trunk/ 0000755 0001750 0001750 00000000000 11313645005 011541 5 ustar benoit benoit trunk/qlgc_vnic.cfg.sample 0000644 0001750 0001750 00000015064 11313645005 015455 0 ustar benoit benoit # QLogic VNIC configuration file
#
# This file documents and describes the use of the
# VNIC configuration file qlgc_vnic.cfg. This file
# should reside in /etc/infiniband/qlgc_vnic.cfg
#
#
# Knowing how to fill the configuration file
###############################################
#
# For filling the configuration file you need to know
# some information about your EVIC/VEx device. This information
# can be obtained with the help of the ib_qlgc_vnic_query tool.
# "ib_qlgc_vnic_query -es" command will give DGID, IOCGUID and IOCSTRING information about
# the EVIC/VEx IOCs that are available through port 1 and
# "ib_qlgc_vnic_query -es -d /dev/infiniband/umad1" will give information about
# the EVIC/VEX IOCs available through port 2.
#
# Refer to the README for more information about the ib_qlgc_vnic_query tool.
#
#
# General structure of the configuration file
###############################################
#
# All lines beginning with a # are treated as comments.
#
# A simple configuration file consists of CREATE commands
# for each VNIC interface to be created.
#
# A simple CREATE command looks like this:
#
# {CREATE; NAME="eioc1";
# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";
# }
#
#Where
#
#NAME - The device name for the interface
#
#DGID - The DGID of the IOC to use.
#
# If DGID is specified then IOCGUID MUST also be specified.
#
# Though specifying DGID is optional, using this option is recommended,
# as it will provide the quickest way of starting up the VNIC service.
#
#
#IOCGUID - The GUID of the IOC to use.
#
#IOCSTRING - The IOC Profile ID String of the IOC to use.
#
# Either an IOCGUID or an IOCSTRING MUST always be specified.
#
# If DGID is specified then IOCGUID MUST also be specified.
#
# If no DGID is specified and both IOCGUID and IOCSTRING are specified
# then IOCSTRING is given preference and the DGID of the IOC whose
# IOCSTRING is specified is used to create the VNIC interface.
#
# If hotswap capability of EVIC/VEx is to be used, then IOCSTRING
# must be specified.
#
#INSTANCE - Defaults to 0. Range 0-255. If a host will connect to the
# same IOC more than once, each connection must be assigned a unique
# number.
#
#
#RX_CSUM - defaults to TRUE. When true, indicates that the receive checksum
# should be done by the EVIC/VEx
#
#HEARTBEAT - defaults to 100. Specifies the time in 1/100'ths of a second
# between heartbeats
#
#PORT - Specification for local HCA port. First port is 1.
#
#HCA - Optional HCA specification for use with PORT specification. First HCA is 0.
#
#PORTGUID - The PORTGUID of the IB port to use.
#
# Use of PORTGUID for configuring the VNIC interface has an
# advantage on hosts having more than 1 HCAs plugged in. As
# PORTGUID is persistent for given IB port, VNIC configurations
# would be consistent and reliable - unaffected by restarts of
# OFED IB stack on host having more than 1 HCAs plugged in.
#
# On the downside, if HCA on the host is changed, VNIC interfaces
# configured with PORTGUID needs reconfiguration.
#
#IB_MULTICAST - Controls enabling or disabling of IB multicast feature on VNIC.
# Defaults to TRUE implying IB multicast is enabled for
# the interface. To disable IB multicast, set it to FALSE.
#
# Example of DGID and IOCGUID based configuration (this configuration will give
# the quickest start up of VNIC service):
#
# {CREATE; NAME="eioc1";
# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001;
# }
#
#
# Example of IOCGUID based configuration:
#
# {CREATE; NAME="eioc1"; IOCGUID=0x66A013000010C;
# RX_CSUM=TRUE;
# HEARTBEAT=100; }
#
# Example of IOCSTRING based configuration:
#
# {CREATE; NAME="eioc1"; IOCSTRING="Chassis 0x00066A0050000018, Slot 2, IOC 1";
# RX_CSUM=TRUE;
# HEARTBEAT=100; }
#
#
#Failover configuration:
#########################
#
# It is possible to create a VNIC interface with failover configuration
# by using the PRIMARY and SECONDARY commands. The IOC specified in
# the PRIMARY command will be used as the primary IOC for this interface
# and the IOC specified in the SECONDARY command will be used as the
# fail-over backup in case the connection with the primary IOC fails
# for some reason.
#
# PRIMARY and SECONDARY commands are written in the following way:
#
# PRIMARY={DGID=...;IOCGUID=...; IOCSTRING=...;INSTANCE=... } -
# IOCGUID, and INSTANCE must be values that are unique to the primary interface
#
# SECONDARY={DGID=...;IOCGUID=...; INSTANCE=... } -
# IOCGUID, and INSTANCE must be values that are unique to the secondary interface
#
# OR it can also be specified without using DGID, like this:
#
# PRIMARY={IOCGUID=...; INSTANCE=... } - IOCGUID may be substituted with
# IOCSTRING. IOCGUID, IOCSTRING, and INSTANCE must be values that are
# unique to the primary interface
#
# SECONDARY={IOCGUID=...; INSTANCE=... } - bring up a secondary connection for
# fail-over. IOCGUID may be substituted with IOCSTRING. IOCGUID, IOCSTRING,
# and INSTANCE values to be used for the secondary connection
#
#
#Examples of failover configuration:
#
#{CREATE; NAME="veth1";
# PRIMARY={ DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";
# INSTANCE=1; PORT=1; }
# SECONDARY={DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0230000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 2";
# INSTANCE=1; PORT=2; }
#}
#
# {CREATE; NAME="eioc2";
# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; }
# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; }
# }
#
#Example of configuration with IB_MULTICAST
#
# {CREATE; NAME="eioc2";
# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; IB_MULTICAST=FALSE; }
# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; IB_MULTICAST=FALSE; }
# }
#
# Example of HCA/PORT and PORTGUID configurations:
# {
# CREATE; NAME="veth1";
# PRIMARY={IOCGUID=00066a02de000070; INSTANCE=1; PORTGUID=0x0002c903000010f5; }
# SECONDARY={IOCGUID=00066a02de000070; INSTANCE=2; PORTGUID=0x0002c903000010f6; }
# }
#
# {
# CREATE; NAME="veth2";
# PRIMARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=3; HCA=1; PORT=2; }
# SECONDARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=4; HCA=0; PORT=1; }
# }
#
# {
# CREATE; NAME="veth3";
# IOCSTRING="EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2";
# INSTANCE=5 PORTGUID=0x0002c90300000786;
# }
# {
# CREATE; NAME="veth4;
# IOCGUID=00066a02de000070;
# INSTANCE=6; HCA=1; PORT=2;
# }
trunk/dhcp/ 0000755 0001750 0001750 00000000000 11315233071 012455 5 ustar benoit benoit trunk/dhcp/0001-Make-DHCP-server-print-HW-info.patch 0000644 0001750 0001750 00000003474 11315233071 021360 0 ustar benoit benoit From 37cb19b5b3da9a8c3f61eeb3c4fa5885c70a4375 Mon Sep 17 00:00:00 2001
From: Eli Cohen
Date: Tue, 23 Jun 2009 10:11:55 +0300
Subject: [PATCH] Make DHCP server print HW info
When the DHCP server gets a request, it prints to the log file the HW address
which sent the request. Since for IPoIB the HW address is not conveyed on
messages, this patch will put the client identifer in the HW address. This is
fine since we put the HW address in the client identifier.
Signed-off-by: Eli Cohen
---
common/discover.c | 4 ++--
server/dhcp.c | 12 ++++++++++++
2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/common/discover.c b/common/discover.c
index b0387d0..c4f7200 100644
--- a/common/discover.c
+++ b/common/discover.c
@@ -533,9 +533,9 @@ void discover_interfaces (state)
#endif
case ARPHRD_INFINIBAND:
- tmp -> hw_address.hlen = 1;
+ tmp -> hw_address.hlen = 16;
tmp -> hw_address.hbuf [0] = ARPHRD_INFINIBAND;
- memcpy (&tmp -> hw_address.hbuf [1], sa.sa_data, 20);
+ memcpy (&tmp -> hw_address.hbuf [1], sa.sa_data, 16);
break;
default:
diff --git a/server/dhcp.c b/server/dhcp.c
index 1c90a6c..b558cfb 100644
--- a/server/dhcp.c
+++ b/server/dhcp.c
@@ -262,6 +262,18 @@ void dhcpdiscover (packet, ms_nulltp)
#if defined (FAILOVER_PROTOCOL)
dhcp_failover_state_t *peer;
#endif
+ struct option_cache *oc;
+
+ if (packet->raw->htype == ARPHRD_INFINIBAND) {
+ oc = lookup_option (&dhcp_universe, packet->options, DHO_DHCP_CLIENT_IDENTIFIER);
+ if (oc) {
+ int len;
+
+ len = oc->data.len > 16 ? 16 : oc->data.len;
+ packet->raw->hlen = len;
+ memcpy(packet->raw->chaddr, oc->data.data, len);
+ }
+ }
find_lease (&lease, packet, packet -> shared_network,
0, &peer_has_leases, (struct lease *)0, MDL);
--
1.6.3.2
trunk/dhcp/dhcp-3.0.4.patch 0000755 0001750 0001750 00000002150 11313645005 015057 0 ustar benoit benoit Index: dhcp-3.0.4/includes/site.h
===================================================================
--- dhcp-3.0.4.orig/includes/site.h 2002-03-12 20:33:39.000000000 +0200
+++ dhcp-3.0.4/includes/site.h 2006-05-23 11:34:38.000000000 +0300
@@ -135,7 +135,7 @@
the aforementioned problems do not matter to you, or if no other
API is supported for your system, you may want to go with it. */
-/* #define USE_SOCKETS */
+#define USE_SOCKETS
/* Define this to use the Sun Streams NIT API.
Index: dhcp-3.0.4/common/discover.c
===================================================================
--- dhcp-3.0.4.orig/common/discover.c 2006-02-23 00:43:27.000000000 +0200
+++ dhcp-3.0.4/common/discover.c 2006-05-23 11:45:16.000000000 +0300
@@ -532,6 +532,12 @@ void discover_interfaces (state)
break;
#endif
+ case ARPHRD_INFINIBAND:
+ tmp -> hw_address.hlen = 1;
+ tmp -> hw_address.hbuf [0] = ARPHRD_INFINIBAND;
+ memcpy (&tmp -> hw_address.hbuf [1], sa.sa_data, 20);
+ break;
+
default:
log_error ("%s: unknown hardware address type %d",
ifr.ifr_name, sa.sa_family);
trunk/ofed.conf-example 0000644 0001750 0001750 00000002230 11313645005 014753 0 ustar benoit benoit prefix=/usr
core=y
mthca=y
mlx4=y
mlx4_en=y
cxgb3=y
nes=y
ipath=y
ipoib=y
sdp=y
srp=y
srpt=y
rds=y
kernel-ib=y
kernel-ib-devel=y
libibverbs=y
libibverbs-devel=y
libibverbs-devel-static=y
libibverbs-utils=y
libmthca=y
libmthca-devel-static=y
libmlx4=y
libmlx4-devel=y
libcxgb3=y
libcxgb3-devel=y
libnes=y
libnes-devel-static=y
libipathverbs=y
libipathverbs-devel=y
libibcm=y
libibcm-devel=y
libibcommon=y
libibcommon-devel=y
libibcommon-static=y
libibumad=y
libibumad-devel=y
libibumad-static=y
libibmad=y
libibmad-devel=y
libibmad-static=y
ibsim=y
librdmacm=y
librdmacm-utils=y
librdmacm-devel=y
libsdp=y
libsdp-devel=y
opensm=y
opensm-libs=y
opensm-devel=y
opensm-static=y
compat-dapl=y
compat-dapl-devel=y
dapl=y
dapl-devel=y
dapl-devel-static=y
dapl-utils=y
perftest=y
mstflint=y
tvflash=y
sdpnetstat=y
srptools=y
rds-tools=y
ibutils=y
infiniband-diags=y
qperf=y
ofed-docs=y
ofed-scripts=y
mpi-selector=y
mvapich_gcc=y
mvapich2_gcc=y
openmpi_gcc=y
mpitests_mvapich_gcc=y
mpitests_mvapich2_gcc=y
mpitests_openmpi_gcc=y
mvapich2_conf_impl=ofa
mvapich2_conf_romio=1
mvapich2_conf_shared_libs=1
mvapich2_conf_ckpt=0
mvapich2_conf_vcluster=small
mvapich2_conf_dapl_provider=ib0
trunk/mvapich2_release_notes.txt 0000644 0001750 0001750 00000012755 11313645005 016735 0 ustar benoit benoit ========================================================================
Open Fabrics Enterprise Distribution (OFED)
MVAPICH2-1.2p1 in OFED 1.4 Release Notes
December 2008
Overview
--------
These are the release notes for MVAPICH2-1.2p1. This is OFED's edition of
the MVAPICH2-1.2p1 release. MVAPICH2 is an MPI-2 implementation over
InfiniBand and iWARP from the Ohio State University
(http://mvapich.cse.ohio-state.edu/).
User Guide
----------
For more information on using MVAPICH2-1.2p1, please visit the user guide at
http://mvapich.cse.ohio-state.edu/support/.
Software Dependencies
---------------------
MVAPICH2 depends on the installation of the OFED Distribution stack with
OpenSM running. The MPI module also requires an established network
interface (either InfiniBand, IPoIB, iWARP, uDAPL, or Ethernet). BLCR support
is needed if built with fault tolerance support.
New Features
------------
MVAPICH2 (MPI-2 over InfiniBand and iWARP) is an MPI-2 implementation based on
MPICH2. MVAPICH2 1.2p1 is available as a single integrated package (with
MPICH2 1.0.7). This version of MVAPICH2-1.2p1 for OFED has the following
changes from MVAPICH2-1.0.3:
MVAPICH2-1.2p1 (11/11/2008)
- Fix shared-memory communication issue for AMD Barcelona systems.
MVAPICH2-1.2 (11/06/2008)
* Bugs fixed since MVAPICH2-1.2-rc2
- Ignore the last bit of the pkey and remove the pkey_ix option since the
index can be different on different machines. Thanks for Pasha@Mellanox
for the patch.
- Fix data types for memory allocations. Thanks for Dr. Bill Barth
from TACC for the patches.
- Fix a bug when MV2_NUM_HCAS is larger than the number of active HCAs.
- Allow builds on architectures for which tuning parameters do not exist.
* Efficient support for intra-node shared memory communication on
diskless clusters
* Changes related to the mpirun_rsh framework
- Always build and install mpirun_rsh in addition to the process
manager(s) selected through the --with-pm mechanism.
- Cleaner job abort handling
- Ability to detect the path to mpispawn if the Linux proc filesystem is
available.
- Added Totalview debugger support
- Stdin is only available to rank 0. Other ranks get /dev/null.
* Other miscellaneous changes
- Add sequence numbers for RPUT and RGET finish packets.
- Increase the number of allowed nodes for shared memory broadcast to 4K.
- Use /dev/shm on Linux as the default temporary file path for shared
memory communication. Thanks for Doug Johnson@OSC for the patch.
- MV2_DEFAULT_MAX_WQE has been replaced with MV2_DEFAULT_MAX_SEND_WQE and
MV2_DEFAULT_MAX_RECV_WQE for send and recv wqes, respectively.
- Fix compilation warnings.
MVAPICH2-1.2-RC2 (08/20/2008)
* Following bugs are fixed in RC2
- Properly handle the scenario in shared memory broadcast code when the
datatypes of different processes taking part in broadcast are different.
- Fix a bug in Checkpoint-Restart code to determine whether a connection
is a shared memory connection or a network connection.
- Support non-standard path for BLCR header files.
- Increase the maximum heap size to avoid race condition in realloc().
- Use int32_t for rank for larger jobs with 32k processes or more.
- Improve mvapich2-1.2 bandwidth to the same level of mvapich2-1.0.3.
- An error handling patch for uDAPL interface. Thanks for Nilesh Awate
for the patch.
- Explicitly set some of the EP attributes when on demand connection
is used in uDAPL interface.
MVAPICH2-1.2RC1 (07/02/08)
* Based on MPICH2 1.0.7
* Scalable and robust daemon-less job startup
- Enhanced and robust mpirun_rsh framework (non-MPD-based) to
provide scalable job launching on multi-thousand core clusters
- Available for OpenFabrics (IB and iWARP) and uDAPL interfaces
(including Solaris)
* Checkpoint-restart with intra-node shared memory support
- Allows best performance and scalability with fault-tolerance
support
* Enhancement to software installation
- Full autoconf-based configuration
- An application (mpiname) for querying the MVAPICH2
library version and configuration information
* Enhanced processor affinity using PLPA for multi-core architectures
- Allows user-defined flexible processor affinity
* Enhanced scalability for RDMA-based direct one-sided communication
with less communication resource
* Shared memory optimized MPI_Bcast operations
* Optimized and tuned MPI_Alltoall
Main Verification Flows
-----------------------
In order to verify the correctness of MVAPICH2-1.2p1, the following tests
and parameters were run.
Test Description
====================================================================
Intel Intel's MPI functionality test suite
OSU Benchmarks OSU's performance tests
IMB Intel's MPI Benchmark test
mpich2 Test suite distributed with MPICH2
mpitest b_eff test
Linpack Linpack benchmark
NAS NAS Parallel Benchmarks (NPB3.2)
NAMD NAMD application
Mailing List
------------
There is a public mailing list mvapich-discuss@cse.ohio-state.edu for
mvapich users and developers to
- Ask for help and support from each other and get prompt response
- Contribute patches and enhancements
========================================================================
trunk/ofed_patch.sh 0000755 0001750 0001750 00000021445 11313645005 014202 0 ustar benoit benoit #!/bin/bash
#
# Copyright (c) 2006 Mellanox Technologies. All rights reserved.
#
# This Software is licensed under one of the following licenses:
#
# 1) under the terms of the "Common Public License 1.0" a copy of which is
# available from the Open Source Initiative, see
# http://www.opensource.org/licenses/cpl.php.
#
# 2) under the terms of the "The BSD License" a copy of which is
# available from the Open Source Initiative, see
# http://www.opensource.org/licenses/bsd-license.php.
#
# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
# copy of which is available from the Open Source Initiative, see
# http://www.opensource.org/licenses/gpl-license.php.
#
# Licensee has the right to choose one of the above licenses.
#
# Redistributions of source code must retain the above copyright
# notice and one of the license notices.
#
# Redistributions in binary form must reproduce both the above copyright
# notice, one of the license notices in the documentation
# and/or other materials provided with the distribution.
#
#
# Add/Remove a patch to/from OFED-1.3's ofa package
usage()
{
cat << EOF
Usage:
Add patch to OFED:
`basename $0` --add
--ofed|-o
--patch|-p
--type|-t |addons >
Remove patch from OFED:
`basename $0` --remove
--ofed|-o
--patch|-p
--type|-t |addons >
Example:
`basename $0` --add --ofed /tmp/OFED-1.3/ --patch /tmp/cma_establish.patch --type kernel
`basename $0` --remove --ofed /tmp/OFED-1.3/ --patch cma_establish.patch --type kernel
EOF
}
action=""
# Execute command w/ echo and exit if it fail
ex()
{
echo "$@"
if ! "$@"; then
printf "\nFailed executing $@\n\n"
exit 1
fi
}
add_patch()
{
if [ -f $2/${1##*/} ]; then
echo Replacing $2/${1##*/}
ex /bin/rm -f $2/${1##*/}
fi
ex cp $1 $2
}
remove_patch()
{
if [ -f $2/${1##*/} ]; then
echo Removing $2/${1##*/}
ex /bin/rm -f $2/${1##*/}
else
echo Patch $2/${1##*/} was not found
exit 1
fi
}
set_rpm_info()
{
package_SRC_RPM=$(/bin/ls -1 ${ofed}/SRPMS/${1}*src.rpm 2> /dev/null)
if [[ -n "${package_SRC_RPM}" && -s ${package_SRC_RPM} ]]; then
package_name=$(rpm --queryformat "[%{NAME}]" -qp ${package_SRC_RPM})
package_ver=$(rpm --queryformat "[%{VERSION}]" -qp ${package_SRC_RPM})
package_rel=$(rpm --queryformat "[%{RELEASE}]" -qp ${package_SRC_RPM})
else
echo $1 src.rpm not found under ${ofed}/SRPMS
exit 1
fi
}
main()
{
while [ ! -z "$1" ]
do
case $1 in
--add)
action="add"
shift
;;
--remove)
action="remove"
shift
;;
--ofed|-o)
ofed=$2
shift 2
;;
--patch|-p)
patch=$2
shift 2
;;
--type|-t)
type=$2
shift 2
case ${type} in
backport|addons)
tag=$1
shift
;;
esac
;;
--help|-h)
usage
exit 0
;;
*)
usage
exit 1
;;
esac
done
if [ -z "$action" ]; then
usage
exit 1
fi
if [ -z "$ofed" ] || [ ! -d "$ofed" ]; then
echo Set the path to the OFED directory. Use \'--ofed\' parameter
exit 1
else
ofed=$(readlink -f $ofed)
fi
if [ "$action" == "add" ]; then
if [ -z "$patch" ] || [ ! -r "$patch" ]; then
echo Set the path to the patch file. Use \'--patch\' parameter
exit 1
else
patch=$(readlink -f $patch)
fi
else
if [ -z "$patch" ]; then
echo Set the name of the patch to be removed. Use \'--patch\' parameter
exit 1
fi
fi
if [ -z "$type" ]; then
echo Set the type of the patch. Use \'--type\' parameter
exit 1
fi
if [ "$type" == "backport" ] || [ "$type" == "addons" ]; then
if [ -z "$tag" ]; then
echo Set tag for backport patch.
exit 1
fi
fi
# Get ofa RPM version
case $type in
kernel|backport|addons)
set_rpm_info ofa_kernel
;;
*)
echo "Unknown type $type"
exit 1
;;
esac
package=${package_name}-${package_ver}
cd ${ofed}
if [ ! -e SRPMS/${package}-${package_rel}.src.rpm ]; then
echo File ${ofed}/SRPMS/${package}-${package_rel}.src.rpm not found
exit 1
fi
if ! ( set -x && rpm -i --define "_topdir $(pwd)" SRPMS/${package}-${package_rel}.src.rpm && set +x ); then
echo "Failed to install ${package}-${package_rel}.src.rpm"
exit 1
fi
cd -
cd ${ofed}/SOURCES
ex tar xzf ${package}.tgz
case $type in
kernel)
if [ "$action" == "add" ]; then
add_patch $patch ${package}/kernel_patches/fixes
else
remove_patch $patch ${package}/kernel_patches/fixes
fi
;;
backport)
if [ "$action" == "add" ]; then
if [ ! -d ${package}/kernel_patches/backport/$tag ]; then
echo Creating ${package}/kernel_patches/backport/$tag directory
ex mkdir -p ${package}/kernel_patches/backport/$tag
echo WARNING: Check that ${package} configure supports backport/$tag
fi
add_patch $patch ${package}/kernel_patches/backport/$tag
else
remove_patch $patch ${package}/kernel_patches/backport/$tag
fi
;;
addons)
if [ "$action" == "add" ]; then
if [ ! -d ${package}/kernel_addons/backport/$tag ]; then
echo Creating ${package}/kernel_addons/backport/$tag directory
ex mkdir -p ${package}/kernel_addons/backport/$tag
echo WARNING: Check that ${package} configure supports backport/$tag
fi
add_patch $patch ${package}/kernel_addons/backport/$tag
else
remove_patch $patch ${package}/kernel_addons/backport/$tag
fi
;;
*)
echo Unknown patch type: $type
exit 1
;;
esac
ex tar czf ${package}.tgz ${package}
cd -
cd ${ofed}
echo Rebuilding ${package_name} source rpm:
if ! ( set -x && rpmbuild -bs --define "_topdir $(pwd)" SPECS/${package_name}.spec && set +x ); then
echo Failed to create ${package}-${package_rel}.src.rpm
exit 1
fi
ex rm -rf SOURCES/${package}*
if [ "$action" == "add" ]; then
echo Patch added successfully.
else
echo Patch removed successfully.
fi
echo
echo Remove existing RPM packages from ${ofed}/RPMS direcory in order
echo to rebuild RPMs
}
main $@
trunk/mlx4_release_notes.txt 0000644 0001750 0001750 00000030121 11313645005 016073 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
ConnectX driver (mlx4) in OFED 1.4.2 Release Notes
July 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Supported Firmware Versions
3. VPI (Virtual Process Interconnect)
4. Infiniband new features and bug fixes since OFED 1.3.1
5. Infiniband (mlx4_ib) new features and bug fixes since OFED 1.4
6. Eth (mlx4_en) new features and bug fixes since OFED 1.4
7. New features and bug fixes since OFED 1.4.1
8. Known Issues
9. mlx4 Available Parameters
===============================================================================
1. Overview
===============================================================================
mlx4 is the low level driver implementation for the ConnectX adapters designed
by Mellanox Technologies. The ConnectX can operate as an InfiniBand adapter,
as an Ethernet NIC, or as a Fibre Channel HBA. The driver in OFED 1.4 supports
Infiniband and Ethernet NIC configurations. To accommodate the supported
configurations, the driver is split into three modules:
- mlx4_core
Handles low-level functions like device initialization and firmware
commands processing. Also controls resource allocation so that the
InfiniBand and Ethernet functions can share the device without
interfering with each other.
- mlx4_ib
Handles InfiniBand-specific functions and plugs into the InfiniBand
midlayer
- mlx4_en
A new 10G driver named mlx4_en was added to drivers/net/mlx4.
It handles Ethernet specific functions and plugs into the netdev mid-layer.
===============================================================================
2. Supported Firmware Versions
===============================================================================
- This release was tested with FW 2.6.000.
- The minimal version to use is 2.3.000.
- To use both IB and Ethernet (VPI) use FW version 2.6.0
===============================================================================
3. VPI (Virtual Protocol Interconnect)
===============================================================================
VPI enables ConnectX to be configured as an Ethernet NIC and/or an Infiniband
adapter.
o Overview:
The VPI driver is a combination of the Mellanox ConnectX HCA Ethernet and
Infiniband drivers.
It supplies the user with the ability to run Infiniband and Ethernet
protocols on the same HCA (separately or at the same time).
For more details on the Ethernet driver see MLNX_EN_README.txt.
o Firmware:
The VPI driver works with FW 25408 version 2.6.000 or higher.
One needs to use INI files that allow different protocols over same HCA.
o Installing the mlnx_en driver:
To install the OFED 1.4 with the Ethernet driver make sure that "mlx4_en=y"
in file ofed.conf
o Loading drivers:
To load the Ethernet driver one should set "MLX4_EN_LOAD=yes"
in file /etc/infiniband/openib.conf.
If "MLX4_EN_LOAD" is not marked as "yes", the Ethernet driver can be loaded
by running "/sbin/modprobe mlx4_en".
o Port type management:
By default both ConnectX ports are initialized as Infiniband ports.
If you wish to change the port type use the connectx_port_config script after
the driver is loaded.
Running "/sbin/connectx_port_config -s" will show current port configuration
for all ConnectX devices.
Port configuration is saved in file: /etc/infiniband/connectx.conf.
This saved configuration is restored at driver restart only if done via
"/etc/init.d/openibd restart".
Possible port types are:
"eth" - Always Ethernet.
"ib" - Always Infiniband.
"auto" - Link sensing mode - detect port type based on the attached
network type. If no link is detected, the driver retries link
sensing every few seconds.
Port link type can be configured for each device in the system at run time
using the "/sbin/connectx_port_config" script.
This utility will prompt for the PCI device to be modified (if there is only
one it will be selected automatically).
At the next stage the user will be prompted for the desired mode for each port.
The desired port configuration will then be set for the selected device.
Note: This utility also has a non interactive mode:
"/sbin/connectx_port_config [[-d|--device ] -c|--conf ]".
- The following configurations are supported by VPI:
Port1 = eth Port2 = eth
Port1 = ib Port2 = ib
Port1 = auto Port2 = auto
Port1 = ib Port2 = eth
Port1 = ib Port2 = auto
Port1 = auto Port2 = eth
Note: the following options are not supported:
Port1 = eth Port2 = ib
Port1 = eth Port2 = auto
Port1 = auto Port2 = ib
===============================================================================
4. Infiniband new features and bug fixes since OFED 1.3.1
===============================================================================
Features that are enabled with FW 2.5.0 only:
- Send with invalidate and Local invalidate send queue work requests.
- Resize CQ support.
Features that are enabled with FW 2.6.0 only:
- Fast register MR send queue work requests.
- Local DMA L_Key.
- Raw Ethertype QP support (one QP per port) -- receive only.
Non FW dependent features:
- Allow 4K messages for UD QPs.
- Allocate/free fast register MR page lists.
- More efficient MTT allocator.
- RESET->ERR QP state transition no longer supported (IB Spec 1.2.1).
- Pass congestion management class MADs to the HCA.
- Enable firmware diagnostic counters available via sysfs.
- Enable LSO support for IPOIB.
- IB_EVENT_LID_CHANGE is generated more appropriately.
- Fixed race condition between create QP and destroy QP (bugzilla 1389)
===============================================================================
5. Infiniband new features and bug fixes since OFED 1.4
===============================================================================
- Enable setting via module param (set_4k_mtu) 4K MTU for ConnectX ports.
- Support optimized registration of huge pages backed memory.
With this optimization, the number of MTT entries used is significantly
lower than for regular memory, so the HCA will access registered memory with
fewer cache misses and improved performance.
For more information on this topic, please refer to Linux documentation file:
Documentation/vm/hugetlbpage.txt
- Do not enable blueflame sends if write combining is not available
- Add write combining support for for PPC64, and thus enable blueflame sends.
- Unregister IB device before executing CLOSE_PORT.
- Notify and exit if the kernel module used does not support XRC. This is done
to avoid libmlx4 compatibility problem.
- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment.
This enable to register more memory with the same number of segments.
===============================================================================
6. Eth (mlx4_en) new features and bug fixes since OFED 1.4
===============================================================================
6.1 Changes and New Features
----------------------------
- Added Tx Multi-queue support which Improves multi-stream and bi-directional
TCP performance.
- Added IP Reassembly to improve RX bandwidth for IP fragmented packets.
- Added linear skb support which improves UDP performance.
- Removed the following module parameters:
- rx/tx_ring_size
- rx_ring_num - number of RX rings
- pprx/pptx - global pause frames
The parameters above are controlled through the standard Ethtool interface.
Bug Fixes
---------
- Memory leak when driver is unloaded without configuring interfaces first.
- Setting flow control parameters for one ConnectX port through Ethtool
impacts the other port as well.
- Adaptive interrupt moderation malfunctions after receiving/transmitting
around 7 Tera-bytes of data.
- Firmware commands fail with bad flow messages when bringing an interface up.
- Unexpected behavior in case of memory allocation failures.
===============================================================================
7. New features and bug fixes since OFED 1.4.1
===============================================================================
- Added support for new device ID: 0x6764: MT26468 ConnectX EN 10GigE PCIe gen2
===============================================================================
8. Known Issues
===============================================================================
- mlx4_en driver is not supported on PPC64 and IA64
- The mlx4_en module uses a Linux implementation for Large Receive Offload
(LRO) in kernel 2.6.24 and later. These kernels require installing the
"inet_lro" module.
- The SQD feature is not supported:
- To load the driver on machines with 64KB default page size UAR bar must be
enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium with
SLES 11 or when 64KB page size enabled.
Perform the following three steps:
1. Add the following line in the firmware configuration (INI) file under the
[HCA] section:
log2_uar_bar_megabytes = 5
2. Burn a modified firmware image with the changed INI file.
3. Reboot the system.
- Ethernet in MLNX_OFED 1.4 is not supported for the following OPNs:
MHQH29-XTC
MHGH29-XTC
MHGH29-XSC
MHGH28-XTC
MHGH28-XTC
MHGH28-XSC
MHGH28-XSC
MHEH28-XTC
MHEH28-XSC
MHQH19-XTC
Attempting to use these cards as NICs will yield the following error
in /var/log/messages:
mlx4_core 0000:0d:00.0: command 0x9 failed: fw status = 0x8
mtlx008 kernel: mlx4_en 0000:0d:00.0: Failed Initializing port
mtlx008 kernel: mlx4_en 0000:0d:00.0: Failed starting port:1
As a workaround, use the MLNX_EN driver instead of OFED for Linux.
===============================================================================
9. mlx4 Available Parameters
===============================================================================
In order to set mlx4 parameters, add the following line(s) to /etc/modpobe.conf:
options mlx4_core parameter=
and/or
options mlx4_ib parameter=
and/or
options mlx4_en parameter=
mlx4_core parameters:
set_4k_mtu: attempt to set 4K MTU to all ConnectX ports (default 0)
msi_x: attempt to use MSI-X if nonzero (default 1)
enable_qos: Enable Quality of Service support in the HCA if > 0, (default 0)
block_loopback Block multicast loopback packets if > 0 (default: 1)
internal_err_reset: Reset device on internal errors if non-zero (default 1)
debug_level: Enable debug tracing if > 0 (default 0)
log_num_qp: log maximum number of QPs per HCA (default is 17; max is 20)
log_num_srq: log maximum number of SRQs per HCA (default is 16; max is 20)
log_rdmarc_per_qp: log number of RDMARC buffers per QP (default is 4; max is 7)
log_num_cq: log maximum number of CQs per HCA (default is 16 max is 19)
log_num_mcg: log maximum number of multicast groups per HCA (default is 13; max is 21)
log_num_mpt: log maximum number of memory protection table entries per HCA
(default is 17; max is 20)
log_num_mtt: log maximum number of memory translation table segments per HCA
(default is 20; max is 20)
log_num_mac: log maximum number of MACs per ETH port (1-7) (int)
log_num_vlan: log maximum number of VLANs per ETH port (0-7) (int)
log_mtts_per_seg Log2 number of MTT entries per segment (1-5; default is 3)
use_prio: Enable steering by VLAN priority on ETH ports (0/1, default 0) (bool)
mlx4_ib parameters:
debug_level: Enable debug tracing if > 0 (default 0)
mlx4_en parameters:
rss_xor: Use XOR hash function for RSS 0 (default is xor)
rss_mask: RSS hash type bitmask (default is 0xf)
num_lro: Number of LRO sessions per ring or disabled (0) (default is 32)
pfctx: Priority based Flow Control policy on TX[7:0].
Per priority bit mask (default is 0)
pfcrx: Priority based Flow Control policy on RX[7:0].
Per priority bit mask (default is 0)
inline_thold: threshold for using inline data (default is 128)
trunk/rdma_cm_release_notes.txt 0000644 0001750 0001750 00000010276 11313645005 016622 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
RDMA CM in OFED 1.4 Release Notes
December 2008
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New Features
3. Known Issues
4. Fixed bugs since OFED 1.3
===============================================================================
1. Overview
===============================================================================
The RDMA CM is a communication manager used to setup reliable, connected
and unreliable datagram data transfers. It provides an RDMA transport
neutral interface for establishing connections. The API is based on sockets,
but adapted for queue pair (QP) based semantics: communication must be
over a specific RDMA device, and data transfers are message based.
The RDMA CM only provides the communication management (connection setup /
teardown) portion of an RDMA API. It works in conjunction with the verbs
API for data transfers.
===============================================================================
2. New Features
===============================================================================
for OFED 1.3:
Added support for valgrind checks.
Added quality of service support. Quality of service (QoS) is automatically
enabled through the use of the kernel rdma_cm, if the local subnet is
configured for QoS. Additionally, the librdmacm allows users to request
a specific type of service to use when connecting through a new API.
Support for QoS is fabric dependent, and usually configured by an
administrator. Details of QoS are outside the scope of this document;
additional information may be found in subnet management (SM) documentation.
Added sanity checks and fixes for maximum outstanding RDMA operations in an
effort to detect application errors earlier in the connection process.
Various documentation updates.
for OFED 1.2:
The RDMA CM now supports connected, datagram, and multicast data transfers.
When used over Infiniband, the RDMA CM will make use of a local path record
cache, if it is enabled. On large fabrics, use of the local cache can greatly
reduce connection time. Use of a cache is not necessary for iWarp.
Man pages have been created to describe the various interfaces and test
programs available. For a full list, users should refer to the rdma_cm.7 man
page.
===============================================================================
3. Known Issues
===============================================================================
The RDMA CM relies on the operating system's network configuration tables to
map IP addresses to RDMA devices. Incorrectly configured network
configurations can result in the RDMA CM being unable to locate the correct
RDMA device. Currently, the RDMA CM only supports IPv4 addressing.
All RDMA interfaces must provide a way to map IP addresses to an RDMA device.
For Infiniband, this is done using IPoIB, and requires correctly configured
IPoIB device interfaces sharing the same multicast domain. For details on
configuring IPoIB, refer to ipoib_release_notes.txt. For RDMA devices to
communicate, they must support the same underlying network and data link
layers.
If you experience problems using the RDMA CM, you may want to check the
following:
* Verify that you have IP connectivity over the RDMA devices. For example,
ping between iWarp or IPoIB devices.
* Ensure that IP network addresses assigned to RDMA devices do not
overlap with IP network addresses assigned to standard Ethernet devices.
* For multicast issues, either bind directly to a specific RDMA device, or
configure the IP routing tables to route multicast traffic over an RDMA
device's IP address.
===============================================================================
4. Fixed bugs since OFED 1.3
===============================================================================
- The reject status fix has been inserted for DAPL.
===============================================================================
5. Fixed bugs since OFED 1.3.1
===============================================================================
- Non
trunk/ehca_release_notes.txt 0000644 0001750 0001750 00000007437 11313645005 016125 0 ustar benoit benoit
Open Fabrics Enterprise Distribution (OFED)
ehca in OFED 1.4.1 Release Notes
May 2009
Overview
--------
ehca is the low level driver implementation for all IBM GX-based HCAs.
Supported HCAs
--------------
- GX Dual-port SDR 4x IB HCA
- GX Dual-port SDR 12x IB HCA
- GX Dual-port DDR 4x IB HCA
- GX Dual-port DDR 12x IB HCA
Available Parameters
--------------------
In order to set ehca parameters, add the following line(s) to /etc/modprobe.conf:
options ib_ehca =
whereby is one of the following items:
- debug_level debug level (0: no debug traces (default), 1: with debug traces)
- port_act_time time to wait for port activation (default: 30 sec)
- scaling_code scaling code (0: disable (default), 1: enable)
- open_aqp1 Open AQP1 on startup (default: no) (bool)
- hw_level Hardware level (0: autosensing (default), 0x10..0x14: eHCA, 0x20..0x23: eHCA2) (int)
- nr_ports number of connected ports (-1: autodetect, 1: port one only, 2: two ports (default) (int)
- use_hp_mr Use high performance MRs (default: no) (bool)
- poll_all_eqs Poll all event queues periodically (default: yes) (bool)
- static_rate Set permanent static rate (default: no static rate) (int)
- lock_hcalls Serialize all hCalls made by the driver (default: autodetect) (bool)
- number_of_cqs Max number of CQs which can be allocated (default: autodetect) (int)
- number_of_qps Max number of QPs which can be allocated (default: autodetect) (int)
New Features
------------
- none
Fixed Bugs ofed-1.4.1
---------------------
- none
Fixed Bugs ofed-1.4
---------------------
- Reject send work requests only for RESET, INIT and RTR state
- Reject receive work requests if QP is in RESET state
- In case of lost interrupts, trigger EOI to reenable interrupts
- Filter PATH_MIG events if QP was never armed
- Release mutex in error path of alloc_small_queue_page()
- Check idr_find() return value
- Discard double CQE for one WR
- Generate flush status CQ entries
- Don't allow creating UC QP with SRQ
- Fix reported max number of QPs and CQs in systems with >1 adapter
- Reject dynamic memory add/remove when ehca adapter is present
- Remove reference to special QP in case of port activation failure
- Fix locking for shca_list_lock
Fixed Bugs ofed-1.3.1
---------------------
- Support all ibv_devinfo values in query_device() and query_port()
- Prevent posting of SQ WQEs if QP not in RTS
- Remove mr_largepage parameter, ie always enable large page support
- Allocate event queue size depending on max number of CQs and QPs
- Protect QP against destroying until all async events for it are handled
Fixed Bugs ofed-1.3
-------------------
- Serialize HCA-related hCalls if necessary
- Fix static rate if path faster than link
- Return physical link information in query_port()
- Fix clipping of device limits to INT_MAX
- Fix issues related to path migration support
- Support more than 4k QPs for userspace and kernelspace
- Prevent sending UD packets to QP0
- Prevent RDMA-related connection failures on some eHCA2 hardware
Available backports
-------------------
- RedHat EL5 up2: 2.6.18-92.ELsmp
- RedHat EL5 up3: 2.6.18-128.ELsmp
- SLES11: 2.6.27.19-5.1-smp
- SLES10SP1: 2.6.16-53-0.16-smp
- SLES10SP2: 2.6.16-60
- kernel.org: 2.6.24-27
Known Issues
------------
1. The device driver normally uses both ports. For using just one port it is
strongly recommended to set option nr_ports=-1 to enable autodetect mode:
modprobe ib_ehca nr_ports=-1
2. Furthermore the port(s) needs to be connected to an active switch port while
loading the ehca device driver.
3. Dynamic memory operations are not supported with ehca
4. Allocating a large number of queue pairs might be time consuming. This will
be fixed in next OFED release.
trunk/LICENSE 0000644 0001750 0001750 00000002357 11313645005 012555 0 ustar benoit benoit OpenFabrics.org BSD license:
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
trunk/srp_release_notes.txt 0000644 0001750 0001750 00000055333 11313645005 016027 0 ustar benoit benoit
Open Fabrics Enterprise Distribution (OFED)
SRP in OFED 1.4 Release Notes
December 2008
==============================================================================
Table of contents
==============================================================================
1. Overview
2. Changes and Bug Fixes since OFED 1.3.1
3. Software Dependencies
4. Major Features
5. Loading SRP Initiator
6. Manually Establishing an SRP Connection
7. SRP Tools - ibsrpdm and srp_daemon
8. Automatic Discovery and Connecting to Targets
9. Multiple Connections from Initiator IB Port to the Target
10. High Availability
11. Shutting Down SRP
12. Known Issues
13. Vendor Specific Notes
==============================================================================
1. Overview
==============================================================================
The SRP standard describes the message format and protocol definitions required
for transferring commands and data between a SCSI initiator port and a SCSI
target port using RDMA communication service.
==============================================================================
2. Changes and Bug Fixes since OFED 1.3.1
==============================================================================
* Check for scsi_id in scmnd to prevent scan/rescan keep adding new scsi devices
ie. echo "- - -" > /sys/class/scsi_host/hostXX/scan
* Bug fixing
==============================================================================
4. Software Dependencies
==============================================================================
The SRP Initiator depends on the installation of the OFED Distribution stack
with OpenSM running.
==============================================================================
5. Major Features
==============================================================================
This SRP Initiator is based on source taken from openib.org gen2 implementing
the SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. See:
www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf
The SRP Initiator supports:
- Basic SCSI Primary Commands -3 (SPC-3)
(www.t10.org/ftp/t10/drafts/spc3/spc3r21b.pdf)
- Basic SCSI Block Commands -2 (SBC-2)
(www.t10.org/ftp/t10/drafts/sbc2/sbc2r16.pdf)
- Basic functionality, task management and limited error handling
==============================================================================
6. Loading SRP Initiator
==============================================================================
To load the SRP module, either execute the "modprobe ib_srp" command after the
OFED driver is up, or change the value of SRP_LOAD in
/etc/infiniband/openib.conf to "yes" (causing the srp module to be loaded
at driver boot).
NOTE: When loading the ib_srp module, it is possible to set the module
parameter srp_sg_tablesize. This is the maximum number of
gather/scatter entries per I/O (default: 12).
a. modprobe ib_srp srp_sg_tablesize=32
or
b. edit /etc/modprobe.conf and add the following line:
options ib_srp srp_sg_tablesize=32
==============================================================================
7. Manually Establishing an SRP Connection
==============================================================================
The following steps describe how to manually load an SRP connection between
the Initiator and an SRP Target. Section 8 explains how to do this
automatically.
- Make sure that the ib_srp module is loaded, the SRP Initiator is reachable
by the SRP Target, and that an SM is running.
- To establish a connection with an SRP Target and create SRP (SCSI) device(s)
for that target under /dev, use the following command:
echo -n id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\
pkey=ffff,service_id=[service[0] value] > \
/sys/class/infiniband_srp/srp-mthca[hca number]-[port number]/add_target
Notes:
a. Execution of the above "echo" command may take some time
b. The SM must be running while the command executes
c. It is possible to include additional parameters in the echo command:
> max_cmd_per_lun - Default: 63
> max_sect (short for max_sectors) - sets the request size of a command
> io_class - Default: 0x100 as in rev 16A of the specification
Note: In rev 10 the default was 0xff00
> initiator_ext - Please refer to Section 9 (Multiple Connections...)
d. See SRP Tools below for instructions on how the parameters in the
echo command above may be obtained.
- To list the new SCSI devices that have been added by the echo command, you
may use either of the following two methods:
a. Execute "fdisk -l". This command lists all devices; the new devices are
included in this listing.
b. Execute "dmesg" or look at /var/log/messages to find messages with the names
of the new devices.
==============================================================================
8. SRP Tools - ibsrpdm and srp_daemon
==============================================================================
To assist in performing the steps in Section 6, the OFED 1.3.1 distribution
provides two utilities which:
- Detect targets on the fabric reachable by the Initiator (for step 1)
- Output target attributes in a format suitable for use in the above
"echo" command (step 2)
These utilities are: ibsrpdm and srp_daemon.
The utilities can be found under /usr/local/ofed/sbin/ (or /sbin/),
and are part of the srptools RPM that may be installed using the
OFED custom installation. Detailed information regarding the various
options for these utilities are provided by their man pages.
Below, several usage scenarios for these utilities are presented.
ibsrpdm usage
-------------
1. Detecting reachable targets
a. To detect all targets reachable by the SRP initiator via the default
umad device (/dev/umad0), execute the following command:
> ibsrpdm
This command will output information on each SRP target detected, in
human-readable form.
Sample output:
IO Unit Info:
port LID: 0103
port GID: fe800000000000000002c90200402bd5
change ID: 0002
max controllers: 0x10
controller[ 1]
GUID: 0002c90200402bd4
vendor ID: 0002c9
device ID: 005a44
IO class : 0100
ID: LSI Storage Systems SRP Driver 200400a0b81146a1
service entries: 1
service[ 0]: 200400a0b81146a1 / SRP.T10:200400A0B81146A1
b. To detect all the SRP Targets reachable by the SRP Initiator via
another umad device, use the following command:
> ibsrpdm -d
2. Assistance in creating an SRP connection
a. To generate output suitable for utilization in the "echo" command of
section 5, add the "-c" option to ibsrpdm:
>ibsrpdm -c
Sample output:
id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1
b. To establish a connection with an SRP Target (Section 6) using the output
from the "libsrpdm -c" example above, execute the following command:
echo -n id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1
> /sys/class/infiniband_srp/srp-mthca0-1/add_target
The SRP connection should now be up; the newly created SCSI devices should appear
in the listing obtained from the "fdisk -l" command.
srp_daemon
----------
The srp_daemon utility is based on ibsrpdm and extends its functionality.
In addition to the ibsrpdm functionality described above, srp_daemon can also
- Establish an SRP connection by itself (without the need to issue the "echo"
command described in Section 6)
- Continue running in background, detecting new targets and establishing SRP
connections with them (daemon mode)
- Discover reachable SRP targets given an infiniband HCA name and port, rather
than just by /dev/umad where is a digit
- Enable High Availability operation (together with Device-Mapper Multipath)
- Have a configuration file that determines the targets to connect to
a. srp_daemon commands equivalent to ibsrpdm:
"srp_daemon -a -o" is equivalent to "ibsrpdm"
"srp_daemon -c -a -o" is equivalent to "ibsrpdm -c"
Note: These srp_daemon commands can behave differently than the equivalent
ibsrpdm command when /etc/srp_daemon.conf is not empty.
b. srp_daemon extensions to ibsrpdm
- To discover SRP Targets reachable from HCA device ,
port , (and generate output suitable for 'echo') you may execute
srp_daemon -c -a -o -i -p
- To both discover the SRP Targets and establish connections with them, just
add the -e option to the above command.
- Executing srp_daemon over a port without the -a option will only display
the reachable targets via the port and to which the initiator is not
connected. If executing with the -e option it is better to omit -a.
- It is recommended to use the -n option. This option adds the initiator_ext
to the connecting string. (See Section 9 for more details).
- srp_daemon has a configuration file that can be set, where the default is
/etc/srp_daemon.conf. Use the -f to supply a different configuration file
that configures the targets srp_daemon is allowed to connect to. The
configuration file can also be used to set values for additional
parameters (e.g., max_cmd_per_lun, max_sect).
- A continuous background (daemon) operation, providing an automatic ongoing
detection and connection capability. See Section 8.
==============================================================================
9. Automatic Discovery and Connecting to Targets
==============================================================================
- Make sure that the ib_srp module is loaded, the SRP Initiator can reach an
SRP Target, and that an SM is running.
- To connect to all the existing Targets in the fabric, execute
srp_daemon -e -o. This utility will scan the fabric once, connect to
every Target it detects, and then exit.
NOTE: srp_daemon will follow the configuration it finds in
/etc/srp_daemon.conf. Thus, it will ignore a target that is disallowed in
the configuration file.
- To connect to all the existing Targets in the fabric and to connect
to new targets that will join the fabric, execute srp_daemon -e. This utility
continues to execute until it is either killed by the user or encounters
connection errors (such as no SM in the fabric).
- To execute SRP daemon as a daemon you may execute run_srp_daemon
(found under /usr/local/ofed/sbin/ or /sbin/), providing it with
the same options used for running srp_daemon.
Note: Make sure only one instance of run_srp_daemon runs per port.
- To execute SRP daemon as a daemon on all the ports, execute srp_daemon.sh
(found under /usr/local/ofed/sbin/ or /sbin/).
srp_daemon.sh sends its log to /var/log/srp_daemon.log.
- It is possible to configure this script to execute automatically when the
InfiniBand driver starts by changing the value of SRP_DAEMON_ENABLE in
/etc/infiniband/openib.conf to "yes".
Another option to to configure this script to execute automatically when the
InfiniBand driver starts is by changing the value of SRPHA_ENABLE in
/etc/infiniband/openib.conf to "yes". However, this option also enables
SRP High Availability that has some more features. (Please read the High
Availability section).
==============================================================================
10. Multiple Connections from Initiator IB Port to the Target
==============================================================================
Some system configurations may need multiple SRP connections from
the SRP Initiator to the same SRP Target: to the same Target IB port,
or to different IB ports on the same Target HCA.
In case of a single Target IB port, i.e., SRP connections use the same path,
the configuration is enabled using a different initiator_ext value for each
SRP connection. The initiator_ext value is a 16-hexadecimal-digit value
specified in the connection command.
Also in case of two physical connections (i.e., network paths) from a single
initiator IB port to two different IB ports on the same Target HCA, there is
need for a different initiator_ext value on each path. The conventions is to
use the Target port GUID as the initiator_ext value for the relevant path.
If you use srp_daemon with -n flag, it automatically assigns initiator_ext
values according to this convention. For example:
id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec,dgid=fe800000000000000002c90200402bed,\
pkey=ffff,service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200
Notes:
a. It is recommended to use the -n flag for all srp_daemon invocations.
b. ibsrpdm does not have a corresponding option.
c. srp_daemon.sh always uses the -n option (whether invoked manually by
the user, or automatically at startup by setting SRPHA_ENABLE or
SRP_DAEMON_ENABLE to yes).
==============================================================================
11. High Availability (HA)
==============================================================================
High Availability Overview
--------------------------
High Availability works using the Device-Mapper (DM) multipath and the
SRP daemon.
Each initiator is connected to the same target from several ports/HCAs.
The DM multipath is responsible for joining together different paths to the
same target and for fail-over between paths when one of them goes offline.
Multipath will be execute on newly joined SCSI devices.
Each initiator should execute several instances of the SRP daemon, one for each
port. At startup, each SRP daemon detects the SRP targets in the fabric and
sends requests to the ib_srp module to connect to each of them. These
SRP daemons also detect targets that subsequently join the fabric, and send the
ib_srp module requests to connect to them as well.
High Availability Operation
---------------------------
When a path (from port1) to a target fails, the ib_srp module starts an error
recovery process. If this process gets to the reset_host stage and there is no
path to the target from this port, ib_srp will remove this scsi_host. After
the scsi_host is removed, multipath switches to another path to this target
(from another port/HCA).
When the failed path recovers, it will be detected by the SRP daemon. The SRP
daemon will then request ib_srp to connect to this target. Once the connection
is up, there will be a new scsi_host for this target. Multipath will be
executed on the devices of this host, returning to the original state (prior to
the failed path).
High Availability Prerequisites
-------------------------------
Installation for RHEL4 and RHEL5: (Execute once)
- Verify that the standard device-mapper-multipath rpm is installed. If not,
install it from the RHEL distribution.
Installation for SLES10: (Execute once)
- Verify that multipath is installed. If not, install it from the
installation (You can use yast).
- Update udev: (Execute once - for manual activation of High Availability only)
- Add a file to /etc/udev/rules.d/ (you can call it 91-srp.rules)
This file should have one line:
ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m"
Note: When SRPHA_ENABLE is set to "yes" (see Automatic Activation of High
Availability below), this file is created upon each boot of the driver and
is deleted when the driver is unloaded.
Manual Activation of High Availability
--------------------------------------
Initialization: (Execute after each boot of the driver)
1) Execute modprobe dm-multipath
2) Execute modprobe ib-srp
3) Make sure you have created file /etc/udev/rules.d/91-srp.rules
as described above
4) Execute for each port and each HCA:
srp_daemon -c -e -R 300 -i -p
(You can use another value for -R. See under the Known Issues section
the workaround for the rare race condition.)
This step can be performed by executing srp_daemon.sh, which sends
its log to /var/log/srp_daemon.log.
Now it is possible to access the SRP LUNs on /dev/mapper/.
NOTE: It is possible for regular (non-SRP) LUNs to also be present;
the SRP LUNs may be identified by their names. You can configure the
/etc/multipath.conf file to change multipath behavior.
Automatic Activation of High Availability
-----------------------------------------
- Set the value of SRPHA_ENABLE in /etc/infiniband/openib.conf to "yes".
- From the next loading of the driver it will be possible to access the SRP
LUNs on /dev/mapper/
NOTE: It is possible that regular (not SRP) LUNs may also be present;
the SRP LUNs may be identified by their name.
- It is possible to see the output of the SRP daemon in /var/log/srp_daemon.log
==============================================================================
12. Shutting Down SRP
==============================================================================
SRP can be shutdown by using "rmmod ib_srp", or by stopping the OFED driver
("/etc/init.d/openibd stop"), or as a by-product of a complete system shutdown.
Prior to shutting down SRP, remove all references to it. The actions you need
to take depend on the way SRP was loaded. There are three cases.
a. Without High Availability
------------------------------------
When working without High Availability, you should unmount the SRP
partitions that were mounted prior to shutting down SRP.
b. After Manual Activation of High Availability
-----------------------------------------------
If you manually activated SRP High Availability, perform the following steps:
1) Unmount all SRP partitions that were mounted
2) Kill the SRP daemon instances
3) Make sure there are no multipath instances running. If there are multiple
instances, wait for them to end or kill them.
4) Execute multipath -F
c. After Automatic Activation of High Availability
--------------------------------------------------
If SRP High Availability was automatically activated, SRP shutdown must be
part of the driver shutdown ("/etc/init.d/openibd stop") which performs
steps 2-4 of case b above. However, you still have to unmount all SRP
partitions that were mounted before driver shutdown.
HAL Issue
---------
The HAL (Hardware Abstraction Layer) system includes a daemon that examines
all devices in the system. In this process, it frequently holds a reference
to the ib_srp module. If you attempt to shutdown SRP while this daemon is
holding a reference to ib_srp, the shutdown will fail. Therefore, you
should make sure this will not occur. One solution may be to stop "haldaemon"
(/etc/init.d/haldaemon stop) prior to SRP shutdown.
==============================================================================
13. Known Issues
==============================================================================
- There is a very rare race condition which can cause the SRP daemon to miss a
target that joins the fabric. The race can occur if a target joins and leaves
the fabric several times in a short time (e.g., if the cable is not connected
well). In such a case, the SM may ignore this quick change of state and may
not send an InformInfo to the srp_daemon.
Workaround: Execute the srp_daemon command with the -R option. This
option causes the SRP daemon to perform a full rescan of the fabric every
seconds.
- The srp_daemon does not support different pkeys other than the default
pkey=ffff
- It is recommended to use an SM that supports the enhanced capability mask
matching feature (errata MGTWG8372). With SMs which support this feature, the
SRP daemon generates significantly less communication traffic.
- When booting OFED with SRP High Availability enabled, executing multipath for
all LUNs on all connections may take some time (several minutes). However, it
is possible to start working while this process is in progress.
- Stopping the driver while SRP High Availability is enabled kills all
multipath processes. Consider appropriate actions in case multipath is used
for other purposes.
- AS High Availability is based on Device Mapper multipath, it embodies
multipath limitations and also its configuration and tuning options.
See http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=Home
for information on multipath.
To modify and tune multipath configuration, edit the file /etc/multipath.conf
according to instructions and tips listed in
/usr/share/doc/packages/multipath-tools/multipath.conf.*
- In case your topology has two physical connections (i.e., network paths) from
a single initiator IB port to two different IB ports on the same Target HCA,
and you wish to have an SRP connection on the one path coexist with an SRP
connection on the second path, you must set a different initiator_ext value
on each path. See Section 9, "Multiple Connections from Initiator IB Port
to the Target" for details.
- The srp_daemon tool reads by default the configuration file
/etc/srp_daemon.conf. In case this configuration file disallows connecting
to a certain target, srp_daemon will ignore the target. If you find out
that srp_daemon ignores a target, please check the /etc/srp_daemon.conf file.
==============================================================================
14. Vendor Specific Notes
==============================================================================
Hosts connected to Qlogic SRP Targets must perform one of the following
steps after upgrading to OFED 1.3.1 to continue accessing their storage
successfully:
1. When issuing the "echo" command to add a new SRP Target, the host
must append the string ",initiator_ext=0000000000000001" to the original
echo string.
Example:
'ibsrpdm -c' output is as follows:
id_ext=0000000000000001,ioc_guid=00066a0138000165,dgid=fe8000000000000
000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00
id_ext=0000000000000001,ioc_guid=00066a0238000165,dgid=fe8000000000000
000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00
To connect to the first target, the echo command must be:
echo -n \
id_ext=0000000000000001,ioc_guid=00066a0138000165,\
dgid=fe8000000000000000066a0260000165,pkey=ffff,\
service_id=0000494353535250,io_class=ff00,\
initiator_ext=0000000000000001 > \
/sys/class/inifiniband_srp/srp-mthca0-1/add_target
2. Change the SRP map on the Qlogic SRP Target to set the expected initiator
extension to 0. For details on how to change the SRP map on a Qlogic SRP
Target, please refer to product documentation.
trunk/ib-bonding.txt 0000644 0001750 0001750 00000021745 11313645005 014323 0 ustar benoit benoit IB Bonding
===============================================================================
1. Introduction
2. How to work with ib-bond
3. How to work with interface configuration scripts
3.1 Configuration with initscripts support
3.1.1 Writing network scripts under Redhat-AS4 (Update 4, 5, 6 or 7)
3.1.2 Writing network scripts under Redhhat-EL5
3.2 Configuration with sysconfig support
3.2.1 Writing network scripts under SLES-10
3.3 Configuring Ethernet slaves
1. Introduction
-------------------------------------------------------------------------------
ib-bonding is a High Availability solution for IPoIB interfaces. It is based
on the Linux Ethernet Bonding Driver and was adopted to work with IPoIB.
However, the support for for IPoIB interfaces is only for the active-backup
mode, other modes should not be used.
ib-bonding package contains a bonding driver and a utility called ib-bond to
manage and control the driver operation.
2. How to work with ib-bond
-------------------------------------------------------------------------------
* Creating a bonding network interface
--bond-name: sets the name of the bonding network interface. Default is bond0
--bond-ip : sets the IP address of bond0. If MASK is not given it
is set to 255.255.255.0 (24 bits). Note that MASK should be the number of 1
bits in the netmask.
--slaves: a comma separated list of slave ib devices. If not given ib0 and
ib1 will be used as slaves. Child interfaces are allowed.
--miimon: the MII monitoring interval in mSec. Default is 100
* Deleting a bonding network interface
--stop: unenslave slaves and delete a specific bonding network interface (use with --bond-name)
--stop-all: unenslave slaves and delete all bonding network interfaces
* Querying a bonding network interface
--status: show the status of a specific bonding network interface (use with --bond-name)
--status-all: show the status of all bonding network interfaces
Examples:
* To bring up bond0 with ib0 and ib2 as slaves (assumes 2 HCAs)
ib-bond --bond-ip 192.186.10.100 --slaves ib0,ib2
* To bring up bond1 with ib0.f1f1 1and ib1.f1f1 as slaves with non default
netmask
ib-bond --bond-name bond1 --bond-ip 192.186.10.100/25 --slaves ib0.f1f1,ib1.f1f1
* To query the status of bond1
ib-bond --bond-name bond1 --status
* To query the status of all bonding interfaces
ib-bond --status-all
* To stop bond1
ib-bond --bond-name bond1 --stop
* To stop all bonding interfaces
ib-bond --stop-all
3. How to work with interface configuration scripts
-------------------------------------------------------------------------------
Using ib-bond to configure interfaces doesn't save the configuration anywhere,
so whenever the master or one of the slaves is destroyed the configuration
should be restored by running ib-bond again (e.g. after system reboot).
It is possible to avoid that if you create an interface configuration script for
the ibX and bondX interfaces. To do that, you should use the standard syntax to
create the bonding configuration (depending on your OS).
3.1 Configuration with initscripts support
------------------------------------------
Note: This feature is available only for Redhat-AS4 (Update 4, Update 5,
Update 6 or Update 7) and for Redhat-EL5 and above.
3.1.1 Writing network scripts under Redhat-AS4 (Update 4, 5, 6 or 7)
-----------------------------------------------------------------
* In the master (bond) interface script add the line:
TYPE=Bonding
MTU=
Exmaple: for bond0 (master) the file is named /etc/sysconfig/network-scripts/ifcfg-bond0
with the following text in the file:
DEVICE=bond0
IPADDR=192.168.1.1
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
TYPE=Bonding
MTU=65520
Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected
mode and are configured with the same value. For IPoIB slaves that work in
datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at
all (and letting it to be set to the default value), performance of the
interface might decrease.
* In the slave (ib) interface script put the following lines:
SLAVE=yes
MASTER=
TYPE=InfiniBand
PRIMARY=
Example: the script for ib0 (slave) would be named /etc/sysconfig/network-scripts/ifcfg-ib0
with the following text in the file:
DEVICE=ib0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
TYPE=InfiniBand
PRIMARY=yes
Note: If the slave interface is not primary then the line PRIMARY= is not
required and can be omitted.
After the configuration is saved, restart the network service by running:
/etc/init.d/network restart
3.1.2 Writing network scripts under Redhhat-EL5
-----------------------------------------------
Follow the instructions in 3.1.1 (Writing network scripts under Redhat-AS4)
with the following changes:
* In the bondX (master) script - the line TYPE=Bonding is not needed.
* In the bondX (master) script - you may add to the configuration more options
with the following line
BONDING_OPTS=" primary=ib0 updelay=0 downdelay=0"
* in the ibX (slave) script - the line TYPE=InfiniBand necessary when using
bonding over devices configured with partitions ( p_key)
Example:
ifcfg-ibX.8003 and ifcfg-ibY.8003 must include TYPE=InfiniBand line in
their configuration files, when using as slaves for bondX device
* in /etc/modprobe.conf add the following lines
alias bond0 bonding
options bond0 miimon=100 mode=1 max_bonds=1
If you want more than one bonding interface, name them bond1, bond2... and
just add the necessary lines in /etc/modprobe.conf and change max_bonds=1 to
max_bonds=N where N=number_of_bonding_interfaces
Note: restarting OFED doesn't keep the bonding configuration via initscripts.
You have to restart the network service in order to recreate the bonding
interface.
3.2 Configuration with sysconfig support
----------------------------------------
Note: This feature is available only for SLES-10 and above.
3.2.1 Writing network scripts under SLES-10
-----------------------------------------------
* In the master (bond) interface script add the lins:
BONDING_MASTER=yes
BONDING_MODULE_OPTS="mode=active-backup miimon="
BONDING_SLAVE0=slave0
BONDING_SLAVE1=slave1
MTU=
Exmaple: for bond0 (master) the file is named /etc/sysconfig/network/ifcfg-bond0
with the following text in the file:
BOOTPROTO="static"
BROADCAST="10.0.2.255"
IPADDR="10.0.2.10"
NETMASK="255.255.0.0"
NETWORK="10.0.2.0"
REMOTE_IPADDR=""
STARTMODE="onboot"
BONDING_MASTER="yes"
BONDING_MODULE_OPTS="mode=active-backup miimon=100 primary=ib0 updelay=0 downdelay=0"
BONDING_SLAVE0=ib0
BONDING_SLAVE1=ib1
MTU=65520
Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected
mode and are configured with the same value. For IPoIB slaves that work in
datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at
all (and letting it to be set to the default value), performance of the
interface might decrease.
Note: primary, downdelay and updelay is an optional bonding interface
configuration. You may choose to use them, change them or delete them from the
configuration script (by editing the line that starts with BONDING_OPTS)
* The slave (ib) interace script should look like this:
BOOTPROTO='none'
STARTMODE='off'
PRE_DOWN_SCRIPT=/etc/sysconfig/network/unenslave.sh
After the configuration is saved, restart the network service by running:
/etc/init.d/network restart
3.3 Configuring Ethernet slaves
-------------------------------
It is not possible to have a mix of Ethernt slaves and IPoIB slaves under the
same bonding master. It is possible however that a bonding master of Ethernet
slaves and a bonding master of IPoIB slaves will co-exist in one machne.
To configure Ethernet slaves under a bonding master use the following
instructios (depending on the OS)
* Under Redhat-AS4
Use the same instructions as for IPoIB slaves with the following exceptions
- In the master configuration file add the line
SLAVEDEV=1
- In the slave configuration file leave the line
TYPE=InfiniBand
- For Ethernet, it is possible to set parameters of the bonding module in /etc/modprobe.conf
with the following line for example
options bonding miimon=100 mode=1 primary=eth0
Note that alias names for the bonding module (such as bond0) may not work.
* Under Redhat-AS5
No special instructions are required.
* Under SLES10
When using both type of bonding under, it is neccessary to update the
MANDATORY_DEVICES environment variable in /etc/sysconfig/network/config with the names
of the InfiniBand devices ( ib0, ib1, etc. ). Otherwise, bonding devices will be created
before InfiniBand devices at boot time.
Note: If there is more than one Ethernet NIC installed then there might be a
race for the interface name eth0, eth1 etc. This may lead to unexpected
relation between logical and physical devices which may lead to wrong bonding
configuration. This issue may be solved by binding a logical device name (e.g.
eth0) to a physical (hardware) device by specifying the MAC address in the
ethN configuration file.
trunk/iser_release_notes.txt 0000644 0001750 0001750 00000006241 11313645005 016157 0 ustar benoit benoit
Open Fabrics Enterprise Distribution (OFED)
iSER initiator in OFED 1.4 Release Notes
December 2008
* Background
iSER allows iSCSI to be layered over RDMA transports (including
InfiniBand and iWARP (RNIC)).
The OpenFabrics iSER initiator implementation is interoperable with
open-iscsi (http://www.open-iscsi.org/). It provides an alternative
transport to iscsi_tcp in the open-iscsi framework. The iSER transport
exposes a transport API to scsi_transport_iscsi, and a SCSI LLD API to
the Linux SCSI mid-layer (scsi_mod). Currently, the OpenFabrics iSER
initiator can be layered over InfiniBand (no iWARP support yet).
* Supported platforms
SLES 10
SLES 10 sp1
SLES 10 sp2
RHAS 4 up4
RHAS 4 up5
RHAS 4 up6
RHAS 4 up7
RHEL 5
RHEL 5.1
RHEL 5.2
The release has been tested against Voltaire iSCSI over iSER target
running in Voltaire's IB/Fibre-Channel router (SR4G) and the STGT
target.
* Fixed Bugs and Enhancements since OFED 1.3
iSER:
- Add logical unit reset support
- Update URLs of iSER docs
- Add change_queue_depth method
- Fix list iteration bug
- Handle iser_device allocation error gracefully
- Don't change ITT endianness
- Move high-volume debug output to higher debug level
- Count FMR alignment violations per session
Open-iSCSI:
- Update open-iscsi rpm versions from
2.0-754 to 2.0-754.1 and from 2.0-865.15 to 2.0-869.2
- Change open-iscsi defaults
- iscsi_discovery: fixed printing debug information
- iscsi_discovery: check if iscsid is running
- Set open-iscsi for auto-startup when installing OFED
- iscsiadm: bail out if daemon isn't running
* Known Issues
Open-iSCSI:
- modifing node transport_name while session is active
will create stale session. It will be deleted only after reboot.
- This issue is scheduled for OFED1.4 as part of new open-iscsi version.
* Installation/upgrade of open-iscsi
If iSER is selected to be installed with OFED, open-iscsi will be also
installed (or upgraded if another version of open-iscsi is already
installed). Installing/upgrading open-iscsi is required for iSER to
work properly. Before installing OFED, please make sure that no version
of open-iscsi is installed or add the following key to your ofed.conf
file: upgrade_open_iscsi=yes. Using this key will remove any old version
of open-iscsi.
If an older version of open-iscsi was installed, it is recommended to
delete its records before running open-iscsi. This can easily be done by
running the following command (while open-iscsi is stopped):
rm -rf /etc/iscsi/nodes/* /etc/iscsi/send_targets/*
Then, open-iscsi may be started, and targets may be discovered by running
'iscsi_discovery '.
* iSER links
Wiki pages
Information on building/configuring/running the open iscsi initiator over
iSER: https://wiki.openfabrics.org/tiki-index.php?page=iSER
IETF pages
iSCSI and iSER specifications come out of the IETF IP storage (IPS) work
group.
iSCSI specification: http://www.ietf.org/rfc/rfc3720.txt
iSER specification: http://www.ietf.org/rfc/rfc5046.txt
"About" page
general and detailed information on iSCSI and iSER
http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA
trunk/BUILD_ID 0000644 0001750 0001750 00000004555 11313645005 012710 0 ustar benoit benoit OFED-1.4.2
libibverbs:
git://git.openfabrics.org/ofed_1_4/libibverbs.git ofed_1_4
commit b00dc7d2f79e0660ac40160607c9c4937a895433
libmthca:
git://git.openfabrics.org/ofed_1_4/libmthca.git ofed_1_4
commit be5eef3895eb7864db6395b885a19f770fde7234
libmlx4:
git://git.openfabrics.org/ofed_1_4/libmlx4.git ofed_1_4
commit d5e5026e2bd3bbd7648199a48c4245daf313aa48
libehca:
git://git.openfabrics.org/ofed_1_4/libehca.git ofed_1_4
commit 0249815e9b6f134f33546da6fa2e84e1185eea6d
libipathverbs:
git://git.openfabrics.org/~ralphc/libipathverbs ofed_1_4
commit 337df3c1cbe43c3e9cb58e7f6e91f44603dd23fb
libcxgb3:
git://git.openfabrics.org/~swise/libcxgb3.git ofed_1_4
commit f685c8fe7e77e64614d825e563dd9f02a0b1ae16
libnes:
git://git.openfabrics.org/~glenn/libnes.git master
commit 379cccb4484f39b99c974eb6910d3a0407c0bbd1
libibcm:
git://git.openfabrics.org/~shefty/libibcm.git master
commit 7fb57e005b3eae2feb83b3fd369aeba700a5bcf8
librdmacm:
git://git.openfabrics.org/~shefty/librdmacm.git master
commit 62c2bddeaf5275425e1a7e3add59c3913ccdb4e9
libsdp:
git://git.openfabrics.org/ofed_1_4/libsdp.git ofed_1_4
commit b1eaecb7806d60922b2fe7a2592cea4ae56cc2ab
sdpnetstat:
git://git.openfabrics.org/~amirv/sdpnetstat.git ofed_1_4
commit 798e44f6d5ff8b15b2a86bc36768bd2ad473a6d7
srptools:
git://git.openfabrics.org/~ishai/srptools.git master
commit ce1f64c8dd63c93d56c1cc5fbcdaaadd4f74a1e3
perftest:
git://git.openfabrics.org/~orenmeron/perftest.git master
commit e96be03d61da50275015f19ce5a237cc7e8daa54
qlvnictools:
git://git.openfabrics.org/~ramachandrak/qlvnictools.git ofed_1_4
commit 4ce9789273896d0e67430c330eb3703405b59951
tvflash:
git://git.openfabrics.org/ofed_1_4/tvflash.git ofed_1_4
commit e1b50b3b8af52b0bc55b2825bb4d6ce699d5c43b
mstflint:
git://git.openfabrics.org/~orenk/mstflint.git master
commit 57e9b162ec298fc08264128c347104bbd0311f69
qperf:
git://git.openfabrics.org/~johann/qperf.git/.git master
commit b81434ec094694bae55e20dd1af5f5057d8e5f82
ibutils:
git://git.openfabrics.org/~kliteyn/ibutils.git ofed_1_4
commit 5638c9b23013bf046b44f45f7091b4761a80c8f1
ibsim:
git://git.openfabrics.org/ofed_1_4/ibsim.git ofed_1_4
commit a76132ae36dde8302552d896e35bd29608ac9524
ofa_kernel-1.4.2:
Git:
git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
commit a1c3199280b1a0d3ccdc92a809ea6c612c5c44d8
# MPI
mvapich-1.1.0-3355.src.rpm
mvapich2-1.2p1-1.src.rpm
openmpi-1.3.2-1.src.rpm
mpitests-3.1-891.src.rpm
trunk/HOWTO.build_ofed 0000644 0001750 0001750 00000011734 11313645005 014465 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
How To Build OFED 1.4
December 2008
==============================================================================
Table of contents
==============================================================================
1. Overview
2. Usage
3. Requirements
==============================================================================
1. Overview
==============================================================================
The script "build_ofed.sh" is used to build the OFED package based on the
OpenFabrics project and InfiniBand git tree. The package is built under the
current working directory.
See OFED_release_notes.txt for more details.
==============================================================================
2. Usage
==============================================================================
The build script for the OFED package can be downloaded from:
git://git.openfabrics.org/~vlad/ofabuild
branch: ofed_1_3
Name: build_ofed.sh
Usage: build_ofed.sh --ver|-v
[--tmpdir ]
[--ofed-scripts ]
[--ofed-docs ]
[--mpidir|-m ]
[--long-help]
Example:
./build_ofed.sh --ver 1.4-rc7
This command will create a package (i.e., subtree) called OFED-1.4-rc7
in the current working directory.
Sources are extracted by default from the following locations:
libibverbs:
git://git.openfabrics.org/ofed_1_4/libibverbs.git ofed_1_4
commit b00dc7d2f79e0660ac40160607c9c4937a895433
libmthca:
git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git master
commit be5eef3895eb7864db6395b885a19f770fde7234
libmlx4:
git://git.openfabrics.org/ofed_1_4/libmlx4.git ofed_1_4
commit fd418d6ee049afe76bb769aff87c303b96848495
libehca:
git://git.openfabrics.org/ofed_1_4/libehca.git ofed_1_4
commit e0c2d7e8ee2aa5dd3f3511270521fb0c206167c6
libipathverbs:
git://git.openfabrics.org/~ralphc/libipathverbs ofed_1_4
commit 65e5701dbe7b511f796cb0026b0cd51831a62318
libcxgb3:
git://git.openfabrics.org/~swise/libcxgb3.git ofed_1_4
commit f685c8fe7e77e64614d825e563dd9f02a0b1ae16
libnes:
git://git.openfabrics.org/~glenn/libnes.git master
commit 07fb9dfbbb36b28b5ea6caa14a1a5e215386b3e8
libibcm:
git://git.openfabrics.org/~shefty/libibcm.git master
commit 7fb57e005b3eae2feb83b3fd369aeba700a5bcf8
librdmacm:
git://git.openfabrics.org/~shefty/librdmacm.git master
commit e0b1ece1dc0518b2a5232872e0c48d3e2e354e47
libsdp:
git://git.openfabrics.org/ofed_1_4/libsdp.git ofed_1_4
commit 02404fb0266082f5b64412c3c25a71cb9d39442d
sdpnetstat:
git://git.openfabrics.org/~amirv/sdpnetstat.git ofed_1_4
commit 75a033a9512127449f141411b0b7516f72351f95
srptools:
git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3
commit d3025d0771317584e51490a419a79ab55650ebc9
perftest:
git://git.openfabrics.org/~orenmeron/perftest.git master
commit ca629627c7a26005a1a4c8775cc01f483524f1c4
qlvnictools:
git://git.openfabrics.org/~ramachandrak/qlvnictools.git ofed_1_4
commit 1dc6e51a728cbfbdd2018260602b8bebde618da9
tvflash:
git://git.openfabrics.org/ofed_1_4/tvflash.git ofed_1_4
commit e1b50b3b8af52b0bc55b2825bb4d6ce699d5c43b
mstflint:
git://git.openfabrics.org/~orenk/mstflint.git master
commit 9ddeea464e946cd425e05b0d1fdd9ec003fca824
qperf:
git://git.openfabrics.org/~johann/qperf.git/.git master
commit bee05d35b09b0349cf4734ae43fc9c2e970ada8c
ibutils:
git://git.openfabrics.org/~orenk/ibutils.git master
commit 6516d16e815c68fa405562ea773b0c5215c1b70c
ibsim:
git://git.openfabrics.org/~sashak/ibsim.git master
commit a76132ae36dde8302552d896e35bd29608ac9524
ofa_kernel-1.4:
Git:
git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
commit 7055de9adefaaa409856d7ee0b8a986485fbfb06
SRPMS:
rds-tools:
http://www.openfabrics.org/~vlad/ofed_1_4/rds-tools
mvapich:
http://www.openfabrics.org/~pasha/ofed_1_4/mvapich
openmpi:
http://www.openfabrics.org/~jsquyres/ofed_1_4
mvapich2:
http://www.openfabrics.org/~perkinjo/ofed_1_4
mpitests:
http://www.openfabrics.org/~pasha/ofed_1_4/mpitests
==============================================================================
3. Requirements
==============================================================================
1. Git:
Can be downloaded from:
http://www.kernel.org/pub/software/scm/git/git-1.5.3.tar.gz
2. Autotools:
libtool-1.5.20 or higher
autoconf-2.59 or higher
automake-1.9.6 or higher
m4-1.4.4 or higher
The above tools can be downloaded from the following URLs:
libtool - "http://ftp.gnu.org/gnu/libtool/libtool-1.5.20.tar.gz"
autoconf - "http://ftp.gnu.org/gnu/autoconf/autoconf-2.59.tar.gz"
automake - "http://ftp.gnu.org/gnu/automake/automake-1.9.6.tar.gz"
m4 - "http://ftp.gnu.org/gnu/m4/m4-1.4.4.tar.gz"
trunk/OFED_release_notes.txt 0000644 0001750 0001750 00000030526 11313645005 015735 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
Version 1.4.2
Release Notes
July 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview, which includes:
- OFED Distribution Rev 1.4 Contents
- Supported Platforms and Operating Systems
- Supported HCA and RNIC Adapter Cards and Firmware Versions
- Tested Switch Platforms
- Third party Test Packages
- OFED sources
2. Main Changes from OFED 1.3
3. Main Changes from OFED 1.4
4. Main Changes from OFED 1.4.1
5. Known Issues
===============================================================================
1. Overview
===============================================================================
These are the release notes of OpenFabrics Enterprise Distribution (OFED)
release 1.4. The OFED software package is composed of several software modules,
and is intended for use on a computer cluster constructed as an InfiniBand
subnet or iWARP network.
Note: If you plan to upgrade the OFED package on your cluster, please upgrade
all of its nodes to this new version.
1.1 OFED 1.4.1 Contents
-----------------------
The OFED package contains the following components:
- OpenFabrics core and ULPs:
- IB HCA drivers (mthca, mlx4, ipath, ehca)
- iWARP RNIC driver (cxgb3, nes)
- core
- Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER
Initiator and target, RDS, uDAPL, qlgc_vnic and NFS-RDMA.
- OpenFabrics utilities:
- OpenSM (OSM): InfiniBand Subnet Manager
- Diagnostic tools
- Performance tests
- MPI:
- OSU MPI stack supporting the InfiniBand and iWARP interface
- Open MPI stack supporting the InfiniBand and iWARP interface
- OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface
- MPI benchmark tests (OSU benchmarks, Intel MPI benchmarks, Presta)
- Extra packages:
- open-iscsi: open-iscsi initiator with iSER support
- ib-bonding: Bonding driver for IPoIB interface
- Sources of all software modules (under conditions mentioned in the modules'
LICENSE files)
- Documentation
Notes:
1. iSER target is of Beta quality.
2. NFS-RDMA is at Beta, thus it is not installed by default.
2. All other OFED components are of production quality.
3. See release notes for each package in the docs directory.
4. Any Topspin copyright belongs to Cisco Systems, Inc.
1.2 Supported Platforms and Operating Systems
---------------------------------------------
o CPU architectures:
- x86_64
- x86
- ppc64
- ia64
o Linux Operating Systems:
- RedHat EL4 up4: 2.6.9-42.ELsmp *
- RedHat EL4 up5: 2.6.9-55.ELsmp
- RedHat EL4 up6: 2.6.9-67.ELsmp
- RedHat EL4 up7: 2.6.9-78.ELsmp
- RedHat EL5: 2.6.18-8.el5
- RedHat EL5 up1: 2.6.18-53.el5
- RedHat EL5 up2: 2.6.18-92.el5
- RedHat EL5 up3: 2.6.18-128.el5
- Fedora C9: 2.6.25-14.fc9 *
- SLES10: 2.6.16.21-0.8-smp
- SLES10 SP1: 2.6.16.46-0.12-smp
- SLES10 SP2: 2.6.16.60-0.21-smp
- SLES11: 2.6.27.19-5-default
- OpenSuSE 10.3: 2.6.22.5-31 *
- OEL 4 up7 2.6.9-78.ELsmp
- OEL 5 up2 2.6.18-92.el5
- CentOS5.2 2.6.18-92.el5
- kernel.org: 2.6.26 and 2.6.27
* Minimal QA for these versions
1.3 HCAs and RNICs Supported
----------------------------
This release supports IB HCAs by Mellanox Technologies, Qlogic and IBM as
well as iWARP RNICs by Chelsio Communications and Intel.
o Mellanox Technologies HCAs (SDR, DDR and QDR Modes are Supported):
- InfiniHost (fw-23108 Rev 3.5.000)
- InfiniHost III Ex (MemFree: fw-25218 Rev 5.3.000
with memory: fw-25208 Rev 4.8.200)
- InfiniHost III Lx (fw-25204 Rev 1.2.000)
- ConnectX IB (fw-25408 Rev 2.6.000)
For official firmware versions please see:
http://www.mellanox.com/content/pages.php?pg=firmware_download
o Qlogic HCAs:
- QHT7140 QLogic InfiniPath SDR HTX HCA
- QLE7140 QLogic InfiniPath SDR PCIe HCA
- QLE7240 QLogic InfiniPath DDR x8 PCIe HCA
- QLE7280 QLogic IniniPath DDR x16 PCIe HCA
o IBM HCAs:
- GX Dual-port SDR 4x IB HCA
- GX Dual-port SDR 12x IB HCA
- GX Dual-port DDR 4x IB HCA
- GX Dual-port DDR 12x IB HCA
o Chelsio RNICs:
- S310/S320 10GbE Storage Accelerators
- R310/R320 10GbE iWARP Adapters
o Intel RNICs:
- NE020 10Gb iWARP Adapter
1.4 Switches Supported
----------------------
This release was tested with switches and gateways provided by the following
companies:
- Cisco
- Voltaire
- Qlogic
- Flextronics
- Sun
- Mellanox
1.5 Third Party Packages
------------------------
The following third party packages have been tested with OFED 1.4:
1. Intel MPI, Version 3.0 - Package ID: l_mpi_p_3.0.043
2. HP-MPI 2.2.5.1 (without XRC support)
3. HP-MPI 2.2.7 (with XRC support)
1.6 OFED Sources
----------------
All sources are located under git://git.openfabrics.org/
Kernel sources: git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
User level Sources are located in all git trees as written in the BUILD_ID
The kernel sources are based on Linux 2.6.27 mainline kernel. Its patches
are included in the OFED sources directory.
For details see HOWTO.build_ofed.
===============================================================================
2. Main Changes from OFED 1.3
===============================================================================
Note: For details regarding the various changes, please see the release notes
for each package in the docs directory.
2.1 General changes
o Kernel code based on 2.6.27
o Added iSER target package
o Added NFS-RDMA support - in technology preview state
o New verbs to support BMME:
- Fast memory thru send queue
- Local invalidate send work requests
- Read with invalidate
o Multi-Protocol (Virtual Protocol Interconnect) support, Eth and IB
for ConnectX. See mlx4_release note for the usage model.
2.2 IPoIB
o Datagram mode: Added LRO and LSO support
2.3 SDP
o GA level
2.4 qlgc_vnic
o Support for hotswap of EVIC and dynamic update of existing connections
with the addition of QLogic dynamic update daemon.
o Performance improvements in handling of Ethernet broadcast/multicast
traffic.
2.5 RDS
o GA of RDMA API (using FMRs) - RDS API version 3
2.6 uDAPL
o Added socket based CM - for both scalability and interop with Windows
o Added UD extensions - for version 2.0 only
o v1 library package has been renamed to compat-dapl-1.2.8-1
2.7 Management
o OpenSM
- Cashed routing
- Multi lid routing balancing for updn/minhop routing algorithms
- Preserve base lid routes when LMC > 0
- OpenSM configuration unification
- IPv6 Solicited Node Multicast addresses consolidation
- Routing Chaining
- Failover/Handover improvements: Query remote SMs during light sweep
- Ordered routing paths balancing
o ibutils:
- Report created in CSV format
o Diagnostic tools:
- ibnetdiscover library - to accelerate another tools
2.9 MPI:
a. OSU MVAPICH 1.1.0
- eXtended Reliable Connection (XRC) support
- Lock-free design to provide support for asynchronous progress at
both sender and receiver to overlap computation and communication
- Optimized MPI_allgather collective
- Efficient intra-node shared memory communication support for
diskless clusters
- Enhanced Totalview Support with the new mpirun_rsh framework
b. Open MPI 1.2.8
- Bug fixes
c. OSU MVAPICH2 1.2p1
- Scalable and robust daemon-less job startup
- Enhanced scalability for RDMA-based Direct One-sided communication
with less resources
- Checkpoint-Restart with Intra-node Shared Memory Support
- Multi-core optimized Collectives
- Shared memory optimized MPI_Bcast
- Optimized and tuned MPI_Alltoall
- Enhancement to Software Installation with Full Autoconf-based
Configuration
- PLPA Support for affinity
d. MPI tests:
- Updated IMB 3.1
===============================================================================
3. Main Changes from OFED 1.4
===============================================================================
- Added support for RHEL 5.3 and SLES11
- NFS/RDMA: In beta quality with backports for RHEL 5.2, 5.3 and SLES 10 SP2
- Updated MPI packages:
MVAPICH 1.1.0-3355
Open MPI 1.3.2
- Updated bonding package: ib-bonding-0.9.0-40
- Updated DAPL: compat-dapl-1.2.14 and dapl-2.0.19
- Updated opensm version to include critical bug fixes
- Fixed RDS iWARP support
- Low level drivers updated: ehca, mlx4, cxgb3, nes, ipath, mthca
- Added a module parameter to control number of MTTs per segment in Mellanox
HCAs (mlx4 & mthca)
- mstflint update
- Bug fixes - see each component release notes for details
===============================================================================
4. Main Changes from OFED 1.4.1
===============================================================================
Bug fixes:
----------
SDP:
- Fix memory leak in bzcopy (#1672)
- Fix bad credits advertised when connection initiated (#1679)
- Fix compilation on i386 with gcc 3.4 used in RedHat 4. (#1630)
- Fix Data integrity error (#1672)
backports:
- Fix clear-dirty-page accounting. This bug was hit in Lustre testing (#1650)
- Fix NFS stale file handles (#1680)
- Simple NFS file operations causes RHEL 5.3 server to hard hang. (#1676)
- iozone direct write test causes OFED to hang/crash (#1675)
- Module crash on server with multiple client simple load (#1677)
mlx4 driver:
- Fix post send of local invalidate and fast registration packets.
This fixed nfsrdma server crash @test5 connectathon basic test. (#1571)
nes driver:
- fix qp refcount during disconnect
Features:
---------
nes driver:
- Make LRO as default feature
mlx4 driver:
- Added a new device ID: 0x6764: MT26468 ConnectX EN 10GigE PCIe gen2
===============================================================================
5. Known Issues
===============================================================================
The following is a list of general limitations and known issues of the various
components of the OFED 1.4 release.
1. When upgrading from an earlier OFED version, the installation script does not
stop the earlier OFED version prior to uninstalling it.
Workaround: Stop the old OFED stack (/etc/init.d/openibd stop) before
upgrading to OFED 1.4.
2. Memory registration by the user is limited according to administrator
setting. See "Pinning (Locking) User Memory Pages" in OFED_tips.txt for
system configuration.
3. Fork support from kernel 2.6.12 and above is available provided
that applications do not use threads. The fork() is supported as long
as the parent process does not run before the child exits or calls exec().
The former can be achieved by calling wait(childpid), and the latter can be
achieved by application specific means. The Posix system() call is
supported.
4. The ipath driver is supported only on 64-bit platforms.
5. When installing OFED on OpenSuse or Ubuntu one should use the
--without-depcheck option of the install.pl script
6. To install OFED 1.4 on Fedora Core 8 one should:
1. Install libtool RPM (required by libibcommon)
2. Install tcsh RPM (required by mpi-selector)
3. Create the file '.rpmmacros' (required by mvapich):
echo "%__arch_install_post %{nil}" >> /root/.rpmmacros
7. IPoIB: brctl utilities do not work on IPoIB interfaces. The reason for that
is that these utilities support devices of type Ethernet only.
8. "openibd stop" can sometime fail with the error:
Unloading ib_cm [FAILED]
ERROR: Module ib_cm is in use by ib_ipoib
Workaround: run "openibd stop" again.
9. When working with ISCSI over IPoIB or mlx4_en, you must disable LRO (even
if IPoIB is set to connected mode). This is because there is a bug in older
kernels which causes a kernel panic.
10. On SLES11 in case that uninstall is failing, need to look at the error log
and remove the RPMs manually using 'rpm -e '
11. On SLES11 one should set allow_unsupported_modules parameter to 1 in file:
/etc/modprobe.d/unsupported-modules. Without this the modules will not
load.
Note: See the release notes of each component for additional issues.
trunk/ipoib_release_notes.txt 0000644 0001750 0001750 00000037274 11313645005 016331 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
IPoIB in OFED 1.4.1 Release Notes
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New Features
3. Known Issues
4. DHCP Support of IPoIB
5. The ib-bonding driver
6. Bug Fixes and Enhancements Since OFED 1.3
7. Bug Fixes and Enhancements Since OFED 1.3.1
8. Bug Fixes and Enhancements Since OFED 1.4
9. Performance tuning
===============================================================================
1. Overview
===============================================================================
IPoIB is a network driver implementation that enables transmitting IP and ARP
protocol packets over an InfiniBand UD channel. The implementation conforms to
the relevant IETF working group's RFCs (http://www.ietf.org).
===============================================================================
2. New Features
===============================================================================
1. This version of ofed introduces improvements to IPOIB by cutting the CPU
overhead in handling receive packets. This will improve operation
in datagram mode:
Large Receive Offload (LRO) - aggregating multiple incoming packets from a
single stream into a larger buffer before they are passed higher up the
networking stack, thus reducing the number of packets that have to be
processed.
This feature is enabled on HCAs that can support LRO, e.g. ConnectX.
2. Datagram mode: LSO (large send offload) allows the networking stack to pass
SKBs with data size larger than the MTU to the IPoIB driver and have the HCA
HW fragment the data to multiple MSS-sized packets. Add a device capability
flag IB_DEVICE_UD_TSO for devices that can perform TCP segmentation offload,
a new send work request opcode IB_WR_LSO, header, hlen and mss fields for
the work request structure, and a new IB_WC_LSO completion type.
This feature is enabled on HCAs that can support LSO, e.g. ConnectX.
Usage and configuration:
========================
1. To check the current mode used for outgoing connections, enter:
cat /sys/class/net/ib0/mode
2. To disable IPoIB CM at compile time, enter:
cd OFED-1.4
export OFA_KERNEL_PARAMS="--without-ipoib-cm"
./install.pl
3. To change the run-time configuration for IPoIB, enter:
edit /etc/infiniband/openib.conf, change the following parameters:
# Enable IPoIB Connected Mode
SET_IPOIB_CM=yes
# Set IPoIB MTU
IPOIB_MTU=65520
4. You can also change the mode and MTU for a specific interface manually.
To enable connected mode for interface ib0, enter:
echo connected > /sys/class/net/ib0/mode
To increase MTU, enter:
ifconfig ib0 mtu 65520
5. Switching between CM and UD mode can be done in run time:
echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM
===============================================================================
3. Known Issues
===============================================================================
1. If a host has multiple interfaces and (a) each interface belongs to a
different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
they are connected to the same IB Switch, then the host violates the IP rule
requiring different broadcast domains. Consequently, the host may build an
incorrect ARP table.
The correct setting of a multi-homed IPoIB host is achieved by using a
different PKEY for each IP subnet. If a host has multiple interfaces on the
same IP subnet, then to prevent a peer from building an incorrect ARP entry
(neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
causes the network stack to send ARP replies only on the interface with the
IP address specified in the ARP request:
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_ignore=1
Or, globally,
sysctl -w net.ipv4.conf.all.arp_ignore=1
To learn more about the arp_ignore parameter, see Documentation/networking/ip-sysctl.txt.
Note that distributions have the means to make kernel parameters persistent.
2. There are IPoIB alias lines in modprobe.conf which prevent stopping/
unloading the stack (i.e., '/etc/init.d/openibd stop' will fail).
These alias lines cause the drivers to be loaded again by udev scripts.
Workaround: Change modprobe.conf to set
OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove
the alias lines from modprobe.conf.
3. On SLES 10:
The ib1 interface uses the configuration script of ib0.
Workaround: Invoke ifup/ifdown using both the interface name and the
configuration script name (example: ifup ib1 ib1).
4. After a hotplug event, the IPoIB interface falls back to datagram mode, and
MTU is reduced to 2K.
Workaround: Re-enable connected mode and increase MTU manually:
echo connected > /sys/class/net/ib0/mode
ifconfig ib0 mtu 65520
5. Since the IPoIB configuration files (ifcfg-ib) are installed under the
standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/
and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf
does not prevent the loading of IPoIB on boot.
6. If IPoIB connected mode is enabled, it uses a large MTU for connected mode
messages and a small MTU for datagram (in particular, multicast) messages,
and relies on path MTU discovery to adjust MTU appropriately. Packets sent
in the window before MTU discovery automatically reduces the MTU for a
specific destination will be dropped, producing the following message in the
system log:
"packet len (> ) too long to send, dropping"
To warn about this, a message is produced in the system log each time MTU is
set to a value higher than 2K.
7. IPoIB IPv6 support is broken for between systems with kernels < 2.6.12 and
kernels >= 2.6.12. The reason for that is that kernel 2.6.12 puts the link
layer address at an offset of two bytes with respect to older kernels. This
causes the other host to misinterpret the hardware address resulting in failure
to resolve path which are based on wrong GIDs. As an example, RH 4.x and RH
5.x cannot interoperate.
8. In connected mode, TCP latency for short messages is larger by approx. 1usec
(~5%) than in datagram mode. As a workaround, use datagram mode.
9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with
newer kernels. We recommend kernels from 2.6.18 and up for
best IPoIB performance.
10. Connectivity issues encountered when using IPv6 on ia64 systems.
11. The IPoIB module uses a Linux implementation for Large Receive Offload
(LRO) in kernel 2.6.24 and later. These kernels require installing the
"inet_lro" module.
12. ConnectX only: If you have a port configured as ETH, and are running IPoIB
in connected mode -- and then change the port type to IB, the IPoIB mode
changes to datagram mode.
13. When working with ISCSI, you must disable LRO (even if you are working in
connected mode). This is because there is a bug in older kernels which causes
a kernel panic.
14. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test
gets to packet size 8192 and larger it always loose the first packet in the
sequence.
Workaround: Increase the number of pending skb's before a neighbor is
resolved (default is 3). This value can be changed with:
sysctl net.ipv4.neigh.ib0.unres_qlen.
15. IPoIB multicast support is broken in RH4.x kernels. This is because
ndisc_mc_map() does not handle IPOIB hardware addresses.
===============================================================================
4. IPoIB Configuration Based on DHCP
===============================================================================
Setting an IPoIB interface configuration based on DHCP (v3.1.2 which is available
via www.isc.org) is performed similarly to the configuration of Ethernet
interfaces. In other words, you need to make sure that IPoIB configuration files
include the following line:
For RedHat:
BOOTPROTO=dhcp
For SLES:
BOOTPROTO=dchp
Note: If IPoIB configuration files are included, ifcfg-ib files will be
installed under:
/etc/sysconfig/network-scripts/ on a RedHat machine
/etc/sysconfig/network/ on a SuSE machine
Note: A patch for DHCP is required for supporting IPoIB. The patch file for
DHCP v3.1.2, dhcp.patch, is available under the docs/ directory.
Standard DHCP fields holding MAC addresses are not large enough to contain an
IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages
convey a client identifier field used to identify the DHCP session. This client
identifier field can be used to associate an IP address with a client identifier
value, such that the DHCP server will grant the same IP address to any client
that conveys this client identifier.
Note: Refer to the DHCP documentation for more details how to make this
association.
The length of the client identifier field is not fixed in the specification.
4.1 DHCP Server
In order for the DHCP server to provide configuration records for clients, an
appropriate configuration file needs to be created. By default, the DHCP server
looks for a configuration file called dhcpd.conf under /etc. You can either edit
this file or create a new one and provide its full path to the DHCP server using
the -cf flag. See a file example at docs/dhcpd.conf of this package.
The DHCP server must run on a machine which has loaded the IPoIB module.
To run the DHCP server from the command line, enter:
dhcpd -d
Example:
host1# dhcpd ib0 -d
4.2 DHCP Client (Optional)
Note: A DHCP client can be used if you need to prepare a diskless machine with
an IB driver.
In order to use a DHCP client identifier, you need to first create a
configuration file that defines the DHCP client identifier. Then run the DHCP
client with this file using the following command:
dhclient cf
Example of a configuration file for the ConnectX (PCI Device ID 25418), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier 00:02:c9:03:00:00:10:39;
}
Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92;
}
In order to use the configuration file, run:
host1# dhclient cf dhclient.conf ib1
===============================================================================
5. The ib-bonding driver
===============================================================================
The ib-bonding driver is a High Availability solution for IPoIB interfaces.
It is based on the Linux Ethernet Bonding Driver and was adapted to work with
IPoIB. The ib-bonding package contains a bonding driver and a utility called
ib-bond to manage and control the driver operation.
The ib-bonding driver comes with the ib-bonding package (run rpm -qi ib-bonding
to get the package information).
Using the ib-bonding driver
---------------------------
The ib-bonding driver can be loaded manually or automatically.
1. Manual operation:
Use the utility ib-bond to start, query, or stop the driver. For details on this
utility, read the documentation for the ib-bonding package.
2. Automatic operation:
Use standard OS tools (sysconfig in SuSE and initscripts in Redhat)
to create a configuration that will come up with network restart. For details
on this, read the documentation for the ib-bonding package.
Notes:
* Using /etc/infiniband/openib.conf to create a persistent configuration is
no longer supported
* On RHEL4_U7, cannot set a slave interface as primary.
===============================================================================
6. Bug Fixes and Enhancements Since OFED 1.3
===============================================================================
- There is no default configuration for IPoIB interfaces: One should manually
specify the full IP configuration or use the ofed_net.conf file. See
OFED_Installation_Guide.txt for details on ipoib configuration.
- Don't drop multicast sends when they can be queued
- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small
SKBs (bug 989)
- IPoIB failed on stress testing (bug 1004)
- Kernel Oops during "port up/down test" (bug 1040)
- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel
panic (bug 985)
- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20
- Set max CM MTU when moving to CM mode, instead of setting it in openibd script
- Fix CQ size calculations for ipoib
- Bonding: Enable build for SLES10 SP2
- Bonding: Fix issue in using the bonding module for Ethernet slaves (see
documentation for details)
===============================================================================
7. Bug Fixes and Enhancements Since OFED 1.3.1
===============================================================================
- IPoIB: Refresh paths instead of flushing them on SM change events to improve
failover respond
- IPoIB: Fix loss of connectivity after bonding failover on both sides
- Bonding: Fix link state detection under RHEL4
- Bonding: Avoid annoying messages from initscripts when starting bond
- Bonding: Set default number of grat. ARP after failover to three (was one)
===============================================================================
8. Bug Fixes and Enhancements Since OFED 1.4
===============================================================================
- Performance tuning is enabled by default for IPOIB CM.
- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails
- disable napi while cq is being drained (bugzilla #1587)
- rdma_cm: Use rate from ipoib broadcast when joining ipoib multicast
When joining IPoIB multicast group, use the same rate as in the broadcast
group. Otherwise, if rdma_cm creates this group before IPoIB does, it might get
a different rate. This will cause IPoIB to fail joining to the same group later
on, because IPoIB has a strict rate selection.
- fix unprotected use of priv->broadcast in ipoib_mcast_join_task.
- Do not join broadcast group if interface is brought down
===============================================================================
9. Performance tuning
===============================================================================
When IPoIB is configured to run in connected mode, tcp parameter tuning is
performed at driver startup -- to improve the throughput of medium and large
messages.
The driver startup scripts set the following TCP parameters as follows:
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=0
net.core.netdev_max_backlog=250000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=16777216
net.ipv4.tcp_mem="16777216 16777216 16777216"
net.ipv4.tcp_rmem="4096 87380 16777216"
net.ipv4.tcp_wmem="4096 65536 16777216"
This tuning is effective only for connected mode. If you run in datagram mode,
it actually reduces performance.
If you change the IPoIB run mode to "datagram" while the driver is running,
the tuned parameters do not get reset to their default values. We therefore
recommend that you change the IPoIB mode only while the driver is down
(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file
/etc/infiniband/openib.conf, and then restarting the driver).
trunk/PERF_TEST_README.txt 0000644 0001750 0001750 00000013646 11313645005 014724 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
Performance Tests README for OFED 1.4.1
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Notes on Testing Method
3. Test Descriptions
4. Running Tests
===============================================================================
1. Overview
===============================================================================
This is a collection of tests written over uverbs intended for use as a
performance micro-benchmark. As an example, the tests can be used for
hardware or software tuning and/or functional testing.
Please post results and observations to the openib-general mailing list.
See "Contact Us" at http://openib.org/mailman/listinfo/openib-general and
http://www.openib.org.
===============================================================================
2. Notes on Testing Method
===============================================================================
- The benchmark uses the CPU cycle counter to get time stamps without a context
switch. Some CPU architectures (e.g., Intel's 80486 or older PPC) do NOT have
such capability.
- The benchmark measures round-trip time but reports half of that as one-way
latency. This means that it may not be sufficiently accurate for asymmetrical
configurations.
- Min/Median/Max results are reported.
The Median (vs average) is less sensitive to extreme scores.
Typically, the Max value is the first value measured.
- Larger samples only help marginally. The default (1000) is very satisfactory.
Note that an array of cycles_t (typically an unsigned long) is allocated
once to collect samples and again to store the difference between them.
Really big sample sizes (e.g., 1 million) might expose other problems
with the program.
- The "-H" option will dump the histogram for additional statistical analysis.
See xgraph, ygraph, r-base (http://www.r-project.org/), pspp, or other
statistical math programs.
Architectures tested: i686, x86_64, ia64, ppc64
===============================================================================
3. Test Descriptions
===============================================================================
The following tests are mainly useful for hardware/software benchmarking.
write_lat.c latency test with RDMA write transactions
write_bw.c bandwidth test with RDMA write transactions
send_lat.c latency test with send transactions
send_bw.c bandwidth test with send transactions
read_lat.c latency test with RDMA read transactions
read_bw.c bandwidth test with RDMA read transactions
Legacy tests: (To be removed in the next release)
rdma_lat.c latency test with RDMA write transactions
rdma_bw.c streaming bandwidth test with RDMA write transactions
The executable name of each test starts with the general prefix "ib_";
for example, ib_write_lat.
===============================================================================
4. Running Tests
===============================================================================
Prerequisites:
kernel 2.6
ib_uverbs (kernel module) matches libibverbs
("match" means binary compatible, but ideally of the same SVN rev)
Server: ./
Client: ./
o is IPv4 or IPv6 address. You can use the IPoIB
address if IPoIB is configured.
o --help lists the available
*** IMPORTANT NOTE: The SAME OPTIONS must be passed to both server and client.
Options in the tests (some options are applicable to part of the tests):
-p, --port= listen on/connect to port (default: 18515)
-c, --connection= connection type RC/UC (default RC)
-m, --mtu= mtu size (default: 1024)
-d, --ib-dev= use IB device (default: first device found)
-i, --ib-port= use port of IB device (default: 1)
-s, --size= size of message to exchange (default: 1)
-a, --all run sizes from 2 till 2^23
-t, --tx-depth= size of tx queue (default: 50)
-n, --iters= number of exchanges (at least 100, default: 1000)
-u, --qp-timeout= QP timeout, timeout value is 4 usec * 2^timeout
(default 14\n)
-g, --post= number of posts for each qp in the chain
(default tx_depth). for write_bw test
-g, --mcg send messages to multicast group(only available
in UD connection. for send tests
-o, --outs= num of outstanding read/atom (default 4)
-q, --qp= num of qp's (default 1)
-r, --rx-depth= make rx queue bigger than tx (default 600)
-I, --inline_size= max size of message to be sent in inline mode
(default 400)
-C, --report-cycles report times in cpu cycle units
(default: microseconds)
-H, --report-histogram print out all results
(default: print summary only)
-U, --report-unsorted (implies -H) print out unsorted results
(default: sorted)
-V, --version display version number
-F, --CPU-freq do not fail even if cpufreq_ondemand module is
loaded
-N, --no peak-bw cancel peak-bw calculation (default with peak-bw)
-e, --events sleep on CQ events (default poll)
-l, --signal signal completion on each msg
-b, --bidirectional measure bidirectional bandwidth
(default unidirectional)
*** IMPORTANT NOTE: You need to be running a Subnet Manager on the switch or
on one of the nodes in your fabric.
Example:
Run "ib_write_lat -a" on the server side.
Then run "ib_write_lat -a " on the client side.
ib_write_lat will exit on both server and client after printing results.
trunk/open_mpi_release_notes.txt 0000644 0001750 0001750 00000210316 11313645005 017023 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
Open MPI in OFED 1.4.1 Copyrights, License, and Release Notes
May 2009
Open MPI Copyrights
-------------------
Most files in this release are marked with the copyrights of the
organizations who have edited them. The copyrights below generally
reflect members of the Open MPI core team who have contributed code to
this release. The copyrights for code used under license from other
parties are included in the corresponding files.
Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
University Research and Technology
Corporation. All rights reserved.
Copyright (c) 2004-2009 The University of Tennessee and The University
of Tennessee Research Foundation. All rights
reserved.
Copyright (c) 2004-2008 High Performance Computing Center Stuttgart,
University of Stuttgart. All rights reserved.
Copyright (c) 2004-2007 The Regents of the University of California.
All rights reserved.
Copyright (c) 2006-2009 Los Alamos National Security, LLC. All rights
reserved.
Copyright (c) 2006-2009 Cisco Systems, Inc. All rights reserved.
Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
Copyright (c) 2006-2008 Sandia National Laboratories. All rights reserved.
Copyright (c) 2006-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Copyright (c) 2006-2009 The University of Houston. All rights reserved.
Copyright (c) 2006-2008 Myricom, Inc. All rights reserved.
Copyright (c) 2007-2008 UT-Battelle, LLC. All rights reserved.
Copyright (c) 2007-2008 IBM Corporation. All rights reserved.
Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing
Centre, Federal Republic of Germany
Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
Copyright (c) 2007 Evergrid, Inc. All rights reserved.
Copyright (c) 2008 Institut National de Recherche en
Informatique. All rights reserved.
Copyright (c) 2007 Lawrence Livermore National Security, LLC.
All rights reserved.
Copyright (c) 2007-2009 Mellanox Technologies. All rights reserved.
Copyright (c) 2006 QLogic Corporation. All rights reserved.
Additional copyrights may follow
Open MPI License
----------------
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer listed
in this license in the documentation and/or other materials
provided with the distribution.
- Neither the name of the copyright holders nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
The copyright holders provide no reassurances that the source code
provided does not infringe any patent, copyright, or any other
intellectual property rights of third parties. The copyright holders
disclaim any liability to any recipient for claims brought against
recipient by any third party for infringement of that parties
intellectual property rights.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
===========================================================================
When submitting questions and problems, be sure to include as much
extra information as possible. This web page details all the
information that we request in order to provide assistance:
http://www.open-mpi.org/community/help/
The best way to report bugs, send comments, or ask questions is to
sign up on the user's and/or developer's mailing list (for user-level
and developer-level questions; when in doubt, send to the user's
list):
users@open-mpi.org
devel@open-mpi.org
Because of spam, only subscribers are allowed to post to these lists
(ensure that you subscribe with and post from exactly the same e-mail
address -- joe@example.com is considered different than
joe@mycomputer.example.com!). Visit these pages to subscribe to the
lists:
http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/mailman/listinfo.cgi/devel
Thanks for your time.
===========================================================================
Much, much more information is also available in the Open MPI FAQ:
http://www.open-mpi.org/faq/
===========================================================================
OFED-Specific Release Notes
---------------------------
** SLES 10 with Pathscale compiler support:
Using the Pathscale compiler to build Open MPI on SLES10 may result in
a non-functional Open MPI installation (every Open MPI command fails).
If this problem occurs, try upgrading your Pathscale installation to
the latest maintenance release, or use a different compiler to compile
Open MPI.
** Intel compiler support:
Some versions of the Intel 9.1 C++ compiler suite series produce
incorrect code when used with the Open MPI C++ bindings. Symptoms of
this problem include crashing applications (e.g., segmentation
violations) and Open MPI producing errors about incorrect parameters.
Be sure to upgrade to the latest maintenance release of the Intel 9.1
compiler to avoid these problems.
** Installing newer versions of Open MPI after OFED is installed:
Open MPI can be built from source after OFED is fully installed. The
source code for Open MPI can be extracted from the SRPM shipped with
OFED or downloaded from the main Open MPI web site:
http://www.open-mpi.org/.
To compile with Open MPI from source with OFED support, fully install
the rest of OFED. If you used the default prefix for the OFED
installation (/usr), Open MPI should build with OpenFabrics support by
default. If you used a different OFED prefix, you must tell Open MPI
what it is with the "--with-openib=" switch to configure.
You can verify that Open MPI installed with OpenFabrics support by
running (the exact version numbers displayed may be different; the
important part is that the "openib" BTL is displayed):
shell$ ompi_info | grep openib
MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.2)
See the rest of the documentation below for other configure command
line options and installation instructions.
** Changelog summary
Showing versions 1.2.7 - 1.3.2; see the "NEWS" file in an Open MPI
distribution for the full list.
1.3.2
-----
- Fixed a potential infinite loop in the openib BTL that could occur
in senders in some frequent-communication scenarios. Thanks to Don
Wood for reporting the problem.
- Add a new checksum PML variation on ob1 (main MPI point-to-point
communication engine) to detect memory corruption in node-to-node
messages
- Add a new configuration option to add padding to the openib
header so the data is aligned
- Add a new configuration option to use an alternative checksum algo
when using the checksum PML
- Fixed a problem reported by multiple users on the mailing list that
the LSF support would fail to find the appropriate libraries at
run-time.
- Allow empty shell designations from getpwuid(). Thanks to Sergey
Koposov for the bug report.
- Ensure that mpirun exits with non-zero status when applications die
due to user signal. Thanks to Geoffroy Pignot for suggesting the
fix.
- Ensure that MPI_VERSION / MPI_SUBVERSION match what is returned by
MPI_GET_VERSION. Thanks to Rob Egan for reporting the error.
- Updated MPI_*KEYVAL_CREATE functions to properly handle Fortran
extra state.
- A variety of ob1 (main MPI point-to-point communication engine) bug
fixes that could have caused hangs or seg faults.
- Do not install Open MPI's signal handlers in MPI_INIT if there are
already signal handlers installed. Thanks to Kees Verstoep for
bringing the issue to our attention.
- Fix GM support to not seg fault in MPI_INIT.
- Various VampirTrace fixes.
- Various PLPA fixes.
- No longer create BTLs for invalid (TCP) devices.
- Various man page style and lint cleanups.
- Fix critical OpenFabrics-related bug noted here:
http://www.open-mpi.org/community/lists/announce/2009/03/0029.php.
Open MPI now uses a much more robust memory intercept scheme that is
quite similar to what is used by MX. The use of "-lopenmpi-malloc"
is no longer necessary, is deprecated, and is expected to disappear
in a future release. -lopenmpi-malloc will continue to work for the
duration of the Open MPI v1.3 and v1.4 series.
- Fix some OpenFabrics shutdown errors, both regarding iWARP and SRQ.
- Allow the udapl BTL to work on Solaris platforms that support
relaxed PCI ordering.
- Fix problem where the mpirun would sometimes use rsh/ssh to launch on
the localhost (instead of simply forking).
- Minor SLURM stdin fixes.
- Fix to run properly under SGE jobs.
- Scalability and latency improvements for shared memory jobs: convert
to using one message queue instead of N queues.
- Automatically size the shared-memory area (mmap file) to match
better what is needed; specifically, so that large-np jobs will start.
- Use fixed-length MPI predefined handles in order to provide ABI
compatibility between Open MPI releases.
- Fix building of the posix paffinity component to properly get the
number of processors in loosely tested environments (e.g.,
FreeBSD). Thanks to Steve Kargl for reporting the issue.
- Fix --with-libnuma handling in configure. Thanks to Gus Correa for
reporting the problem.
1.3.1
-----
- Added "sync" coll component to allow users to synchronize every N
collective operations on a given communicator.
- Increased the default values of the IB and RNR timeout MCA parameters.
- Fix a compiler error noted by Mostyn Lewis with the PGI 8.0 compiler.
- Fix an error that prevented stdin from being forwarded if the
rsh launcher was in use. Thanks to Branden Moore for pointing out
the problem.
- Correct a case where the added datatype is considered as contiguous but
has gaps in the beginning.
- Fix an error that limited the number of comm_spawns that could
simultaneously be running in some environments
- Correct a corner case in OB1's GET protocol for long messages; the
error could sometimes cause MPI jobs using the openib BTL to hang.
- Fix a bunch of bugs in the IO forwarding (IOF) subsystem and add some
new options to output to files and redirect output to xterm. Thanks to
Jody Weissmann for helping test out many of the new fixes and
features.
- Fix SLURM race condition.
- Fix MPI_File_c2f(MPI_FILE_NULL) to return 0, not -1. Thanks to
Lisandro Dalcin for the bug report.
- Fix the DSO build of tm PLM.
- Various fixes for size disparity between C int's and Fortran
INTEGER's. Thanks to Christoph van Wullen for the bug report.
- Ensure that mpirun exits with a non-zero exit status when daemons or
processes abort or fail to launch.
- Various fixes to work around Intel (NetEffect) RNIC behavior.
- Various fixes for mpirun's --preload-files and --preload-binary
options.
- Fix the string name in MPI::ERRORS_THROW_EXCEPTIONS.
- Add ability to forward SIFTSTP and SIGCONT to MPI processes if you
set the MCA parameter orte_forward_job_control to 1.
- Allow the sm BTL to allocate larger amounts of shared memory if
desired (helpful for very large multi-core boxen).
- Fix a few places where we used PATH_MAX instead of OMPI_PATH_MAX,
leading to compile problems on some platforms. Thanks to Andrea Iob
for the bug report.
- Fix mca_btl_openib_warn_no_device_params_found MCA parameter; it
was accidentally being ignored.
- Fix some run-time issues with the sctp BTL.
- Ensure that RTLD_NEXT exists before trying to use it (e.g., it
doesn't exist on Cygwin). Thanks to Gustavo Seabra for reporting
the issue.
- Various fixes to VampirTrace, including fixing compile errors on
some platforms.
- Fixed missing MPI_Comm_accept.3 man page; fixed minor issue in
orterun.1 man page. Thanks to Dirk Eddelbuettel for identifying the
problem and submitting a patch.
- Implement the XML formatted output of stdout/stderr/stddiag.
- Fixed mpirun's -wdir switch to ensure that working directories for
multiple app contexts are properly handled. Thanks to Geoffroy
Pignot for reporting the problem.
- Improvements to the MPI C++ integer constants:
- Allow MPI::SEEK_* constants to be used as constants
- Allow other MPI C++ constants to be used as array sizes
- Fix minor problem with orte-restart's command line options. See
ticket #1761 for details. Thanks to Gregor Dschung for reporting
the problem.
1.3
---
- Extended the OS X 10.5.x (Leopard) workaround for a problem when
assembly code is compiled with -g[0-9]. Thanks to Barry Smith for
reporting the problem. See ticket #1701.
- Disabled MPI_REAL16 and MPI_COMPLEX32 support on platforms where the
bit representation of REAL*16 is different than that of the C type
of the same size (usually long double). Thanks to Julien Devriendt
for reporting the issue. See ticket #1603.
- Increased the size of MPI_MAX_PORT_NAME to 1024 from 36. See ticket #1533.
- Added "notify debugger on abort" feature. See tickets #1509 and #1510.
Thanks to Seppo Sahrakropi for the bug report.
- Upgraded Open MPI tarballs to use Autoconf 2.63, Automake 1.10.1,
Libtool 2.2.6a.
- Added missing MPI::Comm::Call_errhandler() function. Thanks to Dave
Goodell for bringing this to our attention.
- Increased MPI_SUBVERSION value in mpi.h to 1 (i.e., MPI 2.1).
- Changed behavior of MPI_GRAPH_CREATE, MPI_TOPO_CREATE, and several
other topology functions per MPI-2.1.
- Fix the type of the C++ constant MPI::IN_PLACE.
- Various enhancements to the openib BTL:
- Added btl_openib_if_[in|ex]clude MCA parameters for
including/excluding comma-delimited lists of HCAs and ports.
- Added RDMA CM support, includng btl_openib_cpc_[in|ex]clude MCA
parameters
- Added NUMA support to only use "near" network adapters
- Added "Bucket SRQ" (BSRQ) support to better utilize registered
memory, including btl_openib_receive_queues MCA parameter
- Added ConnectX XRC support (and integrated with BSRQ)
- Added btl_openib_ib_max_inline_data MCA parameter
- Added iWARP support
- Revamped flow control mechansisms to be more efficient
- "mpi_leave_pinned=1" is now the default when possible,
automatically improving performance for large messages when
application buffers are re-used
- Elimiated duplicated error messages when multiple MPI processes fail
with the same error.
- Added NUMA support to the shared memory BTL.
- Add Valgrind-based memory checking for MPI-semantic checks.
- Add support for some optional Fortran datatypes (MPI_LOGICAL1,
MPI_LOGICAL2, MPI_LOGICAL4 and MPI_LOGICAL8).
- Remove the use of the STL from the C++ bindings.
- Added support for Platform/LSF job launchers. Must be Platform LSF
v7.0.2 or later.
- Updated ROMIO with the version from MPICH2 1.0.7.
- Added RDMA capable one-sided component (called rdma), which
can be used with BTL components that expose a full one-sided
interface.
- Added the optional datatype MPI_REAL2. As this is added to the "end of"
predefined datatypes in the fortran header files, there will not be
any compatibility issues.
- Added Portable Linux Processor Affinity (PLPA) for Linux.
- Addition of a finer symbols export control via the visibiliy feature
offered by some compilers.
- Added checkpoint/restart process fault tolerance support. Initially
support a LAM/MPI-like protocol.
- Removed "mvapi" BTL; all InfiniBand support now uses the OpenFabrics
driver stacks ("openib" BTL).
- Added more stringent MPI API parameter checking to help user-level
debugging.
- The ptmalloc2 memory manager component is now by default built as
a standalone library named libopenmpi-malloc. Users wanting to
use leave_pinned with ptmalloc2 will now need to link the library
into their application explicitly. All other users will use the
libc-provided allocator instead of Open MPI's ptmalloc2. This change
may be overriden with the configure option enable-ptmalloc2-internal
- The leave_pinned options will now default to using mallopt on
Linux in the cases where ptmalloc2 was not linked in. mallopt
will also only be available if munmap can be intercepted (the
default whenever Open MPI is not compiled with --without-memory-
manager.
- Open MPI will now complain and refuse to use leave_pinned if
no memory intercept / mallopt option is available.
- Add option of using Perl-based wrapper compilers instead of the
C-based wrapper compilers. The Perl-based version does not
have the features of the C-based version, but does work better
in cross-compile environments.
1.2.9
-----
- Fix a segfault when using one-sided communications on some forms of derived
datatypes. Thanks to Dorian Krause for reporting the bug. See #1715.
- Fix an alignment problem affecting one-sided communications on
some architectures (e.g., SPARC64). See #1738.
- Fix compilation on Solaris when thread support is enabled in Open MPI
(e.g., when using --with-threads). See #1736.
- Correctly take into account the MTU that an OpenFabrics device port
is using. See #1722 and
https://bugs.openfabrics.org/show_bug.cgi?id=1369.
- Fix two datatype engine bugs. See #1677.
Thanks to Peter Kjellstrom for the bugreport.
- Fix the bml r2 help filename so the help message can be found. See #1623.
- Fix a compilation problem on RHEL4U3 with the PGI 32 bit compiler
caused by . See ticket #1613.
- Fix the --enable-cxx-exceptions configure option. See ticket #1607.
- Properly handle when the MX BTL cannot open an endpoint. See ticket #1621.
- Fix a double free of events on the tcp_events list. See ticket #1631.
- Fix a buffer overun in opal_free_list_grow (called by MPI_Init).
Thanks to Patrick Farrell for the bugreport and Stephan Kramer for
the bugfix. See ticket #1583.
- Fix a problem setting OPAL_PREFIX for remote sh-based shells.
See ticket #1580.
1.2.8
-----
- Tweaked one memory barrier in the openib component to be more conservative.
May fix a problem observed on PPC machines. See ticket #1532.
- Fix OpenFabrics IB partition support. See ticket #1557.
- Restore v1.1 feature that sourced .profile on remote nodes if the default
shell will not do so (e.g. /bin/sh and /bin/ksh). See ticket #1560.
- Fix segfault in MPI_Init_thread() if ompi_mpi_init() fails. See ticket #1562.
- Adjust SLURM support to first look for $SLURM_JOB_CPUS_PER_NODE instead of
the deprecated $SLURM_TASKS_PER_NODE environment variable. This change
may be *required* when using SLURM v1.2 and above. See ticket #1536.
- Fix the MPIR_Proctable to be in process rank order. See ticket #1529.
- Fix a regression introduced in 1.2.6 for the IBM eHCA. See ticket #1526.
1.2.7
-----
- Add some Sun HCA vendor IDs. See ticket #1461.
- Fixed a memory leak in MPI_Alltoallw when called from Fortran.
Thanks to Dave Grote for the bugreport. See ticket #1457.
- Only link in libutil when it is needed/desired. Thanks to
Brian Barret for diagnosing and fixing the problem. See ticket #1455.
- Update some QLogic HCA vendor IDs. See ticket #1453.
- Fix F90 binding for MPI_CART_GET. Thanks to Scott Beardsley for
bringing it to our attention. See ticket #1429.
- Remove a spurious warning message generated in/by ROMIO. See ticket #1421.
- Fix a bug where command-line MCA parameters were not overriding
MCA parameters set from environment variables. See ticket #1380.
- Fix a bug in the AMD64 atomics assembly. Thanks to Gabriele Fatigati
for the bug report and bugfix. See ticket #1351.
- Fix a gather and scatter bug on intercommunicators when the datatype
being moved is 0 bytes. See ticket #1331.
- Some more man page fixes from the Debian maintainers.
See tickets #1324 and #1329.
- Have openib BTL (OpenFabrics support) check for the presence of
/sys/class/infiniband before allowing itself to be used. This check
prevents spurious "OMPI did not find RDMA hardware!" notices on
systems that have the software drivers installed, but no
corresponding hardware. See tickets #1321 and #1305.
- Added vendor IDs for some ConnectX openib HCAs. See ticket #1311.
- Fix some RPM specfile inconsistencies. See ticket #1308.
Thanks to Jim Kusznir for noticing the problem.
- Removed an unused function prototype that caused warnings on
some systems (e.g., OS X). See ticket #1274.
- Fix a deadlock in inter-communicator scatter/gather operations.
Thanks to Martin Audet for the bug report. See ticket #1268.
===========================================================================
Much, much more information is also available in the Open MPI FAQ:
http://www.open-mpi.org/faq/
===========================================================================
General Release Notes
---------------------
Detailed Open MPI v1.3 Feature List:
o Open MPI RunTime Environment (ORTE) improvements
- General robustness improvements
- Scalable job launch (we've seen ~16K processes in less than a
minute in a highly-optimized configuration)
- New process mappers
- Support for Platform/LSF environments (v7.0.2 and later)
- More flexible processing of host lists
- new mpirun cmd line options and associated functionality
o Fault-Tolerance Features
- Asynchronous, transparent checkpoint/restart support
- Fully coordinated checkpoint/restart coordination component
- Support for the following checkpoint/restart services:
- blcr: Berkley Lab's Checkpoint/Restart
- self: Application level callbacks
- Support for the following interconnects:
- tcp
- mx
- openib
- sm
- self
- Improved Message Logging
o MPI_THREAD_MULTIPLE support for point-to-point messaging in the
following BTLs (note that only MPI point-to-point messaging API
functions support MPI_THREAD_MULTIPLE; other API functions likely
do not):
- tcp
- sm
- mx
- elan
- self
o Point-to-point Messaging Layer (PML) improvements
- Memory footprint reduction
- Improved latency
- Improved algorithm for multiple communication device
("multi-rail") support
o Numerous Open Fabrics improvements/enhancements
- Added iWARP support (including RDMA CM)
- Memory footprint and performance improvements
- "Bucket" SRQ support for better registered memory utilization
- XRC/ConnectX support
- Message coalescing
- Improved error report mechanism with Asynchronous events
- Automatic Path Migration (APM)
- Improved processor/port binding
- Infrastructure for additional wireup strategies
- mpi_leave_pinned is now enabled by default
o uDAPL BTL enhancements
- Multi-rail support
- Subnet checking
- Interface include/exclude capabilities
o Processor affinity
- Linux processor affinity improvements
- Core/socket <--> process mappings
o Collectives
- Performance improvements
- Support for hierarchical collectives (must be activated
manually; see below)
o Miscellaneous
- MPI 2.1 compliant
- Sparse process groups and communicators
- Support for Cray Compute Node Linux (CNL)
- One-sided RDMA component (BTL-level based rather than PML-level
based)
- Aggregate MCA parameter sets
- MPI handle debugging
- Many small improvements to the MPI C++ bindings
- Valgrind support
- VampirTrace support
- Updated ROMIO to the version from MPICH2 1.0.7
- Removed the mVAPI IB stacks
- Display most error messages only once (vs. once for each
process)
- Many other small improvements and bug fixes, too numerous to
list here
Known issues
------------
o There is a segfault that sometimes occurs on one of our x86_64 test
clusters when using MPI onesided communications over Myrinet MX.
Since no one else has reported this problem we are not holding
up the 1.3 release. See ticket #1757 for the details, and any
possible workarounds.
o XGrid support is currently broken.
https://svn.open-mpi.org/trac/ompi/ticket/1777
o MPI_REDUCE_SCATTER does not work with counts of 0.
https://svn.open-mpi.org/trac/ompi/ticket/1559
o Please also see the Open MPI bug tracker for bugs beyond this release.
https://svn.open-mpi.org/trac/ompi/report
===========================================================================
The following abbreviated list of release notes applies to this code
base as of this writing (14 April 2009):
General notes
-------------
- Open MPI includes support for a wide variety of supplemental
hardware and software package. When configuring Open MPI, you may
need to supply additional flags to the "configure" script in order
to tell Open MPI where the header files, libraries, and any other
required files are located. As such, running "configure" by itself
may not include support for all the devices (etc.) that you expect,
especially if their support headers / libraries are installed in
non-standard locations. Network interconnects are an easy example
to discuss -- Myrinet and OpenFabrics networks, for example, both
have supplemental headers and libraries that must be found before
Open MPI can build support for them. You must specify where these
files are with the appropriate options to configure. See the
listing of configure command-line switches, below, for more details.
- The majority of Open MPI's documentation is here in this file, the
included man pages, and on the web site FAQ
(http://www.open-mpi.org/). This will eventually be supplemented
with cohesive installation and user documentation files.
- Note that Open MPI documentation uses the word "component"
frequently; the word "plugin" is probably more familiar to most
users. As such, end users can probably completely substitute the
word "plugin" wherever you see "component" in our documentation.
For what it's worth, we use the word "component" for historical
reasons, mainly because it is part of our acronyms and internal API
functionc calls.
- The run-time systems that are currently supported are:
- rsh / ssh
- LoadLeveler
- PBS Pro, Open PBS, Torque
- Platform LSF (v7.0.2 and later)
- SLURM
- XGrid (known to be broken in 1.3 through 1.3.2)
- Cray XT-3 and XT-4
- Sun Grid Engine (SGE) 6.1, 6.2 and open source Grid Engine
- Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)
- Systems that have been tested are:
- Linux (various flavors/distros), 32 bit, with gcc, and Sun Studio 12
- Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
Intel, Portland, Pathscale, and Sun Studio 12 compilers (*)
- OS X (10.4), 32 and 64 bit (i386, PPC, PPC64, x86_64), with gcc
and Absoft compilers (*)
- Solaris 10 update 2, 3 and 4, 32 and 64 bit (SPARC, i386, x86_64),
with Sun Studio 10, 11 and 12
(*) Be sure to read the Compiler Notes, below.
- Other systems have been lightly (but not fully tested):
- Other 64 bit platforms (e.g., Linux on PPC64)
- Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
more testing and support is expected later in the Open MPI v1.3.x
series.
Compiler Notes
--------------
- Mixing compilers from different vendors when building Open MPI
(e.g., using the C/C++ compiler from one vendor and the F77/F90
compiler from a different vendor) has been successfully employed by
some Open MPI users (discussed on the Open MPI user's mailing list),
but such configurations are not tested and not documented. For
example, such configurations may require additional compiler /
linker flags to make Open MPI build properly.
- Open MPI does not support the Sparc v8 CPU target, which is the
default on Sun Solaris. The v8plus (32 bit) or v9 (64 bit)
targets must be used to build Open MPI on Solaris. This can be
done by including a flag in CFLAGS, CXXFLAGS, FFLAGS, and FCFLAGS,
-xarch=v8plus for the Sun compilers, -mv8plus for GCC.
- At least some versions of the Intel 8.1 compiler seg fault while
compiling certain Open MPI source code files. As such, it is not
supported.
- The Intel 9.0 v20051201 compiler on IA64 platforms seems to have a
problem with optimizing the ptmalloc2 memory manager component (the
generated code will segv). As such, the ptmalloc2 component will
automatically disable itself if it detects that it is on this
platform/compiler combination. The only effect that this should
have is that the MCA parameter mpi_leave_pinned will be inoperative.
- Early versions of the Portland Group 6.0 compiler have problems
creating the C++ MPI bindings as a shared library (e.g., v6.0-1).
Tests with later versions show that this has been fixed (e.g.,
v6.0-5).
- The Portland Group compilers prior to version 7.0 require the
"-Msignextend" compiler flag to extend the sign bit when converting
from a shorter to longer integer. This is is different than other
compilers (such as GNU). When compiling Open MPI with the Portland
compiler suite, the following flags should be passed to Open MPI's
configure script:
shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-Msignextend \
--with-wrapper-cflags=-Msignextend \
--with-wrapper-cxxflags=-Msignextend ...
This will both compile Open MPI with the proper compile flags and
also automatically add "-Msignextend" when the C and C++ MPI wrapper
compilers are used to compile user MPI applications.
- Using the MPI C++ bindings with the Pathscale compiler is known
to fail, possibly due to Pathscale compiler issues.
- Using the Absoft compiler to build the MPI Fortran bindings on Suse
9.3 is known to fail due to a Libtool compatibility issue.
- Open MPI will build bindings suitable for all common forms of
Fortran 77 compiler symbol mangling on platforms that support it
(e.g., Linux). On platforms that do not support weak symbols (e.g.,
OS X), Open MPI will build Fortran 77 bindings just for the compiler
that Open MPI was configured with.
Hence, on platforms that support it, if you configure Open MPI with
a Fortran 77 compiler that uses one symbol mangling scheme, you can
successfully compile and link MPI Fortran 77 applications with a
Fortran 77 compiler that uses a different symbol mangling scheme.
NOTE: For platforms that support the multi-Fortran-compiler bindings
(i.e., weak symbols are supported), due to limitations in the MPI
standard and in Fortran compilers, it is not possible to hide these
differences in all cases. Specifically, the following two cases may
not be portable between different Fortran compilers:
1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE
will only compare properly to Fortran applications that were
created with Fortran compilers that that use the same
name-mangling scheme as the Fortran compiler that Open MPI was
configured with.
2. Fortran compilers may have different values for the logical
.TRUE. constant. As such, any MPI function that uses the Fortran
LOGICAL type may only get .TRUE. values back that correspond to
the the .TRUE. value of the Fortran compiler that Open MPI was
configured with. Note that some Fortran compilers allow forcing
.TRUE. to be 1 and .FALSE. to be 0. For example, the Portland
Group compilers provide the "-Munixlogical" option, and Intel
compilers (version >= 8.) provide the "-fpscomp logicals" option.
You can use the ompi_info command to see the Fortran compiler that
Open MPI was configured with.
- The Fortran 90 MPI bindings can now be built in one of three sizes
using --with-mpi-f90-size=SIZE (see description below). These sizes
reflect the number of MPI functions included in the "mpi" Fortran 90
module and therefore which functions will be subject to strict type
checking. All functions not included in the Fortran 90 module can
still be invoked from F90 applications, but will fall back to
Fortran-77 style checking (i.e., little/none).
- trivial: Only includes F90-specific functions from MPI-2. This
means overloaded versions of MPI_SIZEOF for all the MPI-supported
F90 intrinsic types.
- small (default): All the functions in "trivial" plus all MPI
functions that take no choice buffers (meaning buffers that are
specified by the user and are of type (void*) in the C bindings --
generally buffers specified for message passing). Hence,
functions like MPI_COMM_RANK are included, but functions like
MPI_SEND are not.
- medium: All the functions in "small" plus all MPI functions that
take one choice buffer (e.g., MPI_SEND, MPI_RECV, ...). All
one-choice-buffer functions have overloaded variants for each of
the MPI-supported Fortran intrinsic types up to the number of
dimensions specified by --with-f90-max-array-dim (default value is
4).
Increasing the size of the F90 module (in order from trivial, small,
and medium) will generally increase the length of time required to
compile user MPI applications. Specifically, "trivial"- and
"small"-sized F90 modules generally allow user MPI applications to
be compiled fairly quickly but lose type safety for all MPI
functions with choice buffers. "medium"-sized F90 modules generally
take longer to compile user applications but provide greater type
safety for MPI functions.
Note that MPI functions with two choice buffers (e.g., MPI_GATHER)
are not currently included in Open MPI's F90 interface. Calls to
these functions will automatically fall through to Open MPI's F77
interface. A "large" size that includes the two choice buffer MPI
functions is possible in future versions of Open MPI.
General Run-Time Support Notes
------------------------------
- The Open MPI installation must be in your PATH on all nodes (and
potentially LD_LIBRARY_PATH, if libmpi is a shared library), unless
using the --prefix or --enable-mpirun-prefix-by-default
functionality (see below).
- LAM/MPI-like mpirun notation of "C" and "N" is not yet supported.
- The XGrid support is experimental - see the Open MPI FAQ and this
post on the Open MPI user's mailing list for more information:
http://www.open-mpi.org/community/lists/users/2006/01/0539.php
- Open MPI's run-time behavior can be customized via MCA ("MPI
Component Architecture") parameters (see below for more information
on how to get/set MCA parameter values). Some MCA parameters can be
set in a way that renders Open MPI inoperable (see notes about MCA
parameters later in this file). In particular, some parameters have
required options that must be included.
- If specified, the "btl" parameter must include the "self"
component, or Open MPI will not be able to deliver messages to the
same rank as the sender. For example: "mpirun --mca btl tcp,self
..."
- If specified, the "btl_tcp_if_exclude" paramater must include the
loopback device ("lo" on many Linux platforms), or Open MPI will
not be able to route MPI messages using the TCP BTL. For example:
"mpirun --mca btl_tcp_if_exclude lo,eth1 ..."
- Running on nodes with different endian and/or different datatype
sizes within a single parallel job is supported in this release.
However, Open MPI does not resize data when datatypes differ in size
(for example, sending a 4 byte MPI_DOUBLE and receiving an 8 byte
MPI_DOUBLE will fail).
MPI Functionality and Features
------------------------------
- All MPI-2.1 functionality is supported.
- MPI_THREAD_MULTIPLE support is included, but is only lightly tested.
It likely does not work for thread-intensive applications. Note
that *only* the MPI point-to-point communication functions for the
BTL's listed above are considered thread safe. Other support
functions (e.g., MPI attributes) have not been certified as safe
when simultaneously used by multiple threads.
Note that Open MPI's thread support is in a fairly early stage; the
above devices are likely to *work*, but the latency is likely to be
fairly high. Specifically, efforts so far have concentrated on
*correctness*, not *performance* (yet).
- MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a
portable C datatype can be found that matches the Fortran type
REAL*16, both in size and bit representation.
- Asynchronous message passing progress using threads can be turned on
with the --enable-progress-threads option to configure.
Asynchronous message passing progress is only supported with devices
that support MPI_THREAD_MULTIPLE, but is only very lightly tested
(and may not provide very much performance benefit).
Collectives
-----------
- The "hierarch" coll component (i.e., an implementation of MPI
collective operations) attempts to discover network layers of
latency in order to segregate individual "local" and "global"
operations as part of the overall collective operation. In this
way, network traffic can be reduced -- or possibly even minimized
(similar to MagPIe). The current "hierarch" component only
separates MPI processes into on- and off-node groups.
Hierarch has had sufficient correctness testing, but has not
received much performance tuning. As such, hierarch is not
activated by default -- it must be enabled manually by setting its
priority level to 100:
mpirun --mca coll_hierarch_priority 100 ...
We would appreciate feedback from the user community about how well
hierarch works for your applications.
Network Support
---------------
- The OpenFabrics Enterprise Distribution (OFED) software package v1.0
will not work properly with Open MPI v1.2 (and later) due to how its
Mellanox InfiniBand plugin driver is created. The problem is fixed
OFED v1.1 (and later).
- Older mVAPI-based InfiniBand drivers (Mellanox VAPI) are no longer
supported. Please use an older version of Open MPI (1.2 series or
earlier) if you need mVAPI support.
- The use of fork() with the openib BTL is only partially supported,
and only on Linux kernels >= v2.6.15 with libibverbs v1.1 or later
(first released as part of OFED v1.2), per restrictions imposed by
the OFED network stack.
- There are two MPI network models available: "ob1" and "cm". "ob1"
uses BTL ("Byte Transfer Layer") components for each supported
network. "cm" uses MTL ("Matching Tranport Layer") components for
each supported network.
- "ob1" supports a variety of networks that can be used in
combination with each other (per OS constraints; e.g., there are
reports that the GM and OpenFabrics kernel drivers do not operate
well together):
- OpenFabrics: InfiniBand and iWARP
- Loopback (send-to-self)
- Myrinet: GM and MX
- Portals
- Quadrics Elan
- Shared memory
- TCP
- SCTP
- uDAPL
- "cm" supports a smaller number of networks (and they cannot be
used together), but may provide better better overall MPI
performance:
- Myrinet MX (not GM)
- InfiniPath PSM
- Portals
Open MPI will, by default, choose to use "cm" when the InfiniPath
PSM MTL can be used. Otherwise, OB1 will be used and the
corresponding BTLs will be selected. Users can force the use of ob1
or cm if desired by setting the "pml" MCA parameter at run-time:
shell$ mpirun --mca pml ob1 ...
or
shell$ mpirun --mca pml cm ...
- Myrinet MX support is shared between the 2 internal devices, the MTL
and the BTL. The design of the BTL interface in Open MPI assumes
that only naive one-sided communication capabilities are provided by
the low level communication layers. However, modern communication
layers such as Myrinet MX, InfiniPath PSM, or Portals, natively
implement highly-optimized two-sided communication semantics. To
leverage these capabilities, Open MPI provides the "cm" PML and
corresponding MTL components to transfer messages rather than bytes.
The MTL interface implements a shorter code path and lets the
low-level network library decide which protocol to use (depending on
issues such as message length, internal resources and other
parameters specific to the underlying interconnect). However, Open
MPI cannot currently use multiple MTL modules at once. In the case
of the MX MTL, process loopback and on-node shared memory
communications are provided by the MX library. Moreover, the
current MX MTL does not support message pipelining resulting in
lower performances in case of non-contiguous data-types.
The "ob1" PML and BTL components use Open MPI's internal on-node
shared memory and process loopback devices for high performance.
The BTL interface allows multiple devices to be used simultaneously.
For the MX BTL it is recommended that the first segment (which is as
a threshold between the eager and the rendezvous protocol) should
always be at most 4KB, but there is no further restriction on the
size of subsequent fragments.
The MX MTL is recommended in the common case for best performance on
10G hardware when most of the data transfers cover contiguous memory
layouts. The MX BTL is recommended in all other cases, such as when
using multiple interconnects at the same time (including TCP), or
transferring non contiguous data-types.
===========================================================================
Building Open MPI
-----------------
Open MPI uses a traditional configure script paired with "make" to
build. Typical installs can be of the pattern:
---------------------------------------------------------------------------
shell$ ./configure [...options...]
shell$ make all install
---------------------------------------------------------------------------
There are many available configure options (see "./configure --help"
for a full list); a summary of the more commonly used ones follows:
--prefix=
Install Open MPI into the base directory named . Hence,
Open MPI will place its executables in /bin, its header
files in /include, its libraries in /lib, etc.
--with-elan=
Specify the directory where the Quadrics Elan library and header
files are located. This option is generally only necessary if the
Elan headers and libraries are not in default compiler/linker
search paths.
Elan is the support library for Quadrics-based networks.
--with-elan-libdir=
Look in directory for the Quadrics Elan libraries. By default, Open
MPI will look in /lib and /lib64,
which covers most cases. This option is only needed for special
configurations.
--with-gm=
Specify the directory where the GM libraries and header files are
located. This option is generally only necessary if the GM headers
and libraries are not in default compiler/linker search paths.
GM is the support library for older Myrinet-based networks (GM has
been obsoleted by MX).
--with-gm-libdir=
Look in directory for the GM libraries. By default, Open MPI will
look in /lib and /lib64, which covers
most cases. This option is only needed for special configurations.
--with-mx=
Specify the directory where the MX libraries and header files are
located. This option is generally only necessary if the MX headers
and libraries are not in default compiler/linker search paths.
MX is the support library for Myrinet-based networks.
--with-mx-libdir=
Look in directory for the MX libraries. By default, Open MPI will
look in /lib and /lib64, which covers
most cases. This option is only needed for special configurations.
--with-openib=
Specify the directory where the OpenFabrics (previously known as
OpenIB) libraries and header files are located. This option is
generally only necessary if the OpenFabrics headers and libraries
are not in default compiler/linker search paths.
"OpenFabrics" refers to iWARP- and InifiniBand-based networks.
--with-openib-libdir=
Look in directory for the OpenFabrics libraries. By default, Open
MPI will look in /lib and /lib64, which covers most cases. This option is only
needed for special configurations.
--with-portals=
Specify the directory where the Portals libraries and header files
are located. This option is generally only necessary if the Portals
headers and libraries are not in default compiler/linker search
paths.
Portals is the support library for Cray interconnects, but is also
available on other platforms (e.g., there is a Portals library
implemented over regular TCP).
--with-portals-config=
Configuration to use for Portals support. The following
values are possible: "utcp", "xt3", "xt3-modex" (default: utcp).
--with-portals-libs=
Additional libraries to link with for Portals support.
--with-psm=
Specify the directory where the QLogic InfiniPath PSM library and
header files are located. This option is generally only necessary
if the InfiniPath headers and libraries are not in default
compiler/linker search paths.
PSM is the support library for QLogic InfiniPath network adapters.
--with-psm-libdir=
Look in directory for the PSM libraries. By default, Open MPI will
look in /lib and /lib64, which covers
most cases. This option is only needed for special configurations.
--with-sctp=
Specify the directory where the SCTP libraries and header files are
located. This option is generally only necessary if the SCTP headers
and libraries are not in default compiler/linker search paths.
SCTP is a special network stack over ethernet networks.
--with-sctp-libdir=
Look in directory for the SCTP libraries. By default, Open MPI will
look in /lib and /lib64, which covers
most cases. This option is only needed for special configurations.
--with-udapl=
Specify the directory where the UDAPL libraries and header files are
located. Note that UDAPL support is disabled by default on Linux;
the --with-udapl flag must be specified in order to enable it.
Specifying the directory argument is generally only necessary if the
UDAPL headers and libraries are not in default compiler/linker
search paths.
UDAPL is the support library for high performance networks in Sun
HPC ClusterTools and on Linux OpenFabrics networks (although the
"openib" options are preferred for Linux OpenFabrics networks, not
UDAPL).
--with-udapl-libdir=
Look in directory for the UDAPL libraries. By default, Open MPI
will look in /lib and /lib64,
which covers most cases. This option is only needed for special
configurations.
--with-lsf=
Specify the directory where the LSF libraries and header files are
located. This option is generally only necessary if the LSF headers
and libraries are not in default compiler/linker search paths.
LSF is a resource manager system, frequently used as a batch
scheduler in HPC systems.
--with-lsf-libdir=
Look in directory for the LSF libraries. By default, Open MPI will
look in /lib and /lib64, which covers
most cases. This option is only needed for special configurations.
--with-tm=
Specify the directory where the TM libraries and header files are
located. This option is generally only necessary if the TM headers
and libraries are not in default compiler/linker search paths.
TM is the support library for the Torque and PBS Pro resource
manager systems, both of which are frequently used as a batch
scheduler in HPC systems.
--with-sge
Specify to build support for the Sun Grid Engine (SGE) resource
manager. SGE support is disabled by default; this option must be
specified to build OMPI's SGE support.
The Sun Grid Engine (SGE) is a resource manager system, frequently
used as a batch scheduler in HPC systems.
--with-mpi-param_check(=value)
"value" can be one of: always, never, runtime. If --with-mpi-param
is not specified, "runtime" is the default. If --with-mpi-param
is specified with no value, "always" is used. Using
--without-mpi-param-check is equivalent to "never".
- always: the parameters of MPI functions are always checked for
errors
- never: the parameters of MPI functions are never checked for
errors
- runtime: whether the parameters of MPI functions are checked
depends on the value of the MCA parameter mpi_param_check
(default: yes).
--with-threads=value
Since thread support (both support for MPI_THREAD_MULTIPLE and
asynchronous progress) is only partially tested, it is disabled by
default. To enable threading, use "--with-threads=posix". This is
most useful when combined with --enable-mpi-threads and/or
--enable-progress-threads.
--enable-mpi-threads
Allows the MPI thread level MPI_THREAD_MULTIPLE. See
--with-threads; this is currently disabled by default.
--enable-progress-threads
Allows asynchronous progress in some transports. See
--with-threads; this is currently disabled by default. See the
above note about asynchronous progress.
--disable-mpi-cxx
Disable building the C++ MPI bindings. Note that this does *not*
disable the C++ checks during configure; some of Open MPI's tools
are written in C++ and therefore require a C++ compiler to be built.
--disable-mpi-cxx-seek
Disable the MPI::SEEK_* constants. Due to a problem with the MPI-2
specification, these constants can conflict with system-level SEEK_*
constants. Open MPI attempts to work around this problem, but the
workaround may fail in some esoteric situations. The
--disable-mpi-cxx-seek switch disables Open MPI's workarounds (and
therefore the MPI::SEEK_* constants will be unavailable).
--disable-mpi-f77
Disable building the Fortran 77 MPI bindings.
--disable-mpi-f90
Disable building the Fortran 90 MPI bindings. Also related to the
--with-f90-max-array-dim and --with-mpi-f90-size options.
--with-mpi-f90-size=
Three sizes of the MPI F90 module can be built: trivial (only a
handful of MPI-2 F90-specific functions are included in the F90
module), small (trivial + all MPI functions that take no choice
buffers), and medium (small + all MPI functions that take 1 choice
buffer). This parameter is only used if the F90 bindings are
enabled.
--with-f90-max-array-dim=
The F90 MPI bindings are strictly typed, even including the number of
dimensions for arrays for MPI choice buffer parameters. Open MPI
generates these bindings at compile time with a maximum number of
dimensions as specified by this parameter. The default value is 4.
--enable-mpirun-prefix-by-default
This option forces the "mpirun" command to always behave as if
"--prefix $prefix" was present on the command line (where $prefix is
the value given to the --prefix option to configure). This prevents
most rsh/ssh-based users from needing to modify their shell startup
files to set the PATH and/or LD_LIBRARY_PATH for Open MPI on remote
nodes. Note, however, that such users may still desire to set PATH
-- perhaps even in their shell startup files -- so that executables
such as mpicc and mpirun can be found without needing to type long
path names. --enable-orterun-prefix-by-default is a synonym for
this option.
--disable-shared
By default, libmpi is built as a shared library, and all components
are built as dynamic shared objects (DSOs). This switch disables
this default; it is really only useful when used with
--enable-static. Specifically, this option does *not* imply
--enable-static; enabling static libraries and disabling shared
libraries are two independent options.
--enable-static
Build libmpi as a static library, and statically link in all
components. Note that this option does *not* imply
--disable-shared; enabling static libraries and disabling shared
libraries are two independent options.
--enable-sparse-groups
Enable the usage of sparse groups. This would save memory
significantly especially if you are creating large
communicators. (Disabled by default)
--enable-peruse
Enable the PERUSE MPI data analysis interface.
--enable-dlopen
Build all of Open MPI's components as standalone Dynamic Shared
Objects (DSO's) that are loaded at run-time. The opposite of this
option, --disable-dlopen, causes two things:
1. All of Open MPI's components will be built as part of Open MPI's
normal libraries (e.g., libmpi).
2. Open MPI will not attempt to open any DSO's at run-time.
Note that this option does *not* imply that OMPI's libraries will be
built as static objects (e.g., libmpi.a). It only specifies the
location of OMPI's components: standalone DSOs or folded into the
Open MPI libraries. You can control whenther Open MPI's libraries
are build as static or dynamic via --enable|disable-static and
--enable|disable-shared.
--enable-heterogeneous
Enable support for running on heterogeneous clusters (e.g., machines
with different endian representations). Heterogeneous support is
disabled by default because it imposes a minor performance penalty.
--enable-ptmalloc2-internal
***NOTE: This option no longer exists.
This option was introduced in Open MPI v1.3 and was then removed in
Open MPI v1.3.2. Open MPI fundamentally changed how it uses
ptmalloc2 support in v1.3.2 such that the
--enable-ptmalloc2-internal flag was no longer necessary. It can
still harmlessly be supplied to Open MPI's configure script, but a
warning will appear about how it is an unrecognized option.
In v1.3 and v1.3.1, Open MPI built the ptmalloc2 library as a
standalone library that users could choose to link in or not (by
adding -lopenmpi-malloc to their link command). Using this option
restored pre-v1.3 behavior of *always* forcing the user to use the
ptmalloc2 memory manager (because it is part of libmpi).
Starting with v1.3.2, ptmalloc2 is always built into Open MPI, but
is only activated in certain scenarios.
--with-wrapper-cflags=
--with-wrapper-cxxflags=
--with-wrapper-fflags=
--with-wrapper-fcflags=
--with-wrapper-ldflags=
--with-wrapper-libs=
Add the specified flags to the default flags that used are in Open
MPI's "wrapper" compilers (e.g., mpicc -- see below for more
information about Open MPI's wrapper compilers). By default, Open
MPI's wrapper compilers use the same compilers used to build Open
MPI and specify an absolute minimum set of additional flags that are
necessary to compile/link MPI applications. These configure options
give system administrators the ability to embed additional flags in
OMPI's wrapper compilers (which is a local policy decision). The
meanings of the different flags are:
: Flags passed by the mpicc wrapper to the C compiler
: Flags passed by the mpic++ wrapper to the C++ compiler
: Flags passed by the mpif77 wrapper to the F77 compiler
: Flags passed by the mpif90 wrapper to the F90 compiler
: Flags passed by all the wrappers to the linker
: Flags passed by all the wrappers to the linker
There are other ways to configure Open MPI's wrapper compiler
behavior; see the Open MPI FAQ for more information.
There are many other options available -- see "./configure --help".
Changing the compilers that Open MPI uses to build itself uses the
standard Autoconf mechanism of setting special environment variables
either before invoking configure or on the configure command line.
The following environment variables are recognized by configure:
CC - C compiler to use
CFLAGS - Compile flags to pass to the C compiler
CPPFLAGS - Preprocessor flags to pass to the C compiler
CXX - C++ compiler to use
CXXFLAGS - Compile flags to pass to the C++ compiler
CXXCPPFLAGS - Preprocessor flags to pass to the C++ compiler
F77 - Fortran 77 compiler to use
FFLAGS - Compile flags to pass to the Fortran 77 compiler
FC - Fortran 90 compiler to use
FCFLAGS - Compile flags to pass to the Fortran 90 compiler
LDFLAGS - Linker flags to pass to all compilers
LIBS - Libraries to pass to all compilers (it is rarely
necessary for users to need to specify additional LIBS)
For example:
shell$ ./configure CC=mycc CXX=myc++ F77=myf77 F90=myf90 ...
***Note: We generally suggest using the above command line form for
setting different compilers (vs. setting environment variables and
then invoking "./configure"). The above form will save all
variables and values in the config.log file, which makes
post-mortem analysis easier when problems occur.
It is required that the compilers specified be compile and link
compatible, meaning that object files created by one compiler must be
able to be linked with object files from the other compilers and
produce correctly functioning executables.
Open MPI supports all the "make" targets that are provided by GNU
Automake, such as:
all - build the entire Open MPI package
install - install Open MPI
uninstall - remove all traces of Open MPI from the $prefix
clean - clean out the build tree
Once Open MPI has been built and installed, it is safe to run "make
clean" and/or remove the entire build tree.
VPATH and parallel builds are fully supported.
Generally speaking, the only thing that users need to do to use Open
MPI is ensure that /bin is in their PATH and /lib is
in their LD_LIBRARY_PATH. Users may need to ensure to set the PATH
and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc)
so that non-interactive rsh/ssh-based logins will be able to find the
Open MPI executables.
===========================================================================
Checking Your Open MPI Installation
-----------------------------------
The "ompi_info" command can be used to check the status of your Open
MPI installation (located in /bin/ompi_info). Running it with
no arguments provides a summary of information about your Open MPI
installation.
Note that the ompi_info command is extremely helpful in determining
which components are installed as well as listing all the run-time
settable parameters that are available in each component (as well as
their default values).
The following options may be helpful:
--all Show a *lot* of information about your Open MPI
installation.
--parsable Display all the information in an easily
grep/cut/awk/sed-able format.
--param
A of "all" and a of "all" will
show all parameters to all components. Otherwise, the
parameters of all the components in a specific framework,
or just the parameters of a specific component can be
displayed by using an appropriate and/or
name.
Changing the values of these parameters is explained in the "The
Modular Component Architecture (MCA)" section, below.
===========================================================================
Compiling Open MPI Applications
-------------------------------
Open MPI provides "wrapper" compilers that should be used for
compiling MPI applications:
C: mpicc
C++: mpiCC (or mpic++ if your filesystem is case-insensitive)
Fortran 77: mpif77
Fortran 90: mpif90
For example:
shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g
shell$
All the wrapper compilers do is add a variety of compiler and linker
flags to the command line and then invoke a back-end compiler. To be
specific: the wrapper compilers do not parse source code at all; they
are solely command-line manipulators, and have nothing to do with the
actual compilation or linking of programs. The end result is an MPI
executable that is properly linked to all the relevant libraries.
Customizing the behavior of the wrapper compilers is possible (e.g.,
changing the compiler [not recommended] or specifying additional
compiler/linker flags); see the Open MPI FAQ for more information.
===========================================================================
Running Open MPI Applications
-----------------------------
Open MPI supports both mpirun and mpiexec (they are exactly
equivalent). For example:
shell$ mpirun -np 2 hello_world_mpi
or
shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi
are equivalent. Some of mpiexec's switches (such as -host and -arch)
are not yet functional, although they will not error if you try to use
them.
The rsh launcher accepts a -hostfile parameter (the option
"-machinefile" is equivalent); you can specify a -hostfile parameter
indicating an standard mpirun-style hostfile (one hostname per line):
shell$ mpirun -hostfile my_hostfile -np 2 hello_world_mpi
If you intend to run more than one process on a node, the hostfile can
use the "slots" attribute. If "slots" is not specified, a count of 1
is assumed. For example, using the following hostfile:
---------------------------------------------------------------------------
node1.example.com
node2.example.com
node3.example.com slots=2
node4.example.com slots=4
---------------------------------------------------------------------------
shell$ mpirun -hostfile my_hostfile -np 8 hello_world_mpi
will launch MPI_COMM_WORLD rank 0 on node1, rank 1 on node2, ranks 2
and 3 on node3, and ranks 4 through 7 on node4.
Other starters, such as the resource manager / batch scheduling
environments, do not require hostfiles (and will ignore the hostfile
if it is supplied). They will also launch as many processes as slots
have been allocated by the scheduler if no "-np" argument has been
provided. For example, running a SLURM job with 8 processors:
shell$ salloc -n 8 mpirun a.out
The above command will reserve 8 processors and run 1 copy of mpirun,
which will, in turn, launch 8 copies of a.out in a single
MPI_COMM_WORLD on the processors that were allocated by SLURM.
Note that the values of component parameters can be changed on the
mpirun / mpiexec command line. This is explained in the section
below, "The Modular Component Architecture (MCA)".
===========================================================================
The Modular Component Architecture (MCA)
The MCA is the backbone of Open MPI -- most services and functionality
are implemented through MCA components. Here is a list of all the
component frameworks in Open MPI:
---------------------------------------------------------------------------
MPI component frameworks:
-------------------------
allocator - Memory allocator
bml - BTL management layer
btl - MPI point-to-point Byte Transfer Layer, used for MPI
point-to-point messages on some types of networks
coll - MPI collective algorithms
crcp - Checkpoint/restart coordination protocol
dpm - MPI-2 dynamic process management
io - MPI-2 I/O
mpool - Memory pooling
mtl - Matching transport layer, used for MPI point-to-point
messages on some types of networks
osc - MPI-2 one-sided communications
pml - MPI point-to-point management layer
pubsub - MPI-2 publish/subscribe management
rcache - Memory registration cache
topo - MPI topology routines
Back-end run-time environment component frameworks:
---------------------------------------------------
errmgr - RTE error manager
ess - RTE environment-specfic services
filem - Remote file management
grpcomm - RTE group communications
iof - I/O forwarding
notifier - System/network administrator noficiation system
odls - OpenRTE daemon local launch subsystem
oob - Out of band messaging
plm - Process lifecycle management
ras - Resource allocation system
rmaps - Resource mapping system
rml - RTE message layer
routed - Routing table for the RML
snapc - Snapshot coordination
Miscellaneous frameworks:
-------------------------
backtrace - Debugging call stack backtrace support
carto - Cartography (host/network mapping) support
crs - Checkpoint and restart service
installdirs - Installation directory relocation services
maffinity - Memory affinity
memchecker - Run-time memory checking
memcpy - Memopy copy support
memory - Memory management hooks
paffinity - Processor affinity
timer - High-resolution timers
---------------------------------------------------------------------------
Each framework typically has one or more components that are used at
run-time. For example, the btl framework is used by the MPI layer to
send bytes across different types underlying networks. The tcp btl,
for example, sends messages across TCP-based networks; the openib btl
sends messages across OpenFabrics-based networks; the MX btl sends
messages across Myrinet networks.
Each component typically has some tunable parameters that can be
changed at run-time. Use the ompi_info command to check a component
to see what its tunable parameters are. For example:
shell$ ompi_info --param btl tcp
shows all the parameters (and default values) for the tcp btl
component.
These values can be overridden at run-time in several ways. At
run-time, the following locations are examined (in order) for new
values of parameters:
1. /etc/openmpi-mca-params.conf
This file is intended to set any system-wide default MCA parameter
values -- it will apply, by default, to all users who use this Open
MPI installation. The default file that is installed contains many
comments explaining its format.
2. $HOME/.openmpi/mca-params.conf
If this file exists, it should be in the same format as
/etc/openmpi-mca-params.conf. It is intended to provide
per-user default parameter values.
3. environment variables of the form OMPI_MCA_ set equal to a
Where is the name of the parameter. For example, set the
variable named OMPI_MCA_btl_tcp_frag_size to the value 65536
(Bourne-style shells):
shell$ OMPI_MCA_btl_tcp_frag_size=65536
shell$ export OMPI_MCA_btl_tcp_frag_size
4. the mpirun command line: --mca
Where is the name of the parameter. For example:
shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi
These locations are checked in order. For example, a parameter value
passed on the mpirun command line will override an environment
variable; an environment variable will override the system-wide
defaults.
===========================================================================
Common Questions
----------------
Many common questions about building and using Open MPI are answered
on the FAQ:
http://www.open-mpi.org/faq/
===========================================================================
Got more questions?
-------------------
Found a bug? Got a question? Want to make a suggestion? Want to
contribute to Open MPI? Please let us know!
When submitting questions and problems, be sure to include as much
extra information as possible. This web page details all the
information that we request in order to provide assistance:
http://www.open-mpi.org/community/help/
User-level questions and comments should generally be sent to the
user's mailing list (users@open-mpi.org). Because of spam, only
subscribers are allowed to post to this list (ensure that you
subscribe with and post from *exactly* the same e-mail address --
joe@example.com is considered different than
joe@mycomputer.example.com!). Visit this page to subscribe to the
user's list:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Developer-level bug reports, questions, and comments should generally
be sent to the developer's mailing list (devel@open-mpi.org). Please
do not post the same question to both lists. As with the user's list,
only subscribers are allowed to post to the developer's list. Visit
the following web page to subscribe:
http://www.open-mpi.org/mailman/listinfo.cgi/devel
Make today an Open MPI day!
trunk/ipath_release_notes.txt 0000644 0001750 0001750 00000002662 11313645005 016325 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
ipath in OFED 1.4.1 Release Notes
May 2009
======================================================================
Table of Contents
======================================================================
1. Overview
2. Fixed Bugs and Enhancements
3. Known Issues
======================================================================
1. Overview
======================================================================
ipath is the low level driver implementation for all QLogic InfiniPath
HCAs: the HTX QHT7140, the PCI-Express x8 QLE7140, the PCI-Express x8
DDR QLE7240, and the PCI-Express x16 DDR QLE7280.
======================================================================
2. Fixed Bugs and Enhancements
======================================================================
2.1 (Bug 1369) No results when running Open MPI bandwidth with msg size
bigger than 2200 with Qlogic HCA
This was fixed by a change submitted to OpenMPI 1.3, which is part of OFED
1.4.1.
======================================================================
3. Known Issues
======================================================================
3.1 (Bug 1242) Kernel panic while running mpi2007 against
ofed1.4 -- ib_ipath: ipath_sdma_verbs_send
Found while running mpi2007 over OpenMPI on stock OFED1.4 RC1 RHEL4.X
machines. QLogic is working on it. Contact support@qlogic.com if you
run into this problem.
trunk/nes_release_notes.txt 0000644 0001750 0001750 00000022162 11313645005 016002 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
NetEffect Ethernet Cluster Server Adapter Release Notes
May 2009
The iw_nes module and libnes user library provide RDMA and L2IF
support for the NetEffect Ethernet Cluster Server Adapters.
============================================
Required Setting - RDMA Unify TCP port space
============================================
RDMA connections use the same TCP port space as the host stack. To avoid
conflicts, set rdma_cm module option unify_tcp_port_sapce to 1 by adding
the following to /etc/modprobe.conf:
options rdma_cm unify_tcp_port_space=1
=======================
Loadable Module Options
=======================
The following options can be used when loading the iw_nes module by modifying
modprobe.conf file:
wide_ppm_offset = 0
Set to 1 will increase CX4 interface clock ppm offset to 300ppm.
Default setting 0 is 100ppm.
mpa_version = 1
MPA version to be used int MPA Req/Resp (0 or 1).
disable_mpa_crc = 0
Disable checking of MPA CRC.
send_first = 0
Send RDMA Message First on Active Connection.
nes_drv_opt = 0x00000100
Following options are supported:
Enable MSI - 0x00000010
No Inline Data - 0x00000080
Disable Interrupt Moderation - 0x00000100
Disable Virtual Work Queue - 0x00000200
nes_debug_level = 0
Enable debug output level.
wqm_quanta = 65536
Set size of data to be transmitted at a time.
limit_maxrdreqsz = 0
Limit PCI read request size to 256 bytes.
===============
Runtime Options
===============
The following options can be used to alter the behavior of the iw_nes module:
NOTE: Assuming NetEffect Ethernet Cluster Server Adapter is assigned eth2.
ifconfig eth2 mtu 9000 - largest mtu supported
ethtool -K eth2 tso on - enables TSO
ethtool -K eth2 tso off - disables TSO
ethtool -C eth2 rx-usecs-irq 128 - set static interrupt moderation
ethtool -C eth2 adaptive-rx on - enable dynamic interrupt moderation
ethtool -C eth2 adaptive-rx off - disable dynamic interrupt moderation
ethtool -C eth2 rx-frames-low 16 - low watermark of rx queue for dynamic
interrupt moderation
ethtool -C eth2 rx-frames-high 256 - high watermark of rx queue for
dynamic interrupt moderation
ethtool -C eth2 rx-usecs-low 40 - smallest interrupt moderation timer
for dynamic interrupt moderation
ethtool -C eth2 rx-usecs-high 1000 - largest interrupt moderation timer
for dynamic interrupt moderation
===================
uDAPL Configuration
===================
Rest of the document assumes the following uDAPL settings in dat.conf:
OpenIB-cma-nes u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
ofa-v2-nes u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
=======================================
Recommended Settings for HP MPI 2.2.7
=======================================
Add the following to mpirun command:
-1sided
Example mpirun command with uDAPL-2.0:
mpirun -UDAPL -prot -intra=shm
-e MPI_ICLIB_UDAPL=libdaplofa.so.1
-e MPI_HASIC_UDAPL=ofa-v2-nes
-1sided
-f /opt/hpmpi/appfile
Example mpirun command with uDAPL-1.2:
mpirun -UDAPL -prot -intra=shm
-e MPI_ICLIB_UDAPL=libdaplcma.so.1
-e MPI_HASIC_UDAPL=OpenIB-cma-nes
-1sided
-f /opt/hpmpi/appfile
=======================================
Recommended Settings for Intel MPI 3.2
=======================================
Add the following to mpiexec command:
-genv I_MPI_FALLBACK_DEVICE 0
-genv I_MPI_DEVICE rdma:OpenIB-cma-nes
-genv I_MPI_RENDEZVOUS_RDMA_WRITE
Example mpiexec command line for uDAPL-2.0:
mpiexec -genv I_MPI_FALLBACK_DEVICE 0
-genv I_MPI_DEVICE rdma:ofa-v2-nes
-genv I_MPI_RENDEZVOUS_RDMA_WRITE
-ppn 1 -n 2
/opt/intel/impi/3.2.0.011/bin64/IMB-MPI1
Example mpiexec command line for uDAPL-1.2:
mpiexec -genv I_MPI_FALLBACK_DEVICE 0
-genv I_MPI_DEVICE rdma:OpenIB-cma-nes
-genv I_MPI_RENDEZVOUS_RDMA_WRITE
-ppn 1 -n 2
/opt/intel/impi/3.2.0.011/bin64/IMB-MPI1
========================================
Recommended Setting for MVAPICH2 and OFA
========================================
Add the following to the mpirun command:
-env MV2_USE_RDMA_CM 1
-env MV2_USE_IWARP_MODE 1
For larger number of processes, it is also recommended to set the following:
-env MV2_MAX_INLINE_SIZE 64
-env MV2_USE_SRQ 0
Example mpiexec command line:
mpiexec -l -n 2
-env MV2_USE_RDMA_CM 1
-env MV2_USE_IWARP_MODE 1
/usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_latency
==========================================
Recommended Setting for MVAPICH2 and uDAPL
==========================================
Add the following to the mpirun command:
-env MV2_PREPOST_DEPTH 59
Example mpiexec command line:
mpiexec -l -n 2
-env MV2_DAPL_PROVIDER ofa-v2-nes
-env MV2_PREPOST_DEPTH 59
/usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_latency
mpiexec -l -n 2
-env MV2_DAPL_PROVIDER OpenIB-cma-nes
-env MV2_PREPOST_DEPTH 59
/usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_latency
===========================
Modify Settings in Open MPI
===========================
There is more than one way to specify MCA parameters in
Open MPI. Please visit this link and use the best method
for your environment:
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
=======================================
Recommended Settings for Open MPI 1.3.2
=======================================
Caching pinned memory is enabled by default but it may be necessary
to limit the size of the cache to prevent running out of memory by
adding the following parameter:
mpool_rdma_rcache_size_limit =
The cache size depends on the number of processes and nodes, e.g. for
64 processes with 8 nodes, limit the pinned cache size to
104857600 (100 MBytes).
Example mpirun command line:
mpirun -np 2 -hostfile /opt/mpd.hosts
-mca btl openib,self,sm
-mca mpool_rdma_rcache_size_limit 104857600
/usr/mpi/gcc/openmpi-1.3.2/tests/IMB-3.1/IMB-MPI1
=======================================
Recommended Settings for Open MPI 1.3.1
=======================================
There is a known problem with cached pinned memory. It is recommended
that pinned memory caching be disabled. For more information, see
https://svn.open-mpi.org/trac/ompi/ticket/1853
To disable pinned memory caching, add the following parameter:
mpi_leave_pinned = 0
Example mpirun command line:
mpirun -np 2 -hostfile /opt/mpd.hosts
-mca btl openib,self,sm
-mca btl_mpi_leave_pinned 0
/usr/mpi/gcc/openmpi-1.3.1/tests/IMB-3.1/IMB-MPI1
=====================================
Recommended Settings for Open MPI 1.3
=====================================
There is a known problem with cached pinned memory. It is recommended
that pinned memory caching be disabled. For more information, see
https://svn.open-mpi.org/trac/ompi/ticket/1853
To disable pinned memory caching, add the following parameter:
mpi_leave_pinned = 0
Receive Queue setting:
btl_openib_receive_queues = P,65536,256,192,128
Set maximum size of inline data segment to 64:
btl_openib_max_inline_data = 64
Example mpirun command:
mpirun -np 2 -hostfile /root/mpd.hosts
-mca btl openib,self,sm
-mca btl_mpi_leave_pinned 0
-mca btl_openib_receive_queues P,65536,256,192,128
-mca btl_openib_max_inline_data 64
/usr/mpi/gcc/openmpi-1.3/tests/IMB-3.1/IMB-MPI1
============
Known Issues
============
The following is a list of known issues with Linux kernel and
OFED 1.4.1 release.
1. We have observed "__qdisc_run" softlockup crash running UDP
traffic on RHEL5.1 systems with more than 8 cores. The issue
is in Linux network stack. The fix for this is available from
the following link:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git
;a=commitdiff;h=2ba2506ca7ca62c56edaa334b0fe61eb5eab6ab0
;hp=32aced7509cb20ef3ec67c9b56f5b55c41dd4f8d
2. Running Pallas test suite and MVAPICH2 (OFA/uDAPL) for more
than 64 processes will abnormally terminate. The workaround is
add the following to mpirun command:
-env MV2_ON_DEMAND_THRESHOLD
e.g. For 72 total processes, -env MV2_ON_DEMAND_THRESHOLD 72
3. For MVAPICH2 (OFA/uDAPL) IMB-EXT (part of Pallas suite) "Window" test
may show high latency numbers. It is recommended to turn off one sided
communication by adding following to the mpirun command:
-env MV2_USE_RDMA_ONE_SIDED 0
4. IMB-EXT does not run with Open MPI 1.3.1 or 1.3. The workaround is
to turn off message coalescing by adding the following to mpirun
command:
-mca btl_openib_use_message_coalescing 0
NetEffect is a trademark of Intel Corporation in the U.S. and other countries.
trunk/mstflint_release_notes.txt 0000644 0001750 0001750 00000005160 11313645005 017054 0 ustar benoit benoit Open Fabrics InfiniBand Diagnostic Utilities
--------------------------------------------
*******************************************************************************
RELEASE: OFED 1.4
DATE: Dec 2008
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New features
3. Major Bugs Fixed
4. Known Issues
===============================================================================
1. Overview
===============================================================================
uhis package contains a burning tool and diagnostic tools for Mellanox
manufactured cards. It also provides access to the relevant source
code. Please see the file LICENSE for licensing details.
2) Package Contents
a) mstflint source code
b) mflash lib
This lib provides Flash access through Mellanox HCAs.
c) mtcr lib (implemented in mtcr.h file)
This lib enables access to HCA hardware registers.
d) mstregdump utility
This utility dumps hardware registers from Mellanox hardware
for later analysis by Mellanox.
e) mstvpd
This utility dumps the on-card VPD.
f) hca_self_test.ofed
This scripts checks the status of software, firmware and hardware
of the HCAs installed on the local host.
===============================================================================
2. New Features
===============================================================================
* Mellanox InfiniScaleIV switch support.
Mstflint and the mflash lib support burning of this switch device.
mstregdump can dump InfiniScaleIV registers
* Added hca_self_test.ofed tool to the package.
See file hca_self_test.readme included in the package for details.
===============================================================================
3. Major Bugs Fixed
===============================================================================
* Fixed: Mstregdump on ConnectX devices cause the device to hang
===============================================================================
4. Known Issues
===============================================================================
* In the very unlikely event that you get the following error message when
running mstflint:
Warning: memory access to device 0a:00.0 failed: Input/output error.
Warning: Fallback on IO: much slower, and unsafe if device in use.
*** buffer overflow detected ***: mstflint terminated
simply run "mst start" and then re-run mstflint.
trunk/QoS_architecture.txt 0000644 0001750 0001750 00000022557 11313645005 015561 0 ustar benoit benoit
QoS support in OFED
==============================================================================
Table of contents
==============================================================================
1. Overview
2. Architecture
3. Supported Policy
4. CMA functionality
5. IPoIB functionality
6. SDP functionality
7. RDS functionality
8. SRP functionality
9. iSER functionality
10. OpenSM functionality
==============================================================================
1. Overview
==============================================================================
Quality of Service requirements stem from the realization of I/O consolidation
over IB network: As multiple applications and ULPs share the same fabric,
means to control their use of the network resources are becoming a must.
The basic need is to differentiate the service levels provided to different
traffic flows, such that a policy could be enforced and control each flow
utilization of the fabric resources.
IBTA specification defined several hardware features and management interfaces
to support QoS:
* Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
* Arbitration between traffic of different VLs is performed by a 2 priority
levels weighted round robin arbiter. The arbiter is programmable with
a sequence of (VL, weight) pairs and maximal number of high priority credits
to be processed before low priority is served
* Packets carry class of service marking in the range 0 to 15 in their
header SL field
* Each switch can map the incoming packet by its SL to a particular output
VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
* The Subnet Administrator controls each communication flow parameters
by providing them as a response to Path Record (PR) or MultiPathRecord (MPR)
queries
The IB QoS features provide the means to implement a DiffServ like
architecture. DiffServ architecture (IETF RFC 2474 & 2475) is widely used
today in highly dynamic fabrics.
This document provides the detailed functional definition for the various
software elements that enable a DiffServ like architecture over the
OpenFabrics software stack.
==============================================================================
2. Architecture
==============================================================================
QoS functionality is split between the SM/SA, CMA and the various ULPS.
We take the "chronology approach" to describe how the overall system works.
2.1. The network manager (human) provides a set of rules (policy) that
define how the network is being configured and how its resources are split
to different QoS-Levels. The policy also define how to decide which QoS-Level
each application or ULP or service use.
2.2. The SM analyzes the provided policy to see if it is realizable and
performs the necessary fabric setup. Part of this policy defines the default
QoS-Level of each partition. The SA is enhanced to match the requested Source,
Destination, QoS-Class, Service-ID, PKey against the policy, so clients
(ULPs, programs) can obtain a policy enforced QoS. The SM may also set up
partitions with appropriate IPoIB broadcast group. This broadcast group
carries its QoS attributes: SL, MTU, RATE, and Packet Lifetime.
2.3. IPoIB is being setup. IPoIB uses the SL, MTU, RATE and Packet Lifetime
available on the multicast group which forms the broadcast group of this
partition.
2.4. MPI which provides non IB based connection management should be
configured to run using hard coded SLs. It uses these SLs for every QP
being opened.
2.5. ULPs that use CM interface (like SRP) have their own pre-assigned
Service-ID and use it while obtaining PathRecord/MultiPathRecord (PR/MPR)
for establishing connections. The SA receiving the PR/MPR matches it
against the policy and returns the appropriate PR/MPR including SL, MTU,
RATE and Lifetime.
2.6. ULPs and programs (e.g. SDP) use CMA to establish RC connection provide
the CMA the target IP and port number. ULPs might also provide QoS-Class.
The CMA then creates Service-ID for the ULP and passes this ID and optional
QoS-Class in the PR/MPR request. The resulting PR/MPR is used for configuring
the connection QP.
PathRecord and MultiPathRecord enhancement for QoS:
As mentioned above the PathRecord and MultiPathRecord attributes are enhanced
to carry the Service-ID which is a 64bit value. A new field QoS-Class is also
provided.
A new capability bit describes the SM QoS support in the SA class port info.
This approach provides an easy migration path for existing access layer and
ULPs by not introducing new set of PR/MPR attributes.
==============================================================================
3. Supported Policy
==============================================================================
The QoS policy that is specified in a separate file is divided into
4 sub sections:
I) Port Group: a set of CAs, Routers or Switches that share the same settings.
A port group might be a partition defined by the partition manager policy,
list of GUIDs, or list of port names based on NodeDescription.
II) Fabric Setup: Defines how the SL2VL and VLArb tables should be setup.
NOTE: Currently this part of the policy is ignored. SL2VL and VLArb
tables should be configured in the OpenSM options file
(opensm.opts).
III) QoS-Levels Definition: This section defines the possible sets of
parameters for QoS that a client might be mapped to. Each set holds
SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits.
NOTE: Currently, Path Bits are not implemented.
IV) Matching Rules: A list of rules that match an incoming PR/MPR request
to a QoS-Level. The rules are processed in order such as the first match
is applied. Each rule is built out of a set of match expressions which
should all match for the rule to apply. The matching expressions are
defined for the following fields:
- SRC and DST to lists of port groups
- Service-ID to a list of Service-ID values or ranges
- QoS-Class to a list of QoS-Class values or ranges
==============================================================================
4. CMA features
==============================================================================
The CMA interface supports Service-ID through the notion of port space
as a prefixes to the port_num which is part of the sockaddr provided to
rdma_resolve_add().
CMP also allows the ULP (like SDP) to propagate a request for specific
QoS-Class. CMA uses the provided QoS-Class and Service-ID in the sent PR/MPR.
==============================================================================
5. IPoIB
==============================================================================
IPoIB queries the SA for its broadcast group information.
It provides the broadcast group SL, MTU, and RATE in every following
PathRecord query performed when a new UDAV is needed by IPoIB.
==============================================================================
6. SDP
==============================================================================
SDP uses CMA for building its connections.
The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
holding the remote TCP/IP Port Number to connect to.
==============================================================================
7. RDS
==============================================================================
RDS uses CMA and thus it is very close to SDP. The Service-ID for RDS is
0x000000000106PPPP, where PPPP are 4 hex digits holding the TCP/IP Port
Number that the protocol connects to.
Default port number for RDS is 0x48CA, which makes a default Service-ID
0x00000000010648CA.
==============================================================================
8. SRP
==============================================================================
Current SRP implementation uses its own CM callbacks (not CMA). So SRP fills
in the Service-ID in the PR/MPR by itself and use that information in setting
up the QP.
SRP Service-ID is defined by the SRP target I/O Controller (it also complies
with IBTA Service-ID rules). The Service-ID is reported by the I/O Controller
in the ServiceEntries DMA attribute and should be used in the PR/MPR if the
SA reports its ability to handle QoS PR/MPRs.
==============================================================================
9. iSER
==============================================================================
Similar to RDS, iSER also uses CMA. The Service-ID for iSER is similar to RDS
(0x000000000106PPPP), with default port number 0x0CBC, which makes a default
Service-ID 0x0000000001060CBC.
==============================================================================
10. OpenSM features
==============================================================================
The QoS related functionality that is provided by OpenSM can be split into two
main parts:
10.1. Fabric Setup
During fabric initialization the SM parses the policy and apply its settings
to the discovered fabric elements.
10.2. PR/MPR query handling:
OpenSM enforces the provided policy on client request.
The overall flow for such requests is: first the request is matched against
the defined match rules such that the target QoS-Level definition is found.
Given the QoS-Level a path(s) search is performed with the given restrictions
imposed by that level.
==============================================================================
trunk/ofed_net.conf-example 0000644 0001750 0001750 00000000426 11313645005 015626 0 ustar benoit benoit LAN_INTERFACE_ib0=eth0
IPADDR_ib0=192.168.0.'*'
NETMASK_ib0=255.255.255.0
NETWORK_ib0=192.168.0.0
BROADCAST_ib0=192.168.0.255
ONBOOT_ib0=1
LAN_INTERFACE_ib1=eth0
IPADDR_ib1=172.16.'*'.'*'
NETMASK_ib1=255.255.0.0
NETWORK_ib1=172.16.0.0
BROADCAST_ib1=172.16.255.255
ONBOOT_ib1=1
trunk/QLOGIC_VNIC_README.txt 0000644 0001750 0001750 00000065531 11313645005 015126 0 ustar benoit benoit This is a release of the QLogic VNIC driver on OFED 1.4. This driver is
currently supported on Intel x86 32 and 64 bit machines.
Supported OS are:
- RHEL 4 Update 4.
- RHEL 4 Update 5.
- RHEL 4 Update 6.
- SLES 10.
- SLES 10 Service Pack 1.
- SLES 10 Service Pack 1 Update 1.
- SLES 10 Service Pack 2.
- RHEL 5.
- RHEL 5 Update 1.
- RHEL 5 Update 2.
- vanilla 2.6.27 kernel.
The VNIC driver in conjunction with the QLogic Ethernet Virtual I/O Controller
(EVIC) provides Ethernet interfaces on a host with IB HCA(s) without the need
for any physical Ethernet NIC.
This file describes the use of the QLogic VNIC ULP service on an OFED stack
and covers the following points:
A) Creating QLogic VNIC interfaces
B) Discovering VEx/EVIC IOCs present on the fabric using ib_qlgc_vnic_query
C) Starting the QLogic VNIC driver and the VNIC interfaces
D) Assigning IP addresses etc for the QLogic VNIC interfaces
E) Information about the QLogic VNIC interfaces
F) Deleting a specific QLogic VNIC interface
G) Forced Failover feature for QLogic VNIC.
H) Infiniband Quality of Service for VNIC.
I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support
J) Information about creating VLAN interfaces
K) Information about enabling IB Multicast for QLogic VNIC interface
L) Basic Troubleshooting
A) Creating QLogic VNIC interfaces
The VNIC interfaces can be created with the help of
the configuration file which must be placed at /etc/infiniband/qlgc_vnic.cfg.
Please take a look at /etc/infiniband/qlgc_vnic.cfg.sample file (available also
as part of the documentation) to see how VNIC configuration files are written.
You can use this configuration file as the basis for creating a VNIC configuration
file by copying it to /etc/infiniband/qlgc_vnic.cfg. Of course you will have to
replace the IOCGUID, IOCSTRING values etc in the sample configuration file
with those of the EVIC IOCs present on your fabric.
(For backward compatibilty, if this file is missing,
/etc/infiniband/qlogic_vnic.cfg or /etc/sysconfig/ics_inic.cfg
will be used for configuration)
Please note that using DGID of the EVIC/VEx IOC is
recommended as it will ensure the quickest startup of the
VNIC service. If DGID is specified then you must also
specify the IOCGUID. More details can be found in
the qlgc_vnic.cfg.sample file.
In case of a host consisting of more than 1 HCAs plugged in, VNIC
interfaces can be configured based on HCA no and Port No or PORTGUID.
B) Discovering EVIC/VEx IOCs present on the fabric using ib_qlgc_vnic_query
For writing the configuration file, you will need information
about the EVIC/VEx IOCs present on the fabric like their IOCGUID,
IOCSTRING etc. The ib_qlgc_vnic_query tool should be used to get this
information.
When ib_qlgc_vnic_query is executed without any options, it scans through ALL
active IB ports on the host and obtains the detailed information about all the
EVIC/VEx IOCs reachable through each active IB port:
# ib_qlgc_vnic_query
HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
IO Unit Info:
port LID: 0008
port GID: fe8000000000000000066a11de000070
change ID: 0003
max controllers: 0x02
controller[ 1]
GUID: 00066a01de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
service entries: 2
service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
IO Unit Info:
port LID: 0009
port GID: fe8000000000000000066a21de000070
change ID: 0003
max controllers: 0x02
controller[ 2]
GUID: 00066a02de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
service entries: 2
service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
IO Unit Info:
port LID: 0008
port GID: fe8000000000000000066a11de000070
change ID: 0003
max controllers: 0x02
controller[ 1]
GUID: 00066a01de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
service entries: 2
service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
IO Unit Info:
port LID: 0009
port GID: fe8000000000000000066a21de000070
change ID: 0003
max controllers: 0x02
controller[ 2]
GUID: 00066a02de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
service entries: 2
service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
Port State is Down. Skipping search of DM nodes on this port.
HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
IO Unit Info:
port LID: 0008
port GID: fe8000000000000000066a11de000070
change ID: 0003
max controllers: 0x02
controller[ 1]
GUID: 00066a01de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
service entries: 2
service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
IO Unit Info:
port LID: 0009
port GID: fe8000000000000000066a21de000070
change ID: 0003
max controllers: 0x02
controller[ 2]
GUID: 00066a02de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
service entries: 2
service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
This is meant to help the network administrator to know about HCA/Port information
on host along with EVIC IOCs reachable through given IB ports on fabric. When
ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID information
and with -s option it reports the IOCSTRING information for the EVIC/VEx IOCs
present on the fabric:
# ib_qlgc_vnic_query -e
HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
Port State is Down. Skipping search of DM nodes on this port.
HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
# ib_qlgc_vnic_query -s
HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
Port State is Down. Skipping search of DM nodes on this port.
HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
# ib_qlgc_vnic_query -es
HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
Port State is Down. Skipping search of DM nodes on this port.
HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
ib_qlgc_vnic_query can be used to discover EVIC IOCs on the fabric based on
umad device, HCA no/Port no and PORTGUID as follows:
For umad devices, it takes the name of the umad device mentioned with '-d'
option:
# ib_qlgc_vnic_query -es -d /dev/infiniband/umad0
HCA No = 0, HCA = mlx4_0, Port = 1
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
If the name of the HCA and its port no is known, then ib_qlgc_vnic_query can
make use of this information to discover EVIC IOCs on the fabric. HCA name
and port no is specified with '-C' and '-P' options respectively.
# ib_qlgc_vnic_query -es -C mlx4_1 -P 2
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
In case, if HCA name is not specified but port no is specified, HCA 0 is
selected as default HCA to discover IOCs and if Port no is missing then,
Port 1 of HCA name mentioned is used to discover the IOCs. If both are
missing, the behaviour is default and ib_qlgc_vnic_query will scan all the
IB ports on the host to discover IOCs reachable through each one of them.
PORTGUID information about the IB ports on given host can be obtained using
the option '-L':
# ib_qlgc_vnic_query -L
0,mlx4_0,1,0x0002c903000010f5
0,mlx4_0,2,0x0002c903000010f6
1,mlx4_1,1,0x0002c90300000785
1,mlx4_1,2,0x0002c90300000786
This actually lists different configurable parameters of IB ports present on
given host in the order: HCA No, HCA Name, Port No, PORTGUID separated by
commas. PORTGUID value obtained thus, can be used to discover EVIC IOCs
reachable through it using '-G' option as follows:
# ib_qlgc_vnic_query -es -G 0x0002c903000010f5
HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
C) Starting the QLogic VNIC driver and the QLogic VNIC interfaces
To start the QLogic VNIC service as a part of startup of OFED stack, set
QLGC_VNIC_LOAD=yes
in /etc/infiniband/openib.conf file. With this actually, the QLogic VNIC
service will also be stopped when the OFED stack is stopped. Also, if OFED
stack has been marked to start on boot, QLogic VNIC service will also start
on boot.
The rest of the discussion in this subsection C) is valid only if
QLGC_VNIC_LOAD=no
is set into /etc/infiniband/openib.conf.
Once you have created a configuration file, you can start the VNIC driver
and create the VNIC interfaces specified in the configuration file with:
#/sbin/service qlgc_vnic start
You can stop the VNIC driver and bring down the VNIC interfaces with
#/sbin/service qlgc_vnic stop
To restart the QLogic VNIC driver, you can use
#/sbin/service qlgc_vnic restart
If you have not started the Infiniband network stack (Infinipath or OFED),
then running "/sbin/service qlgc_vnic start" command will also cause the
Infiniband network stack to be started since the QLogic VNIC service requires
the Infiniband stack.
On the other hand if you start the Infiniband network stack separately, then
the correct order of starting is:
- Start the Infiniband stack
- Start QLogic VNIC service
For example, if you use OFED, correct order of starting is:
/sbin/service openibd start
/sbin/service qlgc_vnic start
Correct order of stopping is:
- Stop QLogic VNIC service
- Stop the Infiniband stack
For example, if you use OFED, correct order of stopping is:
/sbin/service qlgc_vnic stop
/sbin/service openibd stop
If you try to stop the Infiniband stack when the QLogic VNIC service is
running,
you will get an error message that some of the modules of the Infiniband stack
are in use by the QLogic VNIC service. Also, any QLogic VNIC interfaces that
you
created are removed (because stopping the Infiniband network stack causes the
HCA
driver to be unloaded which is required for the VNIC interfaces to be
present).
In this case, do the following:
1. Stop the QLogic VNIC service with "/sbin/service qlgc_vnic stop"
2. Stop the Infiniband stack again.
3. If you want to restart the QLogic VNIC interfaces, use
"/sbin/service qlgc_vnic start".
D) Assigning IP addresses etc for the QLogic VNIC interfaces
This can be done with ifconfig or by setting up the ifcfg-XXX (ifcfg-veth0 for
an interface named veth0 etc) network files for the corresponding VNIC interfaces.
E) Information about the QLogic VNIC interfaces
Information about VNIC interfaces on a given host can be obtained using a
script "ib_qlgc_vnic_info" :-
# ib_qlgc_vnic_info
VNIC Interface : eioc0
VNIC State : VNIC_REGISTERED
Current Path : primary path
Receive Checksum : true
Transmit checksum : true
Primary Path :
VIPORT State : VIPORT_CONNECTED
Link State : LINK_IDLING
HCA Info. : vnic-mthca0-1
Heartbeat : 100
IOC String : EVIC in Chassis 0x00066a00db000010, Slot 4, Ioc 1
IOC GUID : 66a01de000037
DGID : fe8000000000000000066a11de000037
P Key : ffff
Secondary Path :
VIPORT State : VIPORT_DISCONNECTED
Link State : INVALID STATE
HCA Info. : vnic-mthca0-2
Heartbeat : 100
IOC String :
IOC GUID : 66a01de000037
DGID : 00000000000000000000000000000000
P Key : 0
This information is collected from /sys/class/infiniband_qlgc_vnic/interfaces/
directory under which there is a separate directory corresponding to each
VNIC interface.
F) Deleting a specific QLogic VNIC interface
VNIC interfaces can be deleted by writing the name of the interface to
the /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic file.
For example to delete interface veth0
echo -n veth0 > /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic
G) Forced Failover feature for QLogic VNIC.
VNIC interfaces, when configured with failover configuration, can be
forced to failover to use other active path. For example, if VNIC interface
"veth1" is configured with failover configuration, then to switch to other
path, use command:
echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/force_failover
This will make VNIC interface veth1 to switch to other active path, even though
the path of VNIC interface, before the forced failover operation, is not in
disconnected state.
This feature allows the network administrator to control the path of the
VNIC traffic at run time and reconfiguration as well as restart of VNIC
service is not required to achieve the same.
Once enabled as mentioned above, forced failover can be cleared with
the unfailover command:
echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/unfailover
This clears the forced failover on VNIC interface "veth1". Once cleared,
if module parameter "default_prefer_primary" is set to 1, then VNIC
interface switches back to primary path. If module parameter
"default_prefer_primary" is set to 0, then VNIC interface continues to
use its current active path.
Forced failover, thus, takes priority over default_prefer_primary and the
default_prefer_primary feature will not be active unless the forced
failover is cleared through "unfailover".
Besides this forced failover, QLogic VNIC service does retain its
original failover feature which gets triggered when current active
path gets disconnected.
H) Infiniband Quality of Service for VNIC:-
To enforce infiniband Quality of Service(QoS) for VNIC protocol, there
is no configuration required on host side. The service level for the
VNIC protocol can be configured using service ID or target port guid
in the "qos-ulps" section of /etc/opensm/qos-policy.conf on the host
running OpenSM.
Service IDs for the EVIC IO controllers can be obtained from the output
of ib_qlgc_vnic_query:
HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
IO Unit Info:
port LID: 0008
port GID: fe8000000000000000066a11de000070
change ID: 0003
max controllers: 0x02
controller[ 1]
GUID: 00066a01de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
service entries: 2
------> service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
------> service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
IO Unit Info:
port LID: 0009
port GID: fe8000000000000000066a21de000070
change ID: 0003
max controllers: 0x02
controller[ 2]
GUID: 00066a02de000070
vendor ID: 00066a
device ID: 000030
IO class : 2000
ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
service entries: 2
------> service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
------> service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
Numbers 1000066a00000002, 1000066a00000102 are the required service IDs.
Finer control on quality of service for the VNIC protocol can be achieved by
configuring the service level using target port guid values of the EVIC IO
controllers. Target port guid values for the EVIC IO controllers can be
obtained using "saquery" command supplied by OFED package.
I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support:-
This tool is started and stopped as part of the QLogic VNIC service
(refer to C above) and provides the following features:
1. Dynamic update of disconnected interfaces (which have been configured
WITHOUT using the DGID option in the configuration file) :
At the start up of VNIC driver, if the HCA port through which a particular VNIC
interface path (primary or secondary) connects to target is down or the
EVIC/VEx IOC is not available then all the required parameters (DGID etc) for connecting
with the EVIC/VEx cannot be determined. Hence the corresponding VNIC interface
path is not available at the start of the VNIC service. This daemon constantly
monitors the configured VNIC interfaces to check if any of them are disconnected.
If any of the interfaces are disconnected, it scans for available EVIC/VEx targets using
"ib_qlgc_vnic_query" tool. When daemon sees that for a given path of a VNIC interface,
the configured EVIC/VEx IOC has become available, it dynamically updates the
VNIC kernel driver with the required information to establish connection for
that path of the interface. In this way, the interface gets connected with
the configured EVIC/VEx whenever it becomes available without any manual
intervention.
2. Hot Swap support :
Hot swap is an operation in which an existing EVIC/VEx is replaced by another
EVIC/VEx (in the same slot of the switch chassis as the older one). In such a
case, the current connection for the corresponding VNIC interface will have to
be re-established. The daemon detects this hot swap case and re-establishes
the connection automatically. To make use of this feature of the daemon, it is
recommended that IOCSTRING be used in the configuration file to configure the
VNIC interfaces.
This is because, after a hot swap though all other parameters like DGID, IOCGUID etc
of the EVIC/VEx change, the IOCSTRING remains the same. Thus the daemon monitors
for changes in IOCGUID and DGID of disconnected interfaces based on the IOCSTRING.
If these values have changed it updates the kernel driver so that the VNIC
interface can start using the new EVIC/VEx.
If in addition to IOCSTRING, DGID and IOCGUID have been used to configure
a VNIC interface, then on a hotswap the daemon will update the parameters as required.
But to have that VNIC interface available immediately on the next restart of the
QLogic VNIC service, please make sure to update the configuration file with the
new DGID and IOCGUID values. Otherwise, the creation of such interfaces will be
delayed till the daemon runs and updates the parameters.
J) Information about creating VLAN interfaces
The EVIC/VEx supports VLAN tagging without having to explicitly create VLAN
interfaces for the VNIC interface on the host. This is done by enabling
Egress/Ingress tagging on the EVIC/VEx and setting the "Host ignores VLAN"
option for the VNIC interface. The "Host ignores VLAN" option is enabled
by default due to which VLAN tags are ignored on the host by the QLogic
VNIC driver. Thus explicitly created VLAN interfaces (using vconfig command)
for a given VNIC interface will not be operational.
If you want to explicitly create a VLAN interface for a given VNIC interface,
then you will have to disable the "Host ignores VLAN" option for the
VNIC interface on the EVIC/VEx. The qlgc_vnic service must be restarted
on the host after disabling (or enabling) the "Host ignores VLAN" option.
Please refer to the EVIC/VEx documentation for more information on Egress/Ingress
port tagging feature and disabling the "Host ignores VLAN" option.
K) Information about enabling IB Multicast for QLogic VNIC interface
QLogic VNIC driver has been upgraded to support the IB Multicasting feature of
EVIC/VEx. This feature enables the QLogic VNIC host driver to support the IP
multicasting more efficiently. With this feature enabled, infiniband multicast
group acts as a carrier of IP multicast traffic. EVIC will make use of such IB
multicast groups for forwarding IP multicast traffic to VNIC interfaces which
are member of given IP multicast group. In the older QLogic VNIC host driver,
IB multicasting was not being used to carry IP multicast traffic.
By default, IB multicasting is disabled on EVIC/VEx; but it is enabled by
default at the QLogic VNIC host driver.
To disable IB multicast feature on the host driver, VNIC configuration file
needs to be modified by setting the parameter IB_MULTICAST=FALSE in the
interface configuration. Please refer to the qlgc_vnic.cfg.sample for more
details on configuration of VNIC interfaces for IB multicasting.
IB multicasting also needs to be enabled over EVIC/VEx. Please refer to the
EVIC/VEx documentation for more information on enabling IB multicast
feature over EVIC/VEx.
L) Basic Troubleshooting
1. In case of any problems, make sure that:
a) The HCA ports you are trying to use have IB cables connected and are in an
active state. You can use the "ibv_devinfo" tool to check the state of
your HCA ports.
b) If your HCA ports are not active, check if an SM is running on the fabric
where the HCA ports are connected. If you have done a full install of
OFED, you can use the "sminfo" command ("sminfo -P 2" for port 2) to
check SM information.
c) Make sure that the EVIC/VEx is powered up and its Ethernet cables are connected
properly.
d) Check /var/log/messages for any error messages.
2. If some of your VNIC interfaces are not available:
a) Use "ifconfig" tool with -a option to see if all interfaces are created.
It is possible that the interfaces are created but do not have an
IP address. Make sure that you have setup a correct ifcfg-XXX file for your
VNIC interfaces for automatic assignment of IP addresses.
If the VNIC interface is created and the ifcfg file is also correct
but the VNIC interface is not UP, make sure that the target EVIC/VEx
IOC has an Ethernet cable properly connected.
b) Make sure that the VNIC configuration file has been setup properly
with correct EVIC/VEx target DGID/IOCGUID/IOCSTRING information and
instance numbers.
c) Make sure that the EVIC/VEx target IOC specified for that interface is
available. You can use the "ib_qlgc_vnic_query" tool to verify this. If it is not
available when you started the service, but it becomes available later
on, then the QLogic VNIC dynamic update daemon will bring up the
interface when the target becomes available. You will see messages in
/var/log/messages when the corresponding interface is created.
d) Make sure that you have not exceeded the total number of Virtual interfaces
supported by the EVIC/VEx. You can check the total number of Virtual interfaces
currently in use on the HTTP interface of the EVIC/VEx.
trunk/mthca_release_notes.txt 0000644 0001750 0001750 00000011142 11313645005 016305 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
mthca in OFED 1.4.1 Release Notes
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Fixed Bugs since OFED 1.3.1
3. Bug fixes and enhancements since OFED 1.4
4. Known Issues
===============================================================================
1. Overview
===============================================================================
mthca is the low level driver implementation for the following Mellanox Techno-
logies HCAs: InfiniHost, InfiniHost III Ex and InfiniHost III Lx.
mthca Available Parameters
--------------------------
In order to set mthca parameters, add the following line to /etc/modpobe.conf:
options ib_mthca parameter=
mthca parameters:
tune_pci - increase PCI burst from the default set by BIOS if
nonzero
msi - attempt to use MSI if nonzero
msi_x - attempt to use MSI-X if nonzero
fw_cmd_doorbell - post firmware commands through doorbell page if non-
zero (and supported by firmware)
catas_reset_disable - disable device reset on a catastrophic event if non-
zero
debug_level - Enable debug tracing if > 0 (int)
num_qp - maximum number of QPs per HCA (int)
rdb_per_qp - number of RDB buffers per QP (int)
num_cq - maximum number of CQs per HCA (int)
num_mcg - maximum number of multicast groups per HCA (int)
num_mpt - maximum number of memory protection table entries
per HCA (int)
num_mtt - maximum number of memory translation table segments
per HCA (int)
num_udav - maximum number of UD address vectors per HCA (int)
fmr_reserved_mtts - number of memory translation table segments reserved
for FMR (int)
log_mtts_per_seg - log2 number of MTT entries per segment (1-5)
===============================================================================
2. Fixed Bugs
===============================================================================
- IB_EVENT_LID_CHANGE is generated more appropriately.
- Improved MTT buddy allocator (free count per order).
- Fix check of max_send_sge for special QPs.
- RESET->ERR QP state transition no longer supported (IB Spec 1.2.1).
- Clear ICM pages before handing to FW.
- Fixed race condition between create QP and destroy QP (bugzilla 1389)
===============================================================================
3. Bug fixes and enhancements since OFED 1.4
===============================================================================
- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment.
This enable to register more memory with the same number of segments.
- Bring INIT_HCA and other commands timeout into consistency with PRM. This
solve an issue when had more than 2^18 max qp's configured.
===============================================================================
3. Known Issues
===============================================================================
1. A UAR size other than 8MB prevents mthca driver loading. The default UAR
size is 8MB. If the size is changed, the following error message will be
logged to /var/log/messages upon attempting to load the mthca driver:
ib_mthca 0000:04:00.0: Missing UAR, aborting.
2. If a user level application using multicast receives a control signal
in the process of detaching from a multicast group, its QP may remain a
member of the multicast group (in HCA).
Workaround: Destroy the multicast group after detaching the QP from it.
3. In mem-free devices, RC QPs can be created with a maximum of (max_sge - 1)
entries only; UD QPs can be created with a maximum of (max_sge - 3) entries.
4. Performance can be degraded due to a wrong BIOS configuration:
The PCI Express specification requires the BIOS to set the MaxReadReq
register for each HCA card for maximum performance and stability.
If you experience bandwidth performance degradation, try forcing the card to
behave not according to the PCI Express specification by setting the
tune_pci=1 module parameter. This tune_pci=1 assignment was the default
setting in OFED 1.0; therefore, it may have masked performance degradation
on some systems.
If tune_pci=1 improves bandwidth, please report the issue to your BIOS
vendor. Please note that Mellanox Technologies does not recommend using
tune_pci=1 in production systems: working with tune_pci=1 set is untested
and is known to trigger instability issues on some platforms.
trunk/rds_release_notes.txt 0000644 0001750 0001750 00000010135 11313645005 016002 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
RDS in OFED 1.4.1 Release Notes
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Supported Platforms
3. Installation & Configuration
4. New Features
5. Bug fixes and Enhancements since OFED 1.3.1
6. Bug fixes and Enhancements since OFED 1.3
7. Bug fixes and Enhancements since OFED 1.2
8. Known Issues
===============================================================================
1. Overview
===============================================================================
RDS socket API. It provides reliable, in-order datagram delivery between
sockets over a variety of transports.
For details see RDS_README.txt and man 7 rds.
===============================================================================
2. supported platforms
===============================================================================
RHEL4.0 Update 4,5,6
RHEL5.0 Update 1,2
SLES 10
===============================================================================
3. Installation & Configuration
===============================================================================
To install RDS select rds in OFED's manual installation or put 'rds=y' in the
ofed.conf for unattended installation.
To load RDS module upon boot edit file '/etc/infiniband/openib.conf' as
follows:
# Load RDS module
RDS_LOAD=yes
===============================================================================
4. New Features
===============================================================================
RDS protocol version 3.1.
RDS v3.1 is backwards compatible with v3.0 via protocol negotiation.
Support for iWARP (bcopy mode only).
Locking and scalability improvements.
Credit-based flow control for iWARP transport.
TCP transport removed.
===============================================================================
5. Bug fixes and Enhancements since OFED 1.3.1
===============================================================================
- RDMA completion notifications are signalled when the IB stack gives us the
completion event for the accompanying RDS message. This is a change from the
1.3.x behavior, which signalled completion notifications when the RDS message
was ACKed.
- Fixed bugs associated with congestion monitoring.
- FMR pool size increased from 2K to 4K
- Added support for RDMA_CM_EVENT_ADDR_CHANGE event.
- RDS should now work on Qlogic HCAs.
===============================================================================
6. Bug fixes and Enhancements since OFED 1.3
===============================================================================
- Fix a bug in RDMA signaling
- Add 3 more stats counters
- Fix a kernel crash that can occur when RDS/IB connection drops
- Fixes for RDMA API
===============================================================================
7. Bug fixes and Enhancements since OFED 1.2
===============================================================================
1) Wire protocol for RDS v3 and RDS v2 are not compatible.
2) RDS over TCP is disabled in OFED 1.3. We will re-enable in future release.
3) Congestion monitoring support gives the application more fine-grained
control.
With explicit monitoring, the application polls for POLLIN as before, and
additionally uses the RDS_CONG_MONITOR socket option to install a 64bit mask
value in the socket, where each bit corresponds to a group of ports.
When a congestion update arrives, RDS checks the set of ports that became
uncongested against the bit mask installed in the socket. If they overlap, a
control messages is enqueued on the socket, and the application is woken up.
When application calls recvmsg (2), it will be given the control message
containing the bitmap on the socket.
===============================================================================
8. Known Issues
===============================================================================
1. RDMAs over 1 MiB not supported.
trunk/README.txt 0000644 0001750 0001750 00000020467 11313645005 013250 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
Version 1.4
README
December 2008
This is the OpenFabrics Enterprise Distribution (OFED) version 1.4
software package supporting InfiniBand and iWARP fabrics. It is composed
of several software modules intended for use on a computer cluster
constructed as an InfiniBand subnet or an iWARP network.
*** Note: If you plan to upgrade OFED on your cluster, please upgrade all
its nodes to this new version.
This document includes the following sections:
1. HW and SW Requirements
2. OFED Package Contents
3. Installing OFED Software
4. Starting and Verifying the IB Fabric
5. MPI (Message Passing Interface)
6. Related Documentation
OpenFabrics Home Page: http://www.openfabrics.org
The OFED rev 1.4 software download available in
http://www.openfabrics.org/builds/ofed-1.4/release/
Please email bugs and error reports to your InfiniBand vendor, or use bugzilla
https://bugs.openfabrics.org/
1. HW and SW Requirements:
==========================
1) Server platform with InfiniBand HCA or iWARP RNIC (see OFED Distribution
Release Notes for details)
2) Linux operating system (see OFED Distribution Release Notes for details)
3) Administrator privileges on your machine(s)
4) Disk Space: - For Build & Installation: 300MB
- For Installation only: 200MB
5) For the OFED Distribution to compile on your machine, some software
packages of your operating system (OS) distribution are required. These
are listed here.
OS Distribution Required Packages
--------------- ----------------------------------
General:
o Common to all gcc, glib, glib-devel, glibc, glibc-devel,
glibc-devel-32bit (to build 32-bit libraries on x86_86
and ppc64), zlib-devel
o RedHat, Fedora kernel-devel, rpm-build
o SLES 9.0 kernel-source, udev, rpm
o SLES 10.0 kernel-source, rpm
Note: To build 32-bit libraries on x86_64 and ppc64 platforms, the 32-bit
glibc-devel should be installed.
Specific Component Requirements:
o Mvapich a Fortran Compiler (such as gcc-g77)
o Mvapich2 libstdc++-devel, sysfsutils (SuSE),
libsysfs-devel (RedHat5.0, Fedora C6)
o Open MPI libstdc++-devel
o ibutils tcl-8.4, tcl-devel-8.4, tk, libstdc++-devel
o tvflash pciutils-devel
o mstflint libstdc++-devel (32-bit on ppc64), gcc-c++
Note: The installer will warn you if you attempt to compile any of the
above packages and do not have the prerequisites installed.
*** Important Note for open-iscsi users:
Installing iSER as part of OFED installation will also install open-iscsi.
Before installing OFED, please uninstall any open-iscsi version that may
be installed on your machine. Installing OFED with iSER support while
another open-iscsi version is already installed will cause the installation
process to fail.
2. OFED Package Contents
========================
The OFED Distribution package generates RPMs for installing the following:
o OpenFabrics core and ULPs
- HCA drivers (mthca, mlx4, ipath, ehca)
- iWARP driver (cxgb3, nes)
- core
- Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER
Initiator and target, RDS, uDAPL, qlgc_vnic and NFS-RDMA.
o OpenFabrics utilities
- OpenSM: InfiniBand Subnet Manager
- Diagnostic tools
- Performance tests
o MPI
- OSU MVAPICH stack supporting the InfiniBand and iWARP interface
- Open MPI stack supporting the InfiniBand and iWARP interface
- OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface
- MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta)
o Extra packages
- open-iscsi: open-iscsi initiator with iSER support
- ib-bonding: Bonding driver for IPoIB interface
o Sources of all software modules (under conditions mentioned in the
modules' LICENSE files)
o Documentation
3. Installing OFED Software
============================
The default installation directory is: /usr
Install Quick Guide:
1) Download and extract: tar xzvf OFED-1.4.tgz file.
2) Change into directory: cd OFED-1.4
3) Run as root: ./install.pl
4) Follow the directions to install required components. For details, please see
OFED_Installation_Guide.txt under OFED-1.4/docs.
Notes:
1. The install script removes previously installed IB packages and
re-installs from scratch. You will be prompted to acknowledge the deletion
of the old packages. However, configuration files (.conf) will be
preserved and saved with a ".rpmsave" extension.
2. After the installer completes, information about the OFED
installation such as the prefix, the kernel version, and
installation parameters can be found by running
/etc/infiniband/info.
3. Information on the driver version and source git trees can be found
using the ofed_info utility
4. Starting and Verifying the IB Fabric
=======================================
1) If you rebooted your machine after the installation process completed,
IB interfaces should be up. If you did not reboot your machine, please
enter the following command: /etc/init.d/openibd start
2) Check that the IB driver is running on all nodes: ibv_devinfo should print
"hca_id: " on the first line.
3) Make sure that a Subnet Manager is running by invoking the sminfo utility.
If an SM is not running, sminfo prints:
sminfo: iberror: query failed
If an SM is running, sminfo prints the LID and other SM node information.
Example:
sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1
To check if OpenSM is running on the management node, enter: /etc/init.d/opensmd status
To start OpenSM, enter: /etc/init.d/opensmd start
Note: OpenSM parameters can be set via the file /etc/opensm/opensm.conf
4) Verify the status of ports by using ibv_devinfo: all connected ports should
report a "PORT_ACTIVE" state.
5) Check the network connectivity status: run ibchecknet to see if the subnet
is "clean" and ready for ULP/application use. The following tools display
more information in addition to IB info: ibnetdiscover, ibhosts, and
ibswitches.
6) Alternatively, instead of running steps 3 to 5 you can use the ibdiagnet
utility to perform a set of tests on your network. Upon finding an error,
ibdiagnet will print a message starting with a "-E-". For a more complete
report of the network features you should run ibdiagnet -r. If you have a
topology file describing your network you can feed this file to ibdiagnet
(using the option: -t ) and all reports will use the names they
appear in the file (instead of LIDs, GUIDs and directed routes).
7) To run an application over SDP set the following variables:
env LD_PRELOAD='stack_prefix'/lib/libsdp.so
LIBSDP_CONFIG_FILE='stack_prefix'/etc/libsdp.conf
(or LD_PRELOAD='stack_prefix'/lib64/libsdp.so on 64 bit machines)
The default 'stack_prefix' is /usr
5. MPI (Message Passing Interface)
==================================
In Step 2 of the main menu of install.pl, options 2, 3 and 4 can
install one or more MPI stacks. Multiple MPI stacks can be installed
simultaneously -- they will not conflict with each other.
Three MPI stacks are included in this release of OFED:
- MVAPICH 1.1.0-3355
- Open MPI 1.3.2
- MVAPICH2 1.2p1
OFED also includes 4 basic tests that can be run against each MPI
stack: bandwidth (bw), latency (lt), Intel MPI Benchmark and Presta. The tests
are located under: /mpi///tests/.
Please see MPI_README.txt for more details on each MPI package and how to run
the tests.
6. Related Documentation
========================
1) Release Notes for OFED Distribution components are to be found under
OFED-1.4/docs and, after the package installation, under
/usr/share/doc/ofed-docs-1.4 for RedHat
/usr/share/doc/packages/ofed-docs-1.4 for SuSE.
2) For a detailed installation guide, see OFED_Installation_Guide.txt.
3) For more information, please visit the OFED web-page http://www.openfabrics.org
For more information contact your vendor.
trunk/nfs-rdma.release-notes.txt 0000644 0001750 0001750 00000014254 11313645005 016564 0 ustar benoit benoit ################################################################################
# #
# NFS/RDMA README #
# #
################################################################################
Author: NetApp and Open Grid Computing
Adapted for OFED 1.4 (from linux-2.6.27.8/Documentation/filesystems/nfs-rdma.txt)
by Jeff Becker
Table of Contents
~~~~~~~~~~~~~~~~~
- Overview
- OFED 1.4 limitations
- Getting Help
- Installation
- Check RDMA and NFS Setup
- NFS/RDMA Setup
Overview
~~~~~~~~
This document describes how to install and setup the Linux NFS/RDMA client
and server software.
The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
was first included in the following release, Linux 2.6.25.
In our testing, we have obtained excellent performance results (full 10Gbit
wire bandwidth at minimal client CPU) under many workloads. The code passes
the full Connectathon test suite and operates over both Infiniband and iWARP
RDMA adapters.
OFED 1.4.1 limitations:
~~~~~~~~~~~~~~~~~~~~~
NFS-RDMA is supported for the following releases:
- Redhat Enterprise Linux (RHEL) version 5.1
- Redhat Enterprise Linux (RHEL) version 5.2
- Redhat Enterprise Linux (RHEL) version 5.3
- SUSE Linux Enterprise Server (SLES) version 10, Service Pack 2
- SUSE Linux Enterprise Server (SLES) version 11
And the following kernel.org kernels:
- 2.6.22
- 2.6.26
- 2.6.27
All other Linux Distrubutions and kernel versions are NOT supported on OFED 1.4.1
Getting Help
~~~~~~~~~~~~
If you get stuck, you can ask questions on the
nfs-rdma-devel@lists.sourceforge.net, or general@lists.openfabrics.org
mailing lists.
Installation
~~~~~~~~~~~~
These instructions are a step by step guide to building a machine for
use with NFS/RDMA.
- Install an RDMA device
Any device supported by the drivers in drivers/infiniband/hw is acceptable.
Testing has been performed using several Mellanox-based IB cards and
the Chelsio cxgb3 iWARP adapter.
- Install OFED 1.4.1
NFS/RDMA has been tested on RHEL5.1, RHEL5.2, RHEL 5.3, SLES10SP2, SLES11,
kernels 2.6.22, 2.6.26, and 2.6.27. On these kernels, NFS-RDMA will be
installed by default if you simply select "install all", and can be
specifically included by a "custom" install.
In addition, the install script will install a version of the nfs-utils that
is required for NFS/RDMA. The binary installed will be named "mount.rnfs".
This version is not necessary for Linux Distributions with nfs-utils 1.1 or
later.
Upon successful installation, the nfs kernel modules will be placed in the
directory /lib/modules/'uname -a'/updates. It is recommended that you reboot to
ensure that the correct modules are loaded.
Check RDMA and NFS Setup
~~~~~~~~~~~~~~~~~~~~~~~~
Before configuring the NFS/RDMA software, it is a good idea to test
your new kernel to ensure that the kernel is working correctly.
In particular, it is a good idea to verify that the RDMA stack
is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
is working properly.
- Check RDMA Setup
If you built the RDMA components as modules, load them at
this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
card:
$ modprobe ib_mthca
$ modprobe ib_ipoib
If you are using InfiniBand, make sure there is a Subnet Manager (SM)
running on the network. If your IB switch has an embedded SM, you can
use it. Otherwise, you will need to run an SM, such as OpenSM, on one
of your end nodes.
If an SM is running on your network, you should see the following:
$ cat /sys/class/infiniband/driverX/ports/1/state
4: ACTIVE
where driverX is mthca0, ipath5, ehca3, etc.
To further test the InfiniBand software stack, use IPoIB (this
assumes you have two IB hosts named host1 and host2):
host1$ ifconfig ib0 a.b.c.x
host2$ ifconfig ib0 a.b.c.y
host1$ ping a.b.c.y
host2$ ping a.b.c.x
For other device types, follow the appropriate procedures.
- Check NFS Setup
For the NFS components enabled above (client and/or server),
test their functionality over standard Ethernet using TCP/IP or UDP/IP.
NFS/RDMA Setup
~~~~~~~~~~~~~~
We recommend that you use two machines, one to act as the client and
one to act as the server.
One time configuration:
- On the server system, configure the /etc/exports file and
start the NFS/RDMA server.
Exports entries with the following formats have been tested:
/vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
/vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
The IP address(es) is(are) the client's IPoIB address for an InfiniBand
HCA or the client's iWARP address(es) for an RNIC.
NOTE: The "insecure" option must be used because the NFS/RDMA client does
not use a reserved port.
Each time a machine boots:
- Load and configure the RDMA drivers
For InfiniBand using a Mellanox adapter:
$ modprobe ib_mthca
$ modprobe ib_ipoib
$ ifconfig ib0 a.b.c.d
NOTE: use unique addresses for the client and server
- Start the NFS server
Load the RDMA transport module:
$ modprobe svcrdma
Start the server:
$ /etc/init.d/nfsserver start
or
$ service nfs start
Instruct the server to listen on the RDMA transport:
$ echo rdma 20049 > /proc/fs/nfsd/portlist
- On the client system
Load the RDMA client module:
$ modprobe xprtrdma
Mount the NFS/RDMA server:
$ mount -o rdma,port=20049 :/ /mnt
To verify that the mount is using RDMA, run "cat /proc/mounts" and check
the "proto" field for the given mount.
Congratulations! You're using NFS/RDMA!
Known Issues
~~~~~~~~~~~~~~~~~~~~~~~~
If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are using
a 64KB page size (like PPC64 and IA64 systems) and your server is using a
4KB page size (like i386 and X86_64), then you need to mount the server
using rsize=32768,wsize=32768 to avoid overrunning the Chelsio RNIC fast
register limits. This is a known firmware limitation in the Chelsio RNIC.
trunk/sdp_release_notes.txt 0000644 0001750 0001750 00000020477 11313645005 016012 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
SDP in OFED 1.4.1 Release Notes
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Bug Fixes and Enhancements since OFED 1.3
3. Bug Fixes and Enhancements since OFED 1.4
4. Known Issues
5. Verification Applications/Flows/Tests
===============================================================================
1. Overview
===============================================================================
SDP in OFED is at GA level for OFED 1.4.1
===============================================================================
2. Bug Fixes and Enhancements since OFED 1.3
===============================================================================
* Cleanup
- Compilation warnings
- New kernel support
* New function
- sdpnetstat supply information about socket state and process name.
* Bug fixes
- open/close connection mechanism overhaul - many issue related to that are
fixed.
- No known kernel crash is caused by SDP.
- Many small bugs fixed - see bugzilla.
- Having now full windows interoperability.
===============================================================================
2. Bug Fixes and Enhancements since OFED 1.4
===============================================================================
SDP:
- BUG1311 Netpipe fails with a IB_WC_LOC_LEN_ERR.
- BUG1472 - clean socket timeouts and refcount when device is removed
- BUG1502 - scheduling while atomic
- BUG1309 - SDP close is slow + fix recv buffer initial size setting
- BUG1087 - fixed recovery from failing rdma_create_qp()
===============================================================================
3. Known Issues
===============================================================================
- BUG1444 - setsockopt(SO_RCVBUF) is not working in sdp socket. To limit top
system wide sdp memory usage for recv use module parameter top_mem_usage.
- There are some issues regarding PPC and IA64 that were not fixed for this
release. check bugzilla for more info.
- TCP allows connecting to IP_ANY - 0.0.0.0 (as a destination address!). SDP
does not allow - and will reject the connection.
- Each SDP socket currently consumes up to 2 MBytes of memory. If this value
is high for your installation, it is possible to trade off performance
for lower memory utilization per socket by reducing the value of the
"rcvbuf_scale" module parameter (default: 16).
Note: the minimum legal value for this parameter is 1.
At this parameter value, each socket will consume approximately 128 KBytes.
- Small message size performance is low when messages are sent by client
at a rate lower than the rate at which they are consumed by server,
and when TCP_CORK is not set. This is observed, for example, with iperf
benchmark. As a workaround, set the TCP_CORK socket option
to ensure data is sent in at least 32K byte chunks.
- Performance is low on 32-bit kernels, as SDP utilizes high memory
to ease memory pressure. Moving to a 64-bit kernel solves this
problem even if the application remains a 32-bit one.
- By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards
using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth.
Workaround: reset the MTU size to 1K in this situation, using either of
the two methods below:
1. Activate the "tavor quirk" workaround in opensm:
a. Create an opensm options cache file (/var/cache/osm/opensm.opts):
> opensm --cache-options -o
b. Add the following line to /var/cache/osm/opensm.opts:
enable_quirks TRUE
c. Rerun opensm using your usual command line options to activate
the opensm quirk option.
2. Activate the "tavor quirk" workaround in cma:
set the tavor_quirk module parameter of the rdma_cm module to value 1
(default: 0).
- BZCOPY mode is only effective for large block transfers.
By setting the /sys parameter 'sdp_zcopy_thresh' to a non-zero value, a
non-standard SDP speedup is enabled. All messages longer than
'sdp_zcopy_thresh' bytes in length will cause the user space buffer to
be pinned and the data sent directly from the original buffer. This
results in less CPU use and, on many systems, much better bandwidth.
The default 64K value for 'sdp_zcopy_thresh' is sometimes too low for
some systems. You must experiment with your hardware to select the
best value. when mixing bzcopy and bcopy on the same socket the socket
could get stucked in rare situations - BUG1324.
===============================================================================
4. Verification Applications/Flows/Tests
===============================================================================
- ssh/sshd
- wget/netscape/firefox/apache
- netpipe
- netperf
- LTP socket tests
- iperf-2.0.2
- ttcp
- Threaded and forking echo client server examples
- Various Java client server applications (SUN:jre, BEA:jrockit/WebLogic, GNU:gij/gcj)
- Many UNIX utilities to verify that pre-load did not harm the applications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Open Fabrics Enterprise Distribution (OFED)
libsdp v. 9382 in OFED 1.4.1 Release Notes
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New Features
3. Bug Fixes
4. Bug Fixes and Enhancements since OFED 1.4
5. Known Issues
6. Verification Applications/Flows/Tests
===============================================================================
1. Overview
===============================================================================
This document describes the contents of the libsdp OFED 1.3 release.
libsdp is a LD_PRELOAD-able library that can be used to migrate existing
applications to use InfiniBand Sockets Direct Protocol (SDP) instead of
TCP sockets, transparently and without recompilations. To setup libsdp
please follow the instructions below. The libsdp version for this release
is 1.3.
===============================================================================
2. New Features
===============================================================================
* Add support for new kernel options
- SIOCOUTQ ioctl support
- Add keepalive support
- New options: SOCK_KEEPALIVE, TCP_KEEPIDLE
* Add libsdp-devel sub-package
===============================================================================
3. Bug Fixes
===============================================================================
The following list of bugs were fixed. Note that other less critical
or visible bugs were also fixed.
* Multi-threaded applications that are opening and closing many SDP sockets
quickly using the 'both' flow would incorrectly close a valid socket.
* Attempt to bind to an invalid socket using 'both' flow returns the wrong
errno.
* Applications using signal driven IO (FASYNC) on 'both' flow sockets would
fail because the second accep() would be executed on the TCP socket.
* Update error handling to set errno to a valid error code prior to
returning -1.
===============================================================================
4. Bug Fixes and Enhancements since OFED 1.4
===============================================================================
libsdp:
* Enable building libsdp on Solaris
* BUG1256 - Add epoll support
sdpnetstat:
* BUF1513 - sdpnetstat is not showing all the listening processes on ipv6 sockets.
===============================================================================
5. Known Issues
===============================================================================
* libsdp cannot provide its socket switch functionality for executables
statically linked with libc.
* When using server to listen on both SDP and TCP, the number of sockets is
doubled.
===============================================================================
6. Verification Applications/Flows/Tests
===============================================================================
See the corresponding section in the SDP release notes above.
trunk/DEBIAN-HOWTO/ 0000755 0001750 0001750 00000000000 11313645005 013361 5 ustar benoit benoit trunk/DEBIAN-HOWTO/infiniband-howto.txt 0000644 0001750 0001750 00000113763 11313645005 017374 0 ustar benoit benoit Infiniband HOWTO
Guy Coates
This document describes how to install and configure the OFED infini-
band software on Debian.
______________________________________________________________________
Table of Contents
1. Introduction
1.1 The latest version
1.2 What is OFED?
2. Installing the OFED Software
2.1 Installing prebuilt packages
2.2 Building packages from source
2.2.1 Install the prerequisites development packages
2.2.2 Checkout the svn tree
2.2.3 Install the upstream source (optional)
2.2.4 Build the packages.
3. Install the kernel modules
3.1 Building new kernel modules
4. Setting up a basic infiniband network
4.1 Upgrade your Infiniband card and switch firmware
4.2 Physically Connect the network
4.3 Choose a Subnet Manager
4.4 Load the kernel modules
4.5 (optional) Start opensm
4.6 Check network health
4.7 Check the extended network connectivity
4.8 testing connectivity with ibping
4.9 Testing RDMA performance
5. IP over Infiniband (IPoIB)
5.1 List the network devices
5.2 IP Configuration
5.3 Connected vs Unconnected Mode
5.4 TCP tuning
5.5 ARP and dual ported cards
6. OpenMPI
6.1 Configure IPoIB
6.2 Load the modules
6.3 Check permissions and limits
6.4 Install the mpi test programs
6.5 Configure Host Access
6.6 Run the MPI PingPong benchmark
7. SDP
7.1 Configuration
7.2 Example Using SDP with Netpipe
8. SRP
8.1 Configuration
8.2 SRP daemon configuration
8.2.1 Determine the IDs of presented devices
8.2.2 Configure srp_deamon to connect to the devices
8.3 Multipathing, LVM and formatting
9. Building Lustre against OFED
9.1 Check Compatibility
9.2 Build a lustre patched kernel
9.3 Build OFED modules for the lustre patched kernel
9.4 Configure lustre
10. Troubleshooting
10.1 Genernal fabric troubleshooting
10.2 ib_query_gid() failed errors on mlx4 platforms
10.3 Missing XRC support
11. Tips and Tricks
11.1 Descriptive node names
12. Further Information
______________________________________________________________________
11.. IInnttrroodduuccttiioonn
This document describes how to install and configure the OFED
infiniband software on Debian. This document is intended to show you
how to configure a simple Infiniband network as quickly as possible.
It is not a replacement for the details documentation provided in the
ofed-docs package!
11..11.. TThhee llaatteesstt vveerrssiioonn
The latest version of the howto can be found on the pkg-ofed alioth
webite:
http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html
Source is kept in the SVN repository:
http://svn.debian.org/wsvn/pkg-ofed/
11..22.. WWhhaatt iiss OOFFEEDD??
OFED (OpenFabric's Enterprise Distribution) is the defacto Infiniband
software stack on Linux. OFED provides a consistent set of kernel
modules and userspace libraries which have been tested together.
Further details of the Openfabrics Alliance and OFED can be found here
http://www.openfabrics.org
22.. IInnssttaalllliinngg tthhee OOFFEEDD SSooffttwwaarree
Before you can use your infiniband network you will need to install
the OFED software on your infiniband client machines. You can choose
to use the pre-build packages on alioth, or build your own packages
straight from the alioth SVN repository.
22..11.. IInnssttaalllliinngg pprreebbuuiilltt ppaacckkaaggeess
Add the following lines to your sources.list file:
deb http://pkg-ofed.alioth.debian.org/apt/ofed ./
deb-src http://pkg-ofed.alioth.debian.org/apt/ofed ./
and run:
aptitude update
aptitude install ofed
22..22.. BBuuiillddiinngg ppaacckkaaggeess ffrroomm ssoouurrccee
If you wish to build the OFED packages from the alioth svn repository,
use the following procedure.
22..22..11.. IInnssttaallll tthhee pprreerreeqquuiissiitteess ddeevveellooppmmeenntt ppaacckkaaggeess
aptitude install svn-buildpackage build-essential devscripts
22..22..22.. CChheecckkoouutt tthhee ssvvnn ttrreeee
svn co svn://svn.debian.org/pkg-ofed/
22..22..33.. IInnssttaallll tthhee uuppssttrreeaamm ssoouurrccee ((ooppttiioonnaall))
The upstream source tarballs need to be available if you want to build
pukka debian packages suitable for inclusion upstream. If you are
simply building packages for your own use, you can ignore this step.
cd pkg-ofed
mkdir tarballs
Original source tarballs can be downloaded from the repository:
apt-get source libibverbs
Alternatively, you can grab the source code directly from upstream.
http://www.openfabrics.org/downloads/OFED/
Upstream source is distributed via SRPMS; you can use alien to convert
them into tarballs.
22..22..44.. BBuuiilldd tthhee ppaacckkaaggeess..
cd into the package you wish to build. eg for libibcommon,
cd pkg-ofed/libibcommon
Link in the upstream tarballs directory (optional)
ln -s -f ../tarballs .
Run svn-buildpackage from within the trunk directory.
cd pkg-ofed/libibcommon/trunk
svn-buildpackage -uc -us -rfakeroot
The build process will generate a deb in the build-area directory.
Repeat the process for the rest of the packages. Note that some
packages have build dependancies on other OFED packages. The suggested
build order is:
libibverbs
libnes
libcxgb3
libipathverbs
libmlx4
libmthca
librdmacm
libibcm
libibcommon
libibumad
libibmad
libsdp
dapl
opensm
infiniband-diags
ibutils
mstflint
perftest
qlvnictools
qperf
rds-tools
sdpnetstat
srptools
tvflash
ibsim
mpitests
ofed-docs
ofa_kernel
ofed
33.. IInnssttaallll tthhee kkeerrnneell mmoodduulleess
You now need to build a set of OFED kernel modules which match the
version of the OFED software you have installed.
The Debian kernel contains a set of OFED infiniband drivers, but they
may not match the OFED userspace version have installed. Consult the
table below to determine what OFED version the Debian kernel contains.
Debian Kernel Version OFED Version
<=2.6.26 1.3
>=2.6.27 1.4
If the debian kernel modules are the incorrect version, you can build
a new set of modules using the ofa-kernel-source package. If your
kernel already includes the correct OFED kernel modules you can skip
the rest of this section. If you are in doubt, you should build a new
set of modules rather than relying on the modules shipped with the
kernel.
33..11.. BBuuiillddiinngg nneeww kkeerrnneell mmoodduulleess
You can build new kernel modules using module-assistant.
aptitude install module-assistant
Ensure you have the ofa-kernel-source package installed, and then run:
module-assistant prepare
module-assistant clean ofa-kernel
module-assistant build ofa-kernel
This procedure will create an ofa-kernel-modules deb in /usr/src. You
can the install the deb using dpkg or by running:
module-assistant install ofa-kernel
The deb can also be copied to your other infiniband hosts and
installed using dpkg.
As the deb contains replacements for existing kernel modules you will
need to either manually remove any infiniband modules which have
already been loaded, or reboot the machine, before you can use the new
modules.
The new kernel modules will be installed into /usr/lib//updates. They will not overwrite the original kernel modules,
but the module loader will pick up the modules from the updates
directory in preference. You can verify that the system is using the
new kernel modules by running the modinfo command.
# modinfo ib_core
filename: /lib/modules/2.6.22.19/updates/kernel/drivers/infiniband/core/ib_core.ko
author: Roland Dreier
description: core kernel InfiniBand API
license: Dual BSD/GPL
vermagic: 2.6.22.19 SMP mod_unload
Note that if you wish to rebuild the kernel modules for any reason,
(eg for a new kernel version or to continue an interrupted build) then
you must issue the "module-assistant clean" command before trying a
new build.
44.. SSeettttiinngg uupp aa bbaassiicc iinnffiinniibbaanndd nneettwwoorrkk
This sections describes how to set up a basic infiniband network and
test its functionality.
44..11.. UUppggrraaddee yyoouurr IInnffiinniibbaanndd ccaarrdd aanndd sswwiittcchh ffiirrmmwwaarree
Before proceeding you should ensure that the firmware in your switches
and infiniband cards is at the latest release. Older firmware
versions may cause interoperability and fabric stability issues. Do
not assume that just because your hardware has come fresh from the
factory that it has the latest firmware on it.
You should follow the documentation from your vendor as to how the
firmware should be updated.
44..22.. PPhhyyssiiccaallllyy CCoonnnneecctt tthhee nneettwwoorrkk
Connect up to your hosts and switches.
44..33.. CChhoooossee aa SSuubbnneett MMaannaaggeerr
Each infiniband network requires a subnet manager. You can choose to
run the OFED opensm subnet manager on one of the Linux clients, or you
may choose to use an embedded subnet manager running on one of the
switches in your fabric. Note that not all switches come with a subnet
manager; check your switch documentation.
44..44.. LLooaadd tthhee kkeerrnneell mmoodduulleess
Infiniband kernel modules are not loaded automatically. You should
adding them to /etc/modules so that they are automatically loaded on
machine bootup. You will need to include the hardware specific modules
and the protocol modules.
/etc/modules:
# Hardware drivers
# Choose the apropriate modules from
# /lib/modules//updates/kernel/drivers/infiniband/hw
#
#mlx4_ib # Mellanox ConnectX cards
#ib_mthca # some mellanox cards
#iw_cxgb3 # Chelsio T3 cards
#iw_nes # NetEffect cards
#
# Protocol modules
# Common modules
rdma_ucm
ib_umad
ib_uverbs
# IP over IB
ib_ipoib
# scsi over IB
ib_srp
# IB SDP protocol
ib_sdp
44..55.. ((ooppttiioonnaall)) SSttaarrtt ooppeennssmm
If you are going to use the opensm suetnet manager, edit
/etc/default/opensm and add the port GUIDs of the interfaces on which
you wish to start opensm.
You can find the port GUIDs of your cards with the ibstat -p command:
# ibstat -p
0x0002c9030002fb05
0x0002c9030002fb06
/etc/default/opensm:
PORTS="0x0002c9030002fb05 0x0002c9030002fb06"
Note if you want to start opensm on all ports you can use the
PORTS="ALL" keyword.
Start opensm:
#/etc/init.d/opensm start
If opensm has started correctly you should see SUBNET UP messages in
the opensm logfile (/var/log/opensm..log).
Mar 04 14:56:06 600685 [4580A960] 0x02 -> SUBNET UP
Note that you can start opensm on multiple nodes; one node will be the
active subnet manager and the others will put themselves into standby.
44..66.. CChheecckk nneettwwoorrkk hheeaalltthh
You can now check the status of the local IB link with the ibstat
command. Connected links should be in the "LinkUp" state. The
following output is from a dual ported card, only one of which (port1)
is connected.
# ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.3.0
Hardware version: a0
Node GUID: 0x0002c9030002fb04
System image GUID: 0x0002c9030002fb07
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030002fb05
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030002fb06
44..77.. CChheecckk tthhee eexxtteennddeedd nneettwwoorrkk ccoonnnneeccttiivviittyy
Once the host is connected to the infiniband network you can check the
health of all of the other network components with the ibhosts,
ibswitches and iblinkinfo commands.
ibhosts displays all of the hosts visible on the network.
# ibhosts
Ca : 0x0008f1040399d3d0 ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0008f1040399d370 ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0008f1040399d3fc ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0008f1040399d3f4 ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0002c9030002faf4 ports 2 "MT25408 ConnectX Mellanox Technologies"
Ca : 0x0002c9030002fc0c ports 2 "MT25408 ConnectX Mellanox Technologies"
Ca : 0x0002c9030002fc10 ports 2 "MT25408 ConnectX Mellanox Technologies"
ibswitches will display all of the switches in the network.
# ibswitches
Switch : 0x0008f104004121fa ports 24 "ISR9024D-M Voltaire" enhanced port 0 lid 1 lmc 0
iblinkinfo will show the status and speed of all of the links in the
network.
#iblinkinfo.pl
Switch 0x0008f104004121fa ISR9024D-M Voltaire:
1 1[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 2 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 2[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 13 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 3[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 4 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 4[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 26 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 5[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 27 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 6[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 24 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 7[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 28 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 8[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 25 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 9[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 31 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 10[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 32 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 11[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 33 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 12[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 29 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
1 13[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 30 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
14[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( )
1 15[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 3 1[ ] "Voltaire HCA400Ex-D" ( )
1 16[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 10 1[ ] "Voltaire HCA400Ex-D" ( )
17[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( )
18[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( )
1 19[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 7 2[ ] "Voltaire HCA400Ex-D" ( )
1 20[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 6 2[ ] "Voltaire HCA400Ex-D" ( )
1 21[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 5 2[ ] "Voltaire HCA400Ex-D" ( )
1 22[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 21 1[ ] "Voltaire HCA400Ex-D" ( )
1 23[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 9 2[ ] "Voltaire HCA400Ex-D" ( )
1 24[ ] ==( 4X 5.0 Gbps Active / LinkUp)==> 8 1[ ] "Voltaire HCA400Ex-D" ( )
44..88.. tteessttiinngg ccoonnnneeccttiivviittyy wwiitthh iibbppiinngg
ibping is an infiniband equivalent to the icmp ping command. Choose a
node on the fabric and run a ibping server:
#ibping -S
Choose another node on your network, and then ping the port GUID of
the server. (ibstat on the server will list the port GUID).
#ibping -G 0x0002c9030002fc1d
Pong from test.example.com (Lid 13): time 0.072 ms
Pong from test.example.com (Lid 13): time 0.043 ms
Pong from test.example.com (Lid 13): time 0.045 ms
Pong from test.example.com (Lid 13): time 0.045 ms
44..99.. TTeessttiinngg RRDDMMAA ppeerrffoorrmmaannccee
You can test the latency and bandwidth of a link with the ib_rdma_lat
commands.
To test the latency, start the server on a node:
#ib_rdma_lat
and then start a client on another node, giving it the hostname of the
server.
#ib_rdma_lat hostname-of-server
local address: LID 0x0d QPN 0x18004a PSN 0xca58c4 RKey 0xda002824 VAddr 0x00000000509001
remote address: LID 0x02 QPN 0x7c004a PSN 0x4b4eba RKey 0x82002466 VAddr 0x00000000509001
Latency typical: 1.15193 usec
Latency best : 1.13094 usec
Latency worst : 5.48519 usec
You can test the bandwith of the link using the ib_rdma_bw command.
#ib_rdma_bw
and then start a client on another node, giving it the hostname of the
server.
#ib_rdma_bw hostname-of-server
855: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
855: Local address: LID 0x0d, QPN 0x1c004a, PSN 0xbf60dd RKey 0xde002824 VAddr 0x002aea4092b000
855: Remote address: LID 0x02, QPN 0x004a, PSN 0xaad03c, RKey 0x86002466 VAddr 0x002b8a4e191000
855: Bandwidth peak (#0 to #955): 1486.85 MB/sec
855: Bandwidth average: 1486.47 MB/sec
855: Service Demand peak (#0 to #955): 1970 cycles/KB
855: Service Demand Avg : 1971 cycles/KB
The perftest package contains a number of other similar benchmarking
programs to test various aspects of your network.
55.. IIPP oovveerr IInnffiinniibbaanndd ((IIPPooIIBB))
The OFED stack allows you to run TCP/IP over your infiniband network,
allowing you to run non-infiniband aware applications across your
network. Several native infiniband applications also use IPoIB for
host resolution (eg Lustre and SDP).
55..11.. LLiisstt tthhee nneettwwoorrkk ddeevviicceess
Check that the IBoIP modules is loaded.
#modprobe ib_ipoib
You will now have an "ib" network interface for each of your infini-
band cards.
#ifconfig -a
ib0 Link encap:UNSPEC HWaddr 80-06-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
BROADCAST MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
ib1 Link encap:UNSPEC HWaddr 80-06-00-49-FE-80-00-00-00-00-00-00-00-00-00-00
BROADCAST MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
55..22.. IIPP CCoonnffiigguurraattiioonn
You can now configure the ib network devices using
/etc/network/interfaces.
auto ib0
iface ib0 inet static
address 172.31.128.50
netmask 255.255.240.0
broadcast 172.31.143.255
Bring the network device up, as normal.
ifup ib0
55..33.. CCoonnnneecctteedd vvss UUnnccoonnnneecctteedd MMooddee
IPoIB can run over two infiniband transports, Unreliable Datagram (UD)
mode or Connected mode (CM). The difference between these two modes
are described in:
RFC4392 - IP over InfiniBand (IPoIB) Architecture
RFC4391 - Transmission of IP over InfiniBand (IPoIB) (UD mode)
RFC4755 - IP over InfiniBand: Connected Mode
ADDME: Pro/cons of these two methods?
You can switch between these two mode at runtime with:
echo datagram > /sys/class/net/ibX/mode
echo connected > /sys/class/net/ibX/mode
The default is datagram (UD) mode. If you with to use CM then you can
add a script to /etc/network/interfaces/if-up.d to automatically set
CM mode on your interfaces when they are configured.
55..44.. TTCCPP ttuunniinngg
In order to obtain maximum IPoIB throughput you may need to tweak the
MTU and various kernel TCP buffer and window settings. See the
details in the ipoib_release_notes.txt document in the ofed-docs
package.
55..55.. AARRPP aanndd dduuaall ppoorrtteedd ccaarrddss
If you have a dual ported card with both ports on the same IB subnet,
but different IP subnets, you will need to tweak the ARP settings for
the IPoIB interfaces. See ipoib_release_notes.txt in the ofed-docs
package for a full discussion of this issue.
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_ignore=1
66.. OOppeennMMPPII
The section describes how to configure OpenMPI to use Infiniband.
66..11.. CCoonnffiigguurree IIPPooIIBB
OpenMPI uses IPoIB for job startup and tear-down. You should configure
IPoIB on all of your hosts.
66..22.. LLooaadd tthhee mmoodduulleess
Ensure the rdma_ucm module is loaded.
modprobe rdma_ucm
66..33.. CChheecckk ppeerrmmiissssiioonnss aanndd lliimmiittss
Uses who want to run MPI jobs will need to have write permissions for
the following devices:
/dev/infiniband/uverbs*
/dev/infiniband/rdma_cm*
The simplest way to do this is to add the users to the rdma group. If
that is not suitiable for your site, you can change the permissions
and ownership of these devices by editing the following udev rules:
/etc/udev/rules.d/50-udev.rules
/etc/udev/rules.d/91-permissions.rules
OpenMPI will need to pin memory. Edit /etc/security/limits.conf and
add the line:
* hard memlock unlimited
66..44.. IInnssttaallll tthhee mmppii tteesstt pprrooggrraammss
Check the mpitests package is installed.
aptitude install mpitests
66..55.. CCoonnffiigguurree HHoosstt AAcccceessss
OpenMPI uses ssh to spawn jobs on remote hosts. You should configure a
public/private keypair to ensure that you can ssh between hosts
without entering a password. You should also ensure that your login
process is silent.
66..66.. RRuunn tthhee MMPPII PPiinnggPPoonngg bbeenncchhmmaarrkk
We will use the MPI PingPong benchmark for our testing. By default,
openmpi should use inifiniband networks in preference to any tcp
networks it finds. However, we will force mpi to ignore tcp networks
to ensure that is using the infiniband network.
#!/bin/bash
#Infiniband MPI test program
#Edit the hosts below to match your test hosts
cat > /tmp/hostfile.$$.mpi <
# PingPong
[HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
[HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
[HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
[HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.53 0.00
1 1000 1.44 0.66
2 1000 1.42 1.34
4 1000 1.41 2.70
8 1000 1.48 5.15
16 1000 1.50 10.15
32 1000 1.54 19.85
64 1000 1.79 34.05
128 1000 3.01 40.56
256 1000 3.56 68.66
512 1000 4.46 109.41
1024 1000 5.37 181.92
2048 1000 8.13 240.25
4096 1000 10.87 359.48
8192 1000 15.97 489.17
16384 1000 30.54 511.68
32768 1000 55.01 568.12
65536 640 122.20 511.46
131072 320 207.20 603.27
262144 160 377.10 662.96
524288 80 706.21 708.00
1048576 40 1376.93 726.25
2097152 20 1946.00 1027.75
4194304 10 3119.29 1282.34
If you encounter any errors read the excellent OpenMPI troubleshooting
guide. http://www.openmpi.org
If you want to compare infiniband performance with your ethernet/TCP
networks, you can re-run the tests using flags to tell openmpi to use
your ethernet network. (The example below assumes that your test nodes
are connected via eth0).
#!/bin/bash
#TCP MPI test program
#Edit the hosts below to match your test hosts
cat > /tmp/hostfile.$$.mpi < 0.22 Mbps in 34.04 usec
1: 2 bytes 2937 times --> 0.45 Mbps in 33.65 usec
2: 3 bytes 2971 times --> 0.69 Mbps in 33.41 usec
121: 8388605 bytes 3 times --> 2951.89 Mbps in 21680.99 usec
122: 8388608 bytes 3 times --> 3008.08 Mbps in 21276.00 usec
123: 8388611 bytes 3 times --> 2941.76 Mbps in 21755.66 usec
Now repeat the test, but force netpipe to use SDP rather than TCP.
nodeA# LD_PRELOAD=libsdp.so NPtcp
nodeB# LD_PRELOAD=libsdp.so NPtcp -h 10.0.0.1
Send and receive buffers are 16384 and 87380 bytes
(A bug in Linux doubles the requested buffer sizes)
Now starting the main loop
0: 1 bytes 9765 times --> 1.45 Mbps in 5.28 usec
1: 2 bytes 18946 times --> 2.80 Mbps in 5.46 usec
2: 3 bytes 18323 times --> 4.06 Mbps in 5.63 usec
121: 8388605 bytes 5 times --> 7665.51 Mbps in 8349.08 usec
122: 8388608 bytes 5 times --> 7668.62 Mbps in 8345.70 usec
123: 8388611 bytes 5 times --> 7629.04 Mbps in 8389.00 usec
You should see a significant increase in performance when using SDP.
88.. SSRRPP
SRP (SCSI Remote protocol or SCSI RDMA protocol) is a protocol that
allows the use of SCSI devices across infiniband. If you have
infiniband storage, use can use SRP to acess the devices.
88..11.. CCoonnffiigguurraattiioonn
Ensure that your infiniband storage is presented to the host in
question. Check your storage controller documentation. Ensure that
the ib_srp kernel module is loaded and that the srptools package is
installed.
modprobe ib_srp
88..22.. SSRRPP ddaaeemmoonn ccoonnffiigguurraattiioonn
srp_daemon is responsible for discovering and connecting to SRP
targets. The default configuration shipped with srp_daemon is to
ignore all presented devices; this is a failsafe to prevent devices
from being mounted by accident on the wrong hosts.
The srp_daemon config file /etc/srp_daemon.conf has a simply syntax,
and is described in the srp_daemon(1) manpage. Each line in this file
is a rule which can be either to allow connection or to disallow
connection according to the first character in the line (a or d
accordingly) and ID of the storage device.
88..22..11.. DDeetteerrmmiinnee tthhee IIDDss ooff pprreesseenntteedd ddeevviicceess
You can determine the IDs of SRP devices presented to your hosts by
running the ibsrpdm -c command.
# ibsrpdm -c
id_ext=50001ff10005052a,ioc_guid=50001ff10005052a,dgid=fe8000000000000050001ff10005052a,pkey=ffff,service_id=2a050500f11f0050
88..22..22.. CCoonnffiigguurree ssrrpp__ddeeaammoonn ttoo ccoonnnneecctt ttoo tthhee ddeevviicceess
Once we have the IDs of the devices, we can add them to
/etc/srp_daemon.conf. You can also specify other srp related options
for the target, such as max_cmd_per_lun and Max_sect. These are
storage specific; check your vendor documentation for reccomended
values.
# This rule allows connection to our target
a id_ext=50001ff10005052a,ioc_guid=50001ff10005052a,max_cmd_per_lun=32,max_sect=65535
# This rule disallows everything else
d
Restart the srp_daemon and the storage target should now become visi-
ble; check the kernel log to see if the disk has been detected.
/etc/init.d/srptools restart
In the example kernel log output the disk has been descovered as scsi
device sdb.
scsi 3:0:0:1: Direct-Access IBM DCS9900 5.03 PQ: 0 ANSI: 5
sd 3:0:0:1: [sdb] 1953458176 4096-byte hardware sectors (8001365 MB)
sd 3:0:0:1: [sdb] Write Protect is off
sd 3:0:0:1: [sdb] Mode Sense: 97 00 10 08
sd 3:0:0:1: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
sd 3:0:0:1: [sdb] 1953458176 4096-byte hardware sectors (8001365 MB)
sd 3:0:0:1: [sdb] Write Protect is off
sd 3:0:0:1: [sdb] Mode Sense: 97 00 10 08
sd 3:0:0:1: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
sdb:<6>scsi4 : SRP.T10:50001FF10005052A
unknown partition table
sd 3:0:0:1: [sdb] Attached SCSI disk
sd 3:0:0:1: Attached scsi generic sg5 type 0
88..33.. MMuullttiippaatthhiinngg,, LLVVMM aanndd ffoorrmmaattttiinngg
The newly detected SRP device can be treated as an other scsi device.
If you have multiple infiniband adapters you can use multipath-tools
on top of the SRP devices to protects against a network failure. If
you are not using multipathed IO you can simply format the device as
normal.
99.. BBuuiillddiinngg LLuussttrree aaggaaiinnsstt OOFFEEDD
Lustre is a scalable cluster filesystem popular on high performance
compute clusters. See http://www.lustre.org
for more information. lustre can use infiniband as one of its network
transports in order to increase performance. The section describes how
to compile lustre against the OFED infiniband stack.
99..11.. CChheecckk CCoommppaattiibbiilliittyy
Not all lustre versions are compatible with all OFED or kernel
versions. Read the lustre release notes for which versions are
supported.
99..22.. BBuuiilldd aa lluussttrree ppaattcchheedd kkeerrnneell
Build a lustre patched kernel as per the instructions on the lustre
wiki. Once you have build the kernel keep the configured source tree.
It is required for the next step.
99..33.. BBuuiilldd OOFFEEDD mmoodduulleess ffoorr tthhee lluussttrree ppaattcchheedd kkeerrnneell
Build OFED modules against the newly build lustre patched kernel.
module-assistant prepare
module-assistant clean ofa-kernel
module-assistant -k/path/to/lustre/patched/kernel build ofa-kernel
Do not issue a "module-assistant clean" command after the build. The
ofa-kernel-module source tree is needed for the next step.
99..44.. CCoonnffiigguurree lluussttrree
You can now configure lustre to build against the lustre patched
kernel and the ofa-kernel-module sources.
cd lustre-source
./configure --with-o2ib=/usr/src/modules/ofa-kernel --with-linux=/path/to/patched/linux/source \
--other-options
1100.. TTrroouubblleesshhoooottiinngg
This section covers general troubleshooting and commonly reported
problems.
1100..11.. GGeenneerrnnaall ffaabbrriicc ttrroouubblleesshhoooottiinngg
The ibdiagnet program can be used to troubleshoot potential issues
with your infiniband fabric.
ibdiagnet -r
1100..22.. iibb__qquueerryy__ggiidd(()) ffaaiilleedd eerrrroorrss oonn mmllxx44 ppllaattffoorrmmss
ibstat or opensm hangs and the following kernel messages are printed:
kernel: [ 78.170077] ib0: ib_query_gid() failed
kernel: [ 89.272789] ib0: ib_query_port failed
Fix: Load the mlx4_core module with the msi_x=0 option.
cat > /etc/modprobe.d/mlx4_core < /sys/class/infiniband//node_desc
1122.. FFuurrtthheerr IInnffoorrmmaattiioonn
Extensive documentation on the OFED software is present in the ofed-
docs package.
The openfabrics alliance webpage can be found here:
http://www.openfabrics.org/
The following mailing lists are also useful:
http://lists.alioth.debian.org/mailman/listinfo/pkg-ofed-devel
: pkg-
ofed-devel: Discussion of debian specific problem or issues.
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
: ofa-
general: General discussion of the OFED software.
Books:
Infiniband Network Architecture
by MindShare, Inc.; Tom Shanley
Publisher: Addison-Wesley Professional
Pub Date: October 31, 2002
Print ISBN-10: 0-321-11765-4
trunk/DEBIAN-HOWTO/infiniband-howto.html 0000644 0001750 0001750 00000012705 11313645005 017513 0 ustar benoit benoit
Infiniband HOWTONext
Previous
Contents
Infiniband HOWTO
Guy Coates
This document describes how to install and configure the OFED infiniband software on Debian.
This document describes how to install and configure the OFED infiniband software on Debian. This document is intended
to show you how to configure a simple Infiniband network as quickly as possible. It is not a replacement
for the details documentation provided in the ofed-docs package!
OFED (OpenFabric's Enterprise Distribution) is the defacto Infiniband software stack on Linux. OFED
provides a consistent set of kernel modules and userspace libraries which have been tested together.
Before you can use your infiniband network you will need to install the OFED software on your infiniband client machines.
You can choose to use the pre-build packages on alioth, or build your own packages straight from the alioth SVN repository.
The upstream source tarballs need to be available if you
want to build pukka debian packages suitable for inclusion
upstream. If you are simply building packages for your own use,
you can ignore this step.
cd pkg-ofed
mkdir tarballs
Original source tarballs can be downloaded from the repository:
apt-get source libibverbs
Alternatively, you can grab the source code directly from upstream.
http://www.openfabrics.org/downloads/OFED/
Upstream source is distributed via SRPMS; you can use alien to convert them into tarballs.
Build the packages.
cd into the package you wish to build. eg for libibcommon,
cd pkg-ofed/libibcommon
Link in the upstream tarballs directory (optional)
ln -s -f ../tarballs .
Run svn-buildpackage from within the trunk directory.
cd pkg-ofed/libibcommon/trunk
svn-buildpackage -uc -us -rfakeroot
The build process will generate a deb in the build-area directory.
Repeat the process for the rest of the packages. Note that some packages have build dependancies on other OFED packages. The suggested build order is:
You now need to build a set of OFED kernel modules which match the version of the OFED software you have installed.
The Debian kernel contains a set of OFED infiniband drivers, but they may not match the OFED userspace version have installed.
Consult the table below to determine what OFED version the Debian kernel contains.
Debian Kernel Version OFED Version
<=2.6.26 1.3
>=2.6.27 1.4
If the debian kernel modules are the incorrect version, you can build a new set of modules using the ofa-kernel-source package.
If your kernel already includes the correct OFED kernel modules you can skip the rest of this section. If you are in doubt, you should
build a new set of modules rather than relying on the modules shipped with the kernel.
This procedure will create an ofa-kernel-modules deb in /usr/src. You can the install the deb using dpkg or by running:
module-assistant install ofa-kernel
The deb can also be copied to your other infiniband hosts and installed using dpkg.
As the deb contains replacements for existing kernel modules you will need to either manually remove
any infiniband modules which have already been loaded, or reboot the machine, before you can use the new modules.
The new kernel modules will be installed into /usr/lib/<kernel-version>/updates. They will not overwrite the original kernel modules, but the module
loader will pick up the modules from the updates directory in preference. You can verify that the system is using the new kernel modules by running the
modinfo command.
# modinfo ib_core
filename: /lib/modules/2.6.22.19/updates/kernel/drivers/infiniband/core/ib_core.ko
author: Roland Dreier
description: core kernel InfiniBand API
license: Dual BSD/GPL
vermagic: 2.6.22.19 SMP mod_unload
Note that if you wish to rebuild the kernel modules for any reason, (eg for a new kernel version or to continue an interrupted build) then you must issue
the "module-assistant clean" command before trying a new build.
Before proceeding you should ensure that the firmware in your switches and infiniband cards is at the latest release.
Older firmware versions may cause interoperability and fabric stability issues. Do not assume that just because your
hardware has come fresh from the factory that it has the latest firmware on it.
You should follow the documentation from your vendor as to how the firmware should be updated.
Each infiniband network requires a subnet manager. You can choose to run the OFED opensm subnet manager on one of the
Linux clients, or you may choose to use an embedded subnet manager running on one of the switches in your fabric. Note
that not all switches come with a subnet manager; check your switch documentation.
Infiniband kernel modules are not loaded automatically. You should adding them to /etc/modules so that they are automatically loaded on machine
bootup. You will need to include the hardware specific modules and the protocol modules.
/etc/modules:
# Hardware drivers
# Choose the apropriate modules from
# /lib/modules/<kernel-version>/updates/kernel/drivers/infiniband/hw
#
#mlx4_ib # Mellanox ConnectX cards
#ib_mthca # some mellanox cards
#iw_cxgb3 # Chelsio T3 cards
#iw_nes # NetEffect cards
#
# Protocol modules
# Common modules
rdma_ucm
ib_umad
ib_uverbs
# IP over IB
ib_ipoib
# scsi over IB
ib_srp
# IB SDP protocol
ib_sdp
If you are going to use the opensm suetnet manager, edit /etc/default/opensm and add the port
GUIDs of the interfaces on which you wish to start opensm.
You can find the port GUIDs of your cards with the ibstat -p command:
# ibstat -p
0x0002c9030002fb05
0x0002c9030002fb06
/etc/default/opensm:
PORTS="0x0002c9030002fb05 0x0002c9030002fb06"
Note if you want to start opensm on all ports you can use the PORTS="ALL" keyword.
Start opensm:
#/etc/init.d/opensm start
If opensm has started correctly you should see SUBNET UP messages in the opensm logfile (/var/log/opensm.<PORTID>.log).
Mar 04 14:56:06 600685 [4580A960] 0x02 -> SUBNET UP
Note that you can start opensm on multiple nodes; one node will be the active subnet manager and the others will put themselves into standby.
You can now check the status of the local IB link with the ibstat command. Connected links should be in the "LinkUp" state. The following
output is from a dual ported card, only one of which (port1) is connected.
# ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.3.0
Hardware version: a0
Node GUID: 0x0002c9030002fb04
System image GUID: 0x0002c9030002fb07
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030002fb05
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030002fb06
Once the host is connected to the infiniband network you can check the health of all of the other network components with the ibhosts, ibswitches and iblinkinfo commands.
ibhosts displays all of the hosts visible on the network.
# ibhosts
Ca : 0x0008f1040399d3d0 ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0008f1040399d370 ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0008f1040399d3fc ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0008f1040399d3f4 ports 2 "Voltaire HCA400Ex-D"
Ca : 0x0002c9030002faf4 ports 2 "MT25408 ConnectX Mellanox Technologies"
Ca : 0x0002c9030002fc0c ports 2 "MT25408 ConnectX Mellanox Technologies"
Ca : 0x0002c9030002fc10 ports 2 "MT25408 ConnectX Mellanox Technologies"
ibswitches will display all of the switches in the network.
# ibswitches
Switch : 0x0008f104004121fa ports 24 "ISR9024D-M Voltaire" enhanced port 0 lid 1 lmc 0
iblinkinfo will show the status and speed of all of the links in the network.
ibping is an infiniband equivalent to the icmp ping command. Choose a node on the fabric and run a ibping server:
#ibping -S
Choose another node on your network, and then ping the port GUID of the server. (ibstat on the server will list the port GUID).
#ibping -G 0x0002c9030002fc1d
Pong from test.example.com (Lid 13): time 0.072 ms
Pong from test.example.com (Lid 13): time 0.043 ms
Pong from test.example.com (Lid 13): time 0.045 ms
Pong from test.example.com (Lid 13): time 0.045 ms
The OFED stack allows you to run TCP/IP over your infiniband network, allowing you to run non-infiniband aware applications across
your network. Several native infiniband applications also use IPoIB for host resolution (eg Lustre and SDP).
IPoIB can run over two infiniband transports, Unreliable Datagram (UD) mode or Connected mode (CM). The difference between
these two modes are described in:
RFC4392 - IP over InfiniBand (IPoIB) Architecture
RFC4391 - Transmission of IP over InfiniBand (IPoIB) (UD mode)
RFC4755 - IP over InfiniBand: Connected Mode
ADDME: Pro/cons of these two methods?
You can switch between these two mode at runtime with:
The default is datagram (UD) mode. If you with to use CM then you can add a script to /etc/network/interfaces/if-up.d to
automatically set CM mode on your interfaces when they are configured.
In order to obtain maximum IPoIB throughput you may need to tweak the MTU and various kernel TCP buffer and window settings.
See the details in the ipoib_release_notes.txt document in the ofed-docs package.
If you have a dual ported card with both ports on the same IB subnet, but different IP subnets, you
will need to tweak the ARP settings for the IPoIB interfaces. See ipoib_release_notes.txt in the ofed-docs package for a full
discussion of this issue.
Uses who want to run MPI jobs will need to have write permissions for the following devices:
/dev/infiniband/uverbs*
/dev/infiniband/rdma_cm*
The simplest way to do this is to add the users to the rdma group. If that is not suitiable for
your site, you can change the permissions and ownership of these devices by editing the following
udev rules:
OpenMPI uses ssh to spawn jobs on remote hosts. You should configure a public/private keypair to ensure that you
can ssh between hosts without entering a password. You should also ensure that your login process is silent.
We will use the MPI PingPong benchmark for our testing. By default, openmpi should use inifiniband networks in preference to any tcp networks it finds. However, we will force mpi to ignore tcp networks to ensure that is using the infiniband network.
#!/bin/bash
#Infiniband MPI test program
#Edit the hosts below to match your test hosts
cat > /tmp/hostfile.$$.mpi <<EOF
hostA slots=1
HostB slots=1
EOF
mpirun --mca btl_openib_verbose 1 --mca btl ^tcp -n 2 -hostfile /tmp/hostfile.$$.mpi IMB-MPI1 PingPong
If all goes well you should see openib debugging messages from both hosts, together with the job output.
<snip>
# PingPong
[HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
[HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
[HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
[HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.53 0.00
1 1000 1.44 0.66
2 1000 1.42 1.34
4 1000 1.41 2.70
8 1000 1.48 5.15
16 1000 1.50 10.15
32 1000 1.54 19.85
64 1000 1.79 34.05
128 1000 3.01 40.56
256 1000 3.56 68.66
512 1000 4.46 109.41
1024 1000 5.37 181.92
2048 1000 8.13 240.25
4096 1000 10.87 359.48
8192 1000 15.97 489.17
16384 1000 30.54 511.68
32768 1000 55.01 568.12
65536 640 122.20 511.46
131072 320 207.20 603.27
262144 160 377.10 662.96
524288 80 706.21 708.00
1048576 40 1376.93 726.25
2097152 20 1946.00 1027.75
4194304 10 3119.29 1282.34
If you encounter any errors read the excellent OpenMPI troubleshooting guide.
http://www.openmpi.org
If you want to compare infiniband performance with your ethernet/TCP networks, you can re-run the tests using flags to tell openmpi to use your ethernet network. (The example below assumes that your test nodes are connected via eth0).
#!/bin/bash
#TCP MPI test program
#Edit the hosts below to match your test hosts
cat > /tmp/hostfile.$$.mpi <<EOF
hostA slots=1
HostB slots=1
EOF
mpirun --mca btl ^openib --mca btl_tcp_if_include eth0 --hostfile hostfile -n 2 IMB-MPI1 -benchmark PingPong
You should notice signficantly higher latencies than for the infiniband test.
Sockets Direct Protocol (SDP) is a network protocol which provides an RDMA accelerated
alternative to TCP over infiniband networks. OFED provides an LD_PRELOADable library
(libsdp.so) which allows programs which use TCP to use the more efficient SDP protocol instead.
The use of an LD_PRELOADable libary means that the switch in protocol is transparent,
and does not require the application to be recompiled.
SDP used IPoIB for address resolution, so you must configure IPoIB before using SDP.
You should also ensure the ib_sdp kernel module is installed.
modprobe ib_sdp
You can use libsdp in two ways; you can either manually LD_PRELOAD the library whilst invoking your application, or
create a config file which specifies which applications will use SDP.
To manually LD_PRELOAD a library, simply set the LD_PRELOAD variable before invoking your application.
The following example shows how to use libsdp to make the TCP benchmarking application, netpipe, use SDP rather than TCP.
NodeA is the server and NodeB is the client. IPoIB is configured on both nodes, and NodeA's IPoIB address is 10.0.0.1
Install netpipe on both nodes.
aptitude install netpipe-tcp
First, run the netpipe benchmark over TCP in order to obtain a baseline number.
nodeA# NPtcp
nodeB# NPtcp -h 10.0.0.1
Send and receive buffers are 16384 and 87380 bytes
(A bug in Linux doubles the requested buffer sizes)
Now starting the main loop
0: 1 bytes 2778 times --> 0.22 Mbps in 34.04 usec
1: 2 bytes 2937 times --> 0.45 Mbps in 33.65 usec
2: 3 bytes 2971 times --> 0.69 Mbps in 33.41 usec
<snip>
121: 8388605 bytes 3 times --> 2951.89 Mbps in 21680.99 usec
122: 8388608 bytes 3 times --> 3008.08 Mbps in 21276.00 usec
123: 8388611 bytes 3 times --> 2941.76 Mbps in 21755.66 usec
Now repeat the test, but force netpipe to use SDP rather than TCP.
nodeA# LD_PRELOAD=libsdp.so NPtcp
nodeB# LD_PRELOAD=libsdp.so NPtcp -h 10.0.0.1
Send and receive buffers are 16384 and 87380 bytes
(A bug in Linux doubles the requested buffer sizes)
Now starting the main loop
0: 1 bytes 9765 times --> 1.45 Mbps in 5.28 usec
1: 2 bytes 18946 times --> 2.80 Mbps in 5.46 usec
2: 3 bytes 18323 times --> 4.06 Mbps in 5.63 usec
<snip>
121: 8388605 bytes 5 times --> 7665.51 Mbps in 8349.08 usec
122: 8388608 bytes 5 times --> 7668.62 Mbps in 8345.70 usec
123: 8388611 bytes 5 times --> 7629.04 Mbps in 8389.00 usec
You should see a significant increase in performance when using SDP.
NextPreviousContents
trunk/DEBIAN-HOWTO/infiniband-howto-8.html 0000644 0001750 0001750 00000010311 11313645005 017647 0 ustar benoit benoit
Infiniband HOWTO: SRPNextPreviousContents
SRP (SCSI Remote protocol or SCSI RDMA protocol) is a protocol that allows the use of SCSI devices across
infiniband. If you have infiniband storage, use can use SRP to acess the devices.
Ensure that your infiniband storage is presented to the host in question. Check your storage controller documentation.
Ensure that the ib_srp kernel module is loaded and that the srptools package is installed.
srp_daemon is responsible for discovering and connecting to SRP targets. The default configuration shipped with srp_daemon is to ignore all presented
devices; this is a failsafe to prevent devices from being mounted by accident on the wrong hosts.
The srp_daemon config file /etc/srp_daemon.conf has a simply syntax, and is described in the srp_daemon(1) manpage. Each line in this file is a rule which can be either
to allow connection or to disallow connection according to the first character in the line (a or d accordingly) and ID of the storage device.
Determine the IDs of presented devices
You can determine the IDs of SRP devices presented to your hosts by running the ibsrpdm -c command.
Once we have the IDs of the devices, we can add them to /etc/srp_daemon.conf. You can also specify other srp related
options for the target, such as max_cmd_per_lun and Max_sect. These are storage specific; check your vendor documentation
for reccomended values.
# This rule allows connection to our target
a id_ext=50001ff10005052a,ioc_guid=50001ff10005052a,max_cmd_per_lun=32,max_sect=65535
# This rule disallows everything else
d
Restart the srp_daemon and the storage target should now become visible; check the kernel log to see if the disk has been detected.
/etc/init.d/srptools restart
In the example kernel log output the disk has been descovered as scsi device sdb.
The newly detected SRP device can be treated as an other scsi device. If you have multiple infiniband adapters you can use multipath-tools
on top of the SRP devices to protects against a network failure. If you are not using multipathed IO you can simply format the device as normal.
NextPreviousContents
trunk/DEBIAN-HOWTO/infiniband-howto-9.html 0000644 0001750 0001750 00000005035 11313645005 017657 0 ustar benoit benoit
Infiniband HOWTO: Building Lustre against OFEDNextPreviousContents
Lustre is a scalable cluster filesystem popular on high performance compute clusters. See
http://www.lustre.org
for more information. lustre can use infiniband as one of its network transports in order to increase performance. The section describes how to compile lustre
against the OFED infiniband stack.
Build a lustre patched kernel as per the instructions on the lustre wiki. Once you have build the kernel keep the configured source tree.
It is required for the next step.
If you see error messages pertaining to missing support for XRC, it means you have mis-matched kernel modules and userspace libraries.
mlx4: There is a mismatch between the kernel and the userspace
libraries: Kernel does not support XRC. Exiting.
Fix: Make sure that you build and install the OFED kernel modules as described in section X.
NextPreviousContents
trunk/DEBIAN-HOWTO/infiniband-howto-11.html 0000644 0001750 0001750 00000002151 11313645005 017724 0 ustar benoit benoit
Infiniband HOWTO: Tips and TricksNextPreviousContents
Infiniband Network Architecture
by MindShare, Inc.; Tom Shanley
Publisher: Addison-Wesley Professional
Pub Date: October 31, 2002
Print ISBN-10: 0-321-11765-4
Next
PreviousContents
trunk/ibutils_release_notes.txt 0000644 0001750 0001750 00000007721 11313645005 016674 0 ustar benoit benoit Open Fabrics InfiniBand Diagnostic Utilities
--------------------------------------------
*******************************************************************************
RELEASE: OFED 1.4.1
DATE: May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New features
3. Major Bugs Fixed
3.1 Changes since OFED 1.4
4. Known Issues
===============================================================================
1. Overview
===============================================================================
The ibutils package provides a set of diagnostic tools that check the health
of an InfiniBand fabric.
Package components:
ibis: IB interface - A TCL shell that provides interface for sending various
MADs on the IB fabric. This is the component that actually accesses
the IB Hardware.
ibdm: IB Data Model - A library that provides IB fabric analysis.
ibmgtsim: An IB fabric simulator. Useful for developing IB tools.
ibdiag: This package provides 3 tools which provide the user interface
to activate the above functionality:
- ibdiagnet: Performs various quality and health checks on the IB
fabric.
- ibdiagpath: Performs various fabric quality and health checks on
the given links and nodes in a specific path.
- ibdiagui: A GUI wrapper for the above tools.
===============================================================================
2. New Features
===============================================================================
* Mellanox InfiniScaleIV support:
- Support switches with port count > 32
- Added ibnl (IB network) files for MTS3600 and MTS3610 InfiniScaleIV based switch systems.
* IBDM QoS Credit Loop check:
This check now considers SL/VL when looking for credit loops.
This check can be activated by running "ibdiagnet -r"
* ibdiagnet: Added -csv flag, which generated a set of Comma Separated Values
files, which contains data about the fabric. Generated files:
inv_csv - Lists the ports found in the fabric
links_csv - Lists the ports connections in the fabric
pm_csv - Lists port counters in csv format
err_csv - Lists errors found during the run
* ibmgtsim: Add basic M_Key mechanism simulation
===============================================================================
3. Major Bugs Fixed
===============================================================================
* ibdm: Support 2 port switches in a loaded LST file.
* ibis: fix some buffer overrun bugs with long node description.
* Installation: Ibdiagui requires tcl/tk 8.4 or 8.5 (was only 8.4). This allows
installation on Fedora Core 9.
* ibdiagnet: Fixed -pm flag caused a crash on back to back (no switch) setup.
* ibdiagnet: Do not query port counters when local port is in INIT state.
===============================================================================
3.1 Changes since OFED 1.4
===============================================================================
* PM csv files format fix
* Fixed generating and parsing IBNL files
* Fixed CC packet format to meet IBTA approved format
* Set of changes to sync with OpenSM changes
* Regenerated wrappers - fixed compilation errors on some distros
* Fixed printing SM and mcast info in ibdiagnet
* Other minor fixes/improvements
===============================================================================
4. Known Issues
===============================================================================
- Ibdiagnet "-wt" option may generate a bad topology file when running on a
cluster that contains complex switch systems.
- When a subnet manager is not running, ibdiagnet IPoIB check may take a long
time to complete.
trunk/OFED_Installation_Guide.txt 0000644 0001750 0001750 00000042720 11313645005 016662 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
Version 1.4.1
Installation Guide
May 2009
==============================================================================
Table of contents
==============================================================================
1. Overview
2. Contents of the OFED Distribution
3. Hardware and Software Requirements
4. How to Download and Extract the OFED Distribution
5. Installing OFED Software
6. Building OFED RPMs
7. IPoIB Configuration
8. Uninstalling OFED
9. Upgrading OFED
10. Configuration
11. Related Documentation
==============================================================================
1. Overview
==============================================================================
This is the OpenFabrics Enterprise Distribution (OFED) version 1.4.1
software package supporting InfiniBand and iWARP fabrics. It is composed
of several software modules intended for use on a computer cluster
constructed as an InfiniBand subnet or an iWARP network.
This document describes how to install the various modules and test them in
a Linux environment.
General Notes:
1) The install script removes all previously installed OFED packages
and re-installs from scratch. (Note: Configuration files will not
be removed). You will be prompted to acknowledge the deletion of
the old packages.
2) When installing OFED on an entire [homogeneous] cluster, a common
strategy is to install the software on one of the cluster nodes
(perhaps on a shared file system such as NFS). The resulting RPMs,
created under OFED-1.4.1/RPMS directory, can then be installed on all
nodes in the cluster using any cluster-aware tools (such as pdsh).
==============================================================================
2. OFED Package Contents
==============================================================================
The OFED Distribution package generates RPMs for installing the following:
o OpenFabrics core and ULPs:
- HCA drivers (mthca, mlx4, ipath, ehca)
- iWARP driver (cxgb3, nes)
- core
- Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER
Initiator and target, RDS, qlgc_vnic, uDAPL and NFS-RDMA
o OpenFabrics utilities
- OpenSM: InfiniBand Subnet Manager
- Diagnostic tools
- Performance tests
o MPI
- OSU MVAPICH stack supporting the InfiniBand and iWARP interface
- Open MPI stack supporting the InfiniBand and iWARP interface
- OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface
- MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta)
o Extra packages
- open-iscsi: open-iscsi initiator with iSER support
- ib-bonding: Bonding driver for IPoIB interface
o Sources of all software modules (under conditions mentioned in the
modules' LICENSE files)
o Documentation
==============================================================================
3. Hardware and Software Requirements
==============================================================================
1) Server platform with InfiniBand HCA or iWARP RNIC (see OFED Distribution
Release Notes for details)
2) Linux operating system (see OFED Distribution Release Notes for details)
3) Administrator privileges on your machine(s)
4) Disk Space: - For Build & Installation: 300MB
- For Installation only: 200MB
5) For the OFED Distribution to compile on your machine, some software
packages of your operating system (OS) distribution are required. These
are listed here.
OS Distribution Required Packages
--------------- ----------------------------------
General:
o Common to all gcc, glib, glib-devel, glibc, glibc-devel,
glibc-devel-32bit (to build 32-bit libraries on x86_86
and ppc64), zlib-devel
o RedHat, Fedora kernel-devel, rpm-build
o SLES 10.0 kernel-source, rpm
Note: To build 32-bit libraries on x86_64 and ppc64 platforms, the 32-bit
glibc-devel should be installed.
Specific Component Requirements:
o Mvapich a Fortran Compiler (such as gcc-g77)
o Mvapich2 libstdc++-devel, sysfsutils (SuSE),
libsysfs-devel (RedHat5.0, Fedora C6)
o Open MPI libstdc++-devel
o ibutils tcl-8.4, tcl-devel-8.4, tk, libstdc++-devel
o tvflash pciutils-devel
o mstflint libstdc++-devel (32-bit on ppc64), gcc-c++
Note: The installer will warn you if you attempt to compile any of the
above packages and do not have the prerequisites installed.
*** Important Note for open-iscsi users:
Installing iSER as part of OFED installation will also install open-iscsi.
Before installing OFED, please uninstall any open-iscsi version that may
be installed on your machine. Installing OFED with iSER support while
another open-iscsi version is already installed will cause the installation
process to fail.
==============================================================================
4. How to Download and Extract the OFED Distribution
==============================================================================
1) Download the OFED-X.X.X.tgz file to your target Linux host.
If this package is to be installed on a cluster, it is recommended to
download it to an NFS shared directory.
2) Extract the package using:
tar xzvf OFED-X.X.X.tgz
==============================================================================
5. Installing OFED Software
==============================================================================
1) Go to the directory into which the package was extracted:
cd /..../OFED-X.X.X
2) Installing the OFED package must be done as root. For a
menu-driven first build and installation, run the installer
script:
./install.pl
Interactive menus will direct you through the install process.
Note: After the installer completes, information about the OFED
installation such as the prefix, the kernel version, and
installation parameters can be found by running
/etc/infiniband/info.
Information on the driver version and source git trees can be found
using the ofed_info utility
During the interactive installation of OFED, two files are
generated: ofed.conf and ofed_net.conf.
ofed.conf holds the installed software modules and configuration settings
chosen by the user. ofed_net.conf holds the IPoIB settings chosen by the
user.
If the package is installed on a cluster-shared directory, these
files can then be used to perform an automatic, unattended
installation of OFED on other machines in the cluster. The
unattended installation will use the same choices as were selected
in the interactive installation.
For an automatic installation on any host, run the following:
./OFED-X.X.X/install.pl -c /ofed.conf -n /ofed_net.conf
3) Install script usage:
Usage: ./install.pl [-c |--all|--hpc|--basic]
[-n|--net ]
-c|--config . Example of the config file can
be found under docs (ofed.conf-example)
-n|--net Example of the config file can be
found under docs (ofed_net.conf-example)
-l|--prefix Set installation prefix.
-p|--print-available Print available packages for current platform.
And create corresponding ofed.conf file.
-k|--kernel . Default on this system: $(uname -r)
-s|--kernel-sources . Default on this
system: /lib/modules/$(uname -r)/build
--build32 Build 32-bit libraries. Relevant for x86_64 and
ppc64 platforms
--without-depcheck Skip Distro's libraries check
-v|-vv|-vvv. Set verbosity level
-q. Set quiet - no messages will be printed
--all|--hpc|--basic Install all,hpc or basic packages
correspondingly
Notes:
------
a. It is possible to rename and/or edit the ofed.conf and ofed_net.conf files.
Thus it is possible to change user choices (observing the original format).
See examples of ofed.conf and ofed_net.conf under OFED-X.X.X/docs.
Run './install.pl -p' to get ofed.conf with all available packages included.
b. Important note for open-iscsi users:
Installing iSER as part of the OFED installation will also install
open-iscsi. Before installing OFED, please uninstall any open-iscsi version
that may be installed on your machine. Installing OFED with iSER support
while another open-iscsi version is already installed will cause the
installation process to fail.
Install Process Results:
------------------------
o The OFED package is installed under directory. Default prefix is /usr
o The kernel modules are installed under:
- Infiniband subsystem:
/lib/modules/`uname -r`/updates/kernel/drivers/infiniband/
- open-iscsi:
/lib/modules/`uname -r`/updates/kernel/drivers/scsi/
- Chelsio driver:
/lib/modules/`uname -r`/updates/kernel/drivers/net/cxgb3/cxgb3.ko
- ConnectX driver:
/lib/modules/`uname -r`/updates/kernel/drivers/net/mlx4/mlx4_core.ko
- RDS:
/lib/modules/`uname -r`/updates/kernel/net/rds/rds.ko
- Bonding module:
/lib/modules/`uname -r`/updates/kernel/drivers/net/bonding/bonding.ko
o The package kernel include files are placed under /src/ofa_kernel/.
These includes should be used when building kernel modules which use
the Openfabrics stack. (Note that these includes, if needed, are
"backported" to your kernel).
o The raw package (un-backported) source files are placed under
/src/ofa_kernel-1.4.1
o The script "openibd" is installed under /etc/init.d/. This script can
be used to load and unload the software stack.
o The directory /etc/infiniband is created with the files "info" and
"openib.conf". The "info" script can be used to retrieve OFED
installation information. The "openib.conf" file contains the list of
modules that are loaded when the "openibd" script is used.
o The file "90-ib.rules" is installed under /etc/udev/rules.d/
o If libibverbs-utils is installed, then ofed.sh and ofed.csh are
installed under /etc/profile.d/. These automatically update the PATH
environment variable with /bin. In addition, ofed.conf is
installed under /etc/ld.so.conf.d/ to update the dynamic linker's
run-time search path to find the InfiniBand shared libraries.
o The file /etc/modprobe.conf is updated to include the following:
- "alias ib ib_ipoib" for each ib interface.
- "alias net-pf-27 ib_sdp" for sdp.
o If opensm is installed, the daemon opensmd is installed under /etc/init.d/
o All verbs tests and examples are installed under /bin and management
utilities under /sbin
o ofed_info script provides information on the OFED version and git repository.
o If IPoIB configuration files are included, ifcfg-ib files will be
installed at:
- RedHat: /etc/sysconfig/network-scripts/
- SuSE: /etc/sysconfig/network/
o If iSER is included, open-iscsi user-space files will be also installed:
- Configuration files will be installed at /etc/iscsi
- Startup script will be installed at:
- RedHat: /etc/init.d/iscsi
- SuSE: /etc/init.d/open-iscsi
- Other tools (iscsiadm, iscsid, iscsi_discovery, iscsi-iname, iscsistart)
will be installed under /sbin.
- Documentation will be installed under:
- RedHat: /usr/share/doc/iscsi-initiator-utils-
- SuSE: /usr/share/doc/packages/open-iscsi
o man pages will be installed under /usr/share/man/.
==============================================================================
6. Building OFED RPMs
==============================================================================
1) Go to the directory into which the package was extracted:
cd /..../OFED-X.X.X
2) Run install.pl as explained above
This script also builds OFED binary RPMs under OFED-X.X.X/RPMS; the sources
are placed in OFED-X.X.X/SRPMS/.
Once the install process has completed, the user may run ./install.pl on
other machines that have the same operating system and kernel to
install the new RPMs.
Note: Depending on your hardware, the build procedure may take 30-45
minutes. Installation, however, is a relatively short process
(~5 minutes). A common strategy for OFED installation on large
homogeneous clusters is to extract the tarball on a network
file system (such as NFS), build OFED RPMs on NFS, and then run the
installer on each node with the RPMs that were previously built.
==============================================================================
7. IP-over-IB (IPoIB) Configuration
==============================================================================
Configuring IPoIB is an optional step during the installation. During
an interactive installation, the user may choose to insert the ifcfg-ib
files. If this option is chosen, the ifcfg-ib files will be
installed under:
- RedHat: /etc/sysconfig/network-scripts/
- SuSE: /etc/sysconfig/network/
Setting IPoIB Configuration:
----------------------------
There is no default configuration for IPoIB interfaces.
One should manually specify the full IP configuration during the
interactive installation: IP address, network address, netmask, and
broadcast address, or use the ofed_net.conf file.
For bonding setting please see "ipoib_release_notes.txt"
For unattended installations, a configuration file can be provided
with this information. The configuration file must specify the
following information:
- Fixed values for each IPoIB interface
- Base IPoIB configuration on Ethernet configuration (may be useful for
cluster configuration)
Here are some examples of ofed_net.conf:
# Static settings; all values provided by this file
IPADDR_ib0=172.16.0.4
NETMASK_ib0=255.255.0.0
NETWORK_ib0=172.16.0.0
BROADCAST_ib0=172.16.255.255
ONBOOT_ib0=1
# Based on eth0; each '*' will be replaced by the script with corresponding
# octet from eth0.
LAN_INTERFACE_ib0=eth0
IPADDR_ib0=172.16.'*'.'*'
NETMASK_ib0=255.255.0.0
NETWORK_ib0=172.16.0.0
BROADCAST_ib0=172.16.255.255
ONBOOT_ib0=1
# Based on the first eth interface that is found (for n=0,1,...);
# each '*' will be replaced by the script with corresponding octet from eth.
LAN_INTERFACE_ib0=
IPADDR_ib0=172.16.'*'.'*'
NETMASK_ib0=255.255.0.0
NETWORK_ib0=172.16.0.0
BROADCAST_ib0=172.16.255.255
ONBOOT_ib0=1
==============================================================================
8. Uninstalling OFED
==============================================================================
There are two ways to uninstall OFED:
1) Via the installation menu.
2) Using the script ofed_uninstall.sh. The script is part of ofed-scripts
package.
3) ofed_uninstall.sh script supports an option to executes 'openibd stop'
before removing the RPMs using the flag: --unload-modules
==============================================================================
9. Upgrading OFED
==============================================================================
If an old OFED version is installed, it may be upgraded by installing a
new OFED version as described in section 5. Note that if the old OFED
version was loaded before upgrading, you need to restart OFED or reboot
your machine in order to start the new OFED stack.
==============================================================================
10. Configuration
==============================================================================
Most of the OFED components can be configured or reconfigured after
the installation by modifying the relevant configuration files. The
list of the modules that will be loaded automatically upon boot can be
found in the /etc/infiniband/openib.conf file. Other configuration
files include:
- SDP configuration file: /etc/libsdp.conf
- OpenSM configuration file: /etc/ofa/opensm.conf (for RedHat)
/etc/sysconfig/opensm (for SuSE) - should be
created manually if required.
- DAPL configuration file: /etc/dat.conf
See packages Release Notes for more details.
Note: After the installer completes, information about the OFED
installation such as the prefix, kernel version, and
installation parameters can be found by running
/etc/infiniband/info.
==============================================================================
11. Related Documentation
==============================================================================
OFED documentation is located in the ofed-docs RPM. After
installation the documents are located under the directory:
/usr/share/doc/ofed-docs-1.4.1 for RedHat
/usr/share/doc/packages/ofed-docs-1.4.1 for SuSE
Document list:
o README.txt
o OFED_Installation_Guide.txt
o MPI_README.txt
o Examples of configuration files
o OFED_tips.txt
o HOWTO.build_ofed
o All release notes and README files
For more information, please visit the OpenFabrics web site:
http://www.openfabrics.org/
open-iscsi documentation is located at:
- RedHat: /usr/share/doc/iscsi-initiator-utils-
- SuSE: /usr/share/doc/packages/open-iscsi
For more information, please visit the open-iscsi web site:
http://www.open-iscsi.org/
trunk/MPI_README.txt 0000644 0001750 0001750 00000061326 11313645005 013754 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
MPI in OFED 1.4.1 README
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. MVAPICH
3. Open MPI
4. MVAPICH2
===============================================================================
1. Overview
===============================================================================
Three MPI stacks are included in this release of OFED:
- MVAPICH 1.1.0-3355
- Open MPI 1.3.2
- MVAPICH2 1.2p1
Setup, compilation and run information of MVAPICH, Open MPI and MVAPICH2 is
provided below in sections 2, 3 and 4 respectively.
1.1 Installation Note
---------------------
In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install
one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt
to learn about the different options.
The installation script allows each MPI to be compiled using one or
more compilers. Users need to set, per MPI stack installed, the PATH
and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks.
1.2 MPI Tests
-------------
OFED includes four basic tests that can be run against each MPI stack:
bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests
are located under: /mpi///tests/,
where is /usr by default.
1.4 Selecting Which MPI to Use: mpi-selector
--------------------------------------------
Depending on how the OFED installer was run, multiple different MPI
implementations may be installed on your system. The OFED installer
will run an MPI selector tool during the installation process,
presenting a menu-based interface to select which MPI implementation
is set as the default for all users. This MPI selector tool can be
re-run at any time by the administrator after the OFED installer
completes to modify the site-wide default MPI implementation selection
by invoking the "mpi-selector-menu" command (root access is typically
required to change the site-wide default).
The mpi-selector-menu command can also be used by non-administrative
users to override the site-wide default MPI implementation selection
by setting a per-user default. Specifically: unless a user runs the
MPI selector tool to set a per-user default, their environment will be
setup for the site-wide default MPI implementation.
Note that the default MPI selection does *not* affect the shell from
which the command was invoked (or any other shells that were already
running when the MPI selector tool was invoked). The default
selection is only changed for *new* shells started after the selector
tool was invoked. It is recommended that once the default MPI
implementation is changed via the selector tool, users should logout
and login again to ensure that they have a consistent view of the
default MPI implementation. Other tools can be used to change the MPI
environment in the current shell, such as the environment modules
software package (which is not included in the OFED software package;
see http://modules.sourceforge.net/ for details).
Note that the site-wide default is set in a file that is typically not
on a networked filesystem, and is therefore specific to the host on
which it was run. As such, it is recommended to run the
mpi-selector-menu command on all hosts in a cluster, picking the same
default MPI implementation on each. It may be more convenient,
however, to use the mpi-selector command in script-based scenarios
(such as running on every host in a cluster); mpi-selector effects all
the same functionality as mpi-selector-menu, but is intended for
automated environments. See the mpi-selector(1) manual page for more
details.
Additionally, per-user defaults are set in a file in the user's $HOME
directory. If this directory is not on a network-shared filesystem
between all hosts that will be used for MPI applications, then it also
needs to be propagated to all relevant hosts.
Note: The MPI selector tool typically sets the PATH and/or
LD_LIBRARY_PATH for a given MPI implementation. This step can, of
course, also be performed manually by a user or on a site-wide basis.
The MPI selector tool simply bundles up this functionality in a
convenient set of command line tools and menus.
1.4 Updating MPI Installations
------------------------------
Note that all of the MPI implementations included in the OFED software
package are the versions that were available when OFED v1.4 was
released. They have been QA tested with this version of OFED and are
fully supported.
However, note that administrators can go to the web sites of each MPI
implementation and download / install newer versions after OFED has
been successfully installed. There is nothing specific about the
OFED-included MPI software packages that prohibit installing
newer/other MPI implementations.
It should be also noted that versions of MPI released after OFED v1.4
are not supported by OFED. But since each MPI has its own release
schedule and QA process (each of which involves testing with the OFED
stack), it may sometimes be desirable -- or even advisable, depending
on how old the MPI implementations are that are included in OFED -- to
download install a newer version of MPI.
The web sites of each MPI implementation are listed below:
- Open MPI: http://www.open-mpi.org/
- MVAPICH: http://mvapich.cse.ohio-state.edu/
- MVAPICH2: http://mvapich.cse.ohio-state.edu/overview/mvapich2/
===============================================================================
2. MVAPICH MPI
===============================================================================
This package is a 1.1.0 version of the MVAPICH software package,
and is the officially supported MPI stack for this release of OFED.
See http://mvapich.cse.ohio-state.edu for more details.
2.1 Setting up for MVAPICH
--------------------------
To launch MPI jobs, its installation directory needs to be included
in PATH and LD_LIBRARY_PATH. To set them, execute one of the following
commands:
source /mpi///bin/mpivars.sh
-- when using sh for launching MPI jobs
or
source /mpi///bin/mpivars.csh
-- when using csh for launching MPI jobs
2.2 Compiling MVAPICH Applications:
-----------------------------------
***Important note***:
A valid Fortran compiler must be present in order to build the MVAPICH MPI
stack and tests.
The default gcc-g77 Fortran compiler is provided with all RedHat Linux
releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide
this compiler as part of the default installation.
The following compilers are supported by OFED's MVAPICH package: Gcc,
Intel,Pathscale and PGI. The install script prompts the user to choose
the compiler with which to build the MVAPICH RPM. Note that more
than one compiler can be selected simultaneously, if desired.
For details see:
http://mvapich.cse.ohio-state.edu/support
To review the default configuration of the installation, check the default
configuration file: /mpi///etc/mvapich.conf
2.3 Running MVAPICH Applications:
---------------------------------
Requirements:
o At least two nodes. Example: mtlm01, mtlm02
o Machine file: Includes the list of machines. Example: /root/cluster
o Bidirectional rsh or ssh without a password
Note: ssh will be used unless -rsh is specified. In order to use
rsh, add to the mpirun_rsh command the parameter: -rsh
*** Running OSU tests ***
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_bw
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_latency
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_bibw
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_bcast
*** Running Intel MPI Benchmark test (Full test) ***
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/IMB-3.1/IMB-MPI1
*** Running Presta test ***
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/presta-1.4.0/com -o 100
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/presta-1.4.0/glob -o 100
/usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/presta-1.4.0/globalop
===============================================================================
3. Open MPI
===============================================================================
Open MPI is a next-generation MPI implementation from the Open MPI
Project (http://www.open-mpi.org/). Version 1.3.2 of Open MPI is
included in this release, which is also available directly from the
main Open MPI web site.
A working Fortran compiler is not required to build Open MPI, but some
of the included MPI tests are written in Fortran. These tests will
not compile/run if Open MPI is built without Fortran support.
The following compilers are supported by OFED's Open MPI package: GNU,
Pathscale, Intel, or Portland. The install script prompts the user
for the compiler with which to build the Open MPI RPM. Note that more
than one compiler can be selected simultaneously, if desired.
Users should check the main Open MPI web site for additional
documentation and support. (Note: The FAQ file considers OpenFabrics
tuning among other issues.)
3.1 Setting up for Open MPI
---------------------------
Selecting to use Open MPI via the mpi-selector-mpi and mpi-selector
tools will perform all the necessary setup for users to build and run
Open MPI applications. If you use the MPI selector tools, you can
skip the rest of this section.
If you do not wish to use the MPI selector tools, the Open MPI team
strongly advises users to put the Open MPI installation directory in
their PATH and LD_LIBRARY_PATH. This can be done at the system level
if all users are going to use Open MPI. Specifically:
- add /bin to PATH
- add /lib to LD_LIBRARY_PATH
is the directory where the desired Open MPI instance was
installed ("instance" refers to the compiler used for Open MPI
compilation at install time.).
If you are using a job scheduler to launch MPI jobs (e.g., SLURM,
Torque), setting the PATH and LD_LIBRARY_PATH is still required, but
it does not need to be set in your shell startup files. Procedures
describing how to add these values to PATH and LD_LIBRARY_PATH are
described in detail at:
http://www.open-mpi.org/faq/?category=running
3.2 Open MPI Installation Support / Updates
-------------------------------------------
The OFED package will install Open MPI with support for TCP, shared
memory, and the OpenFabrics network stacks. No other networks are
supported by the OFED Open MPI installation.
Open MPI supports a wide variety of run-time environments. The OFED
installer will not include support for all of them, however (e.g.,
Torque and PBS-based environments are not supported by the
OFED-installed Open MPI).
The ompi_info command can be used to see what support was installed;
look for plugins for your specific environment / network / etc. If
you do not see them, the OFED installer did not include support for
them.
As described above, administrators or users can go to the Open MPI web
site and download / install either a newer version of Open MPI (if
available), or the same version with different configuration options
(e.g., support for Torque / PBS-based environments).
3.3 Compiling Open MPI Applications
-----------------------------------
(copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see
this web page for more details)
The Open MPI team strongly recommends that you simply use Open MPI's
"wrapper" compilers to compile your MPI applications. That is, instead
of using (for example) gcc to compile your program, use mpicc. Open
MPI provides a wrapper compiler for four languages:
Language Wrapper compiler name
------------- --------------------------------
C mpicc
C++ mpiCC, mpicxx, or mpic++
(note that mpiCC will not exist
on case-insensitive file-systems)
Fortran 77 mpif77
Fortran 90 mpif90
------------- --------------------------------
Note that if no Fortran 77 or Fortran 90 compilers were found when
Open MPI was built, Fortran 77 and 90 support will automatically be
disabled (respectively).
If you expect to compile your program as:
> gcc my_mpi_application.c -lmpi -o my_mpi_application
Simply use the following instead:
> mpicc my_mpi_application.c -o my_mpi_application
Specifically: simply adding "-lmpi" to your normal compile/link
command line *will not work*. See
http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the
Open MPI wrapper compilers.
Note that Open MPI's wrapper compilers do not do any actual compiling
or linking; all they do is manipulate the command line and add in all
the relevant compiler / linker flags and then invoke the underlying
compiler / linker (hence, the name "wrapper" compiler). More
specifically, if you run into a compiler or linker error, check your
source code and/or back-end compiler -- it is usually not the fault of
the Open MPI wrapper compiler.
3.4 Running Open MPI Applications:
----------------------------------
Open MPI uses either the "mpirun" or "mpiexec" commands to launch
applications. If your cluster uses a resource manager (such as
SLURM), providing a hostfile is not necessary:
> mpirun -np 4 my_mpi_application
If you use rsh/ssh to launch applications, they must be set up to NOT
prompt for a password (see http://www.open-mpi.org/faq/?category=rsh
for more details on this topic). Moreover, you need to provide a
hostfile containing a list of hosts to run on.
Example:
> cat hostfile
host1.example.com
host2.example.com
host3.example.com
host4.example.com
> mpirun -np 4 -hostfile hostfile my_mpi_application
(application runs on all 4 hosts)
In the following examples, replace with the number of hosts to run on,
and with the filename of a valid hostfile listing the hosts
to run on (unless you are running under a supported resource manager,
in which case a hostfile is unnecessary).
Also note that Open MPI is highly run-time tunable. There are many
options that can be tuned to obtain optimal performance of your MPI
applications (see the Open MPI web site / FAQ for more information:
http://www.open-mpi.org/faq/).
- is an integer indicating how many MPI processes to run (e.g., 2)
- is the filename of a hostfile, as described above
Example 1: Running the OSU bandwidth:
> cd /usr/mpi/gcc/openmpi-1.3.2/tests/osu_benchmarks-3.0
> mpirun -np -hostfile osu_bw
Example 2: Running the Intel MPI Benchmark benchmarks:
> cd /usr/mpi/gcc/openmpi-1.3.2/tests/IMB-3.1
> mpirun -np -hostfile IMB-MPI1
--> Note that the version of IMB-EXT that ships in this version of
OFED contains a bug that will cause it to immediately error
out when run with Open MPI.
Example 3: Running the Presta benchmarks:
> cd /usr/mpi/gcc/openmpi-1.3.2/tests/presta-1.4.0
> mpirun -np -hostfile com -o 100
3.5 More Open MPI Information
-----------------------------
Much, much more information is available about using and tuning Open
MPI (to include OpenFabrics-specific tunable parameters) on the Open
MPI web site FAQ:
http://www.open-mpi.org/faq/
Users who cannot find the answers that they are looking for, or are
experiencing specific problems should consult the "how to get help" web
page for more information:
http://www.open-mpi.org/community/help/
===============================================================================
4. MVAPICH2 MPI
===============================================================================
MVAPICH2 is an MPI-2 implementation which includes all MPI-1 features.
It is based on MPICH2 and MVICH. MVAPICH2 provides many features including
fault-tolerance with checkpoint-restart, RDMA_CM support, iWARP support,
optimized collectives, on-demand connection management, multi-core optimized
and scalable shared memory support, and memory hook with ptmalloc2 library
support. The ADI-3-level design of MVAPICH2 supports many features including:
MPI-2 functionalities (one-sided, collectives and data-type), multi-threading
and all MPI-1 functionalities. It also supports a wide range of platforms
(architecture, OS, compilers, InfiniBand adapters and iWARP adapters). More
information can be found on the MVAPICH2 project site:
http://mvapich.cse.ohio-state.edu/overview/mvapich2/
A valid Fortran compiler must be present in order to build the MVAPICH2
MPI stack and tests. The following compilers are supported by OFED's
MVAPICH2 MPI package: gcc, intel, pgi, and pathscale. The install script
prompts the user to choose the compiler with which to build the MVAPICH2
MPI RPM. Note that more than one compiler can be selected simultaneously,
if desired.
The install script prompts for various MVAPICH2 build options as detailed
below:
- Implementation (OFA or uDAPL) [default "OFA"]
- OFA (IB and iWARP) Options:
- ROMIO Support [default Y]
- Shared Library Support [default Y]
- Checkpoint-Restart Support [default N]
* requires an installation of BLCR and prompts for the
BLCR installation directory location
- uDAPL Options:
- ROMIO Support [default Y]
- Shared Library Support [default Y]
- Cluster Size [default "Small"]
- I/O Bus [default "PCI-Express"]
- Link Speed [default "SDR"]
- Default DAPL Provider [default ""]
* the default provider is determined based on detected OS
For non-interactive builds where no MVAPICH2 build options are stored in
the OFED configuration file, the default settings are:
Implementation: OFA
ROMIO Support: Y
Shared Library Support: Y
Checkpoint-Restart Support: N
4.1 Setting up for MVAPICH2
---------------------------
Selecting to use MVAPICH2 via the MPI selector tools will perform
most of the setup necessary to build and run MPI applications with
MVAPICH2. If one does not wish to use the MPI Selector tools, using
the following settings should be enough:
- add /bin to PATH
The above is the directory where the desired MVAPICH2
instance was installed ("instance" refers to the path based on
the RPM package name, including the compiler chosen during the
install). It is also possible to source the following files
in order to setup the proper environment:
source /bin/mpivars.sh [for Bourne based shells]
source /bin/mpivars.csh [for C based shells]
In addition to the user environment settings handled by the MPI selector
tools, some other system settings might need to be modified. MVAPICH2
requires the memlock resource limit to be modified from the default
in /etc/security/limits.conf:
* soft memlock unlimited
MVAPICH2 requires bidirectional rsh or ssh without a password to work.
The default is ssh, and in this case it will be required to add the
following line to the /etc/init.d/sshd script before sshd is started:
ulimit -l unlimited
It is also possible to specify a specific size in kilobytes instead
of unlimited if desired.
The MVAPICH2 OFA build requires an /etc/mv2.conf file specifying the
IP address of an Infiniband HCA (IPoIB) for RDMA-CM functionality
or the IP address of an iWARP adapter for iWARP functionality if
either of those are desired. This is not required by default, unless
either of the following runtime environment variables are set when
using the OFA MVAPICH2 build:
RDMA-CM
-------
MV2_USE_RDMA_CM=1
iWARP
-----
MV2_USE_IWARP_MODE=1
Otherwise, the OFA build will work without an /etc/mv2.conf file using
only the Infiniband HCA directly.
The MVAPICH2 uDAPL build requires an /etc/dat.conf file specifying the
DAPL provider information. The default DAPL provider is chosen at
build time, with a default value of "ib0", however it can also be
specified at runtime by setting the following environment variable:
MV2_DEFAULT_DAPL_PROVIDER=
More information about MVAPICH2 can be found in the MVAPICH2 User Guide:
http://mvapich.cse.ohio-state.edu/support/
4.2 Compiling MVAPICH2 Applications
-----------------------------------
The MVAPICH2 compiler command for each language are:
Language Compiler Command
-------- ----------------
C mpicc
C++ mpicxx
Fortran 77 mpif77
Fortran 90 mpif90
The system compiler commands should not be used directly. The Fortran 90
compiler command only exists if a Fortran 90 compiler was used during the
build process.
4.3 Running MVAPICH2 Applications
---------------------------------
4.3.1 Running MVAPICH2 Applications with mpirun_rsh
---------------------------------------------------
>From release 1.2, MVAPICH2 comes with a faster and more scalable startup based
on mpirun_rsh. To launch a MPI job with mpirun_rsh, password-less ssh needs to
be enabled across all nodes.
Note: ssh will be used by default. In order to use rsh, use the -rsh option on
the mpirun_rsh commandline. For more options, see mpirun_rsh -help or the
MVAPICH2 user guide.
*** Running 4 processes on 4 nodes ***
$ cat > hostfile
node1
node2
node3
node4
$ mpirun_rsh -np 4 -hostfile hostfile /path/to/my_mpi_app
*** Running OSU tests ***
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_bw
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_latency
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_bibw
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_bcast
*** Running Intel MPI Benchmark test (Full test) ***
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.1/IMB-MPI1
*** Running Presta test ***
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/com -o 100
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/glob -o 100
/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/globalop
4.3.2 Running MVAPICH2 Applications with mpd and mpiexec
--------------------------------------------------------
Launching processes in MVAPICH2 is a two step process. First, mpdboot must
be used to launch MPD daemons on the desired hosts. Second, the mpiexec
command is used to launch the processes. MVAPICH2 requires bidirectional
ssh or rsh without a password. This is specified when the MPD daemons are
launched with the mpdboot command through the --rsh command line option.
The default is ssh. Once the processes are finished, stopping the MPD
daemons with the mpdallexit command should be done. The following example
shows the basic procedure:
4 Processes on 4 Hosts Example:
$ cat >hostsfile
node1.example.com
node2.example.com
node3.example.com
node4.example.com
$ mpdboot -n 4 -f ./hostsfile
$ mpiexec -n 4 ./my_mpi_application
$ mpdallexit
It is also possible to use the mpirun command in place of mpiexec. They are
actually the same command in MVAPICH2, however using mpiexec is preferred.
It is possible to run more processes than hosts. In this case, multiple
processes will run on some or all of the hosts used. The following examples
demonstrate how to run the MPI tests. The default installation prefix and
gcc version of MVAPICH2 are shown. In each case, it is assumed that a hosts
file has been created in the specific directory with two hosts.
OSU Tests Example:
$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0
$ mpdboot -n 2 -f ./hosts
$ mpiexec -n 2 ./osu_bcast
$ mpiexec -n 2 ./osu_bibw
$ mpiexec -n 2 ./osu_bw
$ mpiexec -n 2 ./osu_latency
$ mpdallexit
Intel MPI Benchmark Example:
$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.1
$ mpdboot -n 2 -f ./hosts
$ mpiexec -n 2 ./IMB-MPI1
$ mpdallexit
Presta Benchmarks Example:
$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0
$ mpdboot -n 2 -f ./hosts
$ mpiexec -n 2 ./com -o 100
$ mpiexec -n 2 ./glob -o 100
$ mpiexec -n 2 ./globalop
$ mpdallexit
trunk/MSTFLINT_README.txt 0000644 0001750 0001750 00000011313 11313645005 014556 0 ustar benoit benoit Mellanox Technologies - www.mellanox.com
****************************************
MSTFLINT Package - Firmware Burning and Diagnostics Tools
1) Overview
This package contains a burning tool and diagnostic tools for Mellanox
manufactured HCA/NIC cards. It also provides access to the relevant source
code. Please see the file LICENSE for licensing details.
----------------------------------------------------------------------------
NOTE:
This burning tool should be used only with Mellanox-manufactured
HCA/NIC cards. Using it with cards manufactured by other vendors
may be harmful to the cards (due to different configurations).
Using the diagnostic tools is normally safe for all HCAs/NICs.
----------------------------------------------------------------------------
2) Package Contents
a) mstflint source code
b) mflash lib
This lib provides Flash access through Mellanox HCAs.
c) mtcr lib (implemented in mtcr.h file)
This lib enables access to HCA hardware registers.
d) mstregdump utility
This utility dumps hardware registers from Mellanox hardware
for later analysis by Mellanox.
e) mstvpd
This utility dumps the on-card VPD.
3) Installation
a) Build the mstflint utility. This package is built using a standard
autotools method.
Example:
> ./configure
> make
> make install
- Run "configure --help" for custom configuration options.
- Typically, root privileges are required to run "make install"
4) Hardware Access Device Names
The tools in this package require a device name in the command
line. The device name is the identifier of the target CA.
This section describes the device name formats and the HW access flow.
a) The devices can be accessed by their PCI ID as displayed by lspci
(bus:dev.fn).
Example:
# List all Mellanox devices
> /sbin/lspci -d 15b3:
02:00.0 Ethernet controller: Mellanox Technologies Unknown device 6368 (rev a0)
# Use mstflint tool to query the firmware on this device
> mstflint -d 02:00.0 q
b) When the IB driver (mlx4 or mthca) is loaded, the devices can be accessed
by their IB device name.
Example:
# List the IB devices
> ibv_devinfo | grep hca_id
hca_id: mlx4_0
# Use mstvpd tool to dump the VPD of this device
> mstvpd mlx4_0
c) PCI configuration access
In examples a and b above, the device is accessed via PCI Memory Mapping.
The device can also be accessed by PCI configuration cycles.
PCI configuration access is slower and less safe than memory access --
use it only if methods a and b above do not work.
To force configuration access, use device names in the following format:
/proc/bus/pci//
Example:
# List all Mellanox devices
> /sbin/lspci -d 15b3:
02:00.0 Ethernet controller: Mellanox Technologies Unknown device 6368 (rev a0)
# Use mstregdump to dump HW registers, using PCI config cycles
> mstregdump /proc/bus/pci/02/00.0 > crdump.log
Note: Typically, you will need root privileges for hardware access
5) Usage (mstflint):
Read mstflint usage. Enter "./mstflint -h" for a short help message, or
"./mstflint -hh" for a detailed help message.
Obtaining firmware files:
If you purchased your card from Mellanox Technologies, please use the
Mellanox website (www.mellanox.com, under 'Firmware' downloads) to
download the firmware for your card.
If you purchased your card from a vendor other than Mellanox, get a
specific firmware configuration (INI) file from your HCA card vendor and
generate the binary image.
Use mstflint to burn a device according to the burning instructions in
"mstflint -hh" and in Mellanox web site firmware page.
6) Usage (mstregdump):
An internal register dump is displayed to the standard output.
Please store it in a file for analysis by Mellanox.
Example:
> mstregdump mthca0 > dumpfile
7) Usage (mstvpd):
A VPD dump is displayed to the standard output.
A list of keywords to dump can be supplied after the -- flag
to apply an output filter.
Examples:
> mstvpd mthca0
ID: Lion cub DDR
PN: MHGA28-1T
EC: A3
SN: MT0551X00740
V0: PCIe x8
V1: N/A
YA: R R
RW:
> mstvpd mthca0 -- PN ID
PN: MHGA28-1T
ID: Lion cub DDR
8) Problem Reporting:
Please collect the following information when reporting issues:
uname -a
cat /etc/issue
cat /proc/bus/pci/devices
mstflint -vv
lspci
mstflint -d 02:00.0 v
mstflint -d 02:00.0 q
mstvpd 02:00.0
trunk/MLNX_EN_README.txt 0000644 0001750 0001750 00000010377 11313645005 014467 0 ustar benoit benoit ===============================================================================
MLNX_EN driver for Mellanox Adapter Cards with 10GigE Support
README for MLNX_OFED 1.4
March 2009
===============================================================================
Contents:
=========
1. Overview
2. Software Dependencies
3. Ethernet Driver Usage and Configuration
4. Known Issues
5. Troubleshooting
1. Overview
===========
MLNX_EN driver is composed from mlx4_core and mlx4_en kernel modules.
The MLNX_EN driver release exposes the following capabilities:
- Single/Dual port
- Fibre Channel over Ethernet (FCoE)
- Up to 16 Rx queues per port
- Rx steering mode: Receive Core Affinity (RCA)
- Tx arbitration mode: VLAN user-priority (off by default)
- MSI-X or INTx
- Adaptive interrupt moderation
- HW Tx/Rx checksum calculation
- Large Send Offload (i.e., TCP Segmentation Offload)
- Large Receive Offload
- IP reassembly offload for fragmented IP packets
- Multi-core NAPI support
- VLAN Tx/Rx acceleration (HW VLAN stripping/insertion)
- HW VLAN filtering
- HW multicast filtering
- ifconfig up/down + mtu changes (up to 10K)
- Ethtool support
- Net device statistics
- CX4 connectors (XAUI) or XFP
2. Software Dependencies
========================
- The mlx4_en module uses a Linux implementation for Large Receive Offload
(LRO) in kernel 2.6.24 and later. These kernels require installing the
"inet_lro" module.
3. Ethernet Driver Usage and Configuration
==========================================
- To assign an IP address to the interface run:
#> ifconfig eth
where 'x' is the OS assigned interface number.
- To check driver and device information run:
#> ethtool -i eth
Example:
#> ethtool -i eth2
driver: mlx4_en (MT_0BD0110004)
version: 1.4.0 (Dec 2008)
firmware-version: 2.6.0
bus-info: 0000:0e:00.0
- To query stateless offload status run:
#> ethtool -k eth
- To set stateless offload status run:
#> ethtool -K eth [rx on|off] [tx on|off] [sg on|off] [tso on|off]
- To query interrupt coalescing settings run:
#> ethtool -c eth
- By default, the driver uses adaptive interrupt moderation for the receive path,
which adjusts the moderation time according to the traffic pattern.
Adaptive moderation settings can be set by:
#> ethtool -C eth adaptive-rx on|off
- To set interrupt coalescing settings run:
#> ethtool -C eth [rx-usecs N] [rx-frames N] [tx-usecs N] [tx-frames N]
Note: usec settings correspond to the time to wait after the *last* packet
sent/received before triggering an interrupt
- To query pause frame settings run:
#> ethtool -a eth
- To set pause frame settings run:
#> ethtool -A eth [rx on|off] [tx on|off]
- To obtain additional device statistics, run:
#> ethtool -S eth
The driver defaults to the following parameters:
- Both ports are activated (i.e., a net device is created for each port)
- The number of Rx rings for each port is the number of on-line CPUs
- Per-core NAPI is enabled
- LRO is enabled with 32 concurrent sessions per Rx ring
Some of these values can be changed using module parameters, which are
detailed by running:
#> modinfo mlx4_en
To set non-default values to module parameters, the following line should be
added to /etc/modprobe.conf file:
"options mlx4_en == ..."
Values of all parameters can be observed in /sys/module/mlx4_en/parameters/.
4. Known Issues
===============
- For RedHat EL4, adding and removing multiple vlan interfaces over the network
interface created by the mlx4_en driver may lead to printing the following:
"kernel: unregister_netdevice: waiting for eth to become free. Usage count ="
- iperf with multiple (> 100) streams may fail on kernel.org 2.6.25 versions
earlier than 2.6.25.9.
- mlx4_en driver is not supported on PPC64 and IA64
5. Troubleshooting
==================
Problem: I restarted the driver and received the following error message:
mlx4_core 0000:13:00.0: PCI device did not come back after reset, aborting.
mlx4_core 0000:13:00.0: Failed to reset HCA, aborting.
Suggestion: This error appears if you have burnt new firmware to the adapter
card but have not rebooted the machine yet. Reboot the machine to allow the
new firmware to take effect.
trunk/cxgb3_release_notes.txt 0000644 0001750 0001750 00000021422 11313645005 016221 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
CHELSIO T3 RNIC RELEASE NOTES
May 2009
The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the
Chelsio S series adapters. Make sure you choose the 'cxgb3' and
'libcxgb3' options when generating your ofed-1.4.1 rpms.
============================================
New for ofed-1.4.1
============================================
- NFSRDMA support.
- 7.4 Firmware support. See below for more information on updating
your RNIC to the latest firmware.
============================================
Enabling Various MPIs
============================================
For OpenMPI, Intel MPI, HP MPI, and Scali MPI: you must set the iw_cxgb3
module option peer2peer=1 on all systems. This can be done by writing
to the /sys/module file system during boot. EG:
# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer
Or you can add the following line to /etc/modprobe.conf to set the option
at module load time:
options iw_cxgb3 peer2peer=1
For Intel MPI, HP MPI, and Scali MPI: Enable the chelsio device by adding
an entry to /etc/dat.conf for the chelsio interface. For instance,
if your chelsio interface name is eth2, then the following line adds a
DAT device named "chelsio" for that interface:
chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
=============
Intel MPI:
=============
The following env vars enable Intel MPI version 3.1.038. Place these
in your user env after installing and setting up Intel MPI:
export RSH=ssh
export DAPL_MAX_INLINE=64
export I_MPI_DEVICE=rdssm:chelsio
export MPIEXEC_TIMEOUT=180
export MPI_BIT_MODE=64
Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in
/etc/dat.conf named "chelsio".
Contact Intel for obtaining their MPI with DAPL support.
=============
HP MPI:
=============
To run HP MPI applications, use these mpirun options:
-prot -e DAPL_MAX_INLINE=64 -UDAPL
EG:
$ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob
Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces.
Also this assumes your first entry in /etc/dat.conf is for the chelsio
device.
Contact HP for obtaining their MPI with DAPL support.
=============
Scali MPI:
=============
The following env vars enable Scali MPI. Place these in your user env
after installing and setting up Scali MPI for running over Infiniband:
export DAPL_MAX_INLINE=64
export SCAMPI_NETWORKS=chelsio
export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128"
Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf
named "chelsio".
Contact Scali for obtaining their MPI with DAPL support.
=============
OpenMPI:
=============
OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater.
Open MPI will work without any specific configuration via the openib btl.
Users wishing to performance tune the configurable options may wish to
inspect the receive queue values. Those can be found in the "Chelsio T3"
section of mca-btl-openib-hca-params.ini.
============================================
Loadable Module options:
============================================
The following options can be used when loading the iw_cxgb3 module to
tune the iWARP driver:
cong_flavor - set the congestion control algorithm. Default is 1.
0 == Reno
1 == Tahoe
2 == NewReno
3 == HighSpeed
snd_win - set the TCP send window in bytes. Default is 32kB.
rcv_win - set the TCP receive window in bytes. Default is 256kB.
crc_enabled - set whether MPA CRC should be negotiated. Default is 1.
markers_enabled - set whether to request receiving MPA markers. Default is
0; do not request to receive markers.
NOTE: The Chelsio RNIC fully supports markers, but
the current OFA RDMA-CM doesn't provide an API for
requesting either markers or crc to be negotiated. Thus
this functionality is provided via module parameters.
mpa_rev - set the MPA revision to be used. Default is 1, which is
spec compliant. Set to 0 to connect with the Ammasso 1100
rnic.
ep_timeout_secs - set the number of seconds for timing out MPA start up
negotiation and normal close. Default is 60.
peer2peer - Enables connection setup changes to allow peer2peer
applications to work over chelsio rnics. This enables
the following applications:
Intel MPI
HP MPI
Open MPI
Scali MPI
Set peer2peer=1 on all systems to enable these
applications.
The following options can be used when loading the cxgb3 module to
tune the NIC driver:
msi - whether to use MSI or MSI-X. Default is 2.
0 = only pin
1 = only MSI or pin
2 = use MSI/X, MSI, or pin, based on system
============================================
Updating Firmware:
============================================
This release requires firmware version 7.x, and Protocol SRAM version
1.1.x. This firmware can be downloaded from http://service.chelsio.com.
If your distro/kernel supports firmware loading, you can place the
chelsio firmware and psram images in /lib/firmware, then unload and reload
the cxgb3 module to get the new images loaded. If this does not work,
then you can load the firmware images manually:
Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio.
To build cxgbtool:
# cd
# make && make install
Then load the cxgb3 driver:
# modprobe cxgb3
Now note the ethernet interface name for the T3 device. This can be
done by typing 'ifconfig -a' and noting the interface name for the
interface with a HW address that begins with "00:07:43". Then load the
new firmware and eeprom file:
# cxgbtool ethxx loadfw
# update_eeprom.sh ethxx
# reboot
============================================
Testing connectivity with ping and rping:
============================================
Configure the ethernet interfaces for your cxgb3 device. After you
modprobe iw_cxgb3 you will see one or two ethernet interfaces for the
T3 device. Configure them with an appropriate ip address, netmask, etc.
You can use the Linux ping command to test basic connectivity via the
T3 interface.
To test RDMA, use the rping command that is included in the librdmacm-utils
rpm:
On the server machine:
# rping -s -a 0.0.0.0 -p 9999
On the client machine:
# rping -c -VvC10 -a server_ip_addr -p 9999
You should see ping data like this on the client:
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
client DISCONNECT EVENT...
#
============================================
Addition Notes and Issues
============================================
1) To run uDAPL over the chelsio device, you must export this environment
variable:
export DAPL_MAX_INLINE=64
2) If you have a multi-homed host and the physical ethernet networks
are bridged, or if you have multiple chelsio rnics in the system, then
you need to configure arp to only send replies on the interface with
the target ip address:
sysctl -w net.ipv4.conf.all.arp_ignore=2
3) If you are building OFED against a kernel.org kernel later than
2.6.20, then make sure your kernel is configured with the cxgb3 and
iw_cxgb3 modules enabled. This forces the kernel to pull in the genalloc
allocator, which is required for the OFED iw_cxgb3 module. Make sure
these config options are included in your .config file:
CONFIG_CHELSIO_T3=m
CONFIG_INFINIBAND_CXGB=m
4) If you run the RDMA latency test using the ib_rdma_lat program, make
sure you use the following command lines to limit the amount of inline
data to 64:
server: ib_rdma_lat -c -I 64
client: ib_rdma_lat -c -I 64 server_ip_addr
5) If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are
using a 64KB page size (like PPC64 and IA64 systems) and your server is
using a 4KB page size (like i386 and X86_64), then you need to mount the
server using rsize=32768,wsize=32768 to avoid overrunning the Chelsio
RNIC fast register limits. This is a known firmware limitation in the
Chelsio RNIC.
trunk/iser_target_release_notes.txt 0000644 0001750 0001750 00000003011 11313645005 017515 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
STGT/iSER target in OFED 1.4 Release Notes
December 2008
* Background
iSER allows iSCSI to be layered over RDMA transports (including InfiniBand
and iWARP (RNIC)). Linux target framework (tgt) aims to simplify various SCSI
target driver (iSCSI, Fibre Channel, SRP, etc) creation and maintenance.
tgt supports the following target drivers (among othets)
- iSCSI software (tcp) target driver for Ethernet/IPoIB NICs
- iSER software target driver for Infiniband and RDMA NICs
For iSCSI and iSER tgt consists of user-space daemon, and user-space
tools. That is, no special kernel support is needed other than the
kernel (and user space) RDMA stacks.
The code is under the GNU General Public License version 2.
This package is based on a snapshot (clone) of the tgt git tree taken
on August 28th, 2008
* Supported platforms
RHEL 5 and its updates
SLES 10 and its service-packs
The release has been tested against the Linux open iscsi initiator
* STGT/iSER links
STGT home page
http://stgt.berlios.de
STGT git
git://git.kernel.org/pub/scm/linux/kernel/git/tomo/tgt.git
the STGT sources have some embedded documentation, specifically
the README and REDMA.iscsi files would be usefull
Wiki pages
Information on building/configuring/running the stgt/iser target
https://wiki.openfabrics.org/tiki-index.php?page=iSER-target
general and detailed information on iSCSI and iSER
http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA
trunk/opensm_release_notes.txt 0000644 0001750 0001750 00000053354 11313645005 016525 0 ustar benoit benoit OpenSM Release Notes 3.2
=============================
Version: OpenSM 3.2.x
Repo: git://git.openfabrics.org/~sashak/management.git
Date: May 2009
1 Overview
----------
This document describes the contents of the OpenSM 3.2 release.
OpenSM is an InfiniBand compliant Subnet Manager and Administration,
and runs on top of OpenIB. The OpenSM version for this release
is opensm-3.2.5
This document includes the following sections:
1 This Overview section (describing new features and software
dependencies)
2 Known Issues And Limitations
3 Unsupported IB compliance statements
4 Bug Fixes
5 Main Verification Flows
6 Qualified Software Stacks and Devices
1.1 Major New Features
* Cached Routing
OpenSM provides an optional unicast routing cache (enabled by '-A' or
'--ucast_cache' options). When enabled, unicast routing cache prevents
routing recalculation (which is a heavy task in a large cluster) when
there was no topology change detected during the heavy sweep, or when
the topology change does not require new routing calculation, e.g. when
one or more CAs/RTRs/leaf switches going down, or one or more of these
nodes coming back after being down.
* Routing Chaining
Routing chaining is the ability to configure the order in which routing
algorithms are applied in opensm, i.e. '-R ftree,updn,minhop' - try
using ftree routing. If ftree fails, try updn. If updn fails, try
minhop.
* IPv6 Solicited Node Multicast addresses consolidation
When this mode is used (enabled with --consolidate_ipv6_snm_req option)
OpenSM will map all IPv6 Solicited Node Multicast address join requests
into a single Multicast group with address ff10:601b::1:ff00:0. In this
way limited MLID space is saved. This IBA noncompliant feature is very
useful with large (~> 1024 nodes) clusters.
* OpenSM sweep state machine rework
Huge and buggy OpenSM sweep state machine was fully rewritten in safer
and more effective synchronous manner.
* Multi lid routing balancing for updn/minhop routing algorithms
When LMC > 0 is used OpenSM will ensure to generate routing paths via
different switches and when possible chassis.
* Preserve base lid routes when LMC > 0
When LMC > 0 is used OpenSM will preserve routing paths for base lids
as it would be with LMC = 0. In this way traffic on each LID level is
not affected by LMC changes.
* Ordered routing paths balancing
This adds ability to predefine the port order in which routing paths
balancing is performed by OpenSM. Helps to improve performance
dramatically (40-50%) for applications with known communication
pattern. Activated with --guid_routing_order_file command line option.
* Unified OpenSM configuration
Now there is "conventional" config file instead of hidden option cache
file (opensm.opts). OpenSM will find this in a default place (consult
man page for exact value) or the file name can be specified with '-F'
command line option. Also there is an option ('-c') to generate config
file template.
* Query remote SMs during light sweep
Master OpenSM will query remote standby SMs periodically to catch its
possible state changes and react accordingly (as required by IBA spec).
* Predefined port ids for Up/Down algorithm
This is useful as Up/Down fine tuning tool - the algorithm will use
predefined port IDs instead of GUIDs for its decision about direction.
Activated with --ids_guid_file command line option.
* Improved plugin API version 2.
Now OpenSM will provide to plugins the access to all data structures.
This make it possible to implement powerful multi purpose plugins. All
OpenSM header files are installed now and specific configuration/build
options are exported via generated osm_config.h header file.
* Many code improvements, optimizations and cleanups
* Automatic daily snapshots generation.
This is is not a "feature", but simplifies the access to recent OpenSM
bits.
1.2 Minor New Features:
* Cleanup cl_qlock_pool memory allocator - speedup memory allocations
* Support for configurable (via OSM_UMAD_MAX_PENDING environment variable)
size of pending MADs pool.
* Set packet life time to subnet timeout option rather than default
* Enforce routing paths rebalancing on switch reconnection
* In Up/Down routing algorithm compare GUID values in host byte order
* Add 'switchbalance' and 'lidbalance' commands for OpenSM console
* Respond to new trap 144 node description update flag
* Add '--connect_roots' command line options. This preserves connectivity
between root nodes in Up/Down routing algorithm
* Setting SL in the IPoIB MCast groups in accordance with QoS policy
* Dump auto detected root node guids in Up/Down routing algorithm
* Unify OpenSM dumpers code
* Unify various guid files parsers - add generic nodenamemap style parser
* When root node guids were provided in file update the list on each
Up/Down run
* During ./configure show values of configuration dirs and files
* Make prefix routes config file name configurable
* Add a Performance Manager HOWTO to the docs and the dist
* Support separate SA and SM keys as clarified in IBA 1.2.1
* Remove AM_MAINTAINER_MODE in ./configure
* Make vendor type OSM_VENDOR_INTF_OPENIB (libibumad) to be default
* Build osm_perfmgr_db.* content only when PerfMgr is enabled.
* Move PerfMgr event_db_dump_file to common OpenSM dump dir
* Allow space separated strings as values in OpenSM config
* Support for multiple event plugins
* Add '--version' command line option
* Add '--create-config ' command line option
* Speedup and simplify logging code
* Speedup multicast processing in SA DB
* In log messages convert unicast LIDs from hex to decimal format and
GIDs from hex to IPv6 address format
* Handle all possible ports in "ignore-guids" file
* Add 'reroute' console command
* Remove many install-exec-hook from Makefiles
* Some cleanups in LASH routing algorithm code
* In Makefiles remove -rpath and explicit -lpthread, -ldl from LDFLAGS
(move to configurator)
* Install all OpenSM header files
* Improve locking in SM Info receiver
* Add new OSM_EVENT_ID_SUBNET_UP event for plugins
* Redo lex and yacc files generation in conventional way
* Add a missing Node Description check on light sweep.
* Move vendor specific compilation defines from command to generated
config.h file
* Provide useful error message when log file opening fails
* Add generated osm_config.h file with OpenSM specific defines
* Display port number in decimal in log messages
* Replace osm_vendor_select.h by generated osm_config.h
* Unify options listing in OpenSM usage message
* LFT buffers handling simplification
* Add 'dump_conf' console command
* OpenSM performs sweep on SIGCONT (coming out of suspend).
* When our SM is in Standby state and its priority is increased
(via console command), notify master SM by sending Trap 144.
* When entering standby state (after discovery) notify master SM
with Trap 144.
* support more PortInfo:CapabilityMask bits
* When babbling port policy is on disable the port with the least hop
count.
1.3 Library API Changes
None
1.4 Software Dependencies
OpenSM depends on the installation of either OFED 1.x, OpenIB gen2 (e.g.
IBG2 distribution), OpenIB gen1 (e.g. IBGD distribution), or Mellanox
VAPI stacks. The qualified driver versions are provided in Table 2,
"Qualified IB Stacks".
Also, building of QoS manager policy file parser requires flex, and either
bison or byacc installed.
1.5 Supported Devices Firmware
The main task of OpenSM is to initialize InfiniBand devices. The
qualified devices and their corresponding firmware versions
are listed in Table 3.
2 Known Issues And Limitations
------------------------------
* No Service / Key associations:
There is no way to manage Service access by Keys.
* No SM to SM SMDB synchronization:
Puts the burden of re-registering services, multicast groups, and
inform-info on the client application (or IB access layer core).
* When running with QoS with default configuration (opensm -Q),
OpenSM prints list of "Invalid Cached Option" error messages.
This does not affect OpenSM functionality.
* SMs do not hand-over when running on ConnectX in a switch-based topology.
3 Unsupported IB Compliance Statements
--------------------------------------
The following section lists all the IB compliance statements which
OpenSM does not support. Please refer to the IB specification for detailed
information regarding each compliance statement.
* C14-22 (Authentication):
M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
SubnSet method. As a work-around, an OpenSM option is provided for
defining the protect bits.
* C14-67 (Authentication):
On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
the SM shall generate a SubnGetResp if the M_Key matches, or
silently drop the packet if M_Key does not match.
* C15-0.1.23.4 (Authentication):
InformInfoRecords shall always be provided with the QPN set to 0,
except for the case of a trusted request, in which case the actual
subscriber QPN shall be returned.
* o13-17.1.2 (Event-FWD):
If no permission to forward, the subscription should be removed and
no further forwarding should occur.
* C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
GUIDInfo - SM should enable assigning Port GUIDInfo.
* C14-44 (Initialization):
If the SM discovers that it is missing an M_Key to update CA/RT/SW,
it should notify the higher level.
* C14-62.1.1.12 (Initialization):
PortInfo:M_Key - Set the M_Key to a node based random value.
* C14-62.1.1.13 (Initialization):
PortInfo:P_KeyProtectBits - set according to an optional policy.
* C14-62.1.1.24 (Initialization):
SwitchInfo:DefaultPort - should be configured for random FDB.
* C14-62.1.1.32 (Initialization):
RandomForwardingTable should be configured.
* o15-0.1.12 (Multicast):
If the JoinState is SendOnlyNonMember = 1 (only), then the endport
should join as sender only.
* o15-0.1.8 (Multicast):
If a request for creating an MCG with fields that cannot be met,
return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass).
* C15-0.1.8.6 (SA-Query):
Respond to SubnAdmGetTraceTable - this is an optional attribute.
* C15-0.1.13 Services:
Reject ServiceRecord create, modify or delete if the given
ServiceP_Key does not match the one included in the ServiceGID port
and the port that sent the request.
* C15-0.1.14 (Services):
Provide means to associate service name and ServiceKeys.
4 Bug Fixes
-----------
4.1 Major Bug Fixes
* Set SA attribute offset to 0 when no records are returned
* Send trap 64 only after new ports are in ACTIVE state.
* Fix in sending client reregistration bit
* Fix default OpenSM SM (and SA) Key byte order
* Fix in sending Multicast groups creation/deletion notification (Traps
66,67)
* Don't startup automatically on SuSE based systems
* Discovery bug, where some ports were leaved unlinked (without remote side).
4.2 Other Bug Fixes
* opensm/osm_console.c: fix seg fault when running "portstatus ca" in
the console
* opensm: fix potential core dumps where osm_node_get_physp_ptr can
return NULL
* opensm/osm_mcast_mgr: limit spanning tree creation recursion to value
of max hops (64)
* opensm: switch LFTs incremental update fix
* opensm/osm_state_mgr.c: fix segmentation fault
* opensm: eliminate some potential NULL pointer dereferences
* opensm/osm_console.c: fix guid parsing
* opensm: fix off by 1 issue with max_lid and max_multicat_lid_ho
* opensm: fix potentially wrong port_guid initialization
* opensm/configure.in: fix wrong HAVE_DEFAULT_OPENSM_CONFIG_FILE define
generation
* opensm: fix snprintf() usage
* opensm/osm_sa_lft_record: validate LFT block number
* opensm/osm_sa_lft_record: pass block parameter in host byte order
* opensm/include/Makefile.am: don't duplicate header files in EXTRA_DIST
* opensm/osm_sa_class_port_info.c: fix over bound array access
* osmtest/osmt_service.c: fix over bound array access
* osmtest: fix qpn encoding in osmtest_informinfo_request()
* opensm/osm_vendor_mlx_sa.c: handling attribute offset of 0
* opensm: fix segfault corner case when osm_console_init fails
* opensm/console: close console socket on cleanup path
* opensm/osm_ucast_lash: fix buffer overflow
* opensm: fix broken IPv6 SNM consolidation code
* opensm/osm_sa_lft_record.c: fix block number encoding byte order
* opensm/osm_sa: fix memory leak in SA responder
* opensm/osm_mcast_mgr: fix memory leak
* opensm: fix qos config parsing bugs
* opensm/osm_mcast_tbl.c: fix sending invalid MF block due to max mlid
overflow
* opensm: log_max_size config parameter in MB
* opensm/osm_ucast_lash: fix extra memory allocations
* opensm: fix race in main OpenSM flow
* opensm/ftree: fix GUID check against cn_guid_file
* opensm/ftree: save FLT buffers memory allocations
* opensm/osm_sa_link_record.c: prevent potential endless recursion
* opensm: remove SM from sm_guid_tbl when IsSM port capability flag is
not set
* opensm: fix QoS config bug
* opensm: don't reassign zeroed params from config file
* Other less critical or visible bugs were also fixed.
* opensm: update LFTs when entering master
* opensm: invalidate routing cache when entering master state
* opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 is active
5 Main Verification Flows
-------------------------
OpenSM verification is run using the following activities:
* osmtest - a stand-alone program
* ibmgtsim (IB management simulator) based - a set of flows that
simulate clusters, inject errors and verify OpenSM capability to
respond and bring up the network correctly.
* small cluster regression testing - where the SM is used on back to
back or single switch configurations. The regression includes
multiple OpenSM dedicated tests.
* cluster testing - when we run OpenSM to setup a large cluster, perform
hand-off, reboots and reconnects, verify routing correctness and SA
responsiveness at the ULP level (IPoIB and SDP).
5.1 osmtest
osmtest is an automated verification tool used for OpenSM
testing. Its verification flows are described by list below.
* Inventory File: Obtain and verify all port info, node info, link and path
records parameters.
* Service Record:
- Register new service
- Register another service (with a lease period)
- Register another service (with service p_key set to zero)
- Get all services by name
- Delete the first service
- Delete the third service
- Added bad flows of get/delete non valid service
- Add / Get same service with different data
- Add / Get / Delete by different component mask values (services
by Name & Key / Name & Data / Name & Id / Id only )
* Multicast Member Record:
- Query of existing Groups (IPoIB)
- BAD Join with insufficient comp mask (o15.0.1.3)
- Create given MGID=0 (o15.0.1.4)
- Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
- Create BAD MGID=0xFA. (o15.0.1.6)
- Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
- New MGID with invalid join state (o15.0.1.9)
- Retry of existing MGID - See JoinState update (o15.0.1.11)
- BAD RATE when connecting to existing MGID (o15.0.1.13)
- Partial JoinState delete request - removing FullMember (o15.0.1.14)
- Full Delete of a group (o15.0.1.14)
- Verify Delete by trying to Join deleted group (o15.0.1.14)
- BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)
* GUIDInfo Record:
- All GUIDInfoRecords in subnet are obtained
* MultiPathRecord:
- Perform some compliant and noncompliant MultiPathRecord requests
- Validation is via status in responses and IB analyzer
* PKeyTableRecord:
- Perform some compliant and noncompliant PKeyTableRecord queries
- Validation is via status in responses and IB analyzer
* LinearForwardingTableRecord:
- Perform some compliant and noncompliant LinearForwardingTableRecord queries
- Validation is via status in responses and IB analyzer
* Event Forwarding: Register for trap forwarding using reports
- Send a trap and wait for report
- Unregister non-existing
* Trap 64/65 Flow: Register to Trap 64-65, create traps (by
disconnecting/connecting ports) and wait for report, then unregister.
* Stress Test: send PortInfoRecord queries, both single and RMPP and
check for the rate of responses as well as their validity.
5.2 IB Management Simulator OpenSM Test Flows:
The simulator provides ability to simulate the SM handling of virtual
topologies that are not limited to actual lab equipment availability.
OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
regressions use smaller (16 and 128 nodes clusters).
The following test flows are run on the IB management simulator:
* Stability:
Up to 12 links from the fabric are randomly selected to drop packets
at drop rates up to 90%. The SM is required to succeed in bringing the
fabric up. The resulting routing is verified to be correct as well.
* LID Manager:
Using LMC = 2 the fabric is initialized with LIDs. Faults such as
zero LID, Duplicated LID, non-aligned (to LMC) LIDs are
randomly assigned to various nodes and other errors are randomly
output to the guid2lid cache file. The SM sweep is run 5 times and
after each iteration a complete verification is made to ensure that all
LIDs that could possibly be maintained are kept, as well as that all nodes
were assigned a legal LID range.
* Multicast Routing:
Nodes randomly join the 0xc000 group and eventually the
resulting routing is verified for completeness and adherence to
Up/Down routing rules.
* osmtest:
The complete osmtest flow as described in the previous table is run on
the simulated fabrics.
* Stress Test:
This flow merges fabric, LID and stability issues with continuous
PathRecord, ServiceRecord and Multicast Join/Leave activity to
stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get
were added to the test such both existing and non existing nodes
perform them in random order.
5.3 OpenSM Regression
Using a back-to-back or single switch connection, the following set of
tests is run nightly on the stacks described in table 2. The included
tests are:
* Stress Testing: Flood the SA with queries from multiple channel
adapters to check the robustness of the entire stack up to the SA.
* Dynamic Changes: Dynamic Topology changes, through randomly
dropping SMP packets, used to test OpenSM adaptation to an unstable
network & verify DB correctness.
* Trap Injection: This flow injects traps to the SM and verifies that it
handles them gracefully.
* SA Query Test: This test exhaustively checks the SA responses to all
possible single component mask. To do that the test examines the
entire set of records the SA can provide, classifies them by their
field values and then selects every field (using component mask and a
value) and verifies that the response matches the expected set of records.
A random selection using multiple component mask bits is also performed.
5.4 Cluster testing:
Cluster testing is usually run before a distribution release. It
involves real hardware setups of 16 to 32 nodes (or more if a beta site
is available). Each test is validated by running all-to-all ping through the IB
interface. The test procedure includes:
* Cluster bringup
* Hand-off between 2 or 3 SM's while performing:
- Node reboots
- Switch power cycles (disconnecting the SM's)
* Unresponsive port detection and recovery
* osmtest from multiple nodes
* Trap injection and recovery
6 Qualified Software Stacks and Devices
---------------------------------------
OpenSM Compatibility
--------------------
Note that OpenSM version 3.2.1 and earlier used a value of 1 in host
byte order for the default SM_Key, so there is a compatibility issue
with these earlier versions of OpenSM when the 3.2.2 or later version
is running on a little endian machine. This affects SM handover as well
as SA queries (saquery tool in infiniband-diags).
Table 2 - Qualified IB Stacks
=============================
Stack | Version
-----------------------------------------|--------------------------
OFED | 1.4
OFED | 1.3
OFED | 1.2
OFED | 1.1
OFED | 1.0
OpenIB Gen2 (IBG2 distribution) | 1.0
OpenIB Gen1 (IBGD distribution) | 1.8.0
VAPI (Mellanox InfiniBand HCA Driver) | 3.2 and later
Table 3 - Qualified Devices and Corresponding Firmware
======================================================
Mellanox
Device | FW versions
------------------------------------|-------------------------------
InfiniScale | fw-43132 5.2.000 (and later)
InfiniScale III | fw-47396 0.5.000 (and later)
InfiniScale IV | fw-48436 7.1.000 (and later)
InfiniHost | fw-23108 3.5.000 (and later)
InfiniHost III Lx | fw-25204 1.2.000 (and later)
InfiniHost III Ex (InfiniHost Mode) | fw-25208 4.8.200 (and later)
InfiniHost III Ex (MemFree Mode) | fw-25218 5.3.000 (and later)
ConnectX IB | fw-25408 2.3.000 (and later)
QLogic/PathScale
Device | Note
--------|-----------------------------------------------------------
iPath | QHT6040 (PathScale InfiniPath HT-460)
iPath | QHT6140 (PathScale InfiniPath HT-465)
iPath | QLE6140 (PathScale InfiniPath PE-880)
iPath | QLE7240
iPath | QLE7280
Note 1: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose
QP0 and QP1. However, it does support it as a device on the subnet.
Note 2: QoS firmware and Mellanox devices
HCAs: QoS supported by ConnectX. QoS-enabled FW release is 2_5_000 and
later.
Switches: QoS supported by InfiniScale III
Any InfiniScale III FW that is supported by OpenSM supports QoS.
trunk/mvapich_release_notes.txt 0000644 0001750 0001750 00000010701 11313645005 016640 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
OSU MPI MVAPICH-1.1.0, in OFED 1.4.r10 Release Notes
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Software Dependencies
3. New Features
4. Bug Fixes
5. Known Issues
6. Main Verification Flows
===============================================================================
1. Overview
===============================================================================
These are the release notes for OSU MPI MVAPICH-1.1.0.
OSU MPI is an MPI channel implementation over InfiniBand
by Ohio State University (OSU).
See http://mvapich.cse.ohio-state.edu
===============================================================================
2. Software Dependencies
===============================================================================
OSU MPI depends on the installation of the OFED stack with OpenSM running.
The MPI module also requires an established network interface (either
InfiniBand IPoIB or Ethernet).
===============================================================================
3. New Features ( Compared to mvapich 1.0.0 )
===============================================================================
MVAPICH-1.1.0 has the following additional features:
- eXtended Reliable Connection (XRC) support
- Lock-free design to provide support for asynchronous
progress at both sender and receiver to overlap
computation and communication
- Optimized MPI_allgather collective
- Efficient intra-node shared memory communication
support for diskless clusters
- Enhanced Totalview Support with the new mpirun_rsh framework
===============================================================================
4. Bug Fixes ( Compared to mvapich 1.0.0 )
===============================================================================
- De-register stale memory regions earlier to prevent
excess allocations of physical memory
- Fixes for MPI_Query_thread and MPI_Is_thread_main
- Fixes for PGI compiler support
- Compilation warnings cleanup
- Fixes for optimized colletives
- Fix data types for memory allocations
- Multiple fixes for mpirun_rsh launcher
===============================================================================
5. Known Issues
===============================================================================
- Shared memory broadcast optimization is disabled by default.
- MVAPICH MPI compiled on AMD x86_64 does not work with MVAPICH MPI compiled
on Intel X86_64 (EM64t).
Workaround:
Use "VIADEV_USE_COMPAT_MODE=1" run time option in order to enable compatibility
mode that works for AMD and Intel platform.
- A process running MPI cannot fork after MPI_Init unless the environment
variable IBV_FORK_SAFE=1 is set to enable fork support. This support also
requires a kernel version of 2.6.16 or higher.
- For users of Mellanox Technologies firmware fw-23108 or fw-25208 only:
MVAPICH might fail in its default configuration if your HCA is burnt with an
fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version
4.7.400 or earlier.
NOTE: There is no issue if you chose to update firmware during Mellanox
OFED installation as newer firmware versions were burnt.
Workaround:
Option 1 - Update the firmware. For instructions, see Mellanox Firmware Tools
(MFT) User's Manual under the docs/ folder.
Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0
- MVAPICH may fail to run on some SLES 10 machines due to problems in resolving
the host name.
Workaround: Edit /etc/hosts and comment-out/remove the line that maps
IP address 127.0.0.2 to the system's fully qualified hostname.
===============================================================================
6. Main Verification Flows
===============================================================================
In order to verify the correctness of MVAPICH, the following tests and
parameters were run.
Test Description
-------------------------------------------------------------------
Intel's Test suite - 1400 Intel tests
BW/LT OSU's test for bandwidth latency
IMB Intel's MPI Benchmark test
mpitest b_eff test
Presta Presta multicast test
Linpack Linpack benchmark
NAS2.3 NAS NPB2.3 tests
SuperLU SuperLU benchmark (NERSC edition)
NAMD NAMD application
CAM CAM application
trunk/qperf_release_notes.txt 0000644 0001750 0001750 00000006224 11313645005 016333 0 ustar benoit benoit Distribution
Open Fabrics Enterprise Distribution (OFED) 1.4, December 2008
Summary
qperf - Measure RDMA and IP performance
Overview
qperf measures bandwidth and latency between two nodes. It can work over
TCP/IP as well as the RDMA transports.
Quick Start
* Since qperf measures latency and bandwidth between two nodes, you need
access to two nodes. Assume they are called node1 and node2.
* On node1, run qperf without any arguments. It will act as a server and
continue to run until asked to quit.
* To measure TCP bandwidth between the two nodes, on node2, type:
qperf node1 tcp_bw
* To measure RDMA RC latency, type (on node2):
qperf node1 rc_lat
* To measure RDMA UD latency using polling, type (on node2):
qperf node1 -P 1 ud_lat
* To measure SDP bandwidth, on node2, type:
qperf node1 sdp_bw
Documentation
* Man page available. Type
man qperf
* To get a list of examples, type:
qperf --help examples
* To get a list of tests, type:
qperf --help tests
Tests
Miscellaneous
conf Show configuration
quit Cause the server to quit
Socket Based
rds_bw RDS streaming one way bandwidth
rds_lat RDS one way latency
sctp_bw SCTP streaming one way bandwidth
sctp_lat SCTP one way latency
sdp_bw SDP streaming one way bandwidth
sdp_lat SDP one way latency
tcp_bw TCP streaming one way bandwidth
tcp_lat TCP one way latency
udp_bw UDP streaming one way bandwidth
udp_lat UDP one way latency
RDMA Send/Receive
ud_bw UD streaming one way bandwidth
ud_bi_bw UD streaming two way bandwidth
ud_lat UD one way latency
rc_bw RC streaming one way bandwidth
rc_bi_bw RC streaming two way bandwidth
rc_lat RC one way latency
uc_bw UC streaming one way bandwidth
uc_bi_bw UC streaming two way bandwidth
uc_lat UC one way latency
RDMA
rc_rdma_read_bw RC RDMA read streaming one way bandwidth
rc_rdma_read_lat RC RDMA read one way latency
rc_rdma_write_bw RC RDMA write streaming one way bandwidth
rc_rdma_write_lat RC RDMA write one way latency
rc_rdma_write_poll_lat RC RDMA write one way polling latency
uc_rdma_write_bw UC RDMA write streaming one way bandwidth
uc_rdma_write_lat UC RDMA write one way latency
uc_rdma_write_poll_lat UC RDMA write one way polling latency
InfiniBand Atomics
rc_compare_swap_mr RC compare and swap messaging rate
rc_fetch_add_mr RC fetch and add messaging rate
Verification
ver_rc_compare_swap Verify RC compare and swap
ver_rc_fetch_add Verify RC fetch and add
trunk/create_Module.symvers.sh 0000755 0001750 0001750 00000004760 11313645005 016366 0 ustar benoit benoit #!/bin/bash
#
# Copyright (c) 2006 Mellanox Technologies. All rights reserved.
# Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved.
#
# This Software is licensed under one of the following licenses:
#
# 1) under the terms of the "Common Public License 1.0" a copy of which is
# available from the Open Source Initiative, see
# http://www.opensource.org/licenses/cpl.php.
#
# 2) under the terms of the "The BSD License" a copy of which is
# available from the Open Source Initiative, see
# http://www.opensource.org/licenses/bsd-license.php.
#
# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
# copy of which is available from the Open Source Initiative, see
# http://www.opensource.org/licenses/gpl-license.php.
#
# Licensee has the right to choose one of the above licenses.
#
# Redistributions of source code must retain the above copyright
# notice and one of the license notices.
#
# Redistributions in binary form must reproduce both the above copyright
# notice, one of the license notices in the documentation
# and/or other materials provided with the distribution.
#
# Description: creates Module.symvers file for InfiniBand modules
K_VER=${K_VER:-$(uname -r)}
MOD_SYMVERS_IB=./Module.symvers
SYMS=/tmp/syms
if [ -d /lib/modules/$K_VER/updates/kernel/drivers/infiniband ]; then
MODULES_DIR=/lib/modules/$K_VER/updates/kernel/drivers/infiniband
elif [ -d /lib/modules/$K_VER/kernel/drivers/infiniband ]; then
MODULES_DIR=/lib/modules/$K_VER/kernel/drivers/infiniband
else
echo "No infiniband modules found"
exit 1
fi
echo MODULES_DIR=${MODULES_DIR}
if [ -f ${MOD_SYMVERS_IB} -a ! -f ${MOD_SYMVERS_IB}.save ]; then
mv ${MOD_SYMVERS_IB} ${MOD_SYMVERS_IB}.save
fi
rm -f $MOD_SYMVERS_IB
rm -f $SYMS
for mod in $(find ${MODULES_DIR} -name '*.ko') ; do
nm -o $mod |grep __crc >> $SYMS
n_mods=$((n_mods+1))
done
n_syms=$(wc -l $SYMS |cut -f1 -d" ")
echo Found $n_syms InfiniBand symbols in $n_mods InfiniBand modules
n=1
while [ $n -le $n_syms ] ; do
line=$(head -$n $SYMS|tail -1)
line1=$(echo $line|cut -f1 -d:)
line2=$(echo $line|cut -f2 -d:)
file=$(echo $line1|cut -f6- -d/)
file=$(echo $file|cut -f1 -d.)
crc=$(echo $line2|cut -f1 -d" ")
crc=${crc:8}
sym=$(echo $line2|cut -f3 -d" ")
sym=${sym:6}
echo -e "0x$crc\t$sym\t$file" >> $MOD_SYMVERS_IB
if [ -z $allsyms ] ; then
allsyms=$sym
else
allsyms="$allsyms|$sym"
fi
n=$((n+1))
done
echo ${MOD_SYMVERS_IB} created.
trunk/mpi-selector_release_notes.txt 0000644 0001750 0001750 00000003421 11313645005 017615 0 ustar benoit benoit MPI Selector 1.0 release notes
December 2008
==============================
OFED contains a simple mechanism for system administrators and end
users to select which MPI implementation they want to use. The MPI
selector functionality is not specific to any MPI implementation; it
can be used with any implementation that provides shell startup files
that correctly set the environment for that MPI. The OFED installer
will automatically add MPI selector support for each MPI that it
installs. Additional MPI's not known by the OFED installer can be
listed in the MPI selector; see the mpi-selector(1) man page for
details.
Note that MPI selector only affects the default MPI environment for
*future* shells. Specifically, if you use MPI selector to select MPI
implementation ABC, this default selection will not take effect until
you start a new shell (e.g., logout and login again). Other packages
(such as environment modules) provide functionality that allows
changing your environment to point to a new MPI implementation in the
current shell. The MPI selector was not meant to duplicate or replace
that functionality.
The MPI selector functionality can be invoked in one of two ways:
1. The mpi-selector-menu command.
This command is a simple, menu-based program that allows the
selection of the system-wide MPI (usually only settable by root)
and a per-user MPI selection. It also shows what the current
selections are.
This command is recommended for all users.
2. The mpi-selector command.
This command is a CLI-equivalent of the mpi-selector-menu,
allowing for the same functionality as mpi-selector-menu but
without the interactive menus and prompts. It is suitable for
scripting.
See the mpi-selector(1) man page for more information.
trunk/OFED_tips.txt 0000644 0001750 0001750 00000037424 11313645005 014070 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
Tips for Working with OFED 1.4
December 2008
===============================================================================
Table of Contents
===============================================================================
1. OFED Utilities
2. Debug HOWTOs
3. Pinning (Locking) User Memory Pages
4. External Module Compilation Over OFED-1.4
5. Adding/Deleting a patch to OFED package
6. Adding vendor specific actions to the installation of OFED
7. How to compile OFED sources manually
===============================================================================
1. OFED Utilities
===============================================================================
The OFED package includes utilities under /bin, where stands
for the OFED installation path. To retrieve this path, run the script
"/etc/infiniband/info" as explained in Section 2.2 below.
Notes:
------
1. This document includes descriptions for a subset of the existing utilities.
To learn about other utilities, use their --help flag.
2. The sources for all utilities are not part of the RPM installation. However,
all sources exist in the openib-1.4.tgz tarball.
1.1 Device Information
----------------------
Device information can be obtained using several utilities:
a. ibv_devinfo
ibv_devinfo prints the ca attributes.
usage:
ibv_devinfo
Options:
-d, --ib-dev= use IB device (default: first device found)
-i, --ib-port= use port of IB device (default: all ports)
-l, --list print only the IB devices names
-v, --verbose print all the attributes of the IB device(s)
b. ibstat
usage:
ibstat [OPTIONS] [portnum]
Options:
-d debug
-l list all IB devices
-s print short device summary
-p print port GUIDs
-V print ibstat version information and exit
-h print usage
Examples:
ibstat -l # list all IB devices
ibstat mthca0 2 # stat port 2 of mthca0
c. Using sysfs file system
The driver supports the sysfs file system under: /sys/class/infiniband
Examples:
> ls /sys/class/infiniband/mthca0/
board_id device fw_ver hca_type hw_rev node_desc node_guid node_type
ports sys_image_guid
> cat /sys/class/infiniband/mthca0/board_id
MT_0200000001
> ls /sys/class/infiniband/mthca0/ports/1/
cap_mask counters gids lid lid_mask_count phys_state pkeys rate sm_lid
sm_sl state
> cat /sys/class/infiniband/mthca0/ports/1/state
4: ACTIVE
1.2 Performance Tests
---------------------
The following performance tests are provided with the OFED release:
1. Latency tests:
- ib_read_lat: RDMA read
- ib_write_lat: RDMA write
- ib_send_lat: UD, UC and RC (default) send
2. Bandwidth tests:
- ib_read_bw: RDMA read
- ib_write_bw: RDMA write
- ib_send_bw: UD, UC and RC (default) send
Usage:
Server:
Client: is an Ethernet or IPoIB address.
--help lists the available . The same options must be
passed to both server and client.
Note: See PERF_TEST_README.txt for more information on the performance
tests.
Example: ib_send_bw
Usage:
ib_send_bw start a server and wait for connection
ib_send_bw connect to server at
options:
-p, --port= listen on/connect to port
(default: 18515)
-d, --ib-dev= use IB device
(default: first device found)
-i, --ib-port= use port of IB device
(default: 1)
-c, --connection= connection type RC/UC/UD (default: RC)
-m, --mtu= mtu size (default: 1024)
-s, --size= size of message to exchange
(default: 65536)
-a, --all run sizes from 2 up to 2^23
-t, --tx-depth= size of tx queue (default: 300)
-n, --iters= number of exchanges
(at least 2, default: 1000)
-b, --bidirectional measure bidirectional bandwidth
(default: unidirectional)
-V, --version display version number
1.3 Ping-pong Example Tests
---------------------------
The ping-pong example tests provide basic connectivity tests. Each test
has a help message (-h).
- ibv_ud_pingpong
- ibv_rc_pingpong
- ibv_srq_pingpong
- ibv_uc_pingpong
Example: ibv_ud_pingpong --h
Usage:
ibv_ud_pingpong start a server and wait for connection
ibv_ud_pingpong connect to server at
options:
-p, --port= listen on/connect to port
(default: 18515)
-d, --ib-dev= use IB device
(default: first device found)
-i, --ib-port= use port of IB device (default: 1)
-s, --size= size of message to exchange (default: 2048)
-r, --rx-depth= number of receives to post at a time
(default: 500)
-n, --iters= number of exchanges (default: 1000)
-e, --events sleep on CQ events (default: poll)
===============================================================================
2. Debug HOWTOs
===============================================================================
2.1 OFED Components and Version Information
-------------------------------------------
The text file BUILD_ID provides data on all OFED components (whether installed
or not). This file is a part of the ofed-docs RPM and installed under
/usr/share/doc/ofed-docs-1.4 on RedHat, and under
/usr/share/doc/packages/ofed-docs-1.4 on SuSE.
The same information can be obtained by executing the 'ofed_info' command. For
example:
> ofed_info
OFED-1.4
libibverbs:
git://git.openfabrics.org/ofed_1_4/libibverbs.git ofed_1_4
commit b00dc7d2f79e0660ac40160607c9c4937a895433
libmthca:
git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git master
commit be5eef3895eb7864db6395b885a19f770fde7234
libmlx4:
git://git.openfabrics.org/ofed_1_4/libmlx4.git ofed_1_4
commit fd418d6ee049afe76bb769aff87c303b96848495
libehca:
git://git.openfabrics.org/ofed_1_4/libehca.git ofed_1_4
commit e0c2d7e8ee2aa5dd3f3511270521fb0c206167c6
libipathverbs:
git://git.openfabrics.org/~ralphc/libipathverbs ofed_1_4
commit 65e5701dbe7b511f796cb0026b0cd51831a62318
libcxgb3:
git://git.openfabrics.org/~swise/libcxgb3.git ofed_1_4
commit f685c8fe7e77e64614d825e563dd9f02a0b1ae16
libnes:
git://git.openfabrics.org/~glenn/libnes.git master
commit 07fb9dfbbb36b28b5ea6caa14a1a5e215386b3e8
libibcm:
git://git.openfabrics.org/~shefty/libibcm.git master
commit 7fb57e005b3eae2feb83b3fd369aeba700a5bcf8
librdmacm:
git://git.openfabrics.org/~shefty/librdmacm.git master
commit e0b1ece1dc0518b2a5232872e0c48d3e2e354e47
libsdp:
git://git.openfabrics.org/ofed_1_4/libsdp.git ofed_1_4
commit 02404fb0266082f5b64412c3c25a71cb9d39442d
sdpnetstat:
git://git.openfabrics.org/~amirv/sdpnetstat.git ofed_1_4
commit 75a033a9512127449f141411b0b7516f72351f95
srptools:
git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3
commit d3025d0771317584e51490a419a79ab55650ebc9
perftest:
git://git.openfabrics.org/~orenmeron/perftest.git master
commit ca629627c7a26005a1a4c8775cc01f483524f1c4
qlvnictools:
git://git.openfabrics.org/~ramachandrak/qlvnictools.git ofed_1_4
commit 1dc6e51a728cbfbdd2018260602b8bebde618da9
tvflash:
git://git.openfabrics.org/ofed_1_4/tvflash.git ofed_1_4
commit e1b50b3b8af52b0bc55b2825bb4d6ce699d5c43b
mstflint:
git://git.openfabrics.org/~orenk/mstflint.git master
commit 9ddeea464e946cd425e05b0d1fdd9ec003fca824
qperf:
git://git.openfabrics.org/~johann/qperf.git/.git master
commit bee05d35b09b0349cf4734ae43fc9c2e970ada8c
ibutils:
git://git.openfabrics.org/~orenk/ibutils.git master
commit 6516d16e815c68fa405562ea773b0c5215c1b70c
ibsim:
git://git.openfabrics.org/~sashak/ibsim.git master
commit eff83c7a522dea41c21e15746b1c58ff21fdecaa
ofa_kernel-1.4:
Git:
git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
commit 60ca4b0e03aa5acccb01a3e0430ba240ad521547
# MPI
mvapich-1.1.0-3143.src.rpm
mvapich2-1.2p1-1.src.rpm
openmpi-1.2.8-1.src.rpm
mpitests-3.1-891.src.rpm
2.2 Installed OFED Components
-------------------------------
The script /etc/infiniband/info provides data on the specific OFED installation
on the machine.
For example:
> /etc/infiniband/info
prefix=/usr
Kernel=2.6.9-78.ELsmp
MODULES: CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=m CONFIG_IPATH_CORE=m CONFIG_INFINIBAND_IPATH=m
CONFIG_INFINIBAND_IPOIB=m
User level: --kernel-version 2.6.9-78.ELsmp --kernel-sources
/lib/modules/2.6.9-78.ELsmp/build --with-libibcm --with-libibverbs
--with-libipathverbs --with-libmthca --with-mstflint --with-perftest
2.3 Building/Installing InfiniBand (IB) Modules With Debug Information
----------------------------------------------------------------------
To compile/build/install the IB modules so that they will contain debug
information, set OPENIB_KERNEL_EXTRA_CFLAGS="-g" in your environment
before running OFED's install.pl/build.sh .
===============================================================================
3. Pinning (Locking) User Memory Pages
===============================================================================
Memory locking is managed by the kernel on a per user basis. Regular users (as
opposed to root) have a limited number of pages which they may pin, where
the limit is pre-set by the administrator. Registering memory for IB verbs
requires pinning memory, thus an application cannot register more memory than
it is allowed to pin.
The user can change the system per-process memory lock limit by adding
the following two lines to the file /etc/security/limits.conf:
* soft memlock
* hard memlock
where denotes the number of KBytes that may be locked by a
user process.
The above change to /etc/security/limits.conf will allow any user process in the
system to lock up to KBytes of memory.
On some systems, it may be possible to use "unlimited" for the size to disable
these limits entirely.
Note: The file /etc/security/limits.conf contains further documentation.
===============================================================================
4. External Module Compilation Over OFED-1.4
===============================================================================
To build kernel modules depending on OFED's modules, take the Modules.symvers
file from /src/openib/Module.symvers (part of the kernel-ib-devel RPM)
and copy it to the modules subdir and then compile your module.
If /src/openib/Module.symvers does not exist or it is empty, use the
create_Module.symvers.sh (a part of the ofed-docs RPM) script to create the
Module.symvers file.
See "Module versioning & Module.symvers" in the modules.txt from kernel
documentation (e.g. linux-2.6.20/Documentation/kbuild/modules.txt).
===============================================================================
5. Adding/Deleting a patch to OFED package
===============================================================================
If there is a need to add or delete a patch to OFED package one should use the
ofed_patch.sh script, that is available under the docs directory.
This script supports kernel sources only.
Usage:
Add patch to OFED:
ofed_patch.sh --add
--ofed|-o
--patch|-p
--type|-t |addons >
Remove patch from OFED:
ofed_patch.sh --remove
--ofed|-o
--patch|-p
--type|-t |addons >
Examples:
ofed_patch.sh --add --ofed /tmp/OFED-1.4/ --patch /tmp/cma_fix.patch --type kernel
ofed_patch.sh --remove --ofed /tmp/OFED-1.4/ --patch cma_fix.patch --type kernel
===============================================================================
6. Adding vendor specific actions to the installation of OFED
===============================================================================
Vendors that want to add actions to the install/uninstall process of OFED can
bind external scripts to hooks in install.pl and ofed_uninstall.sh.
6.1 Specifying vendor scripts and configuration parameters
-----------------------------------------------------------
This option is only available when installing ofed in non interactive mode.
Edit the OFED configuration file (ofed.conf ) and add the lines below (you don't have to use all of them).
# Script to run before install process starts
vendor_pre_install=my_pre_install.sh
# Script to run after install process finishes
vendor_post_install=my_post_install.sh
# Script to run before uninstall process starts
vendor_pre_uninstall=my_pre_uninstall.sh
# Script to run after uninstall process finishes
vendor_post_uninstall=my_post_uninstall.sh
You can also also add vendor specific configuration parameters. Lines that start
with vendor_config_ will not be parsed by install.pl and can be parsed by one
of the vendor scripts
vendor_config_something=value_for_that_something
Running ./install.pl -c ofed.conf in the OFED directory will now invoke the
relevant vendor specific actions.
6.2. Requirements from vendor scripts
-------------------------------------
The script files that are given to install.pl in ofed.conf should
- be located in the root directory of OFED
- return zero on success
If vendor script fails it fails the entire installation.
6.3 Skeleton for pre/post install vendor script
-----------------------------------------------
install.pl passes some useful installation variables to the vendor pre/post
install scripts environment. See the example below for a typical usage.
#!/bin/bash
eval $*
# The following env. parameters are set at this point
#
# CONFIG: full path filename of the OFED configuration file
# RPMS: directory of binary RPMs
# SRPMS: directory of source RPMS
# PREFIX: prefix of installation
# TOPDIR: root of OFED package
# QUIET: quiet operation indicator
function readconf() {
local config=$1
while read line; do
# skip comments
[[ ${line:0:1} == "#" ]] && continue
# skip empty lines
[[ -z "$line" ]] && continue
# parse line
token=$(echo $line|cut -f1 -d=)
value=$(echo $line|cut -f2 -d=)
done < "$config"
}
readconf $CONFIG
exit 0
===============================================================================
7. How to compile OFED sources manually
===============================================================================
These are the instructions how to compile and install kernel and user parts "manually"
meaning without building the RPMs and without using the install.pl script.
7.1 Compiling the kernel modules
--------------------------------
1. tar xzf OFED-1.4.tgz
2. rpm -ihv OFED-1.4/SRPMS/ofa_kernel-1.4-ofed1.4.src.rpm
3. cd /usr/src/redhat/SOURCES
4. tar xzvf ofa_kernel-1.4.tgz
5. cd ofa_kernel-1.4
6. configure:
run ./configure --help for a list of options.
basic invocation is:
./configure --with-core-mod --with-ipoib-mod --with-mthca-mod --with-mlx4_core-mod --with-mlx4_inf-mod
7. make
make install
NOTES:
1. configure applies the patches to the source code according to the current
kernel. If you wish to rerun configure it is recommend to untar the source
code tree from the beginning and start with a clean state.
An alternative is to pass the option: --without-patch to the configure invocation.
2. The modules select for install are written to configure.mk.kernel
7.2 Compiling the user space libraries
--------------------------------------
To install user space library from the source RPM provided by OFED-1.4 manually,
do the following:
Example for libibverbs:
1. tar xzf OFED-1.4.tgz
2. rpm -ihv SRPMS/libibverbs-1.1.2-1.ofed1.4.src.rpm
3. cd /usr/src/redhat/SOURCES (for RedHat)
or
cd /usr/src/packages/SOURCES (for SuSE)
4. tar xzf libibverbs-1.1.2.tgz
5. cd libibverbs-1.1.2
6. ./configure (specify parameters, if required)
7. make
8. make install
trunk/diags_release_notes.txt 0000644 0001750 0001750 00000005372 11313645005 016310 0 ustar benoit benoit Open Fabrics Enterprise Distribution (OFED)
Diagnostic Tools in OFED 1.4 Release Notes
December 2008
Repo: git://git.openfabrics.org/~ofed_1_3/management.git (release)
git://git.openfabrics.org/~sashak/management/management.git (development)
General
-------
Model of operation: All diag utilities use direct MAD access to perform their
operations. Operations that require QP0 mads only may use direct routed
mads, and therefore can work even in unconfigured subnets. Almost all
utilities can operate without accessing the SM, unless GUID to lid translation
is required. The only exception to this is saquery which requires the SM.
Dependencies
------------
Most diag utilities depend on libibmad and libibumad.
All diag utilities depend on the ib_umad kernel module.
Multiple port/Multiple CA support
---------------------------------
When no IB device or port is specified (see the "local umad parameters" below),
the libibumad library selects the port to use by the following criteria:
1. the first port that is ACTIVE.
2. if not found, the first port that is UP (physical link up).
If a port and/or CA name is specified, the libibumad library attempts to
satisfy the user request, and will fail if it cannot do so.
For example:
ibaddr # use the 'best port'
ibaddr -C mthca1 # pick the best port from mthca1 only.
ibaddr -P 2 # use the second (active/up) port from the
first available IB device.
ibaddr -C mthca0 -P 2 # use the specified port only.
Common options & flags
----------------------
Most diagnostics take the following flags. The exact list of supported
flags per utility can be found in the usage message and can be displayed
using util_name -h syntax.
# Debugging flags
-d raise the IB debugging level. May be used
several times (-ddd or -d -d -d).
-e show umad send receive errors (timeouts and others)
-h display the usage message
-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)
-V display the internal version info.
# Addressing flags
-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...
-G use GUID address arguments. In most cases, it is the Port GUID.
Examples:
"0x08f1040023"
-s use 'smlid' as the target lid for SA queries.
# Local umad parameters:
-C use the specified ca_name.
-P use the specified ca_port.
-t override the default timeout for the solicited mads.
CLI notation
------------
All utilities use the POSIX style notation, meaning that all options (flags)
must precede all arguments (parameters).
Utilities descriptions
----------------------
See man pages
Bugs Fixed
----------
trunk/RDS_README.txt 0000644 0001750 0001750 00000040523 11313645005 013753 0 ustar benoit benoit RDS(7) RDS(7)
NAME
RDS - Reliable Datagram Sockets
SYNOPSIS
#include
#include
DESCRIPTION
This is an implementation of the RDS socket API. It provides reliable,
in-order datagram delivery between sockets over a variety of trans‐
ports.
Currently, RDS can be transported over Infiniband, and loopback.
iWARP bcopy is supported, but not RDMA operations.
RDS uses standard AF_INET addresses as described in ip(7) to identify
end points.
Socket Creation
RDS is still in development and as such does not have a reserved proto‐
col family constant. Applications must read the string representation
of the protocol family value from the pf_rds sysctl parameter file
described below.
rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
Socket Options
RDS sockets support a number of socket options through the setsock‐
opt(2) and getsockopt(2) calls. The following generic options (with
socket level SOL_SOCKET) are of specific importance:
SO_RCVBUF
Specifies the size of the receive buffer. See section on "Con‐
gestion Control" below.
SO_SNDBUF
Specifies the size of the send buffer. See "Message Transmis‐
sion" below.
SO_SNDTIMEO
Specifies the send timeout when trying to enqueue a message on a
socket with a full queue in blocking mode.
In addition to these, RDS supports a number of protocol specific
options (with socket level SOL_RDS). Just as with the RDS protocol
family, an official value has not been assigned yet, so the kernel will
assign a value dynamically. The assigned value can be retrieved from
the sol_rds sysctl parameter file.
RDS specific socket options will be described in a separate section
below.
Binding
A new RDS socket has no local address when it is first returned from
socket(2). It must be bound to a local address by calling bind(2)
before any messages can be sent or received. This will also attach the
socket to a specific transport, based on the type of interface the
local address is attached to. From that point on, the socket can only
reach destinations which are available through this transport.
For instance, when binding to the address of an Infiniband interface
such as ib0, the socket will use the Infiniband transport. If RDS is
not able to associate a transport with the given address, it will
return EADDRNOTAVAIL.
An RDS socket can only be bound to one address and only one socket can
be bound to a given address/port pair. If no port is specified in the
binding address then an unbound port is selected at random.
RDS does not allow the application to bind a previously bound socket to
another address. Binding to the wildcard address INADDR_ANY is not per‐
mitted either.
Connecting
The default mode of operation for RDS is to use unconnected socket, and
specify a destination address as an argument to sendmsg. However, RDS
allows sockets to be connected to a remote end point using connect(2).
If a socket is connected, calling sendmsg without specifying a destina‐
tion address will use the previously given remote address.
Congestion Control
RDS does not have explicit congestion control like common streaming
protocols such as TCP. However, sockets have two queue limits associ‐
ated with them; the send queue size and the receive queue size. Mes‐
sages are accounted based on the number of bytes of payload.
The send queue size limits how much data local processes can queue on a
local socket (see the following section). If that limit is exceeded,
the kernel will not accept further messages until the queue is drained
and messages have been delivered to and acknowledged by the remote
host.
The receive queue size limits how much data RDS will put on the receive
queue of a socket before marking the socket as congested. When a
socket becomes congested, RDS will send a congestion map update to the
other participating hosts, who are then expected to stop sending more
messages to this port.
There is a timing window during which a remote host can still continue
to send messages to a congested port; RDS solves this by accepting
these messages even if the socket's receive queue is already over the
limit.
As the application pulls incoming messages off the receive queue using
recvmsg(2), the number of bytes on the receive queue will eventually
drop below the receive queue size, at which point the port is then
marked uncongested, and another congestion update is sent to all par‐
ticipating hosts. This tells them to allow applications to send addi‐
tional messages to this port.
The default values for the send and receive buffer size are controlled
by the A given RDS socket has limited transmit buffer space. It
defaults to the system wide socket send buffer size set in the
wmem_default and rmem_default sysctls, respectively. They can be tuned
by the application through the SO_SNDBUF and SO_RCVBUF socket options.
Blocking Behavior
The sendmsg(2) and recvmsg(2) calls can block in a variety of situa‐
tions. Whether a call blocks or returns with an error depends on the
non-blocking setting of the file descriptor and the MSG_DONTWAIT mes‐
sage flag. If the file descriptor is set to blocking mode (which is the
default), and the MSG_DONTWAIT flag is not given, the call will block.
In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used
to specify a timeout (in seconds) after which the call will abort wait‐
ing, and return an error. The default timeout is 0, which tells RDS to
block indefinitely.
Message Transmission
Messages may be sent using sendmsg(2) once the RDS socket is bound.
Message length cannot exceed 4 gigabytes as the wire protocol uses an
unsigned 32 bit integer to express the message length.
RDS does not support out of band data. Applications are allowed to send
to unicast addresses only; broadcast or multicast are not supported.
A successful sendmsg(2) call puts the message in the socket's transmit
queue where it will remain until either the destination acknowledges
that the message is no longer in the network or the application removes
the message from the send queue.
Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO
socket option described below.
While a message is in the transmit queue its payload bytes are
accounted for. If an attempt is made to send a message while there is
not sufficient room on the transmit queue, the call will either block
or return EAGAIN.
Trying to send to a destination that is marked congested (see above),
the call will either block or return ENOBUFS.
A message sent with no payload bytes will not consume any space in the
destination's send buffer but will result in a message receipt on the
destination. The receiver will not get any payload data but will be
able to see the sender's address.
Messages sent to a port to which no socket is bound will be silently
discarded by the destination host. No error messages are reported to
the sender.
Message Receipt
Messages may be received with recvmsg(2) on an RDS socket once it is
bound to a source address. RDS will return messages in-order, i.e. mes‐
sages from the same sender will arrive in the same order in which they
were be sent.
The address of the sender will be returned in the sockaddr_in structure
pointed to by the msg_name field, if set.
If the MSG_PEEK flag is given, the first message on the receive is
returned without removing it from the queue.
The memory consumed by messages waiting for delivery does not limit the
number of messages that can be queued for receive. RDS does attempt to
perform congestion control as described in the section above.
If the length of the message exceeds the size of the buffer provided to
recvmsg(2), then the remainder of the bytes in the message are dis‐
carded and the MSG_TRUNC flag is set in the msg_flags field. In this
truncating case recvmsg(2) will still return the number of bytes
copied, not the length of entire messge. If MSG_TRUNC is set in the
flags argument to recvmsg(2), then it will return the number of bytes
in the entire message. Thus one can examine the size of the next mes‐
sage in the receive queue without incurring a copying overhead by pro‐
viding a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the
flags argument.
The sending address of a zero-length message will still be provided in
the msg_name field.
Control Messages
RDS uses control messages (a.k.a. ancillary data) through the msg_con‐
trol and msg_controllen fields in sendmsg(2) and recvmsg(2). Control
messages generated by RDS have a cmsg_level value of sol_rds. Most
control messages are related to the zerocopy interface added in RDS
version 3, and are described in rds-rdma(7).
The only exception is the RDS_CMSG_CONG_UPDATE message, which is
described in the following section.
Polling
RDS supports the poll(2) interface in a limited fashion. POLLIN is
returned when there is a message (either a proper RDS message, or a
control message) waiting in the socket's receive queue. POLLOUT is
always returned while there is room on the socket's send queue.
Sending to congested ports requires special handling. When an applica‐
tion tries to send to a congested destination, the system call will
return ENOBUFS. However, it cannot poll for POLLOUT, as there is prob‐
ably still room on the transmit queue, so the call to poll(2) would
return immediately, even though the destination is still congested.
There are two ways of dealing with this situation. The first is to sim‐
ply poll for POLLIN. By default, a process sleeping in poll(2) is
always woken up when the congestion map is updated, and thus the appli‐
cation can retry any previously congested sends.
The second option is explicit congestion monitoring, which gives the
application more fine-grained control.
With explicit monitoring, the application polls for POLLIN as before,
and additionally uses the RDS_CONG_MONITOR socket option to install a
64bit mask value in the socket, where each bit corresponds to a group
of ports. When a congestion update arrives, RDS checks the set of ports
that became uncongested against the bit mask installed in the socket.
If they overlap, a control messages is enqueued on the socket, and the
application is woken up. When it calls recvmsg(2), it will be given the
control message containing the bitmap. on the socket.
The congestion monitor bitmask can be set and queried using setsock‐
opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.
Congestion updates are delivered to the application via
RDS_CMSG_CONG_UPDATE control messages. These control messages are
always delivered by themselves (or possibly additional control mes‐
sages), but never along with a RDS data message. The cmsg_data field of
the control message is an 8 byte datum containing the 64bit mask value.
Applications can use the following macros to test for and set bits in
the bitmask:
#define RDS_CONG_MONITOR_SIZE 64
#define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
#define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
Canceling Messages
An application can cancel (flush) messages from the send queue using
the RDS_CANCEL_SENT_TO socket option with setsockopt(2). This call
takes an optional sockaddr_in address structure as argument. If given,
only messages to the destination specified by this address are dis‐
carded. If no address is given, all pending messages are discarded.
Note that this affects messages that have not yet been transmitted as
well as messages that have been transmitted, but for which no acknowl‐
edgment from the remote host has been received yet.
Reliability
If sendmsg(2) succeeds, RDS guarantees that the message will be vis‐
ible to recvmsg(2) on a socket bound to the destination address as
long as that destination socket remains open.
If there is no socket bound on the destination, the message is
silently dropped. If the sending RDS can't be sure that there is no
socket bound then it will try to send the message indefinitely until it
can be sure or the sent message is canceled.
If a socket is closed then all pending sent messages on the socket are
canceled and may or may not be seen by the receiver.
The RDS_CANCEL_SENT_TO socket option can be used to cancel all pending
messages to a given destination.
If a receiving socket is closed with pending messages then the sender
considers those messages as having left the network and will not
retransmit them.
A message will only be seen by recvmsg(2) once, unless MSG_PEEK was
specified. Once the message has been delivered it is removed from the
sending socket's transmit queue.
All messages sent from the same socket to the same destination will be
delivered in the order they're sent. Messages sent from different sock‐
ets, or to different destinations, may be delivered in any order.
SYSCTL VALUES
These parameteres may only be accessed through their files in
/proc/sys/net/rds. Access through sysctl(2) is not supported.
pf_rds This file contains the string representation of the protocol
family constant passed to socket(2) to create a new RDS socket.
sol_rds
This file contains the string representation of the socket level
parameter that is passed to getsockopt(2) and setsockopt(2) to
manipulate RDS socket options.
max_unacked_bytes and max_unacked_packets
These parameters are used to tune the generation of acknowledge‐
ments. By default, the system receiving RDS messages does not
send back explicit acknowledgements unless it transmits a mes‐
sage of its own (in which case the ACK is piggybacked onto the
outgoing message), or when the sending system requests an ACK.
However, the sender needs to see an ACK from time to time so
that it can purge old messages from the send queue. The unacked
bytes and packet counters are used to keep track of how much
data has been sent without requesting an ACK. The default is to
request an acknowledgement every 16 packets, or every 16 MB,
whichever comes first.
reconnect_delay_min_ms and reconnect_delay_max_ms
RDS uses host-to-host connections to transport RDS messages
(both for the TCP and the Infiniband transport). If this connec‐
tion breaks, RDS will try to re-establish the connection.
Because this reconnect may be triggered by both hosts at the
same time and fail, RDS uses a random backoff before attempting
a reconnect. These two parameters specify the minimum and maxi‐
mum delay in milliseconds. The default values are 1 and 1000,
respectively.
SEE ALSO
rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2),
setsockopt(2).
RDS(7)
trunk/uDAPL_release_notes.txt 0000644 0001750 0001750 00000141106 11313645005 016122 0 ustar benoit benoit Release Notes for
OFED 1.4.1 DAPL Release
May 2009
OFED 1.4.1 RELEASE NOTES
This release of the uDAPL reference implementation package for both
DAT 1.2 and 2.0 specification is timed to coincide with OFED release
of the Open Fabrics (www.openfabrics.org) software stack.
NEW SINCE OFED 1.4 - new versions of uDAPL v1 (1.2.14-1) and v2 (2.0.19-1)
* New Features - optional counters, must be configured/built with -DDAPL_COUNTERS
* Bug Fixes
v2 - scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit
v2 - scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge
v2 - dtest: add flush EVD call after data transfer errors
v2 - scm: increase default MTU size from 1024 to 2048
v2 - dapltest: reset server listen ports to avoid collisions during long runs
v2 - dapltest: avoid duplicating ports, increment based on ep/thread count
v2 - dapltest: fix assumptions that multiple EP's will connect in order
v2 - common: sync missing with when removing items off of EVD pending queue
v2 - scm: reduce open time with thread start up
v2 - scm: getsockopt optlen needs initialized to size of optval
v2 - scm: cr_thread cleanup
v2 - OFED and WinOF code sync
v2 - scm: remove unnecessary query gid/lid from connection phase code.
v2 - scm: add optional 64-bit counters, build with -DDAPL_COUNTERS.
v1,v2 - spec files missing Requires(post) statements for sed/coreutils
v1,v2 - dtest/dapltest: use $(top_builddir) for .la files during test builds
v1,v2 - scm: remove unecessary thread when using direct objects
v1,v2 - Fix SuSE 11 build issues, asm/atomic.h no longer exists
* Build Notes:
# NON_DEBUG build/install example for x86_64, OFED targets
./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
make install
# DEBUG build/install example for x86_64, using OFED targets
./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
make install
# COUNTERS build/install example for x86_64, using OFED targets
./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS"
make install
* BKM for running new DAPL library on your cluster without any impact on existing OFED installation:
Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1
Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.19.tar.gz
untar in /home/ardavis
cd /home/ardavis/dapl-2.0.19
./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries)
create /home/ardavis/dat.conf with following 2 lines. (entries with path to new libraries):
ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following:
setenv DAT_OVERRIDE=/home/ardavis/dat.conf
If running Intel MPI and uDAPL socket cm, set the following:
setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1
or if running Intel MPI and uDAPL rdma_cm, set the following:
setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0
-------------------------
OFED 1.4 RELEASE NOTES
NEW SINCE OFED 1.3.1 - new versions of uDAPL v1 (1.2.12-1) and v2 (2.0.15-1)
* New Features
1. The new socket CM provider, introduced in 1.2.8 and 2.0.11 packages,
assumes homogeneous cluster and will setup the QP's based on local HCA port
attributes and exchanges QP information via socket's using the hostname of
each node. IPoIB and rdma_cm are NOT required for this provider. QP attributes
can be adjusted via the following environment parameters:
DAPL_ACK_TIMER (default=16 5 bits, 4.096us*2^ack_timer. 16 == 268ms)
DAPL_ACK_RETRY (default=7 3 bits, 7 * 268ms = 1.8 seconds)
DAPL_RNR_TIMER (default=12 5 bits, 12 == 64ms, 28 == 163ms, 31 == 491ms)
DAPL_RNR_RETRY (default=7 3 bits, 7 == infinite)
DAPL_IB_MTU (default=1024 limited to active MTU max)
The new socket cm entries in /etc/dat.conf provide a link to the actual HCA
device and port. Example v1 and v2 entries for a Mellanox connectx device, port 1:
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
This new socket cm provider, was successfully tested on the TATA CRL cluster
(#8 on Top500) with Intel MPI, achieving a HPLinpack score of 132.8TFlops on
1798 nodes, 14384 cores at ~76.9% of peak. DAPL_ACK_TIMER was increased to 21
for this scale.
2. New v2 definitions for IB unreliable datagram extension (only supported in
scm provider, libdaploscm.so.2)
Extended EP dat_service_type, with DAT_IB_SERVICE_TYPE_UD
Add IB extension call dat_ib_post_send_ud().
Add address handle definition for UD calls.
Add IB event definitions to provide remote AH via connect and connect requests
See dtestx (-d) source for example usage model
* Bug Fixes
v1,v2 - dapltest: trans test moves to cleanup stage before rdma_read processing is complete
v1,v2 - Fix static registration (dat.conf) to include sysconfdir override
v1,v2 - dat.conf: add default iwarp entry for eth2
v1,v2 - dapl: adjust max_rdma_read_iov to 1 for iWARP devices
v1,v2 - dtest: reduce default IOV's for ep_create to support iWARP
v1,v2 - dtest: fix 32-bit build issues
v1,v2 - build: $(DESTDIR) prepend needed on install hooks for dat.conf
v2 - scm: UD shares EP;s which requires serialization
v2 - dapl: fixes for IB UD extensions in common code and socket cm provider.
v2 - dapl: add provider specific attribute query option for IB UD MTU size
v2 - dapl build: add correct CFLAGS, set non-debug build by default for v2
v2 - dtestx: fix stack corruption problem with hostname strcpy
v2 - dapl extension: dapli_post_ext should always allocate cookie for requests.
v2 - dapltest: manpage - rdma write example incorrect
v1,v2 - dat, dapl, dtest, dapltest, providers: fix compiler warnings in dat common code
v1,v2 - dapl cma: debug message during query needs definition for inet_ntoa
v1,v2 - dapl scm: fix corner case that delivers duplicate disconnect events
v1,v2 - dat: include stddef.h for NULL definition in dat_platform_specific.h
v1,v2 - dapl: add debug messages during async and overflow events
v1,v2 - dapltest: add check for duplicate disconnect events in transaction test
v1,v2 - dapl scm: use correct device attribute for max_rdma_read_out, max_qp_init_rd_atom
v1,v2 - dapl scm: change IB RC qp inline and timer defaults.
v1,v2 - dapl scm: add mtu adjustments via environment, default = 1024.
v1,v2 - dapl scm: change connect and accept to non-blocking to avoid blocking user thread.
v1,v2 - dapl scm: update max_rdma_read_iov, max_rdma_write_iov EP attributes during query
v1,v2 - dat: allow TYPE_ERR messages to be turned off with DAT_DBG_TYPE
v1,v2 - dapl: remove needless terminating 0 in dto_op_str functions.
v1,v2 - dat: remove reference to doc/dat.conf in makefile.am
v1,v2 - dapl scm: fix ibv_destroy_cq busy error condition during dat_evd_free.
v1,v2 - dapl scm: add stdout logging for uname and gethostbyname errors during open.
v1,v2 - dapl scm: support global routing and set mtu based on active_mtu
v1,v2 - dapl: add opcode to string function to report opcode during failures.
v1,v2 - dapl: remove unused iov buffer allocation on the endpoint
v1,v2 - dapl: endpoint pending request count is wrong
-------------------------
OFED 1.3.1 RELEASE NOTES
NEW SINCE OFED 1.3 - new versions of uDAPL v1 (1.2.7-1) and v2 (2.0.9-1)
* New Features - None
* Bug Fixes
v2 - add private data exchange with reject
v1,v2 - better error reporting in non-debug builds
v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers
v1,v2 - support for zero byte operations, iov==NULL
v1,v2 - multi-transport support for inline data and private data differences
v1,v2 - fix memory leaks and other reported bugs since OFED 1.3
v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1
v1,v2 - long delay during dat_ia_open when DNS not configured
v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max
-------------------------
OFED 1.3 RELEASE NOTES
NEW SINCE OFED 1.2
* New Features
1. Add v2.0 library support for new 2.0 API Specification
2. Separate v1.2 library release to co-exist with v2.0 libraries.
3. New dat.conf with both 1.2 and 2.0 support
4. New v2.0 dtestx utilities to test IB extensions
* Bug Fixes
v1.2 and v2.0
- uDAT: static/dynamic registry parsing fixes
- uDAPL: provider fixes for dat_psp_create_any
- dtest/dapltest: change default provider names to sync with dat.conf
- openib_cma: issues with destroy_cm_id and init/resp exchange
- dapltest: use gettimeofday instead of get_cycles for better portability
- dapltest: endian issue with mem_handle, mem_address
- dapltest fix to include inet_ntoa definitions
- fix build problems on 32-bit and 64-bit PowerPC
- cleanup packaging
v2.0
- set default config options to match spec file, --enable-debug --enable-ext-type=ib
- use unique devel target names, libdat2.so, /usr/include/dat2
- dtestx fix memory leak, freeaddrinfo after getaddrinfo
- Fix for IB extended DTO cookie deallocation on inbound rdma_Write_immed
- WinOF: Update OFED code base to include WinOF changes, work from same code base
- WinOF: add DAT_API definition, __stdcall for windows, nothing for linux
- dtest: add dat_evd_query to check correct size
- openib_cma: add macro to convert SID to PORT
- dtest: endian support for exchanging RMR info
- openib_cma: lower default settings, inline and RDMA init/resp
- openib_cma: missing ia_query for max_iov_segments_per_rdma_write
v1.2
- openib_cma: turn down dbg noise level on rejects
- dtest: typo in memset
BUILD: v1 and v2 uDAPL source install/build instructions (redhat example):
# cd to distribution SRPMS directory
cd /tmp/OFED-1.3/SRPMS
rpm -i dapl-1.2*.rpm
rpm -i dapl-2.0*.rpm
cd /usr/src/redhat/SOURCES
tar zxf dapl-1.2*.tgz
tar zxf dapl-2.0*.tgz
# NON_DEBUG build example for x86_64, using OFED targets
./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64
LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
# build and install
make
make install
# DEBUG build example for x86_64, using OFED targets
./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64
LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
# build and install
make
make install
# DEBUG messages: set environment variable DAPL_DBG_TYPE, default
mapping is 0x0003
DAPL_DBG_TYPE_ERR = 0x0001,
DAPL_DBG_TYPE_WARN = 0x0002,
DAPL_DBG_TYPE_EVD = 0x0004,
DAPL_DBG_TYPE_CM = 0x0008,
DAPL_DBG_TYPE_EP = 0x0010,
DAPL_DBG_TYPE_UTIL = 0x0020,
DAPL_DBG_TYPE_CALLBACK = 0x0040,
DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080,
DAPL_DBG_TYPE_API = 0x0100,
DAPL_DBG_TYPE_RTN = 0x0200,
DAPL_DBG_TYPE_EXCEPTION = 0x0400,
DAPL_DBG_TYPE_SRQ = 0x0800,
DAPL_DBG_TYPE_CNTR = 0x1000
-------------------------
OFED 1.2 RELEASE NOTES
NEW SINCE Gamma 3.2 and OFED 1.1
* New Features
1. Added dtest and dapltest to the openfabrics build and utils rpm.
Includes manpages.
2. Added following enviroment variables to configure connection management
timers (default settings) for larger clusters:
DAPL_CM_ARP_TIMEOUT_MS 4000
DAPL_CM_ARP_RETRY_COUNT 15
DAPL_CM_ROUTE_TIMEOUT_MS 4000
DAPL_CM_ROUTE_RETRY_COUNT 15
* Bug Fixes
+ Added support for new ib verbs client register event. No extra
processing required at the uDAPL level.
+ Fix some issues supporting create qp without recv cq handle or
recv qp resources. IB verbs assume a recv_cq handle and uDAPL
dapl_ep_create assumes there is always recv_sge resources specified.
+ Fix some timeout and long disconnect delay issues discovered during
scale-out testing. Added support to retry rdma_cm address and route
resolution with configuration options. Provide a disconnect call
when receiving the disconnect request to guarantee a disconnect reply
and event on the remote side. The rdma_disconnect was not being called
from dat_ep_disconnect() as a result of the state changing
to DISCONNECTED in the event callback.
+ Changes to support exchanging and validation of the device
responder_resources and the initiator_depth during conn establishment
+ Fix some build issues with dapltest on 32 bit arch, and on ia64 SUSE arch
+ Add support for multiple IB devices to dat.conf to support IPoIB HA failover
+ Fix atomic operation build problem with ia64 and RHEL5.
+ Add support to return local and remote port information with dat_ep_query
+ Cleanup RPM specfile for the dapl package, move to 1.2-1 release.
NEW SINCE Gamma 3.1 and OFED 1.0
* BUG FIXES
+ Update obsolete CLK_TCK to CLOCKS_PER_SEC
+ Fill out some unitialized fields in the ia_attr structure returned by
dat_ia_query().
+ Update dtest to support multiple segments on rdma write and change
makefile to use OpenIB-cma by default.
+ Add support for dat_evd_set_unwaitable on a DTO evd in openib_cma
provider
+ Added errno reporting (message and return codes) during open to help
diagnose create thread issues.
+ Fix some suspicious inline assembly EIEIO_ON_SMP and ISYNC_ON_SMP
+ Fix IA64 build problems
+ Lower the reject debug message level so we don't see warnings when
consumers reject.
+ Added support for active side TIMED_OUT event from a provider.
+ Fix bug in dapls_ib_get_dat_event() call after adding new unreachable
event.
+ Update for new rdma_create_id() function signature.
+ Set max rdma read per EP attributes
+ Report the proper error and timeout events.
+ Socket CM fix to guard against using a loopback address as the local
device address.
+ Use the uCM set_option feature to adjust connect request timeout
retry values.
+ Fix to disallow any event after a disconnect event.
* OFED 1.1 uDAPL source build instructions:
cd /usr/local/ofed/src/openib-1.1/src/userspace/dapl
# NON_DEBUG build configuration
./configure --disable-libcheck --prefix /usr/local/ofed
--libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64
CPPFLAGS="-I../libibverbs/include -I../librdmacm/include"
# build and install
make
make install
# DEBUG build configuration
./configure --disable-libcheck --enable-debug --prefix /usr/local/ofed
--libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64
CPPFLAGS="-I../libibverbs/include -I../librdmacm/include"
# build and install
make
make install
# DEBUG messages: set environment variable DAPL_DBG_TYPE, default
mapping is 0x0003
DAPL_DBG_TYPE_ERR = 0x0001,
DAPL_DBG_TYPE_WARN = 0x0002,
DAPL_DBG_TYPE_EVD = 0x0004,
DAPL_DBG_TYPE_CM = 0x0008,
DAPL_DBG_TYPE_EP = 0x0010,
DAPL_DBG_TYPE_UTIL = 0x0020,
DAPL_DBG_TYPE_CALLBACK = 0x0040,
DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080,
DAPL_DBG_TYPE_API = 0x0100,
DAPL_DBG_TYPE_RTN = 0x0200,
DAPL_DBG_TYPE_EXCEPTION = 0x0400,
DAPL_DBG_TYPE_SRQ = 0x0800,
DAPL_DBG_TYPE_CNTR = 0x1000
Note: The udapl provider library libdaplscm.so is untested and
unsupported, thus customers should not use it.
It will be removed in the next OFED release.
DAPL GAMMA 3.1 RELEASE NOTES
This release of the DAPL reference implementation
is timed to coincide with the first release of the
Open Fabrics (www.openfabrics.org) software stack.
This release adds support for this new stack, which
is now the native Linux RDMA stack.
This release also adds a new licensing option. In
addition to the Common Public License and BSD License,
the code can now be licensed under the terms of the GNU
General Public License (GPL) version 2.
NEW SINCE Gamma 3.0
- GPL v2 added as a licensing option
- OpenFabrics (aka OpenIB) gen2 verbs support
- dapltest support for Solaris 10
* BUG FIXES
+ Fixed a disconnect event processing race
+ Fix to destroy all QPs on IA close
+ Removed compiler warnings
+ Removed unused variables
+ And many more...
DAPL GAMMA 3.0 RELEASE NOTES
This is the first release based on version 1.2 of the spec. There
are some components, such a shared receive queues (SRQs), which
are not implemented yet.
Once again there were numerous bug fixes submitted by the
DAPL community.
NEW SINCE Beta 2.06
- DAT 1.2 headers
- DAT_IA_HANDLEs implemented as small integers
- Changed default device name to be "ia0a"
- Initial support for Linux 2.6.X kernels
- Updates to the OpenIB gen 1 provider
* BUG FIXES
+ Updated Makefile for differentiation between OS releases.
+ Updated atomic routines to use appropriate API
+ Removed unnecessary assert from atomic_dec.
+ Fixed bugs when freeing a PSP.
+ Fixed error codes returned by the DAT static registry.
+ Kernel updates for dat_strerror.
+ Cleaned up the transport layer/adapter interface to use DAPL
types rather than transport types.
+ Fixed ring buffer reallocation.
+ Removed old test/udapl/dapltest directory.
+ Fixed DAT_IA_HANDLE translation (from pointer to int and
vice versa) on 64-bit platforms.
DAP BETA 2.06 RELEASE NOTES
We are not planning any further releases of the Beta series,
which are based on the 1.1 version of the spec. There may be
further releases for bug fixes, but we anticipate the DAPL
community to move to the new 1.2 version of the spec and the
changes mandated in the reference implementation.
The biggest item in this release is the first inclusion of the
OpenIB Gen 1 provider, an item generating a lot of interest in
the IB community. This implementation has graciously been
provided by the Mellanox team. The kdapl implementation is in
progress, and we imagine work will soon begin on Gen 2.
There are also a handful of bug fixes available, as well as a long
awaited update to the endpoint design document.
NEW SINCE Beta 2.05
- OpenIB gen 1 provider support has been added
- Added dapls_evd_post_generic_event(), routine to post generic
event types as requested by some providers. Also cleaned up
error reporting.
- Updated the endpoint design document in the doc/ directory.
* BUG FIXES
+ Cleaned up memory leak on close by freeing the HCA structure;
+ Removed bogus #defs for rdtsc calls on IA64.
+ Changed daptest thread types to use internal types for
portability & correctness
+ Various 64 bit enhancements & updates
+ Fixes to conformance test that were defining CONN_QUAL twice
and using it in different ways
+ Cleaned up private data handling in ep_connect & provider
support: we now avoid extra copy in connect code; reduced
stack requirements by using private_data structure in the EP;
removed provider variable.
+ Fixed problem in the dat conformance test where cno_wait would
attempt to dereference a timer value and SEGV.
+ Removed old vestiges of depricated POLLING_COMPLETIONS
conditionals.
DAPL BETA 2.05 RELEASE NOTES
This was to be a very minor release, the primary change was
going to be the new wording of the DAT license as contained in
the header for all source files. But the interest and
development occurring in DAPL provided some extra bug fixes, and
some new functionality that has been requested for a while.
First, you may notice that every single source file was
changed. If you read the release notes from DAPL BETA 2.04, you
were warned this would happen. There was a legal issue with the
wording in the header, the end result was that every source file
was required to change the word 'either of' to 'both'. We've
been putting this change off as long as possible, but we wanted
to do it in a clean drop before we start working on DAT 1.2
changes in the reference implementation, just to keep things
reasonably sane.
kdapltest has enabled three of the subtests supported by
dapltest. The Performance test in particular has been very
useful to dapltest in getting minima and maxima. The Limit test
pushes the limits by allocating the maximum number of specific
resources. And the FFT tests are also available.
Most vendors have supported shared memory regions for a while,
several of which have asked the reference implementation team to
provide a common implementation. Shared memory registration has
been tested on ibapi, and compiled into vapi. Both InfiniBand
providers have the restriction that a memory region must be
created before it can be shared; not all RDMA APIs are this way,
several allow you to declare a memory region shared when it is
registered. Hence, details of the implementation are hidden in
the provider layer, rather than forcing other APIs to do
something strange.
This release also contains some changes that will allow dapl to
work on Opteron processors, as well as some preliminary support
for Power PC architecture. These features are not well tested
and may be incomplete at this time.
Finally, we have been asked several times over the course of the
project for a canonical interface between the common and
provider layers. This release includes a dummy provider to meet
that need. Anyone should be able to download the release and do
a:
make VERBS=DUMMY
And have a cleanly compiled dapl library. This will be useful
both to those porting new transport providers, as well as those
going to new machines.
The DUMMY provider has been compiled on both Linux and Windows
machines.
NEW SINCE Beta 2.4
- kdapltest enhancements:
* Limit subtests now work
* Performance subtests now work.
* FFT tests now work.
- The VAPI headers have been refreshed by Mellanox
- Initial Opteron and PPC support.
- Atomic data types now have consistent treatment, allowing us to
use native data types other than integers. The Linux kdapl
uses atomic_t, allowing dapl to use the kernel macros and
eliminate the assembly code in dapl_osd.h
- The license language was updated per the direction of the
DAT Collaborative. This two word change affected the header
of every file in the tree.
- SHARED memory regions are now supported.
- Initial support for the TOPSPIN provider.
- Added a dummy provider, essentially the NULL provider. It's
purpose is to aid in porting and to clarify exactly what is
expected in a provider implementation.
- Removed memory allocation from the DTO path for VAPI
- cq_resize will now allow the CQ to be resized smaller. Not all
providers support this, but it's a provider problem, not a
limitation of the common code.
* BUG FIXES
+ Removed spurious lock in dapl_evd_connection_callb.c that
would have caused a deadlock.
+ The Async EVD was getting torn down too early, potentially
causing lost errors. Has been moved later in the teardown
process.
+ kDAPL replaced mem_map_reserve() with newer SetPageReserved()
for better Linux integration.
+ kdapltest no longer allocate large print buffers on the stack,
is more careful to ensure buffers don't overflow.
+ Put dapl_os_dbg_print() under DAPL_DBG conditional, it is
supposed to go away in a production build.
+ dapltest protocol version has been bumped to reflect the
change in the Service ID.
+ Corrected several instances of routines that did not adhere
to the DAT 1.1 error code scheme.
+ Cleaned up vapi ib_reject_connection to pass DAT types rather
than provider specific types. Also cleaned up naming interface
declarations and their use in vapi_cm.c; fixed incorrect
#ifdef for naming.
+ Initialize missing uDAPL provider attr, pz_support.
+ Changes for better layering: first, moved
dapl_lmr_convert_privileges to the provider layer as memory
permissions are clearly transport specific and are not always
defined in an integer bitfield; removed common routines for
lmr and rmr. Second, move init and release setup/teardown
routines into adapter_util.h, which defined the provider
interface.
+ Cleaned up the HCA name cruft that allowed different types
of names such as strings or ints to be dealt with in common
code; but all names are presented by the dat_registry as
strings, so pushed conversions down to the provider
level. Greatly simplifies names.
+ Changed deprecated true/false to DAT_TRUE/DAT_FALSE.
+ Removed old IB_HCA_NAME type in favor of char *.
+ Fixed race condition in kdapltest's use of dat_evd_dequeue.
+ Changed cast for SERVER_PORT_NUMBER to DAT_CONN_QUAL as it
should be.
+ Small code reorg to put the CNO into the EVD when it is
allocated, which simplifies things.
+ Removed gratuitous ib_hca_port_t and ib_send_op_type_t types,
replaced with standard int.
+ Pass a pointer to cqe debug routine, not a structure. Some
clean up of data types.
+ kdapl threads now invoke reparent_to_init() on exit to allow
threads to get cleaned up.
DAPL BETA 2.04 RELEASE NOTES
The big changes for this release involve a more strict adherence
to the original dapl architecture. Originally, only InfiniBand
providers were available, so allowing various data types and
event codes to show through into common code wasn't a big deal.
But today, there are an increasing number of providers available
on a number of transports. Requiring an IP iWarp provider to
match up to InfiniBand events is silly, for example.
Restructuring the code allows more flexibility in providing an
implementation.
There are also a large number of bug fixes available in this
release, particularly in kdapl related code.
Be warned that the next release will change every file in the
tree as we move to the newly approved DAT license. This is a
small change, but all files are affected.
Future releases will also support to the soon to be ratified DAT
1.2 specification.
This release has benefited from many bug reports and fixes from
a number of individuals and companies. On behalf of the DAPL
community, thank you!
NEW SINCE Beta 2.3
- Made several changes to be more rigorous on the layering
design of dapl. The intent is to make it easier for non
InfiniBand transports to use dapl. These changes include:
* Revamped the ib_hca_open/close code to use an hca_ptr
rather than an ib_handle, giving the transport layer more
flexibility in assigning transport handles and resources.
* Removed the CQD calls, they are specific to the IBM API;
folded this functionality into the provider open/close calls.
* Moved VAPI, IBAPI transport specific items into a transport
structure placed inside of the HCA structure. Also updated
routines using these fields to use the new location. Cleaned
up provider knobs that have been exposed for too long.
* Changed a number of provider routines to use DAPL structure
pointers rather than exposing provider handles & values. Moved
provider specific items out of common code, including provider
data types (e.g. ib_uint32_t).
* Pushed provider completion codes and type back into the
provider layer. We no longer use EVD or CM completion types at
the common layer, instead we obtain the appropriate DAT type
from the provider and process only DAT types.
* Change private_data handling such that we can now accommodate
variable length private data.
- Remove DAT 1.0 cruft from the DAT header files.
- Better spec compliance in headers and various routines.
- Major updates to the VAPI implementation from
Mellanox. Includes initial kdapl implementation
- Move kdapl platform specific support for hash routines into
OSD file.
- Cleanups to make the code more readable, including comments
and certain variable and structure names.
- Fixed CM_BUSTED code so that it works again: very useful for
new dapl ports where infrastructure is lacking. Also made
some fixes for IBHOSTS_NAMING conditional code.
- Added DAPL_MERGE_CM_DTO as a compile time switch to support
EVD stream merging of CM and DTO events. Default is off.
- 'Quit' test ported to kdapltest
- uDAPL now builds on Linux 2.6 platform (SuSE 9.1).
- kDAPL now builds for a larger range of Linux kernels, but
still lacks 2.6 support.
- Added shared memory ID to LMR structure. Shared memory is
still not fully supported in the reference implementation, but
the common code will appear soon.
* Bug fixes
- Various Makefiles fixed to use the correct dat registry
library in its new location (as of Beta 2.03)
- Simple reorg of dat headers files to be consistent with
the spec.
- fixed bug in vapi_dto.h recv macro where we could have an
uninitialized pointer.
- Simple fix in dat_dr.c to initialize a variable early in the
routine before errors occur.
- Removed private data pointers from a CONNECTED event, as
there should be no private data here.
- dat_strerror no longer returns an uninitialized pointer if
the error code is not recognized.
- dat_dup_connect() will reject 0 timeout values, per the
spec.
- Removed unused internal_hca_names parameter from
ib_enum_hcas() interface.
- Use a temporary DAT_EVENT for kdapl up-calls rather than
making assumptions about the current event queue.
- Relocated some platform dependent code to an OSD file.
- Eliminated several #ifdefs in .c files.
- Inserted a missing unlock() on an error path.
- Added bounds checking on size of private data to make sure
we don't overrun the buffer
- Fixed a kdapltest problem that caused a machine to panic if
the user hit ^C
- kdapltest now uses spin locks more appropriate for their
context, e.g. spin_lock_bh or spin_lock_irq. Under a
conditional.
- Fixed kdapltest loops that drain EVDs so they don't go into
endless loops.
- Fixed bug in dapl_llist_add_entry link list code.
- Better error reporting from provider code.
- Handle case of user trying to reap DTO completions on an
EP that has been freed.
- No longer hold lock when ep_free() calls into provider layer
- Fixed cr_accept() to not have an extra copy of
private_data.
- Verify private_data pointers before using them, avoid
panic.
- Fixed memory leak in kdapltest where print buffers were not
getting reclaimed.
DAPL BETA 2.03 RELEASE NOTES
There are some prominent features in this release:
1) dapltest/kdapltest. The dapltest test program has been
rearchitected such that a kernel version is now available
to test with kdapl. The most obvious change is a new
directory structure that more closely matches other core
dapl software. But there are a large number of changes
throughout the source files to accommodate both the
differences in udapl/kdapl interfaces, but also more mundane
things such as printing.
The new dapltest is in the tree at ./test/dapltest, while the
old remains at ./test/udapl/dapltest. For this release, we
have maintained both versions. In a future release, perhaps
the next release, the old dapltest directory will be
removed. Ongoing development will only occur in the new tree.
2) DAT 1.1 compliance. The DAT Collaborative has been busy
finalizing the 1.1 revision of the spec. The header files
have been reviewed and posted on the DAT Collaborative web
site, they are now in full compliance.
The reference implementation has been at a 1.1 level for a
while. The current implementation has some features that will
be part of the 1.2 DAT specification, but only in places
where full compatibility can be maintained.
3) The DAT Registry has undergone some positive changes for
robustness and support of more platforms. It now has the
ability to support several identical provider names
simultaneously, which enables the same dat.conf file to
support multiple platforms. The registry will open each
library and return when successful. For example, a dat.conf
file may contain multiple provider names for ex0a, each
pointing to a different library that may represent different
platforms or vendors. This simplifies distribution into
different environments by enabling the use of common
dat.conf files.
In addition, there are a large number of bug fixes throughout
the code. Bug reports and fixes have come from a number of
companies.
Also note that the Release notes are cleaned up, no longer
containing the complete text of previous releases.
* EVDs no longer support DTO and CONNECTION event types on the
same EVD. NOTE: The problem is maintaining the event ordering
between two channels such that no DTO completes before a
connection is received; and no DTO completes after a
disconnect is received. For 90% of the cases this can be made
to work, but the remaining 10% will cause serious performance
degradation to get right.
NEW SINCE Beta 2.2
* DAT 1.1 spec compliance. This includes some new types, error
codes, and moving structures around in the header files,
among other things. Note the Class bits of dat_error.h have
returned to a #define (from an enum) to cover the broadest
range of platforms.
* Several additions for robustness, including handle and
pointer checking, better argument checking, state
verification, etc. Better recovery from error conditions,
and some assert()s have been replaced with 'if' statements to
handle the error.
* EVDs now maintain the actual queue length, rather than the
requested amount. Both the DAT spec and IB (and other
transports) allow the underlying implementation to provide
more CQ entries than requested.
Requests for the same number of entries contained by an EVD
return immediate success.
* kDAPL enhancements:
- module parameters & OS support calls updated to work with
more recent Linux kernels.
- kDAPL build options changes to match the Linux kernel, vastly
reducing the size and making it more robust.
- kDAPL unload now works properly
- kDAPL takes a reference on the provider driver when it
obtains a verbs vector, to prevent an accidental unload
- Cleaned out all of the uDAPL cruft from the linux/osd files.
* New dapltest (see above).
* Added a new I/O trace facility, enabling a developer to debug
all I/O that are in progress or recently completed. Default
is OFF in the build.
* 0 timeout connections now refused, per the spec.
* Moved the remaining uDAPL specific files from the common/
directory to udapl/. Also removed udapl files from the kdapl
build.
* Bug fixes
- Better error reporting from provider layer
- Fixed race condition on reference counts for posting DTO
ops.
- Use DAT_COMPLETION_SUPPRESS_FLAG to suppress successful
completion of dapl_rmr_bind (instead of
DAT_COMPLEITON_UNSIGNALLED, which is for non-notification
completion).
- Verify psp_flags value per the spec
- Bug in psp_create_any() checking psp_flags fixed
- Fixed type of flags in ib_disconnect from
DAT_COMPLETION_FLAGS to DAT_CLOSE_FLAGS
- Removed hard coded check for ASYNC_EVD. Placed all EVD
prevention in evd_stream_merging_supported array, and
prevent ASYNC_EVD from being created by an app.
- ep_free() fixed to comply with the spec
- Replaced various printfs with dbg_log statements
- Fixed kDAPL interaction with the Linux kernel
- Corrected phy_register protottype
- Corrected kDAPL wait/wakeup synchronization
- Fixed kDAPL evd_kcreate() such that it no longer depends
on uDAPL only code.
- dapl_provider.h had wrong guard #def: changed DAT_PROVIDER_H
to DAPL_PROVIDER_H
- removed extra (and bogus) call to dapls_ib_completion_notify()
in evd_kcreate.c
- Inserted missing error code assignment in
dapls_rbuf_realloc()
- When a CONNECTED event arrives, make sure we are ready for
it, else something bad may have happened to the EP and we
just return; this replaces an explicit check for a single
error condition, replacing it with the general check for the
state capable of dealing with the request.
- Better context pointer verification. Removed locks around
call to ib_disconnect on an error path, which would result
in a deadlock. Added code for BROKEN events.
- Brought the vapi code more up to date: added conditional
compile switches, removed obsolete __ActivePort, deal
with 0 length DTO
- Several dapltest fixes to bring the code up to the 1.1
specification.
- Fixed mismatched dalp_os_dbg_print() #else dapl_Dbg_Print();
the latter was replaced with the former.
- ep_state_subtype() now includes UNCONNECTED.
- Added some missing ibapi error codes.
NEW SINCE Beta 2.1
* Changes for Erratta and 1.1 Spec
- Removed DAT_NAME_NOT_FOUND, per DAT erratta
- EVD's with DTO and CONNECTION flags set no longer valid.
- Removed DAT_IS_SUCCESS macro
- Moved provider attribute structures from vendor files to udat.h
and kdat.h
- kdapl UPCALL_OBJECT now passed by reference
* Completed dat_strerr return strings
* Now support interrupted system calls
* dapltest now used dat_strerror for error reporting.
* Large number of files were formatted to meet project standard,
very cosmetic changes but improves readability and
maintainability. Also cleaned up a number of comments during
this effort.
* dat_registry and RPM file changes (contributed by Steffen Persvold):
- Renamed the RPM name of the registry to be dat-registry
(renamed the .spec file too, some cvs add/remove needed)
- Added the ability to create RPMs as normal user (using
temporal paths), works on SuSE, Fedora, and RedHat.
- 'make rpm' now works even if you didn't build first.
- Changed to using the GNU __attribute__((constructor)) and
__attribute__((destructor)) on the dat_init functions, dat_init
and dat_fini. The old -init and -fini options to LD makes
applications crash on some platforms (Fedora for example).
- Added support for 64 bit platforms.
- Added code to allow multiple provider names in the registry,
primarily to support ia32 and ia64 libraries simultaneously.
Provider names are now kept in a list, the first successful
library open will be the provider.
* Added initial infrastructure for DAPL_DCNTR, a feature that
will aid in debug and tuning of a dapl implementation. Partial
implementation only at this point.
* Bug fixes
- Prevent debug messages from crashing dapl in EVD completions by
verifying the error code to ensure data is valid.
- Verify CNO before using it to clean up in evd_free()
- CNO timeouts now return correct error codes, per the spec.
- cr_accept now complies with the spec concerning connection
requests that go away before the accept is invoked.
- Verify valid EVD before posting connection evens on active side
of a connection. EP locking also corrected.
- Clean up of dapltest Makefile, no longer need to declare
DAT_THREADSAFE
- Fixed check of EP states to see if we need to disconnect an
IA is closed.
- ep_free() code reworked such that we can properly close a
connection pending EP.
- Changed disconnect processing to comply with the spec: user will
see a BROKEN event, not DISCONNECTED.
- If we get a DTO error, issue a disconnect to let the CM and
the user know the EP state changed to disconnect; checked IBA
spec to make sure we disconnect on correct error codes.
- ep_disconnect now properly deals with abrupt disconnects on the
active side of a connection.
- PSP now created in the correct state for psp_create_any(), making
it usable.
- dapl_evd_resize() now returns correct status, instead of always
DAT_NOT_IMPLEMENTED.
- dapl_evd_modify_cno() does better error checking before invoking
the provider layer, avoiding bugs.
- Simple change to allow dapl_evd_modify_cno() to set the CNO to
NULL, per the spec.
- Added required locking around call to dapl_sp_remove_cr.
- Fixed problems related to dapl_ep_free: the new
disconnect(abrupt) allows us to do a more immediate teardown of
connections, removing the need for the MAGIC_EP_EXIT magic
number/state, which has been removed. Mmuch cleanup of paths,
and made more robust.
- Made changes to meet the spec, uDAPL 1.1 6.3.2.3: CNO is
triggered if there are waiters when the last EVD is removed
or when the IA is freed.
- Added code to deal with the provider synchronously telling us
a connection is unreachable, and generate the appropriate
event.
- Changed timer routine type from unsigned long to uintptr_t
to better fit with machine architectures.
- ep.param data now initialized in ep_create, not ep_alloc.
- Or Gerlitz provided updates to Mellanox files for evd_resize,
fw attributes, many others. Also implemented changes for correct
sizes on REP side of a connection request.
NEW SINCE Beta 2.0
* dat_echo now DAT 1.1 compliant. Various small enhancements.
* Revamped atomic_inc/dec to be void, the return value was never
used. This allows kdapl to use Linux kernel equivalents, and
is a small performance advantage.
* kDAPL: dapl_evd_modify_upcall implemented and tested.
* kDAPL: physical memory registration implemented and tested.
* uDAPL now builds cleanly for non-debug versions.
* Default RDMA credits increased to 8.
* Default ACK_TIMEOUT now a reasonable value (2 sec vs old 2
months).
* Cleaned up dat_error.h, now 1.1 compliant in comments.
* evd_resize initial implementation. Untested.
* Bug fixes
- __KDAPL__ is defined in kdat_config.h, so apps don't need
to define it.
- Changed include file ordering in kdat.h to put kdat_config.h
first.
- resolved connection/tear-down race on the client side.
- kDAPL timeouts now scaled properly; fixed 3 orders of
magnitude difference.
- kDAPL EVD callbacks now get invoked for all completions; old
code would drop them in heavy utilization.
- Fixed error path in kDAPL evd creation, so we no longer
leak CNOs.
- create_psp_any returns correct error code if it can't create
a connection qualifier.
- lock fix in ibapi disconnect code.
- kDAPL INFINITE waits now work properly (non connection
waits)
- kDAPL driver unload now works properly
- dapl_lmr_[k]create now returns 1.1 error codes
- ibapi routines now return DAT 1.1 error codes
NEW SINCE Beta 1.10
* kDAPL is now part of the DAPL distribution. See the release
notes above.
The kDAPL 1.1 spec is now contained in the doc/ subdirectory.
* Several files have been moved around as part of the kDAPL
checkin. Some files that were previously in udapl/ are now
in common/, some in common are now in udapl/. The goal was
to make sure files are properly located and make sense for
the build.
* Source code formatting changes for consistency.
* Bug fixes
- dapl_evd_create() was comparing the wrong bit combinations,
allowing bogus EVDs to be created.
- Removed code that swallowed zero length I/O requests, which
are allowed by the spec and are useful to applications.
- Locking in dapli_get_sp_ep was asymmetric; fixed it so the
routine will take and release the lock. Cosmetic change.
- dapl_get_consuemr_context() will now verify the pointer
argument 'context' is not NULL.
OBTAIN THE CODE
To obtain the tree for your local machine you can check it
out of the source repository using CVS tools. CVS is common
on Unix systems and available as freeware on Windows machines.
The command to anonymously obtain the source code from
Source Forge (with no password) is:
cvs -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl login
cvs -z3 -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl co .
When prompted for a password, simply press the Enter key.
Source Forge also contains explicit directions on how to become
a developer, as well as how to use different CVS commands. You may
also browse the source code using the URL:
http://svn.sourceforge.net/viewvc/dapl/trunk/
SYSTEM REQUIREMENTS
This project has been implemented on Red Hat Linux 7.3, SuSE
SLES 8, 9, and 10, Windows 2000, RHEL 3.0, 4.0 and 5.0 and a few
other Linux distrubutions. The structure of the code is designed
to allow other operating systems to easily be adapted.
The DAPL team has used Mellanox Tavor based InfiniBand HCAs for
development, and continues with this platform. Our HCAs use the
IB verbs API submitted by IBM. Mellanox has contributed an
adapter layer using their VAPI verbs API. Either platform is
available to any group considering DAPL work. The structure of
the uDAPL source allows other provider API sets to be easily
integrated.
The development team uses any one of three topologies: two HCAs
in a single machine; a single HCA in each of two machines; and
most commonly, a switch. Machines connected to a switch may have
more than one HCA.
The DAPL Plugfest revealed that switches and HCAs available from
most vendors will interoperate with little trouble, given the
most recent releases of software. The dapl reference team makes
no recommendation on HCA or switch vendors.
Explicit machine configurations are available upon request.
IN THE TREE
The DAPL tree contains source code for the uDAPL and kDAPL
implementations, and also includes tests and documentation.
Included documentation has the base level API of the
providers: OpenFabrics, IBM Access, and Mellanox Verbs API. Also
included are a growing number of DAPL design documents which
lead the reader through specific DAPL subsystems. More
design documents are in progress and will appear in the tree in
the near future.
A small number of test applications and a unit test framework
are also included. dapltest is the primary testing application
used by the DAPL team, it is capable of simulating a variety of
loads and exercises a large number of interfaces. Full
documentation is included for each of the tests.
Recently, the dapl conformance test has been added to the source
repository. The test provides coverage of the most common
interfaces, doing both positive and negative testing. Vendors
providing DAPL implementation are strongly encouraged to run
this set of tests.
MAKEFILE NOTES
There are a number #ifdef's in the code that were necessary
during early development. They are disappearing as we
have time to take advantage of features and work available from
newer releases of provider software. These #ifdefs are not
documented as the intent is to remove them as soon as possible.
CONTRIBUTIONS
As is common to Source Forge projects, there are a small number
of developers directly associated with the source tree and having
privileges to change the tree. Requested updates, changes, bug
fixes, enhancements, or contributions should be sent to
James Lentini at jlentinit@netapp.com for review. We welcome your
contributions and expect the quality of the project will
improve thanks to your help.
The core DAPL team is:
James Lentini
Arlin Davis
Steve Sears
... with contributions from a number of excellent engineers in
various companies contributing to the open source effort.
ONGOING WORK
Not all of the DAPL spec is implemented at this time.
Functionality such as shared memory will probably not be
implemented by the reference implementation (there is a write up
on this in the doc/ area), and there are yet various cases where
work remains to be done. And of course, not all of the
implemented functionality has been tested yet. The DAPL team
continues to develop and test the tree with the intent of
completing the specification and delivering a robust and useful
implementation.
The DAPL Team
trunk/SRPT_README.txt 0000644 0001750 0001750 00000033261 11313645005 014114 0 ustar benoit benoit SCSI RDMA Protocol (SRP) Target driver for Linux
=================================================
SRP Target driver is designed to work directly on top of OpenFabrics
OFED-1.x software stack (http://www.openfabrics.org) or Infiniband
drivers in Linux kernel tree (kernel.org). It also interfaces with
Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net)
By interfacing with SCST driver we are able to work and support a lot IO
modes on real or virtual devices in the backend
1. scst_disk -- interfacing with scsi sub-system to claim and export real
scsi devices ie. disks, hardware raid volumes, tape library as SRP's luns
2. scst_vdisk -- fileio and blockio modes. This allows you to turn software
raid volumes, LVM volumes, IDE disks, block devices and normal files into
SRP's luns
3. NULLIO mode will allow you to measure the performance without sending IOs
to *real* devices
Prerequisites
-------------
0. Supported distributions: RHEL 5/5.1/5.2, SLES 10 sp1/sp2 and vanilla kernels >
2.6.16
Note: On distribution default kernels you can run scst_vdisk blockio mode
to have good performance. You can also run scst_disk ie. scsi pass-thru
mode; however, you have to compile scst with -DSTRICT_SERIALIZING
enabled and this does not yield good performance.
It is required to recompile the kernel to have good performance with
scst_disk ie. scsi pass-thru mode
1. Download and install SCST driver.
a. download scst-1.0.0.tar.gz from this URL
http://scst.sourceforge.net/downloads.html
If your distribution is RHEL 5.2 please go to step
b. untar and install scst-1.0.0
$ tar zxvf scst-1.0.0.tar.gz
$ cd scst-1.0.0
For RedHat distributions:
$ make && make install
For SuSE distributions:
. Save the following patch to /tmp/scst_sles_spX.patch
/************************ Start scst_sless_spX.patch *********************/
diff -Naur scst-1.0.0/src/scst_lib.c scst-1.0.0.wk/src/scst_lib.c
--- scst-1.0.0/src/scst_lib.c 2008-06-26 23:20:18.000000000 -0700
+++ scst-1.0.0.wk/src/scst_lib.c 2008-12-08 15:28:46.000000000 -0800
@@ -1071,7 +1071,7 @@
return orig_cmd;
}
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
static void scst_req_done(struct scsi_cmnd *scsi_cmd)
{
struct scsi_request *req;
@@ -1134,7 +1134,7 @@
TRACE_EXIT();
return;
}
-#else /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18) */
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16) */
static void scst_send_release(struct scst_device *dev)
{
struct scsi_device *scsi_dev;
@@ -1183,7 +1183,7 @@
TRACE_EXIT();
return;
}
-#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18) */
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16) */
/* scst_mutex supposed to be held */
static void scst_clear_reservation(struct scst_tgt_dev *tgt_dev)
@@ -1421,7 +1421,7 @@
sBUG_ON(cmd->inc_blocking || cmd->needs_unblocking ||
cmd->dec_on_dev_needed);
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
#if defined(EXTRACHECKS)
if (cmd->scsi_req) {
PRINT_ERROR("%s: %s", __func__, "Cmd with unfreed "
@@ -1596,7 +1596,7 @@
return;
}
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
int scst_alloc_request(struct scst_cmd *cmd)
{
int res = 0;
diff -Naur scst-1.0.0/src/scst_main.c scst-1.0.0.wk/src/scst_main.c
--- scst-1.0.0/src/scst_main.c 2008-07-07 12:04:00.000000000 -0700
+++ scst-1.0.0.wk/src/scst_main.c 2008-12-08 15:25:05.000000000 -0800
@@ -1593,7 +1593,7 @@
TRACE_ENTRY();
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
{
struct scsi_request *req;
BUILD_BUG_ON(SCST_SENSE_BUFFERSIZE !=
diff -Naur scst-1.0.0/src/scst_priv.h scst-1.0.0.wk/src/scst_priv.h
--- scst-1.0.0/src/scst_priv.h 2008-06-12 04:40:53.000000000 -0700
+++ scst-1.0.0.wk/src/scst_priv.h 2008-12-08 15:25:43.000000000 -0800
@@ -27,7 +27,7 @@
#include
#include
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
#include
#endif
@@ -320,7 +320,7 @@
void scst_check_retries(struct scst_tgt *tgt);
void scst_tgt_retry_timer_fn(unsigned long arg);
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
int scst_alloc_request(struct scst_cmd *cmd);
void scst_release_request(struct scst_cmd *cmd);
diff -Naur scst-1.0.0/src/scst_targ.c scst-1.0.0.wk/src/scst_targ.c
--- scst-1.0.0/src/scst_targ.c 2008-06-26 23:20:05.000000000 -0700
+++ scst-1.0.0.wk/src/scst_targ.c 2008-12-08 15:26:45.000000000 -0800
@@ -1132,7 +1132,7 @@
return context;
}
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
static inline struct scst_cmd *scst_get_cmd(struct scsi_cmnd *scsi_cmd,
struct scsi_request **req)
{
@@ -1183,7 +1183,7 @@
TRACE_EXIT();
return;
}
-#else /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18) */
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16) */
static void scst_cmd_done(void *data, char *sense, int result, int resid)
{
struct scst_cmd *cmd;
@@ -1205,7 +1205,7 @@
TRACE_EXIT();
return;
}
-#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18) */
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16) */
static void scst_cmd_done_local(struct scst_cmd *cmd, int next_state)
{
@@ -1771,7 +1771,7 @@
}
#endif
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
if (unlikely(scst_alloc_request(cmd) != 0)) {
if (scst_cmd_atomic(cmd)) {
rc = SCST_EXEC_NEED_THREAD;
@@ -1823,7 +1823,7 @@
out_error:
scst_set_cmd_error(cmd, SCST_LOAD_SENSE(scst_sense_hardw_error));
-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 18)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 16)
out_busy:
scst_set_busy(cmd);
/* go through */
/************************ End scst_sless_spX.patch ***********************/
. patch -p1 < /tmp/scst_sles_spX.patch
. make && make install
c. save the following patch into /tmp/scsi_tgt.patch
/************************ Start scsi_tgt.patch *********************/
--- scsi_tgt.h 2008-07-20 14:25:30.000000000 -0700
+++ scsi_tgt.h 2008-07-20 14:25:09.000000000 -0700
@@ -42,7 +42,9 @@
#endif
#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 19)
+/*
typedef _Bool bool;
+*/
#define true 1
#define false 0
#endif
@@ -2330,7 +2332,7 @@
void scst_async_mcmd_completed(struct scst_mgmt_cmd *mcmd, int status);
#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
-
+/*
static inline struct page *sg_page(struct scatterlist *sg)
{
return sg->page;
@@ -2358,7 +2360,7 @@
sg->offset = offset;
sg->length = len;
}
-
+*/
#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24) */
static inline void sg_clear(struct scatterlist *sg)
/************************ End scsi_tgt.patch *********************/
d. patch scsi_tgt.h file with /tmp/scsi_tgt.patch
$ cd /usr/local/include/scst;
$ cp scst.h scsi_tgt.h
$ patch -p0 < /tmp/scsi_tgt.patch
These steps (e-h) are for RHEL 5.2 distributions only
Other versions (RHEL 5/5.1, SLES 10 sp1/sp2) should keep these steps (e-h)
and continue with step (2) - OFED installation
e. save the following patch into /tmp/scst.patch
/************************ Start scst.patch *********************/
--- scst.h 2008-07-20 14:25:30.000000000 -0700
+++ scst.h 2008-07-20 14:25:09.000000000 -0700
@@ -42,7 +42,9 @@
#endif
#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 19)
+/*
typedef _Bool bool;
+*/
#define true 1
#define false 0
#endif
/************************ End scst.patch *********************/
f. untar, patch, and install scst-1.0.0
$ tar zxvf scst-1.0.0.tar.gz
$ cd scst-1.0.0/include
$ patch -p0 < /tmp/scst.patch
$ cd ..
$ make && make install
g. save the following patch into /tmp/scsi_tgt.patch
/************************ Start scsi_tgt.patch *********************/
--- scsi_tgt.h 2008-07-20 14:25:30.000000000 -0700
+++ scsi_tgt.h 2008-07-20 14:25:09.000000000 -0700
@@ -2330,7 +2332,7 @@
void scst_async_mcmd_completed(struct scst_mgmt_cmd *mcmd, int status);
#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
-
+/*
static inline struct page *sg_page(struct scatterlist *sg)
{
return sg->page;
@@ -2358,7 +2360,7 @@
sg->offset = offset;
sg->length = len;
}
-
+*/
#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24) */
static inline void sg_clear(struct scatterlist *sg)
/************************ End scsi_tgt.patch *********************/
h. patch scsi_tgt.h file with /tmp/scsi_tgt.patch
$ cd /usr/local/include/scst
$ cp scst.h scsi_tgt.h
$ patch -p0 < /tmp/scsi_tgt.patch
2. Download/install OFED-1.3.1 or OFED-1.4 - SRP target is part of OFED
Note: if your system already have OFED stack installed, you need to remove
all the previous built RPMs and reinstall
$ cd ~/OFED-1.4
$ rm RPMS/*
$ ./install.pl -c ofed.conf
a. download OFED packages from this URL
http://www.openfabrics.org/downloads/OFED/
b. install OFED - remember to choose srpt=y
$ cd ~/OFED-1.4
$ ./install.pl
How-to run
-----------
A. On srp target machine
1. Please refer to SCST's README for loading scst driver and its dev_handlers
drivers (scst_disk, scst_vdisk block or file IO mode, nullio, ...)
Note: In any mode you always need to have lun 0 in any group's device list
Then you can have any lun number following lun 0 (it does not required
have lun number in order except that the first lun is always 0)
Setting SRPT_LOAD=yes in /etc/infiniband/openib.conf is not good enough
It only load ib_srpt module and does not load scst and its dev_handlers
Example 1: working with VDISK BLOCKIO mode
(using md0 device, sda, and cciss/c1d0)
a. modprobe scst
b. modprobe scst_vdisk
c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
g. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
h. echo "add vdisk2 2" >/proc/scsi_tgt/groups/Default/devices
Example 2: working with real back-end scsi disks in scsi pass-thru mode
a. modprobe scst
b. modprobe scst_disk
c. cat /proc/scsi_tgt/scsi_tgt
ibstor00:~ # cat /proc/scsi_tgt/scsi_tgt
Device (host:ch:id:lun or name) Device handler
0:0:0:0 dev_disk
4:0:0:0 dev_disk
5:0:0:0 dev_disk
6:0:0:0 dev_disk
7:0:0:0 dev_disk
Now you want to exclude the first scsi disk and expose the last 4 scsi disks
as IB/SRP luns for I/O
echo "add 4:0:0:0 0" >/proc/scsi_tgt/groups/Default/devices
echo "add 5:0:0:0 1" >/proc/scsi_tgt/groups/Default/devices
echo "add 6:0:0:0 2" >/proc/scsi_tgt/groups/Default/devices
echo "add 7:0:0:0 3" >/proc/scsi_tgt/groups/Default/devices
Example 3: working with scst_vdisk FILEIO mode
(using md0 device and file 10G-file)
a. modprobe scst
b. modprobe scst_vdisk
c. echo "open vdisk0 /dev/md0" > /proc/scsi_tgt/vdisk/vdisk
d. echo "open vdisk1 /10G-file" > /proc/scsi_tgt/vdisk/vdisk
e. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
f. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
2. modprobe ib_srpt
B. On initiator machines you can manualy do the following steps:
1. modprobe ib_srp
2. ipsrpdm -c -d /dev/infiniband/umadX
(to discover new SRP target)
umad0: port 1 of the first HCA
umad1: port 2 of the first HCA
umad2: port 1 of the second HCA
3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target
4. fdisk -l (will show new discovered scsi disks)
Example:
Assume that you use port 1 of first HCA in the system ie. mthca0
[root@lab104 ~]# ibsrpdm -c -d /dev/infiniband/umad0
id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4
[root@lab104 ~]# echo id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 >
/sys/class/infiniband_srp/srp-mthca0-1/add_target
OR
+ You can edit /etc/infiniband/openib.conf to load srp driver and srp HA daemon
automatically ie. set SRP_LOAD=yes, and SRPHA_ENABLE=yes
+ To set up and use high availability feature you need dm-multipath driver
and multipath tool
+ Please refer to OFED-1.x SRP's user manual for more in-details instructions
on how-to enable/use HA feature
Here is an example of srp target setup file
--------------------------------------------
#!/bin/sh
modprobe scst scst_threads=1
modprobe scst_vdisk scst_vdisk_ID=100
echo "open vdisk0 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
echo "open vdisk1 /dev/sdb BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
echo "open vdisk2 /dev/sdc BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
echo "open vdisk3 /dev/sdd BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
echo "add vdisk0 0" > /proc/scsi_tgt/groups/Default/devices
echo "add vdisk1 1" > /proc/scsi_tgt/groups/Default/devices
echo "add vdisk2 2" > /proc/scsi_tgt/groups/Default/devices
echo "add vdisk3 3" > /proc/scsi_tgt/groups/Default/devices
modprobe ib_srpt
echo "add "mgmt"" > /proc/scsi_tgt/trace_level
echo "add "mgmt_dbg"" > /proc/scsi_tgt/trace_level
echo "add "out_of_mem"" > /proc/scsi_tgt/trace_level
How-to unload/shutdown
-----------------------
1. Unload ib_srpt
$ modprobe -r ib_srpt
2. Unload scst and its dev_handlers first
$ modprobe -r scst_vdisk scst
3. Unload ofed
$ /etc/rc.d/openibd stop
trunk/QoS_management_in_OpenSM.txt 0000644 0001750 0001750 00000051127 11313645005 017115 0 ustar benoit benoit
QoS Management in OpenSM
==============================================================================
Table of contents
==============================================================================
1. Overview
2. Full QoS Policy File
3. Simplified QoS Policy Definition
4. Policy File Syntax Guidelines
5. Examples of Full Policy File
6. Simplified QoS Policy - Details and Examples
7. SL2VL Mapping and VL Arbitration
==============================================================================
1. Overview
==============================================================================
When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
The default name of OpenSM QoS policy file is
/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
or --qos_policy_file option with OpenSM.
During fabric initialization and at every heavy sweep OpenSM parses the QoS
policy file, applies its settings to the discovered fabric elements, and
enforces the provided policy on client requests. The overall flow for such
requests is:
- The request is matched against the defined matching rules such that the
QoS Level definition is found.
- Given the QoS Level, path(s) search is performed with the given
restrictions imposed by that level.
There are two ways to define QoS policy:
- Full policy, where the policy file syntax provides an administrator
various ways to match PathRecord/MultiPathRecord (PR/MPR) request and
enforce various QoS constraints on the requested PR/MPR
- Simplified QoS policy definition, where an administrator would be able to
match PR/MPR requests by various ULPs and applications running on top of
these ULPs.
While the full policy syntax is very flexible, in many cases the simplified
policy definition would be sufficient.
==============================================================================
2. Full QoS Policy File
==============================================================================
QoS policy file has the following sections:
I) Port Groups (denoted by port-groups).
This section defines zero or more port groups that can be referred later by
matching rules (see below). Port group lists ports by:
- Port GUID
- Port name, which is a combination of NodeDescription and IB port number
- PKey, which means that all the ports in the subnet that belong to
partition with a given PKey belong to this port group
- Partition name, which means that all the ports in the subnet that belong
to partition with a given name belong to this port group
- Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and
SELF (SM's port).
II) QoS Setup (denoted by qos-setup).
This section describes how to set up SL2VL and VL Arbitration tables on
various nodes in the fabric.
However, this is not supported in OpenSM currently.
SL2VL and VLArb tables should be configured in the OpenSM options file
(default location - /usr/local/etc/opensm/opensm.conf).
III) QoS Levels (denoted by qos-levels).
Each QoS Level defines Service Level (SL) and a few optional fields:
- MTU limit
- Rate limit
- PKey
- Packet lifetime
When path(s) search is performed, it is done with regards to restriction that
these QoS Level parameters impose.
One QoS level that is mandatory to define is a DEFAULT QoS level. It is
applied to a PR/MPR query that does not match any existing match rule.
Similar to any other QoS Level, it can also be explicitly referred by any
match rule.
IV) QoS Matching Rules (denoted by qos-match-rules).
Each PathRecord/MultiPathRecord query that OpenSM receives is matched against
the set of matching rules. Rules are scanned in order of appearance in the QoS
policy file such as the first match takes precedence.
Each rule has a name of QoS level that will be applied to the matching query.
A default QoS level is applied to a query that did not match any rule.
Queries can be matched by:
- Source port group (whether a source port is a member of a specified group)
- Destination port group (same as above, only for destination port)
- PKey
- QoS class
- Service ID
To match a certain matching rule, PR/MPR query has to match ALL the rule's
criteria. However, not all the fields of the PR/MPR query have to appear in
the matching rule.
For instance, if the rule has a single criterion - Service ID, it will match
any query that has this Service ID, disregarding rest of the query fields.
However, if a certain query has only Service ID (which means that this is the
only bit in the PR/MPR component mask that is on), it will not match any rule
that has other matching criteria besides Service ID.
==============================================================================
3. Simplified QoS Policy Definition
==============================================================================
Simplified QoS policy definition comprises of a single section denoted by
qos-ulps. Similar to the full QoS policy, it has a list of match rules and
their QoS Level, but in this case a match rule has only one criterion - its
goal is to match a certain ULP (or a certain application on top of this ULP)
PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
The simplified policy section may appear in the policy file in combine with
the full policy, or as a stand-alone policy definition.
See more details and list of match rule criteria below.
==============================================================================
4. Policy File Syntax Guidelines
==============================================================================
- Empty lines are ignored.
- Leading and trailing blanks, as well as empty lines, are ignored, so
the indentation in the example is just for better readability.
- Comments are started with the pound sign (#) and terminated by EOL.
- Any keyword should be the first non-blank in the line, unless it's a
comment.
- Keywords that denote section/subsection start have matching closing
keywords.
- Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR
requests that didn't match any of the matching rules.
- Any section/subsection of the policy file is optional.
==============================================================================
5. Examples of Full Policy File
==============================================================================
As mentioned earlier, any section of the policy file is optional, and
the only mandatory part of the policy file is a default QoS Level.
Here's an example of the shortest policy file:
qos-levels
qos-level
name: DEFAULT
sl: 0
end-qos-level
end-qos-levels
Port groups section is missing because there are no match rules, which means
that port groups are not referred anywhere, and there is no need defining
them. And since this policy file doesn't have any matching rules, PR/MPR query
won't match any rule, and OpenSM will enforce default QoS level.
Essentially, the above example is equivalent to not having QoS policy file
at all.
The following example shows all the possible options and keywords in the
policy file and their syntax:
#
# See the comments in the following example.
# They explain different keywords and their meaning.
#
port-groups
port-group # using port GUIDs
name: Storage
# "use" is just a description that is used for logging
# Other than that, it is just a comment
use: SRP Targets
port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
port-guid: 0x1000000000FFFF
end-port-group
port-group
name: Virtual Servers
# The syntax of the port name is as follows:
# "node_description/Pnum".
# node_description is compared to the NodeDescription of the node,
# and "Pnum" is a port number on that node.
port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
end-port-group
# using partitions defined in the partition policy
port-group
name: Partitions
partition: Part1
pkey: 0x1234
end-port-group
# using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
# or ALL (for all the nodes in the subnet)
port-group
name: CAs and SM
node-type: CA, SELF
end-port-group
end-port-groups
qos-setup
# This section of the policy file describes how to set up SL2VL and VL
# Arbitration tables on various nodes in the fabric.
# However, this is not supported in OpenSM currently - the section is
# parsed and ignored. SL2VL and VLArb tables should be configured in the
# OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf).
end-qos-setup
qos-levels
# Having a QoS Level named "DEFAULT" is a must - it is applied to
# PR/MPR requests that didn't match any of the matching rules.
qos-level
name: DEFAULT
use: default QoS Level
sl: 0
end-qos-level
# the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime
qos-level
name: WholeSet
sl: 1
mtu-limit: 4
rate-limit: 5
pkey: 0x1234
packet-life: 8
end-qos-level
end-qos-levels
# Match rules are scanned in order of their apperance in the policy file.
# First matched rule takes precedence.
qos-match-rules
# matching by single criteria: QoS class
qos-match-rule
use: by QoS class
qos-class: 7-9,11
# Name of qos-level to apply to the matching PR/MPR
qos-level-name: WholeSet
end-qos-match-rule
# show matching by destination group and service id
qos-match-rule
use: Storage targets
destination: Storage
service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF
qos-level-name: WholeSet
end-qos-match-rule
qos-match-rule
source: Storage
use: match by source group only
qos-level-name: DEFAULT
end-qos-match-rule
qos-match-rule
use: match by all parameters
qos-class: 7-9,11
source: Virtual Servers
destination: Storage
service-id: 0x0000000000010000-0x000000000001FFFF
pkey: 0x0F00-0x0FFF
qos-level-name: WholeSet
end-qos-match-rule
end-qos-match-rules
==============================================================================
6. Simplified QoS Policy - Details and Examples
==============================================================================
Simplified QoS policy match rules are tailored for matching ULPs (or some
application on top of a ULP) PR/MPR requests. This section has a list of
per-ULP (or per-application) match rules and the SL that should be enforced
on the matched PR/MPR query.
Match rules include:
- Default match rule that is applied to PR/MPR query that didn't match any
of the other match rules
- SDP
- SDP application with a specific target TCP/IP port range
- SRP with a specific target IB port GUID
- RDS
- iSER
- iSER application with a specific target TCP/IP port range
- IPoIB with a default PKey
- IPoIB with a specific PKey
- any ULP/application with a specific Service ID in the PR/MPR query
- any ULP/application with a specific PKey in the PR/MPR query
- any ULP/application with a specific target IB port GUID in the PR/MPR query
Since any section of the policy file is optional, as long as basic rules of
the file are kept (such as no referring to nonexisting port group, having
default QoS Level, etc), the simplified policy section (qos-ulps) can serve
as a complete QoS policy file.
The shortest policy file in this case would be as follows:
qos-ulps
default : 0 #default SL
end-qos-ulps
It is equivalent to the previous example of the shortest policy file, and it
is also equivalent to not having policy file at all.
Below is an example of simplified QoS policy with all the possible keywords:
qos-ulps
default : 0 # default SL
sdp, port-num 30000 : 0 # SL for application running on top
# of SDP when a destination
# TCP/IPport is 30000
sdp, port-num 10000-20000 : 0
sdp : 1 # default SL for any other
# application running on top of SDP
rds : 2 # SL for RDS traffic
iser, port-num 900 : 0 # SL for iSER with a specific target
# port
iser : 3 # default SL for iSER
ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with
# pkey 0x0001
ipoib : 4 # default IPoIB partition,
# pkey=0x7FFF
any, service-id 0x6234 : 6 # match any PR/MPR query with a
# specific Service ID
any, pkey 0x0ABC : 6 # match any PR/MPR query with a
# specific PKey
srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on
# a specified IB port GUID
any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with
# a specific target port GUID
end-qos-ulps
Similar to the full policy definition, matching of PR/MPR queries is done in
order of appearance in the QoS policy file such as the first match takes
precedence, except for the "default" rule, which is applied only if the query
didn't match any other rule.
All other sections of the QoS policy file take precedence over the qos-ulps
section. That is, if a policy file has both qos-match-rules and qos-ulps
sections, then any query is matched first against the rules in the
qos-match-rules section, and only if there was no match, the query is matched
against the rules in qos-ulps section.
Note that some of these match rules may overlap, so in order to use the
simplified QoS definition effectively, it is important to understand how each
of the ULPs is matched:
6.1 IPoIB
IPoIB query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so
the following three match rules are equivalent:
ipoib :
ipoib, pkey 0x7fff :
any, pkey 0x7fff :
6.2 SDP
SDP PR query is matched by Service ID. The Service-ID for SDP is
0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port
Number to connect to. The following two match rules are equivalent:
sdp :
any, service-id 0x0000000000010000-0x000000000001ffff :
6.3 RDS
Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS
is 0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP
Port Number to connect to. Default port number for RDS is 0x48CA, which makes
a default Service-ID 0x00000000010648CA. The following two match rules are
equivalent:
rds :
any, service-id 0x00000000010648CA :
6.4 iSER
Similar to RDS, iSER query is matched by Service ID, where the the Service ID
is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes
a default Service-ID 0x0000000001060CBC. The following two match rules are
equivalent:
iser :
any, service-id 0x0000000001060CBC :
6.5 SRP
Service ID for SRP varies from storage vendor to vendor, thus SRP query is
matched by the target IB port GUID. The following two match rules are
equivalent:
srp, target-port-guid 0x1234 :
any, target-port-guid 0x1234 :
Note that any of the above ULPs might contain target port GUID in the PR
query, so in order for these queries not to be recognized by the QoS manager
as SRP, the SRP match rule (or any match rule that refers to the target port
guid only) should be placed at the end of the qos-ulps match rules.
6.6 MPI
SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL
on the MPI traffic, and that's why it is the only ULP that did not appear in
the qos-ulps section.
==============================================================================
7. SL2VL Mapping and VL Arbitration
==============================================================================
OpenSM cached options file has a set of QoS related configuration parameters,
that are used to configure SL2VL mapping and VL arbitration on IB ports.
These parameters are:
- Max VLs: the maximum number of VLs that will be on the subnet.
- High limit: the limit of High Priority component of VL Arbitration
table (IBA 7.6.9).
- VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template.
- VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template.
- SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs
corresponding to SLs 0-15 (Note that VL15 used here means drop this SL).
There are separate QoS configuration parameters sets for various target types:
CAs, routers, switch external ports, and switch's enhanced port 0. The names
of such parameters are prefixed by "qos__" string. Here is a full list
of the currently supported sets:
qos_ca_ - QoS configuration parameters set for CAs.
qos_rtr_ - parameters set for routers.
qos_sw0_ - parameters set for switches' port 0.
qos_swe_ - parameters set for switches' external ports.
Here's the example of typical default values for CAs and switches' external
ports (hard-coded in OpenSM initialization):
qos_ca_max_vls 15
qos_ca_high_limit 0
qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
qos_swe_max_vls 15
qos_swe_high_limit 0
qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
VL arbitration tables (both high and low) are lists of VL/Weight pairs.
Each list entry contains a VL number (values from 0-14), and a weighting value
(values 0-255), indicating the number of 64 byte units (credits) which may be
transmitted from that VL when its turn in the arbitration occurs. A weight
of 0 indicates that this entry should be skipped. If a list entry is
programmed for VL15 or for a VL that is not supported or is not currently
configured by the port, the port may either skip that entry or send from any
supported VL for that entry.
Note, that the same VLs may be listed multiple times in the High or Low
priority arbitration tables, and, further, it can be listed in both tables.
The limit of high-priority VLArb table (qos__high_limit) indicates the
number of high-priority packets that can be transmitted without an opportunity
to send a low-priority packet. Specifically, the number of bytes that can be
sent is high_limit times 4K bytes.
A high_limit value of 255 indicates that the byte limit is unbounded.
Note: if the 255 value is used, the low priority VLs may be starved.
A value of 0 indicates that only a single packet from the high-priority table
may be sent before an opportunity is given to the low-priority table.
Keep in mind that ports usually transmit packets of size equal to MTU.
For instance, for 4KB MTU a single packet will require 64 credits, so in order
to achieve effective VL arbitration for packets of 4KB MTU, the weighting
values for each VL should be multiples of 64.
Below is an example of SL2VL and VL Arbitration configuration on subnet:
qos_ca_max_vls 15
qos_ca_high_limit 6
qos_ca_vlarb_high 0:4
qos_ca_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
qos_swe_max_vls 15
qos_swe_high_limit 6
qos_swe_vlarb_high 0:4
qos_swe_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is
defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single
transmission burst. Such configuration would suilt VL that needs low latency
and uses small MTU when transmitting packets. Rest of VLs are defined as low
priority VLs with different weights, while VL4 is effectively turned off.