1OPENSM(8)                      OpenIB Management                     OPENSM(8)
2
3
4

NAME

6       opensm - InfiniBand subnet manager and administration (SM/SA)
7
8

SYNOPSIS

10       opensm  [--version]]  [-F  |  --config  <file_name>]  [-c(reate-config)
11       <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority)  <PRI‐
12       ORITY>]  [--subnet_prefix  <PREFIX in hex>] [--smkey <SM_Key>] [--sm_sl
13       <SL number>] [-r(eassign_lids)] [-R <engine name(s)> | --routing_engine
14       <engine  name(s)>]  [--do_mesh_analysis]  [--lash_start_vl <vl number>]
15       [-A  |  --ucast_cache]  [-z  |  --connect_roots]  [-M  <file  name>   |
16       --lid_matrix_file  <file  name>]  [-U  <file  name> | --lfts_file <file
17       name>] [-S | --sadb_file <file name>] [-a | --root_guid_file  <path  to
18       file>]  [-u | --cn_guid_file <path to file>] [-G | --io_guid_file <path
19       to file>] [--port-shifting] [--scatter-ports] [-H |  --max_reverse_hops
20       <max  reverse  hops  allowed>] [-X | --guid_routing_order_file <path to
21       file>] [-m  |  --ids_guid_file  <path  to  file>]  [-o(nce)]  [-s(weep)
22       <interval>] [-t(imeout) <milliseconds>] [--retries <number>] [--maxsmps
23       <number>] [--console [off | local | socket | loopback]] [--console-port
24       <port>]    [-i(gnore-guids)    <equalize-ignore-guids-file>]    [-w   |
25       --hop_weights_file <path to file>]  [-O  |  --port_search_ordering_file
26       <path  to  file>]  [-O | --dimn_ports_file <path to file>] (DEPRECATED)
27       [-f <log file path> | --log_file <log file path> ]  [-L  |  --log_limit
28       <size in MB>] [-e(rase_log_file)] [-P(config) <partition config file> ]
29       [-N | --no_part_enforce] (DEPRECATED) [-Z | --part_enforce [both | in |
30       out  |  off]]  [-W  |  --allow_both_pkeys] [-Q | --qos [-Y | --qos_pol‐
31       icy_file <file name>]] [--congestion-control]  [--cckey  <key>]  [-y  |
32       --stay_on_fatal]  [-B  |  --daemon]  [-J | --pidfile <file_name>] [-I |
33       --inactive]  [--perfmgr]  [--perfmgr_sweep_time_s  <seconds>]   [--pre‐
34       fix_routes_file   <path>]   [--consolidate_ipv6_snm_req]  [--log_prefix
35       <prefix text>] [--torus_config <path to file>]  [-v(erbose)]  [-V]  [-D
36       <flags>] [-d(ebug) <number>] [-h(elp)] [-?]
37
38

DESCRIPTION

40       opensm  is  an  InfiniBand compliant Subnet Manager and Administration,
41       and runs on top of OpenIB.
42
43       opensm provides an implementation of an InfiniBand Subnet  Manager  and
44       Administration.  Such a software entity is required to run for in order
45       to initialize the InfiniBand hardware (at least one per each InfiniBand
46       subnet).
47
48       opensm  also now contains an experimental version of a performance man‐
49       ager as well.
50
51       opensm defaults were designed to meet the common case usage on clusters
52       with up to a few hundred nodes. Thus, in this default mode, opensm will
53       scan the IB fabric, initialize it, and sweep occasionally for changes.
54
55       opensm attaches to a specific IB port on the local machine and  config‐
56       ures  only  the fabric connected to it. (If the local machine has other
57       IB ports, opensm will ignore  the  fabrics  connected  to  those  other
58       ports). If no port is specified, it will select the first "best" avail‐
59       able port.
60
61       opensm can present the available ports and prompt for a port number  to
62       attach to.
63
64       By  default,  the  run  is  logged  to two files: /var/log/messages and
65       /var/log/opensm.log.  The first file will register only  general  major
66       events, whereas the second will include details of reported errors. All
67       errors reported in this second file should be treated as indicators  of
68       IB  fabric  health issues.  (Note that when a fatal and non-recoverable
69       error occurs, opensm will exit.)  Both log  files  should  include  the
70       message "SUBNET UP" if opensm was able to setup the subnet correctly.
71
72

OPTIONS

74       --version
75              Prints OpenSM version and exits.
76
77       -F, --config <config file>
78              The  name  of  the  OpenSM  config  file.  When  not  specified
79              /etc/rdma/opensm.conf will be used (if exists).
80
81       -c, --create-config <file name>
82              OpenSM will dump its configuration to  the  specified  file  and
83              exit.   This is a way to generate OpenSM configuration file tem‐
84              plate.
85
86       -g, --guid <GUID in hex>
87              This option specifies the  local  port  GUID  value  with  which
88              OpenSM  should  bind.   OpenSM may be bound to 1 port at a time.
89              If GUID given is 0, OpenSM displays  a  list  of  possible  port
90              GUIDs and waits for user input.  Without -g, OpenSM tries to use
91              the default port.
92
93       -l, --lmc <LMC value>
94              This option specifies the subnet's LMC  value.   The  number  of
95              LIDs  assigned  to each port is 2^LMC.  The LMC value must be in
96              the range 0-7.  LMC values >  0  allow  multiple  paths  between
97              ports.   LMC values > 0 should only be used if the subnet topol‐
98              ogy actually provides multiple paths between ports, i.e.  multi‐
99              ple interconnects between switches.  Without -l, OpenSM defaults
100              to LMC = 0, which allows one path between any two ports.
101
102       -p, --priority <Priority value>
103              This option specifies the SM´s PRIORITY.  This will  effect  the
104              handover  cases,  where  master  is chosen by priority and GUID.
105              Range goes from 0 (default and lowest priority) to 15 (highest).
106
107       --subnet_prefix <PREFIX in hex>
108              This option specifies the subnet prefix to use  on  the  fabric.
109              The default prefix is 0xfe80000000000000.  OpenMPI in particular
110              requires that separate fabrics plugged into different ports on a
111              machine must have different subnet prefixes in order to identify
112              that it is not two ports plugged into a single fabric.
113
114       --smkey <SM_Key value>
115              This option specifies the SM´s  SM_Key  (64  bits).   This  will
116              effect  SM  authentication.   Note that OpenSM version 3.2.1 and
117              below used the default value '1' in a host  byte  order,  it  is
118              fixed  now but you may need this option to interoperate with old
119              OpenSM running on a little endian machine.
120
121       --sm_sl <SL number>
122              This option sets the SL to use for communication with the SM/SA.
123              Defaults to 0.
124
125       -r, --reassign_lids
126              This  option  causes  OpenSM  to reassign LIDs to all end nodes.
127              Specifying -r on a running subnet may  disrupt  subnet  traffic.
128              Without -r, OpenSM attempts to preserve existing LID assignments
129              resolving multiple use of same LID.
130
131       -R, --routing_engine <Routing engine names>
132              This option chooses routing engine(s) to use instead of Min  Hop
133              algorithm  (default).  Multiple routing engines can be specified
134              separated by commas so that specific ordering of  routing  algo‐
135              rithms  will  be  tried if earlier routing engines fail.  If all
136              configured routing engines fail, OpenSM will always  attempt  to
137              route  with Min Hop unless 'no_fallback' is included in the list
138              of routing engines.   Supported  engines:  minhop,  updn,  dnup,
139              file, ftree, lash, dor, torus-2QoS, dfsssp, sssp.
140
141       --do_mesh_analysis
142              This  option  enables  additional  analysis for the lash routing
143              engine to precondition switch port assignments in regular carte‐
144              sian  meshes which may reduce the number of SLs required to give
145              a deadlock free routing.
146
147       --lash_start_vl <vl number>
148              This option sets the starting VL to use  for  the  lash  routing
149              algorithm.  Defaults to 0.
150
151       -A, --ucast_cache
152              This  option  enables unicast routing cache and prevents routing
153              recalculation (which is a heavy task in a  large  cluster)  when
154              there was no topology change detected during the heavy sweep, or
155              when the topology change does not require new  routing  calcula‐
156              tion,  e.g.  when one or more CAs/RTRs/leaf switches going down,
157              or one or more of these nodes coming back after being  down.   A
158              very common case that is handled by the unicast routing cache is
159              host reboot, which otherwise would cause two full routing recal‐
160              culations:  one  when the host goes down, and the other when the
161              host comes back online.
162
163       -z, --connect_roots
164              This option enforces routing engines (up/down and  fat-tree)  to
165              make  connectivity  between  root switches and in this way to be
166              fully IBA compliant. In many cases this can violate "pure" dead‐
167              lock free algorithm, so use it carefully.
168
169       -M, --lid_matrix_file <file name>
170              This  option specifies the name of the lid matrix dump file from
171              where switch lid matrices (min hops tables will be loaded.
172
173       -U, --lfts_file <file name>
174              This option specifies the name  of  the  LFTs  file  from  where
175              switch  forwarding tables will be loaded when using "file" rout‐
176              ing engine.
177
178       -S, --sadb_file <file name>
179              This option specifies the name of the SA DB dump file from where
180              SA database will be loaded.
181
182       -a, --root_guid_file <file name>
183              Set the root nodes for the Up/Down or Fat-Tree routing algorithm
184              to the guids provided in the given file (one to a line).
185
186       -u, --cn_guid_file <file name>
187              Set the compute nodes for the Fat-Tree  or  DFSSSP/SSSP  routing
188              algorithms  to the port GUIDs provided in the given file (one to
189              a line).
190
191       -G, --io_guid_file <file name>
192              Set the I/O nodes for the Fat-Tree or DFSSSP/SSSP routing  algo‐
193              rithms  to  the  port GUIDs provided in the given file (one to a
194              line).
195              In the case of Fat-Tree routing:
196              I/O nodes are non-CN nodes allowed to use up to max_reverse_hops
197              switches the wrong way around to improve connectivity.
198              In the case of (DF)SSSP routing:
199              Providing  guids  of  compute  and/or I/O nodes will ensure that
200              paths towards those nodes are  as  much  separated  as  possible
201              within their node category, i.e., I/O traffic will not share the
202              same link if multiple links are available.
203
204       --port-shifting
205              This option enables a feature called  port  shifting.   In  some
206              fabrics,  particularly  cluster  environments,  routes  commonly
207              align and congest  with  other  routes  due  to  algorithmically
208              unchanging  traffic  patterns.  This routing option will "shift"
209              routing around in an attempt to alleviate this problem.
210
211       --scatter-ports
212              This option will randomize port selecting in routing.
213
214       -H, --max_reverse_hops <max reverse hops allowed>
215              Set the maximum number of reverse hops an I/O node is allowed to
216              make. A reverse hop is the use of a switch the wrong way around.
217
218       -m, --ids_guid_file <file name>
219              Name  of  the map file with set of the IDs which will be used by
220              Up/Down routing algorithm instead of node GUIDs (format:  <guid>
221              <id> per line).
222
223       -X, --guid_routing_order_file <file name>
224              Set  the  order  port  guids  will  be routed for the MinHop and
225              Up/Down routing algorithms to the guids provided  in  the  given
226              file (one to a line).
227
228       -o, --once
229              This  option  causes  OpenSM  to configure the subnet once, then
230              exit.  Ports remain in the ACTIVE state.
231
232       -s, --sweep <interval value>
233              This option specifies  the  number  of  seconds  between  subnet
234              sweeps.   Specifying -s 0 disables sweeping.  Without -s, OpenSM
235              defaults to a sweep interval of 10 seconds.
236
237       -t, --timeout <value>
238              This option specifies the time in milliseconds used for transac‐
239              tion  timeouts.   Timeout  values  should  be  > 0.  Without -t,
240              OpenSM defaults to a timeout value of 200 milliseconds.
241
242       --retries <number>
243              This option specifies the number of retries  used  for  transac‐
244              tions.   Without  --retries,  OpenSM  defaults  to 3 retries for
245              transactions.
246
247       --maxsmps <number>
248              This option specifies the number of VL15 SMP MADs allowed on the
249              wire  at  any one time.  Specifying --maxsmps 0 allows unlimited
250              outstanding SMPs.  Without --maxsmps, OpenSM defaults to a maxi‐
251              mum of 4 outstanding SMPs.
252
253       --console [off | local | loopback | socket]
254              This  option  brings up the OpenSM console (default off).  Note,
255              loopback and socket open a socket  which  can  be  connected  to
256              WITHOUT  CREDENTIALS.   Loopback  is  safer if access to your SM
257              host is controlled.  tcp_wrappers (hosts.[allow|deny])  is  used
258              with  loopback  and  socket.   loopback  and socket will only be
259              available if OpenSM  was  built  with  --enable-console-loopback
260              (default  yes)  and --enable-console-socket (default no) respec‐
261              tively.
262
263       --console-port <port>
264              Specify an alternate telnet port for the socket console (default
265              10000).   Note that this option only appears if OpenSM was built
266              with --enable-console-socket.
267
268       -i, --ignore-guids <equalize-ignore-guids-file>
269              This option provides the means to define a set of ports (by node
270              guid  and  port  number)  that  will be ignored by the link load
271              equalization algorithm.
272
273       -w, --hop_weights_file <path to file>
274              This option provides weighting factors per port  representing  a
275              hop  cost  in  computing  the  lid matrix.  The file consists of
276              lines containing a switch port GUID (specified as a 64  bit  hex
277              number, with leading 0x), output port number, and weighting fac‐
278              tor.  Any port not listed in the file defaults  to  a  weighting
279              factor  of  1.   Lines  starting  with  # are comments.  Weights
280              affect only the output route from the port, so many useful  con‐
281              figurations will require weights to be specified in pairs.
282
283       -O, --port_search_ordering_file <path to file>
284              This  option  tweaks  the routing. It suitable for two cases: 1.
285              While using DOR routing algorithm.  This option provides a  map‐
286              ping  between  hypercube  dimensions  and  ports on a per switch
287              basis for the DOR routing engine.  The file  consists  of  lines
288              containing a switch node GUID (specified as a 64 bit hex number,
289              with leading 0x) followed by a list of  non-zero  port  numbers,
290              separated  by  spaces,  one  switch per line.  The order for the
291              port numbers is in one to one correspondence to the  dimensions.
292              Ports  not listed on a line are assigned to the remaining dimen‐
293              sions, in port order.  Anything after a  #  is  a  comment.   2.
294              While using general routing algorithm.  This option provides the
295              order of the ports that would be chosen for routing,  from  each
296              switch rather than searching for an appropriate port from port 1
297              to N.  The file consists of lines containing a switch node  GUID
298              (specified  as a 64 bit hex number, with leading 0x) followed by
299              a list of non-zero port numbers, separated by spaces, one switch
300              per  line.  In case of DOR, the order for the port numbers is in
301              one to one correspondence to the dimensions.  Ports  not  listed
302              on  a  line  are  assigned  to the remaining dimensions, in port
303              order.  Anything after a # is a comment.
304
305       -O, --dimn_ports_file <path to file> (DEPRECATED)
306              This is  a  deprecated  flag.  Please  use  --port_search_order‐
307              ing_file instead.  This option provides a mapping between hyper‐
308              cube dimensions and ports on a per  switch  basis  for  the  DOR
309              routing  engine.  The file consists of lines containing a switch
310              node GUID (specified as a 64 bit hex number,  with  leading  0x)
311              followed  by  a list of non-zero port numbers, separated by spa‐
312              ces, one switch per line.  The order for the port numbers is  in
313              one  to  one correspondence to the dimensions.  Ports not listed
314              on a line are assigned to  the  remaining  dimensions,  in  port
315              order.  Anything after a # is a comment.
316
317       -x, --honor_guid2lid
318              This  option  forces  OpenSM to honor the guid2lid file, when it
319              comes  out  of  Standby  state,  if  such  file   exists   under
320              OSM_CACHE_DIR, and is valid.  By default, this is FALSE.
321
322       -f, --log_file <file name>
323              This  option  defines the log to be the given file.  By default,
324              the log goes to /var/log/opensm.log.  For the log to go to stan‐
325              dard output use -f stdout.
326
327       -L, --log_limit <size in MB>
328              This  option defines maximal log file size in MB. When specified
329              the log file will be truncated upon reaching this limit.
330
331       -e, --erase_log_file
332              This option will cause deletion of the log file  (if  it  previ‐
333              ously exists). By default, the log file is accumulative.
334
335       -P, --Pconfig <partition config file>
336              This  option  defines the optional partition configuration file.
337              The default name is /etc/rdma/partitions.conf.
338
339       --prefix_routes_file <file name>
340              Prefix routes control how the SA responds to path record queries
341              for  off-subnet  DGIDs.   By default, the SA fails such queries.
342              The PREFIX ROUTES section below describes the format of the con‐
343              figuration       file.        The      default      path      is
344              /etc/rdma/prefix-routes.conf.
345
346       -Q, --qos
347              This option enables QoS setup. It is disabled by default.
348
349       -Y, --qos_policy_file <file name>
350              This option defines the optional QoS policy  file.  The  default
351              name     is     /etc/rdma/qos-policy.conf.    See    QoS_manage‐
352              ment_in_OpenSM.txt in opensm doc for more information on config‐
353              uring QoS policy via this file.
354
355       --congestion_control
356              (EXPERIMENTAL) This option enables congestion control configura‐
357              tion.  It is disabled by default.  See config file  for  conges‐
358              tion  control configuration options.  --cc_key <key> (EXPERIMEN‐
359              TAL) This option configures the CCkey to  use  when  configuring
360              congestion  control.  Note that this option does not configure a
361              new CCkey into switches and CAs.  Defaults to 0.
362
363       -N, --no_part_enforce (DEPRECATED)
364              This is a deprecated flag. Please  use  --part_enforce  instead.
365              This  option  disables  partition enforcement on switch external
366              ports.
367
368       -Z, --part_enforce [both | in | out | off]
369              This  option  indicates  the  partition  enforcement  type  (for
370              switches).   Enforcement type can be inbound only (in), outbound
371              only (out), both or disabled (off). Default is both.
372
373       -W, --allow_both_pkeys
374              This option indicates whether both full and  limited  membership
375              on  the  same  partition  can  be  configured  in the PKeyTable.
376              Default is not to allow both pkeys.
377
378       -y, --stay_on_fatal
379              This option will cause SM not to exit  on  fatal  initialization
380              issues: if SM discovers duplicated guids or a 12x link with lane
381              reversal badly configured.  By default,  the  SM  will  exit  on
382              these errors.
383
384       -B, --daemon
385              Run in daemon mode - OpenSM will run in the background.
386
387       -J, --pidfile <file_name>
388              Makes  the  SM  write  its  own  PID  to the specified file when
389              started in daemon mode.
390
391       -I, --inactive
392              Start SM in inactive rather than init SM state.  This option can
393              be  used  in  conjunction with the perfmgr so as to run a stand‐
394              alone performance manager without SM/SA.  However, this  is  NOT
395              currently implemented in the performance manager.
396
397       --perfmgr
398              Enable  the  perfmgr.  Only takes effect if --enable-perfmgr was
399              specified at configure time.  See  performance-manager-HOWTO.txt
400              in opensm doc for more information on running perfmgr.
401
402       --perfmgr_sweep_time_s <seconds>
403              Specify  the  sweep  time for the performance manager in seconds
404              (default is 180 seconds).  Only takes effect if --enable-perfmgr
405              was specified at configure time.
406
407       --consolidate_ipv6_snm_req
408              Use  shared  MLID  for  IPv6 Solicited Node Multicast groups per
409              MGID scope and P_Key.
410
411       --log_prefix <prefix text>
412              This option specifies the prefix to  the  syslog  messages  from
413              OpenSM.  A suitable prefix can be used to identify the IB subnet
414              in syslog messages when two or more instances of OpenSM run in a
415              single  node to manage multiple fabrics. For example, in a dual-
416              fabric (or dual-rail) IB cluster, the prefix for the first  fab‐
417              ric could be "mpi" and the other fabric could be "storage".
418
419       --torus_config <path to torus-2QoS config file>
420              This  option  defines  the file name for the extra configuration
421              information needed for  the  torus-2QoS  routing  engine.    The
422              default name is /etc/rdma/torus-2QoS.conf
423
424       -v, --verbose
425              This  option  increases  the log verbosity level.  The -v option
426              may be specified multiple times to  further  increase  the  ver‐
427              bosity  level.  See the -D option for more information about log
428              verbosity.
429
430       -V     This option sets the maximum  verbosity  level  and  forces  log
431              flushing.   The  -V option is equivalent to ´-D 0xFF -d 2´.  See
432              the -D option for more information about log verbosity.
433
434       -D <value>
435              This option sets the log verbosity level.  A  flags  field  must
436              follow the -D option.  A bit set/clear in the flags enables/dis‐
437              ables a specific log level as follows:
438
439               BIT    LOG LEVEL ENABLED
440               ----   -----------------
441               0x01 - ERROR (error messages)
442               0x02 - INFO (basic messages, low volume)
443               0x04 - VERBOSE (interesting stuff, moderate volume)
444               0x08 - DEBUG (diagnostic, high volume)
445               0x10 - FUNCS (function entry/exit, very high volume)
446               0x20 - FRAMES (dumps all SMP and GMP frames)
447               0x40 - ROUTING (dump FDB routing information)
448               0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM log‐
449              ging)
450
451              Without  -D,  OpenSM defaults to ERROR + INFO (0x3).  Specifying
452              -D 0 disables all messages.  Specifying -D 0xFF enables all mes‐
453              sages  (see  -V).   High verbosity levels may require increasing
454              the transaction timeout with the -t option.
455
456       -d, --debug <value>
457              This option specifies a debug option.   These  options  are  not
458              normally  needed.   The  number  following  -d selects the debug
459              option to enable as follows:
460
461               OPT   Description
462               ---    -----------------
463               -d0  - Ignore other SM nodes
464               -d1  - Force single threaded dispatching
465               -d2  - Force log flushing after each log message
466               -d3  - Disable multicast support
467
468       -h, --help
469              Display this usage info then exit.
470
471       -?     Display this usage info then exit.
472
473

ENVIRONMENT VARIABLES

475       The following environment variables control opensm behavior:
476
477       OSM_TMP_DIR - controls the directory in which the temporary files  gen‐
478       erated  by  opensm  are  created.  These  files are: opensm-subnet.lst,
479       opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.
480
481       OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
482       quent   runs   are   consistent.   The   default   directory   used  is
483       /var/cache/opensm.  The following files are included in it:
484
485        guid2lid  - stores the LID range assigned to each GUID
486        guid2mkey - stores the MKey previously assiged to each GUID
487        neighbors - stores a map of the GUIDs at either end of each link
488                    in the fabric
489
490

NOTES

492       When opensm receives a HUP signal, it starts a new heavy sweep as if  a
493       trap was received or a topology change was found.
494
495       Also,  SIGUSR1  can  be used to trigger a reopen of /var/log/opensm.log
496       for logrotate purposes.
497
498

PARTITION CONFIGURATION

500       The  default  name  of  OpenSM   partitions   configuration   file   is
501       /etc/rdma/partitions.conf.  The  default  may  be  changed by using the
502       --Pconfig (-P) option with OpenSM.
503
504       The default partition will be created by  OpenSM  unconditionally  even
505       when partition configuration file does not exist or cannot be accessed.
506
507       The default partition has P_Key value 0x7fff. OpenSM´s port will always
508       have full membership in default partition. All  other  end  ports  will
509       have  full  membership if the partition configuration file is not found
510       or cannot be accessed, or limited membership if the file exists and can
511       be accessed but there is no rule for the Default partition.
512
513       Effectively,  this amounts to the same as if one of the following rules
514       below appear in the partition configuration file.
515
516       In the case of no rule for the Default partition:
517
518       Default=0x7fff : ALL=limited, SELF=full ;
519
520       In the case of no  partition  configuration  file  or  file  cannot  be
521       accessed:
522
523       Default=0x7fff : ALL=full ;
524
525
526       File Format
527
528       Comments:
529
530       Line  content  followed  after  ´#´ character is comment and ignored by
531       parser.
532
533       General file format:
534
535       <Partition Definition>:[<newline>]<Partition Properties>;
536
537            Partition Definition:
538              [PartitionName][=PKey][,ipoib_bc_flags][,defmember=full|limited]
539
540               PartitionName  - string, will be used with logging. When  omit‐
541       ted
542                                empty string will be used.
543               PKey            -  P_Key  value for this partition. Only low 15
544       bits will
545                                be used. When omitted will be autogenerated.
546               ipoib_bc_flags - used to indicate/specify IPoIB  capability  of
547       this partition.
548
549               defmember=full|limited|both  - specifies default membership for
550       port guid
551                                list. Default is limited.
552
553            ipoib_bc_flags:
554               ipoib_flag|[mgroup_flag]*
555
556               ipoib_flag - indicates that this  partition  may  be  used  for
557       IPoIB, as
558                            a result the IPoIB broadcast group will be created
559       with
560                            the flags given, if any.
561
562            Partition Properties:
563              [<Port list>|<MCast Group>]* | <Port list>
564
565            Port list:
566               <Port Specifier>[,<Port Specifier>]
567
568            Port Specifier:
569               <PortGUID>[=[full|limited|both]]
570
571               PortGUID         - GUID of partition member EndPort.  Hexadeci‐
572       mal
573                                  numbers  should  start from 0x, decimal num‐
574       bers
575                                  are accepted too.
576
577               full, limited,   - indicates full and/or limited membership for
578       this
579               both                port.   When omitted (or unrecognized) lim‐
580       ited
581                                  membership is assumed. Both  indicates  both
582       full
583                                  and limited membership for this port.
584
585            MCast Group:
586               mgid=gid[,mgroup_flag]*<newline>
587
588                                -  gid specified is verified to be a Multicast
589       address
590                                  IP groups are verified to match the rate and
591       mtu of the
592                                  broadcast group.  The P_Key bits of the mgid
593       for IP
594                                  groups are  verified  to  either  match  the
595       P_Key specified
596                                  in  by "Partition Definition" or if they are
597       0x0000 the
598                                  P_Key will be copied into those bits.
599
600            mgroup_flag:
601               rate=<val>  - specifies rate for this MC group
602                             (default is 3 (10GBps))
603               mtu=<val>   - specifies MTU for this MC group
604                             (default is 4 (2048))
605               sl=<val>    - specifies SL for this MC group
606                             (default is 0)
607               scope=<val> - specifies scope for this MC group
608                             (default is 2 (link local)).  Multiple scope set‐
609       tings
610                             are permitted for a partition.
611                             NOTE:  This  overwrites  the  scope nibble of the
612       specified
613                                   mgid.   Furthermore   specifying   multiple
614       scope
615                                   settings will result in multiple MC groups
616                                   being created.
617               Q_Key=<val>      - specifies the Q_Key for this MC group
618                                 (default:  0x0b1b  for IP groups, 0 for other
619       groups)
620               TClass=<val>    - specifies tclass for this MC group
621                                 (default is 0)
622               FlowLabel=<val> - specifies FlowLabel for this MC group
623                                 (default is 0)
624
625            newline: '0
626
627
628       Note that values for rate, mtu, and scope, for both partitions and mul‐
629       ticast groups, should be specified as defined in the IBTA specification
630       (for example, mtu=4 for 2048).
631
632       There are several useful keywords for PortGUID definition:
633
634        - 'ALL' means all end ports in this subnet.
635        - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
636        - 'ALL_SWITCHES' means all Switch end ports in this subnet.
637        - 'ALL_ROUTERS' means all Router end ports in this subnet.
638        - 'SELF' means subnet manager's port.
639
640       Empty list means no ports in this partition.
641
642       Notes:
643
644       White space is permitted between delimiters ('=', ',',':',';').
645
646       PartitionName does not need to be unique, PKey does need to be  unique.
647       If  PKey is repeated then those partition configurations will be merged
648       and first PartitionName will be used (see also next note).
649
650       It is possible to split partition configuration in more than one  defi‐
651       nition, but then PKey should be explicitly specified (otherwise differ‐
652       ent PKey values will be generated for those definitions).
653
654       Examples:
655
656        Default=0x7fff : ALL, SELF=full ;
657        Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
658
659        NewPartition , ipoib : 0x123456=full, 0x3456789034=limi,  0x2134af2306
660       ;
661
662        YetAnotherOne = 0x300 : SELF=full ;
663        YetAnotherOne = 0x300 : ALL=limited ;
664
665        ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
666        # 0x123453, 0x123454 will be limited
667        ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
668        # 0x123456, 0x123457 will be limited
669        ShareIO   =   0x80   :   defmember=limited   :   0x123456,   0x123457,
670       0x123458=full;
671        ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
672        ShareIO  =  0x80  ,  defmember=full  :   0x12345b,   0x12345c=limited,
673       0x12345d;
674
675        # multicast groups added to default
676        Default=0x7fff,ipoib:
677               mgid=ff12:401b::0707,sl=1 # random IPv4 group
678               mgid=ff12:601b::16    # MLDv2-capable routers
679               mgid=ff12:401b::16    # IGMP
680               mgid=ff12:601b::2     # All routers
681               mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
682               ALL=full;
683
684
685       Note:
686
687       The following rule is equivalent to how OpenSM used to run prior to the
688       partition manager:
689
690        Default=0x7fff,ipoib:ALL=full;
691
692

QOS CONFIGURATION

694       There are a set of QoS related low-level configuration parameters.  All
695       these  parameter  names  are  prefixed by "qos_" string. Here is a full
696       list of these parameters:
697
698        qos_max_vls    - The maximum number of VLs that will be on the subnet
699        qos_high_limit - The limit of High Priority component of VL
700                         Arbitration table (IBA 7.6.9)
701        qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
702                         template
703        qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
704                         template
705                         Both VL arbitration templates are pairs of
706                         VL and weight
707        qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
708                         a list of VLs corresponding to SLs 0-15 (Note
709                         that VL15 used here means drop this SL)
710
711       Typical default values (hard-coded in OpenSM initialization) are:
712
713        qos_max_vls 15
714        qos_high_limit 0
715        qos_vlarb_low
716       0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
717        qos_vlarb_high
718       0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
719        qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
720
721       The syntax is compatible with rest of OpenSM configuration options  and
722       values may be stored in OpenSM config file (cached options file).
723
724       In  addition  to  the  above,  we may define separate QoS configuration
725       parameters sets for various target types. As targets, we currently sup‐
726       port CAs, routers, switch external ports, and switch's enhanced port 0.
727       The names of such specialized parameters are prefixed by  "qos_<type>_"
728       string. Here is a full list of the currently supported sets:
729
730        qos_ca_  - QoS configuration parameters set for CAs.
731        qos_rtr_ - parameters set for routers.
732        qos_sw0_ - parameters set for switches' port 0.
733        qos_swe_ - parameters set for switches' external ports.
734
735       Examples:
736        qos_sw0_max_vls=2
737        qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
738        qos_swe_high_limit=0
739
740

PREFIX ROUTES

742       Prefix  routes  control  how the SA responds to path record queries for
743       off-subnet DGIDs.  By default, the SA fails such  queries.   Note  that
744       IBA  does  not  specify how the SA should obtain off-subnet path record
745       information.  The prefix routes configuration is meant  as  a  stop-gap
746       until the specification is completed.
747
748       Each  line  in  the configuration file is a 64-bit prefix followed by a
749       64-bit GUID, separated by white space.  The GUID specifies  the  router
750       port  on the local subnet that will handle the prefix.  Blank lines are
751       ignored, as is anything between a # character and the end of the  line.
752       The  prefix  and  GUID  are  both  in  hex, the leading 0x is optional.
753       Either, or both, can be wild-carded by specifying an  asterisk  instead
754       of an explicit prefix or GUID.
755
756       When  responding  to a path record query for an off-subnet DGID, opensm
757       searches for the first prefix match in the configuration file.   There‐
758       fore,  the order of the lines in the configuration file is important: a
759       wild-carded prefix at the beginning of the configuration  file  renders
760       all  subsequent lines useless.  If there is no match, then opensm fails
761       the query.  It is legal to repeat prefixes in the  configuration  file,
762       opensm  will return the path to the first available matching router.  A
763       configuration file with a single line where both prefix  and  GUID  are
764       wild-carded  means  that  a path record query specifying any off-subnet
765       DGID should return a path to the first available router.  This configu‐
766       ration  yields  the same behavior formerly achieved by compiling opensm
767       with -DROUTER_EXP which has been obsoleted.
768
769

MKEY CONFIGURATION

771       OpenSM supports configuring a single  management  key  (MKey)  for  use
772       across the subnet.
773
774       The following configuration options are available:
775
776        m_key                  - the 64-bit MKey to be used on the subnet
777                                 (IBA 14.2.4)
778        m_key_protection_level - the numeric value of the MKey ProtectBits
779                                 (IBA 14.2.4.1)
780        m_key_lease_period     - the number of seconds a CA will wait for a
781                                 response from the SM before resetting the
782                                 protection level to 0 (IBA 14.2.4.2).
783
784       OpenSM  will  configure  all  ports  with  the MKey specified by m_key,
785       defaulting to a value of 0. A m_key value of 0 disables MKey protection
786       on  the subnet.  Switches and HCAs with a non-zero MKey will not accept
787       requests to change their configuration unless the request includes  the
788       proper MKey.
789
790       MKey Protection Levels
791
792       MKey  protection  levels  modify  how  switches and CAs respond to SMPs
793       lacking a valid MKey.  OpenSM will configure each port's ProtectBits to
794       support  the level defined by the m_key_protection_level parameter.  If
795       no parameter is specified, OpenSM defaults to operating  at  protection
796       level 0.
797
798       There are currently 4 protection levels defined by the IBA:
799
800        0 - Queries return valid data, including MKey.  Configuration changes
801            are not allowed unless the request contains a valid MKey.
802        1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
803            unless the request contains a valid MKey.
804        2 - Neither queries nor configuration changes are allowed, unless the
805            request contains a valid MKey.
806        3 - Identical to 2.  Maintained for backwards compatibility.
807
808       MKey Lease Period
809
810       InfiniBand  supports  a  MKey lease timeout, which is intended to allow
811       administrators or a new SM to recover/reset lost MKeys on a fabric.
812
813       If MKeys are enabled on the subnet  and  a  switch  or  CA  receives  a
814       request  that  requires a valid MKey but does not contain one, it warns
815       the SM by sending a trap (Bad M_Key, Trap  256).   If  the  MKey  lease
816       period is non-zero, it also starts a countdown timer for the time spec‐
817       ified by the lease period.  If a SM (or other agent) responds with  the
818       correct  MKey,  the timer is stopped and reset.  Should the timer reach
819       zero, the switch or CA will reset  its  MKey  protection  level  to  0,
820       exposing the MKey and allowing recovery.
821
822       OpenSM will initialize all ports to use a mkey lease period of the num‐
823       ber of seconds specified in the config file.  If  no  mkey_lease_period
824       is specified, a default of 0 will be used.
825
826       OpenSM  normally quickly responds to all Bad_M_Key traps, resetting the
827       lease timers.  Additionally, OpenSM's subnet sweeps  will  also  cancel
828       any  running  timers.   For  maximum  protection  against accidentally-
829       exposed MKeys, the MKey lease time should be a  few  multiples  of  the
830       subnet sweep time.  If OpenSM detects at startup that your sweep inter‐
831       val is greater than your MKey lease period, it  will  reset  the  lease
832       period  to  be greater than the sweep interval.  Similarly, if sweeping
833       is disabled at startup, it will be re-enabled  with  an  interval  less
834       than the Mkey lease period.
835
836       If  OpenSM  is  required  to  recover  a subnet for which it is missing
837       mkeys, it must do so one switch level at a time.  As  such,  the  total
838       time to recover the subnet may be as long as the mkey lease period mul‐
839       tiplied by the maximum number of hops between the SM and  an  endpoint,
840       plus one.
841
842       MKey Effects on Diagnostic Utilities
843
844       Setting a MKey may have a detrimental effect on diagnostic software run
845       on the subnet, unless your diagnostic  software  is  able  to  retrieve
846       MKeys from the SA or can be explicitly configured with the proper MKey.
847       This is particularly true at protection level 2, where CAs will  ignore
848       queries for management information that do not contain the proper MKey.
849
850

ROUTING

852       OpenSM now offers nine routing engines:
853
854       1.   Min  Hop  Algorithm - based on the minimum hops to each node where
855       the path length is optimized.
856
857       2.  UPDN Unicast routing algorithm - also based on the minimum hops  to
858       each  node,  but  it  is  constrained  to ranking rules. This algorithm
859       should be chosen if the subnet is not a pure Fat Tree, and deadlock may
860       occur due to a loop in the subnet.
861
862       3.  DNUP Unicast routing algorithm - similar to UPDN but allows routing
863       in fabrics which have some CA nodes attached closer to the  roots  than
864       some switch nodes.
865
866       4.  Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
867       ing for congestion-free "shift" communication pattern.   It  should  be
868       chosen  if  a subnet is a symmetrical or almost symmetrical fat-tree of
869       various types,  not  just  K-ary-N-Trees:  non-constant  K,  not  fully
870       staffed,  any  Constant  Bisectional Bandwidth (CBB) ratio.  Similar to
871       UPDN, Fat Tree routing is constrained to ranking rules.
872
873       5. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
874       to  provide deadlock-free shortest-path routing while also distributing
875       the paths between layers. LASH is an alternative  deadlock-free  topol‐
876       ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
877       ing the use of a potentially congested root node.
878
879       6. DOR Unicast routing algorithm - based on the Min Hop algorithm,  but
880       avoids  port  equalization  except for redundant links between the same
881       two switches.  This provides deadlock free routes for  hypercubes  when
882       the  fabric  is  cabled  as a hypercube and for meshes when cabled as a
883       mesh (see details below).
884
885       7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
886       specialized  for 2D/3D torus topologies.  Torus-2QoS provides deadlock-
887       free routing while supporting two quality of service (QoS) levels.   In
888       addition  it  is able to route around multiple failed fabric links or a
889       single failed fabric switch without introducing deadlocks, and  without
890       changing path SL values granted before the failure.
891
892       8.  DFSSSP  unicast  routing algorithm - a deadlock-free single-source-
893       shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
894       as  the  base  to optimize link utilization and uses Infiniband virtual
895       lanes (SL) to provide deadlock-freedom.
896
897       9. SSSP unicast routing algorithm - a single-source-shortest-path rout‐
898       ing algorithm, which globally balances the number of routes per link to
899       optimize link utilization. This routing algorithm has  no  restrictions
900       in terms of the underlying topology.
901
902       OpenSM  also supports a file method which can load routes from a table.
903       See ´Modular Routing Engine´ for more information on this.
904
905       The basic routing algorithm is comprised of two stages:
906
907       1. MinHop matrix calculation
908          How many hops are required to get from each port to each LID ?
909          The algorithm to fill these tables is different if you run  standard
910       (min hop) or Up/Down.
911          For  standard routing, a "relaxation" algorithm is used to propagate
912       min hop from every destination LID through neighbor switches
913          For Up/Down routing, a BFS from every target is used. The BFS tracks
914       link  direction (up or down) and avoid steps that will perform up after
915       a down step was used.
916
917       2. Once MinHop matrices exist, each switch is visited and for each tar‐
918       get  LID  a  decision  is made as to what port should be used to get to
919       that LID.
920          This step is common to standard and Up/Down routing. Each port has a
921       counter counting the number of target LIDs going through it.
922          When there are multiple alternative ports with same MinHop to a LID,
923       the one with less previously assigned LIDs is selected.
924          If LMC > 0, more  checks  are  added:  Within  each  group  of  LIDs
925       assigned to same target port,
926          a. use only ports which have same MinHop
927          b.  first prefer the ones that go to different systemImageGuid (then
928       the previous LID of the same LMC group)
929          c. if none - prefer those which go through another NodeGuid
930          d. fall back to the number of paths method (if all go to same node).
931
932       Effect of Topology Changes
933
934       OpenSM will preserve existing routing in any case  where  there  is  no
935       change in the fabric switches unless the -r (--reassign_lids) option is
936       specified.
937
938       -r
939       --reassign_lids
940                 This option causes OpenSM to reassign LIDs to all
941                 end nodes. Specifying -r on a running subnet
942                 may disrupt subnet traffic.
943                 Without -r, OpenSM attempts to preserve existing
944                 LID assignments resolving multiple use of same LID.
945
946       If a link is added or removed, OpenSM does not recalculate  the  routes
947       that  do  not  have  to change. A route has to change if the port is no
948       longer UP or no longer the MinHop. When routing changes are  performed,
949       the same algorithm for balancing the routes is invoked.
950
951       In  the  case of using the file based routing, any topology changes are
952       currently ignored The 'file' routing engine just loads  the  LFTs  from
953       the  file specified, with no reaction to real topology. Obviously, this
954       will not be able to recheck LIDs (by GUID) for disconnected nodes,  and
955       LFTs  for  non-existent  switches  will  be  skipped.  Multicast is not
956       affected by 'file' routing engine (this uses min hop tables).
957
958
959       Min Hop Algorithm
960
961       The Min Hop algorithm is invoked by default if no routing algorithm  is
962       specified.  It can also be invoked by specifying '-R minhop'.
963
964       The  Min  Hop algorithm is divided into two stages: computation of min-
965       hop tables on every switch and LFT output port  assignment.  Link  sub‐
966       scription  is also equalized with the ability to override based on port
967       GUID. The latter is supplied by:
968
969       -i <equalize-ignore-guids-file>
970       --ignore-guids <equalize-ignore-guids-file>
971                 This option provides the means to define a set of ports
972                 (by guid) that will be ignored by the link load
973                 equalization algorithm. Note that only endports (CA,
974                 switch port 0, and router ports) and not switch external
975                 ports are supported.
976
977       LMC awareness routes based on (remote) system or switch basis.
978
979
980       Purpose of UPDN Algorithm
981
982       The UPDN algorithm is designed to prevent deadlocks from  occurring  in
983       loops  of  the subnet. A loop-deadlock is a situation in which it is no
984       longer possible to send data between any two  hosts  connected  through
985       the  loop.  As  such,  the UPDN routing algorithm should be used if the
986       subnet is not a pure Fat Tree, and one of its loops  may  experience  a
987       deadlock (due, for example, to high pressure).
988
989       The UPDN algorithm is based on the following main stages:
990
991       1.  Auto-detect root nodes - based on the CA hop length from any switch
992       in the subnet, a statistical histogram is built for  each  switch  (hop
993       num  vs  number  of  occurrences). If the histogram reflects a specific
994       column (higher than others) for a certain node, then it is marked as  a
995       root node. Since the algorithm is statistical, it may not find any root
996       nodes. The list of the root nodes found by this  auto-detect  stage  is
997       used by the ranking process stage.
998
999           Note 1: The user can override the node list manually.
1000           Note 2: If this stage cannot find any root nodes, and the user did
1001                   not specify a guid list file, OpenSM defaults back to the
1002                   Min Hop routing algorithm.
1003
1004       2.   Ranking  process  -  All  root switch nodes (found in stage 1) are
1005       assigned a rank of 0. Using the BFS algorithm, the rest of  the  switch
1006       nodes  in the subnet are ranked incrementally. This ranking aids in the
1007       process of enforcing rules that ensure loop-free paths.
1008
1009       3.  Min Hop Table setting - after ranking is done, a BFS  algorithm  is
1010       run  from  each  (CA  or  switch)  node  in  the subnet. During the BFS
1011       process, the FDB table of each switch node traversed by BFS is updated,
1012       in  reference to the starting node, based on the ranking rules and guid
1013       values.
1014
1015       At the end of the process, the  updated  FDB  tables  ensure  loop-free
1016       paths through the subnet.
1017
1018       Note:  Up/Down routing does not allow LID routing communication between
1019       switches that are located inside spine "switch systems".  The reason is
1020       that  there  is  no way to allow a LID route between them that does not
1021       break the Up/Down rule.  One ramification of this is  that  you  cannot
1022       run SM on switches other than the leaf switches of the fabric.
1023
1024
1025       UPDN Algorithm Usage
1026
1027       Activation through OpenSM
1028
1029       Use  '-R  updn' option (instead of old '-u') to activate the UPDN algo‐
1030       rithm.  Use '-a <root_guid_file>' for adding an  UPDN  guid  file  that
1031       contains  the  root nodes for ranking.  If the `-a' option is not used,
1032       OpenSM uses its auto-detect root nodes algorithm.
1033
1034       Notes on the guid list file:
1035
1036       1.   A valid guid file specifies one guid in each line. Lines  with  an
1037       invalid format will be discarded.
1038       2.   The user should specify the root switch guids. However, it is also
1039       possible to specify CA guids; OpenSM will use the guid  of  the  switch
1040       (if it exists) that connects the CA to the subnet as a root node.
1041
1042       Purpose of DNUP Algorithm
1043
1044       The DNUP algorithm is designed to serve a similar purpose to UPDN. How‐
1045       ever it is intended to work in network topologies which are unsuited to
1046       UPDN  due to nodes being connected closer to the roots than some of the
1047       switches.  An example would  be  a  fabric  which  contains  nodes  and
1048       uplinks connected to the same switch. The operation of DNUP is the same
1049       as UPDN with the exception of the ranking process.  In DNUP all  switch
1050       nodes  are  ranked  based  solely  on their distance from CA Nodes, all
1051       switch nodes directly connected to at least one CA are assigned a value
1052       of  1  all other switch nodes are assigned a value of one more than the
1053       minimum rank of all neighbor switch nodes.
1054
1055       Fat-tree Routing Algorithm
1056
1057       The fat-tree algorithm optimizes routing for "shift" communication pat‐
1058       tern.   It should be chosen if a subnet is a symmetrical or almost sym‐
1059       metrical fat-tree of various types.   It  supports  not  just  K-ary-N-
1060       Trees,  by handling for non-constant K, cases where not all leafs (CAs)
1061       are present, any CBB ratio.  As in UPDN, fat-tree also prevents credit-
1062       loop-deadlocks.
1063
1064       If  the  root  guid  file  is  not provided ('-a' or '--root_guid_file'
1065       options), the topology has to be pure fat-tree that complies  with  the
1066       following rules:
1067         - Tree rank should be between two and eight (inclusively)
1068         - Switches of the same rank should have the same number
1069           of UP-going port groups*, unless they are root switches,
1070           in which case the shouldn't have UP-going ports at all.
1071         - Switches of the same rank should have the same number
1072           of DOWN-going port groups, unless they are leaf switches.
1073         - Switches of the same rank should have the same number
1074           of ports in each UP-going port group.
1075         - Switches of the same rank should have the same number
1076           of ports in each DOWN-going port group.
1077         - All the CAs have to be at the same tree level (rank).
1078
1079       If the root guid file is provided, the topology doesn't have to be pure
1080       fat-tree, and it should only comply with the following rules:
1081         - Tree rank should be between two and eight (inclusively)
1082         - All the Compute Nodes** have to be at the same tree level (rank).
1083           Note that non-compute node CAs are allowed here to be at different
1084           tree ranks.
1085
1086       * ports that are connected to the same remote switch are referenced  as
1087       ´port group´.
1088
1089       **   list   of  compute  nodes  (CNs)  can  be  specified  by  ´-u´  or
1090       ´--cn_guid_file´ OpenSM options.
1091
1092       Topologies that do not comply cause a  fallback  to  min  hop  routing.
1093       Note that this can also occur on link failures which cause the topology
1094       to no longer be "pure" fat-tree.
1095
1096       Note that although fat-tree algorithm supports trees  with  non-integer
1097       CBB  ratio,  the  routing will not be as balanced as in case of integer
1098       CBB ratio.  In addition to this, although  the  algorithm  allows  leaf
1099       switches  to have any number of CAs, the closer the tree is to be fully
1100       populated, the more effective the "shift"  communication  pattern  will
1101       be.   In  general,  even  if  the root list is provided, the closer the
1102       topology to a pure and symmetrical fat-tree, the more optimal the rout‐
1103       ing will be.
1104
1105       The  algorithm  also dumps compute node ordering file (opensm-ftree-ca-
1106       order.dump) in the same directory where the OpenSM  log  resides.  This
1107       ordering  file  provides  the CN order that may be used to create effi‐
1108       cient communication pattern, that will match the routing tables.
1109
1110       Routing between non-CN nodes
1111
1112       The use of the cn_guid_file option allows non-CN nodes to be located on
1113       different  levels  in the fat tree.  In such case, it is not guaranteed
1114       that the Fat Tree algorithm will route between two  non-CN  nodes.   To
1115       solve  this problem, a list of non-CN nodes can be specified by ´-G´ or
1116       ´--io_guid_file´ option.  Theses nodes will be allowed to use  switches
1117       the  wrong  way  round a specific number of times (specified by ´-H´ or
1118       ´--max_reverse_hops´.    With   the   proper    max_reverse_hops    and
1119       io_guid_file values, you can ensure full connectivity in the Fat Tree.
1120
1121       Please  note  that  using  max_reverse_hops creates routes that use the
1122       switch in a counter-stream way.  This option should never  be  used  to
1123       connect nodes with high bandwidth traffic between them ! It should only
1124       be used to allow connectivity for HA purposes or similar.  Also  having
1125       routes the other way around can in theory cause credit loops.
1126
1127       Use these options with extreme care !
1128
1129       Activation through OpenSM
1130
1131       Use  '-R  ftree'  option  to  activate the fat-tree algorithm.  Use '-a
1132       <root_guid_file>' to provide root nodes for ranking. If the `-a' option
1133       is  not  used,  routing algorithm will detect roots automatically.  Use
1134       '-u <root_cn_file>' to provide the list of compute nodes. If  the  `-u'
1135       option is not used, all the CAs are considered as compute nodes.
1136
1137       Note:  LMC  > 0 is not supported by fat-tree routing. If this is speci‐
1138       fied, the default routing algorithm is invoked instead.
1139
1140
1141       LASH Routing Algorithm
1142
1143       LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
1144       istic  shortest  path  routing algorithm that enables topology agnostic
1145       deadlock-free routing within communication networks.
1146
1147       When computing the routing function, LASH analyzes the network topology
1148       for  the  shortest-path  routes between all pairs of sources / destina‐
1149       tions and groups these paths into virtual layers in such a  way  as  to
1150       avoid deadlock.
1151
1152       Note  LASH  analyzes routes and ensures deadlock freedom between switch
1153       pairs. The link from HCA between and switch does not need virtual  lay‐
1154       ers as deadlock will not arise between switch and HCA.
1155
1156       In more detail, the algorithm works as follows:
1157
1158       1) LASH determines the shortest-path between all pairs of source / des‐
1159       tination switches. Note, LASH ensures the  same  SL  is  used  for  all
1160       SRC/DST  - DST/SRC pairs and there is no guarantee that the return path
1161       for a given DST/SRC will be the reverse of the route SRC/DST.
1162
1163       2) LASH then begins an SL assignment process where a route is  assigned
1164       to  a  layer (SL) if the addition of that route does not cause deadlock
1165       within that layer. This is achieved  by  maintaining  and  analysing  a
1166       channel dependency graph for each layer. Once the potential addition of
1167       a path could lead to deadlock, LASH opens a new layer and continues the
1168       process.
1169
1170       3)  Once  this  stage  has been completed, it is highly likely that the
1171       first layers processed will contain more paths than  the  latter  ones.
1172       To better balance the use of layers, LASH moves paths from one layer to
1173       another so that the number of paths in each layer averages out.
1174
1175       Note, the implementation of LASH in opensm attempts to use as few  lay‐
1176       ers as possible. This number can be less than the number of actual lay‐
1177       ers available.
1178
1179       In general LASH is a very flexible  algorithm.  It  can,  for  example,
1180       reduce to Dimension Order Routing in certain topologies, it is topology
1181       agnostic and fares well in the face of faults.
1182
1183       It has been shown that for both regular and irregular topologies,  LASH
1184       outperforms  Up/Down.  The reason for this is that LASH distributes the
1185       traffic more evenly through a network, avoiding the  bottleneck  issues
1186       related to a root node and always routes shortest-path.
1187
1188       The algorithm was developed by Simula Research Laboratory.
1189
1190
1191       Use '-R lash -Q ' option to activate the LASH algorithm.
1192
1193       Note:  QoS support has to be turned on in order that SL/VL mappings are
1194       used.
1195
1196       Note: LMC > 0 is not supported by the LASH routing. If this  is  speci‐
1197       fied, the default routing algorithm is invoked instead.
1198
1199       For  open regular cartesian meshes the DOR algorithm is the ideal rout‐
1200       ing algorithm. For toroidal meshes on the other hand there are  routing
1201       loops  that can cause deadlocks. LASH can be used to route these cases.
1202       The performance of LASH can be improved by preconditioning the mesh  in
1203       cases  where  there  are multiple links connecting switches and also in
1204       cases where the switches are not cabled consistently. An option  exists
1205       for  LASH  to  do this. To invoke this use '-R lash -Q --do_mesh_analy‐
1206       sis'. This will add an additional phase that analyses the mesh  to  try
1207       to  determine  the  dimension and size of a mesh. If it determines that
1208       the mesh looks like an open or closed cartesian mesh  it  reorders  the
1209       ports in dimension order before the rest of the LASH algorithm runs.
1210
1211       DOR Routing Algorithm
1212
1213       The Dimension Order Routing algorithm is based on the Min Hop algorithm
1214       and so uses shortest paths.  Instead of spreading  traffic  out  across
1215       different  paths  with the same shortest distance, it chooses among the
1216       available shortest paths based on an ordering of dimensions.  Each port
1217       must  be  consistently  cabled  to represent a hypercube dimension or a
1218       mesh dimension.  Alternatively, the -O option can be used to  assign  a
1219       custom  mapping between the ports on a given switch, and the associated
1220       dimension.  Paths are grown from a destination back to a  source  using
1221       the lowest dimension (port) of available paths at each step.  This pro‐
1222       vides the ordering necessary to avoid deadlock.  When there are  multi‐
1223       ple  links  between  any  two  switches,  they still represent only one
1224       dimension and traffic is balanced across them unless port  equalization
1225       is  turned  off.  In the case of hypercubes, the same port must be used
1226       throughout the fabric to represent the hypercube dimension and match on
1227       both  ends of the cable, or the -O option used to accomplish the align‐
1228       ment.  In the case of meshes, the dimension should consistently use the
1229       same  pair  of  ports,  one port on one end of the cable, and the other
1230       port on the other end, continuing along the mesh dimension, or  the  -O
1231       option used as an override.
1232
1233       Use '-R dor' option to activate the DOR algorithm.
1234
1235       DFSSSP and SSSP Routing Algorithm
1236
1237       The  (Deadlock-Free)  Single-Source-Shortest-Path  routing algorithm is
1238       designed to optimize link utilization thru global balancing of  routes,
1239       while  supporting  arbitrary  topologies.  The DFSSSP routing algorithm
1240       uses Infiniband virtual lanes (SL) to provide deadlock-freedom.
1241
1242       The DFSSSP algorithm consists of five major steps:
1243       1) It discovers the subnet and models the subnet as a  directed  multi‐
1244       graph  in which each node represents a node of the physical network and
1245       each edge represents one direction of the  full-duplex  links  used  to
1246       connect the nodes.
1247       2)  A loop, which iterates over all CA and switches of the subnet, will
1248       perform three steps to generate the linear forwarding tables  for  each
1249       switch:
1250       2.1)  use Dijkstra's algorithm to find the shortest path from all nodes
1251       to the current selected destination;
1252       2.2) update the egde weights in the  graph,  i.e.  add  the  number  of
1253       routes, which use a link to reach the destination, to the link/edge;
1254       2.3)  update  the  LFT  of each switch with the outgoing port which was
1255       used in the current step to route the traffic to the destination node.
1256       3) After the number of available virtual lanes or layers in the  subnet
1257       is  detected  and  a  channel  dependency graph is initialized for each
1258       layer, the algorithm will put each possible route of  the  subnet  into
1259       the first layer.
1260       4)  A  loop  iterates over all channel dependency graphs (CDG) and per‐
1261       forms the following substeps:
1262       4.1) search for a cycle in the current CDG;
1263       4.2) when a cycle is found, i.e. a possible deadlock  is  present,  one
1264       edge  is selected and all routes, which induced this egde, are moved to
1265       the "next higher" virtual layer (CDG[i+1]);
1266       4.3) the cycle search is continued until  all  cycles  are  broken  and
1267       routes are moved "up".
1268       5)  When  the  number  of  needed layers does not exceeds the number of
1269       available SL/VL to remove all cycles in all CDGs, the rounting is dead‐
1270       lock-free  and  an  relation  table  is  generated,  which contains the
1271       assignment of routes from source to destination to a SL
1272
1273       Note on SSSP:
1274       This algorithm does not perform the steps 3)-5) and can not be  consid‐
1275       ered  to  be deadlock-free for all topologies. But on the one hand, you
1276       can choose this algorithm for really large  networks  (5,000+  CAs  and
1277       deadlock-free by design) to reduce the runtime of the algorithm. On the
1278       other hand, you might use the SSSP routing algorithm as an alternative,
1279       when all deadlock-free routing algorithms fail to route the network for
1280       whatever reason.  In the last case, SSSP was  designed  to  deliver  an
1281       equal  or  higher bandwidth due to better congestion avoidance than the
1282       Min Hop routing algorithm.
1283
1284       Notes for usage:
1285       a) running DFSSSP: '-R dfsssp -Q'
1286       a.1) QoS has to be configured to equally spread the load on the  avail‐
1287       able SL or virtual lanes
1288       a.2)  applications  must perform a path record query to get path SL for
1289       each route, which the application will use to transmite packages
1290       b) running SSSP:   '-R sssp'
1291       c) both algorithms support LMC > 0
1292
1293       Hints for optimizing I/O traffic:
1294       Having more nodes (I/O and compute) connected to a switch than incoming
1295       links  can  result  in  a  'bad'  routing of the I/O traffic as long as
1296       (DF)SSSP routing is not aware of the dedicated I/O nodes, i.e., in  the
1297       following  network configuration CN1-CN3 might send all I/O traffic via
1298       Link2 to IO1,IO2:
1299
1300            CN1         Link1        IO1
1301               \       /----\       /
1302         CN2 -- Switch1      Switch2 -- CN4
1303               /       \----/       \
1304            CN3         Link2        IO2
1305
1306       To prevent this from happening (DF)SSSP can use both the  compute  node
1307       guid   file   and   the   I/O  guid  file  specified  by  the  ´-u´  or
1308       ´--cn_guid_file´ and ´-G´ or ´--io_guid_file´ options (similar  to  the
1309       Fat-Tree routing).  This ensures that traffic towards compute nodes and
1310       I/O nodes is balanced separately and therefore distributed as  much  as
1311       possible  across  the available links. Port GUIDs, as listed by ibstat,
1312       must be specified (not Node GUIDs).
1313       The priority for the optimization is as follows:
1314         compute nodes -> I/O nodes -> other nodes
1315       Possible use case szenarios:
1316       a) neither ´-u´ nor ´-G´ are specified: all nodes a treated  as  ´other
1317       nodes´ and therefore balanced equally;
1318       b)  ´-G´ is specified: traffic towards I/O nodes will be balanced opti‐
1319       mally;
1320       c) the system has three node types, such as  login/admin,  compute  and
1321       I/O,  but  the  balancing focus should be I/O, then one has to use ´-u´
1322       and ´-G´ with I/O guids listed in cn_guid_file and compute  node  guids
1323       listed in io_guid_file;
1324       d) ...
1325
1326       Torus-2QoS Routing Algorithm
1327
1328       Torus-2QoS  is  routing  algorithm designed for large-scale 2D/3D torus
1329       fabrics; see torus-2QoS(8) for full documentation.
1330
1331       Use '-R torus-2QoS -Q' or '-R torus-2QoS,no_fallback  -Q'  to  activate
1332       the torus-2QoS algorithm.
1333
1334
1335       Routing References
1336
1337       To  learn  more  about deadlock-free routing, see the article "Deadlock
1338       Free Message Routing in  Multiprocessor  Interconnection  Networks"  by
1339       William J Dally and Charles L Seitz (1985).
1340
1341       To  learn  more about the up/down algorithm, see the article "Effective
1342       Strategy to Compute Forwarding Tables for InfiniBand Networks" by  Jose
1343       Carlos  Sancho,  Antonio  Robles,  and  Jose  Duato  at the Universidad
1344       Politecnica de Valencia.
1345
1346       To learn more about LASH and the flexibility behind it, the requirement
1347       for  layers,  performance comparisons to other algorithms, see the fol‐
1348       lowing articles:
1349
1350       "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
1351       on Parallel and Distributed Systems, VOL.16, No12, December 2005.
1352
1353       "Routing  for  the  ASI Fabric Manager", Solheim et al. IEEE Communica‐
1354       tions Magazine, Vol.44, No.7, July 2006.
1355
1356       "Layered Shortest Path (LASH) Routing in  Irregular  System  Area  Net‐
1357       works",  Skeie  et al. IEEE Computer Society Communication Architecture
1358       for Clusters 2002.
1359
1360       To learn more about the DFSSSP and  SSSP  routing  algorithm,  see  the
1361       articles:
1362       J.  Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for
1363       Arbitrary Topologies, In Proceedings of  the  25th  IEEE  International
1364       Parallel & Distributed Processing Symposium (IPDPS 2011)
1365       T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-
1366       Scale InfiniBand Networks, In 17th Annual IEEE Symposium on  High  Per‐
1367       formance Interconnects (HOTI 2009)
1368
1369       Modular Routine Engine
1370
1371       Modular  routing engine structure allows for the ease of "plugging" new
1372       routing modules.
1373
1374       Currently, only unicast callbacks are supported. Multicast can be added
1375       later.
1376
1377       One  existing  routing module is up-down "updn", which may be activated
1378       with '-R updn' option (instead of old '-u').
1379
1380       General usage is: $ opensm -R 'module-name'
1381
1382       There is also a trivial routing module which is able to load LFT tables
1383       from a file.
1384
1385       Main features:
1386
1387        - this will load switch LFTs and/or LID matrices (min hops tables)
1388        - this will load switch LFTs according to the path entries introduced
1389          in the file
1390        - no additional checks will be performed (such as "is port connected",
1391          etc.)
1392        - in case when fabric LIDs were changed this will try to reconstruct
1393          LFTs correctly if endport GUIDs are represented in the file
1394          (in order to disable this, GUIDs may be removed from the file
1395           or zeroed)
1396
1397       The  file  format  is  compatible with output of 'ibroute' util and for
1398       whole fabric can be generated with dump_lfts.sh script.
1399
1400       To activate file based routing module, use:
1401
1402         opensm -R file -U /path/to/lfts_file
1403
1404       If the lfts_file is not found or is in error, the default routing algo‐
1405       rithm is utilized.
1406
1407       The  ability  to dump switch lid matrices (aka min hops tables) to file
1408       and later to load these is also supported.
1409
1410       The usage is similar to unicast forwarding tables loading from  a  lfts
1411       file  (introduced  by  'file'  routing engine), but new lid matrix file
1412       name should be specified by -M or --lid_matrix_file option.  For  exam‐
1413       ple:
1414
1415         opensm -R file -M ./opensm-lid-matrix.dump
1416
1417       The  dump  file is named ´opensm-lid-matrix.dump´ and will be generated
1418       in  standard  opensm  dump  directory  (/var/log   by   default)   when
1419       OSM_LOG_ROUTING logging flag is set.
1420
1421       When routing engine 'file' is activated, but the lfts file is not spec‐
1422       ified or not cannot be open default lid matrix algorithm will be used.
1423
1424       There is also a switch forwarding tables dumper which generates a  file
1425       compatible with dump_lfts.sh output. This file can be used as input for
1426       forwarding tables loading by 'file' routing engine.   Both  or  one  of
1427       options -U and -M can be specified together with ´-R file´.
1428
1429

PER MODULE LOGGING CONFIGURATION

1431       To  enabled  per  module logging, set per_module_logging to TRUE in the
1432       opensm options file and configure per_module_logging_file there  appro‐
1433       priately.
1434
1435       The per module logging config file format is a set of lines with module
1436       name and logging level as follows:
1437
1438        <module name><separator><logging level>
1439
1440        <module name> is the file name including .c
1441        <separator> is either = , space, or tab
1442        <logging level> is the same levels as used in the coarse/overall
1443        logging as follows:
1444
1445        BIT    LOG LEVEL ENABLED
1446        ----   -----------------
1447        0x01 - ERROR (error messages)
1448        0x02 - INFO (basic messages, low volume)
1449        0x04 - VERBOSE (interesting stuff, moderate volume)
1450        0x08 - DEBUG (diagnostic, high volume)
1451        0x10 - FUNCS (function entry/exit, very high volume)
1452        0x20 - FRAMES (dumps all SMP and GMP frames)
1453        0x40 - ROUTING (dump FDB routing information)
1454        0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)
1455
1456

FILES

1458       /etc/rdma/opensm.conf
1459              default OpenSM config file.
1460
1461
1462       /etc/rdma/ib-node-name-map
1463              default node name map file.  See ibnetdiscover for more informa‐
1464              tion on format.
1465
1466
1467       /etc/rdma/partitions.conf
1468              default partition config file
1469
1470
1471       /etc/rdma/qos-policy.conf
1472              default QOS policy config file
1473
1474
1475       /etc/rdma/prefix-routes.conf
1476              default prefix routes file
1477
1478
1479       /etc/rdma/per-module-logging.conf
1480              default per module logging config file
1481
1482
1483       /etc/rdma/torus-2QoS.conf
1484              default torus-2QoS config file
1485
1486

AUTHORS

1488       Hal Rosenstock
1489              <hal@mellanox.com>
1490
1491       Sasha Khapyorsky
1492              <sashak@voltaire.com>
1493
1494       Eitan Zahavi
1495              <eitan@mellanox.co.il>
1496
1497       Yevgeny Kliteynik
1498              <kliteyn@mellanox.co.il>
1499
1500       Thomas Sodring
1501              <tsodring@simula.no>
1502
1503       Ira Weiny
1504              <weiny2@llnl.gov>
1505
1506       Dale Purdy
1507              <purdy@sgi.com>
1508
1509

SEE ALSO

1511       torus-2QoS(8), torus-2QoS.conf(5).
1512
1513
1514
1515OpenIB                           March 8, 2012                       OPENSM(8)
Impressum