1OPENSM(8)                      OpenIB Management                     OPENSM(8)
2
3
4

NAME

6       opensm - InfiniBand subnet manager and administration (SM/SA)
7
8

SYNOPSIS

10       opensm  [--version]]  [-F  |  --config  <file_name>]  [-c(reate-config)
11       <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority)  <PRI‐
12       ORITY>]  [--subnet_prefix  <PREFIX in hex>] [--smkey <SM_Key>] [--sm_sl
13       <SL number>] [-r(eassign_lids)] [-R <engine name(s)> | --routing_engine
14       <engine  name(s)>]  [--do_mesh_analysis]  [--lash_start_vl <vl number>]
15       [-A  |  --ucast_cache]  [-z  |  --connect_roots]  [-M  <file  name>   |
16       --lid_matrix_file  <file  name>]  [-U  <file  name> | --lfts_file <file
17       name>] [-S | --sadb_file <file name>] [-a | --root_guid_file  <path  to
18       file>]  [-u | --cn_guid_file <path to file>] [-G | --io_guid_file <path
19       to file>]  [--port-shifting]  [--scatter-ports  <random  seed>]  [-H  |
20       --max_reverse_hops  <max  reverse  hops  allowed>]  [-X  | --guid_rout‐
21       ing_order_file <path to file>] [-m | --ids_guid_file  <path  to  file>]
22       [-o(nce)]  [-s(weep) <interval>] [-t(imeout) <milliseconds>] [--retries
23       <number>] [--maxsmps <number>] [--console [off | local | socket | loop‐
24       back]]  [--console-port  <port>] [-i | --ignore_guids <equalize-ignore-
25       guids-file>]  [-w  |  --hop_weights_file  <path   to   file>]   [-O   |
26       --port_search_ordering_file  <path  to  file>]  [-O | --dimn_ports_file
27       <path to file>] (DEPRECATED) [-f <log file path> | --log_file <log file
28       path> ] [-L | --log_limit <size in MB>] [-e(rase_log_file)] [-P(config)
29       <partition config file> ] [-N | --no_part_enforce] (DEPRECATED)  [-Z  |
30       --part_enforce [both | in | out | off]] [-W | --allow_both_pkeys] [-Q |
31       --qos [-Y  |  --qos_policy_file  <file  name>]]  [--congestion-control]
32       [--cckey  <key>] [-y | --stay_on_fatal] [-B | --daemon] [-J | --pidfile
33       <file_name>]  [-I  |  --inactive]  [--perfmgr]  [--perfmgr_sweep_time_s
34       <seconds>]  [--prefix_routes_file  <path>] [--consolidate_ipv6_snm_req]
35       [--log_prefix  <prefix   text>]   [--torus_config   <path   to   file>]
36       [-v(erbose)] [-V] [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]
37
38

DESCRIPTION

40       opensm  is  an  InfiniBand compliant Subnet Manager and Administration,
41       and runs on top of OpenIB.
42
43       opensm provides an implementation of an InfiniBand Subnet  Manager  and
44       Administration.  Such a software entity is required to run for in order
45       to initialize the InfiniBand hardware (at least one per each InfiniBand
46       subnet).
47
48       opensm  also now contains an experimental version of a performance man‐
49       ager as well.
50
51       opensm defaults were designed to meet the common case usage on clusters
52       with up to a few hundred nodes. Thus, in this default mode, opensm will
53       scan the IB fabric, initialize it, and sweep occasionally for changes.
54
55       opensm attaches to a specific IB port on the local machine and  config‐
56       ures  only  the fabric connected to it. (If the local machine has other
57       IB ports, opensm will ignore  the  fabrics  connected  to  those  other
58       ports). If no port is specified, it will select the first "best" avail‐
59       able port.
60
61       opensm can present the available ports and prompt for a port number  to
62       attach to.
63
64       By  default,  the  run  is  logged  to two files: /var/log/messages and
65       /var/log/opensm.log.  The first file will register only  general  major
66       events, whereas the second will include details of reported errors. All
67       errors reported in this second file should be treated as indicators  of
68       IB  fabric  health issues.  (Note that when a fatal and non-recoverable
69       error occurs, opensm will exit.)  Both log  files  should  include  the
70       message "SUBNET UP" if opensm was able to setup the subnet correctly.
71
72

OPTIONS

74       --version
75              Prints OpenSM version and exits.
76
77       -F, --config <config file>
78              The  name  of  the  OpenSM  config  file.  When  not  specified
79              /etc/rdma/opensm.conf will be used (if exists).
80
81       -c, --create-config <file name>
82              OpenSM will dump its configuration to  the  specified  file  and
83              exit.   This is a way to generate OpenSM configuration file tem‐
84              plate.
85
86       -g, --guid <GUID in hex>
87              This option specifies the  local  port  GUID  value  with  which
88              OpenSM  should  bind.   OpenSM may be bound to 1 port at a time.
89              If GUID given is 0, OpenSM displays  a  list  of  possible  port
90              GUIDs and waits for user input.  Without -g, OpenSM tries to use
91              the default port.
92
93       -l, --lmc <LMC value>
94              This option specifies the subnet's LMC  value.   The  number  of
95              LIDs  assigned  to each port is 2^LMC.  The LMC value must be in
96              the range 0-7.  LMC values >  0  allow  multiple  paths  between
97              ports.   LMC values > 0 should only be used if the subnet topol‐
98              ogy actually provides multiple paths between ports, i.e.  multi‐
99              ple interconnects between switches.  Without -l, OpenSM defaults
100              to LMC = 0, which allows one path between any two ports.
101
102       -p, --priority <Priority value>
103              This option specifies the SM´s PRIORITY.  This will  effect  the
104              handover  cases,  where  master  is chosen by priority and GUID.
105              Range goes from 0 (default and lowest priority) to 15 (highest).
106
107       --subnet_prefix <PREFIX in hex>
108              This option specifies the subnet prefix to use  on  the  fabric.
109              The default prefix is 0xfe80000000000000.  OpenMPI in particular
110              requires separate fabrics plugged into different ports  to  have
111              different prefixes or else it won't run.
112
113       --smkey <SM_Key value>
114              This  option  specifies  the  SM´s  SM_Key (64 bits).  This will
115              effect SM authentication.  Note that OpenSM  version  3.2.1  and
116              below  used  the  default  value '1' in a host byte order, it is
117              fixed now but you may need this option to interoperate with  old
118              OpenSM running on a little endian machine.
119
120       --sm_sl <SL number>
121              This option sets the SL to use for communication with the SM/SA.
122              Defaults to 0.
123
124       -r, --reassign_lids
125              This option causes OpenSM to reassign LIDs  to  all  end  nodes.
126              Specifying  -r  on  a running subnet may disrupt subnet traffic.
127              Without -r, OpenSM attempts to preserve existing LID assignments
128              resolving multiple use of same LID.
129
130       -R, --routing_engine <Routing engine names>
131              This  option chooses routing engine(s) to use instead of Min Hop
132              algorithm (default).  Multiple routing engines can be  specified
133              separated  by  commas so that specific ordering of routing algo‐
134              rithms will be tried if earlier routing engines  fail.   If  all
135              configured  routing  engines fail, OpenSM will always attempt to
136              route with Min Hop unless 'no_fallback' is included in the  list
137              of  routing  engines.   Supported  engines:  minhop, updn, dnup,
138              file, ftree, lash, dor, torus-2QoS, dfsssp, sssp.
139
140       --do_mesh_analysis
141              This option enables additional analysis  for  the  lash  routing
142              engine to precondition switch port assignments in regular carte‐
143              sian meshes which may reduce the number of SLs required to  give
144              a deadlock free routing.
145
146       --lash_start_vl <vl number>
147              This  option  sets  the  starting VL to use for the lash routing
148              algorithm.  Defaults to 0.
149
150       -A, --ucast_cache
151              This option enables unicast routing cache and  prevents  routing
152              recalculation  (which  is  a heavy task in a large cluster) when
153              there was no topology change detected during the heavy sweep, or
154              when  the  topology change does not require new routing calcula‐
155              tion, e.g. when one or more CAs/RTRs/leaf switches  going  down,
156              or  one  or more of these nodes coming back after being down.  A
157              very common case that is handled by the unicast routing cache is
158              host reboot, which otherwise would cause two full routing recal‐
159              culations: one when the host goes down, and the other  when  the
160              host comes back online.
161
162       -z, --connect_roots
163              This  option  enforces routing engines (up/down and fat-tree) to
164              make connectivity between root switches and in this  way  to  be
165              fully IBA compliant. In many cases this can violate "pure" dead‐
166              lock free algorithm, so use it carefully.
167
168       -M, --lid_matrix_file <file name>
169              This option specifies the name of the lid matrix dump file  from
170              where switch lid matrices (min hops tables) will be loaded.
171
172       -U, --lfts_file <file name>
173              This  option  specifies  the  name  of  the LFTs file from where
174              switch forwarding tables will be loaded when using "file"  rout‐
175              ing engine.
176
177       -S, --sadb_file <file name>
178              This option specifies the name of the SA DB dump file from where
179              SA database will be loaded.
180
181       -a, --root_guid_file <file name>
182              Set the root nodes for the Up/Down or Fat-Tree routing algorithm
183              to the guids provided in the given file (one to a line).
184
185       -u, --cn_guid_file <file name>
186              Set  the  compute  nodes for the Fat-Tree or DFSSSP/SSSP routing
187              algorithms to the port GUIDs provided in the given file (one  to
188              a line).
189
190       -G, --io_guid_file <file name>
191              Set  the I/O nodes for the Fat-Tree or DFSSSP/SSSP routing algo‐
192              rithms to the port GUIDs provided in the given file  (one  to  a
193              line).
194              In the case of Fat-Tree routing:
195              I/O nodes are non-CN nodes allowed to use up to max_reverse_hops
196              switches the wrong way around to improve connectivity.
197              In the case of (DF)SSSP routing:
198              Providing guids of compute and/or I/O  nodes  will  ensure  that
199              paths  towards  those  nodes  are  as much separated as possible
200              within their node category, i.e., I/O traffic will not share the
201              same link if multiple links are available.
202
203       --port-shifting
204              This  option  enables  a  feature called port shifting.  In some
205              fabrics,  particularly  cluster  environments,  routes  commonly
206              align  and  congest  with  other  routes  due to algorithmically
207              unchanging traffic patterns.  This routing option  will  "shift"
208              routing around in an attempt to alleviate this problem.
209
210       --scatter-ports <random seed>
211              This  option  is  used  to  randomize  port selection in routing
212              rather  than  using  a  round-robin  algorithm  (which  is   the
213              default).  Value  supplied with option is used as a random seed.
214              If value is 0, which is the default, the scatter ports option is
215              disabled.
216
217       -H, --max_reverse_hops <max reverse hops allowed>
218              Set the maximum number of reverse hops an I/O node is allowed to
219              make. A reverse hop is the use of a switch the wrong way around.
220
221       -m, --ids_guid_file <file name>
222              Name of the map file with set of the IDs which will be  used  by
223              Up/Down  routing algorithm instead of node GUIDs (format: <guid>
224              <id> per line).
225
226       -X, --guid_routing_order_file <file name>
227              Set the order port guids will  be  routed  for  the  MinHop  and
228              Up/Down  routing  algorithms  to the guids provided in the given
229              file (one to a line).
230
231       -o, --once
232              This option causes OpenSM to configure  the  subnet  once,  then
233              exit.  Ports remain in the ACTIVE state.
234
235       -s, --sweep <interval value>
236              This  option  specifies  the  number  of  seconds between subnet
237              sweeps.  Specifying -s 0 disables sweeping.  Without -s,  OpenSM
238              defaults to a sweep interval of 10 seconds.
239
240       -t, --timeout <value>
241              This option specifies the time in milliseconds used for transac‐
242              tion timeouts.  Timeout values  should  be  >  0.   Without  -t,
243              OpenSM defaults to a timeout value of 200 milliseconds.
244
245       --retries <number>
246              This  option  specifies  the number of retries used for transac‐
247              tions.  Without --retries, OpenSM  defaults  to  3  retries  for
248              transactions.
249
250       --maxsmps <number>
251              This option specifies the number of VL15 SMP MADs allowed on the
252              wire at any one time.  Specifying --maxsmps 0  allows  unlimited
253              outstanding SMPs.  Without --maxsmps, OpenSM defaults to a maxi‐
254              mum of 4 outstanding SMPs.
255
256       --console [off | local | loopback | socket]
257              This option brings up the OpenSM console (default  off).   Note,
258              loopback  and  socket  open  a  socket which can be connected to
259              WITHOUT CREDENTIALS.  Loopback is safer if  access  to  your  SM
260              host  is  controlled.  tcp_wrappers (hosts.[allow|deny]) is used
261              with loopback and socket.  loopback  and  socket  will  only  be
262              available  if  OpenSM  was  built with --enable-console-loopback
263              (default yes) and --enable-console-socket (default  no)  respec‐
264              tively.
265
266       --console-port <port>
267              Specify an alternate telnet port for the socket console (default
268              10000).  Note that this option only appears if OpenSM was  built
269              with --enable-console-socket.
270
271       -i, --ignore_guids <equalize-ignore-guids-file>
272              This option provides the means to define a set of ports (by node
273              guid and port number) that will be  ignored  by  the  link  load
274              equalization algorithm.
275
276       -w, --hop_weights_file <path to file>
277              This  option  provides weighting factors per port representing a
278              hop cost in computing the lid  matrix.   The  file  consists  of
279              lines  containing  a switch port GUID (specified as a 64 bit hex
280              number, with leading 0x), output port number, and weighting fac‐
281              tor.   Any  port  not listed in the file defaults to a weighting
282              factor of 1.  Lines  starting  with  #  are  comments.   Weights
283              affect  only the output route from the port, so many useful con‐
284              figurations will require weights to be specified in pairs.
285
286       -O, --port_search_ordering_file <path to file>
287              This option tweaks the routing. It suitable for  two  cases:  1.
288              While  using DOR routing algorithm.  This option provides a map‐
289              ping between hypercube dimensions and  ports  on  a  per  switch
290              basis  for  the  DOR routing engine.  The file consists of lines
291              containing a switch node GUID (specified as a 64 bit hex number,
292              with  leading  0x)  followed by a list of non-zero port numbers,
293              separated by spaces, one switch per line.   The  order  for  the
294              port  numbers is in one to one correspondence to the dimensions.
295              Ports not listed on a line are assigned to the remaining  dimen‐
296              sions,  in  port  order.   Anything  after a # is a comment.  2.
297              While using general routing algorithm.  This option provides the
298              order  of  the ports that would be chosen for routing, from each
299              switch rather than searching for an appropriate port from port 1
300              to  N.  The file consists of lines containing a switch node GUID
301              (specified as a 64 bit hex number, with leading 0x) followed  by
302              a list of non-zero port numbers, separated by spaces, one switch
303              per line.  In case of DOR, the order for the port numbers is  in
304              one  to  one correspondence to the dimensions.  Ports not listed
305              on a line are assigned to  the  remaining  dimensions,  in  port
306              order.  Anything after a # is a comment.
307
308       -O, --dimn_ports_file <path to file> (DEPRECATED)
309              This  is  a  deprecated  flag.  Please  use --port_search_order‐
310              ing_file instead.  This option provides a mapping between hyper‐
311              cube  dimensions  and  ports  on  a per switch basis for the DOR
312              routing engine.  The file consists of lines containing a  switch
313              node  GUID  (specified  as a 64 bit hex number, with leading 0x)
314              followed by a list of non-zero port numbers, separated  by  spa‐
315              ces,  one switch per line.  The order for the port numbers is in
316              one to one correspondence to the dimensions.  Ports  not  listed
317              on  a  line  are  assigned  to the remaining dimensions, in port
318              order.  Anything after a # is a comment.
319
320       -x, --honor_guid2lid
321              This option forces OpenSM to honor the guid2lid  file,  when  it
322              comes   out   of  Standby  state,  if  such  file  exists  under
323              OSM_CACHE_DIR, and is valid.  By default, this is FALSE.
324
325       -f, --log_file <file name>
326              This option defines the log to be the given file.   By  default,
327              the log goes to /var/log/opensm.log.  For the log to go to stan‐
328              dard output use -f stdout.
329
330       -L, --log_limit <size in MB>
331              This option defines maximal log file size in MB. When  specified
332              the log file will be truncated upon reaching this limit.
333
334       -e, --erase_log_file
335              This  option  will  cause deletion of the log file (if it previ‐
336              ously exists). By default, the log file is accumulative.
337
338       -P, --Pconfig <partition config file>
339              This option defines the optional partition  configuration  file.
340              The default name is /etc/rdma/partitions.conf.
341
342       --prefix_routes_file <file name>
343              Prefix routes control how the SA responds to path record queries
344              for off-subnet DGIDs.  By default, the SA  fails  such  queries.
345              The PREFIX ROUTES section below describes the format of the con‐
346              figuration      file.       The      default       path       is
347              /etc/rdma/prefix-routes.conf.
348
349       -Q, --qos
350              This option enables QoS setup. It is disabled by default.
351
352       -Y, --qos_policy_file <file name>
353              This  option  defines  the optional QoS policy file. The default
354              name    is    /etc/rdma/qos-policy.conf.     See     QoS_manage‐
355              ment_in_OpenSM.txt in opensm doc for more information on config‐
356              uring QoS policy via this file.
357
358       --congestion_control
359              (EXPERIMENTAL) This option enables congestion control configura‐
360              tion.   It  is disabled by default.  See config file for conges‐
361              tion control configuration options.  --cc_key <key>  (EXPERIMEN‐
362              TAL)  This  option  configures the CCkey to use when configuring
363              congestion control.  Note that this option does not configure  a
364              new CCkey into switches and CAs.  Defaults to 0.
365
366       -N, --no_part_enforce (DEPRECATED)
367              This  is  a  deprecated flag. Please use --part_enforce instead.
368              This option disables partition enforcement  on  switch  external
369              ports.
370
371       -Z, --part_enforce [both | in | out | off]
372              This  option  indicates  the  partition  enforcement  type  (for
373              switches).  Enforcement type can be inbound only (in),  outbound
374              only (out), both or disabled (off). Default is both.
375
376       -W, --allow_both_pkeys
377              This  option  indicates whether both full and limited membership
378              on the same  partition  can  be  configured  in  the  PKeyTable.
379              Default is not to allow both pkeys.
380
381       -y, --stay_on_fatal
382              This  option  will  cause SM not to exit on fatal initialization
383              issues: if SM discovers duplicated guids or a 12x link with lane
384              reversal  badly  configured.   By  default,  the SM will exit on
385              these errors.
386
387       -B, --daemon
388              Run in daemon mode - OpenSM will run in the background.
389
390       -J, --pidfile <file_name>
391              Makes the SM write its  own  PID  to  the  specified  file  when
392              started in daemon mode.
393
394       -I, --inactive
395              Start SM in inactive rather than init SM state.  This option can
396              be used in conjunction with the perfmgr so as to  run  a  stand‐
397              alone  performance  manager without SM/SA.  However, this is NOT
398              currently implemented in the performance manager.
399
400       --perfmgr
401              Enable the perfmgr.  Only takes effect if  --enable-perfmgr  was
402              specified  at configure time.  See performance-manager-HOWTO.txt
403              in opensm doc for more information on running perfmgr.
404
405       --perfmgr_sweep_time_s <seconds>
406              Specify the sweep time for the performance  manager  in  seconds
407              (default is 180 seconds).  Only takes effect if --enable-perfmgr
408              was specified at configure time.
409
410       --consolidate_ipv6_snm_req
411              Use shared MLID for IPv6 Solicited  Node  Multicast  groups  per
412              MGID scope and P_Key.
413
414       --log_prefix <prefix text>
415              This  option  specifies  the  prefix to the syslog messages from
416              OpenSM.  A suitable prefix can be used to identify the IB subnet
417              in syslog messages when two or more instances of OpenSM run in a
418              single node to manage multiple fabrics. For example, in a  dual-
419              fabric  (or dual-rail) IB cluster, the prefix for the first fab‐
420              ric could be "mpi" and the other fabric could be "storage".
421
422       --torus_config <path to torus-2QoS config file>
423              This option defines the file name for  the  extra  configuration
424              information  needed  for  the  torus-2QoS  routing engine.   The
425              default name is /etc/rdma/torus-2QoS.conf
426
427       -v, --verbose
428              This option increases the log verbosity level.   The  -v  option
429              may  be  specified  multiple  times to further increase the ver‐
430              bosity level.  See the -D option for more information about  log
431              verbosity.
432
433       -V     This  option  sets  the  maximum  verbosity level and forces log
434              flushing.  The -V option is equivalent to ´-D 0xFF -d  2´.   See
435              the -D option for more information about log verbosity.
436
437       -D <value>
438              This  option  sets  the log verbosity level.  A flags field must
439              follow the -D option.  A bit set/clear in the flags enables/dis‐
440              ables a specific log level as follows:
441
442               BIT    LOG LEVEL ENABLED
443               ----   -----------------
444               0x01 - ERROR (error messages)
445               0x02 - INFO (basic messages, low volume)
446               0x04 - VERBOSE (interesting stuff, moderate volume)
447               0x08 - DEBUG (diagnostic, high volume)
448               0x10 - FUNCS (function entry/exit, very high volume)
449               0x20 - FRAMES (dumps all SMP and GMP frames)
450               0x40 - ROUTING (dump FDB routing information)
451               0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM log‐
452              ging)
453
454              Without -D, OpenSM defaults to ERROR + INFO  (0x3).   Specifying
455              -D 0 disables all messages.  Specifying -D 0xFF enables all mes‐
456              sages (see -V).  High verbosity levels  may  require  increasing
457              the transaction timeout with the -t option.
458
459       -d, --debug <value>
460              This  option  specifies  a  debug option.  These options are not
461              normally needed.  The number  following  -d  selects  the  debug
462              option to enable as follows:
463
464               OPT   Description
465               ---    -----------------
466               -d0  - Ignore other SM nodes
467               -d1  - Force single threaded dispatching
468               -d2  - Force log flushing after each log message
469               -d3  - Disable multicast support
470
471       -h, --help
472              Display this usage info then exit.
473
474       -?     Display this usage info then exit.
475
476

ENVIRONMENT VARIABLES

478       The following environment variables control opensm behavior:
479
480       OSM_TMP_DIR  - controls the directory in which the temporary files gen‐
481       erated by opensm  are  created.  These  files  are:  opensm-subnet.lst,
482       opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.
483
484       OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
485       quent  runs   are   consistent.   The   default   directory   used   is
486       /var/cache/opensm.  The following files are included in it:
487
488        guid2lid  - stores the LID range assigned to each GUID
489        guid2mkey - stores the MKey previously assiged to each GUID
490        neighbors - stores a map of the GUIDs at either end of each link
491                    in the fabric
492
493

NOTES

495       When  opensm receives a HUP signal, it starts a new heavy sweep as if a
496       trap was received or a topology change was found.
497
498       Also, SIGUSR1 can be used to trigger a  reopen  of  /var/log/opensm.log
499       for logrotate purposes.
500
501

PARTITION CONFIGURATION

503       The   default   name   of   OpenSM  partitions  configuration  file  is
504       /etc/rdma/partitions.conf. The default may  be  changed  by  using  the
505       --Pconfig (-P) option with OpenSM.
506
507       The  default  partition  will be created by OpenSM unconditionally even
508       when partition configuration file does not exist or cannot be accessed.
509
510       The default partition has P_Key value 0x7fff. OpenSM´s port will always
511       have  full  membership  in  default partition. All other end ports will
512       have full membership if the partition configuration file is  not  found
513       or cannot be accessed, or limited membership if the file exists and can
514       be accessed but there is no rule for the Default partition.
515
516       Effectively, this amounts to the same as if one of the following  rules
517       below appear in the partition configuration file.
518
519       In the case of no rule for the Default partition:
520
521       Default=0x7fff : ALL=limited, SELF=full ;
522
523       In  the  case  of  no  partition  configuration  file or file cannot be
524       accessed:
525
526       Default=0x7fff : ALL=full ;
527
528
529       File Format
530
531       Comments:
532
533       Line content followed after ´#´ character is  comment  and  ignored  by
534       parser.
535
536       General file format:
537
538       <Partition Definition>:[<newline>]<Partition Properties>;
539
540            Partition Definition:
541              [PartitionName][=PKey][,indx0][,ipoib_bc_flags][,defmem‐
542       ber=full|limited]
543
544               PartitionName  - string, will be used with logging. When
545                                omitted, empty string will be used.
546               PKey           - P_Key value for this partition. Only low 15
547                                bits will be used. When omitted will be
548                                autogenerated.
549               indx0          - indicates that this pkey should be inserted in
550                                block 0 index 0.
551               ipoib_bc_flags - used to indicate/specify IPoIB capability of
552                                this partition.
553
554               defmember=full|limited|both - specifies default membership for
555                                port guid list. Default is limited.
556
557            ipoib_bc_flags:
558               ipoib_flag|[mgroup_flag]*
559
560               ipoib_flag:
561                   ipoib  - indicates that this partition may be used for
562                            IPoIB, as a result the IPoIB broadcast group will
563                            be created with the mgroup_flag flags given,
564                            if any.
565
566            Partition Properties:
567              [<Port list>|<MCast Group>]* | <Port list>
568
569            Port list:
570               <Port Specifier>[,<Port Specifier>]
571
572            Port Specifier:
573               <PortGUID>[=[full|limited|both]]
574
575               PortGUID         - GUID of partition member EndPort.
576                                  Hexadecimal numbers should start from
577                                  0x, decimal numbers are accepted too.
578               full, limited,   - indicates full and/or limited membership for
579               both               this port.  When omitted (or unrecognized)
580                                  limited membership is assumed.  Both
581                                  indicates both full and limited membership
582                                  for this port.
583
584            MCast Group:
585               mgid=gid[,mgroup_flag]*<newline>
586
587                                - gid specified is verified to be a Multicast
588                                  address.  IP groups are verified to match
589                                  the rate and mtu of the broadcast group.
590                                  The P_Key bits of the mgid for IP groups are
591                                  verified to either match the P_Key specified
592                                  in by "Partition Definition" or if they are
593                                  0x0000 the P_Key will be copied into those
594                                  bits.
595
596            mgroup_flag:
597               rate=<val>  - specifies rate for this MC group
598                             (default is 3 (10GBps))
599               mtu=<val>   - specifies MTU for this MC group
600                             (default is 4 (2048))
601               sl=<val>    - specifies SL for this MC group
602                             (default is 0)
603               scope=<val> - specifies scope for this MC group
604                             (default is 2 (link local)).  Multiple scope
605                             settings are permitted for a partition.
606                             NOTE: This overwrites the scope nibble of the
607                                   specified mgid.  Furthermore specifying
608                                   multiple scope settings will result in
609                                   multiple MC groups being created.
610               Q_Key=<val>     - specifies the Q_Key for this MC group
611                                 (default: 0x0b1b for IP groups, 0 for other
612                                  groups)
613                                 WARNING: changing this for the broadcast
614                                          group may break IPoIB on client
615                                          nodes!!
616               TClass=<val>    - specifies tclass for this MC group
617                                 (default is 0)
618               FlowLabel=<val> - specifies FlowLabel for this MC group
619                                 (default is 0)
620
621       Note that values for rate, mtu, and scope, for both partitions and mul‐
622       ticast groups, should be specified as defined in the IBTA specification
623       (for example, mtu=4 for 2048).
624
625       There are several useful keywords for PortGUID definition:
626
627        - 'ALL' means all end ports in this subnet.
628        - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
629        - 'ALL_SWITCHES' means all Switch end ports in this subnet.
630        - 'ALL_ROUTERS' means all Router end ports in this subnet.
631        - 'SELF' means subnet manager's port.
632
633       Empty list means no ports in this partition.
634
635       Notes:
636
637       White space is permitted between delimiters ('=', ',',':',';').
638
639       PartitionName does not need to be unique, PKey does need to be  unique.
640       If  PKey is repeated then those partition configurations will be merged
641       and first PartitionName will be used (see also next note).
642
643       It is possible to split partition configuration in more than one  defi‐
644       nition, but then PKey should be explicitly specified (otherwise differ‐
645       ent PKey values will be generated for those definitions).
646
647       Examples:
648
649        Default=0x7fff : ALL, SELF=full ;
650        Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
651
652        NewPartition , ipoib : 0x123456=full, 0x3456789034=limi,  0x2134af2306
653       ;
654
655        YetAnotherOne = 0x300 : SELF=full ;
656        YetAnotherOne = 0x300 : ALL=limited ;
657
658        ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
659        # 0x123453, 0x123454 will be limited
660        ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
661        # 0x123456, 0x123457 will be limited
662        ShareIO   =   0x80   :   defmember=limited   :   0x123456,   0x123457,
663       0x123458=full;
664        ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
665        ShareIO  =  0x80  ,  defmember=full  :   0x12345b,   0x12345c=limited,
666       0x12345d;
667
668        # multicast groups added to default
669        Default=0x7fff,ipoib:
670               mgid=ff12:401b::0707,sl=1 # random IPv4 group
671               mgid=ff12:601b::16    # MLDv2-capable routers
672               mgid=ff12:401b::16    # IGMP
673               mgid=ff12:601b::2     # All routers
674               mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
675               ALL=full;
676
677
678       Note:
679
680       The following rule is equivalent to how OpenSM used to run prior to the
681       partition manager:
682
683        Default=0x7fff,ipoib:ALL=full;
684
685

QOS CONFIGURATION

687       There are a set of QoS related low-level configuration parameters.  All
688       these  parameter  names  are  prefixed by "qos_" string. Here is a full
689       list of these parameters:
690
691        qos_max_vls    - The maximum number of VLs that will be on the subnet
692        qos_high_limit - The limit of High Priority component of VL
693                         Arbitration table (IBA 7.6.9)
694        qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
695                         template
696        qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
697                         template
698                         Both VL arbitration templates are pairs of
699                         VL and weight
700        qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
701                         a list of VLs corresponding to SLs 0-15 (Note
702                         that VL15 used here means drop this SL)
703
704       Typical default values (hard-coded in OpenSM initialization) are:
705
706        qos_max_vls 15
707        qos_high_limit 0
708        qos_vlarb_low
709       0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
710        qos_vlarb_high
711       0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
712        qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
713
714       The syntax is compatible with rest of OpenSM configuration options  and
715       values may be stored in OpenSM config file (cached options file).
716
717       In  addition  to  the  above,  we may define separate QoS configuration
718       parameters sets for various target types. As targets, we currently sup‐
719       port CAs, routers, switch external ports, and switch's enhanced port 0.
720       The names of such specialized parameters are prefixed by  "qos_<type>_"
721       string. Here is a full list of the currently supported sets:
722
723        qos_ca_  - QoS configuration parameters set for CAs.
724        qos_rtr_ - parameters set for routers.
725        qos_sw0_ - parameters set for switches' port 0.
726        qos_swe_ - parameters set for switches' external ports.
727
728       Examples:
729        qos_sw0_max_vls=2
730        qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
731        qos_swe_high_limit=0
732
733

PREFIX ROUTES

735       Prefix  routes  control  how the SA responds to path record queries for
736       off-subnet DGIDs.  By default, the SA fails such  queries.   Note  that
737       IBA  does  not  specify how the SA should obtain off-subnet path record
738       information.  The prefix routes configuration is meant  as  a  stop-gap
739       until the specification is completed.
740
741       Each  line  in  the configuration file is a 64-bit prefix followed by a
742       64-bit GUID, separated by white space.  The GUID specifies  the  router
743       port  on the local subnet that will handle the prefix.  Blank lines are
744       ignored, as is anything between a # character and the end of the  line.
745       The  prefix  and  GUID  are  both  in  hex, the leading 0x is optional.
746       Either, or both, can be wild-carded by specifying an  asterisk  instead
747       of an explicit prefix or GUID.
748
749       When  responding  to a path record query for an off-subnet DGID, opensm
750       searches for the first prefix match in the configuration file.   There‐
751       fore,  the order of the lines in the configuration file is important: a
752       wild-carded prefix at the beginning of the configuration  file  renders
753       all  subsequent lines useless.  If there is no match, then opensm fails
754       the query.  It is legal to repeat prefixes in the  configuration  file,
755       opensm  will return the path to the first available matching router.  A
756       configuration file with a single line where both prefix  and  GUID  are
757       wild-carded  means  that  a path record query specifying any off-subnet
758       DGID should return a path to the first available router.  This configu‐
759       ration  yields  the same behavior formerly achieved by compiling opensm
760       with -DROUTER_EXP which has been obsoleted.
761
762

MKEY CONFIGURATION

764       OpenSM supports configuring a single  management  key  (MKey)  for  use
765       across the subnet.
766
767       The following configuration options are available:
768
769        m_key                  - the 64-bit MKey to be used on the subnet
770                                 (IBA 14.2.4)
771        m_key_protection_level - the numeric value of the MKey ProtectBits
772                                 (IBA 14.2.4.1)
773        m_key_lease_period     - the number of seconds a CA will wait for a
774                                 response from the SM before resetting the
775                                 protection level to 0 (IBA 14.2.4.2).
776
777       OpenSM  will  configure  all  ports  with  the MKey specified by m_key,
778       defaulting to a value of 0. A m_key value of 0 disables MKey protection
779       on  the subnet.  Switches and HCAs with a non-zero MKey will not accept
780       requests to change their configuration unless the request includes  the
781       proper MKey.
782
783       MKey Protection Levels
784
785       MKey  protection  levels  modify  how  switches and CAs respond to SMPs
786       lacking a valid MKey.  OpenSM will configure each port's ProtectBits to
787       support  the level defined by the m_key_protection_level parameter.  If
788       no parameter is specified, OpenSM defaults to operating  at  protection
789       level 0.
790
791       There are currently 4 protection levels defined by the IBA:
792
793        0 - Queries return valid data, including MKey.  Configuration changes
794            are not allowed unless the request contains a valid MKey.
795        1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
796            unless the request contains a valid MKey.
797        2 - Neither queries nor configuration changes are allowed, unless the
798            request contains a valid MKey.
799        3 - Identical to 2.  Maintained for backwards compatibility.
800
801       MKey Lease Period
802
803       InfiniBand  supports  a  MKey lease timeout, which is intended to allow
804       administrators or a new SM to recover/reset lost MKeys on a fabric.
805
806       If MKeys are enabled on the subnet  and  a  switch  or  CA  receives  a
807       request  that  requires a valid MKey but does not contain one, it warns
808       the SM by sending a trap (Bad M_Key, Trap  256).   If  the  MKey  lease
809       period is non-zero, it also starts a countdown timer for the time spec‐
810       ified by the lease period.  If a SM (or other agent) responds with  the
811       correct  MKey,  the timer is stopped and reset.  Should the timer reach
812       zero, the switch or CA will reset  its  MKey  protection  level  to  0,
813       exposing the MKey and allowing recovery.
814
815       OpenSM will initialize all ports to use a mkey lease period of the num‐
816       ber of seconds specified in the config file.  If  no  mkey_lease_period
817       is specified, a default of 0 will be used.
818
819       OpenSM  normally quickly responds to all Bad_M_Key traps, resetting the
820       lease timers.  Additionally, OpenSM's subnet sweeps  will  also  cancel
821       any  running  timers.   For  maximum  protection  against accidentally-
822       exposed MKeys, the MKey lease time should be a  few  multiples  of  the
823       subnet sweep time.  If OpenSM detects at startup that your sweep inter‐
824       val is greater than your MKey lease period, it  will  reset  the  lease
825       period  to  be greater than the sweep interval.  Similarly, if sweeping
826       is disabled at startup, it will be re-enabled  with  an  interval  less
827       than the Mkey lease period.
828
829       If  OpenSM  is  required  to  recover  a subnet for which it is missing
830       mkeys, it must do so one switch level at a time.  As  such,  the  total
831       time to recover the subnet may be as long as the mkey lease period mul‐
832       tiplied by the maximum number of hops between the SM and  an  endpoint,
833       plus one.
834
835       MKey Effects on Diagnostic Utilities
836
837       Setting a MKey may have a detrimental effect on diagnostic software run
838       on the subnet, unless your diagnostic  software  is  able  to  retrieve
839       MKeys from the SA or can be explicitly configured with the proper MKey.
840       This is particularly true at protection level 2, where CAs will  ignore
841       queries for management information that do not contain the proper MKey.
842
843

ROUTING

845       OpenSM now offers nine routing engines:
846
847       1.   Min  Hop  Algorithm - based on the minimum hops to each node where
848       the path length is optimized.
849
850       2.  UPDN Unicast routing algorithm - also based on the minimum hops  to
851       each  node,  but  it  is  constrained  to ranking rules. This algorithm
852       should be chosen if the subnet is not a pure Fat Tree, and deadlock may
853       occur due to a loop in the subnet.
854
855       3.  DNUP Unicast routing algorithm - similar to UPDN but allows routing
856       in fabrics which have some CA nodes attached closer to the  roots  than
857       some switch nodes.
858
859       4.  Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
860       ing for congestion-free "shift" communication pattern.   It  should  be
861       chosen  if  a subnet is a symmetrical or almost symmetrical fat-tree of
862       various types,  not  just  K-ary-N-Trees:  non-constant  K,  not  fully
863       staffed,  any  Constant  Bisectional Bandwidth (CBB) ratio.  Similar to
864       UPDN, Fat Tree routing is constrained to ranking rules.
865
866       5. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
867       to  provide deadlock-free shortest-path routing while also distributing
868       the paths between layers. LASH is an alternative  deadlock-free  topol‐
869       ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
870       ing the use of a potentially congested root node.
871
872       6. DOR Unicast routing algorithm - based on the Min Hop algorithm,  but
873       avoids  port  equalization  except for redundant links between the same
874       two switches.  This provides deadlock free routes for  hypercubes  when
875       the  fabric  is  cabled  as a hypercube and for meshes when cabled as a
876       mesh (see details below).
877
878       7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
879       specialized  for 2D/3D torus topologies.  Torus-2QoS provides deadlock-
880       free routing while supporting two quality of service (QoS) levels.   In
881       addition  it  is able to route around multiple failed fabric links or a
882       single failed fabric switch without introducing deadlocks, and  without
883       changing path SL values granted before the failure.
884
885       8.  DFSSSP  unicast  routing algorithm - a deadlock-free single-source-
886       shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
887       as  the  base  to optimize link utilization and uses Infiniband virtual
888       lanes (SL) to provide deadlock-freedom.
889
890       9. SSSP unicast routing algorithm - a single-source-shortest-path rout‐
891       ing algorithm, which globally balances the number of routes per link to
892       optimize link utilization. This routing algorithm has  no  restrictions
893       in terms of the underlying topology.
894
895       OpenSM  also supports a file method which can load routes from a table.
896       See ´Modular Routing Engine´ for more information on this.
897
898       The basic routing algorithm is comprised of two stages:
899
900       1. MinHop matrix calculation
901          How many hops are required to get from each port to each LID ?
902          The algorithm to fill these tables is different if you run  standard
903       (min hop) or Up/Down.
904          For  standard routing, a "relaxation" algorithm is used to propagate
905       min hop from every destination LID through neighbor switches
906          For Up/Down routing, a BFS from every target is used. The BFS tracks
907       link  direction (up or down) and avoid steps that will perform up after
908       a down step was used.
909
910       2. Once MinHop matrices exist, each switch is visited and for each tar‐
911       get  LID  a  decision  is made as to what port should be used to get to
912       that LID.
913          This step is common to standard and Up/Down routing. Each port has a
914       counter counting the number of target LIDs going through it.
915          When there are multiple alternative ports with same MinHop to a LID,
916       the one with less previously assigned LIDs is selected.
917          If LMC > 0, more  checks  are  added:  Within  each  group  of  LIDs
918       assigned to same target port,
919          a. use only ports which have same MinHop
920          b.  first prefer the ones that go to different systemImageGuid (then
921       the previous LID of the same LMC group)
922          c. if none - prefer those which go through another NodeGuid
923          d. fall back to the number of paths method (if all go to same node).
924
925       Effect of Topology Changes
926
927       OpenSM will preserve existing routing in any case  where  there  is  no
928       change in the fabric switches unless the -r (--reassign_lids) option is
929       specified.
930
931       -r
932       --reassign_lids
933                 This option causes OpenSM to reassign LIDs to all
934                 end nodes. Specifying -r on a running subnet
935                 may disrupt subnet traffic.
936                 Without -r, OpenSM attempts to preserve existing
937                 LID assignments resolving multiple use of same LID.
938
939       If a link is added or removed, OpenSM does not recalculate  the  routes
940       that  do  not  have  to change. A route has to change if the port is no
941       longer UP or no longer the MinHop. When routing changes are  performed,
942       the same algorithm for balancing the routes is invoked.
943
944       In  the  case of using the file based routing, any topology changes are
945       currently ignored The 'file' routing engine just loads  the  LFTs  from
946       the  file specified, with no reaction to real topology. Obviously, this
947       will not be able to recheck LIDs (by GUID) for disconnected nodes,  and
948       LFTs  for  non-existent  switches  will  be  skipped.  Multicast is not
949       affected by 'file' routing engine (this uses min hop tables).
950
951
952       Min Hop Algorithm
953
954       The Min Hop algorithm is invoked by default if no routing algorithm  is
955       specified.  It can also be invoked by specifying '-R minhop'.
956
957       The  Min  Hop algorithm is divided into two stages: computation of min-
958       hop tables on every switch and LFT output port  assignment.  Link  sub‐
959       scription  is also equalized with the ability to override based on port
960       GUID. The latter is supplied by:
961
962       -i <equalize-ignore-guids-file>
963       --ignore_guids <equalize-ignore-guids-file>
964                 This option provides the means to define a set of ports
965                 (by guid) that will be ignored by the link load
966                 equalization algorithm. Note that only endports (CA,
967                 switch port 0, and router ports) and not switch external
968                 ports are supported.
969
970       LMC awareness routes based on (remote) system or switch basis.
971
972
973       Purpose of UPDN Algorithm
974
975       The UPDN algorithm is designed to prevent deadlocks from  occurring  in
976       loops  of  the subnet. A loop-deadlock is a situation in which it is no
977       longer possible to send data between any two  hosts  connected  through
978       the  loop.  As  such,  the UPDN routing algorithm should be used if the
979       subnet is not a pure Fat Tree, and one of its loops  may  experience  a
980       deadlock (due, for example, to high pressure).
981
982       The UPDN algorithm is based on the following main stages:
983
984       1.  Auto-detect root nodes - based on the CA hop length from any switch
985       in the subnet, a statistical histogram is built for  each  switch  (hop
986       num  vs  number  of  occurrences). If the histogram reflects a specific
987       column (higher than others) for a certain node, then it is marked as  a
988       root node. Since the algorithm is statistical, it may not find any root
989       nodes. The list of the root nodes found by this  auto-detect  stage  is
990       used by the ranking process stage.
991
992           Note 1: The user can override the node list manually.
993           Note 2: If this stage cannot find any root nodes, and the user did
994                   not specify a guid list file, OpenSM defaults back to the
995                   Min Hop routing algorithm.
996
997       2.   Ranking  process  -  All  root switch nodes (found in stage 1) are
998       assigned a rank of 0. Using the BFS algorithm, the rest of  the  switch
999       nodes  in the subnet are ranked incrementally. This ranking aids in the
1000       process of enforcing rules that ensure loop-free paths.
1001
1002       3.  Min Hop Table setting - after ranking is done, a BFS  algorithm  is
1003       run  from  each  (CA  or  switch)  node  in  the subnet. During the BFS
1004       process, the FDB table of each switch node traversed by BFS is updated,
1005       in  reference to the starting node, based on the ranking rules and guid
1006       values.
1007
1008       At the end of the process, the  updated  FDB  tables  ensure  loop-free
1009       paths through the subnet.
1010
1011       Note:  Up/Down routing does not allow LID routing communication between
1012       switches that are located inside spine "switch systems".  The reason is
1013       that  there  is  no way to allow a LID route between them that does not
1014       break the Up/Down rule.  One ramification of this is  that  you  cannot
1015       run SM on switches other than the leaf switches of the fabric.
1016
1017
1018       UPDN Algorithm Usage
1019
1020       Activation through OpenSM
1021
1022       Use  '-R  updn' option (instead of old '-u') to activate the UPDN algo‐
1023       rithm.  Use '-a <root_guid_file>' for adding an  UPDN  guid  file  that
1024       contains  the  root nodes for ranking.  If the `-a' option is not used,
1025       OpenSM uses its auto-detect root nodes algorithm.
1026
1027       Notes on the guid list file:
1028
1029       1.   A valid guid file specifies one guid in each line. Lines  with  an
1030       invalid format will be discarded.
1031       2.   The user should specify the root switch guids. However, it is also
1032       possible to specify CA guids; OpenSM will use the guid  of  the  switch
1033       (if it exists) that connects the CA to the subnet as a root node.
1034
1035       Purpose of DNUP Algorithm
1036
1037       The DNUP algorithm is designed to serve a similar purpose to UPDN. How‐
1038       ever it is intended to work in network topologies which are unsuited to
1039       UPDN  due to nodes being connected closer to the roots than some of the
1040       switches.  An example would  be  a  fabric  which  contains  nodes  and
1041       uplinks connected to the same switch. The operation of DNUP is the same
1042       as UPDN with the exception of the ranking process.  In DNUP all  switch
1043       nodes  are  ranked  based  solely  on their distance from CA Nodes, all
1044       switch nodes directly connected to at least one CA are assigned a value
1045       of  1  all other switch nodes are assigned a value of one more than the
1046       minimum rank of all neighbor switch nodes.
1047
1048       Fat-tree Routing Algorithm
1049
1050       The fat-tree algorithm optimizes routing for "shift" communication pat‐
1051       tern.   It should be chosen if a subnet is a symmetrical or almost sym‐
1052       metrical fat-tree of various types.   It  supports  not  just  K-ary-N-
1053       Trees,  by handling for non-constant K, cases where not all leafs (CAs)
1054       are present, any CBB ratio.  As in UPDN, fat-tree also prevents credit-
1055       loop-deadlocks.
1056
1057       If  the  root  guid  file  is  not provided ('-a' or '--root_guid_file'
1058       options), the topology has to be pure fat-tree that complies  with  the
1059       following rules:
1060         - Tree rank should be between two and eight (inclusively)
1061         - Switches of the same rank should have the same number
1062           of UP-going port groups*, unless they are root switches,
1063           in which case the shouldn't have UP-going ports at all.
1064         - Switches of the same rank should have the same number
1065           of DOWN-going port groups, unless they are leaf switches.
1066         - Switches of the same rank should have the same number
1067           of ports in each UP-going port group.
1068         - Switches of the same rank should have the same number
1069           of ports in each DOWN-going port group.
1070         - All the CAs have to be at the same tree level (rank).
1071
1072       If the root guid file is provided, the topology doesn't have to be pure
1073       fat-tree, and it should only comply with the following rules:
1074         - Tree rank should be between two and eight (inclusively)
1075         - All the Compute Nodes** have to be at the same tree level (rank).
1076           Note that non-compute node CAs are allowed here to be at different
1077           tree ranks.
1078
1079       * ports that are connected to the same remote switch are referenced  as
1080       ´port group´.
1081
1082       **   list   of  compute  nodes  (CNs)  can  be  specified  by  ´-u´  or
1083       ´--cn_guid_file´ OpenSM options.
1084
1085       Topologies that do not comply cause a  fallback  to  min  hop  routing.
1086       Note that this can also occur on link failures which cause the topology
1087       to no longer be "pure" fat-tree.
1088
1089       Note that although fat-tree algorithm supports trees  with  non-integer
1090       CBB  ratio,  the  routing will not be as balanced as in case of integer
1091       CBB ratio.  In addition to this, although  the  algorithm  allows  leaf
1092       switches  to have any number of CAs, the closer the tree is to be fully
1093       populated, the more effective the "shift"  communication  pattern  will
1094       be.   In  general,  even  if  the root list is provided, the closer the
1095       topology to a pure and symmetrical fat-tree, the more optimal the rout‐
1096       ing will be.
1097
1098       The  algorithm  also dumps compute node ordering file (opensm-ftree-ca-
1099       order.dump) in the same directory where the OpenSM  log  resides.  This
1100       ordering  file  provides  the CN order that may be used to create effi‐
1101       cient communication pattern, that will match the routing tables.
1102
1103       Routing between non-CN nodes
1104
1105       The use of the cn_guid_file option allows non-CN nodes to be located on
1106       different  levels  in the fat tree.  In such case, it is not guaranteed
1107       that the Fat Tree algorithm will route between two  non-CN  nodes.   To
1108       solve  this problem, a list of non-CN nodes can be specified by ´-G´ or
1109       ´--io_guid_file´ option.  Theses nodes will be allowed to use  switches
1110       the  wrong  way  round a specific number of times (specified by ´-H´ or
1111       ´--max_reverse_hops´.    With   the   proper    max_reverse_hops    and
1112       io_guid_file values, you can ensure full connectivity in the Fat Tree.
1113
1114       Please  note  that  using  max_reverse_hops creates routes that use the
1115       switch in a counter-stream way.  This option should never  be  used  to
1116       connect nodes with high bandwidth traffic between them ! It should only
1117       be used to allow connectivity for HA purposes or similar.  Also  having
1118       routes the other way around can in theory cause credit loops.
1119
1120       Use these options with extreme care !
1121
1122       Activation through OpenSM
1123
1124       Use  '-R  ftree'  option  to  activate the fat-tree algorithm.  Use '-a
1125       <root_guid_file>' to provide root nodes for ranking. If the `-a' option
1126       is  not  used,  routing algorithm will detect roots automatically.  Use
1127       '-u <root_cn_file>' to provide the list of compute nodes. If  the  `-u'
1128       option is not used, all the CAs are considered as compute nodes.
1129
1130       Note:  LMC  > 0 is not supported by fat-tree routing. If this is speci‐
1131       fied, the default routing algorithm is invoked instead.
1132
1133
1134       LASH Routing Algorithm
1135
1136       LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
1137       istic  shortest  path  routing algorithm that enables topology agnostic
1138       deadlock-free routing within communication networks.
1139
1140       When computing the routing function, LASH analyzes the network topology
1141       for  the  shortest-path  routes between all pairs of sources / destina‐
1142       tions and groups these paths into virtual layers in such a  way  as  to
1143       avoid deadlock.
1144
1145       Note  LASH  analyzes routes and ensures deadlock freedom between switch
1146       pairs. The link from HCA between and switch does not need virtual  lay‐
1147       ers as deadlock will not arise between switch and HCA.
1148
1149       In more detail, the algorithm works as follows:
1150
1151       1) LASH determines the shortest-path between all pairs of source / des‐
1152       tination switches. Note, LASH ensures the  same  SL  is  used  for  all
1153       SRC/DST  - DST/SRC pairs and there is no guarantee that the return path
1154       for a given DST/SRC will be the reverse of the route SRC/DST.
1155
1156       2) LASH then begins an SL assignment process where a route is  assigned
1157       to  a  layer (SL) if the addition of that route does not cause deadlock
1158       within that layer. This is achieved  by  maintaining  and  analysing  a
1159       channel dependency graph for each layer. Once the potential addition of
1160       a path could lead to deadlock, LASH opens a new layer and continues the
1161       process.
1162
1163       3)  Once  this  stage  has been completed, it is highly likely that the
1164       first layers processed will contain more paths than  the  latter  ones.
1165       To better balance the use of layers, LASH moves paths from one layer to
1166       another so that the number of paths in each layer averages out.
1167
1168       Note, the implementation of LASH in opensm attempts to use as few  lay‐
1169       ers as possible. This number can be less than the number of actual lay‐
1170       ers available.
1171
1172       In general LASH is a very flexible  algorithm.  It  can,  for  example,
1173       reduce to Dimension Order Routing in certain topologies, it is topology
1174       agnostic and fares well in the face of faults.
1175
1176       It has been shown that for both regular and irregular topologies,  LASH
1177       outperforms  Up/Down.  The reason for this is that LASH distributes the
1178       traffic more evenly through a network, avoiding the  bottleneck  issues
1179       related to a root node and always routes shortest-path.
1180
1181       The algorithm was developed by Simula Research Laboratory.
1182
1183
1184       Use '-R lash -Q ' option to activate the LASH algorithm.
1185
1186       Note:  QoS support has to be turned on in order that SL/VL mappings are
1187       used.
1188
1189       Note: LMC > 0 is not supported by the LASH routing. If this  is  speci‐
1190       fied, the default routing algorithm is invoked instead.
1191
1192       For  open regular cartesian meshes the DOR algorithm is the ideal rout‐
1193       ing algorithm. For toroidal meshes on the other hand there are  routing
1194       loops  that can cause deadlocks. LASH can be used to route these cases.
1195       The performance of LASH can be improved by preconditioning the mesh  in
1196       cases  where  there  are multiple links connecting switches and also in
1197       cases where the switches are not cabled consistently. An option  exists
1198       for  LASH  to  do this. To invoke this use '-R lash -Q --do_mesh_analy‐
1199       sis'. This will add an additional phase that analyses the mesh  to  try
1200       to  determine  the  dimension and size of a mesh. If it determines that
1201       the mesh looks like an open or closed cartesian mesh  it  reorders  the
1202       ports in dimension order before the rest of the LASH algorithm runs.
1203
1204       DOR Routing Algorithm
1205
1206       The Dimension Order Routing algorithm is based on the Min Hop algorithm
1207       and so uses shortest paths.  Instead of spreading  traffic  out  across
1208       different  paths  with the same shortest distance, it chooses among the
1209       available shortest paths based on an ordering of dimensions.  Each port
1210       must  be  consistently  cabled  to represent a hypercube dimension or a
1211       mesh dimension.  Alternatively, the -O option can be used to  assign  a
1212       custom  mapping between the ports on a given switch, and the associated
1213       dimension.  Paths are grown from a destination back to a  source  using
1214       the lowest dimension (port) of available paths at each step.  This pro‐
1215       vides the ordering necessary to avoid deadlock.  When there are  multi‐
1216       ple  links  between  any  two  switches,  they still represent only one
1217       dimension and traffic is balanced across them unless port  equalization
1218       is  turned  off.  In the case of hypercubes, the same port must be used
1219       throughout the fabric to represent the hypercube dimension and match on
1220       both  ends of the cable, or the -O option used to accomplish the align‐
1221       ment.  In the case of meshes, the dimension should consistently use the
1222       same  pair  of  ports,  one port on one end of the cable, and the other
1223       port on the other end, continuing along the mesh dimension, or  the  -O
1224       option used as an override.
1225
1226       Use '-R dor' option to activate the DOR algorithm.
1227
1228       DFSSSP and SSSP Routing Algorithm
1229
1230       The  (Deadlock-Free)  Single-Source-Shortest-Path  routing algorithm is
1231       designed to optimize link utilization thru global balancing of  routes,
1232       while  supporting  arbitrary  topologies.  The DFSSSP routing algorithm
1233       uses Infiniband virtual lanes (SL) to provide deadlock-freedom.
1234
1235       The DFSSSP algorithm consists of five major steps:
1236       1) It discovers the subnet and models the subnet as a  directed  multi‐
1237       graph  in which each node represents a node of the physical network and
1238       each edge represents one direction of the  full-duplex  links  used  to
1239       connect the nodes.
1240       2)  A loop, which iterates over all CA and switches of the subnet, will
1241       perform three steps to generate the linear forwarding tables  for  each
1242       switch:
1243       2.1)  use Dijkstra's algorithm to find the shortest path from all nodes
1244       to the current selected destination;
1245       2.2) update the egde weights in the  graph,  i.e.  add  the  number  of
1246       routes, which use a link to reach the destination, to the link/edge;
1247       2.3)  update  the  LFT  of each switch with the outgoing port which was
1248       used in the current step to route the traffic to the destination node.
1249       3) After the number of available virtual lanes or layers in the  subnet
1250       is  detected  and  a  channel  dependency graph is initialized for each
1251       layer, the algorithm will put each possible route of  the  subnet  into
1252       the first layer.
1253       4)  A  loop  iterates over all channel dependency graphs (CDG) and per‐
1254       forms the following substeps:
1255       4.1) search for a cycle in the current CDG;
1256       4.2) when a cycle is found, i.e. a possible deadlock  is  present,  one
1257       edge  is selected and all routes, which induced this egde, are moved to
1258       the "next higher" virtual layer (CDG[i+1]);
1259       4.3) the cycle search is continued until  all  cycles  are  broken  and
1260       routes are moved "up".
1261       5)  When  the  number  of  needed layers does not exceeds the number of
1262       available SL/VL to remove all cycles in all CDGs, the rounting is dead‐
1263       lock-free  and  an  relation  table  is  generated,  which contains the
1264       assignment of routes from source to destination to a SL
1265
1266       Note on SSSP:
1267       This algorithm does not perform the steps 3)-5) and can not be  consid‐
1268       ered  to  be deadlock-free for all topologies. But on the one hand, you
1269       can choose this algorithm for really large  networks  (5,000+  CAs  and
1270       deadlock-free by design) to reduce the runtime of the algorithm. On the
1271       other hand, you might use the SSSP routing algorithm as an alternative,
1272       when all deadlock-free routing algorithms fail to route the network for
1273       whatever reason.  In the last case, SSSP was  designed  to  deliver  an
1274       equal  or  higher bandwidth due to better congestion avoidance than the
1275       Min Hop routing algorithm.
1276
1277       Notes for usage:
1278       a) running DFSSSP: '-R dfsssp -Q'
1279       a.1) QoS has to be configured to equally spread the load on the  avail‐
1280       able SL or virtual lanes
1281       a.2)  applications  must perform a path record query to get path SL for
1282       each route, which the application will use to transmite packages
1283       b) running SSSP:   '-R sssp'
1284       c) both algorithms support LMC > 0
1285
1286       Hints for optimizing I/O traffic:
1287       Having more nodes (I/O and compute) connected to a switch than incoming
1288       links  can  result  in  a  'bad'  routing of the I/O traffic as long as
1289       (DF)SSSP routing is not aware of the dedicated I/O nodes, i.e., in  the
1290       following  network configuration CN1-CN3 might send all I/O traffic via
1291       Link2 to IO1,IO2:
1292
1293            CN1         Link1        IO1
1294               \       /----\       /
1295         CN2 -- Switch1      Switch2 -- CN4
1296               /       \----/       \
1297            CN3         Link2        IO2
1298
1299       To prevent this from happening (DF)SSSP can use both the  compute  node
1300       guid   file   and   the   I/O  guid  file  specified  by  the  ´-u´  or
1301       ´--cn_guid_file´ and ´-G´ or ´--io_guid_file´ options (similar  to  the
1302       Fat-Tree routing).  This ensures that traffic towards compute nodes and
1303       I/O nodes is balanced separately and therefore distributed as  much  as
1304       possible  across  the available links. Port GUIDs, as listed by ibstat,
1305       must be specified (not Node GUIDs).
1306       The priority for the optimization is as follows:
1307         compute nodes -> I/O nodes -> other nodes
1308       Possible use case szenarios:
1309       a) neither ´-u´ nor ´-G´ are specified: all nodes a treated  as  ´other
1310       nodes´ and therefore balanced equally;
1311       b)  ´-G´ is specified: traffic towards I/O nodes will be balanced opti‐
1312       mally;
1313       c) the system has three node types, such as  login/admin,  compute  and
1314       I/O,  but  the  balancing focus should be I/O, then one has to use ´-u´
1315       and ´-G´ with I/O guids listed in cn_guid_file and compute  node  guids
1316       listed in io_guid_file;
1317       d) ...
1318
1319       Torus-2QoS Routing Algorithm
1320
1321       Torus-2QoS  is  routing  algorithm designed for large-scale 2D/3D torus
1322       fabrics; see torus-2QoS(8) for full documentation.
1323
1324       Use '-R torus-2QoS -Q' or '-R torus-2QoS,no_fallback  -Q'  to  activate
1325       the torus-2QoS algorithm.
1326
1327
1328       Routing References
1329
1330       To  learn  more  about deadlock-free routing, see the article "Deadlock
1331       Free Message Routing in  Multiprocessor  Interconnection  Networks"  by
1332       William J Dally and Charles L Seitz (1985).
1333
1334       To  learn  more about the up/down algorithm, see the article "Effective
1335       Strategy to Compute Forwarding Tables for InfiniBand Networks" by  Jose
1336       Carlos  Sancho,  Antonio  Robles,  and  Jose  Duato  at the Universidad
1337       Politecnica de Valencia.
1338
1339       To learn more about LASH and the flexibility behind it, the requirement
1340       for  layers,  performance comparisons to other algorithms, see the fol‐
1341       lowing articles:
1342
1343       "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
1344       on Parallel and Distributed Systems, VOL.16, No12, December 2005.
1345
1346       "Routing  for  the  ASI Fabric Manager", Solheim et al. IEEE Communica‐
1347       tions Magazine, Vol.44, No.7, July 2006.
1348
1349       "Layered Shortest Path (LASH) Routing in  Irregular  System  Area  Net‐
1350       works",  Skeie  et al. IEEE Computer Society Communication Architecture
1351       for Clusters 2002.
1352
1353       To learn more about the DFSSSP and  SSSP  routing  algorithm,  see  the
1354       articles:
1355       J.  Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for
1356       Arbitrary Topologies, In Proceedings of  the  25th  IEEE  International
1357       Parallel & Distributed Processing Symposium (IPDPS 2011)
1358       T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-
1359       Scale InfiniBand Networks, In 17th Annual IEEE Symposium on  High  Per‐
1360       formance Interconnects (HOTI 2009)
1361
1362       Modular Routine Engine
1363
1364       Modular  routing engine structure allows for the ease of "plugging" new
1365       routing modules.
1366
1367       Currently, only unicast callbacks are supported. Multicast can be added
1368       later.
1369
1370       One  existing  routing module is up-down "updn", which may be activated
1371       with '-R updn' option (instead of old '-u').
1372
1373       General usage is: $ opensm -R 'module-name'
1374
1375       There is also a trivial routing module which is able to load LFT tables
1376       from a file.
1377
1378       Main features:
1379
1380        - this will load switch LFTs and/or LID matrices (min hops tables)
1381        - this will load switch LFTs according to the path entries introduced
1382          in the file
1383        - no additional checks will be performed (such as "is port connected",
1384          etc.)
1385        - in case when fabric LIDs were changed this will try to reconstruct
1386          LFTs correctly if endport GUIDs are represented in the file
1387          (in order to disable this, GUIDs may be removed from the file
1388           or zeroed)
1389
1390       The  file  format  is  compatible with output of 'ibroute' util and for
1391       whole fabric can be generated with dump_lfts.sh script.
1392
1393       To activate file based routing module, use:
1394
1395         opensm -R file -U /path/to/lfts_file
1396
1397       If the lfts_file is not found or is in error, the default routing algo‐
1398       rithm is utilized.
1399
1400       The  ability  to dump switch lid matrices (aka min hops tables) to file
1401       and later to load these is also supported.
1402
1403       The usage is similar to unicast forwarding tables loading from  a  lfts
1404       file  (introduced  by  'file'  routing engine), but new lid matrix file
1405       name should be specified by -M or --lid_matrix_file option.  For  exam‐
1406       ple:
1407
1408         opensm -R file -M ./opensm-lid-matrix.dump
1409
1410       The  dump  file is named ´opensm-lid-matrix.dump´ and will be generated
1411       in  standard  opensm  dump  directory  (/var/log   by   default)   when
1412       OSM_LOG_ROUTING logging flag is set.
1413
1414       When routing engine 'file' is activated, but the lfts file is not spec‐
1415       ified or not cannot be open default lid matrix algorithm will be used.
1416
1417       There is also a switch forwarding tables dumper which generates a  file
1418       compatible with dump_lfts.sh output. This file can be used as input for
1419       forwarding tables loading by 'file' routing engine.   Both  or  one  of
1420       options -U and -M can be specified together with ´-R file´.
1421
1422

PER MODULE LOGGING CONFIGURATION

1424       To  enable per module logging, configure per_module_logging_file to the
1425       per module logging config file name in the opensm options file. To dis‐
1426       able, configure per_module_logging_file to (null) there.
1427
1428       The per module logging config file format is a set of lines with module
1429       name and logging level as follows:
1430
1431        <module name><separator><logging level>
1432
1433        <module name> is the file name including .c
1434        <separator> is either = , space, or tab
1435        <logging level> is the same levels as used in the coarse/overall
1436        logging as follows:
1437
1438        BIT    LOG LEVEL ENABLED
1439        ----   -----------------
1440        0x01 - ERROR (error messages)
1441        0x02 - INFO (basic messages, low volume)
1442        0x04 - VERBOSE (interesting stuff, moderate volume)
1443        0x08 - DEBUG (diagnostic, high volume)
1444        0x10 - FUNCS (function entry/exit, very high volume)
1445        0x20 - FRAMES (dumps all SMP and GMP frames)
1446        0x40 - ROUTING (dump FDB routing information)
1447        0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)
1448
1449

FILES

1451       /etc/rdma/opensm.conf
1452              default OpenSM config file.
1453
1454
1455       /etc/rdma/ib-node-name-map
1456              default node name map file.  See ibnetdiscover for more informa‐
1457              tion on format.
1458
1459
1460       /etc/rdma/partitions.conf
1461              default partition config file
1462
1463
1464       /etc/rdma/qos-policy.conf
1465              default QOS policy config file
1466
1467
1468       /etc/rdma/prefix-routes.conf
1469              default prefix routes file
1470
1471
1472       /etc/rdma/per-module-logging.conf
1473              default per module logging config file
1474
1475
1476       /etc/rdma/torus-2QoS.conf
1477              default torus-2QoS config file
1478
1479

AUTHORS

1481       Hal Rosenstock
1482              <hal@mellanox.com>
1483
1484       Sasha Khapyorsky
1485              <sashak@voltaire.com>
1486
1487       Eitan Zahavi
1488              <eitan@mellanox.co.il>
1489
1490       Yevgeny Kliteynik
1491              <kliteyn@mellanox.co.il>
1492
1493       Thomas Sodring
1494              <tsodring@simula.no>
1495
1496       Ira Weiny
1497              <weiny2@llnl.gov>
1498
1499       Dale Purdy
1500              <purdy@sgi.com>
1501
1502

SEE ALSO

1504       torus-2QoS(8), torus-2QoS.conf(5).
1505
1506
1507
1508OpenIB                           Sept 15, 2014                       OPENSM(8)
Impressum