1OPENSM(8)                      OpenIB Management                     OPENSM(8)
2
3
4

NAME

6       opensm - InfiniBand subnet manager and administration (SM/SA)
7
8

SYNOPSIS

10       opensm  [--version]]  [-F  |  --config  <file_name>]  [-c(reate-config)
11       <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority)  <PRI‐
12       ORITY>]  [-smkey <SM_Key>] [--sm_sl <SL number>] [-r(eassign_lids)] [-R
13       <engine name(s)> | --routing_engine <engine name(s)>] [--do_mesh_analy‐
14       sis]  [--lash_start_vl  <vl  number>] [-A | --ucast_cache] [-z | --con‐
15       nect_roots] [-M <file name> | --lid_matrix_file <file name>] [-U  <file
16       name>  |  --lfts_file <file name>] [-S | --sadb_file <file name>] [-a |
17       --root_guid_file <path to file>] [-u | --cn_guid_file <path  to  file>]
18       [-G  |  --io_guid_file  <path  to  file>] [-H | --max_reverse_hops <max
19       reverse hops allowed>] [-X | --guid_routing_order_file <path to  file>]
20       [-m  |  --ids_guid_file <path to file>] [-o(nce)] [-s(weep) <interval>]
21       [-t(imeout) <milliseconds>] [--retries  <number>]  [-maxsmps  <number>]
22       [-console  [off  |  local  | socket | loopback]] [-console-port <port>]
23       [-i(gnore-guids) <equalize-ignore-guids-file>] [-w | --hop_weights_file
24       <path  to file>] [-f <log file path> | --log_file <log file path> ] [-L
25       | --log_limit <size in MB>] [-e(rase_log_file)] [-P(config)  <partition
26       config  file>  ]  [-N | --no_part_enforce] [-Q | --qos [-Y | --qos_pol‐
27       icy_file <file name>]] [-y | --stay_on_fatal] [-B  |  --daemon]  [-I  |
28       --inactive]   [--perfmgr]  [--perfmgr_sweep_time_s  <seconds>]  [--pre‐
29       fix_routes_file <path>] [--consolidate_ipv6_snm_req] [-v(erbose)]  [-V]
30       [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]
31
32

DESCRIPTION

34       opensm  is  an  InfiniBand compliant Subnet Manager and Administration,
35       and runs on top of OpenIB.
36
37       opensm provides an implementation of an InfiniBand Subnet  Manager  and
38       Administration.  Such a software entity is required to run for in order
39       to initialize the InfiniBand hardware (at least one per each InfiniBand
40       subnet).
41
42       opensm  also now contains an experimental version of a performance man‐
43       ager as well.
44
45       opensm defaults were designed to meet the common case usage on clusters
46       with up to a few hundred nodes. Thus, in this default mode, opensm will
47       scan the IB fabric, initialize it, and sweep occasionally for changes.
48
49       opensm attaches to a specific IB port on the local machine and  config‐
50       ures  only  the fabric connected to it. (If the local machine has other
51       IB ports, opensm will ignore  the  fabrics  connected  to  those  other
52       ports). If no port is specified, it will select the first "best" avail‐
53       able port.
54
55       opensm can present the available ports and prompt for a port number  to
56       attach to.
57
58       By  default,  the  run  is  logged  to two files: /var/log/messages and
59       /var/log/opensm.log.  The first file will register only  general  major
60       events, whereas the second will include details of reported errors. All
61       errors reported in this second file should be treated as indicators  of
62       IB  fabric  health issues.  (Note that when a fatal and non-recoverable
63       error occurs, opensm will exit.)  Both log  files  should  include  the
64       message "SUBNET UP" if opensm was able to setup the subnet correctly.
65
66

OPTIONS

68       --version
69              Prints OpenSM version and exits.
70
71       -F, --config <config file>
72              The  name  of  the  OpenSM  config  file.  When  not  specified
73              /etc/rdma/opensm.conf will be used (if exists).
74
75       -c, --create-config <file name>
76              OpenSM will dump its configuration to  the  specified  file  and
77              exit.   This is a way to generate OpenSM configuration file tem‐
78              plate.
79
80       -g, --guid <GUID in hex>
81              This option specifies the  local  port  GUID  value  with  which
82              OpenSM  should  bind.   OpenSM may be bound to 1 port at a time.
83              If GUID given is 0, OpenSM displays  a  list  of  possible  port
84              GUIDs and waits for user input.  Without -g, OpenSM tries to use
85              the default port.
86
87       -l, --lmc <LMC value>
88              This option specifies the subnet's LMC  value.   The  number  of
89              LIDs  assigned  to each port is 2^LMC.  The LMC value must be in
90              the range 0-7.  LMC values >  0  allow  multiple  paths  between
91              ports.   LMC values > 0 should only be used if the subnet topol‐
92              ogy actually provides multiple paths between ports, i.e.  multi‐
93              ple interconnects between switches.  Without -l, OpenSM defaults
94              to LMC = 0, which allows one path between any two ports.
95
96       -p, --priority <Priority value>
97              This option specifies the SM´s PRIORITY.  This will  effect  the
98              handover  cases,  where  master  is chosen by priority and GUID.
99              Range goes from 0 (default and lowest priority) to 15 (highest).
100
101       -smkey <SM_Key value>
102              This option specifies the SM´s  SM_Key  (64  bits).   This  will
103              effect  SM  authentication.   Note that OpenSM version 3.2.1 and
104              below used the default value '1' in a host  byte  order,  it  is
105              fixed  now but you may need this option to interoperate with old
106              OpenSM running on a little endian machine.
107
108       --sm_sl <SL number>
109              This option sets the SL to use for communication with the SM/SA.
110              Defaults to 0.
111
112       -r, --reassign_lids
113              This  option  causes  OpenSM  to reassign LIDs to all end nodes.
114              Specifying -r on a running subnet may  disrupt  subnet  traffic.
115              Without -r, OpenSM attempts to preserve existing LID assignments
116              resolving multiple use of same LID.
117
118       -R, --routing_engine <Routing engine names>
119              This option chooses routing engine(s) to use instead of Min  Hop
120              algorithm  (default).  Multiple routing engines can be specified
121              separated by commas so that specific ordering of  routing  algo‐
122              rithms will be tried if earlier routing engines fail.  Supported
123              engines: minhop, updn, file, ftree, lash, dor
124
125       --do_mesh_analysis
126              This option enables additional analysis  for  the  lash  routing
127              engine to precondition switch port assignments in regular carte‐
128              sian meshes which may reduce the number of SLs required to  give
129              a deadlock free routing.
130
131       --lash_start_vl <vl number>
132              This  option  sets  the  starting VL to use for the lash routing
133              algorithm.  Defaults to 0.
134
135       -A, --ucast_cache
136              This option enables unicast routing cache and  prevents  routing
137              recalculation  (which  is  a heavy task in a large cluster) when
138              there was no topology change detected during the heavy sweep, or
139              when  the  topology change does not require new routing calcula‐
140              tion, e.g. when one or more CAs/RTRs/leaf switches  going  down,
141              or  one  or more of these nodes coming back after being down.  A
142              very common case that is handled by the unicast routing cache is
143              host reboot, which otherwise would cause two full routing recal‐
144              culations: one when the host goes down, and the other  when  the
145              host comes back online.
146
147       -z, --connect_roots
148              This  option  enforces routing engines (up/down and fat-tree) to
149              make connectivity between root switches and in this  way  to  be
150              fully IBA complaint. In many cases this can violate "pure" dead‐
151              lock free algorithm, so use it carefully.
152
153       -M, --lid_matrix_file <file name>
154              This option specifies the name of the lid matrix dump file  from
155              where switch lid matrices (min hops tables will be loaded.
156
157       -U, --lfts_file <file name>
158              This  option  specifies  the  name  of  the LFTs file from where
159              switch forwarding tables will be loaded.
160
161       -S, --sadb_file <file name>
162              This option specifies the name of the SA DB dump file from where
163              SA database will be loaded.
164
165       -a, --root_guid_file <file name>
166              Set the root nodes for the Up/Down or Fat-Tree routing algorithm
167              to the guids provided in the given file (one to a line).
168
169       -u, --cn_guid_file <file name>
170              Set the compute nodes for the Fat-Tree routing algorithm to  the
171              guids provided in the given file (one to a line).
172
173       -G, --io_guid_file <file name>
174              Set  the  I/O  nodes  for  the Fat-Tree routing algorithm to the
175              guids provided in the given file (one to a line).  I/O nodes are
176              non-CN  nodes allowed to use up to max_reverse_hops switches the
177              wrong way around to improve connectivity.
178
179       -H, --max_reverse_hops <file name>
180              Set the maximum number of reverse hops an I/O node is allowed to
181              make. A reverse hop is the use of a switch the wrong way around.
182
183       -m, --ids_guid_file <file name>
184              Name  of  the map file with set of the IDs which will be used by
185              Up/Down routing algorithm instead of node GUIDs (format:  <guid>
186              <id> per line).
187
188       -X, --guid_routing_order_file <file name>
189              Set  the  order  port  guids  will  be routed for the MinHop and
190              Up/Down routing algorithms to the guids provided  in  the  given
191              file (one to a line).
192
193       -o, --once
194              This  option  causes  OpenSM  to configure the subnet once, then
195              exit.  Ports remain in the ACTIVE state.
196
197       -s, --sweep <interval value>
198              This option specifies  the  number  of  seconds  between  subnet
199              sweeps.   Specifying -s 0 disables sweeping.  Without -s, OpenSM
200              defaults to a sweep interval of 10 seconds.
201
202       -t, --timeout <value>
203              This option specifies the time in milliseconds used for transac‐
204              tion  timeouts.  Specifying -t 0 disables timeouts.  Without -t,
205              OpenSM defaults to a timeout value of 200 milliseconds.
206
207       --retries <number>
208              This option specifies the number of retries  used  for  transac‐
209              tions.   Without  --retries,  OpenSM  defaults  to 3 retries for
210              transactions.
211
212       -maxsmps <number>
213              This option specifies the number of VL15 SMP MADs allowed on the
214              wire  at  any  one time.  Specifying -maxsmps 0 allows unlimited
215              outstanding SMPs.  Without -maxsmps, OpenSM defaults to a  maxi‐
216              mum of 4 outstanding SMPs.
217
218       -console [off | local | socket | loopback]
219              This  option  brings  up the OpenSM console (default off).  Note
220              that the socket and loopback options will only be  available  if
221              OpenSM was built with --enable-console-socket.
222
223       -console-port <port>
224              Specify an alternate telnet port for the socket console (default
225              10000).  Note that this option only appears if OpenSM was  built
226              with --enable-console-socket.
227
228       -i, -ignore-guids <equalize-ignore-guids-file>
229              This option provides the means to define a set of ports (by node
230              guid and port number) that will be  ignored  by  the  link  load
231              equalization algorithm.
232
233       -w, --hop_weights_file <path to file>
234              This  option  provides weighting factors per port representing a
235              hop cost in computing the lid  matrix.   The  file  consists  of
236              lines  containing  a switch port GUID (specified as a 64 bit hex
237              number, with leading 0x), output port number, and weighting fac‐
238              tor.   Any  port  not listed in the file defaults to a weighting
239              factor of 1.  Lines  starting  with  #  are  comments.   Weights
240              affect  only the output route from the port, so many useful con‐
241              figurations will require weights to be specified in pairs.
242
243       -x, --honor_guid2lid
244              This option forces OpenSM to honor the guid2lid  file,  when  it
245              comes   out   of  Standby  state,  if  such  file  exists  under
246              OSM_CACHE_DIR, and is valid.  By default, this is FALSE.
247
248       -f, --log_file <file name>
249              This option defines the log to be the given file.   By  default,
250              the log goes to /var/log/opensm.log.  For the log to go to stan‐
251              dard output use -f stdout.
252
253       -L, --log_limit <size in MB>
254              This option defines maximal log file size in MB. When  specified
255              the log file will be truncated upon reaching this limit.
256
257       -e, --erase_log_file
258              This  option  will  cause deletion of the log file (if it previ‐
259              ously exists). By default, the log file is accumulative.
260
261       -P, --Pconfig <partition config file>
262              This option defines the optional partition  configuration  file.
263              The default name is /etc/rdma/partitions.conf.
264
265       --prefix_routes_file <file name>
266              Prefix routes control how the SA responds to path record queries
267              for off-subnet DGIDs.  By default, the SA  fails  such  queries.
268              The PREFIX ROUTES section below describes the format of the con‐
269              figuration      file.       The      default       path       is
270              /etc/rdma/prefix-routes.conf.
271
272       -Q, --qos
273              This option enables QoS setup. It is disabled by default.
274
275       -Y, --qos_policy_file <file name>
276              This  option  defines  the optional QoS policy file. The default
277              name    is    /etc/rdma/qos-policy.conf.     See     QoS_manage‐
278              ment_in_OpenSM.txt in opensm doc for more information on config‐
279              uring QoS policy via this file.
280
281       -N, --no_part_enforce
282              This option disables partition enforcement  on  switch  external
283              ports.
284
285       -y, --stay_on_fatal
286              This  option  will  cause SM not to exit on fatal initialization
287              issues: if SM discovers duplicated guids or a 12x link with lane
288              reversal  badly  configured.   By  default,  the SM will exit on
289              these errors.
290
291       -B, --daemon
292              Run in daemon mode - OpenSM will run in the background.
293
294       -I, --inactive
295              Start SM in inactive rather than init SM state.  This option can
296              be  used  in  conjunction with the perfmgr so as to run a stand‐
297              alone performance manager without SM/SA.  However, this  is  NOT
298              currently implemented in the performance manager.
299
300       -perfmgr
301              Enable  the  perfmgr.  Only takes effect if --enable-perfmgr was
302              specified at configure time.  See  performance-manager-HOWTO.txt
303              in opensm doc for more information on running perfmgr.
304
305       -perfmgr_sweep_time_s <seconds>
306              Specify  the  sweep  time for the performance manager in seconds
307              (default is 180 seconds).  Only takes effect if --enable-perfmgr
308              was specified at configure time.
309
310       --consolidate_ipv6_snm_req
311              Use  shared  MLID  for  IPv6 Solicited Node Multicast groups per
312              MGID scope and P_Key.
313
314       -v, --verbose
315              This option increases the log verbosity level.   The  -v  option
316              may  be  specified  multiple  times to further increase the ver‐
317              bosity level.  See the -D option for more information about  log
318              verbosity.
319
320       -V     This  option  sets  the  maximum  verbosity level and forces log
321              flushing.  The -V option is equivalent to ´-D 0xFF -d  2´.   See
322              the -D option for more information about log verbosity.
323
324       -D <value>
325              This  option  sets  the log verbosity level.  A flags field must
326              follow the -D option.  A bit set/clear in the flags enables/dis‐
327              ables a specific log level as follows:
328
329               BIT    LOG LEVEL ENABLED
330               ----   -----------------
331               0x01 - ERROR (error messages)
332               0x02 - INFO (basic messages, low volume)
333               0x04 - VERBOSE (interesting stuff, moderate volume)
334               0x08 - DEBUG (diagnostic, high volume)
335               0x10 - FUNCS (function entry/exit, very high volume)
336               0x20 - FRAMES (dumps all SMP and GMP frames)
337               0x40 - ROUTING (dump FDB routing information)
338               0x80 - currently unused.
339
340              Without  -D,  OpenSM defaults to ERROR + INFO (0x3).  Specifying
341              -D 0 disables all messages.  Specifying -D 0xFF enables all mes‐
342              sages  (see  -V).   High verbosity levels may require increasing
343              the transaction timeout with the -t option.
344
345       -d, --debug <value>
346              This option specifies a debug option.   These  options  are  not
347              normally  needed.   The  number  following  -d selects the debug
348              option to enable as follows:
349
350               OPT   Description
351               ---    -----------------
352               -d0  - Ignore other SM nodes
353               -d1  - Force single threaded dispatching
354               -d2  - Force log flushing after each log message
355               -d3  - Disable multicast support
356
357       -h, --help
358              Display this usage info then exit.
359
360       -?     Display this usage info then exit.
361
362

ENVIRONMENT VARIABLES

364       The following environment variables control opensm behavior:
365
366       OSM_TMP_DIR - controls the directory in which the temporary files  gen‐
367       erated  by  opensm  are  created.  These  files are: opensm-subnet.lst,
368       opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.
369
370       OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
371       quent   runs   are   consistent.   The   default   directory   used  is
372       /var/cache/opensm.  The following file is included in it:
373
374        guid2lid - stores the LID range assigned to each GUID
375
376

NOTES

378       When opensm receives a HUP signal, it starts a new heavy sweep as if  a
379       trap was received or a topology change was found.
380
381       Also,  SIGUSR1  can  be used to trigger a reopen of /var/log/opensm.log
382       for logrotate purposes.
383
384

PARTITION CONFIGURATION

386       The  default  name  of  OpenSM   partitions   configuration   file   is
387       /etc/rdma/partitions.conf.  The  default  may  be  changed by using the
388       --Pconfig (-P) option with OpenSM.
389
390       The default partition will be created by  OpenSM  unconditionally  even
391       when partition configuration file does not exist or cannot be accessed.
392
393       The default partition has P_Key value 0x7fff. OpenSM´s port will always
394       have full membership in default partition. All  other  end  ports  will
395       have  full  membership if the partition configuration file is not found
396       or cannot be accessed, or limited membership if the file exists and can
397       be accessed but there is no rule for the Default partition.
398
399       Effectively,  this amounts to the same as if one of the following rules
400       below appear in the partition configuration file.
401
402       In the case of no rule for the Default partition:
403
404       Default=0x7fff : ALL=limited, SELF=full ;
405
406       In the case of no  partition  configuration  file  or  file  cannot  be
407       accessed:
408
409       Default=0x7fff : ALL=full ;
410
411
412       File Format
413
414       Comments:
415
416       Line  content  followed  after  ´#´ character is comment and ignored by
417       parser.
418
419       General file format:
420
421       <Partition Definition>:<PortGUIDs list> ;
422
423       Partition Definition:
424
425       [PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
426
427        PartitionName - string, will be used with logging. When omitted
428                        empty string will be used.
429        PKey          - P_Key value for this partition. Only low 15 bits will
430                        be used. When omitted will be autogenerated.
431        flag          - used to indicate IPoIB capability of this partition.
432        defmember=full|limited - specifies default membership for port guid
433                        list. Default is limited.
434
435       Currently recognized flags are:
436
437        ipoib       - indicates that this partition may be used for IPoIB, as
438                      result IPoIB capable MC group will be created.
439        rate=<val>  - specifies rate for this IPoIB MC group
440                      (default is 3 (10GBps))
441        mtu=<val>   - specifies MTU for this IPoIB MC group
442                      (default is 4 (2048))
443        sl=<val>    - specifies SL for this IPoIB MC group
444                      (default is 0)
445        scope=<val> - specifies scope for this IPoIB MC group
446                      (default is 2 (link local)).  Multiple scope settings
447                      are permitted for a partition.
448
449       Note that values for rate,  mtu,  and  scope  should  be  specified  as
450       defined in the IBTA specification (for example, mtu=4 for 2048).
451
452       PortGUIDs list:
453
454        PortGUID         - GUID of partition member EndPort. Hexadecimal
455                           numbers should start from 0x, decimal numbers
456                           are accepted too.
457        full or limited  - indicates full or limited membership for this
458                           port.  When omitted (or unrecognized) limited
459                           membership is assumed.
460
461       There are two useful keywords for PortGUID definition:
462
463        - 'ALL' means all end ports in this subnet.
464        - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
465        - 'ALL_SWITCHES' means all Switch end ports in this subnet.
466        - 'ALL_ROUTERS' means all Router end ports in this subnet.
467        - 'SELF' means subnet manager's port.
468
469       Empty list means no ports in this partition.
470
471       Notes:
472
473       White space is permitted between delimiters ('=', ',',':',';').
474
475       The  line  can be wrapped after ':' followed after Partition Definition
476       and between.
477
478       PartitionName does not need to be unique, PKey does need to be  unique.
479       If  PKey is repeated then those partition configurations will be merged
480       and first PartitionName will be used (see also next note).
481
482       It is possible to split partition configuration in more than one  defi‐
483       nition, but then PKey should be explicitly specified (otherwise differ‐
484       ent PKey values will be generated for those definitions).
485
486       Examples:
487
488        Default=0x7fff : ALL, SELF=full ;
489        Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
490
491        NewPartition , ipoib : 0x123456=full, 0x3456789034=limi,  0x2134af2306
492       ;
493
494        YetAnotherOne = 0x300 : SELF=full ;
495        YetAnotherOne = 0x300 : ALL=limited ;
496
497        ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
498        # 0x123453, 0x123454 will be limited
499        ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
500        # 0x123456, 0x123457 will be limited
501        ShareIO   =   0x80   :   defmember=limited   :   0x123456,   0x123457,
502       0x123458=full;
503        ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
504        ShareIO  =  0x80  ,  defmember=full  :   0x12345b,   0x12345c=limited,
505       0x12345d;
506
507
508       Note:
509
510       The following rule is equivalent to how OpenSM used to run prior to the
511       partition manager:
512
513        Default=0x7fff,ipoib:ALL=full;
514
515

QOS CONFIGURATION

517       There are a set of QoS related low-level configuration parameters.  All
518       these  parameter  names  are  prefixed by "qos_" string. Here is a full
519       list of these parameters:
520
521        qos_max_vls    - The maximum number of VLs that will be on the subnet
522        qos_high_limit - The limit of High Priority component of VL
523                         Arbitration table (IBA 7.6.9)
524        qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
525                         template
526        qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
527                         template
528                         Both VL arbitration templates are pairs of
529                         VL and weight
530        qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
531                         a list of VLs corresponding to SLs 0-15 (Note
532                         that VL15 used here means drop this SL)
533
534       Typical default values (hard-coded in OpenSM initialization) are:
535
536        qos_max_vls 15
537        qos_high_limit 0
538        qos_vlarb_low
539       0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
540        qos_vlarb_high
541       0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
542        qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
543
544       The syntax is compatible with rest of OpenSM configuration options  and
545       values may be stored in OpenSM config file (cached options file).
546
547       In  addition  to  the  above,  we may define separate QoS configuration
548       parameters sets for various target types. As targets, we currently sup‐
549       port CAs, routers, switch external ports, and switch's enhanced port 0.
550       The names of such specialized parameters are prefixed by  "qos_<type>_"
551       string. Here is a full list of the currently supported sets:
552
553        qos_ca_  - QoS configuration parameters set for CAs.
554        qos_rtr_ - parameters set for routers.
555        qos_sw0_ - parameters set for switches' port 0.
556        qos_swe_ - parameters set for switches' external ports.
557
558       Examples:
559        qos_sw0_max_vls=2
560        qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
561        qos_swe_high_limit=0
562
563

PREFIX ROUTES

565       Prefix  routes  control  how the SA responds to path record queries for
566       off-subnet DGIDs.  By default, the SA fails such  queries.   Note  that
567       IBA  does  not  specify how the SA should obtain off-subnet path record
568       information.  The prefix routes configuration is meant  as  a  stop-gap
569       until the specification is completed.
570
571       Each  line  in  the configuration file is a 64-bit prefix followed by a
572       64-bit GUID, separated by white space.  The GUID specifies  the  router
573       port  on the local subnet that will handle the prefix.  Blank lines are
574       ignored, as is anything between a # character and the end of the  line.
575       The  prefix  and  GUID  are  both  in  hex, the leading 0x is optional.
576       Either, or both, can be wild-carded by specifying an  asterisk  instead
577       of an explicit prefix or GUID.
578
579       When  responding  to a path record query for an off-subnet DGID, opensm
580       searches for the first prefix match in the configuration file.   There‐
581       fore,  the order of the lines in the configuration file is important: a
582       wild-carded prefix at the beginning of the configuration  file  renders
583       all  subsequent lines useless.  If there is no match, then opensm fails
584       the query.  It is legal to repeat prefixes in the  configuration  file,
585       opensm  will return the path to the first available matching router.  A
586       configuration file with a single line where both prefix  and  GUID  are
587       wild-carded  means  that  a path record query specifying any off-subnet
588       DGID should return a path to the first available router.  This configu‐
589       ration  yields  the same behavior formerly achieved by compiling opensm
590       with -DROUTER_EXP which has been obsoleted.
591
592

ROUTING

594       OpenSM now offers five routing engines:
595
596       1.  Min Hop Algorithm - based on the minimum hops to  each  node  where
597       the path length is optimized.
598
599       2.   UPDN Unicast routing algorithm - also based on the minimum hops to
600       each node, but it is  constrained  to  ranking  rules.  This  algorithm
601       should be chosen if the subnet is not a pure Fat Tree, and deadlock may
602       occur due to a loop in the subnet.
603
604       3.  Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
605       ing  for  congestion-free  "shift" communication pattern.  It should be
606       chosen if a subnet is a symmetrical or almost symmetrical  fat-tree  of
607       various  types,  not  just  K-ary-N-Trees:  non-constant  K,  not fully
608       staffed, any Constant Bisectional Bandwidth (CBB)  ratio.   Similar  to
609       UPDN, Fat Tree routing is constrained to ranking rules.
610
611       4. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
612       to provide deadlock-free shortest-path routing while also  distributing
613       the  paths  between layers. LASH is an alternative deadlock-free topol‐
614       ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
615       ing the use of a potentially congested root node.
616
617       5.  DOR Unicast routing algorithm - based on the Min Hop algorithm, but
618       avoids port equalization except for redundant links  between  the  same
619       two  switches.   This provides deadlock free routes for hypercubes when
620       the fabric is cabled as a hypercube and for meshes  when  cabled  as  a
621       mesh (see details below).
622
623       OpenSM  also supports a file method which can load routes from a table.
624       See ´Modular Routing Engine´ for more information on this.
625
626       The basic routing algorithm is comprised of two stages:
627
628       1. MinHop matrix calculation
629          How many hops are required to get from each port to each LID ?
630          The algorithm to fill these tables is different if you run  standard
631       (min hop) or Up/Down.
632          For  standard routing, a "relaxation" algorithm is used to propagate
633       min hop from every destination LID through neighbor switches
634          For Up/Down routing, a BFS from every target is used. The BFS tracks
635       link  direction (up or down) and avoid steps that will perform up after
636       a down step was used.
637
638       2. Once MinHop matrices exist, each switch is visited and for each tar‐
639       get  LID  a  decision  is made as to what port should be used to get to
640       that LID.
641          This step is common to standard and Up/Down routing. Each port has a
642       counter counting the number of target LIDs going through it.
643          When there are multiple alternative ports with same MinHop to a LID,
644       the one with less previously assigned ports is selected.
645          If LMC > 0, more  checks  are  added:  Within  each  group  of  LIDs
646       assigned to same target port,
647          a. use only ports which have same MinHop
648          b.  first prefer the ones that go to different systemImageGuid (then
649       the previous LID of the same LMC group)
650          c. if none - prefer those which go through another NodeGuid
651          d. fall back to the number of paths method (if all go to same node).
652
653       Effect of Topology Changes
654
655       OpenSM will preserve existing routing in any case  where  there  is  no
656       change in the fabric switches unless the -r (--reassign_lids) option is
657       specified.
658
659       -r
660       --reassign_lids
661                 This option causes OpenSM to reassign LIDs to all
662                 end nodes. Specifying -r on a running subnet
663                 may disrupt subnet traffic.
664                 Without -r, OpenSM attempts to preserve existing
665                 LID assignments resolving multiple use of same LID.
666
667       If a link is added or removed, OpenSM does not recalculate  the  routes
668       that  do  not  have  to change. A route has to change if the port is no
669       longer UP or no longer the MinHop. When routing changes are  performed,
670       the same algorithm for balancing the routes is invoked.
671
672       In  the  case of using the file based routing, any topology changes are
673       currently ignored The 'file' routing engine just loads  the  LFTs  from
674       the  file specified, with no reaction to real topology. Obviously, this
675       will not be able to recheck LIDs (by GUID) for disconnected nodes,  and
676       LFTs  for  non-existent  switches  will  be  skipped.  Multicast is not
677       affected by 'file' routing engine (this uses min hop tables).
678
679
680       Min Hop Algorithm
681
682       The Min Hop algorithm is invoked by default if no routing algorithm  is
683       specified.  It can also be invoked by specifying '-R minhop'.
684
685       The  Min  Hop algorithm is divided into two stages: computation of min-
686       hop tables on every switch and LFT output port  assignment.  Link  sub‐
687       scription  is also equalized with the ability to override based on port
688       GUID. The latter is supplied by:
689
690       -i <equalize-ignore-guids-file>
691       -ignore-guids <equalize-ignore-guids-file>
692                 This option provides the means to define a set of ports
693                 (by guid) that will be ignored by the link load
694                 equalization algorithm. Note that only endports (CA,
695                 switch port 0, and router ports) and not switch external
696                 ports are supported.
697
698       LMC awareness routes based on (remote) system or switch basis.
699
700
701       Purpose of UPDN Algorithm
702
703       The UPDN algorithm is designed to prevent deadlocks from  occurring  in
704       loops  of  the subnet. A loop-deadlock is a situation in which it is no
705       longer possible to send data between any two  hosts  connected  through
706       the  loop.  As  such,  the UPDN routing algorithm should be used if the
707       subnet is not a pure Fat Tree, and one of its loops  may  experience  a
708       deadlock (due, for example, to high pressure).
709
710       The UPDN algorithm is based on the following main stages:
711
712       1.  Auto-detect root nodes - based on the CA hop length from any switch
713       in the subnet, a statistical histogram is built for  each  switch  (hop
714       num  vs  number  of  occurrences). If the histogram reflects a specific
715       column (higher than others) for a certain node, then it is marked as  a
716       root node. Since the algorithm is statistical, it may not find any root
717       nodes. The list of the root nodes found by this  auto-detect  stage  is
718       used by the ranking process stage.
719
720           Note 1: The user can override the node list manually.
721           Note 2: If this stage cannot find any root nodes, and the user did
722                   not specify a guid list file, OpenSM defaults back to the
723                   Min Hop routing algorithm.
724
725       2.   Ranking  process  -  All  root switch nodes (found in stage 1) are
726       assigned a rank of 0. Using the BFS algorithm, the rest of  the  switch
727       nodes  in the subnet are ranked incrementally. This ranking aids in the
728       process of enforcing rules that ensure loop-free paths.
729
730       3.  Min Hop Table setting - after ranking is done, a BFS  algorithm  is
731       run  from  each  (CA  or  switch)  node  in  the subnet. During the BFS
732       process, the FDB table of each switch node traversed by BFS is updated,
733       in  reference to the starting node, based on the ranking rules and guid
734       values.
735
736       At the end of the process, the  updated  FDB  tables  ensure  loop-free
737       paths through the subnet.
738
739       Note:  Up/Down routing does not allow LID routing communication between
740       switches that are located inside spine "switch systems".  The reason is
741       that  there  is  no way to allow a LID route between them that does not
742       break the Up/Down rule.  One ramification of this is  that  you  cannot
743       run SM on switches other than the leaf switches of the fabric.
744
745
746       UPDN Algorithm Usage
747
748       Activation through OpenSM
749
750       Use  '-R  updn' option (instead of old '-u') to activate the UPDN algo‐
751       rithm.  Use '-a <root_guid_file>' for adding an  UPDN  guid  file  that
752       contains  the  root nodes for ranking.  If the `-a' option is not used,
753       OpenSM uses its auto-detect root nodes algorithm.
754
755       Notes on the guid list file:
756
757       1.   A valid guid file specifies one guid in each line. Lines  with  an
758       invalid format will be discarded.
759       2.   The user should specify the root switch guids. However, it is also
760       possible to specify CA guids; OpenSM will use the guid  of  the  switch
761       (if it exists) that connects the CA to the subnet as a root node.
762
763
764       Fat-tree Routing Algorithm
765
766       The fat-tree algorithm optimizes routing for "shift" communication pat‐
767       tern.  It should be chosen if a subnet is a symmetrical or almost  sym‐
768       metrical  fat-tree  of  various  types.   It supports not just K-ary-N-
769       Trees, by handling for non-constant K, cases where not all leafs  (CAs)
770       are present, any CBB ratio.  As in UPDN, fat-tree also prevents credit-
771       loop-deadlocks.
772
773       If the root guid file  is  not  provided  ('-a'  or  '--root_guid_file'
774       options),  the  topology has to be pure fat-tree that complies with the
775       following rules:
776         - Tree rank should be between two and eight (inclusively)
777         - Switches of the same rank should have the same number
778           of UP-going port groups*, unless they are root switches,
779           in which case the shouldn't have UP-going ports at all.
780         - Switches of the same rank should have the same number
781           of DOWN-going port groups, unless they are leaf switches.
782         - Switches of the same rank should have the same number
783           of ports in each UP-going port group.
784         - Switches of the same rank should have the same number
785           of ports in each DOWN-going port group.
786         - All the CAs have to be at the same tree level (rank).
787
788       If the root guid file is provided, the topology doesn't have to be pure
789       fat-tree, and it should only comply with the following rules:
790         - Tree rank should be between two and eight (inclusively)
791         - All the Compute Nodes** have to be at the same tree level (rank).
792           Note that non-compute node CAs are allowed here to be at different
793           tree ranks.
794
795       *  ports that are connected to the same remote switch are referenced as
796       ´port group´.
797
798       **  list  of  compute  nodes  (CNs)  can  be  specified  by   ´-u´   or
799       ´--cn_guid_file´ OpenSM options.
800
801       Topologies  that  do  not  comply  cause a fallback to min hop routing.
802       Note that this can also occur on link failures which cause the topology
803       to no longer be "pure" fat-tree.
804
805       Note  that  although fat-tree algorithm supports trees with non-integer
806       CBB ratio, the routing will not be as balanced as in  case  of  integer
807       CBB  ratio.   In  addition  to this, although the algorithm allows leaf
808       switches to have any number of CAs, the closer the tree is to be  fully
809       populated,  the  more  effective the "shift" communication pattern will
810       be.  In general, even if the root list  is  provided,  the  closer  the
811       topology to a pure and symmetrical fat-tree, the more optimal the rout‐
812       ing will be.
813
814       The algorithm also dumps compute node ordering  file  (opensm-ftree-ca-
815       order.dump)  in  the  same directory where the OpenSM log resides. This
816       ordering file provides the CN order that may be used  to  create  effi‐
817       cient communication pattern, that will match the routing tables.
818
819       Routing between non-CN nodes
820
821       The use of the cn_guid_file option allows non-CN nodes to be located on
822       different levels in the fat tree.  In such case, it is  not  guaranteed
823       that  the  Fat  Tree algorithm will route between two non-CN nodes.  To
824       solve this problem, a list of non-CN nodes can be specified by ´-G´  or
825       ´--io_guid_file´  option.  Theses nodes will be allowed to use switches
826       the wrong way round a specific number of times (specified  by  ´-H´  or
827       ´--max_reverse_hops´.     With    the   proper   max_reverse_hops   and
828       io_guid_file values, you can ensure full connectivity in the Fat Tree.
829
830       Please note that using max_reverse_hops creates  routes  that  use  the
831       switch  in  a  counter-stream way.  This option should never be used to
832       connect nodes with high bandwidth traffic between them ! It should only
833       be  used to allow connectivity for HA purposes or similar.  Also having
834       routes the other way around can in theory cause credit loops.
835
836       Use these options with extreme care !
837
838       Activation through OpenSM
839
840       Use '-R ftree' option to activate  the  fat-tree  algorithm.   Use  '-a
841       <root_guid_file>' to provide root nodes for ranking. If the `-a' option
842       is not used, routing algorithm will detect  roots  automatically.   Use
843       '-u  <root_cn_file>'  to provide the list of compute nodes. If the `-u'
844       option is not used, all the CAs are considered as compute nodes.
845
846       Note: LMC > 0 is not supported by fat-tree routing. If this  is  speci‐
847       fied, the default routing algorithm is invoked instead.
848
849
850       LASH Routing Algorithm
851
852       LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
853       istic shortest path routing algorithm that  enables  topology  agnostic
854       deadlock-free routing within communication networks.
855
856       When computing the routing function, LASH analyzes the network topology
857       for the shortest-path routes between all pairs of  sources  /  destina‐
858       tions  and  groups  these paths into virtual layers in such a way as to
859       avoid deadlock.
860
861       Note LASH analyzes routes and ensures deadlock freedom  between  switch
862       pairs.  The link from HCA between and switch does not need virtual lay‐
863       ers as deadlock will not arise between switch and HCA.
864
865       In more detail, the algorithm works as follows:
866
867       1) LASH determines the shortest-path between all pairs of source / des‐
868       tination  switches.  Note,  LASH  ensures  the  same SL is used for all
869       SRC/DST - DST/SRC pairs and there is no guarantee that the return  path
870       for a given DST/SRC will be the reverse of the route SRC/DST.
871
872       2)  LASH then begins an SL assignment process where a route is assigned
873       to a layer (SL) if the addition of that route does not  cause  deadlock
874       within  that  layer.  This  is  achieved by maintaining and analysing a
875       channel dependency graph for each layer. Once the potential addition of
876       a path could lead to deadlock, LASH opens a new layer and continues the
877       process.
878
879       3) Once this stage has been completed, it is  highly  likely  that  the
880       first  layers  processed  will contain more paths than the latter ones.
881       To better balance the use of layers, LASH moves paths from one layer to
882       another so that the number of paths in each layer averages out.
883
884       Note,  the implementation of LASH in opensm attempts to use as few lay‐
885       ers as possible. This number can be less than the number of actual lay‐
886       ers available.
887
888       In  general  LASH  is  a  very flexible algorithm. It can, for example,
889       reduce to Dimension Order Routing in certain topologies, it is topology
890       agnostic and fares well in the face of faults.
891
892       It  has been shown that for both regular and irregular topologies, LASH
893       outperforms Up/Down. The reason for this is that LASH  distributes  the
894       traffic  more  evenly through a network, avoiding the bottleneck issues
895       related to a root node and always routes shortest-path.
896
897       The algorithm was developed by Simula Research Laboratory.
898
899
900       Use '-R lash -Q ' option to activate the LASH algorithm.
901
902       Note: QoS support has to be turned on in order that SL/VL mappings  are
903       used.
904
905       Note:  LMC  > 0 is not supported by the LASH routing. If this is speci‐
906       fied, the default routing algorithm is invoked instead.
907
908       For open regular cartesian meshes the DOR algorithm is the ideal  rout‐
909       ing  algorithm. For toroidal meshes on the other hand there are routing
910       loops that can cause deadlocks. LASH can be used to route these  cases.
911       The  performance of LASH can be improved by preconditioning the mesh in
912       cases where there are multiple links connecting switches  and  also  in
913       cases  where the switches are not cabled consistently. An option exists
914       for LASH to do this. To invoke this use '-R  lash  -Q  --do_mesh_analy‐
915       sis'.  This  will add an additional phase that analyses the mesh to try
916       to determine the dimension and size of a mesh. If  it  determines  that
917       the  mesh  looks  like an open or closed cartesian mesh it reorders the
918       ports in dimension order before the rest of the LASH algorithm runs.
919
920       DOR Routing Algorithm
921
922       The Dimension Order Routing algorithm is based on the Min Hop algorithm
923       and  so  uses  shortest paths.  Instead of spreading traffic out across
924       different paths with the same shortest distance, it chooses  among  the
925       available shortest paths based on an ordering of dimensions.  Each port
926       must be consistently cabled to represent a  hypercube  dimension  or  a
927       mesh  dimension.   Paths  are grown from a destination back to a source
928       using the lowest dimension (port) of  available  paths  at  each  step.
929       This provides the ordering necessary to avoid deadlock.  When there are
930       multiple links between any two switches, they still represent only  one
931       dimension  and traffic is balanced across them unless port equalization
932       is turned off.  In the case of hypercubes, the same port must  be  used
933       throughout the fabric to represent the hypercube dimension and match on
934       both ends of the cable.  In the case of meshes,  the  dimension  should
935       consistently  use  the  same  pair of ports, one port on one end of the
936       cable, and the other port on the other end, continuing along  the  mesh
937       dimension.
938
939       Use '-R dor' option to activate the DOR algorithm.
940
941
942       Routing References
943
944       To  learn  more  about deadlock-free routing, see the article "Deadlock
945       Free Message Routing in  Multiprocessor  Interconnection  Networks"  by
946       William J Dally and Charles L Seitz (1985).
947
948       To  learn  more about the up/down algorithm, see the article "Effective
949       Strategy to Compute Forwarding Tables for InfiniBand Networks" by  Jose
950       Carlos  Sancho,  Antonio  Robles,  and  Jose  Duato  at the Universidad
951       Politecnica de Valencia.
952
953       To learn more about LASH and the flexibility behind it, the requirement
954       for  layers,  performance comparisons to other algorithms, see the fol‐
955       lowing articles:
956
957       "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
958       on Parallel and Distributed Systems, VOL.16, No12, December 2005.
959
960       "Routing  for  the  ASI Fabric Manager", Solheim et al. IEEE Communica‐
961       tions Magazine, Vol.44, No.7, July 2006.
962
963       "Layered Shortest Path (LASH) Routing in  Irregular  System  Area  Net‐
964       works",  Skeie  et al. IEEE Computer Society Communication Architecture
965       for Clusters 2002.
966
967
968       Modular Routine Engine
969
970       Modular routing engine structure allows for the ease of "plugging"  new
971       routing modules.
972
973       Currently, only unicast callbacks are supported. Multicast can be added
974       later.
975
976       One existing routing module is up-down "updn", which may  be  activated
977       with '-R updn' option (instead of old '-u').
978
979       General usage is: $ opensm -R 'module-name'
980
981       There is also a trivial routing module which is able to load LFT tables
982       from a file.
983
984       Main features:
985
986        - this will load switch LFTs and/or LID matrices (min hops tables)
987        - this will load switch LFTs according to the path entries introduced
988          in the file
989        - no additional checks will be performed (such as "is port connected",
990          etc.)
991        - in case when fabric LIDs were changed this will try to reconstruct
992          LFTs correctly if endport GUIDs are represented in the file
993          (in order to disable this, GUIDs may be removed from the file
994           or zeroed)
995
996       The file format is compatible with output of  'ibroute'  util  and  for
997       whole fabric can be generated with dump_lfts.sh script.
998
999       To activate file based routing module, use:
1000
1001         opensm -R file -U /path/to/lfts_file
1002
1003       If the lfts_file is not found or is in error, the default routing algo‐
1004       rithm is utilized.
1005
1006       The ability to dump switch lid matrices (aka min hops tables)  to  file
1007       and later to load these is also supported.
1008
1009       The  usage  is similar to unicast forwarding tables loading from a lfts
1010       file (introduced by 'file' routing engine), but  new  lid  matrix  file
1011       name  should  be specified by -M or --lid_matrix_file option. For exam‐
1012       ple:
1013
1014         opensm -R file -M ./opensm-lid-matrix.dump
1015
1016       The dump file is named ´opensm-lid-matrix.dump´ and will  be  generated
1017       in   standard   opensm   dump  directory  (/var/log  by  default)  when
1018       OSM_LOG_ROUTING logging flag is set.
1019
1020       When routing engine 'file' is activated, but the lfts file is not spec‐
1021       ified or not cannot be open default lid matrix algorithm will be used.
1022
1023       There  is also a switch forwarding tables dumper which generates a file
1024       compatible with dump_lfts.sh output. This file can be used as input for
1025       forwarding  tables  loading  by  'file' routing engine.  Both or one of
1026       options -U and -M can be specified together with ´-R file´.
1027
1028

FILES

1030       /etc/rdma/opensm.conf
1031              default OpenSM config file.
1032
1033
1034       /etc/rdma/ib-node-name-map
1035              default node name map file.  See ibnetdiscover for more informa‐
1036              tion on format.
1037
1038
1039       /etc/rdma/partitions.conf
1040              default partition config file
1041
1042
1043       /etc/rdma/qos-policy.conf
1044              default QOS policy config file
1045
1046
1047       /etc/rdma/prefix-routes.conf
1048              default prefix routes file.
1049
1050

AUTHORS

1052       Hal Rosenstock
1053              <hal.rosenstock@gmail.com>
1054
1055       Sasha Khapyorsky
1056              <sashak@voltaire.com>
1057
1058       Eitan Zahavi
1059              <eitan@mellanox.co.il>
1060
1061       Yevgeny Kliteynik
1062              <kliteyn@mellanox.co.il>
1063
1064       Thomas Sodring
1065              <tsodring@simula.no>
1066
1067       Ira Weiny
1068              <weiny2@llnl.gov>
1069
1070
1071
1072OpenIB                         October 22, 2009                      OPENSM(8)
Impressum