1OPENSM(8) OpenIB Management OPENSM(8)
2
3
4
6 opensm - InfiniBand subnet manager and administration (SM/SA)
7
8
10 opensm [--version]] [-F | --config <file_name>] [-c(reate-config)
11 <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority) <PRI‐
12 ORITY>] [-smkey <SM_Key>] [--sm_sl <SL number>] [-r(eassign_lids)] [-R
13 <engine name(s)> | --routing_engine <engine name(s)>] [--do_mesh_analy‐
14 sis] [--lash_start_vl <vl number>] [-A | --ucast_cache] [-z | --con‐
15 nect_roots] [-M <file name> | --lid_matrix_file <file name>] [-U <file
16 name> | --lfts_file <file name>] [-S | --sadb_file <file name>] [-a |
17 --root_guid_file <path to file>] [-u | --cn_guid_file <path to file>]
18 [-G | --io_guid_file <path to file>] [-H | --max_reverse_hops <max
19 reverse hops allowed>] [-X | --guid_routing_order_file <path to file>]
20 [-m | --ids_guid_file <path to file>] [-o(nce)] [-s(weep) <interval>]
21 [-t(imeout) <milliseconds>] [--retries <number>] [-maxsmps <number>]
22 [-console [off | local | socket | loopback]] [-console-port <port>]
23 [-i(gnore-guids) <equalize-ignore-guids-file>] [-w | --hop_weights_file
24 <path to file>] [-f <log file path> | --log_file <log file path> ] [-L
25 | --log_limit <size in MB>] [-e(rase_log_file)] [-P(config) <partition
26 config file> ] [-N | --no_part_enforce] [-Q | --qos [-Y | --qos_pol‐
27 icy_file <file name>]] [-y | --stay_on_fatal] [-B | --daemon] [-I |
28 --inactive] [--perfmgr] [--perfmgr_sweep_time_s <seconds>] [--pre‐
29 fix_routes_file <path>] [--consolidate_ipv6_snm_req] [-v(erbose)] [-V]
30 [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]
31
32
34 opensm is an InfiniBand compliant Subnet Manager and Administration,
35 and runs on top of OpenIB.
36
37 opensm provides an implementation of an InfiniBand Subnet Manager and
38 Administration. Such a software entity is required to run for in order
39 to initialize the InfiniBand hardware (at least one per each InfiniBand
40 subnet).
41
42 opensm also now contains an experimental version of a performance man‐
43 ager as well.
44
45 opensm defaults were designed to meet the common case usage on clusters
46 with up to a few hundred nodes. Thus, in this default mode, opensm will
47 scan the IB fabric, initialize it, and sweep occasionally for changes.
48
49 opensm attaches to a specific IB port on the local machine and config‐
50 ures only the fabric connected to it. (If the local machine has other
51 IB ports, opensm will ignore the fabrics connected to those other
52 ports). If no port is specified, it will select the first "best" avail‐
53 able port.
54
55 opensm can present the available ports and prompt for a port number to
56 attach to.
57
58 By default, the run is logged to two files: /var/log/messages and
59 /var/log/opensm.log. The first file will register only general major
60 events, whereas the second will include details of reported errors. All
61 errors reported in this second file should be treated as indicators of
62 IB fabric health issues. (Note that when a fatal and non-recoverable
63 error occurs, opensm will exit.) Both log files should include the
64 message "SUBNET UP" if opensm was able to setup the subnet correctly.
65
66
68 --version
69 Prints OpenSM version and exits.
70
71 -F, --config <config file>
72 The name of the OpenSM config file. When not specified
73 /etc/rdma/opensm.conf will be used (if exists).
74
75 -c, --create-config <file name>
76 OpenSM will dump its configuration to the specified file and
77 exit. This is a way to generate OpenSM configuration file tem‐
78 plate.
79
80 -g, --guid <GUID in hex>
81 This option specifies the local port GUID value with which
82 OpenSM should bind. OpenSM may be bound to 1 port at a time.
83 If GUID given is 0, OpenSM displays a list of possible port
84 GUIDs and waits for user input. Without -g, OpenSM tries to use
85 the default port.
86
87 -l, --lmc <LMC value>
88 This option specifies the subnet's LMC value. The number of
89 LIDs assigned to each port is 2^LMC. The LMC value must be in
90 the range 0-7. LMC values > 0 allow multiple paths between
91 ports. LMC values > 0 should only be used if the subnet topol‐
92 ogy actually provides multiple paths between ports, i.e. multi‐
93 ple interconnects between switches. Without -l, OpenSM defaults
94 to LMC = 0, which allows one path between any two ports.
95
96 -p, --priority <Priority value>
97 This option specifies the SM´s PRIORITY. This will effect the
98 handover cases, where master is chosen by priority and GUID.
99 Range goes from 0 (default and lowest priority) to 15 (highest).
100
101 -smkey <SM_Key value>
102 This option specifies the SM´s SM_Key (64 bits). This will
103 effect SM authentication. Note that OpenSM version 3.2.1 and
104 below used the default value '1' in a host byte order, it is
105 fixed now but you may need this option to interoperate with old
106 OpenSM running on a little endian machine.
107
108 --sm_sl <SL number>
109 This option sets the SL to use for communication with the SM/SA.
110 Defaults to 0.
111
112 -r, --reassign_lids
113 This option causes OpenSM to reassign LIDs to all end nodes.
114 Specifying -r on a running subnet may disrupt subnet traffic.
115 Without -r, OpenSM attempts to preserve existing LID assignments
116 resolving multiple use of same LID.
117
118 -R, --routing_engine <Routing engine names>
119 This option chooses routing engine(s) to use instead of Min Hop
120 algorithm (default). Multiple routing engines can be specified
121 separated by commas so that specific ordering of routing algo‐
122 rithms will be tried if earlier routing engines fail. Supported
123 engines: minhop, updn, file, ftree, lash, dor
124
125 --do_mesh_analysis
126 This option enables additional analysis for the lash routing
127 engine to precondition switch port assignments in regular carte‐
128 sian meshes which may reduce the number of SLs required to give
129 a deadlock free routing.
130
131 --lash_start_vl <vl number>
132 This option sets the starting VL to use for the lash routing
133 algorithm. Defaults to 0.
134
135 -A, --ucast_cache
136 This option enables unicast routing cache and prevents routing
137 recalculation (which is a heavy task in a large cluster) when
138 there was no topology change detected during the heavy sweep, or
139 when the topology change does not require new routing calcula‐
140 tion, e.g. when one or more CAs/RTRs/leaf switches going down,
141 or one or more of these nodes coming back after being down. A
142 very common case that is handled by the unicast routing cache is
143 host reboot, which otherwise would cause two full routing recal‐
144 culations: one when the host goes down, and the other when the
145 host comes back online.
146
147 -z, --connect_roots
148 This option enforces routing engines (up/down and fat-tree) to
149 make connectivity between root switches and in this way to be
150 fully IBA complaint. In many cases this can violate "pure" dead‐
151 lock free algorithm, so use it carefully.
152
153 -M, --lid_matrix_file <file name>
154 This option specifies the name of the lid matrix dump file from
155 where switch lid matrices (min hops tables will be loaded.
156
157 -U, --lfts_file <file name>
158 This option specifies the name of the LFTs file from where
159 switch forwarding tables will be loaded.
160
161 -S, --sadb_file <file name>
162 This option specifies the name of the SA DB dump file from where
163 SA database will be loaded.
164
165 -a, --root_guid_file <file name>
166 Set the root nodes for the Up/Down or Fat-Tree routing algorithm
167 to the guids provided in the given file (one to a line).
168
169 -u, --cn_guid_file <file name>
170 Set the compute nodes for the Fat-Tree routing algorithm to the
171 guids provided in the given file (one to a line).
172
173 -G, --io_guid_file <file name>
174 Set the I/O nodes for the Fat-Tree routing algorithm to the
175 guids provided in the given file (one to a line). I/O nodes are
176 non-CN nodes allowed to use up to max_reverse_hops switches the
177 wrong way around to improve connectivity.
178
179 -H, --max_reverse_hops <file name>
180 Set the maximum number of reverse hops an I/O node is allowed to
181 make. A reverse hop is the use of a switch the wrong way around.
182
183 -m, --ids_guid_file <file name>
184 Name of the map file with set of the IDs which will be used by
185 Up/Down routing algorithm instead of node GUIDs (format: <guid>
186 <id> per line).
187
188 -X, --guid_routing_order_file <file name>
189 Set the order port guids will be routed for the MinHop and
190 Up/Down routing algorithms to the guids provided in the given
191 file (one to a line).
192
193 -o, --once
194 This option causes OpenSM to configure the subnet once, then
195 exit. Ports remain in the ACTIVE state.
196
197 -s, --sweep <interval value>
198 This option specifies the number of seconds between subnet
199 sweeps. Specifying -s 0 disables sweeping. Without -s, OpenSM
200 defaults to a sweep interval of 10 seconds.
201
202 -t, --timeout <value>
203 This option specifies the time in milliseconds used for transac‐
204 tion timeouts. Specifying -t 0 disables timeouts. Without -t,
205 OpenSM defaults to a timeout value of 200 milliseconds.
206
207 --retries <number>
208 This option specifies the number of retries used for transac‐
209 tions. Without --retries, OpenSM defaults to 3 retries for
210 transactions.
211
212 -maxsmps <number>
213 This option specifies the number of VL15 SMP MADs allowed on the
214 wire at any one time. Specifying -maxsmps 0 allows unlimited
215 outstanding SMPs. Without -maxsmps, OpenSM defaults to a maxi‐
216 mum of 4 outstanding SMPs.
217
218 -console [off | local | socket | loopback]
219 This option brings up the OpenSM console (default off). Note
220 that the socket and loopback options will only be available if
221 OpenSM was built with --enable-console-socket.
222
223 -console-port <port>
224 Specify an alternate telnet port for the socket console (default
225 10000). Note that this option only appears if OpenSM was built
226 with --enable-console-socket.
227
228 -i, -ignore-guids <equalize-ignore-guids-file>
229 This option provides the means to define a set of ports (by node
230 guid and port number) that will be ignored by the link load
231 equalization algorithm.
232
233 -w, --hop_weights_file <path to file>
234 This option provides weighting factors per port representing a
235 hop cost in computing the lid matrix. The file consists of
236 lines containing a switch port GUID (specified as a 64 bit hex
237 number, with leading 0x), output port number, and weighting fac‐
238 tor. Any port not listed in the file defaults to a weighting
239 factor of 1. Lines starting with # are comments. Weights
240 affect only the output route from the port, so many useful con‐
241 figurations will require weights to be specified in pairs.
242
243 -x, --honor_guid2lid
244 This option forces OpenSM to honor the guid2lid file, when it
245 comes out of Standby state, if such file exists under
246 OSM_CACHE_DIR, and is valid. By default, this is FALSE.
247
248 -f, --log_file <file name>
249 This option defines the log to be the given file. By default,
250 the log goes to /var/log/opensm.log. For the log to go to stan‐
251 dard output use -f stdout.
252
253 -L, --log_limit <size in MB>
254 This option defines maximal log file size in MB. When specified
255 the log file will be truncated upon reaching this limit.
256
257 -e, --erase_log_file
258 This option will cause deletion of the log file (if it previ‐
259 ously exists). By default, the log file is accumulative.
260
261 -P, --Pconfig <partition config file>
262 This option defines the optional partition configuration file.
263 The default name is /etc/rdma/partitions.conf.
264
265 --prefix_routes_file <file name>
266 Prefix routes control how the SA responds to path record queries
267 for off-subnet DGIDs. By default, the SA fails such queries.
268 The PREFIX ROUTES section below describes the format of the con‐
269 figuration file. The default path is
270 /etc/rdma/prefix-routes.conf.
271
272 -Q, --qos
273 This option enables QoS setup. It is disabled by default.
274
275 -Y, --qos_policy_file <file name>
276 This option defines the optional QoS policy file. The default
277 name is /etc/rdma/qos-policy.conf. See QoS_manage‐
278 ment_in_OpenSM.txt in opensm doc for more information on config‐
279 uring QoS policy via this file.
280
281 -N, --no_part_enforce
282 This option disables partition enforcement on switch external
283 ports.
284
285 -y, --stay_on_fatal
286 This option will cause SM not to exit on fatal initialization
287 issues: if SM discovers duplicated guids or a 12x link with lane
288 reversal badly configured. By default, the SM will exit on
289 these errors.
290
291 -B, --daemon
292 Run in daemon mode - OpenSM will run in the background.
293
294 -I, --inactive
295 Start SM in inactive rather than init SM state. This option can
296 be used in conjunction with the perfmgr so as to run a stand‐
297 alone performance manager without SM/SA. However, this is NOT
298 currently implemented in the performance manager.
299
300 -perfmgr
301 Enable the perfmgr. Only takes effect if --enable-perfmgr was
302 specified at configure time. See performance-manager-HOWTO.txt
303 in opensm doc for more information on running perfmgr.
304
305 -perfmgr_sweep_time_s <seconds>
306 Specify the sweep time for the performance manager in seconds
307 (default is 180 seconds). Only takes effect if --enable-perfmgr
308 was specified at configure time.
309
310 --consolidate_ipv6_snm_req
311 Use shared MLID for IPv6 Solicited Node Multicast groups per
312 MGID scope and P_Key.
313
314 -v, --verbose
315 This option increases the log verbosity level. The -v option
316 may be specified multiple times to further increase the ver‐
317 bosity level. See the -D option for more information about log
318 verbosity.
319
320 -V This option sets the maximum verbosity level and forces log
321 flushing. The -V option is equivalent to ´-D 0xFF -d 2´. See
322 the -D option for more information about log verbosity.
323
324 -D <value>
325 This option sets the log verbosity level. A flags field must
326 follow the -D option. A bit set/clear in the flags enables/dis‐
327 ables a specific log level as follows:
328
329 BIT LOG LEVEL ENABLED
330 ---- -----------------
331 0x01 - ERROR (error messages)
332 0x02 - INFO (basic messages, low volume)
333 0x04 - VERBOSE (interesting stuff, moderate volume)
334 0x08 - DEBUG (diagnostic, high volume)
335 0x10 - FUNCS (function entry/exit, very high volume)
336 0x20 - FRAMES (dumps all SMP and GMP frames)
337 0x40 - ROUTING (dump FDB routing information)
338 0x80 - currently unused.
339
340 Without -D, OpenSM defaults to ERROR + INFO (0x3). Specifying
341 -D 0 disables all messages. Specifying -D 0xFF enables all mes‐
342 sages (see -V). High verbosity levels may require increasing
343 the transaction timeout with the -t option.
344
345 -d, --debug <value>
346 This option specifies a debug option. These options are not
347 normally needed. The number following -d selects the debug
348 option to enable as follows:
349
350 OPT Description
351 --- -----------------
352 -d0 - Ignore other SM nodes
353 -d1 - Force single threaded dispatching
354 -d2 - Force log flushing after each log message
355 -d3 - Disable multicast support
356
357 -h, --help
358 Display this usage info then exit.
359
360 -? Display this usage info then exit.
361
362
364 The following environment variables control opensm behavior:
365
366 OSM_TMP_DIR - controls the directory in which the temporary files gen‐
367 erated by opensm are created. These files are: opensm-subnet.lst,
368 opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.
369
370 OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
371 quent runs are consistent. The default directory used is
372 /var/cache/opensm. The following file is included in it:
373
374 guid2lid - stores the LID range assigned to each GUID
375
376
378 When opensm receives a HUP signal, it starts a new heavy sweep as if a
379 trap was received or a topology change was found.
380
381 Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log
382 for logrotate purposes.
383
384
386 The default name of OpenSM partitions configuration file is
387 /etc/rdma/partitions.conf. The default may be changed by using the
388 --Pconfig (-P) option with OpenSM.
389
390 The default partition will be created by OpenSM unconditionally even
391 when partition configuration file does not exist or cannot be accessed.
392
393 The default partition has P_Key value 0x7fff. OpenSM´s port will always
394 have full membership in default partition. All other end ports will
395 have full membership if the partition configuration file is not found
396 or cannot be accessed, or limited membership if the file exists and can
397 be accessed but there is no rule for the Default partition.
398
399 Effectively, this amounts to the same as if one of the following rules
400 below appear in the partition configuration file.
401
402 In the case of no rule for the Default partition:
403
404 Default=0x7fff : ALL=limited, SELF=full ;
405
406 In the case of no partition configuration file or file cannot be
407 accessed:
408
409 Default=0x7fff : ALL=full ;
410
411
412 File Format
413
414 Comments:
415
416 Line content followed after ´#´ character is comment and ignored by
417 parser.
418
419 General file format:
420
421 <Partition Definition>:<PortGUIDs list> ;
422
423 Partition Definition:
424
425 [PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
426
427 PartitionName - string, will be used with logging. When omitted
428 empty string will be used.
429 PKey - P_Key value for this partition. Only low 15 bits will
430 be used. When omitted will be autogenerated.
431 flag - used to indicate IPoIB capability of this partition.
432 defmember=full|limited - specifies default membership for port guid
433 list. Default is limited.
434
435 Currently recognized flags are:
436
437 ipoib - indicates that this partition may be used for IPoIB, as
438 result IPoIB capable MC group will be created.
439 rate=<val> - specifies rate for this IPoIB MC group
440 (default is 3 (10GBps))
441 mtu=<val> - specifies MTU for this IPoIB MC group
442 (default is 4 (2048))
443 sl=<val> - specifies SL for this IPoIB MC group
444 (default is 0)
445 scope=<val> - specifies scope for this IPoIB MC group
446 (default is 2 (link local)). Multiple scope settings
447 are permitted for a partition.
448
449 Note that values for rate, mtu, and scope should be specified as
450 defined in the IBTA specification (for example, mtu=4 for 2048).
451
452 PortGUIDs list:
453
454 PortGUID - GUID of partition member EndPort. Hexadecimal
455 numbers should start from 0x, decimal numbers
456 are accepted too.
457 full or limited - indicates full or limited membership for this
458 port. When omitted (or unrecognized) limited
459 membership is assumed.
460
461 There are two useful keywords for PortGUID definition:
462
463 - 'ALL' means all end ports in this subnet.
464 - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
465 - 'ALL_SWITCHES' means all Switch end ports in this subnet.
466 - 'ALL_ROUTERS' means all Router end ports in this subnet.
467 - 'SELF' means subnet manager's port.
468
469 Empty list means no ports in this partition.
470
471 Notes:
472
473 White space is permitted between delimiters ('=', ',',':',';').
474
475 The line can be wrapped after ':' followed after Partition Definition
476 and between.
477
478 PartitionName does not need to be unique, PKey does need to be unique.
479 If PKey is repeated then those partition configurations will be merged
480 and first PartitionName will be used (see also next note).
481
482 It is possible to split partition configuration in more than one defi‐
483 nition, but then PKey should be explicitly specified (otherwise differ‐
484 ent PKey values will be generated for those definitions).
485
486 Examples:
487
488 Default=0x7fff : ALL, SELF=full ;
489 Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
490
491 NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306
492 ;
493
494 YetAnotherOne = 0x300 : SELF=full ;
495 YetAnotherOne = 0x300 : ALL=limited ;
496
497 ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
498 # 0x123453, 0x123454 will be limited
499 ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
500 # 0x123456, 0x123457 will be limited
501 ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457,
502 0x123458=full;
503 ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
504 ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited,
505 0x12345d;
506
507
508 Note:
509
510 The following rule is equivalent to how OpenSM used to run prior to the
511 partition manager:
512
513 Default=0x7fff,ipoib:ALL=full;
514
515
517 There are a set of QoS related low-level configuration parameters. All
518 these parameter names are prefixed by "qos_" string. Here is a full
519 list of these parameters:
520
521 qos_max_vls - The maximum number of VLs that will be on the subnet
522 qos_high_limit - The limit of High Priority component of VL
523 Arbitration table (IBA 7.6.9)
524 qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9)
525 template
526 qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
527 template
528 Both VL arbitration templates are pairs of
529 VL and weight
530 qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is
531 a list of VLs corresponding to SLs 0-15 (Note
532 that VL15 used here means drop this SL)
533
534 Typical default values (hard-coded in OpenSM initialization) are:
535
536 qos_max_vls 15
537 qos_high_limit 0
538 qos_vlarb_low
539 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
540 qos_vlarb_high
541 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
542 qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
543
544 The syntax is compatible with rest of OpenSM configuration options and
545 values may be stored in OpenSM config file (cached options file).
546
547 In addition to the above, we may define separate QoS configuration
548 parameters sets for various target types. As targets, we currently sup‐
549 port CAs, routers, switch external ports, and switch's enhanced port 0.
550 The names of such specialized parameters are prefixed by "qos_<type>_"
551 string. Here is a full list of the currently supported sets:
552
553 qos_ca_ - QoS configuration parameters set for CAs.
554 qos_rtr_ - parameters set for routers.
555 qos_sw0_ - parameters set for switches' port 0.
556 qos_swe_ - parameters set for switches' external ports.
557
558 Examples:
559 qos_sw0_max_vls=2
560 qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
561 qos_swe_high_limit=0
562
563
565 Prefix routes control how the SA responds to path record queries for
566 off-subnet DGIDs. By default, the SA fails such queries. Note that
567 IBA does not specify how the SA should obtain off-subnet path record
568 information. The prefix routes configuration is meant as a stop-gap
569 until the specification is completed.
570
571 Each line in the configuration file is a 64-bit prefix followed by a
572 64-bit GUID, separated by white space. The GUID specifies the router
573 port on the local subnet that will handle the prefix. Blank lines are
574 ignored, as is anything between a # character and the end of the line.
575 The prefix and GUID are both in hex, the leading 0x is optional.
576 Either, or both, can be wild-carded by specifying an asterisk instead
577 of an explicit prefix or GUID.
578
579 When responding to a path record query for an off-subnet DGID, opensm
580 searches for the first prefix match in the configuration file. There‐
581 fore, the order of the lines in the configuration file is important: a
582 wild-carded prefix at the beginning of the configuration file renders
583 all subsequent lines useless. If there is no match, then opensm fails
584 the query. It is legal to repeat prefixes in the configuration file,
585 opensm will return the path to the first available matching router. A
586 configuration file with a single line where both prefix and GUID are
587 wild-carded means that a path record query specifying any off-subnet
588 DGID should return a path to the first available router. This configu‐
589 ration yields the same behavior formerly achieved by compiling opensm
590 with -DROUTER_EXP which has been obsoleted.
591
592
594 OpenSM now offers five routing engines:
595
596 1. Min Hop Algorithm - based on the minimum hops to each node where
597 the path length is optimized.
598
599 2. UPDN Unicast routing algorithm - also based on the minimum hops to
600 each node, but it is constrained to ranking rules. This algorithm
601 should be chosen if the subnet is not a pure Fat Tree, and deadlock may
602 occur due to a loop in the subnet.
603
604 3. Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
605 ing for congestion-free "shift" communication pattern. It should be
606 chosen if a subnet is a symmetrical or almost symmetrical fat-tree of
607 various types, not just K-ary-N-Trees: non-constant K, not fully
608 staffed, any Constant Bisectional Bandwidth (CBB) ratio. Similar to
609 UPDN, Fat Tree routing is constrained to ranking rules.
610
611 4. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
612 to provide deadlock-free shortest-path routing while also distributing
613 the paths between layers. LASH is an alternative deadlock-free topol‐
614 ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
615 ing the use of a potentially congested root node.
616
617 5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
618 avoids port equalization except for redundant links between the same
619 two switches. This provides deadlock free routes for hypercubes when
620 the fabric is cabled as a hypercube and for meshes when cabled as a
621 mesh (see details below).
622
623 OpenSM also supports a file method which can load routes from a table.
624 See ´Modular Routing Engine´ for more information on this.
625
626 The basic routing algorithm is comprised of two stages:
627
628 1. MinHop matrix calculation
629 How many hops are required to get from each port to each LID ?
630 The algorithm to fill these tables is different if you run standard
631 (min hop) or Up/Down.
632 For standard routing, a "relaxation" algorithm is used to propagate
633 min hop from every destination LID through neighbor switches
634 For Up/Down routing, a BFS from every target is used. The BFS tracks
635 link direction (up or down) and avoid steps that will perform up after
636 a down step was used.
637
638 2. Once MinHop matrices exist, each switch is visited and for each tar‐
639 get LID a decision is made as to what port should be used to get to
640 that LID.
641 This step is common to standard and Up/Down routing. Each port has a
642 counter counting the number of target LIDs going through it.
643 When there are multiple alternative ports with same MinHop to a LID,
644 the one with less previously assigned ports is selected.
645 If LMC > 0, more checks are added: Within each group of LIDs
646 assigned to same target port,
647 a. use only ports which have same MinHop
648 b. first prefer the ones that go to different systemImageGuid (then
649 the previous LID of the same LMC group)
650 c. if none - prefer those which go through another NodeGuid
651 d. fall back to the number of paths method (if all go to same node).
652
653 Effect of Topology Changes
654
655 OpenSM will preserve existing routing in any case where there is no
656 change in the fabric switches unless the -r (--reassign_lids) option is
657 specified.
658
659 -r
660 --reassign_lids
661 This option causes OpenSM to reassign LIDs to all
662 end nodes. Specifying -r on a running subnet
663 may disrupt subnet traffic.
664 Without -r, OpenSM attempts to preserve existing
665 LID assignments resolving multiple use of same LID.
666
667 If a link is added or removed, OpenSM does not recalculate the routes
668 that do not have to change. A route has to change if the port is no
669 longer UP or no longer the MinHop. When routing changes are performed,
670 the same algorithm for balancing the routes is invoked.
671
672 In the case of using the file based routing, any topology changes are
673 currently ignored The 'file' routing engine just loads the LFTs from
674 the file specified, with no reaction to real topology. Obviously, this
675 will not be able to recheck LIDs (by GUID) for disconnected nodes, and
676 LFTs for non-existent switches will be skipped. Multicast is not
677 affected by 'file' routing engine (this uses min hop tables).
678
679
680 Min Hop Algorithm
681
682 The Min Hop algorithm is invoked by default if no routing algorithm is
683 specified. It can also be invoked by specifying '-R minhop'.
684
685 The Min Hop algorithm is divided into two stages: computation of min-
686 hop tables on every switch and LFT output port assignment. Link sub‐
687 scription is also equalized with the ability to override based on port
688 GUID. The latter is supplied by:
689
690 -i <equalize-ignore-guids-file>
691 -ignore-guids <equalize-ignore-guids-file>
692 This option provides the means to define a set of ports
693 (by guid) that will be ignored by the link load
694 equalization algorithm. Note that only endports (CA,
695 switch port 0, and router ports) and not switch external
696 ports are supported.
697
698 LMC awareness routes based on (remote) system or switch basis.
699
700
701 Purpose of UPDN Algorithm
702
703 The UPDN algorithm is designed to prevent deadlocks from occurring in
704 loops of the subnet. A loop-deadlock is a situation in which it is no
705 longer possible to send data between any two hosts connected through
706 the loop. As such, the UPDN routing algorithm should be used if the
707 subnet is not a pure Fat Tree, and one of its loops may experience a
708 deadlock (due, for example, to high pressure).
709
710 The UPDN algorithm is based on the following main stages:
711
712 1. Auto-detect root nodes - based on the CA hop length from any switch
713 in the subnet, a statistical histogram is built for each switch (hop
714 num vs number of occurrences). If the histogram reflects a specific
715 column (higher than others) for a certain node, then it is marked as a
716 root node. Since the algorithm is statistical, it may not find any root
717 nodes. The list of the root nodes found by this auto-detect stage is
718 used by the ranking process stage.
719
720 Note 1: The user can override the node list manually.
721 Note 2: If this stage cannot find any root nodes, and the user did
722 not specify a guid list file, OpenSM defaults back to the
723 Min Hop routing algorithm.
724
725 2. Ranking process - All root switch nodes (found in stage 1) are
726 assigned a rank of 0. Using the BFS algorithm, the rest of the switch
727 nodes in the subnet are ranked incrementally. This ranking aids in the
728 process of enforcing rules that ensure loop-free paths.
729
730 3. Min Hop Table setting - after ranking is done, a BFS algorithm is
731 run from each (CA or switch) node in the subnet. During the BFS
732 process, the FDB table of each switch node traversed by BFS is updated,
733 in reference to the starting node, based on the ranking rules and guid
734 values.
735
736 At the end of the process, the updated FDB tables ensure loop-free
737 paths through the subnet.
738
739 Note: Up/Down routing does not allow LID routing communication between
740 switches that are located inside spine "switch systems". The reason is
741 that there is no way to allow a LID route between them that does not
742 break the Up/Down rule. One ramification of this is that you cannot
743 run SM on switches other than the leaf switches of the fabric.
744
745
746 UPDN Algorithm Usage
747
748 Activation through OpenSM
749
750 Use '-R updn' option (instead of old '-u') to activate the UPDN algo‐
751 rithm. Use '-a <root_guid_file>' for adding an UPDN guid file that
752 contains the root nodes for ranking. If the `-a' option is not used,
753 OpenSM uses its auto-detect root nodes algorithm.
754
755 Notes on the guid list file:
756
757 1. A valid guid file specifies one guid in each line. Lines with an
758 invalid format will be discarded.
759 2. The user should specify the root switch guids. However, it is also
760 possible to specify CA guids; OpenSM will use the guid of the switch
761 (if it exists) that connects the CA to the subnet as a root node.
762
763
764 Fat-tree Routing Algorithm
765
766 The fat-tree algorithm optimizes routing for "shift" communication pat‐
767 tern. It should be chosen if a subnet is a symmetrical or almost sym‐
768 metrical fat-tree of various types. It supports not just K-ary-N-
769 Trees, by handling for non-constant K, cases where not all leafs (CAs)
770 are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-
771 loop-deadlocks.
772
773 If the root guid file is not provided ('-a' or '--root_guid_file'
774 options), the topology has to be pure fat-tree that complies with the
775 following rules:
776 - Tree rank should be between two and eight (inclusively)
777 - Switches of the same rank should have the same number
778 of UP-going port groups*, unless they are root switches,
779 in which case the shouldn't have UP-going ports at all.
780 - Switches of the same rank should have the same number
781 of DOWN-going port groups, unless they are leaf switches.
782 - Switches of the same rank should have the same number
783 of ports in each UP-going port group.
784 - Switches of the same rank should have the same number
785 of ports in each DOWN-going port group.
786 - All the CAs have to be at the same tree level (rank).
787
788 If the root guid file is provided, the topology doesn't have to be pure
789 fat-tree, and it should only comply with the following rules:
790 - Tree rank should be between two and eight (inclusively)
791 - All the Compute Nodes** have to be at the same tree level (rank).
792 Note that non-compute node CAs are allowed here to be at different
793 tree ranks.
794
795 * ports that are connected to the same remote switch are referenced as
796 ´port group´.
797
798 ** list of compute nodes (CNs) can be specified by ´-u´ or
799 ´--cn_guid_file´ OpenSM options.
800
801 Topologies that do not comply cause a fallback to min hop routing.
802 Note that this can also occur on link failures which cause the topology
803 to no longer be "pure" fat-tree.
804
805 Note that although fat-tree algorithm supports trees with non-integer
806 CBB ratio, the routing will not be as balanced as in case of integer
807 CBB ratio. In addition to this, although the algorithm allows leaf
808 switches to have any number of CAs, the closer the tree is to be fully
809 populated, the more effective the "shift" communication pattern will
810 be. In general, even if the root list is provided, the closer the
811 topology to a pure and symmetrical fat-tree, the more optimal the rout‐
812 ing will be.
813
814 The algorithm also dumps compute node ordering file (opensm-ftree-ca-
815 order.dump) in the same directory where the OpenSM log resides. This
816 ordering file provides the CN order that may be used to create effi‐
817 cient communication pattern, that will match the routing tables.
818
819 Routing between non-CN nodes
820
821 The use of the cn_guid_file option allows non-CN nodes to be located on
822 different levels in the fat tree. In such case, it is not guaranteed
823 that the Fat Tree algorithm will route between two non-CN nodes. To
824 solve this problem, a list of non-CN nodes can be specified by ´-G´ or
825 ´--io_guid_file´ option. Theses nodes will be allowed to use switches
826 the wrong way round a specific number of times (specified by ´-H´ or
827 ´--max_reverse_hops´. With the proper max_reverse_hops and
828 io_guid_file values, you can ensure full connectivity in the Fat Tree.
829
830 Please note that using max_reverse_hops creates routes that use the
831 switch in a counter-stream way. This option should never be used to
832 connect nodes with high bandwidth traffic between them ! It should only
833 be used to allow connectivity for HA purposes or similar. Also having
834 routes the other way around can in theory cause credit loops.
835
836 Use these options with extreme care !
837
838 Activation through OpenSM
839
840 Use '-R ftree' option to activate the fat-tree algorithm. Use '-a
841 <root_guid_file>' to provide root nodes for ranking. If the `-a' option
842 is not used, routing algorithm will detect roots automatically. Use
843 '-u <root_cn_file>' to provide the list of compute nodes. If the `-u'
844 option is not used, all the CAs are considered as compute nodes.
845
846 Note: LMC > 0 is not supported by fat-tree routing. If this is speci‐
847 fied, the default routing algorithm is invoked instead.
848
849
850 LASH Routing Algorithm
851
852 LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
853 istic shortest path routing algorithm that enables topology agnostic
854 deadlock-free routing within communication networks.
855
856 When computing the routing function, LASH analyzes the network topology
857 for the shortest-path routes between all pairs of sources / destina‐
858 tions and groups these paths into virtual layers in such a way as to
859 avoid deadlock.
860
861 Note LASH analyzes routes and ensures deadlock freedom between switch
862 pairs. The link from HCA between and switch does not need virtual lay‐
863 ers as deadlock will not arise between switch and HCA.
864
865 In more detail, the algorithm works as follows:
866
867 1) LASH determines the shortest-path between all pairs of source / des‐
868 tination switches. Note, LASH ensures the same SL is used for all
869 SRC/DST - DST/SRC pairs and there is no guarantee that the return path
870 for a given DST/SRC will be the reverse of the route SRC/DST.
871
872 2) LASH then begins an SL assignment process where a route is assigned
873 to a layer (SL) if the addition of that route does not cause deadlock
874 within that layer. This is achieved by maintaining and analysing a
875 channel dependency graph for each layer. Once the potential addition of
876 a path could lead to deadlock, LASH opens a new layer and continues the
877 process.
878
879 3) Once this stage has been completed, it is highly likely that the
880 first layers processed will contain more paths than the latter ones.
881 To better balance the use of layers, LASH moves paths from one layer to
882 another so that the number of paths in each layer averages out.
883
884 Note, the implementation of LASH in opensm attempts to use as few lay‐
885 ers as possible. This number can be less than the number of actual lay‐
886 ers available.
887
888 In general LASH is a very flexible algorithm. It can, for example,
889 reduce to Dimension Order Routing in certain topologies, it is topology
890 agnostic and fares well in the face of faults.
891
892 It has been shown that for both regular and irregular topologies, LASH
893 outperforms Up/Down. The reason for this is that LASH distributes the
894 traffic more evenly through a network, avoiding the bottleneck issues
895 related to a root node and always routes shortest-path.
896
897 The algorithm was developed by Simula Research Laboratory.
898
899
900 Use '-R lash -Q ' option to activate the LASH algorithm.
901
902 Note: QoS support has to be turned on in order that SL/VL mappings are
903 used.
904
905 Note: LMC > 0 is not supported by the LASH routing. If this is speci‐
906 fied, the default routing algorithm is invoked instead.
907
908 For open regular cartesian meshes the DOR algorithm is the ideal rout‐
909 ing algorithm. For toroidal meshes on the other hand there are routing
910 loops that can cause deadlocks. LASH can be used to route these cases.
911 The performance of LASH can be improved by preconditioning the mesh in
912 cases where there are multiple links connecting switches and also in
913 cases where the switches are not cabled consistently. An option exists
914 for LASH to do this. To invoke this use '-R lash -Q --do_mesh_analy‐
915 sis'. This will add an additional phase that analyses the mesh to try
916 to determine the dimension and size of a mesh. If it determines that
917 the mesh looks like an open or closed cartesian mesh it reorders the
918 ports in dimension order before the rest of the LASH algorithm runs.
919
920 DOR Routing Algorithm
921
922 The Dimension Order Routing algorithm is based on the Min Hop algorithm
923 and so uses shortest paths. Instead of spreading traffic out across
924 different paths with the same shortest distance, it chooses among the
925 available shortest paths based on an ordering of dimensions. Each port
926 must be consistently cabled to represent a hypercube dimension or a
927 mesh dimension. Paths are grown from a destination back to a source
928 using the lowest dimension (port) of available paths at each step.
929 This provides the ordering necessary to avoid deadlock. When there are
930 multiple links between any two switches, they still represent only one
931 dimension and traffic is balanced across them unless port equalization
932 is turned off. In the case of hypercubes, the same port must be used
933 throughout the fabric to represent the hypercube dimension and match on
934 both ends of the cable. In the case of meshes, the dimension should
935 consistently use the same pair of ports, one port on one end of the
936 cable, and the other port on the other end, continuing along the mesh
937 dimension.
938
939 Use '-R dor' option to activate the DOR algorithm.
940
941
942 Routing References
943
944 To learn more about deadlock-free routing, see the article "Deadlock
945 Free Message Routing in Multiprocessor Interconnection Networks" by
946 William J Dally and Charles L Seitz (1985).
947
948 To learn more about the up/down algorithm, see the article "Effective
949 Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose
950 Carlos Sancho, Antonio Robles, and Jose Duato at the Universidad
951 Politecnica de Valencia.
952
953 To learn more about LASH and the flexibility behind it, the requirement
954 for layers, performance comparisons to other algorithms, see the fol‐
955 lowing articles:
956
957 "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
958 on Parallel and Distributed Systems, VOL.16, No12, December 2005.
959
960 "Routing for the ASI Fabric Manager", Solheim et al. IEEE Communica‐
961 tions Magazine, Vol.44, No.7, July 2006.
962
963 "Layered Shortest Path (LASH) Routing in Irregular System Area Net‐
964 works", Skeie et al. IEEE Computer Society Communication Architecture
965 for Clusters 2002.
966
967
968 Modular Routine Engine
969
970 Modular routing engine structure allows for the ease of "plugging" new
971 routing modules.
972
973 Currently, only unicast callbacks are supported. Multicast can be added
974 later.
975
976 One existing routing module is up-down "updn", which may be activated
977 with '-R updn' option (instead of old '-u').
978
979 General usage is: $ opensm -R 'module-name'
980
981 There is also a trivial routing module which is able to load LFT tables
982 from a file.
983
984 Main features:
985
986 - this will load switch LFTs and/or LID matrices (min hops tables)
987 - this will load switch LFTs according to the path entries introduced
988 in the file
989 - no additional checks will be performed (such as "is port connected",
990 etc.)
991 - in case when fabric LIDs were changed this will try to reconstruct
992 LFTs correctly if endport GUIDs are represented in the file
993 (in order to disable this, GUIDs may be removed from the file
994 or zeroed)
995
996 The file format is compatible with output of 'ibroute' util and for
997 whole fabric can be generated with dump_lfts.sh script.
998
999 To activate file based routing module, use:
1000
1001 opensm -R file -U /path/to/lfts_file
1002
1003 If the lfts_file is not found or is in error, the default routing algo‐
1004 rithm is utilized.
1005
1006 The ability to dump switch lid matrices (aka min hops tables) to file
1007 and later to load these is also supported.
1008
1009 The usage is similar to unicast forwarding tables loading from a lfts
1010 file (introduced by 'file' routing engine), but new lid matrix file
1011 name should be specified by -M or --lid_matrix_file option. For exam‐
1012 ple:
1013
1014 opensm -R file -M ./opensm-lid-matrix.dump
1015
1016 The dump file is named ´opensm-lid-matrix.dump´ and will be generated
1017 in standard opensm dump directory (/var/log by default) when
1018 OSM_LOG_ROUTING logging flag is set.
1019
1020 When routing engine 'file' is activated, but the lfts file is not spec‐
1021 ified or not cannot be open default lid matrix algorithm will be used.
1022
1023 There is also a switch forwarding tables dumper which generates a file
1024 compatible with dump_lfts.sh output. This file can be used as input for
1025 forwarding tables loading by 'file' routing engine. Both or one of
1026 options -U and -M can be specified together with ´-R file´.
1027
1028
1030 /etc/rdma/opensm.conf
1031 default OpenSM config file.
1032
1033
1034 /etc/rdma/ib-node-name-map
1035 default node name map file. See ibnetdiscover for more informa‐
1036 tion on format.
1037
1038
1039 /etc/rdma/partitions.conf
1040 default partition config file
1041
1042
1043 /etc/rdma/qos-policy.conf
1044 default QOS policy config file
1045
1046
1047 /etc/rdma/prefix-routes.conf
1048 default prefix routes file.
1049
1050
1052 Hal Rosenstock
1053 <hal.rosenstock@gmail.com>
1054
1055 Sasha Khapyorsky
1056 <sashak@voltaire.com>
1057
1058 Eitan Zahavi
1059 <eitan@mellanox.co.il>
1060
1061 Yevgeny Kliteynik
1062 <kliteyn@mellanox.co.il>
1063
1064 Thomas Sodring
1065 <tsodring@simula.no>
1066
1067 Ira Weiny
1068 <weiny2@llnl.gov>
1069
1070
1071
1072OpenIB October 22, 2009 OPENSM(8)