1OPENSM(8) OpenIB Management OPENSM(8)
2
3
4
6 opensm - InfiniBand subnet manager and administration (SM/SA)
7
8
10 opensm [--version]] [-F | --config <file_name>] [-c(reate-config)
11 <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority) <PRI‐
12 ORITY>] [--subnet_prefix <PREFIX in hex>] [--smkey <SM_Key>] [--sm_sl
13 <SL number>] [-r(eassign_lids)] [-R <engine name(s)> | --routing_engine
14 <engine name(s)>] [--do_mesh_analysis] [--lash_start_vl <vl number>]
15 [--nue_max_num_vls <vl number>] [-A | --ucast_cache] [-z | --con‐
16 nect_roots] [-M <file name> | --lid_matrix_file <file name>] [-U <file
17 name> | --lfts_file <file name>] [-S | --sadb_file <file name>] [-a |
18 --root_guid_file <path to file>] [-u | --cn_guid_file <path to file>]
19 [-G | --io_guid_file <path to file>] [--port-shifting] [--scatter-ports
20 <random seed>] [-H | --max_reverse_hops <max reverse hops allowed>] [-X
21 | --guid_routing_order_file <path to file>] [-m | --ids_guid_file <path
22 to file>] [-o(nce)] [-s(weep) <interval>] [-t(imeout) <milliseconds>]
23 [--retries <number>] [--maxsmps <number>] [--console [off | local |
24 socket | loopback]] [--console-port <port>] [-i | --ignore_guids
25 <equalize-ignore-guids-file>] [-w | --hop_weights_file <path to file>]
26 [-O | --port_search_ordering_file <path to file>] [-O |
27 --dimn_ports_file <path to file>] (DEPRECATED) [-f <log file path> |
28 --log_file <log file path> ] [-L | --log_limit <size in MB>]
29 [-e(rase_log_file)] [-P(config) <partition config file> ] [-N |
30 --no_part_enforce] (DEPRECATED) [-Z | --part_enforce [both | in | out |
31 off]] [-W | --allow_both_pkeys] [-Q | --qos [-Y | --qos_policy_file
32 <file name>]] [--congestion-control] [--cckey <key>] [-y |
33 --stay_on_fatal] [-B | --daemon] [-J | --pidfile <file_name>] [-I |
34 --inactive] [--perfmgr] [--perfmgr_sweep_time_s <seconds>] [--pre‐
35 fix_routes_file <path>] [--consolidate_ipv6_snm_req] [--log_prefix
36 <prefix text>] [--torus_config <path to file>] [-v(erbose)] [-V] [-D
37 <flags>] [-d(ebug) <number>] [-h(elp)] [-?]
38
39
41 opensm is an InfiniBand compliant Subnet Manager and Administration,
42 and runs on top of OpenIB.
43
44 opensm provides an implementation of an InfiniBand Subnet Manager and
45 Administration. Such a software entity is required to run for in order
46 to initialize the InfiniBand hardware (at least one per each InfiniBand
47 subnet).
48
49 opensm also now contains an experimental version of a performance man‐
50 ager as well.
51
52 opensm defaults were designed to meet the common case usage on clusters
53 with up to a few hundred nodes. Thus, in this default mode, opensm will
54 scan the IB fabric, initialize it, and sweep occasionally for changes.
55
56 opensm attaches to a specific IB port on the local machine and config‐
57 ures only the fabric connected to it. (If the local machine has other
58 IB ports, opensm will ignore the fabrics connected to those other
59 ports). If no port is specified, it will select the first "best" avail‐
60 able port.
61
62 opensm can present the available ports and prompt for a port number to
63 attach to.
64
65 By default, the run is logged to two files: /var/log/messages and
66 /var/log/opensm.log. The first file will register only general major
67 events, whereas the second will include details of reported errors. All
68 errors reported in this second file should be treated as indicators of
69 IB fabric health issues. (Note that when a fatal and non-recoverable
70 error occurs, opensm will exit.) Both log files should include the
71 message "SUBNET UP" if opensm was able to setup the subnet correctly.
72
73
75 --version
76 Prints OpenSM version and exits.
77
78 -F, --config <config file>
79 The name of the OpenSM config file. When not specified
80 /etc/rdma/opensm.conf will be used (if exists).
81
82 -c, --create-config <file name>
83 OpenSM will dump its configuration to the specified file and
84 exit. This is a way to generate OpenSM configuration file tem‐
85 plate.
86
87 -g, --guid <GUID in hex>
88 This option specifies the local port GUID value with which
89 OpenSM should bind. OpenSM may be bound to 1 port at a time.
90 If GUID given is 0, OpenSM displays a list of possible port
91 GUIDs and waits for user input. Without -g, OpenSM tries to use
92 the default port.
93
94 -l, --lmc <LMC value>
95 This option specifies the subnet's LMC value. The number of
96 LIDs assigned to each port is 2^LMC. The LMC value must be in
97 the range 0-7. LMC values > 0 allow multiple paths between
98 ports. LMC values > 0 should only be used if the subnet topol‐
99 ogy actually provides multiple paths between ports, i.e. multi‐
100 ple interconnects between switches. Without -l, OpenSM defaults
101 to LMC = 0, which allows one path between any two ports.
102
103 -p, --priority <Priority value>
104 This option specifies the SM´s PRIORITY. This will effect the
105 handover cases, where master is chosen by priority and GUID.
106 Range goes from 0 (default and lowest priority) to 15 (highest).
107
108 --subnet_prefix <PREFIX in hex>
109 This option specifies the subnet prefix to use on the fabric.
110 The default prefix is 0xfe80000000000000. OpenMPI in particular
111 requires separate fabrics plugged into different ports to have
112 different prefixes or else it won't run.
113
114 --smkey <SM_Key value>
115 This option specifies the SM´s SM_Key (64 bits). This will
116 effect SM authentication. Note that OpenSM version 3.2.1 and
117 below used the default value '1' in a host byte order, it is
118 fixed now but you may need this option to interoperate with old
119 OpenSM running on a little endian machine.
120
121 --sm_sl <SL number>
122 This option sets the SL to use for communication with the SM/SA.
123 Defaults to 0.
124
125 -r, --reassign_lids
126 This option causes OpenSM to reassign LIDs to all end nodes.
127 Specifying -r on a running subnet may disrupt subnet traffic.
128 Without -r, OpenSM attempts to preserve existing LID assignments
129 resolving multiple use of same LID.
130
131 -R, --routing_engine <Routing engine names>
132 This option chooses routing engine(s) to use instead of Min Hop
133 algorithm (default). Multiple routing engines can be specified
134 separated by commas so that specific ordering of routing algo‐
135 rithms will be tried if earlier routing engines fail. If all
136 configured routing engines fail, OpenSM will always attempt to
137 route with Min Hop unless 'no_fallback' is included in the list
138 of routing engines. Supported engines: minhop, updn, dnup,
139 file, ftree, lash, dor, torus-2QoS, nue, dfsssp, sssp.
140
141 --do_mesh_analysis
142 This option enables additional analysis for the lash routing
143 engine to precondition switch port assignments in regular carte‐
144 sian meshes which may reduce the number of SLs required to give
145 a deadlock free routing.
146
147 --lash_start_vl <vl number>
148 This option sets the starting VL to use for the lash routing
149 algorithm. Defaults to 0.
150
151 --nue_max_num_vls <vl number>
152 This option sets the maximum number of VLs to use for the Nue
153 routing engine. Every number greater or equal to 0 is allowed,
154 and the default is 1 to enforce deadlock-freedom even if QoS is
155 not enabled. If set to 0, then Nue routing will automatically
156 determine and choose maximum supported by the fabric. And if set
157 to any interger >= 1, then Nue uses min(max_sup‐
158 ported,nue_max_num_vls). Rule of thumb is: higher
159 nue_max_num_vls results in better path balancing.
160
161 -A, --ucast_cache
162 This option enables unicast routing cache and prevents routing
163 recalculation (which is a heavy task in a large cluster) when
164 there was no topology change detected during the heavy sweep, or
165 when the topology change does not require new routing calcula‐
166 tion, e.g. when one or more CAs/RTRs/leaf switches going down,
167 or one or more of these nodes coming back after being down. A
168 very common case that is handled by the unicast routing cache is
169 host reboot, which otherwise would cause two full routing recal‐
170 culations: one when the host goes down, and the other when the
171 host comes back online.
172
173 -z, --connect_roots
174 This option enforces routing engines (up/down and fat-tree) to
175 make connectivity between root switches and in this way to be
176 fully IBA compliant. In many cases this can violate "pure" dead‐
177 lock free algorithm, so use it carefully.
178
179 -M, --lid_matrix_file <file name>
180 This option specifies the name of the lid matrix dump file from
181 where switch lid matrices (min hops tables) will be loaded.
182
183 -U, --lfts_file <file name>
184 This option specifies the name of the LFTs file from where
185 switch forwarding tables will be loaded when using "file" rout‐
186 ing engine.
187
188 -S, --sadb_file <file name>
189 This option specifies the name of the SA DB dump file from where
190 SA database will be loaded.
191
192 -a, --root_guid_file <file name>
193 Set the root nodes for the Up/Down or Fat-Tree routing algorithm
194 to the guids provided in the given file (one to a line).
195
196 -u, --cn_guid_file <file name>
197 Set the compute nodes for the Fat-Tree or DFSSSP/SSSP routing
198 algorithms to the port GUIDs provided in the given file (one to
199 a line).
200
201 -G, --io_guid_file <file name>
202 Set the I/O nodes for the Fat-Tree or DFSSSP/SSSP routing algo‐
203 rithms to the port GUIDs provided in the given file (one to a
204 line).
205 In the case of Fat-Tree routing:
206 I/O nodes are non-CN nodes allowed to use up to max_reverse_hops
207 switches the wrong way around to improve connectivity.
208 In the case of (DF)SSSP routing:
209 Providing guids of compute and/or I/O nodes will ensure that
210 paths towards those nodes are as much separated as possible
211 within their node category, i.e., I/O traffic will not share the
212 same link if multiple links are available.
213
214 --port-shifting
215 This option enables a feature called port shifting. In some
216 fabrics, particularly cluster environments, routes commonly
217 align and congest with other routes due to algorithmically
218 unchanging traffic patterns. This routing option will "shift"
219 routing around in an attempt to alleviate this problem.
220
221 --scatter-ports <random seed>
222 This option is used to randomize port selection in routing
223 rather than using a round-robin algorithm (which is the
224 default). Value supplied with option is used as a random seed.
225 If value is 0, which is the default, the scatter ports option is
226 disabled.
227
228 -H, --max_reverse_hops <max reverse hops allowed>
229 Set the maximum number of reverse hops an I/O node is allowed to
230 make. A reverse hop is the use of a switch the wrong way around.
231
232 -m, --ids_guid_file <file name>
233 Name of the map file with set of the IDs which will be used by
234 Up/Down routing algorithm instead of node GUIDs (format: <guid>
235 <id> per line).
236
237 -X, --guid_routing_order_file <file name>
238 Set the order port guids will be routed for the MinHop and
239 Up/Down routing algorithms to the guids provided in the given
240 file (one to a line).
241
242 -o, --once
243 This option causes OpenSM to configure the subnet once, then
244 exit. Ports remain in the ACTIVE state.
245
246 -s, --sweep <interval value>
247 This option specifies the number of seconds between subnet
248 sweeps. Specifying -s 0 disables sweeping. Without -s, OpenSM
249 defaults to a sweep interval of 10 seconds.
250
251 -t, --timeout <value>
252 This option specifies the time in milliseconds used for transac‐
253 tion timeouts. Timeout values should be > 0. Without -t,
254 OpenSM defaults to a timeout value of 200 milliseconds.
255
256 --retries <number>
257 This option specifies the number of retries used for transac‐
258 tions. Without --retries, OpenSM defaults to 3 retries for
259 transactions.
260
261 --maxsmps <number>
262 This option specifies the number of VL15 SMP MADs allowed on the
263 wire at any one time. Specifying --maxsmps 0 allows unlimited
264 outstanding SMPs. Without --maxsmps, OpenSM defaults to a maxi‐
265 mum of 4 outstanding SMPs.
266
267 --console [off | local | loopback | socket]
268 This option brings up the OpenSM console (default off). Note,
269 loopback and socket open a socket which can be connected to
270 WITHOUT CREDENTIALS. Loopback is safer if access to your SM
271 host is controlled. tcp_wrappers (hosts.[allow|deny]) is used
272 with loopback and socket. loopback and socket will only be
273 available if OpenSM was built with --enable-console-loopback
274 (default yes) and --enable-console-socket (default no) respec‐
275 tively.
276
277 --console-port <port>
278 Specify an alternate telnet port for the socket console (default
279 10000). Note that this option only appears if OpenSM was built
280 with --enable-console-socket.
281
282 -i, --ignore_guids <equalize-ignore-guids-file>
283 This option provides the means to define a set of ports (by node
284 guid and port number) that will be ignored by the link load
285 equalization algorithm.
286
287 -w, --hop_weights_file <path to file>
288 This option provides weighting factors per port representing a
289 hop cost in computing the lid matrix. The file consists of
290 lines containing a switch port GUID (specified as a 64 bit hex
291 number, with leading 0x), output port number, and weighting fac‐
292 tor. Any port not listed in the file defaults to a weighting
293 factor of 1. Lines starting with # are comments. Weights
294 affect only the output route from the port, so many useful con‐
295 figurations will require weights to be specified in pairs.
296
297 -O, --port_search_ordering_file <path to file>
298 This option tweaks the routing. It suitable for two cases: 1.
299 While using DOR routing algorithm. This option provides a map‐
300 ping between hypercube dimensions and ports on a per switch
301 basis for the DOR routing engine. The file consists of lines
302 containing a switch node GUID (specified as a 64 bit hex number,
303 with leading 0x) followed by a list of non-zero port numbers,
304 separated by spaces, one switch per line. The order for the
305 port numbers is in one to one correspondence to the dimensions.
306 Ports not listed on a line are assigned to the remaining dimen‐
307 sions, in port order. Anything after a # is a comment. 2.
308 While using general routing algorithm. This option provides the
309 order of the ports that would be chosen for routing, from each
310 switch rather than searching for an appropriate port from port 1
311 to N. The file consists of lines containing a switch node GUID
312 (specified as a 64 bit hex number, with leading 0x) followed by
313 a list of non-zero port numbers, separated by spaces, one switch
314 per line. In case of DOR, the order for the port numbers is in
315 one to one correspondence to the dimensions. Ports not listed
316 on a line are assigned to the remaining dimensions, in port
317 order. Anything after a # is a comment.
318
319 -O, --dimn_ports_file <path to file> (DEPRECATED)
320 This is a deprecated flag. Please use --port_search_order‐
321 ing_file instead. This option provides a mapping between hyper‐
322 cube dimensions and ports on a per switch basis for the DOR
323 routing engine. The file consists of lines containing a switch
324 node GUID (specified as a 64 bit hex number, with leading 0x)
325 followed by a list of non-zero port numbers, separated by spa‐
326 ces, one switch per line. The order for the port numbers is in
327 one to one correspondence to the dimensions. Ports not listed
328 on a line are assigned to the remaining dimensions, in port
329 order. Anything after a # is a comment.
330
331 -x, --honor_guid2lid
332 This option forces OpenSM to honor the guid2lid file, when it
333 comes out of Standby state, if such file exists under
334 OSM_CACHE_DIR, and is valid. By default, this is FALSE.
335
336 -f, --log_file <file name>
337 This option defines the log to be the given file. By default,
338 the log goes to /var/log/opensm.log. For the log to go to stan‐
339 dard output use -f stdout.
340
341 -L, --log_limit <size in MB>
342 This option defines maximal log file size in MB. When specified
343 the log file will be truncated upon reaching this limit.
344
345 -e, --erase_log_file
346 This option will cause deletion of the log file (if it previ‐
347 ously exists). By default, the log file is accumulative.
348
349 -P, --Pconfig <partition config file>
350 This option defines the optional partition configuration file.
351 The default name is /etc/rdma/partitions.conf.
352
353 --prefix_routes_file <file name>
354 Prefix routes control how the SA responds to path record queries
355 for off-subnet DGIDs. By default, the SA fails such queries.
356 The PREFIX ROUTES section below describes the format of the con‐
357 figuration file. The default path is
358 /etc/rdma/prefix-routes.conf.
359
360 -Q, --qos
361 This option enables QoS setup. It is disabled by default.
362
363 -Y, --qos_policy_file <file name>
364 This option defines the optional QoS policy file. The default
365 name is /etc/rdma/qos-policy.conf. See QoS_manage‐
366 ment_in_OpenSM.txt in opensm doc for more information on config‐
367 uring QoS policy via this file.
368
369 --congestion_control
370 (EXPERIMENTAL) This option enables congestion control configura‐
371 tion. It is disabled by default. See config file for conges‐
372 tion control configuration options. --cc_key <key> (EXPERIMEN‐
373 TAL) This option configures the CCkey to use when configuring
374 congestion control. Note that this option does not configure a
375 new CCkey into switches and CAs. Defaults to 0.
376
377 -N, --no_part_enforce (DEPRECATED)
378 This is a deprecated flag. Please use --part_enforce instead.
379 This option disables partition enforcement on switch external
380 ports.
381
382 -Z, --part_enforce [both | in | out | off]
383 This option indicates the partition enforcement type (for
384 switches). Enforcement type can be inbound only (in), outbound
385 only (out), both or disabled (off). Default is both.
386
387 -W, --allow_both_pkeys
388 This option indicates whether both full and limited membership
389 on the same partition can be configured in the PKeyTable.
390 Default is not to allow both pkeys.
391
392 -y, --stay_on_fatal
393 This option will cause SM not to exit on fatal initialization
394 issues: if SM discovers duplicated guids or a 12x link with lane
395 reversal badly configured. By default, the SM will exit on
396 these errors.
397
398 -B, --daemon
399 Run in daemon mode - OpenSM will run in the background.
400
401 -J, --pidfile <file_name>
402 Makes the SM write its own PID to the specified file when
403 started in daemon mode.
404
405 -I, --inactive
406 Start SM in inactive rather than init SM state. This option can
407 be used in conjunction with the perfmgr so as to run a stand‐
408 alone performance manager without SM/SA. However, this is NOT
409 currently implemented in the performance manager.
410
411 --perfmgr
412 Enable the perfmgr. Only takes effect if --enable-perfmgr was
413 specified at configure time. See performance-manager-HOWTO.txt
414 in opensm doc for more information on running perfmgr.
415
416 --perfmgr_sweep_time_s <seconds>
417 Specify the sweep time for the performance manager in seconds
418 (default is 180 seconds). Only takes effect if --enable-perfmgr
419 was specified at configure time.
420
421 --consolidate_ipv6_snm_req
422 Use shared MLID for IPv6 Solicited Node Multicast groups per
423 MGID scope and P_Key.
424
425 --log_prefix <prefix text>
426 This option specifies the prefix to the syslog messages from
427 OpenSM. A suitable prefix can be used to identify the IB subnet
428 in syslog messages when two or more instances of OpenSM run in a
429 single node to manage multiple fabrics. For example, in a dual-
430 fabric (or dual-rail) IB cluster, the prefix for the first fab‐
431 ric could be "mpi" and the other fabric could be "storage".
432
433 --torus_config <path to torus-2QoS config file>
434 This option defines the file name for the extra configuration
435 information needed for the torus-2QoS routing engine. The
436 default name is /etc/rdma/torus-2QoS.conf
437
438 -v, --verbose
439 This option increases the log verbosity level. The -v option
440 may be specified multiple times to further increase the ver‐
441 bosity level. See the -D option for more information about log
442 verbosity.
443
444 -V This option sets the maximum verbosity level and forces log
445 flushing. The -V option is equivalent to ´-D 0xFF -d 2´. See
446 the -D option for more information about log verbosity.
447
448 -D <value>
449 This option sets the log verbosity level. A flags field must
450 follow the -D option. A bit set/clear in the flags enables/dis‐
451 ables a specific log level as follows:
452
453 BIT LOG LEVEL ENABLED
454 ---- -----------------
455 0x01 - ERROR (error messages)
456 0x02 - INFO (basic messages, low volume)
457 0x04 - VERBOSE (interesting stuff, moderate volume)
458 0x08 - DEBUG (diagnostic, high volume)
459 0x10 - FUNCS (function entry/exit, very high volume)
460 0x20 - FRAMES (dumps all SMP and GMP frames)
461 0x40 - ROUTING (dump FDB routing information)
462 0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM log‐
463 ging)
464
465 Without -D, OpenSM defaults to ERROR + INFO (0x3). Specifying
466 -D 0 disables all messages. Specifying -D 0xFF enables all mes‐
467 sages (see -V). High verbosity levels may require increasing
468 the transaction timeout with the -t option.
469
470 -d, --debug <value>
471 This option specifies a debug option. These options are not
472 normally needed. The number following -d selects the debug
473 option to enable as follows:
474
475 OPT Description
476 --- -----------------
477 -d0 - Ignore other SM nodes
478 -d1 - Force single threaded dispatching
479 -d2 - Force log flushing after each log message
480 -d3 - Disable multicast support
481
482 -h, --help
483 Display this usage info then exit.
484
485 -? Display this usage info then exit.
486
487
489 The following environment variables control opensm behavior:
490
491 OSM_TMP_DIR - controls the directory in which the temporary files gen‐
492 erated by opensm are created. These files are: opensm-subnet.lst,
493 opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.
494
495 OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
496 quent runs are consistent. The default directory used is
497 /var/cache/opensm. The following files are included in it:
498
499 guid2lid - stores the LID range assigned to each GUID
500 guid2mkey - stores the MKey previously assiged to each GUID
501 neighbors - stores a map of the GUIDs at either end of each link
502 in the fabric
503
504
506 When opensm receives a HUP signal, it starts a new heavy sweep as if a
507 trap was received or a topology change was found.
508
509 Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log
510 for logrotate purposes.
511
512
514 The default name of OpenSM partitions configuration file is
515 /etc/rdma/partitions.conf. The default may be changed by using the
516 --Pconfig (-P) option with OpenSM.
517
518 The default partition will be created by OpenSM unconditionally even
519 when partition configuration file does not exist or cannot be accessed.
520
521 The default partition has P_Key value 0x7fff. OpenSM´s port will always
522 have full membership in default partition. All other end ports will
523 have full membership if the partition configuration file is not found
524 or cannot be accessed, or limited membership if the file exists and can
525 be accessed but there is no rule for the Default partition.
526
527 Effectively, this amounts to the same as if one of the following rules
528 below appear in the partition configuration file.
529
530 In the case of no rule for the Default partition:
531
532 Default=0x7fff : ALL=limited, SELF=full ;
533
534 In the case of no partition configuration file or file cannot be
535 accessed:
536
537 Default=0x7fff : ALL=full ;
538
539
540 File Format
541
542 Comments:
543
544 Line content followed after ´#´ character is comment and ignored by
545 parser.
546
547 General file format:
548
549 <Partition Definition>:[<newline>]<Partition Properties>;
550
551 Partition Definition:
552 [PartitionName][=PKey][,indx0][,ipoib_bc_flags][,defmem‐
553 ber=full|limited]
554
555 PartitionName - string, will be used with logging. When
556 omitted, empty string will be used.
557 PKey - P_Key value for this partition. Only low 15
558 bits will be used. When omitted will be
559 autogenerated.
560 indx0 - indicates that this pkey should be inserted in
561 block 0 index 0.
562 ipoib_bc_flags - used to indicate/specify IPoIB capability of
563 this partition.
564
565 defmember=full|limited|both - specifies default membership for
566 port guid list. Default is limited.
567
568 ipoib_bc_flags:
569 ipoib_flag|[mgroup_flag]*
570
571 ipoib_flag:
572 ipoib - indicates that this partition may be used for
573 IPoIB, as a result the IPoIB broadcast group will
574 be created with the mgroup_flag flags given,
575 if any.
576
577 Partition Properties:
578 [<Port list>|<MCast Group>]* | <Port list>
579
580 Port list:
581 <Port Specifier>[,<Port Specifier>]
582
583 Port Specifier:
584 <PortGUID>[=[full|limited|both]]
585
586 PortGUID - GUID of partition member EndPort.
587 Hexadecimal numbers should start from
588 0x, decimal numbers are accepted too.
589 full, limited, - indicates full and/or limited membership for
590 both this port. When omitted (or unrecognized)
591 limited membership is assumed. Both
592 indicates both full and limited membership
593 for this port.
594
595 MCast Group:
596 mgid=gid[,mgroup_flag]*<newline>
597
598 - gid specified is verified to be a Multicast
599 address. IP groups are verified to match
600 the rate and mtu of the broadcast group.
601 The P_Key bits of the mgid for IP groups are
602 verified to either match the P_Key specified
603 in by "Partition Definition" or if they are
604 0x0000 the P_Key will be copied into those
605 bits.
606
607 mgroup_flag:
608 rate=<val> - specifies rate for this MC group
609 (default is 3 (10GBps))
610 mtu=<val> - specifies MTU for this MC group
611 (default is 4 (2048))
612 sl=<val> - specifies SL for this MC group
613 (default is 0)
614 scope=<val> - specifies scope for this MC group
615 (default is 2 (link local)). Multiple scope
616 settings are permitted for a partition.
617 NOTE: This overwrites the scope nibble of the
618 specified mgid. Furthermore specifying
619 multiple scope settings will result in
620 multiple MC groups being created.
621 Q_Key=<val> - specifies the Q_Key for this MC group
622 (default: 0x0b1b for IP groups, 0 for other
623 groups)
624 WARNING: changing this for the broadcast
625 group may break IPoIB on client
626 nodes!!
627 TClass=<val> - specifies tclass for this MC group
628 (default is 0)
629 FlowLabel=<val> - specifies FlowLabel for this MC group
630 (default is 0) NOTE: All mgroup_flag
631 flags MUST be separated by comma (,).
632
633 Note that values for rate, mtu, and scope, for both partitions and mul‐
634 ticast groups, should be specified as defined in the IBTA specification
635 (for example, mtu=4 for 2048).
636
637 There are several useful keywords for PortGUID definition:
638
639 - 'ALL' means all end ports in this subnet.
640 - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
641 - 'ALL_SWITCHES' means all Switch end ports in this subnet.
642 - 'ALL_ROUTERS' means all Router end ports in this subnet.
643 - 'SELF' means subnet manager's port.
644
645 Empty list means no ports in this partition.
646
647 Notes:
648
649 White space is permitted between delimiters ('=', ',',':',';').
650
651 PartitionName does not need to be unique, PKey does need to be unique.
652 If PKey is repeated then those partition configurations will be merged
653 and first PartitionName will be used (see also next note).
654
655 It is possible to split partition configuration in more than one defi‐
656 nition, but then PKey should be explicitly specified (otherwise differ‐
657 ent PKey values will be generated for those definitions).
658
659 Examples:
660
661 Default=0x7fff : ALL, SELF=full ;
662 Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
663
664 NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306
665 ;
666
667 YetAnotherOne = 0x300 : SELF=full ;
668 YetAnotherOne = 0x300 : ALL=limited ;
669
670 ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
671 # 0x123453, 0x123454 will be limited
672 ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
673 # 0x123456, 0x123457 will be limited
674 ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457,
675 0x123458=full;
676 ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
677 ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited,
678 0x12345d;
679
680 # multicast groups added to default
681 Default=0x7fff,ipoib:
682 mgid=ff12:401b::0707,sl=1 # random IPv4 group
683 mgid=ff12:601b::16 # MLDv2-capable routers
684 mgid=ff12:401b::16 # IGMP
685 mgid=ff12:601b::2 # All routers
686 mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
687 ALL=full;
688
689
690 Note:
691
692 The following rule is equivalent to how OpenSM used to run prior to the
693 partition manager:
694
695 Default=0x7fff,ipoib:ALL=full;
696
697
699 There are a set of QoS related low-level configuration parameters. All
700 these parameter names are prefixed by "qos_" string. Here is a full
701 list of these parameters:
702
703 qos_max_vls - The maximum number of VLs that will be on the subnet
704 qos_high_limit - The limit of High Priority component of VL
705 Arbitration table (IBA 7.6.9)
706 qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9)
707 template
708 qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
709 template
710 Both VL arbitration templates are pairs of
711 VL and weight
712 qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is
713 a list of VLs corresponding to SLs 0-15 (Note
714 that VL15 used here means drop this SL)
715
716 Typical default values (hard-coded in OpenSM initialization) are:
717
718 qos_max_vls 15
719 qos_high_limit 0
720 qos_vlarb_low
721 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
722 qos_vlarb_high
723 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
724 qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
725
726 The syntax is compatible with rest of OpenSM configuration options and
727 values may be stored in OpenSM config file (cached options file).
728
729 In addition to the above, we may define separate QoS configuration
730 parameters sets for various target types. As targets, we currently sup‐
731 port CAs, routers, switch external ports, and switch's enhanced port 0.
732 The names of such specialized parameters are prefixed by "qos_<type>_"
733 string. Here is a full list of the currently supported sets:
734
735 qos_ca_ - QoS configuration parameters set for CAs.
736 qos_rtr_ - parameters set for routers.
737 qos_sw0_ - parameters set for switches' port 0.
738 qos_swe_ - parameters set for switches' external ports.
739
740 Examples:
741 qos_sw0_max_vls=2
742 qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
743 qos_swe_high_limit=0
744
745
747 Prefix routes control how the SA responds to path record queries for
748 off-subnet DGIDs. By default, the SA fails such queries. Note that
749 IBA does not specify how the SA should obtain off-subnet path record
750 information. The prefix routes configuration is meant as a stop-gap
751 until the specification is completed.
752
753 Each line in the configuration file is a 64-bit prefix followed by a
754 64-bit GUID, separated by white space. The GUID specifies the router
755 port on the local subnet that will handle the prefix. Blank lines are
756 ignored, as is anything between a # character and the end of the line.
757 The prefix and GUID are both in hex, the leading 0x is optional.
758 Either, or both, can be wild-carded by specifying an asterisk instead
759 of an explicit prefix or GUID.
760
761 When responding to a path record query for an off-subnet DGID, opensm
762 searches for the first prefix match in the configuration file. There‐
763 fore, the order of the lines in the configuration file is important: a
764 wild-carded prefix at the beginning of the configuration file renders
765 all subsequent lines useless. If there is no match, then opensm fails
766 the query. It is legal to repeat prefixes in the configuration file,
767 opensm will return the path to the first available matching router. A
768 configuration file with a single line where both prefix and GUID are
769 wild-carded means that a path record query specifying any off-subnet
770 DGID should return a path to the first available router. This configu‐
771 ration yields the same behavior formerly achieved by compiling opensm
772 with -DROUTER_EXP which has been obsoleted.
773
774
776 OpenSM supports configuring a single management key (MKey) for use
777 across the subnet.
778
779 The following configuration options are available:
780
781 m_key - the 64-bit MKey to be used on the subnet
782 (IBA 14.2.4)
783 m_key_protection_level - the numeric value of the MKey ProtectBits
784 (IBA 14.2.4.1)
785 m_key_lease_period - the number of seconds a CA will wait for a
786 response from the SM before resetting the
787 protection level to 0 (IBA 14.2.4.2).
788
789 OpenSM will configure all ports with the MKey specified by m_key,
790 defaulting to a value of 0. A m_key value of 0 disables MKey protection
791 on the subnet. Switches and HCAs with a non-zero MKey will not accept
792 requests to change their configuration unless the request includes the
793 proper MKey.
794
795 MKey Protection Levels
796
797 MKey protection levels modify how switches and CAs respond to SMPs
798 lacking a valid MKey. OpenSM will configure each port's ProtectBits to
799 support the level defined by the m_key_protection_level parameter. If
800 no parameter is specified, OpenSM defaults to operating at protection
801 level 0.
802
803 There are currently 4 protection levels defined by the IBA:
804
805 0 - Queries return valid data, including MKey. Configuration changes
806 are not allowed unless the request contains a valid MKey.
807 1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
808 unless the request contains a valid MKey.
809 2 - Neither queries nor configuration changes are allowed, unless the
810 request contains a valid MKey.
811 3 - Identical to 2. Maintained for backwards compatibility.
812
813 MKey Lease Period
814
815 InfiniBand supports a MKey lease timeout, which is intended to allow
816 administrators or a new SM to recover/reset lost MKeys on a fabric.
817
818 If MKeys are enabled on the subnet and a switch or CA receives a
819 request that requires a valid MKey but does not contain one, it warns
820 the SM by sending a trap (Bad M_Key, Trap 256). If the MKey lease
821 period is non-zero, it also starts a countdown timer for the time spec‐
822 ified by the lease period. If a SM (or other agent) responds with the
823 correct MKey, the timer is stopped and reset. Should the timer reach
824 zero, the switch or CA will reset its MKey protection level to 0,
825 exposing the MKey and allowing recovery.
826
827 OpenSM will initialize all ports to use a mkey lease period of the num‐
828 ber of seconds specified in the config file. If no mkey_lease_period
829 is specified, a default of 0 will be used.
830
831 OpenSM normally quickly responds to all Bad_M_Key traps, resetting the
832 lease timers. Additionally, OpenSM's subnet sweeps will also cancel
833 any running timers. For maximum protection against accidentally-
834 exposed MKeys, the MKey lease time should be a few multiples of the
835 subnet sweep time. If OpenSM detects at startup that your sweep inter‐
836 val is greater than your MKey lease period, it will reset the lease
837 period to be greater than the sweep interval. Similarly, if sweeping
838 is disabled at startup, it will be re-enabled with an interval less
839 than the Mkey lease period.
840
841 If OpenSM is required to recover a subnet for which it is missing
842 mkeys, it must do so one switch level at a time. As such, the total
843 time to recover the subnet may be as long as the mkey lease period mul‐
844 tiplied by the maximum number of hops between the SM and an endpoint,
845 plus one.
846
847 MKey Effects on Diagnostic Utilities
848
849 Setting a MKey may have a detrimental effect on diagnostic software run
850 on the subnet, unless your diagnostic software is able to retrieve
851 MKeys from the SA or can be explicitly configured with the proper MKey.
852 This is particularly true at protection level 2, where CAs will ignore
853 queries for management information that do not contain the proper MKey.
854
855
857 OpenSM now offers ten routing engines:
858
859 1. Min Hop Algorithm - based on the minimum hops to each node where
860 the path length is optimized.
861
862 2. UPDN Unicast routing algorithm - also based on the minimum hops to
863 each node, but it is constrained to ranking rules. This algorithm
864 should be chosen if the subnet is not a pure Fat Tree, and deadlock may
865 occur due to a loop in the subnet.
866
867 3. DNUP Unicast routing algorithm - similar to UPDN but allows routing
868 in fabrics which have some CA nodes attached closer to the roots than
869 some switch nodes.
870
871 4. Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
872 ing for congestion-free "shift" communication pattern. It should be
873 chosen if a subnet is a symmetrical or almost symmetrical fat-tree of
874 various types, not just K-ary-N-Trees: non-constant K, not fully
875 staffed, any Constant Bisectional Bandwidth (CBB) ratio. Similar to
876 UPDN, Fat Tree routing is constrained to ranking rules.
877
878 5. LASH unicast routing algorithm - uses InfiniBand virtual layers (SL)
879 to provide deadlock-free shortest-path routing while also distributing
880 the paths between layers. LASH is an alternative deadlock-free topol‐
881 ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
882 ing the use of a potentially congested root node.
883
884 6. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
885 avoids port equalization except for redundant links between the same
886 two switches. This provides deadlock free routes for hypercubes when
887 the fabric is cabled as a hypercube and for meshes when cabled as a
888 mesh (see details below).
889
890 7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
891 specialized for 2D/3D torus topologies. Torus-2QoS provides deadlock-
892 free routing while supporting two quality of service (QoS) levels. In
893 addition it is able to route around multiple failed fabric links or a
894 single failed fabric switch without introducing deadlocks, and without
895 changing path SL values granted before the failure.
896
897 8. DFSSSP unicast routing algorithm - a deadlock-free single-source-
898 shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
899 as the base to optimize link utilization and uses InfiniBand virtual
900 lanes (SL) to provide deadlock-freedom.
901
902 9. SSSP unicast routing algorithm - a single-source-shortest-path rout‐
903 ing algorithm, which globally balances the number of routes per link to
904 optimize link utilization. This routing algorithm has no restrictions
905 in terms of the underlying topology.
906
907 10. Nue unicast routing algorithm - a 100%-applicable and deadlock-free
908 routing which can be used for any arbitrary or faulty network topology
909 and any number of virtual lanes (this includes the absense of VLs as
910 well). Paths are globally balanced w.r.t the number of routes per link,
911 and are kept as short as possible while enforcing deadlock-freedom
912 within the VL constraint.
913
914 OpenSM also supports a file method which can load routes from a table.
915 See ´Modular Routing Engine´ for more information on this.
916
917 The basic routing algorithm is comprised of two stages:
918
919 1. MinHop matrix calculation
920 How many hops are required to get from each port to each LID ?
921 The algorithm to fill these tables is different if you run standard
922 (min hop) or Up/Down.
923 For standard routing, a "relaxation" algorithm is used to propagate
924 min hop from every destination LID through neighbor switches
925 For Up/Down routing, a BFS from every target is used. The BFS tracks
926 link direction (up or down) and avoid steps that will perform up after
927 a down step was used.
928
929 2. Once MinHop matrices exist, each switch is visited and for each tar‐
930 get LID a decision is made as to what port should be used to get to
931 that LID.
932 This step is common to standard and Up/Down routing. Each port has a
933 counter counting the number of target LIDs going through it.
934 When there are multiple alternative ports with same MinHop to a LID,
935 the one with less previously assigned LIDs is selected.
936 If LMC > 0, more checks are added: Within each group of LIDs
937 assigned to same target port,
938 a. use only ports which have same MinHop
939 b. first prefer the ones that go to different systemImageGuid (then
940 the previous LID of the same LMC group)
941 c. if none - prefer those which go through another NodeGuid
942 d. fall back to the number of paths method (if all go to same node).
943
944 Effect of Topology Changes
945
946 OpenSM will preserve existing routing in any case where there is no
947 change in the fabric switches unless the -r (--reassign_lids) option is
948 specified.
949
950 -r
951 --reassign_lids
952 This option causes OpenSM to reassign LIDs to all
953 end nodes. Specifying -r on a running subnet
954 may disrupt subnet traffic.
955 Without -r, OpenSM attempts to preserve existing
956 LID assignments resolving multiple use of same LID.
957
958 If a link is added or removed, OpenSM does not recalculate the routes
959 that do not have to change. A route has to change if the port is no
960 longer UP or no longer the MinHop. When routing changes are performed,
961 the same algorithm for balancing the routes is invoked.
962
963 In the case of using the file based routing, any topology changes are
964 currently ignored The 'file' routing engine just loads the LFTs from
965 the file specified, with no reaction to real topology. Obviously, this
966 will not be able to recheck LIDs (by GUID) for disconnected nodes, and
967 LFTs for non-existent switches will be skipped. Multicast is not
968 affected by 'file' routing engine (this uses min hop tables).
969
970
971 Min Hop Algorithm
972
973 The Min Hop algorithm is invoked by default if no routing algorithm is
974 specified. It can also be invoked by specifying '-R minhop'.
975
976 The Min Hop algorithm is divided into two stages: computation of min-
977 hop tables on every switch and LFT output port assignment. Link sub‐
978 scription is also equalized with the ability to override based on port
979 GUID. The latter is supplied by:
980
981 -i <equalize-ignore-guids-file>
982 --ignore_guids <equalize-ignore-guids-file>
983 This option provides the means to define a set of ports
984 (by guid) that will be ignored by the link load
985 equalization algorithm. Note that only endports (CA,
986 switch port 0, and router ports) and not switch external
987 ports are supported.
988
989 LMC awareness routes based on (remote) system or switch basis.
990
991
992 Purpose of UPDN Algorithm
993
994 The UPDN algorithm is designed to prevent deadlocks from occurring in
995 loops of the subnet. A loop-deadlock is a situation in which it is no
996 longer possible to send data between any two hosts connected through
997 the loop. As such, the UPDN routing algorithm should be used if the
998 subnet is not a pure Fat Tree, and one of its loops may experience a
999 deadlock (due, for example, to high pressure).
1000
1001 The UPDN algorithm is based on the following main stages:
1002
1003 1. Auto-detect root nodes - based on the CA hop length from any switch
1004 in the subnet, a statistical histogram is built for each switch (hop
1005 num vs number of occurrences). If the histogram reflects a specific
1006 column (higher than others) for a certain node, then it is marked as a
1007 root node. Since the algorithm is statistical, it may not find any root
1008 nodes. The list of the root nodes found by this auto-detect stage is
1009 used by the ranking process stage.
1010
1011 Note 1: The user can override the node list manually.
1012 Note 2: If this stage cannot find any root nodes, and the user did
1013 not specify a guid list file, OpenSM defaults back to the
1014 Min Hop routing algorithm.
1015
1016 2. Ranking process - All root switch nodes (found in stage 1) are
1017 assigned a rank of 0. Using the BFS algorithm, the rest of the switch
1018 nodes in the subnet are ranked incrementally. This ranking aids in the
1019 process of enforcing rules that ensure loop-free paths.
1020
1021 3. Min Hop Table setting - after ranking is done, a BFS algorithm is
1022 run from each (CA or switch) node in the subnet. During the BFS
1023 process, the FDB table of each switch node traversed by BFS is updated,
1024 in reference to the starting node, based on the ranking rules and guid
1025 values.
1026
1027 At the end of the process, the updated FDB tables ensure loop-free
1028 paths through the subnet.
1029
1030 Note: Up/Down routing does not allow LID routing communication between
1031 switches that are located inside spine "switch systems". The reason is
1032 that there is no way to allow a LID route between them that does not
1033 break the Up/Down rule. One ramification of this is that you cannot
1034 run SM on switches other than the leaf switches of the fabric.
1035
1036
1037 UPDN Algorithm Usage
1038
1039 Activation through OpenSM
1040
1041 Use '-R updn' option (instead of old '-u') to activate the UPDN algo‐
1042 rithm. Use '-a <root_guid_file>' for adding an UPDN guid file that
1043 contains the root nodes for ranking. If the `-a' option is not used,
1044 OpenSM uses its auto-detect root nodes algorithm.
1045
1046 Notes on the guid list file:
1047
1048 1. A valid guid file specifies one guid in each line. Lines with an
1049 invalid format will be discarded.
1050 2. The user should specify the root switch guids. However, it is also
1051 possible to specify CA guids; OpenSM will use the guid of the switch
1052 (if it exists) that connects the CA to the subnet as a root node.
1053
1054 Purpose of DNUP Algorithm
1055
1056 The DNUP algorithm is designed to serve a similar purpose to UPDN. How‐
1057 ever it is intended to work in network topologies which are unsuited to
1058 UPDN due to nodes being connected closer to the roots than some of the
1059 switches. An example would be a fabric which contains nodes and
1060 uplinks connected to the same switch. The operation of DNUP is the same
1061 as UPDN with the exception of the ranking process. In DNUP all switch
1062 nodes are ranked based solely on their distance from CA Nodes, all
1063 switch nodes directly connected to at least one CA are assigned a value
1064 of 1 all other switch nodes are assigned a value of one more than the
1065 minimum rank of all neighbor switch nodes.
1066
1067 Fat-tree Routing Algorithm
1068
1069 The fat-tree algorithm optimizes routing for "shift" communication pat‐
1070 tern. It should be chosen if a subnet is a symmetrical or almost sym‐
1071 metrical fat-tree of various types. It supports not just K-ary-N-
1072 Trees, by handling for non-constant K, cases where not all leafs (CAs)
1073 are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-
1074 loop-deadlocks.
1075
1076 If the root guid file is not provided ('-a' or '--root_guid_file'
1077 options), the topology has to be pure fat-tree that complies with the
1078 following rules:
1079 - Tree rank should be between two and eight (inclusively)
1080 - Switches of the same rank should have the same number
1081 of UP-going port groups*, unless they are root switches,
1082 in which case the shouldn't have UP-going ports at all.
1083 - Switches of the same rank should have the same number
1084 of DOWN-going port groups, unless they are leaf switches.
1085 - Switches of the same rank should have the same number
1086 of ports in each UP-going port group.
1087 - Switches of the same rank should have the same number
1088 of ports in each DOWN-going port group.
1089 - All the CAs have to be at the same tree level (rank).
1090
1091 If the root guid file is provided, the topology doesn't have to be pure
1092 fat-tree, and it should only comply with the following rules:
1093 - Tree rank should be between two and eight (inclusively)
1094 - All the Compute Nodes** have to be at the same tree level (rank).
1095 Note that non-compute node CAs are allowed here to be at different
1096 tree ranks.
1097
1098 * ports that are connected to the same remote switch are referenced as
1099 ´port group´.
1100
1101 ** list of compute nodes (CNs) can be specified by ´-u´ or
1102 ´--cn_guid_file´ OpenSM options.
1103
1104 Topologies that do not comply cause a fallback to min hop routing.
1105 Note that this can also occur on link failures which cause the topology
1106 to no longer be "pure" fat-tree.
1107
1108 Note that although fat-tree algorithm supports trees with non-integer
1109 CBB ratio, the routing will not be as balanced as in case of integer
1110 CBB ratio. In addition to this, although the algorithm allows leaf
1111 switches to have any number of CAs, the closer the tree is to be fully
1112 populated, the more effective the "shift" communication pattern will
1113 be. In general, even if the root list is provided, the closer the
1114 topology to a pure and symmetrical fat-tree, the more optimal the rout‐
1115 ing will be.
1116
1117 The algorithm also dumps compute node ordering file (opensm-ftree-ca-
1118 order.dump) in the same directory where the OpenSM log resides. This
1119 ordering file provides the CN order that may be used to create effi‐
1120 cient communication pattern, that will match the routing tables.
1121
1122 Routing between non-CN nodes
1123
1124 The use of the cn_guid_file option allows non-CN nodes to be located on
1125 different levels in the fat tree. In such case, it is not guaranteed
1126 that the Fat Tree algorithm will route between two non-CN nodes. To
1127 solve this problem, a list of non-CN nodes can be specified by ´-G´ or
1128 ´--io_guid_file´ option. Theses nodes will be allowed to use switches
1129 the wrong way round a specific number of times (specified by ´-H´ or
1130 ´--max_reverse_hops´. With the proper max_reverse_hops and
1131 io_guid_file values, you can ensure full connectivity in the Fat Tree.
1132
1133 Please note that using max_reverse_hops creates routes that use the
1134 switch in a counter-stream way. This option should never be used to
1135 connect nodes with high bandwidth traffic between them ! It should only
1136 be used to allow connectivity for HA purposes or similar. Also having
1137 routes the other way around can in theory cause credit loops.
1138
1139 Use these options with extreme care !
1140
1141 Activation through OpenSM
1142
1143 Use '-R ftree' option to activate the fat-tree algorithm. Use '-a
1144 <root_guid_file>' to provide root nodes for ranking. If the `-a' option
1145 is not used, routing algorithm will detect roots automatically. Use
1146 '-u <root_cn_file>' to provide the list of compute nodes. If the `-u'
1147 option is not used, all the CAs are considered as compute nodes.
1148
1149 Note: LMC > 0 is not supported by fat-tree routing. If this is speci‐
1150 fied, the default routing algorithm is invoked instead.
1151
1152
1153 LASH Routing Algorithm
1154
1155 LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
1156 istic shortest path routing algorithm that enables topology agnostic
1157 deadlock-free routing within communication networks.
1158
1159 When computing the routing function, LASH analyzes the network topology
1160 for the shortest-path routes between all pairs of sources / destina‐
1161 tions and groups these paths into virtual layers in such a way as to
1162 avoid deadlock.
1163
1164 Note LASH analyzes routes and ensures deadlock freedom between switch
1165 pairs. The link from HCA between and switch does not need virtual lay‐
1166 ers as deadlock will not arise between switch and HCA.
1167
1168 In more detail, the algorithm works as follows:
1169
1170 1) LASH determines the shortest-path between all pairs of source / des‐
1171 tination switches. Note, LASH ensures the same SL is used for all
1172 SRC/DST - DST/SRC pairs and there is no guarantee that the return path
1173 for a given DST/SRC will be the reverse of the route SRC/DST.
1174
1175 2) LASH then begins an SL assignment process where a route is assigned
1176 to a layer (SL) if the addition of that route does not cause deadlock
1177 within that layer. This is achieved by maintaining and analysing a
1178 channel dependency graph for each layer. Once the potential addition of
1179 a path could lead to deadlock, LASH opens a new layer and continues the
1180 process.
1181
1182 3) Once this stage has been completed, it is highly likely that the
1183 first layers processed will contain more paths than the latter ones.
1184 To better balance the use of layers, LASH moves paths from one layer to
1185 another so that the number of paths in each layer averages out.
1186
1187 Note, the implementation of LASH in opensm attempts to use as few lay‐
1188 ers as possible. This number can be less than the number of actual lay‐
1189 ers available.
1190
1191 In general LASH is a very flexible algorithm. It can, for example,
1192 reduce to Dimension Order Routing in certain topologies, it is topology
1193 agnostic and fares well in the face of faults.
1194
1195 It has been shown that for both regular and irregular topologies, LASH
1196 outperforms Up/Down. The reason for this is that LASH distributes the
1197 traffic more evenly through a network, avoiding the bottleneck issues
1198 related to a root node and always routes shortest-path.
1199
1200 The algorithm was developed by Simula Research Laboratory.
1201
1202
1203 Use '-R lash -Q ' option to activate the LASH algorithm.
1204
1205 Note: QoS support has to be turned on in order that SL/VL mappings are
1206 used.
1207
1208 Note: LMC > 0 is not supported by the LASH routing. If this is speci‐
1209 fied, the default routing algorithm is invoked instead.
1210
1211 For open regular cartesian meshes the DOR algorithm is the ideal rout‐
1212 ing algorithm. For toroidal meshes on the other hand there are routing
1213 loops that can cause deadlocks. LASH can be used to route these cases.
1214 The performance of LASH can be improved by preconditioning the mesh in
1215 cases where there are multiple links connecting switches and also in
1216 cases where the switches are not cabled consistently. An option exists
1217 for LASH to do this. To invoke this use '-R lash -Q --do_mesh_analy‐
1218 sis'. This will add an additional phase that analyses the mesh to try
1219 to determine the dimension and size of a mesh. If it determines that
1220 the mesh looks like an open or closed cartesian mesh it reorders the
1221 ports in dimension order before the rest of the LASH algorithm runs.
1222
1223 DOR Routing Algorithm
1224
1225 The Dimension Order Routing algorithm is based on the Min Hop algorithm
1226 and so uses shortest paths. Instead of spreading traffic out across
1227 different paths with the same shortest distance, it chooses among the
1228 available shortest paths based on an ordering of dimensions. Each port
1229 must be consistently cabled to represent a hypercube dimension or a
1230 mesh dimension. Alternatively, the -O option can be used to assign a
1231 custom mapping between the ports on a given switch, and the associated
1232 dimension. Paths are grown from a destination back to a source using
1233 the lowest dimension (port) of available paths at each step. This pro‐
1234 vides the ordering necessary to avoid deadlock. When there are multi‐
1235 ple links between any two switches, they still represent only one
1236 dimension and traffic is balanced across them unless port equalization
1237 is turned off. In the case of hypercubes, the same port must be used
1238 throughout the fabric to represent the hypercube dimension and match on
1239 both ends of the cable, or the -O option used to accomplish the align‐
1240 ment. In the case of meshes, the dimension should consistently use the
1241 same pair of ports, one port on one end of the cable, and the other
1242 port on the other end, continuing along the mesh dimension, or the -O
1243 option used as an override.
1244
1245 Use '-R dor' option to activate the DOR algorithm.
1246
1247 DFSSSP and SSSP Routing Algorithm
1248
1249 The (Deadlock-Free) Single-Source-Shortest-Path routing algorithm is
1250 designed to optimize link utilization thru global balancing of routes,
1251 while supporting arbitrary topologies. The DFSSSP routing algorithm
1252 uses InfiniBand virtual lanes (SL) to provide deadlock-freedom.
1253
1254 The DFSSSP algorithm consists of five major steps:
1255 1) It discovers the subnet and models the subnet as a directed multi‐
1256 graph in which each node represents a node of the physical network and
1257 each edge represents one direction of the full-duplex links used to
1258 connect the nodes.
1259 2) A loop, which iterates over all CA and switches of the subnet, will
1260 perform three steps to generate the linear forwarding tables for each
1261 switch:
1262 2.1) use Dijkstra's algorithm to find the shortest path from all nodes
1263 to the current selected destination;
1264 2.2) update the edge weights in the graph, i.e. add the number of
1265 routes, which use a link to reach the destination, to the link/edge;
1266 2.3) update the LFT of each switch with the outgoing port which was
1267 used in the current step to route the traffic to the destination node.
1268 3) After the number of available virtual lanes or layers in the subnet
1269 is detected and a channel dependency graph is initialized for each
1270 layer, the algorithm will put each possible route of the subnet into
1271 the first layer.
1272 4) A loop iterates over all channel dependency graphs (CDG) and per‐
1273 forms the following substeps:
1274 4.1) search for a cycle in the current CDG;
1275 4.2) when a cycle is found, i.e. a possible deadlock is present, one
1276 edge is selected and all routes, which induced this edge, are moved to
1277 the "next higher" virtual layer (CDG[i+1]);
1278 4.3) the cycle search is continued until all cycles are broken and
1279 routes are moved "up".
1280 5) When the number of needed layers does not exceeds the number of
1281 available SL/VL to remove all cycles in all CDGs, the routing is dead‐
1282 lock-free and an relation table is generated, which contains the
1283 assignment of routes from source to destination to a SL
1284
1285 Note on SSSP:
1286 This algorithm does not perform the steps 3)-5) and can not be consid‐
1287 ered to be deadlock-free for all topologies. But on the one hand, you
1288 can choose this algorithm for really large networks (5,000+ CAs and
1289 deadlock-free by design) to reduce the runtime of the algorithm. On the
1290 other hand, you might use the SSSP routing algorithm as an alternative,
1291 when all deadlock-free routing algorithms fail to route the network for
1292 whatever reason. In the last case, SSSP was designed to deliver an
1293 equal or higher bandwidth due to better congestion avoidance than the
1294 Min Hop routing algorithm.
1295
1296 Notes for usage:
1297 a) running DFSSSP: '-R dfsssp -Q'
1298 a.1) QoS has to be configured to equally spread the load on the avail‐
1299 able SL or virtual lanes
1300 a.2) applications must perform a path record query to get path SL for
1301 each route, which the application will use to transmit packages
1302 b) running SSSP: '-R sssp'
1303 c) both algorithms support LMC > 0
1304
1305 Hints for optimizing I/O traffic:
1306 Having more nodes (I/O and compute) connected to a switch than incoming
1307 links can result in a 'bad' routing of the I/O traffic as long as
1308 (DF)SSSP routing is not aware of the dedicated I/O nodes, i.e., in the
1309 following network configuration CN1-CN3 might send all I/O traffic via
1310 Link2 to IO1,IO2:
1311
1312 CN1 Link1 IO1
1313 \ /----\ /
1314 CN2 -- Switch1 Switch2 -- CN4
1315 / \----/ \
1316 CN3 Link2 IO2
1317
1318 To prevent this from happening (DF)SSSP can use both the compute node
1319 guid file and the I/O guid file specified by the ´-u´ or
1320 ´--cn_guid_file´ and ´-G´ or ´--io_guid_file´ options (similar to the
1321 Fat-Tree routing). This ensures that traffic towards compute nodes and
1322 I/O nodes is balanced separately and therefore distributed as much as
1323 possible across the available links. Port GUIDs, as listed by ibstat,
1324 must be specified (not Node GUIDs).
1325 The priority for the optimization is as follows:
1326 compute nodes -> I/O nodes -> other nodes
1327 Possible use case scenarios:
1328 a) neither ´-u´ nor ´-G´ are specified: all nodes a treated as ´other
1329 nodes´ and therefore balanced equally;
1330 b) ´-G´ is specified: traffic towards I/O nodes will be balanced opti‐
1331 mally;
1332 c) the system has three node types, such as login/admin, compute and
1333 I/O, but the balancing focus should be I/O, then one has to use ´-u´
1334 and ´-G´ with I/O guids listed in cn_guid_file and compute node guids
1335 listed in io_guid_file;
1336 d) ...
1337
1338 Torus-2QoS Routing Algorithm
1339
1340 Torus-2QoS is routing algorithm designed for large-scale 2D/3D torus
1341 fabrics; see torus-2QoS(8) for full documentation.
1342
1343 Use '-R torus-2QoS -Q' or '-R torus-2QoS,no_fallback -Q' to activate
1344 the torus-2QoS algorithm.
1345
1346 Nue Routing Algorithm
1347
1348 Use either `-R nue' or `-R nue -Q --nue_max_num_vls <int>' to activate
1349 Nue.
1350
1351 Note: if `--nue_max_num_vls' is specified and unequal to 1, then QoS
1352 support must be turned on, so that SL2VL mappings are valid and appli‐
1353 cations comply with suggested SLs to avoid credit-loops. For more
1354 details on QoS and Nue see below.
1355
1356 The implementation of Nue routing for OpenSM is a 100%-applicable, bal‐
1357 anced, and deadlock-free unicast routing engine (which also configures
1358 multicast tables, see 'Note on multicast' below). The key points of
1359 this algorithm are the following:
1360 - 100% fault-tolerant, oblivious routing strategy
1361 - topology-agnostic, i.e., applicable to every topology (no matter if
1362 topology
1363 is regular, irregular after faults, or random)
1364 - 100% deadlock-free routing within the resource limits (i.e., it
1365 never
1366 exceeds the given number of available virtual lanes, and it does
1367 not
1368 necessarily require virtual lanes) for every topology
1369 - very good path balancing and therefore high throughput (even better
1370 when
1371 using METIS, see notes below)
1372 - QoS (via SLs/VLs) + deadlock-freedom can be combined (since both
1373 rely on
1374 VLs), e.g., using VL0-3 for Nue's deadlock-freedom (and 1. QoS
1375 level) and
1376 VL4-7 as second QoS level
1377 - forwarding tables are fast to calculate: O(n^2 * log n), however
1378 slightly
1379 slower compared to topology-aware routings (for obvious reasons),
1380 and
1381 - the path-to-VL mapping only depends on the destination, which may
1382 be useful
1383 for scalable, efficient path resolution and caching mechanisms.
1384 From a very high level perspective, Nue routing is similar to DFSSSP
1385 (see above) in the sense that both use Dijkstra and edge weight updates
1386 for path balancing, and paths are mapped to virtual layers assuming a
1387 1:1 mapping of SL2VL tables. However, the fundamental difference is
1388 that Nue routing doesn't perform the path calculation on the graph rep‐
1389 resenting the real fabric, and instead routes directly within the chan‐
1390 nel dependency graph. This approach allows Nue routing to place routing
1391 restrictions (to avoid any credit-loops) in an on-demand manner, which
1392 overcomes the problem of all other good VL-based algorithms. Meaning,
1393 the competitors cannot control or limit the use of VLs, and might run
1394 out of them and have to give up. On the flip side, Nue may have to use
1395 detours for a few routes, and hence cannot really be considered "short‐
1396 est-path" routing, because it is impossible to accomplish deadlock-
1397 free, shortest-path routing with an limited number of available virtual
1398 lanes for arbitrary network topologies.
1399
1400 Note on the use of METIS library with Nue:
1401 Nue routing may has to separate the LIDs into multiple subsets, one for
1402 every virtual layer, if multiple layers are used. Nue has two options
1403 to perform this partitioning (not to be confused with IB partitions);
1404 the first is a fairly simple semi-random assignment of LIDs to lay‐
1405 ers/subsets, and the second partitioning uses the METIS library to par‐
1406 tition the network graph into k approximately equal sized parts. The
1407 latter approach has shown better results in terms of path balancing and
1408 avoidance of using fallback paths, and hence it is HIGHLY advised to
1409 install/use the METIS library with OpenSM (enforced via `--enable-
1410 metis' configure flag when building OpenSM). For the rare case, that
1411 METIS isn't packaged with the Linux distro, here is a link to the offi‐
1412 cial website to download and install METIS 5.1.0 manually:
1413 http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
1414 OpenSM's configure script also provides options in case METIS header
1415 and library aren't found in the default path.
1416
1417 Runtime options for Nue:
1418 The behavior of Nue routing can be directly influenced by the osm.conf
1419 parameter (which is also available as command line option):
1420 - nue_max_num_vls: controls/limits the number of virtual lanes/layers
1421 which
1422 Nue is allowed to use (detailed explanation in osm.conf file).
1423 Furthermore, Nue supports TRUE and FALSE settings of avoid_throt‐
1424 tled_links, use_ucast_cache, and qos (more on this hereafter); and lmc
1425 > 0.
1426
1427 Notes on Quality of Service (QoS):
1428 The advantage of Nue is that it works with AND without QoS being
1429 enabled, i.e., the usage of SLs/VLs for deadlock-freedom can be
1430 avoided. Here are the three possible usage scenarios:
1431 - neither setting `--nue_max_num_vls <int>' nor `-Q': Nue assumes
1432 that only 1
1433 virtual layer (identical to physical network; or OperVLs equal
1434 to VL0) is
1435 usable and all paths are to be calculated within this one layer.
1436 Hence,
1437 there is no need for special SL2VL mappings in the network and
1438 the use of
1439 specific SLs by applications.
1440 - setting `-Q' but not `--nue_max_num_vls <int>': This combination
1441 works like
1442 the previous one, meaning the SL returned for path record
1443 requests is not
1444 defined by Nue, since all paths are deadlock-free without using
1445 VLs.
1446 However, any separate QoS settings may influence the SL returned
1447 to
1448 applications.
1449 - setting `-Q --nue_max_num_vls <int>' with int != 1: In this config‐
1450 uration,
1451 applications have to query and obey the SL for path records as
1452 returned
1453 by Nue because otherwise the deadlock-freedom cannot be guaran‐
1454 teed
1455 anymore. Furthermore, errors in the fabric may require applica‐
1456 tions to
1457 repath to avoid message deadlocks. Since Nue operates on virtual
1458 layer,
1459 admins should configure the SL2VL mapping tables in an homoge‐
1460 neous 1:1
1461 manner across the entire subnet to separate the layers.
1462 As an additional note, using more VLs for Nue usually improves the
1463 overall network throughput, so there are trade offs admins may have to
1464 consider when configuring the subnet manager with Nue routing.
1465
1466 Note on multicast:
1467 The Nue routing engine configures multicast forwarding tables by uti‐
1468 lizing a spanning tree calculation routed at a subnet switch suggested
1469 by OpenSM. This spanning tree for a mcast group will try to use the
1470 least overloaded links (w.r.t the ucast paths-per-link metric/weight)
1471 in the fabric. However, Nue routing currently does not guarantee dead‐
1472 lock-freedom for the set of multicast routes on all topologies, nor for
1473 the combination of deadlock-free unicast routes with additional multi‐
1474 cast routes. Assuming, for a given topology the calculated mcast routes
1475 are dl-free, then an admin may fix the latter problem by separating the
1476 VLs, e.g., using VL0-6 for unicast routing by specifying
1477 `--nue_max_num_vls 7' and utilizing VL7 for multicast.
1478
1479
1480 Routing References
1481
1482 To learn more about deadlock-free routing, see the article "Deadlock
1483 Free Message Routing in Multiprocessor Interconnection Networks" by
1484 William J Dally and Charles L Seitz (1985).
1485
1486 To learn more about the up/down algorithm, see the article "Effective
1487 Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose
1488 Carlos Sancho, Antonio Robles, and Jose Duato at the Universidad
1489 Politecnica de Valencia.
1490
1491 To learn more about LASH and the flexibility behind it, the requirement
1492 for layers, performance comparisons to other algorithms, see the fol‐
1493 lowing articles:
1494
1495 "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
1496 on Parallel and Distributed Systems, VOL.16, No12, December 2005.
1497
1498 "Routing for the ASI Fabric Manager", Solheim et al. IEEE Communica‐
1499 tions Magazine, Vol.44, No.7, July 2006.
1500
1501 "Layered Shortest Path (LASH) Routing in Irregular System Area Net‐
1502 works", Skeie et al. IEEE Computer Society Communication Architecture
1503 for Clusters 2002.
1504
1505 To learn more about the DFSSSP and SSSP routing algorithm, see the
1506 articles:
1507 J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for
1508 Arbitrary Topologies, In Proceedings of the 25th IEEE International
1509 Parallel & Distributed Processing Symposium (IPDPS 2011)
1510 T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-
1511 Scale InfiniBand Networks, In 17th Annual IEEE Symposium on High Per‐
1512 formance Interconnects (HOTI 2009)
1513
1514 To learn more about the Nue routing algorithm, see the article "Routing
1515 on the Dependency Graph: A New Approach to Deadlock-Free High-Perfor‐
1516 mance Routing" by J. Domke, T. Hoefler and S. Matsuoka (published in
1517 HPDC'16).
1518
1519 Modular Routine Engine
1520
1521 Modular routing engine structure allows for the ease of "plugging" new
1522 routing modules.
1523
1524 Currently, only unicast callbacks are supported. Multicast can be added
1525 later.
1526
1527 One existing routing module is up-down "updn", which may be activated
1528 with '-R updn' option (instead of old '-u').
1529
1530 General usage is: $ opensm -R 'module-name'
1531
1532 There is also a trivial routing module which is able to load LFT tables
1533 from a file.
1534
1535 Main features:
1536
1537 - this will load switch LFTs and/or LID matrices (min hops tables)
1538 - this will load switch LFTs according to the path entries introduced
1539 in the file
1540 - no additional checks will be performed (such as "is port connected",
1541 etc.)
1542 - in case when fabric LIDs were changed this will try to reconstruct
1543 LFTs correctly if endport GUIDs are represented in the file
1544 (in order to disable this, GUIDs may be removed from the file
1545 or zeroed)
1546
1547 The file format is compatible with output of 'ibroute' util and for
1548 whole fabric can be generated with dump_lfts.sh script.
1549
1550 To activate file based routing module, use:
1551
1552 opensm -R file -U /path/to/lfts_file
1553
1554 If the lfts_file is not found or is in error, the default routing algo‐
1555 rithm is utilized.
1556
1557 The ability to dump switch lid matrices (aka min hops tables) to file
1558 and later to load these is also supported.
1559
1560 The usage is similar to unicast forwarding tables loading from a lfts
1561 file (introduced by 'file' routing engine), but new lid matrix file
1562 name should be specified by -M or --lid_matrix_file option. For exam‐
1563 ple:
1564
1565 opensm -R file -M ./opensm-lid-matrix.dump
1566
1567 The dump file is named ´opensm-lid-matrix.dump´ and will be generated
1568 in standard opensm dump directory (/var/log by default) when
1569 OSM_LOG_ROUTING logging flag is set.
1570
1571 When routing engine 'file' is activated, but the lfts file is not spec‐
1572 ified or not cannot be open default lid matrix algorithm will be used.
1573
1574 There is also a switch forwarding tables dumper which generates a file
1575 compatible with dump_lfts.sh output. This file can be used as input for
1576 forwarding tables loading by 'file' routing engine. Both or one of
1577 options -U and -M can be specified together with ´-R file´.
1578
1579
1581 To enable per module logging, configure per_module_logging_file to the
1582 per module logging config file name in the opensm options file. To dis‐
1583 able, configure per_module_logging_file to (null) there.
1584
1585 The per module logging config file format is a set of lines with module
1586 name and logging level as follows:
1587
1588 <module name><separator><logging level>
1589
1590 <module name> is the file name including .c
1591 <separator> is either = , space, or tab
1592 <logging level> is the same levels as used in the coarse/overall
1593 logging as follows:
1594
1595 BIT LOG LEVEL ENABLED
1596 ---- -----------------
1597 0x01 - ERROR (error messages)
1598 0x02 - INFO (basic messages, low volume)
1599 0x04 - VERBOSE (interesting stuff, moderate volume)
1600 0x08 - DEBUG (diagnostic, high volume)
1601 0x10 - FUNCS (function entry/exit, very high volume)
1602 0x20 - FRAMES (dumps all SMP and GMP frames)
1603 0x40 - ROUTING (dump FDB routing information)
1604 0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)
1605
1606
1608 /etc/rdma/opensm.conf
1609 default OpenSM config file.
1610
1611
1612 /etc/rdma/ib-node-name-map
1613 default node name map file. See ibnetdiscover for more informa‐
1614 tion on format.
1615
1616
1617 /etc/rdma/partitions.conf
1618 default partition config file
1619
1620
1621 /etc/rdma/qos-policy.conf
1622 default QOS policy config file
1623
1624
1625 /etc/rdma/prefix-routes.conf
1626 default prefix routes file
1627
1628
1629 /etc/rdma/per-module-logging.conf
1630 default per module logging config file
1631
1632
1633 /etc/rdma/torus-2QoS.conf
1634 default torus-2QoS config file
1635
1636
1638 Hal Rosenstock
1639 <hal@mellanox.com>
1640
1641 Sasha Khapyorsky
1642 <sashak@voltaire.com>
1643
1644 Eitan Zahavi
1645 <eitan@mellanox.co.il>
1646
1647 Yevgeny Kliteynik
1648 <kliteyn@mellanox.co.il>
1649
1650 Thomas Sodring
1651 <tsodring@simula.no>
1652
1653 Ira Weiny
1654 <weiny2@llnl.gov>
1655
1656 Dale Purdy
1657 <purdy@sgi.com>
1658
1659
1661 torus-2QoS(8), torus-2QoS.conf(5).
1662
1663
1664
1665OpenIB Sept 15, 2014 OPENSM(8)