1OPENSM(8) OpenIB Management OPENSM(8)
2
3
4
6 opensm - InfiniBand subnet manager and administration (SM/SA)
7
8
10 opensm [--version]] [-F | --config <file_name>] [-c(reate-config)
11 <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority) <PRI‐
12 ORITY>] [--subnet_prefix <PREFIX in hex>] [--smkey <SM_Key>] [--sm_sl
13 <SL number>] [-r(eassign_lids)] [-R <engine name(s)> | --routing_engine
14 <engine name(s)>] [--do_mesh_analysis] [--lash_start_vl <vl number>]
15 [-A | --ucast_cache] [-z | --connect_roots] [-M <file name> |
16 --lid_matrix_file <file name>] [-U <file name> | --lfts_file <file
17 name>] [-S | --sadb_file <file name>] [-a | --root_guid_file <path to
18 file>] [-u | --cn_guid_file <path to file>] [-G | --io_guid_file <path
19 to file>] [--port-shifting] [--scatter-ports] [-H | --max_reverse_hops
20 <max reverse hops allowed>] [-X | --guid_routing_order_file <path to
21 file>] [-m | --ids_guid_file <path to file>] [-o(nce)] [-s(weep)
22 <interval>] [-t(imeout) <milliseconds>] [--retries <number>] [--maxsmps
23 <number>] [--console [off | local | socket | loopback]] [--console-port
24 <port>] [-i(gnore-guids) <equalize-ignore-guids-file>] [-w |
25 --hop_weights_file <path to file>] [-O | --port_search_ordering_file
26 <path to file>] [-O | --dimn_ports_file <path to file>] (DEPRECATED)
27 [-f <log file path> | --log_file <log file path> ] [-L | --log_limit
28 <size in MB>] [-e(rase_log_file)] [-P(config) <partition config file> ]
29 [-N | --no_part_enforce] (DEPRECATED) [-Z | --part_enforce [both | in |
30 out | off]] [-W | --allow_both_pkeys] [-Q | --qos [-Y | --qos_pol‐
31 icy_file <file name>]] [--congestion-control] [--cckey <key>] [-y |
32 --stay_on_fatal] [-B | --daemon] [-J | --pidfile <file_name>] [-I |
33 --inactive] [--perfmgr] [--perfmgr_sweep_time_s <seconds>] [--pre‐
34 fix_routes_file <path>] [--consolidate_ipv6_snm_req] [--log_prefix
35 <prefix text>] [--torus_config <path to file>] [-v(erbose)] [-V] [-D
36 <flags>] [-d(ebug) <number>] [-h(elp)] [-?]
37
38
40 opensm is an InfiniBand compliant Subnet Manager and Administration,
41 and runs on top of OpenIB.
42
43 opensm provides an implementation of an InfiniBand Subnet Manager and
44 Administration. Such a software entity is required to run for in order
45 to initialize the InfiniBand hardware (at least one per each InfiniBand
46 subnet).
47
48 opensm also now contains an experimental version of a performance man‐
49 ager as well.
50
51 opensm defaults were designed to meet the common case usage on clusters
52 with up to a few hundred nodes. Thus, in this default mode, opensm will
53 scan the IB fabric, initialize it, and sweep occasionally for changes.
54
55 opensm attaches to a specific IB port on the local machine and config‐
56 ures only the fabric connected to it. (If the local machine has other
57 IB ports, opensm will ignore the fabrics connected to those other
58 ports). If no port is specified, it will select the first "best" avail‐
59 able port.
60
61 opensm can present the available ports and prompt for a port number to
62 attach to.
63
64 By default, the run is logged to two files: /var/log/messages and
65 /var/log/opensm.log. The first file will register only general major
66 events, whereas the second will include details of reported errors. All
67 errors reported in this second file should be treated as indicators of
68 IB fabric health issues. (Note that when a fatal and non-recoverable
69 error occurs, opensm will exit.) Both log files should include the
70 message "SUBNET UP" if opensm was able to setup the subnet correctly.
71
72
74 --version
75 Prints OpenSM version and exits.
76
77 -F, --config <config file>
78 The name of the OpenSM config file. When not specified
79 /etc/rdma/opensm.conf will be used (if exists).
80
81 -c, --create-config <file name>
82 OpenSM will dump its configuration to the specified file and
83 exit. This is a way to generate OpenSM configuration file tem‐
84 plate.
85
86 -g, --guid <GUID in hex>
87 This option specifies the local port GUID value with which
88 OpenSM should bind. OpenSM may be bound to 1 port at a time.
89 If GUID given is 0, OpenSM displays a list of possible port
90 GUIDs and waits for user input. Without -g, OpenSM tries to use
91 the default port.
92
93 -l, --lmc <LMC value>
94 This option specifies the subnet's LMC value. The number of
95 LIDs assigned to each port is 2^LMC. The LMC value must be in
96 the range 0-7. LMC values > 0 allow multiple paths between
97 ports. LMC values > 0 should only be used if the subnet topol‐
98 ogy actually provides multiple paths between ports, i.e. multi‐
99 ple interconnects between switches. Without -l, OpenSM defaults
100 to LMC = 0, which allows one path between any two ports.
101
102 -p, --priority <Priority value>
103 This option specifies the SM´s PRIORITY. This will effect the
104 handover cases, where master is chosen by priority and GUID.
105 Range goes from 0 (default and lowest priority) to 15 (highest).
106
107 --subnet_prefix <PREFIX in hex>
108 This option specifies the subnet prefix to use on the fabric.
109 The default prefix is 0xfe80000000000000. OpenMPI in particular
110 requires that separate fabrics plugged into different ports on a
111 machine must have different subnet prefixes in order to identify
112 that it is not two ports plugged into a single fabric.
113
114 --smkey <SM_Key value>
115 This option specifies the SM´s SM_Key (64 bits). This will
116 effect SM authentication. Note that OpenSM version 3.2.1 and
117 below used the default value '1' in a host byte order, it is
118 fixed now but you may need this option to interoperate with old
119 OpenSM running on a little endian machine.
120
121 --sm_sl <SL number>
122 This option sets the SL to use for communication with the SM/SA.
123 Defaults to 0.
124
125 -r, --reassign_lids
126 This option causes OpenSM to reassign LIDs to all end nodes.
127 Specifying -r on a running subnet may disrupt subnet traffic.
128 Without -r, OpenSM attempts to preserve existing LID assignments
129 resolving multiple use of same LID.
130
131 -R, --routing_engine <Routing engine names>
132 This option chooses routing engine(s) to use instead of Min Hop
133 algorithm (default). Multiple routing engines can be specified
134 separated by commas so that specific ordering of routing algo‐
135 rithms will be tried if earlier routing engines fail. If all
136 configured routing engines fail, OpenSM will always attempt to
137 route with Min Hop unless 'no_fallback' is included in the list
138 of routing engines. Supported engines: minhop, updn, dnup,
139 file, ftree, lash, dor, torus-2QoS, dfsssp, sssp.
140
141 --do_mesh_analysis
142 This option enables additional analysis for the lash routing
143 engine to precondition switch port assignments in regular carte‐
144 sian meshes which may reduce the number of SLs required to give
145 a deadlock free routing.
146
147 --lash_start_vl <vl number>
148 This option sets the starting VL to use for the lash routing
149 algorithm. Defaults to 0.
150
151 -A, --ucast_cache
152 This option enables unicast routing cache and prevents routing
153 recalculation (which is a heavy task in a large cluster) when
154 there was no topology change detected during the heavy sweep, or
155 when the topology change does not require new routing calcula‐
156 tion, e.g. when one or more CAs/RTRs/leaf switches going down,
157 or one or more of these nodes coming back after being down. A
158 very common case that is handled by the unicast routing cache is
159 host reboot, which otherwise would cause two full routing recal‐
160 culations: one when the host goes down, and the other when the
161 host comes back online.
162
163 -z, --connect_roots
164 This option enforces routing engines (up/down and fat-tree) to
165 make connectivity between root switches and in this way to be
166 fully IBA compliant. In many cases this can violate "pure" dead‐
167 lock free algorithm, so use it carefully.
168
169 -M, --lid_matrix_file <file name>
170 This option specifies the name of the lid matrix dump file from
171 where switch lid matrices (min hops tables will be loaded.
172
173 -U, --lfts_file <file name>
174 This option specifies the name of the LFTs file from where
175 switch forwarding tables will be loaded when using "file" rout‐
176 ing engine.
177
178 -S, --sadb_file <file name>
179 This option specifies the name of the SA DB dump file from where
180 SA database will be loaded.
181
182 -a, --root_guid_file <file name>
183 Set the root nodes for the Up/Down or Fat-Tree routing algorithm
184 to the guids provided in the given file (one to a line).
185
186 -u, --cn_guid_file <file name>
187 Set the compute nodes for the Fat-Tree or DFSSSP/SSSP routing
188 algorithms to the port GUIDs provided in the given file (one to
189 a line).
190
191 -G, --io_guid_file <file name>
192 Set the I/O nodes for the Fat-Tree or DFSSSP/SSSP routing algo‐
193 rithms to the port GUIDs provided in the given file (one to a
194 line).
195 In the case of Fat-Tree routing:
196 I/O nodes are non-CN nodes allowed to use up to max_reverse_hops
197 switches the wrong way around to improve connectivity.
198 In the case of (DF)SSSP routing:
199 Providing guids of compute and/or I/O nodes will ensure that
200 paths towards those nodes are as much separated as possible
201 within their node category, i.e., I/O traffic will not share the
202 same link if multiple links are available.
203
204 --port-shifting
205 This option enables a feature called port shifting. In some
206 fabrics, particularly cluster environments, routes commonly
207 align and congest with other routes due to algorithmically
208 unchanging traffic patterns. This routing option will "shift"
209 routing around in an attempt to alleviate this problem.
210
211 --scatter-ports
212 This option will randomize port selecting in routing.
213
214 -H, --max_reverse_hops <max reverse hops allowed>
215 Set the maximum number of reverse hops an I/O node is allowed to
216 make. A reverse hop is the use of a switch the wrong way around.
217
218 -m, --ids_guid_file <file name>
219 Name of the map file with set of the IDs which will be used by
220 Up/Down routing algorithm instead of node GUIDs (format: <guid>
221 <id> per line).
222
223 -X, --guid_routing_order_file <file name>
224 Set the order port guids will be routed for the MinHop and
225 Up/Down routing algorithms to the guids provided in the given
226 file (one to a line).
227
228 -o, --once
229 This option causes OpenSM to configure the subnet once, then
230 exit. Ports remain in the ACTIVE state.
231
232 -s, --sweep <interval value>
233 This option specifies the number of seconds between subnet
234 sweeps. Specifying -s 0 disables sweeping. Without -s, OpenSM
235 defaults to a sweep interval of 10 seconds.
236
237 -t, --timeout <value>
238 This option specifies the time in milliseconds used for transac‐
239 tion timeouts. Timeout values should be > 0. Without -t,
240 OpenSM defaults to a timeout value of 200 milliseconds.
241
242 --retries <number>
243 This option specifies the number of retries used for transac‐
244 tions. Without --retries, OpenSM defaults to 3 retries for
245 transactions.
246
247 --maxsmps <number>
248 This option specifies the number of VL15 SMP MADs allowed on the
249 wire at any one time. Specifying --maxsmps 0 allows unlimited
250 outstanding SMPs. Without --maxsmps, OpenSM defaults to a maxi‐
251 mum of 4 outstanding SMPs.
252
253 --console [off | local | loopback | socket]
254 This option brings up the OpenSM console (default off). Note,
255 loopback and socket open a socket which can be connected to
256 WITHOUT CREDENTIALS. Loopback is safer if access to your SM
257 host is controlled. tcp_wrappers (hosts.[allow|deny]) is used
258 with loopback and socket. loopback and socket will only be
259 available if OpenSM was built with --enable-console-loopback
260 (default yes) and --enable-console-socket (default no) respec‐
261 tively.
262
263 --console-port <port>
264 Specify an alternate telnet port for the socket console (default
265 10000). Note that this option only appears if OpenSM was built
266 with --enable-console-socket.
267
268 -i, --ignore-guids <equalize-ignore-guids-file>
269 This option provides the means to define a set of ports (by node
270 guid and port number) that will be ignored by the link load
271 equalization algorithm.
272
273 -w, --hop_weights_file <path to file>
274 This option provides weighting factors per port representing a
275 hop cost in computing the lid matrix. The file consists of
276 lines containing a switch port GUID (specified as a 64 bit hex
277 number, with leading 0x), output port number, and weighting fac‐
278 tor. Any port not listed in the file defaults to a weighting
279 factor of 1. Lines starting with # are comments. Weights
280 affect only the output route from the port, so many useful con‐
281 figurations will require weights to be specified in pairs.
282
283 -O, --port_search_ordering_file <path to file>
284 This option tweaks the routing. It suitable for two cases: 1.
285 While using DOR routing algorithm. This option provides a map‐
286 ping between hypercube dimensions and ports on a per switch
287 basis for the DOR routing engine. The file consists of lines
288 containing a switch node GUID (specified as a 64 bit hex number,
289 with leading 0x) followed by a list of non-zero port numbers,
290 separated by spaces, one switch per line. The order for the
291 port numbers is in one to one correspondence to the dimensions.
292 Ports not listed on a line are assigned to the remaining dimen‐
293 sions, in port order. Anything after a # is a comment. 2.
294 While using general routing algorithm. This option provides the
295 order of the ports that would be chosen for routing, from each
296 switch rather than searching for an appropriate port from port 1
297 to N. The file consists of lines containing a switch node GUID
298 (specified as a 64 bit hex number, with leading 0x) followed by
299 a list of non-zero port numbers, separated by spaces, one switch
300 per line. In case of DOR, the order for the port numbers is in
301 one to one correspondence to the dimensions. Ports not listed
302 on a line are assigned to the remaining dimensions, in port
303 order. Anything after a # is a comment.
304
305 -O, --dimn_ports_file <path to file> (DEPRECATED)
306 This is a deprecated flag. Please use --port_search_order‐
307 ing_file instead. This option provides a mapping between hyper‐
308 cube dimensions and ports on a per switch basis for the DOR
309 routing engine. The file consists of lines containing a switch
310 node GUID (specified as a 64 bit hex number, with leading 0x)
311 followed by a list of non-zero port numbers, separated by spa‐
312 ces, one switch per line. The order for the port numbers is in
313 one to one correspondence to the dimensions. Ports not listed
314 on a line are assigned to the remaining dimensions, in port
315 order. Anything after a # is a comment.
316
317 -x, --honor_guid2lid
318 This option forces OpenSM to honor the guid2lid file, when it
319 comes out of Standby state, if such file exists under
320 OSM_CACHE_DIR, and is valid. By default, this is FALSE.
321
322 -f, --log_file <file name>
323 This option defines the log to be the given file. By default,
324 the log goes to /var/log/opensm.log. For the log to go to stan‐
325 dard output use -f stdout.
326
327 -L, --log_limit <size in MB>
328 This option defines maximal log file size in MB. When specified
329 the log file will be truncated upon reaching this limit.
330
331 -e, --erase_log_file
332 This option will cause deletion of the log file (if it previ‐
333 ously exists). By default, the log file is accumulative.
334
335 -P, --Pconfig <partition config file>
336 This option defines the optional partition configuration file.
337 The default name is /etc/rdma/partitions.conf.
338
339 --prefix_routes_file <file name>
340 Prefix routes control how the SA responds to path record queries
341 for off-subnet DGIDs. By default, the SA fails such queries.
342 The PREFIX ROUTES section below describes the format of the con‐
343 figuration file. The default path is
344 /etc/rdma/prefix-routes.conf.
345
346 -Q, --qos
347 This option enables QoS setup. It is disabled by default.
348
349 -Y, --qos_policy_file <file name>
350 This option defines the optional QoS policy file. The default
351 name is /etc/rdma/qos-policy.conf. See QoS_manage‐
352 ment_in_OpenSM.txt in opensm doc for more information on config‐
353 uring QoS policy via this file.
354
355 --congestion_control
356 (EXPERIMENTAL) This option enables congestion control configura‐
357 tion. It is disabled by default. See config file for conges‐
358 tion control configuration options. --cc_key <key> (EXPERIMEN‐
359 TAL) This option configures the CCkey to use when configuring
360 congestion control. Note that this option does not configure a
361 new CCkey into switches and CAs. Defaults to 0.
362
363 -N, --no_part_enforce (DEPRECATED)
364 This is a deprecated flag. Please use --part_enforce instead.
365 This option disables partition enforcement on switch external
366 ports.
367
368 -Z, --part_enforce [both | in | out | off]
369 This option indicates the partition enforcement type (for
370 switches). Enforcement type can be inbound only (in), outbound
371 only (out), both or disabled (off). Default is both.
372
373 -W, --allow_both_pkeys
374 This option indicates whether both full and limited membership
375 on the same partition can be configured in the PKeyTable.
376 Default is not to allow both pkeys.
377
378 -y, --stay_on_fatal
379 This option will cause SM not to exit on fatal initialization
380 issues: if SM discovers duplicated guids or a 12x link with lane
381 reversal badly configured. By default, the SM will exit on
382 these errors.
383
384 -B, --daemon
385 Run in daemon mode - OpenSM will run in the background.
386
387 -J, --pidfile <file_name>
388 Makes the SM write its own PID to the specified file when
389 started in daemon mode.
390
391 -I, --inactive
392 Start SM in inactive rather than init SM state. This option can
393 be used in conjunction with the perfmgr so as to run a stand‐
394 alone performance manager without SM/SA. However, this is NOT
395 currently implemented in the performance manager.
396
397 --perfmgr
398 Enable the perfmgr. Only takes effect if --enable-perfmgr was
399 specified at configure time. See performance-manager-HOWTO.txt
400 in opensm doc for more information on running perfmgr.
401
402 --perfmgr_sweep_time_s <seconds>
403 Specify the sweep time for the performance manager in seconds
404 (default is 180 seconds). Only takes effect if --enable-perfmgr
405 was specified at configure time.
406
407 --consolidate_ipv6_snm_req
408 Use shared MLID for IPv6 Solicited Node Multicast groups per
409 MGID scope and P_Key.
410
411 --log_prefix <prefix text>
412 This option specifies the prefix to the syslog messages from
413 OpenSM. A suitable prefix can be used to identify the IB subnet
414 in syslog messages when two or more instances of OpenSM run in a
415 single node to manage multiple fabrics. For example, in a dual-
416 fabric (or dual-rail) IB cluster, the prefix for the first fab‐
417 ric could be "mpi" and the other fabric could be "storage".
418
419 --torus_config <path to torus-2QoS config file>
420 This option defines the file name for the extra configuration
421 information needed for the torus-2QoS routing engine. The
422 default name is /etc/rdma/torus-2QoS.conf
423
424 -v, --verbose
425 This option increases the log verbosity level. The -v option
426 may be specified multiple times to further increase the ver‐
427 bosity level. See the -D option for more information about log
428 verbosity.
429
430 -V This option sets the maximum verbosity level and forces log
431 flushing. The -V option is equivalent to ´-D 0xFF -d 2´. See
432 the -D option for more information about log verbosity.
433
434 -D <value>
435 This option sets the log verbosity level. A flags field must
436 follow the -D option. A bit set/clear in the flags enables/dis‐
437 ables a specific log level as follows:
438
439 BIT LOG LEVEL ENABLED
440 ---- -----------------
441 0x01 - ERROR (error messages)
442 0x02 - INFO (basic messages, low volume)
443 0x04 - VERBOSE (interesting stuff, moderate volume)
444 0x08 - DEBUG (diagnostic, high volume)
445 0x10 - FUNCS (function entry/exit, very high volume)
446 0x20 - FRAMES (dumps all SMP and GMP frames)
447 0x40 - ROUTING (dump FDB routing information)
448 0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM log‐
449 ging)
450
451 Without -D, OpenSM defaults to ERROR + INFO (0x3). Specifying
452 -D 0 disables all messages. Specifying -D 0xFF enables all mes‐
453 sages (see -V). High verbosity levels may require increasing
454 the transaction timeout with the -t option.
455
456 -d, --debug <value>
457 This option specifies a debug option. These options are not
458 normally needed. The number following -d selects the debug
459 option to enable as follows:
460
461 OPT Description
462 --- -----------------
463 -d0 - Ignore other SM nodes
464 -d1 - Force single threaded dispatching
465 -d2 - Force log flushing after each log message
466 -d3 - Disable multicast support
467
468 -h, --help
469 Display this usage info then exit.
470
471 -? Display this usage info then exit.
472
473
475 The following environment variables control opensm behavior:
476
477 OSM_TMP_DIR - controls the directory in which the temporary files gen‐
478 erated by opensm are created. These files are: opensm-subnet.lst,
479 opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.
480
481 OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
482 quent runs are consistent. The default directory used is
483 /var/cache/opensm. The following files are included in it:
484
485 guid2lid - stores the LID range assigned to each GUID
486 guid2mkey - stores the MKey previously assiged to each GUID
487 neighbors - stores a map of the GUIDs at either end of each link
488 in the fabric
489
490
492 When opensm receives a HUP signal, it starts a new heavy sweep as if a
493 trap was received or a topology change was found.
494
495 Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log
496 for logrotate purposes.
497
498
500 The default name of OpenSM partitions configuration file is
501 /etc/rdma/partitions.conf. The default may be changed by using the
502 --Pconfig (-P) option with OpenSM.
503
504 The default partition will be created by OpenSM unconditionally even
505 when partition configuration file does not exist or cannot be accessed.
506
507 The default partition has P_Key value 0x7fff. OpenSM´s port will always
508 have full membership in default partition. All other end ports will
509 have full membership if the partition configuration file is not found
510 or cannot be accessed, or limited membership if the file exists and can
511 be accessed but there is no rule for the Default partition.
512
513 Effectively, this amounts to the same as if one of the following rules
514 below appear in the partition configuration file.
515
516 In the case of no rule for the Default partition:
517
518 Default=0x7fff : ALL=limited, SELF=full ;
519
520 In the case of no partition configuration file or file cannot be
521 accessed:
522
523 Default=0x7fff : ALL=full ;
524
525
526 File Format
527
528 Comments:
529
530 Line content followed after ´#´ character is comment and ignored by
531 parser.
532
533 General file format:
534
535 <Partition Definition>:[<newline>]<Partition Properties>;
536
537 Partition Definition:
538 [PartitionName][=PKey][,ipoib_bc_flags][,defmember=full|limited]
539
540 PartitionName - string, will be used with logging. When omit‐
541 ted
542 empty string will be used.
543 PKey - P_Key value for this partition. Only low 15
544 bits will
545 be used. When omitted will be autogenerated.
546 ipoib_bc_flags - used to indicate/specify IPoIB capability of
547 this partition.
548
549 defmember=full|limited|both - specifies default membership for
550 port guid
551 list. Default is limited.
552
553 ipoib_bc_flags:
554 ipoib_flag|[mgroup_flag]*
555
556 ipoib_flag - indicates that this partition may be used for
557 IPoIB, as
558 a result the IPoIB broadcast group will be created
559 with
560 the flags given, if any.
561
562 Partition Properties:
563 [<Port list>|<MCast Group>]* | <Port list>
564
565 Port list:
566 <Port Specifier>[,<Port Specifier>]
567
568 Port Specifier:
569 <PortGUID>[=[full|limited|both]]
570
571 PortGUID - GUID of partition member EndPort. Hexadeci‐
572 mal
573 numbers should start from 0x, decimal num‐
574 bers
575 are accepted too.
576
577 full, limited, - indicates full and/or limited membership for
578 this
579 both port. When omitted (or unrecognized) lim‐
580 ited
581 membership is assumed. Both indicates both
582 full
583 and limited membership for this port.
584
585 MCast Group:
586 mgid=gid[,mgroup_flag]*<newline>
587
588 - gid specified is verified to be a Multicast
589 address
590 IP groups are verified to match the rate and
591 mtu of the
592 broadcast group. The P_Key bits of the mgid
593 for IP
594 groups are verified to either match the
595 P_Key specified
596 in by "Partition Definition" or if they are
597 0x0000 the
598 P_Key will be copied into those bits.
599
600 mgroup_flag:
601 rate=<val> - specifies rate for this MC group
602 (default is 3 (10GBps))
603 mtu=<val> - specifies MTU for this MC group
604 (default is 4 (2048))
605 sl=<val> - specifies SL for this MC group
606 (default is 0)
607 scope=<val> - specifies scope for this MC group
608 (default is 2 (link local)). Multiple scope set‐
609 tings
610 are permitted for a partition.
611 NOTE: This overwrites the scope nibble of the
612 specified
613 mgid. Furthermore specifying multiple
614 scope
615 settings will result in multiple MC groups
616 being created.
617 Q_Key=<val> - specifies the Q_Key for this MC group
618 (default: 0x0b1b for IP groups, 0 for other
619 groups)
620 TClass=<val> - specifies tclass for this MC group
621 (default is 0)
622 FlowLabel=<val> - specifies FlowLabel for this MC group
623 (default is 0)
624
625 newline: '0
626
627
628 Note that values for rate, mtu, and scope, for both partitions and mul‐
629 ticast groups, should be specified as defined in the IBTA specification
630 (for example, mtu=4 for 2048).
631
632 There are several useful keywords for PortGUID definition:
633
634 - 'ALL' means all end ports in this subnet.
635 - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
636 - 'ALL_SWITCHES' means all Switch end ports in this subnet.
637 - 'ALL_ROUTERS' means all Router end ports in this subnet.
638 - 'SELF' means subnet manager's port.
639
640 Empty list means no ports in this partition.
641
642 Notes:
643
644 White space is permitted between delimiters ('=', ',',':',';').
645
646 PartitionName does not need to be unique, PKey does need to be unique.
647 If PKey is repeated then those partition configurations will be merged
648 and first PartitionName will be used (see also next note).
649
650 It is possible to split partition configuration in more than one defi‐
651 nition, but then PKey should be explicitly specified (otherwise differ‐
652 ent PKey values will be generated for those definitions).
653
654 Examples:
655
656 Default=0x7fff : ALL, SELF=full ;
657 Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
658
659 NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306
660 ;
661
662 YetAnotherOne = 0x300 : SELF=full ;
663 YetAnotherOne = 0x300 : ALL=limited ;
664
665 ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
666 # 0x123453, 0x123454 will be limited
667 ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
668 # 0x123456, 0x123457 will be limited
669 ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457,
670 0x123458=full;
671 ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
672 ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited,
673 0x12345d;
674
675 # multicast groups added to default
676 Default=0x7fff,ipoib:
677 mgid=ff12:401b::0707,sl=1 # random IPv4 group
678 mgid=ff12:601b::16 # MLDv2-capable routers
679 mgid=ff12:401b::16 # IGMP
680 mgid=ff12:601b::2 # All routers
681 mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
682 ALL=full;
683
684
685 Note:
686
687 The following rule is equivalent to how OpenSM used to run prior to the
688 partition manager:
689
690 Default=0x7fff,ipoib:ALL=full;
691
692
694 There are a set of QoS related low-level configuration parameters. All
695 these parameter names are prefixed by "qos_" string. Here is a full
696 list of these parameters:
697
698 qos_max_vls - The maximum number of VLs that will be on the subnet
699 qos_high_limit - The limit of High Priority component of VL
700 Arbitration table (IBA 7.6.9)
701 qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9)
702 template
703 qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
704 template
705 Both VL arbitration templates are pairs of
706 VL and weight
707 qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is
708 a list of VLs corresponding to SLs 0-15 (Note
709 that VL15 used here means drop this SL)
710
711 Typical default values (hard-coded in OpenSM initialization) are:
712
713 qos_max_vls 15
714 qos_high_limit 0
715 qos_vlarb_low
716 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
717 qos_vlarb_high
718 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
719 qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
720
721 The syntax is compatible with rest of OpenSM configuration options and
722 values may be stored in OpenSM config file (cached options file).
723
724 In addition to the above, we may define separate QoS configuration
725 parameters sets for various target types. As targets, we currently sup‐
726 port CAs, routers, switch external ports, and switch's enhanced port 0.
727 The names of such specialized parameters are prefixed by "qos_<type>_"
728 string. Here is a full list of the currently supported sets:
729
730 qos_ca_ - QoS configuration parameters set for CAs.
731 qos_rtr_ - parameters set for routers.
732 qos_sw0_ - parameters set for switches' port 0.
733 qos_swe_ - parameters set for switches' external ports.
734
735 Examples:
736 qos_sw0_max_vls=2
737 qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
738 qos_swe_high_limit=0
739
740
742 Prefix routes control how the SA responds to path record queries for
743 off-subnet DGIDs. By default, the SA fails such queries. Note that
744 IBA does not specify how the SA should obtain off-subnet path record
745 information. The prefix routes configuration is meant as a stop-gap
746 until the specification is completed.
747
748 Each line in the configuration file is a 64-bit prefix followed by a
749 64-bit GUID, separated by white space. The GUID specifies the router
750 port on the local subnet that will handle the prefix. Blank lines are
751 ignored, as is anything between a # character and the end of the line.
752 The prefix and GUID are both in hex, the leading 0x is optional.
753 Either, or both, can be wild-carded by specifying an asterisk instead
754 of an explicit prefix or GUID.
755
756 When responding to a path record query for an off-subnet DGID, opensm
757 searches for the first prefix match in the configuration file. There‐
758 fore, the order of the lines in the configuration file is important: a
759 wild-carded prefix at the beginning of the configuration file renders
760 all subsequent lines useless. If there is no match, then opensm fails
761 the query. It is legal to repeat prefixes in the configuration file,
762 opensm will return the path to the first available matching router. A
763 configuration file with a single line where both prefix and GUID are
764 wild-carded means that a path record query specifying any off-subnet
765 DGID should return a path to the first available router. This configu‐
766 ration yields the same behavior formerly achieved by compiling opensm
767 with -DROUTER_EXP which has been obsoleted.
768
769
771 OpenSM supports configuring a single management key (MKey) for use
772 across the subnet.
773
774 The following configuration options are available:
775
776 m_key - the 64-bit MKey to be used on the subnet
777 (IBA 14.2.4)
778 m_key_protection_level - the numeric value of the MKey ProtectBits
779 (IBA 14.2.4.1)
780 m_key_lease_period - the number of seconds a CA will wait for a
781 response from the SM before resetting the
782 protection level to 0 (IBA 14.2.4.2).
783
784 OpenSM will configure all ports with the MKey specified by m_key,
785 defaulting to a value of 0. A m_key value of 0 disables MKey protection
786 on the subnet. Switches and HCAs with a non-zero MKey will not accept
787 requests to change their configuration unless the request includes the
788 proper MKey.
789
790 MKey Protection Levels
791
792 MKey protection levels modify how switches and CAs respond to SMPs
793 lacking a valid MKey. OpenSM will configure each port's ProtectBits to
794 support the level defined by the m_key_protection_level parameter. If
795 no parameter is specified, OpenSM defaults to operating at protection
796 level 0.
797
798 There are currently 4 protection levels defined by the IBA:
799
800 0 - Queries return valid data, including MKey. Configuration changes
801 are not allowed unless the request contains a valid MKey.
802 1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
803 unless the request contains a valid MKey.
804 2 - Neither queries nor configuration changes are allowed, unless the
805 request contains a valid MKey.
806 3 - Identical to 2. Maintained for backwards compatibility.
807
808 MKey Lease Period
809
810 InfiniBand supports a MKey lease timeout, which is intended to allow
811 administrators or a new SM to recover/reset lost MKeys on a fabric.
812
813 If MKeys are enabled on the subnet and a switch or CA receives a
814 request that requires a valid MKey but does not contain one, it warns
815 the SM by sending a trap (Bad M_Key, Trap 256). If the MKey lease
816 period is non-zero, it also starts a countdown timer for the time spec‐
817 ified by the lease period. If a SM (or other agent) responds with the
818 correct MKey, the timer is stopped and reset. Should the timer reach
819 zero, the switch or CA will reset its MKey protection level to 0,
820 exposing the MKey and allowing recovery.
821
822 OpenSM will initialize all ports to use a mkey lease period of the num‐
823 ber of seconds specified in the config file. If no mkey_lease_period
824 is specified, a default of 0 will be used.
825
826 OpenSM normally quickly responds to all Bad_M_Key traps, resetting the
827 lease timers. Additionally, OpenSM's subnet sweeps will also cancel
828 any running timers. For maximum protection against accidentally-
829 exposed MKeys, the MKey lease time should be a few multiples of the
830 subnet sweep time. If OpenSM detects at startup that your sweep inter‐
831 val is greater than your MKey lease period, it will reset the lease
832 period to be greater than the sweep interval. Similarly, if sweeping
833 is disabled at startup, it will be re-enabled with an interval less
834 than the Mkey lease period.
835
836 If OpenSM is required to recover a subnet for which it is missing
837 mkeys, it must do so one switch level at a time. As such, the total
838 time to recover the subnet may be as long as the mkey lease period mul‐
839 tiplied by the maximum number of hops between the SM and an endpoint,
840 plus one.
841
842 MKey Effects on Diagnostic Utilities
843
844 Setting a MKey may have a detrimental effect on diagnostic software run
845 on the subnet, unless your diagnostic software is able to retrieve
846 MKeys from the SA or can be explicitly configured with the proper MKey.
847 This is particularly true at protection level 2, where CAs will ignore
848 queries for management information that do not contain the proper MKey.
849
850
852 OpenSM now offers nine routing engines:
853
854 1. Min Hop Algorithm - based on the minimum hops to each node where
855 the path length is optimized.
856
857 2. UPDN Unicast routing algorithm - also based on the minimum hops to
858 each node, but it is constrained to ranking rules. This algorithm
859 should be chosen if the subnet is not a pure Fat Tree, and deadlock may
860 occur due to a loop in the subnet.
861
862 3. DNUP Unicast routing algorithm - similar to UPDN but allows routing
863 in fabrics which have some CA nodes attached closer to the roots than
864 some switch nodes.
865
866 4. Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
867 ing for congestion-free "shift" communication pattern. It should be
868 chosen if a subnet is a symmetrical or almost symmetrical fat-tree of
869 various types, not just K-ary-N-Trees: non-constant K, not fully
870 staffed, any Constant Bisectional Bandwidth (CBB) ratio. Similar to
871 UPDN, Fat Tree routing is constrained to ranking rules.
872
873 5. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
874 to provide deadlock-free shortest-path routing while also distributing
875 the paths between layers. LASH is an alternative deadlock-free topol‐
876 ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
877 ing the use of a potentially congested root node.
878
879 6. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
880 avoids port equalization except for redundant links between the same
881 two switches. This provides deadlock free routes for hypercubes when
882 the fabric is cabled as a hypercube and for meshes when cabled as a
883 mesh (see details below).
884
885 7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
886 specialized for 2D/3D torus topologies. Torus-2QoS provides deadlock-
887 free routing while supporting two quality of service (QoS) levels. In
888 addition it is able to route around multiple failed fabric links or a
889 single failed fabric switch without introducing deadlocks, and without
890 changing path SL values granted before the failure.
891
892 8. DFSSSP unicast routing algorithm - a deadlock-free single-source-
893 shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
894 as the base to optimize link utilization and uses Infiniband virtual
895 lanes (SL) to provide deadlock-freedom.
896
897 9. SSSP unicast routing algorithm - a single-source-shortest-path rout‐
898 ing algorithm, which globally balances the number of routes per link to
899 optimize link utilization. This routing algorithm has no restrictions
900 in terms of the underlying topology.
901
902 OpenSM also supports a file method which can load routes from a table.
903 See ´Modular Routing Engine´ for more information on this.
904
905 The basic routing algorithm is comprised of two stages:
906
907 1. MinHop matrix calculation
908 How many hops are required to get from each port to each LID ?
909 The algorithm to fill these tables is different if you run standard
910 (min hop) or Up/Down.
911 For standard routing, a "relaxation" algorithm is used to propagate
912 min hop from every destination LID through neighbor switches
913 For Up/Down routing, a BFS from every target is used. The BFS tracks
914 link direction (up or down) and avoid steps that will perform up after
915 a down step was used.
916
917 2. Once MinHop matrices exist, each switch is visited and for each tar‐
918 get LID a decision is made as to what port should be used to get to
919 that LID.
920 This step is common to standard and Up/Down routing. Each port has a
921 counter counting the number of target LIDs going through it.
922 When there are multiple alternative ports with same MinHop to a LID,
923 the one with less previously assigned LIDs is selected.
924 If LMC > 0, more checks are added: Within each group of LIDs
925 assigned to same target port,
926 a. use only ports which have same MinHop
927 b. first prefer the ones that go to different systemImageGuid (then
928 the previous LID of the same LMC group)
929 c. if none - prefer those which go through another NodeGuid
930 d. fall back to the number of paths method (if all go to same node).
931
932 Effect of Topology Changes
933
934 OpenSM will preserve existing routing in any case where there is no
935 change in the fabric switches unless the -r (--reassign_lids) option is
936 specified.
937
938 -r
939 --reassign_lids
940 This option causes OpenSM to reassign LIDs to all
941 end nodes. Specifying -r on a running subnet
942 may disrupt subnet traffic.
943 Without -r, OpenSM attempts to preserve existing
944 LID assignments resolving multiple use of same LID.
945
946 If a link is added or removed, OpenSM does not recalculate the routes
947 that do not have to change. A route has to change if the port is no
948 longer UP or no longer the MinHop. When routing changes are performed,
949 the same algorithm for balancing the routes is invoked.
950
951 In the case of using the file based routing, any topology changes are
952 currently ignored The 'file' routing engine just loads the LFTs from
953 the file specified, with no reaction to real topology. Obviously, this
954 will not be able to recheck LIDs (by GUID) for disconnected nodes, and
955 LFTs for non-existent switches will be skipped. Multicast is not
956 affected by 'file' routing engine (this uses min hop tables).
957
958
959 Min Hop Algorithm
960
961 The Min Hop algorithm is invoked by default if no routing algorithm is
962 specified. It can also be invoked by specifying '-R minhop'.
963
964 The Min Hop algorithm is divided into two stages: computation of min-
965 hop tables on every switch and LFT output port assignment. Link sub‐
966 scription is also equalized with the ability to override based on port
967 GUID. The latter is supplied by:
968
969 -i <equalize-ignore-guids-file>
970 --ignore-guids <equalize-ignore-guids-file>
971 This option provides the means to define a set of ports
972 (by guid) that will be ignored by the link load
973 equalization algorithm. Note that only endports (CA,
974 switch port 0, and router ports) and not switch external
975 ports are supported.
976
977 LMC awareness routes based on (remote) system or switch basis.
978
979
980 Purpose of UPDN Algorithm
981
982 The UPDN algorithm is designed to prevent deadlocks from occurring in
983 loops of the subnet. A loop-deadlock is a situation in which it is no
984 longer possible to send data between any two hosts connected through
985 the loop. As such, the UPDN routing algorithm should be used if the
986 subnet is not a pure Fat Tree, and one of its loops may experience a
987 deadlock (due, for example, to high pressure).
988
989 The UPDN algorithm is based on the following main stages:
990
991 1. Auto-detect root nodes - based on the CA hop length from any switch
992 in the subnet, a statistical histogram is built for each switch (hop
993 num vs number of occurrences). If the histogram reflects a specific
994 column (higher than others) for a certain node, then it is marked as a
995 root node. Since the algorithm is statistical, it may not find any root
996 nodes. The list of the root nodes found by this auto-detect stage is
997 used by the ranking process stage.
998
999 Note 1: The user can override the node list manually.
1000 Note 2: If this stage cannot find any root nodes, and the user did
1001 not specify a guid list file, OpenSM defaults back to the
1002 Min Hop routing algorithm.
1003
1004 2. Ranking process - All root switch nodes (found in stage 1) are
1005 assigned a rank of 0. Using the BFS algorithm, the rest of the switch
1006 nodes in the subnet are ranked incrementally. This ranking aids in the
1007 process of enforcing rules that ensure loop-free paths.
1008
1009 3. Min Hop Table setting - after ranking is done, a BFS algorithm is
1010 run from each (CA or switch) node in the subnet. During the BFS
1011 process, the FDB table of each switch node traversed by BFS is updated,
1012 in reference to the starting node, based on the ranking rules and guid
1013 values.
1014
1015 At the end of the process, the updated FDB tables ensure loop-free
1016 paths through the subnet.
1017
1018 Note: Up/Down routing does not allow LID routing communication between
1019 switches that are located inside spine "switch systems". The reason is
1020 that there is no way to allow a LID route between them that does not
1021 break the Up/Down rule. One ramification of this is that you cannot
1022 run SM on switches other than the leaf switches of the fabric.
1023
1024
1025 UPDN Algorithm Usage
1026
1027 Activation through OpenSM
1028
1029 Use '-R updn' option (instead of old '-u') to activate the UPDN algo‐
1030 rithm. Use '-a <root_guid_file>' for adding an UPDN guid file that
1031 contains the root nodes for ranking. If the `-a' option is not used,
1032 OpenSM uses its auto-detect root nodes algorithm.
1033
1034 Notes on the guid list file:
1035
1036 1. A valid guid file specifies one guid in each line. Lines with an
1037 invalid format will be discarded.
1038 2. The user should specify the root switch guids. However, it is also
1039 possible to specify CA guids; OpenSM will use the guid of the switch
1040 (if it exists) that connects the CA to the subnet as a root node.
1041
1042 Purpose of DNUP Algorithm
1043
1044 The DNUP algorithm is designed to serve a similar purpose to UPDN. How‐
1045 ever it is intended to work in network topologies which are unsuited to
1046 UPDN due to nodes being connected closer to the roots than some of the
1047 switches. An example would be a fabric which contains nodes and
1048 uplinks connected to the same switch. The operation of DNUP is the same
1049 as UPDN with the exception of the ranking process. In DNUP all switch
1050 nodes are ranked based solely on their distance from CA Nodes, all
1051 switch nodes directly connected to at least one CA are assigned a value
1052 of 1 all other switch nodes are assigned a value of one more than the
1053 minimum rank of all neighbor switch nodes.
1054
1055 Fat-tree Routing Algorithm
1056
1057 The fat-tree algorithm optimizes routing for "shift" communication pat‐
1058 tern. It should be chosen if a subnet is a symmetrical or almost sym‐
1059 metrical fat-tree of various types. It supports not just K-ary-N-
1060 Trees, by handling for non-constant K, cases where not all leafs (CAs)
1061 are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-
1062 loop-deadlocks.
1063
1064 If the root guid file is not provided ('-a' or '--root_guid_file'
1065 options), the topology has to be pure fat-tree that complies with the
1066 following rules:
1067 - Tree rank should be between two and eight (inclusively)
1068 - Switches of the same rank should have the same number
1069 of UP-going port groups*, unless they are root switches,
1070 in which case the shouldn't have UP-going ports at all.
1071 - Switches of the same rank should have the same number
1072 of DOWN-going port groups, unless they are leaf switches.
1073 - Switches of the same rank should have the same number
1074 of ports in each UP-going port group.
1075 - Switches of the same rank should have the same number
1076 of ports in each DOWN-going port group.
1077 - All the CAs have to be at the same tree level (rank).
1078
1079 If the root guid file is provided, the topology doesn't have to be pure
1080 fat-tree, and it should only comply with the following rules:
1081 - Tree rank should be between two and eight (inclusively)
1082 - All the Compute Nodes** have to be at the same tree level (rank).
1083 Note that non-compute node CAs are allowed here to be at different
1084 tree ranks.
1085
1086 * ports that are connected to the same remote switch are referenced as
1087 ´port group´.
1088
1089 ** list of compute nodes (CNs) can be specified by ´-u´ or
1090 ´--cn_guid_file´ OpenSM options.
1091
1092 Topologies that do not comply cause a fallback to min hop routing.
1093 Note that this can also occur on link failures which cause the topology
1094 to no longer be "pure" fat-tree.
1095
1096 Note that although fat-tree algorithm supports trees with non-integer
1097 CBB ratio, the routing will not be as balanced as in case of integer
1098 CBB ratio. In addition to this, although the algorithm allows leaf
1099 switches to have any number of CAs, the closer the tree is to be fully
1100 populated, the more effective the "shift" communication pattern will
1101 be. In general, even if the root list is provided, the closer the
1102 topology to a pure and symmetrical fat-tree, the more optimal the rout‐
1103 ing will be.
1104
1105 The algorithm also dumps compute node ordering file (opensm-ftree-ca-
1106 order.dump) in the same directory where the OpenSM log resides. This
1107 ordering file provides the CN order that may be used to create effi‐
1108 cient communication pattern, that will match the routing tables.
1109
1110 Routing between non-CN nodes
1111
1112 The use of the cn_guid_file option allows non-CN nodes to be located on
1113 different levels in the fat tree. In such case, it is not guaranteed
1114 that the Fat Tree algorithm will route between two non-CN nodes. To
1115 solve this problem, a list of non-CN nodes can be specified by ´-G´ or
1116 ´--io_guid_file´ option. Theses nodes will be allowed to use switches
1117 the wrong way round a specific number of times (specified by ´-H´ or
1118 ´--max_reverse_hops´. With the proper max_reverse_hops and
1119 io_guid_file values, you can ensure full connectivity in the Fat Tree.
1120
1121 Please note that using max_reverse_hops creates routes that use the
1122 switch in a counter-stream way. This option should never be used to
1123 connect nodes with high bandwidth traffic between them ! It should only
1124 be used to allow connectivity for HA purposes or similar. Also having
1125 routes the other way around can in theory cause credit loops.
1126
1127 Use these options with extreme care !
1128
1129 Activation through OpenSM
1130
1131 Use '-R ftree' option to activate the fat-tree algorithm. Use '-a
1132 <root_guid_file>' to provide root nodes for ranking. If the `-a' option
1133 is not used, routing algorithm will detect roots automatically. Use
1134 '-u <root_cn_file>' to provide the list of compute nodes. If the `-u'
1135 option is not used, all the CAs are considered as compute nodes.
1136
1137 Note: LMC > 0 is not supported by fat-tree routing. If this is speci‐
1138 fied, the default routing algorithm is invoked instead.
1139
1140
1141 LASH Routing Algorithm
1142
1143 LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
1144 istic shortest path routing algorithm that enables topology agnostic
1145 deadlock-free routing within communication networks.
1146
1147 When computing the routing function, LASH analyzes the network topology
1148 for the shortest-path routes between all pairs of sources / destina‐
1149 tions and groups these paths into virtual layers in such a way as to
1150 avoid deadlock.
1151
1152 Note LASH analyzes routes and ensures deadlock freedom between switch
1153 pairs. The link from HCA between and switch does not need virtual lay‐
1154 ers as deadlock will not arise between switch and HCA.
1155
1156 In more detail, the algorithm works as follows:
1157
1158 1) LASH determines the shortest-path between all pairs of source / des‐
1159 tination switches. Note, LASH ensures the same SL is used for all
1160 SRC/DST - DST/SRC pairs and there is no guarantee that the return path
1161 for a given DST/SRC will be the reverse of the route SRC/DST.
1162
1163 2) LASH then begins an SL assignment process where a route is assigned
1164 to a layer (SL) if the addition of that route does not cause deadlock
1165 within that layer. This is achieved by maintaining and analysing a
1166 channel dependency graph for each layer. Once the potential addition of
1167 a path could lead to deadlock, LASH opens a new layer and continues the
1168 process.
1169
1170 3) Once this stage has been completed, it is highly likely that the
1171 first layers processed will contain more paths than the latter ones.
1172 To better balance the use of layers, LASH moves paths from one layer to
1173 another so that the number of paths in each layer averages out.
1174
1175 Note, the implementation of LASH in opensm attempts to use as few lay‐
1176 ers as possible. This number can be less than the number of actual lay‐
1177 ers available.
1178
1179 In general LASH is a very flexible algorithm. It can, for example,
1180 reduce to Dimension Order Routing in certain topologies, it is topology
1181 agnostic and fares well in the face of faults.
1182
1183 It has been shown that for both regular and irregular topologies, LASH
1184 outperforms Up/Down. The reason for this is that LASH distributes the
1185 traffic more evenly through a network, avoiding the bottleneck issues
1186 related to a root node and always routes shortest-path.
1187
1188 The algorithm was developed by Simula Research Laboratory.
1189
1190
1191 Use '-R lash -Q ' option to activate the LASH algorithm.
1192
1193 Note: QoS support has to be turned on in order that SL/VL mappings are
1194 used.
1195
1196 Note: LMC > 0 is not supported by the LASH routing. If this is speci‐
1197 fied, the default routing algorithm is invoked instead.
1198
1199 For open regular cartesian meshes the DOR algorithm is the ideal rout‐
1200 ing algorithm. For toroidal meshes on the other hand there are routing
1201 loops that can cause deadlocks. LASH can be used to route these cases.
1202 The performance of LASH can be improved by preconditioning the mesh in
1203 cases where there are multiple links connecting switches and also in
1204 cases where the switches are not cabled consistently. An option exists
1205 for LASH to do this. To invoke this use '-R lash -Q --do_mesh_analy‐
1206 sis'. This will add an additional phase that analyses the mesh to try
1207 to determine the dimension and size of a mesh. If it determines that
1208 the mesh looks like an open or closed cartesian mesh it reorders the
1209 ports in dimension order before the rest of the LASH algorithm runs.
1210
1211 DOR Routing Algorithm
1212
1213 The Dimension Order Routing algorithm is based on the Min Hop algorithm
1214 and so uses shortest paths. Instead of spreading traffic out across
1215 different paths with the same shortest distance, it chooses among the
1216 available shortest paths based on an ordering of dimensions. Each port
1217 must be consistently cabled to represent a hypercube dimension or a
1218 mesh dimension. Alternatively, the -O option can be used to assign a
1219 custom mapping between the ports on a given switch, and the associated
1220 dimension. Paths are grown from a destination back to a source using
1221 the lowest dimension (port) of available paths at each step. This pro‐
1222 vides the ordering necessary to avoid deadlock. When there are multi‐
1223 ple links between any two switches, they still represent only one
1224 dimension and traffic is balanced across them unless port equalization
1225 is turned off. In the case of hypercubes, the same port must be used
1226 throughout the fabric to represent the hypercube dimension and match on
1227 both ends of the cable, or the -O option used to accomplish the align‐
1228 ment. In the case of meshes, the dimension should consistently use the
1229 same pair of ports, one port on one end of the cable, and the other
1230 port on the other end, continuing along the mesh dimension, or the -O
1231 option used as an override.
1232
1233 Use '-R dor' option to activate the DOR algorithm.
1234
1235 DFSSSP and SSSP Routing Algorithm
1236
1237 The (Deadlock-Free) Single-Source-Shortest-Path routing algorithm is
1238 designed to optimize link utilization thru global balancing of routes,
1239 while supporting arbitrary topologies. The DFSSSP routing algorithm
1240 uses Infiniband virtual lanes (SL) to provide deadlock-freedom.
1241
1242 The DFSSSP algorithm consists of five major steps:
1243 1) It discovers the subnet and models the subnet as a directed multi‐
1244 graph in which each node represents a node of the physical network and
1245 each edge represents one direction of the full-duplex links used to
1246 connect the nodes.
1247 2) A loop, which iterates over all CA and switches of the subnet, will
1248 perform three steps to generate the linear forwarding tables for each
1249 switch:
1250 2.1) use Dijkstra's algorithm to find the shortest path from all nodes
1251 to the current selected destination;
1252 2.2) update the egde weights in the graph, i.e. add the number of
1253 routes, which use a link to reach the destination, to the link/edge;
1254 2.3) update the LFT of each switch with the outgoing port which was
1255 used in the current step to route the traffic to the destination node.
1256 3) After the number of available virtual lanes or layers in the subnet
1257 is detected and a channel dependency graph is initialized for each
1258 layer, the algorithm will put each possible route of the subnet into
1259 the first layer.
1260 4) A loop iterates over all channel dependency graphs (CDG) and per‐
1261 forms the following substeps:
1262 4.1) search for a cycle in the current CDG;
1263 4.2) when a cycle is found, i.e. a possible deadlock is present, one
1264 edge is selected and all routes, which induced this egde, are moved to
1265 the "next higher" virtual layer (CDG[i+1]);
1266 4.3) the cycle search is continued until all cycles are broken and
1267 routes are moved "up".
1268 5) When the number of needed layers does not exceeds the number of
1269 available SL/VL to remove all cycles in all CDGs, the rounting is dead‐
1270 lock-free and an relation table is generated, which contains the
1271 assignment of routes from source to destination to a SL
1272
1273 Note on SSSP:
1274 This algorithm does not perform the steps 3)-5) and can not be consid‐
1275 ered to be deadlock-free for all topologies. But on the one hand, you
1276 can choose this algorithm for really large networks (5,000+ CAs and
1277 deadlock-free by design) to reduce the runtime of the algorithm. On the
1278 other hand, you might use the SSSP routing algorithm as an alternative,
1279 when all deadlock-free routing algorithms fail to route the network for
1280 whatever reason. In the last case, SSSP was designed to deliver an
1281 equal or higher bandwidth due to better congestion avoidance than the
1282 Min Hop routing algorithm.
1283
1284 Notes for usage:
1285 a) running DFSSSP: '-R dfsssp -Q'
1286 a.1) QoS has to be configured to equally spread the load on the avail‐
1287 able SL or virtual lanes
1288 a.2) applications must perform a path record query to get path SL for
1289 each route, which the application will use to transmite packages
1290 b) running SSSP: '-R sssp'
1291 c) both algorithms support LMC > 0
1292
1293 Hints for optimizing I/O traffic:
1294 Having more nodes (I/O and compute) connected to a switch than incoming
1295 links can result in a 'bad' routing of the I/O traffic as long as
1296 (DF)SSSP routing is not aware of the dedicated I/O nodes, i.e., in the
1297 following network configuration CN1-CN3 might send all I/O traffic via
1298 Link2 to IO1,IO2:
1299
1300 CN1 Link1 IO1
1301 \ /----\ /
1302 CN2 -- Switch1 Switch2 -- CN4
1303 / \----/ \
1304 CN3 Link2 IO2
1305
1306 To prevent this from happening (DF)SSSP can use both the compute node
1307 guid file and the I/O guid file specified by the ´-u´ or
1308 ´--cn_guid_file´ and ´-G´ or ´--io_guid_file´ options (similar to the
1309 Fat-Tree routing). This ensures that traffic towards compute nodes and
1310 I/O nodes is balanced separately and therefore distributed as much as
1311 possible across the available links. Port GUIDs, as listed by ibstat,
1312 must be specified (not Node GUIDs).
1313 The priority for the optimization is as follows:
1314 compute nodes -> I/O nodes -> other nodes
1315 Possible use case szenarios:
1316 a) neither ´-u´ nor ´-G´ are specified: all nodes a treated as ´other
1317 nodes´ and therefore balanced equally;
1318 b) ´-G´ is specified: traffic towards I/O nodes will be balanced opti‐
1319 mally;
1320 c) the system has three node types, such as login/admin, compute and
1321 I/O, but the balancing focus should be I/O, then one has to use ´-u´
1322 and ´-G´ with I/O guids listed in cn_guid_file and compute node guids
1323 listed in io_guid_file;
1324 d) ...
1325
1326 Torus-2QoS Routing Algorithm
1327
1328 Torus-2QoS is routing algorithm designed for large-scale 2D/3D torus
1329 fabrics; see torus-2QoS(8) for full documentation.
1330
1331 Use '-R torus-2QoS -Q' or '-R torus-2QoS,no_fallback -Q' to activate
1332 the torus-2QoS algorithm.
1333
1334
1335 Routing References
1336
1337 To learn more about deadlock-free routing, see the article "Deadlock
1338 Free Message Routing in Multiprocessor Interconnection Networks" by
1339 William J Dally and Charles L Seitz (1985).
1340
1341 To learn more about the up/down algorithm, see the article "Effective
1342 Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose
1343 Carlos Sancho, Antonio Robles, and Jose Duato at the Universidad
1344 Politecnica de Valencia.
1345
1346 To learn more about LASH and the flexibility behind it, the requirement
1347 for layers, performance comparisons to other algorithms, see the fol‐
1348 lowing articles:
1349
1350 "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
1351 on Parallel and Distributed Systems, VOL.16, No12, December 2005.
1352
1353 "Routing for the ASI Fabric Manager", Solheim et al. IEEE Communica‐
1354 tions Magazine, Vol.44, No.7, July 2006.
1355
1356 "Layered Shortest Path (LASH) Routing in Irregular System Area Net‐
1357 works", Skeie et al. IEEE Computer Society Communication Architecture
1358 for Clusters 2002.
1359
1360 To learn more about the DFSSSP and SSSP routing algorithm, see the
1361 articles:
1362 J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for
1363 Arbitrary Topologies, In Proceedings of the 25th IEEE International
1364 Parallel & Distributed Processing Symposium (IPDPS 2011)
1365 T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-
1366 Scale InfiniBand Networks, In 17th Annual IEEE Symposium on High Per‐
1367 formance Interconnects (HOTI 2009)
1368
1369 Modular Routine Engine
1370
1371 Modular routing engine structure allows for the ease of "plugging" new
1372 routing modules.
1373
1374 Currently, only unicast callbacks are supported. Multicast can be added
1375 later.
1376
1377 One existing routing module is up-down "updn", which may be activated
1378 with '-R updn' option (instead of old '-u').
1379
1380 General usage is: $ opensm -R 'module-name'
1381
1382 There is also a trivial routing module which is able to load LFT tables
1383 from a file.
1384
1385 Main features:
1386
1387 - this will load switch LFTs and/or LID matrices (min hops tables)
1388 - this will load switch LFTs according to the path entries introduced
1389 in the file
1390 - no additional checks will be performed (such as "is port connected",
1391 etc.)
1392 - in case when fabric LIDs were changed this will try to reconstruct
1393 LFTs correctly if endport GUIDs are represented in the file
1394 (in order to disable this, GUIDs may be removed from the file
1395 or zeroed)
1396
1397 The file format is compatible with output of 'ibroute' util and for
1398 whole fabric can be generated with dump_lfts.sh script.
1399
1400 To activate file based routing module, use:
1401
1402 opensm -R file -U /path/to/lfts_file
1403
1404 If the lfts_file is not found or is in error, the default routing algo‐
1405 rithm is utilized.
1406
1407 The ability to dump switch lid matrices (aka min hops tables) to file
1408 and later to load these is also supported.
1409
1410 The usage is similar to unicast forwarding tables loading from a lfts
1411 file (introduced by 'file' routing engine), but new lid matrix file
1412 name should be specified by -M or --lid_matrix_file option. For exam‐
1413 ple:
1414
1415 opensm -R file -M ./opensm-lid-matrix.dump
1416
1417 The dump file is named ´opensm-lid-matrix.dump´ and will be generated
1418 in standard opensm dump directory (/var/log by default) when
1419 OSM_LOG_ROUTING logging flag is set.
1420
1421 When routing engine 'file' is activated, but the lfts file is not spec‐
1422 ified or not cannot be open default lid matrix algorithm will be used.
1423
1424 There is also a switch forwarding tables dumper which generates a file
1425 compatible with dump_lfts.sh output. This file can be used as input for
1426 forwarding tables loading by 'file' routing engine. Both or one of
1427 options -U and -M can be specified together with ´-R file´.
1428
1429
1431 To enabled per module logging, set per_module_logging to TRUE in the
1432 opensm options file and configure per_module_logging_file there appro‐
1433 priately.
1434
1435 The per module logging config file format is a set of lines with module
1436 name and logging level as follows:
1437
1438 <module name><separator><logging level>
1439
1440 <module name> is the file name including .c
1441 <separator> is either = , space, or tab
1442 <logging level> is the same levels as used in the coarse/overall
1443 logging as follows:
1444
1445 BIT LOG LEVEL ENABLED
1446 ---- -----------------
1447 0x01 - ERROR (error messages)
1448 0x02 - INFO (basic messages, low volume)
1449 0x04 - VERBOSE (interesting stuff, moderate volume)
1450 0x08 - DEBUG (diagnostic, high volume)
1451 0x10 - FUNCS (function entry/exit, very high volume)
1452 0x20 - FRAMES (dumps all SMP and GMP frames)
1453 0x40 - ROUTING (dump FDB routing information)
1454 0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)
1455
1456
1458 /etc/rdma/opensm.conf
1459 default OpenSM config file.
1460
1461
1462 /etc/rdma/ib-node-name-map
1463 default node name map file. See ibnetdiscover for more informa‐
1464 tion on format.
1465
1466
1467 /etc/rdma/partitions.conf
1468 default partition config file
1469
1470
1471 /etc/rdma/qos-policy.conf
1472 default QOS policy config file
1473
1474
1475 /etc/rdma/prefix-routes.conf
1476 default prefix routes file
1477
1478
1479 /etc/rdma/per-module-logging.conf
1480 default per module logging config file
1481
1482
1483 /etc/rdma/torus-2QoS.conf
1484 default torus-2QoS config file
1485
1486
1488 Hal Rosenstock
1489 <hal@mellanox.com>
1490
1491 Sasha Khapyorsky
1492 <sashak@voltaire.com>
1493
1494 Eitan Zahavi
1495 <eitan@mellanox.co.il>
1496
1497 Yevgeny Kliteynik
1498 <kliteyn@mellanox.co.il>
1499
1500 Thomas Sodring
1501 <tsodring@simula.no>
1502
1503 Ira Weiny
1504 <weiny2@llnl.gov>
1505
1506 Dale Purdy
1507 <purdy@sgi.com>
1508
1509
1511 torus-2QoS(8), torus-2QoS.conf(5).
1512
1513
1514
1515OpenIB March 8, 2012 OPENSM(8)