1OPENSM(8) OpenIB Management OPENSM(8)
2
3
4
6 opensm - InfiniBand subnet manager and administration (SM/SA)
7
8
10 opensm [--version]] [-F | --config <file_name>] [-c(reate-config)
11 <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority) <PRI‐
12 ORITY>] [--subnet_prefix <PREFIX in hex>] [--smkey <SM_Key>] [--sm_sl
13 <SL number>] [-r(eassign_lids)] [-R <engine name(s)> | --routing_engine
14 <engine name(s)>] [--do_mesh_analysis] [--lash_start_vl <vl number>]
15 [-A | --ucast_cache] [-z | --connect_roots] [-M <file name> |
16 --lid_matrix_file <file name>] [-U <file name> | --lfts_file <file
17 name>] [-S | --sadb_file <file name>] [-a | --root_guid_file <path to
18 file>] [-u | --cn_guid_file <path to file>] [-G | --io_guid_file <path
19 to file>] [--port-shifting] [--scatter-ports <random seed>] [-H |
20 --max_reverse_hops <max reverse hops allowed>] [-X | --guid_rout‐
21 ing_order_file <path to file>] [-m | --ids_guid_file <path to file>]
22 [-o(nce)] [-s(weep) <interval>] [-t(imeout) <milliseconds>] [--retries
23 <number>] [--maxsmps <number>] [--console [off | local | socket | loop‐
24 back]] [--console-port <port>] [-i | --ignore_guids <equalize-ignore-
25 guids-file>] [-w | --hop_weights_file <path to file>] [-O |
26 --port_search_ordering_file <path to file>] [-O | --dimn_ports_file
27 <path to file>] (DEPRECATED) [-f <log file path> | --log_file <log file
28 path> ] [-L | --log_limit <size in MB>] [-e(rase_log_file)] [-P(config)
29 <partition config file> ] [-N | --no_part_enforce] (DEPRECATED) [-Z |
30 --part_enforce [both | in | out | off]] [-W | --allow_both_pkeys] [-Q |
31 --qos [-Y | --qos_policy_file <file name>]] [--congestion-control]
32 [--cckey <key>] [-y | --stay_on_fatal] [-B | --daemon] [-J | --pidfile
33 <file_name>] [-I | --inactive] [--perfmgr] [--perfmgr_sweep_time_s
34 <seconds>] [--prefix_routes_file <path>] [--consolidate_ipv6_snm_req]
35 [--log_prefix <prefix text>] [--torus_config <path to file>]
36 [-v(erbose)] [-V] [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]
37
38
40 opensm is an InfiniBand compliant Subnet Manager and Administration,
41 and runs on top of OpenIB.
42
43 opensm provides an implementation of an InfiniBand Subnet Manager and
44 Administration. Such a software entity is required to run for in order
45 to initialize the InfiniBand hardware (at least one per each InfiniBand
46 subnet).
47
48 opensm also now contains an experimental version of a performance man‐
49 ager as well.
50
51 opensm defaults were designed to meet the common case usage on clusters
52 with up to a few hundred nodes. Thus, in this default mode, opensm will
53 scan the IB fabric, initialize it, and sweep occasionally for changes.
54
55 opensm attaches to a specific IB port on the local machine and config‐
56 ures only the fabric connected to it. (If the local machine has other
57 IB ports, opensm will ignore the fabrics connected to those other
58 ports). If no port is specified, it will select the first "best" avail‐
59 able port.
60
61 opensm can present the available ports and prompt for a port number to
62 attach to.
63
64 By default, the run is logged to two files: /var/log/messages and
65 /var/log/opensm.log. The first file will register only general major
66 events, whereas the second will include details of reported errors. All
67 errors reported in this second file should be treated as indicators of
68 IB fabric health issues. (Note that when a fatal and non-recoverable
69 error occurs, opensm will exit.) Both log files should include the
70 message "SUBNET UP" if opensm was able to setup the subnet correctly.
71
72
74 --version
75 Prints OpenSM version and exits.
76
77 -F, --config <config file>
78 The name of the OpenSM config file. When not specified
79 /etc/rdma/opensm.conf will be used (if exists).
80
81 -c, --create-config <file name>
82 OpenSM will dump its configuration to the specified file and
83 exit. This is a way to generate OpenSM configuration file tem‐
84 plate.
85
86 -g, --guid <GUID in hex>
87 This option specifies the local port GUID value with which
88 OpenSM should bind. OpenSM may be bound to 1 port at a time.
89 If GUID given is 0, OpenSM displays a list of possible port
90 GUIDs and waits for user input. Without -g, OpenSM tries to use
91 the default port.
92
93 -l, --lmc <LMC value>
94 This option specifies the subnet's LMC value. The number of
95 LIDs assigned to each port is 2^LMC. The LMC value must be in
96 the range 0-7. LMC values > 0 allow multiple paths between
97 ports. LMC values > 0 should only be used if the subnet topol‐
98 ogy actually provides multiple paths between ports, i.e. multi‐
99 ple interconnects between switches. Without -l, OpenSM defaults
100 to LMC = 0, which allows one path between any two ports.
101
102 -p, --priority <Priority value>
103 This option specifies the SM´s PRIORITY. This will effect the
104 handover cases, where master is chosen by priority and GUID.
105 Range goes from 0 (default and lowest priority) to 15 (highest).
106
107 --subnet_prefix <PREFIX in hex>
108 This option specifies the subnet prefix to use in on the fabric.
109 The default prefix is 0xfe80000000000000. OpenMPI in particular
110 requires separate fabrics plugged into different ports to have
111 different prefixes or else it won't run.
112
113 --smkey <SM_Key value>
114 This option specifies the SM´s SM_Key (64 bits). This will
115 effect SM authentication. Note that OpenSM version 3.2.1 and
116 below used the default value '1' in a host byte order, it is
117 fixed now but you may need this option to interoperate with old
118 OpenSM running on a little endian machine.
119
120 --sm_sl <SL number>
121 This option sets the SL to use for communication with the SM/SA.
122 Defaults to 0.
123
124 -r, --reassign_lids
125 This option causes OpenSM to reassign LIDs to all end nodes.
126 Specifying -r on a running subnet may disrupt subnet traffic.
127 Without -r, OpenSM attempts to preserve existing LID assignments
128 resolving multiple use of same LID.
129
130 -R, --routing_engine <Routing engine names>
131 This option chooses routing engine(s) to use instead of Min Hop
132 algorithm (default). Multiple routing engines can be specified
133 separated by commas so that specific ordering of routing algo‐
134 rithms will be tried if earlier routing engines fail. If all
135 configured routing engines fail, OpenSM will always attempt to
136 route with Min Hop unless 'no_fallback' is included in the list
137 of routing engines. Supported engines: minhop, updn, dnup,
138 file, ftree, lash, dor, torus-2QoS, dfsssp, sssp.
139
140 --do_mesh_analysis
141 This option enables additional analysis for the lash routing
142 engine to precondition switch port assignments in regular carte‐
143 sian meshes which may reduce the number of SLs required to give
144 a deadlock free routing.
145
146 --lash_start_vl <vl number>
147 This option sets the starting VL to use for the lash routing
148 algorithm. Defaults to 0.
149
150 -A, --ucast_cache
151 This option enables unicast routing cache and prevents routing
152 recalculation (which is a heavy task in a large cluster) when
153 there was no topology change detected during the heavy sweep, or
154 when the topology change does not require new routing calcula‐
155 tion, e.g. when one or more CAs/RTRs/leaf switches going down,
156 or one or more of these nodes coming back after being down. A
157 very common case that is handled by the unicast routing cache is
158 host reboot, which otherwise would cause two full routing recal‐
159 culations: one when the host goes down, and the other when the
160 host comes back online.
161
162 -z, --connect_roots
163 This option enforces routing engines (up/down and fat-tree) to
164 make connectivity between root switches and in this way to be
165 fully IBA compliant. In many cases this can violate "pure" dead‐
166 lock free algorithm, so use it carefully.
167
168 -M, --lid_matrix_file <file name>
169 This option specifies the name of the lid matrix dump file from
170 where switch lid matrices (min hops tables) will be loaded.
171
172 -U, --lfts_file <file name>
173 This option specifies the name of the LFTs file from where
174 switch forwarding tables will be loaded when using "file" rout‐
175 ing engine.
176
177 -S, --sadb_file <file name>
178 This option specifies the name of the SA DB dump file from where
179 SA database will be loaded.
180
181 -a, --root_guid_file <file name>
182 Set the root nodes for the Up/Down or Fat-Tree routing algorithm
183 to the guids provided in the given file (one to a line).
184
185 -u, --cn_guid_file <file name>
186 Set the compute nodes for the Fat-Tree or DFSSSP/SSSP routing
187 algorithms to the port GUIDs provided in the given file (one to
188 a line).
189
190 -G, --io_guid_file <file name>
191 Set the I/O nodes for the Fat-Tree or DFSSSP/SSSP routing algo‐
192 rithms to the port GUIDs provided in the given file (one to a
193 line).
194 In the case of Fat-Tree routing:
195 I/O nodes are non-CN nodes allowed to use up to max_reverse_hops
196 switches the wrong way around to improve connectivity.
197 In the case of (DF)SSSP routing:
198 Providing guids of compute and/or I/O nodes will ensure that
199 paths towards those nodes are as much separated as possible
200 within their node category, i.e., I/O traffic will not share the
201 same link if multiple links are available.
202
203 --port-shifting
204 This option enables a feature called port shifting. In some
205 fabrics, particularly cluster environments, routes commonly
206 align and congest with other routes due to algorithmically
207 unchanging traffic patterns. This routing option will "shift"
208 routing around in an attempt to alleviate this problem.
209
210 --scatter-ports <random seed>
211 This option is used to randomize port selection in routing
212 rather than using a round-robin algorithm (which is the
213 default). Value supplied with option is used as a random seed.
214 If value is 0, which is the default, the scatter ports option is
215 disabled.
216
217 -H, --max_reverse_hops <max reverse hops allowed>
218 Set the maximum number of reverse hops an I/O node is allowed to
219 make. A reverse hop is the use of a switch the wrong way around.
220
221 -m, --ids_guid_file <file name>
222 Name of the map file with set of the IDs which will be used by
223 Up/Down routing algorithm instead of node GUIDs (format: <guid>
224 <id> per line).
225
226 -X, --guid_routing_order_file <file name>
227 Set the order port guids will be routed for the MinHop and
228 Up/Down routing algorithms to the guids provided in the given
229 file (one to a line).
230
231 -o, --once
232 This option causes OpenSM to configure the subnet once, then
233 exit. Ports remain in the ACTIVE state.
234
235 -s, --sweep <interval value>
236 This option specifies the number of seconds between subnet
237 sweeps. Specifying -s 0 disables sweeping. Without -s, OpenSM
238 defaults to a sweep interval of 10 seconds.
239
240 -t, --timeout <value>
241 This option specifies the time in milliseconds used for transac‐
242 tion timeouts. Timeout values should be > 0. Without -t,
243 OpenSM defaults to a timeout value of 200 milliseconds.
244
245 --retries <number>
246 This option specifies the number of retries used for transac‐
247 tions. Without --retries, OpenSM defaults to 3 retries for
248 transactions.
249
250 --maxsmps <number>
251 This option specifies the number of VL15 SMP MADs allowed on the
252 wire at any one time. Specifying --maxsmps 0 allows unlimited
253 outstanding SMPs. Without --maxsmps, OpenSM defaults to a maxi‐
254 mum of 4 outstanding SMPs.
255
256 --console [off | local | loopback | socket]
257 This option brings up the OpenSM console (default off). Note,
258 loopback and socket open a socket which can be connected to
259 WITHOUT CREDENTIALS. Loopback is safer if access to your SM
260 host is controlled. tcp_wrappers (hosts.[allow|deny]) is used
261 with loopback and socket. loopback and socket will only be
262 available if OpenSM was built with --enable-console-loopback
263 (default yes) and --enable-console-socket (default no) respec‐
264 tively.
265
266 --console-port <port>
267 Specify an alternate telnet port for the socket console (default
268 10000). Note that this option only appears if OpenSM was built
269 with --enable-console-socket.
270
271 -i, --ignore_guids <equalize-ignore-guids-file>
272 This option provides the means to define a set of ports (by node
273 guid and port number) that will be ignored by the link load
274 equalization algorithm.
275
276 -w, --hop_weights_file <path to file>
277 This option provides weighting factors per port representing a
278 hop cost in computing the lid matrix. The file consists of
279 lines containing a switch port GUID (specified as a 64 bit hex
280 number, with leading 0x), output port number, and weighting fac‐
281 tor. Any port not listed in the file defaults to a weighting
282 factor of 1. Lines starting with # are comments. Weights
283 affect only the output route from the port, so many useful con‐
284 figurations will require weights to be specified in pairs.
285
286 -O, --port_search_ordering_file <path to file>
287 This option tweaks the routing. It suitable for two cases: 1.
288 While using DOR routing algorithm. This option provides a map‐
289 ping between hypercube dimensions and ports on a per switch
290 basis for the DOR routing engine. The file consists of lines
291 containing a switch node GUID (specified as a 64 bit hex number,
292 with leading 0x) followed by a list of non-zero port numbers,
293 separated by spaces, one switch per line. The order for the
294 port numbers is in one to one correspondence to the dimensions.
295 Ports not listed on a line are assigned to the remaining dimen‐
296 sions, in port order. Anything after a # is a comment. 2.
297 While using general routing algorithm. This option provides the
298 order of the ports that would be chosen for routing, from each
299 switch rather than searching for an appropriate port from port 1
300 to N. The file consists of lines containing a switch node GUID
301 (specified as a 64 bit hex number, with leading 0x) followed by
302 a list of non-zero port numbers, separated by spaces, one switch
303 per line. In case of DOR, the order for the port numbers is in
304 one to one correspondence to the dimensions. Ports not listed
305 on a line are assigned to the remaining dimensions, in port
306 order. Anything after a # is a comment.
307
308 -O, --dimn_ports_file <path to file> (DEPRECATED)
309 This is a deprecated flag. Please use --port_search_order‐
310 ing_file instead. This option provides a mapping between hyper‐
311 cube dimensions and ports on a per switch basis for the DOR
312 routing engine. The file consists of lines containing a switch
313 node GUID (specified as a 64 bit hex number, with leading 0x)
314 followed by a list of non-zero port numbers, separated by spa‐
315 ces, one switch per line. The order for the port numbers is in
316 one to one correspondence to the dimensions. Ports not listed
317 on a line are assigned to the remaining dimensions, in port
318 order. Anything after a # is a comment.
319
320 -x, --honor_guid2lid
321 This option forces OpenSM to honor the guid2lid file, when it
322 comes out of Standby state, if such file exists under
323 OSM_CACHE_DIR, and is valid. By default, this is FALSE.
324
325 -f, --log_file <file name>
326 This option defines the log to be the given file. By default,
327 the log goes to /var/log/opensm.log. For the log to go to stan‐
328 dard output use -f stdout.
329
330 -L, --log_limit <size in MB>
331 This option defines maximal log file size in MB. When specified
332 the log file will be truncated upon reaching this limit.
333
334 -e, --erase_log_file
335 This option will cause deletion of the log file (if it previ‐
336 ously exists). By default, the log file is accumulative.
337
338 -P, --Pconfig <partition config file>
339 This option defines the optional partition configuration file.
340 The default name is /etc/rdma/partitions.conf.
341
342 --prefix_routes_file <file name>
343 Prefix routes control how the SA responds to path record queries
344 for off-subnet DGIDs. By default, the SA fails such queries.
345 The PREFIX ROUTES section below describes the format of the con‐
346 figuration file. The default path is
347 /etc/rdma/prefix-routes.conf.
348
349 -Q, --qos
350 This option enables QoS setup. It is disabled by default.
351
352 -Y, --qos_policy_file <file name>
353 This option defines the optional QoS policy file. The default
354 name is /etc/rdma/qos-policy.conf. See QoS_manage‐
355 ment_in_OpenSM.txt in opensm doc for more information on config‐
356 uring QoS policy via this file.
357
358 --congestion_control
359 (EXPERIMENTAL) This option enables congestion control configura‐
360 tion. It is disabled by default. See config file for conges‐
361 tion control configuration options. --cc_key <key> (EXPERIMEN‐
362 TAL) This option configures the CCkey to use when configuring
363 congestion control. Note that this option does not configure a
364 new CCkey into switches and CAs. Defaults to 0.
365
366 -N, --no_part_enforce (DEPRECATED)
367 This is a deprecated flag. Please use --part_enforce instead.
368 This option disables partition enforcement on switch external
369 ports.
370
371 -Z, --part_enforce [both | in | out | off]
372 This option indicates the partition enforcement type (for
373 switches). Enforcement type can be inbound only (in), outbound
374 only (out), both or disabled (off). Default is both.
375
376 -W, --allow_both_pkeys
377 This option indicates whether both full and limited membership
378 on the same partition can be configured in the PKeyTable.
379 Default is not to allow both pkeys.
380
381 -y, --stay_on_fatal
382 This option will cause SM not to exit on fatal initialization
383 issues: if SM discovers duplicated guids or a 12x link with lane
384 reversal badly configured. By default, the SM will exit on
385 these errors.
386
387 -B, --daemon
388 Run in daemon mode - OpenSM will run in the background.
389
390 -J, --pidfile <file_name>
391 Makes the SM write its own PID to the specified file when
392 started in daemon mode.
393
394 -I, --inactive
395 Start SM in inactive rather than init SM state. This option can
396 be used in conjunction with the perfmgr so as to run a stand‐
397 alone performance manager without SM/SA. However, this is NOT
398 currently implemented in the performance manager.
399
400 --perfmgr
401 Enable the perfmgr. Only takes effect if --enable-perfmgr was
402 specified at configure time. See performance-manager-HOWTO.txt
403 in opensm doc for more information on running perfmgr.
404
405 --perfmgr_sweep_time_s <seconds>
406 Specify the sweep time for the performance manager in seconds
407 (default is 180 seconds). Only takes effect if --enable-perfmgr
408 was specified at configure time.
409
410 --consolidate_ipv6_snm_req
411 Use shared MLID for IPv6 Solicited Node Multicast groups per
412 MGID scope and P_Key.
413
414 --log_prefix <prefix text>
415 This option specifies the prefix to the syslog messages from
416 OpenSM. A suitable prefix can be used to identify the IB subnet
417 in syslog messages when two or more instances of OpenSM run in a
418 single node to manage multiple fabrics. For example, in a dual-
419 fabric (or dual-rail) IB cluster, the prefix for the first fab‐
420 ric could be "mpi" and the other fabric could be "storage".
421
422 --torus_config <path to torus-2QoS config file>
423 This option defines the file name for the extra configuration
424 information needed for the torus-2QoS routing engine. The
425 default name is /etc/rdma/torus-2QoS.conf
426
427 -v, --verbose
428 This option increases the log verbosity level. The -v option
429 may be specified multiple times to further increase the ver‐
430 bosity level. See the -D option for more information about log
431 verbosity.
432
433 -V This option sets the maximum verbosity level and forces log
434 flushing. The -V option is equivalent to ´-D 0xFF -d 2´. See
435 the -D option for more information about log verbosity.
436
437 -D <value>
438 This option sets the log verbosity level. A flags field must
439 follow the -D option. A bit set/clear in the flags enables/dis‐
440 ables a specific log level as follows:
441
442 BIT LOG LEVEL ENABLED
443 ---- -----------------
444 0x01 - ERROR (error messages)
445 0x02 - INFO (basic messages, low volume)
446 0x04 - VERBOSE (interesting stuff, moderate volume)
447 0x08 - DEBUG (diagnostic, high volume)
448 0x10 - FUNCS (function entry/exit, very high volume)
449 0x20 - FRAMES (dumps all SMP and GMP frames)
450 0x40 - ROUTING (dump FDB routing information)
451 0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM log‐
452 ging)
453
454 Without -D, OpenSM defaults to ERROR + INFO (0x3). Specifying
455 -D 0 disables all messages. Specifying -D 0xFF enables all mes‐
456 sages (see -V). High verbosity levels may require increasing
457 the transaction timeout with the -t option.
458
459 -d, --debug <value>
460 This option specifies a debug option. These options are not
461 normally needed. The number following -d selects the debug
462 option to enable as follows:
463
464 OPT Description
465 --- -----------------
466 -d0 - Ignore other SM nodes
467 -d1 - Force single threaded dispatching
468 -d2 - Force log flushing after each log message
469 -d3 - Disable multicast support
470
471 -h, --help
472 Display this usage info then exit.
473
474 -? Display this usage info then exit.
475
476
478 The following environment variables control opensm behavior:
479
480 OSM_TMP_DIR - controls the directory in which the temporary files gen‐
481 erated by opensm are created. These files are: opensm-subnet.lst,
482 opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.
483
484 OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
485 quent runs are consistent. The default directory used is
486 /var/cache/opensm. The following files are included in it:
487
488 guid2lid - stores the LID range assigned to each GUID
489 guid2mkey - stores the MKey previously assiged to each GUID
490 neighbors - stores a map of the GUIDs at either end of each link
491 in the fabric
492
493
495 When opensm receives a HUP signal, it starts a new heavy sweep as if a
496 trap was received or a topology change was found.
497
498 Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log
499 for logrotate purposes.
500
501
503 The default name of OpenSM partitions configuration file is
504 /etc/rdma/partitions.conf. The default may be changed by using the
505 --Pconfig (-P) option with OpenSM.
506
507 The default partition will be created by OpenSM unconditionally even
508 when partition configuration file does not exist or cannot be accessed.
509
510 The default partition has P_Key value 0x7fff. OpenSM´s port will always
511 have full membership in default partition. All other end ports will
512 have full membership if the partition configuration file is not found
513 or cannot be accessed, or limited membership if the file exists and can
514 be accessed but there is no rule for the Default partition.
515
516 Effectively, this amounts to the same as if one of the following rules
517 below appear in the partition configuration file.
518
519 In the case of no rule for the Default partition:
520
521 Default=0x7fff : ALL=limited, SELF=full ;
522
523 In the case of no partition configuration file or file cannot be
524 accessed:
525
526 Default=0x7fff : ALL=full ;
527
528
529 File Format
530
531 Comments:
532
533 Line content followed after ´#´ character is comment and ignored by
534 parser.
535
536 General file format:
537
538 <Partition Definition>:[<newline>]<Partition Properties>;
539
540 Partition Definition:
541 [PartitionName][=PKey][,indx0][,ipoib_bc_flags][,defmem‐
542 ber=full|limited]
543
544 PartitionName - string, will be used with logging. When
545 omitted, empty string will be used.
546 PKey - P_Key value for this partition. Only low 15
547 bits will be used. When omitted will be
548 autogenerated.
549 indx0 - indicates that this pkey should be inserted in
550 block 0 index 0.
551 ipoib_bc_flags - used to indicate/specify IPoIB capability of
552 this partition.
553
554 defmember=full|limited|both - specifies default membership for
555 port guid list. Default is limited.
556
557 ipoib_bc_flags:
558 ipoib_flag|[mgroup_flag]*
559
560 ipoib_flag:
561 ipoib - indicates that this partition may be used for
562 IPoIB, as a result the IPoIB broadcast group will
563 be created with the mgroup_flag flags given,
564 if any.
565
566 Partition Properties:
567 [<Port list>|<MCast Group>]* | <Port list>
568
569 Port list:
570 <Port Specifier>[,<Port Specifier>]
571
572 Port Specifier:
573 <PortGUID>[=[full|limited|both]]
574
575 PortGUID - GUID of partition member EndPort.
576 Hexadecimal numbers should start from
577 0x, decimal numbers are accepted too.
578 full, limited, - indicates full and/or limited membership for
579 both this port. When omitted (or unrecognized)
580 limited membership is assumed. Both
581 indicates both full and limited membership
582 for this port.
583
584 MCast Group:
585 mgid=gid[,mgroup_flag]*<newline>
586
587 - gid specified is verified to be a Multicast
588 address. IP groups are verified to match
589 the rate and mtu of the broadcast group.
590 The P_Key bits of the mgid for IP groups are
591 verified to either match the P_Key specified
592 in by "Partition Definition" or if they are
593 0x0000 the P_Key will be copied into those
594 bits.
595
596 mgroup_flag:
597 rate=<val> - specifies rate for this MC group
598 (default is 3 (10GBps))
599 mtu=<val> - specifies MTU for this MC group
600 (default is 4 (2048))
601 sl=<val> - specifies SL for this MC group
602 (default is 0)
603 scope=<val> - specifies scope for this MC group
604 (default is 2 (link local)). Multiple scope
605 settings are permitted for a partition.
606 NOTE: This overwrites the scope nibble of the
607 specified mgid. Furthermore specifying
608 multiple scope settings will result in
609 multiple MC groups being created.
610 Q_Key=<val> - specifies the Q_Key for this MC group
611 (default: 0x0b1b for IP groups, 0 for other
612 groups)
613 WARNING: changing this for the broadcast
614 group may break IPoIB on client
615 nodes!!
616 TClass=<val> - specifies tclass for this MC group
617 (default is 0)
618 FlowLabel=<val> - specifies FlowLabel for this MC group
619 (default is 0)
620
621 Note that values for rate, mtu, and scope, for both partitions and mul‐
622 ticast groups, should be specified as defined in the IBTA specification
623 (for example, mtu=4 for 2048).
624
625 There are several useful keywords for PortGUID definition:
626
627 - 'ALL' means all end ports in this subnet.
628 - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
629 - 'ALL_SWITCHES' means all Switch end ports in this subnet.
630 - 'ALL_ROUTERS' means all Router end ports in this subnet.
631 - 'SELF' means subnet manager's port.
632
633 Empty list means no ports in this partition.
634
635 Notes:
636
637 White space is permitted between delimiters ('=', ',',':',';').
638
639 PartitionName does not need to be unique, PKey does need to be unique.
640 If PKey is repeated then those partition configurations will be merged
641 and first PartitionName will be used (see also next note).
642
643 It is possible to split partition configuration in more than one defi‐
644 nition, but then PKey should be explicitly specified (otherwise differ‐
645 ent PKey values will be generated for those definitions).
646
647 Examples:
648
649 Default=0x7fff : ALL, SELF=full ;
650 Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
651
652 NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306
653 ;
654
655 YetAnotherOne = 0x300 : SELF=full ;
656 YetAnotherOne = 0x300 : ALL=limited ;
657
658 ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
659 # 0x123453, 0x123454 will be limited
660 ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
661 # 0x123456, 0x123457 will be limited
662 ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457,
663 0x123458=full;
664 ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
665 ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited,
666 0x12345d;
667
668 # multicast groups added to default
669 Default=0x7fff,ipoib:
670 mgid=ff12:401b::0707,sl=1 # random IPv4 group
671 mgid=ff12:601b::16 # MLDv2-capable routers
672 mgid=ff12:401b::16 # IGMP
673 mgid=ff12:601b::2 # All routers
674 mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
675 ALL=full;
676
677
678 Note:
679
680 The following rule is equivalent to how OpenSM used to run prior to the
681 partition manager:
682
683 Default=0x7fff,ipoib:ALL=full;
684
685
687 There are a set of QoS related low-level configuration parameters. All
688 these parameter names are prefixed by "qos_" string. Here is a full
689 list of these parameters:
690
691 qos_max_vls - The maximum number of VLs that will be on the subnet
692 qos_high_limit - The limit of High Priority component of VL
693 Arbitration table (IBA 7.6.9)
694 qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9)
695 template
696 qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
697 template
698 Both VL arbitration templates are pairs of
699 VL and weight
700 qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is
701 a list of VLs corresponding to SLs 0-15 (Note
702 that VL15 used here means drop this SL)
703
704 Typical default values (hard-coded in OpenSM initialization) are:
705
706 qos_max_vls 15
707 qos_high_limit 0
708 qos_vlarb_low
709 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
710 qos_vlarb_high
711 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
712 qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
713
714 The syntax is compatible with rest of OpenSM configuration options and
715 values may be stored in OpenSM config file (cached options file).
716
717 In addition to the above, we may define separate QoS configuration
718 parameters sets for various target types. As targets, we currently sup‐
719 port CAs, routers, switch external ports, and switch's enhanced port 0.
720 The names of such specialized parameters are prefixed by "qos_<type>_"
721 string. Here is a full list of the currently supported sets:
722
723 qos_ca_ - QoS configuration parameters set for CAs.
724 qos_rtr_ - parameters set for routers.
725 qos_sw0_ - parameters set for switches' port 0.
726 qos_swe_ - parameters set for switches' external ports.
727
728 Examples:
729 qos_sw0_max_vls=2
730 qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
731 qos_swe_high_limit=0
732
733
735 Prefix routes control how the SA responds to path record queries for
736 off-subnet DGIDs. By default, the SA fails such queries. Note that
737 IBA does not specify how the SA should obtain off-subnet path record
738 information. The prefix routes configuration is meant as a stop-gap
739 until the specification is completed.
740
741 Each line in the configuration file is a 64-bit prefix followed by a
742 64-bit GUID, separated by white space. The GUID specifies the router
743 port on the local subnet that will handle the prefix. Blank lines are
744 ignored, as is anything between a # character and the end of the line.
745 The prefix and GUID are both in hex, the leading 0x is optional.
746 Either, or both, can be wild-carded by specifying an asterisk instead
747 of an explicit prefix or GUID.
748
749 When responding to a path record query for an off-subnet DGID, opensm
750 searches for the first prefix match in the configuration file. There‐
751 fore, the order of the lines in the configuration file is important: a
752 wild-carded prefix at the beginning of the configuration file renders
753 all subsequent lines useless. If there is no match, then opensm fails
754 the query. It is legal to repeat prefixes in the configuration file,
755 opensm will return the path to the first available matching router. A
756 configuration file with a single line where both prefix and GUID are
757 wild-carded means that a path record query specifying any off-subnet
758 DGID should return a path to the first available router. This configu‐
759 ration yields the same behavior formerly achieved by compiling opensm
760 with -DROUTER_EXP which has been obsoleted.
761
762
764 OpenSM supports configuring a single management key (MKey) for use
765 across the subnet.
766
767 The following configuration options are available:
768
769 m_key - the 64-bit MKey to be used on the subnet
770 (IBA 14.2.4)
771 m_key_protection_level - the numeric value of the MKey ProtectBits
772 (IBA 14.2.4.1)
773 m_key_lease_period - the number of seconds a CA will wait for a
774 response from the SM before resetting the
775 protection level to 0 (IBA 14.2.4.2).
776
777 OpenSM will configure all ports with the MKey specified by m_key,
778 defaulting to a value of 0. A m_key value of 0 disables MKey protection
779 on the subnet. Switches and HCAs with a non-zero MKey will not accept
780 requests to change their configuration unless the request includes the
781 proper MKey.
782
783 MKey Protection Levels
784
785 MKey protection levels modify how switches and CAs respond to SMPs
786 lacking a valid MKey. OpenSM will configure each port's ProtectBits to
787 support the level defined by the m_key_protection_level parameter. If
788 no parameter is specified, OpenSM defaults to operating at protection
789 level 0.
790
791 There are currently 4 protection levels defined by the IBA:
792
793 0 - Queries return valid data, including MKey. Configuration changes
794 are not allowed unless the request contains a valid MKey.
795 1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
796 unless the request contains a valid MKey.
797 2 - Neither queries nor configuration changes are allowed, unless the
798 request contains a valid MKey.
799 3 - Identical to 2. Maintained for backwards compatibility.
800
801 MKey Lease Period
802
803 InfiniBand supports a MKey lease timeout, which is intended to allow
804 administrators or a new SM to recover/reset lost MKeys on a fabric.
805
806 If MKeys are enabled on the subnet and a switch or CA receives a
807 request that requires a valid MKey but does not contain one, it warns
808 the SM by sending a trap (Bad M_Key, Trap 256). If the MKey lease
809 period is non-zero, it also starts a countdown timer for the time spec‐
810 ified by the lease period. If a SM (or other agent) responds with the
811 correct MKey, the timer is stopped and reset. Should the timer reach
812 zero, the switch or CA will reset its MKey protection level to 0,
813 exposing the MKey and allowing recovery.
814
815 OpenSM will initialize all ports to use a mkey lease period of the num‐
816 ber of seconds specified in the config file. If no mkey_lease_period
817 is specified, a default of 0 will be used.
818
819 OpenSM normally quickly responds to all Bad_M_Key traps, resetting the
820 lease timers. Additionally, OpenSM's subnet sweeps will also cancel
821 any running timers. For maximum protection against accidentally-
822 exposed MKeys, the MKey lease time should be a few multiples of the
823 subnet sweep time. If OpenSM detects at startup that your sweep inter‐
824 val is greater than your MKey lease period, it will reset the lease
825 period to be greater than the sweep interval. Similarly, if sweeping
826 is disabled at startup, it will be re-enabled with an interval less
827 than the Mkey lease period.
828
829 If OpenSM is required to recover a subnet for which it is missing
830 mkeys, it must do so one switch level at a time. As such, the total
831 time to recover the subnet may be as long as the mkey lease period mul‐
832 tiplied by the maximum number of hops between the SM and an endpoint,
833 plus one.
834
835 MKey Effects on Diagnostic Utilities
836
837 Setting a MKey may have a detrimental effect on diagnostic software run
838 on the subnet, unless your diagnostic software is able to retrieve
839 MKeys from the SA or can be explicitly configured with the proper MKey.
840 This is particularly true at protection level 2, where CAs will ignore
841 queries for management information that do not contain the proper MKey.
842
843
845 OpenSM now offers nine routing engines:
846
847 1. Min Hop Algorithm - based on the minimum hops to each node where
848 the path length is optimized.
849
850 2. UPDN Unicast routing algorithm - also based on the minimum hops to
851 each node, but it is constrained to ranking rules. This algorithm
852 should be chosen if the subnet is not a pure Fat Tree, and deadlock may
853 occur due to a loop in the subnet.
854
855 3. DNUP Unicast routing algorithm - similar to UPDN but allows routing
856 in fabrics which have some CA nodes attached closer to the roots than
857 some switch nodes.
858
859 4. Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
860 ing for congestion-free "shift" communication pattern. It should be
861 chosen if a subnet is a symmetrical or almost symmetrical fat-tree of
862 various types, not just K-ary-N-Trees: non-constant K, not fully
863 staffed, any Constant Bisectional Bandwidth (CBB) ratio. Similar to
864 UPDN, Fat Tree routing is constrained to ranking rules.
865
866 5. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
867 to provide deadlock-free shortest-path routing while also distributing
868 the paths between layers. LASH is an alternative deadlock-free topol‐
869 ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
870 ing the use of a potentially congested root node.
871
872 6. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
873 avoids port equalization except for redundant links between the same
874 two switches. This provides deadlock free routes for hypercubes when
875 the fabric is cabled as a hypercube and for meshes when cabled as a
876 mesh (see details below).
877
878 7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
879 specialized for 2D/3D torus topologies. Torus-2QoS provides deadlock-
880 free routing while supporting two quality of service (QoS) levels. In
881 addition it is able to route around multiple failed fabric links or a
882 single failed fabric switch without introducing deadlocks, and without
883 changing path SL values granted before the failure.
884
885 8. DFSSSP unicast routing algorithm - a deadlock-free single-source-
886 shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
887 as the base to optimize link utilization and uses Infiniband virtual
888 lanes (SL) to provide deadlock-freedom.
889
890 9. SSSP unicast routing algorithm - a single-source-shortest-path rout‐
891 ing algorithm, which globally balances the number of routes per link to
892 optimize link utilization. This routing algorithm has no restrictions
893 in terms of the underlying topology.
894
895 OpenSM also supports a file method which can load routes from a table.
896 See ´Modular Routing Engine´ for more information on this.
897
898 The basic routing algorithm is comprised of two stages:
899
900 1. MinHop matrix calculation
901 How many hops are required to get from each port to each LID ?
902 The algorithm to fill these tables is different if you run standard
903 (min hop) or Up/Down.
904 For standard routing, a "relaxation" algorithm is used to propagate
905 min hop from every destination LID through neighbor switches
906 For Up/Down routing, a BFS from every target is used. The BFS tracks
907 link direction (up or down) and avoid steps that will perform up after
908 a down step was used.
909
910 2. Once MinHop matrices exist, each switch is visited and for each tar‐
911 get LID a decision is made as to what port should be used to get to
912 that LID.
913 This step is common to standard and Up/Down routing. Each port has a
914 counter counting the number of target LIDs going through it.
915 When there are multiple alternative ports with same MinHop to a LID,
916 the one with less previously assigned LIDs is selected.
917 If LMC > 0, more checks are added: Within each group of LIDs
918 assigned to same target port,
919 a. use only ports which have same MinHop
920 b. first prefer the ones that go to different systemImageGuid (then
921 the previous LID of the same LMC group)
922 c. if none - prefer those which go through another NodeGuid
923 d. fall back to the number of paths method (if all go to same node).
924
925 Effect of Topology Changes
926
927 OpenSM will preserve existing routing in any case where there is no
928 change in the fabric switches unless the -r (--reassign_lids) option is
929 specified.
930
931 -r
932 --reassign_lids
933 This option causes OpenSM to reassign LIDs to all
934 end nodes. Specifying -r on a running subnet
935 may disrupt subnet traffic.
936 Without -r, OpenSM attempts to preserve existing
937 LID assignments resolving multiple use of same LID.
938
939 If a link is added or removed, OpenSM does not recalculate the routes
940 that do not have to change. A route has to change if the port is no
941 longer UP or no longer the MinHop. When routing changes are performed,
942 the same algorithm for balancing the routes is invoked.
943
944 In the case of using the file based routing, any topology changes are
945 currently ignored The 'file' routing engine just loads the LFTs from
946 the file specified, with no reaction to real topology. Obviously, this
947 will not be able to recheck LIDs (by GUID) for disconnected nodes, and
948 LFTs for non-existent switches will be skipped. Multicast is not
949 affected by 'file' routing engine (this uses min hop tables).
950
951
952 Min Hop Algorithm
953
954 The Min Hop algorithm is invoked by default if no routing algorithm is
955 specified. It can also be invoked by specifying '-R minhop'.
956
957 The Min Hop algorithm is divided into two stages: computation of min-
958 hop tables on every switch and LFT output port assignment. Link sub‐
959 scription is also equalized with the ability to override based on port
960 GUID. The latter is supplied by:
961
962 -i <equalize-ignore-guids-file>
963 --ignore_guids <equalize-ignore-guids-file>
964 This option provides the means to define a set of ports
965 (by guid) that will be ignored by the link load
966 equalization algorithm. Note that only endports (CA,
967 switch port 0, and router ports) and not switch external
968 ports are supported.
969
970 LMC awareness routes based on (remote) system or switch basis.
971
972
973 Purpose of UPDN Algorithm
974
975 The UPDN algorithm is designed to prevent deadlocks from occurring in
976 loops of the subnet. A loop-deadlock is a situation in which it is no
977 longer possible to send data between any two hosts connected through
978 the loop. As such, the UPDN routing algorithm should be used if the
979 subnet is not a pure Fat Tree, and one of its loops may experience a
980 deadlock (due, for example, to high pressure).
981
982 The UPDN algorithm is based on the following main stages:
983
984 1. Auto-detect root nodes - based on the CA hop length from any switch
985 in the subnet, a statistical histogram is built for each switch (hop
986 num vs number of occurrences). If the histogram reflects a specific
987 column (higher than others) for a certain node, then it is marked as a
988 root node. Since the algorithm is statistical, it may not find any root
989 nodes. The list of the root nodes found by this auto-detect stage is
990 used by the ranking process stage.
991
992 Note 1: The user can override the node list manually.
993 Note 2: If this stage cannot find any root nodes, and the user did
994 not specify a guid list file, OpenSM defaults back to the
995 Min Hop routing algorithm.
996
997 2. Ranking process - All root switch nodes (found in stage 1) are
998 assigned a rank of 0. Using the BFS algorithm, the rest of the switch
999 nodes in the subnet are ranked incrementally. This ranking aids in the
1000 process of enforcing rules that ensure loop-free paths.
1001
1002 3. Min Hop Table setting - after ranking is done, a BFS algorithm is
1003 run from each (CA or switch) node in the subnet. During the BFS
1004 process, the FDB table of each switch node traversed by BFS is updated,
1005 in reference to the starting node, based on the ranking rules and guid
1006 values.
1007
1008 At the end of the process, the updated FDB tables ensure loop-free
1009 paths through the subnet.
1010
1011 Note: Up/Down routing does not allow LID routing communication between
1012 switches that are located inside spine "switch systems". The reason is
1013 that there is no way to allow a LID route between them that does not
1014 break the Up/Down rule. One ramification of this is that you cannot
1015 run SM on switches other than the leaf switches of the fabric.
1016
1017
1018 UPDN Algorithm Usage
1019
1020 Activation through OpenSM
1021
1022 Use '-R updn' option (instead of old '-u') to activate the UPDN algo‐
1023 rithm. Use '-a <root_guid_file>' for adding an UPDN guid file that
1024 contains the root nodes for ranking. If the `-a' option is not used,
1025 OpenSM uses its auto-detect root nodes algorithm.
1026
1027 Notes on the guid list file:
1028
1029 1. A valid guid file specifies one guid in each line. Lines with an
1030 invalid format will be discarded.
1031 2. The user should specify the root switch guids. However, it is also
1032 possible to specify CA guids; OpenSM will use the guid of the switch
1033 (if it exists) that connects the CA to the subnet as a root node.
1034
1035 Purpose of DNUP Algorithm
1036
1037 The DNUP algorithm is designed to serve a similar purpose to UPDN. How‐
1038 ever it is intended to work in network topologies which are unsuited to
1039 UPDN due to nodes being connected closer to the roots than some of the
1040 switches. An example would be a fabric which contains nodes and
1041 uplinks connected to the same switch. The operation of DNUP is the same
1042 as UPDN with the exception of the ranking process. In DNUP all switch
1043 nodes are ranked based solely on their distance from CA Nodes, all
1044 switch nodes directly connected to at least one CA are assigned a value
1045 of 1 all other switch nodes are assigned a value of one more than the
1046 minimum rank of all neighbor switch nodes.
1047
1048 Fat-tree Routing Algorithm
1049
1050 The fat-tree algorithm optimizes routing for "shift" communication pat‐
1051 tern. It should be chosen if a subnet is a symmetrical or almost sym‐
1052 metrical fat-tree of various types. It supports not just K-ary-N-
1053 Trees, by handling for non-constant K, cases where not all leafs (CAs)
1054 are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-
1055 loop-deadlocks.
1056
1057 If the root guid file is not provided ('-a' or '--root_guid_file'
1058 options), the topology has to be pure fat-tree that complies with the
1059 following rules:
1060 - Tree rank should be between two and eight (inclusively)
1061 - Switches of the same rank should have the same number
1062 of UP-going port groups*, unless they are root switches,
1063 in which case the shouldn't have UP-going ports at all.
1064 - Switches of the same rank should have the same number
1065 of DOWN-going port groups, unless they are leaf switches.
1066 - Switches of the same rank should have the same number
1067 of ports in each UP-going port group.
1068 - Switches of the same rank should have the same number
1069 of ports in each DOWN-going port group.
1070 - All the CAs have to be at the same tree level (rank).
1071
1072 If the root guid file is provided, the topology doesn't have to be pure
1073 fat-tree, and it should only comply with the following rules:
1074 - Tree rank should be between two and eight (inclusively)
1075 - All the Compute Nodes** have to be at the same tree level (rank).
1076 Note that non-compute node CAs are allowed here to be at different
1077 tree ranks.
1078
1079 * ports that are connected to the same remote switch are referenced as
1080 ´port group´.
1081
1082 ** list of compute nodes (CNs) can be specified by ´-u´ or
1083 ´--cn_guid_file´ OpenSM options.
1084
1085 Topologies that do not comply cause a fallback to min hop routing.
1086 Note that this can also occur on link failures which cause the topology
1087 to no longer be "pure" fat-tree.
1088
1089 Note that although fat-tree algorithm supports trees with non-integer
1090 CBB ratio, the routing will not be as balanced as in case of integer
1091 CBB ratio. In addition to this, although the algorithm allows leaf
1092 switches to have any number of CAs, the closer the tree is to be fully
1093 populated, the more effective the "shift" communication pattern will
1094 be. In general, even if the root list is provided, the closer the
1095 topology to a pure and symmetrical fat-tree, the more optimal the rout‐
1096 ing will be.
1097
1098 The algorithm also dumps compute node ordering file (opensm-ftree-ca-
1099 order.dump) in the same directory where the OpenSM log resides. This
1100 ordering file provides the CN order that may be used to create effi‐
1101 cient communication pattern, that will match the routing tables.
1102
1103 Routing between non-CN nodes
1104
1105 The use of the cn_guid_file option allows non-CN nodes to be located on
1106 different levels in the fat tree. In such case, it is not guaranteed
1107 that the Fat Tree algorithm will route between two non-CN nodes. To
1108 solve this problem, a list of non-CN nodes can be specified by ´-G´ or
1109 ´--io_guid_file´ option. Theses nodes will be allowed to use switches
1110 the wrong way round a specific number of times (specified by ´-H´ or
1111 ´--max_reverse_hops´. With the proper max_reverse_hops and
1112 io_guid_file values, you can ensure full connectivity in the Fat Tree.
1113
1114 Please note that using max_reverse_hops creates routes that use the
1115 switch in a counter-stream way. This option should never be used to
1116 connect nodes with high bandwidth traffic between them ! It should only
1117 be used to allow connectivity for HA purposes or similar. Also having
1118 routes the other way around can in theory cause credit loops.
1119
1120 Use these options with extreme care !
1121
1122 Activation through OpenSM
1123
1124 Use '-R ftree' option to activate the fat-tree algorithm. Use '-a
1125 <root_guid_file>' to provide root nodes for ranking. If the `-a' option
1126 is not used, routing algorithm will detect roots automatically. Use
1127 '-u <root_cn_file>' to provide the list of compute nodes. If the `-u'
1128 option is not used, all the CAs are considered as compute nodes.
1129
1130 Note: LMC > 0 is not supported by fat-tree routing. If this is speci‐
1131 fied, the default routing algorithm is invoked instead.
1132
1133
1134 LASH Routing Algorithm
1135
1136 LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
1137 istic shortest path routing algorithm that enables topology agnostic
1138 deadlock-free routing within communication networks.
1139
1140 When computing the routing function, LASH analyzes the network topology
1141 for the shortest-path routes between all pairs of sources / destina‐
1142 tions and groups these paths into virtual layers in such a way as to
1143 avoid deadlock.
1144
1145 Note LASH analyzes routes and ensures deadlock freedom between switch
1146 pairs. The link from HCA between and switch does not need virtual lay‐
1147 ers as deadlock will not arise between switch and HCA.
1148
1149 In more detail, the algorithm works as follows:
1150
1151 1) LASH determines the shortest-path between all pairs of source / des‐
1152 tination switches. Note, LASH ensures the same SL is used for all
1153 SRC/DST - DST/SRC pairs and there is no guarantee that the return path
1154 for a given DST/SRC will be the reverse of the route SRC/DST.
1155
1156 2) LASH then begins an SL assignment process where a route is assigned
1157 to a layer (SL) if the addition of that route does not cause deadlock
1158 within that layer. This is achieved by maintaining and analysing a
1159 channel dependency graph for each layer. Once the potential addition of
1160 a path could lead to deadlock, LASH opens a new layer and continues the
1161 process.
1162
1163 3) Once this stage has been completed, it is highly likely that the
1164 first layers processed will contain more paths than the latter ones.
1165 To better balance the use of layers, LASH moves paths from one layer to
1166 another so that the number of paths in each layer averages out.
1167
1168 Note, the implementation of LASH in opensm attempts to use as few lay‐
1169 ers as possible. This number can be less than the number of actual lay‐
1170 ers available.
1171
1172 In general LASH is a very flexible algorithm. It can, for example,
1173 reduce to Dimension Order Routing in certain topologies, it is topology
1174 agnostic and fares well in the face of faults.
1175
1176 It has been shown that for both regular and irregular topologies, LASH
1177 outperforms Up/Down. The reason for this is that LASH distributes the
1178 traffic more evenly through a network, avoiding the bottleneck issues
1179 related to a root node and always routes shortest-path.
1180
1181 The algorithm was developed by Simula Research Laboratory.
1182
1183
1184 Use '-R lash -Q ' option to activate the LASH algorithm.
1185
1186 Note: QoS support has to be turned on in order that SL/VL mappings are
1187 used.
1188
1189 Note: LMC > 0 is not supported by the LASH routing. If this is speci‐
1190 fied, the default routing algorithm is invoked instead.
1191
1192 For open regular cartesian meshes the DOR algorithm is the ideal rout‐
1193 ing algorithm. For toroidal meshes on the other hand there are routing
1194 loops that can cause deadlocks. LASH can be used to route these cases.
1195 The performance of LASH can be improved by preconditioning the mesh in
1196 cases where there are multiple links connecting switches and also in
1197 cases where the switches are not cabled consistently. An option exists
1198 for LASH to do this. To invoke this use '-R lash -Q --do_mesh_analy‐
1199 sis'. This will add an additional phase that analyses the mesh to try
1200 to determine the dimension and size of a mesh. If it determines that
1201 the mesh looks like an open or closed cartesian mesh it reorders the
1202 ports in dimension order before the rest of the LASH algorithm runs.
1203
1204 DOR Routing Algorithm
1205
1206 The Dimension Order Routing algorithm is based on the Min Hop algorithm
1207 and so uses shortest paths. Instead of spreading traffic out across
1208 different paths with the same shortest distance, it chooses among the
1209 available shortest paths based on an ordering of dimensions. Each port
1210 must be consistently cabled to represent a hypercube dimension or a
1211 mesh dimension. Alternatively, the -O option can be used to assign a
1212 custom mapping between the ports on a given switch, and the associated
1213 dimension. Paths are grown from a destination back to a source using
1214 the lowest dimension (port) of available paths at each step. This pro‐
1215 vides the ordering necessary to avoid deadlock. When there are multi‐
1216 ple links between any two switches, they still represent only one
1217 dimension and traffic is balanced across them unless port equalization
1218 is turned off. In the case of hypercubes, the same port must be used
1219 throughout the fabric to represent the hypercube dimension and match on
1220 both ends of the cable, or the -O option used to accomplish the align‐
1221 ment. In the case of meshes, the dimension should consistently use the
1222 same pair of ports, one port on one end of the cable, and the other
1223 port on the other end, continuing along the mesh dimension, or the -O
1224 option used as an override.
1225
1226 Use '-R dor' option to activate the DOR algorithm.
1227
1228 DFSSSP and SSSP Routing Algorithm
1229
1230 The (Deadlock-Free) Single-Source-Shortest-Path routing algorithm is
1231 designed to optimize link utilization thru global balancing of routes,
1232 while supporting arbitrary topologies. The DFSSSP routing algorithm
1233 uses Infiniband virtual lanes (SL) to provide deadlock-freedom.
1234
1235 The DFSSSP algorithm consists of five major steps:
1236 1) It discovers the subnet and models the subnet as a directed multi‐
1237 graph in which each node represents a node of the physical network and
1238 each edge represents one direction of the full-duplex links used to
1239 connect the nodes.
1240 2) A loop, which iterates over all CA and switches of the subnet, will
1241 perform three steps to generate the linear forwarding tables for each
1242 switch:
1243 2.1) use Dijkstra's algorithm to find the shortest path from all nodes
1244 to the current selected destination;
1245 2.2) update the egde weights in the graph, i.e. add the number of
1246 routes, which use a link to reach the destination, to the link/edge;
1247 2.3) update the LFT of each switch with the outgoing port which was
1248 used in the current step to route the traffic to the destination node.
1249 3) After the number of available virtual lanes or layers in the subnet
1250 is detected and a channel dependency graph is initialized for each
1251 layer, the algorithm will put each possible route of the subnet into
1252 the first layer.
1253 4) A loop iterates over all channel dependency graphs (CDG) and per‐
1254 forms the following substeps:
1255 4.1) search for a cycle in the current CDG;
1256 4.2) when a cycle is found, i.e. a possible deadlock is present, one
1257 edge is selected and all routes, which induced this egde, are moved to
1258 the "next higher" virtual layer (CDG[i+1]);
1259 4.3) the cycle search is continued until all cycles are broken and
1260 routes are moved "up".
1261 5) When the number of needed layers does not exceeds the number of
1262 available SL/VL to remove all cycles in all CDGs, the rounting is dead‐
1263 lock-free and an relation table is generated, which contains the
1264 assignment of routes from source to destination to a SL
1265
1266 Note on SSSP:
1267 This algorithm does not perform the steps 3)-5) and can not be consid‐
1268 ered to be deadlock-free for all topologies. But on the one hand, you
1269 can choose this algorithm for really large networks (5,000+ CAs and
1270 deadlock-free by design) to reduce the runtime of the algorithm. On the
1271 other hand, you might use the SSSP routing algorithm as an alternative,
1272 when all deadlock-free routing algorithms fail to route the network for
1273 whatever reason. In the last case, SSSP was designed to deliver an
1274 equal or higher bandwidth due to better congestion avoidance than the
1275 Min Hop routing algorithm.
1276
1277 Notes for usage:
1278 a) running DFSSSP: '-R dfsssp -Q'
1279 a.1) QoS has to be configured to equally spread the load on the avail‐
1280 able SL or virtual lanes
1281 a.2) applications must perform a path record query to get path SL for
1282 each route, which the application will use to transmite packages
1283 b) running SSSP: '-R sssp'
1284 c) both algorithms support LMC > 0
1285
1286 Hints for optimizing I/O traffic:
1287 Having more nodes (I/O and compute) connected to a switch than incoming
1288 links can result in a 'bad' routing of the I/O traffic as long as
1289 (DF)SSSP routing is not aware of the dedicated I/O nodes, i.e., in the
1290 following network configuration CN1-CN3 might send all I/O traffic via
1291 Link2 to IO1,IO2:
1292
1293 CN1 Link1 IO1
1294 \ /----\ /
1295 CN2 -- Switch1 Switch2 -- CN4
1296 / \----/ \
1297 CN3 Link2 IO2
1298
1299 To prevent this from happening (DF)SSSP can use both the compute node
1300 guid file and the I/O guid file specified by the ´-u´ or
1301 ´--cn_guid_file´ and ´-G´ or ´--io_guid_file´ options (similar to the
1302 Fat-Tree routing). This ensures that traffic towards compute nodes and
1303 I/O nodes is balanced separately and therefore distributed as much as
1304 possible across the available links. Port GUIDs, as listed by ibstat,
1305 must be specified (not Node GUIDs).
1306 The priority for the optimization is as follows:
1307 compute nodes -> I/O nodes -> other nodes
1308 Possible use case szenarios:
1309 a) neither ´-u´ nor ´-G´ are specified: all nodes a treated as ´other
1310 nodes´ and therefore balanced equally;
1311 b) ´-G´ is specified: traffic towards I/O nodes will be balanced opti‐
1312 mally;
1313 c) the system has three node types, such as login/admin, compute and
1314 I/O, but the balancing focus should be I/O, then one has to use ´-u´
1315 and ´-G´ with I/O guids listed in cn_guid_file and compute node guids
1316 listed in io_guid_file;
1317 d) ...
1318
1319 Torus-2QoS Routing Algorithm
1320
1321 Torus-2QoS is routing algorithm designed for large-scale 2D/3D torus
1322 fabrics; see torus-2QoS(8) for full documentation.
1323
1324 Use '-R torus-2QoS -Q' or '-R torus-2QoS,no_fallback -Q' to activate
1325 the torus-2QoS algorithm.
1326
1327
1328 Routing References
1329
1330 To learn more about deadlock-free routing, see the article "Deadlock
1331 Free Message Routing in Multiprocessor Interconnection Networks" by
1332 William J Dally and Charles L Seitz (1985).
1333
1334 To learn more about the up/down algorithm, see the article "Effective
1335 Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose
1336 Carlos Sancho, Antonio Robles, and Jose Duato at the Universidad
1337 Politecnica de Valencia.
1338
1339 To learn more about LASH and the flexibility behind it, the requirement
1340 for layers, performance comparisons to other algorithms, see the fol‐
1341 lowing articles:
1342
1343 "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
1344 on Parallel and Distributed Systems, VOL.16, No12, December 2005.
1345
1346 "Routing for the ASI Fabric Manager", Solheim et al. IEEE Communica‐
1347 tions Magazine, Vol.44, No.7, July 2006.
1348
1349 "Layered Shortest Path (LASH) Routing in Irregular System Area Net‐
1350 works", Skeie et al. IEEE Computer Society Communication Architecture
1351 for Clusters 2002.
1352
1353 To learn more about the DFSSSP and SSSP routing algorithm, see the
1354 articles:
1355 J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for
1356 Arbitrary Topologies, In Proceedings of the 25th IEEE International
1357 Parallel & Distributed Processing Symposium (IPDPS 2011)
1358 T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-
1359 Scale InfiniBand Networks, In 17th Annual IEEE Symposium on High Per‐
1360 formance Interconnects (HOTI 2009)
1361
1362 Modular Routine Engine
1363
1364 Modular routing engine structure allows for the ease of "plugging" new
1365 routing modules.
1366
1367 Currently, only unicast callbacks are supported. Multicast can be added
1368 later.
1369
1370 One existing routing module is up-down "updn", which may be activated
1371 with '-R updn' option (instead of old '-u').
1372
1373 General usage is: $ opensm -R 'module-name'
1374
1375 There is also a trivial routing module which is able to load LFT tables
1376 from a file.
1377
1378 Main features:
1379
1380 - this will load switch LFTs and/or LID matrices (min hops tables)
1381 - this will load switch LFTs according to the path entries introduced
1382 in the file
1383 - no additional checks will be performed (such as "is port connected",
1384 etc.)
1385 - in case when fabric LIDs were changed this will try to reconstruct
1386 LFTs correctly if endport GUIDs are represented in the file
1387 (in order to disable this, GUIDs may be removed from the file
1388 or zeroed)
1389
1390 The file format is compatible with output of 'ibroute' util and for
1391 whole fabric can be generated with dump_lfts.sh script.
1392
1393 To activate file based routing module, use:
1394
1395 opensm -R file -U /path/to/lfts_file
1396
1397 If the lfts_file is not found or is in error, the default routing algo‐
1398 rithm is utilized.
1399
1400 The ability to dump switch lid matrices (aka min hops tables) to file
1401 and later to load these is also supported.
1402
1403 The usage is similar to unicast forwarding tables loading from a lfts
1404 file (introduced by 'file' routing engine), but new lid matrix file
1405 name should be specified by -M or --lid_matrix_file option. For exam‐
1406 ple:
1407
1408 opensm -R file -M ./opensm-lid-matrix.dump
1409
1410 The dump file is named ´opensm-lid-matrix.dump´ and will be generated
1411 in standard opensm dump directory (/var/log by default) when
1412 OSM_LOG_ROUTING logging flag is set.
1413
1414 When routing engine 'file' is activated, but the lfts file is not spec‐
1415 ified or not cannot be open default lid matrix algorithm will be used.
1416
1417 There is also a switch forwarding tables dumper which generates a file
1418 compatible with dump_lfts.sh output. This file can be used as input for
1419 forwarding tables loading by 'file' routing engine. Both or one of
1420 options -U and -M can be specified together with ´-R file´.
1421
1422
1424 To enable per module logging, configure per_module_logging_file to the
1425 per module logging config file name in the opensm options file. To dis‐
1426 able, configure per_module_logging_file to (null) there.
1427
1428 The per module logging config file format is a set of lines with module
1429 name and logging level as follows:
1430
1431 <module name><separator><logging level>
1432
1433 <module name> is the file name including .c
1434 <separator> is either = , space, or tab
1435 <logging level> is the same levels as used in the coarse/overall
1436 logging as follows:
1437
1438 BIT LOG LEVEL ENABLED
1439 ---- -----------------
1440 0x01 - ERROR (error messages)
1441 0x02 - INFO (basic messages, low volume)
1442 0x04 - VERBOSE (interesting stuff, moderate volume)
1443 0x08 - DEBUG (diagnostic, high volume)
1444 0x10 - FUNCS (function entry/exit, very high volume)
1445 0x20 - FRAMES (dumps all SMP and GMP frames)
1446 0x40 - ROUTING (dump FDB routing information)
1447 0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)
1448
1449
1451 /etc/rdma/opensm.conf
1452 default OpenSM config file.
1453
1454
1455 /etc/rdma/ib-node-name-map
1456 default node name map file. See ibnetdiscover for more informa‐
1457 tion on format.
1458
1459
1460 /etc/rdma/partitions.conf
1461 default partition config file
1462
1463
1464 /etc/rdma/qos-policy.conf
1465 default QOS policy config file
1466
1467
1468 /etc/rdma/prefix-routes.conf
1469 default prefix routes file
1470
1471
1472 /etc/rdma/per-module-logging.conf
1473 default per module logging config file
1474
1475
1476 /etc/rdma/torus-2QoS.conf
1477 default torus-2QoS config file
1478
1479
1481 Hal Rosenstock
1482 <hal@mellanox.com>
1483
1484 Sasha Khapyorsky
1485 <sashak@voltaire.com>
1486
1487 Eitan Zahavi
1488 <eitan@mellanox.co.il>
1489
1490 Yevgeny Kliteynik
1491 <kliteyn@mellanox.co.il>
1492
1493 Thomas Sodring
1494 <tsodring@simula.no>
1495
1496 Ira Weiny
1497 <weiny2@llnl.gov>
1498
1499 Dale Purdy
1500 <purdy@sgi.com>
1501
1502
1504 torus-2QoS(8), torus-2QoS.conf(5).
1505
1506
1507
1508OpenIB Sept 15, 2014 OPENSM(8)