1TORUS-2QOS(8) OpenIB Management TORUS-2QOS(8)
2
3
4
6 torus-2QoS - Routing engine for OpenSM subnet manager
7
9 Torus-2QoS is routing algorithm designed for large-scale 2D/3D torus
10 fabrics. The torus-2QoS routing engine can provide the following func‐
11 tionality on a 2D/3D torus:
12
13
14 – Routing that is free of credit loops.
15 – Two levels of Quality of Service (QoS), assuming switches support
16 eight data VLs and channel adapters support two data VLs.
17 – The ability to route around a single failed switch, and/or multiple
18 failed links, without
19 – introducing credit loops, or
20 – changing path SL values.
21 – Very short run times, with good scaling properties as fabric size
22 increases.
23
25 Unicast routing in torus-2QoS is based on Dimension Order Routing
26 (DOR). It avoids the deadlocks that would otherwise occur in a DOR-
27 routed torus using the concept of a dateline for each torus dimension.
28 It encodes into a path SL which datelines the path crosses, as follows:
29
30 sl = 0;
31 for (d = 0; d < torus_dimensions; d++) {
32 /* path_crosses_dateline(d) returns 0 or 1 */
33 sl |= path_crosses_dateline(d) << d;
34 }
35
36
37 On a 3D torus this consumes three SL bits, leaving one SL bit unused.
38 Torus-2QoS uses this SL bit to implement two QoS levels.
39
40 Torus-2QoS also makes use of the output port dependence of switch SL2VL
41 maps to encode into one VL bit the information encoded in three SL
42 bits. It computes in which torus coordinate direction each inter-
43 switch link "points", and writes SL2VL maps for such ports as follows:
44
45 for (sl = 0; sl < 16; sl++) {
46 /* cdir(port) computes which torus coordinate direction
47 * a switch port "points" in; returns 0, 1, or 2
48 */
49 sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
50 }
51
52
53 Thus, on a pristine 3D torus, i.e., in the absence of failed fabric
54 switches, torus-2QoS consumes eight SL values (SL bits 0-2) and two VL
55 values (VL bit 0) per QoS level to provide deadlock-free routing.
56
57 Torus-2QoS routes around link failure by "taking the long way around"
58 any 1D ring interrupted by link failure. For example, consider the 2D
59 6x5 torus below, where switches are denoted by [+a-zA-Z]:
60 | | | | | |
61 4 --+----+----+----+----+----+--
62 | | | | | |
63 3 --+----+----+----D----+----+--
64 | | | | | |
65 2 --+----+----I----r----+----+--
66 | | | | | |
67 1 --m----S----n----T----o----p--
68 | | | | | |
69 y=0 --+----+----+----+----+----+--
70 | | | | | |
71
72 x=0 1 2 3 4 5
73
74
75 For a pristine fabric the path from S to D would be S-n-T-r-D. In the
76 event that either link S-n or n-T has failed, torus-2QoS would use the
77 path S-m-p-o-T-r-D. Note that it can do this without changing the path
78 SL value; once the 1D ring m-S-n-T-o-p-m has been broken by failure,
79 path segments using it cannot contribute to deadlock, and the x-direc‐
80 tion dateline (between, say, x=5 and x=0) can be ignored for path seg‐
81 ments on that ring.
82
83 One result of this is that torus-2QoS can route around many simultane‐
84 ous link failures, as long as no 1D ring is broken into disjoint seg‐
85 ments. For example, if links n-T and T-o have both failed, that ring
86 has been broken into two disjoint segments, T and o-p-m-S-n.
87 Torus-2QoS checks for such issues, reports if they are found, and
88 refuses to route such fabrics.
89
90 Note that in the case where there are multiple parallel links between a
91 pair of switches, torus-2QoS will allocate routes across such links in
92 a round-robin fashion, based on ports at the path destination switch
93 that are active and not used for inter-switch links. Should a link
94 that is one of several such parallel links fail, routes are redis‐
95 tributed across the remaining links. When the last of such a set of
96 parallel links fails, traffic is rerouted as described above.
97
98 Handling a failed switch under DOR requires introducing into a path at
99 least one turn that would be otherwise "illegal", i.e., not allowed by
100 DOR rules. Torus-2QoS will introduce such a turn as close as possible
101 to the failed switch in order to route around it.
102
103 In the above example, suppose switch T has failed, and consider the
104 path from S to D. Torus-2QoS will produce the path S-n-I-r-D, rather
105 than the S-n-T-r-D path for a pristine torus, by introducing an early
106 turn at n. Normal DOR rules will cause traffic arriving at switch I to
107 be forwarded to switch r; for traffic arriving from I due to the
108 "early" turn at n, this will generate an "illegal" turn at I.
109
110 Torus-2QoS will also use the input port dependence of SL2VL maps to set
111 VL bit 1 (which would be otherwise unused) for y-x, z-x, and z-y turns,
112 i.e., those turns that are illegal under DOR. This causes the first
113 hop after any such turn to use a separate set of VL values, and pre‐
114 vents deadlock in the presence of a single failed switch.
115
116 For any given path, only the hops after a turn that is illegal under
117 DOR can contribute to a credit loop that leads to deadlock. So in the
118 example above with failed switch T, the location of the illegal turn at
119 I in the path from S to D requires that any credit loop caused by that
120 turn must encircle the failed switch at T. Thus the second and later
121 hops after the illegal turn at I (i.e., hop r-D) cannot contribute to a
122 credit loop because they cannot be used to construct a loop encircling
123 T. The hop I-r uses a separate VL, so it cannot contribute to a credit
124 loop encircling T.
125
126 Extending this argument shows that in addition to being capable of
127 routing around a single switch failure without introducing deadlock,
128 torus-2QoS can also route around multiple failed switches on the condi‐
129 tion they are adjacent in the last dimension routed by DOR. For exam‐
130 ple, consider the following case on a 6x6 2D torus:
131 | | | | | |
132 5 --+----+----+----+----+----+--
133 | | | | | |
134 4 --+----+----+----D----+----+--
135 | | | | | |
136 3 --+----+----I----u----+----+--
137 | | | | | |
138 2 --+----+----q----R----+----+--
139 | | | | | |
140 1 --m----S----n----T----o----p--
141 | | | | | |
142 y=0 --+----+----+----+----+----+--
143 | | | | | |
144
145 x=0 1 2 3 4 5
146
147
148 Suppose switches T and R have failed, and consider the path from S to
149 D. Torus-2QoS will generate the path S-n-q-I-u-D, with an illegal turn
150 at switch I, and with hop I-u using a VL with bit 1 set.
151
152 As a further example, consider a case that torus-2QoS cannot route
153 without deadlock: two failed switches adjacent in a dimension that is
154 not the last dimension routed by DOR; here the failed switches are O
155 and T:
156 | | | | | |
157 5 --+----+----+----+----+----+--
158 | | | | | |
159 4 --+----+----+----+----+----+--
160 | | | | | |
161 3 --+----+----+----+----D----+--
162 | | | | | |
163 2 --+----+----I----q----r----+--
164 | | | | | |
165 1 --m----S----n----O----T----p--
166 | | | | | |
167 y=0 --+----+----+----+----+----+--
168 | | | | | |
169
170 x=0 1 2 3 4 5
171
172
173 In a pristine fabric, torus-2QoS would generate the path from S to D as
174 S-n-O-T-r-D. With failed switches O and T, torus-2QoS will generate
175 the path S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q
176 using a VL with bit 1 set. In contrast to the earlier examples, the
177 second hop after the illegal turn, q-r, can be used to construct a
178 credit loop encircling the failed switches.
179
181 Since torus-2QoS uses all four available SL bits, and the three data VL
182 bits that are typically available in current switches, there is no way
183 to use SL/VL values to separate multicast traffic from unicast traffic.
184 Thus, torus-2QoS must generate multicast routing such that credit loops
185 cannot arise from a combination of multicast and unicast path segments.
186
187 It turns out that it is possible to construct spanning trees for multi‐
188 cast routing that have that property. For the 2D 6x5 torus example
189 above, here is the full-fabric spanning tree that torus-2QoS will con‐
190 struct, where "x" is the root switch and each "+" is a non-root switch:
191 4 + + + + + +
192 | | | | | |
193 3 + + + + + +
194 | | | | | |
195 2 +----+----+----x----+----+
196 | | | | | |
197 1 + + + + + +
198 | | | | | |
199 y=0 + + + + + +
200
201 x=0 1 2 3 4 5
202
203
204 For multicast traffic routed from root to tip, every turn in the above
205 spanning tree is a legal DOR turn.
206
207 For traffic routed from tip to root, and some traffic routed through
208 the root, turns are not legal DOR turns. However, to construct a
209 credit loop, the union of multicast routing on this spanning tree with
210 DOR unicast routing can only provide 3 of the 4 turns needed for the
211 loop.
212
213 In addition, if none of the above spanning tree branches crosses a
214 dateline used for unicast credit loop avoidance on a torus, and if mul‐
215 ticast traffic is confined to SL 0 or SL 8 (recall that torus-2QoS uses
216 SL bit 3 to differentiate QoS level), then multicast traffic also can‐
217 not contribute to the "ring" credit loops that are otherwise possible
218 in a torus.
219
220 Torus-2QoS uses these ideas to create a master spanning tree. Every
221 multicast group spanning tree will be constructed as a subset of the
222 master tree, with the same root as the master tree.
223
224 Such multicast group spanning trees will in general not be optimal for
225 groups which are a subset of the full fabric. However, this compromise
226 must be made to enable support for two QoS levels on a torus while pre‐
227 venting credit loops.
228
229 In the presence of link or switch failures that result in a fabric for
230 which torus-2QoS can generate credit-loop-free unicast routes, it is
231 also possible to generate a master spanning tree for multicast that
232 retains the required properties. For example, consider that same 2D
233 6x5 torus, with the link from (2,2) to (3,2) failed. Torus-2QoS will
234 generate the following master spanning tree:
235 4 + + + + + +
236 | | | | | |
237 3 + + + + + +
238 | | | | | |
239 2 --+----+----+ x----+----+--
240 | | | | | |
241 1 + + + + + +
242 | | | | | |
243 y=0 + + + + + +
244
245 x=0 1 2 3 4 5
246
247
248 Two things are notable about this master spanning tree. First, assum‐
249 ing the x dateline was between x=5 and x=0, this spanning tree has a
250 branch that crosses the dateline. However, just as for unicast, cross‐
251 ing a dateline on a 1D ring (here, the ring for y=2) that is broken by
252 a failure cannot contribute to a torus credit loop.
253
254 Second, this spanning tree is no longer optimal even for multicast
255 groups that encompass the entire fabric. That, unfortunately, is a
256 compromise that must be made to retain the other desirable properties
257 of torus-2QoS routing.
258
259 In the event that a single switch fails, torus-2QoS will generate a
260 master spanning tree that has no "extra" turns by appropriately select‐
261 ing a root switch. In the 2D 6x5 torus example, assume now that the
262 switch at (3,2), i.e., the root for a pristine fabric, fails.
263 Torus-2QoS will generate the following master spanning tree for that
264 case:
265 |
266 4 + + + + + +
267 | | | | | |
268 3 + + + + + +
269 | | | | |
270 2 + + + + +
271 | | | | |
272 1 +----+----x----+----+----+
273 | | | | | |
274 y=0 + + + + + +
275 |
276
277 x=0 1 2 3 4 5
278
279
280 Assuming the y dateline was between y=4 and y=0, this spanning tree has
281 a branch that crosses a dateline. However, again this cannot contrib‐
282 ute to credit loops as it occurs on a 1D ring (the ring for x=3) that
283 is broken by a failure, as in the above example.
284
286 The algorithm used by torus-2QoS to construct the torus topology from
287 the undirected graph representing the fabric requires that the radix of
288 each dimension be configured via torus-2QoS.conf. It also requires
289 that the torus topology be "seeded"; for a 3D torus this requires con‐
290 figuring four switches that define the three coordinate directions of
291 the torus.
292
293 Given this starting information, the algorithm is to examine the cube
294 formed by the eight switch locations bounded by the corners (x,y,z) and
295 (x+1,y+1,z+1). Based on switches already placed into the torus topol‐
296 ogy at some of these locations, the algorithm examines 4-loops of
297 inter-switch links to find the one that is consistent with a face of
298 the cube of switch locations, and adds its swiches to the discovered
299 topology in the correct locations.
300
301 Because the algorithm is based on examining the topology of 4-loops of
302 links, a torus with one or more radix-4 dimensions requires extra ini‐
303 tial seed configuration. See torus-2QoS.conf(5) for details.
304 Torus-2QoS will detect and report when it has insufficient configura‐
305 tion for a torus with radix-4 dimensions.
306
307 In the event the torus is significantly degraded, i.e., there are many
308 missing switches or links, it may happen that torus-2QoS is unable to
309 place into the torus some switches and/or links that were discovered in
310 the fabric, and will generate a warning in that case. A similar condi‐
311 tion occurs if torus-2QoS is misconfigured, i.e., the radix of a torus
312 dimension as configured does not match the radix of that torus dimen‐
313 sion as wired, and many switches/links in the fabric will not be placed
314 into the torus.
315
317 OpenSM will not program switches and channel adapters with SL2VL maps
318 or VL arbitration configuration unless it is invoked with -Q. Since
319 torus-2QoS depends on such functionality for correct operation, always
320 invoke OpenSM with -Q when torus-2QoS is in the list of routing
321 engines.
322
323 Any quality of service configuration method supported by OpenSM will
324 work with torus-2QoS, subject to the following limitations and consid‐
325 erations.
326
327 For all routing engines supported by OpenSM except torus-2QoS, there is
328 a one-to-one correspondence between QoS level and SL. Torus-2QoS can
329 only support two quality of service levels, so only the high-order bit
330 of any SL value used for unicast QoS configuration will be honored by
331 torus-2QoS.
332
333 For multicast QoS configuration, only SL values 0 and 8 should be used
334 with torus-2QoS.
335
336 Since SL to VL map configuration must be under the complete control of
337 torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl, etc., must
338 and will be ignored, and a warning will be generated.
339
340 For inter-switch links, Torus-2QoS uses VL values 0-3 to implement one
341 of its supported QoS levels, and VL values 4-7 to implement the other.
342 For endport links (CA, router, switch management port), Torus-2QoS uses
343 VL value 0 for one of its supported QoS levels and VL value 1 to imple‐
344 ment the other. Hard-to-diagnose application issues may arise if traf‐
345 fic is not delivered fairly across each of these two VL ranges. For
346 inter-switch links, Torus-2QoS will detect and warn if VL arbitration
347 is configured unfairly across VLs in the range 0-3, and also in the
348 range 4-7. Note that the default OpenSM VL arbitration configuration
349 does not meet this constraint, so all torus-2QoS users should configure
350 VL arbitration via qos_ca_vlarb_high, qos_swe_vlarb_high,
351 qos_ca_vlarb_low, qos_swe_vlarb_low, etc.
352
353 Note that torus-2QoS maps SL values to VL values differently for inter-
354 switch and endport links. This is why qos_vlarb_high and qos_vlarb_low
355 should not be used, as using them may result in VL arbitration for a
356 QoS level being different across inter-switch links vs. across endport
357 links.
358
360 Any routing algorithm for a torus IB fabric must employ path SL values
361 to avoid credit loops. As a result, all applications run over such
362 fabrics must perform a path record query to obtain the correct path SL
363 for connection setup. Applications that use rdma_cm for connection
364 setup will automatically meet this requirement.
365
366 If a change in fabric topology causes changes in path SL values
367 required to route without credit loops, in general all applications
368 would need to repath to avoid message deadlock. Since torus-2QoS has
369 the ability to reroute after a single switch failure without changing
370 path SL values, repathing by running applications is not required when
371 the fabric is routed with torus-2QoS.
372
373 Torus-2QoS can provide unchanging path SL values in the presence of
374 subnet manager failover provided that all OpenSM instances have the
375 same idea of dateline location. See torus-2QoS.conf(5) for details.
376
377 Torus-2QoS will detect configurations of failed switches and links that
378 prevent routing that is free of credit loops, and will log warnings and
379 refuse to route. If "no_fallback" was configured in the list of OpenSM
380 routing engines, then no other routing engine will attempt to route the
381 fabric. In that case all paths that do not transit the failed compo‐
382 nents will continue to work, and the subset of paths that are still
383 operational will continue to remain free of credit loops. OpenSM will
384 continue to attempt to route the fabric after every sweep interval, and
385 after any change (such as a link up) in the fabric topology. When the
386 fabric components are repaired, full functionality will be restored.
387
388 In the event OpenSM was configured to allow some other engine to route
389 the fabric if torus-2QoS fails, then credit loops and message deadlock
390 are likely if torus-2QoS had previously routed the fabric successfully.
391 Even if the other engine is capable of routing a torus without credit
392 loops, applications that built connections with path SL values granted
393 under torus-2QoS will likely experience message deadlock under routing
394 generated by a different engine, unless they repath.
395
396 To verify that a torus fabric is routed free of credit loops, use ibdm‐
397 chk to analyze data collected via ibdiagnet -vlr.
398
400 /etc/rdma/opensm.conf
401 default OpenSM config file.
402
403 /etc/rdma/qos-policy.conf
404 default QoS policy config file.
405
406 /etc/rdma/torus-2QoS.conf
407 default torus-2QoS config file.
408
410 opensm(8), torus-2QoS.conf(5), ibdiagnet(1), ibdmchk(1), rdma_cm(7).
411
412
413
414OpenIB November 10, 2010 TORUS-2QOS(8)