1TORUS-2QOS(8)                  OpenIB Management                 TORUS-2QOS(8)
2
3
4

NAME

6       torus-2QoS - Routing engine for OpenSM subnet manager
7

DESCRIPTION

9       Torus-2QoS  is  routing  algorithm designed for large-scale 2D/3D torus
10       fabrics.  The torus-2QoS routing engine can provide the following func‐
11       tionality on a 2D/3D torus:
12
13
14         – Routing that is free of credit loops.
15         – Two levels of Quality of Service (QoS), assuming switches support
16           eight data VLs and channel adapters support two data VLs.
17         – The ability to route around a single failed switch, and/or multiple
18           failed links, without
19           – introducing credit loops, or
20           – changing path SL values.
21         – Very short run times, with good scaling properties as fabric size
22           increases.
23

UNICAST ROUTING

25       Unicast routing in torus-2QoS  is  based  on  Dimension  Order  Routing
26       (DOR).   It  avoids  the deadlocks that would otherwise occur in a DOR-
27       routed torus using the concept of a dateline for each torus  dimension.
28       It encodes into a path SL which datelines the path crosses, as follows:
29
30           sl = 0;
31           for (d = 0; d < torus_dimensions; d++) {
32            /* path_crosses_dateline(d) returns 0 or 1 */
33            sl |= path_crosses_dateline(d) << d;
34           }
35
36
37       On  a  3D torus this consumes three SL bits, leaving one SL bit unused.
38       Torus-2QoS uses this SL bit to implement two QoS levels.
39
40       Torus-2QoS also makes use of the output port dependence of switch SL2VL
41       maps  to  encode  into  one  VL bit the information encoded in three SL
42       bits.  It computes in which  torus  coordinate  direction  each  inter-
43       switch link "points", and writes SL2VL maps for such ports as follows:
44
45           for (sl = 0; sl < 16; sl++) {
46            /* cdir(port) computes which torus coordinate direction
47             * a switch port "points" in; returns 0, 1, or 2
48             */
49            sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
50           }
51
52
53       Thus,  on  a  pristine  3D torus, i.e., in the absence of failed fabric
54       switches, torus-2QoS consumes eight SL values (SL bits 0-2) and two  VL
55       values (VL bit 0) per QoS level to provide deadlock-free routing.
56
57       Torus-2QoS  routes  around link failure by "taking the long way around"
58       any 1D ring interrupted by link failure.  For example, consider the  2D
59       6x5 torus below, where switches are denoted by [+a-zA-Z]:
60                               |    |    |    |    |    |
61                          4  --+----+----+----+----+----+--
62                               |    |    |    |    |    |
63                          3  --+----+----+----D----+----+--
64                               |    |    |    |    |    |
65                          2  --+----+----I----r----+----+--
66                               |    |    |    |    |    |
67                          1  --m----S----n----T----o----p--
68                               |    |    |    |    |    |
69                        y=0  --+----+----+----+----+----+--
70                               |    |    |    |    |    |
71
72                             x=0    1    2    3    4    5
73
74
75       For  a pristine fabric the path from S to D would be S-n-T-r-D.  In the
76       event that either link S-n or n-T has failed, torus-2QoS would use  the
77       path S-m-p-o-T-r-D.  Note that it can do this without changing the path
78       SL value; once the 1D ring m-S-n-T-o-p-m has been  broken  by  failure,
79       path  segments using it cannot contribute to deadlock, and the x-direc‐
80       tion dateline (between, say, x=5 and x=0) can be ignored for path  seg‐
81       ments on that ring.
82
83       One  result of this is that torus-2QoS can route around many simultane‐
84       ous link failures, as long as no 1D ring is broken into  disjoint  seg‐
85       ments.   For  example, if links n-T and T-o have both failed, that ring
86       has  been  broken  into  two  disjoint  segments,  T   and   o-p-m-S-n.
87       Torus-2QoS  checks  for  such  issues,  reports  if they are found, and
88       refuses to route such fabrics.
89
90       Note that in the case where there are multiple parallel links between a
91       pair  of switches, torus-2QoS will allocate routes across such links in
92       a round-robin fashion, based on ports at the  path  destination  switch
93       that  are  active  and  not used for inter-switch links.  Should a link
94       that is one of several such parallel  links  fail,  routes  are  redis‐
95       tributed  across  the  remaining links.  When the last of such a set of
96       parallel links fails, traffic is rerouted as described above.
97
98       Handling a failed switch under DOR requires introducing into a path  at
99       least  one turn that would be otherwise "illegal", i.e., not allowed by
100       DOR rules.  Torus-2QoS will introduce such a turn as close as  possible
101       to the failed switch in order to route around it.
102
103       In  the  above  example,  suppose switch T has failed, and consider the
104       path from S to D.  Torus-2QoS will produce the path  S-n-I-r-D,  rather
105       than  the  S-n-T-r-D path for a pristine torus, by introducing an early
106       turn at n.  Normal DOR rules will cause traffic arriving at switch I to
107       be  forwarded  to  switch  r;  for  traffic  arriving from I due to the
108       "early" turn at n, this will generate an "illegal" turn at I.
109
110       Torus-2QoS will also use the input port dependence of SL2VL maps to set
111       VL bit 1 (which would be otherwise unused) for y-x, z-x, and z-y turns,
112       i.e., those turns that are illegal under DOR.  This  causes  the  first
113       hop  after  any  such turn to use a separate set of VL values, and pre‐
114       vents deadlock in the presence of a single failed switch.
115
116       For any given path, only the hops after a turn that  is  illegal  under
117       DOR  can contribute to a credit loop that leads to deadlock.  So in the
118       example above with failed switch T, the location of the illegal turn at
119       I  in the path from S to D requires that any credit loop caused by that
120       turn must encircle the failed switch at T.  Thus the second  and  later
121       hops after the illegal turn at I (i.e., hop r-D) cannot contribute to a
122       credit loop because they cannot be used to construct a loop  encircling
123       T.  The hop I-r uses a separate VL, so it cannot contribute to a credit
124       loop encircling T.
125
126       Extending this argument shows that in  addition  to  being  capable  of
127       routing  around  a  single switch failure without introducing deadlock,
128       torus-2QoS can also route around multiple failed switches on the condi‐
129       tion  they are adjacent in the last dimension routed by DOR.  For exam‐
130       ple, consider the following case on a 6x6 2D torus:
131                               |    |    |    |    |    |
132                          5  --+----+----+----+----+----+--
133                               |    |    |    |    |    |
134                          4  --+----+----+----D----+----+--
135                               |    |    |    |    |    |
136                          3  --+----+----I----u----+----+--
137                               |    |    |    |    |    |
138                          2  --+----+----q----R----+----+--
139                               |    |    |    |    |    |
140                          1  --m----S----n----T----o----p--
141                               |    |    |    |    |    |
142                        y=0  --+----+----+----+----+----+--
143                               |    |    |    |    |    |
144
145                             x=0    1    2    3    4    5
146
147
148       Suppose switches T and R have failed, and consider the path from  S  to
149       D.  Torus-2QoS will generate the path S-n-q-I-u-D, with an illegal turn
150       at switch I, and with hop I-u using a VL with bit 1 set.
151
152       As a further example, consider a  case  that  torus-2QoS  cannot  route
153       without  deadlock:  two failed switches adjacent in a dimension that is
154       not the last dimension routed by DOR; here the failed  switches  are  O
155       and T:
156                               |    |    |    |    |    |
157                          5  --+----+----+----+----+----+--
158                               |    |    |    |    |    |
159                          4  --+----+----+----+----+----+--
160                               |    |    |    |    |    |
161                          3  --+----+----+----+----D----+--
162                               |    |    |    |    |    |
163                          2  --+----+----I----q----r----+--
164                               |    |    |    |    |    |
165                          1  --m----S----n----O----T----p--
166                               |    |    |    |    |    |
167                        y=0  --+----+----+----+----+----+--
168                               |    |    |    |    |    |
169
170                             x=0    1    2    3    4    5
171
172
173       In a pristine fabric, torus-2QoS would generate the path from S to D as
174       S-n-O-T-r-D.  With failed switches O and T,  torus-2QoS  will  generate
175       the  path  S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q
176       using a VL with bit 1 set.  In contrast to the  earlier  examples,  the
177       second  hop  after  the  illegal  turn, q-r, can be used to construct a
178       credit loop encircling the failed switches.
179

MULTICAST ROUTING

181       Since torus-2QoS uses all four available SL bits, and the three data VL
182       bits  that are typically available in current switches, there is no way
183       to use SL/VL values to separate multicast traffic from unicast traffic.
184       Thus, torus-2QoS must generate multicast routing such that credit loops
185       cannot arise from a combination of multicast and unicast path segments.
186
187       It turns out that it is possible to construct spanning trees for multi‐
188       cast  routing  that  have  that property.  For the 2D 6x5 torus example
189       above, here is the full-fabric spanning tree that torus-2QoS will  con‐
190       struct, where "x" is the root switch and each "+" is a non-root switch:
191                          4    +    +    +    +    +    +
192                               |    |    |    |    |    |
193                          3    +    +    +    +    +    +
194                               |    |    |    |    |    |
195                          2    +----+----+----x----+----+
196                               |    |    |    |    |    |
197                          1    +    +    +    +    +    +
198                               |    |    |    |    |    |
199                        y=0    +    +    +    +    +    +
200
201                             x=0    1    2    3    4    5
202
203
204       For  multicast traffic routed from root to tip, every turn in the above
205       spanning tree is a legal DOR turn.
206
207       For traffic routed from tip to root, and some  traffic  routed  through
208       the  root,  turns  are  not  legal  DOR turns.  However, to construct a
209       credit loop, the union of multicast routing on this spanning tree  with
210       DOR  unicast  routing  can only provide 3 of the 4 turns needed for the
211       loop.
212
213       In addition, if none of the above  spanning  tree  branches  crosses  a
214       dateline used for unicast credit loop avoidance on a torus, and if mul‐
215       ticast traffic is confined to SL 0 or SL 8 (recall that torus-2QoS uses
216       SL  bit 3 to differentiate QoS level), then multicast traffic also can‐
217       not contribute to the "ring" credit loops that are  otherwise  possible
218       in a torus.
219
220       Torus-2QoS  uses  these  ideas to create a master spanning tree.  Every
221       multicast group spanning tree will be constructed as a  subset  of  the
222       master tree, with the same root as the master tree.
223
224       Such  multicast group spanning trees will in general not be optimal for
225       groups which are a subset of the full fabric. However, this  compromise
226       must be made to enable support for two QoS levels on a torus while pre‐
227       venting credit loops.
228
229       In the presence of link or switch failures that result in a fabric  for
230       which  torus-2QoS  can  generate credit-loop-free unicast routes, it is
231       also possible to generate a master spanning  tree  for  multicast  that
232       retains  the  required  properties.  For example, consider that same 2D
233       6x5 torus, with the link from (2,2) to (3,2) failed.   Torus-2QoS  will
234       generate the following master spanning tree:
235                          4    +    +    +    +    +    +
236                               |    |    |    |    |    |
237                          3    +    +    +    +    +    +
238                               |    |    |    |    |    |
239                          2  --+----+----+    x----+----+--
240                               |    |    |    |    |    |
241                          1    +    +    +    +    +    +
242                               |    |    |    |    |    |
243                        y=0    +    +    +    +    +    +
244
245                             x=0    1    2    3    4    5
246
247
248       Two  things are notable about this master spanning tree.  First, assum‐
249       ing the x dateline was between x=5 and x=0, this spanning  tree  has  a
250       branch that crosses the dateline.  However, just as for unicast, cross‐
251       ing a dateline on a 1D ring (here, the ring for y=2) that is broken  by
252       a failure cannot contribute to a torus credit loop.
253
254       Second,  this  spanning  tree  is  no longer optimal even for multicast
255       groups that encompass the entire fabric.   That,  unfortunately,  is  a
256       compromise  that  must be made to retain the other desirable properties
257       of torus-2QoS routing.
258
259       In the event that a single switch fails,  torus-2QoS  will  generate  a
260       master spanning tree that has no "extra" turns by appropriately select‐
261       ing a root switch.  In the 2D 6x5 torus example, assume  now  that  the
262       switch  at  (3,2),  i.e.,  the  root  for  a  pristine  fabric,  fails.
263       Torus-2QoS will generate the following master spanning  tree  for  that
264       case:
265                                        |
266                          4    +    +    +    +    +    +
267                               |    |    |    |    |    |
268                          3    +    +    +    +    +    +
269                               |    |    |         |    |
270                          2    +    +    +         +    +
271                               |    |    |         |    |
272                          1    +----+----x----+----+----+
273                               |    |    |    |    |    |
274                        y=0    +    +    +    +    +    +
275                                        |
276
277                             x=0    1    2    3    4    5
278
279
280       Assuming the y dateline was between y=4 and y=0, this spanning tree has
281       a branch that crosses a dateline.  However, again this cannot  contrib‐
282       ute  to  credit loops as it occurs on a 1D ring (the ring for x=3) that
283       is broken by a failure, as in the above example.
284

TORUS TOPOLOGY DISCOVERY

286       The algorithm used by torus-2QoS to construct the torus  topology  from
287       the undirected graph representing the fabric requires that the radix of
288       each dimension be configured via  torus-2QoS.conf.   It  also  requires
289       that  the torus topology be "seeded"; for a 3D torus this requires con‐
290       figuring four switches that define the three coordinate  directions  of
291       the torus.
292
293       Given  this  starting information, the algorithm is to examine the cube
294       formed by the eight switch locations bounded by the corners (x,y,z) and
295       (x+1,y+1,z+1).   Based on switches already placed into the torus topol‐
296       ogy at some of these  locations,  the  algorithm  examines  4-loops  of
297       inter-switch  links  to  find the one that is consistent with a face of
298       the cube of switch locations, and adds its swiches  to  the  discovered
299       topology in the correct locations.
300
301       Because  the algorithm is based on examining the topology of 4-loops of
302       links, a torus with one or more radix-4 dimensions requires extra  ini‐
303       tial   seed   configuration.    See   torus-2QoS.conf(5)  for  details.
304       Torus-2QoS will detect and report when it has  insufficient  configura‐
305       tion for a torus with radix-4 dimensions.
306
307       In  the event the torus is significantly degraded, i.e., there are many
308       missing switches or links, it may happen that torus-2QoS is  unable  to
309       place into the torus some switches and/or links that were discovered in
310       the fabric, and will generate a warning in that case.  A similar condi‐
311       tion  occurs if torus-2QoS is misconfigured, i.e., the radix of a torus
312       dimension as configured does not match the radix of that  torus  dimen‐
313       sion as wired, and many switches/links in the fabric will not be placed
314       into the torus.
315

QUALITY OF SERVICE CONFIGURATION

317       OpenSM will not program switchs and channel adapters with SL2VL maps or
318       VL  arbitration  configuration  unless  it  is  invoked with -Q.  Since
319       torus-2QoS depends on such functionality for correct operation,  always
320       invoke  OpenSM  with  -Q  when  torus-2QoS  is  in  the list of routing
321       engines.
322
323       Any quality of service configuration method supported  by  OpenSM  will
324       work  with torus-2QoS, subject to the following limitations and consid‐
325       erations.
326
327       For all routing engines supported by OpenSM except torus-2QoS, there is
328       a  one-to-one  correspondence between QoS level and SL.  Torus-2QoS can
329       only support two quality of service levels, so only the high-order  bit
330       of  any  SL value used for unicast QoS configuration will be honored by
331       torus-2QoS.
332
333       For multicast QoS configuration, only SL values 0 and 8 should be  used
334       with torus-2QoS.
335
336       Since  SL to VL map configuration must be under the complete control of
337       torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl, etc.,  must
338       and  will be ignored, and a warning will be generated.
339
340       For  inter-switch links, Torus-2QoS uses VL values 0-3 to implement one
341       of its supported QoS levels, and VL values 4-7 to implement the  other.
342       For endport links (CA, router, switch management port), Torus-2QoS uses
343       VL value 0 for one of its supported QoS levels and VL value 1 to imple‐
344       ment the other.  Hard-to-diagnose application issues may arise if traf‐
345       fic is not delivered fairly across each of these  two  VL  ranges.  For
346       inter-switch  links,  Torus-2QoS will detect and warn if VL arbitration
347       is configured unfairly across VLs in the range 0-3,  and  also  in  the
348       range  4-7.  Note  that the default OpenSM VL arbitration configuration
349       does not meet this constraint, so all torus-2QoS users should configure
350       VL     arbitration     via    qos_ca_vlarb_high,    qos_swe_vlarb_high,
351       qos_ca_vlarb_low, qos_swe_vlarb_low, etc.
352
353       Note that torus-2QoS maps SL values to VL values differently for inter-
354       switch and endport links.  This is why qos_vlarb_high and qos_vlarb_low
355       should not be used, as using them may result in VL  arbitration  for  a
356       QoS  level being different across inter-switch links vs. across endport
357       links.
358

OPERATIONAL CONSIDERATIONS

360       Any routing algorithm for a torus IB fabric must employ path SL  values
361       to  avoid  credit  loops.   As a result, all applications run over such
362       fabrics must perform a path record query to obtain the correct path  SL
363       for  connection  setup.   Applications  that use rdma_cm for connection
364       setup will automatically meet this requirement.
365
366       If a change in  fabric  topology  causes  changes  in  path  SL  values
367       required  to  route  without  credit loops, in general all applications
368       would need to repath to avoid message deadlock.  Since  torus-2QoS  has
369       the  ability  to reroute after a single switch failure without changing
370       path SL values, repathing by running applications is not required  when
371       the fabric is routed with torus-2QoS.
372
373       Torus-2QoS  can  provide  unchanging  path SL values in the presence of
374       subnet manager failover provided that all  OpenSM  instances  have  the
375       same idea of dateline location.  See torus-2QoS.conf(5) for details.
376
377       Torus-2QoS will detect configurations of failed switches and links that
378       prevent routing that is free of credit loops, and will log warnings and
379       refuse to route.  If "no_fallback" was configured in the list of OpenSM
380       routing engines, then no other routing engine will attempt to route the
381       fabric.   In  that case all paths that do not transit the failed compo‐
382       nents will continue to work, and the subset of  paths  that  are  still
383       operational  will continue to remain free of credit loops.  OpenSM will
384       continue to attempt to route the fabric after every sweep interval, and
385       after  any change (such as a link up) in the fabric topology.  When the
386       fabric components are repaired, full functionality will be restored.
387
388       In the event OpenSM was configured to allow some other engine to  route
389       the  fabric if torus-2QoS fails, then credit loops and message deadlock
390       are likely if torus-2QoS had previously routed the fabric successfully.
391       Even  if  the other engine is capable of routing a torus without credit
392       loops, applications that built connections with path SL values  granted
393       under  torus-2QoS will likely experience message deadlock under routing
394       generated by a different engine, unless they repath.
395
396       To verify that a torus fabric is routed free of credit loops, use ibdm‐
397       chk to analyze data collected via ibdiagnet -vlr.
398

FILES

400       /etc/rdma/opensm.conf
401              default OpenSM config file.
402
403       /etc/rdma/qos-policy.conf
404              default QoS policy config file.
405
406       /etc/rdma/torus-2QoS.conf
407              default torus-2QoS config file.
408

SEE ALSO

410       opensm(8), torus-2QoS.conf(5), ibdiagnet(1), ibdmchk(1), rdma_cm(7).
411
412
413
414OpenIB                         November 10, 2010                 TORUS-2QOS(8)
Impressum