1TC-HFSC(7) Linux TC-HFSC(7)
2
3
4
6 tc-hfcs - Hierarchical Fair Service Curve
7
9 HFSC (Hierarchical Fair Service Curve) is a network packet scheduling
10 algorithm that was first presented at SIGCOMM'97. Developed as a part
11 of ALTQ (ALTernative Queuing) on NetBSD, found its way quickly to other
12 BSD systems, and then a few years ago became part of the linux kernel.
13 Still, it's not the most popular scheduling algorithm - especially if
14 compared to HTB - and it's not well documented for the enduser. This
15 introduction aims to explain how HFSC works without using too much math
16 (although some math it will be inevitable).
17
18 In short HFSC aims to:
19
20 1) guarantee precise bandwidth and delay allocation for all leaf
21 classes (realtime criterion)
22
23 2) allocate excess bandwidth fairly as specified by class hierar‐
24 chy (linkshare & upperlimit criterion)
25
26 3) minimize any discrepancy between the service curve and the
27 actual amount of service provided during linksharing
28
29 The main "selling" point of HFSC is feature [1m(1), which is achieved by
30 using nonlinear service curves (more about what it actually is later).
31 This is particularly useful in VoIP or games, where not only a guaran‐
32 tee of consistent bandwidth is important, but also limiting the initial
33 delay of a data stream. Note that it matters only for leaf classes
34 (where the actual queues are) - thus class hierarchy is ignored in the
35 realtime case.
36
37 Feature [1m(2) is well, obvious - any algorithm featuring class hierarchy
38 (such as HTB or CBQ) strives to achieve that. HFSC does that well,
39 although you might end with unusual situations, if you define service
40 curves carelessly - see section CORNER CASES for examples.
41
42 Feature [1m(3) is mentioned due to the nature of the problem. There may be
43 situations where it's either not possible to guarantee service of all
44 curves at the same time, and/or it's impossible to do so fairly. Both
45 will be explained later. Note that this is mainly related to interior
46 (aka aggregate) classes, as the leafs are already handled by [1m(1).
47 Still, it's perfectly possible to create a leaf class without realtime
48 service, and in such a case the caveats will naturally extend to leaf
49 classes as well.
50
51
53 For the remaining part of the document, we'll use following shortcuts:
54
55 RT - realtime
56 LS - linkshare
57 UL - upperlimit
58 SC - service curve
59
61 To understand how HFSC works, we must first introduce a service curve.
62 Overall, it's a nondecreasing function of some time unit, returning the
63 amount of service (an allowed or allocated amount of bandwidth) at some
64 specific point in time. The purpose of it should be subconsciously
65 obvious: if a class was allowed to transfer not less than the amount
66 specified by its service curve, then the service curve is not violated.
67
68 Still, we need more elaborate criterion than just the above (although
69 in the most generic case it can be reduced to it). The criterion has to
70 take two things into account:
71
72 · idling periods
73
74 · the ability to "look back", so if during current active period
75 the service curve is violated, maybe it isn't if we count
76 excess bandwidth received during earlier active period(s)
77
78 Let's define the criterion as follows:
79
80 [1m(1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)
81
82 Here 'w' denotes the amount of service received during some time period
83 between t0 and t1. B is a set of all times, where a session becomes
84 active after idling period (further denoted as 'becoming backlogged').
85 For a clearer picture, imagine two situations:
86
87 a) our session was active during two periods, with a small time
88 gap between them
89
90 b) as in (a), but with a larger gap
91
92 Consider (a): if the service received during both periods meets [1m(1),
93 then all is well. But what if it doesn't do so during the 2nd period?
94 If the amount of service received during the 1st period is larger than
95 the service curve, then it might compensate for smaller service during
96 the 2nd period and the gap - if the gap is small enough.
97
98 If the gap is larger (b) - then it's less likely to happen (unless the
99 excess bandwidth allocated during the 1st part was really large).
100 Still, the larger the gap - the less interesting is what happened in
101 the past (e.g. 10 minutes ago) - what matters is the current traffic
102 that just started.
103
104 From HFSC's perspective, more interesting is answering the following
105 question: when should we start transferring packets, so a service curve
106 of a class is not violated. Or rephrasing it: How much X() amount of
107 service should a session receive by time t, so the service curve is not
108 violated. Function X() defined as below is the basic building block of
109 HFSC, used in: eligible, deadline, virtual-time and fit-time curves. Of
110 course, X() is based on equation [1m(1) and is defined recursively:
111
112
113 · At the 1st backlogged period beginning function X is initial‐
114 ized to generic service curve assigned to a class
115
116 · At any subsequent backlogged period, X() is:
117 min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
118 ... where t0 denotes the beginning of the current backlogged
119 period.
120
121 HFSC uses either linear, or two-piece linear service curves. In case of
122 linear or two-piece linear convex functions (first slope < second
123 slope), min() in X's definition reduces to the 2nd argument. But in
124 case of two-piece concave functions, the 1st argument might quickly
125 become lesser for some t>=t0. Note, that for some backlogged period,
126 X() is defined only from that period's beginning. We also define
127 X^(-1)(w) as smallest t>=t0, for which X(t) = w. We have to define it
128 this way, as X() is usually not an injection.
129
130 The above generic X() can be one of the following:
131
132 E() In realtime criterion, selects packets eligible for sending. If
133 none are eligible, HFSC will use linkshare criterion. Eligible
134 time 'et' is calculated with reference to packets' heads (
135 et = E^(-1)(w) ). It's based on RT service curve, but in case
136 of a convex curve, uses its 2nd slope only.
137
138 D() In realtime criterion, selects the most suitable packet from
139 the ones chosen by E(). Deadline time 'dt' corresponds to pack‐
140 ets' tails (dt = D^(-1)(w+l), where 'l' is packet's length).
141 Based on RT service curve.
142
143 V() In linkshare criterion, arbitrates which packet to send next.
144 Note that V() is function of a virtual time - see LINKSHARE
145 CRITERION section for details. Virtual time 'vt' corresponds to
146 packets' heads (vt = V^(-1)(w)). Based on LS service curve.
147
148 F() An extension to linkshare criterion, used to limit at which
149 speed linkshare criterion is allowed to dequeue. Fit-time 'ft'
150 corresponds to packets' heads as well (ft = F^(-1)(w)). Based
151 on UL service curve.
152
153 Be sure to make clean distinction between session's RT, LS and UL ser‐
154 vice curves and the above "utility" functions.
155
157 RT criterion ignores class hierarchy and guarantees precise bandwidth
158 and delay allocation. We say that a packet is eligible for sending,
159 when the current real time is later than the eligible time of the
160 packet. From all eligible packets, the one most suited for sending is
161 the one with the shortest deadline time. This sounds simple, but con‐
162 sider the following example:
163
164 Interface 10Mbit, two classes, both with two-piece linear service
165 curves:
166
167 · 1st class - 2Mbit for 100ms, then 7Mbit (convex - 1st slope <
168 2nd slope)
169
170 · 2nd class - 7Mbit for 100ms, then 2Mbit (concave - 1st slope >
171 2nd slope)
172
173 Assume for a moment, that we only use D() for both finding eligible
174 packets, and choosing the most fitting one, thus eligible time would be
175 computed as D^(-1)(w) and deadline time would be computed as
176 D^(-1)(w+l). If the 2nd class starts sending packets 1 second after the
177 1st class, it's of course impossible to guarantee 14Mbit, as the inter‐
178 face capability is only 10Mbit. The only workaround in this scenario
179 is to allow the 1st class to send the packets earlier that would nor‐
180 mally be allowed. That's where separate E() comes to help. Putting all
181 the math aside (see HFSC paper for details), E() for RT concave service
182 curve is just like D(), but for the RT convex service curve - it's con‐
183 structed using only RT service curve's 2nd slope (in our example
184 7Mbit).
185
186 The effect of such E() - packets will be sent earlier, and at the same
187 time D() will be updated - so the current deadline time calculated from
188 it will be later. Thus, when the 2nd class starts sending packets
189 later, both the 1st and the 2nd class will be eligible, but the 2nd
190 session's deadline time will be smaller and its packets will be sent
191 first. When the 1st class becomes idle at some later point, the 2nd
192 class will be able to "buffer" up again for later active period of the
193 1st class.
194
195 A short remark - in a situation, where the total amount of bandwidth
196 available on the interface is larger than the allocated total realtime
197 parts (imagine a 10 Mbit interface, but 1Mbit/2Mbit and 2Mbit/1Mbit
198 classes), the sole speed of the interface could suffice to guarantee
199 the times.
200
201 Important part of RT criterion is that apart from updating its D() and
202 E(), also V() used by LS criterion is updated. Generally the RT crite‐
203 rion is secondary to LS one, and used only if there's a risk of violat‐
204 ing precise realtime requirements. Still, the "participation" in band‐
205 width distributed by LS criterion is there, so V() has to be updated
206 along the way. LS criterion can than properly compensate for non-ideal
207 fair sharing situation, caused by RT scheduling. If you use UL service
208 curve its F() will be updated as well (UL service curve is an extension
209 to LS one - see UPPERLIMIT CRITERION section).
210
211 Anyway - careless specification of LS and RT service curves can lead to
212 potentially undesired situations (see CORNER CASES for examples). This
213 wasn't the case in HFSC paper where LS and RT service curves couldn't
214 be specified separately.
215
216
218 LS criterion's task is to distribute bandwidth according to specified
219 class hierarchy. Contrary to RT criterion, there're no comparisons
220 between current real time and virtual time - the decision is based
221 solely on direct comparison of virtual times of all active subclasses -
222 the one with the smallest vt wins and gets scheduled. One immediate
223 conclusion from this fact is that absolute values don't matter - only
224 ratios between them (so for example, two children classes with simple
225 linear 1Mbit service curves will get the same treatment from LS crite‐
226 rion's perspective, as if they were 5Mbit). The other conclusion is,
227 that in perfectly fluid system with linear curves, all virtual times
228 across whole class hierarchy would be equal.
229
230 Why is VC defined in term of virtual time (and what is it)?
231
232 Imagine an example: class A with two children - A1 and A2, both with
233 let's say 10Mbit SCs. If A2 is idle, A1 receives all the bandwidth of A
234 (and update its V() in the process). When A2 becomes active, A1's vir‐
235 tual time is already far later than A2's one. Considering the type of
236 decision made by LS criterion, A1 would become idle for a long time. We
237 can workaround this situation by adjusting virtual time of the class
238 becoming active - we do that by getting such time "up to date". HFSC
239 uses a mean of the smallest and the biggest virtual time of currently
240 active children fit for sending. As it's not real time anymore (exclud‐
241 ing trivial case of situation where all classes become active at the
242 same time, and never become idle), it's called virtual time.
243
244 Such approach has its price though. The problem is analogous to what
245 was presented in previous section and is caused by non-linearity of
246 service curves:
247
248 1) either it's impossible to guarantee service curves and satisfy
249 fairness during certain time periods:
250
251 Recall the example from RT section, slightly modified (with 3Mbit
252 slopes instead of 2Mbit ones):
253
254
255 · 1st class - 3Mbit for 100ms, then 7Mbit (convex - 1st slope <
256 2nd slope)
257
258 · 2nd class - 7Mbit for 100ms, then 3Mbit (concave - 1st slope >
259 2nd slope)
260
261
262 They sum up nicely to 10Mbit - the interface's capacity. But if we
263 wanted to only use LS for guarantees and fairness - it simply won't
264 work. In LS context, only V() is used for making decision which
265 class to schedule. If the 2nd class becomes active when the 1st one
266 is in its second slope, the fairness will be preserved - ratio will
267 be 1:1 (7Mbit:7Mbit), but LS itself is of course unable to guaran‐
268 tee the absolute values themselves - as it would have to go beyond
269 of what the interface is capable of.
270
271
272 2) and/or it's impossible to guarantee service curves of all classes
273 at the same time [fairly or not]:
274
275
276 This is similar to the above case, but a bit more subtle. We will
277 consider two subtrees, arbitrated by their common (root here) par‐
278 ent:
279
280 R (root) - 10Mbit
281
282 A - 7Mbit, then 3Mbit
283 A1 - 5Mbit, then 2Mbit
284 A2 - 2Mbit, then 1Mbit
285
286 B - 3Mbit, then 7Mbit
287
288 R arbitrates between left subtree (A) and right (B). Assume that A2
289 and B are constantly backlogged, and at some later point A1 becomes
290 backlogged (when all other classes are in their 2nd linear part).
291
292 What happens now? B (choice made by R) will always get 7 Mbit as R
293 is only (obviously) concerned with the ratio between its direct
294 children. Thus A subtree gets 3Mbit, but its children would want
295 (at the point when A1 became backlogged) 5Mbit + 1Mbit. That's of
296 course impossible, as they can only get 3Mbit due to interface lim‐
297 itation.
298
299 In the left subtree - we have the same situation as previously
300 (fair split between A1 and A2, but violated guarantees), but in the
301 whole tree - there's no fairness (B got 7Mbit, but A1 and A2 have
302 to fit together in 3Mbit) and there's no guarantees for all classes
303 (only B got what it wanted). Even if we violated fairness in the A
304 subtree and set A2's service curve to 0, A1 would still not get the
305 required bandwidth.
306
308 UL criterion is an extensions to LS one, that permits sending packets
309 only if current real time is later than fit-time ('ft'). So the modi‐
310 fied LS criterion becomes: choose the smallest virtual time from all
311 active children, such that fit-time < current real time also holds.
312 Fit-time is calculated from F(), which is based on UL service curve. As
313 you can see, its role is kinda similar to E() used in RT criterion.
314 Also, for obvious reasons - you can't specify UL service curve without
315 LS one.
316
317 The main purpose of the UL service curve is to limit HFSC to bandwidth
318 available on the upstream router (think adsl home modem/router, and
319 linux server as NAT/firewall/etc. with 100Mbit+ connection to mentioned
320 modem/router). Typically, it's used to create a single class directly
321 under root, setting a linear UL service curve to available bandwidth -
322 and then creating your class structure from that class downwards. Of
323 course, you're free to add a UL service curve (linear or not) to any
324 class with LS criterion.
325
326 An important part about the UL service curve is that whenever at some
327 point in time a class doesn't qualify for linksharing due to its
328 fit-time, the next time it does qualify it will update its virtual time
329 to the smallest virtual time of all active children fit for linkshar‐
330 ing. This way, one of the main things the LS criterion tries to achieve
331 - equality of all virtual times across whole hierarchy - is preserved
332 (in perfectly fluid system with only linear curves, all virtual times
333 would be equal).
334
335 Without that, 'vt' would lag behind other virtual times, and could
336 cause problems. Consider an interface with a capacity of 10Mbit, and
337 the following leaf classes (just in case you're skipping this text
338 quickly - this example shows behavior that doesn't happen):
339
340 A - ls 5.0Mbit
341 B - ls 2.5Mbit
342 C - ls 2.5Mbit, ul 2.5Mbit
343
344 If B was idle, while A and C were constantly backlogged, A and C would
345 normally (as far as LS criterion is concerned) divide bandwidth in 2:1
346 ratio. But due to UL service curve in place, C would get at most
347 2.5Mbit, and A would get the remaining 7.5Mbit. The longer the back‐
348 logged period, the more the virtual times of A and C would drift apart.
349 If B became backlogged at some later point in time, its virtual time
350 would be set to (A's vt + C's vt)/2, thus blocking A from sending any
351 traffic until B's virtual time catches up with A.
352
354 Another difference from the original HFSC paper is that RT and LS SCs
355 can be specified separately. Moreover, leaf classes are allowed to have
356 only either RT SC or LS SC. For interior classes, only LS SCs make
357 sense: any RT SC will be ignored.
358
360 Separate service curves for LS and RT criteria can lead to certain
361 traps that come from "fighting" between ideal linksharing and enforced
362 realtime guarantees. Those situations didn't exist in original HFSC
363 paper, where specifying separate LS / RT service curves was not dis‐
364 cussed.
365
366 Consider an interface with a 10Mbit capacity, with the following leaf
367 classes:
368
369 A - ls 5.0Mbit, rt 8Mbit
370 B - ls 2.5Mbit
371 C - ls 2.5Mbit
372
373 Imagine A and C are constantly backlogged. As B is idle, A and C would
374 divide bandwidth in 2:1 ratio, considering LS service curve (so in the‐
375 ory - 6.66 and 3.33). Alas RT criterion takes priority, so A will get
376 8Mbit and LS will be able to compensate class C for only 2 Mbit - this
377 will cause discrepancy between virtual times of A and C.
378
379 Assume this situation lasts for a long time with no idle periods, and
380 suddenly B becomes active. B's virtual time will be updated to
381 (A's vt + C's vt)/2, effectively landing in the middle between A's and
382 C's virtual time. The effect - B, having no RT guarantees, will be pun‐
383 ished and will not be allowed to transfer until C's virtual time
384 catches up.
385
386 If the interface had a higher capacity, for example 100Mbit, this exam‐
387 ple would behave perfectly fine though.
388
389 Let's look a bit closer at the above example - it "cleverly" invali‐
390 dates one of the basic things LS criterion tries to achieve - equality
391 of all virtual times across class hierarchy. Leaf classes without RT
392 service curves are literally left to their own fate (governed by messed
393 up virtual times).
394
395 Also, it doesn't make much sense. Class A will always be guaranteed up
396 to 8Mbit, and this is more than any absolute bandwidth that could hap‐
397 pen from its LS criterion (excluding trivial case of only A being
398 active). If the bandwidth taken by A is smaller than absolute value
399 from LS criterion, the unused part will be automatically assigned to
400 other active classes (as A has idling periods in such case). The only
401 "advantage" is, that even in case of low bandwidth on average, bursts
402 would be handled at the speed defined by RT criterion. Still, if extra
403 speed is needed (e.g. due to latency), non linear service curves should
404 be used in such case.
405
406 In the other words: the LS criterion is meaningless in the above exam‐
407 ple.
408
409 You can quickly "workaround" it by making sure each leaf class has RT
410 service curve assigned (thus guaranteeing all of them will get some
411 bandwidth), but it doesn't make it any more valid.
412
413 Keep in mind - if you use nonlinear curves and irregularities explained
414 above happen only in the first segment, then there's little wrong with
415 "overusing" RT curve a bit:
416
417 A - ls 5.0Mbit, rt 9Mbit/30ms, then 1Mbit
418 B - ls 2.5Mbit
419 C - ls 2.5Mbit
420
421 Here, the vt of A will "spike" in the initial period, but then A will
422 never get more than 1Mbit until B & C catch up. Then everything will be
423 back to normal.
424
426 In certain situations, the scheduler can throttle itself and setup so
427 called watchdog to wakeup dequeue function at some time later. In case
428 of HFSC it happens when for example no packet is eligible for schedul‐
429 ing, and UL service curve is used to limit the speed at which LS crite‐
430 rion is allowed to dequeue packets. It's called throttling, and accu‐
431 racy of it is dependent on how the kernel is compiled.
432
433 There're 3 important options in modern kernels, as far as timers' reso‐
434 lution goes: 'tickless system', 'high resolution timer support' and
435 'timer frequency'.
436
437 If you have 'tickless system' enabled, then the timer interrupt will
438 trigger as slowly as possible, but each time a scheduler throttles
439 itself (or any other part of the kernel needs better accuracy), the
440 rate will be increased as needed / possible. The ceiling is either
441 'timer frequency' if 'high resolution timer support' is not available
442 or not compiled in, or it's hardware dependent and can go far beyond
443 the highest 'timer frequency' setting available.
444
445 If 'tickless system' is not enabled, the timer will trigger at a fixed
446 rate specified by 'timer frequency' - regardless if high resolution
447 timers are or aren't available.
448
449 This is important to keep those settings in mind, as in scenario like:
450 no tickless, no HR timers, frequency set to 100hz - throttling accuracy
451 would be at 10ms. It doesn't automatically mean you would be limited to
452 ~0.8Mbit/s (assuming packets at ~1KB) - as long as your queues are pre‐
453 pared to cover for timer inaccuracy. Of course, in case of e.g. locally
454 generated UDP traffic - appropriate socket size is needed as well.
455 Short example to make it more understandable (assume hardcore
456 anti-schedule settings - HZ=100, no HR timers, no tickless):
457
458 tc qdisc add dev eth0 root handle 1:0 hfsc default 1
459 tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10Mbit
460
461 Assuming packet of ~1KB size and HZ=100, that averages to ~0.8Mbit -
462 anything beyond it (e.g. the above example with specified rate over 10x
463 larger) will require appropriate queuing and cause bursts every ~10 ms.
464 As you can imagine, any HFSC's RT guarantees will be seriously invali‐
465 dated by that. Aforementioned example is mainly important if you deal
466 with old hardware - as is particularly popular for home server chores.
467 Even then, you can easily set HZ=1000 and have very accurate scheduling
468 for typical adsl speeds.
469
470 Anything modern (apic or even hpet msi based timers + 'tickless sys‐
471 tem') will provide enough accuracy for superb 1Gbit scheduling. For
472 example, on one of my cheap dual-core AMD boards I have the following
473 settings:
474
475 tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
476 tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 300mbit
477
478 And a simple:
479
480 nc -u dst.host.com 54321 </dev/zero
481 nc -l -p 54321 >/dev/null
482
483 ...will yield the following effects over a period of ~10 seconds (taken
484 from /proc/interrupts):
485
486 319: 42124229 0 HPET_MSI-edge hpet2 (before)
487 319: 42436214 0 HPET_MSI-edge hpet2 (after 10s.)
488
489 That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvi‐
490 ous drawback of it is that cpu load can be rather high with servicing
491 that many timer interrupts. The example with 300Mbit RT service curve
492 on 1Gbit link is particularly ugly, as it requires a lot of throttling
493 with minuscule delays.
494
495 Also note that it's just an example showing the capabilities of current
496 hardware. The above example (essentially a 300Mbit TBF emulator) is
497 pointless on an internal interface to begin with: you will pretty much
498 always want a regular LS service curve there, and in such a scenario
499 HFSC simply doesn't throttle at all.
500
501 300Mbit RT service curve (selected columns from mpstat -P ALL 1):
502
503 10:56:43 PM CPU %sys %irq %soft %idle
504 10:56:44 PM all 20.10 6.53 34.67 37.19
505 10:56:44 PM 0 35.00 0.00 63.00 0.00
506 10:56:44 PM 1 4.95 12.87 6.93 73.27
507
508 So, in the rare case you need those speeds with only a RT service
509 curve, or with a UL service curve: remember the drawbacks.
510
512 For reasons unknown (though well guessed), many examples you can google
513 love to overuse UL criterion and stuff it in every node possible. This
514 makes no sense and works against what HFSC tries to do (and does pretty
515 damn well). Use UL where it makes sense: on the uppermost node to match
516 upstream router's uplink capacity. Or in special cases, such as testing
517 (limit certain subtree to some speed), or customers that must never get
518 more than certain speed. In the last case you can usually achieve the
519 same by just using a RT criterion without LS+UL on leaf nodes.
520
521 As for the router case - remember it's good to differentiate between
522 "traffic to router" (remote console, web config, etc.) and "outgoing
523 traffic", so for example:
524
525 tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
526 tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50Mbit
527 tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2Mbit ul m2 2Mbit
528
529 ... so "internet" tree under 1:1 and "router itself" as 1:999
530
532 Please refer to tc-stab(8)
533
535 tc(8), tc-hfsc(8), tc-stab(8)
536
537 Please direct bugreports and patches to: <netdev@vger.kernel.org>
538
540 Manpage created by Michal Soltys (soltys@ziu.info)
541
542
543
544iproute2 31 October 2011 TC-HFSC(7)