1BPF classifier and actions in tc(8) Linux BPF classifier and actions in tc(8)
2
3
4
6 BPF - BPF programmable classifier and actions for ingress/egress queue‐
7 ing disciplines
8
10 eBPF classifier (filter) or action:
11 tc filter ... bpf [ object-file OBJ_FILE ] [ section CLS_NAME ] [ ex‐
12 port UDS_FILE ] [ verbose ] [ direct-action | da ] [ skip_hw | skip_sw
13 ] [ police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLASSID ]
14 tc action ... bpf [ object-file OBJ_FILE ] [ section CLS_NAME ] [ ex‐
15 port UDS_FILE ] [ verbose ]
16
17
18 cBPF classifier (filter) or action:
19 tc filter ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ] [
20 police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLASSID ]
21 tc action ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ]
22
23
25 Extended Berkeley Packet Filter ( eBPF ) and classic Berkeley Packet
26 Filter (originally known as BPF, for better distinction referred to as
27 cBPF here) are both available as a fully programmable and highly effi‐
28 cient classifier and actions. They both offer a minimal instruction set
29 for implementing small programs which can safely be loaded into the
30 kernel and thus executed in a tiny virtual machine from kernel space.
31 An in-kernel verifier guarantees that a specified program always termi‐
32 nates and neither crashes nor leaks data from the kernel.
33
34 In Linux, it's generally considered that eBPF is the successor of cBPF.
35 The kernel internally transforms cBPF expressions into eBPF expressions
36 and executes the latter. Execution of them can be performed in an in‐
37 terpreter or at setup time, they can be just-in-time compiled (JIT'ed)
38 to run as native machine code.
39
40 Currently, the eBPF JIT compiler is available for the following archi‐
41 tectures:
42
43 * x86_64 (since Linux 3.18)
44 * arm64 (since Linux 3.18)
45 * s390 (since Linux 4.1)
46 * ppc64 (since Linux 4.8)
47 * sparc64 (since Linux 4.12)
48 * mips64 (since Linux 4.13)
49 * arm32 (since Linux 4.14)
50 * x86_32 (since Linux 4.18)
51
52 Whereas the following architectures have cBPF, but did not (yet) switch
53 to eBPF JIT support:
54
55 * ppc32
56 * sparc32
57 * mips32
58
59 eBPF's instruction set has similar underlying principles as the cBPF
60 instruction set, it however is modelled closer to the underlying archi‐
61 tecture to better mimic native instruction sets with the aim to achieve
62 a better run-time performance. It is designed to be JIT'ed with a one
63 to one mapping, which can also open up the possibility for compilers to
64 generate optimized eBPF code through an eBPF backend that performs al‐
65 most as fast as natively compiled code. Given that LLVM provides such
66 an eBPF backend, eBPF programs can therefore easily be programmed in a
67 subset of the C language. Other than that, eBPF infrastructure also
68 comes with a construct called "maps". eBPF maps are key/value stores
69 that are shared between multiple eBPF programs, but also between eBPF
70 programs and user space applications.
71
72 For the traffic control subsystem, classifier and actions that can be
73 attached to ingress and egress qdiscs can be written in eBPF or cBPF.
74 The advantage over other classifier and actions is that eBPF/cBPF pro‐
75 vides the generic framework, while users can implement their highly
76 specialized use cases efficiently. This means that the classifier or
77 action written that way will not suffer from feature bloat, and can
78 therefore execute its task highly efficient. It allows for non-linear
79 classification and even merging the action part into the classifica‐
80 tion. Combined with efficient eBPF map data structures, user space can
81 push new policies like classids into the kernel without reloading a
82 classifier, or it can gather statistics that are pushed into one map
83 and use another one for dynamically load balancing traffic based on the
84 determined load, just to provide a few examples.
85
86
88 object-file
89 points to an object file that has an executable and linkable format
90 (ELF) and contains eBPF opcodes and eBPF map definitions. The LLVM com‐
91 piler infrastructure with clang(1) as a C language front end is one
92 project that supports emitting eBPF object files that can be passed to
93 the eBPF classifier (more details in the EXAMPLES section). This option
94 is mandatory when an eBPF classifier or action is to be loaded.
95
96
97 section
98 is the name of the ELF section from the object file, where the eBPF
99 classifier or action resides. By default the section name for the clas‐
100 sifier is called "classifier", and for the action "action". Given that
101 a single object file can contain multiple classifier and actions, the
102 corresponding section name needs to be specified, if it differs from
103 the defaults.
104
105
106 export
107 points to a Unix domain socket file. In case the eBPF object file also
108 contains a section named "maps" with eBPF map specifications, then the
109 map file descriptors can be handed off via the Unix domain socket to an
110 eBPF "agent" herding all descriptors after tc lifetime. This can be
111 some third party application implementing the IPC counterpart for the
112 import, that uses them for calling into bpf(2) system call to read out
113 or update eBPF map data from user space, for example, for monitoring
114 purposes or to push down new policies.
115
116
117 verbose
118 if set, it will dump the eBPF verifier output, even if loading the eBPF
119 program was successful. By default, only on error, the verifier log is
120 being emitted to the user.
121
122
123 direct-action | da
124 instructs eBPF classifier to not invoke external TC actions, instead
125 use the TC actions return codes (TC_ACT_OK, TC_ACT_SHOT etc.) for clas‐
126 sifiers.
127
128
129 skip_hw | skip_sw
130 hardware offload control flags. By default TC will try to offload fil‐
131 ters to hardware if possible. skip_hw explicitly disables the attempt
132 to offload. skip_sw forces the offload and disables running the eBPF
133 program in the kernel. If hardware offload is not possible and this
134 flag was set kernel will report an error and filter will not be in‐
135 stalled at all.
136
137
138 police
139 is an optional parameter for an eBPF/cBPF classifier that specifies a
140 police in tc(1) which is attached to the classifier, for example, on an
141 ingress qdisc.
142
143
144 action
145 is an optional parameter for an eBPF/cBPF classifier that specifies a
146 subsequent action in tc(1) which is attached to a classifier.
147
148
149 classid
150 flowid
151 provides the default traffic control class identifier for this
152 eBPF/cBPF classifier. The default class identifier can also be over‐
153 written by the return code of the eBPF/cBPF program. A default return
154 code of -1 specifies the here provided default class identifier to be
155 used. A return code of the eBPF/cBPF program of 0 implies that no match
156 took place, and a return code other than these two will override the
157 default classid. This allows for efficient, non-linear classification
158 with only a single eBPF/cBPF program as opposed to having multiple in‐
159 dividual programs for various class identifiers which would need to
160 reparse packet contents.
161
162
163 bytecode
164 is being used for loading cBPF classifier and actions only. The cBPF
165 bytecode is directly passed as a text string in the form of 's,c t f
166 k,c t f k,c t f k,...' , where s denotes the number of subsequent
167 4-tuples. One such 4-tuple consists of c t f k decimals, where c repre‐
168 sents the cBPF opcode, t the jump true offset target, f the jump false
169 offset target and k the immediate constant/literal. There are various
170 tools that generate code in this loadable format, for example, bpf_asm
171 that ships with the Linux kernel source tree under tools/net/ , so it
172 is certainly not expected to hack this by hand. The bytecode or byte‐
173 code-file option is mandatory when a cBPF classifier or action is to be
174 loaded.
175
176
177 bytecode-file
178 also being used to load a cBPF classifier or action. It's effectively
179 the same as bytecode only that the cBPF bytecode is not passed directly
180 via command line, but rather resides in a text file.
181
182
184 eBPF TOOLING
185 A full blown example including eBPF agent code can be found inside the
186 iproute2 source package under: examples/bpf/
187
188 As prerequisites, the kernel needs to have the eBPF system call namely
189 bpf(2) enabled and ships with cls_bpf and act_bpf kernel modules for
190 the traffic control subsystem. To enable eBPF/eBPF JIT support, depend‐
191 ing which of the two the given architecture supports:
192
193 echo 1 > /proc/sys/net/core/bpf_jit_enable
194
195 A given restricted C file can be compiled via LLVM as:
196
197 clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj
198 -o bpf.o
199
200 The compiler invocation might still simplify in future, so for now,
201 it's quite handy to alias this construct in one way or another, for ex‐
202 ample:
203
204 __bcc() {
205 clang -O2 -emit-llvm -c $1 -o - | \
206 llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
207 }
208
209 alias bcc=__bcc
210
211 A minimal, stand-alone unit, which matches on all traffic with the de‐
212 fault classid (return code of -1) looks like:
213
214
215 #include <linux/bpf.h>
216
217 #ifndef __section
218 # define __section(x) __attribute__((section(x), used))
219 #endif
220
221 __section("classifier") int cls_main(struct __sk_buff *skb)
222 {
223 return -1;
224 }
225
226 char __license[] __section("license") = "GPL";
227
228 More examples can be found further below in subsection eBPF PROGRAMMING
229 as focus here will be on tooling.
230
231 There can be various other sections, for example, also for actions.
232 Thus, an object file in eBPF can contain multiple entrance points. Al‐
233 ways a specific entrance point, however, must be specified when config‐
234 uring with tc. A license must be part of the restricted C code and the
235 license string syntax is the same as with Linux kernel modules. The
236 kernel reserves its right that some eBPF helper functions can be re‐
237 stricted to GPL compatible licenses only, and thus may reject a program
238 from loading into the kernel when such a license mismatch occurs.
239
240 The resulting object file from the compilation can be inspected with
241 the usual set of tools that also operate on normal object files, for
242 example objdump(1) for inspecting ELF section headers:
243
244
245 objdump -h bpf.o
246 [...]
247 3 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3
248 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
249 4 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3
250 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
251 5 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3
252 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
253 6 maps 00000030 0000000000000000 0000000000000000 00000958 2**2
254 CONTENTS, ALLOC, LOAD, DATA
255 7 license 00000004 0000000000000000 0000000000000000 00000988 2**0
256 CONTENTS, ALLOC, LOAD, DATA
257 [...]
258
259 Adding an eBPF classifier from an object file that contains a classi‐
260 fier in the default ELF section is trivial (note that instead of "ob‐
261 ject-file" also shortcuts such as "obj" can be used):
262
263 bcc bpf.c
264 tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
265
266 In case the classifier resides in ELF section "mycls", then that same
267 command needs to be invoked as:
268
269 tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
270
271 Dumping the classifier configuration will tell the location of the
272 classifier, in other words that it's from object file "bpf.o" under
273 section "mycls":
274
275 tc filter show dev em1
276 filter parent 1: protocol all pref 49152 bpf
277 filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1
278 bpf.o:[mycls]
279
280 The same program can also be installed on ingress qdisc side as opposed
281 to egress ...
282
283 tc qdisc add dev em1 handle ffff: ingress
284 tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid
285 ffff:1
286
287 ... and again dumped from there:
288
289 tc filter show dev em1 parent ffff:
290 filter protocol all pref 49152 bpf
291 filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1
292 bpf.o:[mycls]
293
294 Attaching a classifier and action on ingress has the restriction that
295 it doesn't have an actual underlying queueing discipline. What ingress
296 can do is to classify, mangle, redirect or drop packets. When queueing
297 is required on ingress side, then ingress must redirect packets to the
298 ifb device, otherwise policing can be used. Moreover, ingress can be
299 used to have an early drop point of unwanted packets before they hit
300 upper layers of the networking stack, perform network accounting with
301 eBPF maps that could be shared with egress, or have an early mangle
302 and/or redirection point to different networking devices.
303
304 Multiple eBPF actions and classifier can be placed into a single object
305 file within various sections. In that case, non-default section names
306 must be provided, which is the case for both actions in this example:
307
308 tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \
309 action bpf obj bpf.o sec action-mark \
310 action bpf obj bpf.o sec action-rand ok
311
312 The advantage of this is that the classifier and the two actions can
313 then share eBPF maps with each other, if implemented in the programs.
314
315 In order to access eBPF maps from user space beyond tc(8) setup life‐
316 time, the ownership can be transferred to an eBPF agent via Unix domain
317 sockets. There are two possibilities for implementing this:
318
319 1) implementation of an own eBPF agent that takes care of setting up
320 the Unix domain socket and implementing the protocol that tc(8) dic‐
321 tates. A code example of this can be found inside the iproute2 source
322 package under: examples/bpf/
323
324 2) use tc exec for transferring the eBPF map file descriptors through a
325 Unix domain socket, and spawning an application such as sh(1) . This
326 approach's advantage is that tc will place the file descriptors into
327 the environment and thus make them available just like stdin, stdout,
328 stderr file descriptors, meaning, in case user applications run from
329 within this fd-owner shell, they can terminate and restart without los‐
330 ing eBPF maps file descriptors. Example invocation with the previous
331 classifier and action mixture:
332
333 tc exec bpf imp /tmp/bpf
334 tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid
335 1:1 \
336 action bpf obj bpf.o sec action-mark \
337 action bpf obj bpf.o sec action-rand ok
338
339 Assuming that eBPF maps are shared with classifier and actions, it's
340 enough to export them once, for example, from within the classifier or
341 action command. tc will setup all eBPF map file descriptors at the time
342 when the object file is first parsed.
343
344 When a shell has been spawned, the environment will have a couple of
345 eBPF related variables. BPF_NUM_MAPS provides the total number of maps
346 that have been transferred over the Unix domain socket. BPF_MAP<X>'s
347 value is the file descriptor number that can be accessed in eBPF agent
348 applications, in other words, it can directly be used as the file de‐
349 scriptor value for the bpf(2) system call to retrieve or alter eBPF map
350 values. <X> denotes the identifier of the eBPF map. It corresponds to
351 the id member of struct bpf_elf_map from the tc eBPF map specifica‐
352 tion.
353
354 The environment in this example looks as follows:
355
356
357 sh# env | grep BPF
358 BPF_NUM_MAPS=3
359 BPF_MAP1=6
360 BPF_MAP0=5
361 BPF_MAP2=7
362 sh# ls -la /proc/self/fd
363 [...]
364 lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
365 lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
366 lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
367 sh# my_bpf_agent
368
369 eBPF agents are very useful in that they can prepopulate eBPF maps from
370 user space, monitor statistics via maps and based on that feedback, for
371 example, rewrite classids in eBPF map values during runtime. Given that
372 eBPF agents are implemented as normal applications, they can also dy‐
373 namically receive traffic control policies from external controllers
374 and thus push them down into eBPF maps to dynamically adapt to network
375 conditions. Moreover, eBPF maps can also be shared with other eBPF pro‐
376 gram types (e.g. tracing), thus very powerful combination can therefore
377 be implemented.
378
379
380 eBPF PROGRAMMING
381 eBPF classifier and actions are being implemented in restricted C syn‐
382 tax (in future, there could additionally be new language frontends sup‐
383 ported).
384
385 The header file linux/bpf.h provides eBPF helper functions that can be
386 called from an eBPF program. This man page will only provide two mini‐
387 mal, stand-alone examples, have a look at examples/bpf from the
388 iproute2 source package for a fully fledged flow dissector example to
389 better demonstrate some of the possibilities with eBPF.
390
391 Supported 32 bit classifier return codes from the C program and their
392 meanings:
393 0 , denotes a mismatch
394 -1 , denotes the default classid configured from the command line
395 else , everything else will override the default classid to provide
396 a facility for non-linear matching
397
398 Supported 32 bit action return codes from the C program and their mean‐
399 ings ( linux/pkt_cls.h ):
400 TC_ACT_OK (0) , will terminate the packet processing pipeline and
401 allows the packet to proceed
402 TC_ACT_SHOT (2) , will terminate the packet processing pipeline and
403 drops the packet
404 TC_ACT_UNSPEC (-1) , will use the default action configured from tc
405 (similarly as returning -1 from a classifier)
406 TC_ACT_PIPE (3) , will iterate to the next action, if available
407 TC_ACT_RECLASSIFY (1) , will terminate the packet processing pipe‐
408 line and start classification from the beginning
409 else , everything else is an unspecified return code
410
411 Both classifier and action return codes are supported in eBPF and cBPF
412 programs.
413
414 To demonstrate restricted C syntax, a minimal toy classifier example is
415 provided, which assumes that egress packets, for instance originating
416 from a container, have previously been marked in interval [0, 255]. The
417 program keeps statistics on different marks for user space and maps the
418 classid to the root qdisc with the marking itself as the minor handle:
419
420
421 #include <stdint.h>
422 #include <asm/types.h>
423
424 #include <linux/bpf.h>
425 #include <linux/pkt_sched.h>
426
427 #include "helpers.h"
428
429 struct tuple {
430 long packets;
431 long bytes;
432 };
433
434 #define BPF_MAP_ID_STATS 1 /* agent's map identifier */
435 #define BPF_MAX_MARK 256
436
437 struct bpf_elf_map __section("maps") map_stats = {
438 .type = BPF_MAP_TYPE_ARRAY,
439 .id = BPF_MAP_ID_STATS,
440 .size_key = sizeof(uint32_t),
441 .size_value = sizeof(struct tuple),
442 .max_elem = BPF_MAX_MARK,
443 .pinning = PIN_GLOBAL_NS,
444 };
445
446 static inline void cls_update_stats(const struct __sk_buff *skb,
447 uint32_t mark)
448 {
449 struct tuple *tu;
450
451 tu = bpf_map_lookup_elem(&map_stats, &mark);
452 if (likely(tu)) {
453 __sync_fetch_and_add(&tu->packets, 1);
454 __sync_fetch_and_add(&tu->bytes, skb->len);
455 }
456 }
457
458 __section("cls") int cls_main(struct __sk_buff *skb)
459 {
460 uint32_t mark = skb->mark;
461
462 if (unlikely(mark >= BPF_MAX_MARK))
463 return 0;
464
465 cls_update_stats(skb, mark);
466
467 return TC_H_MAKE(TC_H_ROOT, mark);
468 }
469
470 char __license[] __section("license") = "GPL";
471
472 Another small example is a port redirector which demuxes destination
473 port 80 into the interval [8080, 8087] steered by RSS, that can then be
474 attached to ingress qdisc. The exercise of adding the egress counter‐
475 part and IPv6 support is left to the reader:
476
477
478 #include <asm/types.h>
479 #include <asm/byteorder.h>
480
481 #include <linux/bpf.h>
482 #include <linux/filter.h>
483 #include <linux/in.h>
484 #include <linux/if_ether.h>
485 #include <linux/ip.h>
486 #include <linux/tcp.h>
487
488 #include "helpers.h"
489
490 static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
491 __u16 old_port, __u16 new_port)
492 {
493 bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
494 old_port, new_port, sizeof(new_port));
495 bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
496 &new_port, sizeof(new_port), 0);
497 }
498
499 static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
500 {
501 __u16 dport, dport_new = 8080, off;
502 __u8 ip_proto, ip_vl;
503
504 ip_proto = load_byte(skb, nh_off +
505 offsetof(struct iphdr, protocol));
506 if (ip_proto != IPPROTO_TCP)
507 return 0;
508
509 ip_vl = load_byte(skb, nh_off);
510 if (likely(ip_vl == 0x45))
511 nh_off += sizeof(struct iphdr);
512 else
513 nh_off += (ip_vl & 0xF) << 2;
514
515 dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
516 if (dport != 80)
517 return 0;
518
519 off = skb->queue_mapping & 7;
520 set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
521 __cpu_to_be16(dport_new + off));
522 return -1;
523 }
524
525 __section("lb") int lb_main(struct __sk_buff *skb)
526 {
527 int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
528
529 if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
530 ret = lb_do_ipv4(skb, nh_off);
531
532 return ret;
533 }
534
535 char __license[] __section("license") = "GPL";
536
537 The related helper header file helpers.h in both examples was:
538
539
540 /* Misc helper macros. */
541 #define __section(x) __attribute__((section(x), used))
542 #define offsetof(x, y) __builtin_offsetof(x, y)
543 #define likely(x) __builtin_expect(!!(x), 1)
544 #define unlikely(x) __builtin_expect(!!(x), 0)
545
546 /* Object pinning settings */
547 #define PIN_NONE 0
548 #define PIN_OBJECT_NS 1
549 #define PIN_GLOBAL_NS 2
550
551 /* ELF map definition */
552 struct bpf_elf_map {
553 __u32 type;
554 __u32 size_key;
555 __u32 size_value;
556 __u32 max_elem;
557 __u32 flags;
558 __u32 id;
559 __u32 pinning;
560 __u32 inner_id;
561 __u32 inner_idx;
562 };
563
564 /* Some used BPF function calls. */
565 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
566 int len, int flags) =
567 (void *) BPF_FUNC_skb_store_bytes;
568 static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
569 int to, int flags) =
570 (void *) BPF_FUNC_l4_csum_replace;
571 static void *(*bpf_map_lookup_elem)(void *map, void *key) =
572 (void *) BPF_FUNC_map_lookup_elem;
573
574 /* Some used BPF intrinsics. */
575 unsigned long long load_byte(void *skb, unsigned long long off)
576 asm ("llvm.bpf.load.byte");
577 unsigned long long load_half(void *skb, unsigned long long off)
578 asm ("llvm.bpf.load.half");
579
580 Best practice, we recommend to only have a single eBPF classifier
581 loaded in tc and perform all necessary matching and mangling from there
582 instead of a list of individual classifier and separate actions. Just a
583 single classifier tailored for a given use-case will be most efficient
584 to run.
585
586
587 eBPF DEBUGGING
588 Both tc filter and action commands for bpf support an optional verbose
589 parameter that can be used to inspect the eBPF verifier log. It is
590 dumped by default in case of an error.
591
592 In case the eBPF/cBPF JIT compiler has been enabled, it can also be in‐
593 structed to emit a debug output of the resulting opcode image into the
594 kernel log, which can be read via dmesg(1) :
595
596 echo 2 > /proc/sys/net/core/bpf_jit_enable
597
598 The Linux kernel source tree ships additionally under tools/net/ a
599 small helper called bpf_jit_disasm that reads out the opcode image dump
600 from the kernel log and dumps the resulting disassembly:
601
602 bpf_jit_disasm -o
603
604 Other than that, the Linux kernel also contains an extensive eBPF/cBPF
605 test suite module called test_bpf . Upon ...
606
607 modprobe test_bpf
608
609 ... it performs a diversity of test cases and dumps the results into
610 the kernel log that can be inspected with dmesg(1) . The results can
611 differ depending on whether the JIT compiler is enabled or not. In case
612 of failed test cases, the module will fail to load. In such cases, we
613 urge you to file a bug report to the related JIT authors, Linux kernel
614 and networking mailing lists.
615
616
617 cBPF
618 Although we generally recommend switching to implementing eBPF classi‐
619 fier and actions, for the sake of completeness, a few words on how to
620 program in cBPF will be lost here.
621
622 Likewise, the bpf_jit_enable switch can be enabled as mentioned al‐
623 ready. Tooling such as bpf_jit_disasm is also independent whether eBPF
624 or cBPF code is being loaded.
625
626 Unlike in eBPF, classifier and action are not implemented in restricted
627 C, but rather in a minimal assembler-like language or with the help of
628 other tooling.
629
630 The raw interface with tc takes opcodes directly. For example, the most
631 minimal classifier matching on every packet resulting in the default
632 classid of 1:1 looks like:
633
634 tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,'
635 flowid 1:1
636
637 The first decimal of the bytecode sequence denotes the number of subse‐
638 quent 4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists
639 of c t f k decimals, where c represents the cBPF opcode, t the jump
640 true offset target, f the jump false offset target and k the immediate
641 constant/literal. Here, this denotes an unconditional return from the
642 program with immediate value of -1.
643
644 Thus, for egress classification, Willem de Bruijn implemented a minimal
645 stand-alone helper tool under the GNU General Public License version 2
646 for iptables(8) BPF extension, which abuses the libpcap internal clas‐
647 sic BPF compiler, his code derived here for usage with tc(8) :
648
649
650 #include <pcap.h>
651 #include <stdio.h>
652
653 int main(int argc, char **argv)
654 {
655 struct bpf_program prog;
656 struct bpf_insn *ins;
657 int i, ret, dlt = DLT_RAW;
658
659 if (argc < 2 || argc > 3)
660 return 1;
661 if (argc == 3) {
662 dlt = pcap_datalink_name_to_val(argv[1]);
663 if (dlt == -1)
664 return 1;
665 }
666
667 ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
668 1, PCAP_NETMASK_UNKNOWN);
669 if (ret)
670 return 1;
671
672 printf("%d,", prog.bf_len);
673 ins = prog.bf_insns;
674
675 for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
676 printf("%u %u %u %u,", ins->code,
677 ins->jt, ins->jf, ins->k);
678 printf("%u %u %u %u",
679 ins->code, ins->jt, ins->jf, ins->k);
680
681 pcap_freecode(&prog);
682 return 0;
683 }
684
685 Given this small helper, any tcpdump(8) filter expression can be abused
686 as a classifier where a match will result in the default classid:
687
688 bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
689 tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn
690 flowid 1:1
691
692 Basically, such a minimal generator is equivalent to:
693
694 tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\n' ',' >
695 /var/bpf/tcp-syn
696
697 Since libpcap does not support all Linux' specific cBPF extensions in
698 its compiler, the Linux kernel also ships under tools/net/ a minimal
699 BPF assembler called bpf_asm for providing full control. For detailed
700 syntax and semantics on implementing such programs by hand, see refer‐
701 ences under FURTHER READING .
702
703 Trivial toy example in bpf_asm for classifying IPv4/TCP packets, saved
704 in a text file called foobar :
705
706
707 ldh [12]
708 jne #0x800, drop
709 ldb [23]
710 jneq #6, drop
711 ret #-1
712 drop: ret #0
713
714 Similarly, such a classifier can be loaded as:
715
716 bpf_asm foobar > /var/bpf/tcp-syn
717 tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn
718 flowid 1:1
719
720 For BPF classifiers, the Linux kernel provides additionally under
721 tools/net/ a small BPF debugger called bpf_dbg , which can be used to
722 test a classifier against pcap files, single-step or add various break‐
723 points into the classifier program and dump register contents during
724 runtime.
725
726 Implementing an action in classic BPF is rather limited in the sense
727 that packet mangling is not supported. Therefore, it's generally recom‐
728 mended to make the switch to eBPF, whenever possible.
729
730
732 Further and more technical details about the BPF architecture can be
733 found in the Linux kernel source tree under Documentation/network‐
734 ing/filter.txt .
735
736 Further details on eBPF tc(8) examples can be found in the iproute2
737 source tree under examples/bpf/ .
738
739
741 tc(8), tc-ematch(8) bpf(2) bpf(4)
742
743
745 Manpage written by Daniel Borkmann.
746
747 Please report corrections or improvements to the Linux kernel network‐
748 ing mailing list: <netdev@vger.kernel.org>
749
750
751
752iproute2 18 May 201B5PF classifier and actions in tc(8)