1BPF classifier and actions in tc(8) Linux BPF classifier and actions in tc(8)
2
3
4
6 BPF - BPF programmable classifier and actions for ingress/egress queue‐
7 ing disciplines
8
10 eBPF classifier (filter) or action:
11 tc filter ... bpf [ object-file OBJ_FILE ] [ section CLS_NAME ] [
12 export UDS_FILE ] [ verbose ] [ direct-action | da ] [ skip_hw |
13 skip_sw ] [ police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLAS‐
14 SID ]
15 tc action ... bpf [ object-file OBJ_FILE ] [ section CLS_NAME ] [
16 export UDS_FILE ] [ verbose ]
17
18
19 cBPF classifier (filter) or action:
20 tc filter ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ] [
21 police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLASSID ]
22 tc action ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ]
23
24
26 Extended Berkeley Packet Filter ( eBPF ) and classic Berkeley Packet
27 Filter (originally known as BPF, for better distinction referred to as
28 cBPF here) are both available as a fully programmable and highly effi‐
29 cient classifier and actions. They both offer a minimal instruction set
30 for implementing small programs which can safely be loaded into the
31 kernel and thus executed in a tiny virtual machine from kernel space.
32 An in-kernel verifier guarantees that a specified program always termi‐
33 nates and neither crashes nor leaks data from the kernel.
34
35 In Linux, it's generally considered that eBPF is the successor of cBPF.
36 The kernel internally transforms cBPF expressions into eBPF expressions
37 and executes the latter. Execution of them can be performed in an
38 interpreter or at setup time, they can be just-in-time compiled
39 (JIT'ed) to run as native machine code.
40
41 Currently, the eBPF JIT compiler is available for the following archi‐
42 tectures:
43
44 * x86_64 (since Linux 3.18)
45 * arm64 (since Linux 3.18)
46 * s390 (since Linux 4.1)
47 * ppc64 (since Linux 4.8)
48 * sparc64 (since Linux 4.12)
49 * mips64 (since Linux 4.13)
50 * arm32 (since Linux 4.14)
51 * x86_32 (since Linux 4.18)
52
53 Whereas the following architectures have cBPF, but did not (yet) switch
54 to eBPF JIT support:
55
56 * ppc32
57 * sparc32
58 * mips32
59
60 eBPF's instruction set has similar underlying principles as the cBPF
61 instruction set, it however is modelled closer to the underlying archi‐
62 tecture to better mimic native instruction sets with the aim to achieve
63 a better run-time performance. It is designed to be JIT'ed with a one
64 to one mapping, which can also open up the possibility for compilers to
65 generate optimized eBPF code through an eBPF backend that performs
66 almost as fast as natively compiled code. Given that LLVM provides such
67 an eBPF backend, eBPF programs can therefore easily be programmed in a
68 subset of the C language. Other than that, eBPF infrastructure also
69 comes with a construct called "maps". eBPF maps are key/value stores
70 that are shared between multiple eBPF programs, but also between eBPF
71 programs and user space applications.
72
73 For the traffic control subsystem, classifier and actions that can be
74 attached to ingress and egress qdiscs can be written in eBPF or cBPF.
75 The advantage over other classifier and actions is that eBPF/cBPF pro‐
76 vides the generic framework, while users can implement their highly
77 specialized use cases efficiently. This means that the classifier or
78 action written that way will not suffer from feature bloat, and can
79 therefore execute its task highly efficient. It allows for non-linear
80 classification and even merging the action part into the classifica‐
81 tion. Combined with efficient eBPF map data structures, user space can
82 push new policies like classids into the kernel without reloading a
83 classifier, or it can gather statistics that are pushed into one map
84 and use another one for dynamically load balancing traffic based on the
85 determined load, just to provide a few examples.
86
87
89 object-file
90 points to an object file that has an executable and linkable format
91 (ELF) and contains eBPF opcodes and eBPF map definitions. The LLVM com‐
92 piler infrastructure with clang(1) as a C language front end is one
93 project that supports emitting eBPF object files that can be passed to
94 the eBPF classifier (more details in the EXAMPLES section). This option
95 is mandatory when an eBPF classifier or action is to be loaded.
96
97
98 section
99 is the name of the ELF section from the object file, where the eBPF
100 classifier or action resides. By default the section name for the clas‐
101 sifier is called "classifier", and for the action "action". Given that
102 a single object file can contain multiple classifier and actions, the
103 corresponding section name needs to be specified, if it differs from
104 the defaults.
105
106
107 export
108 points to a Unix domain socket file. In case the eBPF object file also
109 contains a section named "maps" with eBPF map specifications, then the
110 map file descriptors can be handed off via the Unix domain socket to an
111 eBPF "agent" herding all descriptors after tc lifetime. This can be
112 some third party application implementing the IPC counterpart for the
113 import, that uses them for calling into bpf(2) system call to read out
114 or update eBPF map data from user space, for example, for monitoring
115 purposes or to push down new policies.
116
117
118 verbose
119 if set, it will dump the eBPF verifier output, even if loading the eBPF
120 program was successful. By default, only on error, the verifier log is
121 being emitted to the user.
122
123
124 direct-action | da
125 instructs eBPF classifier to not invoke external TC actions, instead
126 use the TC actions return codes (TC_ACT_OK, TC_ACT_SHOT etc.) for clas‐
127 sifiers.
128
129
130 skip_hw | skip_sw
131 hardware offload control flags. By default TC will try to offload fil‐
132 ters to hardware if possible. skip_hw explicitly disables the attempt
133 to offload. skip_sw forces the offload and disables running the eBPF
134 program in the kernel. If hardware offload is not possible and this
135 flag was set kernel will report an error and filter will not be
136 installed at all.
137
138
139 police
140 is an optional parameter for an eBPF/cBPF classifier that specifies a
141 police in tc(1) which is attached to the classifier, for example, on an
142 ingress qdisc.
143
144
145 action
146 is an optional parameter for an eBPF/cBPF classifier that specifies a
147 subsequent action in tc(1) which is attached to a classifier.
148
149
150 classid
151 flowid
152 provides the default traffic control class identifier for this
153 eBPF/cBPF classifier. The default class identifier can also be over‐
154 written by the return code of the eBPF/cBPF program. A default return
155 code of -1 specifies the here provided default class identifier to be
156 used. A return code of the eBPF/cBPF program of 0 implies that no match
157 took place, and a return code other than these two will override the
158 default classid. This allows for efficient, non-linear classification
159 with only a single eBPF/cBPF program as opposed to having multiple
160 individual programs for various class identifiers which would need to
161 reparse packet contents.
162
163
164 bytecode
165 is being used for loading cBPF classifier and actions only. The cBPF
166 bytecode is directly passed as a text string in the form of ´s,c t f
167 k,c t f k,c t f k,...´ , where s denotes the number of subsequent
168 4-tuples. One such 4-tuple consists of c t f k decimals, where c repre‐
169 sents the cBPF opcode, t the jump true offset target, f the jump false
170 offset target and k the immediate constant/literal. There are various
171 tools that generate code in this loadable format, for example, bpf_asm
172 that ships with the Linux kernel source tree under tools/net/ , so it
173 is certainly not expected to hack this by hand. The bytecode or byte‐
174 code-file option is mandatory when a cBPF classifier or action is to be
175 loaded.
176
177
178 bytecode-file
179 also being used to load a cBPF classifier or action. It's effectively
180 the same as bytecode only that the cBPF bytecode is not passed directly
181 via command line, but rather resides in a text file.
182
183
185 eBPF TOOLING
186 A full blown example including eBPF agent code can be found inside the
187 iproute2 source package under: examples/bpf/
188
189 As prerequisites, the kernel needs to have the eBPF system call namely
190 bpf(2) enabled and ships with cls_bpf and act_bpf kernel modules for
191 the traffic control subsystem. To enable eBPF/eBPF JIT support, depend‐
192 ing which of the two the given architecture supports:
193
194 echo 1 > /proc/sys/net/core/bpf_jit_enable
195
196 A given restricted C file can be compiled via LLVM as:
197
198 clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj
199 -o bpf.o
200
201 The compiler invocation might still simplify in future, so for now,
202 it's quite handy to alias this construct in one way or another, for
203 example:
204
205 __bcc() {
206 clang -O2 -emit-llvm -c $1 -o - | \
207 llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
208 }
209
210 alias bcc=__bcc
211
212 A minimal, stand-alone unit, which matches on all traffic with the
213 default classid (return code of -1) looks like:
214
215
216 #include <linux/bpf.h>
217
218 #ifndef __section
219 # define __section(x) __attribute__((section(x), used))
220 #endif
221
222 __section("classifier") int cls_main(struct __sk_buff *skb)
223 {
224 return -1;
225 }
226
227 char __license[] __section("license") = "GPL";
228
229 More examples can be found further below in subsection eBPF PROGRAMMING
230 as focus here will be on tooling.
231
232 There can be various other sections, for example, also for actions.
233 Thus, an object file in eBPF can contain multiple entrance points.
234 Always a specific entrance point, however, must be specified when con‐
235 figuring with tc. A license must be part of the restricted C code and
236 the license string syntax is the same as with Linux kernel modules.
237 The kernel reserves its right that some eBPF helper functions can be
238 restricted to GPL compatible licenses only, and thus may reject a pro‐
239 gram from loading into the kernel when such a license mismatch occurs.
240
241 The resulting object file from the compilation can be inspected with
242 the usual set of tools that also operate on normal object files, for
243 example objdump(1) for inspecting ELF section headers:
244
245
246 objdump -h bpf.o
247 [...]
248 3 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3
249 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
250 4 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3
251 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
252 5 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3
253 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
254 6 maps 00000030 0000000000000000 0000000000000000 00000958 2**2
255 CONTENTS, ALLOC, LOAD, DATA
256 7 license 00000004 0000000000000000 0000000000000000 00000988 2**0
257 CONTENTS, ALLOC, LOAD, DATA
258 [...]
259
260 Adding an eBPF classifier from an object file that contains a classi‐
261 fier in the default ELF section is trivial (note that instead of
262 "object-file" also shortcuts such as "obj" can be used):
263
264 bcc bpf.c
265 tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
266
267 In case the classifier resides in ELF section "mycls", then that same
268 command needs to be invoked as:
269
270 tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
271
272 Dumping the classifier configuration will tell the location of the
273 classifier, in other words that it's from object file "bpf.o" under
274 section "mycls":
275
276 tc filter show dev em1
277 filter parent 1: protocol all pref 49152 bpf
278 filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1
279 bpf.o:[mycls]
280
281 The same program can also be installed on ingress qdisc side as opposed
282 to egress ...
283
284 tc qdisc add dev em1 handle ffff: ingress
285 tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid
286 ffff:1
287
288 ... and again dumped from there:
289
290 tc filter show dev em1 parent ffff:
291 filter protocol all pref 49152 bpf
292 filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1
293 bpf.o:[mycls]
294
295 Attaching a classifier and action on ingress has the restriction that
296 it doesn't have an actual underlying queueing discipline. What ingress
297 can do is to classify, mangle, redirect or drop packets. When queueing
298 is required on ingress side, then ingress must redirect packets to the
299 ifb device, otherwise policing can be used. Moreover, ingress can be
300 used to have an early drop point of unwanted packets before they hit
301 upper layers of the networking stack, perform network accounting with
302 eBPF maps that could be shared with egress, or have an early mangle
303 and/or redirection point to different networking devices.
304
305 Multiple eBPF actions and classifier can be placed into a single object
306 file within various sections. In that case, non-default section names
307 must be provided, which is the case for both actions in this example:
308
309 tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \
310 action bpf obj bpf.o sec action-mark \
311 action bpf obj bpf.o sec action-rand ok
312
313 The advantage of this is that the classifier and the two actions can
314 then share eBPF maps with each other, if implemented in the programs.
315
316 In order to access eBPF maps from user space beyond tc(8) setup life‐
317 time, the ownership can be transferred to an eBPF agent via Unix domain
318 sockets. There are two possibilities for implementing this:
319
320 1) implementation of an own eBPF agent that takes care of setting up
321 the Unix domain socket and implementing the protocol that tc(8) dic‐
322 tates. A code example of this can be found inside the iproute2 source
323 package under: examples/bpf/
324
325 2) use tc exec for transferring the eBPF map file descriptors through a
326 Unix domain socket, and spawning an application such as sh(1) . This
327 approach's advantage is that tc will place the file descriptors into
328 the environment and thus make them available just like stdin, stdout,
329 stderr file descriptors, meaning, in case user applications run from
330 within this fd-owner shell, they can terminate and restart without los‐
331 ing eBPF maps file descriptors. Example invocation with the previous
332 classifier and action mixture:
333
334 tc exec bpf imp /tmp/bpf
335 tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid
336 1:1 \
337 action bpf obj bpf.o sec action-mark \
338 action bpf obj bpf.o sec action-rand ok
339
340 Assuming that eBPF maps are shared with classifier and actions, it's
341 enough to export them once, for example, from within the classifier or
342 action command. tc will setup all eBPF map file descriptors at the time
343 when the object file is first parsed.
344
345 When a shell has been spawned, the environment will have a couple of
346 eBPF related variables. BPF_NUM_MAPS provides the total number of maps
347 that have been transferred over the Unix domain socket. BPF_MAP<X>'s
348 value is the file descriptor number that can be accessed in eBPF agent
349 applications, in other words, it can directly be used as the file
350 descriptor value for the bpf(2) system call to retrieve or alter eBPF
351 map values. <X> denotes the identifier of the eBPF map. It corresponds
352 to the id member of struct bpf_elf_map from the tc eBPF map specifica‐
353 tion.
354
355 The environment in this example looks as follows:
356
357
358 sh# env | grep BPF
359 BPF_NUM_MAPS=3
360 BPF_MAP1=6
361 BPF_MAP0=5
362 BPF_MAP2=7
363 sh# ls -la /proc/self/fd
364 [...]
365 lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
366 lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
367 lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
368 sh# my_bpf_agent
369
370 eBPF agents are very useful in that they can prepopulate eBPF maps from
371 user space, monitor statistics via maps and based on that feedback, for
372 example, rewrite classids in eBPF map values during runtime. Given that
373 eBPF agents are implemented as normal applications, they can also
374 dynamically receive traffic control policies from external controllers
375 and thus push them down into eBPF maps to dynamically adapt to network
376 conditions. Moreover, eBPF maps can also be shared with other eBPF pro‐
377 gram types (e.g. tracing), thus very powerful combination can therefore
378 be implemented.
379
380
381 eBPF PROGRAMMING
382 eBPF classifier and actions are being implemented in restricted C syn‐
383 tax (in future, there could additionally be new language frontends sup‐
384 ported).
385
386 The header file linux/bpf.h provides eBPF helper functions that can be
387 called from an eBPF program. This man page will only provide two mini‐
388 mal, stand-alone examples, have a look at examples/bpf from the
389 iproute2 source package for a fully fledged flow dissector example to
390 better demonstrate some of the possibilities with eBPF.
391
392 Supported 32 bit classifier return codes from the C program and their
393 meanings:
394 0 , denotes a mismatch
395 -1 , denotes the default classid configured from the command line
396 else , everything else will override the default classid to provide
397 a facility for non-linear matching
398
399 Supported 32 bit action return codes from the C program and their mean‐
400 ings ( linux/pkt_cls.h ):
401 TC_ACT_OK (0) , will terminate the packet processing pipeline and
402 allows the packet to proceed
403 TC_ACT_SHOT (2) , will terminate the packet processing pipeline and
404 drops the packet
405 TC_ACT_UNSPEC (-1) , will use the default action configured from tc
406 (similarly as returning -1 from a classifier)
407 TC_ACT_PIPE (3) , will iterate to the next action, if available
408 TC_ACT_RECLASSIFY (1) , will terminate the packet processing pipe‐
409 line and start classification from the beginning
410 else , everything else is an unspecified return code
411
412 Both classifier and action return codes are supported in eBPF and cBPF
413 programs.
414
415 To demonstrate restricted C syntax, a minimal toy classifier example is
416 provided, which assumes that egress packets, for instance originating
417 from a container, have previously been marked in interval [0, 255]. The
418 program keeps statistics on different marks for user space and maps the
419 classid to the root qdisc with the marking itself as the minor handle:
420
421
422 #include <stdint.h>
423 #include <asm/types.h>
424
425 #include <linux/bpf.h>
426 #include <linux/pkt_sched.h>
427
428 #include "helpers.h"
429
430 struct tuple {
431 long packets;
432 long bytes;
433 };
434
435 #define BPF_MAP_ID_STATS 1 /* agent's map identifier */
436 #define BPF_MAX_MARK 256
437
438 struct bpf_elf_map __section("maps") map_stats = {
439 .type = BPF_MAP_TYPE_ARRAY,
440 .id = BPF_MAP_ID_STATS,
441 .size_key = sizeof(uint32_t),
442 .size_value = sizeof(struct tuple),
443 .max_elem = BPF_MAX_MARK,
444 .pinning = PIN_GLOBAL_NS,
445 };
446
447 static inline void cls_update_stats(const struct __sk_buff *skb,
448 uint32_t mark)
449 {
450 struct tuple *tu;
451
452 tu = bpf_map_lookup_elem(&map_stats, &mark);
453 if (likely(tu)) {
454 __sync_fetch_and_add(&tu->packets, 1);
455 __sync_fetch_and_add(&tu->bytes, skb->len);
456 }
457 }
458
459 __section("cls") int cls_main(struct __sk_buff *skb)
460 {
461 uint32_t mark = skb->mark;
462
463 if (unlikely(mark >= BPF_MAX_MARK))
464 return 0;
465
466 cls_update_stats(skb, mark);
467
468 return TC_H_MAKE(TC_H_ROOT, mark);
469 }
470
471 char __license[] __section("license") = "GPL";
472
473 Another small example is a port redirector which demuxes destination
474 port 80 into the interval [8080, 8087] steered by RSS, that can then be
475 attached to ingress qdisc. The exercise of adding the egress counter‐
476 part and IPv6 support is left to the reader:
477
478
479 #include <asm/types.h>
480 #include <asm/byteorder.h>
481
482 #include <linux/bpf.h>
483 #include <linux/filter.h>
484 #include <linux/in.h>
485 #include <linux/if_ether.h>
486 #include <linux/ip.h>
487 #include <linux/tcp.h>
488
489 #include "helpers.h"
490
491 static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
492 __u16 old_port, __u16 new_port)
493 {
494 bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
495 old_port, new_port, sizeof(new_port));
496 bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
497 &new_port, sizeof(new_port), 0);
498 }
499
500 static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
501 {
502 __u16 dport, dport_new = 8080, off;
503 __u8 ip_proto, ip_vl;
504
505 ip_proto = load_byte(skb, nh_off +
506 offsetof(struct iphdr, protocol));
507 if (ip_proto != IPPROTO_TCP)
508 return 0;
509
510 ip_vl = load_byte(skb, nh_off);
511 if (likely(ip_vl == 0x45))
512 nh_off += sizeof(struct iphdr);
513 else
514 nh_off += (ip_vl & 0xF) << 2;
515
516 dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
517 if (dport != 80)
518 return 0;
519
520 off = skb->queue_mapping & 7;
521 set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
522 __cpu_to_be16(dport_new + off));
523 return -1;
524 }
525
526 __section("lb") int lb_main(struct __sk_buff *skb)
527 {
528 int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
529
530 if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
531 ret = lb_do_ipv4(skb, nh_off);
532
533 return ret;
534 }
535
536 char __license[] __section("license") = "GPL";
537
538 The related helper header file helpers.h in both examples was:
539
540
541 /* Misc helper macros. */
542 #define __section(x) __attribute__((section(x), used))
543 #define offsetof(x, y) __builtin_offsetof(x, y)
544 #define likely(x) __builtin_expect(!!(x), 1)
545 #define unlikely(x) __builtin_expect(!!(x), 0)
546
547 /* Object pinning settings */
548 #define PIN_NONE 0
549 #define PIN_OBJECT_NS 1
550 #define PIN_GLOBAL_NS 2
551
552 /* ELF map definition */
553 struct bpf_elf_map {
554 __u32 type;
555 __u32 size_key;
556 __u32 size_value;
557 __u32 max_elem;
558 __u32 flags;
559 __u32 id;
560 __u32 pinning;
561 __u32 inner_id;
562 __u32 inner_idx;
563 };
564
565 /* Some used BPF function calls. */
566 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
567 int len, int flags) =
568 (void *) BPF_FUNC_skb_store_bytes;
569 static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
570 int to, int flags) =
571 (void *) BPF_FUNC_l4_csum_replace;
572 static void *(*bpf_map_lookup_elem)(void *map, void *key) =
573 (void *) BPF_FUNC_map_lookup_elem;
574
575 /* Some used BPF intrinsics. */
576 unsigned long long load_byte(void *skb, unsigned long long off)
577 asm ("llvm.bpf.load.byte");
578 unsigned long long load_half(void *skb, unsigned long long off)
579 asm ("llvm.bpf.load.half");
580
581 Best practice, we recommend to only have a single eBPF classifier
582 loaded in tc and perform all necessary matching and mangling from there
583 instead of a list of individual classifier and separate actions. Just a
584 single classifier tailored for a given use-case will be most efficient
585 to run.
586
587
588 eBPF DEBUGGING
589 Both tc filter and action commands for bpf support an optional verbose
590 parameter that can be used to inspect the eBPF verifier log. It is
591 dumped by default in case of an error.
592
593 In case the eBPF/cBPF JIT compiler has been enabled, it can also be
594 instructed to emit a debug output of the resulting opcode image into
595 the kernel log, which can be read via dmesg(1) :
596
597 echo 2 > /proc/sys/net/core/bpf_jit_enable
598
599 The Linux kernel source tree ships additionally under tools/net/ a
600 small helper called bpf_jit_disasm that reads out the opcode image dump
601 from the kernel log and dumps the resulting disassembly:
602
603 bpf_jit_disasm -o
604
605 Other than that, the Linux kernel also contains an extensive eBPF/cBPF
606 test suite module called test_bpf . Upon ...
607
608 modprobe test_bpf
609
610 ... it performs a diversity of test cases and dumps the results into
611 the kernel log that can be inspected with dmesg(1) . The results can
612 differ depending on whether the JIT compiler is enabled or not. In case
613 of failed test cases, the module will fail to load. In such cases, we
614 urge you to file a bug report to the related JIT authors, Linux kernel
615 and networking mailing lists.
616
617
618 cBPF
619 Although we generally recommend switching to implementing eBPF classi‐
620 fier and actions, for the sake of completeness, a few words on how to
621 program in cBPF will be lost here.
622
623 Likewise, the bpf_jit_enable switch can be enabled as mentioned
624 already. Tooling such as bpf_jit_disasm is also independent whether
625 eBPF or cBPF code is being loaded.
626
627 Unlike in eBPF, classifier and action are not implemented in restricted
628 C, but rather in a minimal assembler-like language or with the help of
629 other tooling.
630
631 The raw interface with tc takes opcodes directly. For example, the most
632 minimal classifier matching on every packet resulting in the default
633 classid of 1:1 looks like:
634
635 tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,'
636 flowid 1:1
637
638 The first decimal of the bytecode sequence denotes the number of subse‐
639 quent 4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists
640 of c t f k decimals, where c represents the cBPF opcode, t the jump
641 true offset target, f the jump false offset target and k the immediate
642 constant/literal. Here, this denotes an unconditional return from the
643 program with immediate value of -1.
644
645 Thus, for egress classification, Willem de Bruijn implemented a minimal
646 stand-alone helper tool under the GNU General Public License version 2
647 for iptables(8) BPF extension, which abuses the libpcap internal clas‐
648 sic BPF compiler, his code derived here for usage with tc(8) :
649
650
651 #include <pcap.h>
652 #include <stdio.h>
653
654 int main(int argc, char **argv)
655 {
656 struct bpf_program prog;
657 struct bpf_insn *ins;
658 int i, ret, dlt = DLT_RAW;
659
660 if (argc < 2 || argc > 3)
661 return 1;
662 if (argc == 3) {
663 dlt = pcap_datalink_name_to_val(argv[1]);
664 if (dlt == -1)
665 return 1;
666 }
667
668 ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
669 1, PCAP_NETMASK_UNKNOWN);
670 if (ret)
671 return 1;
672
673 printf("%d,", prog.bf_len);
674 ins = prog.bf_insns;
675
676 for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
677 printf("%u %u %u %u,", ins->code,
678 ins->jt, ins->jf, ins->k);
679 printf("%u %u %u %u",
680 ins->code, ins->jt, ins->jf, ins->k);
681
682 pcap_freecode(&prog);
683 return 0;
684 }
685
686 Given this small helper, any tcpdump(8) filter expression can be abused
687 as a classifier where a match will result in the default classid:
688
689 bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
690 tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn
691 flowid 1:1
692
693 Basically, such a minimal generator is equivalent to:
694
695 tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\n' ',' >
696 /var/bpf/tcp-syn
697
698 Since libpcap does not support all Linux' specific cBPF extensions in
699 its compiler, the Linux kernel also ships under tools/net/ a minimal
700 BPF assembler called bpf_asm for providing full control. For detailed
701 syntax and semantics on implementing such programs by hand, see refer‐
702 ences under FURTHER READING .
703
704 Trivial toy example in bpf_asm for classifying IPv4/TCP packets, saved
705 in a text file called foobar :
706
707
708 ldh [12]
709 jne #0x800, drop
710 ldb [23]
711 jneq #6, drop
712 ret #-1
713 drop: ret #0
714
715 Similarly, such a classifier can be loaded as:
716
717 bpf_asm foobar > /var/bpf/tcp-syn
718 tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn
719 flowid 1:1
720
721 For BPF classifiers, the Linux kernel provides additionally under
722 tools/net/ a small BPF debugger called bpf_dbg , which can be used to
723 test a classifier against pcap files, single-step or add various break‐
724 points into the classifier program and dump register contents during
725 runtime.
726
727 Implementing an action in classic BPF is rather limited in the sense
728 that packet mangling is not supported. Therefore, it's generally recom‐
729 mended to make the switch to eBPF, whenever possible.
730
731
733 Further and more technical details about the BPF architecture can be
734 found in the Linux kernel source tree under Documentation/network‐
735 ing/filter.txt .
736
737 Further details on eBPF tc(8) examples can be found in the iproute2
738 source tree under examples/bpf/ .
739
740
742 tc(8), tc-ematch(8) bpf(2) bpf(4)
743
744
746 Manpage written by Daniel Borkmann.
747
748 Please report corrections or improvements to the Linux kernel network‐
749 ing mailing list: <netdev@vger.kernel.org>
750
751
752
753iproute2 18 May 201B5PF classifier and actions in tc(8)