1BPF classifier and actions in tc(8)  Linux BPF classifier and actions in tc(8)
2
3
4

NAME

6       BPF - BPF programmable classifier and actions for ingress/egress queue‐
7       ing disciplines
8

SYNOPSIS

10   eBPF classifier (filter) or action:
11       tc filter ... bpf [ object-file OBJ_FILE  ]  [  section  CLS_NAME  ]  [
12       export  UDS_FILE  ]  [  verbose  ]  [  direct-action | da ] [ skip_hw |
13       skip_sw ] [ police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLAS‐
14       SID ]
15       tc  action  ...  bpf  [  object-file  OBJ_FILE ] [ section CLS_NAME ] [
16       export UDS_FILE ] [ verbose ]
17
18
19   cBPF classifier (filter) or action:
20       tc filter ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ]  [
21       police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLASSID ]
22       tc action ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ]
23
24

DESCRIPTION

26       Extended  Berkeley  Packet  Filter ( eBPF ) and classic Berkeley Packet
27       Filter (originally known as BPF, for better distinction referred to  as
28       cBPF  here) are both available as a fully programmable and highly effi‐
29       cient classifier and actions. They both offer a minimal instruction set
30       for  implementing  small  programs  which can safely be loaded into the
31       kernel and thus executed in a tiny virtual machine from  kernel  space.
32       An in-kernel verifier guarantees that a specified program always termi‐
33       nates and neither crashes nor leaks data from the kernel.
34
35       In Linux, it's generally considered that eBPF is the successor of cBPF.
36       The kernel internally transforms cBPF expressions into eBPF expressions
37       and executes the latter. Execution of  them  can  be  performed  in  an
38       interpreter  or  at  setup  time,  they  can  be  just-in-time compiled
39       (JIT'ed) to run as native machine code.
40
41       Currently, the eBPF JIT compiler is available for the following  archi‐
42       tectures:
43
44       *   x86_64 (since Linux 3.18)
45       *   arm64 (since Linux 3.18)
46       *   s390 (since Linux 4.1)
47       *   ppc64 (since Linux 4.8)
48       *   sparc64 (since Linux 4.12)
49       *   mips64 (since Linux 4.13)
50       *   arm32 (since Linux 4.14)
51       *   x86_32 (since Linux 4.18)
52
53       Whereas the following architectures have cBPF, but did not (yet) switch
54       to eBPF JIT support:
55
56       *   ppc32
57       *   sparc32
58       *   mips32
59
60       eBPF's instruction set has similar underlying principles  as  the  cBPF
61       instruction set, it however is modelled closer to the underlying archi‐
62       tecture to better mimic native instruction sets with the aim to achieve
63       a  better  run-time performance. It is designed to be JIT'ed with a one
64       to one mapping, which can also open up the possibility for compilers to
65       generate  optimized  eBPF  code  through  an eBPF backend that performs
66       almost as fast as natively compiled code. Given that LLVM provides such
67       an  eBPF backend, eBPF programs can therefore easily be programmed in a
68       subset of the C language. Other than  that,  eBPF  infrastructure  also
69       comes  with  a  construct called "maps". eBPF maps are key/value stores
70       that are shared between multiple eBPF programs, but also  between  eBPF
71       programs and user space applications.
72
73       For  the  traffic control subsystem, classifier and actions that can be
74       attached to ingress and egress qdiscs can be written in eBPF  or  cBPF.
75       The  advantage over other classifier and actions is that eBPF/cBPF pro‐
76       vides the generic framework, while users  can  implement  their  highly
77       specialized  use  cases  efficiently. This means that the classifier or
78       action written that way will not suffer from  feature  bloat,  and  can
79       therefore  execute  its task highly efficient. It allows for non-linear
80       classification and even merging the action part  into  the  classifica‐
81       tion.  Combined with efficient eBPF map data structures, user space can
82       push new policies like classids into the  kernel  without  reloading  a
83       classifier,  or  it  can gather statistics that are pushed into one map
84       and use another one for dynamically load balancing traffic based on the
85       determined load, just to provide a few examples.
86
87

PARAMETERS

89   object-file
90       points  to  an  object  file that has an executable and linkable format
91       (ELF) and contains eBPF opcodes and eBPF map definitions. The LLVM com‐
92       piler  infrastructure  with  clang(1)  as a C language front end is one
93       project that supports emitting eBPF object files that can be passed  to
94       the eBPF classifier (more details in the EXAMPLES section). This option
95       is mandatory when an eBPF classifier or action is to be loaded.
96
97
98   section
99       is the name of the ELF section from the object  file,  where  the  eBPF
100       classifier or action resides. By default the section name for the clas‐
101       sifier is called "classifier", and for the action "action". Given  that
102       a  single  object file can contain multiple classifier and actions, the
103       corresponding section name needs to be specified, if  it  differs  from
104       the defaults.
105
106
107   export
108       points  to a Unix domain socket file. In case the eBPF object file also
109       contains a section named "maps" with eBPF map specifications, then  the
110       map file descriptors can be handed off via the Unix domain socket to an
111       eBPF "agent" herding all descriptors after tc  lifetime.  This  can  be
112       some  third  party application implementing the IPC counterpart for the
113       import, that uses them for calling into bpf(2) system call to read  out
114       or  update  eBPF  map data from user space, for example, for monitoring
115       purposes or to push down new policies.
116
117
118   verbose
119       if set, it will dump the eBPF verifier output, even if loading the eBPF
120       program  was successful. By default, only on error, the verifier log is
121       being emitted to the user.
122
123
124   direct-action | da
125       instructs eBPF classifier to not invoke external  TC  actions,  instead
126       use the TC actions return codes (TC_ACT_OK, TC_ACT_SHOT etc.) for clas‐
127       sifiers.
128
129
130   skip_hw | skip_sw
131       hardware offload control flags. By default TC will try to offload  fil‐
132       ters  to hardware if possible.  skip_hw explicitly disables the attempt
133       to offload.  skip_sw forces the offload and disables running  the  eBPF
134       program  in  the  kernel.  If hardware offload is not possible and this
135       flag was set kernel will  report  an  error  and  filter  will  not  be
136       installed at all.
137
138
139   police
140       is  an  optional parameter for an eBPF/cBPF classifier that specifies a
141       police in tc(1) which is attached to the classifier, for example, on an
142       ingress qdisc.
143
144
145   action
146       is  an  optional parameter for an eBPF/cBPF classifier that specifies a
147       subsequent action in tc(1) which is attached to a classifier.
148
149
150   classid
151   flowid
152       provides  the  default  traffic  control  class  identifier  for   this
153       eBPF/cBPF  classifier.  The  default class identifier can also be over‐
154       written by the return code of the eBPF/cBPF program. A  default  return
155       code  of  -1 specifies the here provided default class identifier to be
156       used. A return code of the eBPF/cBPF program of 0 implies that no match
157       took  place,  and  a return code other than these two will override the
158       default classid. This allows for efficient,  non-linear  classification
159       with  only  a  single  eBPF/cBPF  program as opposed to having multiple
160       individual programs for various class identifiers which would  need  to
161       reparse packet contents.
162
163
164   bytecode
165       is  being  used  for loading cBPF classifier and actions only. The cBPF
166       bytecode is directly passed as a text string in the form of  ´s,c  t  f
167       k,c  t  f  k,c  t  f  k,...´ , where s denotes the number of subsequent
168       4-tuples. One such 4-tuple consists of c t f k decimals, where c repre‐
169       sents  the cBPF opcode, t the jump true offset target, f the jump false
170       offset target and k the immediate constant/literal. There  are  various
171       tools  that generate code in this loadable format, for example, bpf_asm
172       that ships with the Linux kernel source tree under tools/net/ ,  so  it
173       is  certainly  not expected to hack this by hand. The bytecode or byte‐
174       code-file option is mandatory when a cBPF classifier or action is to be
175       loaded.
176
177
178   bytecode-file
179       also  being  used to load a cBPF classifier or action. It's effectively
180       the same as bytecode only that the cBPF bytecode is not passed directly
181       via command line, but rather resides in a text file.
182
183

EXAMPLES

185   eBPF TOOLING
186       A  full blown example including eBPF agent code can be found inside the
187       iproute2 source package under: examples/bpf/
188
189       As prerequisites, the kernel needs to have the eBPF system call  namely
190       bpf(2)  enabled  and  ships with cls_bpf and act_bpf kernel modules for
191       the traffic control subsystem. To enable eBPF/eBPF JIT support, depend‐
192       ing which of the two the given architecture supports:
193
194           echo 1 > /proc/sys/net/core/bpf_jit_enable
195
196       A given restricted C file can be compiled via LLVM as:
197
198           clang  -O2  -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj
199           -o bpf.o
200
201       The compiler invocation might still simplify in  future,  so  for  now,
202       it's  quite  handy  to  alias this construct in one way or another, for
203       example:
204
205           __bcc() {
206                   clang -O2 -emit-llvm -c $1 -o - | \
207                   llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
208           }
209
210           alias bcc=__bcc
211
212       A minimal, stand-alone unit, which matches  on  all  traffic  with  the
213       default classid (return code of -1) looks like:
214
215
216           #include <linux/bpf.h>
217
218           #ifndef __section
219           # define __section(x)  __attribute__((section(x), used))
220           #endif
221
222           __section("classifier") int cls_main(struct __sk_buff *skb)
223           {
224                   return -1;
225           }
226
227           char __license[] __section("license") = "GPL";
228
229       More examples can be found further below in subsection eBPF PROGRAMMING
230       as focus here will be on tooling.
231
232       There can be various other sections, for  example,  also  for  actions.
233       Thus,  an  object  file  in  eBPF can contain multiple entrance points.
234       Always a specific entrance point, however, must be specified when  con‐
235       figuring  with  tc. A license must be part of the restricted C code and
236       the license string syntax is the same as  with  Linux  kernel  modules.
237       The  kernel  reserves  its right that some eBPF helper functions can be
238       restricted to GPL compatible licenses only, and thus may reject a  pro‐
239       gram from loading into the kernel when such a license mismatch occurs.
240
241       The  resulting  object  file from the compilation can be inspected with
242       the usual set of tools that also operate on normal  object  files,  for
243       example objdump(1) for inspecting ELF section headers:
244
245
246           objdump -h bpf.o
247           [...]
248           3 classifier    000007f8  0000000000000000  0000000000000000  00000040  2**3
249                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
250           4 action-mark   00000088  0000000000000000  0000000000000000  00000838  2**3
251                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
252           5 action-rand   00000098  0000000000000000  0000000000000000  000008c0  2**3
253                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
254           6 maps          00000030  0000000000000000  0000000000000000  00000958  2**2
255                           CONTENTS, ALLOC, LOAD, DATA
256           7 license       00000004  0000000000000000  0000000000000000  00000988  2**0
257                           CONTENTS, ALLOC, LOAD, DATA
258           [...]
259
260       Adding  an  eBPF classifier from an object file that contains a classi‐
261       fier in the default ELF  section  is  trivial  (note  that  instead  of
262       "object-file" also shortcuts such as "obj" can be used):
263
264           bcc bpf.c
265           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
266
267       In  case  the classifier resides in ELF section "mycls", then that same
268       command needs to be invoked as:
269
270           tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
271
272       Dumping the classifier configuration will  tell  the  location  of  the
273       classifier,  in  other  words  that it's from object file "bpf.o" under
274       section "mycls":
275
276           tc filter show dev em1
277           filter parent 1: protocol all pref 49152 bpf
278           filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid  1:1
279           bpf.o:[mycls]
280
281       The same program can also be installed on ingress qdisc side as opposed
282       to egress ...
283
284           tc qdisc add dev em1 handle ffff: ingress
285           tc filter add dev em1 parent ffff: bpf obj bpf.o sec  mycls  flowid
286           ffff:1
287
288       ... and again dumped from there:
289
290           tc filter show dev em1 parent ffff:
291           filter protocol all pref 49152 bpf
292           filter  protocol  all  pref  49152  bpf  handle  0x1  flowid ffff:1
293           bpf.o:[mycls]
294
295       Attaching a classifier and action on ingress has the  restriction  that
296       it  doesn't have an actual underlying queueing discipline. What ingress
297       can do is to classify, mangle, redirect or drop packets. When  queueing
298       is  required on ingress side, then ingress must redirect packets to the
299       ifb device, otherwise policing can be used. Moreover,  ingress  can  be
300       used  to  have  an early drop point of unwanted packets before they hit
301       upper layers of the networking stack, perform network  accounting  with
302       eBPF  maps  that  could  be shared with egress, or have an early mangle
303       and/or redirection point to different networking devices.
304
305       Multiple eBPF actions and classifier can be placed into a single object
306       file  within  various sections. In that case, non-default section names
307       must be provided, which is the case for both actions in this example:
308
309           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \
310                                    action bpf obj bpf.o sec action-mark \
311                                    action bpf obj bpf.o sec action-rand ok
312
313       The advantage of this is that the classifier and the  two  actions  can
314       then share eBPF maps with each other, if implemented in the programs.
315
316       In  order  to access eBPF maps from user space beyond tc(8) setup life‐
317       time, the ownership can be transferred to an eBPF agent via Unix domain
318       sockets. There are two possibilities for implementing this:
319
320       1)  implementation  of  an own eBPF agent that takes care of setting up
321       the Unix domain socket and implementing the protocol  that  tc(8)  dic‐
322       tates.  A  code example of this can be found inside the iproute2 source
323       package under: examples/bpf/
324
325       2) use tc exec for transferring the eBPF map file descriptors through a
326       Unix  domain  socket,  and spawning an application such as sh(1) . This
327       approach's advantage is that tc will place the  file  descriptors  into
328       the  environment  and thus make them available just like stdin, stdout,
329       stderr file descriptors, meaning, in case user  applications  run  from
330       within this fd-owner shell, they can terminate and restart without los‐
331       ing eBPF maps file descriptors. Example invocation  with  the  previous
332       classifier and action mixture:
333
334           tc exec bpf imp /tmp/bpf
335           tc  filter  add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid
336           1:1 \
337                                    action bpf obj bpf.o sec action-mark \
338                                    action bpf obj bpf.o sec action-rand ok
339
340       Assuming that eBPF maps are shared with classifier  and  actions,  it's
341       enough  to export them once, for example, from within the classifier or
342       action command. tc will setup all eBPF map file descriptors at the time
343       when the object file is first parsed.
344
345       When  a  shell  has been spawned, the environment will have a couple of
346       eBPF related variables. BPF_NUM_MAPS provides the total number of  maps
347       that  have  been  transferred over the Unix domain socket. BPF_MAP<X>'s
348       value is the file descriptor number that can be accessed in eBPF  agent
349       applications,  in  other  words,  it  can  directly be used as the file
350       descriptor value for the bpf(2) system call to retrieve or  alter  eBPF
351       map  values. <X> denotes the identifier of the eBPF map. It corresponds
352       to the id member of struct bpf_elf_map  from the tc eBPF map specifica‐
353       tion.
354
355       The environment in this example looks as follows:
356
357
358           sh# env | grep BPF
359               BPF_NUM_MAPS=3
360               BPF_MAP1=6
361               BPF_MAP0=5
362               BPF_MAP2=7
363           sh# ls -la /proc/self/fd
364               [...]
365               lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
366               lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
367               lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
368           sh# my_bpf_agent
369
370       eBPF agents are very useful in that they can prepopulate eBPF maps from
371       user space, monitor statistics via maps and based on that feedback, for
372       example, rewrite classids in eBPF map values during runtime. Given that
373       eBPF agents are implemented  as  normal  applications,  they  can  also
374       dynamically  receive traffic control policies from external controllers
375       and thus push them down into eBPF maps to dynamically adapt to  network
376       conditions. Moreover, eBPF maps can also be shared with other eBPF pro‐
377       gram types (e.g. tracing), thus very powerful combination can therefore
378       be implemented.
379
380
381   eBPF PROGRAMMING
382       eBPF  classifier and actions are being implemented in restricted C syn‐
383       tax (in future, there could additionally be new language frontends sup‐
384       ported).
385
386       The  header file linux/bpf.h provides eBPF helper functions that can be
387       called from an eBPF program.  This man page will only provide two mini‐
388       mal,  stand-alone  examples,  have  a  look  at  examples/bpf  from the
389       iproute2 source package for a fully fledged flow dissector  example  to
390       better demonstrate some of the possibilities with eBPF.
391
392       Supported  32  bit classifier return codes from the C program and their
393       meanings:
394           0 , denotes a mismatch
395           -1 , denotes the default classid configured from the command line
396           else , everything else will override the default classid to provide
397           a facility for non-linear matching
398
399       Supported 32 bit action return codes from the C program and their mean‐
400       ings ( linux/pkt_cls.h ):
401           TC_ACT_OK (0) , will terminate the packet processing  pipeline  and
402           allows the packet to proceed
403           TC_ACT_SHOT (2) , will terminate the packet processing pipeline and
404           drops the packet
405           TC_ACT_UNSPEC (-1) , will use the default action configured from tc
406           (similarly as returning -1 from a classifier)
407           TC_ACT_PIPE (3) , will iterate to the next action, if available
408           TC_ACT_RECLASSIFY  (1) , will terminate the packet processing pipe‐
409           line and start classification from the beginning
410           else , everything else is an unspecified return code
411
412       Both classifier and action return codes are supported in eBPF and  cBPF
413       programs.
414
415       To demonstrate restricted C syntax, a minimal toy classifier example is
416       provided, which assumes that egress packets, for  instance  originating
417       from a container, have previously been marked in interval [0, 255]. The
418       program keeps statistics on different marks for user space and maps the
419       classid to the root qdisc with the marking itself as the minor handle:
420
421
422           #include <stdint.h>
423           #include <asm/types.h>
424
425           #include <linux/bpf.h>
426           #include <linux/pkt_sched.h>
427
428           #include "helpers.h"
429
430           struct tuple {
431                   long packets;
432                   long bytes;
433           };
434
435           #define BPF_MAP_ID_STATS        1 /* agent's map identifier */
436           #define BPF_MAX_MARK            256
437
438           struct bpf_elf_map __section("maps") map_stats = {
439                   .type           =       BPF_MAP_TYPE_ARRAY,
440                   .id             =       BPF_MAP_ID_STATS,
441                   .size_key       =       sizeof(uint32_t),
442                   .size_value     =       sizeof(struct tuple),
443                   .max_elem       =       BPF_MAX_MARK,
444           };
445
446           static inline void cls_update_stats(const struct __sk_buff *skb,
447                                               uint32_t mark)
448           {
449                   struct tuple *tu;
450
451                   tu = bpf_map_lookup_elem(&map_stats, &mark);
452                   if (likely(tu)) {
453                           __sync_fetch_and_add(&tu->packets, 1);
454                           __sync_fetch_and_add(&tu->bytes, skb->len);
455                   }
456           }
457
458           __section("cls") int cls_main(struct __sk_buff *skb)
459           {
460                   uint32_t mark = skb->mark;
461
462                   if (unlikely(mark >= BPF_MAX_MARK))
463                           return 0;
464
465                   cls_update_stats(skb, mark);
466
467                   return TC_H_MAKE(TC_H_ROOT, mark);
468           }
469
470           char __license[] __section("license") = "GPL";
471
472       Another  small  example  is a port redirector which demuxes destination
473       port 80 into the interval [8080, 8087] steered by RSS, that can then be
474       attached  to  ingress qdisc. The exercise of adding the egress counter‐
475       part and IPv6 support is left to the reader:
476
477
478           #include <asm/types.h>
479           #include <asm/byteorder.h>
480
481           #include <linux/bpf.h>
482           #include <linux/filter.h>
483           #include <linux/in.h>
484           #include <linux/if_ether.h>
485           #include <linux/ip.h>
486           #include <linux/tcp.h>
487
488           #include "helpers.h"
489
490           static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
491                                            __u16 old_port, __u16 new_port)
492           {
493                   bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
494                                       old_port, new_port, sizeof(new_port));
495                   bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
496                                       &new_port, sizeof(new_port), 0);
497           }
498
499           static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
500           {
501                   __u16 dport, dport_new = 8080, off;
502                   __u8 ip_proto, ip_vl;
503
504                   ip_proto = load_byte(skb, nh_off +
505                                        offsetof(struct iphdr, protocol));
506                   if (ip_proto != IPPROTO_TCP)
507                           return 0;
508
509                   ip_vl = load_byte(skb, nh_off);
510                   if (likely(ip_vl == 0x45))
511                           nh_off += sizeof(struct iphdr);
512                   else
513                           nh_off += (ip_vl & 0xF) << 2;
514
515                   dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
516                   if (dport != 80)
517                           return 0;
518
519                   off = skb->queue_mapping & 7;
520                   set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
521                                 __cpu_to_be16(dport_new + off));
522                   return -1;
523           }
524
525           __section("lb") int lb_main(struct __sk_buff *skb)
526           {
527                   int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
528
529                   if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
530                           ret = lb_do_ipv4(skb, nh_off);
531
532                   return ret;
533           }
534
535           char __license[] __section("license") = "GPL";
536
537       The related helper header file helpers.h in both examples was:
538
539
540           /* Misc helper macros. */
541           #define __section(x) __attribute__((section(x), used))
542           #define offsetof(x, y) __builtin_offsetof(x, y)
543           #define likely(x) __builtin_expect(!!(x), 1)
544           #define unlikely(x) __builtin_expect(!!(x), 0)
545
546           /* Used map structure */
547           struct bpf_elf_map {
548               __u32 type;
549               __u32 size_key;
550               __u32 size_value;
551               __u32 max_elem;
552               __u32 id;
553           };
554
555           /* Some used BPF function calls. */
556           static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
557                                             int len, int flags) =
558                 (void *) BPF_FUNC_skb_store_bytes;
559           static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
560                                             int to, int flags) =
561                 (void *) BPF_FUNC_l4_csum_replace;
562           static void *(*bpf_map_lookup_elem)(void *map, void *key) =
563                 (void *) BPF_FUNC_map_lookup_elem;
564
565           /* Some used BPF intrinsics. */
566           unsigned long long load_byte(void *skb, unsigned long long off)
567               asm ("llvm.bpf.load.byte");
568           unsigned long long load_half(void *skb, unsigned long long off)
569               asm ("llvm.bpf.load.half");
570
571       Best practice, we recommend to  only  have  a  single  eBPF  classifier
572       loaded in tc and perform all necessary matching and mangling from there
573       instead of a list of individual classifier and separate actions. Just a
574       single  classifier tailored for a given use-case will be most efficient
575       to run.
576
577
578   eBPF DEBUGGING
579       Both tc filter and action commands for bpf support an optional  verbose
580       parameter  that  can  be  used  to inspect the eBPF verifier log. It is
581       dumped by default in case of an error.
582
583       In case the eBPF/cBPF JIT compiler has been enabled,  it  can  also  be
584       instructed  to  emit  a debug output of the resulting opcode image into
585       the kernel log, which can be read via dmesg(1) :
586
587           echo 2 > /proc/sys/net/core/bpf_jit_enable
588
589       The Linux kernel source tree  ships  additionally  under  tools/net/  a
590       small helper called bpf_jit_disasm that reads out the opcode image dump
591       from the kernel log and dumps the resulting disassembly:
592
593           bpf_jit_disasm -o
594
595       Other than that, the Linux kernel also contains an extensive  eBPF/cBPF
596       test suite module called test_bpf . Upon ...
597
598           modprobe test_bpf
599
600       ...  it  performs  a diversity of test cases and dumps the results into
601       the kernel log that can be inspected with dmesg(1) .  The  results  can
602       differ depending on whether the JIT compiler is enabled or not. In case
603       of failed test cases, the module will fail to load. In such  cases,  we
604       urge  you to file a bug report to the related JIT authors, Linux kernel
605       and networking mailing lists.
606
607
608   cBPF
609       Although we generally recommend switching to implementing eBPF  classi‐
610       fier  and  actions, for the sake of completeness, a few words on how to
611       program in cBPF will be lost here.
612
613       Likewise,  the  bpf_jit_enable  switch  can  be  enabled  as  mentioned
614       already.  Tooling  such  as  bpf_jit_disasm is also independent whether
615       eBPF or cBPF code is being loaded.
616
617       Unlike in eBPF, classifier and action are not implemented in restricted
618       C,  but rather in a minimal assembler-like language or with the help of
619       other tooling.
620
621       The raw interface with tc takes opcodes directly. For example, the most
622       minimal  classifier  matching  on every packet resulting in the default
623       classid of 1:1 looks like:
624
625           tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0  4294967295,'
626           flowid 1:1
627
628       The first decimal of the bytecode sequence denotes the number of subse‐
629       quent 4-tuples of cBPF opcodes. As mentioned, such a  4-tuple  consists
630       of  c  t  f  k decimals, where c represents the cBPF opcode, t the jump
631       true offset target, f the jump false offset target and k the  immediate
632       constant/literal.  Here,  this denotes an unconditional return from the
633       program with immediate value of -1.
634
635       Thus, for egress classification, Willem de Bruijn implemented a minimal
636       stand-alone  helper tool under the GNU General Public License version 2
637       for iptables(8) BPF extension, which abuses the libpcap internal  clas‐
638       sic BPF compiler, his code derived here for usage with tc(8) :
639
640
641           #include <pcap.h>
642           #include <stdio.h>
643
644           int main(int argc, char **argv)
645           {
646                   struct bpf_program prog;
647                   struct bpf_insn *ins;
648                   int i, ret, dlt = DLT_RAW;
649
650                   if (argc < 2 || argc > 3)
651                           return 1;
652                   if (argc == 3) {
653                           dlt = pcap_datalink_name_to_val(argv[1]);
654                           if (dlt == -1)
655                                   return 1;
656                   }
657
658                   ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
659                                             1, PCAP_NETMASK_UNKNOWN);
660                   if (ret)
661                           return 1;
662
663                   printf("%d,", prog.bf_len);
664                   ins = prog.bf_insns;
665
666                   for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
667                           printf("%u %u %u %u,", ins->code,
668                                  ins->jt, ins->jf, ins->k);
669                   printf("%u %u %u %u",
670                          ins->code, ins->jt, ins->jf, ins->k);
671
672                   pcap_freecode(&prog);
673                   return 0;
674           }
675
676       Given this small helper, any tcpdump(8) filter expression can be abused
677       as a classifier where a match will result in the default classid:
678
679           bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
680           tc filter add dev em1 parent 1: bpf bytecode-file  /var/bpf/tcp-syn
681           flowid 1:1
682
683       Basically, such a minimal generator is equivalent to:
684
685           tcpdump  -iem1  -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\n' ',' >
686           /var/bpf/tcp-syn
687
688       Since libpcap does not support all Linux' specific cBPF  extensions  in
689       its  compiler,  the  Linux kernel also ships under tools/net/ a minimal
690       BPF assembler called bpf_asm for providing full control.  For  detailed
691       syntax  and semantics on implementing such programs by hand, see refer‐
692       ences under FURTHER READING .
693
694       Trivial toy example in bpf_asm for classifying IPv4/TCP packets,  saved
695       in a text file called foobar :
696
697
698           ldh [12]
699           jne #0x800, drop
700           ldb [23]
701           jneq #6, drop
702           ret #-1
703           drop: ret #0
704
705       Similarly, such a classifier can be loaded as:
706
707           bpf_asm foobar > /var/bpf/tcp-syn
708           tc  filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn
709           flowid 1:1
710
711       For BPF classifiers,  the  Linux  kernel  provides  additionally  under
712       tools/net/  a  small BPF debugger called bpf_dbg , which can be used to
713       test a classifier against pcap files, single-step or add various break‐
714       points  into  the  classifier program and dump register contents during
715       runtime.
716
717       Implementing an action in classic BPF is rather limited  in  the  sense
718       that packet mangling is not supported. Therefore, it's generally recom‐
719       mended to make the switch to eBPF, whenever possible.
720
721

FURTHER READING

723       Further and more technical details about the BPF  architecture  can  be
724       found  in  the  Linux  kernel  source tree under Documentation/network‐
725       ing/filter.txt .
726
727       Further details on eBPF tc(8) examples can be  found  in  the  iproute2
728       source tree under examples/bpf/ .
729
730

SEE ALSO

732       tc(8), tc-ematch(8) bpf(2) bpf(4)
733
734

AUTHORS

736       Manpage written by Daniel Borkmann.
737
738       Please  report corrections or improvements to the Linux kernel network‐
739       ing mailing list: <netdev@vger.kernel.org>
740
741
742
743iproute2                          18 May 201B5PF classifier and actions in tc(8)
Impressum