1BPF classifier and actions in tc(8)  Linux BPF classifier and actions in tc(8)
2
3
4

NAME

6       BPF - BPF programmable classifier and actions for ingress/egress queue‐
7       ing disciplines
8

SYNOPSIS

10   eBPF classifier (filter) or action:
11       tc filter ... bpf [ object-file OBJ_FILE ] [ section CLS_NAME ]  [  ex‐
12       port  UDS_FILE ] [ verbose ] [ direct-action | da ] [ skip_hw | skip_sw
13       ] [ police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLASSID ]
14       tc action ... bpf [ object-file OBJ_FILE ] [ section CLS_NAME ]  [  ex‐
15       port UDS_FILE ] [ verbose ]
16
17
18   cBPF classifier (filter) or action:
19       tc  filter ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ] [
20       police POLICE_SPEC ] [ action ACTION_SPEC ] [ classid CLASSID ]
21       tc action ... bpf [ bytecode-file BPF_FILE | bytecode BPF_BYTECODE ]
22
23

DESCRIPTION

25       Extended Berkeley Packet Filter ( eBPF ) and  classic  Berkeley  Packet
26       Filter  (originally known as BPF, for better distinction referred to as
27       cBPF here) are both available as a fully programmable and highly  effi‐
28       cient classifier and actions. They both offer a minimal instruction set
29       for implementing small programs which can safely  be  loaded  into  the
30       kernel  and  thus executed in a tiny virtual machine from kernel space.
31       An in-kernel verifier guarantees that a specified program always termi‐
32       nates and neither crashes nor leaks data from the kernel.
33
34       In Linux, it's generally considered that eBPF is the successor of cBPF.
35       The kernel internally transforms cBPF expressions into eBPF expressions
36       and  executes  the latter. Execution of them can be performed in an in‐
37       terpreter or at setup time, they can be just-in-time compiled  (JIT'ed)
38       to run as native machine code.
39
40       Currently,  the eBPF JIT compiler is available for the following archi‐
41       tectures:
42
43       *   x86_64 (since Linux 3.18)
44       *   arm64 (since Linux 3.18)
45       *   s390 (since Linux 4.1)
46       *   ppc64 (since Linux 4.8)
47       *   sparc64 (since Linux 4.12)
48       *   mips64 (since Linux 4.13)
49       *   arm32 (since Linux 4.14)
50       *   x86_32 (since Linux 4.18)
51
52       Whereas the following architectures have cBPF, but did not (yet) switch
53       to eBPF JIT support:
54
55       *   ppc32
56       *   sparc32
57       *   mips32
58
59       eBPF's  instruction  set  has similar underlying principles as the cBPF
60       instruction set, it however is modelled closer to the underlying archi‐
61       tecture to better mimic native instruction sets with the aim to achieve
62       a better run-time performance. It is designed to be JIT'ed with  a  one
63       to one mapping, which can also open up the possibility for compilers to
64       generate optimized eBPF code through an eBPF backend that performs  al‐
65       most  as  fast as natively compiled code. Given that LLVM provides such
66       an eBPF backend, eBPF programs can therefore easily be programmed in  a
67       subset  of  the  C  language. Other than that, eBPF infrastructure also
68       comes with a construct called "maps". eBPF maps  are  key/value  stores
69       that  are  shared between multiple eBPF programs, but also between eBPF
70       programs and user space applications.
71
72       For the traffic control subsystem, classifier and actions that  can  be
73       attached  to  ingress and egress qdiscs can be written in eBPF or cBPF.
74       The advantage over other classifier and actions is that eBPF/cBPF  pro‐
75       vides  the  generic  framework,  while users can implement their highly
76       specialized use cases efficiently. This means that  the  classifier  or
77       action  written  that  way  will not suffer from feature bloat, and can
78       therefore execute its task highly efficient. It allows  for  non-linear
79       classification  and  even  merging the action part into the classifica‐
80       tion. Combined with efficient eBPF map data structures, user space  can
81       push  new  policies  like  classids into the kernel without reloading a
82       classifier, or it can gather statistics that are pushed  into  one  map
83       and use another one for dynamically load balancing traffic based on the
84       determined load, just to provide a few examples.
85
86

PARAMETERS

88   object-file
89       points to an object file that has an  executable  and  linkable  format
90       (ELF) and contains eBPF opcodes and eBPF map definitions. The LLVM com‐
91       piler infrastructure with clang(1) as a C language  front  end  is  one
92       project  that supports emitting eBPF object files that can be passed to
93       the eBPF classifier (more details in the EXAMPLES section). This option
94       is mandatory when an eBPF classifier or action is to be loaded.
95
96
97   section
98       is  the  name  of  the ELF section from the object file, where the eBPF
99       classifier or action resides. By default the section name for the clas‐
100       sifier  is called "classifier", and for the action "action". Given that
101       a single object file can contain multiple classifier and  actions,  the
102       corresponding  section  name  needs to be specified, if it differs from
103       the defaults.
104
105
106   export
107       points to a Unix domain socket file. In case the eBPF object file  also
108       contains  a section named "maps" with eBPF map specifications, then the
109       map file descriptors can be handed off via the Unix domain socket to an
110       eBPF  "agent"  herding  all  descriptors after tc lifetime. This can be
111       some third party application implementing the IPC counterpart  for  the
112       import,  that uses them for calling into bpf(2) system call to read out
113       or update eBPF map data from user space, for  example,  for  monitoring
114       purposes or to push down new policies.
115
116
117   verbose
118       if set, it will dump the eBPF verifier output, even if loading the eBPF
119       program was successful. By default, only on error, the verifier log  is
120       being emitted to the user.
121
122
123   direct-action | da
124       instructs  eBPF  classifier  to not invoke external TC actions, instead
125       use the TC actions return codes (TC_ACT_OK, TC_ACT_SHOT etc.) for clas‐
126       sifiers.
127
128
129   skip_hw | skip_sw
130       hardware  offload control flags. By default TC will try to offload fil‐
131       ters to hardware if possible.  skip_hw explicitly disables the  attempt
132       to  offload.   skip_sw forces the offload and disables running the eBPF
133       program in the kernel.  If hardware offload is not  possible  and  this
134       flag  was  set  kernel  will report an error and filter will not be in‐
135       stalled at all.
136
137
138   police
139       is an optional parameter for an eBPF/cBPF classifier that  specifies  a
140       police in tc(1) which is attached to the classifier, for example, on an
141       ingress qdisc.
142
143
144   action
145       is an optional parameter for an eBPF/cBPF classifier that  specifies  a
146       subsequent action in tc(1) which is attached to a classifier.
147
148
149   classid
150   flowid
151       provides   the  default  traffic  control  class  identifier  for  this
152       eBPF/cBPF classifier. The default class identifier can  also  be  over‐
153       written  by  the return code of the eBPF/cBPF program. A default return
154       code of -1 specifies the here provided default class identifier  to  be
155       used. A return code of the eBPF/cBPF program of 0 implies that no match
156       took place, and a return code other than these two  will  override  the
157       default  classid.  This allows for efficient, non-linear classification
158       with only a single eBPF/cBPF program as opposed to having multiple  in‐
159       dividual  programs  for  various  class identifiers which would need to
160       reparse packet contents.
161
162
163   bytecode
164       is being used for loading cBPF classifier and actions  only.  The  cBPF
165       bytecode  is  directly  passed as a text string in the form of 's,c t f
166       k,c t f k,c t f k,...'  , where s  denotes  the  number  of  subsequent
167       4-tuples. One such 4-tuple consists of c t f k decimals, where c repre‐
168       sents the cBPF opcode, t the jump true offset target, f the jump  false
169       offset  target  and k the immediate constant/literal. There are various
170       tools that generate code in this loadable format, for example,  bpf_asm
171       that  ships  with the Linux kernel source tree under tools/net/ , so it
172       is certainly not expected to hack this by hand. The bytecode  or  byte‐
173       code-file option is mandatory when a cBPF classifier or action is to be
174       loaded.
175
176
177   bytecode-file
178       also being used to load a cBPF classifier or action.  It's  effectively
179       the same as bytecode only that the cBPF bytecode is not passed directly
180       via command line, but rather resides in a text file.
181
182

EXAMPLES

184   eBPF TOOLING
185       A full blown example including eBPF agent code can be found inside  the
186       iproute2 source package under: examples/bpf/
187
188       As  prerequisites, the kernel needs to have the eBPF system call namely
189       bpf(2) enabled and ships with cls_bpf and act_bpf  kernel  modules  for
190       the traffic control subsystem. To enable eBPF/eBPF JIT support, depend‐
191       ing which of the two the given architecture supports:
192
193           echo 1 > /proc/sys/net/core/bpf_jit_enable
194
195       A given restricted C file can be compiled via LLVM as:
196
197           clang -O2 -emit-llvm -c bpf.c -o - | llc  -march=bpf  -filetype=obj
198           -o bpf.o
199
200       The  compiler  invocation  might  still simplify in future, so for now,
201       it's quite handy to alias this construct in one way or another, for ex‐
202       ample:
203
204           __bcc() {
205                   clang -O2 -emit-llvm -c $1 -o - | \
206                   llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
207           }
208
209           alias bcc=__bcc
210
211       A  minimal, stand-alone unit, which matches on all traffic with the de‐
212       fault classid (return code of -1) looks like:
213
214
215           #include <linux/bpf.h>
216
217           #ifndef __section
218           # define __section(x)  __attribute__((section(x), used))
219           #endif
220
221           __section("classifier") int cls_main(struct __sk_buff *skb)
222           {
223                   return -1;
224           }
225
226           char __license[] __section("license") = "GPL";
227
228       More examples can be found further below in subsection eBPF PROGRAMMING
229       as focus here will be on tooling.
230
231       There  can  be  various  other sections, for example, also for actions.
232       Thus, an object file in eBPF can contain multiple entrance points.  Al‐
233       ways a specific entrance point, however, must be specified when config‐
234       uring with tc. A license must be part of the restricted C code and  the
235       license  string  syntax  is the same as with Linux kernel modules.  The
236       kernel reserves its right that some eBPF helper functions  can  be  re‐
237       stricted to GPL compatible licenses only, and thus may reject a program
238       from loading into the kernel when such a license mismatch occurs.
239
240       The resulting object file from the compilation can  be  inspected  with
241       the  usual  set  of tools that also operate on normal object files, for
242       example objdump(1) for inspecting ELF section headers:
243
244
245           objdump -h bpf.o
246           [...]
247           3 classifier    000007f8  0000000000000000  0000000000000000  00000040  2**3
248                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
249           4 action-mark   00000088  0000000000000000  0000000000000000  00000838  2**3
250                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
251           5 action-rand   00000098  0000000000000000  0000000000000000  000008c0  2**3
252                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
253           6 maps          00000030  0000000000000000  0000000000000000  00000958  2**2
254                           CONTENTS, ALLOC, LOAD, DATA
255           7 license       00000004  0000000000000000  0000000000000000  00000988  2**0
256                           CONTENTS, ALLOC, LOAD, DATA
257           [...]
258
259       Adding an eBPF classifier from an object file that contains  a  classi‐
260       fier  in  the default ELF section is trivial (note that instead of "ob‐
261       ject-file" also shortcuts such as "obj" can be used):
262
263           bcc bpf.c
264           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
265
266       In case the classifier resides in ELF section "mycls", then  that  same
267       command needs to be invoked as:
268
269           tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
270
271       Dumping  the  classifier  configuration  will  tell the location of the
272       classifier, in other words that it's from  object  file  "bpf.o"  under
273       section "mycls":
274
275           tc filter show dev em1
276           filter parent 1: protocol all pref 49152 bpf
277           filter  parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1
278           bpf.o:[mycls]
279
280       The same program can also be installed on ingress qdisc side as opposed
281       to egress ...
282
283           tc qdisc add dev em1 handle ffff: ingress
284           tc  filter  add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid
285           ffff:1
286
287       ... and again dumped from there:
288
289           tc filter show dev em1 parent ffff:
290           filter protocol all pref 49152 bpf
291           filter protocol  all  pref  49152  bpf  handle  0x1  flowid  ffff:1
292           bpf.o:[mycls]
293
294       Attaching  a  classifier and action on ingress has the restriction that
295       it doesn't have an actual underlying queueing discipline. What  ingress
296       can  do is to classify, mangle, redirect or drop packets. When queueing
297       is required on ingress side, then ingress must redirect packets to  the
298       ifb  device,  otherwise  policing can be used. Moreover, ingress can be
299       used to have an early drop point of unwanted packets  before  they  hit
300       upper  layers  of the networking stack, perform network accounting with
301       eBPF maps that could be shared with egress, or  have  an  early  mangle
302       and/or redirection point to different networking devices.
303
304       Multiple eBPF actions and classifier can be placed into a single object
305       file within various sections. In that case, non-default  section  names
306       must be provided, which is the case for both actions in this example:
307
308           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \
309                                    action bpf obj bpf.o sec action-mark \
310                                    action bpf obj bpf.o sec action-rand ok
311
312       The  advantage  of  this is that the classifier and the two actions can
313       then share eBPF maps with each other, if implemented in the programs.
314
315       In order to access eBPF maps from user space beyond tc(8)  setup  life‐
316       time, the ownership can be transferred to an eBPF agent via Unix domain
317       sockets. There are two possibilities for implementing this:
318
319       1) implementation of an own eBPF agent that takes care  of  setting  up
320       the  Unix  domain  socket and implementing the protocol that tc(8) dic‐
321       tates. A code example of this can be found inside the  iproute2  source
322       package under: examples/bpf/
323
324       2) use tc exec for transferring the eBPF map file descriptors through a
325       Unix domain socket, and spawning an application such as  sh(1)  .  This
326       approach's  advantage  is  that tc will place the file descriptors into
327       the environment and thus make them available just like  stdin,  stdout,
328       stderr  file  descriptors,  meaning, in case user applications run from
329       within this fd-owner shell, they can terminate and restart without los‐
330       ing  eBPF  maps  file descriptors. Example invocation with the previous
331       classifier and action mixture:
332
333           tc exec bpf imp /tmp/bpf
334           tc filter add dev em1 parent 1: bpf obj bpf.o exp  /tmp/bpf  flowid
335           1:1 \
336                                    action bpf obj bpf.o sec action-mark \
337                                    action bpf obj bpf.o sec action-rand ok
338
339       Assuming  that  eBPF  maps are shared with classifier and actions, it's
340       enough to export them once, for example, from within the classifier  or
341       action command. tc will setup all eBPF map file descriptors at the time
342       when the object file is first parsed.
343
344       When a shell has been spawned, the environment will have  a  couple  of
345       eBPF  related variables. BPF_NUM_MAPS provides the total number of maps
346       that have been transferred over the Unix  domain  socket.  BPF_MAP<X>'s
347       value  is the file descriptor number that can be accessed in eBPF agent
348       applications, in other words, it can directly be used as the  file  de‐
349       scriptor value for the bpf(2) system call to retrieve or alter eBPF map
350       values. <X> denotes the identifier of the eBPF map. It  corresponds  to
351       the  id  member  of struct bpf_elf_map  from the tc eBPF map specifica‐
352       tion.
353
354       The environment in this example looks as follows:
355
356
357           sh# env | grep BPF
358               BPF_NUM_MAPS=3
359               BPF_MAP1=6
360               BPF_MAP0=5
361               BPF_MAP2=7
362           sh# ls -la /proc/self/fd
363               [...]
364               lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
365               lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
366               lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
367           sh# my_bpf_agent
368
369       eBPF agents are very useful in that they can prepopulate eBPF maps from
370       user space, monitor statistics via maps and based on that feedback, for
371       example, rewrite classids in eBPF map values during runtime. Given that
372       eBPF  agents  are implemented as normal applications, they can also dy‐
373       namically receive traffic control policies  from  external  controllers
374       and  thus push them down into eBPF maps to dynamically adapt to network
375       conditions. Moreover, eBPF maps can also be shared with other eBPF pro‐
376       gram types (e.g. tracing), thus very powerful combination can therefore
377       be implemented.
378
379
380   eBPF PROGRAMMING
381       eBPF classifier and actions are being implemented in restricted C  syn‐
382       tax (in future, there could additionally be new language frontends sup‐
383       ported).
384
385       The header file linux/bpf.h provides eBPF helper functions that can  be
386       called from an eBPF program.  This man page will only provide two mini‐
387       mal, stand-alone  examples,  have  a  look  at  examples/bpf  from  the
388       iproute2  source  package for a fully fledged flow dissector example to
389       better demonstrate some of the possibilities with eBPF.
390
391       Supported 32 bit classifier return codes from the C program  and  their
392       meanings:
393           0 , denotes a mismatch
394           -1 , denotes the default classid configured from the command line
395           else , everything else will override the default classid to provide
396           a facility for non-linear matching
397
398       Supported 32 bit action return codes from the C program and their mean‐
399       ings ( linux/pkt_cls.h ):
400           TC_ACT_OK  (0)  , will terminate the packet processing pipeline and
401           allows the packet to proceed
402           TC_ACT_SHOT (2) , will terminate the packet processing pipeline and
403           drops the packet
404           TC_ACT_UNSPEC (-1) , will use the default action configured from tc
405           (similarly as returning -1 from a classifier)
406           TC_ACT_PIPE (3) , will iterate to the next action, if available
407           TC_ACT_RECLASSIFY (1) , will terminate the packet processing  pipe‐
408           line and start classification from the beginning
409           else , everything else is an unspecified return code
410
411       Both  classifier and action return codes are supported in eBPF and cBPF
412       programs.
413
414       To demonstrate restricted C syntax, a minimal toy classifier example is
415       provided,  which  assumes that egress packets, for instance originating
416       from a container, have previously been marked in interval [0, 255]. The
417       program keeps statistics on different marks for user space and maps the
418       classid to the root qdisc with the marking itself as the minor handle:
419
420
421           #include <stdint.h>
422           #include <asm/types.h>
423
424           #include <linux/bpf.h>
425           #include <linux/pkt_sched.h>
426
427           #include "helpers.h"
428
429           struct tuple {
430                   long packets;
431                   long bytes;
432           };
433
434           #define BPF_MAP_ID_STATS        1 /* agent's map identifier */
435           #define BPF_MAX_MARK            256
436
437           struct bpf_elf_map __section("maps") map_stats = {
438                   .type           =       BPF_MAP_TYPE_ARRAY,
439                   .id             =       BPF_MAP_ID_STATS,
440                   .size_key       =       sizeof(uint32_t),
441                   .size_value     =       sizeof(struct tuple),
442                   .max_elem       =       BPF_MAX_MARK,
443                   .pinning        =       PIN_GLOBAL_NS,
444           };
445
446           static inline void cls_update_stats(const struct __sk_buff *skb,
447                                               uint32_t mark)
448           {
449                   struct tuple *tu;
450
451                   tu = bpf_map_lookup_elem(&map_stats, &mark);
452                   if (likely(tu)) {
453                           __sync_fetch_and_add(&tu->packets, 1);
454                           __sync_fetch_and_add(&tu->bytes, skb->len);
455                   }
456           }
457
458           __section("cls") int cls_main(struct __sk_buff *skb)
459           {
460                   uint32_t mark = skb->mark;
461
462                   if (unlikely(mark >= BPF_MAX_MARK))
463                           return 0;
464
465                   cls_update_stats(skb, mark);
466
467                   return TC_H_MAKE(TC_H_ROOT, mark);
468           }
469
470           char __license[] __section("license") = "GPL";
471
472       Another small example is a port redirector  which  demuxes  destination
473       port 80 into the interval [8080, 8087] steered by RSS, that can then be
474       attached to ingress qdisc. The exercise of adding the  egress  counter‐
475       part and IPv6 support is left to the reader:
476
477
478           #include <asm/types.h>
479           #include <asm/byteorder.h>
480
481           #include <linux/bpf.h>
482           #include <linux/filter.h>
483           #include <linux/in.h>
484           #include <linux/if_ether.h>
485           #include <linux/ip.h>
486           #include <linux/tcp.h>
487
488           #include "helpers.h"
489
490           static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
491                                            __u16 old_port, __u16 new_port)
492           {
493                   bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
494                                       old_port, new_port, sizeof(new_port));
495                   bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
496                                       &new_port, sizeof(new_port), 0);
497           }
498
499           static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
500           {
501                   __u16 dport, dport_new = 8080, off;
502                   __u8 ip_proto, ip_vl;
503
504                   ip_proto = load_byte(skb, nh_off +
505                                        offsetof(struct iphdr, protocol));
506                   if (ip_proto != IPPROTO_TCP)
507                           return 0;
508
509                   ip_vl = load_byte(skb, nh_off);
510                   if (likely(ip_vl == 0x45))
511                           nh_off += sizeof(struct iphdr);
512                   else
513                           nh_off += (ip_vl & 0xF) << 2;
514
515                   dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
516                   if (dport != 80)
517                           return 0;
518
519                   off = skb->queue_mapping & 7;
520                   set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
521                                 __cpu_to_be16(dport_new + off));
522                   return -1;
523           }
524
525           __section("lb") int lb_main(struct __sk_buff *skb)
526           {
527                   int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
528
529                   if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
530                           ret = lb_do_ipv4(skb, nh_off);
531
532                   return ret;
533           }
534
535           char __license[] __section("license") = "GPL";
536
537       The related helper header file helpers.h in both examples was:
538
539
540           /* Misc helper macros. */
541           #define __section(x) __attribute__((section(x), used))
542           #define offsetof(x, y) __builtin_offsetof(x, y)
543           #define likely(x) __builtin_expect(!!(x), 1)
544           #define unlikely(x) __builtin_expect(!!(x), 0)
545
546           /* Object pinning settings */
547           #define PIN_NONE       0
548           #define PIN_OBJECT_NS  1
549           #define PIN_GLOBAL_NS  2
550
551           /* ELF map definition */
552           struct bpf_elf_map {
553               __u32 type;
554               __u32 size_key;
555               __u32 size_value;
556               __u32 max_elem;
557               __u32 flags;
558               __u32 id;
559               __u32 pinning;
560               __u32 inner_id;
561               __u32 inner_idx;
562           };
563
564           /* Some used BPF function calls. */
565           static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
566                                             int len, int flags) =
567                 (void *) BPF_FUNC_skb_store_bytes;
568           static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
569                                             int to, int flags) =
570                 (void *) BPF_FUNC_l4_csum_replace;
571           static void *(*bpf_map_lookup_elem)(void *map, void *key) =
572                 (void *) BPF_FUNC_map_lookup_elem;
573
574           /* Some used BPF intrinsics. */
575           unsigned long long load_byte(void *skb, unsigned long long off)
576               asm ("llvm.bpf.load.byte");
577           unsigned long long load_half(void *skb, unsigned long long off)
578               asm ("llvm.bpf.load.half");
579
580       Best  practice,  we  recommend  to  only  have a single eBPF classifier
581       loaded in tc and perform all necessary matching and mangling from there
582       instead of a list of individual classifier and separate actions. Just a
583       single classifier tailored for a given use-case will be most  efficient
584       to run.
585
586
587   eBPF DEBUGGING
588       Both  tc filter and action commands for bpf support an optional verbose
589       parameter that can be used to inspect the  eBPF  verifier  log.  It  is
590       dumped by default in case of an error.
591
592       In case the eBPF/cBPF JIT compiler has been enabled, it can also be in‐
593       structed to emit a debug output of the resulting opcode image into  the
594       kernel log, which can be read via dmesg(1) :
595
596           echo 2 > /proc/sys/net/core/bpf_jit_enable
597
598       The  Linux  kernel  source  tree  ships additionally under tools/net/ a
599       small helper called bpf_jit_disasm that reads out the opcode image dump
600       from the kernel log and dumps the resulting disassembly:
601
602           bpf_jit_disasm -o
603
604       Other  than that, the Linux kernel also contains an extensive eBPF/cBPF
605       test suite module called test_bpf . Upon ...
606
607           modprobe test_bpf
608
609       ... it performs a diversity of test cases and dumps  the  results  into
610       the  kernel  log  that can be inspected with dmesg(1) . The results can
611       differ depending on whether the JIT compiler is enabled or not. In case
612       of  failed  test cases, the module will fail to load. In such cases, we
613       urge you to file a bug report to the related JIT authors, Linux  kernel
614       and networking mailing lists.
615
616
617   cBPF
618       Although  we generally recommend switching to implementing eBPF classi‐
619       fier and actions, for the sake of completeness, a few words on  how  to
620       program in cBPF will be lost here.
621
622       Likewise,  the  bpf_jit_enable  switch  can be enabled as mentioned al‐
623       ready. Tooling such as bpf_jit_disasm is also independent whether  eBPF
624       or cBPF code is being loaded.
625
626       Unlike in eBPF, classifier and action are not implemented in restricted
627       C, but rather in a minimal assembler-like language or with the help  of
628       other tooling.
629
630       The raw interface with tc takes opcodes directly. For example, the most
631       minimal classifier matching on every packet resulting  in  the  default
632       classid of 1:1 looks like:
633
634           tc  filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,'
635           flowid 1:1
636
637       The first decimal of the bytecode sequence denotes the number of subse‐
638       quent  4-tuples  of cBPF opcodes. As mentioned, such a 4-tuple consists
639       of c t f k decimals, where c represents the cBPF  opcode,  t  the  jump
640       true  offset target, f the jump false offset target and k the immediate
641       constant/literal. Here, this denotes an unconditional return  from  the
642       program with immediate value of -1.
643
644       Thus, for egress classification, Willem de Bruijn implemented a minimal
645       stand-alone helper tool under the GNU General Public License version  2
646       for  iptables(8) BPF extension, which abuses the libpcap internal clas‐
647       sic BPF compiler, his code derived here for usage with tc(8) :
648
649
650           #include <pcap.h>
651           #include <stdio.h>
652
653           int main(int argc, char **argv)
654           {
655                   struct bpf_program prog;
656                   struct bpf_insn *ins;
657                   int i, ret, dlt = DLT_RAW;
658
659                   if (argc < 2 || argc > 3)
660                           return 1;
661                   if (argc == 3) {
662                           dlt = pcap_datalink_name_to_val(argv[1]);
663                           if (dlt == -1)
664                                   return 1;
665                   }
666
667                   ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
668                                             1, PCAP_NETMASK_UNKNOWN);
669                   if (ret)
670                           return 1;
671
672                   printf("%d,", prog.bf_len);
673                   ins = prog.bf_insns;
674
675                   for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
676                           printf("%u %u %u %u,", ins->code,
677                                  ins->jt, ins->jf, ins->k);
678                   printf("%u %u %u %u",
679                          ins->code, ins->jt, ins->jf, ins->k);
680
681                   pcap_freecode(&prog);
682                   return 0;
683           }
684
685       Given this small helper, any tcpdump(8) filter expression can be abused
686       as a classifier where a match will result in the default classid:
687
688           bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
689           tc  filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn
690           flowid 1:1
691
692       Basically, such a minimal generator is equivalent to:
693
694           tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\n'  ','  >
695           /var/bpf/tcp-syn
696
697       Since  libpcap  does not support all Linux' specific cBPF extensions in
698       its compiler, the Linux kernel also ships under  tools/net/  a  minimal
699       BPF  assembler  called bpf_asm for providing full control. For detailed
700       syntax and semantics on implementing such programs by hand, see  refer‐
701       ences under FURTHER READING .
702
703       Trivial  toy example in bpf_asm for classifying IPv4/TCP packets, saved
704       in a text file called foobar :
705
706
707           ldh [12]
708           jne #0x800, drop
709           ldb [23]
710           jneq #6, drop
711           ret #-1
712           drop: ret #0
713
714       Similarly, such a classifier can be loaded as:
715
716           bpf_asm foobar > /var/bpf/tcp-syn
717           tc filter add dev em1 parent 1: bpf bytecode-file  /var/bpf/tcp-syn
718           flowid 1:1
719
720       For  BPF  classifiers,  the  Linux  kernel  provides additionally under
721       tools/net/ a small BPF debugger called bpf_dbg , which can be  used  to
722       test a classifier against pcap files, single-step or add various break‐
723       points into the classifier program and dump  register  contents  during
724       runtime.
725
726       Implementing  an  action  in classic BPF is rather limited in the sense
727       that packet mangling is not supported. Therefore, it's generally recom‐
728       mended to make the switch to eBPF, whenever possible.
729
730

FURTHER READING

732       Further  and  more  technical details about the BPF architecture can be
733       found in the Linux  kernel  source  tree  under  Documentation/network‐
734       ing/filter.txt .
735
736       Further  details  on  eBPF  tc(8) examples can be found in the iproute2
737       source tree under examples/bpf/ .
738
739

SEE ALSO

741       tc(8), tc-ematch(8) bpf(2) bpf(4)
742
743

AUTHORS

745       Manpage written by Daniel Borkmann.
746
747       Please report corrections or improvements to the Linux kernel  network‐
748       ing mailing list: <netdev@vger.kernel.org>
749
750
751
752iproute2                          18 May 201B5PF classifier and actions in tc(8)
Impressum