bpf(2) - f38

1bpf(2)                        System Calls Manual                       bpf(2)
2
3
4

NAME

6       bpf - perform a command on an extended BPF map or program
7

SYNOPSIS

9       #include <linux/bpf.h>
10
11       int bpf(int cmd, union bpf_attr *attr, unsigned int size);
12

DESCRIPTION

14       The  bpf()  system  call  performs a range of operations related to ex‐
15       tended Berkeley Packet Filters.  Extended BPF (or eBPF) is  similar  to
16       the  original  ("classic")  BPF  (cBPF) used to filter network packets.
17       For both cBPF and eBPF programs, the  kernel  statically  analyzes  the
18       programs  before loading them, in order to ensure that they cannot harm
19       the running system.
20
21       eBPF extends cBPF in multiple ways, including the  ability  to  call  a
22       fixed set of in-kernel helper functions (via the BPF_CALL opcode exten‐
23       sion provided by eBPF) and access shared data structures such  as  eBPF
24       maps.
25
26   Extended BPF Design/Architecture
27       eBPF  maps  are  a generic data structure for storage of different data
28       types.  Data types are generally treated as binary  blobs,  so  a  user
29       just  specifies  the  size of the key and the size of the value at map-
30       creation time.  In other words, a key/value for a given map can have an
31       arbitrary structure.
32
33       A  user  process  can  create multiple maps (with key/value-pairs being
34       opaque bytes of data) and access them via file descriptors.   Different
35       eBPF  programs  can  access  the same maps in parallel.  It's up to the
36       user process and eBPF program to decide what they store inside maps.
37
38       There's one special map type, called a program array.  This type of map
39       stores  file  descriptors  referring  to  other  eBPF programs.  When a
40       lookup in the map is performed, the program flow is redirected in-place
41       to  the  beginning  of another eBPF program and does not return back to
42       the calling program.  The level of nesting has a fixed limit of 32,  so
43       that  infinite  loops cannot be crafted.  At run time, the program file
44       descriptors stored in the map can be modified, so program functionality
45       can  be  altered based on specific requirements.  All programs referred
46       to in a program-array map must have been  previously  loaded  into  the
47       kernel via bpf().  If a map lookup fails, the current program continues
48       its execution.  See BPF_MAP_TYPE_PROG_ARRAY below for further details.
49
50       Generally, eBPF programs are loaded by the user process  and  automati‐
51       cally  unloaded  when  the  process exits.  In some cases, for example,
52       tc-bpf(8), the program will continue to stay alive  inside  the  kernel
53       even  after  the  process that loaded the program exits.  In that case,
54       the tc subsystem holds a reference to the eBPF program after  the  file
55       descriptor  has been closed by the user-space program.  Thus, whether a
56       specific program continues to live inside the kernel depends on how  it
57       is further attached to a given kernel subsystem after it was loaded via
58       bpf().
59
60       Each eBPF program is a set of instructions that is safe  to  run  until
61       its  completion.   An in-kernel verifier statically determines that the
62       eBPF program terminates and is safe to execute.   During  verification,
63       the  kernel  increments  reference counts for each of the maps that the
64       eBPF program uses, so that the attached maps can't be removed until the
65       program is unloaded.
66
67       eBPF programs can be attached to different events.  These events can be
68       the arrival of network packets, tracing events,  classification  events
69       by network queueing  disciplines (for eBPF programs attached to a tc(8)
70       classifier), and other types that may be added in the  future.   A  new
71       event  triggers execution of the eBPF program, which may store informa‐
72       tion about the event in eBPF maps.  Beyond storing data, eBPF  programs
73       may call a fixed set of in-kernel helper functions.
74
75       The  same eBPF program can be attached to multiple events and different
76       eBPF programs can access the same map:
77
78           tracing     tracing    tracing    packet      packet     packet
79           event A     event B    event C    on eth0     on eth1    on eth2
80            |             |         |          |           |          ^
81            |             |         |          |           v          |
82            --> tracing <--     tracing      socket    tc ingress   tc egress
83                 prog_1          prog_2      prog_3    classifier    action
84                 |  |              |           |         prog_4      prog_5
85              |---  -----|  |------|          map_3        |           |
86            map_1       map_2                              --| map_4 |--
87
88   Arguments
89       The operation to be performed by the bpf() system call is determined by
90       the  cmd argument.  Each operation takes an accompanying argument, pro‐
91       vided via attr, which is a pointer to a union of type bpf_attr (see be‐
92       low).   The  unused  fields  and  padding must be zeroed out before the
93       call.  The size argument is the size of the union pointed to by attr.
94
95       The value provided in cmd is one of the following:
96
97       BPF_MAP_CREATE
98              Create a map and return a file descriptor  that  refers  to  the
99              map.   The  close-on-exec file descriptor flag (see fcntl(2)) is
100              automatically enabled for the new file descriptor.
101
102       BPF_MAP_LOOKUP_ELEM
103              Look up an element by key in a  specified  map  and  return  its
104              value.
105
106       BPF_MAP_UPDATE_ELEM
107              Create or update an element (key/value pair) in a specified map.
108
109       BPF_MAP_DELETE_ELEM
110              Look up and delete an element by key in a specified map.
111
112       BPF_MAP_GET_NEXT_KEY
113              Look  up an element by key in a specified map and return the key
114              of the next element.
115
116       BPF_PROG_LOAD
117              Verify and load an eBPF program, returning a new file descriptor
118              associated  with the program.  The close-on-exec file descriptor
119              flag (see fcntl(2)) is automatically enabled for  the  new  file
120              descriptor.
121
122              The bpf_attr union consists of various anonymous structures that
123              are used by different bpf() commands:
124
125           union bpf_attr {
126               struct {    /* Used by BPF_MAP_CREATE */
127                   __u32         map_type;
128                   __u32         key_size;    /* size of key in bytes */
129                   __u32         value_size;  /* size of value in bytes */
130                   __u32         max_entries; /* maximum number of entries
131                                                 in a map */
132               };
133
134               struct {    /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
135                              commands */
136                   __u32         map_fd;
137                   __aligned_u64 key;
138                   union {
139                       __aligned_u64 value;
140                       __aligned_u64 next_key;
141                   };
142                   __u64         flags;
143               };
144
145               struct {    /* Used by BPF_PROG_LOAD */
146                   __u32         prog_type;
147                   __u32         insn_cnt;
148                   __aligned_u64 insns;      /* 'const struct bpf_insn *' */
149                   __aligned_u64 license;    /* 'const char *' */
150                   __u32         log_level;  /* verbosity level of verifier */
151                   __u32         log_size;   /* size of user buffer */
152                   __aligned_u64 log_buf;    /* user supplied 'char *'
153                                                buffer */
154                   __u32         kern_version;
155                                             /* checked when prog_type=kprobe
156                                                (since Linux 4.1) */
157               };
158           } __attribute__((aligned(8)));
159
160   eBPF maps
161       Maps are a generic data structure for storage  of  different  types  of
162       data.   They  allow  sharing  of data between eBPF kernel programs, and
163       also between kernel and user-space applications.
164
165       Each map type has the following attributes:
166
167       •  type
168
169       •  maximum number of elements
170
171       •  key size in bytes
172
173       •  value size in bytes
174
175       The following wrapper functions demonstrate how various bpf()  commands
176       can  be used to access the maps.  The functions use the cmd argument to
177       invoke different operations.
178
179       BPF_MAP_CREATE
180              The BPF_MAP_CREATE command creates a new map,  returning  a  new
181              file descriptor that refers to the map.
182
183                  int
184                  bpf_create_map(enum bpf_map_type map_type,
185                                 unsigned int key_size,
186                                 unsigned int value_size,
187                                 unsigned int max_entries)
188                  {
189                      union bpf_attr attr = {
190                          .map_type    = map_type,
191                          .key_size    = key_size,
192                          .value_size  = value_size,
193                          .max_entries = max_entries
194                      };
195
196                      return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
197                  }
198
199              The  new  map has the type specified by map_type, and attributes
200              as specified in key_size, value_size, and max_entries.  On  suc‐
201              cess, this operation returns a file descriptor.  On error, -1 is
202              returned and errno is set to EINVAL, EPERM, or ENOMEM.
203
204              The key_size and value_size attributes will be used by the veri‐
205              fier during program loading to check that the program is calling
206              bpf_map_*_elem() helper functions with a  correctly  initialized
207              key and to check that the program doesn't access the map element
208              value beyond the specified value_size.  For example, when a  map
209              is created with a key_size of 8 and the eBPF program calls
210
211                  bpf_map_lookup_elem(map_fd, fp - 4)
212
213              the  program  will be rejected, since the in-kernel helper func‐
214              tion
215
216                  bpf_map_lookup_elem(map_fd, void *key)
217
218              expects to read 8 bytes from the location pointed to by key, but
219              the  fp - 4  (where fp is the top of the stack) starting address
220              will cause out-of-bounds stack access.
221
222              Similarly, when a map is created with a value_size of 1 and  the
223              eBPF program contains
224
225                  value = bpf_map_lookup_elem(...);
226                  *(u32 *) value = 1;
227
228              the  program  will  be  rejected,  since  it  accesses the value
229              pointer beyond the specified 1 byte value_size limit.
230
231              Currently, the following values are supported for map_type:
232
233                  enum bpf_map_type {
234                      BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */
235                      BPF_MAP_TYPE_HASH,
236                      BPF_MAP_TYPE_ARRAY,
237                      BPF_MAP_TYPE_PROG_ARRAY,
238                      BPF_MAP_TYPE_PERF_EVENT_ARRAY,
239                      BPF_MAP_TYPE_PERCPU_HASH,
240                      BPF_MAP_TYPE_PERCPU_ARRAY,
241                      BPF_MAP_TYPE_STACK_TRACE,
242                      BPF_MAP_TYPE_CGROUP_ARRAY,
243                      BPF_MAP_TYPE_LRU_HASH,
244                      BPF_MAP_TYPE_LRU_PERCPU_HASH,
245                      BPF_MAP_TYPE_LPM_TRIE,
246                      BPF_MAP_TYPE_ARRAY_OF_MAPS,
247                      BPF_MAP_TYPE_HASH_OF_MAPS,
248                      BPF_MAP_TYPE_DEVMAP,
249                      BPF_MAP_TYPE_SOCKMAP,
250                      BPF_MAP_TYPE_CPUMAP,
251                      BPF_MAP_TYPE_XSKMAP,
252                      BPF_MAP_TYPE_SOCKHASH,
253                      BPF_MAP_TYPE_CGROUP_STORAGE,
254                      BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
255                      BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
256                      BPF_MAP_TYPE_QUEUE,
257                      BPF_MAP_TYPE_STACK,
258                      /* See /usr/include/linux/bpf.h for the full list. */
259                  };
260
261              map_type selects one of the available map implementations in the
262              kernel.   For  all map types, eBPF programs access maps with the
263              same  bpf_map_lookup_elem()  and  bpf_map_update_elem()   helper
264              functions.   Further  details of the various map types are given
265              below.
266
267       BPF_MAP_LOOKUP_ELEM
268              The BPF_MAP_LOOKUP_ELEM command looks up an element with a given
269              key in the map referred to by the file descriptor fd.
270
271                  int
272                  bpf_lookup_elem(int fd, const void *key, void *value)
273                  {
274                      union bpf_attr attr = {
275                          .map_fd = fd,
276                          .key    = ptr_to_u64(key),
277                          .value  = ptr_to_u64(value),
278                      };
279
280                      return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
281                  }
282
283              If  an  element  is found, the operation returns zero and stores
284              the element's value into value, which must point to a buffer  of
285              value_size bytes.
286
287              If  no element is found, the operation returns -1 and sets errno
288              to ENOENT.
289
290       BPF_MAP_UPDATE_ELEM
291              The BPF_MAP_UPDATE_ELEM command creates or  updates  an  element
292              with  a  given  key/value in the map referred to by the file de‐
293              scriptor fd.
294
295                  int
296                  bpf_update_elem(int fd, const void *key, const void *value,
297                                  uint64_t flags)
298                  {
299                      union bpf_attr attr = {
300                          .map_fd = fd,
301                          .key    = ptr_to_u64(key),
302                          .value  = ptr_to_u64(value),
303                          .flags  = flags,
304                      };
305
306                      return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
307                  }
308
309              The flags argument should be specified as one of the following:
310
311              BPF_ANY
312                     Create a new element or update an existing element.
313
314              BPF_NOEXIST
315                     Create a new element only if it did not exist.
316
317              BPF_EXIST
318                     Update an existing element.
319
320              On success, the operation returns zero.  On  error,  -1  is  re‐
321              turned  and  errno  is  set  to EINVAL, EPERM, ENOMEM, or E2BIG.
322              E2BIG indicates that the number of elements in the  map  reached
323              the  max_entries  limit  specified at map creation time.  EEXIST
324              will be returned if flags specifies BPF_NOEXIST and the  element
325              with  key already exists in the map.  ENOENT will be returned if
326              flags specifies BPF_EXIST and the element with key doesn't exist
327              in the map.
328
329       BPF_MAP_DELETE_ELEM
330              The BPF_MAP_DELETE_ELEM command deletes the element whose key is
331              key from the map referred to by the file descriptor fd.
332
333                  int
334                  bpf_delete_elem(int fd, const void *key)
335                  {
336                      union bpf_attr attr = {
337                          .map_fd = fd,
338                          .key    = ptr_to_u64(key),
339                      };
340
341                      return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
342                  }
343
344              On success, zero is returned.  If the element is not  found,  -1
345              is returned and errno is set to ENOENT.
346
347       BPF_MAP_GET_NEXT_KEY
348              The  BPF_MAP_GET_NEXT_KEY  command looks up an element by key in
349              the map referred to by the  file  descriptor  fd  and  sets  the
350              next_key pointer to the key of the next element.
351
352                  int
353                  bpf_get_next_key(int fd, const void *key, void *next_key)
354                  {
355                      union bpf_attr attr = {
356                          .map_fd   = fd,
357                          .key      = ptr_to_u64(key),
358                          .next_key = ptr_to_u64(next_key),
359                      };
360
361                      return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
362                  }
363
364              If  key  is  found,  the  operation  returns  zero  and sets the
365              next_key pointer to the key of the next element.  If key is  not
366              found,  the operation returns zero and sets the next_key pointer
367              to the key of the first element.  If key is the last element, -1
368              is  returned  and  errno is set to ENOENT.  Other possible errno
369              values are ENOMEM, EFAULT, EPERM, and EINVAL.  This  method  can
370              be used to iterate over all elements in the map.
371
372       close(map_fd)
373              Delete  the map referred to by the file descriptor map_fd.  When
374              the user-space program that created a map exits, all  maps  will
375              be deleted automatically (but see NOTES).
376
377   eBPF map types
378       The following map types are supported:
379
380       BPF_MAP_TYPE_HASH
381              Hash-table maps have the following characteristics:
382
383              •  Maps  are created and destroyed by user-space programs.  Both
384                 user-space and eBPF programs can perform lookup, update,  and
385                 delete operations.
386
387              •  The  kernel  takes  care  of allocating and freeing key/value
388                 pairs.
389
390              •  The map_update_elem() helper will fail to insert new  element
391                 when  the  max_entries  limit is reached.  (This ensures that
392                 eBPF programs cannot exhaust memory.)
393
394              •  map_update_elem() replaces existing elements atomically.
395
396              Hash-table maps are optimized for speed of lookup.
397
398       BPF_MAP_TYPE_ARRAY
399              Array maps have the following characteristics:
400
401              •  Optimized for fastest possible lookup.   In  the  future  the
402                 verifier/JIT  compiler may recognize lookup() operations that
403                 employ a constant key and optimize it into constant  pointer.
404                 It  is  possible  to  optimize a non-constant key into direct
405                 pointer arithmetic as well, since pointers and value_size are
406                 constant  for  the life of the eBPF program.  In other words,
407                 array_map_lookup_elem() may be 'inlined' by the  verifier/JIT
408                 compiler  while preserving concurrent access to this map from
409                 user space.
410
411              •  All array elements pre-allocated and zero initialized at init
412                 time
413
414              •  The key is an array index, and must be exactly four bytes.
415
416              •  map_delete_elem() fails with the error EINVAL, since elements
417                 cannot be deleted.
418
419              •  map_update_elem() replaces elements in a  nonatomic  fashion;
420                 for  atomic updates, a hash-table map should be used instead.
421                 There is however one special case that can also be used  with
422                 arrays:  the  atomic  built-in  __sync_fetch_and_add() can be
423                 used on 32 and 64 bit atomic counters.  For example,  it  can
424                 be  applied on the whole value itself if it represents a sin‐
425                 gle counter, or in case of a  structure  containing  multiple
426                 counters,  it  could be used on individual counters.  This is
427                 quite often useful for aggregation and accounting of events.
428
429              Among the uses for array maps are the following:
430
431              •  As "global" eBPF variables: an array of 1 element  whose  key
432                 is  (index) 0 and where the value is a collection of 'global'
433                 variables which eBPF programs can use to keep  state  between
434                 events.
435
436              •  Aggregation of tracing events into a fixed set of buckets.
437
438              •  Accounting of networking events, for example, number of pack‐
439                 ets and packet sizes.
440
441       BPF_MAP_TYPE_PROG_ARRAY (since Linux 4.2)
442              A program array map is a special kind of  array  map  whose  map
443              values  contain  only  file  descriptors referring to other eBPF
444              programs.  Thus, both the key_size and value_size  must  be  ex‐
445              actly  four  bytes.   This  map  is used in conjunction with the
446              bpf_tail_call() helper.
447
448              This means that an eBPF program with a  program  array  map  at‐
449              tached to it can call from kernel side into
450
451                  void bpf_tail_call(void *context, void *prog_map,
452                                     unsigned int index);
453
454              and therefore replace its own program flow with the one from the
455              program at the given program array slot, if present.   This  can
456              be regarded as kind of a jump table to a different eBPF program.
457              The invoked program will then reuse the same stack.  When a jump
458              into  the new program has been performed, it won't return to the
459              old program anymore.
460
461              If no eBPF program is found at the given index  of  the  program
462              array (because the map slot doesn't contain a valid program file
463              descriptor, the specified lookup index/key is out of bounds,  or
464              the limit of 32 nested calls has been exceed), execution contin‐
465              ues with the current eBPF program.  This can be used as a  fall-
466              through for default cases.
467
468              A  program  array map is useful, for example, in tracing or net‐
469              working, to handle individual system calls or protocols in their
470              own  subprograms  and use their identifiers as an individual map
471              index.  This approach may result in  performance  benefits,  and
472              also makes it possible to overcome the maximum instruction limit
473              of a single eBPF program.  In dynamic environments, a user-space
474              daemon  might  atomically replace individual subprograms at run-
475              time with newer versions to alter overall program behavior,  for
476              instance, if global policies change.
477
478   eBPF programs
479       The BPF_PROG_LOAD command is used to load an eBPF program into the ker‐
480       nel.  The return value for this command is a new file descriptor  asso‐
481       ciated with this eBPF program.
482
483           char bpf_log_buf[LOG_BUF_SIZE];
484
485           int
486           bpf_prog_load(enum bpf_prog_type type,
487                         const struct bpf_insn *insns, int insn_cnt,
488                         const char *license)
489           {
490               union bpf_attr attr = {
491                   .prog_type = type,
492                   .insns     = ptr_to_u64(insns),
493                   .insn_cnt  = insn_cnt,
494                   .license   = ptr_to_u64(license),
495                   .log_buf   = ptr_to_u64(bpf_log_buf),
496                   .log_size  = LOG_BUF_SIZE,
497                   .log_level = 1,
498               };
499
500               return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
501           }
502
503       prog_type is one of the available program types:
504
505                  enum bpf_prog_type {
506                      BPF_PROG_TYPE_UNSPEC,        /* Reserve 0 as invalid
507                                                      program type */
508                      BPF_PROG_TYPE_SOCKET_FILTER,
509                      BPF_PROG_TYPE_KPROBE,
510                      BPF_PROG_TYPE_SCHED_CLS,
511                      BPF_PROG_TYPE_SCHED_ACT,
512                      BPF_PROG_TYPE_TRACEPOINT,
513                      BPF_PROG_TYPE_XDP,
514                      BPF_PROG_TYPE_PERF_EVENT,
515                      BPF_PROG_TYPE_CGROUP_SKB,
516                      BPF_PROG_TYPE_CGROUP_SOCK,
517                      BPF_PROG_TYPE_LWT_IN,
518                      BPF_PROG_TYPE_LWT_OUT,
519                      BPF_PROG_TYPE_LWT_XMIT,
520                      BPF_PROG_TYPE_SOCK_OPS,
521                      BPF_PROG_TYPE_SK_SKB,
522                      BPF_PROG_TYPE_CGROUP_DEVICE,
523                      BPF_PROG_TYPE_SK_MSG,
524                      BPF_PROG_TYPE_RAW_TRACEPOINT,
525                      BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
526                      BPF_PROG_TYPE_LWT_SEG6LOCAL,
527                      BPF_PROG_TYPE_LIRC_MODE2,
528                      BPF_PROG_TYPE_SK_REUSEPORT,
529                      BPF_PROG_TYPE_FLOW_DISSECTOR,
530                      /* See /usr/include/linux/bpf.h for the full list. */
531                  };
532
533       For further details of eBPF program types, see below.
534
535       The remaining fields of bpf_attr are set as follows:
536
537       •  insns is an array of struct bpf_insn instructions.
538
539       •  insn_cnt is the number of instructions in the program referred to by
540          insns.
541
542       •  license is a license string, which must be GPL  compatible  to  call
543          helper functions marked gpl_only.  (The licensing rules are the same
544          as for kernel modules, so that also dual  licenses,  such  as  "Dual
545          BSD/GPL", may be used.)
546
547       •  log_buf  is  a pointer to a caller-allocated buffer in which the in-
548          kernel verifier can store the  verification  log.   This  log  is  a
549          multi-line string that can be checked by the program author in order
550          to understand how the verifier came to the conclusion that the  eBPF
551          program  is unsafe.  The format of the output can change at any time
552          as the verifier evolves.
553
554       •  log_size size of the buffer pointed to by log_buf.  If the  size  of
555          the buffer is not large enough to store all verifier messages, -1 is
556          returned and errno is set to ENOSPC.
557
558       •  log_level verbosity level of the verifier.  A value  of  zero  means
559          that the verifier will not provide a log; in this case, log_buf must
560          be a NULL pointer, and log_size must be zero.
561
562       Applying close(2) to the file descriptor returned by BPF_PROG_LOAD will
563       unload the eBPF program (but see NOTES).
564
565       Maps  are  accessible  from eBPF programs and are used to exchange data
566       between eBPF programs and between eBPF  programs  and  user-space  pro‐
567       grams.   For  example,  eBPF  programs can process various events (like
568       kprobe, packets) and store their data into a map, and  user-space  pro‐
569       grams  can  then  fetch data from the map.  Conversely, user-space pro‐
570       grams can use a map as a configuration mechanism,  populating  the  map
571       with values checked by the eBPF program, which then modifies its behav‐
572       ior on the fly according to those values.
573
574   eBPF program types
575       The eBPF program type  (prog_type)  determines  the  subset  of  kernel
576       helper  functions that the program may call.  The program type also de‐
577       termines the program input (context)—the format of  struct  bpf_context
578       (which is the data blob passed into the eBPF program as the first argu‐
579       ment).
580
581       For example, a tracing program does not have the exact same  subset  of
582       helper  functions as a socket filter program (though they may have some
583       helpers in common).  Similarly, the input (context) for a tracing  pro‐
584       gram  is  a  set  of register values, while for a socket filter it is a
585       network packet.
586
587       The set of functions available to eBPF programs of a given type may in‐
588       crease in the future.
589
590       The following program types are supported:
591
592       BPF_PROG_TYPE_SOCKET_FILTER (since Linux 3.19)
593              Currently,  the set of functions for BPF_PROG_TYPE_SOCKET_FILTER
594              is:
595
596                  bpf_map_lookup_elem(map_fd, void *key)
597                                      /* look up key in a map_fd */
598                  bpf_map_update_elem(map_fd, void *key, void *value)
599                                      /* update key/value */
600                  bpf_map_delete_elem(map_fd, void *key)
601                                      /* delete key in a map_fd */
602
603              The bpf_context argument is a pointer to a struct __sk_buff.
604
605       BPF_PROG_TYPE_KPROBE (since Linux 4.1)
606              [To be documented]
607
608       BPF_PROG_TYPE_SCHED_CLS (since Linux 4.1)
609              [To be documented]
610
611       BPF_PROG_TYPE_SCHED_ACT (since Linux 4.1)
612              [To be documented]
613
614   Events
615       Once a program is loaded, it can be attached to an event.  Various ker‐
616       nel subsystems have different ways to do so.
617
618       Since Linux 3.19, the following call will attach the program prog_fd to
619       the socket sockfd, which was created by an earlier call to socket(2):
620
621           setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
622                      &prog_fd, sizeof(prog_fd));
623
624       Since Linux 4.1, the following call may be used to attach the eBPF pro‐
625       gram  referred  to  by the file descriptor prog_fd to a perf event file
626       descriptor,  event_fd,  that  was  created  by  a  previous   call   to
627       perf_event_open(2):
628
629           ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
630

RETURN VALUE

632       For a successful call, the return value depends on the operation:
633
634       BPF_MAP_CREATE
635              The new file descriptor associated with the eBPF map.
636
637       BPF_PROG_LOAD
638              The new file descriptor associated with the eBPF program.
639
640       All other commands
641              Zero.
642
643       On error, -1 is returned, and errno is set to indicate the error.
644

ERRORS

646       E2BIG  The  eBPF  program is too large or a map reached the max_entries
647              limit (maximum number of elements).
648
649       EACCES For BPF_PROG_LOAD, even  though  all  program  instructions  are
650              valid,  the  program has been rejected because it was deemed un‐
651              safe.  This may be because it may  have  accessed  a  disallowed
652              memory  region or an uninitialized stack/register or because the
653              function constraints don't match the  actual  types  or  because
654              there  was a misaligned memory access.  In this case, it is rec‐
655              ommended to call bpf() again with  log_level  =  1  and  examine
656              log_buf for the specific reason provided by the verifier.
657
658       EBADF  fd is not an open file descriptor.
659
660       EFAULT One  of  the pointers (key or value or log_buf or insns) is out‐
661              side the accessible address space.
662
663       EINVAL The value specified in cmd is not recognized by this kernel.
664
665       EINVAL For BPF_MAP_CREATE, either map_type or attributes are invalid.
666
667       EINVAL For  BPF_MAP_*_ELEM  commands,  some  of  the  fields  of  union
668              bpf_attr that are not used by this command are not set to zero.
669
670       EINVAL For  BPF_PROG_LOAD, indicates an attempt to load an invalid pro‐
671              gram.  eBPF programs can be deemed invalid due  to  unrecognized
672              instructions,  the  use  of reserved fields, jumps out of range,
673              infinite loops or calls of unknown functions.
674
675       ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM,  indicates  that
676              the element with the given key was not found.
677
678       ENOMEM Cannot allocate sufficient memory.
679
680       EPERM  The  call  was  made  without  sufficient privilege (without the
681              CAP_SYS_ADMIN capability).
682

STANDARDS

684       Linux.
685

HISTORY

687       Linux 3.18.
688

NOTES

690       Prior to Linux 4.4, all bpf() commands require the caller to  have  the
691       CAP_SYS_ADMIN capability.  From Linux 4.4 onwards, an unprivileged user
692       may create limited programs of type BPF_PROG_TYPE_SOCKET_FILTER and as‐
693       sociated  maps.   However they may not store kernel pointers within the
694       maps and are presently limited to the following helper functions:
695
696       •  get_random
697       •  get_smp_processor_id
698       •  tail_call
699       •  ktime_get_ns
700
701       Unprivileged access may be blocked by writing the value 1 to  the  file
702       /proc/sys/kernel/unprivileged_bpf_disabled.
703
704       eBPF  objects (maps and programs) can be shared between processes.  For
705       example, after fork(2), the child inherits file  descriptors  referring
706       to  the  same eBPF objects.  In addition, file descriptors referring to
707       eBPF objects can be transferred over UNIX  domain  sockets.   File  de‐
708       scriptors referring to eBPF objects can be duplicated in the usual way,
709       using dup(2) and similar calls.  An eBPF object is deallocated only af‐
710       ter all file descriptors referring to the object have been closed.
711
712       eBPF  programs can be written in a restricted C that is compiled (using
713       the clang compiler) into eBPF bytecode.  Various features  are  omitted
714       from this restricted C, such as loops, global variables, variadic func‐
715       tions, floating-point numbers, and passing structures as function argu‐
716       ments.  Some examples can be found in the samples/bpf/*_kern.c files in
717       the kernel source tree.
718
719       The kernel contains a just-in-time (JIT) compiler that translates  eBPF
720       bytecode into native machine code for better performance.  Before Linux
721       4.15, the JIT compiler is disabled by default, but its operation can be
722       controlled  by writing one of the following integer strings to the file
723       /proc/sys/net/core/bpf_jit_enable:
724
725       0      Disable JIT compilation (default).
726
727       1      Normal compilation.
728
729       2      Debugging mode.  The generated opcodes are dumped in hexadecimal
730              into the kernel log.  These opcodes can then be disassembled us‐
731              ing the program tools/net/bpf_jit_disasm.c provided in the  ker‐
732              nel source tree.
733
734       Since Linux 4.15, the kernel may configured with the CONFIG_BPF_JIT_AL‐
735       WAYS_ON option.  In this case, the JIT compiler is always enabled,  and
736       the  bpf_jit_enable is initialized to 1 and is immutable.  (This kernel
737       configuration option was provided as a mitigation for one of the  Spec‐
738       tre attacks against the BPF interpreter.)
739
740       The  JIT compiler for eBPF is currently available for the following ar‐
741       chitectures:
742
743       •  x86-64 (since Linux 3.18; cBPF since Linux 3.0);
744       •  ARM32 (since Linux 3.18; cBPF since Linux 3.4);
745       •  SPARC 32 (since Linux 3.18; cBPF since Linux 3.5);
746       •  ARM-64 (since Linux 3.18);
747       •  s390 (since Linux 4.1; cBPF since Linux 3.7);
748       •  PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1);
749       •  SPARC 64 (since Linux 4.12);
750       •  x86-32 (since Linux 4.18);
751       •  MIPS 64 (since Linux 4.18; cBPF since Linux 3.16);
752       •  riscv (since Linux 5.1).
753

EXAMPLES

755       /* bpf+sockets example:
756        * 1. create array map of 256 elements
757        * 2. load program that counts number of packets received
758        *    r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
759        *    map[r0]++
760        * 3. attach prog_fd to raw socket via setsockopt()
761        * 4. print number of received TCP/UDP packets every second
762        */
763       int
764       main(int argc, char *argv[])
765       {
766           int sock, map_fd, prog_fd, key;
767           long long value = 0, tcp_cnt, udp_cnt;
768
769           map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key),
770                                   sizeof(value), 256);
771           if (map_fd < 0) {
772               printf("failed to create map '%s'\n", strerror(errno));
773               /* likely not run as root */
774               return 1;
775           }
776
777           struct bpf_insn prog[] = {
778               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),        /* r6 = r1 */
779               BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
780                                       /* r0 = ip->proto */
781               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
782                                       /* *(u32 *)(fp - 4) = r0 */
783               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),       /* r2 = fp */
784               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),      /* r2 = r2 - 4 */
785               BPF_LD_MAP_FD(BPF_REG_1, map_fd),           /* r1 = map_fd */
786               BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
787                                       /* r0 = map_lookup(r1, r2) */
788               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
789                                       /* if (r0 == 0) goto pc+2 */
790               BPF_MOV64_IMM(BPF_REG_1, 1),                /* r1 = 1 */
791               BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
792                                       /* lock *(u64 *) r0 += r1 */
793               BPF_MOV64_IMM(BPF_REG_0, 0),                /* r0 = 0 */
794               BPF_EXIT_INSN(),                            /* return r0 */
795           };
796
797           prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog,
798                                   sizeof(prog) / sizeof(prog[0]), "GPL");
799
800           sock = open_raw_sock("lo");
801
802           assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
803                             sizeof(prog_fd)) == 0);
804
805           for (;;) {
806               key = IPPROTO_TCP;
807               assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
808               key = IPPROTO_UDP;
809               assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
810               printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
811               sleep(1);
812           }
813
814           return 0;
815       }
816
817       Some complete working code can be found in the samples/bpf directory in
818       the kernel source tree.
819