1bpf(2) System Calls Manual bpf(2)
2
3
4
6 bpf - perform a command on an extended BPF map or program
7
9 #include <linux/bpf.h>
10
11 int bpf(int cmd, union bpf_attr *attr, unsigned int size);
12
14 The bpf() system call performs a range of operations related to ex‐
15 tended Berkeley Packet Filters. Extended BPF (or eBPF) is similar to
16 the original ("classic") BPF (cBPF) used to filter network packets.
17 For both cBPF and eBPF programs, the kernel statically analyzes the
18 programs before loading them, in order to ensure that they cannot harm
19 the running system.
20
21 eBPF extends cBPF in multiple ways, including the ability to call a
22 fixed set of in-kernel helper functions (via the BPF_CALL opcode exten‐
23 sion provided by eBPF) and access shared data structures such as eBPF
24 maps.
25
26 Extended BPF Design/Architecture
27 eBPF maps are a generic data structure for storage of different data
28 types. Data types are generally treated as binary blobs, so a user
29 just specifies the size of the key and the size of the value at map-
30 creation time. In other words, a key/value for a given map can have an
31 arbitrary structure.
32
33 A user process can create multiple maps (with key/value-pairs being
34 opaque bytes of data) and access them via file descriptors. Different
35 eBPF programs can access the same maps in parallel. It's up to the
36 user process and eBPF program to decide what they store inside maps.
37
38 There's one special map type, called a program array. This type of map
39 stores file descriptors referring to other eBPF programs. When a
40 lookup in the map is performed, the program flow is redirected in-place
41 to the beginning of another eBPF program and does not return back to
42 the calling program. The level of nesting has a fixed limit of 32, so
43 that infinite loops cannot be crafted. At run time, the program file
44 descriptors stored in the map can be modified, so program functionality
45 can be altered based on specific requirements. All programs referred
46 to in a program-array map must have been previously loaded into the
47 kernel via bpf(). If a map lookup fails, the current program continues
48 its execution. See BPF_MAP_TYPE_PROG_ARRAY below for further details.
49
50 Generally, eBPF programs are loaded by the user process and automati‐
51 cally unloaded when the process exits. In some cases, for example,
52 tc-bpf(8), the program will continue to stay alive inside the kernel
53 even after the process that loaded the program exits. In that case,
54 the tc subsystem holds a reference to the eBPF program after the file
55 descriptor has been closed by the user-space program. Thus, whether a
56 specific program continues to live inside the kernel depends on how it
57 is further attached to a given kernel subsystem after it was loaded via
58 bpf().
59
60 Each eBPF program is a set of instructions that is safe to run until
61 its completion. An in-kernel verifier statically determines that the
62 eBPF program terminates and is safe to execute. During verification,
63 the kernel increments reference counts for each of the maps that the
64 eBPF program uses, so that the attached maps can't be removed until the
65 program is unloaded.
66
67 eBPF programs can be attached to different events. These events can be
68 the arrival of network packets, tracing events, classification events
69 by network queueing disciplines (for eBPF programs attached to a tc(8)
70 classifier), and other types that may be added in the future. A new
71 event triggers execution of the eBPF program, which may store informa‐
72 tion about the event in eBPF maps. Beyond storing data, eBPF programs
73 may call a fixed set of in-kernel helper functions.
74
75 The same eBPF program can be attached to multiple events and different
76 eBPF programs can access the same map:
77
78 tracing tracing tracing packet packet packet
79 event A event B event C on eth0 on eth1 on eth2
80 | | | | | ^
81 | | | | v |
82 --> tracing <-- tracing socket tc ingress tc egress
83 prog_1 prog_2 prog_3 classifier action
84 | | | | prog_4 prog_5
85 |--- -----| |------| map_3 | |
86 map_1 map_2 --| map_4 |--
87
88 Arguments
89 The operation to be performed by the bpf() system call is determined by
90 the cmd argument. Each operation takes an accompanying argument, pro‐
91 vided via attr, which is a pointer to a union of type bpf_attr (see be‐
92 low). The unused fields and padding must be zeroed out before the
93 call. The size argument is the size of the union pointed to by attr.
94
95 The value provided in cmd is one of the following:
96
97 BPF_MAP_CREATE
98 Create a map and return a file descriptor that refers to the
99 map. The close-on-exec file descriptor flag (see fcntl(2)) is
100 automatically enabled for the new file descriptor.
101
102 BPF_MAP_LOOKUP_ELEM
103 Look up an element by key in a specified map and return its
104 value.
105
106 BPF_MAP_UPDATE_ELEM
107 Create or update an element (key/value pair) in a specified map.
108
109 BPF_MAP_DELETE_ELEM
110 Look up and delete an element by key in a specified map.
111
112 BPF_MAP_GET_NEXT_KEY
113 Look up an element by key in a specified map and return the key
114 of the next element.
115
116 BPF_PROG_LOAD
117 Verify and load an eBPF program, returning a new file descriptor
118 associated with the program. The close-on-exec file descriptor
119 flag (see fcntl(2)) is automatically enabled for the new file
120 descriptor.
121
122 The bpf_attr union consists of various anonymous structures that
123 are used by different bpf() commands:
124
125 union bpf_attr {
126 struct { /* Used by BPF_MAP_CREATE */
127 __u32 map_type;
128 __u32 key_size; /* size of key in bytes */
129 __u32 value_size; /* size of value in bytes */
130 __u32 max_entries; /* maximum number of entries
131 in a map */
132 };
133
134 struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
135 commands */
136 __u32 map_fd;
137 __aligned_u64 key;
138 union {
139 __aligned_u64 value;
140 __aligned_u64 next_key;
141 };
142 __u64 flags;
143 };
144
145 struct { /* Used by BPF_PROG_LOAD */
146 __u32 prog_type;
147 __u32 insn_cnt;
148 __aligned_u64 insns; /* 'const struct bpf_insn *' */
149 __aligned_u64 license; /* 'const char *' */
150 __u32 log_level; /* verbosity level of verifier */
151 __u32 log_size; /* size of user buffer */
152 __aligned_u64 log_buf; /* user supplied 'char *'
153 buffer */
154 __u32 kern_version;
155 /* checked when prog_type=kprobe
156 (since Linux 4.1) */
157 };
158 } __attribute__((aligned(8)));
159
160 eBPF maps
161 Maps are a generic data structure for storage of different types of
162 data. They allow sharing of data between eBPF kernel programs, and
163 also between kernel and user-space applications.
164
165 Each map type has the following attributes:
166
167 • type
168
169 • maximum number of elements
170
171 • key size in bytes
172
173 • value size in bytes
174
175 The following wrapper functions demonstrate how various bpf() commands
176 can be used to access the maps. The functions use the cmd argument to
177 invoke different operations.
178
179 BPF_MAP_CREATE
180 The BPF_MAP_CREATE command creates a new map, returning a new
181 file descriptor that refers to the map.
182
183 int
184 bpf_create_map(enum bpf_map_type map_type,
185 unsigned int key_size,
186 unsigned int value_size,
187 unsigned int max_entries)
188 {
189 union bpf_attr attr = {
190 .map_type = map_type,
191 .key_size = key_size,
192 .value_size = value_size,
193 .max_entries = max_entries
194 };
195
196 return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
197 }
198
199 The new map has the type specified by map_type, and attributes
200 as specified in key_size, value_size, and max_entries. On suc‐
201 cess, this operation returns a file descriptor. On error, -1 is
202 returned and errno is set to EINVAL, EPERM, or ENOMEM.
203
204 The key_size and value_size attributes will be used by the veri‐
205 fier during program loading to check that the program is calling
206 bpf_map_*_elem() helper functions with a correctly initialized
207 key and to check that the program doesn't access the map element
208 value beyond the specified value_size. For example, when a map
209 is created with a key_size of 8 and the eBPF program calls
210
211 bpf_map_lookup_elem(map_fd, fp - 4)
212
213 the program will be rejected, since the in-kernel helper func‐
214 tion
215
216 bpf_map_lookup_elem(map_fd, void *key)
217
218 expects to read 8 bytes from the location pointed to by key, but
219 the fp - 4 (where fp is the top of the stack) starting address
220 will cause out-of-bounds stack access.
221
222 Similarly, when a map is created with a value_size of 1 and the
223 eBPF program contains
224
225 value = bpf_map_lookup_elem(...);
226 *(u32 *) value = 1;
227
228 the program will be rejected, since it accesses the value
229 pointer beyond the specified 1 byte value_size limit.
230
231 Currently, the following values are supported for map_type:
232
233 enum bpf_map_type {
234 BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
235 BPF_MAP_TYPE_HASH,
236 BPF_MAP_TYPE_ARRAY,
237 BPF_MAP_TYPE_PROG_ARRAY,
238 BPF_MAP_TYPE_PERF_EVENT_ARRAY,
239 BPF_MAP_TYPE_PERCPU_HASH,
240 BPF_MAP_TYPE_PERCPU_ARRAY,
241 BPF_MAP_TYPE_STACK_TRACE,
242 BPF_MAP_TYPE_CGROUP_ARRAY,
243 BPF_MAP_TYPE_LRU_HASH,
244 BPF_MAP_TYPE_LRU_PERCPU_HASH,
245 BPF_MAP_TYPE_LPM_TRIE,
246 BPF_MAP_TYPE_ARRAY_OF_MAPS,
247 BPF_MAP_TYPE_HASH_OF_MAPS,
248 BPF_MAP_TYPE_DEVMAP,
249 BPF_MAP_TYPE_SOCKMAP,
250 BPF_MAP_TYPE_CPUMAP,
251 BPF_MAP_TYPE_XSKMAP,
252 BPF_MAP_TYPE_SOCKHASH,
253 BPF_MAP_TYPE_CGROUP_STORAGE,
254 BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
255 BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
256 BPF_MAP_TYPE_QUEUE,
257 BPF_MAP_TYPE_STACK,
258 /* See /usr/include/linux/bpf.h for the full list. */
259 };
260
261 map_type selects one of the available map implementations in the
262 kernel. For all map types, eBPF programs access maps with the
263 same bpf_map_lookup_elem() and bpf_map_update_elem() helper
264 functions. Further details of the various map types are given
265 below.
266
267 BPF_MAP_LOOKUP_ELEM
268 The BPF_MAP_LOOKUP_ELEM command looks up an element with a given
269 key in the map referred to by the file descriptor fd.
270
271 int
272 bpf_lookup_elem(int fd, const void *key, void *value)
273 {
274 union bpf_attr attr = {
275 .map_fd = fd,
276 .key = ptr_to_u64(key),
277 .value = ptr_to_u64(value),
278 };
279
280 return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
281 }
282
283 If an element is found, the operation returns zero and stores
284 the element's value into value, which must point to a buffer of
285 value_size bytes.
286
287 If no element is found, the operation returns -1 and sets errno
288 to ENOENT.
289
290 BPF_MAP_UPDATE_ELEM
291 The BPF_MAP_UPDATE_ELEM command creates or updates an element
292 with a given key/value in the map referred to by the file de‐
293 scriptor fd.
294
295 int
296 bpf_update_elem(int fd, const void *key, const void *value,
297 uint64_t flags)
298 {
299 union bpf_attr attr = {
300 .map_fd = fd,
301 .key = ptr_to_u64(key),
302 .value = ptr_to_u64(value),
303 .flags = flags,
304 };
305
306 return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
307 }
308
309 The flags argument should be specified as one of the following:
310
311 BPF_ANY
312 Create a new element or update an existing element.
313
314 BPF_NOEXIST
315 Create a new element only if it did not exist.
316
317 BPF_EXIST
318 Update an existing element.
319
320 On success, the operation returns zero. On error, -1 is re‐
321 turned and errno is set to EINVAL, EPERM, ENOMEM, or E2BIG.
322 E2BIG indicates that the number of elements in the map reached
323 the max_entries limit specified at map creation time. EEXIST
324 will be returned if flags specifies BPF_NOEXIST and the element
325 with key already exists in the map. ENOENT will be returned if
326 flags specifies BPF_EXIST and the element with key doesn't exist
327 in the map.
328
329 BPF_MAP_DELETE_ELEM
330 The BPF_MAP_DELETE_ELEM command deletes the element whose key is
331 key from the map referred to by the file descriptor fd.
332
333 int
334 bpf_delete_elem(int fd, const void *key)
335 {
336 union bpf_attr attr = {
337 .map_fd = fd,
338 .key = ptr_to_u64(key),
339 };
340
341 return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
342 }
343
344 On success, zero is returned. If the element is not found, -1
345 is returned and errno is set to ENOENT.
346
347 BPF_MAP_GET_NEXT_KEY
348 The BPF_MAP_GET_NEXT_KEY command looks up an element by key in
349 the map referred to by the file descriptor fd and sets the
350 next_key pointer to the key of the next element.
351
352 int
353 bpf_get_next_key(int fd, const void *key, void *next_key)
354 {
355 union bpf_attr attr = {
356 .map_fd = fd,
357 .key = ptr_to_u64(key),
358 .next_key = ptr_to_u64(next_key),
359 };
360
361 return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
362 }
363
364 If key is found, the operation returns zero and sets the
365 next_key pointer to the key of the next element. If key is not
366 found, the operation returns zero and sets the next_key pointer
367 to the key of the first element. If key is the last element, -1
368 is returned and errno is set to ENOENT. Other possible errno
369 values are ENOMEM, EFAULT, EPERM, and EINVAL. This method can
370 be used to iterate over all elements in the map.
371
372 close(map_fd)
373 Delete the map referred to by the file descriptor map_fd. When
374 the user-space program that created a map exits, all maps will
375 be deleted automatically (but see NOTES).
376
377 eBPF map types
378 The following map types are supported:
379
380 BPF_MAP_TYPE_HASH
381 Hash-table maps have the following characteristics:
382
383 • Maps are created and destroyed by user-space programs. Both
384 user-space and eBPF programs can perform lookup, update, and
385 delete operations.
386
387 • The kernel takes care of allocating and freeing key/value
388 pairs.
389
390 • The map_update_elem() helper will fail to insert new element
391 when the max_entries limit is reached. (This ensures that
392 eBPF programs cannot exhaust memory.)
393
394 • map_update_elem() replaces existing elements atomically.
395
396 Hash-table maps are optimized for speed of lookup.
397
398 BPF_MAP_TYPE_ARRAY
399 Array maps have the following characteristics:
400
401 • Optimized for fastest possible lookup. In the future the
402 verifier/JIT compiler may recognize lookup() operations that
403 employ a constant key and optimize it into constant pointer.
404 It is possible to optimize a non-constant key into direct
405 pointer arithmetic as well, since pointers and value_size are
406 constant for the life of the eBPF program. In other words,
407 array_map_lookup_elem() may be 'inlined' by the verifier/JIT
408 compiler while preserving concurrent access to this map from
409 user space.
410
411 • All array elements pre-allocated and zero initialized at init
412 time
413
414 • The key is an array index, and must be exactly four bytes.
415
416 • map_delete_elem() fails with the error EINVAL, since elements
417 cannot be deleted.
418
419 • map_update_elem() replaces elements in a nonatomic fashion;
420 for atomic updates, a hash-table map should be used instead.
421 There is however one special case that can also be used with
422 arrays: the atomic built-in __sync_fetch_and_add() can be
423 used on 32 and 64 bit atomic counters. For example, it can
424 be applied on the whole value itself if it represents a sin‐
425 gle counter, or in case of a structure containing multiple
426 counters, it could be used on individual counters. This is
427 quite often useful for aggregation and accounting of events.
428
429 Among the uses for array maps are the following:
430
431 • As "global" eBPF variables: an array of 1 element whose key
432 is (index) 0 and where the value is a collection of 'global'
433 variables which eBPF programs can use to keep state between
434 events.
435
436 • Aggregation of tracing events into a fixed set of buckets.
437
438 • Accounting of networking events, for example, number of pack‐
439 ets and packet sizes.
440
441 BPF_MAP_TYPE_PROG_ARRAY (since Linux 4.2)
442 A program array map is a special kind of array map whose map
443 values contain only file descriptors referring to other eBPF
444 programs. Thus, both the key_size and value_size must be ex‐
445 actly four bytes. This map is used in conjunction with the
446 bpf_tail_call() helper.
447
448 This means that an eBPF program with a program array map at‐
449 tached to it can call from kernel side into
450
451 void bpf_tail_call(void *context, void *prog_map,
452 unsigned int index);
453
454 and therefore replace its own program flow with the one from the
455 program at the given program array slot, if present. This can
456 be regarded as kind of a jump table to a different eBPF program.
457 The invoked program will then reuse the same stack. When a jump
458 into the new program has been performed, it won't return to the
459 old program anymore.
460
461 If no eBPF program is found at the given index of the program
462 array (because the map slot doesn't contain a valid program file
463 descriptor, the specified lookup index/key is out of bounds, or
464 the limit of 32 nested calls has been exceed), execution contin‐
465 ues with the current eBPF program. This can be used as a fall-
466 through for default cases.
467
468 A program array map is useful, for example, in tracing or net‐
469 working, to handle individual system calls or protocols in their
470 own subprograms and use their identifiers as an individual map
471 index. This approach may result in performance benefits, and
472 also makes it possible to overcome the maximum instruction limit
473 of a single eBPF program. In dynamic environments, a user-space
474 daemon might atomically replace individual subprograms at run-
475 time with newer versions to alter overall program behavior, for
476 instance, if global policies change.
477
478 eBPF programs
479 The BPF_PROG_LOAD command is used to load an eBPF program into the ker‐
480 nel. The return value for this command is a new file descriptor asso‐
481 ciated with this eBPF program.
482
483 char bpf_log_buf[LOG_BUF_SIZE];
484
485 int
486 bpf_prog_load(enum bpf_prog_type type,
487 const struct bpf_insn *insns, int insn_cnt,
488 const char *license)
489 {
490 union bpf_attr attr = {
491 .prog_type = type,
492 .insns = ptr_to_u64(insns),
493 .insn_cnt = insn_cnt,
494 .license = ptr_to_u64(license),
495 .log_buf = ptr_to_u64(bpf_log_buf),
496 .log_size = LOG_BUF_SIZE,
497 .log_level = 1,
498 };
499
500 return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
501 }
502
503 prog_type is one of the available program types:
504
505 enum bpf_prog_type {
506 BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
507 program type */
508 BPF_PROG_TYPE_SOCKET_FILTER,
509 BPF_PROG_TYPE_KPROBE,
510 BPF_PROG_TYPE_SCHED_CLS,
511 BPF_PROG_TYPE_SCHED_ACT,
512 BPF_PROG_TYPE_TRACEPOINT,
513 BPF_PROG_TYPE_XDP,
514 BPF_PROG_TYPE_PERF_EVENT,
515 BPF_PROG_TYPE_CGROUP_SKB,
516 BPF_PROG_TYPE_CGROUP_SOCK,
517 BPF_PROG_TYPE_LWT_IN,
518 BPF_PROG_TYPE_LWT_OUT,
519 BPF_PROG_TYPE_LWT_XMIT,
520 BPF_PROG_TYPE_SOCK_OPS,
521 BPF_PROG_TYPE_SK_SKB,
522 BPF_PROG_TYPE_CGROUP_DEVICE,
523 BPF_PROG_TYPE_SK_MSG,
524 BPF_PROG_TYPE_RAW_TRACEPOINT,
525 BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
526 BPF_PROG_TYPE_LWT_SEG6LOCAL,
527 BPF_PROG_TYPE_LIRC_MODE2,
528 BPF_PROG_TYPE_SK_REUSEPORT,
529 BPF_PROG_TYPE_FLOW_DISSECTOR,
530 /* See /usr/include/linux/bpf.h for the full list. */
531 };
532
533 For further details of eBPF program types, see below.
534
535 The remaining fields of bpf_attr are set as follows:
536
537 • insns is an array of struct bpf_insn instructions.
538
539 • insn_cnt is the number of instructions in the program referred to by
540 insns.
541
542 • license is a license string, which must be GPL compatible to call
543 helper functions marked gpl_only. (The licensing rules are the same
544 as for kernel modules, so that also dual licenses, such as "Dual
545 BSD/GPL", may be used.)
546
547 • log_buf is a pointer to a caller-allocated buffer in which the in-
548 kernel verifier can store the verification log. This log is a
549 multi-line string that can be checked by the program author in order
550 to understand how the verifier came to the conclusion that the eBPF
551 program is unsafe. The format of the output can change at any time
552 as the verifier evolves.
553
554 • log_size size of the buffer pointed to by log_buf. If the size of
555 the buffer is not large enough to store all verifier messages, -1 is
556 returned and errno is set to ENOSPC.
557
558 • log_level verbosity level of the verifier. A value of zero means
559 that the verifier will not provide a log; in this case, log_buf must
560 be a NULL pointer, and log_size must be zero.
561
562 Applying close(2) to the file descriptor returned by BPF_PROG_LOAD will
563 unload the eBPF program (but see NOTES).
564
565 Maps are accessible from eBPF programs and are used to exchange data
566 between eBPF programs and between eBPF programs and user-space pro‐
567 grams. For example, eBPF programs can process various events (like
568 kprobe, packets) and store their data into a map, and user-space pro‐
569 grams can then fetch data from the map. Conversely, user-space pro‐
570 grams can use a map as a configuration mechanism, populating the map
571 with values checked by the eBPF program, which then modifies its behav‐
572 ior on the fly according to those values.
573
574 eBPF program types
575 The eBPF program type (prog_type) determines the subset of kernel
576 helper functions that the program may call. The program type also de‐
577 termines the program input (context)—the format of struct bpf_context
578 (which is the data blob passed into the eBPF program as the first argu‐
579 ment).
580
581 For example, a tracing program does not have the exact same subset of
582 helper functions as a socket filter program (though they may have some
583 helpers in common). Similarly, the input (context) for a tracing pro‐
584 gram is a set of register values, while for a socket filter it is a
585 network packet.
586
587 The set of functions available to eBPF programs of a given type may in‐
588 crease in the future.
589
590 The following program types are supported:
591
592 BPF_PROG_TYPE_SOCKET_FILTER (since Linux 3.19)
593 Currently, the set of functions for BPF_PROG_TYPE_SOCKET_FILTER
594 is:
595
596 bpf_map_lookup_elem(map_fd, void *key)
597 /* look up key in a map_fd */
598 bpf_map_update_elem(map_fd, void *key, void *value)
599 /* update key/value */
600 bpf_map_delete_elem(map_fd, void *key)
601 /* delete key in a map_fd */
602
603 The bpf_context argument is a pointer to a struct __sk_buff.
604
605 BPF_PROG_TYPE_KPROBE (since Linux 4.1)
606 [To be documented]
607
608 BPF_PROG_TYPE_SCHED_CLS (since Linux 4.1)
609 [To be documented]
610
611 BPF_PROG_TYPE_SCHED_ACT (since Linux 4.1)
612 [To be documented]
613
614 Events
615 Once a program is loaded, it can be attached to an event. Various ker‐
616 nel subsystems have different ways to do so.
617
618 Since Linux 3.19, the following call will attach the program prog_fd to
619 the socket sockfd, which was created by an earlier call to socket(2):
620
621 setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
622 &prog_fd, sizeof(prog_fd));
623
624 Since Linux 4.1, the following call may be used to attach the eBPF pro‐
625 gram referred to by the file descriptor prog_fd to a perf event file
626 descriptor, event_fd, that was created by a previous call to
627 perf_event_open(2):
628
629 ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
630
632 For a successful call, the return value depends on the operation:
633
634 BPF_MAP_CREATE
635 The new file descriptor associated with the eBPF map.
636
637 BPF_PROG_LOAD
638 The new file descriptor associated with the eBPF program.
639
640 All other commands
641 Zero.
642
643 On error, -1 is returned, and errno is set to indicate the error.
644
646 E2BIG The eBPF program is too large or a map reached the max_entries
647 limit (maximum number of elements).
648
649 EACCES For BPF_PROG_LOAD, even though all program instructions are
650 valid, the program has been rejected because it was deemed un‐
651 safe. This may be because it may have accessed a disallowed
652 memory region or an uninitialized stack/register or because the
653 function constraints don't match the actual types or because
654 there was a misaligned memory access. In this case, it is rec‐
655 ommended to call bpf() again with log_level = 1 and examine
656 log_buf for the specific reason provided by the verifier.
657
658 EBADF fd is not an open file descriptor.
659
660 EFAULT One of the pointers (key or value or log_buf or insns) is out‐
661 side the accessible address space.
662
663 EINVAL The value specified in cmd is not recognized by this kernel.
664
665 EINVAL For BPF_MAP_CREATE, either map_type or attributes are invalid.
666
667 EINVAL For BPF_MAP_*_ELEM commands, some of the fields of union
668 bpf_attr that are not used by this command are not set to zero.
669
670 EINVAL For BPF_PROG_LOAD, indicates an attempt to load an invalid pro‐
671 gram. eBPF programs can be deemed invalid due to unrecognized
672 instructions, the use of reserved fields, jumps out of range,
673 infinite loops or calls of unknown functions.
674
675 ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that
676 the element with the given key was not found.
677
678 ENOMEM Cannot allocate sufficient memory.
679
680 EPERM The call was made without sufficient privilege (without the
681 CAP_SYS_ADMIN capability).
682
684 Linux.
685
687 Linux 3.18.
688
690 Prior to Linux 4.4, all bpf() commands require the caller to have the
691 CAP_SYS_ADMIN capability. From Linux 4.4 onwards, an unprivileged user
692 may create limited programs of type BPF_PROG_TYPE_SOCKET_FILTER and as‐
693 sociated maps. However they may not store kernel pointers within the
694 maps and are presently limited to the following helper functions:
695
696 • get_random
697 • get_smp_processor_id
698 • tail_call
699 • ktime_get_ns
700
701 Unprivileged access may be blocked by writing the value 1 to the file
702 /proc/sys/kernel/unprivileged_bpf_disabled.
703
704 eBPF objects (maps and programs) can be shared between processes. For
705 example, after fork(2), the child inherits file descriptors referring
706 to the same eBPF objects. In addition, file descriptors referring to
707 eBPF objects can be transferred over UNIX domain sockets. File de‐
708 scriptors referring to eBPF objects can be duplicated in the usual way,
709 using dup(2) and similar calls. An eBPF object is deallocated only af‐
710 ter all file descriptors referring to the object have been closed.
711
712 eBPF programs can be written in a restricted C that is compiled (using
713 the clang compiler) into eBPF bytecode. Various features are omitted
714 from this restricted C, such as loops, global variables, variadic func‐
715 tions, floating-point numbers, and passing structures as function argu‐
716 ments. Some examples can be found in the samples/bpf/*_kern.c files in
717 the kernel source tree.
718
719 The kernel contains a just-in-time (JIT) compiler that translates eBPF
720 bytecode into native machine code for better performance. Before Linux
721 4.15, the JIT compiler is disabled by default, but its operation can be
722 controlled by writing one of the following integer strings to the file
723 /proc/sys/net/core/bpf_jit_enable:
724
725 0 Disable JIT compilation (default).
726
727 1 Normal compilation.
728
729 2 Debugging mode. The generated opcodes are dumped in hexadecimal
730 into the kernel log. These opcodes can then be disassembled us‐
731 ing the program tools/net/bpf_jit_disasm.c provided in the ker‐
732 nel source tree.
733
734 Since Linux 4.15, the kernel may configured with the CONFIG_BPF_JIT_AL‐
735 WAYS_ON option. In this case, the JIT compiler is always enabled, and
736 the bpf_jit_enable is initialized to 1 and is immutable. (This kernel
737 configuration option was provided as a mitigation for one of the Spec‐
738 tre attacks against the BPF interpreter.)
739
740 The JIT compiler for eBPF is currently available for the following ar‐
741 chitectures:
742
743 • x86-64 (since Linux 3.18; cBPF since Linux 3.0);
744 • ARM32 (since Linux 3.18; cBPF since Linux 3.4);
745 • SPARC 32 (since Linux 3.18; cBPF since Linux 3.5);
746 • ARM-64 (since Linux 3.18);
747 • s390 (since Linux 4.1; cBPF since Linux 3.7);
748 • PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1);
749 • SPARC 64 (since Linux 4.12);
750 • x86-32 (since Linux 4.18);
751 • MIPS 64 (since Linux 4.18; cBPF since Linux 3.16);
752 • riscv (since Linux 5.1).
753
755 /* bpf+sockets example:
756 * 1. create array map of 256 elements
757 * 2. load program that counts number of packets received
758 * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
759 * map[r0]++
760 * 3. attach prog_fd to raw socket via setsockopt()
761 * 4. print number of received TCP/UDP packets every second
762 */
763 int
764 main(int argc, char *argv[])
765 {
766 int sock, map_fd, prog_fd, key;
767 long long value = 0, tcp_cnt, udp_cnt;
768
769 map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key),
770 sizeof(value), 256);
771 if (map_fd < 0) {
772 printf("failed to create map '%s'\n", strerror(errno));
773 /* likely not run as root */
774 return 1;
775 }
776
777 struct bpf_insn prog[] = {
778 BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */
779 BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
780 /* r0 = ip->proto */
781 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
782 /* *(u32 *)(fp - 4) = r0 */
783 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */
784 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */
785 BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */
786 BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
787 /* r0 = map_lookup(r1, r2) */
788 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
789 /* if (r0 == 0) goto pc+2 */
790 BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
791 BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
792 /* lock *(u64 *) r0 += r1 */
793 BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
794 BPF_EXIT_INSN(), /* return r0 */
795 };
796
797 prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog,
798 sizeof(prog) / sizeof(prog[0]), "GPL");
799
800 sock = open_raw_sock("lo");
801
802 assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
803 sizeof(prog_fd)) == 0);
804
805 for (;;) {
806 key = IPPROTO_TCP;
807 assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
808 key = IPPROTO_UDP;
809 assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
810 printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
811 sleep(1);
812 }
813
814 return 0;
815 }
816
817 Some complete working code can be found in the samples/bpf directory in
818 the kernel source tree.
819
821 seccomp(2), bpf-helpers(7), socket(7), tc(8), tc-bpf(8)
822
823 Both classic and extended BPF are explained in the kernel source file
824 Documentation/networking/filter.txt.
825
826
827
828Linux man-pages 6.04 2023-03-30 bpf(2)