1BPF-HELPERS(7) BPF-HELPERS(7)
2
3
4
6 BPF-HELPERS - list of eBPF helper functions
7
9 The extended Berkeley Packet Filter (eBPF) subsystem consists in pro‐
10 grams written in a pseudo-assembly language, then attached to one of
11 the several kernel hooks and run in reaction of specific events. This
12 framework differs from the older, "classic" BPF (or "cBPF") in several
13 aspects, one of them being the ability to call special functions (or
14 "helpers") from within a program. These functions are restricted to a
15 white-list of helpers defined in the kernel.
16
17 These helpers are used by eBPF programs to interact with the system, or
18 with the context in which they work. For instance, they can be used to
19 print debugging messages, to get the time since the system was booted,
20 to interact with eBPF maps, or to manipulate network packets. Since
21 there are several eBPF program types, and that they do not run in the
22 same context, each program type can only call a subset of those
23 helpers.
24
25 Due to eBPF conventions, a helper can not have more than five argu‐
26 ments.
27
28 Internally, eBPF programs call directly into the compiled helper func‐
29 tions without requiring any foreign-function interface. As a result,
30 calling helpers introduces no overhead, thus offering excellent perfor‐
31 mance.
32
33 This document is an attempt to list and document the helpers available
34 to eBPF developers. They are sorted by chronological order (the oldest
35 helpers in the kernel at the top).
36
38 void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
39
40 Description
41 Perform a lookup in map for an entry associated to key.
42
43 Return Map value associated to key, or NULL if no entry was
44 found.
45
46 long bpf_map_update_elem(struct bpf_map *map, const void *key, const
47 void *value, u64 flags)
48
49 Description
50 Add or update the value of the entry associated to key in
51 map with value. flags is one of:
52
53 BPF_NOEXIST
54 The entry for key must not exist in the map.
55
56 BPF_EXIST
57 The entry for key must already exist in the map.
58
59 BPF_ANY
60 No condition on the existence of the entry for
61 key.
62
63 Flag value BPF_NOEXIST cannot be used for maps of types
64 BPF_MAP_TYPE_ARRAY or BPF_MAP_TYPE_PERCPU_ARRAY (all el‐
65 ements always exist), the helper would return an error.
66
67 Return 0 on success, or a negative error in case of failure.
68
69 long bpf_map_delete_elem(struct bpf_map *map, const void *key)
70
71 Description
72 Delete entry with key from map.
73
74 Return 0 on success, or a negative error in case of failure.
75
76 long bpf_probe_read(void *dst, u32 size, const void *unsafe_ptr)
77
78 Description
79 For tracing programs, safely attempt to read size bytes
80 from kernel space address unsafe_ptr and store the data
81 in dst.
82
83 Generally, use bpf_probe_read_user() or
84 bpf_probe_read_kernel() instead.
85
86 Return 0 on success, or a negative error in case of failure.
87
88 u64 bpf_ktime_get_ns(void)
89
90 Description
91 Return the time elapsed since system boot, in nanosec‐
92 onds. Does not include time the system was suspended.
93 See: clock_gettime(CLOCK_MONOTONIC)
94
95 Return Current ktime.
96
97 long bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
98
99 Description
100 This helper is a "printk()-like" facility for debugging.
101 It prints a message defined by format fmt (of size
102 fmt_size) to file /sys/kernel/debug/tracing/trace from
103 DebugFS, if available. It can take up to three additional
104 u64 arguments (as an eBPF helpers, the total number of
105 arguments is limited to five).
106
107 Each time the helper is called, it appends a line to the
108 trace. Lines are discarded while /sys/kernel/debug/trac‐
109 ing/trace is open, use /sys/kernel/debug/trac‐
110 ing/trace_pipe to avoid this. The format of the trace is
111 customizable, and the exact output one will get depends
112 on the options set in /sys/kernel/debug/tracing/trace_op‐
113 tions (see also the README file under the same direc‐
114 tory). However, it usually defaults to something like:
115
116 telnet-470 [001] .N.. 419421.045894: 0x00000001: <formatted msg>
117
118 In the above:
119
120 • telnet is the name of the current task.
121
122 • 470 is the PID of the current task.
123
124 • 001 is the CPU number on which the task is running.
125
126 • In .N.., each character refers to a set of options
127 (whether irqs are enabled, scheduling options,
128 whether hard/softirqs are running, level of pre‐
129 empt_disabled respectively). N means that
130 TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set.
131
132 • 419421.045894 is a timestamp.
133
134 • 0x00000001 is a fake value used by BPF for the in‐
135 struction pointer register.
136
137 • <formatted msg> is the message formatted with fmt.
138
139 The conversion specifiers supported by fmt are similar,
140 but more limited than for printk(). They are %d, %i, %u,
141 %x, %ld, %li, %lu, %lx, %lld, %lli, %llu, %llx, %p, %s.
142 No modifier (size of field, padding with zeroes, etc.) is
143 available, and the helper will return -EINVAL (but print
144 nothing) if it encounters an unknown specifier.
145
146 Also, note that bpf_trace_printk() is slow, and should
147 only be used for debugging purposes. For this reason, a
148 notice block (spanning several lines) is printed to ker‐
149 nel logs and states that the helper should not be used
150 "for production use" the first time this helper is used
151 (or more precisely, when trace_printk() buffers are allo‐
152 cated). For passing values to user space, perf events
153 should be preferred.
154
155 Return The number of bytes written to the buffer, or a negative
156 error in case of failure.
157
158 u32 bpf_get_prandom_u32(void)
159
160 Description
161 Get a pseudo-random number.
162
163 From a security point of view, this helper uses its own
164 pseudo-random internal state, and cannot be used to infer
165 the seed of other random functions in the kernel. How‐
166 ever, it is essential to note that the generator used by
167 the helper is not cryptographically secure.
168
169 Return A random 32-bit unsigned value.
170
171 u32 bpf_get_smp_processor_id(void)
172
173 Description
174 Get the SMP (symmetric multiprocessing) processor id.
175 Note that all programs run with preemption disabled,
176 which means that the SMP processor id is stable during
177 all the execution of the program.
178
179 Return The SMP id of the processor running the program.
180
181 long bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void
182 *from, u32 len, u64 flags)
183
184 Description
185 Store len bytes from address from into the packet associ‐
186 ated to skb, at offset. flags are a combination of
187 BPF_F_RECOMPUTE_CSUM (automatically recompute the check‐
188 sum for the packet after storing the bytes) and BPF_F_IN‐
189 VALIDATE_HASH (set skb->hash, skb->swhash and skb->l4hash
190 to 0).
191
192 A call to this helper is susceptible to change the under‐
193 lying packet buffer. Therefore, at load time, all checks
194 on pointers previously done by the verifier are invali‐
195 dated and must be performed again, if the helper is used
196 in combination with direct packet access.
197
198 Return 0 on success, or a negative error in case of failure.
199
200 long bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64
201 to, u64 size)
202
203 Description
204 Recompute the layer 3 (e.g. IP) checksum for the packet
205 associated to skb. Computation is incremental, so the
206 helper must know the former value of the header field
207 that was modified (from), the new value of this field
208 (to), and the number of bytes (2 or 4) for this field,
209 stored in size. Alternatively, it is possible to store
210 the difference between the previous and the new values of
211 the header field in to, by setting from and size to 0.
212 For both methods, offset indicates the location of the IP
213 checksum within the packet.
214
215 This helper works in combination with bpf_csum_diff(),
216 which does not update the checksum in-place, but offers
217 more flexibility and can handle sizes larger than 2 or 4
218 for the checksum to update.
219
220 A call to this helper is susceptible to change the under‐
221 lying packet buffer. Therefore, at load time, all checks
222 on pointers previously done by the verifier are invali‐
223 dated and must be performed again, if the helper is used
224 in combination with direct packet access.
225
226 Return 0 on success, or a negative error in case of failure.
227
228 long bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64
229 to, u64 flags)
230
231 Description
232 Recompute the layer 4 (e.g. TCP, UDP or ICMP) checksum
233 for the packet associated to skb. Computation is incre‐
234 mental, so the helper must know the former value of the
235 header field that was modified (from), the new value of
236 this field (to), and the number of bytes (2 or 4) for
237 this field, stored on the lowest four bits of flags. Al‐
238 ternatively, it is possible to store the difference be‐
239 tween the previous and the new values of the header field
240 in to, by setting from and the four lowest bits of flags
241 to 0. For both methods, offset indicates the location of
242 the IP checksum within the packet. In addition to the
243 size of the field, flags can be added (bitwise OR) actual
244 flags. With BPF_F_MARK_MANGLED_0, a null checksum is left
245 untouched (unless BPF_F_MARK_ENFORCE is added as well),
246 and for updates resulting in a null checksum the value is
247 set to CSUM_MANGLED_0 instead. Flag BPF_F_PSEUDO_HDR in‐
248 dicates the checksum is to be computed against a
249 pseudo-header.
250
251 This helper works in combination with bpf_csum_diff(),
252 which does not update the checksum in-place, but offers
253 more flexibility and can handle sizes larger than 2 or 4
254 for the checksum to update.
255
256 A call to this helper is susceptible to change the under‐
257 lying packet buffer. Therefore, at load time, all checks
258 on pointers previously done by the verifier are invali‐
259 dated and must be performed again, if the helper is used
260 in combination with direct packet access.
261
262 Return 0 on success, or a negative error in case of failure.
263
264 long bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 in‐
265 dex)
266
267 Description
268 This special helper is used to trigger a "tail call", or
269 in other words, to jump into another eBPF program. The
270 same stack frame is used (but values on stack and in reg‐
271 isters for the caller are not accessible to the callee).
272 This mechanism allows for program chaining, either for
273 raising the maximum number of available eBPF instruc‐
274 tions, or to execute given programs in conditional
275 blocks. For security reasons, there is an upper limit to
276 the number of successive tail calls that can be per‐
277 formed.
278
279 Upon call of this helper, the program attempts to jump
280 into a program referenced at index index in prog_ar‐
281 ray_map, a special map of type BPF_MAP_TYPE_PROG_ARRAY,
282 and passes ctx, a pointer to the context.
283
284 If the call succeeds, the kernel immediately runs the
285 first instruction of the new program. This is not a func‐
286 tion call, and it never returns to the previous program.
287 If the call fails, then the helper has no effect, and the
288 caller continues to run its subsequent instructions. A
289 call can fail if the destination program for the jump
290 does not exist (i.e. index is superior to the number of
291 entries in prog_array_map), or if the maximum number of
292 tail calls has been reached for this chain of programs.
293 This limit is defined in the kernel by the macro
294 MAX_TAIL_CALL_CNT (not accessible to user space), which
295 is currently set to 32.
296
297 Return 0 on success, or a negative error in case of failure.
298
299 long bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
300
301 Description
302 Clone and redirect the packet associated to skb to an‐
303 other net device of index ifindex. Both ingress and
304 egress interfaces can be used for redirection. The
305 BPF_F_INGRESS value in flags is used to make the distinc‐
306 tion (ingress path is selected if the flag is present,
307 egress path otherwise). This is the only flag supported
308 for now.
309
310 In comparison with bpf_redirect() helper, bpf_clone_redi‐
311 rect() has the associated cost of duplicating the packet
312 buffer, but this can be executed out of the eBPF program.
313 Conversely, bpf_redirect() is more efficient, but it is
314 handled through an action code where the redirection hap‐
315 pens only after the eBPF program has returned.
316
317 A call to this helper is susceptible to change the under‐
318 lying packet buffer. Therefore, at load time, all checks
319 on pointers previously done by the verifier are invali‐
320 dated and must be performed again, if the helper is used
321 in combination with direct packet access.
322
323 Return 0 on success, or a negative error in case of failure.
324
325 u64 bpf_get_current_pid_tgid(void)
326
327 Return A 64-bit integer containing the current tgid and pid, and
328 created as such: current_task->tgid << 32 | cur‐
329 rent_task->pid.
330
331 u64 bpf_get_current_uid_gid(void)
332
333 Return A 64-bit integer containing the current GID and UID, and
334 created as such: current_gid << 32 | current_uid.
335
336 long bpf_get_current_comm(void *buf, u32 size_of_buf)
337
338 Description
339 Copy the comm attribute of the current task into buf of
340 size_of_buf. The comm attribute contains the name of the
341 executable (excluding the path) for the current task. The
342 size_of_buf must be strictly positive. On success, the
343 helper makes sure that the buf is NUL-terminated. On
344 failure, it is filled with zeroes.
345
346 Return 0 on success, or a negative error in case of failure.
347
348 u32 bpf_get_cgroup_classid(struct sk_buff *skb)
349
350 Description
351 Retrieve the classid for the current task, i.e. for the
352 net_cls cgroup to which skb belongs.
353
354 This helper can be used on TC egress path, but not on
355 ingress.
356
357 The net_cls cgroup provides an interface to tag network
358 packets based on a user-provided identifier for all traf‐
359 fic coming from the tasks belonging to the related
360 cgroup. See also the related kernel documentation, avail‐
361 able from the Linux sources in file Documentation/ad‐
362 min-guide/cgroup-v1/net_cls.rst.
363
364 The Linux kernel has two versions for cgroups: there are
365 cgroups v1 and cgroups v2. Both are available to users,
366 who can use a mixture of them, but note that the net_cls
367 cgroup is for cgroup v1 only. This makes it incompatible
368 with BPF programs run on cgroups, which is a
369 cgroup-v2-only feature (a socket can only hold data for
370 one version of cgroups at a time).
371
372 This helper is only available is the kernel was compiled
373 with the CONFIG_CGROUP_NET_CLASSID configuration option
374 set to "y" or to "m".
375
376 Return The classid, or 0 for the default unconfigured classid.
377
378 long bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16
379 vlan_tci)
380
381 Description
382 Push a vlan_tci (VLAN tag control information) of proto‐
383 col vlan_proto to the packet associated to skb, then up‐
384 date the checksum. Note that if vlan_proto is different
385 from ETH_P_8021Q and ETH_P_8021AD, it is considered to be
386 ETH_P_8021Q.
387
388 A call to this helper is susceptible to change the under‐
389 lying packet buffer. Therefore, at load time, all checks
390 on pointers previously done by the verifier are invali‐
391 dated and must be performed again, if the helper is used
392 in combination with direct packet access.
393
394 Return 0 on success, or a negative error in case of failure.
395
396 long bpf_skb_vlan_pop(struct sk_buff *skb)
397
398 Description
399 Pop a VLAN header from the packet associated to skb.
400
401 A call to this helper is susceptible to change the under‐
402 lying packet buffer. Therefore, at load time, all checks
403 on pointers previously done by the verifier are invali‐
404 dated and must be performed again, if the helper is used
405 in combination with direct packet access.
406
407 Return 0 on success, or a negative error in case of failure.
408
409 long bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key
410 *key, u32 size, u64 flags)
411
412 Description
413 Get tunnel metadata. This helper takes a pointer key to
414 an empty struct bpf_tunnel_key of size, that will be
415 filled with tunnel metadata for the packet associated to
416 skb. The flags can be set to BPF_F_TUNINFO_IPV6, which
417 indicates that the tunnel is based on IPv6 protocol in‐
418 stead of IPv4.
419
420 The struct bpf_tunnel_key is an object that generalizes
421 the principal parameters used by various tunneling proto‐
422 cols into a single struct. This way, it can be used to
423 easily make a decision based on the contents of the en‐
424 capsulation header, "summarized" in this struct. In par‐
425 ticular, it holds the IP address of the remote end (IPv4
426 or IPv6, depending on the case) in key->remote_ipv4 or
427 key->remote_ipv6. Also, this struct exposes the key->tun‐
428 nel_id, which is generally mapped to a VNI (Virtual Net‐
429 work Identifier), making it programmable together with
430 the bpf_skb_set_tunnel_key() helper.
431
432 Let's imagine that the following code is part of a pro‐
433 gram attached to the TC ingress interface, on one end of
434 a GRE tunnel, and is supposed to filter out all messages
435 coming from remote ends with IPv4 address other than
436 10.0.0.1:
437
438 int ret;
439 struct bpf_tunnel_key key = {};
440
441 ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
442 if (ret < 0)
443 return TC_ACT_SHOT; // drop packet
444
445 if (key.remote_ipv4 != 0x0a000001)
446 return TC_ACT_SHOT; // drop packet
447
448 return TC_ACT_OK; // accept packet
449
450 This interface can also be used with all encapsulation
451 devices that can operate in "collect metadata" mode: in‐
452 stead of having one network device per specific configu‐
453 ration, the "collect metadata" mode only requires a sin‐
454 gle device where the configuration can be extracted from
455 this helper.
456
457 This can be used together with various tunnels such as
458 VXLan, Geneve, GRE or IP in IP (IPIP).
459
460 Return 0 on success, or a negative error in case of failure.
461
462 long bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key
463 *key, u32 size, u64 flags)
464
465 Description
466 Populate tunnel metadata for packet associated to skb.
467 The tunnel metadata is set to the contents of key, of
468 size. The flags can be set to a combination of the fol‐
469 lowing values:
470
471 BPF_F_TUNINFO_IPV6
472 Indicate that the tunnel is based on IPv6 protocol
473 instead of IPv4.
474
475 BPF_F_ZERO_CSUM_TX
476 For IPv4 packets, add a flag to tunnel metadata
477 indicating that checksum computation should be
478 skipped and checksum set to zeroes.
479
480 BPF_F_DONT_FRAGMENT
481 Add a flag to tunnel metadata indicating that the
482 packet should not be fragmented.
483
484 BPF_F_SEQ_NUMBER
485 Add a flag to tunnel metadata indicating that a
486 sequence number should be added to tunnel header
487 before sending the packet. This flag was added for
488 GRE encapsulation, but might be used with other
489 protocols as well in the future.
490
491 Here is a typical usage on the transmit path:
492
493 struct bpf_tunnel_key key;
494 populate key ...
495 bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
496 bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
497
498 See also the description of the bpf_skb_get_tunnel_key()
499 helper for additional information.
500
501 Return 0 on success, or a negative error in case of failure.
502
503 u64 bpf_perf_event_read(struct bpf_map *map, u64 flags)
504
505 Description
506 Read the value of a perf event counter. This helper re‐
507 lies on a map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. The
508 nature of the perf event counter is selected when map is
509 updated with perf event file descriptors. The map is an
510 array whose size is the number of available CPUs, and
511 each cell contains a value relative to one CPU. The value
512 to retrieve is indicated by flags, that contains the in‐
513 dex of the CPU to look up, masked with BPF_F_INDEX_MASK.
514 Alternatively, flags can be set to BPF_F_CURRENT_CPU to
515 indicate that the value for the current CPU should be re‐
516 trieved.
517
518 Note that before Linux 4.13, only hardware perf event can
519 be retrieved.
520
521 Also, be aware that the newer helper
522 bpf_perf_event_read_value() is recommended over
523 bpf_perf_event_read() in general. The latter has some ABI
524 quirks where error and counter value are used as a return
525 code (which is wrong to do since ranges may overlap).
526 This issue is fixed with bpf_perf_event_read_value(),
527 which at the same time provides more features over the
528 bpf_perf_event_read() interface. Please refer to the de‐
529 scription of bpf_perf_event_read_value() for details.
530
531 Return The value of the perf event counter read from the map, or
532 a negative error code in case of failure.
533
534 long bpf_redirect(u32 ifindex, u64 flags)
535
536 Description
537 Redirect the packet to another net device of index
538 ifindex. This helper is somewhat similar to
539 bpf_clone_redirect(), except that the packet is not
540 cloned, which provides increased performance.
541
542 Except for XDP, both ingress and egress interfaces can be
543 used for redirection. The BPF_F_INGRESS value in flags is
544 used to make the distinction (ingress path is selected if
545 the flag is present, egress path otherwise). Currently,
546 XDP only supports redirection to the egress interface,
547 and accepts no flag at all.
548
549 The same effect can also be attained with the more
550 generic bpf_redirect_map(), which uses a BPF map to store
551 the redirect target instead of providing it directly to
552 the helper.
553
554 Return For XDP, the helper returns XDP_REDIRECT on success or
555 XDP_ABORTED on error. For other program types, the values
556 are TC_ACT_REDIRECT on success or TC_ACT_SHOT on error.
557
558 u32 bpf_get_route_realm(struct sk_buff *skb)
559
560 Description
561 Retrieve the realm or the route, that is to say the
562 tclassid field of the destination for the skb. The iden‐
563 tifier retrieved is a user-provided tag, similar to the
564 one used with the net_cls cgroup (see description for
565 bpf_get_cgroup_classid() helper), but here this tag is
566 held by a route (a destination entry), not by a task.
567
568 Retrieving this identifier works with the clsact TC
569 egress hook (see also tc-bpf(8)), or alternatively on
570 conventional classful egress qdiscs, but not on TC
571 ingress path. In case of clsact TC egress hook, this has
572 the advantage that, internally, the destination entry has
573 not been dropped yet in the transmit path. Therefore, the
574 destination entry does not need to be artificially held
575 via netif_keep_dst() for a classful qdisc until the skb
576 is freed.
577
578 This helper is available only if the kernel was compiled
579 with CONFIG_IP_ROUTE_CLASSID configuration option.
580
581 Return The realm of the route for the packet associated to skb,
582 or 0 if none was found.
583
584 long bpf_perf_event_output(void *ctx, struct bpf_map *map, u64 flags,
585 void *data, u64 size)
586
587 Description
588 Write raw data blob into a special BPF perf event held by
589 map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. This perf
590 event must have the following attributes: PERF_SAMPLE_RAW
591 as sample_type, PERF_TYPE_SOFTWARE as type, and
592 PERF_COUNT_SW_BPF_OUTPUT as config.
593
594 The flags are used to indicate the index in map for which
595 the value must be put, masked with BPF_F_INDEX_MASK. Al‐
596 ternatively, flags can be set to BPF_F_CURRENT_CPU to in‐
597 dicate that the index of the current CPU core should be
598 used.
599
600 The value to write, of size, is passed through eBPF stack
601 and pointed by data.
602
603 The context of the program ctx needs also be passed to
604 the helper.
605
606 On user space, a program willing to read the values needs
607 to call perf_event_open() on the perf event (either for
608 one or for all CPUs) and to store the file descriptor
609 into the map. This must be done before the eBPF program
610 can send data into it. An example is available in file
611 samples/bpf/trace_output_user.c in the Linux kernel
612 source tree (the eBPF program counterpart is in sam‐
613 ples/bpf/trace_output_kern.c).
614
615 bpf_perf_event_output() achieves better performance than
616 bpf_trace_printk() for sharing data with user space, and
617 is much better suitable for streaming data from eBPF pro‐
618 grams.
619
620 Note that this helper is not restricted to tracing use
621 cases and can be used with programs attached to TC or XDP
622 as well, where it allows for passing data to user space
623 listeners. Data can be:
624
625 • Only custom structs,
626
627 • Only the packet payload, or
628
629 • A combination of both.
630
631 Return 0 on success, or a negative error in case of failure.
632
633 long bpf_skb_load_bytes(const void *skb, u32 offset, void *to, u32 len)
634
635 Description
636 This helper was provided as an easy way to load data from
637 a packet. It can be used to load len bytes from offset
638 from the packet associated to skb, into the buffer
639 pointed by to.
640
641 Since Linux 4.7, usage of this helper has mostly been re‐
642 placed by "direct packet access", enabling packet data to
643 be manipulated with skb->data and skb->data_end pointing
644 respectively to the first byte of packet data and to the
645 byte after the last byte of packet data. However, it re‐
646 mains useful if one wishes to read large quantities of
647 data at once from a packet into the eBPF stack.
648
649 Return 0 on success, or a negative error in case of failure.
650
651 long bpf_get_stackid(void *ctx, struct bpf_map *map, u64 flags)
652
653 Description
654 Walk a user or a kernel stack and return its id. To
655 achieve this, the helper needs ctx, which is a pointer to
656 the context on which the tracing program is executed, and
657 a pointer to a map of type BPF_MAP_TYPE_STACK_TRACE.
658
659 The last argument, flags, holds the number of stack
660 frames to skip (from 0 to 255), masked with
661 BPF_F_SKIP_FIELD_MASK. The next bits can be used to set a
662 combination of the following flags:
663
664 BPF_F_USER_STACK
665 Collect a user space stack instead of a kernel
666 stack.
667
668 BPF_F_FAST_STACK_CMP
669 Compare stacks by hash only.
670
671 BPF_F_REUSE_STACKID
672 If two different stacks hash into the same
673 stackid, discard the old one.
674
675 The stack id retrieved is a 32 bit long integer handle
676 which can be further combined with other data (including
677 other stack ids) and used as a key into maps. This can be
678 useful for generating a variety of graphs (such as flame
679 graphs or off-cpu graphs).
680
681 For walking a stack, this helper is an improvement over
682 bpf_probe_read(), which can be used with unrolled loops
683 but is not efficient and consumes a lot of eBPF instruc‐
684 tions. Instead, bpf_get_stackid() can collect up to
685 PERF_MAX_STACK_DEPTH both kernel and user frames. Note
686 that this limit can be controlled with the sysctl pro‐
687 gram, and that it should be manually increased in order
688 to profile long user stacks (such as stacks for Java pro‐
689 grams). To do so, use:
690
691 # sysctl kernel.perf_event_max_stack=<new value>
692
693 Return The positive or null stack id on success, or a negative
694 error in case of failure.
695
696 s64 bpf_csum_diff(__be32 *from, u32 from_size, __be32 *to, u32 to_size,
697 __wsum seed)
698
699 Description
700 Compute a checksum difference, from the raw buffer
701 pointed by from, of length from_size (that must be a mul‐
702 tiple of 4), towards the raw buffer pointed by to, of
703 size to_size (same remark). An optional seed can be added
704 to the value (this can be cascaded, the seed may come
705 from a previous call to the helper).
706
707 This is flexible enough to be used in several ways:
708
709 • With from_size == 0, to_size > 0 and seed set to check‐
710 sum, it can be used when pushing new data.
711
712 • With from_size > 0, to_size == 0 and seed set to check‐
713 sum, it can be used when removing data from a packet.
714
715 • With from_size > 0, to_size > 0 and seed set to 0, it
716 can be used to compute a diff. Note that from_size and
717 to_size do not need to be equal.
718
719 This helper can be used in combination with
720 bpf_l3_csum_replace() and bpf_l4_csum_replace(), to which
721 one can feed in the difference computed with
722 bpf_csum_diff().
723
724 Return The checksum result, or a negative error code in case of
725 failure.
726
727 long bpf_skb_get_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
728
729 Description
730 Retrieve tunnel options metadata for the packet associ‐
731 ated to skb, and store the raw tunnel option data to the
732 buffer opt of size.
733
734 This helper can be used with encapsulation devices that
735 can operate in "collect metadata" mode (please refer to
736 the related note in the description of bpf_skb_get_tun‐
737 nel_key() for more details). A particular example where
738 this can be used is in combination with the Geneve encap‐
739 sulation protocol, where it allows for pushing (with
740 bpf_skb_get_tunnel_opt() helper) and retrieving arbitrary
741 TLVs (Type-Length-Value headers) from the eBPF program.
742 This allows for full customization of these headers.
743
744 Return The size of the option data retrieved.
745
746 long bpf_skb_set_tunnel_opt(struct sk_buff *skb, void *opt, u32 size)
747
748 Description
749 Set tunnel options metadata for the packet associated to
750 skb to the option data contained in the raw buffer opt of
751 size.
752
753 See also the description of the bpf_skb_get_tunnel_opt()
754 helper for additional information.
755
756 Return 0 on success, or a negative error in case of failure.
757
758 long bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags)
759
760 Description
761 Change the protocol of the skb to proto. Currently sup‐
762 ported are transition from IPv4 to IPv6, and from IPv6 to
763 IPv4. The helper takes care of the groundwork for the
764 transition, including resizing the socket buffer. The
765 eBPF program is expected to fill the new headers, if any,
766 via skb_store_bytes() and to recompute the checksums with
767 bpf_l3_csum_replace() and bpf_l4_csum_replace(). The main
768 case for this helper is to perform NAT64 operations out
769 of an eBPF program.
770
771 Internally, the GSO type is marked as dodgy so that head‐
772 ers are checked and segments are recalculated by the
773 GSO/GRO engine. The size for GSO target is adapted as
774 well.
775
776 All values for flags are reserved for future usage, and
777 must be left at zero.
778
779 A call to this helper is susceptible to change the under‐
780 lying packet buffer. Therefore, at load time, all checks
781 on pointers previously done by the verifier are invali‐
782 dated and must be performed again, if the helper is used
783 in combination with direct packet access.
784
785 Return 0 on success, or a negative error in case of failure.
786
787 long bpf_skb_change_type(struct sk_buff *skb, u32 type)
788
789 Description
790 Change the packet type for the packet associated to skb.
791 This comes down to setting skb->pkt_type to type, except
792 the eBPF program does not have a write access to
793 skb->pkt_type beside this helper. Using a helper here al‐
794 lows for graceful handling of errors.
795
796 The major use case is to change incoming skb*s to
797 **PACKET_HOST* in a programmatic way instead of having to
798 recirculate via redirect(..., BPF_F_INGRESS), for exam‐
799 ple.
800
801 Note that type only allows certain values. At this time,
802 they are:
803
804 PACKET_HOST
805 Packet is for us.
806
807 PACKET_BROADCAST
808 Send packet to all.
809
810 PACKET_MULTICAST
811 Send packet to group.
812
813 PACKET_OTHERHOST
814 Send packet to someone else.
815
816 Return 0 on success, or a negative error in case of failure.
817
818 long bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32
819 index)
820
821 Description
822 Check whether skb is a descendant of the cgroup2 held by
823 map of type BPF_MAP_TYPE_CGROUP_ARRAY, at index.
824
825 Return The return value depends on the result of the test, and
826 can be:
827
828 • 0, if the skb failed the cgroup2 descendant test.
829
830 • 1, if the skb succeeded the cgroup2 descendant test.
831
832 • A negative error code, if an error occurred.
833
834 u32 bpf_get_hash_recalc(struct sk_buff *skb)
835
836 Description
837 Retrieve the hash of the packet, skb->hash. If it is not
838 set, in particular if the hash was cleared due to man‐
839 gling, recompute this hash. Later accesses to the hash
840 can be done directly with skb->hash.
841
842 Calling bpf_set_hash_invalid(), changing a packet proto‐
843 type with bpf_skb_change_proto(), or calling
844 bpf_skb_store_bytes() with the BPF_F_INVALIDATE_HASH are
845 actions susceptible to clear the hash and to trigger a
846 new computation for the next call to bpf_get_hash_re‐
847 calc().
848
849 Return The 32-bit hash.
850
851 u64 bpf_get_current_task(void)
852
853 Return A pointer to the current task struct.
854
855 long bpf_probe_write_user(void *dst, const void *src, u32 len)
856
857 Description
858 Attempt in a safe way to write len bytes from the buffer
859 src to dst in memory. It only works for threads that are
860 in user context, and dst must be a valid user space ad‐
861 dress.
862
863 This helper should not be used to implement any kind of
864 security mechanism because of TOC-TOU attacks, but rather
865 to debug, divert, and manipulate execution of semi-coop‐
866 erative processes.
867
868 Keep in mind that this feature is meant for experiments,
869 and it has a risk of crashing the system and running pro‐
870 grams. Therefore, when an eBPF program using this helper
871 is attached, a warning including PID and process name is
872 printed to kernel logs.
873
874 Return 0 on success, or a negative error in case of failure.
875
876 long bpf_current_task_under_cgroup(struct bpf_map *map, u32 index)
877
878 Description
879 Check whether the probe is being run is the context of a
880 given subset of the cgroup2 hierarchy. The cgroup2 to
881 test is held by map of type BPF_MAP_TYPE_CGROUP_ARRAY, at
882 index.
883
884 Return The return value depends on the result of the test, and
885 can be:
886
887 • 0, if the skb task belongs to the cgroup2.
888
889 • 1, if the skb task does not belong to the cgroup2.
890
891 • A negative error code, if an error occurred.
892
893 long bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
894
895 Description
896 Resize (trim or grow) the packet associated to skb to the
897 new len. The flags are reserved for future usage, and
898 must be left at zero.
899
900 The basic idea is that the helper performs the needed
901 work to change the size of the packet, then the eBPF pro‐
902 gram rewrites the rest via helpers like
903 bpf_skb_store_bytes(), bpf_l3_csum_replace(),
904 bpf_l3_csum_replace() and others. This helper is a slow
905 path utility intended for replies with control messages.
906 And because it is targeted for slow path, the helper it‐
907 self can afford to be slow: it implicitly linearizes, un‐
908 clones and drops offloads from the skb.
909
910 A call to this helper is susceptible to change the under‐
911 lying packet buffer. Therefore, at load time, all checks
912 on pointers previously done by the verifier are invali‐
913 dated and must be performed again, if the helper is used
914 in combination with direct packet access.
915
916 Return 0 on success, or a negative error in case of failure.
917
918 long bpf_skb_pull_data(struct sk_buff *skb, u32 len)
919
920 Description
921 Pull in non-linear data in case the skb is non-linear and
922 not all of len are part of the linear section. Make len
923 bytes from skb readable and writable. If a zero value is
924 passed for len, then the whole length of the skb is
925 pulled.
926
927 This helper is only needed for reading and writing with
928 direct packet access.
929
930 For direct packet access, testing that offsets to access
931 are within packet boundaries (test on skb->data_end) is
932 susceptible to fail if offsets are invalid, or if the re‐
933 quested data is in non-linear parts of the skb. On fail‐
934 ure the program can just bail out, or in the case of a
935 non-linear buffer, use a helper to make the data avail‐
936 able. The bpf_skb_load_bytes() helper is a first solution
937 to access the data. Another one consists in using
938 bpf_skb_pull_data to pull in once the non-linear parts,
939 then retesting and eventually access the data.
940
941 At the same time, this also makes sure the skb is un‐
942 cloned, which is a necessary condition for direct write.
943 As this needs to be an invariant for the write part only,
944 the verifier detects writes and adds a prologue that is
945 calling bpf_skb_pull_data() to effectively unclone the
946 skb from the very beginning in case it is indeed cloned.
947
948 A call to this helper is susceptible to change the under‐
949 lying packet buffer. Therefore, at load time, all checks
950 on pointers previously done by the verifier are invali‐
951 dated and must be performed again, if the helper is used
952 in combination with direct packet access.
953
954 Return 0 on success, or a negative error in case of failure.
955
956 s64 bpf_csum_update(struct sk_buff *skb, __wsum csum)
957
958 Description
959 Add the checksum csum into skb->csum in case the driver
960 has supplied a checksum for the entire packet into that
961 field. Return an error otherwise. This helper is intended
962 to be used in combination with bpf_csum_diff(), in par‐
963 ticular when the checksum needs to be updated after data
964 has been written into the packet through direct packet
965 access.
966
967 Return The checksum on success, or a negative error code in case
968 of failure.
969
970 void bpf_set_hash_invalid(struct sk_buff *skb)
971
972 Description
973 Invalidate the current skb->hash. It can be used after
974 mangling on headers through direct packet access, in or‐
975 der to indicate that the hash is outdated and to trigger
976 a recalculation the next time the kernel tries to access
977 this hash or when the bpf_get_hash_recalc() helper is
978 called.
979
980 long bpf_get_numa_node_id(void)
981
982 Description
983 Return the id of the current NUMA node. The primary use
984 case for this helper is the selection of sockets for the
985 local NUMA node, when the program is attached to sockets
986 using the SO_ATTACH_REUSEPORT_EBPF option (see also
987 socket(7)), but the helper is also available to other
988 eBPF program types, similarly to bpf_get_smp_proces‐
989 sor_id().
990
991 Return The id of current NUMA node.
992
993 long bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags)
994
995 Description
996 Grows headroom of packet associated to skb and adjusts
997 the offset of the MAC header accordingly, adding len
998 bytes of space. It automatically extends and reallocates
999 memory as required.
1000
1001 This helper can be used on a layer 3 skb to push a MAC
1002 header for redirection into a layer 2 device.
1003
1004 All values for flags are reserved for future usage, and
1005 must be left at zero.
1006
1007 A call to this helper is susceptible to change the under‐
1008 lying packet buffer. Therefore, at load time, all checks
1009 on pointers previously done by the verifier are invali‐
1010 dated and must be performed again, if the helper is used
1011 in combination with direct packet access.
1012
1013 Return 0 on success, or a negative error in case of failure.
1014
1015 long bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta)
1016
1017 Description
1018 Adjust (move) xdp_md->data by delta bytes. Note that it
1019 is possible to use a negative value for delta. This
1020 helper can be used to prepare the packet for pushing or
1021 popping headers.
1022
1023 A call to this helper is susceptible to change the under‐
1024 lying packet buffer. Therefore, at load time, all checks
1025 on pointers previously done by the verifier are invali‐
1026 dated and must be performed again, if the helper is used
1027 in combination with direct packet access.
1028
1029 Return 0 on success, or a negative error in case of failure.
1030
1031 long bpf_probe_read_str(void *dst, u32 size, const void *unsafe_ptr)
1032
1033 Description
1034 Copy a NUL terminated string from an unsafe kernel ad‐
1035 dress unsafe_ptr to dst. See bpf_probe_read_kernel_str()
1036 for more details.
1037
1038 Generally, use bpf_probe_read_user_str() or
1039 bpf_probe_read_kernel_str() instead.
1040
1041 Return On success, the strictly positive length of the string,
1042 including the trailing NUL character. On error, a nega‐
1043 tive value.
1044
1045 u64 bpf_get_socket_cookie(struct sk_buff *skb)
1046
1047 Description
1048 If the struct sk_buff pointed by skb has a known socket,
1049 retrieve the cookie (generated by the kernel) of this
1050 socket. If no cookie has been set yet, generate a new
1051 cookie. Once generated, the socket cookie remains stable
1052 for the life of the socket. This helper can be useful for
1053 monitoring per socket networking traffic statistics as it
1054 provides a global socket identifier that can be assumed
1055 unique.
1056
1057 Return A 8-byte long non-decreasing number on success, or 0 if
1058 the socket field is missing inside skb.
1059
1060 u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
1061
1062 Description
1063 Equivalent to bpf_get_socket_cookie() helper that accepts
1064 skb, but gets socket from struct bpf_sock_addr context.
1065
1066 Return A 8-byte long non-decreasing number.
1067
1068 u64 bpf_get_socket_cookie(struct bpf_sock_ops *ctx)
1069
1070 Description
1071 Equivalent to bpf_get_socket_cookie() helper that accepts
1072 skb, but gets socket from struct bpf_sock_ops context.
1073
1074 Return A 8-byte long non-decreasing number.
1075
1076 u32 bpf_get_socket_uid(struct sk_buff *skb)
1077
1078 Return The owner UID of the socket associated to skb. If the
1079 socket is NULL, or if it is not a full socket (i.e. if it
1080 is a time-wait or a request socket instead), overflowuid
1081 value is returned (note that overflowuid might also be
1082 the actual UID value for the socket).
1083
1084 long bpf_set_hash(struct sk_buff *skb, u32 hash)
1085
1086 Description
1087 Set the full hash for skb (set the field skb->hash) to
1088 value hash.
1089
1090 Return 0
1091
1092 long bpf_setsockopt(void *bpf_socket, int level, int optname, void
1093 *optval, int optlen)
1094
1095 Description
1096 Emulate a call to setsockopt() on the socket associated
1097 to bpf_socket, which must be a full socket. The level at
1098 which the option resides and the name optname of the op‐
1099 tion must be specified, see setsockopt(2) for more infor‐
1100 mation. The option value of length optlen is pointed by
1101 optval.
1102
1103 bpf_socket should be one of the following:
1104
1105 • struct bpf_sock_ops for BPF_PROG_TYPE_SOCK_OPS.
1106
1107 • struct bpf_sock_addr for BPF_CGROUP_INET4_CONNECT and
1108 BPF_CGROUP_INET6_CONNECT.
1109
1110 This helper actually implements a subset of setsockopt().
1111 It supports the following levels:
1112
1113 • SOL_SOCKET, which supports the following optnames:
1114 SO_RCVBUF, SO_SNDBUF, SO_MAX_PACING_RATE, SO_PRIORITY,
1115 SO_RCVLOWAT, SO_MARK, SO_BINDTODEVICE, SO_KEEPALIVE.
1116
1117 • IPPROTO_TCP, which supports the following optnames:
1118 TCP_CONGESTION, TCP_BPF_IW, TCP_BPF_SNDCWND_CLAMP,
1119 TCP_SAVE_SYN, TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT,
1120 TCP_SYNCNT, TCP_USER_TIMEOUT.
1121
1122 • IPPROTO_IP, which supports optname IP_TOS.
1123
1124 • IPPROTO_IPV6, which supports optname IPV6_TCLASS.
1125
1126 Return 0 on success, or a negative error in case of failure.
1127
1128 long bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode,
1129 u64 flags)
1130
1131 Description
1132 Grow or shrink the room for data in the packet associated
1133 to skb by len_diff, and according to the selected mode.
1134
1135 By default, the helper will reset any offloaded checksum
1136 indicator of the skb to CHECKSUM_NONE. This can be
1137 avoided by the following flag:
1138
1139 • BPF_F_ADJ_ROOM_NO_CSUM_RESET: Do not reset offloaded
1140 checksum data of the skb to CHECKSUM_NONE.
1141
1142 There are two supported modes at this time:
1143
1144 • BPF_ADJ_ROOM_MAC: Adjust room at the mac layer (room
1145 space is added or removed below the layer 2 header).
1146
1147 • BPF_ADJ_ROOM_NET: Adjust room at the network layer
1148 (room space is added or removed below the layer 3
1149 header).
1150
1151 The following flags are supported at this time:
1152
1153 • BPF_F_ADJ_ROOM_FIXED_GSO: Do not adjust gso_size. Ad‐
1154 justing mss in this way is not allowed for datagrams.
1155
1156 • BPF_F_ADJ_ROOM_ENCAP_L3_IPV4, BPF_F_ADJ_ROOM_EN‐
1157 CAP_L3_IPV6: Any new space is reserved to hold a tunnel
1158 header. Configure skb offsets and other fields accord‐
1159 ingly.
1160
1161 • BPF_F_ADJ_ROOM_ENCAP_L4_GRE, BPF_F_ADJ_ROOM_EN‐
1162 CAP_L4_UDP: Use with ENCAP_L3 flags to further specify
1163 the tunnel type.
1164
1165 • BPF_F_ADJ_ROOM_ENCAP_L2(len): Use with ENCAP_L3/L4
1166 flags to further specify the tunnel type; len is the
1167 length of the inner MAC header.
1168
1169 A call to this helper is susceptible to change the under‐
1170 lying packet buffer. Therefore, at load time, all checks
1171 on pointers previously done by the verifier are invali‐
1172 dated and must be performed again, if the helper is used
1173 in combination with direct packet access.
1174
1175 Return 0 on success, or a negative error in case of failure.
1176
1177 long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
1178
1179 Description
1180 Redirect the packet to the endpoint referenced by map at
1181 index key. Depending on its type, this map can contain
1182 references to net devices (for forwarding packets through
1183 other ports), or to CPUs (for redirecting XDP frames to
1184 another CPU; but this is only implemented for native XDP
1185 (with driver support) as of this writing).
1186
1187 The lower two bits of flags are used as the return code
1188 if the map lookup fails. This is so that the return value
1189 can be one of the XDP program return codes up to XDP_TX,
1190 as chosen by the caller. Any higher bits in the flags ar‐
1191 gument must be unset.
1192
1193 See also bpf_redirect(), which only supports redirecting
1194 to an ifindex, but doesn't require a map to do so.
1195
1196 Return XDP_REDIRECT on success, or the value of the two lower
1197 bits of the flags argument on error.
1198
1199 long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32
1200 key, u64 flags)
1201
1202 Description
1203 Redirect the packet to the socket referenced by map (of
1204 type BPF_MAP_TYPE_SOCKMAP) at index key. Both ingress and
1205 egress interfaces can be used for redirection. The
1206 BPF_F_INGRESS value in flags is used to make the distinc‐
1207 tion (ingress path is selected if the flag is present,
1208 egress path otherwise). This is the only flag supported
1209 for now.
1210
1211 Return SK_PASS on success, or SK_DROP on error.
1212
1213 long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map
1214 *map, void *key, u64 flags)
1215
1216 Description
1217 Add an entry to, or update a map referencing sockets. The
1218 skops is used as a new value for the entry associated to
1219 key. flags is one of:
1220
1221 BPF_NOEXIST
1222 The entry for key must not exist in the map.
1223
1224 BPF_EXIST
1225 The entry for key must already exist in the map.
1226
1227 BPF_ANY
1228 No condition on the existence of the entry for
1229 key.
1230
1231 If the map has eBPF programs (parser and verdict), those
1232 will be inherited by the socket being added. If the
1233 socket is already attached to eBPF programs, this results
1234 in an error.
1235
1236 Return 0 on success, or a negative error in case of failure.
1237
1238 long bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
1239
1240 Description
1241 Adjust the address pointed by xdp_md->data_meta by delta
1242 (which can be positive or negative). Note that this oper‐
1243 ation modifies the address stored in xdp_md->data, so the
1244 latter must be loaded only after the helper has been
1245 called.
1246
1247 The use of xdp_md->data_meta is optional and programs are
1248 not required to use it. The rationale is that when the
1249 packet is processed with XDP (e.g. as DoS filter), it is
1250 possible to push further meta data along with it before
1251 passing to the stack, and to give the guarantee that an
1252 ingress eBPF program attached as a TC classifier on the
1253 same device can pick this up for further post-processing.
1254 Since TC works with socket buffers, it remains possible
1255 to set from XDP the mark or priority pointers, or other
1256 pointers for the socket buffer. Having this scratch
1257 space generic and programmable allows for more flexibil‐
1258 ity as the user is free to store whatever meta data they
1259 need.
1260
1261 A call to this helper is susceptible to change the under‐
1262 lying packet buffer. Therefore, at load time, all checks
1263 on pointers previously done by the verifier are invali‐
1264 dated and must be performed again, if the helper is used
1265 in combination with direct packet access.
1266
1267 Return 0 on success, or a negative error in case of failure.
1268
1269 long bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct
1270 bpf_perf_event_value *buf, u32 buf_size)
1271
1272 Description
1273 Read the value of a perf event counter, and store it into
1274 buf of size buf_size. This helper relies on a map of type
1275 BPF_MAP_TYPE_PERF_EVENT_ARRAY. The nature of the perf
1276 event counter is selected when map is updated with perf
1277 event file descriptors. The map is an array whose size is
1278 the number of available CPUs, and each cell contains a
1279 value relative to one CPU. The value to retrieve is indi‐
1280 cated by flags, that contains the index of the CPU to
1281 look up, masked with BPF_F_INDEX_MASK. Alternatively,
1282 flags can be set to BPF_F_CURRENT_CPU to indicate that
1283 the value for the current CPU should be retrieved.
1284
1285 This helper behaves in a way close to
1286 bpf_perf_event_read() helper, save that instead of just
1287 returning the value observed, it fills the buf structure.
1288 This allows for additional data to be retrieved: in par‐
1289 ticular, the enabled and running times (in buf->enabled
1290 and buf->running, respectively) are copied. In general,
1291 bpf_perf_event_read_value() is recommended over
1292 bpf_perf_event_read(), which has some ABI issues and pro‐
1293 vides fewer functionalities.
1294
1295 These values are interesting, because hardware PMU (Per‐
1296 formance Monitoring Unit) counters are limited resources.
1297 When there are more PMU based perf events opened than
1298 available counters, kernel will multiplex these events so
1299 each event gets certain percentage (but not all) of the
1300 PMU time. In case that multiplexing happens, the number
1301 of samples or counter value will not reflect the case
1302 compared to when no multiplexing occurs. This makes com‐
1303 parison between different runs difficult. Typically, the
1304 counter value should be normalized before comparing to
1305 other experiments. The usual normalization is done as
1306 follows.
1307
1308 normalized_counter = counter * t_enabled / t_running
1309
1310 Where t_enabled is the time enabled for event and t_run‐
1311 ning is the time running for event since last normaliza‐
1312 tion. The enabled and running times are accumulated since
1313 the perf event open. To achieve scaling factor between
1314 two invocations of an eBPF program, users can use CPU id
1315 as the key (which is typical for perf array usage model)
1316 to remember the previous value and do the calculation in‐
1317 side the eBPF program.
1318
1319 Return 0 on success, or a negative error in case of failure.
1320
1321 long bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct
1322 bpf_perf_event_value *buf, u32 buf_size)
1323
1324 Description
1325 For en eBPF program attached to a perf event, retrieve
1326 the value of the event counter associated to ctx and
1327 store it in the structure pointed by buf and of size
1328 buf_size. Enabled and running times are also stored in
1329 the structure (see description of helper
1330 bpf_perf_event_read_value() for more details).
1331
1332 Return 0 on success, or a negative error in case of failure.
1333
1334 long bpf_getsockopt(void *bpf_socket, int level, int optname, void
1335 *optval, int optlen)
1336
1337 Description
1338 Emulate a call to getsockopt() on the socket associated
1339 to bpf_socket, which must be a full socket. The level at
1340 which the option resides and the name optname of the op‐
1341 tion must be specified, see getsockopt(2) for more infor‐
1342 mation. The retrieved value is stored in the structure
1343 pointed by opval and of length optlen.
1344
1345 bpf_socket should be one of the following:
1346
1347 • struct bpf_sock_ops for BPF_PROG_TYPE_SOCK_OPS.
1348
1349 • struct bpf_sock_addr for BPF_CGROUP_INET4_CONNECT and
1350 BPF_CGROUP_INET6_CONNECT.
1351
1352 This helper actually implements a subset of getsockopt().
1353 It supports the following levels:
1354
1355 • IPPROTO_TCP, which supports optname TCP_CONGESTION.
1356
1357 • IPPROTO_IP, which supports optname IP_TOS.
1358
1359 • IPPROTO_IPV6, which supports optname IPV6_TCLASS.
1360
1361 Return 0 on success, or a negative error in case of failure.
1362
1363 long bpf_override_return(struct pt_regs *regs, u64 rc)
1364
1365 Description
1366 Used for error injection, this helper uses kprobes to
1367 override the return value of the probed function, and to
1368 set it to rc. The first argument is the context regs on
1369 which the kprobe works.
1370
1371 This helper works by setting the PC (program counter) to
1372 an override function which is run in place of the origi‐
1373 nal probed function. This means the probed function is
1374 not run at all. The replacement function just returns
1375 with the required value.
1376
1377 This helper has security implications, and thus is sub‐
1378 ject to restrictions. It is only available if the kernel
1379 was compiled with the CONFIG_BPF_KPROBE_OVERRIDE configu‐
1380 ration option, and in this case it only works on func‐
1381 tions tagged with ALLOW_ERROR_INJECTION in the kernel
1382 code.
1383
1384 Also, the helper is only available for the architectures
1385 having the CONFIG_FUNCTION_ERROR_INJECTION option. As of
1386 this writing, x86 architecture is the only one to support
1387 this feature.
1388
1389 Return 0
1390
1391 long bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int
1392 argval)
1393
1394 Description
1395 Attempt to set the value of the bpf_sock_ops_cb_flags
1396 field for the full TCP socket associated to bpf_sock_ops
1397 to argval.
1398
1399 The primary use of this field is to determine if there
1400 should be calls to eBPF programs of type
1401 BPF_PROG_TYPE_SOCK_OPS at various points in the TCP code.
1402 A program of the same type can change its value, per con‐
1403 nection and as necessary, when the connection is estab‐
1404 lished. This field is directly accessible for reading,
1405 but this helper must be used for updates in order to re‐
1406 turn an error if an eBPF program tries to set a callback
1407 that is not supported in the current kernel.
1408
1409 argval is a flag array which can combine these flags:
1410
1411 • BPF_SOCK_OPS_RTO_CB_FLAG (retransmission time out)
1412
1413 • BPF_SOCK_OPS_RETRANS_CB_FLAG (retransmission)
1414
1415 • BPF_SOCK_OPS_STATE_CB_FLAG (TCP state change)
1416
1417 • BPF_SOCK_OPS_RTT_CB_FLAG (every RTT)
1418
1419 Therefore, this function can be used to clear a callback
1420 flag by setting the appropriate bit to zero. e.g. to dis‐
1421 able the RTO callback:
1422
1423 bpf_sock_ops_cb_flags_set(bpf_sock,
1424 bpf_sock->bpf_sock_ops_cb_flags &
1425 ~BPF_SOCK_OPS_RTO_CB_FLAG)
1426
1427 Here are some examples of where one could call such eBPF
1428 program:
1429
1430 • When RTO fires.
1431
1432 • When a packet is retransmitted.
1433
1434 • When the connection terminates.
1435
1436 • When a packet is sent.
1437
1438 • When a packet is received.
1439
1440 Return Code -EINVAL if the socket is not a full TCP socket; oth‐
1441 erwise, a positive number containing the bits that could
1442 not be set is returned (which comes down to 0 if all bits
1443 were set as required).
1444
1445 long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map,
1446 u32 key, u64 flags)
1447
1448 Description
1449 This helper is used in programs implementing policies at
1450 the socket level. If the message msg is allowed to pass
1451 (i.e. if the verdict eBPF program returns SK_PASS), redi‐
1452 rect it to the socket referenced by map (of type
1453 BPF_MAP_TYPE_SOCKMAP) at index key. Both ingress and
1454 egress interfaces can be used for redirection. The
1455 BPF_F_INGRESS value in flags is used to make the distinc‐
1456 tion (ingress path is selected if the flag is present,
1457 egress path otherwise). This is the only flag supported
1458 for now.
1459
1460 Return SK_PASS on success, or SK_DROP on error.
1461
1462 long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
1463
1464 Description
1465 For socket policies, apply the verdict of the eBPF pro‐
1466 gram to the next bytes (number of bytes) of message msg.
1467
1468 For example, this helper can be used in the following
1469 cases:
1470
1471 • A single sendmsg() or sendfile() system call contains
1472 multiple logical messages that the eBPF program is sup‐
1473 posed to read and for which it should apply a verdict.
1474
1475 • An eBPF program only cares to read the first bytes of a
1476 msg. If the message has a large payload, then setting
1477 up and calling the eBPF program repeatedly for all
1478 bytes, even though the verdict is already known, would
1479 create unnecessary overhead.
1480
1481 When called from within an eBPF program, the helper sets
1482 a counter internal to the BPF infrastructure, that is
1483 used to apply the last verdict to the next bytes. If
1484 bytes is smaller than the current data being processed
1485 from a sendmsg() or sendfile() system call, the first
1486 bytes will be sent and the eBPF program will be re-run
1487 with the pointer for start of data pointing to byte num‐
1488 ber bytes + 1. If bytes is larger than the current data
1489 being processed, then the eBPF verdict will be applied to
1490 multiple sendmsg() or sendfile() calls until bytes are
1491 consumed.
1492
1493 Note that if a socket closes with the internal counter
1494 holding a non-zero value, this is not a problem because
1495 data is not being buffered for bytes and is sent as it is
1496 received.
1497
1498 Return 0
1499
1500 long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
1501
1502 Description
1503 For socket policies, prevent the execution of the verdict
1504 eBPF program for message msg until bytes (byte number)
1505 have been accumulated.
1506
1507 This can be used when one needs a specific number of
1508 bytes before a verdict can be assigned, even if the data
1509 spans multiple sendmsg() or sendfile() calls. The extreme
1510 case would be a user calling sendmsg() repeatedly with
1511 1-byte long message segments. Obviously, this is bad for
1512 performance, but it is still valid. If the eBPF program
1513 needs bytes bytes to validate a header, this helper can
1514 be used to prevent the eBPF program to be called again
1515 until bytes have been accumulated.
1516
1517 Return 0
1518
1519 long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64
1520 flags)
1521
1522 Description
1523 For socket policies, pull in non-linear data from user
1524 space for msg and set pointers msg->data and
1525 msg->data_end to start and end bytes offsets into msg,
1526 respectively.
1527
1528 If a program of type BPF_PROG_TYPE_SK_MSG is run on a msg
1529 it can only parse data that the (data, data_end) pointers
1530 have already consumed. For sendmsg() hooks this is likely
1531 the first scatterlist element. But for calls relying on
1532 the sendpage handler (e.g. sendfile()) this will be the
1533 range (0, 0) because the data is shared with user space
1534 and by default the objective is to avoid allowing user
1535 space to modify data while (or after) eBPF verdict is be‐
1536 ing decided. This helper can be used to pull in data and
1537 to set the start and end pointer to given values. Data
1538 will be copied if necessary (i.e. if data was not linear
1539 and if start and end pointers do not point to the same
1540 chunk).
1541
1542 A call to this helper is susceptible to change the under‐
1543 lying packet buffer. Therefore, at load time, all checks
1544 on pointers previously done by the verifier are invali‐
1545 dated and must be performed again, if the helper is used
1546 in combination with direct packet access.
1547
1548 All values for flags are reserved for future usage, and
1549 must be left at zero.
1550
1551 Return 0 on success, or a negative error in case of failure.
1552
1553 long bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int
1554 addr_len)
1555
1556 Description
1557 Bind the socket associated to ctx to the address pointed
1558 by addr, of length addr_len. This allows for making out‐
1559 going connection from the desired IP address, which can
1560 be useful for example when all processes inside a cgroup
1561 should use one single IP address on a host that has mul‐
1562 tiple IP configured.
1563
1564 This helper works for IPv4 and IPv6, TCP and UDP sockets.
1565 The domain (addr->sa_family) must be AF_INET (or
1566 AF_INET6). It's advised to pass zero port (sin_port or
1567 sin6_port) which triggers IP_BIND_ADDRESS_NO_PORT-like
1568 behavior and lets the kernel efficiently pick up an un‐
1569 used port as long as 4-tuple is unique. Passing non-zero
1570 port might lead to degraded performance.
1571
1572 Return 0 on success, or a negative error in case of failure.
1573
1574 long bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta)
1575
1576 Description
1577 Adjust (move) xdp_md->data_end by delta bytes. It is pos‐
1578 sible to both shrink and grow the packet tail. Shrink
1579 done via delta being a negative integer.
1580
1581 A call to this helper is susceptible to change the under‐
1582 lying packet buffer. Therefore, at load time, all checks
1583 on pointers previously done by the verifier are invali‐
1584 dated and must be performed again, if the helper is used
1585 in combination with direct packet access.
1586
1587 Return 0 on success, or a negative error in case of failure.
1588
1589 long bpf_skb_get_xfrm_state(struct sk_buff *skb, u32 index, struct
1590 bpf_xfrm_state *xfrm_state, u32 size, u64 flags)
1591
1592 Description
1593 Retrieve the XFRM state (IP transform framework, see also
1594 ip-xfrm(8)) at index in XFRM "security path" for skb.
1595
1596 The retrieved value is stored in the struct
1597 bpf_xfrm_state pointed by xfrm_state and of length size.
1598
1599 All values for flags are reserved for future usage, and
1600 must be left at zero.
1601
1602 This helper is available only if the kernel was compiled
1603 with CONFIG_XFRM configuration option.
1604
1605 Return 0 on success, or a negative error in case of failure.
1606
1607 long bpf_get_stack(void *ctx, void *buf, u32 size, u64 flags)
1608
1609 Description
1610 Return a user or a kernel stack in bpf program provided
1611 buffer. To achieve this, the helper needs ctx, which is
1612 a pointer to the context on which the tracing program is
1613 executed. To store the stacktrace, the bpf program pro‐
1614 vides buf with a nonnegative size.
1615
1616 The last argument, flags, holds the number of stack
1617 frames to skip (from 0 to 255), masked with
1618 BPF_F_SKIP_FIELD_MASK. The next bits can be used to set
1619 the following flags:
1620
1621 BPF_F_USER_STACK
1622 Collect a user space stack instead of a kernel
1623 stack.
1624
1625 BPF_F_USER_BUILD_ID
1626 Collect buildid+offset instead of ips for user
1627 stack, only valid if BPF_F_USER_STACK is also
1628 specified.
1629
1630 bpf_get_stack() can collect up to PERF_MAX_STACK_DEPTH
1631 both kernel and user frames, subject to sufficient large
1632 buffer size. Note that this limit can be controlled with
1633 the sysctl program, and that it should be manually in‐
1634 creased in order to profile long user stacks (such as
1635 stacks for Java programs). To do so, use:
1636
1637 # sysctl kernel.perf_event_max_stack=<new value>
1638
1639 Return A non-negative value equal to or less than size on suc‐
1640 cess, or a negative error in case of failure.
1641
1642 long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to,
1643 u32 len, u32 start_header)
1644
1645 Description
1646 This helper is similar to bpf_skb_load_bytes() in that it
1647 provides an easy way to load len bytes from offset from
1648 the packet associated to skb, into the buffer pointed by
1649 to. The difference to bpf_skb_load_bytes() is that a
1650 fifth argument start_header exists in order to select a
1651 base offset to start from. start_header can be one of:
1652
1653 BPF_HDR_START_MAC
1654 Base offset to load data from is skb's mac header.
1655
1656 BPF_HDR_START_NET
1657 Base offset to load data from is skb's network
1658 header.
1659
1660 In general, "direct packet access" is the preferred
1661 method to access packet data, however, this helper is in
1662 particular useful in socket filters where skb->data does
1663 not always point to the start of the mac header and where
1664 "direct packet access" is not available.
1665
1666 Return 0 on success, or a negative error in case of failure.
1667
1668 long bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen,
1669 u32 flags)
1670
1671 Description
1672 Do FIB lookup in kernel tables using parameters in
1673 params. If lookup is successful and result shows packet
1674 is to be forwarded, the neighbor tables are searched for
1675 the nexthop. If successful (ie., FIB lookup shows for‐
1676 warding and nexthop is resolved), the nexthop address is
1677 returned in ipv4_dst or ipv6_dst based on family, smac is
1678 set to mac address of egress device, dmac is set to nex‐
1679 thop mac address, rt_metric is set to metric from route
1680 (IPv4/IPv6 only), and ifindex is set to the device index
1681 of the nexthop from the FIB lookup.
1682
1683 plen argument is the size of the passed in struct. flags
1684 argument can be a combination of one or more of the fol‐
1685 lowing values:
1686
1687 BPF_FIB_LOOKUP_DIRECT
1688 Do a direct table lookup vs full lookup using FIB
1689 rules.
1690
1691 BPF_FIB_LOOKUP_OUTPUT
1692 Perform lookup from an egress perspective (default
1693 is ingress).
1694
1695 ctx is either struct xdp_md for XDP programs or struct
1696 sk_buff tc cls_act programs.
1697
1698 Return
1699
1700 • < 0 if any input argument is invalid
1701
1702 • 0 on success (packet is forwarded, nexthop neighbor ex‐
1703 ists)
1704
1705 • > 0 one of BPF_FIB_LKUP_RET_ codes explaining why the
1706 packet is not forwarded or needs assist from full stack
1707
1708 long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map
1709 *map, void *key, u64 flags)
1710
1711 Description
1712 Add an entry to, or update a sockhash map referencing
1713 sockets. The skops is used as a new value for the entry
1714 associated to key. flags is one of:
1715
1716 BPF_NOEXIST
1717 The entry for key must not exist in the map.
1718
1719 BPF_EXIST
1720 The entry for key must already exist in the map.
1721
1722 BPF_ANY
1723 No condition on the existence of the entry for
1724 key.
1725
1726 If the map has eBPF programs (parser and verdict), those
1727 will be inherited by the socket being added. If the
1728 socket is already attached to eBPF programs, this results
1729 in an error.
1730
1731 Return 0 on success, or a negative error in case of failure.
1732
1733 long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map
1734 *map, void *key, u64 flags)
1735
1736 Description
1737 This helper is used in programs implementing policies at
1738 the socket level. If the message msg is allowed to pass
1739 (i.e. if the verdict eBPF program returns SK_PASS), redi‐
1740 rect it to the socket referenced by map (of type
1741 BPF_MAP_TYPE_SOCKHASH) using hash key. Both ingress and
1742 egress interfaces can be used for redirection. The
1743 BPF_F_INGRESS value in flags is used to make the distinc‐
1744 tion (ingress path is selected if the flag is present,
1745 egress path otherwise). This is the only flag supported
1746 for now.
1747
1748 Return SK_PASS on success, or SK_DROP on error.
1749
1750 long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map,
1751 void *key, u64 flags)
1752
1753 Description
1754 This helper is used in programs implementing policies at
1755 the skb socket level. If the sk_buff skb is allowed to
1756 pass (i.e. if the verdict eBPF program returns SK_PASS),
1757 redirect it to the socket referenced by map (of type
1758 BPF_MAP_TYPE_SOCKHASH) using hash key. Both ingress and
1759 egress interfaces can be used for redirection. The
1760 BPF_F_INGRESS value in flags is used to make the distinc‐
1761 tion (ingress path is selected if the flag is present,
1762 egress otherwise). This is the only flag supported for
1763 now.
1764
1765 Return SK_PASS on success, or SK_DROP on error.
1766
1767 long bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32
1768 len)
1769
1770 Description
1771 Encapsulate the packet associated to skb within a Layer 3
1772 protocol header. This header is provided in the buffer at
1773 address hdr, with len its size in bytes. type indicates
1774 the protocol of the header and can be one of:
1775
1776 BPF_LWT_ENCAP_SEG6
1777 IPv6 encapsulation with Segment Routing Header
1778 (struct ipv6_sr_hdr). hdr only contains the SRH,
1779 the IPv6 header is computed by the kernel.
1780
1781 BPF_LWT_ENCAP_SEG6_INLINE
1782 Only works if skb contains an IPv6 packet. Insert
1783 a Segment Routing Header (struct ipv6_sr_hdr) in‐
1784 side the IPv6 header.
1785
1786 BPF_LWT_ENCAP_IP
1787 IP encapsulation (GRE/GUE/IPIP/etc). The outer
1788 header must be IPv4 or IPv6, followed by zero or
1789 more additional headers, up to LWT_BPF_MAX_HEAD‐
1790 ROOM total bytes in all prepended headers. Please
1791 note that if skb_is_gso(skb) is true, no more than
1792 two headers can be prepended, and the inner
1793 header, if present, should be either GRE or
1794 UDP/GUE.
1795
1796 BPF_LWT_ENCAP_SEG6* types can be called by BPF programs
1797 of type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can
1798 be called by bpf programs of types BPF_PROG_TYPE_LWT_IN
1799 and BPF_PROG_TYPE_LWT_XMIT.
1800
1801 A call to this helper is susceptible to change the under‐
1802 lying packet buffer. Therefore, at load time, all checks
1803 on pointers previously done by the verifier are invali‐
1804 dated and must be performed again, if the helper is used
1805 in combination with direct packet access.
1806
1807 Return 0 on success, or a negative error in case of failure.
1808
1809 long bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const
1810 void *from, u32 len)
1811
1812 Description
1813 Store len bytes from address from into the packet associ‐
1814 ated to skb, at offset. Only the flags, tag and TLVs in‐
1815 side the outermost IPv6 Segment Routing Header can be
1816 modified through this helper.
1817
1818 A call to this helper is susceptible to change the under‐
1819 lying packet buffer. Therefore, at load time, all checks
1820 on pointers previously done by the verifier are invali‐
1821 dated and must be performed again, if the helper is used
1822 in combination with direct packet access.
1823
1824 Return 0 on success, or a negative error in case of failure.
1825
1826 long bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32
1827 delta)
1828
1829 Description
1830 Adjust the size allocated to TLVs in the outermost IPv6
1831 Segment Routing Header contained in the packet associated
1832 to skb, at position offset by delta bytes. Only offsets
1833 after the segments are accepted. delta can be as well
1834 positive (growing) as negative (shrinking).
1835
1836 A call to this helper is susceptible to change the under‐
1837 lying packet buffer. Therefore, at load time, all checks
1838 on pointers previously done by the verifier are invali‐
1839 dated and must be performed again, if the helper is used
1840 in combination with direct packet access.
1841
1842 Return 0 on success, or a negative error in case of failure.
1843
1844 long bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param,
1845 u32 param_len)
1846
1847 Description
1848 Apply an IPv6 Segment Routing action of type action to
1849 the packet associated to skb. Each action takes a parame‐
1850 ter contained at address param, and of length param_len
1851 bytes. action can be one of:
1852
1853 SEG6_LOCAL_ACTION_END_X
1854 End.X action: Endpoint with Layer-3 cross-connect.
1855 Type of param: struct in6_addr.
1856
1857 SEG6_LOCAL_ACTION_END_T
1858 End.T action: Endpoint with specific IPv6 table
1859 lookup. Type of param: int.
1860
1861 SEG6_LOCAL_ACTION_END_B6
1862 End.B6 action: Endpoint bound to an SRv6 policy.
1863 Type of param: struct ipv6_sr_hdr.
1864
1865 SEG6_LOCAL_ACTION_END_B6_ENCAP
1866 End.B6.Encap action: Endpoint bound to an SRv6 en‐
1867 capsulation policy. Type of param: struct
1868 ipv6_sr_hdr.
1869
1870 A call to this helper is susceptible to change the under‐
1871 lying packet buffer. Therefore, at load time, all checks
1872 on pointers previously done by the verifier are invali‐
1873 dated and must be performed again, if the helper is used
1874 in combination with direct packet access.
1875
1876 Return 0 on success, or a negative error in case of failure.
1877
1878 long bpf_rc_repeat(void *ctx)
1879
1880 Description
1881 This helper is used in programs implementing IR decoding,
1882 to report a successfully decoded repeat key message. This
1883 delays the generation of a key up event for previously
1884 generated key down event.
1885
1886 Some IR protocols like NEC have a special IR message for
1887 repeating last button, for when a button is held down.
1888
1889 The ctx should point to the lirc sample as passed into
1890 the program.
1891
1892 This helper is only available is the kernel was compiled
1893 with the CONFIG_BPF_LIRC_MODE2 configuration option set
1894 to "y".
1895
1896 Return 0
1897
1898 long bpf_rc_keydown(void *ctx, u32 protocol, u64 scancode, u32 toggle)
1899
1900 Description
1901 This helper is used in programs implementing IR decoding,
1902 to report a successfully decoded key press with scancode,
1903 toggle value in the given protocol. The scancode will be
1904 translated to a keycode using the rc keymap, and reported
1905 as an input key down event. After a period a key up event
1906 is generated. This period can be extended by calling ei‐
1907 ther bpf_rc_keydown() again with the same values, or
1908 calling bpf_rc_repeat().
1909
1910 Some protocols include a toggle bit, in case the button
1911 was released and pressed again between consecutive scan‐
1912 codes.
1913
1914 The ctx should point to the lirc sample as passed into
1915 the program.
1916
1917 The protocol is the decoded protocol number (see enum
1918 rc_proto for some predefined values).
1919
1920 This helper is only available is the kernel was compiled
1921 with the CONFIG_BPF_LIRC_MODE2 configuration option set
1922 to "y".
1923
1924 Return 0
1925
1926 u64 bpf_skb_cgroup_id(struct sk_buff *skb)
1927
1928 Description
1929 Return the cgroup v2 id of the socket associated with the
1930 skb. This is roughly similar to the bpf_get_cgroup_clas‐
1931 sid() helper for cgroup v1 by providing a tag resp. iden‐
1932 tifier that can be matched on or used for map lookups
1933 e.g. to implement policy. The cgroup v2 id of a given
1934 path in the hierarchy is exposed in user space through
1935 the f_handle API in order to get to the same 64-bit id.
1936
1937 This helper can be used on TC egress path, but not on
1938 ingress, and is available only if the kernel was compiled
1939 with the CONFIG_SOCK_CGROUP_DATA configuration option.
1940
1941 Return The id is returned or 0 in case the id could not be re‐
1942 trieved.
1943
1944 u64 bpf_get_current_cgroup_id(void)
1945
1946 Return A 64-bit integer containing the current cgroup id based
1947 on the cgroup within which the current task is running.
1948
1949 void *bpf_get_local_storage(void *map, u64 flags)
1950
1951 Description
1952 Get the pointer to the local storage area. The type and
1953 the size of the local storage is defined by the map argu‐
1954 ment. The flags meaning is specific for each map type,
1955 and has to be 0 for cgroup local storage.
1956
1957 Depending on the BPF program type, a local storage area
1958 can be shared between multiple instances of the BPF pro‐
1959 gram, running simultaneously.
1960
1961 A user should care about the synchronization by himself.
1962 For example, by using the BPF_STX_XADD instruction to al‐
1963 ter the shared data.
1964
1965 Return A pointer to the local storage area.
1966
1967 long bpf_sk_select_reuseport(struct sk_reuseport_md *reuse, struct
1968 bpf_map *map, void *key, u64 flags)
1969
1970 Description
1971 Select a SO_REUSEPORT socket from a BPF_MAP_TYPE_REUSE‐
1972 PORT_ARRAY map. It checks the selected socket is match‐
1973 ing the incoming request in the socket buffer.
1974
1975 Return 0 on success, or a negative error in case of failure.
1976
1977 u64 bpf_skb_ancestor_cgroup_id(struct sk_buff *skb, int ancestor_level)
1978
1979 Description
1980 Return id of cgroup v2 that is ancestor of cgroup associ‐
1981 ated with the skb at the ancestor_level. The root cgroup
1982 is at ancestor_level zero and each step down the hierar‐
1983 chy increments the level. If ancestor_level == level of
1984 cgroup associated with skb, then return value will be
1985 same as that of bpf_skb_cgroup_id().
1986
1987 The helper is useful to implement policies based on
1988 cgroups that are upper in hierarchy than immediate cgroup
1989 associated with skb.
1990
1991 The format of returned id and helper limitations are same
1992 as in bpf_skb_cgroup_id().
1993
1994 Return The id is returned or 0 in case the id could not be re‐
1995 trieved.
1996
1997 struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple
1998 *tuple, u32 tuple_size, u64 netns, u64 flags)
1999
2000 Description
2001 Look for TCP socket matching tuple, optionally in a child
2002 network namespace netns. The return value must be
2003 checked, and if non-NULL, released via bpf_sk_release().
2004
2005 The ctx should point to the context of the program, such
2006 as the skb or socket (depending on the hook in use). This
2007 is used to determine the base network namespace for the
2008 lookup.
2009
2010 tuple_size must be one of:
2011
2012 sizeof(tuple->ipv4)
2013 Look for an IPv4 socket.
2014
2015 sizeof(tuple->ipv6)
2016 Look for an IPv6 socket.
2017
2018 If the netns is a negative signed 32-bit integer, then
2019 the socket lookup table in the netns associated with the
2020 ctx will be used. For the TC hooks, this is the netns of
2021 the device in the skb. For socket hooks, this is the
2022 netns of the socket. If netns is any other signed 32-bit
2023 value greater than or equal to zero then it specifies the
2024 ID of the netns relative to the netns associated with the
2025 ctx. netns values beyond the range of 32-bit integers are
2026 reserved for future use.
2027
2028 All values for flags are reserved for future usage, and
2029 must be left at zero.
2030
2031 This helper is available only if the kernel was compiled
2032 with CONFIG_NET configuration option.
2033
2034 Return Pointer to struct bpf_sock, or NULL in case of failure.
2035 For sockets with reuseport option, the struct bpf_sock
2036 result is from reuse->socks[] using the hash of the tu‐
2037 ple.
2038
2039 struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple
2040 *tuple, u32 tuple_size, u64 netns, u64 flags)
2041
2042 Description
2043 Look for UDP socket matching tuple, optionally in a child
2044 network namespace netns. The return value must be
2045 checked, and if non-NULL, released via bpf_sk_release().
2046
2047 The ctx should point to the context of the program, such
2048 as the skb or socket (depending on the hook in use). This
2049 is used to determine the base network namespace for the
2050 lookup.
2051
2052 tuple_size must be one of:
2053
2054 sizeof(tuple->ipv4)
2055 Look for an IPv4 socket.
2056
2057 sizeof(tuple->ipv6)
2058 Look for an IPv6 socket.
2059
2060 If the netns is a negative signed 32-bit integer, then
2061 the socket lookup table in the netns associated with the
2062 ctx will be used. For the TC hooks, this is the netns of
2063 the device in the skb. For socket hooks, this is the
2064 netns of the socket. If netns is any other signed 32-bit
2065 value greater than or equal to zero then it specifies the
2066 ID of the netns relative to the netns associated with the
2067 ctx. netns values beyond the range of 32-bit integers are
2068 reserved for future use.
2069
2070 All values for flags are reserved for future usage, and
2071 must be left at zero.
2072
2073 This helper is available only if the kernel was compiled
2074 with CONFIG_NET configuration option.
2075
2076 Return Pointer to struct bpf_sock, or NULL in case of failure.
2077 For sockets with reuseport option, the struct bpf_sock
2078 result is from reuse->socks[] using the hash of the tu‐
2079 ple.
2080
2081 long bpf_sk_release(struct bpf_sock *sock)
2082
2083 Description
2084 Release the reference held by sock. sock must be a
2085 non-NULL pointer that was returned from
2086 bpf_sk_lookup_xxx().
2087
2088 Return 0 on success, or a negative error in case of failure.
2089
2090 long bpf_map_push_elem(struct bpf_map *map, const void *value, u64
2091 flags)
2092
2093 Description
2094 Push an element value in map. flags is one of:
2095
2096 BPF_EXIST
2097 If the queue/stack is full, the oldest element is
2098 removed to make room for this.
2099
2100 Return 0 on success, or a negative error in case of failure.
2101
2102 long bpf_map_pop_elem(struct bpf_map *map, void *value)
2103
2104 Description
2105 Pop an element from map.
2106
2107 Return 0 on success, or a negative error in case of failure.
2108
2109 long bpf_map_peek_elem(struct bpf_map *map, void *value)
2110
2111 Description
2112 Get an element from map without removing it.
2113
2114 Return 0 on success, or a negative error in case of failure.
2115
2116 long bpf_msg_push_data(struct sk_msg_buff *msg, u32 start, u32 len, u64
2117 flags)
2118
2119 Description
2120 For socket policies, insert len bytes into msg at offset
2121 start.
2122
2123 If a program of type BPF_PROG_TYPE_SK_MSG is run on a msg
2124 it may want to insert metadata or options into the msg.
2125 This can later be read and used by any of the lower layer
2126 BPF hooks.
2127
2128 This helper may fail if under memory pressure (a malloc
2129 fails) in these cases BPF programs will get an appropri‐
2130 ate error and BPF programs will need to handle them.
2131
2132 Return 0 on success, or a negative error in case of failure.
2133
2134 long bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 len, u64
2135 flags)
2136
2137 Description
2138 Will remove len bytes from a msg starting at byte start.
2139 This may result in ENOMEM errors under certain situations
2140 if an allocation and copy are required due to a full ring
2141 buffer. However, the helper will try to avoid doing the
2142 allocation if possible. Other errors can occur if input
2143 parameters are invalid either due to start byte not being
2144 valid part of msg payload and/or pop value being to
2145 large.
2146
2147 Return 0 on success, or a negative error in case of failure.
2148
2149 long bpf_rc_pointer_rel(void *ctx, s32 rel_x, s32 rel_y)
2150
2151 Description
2152 This helper is used in programs implementing IR decoding,
2153 to report a successfully decoded pointer movement.
2154
2155 The ctx should point to the lirc sample as passed into
2156 the program.
2157
2158 This helper is only available is the kernel was compiled
2159 with the CONFIG_BPF_LIRC_MODE2 configuration option set
2160 to "y".
2161
2162 Return 0
2163
2164 long bpf_spin_lock(struct bpf_spin_lock *lock)
2165
2166 Description
2167 Acquire a spinlock represented by the pointer lock, which
2168 is stored as part of a value of a map. Taking the lock
2169 allows to safely update the rest of the fields in that
2170 value. The spinlock can (and must) later be released with
2171 a call to bpf_spin_unlock(lock).
2172
2173 Spinlocks in BPF programs come with a number of restric‐
2174 tions and constraints:
2175
2176 • bpf_spin_lock objects are only allowed inside maps of
2177 types BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY (this
2178 list could be extended in the future).
2179
2180 • BTF description of the map is mandatory.
2181
2182 • The BPF program can take ONE lock at a time, since tak‐
2183 ing two or more could cause dead locks.
2184
2185 • Only one struct bpf_spin_lock is allowed per map ele‐
2186 ment.
2187
2188 • When the lock is taken, calls (either BPF to BPF or
2189 helpers) are not allowed.
2190
2191 • The BPF_LD_ABS and BPF_LD_IND instructions are not al‐
2192 lowed inside a spinlock-ed region.
2193
2194 • The BPF program MUST call bpf_spin_unlock() to release
2195 the lock, on all execution paths, before it returns.
2196
2197 • The BPF program can access struct bpf_spin_lock only
2198 via the bpf_spin_lock() and bpf_spin_unlock() helpers.
2199 Loading or storing data into the struct bpf_spin_lock
2200 lock; field of a map is not allowed.
2201
2202 • To use the bpf_spin_lock() helper, the BTF description
2203 of the map value must be a struct and have struct
2204 bpf_spin_lock anyname; field at the top level. Nested
2205 lock inside another struct is not allowed.
2206
2207 • The struct bpf_spin_lock lock field in a map value must
2208 be aligned on a multiple of 4 bytes in that value.
2209
2210 • Syscall with command BPF_MAP_LOOKUP_ELEM does not copy
2211 the bpf_spin_lock field to user space.
2212
2213 • Syscall with command BPF_MAP_UPDATE_ELEM, or update
2214 from a BPF program, do not update the bpf_spin_lock
2215 field.
2216
2217 • bpf_spin_lock cannot be on the stack or inside a net‐
2218 working packet (it can only be inside of a map values).
2219
2220 • bpf_spin_lock is available to root only.
2221
2222 • Tracing programs and socket filter programs cannot use
2223 bpf_spin_lock() due to insufficient preemption checks
2224 (but this may change in the future).
2225
2226 • bpf_spin_lock is not allowed in inner maps of
2227 map-in-map.
2228
2229 Return 0
2230
2231 long bpf_spin_unlock(struct bpf_spin_lock *lock)
2232
2233 Description
2234 Release the lock previously locked by a call to
2235 bpf_spin_lock(lock).
2236
2237 Return 0
2238
2239 struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
2240
2241 Description
2242 This helper gets a struct bpf_sock pointer such that all
2243 the fields in this bpf_sock can be accessed.
2244
2245 Return A struct bpf_sock pointer on success, or NULL in case of
2246 failure.
2247
2248 struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
2249
2250 Description
2251 This helper gets a struct bpf_tcp_sock pointer from a
2252 struct bpf_sock pointer.
2253
2254 Return A struct bpf_tcp_sock pointer on success, or NULL in case
2255 of failure.
2256
2257 long bpf_skb_ecn_set_ce(struct sk_buff *skb)
2258
2259 Description
2260 Set ECN (Explicit Congestion Notification) field of IP
2261 header to CE (Congestion Encountered) if current value is
2262 ECT (ECN Capable Transport). Otherwise, do nothing. Works
2263 with IPv6 and IPv4.
2264
2265 Return 1 if the CE flag is set (either by the current helper
2266 call or because it was already present), 0 if it is not
2267 set.
2268
2269 struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)
2270
2271 Description
2272 Return a struct bpf_sock pointer in TCP_LISTEN state.
2273 bpf_sk_release() is unnecessary and not allowed.
2274
2275 Return A struct bpf_sock pointer on success, or NULL in case of
2276 failure.
2277
2278 struct bpf_sock *bpf_skc_lookup_tcp(void *ctx, struct bpf_sock_tuple
2279 *tuple, u32 tuple_size, u64 netns, u64 flags)
2280
2281 Description
2282 Look for TCP socket matching tuple, optionally in a child
2283 network namespace netns. The return value must be
2284 checked, and if non-NULL, released via bpf_sk_release().
2285
2286 This function is identical to bpf_sk_lookup_tcp(), except
2287 that it also returns timewait or request sockets. Use
2288 bpf_sk_fullsock() or bpf_tcp_sock() to access the full
2289 structure.
2290
2291 This helper is available only if the kernel was compiled
2292 with CONFIG_NET configuration option.
2293
2294 Return Pointer to struct bpf_sock, or NULL in case of failure.
2295 For sockets with reuseport option, the struct bpf_sock
2296 result is from reuse->socks[] using the hash of the tu‐
2297 ple.
2298
2299 long bpf_tcp_check_syncookie(struct bpf_sock *sk, void *iph, u32
2300 iph_len, struct tcphdr *th, u32 th_len)
2301
2302 Description
2303 Check whether iph and th contain a valid SYN cookie ACK
2304 for the listening socket in sk.
2305
2306 iph points to the start of the IPv4 or IPv6 header, while
2307 iph_len contains sizeof(struct iphdr) or sizeof(struct
2308 ip6hdr).
2309
2310 th points to the start of the TCP header, while th_len
2311 contains sizeof(struct tcphdr).
2312
2313 Return 0 if iph and th are a valid SYN cookie ACK, or a negative
2314 error otherwise.
2315
2316 long bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t
2317 buf_len, u64 flags)
2318
2319 Description
2320 Get name of sysctl in /proc/sys/ and copy it into pro‐
2321 vided by program buffer buf of size buf_len.
2322
2323 The buffer is always NUL terminated, unless it's
2324 zero-sized.
2325
2326 If flags is zero, full name (e.g. "net/ipv4/tcp_mem") is
2327 copied. Use BPF_F_SYSCTL_BASE_NAME flag to copy base name
2328 only (e.g. "tcp_mem").
2329
2330 Return Number of character copied (not including the trailing
2331 NUL).
2332
2333 -E2BIG if the buffer wasn't big enough (buf will contain
2334 truncated name in this case).
2335
2336 long bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf,
2337 size_t buf_len)
2338
2339 Description
2340 Get current value of sysctl as it is presented in
2341 /proc/sys (incl. newline, etc), and copy it as a string
2342 into provided by program buffer buf of size buf_len.
2343
2344 The whole value is copied, no matter what file position
2345 user space issued e.g. sys_read at.
2346
2347 The buffer is always NUL terminated, unless it's
2348 zero-sized.
2349
2350 Return Number of character copied (not including the trailing
2351 NUL).
2352
2353 -E2BIG if the buffer wasn't big enough (buf will contain
2354 truncated name in this case).
2355
2356 -EINVAL if current value was unavailable, e.g. because
2357 sysctl is uninitialized and read returns -EIO for it.
2358
2359 long bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t
2360 buf_len)
2361
2362 Description
2363 Get new value being written by user space to sysctl (be‐
2364 fore the actual write happens) and copy it as a string
2365 into provided by program buffer buf of size buf_len.
2366
2367 User space may write new value at file position > 0.
2368
2369 The buffer is always NUL terminated, unless it's
2370 zero-sized.
2371
2372 Return Number of character copied (not including the trailing
2373 NUL).
2374
2375 -E2BIG if the buffer wasn't big enough (buf will contain
2376 truncated name in this case).
2377
2378 -EINVAL if sysctl is being read.
2379
2380 long bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf,
2381 size_t buf_len)
2382
2383 Description
2384 Override new value being written by user space to sysctl
2385 with value provided by program in buffer buf of size
2386 buf_len.
2387
2388 buf should contain a string in same form as provided by
2389 user space on sysctl write.
2390
2391 User space may write new value at file position > 0. To
2392 override the whole sysctl value file position should be
2393 set to zero.
2394
2395 Return 0 on success.
2396
2397 -E2BIG if the buf_len is too big.
2398
2399 -EINVAL if sysctl is being read.
2400
2401 long bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res)
2402
2403 Description
2404 Convert the initial part of the string from buffer buf of
2405 size buf_len to a long integer according to the given
2406 base and save the result in res.
2407
2408 The string may begin with an arbitrary amount of white
2409 space (as determined by isspace(3)) followed by a single
2410 optional '-' sign.
2411
2412 Five least significant bits of flags encode base, other
2413 bits are currently unused.
2414
2415 Base must be either 8, 10, 16 or 0 to detect it automati‐
2416 cally similar to user space strtol(3).
2417
2418 Return Number of characters consumed on success. Must be posi‐
2419 tive but no more than buf_len.
2420
2421 -EINVAL if no valid digits were found or unsupported base
2422 was provided.
2423
2424 -ERANGE if resulting value was out of range.
2425
2426 long bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned
2427 long *res)
2428
2429 Description
2430 Convert the initial part of the string from buffer buf of
2431 size buf_len to an unsigned long integer according to the
2432 given base and save the result in res.
2433
2434 The string may begin with an arbitrary amount of white
2435 space (as determined by isspace(3)).
2436
2437 Five least significant bits of flags encode base, other
2438 bits are currently unused.
2439
2440 Base must be either 8, 10, 16 or 0 to detect it automati‐
2441 cally similar to user space strtoul(3).
2442
2443 Return Number of characters consumed on success. Must be posi‐
2444 tive but no more than buf_len.
2445
2446 -EINVAL if no valid digits were found or unsupported base
2447 was provided.
2448
2449 -ERANGE if resulting value was out of range.
2450
2451 void *bpf_sk_storage_get(struct bpf_map *map, struct bpf_sock *sk, void
2452 *value, u64 flags)
2453
2454 Description
2455 Get a bpf-local-storage from a sk.
2456
2457 Logically, it could be thought of getting the value from
2458 a map with sk as the key. From this perspective, the
2459 usage is not much different from bpf_map_lookup_elem(map,
2460 &sk) except this helper enforces the key must be a full
2461 socket and the map must be a BPF_MAP_TYPE_SK_STORAGE
2462 also.
2463
2464 Underneath, the value is stored locally at sk instead of
2465 the map. The map is used as the bpf-local-storage
2466 "type". The bpf-local-storage "type" (i.e. the map) is
2467 searched against all bpf-local-storages residing at sk.
2468
2469 An optional flags (BPF_SK_STORAGE_GET_F_CREATE) can be
2470 used such that a new bpf-local-storage will be created if
2471 one does not exist. value can be used together with
2472 BPF_SK_STORAGE_GET_F_CREATE to specify the initial value
2473 of a bpf-local-storage. If value is NULL, the new
2474 bpf-local-storage will be zero initialized.
2475
2476 Return A bpf-local-storage pointer is returned on success.
2477
2478 NULL if not found or there was an error in adding a new
2479 bpf-local-storage.
2480
2481 long bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk)
2482
2483 Description
2484 Delete a bpf-local-storage from a sk.
2485
2486 Return 0 on success.
2487
2488 -ENOENT if the bpf-local-storage cannot be found.
2489
2490 long bpf_send_signal(u32 sig)
2491
2492 Description
2493 Send signal sig to the process of the current task. The
2494 signal may be delivered to any of this process's threads.
2495
2496 Return 0 on success or successfully queued.
2497
2498 -EBUSY if work queue under nmi is full.
2499
2500 -EINVAL if sig is invalid.
2501
2502 -EPERM if no permission to send the sig.
2503
2504 -EAGAIN if bpf program can try again.
2505
2506 s64 bpf_tcp_gen_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len,
2507 struct tcphdr *th, u32 th_len)
2508
2509 Description
2510 Try to issue a SYN cookie for the packet with correspond‐
2511 ing IP/TCP headers, iph and th, on the listening socket
2512 in sk.
2513
2514 iph points to the start of the IPv4 or IPv6 header, while
2515 iph_len contains sizeof(struct iphdr) or sizeof(struct
2516 ip6hdr).
2517
2518 th points to the start of the TCP header, while th_len
2519 contains the length of the TCP header.
2520
2521 Return On success, lower 32 bits hold the generated SYN cookie
2522 in followed by 16 bits which hold the MSS value for that
2523 cookie, and the top 16 bits are unused.
2524
2525 On failure, the returned value is one of the following:
2526
2527 -EINVAL SYN cookie cannot be issued due to error
2528
2529 -ENOENT SYN cookie should not be issued (no SYN flood)
2530
2531 -EOPNOTSUPP kernel configuration does not enable SYN
2532 cookies
2533
2534 -EPROTONOSUPPORT IP packet version is not 4 or 6
2535
2536 long bpf_skb_output(void *ctx, struct bpf_map *map, u64 flags, void
2537 *data, u64 size)
2538
2539 Description
2540 Write raw data blob into a special BPF perf event held by
2541 map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. This perf
2542 event must have the following attributes: PERF_SAMPLE_RAW
2543 as sample_type, PERF_TYPE_SOFTWARE as type, and
2544 PERF_COUNT_SW_BPF_OUTPUT as config.
2545
2546 The flags are used to indicate the index in map for which
2547 the value must be put, masked with BPF_F_INDEX_MASK. Al‐
2548 ternatively, flags can be set to BPF_F_CURRENT_CPU to in‐
2549 dicate that the index of the current CPU core should be
2550 used.
2551
2552 The value to write, of size, is passed through eBPF stack
2553 and pointed by data.
2554
2555 ctx is a pointer to in-kernel struct sk_buff.
2556
2557 This helper is similar to bpf_perf_event_output() but re‐
2558 stricted to raw_tracepoint bpf programs.
2559
2560 Return 0 on success, or a negative error in case of failure.
2561
2562 long bpf_probe_read_user(void *dst, u32 size, const void *unsafe_ptr)
2563
2564 Description
2565 Safely attempt to read size bytes from user space address
2566 unsafe_ptr and store the data in dst.
2567
2568 Return 0 on success, or a negative error in case of failure.
2569
2570 long bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
2571
2572 Description
2573 Safely attempt to read size bytes from kernel space ad‐
2574 dress unsafe_ptr and store the data in dst.
2575
2576 Return 0 on success, or a negative error in case of failure.
2577
2578 long bpf_probe_read_user_str(void *dst, u32 size, const void *un‐
2579 safe_ptr)
2580
2581 Description
2582 Copy a NUL terminated string from an unsafe user address
2583 unsafe_ptr to dst. The size should include the terminat‐
2584 ing NUL byte. In case the string length is smaller than
2585 size, the target is not padded with further NUL bytes. If
2586 the string length is larger than size, just size-1 bytes
2587 are copied and the last byte is set to NUL.
2588
2589 On success, the length of the copied string is returned.
2590 This makes this helper useful in tracing programs for
2591 reading strings, and more importantly to get its length
2592 at runtime. See the following snippet:
2593
2594 SEC("kprobe/sys_open")
2595 void bpf_sys_open(struct pt_regs *ctx)
2596 {
2597 char buf[PATHLEN]; // PATHLEN is defined to 256
2598 int res = bpf_probe_read_user_str(buf, sizeof(buf),
2599 ctx->di);
2600
2601 // Consume buf, for example push it to
2602 // userspace via bpf_perf_event_output(); we
2603 // can use res (the string length) as event
2604 // size, after checking its boundaries.
2605 }
2606
2607 In comparison, using bpf_probe_read_user() helper here
2608 instead to read the string would require to estimate the
2609 length at compile time, and would often result in copying
2610 more memory than necessary.
2611
2612 Another useful use case is when parsing individual
2613 process arguments or individual environment variables
2614 navigating current->mm->arg_start and cur‐
2615 rent->mm->env_start: using this helper and the return
2616 value, one can quickly iterate at the right offset of the
2617 memory area.
2618
2619 Return On success, the strictly positive length of the string,
2620 including the trailing NUL character. On error, a nega‐
2621 tive value.
2622
2623 long bpf_probe_read_kernel_str(void *dst, u32 size, const void *un‐
2624 safe_ptr)
2625
2626 Description
2627 Copy a NUL terminated string from an unsafe kernel ad‐
2628 dress unsafe_ptr to dst. Same semantics as with
2629 bpf_probe_read_user_str() apply.
2630
2631 Return On success, the strictly positive length of the string,
2632 including the trailing NUL character. On error, a nega‐
2633 tive value.
2634
2635 long bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
2636
2637 Description
2638 Send out a tcp-ack. tp is the in-kernel struct tcp_sock.
2639 rcv_nxt is the ack_seq to be sent out.
2640
2641 Return 0 on success, or a negative error in case of failure.
2642
2643 long bpf_send_signal_thread(u32 sig)
2644
2645 Description
2646 Send signal sig to the thread corresponding to the cur‐
2647 rent task.
2648
2649 Return 0 on success or successfully queued.
2650
2651 -EBUSY if work queue under nmi is full.
2652
2653 -EINVAL if sig is invalid.
2654
2655 -EPERM if no permission to send the sig.
2656
2657 -EAGAIN if bpf program can try again.
2658
2659 u64 bpf_jiffies64(void)
2660
2661 Description
2662 Obtain the 64bit jiffies
2663
2664 Return The 64 bit jiffies
2665
2666 long bpf_read_branch_records(struct bpf_perf_event_data *ctx, void
2667 *buf, u32 size, u64 flags)
2668
2669 Description
2670 For an eBPF program attached to a perf event, retrieve
2671 the branch records (struct perf_branch_entry) associated
2672 to ctx and store it in the buffer pointed by buf up to
2673 size size bytes.
2674
2675 Return On success, number of bytes written to buf. On error, a
2676 negative value.
2677
2678 The flags can be set to BPF_F_GET_BRANCH_RECORDS_SIZE to
2679 instead return the number of bytes required to store all
2680 the branch entries. If this flag is set, buf may be NULL.
2681
2682 -EINVAL if arguments invalid or size not a multiple of
2683 sizeof(struct perf_branch_entry).
2684
2685 -ENOENT if architecture does not support branch records.
2686
2687 long bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct
2688 bpf_pidns_info *nsdata, u32 size)
2689
2690 Description
2691 Returns 0 on success, values for pid and tgid as seen
2692 from the current namespace will be returned in nsdata.
2693
2694 Return 0 on success, or one of the following in case of failure:
2695
2696 -EINVAL if dev and inum supplied don't match dev_t and
2697 inode number with nsfs of current task, or if dev conver‐
2698 sion to dev_t lost high bits.
2699
2700 -ENOENT if pidns does not exists for the current task.
2701
2702 long bpf_xdp_output(void *ctx, struct bpf_map *map, u64 flags, void
2703 *data, u64 size)
2704
2705 Description
2706 Write raw data blob into a special BPF perf event held by
2707 map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. This perf
2708 event must have the following attributes: PERF_SAMPLE_RAW
2709 as sample_type, PERF_TYPE_SOFTWARE as type, and
2710 PERF_COUNT_SW_BPF_OUTPUT as config.
2711
2712 The flags are used to indicate the index in map for which
2713 the value must be put, masked with BPF_F_INDEX_MASK. Al‐
2714 ternatively, flags can be set to BPF_F_CURRENT_CPU to in‐
2715 dicate that the index of the current CPU core should be
2716 used.
2717
2718 The value to write, of size, is passed through eBPF stack
2719 and pointed by data.
2720
2721 ctx is a pointer to in-kernel struct xdp_buff.
2722
2723 This helper is similar to bpf_perf_eventoutput() but re‐
2724 stricted to raw_tracepoint bpf programs.
2725
2726 Return 0 on success, or a negative error in case of failure.
2727
2728 u64 bpf_get_netns_cookie(void *ctx)
2729
2730 Description
2731 Retrieve the cookie (generated by the kernel) of the net‐
2732 work namespace the input ctx is associated with. The net‐
2733 work namespace cookie remains stable for its lifetime and
2734 provides a global identifier that can be assumed unique.
2735 If ctx is NULL, then the helper returns the cookie for
2736 the initial network namespace. The cookie itself is very
2737 similar to that of bpf_get_socket_cookie() helper, but
2738 for network namespaces instead of sockets.
2739
2740 Return A 8-byte long opaque number.
2741
2742 u64 bpf_get_current_ancestor_cgroup_id(int ancestor_level)
2743
2744 Description
2745 Return id of cgroup v2 that is ancestor of the cgroup as‐
2746 sociated with the current task at the ancestor_level. The
2747 root cgroup is at ancestor_level zero and each step down
2748 the hierarchy increments the level. If ancestor_level ==
2749 level of cgroup associated with the current task, then
2750 return value will be the same as that of bpf_get_cur‐
2751 rent_cgroup_id().
2752
2753 The helper is useful to implement policies based on
2754 cgroups that are upper in hierarchy than immediate cgroup
2755 associated with the current task.
2756
2757 The format of returned id and helper limitations are same
2758 as in bpf_get_current_cgroup_id().
2759
2760 Return The id is returned or 0 in case the id could not be re‐
2761 trieved.
2762
2763 long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
2764
2765 Description
2766 Helper is overloaded depending on BPF program type. This
2767 description applies to BPF_PROG_TYPE_SCHED_CLS and
2768 BPF_PROG_TYPE_SCHED_ACT programs.
2769
2770 Assign the sk to the skb. When combined with appropriate
2771 routing configuration to receive the packet towards the
2772 socket, will cause skb to be delivered to the specified
2773 socket. Subsequent redirection of skb via bpf_redi‐
2774 rect(), bpf_clone_redirect() or other methods outside of
2775 BPF may interfere with successful delivery to the socket.
2776
2777 This operation is only valid from TC ingress path.
2778
2779 The flags argument must be zero.
2780
2781 Return 0 on success, or a negative error in case of failure:
2782
2783 -EINVAL if specified flags are not supported.
2784
2785 -ENOENT if the socket is unavailable for assignment.
2786
2787 -ENETUNREACH if the socket is unreachable (wrong netns).
2788
2789 -EOPNOTSUPP if the operation is not supported, for exam‐
2790 ple a call from outside of TC ingress.
2791
2792 -ESOCKTNOSUPPORT if the socket type is not supported
2793 (reuseport).
2794
2795 long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64
2796 flags)
2797
2798 Description
2799 Helper is overloaded depending on BPF program type. This
2800 description applies to BPF_PROG_TYPE_SK_LOOKUP programs.
2801
2802 Select the sk as a result of a socket lookup.
2803
2804 For the operation to succeed passed socket must be com‐
2805 patible with the packet description provided by the ctx
2806 object.
2807
2808 L4 protocol (IPPROTO_TCP or IPPROTO_UDP) must be an exact
2809 match. While IP family (AF_INET or AF_INET6) must be com‐
2810 patible, that is IPv6 sockets that are not v6-only can be
2811 selected for IPv4 packets.
2812
2813 Only TCP listeners and UDP unconnected sockets can be se‐
2814 lected. sk can also be NULL to reset any previous selec‐
2815 tion.
2816
2817 flags argument can combination of following values:
2818
2819 • BPF_SK_LOOKUP_F_REPLACE to override the previous socket
2820 selection, potentially done by a BPF program that ran
2821 before us.
2822
2823 • BPF_SK_LOOKUP_F_NO_REUSEPORT to skip load-balancing
2824 within reuseport group for the socket being selected.
2825
2826 On success ctx->sk will point to the selected socket.
2827
2828 Return 0 on success, or a negative errno in case of failure.
2829
2830 • -EAFNOSUPPORT if socket family (sk->family) is not com‐
2831 patible with packet family (ctx->family).
2832
2833 • -EEXIST if socket has been already selected, poten‐
2834 tially by another program, and BPF_SK_LOOKUP_F_REPLACE
2835 flag was not specified.
2836
2837 • -EINVAL if unsupported flags were specified.
2838
2839 • -EPROTOTYPE if socket L4 protocol (sk->protocol)
2840 doesn't match packet protocol (ctx->protocol).
2841
2842 • -ESOCKTNOSUPPORT if socket is not in allowed state (TCP
2843 listening or UDP unconnected).
2844
2845 u64 bpf_ktime_get_boot_ns(void)
2846
2847 Description
2848 Return the time elapsed since system boot, in nanosec‐
2849 onds. Does include the time the system was suspended.
2850 See: clock_gettime(CLOCK_BOOTTIME)
2851
2852 Return Current ktime.
2853
2854 long bpf_seq_printf(struct seq_file *m, const char *fmt, u32 fmt_size,
2855 const void *data, u32 data_len)
2856
2857 Description
2858 bpf_seq_printf() uses seq_file seq_printf() to print out
2859 the format string. The m represents the seq_file. The
2860 fmt and fmt_size are for the format string itself. The
2861 data and data_len are format string arguments. The data
2862 are a u64 array and corresponding format string values
2863 are stored in the array. For strings and pointers where
2864 pointees are accessed, only the pointer values are stored
2865 in the data array. The data_len is the size of data in
2866 bytes.
2867
2868 Formats %s, %p{i,I}{4,6} requires to read kernel memory.
2869 Reading kernel memory may fail due to either invalid ad‐
2870 dress or valid address but requiring a major memory
2871 fault. If reading kernel memory fails, the string for %s
2872 will be an empty string, and the ip address for
2873 %p{i,I}{4,6} will be 0. Not returning error to bpf pro‐
2874 gram is consistent with what bpf_trace_printk() does for
2875 now.
2876
2877 Return 0 on success, or a negative error in case of failure:
2878
2879 -EBUSY if per-CPU memory copy buffer is busy, can try
2880 again by returning 1 from bpf program.
2881
2882 -EINVAL if arguments are invalid, or if fmt is in‐
2883 valid/unsupported.
2884
2885 -E2BIG if fmt contains too many format specifiers.
2886
2887 -EOVERFLOW if an overflow happened: The same object will
2888 be tried again.
2889
2890 long bpf_seq_write(struct seq_file *m, const void *data, u32 len)
2891
2892 Description
2893 bpf_seq_write() uses seq_file seq_write() to write the
2894 data. The m represents the seq_file. The data and len
2895 represent the data to write in bytes.
2896
2897 Return 0 on success, or a negative error in case of failure:
2898
2899 -EOVERFLOW if an overflow happened: The same object will
2900 be tried again.
2901
2902 u64 bpf_sk_cgroup_id(struct bpf_sock *sk)
2903
2904 Description
2905 Return the cgroup v2 id of the socket sk.
2906
2907 sk must be a non-NULL pointer to a full socket, e.g. one
2908 returned from bpf_sk_lookup_xxx(), bpf_sk_fullsock(),
2909 etc. The format of returned id is same as in
2910 bpf_skb_cgroup_id().
2911
2912 This helper is available only if the kernel was compiled
2913 with the CONFIG_SOCK_CGROUP_DATA configuration option.
2914
2915 Return The id is returned or 0 in case the id could not be re‐
2916 trieved.
2917
2918 u64 bpf_sk_ancestor_cgroup_id(struct bpf_sock *sk, int ancestor_level)
2919
2920 Description
2921 Return id of cgroup v2 that is ancestor of cgroup associ‐
2922 ated with the sk at the ancestor_level. The root cgroup
2923 is at ancestor_level zero and each step down the hierar‐
2924 chy increments the level. If ancestor_level == level of
2925 cgroup associated with sk, then return value will be same
2926 as that of bpf_sk_cgroup_id().
2927
2928 The helper is useful to implement policies based on
2929 cgroups that are upper in hierarchy than immediate cgroup
2930 associated with sk.
2931
2932 The format of returned id and helper limitations are same
2933 as in bpf_sk_cgroup_id().
2934
2935 Return The id is returned or 0 in case the id could not be re‐
2936 trieved.
2937
2938 long bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags)
2939
2940 Description
2941 Copy size bytes from data into a ring buffer ringbuf. If
2942 BPF_RB_NO_WAKEUP is specified in flags, no notification
2943 of new data availability is sent. If BPF_RB_FORCE_WAKEUP
2944 is specified in flags, notification of new data avail‐
2945 ability is sent unconditionally.
2946
2947 Return 0 on success, or a negative error in case of failure.
2948
2949 void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
2950
2951 Description
2952 Reserve size bytes of payload in a ring buffer ringbuf.
2953
2954 Return Valid pointer with size bytes of memory available; NULL,
2955 otherwise.
2956
2957 void bpf_ringbuf_submit(void *data, u64 flags)
2958
2959 Description
2960 Submit reserved ring buffer sample, pointed to by data.
2961 If BPF_RB_NO_WAKEUP is specified in flags, no notifica‐
2962 tion of new data availability is sent. If
2963 BPF_RB_FORCE_WAKEUP is specified in flags, notification
2964 of new data availability is sent unconditionally.
2965
2966 Return Nothing. Always succeeds.
2967
2968 void bpf_ringbuf_discard(void *data, u64 flags)
2969
2970 Description
2971 Discard reserved ring buffer sample, pointed to by data.
2972 If BPF_RB_NO_WAKEUP is specified in flags, no notifica‐
2973 tion of new data availability is sent. If
2974 BPF_RB_FORCE_WAKEUP is specified in flags, notification
2975 of new data availability is sent unconditionally.
2976
2977 Return Nothing. Always succeeds.
2978
2979 u64 bpf_ringbuf_query(void *ringbuf, u64 flags)
2980
2981 Description
2982 Query various characteristics of provided ring buffer.
2983 What exactly is queries is determined by flags:
2984
2985 • BPF_RB_AVAIL_DATA: Amount of data not yet consumed.
2986
2987 • BPF_RB_RING_SIZE: The size of ring buffer.
2988
2989 • BPF_RB_CONS_POS: Consumer position (can wrap around).
2990
2991 • BPF_RB_PROD_POS: Producer(s) position (can wrap
2992 around).
2993
2994 Data returned is just a momentary snapshot of actual val‐
2995 ues and could be inaccurate, so this facility should be
2996 used to power heuristics and for reporting, not to make
2997 100% correct calculation.
2998
2999 Return Requested value, or 0, if flags are not recognized.
3000
3001 long bpf_csum_level(struct sk_buff *skb, u64 level)
3002
3003 Description
3004 Change the skbs checksum level by one layer up or down,
3005 or reset it entirely to none in order to have the stack
3006 perform checksum validation. The level is applicable to
3007 the following protocols: TCP, UDP, GRE, SCTP, FCOE. For
3008 example, a decap of | ETH | IP | UDP | GUE | IP | TCP |
3009 into | ETH | IP | TCP | through bpf_skb_adjust_room()
3010 helper with passing in BPF_F_ADJ_ROOM_NO_CSUM_RESET flag
3011 would require one call to bpf_csum_level() with
3012 BPF_CSUM_LEVEL_DEC since the UDP header is removed. Simi‐
3013 larly, an encap of the latter into the former could be
3014 accompanied by a helper call to bpf_csum_level() with
3015 BPF_CSUM_LEVEL_INC if the skb is still intended to be
3016 processed in higher layers of the stack instead of just
3017 egressing at tc.
3018
3019 There are three supported level settings at this time:
3020
3021 • BPF_CSUM_LEVEL_INC: Increases skb->csum_level for skbs
3022 with CHECKSUM_UNNECESSARY.
3023
3024 • BPF_CSUM_LEVEL_DEC: Decreases skb->csum_level for skbs
3025 with CHECKSUM_UNNECESSARY.
3026
3027 • BPF_CSUM_LEVEL_RESET: Resets skb->csum_level to 0 and
3028 sets CHECKSUM_NONE to force checksum validation by the
3029 stack.
3030
3031 • BPF_CSUM_LEVEL_QUERY: No-op, returns the current
3032 skb->csum_level.
3033
3034 Return 0 on success, or a negative error in case of failure. In
3035 the case of BPF_CSUM_LEVEL_QUERY, the current
3036 skb->csum_level is returned or the error code -EACCES in
3037 case the skb is not subject to CHECKSUM_UNNECESSARY.
3038
3039 struct tcp6_sock *bpf_skc_to_tcp6_sock(void *sk)
3040
3041 Description
3042 Dynamically cast a sk pointer to a tcp6_sock pointer.
3043
3044 Return sk if casting is valid, or NULL otherwise.
3045
3046 struct tcp_sock *bpf_skc_to_tcp_sock(void *sk)
3047
3048 Description
3049 Dynamically cast a sk pointer to a tcp_sock pointer.
3050
3051 Return sk if casting is valid, or NULL otherwise.
3052
3053 struct tcp_timewait_sock *bpf_skc_to_tcp_timewait_sock(void *sk)
3054
3055 Description
3056 Dynamically cast a sk pointer to a tcp_timewait_sock
3057 pointer.
3058
3059 Return sk if casting is valid, or NULL otherwise.
3060
3061 struct tcp_request_sock *bpf_skc_to_tcp_request_sock(void *sk)
3062
3063 Description
3064 Dynamically cast a sk pointer to a tcp_request_sock
3065 pointer.
3066
3067 Return sk if casting is valid, or NULL otherwise.
3068
3069 struct udp6_sock *bpf_skc_to_udp6_sock(void *sk)
3070
3071 Description
3072 Dynamically cast a sk pointer to a udp6_sock pointer.
3073
3074 Return sk if casting is valid, or NULL otherwise.
3075
3076 long bpf_get_task_stack(struct task_struct *task, void *buf, u32 size,
3077 u64 flags)
3078
3079 Description
3080 Return a user or a kernel stack in bpf program provided
3081 buffer. To achieve this, the helper needs task, which is
3082 a valid pointer to struct task_struct. To store the
3083 stacktrace, the bpf program provides buf with a nonnega‐
3084 tive size.
3085
3086 The last argument, flags, holds the number of stack
3087 frames to skip (from 0 to 255), masked with
3088 BPF_F_SKIP_FIELD_MASK. The next bits can be used to set
3089 the following flags:
3090
3091 BPF_F_USER_STACK
3092 Collect a user space stack instead of a kernel
3093 stack.
3094
3095 BPF_F_USER_BUILD_ID
3096 Collect buildid+offset instead of ips for user
3097 stack, only valid if BPF_F_USER_STACK is also
3098 specified.
3099
3100 bpf_get_task_stack() can collect up to
3101 PERF_MAX_STACK_DEPTH both kernel and user frames, subject
3102 to sufficient large buffer size. Note that this limit can
3103 be controlled with the sysctl program, and that it should
3104 be manually increased in order to profile long user
3105 stacks (such as stacks for Java programs). To do so, use:
3106
3107 # sysctl kernel.perf_event_max_stack=<new value>
3108
3109 Return A non-negative value equal to or less than size on suc‐
3110 cess, or a negative error in case of failure.
3111
3113 Example usage for most of the eBPF helpers listed in this manual page
3114 are available within the Linux kernel sources, at the following loca‐
3115 tions:
3116
3117 • samples/bpf/
3118
3119 • tools/testing/selftests/bpf/
3120
3122 eBPF programs can have an associated license, passed along with the
3123 bytecode instructions to the kernel when the programs are loaded. The
3124 format for that string is identical to the one in use for kernel mod‐
3125 ules (Dual licenses, such as "Dual BSD/GPL", may be used). Some helper
3126 functions are only accessible to programs that are compatible with the
3127 GNU Privacy License (GPL).
3128
3129 In order to use such helpers, the eBPF program must be loaded with the
3130 correct license string passed (via attr) to the bpf() system call, and
3131 this generally translates into the C source code of the program con‐
3132 taining a line similar to the following:
3133
3134 char ____license[] __attribute__((section("license"), used)) = "GPL";
3135
3137 This manual page is an effort to document the existing eBPF helper
3138 functions. But as of this writing, the BPF sub-system is under heavy
3139 development. New eBPF program or map types are added, along with new
3140 helper functions. Some helpers are occasionally made available for ad‐
3141 ditional program types. So in spite of the efforts of the community,
3142 this page might not be up-to-date. If you want to check by yourself
3143 what helper functions exist in your kernel, or what types of programs
3144 they can support, here are some files among the kernel tree that you
3145 may be interested in:
3146
3147 • include/uapi/linux/bpf.h is the main BPF header. It contains the full
3148 list of all helper functions, as well as many other BPF definitions
3149 including most of the flags, structs or constants used by the
3150 helpers.
3151
3152 • net/core/filter.c contains the definition of most network-related
3153 helper functions, and the list of program types from which they can
3154 be used.
3155
3156 • kernel/trace/bpf_trace.c is the equivalent for most tracing pro‐
3157 gram-related helpers.
3158
3159 • kernel/bpf/verifier.c contains the functions used to check that valid
3160 types of eBPF maps are used with a given helper function.
3161
3162 • kernel/bpf/ directory contains other files in which additional
3163 helpers are defined (for cgroups, sockmaps, etc.).
3164
3165 • The bpftool utility can be used to probe the availability of helper
3166 functions on the system (as well as supported program and map types,
3167 and a number of other parameters). To do so, run bpftool feature
3168 probe (see bpftool-feature(8) for details). Add the unprivileged key‐
3169 word to list features available to unprivileged users.
3170
3171 Compatibility between helper functions and program types can generally
3172 be found in the files where helper functions are defined. Look for the
3173 struct bpf_func_proto objects and for functions returning them: these
3174 functions contain a list of helpers that a given program type can call.
3175 Note that the default: label of the switch ... case used to filter
3176 helpers can call other functions, themselves allowing access to addi‐
3177 tional helpers. The requirement for GPL license is also in those struct
3178 bpf_func_proto.
3179
3180 Compatibility between helper functions and map types can be found in
3181 the check_map_func_compatibility() function in file kernel/bpf/veri‐
3182 fier.c.
3183
3184 Helper functions that invalidate the checks on data and data_end point‐
3185 ers for network processing are listed in function
3186 bpf_helper_changes_pkt_data() in file net/core/filter.c.
3187
3189 bpf(2), bpftool(8), cgroups(7), ip(8), perf_event_open(2), sendmsg(2),
3190 socket(7), tc-bpf(8)
3191
3193 This page is part of release 5.10 of the Linux man-pages project. A
3194 description of the project, information about reporting bugs, and the
3195 latest version of this page, can be found at
3196 https://www.kernel.org/doc/man-pages/.
3197
3198
3199
3200 BPF-HELPERS(7)