1BPFTRACE(8) BPFTRACE(8)
2
3
4
6 bpftrace - a high-level tracing language
7
9 bpftrace [OPTIONS] FILENAME
10 bpftrace [OPTIONS] -e 'program code'
11
13 bpftrace is a high-level tracing language and runtime for Linux based
14 on BPF. It supports static and dynamic tracing for both the kernel and
15 user-space.
16
17 When FILENAME is "-", read from stdin.
18
20 List all probes with "sleep" in their name
21
22 # bpftrace -l '*sleep*'
23
24 Trace processes calling sleep
25
26 # bpftrace -e 'kprobe:do_nanosleep { printf("%d sleeping\n", pid); }'
27
28 Trace processes calling sleep while spawning sleep 5 as a child process
29
30 # bpftrace -e 'kprobe:do_nanosleep { printf("%d sleeping\n", pid); }' -c 'sleep 5'
31
33 x86_64, arm64 and s390x
34
36 Output format
37 -B MODE, Set the buffer mode for stdout. Valid values are
38 none No buffering. Each I/O is written as soon as possible
39 line Data is written on the first newline or when the buffer is
40 full. This is the default mode.
41 full Data is written once the buffer is full.
42
43 -f FORMAT, Set the output format. Valid values are
44 json
45 text
46
47 -o FILENAME
48 Write bpftrace tracing output to FILENAME instead of stdout. This
49 doesn’t include child process (-c option) output. Errors are still
50 written to stderr.
51
52 --no-warnings
53 Suppress all warning messages created by bpftrace.
54
55 Tracing
56 -e PROGRAM
57 Execute PROGRAM instead of reading the program from a file
58
59 -I DIR
60 Add the directory DIR to the search path for C headers. This option
61 can be used multiple times.
62
63 --include FILENAME
64 Add FILENAME as an include for the pre-processor. This is equal to
65 adding '#include FILENAME' to the start bpftrace program. This
66 option can be used multiple times.
67
68 -l [SEARCH]
69 List all probes that match the SEARCH pattern. If the pattern is
70 omitted all probes will be listed. This pattern supports wildcards
71 in the same way that probes do. E.g. '-l kprobe:*file*' to list all
72 'kprobes' with 'file' in the name. For more details see the LISTING
73 PROBES section.
74
75 --unsafe
76 Some calls, like 'system', are marked as unsafe as they can have
77 dangerous side effects ('system("rm -rf")') and are disabled by
78 default. This flag allows their use.
79
80 -k
81 Errors from bpf-helpers(7) are silently ignored by default which
82 can lead to strange results. This flag enables the detection of
83 errors (except for errors from 'probe_read_*'). When errors occurs
84 bpftrace will log an error containing the source location and the
85 error code:
86
87 stdin:48-57: WARNING: Failed to probe_read_user_str: Bad address (-14)
88 u:lib.so:"fn(char const*)" { printf("arg0:%s\n", str(arg0));}
89 ~~~~~~~~~
90
91 -kk
92 Same as '-k' but also includes the errors from 'probe_read_*'
93 helpers.
94
95 Process management
96 -p PID
97 Attach to the process with PID. If the process terminates, bpftrace
98 will also terminate. When using USDT probes they will be attached
99 to only this process.
100
101 -c COMMAND
102 Run COMMAND as a child process. When the child terminates bpftrace
103 stops as well, as if 'exit()' has been called. If bpftrace
104 terminates before the child process does the child process will be
105 terminated with a SIGTERM. If used, 'USDT' probes these will only
106 be attached to the child process. To avoid a race condition when
107 using 'USDTs' the child is stopped after 'execve' using 'ptrace(2)'
108 and continued when all 'USDT' probes are attached.
109 The child PID is available to programs as the 'cpid' builtin.
110 The child process runs with the same privileges as bpftrace itself
111 (usually root).
112
113 --usdt-file-activation
114 activate usdt semaphores based on file path
115
116 Miscellaneous
117 --info
118 Print detailed information about features supported by the kernel
119 and the bpftrace build.
120
121 -h, --help
122 Print the help summary
123
124 -V, --version
125 Print bpftrace version information
126
127 -v
128 verbose messages
129
130 -d
131 debug mode
132
133 -dd
134 verbose debug mode
135
137 Some behavior can only be controlled through environment variables.
138 This section lists all those variables.
139
140 BPFTRACE_STRLEN
141 Default: 64
142
143 Number of bytes allocated on the BPF stack for the string returned by
144 str().
145
146 Make this larger if you wish to read bigger strings with str().
147
148 Beware that the BPF stack is small (512 bytes).
149
150 Support for even larger strings is [being
151 discussed](https://github.com/iovisor/bpftrace/issues/305).
152
153 BPFTRACE_NO_CPP_DEMANGLE
154 Default: 0
155
156 C++ symbol demangling in user space stack traces is enabled by default.
157
158 This feature can be turned off by setting the value of this environment
159 variable to 1.
160
161 BPFTRACE_MAP_KEYS_MAX
162 Default: 4096
163
164 This is the maximum number of keys that can be stored in a map.
165 Increasing the value will consume more memory and increase startup
166 times. There are some cases where you will want to: for example,
167 sampling stack traces, recording timestamps for each page, etc.
168
169 BPFTRACE_MAX_PROBES
170 Default: 512
171
172 This is the maximum number of probes that bpftrace can attach to.
173 Increasing the value will consume more memory, increase startup times
174 and can incur high performance overhead or even freeze or crash the
175 system.
176
177 BPFTRACE_CACHE_USER_SYMBOLS
178 Default: PER_PROGRAM if ASLR disabled or -c option given, PER_PID
179 otherwise.
180
181 Caching strategy for user symbols. Valid values are:
182
183 • PER_PROGRAM - each program has its own cache. If there are more
184 processes with enabled ASLR for a single program, this might
185 produce incorrect results.
186
187 • PER_PID - each process has its own cache. This is accurate for
188 processes with ASLR enabled, and enables bpftrace to preload caches
189 for processes running at probe attachement time. If there are many
190 processes running, it will consume a lot of a memory.
191
192 • NONE - caching disabled. This saves the most memory, but at the
193 cost of speed.
194
195 BPFTRACE_VMLINUX
196 Default: None
197
198 This specifies the vmlinux path used for kernel symbol resolution when
199 attaching kprobe to offset. If this value is not given, bpftrace
200 searches vmlinux from pre defined locations. See
201 src/attached_probe.cpp:find_vmlinux() for details.
202
203 BPFTRACE_BTF
204 Default: None
205
206 The path to a BTF file. By default, bpftrace searches several locations
207 to find a BTF file. See src/btf.cpp for the details.
208
209 BPFTRACE_PERF_RB_PAGES
210 Default: 64
211
212 Number of pages to allocate per CPU for perf ring buffer. The value
213 must be a power of 2.
214
215 If you’re getting a lot of dropped events bpftrace may not be
216 processing events in the ring buffer fast enough. It may be useful to
217 bump the value higher so more events can be queued up. The tradeoff is
218 that bpftrace will use more memory.
219
220 BPFTRACE_MAX_BPF_PROGS
221 Default: 512
222
223 This is the maximum number of BPF programs (functions) that bpftrace
224 can generate. The main purpose of this limit is to prevent bpftrace
225 from hanging since generating a lot of probes takes a lot of resources
226 (and it should not happen often).
227
228 BPFTRACE_STR_TRUNC_TRAILER
229 Default: ..
230
231 Trailer to add to strings that were truncated. Set to empty string to
232 disable truncation trailers.
233
234 BPFTRACE_STACK_MODE
235 Default: bpftrace
236
237 Output format for ustack and kstack builtins. Available modes/formats:
238 bpftrace, perf, and raw. This can be overwritten at the call site.
239
241 Overview
242 The bpftrace (bt) language is inspired by the D language used by dtrace
243 and uses the same program structure. Each script consists of an
244 preamble and one or more action blocks.
245
246 preamble
247
248 actionblock1
249 actionblock2
250
251 Preprocessor and type definitions take place in the preamble:
252
253 #include <linux/socket.h>
254 #define RED "\033[31m"
255
256 struct S {
257 int x;
258 }
259
260 Each action block consists of three parts:
261
262 probe[,probe]
263 /predicate/ {
264 action
265 }
266
267 Probes
268 A probe specifies the event and event type to attach too.
269
270 Predicate
271 The predicate is optional condition that must be met for the action
272 to be executed.
273
274 Action
275 Actions are the programs that run when an event fires (and the
276 predicate is met). An action is a semicolon (;) separated list of
277 statements and always enclosed by brackets {}
278
279 A basic script that traces the open(2) and openat(2) system calls can
280 be written as follows:
281
282 BEGIN
283 {
284 printf("Tracing open syscalls... Hit Ctrl-C to end.\n");
285 }
286
287 tracepoint:syscalls:sys_enter_open,
288 tracepoint:syscalls:sys_enter_openat
289 {
290 printf("%-6d %-16s %s\n", pid, comm, str(args.filename));
291 }
292
293 This script has two action blocks and a total of 3 probes. The first
294 action block uses the special BEGIN probe, which fires once during
295 bpftrace startup. This probe is used to print a header, indicating that
296 the tracing has started.
297
298 The second action block uses two probes, one for open and one for
299 openat, and defines an action that prints the file being open ed as
300 well as the pid and comm of the process that execute the syscall. See
301 the PROBES section for details on the available probe types.
302
303 Identifiers
304 Identifiers must match the following regular expression:
305 [_a-zA-Z][_a-zA-Z0-9]*
306
307 Comments
308 Both single line and multi line comments are supported.
309
310 // A single line comment
311 i:s:1 { // can also be used to comment inline
312 /*
313 a multi line comment
314
315 */
316 print(/* inline comment block */ 1);
317 }
318
319 Data Types
320 The following fundamental integer types are provided by the language.
321
322 ┌───────┬─────────────────────────┐
323 │ │ │
324 │Type │ Description │
325 ├───────┼─────────────────────────┤
326 │ │ │
327 │uint8 │ Unsigned 8 bit integer │
328 ├───────┼─────────────────────────┤
329 │ │ │
330 │int8 │ Signed 8 bit integer │
331 ├───────┼─────────────────────────┤
332 │ │ │
333 │uint16 │ Unsigned 16 bit integer │
334 ├───────┼─────────────────────────┤
335 │ │ │
336 │int16 │ Signed 16 bit integer │
337 ├───────┼─────────────────────────┤
338 │ │ │
339 │uint32 │ Unsigned 32 bit integer │
340 ├───────┼─────────────────────────┤
341 │ │ │
342 │int32 │ Signed 32 bit integer │
343 ├───────┼─────────────────────────┤
344 │ │ │
345 │uint64 │ Unsigned 64 bit integer │
346 ├───────┼─────────────────────────┤
347 │ │ │
348 │int64 │ Signed 64 bit integer │
349 └───────┴─────────────────────────┘
350
351 Floating-point
352 Floating-point numbers are not supported by BPF and therefore not by
353 bpftrace.
354
355 Constants
356 Integers constants can be defined in the following formats:
357
358 • decimal (base 10)
359
360 • octal (base 8)
361
362 • hexadecimal (base 16)
363
364 • scientific (base 10)
365
366 Octal constants have to be prefixed with a 0, e.g. 0123. Hexadecimal
367 constants start with either 0x or 0X, e.g. 0x10. Scientific are written
368 in the <m>e<n> format which is a shorthand for m*10^n, e.g. $i = 2e3;.
369 Note that scientific literals are integer only due to the lack of
370 floating point support, 1e-3 is not valid.
371
372 To improve the readability of big literals a underscore _ can be used
373 as field separator, e.g. 1_000_123_000.
374
375 Integer suffixes as found in the C language are parsed by bpftrace to
376 ensure compatibility with C headers/definitions but they’re not used as
377 size specifiers. 123UL, 123U and 123LL all result in the same integer
378 type with a value of 123.
379
380 Character constants can be defined by enclosing the character in single
381 quotes, e.g. $c = 'c';.
382
383 String constants can be defined by enclosing the character string in
384 double quotes, e.g. $str = "Hello world";.
385
386 Characters and strings support the following escape sequences:
387
388 ┌─────┬──────────────────────┐
389 │ │ │
390 │\n │ Newline │
391 ├─────┼──────────────────────┤
392 │ │ │
393 │\t │ Tab │
394 ├─────┼──────────────────────┤
395 │ │ │
396 │\0nn │ Octal value nn │
397 ├─────┼──────────────────────┤
398 │ │ │
399 │\xnn │ Hexadecimal value nn │
400 └─────┴──────────────────────┘
401
402 Type conversion
403 Integer and pointer types can be converted using explicit type
404 conversion with an expression like:
405
406 $y = (uint32) $z;
407 $py = (int16 *) $pz;
408
409 Integer casts to a higher rank are sign extended. Conversion to a lower
410 rank is done by zeroing leading bits.
411
412 It is also possible to cast between integers and integer arrays using
413 the same syntax:
414
415 $a = (uint8[8]) 12345;
416 $x = (uint64) $a;
417
418 Both the cast and the destination type must have the same size. When
419 casting to an array, it is possible to omit the size which will be
420 determined automatically from the size of the cast value.
421
422 Operators and Expressions
423 Arithmetic Operators
424 The following operators are available for integer arithmetic:
425
426 ┌──┬────────────────────────┐
427 │ │ │
428 │+ │ integer addition │
429 ├──┼────────────────────────┤
430 │ │ │
431 │- │ integer subtraction │
432 ├──┼────────────────────────┤
433 │ │ │
434 │* │ integer multiplication │
435 ├──┼────────────────────────┤
436 │ │ │
437 │/ │ integer division │
438 ├──┼────────────────────────┤
439 │ │ │
440 │% │ integer modulo │
441 └──┴────────────────────────┘
442
443 Logical Operators
444 ┌───┬─────────────┐
445 │ │ │
446 │&& │ Logical AND │
447 ├───┼─────────────┤
448 │ │ │
449 │|| │ Logical OR │
450 ├───┼─────────────┤
451 │ │ │
452 │! │ Logical NOT │
453 └───┴─────────────┘
454
455 Bitwise Operators
456 ┌───┬───────────────────────────┐
457 │ │ │
458 │& │ AND │
459 ├───┼───────────────────────────┤
460 │ │ │
461 │| │ OR │
462 ├───┼───────────────────────────┤
463 │ │ │
464 │^ │ XOR │
465 ├───┼───────────────────────────┤
466 │ │ │
467 │<< │ Left shift the left-hand │
468 │ │ operand by the number of │
469 │ │ bits specified by the │
470 │ │ right-hand expression │
471 │ │ value │
472 ├───┼───────────────────────────┤
473 │ │ │
474 │>> │ Right shift the left-hand │
475 │ │ operand by the number of │
476 │ │ bits specified by the │
477 │ │ right-hand expression │
478 │ │ value │
479 └───┴───────────────────────────┘
480
481 Relational Operators
482 The following relational operators are defined for integers and
483 pointers.
484
485 ┌───┬────────────────────────────┐
486 │ │ │
487 │< │ left-hand expression is │
488 │ │ less than right-hand │
489 ├───┼────────────────────────────┤
490 │ │ │
491 │<= │ left-hand expression is │
492 │ │ less than or equal to │
493 │ │ right-hand │
494 ├───┼────────────────────────────┤
495 │ │ │
496 │> │ left-hand expression is │
497 │ │ bigger than right-hand │
498 ├───┼────────────────────────────┤
499 │ │ │
500 │>= │ left-hand expression is │
501 │ │ bigger or equal to than │
502 │ │ right-hand │
503 ├───┼────────────────────────────┤
504 │ │ │
505 │== │ left-hand expression equal │
506 │ │ to right-hand │
507 ├───┼────────────────────────────┤
508 │ │ │
509 │!= │ left-hand expression not │
510 │ │ equal to right-hand │
511 └───┴────────────────────────────┘
512
513 The following relation operators are available for comparing strings
514 and integer arrays.
515
516 ┌───┬────────────────────────────┐
517 │ │ │
518 │== │ left-hand string equal to │
519 │ │ right-hand │
520 ├───┼────────────────────────────┤
521 │ │ │
522 │!= │ left-hand string not equal │
523 │ │ to right-hand │
524 └───┴────────────────────────────┘
525
526 Assignment Operators
527 The following assignment operators can be used on both map and scratch
528 variables:
529
530 ┌────┬────────────────────────────┐
531 │ │ │
532 │= │ Assignment, assign the │
533 │ │ right-hand expression to │
534 │ │ the left-hand variable │
535 ├────┼────────────────────────────┤
536 │ │ │
537 │<<= │ Update the variable with │
538 │ │ its value left shifted by │
539 │ │ the number of bits │
540 │ │ specified by the │
541 │ │ right-hand expression │
542 │ │ value │
543 ├────┼────────────────────────────┤
544 │ │ │
545 │>>= │ Update the variable with │
546 │ │ its value right shifted by │
547 │ │ the number of bits │
548 │ │ specified by the │
549 │ │ right-hand expression │
550 │ │ value │
551 ├────┼────────────────────────────┤
552 │ │ │
553 │+= │ Increment the variable by │
554 │ │ the right-hand expression │
555 │ │ value │
556 ├────┼────────────────────────────┤
557 │ │ │
558 │-= │ Decrement the variable by │
559 │ │ the right-hand expression │
560 │ │ value │
561 ├────┼────────────────────────────┤
562 │ │ │
563 │*= │ Multiple the variable by │
564 │ │ the right-hand expression │
565 │ │ value │
566 ├────┼────────────────────────────┤
567 │ │ │
568 │/= │ Divide the variable by the │
569 │ │ right-hand expression │
570 │ │ value │
571 ├────┼────────────────────────────┤
572 │ │ │
573 │%= │ Modulo the variable by the │
574 │ │ right-hand expression │
575 │ │ value │
576 ├────┼────────────────────────────┤
577 │ │ │
578 │&= │ Bitwise AND the variable │
579 │ │ by the right-hand │
580 │ │ expression value │
581 ├────┼────────────────────────────┤
582 │ │ │
583 │|= │ Bitwise OR the variable by │
584 │ │ the right-hand expression │
585 │ │ value │
586 ├────┼────────────────────────────┤
587 │ │ │
588 │^= │ Bitwise XOR the variable │
589 │ │ by the right-hand │
590 │ │ expression value │
591 └────┴────────────────────────────┘
592
593 All these operators are syntactic sugar for combining assignment with
594 the specified operator. @ -= 5 is equal to @ = @ - 5.
595
596 Increment and Decrement Operators
597 The increment (+`) and decrement (`--`) operators can be used on
598 integer and pointer variables to increment their value by one. They can
599 only be used on variables and can either be applied as prefix or
600 suffix. The difference is that the expression `x+ returns the original
601 value of x, before it got incremented while ++x returns the value of x
602 post increment. E.g.
603
604 $x = 10;
605 $y = $x--; // y = 10; x = 9
606 $a = 10;
607 $b = --$a; // a = 9; b = 9
608
609 Note that maps will be implicitly declared and initialized to 0 if not
610 already declared or defined. Scratch variables must be initialized
611 before using these operators.
612
613 Variables and Maps
614 bpftrace knows two types of variables, scratch and map.
615
616 'scratch' variables are kept on the BPF stack and only exists during
617 the execution of the action block and cannot be accessed outside of the
618 program. Scratch variable names always start with a $, e.g. $myvar.
619
620 'map' variables use BPF 'maps'. These exist for the lifetime of
621 bpftrace itself and can be accessed from all action blocks and
622 user-space. Map names always start with a @, e.g. @mymap.
623
624 All valid identifiers can be used as name.
625
626 The data type of a variable is automatically determined during first
627 assignment and cannot be changed afterwards.
628
629 Associative Arrays
630 Associative arrays are a collection of elements indexed by a key,
631 similar to the hash tables found in languages like C++ (std::map) and
632 Python (dict). They’re a variant of 'map' variables.
633
634 @name[key] = expression
635 @name[key1,key2] = expression
636
637 Just like with any variable the type is determined on first use and
638 cannot be modified afterwards. This applies to both the key(s) and the
639 value type.
640
641 The following snippet creates a map with key signature [int64,
642 string[16]] and a value type of int64:
643
644 @[pid, comm]++
645
646 Variable scoping
647 Pointers
648 Pointers in bpftrace are similar to those found in C.
649
650 Tuples
651 bpftrace has support for immutable N-tuples (n > 1). A tuple is a
652 sequence type (like an array) where, unlike an array, every element can
653 have a different type.
654
655 Tuples are a comma separated list of expressions, enclosed in brackets,
656 (1,2) Individual fields can be accessed with the . operator. Tuples are
657 zero indexed like arrays are.
658
659 i:s:1 {
660 $a = (1,2);
661 $b = (3,4, $a);
662 print($a);
663 print($b);
664 print($b.0);
665 }
666
667 Prints:
668
669 (1, 2)
670 (3, 4, (1, 2))
671 3
672
673 Arrays
674 bpftrace supports accessing one-dimensional arrays like those found in
675 C.
676
677 Constructing arrays from scratch, like int a[] = {1,2,3} in C, is not
678 supported. They can only be read into a variable from a pointer.
679
680 The [] operator is used to access elements.
681
682 struct MyStruct {
683 int y[4];
684 }
685
686 kprobe:dummy {
687 $s = (struct MyStruct *) arg0;
688 print($s->y[0]);
689 }
690
691 Structs
692 C like structs are supported by bpftrace. Fields are accessed with the
693 . operator. Fields of a pointer to a struct can be accessed with the ->
694 operator.
695
696 Custom struct can be defined in the preamble
697
698 Constructing structs from scratch, like struct X var = {.f1 = 1} in C,
699 is not supported. They can only be read into a variable from a pointer.
700
701 struct MyStruct {
702 int a;
703 }
704
705 kprobe:dummy {
706 $ptr = (struct MyStruct *) arg0;
707 $st = *$ptr;
708 print($st.a);
709 print($ptr->a);
710 }
711
712 Conditionals
713 Conditional expressions are supported in the form of if/else statements
714 and the ternary operator.
715
716 The ternary operator consists of three operands: a condition followed
717 by a ?, the expression to execute when the condition is true followed
718 by a : and the expression to execute if the condition is false.
719
720 condition ? ifTrue : ifFalse
721
722 Both the ifTrue and ifFalse expressions must be of the same type,
723 mixing types is not allowed.
724
725 The ternary operator can be used as part of an assignment.
726
727 $a == 1 ? print("true") : print("false");
728 $b = $a > 0 ? $a : -1;
729
730 If/else statements, like the one in C, are supported.
731
732 if (condition) {
733 ifblock
734 } else if (condition) {
735 if2block
736 } else {
737 elseblock
738 }
739
740 Loops
741 Since kernel 5.3 BPF supports loops as long as the verifier can prove
742 they’re bounded and fit within the instruction limit.
743
744 In bpftrace loops are available through the while statement.
745
746 while (condition) {
747 block;
748 }
749
750 Within a while-loop the following control flow statements can be used:
751
752 ┌─────────┬────────────────────────────┐
753 │ │ │
754 │continue │ skip processing of the │
755 │ │ rest of the block and jump │
756 │ │ back to the evaluation of │
757 │ │ the conditional │
758 ├─────────┼────────────────────────────┤
759 │ │ │
760 │break │ Terminate the loop │
761 └─────────┴────────────────────────────┘
762
763 i:s:1 {
764 $i = 0;
765 while ($i <= 100) {
766 printf("%d ", $i);
767 if ($i > 5) {
768 break;
769 }
770 $i++
771 }
772 printf("\n");
773 }
774
775 Loop unrolling is also supported with the unroll statement.
776
777 unroll(n) {
778 block;
779 }
780
781 The compiler will evaluate the block n times and generate the BPF code
782 for the block n times. As this happens at compile time n must be a
783 constant greater than 0 (n > 0).
784
785 The following two probes compile into the same code:
786
787 i:s:1 {
788 unroll(3) {
789 print("Unrolled")
790 }
791 }
792
793 i:s:1 {
794 print("Unrolled")
795 print("Unrolled")
796 print("Unrolled")
797 }
798
800 There are three invocation modes for bpftrace built-in functions.
801
802 ┌─────────────┬─────────────────────┬────────────────────┐
803 │ │ │ │
804 │Mode │ Description │ Example functions │
805 ├─────────────┼─────────────────────┼────────────────────┤
806 │ │ │ │
807 │Synchronous │ The value/effect of │ reg(), str(), │
808 │ │ the built-in │ ntop() │
809 │ │ function is │ │
810 │ │ determined/handled │ │
811 │ │ right away by the │ │
812 │ │ bpf program in the │ │
813 │ │ kernel space. │ │
814 ├─────────────┼─────────────────────┼────────────────────┤
815 │ │ │ │
816 │Asynchronous │ The value/effect of │ printf(), clear(), │
817 │ │ the built-in │ exit() │
818 │ │ function is │ │
819 │ │ determined/handled │ │
820 │ │ later by the │ │
821 │ │ bpftrace process in │ │
822 │ │ the user space. │ │
823 ├─────────────┼─────────────────────┼────────────────────┤
824 │ │ │ │
825 │Compile-time │ The value of the │ kaddr(), │
826 │ │ built-in function │ cgroupid(), │
827 │ │ is determined │ offsetof() │
828 │ │ before bpf programs │ │
829 │ │ are running. │ │
830 └─────────────┴─────────────────────┴────────────────────┘
831
832 While BPF in the kernel can do a lot there are still things that can
833 only be done from user space, like the outputting (printing) of data.
834 The way bpftrace handles this is by sending events from the BPF program
835 which user-space will pick up some time in the future (usually in
836 milliseconds). Operations that happen in the kernel are 'synchronous'
837 ('sync') and those that are handled in user space are 'asynchronous'
838 ('async')
839
840 The asynchronous behaviour can lead to some unexpected behavior as
841 updates can happen before user space had time to process the event. The
842 following situations may occur:
843
844 • event loss: when using printf(), the amount of data printed may be
845 less than the actual number of events generated by the kernel
846 during BPF program’s execution.
847
848 • delayed exit: when using the exit() to terminate the program,
849 bpftrace needs to handle the exit signal asynchronously casuing the
850 BPF program may continue to run for some additional time.
851
852 One example is updating a map value in a tight loop:
853
854 BEGIN {
855 @=0;
856 unroll(10) {
857 print(@);
858 @++;
859 }
860 exit()
861 }
862
863 Maps are printed by reference not by value and as the value gets
864 updated right after the print user-space will likely only see the final
865 value once it processes the event:
866
867 @: 10
868 @: 10
869 @: 10
870 @: 10
871 @: 10
872 @: 10
873 @: 10
874 @: 10
875 @: 10
876 @: 10
877
878 Therefore, when you need precise event statistics, it is recommended to
879 use synchronous functions (e.g. count() and hist()) to ensure more
880 reliable and accurate results.
881
883 Kernel and user pointers live in different address spaces which,
884 depending on the CPU architecture, might overlap. Trying to read a
885 pointer that is in the wrong address space results in a runtime error.
886 This error is hidden by default but can be enabled with the -kk flag:
887
888 stdin:1:9-12: WARNING: Failed to probe_read_user: Bad address (-14)
889 BEGIN { @=*uptr(kaddr("do_poweroff")) }
890 ~~~
891
892 bpftrace tries to automatically set the correct address space for a
893 pointer based on the probe type, but might fail in cases where it is
894 unclear. The address space can be changed with the kptr() and uptr()
895 functions.
896
898 Builtins are special variables built into the language. Unlike the
899 scratch and map variable they don’t need a $ or @ as prefix (except for
900 the positional parameters).
901
902 ┌──────────────┬─────────────┬────────────┬───────────────────────┬───────────────────┐
903 │ │ │ │ │ │
904 │Variable │ Type │ Kernel │ BPF Helper │ Description │
905 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
906 │ │ │ │ │ │
907 │$1, $2, ...$n │ int64 │ n/a │ n/a │ The nth │
908 │ │ │ │ │ positional │
909 │ │ │ │ │ parameter │
910 │ │ │ │ │ passed to the │
911 │ │ │ │ │ bpftrace │
912 │ │ │ │ │ program. If │
913 │ │ │ │ │ less than n │
914 │ │ │ │ │ parameters │
915 │ │ │ │ │ are passed │
916 │ │ │ │ │ this │
917 │ │ │ │ │ evaluates to │
918 │ │ │ │ │ 0. For string │
919 │ │ │ │ │ arguments use │
920 │ │ │ │ │ the str() │
921 │ │ │ │ │ call to │
922 │ │ │ │ │ retrieve the │
923 │ │ │ │ │ value. │
924 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
925 │ │ │ │ │ │
926 │$# │ int64 │ n/a │ n/a │ Total amount │
927 │ │ │ │ │ of positional │
928 │ │ │ │ │ parameters │
929 │ │ │ │ │ passed. │
930 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
931 │ │ │ │ │ │
932 │arg0, arg1, │ int64 │ n/a │ n/a │ nth argument │
933 │...argn │ │ │ │ passed to the │
934 │ │ │ │ │ function │
935 │ │ │ │ │ being traced. │
936 │ │ │ │ │ These are │
937 │ │ │ │ │ extracted │
938 │ │ │ │ │ from the CPU │
939 │ │ │ │ │ registers. │
940 │ │ │ │ │ The amount of │
941 │ │ │ │ │ args passed │
942 │ │ │ │ │ in registers │
943 │ │ │ │ │ depends on │
944 │ │ │ │ │ the CPU │
945 │ │ │ │ │ architecture. │
946 │ │ │ │ │ (kprobes, │
947 │ │ │ │ │ uprobes, │
948 │ │ │ │ │ usdt). │
949 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
950 │ │ │ │ │ │
951 │args │ struct args │ n/a │ n/a │ The struct of │
952 │ │ │ │ │ all arguments │
953 │ │ │ │ │ of the traced │
954 │ │ │ │ │ function. │
955 │ │ │ │ │ Available in │
956 │ │ │ │ │ tracepoint, │
957 │ │ │ │ │ kfunc, and │
958 │ │ │ │ │ uprobe (with │
959 │ │ │ │ │ DWARF) │
960 │ │ │ │ │ probes. Use │
961 │ │ │ │ │ args.x to │
962 │ │ │ │ │ access │
963 │ │ │ │ │ argument x or │
964 │ │ │ │ │ args to get a │
965 │ │ │ │ │ record with │
966 │ │ │ │ │ all │
967 │ │ │ │ │ arguments. │
968 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
969 │ │ │ │ │ │
970 │cgroup │ uint64 │ 4.18 │ get_current_cgroup_id │ ID of the │
971 │ │ │ │ │ cgroup the │
972 │ │ │ │ │ current task │
973 │ │ │ │ │ is in. Only │
974 │ │ │ │ │ works with │
975 │ │ │ │ │ cgroupv2. │
976 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
977 │ │ │ │ │ │
978 │comm │ string[16] │ 4.2 │ get_current_com │ comm of the │
979 │ │ │ │ │ current task. │
980 │ │ │ │ │ Equal to the │
981 │ │ │ │ │ value in │
982 │ │ │ │ │ /proc/<pid>/comm │
983 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
984 │ │ │ │ │ │
985 │cpid │ uint32 │ n/a │ n/a │ PID of the child │
986 │ │ │ │ │ process │
987 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
988 │ │ │ │ │ │
989 │numaid │ uint32 │ 5.8 │ numa_node_id │ ID of the NUMA │
990 │ │ │ │ │ node executing │
991 │ │ │ │ │ the BPF program │
992 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
993 │ │ │ │ │ │
994 │cpu │ uint32 │ 4.1 │ raw_smp_processor_id │ ID of the │
995 │ │ │ │ │ processor │
996 │ │ │ │ │ executing the │
997 │ │ │ │ │ BPF program │
998 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
999 │ │ │ │ │ │
1000 │curtask │ uint64 │ 4.8 │ get_current_task │ Pointer to │
1001 │ │ │ │ │ struct │
1002 │ │ │ │ │ task_struct of │
1003 │ │ │ │ │ the current task │
1004 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1005 │ │ │ │ │ │
1006 │elapsed │ uint64 │ (see nsec) │ ktime_get_ns / │ Nanoseconds │
1007 │ │ │ │ ktime_get_boot_ns │ elapsed since │
1008 │ │ │ │ │ bpftrace │
1009 │ │ │ │ │ initialization, │
1010 │ │ │ │ │ based on nsecs │
1011 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1012 │ │ │ │ │ │
1013 │func │ string │ n/a │ n/a │ Name of the │
1014 │ │ │ │ │ current function │
1015 │ │ │ │ │ being traced │
1016 │ │ │ │ │ (kprobes,uprobes) │
1017 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1018 │ │ │ │ │ │
1019 │gid │ uint64 │ 4.2 │ get_current_uid_gid │ GID of current │
1020 │ │ │ │ │ task │
1021 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1022 │ │ │ │ │ │
1023 │kstack │ kstack │ │ get_stackid │ Kernel stack │
1024 │ │ │ │ │ trace │
1025 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1026 │ │ │ │ │ │
1027 │nsecs │ uint64 │ 4.1 / 5.7 │ ktime_get_ns / │ nanoseconds since │
1028 │ │ │ │ ktime_get_boot_ns │ kernel boot. On │
1029 │ │ │ │ │ kernels that │
1030 │ │ │ │ │ support │
1031 │ │ │ │ │ ktime_get_boot_ns │
1032 │ │ │ │ │ this includes the │
1033 │ │ │ │ │ time spent │
1034 │ │ │ │ │ suspended, on │
1035 │ │ │ │ │ older kernels it │
1036 │ │ │ │ │ does not. │
1037 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1038 │ │ │ │ │ │
1039 │pid │ uint64 │ 4.2 │ get_current_pid_tgid │ Process ID (or │
1040 │ │ │ │ │ thread group ID) │
1041 │ │ │ │ │ of the current │
1042 │ │ │ │ │ task. │
1043 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1044 │ │ │ │ │ │
1045 │probe │ string │ n/na │ n/a │ Name of the │
1046 │ │ │ │ │ current probe │
1047 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1048 │ │ │ │ │ │
1049 │rand │ uint32 │ 4.1 │ get_prandom_u32 │ Random number │
1050 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1051 │ │ │ │ │ │
1052 │retval │ int64 │ n/a │ n/a │ Value returned by │
1053 │ │ │ │ │ the function │
1054 │ │ │ │ │ being traced │
1055 │ │ │ │ │ (kretprobe, │
1056 │ │ │ │ │ uretprobe, │
1057 │ │ │ │ │ kretfunc) │
1058 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1059 │ │ │ │ │ │
1060 │sarg0, sarg1, │ int64 │ n/a │ n/a │ nth stack value │
1061 │...sargn │ │ │ │ of the function │
1062 │ │ │ │ │ being traced. │
1063 │ │ │ │ │ (kprobes, │
1064 │ │ │ │ │ uprobes). │
1065 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1066 │ │ │ │ │ │
1067 │tid │ uint64 │ 4.2 │ get_current_pid_tgid │ Thread ID of the │
1068 │ │ │ │ │ current task. │
1069 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1070 │ │ │ │ │ │
1071 │uid │ uint64 │ 4.2 │ get_current_uid_gid │ UID of current │
1072 │ │ │ │ │ task │
1073 ├──────────────┼─────────────┼────────────┼───────────────────────┼───────────────────┤
1074 │ │ │ │ │ │
1075 │ustack │ ustack │ 4.6 │ get_stackid │ Userspace stack │
1076 │ │ │ │ │ trace │
1077 └──────────────┴─────────────┴────────────┴───────────────────────┴───────────────────┘
1078
1080 Map functions are built-in functions who’s return value can only be
1081 assigned to maps. The data type associated with these functions are
1082 only for internal use and are not compatible with the (integer)
1083 operators.
1084
1085 Functions that are marked async are asynchronous which can lead to
1086 unexpected behavior, see the [Sync and Async] section for more
1087 information.
1088
1089 avg
1090 variants
1091
1092 • avg(int64 n)
1093
1094 Calculate the running average of n between consecutive calls.
1095
1096 i:s:1 {
1097 @x++;
1098 @y = avg(@x);
1099 print(@x);
1100 print(@y);
1101 }
1102
1103 Internally this keeps two values in the map: value count and running
1104 total. The average is computed in user-space when printing by dividing
1105 the total by the count.
1106
1107 clear
1108 variants
1109
1110 • clear(map m)
1111
1112 async
1113
1114 Clear all keys/values from map m.
1115
1116 i:ms:100 {
1117 @[rand % 10] = count();
1118 }
1119
1120 i:s:10 {
1121 print(@);
1122 clear(@);
1123 }
1124
1125 count
1126 variants
1127
1128 • count()
1129
1130 Count how often this function is called.
1131
1132 Using @=count() is conceptually similar to @++. The difference is that
1133 the count() function uses a map type optimized for this (PER_CPU),
1134 increasing performance. Due to this the map cannot be accessed as a
1135 regular integer.
1136
1137 i:ms:100 {
1138 @ = count();
1139 }
1140
1141 i:s:10 {
1142 print(@);
1143 clear(@);
1144 }
1145
1146 delete
1147 variants
1148
1149 • delete(mapkey k)
1150
1151 Delete a single key from a map. For a single value map this deletes the
1152 only element. For an associative-array the key to delete has to be
1153 specified.
1154
1155 k:dummy {
1156 @scalar = 1;
1157 @associative[1,2] = 1;
1158 delete(@scalar);
1159 delete(@associative[1,2]);
1160
1161 delete(@associative); // error
1162 }
1163
1164 hist
1165 variants
1166
1167 • hist(int64 n)
1168
1169 Create a log2 histogram of n.
1170
1171 kretprobe:vfs_read {
1172 @bytes = hist(retval);
1173 }
1174
1175 Results in:
1176
1177 @:
1178 [1M, 2M) 3 | |
1179 [2M, 4M) 2 | |
1180 [4M, 8M) 2 | |
1181 [8M, 16M) 6 | |
1182 [16M, 32M) 16 | |
1183 [32M, 64M) 27 | |
1184 [64M, 128M) 48 |@ |
1185 [128M, 256M) 98 |@@@ |
1186 [256M, 512M) 191 |@@@@@@ |
1187 [512M, 1G) 394 |@@@@@@@@@@@@@ |
1188 [1G, 2G) 820 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1189
1190 lhist
1191 variants
1192
1193 • lhist(int64 n, int64 min, int64 max, int64 step)
1194
1195 Create a linear histogram of n. lhist creates M ((max - min) / step)
1196 buckets in the range [min,max) where each bucket is step in size.
1197 Values in the range (-inf, min) and (max, inf) get their get their own
1198 bucket too, bringing the total amount of buckets created to M+2.
1199
1200 i:ms:1 {
1201 @ = lhist(rand %10, 0, 10, 1);
1202 }
1203
1204 i:s:5 {
1205 exit();
1206 }
1207
1208 Prints:
1209
1210 @:
1211 [0, 1) 306 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1212 [1, 2) 284 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1213 [2, 3) 294 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1214 [3, 4) 318 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1215 [4, 5) 311 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1216 [5, 6) 362 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
1217 [6, 7) 336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1218 [7, 8) 326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1219 [8, 9) 328 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1220 [9, 10) 318 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
1221
1222 max
1223 variants
1224
1225 • max(int64 n)
1226
1227 Update the map with n if n is bigger than the current value held.
1228
1229 min
1230 variants
1231
1232 • min(int64 n)
1233
1234 Update the map with n if n is smaller than the current value held.
1235
1236 stats
1237 variants
1238
1239 • stats(int64 n)
1240
1241 stats combines the count, avg and sum calls into one.
1242
1243 kprobe:vfs_read {
1244 @bytes[comm] = stats(arg2);
1245 }
1246
1247 @bytes[bash]: count 7, average 1, total 7
1248 @bytes[sleep]: count 5, average 832, total 4160
1249 @bytes[ls]: count 7, average 886, total 6208
1250 @
1251
1252 sum
1253 variants
1254
1255 • sum(int64 n)
1256
1257 Calculate the sum of all n passed.
1258
1259 zero
1260 variants
1261
1262 • zero(map m)
1263
1264 async
1265
1266 Set all values for all keys to zero.
1267
1269 Functions that are marked async are asynchronous which can lead to
1270 unexpected behaviour, see the [sync and async] section for more
1271 information.
1272
1273 compile time functions are evaluated at compile time, a static value
1274 will be compiled into the program.
1275
1276 unsafe functions can have dangerous side effects and should be used
1277 with care, the --unsafe flag is required for use.
1278
1279 bswap
1280 variants
1281
1282 • uint8 bswap(uint8 n)
1283
1284 • uint16 bswap(uint16 n)
1285
1286 • uint32 bswap(uint32 n)
1287
1288 • uint64 bswap(uint64 n)
1289
1290 bswap reverses the order of the bytes in integer n. In case of 8 bit
1291 integers, n is returned without being modified. The return type is an
1292 unsigned integer of the same width as n.
1293
1294 buf
1295 variants
1296
1297 • buf_t buf(void * data, [int64 length])
1298
1299 buf reads length amount of bytes from address data. The maximum value
1300 of length is limited to the BPFTRACE_STRLEN variable. For arrays the
1301 length is optional, it is automatically inferred from the signature.
1302
1303 buf is address space aware and will call the correct helper based on
1304 the address space associated with data.
1305
1306 The buf_t object returned by buf can safely be printed as a hex encoded
1307 string with the %r format specifier.
1308
1309 Bytes with values >=32 and <=126 are printed using their ASCII
1310 character, other bytes are printed in hex form (e.g. \x00). The %rx
1311 format specifier can be used to print everything in hex form, including
1312 ASCII characters. The similar %rh format specifier prints everything in
1313 hex form without \x and with spaces between bytes (e.g. 0a fe).
1314
1315 i:s:1 {
1316 printf("%r\n", buf(kaddr("avenrun"), 8));
1317 }
1318
1319 \x00\x03\x00\x00\x00\x00\x00\x00
1320 \xc2\x02\x00\x00\x00\x00\x00\x00
1321
1322 cat
1323 variants
1324
1325 • void cat(string namefmt, [...args])
1326
1327 async
1328
1329 Dump the contents of the named file to stdout. cat supports the same
1330 format string and arguments that printf does. If the file cannot be
1331 opened or read an error is printed to stderr.
1332
1333 t:syscalls:sys_enter_execve {
1334 cat("/proc/%d/maps", pid);
1335 }
1336
1337 55f683ebd000-55f683ec1000 r--p 00000000 08:01 1843399 /usr/bin/ls
1338 55f683ec1000-55f683ed6000 r-xp 00004000 08:01 1843399 /usr/bin/ls
1339 55f683ed6000-55f683edf000 r--p 00019000 08:01 1843399 /usr/bin/ls
1340 55f683edf000-55f683ee2000 rw-p 00021000 08:01 1843399 /usr/bin/ls
1341 55f683ee2000-55f683ee3000 rw-p 00000000 00:00 0
1342
1343 cgroup_path
1344 variants
1345
1346 • cgroup_path cgroup_path(int cgroupid, string filter)
1347
1348 Convert cgroup id to cgroup path. This is done asynchronously in
1349 userspace when the cgroup_path value is printed, therefore it can
1350 resolve to a different value if the cgroup id gets reassigned. This
1351 also means that the returned value can only be used for printing.
1352
1353 A string literal may be passed as an optional second argument to filter
1354 cgroup hierarchies in which the cgroup id is looked up by a wildcard
1355 expression (cgroup2 is always represented by "unified", regardless of
1356 where it is mounted).
1357
1358 The currently mounted hierarchy at /sys/fs/cgroup is used to do the
1359 lookup. If the cgroup with the given id isn’t present here (e.g. when
1360 running in a Docker container), the cgroup path won’t be found (unlike
1361 when looking up the cgroup path of a process via /proc/.../cgroup).
1362
1363 BEGIN {
1364 $cgroup_path = cgroup_path(3436);
1365 print($cgroup_path);
1366 print($cgroup_path); /* This may print a different path */
1367 printf("%s %s", $cgroup_path, $cgroup_path); /* This may print two different paths */
1368 }
1369
1370 cgroupid
1371 variants
1372
1373 • uint64 cgroupid(const string path)
1374
1375 compile time
1376
1377 cgroupid retrieves the cgroupv2 ID of the cgroup available at path.
1378
1379 BEGIN {
1380 print(cgroupid("/sys/fs/cgroup/system.slice"));
1381 }
1382
1383 exit
1384 variants
1385
1386 • void exit()
1387
1388 async
1389
1390 Terminate bpftrace, as if a SIGTERM was received. The END probe will
1391 still trigger (if specified) and maps will be printed.
1392
1393 join
1394 variants
1395
1396 • void join(char *arr[], [char * sep = ' '])
1397
1398 async
1399
1400 join joins all the string array arr with sep as separator into one
1401 string. This string will be printed to stdout directly, it cannot be
1402 used as string value.
1403
1404 The concatenation of the array members is done in BPF and the printing
1405 happens in userspace.
1406
1407 tracepoint:syscalls:sys_enter_execve {
1408 join(args.argv);
1409 }
1410
1411 kaddr
1412 variants
1413
1414 • uint64 kaddr(const string name)
1415
1416 compile time
1417
1418 Get the address of the kernel symbol name.
1419
1420 The following script:
1421
1422 kptr
1423 variants
1424
1425 • T * kptr(T * ptr)
1426
1427 Marks ptr as a kernel address space pointer. See the address-spaces
1428 section for more information on address-spaces. The pointer type is
1429 left unchanged.
1430
1431 ksym
1432 variants
1433
1434 • ksym_t ksym(uint64 addr)
1435
1436 async
1437
1438 Retrieve the name of the function that contains address addr. The
1439 address to name mapping happens in user-space.
1440
1441 The ksym_t type can be printed with the %s format specifier.
1442
1443 kprobe:do_nanosleep
1444 {
1445 printf("%s\n", ksym(reg("ip")));
1446 }
1447
1448 Prints:
1449
1450 do_nanosleep
1451
1452 macaddr
1453 variants
1454
1455 • macaddr_t macaddr(char [6] mac)
1456
1457 Create a buffer that holds a macaddress as read from mac This buffer
1458 can be printed in the canonical string format using the %s format
1459 specifier.
1460
1461 kprobe:arp_create {
1462 printf("SRC %s, DST %s\n", macaddr(sarg0), macaddr(sarg1));
1463 }
1464
1465 Prints:
1466
1467 SRC 18:C0:4D:08:2E:BB, DST 74:83:C2:7F:8C:FF
1468
1469 ntop
1470 variants
1471
1472 • inet_t ntop([int64 af, ] int addr)
1473
1474 • inet_t ntop([int64 af, ] char addr[4])
1475
1476 • inet_t ntop([int64 af, ] char addr[16])
1477
1478 ntop returns the string representation of an IPv4 or IPv6 address. ntop
1479 will infer the address type (IPv4 or IPv6) based on the addr type and
1480 size. If an integer or char[4] is given, ntop assumes IPv4, if a
1481 char[16] is given, ntop assumes IPv6. You can also pass the address
1482 type (e.g. AF_INET) explicitly as the first parameter.
1483
1484 pton
1485 variants
1486
1487 • char addr[4] pton(const string *addr_v4)
1488
1489 • char addr[16] pton(const string *addr_v6)
1490
1491 compile time
1492
1493 pton converts a text representation of an IPv4 or IPv6 address to byte
1494 array. pton infers the address family based on . or : in the given
1495 argument. pton comes in handy when we need to select packets with
1496 certain IP addresses.
1497
1498 override
1499 variants
1500
1501 • override(uint64 rc)
1502
1503 unsafe
1504
1505 Kernel 4.16
1506
1507 Helper bpf_override
1508
1509 Supported probes
1510
1511 • kprobe
1512
1513 When using override the probed function will not be executed and
1514 instead rc will be returned.
1515
1516 k:__x64_sys_getuid
1517 /comm == "id"/ {
1518 override(2<<21);
1519 }
1520
1521 uid=4194304 gid=0(root) euid=0(root) groups=0(root)
1522
1523 This feature only works on kernels compiled with
1524 CONFIG_BPF_KPROBE_OVERRIDE and only works on functions tagged
1525 ALLOW_ERROR_INJECTION.
1526
1527 bpftrace does not test whether error injection is allowed for the
1528 probed function, instead if will fail to load the program into the
1529 kernel:
1530
1531 ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
1532 Error attaching probe: 'kprobe:vfs_read'
1533
1534 reg
1535 variants
1536
1537 • reg(const string name)
1538
1539 Supported probes
1540
1541 • kprobe
1542
1543 • uprobe
1544
1545 Get the contents of the register identified by name. Valid names depend
1546 on the CPU architecture.
1547
1548 signal
1549 variants
1550
1551 • signal(const string sig)
1552
1553 • signal(uint32 signum)
1554
1555 unsafe
1556
1557 Kernel 5.3
1558
1559 Helper bpf_send_signal
1560
1561 Probe types: k(ret)probe, u(ret)probe, USDT, profile
1562
1563 Send a signal to the process being traced. The signal can either be
1564 identified by name, e.g. SIGSTOP or by ID, e.g. 19 as found in kill -l.
1565
1566 kprobe:__x64_sys_execve
1567 /comm == "bash"/ {
1568 signal(5);
1569 }
1570
1571 $ ls
1572 Trace/breakpoint trap (core dumped)
1573
1574 sizeof
1575 variants
1576
1577 • sizeof(TYPE)
1578
1579 • sizeof(EXPRESSION)
1580
1581 compile time
1582
1583 Returns size of the argument in bytes. Similar to C/C++ sizeof
1584 operator. Note that the expression does not get evaluated.
1585
1586 offsetof
1587 variants
1588
1589 • offsetof(STRUCT, FIELD)
1590
1591 • offsetof(EXPRESSION, FIELD)
1592
1593 compile time
1594
1595 Returns offset of the field offset bytes in struct. Similar to kernel
1596 offsetof operator. Note that subfields are not yet supported.
1597
1598 str
1599 variants
1600
1601 • str(char * data [, uint32 length)
1602
1603 Helper probe_read_str, probe_read_{kernel,user}_str
1604
1605 str reads a NULL terminated (\0) string from data. The maximum string
1606 length is limited by the BPFTRACE_STR_LEN env variable, unless length
1607 is specified and shorter than the maximum. In case the string is longer
1608 than the specified length only length - 1 bytes are copied and a NULL
1609 byte is appended at the end.
1610
1611 When available (starting from kernel 5.5, see the --info flag) bpftrace
1612 will automatically use the kernel or user variant of
1613 probe_read_{kernel,user}_str based on the address space of data, see
1614 ADDRESS-SPACES for more information.
1615
1616 strerror
1617 variants
1618
1619 • strerror strerror(int error)
1620
1621 Convert errno code to string. This is done asynchronously in userspace
1622 when the strerror value is printed, hence the returned value can only
1623 be used for printing.
1624
1625 #include <errno.h>
1626 BEGIN {
1627 print(strerror(EPERM));
1628 }
1629
1630 strftime
1631 variants
1632
1633 • strtime_t strftime(const string fmt, int64 timestamp_ns)
1634
1635 async
1636
1637 Format the nanoseconds since boot timestamp timestamp_ns according to
1638 the format specified by fmt. The time conversion and formatting happens
1639 in user space, therefore the timestr_t value returned can only be used
1640 for printing using the %s format specifier.
1641
1642 bpftrace uses the strftime(3) function for formatting time and supports
1643 the same format specifiers.
1644
1645 i:s:1 {
1646 printf("%s\n", strftime("%H:%M:%S", nsecs));
1647 }
1648
1649 bpftrace also supports the following format string extensions:
1650
1651 ┌──────────┬────────────────────────────┐
1652 │ │ │
1653 │Specifier │ Description │
1654 ├──────────┼────────────────────────────┤
1655 │ │ │
1656 │%f │ Microsecond as a decimal │
1657 │ │ number, zero-padded on the │
1658 │ │ left │
1659 └──────────┴────────────────────────────┘
1660
1661 strncmp
1662 variants
1663
1664 • int64 strncmp(char * s1, char * s2, int64 n)
1665
1666 strncmp compares up to n characters string s1 and string s2. If they’re
1667 equal 0 is returned, else a non-zero value is returned.
1668
1669 bpftrace doesn’t read past the length of the shortest string.
1670
1671 The use of the == and != operators is recommended over calling strncmp
1672 directly.
1673
1674 strcontains
1675 variants
1676
1677 • int64 strcontains(const char *haystack, const char *needle)
1678
1679 strcontains compares whether the string haystack contains the string
1680 needle. If needle is contained 1 is returned, else zero is returned.
1681
1682 bpftrace doesn’t read past the length of the shortest string.
1683
1684 system
1685 variants
1686
1687 • void system(string namefmt [, ...args])
1688
1689 unsafe async
1690
1691 system lets bpftrace run the specified command (fork and exec) until it
1692 completes and print its stdout. The command is run with the same
1693 privileges as bpftrace and it blocks execution of the processing
1694 threads which can lead to missed events and delays processing of async
1695 events.
1696
1697 i:s:1 {
1698 time("%H:%M:%S: ");
1699 printf("%d\n", @++);
1700 }
1701 i:s:10 {
1702 system("/bin/sleep 10");
1703 }
1704 i:s:30 {
1705 exit();
1706 }
1707
1708 Note how the async time and printf first print every second until the
1709 i:s:10 probe hits, then they print every 10 seconds due to bpftrace
1710 blocking on sleep.
1711
1712 Attaching 3 probes...
1713 08:50:37: 0
1714 08:50:38: 1
1715 08:50:39: 2
1716 08:50:40: 3
1717 08:50:41: 4
1718 08:50:42: 5
1719 08:50:43: 6
1720 08:50:44: 7
1721 08:50:45: 8
1722 08:50:46: 9
1723 08:50:56: 10
1724 08:50:56: 11
1725 08:50:56: 12
1726 08:50:56: 13
1727 08:50:56: 14
1728 08:50:56: 15
1729 08:50:56: 16
1730 08:50:56: 17
1731 08:50:56: 18
1732 08:50:56: 19
1733
1734 system supports the same format string and arguments that printf does.
1735
1736 t:syscalls:sys_enter_execve {
1737 system("/bin/grep %s /proc/%d/status", "vmswap", pid);
1738 }
1739
1740 time
1741 variants
1742
1743 • void time(const string fmt)
1744
1745 async
1746
1747 Format the current wall time according to the format specifier fmt and
1748 print it to stdout. Unlike strftime() time() doesn’t send a timestamp
1749 from the probe, instead it is the time at which user-space processes
1750 the event.
1751
1752 bpftrace uses the strftime(3) function for formatting time and supports
1753 the same format specifiers.
1754
1755 uaddr
1756 variants
1757
1758 • T * uaddr(const string sym)
1759
1760 Supported probes
1761
1762 • uprobes
1763
1764 • uretprobes
1765
1766 • USDT
1767
1768 Does not work with ASLR, see issue #75
1769 <https://github.com/iovisor/bpftrace/issues/75>
1770
1771 The uaddr function returns the address of the specified symbol. This
1772 lookup happens during program compilation and cannot be used
1773 dynamically.
1774
1775 The default return type is uint64*. If the ELF object size matches a
1776 known integer size (1, 2, 4 or 8 bytes) the return type is modified to
1777 match the width (uint8*, uint16*, uint32* or uint64* resp.). As ELF
1778 does not contain type info the type is always assumed to be unsigned.
1779
1780 uprobe:/bin/bash:readline {
1781 printf("PS1: %s\n", str(*uaddr("ps1_prompt")));
1782 }
1783
1784 uptr
1785 variants
1786
1787 • T * uptr(T * ptr)
1788
1789 Marks ptr as a user address space pointer. See the address-spaces
1790 section for more information on address-spaces. The pointer type is
1791 left unchanged.
1792
1793 usym
1794 variants
1795
1796 • usym_t usym(uint64 * addr)
1797
1798 async
1799
1800 Supported probes
1801
1802 • uprobes
1803
1804 • uretprobes
1805
1806 Equal to ksym but resolves user space symbols.
1807
1808 If ASLR is enabled, user space symbolication only works when the
1809 process is running at either the time of the symbol resolution or the
1810 time of the probe attachment. The latter requires
1811 BPFTRACE_CACHE_USER_SYMBOLS to be set to PER_PID, and might not work
1812 with older versions of BCC. A similar limitation also applies to
1813 dynamically loaded symbols.
1814
1815 uprobe:/bin/bash:readline
1816 {
1817 printf("%s\n", usym(reg("ip")));
1818 }
1819
1820 Prints:
1821
1822 readline
1823
1824 path
1825 variants
1826
1827 • char * path(struct path * path)
1828
1829 Kernel 5.10
1830
1831 Helper bpf_d_path
1832
1833 Return full path referenced by struct path pointer in argument.
1834
1835 This function can only be used by functions that are allowed to, these
1836 functions are contained in the btf_allowlist_d_path set in the kernel.
1837
1838 unwatch
1839 variants
1840
1841 • void unwatch(void * addr)
1842
1843 async
1844
1845 Removes a watchpoint
1846
1847 skboutput
1848 variants
1849
1850 • uint32 skboutput(const string path, struct sk_buff *skb, uint64
1851 length, const uint64 offset)
1852
1853 Kernel 5.5
1854
1855 Helper bpf_skb_output
1856
1857 Write sk_buff skb 's data section to a PCAP file in the path, starting
1858 from offset to offset + length.
1859
1860 The PCAP file is encapsulated in RAW IP, so no ethernet header is
1861 included. The data section in the struct skb may contain ethernet
1862 header in some kernel contexts, you may set offset to 14 bytes to
1863 exclude ethernet header.
1864
1865 Each packet’s timestamp is determined by adding nsecs and boot time,
1866 the accuracy varies on different kernels, see nsecs.
1867
1868 This function returns 0 on success, or a negative error in case of
1869 failure.
1870
1871 Environment variable BPFTRACE_PERF_RB_PAGES should be increased in
1872 order to capture large packets, or else these packets will be dropped.
1873
1874 Usage
1875
1876 # cat dump.bt
1877 kfunc:napi_gro_receive {
1878 $ret = skboutput("receive.pcap", args.skb, args.skb->len, 0);
1879 }
1880
1881 kfunc:dev_queue_xmit {
1882 // setting offset to 14, to exclude ethernet header
1883 $ret = skboutput("output.pcap", args.skb, args.skb->len, 14);
1884 printf("skboutput returns %d\n", $ret);
1885 }
1886
1887 # export BPFTRACE_PERF_RB_PAGES=1024
1888 # bpftrace dump.bt
1889 ...
1890
1891 # tcpdump -n -r ./receive.pcap | head -3
1892 reading from file ./receive.pcap, link-type RAW (Raw IP)
1893 dropped privs to tcpdump
1894 10:23:44.674087 IP 22.128.74.231.63175 > 192.168.0.23.22: Flags [.], ack 3513221061, win 14009, options [nop,nop,TS val 721277750 ecr 3115333619], length 0
1895 10:23:45.823194 IP 100.101.2.146.53 > 192.168.0.23.46619: 17273 0/1/0 (130)
1896 10:23:45.823229 IP 100.101.2.146.53 > 192.168.0.23.46158: 45799 1/0/0 A 100.100.45.106 (60)
1897
1899 print
1900 variants
1901
1902 • void print(T val)
1903
1904 async
1905
1906 variants
1907
1908 • void print(T val)
1909
1910 • void print(@map)
1911
1912 • void print(@map, uint64 top)
1913
1914 • void print(@map, uint64 top, uint64 div)
1915
1916 print prints a the value, which can be a map or a scalar value, with
1917 the default formatting for the type.
1918
1919 i:ms:10 { @=hist(rand); }
1920 i:s:1 {
1921 print(@);
1922 print(123);
1923 print("abc");
1924 exit();
1925 }
1926
1927 Prints:
1928
1929 @:
1930 [16M, 32M) 3 |@@@ |
1931 [32M, 64M) 2 |@@ |
1932 [64M, 128M) 1 |@ |
1933 [128M, 256M) 4 |@@@@ |
1934 [256M, 512M) 3 |@@@ |
1935 [512M, 1G) 14 |@@@@@@@@@@@@@@ |
1936 [1G, 2G) 22 |@@@@@@@@@@@@@@@@@@@@@@ |
1937 [2G, 4G) 51 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
1938
1939 123
1940 abc
1941
1942 Note that maps are printed by reference while scalar values are copied.
1943 This means that updating and printing maps in a fast loop will likely
1944 result in bogus map values as the map will be updated before userspace
1945 gets the time to dump and print it.
1946
1947 The printing of maps supports the optional top and div arguments. top
1948 limits the printing to the top N entries with the highest integer
1949 values
1950
1951 BEGIN {
1952 $i = 11;
1953 while($i) {
1954 @[$i] = --$i;
1955 }
1956 print(@, 2);
1957 clear(@);
1958 exit()
1959 }
1960
1961 @[9]: 9
1962 @[10]: 10
1963
1964 The div argument scales the values prior to printing them. Scaling
1965 values before storing them can result in rounding errors. Consider the
1966 following program:
1967
1968 k:f {
1969 @[func] += arg0/10;
1970 }
1971
1972 With the following sequence as numbers for arg0: 134, 377, 111, 99. The
1973 total is 721 which rounds to 72 when scaled by 10 but the program would
1974 print 70 due to the rounding of individual values.
1975
1976 Changing the print call to print(@, 5, 2) will take the top 5 values
1977 and scale them by 2:
1978
1979 @[6]: 3
1980 @[7]: 3
1981 @[8]: 4
1982 @[9]: 4
1983 @[10]: 5
1984
1985 printf
1986 variants
1987
1988 • void printf(const string fmt, args...)
1989
1990 async
1991
1992 printf() formats and prints data. It behaves similar to printf() found
1993 in C and many other languages.
1994
1995 The format string has to be a constant, it cannot be modified at
1996 runtime. The formatting of the string happens in user space. Values are
1997 copied and passed by value.
1998
1999 bpftrace supports all the typical format specifiers like %llx and %hhu.
2000 The non-standard ones can be found in the table below:
2001
2002 ┌──────────┬────────┬─────────────────────┐
2003 │ │ │ │
2004 │Specifier │ Type │ Description │
2005 ├──────────┼────────┼─────────────────────┤
2006 │ │ │ │
2007 │r │ buffer │ Hex-formatted │
2008 │ │ │ string to print │
2009 │ │ │ arbitrary binary │
2010 │ │ │ content returned by │
2011 │ │ │ the buf (buf) │
2012 │ │ │ function. │
2013 └──────────┴────────┴─────────────────────┘
2014
2015 Supported escape sequences
2016
2017 Colors are supported too, using standard terminal escape sequences:
2018
2019 print("\033[31mRed\t\033[33mYellow\033[0m\n")
2020
2022 bpftrace supports various probe types which allow the user to attach
2023 BPF programs to different types of events. Each probe starts with a
2024 provider (e.g. kprobe) followed by a colon (:) separated list of
2025 options. The amount of options and their meaning depend on the provider
2026 and are detailed below. The valid values for options can depend on the
2027 system or binary being traced, e.g. for uprobes it depends on the
2028 binary. Also see LISTING PROBES
2029
2030 It is possible to associate multiple probes with a single action as
2031 long as the action is valid for all specified probes. Multiple probes
2032 can be specified as a comma (,) separated list:
2033
2034 kprobe:tcp_reset,kprobe:tcp_v4_rcv {
2035 printf("Entered: %s\n", probe);
2036 }
2037
2038 Wildcards are supported too:
2039
2040 kprobe:tcp_* {
2041 printf("Entered: %s\n", probe);
2042 }
2043
2044 Both can be combined:
2045
2046 kprobe:tcp_reset,kprobe:*socket* {
2047 printf("Entered: %s\n", probe);
2048 }
2049
2050 Most providers also support a short name which can be used instead of
2051 the full name, e.g. kprobe:f and k:f are identical.
2052
2053 BEGIN and END
2054 These are special built-in events provided by the bpftrace runtime.
2055 BEGIN is triggered before all other probes are attached. END is
2056 triggered after all other probes are detached.
2057
2058 Note that specifying an END probe doesn’t override the printing of
2059 'non-empty' maps at exit. To prevent the printing all used maps need be
2060 cleared, which can be done in the END probe:
2061
2062 END {
2063 clear(@map1);
2064 clear(@map2);
2065 }
2066
2067 hardware
2068 variants
2069
2070 • hardware:event_name:
2071
2072 • hardware:event_name:count
2073
2074 shortname
2075
2076 • h
2077
2078 The hardware probe attaches to pre-defined hardware events provided by
2079 the kernel.
2080
2081 They are implemented using performance monitoring counters (PMCs):
2082 hardware resources on the processor. There are about ten of these, and
2083 they are documented in the perf_event_open(2) man page. The event names
2084 are:
2085
2086 • cpu-cycles or cycles
2087
2088 • instructions
2089
2090 • cache-references
2091
2092 • cache-misses
2093
2094 • branch-instructions or branches
2095
2096 • branch-misses
2097
2098 • bus-cycles
2099
2100 • frontend-stalls
2101
2102 • backend-stalls
2103
2104 • ref-cycles
2105
2106 The count option specifies how many events must happen before the probe
2107 fires. If count is left unspecified a default value is used.
2108
2109 hardware:cache-misses:1e6 { @[pid] = count(); }
2110
2111 interval
2112 variants
2113
2114 • interval:us:count
2115
2116 • interval:ms:count
2117
2118 • interval:s:count
2119
2120 • interval:hz:rate
2121
2122 shortnames
2123
2124 • i
2125
2126 The interval probe fires at a fixed interval as specified by its time
2127 spec. Interval fire on one CPU at the time, unlike [profile] probes.
2128
2129 iterator
2130 variants
2131
2132 • iter:task
2133
2134 • iter:task:pin
2135
2136 • iter:task_file
2137
2138 • iter:task_file:pin
2139
2140 • iter:task_vma
2141
2142 • iter:task_vma:pin
2143
2144 shortnames
2145
2146 • it
2147
2148 These are eBPF iterator probes, that allow iteration over kernel
2149 objects.
2150
2151 Iterator probe can’t be mixed with any other probe, not even other
2152 iterator.
2153
2154 Each iterator probe provides set of fields that could be accessed with
2155 ctx pointer. User can display set of available fields for iterator via
2156 -lv options as described below.
2157
2158 Examples:
2159
2160 # bpftrace -e 'iter:task { printf("%s:%d\n", ctx->task->comm, ctx->task->pid); }'
2161 Attaching 1 probe...
2162 systemd:1
2163 kthreadd:2
2164 rcu_gp:3
2165 rcu_par_gp:4
2166 kworker/0:0H:6
2167 mm_percpu_wq:8
2168 ...
2169
2170 # bpftrace -e 'iter:task_file { printf("%s:%d %d:%s\n", ctx->task->comm, ctx->task->pid, ctx->fd, path(ctx->file->f_path)); }'
2171 Attaching 1 probe...
2172 systemd:1 1:/dev/null
2173 systemd:1 2:/dev/null
2174 systemd:1 3:/dev/kmsg
2175 ...
2176 su:1622 1:/dev/pts/1
2177 su:1622 2:/dev/pts/1
2178 su:1622 3:/var/lib/sss/mc/passwd
2179 ...
2180 bpftrace:1892 1:pipe:[35124]
2181 bpftrace:1892 2:/dev/pts/1
2182 bpftrace:1892 3:anon_inode:bpf-map
2183 bpftrace:1892 4:anon_inode:bpf-map
2184 bpftrace:1892 5:anon_inode:bpf_link
2185 bpftrace:1892 6:anon_inode:bpf-prog
2186 bpftrace:1892 7:anon_inode:bpf_iter
2187
2188 # bpftrace -e 'iter:task_vma {printf("%s %d %lx-%lx\n", comm, pid, ctx->vma->vm_start, ctx->vma->vm_end);}'
2189 Attaching 1 probe...
2190 bpftrace 119480 55b92c380000-55b92c386000
2191 bpftrace 119480 55b92c386000-55b92c391000
2192 bpftrace 119480 55b92c391000-55b92c397000
2193 bpftrace 119480 55b92c398000-55b92c399000
2194 bpftrace 119480 55b92c399000-55b92c39a000
2195 bpftrace 119480 55b92cce3000-55b92d010000
2196 ...
2197 bpftrace 119480 7ffd55dde000-7ffd55de2000
2198 bpftrace 119480 7ffd55de2000-7ffd55de4000
2199
2200 It’s possible to pin iterator with specifying optional probe ':pin'
2201 part, that defines the pin file. It can be specified as absolute path
2202 or relative to /sys/fs/bpf.
2203
2204 relative pin
2205
2206 # bpftrace -e 'iter:task:list { printf("%s:%d\n", ctx->task->comm, ctx->task->pid); }'
2207 Program pinned to /sys/fs/bpf/list
2208
2209 # cat /sys/fs/bpf/list
2210 systemd:1
2211 kthreadd:2
2212 rcu_gp:3
2213 rcu_par_gp:4
2214 kworker/0:0H:6
2215 mm_percpu_wq:8
2216 rcu_tasks_kthre:9
2217 ...
2218
2219 Examples with absolute pin file:
2220
2221 absolute pin
2222
2223 # bpftrace -e '
2224 iter:task_file:/sys/fs/bpf/files {
2225 printf("%s:%d %s\n", ctx->task->comm, ctx->task->pid, path(ctx->file->f_path));
2226 }'
2227
2228 Program pinned to /sys/fs/bpf/files
2229
2230 # cat /sys/fs/bpf/files
2231 systemd:1 anon_inode:inotify
2232 systemd:1 anon_inode:[timerfd]
2233 ...
2234 systemd-journal:849 /dev/kmsg
2235 systemd-journal:849 anon_inode:[eventpoll]
2236 ...
2237 sssd:1146 /var/log/sssd/sssd.log
2238 sssd:1146 anon_inode:[eventpoll]
2239 ...
2240 NetworkManager:1155 anon_inode:[eventfd]
2241 NetworkManager:1155 /var/lib/sss/mc/passwd (deleted)
2242
2243 kfunc and kretfunc
2244 variants
2245
2246 • kfunc[:mod]:fn
2247
2248 • kretfunc[:mod]:fn
2249
2250 shortnames
2251
2252 • f (kfunc)
2253
2254 • fr (kretfunc)
2255
2256 requires (--info)
2257
2258 • Kernel features:BTF
2259
2260 • Probe types:kfunc
2261
2262 kfuncs attach to kernel function similar to kprobe and kretprobe. They
2263 make use of eBPF trampolines which allows kernel code to call into BPF
2264 programs with near zero overhead.
2265
2266 kfunc s make use of BTF type information to derive the type of function
2267 arguments at compile time. This removes the need for manual type
2268 casting and makes the code more resilient against small signature
2269 changes in the kernel. The function arguments are available in the args
2270 struct which can be inspected by doing verbose listing (see LISTING
2271 PROBES). These arguments are also available in the return probe
2272 (kretfunc).
2273
2274 # bpftrace -lv 'kfunc:tcp_reset'
2275 kfunc:tcp_reset
2276 struct sock * sk
2277 struct sk_buff * skb
2278
2279 kfunc:x86_pmu_stop {
2280 printf("pmu %s stop\n", str(args.event->pmu->name));
2281 }
2282
2283 kretfunc:fget {
2284 printf("fd %d name %s\n", args.fd, str(retval->f_path.dentry->d_name.name));
2285 }
2286
2287 fd 3 name ld.so.cache
2288 fd 3 name libselinux.so.1
2289 fd 3 name libselinux.so.1
2290 ...
2291
2292 kfunc:kvm:x86_emulate_insn { @ = count(); }
2293
2294 @ = 347603
2295
2296 kprobe and kretprobe
2297 variants
2298
2299 • kprobe:fn
2300
2301 • kprobe:fn+offset
2302
2303 • kretprobe:fn
2304
2305 shortnames
2306
2307 • k
2308
2309 • kr
2310
2311 kprobe s allow for dynamic instrumentation of kernel functions. Each
2312 time the specified kernel function is executed the attached BPF
2313 programs are ran.
2314
2315 kprobe:tcp_reset {
2316 @tcp_resets = count()
2317 }
2318
2319 Function arguments are available through the argX and sargX builtins,
2320 for register args and stack args respectively. Whether arguments passed
2321 on stack or in a register depends on the architecture and the number or
2322 arguments in used, e.g. on x86_64 the first non-floating point 6
2323 arguments are passed in registers, all following arguments are passed
2324 on the stack. Note that floating point arguments are typically passed
2325 in special registers which don’t count as argX arguments which can
2326 cause confusion. Consider a function with the following signature:
2327
2328 void func(int a, double d, int x)
2329
2330 Due to d being a floating point x is accessed through arg1 where one
2331 might expect arg2.
2332
2333 bpftrace does not detect the function signature so it is not aware of
2334 the argument count or their type. It is up to the user to perform Type
2335 conversion when needed, e.g.
2336
2337 kprobe:tcp_connect
2338 {
2339 $sk = ((struct sock *) arg0);
2340 ...
2341 }
2342
2343 kprobe s are not limited to function entry, they can be attached to any
2344 instruction in a function by specifying an offset from the start of the
2345 function.
2346
2347 kretprobe s trigger on the return from a kernel function. Return probes
2348 do not have access to the function (input) arguments, only to the
2349 return value (through retval). A common pattern to work around this is
2350 by storing the arguments in a map on function entry and retrieving in
2351 the return probe:
2352
2353 kprobe:d_lookup
2354 {
2355 $name = (struct qstr *)arg1;
2356 @fname[tid] = $name->name;
2357 }
2358
2359 kretprobe:d_lookup
2360 /@fname[tid]/
2361 {
2362 printf("%-8d %-6d %-16s M %s\n", elapsed / 1e6, pid, comm,
2363 str(@fname[tid]));
2364 }
2365
2366 profile
2367 variants
2368
2369 • profile:us:count
2370
2371 • profile:ms:count
2372
2373 • profile:s:count
2374
2375 • profile:hz:rate
2376
2377 shortnames
2378
2379 • p
2380
2381 Profile probes fire on each CPU on the specified interval.
2382
2383 software
2384 variants
2385
2386 • software:event:
2387
2388 • software:event:count
2389
2390 shortnames
2391
2392 • s
2393
2394 The software probe attaches to pre-defined software events provided by
2395 the kernel. Event details can be found in the perf_event_open(2) man
2396 page.
2397
2398 The event names are:
2399
2400 • cpu-clock or cpu
2401
2402 • task-clock
2403
2404 • page-faults or faults
2405
2406 • context-switches or cs
2407
2408 • cpu-migrations
2409
2410 • minor-faults
2411
2412 • major-faults
2413
2414 • alignment-faults
2415
2416 • emulation-faults
2417
2418 • dummy
2419
2420 • bpf-output
2421
2422 tracepoint
2423 variants
2424
2425 • tracepoint:subsys:event
2426
2427 shortnames
2428
2429 • t
2430
2431 Tracepoints are hooks into events in the kernel. Tracepoints are
2432 defined in the kernel source and compiled into the kernel binary which
2433 makes them a form of static tracing. Which means that unlike kprobe s
2434 new tracepoints cannot be added without modifying the kernel.
2435
2436 The advantage of tracepoints is that they generally provide a more
2437 stable interface than kprobe s do, they do not depend on the existence
2438 of a kernel function.
2439
2440 Tracepoint arguments are available in the args struct which can be
2441 inspected with verbose listing, see the LISTING PROBES section for more
2442 details.
2443
2444 tracepoint:syscalls:sys_enter_openat {
2445 printf("%s %s\n", comm, str(args.filename));
2446 }
2447
2448 irqbalance /proc/interrupts
2449 irqbalance /proc/stat
2450 snmpd /proc/diskstats
2451 snmpd /proc/stat
2452 snmpd /proc/vmstat
2453 snmpd /proc/net/dev
2454 [...]
2455
2456 Additional information
2457
2458 • https://www.kernel.org/doc/html/latest/trace/tracepoints.html
2459
2460 rawtracepoint
2461 variants
2462
2463 • rawtracepoint:event
2464
2465 shortnames
2466
2467 • rt
2468
2469 The hook point triggered by tracepoint and rawtracepoint is the same.
2470 tracepoint and rawtracepoint are nearly identical in terms of
2471 functionality. The only difference is in the program context.
2472 rawtracepoint offers raw arguments to the tracepoint while tracepoint
2473 applies further processing to the raw arguments. The additional
2474 processing is defined inside the kernel.
2475
2476 Tracepoint arguments are available via the argN builtins. The available
2477 arguments can be found in the relative path of the kernel source code
2478 include/trace/events/. Each arg is a 64-bit integer.
2479
2480 rawtracepoint:block_rq_insert {
2481 printf("%llx %llx\n", arg0, arg1);
2482 }
2483
2484 ffff88810977d6f8 ffff8881097e8e80
2485 [...]
2486
2487 uprobe, uretprobe
2488 variants
2489
2490 • uprobe:binary:func
2491
2492 • uprobe:binary:func+offset
2493
2494 • uprobe:binary:offset
2495
2496 • uretprobe:binary:func
2497
2498 shortnames
2499
2500 • u
2501
2502 • ur
2503
2504 uprobe s or user-space probes are the user-space equivalent of kprobe
2505 s. The same limitations that apply kprobe and kretprobe also apply to
2506 uprobe s and uretprobe s.
2507
2508 When tracing libraries, it is sufficient to specify the library name
2509 instead of a full path. The path will be then automatically resolved
2510 using /etc/ld.so.cache:
2511
2512 # bpftrace -e 'uprobe:libc:malloc { printf("Allocated %d bytes\n", arg0); }'
2513 Allocated 4 bytes
2514 ...
2515
2516 If the traced binary has DWARF included, function arguments are
2517 available in the args struct which can be inspected with verbose
2518 listing, see the LISTING PROBES section for more details.
2519
2520 When tracing C++ programs, it is possible to turn on automatic symbol
2521 demangling by using the :cpp prefix:
2522
2523 # bpftrace -e 'u:src/bpftrace:cpp:"bpftrace::BPFtrace::add_probe" { print("adding probe\n"); }'
2524 Attaching 1 probe...
2525 adding probe
2526
2527 It is important to note that for uretprobe s to work the kernel runs a
2528 special helper on user-space function entry which overrides the return
2529 address on the stack. This can cause issues with languages that have
2530 their own runtime like Golang:
2531
2532 example.go
2533
2534 func myprint(s string) {
2535 fmt.Printf("Input: %s\n", s)
2536 }
2537
2538 func main() {
2539 ss := []string{"a", "b", "c"}
2540 for _, s := range ss {
2541 go myprint(s)
2542 }
2543 time.Sleep(1*time.Second)
2544 }
2545
2546 bpftrace
2547
2548 # bpftrace -e 'uretprobe:./test:main.myprint { @=count(); }' -c ./test
2549 runtime: unexpected return pc for main.myprint called from 0x7fffffffe000
2550 stack: frame={sp:0xc00008cf60, fp:0xc00008cfd0} stack=[0xc00008c000,0xc00008d000)
2551 fatal error: unknown caller pc
2552
2553 usdt
2554 variants
2555
2556 • usdt:binary_path:probe_name
2557
2558 • usdt:binary_path:[probe_namespace]:probe_name
2559
2560 • usdt:library_path:probe_name
2561
2562 • usdt:library_path:[probe_namespace]:probe_name
2563
2564 shortnames
2565
2566 • U
2567
2568 You can target the entire host (or an entire process’s address space by
2569 using the -p arg) by using a single wildcard in place of the
2570 binary_path/library_path e.g. bpftrace -e 'usdt:*:loop {
2571 printf("hi\n"); }. Please note that if you use wildcards for the
2572 probe_name or probe_namespace and end up targeting multiple USDTs for
2573 the same probe you might get errors if you also utilize the USDT
2574 argument builtins (e.g. arg0) as they could be of different types.
2575
2576 watchpoint and asyncwatchpoint
2577 variants
2578
2579 • watchpoint:absolute_address:length:mode
2580
2581 • watchpoint:function+argN:length:mode
2582
2583 shortnames
2584
2585 • w
2586
2587 • aw
2588
2589 These are memory watchpoints provided by the kernel. Whenever a memory
2590 address is written to (w), read from (r), or executed (x), the kernel
2591 can generate an event.
2592
2593 In the first form, an absolute address is monitored. If a pid (-p) or a
2594 command (-c) is provided, bpftrace takes the address as a userspace
2595 address and monitors the appropriate process. If not, bpftrace takes
2596 the address as a kernel space address.
2597
2598 In the second form, the address present in argN when function is
2599 entered is monitored. A pid or command must be provided for this form.
2600 If synchronous (watchpoint), a SIGSTOP is sent to the tracee upon
2601 function entry. The tracee will be SIGCONTed after the watchpoint is
2602 attached. This is to ensure events are not missed. If you want to avoid
2603 the SIGSTOP + SIGCONT use asyncwatchpoint.
2604
2605 Note that on most architectures you may not monitor for execution while
2606 monitoring read or write.
2607
2608 Examples
2609
2610 Print hit when a read from or write to 0x10000000 happens:
2611
2612 # bpftrace -e 'watchpoint:0x10000000:8:rw { printf("hit!\n"); exit(); }' -c ./testprogs/watchpoint
2613
2614 Print the call stack every time the jiffies variable is updated:
2615
2616 # bpftrace -e "watchpoint:0x$(awk '$3 == "jiffies" {print $1}' /proc/kallsyms):8:w {
2617 @[kstack] = count();
2618 }
2619
2620 i:s:1 { exit(); }"
2621 ......
2622 @[
2623 do_timer+12
2624 tick_do_update_jiffies64.part.22+89
2625 tick_sched_do_timer+103
2626 tick_sched_timer+39
2627 __hrtimer_run_queues+256
2628 hrtimer_interrupt+256
2629 smp_apic_timer_interrupt+106
2630 apic_timer_interrupt+15
2631 cpuidle_enter_state+188
2632 cpuidle_enter+41
2633 do_idle+536
2634 cpu_startup_entry+25
2635 start_secondary+355
2636 secondary_startup_64+164
2637 ]: 319
2638
2639 "hit" and exit when the memory pointed to by arg1 of increment is
2640 written to.
2641
2642 # cat wpfunc.c
2643 #include <stdio.h>
2644 #include <stdlib.h>
2645 #include <unistd.h>
2646
2647 __attribute__((noinline))
2648 void increment(__attribute__((unused)) int _, int *i)
2649 {
2650 (*i)++;
2651 }
2652
2653 int main()
2654 {
2655 int *i = malloc(sizeof(int));
2656 while (1)
2657 {
2658 increment(0, i);
2659 (*i)++;
2660 usleep(1000);
2661 }
2662 }
2663
2664 # bpftrace -e 'watchpoint:increment+arg1:4:w { printf("hit!\n"); exit() }' -c ./wpfunc
2665
2667 Probe listing is the method to discover which probes are supported by
2668 the current system. Listing supports the same syntax as normal
2669 attachment does:
2670
2671 # bpftrace -l 'kprobe:*'
2672 # bpftrace -l 't:syscalls:*openat*
2673 # bpftrace -l 'kprobe:tcp*,trace
2674 # bpftrace -l 'k:*socket*,tracepoint:syscalls:*tcp*'
2675
2676 The verbose flag (-v) can be specified to inspect arguments (args) for
2677 providers that support it:
2678
2679 # bpftrace -l 'fr:tcp_reset,t:syscalls:sys_enter_openat' -v
2680 kretfunc:tcp_reset
2681 struct sock * sk
2682 struct sk_buff * skb
2683 tracepoint:syscalls:sys_enter_openat
2684 int __syscall_nr
2685 int dfd
2686 const char * filename
2687 int flags
2688 umode_t mode
2689 # bpftrace -l 'uprobe:/bin/bash:rl_set_prompt' -v # works only if /bin/bash has DWARF
2690 uprobe:/bin/bash:rl_set_prompt
2691 const char *prompt
2692
2693
2694
2695 2023-10-04 BPFTRACE(8)