1PERF_EVENT_OPEN(2) Linux Programmer's Manual PERF_EVENT_OPEN(2)
2
3
4
6 perf_event_open - set up performance monitoring
7
9 #include <linux/perf_event.h>
10 #include <linux/hw_breakpoint.h>
11
12 int perf_event_open(struct perf_event_attr *attr,
13 pid_t pid, int cpu, int group_fd,
14 unsigned long flags);
15
16 Note: There is no glibc wrapper for this system call; see NOTES.
17
19 Given a list of parameters, perf_event_open() returns a file descrip‐
20 tor, for use in subsequent system calls (read(2), mmap(2), prctl(2),
21 fcntl(2), etc.).
22
23 A call to perf_event_open() creates a file descriptor that allows mea‐
24 suring performance information. Each file descriptor corresponds to
25 one event that is measured; these can be grouped together to measure
26 multiple events simultaneously.
27
28 Events can be enabled and disabled in two ways: via ioctl(2) and via
29 prctl(2). When an event is disabled it does not count or generate
30 overflows but does continue to exist and maintain its count value.
31
32 Events come in two flavors: counting and sampled. A counting event is
33 one that is used for counting the aggregate number of events that oc‐
34 cur. In general, counting event results are gathered with a read(2)
35 call. A sampling event periodically writes measurements to a buffer
36 that can then be accessed via mmap(2).
37
38 Arguments
39 The pid and cpu arguments allow specifying which process and CPU to
40 monitor:
41
42 pid == 0 and cpu == -1
43 This measures the calling process/thread on any CPU.
44
45 pid == 0 and cpu >= 0
46 This measures the calling process/thread only when running on
47 the specified CPU.
48
49 pid > 0 and cpu == -1
50 This measures the specified process/thread on any CPU.
51
52 pid > 0 and cpu >= 0
53 This measures the specified process/thread only when running on
54 the specified CPU.
55
56 pid == -1 and cpu >= 0
57 This measures all processes/threads on the specified CPU. This
58 requires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN capabil‐
59 ity or a /proc/sys/kernel/perf_event_paranoid value of less than
60 1.
61
62 pid == -1 and cpu == -1
63 This setting is invalid and will return an error.
64
65 When pid is greater than zero, permission to perform this system call
66 is governed by CAP_PERFMON (since Linux 5.9) and a ptrace access mode
67 PTRACE_MODE_READ_REALCREDS check on older Linux versions; see
68 ptrace(2).
69
70 The group_fd argument allows event groups to be created. An event
71 group has one event which is the group leader. The leader is created
72 first, with group_fd = -1. The rest of the group members are created
73 with subsequent perf_event_open() calls with group_fd being set to the
74 file descriptor of the group leader. (A single event on its own is
75 created with group_fd = -1 and is considered to be a group with only 1
76 member.) An event group is scheduled onto the CPU as a unit: it will
77 be put onto the CPU only if all of the events in the group can be put
78 onto the CPU. This means that the values of the member events can be
79 meaningfully compared—added, divided (to get ratios), and so on—with
80 each other, since they have counted events for the same set of executed
81 instructions.
82
83 The flags argument is formed by ORing together zero or more of the fol‐
84 lowing values:
85
86 PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
87 This flag enables the close-on-exec flag for the created event
88 file descriptor, so that the file descriptor is automatically
89 closed on execve(2). Setting the close-on-exec flags at cre‐
90 ation time, rather than later with fcntl(2), avoids potential
91 race conditions where the calling thread invokes
92 perf_event_open() and fcntl(2) at the same time as another
93 thread calls fork(2) then execve(2).
94
95 PERF_FLAG_FD_NO_GROUP
96 This flag tells the event to ignore the group_fd parameter ex‐
97 cept for the purpose of setting up output redirection using the
98 PERF_FLAG_FD_OUTPUT flag.
99
100 PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
101 This flag re-routes the event's sampled output to instead be in‐
102 cluded in the mmap buffer of the event specified by group_fd.
103
104 PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
105 This flag activates per-container system-wide monitoring. A
106 container is an abstraction that isolates a set of resources for
107 finer-grained control (CPUs, memory, etc.). In this mode, the
108 event is measured only if the thread running on the monitored
109 CPU belongs to the designated container (cgroup). The cgroup is
110 identified by passing a file descriptor opened on its directory
111 in the cgroupfs filesystem. For instance, if the cgroup to mon‐
112 itor is called test, then a file descriptor opened on
113 /dev/cgroup/test (assuming cgroupfs is mounted on /dev/cgroup)
114 must be passed as the pid parameter. cgroup monitoring is
115 available only for system-wide events and may therefore require
116 extra permissions.
117
118 The perf_event_attr structure provides detailed configuration informa‐
119 tion for the event being created.
120
121 struct perf_event_attr {
122 __u32 type; /* Type of event */
123 __u32 size; /* Size of attribute structure */
124 __u64 config; /* Type-specific configuration */
125
126 union {
127 __u64 sample_period; /* Period of sampling */
128 __u64 sample_freq; /* Frequency of sampling */
129 };
130
131 __u64 sample_type; /* Specifies values included in sample */
132 __u64 read_format; /* Specifies values returned in read */
133
134 __u64 disabled : 1, /* off by default */
135 inherit : 1, /* children inherit it */
136 pinned : 1, /* must always be on PMU */
137 exclusive : 1, /* only group on PMU */
138 exclude_user : 1, /* don't count user */
139 exclude_kernel : 1, /* don't count kernel */
140 exclude_hv : 1, /* don't count hypervisor */
141 exclude_idle : 1, /* don't count when idle */
142 mmap : 1, /* include mmap data */
143 comm : 1, /* include comm data */
144 freq : 1, /* use freq, not period */
145 inherit_stat : 1, /* per task counts */
146 enable_on_exec : 1, /* next exec enables */
147 task : 1, /* trace fork/exit */
148 watermark : 1, /* wakeup_watermark */
149 precise_ip : 2, /* skid constraint */
150 mmap_data : 1, /* non-exec mmap data */
151 sample_id_all : 1, /* sample_type all events */
152 exclude_host : 1, /* don't count in host */
153 exclude_guest : 1, /* don't count in guest */
154 exclude_callchain_kernel : 1,
155 /* exclude kernel callchains */
156 exclude_callchain_user : 1,
157 /* exclude user callchains */
158 mmap2 : 1, /* include mmap with inode data */
159 comm_exec : 1, /* flag comm events that are
160 due to exec */
161 use_clockid : 1, /* use clockid for time fields */
162 context_switch : 1, /* context switch data */
163 write_backward : 1, /* Write ring buffer from end
164 to beginning */
165 namespaces : 1, /* include namespaces data */
166 ksymbol : 1, /* include ksymbol events */
167 bpf_event : 1, /* include bpf events */
168 aux_output : 1, /* generate AUX records
169 instead of events */
170 cgroup : 1, /* include cgroup events */
171 text_poke : 1, /* include text poke events */
172
173 __reserved_1 : 30;
174
175 union {
176 __u32 wakeup_events; /* wakeup every n events */
177 __u32 wakeup_watermark; /* bytes before wakeup */
178 };
179
180 __u32 bp_type; /* breakpoint type */
181
182 union {
183 __u64 bp_addr; /* breakpoint address */
184 __u64 kprobe_func; /* for perf_kprobe */
185 __u64 uprobe_path; /* for perf_uprobe */
186 __u64 config1; /* extension of config */
187 };
188
189 union {
190 __u64 bp_len; /* breakpoint length */
191 __u64 kprobe_addr; /* with kprobe_func == NULL */
192 __u64 probe_offset; /* for perf_[k,u]probe */
193 __u64 config2; /* extension of config1 */
194 };
195 __u64 branch_sample_type; /* enum perf_branch_sample_type */
196 __u64 sample_regs_user; /* user regs to dump on samples */
197 __u32 sample_stack_user; /* size of stack to dump on
198 samples */
199 __s32 clockid; /* clock to use for time fields */
200 __u64 sample_regs_intr; /* regs to dump on samples */
201 __u32 aux_watermark; /* aux bytes before wakeup */
202 __u16 sample_max_stack; /* max frames in callchain */
203 __u16 __reserved_2; /* align to u64 */
204
205 };
206
207 The fields of the perf_event_attr structure are described in more de‐
208 tail below:
209
210 type This field specifies the overall event type. It has one of the
211 following values:
212
213 PERF_TYPE_HARDWARE
214 This indicates one of the "generalized" hardware events
215 provided by the kernel. See the config field definition
216 for more details.
217
218 PERF_TYPE_SOFTWARE
219 This indicates one of the software-defined events pro‐
220 vided by the kernel (even if no hardware support is
221 available).
222
223 PERF_TYPE_TRACEPOINT
224 This indicates a tracepoint provided by the kernel trace‐
225 point infrastructure.
226
227 PERF_TYPE_HW_CACHE
228 This indicates a hardware cache event. This has a spe‐
229 cial encoding, described in the config field definition.
230
231 PERF_TYPE_RAW
232 This indicates a "raw" implementation-specific event in
233 the config field.
234
235 PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
236 This indicates a hardware breakpoint as provided by the
237 CPU. Breakpoints can be read/write accesses to an ad‐
238 dress as well as execution of an instruction address.
239
240 dynamic PMU
241 Since Linux 2.6.38, perf_event_open() can support multi‐
242 ple PMUs. To enable this, a value exported by the kernel
243 can be used in the type field to indicate which PMU to
244 use. The value to use can be found in the sysfs filesys‐
245 tem: there is a subdirectory per PMU instance under
246 /sys/bus/event_source/devices. In each subdirectory
247 there is a type file whose content is an integer that can
248 be used in the type field. For instance,
249 /sys/bus/event_source/devices/cpu/type contains the value
250 for the core CPU PMU, which is usually 4.
251
252 kprobe and uprobe (since Linux 4.17)
253 These two dynamic PMUs create a kprobe/uprobe and attach
254 it to the file descriptor generated by perf_event_open.
255 The kprobe/uprobe will be destroyed on the destruction of
256 the file descriptor. See fields kprobe_func, up‐
257 robe_path, kprobe_addr, and probe_offset for more de‐
258 tails.
259
260 size The size of the perf_event_attr structure for forward/backward
261 compatibility. Set this using sizeof(struct perf_event_attr) to
262 allow the kernel to see the struct size at the time of compila‐
263 tion.
264
265 The related define PERF_ATTR_SIZE_VER0 is set to 64; this was
266 the size of the first published struct. PERF_ATTR_SIZE_VER1 is
267 72, corresponding to the addition of breakpoints in Linux
268 2.6.33. PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
269 of branch sampling in Linux 3.4. PERF_ATTR_SIZE_VER3 is 96 cor‐
270 responding to the addition of sample_regs_user and sam‐
271 ple_stack_user in Linux 3.7. PERF_ATTR_SIZE_VER4 is 104 corre‐
272 sponding to the addition of sample_regs_intr in Linux 3.19.
273 PERF_ATTR_SIZE_VER5 is 112 corresponding to the addition of
274 aux_watermark in Linux 4.1.
275
276 config This specifies which event you want, in conjunction with the
277 type field. The config1 and config2 fields are also taken into
278 account in cases where 64 bits is not enough to fully specify
279 the event. The encoding of these fields are event dependent.
280
281 There are various ways to set the config field that are depen‐
282 dent on the value of the previously described type field. What
283 follows are various possible settings for config separated out
284 by type.
285
286 If type is PERF_TYPE_HARDWARE, we are measuring one of the gen‐
287 eralized hardware CPU events. Not all of these are available on
288 all platforms. Set config to one of the following:
289
290 PERF_COUNT_HW_CPU_CYCLES
291 Total cycles. Be wary of what happens during CPU
292 frequency scaling.
293
294 PERF_COUNT_HW_INSTRUCTIONS
295 Retired instructions. Be careful, these can be af‐
296 fected by various issues, most notably hardware in‐
297 terrupt counts.
298
299 PERF_COUNT_HW_CACHE_REFERENCES
300 Cache accesses. Usually this indicates Last Level
301 Cache accesses but this may vary depending on your
302 CPU. This may include prefetches and coherency mes‐
303 sages; again this depends on the design of your CPU.
304
305 PERF_COUNT_HW_CACHE_MISSES
306 Cache misses. Usually this indicates Last Level
307 Cache misses; this is intended to be used in con‐
308 junction with the PERF_COUNT_HW_CACHE_REFERENCES
309 event to calculate cache miss rates.
310
311 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
312 Retired branch instructions. Prior to Linux 2.6.35,
313 this used the wrong event on AMD processors.
314
315 PERF_COUNT_HW_BRANCH_MISSES
316 Mispredicted branch instructions.
317
318 PERF_COUNT_HW_BUS_CYCLES
319 Bus cycles, which can be different from total cy‐
320 cles.
321
322 PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
323 Stalled cycles during issue.
324
325 PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
326 Stalled cycles during retirement.
327
328 PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
329 Total cycles; not affected by CPU frequency scaling.
330
331 If type is PERF_TYPE_SOFTWARE, we are measuring software events
332 provided by the kernel. Set config to one of the following:
333
334 PERF_COUNT_SW_CPU_CLOCK
335 This reports the CPU clock, a high-resolution per-
336 CPU timer.
337
338 PERF_COUNT_SW_TASK_CLOCK
339 This reports a clock count specific to the task that
340 is running.
341
342 PERF_COUNT_SW_PAGE_FAULTS
343 This reports the number of page faults.
344
345 PERF_COUNT_SW_CONTEXT_SWITCHES
346 This counts context switches. Until Linux 2.6.34,
347 these were all reported as user-space events, after
348 that they are reported as happening in the kernel.
349
350 PERF_COUNT_SW_CPU_MIGRATIONS
351 This reports the number of times the process has mi‐
352 grated to a new CPU.
353
354 PERF_COUNT_SW_PAGE_FAULTS_MIN
355 This counts the number of minor page faults. These
356 did not require disk I/O to handle.
357
358 PERF_COUNT_SW_PAGE_FAULTS_MAJ
359 This counts the number of major page faults. These
360 required disk I/O to handle.
361
362 PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
363 This counts the number of alignment faults. These
364 happen when unaligned memory accesses happen; the
365 kernel can handle these but it reduces performance.
366 This happens only on some architectures (never on
367 x86).
368
369 PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
370 This counts the number of emulation faults. The
371 kernel sometimes traps on unimplemented instructions
372 and emulates them for user space. This can nega‐
373 tively impact performance.
374
375 PERF_COUNT_SW_DUMMY (since Linux 3.12)
376 This is a placeholder event that counts nothing.
377 Informational sample record types such as mmap or
378 comm must be associated with an active event. This
379 dummy event allows gathering such records without
380 requiring a counting event.
381
382 If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel
383 tracepoints. The value to use in config can be obtained from
384 under debugfs tracing/events/*/*/id if ftrace is enabled in the
385 kernel.
386
387 If type is PERF_TYPE_HW_CACHE, then we are measuring a hardware
388 CPU cache event. To calculate the appropriate config value, use
389 the following equation:
390
391 config = (perf_hw_cache_id) |
392 (perf_hw_cache_op_id << 8) |
393 (perf_hw_cache_op_result_id << 16);
394
395 where perf_hw_cache_id is one of:
396
397 PERF_COUNT_HW_CACHE_L1D
398 for measuring Level 1 Data Cache
399
400 PERF_COUNT_HW_CACHE_L1I
401 for measuring Level 1 Instruction Cache
402
403 PERF_COUNT_HW_CACHE_LL
404 for measuring Last-Level Cache
405
406 PERF_COUNT_HW_CACHE_DTLB
407 for measuring the Data TLB
408
409 PERF_COUNT_HW_CACHE_ITLB
410 for measuring the Instruction TLB
411
412 PERF_COUNT_HW_CACHE_BPU
413 for measuring the branch prediction unit
414
415 PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
416 for measuring local memory accesses
417
418 and perf_hw_cache_op_id is one of:
419
420 PERF_COUNT_HW_CACHE_OP_READ
421 for read accesses
422
423 PERF_COUNT_HW_CACHE_OP_WRITE
424 for write accesses
425
426 PERF_COUNT_HW_CACHE_OP_PREFETCH
427 for prefetch accesses
428
429 and perf_hw_cache_op_result_id is one of:
430
431 PERF_COUNT_HW_CACHE_RESULT_ACCESS
432 to measure accesses
433
434 PERF_COUNT_HW_CACHE_RESULT_MISS
435 to measure misses
436
437 If type is PERF_TYPE_RAW, then a custom "raw" config value is
438 needed. Most CPUs support events that are not covered by the
439 "generalized" events. These are implementation defined; see
440 your CPU manual (for example the Intel Volume 3B documentation
441 or the AMD BIOS and Kernel Developer Guide). The libpfm4 li‐
442 brary can be used to translate from the name in the architec‐
443 tural manuals to the raw hex value perf_event_open() expects in
444 this field.
445
446 If type is PERF_TYPE_BREAKPOINT, then leave config set to zero.
447 Its parameters are set in other places.
448
449 If type is kprobe or uprobe, set retprobe (bit 0 of config, see
450 /sys/bus/event_source/devices/[k,u]probe/format/retprobe) for
451 kretprobe/uretprobe. See fields kprobe_func, uprobe_path,
452 kprobe_addr, and probe_offset for more details.
453
454 kprobe_func, uprobe_path, kprobe_addr, and probe_offset
455 These fields describe the kprobe/uprobe for dynamic PMUs kprobe
456 and uprobe. For kprobe: use kprobe_func and probe_offset, or
457 use kprobe_addr and leave kprobe_func as NULL. For uprobe: use
458 uprobe_path and probe_offset.
459
460 sample_period, sample_freq
461 A "sampling" event is one that generates an overflow notifica‐
462 tion every N events, where N is given by sample_period. A sam‐
463 pling event has sample_period > 0. When an overflow occurs, re‐
464 quested data is recorded in the mmap buffer. The sample_type
465 field controls what data is recorded on each overflow.
466
467 sample_freq can be used if you wish to use frequency rather than
468 period. In this case, you set the freq flag. The kernel will
469 adjust the sampling period to try and achieve the desired rate.
470 The rate of adjustment is a timer tick.
471
472 sample_type
473 The various bits in this field specify which values to include
474 in the sample. They will be recorded in a ring-buffer, which is
475 available to user space using mmap(2). The order in which the
476 values are saved in the sample are documented in the MMAP Layout
477 subsection below; it is not the enum perf_event_sample_format
478 order.
479
480 PERF_SAMPLE_IP
481 Records instruction pointer.
482
483 PERF_SAMPLE_TID
484 Records the process and thread IDs.
485
486 PERF_SAMPLE_TIME
487 Records a timestamp.
488
489 PERF_SAMPLE_ADDR
490 Records an address, if applicable.
491
492 PERF_SAMPLE_READ
493 Record counter values for all events in a group, not just
494 the group leader.
495
496 PERF_SAMPLE_CALLCHAIN
497 Records the callchain (stack backtrace).
498
499 PERF_SAMPLE_ID
500 Records a unique ID for the opened event's group leader.
501
502 PERF_SAMPLE_CPU
503 Records CPU number.
504
505 PERF_SAMPLE_PERIOD
506 Records the current sampling period.
507
508 PERF_SAMPLE_STREAM_ID
509 Records a unique ID for the opened event. Unlike
510 PERF_SAMPLE_ID the actual ID is returned, not the group
511 leader. This ID is the same as the one returned by
512 PERF_FORMAT_ID.
513
514 PERF_SAMPLE_RAW
515 Records additional data, if applicable. Usually returned
516 by tracepoint events.
517
518 PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
519 This provides a record of recent branches, as provided by
520 CPU branch sampling hardware (such as Intel Last Branch
521 Record). Not all hardware supports this feature.
522
523 See the branch_sample_type field for how to filter which
524 branches are reported.
525
526 PERF_SAMPLE_REGS_USER (since Linux 3.7)
527 Records the current user-level CPU register state (the
528 values in the process before the kernel was called).
529
530 PERF_SAMPLE_STACK_USER (since Linux 3.7)
531 Records the user level stack, allowing stack unwinding.
532
533 PERF_SAMPLE_WEIGHT (since Linux 3.10)
534 Records a hardware provided weight value that expresses
535 how costly the sampled event was. This allows the hard‐
536 ware to highlight expensive events in a profile.
537
538 PERF_SAMPLE_DATA_SRC (since Linux 3.10)
539 Records the data source: where in the memory hierarchy
540 the data associated with the sampled instruction came
541 from. This is available only if the underlying hardware
542 supports this feature.
543
544 PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
545 Places the SAMPLE_ID value in a fixed position in the
546 record, either at the beginning (for sample events) or at
547 the end (if a non-sample event).
548
549 This was necessary because a sample stream may have
550 records from various different event sources with differ‐
551 ent sample_type settings. Parsing the event stream prop‐
552 erly was not possible because the format of the record
553 was needed to find SAMPLE_ID, but the format could not be
554 found without knowing what event the sample belonged to
555 (causing a circular dependency).
556
557 The PERF_SAMPLE_IDENTIFIER setting makes the event stream
558 always parsable by putting SAMPLE_ID in a fixed location,
559 even though it means having duplicate SAMPLE_ID values in
560 records.
561
562 PERF_SAMPLE_TRANSACTION (since Linux 3.13)
563 Records reasons for transactional memory abort events
564 (for example, from Intel TSX transactional memory sup‐
565 port).
566
567 The precise_ip setting must be greater than 0 and a
568 transactional memory abort event must be measured or no
569 values will be recorded. Also note that some perf_event
570 measurements, such as sampled cycle counting, may cause
571 extraneous aborts (by causing an interrupt during a
572 transaction).
573
574 PERF_SAMPLE_REGS_INTR (since Linux 3.19)
575 Records a subset of the current CPU register state as
576 specified by sample_regs_intr. Unlike PERF_SAM‐
577 PLE_REGS_USER the register values will return kernel reg‐
578 ister state if the overflow happened while kernel code is
579 running. If the CPU supports hardware sampling of regis‐
580 ter state (i.e., PEBS on Intel x86) and precise_ip is set
581 higher than zero then the register values returned are
582 those captured by hardware at the time of the sampled in‐
583 struction's retirement.
584
585 PERF_SAMPLE_PHYS_ADDR (since Linux 4.13)
586 Records physical address of data like in PERF_SAM‐
587 PLE_ADDR.
588
589 PERF_SAMPLE_CGROUP (since Linux 5.7)
590 Records (perf_event) cgroup ID of the process. This cor‐
591 responds to the id field in the PERF_RECORD_CGROUP event.
592
593 read_format
594 This field specifies the format of the data returned by read(2)
595 on a perf_event_open() file descriptor.
596
597 PERF_FORMAT_TOTAL_TIME_ENABLED
598 Adds the 64-bit time_enabled field. This can be used to
599 calculate estimated totals if the PMU is overcommitted
600 and multiplexing is happening.
601
602 PERF_FORMAT_TOTAL_TIME_RUNNING
603 Adds the 64-bit time_running field. This can be used to
604 calculate estimated totals if the PMU is overcommitted
605 and multiplexing is happening.
606
607 PERF_FORMAT_ID
608 Adds a 64-bit unique value that corresponds to the event
609 group.
610
611 PERF_FORMAT_GROUP
612 Allows all counter values in an event group to be read
613 with one read.
614
615 disabled
616 The disabled bit specifies whether the counter starts out dis‐
617 abled or enabled. If disabled, the event can later be enabled
618 by ioctl(2), prctl(2), or enable_on_exec.
619
620 When creating an event group, typically the group leader is ini‐
621 tialized with disabled set to 1 and any child events are ini‐
622 tialized with disabled set to 0. Despite disabled being 0, the
623 child events will not start until the group leader is enabled.
624
625 inherit
626 The inherit bit specifies that this counter should count events
627 of child tasks as well as the task specified. This applies only
628 to new children, not to any existing children at the time the
629 counter is created (nor to any new children of existing chil‐
630 dren).
631
632 Inherit does not work for some combinations of read_format val‐
633 ues, such as PERF_FORMAT_GROUP.
634
635 pinned The pinned bit specifies that the counter should always be on
636 the CPU if at all possible. It applies only to hardware coun‐
637 ters and only to group leaders. If a pinned counter cannot be
638 put onto the CPU (e.g., because there are not enough hardware
639 counters or because of a conflict with some other event), then
640 the counter goes into an 'error' state, where reads return end-
641 of-file (i.e., read(2) returns 0) until the counter is subse‐
642 quently enabled or disabled.
643
644 exclusive
645 The exclusive bit specifies that when this counter's group is on
646 the CPU, it should be the only group using the CPU's counters.
647 In the future this may allow monitoring programs to support PMU
648 features that need to run alone so that they do not disrupt
649 other hardware counters.
650
651 Note that many unexpected situations may prevent events with the
652 exclusive bit set from ever running. This includes any users
653 running a system-wide measurement as well as any kernel use of
654 the performance counters (including the commonly enabled NMI
655 Watchdog Timer interface).
656
657 exclude_user
658 If this bit is set, the count excludes events that happen in
659 user space.
660
661 exclude_kernel
662 If this bit is set, the count excludes events that happen in
663 kernel space.
664
665 exclude_hv
666 If this bit is set, the count excludes events that happen in the
667 hypervisor. This is mainly for PMUs that have built-in support
668 for handling this (such as POWER). Extra support is needed for
669 handling hypervisor measurements on most machines.
670
671 exclude_idle
672 If set, don't count when the CPU is running the idle task.
673 While you can currently enable this for any event type, it is
674 ignored for all but software events.
675
676 mmap The mmap bit enables generation of PERF_RECORD_MMAP samples for
677 every mmap(2) call that has PROT_EXEC set. This allows tools to
678 notice new executable code being mapped into a program (dynamic
679 shared libraries for example) so that addresses can be mapped
680 back to the original code.
681
682 comm The comm bit enables tracking of process command name as modi‐
683 fied by the exec(2) and prctl(PR_SET_NAME) system calls as well
684 as writing to /proc/self/comm. If the comm_exec flag is also
685 successfully set (possible since Linux 3.16), then the misc flag
686 PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the
687 exec(2) case from the others.
688
689 freq If this bit is set, then sample_frequency not sample_period is
690 used when setting up the sampling interval.
691
692 inherit_stat
693 This bit enables saving of event counts on context switch for
694 inherited tasks. This is meaningful only if the inherit field
695 is set.
696
697 enable_on_exec
698 If this bit is set, a counter is automatically enabled after a
699 call to exec(2).
700
701 task If this bit is set, then fork/exit notifications are included in
702 the ring buffer.
703
704 watermark
705 If set, have an overflow notification happen when we cross the
706 wakeup_watermark boundary. Otherwise, overflow notifications
707 happen after wakeup_events samples.
708
709 precise_ip (since Linux 2.6.35)
710 This controls the amount of skid. Skid is how many instructions
711 execute between an event of interest happening and the kernel
712 being able to stop and record the event. Smaller skid is better
713 and allows more accurate reporting of which events correspond to
714 which instructions, but hardware is often limited with how small
715 this can be.
716
717 The possible values of this field are the following:
718
719 0 SAMPLE_IP can have arbitrary skid.
720
721 1 SAMPLE_IP must have constant skid.
722
723 2 SAMPLE_IP requested to have 0 skid.
724
725 3 SAMPLE_IP must have 0 skid. See also the description of
726 PERF_RECORD_MISC_EXACT_IP.
727
728 mmap_data (since Linux 2.6.36)
729 This is the counterpart of the mmap field. This enables genera‐
730 tion of PERF_RECORD_MMAP samples for mmap(2) calls that do not
731 have PROT_EXEC set (for example data and SysV shared memory).
732
733 sample_id_all (since Linux 2.6.38)
734 If set, then TID, TIME, ID, STREAM_ID, and CPU can additionally
735 be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
736 ple_type is selected.
737
738 If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID
739 value is included as the last value to ease parsing the record
740 stream. This may lead to the id value appearing twice.
741
742 The layout is described by this pseudo-structure:
743
744 struct sample_id {
745 { u32 pid, tid; } /* if PERF_SAMPLE_TID set */
746 { u64 time; } /* if PERF_SAMPLE_TIME set */
747 { u64 id; } /* if PERF_SAMPLE_ID set */
748 { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
749 { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
750 { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
751 };
752
753 exclude_host (since Linux 3.2)
754 When conducting measurements that include processes running VM
755 instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
756 events happening inside a guest instance. This is only meaning‐
757 ful outside the guests; this setting does not change counts
758 gathered inside of a guest. Currently, this functionality is
759 x86 only.
760
761 exclude_guest (since Linux 3.2)
762 When conducting measurements that include processes running VM
763 instances (i.e., have executed a KVM_RUN ioctl(2)), do not mea‐
764 sure events happening inside guest instances. This is only
765 meaningful outside the guests; this setting does not change
766 counts gathered inside of a guest. Currently, this functional‐
767 ity is x86 only.
768
769 exclude_callchain_kernel (since Linux 3.7)
770 Do not include kernel callchains.
771
772 exclude_callchain_user (since Linux 3.7)
773 Do not include user callchains.
774
775 mmap2 (since Linux 3.16)
776 Generate an extended executable mmap record that contains enough
777 additional information to uniquely identify shared mappings.
778 The mmap flag must also be set for this to work.
779
780 comm_exec (since Linux 3.16)
781 This is purely a feature-detection flag, it does not change ker‐
782 nel behavior. If this flag can successfully be set, then, when
783 comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
784 in the misc field of a comm record header if the rename event
785 being reported was caused by a call to exec(2). This allows
786 tools to distinguish between the various types of process renam‐
787 ing.
788
789 use_clockid (since Linux 4.1)
790 This allows selecting which internal Linux clock to use when
791 generating timestamps via the clockid field. This can make it
792 easier to correlate perf sample times with timestamps generated
793 by other tools.
794
795 context_switch (since Linux 4.3)
796 This enables the generation of PERF_RECORD_SWITCH records when a
797 context switch occurs. It also enables the generation of
798 PERF_RECORD_SWITCH_CPU_WIDE records when sampling in CPU-wide
799 mode. This functionality is in addition to existing tracepoint
800 and software events for measuring context switches. The advan‐
801 tage of this method is that it will give full information even
802 with strict perf_event_paranoid settings.
803
804 write_backward (since Linux 4.6)
805 This causes the ring buffer to be written from the end to the
806 beginning. This is to support reading from overwritable ring
807 buffer.
808
809 namespaces (since Linux 4.11)
810 This enables the generation of PERF_RECORD_NAMESPACES records
811 when a task enters a new namespace. Each namespace has a combi‐
812 nation of device and inode numbers.
813
814 ksymbol (since Linux 5.0)
815 This enables the generation of PERF_RECORD_KSYMBOL records when
816 new kernel symbols are registered or unregistered. This is ana‐
817 lyzing dynamic kernel functions like eBPF.
818
819 bpf_event (since Linux 5.0)
820 This enables the generation of PERF_RECORD_BPF_EVENT records
821 when an eBPF program is loaded or unloaded.
822
823 auxevent (since Linux 5.4)
824 This allows normal (non-AUX) events to generate data for AUX
825 events if the hardware supports it.
826
827 cgroup (since Linux 5.7)
828 This enables the generation of PERF_RECORD_CGROUP records when a
829 new cgroup is created (and activated).
830
831 text_poke (since Linux 5.8)
832 This enables the generation of PERF_RECORD_TEXT_POKE records
833 when there's a changes to the kernel text (i.e., self-modifying
834 code).
835
836 wakeup_events, wakeup_watermark
837 This union sets how many samples (wakeup_events) or bytes
838 (wakeup_watermark) happen before an overflow notification hap‐
839 pens. Which one is used is selected by the watermark bit flag.
840
841 wakeup_events counts only PERF_RECORD_SAMPLE record types. To
842 receive overflow notification for all PERF_RECORD types choose
843 watermark and set wakeup_watermark to 1.
844
845 Prior to Linux 3.0, setting wakeup_events to 0 resulted in no
846 overflow notifications; more recent kernels treat 0 the same as
847 1.
848
849 bp_type (since Linux 2.6.33)
850 This chooses the breakpoint type. It is one of:
851
852 HW_BREAKPOINT_EMPTY
853 No breakpoint.
854
855 HW_BREAKPOINT_R
856 Count when we read the memory location.
857
858 HW_BREAKPOINT_W
859 Count when we write the memory location.
860
861 HW_BREAKPOINT_RW
862 Count when we read or write the memory location.
863
864 HW_BREAKPOINT_X
865 Count when we execute code at the memory location.
866
867 The values can be combined via a bitwise or, but the combination
868 of HW_BREAKPOINT_R or HW_BREAKPOINT_W with HW_BREAKPOINT_X is
869 not allowed.
870
871 bp_addr (since Linux 2.6.33)
872 This is the address of the breakpoint. For execution break‐
873 points, this is the memory address of the instruction of inter‐
874 est; for read and write breakpoints, it is the memory address of
875 the memory location of interest.
876
877 config1 (since Linux 2.6.39)
878 config1 is used for setting events that need an extra register
879 or otherwise do not fit in the regular config field. Raw OFF‐
880 CORE_EVENTS on Nehalem/Westmere/SandyBridge use this field on
881 Linux 3.3 and later kernels.
882
883 bp_len (since Linux 2.6.33)
884 bp_len is the length of the breakpoint being measured if type is
885 PERF_TYPE_BREAKPOINT. Options are HW_BREAKPOINT_LEN_1,
886 HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, and HW_BREAK‐
887 POINT_LEN_8. For an execution breakpoint, set this to
888 sizeof(long).
889
890 config2 (since Linux 2.6.39)
891 config2 is a further extension of the config1 field.
892
893 branch_sample_type (since Linux 3.4)
894 If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
895 branches to include in the branch record.
896
897 The first part of the value is the privilege level, which is a
898 combination of one of the values listed below. If the user does
899 not set privilege level explicitly, the kernel will use the
900 event's privilege level. Event and branch privilege levels do
901 not have to match.
902
903 PERF_SAMPLE_BRANCH_USER
904 Branch target is in user space.
905
906 PERF_SAMPLE_BRANCH_KERNEL
907 Branch target is in kernel space.
908
909 PERF_SAMPLE_BRANCH_HV
910 Branch target is in hypervisor.
911
912 PERF_SAMPLE_BRANCH_PLM_ALL
913 A convenience value that is the three preceding values
914 ORed together.
915
916 In addition to the privilege value, at least one or more of the
917 following bits must be set.
918
919 PERF_SAMPLE_BRANCH_ANY
920 Any branch type.
921
922 PERF_SAMPLE_BRANCH_ANY_CALL
923 Any call branch (includes direct calls, indirect calls,
924 and far jumps).
925
926 PERF_SAMPLE_BRANCH_IND_CALL
927 Indirect calls.
928
929 PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
930 Direct calls.
931
932 PERF_SAMPLE_BRANCH_ANY_RETURN
933 Any return branch.
934
935 PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
936 Indirect jumps.
937
938 PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
939 Conditional branches.
940
941 PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
942 Transactional memory aborts.
943
944 PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
945 Branch in transactional memory transaction.
946
947 PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
948 Branch not in transactional memory transaction.
949 PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
950 part of a hardware-generated call stack. This requires
951 hardware support, currently only found on Intel x86
952 Haswell or newer.
953
954 sample_regs_user (since Linux 3.7)
955 This bit mask defines the set of user CPU registers to dump on
956 samples. The layout of the register mask is architecture-spe‐
957 cific and is described in the kernel header file arch/ARCH/in‐
958 clude/uapi/asm/perf_regs.h.
959
960 sample_stack_user (since Linux 3.7)
961 This defines the size of the user stack to dump if PERF_SAM‐
962 PLE_STACK_USER is specified.
963
964 clockid (since Linux 4.1)
965 If use_clockid is set, then this field selects which internal
966 Linux timer to use for timestamps. The available timers are de‐
967 fined in linux/time.h, with CLOCK_MONOTONIC, CLOCK_MONO‐
968 TONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and CLOCK_TAI cur‐
969 rently supported.
970
971 aux_watermark (since Linux 4.1)
972 This specifies how much data is required to trigger a
973 PERF_RECORD_AUX sample.
974
975 sample_max_stack (since Linux 4.8)
976 When sample_type includes PERF_SAMPLE_CALLCHAIN, this field
977 specifies how many stack frames to report when generating the
978 callchain.
979
980 Reading results
981 Once a perf_event_open() file descriptor has been opened, the values of
982 the events can be read from the file descriptor. The values that are
983 there are specified by the read_format field in the attr structure at
984 open time.
985
986 If you attempt to read into a buffer that is not big enough to hold the
987 data, the error ENOSPC results.
988
989 Here is the layout of the data returned by a read:
990
991 * If PERF_FORMAT_GROUP was specified to allow reading all events in a
992 group at once:
993
994 struct read_format {
995 u64 nr; /* The number of events */
996 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
997 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
998 struct {
999 u64 value; /* The value of the event */
1000 u64 id; /* if PERF_FORMAT_ID */
1001 } values[nr];
1002 };
1003
1004 * If PERF_FORMAT_GROUP was not specified:
1005
1006 struct read_format {
1007 u64 value; /* The value of the event */
1008 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1009 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1010 u64 id; /* if PERF_FORMAT_ID */
1011 };
1012
1013 The values read are as follows:
1014
1015 nr The number of events in this file descriptor. Available only if
1016 PERF_FORMAT_GROUP was specified.
1017
1018 time_enabled, time_running
1019 Total time the event was enabled and running. Normally these
1020 values are the same. Multiplexing happens if the number of
1021 events is more than the number of available PMU counter slots.
1022 In that case the events run only part of the time and the
1023 time_enabled and time running values can be used to scale an es‐
1024 timated value for the count.
1025
1026 value An unsigned 64-bit value containing the counter result.
1027
1028 id A globally unique value for this particular event; only present
1029 if PERF_FORMAT_ID was specified in read_format.
1030
1031 MMAP layout
1032 When using perf_event_open() in sampled mode, asynchronous events (like
1033 counter overflow or PROT_EXEC mmap tracking) are logged into a ring-
1034 buffer. This ring-buffer is created and accessed through mmap(2).
1035
1036 The mmap size should be 1+2^n pages, where the first page is a metadata
1037 page (struct perf_event_mmap_page) that contains various bits of infor‐
1038 mation such as where the ring-buffer head is.
1039
1040 Before kernel 2.6.39, there is a bug that means you must allocate an
1041 mmap ring buffer when sampling even if you do not plan to access it.
1042
1043 The structure of the first metadata mmap page is as follows:
1044
1045 struct perf_event_mmap_page {
1046 __u32 version; /* version number of this structure */
1047 __u32 compat_version; /* lowest version this is compat with */
1048 __u32 lock; /* seqlock for synchronization */
1049 __u32 index; /* hardware counter identifier */
1050 __s64 offset; /* add to hardware counter value */
1051 __u64 time_enabled; /* time event active */
1052 __u64 time_running; /* time event on CPU */
1053 union {
1054 __u64 capabilities;
1055 struct {
1056 __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1057 cap_bit0_is_deprecated : 1,
1058 cap_user_rdpmc : 1,
1059 cap_user_time : 1,
1060 cap_user_time_zero : 1,
1061 };
1062 };
1063 __u16 pmc_width;
1064 __u16 time_shift;
1065 __u32 time_mult;
1066 __u64 time_offset;
1067 __u64 __reserved[120]; /* Pad to 1 k */
1068 __u64 data_head; /* head in the data section */
1069 __u64 data_tail; /* user-space written tail */
1070 __u64 data_offset; /* where the buffer starts */
1071 __u64 data_size; /* data buffer size */
1072 __u64 aux_head;
1073 __u64 aux_tail;
1074 __u64 aux_offset;
1075 __u64 aux_size;
1076
1077 }
1078
1079 The following list describes the fields in the perf_event_mmap_page
1080 structure in more detail:
1081
1082 version
1083 Version number of this structure.
1084
1085 compat_version
1086 The lowest version this is compatible with.
1087
1088 lock A seqlock for synchronization.
1089
1090 index A unique hardware counter identifier.
1091
1092 offset When using rdpmc for reads this offset value must be added to
1093 the one returned by rdpmc to get the current total event count.
1094
1095 time_enabled
1096 Time the event was active.
1097
1098 time_running
1099 Time the event was running.
1100
1101 cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1102 There was a bug in the definition of cap_usr_time and
1103 cap_usr_rdpmc from Linux 3.4 until Linux 3.11. Both bits were
1104 defined to point to the same location, so it was impossible to
1105 know if cap_usr_time or cap_usr_rdpmc were actually set.
1106
1107 Starting with Linux 3.12, these are renamed to cap_bit0 and you
1108 should use the cap_user_time and cap_user_rdpmc fields instead.
1109
1110 cap_bit0_is_deprecated (since Linux 3.12)
1111 If set, this bit indicates that the kernel supports the properly
1112 separated cap_user_time and cap_user_rdpmc bits.
1113
1114 If not-set, it indicates an older kernel where cap_usr_time and
1115 cap_usr_rdpmc map to the same bit and thus both features should
1116 be used with caution.
1117
1118 cap_user_rdpmc (since Linux 3.12)
1119 If the hardware supports user-space read of performance counters
1120 without syscall (this is the "rdpmc" instruction on x86), then
1121 the following code can be used to do a read:
1122
1123 u32 seq, time_mult, time_shift, idx, width;
1124 u64 count, enabled, running;
1125 u64 cyc, time_offset;
1126
1127 do {
1128 seq = pc->lock;
1129 barrier();
1130 enabled = pc->time_enabled;
1131 running = pc->time_running;
1132
1133 if (pc->cap_usr_time && enabled != running) {
1134 cyc = rdtsc();
1135 time_offset = pc->time_offset;
1136 time_mult = pc->time_mult;
1137 time_shift = pc->time_shift;
1138 }
1139
1140 idx = pc->index;
1141 count = pc->offset;
1142
1143 if (pc->cap_usr_rdpmc && idx) {
1144 width = pc->pmc_width;
1145 count += rdpmc(idx - 1);
1146 }
1147
1148 barrier();
1149 } while (pc->lock != seq);
1150
1151 cap_user_time (since Linux 3.12)
1152 This bit indicates the hardware has a constant, nonstop time‐
1153 stamp counter (TSC on x86).
1154
1155 cap_user_time_zero (since Linux 3.12)
1156 Indicates the presence of time_zero which allows mapping time‐
1157 stamp values to the hardware clock.
1158
1159 pmc_width
1160 If cap_usr_rdpmc, this field provides the bit-width of the value
1161 read using the rdpmc or equivalent instruction. This can be
1162 used to sign extend the result like:
1163
1164 pmc <<= 64 - pmc_width;
1165 pmc >>= 64 - pmc_width; // signed shift right
1166 count += pmc;
1167
1168 time_shift, time_mult, time_offset
1169
1170 If cap_usr_time, these fields can be used to compute the time
1171 delta since time_enabled (in nanoseconds) using rdtsc or simi‐
1172 lar.
1173
1174 u64 quot, rem;
1175 u64 delta;
1176
1177 quot = cyc >> time_shift;
1178 rem = cyc & (((u64)1 << time_shift) - 1);
1179 delta = time_offset + quot * time_mult +
1180 ((rem * time_mult) >> time_shift);
1181
1182 Where time_offset, time_mult, time_shift, and cyc are read in
1183 the seqcount loop described above. This delta can then be added
1184 to enabled and possible running (if idx), improving the scaling:
1185
1186 enabled += delta;
1187 if (idx)
1188 running += delta;
1189 quot = count / running;
1190 rem = count % running;
1191 count = quot * enabled + (rem * enabled) / running;
1192
1193 time_zero (since Linux 3.12)
1194
1195 If cap_usr_time_zero is set, then the hardware clock (the TSC
1196 timestamp counter on x86) can be calculated from the time_zero,
1197 time_mult, and time_shift values:
1198
1199 time = timestamp - time_zero;
1200 quot = time / time_mult;
1201 rem = time % time_mult;
1202 cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1203
1204 And vice versa:
1205
1206 quot = cyc >> time_shift;
1207 rem = cyc & (((u64)1 << time_shift) - 1);
1208 timestamp = time_zero + quot * time_mult +
1209 ((rem * time_mult) >> time_shift);
1210
1211 data_head
1212 This points to the head of the data section. The value continu‐
1213 ously increases, it does not wrap. The value needs to be manu‐
1214 ally wrapped by the size of the mmap buffer before accessing the
1215 samples.
1216
1217 On SMP-capable platforms, after reading the data_head value,
1218 user space should issue an rmb().
1219
1220 data_tail
1221 When the mapping is PROT_WRITE, the data_tail value should be
1222 written by user space to reflect the last read data. In this
1223 case, the kernel will not overwrite unread data.
1224
1225 data_offset (since Linux 4.1)
1226 Contains the offset of the location in the mmap buffer where
1227 perf sample data begins.
1228
1229 data_size (since Linux 4.1)
1230 Contains the size of the perf sample region within the mmap buf‐
1231 fer.
1232
1233 aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1234 The AUX region allows mmap(2)-ing a separate sample buffer for
1235 high-bandwidth data streams (separate from the main perf sample
1236 buffer). An example of a high-bandwidth stream is instruction
1237 tracing support, as is found in newer Intel processors.
1238
1239 To set up an AUX area, first aux_offset needs to be set with an
1240 offset greater than data_offset+data_size and aux_size needs to
1241 be set to the desired buffer size. The desired offset and size
1242 must be page aligned, and the size must be a power of two.
1243 These values are then passed to mmap in order to map the AUX
1244 buffer. Pages in the AUX buffer are included as part of the
1245 RLIMIT_MEMLOCK resource limit (see setrlimit(2)), and also as
1246 part of the perf_event_mlock_kb allowance.
1247
1248 By default, the AUX buffer will be truncated if it will not fit
1249 in the available space in the ring buffer. If the AUX buffer is
1250 mapped as a read only buffer, then it will operate in ring buf‐
1251 fer mode where old data will be overwritten by new. In over‐
1252 write mode, it might not be possible to infer where the new data
1253 began, and it is the consumer's job to disable measurement while
1254 reading to avoid possible data races.
1255
1256 The aux_head and aux_tail ring buffer pointers have the same be‐
1257 havior and ordering rules as the previous described data_head
1258 and data_tail.
1259
1260 The following 2^n ring-buffer pages have the layout described below.
1261
1262 If perf_event_attr.sample_id_all is set, then all event types will have
1263 the sample_type selected fields related to where/when (identity) an
1264 event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1265 PERF_RECORD_SAMPLE below, it will be stashed just after the
1266 perf_event_header and the fields already present for the existing
1267 fields, that is, at the end of the payload. This allows a newer
1268 perf.data file to be supported by older perf tools, with the new op‐
1269 tional fields being ignored.
1270
1271 The mmap values start with a header:
1272
1273 struct perf_event_header {
1274 __u32 type;
1275 __u16 misc;
1276 __u16 size;
1277 };
1278
1279 Below, we describe the perf_event_header fields in more detail. For
1280 ease of reading, the fields with shorter descriptions are presented
1281 first.
1282
1283 size This indicates the size of the record.
1284
1285 misc The misc field contains additional information about the sample.
1286
1287 The CPU mode can be determined from this value by masking with
1288 PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1289 ing (note these are not bit masks, only one can be set at a
1290 time):
1291
1292 PERF_RECORD_MISC_CPUMODE_UNKNOWN
1293 Unknown CPU mode.
1294
1295 PERF_RECORD_MISC_KERNEL
1296 Sample happened in the kernel.
1297
1298 PERF_RECORD_MISC_USER
1299 Sample happened in user code.
1300
1301 PERF_RECORD_MISC_HYPERVISOR
1302 Sample happened in the hypervisor.
1303
1304 PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1305 Sample happened in the guest kernel.
1306
1307 PERF_RECORD_MISC_GUEST_USER (since Linux 2.6.35)
1308 Sample happened in guest user code.
1309
1310 Since the following three statuses are generated by different
1311 record types, they alias to the same bit:
1312
1313 PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1314 This is set when the mapping is not executable; otherwise
1315 the mapping is executable.
1316
1317 PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1318 This is set for a PERF_RECORD_COMM record on kernels more
1319 recent than Linux 3.16 if a process name change was
1320 caused by an exec(2) system call.
1321
1322 PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1323 When a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE
1324 record is generated, this bit indicates that the context
1325 switch is away from the current process (instead of into
1326 the current process).
1327
1328 In addition, the following bits can be set:
1329
1330 PERF_RECORD_MISC_EXACT_IP
1331 This indicates that the content of PERF_SAMPLE_IP points
1332 to the actual instruction that triggered the event. See
1333 also perf_event_attr.precise_ip.
1334
1335 PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1336 This indicates there is extended data available (cur‐
1337 rently not used).
1338
1339 PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1340 This bit is not set by the kernel. It is reserved for
1341 the user-space perf utility to indicate that
1342 /proc/i[pid]/maps parsing was taking too long and was
1343 stopped, and thus the mmap records may be truncated.
1344
1345 type The type value is one of the below. The values in the corre‐
1346 sponding record (that follows the header) depend on the type se‐
1347 lected as shown.
1348
1349 PERF_RECORD_MMAP
1350 The MMAP events record the PROT_EXEC mappings so that we can
1351 correlate user-space IPs to code. They have the following
1352 structure:
1353
1354 struct {
1355 struct perf_event_header header;
1356 u32 pid, tid;
1357 u64 addr;
1358 u64 len;
1359 u64 pgoff;
1360 char filename[];
1361 };
1362
1363 pid is the process ID.
1364
1365 tid is the thread ID.
1366
1367 addr is the address of the allocated memory. len is the
1368 length of the allocated memory. pgoff is the page
1369 offset of the allocated memory. filename is a string
1370 describing the backing of the allocated memory.
1371
1372 PERF_RECORD_LOST
1373 This record indicates when events are lost.
1374
1375 struct {
1376 struct perf_event_header header;
1377 u64 id;
1378 u64 lost;
1379 struct sample_id sample_id;
1380 };
1381
1382 id is the unique event ID for the samples that were
1383 lost.
1384
1385 lost is the number of events that were lost.
1386
1387 PERF_RECORD_COMM
1388 This record indicates a change in the process name.
1389
1390 struct {
1391 struct perf_event_header header;
1392 u32 pid;
1393 u32 tid;
1394 char comm[];
1395 struct sample_id sample_id;
1396 };
1397
1398 pid is the process ID.
1399
1400 tid is the thread ID.
1401
1402 comm is a string containing the new name of the process.
1403
1404 PERF_RECORD_EXIT
1405 This record indicates a process exit event.
1406
1407 struct {
1408 struct perf_event_header header;
1409 u32 pid, ppid;
1410 u32 tid, ptid;
1411 u64 time;
1412 struct sample_id sample_id;
1413 };
1414
1415 PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1416 This record indicates a throttle/unthrottle event.
1417
1418 struct {
1419 struct perf_event_header header;
1420 u64 time;
1421 u64 id;
1422 u64 stream_id;
1423 struct sample_id sample_id;
1424 };
1425
1426 PERF_RECORD_FORK
1427 This record indicates a fork event.
1428
1429 struct {
1430 struct perf_event_header header;
1431 u32 pid, ppid;
1432 u32 tid, ptid;
1433 u64 time;
1434 struct sample_id sample_id;
1435 };
1436
1437 PERF_RECORD_READ
1438 This record indicates a read event.
1439
1440 struct {
1441 struct perf_event_header header;
1442 u32 pid, tid;
1443 struct read_format values;
1444 struct sample_id sample_id;
1445 };
1446
1447 PERF_RECORD_SAMPLE
1448 This record indicates a sample.
1449
1450 struct {
1451 struct perf_event_header header;
1452 u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
1453 u64 ip; /* if PERF_SAMPLE_IP */
1454 u32 pid, tid; /* if PERF_SAMPLE_TID */
1455 u64 time; /* if PERF_SAMPLE_TIME */
1456 u64 addr; /* if PERF_SAMPLE_ADDR */
1457 u64 id; /* if PERF_SAMPLE_ID */
1458 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1459 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1460 u64 period; /* if PERF_SAMPLE_PERIOD */
1461 struct read_format v;
1462 /* if PERF_SAMPLE_READ */
1463 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1464 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1465 u32 size; /* if PERF_SAMPLE_RAW */
1466 char data[size]; /* if PERF_SAMPLE_RAW */
1467 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1468 struct perf_branch_entry lbr[bnr];
1469 /* if PERF_SAMPLE_BRANCH_STACK */
1470 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1471 u64 regs[weight(mask)];
1472 /* if PERF_SAMPLE_REGS_USER */
1473 u64 size; /* if PERF_SAMPLE_STACK_USER */
1474 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1475 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER &&
1476 size != 0 */
1477 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1478 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1479 u64 transaction; /* if PERF_SAMPLE_TRANSACTION */
1480 u64 abi; /* if PERF_SAMPLE_REGS_INTR */
1481 u64 regs[weight(mask)];
1482 /* if PERF_SAMPLE_REGS_INTR */
1483 u64 phys_addr; /* if PERF_SAMPLE_PHYS_ADDR */
1484 u64 cgroup; /* if PERF_SAMPLE_CGROUP */
1485 };
1486
1487 sample_id
1488 If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1489 is included. This is a duplication of the PERF_SAM‐
1490 PLE_ID id value, but included at the beginning of the
1491 sample so parsers can easily obtain the value.
1492
1493 ip If PERF_SAMPLE_IP is enabled, then a 64-bit instruction
1494 pointer value is included.
1495
1496 pid, tid
1497 If PERF_SAMPLE_TID is enabled, then a 32-bit process ID
1498 and 32-bit thread ID are included.
1499
1500 time
1501 If PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp
1502 is included. This is obtained via local_clock() which
1503 is a hardware timestamp if available and the jiffies
1504 value if not.
1505
1506 addr
1507 If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1508 included. This is usually the address of a tracepoint,
1509 breakpoint, or software event; otherwise the value is 0.
1510
1511 id If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is in‐
1512 cluded. If the event is a member of an event group, the
1513 group leader ID is returned. This ID is the same as the
1514 one returned by PERF_FORMAT_ID.
1515
1516 stream_id
1517 If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID
1518 is included. Unlike PERF_SAMPLE_ID the actual ID is re‐
1519 turned, not the group leader. This ID is the same as
1520 the one returned by PERF_FORMAT_ID.
1521
1522 cpu, res
1523 If PERF_SAMPLE_CPU is enabled, this is a 32-bit value
1524 indicating which CPU was being used, in addition to a
1525 reserved (unused) 32-bit value.
1526
1527 period
1528 If PERF_SAMPLE_PERIOD is enabled, a 64-bit value indi‐
1529 cating the current sampling period is written.
1530
1531 v If PERF_SAMPLE_READ is enabled, a structure of type
1532 read_format is included which has values for all events
1533 in the event group. The values included depend on the
1534 read_format value used at perf_event_open() time.
1535
1536 nr, ips[nr]
1537 If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit num‐
1538 ber is included which indicates how many following
1539 64-bit instruction pointers will follow. This is the
1540 current callchain.
1541
1542 size, data[size]
1543 If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1544 cating size is included followed by an array of 8-bit
1545 values of length size. The values are padded with 0 to
1546 have 64-bit alignment.
1547
1548 This RAW record data is opaque with respect to the ABI.
1549 The ABI doesn't make any promises with respect to the
1550 stability of its content, it may vary depending on
1551 event, hardware, and kernel version.
1552
1553 bnr, lbr[bnr]
1554 If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit
1555 value indicating the number of records is included, fol‐
1556 lowed by bnr perf_branch_entry structures which each in‐
1557 clude the fields:
1558
1559 from This indicates the source instruction (may not be
1560 a branch).
1561
1562 to The branch target.
1563
1564 mispred
1565 The branch target was mispredicted.
1566
1567 predicted
1568 The branch target was predicted.
1569
1570 in_tx (since Linux 3.11)
1571 The branch was in a transactional memory transac‐
1572 tion.
1573
1574 abort (since Linux 3.11)
1575 The branch was in an aborted transactional memory
1576 transaction.
1577
1578 cycles (since Linux 4.3)
1579 This reports the number of cycles elapsed since
1580 the previous branch stack update.
1581
1582 The entries are from most to least recent, so the first
1583 entry has the most recent branch.
1584
1585 Support for mispred, predicted, and cycles is optional;
1586 if not supported, those values will be 0.
1587
1588 The type of branches recorded is specified by the
1589 branch_sample_type field.
1590
1591 abi, regs[weight(mask)]
1592 If PERF_SAMPLE_REGS_USER is enabled, then the user CPU
1593 registers are recorded.
1594
1595 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1596 PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1597
1598 The regs field is an array of the CPU registers that
1599 were specified by the sample_regs_user attr field. The
1600 number of values is the number of bits set in the sam‐
1601 ple_regs_user bit mask.
1602
1603 size, data[size], dyn_size
1604 If PERF_SAMPLE_STACK_USER is enabled, then the user
1605 stack is recorded. This can be used to generate stack
1606 backtraces. size is the size requested by the user in
1607 sample_stack_user or else the maximum record size. data
1608 is the stack data (a raw dump of the memory pointed to
1609 by the stack pointer at the time of sampling). dyn_size
1610 is the amount of data actually dumped (can be less than
1611 size). Note that dyn_size is omitted if size is 0.
1612
1613 weight
1614 If PERF_SAMPLE_WEIGHT is enabled, then a 64-bit value
1615 provided by the hardware is recorded that indicates how
1616 costly the event was. This allows expensive events to
1617 stand out more clearly in profiles.
1618
1619 data_src
1620 If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value
1621 is recorded that is made up of the following fields:
1622
1623 mem_op
1624 Type of opcode, a bitwise combination of:
1625
1626 PERF_MEM_OP_NA Not available
1627 PERF_MEM_OP_LOAD Load instruction
1628 PERF_MEM_OP_STORE Store instruction
1629 PERF_MEM_OP_PFETCH Prefetch
1630 PERF_MEM_OP_EXEC Executable code
1631
1632 mem_lvl
1633 Memory hierarchy level hit or miss, a bitwise combi‐
1634 nation of the following, shifted left by
1635 PERF_MEM_LVL_SHIFT:
1636
1637 PERF_MEM_LVL_NA Not available
1638 PERF_MEM_LVL_HIT Hit
1639 PERF_MEM_LVL_MISS Miss
1640 PERF_MEM_LVL_L1 Level 1 cache
1641 PERF_MEM_LVL_LFB Line fill buffer
1642 PERF_MEM_LVL_L2 Level 2 cache
1643 PERF_MEM_LVL_L3 Level 3 cache
1644 PERF_MEM_LVL_LOC_RAM Local DRAM
1645 PERF_MEM_LVL_REM_RAM1 Remote DRAM 1 hop
1646 PERF_MEM_LVL_REM_RAM2 Remote DRAM 2 hops
1647 PERF_MEM_LVL_REM_CCE1 Remote cache 1 hop
1648 PERF_MEM_LVL_REM_CCE2 Remote cache 2 hops
1649 PERF_MEM_LVL_IO I/O memory
1650 PERF_MEM_LVL_UNC Uncached memory
1651
1652 mem_snoop
1653 Snoop mode, a bitwise combination of the following,
1654 shifted left by PERF_MEM_SNOOP_SHIFT:
1655
1656 PERF_MEM_SNOOP_NA Not available
1657 PERF_MEM_SNOOP_NONE No snoop
1658 PERF_MEM_SNOOP_HIT Snoop hit
1659 PERF_MEM_SNOOP_MISS Snoop miss
1660 PERF_MEM_SNOOP_HITM Snoop hit modified
1661
1662 mem_lock
1663 Lock instruction, a bitwise combination of the fol‐
1664 lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1665
1666 PERF_MEM_LOCK_NA Not available
1667 PERF_MEM_LOCK_LOCKED Locked transaction
1668
1669 mem_dtlb
1670 TLB access hit or miss, a bitwise combination of the
1671 following, shifted left by PERF_MEM_TLB_SHIFT:
1672
1673 PERF_MEM_TLB_NA Not available
1674 PERF_MEM_TLB_HIT Hit
1675 PERF_MEM_TLB_MISS Miss
1676 PERF_MEM_TLB_L1 Level 1 TLB
1677 PERF_MEM_TLB_L2 Level 2 TLB
1678 PERF_MEM_TLB_WK Hardware walker
1679 PERF_MEM_TLB_OS OS fault handler
1680
1681 transaction
1682 If the PERF_SAMPLE_TRANSACTION flag is set, then a
1683 64-bit field is recorded describing the sources of any
1684 transactional memory aborts.
1685
1686 The field is a bitwise combination of the following val‐
1687 ues:
1688
1689 PERF_TXN_ELISION
1690 Abort from an elision type transaction (Intel-
1691 CPU-specific).
1692
1693 PERF_TXN_TRANSACTION
1694 Abort from a generic transaction.
1695
1696 PERF_TXN_SYNC
1697 Synchronous abort (related to the reported in‐
1698 struction).
1699
1700 PERF_TXN_ASYNC
1701 Asynchronous abort (not related to the reported
1702 instruction).
1703
1704 PERF_TXN_RETRY
1705 Retryable abort (retrying the transaction may
1706 have succeeded).
1707
1708 PERF_TXN_CONFLICT
1709 Abort due to memory conflicts with other threads.
1710
1711 PERF_TXN_CAPACITY_WRITE
1712 Abort due to write capacity overflow.
1713
1714 PERF_TXN_CAPACITY_READ
1715 Abort due to read capacity overflow.
1716
1717 In addition, a user-specified abort code can be obtained
1718 from the high 32 bits of the field by shifting right by
1719 PERF_TXN_ABORT_SHIFT and masking with the value
1720 PERF_TXN_ABORT_MASK.
1721
1722 abi, regs[weight(mask)]
1723 If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU
1724 registers are recorded.
1725
1726 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1727 PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1728
1729 The regs field is an array of the CPU registers that
1730 were specified by the sample_regs_intr attr field. The
1731 number of values is the number of bits set in the sam‐
1732 ple_regs_intr bit mask.
1733
1734 phys_addr
1735 If the PERF_SAMPLE_PHYS_ADDR flag is set, then the
1736 64-bit physical address is recorded.
1737
1738 cgroup
1739 If the PERF_SAMPLE_CGROUP flag is set, then the 64-bit
1740 cgroup ID (for the perf_event subsystem) is recorded.
1741 To get the pathname of the cgroup, the ID should match
1742 to one in a PERF_RECORD_CGROUP .
1743
1744 PERF_RECORD_MMAP2
1745 This record includes extended information on mmap(2) calls
1746 returning executable mappings. The format is similar to
1747 that of the PERF_RECORD_MMAP record, but includes extra val‐
1748 ues that allow uniquely identifying shared mappings.
1749
1750 struct {
1751 struct perf_event_header header;
1752 u32 pid;
1753 u32 tid;
1754 u64 addr;
1755 u64 len;
1756 u64 pgoff;
1757 u32 maj;
1758 u32 min;
1759 u64 ino;
1760 u64 ino_generation;
1761 u32 prot;
1762 u32 flags;
1763 char filename[];
1764 struct sample_id sample_id;
1765 };
1766
1767 pid is the process ID.
1768
1769 tid is the thread ID.
1770
1771 addr is the address of the allocated memory.
1772
1773 len is the length of the allocated memory.
1774
1775 pgoff is the page offset of the allocated memory.
1776
1777 maj is the major ID of the underlying device.
1778
1779 min is the minor ID of the underlying device.
1780
1781 ino is the inode number.
1782
1783 ino_generation
1784 is the inode generation.
1785
1786 prot is the protection information.
1787
1788 flags is the flags information.
1789
1790 filename
1791 is a string describing the backing of the allocated
1792 memory.
1793
1794 PERF_RECORD_AUX (since Linux 4.1)
1795 This record reports that new data is available in the sepa‐
1796 rate AUX buffer region.
1797
1798 struct {
1799 struct perf_event_header header;
1800 u64 aux_offset;
1801 u64 aux_size;
1802 u64 flags;
1803 struct sample_id sample_id;
1804 };
1805
1806 aux_offset
1807 offset in the AUX mmap region where the new data be‐
1808 gins.
1809
1810 aux_size
1811 size of the data made available.
1812
1813 flags describes the AUX update.
1814
1815 PERF_AUX_FLAG_TRUNCATED
1816 if set, then the data returned was truncated
1817 to fit the available buffer size.
1818
1819 PERF_AUX_FLAG_OVERWRITE
1820 if set, then the data returned has overwritten
1821 previous data.
1822
1823 PERF_RECORD_ITRACE_START (since Linux 4.1)
1824 This record indicates which process has initiated an in‐
1825 struction trace event, allowing tools to properly correlate
1826 the instruction addresses in the AUX buffer with the proper
1827 executable.
1828
1829 struct {
1830 struct perf_event_header header;
1831 u32 pid;
1832 u32 tid;
1833 };
1834
1835 pid process ID of the thread starting an instruction
1836 trace.
1837
1838 tid thread ID of the thread starting an instruction
1839 trace.
1840
1841 PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1842 When using hardware sampling (such as Intel PEBS) this
1843 record indicates some number of samples that may have been
1844 lost.
1845
1846 struct {
1847 struct perf_event_header header;
1848 u64 lost;
1849 struct sample_id sample_id;
1850 };
1851
1852 lost the number of potentially lost samples.
1853
1854 PERF_RECORD_SWITCH (since Linux 4.3)
1855 This record indicates a context switch has happened. The
1856 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1857 whether it was a context switch into or away from the cur‐
1858 rent process.
1859
1860 struct {
1861 struct perf_event_header header;
1862 struct sample_id sample_id;
1863 };
1864
1865 PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1866 As with PERF_RECORD_SWITCH this record indicates a context
1867 switch has happened, but it only occurs when sampling in
1868 CPU-wide mode and provides additional information on the
1869 process being switched to/from. The
1870 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1871 whether it was a context switch into or away from the cur‐
1872 rent process.
1873
1874 struct {
1875 struct perf_event_header header;
1876 u32 next_prev_pid;
1877 u32 next_prev_tid;
1878 struct sample_id sample_id;
1879 };
1880
1881 next_prev_pid
1882 The process ID of the previous (if switching in) or
1883 next (if switching out) process on the CPU.
1884
1885 next_prev_tid
1886 The thread ID of the previous (if switching in) or
1887 next (if switching out) thread on the CPU.
1888
1889 PERF_RECORD_NAMESPACES (since Linux 4.11)
1890 This record includes various namespace information of a
1891 process.
1892
1893 struct {
1894 struct perf_event_header header;
1895 u32 pid;
1896 u32 tid;
1897 u64 nr_namespaces;
1898 struct { u64 dev, inode } [nr_namespaces];
1899 struct sample_id sample_id;
1900 };
1901
1902 pid is the process ID
1903
1904 tid is the thread ID
1905
1906 nr_namespace
1907 is the number of namespaces in this record
1908
1909 Each namespace has dev and inode fields and is recorded in
1910 the fixed position like below:
1911
1912 NET_NS_INDEX=0
1913 Network namespace
1914
1915 UTS_NS_INDEX=1
1916 UTS namespace
1917
1918 IPC_NS_INDEX=2
1919 IPC namespace
1920
1921 PID_NS_INDEX=3
1922 PID namespace
1923
1924 USER_NS_INDEX=4
1925 User namespace
1926
1927 MNT_NS_INDEX=5
1928 Mount namespace
1929
1930 CGROUP_NS_INDEX=6
1931 Cgroup namespace
1932
1933 PERF_RECORD_KSYMBOL (since Linux 5.0)
1934 This record indicates kernel symbol register/unregister
1935 events.
1936
1937 struct {
1938 struct perf_event_header header;
1939 u64 addr;
1940 u32 len;
1941 u16 ksym_type;
1942 u16 flags;
1943 char name[];
1944 struct sample_id sample_id;
1945 };
1946
1947 addr is the address of the kernel symbol.
1948
1949 len is the length of the kernel symbol.
1950
1951 ksym_type
1952 is the type of the kernel symbol. Currently the fol‐
1953 lowing types are available:
1954
1955 PERF_RECORD_KSYMBOL_TYPE_BPF
1956 The kernel symbol is a BPF function.
1957
1958 flags If the PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER is set,
1959 then this event is for unregistering the kernel sym‐
1960 bol.
1961
1962 PERF_RECORD_BPF_EVENT (since Linux 5.0)
1963 This record indicates BPF program is loaded or unloaded.
1964
1965 struct {
1966 struct perf_event_header header;
1967 u16 type;
1968 u16 flags;
1969 u32 id;
1970 u8 tag[BPF_TAG_SIZE];
1971 struct sample_id sample_id;
1972 };
1973
1974 type is one of the following values:
1975
1976 PERF_BPF_EVENT_PROG_LOAD
1977 A BPF program is loaded
1978
1979 PERF_BPF_EVENT_PROG_UNLOAD
1980 A BPF program is unloaded
1981
1982 id is the ID of the BPF program.
1983
1984 tag is the tag of the BPF program. Currently,
1985 BPF_TAG_SIZE is defined as 8.
1986
1987 PERF_RECORD_CGROUP (since Linux 5.7)
1988 This record indicates a new cgroup is created and activated.
1989
1990 struct {
1991 struct perf_event_header header;
1992 u64 id;
1993 char path[];
1994 struct sample_id sample_id;
1995 };
1996
1997 id is the cgroup identifier. This can be also retrieved
1998 by name_to_handle_at(2) on the cgroup path (as a file
1999 handle).
2000
2001 path is the path of the cgroup from the root.
2002
2003 PERF_RECORD_TEXT_POKE (since Linux 5.8)
2004 This record indicates a change in the kernel text. This in‐
2005 cludes addition and removal of the text and the correspond‐
2006 ing length is zero in this case.
2007
2008 struct {
2009 struct perf_event_header header;
2010 u64 addr;
2011 u16 old_len;
2012 u16 new_len;
2013 u8 bytes[];
2014 struct sample_id sample_id;
2015 };
2016
2017 addr is the address of the change
2018
2019 old_len
2020 is the old length
2021
2022 new_len
2023 is the new length
2024
2025 bytes contains old bytes immediately followed by new bytes.
2026
2027 Overflow handling
2028 Events can be set to notify when a threshold is crossed, indicating an
2029 overflow. Overflow conditions can be captured by monitoring the event
2030 file descriptor with poll(2), select(2), or epoll(7). Alternatively,
2031 the overflow events can be captured via sa signal handler, by enabling
2032 I/O signaling on the file descriptor; see the discussion of the F_SE‐
2033 TOWN and F_SETSIG operations in fcntl(2).
2034
2035 Overflows are generated only by sampling events (sample_period must
2036 have a nonzero value).
2037
2038 There are two ways to generate overflow notifications.
2039
2040 The first is to set a wakeup_events or wakeup_watermark value that will
2041 trigger if a certain number of samples or bytes have been written to
2042 the mmap ring buffer. In this case, POLL_IN is indicated.
2043
2044 The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This
2045 ioctl adds to a counter that decrements each time the event overflows.
2046 When nonzero, POLL_IN is indicated, but once the counter reaches 0
2047 POLL_HUP is indicated and the underlying event is disabled.
2048
2049 Refreshing an event group leader refreshes all siblings and refreshing
2050 with a parameter of 0 currently enables infinite refreshes; these be‐
2051 haviors are unsupported and should not be relied on.
2052
2053 Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
2054 itored is attached to a different process and that process exits.
2055
2056 rdpmc instruction
2057 Starting with Linux 3.4 on x86, you can use the rdpmc instruction to
2058 get low-latency reads without having to enter the kernel. Note that
2059 using rdpmc is not necessarily faster than other methods for reading
2060 event values.
2061
2062 Support for this can be detected with the cap_usr_rdpmc field in the
2063 mmap page; documentation on how to calculate event values can be found
2064 in that section.
2065
2066 Originally, when rdpmc support was enabled, any process (not just ones
2067 with an active perf event) could use the rdpmc instruction to access
2068 the counters. Starting with Linux 4.0, rdpmc support is only allowed
2069 if an event is currently enabled in a process's context. To restore
2070 the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
2071
2072 perf_event ioctl calls
2073 Various ioctls act on perf_event_open() file descriptors:
2074
2075 PERF_EVENT_IOC_ENABLE
2076 This enables the individual event or event group specified by
2077 the file descriptor argument.
2078
2079 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
2080 then all events in a group are enabled, even if the event speci‐
2081 fied is not the group leader (but see BUGS).
2082
2083 PERF_EVENT_IOC_DISABLE
2084 This disables the individual counter or event group specified by
2085 the file descriptor argument.
2086
2087 Enabling or disabling the leader of a group enables or disables
2088 the entire group; that is, while the group leader is disabled,
2089 none of the counters in the group will count. Enabling or dis‐
2090 abling a member of a group other than the leader affects only
2091 that counter; disabling a non-leader stops that counter from
2092 counting but doesn't affect any other counter.
2093
2094 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
2095 then all events in a group are disabled, even if the event spec‐
2096 ified is not the group leader (but see BUGS).
2097
2098 PERF_EVENT_IOC_REFRESH
2099 Non-inherited overflow counters can use this to enable a counter
2100 for a number of overflows specified by the argument, after which
2101 it is disabled. Subsequent calls of this ioctl add the argument
2102 value to the current count. An overflow notification with
2103 POLL_IN set will happen on each overflow until the count reaches
2104 0; when that happens a notification with POLL_HUP set is sent
2105 and the event is disabled. Using an argument of 0 is considered
2106 undefined behavior.
2107
2108 PERF_EVENT_IOC_RESET
2109 Reset the event count specified by the file descriptor argument
2110 to zero. This resets only the counts; there is no way to reset
2111 the multiplexing time_enabled or time_running values.
2112
2113 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
2114 then all events in a group are reset, even if the event speci‐
2115 fied is not the group leader (but see BUGS).
2116
2117 PERF_EVENT_IOC_PERIOD
2118 This updates the overflow period for the event.
2119
2120 Since Linux 3.7 (on ARM) and Linux 3.14 (all other architec‐
2121 tures), the new period takes effect immediately. On older ker‐
2122 nels, the new period did not take effect until after the next
2123 overflow.
2124
2125 The argument is a pointer to a 64-bit value containing the de‐
2126 sired new period.
2127
2128 Prior to Linux 2.6.36, this ioctl always failed due to a bug in
2129 the kernel.
2130
2131 PERF_EVENT_IOC_SET_OUTPUT
2132 This tells the kernel to report event notifications to the spec‐
2133 ified file descriptor rather than the default one. The file de‐
2134 scriptors must all be on the same CPU.
2135
2136 The argument specifies the desired file descriptor, or -1 if
2137 output should be ignored.
2138
2139 PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
2140 This adds an ftrace filter to this event.
2141
2142 The argument is a pointer to the desired ftrace filter.
2143
2144 PERF_EVENT_IOC_ID (since Linux 3.12)
2145 This returns the event ID value for the given event file de‐
2146 scriptor.
2147
2148 The argument is a pointer to a 64-bit unsigned integer to hold
2149 the result.
2150
2151 PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
2152 This allows attaching a Berkeley Packet Filter (BPF) program to
2153 an existing kprobe tracepoint event. You need CAP_PERFMON
2154 (since Linux 5.8) or CAP_SYS_ADMIN privileges to use this ioctl.
2155
2156 The argument is a BPF program file descriptor that was created
2157 by a previous bpf(2) system call.
2158
2159 PERF_EVENT_IOC_PAUSE_OUTPUT (since Linux 4.7)
2160 This allows pausing and resuming the event's ring-buffer. A
2161 paused ring-buffer does not prevent generation of samples, but
2162 simply discards them. The discarded samples are considered
2163 lost, and cause a PERF_RECORD_LOST sample to be generated when
2164 possible. An overflow signal may still be triggered by the dis‐
2165 carded sample even though the ring-buffer remains empty.
2166
2167 The argument is an unsigned 32-bit integer. A nonzero value
2168 pauses the ring-buffer, while a zero value resumes the ring-buf‐
2169 fer.
2170
2171 PERF_EVENT_MODIFY_ATTRIBUTES (since Linux 4.17)
2172 This allows modifying an existing event without the overhead of
2173 closing and reopening a new event. Currently this is supported
2174 only for breakpoint events.
2175
2176 The argument is a pointer to a perf_event_attr structure con‐
2177 taining the updated event settings.
2178
2179 PERF_EVENT_IOC_QUERY_BPF (since Linux 4.16)
2180 This allows querying which Berkeley Packet Filter (BPF) programs
2181 are attached to an existing kprobe tracepoint. You can only at‐
2182 tach one BPF program per event, but you can have multiple events
2183 attached to a tracepoint. Querying this value on one tracepoint
2184 event returns the ID of all BPF programs in all events attached
2185 to the tracepoint. You need CAP_PERFMON (since Linux 5.8) or
2186 CAP_SYS_ADMIN privileges to use this ioctl.
2187
2188 The argument is a pointer to a structure
2189 struct perf_event_query_bpf {
2190 __u32 ids_len;
2191 __u32 prog_cnt;
2192 __u32 ids[0];
2193 };
2194
2195 The ids_len field indicates the number of ids that can fit in
2196 the provided ids array. The prog_cnt value is filled in by the
2197 kernel with the number of attached BPF programs. The ids array
2198 is filled with the ID of each attached BPF program. If there
2199 are more programs than will fit in the array, then the kernel
2200 will return ENOSPC and ids_len will indicate the number of pro‐
2201 gram IDs that were successfully copied.
2202
2203 Using prctl(2)
2204 A process can enable or disable all currently open event groups using
2205 the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and PR_TASK_PERF_EVENTS_DISABLE
2206 operations. This applies only to events created locally by the calling
2207 process. This does not apply to events created by other processes at‐
2208 tached to the calling process or inherited events from a parent
2209 process. Only group leaders are enabled and disabled, not any other
2210 members of the groups.
2211
2212 perf_event related configuration files
2213 Files in /proc/sys/kernel/
2214
2215 /proc/sys/kernel/perf_event_paranoid
2216 The perf_event_paranoid file can be set to restrict access
2217 to the performance counters.
2218
2219 2 allow only user-space measurements (default since Linux
2220 4.6).
2221 1 allow both kernel and user measurements (default before
2222 Linux 4.6).
2223 0 allow access to CPU-specific data but not raw tracepoint
2224 samples.
2225 -1 no restrictions.
2226
2227 The existence of the perf_event_paranoid file is the offi‐
2228 cial method for determining if a kernel supports
2229 perf_event_open().
2230
2231 /proc/sys/kernel/perf_event_max_sample_rate
2232 This sets the maximum sample rate. Setting this too high
2233 can allow users to sample at a rate that impacts overall ma‐
2234 chine performance and potentially lock up the machine. The
2235 default value is 100000 (samples per second).
2236
2237 /proc/sys/kernel/perf_event_max_stack
2238 This file sets the maximum depth of stack frame entries re‐
2239 ported when generating a call trace.
2240
2241 /proc/sys/kernel/perf_event_mlock_kb
2242 Maximum number of pages an unprivileged user can mlock(2).
2243 The default is 516 (kB).
2244
2245 Files in /sys/bus/event_source/devices/
2246
2247 Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
2248 able for monitoring. Information on how to program these PMUs can
2249 be found under /sys/bus/event_source/devices/. Each subdirectory
2250 corresponds to a different PMU.
2251
2252 /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
2253 This contains an integer that can be used in the type field
2254 of perf_event_attr to indicate that you wish to use this
2255 PMU.
2256
2257 /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
2258 If this file is 1, then direct user-space access to the per‐
2259 formance counter registers is allowed via the rdpmc instruc‐
2260 tion. This can be disabled by echoing 0 to the file.
2261
2262 As of Linux 4.0 the behavior has changed, so that 1 now
2263 means only allow access to processes with active perf
2264 events, with 2 indicating the old allow-anyone-access behav‐
2265 ior.
2266
2267 /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2268 This subdirectory contains information on the architecture-
2269 specific subfields available for programming the various
2270 config fields in the perf_event_attr struct.
2271
2272 The content of each file is the name of the config field,
2273 followed by a colon, followed by a series of integer bit
2274 ranges separated by commas. For example, the file event may
2275 contain the value config1:1,6-10,44 which indicates that
2276 event is an attribute that occupies bits 1,6–10, and 44 of
2277 perf_event_attr::config1.
2278
2279 /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2280 This subdirectory contains files with predefined events.
2281 The contents are strings describing the event settings ex‐
2282 pressed in terms of the fields found in the previously men‐
2283 tioned ./format/ directory. These are not necessarily com‐
2284 plete lists of all events supported by a PMU, but usually a
2285 subset of events deemed useful or interesting.
2286
2287 The content of each file is a list of attribute names sepa‐
2288 rated by commas. Each entry has an optional value (either
2289 hex or decimal). If no value is specified, then it is as‐
2290 sumed to be a single-bit field with a value of 1. An exam‐
2291 ple entry may look like this: event=0x2,inv,ldlat=3.
2292
2293 /sys/bus/event_source/devices/*/uevent
2294 This file is the standard kernel device interface for in‐
2295 jecting hotplug events.
2296
2297 /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2298 The cpumask file contains a comma-separated list of integers
2299 that indicate a representative CPU number for each socket
2300 (package) on the motherboard. This is needed when setting
2301 up uncore or northbridge events, as those PMUs present
2302 socket-wide events.
2303
2305 perf_event_open() returns the new file descriptor, or -1 if an error
2306 occurred (in which case, errno is set appropriately).
2307
2309 The errors returned by perf_event_open() can be inconsistent, and may
2310 vary across processor architectures and performance monitoring units.
2311
2312 E2BIG Returned if the perf_event_attr size value is too small (smaller
2313 than PERF_ATTR_SIZE_VER0), too big (larger than the page size),
2314 or larger than the kernel supports and the extra bytes are not
2315 zero. When E2BIG is returned, the perf_event_attr size field is
2316 overwritten by the kernel to be the size of the structure it was
2317 expecting.
2318
2319 EACCES Returned when the requested event requires CAP_PERFMON (since
2320 Linux 5.8) or CAP_SYS_ADMIN permissions (or a more permissive
2321 perf_event paranoid setting). Some common cases where an un‐
2322 privileged process may encounter this error: attaching to a
2323 process owned by a different user; monitoring all processes on a
2324 given CPU (i.e., specifying the pid argument as -1); and not
2325 setting exclude_kernel when the paranoid setting requires it.
2326
2327 EBADF Returned if the group_fd file descriptor is not valid, or, if
2328 PERF_FLAG_PID_CGROUP is set, the cgroup file descriptor in pid
2329 is not valid.
2330
2331 EBUSY (since Linux 4.1)
2332 Returned if another event already has exclusive access to the
2333 PMU.
2334
2335 EFAULT Returned if the attr pointer points at an invalid memory ad‐
2336 dress.
2337
2338 EINVAL Returned if the specified event is invalid. There are many pos‐
2339 sible reasons for this. A not-exhaustive list: sample_freq is
2340 higher than the maximum setting; the cpu to monitor does not ex‐
2341 ist; read_format is out of range; sample_type is out of range;
2342 the flags value is out of range; exclusive or pinned set and the
2343 event is not a group leader; the event config values are out of
2344 range or set reserved bits; the generic event selected is not
2345 supported; or there is not enough room to add the selected
2346 event.
2347
2348 EINTR Returned when trying to mix perf and ftrace handling for a up‐
2349 robe.
2350
2351 EMFILE Each opened event uses one file descriptor. If a large number
2352 of events are opened, the per-process limit on the number of
2353 open file descriptors will be reached, and no more events can be
2354 created.
2355
2356 ENODEV Returned when the event involves a feature not supported by the
2357 current CPU.
2358
2359 ENOENT Returned if the type setting is not valid. This error is also
2360 returned for some unsupported generic events.
2361
2362 ENOSPC Prior to Linux 3.3, if there was not enough room for the event,
2363 ENOSPC was returned. In Linux 3.3, this was changed to EINVAL.
2364 ENOSPC is still returned if you try to add more breakpoint
2365 events than supported by the hardware.
2366
2367 ENOSYS Returned if PERF_SAMPLE_STACK_USER is set in sample_type and it
2368 is not supported by hardware.
2369
2370 EOPNOTSUPP
2371 Returned if an event requiring a specific hardware feature is
2372 requested but there is no hardware support. This includes re‐
2373 questing low-skid events if not supported, branch tracing if it
2374 is not available, sampling if no PMU interrupt is available, and
2375 branch stacks for software events.
2376
2377 EOVERFLOW (since Linux 4.8)
2378 Returned if PERF_SAMPLE_CALLCHAIN is requested and sam‐
2379 ple_max_stack is larger than the maximum specified in
2380 /proc/sys/kernel/perf_event_max_stack.
2381
2382 EPERM Returned on many (but not all) architectures when an unsupported
2383 exclude_hv, exclude_idle, exclude_user, or exclude_kernel set‐
2384 ting is specified.
2385
2386 It can also happen, as with EACCES, when the requested event re‐
2387 quires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN permis‐
2388 sions (or a more permissive perf_event paranoid setting). This
2389 includes setting a breakpoint on a kernel address, and (since
2390 Linux 3.13) setting a kernel function-trace tracepoint.
2391
2392 ESRCH Returned if attempting to attach to a process that does not ex‐
2393 ist.
2394
2396 perf_event_open() was introduced in Linux 2.6.31 but was called
2397 perf_counter_open(). It was renamed in Linux 2.6.32.
2398
2400 This perf_event_open() system call Linux-specific and should not be
2401 used in programs intended to be portable.
2402
2404 Glibc does not provide a wrapper for this system call; call it using
2405 syscall(2). See the example below.
2406
2407 The official way of knowing if perf_event_open() support is enabled is
2408 checking for the existence of the file /proc/sys/ker‐
2409 nel/perf_event_paranoid.
2410
2411 CAP_PERFMON capability (since Linux 5.8) provides secure approach to
2412 performance monitoring and observability operations in a system accord‐
2413 ing to the principal of least privilege (POSIX IEEE 1003.1e). Access‐
2414 ing system performance monitoring and observability operations using
2415 CAP_PERFMON rather than the much more powerful CAP_SYS_ADMIN excludes
2416 chances to misuse credentials and makes operations more secure.
2417 CAP_SYS_ADMIN usage for secure system performance monitoring and ob‐
2418 servability is discouraged in favor of the CAP_PERFMON capability.
2419
2421 The F_SETOWN_EX option to fcntl(2) is needed to properly get overflow
2422 signals in threads. This was introduced in Linux 2.6.32.
2423
2424 Prior to Linux 2.6.33 (at least for x86), the kernel did not check if
2425 events could be scheduled together until read time. The same happens
2426 on all known kernels if the NMI watchdog is enabled. This means to see
2427 if a given set of events works you have to perf_event_open(), start,
2428 then read before you know for sure you can get valid measurements.
2429
2430 Prior to Linux 2.6.34, event constraints were not enforced by the ker‐
2431 nel. In that case, some events would silently return "0" if the kernel
2432 scheduled them in an improper counter slot.
2433
2434 Prior to Linux 2.6.34, there was a bug when multiplexing where the
2435 wrong results could be returned.
2436
2437 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2438 if "inherit" is enabled and many threads are started.
2439
2440 Prior to Linux 2.6.35, PERF_FORMAT_GROUP did not work with attached
2441 processes.
2442
2443 There is a bug in the kernel code between Linux 2.6.36 and Linux 3.0
2444 that ignores the "watermark" field and acts as if a wakeup_event was
2445 chosen if the union has a nonzero value in it.
2446
2447 From Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2448 was broken and would repeatedly operate on the event specified rather
2449 than iterating across all sibling events in a group.
2450
2451 From Linux 3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2452 bits mapped to the same location. Code should migrate to the new
2453 cap_user_rdpmc and cap_user_time fields instead.
2454
2455 Always double-check your results! Various generalized events have had
2456 wrong values. For example, retired branches measured the wrong thing
2457 on AMD machines until Linux 2.6.35.
2458
2460 The following is a short example that measures the total instruction
2461 count of a call to printf(3).
2462
2463 #include <stdlib.h>
2464 #include <stdio.h>
2465 #include <unistd.h>
2466 #include <string.h>
2467 #include <sys/ioctl.h>
2468 #include <linux/perf_event.h>
2469 #include <asm/unistd.h>
2470
2471 static long
2472 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2473 int cpu, int group_fd, unsigned long flags)
2474 {
2475 int ret;
2476
2477 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2478 group_fd, flags);
2479 return ret;
2480 }
2481
2482 int
2483 main(int argc, char **argv)
2484 {
2485 struct perf_event_attr pe;
2486 long long count;
2487 int fd;
2488
2489 memset(&pe, 0, sizeof(pe));
2490 pe.type = PERF_TYPE_HARDWARE;
2491 pe.size = sizeof(pe);
2492 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2493 pe.disabled = 1;
2494 pe.exclude_kernel = 1;
2495 pe.exclude_hv = 1;
2496
2497 fd = perf_event_open(&pe, 0, -1, -1, 0);
2498 if (fd == -1) {
2499 fprintf(stderr, "Error opening leader %llx\n", pe.config);
2500 exit(EXIT_FAILURE);
2501 }
2502
2503 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2504 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2505
2506 printf("Measuring instruction count for this printf\n");
2507
2508 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2509 read(fd, &count, sizeof(count));
2510
2511 printf("Used %lld instructions\n", count);
2512
2513 close(fd);
2514 }
2515
2517 perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2518
2519 Documentation/admin-guide/perf-security.rst in the kernel source tree
2520
2522 This page is part of release 5.10 of the Linux man-pages project. A
2523 description of the project, information about reporting bugs, and the
2524 latest version of this page, can be found at
2525 https://www.kernel.org/doc/man-pages/.
2526
2527
2528
2529Linux 2020-11-01 PERF_EVENT_OPEN(2)