1PERF_EVENT_OPEN(2) Linux Programmer's Manual PERF_EVENT_OPEN(2)
2
3
4
6 perf_event_open - set up performance monitoring
7
9 #include <linux/perf_event.h> /* Definition of PERF_* constants */
10 #include <linux/hw_breakpoint.h> /* Definition of HW_* constants */
11 #include <sys/syscall.h> /* Definition of SYS_* constants */
12 #include <unistd.h>
13
14 int syscall(SYS_perf_event_open, struct perf_event_attr *attr,
15 pid_t pid, int cpu, int group_fd, unsigned long flags);
16
17 Note: glibc provides no wrapper for perf_event_open(), necessitating
18 the use of syscall(2).
19
21 Given a list of parameters, perf_event_open() returns a file descrip‐
22 tor, for use in subsequent system calls (read(2), mmap(2), prctl(2),
23 fcntl(2), etc.).
24
25 A call to perf_event_open() creates a file descriptor that allows mea‐
26 suring performance information. Each file descriptor corresponds to
27 one event that is measured; these can be grouped together to measure
28 multiple events simultaneously.
29
30 Events can be enabled and disabled in two ways: via ioctl(2) and via
31 prctl(2). When an event is disabled it does not count or generate
32 overflows but does continue to exist and maintain its count value.
33
34 Events come in two flavors: counting and sampled. A counting event is
35 one that is used for counting the aggregate number of events that oc‐
36 cur. In general, counting event results are gathered with a read(2)
37 call. A sampling event periodically writes measurements to a buffer
38 that can then be accessed via mmap(2).
39
40 Arguments
41 The pid and cpu arguments allow specifying which process and CPU to
42 monitor:
43
44 pid == 0 and cpu == -1
45 This measures the calling process/thread on any CPU.
46
47 pid == 0 and cpu >= 0
48 This measures the calling process/thread only when running on
49 the specified CPU.
50
51 pid > 0 and cpu == -1
52 This measures the specified process/thread on any CPU.
53
54 pid > 0 and cpu >= 0
55 This measures the specified process/thread only when running on
56 the specified CPU.
57
58 pid == -1 and cpu >= 0
59 This measures all processes/threads on the specified CPU. This
60 requires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN capabil‐
61 ity or a /proc/sys/kernel/perf_event_paranoid value of less than
62 1.
63
64 pid == -1 and cpu == -1
65 This setting is invalid and will return an error.
66
67 When pid is greater than zero, permission to perform this system call
68 is governed by CAP_PERFMON (since Linux 5.9) and a ptrace access mode
69 PTRACE_MODE_READ_REALCREDS check on older Linux versions; see
70 ptrace(2).
71
72 The group_fd argument allows event groups to be created. An event
73 group has one event which is the group leader. The leader is created
74 first, with group_fd = -1. The rest of the group members are created
75 with subsequent perf_event_open() calls with group_fd being set to the
76 file descriptor of the group leader. (A single event on its own is
77 created with group_fd = -1 and is considered to be a group with only 1
78 member.) An event group is scheduled onto the CPU as a unit: it will
79 be put onto the CPU only if all of the events in the group can be put
80 onto the CPU. This means that the values of the member events can be
81 meaningfully compared—added, divided (to get ratios), and so on—with
82 each other, since they have counted events for the same set of executed
83 instructions.
84
85 The flags argument is formed by ORing together zero or more of the fol‐
86 lowing values:
87
88 PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
89 This flag enables the close-on-exec flag for the created event
90 file descriptor, so that the file descriptor is automatically
91 closed on execve(2). Setting the close-on-exec flags at cre‐
92 ation time, rather than later with fcntl(2), avoids potential
93 race conditions where the calling thread invokes
94 perf_event_open() and fcntl(2) at the same time as another
95 thread calls fork(2) then execve(2).
96
97 PERF_FLAG_FD_NO_GROUP
98 This flag tells the event to ignore the group_fd parameter ex‐
99 cept for the purpose of setting up output redirection using the
100 PERF_FLAG_FD_OUTPUT flag.
101
102 PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
103 This flag re-routes the event's sampled output to instead be in‐
104 cluded in the mmap buffer of the event specified by group_fd.
105
106 PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
107 This flag activates per-container system-wide monitoring. A
108 container is an abstraction that isolates a set of resources for
109 finer-grained control (CPUs, memory, etc.). In this mode, the
110 event is measured only if the thread running on the monitored
111 CPU belongs to the designated container (cgroup). The cgroup is
112 identified by passing a file descriptor opened on its directory
113 in the cgroupfs filesystem. For instance, if the cgroup to mon‐
114 itor is called test, then a file descriptor opened on
115 /dev/cgroup/test (assuming cgroupfs is mounted on /dev/cgroup)
116 must be passed as the pid parameter. cgroup monitoring is
117 available only for system-wide events and may therefore require
118 extra permissions.
119
120 The perf_event_attr structure provides detailed configuration informa‐
121 tion for the event being created.
122
123 struct perf_event_attr {
124 __u32 type; /* Type of event */
125 __u32 size; /* Size of attribute structure */
126 __u64 config; /* Type-specific configuration */
127
128 union {
129 __u64 sample_period; /* Period of sampling */
130 __u64 sample_freq; /* Frequency of sampling */
131 };
132
133 __u64 sample_type; /* Specifies values included in sample */
134 __u64 read_format; /* Specifies values returned in read */
135
136 __u64 disabled : 1, /* off by default */
137 inherit : 1, /* children inherit it */
138 pinned : 1, /* must always be on PMU */
139 exclusive : 1, /* only group on PMU */
140 exclude_user : 1, /* don't count user */
141 exclude_kernel : 1, /* don't count kernel */
142 exclude_hv : 1, /* don't count hypervisor */
143 exclude_idle : 1, /* don't count when idle */
144 mmap : 1, /* include mmap data */
145 comm : 1, /* include comm data */
146 freq : 1, /* use freq, not period */
147 inherit_stat : 1, /* per task counts */
148 enable_on_exec : 1, /* next exec enables */
149 task : 1, /* trace fork/exit */
150 watermark : 1, /* wakeup_watermark */
151 precise_ip : 2, /* skid constraint */
152 mmap_data : 1, /* non-exec mmap data */
153 sample_id_all : 1, /* sample_type all events */
154 exclude_host : 1, /* don't count in host */
155 exclude_guest : 1, /* don't count in guest */
156 exclude_callchain_kernel : 1,
157 /* exclude kernel callchains */
158 exclude_callchain_user : 1,
159 /* exclude user callchains */
160 mmap2 : 1, /* include mmap with inode data */
161 comm_exec : 1, /* flag comm events that are
162 due to exec */
163 use_clockid : 1, /* use clockid for time fields */
164 context_switch : 1, /* context switch data */
165 write_backward : 1, /* Write ring buffer from end
166 to beginning */
167 namespaces : 1, /* include namespaces data */
168 ksymbol : 1, /* include ksymbol events */
169 bpf_event : 1, /* include bpf events */
170 aux_output : 1, /* generate AUX records
171 instead of events */
172 cgroup : 1, /* include cgroup events */
173 text_poke : 1, /* include text poke events */
174
175 __reserved_1 : 30;
176
177 union {
178 __u32 wakeup_events; /* wakeup every n events */
179 __u32 wakeup_watermark; /* bytes before wakeup */
180 };
181
182 __u32 bp_type; /* breakpoint type */
183
184 union {
185 __u64 bp_addr; /* breakpoint address */
186 __u64 kprobe_func; /* for perf_kprobe */
187 __u64 uprobe_path; /* for perf_uprobe */
188 __u64 config1; /* extension of config */
189 };
190
191 union {
192 __u64 bp_len; /* breakpoint length */
193 __u64 kprobe_addr; /* with kprobe_func == NULL */
194 __u64 probe_offset; /* for perf_[k,u]probe */
195 __u64 config2; /* extension of config1 */
196 };
197 __u64 branch_sample_type; /* enum perf_branch_sample_type */
198 __u64 sample_regs_user; /* user regs to dump on samples */
199 __u32 sample_stack_user; /* size of stack to dump on
200 samples */
201 __s32 clockid; /* clock to use for time fields */
202 __u64 sample_regs_intr; /* regs to dump on samples */
203 __u32 aux_watermark; /* aux bytes before wakeup */
204 __u16 sample_max_stack; /* max frames in callchain */
205 __u16 __reserved_2; /* align to u64 */
206
207 };
208
209 The fields of the perf_event_attr structure are described in more de‐
210 tail below:
211
212 type This field specifies the overall event type. It has one of the
213 following values:
214
215 PERF_TYPE_HARDWARE
216 This indicates one of the "generalized" hardware events
217 provided by the kernel. See the config field definition
218 for more details.
219
220 PERF_TYPE_SOFTWARE
221 This indicates one of the software-defined events pro‐
222 vided by the kernel (even if no hardware support is
223 available).
224
225 PERF_TYPE_TRACEPOINT
226 This indicates a tracepoint provided by the kernel trace‐
227 point infrastructure.
228
229 PERF_TYPE_HW_CACHE
230 This indicates a hardware cache event. This has a spe‐
231 cial encoding, described in the config field definition.
232
233 PERF_TYPE_RAW
234 This indicates a "raw" implementation-specific event in
235 the config field.
236
237 PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
238 This indicates a hardware breakpoint as provided by the
239 CPU. Breakpoints can be read/write accesses to an ad‐
240 dress as well as execution of an instruction address.
241
242 dynamic PMU
243 Since Linux 2.6.38, perf_event_open() can support multi‐
244 ple PMUs. To enable this, a value exported by the kernel
245 can be used in the type field to indicate which PMU to
246 use. The value to use can be found in the sysfs filesys‐
247 tem: there is a subdirectory per PMU instance under
248 /sys/bus/event_source/devices. In each subdirectory
249 there is a type file whose content is an integer that can
250 be used in the type field. For instance,
251 /sys/bus/event_source/devices/cpu/type contains the value
252 for the core CPU PMU, which is usually 4.
253
254 kprobe and uprobe (since Linux 4.17)
255 These two dynamic PMUs create a kprobe/uprobe and attach
256 it to the file descriptor generated by perf_event_open.
257 The kprobe/uprobe will be destroyed on the destruction of
258 the file descriptor. See fields kprobe_func, up‐
259 robe_path, kprobe_addr, and probe_offset for more de‐
260 tails.
261
262 size The size of the perf_event_attr structure for forward/backward
263 compatibility. Set this using sizeof(struct perf_event_attr) to
264 allow the kernel to see the struct size at the time of compila‐
265 tion.
266
267 The related define PERF_ATTR_SIZE_VER0 is set to 64; this was
268 the size of the first published struct. PERF_ATTR_SIZE_VER1 is
269 72, corresponding to the addition of breakpoints in Linux
270 2.6.33. PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
271 of branch sampling in Linux 3.4. PERF_ATTR_SIZE_VER3 is 96 cor‐
272 responding to the addition of sample_regs_user and sam‐
273 ple_stack_user in Linux 3.7. PERF_ATTR_SIZE_VER4 is 104 corre‐
274 sponding to the addition of sample_regs_intr in Linux 3.19.
275 PERF_ATTR_SIZE_VER5 is 112 corresponding to the addition of
276 aux_watermark in Linux 4.1.
277
278 config This specifies which event you want, in conjunction with the
279 type field. The config1 and config2 fields are also taken into
280 account in cases where 64 bits is not enough to fully specify
281 the event. The encoding of these fields are event dependent.
282
283 There are various ways to set the config field that are depen‐
284 dent on the value of the previously described type field. What
285 follows are various possible settings for config separated out
286 by type.
287
288 If type is PERF_TYPE_HARDWARE, we are measuring one of the gen‐
289 eralized hardware CPU events. Not all of these are available on
290 all platforms. Set config to one of the following:
291
292 PERF_COUNT_HW_CPU_CYCLES
293 Total cycles. Be wary of what happens during CPU
294 frequency scaling.
295
296 PERF_COUNT_HW_INSTRUCTIONS
297 Retired instructions. Be careful, these can be af‐
298 fected by various issues, most notably hardware in‐
299 terrupt counts.
300
301 PERF_COUNT_HW_CACHE_REFERENCES
302 Cache accesses. Usually this indicates Last Level
303 Cache accesses but this may vary depending on your
304 CPU. This may include prefetches and coherency mes‐
305 sages; again this depends on the design of your CPU.
306
307 PERF_COUNT_HW_CACHE_MISSES
308 Cache misses. Usually this indicates Last Level
309 Cache misses; this is intended to be used in con‐
310 junction with the PERF_COUNT_HW_CACHE_REFERENCES
311 event to calculate cache miss rates.
312
313 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
314 Retired branch instructions. Prior to Linux 2.6.35,
315 this used the wrong event on AMD processors.
316
317 PERF_COUNT_HW_BRANCH_MISSES
318 Mispredicted branch instructions.
319
320 PERF_COUNT_HW_BUS_CYCLES
321 Bus cycles, which can be different from total cy‐
322 cles.
323
324 PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
325 Stalled cycles during issue.
326
327 PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
328 Stalled cycles during retirement.
329
330 PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
331 Total cycles; not affected by CPU frequency scaling.
332
333 If type is PERF_TYPE_SOFTWARE, we are measuring software events
334 provided by the kernel. Set config to one of the following:
335
336 PERF_COUNT_SW_CPU_CLOCK
337 This reports the CPU clock, a high-resolution per-
338 CPU timer.
339
340 PERF_COUNT_SW_TASK_CLOCK
341 This reports a clock count specific to the task that
342 is running.
343
344 PERF_COUNT_SW_PAGE_FAULTS
345 This reports the number of page faults.
346
347 PERF_COUNT_SW_CONTEXT_SWITCHES
348 This counts context switches. Until Linux 2.6.34,
349 these were all reported as user-space events, after
350 that they are reported as happening in the kernel.
351
352 PERF_COUNT_SW_CPU_MIGRATIONS
353 This reports the number of times the process has mi‐
354 grated to a new CPU.
355
356 PERF_COUNT_SW_PAGE_FAULTS_MIN
357 This counts the number of minor page faults. These
358 did not require disk I/O to handle.
359
360 PERF_COUNT_SW_PAGE_FAULTS_MAJ
361 This counts the number of major page faults. These
362 required disk I/O to handle.
363
364 PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
365 This counts the number of alignment faults. These
366 happen when unaligned memory accesses happen; the
367 kernel can handle these but it reduces performance.
368 This happens only on some architectures (never on
369 x86).
370
371 PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
372 This counts the number of emulation faults. The
373 kernel sometimes traps on unimplemented instructions
374 and emulates them for user space. This can nega‐
375 tively impact performance.
376
377 PERF_COUNT_SW_DUMMY (since Linux 3.12)
378 This is a placeholder event that counts nothing.
379 Informational sample record types such as mmap or
380 comm must be associated with an active event. This
381 dummy event allows gathering such records without
382 requiring a counting event.
383
384 If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel
385 tracepoints. The value to use in config can be obtained from
386 under debugfs tracing/events/*/*/id if ftrace is enabled in the
387 kernel.
388
389 If type is PERF_TYPE_HW_CACHE, then we are measuring a hardware
390 CPU cache event. To calculate the appropriate config value, use
391 the following equation:
392
393 config = (perf_hw_cache_id) |
394 (perf_hw_cache_op_id << 8) |
395 (perf_hw_cache_op_result_id << 16);
396
397 where perf_hw_cache_id is one of:
398
399 PERF_COUNT_HW_CACHE_L1D
400 for measuring Level 1 Data Cache
401
402 PERF_COUNT_HW_CACHE_L1I
403 for measuring Level 1 Instruction Cache
404
405 PERF_COUNT_HW_CACHE_LL
406 for measuring Last-Level Cache
407
408 PERF_COUNT_HW_CACHE_DTLB
409 for measuring the Data TLB
410
411 PERF_COUNT_HW_CACHE_ITLB
412 for measuring the Instruction TLB
413
414 PERF_COUNT_HW_CACHE_BPU
415 for measuring the branch prediction unit
416
417 PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
418 for measuring local memory accesses
419
420 and perf_hw_cache_op_id is one of:
421
422 PERF_COUNT_HW_CACHE_OP_READ
423 for read accesses
424
425 PERF_COUNT_HW_CACHE_OP_WRITE
426 for write accesses
427
428 PERF_COUNT_HW_CACHE_OP_PREFETCH
429 for prefetch accesses
430
431 and perf_hw_cache_op_result_id is one of:
432
433 PERF_COUNT_HW_CACHE_RESULT_ACCESS
434 to measure accesses
435
436 PERF_COUNT_HW_CACHE_RESULT_MISS
437 to measure misses
438
439 If type is PERF_TYPE_RAW, then a custom "raw" config value is
440 needed. Most CPUs support events that are not covered by the
441 "generalized" events. These are implementation defined; see
442 your CPU manual (for example the Intel Volume 3B documentation
443 or the AMD BIOS and Kernel Developer Guide). The libpfm4 li‐
444 brary can be used to translate from the name in the architec‐
445 tural manuals to the raw hex value perf_event_open() expects in
446 this field.
447
448 If type is PERF_TYPE_BREAKPOINT, then leave config set to zero.
449 Its parameters are set in other places.
450
451 If type is kprobe or uprobe, set retprobe (bit 0 of config, see
452 /sys/bus/event_source/devices/[k,u]probe/format/retprobe) for
453 kretprobe/uretprobe. See fields kprobe_func, uprobe_path,
454 kprobe_addr, and probe_offset for more details.
455
456 kprobe_func, uprobe_path, kprobe_addr, and probe_offset
457 These fields describe the kprobe/uprobe for dynamic PMUs kprobe
458 and uprobe. For kprobe: use kprobe_func and probe_offset, or
459 use kprobe_addr and leave kprobe_func as NULL. For uprobe: use
460 uprobe_path and probe_offset.
461
462 sample_period, sample_freq
463 A "sampling" event is one that generates an overflow notifica‐
464 tion every N events, where N is given by sample_period. A sam‐
465 pling event has sample_period > 0. When an overflow occurs, re‐
466 quested data is recorded in the mmap buffer. The sample_type
467 field controls what data is recorded on each overflow.
468
469 sample_freq can be used if you wish to use frequency rather than
470 period. In this case, you set the freq flag. The kernel will
471 adjust the sampling period to try and achieve the desired rate.
472 The rate of adjustment is a timer tick.
473
474 sample_type
475 The various bits in this field specify which values to include
476 in the sample. They will be recorded in a ring-buffer, which is
477 available to user space using mmap(2). The order in which the
478 values are saved in the sample are documented in the MMAP Layout
479 subsection below; it is not the enum perf_event_sample_format
480 order.
481
482 PERF_SAMPLE_IP
483 Records instruction pointer.
484
485 PERF_SAMPLE_TID
486 Records the process and thread IDs.
487
488 PERF_SAMPLE_TIME
489 Records a timestamp.
490
491 PERF_SAMPLE_ADDR
492 Records an address, if applicable.
493
494 PERF_SAMPLE_READ
495 Record counter values for all events in a group, not just
496 the group leader.
497
498 PERF_SAMPLE_CALLCHAIN
499 Records the callchain (stack backtrace).
500
501 PERF_SAMPLE_ID
502 Records a unique ID for the opened event's group leader.
503
504 PERF_SAMPLE_CPU
505 Records CPU number.
506
507 PERF_SAMPLE_PERIOD
508 Records the current sampling period.
509
510 PERF_SAMPLE_STREAM_ID
511 Records a unique ID for the opened event. Unlike
512 PERF_SAMPLE_ID the actual ID is returned, not the group
513 leader. This ID is the same as the one returned by
514 PERF_FORMAT_ID.
515
516 PERF_SAMPLE_RAW
517 Records additional data, if applicable. Usually returned
518 by tracepoint events.
519
520 PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
521 This provides a record of recent branches, as provided by
522 CPU branch sampling hardware (such as Intel Last Branch
523 Record). Not all hardware supports this feature.
524
525 See the branch_sample_type field for how to filter which
526 branches are reported.
527
528 PERF_SAMPLE_REGS_USER (since Linux 3.7)
529 Records the current user-level CPU register state (the
530 values in the process before the kernel was called).
531
532 PERF_SAMPLE_STACK_USER (since Linux 3.7)
533 Records the user level stack, allowing stack unwinding.
534
535 PERF_SAMPLE_WEIGHT (since Linux 3.10)
536 Records a hardware provided weight value that expresses
537 how costly the sampled event was. This allows the hard‐
538 ware to highlight expensive events in a profile.
539
540 PERF_SAMPLE_DATA_SRC (since Linux 3.10)
541 Records the data source: where in the memory hierarchy
542 the data associated with the sampled instruction came
543 from. This is available only if the underlying hardware
544 supports this feature.
545
546 PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
547 Places the SAMPLE_ID value in a fixed position in the
548 record, either at the beginning (for sample events) or at
549 the end (if a non-sample event).
550
551 This was necessary because a sample stream may have
552 records from various different event sources with differ‐
553 ent sample_type settings. Parsing the event stream prop‐
554 erly was not possible because the format of the record
555 was needed to find SAMPLE_ID, but the format could not be
556 found without knowing what event the sample belonged to
557 (causing a circular dependency).
558
559 The PERF_SAMPLE_IDENTIFIER setting makes the event stream
560 always parsable by putting SAMPLE_ID in a fixed location,
561 even though it means having duplicate SAMPLE_ID values in
562 records.
563
564 PERF_SAMPLE_TRANSACTION (since Linux 3.13)
565 Records reasons for transactional memory abort events
566 (for example, from Intel TSX transactional memory sup‐
567 port).
568
569 The precise_ip setting must be greater than 0 and a
570 transactional memory abort event must be measured or no
571 values will be recorded. Also note that some perf_event
572 measurements, such as sampled cycle counting, may cause
573 extraneous aborts (by causing an interrupt during a
574 transaction).
575
576 PERF_SAMPLE_REGS_INTR (since Linux 3.19)
577 Records a subset of the current CPU register state as
578 specified by sample_regs_intr. Unlike PERF_SAM‐
579 PLE_REGS_USER the register values will return kernel reg‐
580 ister state if the overflow happened while kernel code is
581 running. If the CPU supports hardware sampling of regis‐
582 ter state (i.e., PEBS on Intel x86) and precise_ip is set
583 higher than zero then the register values returned are
584 those captured by hardware at the time of the sampled in‐
585 struction's retirement.
586
587 PERF_SAMPLE_PHYS_ADDR (since Linux 4.13)
588 Records physical address of data like in PERF_SAM‐
589 PLE_ADDR.
590
591 PERF_SAMPLE_CGROUP (since Linux 5.7)
592 Records (perf_event) cgroup ID of the process. This cor‐
593 responds to the id field in the PERF_RECORD_CGROUP event.
594
595 read_format
596 This field specifies the format of the data returned by read(2)
597 on a perf_event_open() file descriptor.
598
599 PERF_FORMAT_TOTAL_TIME_ENABLED
600 Adds the 64-bit time_enabled field. This can be used to
601 calculate estimated totals if the PMU is overcommitted
602 and multiplexing is happening.
603
604 PERF_FORMAT_TOTAL_TIME_RUNNING
605 Adds the 64-bit time_running field. This can be used to
606 calculate estimated totals if the PMU is overcommitted
607 and multiplexing is happening.
608
609 PERF_FORMAT_ID
610 Adds a 64-bit unique value that corresponds to the event
611 group.
612
613 PERF_FORMAT_GROUP
614 Allows all counter values in an event group to be read
615 with one read.
616
617 disabled
618 The disabled bit specifies whether the counter starts out dis‐
619 abled or enabled. If disabled, the event can later be enabled
620 by ioctl(2), prctl(2), or enable_on_exec.
621
622 When creating an event group, typically the group leader is ini‐
623 tialized with disabled set to 1 and any child events are ini‐
624 tialized with disabled set to 0. Despite disabled being 0, the
625 child events will not start until the group leader is enabled.
626
627 inherit
628 The inherit bit specifies that this counter should count events
629 of child tasks as well as the task specified. This applies only
630 to new children, not to any existing children at the time the
631 counter is created (nor to any new children of existing chil‐
632 dren).
633
634 Inherit does not work for some combinations of read_format val‐
635 ues, such as PERF_FORMAT_GROUP.
636
637 pinned The pinned bit specifies that the counter should always be on
638 the CPU if at all possible. It applies only to hardware coun‐
639 ters and only to group leaders. If a pinned counter cannot be
640 put onto the CPU (e.g., because there are not enough hardware
641 counters or because of a conflict with some other event), then
642 the counter goes into an 'error' state, where reads return end-
643 of-file (i.e., read(2) returns 0) until the counter is subse‐
644 quently enabled or disabled.
645
646 exclusive
647 The exclusive bit specifies that when this counter's group is on
648 the CPU, it should be the only group using the CPU's counters.
649 In the future this may allow monitoring programs to support PMU
650 features that need to run alone so that they do not disrupt
651 other hardware counters.
652
653 Note that many unexpected situations may prevent events with the
654 exclusive bit set from ever running. This includes any users
655 running a system-wide measurement as well as any kernel use of
656 the performance counters (including the commonly enabled NMI
657 Watchdog Timer interface).
658
659 exclude_user
660 If this bit is set, the count excludes events that happen in
661 user space.
662
663 exclude_kernel
664 If this bit is set, the count excludes events that happen in
665 kernel space.
666
667 exclude_hv
668 If this bit is set, the count excludes events that happen in the
669 hypervisor. This is mainly for PMUs that have built-in support
670 for handling this (such as POWER). Extra support is needed for
671 handling hypervisor measurements on most machines.
672
673 exclude_idle
674 If set, don't count when the CPU is running the idle task.
675 While you can currently enable this for any event type, it is
676 ignored for all but software events.
677
678 mmap The mmap bit enables generation of PERF_RECORD_MMAP samples for
679 every mmap(2) call that has PROT_EXEC set. This allows tools to
680 notice new executable code being mapped into a program (dynamic
681 shared libraries for example) so that addresses can be mapped
682 back to the original code.
683
684 comm The comm bit enables tracking of process command name as modi‐
685 fied by the execve(2) and prctl(PR_SET_NAME) system calls as
686 well as writing to /proc/self/comm. If the comm_exec flag is
687 also successfully set (possible since Linux 3.16), then the misc
688 flag PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the
689 execve(2) case from the others.
690
691 freq If this bit is set, then sample_frequency not sample_period is
692 used when setting up the sampling interval.
693
694 inherit_stat
695 This bit enables saving of event counts on context switch for
696 inherited tasks. This is meaningful only if the inherit field
697 is set.
698
699 enable_on_exec
700 If this bit is set, a counter is automatically enabled after a
701 call to execve(2).
702
703 task If this bit is set, then fork/exit notifications are included in
704 the ring buffer.
705
706 watermark
707 If set, have an overflow notification happen when we cross the
708 wakeup_watermark boundary. Otherwise, overflow notifications
709 happen after wakeup_events samples.
710
711 precise_ip (since Linux 2.6.35)
712 This controls the amount of skid. Skid is how many instructions
713 execute between an event of interest happening and the kernel
714 being able to stop and record the event. Smaller skid is better
715 and allows more accurate reporting of which events correspond to
716 which instructions, but hardware is often limited with how small
717 this can be.
718
719 The possible values of this field are the following:
720
721 0 SAMPLE_IP can have arbitrary skid.
722
723 1 SAMPLE_IP must have constant skid.
724
725 2 SAMPLE_IP requested to have 0 skid.
726
727 3 SAMPLE_IP must have 0 skid. See also the description of
728 PERF_RECORD_MISC_EXACT_IP.
729
730 mmap_data (since Linux 2.6.36)
731 This is the counterpart of the mmap field. This enables genera‐
732 tion of PERF_RECORD_MMAP samples for mmap(2) calls that do not
733 have PROT_EXEC set (for example data and SysV shared memory).
734
735 sample_id_all (since Linux 2.6.38)
736 If set, then TID, TIME, ID, STREAM_ID, and CPU can additionally
737 be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
738 ple_type is selected.
739
740 If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID
741 value is included as the last value to ease parsing the record
742 stream. This may lead to the id value appearing twice.
743
744 The layout is described by this pseudo-structure:
745
746 struct sample_id {
747 { u32 pid, tid; } /* if PERF_SAMPLE_TID set */
748 { u64 time; } /* if PERF_SAMPLE_TIME set */
749 { u64 id; } /* if PERF_SAMPLE_ID set */
750 { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
751 { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
752 { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
753 };
754
755 exclude_host (since Linux 3.2)
756 When conducting measurements that include processes running VM
757 instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
758 events happening inside a guest instance. This is only meaning‐
759 ful outside the guests; this setting does not change counts
760 gathered inside of a guest. Currently, this functionality is
761 x86 only.
762
763 exclude_guest (since Linux 3.2)
764 When conducting measurements that include processes running VM
765 instances (i.e., have executed a KVM_RUN ioctl(2)), do not mea‐
766 sure events happening inside guest instances. This is only
767 meaningful outside the guests; this setting does not change
768 counts gathered inside of a guest. Currently, this functional‐
769 ity is x86 only.
770
771 exclude_callchain_kernel (since Linux 3.7)
772 Do not include kernel callchains.
773
774 exclude_callchain_user (since Linux 3.7)
775 Do not include user callchains.
776
777 mmap2 (since Linux 3.16)
778 Generate an extended executable mmap record that contains enough
779 additional information to uniquely identify shared mappings.
780 The mmap flag must also be set for this to work.
781
782 comm_exec (since Linux 3.16)
783 This is purely a feature-detection flag, it does not change ker‐
784 nel behavior. If this flag can successfully be set, then, when
785 comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
786 in the misc field of a comm record header if the rename event
787 being reported was caused by a call to execve(2). This allows
788 tools to distinguish between the various types of process renam‐
789 ing.
790
791 use_clockid (since Linux 4.1)
792 This allows selecting which internal Linux clock to use when
793 generating timestamps via the clockid field. This can make it
794 easier to correlate perf sample times with timestamps generated
795 by other tools.
796
797 context_switch (since Linux 4.3)
798 This enables the generation of PERF_RECORD_SWITCH records when a
799 context switch occurs. It also enables the generation of
800 PERF_RECORD_SWITCH_CPU_WIDE records when sampling in CPU-wide
801 mode. This functionality is in addition to existing tracepoint
802 and software events for measuring context switches. The advan‐
803 tage of this method is that it will give full information even
804 with strict perf_event_paranoid settings.
805
806 write_backward (since Linux 4.6)
807 This causes the ring buffer to be written from the end to the
808 beginning. This is to support reading from overwritable ring
809 buffer.
810
811 namespaces (since Linux 4.11)
812 This enables the generation of PERF_RECORD_NAMESPACES records
813 when a task enters a new namespace. Each namespace has a combi‐
814 nation of device and inode numbers.
815
816 ksymbol (since Linux 5.0)
817 This enables the generation of PERF_RECORD_KSYMBOL records when
818 new kernel symbols are registered or unregistered. This is ana‐
819 lyzing dynamic kernel functions like eBPF.
820
821 bpf_event (since Linux 5.0)
822 This enables the generation of PERF_RECORD_BPF_EVENT records
823 when an eBPF program is loaded or unloaded.
824
825 auxevent (since Linux 5.4)
826 This allows normal (non-AUX) events to generate data for AUX
827 events if the hardware supports it.
828
829 cgroup (since Linux 5.7)
830 This enables the generation of PERF_RECORD_CGROUP records when a
831 new cgroup is created (and activated).
832
833 text_poke (since Linux 5.8)
834 This enables the generation of PERF_RECORD_TEXT_POKE records
835 when there's a change to the kernel text (i.e., self-modifying
836 code).
837
838 wakeup_events, wakeup_watermark
839 This union sets how many samples (wakeup_events) or bytes
840 (wakeup_watermark) happen before an overflow notification hap‐
841 pens. Which one is used is selected by the watermark bit flag.
842
843 wakeup_events counts only PERF_RECORD_SAMPLE record types. To
844 receive overflow notification for all PERF_RECORD types choose
845 watermark and set wakeup_watermark to 1.
846
847 Prior to Linux 3.0, setting wakeup_events to 0 resulted in no
848 overflow notifications; more recent kernels treat 0 the same as
849 1.
850
851 bp_type (since Linux 2.6.33)
852 This chooses the breakpoint type. It is one of:
853
854 HW_BREAKPOINT_EMPTY
855 No breakpoint.
856
857 HW_BREAKPOINT_R
858 Count when we read the memory location.
859
860 HW_BREAKPOINT_W
861 Count when we write the memory location.
862
863 HW_BREAKPOINT_RW
864 Count when we read or write the memory location.
865
866 HW_BREAKPOINT_X
867 Count when we execute code at the memory location.
868
869 The values can be combined via a bitwise or, but the combination
870 of HW_BREAKPOINT_R or HW_BREAKPOINT_W with HW_BREAKPOINT_X is
871 not allowed.
872
873 bp_addr (since Linux 2.6.33)
874 This is the address of the breakpoint. For execution break‐
875 points, this is the memory address of the instruction of inter‐
876 est; for read and write breakpoints, it is the memory address of
877 the memory location of interest.
878
879 config1 (since Linux 2.6.39)
880 config1 is used for setting events that need an extra register
881 or otherwise do not fit in the regular config field. Raw OFF‐
882 CORE_EVENTS on Nehalem/Westmere/SandyBridge use this field on
883 Linux 3.3 and later kernels.
884
885 bp_len (since Linux 2.6.33)
886 bp_len is the length of the breakpoint being measured if type is
887 PERF_TYPE_BREAKPOINT. Options are HW_BREAKPOINT_LEN_1,
888 HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, and HW_BREAK‐
889 POINT_LEN_8. For an execution breakpoint, set this to
890 sizeof(long).
891
892 config2 (since Linux 2.6.39)
893 config2 is a further extension of the config1 field.
894
895 branch_sample_type (since Linux 3.4)
896 If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
897 branches to include in the branch record.
898
899 The first part of the value is the privilege level, which is a
900 combination of one of the values listed below. If the user does
901 not set privilege level explicitly, the kernel will use the
902 event's privilege level. Event and branch privilege levels do
903 not have to match.
904
905 PERF_SAMPLE_BRANCH_USER
906 Branch target is in user space.
907
908 PERF_SAMPLE_BRANCH_KERNEL
909 Branch target is in kernel space.
910
911 PERF_SAMPLE_BRANCH_HV
912 Branch target is in hypervisor.
913
914 PERF_SAMPLE_BRANCH_PLM_ALL
915 A convenience value that is the three preceding values
916 ORed together.
917
918 In addition to the privilege value, at least one or more of the
919 following bits must be set.
920
921 PERF_SAMPLE_BRANCH_ANY
922 Any branch type.
923
924 PERF_SAMPLE_BRANCH_ANY_CALL
925 Any call branch (includes direct calls, indirect calls,
926 and far jumps).
927
928 PERF_SAMPLE_BRANCH_IND_CALL
929 Indirect calls.
930
931 PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
932 Direct calls.
933
934 PERF_SAMPLE_BRANCH_ANY_RETURN
935 Any return branch.
936
937 PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
938 Indirect jumps.
939
940 PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
941 Conditional branches.
942
943 PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
944 Transactional memory aborts.
945
946 PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
947 Branch in transactional memory transaction.
948
949 PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
950 Branch not in transactional memory transaction.
951 PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
952 part of a hardware-generated call stack. This requires
953 hardware support, currently only found on Intel x86
954 Haswell or newer.
955
956 sample_regs_user (since Linux 3.7)
957 This bit mask defines the set of user CPU registers to dump on
958 samples. The layout of the register mask is architecture-spe‐
959 cific and is described in the kernel header file arch/ARCH/in‐
960 clude/uapi/asm/perf_regs.h.
961
962 sample_stack_user (since Linux 3.7)
963 This defines the size of the user stack to dump if PERF_SAM‐
964 PLE_STACK_USER is specified.
965
966 clockid (since Linux 4.1)
967 If use_clockid is set, then this field selects which internal
968 Linux timer to use for timestamps. The available timers are de‐
969 fined in linux/time.h, with CLOCK_MONOTONIC, CLOCK_MONO‐
970 TONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and CLOCK_TAI cur‐
971 rently supported.
972
973 aux_watermark (since Linux 4.1)
974 This specifies how much data is required to trigger a
975 PERF_RECORD_AUX sample.
976
977 sample_max_stack (since Linux 4.8)
978 When sample_type includes PERF_SAMPLE_CALLCHAIN, this field
979 specifies how many stack frames to report when generating the
980 callchain.
981
982 Reading results
983 Once a perf_event_open() file descriptor has been opened, the values of
984 the events can be read from the file descriptor. The values that are
985 there are specified by the read_format field in the attr structure at
986 open time.
987
988 If you attempt to read into a buffer that is not big enough to hold the
989 data, the error ENOSPC results.
990
991 Here is the layout of the data returned by a read:
992
993 * If PERF_FORMAT_GROUP was specified to allow reading all events in a
994 group at once:
995
996 struct read_format {
997 u64 nr; /* The number of events */
998 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
999 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1000 struct {
1001 u64 value; /* The value of the event */
1002 u64 id; /* if PERF_FORMAT_ID */
1003 } values[nr];
1004 };
1005
1006 * If PERF_FORMAT_GROUP was not specified:
1007
1008 struct read_format {
1009 u64 value; /* The value of the event */
1010 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1011 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1012 u64 id; /* if PERF_FORMAT_ID */
1013 };
1014
1015 The values read are as follows:
1016
1017 nr The number of events in this file descriptor. Available only if
1018 PERF_FORMAT_GROUP was specified.
1019
1020 time_enabled, time_running
1021 Total time the event was enabled and running. Normally these
1022 values are the same. Multiplexing happens if the number of
1023 events is more than the number of available PMU counter slots.
1024 In that case the events run only part of the time and the
1025 time_enabled and time running values can be used to scale an es‐
1026 timated value for the count.
1027
1028 value An unsigned 64-bit value containing the counter result.
1029
1030 id A globally unique value for this particular event; only present
1031 if PERF_FORMAT_ID was specified in read_format.
1032
1033 MMAP layout
1034 When using perf_event_open() in sampled mode, asynchronous events (like
1035 counter overflow or PROT_EXEC mmap tracking) are logged into a ring-
1036 buffer. This ring-buffer is created and accessed through mmap(2).
1037
1038 The mmap size should be 1+2^n pages, where the first page is a metadata
1039 page (struct perf_event_mmap_page) that contains various bits of infor‐
1040 mation such as where the ring-buffer head is.
1041
1042 Before kernel 2.6.39, there is a bug that means you must allocate an
1043 mmap ring buffer when sampling even if you do not plan to access it.
1044
1045 The structure of the first metadata mmap page is as follows:
1046
1047 struct perf_event_mmap_page {
1048 __u32 version; /* version number of this structure */
1049 __u32 compat_version; /* lowest version this is compat with */
1050 __u32 lock; /* seqlock for synchronization */
1051 __u32 index; /* hardware counter identifier */
1052 __s64 offset; /* add to hardware counter value */
1053 __u64 time_enabled; /* time event active */
1054 __u64 time_running; /* time event on CPU */
1055 union {
1056 __u64 capabilities;
1057 struct {
1058 __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1059 cap_bit0_is_deprecated : 1,
1060 cap_user_rdpmc : 1,
1061 cap_user_time : 1,
1062 cap_user_time_zero : 1,
1063 };
1064 };
1065 __u16 pmc_width;
1066 __u16 time_shift;
1067 __u32 time_mult;
1068 __u64 time_offset;
1069 __u64 __reserved[120]; /* Pad to 1 k */
1070 __u64 data_head; /* head in the data section */
1071 __u64 data_tail; /* user-space written tail */
1072 __u64 data_offset; /* where the buffer starts */
1073 __u64 data_size; /* data buffer size */
1074 __u64 aux_head;
1075 __u64 aux_tail;
1076 __u64 aux_offset;
1077 __u64 aux_size;
1078
1079 }
1080
1081 The following list describes the fields in the perf_event_mmap_page
1082 structure in more detail:
1083
1084 version
1085 Version number of this structure.
1086
1087 compat_version
1088 The lowest version this is compatible with.
1089
1090 lock A seqlock for synchronization.
1091
1092 index A unique hardware counter identifier.
1093
1094 offset When using rdpmc for reads this offset value must be added to
1095 the one returned by rdpmc to get the current total event count.
1096
1097 time_enabled
1098 Time the event was active.
1099
1100 time_running
1101 Time the event was running.
1102
1103 cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1104 There was a bug in the definition of cap_usr_time and
1105 cap_usr_rdpmc from Linux 3.4 until Linux 3.11. Both bits were
1106 defined to point to the same location, so it was impossible to
1107 know if cap_usr_time or cap_usr_rdpmc were actually set.
1108
1109 Starting with Linux 3.12, these are renamed to cap_bit0 and you
1110 should use the cap_user_time and cap_user_rdpmc fields instead.
1111
1112 cap_bit0_is_deprecated (since Linux 3.12)
1113 If set, this bit indicates that the kernel supports the properly
1114 separated cap_user_time and cap_user_rdpmc bits.
1115
1116 If not-set, it indicates an older kernel where cap_usr_time and
1117 cap_usr_rdpmc map to the same bit and thus both features should
1118 be used with caution.
1119
1120 cap_user_rdpmc (since Linux 3.12)
1121 If the hardware supports user-space read of performance counters
1122 without syscall (this is the "rdpmc" instruction on x86), then
1123 the following code can be used to do a read:
1124
1125 u32 seq, time_mult, time_shift, idx, width;
1126 u64 count, enabled, running;
1127 u64 cyc, time_offset;
1128
1129 do {
1130 seq = pc->lock;
1131 barrier();
1132 enabled = pc->time_enabled;
1133 running = pc->time_running;
1134
1135 if (pc->cap_usr_time && enabled != running) {
1136 cyc = rdtsc();
1137 time_offset = pc->time_offset;
1138 time_mult = pc->time_mult;
1139 time_shift = pc->time_shift;
1140 }
1141
1142 idx = pc->index;
1143 count = pc->offset;
1144
1145 if (pc->cap_usr_rdpmc && idx) {
1146 width = pc->pmc_width;
1147 count += rdpmc(idx - 1);
1148 }
1149
1150 barrier();
1151 } while (pc->lock != seq);
1152
1153 cap_user_time (since Linux 3.12)
1154 This bit indicates the hardware has a constant, nonstop time‐
1155 stamp counter (TSC on x86).
1156
1157 cap_user_time_zero (since Linux 3.12)
1158 Indicates the presence of time_zero which allows mapping time‐
1159 stamp values to the hardware clock.
1160
1161 pmc_width
1162 If cap_usr_rdpmc, this field provides the bit-width of the value
1163 read using the rdpmc or equivalent instruction. This can be
1164 used to sign extend the result like:
1165
1166 pmc <<= 64 - pmc_width;
1167 pmc >>= 64 - pmc_width; // signed shift right
1168 count += pmc;
1169
1170 time_shift, time_mult, time_offset
1171
1172 If cap_usr_time, these fields can be used to compute the time
1173 delta since time_enabled (in nanoseconds) using rdtsc or simi‐
1174 lar.
1175
1176 u64 quot, rem;
1177 u64 delta;
1178
1179 quot = cyc >> time_shift;
1180 rem = cyc & (((u64)1 << time_shift) - 1);
1181 delta = time_offset + quot * time_mult +
1182 ((rem * time_mult) >> time_shift);
1183
1184 Where time_offset, time_mult, time_shift, and cyc are read in
1185 the seqcount loop described above. This delta can then be added
1186 to enabled and possible running (if idx), improving the scaling:
1187
1188 enabled += delta;
1189 if (idx)
1190 running += delta;
1191 quot = count / running;
1192 rem = count % running;
1193 count = quot * enabled + (rem * enabled) / running;
1194
1195 time_zero (since Linux 3.12)
1196
1197 If cap_usr_time_zero is set, then the hardware clock (the TSC
1198 timestamp counter on x86) can be calculated from the time_zero,
1199 time_mult, and time_shift values:
1200
1201 time = timestamp - time_zero;
1202 quot = time / time_mult;
1203 rem = time % time_mult;
1204 cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1205
1206 And vice versa:
1207
1208 quot = cyc >> time_shift;
1209 rem = cyc & (((u64)1 << time_shift) - 1);
1210 timestamp = time_zero + quot * time_mult +
1211 ((rem * time_mult) >> time_shift);
1212
1213 data_head
1214 This points to the head of the data section. The value continu‐
1215 ously increases, it does not wrap. The value needs to be manu‐
1216 ally wrapped by the size of the mmap buffer before accessing the
1217 samples.
1218
1219 On SMP-capable platforms, after reading the data_head value,
1220 user space should issue an rmb().
1221
1222 data_tail
1223 When the mapping is PROT_WRITE, the data_tail value should be
1224 written by user space to reflect the last read data. In this
1225 case, the kernel will not overwrite unread data.
1226
1227 data_offset (since Linux 4.1)
1228 Contains the offset of the location in the mmap buffer where
1229 perf sample data begins.
1230
1231 data_size (since Linux 4.1)
1232 Contains the size of the perf sample region within the mmap buf‐
1233 fer.
1234
1235 aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1236 The AUX region allows mmap(2)-ing a separate sample buffer for
1237 high-bandwidth data streams (separate from the main perf sample
1238 buffer). An example of a high-bandwidth stream is instruction
1239 tracing support, as is found in newer Intel processors.
1240
1241 To set up an AUX area, first aux_offset needs to be set with an
1242 offset greater than data_offset+data_size and aux_size needs to
1243 be set to the desired buffer size. The desired offset and size
1244 must be page aligned, and the size must be a power of two.
1245 These values are then passed to mmap in order to map the AUX
1246 buffer. Pages in the AUX buffer are included as part of the
1247 RLIMIT_MEMLOCK resource limit (see setrlimit(2)), and also as
1248 part of the perf_event_mlock_kb allowance.
1249
1250 By default, the AUX buffer will be truncated if it will not fit
1251 in the available space in the ring buffer. If the AUX buffer is
1252 mapped as a read only buffer, then it will operate in ring buf‐
1253 fer mode where old data will be overwritten by new. In over‐
1254 write mode, it might not be possible to infer where the new data
1255 began, and it is the consumer's job to disable measurement while
1256 reading to avoid possible data races.
1257
1258 The aux_head and aux_tail ring buffer pointers have the same be‐
1259 havior and ordering rules as the previous described data_head
1260 and data_tail.
1261
1262 The following 2^n ring-buffer pages have the layout described below.
1263
1264 If perf_event_attr.sample_id_all is set, then all event types will have
1265 the sample_type selected fields related to where/when (identity) an
1266 event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1267 PERF_RECORD_SAMPLE below, it will be stashed just after the
1268 perf_event_header and the fields already present for the existing
1269 fields, that is, at the end of the payload. This allows a newer
1270 perf.data file to be supported by older perf tools, with the new op‐
1271 tional fields being ignored.
1272
1273 The mmap values start with a header:
1274
1275 struct perf_event_header {
1276 __u32 type;
1277 __u16 misc;
1278 __u16 size;
1279 };
1280
1281 Below, we describe the perf_event_header fields in more detail. For
1282 ease of reading, the fields with shorter descriptions are presented
1283 first.
1284
1285 size This indicates the size of the record.
1286
1287 misc The misc field contains additional information about the sample.
1288
1289 The CPU mode can be determined from this value by masking with
1290 PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1291 ing (note these are not bit masks, only one can be set at a
1292 time):
1293
1294 PERF_RECORD_MISC_CPUMODE_UNKNOWN
1295 Unknown CPU mode.
1296
1297 PERF_RECORD_MISC_KERNEL
1298 Sample happened in the kernel.
1299
1300 PERF_RECORD_MISC_USER
1301 Sample happened in user code.
1302
1303 PERF_RECORD_MISC_HYPERVISOR
1304 Sample happened in the hypervisor.
1305
1306 PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1307 Sample happened in the guest kernel.
1308
1309 PERF_RECORD_MISC_GUEST_USER (since Linux 2.6.35)
1310 Sample happened in guest user code.
1311
1312 Since the following three statuses are generated by different
1313 record types, they alias to the same bit:
1314
1315 PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1316 This is set when the mapping is not executable; otherwise
1317 the mapping is executable.
1318
1319 PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1320 This is set for a PERF_RECORD_COMM record on kernels more
1321 recent than Linux 3.16 if a process name change was
1322 caused by an execve(2) system call.
1323
1324 PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1325 When a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE
1326 record is generated, this bit indicates that the context
1327 switch is away from the current process (instead of into
1328 the current process).
1329
1330 In addition, the following bits can be set:
1331
1332 PERF_RECORD_MISC_EXACT_IP
1333 This indicates that the content of PERF_SAMPLE_IP points
1334 to the actual instruction that triggered the event. See
1335 also perf_event_attr.precise_ip.
1336
1337 PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1338 This indicates there is extended data available (cur‐
1339 rently not used).
1340
1341 PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1342 This bit is not set by the kernel. It is reserved for
1343 the user-space perf utility to indicate that
1344 /proc/i[pid]/maps parsing was taking too long and was
1345 stopped, and thus the mmap records may be truncated.
1346
1347 type The type value is one of the below. The values in the corre‐
1348 sponding record (that follows the header) depend on the type se‐
1349 lected as shown.
1350
1351 PERF_RECORD_MMAP
1352 The MMAP events record the PROT_EXEC mappings so that we can
1353 correlate user-space IPs to code. They have the following
1354 structure:
1355
1356 struct {
1357 struct perf_event_header header;
1358 u32 pid, tid;
1359 u64 addr;
1360 u64 len;
1361 u64 pgoff;
1362 char filename[];
1363 };
1364
1365 pid is the process ID.
1366
1367 tid is the thread ID.
1368
1369 addr is the address of the allocated memory. len is the
1370 length of the allocated memory. pgoff is the page
1371 offset of the allocated memory. filename is a string
1372 describing the backing of the allocated memory.
1373
1374 PERF_RECORD_LOST
1375 This record indicates when events are lost.
1376
1377 struct {
1378 struct perf_event_header header;
1379 u64 id;
1380 u64 lost;
1381 struct sample_id sample_id;
1382 };
1383
1384 id is the unique event ID for the samples that were
1385 lost.
1386
1387 lost is the number of events that were lost.
1388
1389 PERF_RECORD_COMM
1390 This record indicates a change in the process name.
1391
1392 struct {
1393 struct perf_event_header header;
1394 u32 pid;
1395 u32 tid;
1396 char comm[];
1397 struct sample_id sample_id;
1398 };
1399
1400 pid is the process ID.
1401
1402 tid is the thread ID.
1403
1404 comm is a string containing the new name of the process.
1405
1406 PERF_RECORD_EXIT
1407 This record indicates a process exit event.
1408
1409 struct {
1410 struct perf_event_header header;
1411 u32 pid, ppid;
1412 u32 tid, ptid;
1413 u64 time;
1414 struct sample_id sample_id;
1415 };
1416
1417 PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1418 This record indicates a throttle/unthrottle event.
1419
1420 struct {
1421 struct perf_event_header header;
1422 u64 time;
1423 u64 id;
1424 u64 stream_id;
1425 struct sample_id sample_id;
1426 };
1427
1428 PERF_RECORD_FORK
1429 This record indicates a fork event.
1430
1431 struct {
1432 struct perf_event_header header;
1433 u32 pid, ppid;
1434 u32 tid, ptid;
1435 u64 time;
1436 struct sample_id sample_id;
1437 };
1438
1439 PERF_RECORD_READ
1440 This record indicates a read event.
1441
1442 struct {
1443 struct perf_event_header header;
1444 u32 pid, tid;
1445 struct read_format values;
1446 struct sample_id sample_id;
1447 };
1448
1449 PERF_RECORD_SAMPLE
1450 This record indicates a sample.
1451
1452 struct {
1453 struct perf_event_header header;
1454 u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
1455 u64 ip; /* if PERF_SAMPLE_IP */
1456 u32 pid, tid; /* if PERF_SAMPLE_TID */
1457 u64 time; /* if PERF_SAMPLE_TIME */
1458 u64 addr; /* if PERF_SAMPLE_ADDR */
1459 u64 id; /* if PERF_SAMPLE_ID */
1460 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1461 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1462 u64 period; /* if PERF_SAMPLE_PERIOD */
1463 struct read_format v;
1464 /* if PERF_SAMPLE_READ */
1465 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1466 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1467 u32 size; /* if PERF_SAMPLE_RAW */
1468 char data[size]; /* if PERF_SAMPLE_RAW */
1469 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1470 struct perf_branch_entry lbr[bnr];
1471 /* if PERF_SAMPLE_BRANCH_STACK */
1472 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1473 u64 regs[weight(mask)];
1474 /* if PERF_SAMPLE_REGS_USER */
1475 u64 size; /* if PERF_SAMPLE_STACK_USER */
1476 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1477 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER &&
1478 size != 0 */
1479 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1480 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1481 u64 transaction; /* if PERF_SAMPLE_TRANSACTION */
1482 u64 abi; /* if PERF_SAMPLE_REGS_INTR */
1483 u64 regs[weight(mask)];
1484 /* if PERF_SAMPLE_REGS_INTR */
1485 u64 phys_addr; /* if PERF_SAMPLE_PHYS_ADDR */
1486 u64 cgroup; /* if PERF_SAMPLE_CGROUP */
1487 };
1488
1489 sample_id
1490 If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1491 is included. This is a duplication of the PERF_SAM‐
1492 PLE_ID id value, but included at the beginning of the
1493 sample so parsers can easily obtain the value.
1494
1495 ip If PERF_SAMPLE_IP is enabled, then a 64-bit instruction
1496 pointer value is included.
1497
1498 pid, tid
1499 If PERF_SAMPLE_TID is enabled, then a 32-bit process ID
1500 and 32-bit thread ID are included.
1501
1502 time
1503 If PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp
1504 is included. This is obtained via local_clock() which
1505 is a hardware timestamp if available and the jiffies
1506 value if not.
1507
1508 addr
1509 If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1510 included. This is usually the address of a tracepoint,
1511 breakpoint, or software event; otherwise the value is 0.
1512
1513 id If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is in‐
1514 cluded. If the event is a member of an event group, the
1515 group leader ID is returned. This ID is the same as the
1516 one returned by PERF_FORMAT_ID.
1517
1518 stream_id
1519 If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID
1520 is included. Unlike PERF_SAMPLE_ID the actual ID is re‐
1521 turned, not the group leader. This ID is the same as
1522 the one returned by PERF_FORMAT_ID.
1523
1524 cpu, res
1525 If PERF_SAMPLE_CPU is enabled, this is a 32-bit value
1526 indicating which CPU was being used, in addition to a
1527 reserved (unused) 32-bit value.
1528
1529 period
1530 If PERF_SAMPLE_PERIOD is enabled, a 64-bit value indi‐
1531 cating the current sampling period is written.
1532
1533 v If PERF_SAMPLE_READ is enabled, a structure of type
1534 read_format is included which has values for all events
1535 in the event group. The values included depend on the
1536 read_format value used at perf_event_open() time.
1537
1538 nr, ips[nr]
1539 If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit num‐
1540 ber is included which indicates how many following
1541 64-bit instruction pointers will follow. This is the
1542 current callchain.
1543
1544 size, data[size]
1545 If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1546 cating size is included followed by an array of 8-bit
1547 values of length size. The values are padded with 0 to
1548 have 64-bit alignment.
1549
1550 This RAW record data is opaque with respect to the ABI.
1551 The ABI doesn't make any promises with respect to the
1552 stability of its content, it may vary depending on
1553 event, hardware, and kernel version.
1554
1555 bnr, lbr[bnr]
1556 If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit
1557 value indicating the number of records is included, fol‐
1558 lowed by bnr perf_branch_entry structures which each in‐
1559 clude the fields:
1560
1561 from This indicates the source instruction (may not be
1562 a branch).
1563
1564 to The branch target.
1565
1566 mispred
1567 The branch target was mispredicted.
1568
1569 predicted
1570 The branch target was predicted.
1571
1572 in_tx (since Linux 3.11)
1573 The branch was in a transactional memory transac‐
1574 tion.
1575
1576 abort (since Linux 3.11)
1577 The branch was in an aborted transactional memory
1578 transaction.
1579
1580 cycles (since Linux 4.3)
1581 This reports the number of cycles elapsed since
1582 the previous branch stack update.
1583
1584 The entries are from most to least recent, so the first
1585 entry has the most recent branch.
1586
1587 Support for mispred, predicted, and cycles is optional;
1588 if not supported, those values will be 0.
1589
1590 The type of branches recorded is specified by the
1591 branch_sample_type field.
1592
1593 abi, regs[weight(mask)]
1594 If PERF_SAMPLE_REGS_USER is enabled, then the user CPU
1595 registers are recorded.
1596
1597 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1598 PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1599
1600 The regs field is an array of the CPU registers that
1601 were specified by the sample_regs_user attr field. The
1602 number of values is the number of bits set in the sam‐
1603 ple_regs_user bit mask.
1604
1605 size, data[size], dyn_size
1606 If PERF_SAMPLE_STACK_USER is enabled, then the user
1607 stack is recorded. This can be used to generate stack
1608 backtraces. size is the size requested by the user in
1609 sample_stack_user or else the maximum record size. data
1610 is the stack data (a raw dump of the memory pointed to
1611 by the stack pointer at the time of sampling). dyn_size
1612 is the amount of data actually dumped (can be less than
1613 size). Note that dyn_size is omitted if size is 0.
1614
1615 weight
1616 If PERF_SAMPLE_WEIGHT is enabled, then a 64-bit value
1617 provided by the hardware is recorded that indicates how
1618 costly the event was. This allows expensive events to
1619 stand out more clearly in profiles.
1620
1621 data_src
1622 If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value
1623 is recorded that is made up of the following fields:
1624
1625 mem_op
1626 Type of opcode, a bitwise combination of:
1627
1628 PERF_MEM_OP_NA Not available
1629 PERF_MEM_OP_LOAD Load instruction
1630 PERF_MEM_OP_STORE Store instruction
1631 PERF_MEM_OP_PFETCH Prefetch
1632 PERF_MEM_OP_EXEC Executable code
1633
1634 mem_lvl
1635 Memory hierarchy level hit or miss, a bitwise combi‐
1636 nation of the following, shifted left by
1637 PERF_MEM_LVL_SHIFT:
1638
1639 PERF_MEM_LVL_NA Not available
1640 PERF_MEM_LVL_HIT Hit
1641 PERF_MEM_LVL_MISS Miss
1642 PERF_MEM_LVL_L1 Level 1 cache
1643 PERF_MEM_LVL_LFB Line fill buffer
1644 PERF_MEM_LVL_L2 Level 2 cache
1645 PERF_MEM_LVL_L3 Level 3 cache
1646 PERF_MEM_LVL_LOC_RAM Local DRAM
1647 PERF_MEM_LVL_REM_RAM1 Remote DRAM 1 hop
1648 PERF_MEM_LVL_REM_RAM2 Remote DRAM 2 hops
1649 PERF_MEM_LVL_REM_CCE1 Remote cache 1 hop
1650 PERF_MEM_LVL_REM_CCE2 Remote cache 2 hops
1651 PERF_MEM_LVL_IO I/O memory
1652 PERF_MEM_LVL_UNC Uncached memory
1653
1654 mem_snoop
1655 Snoop mode, a bitwise combination of the following,
1656 shifted left by PERF_MEM_SNOOP_SHIFT:
1657
1658 PERF_MEM_SNOOP_NA Not available
1659 PERF_MEM_SNOOP_NONE No snoop
1660 PERF_MEM_SNOOP_HIT Snoop hit
1661 PERF_MEM_SNOOP_MISS Snoop miss
1662 PERF_MEM_SNOOP_HITM Snoop hit modified
1663
1664 mem_lock
1665 Lock instruction, a bitwise combination of the fol‐
1666 lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1667
1668 PERF_MEM_LOCK_NA Not available
1669 PERF_MEM_LOCK_LOCKED Locked transaction
1670
1671 mem_dtlb
1672 TLB access hit or miss, a bitwise combination of the
1673 following, shifted left by PERF_MEM_TLB_SHIFT:
1674
1675 PERF_MEM_TLB_NA Not available
1676 PERF_MEM_TLB_HIT Hit
1677 PERF_MEM_TLB_MISS Miss
1678 PERF_MEM_TLB_L1 Level 1 TLB
1679 PERF_MEM_TLB_L2 Level 2 TLB
1680 PERF_MEM_TLB_WK Hardware walker
1681 PERF_MEM_TLB_OS OS fault handler
1682
1683 transaction
1684 If the PERF_SAMPLE_TRANSACTION flag is set, then a
1685 64-bit field is recorded describing the sources of any
1686 transactional memory aborts.
1687
1688 The field is a bitwise combination of the following val‐
1689 ues:
1690
1691 PERF_TXN_ELISION
1692 Abort from an elision type transaction (Intel-
1693 CPU-specific).
1694
1695 PERF_TXN_TRANSACTION
1696 Abort from a generic transaction.
1697
1698 PERF_TXN_SYNC
1699 Synchronous abort (related to the reported in‐
1700 struction).
1701
1702 PERF_TXN_ASYNC
1703 Asynchronous abort (not related to the reported
1704 instruction).
1705
1706 PERF_TXN_RETRY
1707 Retryable abort (retrying the transaction may
1708 have succeeded).
1709
1710 PERF_TXN_CONFLICT
1711 Abort due to memory conflicts with other threads.
1712
1713 PERF_TXN_CAPACITY_WRITE
1714 Abort due to write capacity overflow.
1715
1716 PERF_TXN_CAPACITY_READ
1717 Abort due to read capacity overflow.
1718
1719 In addition, a user-specified abort code can be obtained
1720 from the high 32 bits of the field by shifting right by
1721 PERF_TXN_ABORT_SHIFT and masking with the value
1722 PERF_TXN_ABORT_MASK.
1723
1724 abi, regs[weight(mask)]
1725 If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU
1726 registers are recorded.
1727
1728 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1729 PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1730
1731 The regs field is an array of the CPU registers that
1732 were specified by the sample_regs_intr attr field. The
1733 number of values is the number of bits set in the sam‐
1734 ple_regs_intr bit mask.
1735
1736 phys_addr
1737 If the PERF_SAMPLE_PHYS_ADDR flag is set, then the
1738 64-bit physical address is recorded.
1739
1740 cgroup
1741 If the PERF_SAMPLE_CGROUP flag is set, then the 64-bit
1742 cgroup ID (for the perf_event subsystem) is recorded.
1743 To get the pathname of the cgroup, the ID should match
1744 to one in a PERF_RECORD_CGROUP .
1745
1746 PERF_RECORD_MMAP2
1747 This record includes extended information on mmap(2) calls
1748 returning executable mappings. The format is similar to
1749 that of the PERF_RECORD_MMAP record, but includes extra val‐
1750 ues that allow uniquely identifying shared mappings.
1751
1752 struct {
1753 struct perf_event_header header;
1754 u32 pid;
1755 u32 tid;
1756 u64 addr;
1757 u64 len;
1758 u64 pgoff;
1759 u32 maj;
1760 u32 min;
1761 u64 ino;
1762 u64 ino_generation;
1763 u32 prot;
1764 u32 flags;
1765 char filename[];
1766 struct sample_id sample_id;
1767 };
1768
1769 pid is the process ID.
1770
1771 tid is the thread ID.
1772
1773 addr is the address of the allocated memory.
1774
1775 len is the length of the allocated memory.
1776
1777 pgoff is the page offset of the allocated memory.
1778
1779 maj is the major ID of the underlying device.
1780
1781 min is the minor ID of the underlying device.
1782
1783 ino is the inode number.
1784
1785 ino_generation
1786 is the inode generation.
1787
1788 prot is the protection information.
1789
1790 flags is the flags information.
1791
1792 filename
1793 is a string describing the backing of the allocated
1794 memory.
1795
1796 PERF_RECORD_AUX (since Linux 4.1)
1797 This record reports that new data is available in the sepa‐
1798 rate AUX buffer region.
1799
1800 struct {
1801 struct perf_event_header header;
1802 u64 aux_offset;
1803 u64 aux_size;
1804 u64 flags;
1805 struct sample_id sample_id;
1806 };
1807
1808 aux_offset
1809 offset in the AUX mmap region where the new data be‐
1810 gins.
1811
1812 aux_size
1813 size of the data made available.
1814
1815 flags describes the AUX update.
1816
1817 PERF_AUX_FLAG_TRUNCATED
1818 if set, then the data returned was truncated
1819 to fit the available buffer size.
1820
1821 PERF_AUX_FLAG_OVERWRITE
1822 if set, then the data returned has overwritten
1823 previous data.
1824
1825 PERF_RECORD_ITRACE_START (since Linux 4.1)
1826 This record indicates which process has initiated an in‐
1827 struction trace event, allowing tools to properly correlate
1828 the instruction addresses in the AUX buffer with the proper
1829 executable.
1830
1831 struct {
1832 struct perf_event_header header;
1833 u32 pid;
1834 u32 tid;
1835 };
1836
1837 pid process ID of the thread starting an instruction
1838 trace.
1839
1840 tid thread ID of the thread starting an instruction
1841 trace.
1842
1843 PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1844 When using hardware sampling (such as Intel PEBS) this
1845 record indicates some number of samples that may have been
1846 lost.
1847
1848 struct {
1849 struct perf_event_header header;
1850 u64 lost;
1851 struct sample_id sample_id;
1852 };
1853
1854 lost the number of potentially lost samples.
1855
1856 PERF_RECORD_SWITCH (since Linux 4.3)
1857 This record indicates a context switch has happened. The
1858 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1859 whether it was a context switch into or away from the cur‐
1860 rent process.
1861
1862 struct {
1863 struct perf_event_header header;
1864 struct sample_id sample_id;
1865 };
1866
1867 PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1868 As with PERF_RECORD_SWITCH this record indicates a context
1869 switch has happened, but it only occurs when sampling in
1870 CPU-wide mode and provides additional information on the
1871 process being switched to/from. The
1872 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1873 whether it was a context switch into or away from the cur‐
1874 rent process.
1875
1876 struct {
1877 struct perf_event_header header;
1878 u32 next_prev_pid;
1879 u32 next_prev_tid;
1880 struct sample_id sample_id;
1881 };
1882
1883 next_prev_pid
1884 The process ID of the previous (if switching in) or
1885 next (if switching out) process on the CPU.
1886
1887 next_prev_tid
1888 The thread ID of the previous (if switching in) or
1889 next (if switching out) thread on the CPU.
1890
1891 PERF_RECORD_NAMESPACES (since Linux 4.11)
1892 This record includes various namespace information of a
1893 process.
1894
1895 struct {
1896 struct perf_event_header header;
1897 u32 pid;
1898 u32 tid;
1899 u64 nr_namespaces;
1900 struct { u64 dev, inode } [nr_namespaces];
1901 struct sample_id sample_id;
1902 };
1903
1904 pid is the process ID
1905
1906 tid is the thread ID
1907
1908 nr_namespace
1909 is the number of namespaces in this record
1910
1911 Each namespace has dev and inode fields and is recorded in
1912 the fixed position like below:
1913
1914 NET_NS_INDEX=0
1915 Network namespace
1916
1917 UTS_NS_INDEX=1
1918 UTS namespace
1919
1920 IPC_NS_INDEX=2
1921 IPC namespace
1922
1923 PID_NS_INDEX=3
1924 PID namespace
1925
1926 USER_NS_INDEX=4
1927 User namespace
1928
1929 MNT_NS_INDEX=5
1930 Mount namespace
1931
1932 CGROUP_NS_INDEX=6
1933 Cgroup namespace
1934
1935 PERF_RECORD_KSYMBOL (since Linux 5.0)
1936 This record indicates kernel symbol register/unregister
1937 events.
1938
1939 struct {
1940 struct perf_event_header header;
1941 u64 addr;
1942 u32 len;
1943 u16 ksym_type;
1944 u16 flags;
1945 char name[];
1946 struct sample_id sample_id;
1947 };
1948
1949 addr is the address of the kernel symbol.
1950
1951 len is the length of the kernel symbol.
1952
1953 ksym_type
1954 is the type of the kernel symbol. Currently the fol‐
1955 lowing types are available:
1956
1957 PERF_RECORD_KSYMBOL_TYPE_BPF
1958 The kernel symbol is a BPF function.
1959
1960 flags If the PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER is set,
1961 then this event is for unregistering the kernel sym‐
1962 bol.
1963
1964 PERF_RECORD_BPF_EVENT (since Linux 5.0)
1965 This record indicates BPF program is loaded or unloaded.
1966
1967 struct {
1968 struct perf_event_header header;
1969 u16 type;
1970 u16 flags;
1971 u32 id;
1972 u8 tag[BPF_TAG_SIZE];
1973 struct sample_id sample_id;
1974 };
1975
1976 type is one of the following values:
1977
1978 PERF_BPF_EVENT_PROG_LOAD
1979 A BPF program is loaded
1980
1981 PERF_BPF_EVENT_PROG_UNLOAD
1982 A BPF program is unloaded
1983
1984 id is the ID of the BPF program.
1985
1986 tag is the tag of the BPF program. Currently,
1987 BPF_TAG_SIZE is defined as 8.
1988
1989 PERF_RECORD_CGROUP (since Linux 5.7)
1990 This record indicates a new cgroup is created and activated.
1991
1992 struct {
1993 struct perf_event_header header;
1994 u64 id;
1995 char path[];
1996 struct sample_id sample_id;
1997 };
1998
1999 id is the cgroup identifier. This can be also retrieved
2000 by name_to_handle_at(2) on the cgroup path (as a file
2001 handle).
2002
2003 path is the path of the cgroup from the root.
2004
2005 PERF_RECORD_TEXT_POKE (since Linux 5.8)
2006 This record indicates a change in the kernel text. This in‐
2007 cludes addition and removal of the text and the correspond‐
2008 ing length is zero in this case.
2009
2010 struct {
2011 struct perf_event_header header;
2012 u64 addr;
2013 u16 old_len;
2014 u16 new_len;
2015 u8 bytes[];
2016 struct sample_id sample_id;
2017 };
2018
2019 addr is the address of the change
2020
2021 old_len
2022 is the old length
2023
2024 new_len
2025 is the new length
2026
2027 bytes contains old bytes immediately followed by new bytes.
2028
2029 Overflow handling
2030 Events can be set to notify when a threshold is crossed, indicating an
2031 overflow. Overflow conditions can be captured by monitoring the event
2032 file descriptor with poll(2), select(2), or epoll(7). Alternatively,
2033 the overflow events can be captured via sa signal handler, by enabling
2034 I/O signaling on the file descriptor; see the discussion of the F_SE‐
2035 TOWN and F_SETSIG operations in fcntl(2).
2036
2037 Overflows are generated only by sampling events (sample_period must
2038 have a nonzero value).
2039
2040 There are two ways to generate overflow notifications.
2041
2042 The first is to set a wakeup_events or wakeup_watermark value that will
2043 trigger if a certain number of samples or bytes have been written to
2044 the mmap ring buffer. In this case, POLL_IN is indicated.
2045
2046 The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This
2047 ioctl adds to a counter that decrements each time the event overflows.
2048 When nonzero, POLL_IN is indicated, but once the counter reaches 0
2049 POLL_HUP is indicated and the underlying event is disabled.
2050
2051 Refreshing an event group leader refreshes all siblings and refreshing
2052 with a parameter of 0 currently enables infinite refreshes; these be‐
2053 haviors are unsupported and should not be relied on.
2054
2055 Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
2056 itored is attached to a different process and that process exits.
2057
2058 rdpmc instruction
2059 Starting with Linux 3.4 on x86, you can use the rdpmc instruction to
2060 get low-latency reads without having to enter the kernel. Note that
2061 using rdpmc is not necessarily faster than other methods for reading
2062 event values.
2063
2064 Support for this can be detected with the cap_usr_rdpmc field in the
2065 mmap page; documentation on how to calculate event values can be found
2066 in that section.
2067
2068 Originally, when rdpmc support was enabled, any process (not just ones
2069 with an active perf event) could use the rdpmc instruction to access
2070 the counters. Starting with Linux 4.0, rdpmc support is only allowed
2071 if an event is currently enabled in a process's context. To restore
2072 the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
2073
2074 perf_event ioctl calls
2075 Various ioctls act on perf_event_open() file descriptors:
2076
2077 PERF_EVENT_IOC_ENABLE
2078 This enables the individual event or event group specified by
2079 the file descriptor argument.
2080
2081 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
2082 then all events in a group are enabled, even if the event speci‐
2083 fied is not the group leader (but see BUGS).
2084
2085 PERF_EVENT_IOC_DISABLE
2086 This disables the individual counter or event group specified by
2087 the file descriptor argument.
2088
2089 Enabling or disabling the leader of a group enables or disables
2090 the entire group; that is, while the group leader is disabled,
2091 none of the counters in the group will count. Enabling or dis‐
2092 abling a member of a group other than the leader affects only
2093 that counter; disabling a non-leader stops that counter from
2094 counting but doesn't affect any other counter.
2095
2096 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
2097 then all events in a group are disabled, even if the event spec‐
2098 ified is not the group leader (but see BUGS).
2099
2100 PERF_EVENT_IOC_REFRESH
2101 Non-inherited overflow counters can use this to enable a counter
2102 for a number of overflows specified by the argument, after which
2103 it is disabled. Subsequent calls of this ioctl add the argument
2104 value to the current count. An overflow notification with
2105 POLL_IN set will happen on each overflow until the count reaches
2106 0; when that happens a notification with POLL_HUP set is sent
2107 and the event is disabled. Using an argument of 0 is considered
2108 undefined behavior.
2109
2110 PERF_EVENT_IOC_RESET
2111 Reset the event count specified by the file descriptor argument
2112 to zero. This resets only the counts; there is no way to reset
2113 the multiplexing time_enabled or time_running values.
2114
2115 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
2116 then all events in a group are reset, even if the event speci‐
2117 fied is not the group leader (but see BUGS).
2118
2119 PERF_EVENT_IOC_PERIOD
2120 This updates the overflow period for the event.
2121
2122 Since Linux 3.7 (on ARM) and Linux 3.14 (all other architec‐
2123 tures), the new period takes effect immediately. On older ker‐
2124 nels, the new period did not take effect until after the next
2125 overflow.
2126
2127 The argument is a pointer to a 64-bit value containing the de‐
2128 sired new period.
2129
2130 Prior to Linux 2.6.36, this ioctl always failed due to a bug in
2131 the kernel.
2132
2133 PERF_EVENT_IOC_SET_OUTPUT
2134 This tells the kernel to report event notifications to the spec‐
2135 ified file descriptor rather than the default one. The file de‐
2136 scriptors must all be on the same CPU.
2137
2138 The argument specifies the desired file descriptor, or -1 if
2139 output should be ignored.
2140
2141 PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
2142 This adds an ftrace filter to this event.
2143
2144 The argument is a pointer to the desired ftrace filter.
2145
2146 PERF_EVENT_IOC_ID (since Linux 3.12)
2147 This returns the event ID value for the given event file de‐
2148 scriptor.
2149
2150 The argument is a pointer to a 64-bit unsigned integer to hold
2151 the result.
2152
2153 PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
2154 This allows attaching a Berkeley Packet Filter (BPF) program to
2155 an existing kprobe tracepoint event. You need CAP_PERFMON
2156 (since Linux 5.8) or CAP_SYS_ADMIN privileges to use this ioctl.
2157
2158 The argument is a BPF program file descriptor that was created
2159 by a previous bpf(2) system call.
2160
2161 PERF_EVENT_IOC_PAUSE_OUTPUT (since Linux 4.7)
2162 This allows pausing and resuming the event's ring-buffer. A
2163 paused ring-buffer does not prevent generation of samples, but
2164 simply discards them. The discarded samples are considered
2165 lost, and cause a PERF_RECORD_LOST sample to be generated when
2166 possible. An overflow signal may still be triggered by the dis‐
2167 carded sample even though the ring-buffer remains empty.
2168
2169 The argument is an unsigned 32-bit integer. A nonzero value
2170 pauses the ring-buffer, while a zero value resumes the ring-buf‐
2171 fer.
2172
2173 PERF_EVENT_MODIFY_ATTRIBUTES (since Linux 4.17)
2174 This allows modifying an existing event without the overhead of
2175 closing and reopening a new event. Currently this is supported
2176 only for breakpoint events.
2177
2178 The argument is a pointer to a perf_event_attr structure con‐
2179 taining the updated event settings.
2180
2181 PERF_EVENT_IOC_QUERY_BPF (since Linux 4.16)
2182 This allows querying which Berkeley Packet Filter (BPF) programs
2183 are attached to an existing kprobe tracepoint. You can only at‐
2184 tach one BPF program per event, but you can have multiple events
2185 attached to a tracepoint. Querying this value on one tracepoint
2186 event returns the ID of all BPF programs in all events attached
2187 to the tracepoint. You need CAP_PERFMON (since Linux 5.8) or
2188 CAP_SYS_ADMIN privileges to use this ioctl.
2189
2190 The argument is a pointer to a structure
2191 struct perf_event_query_bpf {
2192 __u32 ids_len;
2193 __u32 prog_cnt;
2194 __u32 ids[0];
2195 };
2196
2197 The ids_len field indicates the number of ids that can fit in
2198 the provided ids array. The prog_cnt value is filled in by the
2199 kernel with the number of attached BPF programs. The ids array
2200 is filled with the ID of each attached BPF program. If there
2201 are more programs than will fit in the array, then the kernel
2202 will return ENOSPC and ids_len will indicate the number of pro‐
2203 gram IDs that were successfully copied.
2204
2205 Using prctl(2)
2206 A process can enable or disable all currently open event groups using
2207 the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and PR_TASK_PERF_EVENTS_DISABLE
2208 operations. This applies only to events created locally by the calling
2209 process. This does not apply to events created by other processes at‐
2210 tached to the calling process or inherited events from a parent
2211 process. Only group leaders are enabled and disabled, not any other
2212 members of the groups.
2213
2214 perf_event related configuration files
2215 Files in /proc/sys/kernel/
2216
2217 /proc/sys/kernel/perf_event_paranoid
2218 The perf_event_paranoid file can be set to restrict access
2219 to the performance counters.
2220
2221 2 allow only user-space measurements (default since Linux
2222 4.6).
2223 1 allow both kernel and user measurements (default before
2224 Linux 4.6).
2225 0 allow access to CPU-specific data but not raw tracepoint
2226 samples.
2227 -1 no restrictions.
2228
2229 The existence of the perf_event_paranoid file is the offi‐
2230 cial method for determining if a kernel supports
2231 perf_event_open().
2232
2233 /proc/sys/kernel/perf_event_max_sample_rate
2234 This sets the maximum sample rate. Setting this too high
2235 can allow users to sample at a rate that impacts overall ma‐
2236 chine performance and potentially lock up the machine. The
2237 default value is 100000 (samples per second).
2238
2239 /proc/sys/kernel/perf_event_max_stack
2240 This file sets the maximum depth of stack frame entries re‐
2241 ported when generating a call trace.
2242
2243 /proc/sys/kernel/perf_event_mlock_kb
2244 Maximum number of pages an unprivileged user can mlock(2).
2245 The default is 516 (kB).
2246
2247 Files in /sys/bus/event_source/devices/
2248
2249 Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
2250 able for monitoring. Information on how to program these PMUs can
2251 be found under /sys/bus/event_source/devices/. Each subdirectory
2252 corresponds to a different PMU.
2253
2254 /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
2255 This contains an integer that can be used in the type field
2256 of perf_event_attr to indicate that you wish to use this
2257 PMU.
2258
2259 /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
2260 If this file is 1, then direct user-space access to the per‐
2261 formance counter registers is allowed via the rdpmc instruc‐
2262 tion. This can be disabled by echoing 0 to the file.
2263
2264 As of Linux 4.0 the behavior has changed, so that 1 now
2265 means only allow access to processes with active perf
2266 events, with 2 indicating the old allow-anyone-access behav‐
2267 ior.
2268
2269 /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2270 This subdirectory contains information on the architecture-
2271 specific subfields available for programming the various
2272 config fields in the perf_event_attr struct.
2273
2274 The content of each file is the name of the config field,
2275 followed by a colon, followed by a series of integer bit
2276 ranges separated by commas. For example, the file event may
2277 contain the value config1:1,6-10,44 which indicates that
2278 event is an attribute that occupies bits 1,6–10, and 44 of
2279 perf_event_attr::config1.
2280
2281 /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2282 This subdirectory contains files with predefined events.
2283 The contents are strings describing the event settings ex‐
2284 pressed in terms of the fields found in the previously men‐
2285 tioned ./format/ directory. These are not necessarily com‐
2286 plete lists of all events supported by a PMU, but usually a
2287 subset of events deemed useful or interesting.
2288
2289 The content of each file is a list of attribute names sepa‐
2290 rated by commas. Each entry has an optional value (either
2291 hex or decimal). If no value is specified, then it is as‐
2292 sumed to be a single-bit field with a value of 1. An exam‐
2293 ple entry may look like this: event=0x2,inv,ldlat=3.
2294
2295 /sys/bus/event_source/devices/*/uevent
2296 This file is the standard kernel device interface for in‐
2297 jecting hotplug events.
2298
2299 /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2300 The cpumask file contains a comma-separated list of integers
2301 that indicate a representative CPU number for each socket
2302 (package) on the motherboard. This is needed when setting
2303 up uncore or northbridge events, as those PMUs present
2304 socket-wide events.
2305
2307 On success, perf_event_open() returns the new file descriptor. On er‐
2308 ror, -1 is returned and errno is set to indicate the error.
2309
2311 The errors returned by perf_event_open() can be inconsistent, and may
2312 vary across processor architectures and performance monitoring units.
2313
2314 E2BIG Returned if the perf_event_attr size value is too small (smaller
2315 than PERF_ATTR_SIZE_VER0), too big (larger than the page size),
2316 or larger than the kernel supports and the extra bytes are not
2317 zero. When E2BIG is returned, the perf_event_attr size field is
2318 overwritten by the kernel to be the size of the structure it was
2319 expecting.
2320
2321 EACCES Returned when the requested event requires CAP_PERFMON (since
2322 Linux 5.8) or CAP_SYS_ADMIN permissions (or a more permissive
2323 perf_event paranoid setting). Some common cases where an un‐
2324 privileged process may encounter this error: attaching to a
2325 process owned by a different user; monitoring all processes on a
2326 given CPU (i.e., specifying the pid argument as -1); and not
2327 setting exclude_kernel when the paranoid setting requires it.
2328
2329 EBADF Returned if the group_fd file descriptor is not valid, or, if
2330 PERF_FLAG_PID_CGROUP is set, the cgroup file descriptor in pid
2331 is not valid.
2332
2333 EBUSY (since Linux 4.1)
2334 Returned if another event already has exclusive access to the
2335 PMU.
2336
2337 EFAULT Returned if the attr pointer points at an invalid memory ad‐
2338 dress.
2339
2340 EINTR Returned when trying to mix perf and ftrace handling for a up‐
2341 robe.
2342
2343 EINVAL Returned if the specified event is invalid. There are many pos‐
2344 sible reasons for this. A not-exhaustive list: sample_freq is
2345 higher than the maximum setting; the cpu to monitor does not ex‐
2346 ist; read_format is out of range; sample_type is out of range;
2347 the flags value is out of range; exclusive or pinned set and the
2348 event is not a group leader; the event config values are out of
2349 range or set reserved bits; the generic event selected is not
2350 supported; or there is not enough room to add the selected
2351 event.
2352
2353 EMFILE Each opened event uses one file descriptor. If a large number
2354 of events are opened, the per-process limit on the number of
2355 open file descriptors will be reached, and no more events can be
2356 created.
2357
2358 ENODEV Returned when the event involves a feature not supported by the
2359 current CPU.
2360
2361 ENOENT Returned if the type setting is not valid. This error is also
2362 returned for some unsupported generic events.
2363
2364 ENOSPC Prior to Linux 3.3, if there was not enough room for the event,
2365 ENOSPC was returned. In Linux 3.3, this was changed to EINVAL.
2366 ENOSPC is still returned if you try to add more breakpoint
2367 events than supported by the hardware.
2368
2369 ENOSYS Returned if PERF_SAMPLE_STACK_USER is set in sample_type and it
2370 is not supported by hardware.
2371
2372 EOPNOTSUPP
2373 Returned if an event requiring a specific hardware feature is
2374 requested but there is no hardware support. This includes re‐
2375 questing low-skid events if not supported, branch tracing if it
2376 is not available, sampling if no PMU interrupt is available, and
2377 branch stacks for software events.
2378
2379 EOVERFLOW (since Linux 4.8)
2380 Returned if PERF_SAMPLE_CALLCHAIN is requested and sam‐
2381 ple_max_stack is larger than the maximum specified in
2382 /proc/sys/kernel/perf_event_max_stack.
2383
2384 EPERM Returned on many (but not all) architectures when an unsupported
2385 exclude_hv, exclude_idle, exclude_user, or exclude_kernel set‐
2386 ting is specified.
2387
2388 It can also happen, as with EACCES, when the requested event re‐
2389 quires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN permis‐
2390 sions (or a more permissive perf_event paranoid setting). This
2391 includes setting a breakpoint on a kernel address, and (since
2392 Linux 3.13) setting a kernel function-trace tracepoint.
2393
2394 ESRCH Returned if attempting to attach to a process that does not ex‐
2395 ist.
2396
2398 perf_event_open() was introduced in Linux 2.6.31 but was called
2399 perf_counter_open(). It was renamed in Linux 2.6.32.
2400
2402 This perf_event_open() system call Linux-specific and should not be
2403 used in programs intended to be portable.
2404
2406 The official way of knowing if perf_event_open() support is enabled is
2407 checking for the existence of the file /proc/sys/ker‐
2408 nel/perf_event_paranoid.
2409
2410 CAP_PERFMON capability (since Linux 5.8) provides secure approach to
2411 performance monitoring and observability operations in a system accord‐
2412 ing to the principal of least privilege (POSIX IEEE 1003.1e). Access‐
2413 ing system performance monitoring and observability operations using
2414 CAP_PERFMON rather than the much more powerful CAP_SYS_ADMIN excludes
2415 chances to misuse credentials and makes operations more secure.
2416 CAP_SYS_ADMIN usage for secure system performance monitoring and ob‐
2417 servability is discouraged in favor of the CAP_PERFMON capability.
2418
2420 The F_SETOWN_EX option to fcntl(2) is needed to properly get overflow
2421 signals in threads. This was introduced in Linux 2.6.32.
2422
2423 Prior to Linux 2.6.33 (at least for x86), the kernel did not check if
2424 events could be scheduled together until read time. The same happens
2425 on all known kernels if the NMI watchdog is enabled. This means to see
2426 if a given set of events works you have to perf_event_open(), start,
2427 then read before you know for sure you can get valid measurements.
2428
2429 Prior to Linux 2.6.34, event constraints were not enforced by the ker‐
2430 nel. In that case, some events would silently return "0" if the kernel
2431 scheduled them in an improper counter slot.
2432
2433 Prior to Linux 2.6.34, there was a bug when multiplexing where the
2434 wrong results could be returned.
2435
2436 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2437 if "inherit" is enabled and many threads are started.
2438
2439 Prior to Linux 2.6.35, PERF_FORMAT_GROUP did not work with attached
2440 processes.
2441
2442 There is a bug in the kernel code between Linux 2.6.36 and Linux 3.0
2443 that ignores the "watermark" field and acts as if a wakeup_event was
2444 chosen if the union has a nonzero value in it.
2445
2446 From Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2447 was broken and would repeatedly operate on the event specified rather
2448 than iterating across all sibling events in a group.
2449
2450 From Linux 3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2451 bits mapped to the same location. Code should migrate to the new
2452 cap_user_rdpmc and cap_user_time fields instead.
2453
2454 Always double-check your results! Various generalized events have had
2455 wrong values. For example, retired branches measured the wrong thing
2456 on AMD machines until Linux 2.6.35.
2457
2459 The following is a short example that measures the total instruction
2460 count of a call to printf(3).
2461
2462 #include <stdlib.h>
2463 #include <stdio.h>
2464 #include <unistd.h>
2465 #include <string.h>
2466 #include <sys/ioctl.h>
2467 #include <linux/perf_event.h>
2468 #include <asm/unistd.h>
2469
2470 static long
2471 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2472 int cpu, int group_fd, unsigned long flags)
2473 {
2474 int ret;
2475
2476 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2477 group_fd, flags);
2478 return ret;
2479 }
2480
2481 int
2482 main(int argc, char *argv[])
2483 {
2484 struct perf_event_attr pe;
2485 long long count;
2486 int fd;
2487
2488 memset(&pe, 0, sizeof(pe));
2489 pe.type = PERF_TYPE_HARDWARE;
2490 pe.size = sizeof(pe);
2491 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2492 pe.disabled = 1;
2493 pe.exclude_kernel = 1;
2494 pe.exclude_hv = 1;
2495
2496 fd = perf_event_open(&pe, 0, -1, -1, 0);
2497 if (fd == -1) {
2498 fprintf(stderr, "Error opening leader %llx\n", pe.config);
2499 exit(EXIT_FAILURE);
2500 }
2501
2502 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2503 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2504
2505 printf("Measuring instruction count for this printf\n");
2506
2507 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2508 read(fd, &count, sizeof(count));
2509
2510 printf("Used %lld instructions\n", count);
2511
2512 close(fd);
2513 }
2514
2516 perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2517
2518 Documentation/admin-guide/perf-security.rst in the kernel source tree
2519
2521 This page is part of release 5.13 of the Linux man-pages project. A
2522 description of the project, information about reporting bugs, and the
2523 latest version of this page, can be found at
2524 https://www.kernel.org/doc/man-pages/.
2525
2526
2527
2528Linux 2021-08-27 PERF_EVENT_OPEN(2)