1PERF_EVENT_OPEN(2) Linux Programmer's Manual PERF_EVENT_OPEN(2)
2
3
4
6 perf_event_open - set up performance monitoring
7
9 #include <linux/perf_event.h>
10 #include <linux/hw_breakpoint.h>
11
12 int perf_event_open(struct perf_event_attr *attr,
13 pid_t pid, int cpu, int group_fd,
14 unsigned long flags);
15
16 Note: There is no glibc wrapper for this system call; see NOTES.
17
19 Given a list of parameters, perf_event_open() returns a file descrip‐
20 tor, for use in subsequent system calls (read(2), mmap(2), prctl(2),
21 fcntl(2), etc.).
22
23 A call to perf_event_open() creates a file descriptor that allows mea‐
24 suring performance information. Each file descriptor corresponds to
25 one event that is measured; these can be grouped together to measure
26 multiple events simultaneously.
27
28 Events can be enabled and disabled in two ways: via ioctl(2) and via
29 prctl(2). When an event is disabled it does not count or generate
30 overflows but does continue to exist and maintain its count value.
31
32 Events come in two flavors: counting and sampled. A counting event is
33 one that is used for counting the aggregate number of events that
34 occur. In general, counting event results are gathered with a read(2)
35 call. A sampling event periodically writes measurements to a buffer
36 that can then be accessed via mmap(2).
37
38 Arguments
39 The pid and cpu arguments allow specifying which process and CPU to
40 monitor:
41
42 pid == 0 and cpu == -1
43 This measures the calling process/thread on any CPU.
44
45 pid == 0 and cpu >= 0
46 This measures the calling process/thread only when running on
47 the specified CPU.
48
49 pid > 0 and cpu == -1
50 This measures the specified process/thread on any CPU.
51
52 pid > 0 and cpu >= 0
53 This measures the specified process/thread only when running on
54 the specified CPU.
55
56 pid == -1 and cpu >= 0
57 This measures all processes/threads on the specified CPU. This
58 requires CAP_SYS_ADMIN capability or a /proc/sys/ker‐
59 nel/perf_event_paranoid value of less than 1.
60
61 pid == -1 and cpu == -1
62 This setting is invalid and will return an error.
63
64 When pid is greater than zero, permission to perform this system call
65 is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check;
66 see ptrace(2).
67
68 The group_fd argument allows event groups to be created. An event
69 group has one event which is the group leader. The leader is created
70 first, with group_fd = -1. The rest of the group members are created
71 with subsequent perf_event_open() calls with group_fd being set to the
72 file descriptor of the group leader. (A single event on its own is
73 created with group_fd = -1 and is considered to be a group with only 1
74 member.) An event group is scheduled onto the CPU as a unit: it will
75 be put onto the CPU only if all of the events in the group can be put
76 onto the CPU. This means that the values of the member events can be
77 meaningfully compared—added, divided (to get ratios), and so on—with
78 each other, since they have counted events for the same set of executed
79 instructions.
80
81 The flags argument is formed by ORing together zero or more of the fol‐
82 lowing values:
83
84 PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
85 This flag enables the close-on-exec flag for the created event
86 file descriptor, so that the file descriptor is automatically
87 closed on execve(2). Setting the close-on-exec flags at cre‐
88 ation time, rather than later with fcntl(2), avoids potential
89 race conditions where the calling thread invokes
90 perf_event_open() and fcntl(2) at the same time as another
91 thread calls fork(2) then execve(2).
92
93 PERF_FLAG_FD_NO_GROUP
94 This flag tells the event to ignore the group_fd parameter
95 except for the purpose of setting up output redirection using
96 the PERF_FLAG_FD_OUTPUT flag.
97
98 PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
99 This flag re-routes the event's sampled output to instead be
100 included in the mmap buffer of the event specified by group_fd.
101
102 PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
103 This flag activates per-container system-wide monitoring. A
104 container is an abstraction that isolates a set of resources for
105 finer-grained control (CPUs, memory, etc.). In this mode, the
106 event is measured only if the thread running on the monitored
107 CPU belongs to the designated container (cgroup). The cgroup is
108 identified by passing a file descriptor opened on its directory
109 in the cgroupfs filesystem. For instance, if the cgroup to mon‐
110 itor is called test, then a file descriptor opened on
111 /dev/cgroup/test (assuming cgroupfs is mounted on /dev/cgroup)
112 must be passed as the pid parameter. cgroup monitoring is
113 available only for system-wide events and may therefore require
114 extra permissions.
115
116 The perf_event_attr structure provides detailed configuration informa‐
117 tion for the event being created.
118
119 struct perf_event_attr {
120 __u32 type; /* Type of event */
121 __u32 size; /* Size of attribute structure */
122 __u64 config; /* Type-specific configuration */
123
124 union {
125 __u64 sample_period; /* Period of sampling */
126 __u64 sample_freq; /* Frequency of sampling */
127 };
128
129 __u64 sample_type; /* Specifies values included in sample */
130 __u64 read_format; /* Specifies values returned in read */
131
132 __u64 disabled : 1, /* off by default */
133 inherit : 1, /* children inherit it */
134 pinned : 1, /* must always be on PMU */
135 exclusive : 1, /* only group on PMU */
136 exclude_user : 1, /* don't count user */
137 exclude_kernel : 1, /* don't count kernel */
138 exclude_hv : 1, /* don't count hypervisor */
139 exclude_idle : 1, /* don't count when idle */
140 mmap : 1, /* include mmap data */
141 comm : 1, /* include comm data */
142 freq : 1, /* use freq, not period */
143 inherit_stat : 1, /* per task counts */
144 enable_on_exec : 1, /* next exec enables */
145 task : 1, /* trace fork/exit */
146 watermark : 1, /* wakeup_watermark */
147 precise_ip : 2, /* skid constraint */
148 mmap_data : 1, /* non-exec mmap data */
149 sample_id_all : 1, /* sample_type all events */
150 exclude_host : 1, /* don't count in host */
151 exclude_guest : 1, /* don't count in guest */
152 exclude_callchain_kernel : 1,
153 /* exclude kernel callchains */
154 exclude_callchain_user : 1,
155 /* exclude user callchains */
156 mmap2 : 1, /* include mmap with inode data */
157 comm_exec : 1, /* flag comm events that are
158 due to exec */
159 use_clockid : 1, /* use clockid for time fields */
160 context_switch : 1, /* context switch data */
161
162 __reserved_1 : 37;
163
164 union {
165 __u32 wakeup_events; /* wakeup every n events */
166 __u32 wakeup_watermark; /* bytes before wakeup */
167 };
168
169 __u32 bp_type; /* breakpoint type */
170
171 union {
172 __u64 bp_addr; /* breakpoint address */
173 __u64 kprobe_func; /* for perf_kprobe */
174 __u64 uprobe_path; /* for perf_uprobe */
175 __u64 config1; /* extension of config */
176 };
177
178 union {
179 __u64 bp_len; /* breakpoint length */
180 __u64 kprobe_addr; /* with kprobe_func == NULL */
181 __u64 probe_offset; /* for perf_[k,u]probe */
182 __u64 config2; /* extension of config1 */
183 };
184 __u64 branch_sample_type; /* enum perf_branch_sample_type */
185 __u64 sample_regs_user; /* user regs to dump on samples */
186 __u32 sample_stack_user; /* size of stack to dump on
187 samples */
188 __s32 clockid; /* clock to use for time fields */
189 __u64 sample_regs_intr; /* regs to dump on samples */
190 __u32 aux_watermark; /* aux bytes before wakeup */
191 __u16 sample_max_stack; /* max frames in callchain */
192 __u16 __reserved_2; /* align to u64 */
193
194 };
195
196 The fields of the perf_event_attr structure are described in more
197 detail below:
198
199 type This field specifies the overall event type. It has one of the
200 following values:
201
202 PERF_TYPE_HARDWARE
203 This indicates one of the "generalized" hardware events
204 provided by the kernel. See the config field definition
205 for more details.
206
207 PERF_TYPE_SOFTWARE
208 This indicates one of the software-defined events pro‐
209 vided by the kernel (even if no hardware support is
210 available).
211
212 PERF_TYPE_TRACEPOINT
213 This indicates a tracepoint provided by the kernel trace‐
214 point infrastructure.
215
216 PERF_TYPE_HW_CACHE
217 This indicates a hardware cache event. This has a spe‐
218 cial encoding, described in the config field definition.
219
220 PERF_TYPE_RAW
221 This indicates a "raw" implementation-specific event in
222 the config field.
223
224 PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
225 This indicates a hardware breakpoint as provided by the
226 CPU. Breakpoints can be read/write accesses to an
227 address as well as execution of an instruction address.
228
229 dynamic PMU
230 Since Linux 2.6.38, perf_event_open() can support multi‐
231 ple PMUs. To enable this, a value exported by the kernel
232 can be used in the type field to indicate which PMU to
233 use. The value to use can be found in the sysfs filesys‐
234 tem: there is a subdirectory per PMU instance under
235 /sys/bus/event_source/devices. In each subdirectory
236 there is a type file whose content is an integer that can
237 be used in the type field. For instance,
238 /sys/bus/event_source/devices/cpu/type contains the value
239 for the core CPU PMU, which is usually 4.
240
241 kprobe and uprobe (since Linux 4.17)
242 These two dynamic PMUs create a kprobe/uprobe and attach
243 it to the file descriptor generated by perf_event_open.
244 The kprobe/uprobe will be destroyed on the destruction of
245 the file descriptor. See fields kprobe_func,
246 uprobe_path, kprobe_addr, and probe_offset for more
247 details.
248
249 size The size of the perf_event_attr structure for forward/backward
250 compatibility. Set this using sizeof(struct perf_event_attr) to
251 allow the kernel to see the struct size at the time of compila‐
252 tion.
253
254 The related define PERF_ATTR_SIZE_VER0 is set to 64; this was
255 the size of the first published struct. PERF_ATTR_SIZE_VER1 is
256 72, corresponding to the addition of breakpoints in Linux
257 2.6.33. PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
258 of branch sampling in Linux 3.4. PERF_ATTR_SIZE_VER3 is 96 cor‐
259 responding to the addition of sample_regs_user and sam‐
260 ple_stack_user in Linux 3.7. PERF_ATTR_SIZE_VER4 is 104 corre‐
261 sponding to the addition of sample_regs_intr in Linux 3.19.
262 PERF_ATTR_SIZE_VER5 is 112 corresponding to the addition of
263 aux_watermark in Linux 4.1.
264
265 config This specifies which event you want, in conjunction with the
266 type field. The config1 and config2 fields are also taken into
267 account in cases where 64 bits is not enough to fully specify
268 the event. The encoding of these fields are event dependent.
269
270 There are various ways to set the config field that are depen‐
271 dent on the value of the previously described type field. What
272 follows are various possible settings for config separated out
273 by type.
274
275 If type is PERF_TYPE_HARDWARE, we are measuring one of the gen‐
276 eralized hardware CPU events. Not all of these are available on
277 all platforms. Set config to one of the following:
278
279 PERF_COUNT_HW_CPU_CYCLES
280 Total cycles. Be wary of what happens during CPU
281 frequency scaling.
282
283 PERF_COUNT_HW_INSTRUCTIONS
284 Retired instructions. Be careful, these can be
285 affected by various issues, most notably hardware
286 interrupt counts.
287
288 PERF_COUNT_HW_CACHE_REFERENCES
289 Cache accesses. Usually this indicates Last Level
290 Cache accesses but this may vary depending on your
291 CPU. This may include prefetches and coherency mes‐
292 sages; again this depends on the design of your CPU.
293
294 PERF_COUNT_HW_CACHE_MISSES
295 Cache misses. Usually this indicates Last Level
296 Cache misses; this is intended to be used in con‐
297 junction with the PERF_COUNT_HW_CACHE_REFERENCES
298 event to calculate cache miss rates.
299
300 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
301 Retired branch instructions. Prior to Linux 2.6.35,
302 this used the wrong event on AMD processors.
303
304 PERF_COUNT_HW_BRANCH_MISSES
305 Mispredicted branch instructions.
306
307 PERF_COUNT_HW_BUS_CYCLES
308 Bus cycles, which can be different from total
309 cycles.
310
311 PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
312 Stalled cycles during issue.
313
314 PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
315 Stalled cycles during retirement.
316
317 PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
318 Total cycles; not affected by CPU frequency scaling.
319
320 If type is PERF_TYPE_SOFTWARE, we are measuring software events
321 provided by the kernel. Set config to one of the following:
322
323 PERF_COUNT_SW_CPU_CLOCK
324 This reports the CPU clock, a high-resolution per-
325 CPU timer.
326
327 PERF_COUNT_SW_TASK_CLOCK
328 This reports a clock count specific to the task that
329 is running.
330
331 PERF_COUNT_SW_PAGE_FAULTS
332 This reports the number of page faults.
333
334 PERF_COUNT_SW_CONTEXT_SWITCHES
335 This counts context switches. Until Linux 2.6.34,
336 these were all reported as user-space events, after
337 that they are reported as happening in the kernel.
338
339 PERF_COUNT_SW_CPU_MIGRATIONS
340 This reports the number of times the process has
341 migrated to a new CPU.
342
343 PERF_COUNT_SW_PAGE_FAULTS_MIN
344 This counts the number of minor page faults. These
345 did not require disk I/O to handle.
346
347 PERF_COUNT_SW_PAGE_FAULTS_MAJ
348 This counts the number of major page faults. These
349 required disk I/O to handle.
350
351 PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
352 This counts the number of alignment faults. These
353 happen when unaligned memory accesses happen; the
354 kernel can handle these but it reduces performance.
355 This happens only on some architectures (never on
356 x86).
357
358 PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
359 This counts the number of emulation faults. The
360 kernel sometimes traps on unimplemented instructions
361 and emulates them for user space. This can nega‐
362 tively impact performance.
363
364 PERF_COUNT_SW_DUMMY (since Linux 3.12)
365 This is a placeholder event that counts nothing.
366 Informational sample record types such as mmap or
367 comm must be associated with an active event. This
368 dummy event allows gathering such records without
369 requiring a counting event.
370
371 If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel
372 tracepoints. The value to use in config can be obtained from
373 under debugfs tracing/events/*/*/id if ftrace is enabled in the
374 kernel.
375
376 If type is PERF_TYPE_HW_CACHE, then we are measuring a hardware
377 CPU cache event. To calculate the appropriate config value use
378 the following equation:
379
380 (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
381 (perf_hw_cache_op_result_id << 16)
382
383 where perf_hw_cache_id is one of:
384
385 PERF_COUNT_HW_CACHE_L1D
386 for measuring Level 1 Data Cache
387
388 PERF_COUNT_HW_CACHE_L1I
389 for measuring Level 1 Instruction Cache
390
391 PERF_COUNT_HW_CACHE_LL
392 for measuring Last-Level Cache
393
394 PERF_COUNT_HW_CACHE_DTLB
395 for measuring the Data TLB
396
397 PERF_COUNT_HW_CACHE_ITLB
398 for measuring the Instruction TLB
399
400 PERF_COUNT_HW_CACHE_BPU
401 for measuring the branch prediction unit
402
403 PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
404 for measuring local memory accesses
405
406 and perf_hw_cache_op_id is one of:
407
408 PERF_COUNT_HW_CACHE_OP_READ
409 for read accesses
410
411 PERF_COUNT_HW_CACHE_OP_WRITE
412 for write accesses
413
414 PERF_COUNT_HW_CACHE_OP_PREFETCH
415 for prefetch accesses
416
417 and perf_hw_cache_op_result_id is one of:
418
419 PERF_COUNT_HW_CACHE_RESULT_ACCESS
420 to measure accesses
421
422 PERF_COUNT_HW_CACHE_RESULT_MISS
423 to measure misses
424
425 If type is PERF_TYPE_RAW, then a custom "raw" config value is
426 needed. Most CPUs support events that are not covered by the
427 "generalized" events. These are implementation defined; see
428 your CPU manual (for example the Intel Volume 3B documentation
429 or the AMD BIOS and Kernel Developer Guide). The libpfm4
430 library can be used to translate from the name in the architec‐
431 tural manuals to the raw hex value perf_event_open() expects in
432 this field.
433
434 If type is PERF_TYPE_BREAKPOINT, then leave config set to zero.
435 Its parameters are set in other places.
436
437 If type is kprobe or uprobe, set retprobe (bit 0 of config, see
438 /sys/bus/event_source/devices/[k,u]probe/format/retprobe) for
439 kretprobe/uretprobe. See fields kprobe_func, uprobe_path,
440 kprobe_addr, and probe_offset for more details.
441
442 kprobe_func, uprobe_path, kprobe_addr, and probe_offset
443 These fields describe the kprobe/uprobe for dynamic PMUs kprobe
444 and uprobe. For kprobe: use kprobe_func and probe_offset, or
445 use kprobe_addr and leave kprobe_func as NULL. For uprobe: use
446 uprobe_path and probe_offset.
447
448 sample_period, sample_freq
449 A "sampling" event is one that generates an overflow notifica‐
450 tion every N events, where N is given by sample_period. A sam‐
451 pling event has sample_period > 0. When an overflow occurs,
452 requested data is recorded in the mmap buffer. The sample_type
453 field controls what data is recorded on each overflow.
454
455 sample_freq can be used if you wish to use frequency rather than
456 period. In this case, you set the freq flag. The kernel will
457 adjust the sampling period to try and achieve the desired rate.
458 The rate of adjustment is a timer tick.
459
460 sample_type
461 The various bits in this field specify which values to include
462 in the sample. They will be recorded in a ring-buffer, which is
463 available to user space using mmap(2). The order in which the
464 values are saved in the sample are documented in the MMAP Layout
465 subsection below; it is not the enum perf_event_sample_format
466 order.
467
468 PERF_SAMPLE_IP
469 Records instruction pointer.
470
471 PERF_SAMPLE_TID
472 Records the process and thread IDs.
473
474 PERF_SAMPLE_TIME
475 Records a timestamp.
476
477 PERF_SAMPLE_ADDR
478 Records an address, if applicable.
479
480 PERF_SAMPLE_READ
481 Record counter values for all events in a group, not just
482 the group leader.
483
484 PERF_SAMPLE_CALLCHAIN
485 Records the callchain (stack backtrace).
486
487 PERF_SAMPLE_ID
488 Records a unique ID for the opened event's group leader.
489
490 PERF_SAMPLE_CPU
491 Records CPU number.
492
493 PERF_SAMPLE_PERIOD
494 Records the current sampling period.
495
496 PERF_SAMPLE_STREAM_ID
497 Records a unique ID for the opened event. Unlike
498 PERF_SAMPLE_ID the actual ID is returned, not the group
499 leader. This ID is the same as the one returned by
500 PERF_FORMAT_ID.
501
502 PERF_SAMPLE_RAW
503 Records additional data, if applicable. Usually returned
504 by tracepoint events.
505
506 PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
507 This provides a record of recent branches, as provided by
508 CPU branch sampling hardware (such as Intel Last Branch
509 Record). Not all hardware supports this feature.
510
511 See the branch_sample_type field for how to filter which
512 branches are reported.
513
514 PERF_SAMPLE_REGS_USER (since Linux 3.7)
515 Records the current user-level CPU register state (the
516 values in the process before the kernel was called).
517
518 PERF_SAMPLE_STACK_USER (since Linux 3.7)
519 Records the user level stack, allowing stack unwinding.
520
521 PERF_SAMPLE_WEIGHT (since Linux 3.10)
522 Records a hardware provided weight value that expresses
523 how costly the sampled event was. This allows the hard‐
524 ware to highlight expensive events in a profile.
525
526 PERF_SAMPLE_DATA_SRC (since Linux 3.10)
527 Records the data source: where in the memory hierarchy
528 the data associated with the sampled instruction came
529 from. This is available only if the underlying hardware
530 supports this feature.
531
532 PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
533 Places the SAMPLE_ID value in a fixed position in the
534 record, either at the beginning (for sample events) or at
535 the end (if a non-sample event).
536
537 This was necessary because a sample stream may have
538 records from various different event sources with differ‐
539 ent sample_type settings. Parsing the event stream prop‐
540 erly was not possible because the format of the record
541 was needed to find SAMPLE_ID, but the format could not be
542 found without knowing what event the sample belonged to
543 (causing a circular dependency).
544
545 The PERF_SAMPLE_IDENTIFIER setting makes the event stream
546 always parsable by putting SAMPLE_ID in a fixed location,
547 even though it means having duplicate SAMPLE_ID values in
548 records.
549
550 PERF_SAMPLE_TRANSACTION (since Linux 3.13)
551 Records reasons for transactional memory abort events
552 (for example, from Intel TSX transactional memory sup‐
553 port).
554
555 The precise_ip setting must be greater than 0 and a
556 transactional memory abort event must be measured or no
557 values will be recorded. Also note that some perf_event
558 measurements, such as sampled cycle counting, may cause
559 extraneous aborts (by causing an interrupt during a
560 transaction).
561
562 PERF_SAMPLE_REGS_INTR (since Linux 3.19)
563 Records a subset of the current CPU register state as
564 specified by sample_regs_intr. Unlike PERF_SAM‐
565 PLE_REGS_USER the register values will return kernel reg‐
566 ister state if the overflow happened while kernel code is
567 running. If the CPU supports hardware sampling of regis‐
568 ter state (i.e., PEBS on Intel x86) and precise_ip is set
569 higher than zero then the register values returned are
570 those captured by hardware at the time of the sampled
571 instruction's retirement.
572
573 read_format
574 This field specifies the format of the data returned by read(2)
575 on a perf_event_open() file descriptor.
576
577 PERF_FORMAT_TOTAL_TIME_ENABLED
578 Adds the 64-bit time_enabled field. This can be used to
579 calculate estimated totals if the PMU is overcommitted
580 and multiplexing is happening.
581
582 PERF_FORMAT_TOTAL_TIME_RUNNING
583 Adds the 64-bit time_running field. This can be used to
584 calculate estimated totals if the PMU is overcommitted
585 and multiplexing is happening.
586
587 PERF_FORMAT_ID
588 Adds a 64-bit unique value that corresponds to the event
589 group.
590
591 PERF_FORMAT_GROUP
592 Allows all counter values in an event group to be read
593 with one read.
594
595 disabled
596 The disabled bit specifies whether the counter starts out dis‐
597 abled or enabled. If disabled, the event can later be enabled
598 by ioctl(2), prctl(2), or enable_on_exec.
599
600 When creating an event group, typically the group leader is ini‐
601 tialized with disabled set to 1 and any child events are ini‐
602 tialized with disabled set to 0. Despite disabled being 0, the
603 child events will not start until the group leader is enabled.
604
605 inherit
606 The inherit bit specifies that this counter should count events
607 of child tasks as well as the task specified. This applies only
608 to new children, not to any existing children at the time the
609 counter is created (nor to any new children of existing chil‐
610 dren).
611
612 Inherit does not work for some combinations of read_format val‐
613 ues, such as PERF_FORMAT_GROUP.
614
615 pinned The pinned bit specifies that the counter should always be on
616 the CPU if at all possible. It applies only to hardware coun‐
617 ters and only to group leaders. If a pinned counter cannot be
618 put onto the CPU (e.g., because there are not enough hardware
619 counters or because of a conflict with some other event), then
620 the counter goes into an 'error' state, where reads return end-
621 of-file (i.e., read(2) returns 0) until the counter is subse‐
622 quently enabled or disabled.
623
624 exclusive
625 The exclusive bit specifies that when this counter's group is on
626 the CPU, it should be the only group using the CPU's counters.
627 In the future this may allow monitoring programs to support PMU
628 features that need to run alone so that they do not disrupt
629 other hardware counters.
630
631 Note that many unexpected situations may prevent events with the
632 exclusive bit set from ever running. This includes any users
633 running a system-wide measurement as well as any kernel use of
634 the performance counters (including the commonly enabled NMI
635 Watchdog Timer interface).
636
637 exclude_user
638 If this bit is set, the count excludes events that happen in
639 user space.
640
641 exclude_kernel
642 If this bit is set, the count excludes events that happen in
643 kernel space.
644
645 exclude_hv
646 If this bit is set, the count excludes events that happen in the
647 hypervisor. This is mainly for PMUs that have built-in support
648 for handling this (such as POWER). Extra support is needed for
649 handling hypervisor measurements on most machines.
650
651 exclude_idle
652 If set, don't count when the CPU is idle.
653
654 mmap The mmap bit enables generation of PERF_RECORD_MMAP samples for
655 every mmap(2) call that has PROT_EXEC set. This allows tools to
656 notice new executable code being mapped into a program (dynamic
657 shared libraries for example) so that addresses can be mapped
658 back to the original code.
659
660 comm The comm bit enables tracking of process command name as modi‐
661 fied by the exec(2) and prctl(PR_SET_NAME) system calls as well
662 as writing to /proc/self/comm. If the comm_exec flag is also
663 successfully set (possible since Linux 3.16), then the misc flag
664 PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the
665 exec(2) case from the others.
666
667 freq If this bit is set, then sample_frequency not sample_period is
668 used when setting up the sampling interval.
669
670 inherit_stat
671 This bit enables saving of event counts on context switch for
672 inherited tasks. This is meaningful only if the inherit field
673 is set.
674
675 enable_on_exec
676 If this bit is set, a counter is automatically enabled after a
677 call to exec(2).
678
679 task If this bit is set, then fork/exit notifications are included in
680 the ring buffer.
681
682 watermark
683 If set, have an overflow notification happen when we cross the
684 wakeup_watermark boundary. Otherwise, overflow notifications
685 happen after wakeup_events samples.
686
687 precise_ip (since Linux 2.6.35)
688 This controls the amount of skid. Skid is how many instructions
689 execute between an event of interest happening and the kernel
690 being able to stop and record the event. Smaller skid is better
691 and allows more accurate reporting of which events correspond to
692 which instructions, but hardware is often limited with how small
693 this can be.
694
695 The possible values of this field are the following:
696
697 0 SAMPLE_IP can have arbitrary skid.
698
699 1 SAMPLE_IP must have constant skid.
700
701 2 SAMPLE_IP requested to have 0 skid.
702
703 3 SAMPLE_IP must have 0 skid. See also the description of
704 PERF_RECORD_MISC_EXACT_IP.
705
706 mmap_data (since Linux 2.6.36)
707 This is the counterpart of the mmap field. This enables genera‐
708 tion of PERF_RECORD_MMAP samples for mmap(2) calls that do not
709 have PROT_EXEC set (for example data and SysV shared memory).
710
711 sample_id_all (since Linux 2.6.38)
712 If set, then TID, TIME, ID, STREAM_ID, and CPU can additionally
713 be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
714 ple_type is selected.
715
716 If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID
717 value is included as the last value to ease parsing the record
718 stream. This may lead to the id value appearing twice.
719
720 The layout is described by this pseudo-structure:
721
722 struct sample_id {
723 { u32 pid, tid; } /* if PERF_SAMPLE_TID set */
724 { u64 time; } /* if PERF_SAMPLE_TIME set */
725 { u64 id; } /* if PERF_SAMPLE_ID set */
726 { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
727 { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
728 { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
729 };
730 ,in
731
732 exclude_host (since Linux 3.2)
733 When conducting measurements that include processes running VM
734 instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
735 events happening inside a guest instance. This is only meaning‐
736 ful outside the guests; this setting does not change counts
737 gathered inside of a guest. Currently, this functionality is
738 x86 only.
739
740 exclude_guest (since Linux 3.2)
741 When conducting measurements that include processes running VM
742 instances (i.e., have executed a KVM_RUN ioctl(2)), do not mea‐
743 sure events happening inside guest instances. This is only
744 meaningful outside the guests; this setting does not change
745 counts gathered inside of a guest. Currently, this functional‐
746 ity is x86 only.
747
748 exclude_callchain_kernel (since Linux 3.7)
749 Do not include kernel callchains.
750
751 exclude_callchain_user (since Linux 3.7)
752 Do not include user callchains.
753
754 mmap2 (since Linux 3.16)
755 Generate an extended executable mmap record that contains enough
756 additional information to uniquely identify shared mappings.
757 The mmap flag must also be set for this to work.
758
759 comm_exec (since Linux 3.16)
760 This is purely a feature-detection flag, it does not change ker‐
761 nel behavior. If this flag can successfully be set, then, when
762 comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
763 in the misc field of a comm record header if the rename event
764 being reported was caused by a call to exec(2). This allows
765 tools to distinguish between the various types of process renam‐
766 ing.
767
768 use_clockid (since Linux 4.1)
769 This allows selecting which internal Linux clock to use when
770 generating timestamps via the clockid field. This can make it
771 easier to correlate perf sample times with timestamps generated
772 by other tools.
773
774 context_switch (since Linux 4.3)
775 This enables the generation of PERF_RECORD_SWITCH records when a
776 context switch occurs. It also enables the generation of
777 PERF_RECORD_SWITCH_CPU_WIDE records when sampling in CPU-wide
778 mode. This functionality is in addition to existing tracepoint
779 and software events for measuring context switches. The advan‐
780 tage of this method is that it will give full information even
781 with strict perf_event_paranoid settings.
782
783 wakeup_events, wakeup_watermark
784 This union sets how many samples (wakeup_events) or bytes
785 (wakeup_watermark) happen before an overflow notification hap‐
786 pens. Which one is used is selected by the watermark bit flag.
787
788 wakeup_events counts only PERF_RECORD_SAMPLE record types. To
789 receive overflow notification for all PERF_RECORD types choose
790 watermark and set wakeup_watermark to 1.
791
792 Prior to Linux 3.0, setting wakeup_events to 0 resulted in no
793 overflow notifications; more recent kernels treat 0 the same as
794 1.
795
796 bp_type (since Linux 2.6.33)
797 This chooses the breakpoint type. It is one of:
798
799 HW_BREAKPOINT_EMPTY
800 No breakpoint.
801
802 HW_BREAKPOINT_R
803 Count when we read the memory location.
804
805 HW_BREAKPOINT_W
806 Count when we write the memory location.
807
808 HW_BREAKPOINT_RW
809 Count when we read or write the memory location.
810
811 HW_BREAKPOINT_X
812 Count when we execute code at the memory location.
813
814 The values can be combined via a bitwise or, but the combination
815 of HW_BREAKPOINT_R or HW_BREAKPOINT_W with HW_BREAKPOINT_X is
816 not allowed.
817
818 bp_addr (since Linux 2.6.33)
819 This is the address of the breakpoint. For execution break‐
820 points, this is the memory address of the instruction of inter‐
821 est; for read and write breakpoints, it is the memory address of
822 the memory location of interest.
823
824 config1 (since Linux 2.6.39)
825 config1 is used for setting events that need an extra register
826 or otherwise do not fit in the regular config field. Raw OFF‐
827 CORE_EVENTS on Nehalem/Westmere/SandyBridge use this field on
828 Linux 3.3 and later kernels.
829
830 bp_len (since Linux 2.6.33)
831 bp_len is the length of the breakpoint being measured if type is
832 PERF_TYPE_BREAKPOINT. Options are HW_BREAKPOINT_LEN_1,
833 HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, and HW_BREAK‐
834 POINT_LEN_8. For an execution breakpoint, set this to
835 sizeof(long).
836
837 config2 (since Linux 2.6.39)
838 config2 is a further extension of the config1 field.
839
840 branch_sample_type (since Linux 3.4)
841 If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
842 branches to include in the branch record.
843
844 The first part of the value is the privilege level, which is a
845 combination of one of the values listed below. If the user does
846 not set privilege level explicitly, the kernel will use the
847 event's privilege level. Event and branch privilege levels do
848 not have to match.
849
850 PERF_SAMPLE_BRANCH_USER
851 Branch target is in user space.
852
853 PERF_SAMPLE_BRANCH_KERNEL
854 Branch target is in kernel space.
855
856 PERF_SAMPLE_BRANCH_HV
857 Branch target is in hypervisor.
858
859 PERF_SAMPLE_BRANCH_PLM_ALL
860 A convenience value that is the three preceding values
861 ORed together.
862
863 In addition to the privilege value, at least one or more of the
864 following bits must be set.
865
866 PERF_SAMPLE_BRANCH_ANY
867 Any branch type.
868
869 PERF_SAMPLE_BRANCH_ANY_CALL
870 Any call branch (includes direct calls, indirect calls,
871 and far jumps).
872
873 PERF_SAMPLE_BRANCH_IND_CALL
874 Indirect calls.
875
876 PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
877 Direct calls.
878
879 PERF_SAMPLE_BRANCH_ANY_RETURN
880 Any return branch.
881
882 PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
883 Indirect jumps.
884
885 PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
886 Conditional branches.
887
888 PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
889 Transactional memory aborts.
890
891 PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
892 Branch in transactional memory transaction.
893
894 PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
895 Branch not in transactional memory transaction.
896 PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
897 part of a hardware-generated call stack. This requires
898 hardware support, currently only found on Intel x86
899 Haswell or newer.
900
901 sample_regs_user (since Linux 3.7)
902 This bit mask defines the set of user CPU registers to dump on
903 samples. The layout of the register mask is architecture-spe‐
904 cific and is described in the kernel header file
905 arch/ARCH/include/uapi/asm/perf_regs.h.
906
907 sample_stack_user (since Linux 3.7)
908 This defines the size of the user stack to dump if PERF_SAM‐
909 PLE_STACK_USER is specified.
910
911 clockid (since Linux 4.1)
912 If use_clockid is set, then this field selects which internal
913 Linux timer to use for timestamps. The available timers are
914 defined in linux/time.h, with CLOCK_MONOTONIC, CLOCK_MONO‐
915 TONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and CLOCK_TAI cur‐
916 rently supported.
917
918 aux_watermark (since Linux 4.1)
919 This specifies how much data is required to trigger a
920 PERF_RECORD_AUX sample.
921
922 sample_max_stack (since Linux 4.8)
923 When sample_type includes PERF_SAMPLE_CALLCHAIN, this field
924 specifies how many stack frames to report when generating the
925 callchain.
926
927 Reading results
928 Once a perf_event_open() file descriptor has been opened, the values of
929 the events can be read from the file descriptor. The values that are
930 there are specified by the read_format field in the attr structure at
931 open time.
932
933 If you attempt to read into a buffer that is not big enough to hold the
934 data, the error ENOSPC results.
935
936 Here is the layout of the data returned by a read:
937
938 * If PERF_FORMAT_GROUP was specified to allow reading all events in a
939 group at once:
940
941 struct read_format {
942 u64 nr; /* The number of events */
943 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
944 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
945 struct {
946 u64 value; /* The value of the event */
947 u64 id; /* if PERF_FORMAT_ID */
948 } values[nr];
949 };
950
951 * If PERF_FORMAT_GROUP was not specified:
952
953 struct read_format {
954 u64 value; /* The value of the event */
955 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
956 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
957 u64 id; /* if PERF_FORMAT_ID */
958 };
959
960 The values read are as follows:
961
962 nr The number of events in this file descriptor. Available only if
963 PERF_FORMAT_GROUP was specified.
964
965 time_enabled, time_running
966 Total time the event was enabled and running. Normally these
967 values are the same. If more events are started, then available
968 counter slots on the PMU, then multiplexing happens and events
969 run only part of the time. In that case, the time_enabled and
970 time running values can be used to scale an estimated value for
971 the count.
972
973 value An unsigned 64-bit value containing the counter result.
974
975 id A globally unique value for this particular event; only present
976 if PERF_FORMAT_ID was specified in read_format.
977
978 MMAP layout
979 When using perf_event_open() in sampled mode, asynchronous events (like
980 counter overflow or PROT_EXEC mmap tracking) are logged into a ring-
981 buffer. This ring-buffer is created and accessed through mmap(2).
982
983 The mmap size should be 1+2^n pages, where the first page is a metadata
984 page (struct perf_event_mmap_page) that contains various bits of infor‐
985 mation such as where the ring-buffer head is.
986
987 Before kernel 2.6.39, there is a bug that means you must allocate an
988 mmap ring buffer when sampling even if you do not plan to access it.
989
990 The structure of the first metadata mmap page is as follows:
991
992 struct perf_event_mmap_page {
993 __u32 version; /* version number of this structure */
994 __u32 compat_version; /* lowest version this is compat with */
995 __u32 lock; /* seqlock for synchronization */
996 __u32 index; /* hardware counter identifier */
997 __s64 offset; /* add to hardware counter value */
998 __u64 time_enabled; /* time event active */
999 __u64 time_running; /* time event on CPU */
1000 union {
1001 __u64 capabilities;
1002 struct {
1003 __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1004 cap_bit0_is_deprecated : 1,
1005 cap_user_rdpmc : 1,
1006 cap_user_time : 1,
1007 cap_user_time_zero : 1,
1008 };
1009 };
1010 __u16 pmc_width;
1011 __u16 time_shift;
1012 __u32 time_mult;
1013 __u64 time_offset;
1014 __u64 __reserved[120]; /* Pad to 1 k */
1015 __u64 data_head; /* head in the data section */
1016 __u64 data_tail; /* user-space written tail */
1017 __u64 data_offset; /* where the buffer starts */
1018 __u64 data_size; /* data buffer size */
1019 __u64 aux_head;
1020 __u64 aux_tail;
1021 __u64 aux_offset;
1022 __u64 aux_size;
1023
1024 }
1025
1026 The following list describes the fields in the perf_event_mmap_page
1027 structure in more detail:
1028
1029 version
1030 Version number of this structure.
1031
1032 compat_version
1033 The lowest version this is compatible with.
1034
1035 lock A seqlock for synchronization.
1036
1037 index A unique hardware counter identifier.
1038
1039 offset When using rdpmc for reads this offset value must be added to
1040 the one returned by rdpmc to get the current total event count.
1041
1042 time_enabled
1043 Time the event was active.
1044
1045 time_running
1046 Time the event was running.
1047
1048 cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1049 There was a bug in the definition of cap_usr_time and
1050 cap_usr_rdpmc from Linux 3.4 until Linux 3.11. Both bits were
1051 defined to point to the same location, so it was impossible to
1052 know if cap_usr_time or cap_usr_rdpmc were actually set.
1053
1054 Starting with Linux 3.12, these are renamed to cap_bit0 and you
1055 should use the cap_user_time and cap_user_rdpmc fields instead.
1056
1057 cap_bit0_is_deprecated (since Linux 3.12)
1058 If set, this bit indicates that the kernel supports the properly
1059 separated cap_user_time and cap_user_rdpmc bits.
1060
1061 If not-set, it indicates an older kernel where cap_usr_time and
1062 cap_usr_rdpmc map to the same bit and thus both features should
1063 be used with caution.
1064
1065 cap_user_rdpmc (since Linux 3.12)
1066 If the hardware supports user-space read of performance counters
1067 without syscall (this is the "rdpmc" instruction on x86), then
1068 the following code can be used to do a read:
1069
1070 u32 seq, time_mult, time_shift, idx, width;
1071 u64 count, enabled, running;
1072 u64 cyc, time_offset;
1073
1074 do {
1075 seq = pc->lock;
1076 barrier();
1077 enabled = pc->time_enabled;
1078 running = pc->time_running;
1079
1080 if (pc->cap_usr_time && enabled != running) {
1081 cyc = rdtsc();
1082 time_offset = pc->time_offset;
1083 time_mult = pc->time_mult;
1084 time_shift = pc->time_shift;
1085 }
1086
1087 idx = pc->index;
1088 count = pc->offset;
1089
1090 if (pc->cap_usr_rdpmc && idx) {
1091 width = pc->pmc_width;
1092 count += rdpmc(idx - 1);
1093 }
1094
1095 barrier();
1096 } while (pc->lock != seq);
1097
1098 cap_user_time (since Linux 3.12)
1099 This bit indicates the hardware has a constant, nonstop time‐
1100 stamp counter (TSC on x86).
1101
1102 cap_user_time_zero (since Linux 3.12)
1103 Indicates the presence of time_zero which allows mapping time‐
1104 stamp values to the hardware clock.
1105
1106 pmc_width
1107 If cap_usr_rdpmc, this field provides the bit-width of the value
1108 read using the rdpmc or equivalent instruction. This can be
1109 used to sign extend the result like:
1110
1111 pmc <<= 64 - pmc_width;
1112 pmc >>= 64 - pmc_width; // signed shift right
1113 count += pmc;
1114
1115 time_shift, time_mult, time_offset
1116
1117 If cap_usr_time, these fields can be used to compute the time
1118 delta since time_enabled (in nanoseconds) using rdtsc or simi‐
1119 lar.
1120
1121 u64 quot, rem;
1122 u64 delta;
1123 quot = (cyc >> time_shift);
1124 rem = cyc & (((u64)1 << time_shift) - 1);
1125 delta = time_offset + quot * time_mult +
1126 ((rem * time_mult) >> time_shift);
1127
1128 Where time_offset, time_mult, time_shift, and cyc are read in
1129 the seqcount loop described above. This delta can then be added
1130 to enabled and possible running (if idx), improving the scaling:
1131
1132 enabled += delta;
1133 if (idx)
1134 running += delta;
1135 quot = count / running;
1136 rem = count % running;
1137 count = quot * enabled + (rem * enabled) / running;
1138
1139 time_zero (since Linux 3.12)
1140
1141 If cap_usr_time_zero is set, then the hardware clock (the TSC
1142 timestamp counter on x86) can be calculated from the time_zero,
1143 time_mult, and time_shift values:
1144
1145 time = timestamp - time_zero;
1146 quot = time / time_mult;
1147 rem = time % time_mult;
1148 cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1149
1150 And vice versa:
1151
1152 quot = cyc >> time_shift;
1153 rem = cyc & (((u64)1 << time_shift) - 1);
1154 timestamp = time_zero + quot * time_mult +
1155 ((rem * time_mult) >> time_shift);
1156
1157 data_head
1158 This points to the head of the data section. The value continu‐
1159 ously increases, it does not wrap. The value needs to be manu‐
1160 ally wrapped by the size of the mmap buffer before accessing the
1161 samples.
1162
1163 On SMP-capable platforms, after reading the data_head value,
1164 user space should issue an rmb().
1165
1166 data_tail
1167 When the mapping is PROT_WRITE, the data_tail value should be
1168 written by user space to reflect the last read data. In this
1169 case, the kernel will not overwrite unread data.
1170
1171 data_offset (since Linux 4.1)
1172 Contains the offset of the location in the mmap buffer where
1173 perf sample data begins.
1174
1175 data_size (since Linux 4.1)
1176 Contains the size of the perf sample region within the mmap buf‐
1177 fer.
1178
1179 aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1180 The AUX region allows mmaping a separate sample buffer for high-
1181 bandwidth data streams (separate from the main perf sample buf‐
1182 fer). An example of a high-bandwidth stream is instruction
1183 tracing support, as is found in newer Intel processors.
1184
1185 To set up an AUX area, first aux_offset needs to be set with an
1186 offset greater than data_offset+data_size and aux_size needs to
1187 be set to the desired buffer size. The desired offset and size
1188 must be page aligned, and the size must be a power of two.
1189 These values are then passed to mmap in order to map the AUX
1190 buffer. Pages in the AUX buffer are included as part of the
1191 RLIMIT_MEMLOCK resource limit (see setrlimit(2)), and also as
1192 part of the perf_event_mlock_kb allowance.
1193
1194 By default, the AUX buffer will be truncated if it will not fit
1195 in the available space in the ring buffer. If the AUX buffer is
1196 mapped as a read only buffer, then it will operate in ring buf‐
1197 fer mode where old data will be overwritten by new. In over‐
1198 write mode, it might not be possible to infer where the new data
1199 began, and it is the consumer's job to disable measurement while
1200 reading to avoid possible data races.
1201
1202 The aux_head and aux_tail ring buffer pointers have the same
1203 behavior and ordering rules as the previous described data_head
1204 and data_tail.
1205
1206 The following 2^n ring-buffer pages have the layout described below.
1207
1208 If perf_event_attr.sample_id_all is set, then all event types will have
1209 the sample_type selected fields related to where/when (identity) an
1210 event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1211 PERF_RECORD_SAMPLE below, it will be stashed just after the
1212 perf_event_header and the fields already present for the existing
1213 fields, that is, at the end of the payload. This allows a newer
1214 perf.data file to be supported by older perf tools, with the new
1215 optional fields being ignored.
1216
1217 The mmap values start with a header:
1218
1219 struct perf_event_header {
1220 __u32 type;
1221 __u16 misc;
1222 __u16 size;
1223 };
1224
1225 Below, we describe the perf_event_header fields in more detail. For
1226 ease of reading, the fields with shorter descriptions are presented
1227 first.
1228
1229 size This indicates the size of the record.
1230
1231 misc The misc field contains additional information about the sample.
1232
1233 The CPU mode can be determined from this value by masking with
1234 PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1235 ing (note these are not bit masks, only one can be set at a
1236 time):
1237
1238 PERF_RECORD_MISC_CPUMODE_UNKNOWN
1239 Unknown CPU mode.
1240
1241 PERF_RECORD_MISC_KERNEL
1242 Sample happened in the kernel.
1243
1244 PERF_RECORD_MISC_USER
1245 Sample happened in user code.
1246
1247 PERF_RECORD_MISC_HYPERVISOR
1248 Sample happened in the hypervisor.
1249
1250 PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1251 Sample happened in the guest kernel.
1252
1253 PERF_RECORD_MISC_GUEST_USER (since Linux 2.6.35)
1254 Sample happened in guest user code.
1255
1256 Since the following three statuses are generated by different
1257 record types, they alias to the same bit:
1258
1259 PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1260 This is set when the mapping is not executable; otherwise
1261 the mapping is executable.
1262
1263 PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1264 This is set for a PERF_RECORD_COMM record on kernels more
1265 recent than Linux 3.16 if a process name change was
1266 caused by an exec(2) system call.
1267
1268 PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1269 When a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE
1270 record is generated, this bit indicates that the context
1271 switch is away from the current process (instead of into
1272 the current process).
1273
1274 In addition, the following bits can be set:
1275
1276 PERF_RECORD_MISC_EXACT_IP
1277 This indicates that the content of PERF_SAMPLE_IP points
1278 to the actual instruction that triggered the event. See
1279 also perf_event_attr.precise_ip.
1280
1281 PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1282 This indicates there is extended data available (cur‐
1283 rently not used).
1284
1285 PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1286 This bit is not set by the kernel. It is reserved for
1287 the user-space perf utility to indicate that
1288 /proc/i[pid]/maps parsing was taking too long and was
1289 stopped, and thus the mmap records may be truncated.
1290
1291 type The type value is one of the below. The values in the corre‐
1292 sponding record (that follows the header) depend on the type
1293 selected as shown.
1294
1295 PERF_RECORD_MMAP
1296 The MMAP events record the PROT_EXEC mappings so that we can
1297 correlate user-space IPs to code. They have the following
1298 structure:
1299
1300 struct {
1301 struct perf_event_header header;
1302 u32 pid, tid;
1303 u64 addr;
1304 u64 len;
1305 u64 pgoff;
1306 char filename[];
1307 };
1308
1309 pid is the process ID.
1310
1311 tid is the thread ID.
1312
1313 addr is the address of the allocated memory. len is the
1314 length of the allocated memory. pgoff is the page
1315 offset of the allocated memory. filename is a string
1316 describing the backing of the allocated memory.
1317
1318 PERF_RECORD_LOST
1319 This record indicates when events are lost.
1320
1321 struct {
1322 struct perf_event_header header;
1323 u64 id;
1324 u64 lost;
1325 struct sample_id sample_id;
1326 };
1327
1328 id is the unique event ID for the samples that were
1329 lost.
1330
1331 lost is the number of events that were lost.
1332
1333 PERF_RECORD_COMM
1334 This record indicates a change in the process name.
1335
1336 struct {
1337 struct perf_event_header header;
1338 u32 pid;
1339 u32 tid;
1340 char comm[];
1341 struct sample_id sample_id;
1342 };
1343
1344 pid is the process ID.
1345
1346 tid is the thread ID.
1347
1348 comm is a string containing the new name of the process.
1349
1350 PERF_RECORD_EXIT
1351 This record indicates a process exit event.
1352
1353 struct {
1354 struct perf_event_header header;
1355 u32 pid, ppid;
1356 u32 tid, ptid;
1357 u64 time;
1358 struct sample_id sample_id;
1359 };
1360
1361 PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1362 This record indicates a throttle/unthrottle event.
1363
1364 struct {
1365 struct perf_event_header header;
1366 u64 time;
1367 u64 id;
1368 u64 stream_id;
1369 struct sample_id sample_id;
1370 };
1371
1372 PERF_RECORD_FORK
1373 This record indicates a fork event.
1374
1375 struct {
1376 struct perf_event_header header;
1377 u32 pid, ppid;
1378 u32 tid, ptid;
1379 u64 time;
1380 struct sample_id sample_id;
1381 };
1382
1383 PERF_RECORD_READ
1384 This record indicates a read event.
1385
1386 struct {
1387 struct perf_event_header header;
1388 u32 pid, tid;
1389 struct read_format values;
1390 struct sample_id sample_id;
1391 };
1392
1393 PERF_RECORD_SAMPLE
1394 This record indicates a sample.
1395
1396 struct {
1397 struct perf_event_header header;
1398 u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
1399 u64 ip; /* if PERF_SAMPLE_IP */
1400 u32 pid, tid; /* if PERF_SAMPLE_TID */
1401 u64 time; /* if PERF_SAMPLE_TIME */
1402 u64 addr; /* if PERF_SAMPLE_ADDR */
1403 u64 id; /* if PERF_SAMPLE_ID */
1404 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1405 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1406 u64 period; /* if PERF_SAMPLE_PERIOD */
1407 struct read_format v;
1408 /* if PERF_SAMPLE_READ */
1409 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1410 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1411 u32 size; /* if PERF_SAMPLE_RAW */
1412 char data[size]; /* if PERF_SAMPLE_RAW */
1413 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1414 struct perf_branch_entry lbr[bnr];
1415 /* if PERF_SAMPLE_BRANCH_STACK */
1416 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1417 u64 regs[weight(mask)];
1418 /* if PERF_SAMPLE_REGS_USER */
1419 u64 size; /* if PERF_SAMPLE_STACK_USER */
1420 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1421 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER &&
1422 size != 0 */
1423 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1424 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1425 u64 transaction; /* if PERF_SAMPLE_TRANSACTION */
1426 u64 abi; /* if PERF_SAMPLE_REGS_INTR */
1427 u64 regs[weight(mask)];
1428 /* if PERF_SAMPLE_REGS_INTR */
1429 };
1430
1431 sample_id
1432 If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1433 is included. This is a duplication of the PERF_SAM‐
1434 PLE_ID id value, but included at the beginning of the
1435 sample so parsers can easily obtain the value.
1436
1437 ip If PERF_SAMPLE_IP is enabled, then a 64-bit instruction
1438 pointer value is included.
1439
1440 pid, tid
1441 If PERF_SAMPLE_TID is enabled, then a 32-bit process ID
1442 and 32-bit thread ID are included.
1443
1444 time
1445 If PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp
1446 is included. This is obtained via local_clock() which
1447 is a hardware timestamp if available and the jiffies
1448 value if not.
1449
1450 addr
1451 If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1452 included. This is usually the address of a tracepoint,
1453 breakpoint, or software event; otherwise the value is 0.
1454
1455 id If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is
1456 included. If the event is a member of an event group,
1457 the group leader ID is returned. This ID is the same as
1458 the one returned by PERF_FORMAT_ID.
1459
1460 stream_id
1461 If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID
1462 is included. Unlike PERF_SAMPLE_ID the actual ID is
1463 returned, not the group leader. This ID is the same as
1464 the one returned by PERF_FORMAT_ID.
1465
1466 cpu, res
1467 If PERF_SAMPLE_CPU is enabled, this is a 32-bit value
1468 indicating which CPU was being used, in addition to a
1469 reserved (unused) 32-bit value.
1470
1471 period
1472 If PERF_SAMPLE_PERIOD is enabled, a 64-bit value indi‐
1473 cating the current sampling period is written.
1474
1475 v If PERF_SAMPLE_READ is enabled, a structure of type
1476 read_format is included which has values for all events
1477 in the event group. The values included depend on the
1478 read_format value used at perf_event_open() time.
1479
1480 nr, ips[nr]
1481 If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit num‐
1482 ber is included which indicates how many following
1483 64-bit instruction pointers will follow. This is the
1484 current callchain.
1485
1486 size, data[size]
1487 If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1488 cating size is included followed by an array of 8-bit
1489 values of length size. The values are padded with 0 to
1490 have 64-bit alignment.
1491
1492 This RAW record data is opaque with respect to the ABI.
1493 The ABI doesn't make any promises with respect to the
1494 stability of its content, it may vary depending on
1495 event, hardware, and kernel version.
1496
1497 bnr, lbr[bnr]
1498 If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit
1499 value indicating the number of records is included, fol‐
1500 lowed by bnr perf_branch_entry structures which each
1501 include the fields:
1502
1503 from This indicates the source instruction (may not be
1504 a branch).
1505
1506 to The branch target.
1507
1508 mispred
1509 The branch target was mispredicted.
1510
1511 predicted
1512 The branch target was predicted.
1513
1514 in_tx (since Linux 3.11)
1515 The branch was in a transactional memory transac‐
1516 tion.
1517
1518 abort (since Linux 3.11)
1519 The branch was in an aborted transactional memory
1520 transaction.
1521
1522 cycles (since Linux 4.3)
1523 This reports the number of cycles elapsed since
1524 the previous branch stack update.
1525
1526 The entries are from most to least recent, so the first
1527 entry has the most recent branch.
1528
1529 Support for mispred, predicted, and cycles is optional;
1530 if not supported, those values will be 0.
1531
1532 The type of branches recorded is specified by the
1533 branch_sample_type field.
1534
1535 abi, regs[weight(mask)]
1536 If PERF_SAMPLE_REGS_USER is enabled, then the user CPU
1537 registers are recorded.
1538
1539 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1540 PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64.
1541
1542 The regs field is an array of the CPU registers that
1543 were specified by the sample_regs_user attr field. The
1544 number of values is the number of bits set in the sam‐
1545 ple_regs_user bit mask.
1546
1547 size, data[size], dyn_size
1548 If PERF_SAMPLE_STACK_USER is enabled, then the user
1549 stack is recorded. This can be used to generate stack
1550 backtraces. size is the size requested by the user in
1551 sample_stack_user or else the maximum record size. data
1552 is the stack data (a raw dump of the memory pointed to
1553 by the stack pointer at the time of sampling). dyn_size
1554 is the amount of data actually dumped (can be less than
1555 size). Note that dyn_size is omitted if size is 0.
1556
1557 weight
1558 If PERF_SAMPLE_WEIGHT is enabled, then a 64-bit value
1559 provided by the hardware is recorded that indicates how
1560 costly the event was. This allows expensive events to
1561 stand out more clearly in profiles.
1562
1563 data_src
1564 If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value
1565 is recorded that is made up of the following fields:
1566
1567 mem_op
1568 Type of opcode, a bitwise combination of:
1569
1570 PERF_MEM_OP_NA Not available
1571 PERF_MEM_OP_LOAD Load instruction
1572 PERF_MEM_OP_STORE Store instruction
1573 PERF_MEM_OP_PFETCH Prefetch
1574 PERF_MEM_OP_EXEC Executable code
1575
1576 mem_lvl
1577 Memory hierarchy level hit or miss, a bitwise combi‐
1578 nation of the following, shifted left by
1579 PERF_MEM_LVL_SHIFT:
1580
1581 PERF_MEM_LVL_NA Not available
1582 PERF_MEM_LVL_HIT Hit
1583 PERF_MEM_LVL_MISS Miss
1584 PERF_MEM_LVL_L1 Level 1 cache
1585 PERF_MEM_LVL_LFB Line fill buffer
1586 PERF_MEM_LVL_L2 Level 2 cache
1587 PERF_MEM_LVL_L3 Level 3 cache
1588 PERF_MEM_LVL_LOC_RAM Local DRAM
1589 PERF_MEM_LVL_REM_RAM1 Remote DRAM 1 hop
1590 PERF_MEM_LVL_REM_RAM2 Remote DRAM 2 hops
1591 PERF_MEM_LVL_REM_CCE1 Remote cache 1 hop
1592 PERF_MEM_LVL_REM_CCE2 Remote cache 2 hops
1593 PERF_MEM_LVL_IO I/O memory
1594 PERF_MEM_LVL_UNC Uncached memory
1595
1596 mem_snoop
1597 Snoop mode, a bitwise combination of the following,
1598 shifted left by PERF_MEM_SNOOP_SHIFT:
1599
1600 PERF_MEM_SNOOP_NA Not available
1601 PERF_MEM_SNOOP_NONE No snoop
1602 PERF_MEM_SNOOP_HIT Snoop hit
1603 PERF_MEM_SNOOP_MISS Snoop miss
1604 PERF_MEM_SNOOP_HITM Snoop hit modified
1605
1606 mem_lock
1607 Lock instruction, a bitwise combination of the fol‐
1608 lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1609
1610 PERF_MEM_LOCK_NA Not available
1611 PERF_MEM_LOCK_LOCKED Locked transaction
1612
1613 mem_dtlb
1614 TLB access hit or miss, a bitwise combination of the
1615 following, shifted left by PERF_MEM_TLB_SHIFT:
1616
1617 PERF_MEM_TLB_NA Not available
1618 PERF_MEM_TLB_HIT Hit
1619 PERF_MEM_TLB_MISS Miss
1620 PERF_MEM_TLB_L1 Level 1 TLB
1621 PERF_MEM_TLB_L2 Level 2 TLB
1622 PERF_MEM_TLB_WK Hardware walker
1623 PERF_MEM_TLB_OS OS fault handler
1624
1625 transaction
1626 If the PERF_SAMPLE_TRANSACTION flag is set, then a
1627 64-bit field is recorded describing the sources of any
1628 transactional memory aborts.
1629
1630 The field is a bitwise combination of the following val‐
1631 ues:
1632
1633 PERF_TXN_ELISION
1634 Abort from an elision type transaction (Intel-
1635 CPU-specific).
1636
1637 PERF_TXN_TRANSACTION
1638 Abort from a generic transaction.
1639
1640 PERF_TXN_SYNC
1641 Synchronous abort (related to the reported
1642 instruction).
1643
1644 PERF_TXN_ASYNC
1645 Asynchronous abort (not related to the reported
1646 instruction).
1647
1648 PERF_TXN_RETRY
1649 Retryable abort (retrying the transaction may
1650 have succeeded).
1651
1652 PERF_TXN_CONFLICT
1653 Abort due to memory conflicts with other threads.
1654
1655 PERF_TXN_CAPACITY_WRITE
1656 Abort due to write capacity overflow.
1657
1658 PERF_TXN_CAPACITY_READ
1659 Abort due to read capacity overflow.
1660
1661 In addition, a user-specified abort code can be obtained
1662 from the high 32 bits of the field by shifting right by
1663 PERF_TXN_ABORT_SHIFT and masking with the value
1664 PERF_TXN_ABORT_MASK.
1665
1666 abi, regs[weight(mask)]
1667 If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU
1668 registers are recorded.
1669
1670 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1671 PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1672
1673 The regs field is an array of the CPU registers that
1674 were specified by the sample_regs_intr attr field. The
1675 number of values is the number of bits set in the sam‐
1676 ple_regs_intr bit mask.
1677
1678 PERF_RECORD_MMAP2
1679 This record includes extended information on mmap(2) calls
1680 returning executable mappings. The format is similar to
1681 that of the PERF_RECORD_MMAP record, but includes extra val‐
1682 ues that allow uniquely identifying shared mappings.
1683
1684 struct {
1685 struct perf_event_header header;
1686 u32 pid;
1687 u32 tid;
1688 u64 addr;
1689 u64 len;
1690 u64 pgoff;
1691 u32 maj;
1692 u32 min;
1693 u64 ino;
1694 u64 ino_generation;
1695 u32 prot;
1696 u32 flags;
1697 char filename[];
1698 struct sample_id sample_id;
1699 };
1700
1701 pid is the process ID.
1702
1703 tid is the thread ID.
1704
1705 addr is the address of the allocated memory.
1706
1707 len is the length of the allocated memory.
1708
1709 pgoff is the page offset of the allocated memory.
1710
1711 maj is the major ID of the underlying device.
1712
1713 min is the minor ID of the underlying device.
1714
1715 ino is the inode number.
1716
1717 ino_generation
1718 is the inode generation.
1719
1720 prot is the protection information.
1721
1722 flags is the flags information.
1723
1724 filename
1725 is a string describing the backing of the allocated
1726 memory.
1727
1728 PERF_RECORD_AUX (since Linux 4.1)
1729
1730 This record reports that new data is available in the sepa‐
1731 rate AUX buffer region.
1732
1733 struct {
1734 struct perf_event_header header;
1735 u64 aux_offset;
1736 u64 aux_size;
1737 u64 flags;
1738 struct sample_id sample_id;
1739 };
1740
1741 aux_offset
1742 offset in the AUX mmap region where the new data
1743 begins.
1744
1745 aux_size
1746 size of the data made available.
1747
1748 flags describes the AUX update.
1749
1750 PERF_AUX_FLAG_TRUNCATED
1751 if set, then the data returned was truncated
1752 to fit the available buffer size.
1753
1754 PERF_AUX_FLAG_OVERWRITE
1755 if set, then the data returned has overwritten
1756 previous data.
1757
1758 PERF_RECORD_ITRACE_START (since Linux 4.1)
1759
1760 This record indicates which process has initiated an
1761 instruction trace event, allowing tools to properly corre‐
1762 late the instruction addresses in the AUX buffer with the
1763 proper executable.
1764
1765 struct {
1766 struct perf_event_header header;
1767 u32 pid;
1768 u32 tid;
1769 };
1770
1771 pid process ID of the thread starting an instruction
1772 trace.
1773
1774 tid thread ID of the thread starting an instruction
1775 trace.
1776
1777 PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1778
1779 When using hardware sampling (such as Intel PEBS) this
1780 record indicates some number of samples that may have been
1781 lost.
1782
1783 struct {
1784 struct perf_event_header header;
1785 u64 lost;
1786 struct sample_id sample_id;
1787 };
1788
1789 lost the number of potentially lost samples.
1790
1791 PERF_RECORD_SWITCH (since Linux 4.3)
1792
1793 This record indicates a context switch has happened. The
1794 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1795 whether it was a context switch into or away from the cur‐
1796 rent process.
1797
1798 struct {
1799 struct perf_event_header header;
1800 struct sample_id sample_id;
1801 };
1802
1803 PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1804
1805 As with PERF_RECORD_SWITCH this record indicates a context
1806 switch has happened, but it only occurs when sampling in
1807 CPU-wide mode and provides additional information on the
1808 process being switched to/from. The
1809 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1810 whether it was a context switch into or away from the cur‐
1811 rent process.
1812
1813 struct {
1814 struct perf_event_header header;
1815 u32 next_prev_pid;
1816 u32 next_prev_tid;
1817 struct sample_id sample_id;
1818 };
1819
1820 next_prev_pid
1821 The process ID of the previous (if switching in) or
1822 next (if switching out) process on the CPU.
1823
1824 next_prev_tid
1825 The thread ID of the previous (if switching in) or
1826 next (if switching out) thread on the CPU.
1827
1828 Overflow handling
1829 Events can be set to notify when a threshold is crossed, indicating an
1830 overflow. Overflow conditions can be captured by monitoring the event
1831 file descriptor with poll(2), select(2), or epoll(7). Alternatively,
1832 the overflow events can be captured via sa signal handler, by enabling
1833 I/O signaling on the file descriptor; see the discussion of the
1834 F_SETOWN and F_SETSIG operations in fcntl(2).
1835
1836 Overflows are generated only by sampling events (sample_period must
1837 have a nonzero value).
1838
1839 There are two ways to generate overflow notifications.
1840
1841 The first is to set a wakeup_events or wakeup_watermark value that will
1842 trigger if a certain number of samples or bytes have been written to
1843 the mmap ring buffer. In this case, POLL_IN is indicated.
1844
1845 The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This
1846 ioctl adds to a counter that decrements each time the event overflows.
1847 When nonzero, POLL_IN is indicated, but once the counter reaches 0
1848 POLL_HUP is indicated and the underlying event is disabled.
1849
1850 Refreshing an event group leader refreshes all siblings and refreshing
1851 with a parameter of 0 currently enables infinite refreshes; these
1852 behaviors are unsupported and should not be relied on.
1853
1854 Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
1855 itored is attached to a different process and that process exits.
1856
1857 rdpmc instruction
1858 Starting with Linux 3.4 on x86, you can use the rdpmc instruction to
1859 get low-latency reads without having to enter the kernel. Note that
1860 using rdpmc is not necessarily faster than other methods for reading
1861 event values.
1862
1863 Support for this can be detected with the cap_usr_rdpmc field in the
1864 mmap page; documentation on how to calculate event values can be found
1865 in that section.
1866
1867 Originally, when rdpmc support was enabled, any process (not just ones
1868 with an active perf event) could use the rdpmc instruction to access
1869 the counters. Starting with Linux 4.0, rdpmc support is only allowed
1870 if an event is currently enabled in a process's context. To restore
1871 the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
1872
1873 perf_event ioctl calls
1874 Various ioctls act on perf_event_open() file descriptors:
1875
1876 PERF_EVENT_IOC_ENABLE
1877 This enables the individual event or event group specified by
1878 the file descriptor argument.
1879
1880 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1881 then all events in a group are enabled, even if the event speci‐
1882 fied is not the group leader (but see BUGS).
1883
1884 PERF_EVENT_IOC_DISABLE
1885 This disables the individual counter or event group specified by
1886 the file descriptor argument.
1887
1888 Enabling or disabling the leader of a group enables or disables
1889 the entire group; that is, while the group leader is disabled,
1890 none of the counters in the group will count. Enabling or dis‐
1891 abling a member of a group other than the leader affects only
1892 that counter; disabling a non-leader stops that counter from
1893 counting but doesn't affect any other counter.
1894
1895 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1896 then all events in a group are disabled, even if the event spec‐
1897 ified is not the group leader (but see BUGS).
1898
1899 PERF_EVENT_IOC_REFRESH
1900 Non-inherited overflow counters can use this to enable a counter
1901 for a number of overflows specified by the argument, after which
1902 it is disabled. Subsequent calls of this ioctl add the argument
1903 value to the current count. An overflow notification with
1904 POLL_IN set will happen on each overflow until the count reaches
1905 0; when that happens a notification with POLL_HUP set is sent
1906 and the event is disabled. Using an argument of 0 is considered
1907 undefined behavior.
1908
1909 PERF_EVENT_IOC_RESET
1910 Reset the event count specified by the file descriptor argument
1911 to zero. This resets only the counts; there is no way to reset
1912 the multiplexing time_enabled or time_running values.
1913
1914 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1915 then all events in a group are reset, even if the event speci‐
1916 fied is not the group leader (but see BUGS).
1917
1918 PERF_EVENT_IOC_PERIOD
1919 This updates the overflow period for the event.
1920
1921 Since Linux 3.7 (on ARM) and Linux 3.14 (all other architec‐
1922 tures), the new period takes effect immediately. On older ker‐
1923 nels, the new period did not take effect until after the next
1924 overflow.
1925
1926 The argument is a pointer to a 64-bit value containing the
1927 desired new period.
1928
1929 Prior to Linux 2.6.36, this ioctl always failed due to a bug in
1930 the kernel.
1931
1932 PERF_EVENT_IOC_SET_OUTPUT
1933 This tells the kernel to report event notifications to the spec‐
1934 ified file descriptor rather than the default one. The file
1935 descriptors must all be on the same CPU.
1936
1937 The argument specifies the desired file descriptor, or -1 if
1938 output should be ignored.
1939
1940 PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
1941 This adds an ftrace filter to this event.
1942
1943 The argument is a pointer to the desired ftrace filter.
1944
1945 PERF_EVENT_IOC_ID (since Linux 3.12)
1946 This returns the event ID value for the given event file
1947 descriptor.
1948
1949 The argument is a pointer to a 64-bit unsigned integer to hold
1950 the result.
1951
1952 PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
1953 This allows attaching a Berkeley Packet Filter (BPF) program to
1954 an existing kprobe tracepoint event. You need CAP_SYS_ADMIN
1955 privileges to use this ioctl.
1956
1957 The argument is a BPF program file descriptor that was created
1958 by a previous bpf(2) system call.
1959
1960 Using prctl(2)
1961 A process can enable or disable all the event groups that are attached
1962 to it using the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and
1963 PR_TASK_PERF_EVENTS_DISABLE operations. This applies to all counters
1964 on the calling process, whether created by this process or by another,
1965 and does not affect any counters that this process has created on other
1966 processes. It enables or disables only the group leaders, not any
1967 other members in the groups.
1968
1969 perf_event related configuration files
1970 Files in /proc/sys/kernel/
1971
1972 /proc/sys/kernel/perf_event_paranoid
1973 The perf_event_paranoid file can be set to restrict access
1974 to the performance counters.
1975
1976 2 allow only user-space measurements (default since Linux
1977 4.6).
1978 1 allow both kernel and user measurements (default before
1979 Linux 4.6).
1980 0 allow access to CPU-specific data but not raw tracepoint
1981 samples.
1982 -1 no restrictions.
1983
1984 The existence of the perf_event_paranoid file is the offi‐
1985 cial method for determining if a kernel supports
1986 perf_event_open().
1987
1988 /proc/sys/kernel/perf_event_max_sample_rate
1989 This sets the maximum sample rate. Setting this too high
1990 can allow users to sample at a rate that impacts overall
1991 machine performance and potentially lock up the machine.
1992 The default value is 100000 (samples per second).
1993
1994 /proc/sys/kernel/perf_event_max_stack
1995 This file sets the maximum depth of stack frame entries
1996 reported when generating a call trace.
1997
1998 /proc/sys/kernel/perf_event_mlock_kb
1999 Maximum number of pages an unprivileged user can mlock(2).
2000 The default is 516 (kB).
2001
2002 Files in /sys/bus/event_source/devices/
2003
2004 Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
2005 able for monitoring. Information on how to program these PMUs can
2006 be found under /sys/bus/event_source/devices/. Each subdirectory
2007 corresponds to a different PMU.
2008
2009 /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
2010 This contains an integer that can be used in the type field
2011 of perf_event_attr to indicate that you wish to use this
2012 PMU.
2013
2014 /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
2015 If this file is 1, then direct user-space access to the per‐
2016 formance counter registers is allowed via the rdpmc instruc‐
2017 tion. This can be disabled by echoing 0 to the file.
2018
2019 As of Linux 4.0 the behavior has changed, so that 1 now
2020 means only allow access to processes with active perf
2021 events, with 2 indicating the old allow-anyone-access behav‐
2022 ior.
2023
2024 /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2025 This subdirectory contains information on the architecture-
2026 specific subfields available for programming the various
2027 config fields in the perf_event_attr struct.
2028
2029 The content of each file is the name of the config field,
2030 followed by a colon, followed by a series of integer bit
2031 ranges separated by commas. For example, the file event may
2032 contain the value config1:1,6-10,44 which indicates that
2033 event is an attribute that occupies bits 1,6–10, and 44 of
2034 perf_event_attr::config1.
2035
2036 /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2037 This subdirectory contains files with predefined events.
2038 The contents are strings describing the event settings
2039 expressed in terms of the fields found in the previously
2040 mentioned ./format/ directory. These are not necessarily
2041 complete lists of all events supported by a PMU, but usually
2042 a subset of events deemed useful or interesting.
2043
2044 The content of each file is a list of attribute names sepa‐
2045 rated by commas. Each entry has an optional value (either
2046 hex or decimal). If no value is specified, then it is
2047 assumed to be a single-bit field with a value of 1. An
2048 example entry may look like this: event=0x2,inv,ldlat=3.
2049
2050 /sys/bus/event_source/devices/*/uevent
2051 This file is the standard kernel device interface for
2052 injecting hotplug events.
2053
2054 /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2055 The cpumask file contains a comma-separated list of integers
2056 that indicate a representative CPU number for each socket
2057 (package) on the motherboard. This is needed when setting
2058 up uncore or northbridge events, as those PMUs present
2059 socket-wide events.
2060
2062 perf_event_open() returns the new file descriptor, or -1 if an error
2063 occurred (in which case, errno is set appropriately).
2064
2066 The errors returned by perf_event_open() can be inconsistent, and may
2067 vary across processor architectures and performance monitoring units.
2068
2069 E2BIG Returned if the perf_event_attr size value is too small (smaller
2070 than PERF_ATTR_SIZE_VER0), too big (larger than the page size),
2071 or larger than the kernel supports and the extra bytes are not
2072 zero. When E2BIG is returned, the perf_event_attr size field is
2073 overwritten by the kernel to be the size of the structure it was
2074 expecting.
2075
2076 EACCES Returned when the requested event requires CAP_SYS_ADMIN permis‐
2077 sions (or a more permissive perf_event paranoid setting). Some
2078 common cases where an unprivileged process may encounter this
2079 error: attaching to a process owned by a different user; moni‐
2080 toring all processes on a given CPU (i.e., specifying the pid
2081 argument as -1); and not setting exclude_kernel when the para‐
2082 noid setting requires it.
2083
2084 EBADF Returned if the group_fd file descriptor is not valid, or, if
2085 PERF_FLAG_PID_CGROUP is set, the cgroup file descriptor in pid
2086 is not valid.
2087
2088 EBUSY (since Linux 4.1)
2089 Returned if another event already has exclusive access to the
2090 PMU.
2091
2092 EFAULT Returned if the attr pointer points at an invalid memory
2093 address.
2094
2095 EINVAL Returned if the specified event is invalid. There are many pos‐
2096 sible reasons for this. A not-exhaustive list: sample_freq is
2097 higher than the maximum setting; the cpu to monitor does not
2098 exist; read_format is out of range; sample_type is out of range;
2099 the flags value is out of range; exclusive or pinned set and the
2100 event is not a group leader; the event config values are out of
2101 range or set reserved bits; the generic event selected is not
2102 supported; or there is not enough room to add the selected
2103 event.
2104
2105 EMFILE Each opened event uses one file descriptor. If a large number
2106 of events are opened, the per-process limit on the number of
2107 open file descriptors will be reached, and no more events can be
2108 created.
2109
2110 ENODEV Returned when the event involves a feature not supported by the
2111 current CPU.
2112
2113 ENOENT Returned if the type setting is not valid. This error is also
2114 returned for some unsupported generic events.
2115
2116 ENOSPC Prior to Linux 3.3, if there was not enough room for the event,
2117 ENOSPC was returned. In Linux 3.3, this was changed to EINVAL.
2118 ENOSPC is still returned if you try to add more breakpoint
2119 events than supported by the hardware.
2120
2121 ENOSYS Returned if PERF_SAMPLE_STACK_USER is set in sample_type and it
2122 is not supported by hardware.
2123
2124 EOPNOTSUPP
2125 Returned if an event requiring a specific hardware feature is
2126 requested but there is no hardware support. This includes
2127 requesting low-skid events if not supported, branch tracing if
2128 it is not available, sampling if no PMU interrupt is available,
2129 and branch stacks for software events.
2130
2131 EOVERFLOW (since Linux 4.8)
2132 Returned if PERF_SAMPLE_CALLCHAIN is requested and sam‐
2133 ple_max_stack is larger than the maximum specified in
2134 /proc/sys/kernel/perf_event_max_stack.
2135
2136 EPERM Returned on many (but not all) architectures when an unsupported
2137 exclude_hv, exclude_idle, exclude_user, or exclude_kernel set‐
2138 ting is specified.
2139
2140 It can also happen, as with EACCES, when the requested event
2141 requires CAP_SYS_ADMIN permissions (or a more permissive
2142 perf_event paranoid setting). This includes setting a break‐
2143 point on a kernel address, and (since Linux 3.13) setting a ker‐
2144 nel function-trace tracepoint.
2145
2146 ESRCH Returned if attempting to attach to a process that does not
2147 exist.
2148
2150 perf_event_open() was introduced in Linux 2.6.31 but was called
2151 perf_counter_open(). It was renamed in Linux 2.6.32.
2152
2154 This perf_event_open() system call Linux-specific and should not be
2155 used in programs intended to be portable.
2156
2158 Glibc does not provide a wrapper for this system call; call it using
2159 syscall(2). See the example below.
2160
2161 The official way of knowing if perf_event_open() support is enabled is
2162 checking for the existence of the file /proc/sys/ker‐
2163 nel/perf_event_paranoid.
2164
2166 The F_SETOWN_EX option to fcntl(2) is needed to properly get overflow
2167 signals in threads. This was introduced in Linux 2.6.32.
2168
2169 Prior to Linux 2.6.33 (at least for x86), the kernel did not check if
2170 events could be scheduled together until read time. The same happens
2171 on all known kernels if the NMI watchdog is enabled. This means to see
2172 if a given set of events works you have to perf_event_open(), start,
2173 then read before you know for sure you can get valid measurements.
2174
2175 Prior to Linux 2.6.34, event constraints were not enforced by the ker‐
2176 nel. In that case, some events would silently return "0" if the kernel
2177 scheduled them in an improper counter slot.
2178
2179 Prior to Linux 2.6.34, there was a bug when multiplexing where the
2180 wrong results could be returned.
2181
2182 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2183 if "inherit" is enabled and many threads are started.
2184
2185 Prior to Linux 2.6.35, PERF_FORMAT_GROUP did not work with attached
2186 processes.
2187
2188 There is a bug in the kernel code between Linux 2.6.36 and Linux 3.0
2189 that ignores the "watermark" field and acts as if a wakeup_event was
2190 chosen if the union has a nonzero value in it.
2191
2192 From Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2193 was broken and would repeatedly operate on the event specified rather
2194 than iterating across all sibling events in a group.
2195
2196 From Linux 3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2197 bits mapped to the same location. Code should migrate to the new
2198 cap_user_rdpmc and cap_user_time fields instead.
2199
2200 Always double-check your results! Various generalized events have had
2201 wrong values. For example, retired branches measured the wrong thing
2202 on AMD machines until Linux 2.6.35.
2203
2205 The following is a short example that measures the total instruction
2206 count of a call to printf(3).
2207
2208 #include <stdlib.h>
2209 #include <stdio.h>
2210 #include <unistd.h>
2211 #include <string.h>
2212 #include <sys/ioctl.h>
2213 #include <linux/perf_event.h>
2214 #include <asm/unistd.h>
2215
2216 static long
2217 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2218 int cpu, int group_fd, unsigned long flags)
2219 {
2220 int ret;
2221
2222 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2223 group_fd, flags);
2224 return ret;
2225 }
2226
2227 int
2228 main(int argc, char **argv)
2229 {
2230 struct perf_event_attr pe;
2231 long long count;
2232 int fd;
2233
2234 memset(&pe, 0, sizeof(struct perf_event_attr));
2235 pe.type = PERF_TYPE_HARDWARE;
2236 pe.size = sizeof(struct perf_event_attr);
2237 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2238 pe.disabled = 1;
2239 pe.exclude_kernel = 1;
2240 pe.exclude_hv = 1;
2241
2242 fd = perf_event_open(&pe, 0, -1, -1, 0);
2243 if (fd == -1) {
2244 fprintf(stderr, "Error opening leader %llx\n", pe.config);
2245 exit(EXIT_FAILURE);
2246 }
2247
2248 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2249 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2250
2251 printf("Measuring instruction count for this printf\n");
2252
2253 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2254 read(fd, &count, sizeof(long long));
2255
2256 printf("Used %lld instructions\n", count);
2257
2258 close(fd);
2259 }
2260
2262 perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2263
2265 This page is part of release 4.16 of the Linux man-pages project. A
2266 description of the project, information about reporting bugs, and the
2267 latest version of this page, can be found at
2268 https://www.kernel.org/doc/man-pages/.
2269
2270
2271
2272Linux 2018-02-02 PERF_EVENT_OPEN(2)