1PERF_EVENT_OPEN(2) Linux Programmer's Manual PERF_EVENT_OPEN(2)
2
3
4
6 perf_event_open - set up performance monitoring
7
9 #include <linux/perf_event.h>
10 #include <linux/hw_breakpoint.h>
11
12 int perf_event_open(struct perf_event_attr *attr,
13 pid_t pid, int cpu, int group_fd,
14 unsigned long flags);
15
16 Note: There is no glibc wrapper for this system call; see NOTES.
17
19 Given a list of parameters, perf_event_open() returns a file descrip‐
20 tor, for use in subsequent system calls (read(2), mmap(2), prctl(2),
21 fcntl(2), etc.).
22
23 A call to perf_event_open() creates a file descriptor that allows mea‐
24 suring performance information. Each file descriptor corresponds to
25 one event that is measured; these can be grouped together to measure
26 multiple events simultaneously.
27
28 Events can be enabled and disabled in two ways: via ioctl(2) and via
29 prctl(2). When an event is disabled it does not count or generate
30 overflows but does continue to exist and maintain its count value.
31
32 Events come in two flavors: counting and sampled. A counting event is
33 one that is used for counting the aggregate number of events that
34 occur. In general, counting event results are gathered with a read(2)
35 call. A sampling event periodically writes measurements to a buffer
36 that can then be accessed via mmap(2).
37
38 Arguments
39 The pid and cpu arguments allow specifying which process and CPU to
40 monitor:
41
42 pid == 0 and cpu == -1
43 This measures the calling process/thread on any CPU.
44
45 pid == 0 and cpu >= 0
46 This measures the calling process/thread only when running on
47 the specified CPU.
48
49 pid > 0 and cpu == -1
50 This measures the specified process/thread on any CPU.
51
52 pid > 0 and cpu >= 0
53 This measures the specified process/thread only when running on
54 the specified CPU.
55
56 pid == -1 and cpu >= 0
57 This measures all processes/threads on the specified CPU. This
58 requires CAP_SYS_ADMIN capability or a /proc/sys/ker‐
59 nel/perf_event_paranoid value of less than 1.
60
61 pid == -1 and cpu == -1
62 This setting is invalid and will return an error.
63
64 When pid is greater than zero, permission to perform this system call
65 is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check;
66 see ptrace(2).
67
68 The group_fd argument allows event groups to be created. An event
69 group has one event which is the group leader. The leader is created
70 first, with group_fd = -1. The rest of the group members are created
71 with subsequent perf_event_open() calls with group_fd being set to the
72 file descriptor of the group leader. (A single event on its own is
73 created with group_fd = -1 and is considered to be a group with only 1
74 member.) An event group is scheduled onto the CPU as a unit: it will
75 be put onto the CPU only if all of the events in the group can be put
76 onto the CPU. This means that the values of the member events can be
77 meaningfully compared—added, divided (to get ratios), and so on—with
78 each other, since they have counted events for the same set of executed
79 instructions.
80
81 The flags argument is formed by ORing together zero or more of the fol‐
82 lowing values:
83
84 PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
85 This flag enables the close-on-exec flag for the created event
86 file descriptor, so that the file descriptor is automatically
87 closed on execve(2). Setting the close-on-exec flags at cre‐
88 ation time, rather than later with fcntl(2), avoids potential
89 race conditions where the calling thread invokes
90 perf_event_open() and fcntl(2) at the same time as another
91 thread calls fork(2) then execve(2).
92
93 PERF_FLAG_FD_NO_GROUP
94 This flag tells the event to ignore the group_fd parameter
95 except for the purpose of setting up output redirection using
96 the PERF_FLAG_FD_OUTPUT flag.
97
98 PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
99 This flag re-routes the event's sampled output to instead be
100 included in the mmap buffer of the event specified by group_fd.
101
102 PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
103 This flag activates per-container system-wide monitoring. A
104 container is an abstraction that isolates a set of resources for
105 finer-grained control (CPUs, memory, etc.). In this mode, the
106 event is measured only if the thread running on the monitored
107 CPU belongs to the designated container (cgroup). The cgroup is
108 identified by passing a file descriptor opened on its directory
109 in the cgroupfs filesystem. For instance, if the cgroup to mon‐
110 itor is called test, then a file descriptor opened on
111 /dev/cgroup/test (assuming cgroupfs is mounted on /dev/cgroup)
112 must be passed as the pid parameter. cgroup monitoring is
113 available only for system-wide events and may therefore require
114 extra permissions.
115
116 The perf_event_attr structure provides detailed configuration informa‐
117 tion for the event being created.
118
119 struct perf_event_attr {
120 __u32 type; /* Type of event */
121 __u32 size; /* Size of attribute structure */
122 __u64 config; /* Type-specific configuration */
123
124 union {
125 __u64 sample_period; /* Period of sampling */
126 __u64 sample_freq; /* Frequency of sampling */
127 };
128
129 __u64 sample_type; /* Specifies values included in sample */
130 __u64 read_format; /* Specifies values returned in read */
131
132 __u64 disabled : 1, /* off by default */
133 inherit : 1, /* children inherit it */
134 pinned : 1, /* must always be on PMU */
135 exclusive : 1, /* only group on PMU */
136 exclude_user : 1, /* don't count user */
137 exclude_kernel : 1, /* don't count kernel */
138 exclude_hv : 1, /* don't count hypervisor */
139 exclude_idle : 1, /* don't count when idle */
140 mmap : 1, /* include mmap data */
141 comm : 1, /* include comm data */
142 freq : 1, /* use freq, not period */
143 inherit_stat : 1, /* per task counts */
144 enable_on_exec : 1, /* next exec enables */
145 task : 1, /* trace fork/exit */
146 watermark : 1, /* wakeup_watermark */
147 precise_ip : 2, /* skid constraint */
148 mmap_data : 1, /* non-exec mmap data */
149 sample_id_all : 1, /* sample_type all events */
150 exclude_host : 1, /* don't count in host */
151 exclude_guest : 1, /* don't count in guest */
152 exclude_callchain_kernel : 1,
153 /* exclude kernel callchains */
154 exclude_callchain_user : 1,
155 /* exclude user callchains */
156 mmap2 : 1, /* include mmap with inode data */
157 comm_exec : 1, /* flag comm events that are
158 due to exec */
159 use_clockid : 1, /* use clockid for time fields */
160 context_switch : 1, /* context switch data */
161
162 __reserved_1 : 37;
163
164 union {
165 __u32 wakeup_events; /* wakeup every n events */
166 __u32 wakeup_watermark; /* bytes before wakeup */
167 };
168
169 __u32 bp_type; /* breakpoint type */
170
171 union {
172 __u64 bp_addr; /* breakpoint address */
173 __u64 kprobe_func; /* for perf_kprobe */
174 __u64 uprobe_path; /* for perf_uprobe */
175 __u64 config1; /* extension of config */
176 };
177
178 union {
179 __u64 bp_len; /* breakpoint length */
180 __u64 kprobe_addr; /* with kprobe_func == NULL */
181 __u64 probe_offset; /* for perf_[k,u]probe */
182 __u64 config2; /* extension of config1 */
183 };
184 __u64 branch_sample_type; /* enum perf_branch_sample_type */
185 __u64 sample_regs_user; /* user regs to dump on samples */
186 __u32 sample_stack_user; /* size of stack to dump on
187 samples */
188 __s32 clockid; /* clock to use for time fields */
189 __u64 sample_regs_intr; /* regs to dump on samples */
190 __u32 aux_watermark; /* aux bytes before wakeup */
191 __u16 sample_max_stack; /* max frames in callchain */
192 __u16 __reserved_2; /* align to u64 */
193
194 };
195
196 The fields of the perf_event_attr structure are described in more
197 detail below:
198
199 type This field specifies the overall event type. It has one of the
200 following values:
201
202 PERF_TYPE_HARDWARE
203 This indicates one of the "generalized" hardware events
204 provided by the kernel. See the config field definition
205 for more details.
206
207 PERF_TYPE_SOFTWARE
208 This indicates one of the software-defined events pro‐
209 vided by the kernel (even if no hardware support is
210 available).
211
212 PERF_TYPE_TRACEPOINT
213 This indicates a tracepoint provided by the kernel trace‐
214 point infrastructure.
215
216 PERF_TYPE_HW_CACHE
217 This indicates a hardware cache event. This has a spe‐
218 cial encoding, described in the config field definition.
219
220 PERF_TYPE_RAW
221 This indicates a "raw" implementation-specific event in
222 the config field.
223
224 PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
225 This indicates a hardware breakpoint as provided by the
226 CPU. Breakpoints can be read/write accesses to an
227 address as well as execution of an instruction address.
228
229 dynamic PMU
230 Since Linux 2.6.38, perf_event_open() can support multi‐
231 ple PMUs. To enable this, a value exported by the kernel
232 can be used in the type field to indicate which PMU to
233 use. The value to use can be found in the sysfs filesys‐
234 tem: there is a subdirectory per PMU instance under
235 /sys/bus/event_source/devices. In each subdirectory
236 there is a type file whose content is an integer that can
237 be used in the type field. For instance,
238 /sys/bus/event_source/devices/cpu/type contains the value
239 for the core CPU PMU, which is usually 4.
240
241 kprobe and uprobe (since Linux 4.17)
242 These two dynamic PMUs create a kprobe/uprobe and attach
243 it to the file descriptor generated by perf_event_open.
244 The kprobe/uprobe will be destroyed on the destruction of
245 the file descriptor. See fields kprobe_func,
246 uprobe_path, kprobe_addr, and probe_offset for more
247 details.
248
249 size The size of the perf_event_attr structure for forward/backward
250 compatibility. Set this using sizeof(struct perf_event_attr) to
251 allow the kernel to see the struct size at the time of compila‐
252 tion.
253
254 The related define PERF_ATTR_SIZE_VER0 is set to 64; this was
255 the size of the first published struct. PERF_ATTR_SIZE_VER1 is
256 72, corresponding to the addition of breakpoints in Linux
257 2.6.33. PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
258 of branch sampling in Linux 3.4. PERF_ATTR_SIZE_VER3 is 96 cor‐
259 responding to the addition of sample_regs_user and sam‐
260 ple_stack_user in Linux 3.7. PERF_ATTR_SIZE_VER4 is 104 corre‐
261 sponding to the addition of sample_regs_intr in Linux 3.19.
262 PERF_ATTR_SIZE_VER5 is 112 corresponding to the addition of
263 aux_watermark in Linux 4.1.
264
265 config This specifies which event you want, in conjunction with the
266 type field. The config1 and config2 fields are also taken into
267 account in cases where 64 bits is not enough to fully specify
268 the event. The encoding of these fields are event dependent.
269
270 There are various ways to set the config field that are depen‐
271 dent on the value of the previously described type field. What
272 follows are various possible settings for config separated out
273 by type.
274
275 If type is PERF_TYPE_HARDWARE, we are measuring one of the gen‐
276 eralized hardware CPU events. Not all of these are available on
277 all platforms. Set config to one of the following:
278
279 PERF_COUNT_HW_CPU_CYCLES
280 Total cycles. Be wary of what happens during CPU
281 frequency scaling.
282
283 PERF_COUNT_HW_INSTRUCTIONS
284 Retired instructions. Be careful, these can be
285 affected by various issues, most notably hardware
286 interrupt counts.
287
288 PERF_COUNT_HW_CACHE_REFERENCES
289 Cache accesses. Usually this indicates Last Level
290 Cache accesses but this may vary depending on your
291 CPU. This may include prefetches and coherency mes‐
292 sages; again this depends on the design of your CPU.
293
294 PERF_COUNT_HW_CACHE_MISSES
295 Cache misses. Usually this indicates Last Level
296 Cache misses; this is intended to be used in con‐
297 junction with the PERF_COUNT_HW_CACHE_REFERENCES
298 event to calculate cache miss rates.
299
300 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
301 Retired branch instructions. Prior to Linux 2.6.35,
302 this used the wrong event on AMD processors.
303
304 PERF_COUNT_HW_BRANCH_MISSES
305 Mispredicted branch instructions.
306
307 PERF_COUNT_HW_BUS_CYCLES
308 Bus cycles, which can be different from total
309 cycles.
310
311 PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
312 Stalled cycles during issue.
313
314 PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
315 Stalled cycles during retirement.
316
317 PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
318 Total cycles; not affected by CPU frequency scaling.
319
320 If type is PERF_TYPE_SOFTWARE, we are measuring software events
321 provided by the kernel. Set config to one of the following:
322
323 PERF_COUNT_SW_CPU_CLOCK
324 This reports the CPU clock, a high-resolution per-
325 CPU timer.
326
327 PERF_COUNT_SW_TASK_CLOCK
328 This reports a clock count specific to the task that
329 is running.
330
331 PERF_COUNT_SW_PAGE_FAULTS
332 This reports the number of page faults.
333
334 PERF_COUNT_SW_CONTEXT_SWITCHES
335 This counts context switches. Until Linux 2.6.34,
336 these were all reported as user-space events, after
337 that they are reported as happening in the kernel.
338
339 PERF_COUNT_SW_CPU_MIGRATIONS
340 This reports the number of times the process has
341 migrated to a new CPU.
342
343 PERF_COUNT_SW_PAGE_FAULTS_MIN
344 This counts the number of minor page faults. These
345 did not require disk I/O to handle.
346
347 PERF_COUNT_SW_PAGE_FAULTS_MAJ
348 This counts the number of major page faults. These
349 required disk I/O to handle.
350
351 PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
352 This counts the number of alignment faults. These
353 happen when unaligned memory accesses happen; the
354 kernel can handle these but it reduces performance.
355 This happens only on some architectures (never on
356 x86).
357
358 PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
359 This counts the number of emulation faults. The
360 kernel sometimes traps on unimplemented instructions
361 and emulates them for user space. This can nega‐
362 tively impact performance.
363
364 PERF_COUNT_SW_DUMMY (since Linux 3.12)
365 This is a placeholder event that counts nothing.
366 Informational sample record types such as mmap or
367 comm must be associated with an active event. This
368 dummy event allows gathering such records without
369 requiring a counting event.
370
371 If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel
372 tracepoints. The value to use in config can be obtained from
373 under debugfs tracing/events/*/*/id if ftrace is enabled in the
374 kernel.
375
376 If type is PERF_TYPE_HW_CACHE, then we are measuring a hardware
377 CPU cache event. To calculate the appropriate config value use
378 the following equation:
379
380 (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
381 (perf_hw_cache_op_result_id << 16)
382
383 where perf_hw_cache_id is one of:
384
385 PERF_COUNT_HW_CACHE_L1D
386 for measuring Level 1 Data Cache
387
388 PERF_COUNT_HW_CACHE_L1I
389 for measuring Level 1 Instruction Cache
390
391 PERF_COUNT_HW_CACHE_LL
392 for measuring Last-Level Cache
393
394 PERF_COUNT_HW_CACHE_DTLB
395 for measuring the Data TLB
396
397 PERF_COUNT_HW_CACHE_ITLB
398 for measuring the Instruction TLB
399
400 PERF_COUNT_HW_CACHE_BPU
401 for measuring the branch prediction unit
402
403 PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
404 for measuring local memory accesses
405
406 and perf_hw_cache_op_id is one of:
407
408 PERF_COUNT_HW_CACHE_OP_READ
409 for read accesses
410
411 PERF_COUNT_HW_CACHE_OP_WRITE
412 for write accesses
413
414 PERF_COUNT_HW_CACHE_OP_PREFETCH
415 for prefetch accesses
416
417 and perf_hw_cache_op_result_id is one of:
418
419 PERF_COUNT_HW_CACHE_RESULT_ACCESS
420 to measure accesses
421
422 PERF_COUNT_HW_CACHE_RESULT_MISS
423 to measure misses
424
425 If type is PERF_TYPE_RAW, then a custom "raw" config value is
426 needed. Most CPUs support events that are not covered by the
427 "generalized" events. These are implementation defined; see
428 your CPU manual (for example the Intel Volume 3B documentation
429 or the AMD BIOS and Kernel Developer Guide). The libpfm4
430 library can be used to translate from the name in the architec‐
431 tural manuals to the raw hex value perf_event_open() expects in
432 this field.
433
434 If type is PERF_TYPE_BREAKPOINT, then leave config set to zero.
435 Its parameters are set in other places.
436
437 If type is kprobe or uprobe, set retprobe (bit 0 of config, see
438 /sys/bus/event_source/devices/[k,u]probe/format/retprobe) for
439 kretprobe/uretprobe. See fields kprobe_func, uprobe_path,
440 kprobe_addr, and probe_offset for more details.
441
442 kprobe_func, uprobe_path, kprobe_addr, and probe_offset
443 These fields describe the kprobe/uprobe for dynamic PMUs kprobe
444 and uprobe. For kprobe: use kprobe_func and probe_offset, or
445 use kprobe_addr and leave kprobe_func as NULL. For uprobe: use
446 uprobe_path and probe_offset.
447
448 sample_period, sample_freq
449 A "sampling" event is one that generates an overflow notifica‐
450 tion every N events, where N is given by sample_period. A sam‐
451 pling event has sample_period > 0. When an overflow occurs,
452 requested data is recorded in the mmap buffer. The sample_type
453 field controls what data is recorded on each overflow.
454
455 sample_freq can be used if you wish to use frequency rather than
456 period. In this case, you set the freq flag. The kernel will
457 adjust the sampling period to try and achieve the desired rate.
458 The rate of adjustment is a timer tick.
459
460 sample_type
461 The various bits in this field specify which values to include
462 in the sample. They will be recorded in a ring-buffer, which is
463 available to user space using mmap(2). The order in which the
464 values are saved in the sample are documented in the MMAP Layout
465 subsection below; it is not the enum perf_event_sample_format
466 order.
467
468 PERF_SAMPLE_IP
469 Records instruction pointer.
470
471 PERF_SAMPLE_TID
472 Records the process and thread IDs.
473
474 PERF_SAMPLE_TIME
475 Records a timestamp.
476
477 PERF_SAMPLE_ADDR
478 Records an address, if applicable.
479
480 PERF_SAMPLE_READ
481 Record counter values for all events in a group, not just
482 the group leader.
483
484 PERF_SAMPLE_CALLCHAIN
485 Records the callchain (stack backtrace).
486
487 PERF_SAMPLE_ID
488 Records a unique ID for the opened event's group leader.
489
490 PERF_SAMPLE_CPU
491 Records CPU number.
492
493 PERF_SAMPLE_PERIOD
494 Records the current sampling period.
495
496 PERF_SAMPLE_STREAM_ID
497 Records a unique ID for the opened event. Unlike
498 PERF_SAMPLE_ID the actual ID is returned, not the group
499 leader. This ID is the same as the one returned by
500 PERF_FORMAT_ID.
501
502 PERF_SAMPLE_RAW
503 Records additional data, if applicable. Usually returned
504 by tracepoint events.
505
506 PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
507 This provides a record of recent branches, as provided by
508 CPU branch sampling hardware (such as Intel Last Branch
509 Record). Not all hardware supports this feature.
510
511 See the branch_sample_type field for how to filter which
512 branches are reported.
513
514 PERF_SAMPLE_REGS_USER (since Linux 3.7)
515 Records the current user-level CPU register state (the
516 values in the process before the kernel was called).
517
518 PERF_SAMPLE_STACK_USER (since Linux 3.7)
519 Records the user level stack, allowing stack unwinding.
520
521 PERF_SAMPLE_WEIGHT (since Linux 3.10)
522 Records a hardware provided weight value that expresses
523 how costly the sampled event was. This allows the hard‐
524 ware to highlight expensive events in a profile.
525
526 PERF_SAMPLE_DATA_SRC (since Linux 3.10)
527 Records the data source: where in the memory hierarchy
528 the data associated with the sampled instruction came
529 from. This is available only if the underlying hardware
530 supports this feature.
531
532 PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
533 Places the SAMPLE_ID value in a fixed position in the
534 record, either at the beginning (for sample events) or at
535 the end (if a non-sample event).
536
537 This was necessary because a sample stream may have
538 records from various different event sources with differ‐
539 ent sample_type settings. Parsing the event stream prop‐
540 erly was not possible because the format of the record
541 was needed to find SAMPLE_ID, but the format could not be
542 found without knowing what event the sample belonged to
543 (causing a circular dependency).
544
545 The PERF_SAMPLE_IDENTIFIER setting makes the event stream
546 always parsable by putting SAMPLE_ID in a fixed location,
547 even though it means having duplicate SAMPLE_ID values in
548 records.
549
550 PERF_SAMPLE_TRANSACTION (since Linux 3.13)
551 Records reasons for transactional memory abort events
552 (for example, from Intel TSX transactional memory sup‐
553 port).
554
555 The precise_ip setting must be greater than 0 and a
556 transactional memory abort event must be measured or no
557 values will be recorded. Also note that some perf_event
558 measurements, such as sampled cycle counting, may cause
559 extraneous aborts (by causing an interrupt during a
560 transaction).
561
562 PERF_SAMPLE_REGS_INTR (since Linux 3.19)
563 Records a subset of the current CPU register state as
564 specified by sample_regs_intr. Unlike PERF_SAM‐
565 PLE_REGS_USER the register values will return kernel reg‐
566 ister state if the overflow happened while kernel code is
567 running. If the CPU supports hardware sampling of regis‐
568 ter state (i.e., PEBS on Intel x86) and precise_ip is set
569 higher than zero then the register values returned are
570 those captured by hardware at the time of the sampled
571 instruction's retirement.
572
573 read_format
574 This field specifies the format of the data returned by read(2)
575 on a perf_event_open() file descriptor.
576
577 PERF_FORMAT_TOTAL_TIME_ENABLED
578 Adds the 64-bit time_enabled field. This can be used to
579 calculate estimated totals if the PMU is overcommitted
580 and multiplexing is happening.
581
582 PERF_FORMAT_TOTAL_TIME_RUNNING
583 Adds the 64-bit time_running field. This can be used to
584 calculate estimated totals if the PMU is overcommitted
585 and multiplexing is happening.
586
587 PERF_FORMAT_ID
588 Adds a 64-bit unique value that corresponds to the event
589 group.
590
591 PERF_FORMAT_GROUP
592 Allows all counter values in an event group to be read
593 with one read.
594
595 disabled
596 The disabled bit specifies whether the counter starts out dis‐
597 abled or enabled. If disabled, the event can later be enabled
598 by ioctl(2), prctl(2), or enable_on_exec.
599
600 When creating an event group, typically the group leader is ini‐
601 tialized with disabled set to 1 and any child events are ini‐
602 tialized with disabled set to 0. Despite disabled being 0, the
603 child events will not start until the group leader is enabled.
604
605 inherit
606 The inherit bit specifies that this counter should count events
607 of child tasks as well as the task specified. This applies only
608 to new children, not to any existing children at the time the
609 counter is created (nor to any new children of existing chil‐
610 dren).
611
612 Inherit does not work for some combinations of read_format val‐
613 ues, such as PERF_FORMAT_GROUP.
614
615 pinned The pinned bit specifies that the counter should always be on
616 the CPU if at all possible. It applies only to hardware coun‐
617 ters and only to group leaders. If a pinned counter cannot be
618 put onto the CPU (e.g., because there are not enough hardware
619 counters or because of a conflict with some other event), then
620 the counter goes into an 'error' state, where reads return end-
621 of-file (i.e., read(2) returns 0) until the counter is subse‐
622 quently enabled or disabled.
623
624 exclusive
625 The exclusive bit specifies that when this counter's group is on
626 the CPU, it should be the only group using the CPU's counters.
627 In the future this may allow monitoring programs to support PMU
628 features that need to run alone so that they do not disrupt
629 other hardware counters.
630
631 Note that many unexpected situations may prevent events with the
632 exclusive bit set from ever running. This includes any users
633 running a system-wide measurement as well as any kernel use of
634 the performance counters (including the commonly enabled NMI
635 Watchdog Timer interface).
636
637 exclude_user
638 If this bit is set, the count excludes events that happen in
639 user space.
640
641 exclude_kernel
642 If this bit is set, the count excludes events that happen in
643 kernel space.
644
645 exclude_hv
646 If this bit is set, the count excludes events that happen in the
647 hypervisor. This is mainly for PMUs that have built-in support
648 for handling this (such as POWER). Extra support is needed for
649 handling hypervisor measurements on most machines.
650
651 exclude_idle
652 If set, don't count when the CPU is running the idle task.
653 While you can currently enable this for any event type, it is
654 ignored for all but software events.
655
656 mmap The mmap bit enables generation of PERF_RECORD_MMAP samples for
657 every mmap(2) call that has PROT_EXEC set. This allows tools to
658 notice new executable code being mapped into a program (dynamic
659 shared libraries for example) so that addresses can be mapped
660 back to the original code.
661
662 comm The comm bit enables tracking of process command name as modi‐
663 fied by the exec(2) and prctl(PR_SET_NAME) system calls as well
664 as writing to /proc/self/comm. If the comm_exec flag is also
665 successfully set (possible since Linux 3.16), then the misc flag
666 PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the
667 exec(2) case from the others.
668
669 freq If this bit is set, then sample_frequency not sample_period is
670 used when setting up the sampling interval.
671
672 inherit_stat
673 This bit enables saving of event counts on context switch for
674 inherited tasks. This is meaningful only if the inherit field
675 is set.
676
677 enable_on_exec
678 If this bit is set, a counter is automatically enabled after a
679 call to exec(2).
680
681 task If this bit is set, then fork/exit notifications are included in
682 the ring buffer.
683
684 watermark
685 If set, have an overflow notification happen when we cross the
686 wakeup_watermark boundary. Otherwise, overflow notifications
687 happen after wakeup_events samples.
688
689 precise_ip (since Linux 2.6.35)
690 This controls the amount of skid. Skid is how many instructions
691 execute between an event of interest happening and the kernel
692 being able to stop and record the event. Smaller skid is better
693 and allows more accurate reporting of which events correspond to
694 which instructions, but hardware is often limited with how small
695 this can be.
696
697 The possible values of this field are the following:
698
699 0 SAMPLE_IP can have arbitrary skid.
700
701 1 SAMPLE_IP must have constant skid.
702
703 2 SAMPLE_IP requested to have 0 skid.
704
705 3 SAMPLE_IP must have 0 skid. See also the description of
706 PERF_RECORD_MISC_EXACT_IP.
707
708 mmap_data (since Linux 2.6.36)
709 This is the counterpart of the mmap field. This enables genera‐
710 tion of PERF_RECORD_MMAP samples for mmap(2) calls that do not
711 have PROT_EXEC set (for example data and SysV shared memory).
712
713 sample_id_all (since Linux 2.6.38)
714 If set, then TID, TIME, ID, STREAM_ID, and CPU can additionally
715 be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
716 ple_type is selected.
717
718 If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID
719 value is included as the last value to ease parsing the record
720 stream. This may lead to the id value appearing twice.
721
722 The layout is described by this pseudo-structure:
723
724 struct sample_id {
725 { u32 pid, tid; } /* if PERF_SAMPLE_TID set */
726 { u64 time; } /* if PERF_SAMPLE_TIME set */
727 { u64 id; } /* if PERF_SAMPLE_ID set */
728 { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
729 { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
730 { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
731 };
732
733 exclude_host (since Linux 3.2)
734 When conducting measurements that include processes running VM
735 instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
736 events happening inside a guest instance. This is only meaning‐
737 ful outside the guests; this setting does not change counts
738 gathered inside of a guest. Currently, this functionality is
739 x86 only.
740
741 exclude_guest (since Linux 3.2)
742 When conducting measurements that include processes running VM
743 instances (i.e., have executed a KVM_RUN ioctl(2)), do not mea‐
744 sure events happening inside guest instances. This is only
745 meaningful outside the guests; this setting does not change
746 counts gathered inside of a guest. Currently, this functional‐
747 ity is x86 only.
748
749 exclude_callchain_kernel (since Linux 3.7)
750 Do not include kernel callchains.
751
752 exclude_callchain_user (since Linux 3.7)
753 Do not include user callchains.
754
755 mmap2 (since Linux 3.16)
756 Generate an extended executable mmap record that contains enough
757 additional information to uniquely identify shared mappings.
758 The mmap flag must also be set for this to work.
759
760 comm_exec (since Linux 3.16)
761 This is purely a feature-detection flag, it does not change ker‐
762 nel behavior. If this flag can successfully be set, then, when
763 comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
764 in the misc field of a comm record header if the rename event
765 being reported was caused by a call to exec(2). This allows
766 tools to distinguish between the various types of process renam‐
767 ing.
768
769 use_clockid (since Linux 4.1)
770 This allows selecting which internal Linux clock to use when
771 generating timestamps via the clockid field. This can make it
772 easier to correlate perf sample times with timestamps generated
773 by other tools.
774
775 context_switch (since Linux 4.3)
776 This enables the generation of PERF_RECORD_SWITCH records when a
777 context switch occurs. It also enables the generation of
778 PERF_RECORD_SWITCH_CPU_WIDE records when sampling in CPU-wide
779 mode. This functionality is in addition to existing tracepoint
780 and software events for measuring context switches. The advan‐
781 tage of this method is that it will give full information even
782 with strict perf_event_paranoid settings.
783
784 wakeup_events, wakeup_watermark
785 This union sets how many samples (wakeup_events) or bytes
786 (wakeup_watermark) happen before an overflow notification hap‐
787 pens. Which one is used is selected by the watermark bit flag.
788
789 wakeup_events counts only PERF_RECORD_SAMPLE record types. To
790 receive overflow notification for all PERF_RECORD types choose
791 watermark and set wakeup_watermark to 1.
792
793 Prior to Linux 3.0, setting wakeup_events to 0 resulted in no
794 overflow notifications; more recent kernels treat 0 the same as
795 1.
796
797 bp_type (since Linux 2.6.33)
798 This chooses the breakpoint type. It is one of:
799
800 HW_BREAKPOINT_EMPTY
801 No breakpoint.
802
803 HW_BREAKPOINT_R
804 Count when we read the memory location.
805
806 HW_BREAKPOINT_W
807 Count when we write the memory location.
808
809 HW_BREAKPOINT_RW
810 Count when we read or write the memory location.
811
812 HW_BREAKPOINT_X
813 Count when we execute code at the memory location.
814
815 The values can be combined via a bitwise or, but the combination
816 of HW_BREAKPOINT_R or HW_BREAKPOINT_W with HW_BREAKPOINT_X is
817 not allowed.
818
819 bp_addr (since Linux 2.6.33)
820 This is the address of the breakpoint. For execution break‐
821 points, this is the memory address of the instruction of inter‐
822 est; for read and write breakpoints, it is the memory address of
823 the memory location of interest.
824
825 config1 (since Linux 2.6.39)
826 config1 is used for setting events that need an extra register
827 or otherwise do not fit in the regular config field. Raw OFF‐
828 CORE_EVENTS on Nehalem/Westmere/SandyBridge use this field on
829 Linux 3.3 and later kernels.
830
831 bp_len (since Linux 2.6.33)
832 bp_len is the length of the breakpoint being measured if type is
833 PERF_TYPE_BREAKPOINT. Options are HW_BREAKPOINT_LEN_1,
834 HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, and HW_BREAK‐
835 POINT_LEN_8. For an execution breakpoint, set this to
836 sizeof(long).
837
838 config2 (since Linux 2.6.39)
839 config2 is a further extension of the config1 field.
840
841 branch_sample_type (since Linux 3.4)
842 If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
843 branches to include in the branch record.
844
845 The first part of the value is the privilege level, which is a
846 combination of one of the values listed below. If the user does
847 not set privilege level explicitly, the kernel will use the
848 event's privilege level. Event and branch privilege levels do
849 not have to match.
850
851 PERF_SAMPLE_BRANCH_USER
852 Branch target is in user space.
853
854 PERF_SAMPLE_BRANCH_KERNEL
855 Branch target is in kernel space.
856
857 PERF_SAMPLE_BRANCH_HV
858 Branch target is in hypervisor.
859
860 PERF_SAMPLE_BRANCH_PLM_ALL
861 A convenience value that is the three preceding values
862 ORed together.
863
864 In addition to the privilege value, at least one or more of the
865 following bits must be set.
866
867 PERF_SAMPLE_BRANCH_ANY
868 Any branch type.
869
870 PERF_SAMPLE_BRANCH_ANY_CALL
871 Any call branch (includes direct calls, indirect calls,
872 and far jumps).
873
874 PERF_SAMPLE_BRANCH_IND_CALL
875 Indirect calls.
876
877 PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
878 Direct calls.
879
880 PERF_SAMPLE_BRANCH_ANY_RETURN
881 Any return branch.
882
883 PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
884 Indirect jumps.
885
886 PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
887 Conditional branches.
888
889 PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
890 Transactional memory aborts.
891
892 PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
893 Branch in transactional memory transaction.
894
895 PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
896 Branch not in transactional memory transaction.
897 PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
898 part of a hardware-generated call stack. This requires
899 hardware support, currently only found on Intel x86
900 Haswell or newer.
901
902 sample_regs_user (since Linux 3.7)
903 This bit mask defines the set of user CPU registers to dump on
904 samples. The layout of the register mask is architecture-spe‐
905 cific and is described in the kernel header file
906 arch/ARCH/include/uapi/asm/perf_regs.h.
907
908 sample_stack_user (since Linux 3.7)
909 This defines the size of the user stack to dump if PERF_SAM‐
910 PLE_STACK_USER is specified.
911
912 clockid (since Linux 4.1)
913 If use_clockid is set, then this field selects which internal
914 Linux timer to use for timestamps. The available timers are
915 defined in linux/time.h, with CLOCK_MONOTONIC, CLOCK_MONO‐
916 TONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and CLOCK_TAI cur‐
917 rently supported.
918
919 aux_watermark (since Linux 4.1)
920 This specifies how much data is required to trigger a
921 PERF_RECORD_AUX sample.
922
923 sample_max_stack (since Linux 4.8)
924 When sample_type includes PERF_SAMPLE_CALLCHAIN, this field
925 specifies how many stack frames to report when generating the
926 callchain.
927
928 Reading results
929 Once a perf_event_open() file descriptor has been opened, the values of
930 the events can be read from the file descriptor. The values that are
931 there are specified by the read_format field in the attr structure at
932 open time.
933
934 If you attempt to read into a buffer that is not big enough to hold the
935 data, the error ENOSPC results.
936
937 Here is the layout of the data returned by a read:
938
939 * If PERF_FORMAT_GROUP was specified to allow reading all events in a
940 group at once:
941
942 struct read_format {
943 u64 nr; /* The number of events */
944 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
945 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
946 struct {
947 u64 value; /* The value of the event */
948 u64 id; /* if PERF_FORMAT_ID */
949 } values[nr];
950 };
951
952 * If PERF_FORMAT_GROUP was not specified:
953
954 struct read_format {
955 u64 value; /* The value of the event */
956 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
957 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
958 u64 id; /* if PERF_FORMAT_ID */
959 };
960
961 The values read are as follows:
962
963 nr The number of events in this file descriptor. Available only if
964 PERF_FORMAT_GROUP was specified.
965
966 time_enabled, time_running
967 Total time the event was enabled and running. Normally these
968 values are the same. Multiplexing happens if the number of
969 events is more than the number of available PMU counter slots.
970 In that case the events run only part of the time and the
971 time_enabled and time running values can be used to scale an
972 estimated value for the count.
973
974 value An unsigned 64-bit value containing the counter result.
975
976 id A globally unique value for this particular event; only present
977 if PERF_FORMAT_ID was specified in read_format.
978
979 MMAP layout
980 When using perf_event_open() in sampled mode, asynchronous events (like
981 counter overflow or PROT_EXEC mmap tracking) are logged into a ring-
982 buffer. This ring-buffer is created and accessed through mmap(2).
983
984 The mmap size should be 1+2^n pages, where the first page is a metadata
985 page (struct perf_event_mmap_page) that contains various bits of infor‐
986 mation such as where the ring-buffer head is.
987
988 Before kernel 2.6.39, there is a bug that means you must allocate an
989 mmap ring buffer when sampling even if you do not plan to access it.
990
991 The structure of the first metadata mmap page is as follows:
992
993 struct perf_event_mmap_page {
994 __u32 version; /* version number of this structure */
995 __u32 compat_version; /* lowest version this is compat with */
996 __u32 lock; /* seqlock for synchronization */
997 __u32 index; /* hardware counter identifier */
998 __s64 offset; /* add to hardware counter value */
999 __u64 time_enabled; /* time event active */
1000 __u64 time_running; /* time event on CPU */
1001 union {
1002 __u64 capabilities;
1003 struct {
1004 __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1005 cap_bit0_is_deprecated : 1,
1006 cap_user_rdpmc : 1,
1007 cap_user_time : 1,
1008 cap_user_time_zero : 1,
1009 };
1010 };
1011 __u16 pmc_width;
1012 __u16 time_shift;
1013 __u32 time_mult;
1014 __u64 time_offset;
1015 __u64 __reserved[120]; /* Pad to 1 k */
1016 __u64 data_head; /* head in the data section */
1017 __u64 data_tail; /* user-space written tail */
1018 __u64 data_offset; /* where the buffer starts */
1019 __u64 data_size; /* data buffer size */
1020 __u64 aux_head;
1021 __u64 aux_tail;
1022 __u64 aux_offset;
1023 __u64 aux_size;
1024
1025 }
1026
1027 The following list describes the fields in the perf_event_mmap_page
1028 structure in more detail:
1029
1030 version
1031 Version number of this structure.
1032
1033 compat_version
1034 The lowest version this is compatible with.
1035
1036 lock A seqlock for synchronization.
1037
1038 index A unique hardware counter identifier.
1039
1040 offset When using rdpmc for reads this offset value must be added to
1041 the one returned by rdpmc to get the current total event count.
1042
1043 time_enabled
1044 Time the event was active.
1045
1046 time_running
1047 Time the event was running.
1048
1049 cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1050 There was a bug in the definition of cap_usr_time and
1051 cap_usr_rdpmc from Linux 3.4 until Linux 3.11. Both bits were
1052 defined to point to the same location, so it was impossible to
1053 know if cap_usr_time or cap_usr_rdpmc were actually set.
1054
1055 Starting with Linux 3.12, these are renamed to cap_bit0 and you
1056 should use the cap_user_time and cap_user_rdpmc fields instead.
1057
1058 cap_bit0_is_deprecated (since Linux 3.12)
1059 If set, this bit indicates that the kernel supports the properly
1060 separated cap_user_time and cap_user_rdpmc bits.
1061
1062 If not-set, it indicates an older kernel where cap_usr_time and
1063 cap_usr_rdpmc map to the same bit and thus both features should
1064 be used with caution.
1065
1066 cap_user_rdpmc (since Linux 3.12)
1067 If the hardware supports user-space read of performance counters
1068 without syscall (this is the "rdpmc" instruction on x86), then
1069 the following code can be used to do a read:
1070
1071 u32 seq, time_mult, time_shift, idx, width;
1072 u64 count, enabled, running;
1073 u64 cyc, time_offset;
1074
1075 do {
1076 seq = pc->lock;
1077 barrier();
1078 enabled = pc->time_enabled;
1079 running = pc->time_running;
1080
1081 if (pc->cap_usr_time && enabled != running) {
1082 cyc = rdtsc();
1083 time_offset = pc->time_offset;
1084 time_mult = pc->time_mult;
1085 time_shift = pc->time_shift;
1086 }
1087
1088 idx = pc->index;
1089 count = pc->offset;
1090
1091 if (pc->cap_usr_rdpmc && idx) {
1092 width = pc->pmc_width;
1093 count += rdpmc(idx - 1);
1094 }
1095
1096 barrier();
1097 } while (pc->lock != seq);
1098
1099 cap_user_time (since Linux 3.12)
1100 This bit indicates the hardware has a constant, nonstop time‐
1101 stamp counter (TSC on x86).
1102
1103 cap_user_time_zero (since Linux 3.12)
1104 Indicates the presence of time_zero which allows mapping time‐
1105 stamp values to the hardware clock.
1106
1107 pmc_width
1108 If cap_usr_rdpmc, this field provides the bit-width of the value
1109 read using the rdpmc or equivalent instruction. This can be
1110 used to sign extend the result like:
1111
1112 pmc <<= 64 - pmc_width;
1113 pmc >>= 64 - pmc_width; // signed shift right
1114 count += pmc;
1115
1116 time_shift, time_mult, time_offset
1117
1118 If cap_usr_time, these fields can be used to compute the time
1119 delta since time_enabled (in nanoseconds) using rdtsc or simi‐
1120 lar.
1121
1122 u64 quot, rem;
1123 u64 delta;
1124 quot = (cyc >> time_shift);
1125 rem = cyc & (((u64)1 << time_shift) - 1);
1126 delta = time_offset + quot * time_mult +
1127 ((rem * time_mult) >> time_shift);
1128
1129 Where time_offset, time_mult, time_shift, and cyc are read in
1130 the seqcount loop described above. This delta can then be added
1131 to enabled and possible running (if idx), improving the scaling:
1132
1133 enabled += delta;
1134 if (idx)
1135 running += delta;
1136 quot = count / running;
1137 rem = count % running;
1138 count = quot * enabled + (rem * enabled) / running;
1139
1140 time_zero (since Linux 3.12)
1141
1142 If cap_usr_time_zero is set, then the hardware clock (the TSC
1143 timestamp counter on x86) can be calculated from the time_zero,
1144 time_mult, and time_shift values:
1145
1146 time = timestamp - time_zero;
1147 quot = time / time_mult;
1148 rem = time % time_mult;
1149 cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1150
1151 And vice versa:
1152
1153 quot = cyc >> time_shift;
1154 rem = cyc & (((u64)1 << time_shift) - 1);
1155 timestamp = time_zero + quot * time_mult +
1156 ((rem * time_mult) >> time_shift);
1157
1158 data_head
1159 This points to the head of the data section. The value continu‐
1160 ously increases, it does not wrap. The value needs to be manu‐
1161 ally wrapped by the size of the mmap buffer before accessing the
1162 samples.
1163
1164 On SMP-capable platforms, after reading the data_head value,
1165 user space should issue an rmb().
1166
1167 data_tail
1168 When the mapping is PROT_WRITE, the data_tail value should be
1169 written by user space to reflect the last read data. In this
1170 case, the kernel will not overwrite unread data.
1171
1172 data_offset (since Linux 4.1)
1173 Contains the offset of the location in the mmap buffer where
1174 perf sample data begins.
1175
1176 data_size (since Linux 4.1)
1177 Contains the size of the perf sample region within the mmap buf‐
1178 fer.
1179
1180 aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1181 The AUX region allows mmaping a separate sample buffer for high-
1182 bandwidth data streams (separate from the main perf sample buf‐
1183 fer). An example of a high-bandwidth stream is instruction
1184 tracing support, as is found in newer Intel processors.
1185
1186 To set up an AUX area, first aux_offset needs to be set with an
1187 offset greater than data_offset+data_size and aux_size needs to
1188 be set to the desired buffer size. The desired offset and size
1189 must be page aligned, and the size must be a power of two.
1190 These values are then passed to mmap in order to map the AUX
1191 buffer. Pages in the AUX buffer are included as part of the
1192 RLIMIT_MEMLOCK resource limit (see setrlimit(2)), and also as
1193 part of the perf_event_mlock_kb allowance.
1194
1195 By default, the AUX buffer will be truncated if it will not fit
1196 in the available space in the ring buffer. If the AUX buffer is
1197 mapped as a read only buffer, then it will operate in ring buf‐
1198 fer mode where old data will be overwritten by new. In over‐
1199 write mode, it might not be possible to infer where the new data
1200 began, and it is the consumer's job to disable measurement while
1201 reading to avoid possible data races.
1202
1203 The aux_head and aux_tail ring buffer pointers have the same
1204 behavior and ordering rules as the previous described data_head
1205 and data_tail.
1206
1207 The following 2^n ring-buffer pages have the layout described below.
1208
1209 If perf_event_attr.sample_id_all is set, then all event types will have
1210 the sample_type selected fields related to where/when (identity) an
1211 event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1212 PERF_RECORD_SAMPLE below, it will be stashed just after the
1213 perf_event_header and the fields already present for the existing
1214 fields, that is, at the end of the payload. This allows a newer
1215 perf.data file to be supported by older perf tools, with the new
1216 optional fields being ignored.
1217
1218 The mmap values start with a header:
1219
1220 struct perf_event_header {
1221 __u32 type;
1222 __u16 misc;
1223 __u16 size;
1224 };
1225
1226 Below, we describe the perf_event_header fields in more detail. For
1227 ease of reading, the fields with shorter descriptions are presented
1228 first.
1229
1230 size This indicates the size of the record.
1231
1232 misc The misc field contains additional information about the sample.
1233
1234 The CPU mode can be determined from this value by masking with
1235 PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1236 ing (note these are not bit masks, only one can be set at a
1237 time):
1238
1239 PERF_RECORD_MISC_CPUMODE_UNKNOWN
1240 Unknown CPU mode.
1241
1242 PERF_RECORD_MISC_KERNEL
1243 Sample happened in the kernel.
1244
1245 PERF_RECORD_MISC_USER
1246 Sample happened in user code.
1247
1248 PERF_RECORD_MISC_HYPERVISOR
1249 Sample happened in the hypervisor.
1250
1251 PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1252 Sample happened in the guest kernel.
1253
1254 PERF_RECORD_MISC_GUEST_USER (since Linux 2.6.35)
1255 Sample happened in guest user code.
1256
1257 Since the following three statuses are generated by different
1258 record types, they alias to the same bit:
1259
1260 PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1261 This is set when the mapping is not executable; otherwise
1262 the mapping is executable.
1263
1264 PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1265 This is set for a PERF_RECORD_COMM record on kernels more
1266 recent than Linux 3.16 if a process name change was
1267 caused by an exec(2) system call.
1268
1269 PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1270 When a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE
1271 record is generated, this bit indicates that the context
1272 switch is away from the current process (instead of into
1273 the current process).
1274
1275 In addition, the following bits can be set:
1276
1277 PERF_RECORD_MISC_EXACT_IP
1278 This indicates that the content of PERF_SAMPLE_IP points
1279 to the actual instruction that triggered the event. See
1280 also perf_event_attr.precise_ip.
1281
1282 PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1283 This indicates there is extended data available (cur‐
1284 rently not used).
1285
1286 PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1287 This bit is not set by the kernel. It is reserved for
1288 the user-space perf utility to indicate that
1289 /proc/i[pid]/maps parsing was taking too long and was
1290 stopped, and thus the mmap records may be truncated.
1291
1292 type The type value is one of the below. The values in the corre‐
1293 sponding record (that follows the header) depend on the type
1294 selected as shown.
1295
1296 PERF_RECORD_MMAP
1297 The MMAP events record the PROT_EXEC mappings so that we can
1298 correlate user-space IPs to code. They have the following
1299 structure:
1300
1301 struct {
1302 struct perf_event_header header;
1303 u32 pid, tid;
1304 u64 addr;
1305 u64 len;
1306 u64 pgoff;
1307 char filename[];
1308 };
1309
1310 pid is the process ID.
1311
1312 tid is the thread ID.
1313
1314 addr is the address of the allocated memory. len is the
1315 length of the allocated memory. pgoff is the page
1316 offset of the allocated memory. filename is a string
1317 describing the backing of the allocated memory.
1318
1319 PERF_RECORD_LOST
1320 This record indicates when events are lost.
1321
1322 struct {
1323 struct perf_event_header header;
1324 u64 id;
1325 u64 lost;
1326 struct sample_id sample_id;
1327 };
1328
1329 id is the unique event ID for the samples that were
1330 lost.
1331
1332 lost is the number of events that were lost.
1333
1334 PERF_RECORD_COMM
1335 This record indicates a change in the process name.
1336
1337 struct {
1338 struct perf_event_header header;
1339 u32 pid;
1340 u32 tid;
1341 char comm[];
1342 struct sample_id sample_id;
1343 };
1344
1345 pid is the process ID.
1346
1347 tid is the thread ID.
1348
1349 comm is a string containing the new name of the process.
1350
1351 PERF_RECORD_EXIT
1352 This record indicates a process exit event.
1353
1354 struct {
1355 struct perf_event_header header;
1356 u32 pid, ppid;
1357 u32 tid, ptid;
1358 u64 time;
1359 struct sample_id sample_id;
1360 };
1361
1362 PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1363 This record indicates a throttle/unthrottle event.
1364
1365 struct {
1366 struct perf_event_header header;
1367 u64 time;
1368 u64 id;
1369 u64 stream_id;
1370 struct sample_id sample_id;
1371 };
1372
1373 PERF_RECORD_FORK
1374 This record indicates a fork event.
1375
1376 struct {
1377 struct perf_event_header header;
1378 u32 pid, ppid;
1379 u32 tid, ptid;
1380 u64 time;
1381 struct sample_id sample_id;
1382 };
1383
1384 PERF_RECORD_READ
1385 This record indicates a read event.
1386
1387 struct {
1388 struct perf_event_header header;
1389 u32 pid, tid;
1390 struct read_format values;
1391 struct sample_id sample_id;
1392 };
1393
1394 PERF_RECORD_SAMPLE
1395 This record indicates a sample.
1396
1397 struct {
1398 struct perf_event_header header;
1399 u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
1400 u64 ip; /* if PERF_SAMPLE_IP */
1401 u32 pid, tid; /* if PERF_SAMPLE_TID */
1402 u64 time; /* if PERF_SAMPLE_TIME */
1403 u64 addr; /* if PERF_SAMPLE_ADDR */
1404 u64 id; /* if PERF_SAMPLE_ID */
1405 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1406 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1407 u64 period; /* if PERF_SAMPLE_PERIOD */
1408 struct read_format v;
1409 /* if PERF_SAMPLE_READ */
1410 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1411 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1412 u32 size; /* if PERF_SAMPLE_RAW */
1413 char data[size]; /* if PERF_SAMPLE_RAW */
1414 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1415 struct perf_branch_entry lbr[bnr];
1416 /* if PERF_SAMPLE_BRANCH_STACK */
1417 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1418 u64 regs[weight(mask)];
1419 /* if PERF_SAMPLE_REGS_USER */
1420 u64 size; /* if PERF_SAMPLE_STACK_USER */
1421 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1422 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER &&
1423 size != 0 */
1424 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1425 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1426 u64 transaction; /* if PERF_SAMPLE_TRANSACTION */
1427 u64 abi; /* if PERF_SAMPLE_REGS_INTR */
1428 u64 regs[weight(mask)];
1429 /* if PERF_SAMPLE_REGS_INTR */
1430 };
1431
1432 sample_id
1433 If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1434 is included. This is a duplication of the PERF_SAM‐
1435 PLE_ID id value, but included at the beginning of the
1436 sample so parsers can easily obtain the value.
1437
1438 ip If PERF_SAMPLE_IP is enabled, then a 64-bit instruction
1439 pointer value is included.
1440
1441 pid, tid
1442 If PERF_SAMPLE_TID is enabled, then a 32-bit process ID
1443 and 32-bit thread ID are included.
1444
1445 time
1446 If PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp
1447 is included. This is obtained via local_clock() which
1448 is a hardware timestamp if available and the jiffies
1449 value if not.
1450
1451 addr
1452 If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1453 included. This is usually the address of a tracepoint,
1454 breakpoint, or software event; otherwise the value is 0.
1455
1456 id If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is
1457 included. If the event is a member of an event group,
1458 the group leader ID is returned. This ID is the same as
1459 the one returned by PERF_FORMAT_ID.
1460
1461 stream_id
1462 If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID
1463 is included. Unlike PERF_SAMPLE_ID the actual ID is
1464 returned, not the group leader. This ID is the same as
1465 the one returned by PERF_FORMAT_ID.
1466
1467 cpu, res
1468 If PERF_SAMPLE_CPU is enabled, this is a 32-bit value
1469 indicating which CPU was being used, in addition to a
1470 reserved (unused) 32-bit value.
1471
1472 period
1473 If PERF_SAMPLE_PERIOD is enabled, a 64-bit value indi‐
1474 cating the current sampling period is written.
1475
1476 v If PERF_SAMPLE_READ is enabled, a structure of type
1477 read_format is included which has values for all events
1478 in the event group. The values included depend on the
1479 read_format value used at perf_event_open() time.
1480
1481 nr, ips[nr]
1482 If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit num‐
1483 ber is included which indicates how many following
1484 64-bit instruction pointers will follow. This is the
1485 current callchain.
1486
1487 size, data[size]
1488 If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1489 cating size is included followed by an array of 8-bit
1490 values of length size. The values are padded with 0 to
1491 have 64-bit alignment.
1492
1493 This RAW record data is opaque with respect to the ABI.
1494 The ABI doesn't make any promises with respect to the
1495 stability of its content, it may vary depending on
1496 event, hardware, and kernel version.
1497
1498 bnr, lbr[bnr]
1499 If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit
1500 value indicating the number of records is included, fol‐
1501 lowed by bnr perf_branch_entry structures which each
1502 include the fields:
1503
1504 from This indicates the source instruction (may not be
1505 a branch).
1506
1507 to The branch target.
1508
1509 mispred
1510 The branch target was mispredicted.
1511
1512 predicted
1513 The branch target was predicted.
1514
1515 in_tx (since Linux 3.11)
1516 The branch was in a transactional memory transac‐
1517 tion.
1518
1519 abort (since Linux 3.11)
1520 The branch was in an aborted transactional memory
1521 transaction.
1522
1523 cycles (since Linux 4.3)
1524 This reports the number of cycles elapsed since
1525 the previous branch stack update.
1526
1527 The entries are from most to least recent, so the first
1528 entry has the most recent branch.
1529
1530 Support for mispred, predicted, and cycles is optional;
1531 if not supported, those values will be 0.
1532
1533 The type of branches recorded is specified by the
1534 branch_sample_type field.
1535
1536 abi, regs[weight(mask)]
1537 If PERF_SAMPLE_REGS_USER is enabled, then the user CPU
1538 registers are recorded.
1539
1540 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1541 PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64.
1542
1543 The regs field is an array of the CPU registers that
1544 were specified by the sample_regs_user attr field. The
1545 number of values is the number of bits set in the sam‐
1546 ple_regs_user bit mask.
1547
1548 size, data[size], dyn_size
1549 If PERF_SAMPLE_STACK_USER is enabled, then the user
1550 stack is recorded. This can be used to generate stack
1551 backtraces. size is the size requested by the user in
1552 sample_stack_user or else the maximum record size. data
1553 is the stack data (a raw dump of the memory pointed to
1554 by the stack pointer at the time of sampling). dyn_size
1555 is the amount of data actually dumped (can be less than
1556 size). Note that dyn_size is omitted if size is 0.
1557
1558 weight
1559 If PERF_SAMPLE_WEIGHT is enabled, then a 64-bit value
1560 provided by the hardware is recorded that indicates how
1561 costly the event was. This allows expensive events to
1562 stand out more clearly in profiles.
1563
1564 data_src
1565 If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value
1566 is recorded that is made up of the following fields:
1567
1568 mem_op
1569 Type of opcode, a bitwise combination of:
1570
1571 PERF_MEM_OP_NA Not available
1572 PERF_MEM_OP_LOAD Load instruction
1573 PERF_MEM_OP_STORE Store instruction
1574 PERF_MEM_OP_PFETCH Prefetch
1575 PERF_MEM_OP_EXEC Executable code
1576
1577 mem_lvl
1578 Memory hierarchy level hit or miss, a bitwise combi‐
1579 nation of the following, shifted left by
1580 PERF_MEM_LVL_SHIFT:
1581
1582 PERF_MEM_LVL_NA Not available
1583 PERF_MEM_LVL_HIT Hit
1584 PERF_MEM_LVL_MISS Miss
1585 PERF_MEM_LVL_L1 Level 1 cache
1586 PERF_MEM_LVL_LFB Line fill buffer
1587 PERF_MEM_LVL_L2 Level 2 cache
1588 PERF_MEM_LVL_L3 Level 3 cache
1589 PERF_MEM_LVL_LOC_RAM Local DRAM
1590 PERF_MEM_LVL_REM_RAM1 Remote DRAM 1 hop
1591 PERF_MEM_LVL_REM_RAM2 Remote DRAM 2 hops
1592 PERF_MEM_LVL_REM_CCE1 Remote cache 1 hop
1593 PERF_MEM_LVL_REM_CCE2 Remote cache 2 hops
1594 PERF_MEM_LVL_IO I/O memory
1595 PERF_MEM_LVL_UNC Uncached memory
1596
1597 mem_snoop
1598 Snoop mode, a bitwise combination of the following,
1599 shifted left by PERF_MEM_SNOOP_SHIFT:
1600
1601 PERF_MEM_SNOOP_NA Not available
1602 PERF_MEM_SNOOP_NONE No snoop
1603 PERF_MEM_SNOOP_HIT Snoop hit
1604 PERF_MEM_SNOOP_MISS Snoop miss
1605 PERF_MEM_SNOOP_HITM Snoop hit modified
1606
1607 mem_lock
1608 Lock instruction, a bitwise combination of the fol‐
1609 lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1610
1611 PERF_MEM_LOCK_NA Not available
1612 PERF_MEM_LOCK_LOCKED Locked transaction
1613
1614 mem_dtlb
1615 TLB access hit or miss, a bitwise combination of the
1616 following, shifted left by PERF_MEM_TLB_SHIFT:
1617
1618 PERF_MEM_TLB_NA Not available
1619 PERF_MEM_TLB_HIT Hit
1620 PERF_MEM_TLB_MISS Miss
1621 PERF_MEM_TLB_L1 Level 1 TLB
1622 PERF_MEM_TLB_L2 Level 2 TLB
1623 PERF_MEM_TLB_WK Hardware walker
1624 PERF_MEM_TLB_OS OS fault handler
1625
1626 transaction
1627 If the PERF_SAMPLE_TRANSACTION flag is set, then a
1628 64-bit field is recorded describing the sources of any
1629 transactional memory aborts.
1630
1631 The field is a bitwise combination of the following val‐
1632 ues:
1633
1634 PERF_TXN_ELISION
1635 Abort from an elision type transaction (Intel-
1636 CPU-specific).
1637
1638 PERF_TXN_TRANSACTION
1639 Abort from a generic transaction.
1640
1641 PERF_TXN_SYNC
1642 Synchronous abort (related to the reported
1643 instruction).
1644
1645 PERF_TXN_ASYNC
1646 Asynchronous abort (not related to the reported
1647 instruction).
1648
1649 PERF_TXN_RETRY
1650 Retryable abort (retrying the transaction may
1651 have succeeded).
1652
1653 PERF_TXN_CONFLICT
1654 Abort due to memory conflicts with other threads.
1655
1656 PERF_TXN_CAPACITY_WRITE
1657 Abort due to write capacity overflow.
1658
1659 PERF_TXN_CAPACITY_READ
1660 Abort due to read capacity overflow.
1661
1662 In addition, a user-specified abort code can be obtained
1663 from the high 32 bits of the field by shifting right by
1664 PERF_TXN_ABORT_SHIFT and masking with the value
1665 PERF_TXN_ABORT_MASK.
1666
1667 abi, regs[weight(mask)]
1668 If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU
1669 registers are recorded.
1670
1671 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1672 PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1673
1674 The regs field is an array of the CPU registers that
1675 were specified by the sample_regs_intr attr field. The
1676 number of values is the number of bits set in the sam‐
1677 ple_regs_intr bit mask.
1678
1679 PERF_RECORD_MMAP2
1680 This record includes extended information on mmap(2) calls
1681 returning executable mappings. The format is similar to
1682 that of the PERF_RECORD_MMAP record, but includes extra val‐
1683 ues that allow uniquely identifying shared mappings.
1684
1685 struct {
1686 struct perf_event_header header;
1687 u32 pid;
1688 u32 tid;
1689 u64 addr;
1690 u64 len;
1691 u64 pgoff;
1692 u32 maj;
1693 u32 min;
1694 u64 ino;
1695 u64 ino_generation;
1696 u32 prot;
1697 u32 flags;
1698 char filename[];
1699 struct sample_id sample_id;
1700 };
1701
1702 pid is the process ID.
1703
1704 tid is the thread ID.
1705
1706 addr is the address of the allocated memory.
1707
1708 len is the length of the allocated memory.
1709
1710 pgoff is the page offset of the allocated memory.
1711
1712 maj is the major ID of the underlying device.
1713
1714 min is the minor ID of the underlying device.
1715
1716 ino is the inode number.
1717
1718 ino_generation
1719 is the inode generation.
1720
1721 prot is the protection information.
1722
1723 flags is the flags information.
1724
1725 filename
1726 is a string describing the backing of the allocated
1727 memory.
1728
1729 PERF_RECORD_AUX (since Linux 4.1)
1730
1731 This record reports that new data is available in the sepa‐
1732 rate AUX buffer region.
1733
1734 struct {
1735 struct perf_event_header header;
1736 u64 aux_offset;
1737 u64 aux_size;
1738 u64 flags;
1739 struct sample_id sample_id;
1740 };
1741
1742 aux_offset
1743 offset in the AUX mmap region where the new data
1744 begins.
1745
1746 aux_size
1747 size of the data made available.
1748
1749 flags describes the AUX update.
1750
1751 PERF_AUX_FLAG_TRUNCATED
1752 if set, then the data returned was truncated
1753 to fit the available buffer size.
1754
1755 PERF_AUX_FLAG_OVERWRITE
1756 if set, then the data returned has overwritten
1757 previous data.
1758
1759 PERF_RECORD_ITRACE_START (since Linux 4.1)
1760
1761 This record indicates which process has initiated an
1762 instruction trace event, allowing tools to properly corre‐
1763 late the instruction addresses in the AUX buffer with the
1764 proper executable.
1765
1766 struct {
1767 struct perf_event_header header;
1768 u32 pid;
1769 u32 tid;
1770 };
1771
1772 pid process ID of the thread starting an instruction
1773 trace.
1774
1775 tid thread ID of the thread starting an instruction
1776 trace.
1777
1778 PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1779
1780 When using hardware sampling (such as Intel PEBS) this
1781 record indicates some number of samples that may have been
1782 lost.
1783
1784 struct {
1785 struct perf_event_header header;
1786 u64 lost;
1787 struct sample_id sample_id;
1788 };
1789
1790 lost the number of potentially lost samples.
1791
1792 PERF_RECORD_SWITCH (since Linux 4.3)
1793
1794 This record indicates a context switch has happened. The
1795 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1796 whether it was a context switch into or away from the cur‐
1797 rent process.
1798
1799 struct {
1800 struct perf_event_header header;
1801 struct sample_id sample_id;
1802 };
1803
1804 PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1805
1806 As with PERF_RECORD_SWITCH this record indicates a context
1807 switch has happened, but it only occurs when sampling in
1808 CPU-wide mode and provides additional information on the
1809 process being switched to/from. The
1810 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1811 whether it was a context switch into or away from the cur‐
1812 rent process.
1813
1814 struct {
1815 struct perf_event_header header;
1816 u32 next_prev_pid;
1817 u32 next_prev_tid;
1818 struct sample_id sample_id;
1819 };
1820
1821 next_prev_pid
1822 The process ID of the previous (if switching in) or
1823 next (if switching out) process on the CPU.
1824
1825 next_prev_tid
1826 The thread ID of the previous (if switching in) or
1827 next (if switching out) thread on the CPU.
1828
1829 Overflow handling
1830 Events can be set to notify when a threshold is crossed, indicating an
1831 overflow. Overflow conditions can be captured by monitoring the event
1832 file descriptor with poll(2), select(2), or epoll(7). Alternatively,
1833 the overflow events can be captured via sa signal handler, by enabling
1834 I/O signaling on the file descriptor; see the discussion of the
1835 F_SETOWN and F_SETSIG operations in fcntl(2).
1836
1837 Overflows are generated only by sampling events (sample_period must
1838 have a nonzero value).
1839
1840 There are two ways to generate overflow notifications.
1841
1842 The first is to set a wakeup_events or wakeup_watermark value that will
1843 trigger if a certain number of samples or bytes have been written to
1844 the mmap ring buffer. In this case, POLL_IN is indicated.
1845
1846 The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This
1847 ioctl adds to a counter that decrements each time the event overflows.
1848 When nonzero, POLL_IN is indicated, but once the counter reaches 0
1849 POLL_HUP is indicated and the underlying event is disabled.
1850
1851 Refreshing an event group leader refreshes all siblings and refreshing
1852 with a parameter of 0 currently enables infinite refreshes; these
1853 behaviors are unsupported and should not be relied on.
1854
1855 Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
1856 itored is attached to a different process and that process exits.
1857
1858 rdpmc instruction
1859 Starting with Linux 3.4 on x86, you can use the rdpmc instruction to
1860 get low-latency reads without having to enter the kernel. Note that
1861 using rdpmc is not necessarily faster than other methods for reading
1862 event values.
1863
1864 Support for this can be detected with the cap_usr_rdpmc field in the
1865 mmap page; documentation on how to calculate event values can be found
1866 in that section.
1867
1868 Originally, when rdpmc support was enabled, any process (not just ones
1869 with an active perf event) could use the rdpmc instruction to access
1870 the counters. Starting with Linux 4.0, rdpmc support is only allowed
1871 if an event is currently enabled in a process's context. To restore
1872 the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
1873
1874 perf_event ioctl calls
1875 Various ioctls act on perf_event_open() file descriptors:
1876
1877 PERF_EVENT_IOC_ENABLE
1878 This enables the individual event or event group specified by
1879 the file descriptor argument.
1880
1881 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1882 then all events in a group are enabled, even if the event speci‐
1883 fied is not the group leader (but see BUGS).
1884
1885 PERF_EVENT_IOC_DISABLE
1886 This disables the individual counter or event group specified by
1887 the file descriptor argument.
1888
1889 Enabling or disabling the leader of a group enables or disables
1890 the entire group; that is, while the group leader is disabled,
1891 none of the counters in the group will count. Enabling or dis‐
1892 abling a member of a group other than the leader affects only
1893 that counter; disabling a non-leader stops that counter from
1894 counting but doesn't affect any other counter.
1895
1896 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1897 then all events in a group are disabled, even if the event spec‐
1898 ified is not the group leader (but see BUGS).
1899
1900 PERF_EVENT_IOC_REFRESH
1901 Non-inherited overflow counters can use this to enable a counter
1902 for a number of overflows specified by the argument, after which
1903 it is disabled. Subsequent calls of this ioctl add the argument
1904 value to the current count. An overflow notification with
1905 POLL_IN set will happen on each overflow until the count reaches
1906 0; when that happens a notification with POLL_HUP set is sent
1907 and the event is disabled. Using an argument of 0 is considered
1908 undefined behavior.
1909
1910 PERF_EVENT_IOC_RESET
1911 Reset the event count specified by the file descriptor argument
1912 to zero. This resets only the counts; there is no way to reset
1913 the multiplexing time_enabled or time_running values.
1914
1915 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1916 then all events in a group are reset, even if the event speci‐
1917 fied is not the group leader (but see BUGS).
1918
1919 PERF_EVENT_IOC_PERIOD
1920 This updates the overflow period for the event.
1921
1922 Since Linux 3.7 (on ARM) and Linux 3.14 (all other architec‐
1923 tures), the new period takes effect immediately. On older ker‐
1924 nels, the new period did not take effect until after the next
1925 overflow.
1926
1927 The argument is a pointer to a 64-bit value containing the
1928 desired new period.
1929
1930 Prior to Linux 2.6.36, this ioctl always failed due to a bug in
1931 the kernel.
1932
1933 PERF_EVENT_IOC_SET_OUTPUT
1934 This tells the kernel to report event notifications to the spec‐
1935 ified file descriptor rather than the default one. The file
1936 descriptors must all be on the same CPU.
1937
1938 The argument specifies the desired file descriptor, or -1 if
1939 output should be ignored.
1940
1941 PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
1942 This adds an ftrace filter to this event.
1943
1944 The argument is a pointer to the desired ftrace filter.
1945
1946 PERF_EVENT_IOC_ID (since Linux 3.12)
1947 This returns the event ID value for the given event file
1948 descriptor.
1949
1950 The argument is a pointer to a 64-bit unsigned integer to hold
1951 the result.
1952
1953 PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
1954 This allows attaching a Berkeley Packet Filter (BPF) program to
1955 an existing kprobe tracepoint event. You need CAP_SYS_ADMIN
1956 privileges to use this ioctl.
1957
1958 The argument is a BPF program file descriptor that was created
1959 by a previous bpf(2) system call.
1960
1961 PERF_EVENT_IOC_PAUSE_OUTPUT (since Linux 4.7)
1962 This allows pausing and resuming the event's ring-buffer. A
1963 paused ring-buffer does not prevent generation of samples, but
1964 simply discards them. The discarded samples are considered
1965 lost, and cause a PERF_RECORD_LOST sample to be generated when
1966 possible. An overflow signal may still be triggered by the dis‐
1967 carded sample even though the ring-buffer remains empty.
1968
1969 The argument is an unsigned 32-bit integer. A nonzero value
1970 pauses the ring-buffer, while a zero value resumes the ring-buf‐
1971 fer.
1972
1973 PERF_EVENT_MODIFY_ATTRIBUTES (since Linux 4.17)
1974 This allows modifying an existing event without the overhead of
1975 closing and reopening a new event. Currently this is supported
1976 only for breakpoint events.
1977
1978 The argument is a pointer to a perf_event_attr structure con‐
1979 taining the updated event settings.
1980
1981 PERF_EVENT_IOC_QUERY_BPF (since Linux 4.16)
1982 This allows querying which Berkeley Packet Filter (BPF) programs
1983 are attached to an existing kprobe tracepoint. You can only
1984 attach one BPF program per event, but you can have multiple
1985 events attached to a tracepoint. Querying this value on one
1986 tracepoint event returns the id of all BPF programs in all
1987 events attached to the tracepoint. You need CAP_SYS_ADMIN priv‐
1988 ileges to use this ioctl.
1989
1990 The argument is a pointer to a structure
1991 struct perf_event_query_bpf {
1992 __u32 ids_len;
1993 __u32 prog_cnt;
1994 __u32 ids[0];
1995 };
1996
1997 The ids_len field indicates the number of ids that can fit in
1998 the provided ids array. The prog_cnt value is filled in by the
1999 kernel with the number of attached BPF programs. The ids array
2000 is filled with the id of each attached BPF program. If there
2001 are more programs than will fit in the array, then the kernel
2002 will return ENOSPC and ids_len will indicate the number of pro‐
2003 gram IDs that were successfully copied.
2004
2005 Using prctl(2)
2006 A process can enable or disable all currently open event groups using
2007 the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and PR_TASK_PERF_EVENTS_DISABLE
2008 operations. This applies only to events created locally by the calling
2009 process. This does not apply to events created by other processes
2010 attached to the calling process or inherited events from a parent
2011 process. Only group leaders are enabled and disabled, not any other
2012 members of the groups.
2013
2014 perf_event related configuration files
2015 Files in /proc/sys/kernel/
2016
2017 /proc/sys/kernel/perf_event_paranoid
2018 The perf_event_paranoid file can be set to restrict access
2019 to the performance counters.
2020
2021 2 allow only user-space measurements (default since Linux
2022 4.6).
2023 1 allow both kernel and user measurements (default before
2024 Linux 4.6).
2025 0 allow access to CPU-specific data but not raw tracepoint
2026 samples.
2027 -1 no restrictions.
2028
2029 The existence of the perf_event_paranoid file is the offi‐
2030 cial method for determining if a kernel supports
2031 perf_event_open().
2032
2033 /proc/sys/kernel/perf_event_max_sample_rate
2034 This sets the maximum sample rate. Setting this too high
2035 can allow users to sample at a rate that impacts overall
2036 machine performance and potentially lock up the machine.
2037 The default value is 100000 (samples per second).
2038
2039 /proc/sys/kernel/perf_event_max_stack
2040 This file sets the maximum depth of stack frame entries
2041 reported when generating a call trace.
2042
2043 /proc/sys/kernel/perf_event_mlock_kb
2044 Maximum number of pages an unprivileged user can mlock(2).
2045 The default is 516 (kB).
2046
2047 Files in /sys/bus/event_source/devices/
2048
2049 Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
2050 able for monitoring. Information on how to program these PMUs can
2051 be found under /sys/bus/event_source/devices/. Each subdirectory
2052 corresponds to a different PMU.
2053
2054 /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
2055 This contains an integer that can be used in the type field
2056 of perf_event_attr to indicate that you wish to use this
2057 PMU.
2058
2059 /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
2060 If this file is 1, then direct user-space access to the per‐
2061 formance counter registers is allowed via the rdpmc instruc‐
2062 tion. This can be disabled by echoing 0 to the file.
2063
2064 As of Linux 4.0 the behavior has changed, so that 1 now
2065 means only allow access to processes with active perf
2066 events, with 2 indicating the old allow-anyone-access behav‐
2067 ior.
2068
2069 /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2070 This subdirectory contains information on the architecture-
2071 specific subfields available for programming the various
2072 config fields in the perf_event_attr struct.
2073
2074 The content of each file is the name of the config field,
2075 followed by a colon, followed by a series of integer bit
2076 ranges separated by commas. For example, the file event may
2077 contain the value config1:1,6-10,44 which indicates that
2078 event is an attribute that occupies bits 1,6–10, and 44 of
2079 perf_event_attr::config1.
2080
2081 /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2082 This subdirectory contains files with predefined events.
2083 The contents are strings describing the event settings
2084 expressed in terms of the fields found in the previously
2085 mentioned ./format/ directory. These are not necessarily
2086 complete lists of all events supported by a PMU, but usually
2087 a subset of events deemed useful or interesting.
2088
2089 The content of each file is a list of attribute names sepa‐
2090 rated by commas. Each entry has an optional value (either
2091 hex or decimal). If no value is specified, then it is
2092 assumed to be a single-bit field with a value of 1. An
2093 example entry may look like this: event=0x2,inv,ldlat=3.
2094
2095 /sys/bus/event_source/devices/*/uevent
2096 This file is the standard kernel device interface for
2097 injecting hotplug events.
2098
2099 /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2100 The cpumask file contains a comma-separated list of integers
2101 that indicate a representative CPU number for each socket
2102 (package) on the motherboard. This is needed when setting
2103 up uncore or northbridge events, as those PMUs present
2104 socket-wide events.
2105
2107 perf_event_open() returns the new file descriptor, or -1 if an error
2108 occurred (in which case, errno is set appropriately).
2109
2111 The errors returned by perf_event_open() can be inconsistent, and may
2112 vary across processor architectures and performance monitoring units.
2113
2114 E2BIG Returned if the perf_event_attr size value is too small (smaller
2115 than PERF_ATTR_SIZE_VER0), too big (larger than the page size),
2116 or larger than the kernel supports and the extra bytes are not
2117 zero. When E2BIG is returned, the perf_event_attr size field is
2118 overwritten by the kernel to be the size of the structure it was
2119 expecting.
2120
2121 EACCES Returned when the requested event requires CAP_SYS_ADMIN permis‐
2122 sions (or a more permissive perf_event paranoid setting). Some
2123 common cases where an unprivileged process may encounter this
2124 error: attaching to a process owned by a different user; moni‐
2125 toring all processes on a given CPU (i.e., specifying the pid
2126 argument as -1); and not setting exclude_kernel when the para‐
2127 noid setting requires it.
2128
2129 EBADF Returned if the group_fd file descriptor is not valid, or, if
2130 PERF_FLAG_PID_CGROUP is set, the cgroup file descriptor in pid
2131 is not valid.
2132
2133 EBUSY (since Linux 4.1)
2134 Returned if another event already has exclusive access to the
2135 PMU.
2136
2137 EFAULT Returned if the attr pointer points at an invalid memory
2138 address.
2139
2140 EINVAL Returned if the specified event is invalid. There are many pos‐
2141 sible reasons for this. A not-exhaustive list: sample_freq is
2142 higher than the maximum setting; the cpu to monitor does not
2143 exist; read_format is out of range; sample_type is out of range;
2144 the flags value is out of range; exclusive or pinned set and the
2145 event is not a group leader; the event config values are out of
2146 range or set reserved bits; the generic event selected is not
2147 supported; or there is not enough room to add the selected
2148 event.
2149
2150 EINTR Returned when trying to mix perf and ftrace handling for a
2151 uprobe.
2152
2153 EMFILE Each opened event uses one file descriptor. If a large number
2154 of events are opened, the per-process limit on the number of
2155 open file descriptors will be reached, and no more events can be
2156 created.
2157
2158 ENODEV Returned when the event involves a feature not supported by the
2159 current CPU.
2160
2161 ENOENT Returned if the type setting is not valid. This error is also
2162 returned for some unsupported generic events.
2163
2164 ENOSPC Prior to Linux 3.3, if there was not enough room for the event,
2165 ENOSPC was returned. In Linux 3.3, this was changed to EINVAL.
2166 ENOSPC is still returned if you try to add more breakpoint
2167 events than supported by the hardware.
2168
2169 ENOSYS Returned if PERF_SAMPLE_STACK_USER is set in sample_type and it
2170 is not supported by hardware.
2171
2172 EOPNOTSUPP
2173 Returned if an event requiring a specific hardware feature is
2174 requested but there is no hardware support. This includes
2175 requesting low-skid events if not supported, branch tracing if
2176 it is not available, sampling if no PMU interrupt is available,
2177 and branch stacks for software events.
2178
2179 EOVERFLOW (since Linux 4.8)
2180 Returned if PERF_SAMPLE_CALLCHAIN is requested and sam‐
2181 ple_max_stack is larger than the maximum specified in
2182 /proc/sys/kernel/perf_event_max_stack.
2183
2184 EPERM Returned on many (but not all) architectures when an unsupported
2185 exclude_hv, exclude_idle, exclude_user, or exclude_kernel set‐
2186 ting is specified.
2187
2188 It can also happen, as with EACCES, when the requested event
2189 requires CAP_SYS_ADMIN permissions (or a more permissive
2190 perf_event paranoid setting). This includes setting a break‐
2191 point on a kernel address, and (since Linux 3.13) setting a ker‐
2192 nel function-trace tracepoint.
2193
2194 ESRCH Returned if attempting to attach to a process that does not
2195 exist.
2196
2198 perf_event_open() was introduced in Linux 2.6.31 but was called
2199 perf_counter_open(). It was renamed in Linux 2.6.32.
2200
2202 This perf_event_open() system call Linux-specific and should not be
2203 used in programs intended to be portable.
2204
2206 Glibc does not provide a wrapper for this system call; call it using
2207 syscall(2). See the example below.
2208
2209 The official way of knowing if perf_event_open() support is enabled is
2210 checking for the existence of the file /proc/sys/ker‐
2211 nel/perf_event_paranoid.
2212
2214 The F_SETOWN_EX option to fcntl(2) is needed to properly get overflow
2215 signals in threads. This was introduced in Linux 2.6.32.
2216
2217 Prior to Linux 2.6.33 (at least for x86), the kernel did not check if
2218 events could be scheduled together until read time. The same happens
2219 on all known kernels if the NMI watchdog is enabled. This means to see
2220 if a given set of events works you have to perf_event_open(), start,
2221 then read before you know for sure you can get valid measurements.
2222
2223 Prior to Linux 2.6.34, event constraints were not enforced by the ker‐
2224 nel. In that case, some events would silently return "0" if the kernel
2225 scheduled them in an improper counter slot.
2226
2227 Prior to Linux 2.6.34, there was a bug when multiplexing where the
2228 wrong results could be returned.
2229
2230 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2231 if "inherit" is enabled and many threads are started.
2232
2233 Prior to Linux 2.6.35, PERF_FORMAT_GROUP did not work with attached
2234 processes.
2235
2236 There is a bug in the kernel code between Linux 2.6.36 and Linux 3.0
2237 that ignores the "watermark" field and acts as if a wakeup_event was
2238 chosen if the union has a nonzero value in it.
2239
2240 From Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2241 was broken and would repeatedly operate on the event specified rather
2242 than iterating across all sibling events in a group.
2243
2244 From Linux 3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2245 bits mapped to the same location. Code should migrate to the new
2246 cap_user_rdpmc and cap_user_time fields instead.
2247
2248 Always double-check your results! Various generalized events have had
2249 wrong values. For example, retired branches measured the wrong thing
2250 on AMD machines until Linux 2.6.35.
2251
2253 The following is a short example that measures the total instruction
2254 count of a call to printf(3).
2255
2256 #include <stdlib.h>
2257 #include <stdio.h>
2258 #include <unistd.h>
2259 #include <string.h>
2260 #include <sys/ioctl.h>
2261 #include <linux/perf_event.h>
2262 #include <asm/unistd.h>
2263
2264 static long
2265 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2266 int cpu, int group_fd, unsigned long flags)
2267 {
2268 int ret;
2269
2270 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2271 group_fd, flags);
2272 return ret;
2273 }
2274
2275 int
2276 main(int argc, char **argv)
2277 {
2278 struct perf_event_attr pe;
2279 long long count;
2280 int fd;
2281
2282 memset(&pe, 0, sizeof(struct perf_event_attr));
2283 pe.type = PERF_TYPE_HARDWARE;
2284 pe.size = sizeof(struct perf_event_attr);
2285 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2286 pe.disabled = 1;
2287 pe.exclude_kernel = 1;
2288 pe.exclude_hv = 1;
2289
2290 fd = perf_event_open(&pe, 0, -1, -1, 0);
2291 if (fd == -1) {
2292 fprintf(stderr, "Error opening leader %llx\n", pe.config);
2293 exit(EXIT_FAILURE);
2294 }
2295
2296 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2297 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2298
2299 printf("Measuring instruction count for this printf\n");
2300
2301 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2302 read(fd, &count, sizeof(long long));
2303
2304 printf("Used %lld instructions\n", count);
2305
2306 close(fd);
2307 }
2308
2310 perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2311
2312 Documentation/admin-guide/perf-security.rst in the kernel source tree
2313
2315 This page is part of release 5.07 of the Linux man-pages project. A
2316 description of the project, information about reporting bugs, and the
2317 latest version of this page, can be found at
2318 https://www.kernel.org/doc/man-pages/.
2319
2320
2321
2322Linux 2020-06-09 PERF_EVENT_OPEN(2)