1PERF_EVENT_OPEN(2) Linux Programmer's Manual PERF_EVENT_OPEN(2)
2
3
4
6 perf_event_open - set up performance monitoring
7
9 #include <linux/perf_event.h>
10 #include <linux/hw_breakpoint.h>
11
12 int perf_event_open(struct perf_event_attr *attr,
13 pid_t pid, int cpu, int group_fd,
14 unsigned long flags);
15
16 Note: There is no glibc wrapper for this system call; see NOTES.
17
19 Given a list of parameters, perf_event_open() returns a file descrip‐
20 tor, for use in subsequent system calls (read(2), mmap(2), prctl(2),
21 fcntl(2), etc.).
22
23 A call to perf_event_open() creates a file descriptor that allows mea‐
24 suring performance information. Each file descriptor corresponds to
25 one event that is measured; these can be grouped together to measure
26 multiple events simultaneously.
27
28 Events can be enabled and disabled in two ways: via ioctl(2) and via
29 prctl(2). When an event is disabled it does not count or generate
30 overflows but does continue to exist and maintain its count value.
31
32 Events come in two flavors: counting and sampled. A counting event is
33 one that is used for counting the aggregate number of events that
34 occur. In general, counting event results are gathered with a read(2)
35 call. A sampling event periodically writes measurements to a buffer
36 that can then be accessed via mmap(2).
37
38 Arguments
39 The pid and cpu arguments allow specifying which process and CPU to
40 monitor:
41
42 pid == 0 and cpu == -1
43 This measures the calling process/thread on any CPU.
44
45 pid == 0 and cpu >= 0
46 This measures the calling process/thread only when running on
47 the specified CPU.
48
49 pid > 0 and cpu == -1
50 This measures the specified process/thread on any CPU.
51
52 pid > 0 and cpu >= 0
53 This measures the specified process/thread only when running on
54 the specified CPU.
55
56 pid == -1 and cpu >= 0
57 This measures all processes/threads on the specified CPU. This
58 requires CAP_SYS_ADMIN capability or a /proc/sys/ker‐
59 nel/perf_event_paranoid value of less than 1.
60
61 pid == -1 and cpu == -1
62 This setting is invalid and will return an error.
63
64 When pid is greater than zero, permission to perform this system call
65 is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check;
66 see ptrace(2).
67
68 The group_fd argument allows event groups to be created. An event
69 group has one event which is the group leader. The leader is created
70 first, with group_fd = -1. The rest of the group members are created
71 with subsequent perf_event_open() calls with group_fd being set to the
72 file descriptor of the group leader. (A single event on its own is
73 created with group_fd = -1 and is considered to be a group with only 1
74 member.) An event group is scheduled onto the CPU as a unit: it will
75 be put onto the CPU only if all of the events in the group can be put
76 onto the CPU. This means that the values of the member events can be
77 meaningfully compared—added, divided (to get ratios), and so on—with
78 each other, since they have counted events for the same set of executed
79 instructions.
80
81 The flags argument is formed by ORing together zero or more of the fol‐
82 lowing values:
83
84 PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
85 This flag enables the close-on-exec flag for the created event
86 file descriptor, so that the file descriptor is automatically
87 closed on execve(2). Setting the close-on-exec flags at cre‐
88 ation time, rather than later with fcntl(2), avoids potential
89 race conditions where the calling thread invokes
90 perf_event_open() and fcntl(2) at the same time as another
91 thread calls fork(2) then execve(2).
92
93 PERF_FLAG_FD_NO_GROUP
94 This flag tells the event to ignore the group_fd parameter
95 except for the purpose of setting up output redirection using
96 the PERF_FLAG_FD_OUTPUT flag.
97
98 PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
99 This flag re-routes the event's sampled output to instead be
100 included in the mmap buffer of the event specified by group_fd.
101
102 PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
103 This flag activates per-container system-wide monitoring. A
104 container is an abstraction that isolates a set of resources for
105 finer-grained control (CPUs, memory, etc.). In this mode, the
106 event is measured only if the thread running on the monitored
107 CPU belongs to the designated container (cgroup). The cgroup is
108 identified by passing a file descriptor opened on its directory
109 in the cgroupfs filesystem. For instance, if the cgroup to mon‐
110 itor is called test, then a file descriptor opened on
111 /dev/cgroup/test (assuming cgroupfs is mounted on /dev/cgroup)
112 must be passed as the pid parameter. cgroup monitoring is
113 available only for system-wide events and may therefore require
114 extra permissions.
115
116 The perf_event_attr structure provides detailed configuration informa‐
117 tion for the event being created.
118
119 struct perf_event_attr {
120 __u32 type; /* Type of event */
121 __u32 size; /* Size of attribute structure */
122 __u64 config; /* Type-specific configuration */
123
124 union {
125 __u64 sample_period; /* Period of sampling */
126 __u64 sample_freq; /* Frequency of sampling */
127 };
128
129 __u64 sample_type; /* Specifies values included in sample */
130 __u64 read_format; /* Specifies values returned in read */
131
132 __u64 disabled : 1, /* off by default */
133 inherit : 1, /* children inherit it */
134 pinned : 1, /* must always be on PMU */
135 exclusive : 1, /* only group on PMU */
136 exclude_user : 1, /* don't count user */
137 exclude_kernel : 1, /* don't count kernel */
138 exclude_hv : 1, /* don't count hypervisor */
139 exclude_idle : 1, /* don't count when idle */
140 mmap : 1, /* include mmap data */
141 comm : 1, /* include comm data */
142 freq : 1, /* use freq, not period */
143 inherit_stat : 1, /* per task counts */
144 enable_on_exec : 1, /* next exec enables */
145 task : 1, /* trace fork/exit */
146 watermark : 1, /* wakeup_watermark */
147 precise_ip : 2, /* skid constraint */
148 mmap_data : 1, /* non-exec mmap data */
149 sample_id_all : 1, /* sample_type all events */
150 exclude_host : 1, /* don't count in host */
151 exclude_guest : 1, /* don't count in guest */
152 exclude_callchain_kernel : 1,
153 /* exclude kernel callchains */
154 exclude_callchain_user : 1,
155 /* exclude user callchains */
156 mmap2 : 1, /* include mmap with inode data */
157 comm_exec : 1, /* flag comm events that are
158 due to exec */
159 use_clockid : 1, /* use clockid for time fields */
160 context_switch : 1, /* context switch data */
161
162 __reserved_1 : 37;
163
164 union {
165 __u32 wakeup_events; /* wakeup every n events */
166 __u32 wakeup_watermark; /* bytes before wakeup */
167 };
168
169 __u32 bp_type; /* breakpoint type */
170
171 union {
172 __u64 bp_addr; /* breakpoint address */
173 __u64 config1; /* extension of config */
174 };
175
176 union {
177 __u64 bp_len; /* breakpoint length */
178 __u64 config2; /* extension of config1 */
179 };
180 __u64 branch_sample_type; /* enum perf_branch_sample_type */
181 __u64 sample_regs_user; /* user regs to dump on samples */
182 __u32 sample_stack_user; /* size of stack to dump on
183 samples */
184 __s32 clockid; /* clock to use for time fields */
185 __u64 sample_regs_intr; /* regs to dump on samples */
186 __u32 aux_watermark; /* aux bytes before wakeup */
187 __u16 sample_max_stack; /* max frames in callchain */
188 __u16 __reserved_2; /* align to u64 */
189
190 };
191
192 The fields of the perf_event_attr structure are described in more
193 detail below:
194
195 type This field specifies the overall event type. It has one of the
196 following values:
197
198 PERF_TYPE_HARDWARE
199 This indicates one of the "generalized" hardware events
200 provided by the kernel. See the config field definition
201 for more details.
202
203 PERF_TYPE_SOFTWARE
204 This indicates one of the software-defined events pro‐
205 vided by the kernel (even if no hardware support is
206 available).
207
208 PERF_TYPE_TRACEPOINT
209 This indicates a tracepoint provided by the kernel trace‐
210 point infrastructure.
211
212 PERF_TYPE_HW_CACHE
213 This indicates a hardware cache event. This has a spe‐
214 cial encoding, described in the config field definition.
215
216 PERF_TYPE_RAW
217 This indicates a "raw" implementation-specific event in
218 the config field.
219
220 PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
221 This indicates a hardware breakpoint as provided by the
222 CPU. Breakpoints can be read/write accesses to an
223 address as well as execution of an instruction address.
224
225 dynamic PMU
226 Since Linux 2.6.38, perf_event_open() can support multi‐
227 ple PMUs. To enable this, a value exported by the kernel
228 can be used in the type field to indicate which PMU to
229 use. The value to use can be found in the sysfs filesys‐
230 tem: there is a subdirectory per PMU instance under
231 /sys/bus/event_source/devices. In each subdirectory
232 there is a type file whose content is an integer that can
233 be used in the type field. For instance,
234 /sys/bus/event_source/devices/cpu/type contains the value
235 for the core CPU PMU, which is usually 4.
236
237 size The size of the perf_event_attr structure for forward/backward
238 compatibility. Set this using sizeof(struct perf_event_attr) to
239 allow the kernel to see the struct size at the time of compila‐
240 tion.
241
242 The related define PERF_ATTR_SIZE_VER0 is set to 64; this was
243 the size of the first published struct. PERF_ATTR_SIZE_VER1 is
244 72, corresponding to the addition of breakpoints in Linux
245 2.6.33. PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
246 of branch sampling in Linux 3.4. PERF_ATTR_SIZE_VER3 is 96 cor‐
247 responding to the addition of sample_regs_user and sam‐
248 ple_stack_user in Linux 3.7. PERF_ATTR_SIZE_VER4 is 104 corre‐
249 sponding to the addition of sample_regs_intr in Linux 3.19.
250 PERF_ATTR_SIZE_VER5 is 112 corresponding to the addition of
251 aux_watermark in Linux 4.1.
252
253 config This specifies which event you want, in conjunction with the
254 type field. The config1 and config2 fields are also taken into
255 account in cases where 64 bits is not enough to fully specify
256 the event. The encoding of these fields are event dependent.
257
258 There are various ways to set the config field that are depen‐
259 dent on the value of the previously described type field. What
260 follows are various possible settings for config separated out
261 by type.
262
263 If type is PERF_TYPE_HARDWARE, we are measuring one of the gen‐
264 eralized hardware CPU events. Not all of these are available on
265 all platforms. Set config to one of the following:
266
267 PERF_COUNT_HW_CPU_CYCLES
268 Total cycles. Be wary of what happens during CPU
269 frequency scaling.
270
271 PERF_COUNT_HW_INSTRUCTIONS
272 Retired instructions. Be careful, these can be
273 affected by various issues, most notably hardware
274 interrupt counts.
275
276 PERF_COUNT_HW_CACHE_REFERENCES
277 Cache accesses. Usually this indicates Last Level
278 Cache accesses but this may vary depending on your
279 CPU. This may include prefetches and coherency mes‐
280 sages; again this depends on the design of your CPU.
281
282 PERF_COUNT_HW_CACHE_MISSES
283 Cache misses. Usually this indicates Last Level
284 Cache misses; this is intended to be used in con‐
285 junction with the PERF_COUNT_HW_CACHE_REFERENCES
286 event to calculate cache miss rates.
287
288 PERF_COUNT_HW_BRANCH_INSTRUCTIONS
289 Retired branch instructions. Prior to Linux 2.6.35,
290 this used the wrong event on AMD processors.
291
292 PERF_COUNT_HW_BRANCH_MISSES
293 Mispredicted branch instructions.
294
295 PERF_COUNT_HW_BUS_CYCLES
296 Bus cycles, which can be different from total
297 cycles.
298
299 PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
300 Stalled cycles during issue.
301
302 PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
303 Stalled cycles during retirement.
304
305 PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
306 Total cycles; not affected by CPU frequency scaling.
307
308 If type is PERF_TYPE_SOFTWARE, we are measuring software events
309 provided by the kernel. Set config to one of the following:
310
311 PERF_COUNT_SW_CPU_CLOCK
312 This reports the CPU clock, a high-resolution per-
313 CPU timer.
314
315 PERF_COUNT_SW_TASK_CLOCK
316 This reports a clock count specific to the task that
317 is running.
318
319 PERF_COUNT_SW_PAGE_FAULTS
320 This reports the number of page faults.
321
322 PERF_COUNT_SW_CONTEXT_SWITCHES
323 This counts context switches. Until Linux 2.6.34,
324 these were all reported as user-space events, after
325 that they are reported as happening in the kernel.
326
327 PERF_COUNT_SW_CPU_MIGRATIONS
328 This reports the number of times the process has
329 migrated to a new CPU.
330
331 PERF_COUNT_SW_PAGE_FAULTS_MIN
332 This counts the number of minor page faults. These
333 did not require disk I/O to handle.
334
335 PERF_COUNT_SW_PAGE_FAULTS_MAJ
336 This counts the number of major page faults. These
337 required disk I/O to handle.
338
339 PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
340 This counts the number of alignment faults. These
341 happen when unaligned memory accesses happen; the
342 kernel can handle these but it reduces performance.
343 This happens only on some architectures (never on
344 x86).
345
346 PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
347 This counts the number of emulation faults. The
348 kernel sometimes traps on unimplemented instructions
349 and emulates them for user space. This can nega‐
350 tively impact performance.
351
352 PERF_COUNT_SW_DUMMY (since Linux 3.12)
353 This is a placeholder event that counts nothing.
354 Informational sample record types such as mmap or
355 comm must be associated with an active event. This
356 dummy event allows gathering such records without
357 requiring a counting event.
358
359 If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel
360 tracepoints. The value to use in config can be obtained from
361 under debugfs tracing/events/*/*/id if ftrace is enabled in the
362 kernel.
363
364 If type is PERF_TYPE_HW_CACHE, then we are measuring a hardware
365 CPU cache event. To calculate the appropriate config value use
366 the following equation:
367
368 (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
369 (perf_hw_cache_op_result_id << 16)
370
371 where perf_hw_cache_id is one of:
372
373 PERF_COUNT_HW_CACHE_L1D
374 for measuring Level 1 Data Cache
375
376 PERF_COUNT_HW_CACHE_L1I
377 for measuring Level 1 Instruction Cache
378
379 PERF_COUNT_HW_CACHE_LL
380 for measuring Last-Level Cache
381
382 PERF_COUNT_HW_CACHE_DTLB
383 for measuring the Data TLB
384
385 PERF_COUNT_HW_CACHE_ITLB
386 for measuring the Instruction TLB
387
388 PERF_COUNT_HW_CACHE_BPU
389 for measuring the branch prediction unit
390
391 PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
392 for measuring local memory accesses
393
394 and perf_hw_cache_op_id is one of:
395
396 PERF_COUNT_HW_CACHE_OP_READ
397 for read accesses
398
399 PERF_COUNT_HW_CACHE_OP_WRITE
400 for write accesses
401
402 PERF_COUNT_HW_CACHE_OP_PREFETCH
403 for prefetch accesses
404
405 and perf_hw_cache_op_result_id is one of:
406
407 PERF_COUNT_HW_CACHE_RESULT_ACCESS
408 to measure accesses
409
410 PERF_COUNT_HW_CACHE_RESULT_MISS
411 to measure misses
412
413 If type is PERF_TYPE_RAW, then a custom "raw" config value is
414 needed. Most CPUs support events that are not covered by the
415 "generalized" events. These are implementation defined; see
416 your CPU manual (for example the Intel Volume 3B documentation
417 or the AMD BIOS and Kernel Developer Guide). The libpfm4
418 library can be used to translate from the name in the architec‐
419 tural manuals to the raw hex value perf_event_open() expects in
420 this field.
421
422 If type is PERF_TYPE_BREAKPOINT, then leave config set to zero.
423 Its parameters are set in other places.
424
425 sample_period, sample_freq
426 A "sampling" event is one that generates an overflow notifica‐
427 tion every N events, where N is given by sample_period. A sam‐
428 pling event has sample_period > 0. When an overflow occurs,
429 requested data is recorded in the mmap buffer. The sample_type
430 field controls what data is recorded on each overflow.
431
432 sample_freq can be used if you wish to use frequency rather than
433 period. In this case, you set the freq flag. The kernel will
434 adjust the sampling period to try and achieve the desired rate.
435 The rate of adjustment is a timer tick.
436
437 sample_type
438 The various bits in this field specify which values to include
439 in the sample. They will be recorded in a ring-buffer, which is
440 available to user space using mmap(2). The order in which the
441 values are saved in the sample are documented in the MMAP Layout
442 subsection below; it is not the enum perf_event_sample_format
443 order.
444
445 PERF_SAMPLE_IP
446 Records instruction pointer.
447
448 PERF_SAMPLE_TID
449 Records the process and thread IDs.
450
451 PERF_SAMPLE_TIME
452 Records a timestamp.
453
454 PERF_SAMPLE_ADDR
455 Records an address, if applicable.
456
457 PERF_SAMPLE_READ
458 Record counter values for all events in a group, not just
459 the group leader.
460
461 PERF_SAMPLE_CALLCHAIN
462 Records the callchain (stack backtrace).
463
464 PERF_SAMPLE_ID
465 Records a unique ID for the opened event's group leader.
466
467 PERF_SAMPLE_CPU
468 Records CPU number.
469
470 PERF_SAMPLE_PERIOD
471 Records the current sampling period.
472
473 PERF_SAMPLE_STREAM_ID
474 Records a unique ID for the opened event. Unlike
475 PERF_SAMPLE_ID the actual ID is returned, not the group
476 leader. This ID is the same as the one returned by
477 PERF_FORMAT_ID.
478
479 PERF_SAMPLE_RAW
480 Records additional data, if applicable. Usually returned
481 by tracepoint events.
482
483 PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
484 This provides a record of recent branches, as provided by
485 CPU branch sampling hardware (such as Intel Last Branch
486 Record). Not all hardware supports this feature.
487
488 See the branch_sample_type field for how to filter which
489 branches are reported.
490
491 PERF_SAMPLE_REGS_USER (since Linux 3.7)
492 Records the current user-level CPU register state (the
493 values in the process before the kernel was called).
494
495 PERF_SAMPLE_STACK_USER (since Linux 3.7)
496 Records the user level stack, allowing stack unwinding.
497
498 PERF_SAMPLE_WEIGHT (since Linux 3.10)
499 Records a hardware provided weight value that expresses
500 how costly the sampled event was. This allows the hard‐
501 ware to highlight expensive events in a profile.
502
503 PERF_SAMPLE_DATA_SRC (since Linux 3.10)
504 Records the data source: where in the memory hierarchy
505 the data associated with the sampled instruction came
506 from. This is available only if the underlying hardware
507 supports this feature.
508
509 PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
510 Places the SAMPLE_ID value in a fixed position in the
511 record, either at the beginning (for sample events) or at
512 the end (if a non-sample event).
513
514 This was necessary because a sample stream may have
515 records from various different event sources with differ‐
516 ent sample_type settings. Parsing the event stream prop‐
517 erly was not possible because the format of the record
518 was needed to find SAMPLE_ID, but the format could not be
519 found without knowing what event the sample belonged to
520 (causing a circular dependency).
521
522 The PERF_SAMPLE_IDENTIFIER setting makes the event stream
523 always parsable by putting SAMPLE_ID in a fixed location,
524 even though it means having duplicate SAMPLE_ID values in
525 records.
526
527 PERF_SAMPLE_TRANSACTION (since Linux 3.13)
528 Records reasons for transactional memory abort events
529 (for example, from Intel TSX transactional memory sup‐
530 port).
531
532 The precise_ip setting must be greater than 0 and a
533 transactional memory abort event must be measured or no
534 values will be recorded. Also note that some perf_event
535 measurements, such as sampled cycle counting, may cause
536 extraneous aborts (by causing an interrupt during a
537 transaction).
538
539 PERF_SAMPLE_REGS_INTR (since Linux 3.19)
540 Records a subset of the current CPU register state as
541 specified by sample_regs_intr. Unlike PERF_SAM‐
542 PLE_REGS_USER the register values will return kernel reg‐
543 ister state if the overflow happened while kernel code is
544 running. If the CPU supports hardware sampling of regis‐
545 ter state (i.e., PEBS on Intel x86) and precise_ip is set
546 higher than zero then the register values returned are
547 those captured by hardware at the time of the sampled
548 instruction's retirement.
549
550 read_format
551 This field specifies the format of the data returned by read(2)
552 on a perf_event_open() file descriptor.
553
554 PERF_FORMAT_TOTAL_TIME_ENABLED
555 Adds the 64-bit time_enabled field. This can be used to
556 calculate estimated totals if the PMU is overcommitted
557 and multiplexing is happening.
558
559 PERF_FORMAT_TOTAL_TIME_RUNNING
560 Adds the 64-bit time_running field. This can be used to
561 calculate estimated totals if the PMU is overcommitted
562 and multiplexing is happening.
563
564 PERF_FORMAT_ID
565 Adds a 64-bit unique value that corresponds to the event
566 group.
567
568 PERF_FORMAT_GROUP
569 Allows all counter values in an event group to be read
570 with one read.
571
572 disabled
573 The disabled bit specifies whether the counter starts out dis‐
574 abled or enabled. If disabled, the event can later be enabled
575 by ioctl(2), prctl(2), or enable_on_exec.
576
577 When creating an event group, typically the group leader is ini‐
578 tialized with disabled set to 1 and any child events are ini‐
579 tialized with disabled set to 0. Despite disabled being 0, the
580 child events will not start until the group leader is enabled.
581
582 inherit
583 The inherit bit specifies that this counter should count events
584 of child tasks as well as the task specified. This applies only
585 to new children, not to any existing children at the time the
586 counter is created (nor to any new children of existing chil‐
587 dren).
588
589 Inherit does not work for some combinations of read_format val‐
590 ues, such as PERF_FORMAT_GROUP.
591
592 pinned The pinned bit specifies that the counter should always be on
593 the CPU if at all possible. It applies only to hardware coun‐
594 ters and only to group leaders. If a pinned counter cannot be
595 put onto the CPU (e.g., because there are not enough hardware
596 counters or because of a conflict with some other event), then
597 the counter goes into an 'error' state, where reads return end-
598 of-file (i.e., read(2) returns 0) until the counter is subse‐
599 quently enabled or disabled.
600
601 exclusive
602 The exclusive bit specifies that when this counter's group is on
603 the CPU, it should be the only group using the CPU's counters.
604 In the future this may allow monitoring programs to support PMU
605 features that need to run alone so that they do not disrupt
606 other hardware counters.
607
608 Note that many unexpected situations may prevent events with the
609 exclusive bit set from ever running. This includes any users
610 running a system-wide measurement as well as any kernel use of
611 the performance counters (including the commonly enabled NMI
612 Watchdog Timer interface).
613
614 exclude_user
615 If this bit is set, the count excludes events that happen in
616 user space.
617
618 exclude_kernel
619 If this bit is set, the count excludes events that happen in
620 kernel space.
621
622 exclude_hv
623 If this bit is set, the count excludes events that happen in the
624 hypervisor. This is mainly for PMUs that have built-in support
625 for handling this (such as POWER). Extra support is needed for
626 handling hypervisor measurements on most machines.
627
628 exclude_idle
629 If set, don't count when the CPU is idle.
630
631 mmap The mmap bit enables generation of PERF_RECORD_MMAP samples for
632 every mmap(2) call that has PROT_EXEC set. This allows tools to
633 notice new executable code being mapped into a program (dynamic
634 shared libraries for example) so that addresses can be mapped
635 back to the original code.
636
637 comm The comm bit enables tracking of process command name as modi‐
638 fied by the exec(2) and prctl(PR_SET_NAME) system calls as well
639 as writing to /proc/self/comm. If the comm_exec flag is also
640 successfully set (possible since Linux 3.16), then the misc flag
641 PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the
642 exec(2) case from the others.
643
644 freq If this bit is set, then sample_frequency not sample_period is
645 used when setting up the sampling interval.
646
647 inherit_stat
648 This bit enables saving of event counts on context switch for
649 inherited tasks. This is meaningful only if the inherit field
650 is set.
651
652 enable_on_exec
653 If this bit is set, a counter is automatically enabled after a
654 call to exec(2).
655
656 task If this bit is set, then fork/exit notifications are included in
657 the ring buffer.
658
659 watermark
660 If set, have an overflow notification happen when we cross the
661 wakeup_watermark boundary. Otherwise, overflow notifications
662 happen after wakeup_events samples.
663
664 precise_ip (since Linux 2.6.35)
665 This controls the amount of skid. Skid is how many instructions
666 execute between an event of interest happening and the kernel
667 being able to stop and record the event. Smaller skid is better
668 and allows more accurate reporting of which events correspond to
669 which instructions, but hardware is often limited with how small
670 this can be.
671
672 The possible values of this field are the following:
673
674 0 SAMPLE_IP can have arbitrary skid.
675
676 1 SAMPLE_IP must have constant skid.
677
678 2 SAMPLE_IP requested to have 0 skid.
679
680 3 SAMPLE_IP must have 0 skid. See also the description of
681 PERF_RECORD_MISC_EXACT_IP.
682
683 mmap_data (since Linux 2.6.36)
684 This is the counterpart of the mmap field. This enables genera‐
685 tion of PERF_RECORD_MMAP samples for mmap(2) calls that do not
686 have PROT_EXEC set (for example data and SysV shared memory).
687
688 sample_id_all (since Linux 2.6.38)
689 If set, then TID, TIME, ID, STREAM_ID, and CPU can additionally
690 be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
691 ple_type is selected.
692
693 If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID
694 value is included as the last value to ease parsing the record
695 stream. This may lead to the id value appearing twice.
696
697 The layout is described by this pseudo-structure:
698
699 struct sample_id {
700 { u32 pid, tid; } /* if PERF_SAMPLE_TID set */
701 { u64 time; } /* if PERF_SAMPLE_TIME set */
702 { u64 id; } /* if PERF_SAMPLE_ID set */
703 { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
704 { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
705 { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
706 };
707 ,in
708
709 exclude_host (since Linux 3.2)
710 When conducting measurements that include processes running VM
711 instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
712 events happening inside a guest instance. This is only meaning‐
713 ful outside the guests; this setting does not change counts
714 gathered inside of a guest. Currently, this functionality is
715 x86 only.
716
717 exclude_guest (since Linux 3.2)
718 When conducting measurements that include processes running VM
719 instances (i.e., have executed a KVM_RUN ioctl(2)), do not mea‐
720 sure events happening inside guest instances. This is only
721 meaningful outside the guests; this setting does not change
722 counts gathered inside of a guest. Currently, this functional‐
723 ity is x86 only.
724
725 exclude_callchain_kernel (since Linux 3.7)
726 Do not include kernel callchains.
727
728 exclude_callchain_user (since Linux 3.7)
729 Do not include user callchains.
730
731 mmap2 (since Linux 3.16)
732 Generate an extended executable mmap record that contains enough
733 additional information to uniquely identify shared mappings.
734 The mmap flag must also be set for this to work.
735
736 comm_exec (since Linux 3.16)
737 This is purely a feature-detection flag, it does not change ker‐
738 nel behavior. If this flag can successfully be set, then, when
739 comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
740 in the misc field of a comm record header if the rename event
741 being reported was caused by a call to exec(2). This allows
742 tools to distinguish between the various types of process renam‐
743 ing.
744
745 use_clockid (since Linux 4.1)
746 This allows selecting which internal Linux clock to use when
747 generating timestamps via the clockid field. This can make it
748 easier to correlate perf sample times with timestamps generated
749 by other tools.
750
751 context_switch (since Linux 4.3)
752 This enables the generation of PERF_RECORD_SWITCH records when a
753 context switch occurs. It also enables the generation of
754 PERF_RECORD_SWITCH_CPU_WIDE records when sampling in CPU-wide
755 mode. This functionality is in addition to existing tracepoint
756 and software events for measuring context switches. The advan‐
757 tage of this method is that it will give full information even
758 with strict perf_event_paranoid settings.
759
760 wakeup_events, wakeup_watermark
761 This union sets how many samples (wakeup_events) or bytes
762 (wakeup_watermark) happen before an overflow notification hap‐
763 pens. Which one is used is selected by the watermark bit flag.
764
765 wakeup_events counts only PERF_RECORD_SAMPLE record types. To
766 receive overflow notification for all PERF_RECORD types choose
767 watermark and set wakeup_watermark to 1.
768
769 Prior to Linux 3.0, setting wakeup_events to 0 resulted in no
770 overflow notifications; more recent kernels treat 0 the same as
771 1.
772
773 bp_type (since Linux 2.6.33)
774 This chooses the breakpoint type. It is one of:
775
776 HW_BREAKPOINT_EMPTY
777 No breakpoint.
778
779 HW_BREAKPOINT_R
780 Count when we read the memory location.
781
782 HW_BREAKPOINT_W
783 Count when we write the memory location.
784
785 HW_BREAKPOINT_RW
786 Count when we read or write the memory location.
787
788 HW_BREAKPOINT_X
789 Count when we execute code at the memory location.
790
791 The values can be combined via a bitwise or, but the combination
792 of HW_BREAKPOINT_R or HW_BREAKPOINT_W with HW_BREAKPOINT_X is
793 not allowed.
794
795 bp_addr (since Linux 2.6.33)
796 This is the address of the breakpoint. For execution break‐
797 points, this is the memory address of the instruction of inter‐
798 est; for read and write breakpoints, it is the memory address of
799 the memory location of interest.
800
801 config1 (since Linux 2.6.39)
802 config1 is used for setting events that need an extra register
803 or otherwise do not fit in the regular config field. Raw OFF‐
804 CORE_EVENTS on Nehalem/Westmere/SandyBridge use this field on
805 Linux 3.3 and later kernels.
806
807 bp_len (since Linux 2.6.33)
808 bp_len is the length of the breakpoint being measured if type is
809 PERF_TYPE_BREAKPOINT. Options are HW_BREAKPOINT_LEN_1,
810 HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, and HW_BREAK‐
811 POINT_LEN_8. For an execution breakpoint, set this to
812 sizeof(long).
813
814 config2 (since Linux 2.6.39)
815 config2 is a further extension of the config1 field.
816
817 branch_sample_type (since Linux 3.4)
818 If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
819 branches to include in the branch record.
820
821 The first part of the value is the privilege level, which is a
822 combination of one of the values listed below. If the user does
823 not set privilege level explicitly, the kernel will use the
824 event's privilege level. Event and branch privilege levels do
825 not have to match.
826
827 PERF_SAMPLE_BRANCH_USER
828 Branch target is in user space.
829
830 PERF_SAMPLE_BRANCH_KERNEL
831 Branch target is in kernel space.
832
833 PERF_SAMPLE_BRANCH_HV
834 Branch target is in hypervisor.
835
836 PERF_SAMPLE_BRANCH_PLM_ALL
837 A convenience value that is the three preceding values
838 ORed together.
839
840 In addition to the privilege value, at least one or more of the
841 following bits must be set.
842
843 PERF_SAMPLE_BRANCH_ANY
844 Any branch type.
845
846 PERF_SAMPLE_BRANCH_ANY_CALL
847 Any call branch (includes direct calls, indirect calls,
848 and far jumps).
849
850 PERF_SAMPLE_BRANCH_IND_CALL
851 Indirect calls.
852
853 PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
854 Direct calls.
855
856 PERF_SAMPLE_BRANCH_ANY_RETURN
857 Any return branch.
858
859 PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
860 Indirect jumps.
861
862 PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
863 Conditional branches.
864
865 PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
866 Transactional memory aborts.
867
868 PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
869 Branch in transactional memory transaction.
870
871 PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
872 Branch not in transactional memory transaction.
873 PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
874 part of a hardware-generated call stack. This requires
875 hardware support, currently only found on Intel x86
876 Haswell or newer.
877
878 sample_regs_user (since Linux 3.7)
879 This bit mask defines the set of user CPU registers to dump on
880 samples. The layout of the register mask is architecture-spe‐
881 cific and is described in the kernel header file
882 arch/ARCH/include/uapi/asm/perf_regs.h.
883
884 sample_stack_user (since Linux 3.7)
885 This defines the size of the user stack to dump if PERF_SAM‐
886 PLE_STACK_USER is specified.
887
888 clockid (since Linux 4.1)
889 If use_clockid is set, then this field selects which internal
890 Linux timer to use for timestamps. The available timers are
891 defined in linux/time.h, with CLOCK_MONOTONIC, CLOCK_MONO‐
892 TONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and CLOCK_TAI cur‐
893 rently supported.
894
895 aux_watermark (since Linux 4.1)
896 This specifies how much data is required to trigger a
897 PERF_RECORD_AUX sample.
898
899 sample_max_stack (since Linux 4.8)
900 When sample_type includes PERF_SAMPLE_CALLCHAIN, this field
901 specifies how many stack frames to report when generating the
902 callchain.
903
904 Reading results
905 Once a perf_event_open() file descriptor has been opened, the values of
906 the events can be read from the file descriptor. The values that are
907 there are specified by the read_format field in the attr structure at
908 open time.
909
910 If you attempt to read into a buffer that is not big enough to hold the
911 data, the error ENOSPC results.
912
913 Here is the layout of the data returned by a read:
914
915 * If PERF_FORMAT_GROUP was specified to allow reading all events in a
916 group at once:
917
918 struct read_format {
919 u64 nr; /* The number of events */
920 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
921 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
922 struct {
923 u64 value; /* The value of the event */
924 u64 id; /* if PERF_FORMAT_ID */
925 } values[nr];
926 };
927
928 * If PERF_FORMAT_GROUP was not specified:
929
930 struct read_format {
931 u64 value; /* The value of the event */
932 u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
933 u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
934 u64 id; /* if PERF_FORMAT_ID */
935 };
936
937 The values read are as follows:
938
939 nr The number of events in this file descriptor. Available only if
940 PERF_FORMAT_GROUP was specified.
941
942 time_enabled, time_running
943 Total time the event was enabled and running. Normally these
944 values are the same. If more events are started, then available
945 counter slots on the PMU, then multiplexing happens and events
946 run only part of the time. In that case, the time_enabled and
947 time running values can be used to scale an estimated value for
948 the count.
949
950 value An unsigned 64-bit value containing the counter result.
951
952 id A globally unique value for this particular event; only present
953 if PERF_FORMAT_ID was specified in read_format.
954
955 MMAP layout
956 When using perf_event_open() in sampled mode, asynchronous events (like
957 counter overflow or PROT_EXEC mmap tracking) are logged into a ring-
958 buffer. This ring-buffer is created and accessed through mmap(2).
959
960 The mmap size should be 1+2^n pages, where the first page is a metadata
961 page (struct perf_event_mmap_page) that contains various bits of infor‐
962 mation such as where the ring-buffer head is.
963
964 Before kernel 2.6.39, there is a bug that means you must allocate an
965 mmap ring buffer when sampling even if you do not plan to access it.
966
967 The structure of the first metadata mmap page is as follows:
968
969 struct perf_event_mmap_page {
970 __u32 version; /* version number of this structure */
971 __u32 compat_version; /* lowest version this is compat with */
972 __u32 lock; /* seqlock for synchronization */
973 __u32 index; /* hardware counter identifier */
974 __s64 offset; /* add to hardware counter value */
975 __u64 time_enabled; /* time event active */
976 __u64 time_running; /* time event on CPU */
977 union {
978 __u64 capabilities;
979 struct {
980 __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
981 cap_bit0_is_deprecated : 1,
982 cap_user_rdpmc : 1,
983 cap_user_time : 1,
984 cap_user_time_zero : 1,
985 };
986 };
987 __u16 pmc_width;
988 __u16 time_shift;
989 __u32 time_mult;
990 __u64 time_offset;
991 __u64 __reserved[120]; /* Pad to 1 k */
992 __u64 data_head; /* head in the data section */
993 __u64 data_tail; /* user-space written tail */
994 __u64 data_offset; /* where the buffer starts */
995 __u64 data_size; /* data buffer size */
996 __u64 aux_head;
997 __u64 aux_tail;
998 __u64 aux_offset;
999 __u64 aux_size;
1000
1001 }
1002
1003 The following list describes the fields in the perf_event_mmap_page
1004 structure in more detail:
1005
1006 version
1007 Version number of this structure.
1008
1009 compat_version
1010 The lowest version this is compatible with.
1011
1012 lock A seqlock for synchronization.
1013
1014 index A unique hardware counter identifier.
1015
1016 offset When using rdpmc for reads this offset value must be added to
1017 the one returned by rdpmc to get the current total event count.
1018
1019 time_enabled
1020 Time the event was active.
1021
1022 time_running
1023 Time the event was running.
1024
1025 cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1026 There was a bug in the definition of cap_usr_time and
1027 cap_usr_rdpmc from Linux 3.4 until Linux 3.11. Both bits were
1028 defined to point to the same location, so it was impossible to
1029 know if cap_usr_time or cap_usr_rdpmc were actually set.
1030
1031 Starting with Linux 3.12, these are renamed to cap_bit0 and you
1032 should use the cap_user_time and cap_user_rdpmc fields instead.
1033
1034 cap_bit0_is_deprecated (since Linux 3.12)
1035 If set, this bit indicates that the kernel supports the properly
1036 separated cap_user_time and cap_user_rdpmc bits.
1037
1038 If not-set, it indicates an older kernel where cap_usr_time and
1039 cap_usr_rdpmc map to the same bit and thus both features should
1040 be used with caution.
1041
1042 cap_user_rdpmc (since Linux 3.12)
1043 If the hardware supports user-space read of performance counters
1044 without syscall (this is the "rdpmc" instruction on x86), then
1045 the following code can be used to do a read:
1046
1047 u32 seq, time_mult, time_shift, idx, width;
1048 u64 count, enabled, running;
1049 u64 cyc, time_offset;
1050
1051 do {
1052 seq = pc->lock;
1053 barrier();
1054 enabled = pc->time_enabled;
1055 running = pc->time_running;
1056
1057 if (pc->cap_usr_time && enabled != running) {
1058 cyc = rdtsc();
1059 time_offset = pc->time_offset;
1060 time_mult = pc->time_mult;
1061 time_shift = pc->time_shift;
1062 }
1063
1064 idx = pc->index;
1065 count = pc->offset;
1066
1067 if (pc->cap_usr_rdpmc && idx) {
1068 width = pc->pmc_width;
1069 count += rdpmc(idx - 1);
1070 }
1071
1072 barrier();
1073 } while (pc->lock != seq);
1074
1075 cap_user_time (since Linux 3.12)
1076 This bit indicates the hardware has a constant, nonstop time‐
1077 stamp counter (TSC on x86).
1078
1079 cap_user_time_zero (since Linux 3.12)
1080 Indicates the presence of time_zero which allows mapping time‐
1081 stamp values to the hardware clock.
1082
1083 pmc_width
1084 If cap_usr_rdpmc, this field provides the bit-width of the value
1085 read using the rdpmc or equivalent instruction. This can be
1086 used to sign extend the result like:
1087
1088 pmc <<= 64 - pmc_width;
1089 pmc >>= 64 - pmc_width; // signed shift right
1090 count += pmc;
1091
1092 time_shift, time_mult, time_offset
1093
1094 If cap_usr_time, these fields can be used to compute the time
1095 delta since time_enabled (in nanoseconds) using rdtsc or simi‐
1096 lar.
1097
1098 u64 quot, rem;
1099 u64 delta;
1100 quot = (cyc >> time_shift);
1101 rem = cyc & (((u64)1 << time_shift) - 1);
1102 delta = time_offset + quot * time_mult +
1103 ((rem * time_mult) >> time_shift);
1104
1105 Where time_offset, time_mult, time_shift, and cyc are read in
1106 the seqcount loop described above. This delta can then be added
1107 to enabled and possible running (if idx), improving the scaling:
1108
1109 enabled += delta;
1110 if (idx)
1111 running += delta;
1112 quot = count / running;
1113 rem = count % running;
1114 count = quot * enabled + (rem * enabled) / running;
1115
1116 time_zero (since Linux 3.12)
1117
1118 If cap_usr_time_zero is set, then the hardware clock (the TSC
1119 timestamp counter on x86) can be calculated from the time_zero,
1120 time_mult, and time_shift values:
1121
1122 time = timestamp - time_zero;
1123 quot = time / time_mult;
1124 rem = time % time_mult;
1125 cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1126
1127 And vice versa:
1128
1129 quot = cyc >> time_shift;
1130 rem = cyc & (((u64)1 << time_shift) - 1);
1131 timestamp = time_zero + quot * time_mult +
1132 ((rem * time_mult) >> time_shift);
1133
1134 data_head
1135 This points to the head of the data section. The value continu‐
1136 ously increases, it does not wrap. The value needs to be manu‐
1137 ally wrapped by the size of the mmap buffer before accessing the
1138 samples.
1139
1140 On SMP-capable platforms, after reading the data_head value,
1141 user space should issue an rmb().
1142
1143 data_tail
1144 When the mapping is PROT_WRITE, the data_tail value should be
1145 written by user space to reflect the last read data. In this
1146 case, the kernel will not overwrite unread data.
1147
1148 data_offset (since Linux 4.1)
1149 Contains the offset of the location in the mmap buffer where
1150 perf sample data begins.
1151
1152 data_size (since Linux 4.1)
1153 Contains the size of the perf sample region within the mmap buf‐
1154 fer.
1155
1156 aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1157 The AUX region allows mmaping a separate sample buffer for high-
1158 bandwidth data streams (separate from the main perf sample buf‐
1159 fer). An example of a high-bandwidth stream is instruction
1160 tracing support, as is found in newer Intel processors.
1161
1162 To set up an AUX area, first aux_offset needs to be set with an
1163 offset greater than data_offset+data_size and aux_size needs to
1164 be set to the desired buffer size. The desired offset and size
1165 must be page aligned, and the size must be a power of two.
1166 These values are then passed to mmap in order to map the AUX
1167 buffer. Pages in the AUX buffer are included as part of the
1168 RLIMIT_MEMLOCK resource limit (see setrlimit(2)), and also as
1169 part of the perf_event_mlock_kb allowance.
1170
1171 By default, the AUX buffer will be truncated if it will not fit
1172 in the available space in the ring buffer. If the AUX buffer is
1173 mapped as a read only buffer, then it will operate in ring buf‐
1174 fer mode where old data will be overwritten by new. In over‐
1175 write mode, it might not be possible to infer where the new data
1176 began, and it is the consumer's job to disable measurement while
1177 reading to avoid possible data races.
1178
1179 The aux_head and aux_tail ring buffer pointers have the same
1180 behavior and ordering rules as the previous described data_head
1181 and data_tail.
1182
1183 The following 2^n ring-buffer pages have the layout described below.
1184
1185 If perf_event_attr.sample_id_all is set, then all event types will have
1186 the sample_type selected fields related to where/when (identity) an
1187 event took place (TID, TIME, ID, CPU, STREAM_ID) described in
1188 PERF_RECORD_SAMPLE below, it will be stashed just after the
1189 perf_event_header and the fields already present for the existing
1190 fields, that is, at the end of the payload. This allows a newer
1191 perf.data file to be supported by older perf tools, with the new
1192 optional fields being ignored.
1193
1194 The mmap values start with a header:
1195
1196 struct perf_event_header {
1197 __u32 type;
1198 __u16 misc;
1199 __u16 size;
1200 };
1201
1202 Below, we describe the perf_event_header fields in more detail. For
1203 ease of reading, the fields with shorter descriptions are presented
1204 first.
1205
1206 size This indicates the size of the record.
1207
1208 misc The misc field contains additional information about the sample.
1209
1210 The CPU mode can be determined from this value by masking with
1211 PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1212 ing (note these are not bit masks, only one can be set at a
1213 time):
1214
1215 PERF_RECORD_MISC_CPUMODE_UNKNOWN
1216 Unknown CPU mode.
1217
1218 PERF_RECORD_MISC_KERNEL
1219 Sample happened in the kernel.
1220
1221 PERF_RECORD_MISC_USER
1222 Sample happened in user code.
1223
1224 PERF_RECORD_MISC_HYPERVISOR
1225 Sample happened in the hypervisor.
1226
1227 PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1228 Sample happened in the guest kernel.
1229
1230 PERF_RECORD_MISC_GUEST_USER (since Linux 2.6.35)
1231 Sample happened in guest user code.
1232
1233 Since the following three statuses are generated by different
1234 record types, they alias to the same bit:
1235
1236 PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1237 This is set when the mapping is not executable; otherwise
1238 the mapping is executable.
1239
1240 PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1241 This is set for a PERF_RECORD_COMM record on kernels more
1242 recent than Linux 3.16 if a process name change was
1243 caused by an exec(2) system call.
1244
1245 PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1246 When a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE
1247 record is generated, this bit indicates that the context
1248 switch is away from the current process (instead of into
1249 the current process).
1250
1251 In addition, the following bits can be set:
1252
1253 PERF_RECORD_MISC_EXACT_IP
1254 This indicates that the content of PERF_SAMPLE_IP points
1255 to the actual instruction that triggered the event. See
1256 also perf_event_attr.precise_ip.
1257
1258 PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1259 This indicates there is extended data available (cur‐
1260 rently not used).
1261
1262 PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1263 This bit is not set by the kernel. It is reserved for
1264 the user-space perf utility to indicate that
1265 /proc/i[pid]/maps parsing was taking too long and was
1266 stopped, and thus the mmap records may be truncated.
1267
1268 type The type value is one of the below. The values in the corre‐
1269 sponding record (that follows the header) depend on the type
1270 selected as shown.
1271
1272 PERF_RECORD_MMAP
1273 The MMAP events record the PROT_EXEC mappings so that we can
1274 correlate user-space IPs to code. They have the following
1275 structure:
1276
1277 struct {
1278 struct perf_event_header header;
1279 u32 pid, tid;
1280 u64 addr;
1281 u64 len;
1282 u64 pgoff;
1283 char filename[];
1284 };
1285
1286 pid is the process ID.
1287
1288 tid is the thread ID.
1289
1290 addr is the address of the allocated memory. len is the
1291 length of the allocated memory. pgoff is the page
1292 offset of the allocated memory. filename is a string
1293 describing the backing of the allocated memory.
1294
1295 PERF_RECORD_LOST
1296 This record indicates when events are lost.
1297
1298 struct {
1299 struct perf_event_header header;
1300 u64 id;
1301 u64 lost;
1302 struct sample_id sample_id;
1303 };
1304
1305 id is the unique event ID for the samples that were
1306 lost.
1307
1308 lost is the number of events that were lost.
1309
1310 PERF_RECORD_COMM
1311 This record indicates a change in the process name.
1312
1313 struct {
1314 struct perf_event_header header;
1315 u32 pid;
1316 u32 tid;
1317 char comm[];
1318 struct sample_id sample_id;
1319 };
1320
1321 pid is the process ID.
1322
1323 tid is the thread ID.
1324
1325 comm is a string containing the new name of the process.
1326
1327 PERF_RECORD_EXIT
1328 This record indicates a process exit event.
1329
1330 struct {
1331 struct perf_event_header header;
1332 u32 pid, ppid;
1333 u32 tid, ptid;
1334 u64 time;
1335 struct sample_id sample_id;
1336 };
1337
1338 PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1339 This record indicates a throttle/unthrottle event.
1340
1341 struct {
1342 struct perf_event_header header;
1343 u64 time;
1344 u64 id;
1345 u64 stream_id;
1346 struct sample_id sample_id;
1347 };
1348
1349 PERF_RECORD_FORK
1350 This record indicates a fork event.
1351
1352 struct {
1353 struct perf_event_header header;
1354 u32 pid, ppid;
1355 u32 tid, ptid;
1356 u64 time;
1357 struct sample_id sample_id;
1358 };
1359
1360 PERF_RECORD_READ
1361 This record indicates a read event.
1362
1363 struct {
1364 struct perf_event_header header;
1365 u32 pid, tid;
1366 struct read_format values;
1367 struct sample_id sample_id;
1368 };
1369
1370 PERF_RECORD_SAMPLE
1371 This record indicates a sample.
1372
1373 struct {
1374 struct perf_event_header header;
1375 u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
1376 u64 ip; /* if PERF_SAMPLE_IP */
1377 u32 pid, tid; /* if PERF_SAMPLE_TID */
1378 u64 time; /* if PERF_SAMPLE_TIME */
1379 u64 addr; /* if PERF_SAMPLE_ADDR */
1380 u64 id; /* if PERF_SAMPLE_ID */
1381 u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
1382 u32 cpu, res; /* if PERF_SAMPLE_CPU */
1383 u64 period; /* if PERF_SAMPLE_PERIOD */
1384 struct read_format v;
1385 /* if PERF_SAMPLE_READ */
1386 u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
1387 u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
1388 u32 size; /* if PERF_SAMPLE_RAW */
1389 char data[size]; /* if PERF_SAMPLE_RAW */
1390 u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
1391 struct perf_branch_entry lbr[bnr];
1392 /* if PERF_SAMPLE_BRANCH_STACK */
1393 u64 abi; /* if PERF_SAMPLE_REGS_USER */
1394 u64 regs[weight(mask)];
1395 /* if PERF_SAMPLE_REGS_USER */
1396 u64 size; /* if PERF_SAMPLE_STACK_USER */
1397 char data[size]; /* if PERF_SAMPLE_STACK_USER */
1398 u64 dyn_size; /* if PERF_SAMPLE_STACK_USER &&
1399 size != 0 */
1400 u64 weight; /* if PERF_SAMPLE_WEIGHT */
1401 u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
1402 u64 transaction; /* if PERF_SAMPLE_TRANSACTION */
1403 u64 abi; /* if PERF_SAMPLE_REGS_INTR */
1404 u64 regs[weight(mask)];
1405 /* if PERF_SAMPLE_REGS_INTR */
1406 };
1407
1408 sample_id
1409 If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1410 is included. This is a duplication of the PERF_SAM‐
1411 PLE_ID id value, but included at the beginning of the
1412 sample so parsers can easily obtain the value.
1413
1414 ip If PERF_SAMPLE_IP is enabled, then a 64-bit instruction
1415 pointer value is included.
1416
1417 pid, tid
1418 If PERF_SAMPLE_TID is enabled, then a 32-bit process ID
1419 and 32-bit thread ID are included.
1420
1421 time
1422 If PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp
1423 is included. This is obtained via local_clock() which
1424 is a hardware timestamp if available and the jiffies
1425 value if not.
1426
1427 addr
1428 If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1429 included. This is usually the address of a tracepoint,
1430 breakpoint, or software event; otherwise the value is 0.
1431
1432 id If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is
1433 included. If the event is a member of an event group,
1434 the group leader ID is returned. This ID is the same as
1435 the one returned by PERF_FORMAT_ID.
1436
1437 stream_id
1438 If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID
1439 is included. Unlike PERF_SAMPLE_ID the actual ID is
1440 returned, not the group leader. This ID is the same as
1441 the one returned by PERF_FORMAT_ID.
1442
1443 cpu, res
1444 If PERF_SAMPLE_CPU is enabled, this is a 32-bit value
1445 indicating which CPU was being used, in addition to a
1446 reserved (unused) 32-bit value.
1447
1448 period
1449 If PERF_SAMPLE_PERIOD is enabled, a 64-bit value indi‐
1450 cating the current sampling period is written.
1451
1452 v If PERF_SAMPLE_READ is enabled, a structure of type
1453 read_format is included which has values for all events
1454 in the event group. The values included depend on the
1455 read_format value used at perf_event_open() time.
1456
1457 nr, ips[nr]
1458 If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit num‐
1459 ber is included which indicates how many following
1460 64-bit instruction pointers will follow. This is the
1461 current callchain.
1462
1463 size, data[size]
1464 If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1465 cating size is included followed by an array of 8-bit
1466 values of length size. The values are padded with 0 to
1467 have 64-bit alignment.
1468
1469 This RAW record data is opaque with respect to the ABI.
1470 The ABI doesn't make any promises with respect to the
1471 stability of its content, it may vary depending on
1472 event, hardware, and kernel version.
1473
1474 bnr, lbr[bnr]
1475 If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit
1476 value indicating the number of records is included, fol‐
1477 lowed by bnr perf_branch_entry structures which each
1478 include the fields:
1479
1480 from This indicates the source instruction (may not be
1481 a branch).
1482
1483 to The branch target.
1484
1485 mispred
1486 The branch target was mispredicted.
1487
1488 predicted
1489 The branch target was predicted.
1490
1491 in_tx (since Linux 3.11)
1492 The branch was in a transactional memory transac‐
1493 tion.
1494
1495 abort (since Linux 3.11)
1496 The branch was in an aborted transactional memory
1497 transaction.
1498
1499 cycles (since Linux 4.3)
1500 This reports the number of cycles elapsed since
1501 the previous branch stack update.
1502
1503 The entries are from most to least recent, so the first
1504 entry has the most recent branch.
1505
1506 Support for mispred, predicted, and cycles is optional;
1507 if not supported, those values will be 0.
1508
1509 The type of branches recorded is specified by the
1510 branch_sample_type field.
1511
1512 abi, regs[weight(mask)]
1513 If PERF_SAMPLE_REGS_USER is enabled, then the user CPU
1514 registers are recorded.
1515
1516 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1517 PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64.
1518
1519 The regs field is an array of the CPU registers that
1520 were specified by the sample_regs_user attr field. The
1521 number of values is the number of bits set in the sam‐
1522 ple_regs_user bit mask.
1523
1524 size, data[size], dyn_size
1525 If PERF_SAMPLE_STACK_USER is enabled, then the user
1526 stack is recorded. This can be used to generate stack
1527 backtraces. size is the size requested by the user in
1528 sample_stack_user or else the maximum record size. data
1529 is the stack data (a raw dump of the memory pointed to
1530 by the stack pointer at the time of sampling). dyn_size
1531 is the amount of data actually dumped (can be less than
1532 size). Note that dyn_size is omitted if size is 0.
1533
1534 weight
1535 If PERF_SAMPLE_WEIGHT is enabled, then a 64-bit value
1536 provided by the hardware is recorded that indicates how
1537 costly the event was. This allows expensive events to
1538 stand out more clearly in profiles.
1539
1540 data_src
1541 If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value
1542 is recorded that is made up of the following fields:
1543
1544 mem_op
1545 Type of opcode, a bitwise combination of:
1546
1547 PERF_MEM_OP_NA Not available
1548 PERF_MEM_OP_LOAD Load instruction
1549 PERF_MEM_OP_STORE Store instruction
1550 PERF_MEM_OP_PFETCH Prefetch
1551 PERF_MEM_OP_EXEC Executable code
1552
1553 mem_lvl
1554 Memory hierarchy level hit or miss, a bitwise combi‐
1555 nation of the following, shifted left by
1556 PERF_MEM_LVL_SHIFT:
1557
1558 PERF_MEM_LVL_NA Not available
1559 PERF_MEM_LVL_HIT Hit
1560 PERF_MEM_LVL_MISS Miss
1561 PERF_MEM_LVL_L1 Level 1 cache
1562 PERF_MEM_LVL_LFB Line fill buffer
1563 PERF_MEM_LVL_L2 Level 2 cache
1564 PERF_MEM_LVL_L3 Level 3 cache
1565 PERF_MEM_LVL_LOC_RAM Local DRAM
1566 PERF_MEM_LVL_REM_RAM1 Remote DRAM 1 hop
1567 PERF_MEM_LVL_REM_RAM2 Remote DRAM 2 hops
1568 PERF_MEM_LVL_REM_CCE1 Remote cache 1 hop
1569 PERF_MEM_LVL_REM_CCE2 Remote cache 2 hops
1570 PERF_MEM_LVL_IO I/O memory
1571 PERF_MEM_LVL_UNC Uncached memory
1572
1573 mem_snoop
1574 Snoop mode, a bitwise combination of the following,
1575 shifted left by PERF_MEM_SNOOP_SHIFT:
1576
1577 PERF_MEM_SNOOP_NA Not available
1578 PERF_MEM_SNOOP_NONE No snoop
1579 PERF_MEM_SNOOP_HIT Snoop hit
1580 PERF_MEM_SNOOP_MISS Snoop miss
1581 PERF_MEM_SNOOP_HITM Snoop hit modified
1582
1583 mem_lock
1584 Lock instruction, a bitwise combination of the fol‐
1585 lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1586
1587 PERF_MEM_LOCK_NA Not available
1588 PERF_MEM_LOCK_LOCKED Locked transaction
1589
1590 mem_dtlb
1591 TLB access hit or miss, a bitwise combination of the
1592 following, shifted left by PERF_MEM_TLB_SHIFT:
1593
1594 PERF_MEM_TLB_NA Not available
1595 PERF_MEM_TLB_HIT Hit
1596 PERF_MEM_TLB_MISS Miss
1597 PERF_MEM_TLB_L1 Level 1 TLB
1598 PERF_MEM_TLB_L2 Level 2 TLB
1599 PERF_MEM_TLB_WK Hardware walker
1600 PERF_MEM_TLB_OS OS fault handler
1601
1602 transaction
1603 If the PERF_SAMPLE_TRANSACTION flag is set, then a
1604 64-bit field is recorded describing the sources of any
1605 transactional memory aborts.
1606
1607 The field is a bitwise combination of the following val‐
1608 ues:
1609
1610 PERF_TXN_ELISION
1611 Abort from an elision type transaction (Intel-
1612 CPU-specific).
1613
1614 PERF_TXN_TRANSACTION
1615 Abort from a generic transaction.
1616
1617 PERF_TXN_SYNC
1618 Synchronous abort (related to the reported
1619 instruction).
1620
1621 PERF_TXN_ASYNC
1622 Asynchronous abort (not related to the reported
1623 instruction).
1624
1625 PERF_TXN_RETRY
1626 Retryable abort (retrying the transaction may
1627 have succeeded).
1628
1629 PERF_TXN_CONFLICT
1630 Abort due to memory conflicts with other threads.
1631
1632 PERF_TXN_CAPACITY_WRITE
1633 Abort due to write capacity overflow.
1634
1635 PERF_TXN_CAPACITY_READ
1636 Abort due to read capacity overflow.
1637
1638 In addition, a user-specified abort code can be obtained
1639 from the high 32 bits of the field by shifting right by
1640 PERF_TXN_ABORT_SHIFT and masking with the value
1641 PERF_TXN_ABORT_MASK.
1642
1643 abi, regs[weight(mask)]
1644 If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU
1645 registers are recorded.
1646
1647 The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
1648 PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1649
1650 The regs field is an array of the CPU registers that
1651 were specified by the sample_regs_intr attr field. The
1652 number of values is the number of bits set in the sam‐
1653 ple_regs_intr bit mask.
1654
1655 PERF_RECORD_MMAP2
1656 This record includes extended information on mmap(2) calls
1657 returning executable mappings. The format is similar to
1658 that of the PERF_RECORD_MMAP record, but includes extra val‐
1659 ues that allow uniquely identifying shared mappings.
1660
1661 struct {
1662 struct perf_event_header header;
1663 u32 pid;
1664 u32 tid;
1665 u64 addr;
1666 u64 len;
1667 u64 pgoff;
1668 u32 maj;
1669 u32 min;
1670 u64 ino;
1671 u64 ino_generation;
1672 u32 prot;
1673 u32 flags;
1674 char filename[];
1675 struct sample_id sample_id;
1676 };
1677
1678 pid is the process ID.
1679
1680 tid is the thread ID.
1681
1682 addr is the address of the allocated memory.
1683
1684 len is the length of the allocated memory.
1685
1686 pgoff is the page offset of the allocated memory.
1687
1688 maj is the major ID of the underlying device.
1689
1690 min is the minor ID of the underlying device.
1691
1692 ino is the inode number.
1693
1694 ino_generation
1695 is the inode generation.
1696
1697 prot is the protection information.
1698
1699 flags is the flags information.
1700
1701 filename
1702 is a string describing the backing of the allocated
1703 memory.
1704
1705 PERF_RECORD_AUX (since Linux 4.1)
1706
1707 This record reports that new data is available in the sepa‐
1708 rate AUX buffer region.
1709
1710 struct {
1711 struct perf_event_header header;
1712 u64 aux_offset;
1713 u64 aux_size;
1714 u64 flags;
1715 struct sample_id sample_id;
1716 };
1717
1718 aux_offset
1719 offset in the AUX mmap region where the new data
1720 begins.
1721
1722 aux_size
1723 size of the data made available.
1724
1725 flags describes the AUX update.
1726
1727 PERF_AUX_FLAG_TRUNCATED
1728 if set, then the data returned was truncated
1729 to fit the available buffer size.
1730
1731 PERF_AUX_FLAG_OVERWRITE
1732 if set, then the data returned has overwritten
1733 previous data.
1734
1735 PERF_RECORD_ITRACE_START (since Linux 4.1)
1736
1737 This record indicates which process has initiated an
1738 instruction trace event, allowing tools to properly corre‐
1739 late the instruction addresses in the AUX buffer with the
1740 proper executable.
1741
1742 struct {
1743 struct perf_event_header header;
1744 u32 pid;
1745 u32 tid;
1746 };
1747
1748 pid process ID of the thread starting an instruction
1749 trace.
1750
1751 tid thread ID of the thread starting an instruction
1752 trace.
1753
1754 PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1755
1756 When using hardware sampling (such as Intel PEBS) this
1757 record indicates some number of samples that may have been
1758 lost.
1759
1760 struct {
1761 struct perf_event_header header;
1762 u64 lost;
1763 struct sample_id sample_id;
1764 };
1765
1766 lost the number of potentially lost samples.
1767
1768 PERF_RECORD_SWITCH (since Linux 4.3)
1769
1770 This record indicates a context switch has happened. The
1771 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1772 whether it was a context switch into or away from the cur‐
1773 rent process.
1774
1775 struct {
1776 struct perf_event_header header;
1777 struct sample_id sample_id;
1778 };
1779
1780 PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1781
1782 As with PERF_RECORD_SWITCH this record indicates a context
1783 switch has happened, but it only occurs when sampling in
1784 CPU-wide mode and provides additional information on the
1785 process being switched to/from. The
1786 PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates
1787 whether it was a context switch into or away from the cur‐
1788 rent process.
1789
1790 struct {
1791 struct perf_event_header header;
1792 u32 next_prev_pid;
1793 u32 next_prev_tid;
1794 struct sample_id sample_id;
1795 };
1796
1797 next_prev_pid
1798 The process ID of the previous (if switching in) or
1799 next (if switching out) process on the CPU.
1800
1801 next_prev_tid
1802 The thread ID of the previous (if switching in) or
1803 next (if switching out) thread on the CPU.
1804
1805 Overflow handling
1806 Events can be set to notify when a threshold is crossed, indicating an
1807 overflow. Overflow conditions can be captured by monitoring the event
1808 file descriptor with poll(2), select(2), or epoll(7). Alternatively,
1809 the overflow events can be captured via sa signal handler, by enabling
1810 I/O signaling on the file descriptor; see the discussion of the
1811 F_SETOWN and F_SETSIG operations in fcntl(2).
1812
1813 Overflows are generated only by sampling events (sample_period must
1814 have a nonzero value).
1815
1816 There are two ways to generate overflow notifications.
1817
1818 The first is to set a wakeup_events or wakeup_watermark value that will
1819 trigger if a certain number of samples or bytes have been written to
1820 the mmap ring buffer. In this case, POLL_IN is indicated.
1821
1822 The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This
1823 ioctl adds to a counter that decrements each time the event overflows.
1824 When nonzero, POLL_IN is indicated, but once the counter reaches 0
1825 POLL_HUP is indicated and the underlying event is disabled.
1826
1827 Refreshing an event group leader refreshes all siblings and refreshing
1828 with a parameter of 0 currently enables infinite refreshes; these
1829 behaviors are unsupported and should not be relied on.
1830
1831 Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
1832 itored is attached to a different process and that process exits.
1833
1834 rdpmc instruction
1835 Starting with Linux 3.4 on x86, you can use the rdpmc instruction to
1836 get low-latency reads without having to enter the kernel. Note that
1837 using rdpmc is not necessarily faster than other methods for reading
1838 event values.
1839
1840 Support for this can be detected with the cap_usr_rdpmc field in the
1841 mmap page; documentation on how to calculate event values can be found
1842 in that section.
1843
1844 Originally, when rdpmc support was enabled, any process (not just ones
1845 with an active perf event) could use the rdpmc instruction to access
1846 the counters. Starting with Linux 4.0, rdpmc support is only allowed
1847 if an event is currently enabled in a process's context. To restore
1848 the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
1849
1850 perf_event ioctl calls
1851 Various ioctls act on perf_event_open() file descriptors:
1852
1853 PERF_EVENT_IOC_ENABLE
1854 This enables the individual event or event group specified by
1855 the file descriptor argument.
1856
1857 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1858 then all events in a group are enabled, even if the event speci‐
1859 fied is not the group leader (but see BUGS).
1860
1861 PERF_EVENT_IOC_DISABLE
1862 This disables the individual counter or event group specified by
1863 the file descriptor argument.
1864
1865 Enabling or disabling the leader of a group enables or disables
1866 the entire group; that is, while the group leader is disabled,
1867 none of the counters in the group will count. Enabling or dis‐
1868 abling a member of a group other than the leader affects only
1869 that counter; disabling a non-leader stops that counter from
1870 counting but doesn't affect any other counter.
1871
1872 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1873 then all events in a group are disabled, even if the event spec‐
1874 ified is not the group leader (but see BUGS).
1875
1876 PERF_EVENT_IOC_REFRESH
1877 Non-inherited overflow counters can use this to enable a counter
1878 for a number of overflows specified by the argument, after which
1879 it is disabled. Subsequent calls of this ioctl add the argument
1880 value to the current count. An overflow notification with
1881 POLL_IN set will happen on each overflow until the count reaches
1882 0; when that happens a notification with POLL_HUP set is sent
1883 and the event is disabled. Using an argument of 0 is considered
1884 undefined behavior.
1885
1886 PERF_EVENT_IOC_RESET
1887 Reset the event count specified by the file descriptor argument
1888 to zero. This resets only the counts; there is no way to reset
1889 the multiplexing time_enabled or time_running values.
1890
1891 If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
1892 then all events in a group are reset, even if the event speci‐
1893 fied is not the group leader (but see BUGS).
1894
1895 PERF_EVENT_IOC_PERIOD
1896 This updates the overflow period for the event.
1897
1898 Since Linux 3.7 (on ARM) and Linux 3.14 (all other architec‐
1899 tures), the new period takes effect immediately. On older ker‐
1900 nels, the new period did not take effect until after the next
1901 overflow.
1902
1903 The argument is a pointer to a 64-bit value containing the
1904 desired new period.
1905
1906 Prior to Linux 2.6.36, this ioctl always failed due to a bug in
1907 the kernel.
1908
1909 PERF_EVENT_IOC_SET_OUTPUT
1910 This tells the kernel to report event notifications to the spec‐
1911 ified file descriptor rather than the default one. The file
1912 descriptors must all be on the same CPU.
1913
1914 The argument specifies the desired file descriptor, or -1 if
1915 output should be ignored.
1916
1917 PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
1918 This adds an ftrace filter to this event.
1919
1920 The argument is a pointer to the desired ftrace filter.
1921
1922 PERF_EVENT_IOC_ID (since Linux 3.12)
1923 This returns the event ID value for the given event file
1924 descriptor.
1925
1926 The argument is a pointer to a 64-bit unsigned integer to hold
1927 the result.
1928
1929 PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
1930 This allows attaching a Berkeley Packet Filter (BPF) program to
1931 an existing kprobe tracepoint event. You need CAP_SYS_ADMIN
1932 privileges to use this ioctl.
1933
1934 The argument is a BPF program file descriptor that was created
1935 by a previous bpf(2) system call.
1936
1937 Using prctl(2)
1938 A process can enable or disable all the event groups that are attached
1939 to it using the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and
1940 PR_TASK_PERF_EVENTS_DISABLE operations. This applies to all counters
1941 on the calling process, whether created by this process or by another,
1942 and does not affect any counters that this process has created on other
1943 processes. It enables or disables only the group leaders, not any
1944 other members in the groups.
1945
1946 perf_event related configuration files
1947 Files in /proc/sys/kernel/
1948
1949 /proc/sys/kernel/perf_event_paranoid
1950 The perf_event_paranoid file can be set to restrict access
1951 to the performance counters.
1952
1953 2 allow only user-space measurements (default since Linux
1954 4.6).
1955 1 allow both kernel and user measurements (default before
1956 Linux 4.6).
1957 0 allow access to CPU-specific data but not raw tracepoint
1958 samples.
1959 -1 no restrictions.
1960
1961 The existence of the perf_event_paranoid file is the offi‐
1962 cial method for determining if a kernel supports
1963 perf_event_open().
1964
1965 /proc/sys/kernel/perf_event_max_sample_rate
1966 This sets the maximum sample rate. Setting this too high
1967 can allow users to sample at a rate that impacts overall
1968 machine performance and potentially lock up the machine.
1969 The default value is 100000 (samples per second).
1970
1971 /proc/sys/kernel/perf_event_max_stack
1972 This file sets the maximum depth of stack frame entries
1973 reported when generating a call trace.
1974
1975 /proc/sys/kernel/perf_event_mlock_kb
1976 Maximum number of pages an unprivileged user can mlock(2).
1977 The default is 516 (kB).
1978
1979 Files in /sys/bus/event_source/devices/
1980
1981 Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
1982 able for monitoring. Information on how to program these PMUs can
1983 be found under /sys/bus/event_source/devices/. Each subdirectory
1984 corresponds to a different PMU.
1985
1986 /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
1987 This contains an integer that can be used in the type field
1988 of perf_event_attr to indicate that you wish to use this
1989 PMU.
1990
1991 /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
1992 If this file is 1, then direct user-space access to the per‐
1993 formance counter registers is allowed via the rdpmc instruc‐
1994 tion. This can be disabled by echoing 0 to the file.
1995
1996 As of Linux 4.0 the behavior has changed, so that 1 now
1997 means only allow access to processes with active perf
1998 events, with 2 indicating the old allow-anyone-access behav‐
1999 ior.
2000
2001 /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2002 This subdirectory contains information on the architecture-
2003 specific subfields available for programming the various
2004 config fields in the perf_event_attr struct.
2005
2006 The content of each file is the name of the config field,
2007 followed by a colon, followed by a series of integer bit
2008 ranges separated by commas. For example, the file event may
2009 contain the value config1:1,6-10,44 which indicates that
2010 event is an attribute that occupies bits 1,6–10, and 44 of
2011 perf_event_attr::config1.
2012
2013 /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2014 This subdirectory contains files with predefined events.
2015 The contents are strings describing the event settings
2016 expressed in terms of the fields found in the previously
2017 mentioned ./format/ directory. These are not necessarily
2018 complete lists of all events supported by a PMU, but usually
2019 a subset of events deemed useful or interesting.
2020
2021 The content of each file is a list of attribute names sepa‐
2022 rated by commas. Each entry has an optional value (either
2023 hex or decimal). If no value is specified, then it is
2024 assumed to be a single-bit field with a value of 1. An
2025 example entry may look like this: event=0x2,inv,ldlat=3.
2026
2027 /sys/bus/event_source/devices/*/uevent
2028 This file is the standard kernel device interface for
2029 injecting hotplug events.
2030
2031 /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2032 The cpumask file contains a comma-separated list of integers
2033 that indicate a representative CPU number for each socket
2034 (package) on the motherboard. This is needed when setting
2035 up uncore or northbridge events, as those PMUs present
2036 socket-wide events.
2037
2039 perf_event_open() returns the new file descriptor, or -1 if an error
2040 occurred (in which case, errno is set appropriately).
2041
2043 The errors returned by perf_event_open() can be inconsistent, and may
2044 vary across processor architectures and performance monitoring units.
2045
2046 E2BIG Returned if the perf_event_attr size value is too small (smaller
2047 than PERF_ATTR_SIZE_VER0), too big (larger than the page size),
2048 or larger than the kernel supports and the extra bytes are not
2049 zero. When E2BIG is returned, the perf_event_attr size field is
2050 overwritten by the kernel to be the size of the structure it was
2051 expecting.
2052
2053 EACCES Returned when the requested event requires CAP_SYS_ADMIN permis‐
2054 sions (or a more permissive perf_event paranoid setting). Some
2055 common cases where an unprivileged process may encounter this
2056 error: attaching to a process owned by a different user; moni‐
2057 toring all processes on a given CPU (i.e., specifying the pid
2058 argument as -1); and not setting exclude_kernel when the para‐
2059 noid setting requires it.
2060
2061 EBADF Returned if the group_fd file descriptor is not valid, or, if
2062 PERF_FLAG_PID_CGROUP is set, the cgroup file descriptor in pid
2063 is not valid.
2064
2065 EBUSY (since Linux 4.1)
2066 Returned if another event already has exclusive access to the
2067 PMU.
2068
2069 EFAULT Returned if the attr pointer points at an invalid memory
2070 address.
2071
2072 EINVAL Returned if the specified event is invalid. There are many pos‐
2073 sible reasons for this. A not-exhaustive list: sample_freq is
2074 higher than the maximum setting; the cpu to monitor does not
2075 exist; read_format is out of range; sample_type is out of range;
2076 the flags value is out of range; exclusive or pinned set and the
2077 event is not a group leader; the event config values are out of
2078 range or set reserved bits; the generic event selected is not
2079 supported; or there is not enough room to add the selected
2080 event.
2081
2082 EMFILE Each opened event uses one file descriptor. If a large number
2083 of events are opened, the per-process limit on the number of
2084 open file descriptors will be reached, and no more events can be
2085 created.
2086
2087 ENODEV Returned when the event involves a feature not supported by the
2088 current CPU.
2089
2090 ENOENT Returned if the type setting is not valid. This error is also
2091 returned for some unsupported generic events.
2092
2093 ENOSPC Prior to Linux 3.3, if there was not enough room for the event,
2094 ENOSPC was returned. In Linux 3.3, this was changed to EINVAL.
2095 ENOSPC is still returned if you try to add more breakpoint
2096 events than supported by the hardware.
2097
2098 ENOSYS Returned if PERF_SAMPLE_STACK_USER is set in sample_type and it
2099 is not supported by hardware.
2100
2101 EOPNOTSUPP
2102 Returned if an event requiring a specific hardware feature is
2103 requested but there is no hardware support. This includes
2104 requesting low-skid events if not supported, branch tracing if
2105 it is not available, sampling if no PMU interrupt is available,
2106 and branch stacks for software events.
2107
2108 EOVERFLOW (since Linux 4.8)
2109 Returned if PERF_SAMPLE_CALLCHAIN is requested and sam‐
2110 ple_max_stack is larger than the maximum specified in
2111 /proc/sys/kernel/perf_event_max_stack.
2112
2113 EPERM Returned on many (but not all) architectures when an unsupported
2114 exclude_hv, exclude_idle, exclude_user, or exclude_kernel set‐
2115 ting is specified.
2116
2117 It can also happen, as with EACCES, when the requested event
2118 requires CAP_SYS_ADMIN permissions (or a more permissive
2119 perf_event paranoid setting). This includes setting a break‐
2120 point on a kernel address, and (since Linux 3.13) setting a ker‐
2121 nel function-trace tracepoint.
2122
2123 ESRCH Returned if attempting to attach to a process that does not
2124 exist.
2125
2127 perf_event_open() was introduced in Linux 2.6.31 but was called
2128 perf_counter_open(). It was renamed in Linux 2.6.32.
2129
2131 This perf_event_open() system call Linux-specific and should not be
2132 used in programs intended to be portable.
2133
2135 Glibc does not provide a wrapper for this system call; call it using
2136 syscall(2). See the example below.
2137
2138 The official way of knowing if perf_event_open() support is enabled is
2139 checking for the existence of the file /proc/sys/ker‐
2140 nel/perf_event_paranoid.
2141
2143 The F_SETOWN_EX option to fcntl(2) is needed to properly get overflow
2144 signals in threads. This was introduced in Linux 2.6.32.
2145
2146 Prior to Linux 2.6.33 (at least for x86), the kernel did not check if
2147 events could be scheduled together until read time. The same happens
2148 on all known kernels if the NMI watchdog is enabled. This means to see
2149 if a given set of events works you have to perf_event_open(), start,
2150 then read before you know for sure you can get valid measurements.
2151
2152 Prior to Linux 2.6.34, event constraints were not enforced by the ker‐
2153 nel. In that case, some events would silently return "0" if the kernel
2154 scheduled them in an improper counter slot.
2155
2156 Prior to Linux 2.6.34, there was a bug when multiplexing where the
2157 wrong results could be returned.
2158
2159 Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2160 if "inherit" is enabled and many threads are started.
2161
2162 Prior to Linux 2.6.35, PERF_FORMAT_GROUP did not work with attached
2163 processes.
2164
2165 There is a bug in the kernel code between Linux 2.6.36 and Linux 3.0
2166 that ignores the "watermark" field and acts as if a wakeup_event was
2167 chosen if the union has a nonzero value in it.
2168
2169 From Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2170 was broken and would repeatedly operate on the event specified rather
2171 than iterating across all sibling events in a group.
2172
2173 From Linux 3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2174 bits mapped to the same location. Code should migrate to the new
2175 cap_user_rdpmc and cap_user_time fields instead.
2176
2177 Always double-check your results! Various generalized events have had
2178 wrong values. For example, retired branches measured the wrong thing
2179 on AMD machines until Linux 2.6.35.
2180
2182 The following is a short example that measures the total instruction
2183 count of a call to printf(3).
2184
2185 #include <stdlib.h>
2186 #include <stdio.h>
2187 #include <unistd.h>
2188 #include <string.h>
2189 #include <sys/ioctl.h>
2190 #include <linux/perf_event.h>
2191 #include <asm/unistd.h>
2192
2193 static long
2194 perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2195 int cpu, int group_fd, unsigned long flags)
2196 {
2197 int ret;
2198
2199 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2200 group_fd, flags);
2201 return ret;
2202 }
2203
2204 int
2205 main(int argc, char **argv)
2206 {
2207 struct perf_event_attr pe;
2208 long long count;
2209 int fd;
2210
2211 memset(&pe, 0, sizeof(struct perf_event_attr));
2212 pe.type = PERF_TYPE_HARDWARE;
2213 pe.size = sizeof(struct perf_event_attr);
2214 pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2215 pe.disabled = 1;
2216 pe.exclude_kernel = 1;
2217 pe.exclude_hv = 1;
2218
2219 fd = perf_event_open(&pe, 0, -1, -1, 0);
2220 if (fd == -1) {
2221 fprintf(stderr, "Error opening leader %llx\n", pe.config);
2222 exit(EXIT_FAILURE);
2223 }
2224
2225 ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2226 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2227
2228 printf("Measuring instruction count for this printf\n");
2229
2230 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2231 read(fd, &count, sizeof(long long));
2232
2233 printf("Used %lld instructions\n", count);
2234
2235 close(fd);
2236 }
2237
2239 perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2240
2242 This page is part of release 4.15 of the Linux man-pages project. A
2243 description of the project, information about reporting bugs, and the
2244 latest version of this page, can be found at
2245 https://www.kernel.org/doc/man-pages/.
2246
2247
2248
2249Linux 2018-02-02 PERF_EVENT_OPEN(2)