1PERF_EVENT_OPEN(2)         Linux Programmer's Manual        PERF_EVENT_OPEN(2)
2
3
4

NAME

6       perf_event_open - set up performance monitoring
7

SYNOPSIS

9       #include <linux/perf_event.h>    /* Definition of PERF_* constants */
10       #include <linux/hw_breakpoint.h> /* Definition of HW_* constants */
11       #include <sys/syscall.h>         /* Definition of SYS_* constants */
12       #include <unistd.h>
13
14       int syscall(SYS_perf_event_open, struct perf_event_attr *attr,
15                   pid_t pid, int cpu, int group_fd, unsigned long flags);
16
17       Note:  glibc  provides  no wrapper for perf_event_open(), necessitating
18       the use of syscall(2).
19

DESCRIPTION

21       Given a list of parameters, perf_event_open() returns a  file  descrip‐
22       tor,  for  use  in subsequent system calls (read(2), mmap(2), prctl(2),
23       fcntl(2), etc.).
24
25       A call to perf_event_open() creates a file descriptor that allows  mea‐
26       suring  performance  information.   Each file descriptor corresponds to
27       one event that is measured; these can be grouped  together  to  measure
28       multiple events simultaneously.
29
30       Events  can  be  enabled and disabled in two ways: via ioctl(2) and via
31       prctl(2).  When an event is disabled it  does  not  count  or  generate
32       overflows but does continue to exist and maintain its count value.
33
34       Events  come in two flavors: counting and sampled.  A counting event is
35       one that is used for counting the aggregate number of events  that  oc‐
36       cur.   In  general,  counting event results are gathered with a read(2)
37       call.  A sampling event periodically writes measurements  to  a  buffer
38       that can then be accessed via mmap(2).
39
40   Arguments
41       The  pid  and  cpu  arguments allow specifying which process and CPU to
42       monitor:
43
44       pid == 0 and cpu == -1
45              This measures the calling process/thread on any CPU.
46
47       pid == 0 and cpu >= 0
48              This measures the calling process/thread only  when  running  on
49              the specified CPU.
50
51       pid > 0 and cpu == -1
52              This measures the specified process/thread on any CPU.
53
54       pid > 0 and cpu >= 0
55              This  measures the specified process/thread only when running on
56              the specified CPU.
57
58       pid == -1 and cpu >= 0
59              This measures all processes/threads on the specified CPU.   This
60              requires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN capabil‐
61              ity or a /proc/sys/kernel/perf_event_paranoid value of less than
62              1.
63
64       pid == -1 and cpu == -1
65              This setting is invalid and will return an error.
66
67       When  pid  is greater than zero, permission to perform this system call
68       is governed by CAP_PERFMON (since Linux 5.9) and a ptrace  access  mode
69       PTRACE_MODE_READ_REALCREDS   check   on   older   Linux  versions;  see
70       ptrace(2).
71
72       The group_fd argument allows event groups  to  be  created.   An  event
73       group  has  one event which is the group leader.  The leader is created
74       first, with group_fd = -1.  The rest of the group members  are  created
75       with  subsequent perf_event_open() calls with group_fd being set to the
76       file descriptor of the group leader.  (A single event  on  its  own  is
77       created  with group_fd = -1 and is considered to be a group with only 1
78       member.)  An event group is scheduled onto the CPU as a unit:  it  will
79       be  put  onto the CPU only if all of the events in the group can be put
80       onto the CPU.  This means that the values of the member events  can  be
81       meaningfully  compared—added,  divided  (to get ratios), and so on—with
82       each other, since they have counted events for the same set of executed
83       instructions.
84
85       The flags argument is formed by ORing together zero or more of the fol‐
86       lowing values:
87
88       PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
89              This flag enables the close-on-exec flag for the  created  event
90              file  descriptor,  so  that the file descriptor is automatically
91              closed on execve(2).  Setting the close-on-exec  flags  at  cre‐
92              ation  time,  rather  than later with fcntl(2), avoids potential
93              race   conditions   where    the    calling    thread    invokes
94              perf_event_open()  and  fcntl(2)  at  the  same  time as another
95              thread calls fork(2) then execve(2).
96
97       PERF_FLAG_FD_NO_GROUP
98              This flag tells the event to ignore the group_fd  parameter  ex‐
99              cept  for the purpose of setting up output redirection using the
100              PERF_FLAG_FD_OUTPUT flag.
101
102       PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
103              This flag re-routes the event's sampled output to instead be in‐
104              cluded in the mmap buffer of the event specified by group_fd.
105
106       PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
107              This  flag  activates  per-container  system-wide monitoring.  A
108              container is an abstraction that isolates a set of resources for
109              finer-grained  control  (CPUs, memory, etc.).  In this mode, the
110              event is measured only if the thread running  on  the  monitored
111              CPU belongs to the designated container (cgroup).  The cgroup is
112              identified by passing a file descriptor opened on its  directory
113              in the cgroupfs filesystem.  For instance, if the cgroup to mon‐
114              itor  is  called  test,  then  a  file  descriptor   opened   on
115              /dev/cgroup/test  (assuming  cgroupfs is mounted on /dev/cgroup)
116              must be passed as  the  pid  parameter.   cgroup  monitoring  is
117              available  only for system-wide events and may therefore require
118              extra permissions.
119
120       The perf_event_attr structure provides detailed configuration  informa‐
121       tion for the event being created.
122
123           struct perf_event_attr {
124               __u32 type;                 /* Type of event */
125               __u32 size;                 /* Size of attribute structure */
126               __u64 config;               /* Type-specific configuration */
127
128               union {
129                   __u64 sample_period;    /* Period of sampling */
130                   __u64 sample_freq;      /* Frequency of sampling */
131               };
132
133               __u64 sample_type;  /* Specifies values included in sample */
134               __u64 read_format;  /* Specifies values returned in read */
135
136               __u64 disabled       : 1,   /* off by default */
137                     inherit        : 1,   /* children inherit it */
138                     pinned         : 1,   /* must always be on PMU */
139                     exclusive      : 1,   /* only group on PMU */
140                     exclude_user   : 1,   /* don't count user */
141                     exclude_kernel : 1,   /* don't count kernel */
142                     exclude_hv     : 1,   /* don't count hypervisor */
143                     exclude_idle   : 1,   /* don't count when idle */
144                     mmap           : 1,   /* include mmap data */
145                     comm           : 1,   /* include comm data */
146                     freq           : 1,   /* use freq, not period */
147                     inherit_stat   : 1,   /* per task counts */
148                     enable_on_exec : 1,   /* next exec enables */
149                     task           : 1,   /* trace fork/exit */
150                     watermark      : 1,   /* wakeup_watermark */
151                     precise_ip     : 2,   /* skid constraint */
152                     mmap_data      : 1,   /* non-exec mmap data */
153                     sample_id_all  : 1,   /* sample_type all events */
154                     exclude_host   : 1,   /* don't count in host */
155                     exclude_guest  : 1,   /* don't count in guest */
156                     exclude_callchain_kernel : 1,
157                                           /* exclude kernel callchains */
158                     exclude_callchain_user   : 1,
159                                           /* exclude user callchains */
160                     mmap2          :  1,  /* include mmap with inode data */
161                     comm_exec      :  1,  /* flag comm events that are
162                                              due to exec */
163                     use_clockid    :  1,  /* use clockid for time fields */
164                     context_switch :  1,  /* context switch data */
165                     write_backward :  1,  /* Write ring buffer from end
166                                              to beginning */
167                     namespaces     :  1,  /* include namespaces data */
168                     ksymbol        :  1,  /* include ksymbol events */
169                     bpf_event      :  1,  /* include bpf events */
170                     aux_output     :  1,  /* generate AUX records
171                                              instead of events */
172                     cgroup         :  1,  /* include cgroup events */
173                     text_poke      :  1,  /* include text poke events */
174
175                     __reserved_1   : 30;
176
177               union {
178                   __u32 wakeup_events;    /* wakeup every n events */
179                   __u32 wakeup_watermark; /* bytes before wakeup */
180               };
181
182               __u32     bp_type;          /* breakpoint type */
183
184               union {
185                   __u64 bp_addr;          /* breakpoint address */
186                   __u64 kprobe_func;      /* for perf_kprobe */
187                   __u64 uprobe_path;      /* for perf_uprobe */
188                   __u64 config1;          /* extension of config */
189               };
190
191               union {
192                   __u64 bp_len;           /* breakpoint length */
193                   __u64 kprobe_addr;      /* with kprobe_func == NULL */
194                   __u64 probe_offset;     /* for perf_[k,u]probe */
195                   __u64 config2;          /* extension of config1 */
196               };
197               __u64 branch_sample_type;   /* enum perf_branch_sample_type */
198               __u64 sample_regs_user;     /* user regs to dump on samples */
199               __u32 sample_stack_user;    /* size of stack to dump on
200                                              samples */
201               __s32 clockid;              /* clock to use for time fields */
202               __u64 sample_regs_intr;     /* regs to dump on samples */
203               __u32 aux_watermark;        /* aux bytes before wakeup */
204               __u16 sample_max_stack;     /* max frames in callchain */
205               __u16 __reserved_2;         /* align to u64 */
206
207           };
208
209       The  fields  of the perf_event_attr structure are described in more de‐
210       tail below:
211
212       type   This field specifies the overall event type.  It has one of  the
213              following values:
214
215              PERF_TYPE_HARDWARE
216                     This  indicates  one of the "generalized" hardware events
217                     provided by the kernel.  See the config field  definition
218                     for more details.
219
220              PERF_TYPE_SOFTWARE
221                     This  indicates  one  of the software-defined events pro‐
222                     vided by the kernel  (even  if  no  hardware  support  is
223                     available).
224
225              PERF_TYPE_TRACEPOINT
226                     This indicates a tracepoint provided by the kernel trace‐
227                     point infrastructure.
228
229              PERF_TYPE_HW_CACHE
230                     This indicates a hardware cache event.  This has  a  spe‐
231                     cial encoding, described in the config field definition.
232
233              PERF_TYPE_RAW
234                     This  indicates  a "raw" implementation-specific event in
235                     the config field.
236
237              PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
238                     This indicates a hardware breakpoint as provided  by  the
239                     CPU.   Breakpoints  can  be read/write accesses to an ad‐
240                     dress as well as execution of an instruction address.
241
242              dynamic PMU
243                     Since Linux 2.6.38, perf_event_open() can support  multi‐
244                     ple PMUs.  To enable this, a value exported by the kernel
245                     can be used in the type field to indicate  which  PMU  to
246                     use.  The value to use can be found in the sysfs filesys‐
247                     tem: there is  a  subdirectory  per  PMU  instance  under
248                     /sys/bus/event_source/devices.    In   each  subdirectory
249                     there is a type file whose content is an integer that can
250                     be    used    in   the   type   field.    For   instance,
251                     /sys/bus/event_source/devices/cpu/type contains the value
252                     for the core CPU PMU, which is usually 4.
253
254              kprobe and uprobe (since Linux 4.17)
255                     These  two dynamic PMUs create a kprobe/uprobe and attach
256                     it to the file descriptor generated  by  perf_event_open.
257                     The kprobe/uprobe will be destroyed on the destruction of
258                     the  file  descriptor.   See  fields   kprobe_func,   up‐
259                     robe_path,  kprobe_addr,  and  probe_offset  for more de‐
260                     tails.
261
262       size   The size of the perf_event_attr structure  for  forward/backward
263              compatibility.  Set this using sizeof(struct perf_event_attr) to
264              allow the kernel to see the struct size at the time of  compila‐
265              tion.
266
267              The  related  define  PERF_ATTR_SIZE_VER0 is set to 64; this was
268              the size of the first published struct.  PERF_ATTR_SIZE_VER1  is
269              72,  corresponding  to  the  addition  of  breakpoints  in Linux
270              2.6.33.  PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
271              of branch sampling in Linux 3.4.  PERF_ATTR_SIZE_VER3 is 96 cor‐
272              responding  to  the  addition  of  sample_regs_user   and   sam‐
273              ple_stack_user  in Linux 3.7.  PERF_ATTR_SIZE_VER4 is 104 corre‐
274              sponding to the addition  of  sample_regs_intr  in  Linux  3.19.
275              PERF_ATTR_SIZE_VER5  is  112  corresponding  to  the addition of
276              aux_watermark in Linux 4.1.
277
278       config This specifies which event you want,  in  conjunction  with  the
279              type  field.  The config1 and config2 fields are also taken into
280              account in cases where 64 bits is not enough  to  fully  specify
281              the event.  The encoding of these fields are event dependent.
282
283              There  are  various ways to set the config field that are depen‐
284              dent on the value of the previously described type field.   What
285              follows  are  various possible settings for config separated out
286              by type.
287
288              If type is PERF_TYPE_HARDWARE, we are measuring one of the  gen‐
289              eralized hardware CPU events.  Not all of these are available on
290              all platforms.  Set config to one of the following:
291
292                   PERF_COUNT_HW_CPU_CYCLES
293                          Total cycles.  Be wary of what  happens  during  CPU
294                          frequency scaling.
295
296                   PERF_COUNT_HW_INSTRUCTIONS
297                          Retired  instructions.  Be careful, these can be af‐
298                          fected by various issues, most notably hardware  in‐
299                          terrupt counts.
300
301                   PERF_COUNT_HW_CACHE_REFERENCES
302                          Cache  accesses.   Usually this indicates Last Level
303                          Cache accesses but this may vary depending  on  your
304                          CPU.  This may include prefetches and coherency mes‐
305                          sages; again this depends on the design of your CPU.
306
307                   PERF_COUNT_HW_CACHE_MISSES
308                          Cache misses.  Usually  this  indicates  Last  Level
309                          Cache  misses;  this  is intended to be used in con‐
310                          junction  with  the   PERF_COUNT_HW_CACHE_REFERENCES
311                          event to calculate cache miss rates.
312
313                   PERF_COUNT_HW_BRANCH_INSTRUCTIONS
314                          Retired branch instructions.  Prior to Linux 2.6.35,
315                          this used the wrong event on AMD processors.
316
317                   PERF_COUNT_HW_BRANCH_MISSES
318                          Mispredicted branch instructions.
319
320                   PERF_COUNT_HW_BUS_CYCLES
321                          Bus cycles, which can be different  from  total  cy‐
322                          cles.
323
324                   PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
325                          Stalled cycles during issue.
326
327                   PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
328                          Stalled cycles during retirement.
329
330                   PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
331                          Total cycles; not affected by CPU frequency scaling.
332
333              If  type is PERF_TYPE_SOFTWARE, we are measuring software events
334              provided by the kernel.  Set config to one of the following:
335
336                   PERF_COUNT_SW_CPU_CLOCK
337                          This reports the CPU clock, a  high-resolution  per-
338                          CPU timer.
339
340                   PERF_COUNT_SW_TASK_CLOCK
341                          This reports a clock count specific to the task that
342                          is running.
343
344                   PERF_COUNT_SW_PAGE_FAULTS
345                          This reports the number of page faults.
346
347                   PERF_COUNT_SW_CONTEXT_SWITCHES
348                          This counts context switches.  Until  Linux  2.6.34,
349                          these  were all reported as user-space events, after
350                          that they are reported as happening in the kernel.
351
352                   PERF_COUNT_SW_CPU_MIGRATIONS
353                          This reports the number of times the process has mi‐
354                          grated to a new CPU.
355
356                   PERF_COUNT_SW_PAGE_FAULTS_MIN
357                          This  counts the number of minor page faults.  These
358                          did not require disk I/O to handle.
359
360                   PERF_COUNT_SW_PAGE_FAULTS_MAJ
361                          This counts the number of major page faults.   These
362                          required disk I/O to handle.
363
364                   PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
365                          This  counts  the number of alignment faults.  These
366                          happen when unaligned memory  accesses  happen;  the
367                          kernel  can handle these but it reduces performance.
368                          This happens only on some  architectures  (never  on
369                          x86).
370
371                   PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
372                          This  counts  the  number  of emulation faults.  The
373                          kernel sometimes traps on unimplemented instructions
374                          and  emulates  them  for user space.  This can nega‐
375                          tively impact performance.
376
377                   PERF_COUNT_SW_DUMMY (since Linux 3.12)
378                          This is a placeholder  event  that  counts  nothing.
379                          Informational  sample  record  types such as mmap or
380                          comm must be associated with an active event.   This
381                          dummy  event  allows  gathering such records without
382                          requiring a counting event.
383
384              If type is PERF_TYPE_TRACEPOINT, then we  are  measuring  kernel
385              tracepoints.   The  value  to use in config can be obtained from
386              under debugfs tracing/events/*/*/id if ftrace is enabled in  the
387              kernel.
388
389              If  type is PERF_TYPE_HW_CACHE, then we are measuring a hardware
390              CPU cache event.  To calculate the appropriate config value, use
391              the following equation:
392
393                      config = (perf_hw_cache_id) |
394                               (perf_hw_cache_op_id << 8) |
395                               (perf_hw_cache_op_result_id << 16);
396
397                  where perf_hw_cache_id is one of:
398
399                      PERF_COUNT_HW_CACHE_L1D
400                             for measuring Level 1 Data Cache
401
402                      PERF_COUNT_HW_CACHE_L1I
403                             for measuring Level 1 Instruction Cache
404
405                      PERF_COUNT_HW_CACHE_LL
406                             for measuring Last-Level Cache
407
408                      PERF_COUNT_HW_CACHE_DTLB
409                             for measuring the Data TLB
410
411                      PERF_COUNT_HW_CACHE_ITLB
412                             for measuring the Instruction TLB
413
414                      PERF_COUNT_HW_CACHE_BPU
415                             for measuring the branch prediction unit
416
417                      PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
418                             for measuring local memory accesses
419
420                  and perf_hw_cache_op_id is one of:
421
422                      PERF_COUNT_HW_CACHE_OP_READ
423                             for read accesses
424
425                      PERF_COUNT_HW_CACHE_OP_WRITE
426                             for write accesses
427
428                      PERF_COUNT_HW_CACHE_OP_PREFETCH
429                             for prefetch accesses
430
431                  and perf_hw_cache_op_result_id is one of:
432
433                      PERF_COUNT_HW_CACHE_RESULT_ACCESS
434                             to measure accesses
435
436                      PERF_COUNT_HW_CACHE_RESULT_MISS
437                             to measure misses
438
439              If  type  is  PERF_TYPE_RAW, then a custom "raw" config value is
440              needed.  Most CPUs support events that are not  covered  by  the
441              "generalized"  events.   These  are  implementation defined; see
442              your CPU manual (for example the Intel Volume  3B  documentation
443              or  the  AMD  BIOS and Kernel Developer Guide).  The libpfm4 li‐
444              brary can be used to translate from the name  in  the  architec‐
445              tural  manuals to the raw hex value perf_event_open() expects in
446              this field.
447
448              If type is PERF_TYPE_BREAKPOINT, then leave config set to  zero.
449              Its parameters are set in other places.
450
451              If  type is kprobe or uprobe, set retprobe (bit 0 of config, see
452              /sys/bus/event_source/devices/[k,u]probe/format/retprobe)    for
453              kretprobe/uretprobe.    See   fields  kprobe_func,  uprobe_path,
454              kprobe_addr, and probe_offset for more details.
455
456       kprobe_func, uprobe_path, kprobe_addr, and probe_offset
457              These fields describe the kprobe/uprobe for dynamic PMUs  kprobe
458              and  uprobe.   For  kprobe: use kprobe_func and probe_offset, or
459              use kprobe_addr and leave kprobe_func as NULL.  For uprobe:  use
460              uprobe_path and probe_offset.
461
462       sample_period, sample_freq
463              A  "sampling"  event is one that generates an overflow notifica‐
464              tion every N events, where N is given by sample_period.  A  sam‐
465              pling event has sample_period > 0.  When an overflow occurs, re‐
466              quested data is recorded in the mmap  buffer.   The  sample_type
467              field controls what data is recorded on each overflow.
468
469              sample_freq can be used if you wish to use frequency rather than
470              period.  In this case, you set the freq flag.  The  kernel  will
471              adjust  the sampling period to try and achieve the desired rate.
472              The rate of adjustment is a timer tick.
473
474       sample_type
475              The various bits in this field specify which values  to  include
476              in the sample.  They will be recorded in a ring-buffer, which is
477              available to user space using mmap(2).  The order in  which  the
478              values are saved in the sample are documented in the MMAP Layout
479              subsection below; it is not  the  enum  perf_event_sample_format
480              order.
481
482              PERF_SAMPLE_IP
483                     Records instruction pointer.
484
485              PERF_SAMPLE_TID
486                     Records the process and thread IDs.
487
488              PERF_SAMPLE_TIME
489                     Records a timestamp.
490
491              PERF_SAMPLE_ADDR
492                     Records an address, if applicable.
493
494              PERF_SAMPLE_READ
495                     Record counter values for all events in a group, not just
496                     the group leader.
497
498              PERF_SAMPLE_CALLCHAIN
499                     Records the callchain (stack backtrace).
500
501              PERF_SAMPLE_ID
502                     Records a unique ID for the opened event's group leader.
503
504              PERF_SAMPLE_CPU
505                     Records CPU number.
506
507              PERF_SAMPLE_PERIOD
508                     Records the current sampling period.
509
510              PERF_SAMPLE_STREAM_ID
511                     Records  a  unique  ID  for  the  opened  event.   Unlike
512                     PERF_SAMPLE_ID  the  actual ID is returned, not the group
513                     leader.  This ID is the  same  as  the  one  returned  by
514                     PERF_FORMAT_ID.
515
516              PERF_SAMPLE_RAW
517                     Records additional data, if applicable.  Usually returned
518                     by tracepoint events.
519
520              PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
521                     This provides a record of recent branches, as provided by
522                     CPU  branch  sampling hardware (such as Intel Last Branch
523                     Record).  Not all hardware supports this feature.
524
525                     See the branch_sample_type field for how to filter  which
526                     branches are reported.
527
528              PERF_SAMPLE_REGS_USER (since Linux 3.7)
529                     Records  the  current  user-level CPU register state (the
530                     values in the process before the kernel was called).
531
532              PERF_SAMPLE_STACK_USER (since Linux 3.7)
533                     Records the user level stack, allowing stack unwinding.
534
535              PERF_SAMPLE_WEIGHT (since Linux 3.10)
536                     Records a hardware provided weight value  that  expresses
537                     how  costly the sampled event was.  This allows the hard‐
538                     ware to highlight expensive events in a profile.
539
540              PERF_SAMPLE_DATA_SRC (since Linux 3.10)
541                     Records the data source: where in  the  memory  hierarchy
542                     the  data  associated  with  the sampled instruction came
543                     from.  This is available only if the underlying  hardware
544                     supports this feature.
545
546              PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
547                     Places  the  SAMPLE_ID  value  in a fixed position in the
548                     record, either at the beginning (for sample events) or at
549                     the end (if a non-sample event).
550
551                     This  was  necessary  because  a  sample  stream may have
552                     records from various different event sources with differ‐
553                     ent sample_type settings.  Parsing the event stream prop‐
554                     erly was not possible because the format  of  the  record
555                     was needed to find SAMPLE_ID, but the format could not be
556                     found without knowing what event the sample  belonged  to
557                     (causing a circular dependency).
558
559                     The PERF_SAMPLE_IDENTIFIER setting makes the event stream
560                     always parsable by putting SAMPLE_ID in a fixed location,
561                     even though it means having duplicate SAMPLE_ID values in
562                     records.
563
564              PERF_SAMPLE_TRANSACTION (since Linux 3.13)
565                     Records reasons for  transactional  memory  abort  events
566                     (for  example,  from  Intel TSX transactional memory sup‐
567                     port).
568
569                     The precise_ip setting must  be  greater  than  0  and  a
570                     transactional  memory  abort event must be measured or no
571                     values will be recorded.  Also note that some  perf_event
572                     measurements,  such  as sampled cycle counting, may cause
573                     extraneous aborts  (by  causing  an  interrupt  during  a
574                     transaction).
575
576              PERF_SAMPLE_REGS_INTR (since Linux 3.19)
577                     Records  a  subset  of  the current CPU register state as
578                     specified   by   sample_regs_intr.    Unlike    PERF_SAM‐
579                     PLE_REGS_USER the register values will return kernel reg‐
580                     ister state if the overflow happened while kernel code is
581                     running.  If the CPU supports hardware sampling of regis‐
582                     ter state (i.e., PEBS on Intel x86) and precise_ip is set
583                     higher  than  zero  then the register values returned are
584                     those captured by hardware at the time of the sampled in‐
585                     struction's retirement.
586
587              PERF_SAMPLE_PHYS_ADDR (since Linux 4.13)
588                     Records  physical  address  of  data  like  in  PERF_SAM‐
589                     PLE_ADDR.
590
591              PERF_SAMPLE_CGROUP (since Linux 5.7)
592                     Records (perf_event) cgroup ID of the process.  This cor‐
593                     responds to the id field in the PERF_RECORD_CGROUP event.
594
595       read_format
596              This  field specifies the format of the data returned by read(2)
597              on a perf_event_open() file descriptor.
598
599              PERF_FORMAT_TOTAL_TIME_ENABLED
600                     Adds the 64-bit time_enabled field.  This can be used  to
601                     calculate  estimated  totals  if the PMU is overcommitted
602                     and multiplexing is happening.
603
604              PERF_FORMAT_TOTAL_TIME_RUNNING
605                     Adds the 64-bit time_running field.  This can be used  to
606                     calculate  estimated  totals  if the PMU is overcommitted
607                     and multiplexing is happening.
608
609              PERF_FORMAT_ID
610                     Adds a 64-bit unique value that corresponds to the  event
611                     group.
612
613              PERF_FORMAT_GROUP
614                     Allows  all  counter  values in an event group to be read
615                     with one read.
616
617       disabled
618              The disabled bit specifies whether the counter starts  out  dis‐
619              abled  or  enabled.  If disabled, the event can later be enabled
620              by ioctl(2), prctl(2), or enable_on_exec.
621
622              When creating an event group, typically the group leader is ini‐
623              tialized  with  disabled  set to 1 and any child events are ini‐
624              tialized with disabled set to 0.  Despite disabled being 0,  the
625              child events will not start until the group leader is enabled.
626
627       inherit
628              The  inherit bit specifies that this counter should count events
629              of child tasks as well as the task specified.  This applies only
630              to  new  children,  not to any existing children at the time the
631              counter is created (nor to any new children  of  existing  chil‐
632              dren).
633
634              Inherit  does not work for some combinations of read_format val‐
635              ues, such as PERF_FORMAT_GROUP.
636
637       pinned The pinned bit specifies that the counter should  always  be  on
638              the  CPU  if at all possible.  It applies only to hardware coun‐
639              ters and only to group leaders.  If a pinned counter  cannot  be
640              put  onto  the  CPU (e.g., because there are not enough hardware
641              counters or because of a conflict with some other  event),  then
642              the  counter goes into an 'error' state, where reads return end-
643              of-file (i.e., read(2) returns 0) until the  counter  is  subse‐
644              quently enabled or disabled.
645
646       exclusive
647              The exclusive bit specifies that when this counter's group is on
648              the CPU, it should be the only group using the  CPU's  counters.
649              In  the future this may allow monitoring programs to support PMU
650              features that need to run alone so  that  they  do  not  disrupt
651              other hardware counters.
652
653              Note that many unexpected situations may prevent events with the
654              exclusive bit set from ever running.  This  includes  any  users
655              running  a  system-wide measurement as well as any kernel use of
656              the performance counters (including  the  commonly  enabled  NMI
657              Watchdog Timer interface).
658
659       exclude_user
660              If  this  bit  is  set, the count excludes events that happen in
661              user space.
662
663       exclude_kernel
664              If this bit is set, the count excludes  events  that  happen  in
665              kernel space.
666
667       exclude_hv
668              If this bit is set, the count excludes events that happen in the
669              hypervisor.  This is mainly for PMUs that have built-in  support
670              for  handling this (such as POWER).  Extra support is needed for
671              handling hypervisor measurements on most machines.
672
673       exclude_idle
674              If set, don't count when the  CPU  is  running  the  idle  task.
675              While  you  can  currently enable this for any event type, it is
676              ignored for all but software events.
677
678       mmap   The mmap bit enables generation of PERF_RECORD_MMAP samples  for
679              every mmap(2) call that has PROT_EXEC set.  This allows tools to
680              notice new executable code being mapped into a program  (dynamic
681              shared  libraries  for  example) so that addresses can be mapped
682              back to the original code.
683
684       comm   The comm bit enables tracking of process command name  as  modi‐
685              fied  by  the  execve(2)  and prctl(PR_SET_NAME) system calls as
686              well as writing to /proc/self/comm.  If the  comm_exec  flag  is
687              also successfully set (possible since Linux 3.16), then the misc
688              flag PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the
689              execve(2) case from the others.
690
691       freq   If  this  bit is set, then sample_frequency not sample_period is
692              used when setting up the sampling interval.
693
694       inherit_stat
695              This bit enables saving of event counts on  context  switch  for
696              inherited  tasks.   This is meaningful only if the inherit field
697              is set.
698
699       enable_on_exec
700              If this bit is set, a counter is automatically enabled  after  a
701              call to execve(2).
702
703       task   If this bit is set, then fork/exit notifications are included in
704              the ring buffer.
705
706       watermark
707              If set, have an overflow notification happen when we  cross  the
708              wakeup_watermark  boundary.   Otherwise,  overflow notifications
709              happen after wakeup_events samples.
710
711       precise_ip (since Linux 2.6.35)
712              This controls the amount of skid.  Skid is how many instructions
713              execute  between  an  event of interest happening and the kernel
714              being able to stop and record the event.  Smaller skid is better
715              and allows more accurate reporting of which events correspond to
716              which instructions, but hardware is often limited with how small
717              this can be.
718
719              The possible values of this field are the following:
720
721              0  SAMPLE_IP can have arbitrary skid.
722
723              1  SAMPLE_IP must have constant skid.
724
725              2  SAMPLE_IP requested to have 0 skid.
726
727              3  SAMPLE_IP  must  have  0  skid.   See also the description of
728                 PERF_RECORD_MISC_EXACT_IP.
729
730       mmap_data (since Linux 2.6.36)
731              This is the counterpart of the mmap field.  This enables genera‐
732              tion  of  PERF_RECORD_MMAP samples for mmap(2) calls that do not
733              have PROT_EXEC set (for example data and SysV shared memory).
734
735       sample_id_all (since Linux 2.6.38)
736              If set, then TID, TIME, ID, STREAM_ID, and CPU can  additionally
737              be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
738              ple_type is selected.
739
740              If PERF_SAMPLE_IDENTIFIER is specified, then  an  additional  ID
741              value  is  included as the last value to ease parsing the record
742              stream.  This may lead to the id value appearing twice.
743
744              The layout is described by this pseudo-structure:
745
746                  struct sample_id {
747                      { u32 pid, tid; }   /* if PERF_SAMPLE_TID set */
748                      { u64 time;     }   /* if PERF_SAMPLE_TIME set */
749                      { u64 id;       }   /* if PERF_SAMPLE_ID set */
750                      { u64 stream_id;}   /* if PERF_SAMPLE_STREAM_ID set  */
751                      { u32 cpu, res; }   /* if PERF_SAMPLE_CPU set */
752                      { u64 id;       }   /* if PERF_SAMPLE_IDENTIFIER set */
753                  };
754
755       exclude_host (since Linux 3.2)
756              When conducting measurements that include processes  running  VM
757              instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
758              events happening inside a guest instance.  This is only meaning‐
759              ful  outside  the  guests;  this  setting does not change counts
760              gathered inside of a guest.  Currently,  this  functionality  is
761              x86 only.
762
763       exclude_guest (since Linux 3.2)
764              When  conducting  measurements that include processes running VM
765              instances (i.e., have executed a KVM_RUN ioctl(2)), do not  mea‐
766              sure  events  happening  inside  guest  instances.  This is only
767              meaningful outside the guests;  this  setting  does  not  change
768              counts  gathered inside of a guest.  Currently, this functional‐
769              ity is x86 only.
770
771       exclude_callchain_kernel (since Linux 3.7)
772              Do not include kernel callchains.
773
774       exclude_callchain_user (since Linux 3.7)
775              Do not include user callchains.
776
777       mmap2 (since Linux 3.16)
778              Generate an extended executable mmap record that contains enough
779              additional  information  to  uniquely  identify shared mappings.
780              The mmap flag must also be set for this to work.
781
782       comm_exec (since Linux 3.16)
783              This is purely a feature-detection flag, it does not change ker‐
784              nel  behavior.  If this flag can successfully be set, then, when
785              comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
786              in  the  misc  field of a comm record header if the rename event
787              being reported was caused by a call to execve(2).   This  allows
788              tools to distinguish between the various types of process renam‐
789              ing.
790
791       use_clockid (since Linux 4.1)
792              This allows selecting which internal Linux  clock  to  use  when
793              generating  timestamps  via the clockid field.  This can make it
794              easier to correlate perf sample times with timestamps  generated
795              by other tools.
796
797       context_switch (since Linux 4.3)
798              This enables the generation of PERF_RECORD_SWITCH records when a
799              context switch  occurs.   It  also  enables  the  generation  of
800              PERF_RECORD_SWITCH_CPU_WIDE  records  when  sampling in CPU-wide
801              mode.  This functionality is in addition to existing  tracepoint
802              and  software events for measuring context switches.  The advan‐
803              tage of this method is that it will give full  information  even
804              with strict perf_event_paranoid settings.
805
806       write_backward (since Linux 4.6)
807              This  causes  the  ring buffer to be written from the end to the
808              beginning.  This is to support reading  from  overwritable  ring
809              buffer.
810
811       namespaces (since Linux 4.11)
812              This  enables  the  generation of PERF_RECORD_NAMESPACES records
813              when a task enters a new namespace.  Each namespace has a combi‐
814              nation of device and inode numbers.
815
816       ksymbol (since Linux 5.0)
817              This  enables the generation of PERF_RECORD_KSYMBOL records when
818              new kernel symbols are registered or unregistered.  This is ana‐
819              lyzing dynamic kernel functions like eBPF.
820
821       bpf_event (since Linux 5.0)
822              This  enables  the  generation  of PERF_RECORD_BPF_EVENT records
823              when an eBPF program is loaded or unloaded.
824
825       auxevent (since Linux 5.4)
826              This allows normal (non-AUX) events to  generate  data  for  AUX
827              events if the hardware supports it.
828
829       cgroup (since Linux 5.7)
830              This enables the generation of PERF_RECORD_CGROUP records when a
831              new cgroup is created (and activated).
832
833       text_poke (since Linux 5.8)
834              This enables the  generation  of  PERF_RECORD_TEXT_POKE  records
835              when  there's  a change to the kernel text (i.e., self-modifying
836              code).
837
838       wakeup_events, wakeup_watermark
839              This union  sets  how  many  samples  (wakeup_events)  or  bytes
840              (wakeup_watermark)  happen  before an overflow notification hap‐
841              pens.  Which one is used is selected by the watermark bit flag.
842
843              wakeup_events counts only PERF_RECORD_SAMPLE record  types.   To
844              receive  overflow  notification for all PERF_RECORD types choose
845              watermark and set wakeup_watermark to 1.
846
847              Prior to Linux 3.0, setting wakeup_events to 0  resulted  in  no
848              overflow  notifications; more recent kernels treat 0 the same as
849              1.
850
851       bp_type (since Linux 2.6.33)
852              This chooses the breakpoint type.  It is one of:
853
854              HW_BREAKPOINT_EMPTY
855                     No breakpoint.
856
857              HW_BREAKPOINT_R
858                     Count when we read the memory location.
859
860              HW_BREAKPOINT_W
861                     Count when we write the memory location.
862
863              HW_BREAKPOINT_RW
864                     Count when we read or write the memory location.
865
866              HW_BREAKPOINT_X
867                     Count when we execute code at the memory location.
868
869              The values can be combined via a bitwise or, but the combination
870              of  HW_BREAKPOINT_R  or  HW_BREAKPOINT_W with HW_BREAKPOINT_X is
871              not allowed.
872
873       bp_addr (since Linux 2.6.33)
874              This is the address of the  breakpoint.   For  execution  break‐
875              points,  this is the memory address of the instruction of inter‐
876              est; for read and write breakpoints, it is the memory address of
877              the memory location of interest.
878
879       config1 (since Linux 2.6.39)
880              config1  is  used for setting events that need an extra register
881              or otherwise do not fit in the regular config field.   Raw  OFF‐
882              CORE_EVENTS  on  Nehalem/Westmere/SandyBridge  use this field on
883              Linux 3.3 and later kernels.
884
885       bp_len (since Linux 2.6.33)
886              bp_len is the length of the breakpoint being measured if type is
887              PERF_TYPE_BREAKPOINT.     Options    are    HW_BREAKPOINT_LEN_1,
888              HW_BREAKPOINT_LEN_2,    HW_BREAKPOINT_LEN_4,    and    HW_BREAK‐
889              POINT_LEN_8.    For   an   execution  breakpoint,  set  this  to
890              sizeof(long).
891
892       config2 (since Linux 2.6.39)
893              config2 is a further extension of the config1 field.
894
895       branch_sample_type (since Linux 3.4)
896              If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
897              branches to include in the branch record.
898
899              The  first  part of the value is the privilege level, which is a
900              combination of one of the values listed below.  If the user does
901              not  set  privilege  level  explicitly,  the kernel will use the
902              event's privilege level.  Event and branch privilege  levels  do
903              not have to match.
904
905              PERF_SAMPLE_BRANCH_USER
906                     Branch target is in user space.
907
908              PERF_SAMPLE_BRANCH_KERNEL
909                     Branch target is in kernel space.
910
911              PERF_SAMPLE_BRANCH_HV
912                     Branch target is in hypervisor.
913
914              PERF_SAMPLE_BRANCH_PLM_ALL
915                     A  convenience  value  that is the three preceding values
916                     ORed together.
917
918              In addition to the privilege value, at least one or more of  the
919              following bits must be set.
920
921              PERF_SAMPLE_BRANCH_ANY
922                     Any branch type.
923
924              PERF_SAMPLE_BRANCH_ANY_CALL
925                     Any  call  branch (includes direct calls, indirect calls,
926                     and far jumps).
927
928              PERF_SAMPLE_BRANCH_IND_CALL
929                     Indirect calls.
930
931              PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
932                     Direct calls.
933
934              PERF_SAMPLE_BRANCH_ANY_RETURN
935                     Any return branch.
936
937              PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
938                     Indirect jumps.
939
940              PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
941                     Conditional branches.
942
943              PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
944                     Transactional memory aborts.
945
946              PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
947                     Branch in transactional memory transaction.
948
949              PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
950                     Branch   not   in   transactional   memory   transaction.
951                     PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
952                     part of a hardware-generated call stack.   This  requires
953                     hardware  support,  currently  only  found  on  Intel x86
954                     Haswell or newer.
955
956       sample_regs_user (since Linux 3.7)
957              This bit mask defines the set of user CPU registers to  dump  on
958              samples.   The  layout of the register mask is architecture-spe‐
959              cific and is described in the kernel header  file  arch/ARCH/in‐
960              clude/uapi/asm/perf_regs.h.
961
962       sample_stack_user (since Linux 3.7)
963              This  defines  the  size  of the user stack to dump if PERF_SAM‐
964              PLE_STACK_USER is specified.
965
966       clockid (since Linux 4.1)
967              If use_clockid is set, then this field  selects  which  internal
968              Linux timer to use for timestamps.  The available timers are de‐
969              fined  in  linux/time.h,   with   CLOCK_MONOTONIC,   CLOCK_MONO‐
970              TONIC_RAW,  CLOCK_REALTIME,  CLOCK_BOOTTIME,  and CLOCK_TAI cur‐
971              rently supported.
972
973       aux_watermark (since Linux 4.1)
974              This  specifies  how  much  data  is  required  to   trigger   a
975              PERF_RECORD_AUX sample.
976
977       sample_max_stack (since Linux 4.8)
978              When  sample_type  includes  PERF_SAMPLE_CALLCHAIN,  this  field
979              specifies how many stack frames to report  when  generating  the
980              callchain.
981
982   Reading results
983       Once a perf_event_open() file descriptor has been opened, the values of
984       the events can be read from the file descriptor.  The values  that  are
985       there  are  specified by the read_format field in the attr structure at
986       open time.
987
988       If you attempt to read into a buffer that is not big enough to hold the
989       data, the error ENOSPC results.
990
991       Here is the layout of the data returned by a read:
992
993       * If  PERF_FORMAT_GROUP  was specified to allow reading all events in a
994         group at once:
995
996             struct read_format {
997                 u64 nr;            /* The number of events */
998                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
999                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1000                 struct {
1001                     u64 value;     /* The value of the event */
1002                     u64 id;        /* if PERF_FORMAT_ID */
1003                 } values[nr];
1004             };
1005
1006       * If PERF_FORMAT_GROUP was not specified:
1007
1008             struct read_format {
1009                 u64 value;         /* The value of the event */
1010                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1011                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1012                 u64 id;            /* if PERF_FORMAT_ID */
1013             };
1014
1015       The values read are as follows:
1016
1017       nr     The number of events in this file descriptor.  Available only if
1018              PERF_FORMAT_GROUP was specified.
1019
1020       time_enabled, time_running
1021              Total  time  the  event was enabled and running.  Normally these
1022              values are the same.  Multiplexing  happens  if  the  number  of
1023              events  is  more than the number of available PMU counter slots.
1024              In that case the events run  only  part  of  the  time  and  the
1025              time_enabled and time running values can be used to scale an es‐
1026              timated value for the count.
1027
1028       value  An unsigned 64-bit value containing the counter result.
1029
1030       id     A globally unique value for this particular event; only  present
1031              if PERF_FORMAT_ID was specified in read_format.
1032
1033   MMAP layout
1034       When using perf_event_open() in sampled mode, asynchronous events (like
1035       counter overflow or PROT_EXEC mmap tracking) are logged  into  a  ring-
1036       buffer.  This ring-buffer is created and accessed through mmap(2).
1037
1038       The mmap size should be 1+2^n pages, where the first page is a metadata
1039       page (struct perf_event_mmap_page) that contains various bits of infor‐
1040       mation such as where the ring-buffer head is.
1041
1042       Before  kernel  2.6.39,  there is a bug that means you must allocate an
1043       mmap ring buffer when sampling even if you do not plan to access it.
1044
1045       The structure of the first metadata mmap page is as follows:
1046
1047           struct perf_event_mmap_page {
1048               __u32 version;        /* version number of this structure */
1049               __u32 compat_version; /* lowest version this is compat with */
1050               __u32 lock;           /* seqlock for synchronization */
1051               __u32 index;          /* hardware counter identifier */
1052               __s64 offset;         /* add to hardware counter value */
1053               __u64 time_enabled;   /* time event active */
1054               __u64 time_running;   /* time event on CPU */
1055               union {
1056                   __u64   capabilities;
1057                   struct {
1058                       __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1059                             cap_bit0_is_deprecated : 1,
1060                             cap_user_rdpmc         : 1,
1061                             cap_user_time          : 1,
1062                             cap_user_time_zero     : 1,
1063                   };
1064               };
1065               __u16 pmc_width;
1066               __u16 time_shift;
1067               __u32 time_mult;
1068               __u64 time_offset;
1069               __u64 __reserved[120];   /* Pad to 1 k */
1070               __u64 data_head;         /* head in the data section */
1071               __u64 data_tail;         /* user-space written tail */
1072               __u64 data_offset;       /* where the buffer starts */
1073               __u64 data_size;         /* data buffer size */
1074               __u64 aux_head;
1075               __u64 aux_tail;
1076               __u64 aux_offset;
1077               __u64 aux_size;
1078
1079           }
1080
1081       The following list describes the  fields  in  the  perf_event_mmap_page
1082       structure in more detail:
1083
1084       version
1085              Version number of this structure.
1086
1087       compat_version
1088              The lowest version this is compatible with.
1089
1090       lock   A seqlock for synchronization.
1091
1092       index  A unique hardware counter identifier.
1093
1094       offset When  using  rdpmc  for reads this offset value must be added to
1095              the one returned by rdpmc to get the current total event count.
1096
1097       time_enabled
1098              Time the event was active.
1099
1100       time_running
1101              Time the event was running.
1102
1103       cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1104              There  was  a  bug  in  the  definition  of   cap_usr_time   and
1105              cap_usr_rdpmc  from  Linux 3.4 until Linux 3.11.  Both bits were
1106              defined to point to the same location, so it was  impossible  to
1107              know if cap_usr_time or cap_usr_rdpmc were actually set.
1108
1109              Starting  with Linux 3.12, these are renamed to cap_bit0 and you
1110              should use the cap_user_time and cap_user_rdpmc fields instead.
1111
1112       cap_bit0_is_deprecated (since Linux 3.12)
1113              If set, this bit indicates that the kernel supports the properly
1114              separated cap_user_time and cap_user_rdpmc bits.
1115
1116              If  not-set, it indicates an older kernel where cap_usr_time and
1117              cap_usr_rdpmc map to the same bit and thus both features  should
1118              be used with caution.
1119
1120       cap_user_rdpmc (since Linux 3.12)
1121              If the hardware supports user-space read of performance counters
1122              without syscall (this is the "rdpmc" instruction on  x86),  then
1123              the following code can be used to do a read:
1124
1125                  u32 seq, time_mult, time_shift, idx, width;
1126                  u64 count, enabled, running;
1127                  u64 cyc, time_offset;
1128
1129                  do {
1130                      seq = pc->lock;
1131                      barrier();
1132                      enabled = pc->time_enabled;
1133                      running = pc->time_running;
1134
1135                      if (pc->cap_usr_time && enabled != running) {
1136                          cyc = rdtsc();
1137                          time_offset = pc->time_offset;
1138                          time_mult   = pc->time_mult;
1139                          time_shift  = pc->time_shift;
1140                      }
1141
1142                      idx = pc->index;
1143                      count = pc->offset;
1144
1145                      if (pc->cap_usr_rdpmc && idx) {
1146                          width = pc->pmc_width;
1147                          count += rdpmc(idx - 1);
1148                      }
1149
1150                      barrier();
1151                  } while (pc->lock != seq);
1152
1153       cap_user_time (since Linux 3.12)
1154              This  bit  indicates  the hardware has a constant, nonstop time‐
1155              stamp counter (TSC on x86).
1156
1157       cap_user_time_zero (since Linux 3.12)
1158              Indicates the presence of time_zero which allows  mapping  time‐
1159              stamp values to the hardware clock.
1160
1161       pmc_width
1162              If cap_usr_rdpmc, this field provides the bit-width of the value
1163              read using the rdpmc or equivalent  instruction.   This  can  be
1164              used to sign extend the result like:
1165
1166                  pmc <<= 64 - pmc_width;
1167                  pmc >>= 64 - pmc_width; // signed shift right
1168                  count += pmc;
1169
1170       time_shift, time_mult, time_offset
1171
1172              If  cap_usr_time,  these  fields can be used to compute the time
1173              delta since time_enabled (in nanoseconds) using rdtsc  or  simi‐
1174              lar.
1175
1176                  u64 quot, rem;
1177                  u64 delta;
1178
1179                  quot  = cyc >> time_shift;
1180                  rem   = cyc & (((u64)1 << time_shift) - 1);
1181                  delta = time_offset + quot * time_mult +
1182                          ((rem * time_mult) >> time_shift);
1183
1184              Where  time_offset,  time_mult,  time_shift, and cyc are read in
1185              the seqcount loop described above.  This delta can then be added
1186              to enabled and possible running (if idx), improving the scaling:
1187
1188                  enabled += delta;
1189                  if (idx)
1190                      running += delta;
1191                  quot  = count / running;
1192                  rem   = count % running;
1193                  count = quot * enabled + (rem * enabled) / running;
1194
1195       time_zero (since Linux 3.12)
1196
1197              If  cap_usr_time_zero  is  set, then the hardware clock (the TSC
1198              timestamp counter on x86) can be calculated from the  time_zero,
1199              time_mult, and time_shift values:
1200
1201                  time = timestamp - time_zero;
1202                  quot = time / time_mult;
1203                  rem  = time % time_mult;
1204                  cyc  = (quot << time_shift) + (rem << time_shift) / time_mult;
1205
1206              And vice versa:
1207
1208                  quot = cyc >> time_shift;
1209                  rem  = cyc & (((u64)1 << time_shift) - 1);
1210                  timestamp = time_zero + quot * time_mult +
1211                              ((rem * time_mult) >> time_shift);
1212
1213       data_head
1214              This points to the head of the data section.  The value continu‐
1215              ously increases, it does not wrap.  The value needs to be  manu‐
1216              ally wrapped by the size of the mmap buffer before accessing the
1217              samples.
1218
1219              On SMP-capable platforms, after  reading  the  data_head  value,
1220              user space should issue an rmb().
1221
1222       data_tail
1223              When  the  mapping  is PROT_WRITE, the data_tail value should be
1224              written by user space to reflect the last read  data.   In  this
1225              case, the kernel will not overwrite unread data.
1226
1227       data_offset (since Linux 4.1)
1228              Contains  the  offset  of  the location in the mmap buffer where
1229              perf sample data begins.
1230
1231       data_size (since Linux 4.1)
1232              Contains the size of the perf sample region within the mmap buf‐
1233              fer.
1234
1235       aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1236              The  AUX  region allows mmap(2)-ing a separate sample buffer for
1237              high-bandwidth data streams (separate from the main perf  sample
1238              buffer).   An  example of a high-bandwidth stream is instruction
1239              tracing support, as is found in newer Intel processors.
1240
1241              To set up an AUX area, first aux_offset needs to be set with  an
1242              offset  greater than data_offset+data_size and aux_size needs to
1243              be set to the desired buffer size.  The desired offset and  size
1244              must  be  page  aligned,  and  the  size must be a power of two.
1245              These values are then passed to mmap in order  to  map  the  AUX
1246              buffer.   Pages  in  the  AUX buffer are included as part of the
1247              RLIMIT_MEMLOCK resource limit (see setrlimit(2)),  and  also  as
1248              part of the perf_event_mlock_kb allowance.
1249
1250              By  default, the AUX buffer will be truncated if it will not fit
1251              in the available space in the ring buffer.  If the AUX buffer is
1252              mapped  as a read only buffer, then it will operate in ring buf‐
1253              fer mode where old data will be overwritten by  new.   In  over‐
1254              write mode, it might not be possible to infer where the new data
1255              began, and it is the consumer's job to disable measurement while
1256              reading to avoid possible data races.
1257
1258              The aux_head and aux_tail ring buffer pointers have the same be‐
1259              havior and ordering rules as the  previous  described  data_head
1260              and data_tail.
1261
1262       The following 2^n ring-buffer pages have the layout described below.
1263
1264       If perf_event_attr.sample_id_all is set, then all event types will have
1265       the sample_type selected fields related  to  where/when  (identity)  an
1266       event   took  place  (TID,  TIME,  ID,  CPU,  STREAM_ID)  described  in
1267       PERF_RECORD_SAMPLE  below,  it  will  be   stashed   just   after   the
1268       perf_event_header  and  the  fields  already  present  for the existing
1269       fields, that is, at the end  of  the  payload.   This  allows  a  newer
1270       perf.data  file  to  be supported by older perf tools, with the new op‐
1271       tional fields being ignored.
1272
1273       The mmap values start with a header:
1274
1275           struct perf_event_header {
1276               __u32   type;
1277               __u16   misc;
1278               __u16   size;
1279           };
1280
1281       Below, we describe the perf_event_header fields in  more  detail.   For
1282       ease  of  reading,  the  fields with shorter descriptions are presented
1283       first.
1284
1285       size   This indicates the size of the record.
1286
1287       misc   The misc field contains additional information about the sample.
1288
1289              The CPU mode can be determined from this value by  masking  with
1290              PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1291              ing (note these are not bit masks, only one  can  be  set  at  a
1292              time):
1293
1294              PERF_RECORD_MISC_CPUMODE_UNKNOWN
1295                     Unknown CPU mode.
1296
1297              PERF_RECORD_MISC_KERNEL
1298                     Sample happened in the kernel.
1299
1300              PERF_RECORD_MISC_USER
1301                     Sample happened in user code.
1302
1303              PERF_RECORD_MISC_HYPERVISOR
1304                     Sample happened in the hypervisor.
1305
1306              PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1307                     Sample happened in the guest kernel.
1308
1309              PERF_RECORD_MISC_GUEST_USER  (since Linux 2.6.35)
1310                     Sample happened in guest user code.
1311
1312              Since  the  following  three statuses are generated by different
1313              record types, they alias to the same bit:
1314
1315              PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1316                     This is set when the mapping is not executable; otherwise
1317                     the mapping is executable.
1318
1319              PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1320                     This is set for a PERF_RECORD_COMM record on kernels more
1321                     recent than Linux 3.16  if  a  process  name  change  was
1322                     caused by an execve(2) system call.
1323
1324              PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1325                     When  a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE
1326                     record is generated, this bit indicates that the  context
1327                     switch  is away from the current process (instead of into
1328                     the current process).
1329
1330              In addition, the following bits can be set:
1331
1332              PERF_RECORD_MISC_EXACT_IP
1333                     This indicates that the content of PERF_SAMPLE_IP  points
1334                     to  the actual instruction that triggered the event.  See
1335                     also perf_event_attr.precise_ip.
1336
1337              PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1338                     This indicates there is  extended  data  available  (cur‐
1339                     rently not used).
1340
1341              PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1342                     This  bit  is  not set by the kernel.  It is reserved for
1343                     the   user-space   perf   utility   to   indicate    that
1344                     /proc/i[pid]/maps  parsing  was  taking  too long and was
1345                     stopped, and thus the mmap records may be truncated.
1346
1347       type   The type value is one of the below.  The values  in  the  corre‐
1348              sponding record (that follows the header) depend on the type se‐
1349              lected as shown.
1350
1351              PERF_RECORD_MMAP
1352                  The MMAP events record the PROT_EXEC mappings so that we can
1353                  correlate  user-space  IPs to code.  They have the following
1354                  structure:
1355
1356                      struct {
1357                          struct perf_event_header header;
1358                          u32    pid, tid;
1359                          u64    addr;
1360                          u64    len;
1361                          u64    pgoff;
1362                          char   filename[];
1363                      };
1364
1365                  pid    is the process ID.
1366
1367                  tid    is the thread ID.
1368
1369                  addr   is the address of the allocated memory.  len  is  the
1370                         length  of  the  allocated memory.  pgoff is the page
1371                         offset of the allocated memory.  filename is a string
1372                         describing the backing of the allocated memory.
1373
1374              PERF_RECORD_LOST
1375                  This record indicates when events are lost.
1376
1377                      struct {
1378                          struct perf_event_header header;
1379                          u64    id;
1380                          u64    lost;
1381                          struct sample_id sample_id;
1382                      };
1383
1384                  id     is  the  unique  event  ID  for the samples that were
1385                         lost.
1386
1387                  lost   is the number of events that were lost.
1388
1389              PERF_RECORD_COMM
1390                  This record indicates a change in the process name.
1391
1392                      struct {
1393                          struct perf_event_header header;
1394                          u32    pid;
1395                          u32    tid;
1396                          char   comm[];
1397                          struct sample_id sample_id;
1398                      };
1399
1400                  pid    is the process ID.
1401
1402                  tid    is the thread ID.
1403
1404                  comm   is a string containing the new name of the process.
1405
1406              PERF_RECORD_EXIT
1407                  This record indicates a process exit event.
1408
1409                      struct {
1410                          struct perf_event_header header;
1411                          u32    pid, ppid;
1412                          u32    tid, ptid;
1413                          u64    time;
1414                          struct sample_id sample_id;
1415                      };
1416
1417              PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1418                  This record indicates a throttle/unthrottle event.
1419
1420                      struct {
1421                          struct perf_event_header header;
1422                          u64    time;
1423                          u64    id;
1424                          u64    stream_id;
1425                          struct sample_id sample_id;
1426                      };
1427
1428              PERF_RECORD_FORK
1429                  This record indicates a fork event.
1430
1431                      struct {
1432                          struct perf_event_header header;
1433                          u32    pid, ppid;
1434                          u32    tid, ptid;
1435                          u64    time;
1436                          struct sample_id sample_id;
1437                      };
1438
1439              PERF_RECORD_READ
1440                  This record indicates a read event.
1441
1442                      struct {
1443                          struct perf_event_header header;
1444                          u32    pid, tid;
1445                          struct read_format values;
1446                          struct sample_id sample_id;
1447                      };
1448
1449              PERF_RECORD_SAMPLE
1450                  This record indicates a sample.
1451
1452                      struct {
1453                          struct perf_event_header header;
1454                          u64    sample_id;   /* if PERF_SAMPLE_IDENTIFIER */
1455                          u64    ip;          /* if PERF_SAMPLE_IP */
1456                          u32    pid, tid;    /* if PERF_SAMPLE_TID */
1457                          u64    time;        /* if PERF_SAMPLE_TIME */
1458                          u64    addr;        /* if PERF_SAMPLE_ADDR */
1459                          u64    id;          /* if PERF_SAMPLE_ID */
1460                          u64    stream_id;   /* if PERF_SAMPLE_STREAM_ID */
1461                          u32    cpu, res;    /* if PERF_SAMPLE_CPU */
1462                          u64    period;      /* if PERF_SAMPLE_PERIOD */
1463                          struct read_format v;
1464                                              /* if PERF_SAMPLE_READ */
1465                          u64    nr;          /* if PERF_SAMPLE_CALLCHAIN */
1466                          u64    ips[nr];     /* if PERF_SAMPLE_CALLCHAIN */
1467                          u32    size;        /* if PERF_SAMPLE_RAW */
1468                          char   data[size];  /* if PERF_SAMPLE_RAW */
1469                          u64    bnr;         /* if PERF_SAMPLE_BRANCH_STACK */
1470                          struct perf_branch_entry lbr[bnr];
1471                                              /* if PERF_SAMPLE_BRANCH_STACK */
1472                          u64    abi;         /* if PERF_SAMPLE_REGS_USER */
1473                          u64    regs[weight(mask)];
1474                                              /* if PERF_SAMPLE_REGS_USER */
1475                          u64    size;        /* if PERF_SAMPLE_STACK_USER */
1476                          char   data[size];  /* if PERF_SAMPLE_STACK_USER */
1477                          u64    dyn_size;    /* if PERF_SAMPLE_STACK_USER &&
1478                                                 size != 0 */
1479                          u64    weight;      /* if PERF_SAMPLE_WEIGHT */
1480                          u64    data_src;    /* if PERF_SAMPLE_DATA_SRC */
1481                          u64    transaction; /* if PERF_SAMPLE_TRANSACTION */
1482                          u64    abi;         /* if PERF_SAMPLE_REGS_INTR */
1483                          u64    regs[weight(mask)];
1484                                              /* if PERF_SAMPLE_REGS_INTR */
1485                          u64    phys_addr;   /* if PERF_SAMPLE_PHYS_ADDR */
1486                          u64    cgroup;      /* if PERF_SAMPLE_CGROUP */
1487                      };
1488
1489                  sample_id
1490                      If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1491                      is  included.   This  is  a duplication of the PERF_SAM‐
1492                      PLE_ID id value, but included at the  beginning  of  the
1493                      sample so parsers can easily obtain the value.
1494
1495                  ip  If  PERF_SAMPLE_IP is enabled, then a 64-bit instruction
1496                      pointer value is included.
1497
1498                  pid, tid
1499                      If PERF_SAMPLE_TID is enabled, then a 32-bit process  ID
1500                      and 32-bit thread ID are included.
1501
1502                  time
1503                      If  PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp
1504                      is included.  This is obtained via  local_clock()  which
1505                      is  a  hardware  timestamp  if available and the jiffies
1506                      value if not.
1507
1508                  addr
1509                      If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1510                      included.   This is usually the address of a tracepoint,
1511                      breakpoint, or software event; otherwise the value is 0.
1512
1513                  id  If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is  in‐
1514                      cluded.  If the event is a member of an event group, the
1515                      group leader ID is returned.  This ID is the same as the
1516                      one returned by PERF_FORMAT_ID.
1517
1518                  stream_id
1519                      If  PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID
1520                      is included.  Unlike PERF_SAMPLE_ID the actual ID is re‐
1521                      turned,  not  the  group leader.  This ID is the same as
1522                      the one returned by PERF_FORMAT_ID.
1523
1524                  cpu, res
1525                      If PERF_SAMPLE_CPU is enabled, this is  a  32-bit  value
1526                      indicating  which  CPU  was being used, in addition to a
1527                      reserved (unused) 32-bit value.
1528
1529                  period
1530                      If PERF_SAMPLE_PERIOD is enabled, a 64-bit  value  indi‐
1531                      cating the current sampling period is written.
1532
1533                  v   If  PERF_SAMPLE_READ  is  enabled,  a  structure of type
1534                      read_format is included which has values for all  events
1535                      in  the  event group.  The values included depend on the
1536                      read_format value used at perf_event_open() time.
1537
1538                  nr, ips[nr]
1539                      If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit  num‐
1540                      ber  is  included  which  indicates  how  many following
1541                      64-bit instruction pointers will follow.   This  is  the
1542                      current callchain.
1543
1544                  size, data[size]
1545                      If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1546                      cating size is included followed by an  array  of  8-bit
1547                      values  of length size.  The values are padded with 0 to
1548                      have 64-bit alignment.
1549
1550                      This RAW record data is opaque with respect to the  ABI.
1551                      The  ABI  doesn't  make any promises with respect to the
1552                      stability of its  content,  it  may  vary  depending  on
1553                      event, hardware, and kernel version.
1554
1555                  bnr, lbr[bnr]
1556                      If  PERF_SAMPLE_BRANCH_STACK  is  enabled, then a 64-bit
1557                      value indicating the number of records is included, fol‐
1558                      lowed by bnr perf_branch_entry structures which each in‐
1559                      clude the fields:
1560
1561                      from   This indicates the source instruction (may not be
1562                             a branch).
1563
1564                      to     The branch target.
1565
1566                      mispred
1567                             The branch target was mispredicted.
1568
1569                      predicted
1570                             The branch target was predicted.
1571
1572                      in_tx (since Linux 3.11)
1573                             The branch was in a transactional memory transac‐
1574                             tion.
1575
1576                      abort (since Linux 3.11)
1577                             The branch was in an aborted transactional memory
1578                             transaction.
1579
1580                      cycles (since Linux 4.3)
1581                             This  reports  the number of cycles elapsed since
1582                             the previous branch stack update.
1583
1584                      The entries are from most to least recent, so the  first
1585                      entry has the most recent branch.
1586
1587                      Support  for mispred, predicted, and cycles is optional;
1588                      if not supported, those values will be 0.
1589
1590                      The type  of  branches  recorded  is  specified  by  the
1591                      branch_sample_type field.
1592
1593                  abi, regs[weight(mask)]
1594                      If  PERF_SAMPLE_REGS_USER  is enabled, then the user CPU
1595                      registers are recorded.
1596
1597                      The  abi  field  is  one  of  PERF_SAMPLE_REGS_ABI_NONE,
1598                      PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1599
1600                      The  regs  field  is  an array of the CPU registers that
1601                      were specified by the sample_regs_user attr field.   The
1602                      number  of  values is the number of bits set in the sam‐
1603                      ple_regs_user bit mask.
1604
1605                  size, data[size], dyn_size
1606                      If PERF_SAMPLE_STACK_USER  is  enabled,  then  the  user
1607                      stack  is  recorded.  This can be used to generate stack
1608                      backtraces.  size is the size requested by the  user  in
1609                      sample_stack_user or else the maximum record size.  data
1610                      is the stack data (a raw dump of the memory  pointed  to
1611                      by the stack pointer at the time of sampling).  dyn_size
1612                      is the amount of data actually dumped (can be less  than
1613                      size).  Note that dyn_size is omitted if size is 0.
1614
1615                  weight
1616                      If  PERF_SAMPLE_WEIGHT  is  enabled, then a 64-bit value
1617                      provided by the hardware is recorded that indicates  how
1618                      costly  the  event was.  This allows expensive events to
1619                      stand out more clearly in profiles.
1620
1621                  data_src
1622                      If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit  value
1623                      is recorded that is made up of the following fields:
1624
1625                      mem_op
1626                          Type of opcode, a bitwise combination of:
1627
1628                          PERF_MEM_OP_NA          Not available
1629                          PERF_MEM_OP_LOAD        Load instruction
1630                          PERF_MEM_OP_STORE       Store instruction
1631                          PERF_MEM_OP_PFETCH      Prefetch
1632                          PERF_MEM_OP_EXEC        Executable code
1633
1634                      mem_lvl
1635                          Memory hierarchy level hit or miss, a bitwise combi‐
1636                          nation   of   the   following,   shifted   left   by
1637                          PERF_MEM_LVL_SHIFT:
1638
1639                          PERF_MEM_LVL_NA         Not available
1640                          PERF_MEM_LVL_HIT        Hit
1641                          PERF_MEM_LVL_MISS       Miss
1642                          PERF_MEM_LVL_L1         Level 1 cache
1643                          PERF_MEM_LVL_LFB        Line fill buffer
1644                          PERF_MEM_LVL_L2         Level 2 cache
1645                          PERF_MEM_LVL_L3         Level 3 cache
1646                          PERF_MEM_LVL_LOC_RAM    Local DRAM
1647                          PERF_MEM_LVL_REM_RAM1   Remote DRAM 1 hop
1648                          PERF_MEM_LVL_REM_RAM2   Remote DRAM 2 hops
1649                          PERF_MEM_LVL_REM_CCE1   Remote cache 1 hop
1650                          PERF_MEM_LVL_REM_CCE2   Remote cache 2 hops
1651                          PERF_MEM_LVL_IO         I/O memory
1652                          PERF_MEM_LVL_UNC        Uncached memory
1653
1654                      mem_snoop
1655                          Snoop  mode, a bitwise combination of the following,
1656                          shifted left by PERF_MEM_SNOOP_SHIFT:
1657
1658                          PERF_MEM_SNOOP_NA       Not available
1659                          PERF_MEM_SNOOP_NONE     No snoop
1660                          PERF_MEM_SNOOP_HIT      Snoop hit
1661                          PERF_MEM_SNOOP_MISS     Snoop miss
1662                          PERF_MEM_SNOOP_HITM     Snoop hit modified
1663
1664                      mem_lock
1665                          Lock instruction, a bitwise combination of the  fol‐
1666                          lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1667
1668                          PERF_MEM_LOCK_NA        Not available
1669                          PERF_MEM_LOCK_LOCKED    Locked transaction
1670
1671                      mem_dtlb
1672                          TLB access hit or miss, a bitwise combination of the
1673                          following, shifted left by PERF_MEM_TLB_SHIFT:
1674
1675                          PERF_MEM_TLB_NA         Not available
1676                          PERF_MEM_TLB_HIT        Hit
1677                          PERF_MEM_TLB_MISS       Miss
1678                          PERF_MEM_TLB_L1         Level 1 TLB
1679                          PERF_MEM_TLB_L2         Level 2 TLB
1680                          PERF_MEM_TLB_WK         Hardware walker
1681                          PERF_MEM_TLB_OS         OS fault handler
1682
1683                  transaction
1684                      If the  PERF_SAMPLE_TRANSACTION  flag  is  set,  then  a
1685                      64-bit  field  is recorded describing the sources of any
1686                      transactional memory aborts.
1687
1688                      The field is a bitwise combination of the following val‐
1689                      ues:
1690
1691                      PERF_TXN_ELISION
1692                             Abort  from  an  elision type transaction (Intel-
1693                             CPU-specific).
1694
1695                      PERF_TXN_TRANSACTION
1696                             Abort from a generic transaction.
1697
1698                      PERF_TXN_SYNC
1699                             Synchronous abort (related to  the  reported  in‐
1700                             struction).
1701
1702                      PERF_TXN_ASYNC
1703                             Asynchronous  abort  (not related to the reported
1704                             instruction).
1705
1706                      PERF_TXN_RETRY
1707                             Retryable abort  (retrying  the  transaction  may
1708                             have succeeded).
1709
1710                      PERF_TXN_CONFLICT
1711                             Abort due to memory conflicts with other threads.
1712
1713                      PERF_TXN_CAPACITY_WRITE
1714                             Abort due to write capacity overflow.
1715
1716                      PERF_TXN_CAPACITY_READ
1717                             Abort due to read capacity overflow.
1718
1719                      In addition, a user-specified abort code can be obtained
1720                      from the high 32 bits of the field by shifting right  by
1721                      PERF_TXN_ABORT_SHIFT   and   masking   with   the  value
1722                      PERF_TXN_ABORT_MASK.
1723
1724                  abi, regs[weight(mask)]
1725                      If PERF_SAMPLE_REGS_INTR is enabled, then the  user  CPU
1726                      registers are recorded.
1727
1728                      The  abi  field  is  one  of  PERF_SAMPLE_REGS_ABI_NONE,
1729                      PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1730
1731                      The regs field is an array of  the  CPU  registers  that
1732                      were  specified by the sample_regs_intr attr field.  The
1733                      number of values is the number of bits set in  the  sam‐
1734                      ple_regs_intr bit mask.
1735
1736                  phys_addr
1737                      If  the  PERF_SAMPLE_PHYS_ADDR  flag  is  set,  then the
1738                      64-bit physical address is recorded.
1739
1740                  cgroup
1741                      If the PERF_SAMPLE_CGROUP flag is set, then  the  64-bit
1742                      cgroup  ID  (for  the perf_event subsystem) is recorded.
1743                      To get the pathname of the cgroup, the ID  should  match
1744                      to one in a PERF_RECORD_CGROUP .
1745
1746              PERF_RECORD_MMAP2
1747                  This  record  includes extended information on mmap(2) calls
1748                  returning executable mappings.  The  format  is  similar  to
1749                  that of the PERF_RECORD_MMAP record, but includes extra val‐
1750                  ues that allow uniquely identifying shared mappings.
1751
1752                      struct {
1753                          struct perf_event_header header;
1754                          u32    pid;
1755                          u32    tid;
1756                          u64    addr;
1757                          u64    len;
1758                          u64    pgoff;
1759                          u32    maj;
1760                          u32    min;
1761                          u64    ino;
1762                          u64    ino_generation;
1763                          u32    prot;
1764                          u32    flags;
1765                          char   filename[];
1766                          struct sample_id sample_id;
1767                      };
1768
1769                  pid    is the process ID.
1770
1771                  tid    is the thread ID.
1772
1773                  addr   is the address of the allocated memory.
1774
1775                  len    is the length of the allocated memory.
1776
1777                  pgoff  is the page offset of the allocated memory.
1778
1779                  maj    is the major ID of the underlying device.
1780
1781                  min    is the minor ID of the underlying device.
1782
1783                  ino    is the inode number.
1784
1785                  ino_generation
1786                         is the inode generation.
1787
1788                  prot   is the protection information.
1789
1790                  flags  is the flags information.
1791
1792                  filename
1793                         is a string describing the backing of  the  allocated
1794                         memory.
1795
1796              PERF_RECORD_AUX (since Linux 4.1)
1797                  This  record reports that new data is available in the sepa‐
1798                  rate AUX buffer region.
1799
1800                      struct {
1801                          struct perf_event_header header;
1802                          u64    aux_offset;
1803                          u64    aux_size;
1804                          u64    flags;
1805                          struct sample_id sample_id;
1806                      };
1807
1808                  aux_offset
1809                         offset in the AUX mmap region where the new data  be‐
1810                         gins.
1811
1812                  aux_size
1813                         size of the data made available.
1814
1815                  flags  describes the AUX update.
1816
1817                         PERF_AUX_FLAG_TRUNCATED
1818                                if  set,  then the data returned was truncated
1819                                to fit the available buffer size.
1820
1821                         PERF_AUX_FLAG_OVERWRITE
1822                                if set, then the data returned has overwritten
1823                                previous data.
1824
1825              PERF_RECORD_ITRACE_START (since Linux 4.1)
1826                  This  record  indicates  which  process has initiated an in‐
1827                  struction trace event, allowing tools to properly  correlate
1828                  the  instruction addresses in the AUX buffer with the proper
1829                  executable.
1830
1831                      struct {
1832                          struct perf_event_header header;
1833                          u32    pid;
1834                          u32    tid;
1835                      };
1836
1837                  pid    process ID of  the  thread  starting  an  instruction
1838                         trace.
1839
1840                  tid    thread  ID  of  the  thread  starting  an instruction
1841                         trace.
1842
1843              PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1844                  When using hardware  sampling  (such  as  Intel  PEBS)  this
1845                  record  indicates  some number of samples that may have been
1846                  lost.
1847
1848                      struct {
1849                          struct perf_event_header header;
1850                          u64    lost;
1851                          struct sample_id sample_id;
1852                      };
1853
1854                  lost   the number of potentially lost samples.
1855
1856              PERF_RECORD_SWITCH (since Linux 4.3)
1857                  This record indicates a context switch  has  happened.   The
1858                  PERF_RECORD_MISC_SWITCH_OUT  bit in the misc field indicates
1859                  whether it was a context switch into or away from  the  cur‐
1860                  rent process.
1861
1862                      struct {
1863                          struct perf_event_header header;
1864                          struct sample_id sample_id;
1865                      };
1866
1867              PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1868                  As  with  PERF_RECORD_SWITCH this record indicates a context
1869                  switch has happened, but it only  occurs  when  sampling  in
1870                  CPU-wide  mode  and  provides  additional information on the
1871                  process       being       switched       to/from.        The
1872                  PERF_RECORD_MISC_SWITCH_OUT  bit in the misc field indicates
1873                  whether it was a context switch into or away from  the  cur‐
1874                  rent process.
1875
1876                      struct {
1877                          struct perf_event_header header;
1878                          u32 next_prev_pid;
1879                          u32 next_prev_tid;
1880                          struct sample_id sample_id;
1881                      };
1882
1883                  next_prev_pid
1884                         The  process  ID of the previous (if switching in) or
1885                         next (if switching out) process on the CPU.
1886
1887                  next_prev_tid
1888                         The thread ID of the previous (if  switching  in)  or
1889                         next (if switching out) thread on the CPU.
1890
1891              PERF_RECORD_NAMESPACES (since Linux 4.11)
1892                  This  record  includes  various  namespace  information of a
1893                  process.
1894
1895                      struct {
1896                          struct perf_event_header header;
1897                          u32    pid;
1898                          u32    tid;
1899                          u64    nr_namespaces;
1900                          struct { u64 dev, inode } [nr_namespaces];
1901                          struct sample_id sample_id;
1902                      };
1903
1904                  pid    is the process ID
1905
1906                  tid    is the thread ID
1907
1908                  nr_namespace
1909                         is the number of namespaces in this record
1910
1911                  Each namespace has dev and inode fields and is  recorded  in
1912                  the fixed position like below:
1913
1914                  NET_NS_INDEX=0
1915                         Network namespace
1916
1917                  UTS_NS_INDEX=1
1918                         UTS namespace
1919
1920                  IPC_NS_INDEX=2
1921                         IPC namespace
1922
1923                  PID_NS_INDEX=3
1924                         PID namespace
1925
1926                  USER_NS_INDEX=4
1927                         User namespace
1928
1929                  MNT_NS_INDEX=5
1930                         Mount namespace
1931
1932                  CGROUP_NS_INDEX=6
1933                         Cgroup namespace
1934
1935              PERF_RECORD_KSYMBOL (since Linux 5.0)
1936                  This  record  indicates  kernel  symbol  register/unregister
1937                  events.
1938
1939                      struct {
1940                          struct perf_event_header header;
1941                          u64    addr;
1942                          u32    len;
1943                          u16    ksym_type;
1944                          u16    flags;
1945                          char   name[];
1946                          struct sample_id sample_id;
1947                      };
1948
1949                  addr   is the address of the kernel symbol.
1950
1951                  len    is the length of the kernel symbol.
1952
1953                  ksym_type
1954                         is the type of the kernel symbol.  Currently the fol‐
1955                         lowing types are available:
1956
1957                         PERF_RECORD_KSYMBOL_TYPE_BPF
1958                                The kernel symbol is a BPF function.
1959
1960                  flags  If  the  PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER is set,
1961                         then this event is for unregistering the kernel  sym‐
1962                         bol.
1963
1964              PERF_RECORD_BPF_EVENT (since Linux 5.0)
1965                  This record indicates BPF program is loaded or unloaded.
1966
1967                      struct {
1968                          struct perf_event_header header;
1969                          u16 type;
1970                          u16 flags;
1971                          u32 id;
1972                          u8 tag[BPF_TAG_SIZE];
1973                          struct sample_id sample_id;
1974                      };
1975
1976                  type   is one of the following values:
1977
1978                         PERF_BPF_EVENT_PROG_LOAD
1979                                A BPF program is loaded
1980
1981                         PERF_BPF_EVENT_PROG_UNLOAD
1982                                A BPF program is unloaded
1983
1984                  id     is the ID of the BPF program.
1985
1986                  tag    is   the   tag   of   the  BPF  program.   Currently,
1987                         BPF_TAG_SIZE is defined as 8.
1988
1989              PERF_RECORD_CGROUP (since Linux 5.7)
1990                  This record indicates a new cgroup is created and activated.
1991
1992                      struct {
1993                          struct perf_event_header header;
1994                          u64    id;
1995                          char   path[];
1996                          struct sample_id sample_id;
1997                      };
1998
1999                  id     is the cgroup identifier.  This can be also retrieved
2000                         by name_to_handle_at(2) on the cgroup path (as a file
2001                         handle).
2002
2003                  path   is the path of the cgroup from the root.
2004
2005              PERF_RECORD_TEXT_POKE (since Linux 5.8)
2006                  This record indicates a change in the kernel text.  This in‐
2007                  cludes  addition and removal of the text and the correspond‐
2008                  ing length is zero in this case.
2009
2010                      struct {
2011                          struct perf_event_header header;
2012                          u64    addr;
2013                          u16    old_len;
2014                          u16    new_len;
2015                          u8     bytes[];
2016                          struct sample_id sample_id;
2017                      };
2018
2019                  addr   is the address of the change
2020
2021                  old_len
2022                         is the old length
2023
2024                  new_len
2025                         is the new length
2026
2027                  bytes  contains old bytes immediately followed by new bytes.
2028
2029   Overflow handling
2030       Events can be set to notify when a threshold is crossed, indicating  an
2031       overflow.   Overflow conditions can be captured by monitoring the event
2032       file descriptor with poll(2), select(2), or  epoll(7).   Alternatively,
2033       the  overflow events can be captured via sa signal handler, by enabling
2034       I/O signaling on the file descriptor; see the discussion of  the  F_SE‐
2035       TOWN and F_SETSIG operations in fcntl(2).
2036
2037       Overflows  are  generated  only  by sampling events (sample_period must
2038       have a nonzero value).
2039
2040       There are two ways to generate overflow notifications.
2041
2042       The first is to set a wakeup_events or wakeup_watermark value that will
2043       trigger  if  a  certain number of samples or bytes have been written to
2044       the mmap ring buffer.  In this case, POLL_IN is indicated.
2045
2046       The other way is by use  of  the  PERF_EVENT_IOC_REFRESH  ioctl.   This
2047       ioctl  adds to a counter that decrements each time the event overflows.
2048       When nonzero, POLL_IN is indicated, but  once  the  counter  reaches  0
2049       POLL_HUP is indicated and the underlying event is disabled.
2050
2051       Refreshing  an event group leader refreshes all siblings and refreshing
2052       with a parameter of 0 currently enables infinite refreshes;  these  be‐
2053       haviors are unsupported and should not be relied on.
2054
2055       Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
2056       itored is attached to a different process and that process exits.
2057
2058   rdpmc instruction
2059       Starting with Linux 3.4 on x86, you can use the  rdpmc  instruction  to
2060       get  low-latency  reads  without having to enter the kernel.  Note that
2061       using rdpmc is not necessarily faster than other  methods  for  reading
2062       event values.
2063
2064       Support  for  this  can be detected with the cap_usr_rdpmc field in the
2065       mmap page; documentation on how to calculate event values can be  found
2066       in that section.
2067
2068       Originally,  when rdpmc support was enabled, any process (not just ones
2069       with an active perf event) could use the rdpmc  instruction  to  access
2070       the  counters.   Starting with Linux 4.0, rdpmc support is only allowed
2071       if an event is currently enabled in a process's  context.   To  restore
2072       the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
2073
2074   perf_event ioctl calls
2075       Various ioctls act on perf_event_open() file descriptors:
2076
2077       PERF_EVENT_IOC_ENABLE
2078              This  enables  the  individual event or event group specified by
2079              the file descriptor argument.
2080
2081              If the PERF_IOC_FLAG_GROUP bit is set  in  the  ioctl  argument,
2082              then all events in a group are enabled, even if the event speci‐
2083              fied is not the group leader (but see BUGS).
2084
2085       PERF_EVENT_IOC_DISABLE
2086              This disables the individual counter or event group specified by
2087              the file descriptor argument.
2088
2089              Enabling  or disabling the leader of a group enables or disables
2090              the entire group; that is, while the group leader  is  disabled,
2091              none  of the counters in the group will count.  Enabling or dis‐
2092              abling a member of a group other than the  leader  affects  only
2093              that  counter;  disabling  a  non-leader stops that counter from
2094              counting but doesn't affect any other counter.
2095
2096              If the PERF_IOC_FLAG_GROUP bit is set  in  the  ioctl  argument,
2097              then all events in a group are disabled, even if the event spec‐
2098              ified is not the group leader (but see BUGS).
2099
2100       PERF_EVENT_IOC_REFRESH
2101              Non-inherited overflow counters can use this to enable a counter
2102              for a number of overflows specified by the argument, after which
2103              it is disabled.  Subsequent calls of this ioctl add the argument
2104              value  to  the  current  count.   An  overflow notification with
2105              POLL_IN set will happen on each overflow until the count reaches
2106              0;  when  that  happens a notification with POLL_HUP set is sent
2107              and the event is disabled.  Using an argument of 0 is considered
2108              undefined behavior.
2109
2110       PERF_EVENT_IOC_RESET
2111              Reset  the event count specified by the file descriptor argument
2112              to zero.  This resets only the counts; there is no way to  reset
2113              the multiplexing time_enabled or time_running values.
2114
2115              If  the  PERF_IOC_FLAG_GROUP  bit  is set in the ioctl argument,
2116              then all events in a group are reset, even if the  event  speci‐
2117              fied is not the group leader (but see BUGS).
2118
2119       PERF_EVENT_IOC_PERIOD
2120              This updates the overflow period for the event.
2121
2122              Since  Linux  3.7  (on  ARM) and Linux 3.14 (all other architec‐
2123              tures), the new period takes effect immediately.  On older  ker‐
2124              nels,  the  new  period did not take effect until after the next
2125              overflow.
2126
2127              The argument is a pointer to a 64-bit value containing  the  de‐
2128              sired new period.
2129
2130              Prior  to Linux 2.6.36, this ioctl always failed due to a bug in
2131              the kernel.
2132
2133       PERF_EVENT_IOC_SET_OUTPUT
2134              This tells the kernel to report event notifications to the spec‐
2135              ified file descriptor rather than the default one.  The file de‐
2136              scriptors must all be on the same CPU.
2137
2138              The argument specifies the desired file  descriptor,  or  -1  if
2139              output should be ignored.
2140
2141       PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
2142              This adds an ftrace filter to this event.
2143
2144              The argument is a pointer to the desired ftrace filter.
2145
2146       PERF_EVENT_IOC_ID (since Linux 3.12)
2147              This  returns  the  event  ID value for the given event file de‐
2148              scriptor.
2149
2150              The argument is a pointer to a 64-bit unsigned integer  to  hold
2151              the result.
2152
2153       PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
2154              This  allows attaching a Berkeley Packet Filter (BPF) program to
2155              an existing  kprobe  tracepoint  event.   You  need  CAP_PERFMON
2156              (since Linux 5.8) or CAP_SYS_ADMIN privileges to use this ioctl.
2157
2158              The  argument  is a BPF program file descriptor that was created
2159              by a previous bpf(2) system call.
2160
2161       PERF_EVENT_IOC_PAUSE_OUTPUT (since Linux 4.7)
2162              This allows pausing and resuming  the  event's  ring-buffer.   A
2163              paused  ring-buffer  does not prevent generation of samples, but
2164              simply discards them.   The  discarded  samples  are  considered
2165              lost,  and  cause a PERF_RECORD_LOST sample to be generated when
2166              possible.  An overflow signal may still be triggered by the dis‐
2167              carded sample even though the ring-buffer remains empty.
2168
2169              The  argument  is  an  unsigned 32-bit integer.  A nonzero value
2170              pauses the ring-buffer, while a zero value resumes the ring-buf‐
2171              fer.
2172
2173       PERF_EVENT_MODIFY_ATTRIBUTES (since Linux 4.17)
2174              This  allows modifying an existing event without the overhead of
2175              closing and reopening a new event.  Currently this is  supported
2176              only for breakpoint events.
2177
2178              The  argument  is  a pointer to a perf_event_attr structure con‐
2179              taining the updated event settings.
2180
2181       PERF_EVENT_IOC_QUERY_BPF (since Linux 4.16)
2182              This allows querying which Berkeley Packet Filter (BPF) programs
2183              are attached to an existing kprobe tracepoint.  You can only at‐
2184              tach one BPF program per event, but you can have multiple events
2185              attached to a tracepoint.  Querying this value on one tracepoint
2186              event returns the ID of all BPF programs in all events  attached
2187              to  the  tracepoint.   You need CAP_PERFMON (since Linux 5.8) or
2188              CAP_SYS_ADMIN privileges to use this ioctl.
2189
2190              The argument is a pointer to a structure
2191                  struct perf_event_query_bpf {
2192                      __u32    ids_len;
2193                      __u32    prog_cnt;
2194                      __u32    ids[0];
2195                  };
2196
2197              The ids_len field indicates the number of ids that  can  fit  in
2198              the  provided ids array.  The prog_cnt value is filled in by the
2199              kernel with the number of attached BPF programs.  The ids  array
2200              is  filled  with  the ID of each attached BPF program.  If there
2201              are more programs than will fit in the array,  then  the  kernel
2202              will  return ENOSPC and ids_len will indicate the number of pro‐
2203              gram IDs that were successfully copied.
2204
2205   Using prctl(2)
2206       A process can enable or disable all currently open event  groups  using
2207       the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and PR_TASK_PERF_EVENTS_DISABLE
2208       operations.  This applies only to events created locally by the calling
2209       process.   This does not apply to events created by other processes at‐
2210       tached to the  calling  process  or  inherited  events  from  a  parent
2211       process.   Only  group  leaders are enabled and disabled, not any other
2212       members of the groups.
2213
2214   perf_event related configuration files
2215       Files in /proc/sys/kernel/
2216
2217           /proc/sys/kernel/perf_event_paranoid
2218                  The perf_event_paranoid file can be set to  restrict  access
2219                  to the performance counters.
2220
2221                  2   allow  only user-space measurements (default since Linux
2222                      4.6).
2223                  1   allow both kernel and user measurements (default  before
2224                      Linux 4.6).
2225                  0   allow access to CPU-specific data but not raw tracepoint
2226                      samples.
2227                  -1  no restrictions.
2228
2229                  The existence of the perf_event_paranoid file is  the  offi‐
2230                  cial   method   for   determining   if   a  kernel  supports
2231                  perf_event_open().
2232
2233           /proc/sys/kernel/perf_event_max_sample_rate
2234                  This sets the maximum sample rate.  Setting  this  too  high
2235                  can allow users to sample at a rate that impacts overall ma‐
2236                  chine performance and potentially lock up the machine.   The
2237                  default value is 100000 (samples per second).
2238
2239           /proc/sys/kernel/perf_event_max_stack
2240                  This  file sets the maximum depth of stack frame entries re‐
2241                  ported when generating a call trace.
2242
2243           /proc/sys/kernel/perf_event_mlock_kb
2244                  Maximum number of pages an unprivileged user  can  mlock(2).
2245                  The default is 516 (kB).
2246
2247       Files in /sys/bus/event_source/devices/
2248
2249           Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
2250           able for monitoring.  Information on how to program these PMUs  can
2251           be  found  under /sys/bus/event_source/devices/.  Each subdirectory
2252           corresponds to a different PMU.
2253
2254           /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
2255                  This contains an integer that can be used in the type  field
2256                  of  perf_event_attr  to  indicate  that you wish to use this
2257                  PMU.
2258
2259           /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
2260                  If this file is 1, then direct user-space access to the per‐
2261                  formance counter registers is allowed via the rdpmc instruc‐
2262                  tion.  This can be disabled by echoing 0 to the file.
2263
2264                  As of Linux 4.0 the behavior has  changed,  so  that  1  now
2265                  means  only  allow  access  to  processes  with  active perf
2266                  events, with 2 indicating the old allow-anyone-access behav‐
2267                  ior.
2268
2269           /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2270                  This  subdirectory contains information on the architecture-
2271                  specific subfields available  for  programming  the  various
2272                  config fields in the perf_event_attr struct.
2273
2274                  The  content  of  each file is the name of the config field,
2275                  followed by a colon, followed by a  series  of  integer  bit
2276                  ranges separated by commas.  For example, the file event may
2277                  contain the value  config1:1,6-10,44  which  indicates  that
2278                  event  is  an attribute that occupies bits 1,6–10, and 44 of
2279                  perf_event_attr::config1.
2280
2281           /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2282                  This subdirectory contains  files  with  predefined  events.
2283                  The  contents  are strings describing the event settings ex‐
2284                  pressed in terms of the fields found in the previously  men‐
2285                  tioned  ./format/ directory.  These are not necessarily com‐
2286                  plete lists of all events supported by a PMU, but usually  a
2287                  subset of events deemed useful or interesting.
2288
2289                  The  content of each file is a list of attribute names sepa‐
2290                  rated by commas.  Each entry has an optional  value  (either
2291                  hex  or  decimal).  If no value is specified, then it is as‐
2292                  sumed to be a single-bit field with a value of 1.  An  exam‐
2293                  ple entry may look like this: event=0x2,inv,ldlat=3.
2294
2295           /sys/bus/event_source/devices/*/uevent
2296                  This  file  is  the standard kernel device interface for in‐
2297                  jecting hotplug events.
2298
2299           /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2300                  The cpumask file contains a comma-separated list of integers
2301                  that  indicate  a  representative CPU number for each socket
2302                  (package) on the motherboard.  This is needed  when  setting
2303                  up  uncore  or  northbridge  events,  as  those PMUs present
2304                  socket-wide events.
2305

RETURN VALUE

2307       On success, perf_event_open() returns the new file descriptor.  On  er‐
2308       ror, -1 is returned and errno is set to indicate the error.
2309

ERRORS

2311       The  errors  returned by perf_event_open() can be inconsistent, and may
2312       vary across processor architectures and performance monitoring units.
2313
2314       E2BIG  Returned if the perf_event_attr size value is too small (smaller
2315              than  PERF_ATTR_SIZE_VER0), too big (larger than the page size),
2316              or larger than the kernel supports and the extra bytes  are  not
2317              zero.  When E2BIG is returned, the perf_event_attr size field is
2318              overwritten by the kernel to be the size of the structure it was
2319              expecting.
2320
2321       EACCES Returned  when  the  requested event requires CAP_PERFMON (since
2322              Linux 5.8) or CAP_SYS_ADMIN permissions (or  a  more  permissive
2323              perf_event  paranoid  setting).   Some common cases where an un‐
2324              privileged process may encounter  this  error:  attaching  to  a
2325              process owned by a different user; monitoring all processes on a
2326              given CPU (i.e., specifying the pid argument  as  -1);  and  not
2327              setting exclude_kernel when the paranoid setting requires it.
2328
2329       EBADF  Returned  if  the  group_fd file descriptor is not valid, or, if
2330              PERF_FLAG_PID_CGROUP is set, the cgroup file descriptor  in  pid
2331              is not valid.
2332
2333       EBUSY (since Linux 4.1)
2334              Returned  if  another  event already has exclusive access to the
2335              PMU.
2336
2337       EFAULT Returned if the attr pointer points at  an  invalid  memory  ad‐
2338              dress.
2339
2340       EINTR  Returned  when  trying to mix perf and ftrace handling for a up‐
2341              robe.
2342
2343       EINVAL Returned if the specified event is invalid.  There are many pos‐
2344              sible  reasons  for this.  A not-exhaustive list: sample_freq is
2345              higher than the maximum setting; the cpu to monitor does not ex‐
2346              ist;  read_format  is out of range; sample_type is out of range;
2347              the flags value is out of range; exclusive or pinned set and the
2348              event  is not a group leader; the event config values are out of
2349              range or set reserved bits; the generic event  selected  is  not
2350              supported;  or  there  is  not  enough  room to add the selected
2351              event.
2352
2353       EMFILE Each opened event uses one file descriptor.  If a  large  number
2354              of  events  are  opened,  the per-process limit on the number of
2355              open file descriptors will be reached, and no more events can be
2356              created.
2357
2358       ENODEV Returned  when the event involves a feature not supported by the
2359              current CPU.
2360
2361       ENOENT Returned if the type setting is not valid.  This error  is  also
2362              returned for some unsupported generic events.
2363
2364       ENOSPC Prior  to Linux 3.3, if there was not enough room for the event,
2365              ENOSPC was returned.  In Linux 3.3, this was changed to  EINVAL.
2366              ENOSPC  is  still  returned  if  you  try to add more breakpoint
2367              events than supported by the hardware.
2368
2369       ENOSYS Returned if PERF_SAMPLE_STACK_USER is set in sample_type and  it
2370              is not supported by hardware.
2371
2372       EOPNOTSUPP
2373              Returned  if  an  event requiring a specific hardware feature is
2374              requested but there is no hardware support.  This  includes  re‐
2375              questing  low-skid events if not supported, branch tracing if it
2376              is not available, sampling if no PMU interrupt is available, and
2377              branch stacks for software events.
2378
2379       EOVERFLOW (since Linux 4.8)
2380              Returned   if   PERF_SAMPLE_CALLCHAIN   is  requested  and  sam‐
2381              ple_max_stack  is  larger  than   the   maximum   specified   in
2382              /proc/sys/kernel/perf_event_max_stack.
2383
2384       EPERM  Returned on many (but not all) architectures when an unsupported
2385              exclude_hv, exclude_idle, exclude_user, or  exclude_kernel  set‐
2386              ting is specified.
2387
2388              It can also happen, as with EACCES, when the requested event re‐
2389              quires CAP_PERFMON (since Linux 5.8)  or  CAP_SYS_ADMIN  permis‐
2390              sions  (or a more permissive perf_event paranoid setting).  This
2391              includes setting a breakpoint on a kernel  address,  and  (since
2392              Linux 3.13) setting a kernel function-trace tracepoint.
2393
2394       ESRCH  Returned  if attempting to attach to a process that does not ex‐
2395              ist.
2396

VERSION

2398       perf_event_open()  was  introduced  in  Linux  2.6.31  but  was  called
2399       perf_counter_open().  It was renamed in Linux 2.6.32.
2400

CONFORMING TO

2402       This  perf_event_open()  system  call  Linux-specific and should not be
2403       used in programs intended to be portable.
2404

NOTES

2406       The official way of knowing if perf_event_open() support is enabled  is
2407       checking    for    the    existence    of   the   file   /proc/sys/ker‐
2408       nel/perf_event_paranoid.
2409
2410       CAP_PERFMON capability (since Linux 5.8) provides  secure  approach  to
2411       performance monitoring and observability operations in a system accord‐
2412       ing to the principal of least privilege (POSIX IEEE 1003.1e).   Access‐
2413       ing  system  performance  monitoring and observability operations using
2414       CAP_PERFMON rather than the much more powerful  CAP_SYS_ADMIN  excludes
2415       chances  to  misuse  credentials  and  makes  operations  more  secure.
2416       CAP_SYS_ADMIN usage for secure system performance  monitoring  and  ob‐
2417       servability is discouraged in favor of the CAP_PERFMON capability.
2418

BUGS

2420       The  F_SETOWN_EX  option to fcntl(2) is needed to properly get overflow
2421       signals in threads.  This was introduced in Linux 2.6.32.
2422
2423       Prior to Linux 2.6.33 (at least for x86), the kernel did not  check  if
2424       events  could  be scheduled together until read time.  The same happens
2425       on all known kernels if the NMI watchdog is enabled.  This means to see
2426       if  a  given  set of events works you have to perf_event_open(), start,
2427       then read before you know for sure you can get valid measurements.
2428
2429       Prior to Linux 2.6.34, event constraints were not enforced by the  ker‐
2430       nel.  In that case, some events would silently return "0" if the kernel
2431       scheduled them in an improper counter slot.
2432
2433       Prior to Linux 2.6.34, there was a  bug  when  multiplexing  where  the
2434       wrong results could be returned.
2435
2436       Kernels  from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2437       if "inherit" is enabled and many threads are started.
2438
2439       Prior to Linux 2.6.35, PERF_FORMAT_GROUP did  not  work  with  attached
2440       processes.
2441
2442       There  is  a  bug in the kernel code between Linux 2.6.36 and Linux 3.0
2443       that ignores the "watermark" field and acts as if  a  wakeup_event  was
2444       chosen if the union has a nonzero value in it.
2445
2446       From  Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2447       was broken and would repeatedly operate on the event  specified  rather
2448       than iterating across all sibling events in a group.
2449
2450       From  Linux  3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2451       bits mapped to the same location.   Code  should  migrate  to  the  new
2452       cap_user_rdpmc and cap_user_time fields instead.
2453
2454       Always  double-check your results!  Various generalized events have had
2455       wrong values.  For example, retired branches measured the  wrong  thing
2456       on AMD machines until Linux 2.6.35.
2457

EXAMPLES

2459       The  following  is  a short example that measures the total instruction
2460       count of a call to printf(3).
2461
2462       #include <stdlib.h>
2463       #include <stdio.h>
2464       #include <unistd.h>
2465       #include <string.h>
2466       #include <sys/ioctl.h>
2467       #include <linux/perf_event.h>
2468       #include <asm/unistd.h>
2469
2470       static long
2471       perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2472                       int cpu, int group_fd, unsigned long flags)
2473       {
2474           int ret;
2475
2476           ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2477                          group_fd, flags);
2478           return ret;
2479       }
2480
2481       int
2482       main(int argc, char *argv[])
2483       {
2484           struct perf_event_attr pe;
2485           long long count;
2486           int fd;
2487
2488           memset(&pe, 0, sizeof(pe));
2489           pe.type = PERF_TYPE_HARDWARE;
2490           pe.size = sizeof(pe);
2491           pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2492           pe.disabled = 1;
2493           pe.exclude_kernel = 1;
2494           pe.exclude_hv = 1;
2495
2496           fd = perf_event_open(&pe, 0, -1, -1, 0);
2497           if (fd == -1) {
2498              fprintf(stderr, "Error opening leader %llx\n", pe.config);
2499              exit(EXIT_FAILURE);
2500           }
2501
2502           ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2503           ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2504
2505           printf("Measuring instruction count for this printf\n");
2506
2507           ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2508           read(fd, &count, sizeof(count));
2509
2510           printf("Used %lld instructions\n", count);
2511
2512           close(fd);
2513       }
2514

SEE ALSO

2516       perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2517
2518       Documentation/admin-guide/perf-security.rst in the kernel source tree
2519

COLOPHON

2521       This page is part of release 5.13 of the Linux  man-pages  project.   A
2522       description  of  the project, information about reporting bugs, and the
2523       latest    version    of    this    page,    can     be     found     at
2524       https://www.kernel.org/doc/man-pages/.
2525
2526
2527
2528Linux                             2021-08-27                PERF_EVENT_OPEN(2)
Impressum