1PERF_EVENT_OPEN(2)         Linux Programmer's Manual        PERF_EVENT_OPEN(2)
2
3
4

NAME

6       perf_event_open - set up performance monitoring
7

SYNOPSIS

9       #include <linux/perf_event.h>
10       #include <linux/hw_breakpoint.h>
11
12       int perf_event_open(struct perf_event_attr *attr,
13                           pid_t pid, int cpu, int group_fd,
14                           unsigned long flags);
15
16       Note: There is no glibc wrapper for this system call; see NOTES.
17

DESCRIPTION

19       Given  a  list of parameters, perf_event_open() returns a file descrip‐
20       tor, for use in subsequent system calls  (read(2),  mmap(2),  prctl(2),
21       fcntl(2), etc.).
22
23       A  call to perf_event_open() creates a file descriptor that allows mea‐
24       suring performance information.  Each file  descriptor  corresponds  to
25       one  event  that  is measured; these can be grouped together to measure
26       multiple events simultaneously.
27
28       Events can be enabled and disabled in two ways: via  ioctl(2)  and  via
29       prctl(2).   When  an  event  is  disabled it does not count or generate
30       overflows but does continue to exist and maintain its count value.
31
32       Events come in two flavors: counting and sampled.  A counting event  is
33       one  that  is used for counting the aggregate number of events that oc‐
34       cur.  In general, counting event results are gathered  with  a  read(2)
35       call.   A  sampling  event periodically writes measurements to a buffer
36       that can then be accessed via mmap(2).
37
38   Arguments
39       The pid and cpu arguments allow specifying which  process  and  CPU  to
40       monitor:
41
42       pid == 0 and cpu == -1
43              This measures the calling process/thread on any CPU.
44
45       pid == 0 and cpu >= 0
46              This  measures  the  calling process/thread only when running on
47              the specified CPU.
48
49       pid > 0 and cpu == -1
50              This measures the specified process/thread on any CPU.
51
52       pid > 0 and cpu >= 0
53              This measures the specified process/thread only when running  on
54              the specified CPU.
55
56       pid == -1 and cpu >= 0
57              This  measures all processes/threads on the specified CPU.  This
58              requires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN capabil‐
59              ity or a /proc/sys/kernel/perf_event_paranoid value of less than
60              1.
61
62       pid == -1 and cpu == -1
63              This setting is invalid and will return an error.
64
65       When pid is greater than zero, permission to perform this  system  call
66       is  governed  by CAP_PERFMON (since Linux 5.9) and a ptrace access mode
67       PTRACE_MODE_READ_REALCREDS  check  on   older   Linux   versions;   see
68       ptrace(2).
69
70       The  group_fd  argument  allows  event  groups to be created.  An event
71       group has one event which is the group leader.  The leader  is  created
72       first,  with  group_fd = -1.  The rest of the group members are created
73       with subsequent perf_event_open() calls with group_fd being set to  the
74       file  descriptor  of  the  group leader.  (A single event on its own is
75       created with group_fd = -1 and is considered to be a group with only  1
76       member.)   An  event group is scheduled onto the CPU as a unit: it will
77       be put onto the CPU only if all of the events in the group can  be  put
78       onto  the  CPU.  This means that the values of the member events can be
79       meaningfully compared—added, divided (to get ratios),  and  so  on—with
80       each other, since they have counted events for the same set of executed
81       instructions.
82
83       The flags argument is formed by ORing together zero or more of the fol‐
84       lowing values:
85
86       PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
87              This  flag  enables the close-on-exec flag for the created event
88              file descriptor, so that the file  descriptor  is  automatically
89              closed  on  execve(2).   Setting the close-on-exec flags at cre‐
90              ation time, rather than later with  fcntl(2),  avoids  potential
91              race    conditions    where    the    calling   thread   invokes
92              perf_event_open() and fcntl(2)  at  the  same  time  as  another
93              thread calls fork(2) then execve(2).
94
95       PERF_FLAG_FD_NO_GROUP
96              This  flag  tells the event to ignore the group_fd parameter ex‐
97              cept for the purpose of setting up output redirection using  the
98              PERF_FLAG_FD_OUTPUT flag.
99
100       PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
101              This flag re-routes the event's sampled output to instead be in‐
102              cluded in the mmap buffer of the event specified by group_fd.
103
104       PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
105              This flag activates  per-container  system-wide  monitoring.   A
106              container is an abstraction that isolates a set of resources for
107              finer-grained control (CPUs, memory, etc.).  In this  mode,  the
108              event  is  measured  only if the thread running on the monitored
109              CPU belongs to the designated container (cgroup).  The cgroup is
110              identified  by passing a file descriptor opened on its directory
111              in the cgroupfs filesystem.  For instance, if the cgroup to mon‐
112              itor   is   called  test,  then  a  file  descriptor  opened  on
113              /dev/cgroup/test (assuming cgroupfs is mounted  on  /dev/cgroup)
114              must  be  passed  as  the  pid  parameter.  cgroup monitoring is
115              available only for system-wide events and may therefore  require
116              extra permissions.
117
118       The  perf_event_attr structure provides detailed configuration informa‐
119       tion for the event being created.
120
121           struct perf_event_attr {
122               __u32 type;                 /* Type of event */
123               __u32 size;                 /* Size of attribute structure */
124               __u64 config;               /* Type-specific configuration */
125
126               union {
127                   __u64 sample_period;    /* Period of sampling */
128                   __u64 sample_freq;      /* Frequency of sampling */
129               };
130
131               __u64 sample_type;  /* Specifies values included in sample */
132               __u64 read_format;  /* Specifies values returned in read */
133
134               __u64 disabled       : 1,   /* off by default */
135                     inherit        : 1,   /* children inherit it */
136                     pinned         : 1,   /* must always be on PMU */
137                     exclusive      : 1,   /* only group on PMU */
138                     exclude_user   : 1,   /* don't count user */
139                     exclude_kernel : 1,   /* don't count kernel */
140                     exclude_hv     : 1,   /* don't count hypervisor */
141                     exclude_idle   : 1,   /* don't count when idle */
142                     mmap           : 1,   /* include mmap data */
143                     comm           : 1,   /* include comm data */
144                     freq           : 1,   /* use freq, not period */
145                     inherit_stat   : 1,   /* per task counts */
146                     enable_on_exec : 1,   /* next exec enables */
147                     task           : 1,   /* trace fork/exit */
148                     watermark      : 1,   /* wakeup_watermark */
149                     precise_ip     : 2,   /* skid constraint */
150                     mmap_data      : 1,   /* non-exec mmap data */
151                     sample_id_all  : 1,   /* sample_type all events */
152                     exclude_host   : 1,   /* don't count in host */
153                     exclude_guest  : 1,   /* don't count in guest */
154                     exclude_callchain_kernel : 1,
155                                           /* exclude kernel callchains */
156                     exclude_callchain_user   : 1,
157                                           /* exclude user callchains */
158                     mmap2          :  1,  /* include mmap with inode data */
159                     comm_exec      :  1,  /* flag comm events that are
160                                              due to exec */
161                     use_clockid    :  1,  /* use clockid for time fields */
162                     context_switch :  1,  /* context switch data */
163                     write_backward :  1,  /* Write ring buffer from end
164                                              to beginning */
165                     namespaces     :  1,  /* include namespaces data */
166                     ksymbol        :  1,  /* include ksymbol events */
167                     bpf_event      :  1,  /* include bpf events */
168                     aux_output     :  1,  /* generate AUX records
169                                              instead of events */
170                     cgroup         :  1,  /* include cgroup events */
171                     text_poke      :  1,  /* include text poke events */
172
173                     __reserved_1   : 30;
174
175               union {
176                   __u32 wakeup_events;    /* wakeup every n events */
177                   __u32 wakeup_watermark; /* bytes before wakeup */
178               };
179
180               __u32     bp_type;          /* breakpoint type */
181
182               union {
183                   __u64 bp_addr;          /* breakpoint address */
184                   __u64 kprobe_func;      /* for perf_kprobe */
185                   __u64 uprobe_path;      /* for perf_uprobe */
186                   __u64 config1;          /* extension of config */
187               };
188
189               union {
190                   __u64 bp_len;           /* breakpoint length */
191                   __u64 kprobe_addr;      /* with kprobe_func == NULL */
192                   __u64 probe_offset;     /* for perf_[k,u]probe */
193                   __u64 config2;          /* extension of config1 */
194               };
195               __u64 branch_sample_type;   /* enum perf_branch_sample_type */
196               __u64 sample_regs_user;     /* user regs to dump on samples */
197               __u32 sample_stack_user;    /* size of stack to dump on
198                                              samples */
199               __s32 clockid;              /* clock to use for time fields */
200               __u64 sample_regs_intr;     /* regs to dump on samples */
201               __u32 aux_watermark;        /* aux bytes before wakeup */
202               __u16 sample_max_stack;     /* max frames in callchain */
203               __u16 __reserved_2;         /* align to u64 */
204
205           };
206
207       The fields of the perf_event_attr structure are described in  more  de‐
208       tail below:
209
210       type   This  field specifies the overall event type.  It has one of the
211              following values:
212
213              PERF_TYPE_HARDWARE
214                     This indicates one of the "generalized"  hardware  events
215                     provided  by the kernel.  See the config field definition
216                     for more details.
217
218              PERF_TYPE_SOFTWARE
219                     This indicates one of the  software-defined  events  pro‐
220                     vided  by  the  kernel  (even  if  no hardware support is
221                     available).
222
223              PERF_TYPE_TRACEPOINT
224                     This indicates a tracepoint provided by the kernel trace‐
225                     point infrastructure.
226
227              PERF_TYPE_HW_CACHE
228                     This  indicates  a hardware cache event.  This has a spe‐
229                     cial encoding, described in the config field definition.
230
231              PERF_TYPE_RAW
232                     This indicates a "raw" implementation-specific  event  in
233                     the config field.
234
235              PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
236                     This  indicates  a hardware breakpoint as provided by the
237                     CPU.  Breakpoints can be read/write accesses  to  an  ad‐
238                     dress as well as execution of an instruction address.
239
240              dynamic PMU
241                     Since  Linux 2.6.38, perf_event_open() can support multi‐
242                     ple PMUs.  To enable this, a value exported by the kernel
243                     can  be  used  in the type field to indicate which PMU to
244                     use.  The value to use can be found in the sysfs filesys‐
245                     tem:  there  is  a  subdirectory  per  PMU instance under
246                     /sys/bus/event_source/devices.   In   each   subdirectory
247                     there is a type file whose content is an integer that can
248                     be   used   in   the   type   field.     For    instance,
249                     /sys/bus/event_source/devices/cpu/type contains the value
250                     for the core CPU PMU, which is usually 4.
251
252              kprobe and uprobe (since Linux 4.17)
253                     These two dynamic PMUs create a kprobe/uprobe and  attach
254                     it  to  the file descriptor generated by perf_event_open.
255                     The kprobe/uprobe will be destroyed on the destruction of
256                     the   file   descriptor.   See  fields  kprobe_func,  up‐
257                     robe_path, kprobe_addr, and  probe_offset  for  more  de‐
258                     tails.
259
260       size   The  size  of the perf_event_attr structure for forward/backward
261              compatibility.  Set this using sizeof(struct perf_event_attr) to
262              allow  the kernel to see the struct size at the time of compila‐
263              tion.
264
265              The related define PERF_ATTR_SIZE_VER0 is set to  64;  this  was
266              the  size of the first published struct.  PERF_ATTR_SIZE_VER1 is
267              72, corresponding  to  the  addition  of  breakpoints  in  Linux
268              2.6.33.  PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
269              of branch sampling in Linux 3.4.  PERF_ATTR_SIZE_VER3 is 96 cor‐
270              responding   to   the  addition  of  sample_regs_user  and  sam‐
271              ple_stack_user in Linux 3.7.  PERF_ATTR_SIZE_VER4 is 104  corre‐
272              sponding  to  the  addition  of  sample_regs_intr in Linux 3.19.
273              PERF_ATTR_SIZE_VER5 is 112  corresponding  to  the  addition  of
274              aux_watermark in Linux 4.1.
275
276       config This  specifies  which  event  you want, in conjunction with the
277              type field.  The config1 and config2 fields are also taken  into
278              account  in  cases  where 64 bits is not enough to fully specify
279              the event.  The encoding of these fields are event dependent.
280
281              There are various ways to set the config field that  are  depen‐
282              dent  on the value of the previously described type field.  What
283              follows are various possible settings for config  separated  out
284              by type.
285
286              If  type is PERF_TYPE_HARDWARE, we are measuring one of the gen‐
287              eralized hardware CPU events.  Not all of these are available on
288              all platforms.  Set config to one of the following:
289
290                   PERF_COUNT_HW_CPU_CYCLES
291                          Total  cycles.   Be  wary of what happens during CPU
292                          frequency scaling.
293
294                   PERF_COUNT_HW_INSTRUCTIONS
295                          Retired instructions.  Be careful, these can be  af‐
296                          fected  by various issues, most notably hardware in‐
297                          terrupt counts.
298
299                   PERF_COUNT_HW_CACHE_REFERENCES
300                          Cache accesses.  Usually this indicates  Last  Level
301                          Cache  accesses  but this may vary depending on your
302                          CPU.  This may include prefetches and coherency mes‐
303                          sages; again this depends on the design of your CPU.
304
305                   PERF_COUNT_HW_CACHE_MISSES
306                          Cache  misses.   Usually  this  indicates Last Level
307                          Cache misses; this is intended to be  used  in  con‐
308                          junction   with  the  PERF_COUNT_HW_CACHE_REFERENCES
309                          event to calculate cache miss rates.
310
311                   PERF_COUNT_HW_BRANCH_INSTRUCTIONS
312                          Retired branch instructions.  Prior to Linux 2.6.35,
313                          this used the wrong event on AMD processors.
314
315                   PERF_COUNT_HW_BRANCH_MISSES
316                          Mispredicted branch instructions.
317
318                   PERF_COUNT_HW_BUS_CYCLES
319                          Bus  cycles,  which  can be different from total cy‐
320                          cles.
321
322                   PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
323                          Stalled cycles during issue.
324
325                   PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
326                          Stalled cycles during retirement.
327
328                   PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
329                          Total cycles; not affected by CPU frequency scaling.
330
331              If type is PERF_TYPE_SOFTWARE, we are measuring software  events
332              provided by the kernel.  Set config to one of the following:
333
334                   PERF_COUNT_SW_CPU_CLOCK
335                          This  reports  the CPU clock, a high-resolution per-
336                          CPU timer.
337
338                   PERF_COUNT_SW_TASK_CLOCK
339                          This reports a clock count specific to the task that
340                          is running.
341
342                   PERF_COUNT_SW_PAGE_FAULTS
343                          This reports the number of page faults.
344
345                   PERF_COUNT_SW_CONTEXT_SWITCHES
346                          This  counts  context switches.  Until Linux 2.6.34,
347                          these were all reported as user-space events,  after
348                          that they are reported as happening in the kernel.
349
350                   PERF_COUNT_SW_CPU_MIGRATIONS
351                          This reports the number of times the process has mi‐
352                          grated to a new CPU.
353
354                   PERF_COUNT_SW_PAGE_FAULTS_MIN
355                          This counts the number of minor page faults.   These
356                          did not require disk I/O to handle.
357
358                   PERF_COUNT_SW_PAGE_FAULTS_MAJ
359                          This  counts the number of major page faults.  These
360                          required disk I/O to handle.
361
362                   PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
363                          This counts the number of alignment  faults.   These
364                          happen  when  unaligned  memory accesses happen; the
365                          kernel can handle these but it reduces  performance.
366                          This  happens  only  on some architectures (never on
367                          x86).
368
369                   PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
370                          This counts the number  of  emulation  faults.   The
371                          kernel sometimes traps on unimplemented instructions
372                          and emulates them for user space.   This  can  nega‐
373                          tively impact performance.
374
375                   PERF_COUNT_SW_DUMMY (since Linux 3.12)
376                          This  is  a  placeholder  event that counts nothing.
377                          Informational sample record types such  as  mmap  or
378                          comm  must be associated with an active event.  This
379                          dummy event allows gathering  such  records  without
380                          requiring a counting event.
381
382              If  type  is  PERF_TYPE_TRACEPOINT, then we are measuring kernel
383              tracepoints.  The value to use in config can  be  obtained  from
384              under  debugfs tracing/events/*/*/id if ftrace is enabled in the
385              kernel.
386
387              If type is PERF_TYPE_HW_CACHE, then we are measuring a  hardware
388              CPU cache event.  To calculate the appropriate config value, use
389              the following equation:
390
391                      config = (perf_hw_cache_id) |
392                               (perf_hw_cache_op_id << 8) |
393                               (perf_hw_cache_op_result_id << 16);
394
395                  where perf_hw_cache_id is one of:
396
397                      PERF_COUNT_HW_CACHE_L1D
398                             for measuring Level 1 Data Cache
399
400                      PERF_COUNT_HW_CACHE_L1I
401                             for measuring Level 1 Instruction Cache
402
403                      PERF_COUNT_HW_CACHE_LL
404                             for measuring Last-Level Cache
405
406                      PERF_COUNT_HW_CACHE_DTLB
407                             for measuring the Data TLB
408
409                      PERF_COUNT_HW_CACHE_ITLB
410                             for measuring the Instruction TLB
411
412                      PERF_COUNT_HW_CACHE_BPU
413                             for measuring the branch prediction unit
414
415                      PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
416                             for measuring local memory accesses
417
418                  and perf_hw_cache_op_id is one of:
419
420                      PERF_COUNT_HW_CACHE_OP_READ
421                             for read accesses
422
423                      PERF_COUNT_HW_CACHE_OP_WRITE
424                             for write accesses
425
426                      PERF_COUNT_HW_CACHE_OP_PREFETCH
427                             for prefetch accesses
428
429                  and perf_hw_cache_op_result_id is one of:
430
431                      PERF_COUNT_HW_CACHE_RESULT_ACCESS
432                             to measure accesses
433
434                      PERF_COUNT_HW_CACHE_RESULT_MISS
435                             to measure misses
436
437              If type is PERF_TYPE_RAW, then a custom "raw"  config  value  is
438              needed.   Most  CPUs  support events that are not covered by the
439              "generalized" events.  These  are  implementation  defined;  see
440              your  CPU  manual (for example the Intel Volume 3B documentation
441              or the AMD BIOS and Kernel Developer Guide).   The  libpfm4  li‐
442              brary  can  be  used to translate from the name in the architec‐
443              tural manuals to the raw hex value perf_event_open() expects  in
444              this field.
445
446              If  type is PERF_TYPE_BREAKPOINT, then leave config set to zero.
447              Its parameters are set in other places.
448
449              If type is kprobe or uprobe, set retprobe (bit 0 of config,  see
450              /sys/bus/event_source/devices/[k,u]probe/format/retprobe)    for
451              kretprobe/uretprobe.   See  fields   kprobe_func,   uprobe_path,
452              kprobe_addr, and probe_offset for more details.
453
454       kprobe_func, uprobe_path, kprobe_addr, and probe_offset
455              These  fields describe the kprobe/uprobe for dynamic PMUs kprobe
456              and uprobe.  For kprobe: use kprobe_func  and  probe_offset,  or
457              use  kprobe_addr and leave kprobe_func as NULL.  For uprobe: use
458              uprobe_path and probe_offset.
459
460       sample_period, sample_freq
461              A "sampling" event is one that generates an  overflow  notifica‐
462              tion  every N events, where N is given by sample_period.  A sam‐
463              pling event has sample_period > 0.  When an overflow occurs, re‐
464              quested  data  is  recorded in the mmap buffer.  The sample_type
465              field controls what data is recorded on each overflow.
466
467              sample_freq can be used if you wish to use frequency rather than
468              period.   In  this case, you set the freq flag.  The kernel will
469              adjust the sampling period to try and achieve the desired  rate.
470              The rate of adjustment is a timer tick.
471
472       sample_type
473              The  various  bits in this field specify which values to include
474              in the sample.  They will be recorded in a ring-buffer, which is
475              available  to  user space using mmap(2).  The order in which the
476              values are saved in the sample are documented in the MMAP Layout
477              subsection  below;  it  is not the enum perf_event_sample_format
478              order.
479
480              PERF_SAMPLE_IP
481                     Records instruction pointer.
482
483              PERF_SAMPLE_TID
484                     Records the process and thread IDs.
485
486              PERF_SAMPLE_TIME
487                     Records a timestamp.
488
489              PERF_SAMPLE_ADDR
490                     Records an address, if applicable.
491
492              PERF_SAMPLE_READ
493                     Record counter values for all events in a group, not just
494                     the group leader.
495
496              PERF_SAMPLE_CALLCHAIN
497                     Records the callchain (stack backtrace).
498
499              PERF_SAMPLE_ID
500                     Records a unique ID for the opened event's group leader.
501
502              PERF_SAMPLE_CPU
503                     Records CPU number.
504
505              PERF_SAMPLE_PERIOD
506                     Records the current sampling period.
507
508              PERF_SAMPLE_STREAM_ID
509                     Records  a  unique  ID  for  the  opened  event.   Unlike
510                     PERF_SAMPLE_ID the actual ID is returned, not  the  group
511                     leader.   This  ID  is  the  same  as the one returned by
512                     PERF_FORMAT_ID.
513
514              PERF_SAMPLE_RAW
515                     Records additional data, if applicable.  Usually returned
516                     by tracepoint events.
517
518              PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
519                     This provides a record of recent branches, as provided by
520                     CPU branch sampling hardware (such as Intel  Last  Branch
521                     Record).  Not all hardware supports this feature.
522
523                     See  the branch_sample_type field for how to filter which
524                     branches are reported.
525
526              PERF_SAMPLE_REGS_USER (since Linux 3.7)
527                     Records the current user-level CPU  register  state  (the
528                     values in the process before the kernel was called).
529
530              PERF_SAMPLE_STACK_USER (since Linux 3.7)
531                     Records the user level stack, allowing stack unwinding.
532
533              PERF_SAMPLE_WEIGHT (since Linux 3.10)
534                     Records  a  hardware provided weight value that expresses
535                     how costly the sampled event was.  This allows the  hard‐
536                     ware to highlight expensive events in a profile.
537
538              PERF_SAMPLE_DATA_SRC (since Linux 3.10)
539                     Records  the  data  source: where in the memory hierarchy
540                     the data associated with  the  sampled  instruction  came
541                     from.   This is available only if the underlying hardware
542                     supports this feature.
543
544              PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
545                     Places the SAMPLE_ID value in a  fixed  position  in  the
546                     record, either at the beginning (for sample events) or at
547                     the end (if a non-sample event).
548
549                     This was necessary  because  a  sample  stream  may  have
550                     records from various different event sources with differ‐
551                     ent sample_type settings.  Parsing the event stream prop‐
552                     erly  was  not  possible because the format of the record
553                     was needed to find SAMPLE_ID, but the format could not be
554                     found  without  knowing what event the sample belonged to
555                     (causing a circular dependency).
556
557                     The PERF_SAMPLE_IDENTIFIER setting makes the event stream
558                     always parsable by putting SAMPLE_ID in a fixed location,
559                     even though it means having duplicate SAMPLE_ID values in
560                     records.
561
562              PERF_SAMPLE_TRANSACTION (since Linux 3.13)
563                     Records  reasons  for  transactional  memory abort events
564                     (for example, from Intel TSX  transactional  memory  sup‐
565                     port).
566
567                     The  precise_ip  setting  must  be  greater  than 0 and a
568                     transactional memory abort event must be measured  or  no
569                     values  will be recorded.  Also note that some perf_event
570                     measurements, such as sampled cycle counting,  may  cause
571                     extraneous  aborts  (by  causing  an  interrupt  during a
572                     transaction).
573
574              PERF_SAMPLE_REGS_INTR (since Linux 3.19)
575                     Records a subset of the current  CPU  register  state  as
576                     specified    by   sample_regs_intr.    Unlike   PERF_SAM‐
577                     PLE_REGS_USER the register values will return kernel reg‐
578                     ister state if the overflow happened while kernel code is
579                     running.  If the CPU supports hardware sampling of regis‐
580                     ter state (i.e., PEBS on Intel x86) and precise_ip is set
581                     higher than zero then the register  values  returned  are
582                     those captured by hardware at the time of the sampled in‐
583                     struction's retirement.
584
585              PERF_SAMPLE_PHYS_ADDR (since Linux 4.13)
586                     Records  physical  address  of  data  like  in  PERF_SAM‐
587                     PLE_ADDR.
588
589              PERF_SAMPLE_CGROUP (since Linux 5.7)
590                     Records (perf_event) cgroup ID of the process.  This cor‐
591                     responds to the id field in the PERF_RECORD_CGROUP event.
592
593       read_format
594              This field specifies the format of the data returned by  read(2)
595              on a perf_event_open() file descriptor.
596
597              PERF_FORMAT_TOTAL_TIME_ENABLED
598                     Adds  the 64-bit time_enabled field.  This can be used to
599                     calculate estimated totals if the  PMU  is  overcommitted
600                     and multiplexing is happening.
601
602              PERF_FORMAT_TOTAL_TIME_RUNNING
603                     Adds  the 64-bit time_running field.  This can be used to
604                     calculate estimated totals if the  PMU  is  overcommitted
605                     and multiplexing is happening.
606
607              PERF_FORMAT_ID
608                     Adds  a 64-bit unique value that corresponds to the event
609                     group.
610
611              PERF_FORMAT_GROUP
612                     Allows all counter values in an event group  to  be  read
613                     with one read.
614
615       disabled
616              The  disabled  bit specifies whether the counter starts out dis‐
617              abled or enabled.  If disabled, the event can later  be  enabled
618              by ioctl(2), prctl(2), or enable_on_exec.
619
620              When creating an event group, typically the group leader is ini‐
621              tialized with disabled set to 1 and any child  events  are  ini‐
622              tialized  with disabled set to 0.  Despite disabled being 0, the
623              child events will not start until the group leader is enabled.
624
625       inherit
626              The inherit bit specifies that this counter should count  events
627              of child tasks as well as the task specified.  This applies only
628              to new children, not to any existing children at  the  time  the
629              counter  is  created  (nor to any new children of existing chil‐
630              dren).
631
632              Inherit does not work for some combinations of read_format  val‐
633              ues, such as PERF_FORMAT_GROUP.
634
635       pinned The  pinned  bit  specifies that the counter should always be on
636              the CPU if at all possible.  It applies only to  hardware  coun‐
637              ters  and  only to group leaders.  If a pinned counter cannot be
638              put onto the CPU (e.g., because there are  not  enough  hardware
639              counters  or  because of a conflict with some other event), then
640              the counter goes into an 'error' state, where reads return  end-
641              of-file  (i.e.,  read(2)  returns 0) until the counter is subse‐
642              quently enabled or disabled.
643
644       exclusive
645              The exclusive bit specifies that when this counter's group is on
646              the  CPU,  it should be the only group using the CPU's counters.
647              In the future this may allow monitoring programs to support  PMU
648              features  that  need  to  run  alone so that they do not disrupt
649              other hardware counters.
650
651              Note that many unexpected situations may prevent events with the
652              exclusive  bit  set  from ever running.  This includes any users
653              running a system-wide measurement as well as any kernel  use  of
654              the  performance  counters  (including  the commonly enabled NMI
655              Watchdog Timer interface).
656
657       exclude_user
658              If this bit is set, the count excludes  events  that  happen  in
659              user space.
660
661       exclude_kernel
662              If  this  bit  is  set, the count excludes events that happen in
663              kernel space.
664
665       exclude_hv
666              If this bit is set, the count excludes events that happen in the
667              hypervisor.   This is mainly for PMUs that have built-in support
668              for handling this (such as POWER).  Extra support is needed  for
669              handling hypervisor measurements on most machines.
670
671       exclude_idle
672              If  set,  don't  count  when  the  CPU is running the idle task.
673              While you can currently enable this for any event  type,  it  is
674              ignored for all but software events.
675
676       mmap   The  mmap bit enables generation of PERF_RECORD_MMAP samples for
677              every mmap(2) call that has PROT_EXEC set.  This allows tools to
678              notice  new executable code being mapped into a program (dynamic
679              shared libraries for example) so that addresses  can  be  mapped
680              back to the original code.
681
682       comm   The  comm  bit enables tracking of process command name as modi‐
683              fied by the exec(2) and prctl(PR_SET_NAME) system calls as  well
684              as  writing  to  /proc/self/comm.  If the comm_exec flag is also
685              successfully set (possible since Linux 3.16), then the misc flag
686              PERF_RECORD_MISC_COMM_EXEC  can  be  used  to  differentiate the
687              exec(2) case from the others.
688
689       freq   If this bit is set, then sample_frequency not  sample_period  is
690              used when setting up the sampling interval.
691
692       inherit_stat
693              This  bit  enables  saving of event counts on context switch for
694              inherited tasks.  This is meaningful only if the  inherit  field
695              is set.
696
697       enable_on_exec
698              If  this  bit is set, a counter is automatically enabled after a
699              call to exec(2).
700
701       task   If this bit is set, then fork/exit notifications are included in
702              the ring buffer.
703
704       watermark
705              If  set,  have an overflow notification happen when we cross the
706              wakeup_watermark boundary.   Otherwise,  overflow  notifications
707              happen after wakeup_events samples.
708
709       precise_ip (since Linux 2.6.35)
710              This controls the amount of skid.  Skid is how many instructions
711              execute between an event of interest happening  and  the  kernel
712              being able to stop and record the event.  Smaller skid is better
713              and allows more accurate reporting of which events correspond to
714              which instructions, but hardware is often limited with how small
715              this can be.
716
717              The possible values of this field are the following:
718
719              0  SAMPLE_IP can have arbitrary skid.
720
721              1  SAMPLE_IP must have constant skid.
722
723              2  SAMPLE_IP requested to have 0 skid.
724
725              3  SAMPLE_IP must have 0 skid.   See  also  the  description  of
726                 PERF_RECORD_MISC_EXACT_IP.
727
728       mmap_data (since Linux 2.6.36)
729              This is the counterpart of the mmap field.  This enables genera‐
730              tion of PERF_RECORD_MMAP samples for mmap(2) calls that  do  not
731              have PROT_EXEC set (for example data and SysV shared memory).
732
733       sample_id_all (since Linux 2.6.38)
734              If  set, then TID, TIME, ID, STREAM_ID, and CPU can additionally
735              be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
736              ple_type is selected.
737
738              If  PERF_SAMPLE_IDENTIFIER  is  specified, then an additional ID
739              value is included as the last value to ease parsing  the  record
740              stream.  This may lead to the id value appearing twice.
741
742              The layout is described by this pseudo-structure:
743
744                  struct sample_id {
745                      { u32 pid, tid; }   /* if PERF_SAMPLE_TID set */
746                      { u64 time;     }   /* if PERF_SAMPLE_TIME set */
747                      { u64 id;       }   /* if PERF_SAMPLE_ID set */
748                      { u64 stream_id;}   /* if PERF_SAMPLE_STREAM_ID set  */
749                      { u32 cpu, res; }   /* if PERF_SAMPLE_CPU set */
750                      { u64 id;       }   /* if PERF_SAMPLE_IDENTIFIER set */
751                  };
752
753       exclude_host (since Linux 3.2)
754              When  conducting  measurements that include processes running VM
755              instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
756              events happening inside a guest instance.  This is only meaning‐
757              ful outside the guests; this  setting  does  not  change  counts
758              gathered  inside  of  a guest.  Currently, this functionality is
759              x86 only.
760
761       exclude_guest (since Linux 3.2)
762              When conducting measurements that include processes  running  VM
763              instances  (i.e., have executed a KVM_RUN ioctl(2)), do not mea‐
764              sure events happening inside  guest  instances.   This  is  only
765              meaningful  outside  the  guests;  this  setting does not change
766              counts gathered inside of a guest.  Currently, this  functional‐
767              ity is x86 only.
768
769       exclude_callchain_kernel (since Linux 3.7)
770              Do not include kernel callchains.
771
772       exclude_callchain_user (since Linux 3.7)
773              Do not include user callchains.
774
775       mmap2 (since Linux 3.16)
776              Generate an extended executable mmap record that contains enough
777              additional information to  uniquely  identify  shared  mappings.
778              The mmap flag must also be set for this to work.
779
780       comm_exec (since Linux 3.16)
781              This is purely a feature-detection flag, it does not change ker‐
782              nel behavior.  If this flag can successfully be set, then,  when
783              comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
784              in the misc field of a comm record header if  the  rename  event
785              being  reported  was  caused  by a call to exec(2).  This allows
786              tools to distinguish between the various types of process renam‐
787              ing.
788
789       use_clockid (since Linux 4.1)
790              This  allows  selecting  which  internal Linux clock to use when
791              generating timestamps via the clockid field.  This can  make  it
792              easier  to correlate perf sample times with timestamps generated
793              by other tools.
794
795       context_switch (since Linux 4.3)
796              This enables the generation of PERF_RECORD_SWITCH records when a
797              context  switch  occurs.   It  also  enables  the  generation of
798              PERF_RECORD_SWITCH_CPU_WIDE records when  sampling  in  CPU-wide
799              mode.   This functionality is in addition to existing tracepoint
800              and software events for measuring context switches.  The  advan‐
801              tage  of  this method is that it will give full information even
802              with strict perf_event_paranoid settings.
803
804       write_backward (since Linux 4.6)
805              This causes the ring buffer to be written from the  end  to  the
806              beginning.   This  is  to support reading from overwritable ring
807              buffer.
808
809       namespaces (since Linux 4.11)
810              This enables the generation  of  PERF_RECORD_NAMESPACES  records
811              when a task enters a new namespace.  Each namespace has a combi‐
812              nation of device and inode numbers.
813
814       ksymbol (since Linux 5.0)
815              This enables the generation of PERF_RECORD_KSYMBOL records  when
816              new kernel symbols are registered or unregistered.  This is ana‐
817              lyzing dynamic kernel functions like eBPF.
818
819       bpf_event (since Linux 5.0)
820              This enables the  generation  of  PERF_RECORD_BPF_EVENT  records
821              when an eBPF program is loaded or unloaded.
822
823       auxevent (since Linux 5.4)
824              This  allows  normal  (non-AUX)  events to generate data for AUX
825              events if the hardware supports it.
826
827       cgroup (since Linux 5.7)
828              This enables the generation of PERF_RECORD_CGROUP records when a
829              new cgroup is created (and activated).
830
831       text_poke (since Linux 5.8)
832              This  enables  the  generation  of PERF_RECORD_TEXT_POKE records
833              when there's a changes to the kernel text (i.e.,  self-modifying
834              code).
835
836       wakeup_events, wakeup_watermark
837              This  union  sets  how  many  samples  (wakeup_events)  or bytes
838              (wakeup_watermark) happen before an overflow  notification  hap‐
839              pens.  Which one is used is selected by the watermark bit flag.
840
841              wakeup_events  counts  only PERF_RECORD_SAMPLE record types.  To
842              receive overflow notification for all PERF_RECORD  types  choose
843              watermark and set wakeup_watermark to 1.
844
845              Prior  to  Linux  3.0, setting wakeup_events to 0 resulted in no
846              overflow notifications; more recent kernels treat 0 the same  as
847              1.
848
849       bp_type (since Linux 2.6.33)
850              This chooses the breakpoint type.  It is one of:
851
852              HW_BREAKPOINT_EMPTY
853                     No breakpoint.
854
855              HW_BREAKPOINT_R
856                     Count when we read the memory location.
857
858              HW_BREAKPOINT_W
859                     Count when we write the memory location.
860
861              HW_BREAKPOINT_RW
862                     Count when we read or write the memory location.
863
864              HW_BREAKPOINT_X
865                     Count when we execute code at the memory location.
866
867              The values can be combined via a bitwise or, but the combination
868              of HW_BREAKPOINT_R or HW_BREAKPOINT_W  with  HW_BREAKPOINT_X  is
869              not allowed.
870
871       bp_addr (since Linux 2.6.33)
872              This  is  the  address  of the breakpoint.  For execution break‐
873              points, this is the memory address of the instruction of  inter‐
874              est; for read and write breakpoints, it is the memory address of
875              the memory location of interest.
876
877       config1 (since Linux 2.6.39)
878              config1 is used for setting events that need an  extra  register
879              or  otherwise  do not fit in the regular config field.  Raw OFF‐
880              CORE_EVENTS on Nehalem/Westmere/SandyBridge use  this  field  on
881              Linux 3.3 and later kernels.
882
883       bp_len (since Linux 2.6.33)
884              bp_len is the length of the breakpoint being measured if type is
885              PERF_TYPE_BREAKPOINT.     Options    are    HW_BREAKPOINT_LEN_1,
886              HW_BREAKPOINT_LEN_2,    HW_BREAKPOINT_LEN_4,    and    HW_BREAK‐
887              POINT_LEN_8.   For  an  execution  breakpoint,   set   this   to
888              sizeof(long).
889
890       config2 (since Linux 2.6.39)
891              config2 is a further extension of the config1 field.
892
893       branch_sample_type (since Linux 3.4)
894              If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
895              branches to include in the branch record.
896
897              The first part of the value is the privilege level, which  is  a
898              combination of one of the values listed below.  If the user does
899              not set privilege level explicitly,  the  kernel  will  use  the
900              event's  privilege  level.  Event and branch privilege levels do
901              not have to match.
902
903              PERF_SAMPLE_BRANCH_USER
904                     Branch target is in user space.
905
906              PERF_SAMPLE_BRANCH_KERNEL
907                     Branch target is in kernel space.
908
909              PERF_SAMPLE_BRANCH_HV
910                     Branch target is in hypervisor.
911
912              PERF_SAMPLE_BRANCH_PLM_ALL
913                     A convenience value that is the  three  preceding  values
914                     ORed together.
915
916              In  addition to the privilege value, at least one or more of the
917              following bits must be set.
918
919              PERF_SAMPLE_BRANCH_ANY
920                     Any branch type.
921
922              PERF_SAMPLE_BRANCH_ANY_CALL
923                     Any call branch (includes direct calls,  indirect  calls,
924                     and far jumps).
925
926              PERF_SAMPLE_BRANCH_IND_CALL
927                     Indirect calls.
928
929              PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
930                     Direct calls.
931
932              PERF_SAMPLE_BRANCH_ANY_RETURN
933                     Any return branch.
934
935              PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
936                     Indirect jumps.
937
938              PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
939                     Conditional branches.
940
941              PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
942                     Transactional memory aborts.
943
944              PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
945                     Branch in transactional memory transaction.
946
947              PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
948                     Branch   not   in   transactional   memory   transaction.
949                     PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
950                     part  of  a hardware-generated call stack.  This requires
951                     hardware support,  currently  only  found  on  Intel  x86
952                     Haswell or newer.
953
954       sample_regs_user (since Linux 3.7)
955              This  bit  mask defines the set of user CPU registers to dump on
956              samples.  The layout of the register mask  is  architecture-spe‐
957              cific  and  is described in the kernel header file arch/ARCH/in‐
958              clude/uapi/asm/perf_regs.h.
959
960       sample_stack_user (since Linux 3.7)
961              This defines the size of the user stack  to  dump  if  PERF_SAM‐
962              PLE_STACK_USER is specified.
963
964       clockid (since Linux 4.1)
965              If  use_clockid  is  set, then this field selects which internal
966              Linux timer to use for timestamps.  The available timers are de‐
967              fined   in   linux/time.h,   with  CLOCK_MONOTONIC,  CLOCK_MONO‐
968              TONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME,  and  CLOCK_TAI  cur‐
969              rently supported.
970
971       aux_watermark (since Linux 4.1)
972              This   specifies   how  much  data  is  required  to  trigger  a
973              PERF_RECORD_AUX sample.
974
975       sample_max_stack (since Linux 4.8)
976              When  sample_type  includes  PERF_SAMPLE_CALLCHAIN,  this  field
977              specifies  how  many  stack frames to report when generating the
978              callchain.
979
980   Reading results
981       Once a perf_event_open() file descriptor has been opened, the values of
982       the  events  can be read from the file descriptor.  The values that are
983       there are specified by the read_format field in the attr  structure  at
984       open time.
985
986       If you attempt to read into a buffer that is not big enough to hold the
987       data, the error ENOSPC results.
988
989       Here is the layout of the data returned by a read:
990
991       * If PERF_FORMAT_GROUP was specified to allow reading all events  in  a
992         group at once:
993
994             struct read_format {
995                 u64 nr;            /* The number of events */
996                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
997                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
998                 struct {
999                     u64 value;     /* The value of the event */
1000                     u64 id;        /* if PERF_FORMAT_ID */
1001                 } values[nr];
1002             };
1003
1004       * If PERF_FORMAT_GROUP was not specified:
1005
1006             struct read_format {
1007                 u64 value;         /* The value of the event */
1008                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
1009                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
1010                 u64 id;            /* if PERF_FORMAT_ID */
1011             };
1012
1013       The values read are as follows:
1014
1015       nr     The number of events in this file descriptor.  Available only if
1016              PERF_FORMAT_GROUP was specified.
1017
1018       time_enabled, time_running
1019              Total time the event was enabled and  running.   Normally  these
1020              values  are  the  same.   Multiplexing  happens if the number of
1021              events is more than the number of available PMU  counter  slots.
1022              In  that  case  the  events  run  only  part of the time and the
1023              time_enabled and time running values can be used to scale an es‐
1024              timated value for the count.
1025
1026       value  An unsigned 64-bit value containing the counter result.
1027
1028       id     A  globally unique value for this particular event; only present
1029              if PERF_FORMAT_ID was specified in read_format.
1030
1031   MMAP layout
1032       When using perf_event_open() in sampled mode, asynchronous events (like
1033       counter  overflow  or  PROT_EXEC mmap tracking) are logged into a ring-
1034       buffer.  This ring-buffer is created and accessed through mmap(2).
1035
1036       The mmap size should be 1+2^n pages, where the first page is a metadata
1037       page (struct perf_event_mmap_page) that contains various bits of infor‐
1038       mation such as where the ring-buffer head is.
1039
1040       Before kernel 2.6.39, there is a bug that means you  must  allocate  an
1041       mmap ring buffer when sampling even if you do not plan to access it.
1042
1043       The structure of the first metadata mmap page is as follows:
1044
1045           struct perf_event_mmap_page {
1046               __u32 version;        /* version number of this structure */
1047               __u32 compat_version; /* lowest version this is compat with */
1048               __u32 lock;           /* seqlock for synchronization */
1049               __u32 index;          /* hardware counter identifier */
1050               __s64 offset;         /* add to hardware counter value */
1051               __u64 time_enabled;   /* time event active */
1052               __u64 time_running;   /* time event on CPU */
1053               union {
1054                   __u64   capabilities;
1055                   struct {
1056                       __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1057                             cap_bit0_is_deprecated : 1,
1058                             cap_user_rdpmc         : 1,
1059                             cap_user_time          : 1,
1060                             cap_user_time_zero     : 1,
1061                   };
1062               };
1063               __u16 pmc_width;
1064               __u16 time_shift;
1065               __u32 time_mult;
1066               __u64 time_offset;
1067               __u64 __reserved[120];   /* Pad to 1 k */
1068               __u64 data_head;         /* head in the data section */
1069               __u64 data_tail;         /* user-space written tail */
1070               __u64 data_offset;       /* where the buffer starts */
1071               __u64 data_size;         /* data buffer size */
1072               __u64 aux_head;
1073               __u64 aux_tail;
1074               __u64 aux_offset;
1075               __u64 aux_size;
1076
1077           }
1078
1079       The  following  list  describes  the fields in the perf_event_mmap_page
1080       structure in more detail:
1081
1082       version
1083              Version number of this structure.
1084
1085       compat_version
1086              The lowest version this is compatible with.
1087
1088       lock   A seqlock for synchronization.
1089
1090       index  A unique hardware counter identifier.
1091
1092       offset When using rdpmc for reads this offset value must  be  added  to
1093              the one returned by rdpmc to get the current total event count.
1094
1095       time_enabled
1096              Time the event was active.
1097
1098       time_running
1099              Time the event was running.
1100
1101       cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1102              There   was   a  bug  in  the  definition  of  cap_usr_time  and
1103              cap_usr_rdpmc from Linux 3.4 until Linux 3.11.  Both  bits  were
1104              defined  to  point to the same location, so it was impossible to
1105              know if cap_usr_time or cap_usr_rdpmc were actually set.
1106
1107              Starting with Linux 3.12, these are renamed to cap_bit0 and  you
1108              should use the cap_user_time and cap_user_rdpmc fields instead.
1109
1110       cap_bit0_is_deprecated (since Linux 3.12)
1111              If set, this bit indicates that the kernel supports the properly
1112              separated cap_user_time and cap_user_rdpmc bits.
1113
1114              If not-set, it indicates an older kernel where cap_usr_time  and
1115              cap_usr_rdpmc  map to the same bit and thus both features should
1116              be used with caution.
1117
1118       cap_user_rdpmc (since Linux 3.12)
1119              If the hardware supports user-space read of performance counters
1120              without  syscall  (this is the "rdpmc" instruction on x86), then
1121              the following code can be used to do a read:
1122
1123                  u32 seq, time_mult, time_shift, idx, width;
1124                  u64 count, enabled, running;
1125                  u64 cyc, time_offset;
1126
1127                  do {
1128                      seq = pc->lock;
1129                      barrier();
1130                      enabled = pc->time_enabled;
1131                      running = pc->time_running;
1132
1133                      if (pc->cap_usr_time && enabled != running) {
1134                          cyc = rdtsc();
1135                          time_offset = pc->time_offset;
1136                          time_mult   = pc->time_mult;
1137                          time_shift  = pc->time_shift;
1138                      }
1139
1140                      idx = pc->index;
1141                      count = pc->offset;
1142
1143                      if (pc->cap_usr_rdpmc && idx) {
1144                          width = pc->pmc_width;
1145                          count += rdpmc(idx - 1);
1146                      }
1147
1148                      barrier();
1149                  } while (pc->lock != seq);
1150
1151       cap_user_time (since Linux 3.12)
1152              This bit indicates the hardware has a  constant,  nonstop  time‐
1153              stamp counter (TSC on x86).
1154
1155       cap_user_time_zero (since Linux 3.12)
1156              Indicates  the  presence of time_zero which allows mapping time‐
1157              stamp values to the hardware clock.
1158
1159       pmc_width
1160              If cap_usr_rdpmc, this field provides the bit-width of the value
1161              read  using  the  rdpmc  or equivalent instruction.  This can be
1162              used to sign extend the result like:
1163
1164                  pmc <<= 64 - pmc_width;
1165                  pmc >>= 64 - pmc_width; // signed shift right
1166                  count += pmc;
1167
1168       time_shift, time_mult, time_offset
1169
1170              If cap_usr_time, these fields can be used to  compute  the  time
1171              delta  since  time_enabled (in nanoseconds) using rdtsc or simi‐
1172              lar.
1173
1174                  u64 quot, rem;
1175                  u64 delta;
1176
1177                  quot  = cyc >> time_shift;
1178                  rem   = cyc & (((u64)1 << time_shift) - 1);
1179                  delta = time_offset + quot * time_mult +
1180                          ((rem * time_mult) >> time_shift);
1181
1182              Where time_offset, time_mult, time_shift, and cyc  are  read  in
1183              the seqcount loop described above.  This delta can then be added
1184              to enabled and possible running (if idx), improving the scaling:
1185
1186                  enabled += delta;
1187                  if (idx)
1188                      running += delta;
1189                  quot  = count / running;
1190                  rem   = count % running;
1191                  count = quot * enabled + (rem * enabled) / running;
1192
1193       time_zero (since Linux 3.12)
1194
1195              If cap_usr_time_zero is set, then the hardware  clock  (the  TSC
1196              timestamp  counter on x86) can be calculated from the time_zero,
1197              time_mult, and time_shift values:
1198
1199                  time = timestamp - time_zero;
1200                  quot = time / time_mult;
1201                  rem  = time % time_mult;
1202                  cyc  = (quot << time_shift) + (rem << time_shift) / time_mult;
1203
1204              And vice versa:
1205
1206                  quot = cyc >> time_shift;
1207                  rem  = cyc & (((u64)1 << time_shift) - 1);
1208                  timestamp = time_zero + quot * time_mult +
1209                              ((rem * time_mult) >> time_shift);
1210
1211       data_head
1212              This points to the head of the data section.  The value continu‐
1213              ously  increases, it does not wrap.  The value needs to be manu‐
1214              ally wrapped by the size of the mmap buffer before accessing the
1215              samples.
1216
1217              On  SMP-capable  platforms,  after  reading the data_head value,
1218              user space should issue an rmb().
1219
1220       data_tail
1221              When the mapping is PROT_WRITE, the data_tail  value  should  be
1222              written  by  user  space to reflect the last read data.  In this
1223              case, the kernel will not overwrite unread data.
1224
1225       data_offset (since Linux 4.1)
1226              Contains the offset of the location in  the  mmap  buffer  where
1227              perf sample data begins.
1228
1229       data_size (since Linux 4.1)
1230              Contains the size of the perf sample region within the mmap buf‐
1231              fer.
1232
1233       aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1234              The AUX region allows mmap(2)-ing a separate sample  buffer  for
1235              high-bandwidth  data streams (separate from the main perf sample
1236              buffer).  An example of a high-bandwidth stream  is  instruction
1237              tracing support, as is found in newer Intel processors.
1238
1239              To  set up an AUX area, first aux_offset needs to be set with an
1240              offset greater than data_offset+data_size and aux_size needs  to
1241              be  set to the desired buffer size.  The desired offset and size
1242              must be page aligned, and the size  must  be  a  power  of  two.
1243              These  values  are  then  passed to mmap in order to map the AUX
1244              buffer.  Pages in the AUX buffer are included  as  part  of  the
1245              RLIMIT_MEMLOCK  resource  limit  (see setrlimit(2)), and also as
1246              part of the perf_event_mlock_kb allowance.
1247
1248              By default, the AUX buffer will be truncated if it will not  fit
1249              in the available space in the ring buffer.  If the AUX buffer is
1250              mapped as a read only buffer, then it will operate in ring  buf‐
1251              fer  mode  where  old data will be overwritten by new.  In over‐
1252              write mode, it might not be possible to infer where the new data
1253              began, and it is the consumer's job to disable measurement while
1254              reading to avoid possible data races.
1255
1256              The aux_head and aux_tail ring buffer pointers have the same be‐
1257              havior  and  ordering  rules as the previous described data_head
1258              and data_tail.
1259
1260       The following 2^n ring-buffer pages have the layout described below.
1261
1262       If perf_event_attr.sample_id_all is set, then all event types will have
1263       the  sample_type  selected  fields  related to where/when (identity) an
1264       event  took  place  (TID,  TIME,  ID,  CPU,  STREAM_ID)  described   in
1265       PERF_RECORD_SAMPLE   below,   it   will   be  stashed  just  after  the
1266       perf_event_header and the  fields  already  present  for  the  existing
1267       fields,  that  is,  at  the  end  of  the payload.  This allows a newer
1268       perf.data file to be supported by older perf tools, with  the  new  op‐
1269       tional fields being ignored.
1270
1271       The mmap values start with a header:
1272
1273           struct perf_event_header {
1274               __u32   type;
1275               __u16   misc;
1276               __u16   size;
1277           };
1278
1279       Below,  we  describe  the perf_event_header fields in more detail.  For
1280       ease of reading, the fields with  shorter  descriptions  are  presented
1281       first.
1282
1283       size   This indicates the size of the record.
1284
1285       misc   The misc field contains additional information about the sample.
1286
1287              The  CPU  mode can be determined from this value by masking with
1288              PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1289              ing  (note  these  are  not  bit masks, only one can be set at a
1290              time):
1291
1292              PERF_RECORD_MISC_CPUMODE_UNKNOWN
1293                     Unknown CPU mode.
1294
1295              PERF_RECORD_MISC_KERNEL
1296                     Sample happened in the kernel.
1297
1298              PERF_RECORD_MISC_USER
1299                     Sample happened in user code.
1300
1301              PERF_RECORD_MISC_HYPERVISOR
1302                     Sample happened in the hypervisor.
1303
1304              PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1305                     Sample happened in the guest kernel.
1306
1307              PERF_RECORD_MISC_GUEST_USER  (since Linux 2.6.35)
1308                     Sample happened in guest user code.
1309
1310              Since the following three statuses are  generated  by  different
1311              record types, they alias to the same bit:
1312
1313              PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1314                     This is set when the mapping is not executable; otherwise
1315                     the mapping is executable.
1316
1317              PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1318                     This is set for a PERF_RECORD_COMM record on kernels more
1319                     recent  than  Linux  3.16  if  a  process name change was
1320                     caused by an exec(2) system call.
1321
1322              PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1323                     When a PERF_RECORD_SWITCH or  PERF_RECORD_SWITCH_CPU_WIDE
1324                     record  is generated, this bit indicates that the context
1325                     switch is away from the current process (instead of  into
1326                     the current process).
1327
1328              In addition, the following bits can be set:
1329
1330              PERF_RECORD_MISC_EXACT_IP
1331                     This  indicates that the content of PERF_SAMPLE_IP points
1332                     to the actual instruction that triggered the event.   See
1333                     also perf_event_attr.precise_ip.
1334
1335              PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1336                     This  indicates  there  is  extended data available (cur‐
1337                     rently not used).
1338
1339              PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1340                     This bit is not set by the kernel.  It  is  reserved  for
1341                     the    user-space   perf   utility   to   indicate   that
1342                     /proc/i[pid]/maps parsing was taking  too  long  and  was
1343                     stopped, and thus the mmap records may be truncated.
1344
1345       type   The  type  value  is one of the below.  The values in the corre‐
1346              sponding record (that follows the header) depend on the type se‐
1347              lected as shown.
1348
1349              PERF_RECORD_MMAP
1350                  The MMAP events record the PROT_EXEC mappings so that we can
1351                  correlate user-space IPs to code.  They have  the  following
1352                  structure:
1353
1354                      struct {
1355                          struct perf_event_header header;
1356                          u32    pid, tid;
1357                          u64    addr;
1358                          u64    len;
1359                          u64    pgoff;
1360                          char   filename[];
1361                      };
1362
1363                  pid    is the process ID.
1364
1365                  tid    is the thread ID.
1366
1367                  addr   is  the  address of the allocated memory.  len is the
1368                         length of the allocated memory.  pgoff  is  the  page
1369                         offset of the allocated memory.  filename is a string
1370                         describing the backing of the allocated memory.
1371
1372              PERF_RECORD_LOST
1373                  This record indicates when events are lost.
1374
1375                      struct {
1376                          struct perf_event_header header;
1377                          u64    id;
1378                          u64    lost;
1379                          struct sample_id sample_id;
1380                      };
1381
1382                  id     is the unique event ID  for  the  samples  that  were
1383                         lost.
1384
1385                  lost   is the number of events that were lost.
1386
1387              PERF_RECORD_COMM
1388                  This record indicates a change in the process name.
1389
1390                      struct {
1391                          struct perf_event_header header;
1392                          u32    pid;
1393                          u32    tid;
1394                          char   comm[];
1395                          struct sample_id sample_id;
1396                      };
1397
1398                  pid    is the process ID.
1399
1400                  tid    is the thread ID.
1401
1402                  comm   is a string containing the new name of the process.
1403
1404              PERF_RECORD_EXIT
1405                  This record indicates a process exit event.
1406
1407                      struct {
1408                          struct perf_event_header header;
1409                          u32    pid, ppid;
1410                          u32    tid, ptid;
1411                          u64    time;
1412                          struct sample_id sample_id;
1413                      };
1414
1415              PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1416                  This record indicates a throttle/unthrottle event.
1417
1418                      struct {
1419                          struct perf_event_header header;
1420                          u64    time;
1421                          u64    id;
1422                          u64    stream_id;
1423                          struct sample_id sample_id;
1424                      };
1425
1426              PERF_RECORD_FORK
1427                  This record indicates a fork event.
1428
1429                      struct {
1430                          struct perf_event_header header;
1431                          u32    pid, ppid;
1432                          u32    tid, ptid;
1433                          u64    time;
1434                          struct sample_id sample_id;
1435                      };
1436
1437              PERF_RECORD_READ
1438                  This record indicates a read event.
1439
1440                      struct {
1441                          struct perf_event_header header;
1442                          u32    pid, tid;
1443                          struct read_format values;
1444                          struct sample_id sample_id;
1445                      };
1446
1447              PERF_RECORD_SAMPLE
1448                  This record indicates a sample.
1449
1450                      struct {
1451                          struct perf_event_header header;
1452                          u64    sample_id;   /* if PERF_SAMPLE_IDENTIFIER */
1453                          u64    ip;          /* if PERF_SAMPLE_IP */
1454                          u32    pid, tid;    /* if PERF_SAMPLE_TID */
1455                          u64    time;        /* if PERF_SAMPLE_TIME */
1456                          u64    addr;        /* if PERF_SAMPLE_ADDR */
1457                          u64    id;          /* if PERF_SAMPLE_ID */
1458                          u64    stream_id;   /* if PERF_SAMPLE_STREAM_ID */
1459                          u32    cpu, res;    /* if PERF_SAMPLE_CPU */
1460                          u64    period;      /* if PERF_SAMPLE_PERIOD */
1461                          struct read_format v;
1462                                              /* if PERF_SAMPLE_READ */
1463                          u64    nr;          /* if PERF_SAMPLE_CALLCHAIN */
1464                          u64    ips[nr];     /* if PERF_SAMPLE_CALLCHAIN */
1465                          u32    size;        /* if PERF_SAMPLE_RAW */
1466                          char   data[size];  /* if PERF_SAMPLE_RAW */
1467                          u64    bnr;         /* if PERF_SAMPLE_BRANCH_STACK */
1468                          struct perf_branch_entry lbr[bnr];
1469                                              /* if PERF_SAMPLE_BRANCH_STACK */
1470                          u64    abi;         /* if PERF_SAMPLE_REGS_USER */
1471                          u64    regs[weight(mask)];
1472                                              /* if PERF_SAMPLE_REGS_USER */
1473                          u64    size;        /* if PERF_SAMPLE_STACK_USER */
1474                          char   data[size];  /* if PERF_SAMPLE_STACK_USER */
1475                          u64    dyn_size;    /* if PERF_SAMPLE_STACK_USER &&
1476                                                 size != 0 */
1477                          u64    weight;      /* if PERF_SAMPLE_WEIGHT */
1478                          u64    data_src;    /* if PERF_SAMPLE_DATA_SRC */
1479                          u64    transaction; /* if PERF_SAMPLE_TRANSACTION */
1480                          u64    abi;         /* if PERF_SAMPLE_REGS_INTR */
1481                          u64    regs[weight(mask)];
1482                                              /* if PERF_SAMPLE_REGS_INTR */
1483                          u64    phys_addr;   /* if PERF_SAMPLE_PHYS_ADDR */
1484                          u64    cgroup;      /* if PERF_SAMPLE_CGROUP */
1485                      };
1486
1487                  sample_id
1488                      If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1489                      is included.  This is a  duplication  of  the  PERF_SAM‐
1490                      PLE_ID  id  value,  but included at the beginning of the
1491                      sample so parsers can easily obtain the value.
1492
1493                  ip  If PERF_SAMPLE_IP is enabled, then a 64-bit  instruction
1494                      pointer value is included.
1495
1496                  pid, tid
1497                      If  PERF_SAMPLE_TID is enabled, then a 32-bit process ID
1498                      and 32-bit thread ID are included.
1499
1500                  time
1501                      If PERF_SAMPLE_TIME is enabled, then a 64-bit  timestamp
1502                      is  included.   This is obtained via local_clock() which
1503                      is a hardware timestamp if  available  and  the  jiffies
1504                      value if not.
1505
1506                  addr
1507                      If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1508                      included.  This is usually the address of a  tracepoint,
1509                      breakpoint, or software event; otherwise the value is 0.
1510
1511                  id  If  PERF_SAMPLE_ID is enabled, a 64-bit unique ID is in‐
1512                      cluded.  If the event is a member of an event group, the
1513                      group leader ID is returned.  This ID is the same as the
1514                      one returned by PERF_FORMAT_ID.
1515
1516                  stream_id
1517                      If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique  ID
1518                      is included.  Unlike PERF_SAMPLE_ID the actual ID is re‐
1519                      turned, not the group leader.  This ID is  the  same  as
1520                      the one returned by PERF_FORMAT_ID.
1521
1522                  cpu, res
1523                      If  PERF_SAMPLE_CPU  is  enabled, this is a 32-bit value
1524                      indicating which CPU was being used, in  addition  to  a
1525                      reserved (unused) 32-bit value.
1526
1527                  period
1528                      If  PERF_SAMPLE_PERIOD  is enabled, a 64-bit value indi‐
1529                      cating the current sampling period is written.
1530
1531                  v   If PERF_SAMPLE_READ is  enabled,  a  structure  of  type
1532                      read_format  is included which has values for all events
1533                      in the event group.  The values included depend  on  the
1534                      read_format value used at perf_event_open() time.
1535
1536                  nr, ips[nr]
1537                      If  PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit num‐
1538                      ber is  included  which  indicates  how  many  following
1539                      64-bit  instruction  pointers  will follow.  This is the
1540                      current callchain.
1541
1542                  size, data[size]
1543                      If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1544                      cating  size  is  included followed by an array of 8-bit
1545                      values of length size.  The values are padded with 0  to
1546                      have 64-bit alignment.
1547
1548                      This  RAW record data is opaque with respect to the ABI.
1549                      The ABI doesn't make any promises with  respect  to  the
1550                      stability  of  its  content,  it  may  vary depending on
1551                      event, hardware, and kernel version.
1552
1553                  bnr, lbr[bnr]
1554                      If PERF_SAMPLE_BRANCH_STACK is enabled,  then  a  64-bit
1555                      value indicating the number of records is included, fol‐
1556                      lowed by bnr perf_branch_entry structures which each in‐
1557                      clude the fields:
1558
1559                      from   This indicates the source instruction (may not be
1560                             a branch).
1561
1562                      to     The branch target.
1563
1564                      mispred
1565                             The branch target was mispredicted.
1566
1567                      predicted
1568                             The branch target was predicted.
1569
1570                      in_tx (since Linux 3.11)
1571                             The branch was in a transactional memory transac‐
1572                             tion.
1573
1574                      abort (since Linux 3.11)
1575                             The branch was in an aborted transactional memory
1576                             transaction.
1577
1578                      cycles (since Linux 4.3)
1579                             This reports the number of cycles  elapsed  since
1580                             the previous branch stack update.
1581
1582                      The  entries are from most to least recent, so the first
1583                      entry has the most recent branch.
1584
1585                      Support for mispred, predicted, and cycles is  optional;
1586                      if not supported, those values will be 0.
1587
1588                      The  type  of  branches  recorded  is  specified  by the
1589                      branch_sample_type field.
1590
1591                  abi, regs[weight(mask)]
1592                      If PERF_SAMPLE_REGS_USER is enabled, then the  user  CPU
1593                      registers are recorded.
1594
1595                      The  abi  field  is  one  of  PERF_SAMPLE_REGS_ABI_NONE,
1596                      PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1597
1598                      The regs field is an array of  the  CPU  registers  that
1599                      were  specified by the sample_regs_user attr field.  The
1600                      number of values is the number of bits set in  the  sam‐
1601                      ple_regs_user bit mask.
1602
1603                  size, data[size], dyn_size
1604                      If  PERF_SAMPLE_STACK_USER  is  enabled,  then  the user
1605                      stack is recorded.  This can be used to  generate  stack
1606                      backtraces.   size  is the size requested by the user in
1607                      sample_stack_user or else the maximum record size.  data
1608                      is  the  stack data (a raw dump of the memory pointed to
1609                      by the stack pointer at the time of sampling).  dyn_size
1610                      is  the amount of data actually dumped (can be less than
1611                      size).  Note that dyn_size is omitted if size is 0.
1612
1613                  weight
1614                      If PERF_SAMPLE_WEIGHT is enabled, then  a  64-bit  value
1615                      provided  by the hardware is recorded that indicates how
1616                      costly the event was.  This allows expensive  events  to
1617                      stand out more clearly in profiles.
1618
1619                  data_src
1620                      If  PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value
1621                      is recorded that is made up of the following fields:
1622
1623                      mem_op
1624                          Type of opcode, a bitwise combination of:
1625
1626                          PERF_MEM_OP_NA          Not available
1627                          PERF_MEM_OP_LOAD        Load instruction
1628                          PERF_MEM_OP_STORE       Store instruction
1629                          PERF_MEM_OP_PFETCH      Prefetch
1630                          PERF_MEM_OP_EXEC        Executable code
1631
1632                      mem_lvl
1633                          Memory hierarchy level hit or miss, a bitwise combi‐
1634                          nation   of   the   following,   shifted   left   by
1635                          PERF_MEM_LVL_SHIFT:
1636
1637                          PERF_MEM_LVL_NA         Not available
1638                          PERF_MEM_LVL_HIT        Hit
1639                          PERF_MEM_LVL_MISS       Miss
1640                          PERF_MEM_LVL_L1         Level 1 cache
1641                          PERF_MEM_LVL_LFB        Line fill buffer
1642                          PERF_MEM_LVL_L2         Level 2 cache
1643                          PERF_MEM_LVL_L3         Level 3 cache
1644                          PERF_MEM_LVL_LOC_RAM    Local DRAM
1645                          PERF_MEM_LVL_REM_RAM1   Remote DRAM 1 hop
1646                          PERF_MEM_LVL_REM_RAM2   Remote DRAM 2 hops
1647                          PERF_MEM_LVL_REM_CCE1   Remote cache 1 hop
1648                          PERF_MEM_LVL_REM_CCE2   Remote cache 2 hops
1649                          PERF_MEM_LVL_IO         I/O memory
1650                          PERF_MEM_LVL_UNC        Uncached memory
1651
1652                      mem_snoop
1653                          Snoop mode, a bitwise combination of the  following,
1654                          shifted left by PERF_MEM_SNOOP_SHIFT:
1655
1656                          PERF_MEM_SNOOP_NA       Not available
1657                          PERF_MEM_SNOOP_NONE     No snoop
1658                          PERF_MEM_SNOOP_HIT      Snoop hit
1659                          PERF_MEM_SNOOP_MISS     Snoop miss
1660                          PERF_MEM_SNOOP_HITM     Snoop hit modified
1661
1662                      mem_lock
1663                          Lock  instruction, a bitwise combination of the fol‐
1664                          lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1665
1666                          PERF_MEM_LOCK_NA        Not available
1667                          PERF_MEM_LOCK_LOCKED    Locked transaction
1668
1669                      mem_dtlb
1670                          TLB access hit or miss, a bitwise combination of the
1671                          following, shifted left by PERF_MEM_TLB_SHIFT:
1672
1673                          PERF_MEM_TLB_NA         Not available
1674                          PERF_MEM_TLB_HIT        Hit
1675                          PERF_MEM_TLB_MISS       Miss
1676                          PERF_MEM_TLB_L1         Level 1 TLB
1677                          PERF_MEM_TLB_L2         Level 2 TLB
1678                          PERF_MEM_TLB_WK         Hardware walker
1679                          PERF_MEM_TLB_OS         OS fault handler
1680
1681                  transaction
1682                      If  the  PERF_SAMPLE_TRANSACTION  flag  is  set,  then a
1683                      64-bit field is recorded describing the sources  of  any
1684                      transactional memory aborts.
1685
1686                      The field is a bitwise combination of the following val‐
1687                      ues:
1688
1689                      PERF_TXN_ELISION
1690                             Abort from an elision  type  transaction  (Intel-
1691                             CPU-specific).
1692
1693                      PERF_TXN_TRANSACTION
1694                             Abort from a generic transaction.
1695
1696                      PERF_TXN_SYNC
1697                             Synchronous  abort  (related  to the reported in‐
1698                             struction).
1699
1700                      PERF_TXN_ASYNC
1701                             Asynchronous abort (not related to  the  reported
1702                             instruction).
1703
1704                      PERF_TXN_RETRY
1705                             Retryable  abort  (retrying  the  transaction may
1706                             have succeeded).
1707
1708                      PERF_TXN_CONFLICT
1709                             Abort due to memory conflicts with other threads.
1710
1711                      PERF_TXN_CAPACITY_WRITE
1712                             Abort due to write capacity overflow.
1713
1714                      PERF_TXN_CAPACITY_READ
1715                             Abort due to read capacity overflow.
1716
1717                      In addition, a user-specified abort code can be obtained
1718                      from  the high 32 bits of the field by shifting right by
1719                      PERF_TXN_ABORT_SHIFT  and   masking   with   the   value
1720                      PERF_TXN_ABORT_MASK.
1721
1722                  abi, regs[weight(mask)]
1723                      If  PERF_SAMPLE_REGS_INTR  is enabled, then the user CPU
1724                      registers are recorded.
1725
1726                      The  abi  field  is  one  of  PERF_SAMPLE_REGS_ABI_NONE,
1727                      PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1728
1729                      The  regs  field  is  an array of the CPU registers that
1730                      were specified by the sample_regs_intr attr field.   The
1731                      number  of  values is the number of bits set in the sam‐
1732                      ple_regs_intr bit mask.
1733
1734                  phys_addr
1735                      If the  PERF_SAMPLE_PHYS_ADDR  flag  is  set,  then  the
1736                      64-bit physical address is recorded.
1737
1738                  cgroup
1739                      If  the  PERF_SAMPLE_CGROUP flag is set, then the 64-bit
1740                      cgroup ID (for the perf_event  subsystem)  is  recorded.
1741                      To  get  the pathname of the cgroup, the ID should match
1742                      to one in a PERF_RECORD_CGROUP .
1743
1744              PERF_RECORD_MMAP2
1745                  This record includes extended information on  mmap(2)  calls
1746                  returning  executable  mappings.   The  format is similar to
1747                  that of the PERF_RECORD_MMAP record, but includes extra val‐
1748                  ues that allow uniquely identifying shared mappings.
1749
1750                      struct {
1751                          struct perf_event_header header;
1752                          u32    pid;
1753                          u32    tid;
1754                          u64    addr;
1755                          u64    len;
1756                          u64    pgoff;
1757                          u32    maj;
1758                          u32    min;
1759                          u64    ino;
1760                          u64    ino_generation;
1761                          u32    prot;
1762                          u32    flags;
1763                          char   filename[];
1764                          struct sample_id sample_id;
1765                      };
1766
1767                  pid    is the process ID.
1768
1769                  tid    is the thread ID.
1770
1771                  addr   is the address of the allocated memory.
1772
1773                  len    is the length of the allocated memory.
1774
1775                  pgoff  is the page offset of the allocated memory.
1776
1777                  maj    is the major ID of the underlying device.
1778
1779                  min    is the minor ID of the underlying device.
1780
1781                  ino    is the inode number.
1782
1783                  ino_generation
1784                         is the inode generation.
1785
1786                  prot   is the protection information.
1787
1788                  flags  is the flags information.
1789
1790                  filename
1791                         is  a  string describing the backing of the allocated
1792                         memory.
1793
1794              PERF_RECORD_AUX (since Linux 4.1)
1795                  This record reports that new data is available in the  sepa‐
1796                  rate AUX buffer region.
1797
1798                      struct {
1799                          struct perf_event_header header;
1800                          u64    aux_offset;
1801                          u64    aux_size;
1802                          u64    flags;
1803                          struct sample_id sample_id;
1804                      };
1805
1806                  aux_offset
1807                         offset  in the AUX mmap region where the new data be‐
1808                         gins.
1809
1810                  aux_size
1811                         size of the data made available.
1812
1813                  flags  describes the AUX update.
1814
1815                         PERF_AUX_FLAG_TRUNCATED
1816                                if set, then the data returned  was  truncated
1817                                to fit the available buffer size.
1818
1819                         PERF_AUX_FLAG_OVERWRITE
1820                                if set, then the data returned has overwritten
1821                                previous data.
1822
1823              PERF_RECORD_ITRACE_START (since Linux 4.1)
1824                  This record indicates which process  has  initiated  an  in‐
1825                  struction  trace event, allowing tools to properly correlate
1826                  the instruction addresses in the AUX buffer with the  proper
1827                  executable.
1828
1829                      struct {
1830                          struct perf_event_header header;
1831                          u32    pid;
1832                          u32    tid;
1833                      };
1834
1835                  pid    process  ID  of  the  thread  starting an instruction
1836                         trace.
1837
1838                  tid    thread ID  of  the  thread  starting  an  instruction
1839                         trace.
1840
1841              PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1842                  When  using  hardware  sampling  (such  as  Intel PEBS) this
1843                  record indicates some number of samples that may  have  been
1844                  lost.
1845
1846                      struct {
1847                          struct perf_event_header header;
1848                          u64    lost;
1849                          struct sample_id sample_id;
1850                      };
1851
1852                  lost   the number of potentially lost samples.
1853
1854              PERF_RECORD_SWITCH (since Linux 4.3)
1855                  This  record  indicates  a context switch has happened.  The
1856                  PERF_RECORD_MISC_SWITCH_OUT bit in the misc field  indicates
1857                  whether  it  was a context switch into or away from the cur‐
1858                  rent process.
1859
1860                      struct {
1861                          struct perf_event_header header;
1862                          struct sample_id sample_id;
1863                      };
1864
1865              PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1866                  As with PERF_RECORD_SWITCH this record indicates  a  context
1867                  switch  has  happened,  but  it only occurs when sampling in
1868                  CPU-wide mode and provides  additional  information  on  the
1869                  process       being       switched       to/from.        The
1870                  PERF_RECORD_MISC_SWITCH_OUT bit in the misc field  indicates
1871                  whether  it  was a context switch into or away from the cur‐
1872                  rent process.
1873
1874                      struct {
1875                          struct perf_event_header header;
1876                          u32 next_prev_pid;
1877                          u32 next_prev_tid;
1878                          struct sample_id sample_id;
1879                      };
1880
1881                  next_prev_pid
1882                         The process ID of the previous (if switching  in)  or
1883                         next (if switching out) process on the CPU.
1884
1885                  next_prev_tid
1886                         The  thread  ID  of the previous (if switching in) or
1887                         next (if switching out) thread on the CPU.
1888
1889              PERF_RECORD_NAMESPACES (since Linux 4.11)
1890                  This record includes  various  namespace  information  of  a
1891                  process.
1892
1893                      struct {
1894                          struct perf_event_header header;
1895                          u32    pid;
1896                          u32    tid;
1897                          u64    nr_namespaces;
1898                          struct { u64 dev, inode } [nr_namespaces];
1899                          struct sample_id sample_id;
1900                      };
1901
1902                  pid    is the process ID
1903
1904                  tid    is the thread ID
1905
1906                  nr_namespace
1907                         is the number of namespaces in this record
1908
1909                  Each  namespace  has dev and inode fields and is recorded in
1910                  the fixed position like below:
1911
1912                  NET_NS_INDEX=0
1913                         Network namespace
1914
1915                  UTS_NS_INDEX=1
1916                         UTS namespace
1917
1918                  IPC_NS_INDEX=2
1919                         IPC namespace
1920
1921                  PID_NS_INDEX=3
1922                         PID namespace
1923
1924                  USER_NS_INDEX=4
1925                         User namespace
1926
1927                  MNT_NS_INDEX=5
1928                         Mount namespace
1929
1930                  CGROUP_NS_INDEX=6
1931                         Cgroup namespace
1932
1933              PERF_RECORD_KSYMBOL (since Linux 5.0)
1934                  This  record  indicates  kernel  symbol  register/unregister
1935                  events.
1936
1937                      struct {
1938                          struct perf_event_header header;
1939                          u64    addr;
1940                          u32    len;
1941                          u16    ksym_type;
1942                          u16    flags;
1943                          char   name[];
1944                          struct sample_id sample_id;
1945                      };
1946
1947                  addr   is the address of the kernel symbol.
1948
1949                  len    is the length of the kernel symbol.
1950
1951                  ksym_type
1952                         is the type of the kernel symbol.  Currently the fol‐
1953                         lowing types are available:
1954
1955                         PERF_RECORD_KSYMBOL_TYPE_BPF
1956                                The kernel symbol is a BPF function.
1957
1958                  flags  If the PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER  is  set,
1959                         then  this event is for unregistering the kernel sym‐
1960                         bol.
1961
1962              PERF_RECORD_BPF_EVENT (since Linux 5.0)
1963                  This record indicates BPF program is loaded or unloaded.
1964
1965                      struct {
1966                          struct perf_event_header header;
1967                          u16 type;
1968                          u16 flags;
1969                          u32 id;
1970                          u8 tag[BPF_TAG_SIZE];
1971                          struct sample_id sample_id;
1972                      };
1973
1974                  type   is one of the following values:
1975
1976                         PERF_BPF_EVENT_PROG_LOAD
1977                                A BPF program is loaded
1978
1979                         PERF_BPF_EVENT_PROG_UNLOAD
1980                                A BPF program is unloaded
1981
1982                  id     is the ID of the BPF program.
1983
1984                  tag    is  the  tag  of   the   BPF   program.    Currently,
1985                         BPF_TAG_SIZE is defined as 8.
1986
1987              PERF_RECORD_CGROUP (since Linux 5.7)
1988                  This record indicates a new cgroup is created and activated.
1989
1990                      struct {
1991                          struct perf_event_header header;
1992                          u64    id;
1993                          char   path[];
1994                          struct sample_id sample_id;
1995                      };
1996
1997                  id     is the cgroup identifier.  This can be also retrieved
1998                         by name_to_handle_at(2) on the cgroup path (as a file
1999                         handle).
2000
2001                  path   is the path of the cgroup from the root.
2002
2003              PERF_RECORD_TEXT_POKE (since Linux 5.8)
2004                  This record indicates a change in the kernel text.  This in‐
2005                  cludes addition and removal of the text and the  correspond‐
2006                  ing length is zero in this case.
2007
2008                      struct {
2009                          struct perf_event_header header;
2010                          u64    addr;
2011                          u16    old_len;
2012                          u16    new_len;
2013                          u8     bytes[];
2014                          struct sample_id sample_id;
2015                      };
2016
2017                  addr   is the address of the change
2018
2019                  old_len
2020                         is the old length
2021
2022                  new_len
2023                         is the new length
2024
2025                  bytes  contains old bytes immediately followed by new bytes.
2026
2027   Overflow handling
2028       Events  can be set to notify when a threshold is crossed, indicating an
2029       overflow.  Overflow conditions can be captured by monitoring the  event
2030       file  descriptor  with poll(2), select(2), or epoll(7).  Alternatively,
2031       the overflow events can be captured via sa signal handler, by  enabling
2032       I/O  signaling  on the file descriptor; see the discussion of the F_SE‐
2033       TOWN and F_SETSIG operations in fcntl(2).
2034
2035       Overflows are generated only by  sampling  events  (sample_period  must
2036       have a nonzero value).
2037
2038       There are two ways to generate overflow notifications.
2039
2040       The first is to set a wakeup_events or wakeup_watermark value that will
2041       trigger if a certain number of samples or bytes have  been  written  to
2042       the mmap ring buffer.  In this case, POLL_IN is indicated.
2043
2044       The  other  way  is  by  use of the PERF_EVENT_IOC_REFRESH ioctl.  This
2045       ioctl adds to a counter that decrements each time the event  overflows.
2046       When  nonzero,  POLL_IN  is  indicated,  but once the counter reaches 0
2047       POLL_HUP is indicated and the underlying event is disabled.
2048
2049       Refreshing an event group leader refreshes all siblings and  refreshing
2050       with  a  parameter of 0 currently enables infinite refreshes; these be‐
2051       haviors are unsupported and should not be relied on.
2052
2053       Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
2054       itored is attached to a different process and that process exits.
2055
2056   rdpmc instruction
2057       Starting  with  Linux  3.4 on x86, you can use the rdpmc instruction to
2058       get low-latency reads without having to enter the  kernel.   Note  that
2059       using  rdpmc  is  not necessarily faster than other methods for reading
2060       event values.
2061
2062       Support for this can be detected with the cap_usr_rdpmc  field  in  the
2063       mmap  page; documentation on how to calculate event values can be found
2064       in that section.
2065
2066       Originally, when rdpmc support was enabled, any process (not just  ones
2067       with  an  active  perf event) could use the rdpmc instruction to access
2068       the counters.  Starting with Linux 4.0, rdpmc support is  only  allowed
2069       if  an  event  is currently enabled in a process's context.  To restore
2070       the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
2071
2072   perf_event ioctl calls
2073       Various ioctls act on perf_event_open() file descriptors:
2074
2075       PERF_EVENT_IOC_ENABLE
2076              This enables the individual event or event  group  specified  by
2077              the file descriptor argument.
2078
2079              If  the  PERF_IOC_FLAG_GROUP  bit  is set in the ioctl argument,
2080              then all events in a group are enabled, even if the event speci‐
2081              fied is not the group leader (but see BUGS).
2082
2083       PERF_EVENT_IOC_DISABLE
2084              This disables the individual counter or event group specified by
2085              the file descriptor argument.
2086
2087              Enabling or disabling the leader of a group enables or  disables
2088              the  entire  group; that is, while the group leader is disabled,
2089              none of the counters in the group will count.  Enabling or  dis‐
2090              abling  a  member  of a group other than the leader affects only
2091              that counter; disabling a non-leader  stops  that  counter  from
2092              counting but doesn't affect any other counter.
2093
2094              If  the  PERF_IOC_FLAG_GROUP  bit  is set in the ioctl argument,
2095              then all events in a group are disabled, even if the event spec‐
2096              ified is not the group leader (but see BUGS).
2097
2098       PERF_EVENT_IOC_REFRESH
2099              Non-inherited overflow counters can use this to enable a counter
2100              for a number of overflows specified by the argument, after which
2101              it is disabled.  Subsequent calls of this ioctl add the argument
2102              value to the  current  count.   An  overflow  notification  with
2103              POLL_IN set will happen on each overflow until the count reaches
2104              0; when that happens a notification with POLL_HUP  set  is  sent
2105              and the event is disabled.  Using an argument of 0 is considered
2106              undefined behavior.
2107
2108       PERF_EVENT_IOC_RESET
2109              Reset the event count specified by the file descriptor  argument
2110              to  zero.  This resets only the counts; there is no way to reset
2111              the multiplexing time_enabled or time_running values.
2112
2113              If the PERF_IOC_FLAG_GROUP bit is set  in  the  ioctl  argument,
2114              then  all  events in a group are reset, even if the event speci‐
2115              fied is not the group leader (but see BUGS).
2116
2117       PERF_EVENT_IOC_PERIOD
2118              This updates the overflow period for the event.
2119
2120              Since Linux 3.7 (on ARM) and Linux  3.14  (all  other  architec‐
2121              tures),  the new period takes effect immediately.  On older ker‐
2122              nels, the new period did not take effect until  after  the  next
2123              overflow.
2124
2125              The  argument  is a pointer to a 64-bit value containing the de‐
2126              sired new period.
2127
2128              Prior to Linux 2.6.36, this ioctl always failed due to a bug  in
2129              the kernel.
2130
2131       PERF_EVENT_IOC_SET_OUTPUT
2132              This tells the kernel to report event notifications to the spec‐
2133              ified file descriptor rather than the default one.  The file de‐
2134              scriptors must all be on the same CPU.
2135
2136              The  argument  specifies  the  desired file descriptor, or -1 if
2137              output should be ignored.
2138
2139       PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
2140              This adds an ftrace filter to this event.
2141
2142              The argument is a pointer to the desired ftrace filter.
2143
2144       PERF_EVENT_IOC_ID (since Linux 3.12)
2145              This returns the event ID value for the  given  event  file  de‐
2146              scriptor.
2147
2148              The  argument  is a pointer to a 64-bit unsigned integer to hold
2149              the result.
2150
2151       PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
2152              This allows attaching a Berkeley Packet Filter (BPF) program  to
2153              an  existing  kprobe  tracepoint  event.   You  need CAP_PERFMON
2154              (since Linux 5.8) or CAP_SYS_ADMIN privileges to use this ioctl.
2155
2156              The argument is a BPF program file descriptor that  was  created
2157              by a previous bpf(2) system call.
2158
2159       PERF_EVENT_IOC_PAUSE_OUTPUT (since Linux 4.7)
2160              This  allows  pausing  and  resuming the event's ring-buffer.  A
2161              paused ring-buffer does not prevent generation of  samples,  but
2162              simply  discards  them.   The  discarded  samples are considered
2163              lost, and cause a PERF_RECORD_LOST sample to be  generated  when
2164              possible.  An overflow signal may still be triggered by the dis‐
2165              carded sample even though the ring-buffer remains empty.
2166
2167              The argument is an unsigned 32-bit  integer.   A  nonzero  value
2168              pauses the ring-buffer, while a zero value resumes the ring-buf‐
2169              fer.
2170
2171       PERF_EVENT_MODIFY_ATTRIBUTES (since Linux 4.17)
2172              This allows modifying an existing event without the overhead  of
2173              closing  and reopening a new event.  Currently this is supported
2174              only for breakpoint events.
2175
2176              The argument is a pointer to a  perf_event_attr  structure  con‐
2177              taining the updated event settings.
2178
2179       PERF_EVENT_IOC_QUERY_BPF (since Linux 4.16)
2180              This allows querying which Berkeley Packet Filter (BPF) programs
2181              are attached to an existing kprobe tracepoint.  You can only at‐
2182              tach one BPF program per event, but you can have multiple events
2183              attached to a tracepoint.  Querying this value on one tracepoint
2184              event  returns the ID of all BPF programs in all events attached
2185              to the tracepoint.  You need CAP_PERFMON (since  Linux  5.8)  or
2186              CAP_SYS_ADMIN privileges to use this ioctl.
2187
2188              The argument is a pointer to a structure
2189                  struct perf_event_query_bpf {
2190                      __u32    ids_len;
2191                      __u32    prog_cnt;
2192                      __u32    ids[0];
2193                  };
2194
2195              The  ids_len  field  indicates the number of ids that can fit in
2196              the provided ids array.  The prog_cnt value is filled in by  the
2197              kernel  with the number of attached BPF programs.  The ids array
2198              is filled with the ID of each attached BPF  program.   If  there
2199              are  more  programs  than will fit in the array, then the kernel
2200              will return ENOSPC and ids_len will indicate the number of  pro‐
2201              gram IDs that were successfully copied.
2202
2203   Using prctl(2)
2204       A  process  can enable or disable all currently open event groups using
2205       the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and PR_TASK_PERF_EVENTS_DISABLE
2206       operations.  This applies only to events created locally by the calling
2207       process.  This does not apply to events created by other processes  at‐
2208       tached  to  the  calling  process  or  inherited  events  from a parent
2209       process.  Only group leaders are enabled and disabled,  not  any  other
2210       members of the groups.
2211
2212   perf_event related configuration files
2213       Files in /proc/sys/kernel/
2214
2215           /proc/sys/kernel/perf_event_paranoid
2216                  The  perf_event_paranoid  file can be set to restrict access
2217                  to the performance counters.
2218
2219                  2   allow only user-space measurements (default since  Linux
2220                      4.6).
2221                  1   allow  both kernel and user measurements (default before
2222                      Linux 4.6).
2223                  0   allow access to CPU-specific data but not raw tracepoint
2224                      samples.
2225                  -1  no restrictions.
2226
2227                  The  existence  of the perf_event_paranoid file is the offi‐
2228                  cial  method  for   determining   if   a   kernel   supports
2229                  perf_event_open().
2230
2231           /proc/sys/kernel/perf_event_max_sample_rate
2232                  This  sets  the  maximum sample rate.  Setting this too high
2233                  can allow users to sample at a rate that impacts overall ma‐
2234                  chine  performance and potentially lock up the machine.  The
2235                  default value is 100000 (samples per second).
2236
2237           /proc/sys/kernel/perf_event_max_stack
2238                  This file sets the maximum depth of stack frame entries  re‐
2239                  ported when generating a call trace.
2240
2241           /proc/sys/kernel/perf_event_mlock_kb
2242                  Maximum  number  of pages an unprivileged user can mlock(2).
2243                  The default is 516 (kB).
2244
2245       Files in /sys/bus/event_source/devices/
2246
2247           Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
2248           able  for monitoring.  Information on how to program these PMUs can
2249           be found under /sys/bus/event_source/devices/.   Each  subdirectory
2250           corresponds to a different PMU.
2251
2252           /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
2253                  This  contains an integer that can be used in the type field
2254                  of perf_event_attr to indicate that you  wish  to  use  this
2255                  PMU.
2256
2257           /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
2258                  If this file is 1, then direct user-space access to the per‐
2259                  formance counter registers is allowed via the rdpmc instruc‐
2260                  tion.  This can be disabled by echoing 0 to the file.
2261
2262                  As  of  Linux  4.0  the  behavior has changed, so that 1 now
2263                  means only  allow  access  to  processes  with  active  perf
2264                  events, with 2 indicating the old allow-anyone-access behav‐
2265                  ior.
2266
2267           /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2268                  This subdirectory contains information on the  architecture-
2269                  specific  subfields  available  for  programming the various
2270                  config fields in the perf_event_attr struct.
2271
2272                  The content of each file is the name of  the  config  field,
2273                  followed  by  a  colon,  followed by a series of integer bit
2274                  ranges separated by commas.  For example, the file event may
2275                  contain  the  value  config1:1,6-10,44  which indicates that
2276                  event is an attribute that occupies bits 1,6–10, and  44  of
2277                  perf_event_attr::config1.
2278
2279           /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2280                  This  subdirectory  contains  files  with predefined events.
2281                  The contents are strings describing the event  settings  ex‐
2282                  pressed  in terms of the fields found in the previously men‐
2283                  tioned ./format/ directory.  These are not necessarily  com‐
2284                  plete  lists of all events supported by a PMU, but usually a
2285                  subset of events deemed useful or interesting.
2286
2287                  The content of each file is a list of attribute names  sepa‐
2288                  rated  by  commas.  Each entry has an optional value (either
2289                  hex or decimal).  If no value is specified, then it  is  as‐
2290                  sumed  to be a single-bit field with a value of 1.  An exam‐
2291                  ple entry may look like this: event=0x2,inv,ldlat=3.
2292
2293           /sys/bus/event_source/devices/*/uevent
2294                  This file is the standard kernel device  interface  for  in‐
2295                  jecting hotplug events.
2296
2297           /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2298                  The cpumask file contains a comma-separated list of integers
2299                  that indicate a representative CPU number  for  each  socket
2300                  (package)  on  the motherboard.  This is needed when setting
2301                  up uncore or  northbridge  events,  as  those  PMUs  present
2302                  socket-wide events.
2303

RETURN VALUE

2305       perf_event_open()  returns  the  new file descriptor, or -1 if an error
2306       occurred (in which case, errno is set appropriately).
2307

ERRORS

2309       The errors returned by perf_event_open() can be inconsistent,  and  may
2310       vary across processor architectures and performance monitoring units.
2311
2312       E2BIG  Returned if the perf_event_attr size value is too small (smaller
2313              than PERF_ATTR_SIZE_VER0), too big (larger than the page  size),
2314              or  larger  than the kernel supports and the extra bytes are not
2315              zero.  When E2BIG is returned, the perf_event_attr size field is
2316              overwritten by the kernel to be the size of the structure it was
2317              expecting.
2318
2319       EACCES Returned when the requested event  requires  CAP_PERFMON  (since
2320              Linux  5.8)  or  CAP_SYS_ADMIN permissions (or a more permissive
2321              perf_event paranoid setting).  Some common cases  where  an  un‐
2322              privileged  process  may  encounter  this  error: attaching to a
2323              process owned by a different user; monitoring all processes on a
2324              given  CPU  (i.e.,  specifying  the pid argument as -1); and not
2325              setting exclude_kernel when the paranoid setting requires it.
2326
2327       EBADF  Returned if the group_fd file descriptor is not  valid,  or,  if
2328              PERF_FLAG_PID_CGROUP  is  set, the cgroup file descriptor in pid
2329              is not valid.
2330
2331       EBUSY (since Linux 4.1)
2332              Returned if another event already has exclusive  access  to  the
2333              PMU.
2334
2335       EFAULT Returned  if  the  attr  pointer points at an invalid memory ad‐
2336              dress.
2337
2338       EINVAL Returned if the specified event is invalid.  There are many pos‐
2339              sible  reasons  for this.  A not-exhaustive list: sample_freq is
2340              higher than the maximum setting; the cpu to monitor does not ex‐
2341              ist;  read_format  is out of range; sample_type is out of range;
2342              the flags value is out of range; exclusive or pinned set and the
2343              event  is not a group leader; the event config values are out of
2344              range or set reserved bits; the generic event  selected  is  not
2345              supported;  or  there  is  not  enough  room to add the selected
2346              event.
2347
2348       EINTR  Returned when trying to mix perf and ftrace handling for  a  up‐
2349              robe.
2350
2351       EMFILE Each  opened  event uses one file descriptor.  If a large number
2352              of events are opened, the per-process limit  on  the  number  of
2353              open file descriptors will be reached, and no more events can be
2354              created.
2355
2356       ENODEV Returned when the event involves a feature not supported by  the
2357              current CPU.
2358
2359       ENOENT Returned  if  the type setting is not valid.  This error is also
2360              returned for some unsupported generic events.
2361
2362       ENOSPC Prior to Linux 3.3, if there was not enough room for the  event,
2363              ENOSPC  was returned.  In Linux 3.3, this was changed to EINVAL.
2364              ENOSPC is still returned if  you  try  to  add  more  breakpoint
2365              events than supported by the hardware.
2366
2367       ENOSYS Returned  if PERF_SAMPLE_STACK_USER is set in sample_type and it
2368              is not supported by hardware.
2369
2370       EOPNOTSUPP
2371              Returned if an event requiring a specific  hardware  feature  is
2372              requested  but  there is no hardware support.  This includes re‐
2373              questing low-skid events if not supported, branch tracing if  it
2374              is not available, sampling if no PMU interrupt is available, and
2375              branch stacks for software events.
2376
2377       EOVERFLOW (since Linux 4.8)
2378              Returned  if  PERF_SAMPLE_CALLCHAIN  is   requested   and   sam‐
2379              ple_max_stack   is   larger   than   the  maximum  specified  in
2380              /proc/sys/kernel/perf_event_max_stack.
2381
2382       EPERM  Returned on many (but not all) architectures when an unsupported
2383              exclude_hv,  exclude_idle,  exclude_user, or exclude_kernel set‐
2384              ting is specified.
2385
2386              It can also happen, as with EACCES, when the requested event re‐
2387              quires  CAP_PERFMON  (since  Linux 5.8) or CAP_SYS_ADMIN permis‐
2388              sions (or a more permissive perf_event paranoid setting).   This
2389              includes  setting  a  breakpoint on a kernel address, and (since
2390              Linux 3.13) setting a kernel function-trace tracepoint.
2391
2392       ESRCH  Returned if attempting to attach to a process that does not  ex‐
2393              ist.
2394

VERSION

2396       perf_event_open()  was  introduced  in  Linux  2.6.31  but  was  called
2397       perf_counter_open().  It was renamed in Linux 2.6.32.
2398

CONFORMING TO

2400       This perf_event_open() system call Linux-specific  and  should  not  be
2401       used in programs intended to be portable.
2402

NOTES

2404       Glibc  does  not  provide a wrapper for this system call; call it using
2405       syscall(2).  See the example below.
2406
2407       The official way of knowing if perf_event_open() support is enabled  is
2408       checking    for    the    existence    of   the   file   /proc/sys/ker‐
2409       nel/perf_event_paranoid.
2410
2411       CAP_PERFMON capability (since Linux 5.8) provides  secure  approach  to
2412       performance monitoring and observability operations in a system accord‐
2413       ing to the principal of least privilege (POSIX IEEE 1003.1e).   Access‐
2414       ing  system  performance  monitoring and observability operations using
2415       CAP_PERFMON rather than the much more powerful  CAP_SYS_ADMIN  excludes
2416       chances  to  misuse  credentials  and  makes  operations  more  secure.
2417       CAP_SYS_ADMIN usage for secure system performance  monitoring  and  ob‐
2418       servability is discouraged in favor of the CAP_PERFMON capability.
2419

BUGS

2421       The  F_SETOWN_EX  option to fcntl(2) is needed to properly get overflow
2422       signals in threads.  This was introduced in Linux 2.6.32.
2423
2424       Prior to Linux 2.6.33 (at least for x86), the kernel did not  check  if
2425       events  could  be scheduled together until read time.  The same happens
2426       on all known kernels if the NMI watchdog is enabled.  This means to see
2427       if  a  given  set of events works you have to perf_event_open(), start,
2428       then read before you know for sure you can get valid measurements.
2429
2430       Prior to Linux 2.6.34, event constraints were not enforced by the  ker‐
2431       nel.  In that case, some events would silently return "0" if the kernel
2432       scheduled them in an improper counter slot.
2433
2434       Prior to Linux 2.6.34, there was a  bug  when  multiplexing  where  the
2435       wrong results could be returned.
2436
2437       Kernels  from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2438       if "inherit" is enabled and many threads are started.
2439
2440       Prior to Linux 2.6.35, PERF_FORMAT_GROUP did  not  work  with  attached
2441       processes.
2442
2443       There  is  a  bug in the kernel code between Linux 2.6.36 and Linux 3.0
2444       that ignores the "watermark" field and acts as if  a  wakeup_event  was
2445       chosen if the union has a nonzero value in it.
2446
2447       From  Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2448       was broken and would repeatedly operate on the event  specified  rather
2449       than iterating across all sibling events in a group.
2450
2451       From  Linux  3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2452       bits mapped to the same location.   Code  should  migrate  to  the  new
2453       cap_user_rdpmc and cap_user_time fields instead.
2454
2455       Always  double-check your results!  Various generalized events have had
2456       wrong values.  For example, retired branches measured the  wrong  thing
2457       on AMD machines until Linux 2.6.35.
2458

EXAMPLES

2460       The  following  is  a short example that measures the total instruction
2461       count of a call to printf(3).
2462
2463       #include <stdlib.h>
2464       #include <stdio.h>
2465       #include <unistd.h>
2466       #include <string.h>
2467       #include <sys/ioctl.h>
2468       #include <linux/perf_event.h>
2469       #include <asm/unistd.h>
2470
2471       static long
2472       perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2473                       int cpu, int group_fd, unsigned long flags)
2474       {
2475           int ret;
2476
2477           ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2478                          group_fd, flags);
2479           return ret;
2480       }
2481
2482       int
2483       main(int argc, char **argv)
2484       {
2485           struct perf_event_attr pe;
2486           long long count;
2487           int fd;
2488
2489           memset(&pe, 0, sizeof(pe));
2490           pe.type = PERF_TYPE_HARDWARE;
2491           pe.size = sizeof(pe);
2492           pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2493           pe.disabled = 1;
2494           pe.exclude_kernel = 1;
2495           pe.exclude_hv = 1;
2496
2497           fd = perf_event_open(&pe, 0, -1, -1, 0);
2498           if (fd == -1) {
2499              fprintf(stderr, "Error opening leader %llx\n", pe.config);
2500              exit(EXIT_FAILURE);
2501           }
2502
2503           ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2504           ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2505
2506           printf("Measuring instruction count for this printf\n");
2507
2508           ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2509           read(fd, &count, sizeof(count));
2510
2511           printf("Used %lld instructions\n", count);
2512
2513           close(fd);
2514       }
2515

SEE ALSO

2517       perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2518
2519       Documentation/admin-guide/perf-security.rst in the kernel source tree
2520

COLOPHON

2522       This page is part of release 5.10 of the Linux  man-pages  project.   A
2523       description  of  the project, information about reporting bugs, and the
2524       latest    version    of    this    page,    can     be     found     at
2525       https://www.kernel.org/doc/man-pages/.
2526
2527
2528
2529Linux                             2020-11-01                PERF_EVENT_OPEN(2)
Impressum