1PERF_EVENT_OPEN(2)         Linux Programmer's Manual        PERF_EVENT_OPEN(2)
2
3
4

NAME

6       perf_event_open - set up performance monitoring
7

SYNOPSIS

9       #include <linux/perf_event.h>
10       #include <linux/hw_breakpoint.h>
11
12       int perf_event_open(struct perf_event_attr *attr,
13                           pid_t pid, int cpu, int group_fd,
14                           unsigned long flags);
15
16       Note: There is no glibc wrapper for this system call; see NOTES.
17

DESCRIPTION

19       Given  a  list of parameters, perf_event_open() returns a file descrip‐
20       tor, for use in subsequent system calls  (read(2),  mmap(2),  prctl(2),
21       fcntl(2), etc.).
22
23       A  call to perf_event_open() creates a file descriptor that allows mea‐
24       suring performance information.  Each file  descriptor  corresponds  to
25       one  event  that  is measured; these can be grouped together to measure
26       multiple events simultaneously.
27
28       Events can be enabled and disabled in two ways: via  ioctl(2)  and  via
29       prctl(2).   When  an  event  is  disabled it does not count or generate
30       overflows but does continue to exist and maintain its count value.
31
32       Events come in two flavors: counting and sampled.  A counting event  is
33       one  that  is  used  for  counting  the aggregate number of events that
34       occur.  In general, counting event results are gathered with a  read(2)
35       call.   A  sampling  event periodically writes measurements to a buffer
36       that can then be accessed via mmap(2).
37
38   Arguments
39       The pid and cpu arguments allow specifying which  process  and  CPU  to
40       monitor:
41
42       pid == 0 and cpu == -1
43              This measures the calling process/thread on any CPU.
44
45       pid == 0 and cpu >= 0
46              This  measures  the  calling process/thread only when running on
47              the specified CPU.
48
49       pid > 0 and cpu == -1
50              This measures the specified process/thread on any CPU.
51
52       pid > 0 and cpu >= 0
53              This measures the specified process/thread only when running  on
54              the specified CPU.
55
56       pid == -1 and cpu >= 0
57              This  measures all processes/threads on the specified CPU.  This
58              requires   CAP_SYS_ADMIN   capability   or   a    /proc/sys/ker‐
59              nel/perf_event_paranoid value of less than 1.
60
61       pid == -1 and cpu == -1
62              This setting is invalid and will return an error.
63
64       When  pid  is greater than zero, permission to perform this system call
65       is governed by a ptrace access mode  PTRACE_MODE_READ_REALCREDS  check;
66       see ptrace(2).
67
68       The  group_fd  argument  allows  event  groups to be created.  An event
69       group has one event which is the group leader.  The leader  is  created
70       first,  with  group_fd = -1.  The rest of the group members are created
71       with subsequent perf_event_open() calls with group_fd being set to  the
72       file  descriptor  of  the  group leader.  (A single event on its own is
73       created with group_fd = -1 and is considered to be a group with only  1
74       member.)   An  event group is scheduled onto the CPU as a unit: it will
75       be put onto the CPU only if all of the events in the group can  be  put
76       onto  the  CPU.  This means that the values of the member events can be
77       meaningfully compared—added, divided (to get ratios),  and  so  on—with
78       each other, since they have counted events for the same set of executed
79       instructions.
80
81       The flags argument is formed by ORing together zero or more of the fol‐
82       lowing values:
83
84       PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
85              This  flag  enables the close-on-exec flag for the created event
86              file descriptor, so that the file  descriptor  is  automatically
87              closed  on  execve(2).   Setting the close-on-exec flags at cre‐
88              ation time, rather than later with  fcntl(2),  avoids  potential
89              race    conditions    where    the    calling   thread   invokes
90              perf_event_open() and fcntl(2)  at  the  same  time  as  another
91              thread calls fork(2) then execve(2).
92
93       PERF_FLAG_FD_NO_GROUP
94              This  flag  tells  the  event  to  ignore the group_fd parameter
95              except for the purpose of setting up  output  redirection  using
96              the PERF_FLAG_FD_OUTPUT flag.
97
98       PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
99              This  flag  re-routes  the  event's sampled output to instead be
100              included in the mmap buffer of the event specified by group_fd.
101
102       PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
103              This flag activates  per-container  system-wide  monitoring.   A
104              container is an abstraction that isolates a set of resources for
105              finer-grained control (CPUs, memory, etc.).  In this  mode,  the
106              event  is  measured  only if the thread running on the monitored
107              CPU belongs to the designated container (cgroup).  The cgroup is
108              identified  by passing a file descriptor opened on its directory
109              in the cgroupfs filesystem.  For instance, if the cgroup to mon‐
110              itor   is   called  test,  then  a  file  descriptor  opened  on
111              /dev/cgroup/test (assuming cgroupfs is mounted  on  /dev/cgroup)
112              must  be  passed  as  the  pid  parameter.  cgroup monitoring is
113              available only for system-wide events and may therefore  require
114              extra permissions.
115
116       The  perf_event_attr structure provides detailed configuration informa‐
117       tion for the event being created.
118
119           struct perf_event_attr {
120               __u32 type;                 /* Type of event */
121               __u32 size;                 /* Size of attribute structure */
122               __u64 config;               /* Type-specific configuration */
123
124               union {
125                   __u64 sample_period;    /* Period of sampling */
126                   __u64 sample_freq;      /* Frequency of sampling */
127               };
128
129               __u64 sample_type;  /* Specifies values included in sample */
130               __u64 read_format;  /* Specifies values returned in read */
131
132               __u64 disabled       : 1,   /* off by default */
133                     inherit        : 1,   /* children inherit it */
134                     pinned         : 1,   /* must always be on PMU */
135                     exclusive      : 1,   /* only group on PMU */
136                     exclude_user   : 1,   /* don't count user */
137                     exclude_kernel : 1,   /* don't count kernel */
138                     exclude_hv     : 1,   /* don't count hypervisor */
139                     exclude_idle   : 1,   /* don't count when idle */
140                     mmap           : 1,   /* include mmap data */
141                     comm           : 1,   /* include comm data */
142                     freq           : 1,   /* use freq, not period */
143                     inherit_stat   : 1,   /* per task counts */
144                     enable_on_exec : 1,   /* next exec enables */
145                     task           : 1,   /* trace fork/exit */
146                     watermark      : 1,   /* wakeup_watermark */
147                     precise_ip     : 2,   /* skid constraint */
148                     mmap_data      : 1,   /* non-exec mmap data */
149                     sample_id_all  : 1,   /* sample_type all events */
150                     exclude_host   : 1,   /* don't count in host */
151                     exclude_guest  : 1,   /* don't count in guest */
152                     exclude_callchain_kernel : 1,
153                                           /* exclude kernel callchains */
154                     exclude_callchain_user   : 1,
155                                           /* exclude user callchains */
156                     mmap2          :  1,  /* include mmap with inode data */
157                     comm_exec      :  1,  /* flag comm events that are
158                                              due to exec */
159                     use_clockid    :  1,  /* use clockid for time fields */
160                     context_switch :  1,  /* context switch data */
161
162                     __reserved_1   : 37;
163
164               union {
165                   __u32 wakeup_events;    /* wakeup every n events */
166                   __u32 wakeup_watermark; /* bytes before wakeup */
167               };
168
169               __u32     bp_type;          /* breakpoint type */
170
171               union {
172                   __u64 bp_addr;          /* breakpoint address */
173                   __u64 kprobe_func;      /* for perf_kprobe */
174                   __u64 uprobe_path;      /* for perf_uprobe */
175                   __u64 config1;          /* extension of config */
176               };
177
178               union {
179                   __u64 bp_len;           /* breakpoint length */
180                   __u64 kprobe_addr;      /* with kprobe_func == NULL */
181                   __u64 probe_offset;     /* for perf_[k,u]probe */
182                   __u64 config2;          /* extension of config1 */
183               };
184               __u64 branch_sample_type;   /* enum perf_branch_sample_type */
185               __u64 sample_regs_user;     /* user regs to dump on samples */
186               __u32 sample_stack_user;    /* size of stack to dump on
187                                              samples */
188               __s32 clockid;              /* clock to use for time fields */
189               __u64 sample_regs_intr;     /* regs to dump on samples */
190               __u32 aux_watermark;        /* aux bytes before wakeup */
191               __u16 sample_max_stack;     /* max frames in callchain */
192               __u16 __reserved_2;         /* align to u64 */
193
194           };
195
196       The fields of the  perf_event_attr  structure  are  described  in  more
197       detail below:
198
199       type   This  field specifies the overall event type.  It has one of the
200              following values:
201
202              PERF_TYPE_HARDWARE
203                     This indicates one of the "generalized"  hardware  events
204                     provided  by the kernel.  See the config field definition
205                     for more details.
206
207              PERF_TYPE_SOFTWARE
208                     This indicates one of the  software-defined  events  pro‐
209                     vided  by  the  kernel  (even  if  no hardware support is
210                     available).
211
212              PERF_TYPE_TRACEPOINT
213                     This indicates a tracepoint provided by the kernel trace‐
214                     point infrastructure.
215
216              PERF_TYPE_HW_CACHE
217                     This  indicates  a hardware cache event.  This has a spe‐
218                     cial encoding, described in the config field definition.
219
220              PERF_TYPE_RAW
221                     This indicates a "raw" implementation-specific  event  in
222                     the config field.
223
224              PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
225                     This  indicates  a hardware breakpoint as provided by the
226                     CPU.   Breakpoints  can  be  read/write  accesses  to  an
227                     address as well as execution of an instruction address.
228
229              dynamic PMU
230                     Since  Linux 2.6.38, perf_event_open() can support multi‐
231                     ple PMUs.  To enable this, a value exported by the kernel
232                     can  be  used  in the type field to indicate which PMU to
233                     use.  The value to use can be found in the sysfs filesys‐
234                     tem:  there  is  a  subdirectory  per  PMU instance under
235                     /sys/bus/event_source/devices.   In   each   subdirectory
236                     there is a type file whose content is an integer that can
237                     be   used   in   the   type   field.     For    instance,
238                     /sys/bus/event_source/devices/cpu/type contains the value
239                     for the core CPU PMU, which is usually 4.
240
241              kprobe and uprobe (since Linux 4.17)
242                     These two dynamic PMUs create a kprobe/uprobe and  attach
243                     it  to  the file descriptor generated by perf_event_open.
244                     The kprobe/uprobe will be destroyed on the destruction of
245                     the    file    descriptor.    See   fields   kprobe_func,
246                     uprobe_path,  kprobe_addr,  and  probe_offset  for   more
247                     details.
248
249       size   The  size  of the perf_event_attr structure for forward/backward
250              compatibility.  Set this using sizeof(struct perf_event_attr) to
251              allow  the kernel to see the struct size at the time of compila‐
252              tion.
253
254              The related define PERF_ATTR_SIZE_VER0 is set to  64;  this  was
255              the  size of the first published struct.  PERF_ATTR_SIZE_VER1 is
256              72, corresponding  to  the  addition  of  breakpoints  in  Linux
257              2.6.33.  PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition
258              of branch sampling in Linux 3.4.  PERF_ATTR_SIZE_VER3 is 96 cor‐
259              responding   to   the  addition  of  sample_regs_user  and  sam‐
260              ple_stack_user in Linux 3.7.  PERF_ATTR_SIZE_VER4 is 104  corre‐
261              sponding  to  the  addition  of  sample_regs_intr in Linux 3.19.
262              PERF_ATTR_SIZE_VER5 is 112  corresponding  to  the  addition  of
263              aux_watermark in Linux 4.1.
264
265       config This  specifies  which  event  you want, in conjunction with the
266              type field.  The config1 and config2 fields are also taken  into
267              account  in  cases  where 64 bits is not enough to fully specify
268              the event.  The encoding of these fields are event dependent.
269
270              There are various ways to set the config field that  are  depen‐
271              dent  on the value of the previously described type field.  What
272              follows are various possible settings for config  separated  out
273              by type.
274
275              If  type is PERF_TYPE_HARDWARE, we are measuring one of the gen‐
276              eralized hardware CPU events.  Not all of these are available on
277              all platforms.  Set config to one of the following:
278
279                   PERF_COUNT_HW_CPU_CYCLES
280                          Total  cycles.   Be  wary of what happens during CPU
281                          frequency scaling.
282
283                   PERF_COUNT_HW_INSTRUCTIONS
284                          Retired instructions.   Be  careful,  these  can  be
285                          affected  by  various  issues, most notably hardware
286                          interrupt counts.
287
288                   PERF_COUNT_HW_CACHE_REFERENCES
289                          Cache accesses.  Usually this indicates  Last  Level
290                          Cache  accesses  but this may vary depending on your
291                          CPU.  This may include prefetches and coherency mes‐
292                          sages; again this depends on the design of your CPU.
293
294                   PERF_COUNT_HW_CACHE_MISSES
295                          Cache  misses.   Usually  this  indicates Last Level
296                          Cache misses; this is intended to be  used  in  con‐
297                          junction   with  the  PERF_COUNT_HW_CACHE_REFERENCES
298                          event to calculate cache miss rates.
299
300                   PERF_COUNT_HW_BRANCH_INSTRUCTIONS
301                          Retired branch instructions.  Prior to Linux 2.6.35,
302                          this used the wrong event on AMD processors.
303
304                   PERF_COUNT_HW_BRANCH_MISSES
305                          Mispredicted branch instructions.
306
307                   PERF_COUNT_HW_BUS_CYCLES
308                          Bus  cycles,  which  can  be  different  from  total
309                          cycles.
310
311                   PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
312                          Stalled cycles during issue.
313
314                   PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
315                          Stalled cycles during retirement.
316
317                   PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
318                          Total cycles; not affected by CPU frequency scaling.
319
320              If type is PERF_TYPE_SOFTWARE, we are measuring software  events
321              provided by the kernel.  Set config to one of the following:
322
323                   PERF_COUNT_SW_CPU_CLOCK
324                          This  reports  the CPU clock, a high-resolution per-
325                          CPU timer.
326
327                   PERF_COUNT_SW_TASK_CLOCK
328                          This reports a clock count specific to the task that
329                          is running.
330
331                   PERF_COUNT_SW_PAGE_FAULTS
332                          This reports the number of page faults.
333
334                   PERF_COUNT_SW_CONTEXT_SWITCHES
335                          This  counts  context switches.  Until Linux 2.6.34,
336                          these were all reported as user-space events,  after
337                          that they are reported as happening in the kernel.
338
339                   PERF_COUNT_SW_CPU_MIGRATIONS
340                          This  reports  the  number  of times the process has
341                          migrated to a new CPU.
342
343                   PERF_COUNT_SW_PAGE_FAULTS_MIN
344                          This counts the number of minor page faults.   These
345                          did not require disk I/O to handle.
346
347                   PERF_COUNT_SW_PAGE_FAULTS_MAJ
348                          This  counts the number of major page faults.  These
349                          required disk I/O to handle.
350
351                   PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
352                          This counts the number of alignment  faults.   These
353                          happen  when  unaligned  memory accesses happen; the
354                          kernel can handle these but it reduces  performance.
355                          This  happens  only  on some architectures (never on
356                          x86).
357
358                   PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
359                          This counts the number  of  emulation  faults.   The
360                          kernel sometimes traps on unimplemented instructions
361                          and emulates them for user space.   This  can  nega‐
362                          tively impact performance.
363
364                   PERF_COUNT_SW_DUMMY (since Linux 3.12)
365                          This  is  a  placeholder  event that counts nothing.
366                          Informational sample record types such  as  mmap  or
367                          comm  must be associated with an active event.  This
368                          dummy event allows gathering  such  records  without
369                          requiring a counting event.
370
371              If  type  is  PERF_TYPE_TRACEPOINT, then we are measuring kernel
372              tracepoints.  The value to use in config can  be  obtained  from
373              under  debugfs tracing/events/*/*/id if ftrace is enabled in the
374              kernel.
375
376              If type is PERF_TYPE_HW_CACHE, then we are measuring a  hardware
377              CPU  cache event.  To calculate the appropriate config value use
378              the following equation:
379
380                      (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
381                      (perf_hw_cache_op_result_id << 16)
382
383                  where perf_hw_cache_id is one of:
384
385                      PERF_COUNT_HW_CACHE_L1D
386                             for measuring Level 1 Data Cache
387
388                      PERF_COUNT_HW_CACHE_L1I
389                             for measuring Level 1 Instruction Cache
390
391                      PERF_COUNT_HW_CACHE_LL
392                             for measuring Last-Level Cache
393
394                      PERF_COUNT_HW_CACHE_DTLB
395                             for measuring the Data TLB
396
397                      PERF_COUNT_HW_CACHE_ITLB
398                             for measuring the Instruction TLB
399
400                      PERF_COUNT_HW_CACHE_BPU
401                             for measuring the branch prediction unit
402
403                      PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
404                             for measuring local memory accesses
405
406                  and perf_hw_cache_op_id is one of:
407
408                      PERF_COUNT_HW_CACHE_OP_READ
409                             for read accesses
410
411                      PERF_COUNT_HW_CACHE_OP_WRITE
412                             for write accesses
413
414                      PERF_COUNT_HW_CACHE_OP_PREFETCH
415                             for prefetch accesses
416
417                  and perf_hw_cache_op_result_id is one of:
418
419                      PERF_COUNT_HW_CACHE_RESULT_ACCESS
420                             to measure accesses
421
422                      PERF_COUNT_HW_CACHE_RESULT_MISS
423                             to measure misses
424
425              If type is PERF_TYPE_RAW, then a custom "raw"  config  value  is
426              needed.   Most  CPUs  support events that are not covered by the
427              "generalized" events.  These  are  implementation  defined;  see
428              your  CPU  manual (for example the Intel Volume 3B documentation
429              or the AMD  BIOS  and  Kernel  Developer  Guide).   The  libpfm4
430              library  can be used to translate from the name in the architec‐
431              tural manuals to the raw hex value perf_event_open() expects  in
432              this field.
433
434              If  type is PERF_TYPE_BREAKPOINT, then leave config set to zero.
435              Its parameters are set in other places.
436
437              If type is kprobe or uprobe, set retprobe (bit 0 of config,  see
438              /sys/bus/event_source/devices/[k,u]probe/format/retprobe)    for
439              kretprobe/uretprobe.   See  fields   kprobe_func,   uprobe_path,
440              kprobe_addr, and probe_offset for more details.
441
442       kprobe_func, uprobe_path, kprobe_addr, and probe_offset
443              These  fields describe the kprobe/uprobe for dynamic PMUs kprobe
444              and uprobe.  For kprobe: use kprobe_func  and  probe_offset,  or
445              use  kprobe_addr and leave kprobe_func as NULL.  For uprobe: use
446              uprobe_path and probe_offset.
447
448       sample_period, sample_freq
449              A "sampling" event is one that generates an  overflow  notifica‐
450              tion  every N events, where N is given by sample_period.  A sam‐
451              pling event has sample_period > 0.   When  an  overflow  occurs,
452              requested  data is recorded in the mmap buffer.  The sample_type
453              field controls what data is recorded on each overflow.
454
455              sample_freq can be used if you wish to use frequency rather than
456              period.   In  this case, you set the freq flag.  The kernel will
457              adjust the sampling period to try and achieve the desired  rate.
458              The rate of adjustment is a timer tick.
459
460       sample_type
461              The  various  bits in this field specify which values to include
462              in the sample.  They will be recorded in a ring-buffer, which is
463              available  to  user space using mmap(2).  The order in which the
464              values are saved in the sample are documented in the MMAP Layout
465              subsection  below;  it  is not the enum perf_event_sample_format
466              order.
467
468              PERF_SAMPLE_IP
469                     Records instruction pointer.
470
471              PERF_SAMPLE_TID
472                     Records the process and thread IDs.
473
474              PERF_SAMPLE_TIME
475                     Records a timestamp.
476
477              PERF_SAMPLE_ADDR
478                     Records an address, if applicable.
479
480              PERF_SAMPLE_READ
481                     Record counter values for all events in a group, not just
482                     the group leader.
483
484              PERF_SAMPLE_CALLCHAIN
485                     Records the callchain (stack backtrace).
486
487              PERF_SAMPLE_ID
488                     Records a unique ID for the opened event's group leader.
489
490              PERF_SAMPLE_CPU
491                     Records CPU number.
492
493              PERF_SAMPLE_PERIOD
494                     Records the current sampling period.
495
496              PERF_SAMPLE_STREAM_ID
497                     Records  a  unique  ID  for  the  opened  event.   Unlike
498                     PERF_SAMPLE_ID the actual ID is returned, not  the  group
499                     leader.   This  ID  is  the  same  as the one returned by
500                     PERF_FORMAT_ID.
501
502              PERF_SAMPLE_RAW
503                     Records additional data, if applicable.  Usually returned
504                     by tracepoint events.
505
506              PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
507                     This provides a record of recent branches, as provided by
508                     CPU branch sampling hardware (such as Intel  Last  Branch
509                     Record).  Not all hardware supports this feature.
510
511                     See  the branch_sample_type field for how to filter which
512                     branches are reported.
513
514              PERF_SAMPLE_REGS_USER (since Linux 3.7)
515                     Records the current user-level CPU  register  state  (the
516                     values in the process before the kernel was called).
517
518              PERF_SAMPLE_STACK_USER (since Linux 3.7)
519                     Records the user level stack, allowing stack unwinding.
520
521              PERF_SAMPLE_WEIGHT (since Linux 3.10)
522                     Records  a  hardware provided weight value that expresses
523                     how costly the sampled event was.  This allows the  hard‐
524                     ware to highlight expensive events in a profile.
525
526              PERF_SAMPLE_DATA_SRC (since Linux 3.10)
527                     Records  the  data  source: where in the memory hierarchy
528                     the data associated with  the  sampled  instruction  came
529                     from.   This is available only if the underlying hardware
530                     supports this feature.
531
532              PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
533                     Places the SAMPLE_ID value in a  fixed  position  in  the
534                     record, either at the beginning (for sample events) or at
535                     the end (if a non-sample event).
536
537                     This was necessary  because  a  sample  stream  may  have
538                     records from various different event sources with differ‐
539                     ent sample_type settings.  Parsing the event stream prop‐
540                     erly  was  not  possible because the format of the record
541                     was needed to find SAMPLE_ID, but the format could not be
542                     found  without  knowing what event the sample belonged to
543                     (causing a circular dependency).
544
545                     The PERF_SAMPLE_IDENTIFIER setting makes the event stream
546                     always parsable by putting SAMPLE_ID in a fixed location,
547                     even though it means having duplicate SAMPLE_ID values in
548                     records.
549
550              PERF_SAMPLE_TRANSACTION (since Linux 3.13)
551                     Records  reasons  for  transactional  memory abort events
552                     (for example, from Intel TSX  transactional  memory  sup‐
553                     port).
554
555                     The  precise_ip  setting  must  be  greater  than 0 and a
556                     transactional memory abort event must be measured  or  no
557                     values  will be recorded.  Also note that some perf_event
558                     measurements, such as sampled cycle counting,  may  cause
559                     extraneous  aborts  (by  causing  an  interrupt  during a
560                     transaction).
561
562              PERF_SAMPLE_REGS_INTR (since Linux 3.19)
563                     Records a subset of the current  CPU  register  state  as
564                     specified    by   sample_regs_intr.    Unlike   PERF_SAM‐
565                     PLE_REGS_USER the register values will return kernel reg‐
566                     ister state if the overflow happened while kernel code is
567                     running.  If the CPU supports hardware sampling of regis‐
568                     ter state (i.e., PEBS on Intel x86) and precise_ip is set
569                     higher than zero then the register  values  returned  are
570                     those  captured  by  hardware  at the time of the sampled
571                     instruction's retirement.
572
573       read_format
574              This field specifies the format of the data returned by  read(2)
575              on a perf_event_open() file descriptor.
576
577              PERF_FORMAT_TOTAL_TIME_ENABLED
578                     Adds  the 64-bit time_enabled field.  This can be used to
579                     calculate estimated totals if the  PMU  is  overcommitted
580                     and multiplexing is happening.
581
582              PERF_FORMAT_TOTAL_TIME_RUNNING
583                     Adds  the 64-bit time_running field.  This can be used to
584                     calculate estimated totals if the  PMU  is  overcommitted
585                     and multiplexing is happening.
586
587              PERF_FORMAT_ID
588                     Adds  a 64-bit unique value that corresponds to the event
589                     group.
590
591              PERF_FORMAT_GROUP
592                     Allows all counter values in an event group  to  be  read
593                     with one read.
594
595       disabled
596              The  disabled  bit specifies whether the counter starts out dis‐
597              abled or enabled.  If disabled, the event can later  be  enabled
598              by ioctl(2), prctl(2), or enable_on_exec.
599
600              When creating an event group, typically the group leader is ini‐
601              tialized with disabled set to 1 and any child  events  are  ini‐
602              tialized  with disabled set to 0.  Despite disabled being 0, the
603              child events will not start until the group leader is enabled.
604
605       inherit
606              The inherit bit specifies that this counter should count  events
607              of child tasks as well as the task specified.  This applies only
608              to new children, not to any existing children at  the  time  the
609              counter  is  created  (nor to any new children of existing chil‐
610              dren).
611
612              Inherit does not work for some combinations of read_format  val‐
613              ues, such as PERF_FORMAT_GROUP.
614
615       pinned The  pinned  bit  specifies that the counter should always be on
616              the CPU if at all possible.  It applies only to  hardware  coun‐
617              ters  and  only to group leaders.  If a pinned counter cannot be
618              put onto the CPU (e.g., because there are  not  enough  hardware
619              counters  or  because of a conflict with some other event), then
620              the counter goes into an 'error' state, where reads return  end-
621              of-file  (i.e.,  read(2)  returns 0) until the counter is subse‐
622              quently enabled or disabled.
623
624       exclusive
625              The exclusive bit specifies that when this counter's group is on
626              the  CPU,  it should be the only group using the CPU's counters.
627              In the future this may allow monitoring programs to support  PMU
628              features  that  need  to  run  alone so that they do not disrupt
629              other hardware counters.
630
631              Note that many unexpected situations may prevent events with the
632              exclusive  bit  set  from ever running.  This includes any users
633              running a system-wide measurement as well as any kernel  use  of
634              the  performance  counters  (including  the commonly enabled NMI
635              Watchdog Timer interface).
636
637       exclude_user
638              If this bit is set, the count excludes  events  that  happen  in
639              user space.
640
641       exclude_kernel
642              If  this  bit  is  set, the count excludes events that happen in
643              kernel space.
644
645       exclude_hv
646              If this bit is set, the count excludes events that happen in the
647              hypervisor.   This is mainly for PMUs that have built-in support
648              for handling this (such as POWER).  Extra support is needed  for
649              handling hypervisor measurements on most machines.
650
651       exclude_idle
652              If set, don't count when the CPU is idle.
653
654       mmap   The  mmap bit enables generation of PERF_RECORD_MMAP samples for
655              every mmap(2) call that has PROT_EXEC set.  This allows tools to
656              notice  new executable code being mapped into a program (dynamic
657              shared libraries for example) so that addresses  can  be  mapped
658              back to the original code.
659
660       comm   The  comm  bit enables tracking of process command name as modi‐
661              fied by the exec(2) and prctl(PR_SET_NAME) system calls as  well
662              as  writing  to  /proc/self/comm.  If the comm_exec flag is also
663              successfully set (possible since Linux 3.16), then the misc flag
664              PERF_RECORD_MISC_COMM_EXEC  can  be  used  to  differentiate the
665              exec(2) case from the others.
666
667       freq   If this bit is set, then sample_frequency not  sample_period  is
668              used when setting up the sampling interval.
669
670       inherit_stat
671              This  bit  enables  saving of event counts on context switch for
672              inherited tasks.  This is meaningful only if the  inherit  field
673              is set.
674
675       enable_on_exec
676              If  this  bit is set, a counter is automatically enabled after a
677              call to exec(2).
678
679       task   If this bit is set, then fork/exit notifications are included in
680              the ring buffer.
681
682       watermark
683              If  set,  have an overflow notification happen when we cross the
684              wakeup_watermark boundary.   Otherwise,  overflow  notifications
685              happen after wakeup_events samples.
686
687       precise_ip (since Linux 2.6.35)
688              This controls the amount of skid.  Skid is how many instructions
689              execute between an event of interest happening  and  the  kernel
690              being able to stop and record the event.  Smaller skid is better
691              and allows more accurate reporting of which events correspond to
692              which instructions, but hardware is often limited with how small
693              this can be.
694
695              The possible values of this field are the following:
696
697              0  SAMPLE_IP can have arbitrary skid.
698
699              1  SAMPLE_IP must have constant skid.
700
701              2  SAMPLE_IP requested to have 0 skid.
702
703              3  SAMPLE_IP must have 0 skid.   See  also  the  description  of
704                 PERF_RECORD_MISC_EXACT_IP.
705
706       mmap_data (since Linux 2.6.36)
707              This is the counterpart of the mmap field.  This enables genera‐
708              tion of PERF_RECORD_MMAP samples for mmap(2) calls that  do  not
709              have PROT_EXEC set (for example data and SysV shared memory).
710
711       sample_id_all (since Linux 2.6.38)
712              If  set, then TID, TIME, ID, STREAM_ID, and CPU can additionally
713              be included in non-PERF_RECORD_SAMPLEs if the corresponding sam‐
714              ple_type is selected.
715
716              If  PERF_SAMPLE_IDENTIFIER  is  specified, then an additional ID
717              value is included as the last value to ease parsing  the  record
718              stream.  This may lead to the id value appearing twice.
719
720              The layout is described by this pseudo-structure:
721
722                  struct sample_id {
723                      { u32 pid, tid; }   /* if PERF_SAMPLE_TID set */
724                      { u64 time;     }   /* if PERF_SAMPLE_TIME set */
725                      { u64 id;       }   /* if PERF_SAMPLE_ID set */
726                      { u64 stream_id;}   /* if PERF_SAMPLE_STREAM_ID set  */
727                      { u32 cpu, res; }   /* if PERF_SAMPLE_CPU set */
728                      { u64 id;       }   /* if PERF_SAMPLE_IDENTIFIER set */
729                  };
730                  ,in
731
732       exclude_host (since Linux 3.2)
733              When  conducting  measurements that include processes running VM
734              instances (i.e., have executed a KVM_RUN ioctl(2)), only measure
735              events happening inside a guest instance.  This is only meaning‐
736              ful outside the guests; this  setting  does  not  change  counts
737              gathered  inside  of  a guest.  Currently, this functionality is
738              x86 only.
739
740       exclude_guest (since Linux 3.2)
741              When conducting measurements that include processes  running  VM
742              instances  (i.e., have executed a KVM_RUN ioctl(2)), do not mea‐
743              sure events happening inside  guest  instances.   This  is  only
744              meaningful  outside  the  guests;  this  setting does not change
745              counts gathered inside of a guest.  Currently, this  functional‐
746              ity is x86 only.
747
748       exclude_callchain_kernel (since Linux 3.7)
749              Do not include kernel callchains.
750
751       exclude_callchain_user (since Linux 3.7)
752              Do not include user callchains.
753
754       mmap2 (since Linux 3.16)
755              Generate an extended executable mmap record that contains enough
756              additional information to  uniquely  identify  shared  mappings.
757              The mmap flag must also be set for this to work.
758
759       comm_exec (since Linux 3.16)
760              This is purely a feature-detection flag, it does not change ker‐
761              nel behavior.  If this flag can successfully be set, then,  when
762              comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set
763              in the misc field of a comm record header if  the  rename  event
764              being  reported  was  caused  by a call to exec(2).  This allows
765              tools to distinguish between the various types of process renam‐
766              ing.
767
768       use_clockid (since Linux 4.1)
769              This  allows  selecting  which  internal Linux clock to use when
770              generating timestamps via the clockid field.  This can  make  it
771              easier  to correlate perf sample times with timestamps generated
772              by other tools.
773
774       context_switch (since Linux 4.3)
775              This enables the generation of PERF_RECORD_SWITCH records when a
776              context  switch  occurs.   It  also  enables  the  generation of
777              PERF_RECORD_SWITCH_CPU_WIDE records when  sampling  in  CPU-wide
778              mode.   This functionality is in addition to existing tracepoint
779              and software events for measuring context switches.  The  advan‐
780              tage  of  this method is that it will give full information even
781              with strict perf_event_paranoid settings.
782
783       wakeup_events, wakeup_watermark
784              This union  sets  how  many  samples  (wakeup_events)  or  bytes
785              (wakeup_watermark)  happen  before an overflow notification hap‐
786              pens.  Which one is used is selected by the watermark bit flag.
787
788              wakeup_events counts only PERF_RECORD_SAMPLE record  types.   To
789              receive  overflow  notification for all PERF_RECORD types choose
790              watermark and set wakeup_watermark to 1.
791
792              Prior to Linux 3.0, setting wakeup_events to 0  resulted  in  no
793              overflow  notifications; more recent kernels treat 0 the same as
794              1.
795
796       bp_type (since Linux 2.6.33)
797              This chooses the breakpoint type.  It is one of:
798
799              HW_BREAKPOINT_EMPTY
800                     No breakpoint.
801
802              HW_BREAKPOINT_R
803                     Count when we read the memory location.
804
805              HW_BREAKPOINT_W
806                     Count when we write the memory location.
807
808              HW_BREAKPOINT_RW
809                     Count when we read or write the memory location.
810
811              HW_BREAKPOINT_X
812                     Count when we execute code at the memory location.
813
814              The values can be combined via a bitwise or, but the combination
815              of  HW_BREAKPOINT_R  or  HW_BREAKPOINT_W with HW_BREAKPOINT_X is
816              not allowed.
817
818       bp_addr (since Linux 2.6.33)
819              This is the address of the  breakpoint.   For  execution  break‐
820              points,  this is the memory address of the instruction of inter‐
821              est; for read and write breakpoints, it is the memory address of
822              the memory location of interest.
823
824       config1 (since Linux 2.6.39)
825              config1  is  used for setting events that need an extra register
826              or otherwise do not fit in the regular config field.   Raw  OFF‐
827              CORE_EVENTS  on  Nehalem/Westmere/SandyBridge  use this field on
828              Linux 3.3 and later kernels.
829
830       bp_len (since Linux 2.6.33)
831              bp_len is the length of the breakpoint being measured if type is
832              PERF_TYPE_BREAKPOINT.     Options    are    HW_BREAKPOINT_LEN_1,
833              HW_BREAKPOINT_LEN_2,    HW_BREAKPOINT_LEN_4,    and    HW_BREAK‐
834              POINT_LEN_8.    For   an   execution  breakpoint,  set  this  to
835              sizeof(long).
836
837       config2 (since Linux 2.6.39)
838              config2 is a further extension of the config1 field.
839
840       branch_sample_type (since Linux 3.4)
841              If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what
842              branches to include in the branch record.
843
844              The  first  part of the value is the privilege level, which is a
845              combination of one of the values listed below.  If the user does
846              not  set  privilege  level  explicitly,  the kernel will use the
847              event's privilege level.  Event and branch privilege  levels  do
848              not have to match.
849
850              PERF_SAMPLE_BRANCH_USER
851                     Branch target is in user space.
852
853              PERF_SAMPLE_BRANCH_KERNEL
854                     Branch target is in kernel space.
855
856              PERF_SAMPLE_BRANCH_HV
857                     Branch target is in hypervisor.
858
859              PERF_SAMPLE_BRANCH_PLM_ALL
860                     A  convenience  value  that is the three preceding values
861                     ORed together.
862
863              In addition to the privilege value, at least one or more of  the
864              following bits must be set.
865
866              PERF_SAMPLE_BRANCH_ANY
867                     Any branch type.
868
869              PERF_SAMPLE_BRANCH_ANY_CALL
870                     Any  call  branch (includes direct calls, indirect calls,
871                     and far jumps).
872
873              PERF_SAMPLE_BRANCH_IND_CALL
874                     Indirect calls.
875
876              PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
877                     Direct calls.
878
879              PERF_SAMPLE_BRANCH_ANY_RETURN
880                     Any return branch.
881
882              PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
883                     Indirect jumps.
884
885              PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
886                     Conditional branches.
887
888              PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
889                     Transactional memory aborts.
890
891              PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
892                     Branch in transactional memory transaction.
893
894              PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
895                     Branch   not   in   transactional   memory   transaction.
896                     PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is
897                     part of a hardware-generated call stack.   This  requires
898                     hardware  support,  currently  only  found  on  Intel x86
899                     Haswell or newer.
900
901       sample_regs_user (since Linux 3.7)
902              This bit mask defines the set of user CPU registers to  dump  on
903              samples.   The  layout of the register mask is architecture-spe‐
904              cific   and   is   described   in   the   kernel   header   file
905              arch/ARCH/include/uapi/asm/perf_regs.h.
906
907       sample_stack_user (since Linux 3.7)
908              This  defines  the  size  of the user stack to dump if PERF_SAM‐
909              PLE_STACK_USER is specified.
910
911       clockid (since Linux 4.1)
912              If use_clockid is set, then this field  selects  which  internal
913              Linux  timer  to  use  for timestamps.  The available timers are
914              defined  in  linux/time.h,  with  CLOCK_MONOTONIC,   CLOCK_MONO‐
915              TONIC_RAW,  CLOCK_REALTIME,  CLOCK_BOOTTIME,  and CLOCK_TAI cur‐
916              rently supported.
917
918       aux_watermark (since Linux 4.1)
919              This  specifies  how  much  data  is  required  to   trigger   a
920              PERF_RECORD_AUX sample.
921
922       sample_max_stack (since Linux 4.8)
923              When  sample_type  includes  PERF_SAMPLE_CALLCHAIN,  this  field
924              specifies how many stack frames to report  when  generating  the
925              callchain.
926
927   Reading results
928       Once a perf_event_open() file descriptor has been opened, the values of
929       the events can be read from the file descriptor.  The values  that  are
930       there  are  specified by the read_format field in the attr structure at
931       open time.
932
933       If you attempt to read into a buffer that is not big enough to hold the
934       data, the error ENOSPC results.
935
936       Here is the layout of the data returned by a read:
937
938       * If  PERF_FORMAT_GROUP  was specified to allow reading all events in a
939         group at once:
940
941             struct read_format {
942                 u64 nr;            /* The number of events */
943                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
944                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
945                 struct {
946                     u64 value;     /* The value of the event */
947                     u64 id;        /* if PERF_FORMAT_ID */
948                 } values[nr];
949             };
950
951       * If PERF_FORMAT_GROUP was not specified:
952
953             struct read_format {
954                 u64 value;         /* The value of the event */
955                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
956                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
957                 u64 id;            /* if PERF_FORMAT_ID */
958             };
959
960       The values read are as follows:
961
962       nr     The number of events in this file descriptor.  Available only if
963              PERF_FORMAT_GROUP was specified.
964
965       time_enabled, time_running
966              Total  time  the  event was enabled and running.  Normally these
967              values are the same.  If more events are started, then available
968              counter  slots  on the PMU, then multiplexing happens and events
969              run only part of the time.  In that case, the  time_enabled  and
970              time  running values can be used to scale an estimated value for
971              the count.
972
973       value  An unsigned 64-bit value containing the counter result.
974
975       id     A globally unique value for this particular event; only  present
976              if PERF_FORMAT_ID was specified in read_format.
977
978   MMAP layout
979       When using perf_event_open() in sampled mode, asynchronous events (like
980       counter overflow or PROT_EXEC mmap tracking) are logged  into  a  ring-
981       buffer.  This ring-buffer is created and accessed through mmap(2).
982
983       The mmap size should be 1+2^n pages, where the first page is a metadata
984       page (struct perf_event_mmap_page) that contains various bits of infor‐
985       mation such as where the ring-buffer head is.
986
987       Before  kernel  2.6.39,  there is a bug that means you must allocate an
988       mmap ring buffer when sampling even if you do not plan to access it.
989
990       The structure of the first metadata mmap page is as follows:
991
992           struct perf_event_mmap_page {
993               __u32 version;        /* version number of this structure */
994               __u32 compat_version; /* lowest version this is compat with */
995               __u32 lock;           /* seqlock for synchronization */
996               __u32 index;          /* hardware counter identifier */
997               __s64 offset;         /* add to hardware counter value */
998               __u64 time_enabled;   /* time event active */
999               __u64 time_running;   /* time event on CPU */
1000               union {
1001                   __u64   capabilities;
1002                   struct {
1003                       __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
1004                             cap_bit0_is_deprecated : 1,
1005                             cap_user_rdpmc         : 1,
1006                             cap_user_time          : 1,
1007                             cap_user_time_zero     : 1,
1008                   };
1009               };
1010               __u16 pmc_width;
1011               __u16 time_shift;
1012               __u32 time_mult;
1013               __u64 time_offset;
1014               __u64 __reserved[120];   /* Pad to 1 k */
1015               __u64 data_head;         /* head in the data section */
1016               __u64 data_tail;         /* user-space written tail */
1017               __u64 data_offset;       /* where the buffer starts */
1018               __u64 data_size;         /* data buffer size */
1019               __u64 aux_head;
1020               __u64 aux_tail;
1021               __u64 aux_offset;
1022               __u64 aux_size;
1023
1024           }
1025
1026       The following list describes the  fields  in  the  perf_event_mmap_page
1027       structure in more detail:
1028
1029       version
1030              Version number of this structure.
1031
1032       compat_version
1033              The lowest version this is compatible with.
1034
1035       lock   A seqlock for synchronization.
1036
1037       index  A unique hardware counter identifier.
1038
1039       offset When  using  rdpmc  for reads this offset value must be added to
1040              the one returned by rdpmc to get the current total event count.
1041
1042       time_enabled
1043              Time the event was active.
1044
1045       time_running
1046              Time the event was running.
1047
1048       cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
1049              There  was  a  bug  in  the  definition  of   cap_usr_time   and
1050              cap_usr_rdpmc  from  Linux 3.4 until Linux 3.11.  Both bits were
1051              defined to point to the same location, so it was  impossible  to
1052              know if cap_usr_time or cap_usr_rdpmc were actually set.
1053
1054              Starting  with Linux 3.12, these are renamed to cap_bit0 and you
1055              should use the cap_user_time and cap_user_rdpmc fields instead.
1056
1057       cap_bit0_is_deprecated (since Linux 3.12)
1058              If set, this bit indicates that the kernel supports the properly
1059              separated cap_user_time and cap_user_rdpmc bits.
1060
1061              If  not-set, it indicates an older kernel where cap_usr_time and
1062              cap_usr_rdpmc map to the same bit and thus both features  should
1063              be used with caution.
1064
1065       cap_user_rdpmc (since Linux 3.12)
1066              If the hardware supports user-space read of performance counters
1067              without syscall (this is the "rdpmc" instruction on  x86),  then
1068              the following code can be used to do a read:
1069
1070                  u32 seq, time_mult, time_shift, idx, width;
1071                  u64 count, enabled, running;
1072                  u64 cyc, time_offset;
1073
1074                  do {
1075                      seq = pc->lock;
1076                      barrier();
1077                      enabled = pc->time_enabled;
1078                      running = pc->time_running;
1079
1080                      if (pc->cap_usr_time && enabled != running) {
1081                          cyc = rdtsc();
1082                          time_offset = pc->time_offset;
1083                          time_mult   = pc->time_mult;
1084                          time_shift  = pc->time_shift;
1085                      }
1086
1087                      idx = pc->index;
1088                      count = pc->offset;
1089
1090                      if (pc->cap_usr_rdpmc && idx) {
1091                          width = pc->pmc_width;
1092                          count += rdpmc(idx - 1);
1093                      }
1094
1095                      barrier();
1096                  } while (pc->lock != seq);
1097
1098       cap_user_time (since Linux 3.12)
1099              This  bit  indicates  the hardware has a constant, nonstop time‐
1100              stamp counter (TSC on x86).
1101
1102       cap_user_time_zero (since Linux 3.12)
1103              Indicates the presence of time_zero which allows  mapping  time‐
1104              stamp values to the hardware clock.
1105
1106       pmc_width
1107              If cap_usr_rdpmc, this field provides the bit-width of the value
1108              read using the rdpmc or equivalent  instruction.   This  can  be
1109              used to sign extend the result like:
1110
1111                  pmc <<= 64 - pmc_width;
1112                  pmc >>= 64 - pmc_width; // signed shift right
1113                  count += pmc;
1114
1115       time_shift, time_mult, time_offset
1116
1117              If  cap_usr_time,  these  fields can be used to compute the time
1118              delta since time_enabled (in nanoseconds) using rdtsc  or  simi‐
1119              lar.
1120
1121                  u64 quot, rem;
1122                  u64 delta;
1123                  quot = (cyc >> time_shift);
1124                  rem = cyc & (((u64)1 << time_shift) - 1);
1125                  delta = time_offset + quot * time_mult +
1126                          ((rem * time_mult) >> time_shift);
1127
1128              Where  time_offset,  time_mult,  time_shift, and cyc are read in
1129              the seqcount loop described above.  This delta can then be added
1130              to enabled and possible running (if idx), improving the scaling:
1131
1132                  enabled += delta;
1133                  if (idx)
1134                      running += delta;
1135                  quot = count / running;
1136                  rem  = count % running;
1137                  count = quot * enabled + (rem * enabled) / running;
1138
1139       time_zero (since Linux 3.12)
1140
1141              If  cap_usr_time_zero  is  set, then the hardware clock (the TSC
1142              timestamp counter on x86) can be calculated from the  time_zero,
1143              time_mult, and time_shift values:
1144
1145                  time = timestamp - time_zero;
1146                  quot = time / time_mult;
1147                  rem  = time % time_mult;
1148                  cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
1149
1150              And vice versa:
1151
1152                  quot = cyc >> time_shift;
1153                  rem  = cyc & (((u64)1 << time_shift) - 1);
1154                  timestamp = time_zero + quot * time_mult +
1155                      ((rem * time_mult) >> time_shift);
1156
1157       data_head
1158              This points to the head of the data section.  The value continu‐
1159              ously increases, it does not wrap.  The value needs to be  manu‐
1160              ally wrapped by the size of the mmap buffer before accessing the
1161              samples.
1162
1163              On SMP-capable platforms, after  reading  the  data_head  value,
1164              user space should issue an rmb().
1165
1166       data_tail
1167              When  the  mapping  is PROT_WRITE, the data_tail value should be
1168              written by user space to reflect the last read  data.   In  this
1169              case, the kernel will not overwrite unread data.
1170
1171       data_offset (since Linux 4.1)
1172              Contains  the  offset  of  the location in the mmap buffer where
1173              perf sample data begins.
1174
1175       data_size (since Linux 4.1)
1176              Contains the size of the perf sample region within the mmap buf‐
1177              fer.
1178
1179       aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
1180              The AUX region allows mmaping a separate sample buffer for high-
1181              bandwidth data streams (separate from the main perf sample  buf‐
1182              fer).   An  example  of  a  high-bandwidth stream is instruction
1183              tracing support, as is found in newer Intel processors.
1184
1185              To set up an AUX area, first aux_offset needs to be set with  an
1186              offset  greater than data_offset+data_size and aux_size needs to
1187              be set to the desired buffer size.  The desired offset and  size
1188              must  be  page  aligned,  and  the  size must be a power of two.
1189              These values are then passed to mmap in order  to  map  the  AUX
1190              buffer.   Pages  in  the  AUX buffer are included as part of the
1191              RLIMIT_MEMLOCK resource limit (see setrlimit(2)),  and  also  as
1192              part of the perf_event_mlock_kb allowance.
1193
1194              By  default, the AUX buffer will be truncated if it will not fit
1195              in the available space in the ring buffer.  If the AUX buffer is
1196              mapped  as a read only buffer, then it will operate in ring buf‐
1197              fer mode where old data will be overwritten by  new.   In  over‐
1198              write mode, it might not be possible to infer where the new data
1199              began, and it is the consumer's job to disable measurement while
1200              reading to avoid possible data races.
1201
1202              The  aux_head  and  aux_tail  ring buffer pointers have the same
1203              behavior and ordering rules as the previous described  data_head
1204              and data_tail.
1205
1206       The following 2^n ring-buffer pages have the layout described below.
1207
1208       If perf_event_attr.sample_id_all is set, then all event types will have
1209       the sample_type selected fields related  to  where/when  (identity)  an
1210       event   took  place  (TID,  TIME,  ID,  CPU,  STREAM_ID)  described  in
1211       PERF_RECORD_SAMPLE  below,  it  will  be   stashed   just   after   the
1212       perf_event_header  and  the  fields  already  present  for the existing
1213       fields, that is, at the end  of  the  payload.   This  allows  a  newer
1214       perf.data  file  to  be  supported  by  older  perf tools, with the new
1215       optional fields being ignored.
1216
1217       The mmap values start with a header:
1218
1219           struct perf_event_header {
1220               __u32   type;
1221               __u16   misc;
1222               __u16   size;
1223           };
1224
1225       Below, we describe the perf_event_header fields in  more  detail.   For
1226       ease  of  reading,  the  fields with shorter descriptions are presented
1227       first.
1228
1229       size   This indicates the size of the record.
1230
1231       misc   The misc field contains additional information about the sample.
1232
1233              The CPU mode can be determined from this value by  masking  with
1234              PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the follow‐
1235              ing (note these are not bit masks, only one  can  be  set  at  a
1236              time):
1237
1238              PERF_RECORD_MISC_CPUMODE_UNKNOWN
1239                     Unknown CPU mode.
1240
1241              PERF_RECORD_MISC_KERNEL
1242                     Sample happened in the kernel.
1243
1244              PERF_RECORD_MISC_USER
1245                     Sample happened in user code.
1246
1247              PERF_RECORD_MISC_HYPERVISOR
1248                     Sample happened in the hypervisor.
1249
1250              PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
1251                     Sample happened in the guest kernel.
1252
1253              PERF_RECORD_MISC_GUEST_USER  (since Linux 2.6.35)
1254                     Sample happened in guest user code.
1255
1256              Since  the  following  three statuses are generated by different
1257              record types, they alias to the same bit:
1258
1259              PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
1260                     This is set when the mapping is not executable; otherwise
1261                     the mapping is executable.
1262
1263              PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
1264                     This is set for a PERF_RECORD_COMM record on kernels more
1265                     recent than Linux 3.16  if  a  process  name  change  was
1266                     caused by an exec(2) system call.
1267
1268              PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
1269                     When  a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE
1270                     record is generated, this bit indicates that the  context
1271                     switch  is away from the current process (instead of into
1272                     the current process).
1273
1274              In addition, the following bits can be set:
1275
1276              PERF_RECORD_MISC_EXACT_IP
1277                     This indicates that the content of PERF_SAMPLE_IP  points
1278                     to  the actual instruction that triggered the event.  See
1279                     also perf_event_attr.precise_ip.
1280
1281              PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
1282                     This indicates there is  extended  data  available  (cur‐
1283                     rently not used).
1284
1285              PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
1286                     This  bit  is  not set by the kernel.  It is reserved for
1287                     the   user-space   perf   utility   to   indicate    that
1288                     /proc/i[pid]/maps  parsing  was  taking  too long and was
1289                     stopped, and thus the mmap records may be truncated.
1290
1291       type   The type value is one of the below.  The values  in  the  corre‐
1292              sponding  record  (that  follows  the header) depend on the type
1293              selected as shown.
1294
1295              PERF_RECORD_MMAP
1296                  The MMAP events record the PROT_EXEC mappings so that we can
1297                  correlate  user-space  IPs to code.  They have the following
1298                  structure:
1299
1300                      struct {
1301                          struct perf_event_header header;
1302                          u32    pid, tid;
1303                          u64    addr;
1304                          u64    len;
1305                          u64    pgoff;
1306                          char   filename[];
1307                      };
1308
1309                  pid    is the process ID.
1310
1311                  tid    is the thread ID.
1312
1313                  addr   is the address of the allocated memory.  len  is  the
1314                         length  of  the  allocated memory.  pgoff is the page
1315                         offset of the allocated memory.  filename is a string
1316                         describing the backing of the allocated memory.
1317
1318              PERF_RECORD_LOST
1319                  This record indicates when events are lost.
1320
1321                      struct {
1322                          struct perf_event_header header;
1323                          u64    id;
1324                          u64    lost;
1325                          struct sample_id sample_id;
1326                      };
1327
1328                  id     is  the  unique  event  ID  for the samples that were
1329                         lost.
1330
1331                  lost   is the number of events that were lost.
1332
1333              PERF_RECORD_COMM
1334                  This record indicates a change in the process name.
1335
1336                      struct {
1337                          struct perf_event_header header;
1338                          u32    pid;
1339                          u32    tid;
1340                          char   comm[];
1341                          struct sample_id sample_id;
1342                      };
1343
1344                  pid    is the process ID.
1345
1346                  tid    is the thread ID.
1347
1348                  comm   is a string containing the new name of the process.
1349
1350              PERF_RECORD_EXIT
1351                  This record indicates a process exit event.
1352
1353                      struct {
1354                          struct perf_event_header header;
1355                          u32    pid, ppid;
1356                          u32    tid, ptid;
1357                          u64    time;
1358                          struct sample_id sample_id;
1359                      };
1360
1361              PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
1362                  This record indicates a throttle/unthrottle event.
1363
1364                      struct {
1365                          struct perf_event_header header;
1366                          u64    time;
1367                          u64    id;
1368                          u64    stream_id;
1369                          struct sample_id sample_id;
1370                      };
1371
1372              PERF_RECORD_FORK
1373                  This record indicates a fork event.
1374
1375                      struct {
1376                          struct perf_event_header header;
1377                          u32    pid, ppid;
1378                          u32    tid, ptid;
1379                          u64    time;
1380                          struct sample_id sample_id;
1381                      };
1382
1383              PERF_RECORD_READ
1384                  This record indicates a read event.
1385
1386                      struct {
1387                          struct perf_event_header header;
1388                          u32    pid, tid;
1389                          struct read_format values;
1390                          struct sample_id sample_id;
1391                      };
1392
1393              PERF_RECORD_SAMPLE
1394                  This record indicates a sample.
1395
1396                      struct {
1397                          struct perf_event_header header;
1398                          u64    sample_id;   /* if PERF_SAMPLE_IDENTIFIER */
1399                          u64    ip;          /* if PERF_SAMPLE_IP */
1400                          u32    pid, tid;    /* if PERF_SAMPLE_TID */
1401                          u64    time;        /* if PERF_SAMPLE_TIME */
1402                          u64    addr;        /* if PERF_SAMPLE_ADDR */
1403                          u64    id;          /* if PERF_SAMPLE_ID */
1404                          u64    stream_id;   /* if PERF_SAMPLE_STREAM_ID */
1405                          u32    cpu, res;    /* if PERF_SAMPLE_CPU */
1406                          u64    period;      /* if PERF_SAMPLE_PERIOD */
1407                          struct read_format v;
1408                                              /* if PERF_SAMPLE_READ */
1409                          u64    nr;          /* if PERF_SAMPLE_CALLCHAIN */
1410                          u64    ips[nr];     /* if PERF_SAMPLE_CALLCHAIN */
1411                          u32    size;        /* if PERF_SAMPLE_RAW */
1412                          char  data[size];   /* if PERF_SAMPLE_RAW */
1413                          u64    bnr;         /* if PERF_SAMPLE_BRANCH_STACK */
1414                          struct perf_branch_entry lbr[bnr];
1415                                              /* if PERF_SAMPLE_BRANCH_STACK */
1416                          u64    abi;         /* if PERF_SAMPLE_REGS_USER */
1417                          u64    regs[weight(mask)];
1418                                              /* if PERF_SAMPLE_REGS_USER */
1419                          u64    size;        /* if PERF_SAMPLE_STACK_USER */
1420                          char   data[size];  /* if PERF_SAMPLE_STACK_USER */
1421                          u64    dyn_size;    /* if PERF_SAMPLE_STACK_USER &&
1422                                                 size != 0 */
1423                          u64    weight;      /* if PERF_SAMPLE_WEIGHT */
1424                          u64    data_src;    /* if PERF_SAMPLE_DATA_SRC */
1425                          u64    transaction; /* if PERF_SAMPLE_TRANSACTION */
1426                          u64    abi;         /* if PERF_SAMPLE_REGS_INTR */
1427                          u64    regs[weight(mask)];
1428                                              /* if PERF_SAMPLE_REGS_INTR */
1429                      };
1430
1431                  sample_id
1432                      If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID
1433                      is  included.   This  is  a duplication of the PERF_SAM‐
1434                      PLE_ID id value, but included at the  beginning  of  the
1435                      sample so parsers can easily obtain the value.
1436
1437                  ip  If  PERF_SAMPLE_IP is enabled, then a 64-bit instruction
1438                      pointer value is included.
1439
1440                  pid, tid
1441                      If PERF_SAMPLE_TID is enabled, then a 32-bit process  ID
1442                      and 32-bit thread ID are included.
1443
1444                  time
1445                      If  PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp
1446                      is included.  This is obtained via  local_clock()  which
1447                      is  a  hardware  timestamp  if available and the jiffies
1448                      value if not.
1449
1450                  addr
1451                      If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is
1452                      included.   This is usually the address of a tracepoint,
1453                      breakpoint, or software event; otherwise the value is 0.
1454
1455                  id  If PERF_SAMPLE_ID is enabled,  a  64-bit  unique  ID  is
1456                      included.   If  the event is a member of an event group,
1457                      the group leader ID is returned.  This ID is the same as
1458                      the one returned by PERF_FORMAT_ID.
1459
1460                  stream_id
1461                      If  PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID
1462                      is included.  Unlike PERF_SAMPLE_ID  the  actual  ID  is
1463                      returned,  not the group leader.  This ID is the same as
1464                      the one returned by PERF_FORMAT_ID.
1465
1466                  cpu, res
1467                      If PERF_SAMPLE_CPU is enabled, this is  a  32-bit  value
1468                      indicating  which  CPU  was being used, in addition to a
1469                      reserved (unused) 32-bit value.
1470
1471                  period
1472                      If PERF_SAMPLE_PERIOD is enabled, a 64-bit  value  indi‐
1473                      cating the current sampling period is written.
1474
1475                  v   If  PERF_SAMPLE_READ  is  enabled,  a  structure of type
1476                      read_format is included which has values for all  events
1477                      in  the  event group.  The values included depend on the
1478                      read_format value used at perf_event_open() time.
1479
1480                  nr, ips[nr]
1481                      If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit  num‐
1482                      ber  is  included  which  indicates  how  many following
1483                      64-bit instruction pointers will follow.   This  is  the
1484                      current callchain.
1485
1486                  size, data[size]
1487                      If PERF_SAMPLE_RAW is enabled, then a 32-bit value indi‐
1488                      cating size is included followed by an  array  of  8-bit
1489                      values  of length size.  The values are padded with 0 to
1490                      have 64-bit alignment.
1491
1492                      This RAW record data is opaque with respect to the  ABI.
1493                      The  ABI  doesn't  make any promises with respect to the
1494                      stability of its  content,  it  may  vary  depending  on
1495                      event, hardware, and kernel version.
1496
1497                  bnr, lbr[bnr]
1498                      If  PERF_SAMPLE_BRANCH_STACK  is  enabled, then a 64-bit
1499                      value indicating the number of records is included, fol‐
1500                      lowed  by  bnr  perf_branch_entry  structures which each
1501                      include the fields:
1502
1503                      from   This indicates the source instruction (may not be
1504                             a branch).
1505
1506                      to     The branch target.
1507
1508                      mispred
1509                             The branch target was mispredicted.
1510
1511                      predicted
1512                             The branch target was predicted.
1513
1514                      in_tx (since Linux 3.11)
1515                             The branch was in a transactional memory transac‐
1516                             tion.
1517
1518                      abort (since Linux 3.11)
1519                             The branch was in an aborted transactional memory
1520                             transaction.
1521
1522                      cycles (since Linux 4.3)
1523                             This  reports  the number of cycles elapsed since
1524                             the previous branch stack update.
1525
1526                      The entries are from most to least recent, so the  first
1527                      entry has the most recent branch.
1528
1529                      Support  for mispred, predicted, and cycles is optional;
1530                      if not supported, those values will be 0.
1531
1532                      The type  of  branches  recorded  is  specified  by  the
1533                      branch_sample_type field.
1534
1535                  abi, regs[weight(mask)]
1536                      If  PERF_SAMPLE_REGS_USER  is enabled, then the user CPU
1537                      registers are recorded.
1538
1539                      The  abi  field  is  one  of  PERF_SAMPLE_REGS_ABI_NONE,
1540                      PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64.
1541
1542                      The  regs  field  is  an array of the CPU registers that
1543                      were specified by the sample_regs_user attr field.   The
1544                      number  of  values is the number of bits set in the sam‐
1545                      ple_regs_user bit mask.
1546
1547                  size, data[size], dyn_size
1548                      If PERF_SAMPLE_STACK_USER  is  enabled,  then  the  user
1549                      stack  is  recorded.  This can be used to generate stack
1550                      backtraces.  size is the size requested by the  user  in
1551                      sample_stack_user or else the maximum record size.  data
1552                      is the stack data (a raw dump of the memory  pointed  to
1553                      by the stack pointer at the time of sampling).  dyn_size
1554                      is the amount of data actually dumped (can be less  than
1555                      size).  Note that dyn_size is omitted if size is 0.
1556
1557                  weight
1558                      If  PERF_SAMPLE_WEIGHT  is  enabled, then a 64-bit value
1559                      provided by the hardware is recorded that indicates  how
1560                      costly  the  event was.  This allows expensive events to
1561                      stand out more clearly in profiles.
1562
1563                  data_src
1564                      If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit  value
1565                      is recorded that is made up of the following fields:
1566
1567                      mem_op
1568                          Type of opcode, a bitwise combination of:
1569
1570                          PERF_MEM_OP_NA          Not available
1571                          PERF_MEM_OP_LOAD        Load instruction
1572                          PERF_MEM_OP_STORE       Store instruction
1573                          PERF_MEM_OP_PFETCH      Prefetch
1574                          PERF_MEM_OP_EXEC        Executable code
1575
1576                      mem_lvl
1577                          Memory hierarchy level hit or miss, a bitwise combi‐
1578                          nation   of   the   following,   shifted   left   by
1579                          PERF_MEM_LVL_SHIFT:
1580
1581                          PERF_MEM_LVL_NA         Not available
1582                          PERF_MEM_LVL_HIT        Hit
1583                          PERF_MEM_LVL_MISS       Miss
1584                          PERF_MEM_LVL_L1         Level 1 cache
1585                          PERF_MEM_LVL_LFB        Line fill buffer
1586                          PERF_MEM_LVL_L2         Level 2 cache
1587                          PERF_MEM_LVL_L3         Level 3 cache
1588                          PERF_MEM_LVL_LOC_RAM    Local DRAM
1589                          PERF_MEM_LVL_REM_RAM1   Remote DRAM 1 hop
1590                          PERF_MEM_LVL_REM_RAM2   Remote DRAM 2 hops
1591                          PERF_MEM_LVL_REM_CCE1   Remote cache 1 hop
1592                          PERF_MEM_LVL_REM_CCE2   Remote cache 2 hops
1593                          PERF_MEM_LVL_IO         I/O memory
1594                          PERF_MEM_LVL_UNC        Uncached memory
1595
1596                      mem_snoop
1597                          Snoop  mode, a bitwise combination of the following,
1598                          shifted left by PERF_MEM_SNOOP_SHIFT:
1599
1600                          PERF_MEM_SNOOP_NA       Not available
1601                          PERF_MEM_SNOOP_NONE     No snoop
1602                          PERF_MEM_SNOOP_HIT      Snoop hit
1603                          PERF_MEM_SNOOP_MISS     Snoop miss
1604                          PERF_MEM_SNOOP_HITM     Snoop hit modified
1605
1606                      mem_lock
1607                          Lock instruction, a bitwise combination of the  fol‐
1608                          lowing, shifted left by PERF_MEM_LOCK_SHIFT:
1609
1610                          PERF_MEM_LOCK_NA        Not available
1611                          PERF_MEM_LOCK_LOCKED    Locked transaction
1612
1613                      mem_dtlb
1614                          TLB access hit or miss, a bitwise combination of the
1615                          following, shifted left by PERF_MEM_TLB_SHIFT:
1616
1617                          PERF_MEM_TLB_NA         Not available
1618                          PERF_MEM_TLB_HIT        Hit
1619                          PERF_MEM_TLB_MISS       Miss
1620                          PERF_MEM_TLB_L1         Level 1 TLB
1621                          PERF_MEM_TLB_L2         Level 2 TLB
1622                          PERF_MEM_TLB_WK         Hardware walker
1623                          PERF_MEM_TLB_OS         OS fault handler
1624
1625                  transaction
1626                      If the  PERF_SAMPLE_TRANSACTION  flag  is  set,  then  a
1627                      64-bit  field  is recorded describing the sources of any
1628                      transactional memory aborts.
1629
1630                      The field is a bitwise combination of the following val‐
1631                      ues:
1632
1633                      PERF_TXN_ELISION
1634                             Abort  from  an  elision type transaction (Intel-
1635                             CPU-specific).
1636
1637                      PERF_TXN_TRANSACTION
1638                             Abort from a generic transaction.
1639
1640                      PERF_TXN_SYNC
1641                             Synchronous  abort  (related  to   the   reported
1642                             instruction).
1643
1644                      PERF_TXN_ASYNC
1645                             Asynchronous  abort  (not related to the reported
1646                             instruction).
1647
1648                      PERF_TXN_RETRY
1649                             Retryable abort  (retrying  the  transaction  may
1650                             have succeeded).
1651
1652                      PERF_TXN_CONFLICT
1653                             Abort due to memory conflicts with other threads.
1654
1655                      PERF_TXN_CAPACITY_WRITE
1656                             Abort due to write capacity overflow.
1657
1658                      PERF_TXN_CAPACITY_READ
1659                             Abort due to read capacity overflow.
1660
1661                      In addition, a user-specified abort code can be obtained
1662                      from the high 32 bits of the field by shifting right  by
1663                      PERF_TXN_ABORT_SHIFT   and   masking   with   the  value
1664                      PERF_TXN_ABORT_MASK.
1665
1666                  abi, regs[weight(mask)]
1667                      If PERF_SAMPLE_REGS_INTR is enabled, then the  user  CPU
1668                      registers are recorded.
1669
1670                      The  abi  field  is  one  of  PERF_SAMPLE_REGS_ABI_NONE,
1671                      PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
1672
1673                      The regs field is an array of  the  CPU  registers  that
1674                      were  specified by the sample_regs_intr attr field.  The
1675                      number of values is the number of bits set in  the  sam‐
1676                      ple_regs_intr bit mask.
1677
1678              PERF_RECORD_MMAP2
1679                  This  record  includes extended information on mmap(2) calls
1680                  returning executable mappings.  The  format  is  similar  to
1681                  that of the PERF_RECORD_MMAP record, but includes extra val‐
1682                  ues that allow uniquely identifying shared mappings.
1683
1684                      struct {
1685                          struct perf_event_header header;
1686                          u32    pid;
1687                          u32    tid;
1688                          u64    addr;
1689                          u64    len;
1690                          u64    pgoff;
1691                          u32    maj;
1692                          u32    min;
1693                          u64    ino;
1694                          u64    ino_generation;
1695                          u32    prot;
1696                          u32    flags;
1697                          char   filename[];
1698                          struct sample_id sample_id;
1699                      };
1700
1701                  pid    is the process ID.
1702
1703                  tid    is the thread ID.
1704
1705                  addr   is the address of the allocated memory.
1706
1707                  len    is the length of the allocated memory.
1708
1709                  pgoff  is the page offset of the allocated memory.
1710
1711                  maj    is the major ID of the underlying device.
1712
1713                  min    is the minor ID of the underlying device.
1714
1715                  ino    is the inode number.
1716
1717                  ino_generation
1718                         is the inode generation.
1719
1720                  prot   is the protection information.
1721
1722                  flags  is the flags information.
1723
1724                  filename
1725                         is a string describing the backing of  the  allocated
1726                         memory.
1727
1728              PERF_RECORD_AUX (since Linux 4.1)
1729
1730                  This  record reports that new data is available in the sepa‐
1731                  rate AUX buffer region.
1732
1733                      struct {
1734                          struct perf_event_header header;
1735                          u64    aux_offset;
1736                          u64    aux_size;
1737                          u64    flags;
1738                          struct sample_id sample_id;
1739                      };
1740
1741                  aux_offset
1742                         offset in the AUX mmap  region  where  the  new  data
1743                         begins.
1744
1745                  aux_size
1746                         size of the data made available.
1747
1748                  flags  describes the AUX update.
1749
1750                         PERF_AUX_FLAG_TRUNCATED
1751                                if  set,  then the data returned was truncated
1752                                to fit the available buffer size.
1753
1754                         PERF_AUX_FLAG_OVERWRITE
1755                                if set, then the data returned has overwritten
1756                                previous data.
1757
1758              PERF_RECORD_ITRACE_START (since Linux 4.1)
1759
1760                  This   record  indicates  which  process  has  initiated  an
1761                  instruction trace event, allowing tools to  properly  corre‐
1762                  late  the  instruction  addresses in the AUX buffer with the
1763                  proper executable.
1764
1765                      struct {
1766                          struct perf_event_header header;
1767                          u32    pid;
1768                          u32    tid;
1769                      };
1770
1771                  pid    process ID of  the  thread  starting  an  instruction
1772                         trace.
1773
1774                  tid    thread  ID  of  the  thread  starting  an instruction
1775                         trace.
1776
1777              PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
1778
1779                  When using hardware  sampling  (such  as  Intel  PEBS)  this
1780                  record  indicates  some number of samples that may have been
1781                  lost.
1782
1783                      struct {
1784                          struct perf_event_header header;
1785                          u64    lost;
1786                          struct sample_id sample_id;
1787                      };
1788
1789                  lost   the number of potentially lost samples.
1790
1791              PERF_RECORD_SWITCH (since Linux 4.3)
1792
1793                  This record indicates a context switch  has  happened.   The
1794                  PERF_RECORD_MISC_SWITCH_OUT  bit in the misc field indicates
1795                  whether it was a context switch into or away from  the  cur‐
1796                  rent process.
1797
1798                      struct {
1799                          struct perf_event_header header;
1800                          struct sample_id sample_id;
1801                      };
1802
1803              PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
1804
1805                  As  with  PERF_RECORD_SWITCH this record indicates a context
1806                  switch has happened, but it only  occurs  when  sampling  in
1807                  CPU-wide  mode  and  provides  additional information on the
1808                  process       being       switched       to/from.        The
1809                  PERF_RECORD_MISC_SWITCH_OUT  bit in the misc field indicates
1810                  whether it was a context switch into or away from  the  cur‐
1811                  rent process.
1812
1813                      struct {
1814                          struct perf_event_header header;
1815                          u32 next_prev_pid;
1816                          u32 next_prev_tid;
1817                          struct sample_id sample_id;
1818                      };
1819
1820                  next_prev_pid
1821                         The  process  ID of the previous (if switching in) or
1822                         next (if switching out) process on the CPU.
1823
1824                  next_prev_tid
1825                         The thread ID of the previous (if  switching  in)  or
1826                         next (if switching out) thread on the CPU.
1827
1828   Overflow handling
1829       Events  can be set to notify when a threshold is crossed, indicating an
1830       overflow.  Overflow conditions can be captured by monitoring the  event
1831       file  descriptor  with poll(2), select(2), or epoll(7).  Alternatively,
1832       the overflow events can be captured via sa signal handler, by  enabling
1833       I/O  signaling  on  the  file  descriptor;  see  the  discussion of the
1834       F_SETOWN and F_SETSIG operations in fcntl(2).
1835
1836       Overflows are generated only by  sampling  events  (sample_period  must
1837       have a nonzero value).
1838
1839       There are two ways to generate overflow notifications.
1840
1841       The first is to set a wakeup_events or wakeup_watermark value that will
1842       trigger if a certain number of samples or bytes have  been  written  to
1843       the mmap ring buffer.  In this case, POLL_IN is indicated.
1844
1845       The  other  way  is  by  use of the PERF_EVENT_IOC_REFRESH ioctl.  This
1846       ioctl adds to a counter that decrements each time the event  overflows.
1847       When  nonzero,  POLL_IN  is  indicated,  but once the counter reaches 0
1848       POLL_HUP is indicated and the underlying event is disabled.
1849
1850       Refreshing an event group leader refreshes all siblings and  refreshing
1851       with  a  parameter  of  0  currently  enables infinite refreshes; these
1852       behaviors are unsupported and should not be relied on.
1853
1854       Starting with Linux 3.18, POLL_HUP is indicated if the event being mon‐
1855       itored is attached to a different process and that process exits.
1856
1857   rdpmc instruction
1858       Starting  with  Linux  3.4 on x86, you can use the rdpmc instruction to
1859       get low-latency reads without having to enter the  kernel.   Note  that
1860       using  rdpmc  is  not necessarily faster than other methods for reading
1861       event values.
1862
1863       Support for this can be detected with the cap_usr_rdpmc  field  in  the
1864       mmap  page; documentation on how to calculate event values can be found
1865       in that section.
1866
1867       Originally, when rdpmc support was enabled, any process (not just  ones
1868       with  an  active  perf event) could use the rdpmc instruction to access
1869       the counters.  Starting with Linux 4.0, rdpmc support is  only  allowed
1870       if  an  event  is currently enabled in a process's context.  To restore
1871       the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
1872
1873   perf_event ioctl calls
1874       Various ioctls act on perf_event_open() file descriptors:
1875
1876       PERF_EVENT_IOC_ENABLE
1877              This enables the individual event or event  group  specified  by
1878              the file descriptor argument.
1879
1880              If  the  PERF_IOC_FLAG_GROUP  bit  is set in the ioctl argument,
1881              then all events in a group are enabled, even if the event speci‐
1882              fied is not the group leader (but see BUGS).
1883
1884       PERF_EVENT_IOC_DISABLE
1885              This disables the individual counter or event group specified by
1886              the file descriptor argument.
1887
1888              Enabling or disabling the leader of a group enables or  disables
1889              the  entire  group; that is, while the group leader is disabled,
1890              none of the counters in the group will count.  Enabling or  dis‐
1891              abling  a  member  of a group other than the leader affects only
1892              that counter; disabling a non-leader  stops  that  counter  from
1893              counting but doesn't affect any other counter.
1894
1895              If  the  PERF_IOC_FLAG_GROUP  bit  is set in the ioctl argument,
1896              then all events in a group are disabled, even if the event spec‐
1897              ified is not the group leader (but see BUGS).
1898
1899       PERF_EVENT_IOC_REFRESH
1900              Non-inherited overflow counters can use this to enable a counter
1901              for a number of overflows specified by the argument, after which
1902              it is disabled.  Subsequent calls of this ioctl add the argument
1903              value to the  current  count.   An  overflow  notification  with
1904              POLL_IN set will happen on each overflow until the count reaches
1905              0; when that happens a notification with POLL_HUP  set  is  sent
1906              and the event is disabled.  Using an argument of 0 is considered
1907              undefined behavior.
1908
1909       PERF_EVENT_IOC_RESET
1910              Reset the event count specified by the file descriptor  argument
1911              to  zero.  This resets only the counts; there is no way to reset
1912              the multiplexing time_enabled or time_running values.
1913
1914              If the PERF_IOC_FLAG_GROUP bit is set  in  the  ioctl  argument,
1915              then  all  events in a group are reset, even if the event speci‐
1916              fied is not the group leader (but see BUGS).
1917
1918       PERF_EVENT_IOC_PERIOD
1919              This updates the overflow period for the event.
1920
1921              Since Linux 3.7 (on ARM) and Linux  3.14  (all  other  architec‐
1922              tures),  the new period takes effect immediately.  On older ker‐
1923              nels, the new period did not take effect until  after  the  next
1924              overflow.
1925
1926              The  argument  is  a  pointer  to  a 64-bit value containing the
1927              desired new period.
1928
1929              Prior to Linux 2.6.36, this ioctl always failed due to a bug  in
1930              the kernel.
1931
1932       PERF_EVENT_IOC_SET_OUTPUT
1933              This tells the kernel to report event notifications to the spec‐
1934              ified file descriptor rather than the  default  one.   The  file
1935              descriptors must all be on the same CPU.
1936
1937              The  argument  specifies  the  desired file descriptor, or -1 if
1938              output should be ignored.
1939
1940       PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
1941              This adds an ftrace filter to this event.
1942
1943              The argument is a pointer to the desired ftrace filter.
1944
1945       PERF_EVENT_IOC_ID (since Linux 3.12)
1946              This returns the  event  ID  value  for  the  given  event  file
1947              descriptor.
1948
1949              The  argument  is a pointer to a 64-bit unsigned integer to hold
1950              the result.
1951
1952       PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
1953              This allows attaching a Berkeley Packet Filter (BPF) program  to
1954              an  existing  kprobe  tracepoint  event.  You need CAP_SYS_ADMIN
1955              privileges to use this ioctl.
1956
1957              The argument is a BPF program file descriptor that  was  created
1958              by a previous bpf(2) system call.
1959
1960   Using prctl(2)
1961       A  process can enable or disable all the event groups that are attached
1962       to   it   using    the    prctl(2)    PR_TASK_PERF_EVENTS_ENABLE    and
1963       PR_TASK_PERF_EVENTS_DISABLE  operations.   This applies to all counters
1964       on the calling process, whether created by this process or by  another,
1965       and does not affect any counters that this process has created on other
1966       processes.  It enables or disables only  the  group  leaders,  not  any
1967       other members in the groups.
1968
1969   perf_event related configuration files
1970       Files in /proc/sys/kernel/
1971
1972           /proc/sys/kernel/perf_event_paranoid
1973                  The  perf_event_paranoid  file can be set to restrict access
1974                  to the performance counters.
1975
1976                  2   allow only user-space measurements (default since  Linux
1977                      4.6).
1978                  1   allow  both kernel and user measurements (default before
1979                      Linux 4.6).
1980                  0   allow access to CPU-specific data but not raw tracepoint
1981                      samples.
1982                  -1  no restrictions.
1983
1984                  The  existence  of the perf_event_paranoid file is the offi‐
1985                  cial  method  for   determining   if   a   kernel   supports
1986                  perf_event_open().
1987
1988           /proc/sys/kernel/perf_event_max_sample_rate
1989                  This  sets  the  maximum sample rate.  Setting this too high
1990                  can allow users to sample at a  rate  that  impacts  overall
1991                  machine  performance  and  potentially  lock up the machine.
1992                  The default value is 100000 (samples per second).
1993
1994           /proc/sys/kernel/perf_event_max_stack
1995                  This file sets the maximum  depth  of  stack  frame  entries
1996                  reported when generating a call trace.
1997
1998           /proc/sys/kernel/perf_event_mlock_kb
1999                  Maximum  number  of pages an unprivileged user can mlock(2).
2000                  The default is 516 (kB).
2001
2002       Files in /sys/bus/event_source/devices/
2003
2004           Since Linux 2.6.34, the kernel supports having multiple PMUs avail‐
2005           able  for monitoring.  Information on how to program these PMUs can
2006           be found under /sys/bus/event_source/devices/.   Each  subdirectory
2007           corresponds to a different PMU.
2008
2009           /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
2010                  This  contains an integer that can be used in the type field
2011                  of perf_event_attr to indicate that you  wish  to  use  this
2012                  PMU.
2013
2014           /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
2015                  If this file is 1, then direct user-space access to the per‐
2016                  formance counter registers is allowed via the rdpmc instruc‐
2017                  tion.  This can be disabled by echoing 0 to the file.
2018
2019                  As  of  Linux  4.0  the  behavior has changed, so that 1 now
2020                  means only  allow  access  to  processes  with  active  perf
2021                  events, with 2 indicating the old allow-anyone-access behav‐
2022                  ior.
2023
2024           /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
2025                  This subdirectory contains information on the  architecture-
2026                  specific  subfields  available  for  programming the various
2027                  config fields in the perf_event_attr struct.
2028
2029                  The content of each file is the name of  the  config  field,
2030                  followed  by  a  colon,  followed by a series of integer bit
2031                  ranges separated by commas.  For example, the file event may
2032                  contain  the  value  config1:1,6-10,44  which indicates that
2033                  event is an attribute that occupies bits 1,6–10, and  44  of
2034                  perf_event_attr::config1.
2035
2036           /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
2037                  This  subdirectory  contains  files  with predefined events.
2038                  The contents  are  strings  describing  the  event  settings
2039                  expressed  in  terms  of  the fields found in the previously
2040                  mentioned ./format/ directory.  These  are  not  necessarily
2041                  complete lists of all events supported by a PMU, but usually
2042                  a subset of events deemed useful or interesting.
2043
2044                  The content of each file is a list of attribute names  sepa‐
2045                  rated  by  commas.  Each entry has an optional value (either
2046                  hex or decimal).  If no  value  is  specified,  then  it  is
2047                  assumed  to  be  a  single-bit  field with a value of 1.  An
2048                  example entry may look like this: event=0x2,inv,ldlat=3.
2049
2050           /sys/bus/event_source/devices/*/uevent
2051                  This file  is  the  standard  kernel  device  interface  for
2052                  injecting hotplug events.
2053
2054           /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
2055                  The cpumask file contains a comma-separated list of integers
2056                  that indicate a representative CPU number  for  each  socket
2057                  (package)  on  the motherboard.  This is needed when setting
2058                  up uncore or  northbridge  events,  as  those  PMUs  present
2059                  socket-wide events.
2060

RETURN VALUE

2062       perf_event_open()  returns  the  new file descriptor, or -1 if an error
2063       occurred (in which case, errno is set appropriately).
2064

ERRORS

2066       The errors returned by perf_event_open() can be inconsistent,  and  may
2067       vary across processor architectures and performance monitoring units.
2068
2069       E2BIG  Returned if the perf_event_attr size value is too small (smaller
2070              than PERF_ATTR_SIZE_VER0), too big (larger than the page  size),
2071              or  larger  than the kernel supports and the extra bytes are not
2072              zero.  When E2BIG is returned, the perf_event_attr size field is
2073              overwritten by the kernel to be the size of the structure it was
2074              expecting.
2075
2076       EACCES Returned when the requested event requires CAP_SYS_ADMIN permis‐
2077              sions  (or a more permissive perf_event paranoid setting).  Some
2078              common cases where an unprivileged process  may  encounter  this
2079              error:  attaching  to a process owned by a different user; moni‐
2080              toring all processes on a given CPU (i.e.,  specifying  the  pid
2081              argument  as  -1); and not setting exclude_kernel when the para‐
2082              noid setting requires it.
2083
2084       EBADF  Returned if the group_fd file descriptor is not  valid,  or,  if
2085              PERF_FLAG_PID_CGROUP  is  set, the cgroup file descriptor in pid
2086              is not valid.
2087
2088       EBUSY (since Linux 4.1)
2089              Returned if another event already has exclusive  access  to  the
2090              PMU.
2091
2092       EFAULT Returned  if  the  attr  pointer  points  at  an  invalid memory
2093              address.
2094
2095       EINVAL Returned if the specified event is invalid.  There are many pos‐
2096              sible  reasons  for this.  A not-exhaustive list: sample_freq is
2097              higher than the maximum setting; the cpu  to  monitor  does  not
2098              exist; read_format is out of range; sample_type is out of range;
2099              the flags value is out of range; exclusive or pinned set and the
2100              event  is not a group leader; the event config values are out of
2101              range or set reserved bits; the generic event  selected  is  not
2102              supported;  or  there  is  not  enough  room to add the selected
2103              event.
2104
2105       EMFILE Each opened event uses one file descriptor.  If a  large  number
2106              of  events  are  opened,  the per-process limit on the number of
2107              open file descriptors will be reached, and no more events can be
2108              created.
2109
2110       ENODEV Returned  when the event involves a feature not supported by the
2111              current CPU.
2112
2113       ENOENT Returned if the type setting is not valid.  This error  is  also
2114              returned for some unsupported generic events.
2115
2116       ENOSPC Prior  to Linux 3.3, if there was not enough room for the event,
2117              ENOSPC was returned.  In Linux 3.3, this was changed to  EINVAL.
2118              ENOSPC  is  still  returned  if  you  try to add more breakpoint
2119              events than supported by the hardware.
2120
2121       ENOSYS Returned if PERF_SAMPLE_STACK_USER is set in sample_type and  it
2122              is not supported by hardware.
2123
2124       EOPNOTSUPP
2125              Returned  if  an  event requiring a specific hardware feature is
2126              requested but there  is  no  hardware  support.   This  includes
2127              requesting  low-skid  events if not supported, branch tracing if
2128              it is not available, sampling if no PMU interrupt is  available,
2129              and branch stacks for software events.
2130
2131       EOVERFLOW (since Linux 4.8)
2132              Returned   if   PERF_SAMPLE_CALLCHAIN   is  requested  and  sam‐
2133              ple_max_stack  is  larger  than   the   maximum   specified   in
2134              /proc/sys/kernel/perf_event_max_stack.
2135
2136       EPERM  Returned on many (but not all) architectures when an unsupported
2137              exclude_hv, exclude_idle, exclude_user, or  exclude_kernel  set‐
2138              ting is specified.
2139
2140              It  can  also  happen,  as with EACCES, when the requested event
2141              requires  CAP_SYS_ADMIN  permissions  (or  a   more   permissive
2142              perf_event  paranoid  setting).   This includes setting a break‐
2143              point on a kernel address, and (since Linux 3.13) setting a ker‐
2144              nel function-trace tracepoint.
2145
2146       ESRCH  Returned  if  attempting  to  attach  to a process that does not
2147              exist.
2148

VERSION

2150       perf_event_open()  was  introduced  in  Linux  2.6.31  but  was  called
2151       perf_counter_open().  It was renamed in Linux 2.6.32.
2152

CONFORMING TO

2154       This  perf_event_open()  system  call  Linux-specific and should not be
2155       used in programs intended to be portable.
2156

NOTES

2158       Glibc does not provide a wrapper for this system call;  call  it  using
2159       syscall(2).  See the example below.
2160
2161       The  official way of knowing if perf_event_open() support is enabled is
2162       checking   for   the   existence    of    the    file    /proc/sys/ker‐
2163       nel/perf_event_paranoid.
2164

BUGS

2166       The  F_SETOWN_EX  option to fcntl(2) is needed to properly get overflow
2167       signals in threads.  This was introduced in Linux 2.6.32.
2168
2169       Prior to Linux 2.6.33 (at least for x86), the kernel did not  check  if
2170       events  could  be scheduled together until read time.  The same happens
2171       on all known kernels if the NMI watchdog is enabled.  This means to see
2172       if  a  given  set of events works you have to perf_event_open(), start,
2173       then read before you know for sure you can get valid measurements.
2174
2175       Prior to Linux 2.6.34, event constraints were not enforced by the  ker‐
2176       nel.  In that case, some events would silently return "0" if the kernel
2177       scheduled them in an improper counter slot.
2178
2179       Prior to Linux 2.6.34, there was a  bug  when  multiplexing  where  the
2180       wrong results could be returned.
2181
2182       Kernels  from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel
2183       if "inherit" is enabled and many threads are started.
2184
2185       Prior to Linux 2.6.35, PERF_FORMAT_GROUP did  not  work  with  attached
2186       processes.
2187
2188       There  is  a  bug in the kernel code between Linux 2.6.36 and Linux 3.0
2189       that ignores the "watermark" field and acts as if  a  wakeup_event  was
2190       chosen if the union has a nonzero value in it.
2191
2192       From  Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument
2193       was broken and would repeatedly operate on the event  specified  rather
2194       than iterating across all sibling events in a group.
2195
2196       From  Linux  3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time
2197       bits mapped to the same location.   Code  should  migrate  to  the  new
2198       cap_user_rdpmc and cap_user_time fields instead.
2199
2200       Always  double-check your results!  Various generalized events have had
2201       wrong values.  For example, retired branches measured the  wrong  thing
2202       on AMD machines until Linux 2.6.35.
2203

EXAMPLE

2205       The  following  is  a short example that measures the total instruction
2206       count of a call to printf(3).
2207
2208       #include <stdlib.h>
2209       #include <stdio.h>
2210       #include <unistd.h>
2211       #include <string.h>
2212       #include <sys/ioctl.h>
2213       #include <linux/perf_event.h>
2214       #include <asm/unistd.h>
2215
2216       static long
2217       perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
2218                       int cpu, int group_fd, unsigned long flags)
2219       {
2220           int ret;
2221
2222           ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
2223                          group_fd, flags);
2224           return ret;
2225       }
2226
2227       int
2228       main(int argc, char **argv)
2229       {
2230           struct perf_event_attr pe;
2231           long long count;
2232           int fd;
2233
2234           memset(&pe, 0, sizeof(struct perf_event_attr));
2235           pe.type = PERF_TYPE_HARDWARE;
2236           pe.size = sizeof(struct perf_event_attr);
2237           pe.config = PERF_COUNT_HW_INSTRUCTIONS;
2238           pe.disabled = 1;
2239           pe.exclude_kernel = 1;
2240           pe.exclude_hv = 1;
2241
2242           fd = perf_event_open(&pe, 0, -1, -1, 0);
2243           if (fd == -1) {
2244              fprintf(stderr, "Error opening leader %llx\n", pe.config);
2245              exit(EXIT_FAILURE);
2246           }
2247
2248           ioctl(fd, PERF_EVENT_IOC_RESET, 0);
2249           ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
2250
2251           printf("Measuring instruction count for this printf\n");
2252
2253           ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
2254           read(fd, &count, sizeof(long long));
2255
2256           printf("Used %lld instructions\n", count);
2257
2258           close(fd);
2259       }
2260

SEE ALSO

2262       perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
2263

COLOPHON

2265       This page is part of release 4.16 of the Linux  man-pages  project.   A
2266       description  of  the project, information about reporting bugs, and the
2267       latest    version    of    this    page,    can     be     found     at
2268       https://www.kernel.org/doc/man-pages/.
2269
2270
2271
2272Linux                             2018-02-02                PERF_EVENT_OPEN(2)
Impressum