perf-stat(1)

1PERF-STAT(1)                      perf Manual                     PERF-STAT(1)
2
3
4

NAME

6       perf-stat - Run a command and gather performance counter statistics
7

SYNOPSIS

9       perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
10       perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]
11       perf stat [-e <EVENT> | --event=EVENT] [-a] record [-o file] — <command> [<options>]
12       perf stat report [-i file]
13

DESCRIPTION

15       This command runs a command and gathers performance counter statistics
16       from it.
17

OPTIONS

19       <command>...
20           Any command you can specify in a shell.
21
22       record
23           See STAT RECORD.
24
25       report
26           See STAT REPORT.
27
28       -e, --event=
29           Select the PMU event. Selection can be:
30
31           •   a symbolic event name (use perf list to list all events)
32
33           •   a raw PMU event (eventsel+umask) in the form of rNNN where NNN
34               is a hexadecimal event descriptor.
35
36           •   a symbolic or raw PMU event followed by an optional colon and a
37               list of event modifiers, e.g., cpu-cycles:p. See the perf-
38               list(1) man page for details on event modifiers.
39
40           •   a symbolically formed event like pmu/param1=0x3,param2/ where
41               param1 and param2 are defined as formats for the PMU in
42               /sys/bus/event_source/devices/<pmu>/format/*
43
44                   'percore' is a event qualifier that sums up the event counts for both
45                   hardware threads in a core. For example:
46                   perf stat -A -a -e cpu/event,percore=1/,otherevent ...
47
48           •   a symbolically formed event like
49               pmu/config=M,config1=N,config2=K/ where M, N, K are numbers (in
50               decimal, hex, octal format). Acceptable values for each of
51               config, config1 and config2 parameters are defined by
52               corresponding entries in
53               /sys/bus/event_source/devices/<pmu>/format/*
54
55                   Note that the last two syntaxes support prefix and glob matching in
56                   the PMU name to simplify creation of events across multiple instances
57                   of the same type of PMU in large systems (e.g. memory controller PMUs).
58                   Multiple PMU instances are typical for uncore PMUs, so the prefix
59                   'uncore_' is also ignored when performing this match.
60
61       -i, --no-inherit
62           child tasks do not inherit counters
63
64       -p, --pid=<pid>
65           stat events on existing process id (comma separated list)
66
67       -t, --tid=<tid>
68           stat events on existing thread id (comma separated list)
69
70       -b, --bpf-prog
71           stat events on existing bpf program id (comma separated list),
72           requiring root rights. bpftool-prog could be used to find program
73           id all bpf programs in the system. For example:
74
75               # bpftool prog | head -n 1
76               17247: tracepoint  name sys_enter  tag 192d548b9d754067  gpl
77
78               # perf stat -e cycles,instructions --bpf-prog 17247 --timeout 1000
79
80               Performance counter stats for 'BPF program(s) 17247':
81
82               85,967      cycles
83               28,982      instructions              #    0.34  insn per cycle
84
85               1.102235068 seconds time elapsed
86
87       -a, --all-cpus
88           system-wide collection from all CPUs (default if no target is
89           specified)
90
91       --no-scale
92           Don’t scale/normalize counter values
93
94       -d, --detailed
95           print more detailed statistics, can be specified up to 3 times
96
97                     -d:          detailed events, L1 and LLC data cache
98                  -d -d:     more detailed events, dTLB and iTLB events
99               -d -d -d:     very detailed events, adding prefetch events
100
101       -r, --repeat=<n>
102           repeat command and print average + stddev (max: 100). 0 means
103           forever.
104
105       -B, --big-num
106           print large numbers with thousands' separators according to locale.
107           Enabled by default. Use "--no-big-num" to disable. Default setting
108           can be changed with "perf config stat.big-num=false".
109
110       -C, --cpu=
111           Count only on the list of CPUs provided. Multiple CPUs can be
112           provided as a comma-separated list with no space: 0,1. Ranges of
113           CPUs are specified with -: 0-2. In per-thread mode, this option is
114           ignored. The -a option is still necessary to activate system-wide
115           monitoring. Default is to count on all CPUs.
116
117       -A, --no-aggr
118           Do not aggregate counts across all monitored CPUs.
119
120       -n, --null
121           null run - don’t start any counters
122
123       -v, --verbose
124           be more verbose (show counter open errors, etc)
125
126       -x SEP, --field-separator SEP
127           print counts using a CSV-style output to make it easy to import
128           directly into spreadsheets. Columns are separated by the string
129           specified in SEP.
130
131       --table
132           Display time for each run (-r option), in a table format, e.g.:
133
134               $ perf stat --null -r 5 --table perf bench sched pipe
135
136               Performance counter stats for 'perf bench sched pipe' (5 runs):
137
138               # Table of individual measurements:
139               5.189 (-0.293) #
140               5.189 (-0.294) #
141               5.186 (-0.296) #
142               5.663 (+0.181) ##
143               6.186 (+0.703) ####
144
145               # Final result:
146               5.483 +- 0.198 seconds time elapsed  ( +-  3.62% )
147
148       -G name, --cgroup name
149           monitor only in the container (cgroup) called "name". This option
150           is available only in per-cpu mode. The cgroup filesystem must be
151           mounted. All threads belonging to container "name" are monitored
152           when they run on the monitored CPUs. Multiple cgroups can be
153           provided. Each cgroup is applied to the corresponding event, i.e.,
154           first cgroup to first event, second cgroup to second event and so
155           on. It is possible to provide an empty cgroup (monitor all the
156           time) using, e.g., -G foo,,bar. Cgroups must have corresponding
157           events, i.e., they always refer to events defined earlier on the
158           command line. If the user wants to track multiple events for a
159           specific cgroup, the user can use -e e1 -e e2 -G foo,foo or just
160           use -e e1 -e e2 -G foo.
161
162       If wanting to monitor, say, cycles for a cgroup and also for system
163       wide, this command line can be used: perf stat -e cycles -G cgroup_name
164       -a -e cycles.
165
166       --for-each-cgroup name
167           Expand event list for each cgroup in "name" (allow multiple cgroups
168           separated by comma). It also support regex patterns to match
169           multiple groups. This has same effect that repeating -e option and
170           -G option for each event x name. This option cannot be used with
171           -G/--cgroup option.
172
173       -o file, --output file
174           Print the output into the designated file.
175
176       --append
177           Append to the output file designated with the -o option. Ignored if
178           -o is not specified.
179
180       --log-fd
181           Log output to fd, instead of stderr. Complementary to --output, and
182           mutually exclusive with it. --append may be used here. Examples:
183           3>results perf stat --log-fd 3  — $cmd 3>>results perf stat
184           --log-fd 3 --append — $cmd
185
186       --control=fifo:ctl-fifo[,ack-fifo], --control=fd:ctl-fd[,ack-fd]
187           ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as
188           follows. Listen on ctl-fd descriptor for command to control
189           measurement (enable: enable events, disable: disable events).
190           Measurements can be started with events disabled using --delay=-1
191           option. Optionally send control command completion (ack\n) to
192           ack-fd descriptor to synchronize with the controlling process.
193           Example of bash shell script to enable and disable events during
194           measurements:
195
196               #!/bin/bash
197
198               ctl_dir=/tmp/
199
200               ctl_fifo=${ctl_dir}perf_ctl.fifo
201               test -p ${ctl_fifo} && unlink ${ctl_fifo}
202               mkfifo ${ctl_fifo}
203               exec {ctl_fd}<>${ctl_fifo}
204
205               ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo
206               test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo}
207               mkfifo ${ctl_ack_fifo}
208               exec {ctl_fd_ack}<>${ctl_ack_fifo}
209
210               perf stat -D -1 -e cpu-cycles -a -I 1000       \
211                         --control fd:${ctl_fd},${ctl_fd_ack} \
212                         -- sleep 30 &
213               perf_pid=$!
214
215               sleep 5  && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
216               sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
217
218               exec {ctl_fd_ack}>&-
219               unlink ${ctl_ack_fifo}
220
221               exec {ctl_fd}>&-
222               unlink ${ctl_fifo}
223
224               wait -n ${perf_pid}
225               exit $?
226
227       --pre, --post
228           Pre and post measurement hooks, e.g.:
229
230       perf stat --repeat 10 --null --sync --pre make -s
231       O=defconfig-build/clean — make -s -j64 O=defconfig-build/ bzImage
232
233       -I msecs, --interval-print msecs
234           Print count deltas every N milliseconds (minimum: 1ms) The overhead
235           percentage could be high in some cases, for instance with small,
236           sub 100ms intervals. Use with caution. example: perf stat -I 1000
237           -e cycles -a sleep 5
238
239       If the metric exists, it is calculated by the counts generated in this
240       interval and the metric is printed after #.
241
242       --interval-count times
243           Print count deltas for fixed number of times. This option should be
244           used together with "-I" option. example: perf stat -I 1000
245           --interval-count 2 -e cycles -a
246
247       --interval-clear
248           Clear the screen before next interval.
249
250       --timeout msecs
251           Stop the perf stat session and print count deltas after N
252           milliseconds (minimum: 10 ms). This option is not supported with
253           the "-I" option. example: perf stat --time 2000 -e cycles -a
254
255       --metric-only
256           Only print computed metrics. Print them in a single line. Don’t
257           show any raw values. Not supported with --per-thread.
258
259       --per-socket
260           Aggregate counts per processor socket for system-wide mode
261           measurements. This is a useful mode to detect imbalance between
262           sockets. To enable this mode, use --per-socket in addition to -a.
263           (system-wide). The output includes the socket number and the number
264           of online processors on that socket. This is useful to gauge the
265           amount of aggregation.
266
267       --per-die
268           Aggregate counts per processor die for system-wide mode
269           measurements. This is a useful mode to detect imbalance between
270           dies. To enable this mode, use --per-die in addition to -a.
271           (system-wide). The output includes the die number and the number of
272           online processors on that die. This is useful to gauge the amount
273           of aggregation.
274
275       --per-core
276           Aggregate counts per physical processor for system-wide mode
277           measurements. This is a useful mode to detect imbalance between
278           physical cores. To enable this mode, use --per-core in addition to
279           -a. (system-wide). The output includes the core number and the
280           number of online logical processors on that physical processor.
281
282       --per-thread
283           Aggregate counts per monitored threads, when monitoring threads (-t
284           option) or processes (-p option).
285
286       --per-node
287           Aggregate counts per NUMA nodes for system-wide mode measurements.
288           This is a useful mode to detect imbalance between NUMA nodes. To
289           enable this mode, use --per-node in addition to -a. (system-wide).
290
291       -D msecs, --delay msecs
292           After starting the program, wait msecs before measuring (-1: start
293           with events disabled). This is useful to filter out the startup
294           phase of the program, which is often very different.
295
296       -T, --transaction
297           Print statistics of transactional execution if supported.
298
299       --metric-no-group
300           By default, events to compute a metric are placed in weak groups.
301           The group tries to enforce scheduling all or none of the events.
302           The --metric-no-group option places events outside of groups and
303           may increase the chance of the event being scheduled - leading to
304           more accuracy. However, as events may not be scheduled together
305           accuracy for metrics like instructions per cycle can be lower - as
306           both metrics may no longer be being measured at the same time.
307
308       --metric-no-merge
309           By default metric events in different weak groups can be shared if
310           one group contains all the events needed by another. In such cases
311           one group will be eliminated reducing event multiplexing and making
312           it so that certain groups of metrics sum to 100%. A downside to
313           sharing a group is that the group may require multiplexing and so
314           accuracy for a small group that need not have multiplexing is
315           lowered. This option forbids the event merging logic from sharing
316           events between groups and may be used to increase accuracy in this
317           case.
318
319       --quiet
320           Don’t print output. This is useful with perf stat record below to
321           only write data to the perf.data file.
322

STAT RECORD

324       Stores stat data into perf data file.
325
326       -o file, --output file
327           Output file name.
328

STAT REPORT

330       Reads and reports stat data from perf data file.
331
332       -i file, --input file
333           Input file name.
334
335       --per-socket
336           Aggregate counts per processor socket for system-wide mode
337           measurements.
338
339       --per-die
340           Aggregate counts per processor die for system-wide mode
341           measurements.
342
343       --per-core
344           Aggregate counts per physical processor for system-wide mode
345           measurements.
346
347       -M, --metrics
348           Print metrics or metricgroups specified in a comma separated list.
349           For a group all metrics from the group are added. The events from
350           the metrics are automatically measured. See perf list output for
351           the possble metrics and metricgroups.
352
353       -A, --no-aggr
354           Do not aggregate counts across all monitored CPUs.
355
356       --topdown
357           Print complete top-down metrics supported by the CPU. This allows
358           to determine bottle necks in the CPU pipeline for CPU bound
359           workloads, by breaking the cycles consumed down into frontend
360           bound, backend bound, bad speculation and retiring.
361
362       Frontend bound means that the CPU cannot fetch and decode instructions
363       fast enough. Backend bound means that computation or memory access is
364       the bottle neck. Bad Speculation means that the CPU wasted cycles due
365       to branch mispredictions and similar issues. Retiring means that the
366       CPU computed without an apparently bottleneck. The bottleneck is only
367       the real bottleneck if the workload is actually bound by the CPU and
368       not by something else.
369
370       For best results it is usually a good idea to use it with interval mode
371       like -I 1000, as the bottleneck of workloads can change often.
372
373       This enables --metric-only, unless overridden with --no-metric-only.
374
375       The following restrictions only apply to older Intel CPUs and Atom, on
376       newer CPUs (IceLake and later) TopDown can be collected for any thread:
377
378       The top down metrics are collected per core instead of per CPU thread.
379       Per core mode is automatically enabled and -a (global monitoring) is
380       needed, requiring root rights or perf.perf_event_paranoid=-1.
381
382       Topdown uses the full Performance Monitoring Unit, and needs disabling
383       of the NMI watchdog (as root): echo 0 > /proc/sys/kernel/nmi_watchdog
384       for best results. Otherwise the bottlenecks may be inconsistent on
385       workload with changing phases.
386
387       To interpret the results it is usually needed to know on which CPUs the
388       workload runs on. If needed the CPUs can be forced using taskset.
389
390       --td-level
391           Print the top-down statistics that equal to or lower than the input
392           level. It allows users to print the interested top-down metrics
393           level instead of the complete top-down metrics.
394
395       The availability of the top-down metrics level depends on the hardware.
396       For example, Ice Lake only supports L1 top-down metrics. The Sapphire
397       Rapids supports both L1 and L2 top-down metrics.
398
399       Default: 0 means the max level that the current hardware support. Error
400       out if the input is higher than the supported max level.
401
402       --no-merge
403           Do not merge results from same PMUs.
404
405       When multiple events are created from a single event specification,
406       stat will, by default, aggregate the event counts and show the result
407       in a single row. This option disables that behavior and shows the
408       individual events and counts.
409
410       Multiple events are created from a single event specification when: 1.
411       Prefix or glob matching is used for the PMU name. 2. Aliases, which are
412       listed immediately after the Kernel PMU events by perf list, are used.
413
414       --smi-cost
415           Measure SMI cost if msr/aperf/ and msr/smi/ events are supported.
416
417       During the measurement, the /sys/device/cpu/freeze_on_smi will be set
418       to freeze core counters on SMI. The aperf counter will not be effected
419       by the setting. The cost of SMI can be measured by (aperf - unhalted
420       core cycles).
421
422       In practice, the percentages of SMI cycles is very useful for
423       performance oriented analysis. --metric_only will be applied by
424       default. The output is SMI cycles%, equals to (aperf - unhalted core
425       cycles) / aperf
426
427       Users who wants to get the actual value can apply --no-metric-only.
428
429       --all-kernel
430           Configure all used events to run in kernel space.
431
432       --all-user
433           Configure all used events to run in user space.
434
435       --percore-show-thread
436           The event modifier "percore" has supported to sum up the event
437           counts for all hardware threads in a core and show the counts per
438           core.
439
440       This option with event modifier "percore" enabled also sums up the
441       event counts for all hardware threads in a core but show the sum counts
442       per hardware thread. This is essentially a replacement for the any bit
443       and convenient for post processing.
444
445       --summary
446           Print summary for interval mode (-I).
447

EXAMPLES

449       $ perf stat — make
450
451           Performance counter stats for 'make':
452
453              83723.452481      task-clock:u (msec)       #    1.004 CPUs utilized
454                         0      context-switches:u        #    0.000 K/sec
455                         0      cpu-migrations:u          #    0.000 K/sec
456                 3,228,188      page-faults:u             #    0.039 M/sec
457           229,570,665,834      cycles:u                  #    2.742 GHz
458           313,163,853,778      instructions:u            #    1.36  insn per cycle
459            69,704,684,856      branches:u                #  832.559 M/sec
460             2,078,861,393      branch-misses:u           #    2.98% of all branches
461
462           83.409183620 seconds time elapsed
463
464           74.684747000 seconds user
465            8.739217000 seconds sys
466

TIMINGS

468       As displayed in the example above we can display 3 types of timings. We
469       always display the time the counters were enabled/alive:
470
471           83.409183620 seconds time elapsed
472
473       For workload sessions we also display time the workloads spent in
474       user/system lands:
475
476           74.684747000 seconds user
477            8.739217000 seconds sys
478
479       Those times are the very same as displayed by the time tool.
480

CSV FORMAT

482       With -x, perf stat is able to output a not-quite-CSV format output
483       Commas in the output are not put into "". To make it easy to parse it
484       is recommended to use a different character like -x \;
485
486       The fields are in this order:
487
488       •   optional usec time stamp in fractions of second (with -I xxx)
489
490       •   optional CPU, core, or socket identifier
491
492       •   optional number of logical CPUs aggregated
493
494       •   counter value
495
496       •   unit of the counter value or empty
497
498       •   event name
499
500       •   run time of counter
501
502       •   percentage of measurement time the counter was running
503
504       •   optional variance if multiple values are collected with -r
505
506       •   optional metric value
507
508       •   optional unit of metric
509
510       Additional metrics may be printed with all earlier fields being empty.
511