perf-stat(1)

1PERF-STAT(1)                      perf Manual                     PERF-STAT(1)
2
3
4

NAME

6       perf-stat - Run a command and gather performance counter statistics
7

SYNOPSIS

9       perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
10       perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]
11       perf stat [-e <EVENT> | --event=EVENT] [-a] record [-o file] — <command> [<options>]
12       perf stat report [-i file]
13

DESCRIPTION

15       This command runs a command and gathers performance counter statistics
16       from it.
17

OPTIONS

19       <command>...
20           Any command you can specify in a shell.
21
22       record
23           See STAT RECORD.
24
25       report
26           See STAT REPORT.
27
28       -e, --event=
29           Select the PMU event. Selection can be:
30
31           ·   a symbolic event name (use perf list to list all events)
32
33           ·   a raw PMU event (eventsel+umask) in the form of rNNN where NNN
34               is a hexadecimal event descriptor.
35
36           ·   a symbolic or raw PMU event followed by an optional colon and a
37               list of event modifiers, e.g., cpu-cycles:p. See the perf-
38               list(1) man page for details on event modifiers.
39
40           ·   a symbolically formed event like pmu/param1=0x3,param2/ where
41               param1 and param2 are defined as formats for the PMU in
42               /sys/bus/event_source/devices/<pmu>/format/*
43
44                   'percore' is a event qualifier that sums up the event counts for both
45                   hardware threads in a core. For example:
46                   perf stat -A -a -e cpu/event,percore=1/,otherevent ...
47
48           ·   a symbolically formed event like
49               pmu/config=M,config1=N,config2=K/ where M, N, K are numbers (in
50               decimal, hex, octal format). Acceptable values for each of
51               config, config1 and config2 parameters are defined by
52               corresponding entries in
53               /sys/bus/event_source/devices/<pmu>/format/*
54
55                   Note that the last two syntaxes support prefix and glob matching in
56                   the PMU name to simplify creation of events across multiple instances
57                   of the same type of PMU in large systems (e.g. memory controller PMUs).
58                   Multiple PMU instances are typical for uncore PMUs, so the prefix
59                   'uncore_' is also ignored when performing this match.
60
61       -i, --no-inherit
62           child tasks do not inherit counters
63
64       -p, --pid=<pid>
65           stat events on existing process id (comma separated list)
66
67       -t, --tid=<tid>
68           stat events on existing thread id (comma separated list)
69
70       -a, --all-cpus
71           system-wide collection from all CPUs (default if no target is
72           specified)
73
74       --no-scale
75           Don’t scale/normalize counter values
76
77       -d, --detailed
78           print more detailed statistics, can be specified up to 3 times
79
80                     -d:          detailed events, L1 and LLC data cache
81                  -d -d:     more detailed events, dTLB and iTLB events
82               -d -d -d:     very detailed events, adding prefetch events
83
84       -r, --repeat=<n>
85           repeat command and print average + stddev (max: 100). 0 means
86           forever.
87
88       -B, --big-num
89           print large numbers with thousands' separators according to locale.
90           Enabled by default. Use "--no-big-num" to disable. Default setting
91           can be changed with "perf config stat.big-num=false".
92
93       -C, --cpu=
94           Count only on the list of CPUs provided. Multiple CPUs can be
95           provided as a comma-separated list with no space: 0,1. Ranges of
96           CPUs are specified with -: 0-2. In per-thread mode, this option is
97           ignored. The -a option is still necessary to activate system-wide
98           monitoring. Default is to count on all CPUs.
99
100       -A, --no-aggr
101           Do not aggregate counts across all monitored CPUs.
102
103       -n, --null
104           null run - don’t start any counters
105
106       -v, --verbose
107           be more verbose (show counter open errors, etc)
108
109       -x SEP, --field-separator SEP
110           print counts using a CSV-style output to make it easy to import
111           directly into spreadsheets. Columns are separated by the string
112           specified in SEP.
113
114       --table
115           Display time for each run (-r option), in a table format, e.g.:
116
117               $ perf stat --null -r 5 --table perf bench sched pipe
118
119               Performance counter stats for 'perf bench sched pipe' (5 runs):
120
121               # Table of individual measurements:
122               5.189 (-0.293) #
123               5.189 (-0.294) #
124               5.186 (-0.296) #
125               5.663 (+0.181) ##
126               6.186 (+0.703) ####
127
128               # Final result:
129               5.483 +- 0.198 seconds time elapsed  ( +-  3.62% )
130
131       -G name, --cgroup name
132           monitor only in the container (cgroup) called "name". This option
133           is available only in per-cpu mode. The cgroup filesystem must be
134           mounted. All threads belonging to container "name" are monitored
135           when they run on the monitored CPUs. Multiple cgroups can be
136           provided. Each cgroup is applied to the corresponding event, i.e.,
137           first cgroup to first event, second cgroup to second event and so
138           on. It is possible to provide an empty cgroup (monitor all the
139           time) using, e.g., -G foo,,bar. Cgroups must have corresponding
140           events, i.e., they always refer to events defined earlier on the
141           command line. If the user wants to track multiple events for a
142           specific cgroup, the user can use -e e1 -e e2 -G foo,foo or just
143           use -e e1 -e e2 -G foo.
144
145       If wanting to monitor, say, cycles for a cgroup and also for system
146       wide, this command line can be used: perf stat -e cycles -G cgroup_name
147       -a -e cycles.
148
149       --for-each-cgroup name
150           Expand event list for each cgroup in "name" (allow multiple cgroups
151           separated by comma). It also support regex patterns to match
152           multiple groups. This has same effect that repeating -e option and
153           -G option for each event x name. This option cannot be used with
154           -G/--cgroup option.
155
156       -o file, --output file
157           Print the output into the designated file.
158
159       --append
160           Append to the output file designated with the -o option. Ignored if
161           -o is not specified.
162
163       --log-fd
164           Log output to fd, instead of stderr. Complementary to --output, and
165           mutually exclusive with it. --append may be used here. Examples:
166           3>results perf stat --log-fd 3  — $cmd 3>>results perf stat
167           --log-fd 3 --append — $cmd
168
169       --control=fifo:ctl-fifo[,ack-fifo], --control=fd:ctl-fd[,ack-fd]
170           ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as
171           follows. Listen on ctl-fd descriptor for command to control
172           measurement (enable: enable events, disable: disable events).
173           Measurements can be started with events disabled using --delay=-1
174           option. Optionally send control command completion (ack\n) to
175           ack-fd descriptor to synchronize with the controlling process.
176           Example of bash shell script to enable and disable events during
177           measurements:
178
179               #!/bin/bash
180
181               ctl_dir=/tmp/
182
183               ctl_fifo=${ctl_dir}perf_ctl.fifo
184               test -p ${ctl_fifo} && unlink ${ctl_fifo}
185               mkfifo ${ctl_fifo}
186               exec {ctl_fd}<>${ctl_fifo}
187
188               ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo
189               test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo}
190               mkfifo ${ctl_ack_fifo}
191               exec {ctl_fd_ack}<>${ctl_ack_fifo}
192
193               perf stat -D -1 -e cpu-cycles -a -I 1000       \
194                         --control fd:${ctl_fd},${ctl_fd_ack} \
195                         -- sleep 30 &
196               perf_pid=$!
197
198               sleep 5  && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
199               sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
200
201               exec {ctl_fd_ack}>&-
202               unlink ${ctl_ack_fifo}
203
204               exec {ctl_fd}>&-
205               unlink ${ctl_fifo}
206
207               wait -n ${perf_pid}
208               exit $?
209
210       --pre, --post
211           Pre and post measurement hooks, e.g.:
212
213       perf stat --repeat 10 --null --sync --pre make -s
214       O=defconfig-build/clean — make -s -j64 O=defconfig-build/ bzImage
215
216       -I msecs, --interval-print msecs
217           Print count deltas every N milliseconds (minimum: 1ms) The overhead
218           percentage could be high in some cases, for instance with small,
219           sub 100ms intervals. Use with caution. example: perf stat -I 1000
220           -e cycles -a sleep 5
221
222       If the metric exists, it is calculated by the counts generated in this
223       interval and the metric is printed after #.
224
225       --interval-count times
226           Print count deltas for fixed number of times. This option should be
227           used together with "-I" option. example: perf stat -I 1000
228           --interval-count 2 -e cycles -a
229
230       --interval-clear
231           Clear the screen before next interval.
232
233       --timeout msecs
234           Stop the perf stat session and print count deltas after N
235           milliseconds (minimum: 10 ms). This option is not supported with
236           the "-I" option. example: perf stat --time 2000 -e cycles -a
237
238       --metric-only
239           Only print computed metrics. Print them in a single line. Don’t
240           show any raw values. Not supported with --per-thread.
241
242       --per-socket
243           Aggregate counts per processor socket for system-wide mode
244           measurements. This is a useful mode to detect imbalance between
245           sockets. To enable this mode, use --per-socket in addition to -a.
246           (system-wide). The output includes the socket number and the number
247           of online processors on that socket. This is useful to gauge the
248           amount of aggregation.
249
250       --per-die
251           Aggregate counts per processor die for system-wide mode
252           measurements. This is a useful mode to detect imbalance between
253           dies. To enable this mode, use --per-die in addition to -a.
254           (system-wide). The output includes the die number and the number of
255           online processors on that die. This is useful to gauge the amount
256           of aggregation.
257
258       --per-core
259           Aggregate counts per physical processor for system-wide mode
260           measurements. This is a useful mode to detect imbalance between
261           physical cores. To enable this mode, use --per-core in addition to
262           -a. (system-wide). The output includes the core number and the
263           number of online logical processors on that physical processor.
264
265       --per-thread
266           Aggregate counts per monitored threads, when monitoring threads (-t
267           option) or processes (-p option).
268
269       --per-node
270           Aggregate counts per NUMA nodes for system-wide mode measurements.
271           This is a useful mode to detect imbalance between NUMA nodes. To
272           enable this mode, use --per-node in addition to -a. (system-wide).
273
274       -D msecs, --delay msecs
275           After starting the program, wait msecs before measuring (-1: start
276           with events disabled). This is useful to filter out the startup
277           phase of the program, which is often very different.
278
279       -T, --transaction
280           Print statistics of transactional execution if supported.
281
282       --metric-no-group
283           By default, events to compute a metric are placed in weak groups.
284           The group tries to enforce scheduling all or none of the events.
285           The --metric-no-group option places events outside of groups and
286           may increase the chance of the event being scheduled - leading to
287           more accuracy. However, as events may not be scheduled together
288           accuracy for metrics like instructions per cycle can be lower - as
289           both metrics may no longer be being measured at the same time.
290
291       --metric-no-merge
292           By default metric events in different weak groups can be shared if
293           one group contains all the events needed by another. In such cases
294           one group will be eliminated reducing event multiplexing and making
295           it so that certain groups of metrics sum to 100%. A downside to
296           sharing a group is that the group may require multiplexing and so
297           accuracy for a small group that need not have multiplexing is
298           lowered. This option forbids the event merging logic from sharing
299           events between groups and may be used to increase accuracy in this
300           case.
301
302       --quiet
303           Don’t print output. This is useful with perf stat record below to
304           only write data to the perf.data file.
305

STAT RECORD

307       Stores stat data into perf data file.
308
309       -o file, --output file
310           Output file name.
311

STAT REPORT

313       Reads and reports stat data from perf data file.
314
315       -i file, --input file
316           Input file name.
317
318       --per-socket
319           Aggregate counts per processor socket for system-wide mode
320           measurements.
321
322       --per-die
323           Aggregate counts per processor die for system-wide mode
324           measurements.
325
326       --per-core
327           Aggregate counts per physical processor for system-wide mode
328           measurements.
329
330       -M, --metrics
331           Print metrics or metricgroups specified in a comma separated list.
332           For a group all metrics from the group are added. The events from
333           the metrics are automatically measured. See perf list output for
334           the possble metrics and metricgroups.
335
336       -A, --no-aggr
337           Do not aggregate counts across all monitored CPUs.
338
339       --topdown
340           Print top down level 1 metrics if supported by the CPU. This allows
341           to determine bottle necks in the CPU pipeline for CPU bound
342           workloads, by breaking the cycles consumed down into frontend
343           bound, backend bound, bad speculation and retiring.
344
345       Frontend bound means that the CPU cannot fetch and decode instructions
346       fast enough. Backend bound means that computation or memory access is
347       the bottle neck. Bad Speculation means that the CPU wasted cycles due
348       to branch mispredictions and similar issues. Retiring means that the
349       CPU computed without an apparently bottleneck. The bottleneck is only
350       the real bottleneck if the workload is actually bound by the CPU and
351       not by something else.
352
353       For best results it is usually a good idea to use it with interval mode
354       like -I 1000, as the bottleneck of workloads can change often.
355
356       This enables --metric-only, unless overridden with --no-metric-only.
357
358       The following restrictions only apply to older Intel CPUs and Atom, on
359       newer CPUs (IceLake and later) TopDown can be collected for any thread:
360
361       The top down metrics are collected per core instead of per CPU thread.
362       Per core mode is automatically enabled and -a (global monitoring) is
363       needed, requiring root rights or perf.perf_event_paranoid=-1.
364
365       Topdown uses the full Performance Monitoring Unit, and needs disabling
366       of the NMI watchdog (as root): echo 0 > /proc/sys/kernel/nmi_watchdog
367       for best results. Otherwise the bottlenecks may be inconsistent on
368       workload with changing phases.
369
370       To interpret the results it is usually needed to know on which CPUs the
371       workload runs on. If needed the CPUs can be forced using taskset.
372
373       --no-merge
374           Do not merge results from same PMUs.
375
376       When multiple events are created from a single event specification,
377       stat will, by default, aggregate the event counts and show the result
378       in a single row. This option disables that behavior and shows the
379       individual events and counts.
380
381       Multiple events are created from a single event specification when: 1.
382       Prefix or glob matching is used for the PMU name. 2. Aliases, which are
383       listed immediately after the Kernel PMU events by perf list, are used.
384
385       --smi-cost
386           Measure SMI cost if msr/aperf/ and msr/smi/ events are supported.
387
388       During the measurement, the /sys/device/cpu/freeze_on_smi will be set
389       to freeze core counters on SMI. The aperf counter will not be effected
390       by the setting. The cost of SMI can be measured by (aperf - unhalted
391       core cycles).
392
393       In practice, the percentages of SMI cycles is very useful for
394       performance oriented analysis. --metric_only will be applied by
395       default. The output is SMI cycles%, equals to (aperf - unhalted core
396       cycles) / aperf
397
398       Users who wants to get the actual value can apply --no-metric-only.
399
400       --all-kernel
401           Configure all used events to run in kernel space.
402
403       --all-user
404           Configure all used events to run in user space.
405
406       --percore-show-thread
407           The event modifier "percore" has supported to sum up the event
408           counts for all hardware threads in a core and show the counts per
409           core.
410
411       This option with event modifier "percore" enabled also sums up the
412       event counts for all hardware threads in a core but show the sum counts
413       per hardware thread. This is essentially a replacement for the any bit
414       and convenient for post processing.
415
416       --summary
417           Print summary for interval mode (-I).
418

EXAMPLES

420       $ perf stat — make
421
422           Performance counter stats for 'make':
423
424              83723.452481      task-clock:u (msec)       #    1.004 CPUs utilized
425                         0      context-switches:u        #    0.000 K/sec
426                         0      cpu-migrations:u          #    0.000 K/sec
427                 3,228,188      page-faults:u             #    0.039 M/sec
428           229,570,665,834      cycles:u                  #    2.742 GHz
429           313,163,853,778      instructions:u            #    1.36  insn per cycle
430            69,704,684,856      branches:u                #  832.559 M/sec
431             2,078,861,393      branch-misses:u           #    2.98% of all branches
432
433           83.409183620 seconds time elapsed
434
435           74.684747000 seconds user
436            8.739217000 seconds sys
437

TIMINGS

439       As displayed in the example above we can display 3 types of timings. We
440       always display the time the counters were enabled/alive:
441
442           83.409183620 seconds time elapsed
443
444       For workload sessions we also display time the workloads spent in
445       user/system lands:
446
447           74.684747000 seconds user
448            8.739217000 seconds sys
449
450       Those times are the very same as displayed by the time tool.
451

CSV FORMAT

453       With -x, perf stat is able to output a not-quite-CSV format output
454       Commas in the output are not put into "". To make it easy to parse it
455       is recommended to use a different character like -x \;
456
457       The fields are in this order:
458
459       ·   optional usec time stamp in fractions of second (with -I xxx)
460
461       ·   optional CPU, core, or socket identifier
462
463       ·   optional number of logical CPUs aggregated
464
465       ·   counter value
466
467       ·   unit of the counter value or empty
468
469       ·   event name
470
471       ·   run time of counter
472
473       ·   percentage of measurement time the counter was running
474
475       ·   optional variance if multiple values are collected with -r
476
477       ·   optional metric value
478
479       ·   optional unit of metric
480
481       Additional metrics may be printed with all earlier fields being empty.
482