perf-stat(1)

1PERF-STAT(1)                      perf Manual                     PERF-STAT(1)
2
3
4

NAME

6       perf-stat - Run a command and gather performance counter statistics
7

SYNOPSIS

9       perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
10       perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]
11       perf stat [-e <EVENT> | --event=EVENT] [-a] record [-o file] — <command> [<options>]
12       perf stat report [-i file]
13

DESCRIPTION

15       This command runs a command and gathers performance counter statistics
16       from it.
17

OPTIONS

19       <command>...
20           Any command you can specify in a shell.
21
22       record
23           See STAT RECORD.
24
25       report
26           See STAT REPORT.
27
28       -e, --event=
29           Select the PMU event. Selection can be:
30
31           ·   a symbolic event name (use perf list to list all events)
32
33           ·   a raw PMU event (eventsel+umask) in the form of rNNN where NNN
34               is a hexadecimal event descriptor.
35
36           ·   a symbolically formed event like pmu/param1=0x3,param2/ where
37               param1 and param2 are defined as formats for the PMU in
38               /sys/bus/event_source/devices/<pmu>/format/*
39
40           ·   a symbolically formed event like
41               pmu/config=M,config1=N,config2=K/ where M, N, K are numbers (in
42               decimal, hex, octal format). Acceptable values for each of
43               config, config1 and config2 parameters are defined by
44               corresponding entries in
45               /sys/bus/event_source/devices/<pmu>/format/*
46
47       -i, --no-inherit
48           child tasks do not inherit counters
49
50       -p, --pid=<pid>
51           stat events on existing process id (comma separated list)
52
53       -t, --tid=<tid>
54           stat events on existing thread id (comma separated list)
55
56       -a, --all-cpus
57           system-wide collection from all CPUs (default if no target is
58           specified)
59
60       -c, --scale
61           scale/normalize counter values
62
63       -d, --detailed
64           print more detailed statistics, can be specified up to 3 times
65
66                     -d:          detailed events, L1 and LLC data cache
67                  -d -d:     more detailed events, dTLB and iTLB events
68               -d -d -d:     very detailed events, adding prefetch events
69
70       -r, --repeat=<n>
71           repeat command and print average + stddev (max: 100). 0 means
72           forever.
73
74       -B, --big-num
75           print large numbers with thousands' separators according to locale
76
77       -C, --cpu=
78           Count only on the list of CPUs provided. Multiple CPUs can be
79           provided as a comma-separated list with no space: 0,1. Ranges of
80           CPUs are specified with -: 0-2. In per-thread mode, this option is
81           ignored. The -a option is still necessary to activate system-wide
82           monitoring. Default is to count on all CPUs.
83
84       -A, --no-aggr
85           Do not aggregate counts across all monitored CPUs.
86
87       -n, --null
88           null run - don’t start any counters
89
90       -v, --verbose
91           be more verbose (show counter open errors, etc)
92
93       -x SEP, --field-separator SEP
94           print counts using a CSV-style output to make it easy to import
95           directly into spreadsheets. Columns are separated by the string
96           specified in SEP.
97
98       -G name, --cgroup name
99           monitor only in the container (cgroup) called "name". This option
100           is available only in per-cpu mode. The cgroup filesystem must be
101           mounted. All threads belonging to container "name" are monitored
102           when they run on the monitored CPUs. Multiple cgroups can be
103           provided. Each cgroup is applied to the corresponding event, i.e.,
104           first cgroup to first event, second cgroup to second event and so
105           on. It is possible to provide an empty cgroup (monitor all the
106           time) using, e.g., -G foo,,bar. Cgroups must have corresponding
107           events, i.e., they always refer to events defined earlier on the
108           command line.
109
110       -o file, --output file
111           Print the output into the designated file.
112
113       --append
114           Append to the output file designated with the -o option. Ignored if
115           -o is not specified.
116
117       --log-fd
118           Log output to fd, instead of stderr. Complementary to --output, and
119           mutually exclusive with it. --append may be used here. Examples:
120           3>results perf stat --log-fd 3  — $cmd 3>>results perf stat
121           --log-fd 3 --append — $cmd
122
123       --pre, --post
124           Pre and post measurement hooks, e.g.:
125
126       perf stat --repeat 10 --null --sync --pre make -s
127       O=defconfig-build/clean — make -s -j64 O=defconfig-build/ bzImage
128
129       -I msecs, --interval-print msecs
130           Print count deltas every N milliseconds (minimum: 10ms) The
131           overhead percentage could be high in some cases, for instance with
132           small, sub 100ms intervals. Use with caution. example: perf stat -I
133           1000 -e cycles -a sleep 5
134
135       --metric-only
136           Only print computed metrics. Print them in a single line. Don’t
137           show any raw values. Not supported with --per-thread.
138
139       --per-socket
140           Aggregate counts per processor socket for system-wide mode
141           measurements. This is a useful mode to detect imbalance between
142           sockets. To enable this mode, use --per-socket in addition to -a.
143           (system-wide). The output includes the socket number and the number
144           of online processors on that socket. This is useful to gauge the
145           amount of aggregation.
146
147       --per-core
148           Aggregate counts per physical processor for system-wide mode
149           measurements. This is a useful mode to detect imbalance between
150           physical cores. To enable this mode, use --per-core in addition to
151           -a. (system-wide). The output includes the core number and the
152           number of online logical processors on that physical processor.
153
154       --per-thread
155           Aggregate counts per monitored threads, when monitoring threads (-t
156           option) or processes (-p option).
157
158       -D msecs, --delay msecs
159           After starting the program, wait msecs before measuring. This is
160           useful to filter out the startup phase of the program, which is
161           often very different.
162
163       -T, --transaction
164           Print statistics of transactional execution if supported.
165

STAT RECORD

167       Stores stat data into perf data file.
168
169       -o file, --output file
170           Output file name.
171

STAT REPORT

173       Reads and reports stat data from perf data file.
174
175       -i file, --input file
176           Input file name.
177
178       --per-socket
179           Aggregate counts per processor socket for system-wide mode
180           measurements.
181
182       --per-core
183           Aggregate counts per physical processor for system-wide mode
184           measurements.
185
186       -M, --metrics
187           Print metrics or metricgroups specified in a comma separated list.
188           For a group all metrics from the group are added. The events from
189           the metrics are automatically measured. See perf list output for
190           the possble metrics and metricgroups.
191
192       -A, --no-aggr
193           Do not aggregate counts across all monitored CPUs.
194
195       --topdown
196           Print top down level 1 metrics if supported by the CPU. This allows
197           to determine bottle necks in the CPU pipeline for CPU bound
198           workloads, by breaking the cycles consumed down into frontend
199           bound, backend bound, bad speculation and retiring.
200
201       Frontend bound means that the CPU cannot fetch and decode instructions
202       fast enough. Backend bound means that computation or memory access is
203       the bottle neck. Bad Speculation means that the CPU wasted cycles due
204       to branch mispredictions and similar issues. Retiring means that the
205       CPU computed without an apparently bottleneck. The bottleneck is only
206       the real bottleneck if the workload is actually bound by the CPU and
207       not by something else.
208
209       For best results it is usually a good idea to use it with interval mode
210       like -I 1000, as the bottleneck of workloads can change often.
211
212       The top down metrics are collected per core instead of per CPU thread.
213       Per core mode is automatically enabled and -a (global monitoring) is
214       needed, requiring root rights or perf.perf_event_paranoid=-1.
215
216       Topdown uses the full Performance Monitoring Unit, and needs disabling
217       of the NMI watchdog (as root): echo 0 > /proc/sys/kernel/nmi_watchdog
218       for best results. Otherwise the bottlenecks may be inconsistent on
219       workload with changing phases.
220
221       This enables --metric-only, unless overriden with --no-metric-only.
222
223       To interpret the results it is usually needed to know on which CPUs the
224       workload runs on. If needed the CPUs can be forced using taskset.
225
226       --no-merge
227           Do not merge results from same PMUs.
228
229       --smi-cost
230           Measure SMI cost if msr/aperf/ and msr/smi/ events are supported.
231
232       During the measurement, the /sys/device/cpu/freeze_on_smi will be set
233       to freeze core counters on SMI. The aperf counter will not be effected
234       by the setting. The cost of SMI can be measured by (aperf - unhalted
235       core cycles).
236
237       In practice, the percentages of SMI cycles is very useful for
238       performance oriented analysis. --metric_only will be applied by
239       default. The output is SMI cycles%, equals to (aperf - unhalted core
240       cycles) / aperf
241
242       Users who wants to get the actual value can apply --no-metric-only.
243

EXAMPLES

245       $ perf stat — make -j
246
247           Performance counter stats for 'make -j':
248
249           8117.370256  task clock ticks     #      11.281 CPU utilization factor
250                   678  context switches     #       0.000 M/sec
251                   133  CPU migrations       #       0.000 M/sec
252                235724  pagefaults           #       0.029 M/sec
253           24821162526  CPU cycles           #    3057.784 M/sec
254           18687303457  instructions         #    2302.138 M/sec
255             172158895  cache references     #      21.209 M/sec
256              27075259  cache misses         #       3.335 M/sec
257
258           Wall-clock time elapsed:   719.554352 msecs
259

CSV FORMAT

261       With -x, perf stat is able to output a not-quite-CSV format output
262       Commas in the output are not put into "". To make it easy to parse it
263       is recommended to use a different character like -x \;
264
265       The fields are in this order:
266
267       ·   optional usec time stamp in fractions of second (with -I xxx)
268
269       ·   optional CPU, core, or socket identifier
270
271       ·   optional number of logical CPUs aggregated
272
273       ·   counter value
274
275       ·   unit of the counter value or empty
276
277       ·   event name
278
279       ·   run time of counter
280
281       ·   percentage of measurement time the counter was running
282
283       ·   optional variance if multiple values are collected with -r
284
285       ·   optional metric value
286
287       ·   optional unit of metric
288
289       Additional metrics may be printed with all earlier fields being empty.
290