1PERF-STAT(1) perf Manual PERF-STAT(1)
2
3
4
6 perf-stat - Run a command and gather performance counter statistics
7
9 perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
10 perf stat [-e <EVENT> | --event=EVENT] [-a] -- <command> [<options>]
11 perf stat [-e <EVENT> | --event=EVENT] [-a] record [-o file] -- <command> [<options>]
12 perf stat report [-i file]
13
15 This command runs a command and gathers performance counter statistics
16 from it.
17
19 <command>...
20 Any command you can specify in a shell.
21
22 record
23 See STAT RECORD.
24
25 report
26 See STAT REPORT.
27
28 -e, --event=
29 Select the PMU event. Selection can be:
30
31 • a symbolic event name (use perf list to list all events)
32
33 • a raw PMU event in the form of rN where N is a hexadecimal
34 value that represents the raw register encoding with the layout
35 of the event control registers as described by entries in
36 /sys/bus/event_source/devices/cpu/format/*.
37
38 • a symbolic or raw PMU event followed by an optional colon and a
39 list of event modifiers, e.g., cpu-cycles:p. See the perf-
40 list(1) man page for details on event modifiers.
41
42 • a symbolically formed event like pmu/param1=0x3,param2/ where
43 param1 and param2 are defined as formats for the PMU in
44 /sys/bus/event_source/devices/<pmu>/format/*
45
46 'percore' is a event qualifier that sums up the event counts for both
47 hardware threads in a core. For example:
48 perf stat -A -a -e cpu/event,percore=1/,otherevent ...
49
50 • a symbolically formed event like
51 pmu/config=M,config1=N,config2=K/ where M, N, K are numbers (in
52 decimal, hex, octal format). Acceptable values for each of
53 config, config1 and config2 parameters are defined by
54 corresponding entries in
55 /sys/bus/event_source/devices/<pmu>/format/*
56
57 Note that the last two syntaxes support prefix and glob matching in
58 the PMU name to simplify creation of events across multiple instances
59 of the same type of PMU in large systems (e.g. memory controller PMUs).
60 Multiple PMU instances are typical for uncore PMUs, so the prefix
61 'uncore_' is also ignored when performing this match.
62
63 -i, --no-inherit
64 child tasks do not inherit counters
65
66 -p, --pid=<pid>
67 stat events on existing process id (comma separated list)
68
69 -t, --tid=<tid>
70 stat events on existing thread id (comma separated list)
71
72 -b, --bpf-prog
73 stat events on existing bpf program id (comma separated list),
74 requiring root rights. bpftool-prog could be used to find program
75 id all bpf programs in the system. For example:
76
77 # bpftool prog | head -n 1
78 17247: tracepoint name sys_enter tag 192d548b9d754067 gpl
79
80 # perf stat -e cycles,instructions --bpf-prog 17247 --timeout 1000
81
82 Performance counter stats for 'BPF program(s) 17247':
83
84 85,967 cycles
85 28,982 instructions # 0.34 insn per cycle
86
87 1.102235068 seconds time elapsed
88
89 --bpf-counters
90 Use BPF programs to aggregate readings from perf_events. This
91 allows multiple perf-stat sessions that are counting the same
92 metric (cycles, instructions, etc.) to share hardware counters. To
93 use BPF programs on common events by default, use "perf config
94 stat.bpf-counter-events=<list_of_events>".
95
96 --bpf-attr-map
97 With option "--bpf-counters", different perf-stat sessions share
98 information about shared BPF programs and maps via a pinned
99 hashmap. Use "--bpf-attr-map" to specify the path of this pinned
100 hashmap. The default path is /sys/fs/bpf/perf_attr_map.
101
102 -a, --all-cpus
103 system-wide collection from all CPUs (default if no target is
104 specified)
105
106 --no-scale
107 Don’t scale/normalize counter values
108
109 -d, --detailed
110 print more detailed statistics, can be specified up to 3 times
111
112 -d: detailed events, L1 and LLC data cache
113 -d -d: more detailed events, dTLB and iTLB events
114 -d -d -d: very detailed events, adding prefetch events
115
116 -r, --repeat=<n>
117 repeat command and print average + stddev (max: 100). 0 means
118 forever.
119
120 -B, --big-num
121 print large numbers with thousands' separators according to locale.
122 Enabled by default. Use "--no-big-num" to disable. Default setting
123 can be changed with "perf config stat.big-num=false".
124
125 -C, --cpu=
126 Count only on the list of CPUs provided. Multiple CPUs can be
127 provided as a comma-separated list with no space: 0,1. Ranges of
128 CPUs are specified with -: 0-2. In per-thread mode, this option is
129 ignored. The -a option is still necessary to activate system-wide
130 monitoring. Default is to count on all CPUs.
131
132 -A, --no-aggr
133 Do not aggregate counts across all monitored CPUs.
134
135 -n, --null
136 null run - Don’t start any counters.
137
138 This can be useful to measure just elapsed wall-clock time - or to
139 assess the raw overhead of perf stat itself, without running any
140 counters.
141
142 -v, --verbose
143 be more verbose (show counter open errors, etc)
144
145 -x SEP, --field-separator SEP
146 print counts using a CSV-style output to make it easy to import
147 directly into spreadsheets. Columns are separated by the string
148 specified in SEP.
149
150 --table
151 Display time for each run (-r option), in a table format, e.g.:
152
153 $ perf stat --null -r 5 --table perf bench sched pipe
154
155 Performance counter stats for 'perf bench sched pipe' (5 runs):
156
157 # Table of individual measurements:
158 5.189 (-0.293) #
159 5.189 (-0.294) #
160 5.186 (-0.296) #
161 5.663 (+0.181) ##
162 6.186 (+0.703) ####
163
164 # Final result:
165 5.483 +- 0.198 seconds time elapsed ( +- 3.62% )
166
167 -G name, --cgroup name
168 monitor only in the container (cgroup) called "name". This option
169 is available only in per-cpu mode. The cgroup filesystem must be
170 mounted. All threads belonging to container "name" are monitored
171 when they run on the monitored CPUs. Multiple cgroups can be
172 provided. Each cgroup is applied to the corresponding event, i.e.,
173 first cgroup to first event, second cgroup to second event and so
174 on. It is possible to provide an empty cgroup (monitor all the
175 time) using, e.g., -G foo,,bar. Cgroups must have corresponding
176 events, i.e., they always refer to events defined earlier on the
177 command line. If the user wants to track multiple events for a
178 specific cgroup, the user can use -e e1 -e e2 -G foo,foo or just
179 use -e e1 -e e2 -G foo.
180
181 If wanting to monitor, say, cycles for a cgroup and also for system
182 wide, this command line can be used: perf stat -e cycles -G cgroup_name
183 -a -e cycles.
184
185 --for-each-cgroup name
186 Expand event list for each cgroup in "name" (allow multiple cgroups
187 separated by comma). It also support regex patterns to match
188 multiple groups. This has same effect that repeating -e option and
189 -G option for each event x name. This option cannot be used with
190 -G/--cgroup option.
191
192 -o file, --output file
193 Print the output into the designated file.
194
195 --append
196 Append to the output file designated with the -o option. Ignored if
197 -o is not specified.
198
199 --log-fd
200 Log output to fd, instead of stderr. Complementary to --output, and
201 mutually exclusive with it. --append may be used here. Examples:
202 3>results perf stat --log-fd 3 -- $cmd 3>>results perf stat
203 --log-fd 3 --append -- $cmd
204
205 --control=fifo:ctl-fifo[,ack-fifo], --control=fd:ctl-fd[,ack-fd]
206 ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as
207 follows. Listen on ctl-fd descriptor for command to control
208 measurement (enable: enable events, disable: disable events).
209 Measurements can be started with events disabled using --delay=-1
210 option. Optionally send control command completion (ack\n) to
211 ack-fd descriptor to synchronize with the controlling process.
212 Example of bash shell script to enable and disable events during
213 measurements:
214
215 #!/bin/bash
216
217 ctl_dir=/tmp/
218
219 ctl_fifo=${ctl_dir}perf_ctl.fifo
220 test -p ${ctl_fifo} && unlink ${ctl_fifo}
221 mkfifo ${ctl_fifo}
222 exec {ctl_fd}<>${ctl_fifo}
223
224 ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo
225 test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo}
226 mkfifo ${ctl_ack_fifo}
227 exec {ctl_fd_ack}<>${ctl_ack_fifo}
228
229 perf stat -D -1 -e cpu-cycles -a -I 1000 \
230 --control fd:${ctl_fd},${ctl_fd_ack} \
231 \-- sleep 30 &
232 perf_pid=$!
233
234 sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
235 sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
236
237 exec {ctl_fd_ack}>&-
238 unlink ${ctl_ack_fifo}
239
240 exec {ctl_fd}>&-
241 unlink ${ctl_fifo}
242
243 wait -n ${perf_pid}
244 exit $?
245
246 --pre, --post
247 Pre and post measurement hooks, e.g.:
248
249 perf stat --repeat 10 --null --sync --pre make -s
250 O=defconfig-build/clean -- make -s -j64 O=defconfig-build/ bzImage
251
252 -I msecs, --interval-print msecs
253 Print count deltas every N milliseconds (minimum: 1ms) The overhead
254 percentage could be high in some cases, for instance with small,
255 sub 100ms intervals. Use with caution. example: perf stat -I 1000
256 -e cycles -a sleep 5
257
258 If the metric exists, it is calculated by the counts generated in this
259 interval and the metric is printed after #.
260
261 --interval-count times
262 Print count deltas for fixed number of times. This option should be
263 used together with "-I" option. example: perf stat -I 1000
264 --interval-count 2 -e cycles -a
265
266 --interval-clear
267 Clear the screen before next interval.
268
269 --timeout msecs
270 Stop the perf stat session and print count deltas after N
271 milliseconds (minimum: 10 ms). This option is not supported with
272 the "-I" option. example: perf stat --time 2000 -e cycles -a
273
274 --metric-only
275 Only print computed metrics. Print them in a single line. Don’t
276 show any raw values. Not supported with --per-thread.
277
278 --per-socket
279 Aggregate counts per processor socket for system-wide mode
280 measurements. This is a useful mode to detect imbalance between
281 sockets. To enable this mode, use --per-socket in addition to -a.
282 (system-wide). The output includes the socket number and the number
283 of online processors on that socket. This is useful to gauge the
284 amount of aggregation.
285
286 --per-die
287 Aggregate counts per processor die for system-wide mode
288 measurements. This is a useful mode to detect imbalance between
289 dies. To enable this mode, use --per-die in addition to -a.
290 (system-wide). The output includes the die number and the number of
291 online processors on that die. This is useful to gauge the amount
292 of aggregation.
293
294 --per-cache
295 Aggregate counts per cache instance for system-wide mode
296 measurements. By default, the aggregation happens for the cache
297 level at the highest index in the system. To specify a particular
298 level, mention the cache level alongside the option in the format
299 [Ll][1-9][0-9]*. For example: Using option "--per-cache=l3" or
300 "--per-cache=L3" will aggregate the information at the boundary of
301 the level 3 cache in the system.
302
303 --per-core
304 Aggregate counts per physical processor for system-wide mode
305 measurements. This is a useful mode to detect imbalance between
306 physical cores. To enable this mode, use --per-core in addition to
307 -a. (system-wide). The output includes the core number and the
308 number of online logical processors on that physical processor.
309
310 --per-thread
311 Aggregate counts per monitored threads, when monitoring threads (-t
312 option) or processes (-p option).
313
314 --per-node
315 Aggregate counts per NUMA nodes for system-wide mode measurements.
316 This is a useful mode to detect imbalance between NUMA nodes. To
317 enable this mode, use --per-node in addition to -a. (system-wide).
318
319 -D msecs, --delay msecs
320 After starting the program, wait msecs before measuring (-1: start
321 with events disabled). This is useful to filter out the startup
322 phase of the program, which is often very different.
323
324 -T, --transaction
325 Print statistics of transactional execution if supported.
326
327 --metric-no-group
328 By default, events to compute a metric are placed in weak groups.
329 The group tries to enforce scheduling all or none of the events.
330 The --metric-no-group option places events outside of groups and
331 may increase the chance of the event being scheduled - leading to
332 more accuracy. However, as events may not be scheduled together
333 accuracy for metrics like instructions per cycle can be lower - as
334 both metrics may no longer be being measured at the same time.
335
336 --metric-no-merge
337 By default metric events in different weak groups can be shared if
338 one group contains all the events needed by another. In such cases
339 one group will be eliminated reducing event multiplexing and making
340 it so that certain groups of metrics sum to 100%. A downside to
341 sharing a group is that the group may require multiplexing and so
342 accuracy for a small group that need not have multiplexing is
343 lowered. This option forbids the event merging logic from sharing
344 events between groups and may be used to increase accuracy in this
345 case.
346
347 --metric-no-threshold
348 Metric thresholds may increase the number of events necessary to
349 compute whether a metric has exceeded its threshold expression.
350 This may not be desirable, for example, as the events can introduce
351 multiplexing. This option disables the adding of threshold
352 expression events for a metric. However, if there are sufficient
353 events to compute the threshold then the threshold is still
354 computed and used to color the metric’s computed value.
355
356 --quiet
357 Don’t print output, warnings or messages. This is useful with perf
358 stat record below to only write data to the perf.data file.
359
361 Stores stat data into perf data file.
362
363 -o file, --output file
364 Output file name.
365
367 Reads and reports stat data from perf data file.
368
369 -i file, --input file
370 Input file name.
371
372 --per-socket
373 Aggregate counts per processor socket for system-wide mode
374 measurements.
375
376 --per-die
377 Aggregate counts per processor die for system-wide mode
378 measurements.
379
380 --per-cache
381 Aggregate counts per cache instance for system-wide mode
382 measurements. By default, the aggregation happens for the cache
383 level at the highest index in the system. To specify a particular
384 level, mention the cache level alongside the option in the format
385 [Ll][1-9][0-9]*. For example: Using option "--per-cache=l3" or
386 "--per-cache=L3" will aggregate the information at the boundary of
387 the level 3 cache in the system.
388
389 --per-core
390 Aggregate counts per physical processor for system-wide mode
391 measurements.
392
393 -M, --metrics
394 Print metrics or metricgroups specified in a comma separated list.
395 For a group all metrics from the group are added. The events from
396 the metrics are automatically measured. See perf list output for
397 the possible metrics and metricgroups.
398
399 When threshold information is available for a metric, the
400 color red is used to signify a metric has exceeded a threshold
401 while green shows it hasn't. The default color means that
402 no threshold information was available or the threshold
403 couldn't be computed.
404
405 -A, --no-aggr
406 Do not aggregate counts across all monitored CPUs.
407
408 --topdown
409 Print top-down metrics supported by the CPU. This allows to
410 determine bottle necks in the CPU pipeline for CPU bound workloads,
411 by breaking the cycles consumed down into frontend bound, backend
412 bound, bad speculation and retiring.
413
414 Frontend bound means that the CPU cannot fetch and decode instructions
415 fast enough. Backend bound means that computation or memory access is
416 the bottle neck. Bad Speculation means that the CPU wasted cycles due
417 to branch mispredictions and similar issues. Retiring means that the
418 CPU computed without an apparently bottleneck. The bottleneck is only
419 the real bottleneck if the workload is actually bound by the CPU and
420 not by something else.
421
422 For best results it is usually a good idea to use it with interval mode
423 like -I 1000, as the bottleneck of workloads can change often.
424
425 This enables --metric-only, unless overridden with --no-metric-only.
426
427 The following restrictions only apply to older Intel CPUs and Atom, on
428 newer CPUs (IceLake and later) TopDown can be collected for any thread:
429
430 The top down metrics are collected per core instead of per CPU thread.
431 Per core mode is automatically enabled and -a (global monitoring) is
432 needed, requiring root rights or perf.perf_event_paranoid=-1.
433
434 Topdown uses the full Performance Monitoring Unit, and needs disabling
435 of the NMI watchdog (as root): echo 0 > /proc/sys/kernel/nmi_watchdog
436 for best results. Otherwise the bottlenecks may be inconsistent on
437 workload with changing phases.
438
439 To interpret the results it is usually needed to know on which CPUs the
440 workload runs on. If needed the CPUs can be forced using taskset.
441
442 --td-level
443 Print the top-down statistics that equal the input level. It allows
444 users to print the interested top-down metrics level instead of the
445 level 1 top-down metrics.
446
447 As the higher levels gather more metrics and use more counters they
448 will be less accurate. By convention a metric can be examined by
449 appending _group to it and this will increase accuracy compared to
450 gathering all metrics for a level. For example, level 1 analysis may
451 highlight tma_frontend_bound. This metric may be drilled into with
452 tma_frontend_bound_group with perf stat -M tma_frontend_bound_group....
453
454 Error out if the input is higher than the supported max level.
455
456 --no-merge
457 Do not merge results from same PMUs.
458
459 When multiple events are created from a single event specification,
460 stat will, by default, aggregate the event counts and show the result
461 in a single row. This option disables that behavior and shows the
462 individual events and counts.
463
464 Multiple events are created from a single event specification when: 1.
465 Prefix or glob matching is used for the PMU name. 2. Aliases, which are
466 listed immediately after the Kernel PMU events by perf list, are used.
467
468 --hybrid-merge
469 Merge the hybrid event counts from all PMUs.
470
471 For hybrid events, by default, the stat aggregates and reports the
472 event counts per PMU. But sometimes, it’s also useful to aggregate
473 event counts from all PMUs. This option enables that behavior and
474 reports the counts without PMUs.
475
476 For non-hybrid events, it should be no effect.
477
478 --smi-cost
479 Measure SMI cost if msr/aperf/ and msr/smi/ events are supported.
480
481 During the measurement, the /sys/device/cpu/freeze_on_smi will be set
482 to freeze core counters on SMI. The aperf counter will not be effected
483 by the setting. The cost of SMI can be measured by (aperf - unhalted
484 core cycles).
485
486 In practice, the percentages of SMI cycles is very useful for
487 performance oriented analysis. --metric_only will be applied by
488 default. The output is SMI cycles%, equals to (aperf - unhalted core
489 cycles) / aperf
490
491 Users who wants to get the actual value can apply --no-metric-only.
492
493 --all-kernel
494 Configure all used events to run in kernel space.
495
496 --all-user
497 Configure all used events to run in user space.
498
499 --percore-show-thread
500 The event modifier "percore" has supported to sum up the event
501 counts for all hardware threads in a core and show the counts per
502 core.
503
504 This option with event modifier "percore" enabled also sums up the
505 event counts for all hardware threads in a core but show the sum counts
506 per hardware thread. This is essentially a replacement for the any bit
507 and convenient for post processing.
508
509 --summary
510 Print summary for interval mode (-I).
511
512 --no-csv-summary
513 Don’t print summary at the first column for CVS summary output.
514 This option must be used with -x and --summary.
515
516 This option can be enabled in perf config by setting the variable
517 stat.no-csv-summary.
518
519 $ perf config stat.no-csv-summary=true
520
521 --cputype
522 Only enable events on applying cpu with this type for hybrid
523 platform (e.g. core or atom)"
524
526 $ perf stat -- make
527
528 Performance counter stats for 'make':
529
530 83723.452481 task-clock:u (msec) # 1.004 CPUs utilized
531 0 context-switches:u # 0.000 K/sec
532 0 cpu-migrations:u # 0.000 K/sec
533 3,228,188 page-faults:u # 0.039 M/sec
534 229,570,665,834 cycles:u # 2.742 GHz
535 313,163,853,778 instructions:u # 1.36 insn per cycle
536 69,704,684,856 branches:u # 832.559 M/sec
537 2,078,861,393 branch-misses:u # 2.98% of all branches
538
539 83.409183620 seconds time elapsed
540
541 74.684747000 seconds user
542 8.739217000 seconds sys
543
545 As displayed in the example above we can display 3 types of timings. We
546 always display the time the counters were enabled/alive:
547
548 83.409183620 seconds time elapsed
549
550 For workload sessions we also display time the workloads spent in
551 user/system lands:
552
553 74.684747000 seconds user
554 8.739217000 seconds sys
555
556 Those times are the very same as displayed by the time tool.
557
559 With -x, perf stat is able to output a not-quite-CSV format output
560 Commas in the output are not put into "". To make it easy to parse it
561 is recommended to use a different character like -x \;
562
563 The fields are in this order:
564
565 • optional usec time stamp in fractions of second (with -I xxx)
566
567 • optional CPU, core, or socket identifier
568
569 • optional number of logical CPUs aggregated
570
571 • counter value
572
573 • unit of the counter value or empty
574
575 • event name
576
577 • run time of counter
578
579 • percentage of measurement time the counter was running
580
581 • optional variance if multiple values are collected with -r
582
583 • optional metric value
584
585 • optional unit of metric
586
587 Additional metrics may be printed with all earlier fields being empty.
588
590 Support for Intel hybrid events within perf tools.
591
592 For some Intel platforms, such as AlderLake, which is hybrid platform
593 and it consists of atom cpu and core cpu. Each cpu has dedicated event
594 list. Part of events are available on core cpu, part of events are
595 available on atom cpu and even part of events are available on both.
596
597 Kernel exports two new cpu pmus via sysfs: /sys/devices/cpu_core
598 /sys/devices/cpu_atom
599
600 The cpus files are created under the directories. For example,
601
602 cat /sys/devices/cpu_core/cpus 0-15
603
604 cat /sys/devices/cpu_atom/cpus 16-23
605
606 It indicates cpu0-cpu15 are core cpus and cpu16-cpu23 are atom cpus.
607
608 As before, use perf-list to list the symbolic event.
609
610 perf list
611
612 inst_retired.any [Fixed Counter: Counts the number of instructions
613 retired. Unit: cpu_atom] inst_retired.any [Number of instructions
614 retired. Fixed Counter - architectural event. Unit: cpu_core]
615
616 The Unit: xxx is added to brief description to indicate which pmu the
617 event is belong to. Same event name but with different pmu can be
618 supported.
619
620 Enable hybrid event with a specific pmu
621
622 To enable a core only event or atom only event, following syntax is
623 supported:
624
625 cpu_core/<event name>/
626 or
627 cpu_atom/<event name>/
628
629 For example, count the cycles event on core cpus.
630
631 perf stat -e cpu_core/cycles/
632
633 Create two events for one hardware event automatically
634
635 When creating one event and the event is available on both atom and
636 core, two events are created automatically. One is for atom, the other
637 is for core. Most of hardware events and cache events are available on
638 both cpu_core and cpu_atom.
639
640 For hardware events, they have pre-defined configs (e.g. 0 for cycles).
641 But on hybrid platform, kernel needs to know where the event comes from
642 (from atom or from core). The original perf event type
643 PERF_TYPE_HARDWARE can’t carry pmu information. So now this type is
644 extended to be PMU aware type. The PMU type ID is stored at
645 attr.config[63:32].
646
647 PMU type ID is retrieved from sysfs. /sys/devices/cpu_atom/type
648 /sys/devices/cpu_core/type
649
650 The new attr.config layout for PERF_TYPE_HARDWARE:
651
652 PERF_TYPE_HARDWARE: 0xEEEEEEEE000000AA AA: hardware event ID EEEEEEEE:
653 PMU type ID
654
655 Cache event is similar. The type PERF_TYPE_HW_CACHE is extended to be
656 PMU aware type. The PMU type ID is stored at attr.config[63:32].
657
658 The new attr.config layout for PERF_TYPE_HW_CACHE:
659
660 PERF_TYPE_HW_CACHE: 0xEEEEEEEE00DDCCBB BB: hardware cache ID CC:
661 hardware cache op ID DD: hardware cache op result ID EEEEEEEE: PMU type
662 ID
663
664 When enabling a hardware event without specified pmu, such as, perf
665 stat -e cycles -a (use system-wide in this example), two events are
666 created automatically.
667
668 ------------------------------------------------------------
669 perf_event_attr:
670 size 120
671 config 0x400000000
672 sample_type IDENTIFIER
673 read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
674 disabled 1
675 inherit 1
676 exclude_guest 1
677 ------------------------------------------------------------
678
679 and
680
681 ------------------------------------------------------------
682 perf_event_attr:
683 size 120
684 config 0x800000000
685 sample_type IDENTIFIER
686 read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
687 disabled 1
688 inherit 1
689 exclude_guest 1
690 ------------------------------------------------------------
691
692 type 0 is PERF_TYPE_HARDWARE. 0x4 in 0x400000000 indicates it’s
693 cpu_core pmu. 0x8 in 0x800000000 indicates it’s cpu_atom pmu (atom pmu
694 type id is random).
695
696 The kernel creates cycles (0x400000000) on cpu0-cpu15 (core cpus), and
697 create cycles (0x800000000) on cpu16-cpu23 (atom cpus).
698
699 For perf-stat result, it displays two events:
700
701 Performance counter stats for 'system wide':
702
703 6,744,979 cpu_core/cycles/
704 1,965,552 cpu_atom/cycles/
705
706 The first cycles is core event, the second cycles is atom event.
707
708 Thread mode example:
709
710 perf-stat reports the scaled counts for hybrid event and with a
711 percentage displayed. The percentage is the event’s running
712 time/enabling time.
713
714 One example, triad_loop runs on cpu16 (atom core), while we can see the
715 scaled value for core cycles is 160,444,092 and the percentage is
716 0.47%.
717
718 perf stat -e cycles -- taskset -c 16 ./triad_loop
719
720 As previous, two events are created.
721
722
723 .ft C
724 perf_event_attr:
725 size 120
726 config 0x400000000
727 sample_type IDENTIFIER
728 read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
729 disabled 1
730 inherit 1
731 enable_on_exec 1
732 exclude_guest 1
733 .ft
734
735
736 and
737
738
739 .ft C
740 perf_event_attr:
741 size 120
742 config 0x800000000
743 sample_type IDENTIFIER
744 read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
745 disabled 1
746 inherit 1
747 enable_on_exec 1
748 exclude_guest 1
749 .ft
750
751
752 Performance counter stats for 'taskset -c 16 ./triad_loop':
753
754 233,066,666 cpu_core/cycles/ (0.43%)
755 604,097,080 cpu_atom/cycles/ (99.57%)
756
757 perf-record:
758
759 If there is no -e specified in perf record, on hybrid platform, it
760 creates two default cycles and adds them to event list. One is for
761 core, the other is for atom.
762
763 perf-stat:
764
765 If there is no -e specified in perf stat, on hybrid platform, besides
766 of software events, following events are created and added to event
767 list in order.
768
769 cpu_core/cycles/, cpu_atom/cycles/, cpu_core/instructions/,
770 cpu_atom/instructions/, cpu_core/branches/, cpu_atom/branches/,
771 cpu_core/branch-misses/, cpu_atom/branch-misses/
772
773 Of course, both perf-stat and perf-record support to enable hybrid
774 event with a specific pmu.
775
776 e.g. perf stat -e cpu_core/cycles/ perf stat -e cpu_atom/cycles/ perf
777 stat -e cpu_core/r1a/ perf stat -e cpu_atom/L1-icache-loads/ perf stat
778 -e cpu_core/cycles/,cpu_atom/instructions/ perf stat -e
779 {cpu_core/cycles/,cpu_core/instructions/}
780
781 But {cpu_core/cycles/,cpu_atom/instructions/} will return warning and
782 disable grouping, because the pmus in group are not matched (cpu_core
783 vs. cpu_atom).
784
786 With -j, perf stat is able to print out a JSON format output that can
787 be used for parsing.
788
789 • timestamp : optional usec time stamp in fractions of second (with
790 -I)
791
792 • optional aggregate options:
793
794 • core : core identifier (with --per-core)
795
796 • die : die identifier (with --per-die)
797
798 • socket : socket identifier (with --per-socket)
799
800 • node : node identifier (with --per-node)
801
802 • thread : thread identifier (with --per-thread)
803
804 • counter-value : counter value
805
806 • unit : unit of the counter value or empty
807
808 • event : event name
809
810 • variance : optional variance if multiple values are collected (with
811 -r)
812
813 • runtime : run time of counter
814
815 • metric-value : optional metric value
816
817 • metric-unit : optional unit of metric
818
820 perf-top(1), perf-list(1)
821
822
823
824perf 11/28/2023 PERF-STAT(1)