1PERF-TOP(1)                       perf Manual                      PERF-TOP(1)
2
3
4

NAME

6       perf-top - System profiling tool.
7

SYNOPSIS

9       perf top [-e <EVENT> | --event=EVENT] [<options>]
10

DESCRIPTION

12       This command generates and displays a performance counter profile in
13       real time.
14

OPTIONS

16       -a, --all-cpus
17           System-wide collection. (default)
18
19       -c <count>, --count=<count>
20           Event period to sample.
21
22       -C <cpu-list>, --cpu=<cpu>
23           Monitor only on the list of CPUs provided. Multiple CPUs can be
24           provided as a comma-separated list with no space: 0,1. Ranges of
25           CPUs are specified with -: 0-2. Default is to monitor all CPUS.
26
27       -d <seconds>, --delay=<seconds>
28           Number of seconds to delay between refreshes.
29
30       -e <event>, --event=<event>
31           Select the PMU event. Selection can be a symbolic event name (use
32           perf list to list all events) or a raw PMU event in the form of rN
33           where N is a hexadecimal value that represents the raw register
34           encoding with the layout of the event control registers as
35           described by entries in /sys/bus/event_source/devices/cpu/format/*.
36
37       -E <entries>, --entries=<entries>
38           Display this many functions.
39
40       -f <count>, --count-filter=<count>
41           Only display functions with more events than this.
42
43       --group-sort-idx
44           Sort the output by the event at the index n in group. If n is
45           invalid, sort by the first event. It can support multiple groups
46           with different amount of events. WARNING: This should be used on
47           grouped events.
48
49       -F <freq>, --freq=<freq>
50           Profile at this frequency. Use max to use the currently maximum
51           allowed frequency, i.e. the value in the
52           kernel.perf_event_max_sample_rate sysctl.
53
54       -i, --inherit
55           Child tasks do not inherit counters.
56
57       -k <path>, --vmlinux=<path>
58           Path to vmlinux. Required for annotation functionality.
59
60       --ignore-vmlinux
61           Ignore vmlinux files.
62
63       --kallsyms=<file>
64           kallsyms pathname
65
66       -m <pages>, --mmap-pages=<pages>
67           Number of mmap data pages (must be a power of two) or size
68           specification with appended unit character - B/K/M/G. The size is
69           rounded up to have nearest pages power of two value.
70
71       -p <pid>, --pid=<pid>
72           Profile events on existing Process ID (comma separated list).
73
74       -t <tid>, --tid=<tid>
75           Profile events on existing thread ID (comma separated list).
76
77       -u, --uid=
78           Record events in threads owned by uid. Name or number.
79
80       -r <priority>, --realtime=<priority>
81           Collect data with this RT SCHED_FIFO priority.
82
83       --sym-annotate=<symbol>
84           Annotate this symbol.
85
86       -K, --hide_kernel_symbols
87           Hide kernel symbols.
88
89       -U, --hide_user_symbols
90           Hide user symbols.
91
92       --demangle-kernel
93           Demangle kernel symbols.
94
95       -D, --dump-symtab
96           Dump the symbol table used for profiling.
97
98       -v, --verbose
99           Be more verbose (show counter open errors, etc).
100
101       -z, --zero
102           Zero history across display updates.
103
104       -s, --sort
105           Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight,
106           local_weight, abort, in_tx, transaction, overhead, sample, period.
107           Please see description of --sort in the perf-report man page.
108
109       --fields=
110           Specify output field - multiple keys can be specified in CSV
111           format. Following fields are available: overhead, overhead_sys,
112           overhead_us, overhead_children, sample and period. Also it can
113           contain any sort key(s).
114
115               By default, every sort keys not specified in --field will be appended
116               automatically.
117
118       -n, --show-nr-samples
119           Show a column with the number of samples.
120
121       --show-total-period
122           Show a column with the sum of periods.
123
124       --dsos
125           Only consider symbols in these dsos. This option will affect the
126           percentage of the overhead column. See --percentage for more info.
127
128       --comms
129           Only consider symbols in these comms. This option will affect the
130           percentage of the overhead column. See --percentage for more info.
131
132       --symbols
133           Only consider these symbols. This option will affect the percentage
134           of the overhead column. See --percentage for more info.
135
136       -M, --disassembler-style=
137           Set disassembler style for objdump.
138
139       --addr2line=<path>
140           Path to addr2line binary.
141
142       --objdump=<path>
143           Path to objdump binary.
144
145       --prefix=PREFIX, --prefix-strip=N
146           Remove first N entries from source file path names in executables
147           and add PREFIX. This allows to display source code compiled on
148           systems with different file system layout.
149
150       --source
151           Interleave source code with assembly code. Enabled by default,
152           disable with --no-source.
153
154       --asm-raw
155           Show raw instruction encoding of assembly instructions.
156
157       -g
158           Enables call-graph (stack chain/backtrace) recording.
159
160       --call-graph [mode,type,min[,limit],order[,key][,branch]]
161           Setup and enable call-graph (stack chain/backtrace) recording,
162           implies -g. See --call-graph section in perf-record and perf-report
163           man pages for details.
164
165       --children
166           Accumulate callchain of children to parent entry so that then can
167           show up in the output. The output will have a new "Children" column
168           and will be sorted on the data. It requires -g/--call-graph option
169           enabled. See the ‘overhead calculation’ section for more details.
170           Enabled by default, disable with --no-children.
171
172       --max-stack
173           Set the stack depth limit when parsing the callchain, anything
174           beyond the specified depth will be ignored. This is a trade-off
175           between information loss and faster processing especially for
176           workloads that can have a very long callchain stack.
177
178               Default: /proc/sys/kernel/perf_event_max_stack when present, 127 otherwise.
179
180       --ignore-callees=<regex>
181           Ignore callees of the function(s) matching the given regex. This
182           has the effect of collecting the callers of each such function into
183           one place in the call-graph tree.
184
185       --percent-limit
186           Do not show entries which have an overhead under that percent.
187           (Default: 0).
188
189       --percentage
190           Determine how to display the overhead percentage of filtered
191           entries. Filters can be applied by --comms, --dsos and/or --symbols
192           options and Zoom operations on the TUI (thread, dso, etc).
193
194               "relative" means it's relative to filtered entries only so that the
195               sum of shown entries will be always 100%. "absolute" means it retains
196               the original value before and after the filter is applied.
197
198       -w, --column-widths=<width[,width...]>
199           Force each column width to the provided list, for large terminal
200           readability. 0 means no limit (default behavior).
201
202       --proc-map-timeout
203           When processing pre-existing threads /proc/XXX/mmap, it may take a
204           long time, because the file may be huge. A time out is needed in
205           such cases. This option sets the time out limit. The default value
206           is 500 ms.
207
208       -b, --branch-any
209           Enable taken branch stack sampling. Any type of taken branch may be
210           sampled. This is a shortcut for --branch-filter any. See
211           --branch-filter for more infos.
212
213       -j, --branch-filter
214           Enable taken branch stack sampling. Each sample captures a series
215           of consecutive taken branches. The number of branches captured with
216           each sample depends on the underlying hardware, the type of
217           branches of interest, and the executed code. It is possible to
218           select the types of branches captured by enabling filters. For a
219           full list of modifiers please see the perf record manpage.
220
221               The option requires at least one branch type among any, any_call, any_ret, ind_call, cond.
222               The privilege levels may be omitted, in which case, the privilege levels of the associated
223               event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege
224               levels are subject to permissions.  When sampling on multiple events, branch stack sampling
225               is enabled for all the sampling events. The sampled branch type is the same for all events.
226               The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k
227               Note that this feature may not be available on all processors.
228
229       --branch-history
230           Add the addresses of sampled taken branches to the callstack. This
231           allows to examine the path the program took to each sample.
232
233       --raw-trace
234           When displaying traceevent output, do not use print fmt or plugins.
235
236       --hierarchy
237           Enable hierarchy output.
238
239       --overwrite
240           Enable this to use just the most recent records, which helps in
241           high core count machines such as Knights Landing/Mill, but right
242           now is disabled by default as the pausing used in this technique is
243           leading to loss of metadata events such as PERF_RECORD_MMAP which
244           makes perf top unable to resolve samples, leading to lots of
245           unknown samples appearing on the UI. Enable this if you are in such
246           machines and profiling a workload that doesn’t creates short lived
247           threads and/or doesn’t uses many executable mmap operations. Work
248           is being planed to solve this situation, till then, this will
249           remain disabled by default.
250
251       --force
252           Don’t do ownership validation.
253
254       --num-thread-synthesize
255           The number of threads to run when synthesizing events for existing
256           processes. By default, the number of threads equals to the number
257           of online CPUs.
258
259       --namespaces
260           Record events of type PERF_RECORD_NAMESPACES and display it with
261           the cgroup_id sort key.
262
263       -G name, --cgroup name
264           monitor only in the container (cgroup) called "name". This option
265           is available only in per-cpu mode. The cgroup filesystem must be
266           mounted. All threads belonging to container "name" are monitored
267           when they run on the monitored CPUs. Multiple cgroups can be
268           provided. Each cgroup is applied to the corresponding event, i.e.,
269           first cgroup to first event, second cgroup to second event and so
270           on. It is possible to provide an empty cgroup (monitor all the
271           time) using, e.g., -G foo,,bar. Cgroups must have corresponding
272           events, i.e., they always refer to events defined earlier on the
273           command line. If the user wants to track multiple events for a
274           specific cgroup, the user can use -e e1 -e e2 -G foo,foo or just
275           use -e e1 -e e2 -G foo.
276
277       --all-cgroups
278           Record events of type PERF_RECORD_CGROUP and display it with the
279           cgroup sort key.
280
281       --switch-on EVENT_NAME
282           Only consider events after this event is found.
283
284               E.g.:
285
286               Find out where broadcast packets are handled
287
288               perf probe -L icmp_rcv
289
290               Insert a probe there:
291
292               perf probe icmp_rcv:59
293
294               Start perf top and ask it to only consider the cycles events when a
295               broadcast packet arrives This will show a menu with two entries and
296               will start counting when a broadcast packet arrives:
297
298               perf top -e cycles,probe:icmp_rcv --switch-on=probe:icmp_rcv
299
300               Alternatively one can ask for a group and then two overhead columns
301               will appear, the first for cycles and the second for the switch-on event.
302
303               perf top -e '{cycles,probe:icmp_rcv}' --switch-on=probe:icmp_rcv
304
305               This may be interesting to measure a workload only after some initialization
306               phase is over, i.e. insert a perf probe at that point and use the above
307               examples replacing probe:icmp_rcv with the just-after-init probe.
308
309       --switch-off EVENT_NAME
310           Stop considering events after this event is found.
311
312       --show-on-off-events
313           Show the --switch-on/off events too. This has no effect in perf top
314           now but probably we’ll make the default not to show the
315           switch-on/off events on the --group mode and if there is only one
316           event besides the off/on ones, go straight to the histogram
317           browser, just like perf top with no events explicitly specified
318           does.
319
320       --stitch-lbr
321           Show callgraph with stitched LBRs, which may have more complete
322           callgraph. The option must be used with --call-graph lbr recording.
323           Disabled by default. In common cases with call stack overflows, it
324           can recreate better call stacks than the default lbr call stack
325           output. But this approach is not foolproof. There can be cases
326           where it creates incorrect call stacks from incorrect matches. The
327           known limitations include exception handing such as setjmp/longjmp
328           will have calls/returns not match.
329

INTERACTIVE PROMPTING KEYS

331       [d]
332           Display refresh delay.
333
334       [e]
335           Number of entries to display.
336
337       [E]
338           Event to display when multiple counters are active.
339
340       [f]
341           Profile display filter (>= hit count).
342
343       [F]
344           Annotation display filter (>= % of total).
345
346       [s]
347           Annotate symbol.
348
349       [S]
350           Stop annotation, return to full profile display.
351
352       [K]
353           Hide kernel symbols.
354
355       [U]
356           Hide user symbols.
357
358       [z]
359           Toggle event count zeroing across display updates.
360
361       [qQ]
362           Quit.
363
364       Pressing any unmapped key displays a menu, and prompts for input.
365

OVERHEAD CALCULATION

367       The overhead can be shown in two columns as Children and Self when perf
368       collects callchains. The self overhead is simply calculated by adding
369       all period values of the entry - usually a function (symbol). This is
370       the value that perf shows traditionally and sum of all the self
371       overhead values should be 100%.
372
373       The children overhead is calculated by adding all period values of the
374       child functions so that it can show the total overhead of the higher
375       level functions even if they don’t directly execute much. Children here
376       means functions that are called from another (parent) function.
377
378       It might be confusing that the sum of all the children overhead values
379       exceeds 100% since each of them is already an accumulation of self
380       overhead of its child functions. But with this enabled, users can find
381       which function has the most overhead even if samples are spread over
382       the children.
383
384       Consider the following example; there are three functions like below.
385
386
387           .ft C
388           void foo(void) {
389               /* do something */
390           }
391
392           void bar(void) {
393               /* do something */
394               foo();
395           }
396
397           int main(void) {
398               bar()
399               return 0;
400           }
401           .ft
402
403
404       In this case foo is a child of bar, and bar is an immediate child of
405       main so foo also is a child of main. In other words, main is a parent
406       of foo and bar, and bar is a parent of foo.
407
408       Suppose all samples are recorded in foo and bar only. When it’s
409       recorded with callchains the output will show something like below in
410       the usual (self-overhead-only) output of perf report:
411
412
413           .ft C
414           Overhead  Symbol
415           ........  .....................
416             60.00%  foo
417                     |
418                     --- foo
419                         bar
420                         main
421                         __libc_start_main
422
423             40.00%  bar
424                     |
425                     --- bar
426                         main
427                         __libc_start_main
428           .ft
429
430
431       When the --children option is enabled, the self overhead values of
432       child functions (i.e. foo and bar) are added to the parents to
433       calculate the children overhead. In this case the report could be
434       displayed as:
435
436
437           .ft C
438           Children      Self  Symbol
439           ........  ........  ....................
440            100.00%     0.00%  __libc_start_main
441                     |
442                     --- __libc_start_main
443
444            100.00%     0.00%  main
445                     |
446                     --- main
447                         __libc_start_main
448
449            100.00%    40.00%  bar
450                     |
451                     --- bar
452                         main
453                         __libc_start_main
454
455             60.00%    60.00%  foo
456                     |
457                     --- foo
458                         bar
459                         main
460                         __libc_start_main
461           .ft
462
463
464       In the above output, the self overhead of foo (60%) was add to the
465       children overhead of bar, main and __libc_start_main. Likewise, the
466       self overhead of bar (40%) was added to the children overhead of main
467       and \_\_libc_start_main.
468
469       So \_\_libc_start_main and main are shown first since they have same
470       (100%) children overhead (even though they have zero self overhead) and
471       they are the parents of foo and bar.
472
473       Since v3.16 the children overhead is shown by default and the output is
474       sorted by its values. The children overhead is disabled by specifying
475       --no-children option on the command line or by adding report.children =
476       false or top.children = false in the perf config file.
477

SEE ALSO

479       perf-stat(1), perf-list(1), perf-report(1)
480
481
482
483perf                              11/28/2023                       PERF-TOP(1)
Impressum