1PERF-TOP(1)                       perf Manual                      PERF-TOP(1)
2
3
4

NAME

6       perf-top - System profiling tool.
7

SYNOPSIS

9       perf top [-e <EVENT> | --event=EVENT] [<options>]
10

DESCRIPTION

12       This command generates and displays a performance counter profile in
13       real time.
14

OPTIONS

16       -a, --all-cpus
17           System-wide collection. (default)
18
19       -c <count>, --count=<count>
20           Event period to sample.
21
22       -C <cpu-list>, --cpu=<cpu>
23           Monitor only on the list of CPUs provided. Multiple CPUs can be
24           provided as a comma-separated list with no space: 0,1. Ranges of
25           CPUs are specified with -: 0-2. Default is to monitor all CPUS.
26
27       -d <seconds>, --delay=<seconds>
28           Number of seconds to delay between refreshes.
29
30       -e <event>, --event=<event>
31           Select the PMU event. Selection can be a symbolic event name (use
32           perf list to list all events) or a raw PMU event in the form of rN
33           where N is a hexadecimal value that represents the raw register
34           encoding with the layout of the event control registers as
35           described by entries in /sys/bus/event_source/devices/cpu/format/*.
36
37       -E <entries>, --entries=<entries>
38           Display this many functions.
39
40       -f <count>, --count-filter=<count>
41           Only display functions with more events than this.
42
43       --group
44           Put the counters into a counter group.
45
46       --group-sort-idx
47           Sort the output by the event at the index n in group. If n is
48           invalid, sort by the first event. It can support multiple groups
49           with different amount of events. WARNING: This should be used on
50           grouped events.
51
52       -F <freq>, --freq=<freq>
53           Profile at this frequency. Use max to use the currently maximum
54           allowed frequency, i.e. the value in the
55           kernel.perf_event_max_sample_rate sysctl.
56
57       -i, --inherit
58           Child tasks do not inherit counters.
59
60       -k <path>, --vmlinux=<path>
61           Path to vmlinux. Required for annotation functionality.
62
63       --ignore-vmlinux
64           Ignore vmlinux files.
65
66       --kallsyms=<file>
67           kallsyms pathname
68
69       -m <pages>, --mmap-pages=<pages>
70           Number of mmap data pages (must be a power of two) or size
71           specification with appended unit character - B/K/M/G. The size is
72           rounded up to have nearest pages power of two value.
73
74       -p <pid>, --pid=<pid>
75           Profile events on existing Process ID (comma separated list).
76
77       -t <tid>, --tid=<tid>
78           Profile events on existing thread ID (comma separated list).
79
80       -u, --uid=
81           Record events in threads owned by uid. Name or number.
82
83       -r <priority>, --realtime=<priority>
84           Collect data with this RT SCHED_FIFO priority.
85
86       --sym-annotate=<symbol>
87           Annotate this symbol.
88
89       -K, --hide_kernel_symbols
90           Hide kernel symbols.
91
92       -U, --hide_user_symbols
93           Hide user symbols.
94
95       --demangle-kernel
96           Demangle kernel symbols.
97
98       -D, --dump-symtab
99           Dump the symbol table used for profiling.
100
101       -v, --verbose
102           Be more verbose (show counter open errors, etc).
103
104       -z, --zero
105           Zero history across display updates.
106
107       -s, --sort
108           Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight,
109           local_weight, abort, in_tx, transaction, overhead, sample, period.
110           Please see description of --sort in the perf-report man page.
111
112       --fields=
113           Specify output field - multiple keys can be specified in CSV
114           format. Following fields are available: overhead, overhead_sys,
115           overhead_us, overhead_children, sample and period. Also it can
116           contain any sort key(s).
117
118               By default, every sort keys not specified in --field will be appended
119               automatically.
120
121       -n, --show-nr-samples
122           Show a column with the number of samples.
123
124       --show-total-period
125           Show a column with the sum of periods.
126
127       --dsos
128           Only consider symbols in these dsos. This option will affect the
129           percentage of the overhead column. See --percentage for more info.
130
131       --comms
132           Only consider symbols in these comms. This option will affect the
133           percentage of the overhead column. See --percentage for more info.
134
135       --symbols
136           Only consider these symbols. This option will affect the percentage
137           of the overhead column. See --percentage for more info.
138
139       -M, --disassembler-style=
140           Set disassembler style for objdump.
141
142       --prefix=PREFIX, --prefix-strip=N
143           Remove first N entries from source file path names in executables
144           and add PREFIX. This allows to display source code compiled on
145           systems with different file system layout.
146
147       --source
148           Interleave source code with assembly code. Enabled by default,
149           disable with --no-source.
150
151       --asm-raw
152           Show raw instruction encoding of assembly instructions.
153
154       -g
155           Enables call-graph (stack chain/backtrace) recording.
156
157       --call-graph [mode,type,min[,limit],order[,key][,branch]]
158           Setup and enable call-graph (stack chain/backtrace) recording,
159           implies -g. See --call-graph section in perf-record and perf-report
160           man pages for details.
161
162       --children
163           Accumulate callchain of children to parent entry so that then can
164           show up in the output. The output will have a new "Children" column
165           and will be sorted on the data. It requires -g/--call-graph option
166           enabled. See the ‘overhead calculation’ section for more details.
167           Enabled by default, disable with --no-children.
168
169       --max-stack
170           Set the stack depth limit when parsing the callchain, anything
171           beyond the specified depth will be ignored. This is a trade-off
172           between information loss and faster processing especially for
173           workloads that can have a very long callchain stack.
174
175               Default: /proc/sys/kernel/perf_event_max_stack when present, 127 otherwise.
176
177       --ignore-callees=<regex>
178           Ignore callees of the function(s) matching the given regex. This
179           has the effect of collecting the callers of each such function into
180           one place in the call-graph tree.
181
182       --percent-limit
183           Do not show entries which have an overhead under that percent.
184           (Default: 0).
185
186       --percentage
187           Determine how to display the overhead percentage of filtered
188           entries. Filters can be applied by --comms, --dsos and/or --symbols
189           options and Zoom operations on the TUI (thread, dso, etc).
190
191               "relative" means it's relative to filtered entries only so that the
192               sum of shown entries will be always 100%. "absolute" means it retains
193               the original value before and after the filter is applied.
194
195       -w, --column-widths=<width[,width...]>
196           Force each column width to the provided list, for large terminal
197           readability. 0 means no limit (default behavior).
198
199       --proc-map-timeout
200           When processing pre-existing threads /proc/XXX/mmap, it may take a
201           long time, because the file may be huge. A time out is needed in
202           such cases. This option sets the time out limit. The default value
203           is 500 ms.
204
205       -b, --branch-any
206           Enable taken branch stack sampling. Any type of taken branch may be
207           sampled. This is a shortcut for --branch-filter any. See
208           --branch-filter for more infos.
209
210       -j, --branch-filter
211           Enable taken branch stack sampling. Each sample captures a series
212           of consecutive taken branches. The number of branches captured with
213           each sample depends on the underlying hardware, the type of
214           branches of interest, and the executed code. It is possible to
215           select the types of branches captured by enabling filters. For a
216           full list of modifiers please see the perf record manpage.
217
218               The option requires at least one branch type among any, any_call, any_ret, ind_call, cond.
219               The privilege levels may be omitted, in which case, the privilege levels of the associated
220               event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege
221               levels are subject to permissions.  When sampling on multiple events, branch stack sampling
222               is enabled for all the sampling events. The sampled branch type is the same for all events.
223               The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k
224               Note that this feature may not be available on all processors.
225
226       --raw-trace
227           When displaying traceevent output, do not use print fmt or plugins.
228
229       --hierarchy
230           Enable hierarchy output.
231
232       --overwrite
233           Enable this to use just the most recent records, which helps in
234           high core count machines such as Knights Landing/Mill, but right
235           now is disabled by default as the pausing used in this technique is
236           leading to loss of metadata events such as PERF_RECORD_MMAP which
237           makes perf top unable to resolve samples, leading to lots of
238           unknown samples appearing on the UI. Enable this if you are in such
239           machines and profiling a workload that doesn’t creates short lived
240           threads and/or doesn’t uses many executable mmap operations. Work
241           is being planed to solve this situation, till then, this will
242           remain disabled by default.
243
244       --force
245           Don’t do ownership validation.
246
247       --num-thread-synthesize
248           The number of threads to run when synthesizing events for existing
249           processes. By default, the number of threads equals to the number
250           of online CPUs.
251
252       --namespaces
253           Record events of type PERF_RECORD_NAMESPACES and display it with
254           the cgroup_id sort key.
255
256       -G name, --cgroup name
257           monitor only in the container (cgroup) called "name". This option
258           is available only in per-cpu mode. The cgroup filesystem must be
259           mounted. All threads belonging to container "name" are monitored
260           when they run on the monitored CPUs. Multiple cgroups can be
261           provided. Each cgroup is applied to the corresponding event, i.e.,
262           first cgroup to first event, second cgroup to second event and so
263           on. It is possible to provide an empty cgroup (monitor all the
264           time) using, e.g., -G foo,,bar. Cgroups must have corresponding
265           events, i.e., they always refer to events defined earlier on the
266           command line. If the user wants to track multiple events for a
267           specific cgroup, the user can use -e e1 -e e2 -G foo,foo or just
268           use -e e1 -e e2 -G foo.
269
270       --all-cgroups
271           Record events of type PERF_RECORD_CGROUP and display it with the
272           cgroup sort key.
273
274       --switch-on EVENT_NAME
275           Only consider events after this event is found.
276
277               E.g.:
278
279               Find out where broadcast packets are handled
280
281               perf probe -L icmp_rcv
282
283               Insert a probe there:
284
285               perf probe icmp_rcv:59
286
287               Start perf top and ask it to only consider the cycles events when a
288               broadcast packet arrives This will show a menu with two entries and
289               will start counting when a broadcast packet arrives:
290
291               perf top -e cycles,probe:icmp_rcv --switch-on=probe:icmp_rcv
292
293               Alternatively one can ask for --group and then two overhead columns
294               will appear, the first for cycles and the second for the switch-on event.
295
296               perf top --group -e cycles,probe:icmp_rcv --switch-on=probe:icmp_rcv
297
298               This may be interesting to measure a workload only after some initialization
299               phase is over, i.e. insert a perf probe at that point and use the above
300               examples replacing probe:icmp_rcv with the just-after-init probe.
301
302       --switch-off EVENT_NAME
303           Stop considering events after this event is found.
304
305       --show-on-off-events
306           Show the --switch-on/off events too. This has no effect in perf top
307           now but probably we’ll make the default not to show the
308           switch-on/off events on the --group mode and if there is only one
309           event besides the off/on ones, go straight to the histogram
310           browser, just like perf top with no events explicitly specified
311           does.
312
313       --stitch-lbr
314           Show callgraph with stitched LBRs, which may have more complete
315           callgraph. The option must be used with --call-graph lbr recording.
316           Disabled by default. In common cases with call stack overflows, it
317           can recreate better call stacks than the default lbr call stack
318           output. But this approach is not full proof. There can be cases
319           where it creates incorrect call stacks from incorrect matches. The
320           known limitations include exception handing such as setjmp/longjmp
321           will have calls/returns not match.
322

INTERACTIVE PROMPTING KEYS

324       [d]
325           Display refresh delay.
326
327       [e]
328           Number of entries to display.
329
330       [E]
331           Event to display when multiple counters are active.
332
333       [f]
334           Profile display filter (>= hit count).
335
336       [F]
337           Annotation display filter (>= % of total).
338
339       [s]
340           Annotate symbol.
341
342       [S]
343           Stop annotation, return to full profile display.
344
345       [K]
346           Hide kernel symbols.
347
348       [U]
349           Hide user symbols.
350
351       [z]
352           Toggle event count zeroing across display updates.
353
354       [qQ]
355           Quit.
356
357       Pressing any unmapped key displays a menu, and prompts for input.
358

OVERHEAD CALCULATION

360       The overhead can be shown in two columns as Children and Self when perf
361       collects callchains. The self overhead is simply calculated by adding
362       all period values of the entry - usually a function (symbol). This is
363       the value that perf shows traditionally and sum of all the self
364       overhead values should be 100%.
365
366       The children overhead is calculated by adding all period values of the
367       child functions so that it can show the total overhead of the higher
368       level functions even if they don’t directly execute much. Children here
369       means functions that are called from another (parent) function.
370
371       It might be confusing that the sum of all the children overhead values
372       exceeds 100% since each of them is already an accumulation of self
373       overhead of its child functions. But with this enabled, users can find
374       which function has the most overhead even if samples are spread over
375       the children.
376
377       Consider the following example; there are three functions like below.
378
379
380           .ft C
381           void foo(void) {
382               /* do something */
383           }
384
385           void bar(void) {
386               /* do something */
387               foo();
388           }
389
390           int main(void) {
391               bar()
392               return 0;
393           }
394           .ft
395
396
397       In this case foo is a child of bar, and bar is an immediate child of
398       main so foo also is a child of main. In other words, main is a parent
399       of foo and bar, and bar is a parent of foo.
400
401       Suppose all samples are recorded in foo and bar only. When it’s
402       recorded with callchains the output will show something like below in
403       the usual (self-overhead-only) output of perf report:
404
405
406           .ft C
407           Overhead  Symbol
408           ........  .....................
409             60.00%  foo
410                     |
411                     --- foo
412                         bar
413                         main
414                         __libc_start_main
415
416             40.00%  bar
417                     |
418                     --- bar
419                         main
420                         __libc_start_main
421           .ft
422
423
424       When the --children option is enabled, the self overhead values of
425       child functions (i.e. foo and bar) are added to the parents to
426       calculate the children overhead. In this case the report could be
427       displayed as:
428
429
430           .ft C
431           Children      Self  Symbol
432           ........  ........  ....................
433            100.00%     0.00%  __libc_start_main
434                     |
435                     --- __libc_start_main
436
437            100.00%     0.00%  main
438                     |
439                     --- main
440                         __libc_start_main
441
442            100.00%    40.00%  bar
443                     |
444                     --- bar
445                         main
446                         __libc_start_main
447
448             60.00%    60.00%  foo
449                     |
450                     --- foo
451                         bar
452                         main
453                         __libc_start_main
454           .ft
455
456
457       In the above output, the self overhead of foo (60%) was add to the
458       children overhead of bar, main and __libc_start_main. Likewise, the
459       self overhead of bar (40%) was added to the children overhead of main
460       and \_\_libc_start_main.
461
462       So \_\_libc_start_main and main are shown first since they have same
463       (100%) children overhead (even though they have zero self overhead) and
464       they are the parents of foo and bar.
465
466       Since v3.16 the children overhead is shown by default and the output is
467       sorted by its values. The children overhead is disabled by specifying
468       --no-children option on the command line or by adding report.children =
469       false or top.children = false in the perf config file.
470

SEE ALSO

472       perf-stat(1), perf-list(1), perf-report(1)
473
474
475
476perf                              01/12/2023                       PERF-TOP(1)
Impressum