1PERF-TOP(1) perf Manual PERF-TOP(1)
2
3
4
6 perf-top - System profiling tool.
7
9 perf top [-e <EVENT> | --event=EVENT] [<options>]
10
12 This command generates and displays a performance counter profile in
13 real time.
14
16 -a, --all-cpus
17 System-wide collection. (default)
18
19 -c <count>, --count=<count>
20 Event period to sample.
21
22 -C <cpu-list>, --cpu=<cpu>
23 Monitor only on the list of CPUs provided. Multiple CPUs can be
24 provided as a comma-separated list with no space: 0,1. Ranges of
25 CPUs are specified with -: 0-2. Default is to monitor all CPUS.
26
27 -d <seconds>, --delay=<seconds>
28 Number of seconds to delay between refreshes.
29
30 -e <event>, --event=<event>
31 Select the PMU event. Selection can be a symbolic event name (use
32 perf list to list all events) or a raw PMU event in the form of rN
33 where N is a hexadecimal value that represents the raw register
34 encoding with the layout of the event control registers as
35 described by entries in /sys/bus/event_source/devices/cpu/format/*.
36
37 -E <entries>, --entries=<entries>
38 Display this many functions.
39
40 -f <count>, --count-filter=<count>
41 Only display functions with more events than this.
42
43 --group
44 Put the counters into a counter group.
45
46 --group-sort-idx
47 Sort the output by the event at the index n in group. If n is
48 invalid, sort by the first event. It can support multiple groups
49 with different amount of events. WARNING: This should be used on
50 grouped events.
51
52 -F <freq>, --freq=<freq>
53 Profile at this frequency. Use max to use the currently maximum
54 allowed frequency, i.e. the value in the
55 kernel.perf_event_max_sample_rate sysctl.
56
57 -i, --inherit
58 Child tasks do not inherit counters.
59
60 -k <path>, --vmlinux=<path>
61 Path to vmlinux. Required for annotation functionality.
62
63 --ignore-vmlinux
64 Ignore vmlinux files.
65
66 --kallsyms=<file>
67 kallsyms pathname
68
69 -m <pages>, --mmap-pages=<pages>
70 Number of mmap data pages (must be a power of two) or size
71 specification with appended unit character - B/K/M/G. The size is
72 rounded up to have nearest pages power of two value.
73
74 -p <pid>, --pid=<pid>
75 Profile events on existing Process ID (comma separated list).
76
77 -t <tid>, --tid=<tid>
78 Profile events on existing thread ID (comma separated list).
79
80 -u, --uid=
81 Record events in threads owned by uid. Name or number.
82
83 -r <priority>, --realtime=<priority>
84 Collect data with this RT SCHED_FIFO priority.
85
86 --sym-annotate=<symbol>
87 Annotate this symbol.
88
89 -K, --hide_kernel_symbols
90 Hide kernel symbols.
91
92 -U, --hide_user_symbols
93 Hide user symbols.
94
95 --demangle-kernel
96 Demangle kernel symbols.
97
98 -D, --dump-symtab
99 Dump the symbol table used for profiling.
100
101 -v, --verbose
102 Be more verbose (show counter open errors, etc).
103
104 -z, --zero
105 Zero history across display updates.
106
107 -s, --sort
108 Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight,
109 local_weight, abort, in_tx, transaction, overhead, sample, period.
110 Please see description of --sort in the perf-report man page.
111
112 --fields=
113 Specify output field - multiple keys can be specified in CSV
114 format. Following fields are available: overhead, overhead_sys,
115 overhead_us, overhead_children, sample and period. Also it can
116 contain any sort key(s).
117
118 By default, every sort keys not specified in --field will be appended
119 automatically.
120
121 -n, --show-nr-samples
122 Show a column with the number of samples.
123
124 --show-total-period
125 Show a column with the sum of periods.
126
127 --dsos
128 Only consider symbols in these dsos. This option will affect the
129 percentage of the overhead column. See --percentage for more info.
130
131 --comms
132 Only consider symbols in these comms. This option will affect the
133 percentage of the overhead column. See --percentage for more info.
134
135 --symbols
136 Only consider these symbols. This option will affect the percentage
137 of the overhead column. See --percentage for more info.
138
139 -M, --disassembler-style=
140 Set disassembler style for objdump.
141
142 --prefix=PREFIX, --prefix-strip=N
143 Remove first N entries from source file path names in executables
144 and add PREFIX. This allows to display source code compiled on
145 systems with different file system layout.
146
147 --source
148 Interleave source code with assembly code. Enabled by default,
149 disable with --no-source.
150
151 --asm-raw
152 Show raw instruction encoding of assembly instructions.
153
154 -g
155 Enables call-graph (stack chain/backtrace) recording.
156
157 --call-graph [mode,type,min[,limit],order[,key][,branch]]
158 Setup and enable call-graph (stack chain/backtrace) recording,
159 implies -g. See --call-graph section in perf-record and perf-report
160 man pages for details.
161
162 --children
163 Accumulate callchain of children to parent entry so that then can
164 show up in the output. The output will have a new "Children" column
165 and will be sorted on the data. It requires -g/--call-graph option
166 enabled. See the ‘overhead calculation’ section for more details.
167 Enabled by default, disable with --no-children.
168
169 --max-stack
170 Set the stack depth limit when parsing the callchain, anything
171 beyond the specified depth will be ignored. This is a trade-off
172 between information loss and faster processing especially for
173 workloads that can have a very long callchain stack.
174
175 Default: /proc/sys/kernel/perf_event_max_stack when present, 127 otherwise.
176
177 --ignore-callees=<regex>
178 Ignore callees of the function(s) matching the given regex. This
179 has the effect of collecting the callers of each such function into
180 one place in the call-graph tree.
181
182 --percent-limit
183 Do not show entries which have an overhead under that percent.
184 (Default: 0).
185
186 --percentage
187 Determine how to display the overhead percentage of filtered
188 entries. Filters can be applied by --comms, --dsos and/or --symbols
189 options and Zoom operations on the TUI (thread, dso, etc).
190
191 "relative" means it's relative to filtered entries only so that the
192 sum of shown entries will be always 100%. "absolute" means it retains
193 the original value before and after the filter is applied.
194
195 -w, --column-widths=<width[,width...]>
196 Force each column width to the provided list, for large terminal
197 readability. 0 means no limit (default behavior).
198
199 --proc-map-timeout
200 When processing pre-existing threads /proc/XXX/mmap, it may take a
201 long time, because the file may be huge. A time out is needed in
202 such cases. This option sets the time out limit. The default value
203 is 500 ms.
204
205 -b, --branch-any
206 Enable taken branch stack sampling. Any type of taken branch may be
207 sampled. This is a shortcut for --branch-filter any. See
208 --branch-filter for more infos.
209
210 -j, --branch-filter
211 Enable taken branch stack sampling. Each sample captures a series
212 of consecutive taken branches. The number of branches captured with
213 each sample depends on the underlying hardware, the type of
214 branches of interest, and the executed code. It is possible to
215 select the types of branches captured by enabling filters. For a
216 full list of modifiers please see the perf record manpage.
217
218 The option requires at least one branch type among any, any_call, any_ret, ind_call, cond.
219 The privilege levels may be omitted, in which case, the privilege levels of the associated
220 event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege
221 levels are subject to permissions. When sampling on multiple events, branch stack sampling
222 is enabled for all the sampling events. The sampled branch type is the same for all events.
223 The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k
224 Note that this feature may not be available on all processors.
225
226 --raw-trace
227 When displaying traceevent output, do not use print fmt or plugins.
228
229 --hierarchy
230 Enable hierarchy output.
231
232 --overwrite
233 Enable this to use just the most recent records, which helps in
234 high core count machines such as Knights Landing/Mill, but right
235 now is disabled by default as the pausing used in this technique is
236 leading to loss of metadata events such as PERF_RECORD_MMAP which
237 makes perf top unable to resolve samples, leading to lots of
238 unknown samples appearing on the UI. Enable this if you are in such
239 machines and profiling a workload that doesn’t creates short lived
240 threads and/or doesn’t uses many executable mmap operations. Work
241 is being planed to solve this situation, till then, this will
242 remain disabled by default.
243
244 --force
245 Don’t do ownership validation.
246
247 --num-thread-synthesize
248 The number of threads to run when synthesizing events for existing
249 processes. By default, the number of threads equals to the number
250 of online CPUs.
251
252 --namespaces
253 Record events of type PERF_RECORD_NAMESPACES and display it with
254 the cgroup_id sort key.
255
256 -G name, --cgroup name
257 monitor only in the container (cgroup) called "name". This option
258 is available only in per-cpu mode. The cgroup filesystem must be
259 mounted. All threads belonging to container "name" are monitored
260 when they run on the monitored CPUs. Multiple cgroups can be
261 provided. Each cgroup is applied to the corresponding event, i.e.,
262 first cgroup to first event, second cgroup to second event and so
263 on. It is possible to provide an empty cgroup (monitor all the
264 time) using, e.g., -G foo,,bar. Cgroups must have corresponding
265 events, i.e., they always refer to events defined earlier on the
266 command line. If the user wants to track multiple events for a
267 specific cgroup, the user can use -e e1 -e e2 -G foo,foo or just
268 use -e e1 -e e2 -G foo.
269
270 --all-cgroups
271 Record events of type PERF_RECORD_CGROUP and display it with the
272 cgroup sort key.
273
274 --switch-on EVENT_NAME
275 Only consider events after this event is found.
276
277 E.g.:
278
279 Find out where broadcast packets are handled
280
281 perf probe -L icmp_rcv
282
283 Insert a probe there:
284
285 perf probe icmp_rcv:59
286
287 Start perf top and ask it to only consider the cycles events when a
288 broadcast packet arrives This will show a menu with two entries and
289 will start counting when a broadcast packet arrives:
290
291 perf top -e cycles,probe:icmp_rcv --switch-on=probe:icmp_rcv
292
293 Alternatively one can ask for --group and then two overhead columns
294 will appear, the first for cycles and the second for the switch-on event.
295
296 perf top --group -e cycles,probe:icmp_rcv --switch-on=probe:icmp_rcv
297
298 This may be interesting to measure a workload only after some initialization
299 phase is over, i.e. insert a perf probe at that point and use the above
300 examples replacing probe:icmp_rcv with the just-after-init probe.
301
302 --switch-off EVENT_NAME
303 Stop considering events after this event is found.
304
305 --show-on-off-events
306 Show the --switch-on/off events too. This has no effect in perf top
307 now but probably we’ll make the default not to show the
308 switch-on/off events on the --group mode and if there is only one
309 event besides the off/on ones, go straight to the histogram
310 browser, just like perf top with no events explicitly specified
311 does.
312
313 --stitch-lbr
314 Show callgraph with stitched LBRs, which may have more complete
315 callgraph. The option must be used with --call-graph lbr recording.
316 Disabled by default. In common cases with call stack overflows, it
317 can recreate better call stacks than the default lbr call stack
318 output. But this approach is not full proof. There can be cases
319 where it creates incorrect call stacks from incorrect matches. The
320 known limitations include exception handing such as setjmp/longjmp
321 will have calls/returns not match.
322
324 [d]
325 Display refresh delay.
326
327 [e]
328 Number of entries to display.
329
330 [E]
331 Event to display when multiple counters are active.
332
333 [f]
334 Profile display filter (>= hit count).
335
336 [F]
337 Annotation display filter (>= % of total).
338
339 [s]
340 Annotate symbol.
341
342 [S]
343 Stop annotation, return to full profile display.
344
345 [K]
346 Hide kernel symbols.
347
348 [U]
349 Hide user symbols.
350
351 [z]
352 Toggle event count zeroing across display updates.
353
354 [qQ]
355 Quit.
356
357 Pressing any unmapped key displays a menu, and prompts for input.
358
360 The overhead can be shown in two columns as Children and Self when perf
361 collects callchains. The self overhead is simply calculated by adding
362 all period values of the entry - usually a function (symbol). This is
363 the value that perf shows traditionally and sum of all the self
364 overhead values should be 100%.
365
366 The children overhead is calculated by adding all period values of the
367 child functions so that it can show the total overhead of the higher
368 level functions even if they don’t directly execute much. Children here
369 means functions that are called from another (parent) function.
370
371 It might be confusing that the sum of all the children overhead values
372 exceeds 100% since each of them is already an accumulation of self
373 overhead of its child functions. But with this enabled, users can find
374 which function has the most overhead even if samples are spread over
375 the children.
376
377 Consider the following example; there are three functions like below.
378
379
380 .ft C
381 void foo(void) {
382 /* do something */
383 }
384
385 void bar(void) {
386 /* do something */
387 foo();
388 }
389
390 int main(void) {
391 bar()
392 return 0;
393 }
394 .ft
395
396
397 In this case foo is a child of bar, and bar is an immediate child of
398 main so foo also is a child of main. In other words, main is a parent
399 of foo and bar, and bar is a parent of foo.
400
401 Suppose all samples are recorded in foo and bar only. When it’s
402 recorded with callchains the output will show something like below in
403 the usual (self-overhead-only) output of perf report:
404
405
406 .ft C
407 Overhead Symbol
408 ........ .....................
409 60.00% foo
410 |
411 --- foo
412 bar
413 main
414 __libc_start_main
415
416 40.00% bar
417 |
418 --- bar
419 main
420 __libc_start_main
421 .ft
422
423
424 When the --children option is enabled, the self overhead values of
425 child functions (i.e. foo and bar) are added to the parents to
426 calculate the children overhead. In this case the report could be
427 displayed as:
428
429
430 .ft C
431 Children Self Symbol
432 ........ ........ ....................
433 100.00% 0.00% __libc_start_main
434 |
435 --- __libc_start_main
436
437 100.00% 0.00% main
438 |
439 --- main
440 __libc_start_main
441
442 100.00% 40.00% bar
443 |
444 --- bar
445 main
446 __libc_start_main
447
448 60.00% 60.00% foo
449 |
450 --- foo
451 bar
452 main
453 __libc_start_main
454 .ft
455
456
457 In the above output, the self overhead of foo (60%) was add to the
458 children overhead of bar, main and __libc_start_main. Likewise, the
459 self overhead of bar (40%) was added to the children overhead of main
460 and \_\_libc_start_main.
461
462 So \_\_libc_start_main and main are shown first since they have same
463 (100%) children overhead (even though they have zero self overhead) and
464 they are the parents of foo and bar.
465
466 Since v3.16 the children overhead is shown by default and the output is
467 sorted by its values. The children overhead is disabled by specifying
468 --no-children option on the command line or by adding report.children =
469 false or top.children = false in the perf config file.
470
472 perf-stat(1), perf-list(1), perf-report(1)
473
474
475
476perf 01/12/2023 PERF-TOP(1)