1PERF-INTEL-PT(1) perf Manual PERF-INTEL-PT(1)
2
3
4
6 perf-intel-pt - Support for Intel Processor Trace within perf tools
7
9 perf record -e intel_pt//
10
12 Intel Processor Trace (Intel PT) is an extension of Intel Architecture
13 that collects information about software execution such as control
14 flow, execution modes and timings and formats it into highly compressed
15 binary packets. Technical details are documented in the Intel 64 and
16 IA-32 Architectures Software Developer Manuals, Chapter 36 Intel
17 Processor Trace.
18
19 Intel PT is first supported in Intel Core M and 5th generation Intel
20 Core processors that are based on the Intel micro-architecture code
21 name Broadwell.
22
23 Trace data is collected by perf record and stored within the perf.data
24 file. See below for options to perf record.
25
26 Trace data must be decoded which involves walking the object code and
27 matching the trace data packets. For example a TNT packet only tells
28 whether a conditional branch was taken or not taken, so to make use of
29 that packet the decoder must know precisely which instruction was being
30 executed.
31
32 Decoding is done on-the-fly. The decoder outputs samples in the same
33 format as samples output by perf hardware events, for example as though
34 the "instructions" or "branches" events had been recorded. Presently 3
35 tools support this: perf script, perf report and perf inject. See below
36 for more information on using those tools.
37
38 The main distinguishing feature of Intel PT is that the decoder can
39 determine the exact flow of software execution. Intel PT can be used to
40 understand why and how did software get to a certain point, or behave a
41 certain way. The software does not have to be recompiled, so Intel PT
42 works with debug or release builds, however the executed images are
43 needed - which makes use in JIT-compiled environments, or with
44 self-modified code, a challenge. Also symbols need to be provided to
45 make sense of addresses.
46
47 A limitation of Intel PT is that it produces huge amounts of trace data
48 (hundreds of megabytes per second per core) which takes a long time to
49 decode, for example two or three orders of magnitude longer than it
50 took to collect. Another limitation is the performance impact of
51 tracing, something that will vary depending on the use-case and
52 architecture.
53
55 It is important to start small. That is because it is easy to capture
56 vastly more data than can possibly be processed.
57
58 The simplest thing to do with Intel PT is userspace profiling of small
59 programs. Data is captured with perf record e.g. to trace ls
60 userspace-only:
61
62 perf record -e intel_pt//u ls
63
64 And profiled with perf report e.g.
65
66 perf report
67
68 To also trace kernel space presents a problem, namely kernel
69 self-modifying code. A fairly good kernel image is available in
70 /proc/kcore but to get an accurate image a copy of /proc/kcore needs to
71 be made under the same conditions as the data capture. perf record can
72 make a copy of /proc/kcore if the option --kcore is used, but access to
73 /proc/kcore is restricted e.g.
74
75 sudo perf record -o pt_ls --kcore -e intel_pt// -- ls
76
77 which will create a directory named pt_ls and put the perf.data file
78 (named simply data) and copies of /proc/kcore, /proc/kallsyms and
79 /proc/modules into it. The other tools understand the directory format,
80 so to use perf report becomes:
81
82 sudo perf report -i pt_ls
83
84 Because samples are synthesized after-the-fact, the sampling period can
85 be selected for reporting. e.g. sample every microsecond
86
87 sudo perf report pt_ls --itrace=i1usge
88
89 See the sections below for more information about the --itrace option.
90
91 Beware the smaller the period, the more samples that are produced, and
92 the longer it takes to process them.
93
94 Also note that the coarseness of Intel PT timing information will start
95 to distort the statistical value of the sampling as the sampling period
96 becomes smaller.
97
98 To represent software control flow, "branches" samples are produced. By
99 default a branch sample is synthesized for every single branch. To get
100 an idea what data is available you can use the perf script tool with
101 all itrace sampling options, which will list all the samples.
102
103 perf record -e intel_pt//u ls
104 perf script --itrace=ibxwpe
105
106 An interesting field that is not printed by default is flags which can
107 be displayed as follows:
108
109 perf script --itrace=ibxwpe -F+flags
110
111 The flags are "bcrosyiABExgh" which stand for branch, call, return,
112 conditional, system, asynchronous, interrupt, transaction abort, trace
113 begin, trace end, in transaction, VM-entry, and VM-exit respectively.
114
115 perf script also supports higher level ways to dump instruction traces:
116
117 perf script --insn-trace --xed
118
119 Dump all instructions. This requires installing the xed tool (see XED
120 below) Dumping all instructions in a long trace can be fairly slow. It
121 is usually better to start with higher level decoding, like
122
123 perf script --call-trace
124
125 or
126
127 perf script --call-ret-trace
128
129 and then select a time range of interest. The time range can then be
130 examined in detail with
131
132 perf script --time starttime,stoptime --insn-trace --xed
133
134 While examining the trace it’s also useful to filter on specific CPUs
135 using the -C option
136
137 perf script --time starttime,stoptime --insn-trace --xed -C 1
138
139 Dump all instructions in time range on CPU 1.
140
141 Another interesting field that is not printed by default is ipc which
142 can be displayed as follows:
143
144 perf script --itrace=be -F+ipc
145
146 There are two ways that instructions-per-cycle (IPC) can be calculated
147 depending on the recording.
148
149 If the cyc config term (see config terms section below) was used, then
150 IPC is calculated using the cycle count from CYC packets, otherwise MTC
151 packets are used - refer to the mtc config term. When MTC is used,
152 however, the values are less accurate because the timing is less
153 accurate.
154
155 Because Intel PT does not update the cycle count on every branch or
156 instruction, the values will often be zero. When there are values, they
157 will be the number of instructions and number of cycles since the last
158 update, and thus represent the average IPC since the last IPC for that
159 event type. Note IPC for "branches" events is calculated separately
160 from IPC for "instructions" events.
161
162 Also note that the IPC instruction count may or may not include the
163 current instruction. If the cycle count is associated with an
164 asynchronous branch (e.g. page fault or interrupt), then the
165 instruction count does not include the current instruction, otherwise
166 it does. That is consistent with whether or not that instruction has
167 retired when the cycle count is updated.
168
169 Another note, in the case of "branches" events, non-taken branches are
170 not presently sampled, so IPC values for them do not appear e.g. a CYC
171 packet with a TNT packet that starts with a non-taken branch. To see
172 every possible IPC value, "instructions" events can be used e.g.
173 --itrace=i0ns
174
175 While it is possible to create scripts to analyze the data, an
176 alternative approach is available to export the data to a sqlite or
177 postgresql database. Refer to script export-to-sqlite.py or
178 export-to-postgresql.py for more details, and to script
179 exported-sql-viewer.py for an example of using the database.
180
181 There is also script intel-pt-events.py which provides an example of
182 how to unpack the raw data for power events and PTWRITE. The script
183 also displays branches, and supports 2 additional modes selected by
184 option:
185
186 --insn-trace - instruction trace
187 --src-trace - source trace
188
189 As mentioned above, it is easy to capture too much data. One way to
190 limit the data captured is to use snapshot mode which is explained
191 further below. Refer to new snapshot option and Intel PT modes of
192 operation further below.
193
194 Another problem that will be experienced is decoder errors. They can be
195 caused by inability to access the executed image, self-modified or
196 JIT-ed code, or the inability to match side-band information (such as
197 context switches and mmaps) which results in the decoder not knowing
198 what code was executed.
199
200 There is also the problem of perf not being able to copy the data fast
201 enough, resulting in data lost because the buffer was full. See Buffer
202 handling below for more details.
203
205 new event
206 The Intel PT kernel driver creates a new PMU for Intel PT. PMU events
207 are selected by providing the PMU name followed by the "config"
208 separated by slashes. An enhancement has been made to allow default
209 "config" e.g. the option
210
211 -e intel_pt//
212
213 will use a default config value. Currently that is the same as
214
215 -e intel_pt/tsc,noretcomp=0/
216
217 which is the same as
218
219 -e intel_pt/tsc=1,noretcomp=0/
220
221 Note there are now new config terms - see section config terms further
222 below.
223
224 The config terms are listed in /sys/devices/intel_pt/format. They are
225 bit fields within the config member of the struct perf_event_attr which
226 is passed to the kernel by the perf_event_open system call. They
227 correspond to bit fields in the IA32_RTIT_CTL MSR. Here is a list of
228 them and their definitions:
229
230 $ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
231 /sys/bus/event_source/devices/intel_pt/format/cyc:config:1
232 /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
233 /sys/bus/event_source/devices/intel_pt/format/mtc:config:9
234 /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
235 /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
236 /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
237 /sys/bus/event_source/devices/intel_pt/format/tsc:config:10
238
239 Note that the default config must be overridden for each term i.e.
240
241 -e intel_pt/noretcomp=0/
242
243 is the same as:
244
245 -e intel_pt/tsc=1,noretcomp=0/
246
247 So, to disable TSC packets use:
248
249 -e intel_pt/tsc=0/
250
251 It is also possible to specify the config value explicitly:
252
253 -e intel_pt/config=0x400/
254
255 Note that, as with all events, the event is suffixed with event
256 modifiers:
257
258 u userspace
259 k kernel
260 h hypervisor
261 G guest
262 H host
263 p precise ip
264
265 h, G and H are for virtualization which is not supported by Intel PT. p
266 is also not relevant to Intel PT. So only options u and k are
267 meaningful for Intel PT.
268
269 perf_event_attr is displayed if the -vv option is used e.g.
270
271 ------------------------------------------------------------
272 perf_event_attr:
273 type 6
274 size 112
275 config 0x400
276 { sample_period, sample_freq } 1
277 sample_type IP|TID|TIME|CPU|IDENTIFIER
278 read_format ID
279 disabled 1
280 inherit 1
281 exclude_kernel 1
282 exclude_hv 1
283 enable_on_exec 1
284 sample_id_all 1
285 ------------------------------------------------------------
286 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
287 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
288 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
289 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
290 ------------------------------------------------------------
291
292 config terms
293 The June 2015 version of Intel 64 and IA-32 Architectures Software
294 Developer Manuals, Chapter 36 Intel Processor Trace, defined new Intel
295 PT features. Some of the features are reflect in new config terms. All
296 the config terms are described below.
297
298 tsc Always supported. Produces TSC timestamp packets to provide timing
299 information. In some cases it is possible to decode without timing
300 information, for example a per-thread context that does not overlap
301 executable memory maps.
302
303 The default config selects tsc (i.e. tsc=1).
304
305 noretcomp Always supported. Disables "return compression" so a TIP
306 packet is produced when a function returns. Causes more packets to be
307 produced but might make decoding more reliable.
308
309 The default config does not select noretcomp (i.e. noretcomp=0).
310
311 psb_period Allows the frequency of PSB packets to be specified.
312
313 The PSB packet is a synchronization packet that provides a
314 starting point for decoding or recovery from errors.
315
316 Support for psb_period is indicated by:
317
318 /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
319
320 which contains "1" if the feature is supported and "0"
321 otherwise.
322
323 Valid values are given by:
324
325 /sys/bus/event_source/devices/intel_pt/caps/psb_periods
326
327 which contains a hexadecimal value, the bits of which represent
328 valid values e.g. bit 2 set means value 2 is valid.
329
330 The psb_period value is converted to the approximate number of
331 trace bytes between PSB packets as:
332
333 2 ^ (value + 11)
334
335 e.g. value 3 means 16KiB bytes between PSBs
336
337 If an invalid value is entered, the error message
338 will give a list of valid values e.g.
339
340 $ perf record -e intel_pt/psb_period=15/u uname
341 Invalid psb_period for intel_pt. Valid values are: 0-5
342
343 If MTC packets are selected, the default config selects a value
344 of 3 (i.e. psb_period=3) or the nearest lower value that is
345 supported (0 is always supported). Otherwise the default is 0.
346
347 If decoding is expected to be reliable and the buffer is large
348 then a large PSB period can be used.
349
350 Because a TSC packet is produced with PSB, the PSB period can
351 also affect the granularity to timing information in the absence
352 of MTC or CYC.
353
354 mtc Produces MTC timing packets.
355
356 MTC packets provide finer grain timestamp information than TSC
357 packets. MTC packets record time using the hardware crystal
358 clock (CTC) which is related to TSC packets using a TMA packet.
359
360 Support for this feature is indicated by:
361
362 /sys/bus/event_source/devices/intel_pt/caps/mtc
363
364 which contains "1" if the feature is supported and
365 "0" otherwise.
366
367 The frequency of MTC packets can also be specified - see
368 mtc_period below.
369
370 mtc_period Specifies how frequently MTC packets are produced - see mtc
371 above for how to determine if MTC packets are supported.
372
373 Valid values are given by:
374
375 /sys/bus/event_source/devices/intel_pt/caps/mtc_periods
376
377 which contains a hexadecimal value, the bits of which represent
378 valid values e.g. bit 2 set means value 2 is valid.
379
380 The mtc_period value is converted to the MTC frequency as:
381
382 CTC-frequency / (2 ^ value)
383
384 e.g. value 3 means one eighth of CTC-frequency
385
386 Where CTC is the hardware crystal clock, the frequency of which
387 can be related to TSC via values provided in cpuid leaf 0x15.
388
389 If an invalid value is entered, the error message
390 will give a list of valid values e.g.
391
392 $ perf record -e intel_pt/mtc_period=15/u uname
393 Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
394
395 The default value is 3 or the nearest lower value
396 that is supported (0 is always supported).
397
398 cyc Produces CYC timing packets.
399
400 CYC packets provide even finer grain timestamp information than
401 MTC and TSC packets. A CYC packet contains the number of CPU
402 cycles since the last CYC packet. Unlike MTC and TSC packets,
403 CYC packets are only sent when another packet is also sent.
404
405 Support for this feature is indicated by:
406
407 /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
408
409 which contains "1" if the feature is supported and
410 "0" otherwise.
411
412 The number of CYC packets produced can be reduced by specifying
413 a threshold - see cyc_thresh below.
414
415 cyc_thresh Specifies how frequently CYC packets are produced - see cyc
416 above for how to determine if CYC packets are supported.
417
418 Valid cyc_thresh values are given by:
419
420 /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
421
422 which contains a hexadecimal value, the bits of which represent
423 valid values e.g. bit 2 set means value 2 is valid.
424
425 The cyc_thresh value represents the minimum number of CPU cycles
426 that must have passed before a CYC packet can be sent. The
427 number of CPU cycles is:
428
429 2 ^ (value - 1)
430
431 e.g. value 4 means 8 CPU cycles must pass before a CYC packet
432 can be sent. Note a CYC packet is still only sent when another
433 packet is sent, not at, e.g. every 8 CPU cycles.
434
435 If an invalid value is entered, the error message
436 will give a list of valid values e.g.
437
438 $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
439 Invalid cyc_thresh for intel_pt. Valid values are: 0-12
440
441 CYC packets are not requested by default.
442
443 pt Specifies pass-through which enables the branch config term.
444
445 The default config selects 'pt' if it is available, so a user will
446 never need to specify this term.
447
448 branch Enable branch tracing. Branch tracing is enabled by default so
449 to disable branch tracing use branch=0.
450
451 The default config selects 'branch' if it is available.
452
453 ptw Enable PTWRITE packets which are produced when a ptwrite
454 instruction is executed.
455
456 Support for this feature is indicated by:
457
458 /sys/bus/event_source/devices/intel_pt/caps/ptwrite
459
460 which contains "1" if the feature is supported and
461 "0" otherwise.
462
463 fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP
464 packet provides the address of the ptwrite instruction. In the absence
465 of fup_on_ptw, the decoder will use the address of the previous branch
466 if branch tracing is enabled, otherwise the address will be zero. Note
467 that fup_on_ptw will work even when branch tracing is disabled.
468
469 pwr_evt Enable power events. The power events provide information about
470 changes to the CPU C-state.
471
472 Support for this feature is indicated by:
473
474 /sys/bus/event_source/devices/intel_pt/caps/power_event_trace
475
476 which contains "1" if the feature is supported and
477 "0" otherwise.
478
479 AUX area sampling option
480 To select Intel PT "sampling" the AUX area sampling option can be used:
481
482 --aux-sample
483
484 Optionally it can be followed by the sample size in bytes e.g.
485
486 --aux-sample=8192
487
488 In addition, the Intel PT event to sample must be defined e.g.
489
490 -e intel_pt//u
491
492 Samples on other events will be created containing Intel PT data e.g.
493 the following will create Intel PT samples on the branch-misses event,
494 note the events must be grouped using {}:
495
496 perf record --aux-sample -e '{intel_pt//u,branch-misses:u}'
497
498 An alternative to --aux-sample is to add the config term
499 aux-sample-size to events. In this case, the grouping is implied e.g.
500
501 perf record -e intel_pt//u -e branch-misses/aux-sample-size=8192/u
502
503 is the same as:
504
505 perf record -e '{intel_pt//u,branch-misses/aux-sample-size=8192/u}'
506
507 but allows for also using an address filter e.g.:
508
509 perf record -e intel_pt//u --filter 'filter * @/bin/ls' -e branch-misses/aux-sample-size=8192/u -- ls
510
511 It is important to select a sample size that is big enough to contain
512 at least one PSB packet. If not a warning will be displayed:
513
514 Intel PT sample size (%zu) may be too small for PSB period (%zu)
515
516 The calculation used for that is: if sample_size ⟨ psb_period + 256
517 display the warning. When sampling is used, psb_period defaults to 0
518 (2KiB).
519
520 The default sample size is 4KiB.
521
522 The sample size is passed in aux_sample_size in struct perf_event_attr.
523 The sample size is limited by the maximum event size which is 64KiB. It
524 is difficult to know how big the event might be without the trace
525 sample attached, but the tool validates that the sample size is not
526 greater than 60KiB.
527
528 new snapshot option
529 The difference between full trace and snapshot from the kernel’s
530 perspective is that in full trace we don’t overwrite trace data that
531 the user hasn’t collected yet (and indicated that by advancing
532 aux_tail), whereas in snapshot mode we let the trace run and overwrite
533 older data in the buffer so that whenever something interesting
534 happens, we can stop it and grab a snapshot of what was going on around
535 that interesting moment.
536
537 To select snapshot mode a new option has been added:
538
539 -S
540
541 Optionally it can be followed by the snapshot size e.g.
542
543 -S0x100000
544
545 The default snapshot size is the auxtrace mmap size. If neither
546 auxtrace mmap size nor snapshot size is specified, then the default is
547 4MiB for privileged users (or if /proc/sys/kernel/perf_event_paranoid <
548 0), 128KiB for unprivileged users. If an unprivileged user does not
549 specify mmap pages, the mmap pages will be reduced as described in the
550 new auxtrace mmap size option section below.
551
552 The snapshot size is displayed if the option -vv is used e.g.
553
554 Intel PT snapshot size: %zu
555
556 new auxtrace mmap size option
557 Intel PT buffer size is specified by an addition to the -m option e.g.
558
559 -m,16
560
561 selects a buffer size of 16 pages i.e. 64KiB.
562
563 Note that the existing functionality of -m is unchanged. The auxtrace
564 mmap size is specified by the optional addition of a comma and the
565 value.
566
567 The default auxtrace mmap size for Intel PT is 4MiB/page_size for
568 privileged users (or if /proc/sys/kernel/perf_event_paranoid < 0),
569 128KiB for unprivileged users. If an unprivileged user does not specify
570 mmap pages, the mmap pages will be reduced from the default
571 512KiB/page_size to 256KiB/page_size, otherwise the user is likely to
572 get an error as they exceed their mlock limit (Max locked memory as
573 shown in /proc/self/limits). Note that perf does not count the first
574 512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per
575 cpu against the mlock limit so an unprivileged user is allowed 512KiB
576 per cpu plus their mlock limit (which defaults to 64KiB but is not
577 multiplied by the number of cpus).
578
579 In full-trace mode, powers of two are allowed for buffer size, with a
580 minimum size of 2 pages. In snapshot mode or sampling mode, it is the
581 same but the minimum size is 1 page.
582
583 The mmap size and auxtrace mmap size are displayed if the -vv option is
584 used e.g.
585
586 mmap length 528384
587 auxtrace mmap length 4198400
588
589 Intel PT modes of operation
590 Intel PT can be used in 3 modes: full-trace mode sample mode snapshot
591 mode
592
593 Full-trace mode traces continuously e.g.
594
595 perf record -e intel_pt//u uname
596
597 Sample mode attaches a Intel PT sample to other events e.g.
598
599 perf record --aux-sample -e intel_pt//u -e branch-misses:u
600
601 Snapshot mode captures the available data when a signal is sent or
602 "snapshot" control command is issued. e.g. using a signal
603
604 perf record -v -e intel_pt//u -S ./loopy 1000000000 &
605 [1] 11435
606 kill -USR2 11435
607 Recording AUX area tracing snapshot
608
609 Note that the signal sent is SIGUSR2. Note that "Recording AUX area
610 tracing snapshot" is displayed because the -v option is used.
611
612 The advantage of using "snapshot" control command is that the access is
613 controlled by access to a FIFO e.g.
614
615 $ mkfifo perf.control
616 $ mkfifo perf.ack
617 $ cat perf.ack &
618 [1] 15235
619 $ sudo ~/bin/perf record --control fifo:perf.control,perf.ack -S -e intel_pt//u -- sleep 60 &
620 [2] 15243
621 $ ps -e | grep perf
622 15244 pts/1 00:00:00 perf
623 $ kill -USR2 15244
624 bash: kill: (15244) - Operation not permitted
625 $ echo snapshot > perf.control
626 ack
627
628 The 3 Intel PT modes of operation cannot be used together.
629
630 Buffer handling
631 There may be buffer limitations (i.e. single ToPa entry) which means
632 that actual buffer sizes are limited to powers of 2 up to 4MiB
633 (MAX_ORDER). In order to provide other sizes, and in particular an
634 arbitrarily large size, multiple buffers are logically concatenated.
635 However an interrupt must be used to switch between buffers. That has
636 two potential problems: a) the interrupt may not be handled in time so
637 that the current buffer becomes full and some trace data is lost. b)
638 the interrupts may slow the system and affect the performance results.
639
640 If trace data is lost, the driver sets truncated in the PERF_RECORD_AUX
641 event which the tools report as an error.
642
643 In full-trace mode, the driver waits for data to be copied out before
644 allowing the (logical) buffer to wrap-around. If data is not copied out
645 quickly enough, again truncated is set in the PERF_RECORD_AUX event. If
646 the driver has to wait, the intel_pt event gets disabled. Because it is
647 difficult to know when that happens, perf tools always re-enable the
648 intel_pt event after copying out data.
649
650 Intel PT and build ids
651 By default "perf record" post-processes the event stream to find all
652 build ids for executables for all addresses sampled. Deliberately,
653 Intel PT is not decoded for that purpose (it would take too long).
654 Instead the build ids for all executables encountered (due to mmap,
655 comm or task events) are included in the perf.data file.
656
657 To see buildids included in the perf.data file use the command:
658
659 perf buildid-list
660
661 If the perf.data file contains Intel PT data, that is the same as:
662
663 perf buildid-list --with-hits
664
665 Snapshot mode and event disabling
666 In order to make a snapshot, the intel_pt event is disabled using an
667 IOCTL, namely PERF_EVENT_IOC_DISABLE. However doing that can also
668 disable the collection of side-band information. In order to prevent
669 that, a dummy software event has been introduced that permits tracking
670 events (like mmaps) to continue to be recorded while intel_pt is
671 disabled. That is important to ensure there is complete side-band
672 information to allow the decoding of subsequent snapshots.
673
674 A test has been created for that. To find the test:
675
676 perf test list
677 ...
678 23: Test using a dummy software event to keep tracking
679
680 To run the test:
681
682 perf test 23
683 23: Test using a dummy software event to keep tracking : Ok
684
685 perf record modes (nothing new here)
686 perf record essentially operates in one of three modes: per thread per
687 cpu workload only
688
689 "per thread" mode is selected by -t or by --per-thread (with -p or -u
690 or just a workload). "per cpu" is selected by -C or -a. "workload only"
691 mode is selected by not using the other options but providing a command
692 to run (i.e. the workload).
693
694 In per-thread mode an exact list of threads is traced. There is no
695 inheritance. Each thread has its own event buffer.
696
697 In per-cpu mode all processes (or processes from the selected cgroup
698 i.e. -G option, or processes selected with -p or -u) are traced. Each
699 cpu has its own buffer. Inheritance is allowed.
700
701 In workload-only mode, the workload is traced but with per-cpu buffers.
702 Inheritance is allowed. Note that you can now trace a workload in
703 per-thread mode by using the --per-thread option.
704
705 Privileged vs non-privileged users
706 Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
707 users have memory limits imposed upon them. That affects what buffer
708 sizes they can have as outlined above.
709
710 The v4.2 kernel introduced support for a context switch metadata event,
711 PERF_RECORD_SWITCH, which allows unprivileged users to see when their
712 processes are scheduled out and in, just not by whom, which is left for
713 the PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide
714 context, which in turn requires CAP_PERFMON or CAP_SYS_ADMIN.
715
716 Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate
717 context switches") commit, that introduces these metadata events for
718 further info.
719
720 When working with kernels < v4.2, the following considerations must be
721 taken, as the sched:sched_switch tracepoints will be used to receive
722 such information:
723
724 Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
725 users are not permitted to use tracepoints which means there is
726 insufficient side-band information to decode Intel PT in per-cpu mode,
727 and potentially workload-only mode too if the workload creates new
728 processes.
729
730 Note also, that to use tracepoints, read-access to debugfs is required.
731 So if debugfs is not mounted or the user does not have read-access, it
732 will again not be possible to decode Intel PT in per-cpu mode.
733
734 sched_switch tracepoint
735 The sched_switch tracepoint is used to provide side-band data for Intel
736 PT decoding in kernels where the PERF_RECORD_SWITCH metadata event
737 isn’t available.
738
739 The sched_switch events are automatically added. e.g. the second event
740 shown below:
741
742 $ perf record -vv -e intel_pt//u uname
743 ------------------------------------------------------------
744 perf_event_attr:
745 type 6
746 size 112
747 config 0x400
748 { sample_period, sample_freq } 1
749 sample_type IP|TID|TIME|CPU|IDENTIFIER
750 read_format ID
751 disabled 1
752 inherit 1
753 exclude_kernel 1
754 exclude_hv 1
755 enable_on_exec 1
756 sample_id_all 1
757 ------------------------------------------------------------
758 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
759 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
760 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
761 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
762 ------------------------------------------------------------
763 perf_event_attr:
764 type 2
765 size 112
766 config 0x108
767 { sample_period, sample_freq } 1
768 sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER
769 read_format ID
770 inherit 1
771 sample_id_all 1
772 exclude_guest 1
773 ------------------------------------------------------------
774 sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8
775 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8
776 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8
777 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8
778 ------------------------------------------------------------
779 perf_event_attr:
780 type 1
781 size 112
782 config 0x9
783 { sample_period, sample_freq } 1
784 sample_type IP|TID|TIME|IDENTIFIER
785 read_format ID
786 disabled 1
787 inherit 1
788 exclude_kernel 1
789 exclude_hv 1
790 mmap 1
791 comm 1
792 enable_on_exec 1
793 task 1
794 sample_id_all 1
795 mmap2 1
796 comm_exec 1
797 ------------------------------------------------------------
798 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
799 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
800 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
801 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
802 mmap size 528384B
803 AUX area mmap length 4194304
804 perf event ring buffer mmapped per cpu
805 Synthesizing auxtrace information
806 Linux
807 [ perf record: Woken up 1 times to write data ]
808 [ perf record: Captured and wrote 0.042 MB perf.data ]
809
810 Note, the sched_switch event is only added if the user is permitted to
811 use it and only in per-cpu mode.
812
813 Note also, the sched_switch event is only added if TSC packets are
814 requested. That is because, in the absence of timing information, the
815 sched_switch events cannot be matched against the Intel PT trace.
816
818 By default, perf script will decode trace data found in the perf.data
819 file. This can be further controlled by new option --itrace.
820
821 New --itrace option
822 Having no option is the same as
823
824 --itrace
825
826 which, in turn, is the same as
827
828 --itrace=cepwx
829
830 The letters are:
831
832 i synthesize "instructions" events
833 b synthesize "branches" events
834 x synthesize "transactions" events
835 w synthesize "ptwrite" events
836 p synthesize "power" events (incl. PSB events)
837 c synthesize branches events (calls only)
838 r synthesize branches events (returns only)
839 e synthesize tracing error events
840 d create a debug log
841 g synthesize a call chain (use with i or x)
842 G synthesize a call chain on existing event records
843 l synthesize last branch entries (use with i or x)
844 L synthesize last branch entries on existing event records
845 s skip initial number of events
846 q quicker (less detailed) decoding
847 Z prefer to ignore timestamps (so-called "timeless" decoding)
848
849 "Instructions" events look like they were recorded by "perf record -e
850 instructions".
851
852 "Branches" events look like they were recorded by "perf record -e
853 branches". "c" and "r" can be combined to get calls and returns.
854
855 "Transactions" events correspond to the start or end of transactions.
856 The flags field can be used in perf script to determine whether the
857 event is a transaction start, commit or abort.
858
859 Note that "instructions", "branches" and "transactions" events depend
860 on code flow packets which can be disabled by using the config term
861 "branch=0". Refer to the config terms section above.
862
863 "ptwrite" events record the payload of the ptwrite instruction and
864 whether "fup_on_ptw" was used. "ptwrite" events depend on PTWRITE
865 packets which are recorded only if the "ptw" config term was used.
866 Refer to the config terms section above. perf script "synth" field
867 displays "ptwrite" information like this: "ip: 0 payload:
868 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was used.
869
870 "Power" events correspond to power event packets and CBR (core-to-bus
871 ratio) packets. While CBR packets are always recorded when tracing is
872 enabled, power event packets are recorded only if the "pwr_evt" config
873 term was used. Refer to the config terms section above. The power
874 events record information about C-state changes, whereas CBR is
875 indicative of CPU frequency. perf script "event,synth" fields display
876 information like this: cbr: cbr: 22 freq: 2189 MHz (200%) mwait: hints:
877 0x60 extensions: 0x1 pwre: hw: 0 cstate: 2 sub-cstate: 0 exstop: ip: 1
878 pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4 Where: "cbr"
879 includes the frequency and the percentage of maximum non-turbo "mwait"
880 shows mwait hints and extensions "pwre" shows C-state transitions (to a
881 C-state deeper than C0) and whether initiated by hardware "exstop"
882 indicates execution stopped and whether the IP was recorded exactly,
883 "pwrx" indicates return to C0 For more details refer to the Intel 64
884 and IA-32 Architectures Software Developer Manuals.
885
886 PSB events show when a PSB+ occurred and also the byte-offset in the
887 trace. Emitting a PSB+ can cause a CPU a slight delay. When doing
888 timing analysis of code with Intel PT, it is useful to know if a timing
889 bubble was caused by Intel PT or not.
890
891 Error events show where the decoder lost the trace. Error events are
892 quite important. Users must know if what they are seeing is a complete
893 picture or not. The "e" option may be followed by flags which affect
894 what errors will or will not be reported. Each flag must be preceded by
895 either + or -. The flags supported by Intel PT are: -o Suppress
896 overflow errors -l Suppress trace data lost errors For example, for
897 errors but not overflow or data lost errors:
898
899 --itrace=e-o-l
900
901 The "d" option will cause the creation of a file "intel_pt.log"
902 containing all decoded packets and instructions. Note that this option
903 slows down the decoder and that the resulting file may be very large.
904 The "d" option may be followed by flags which affect what debug
905 messages will or will not be logged. Each flag must be preceded by
906 either + or -. The flags support by Intel PT are: -a Suppress logging
907 of perf events +a Log all perf events By default, logged perf events
908 are filtered by any specified time ranges, but flag +a overrides that.
909
910 In addition, the period of the "instructions" event can be specified.
911 e.g.
912
913 --itrace=i10us
914
915 sets the period to 10us i.e. one instruction sample is synthesized for
916 each 10 microseconds of trace. Alternatives to "us" are "ms"
917 (milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
918 (instructions).
919
920 "ms", "us" and "ns" are converted to TSC ticks.
921
922 The timing information included with Intel PT does not give the time of
923 every instruction. Consequently, for the purpose of sampling, the
924 decoder estimates the time since the last timing packet based on 1 tick
925 per instruction. The time on the sample is not adjusted and reflects
926 the last known value of TSC.
927
928 For Intel PT, the default period is 100us.
929
930 Setting it to a zero period means "as often as possible".
931
932 In the case of Intel PT that is the same as a period of 1 and a unit of
933 instructions (i.e. --itrace=i1i).
934
935 Also the call chain size (default 16, max. 1024) for instructions or
936 transactions events can be specified. e.g.
937
938 --itrace=ig32
939 --itrace=xg32
940
941 Also the number of last branch entries (default 64, max. 1024) for
942 instructions or transactions events can be specified. e.g.
943
944 --itrace=il10
945 --itrace=xl10
946
947 Note that last branch entries are cleared for each sample, so there is
948 no overlap from one sample to the next.
949
950 The G and L options are designed in particular for sample mode, and
951 work much like g and l but add call chain and branch stack to the other
952 selected events instead of synthesized events. For example, to record
953 branch-misses events for ls and then add a call chain derived from the
954 Intel PT trace:
955
956 perf record --aux-sample -e '{intel_pt//u,branch-misses:u}' -- ls
957 perf report --itrace=Ge
958
959 Although in fact G is a default for perf report, so that is the same as
960 just:
961
962 perf report
963
964 One caveat with the G and L options is that they work poorly with
965 "Large PEBS". Large PEBS means PEBS records will be accumulated by
966 hardware and the written into the event buffer in one go. That reduces
967 interrupts, but can give very late timestamps. Because the Intel PT
968 trace is synchronized by timestamps, the PEBS events do not match the
969 trace. Currently, Large PEBS is used only in certain circumstances: -
970 hardware supports it - PEBS is used - event period is specified,
971 instead of frequency - the sample type is limited to the following
972 flags: PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR |
973 PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID |
974 PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_TRANSACTION
975 | PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER
976 | PERF_SAMPLE_PERIOD (and sometimes) | PERF_SAMPLE_TIME Because Intel
977 PT sample mode uses a different sample type to the list above, Large
978 PEBS is not used with Intel PT sample mode. To avoid Large PEBS in
979 other cases, avoid specifying the event period i.e. avoid the perf
980 record -c option, --count option, or period config term.
981
982 To disable trace decoding entirely, use the option --no-itrace.
983
984 It is also possible to skip events generated (instructions, branches,
985 transactions) at the beginning. This is useful to ignore initialization
986 code.
987
988 --itrace=i0nss1000000
989
990 skips the first million instructions.
991
992 The q option changes the way the trace is decoded. The decoding is much
993 faster but much less detailed. Specifically, with the q option, the
994 decoder does not decode TNT packets, and does not walk object code, but
995 gets the ip from FUP and TIP packets. The q option can be used with the
996 b and i options but the period is not used. The q option decodes more
997 quickly, but is useful only if the control flow of interest is
998 represented or indicated by FUP, TIP, TIP.PGE, or TIP.PGD packets
999 (refer below). However the q option could be used to find time ranges
1000 that could then be decoded fully using the --time option.
1001
1002 What will not be decoded with the (single) q option:
1003
1004 • direct calls and jmps
1005
1006 • conditional branches
1007
1008 • non-branch instructions
1009
1010 What will be decoded with the (single) q option:
1011
1012 • asynchronous branches such as interrupts
1013
1014 • indirect branches
1015
1016 • function return target address if the noretcomp config term (refer
1017 config terms section) was used
1018
1019 • start of (control-flow) tracing
1020
1021 • end of (control-flow) tracing, if it is not out of context
1022
1023 • power events, ptwrite, transaction start and abort
1024
1025 • instruction pointer associated with PSB packets
1026
1027 Note the q option does not specify what events will be synthesized e.g.
1028 the p option must be used also to show power events.
1029
1030 Repeating the q option (double-q i.e. qq) results in even faster
1031 decoding and even less detail. The decoder decodes only extended PSB
1032 (PSB+) packets, getting the instruction pointer if there is a FUP
1033 packet within PSB+ (i.e. between PSB and PSBEND). Note PSB packets
1034 occur regularly in the trace based on the psb_period config term (refer
1035 config terms section). There will be a FUP packet if the PSB+ occurs
1036 while control flow is being traced.
1037
1038 What will not be decoded with the qq option:
1039
1040 • everything except instruction pointer associated with PSB packets
1041
1042 What will be decoded with the qq option:
1043
1044 • instruction pointer associated with PSB packets
1045
1046 The Z option is equivalent to having recorded a trace without TSC (i.e.
1047 config term tsc=0). It can be useful to avoid timestamp issues when
1048 decoding a trace of a virtual machine.
1049
1050 dump option
1051 perf script has an option (-D) to "dump" the events i.e. display the
1052 binary data.
1053
1054 When -D is used, Intel PT packets are displayed. The packet decoder
1055 does not pay attention to PSB packets, but just decodes the bytes - so
1056 the packets seen by the actual decoder may not be identical in places
1057 where the data is corrupt. One example of that would be when the
1058 buffer-switching interrupt has been too slow, and the buffer has been
1059 filled completely. In that case, the last packet in the buffer might be
1060 truncated and immediately followed by a PSB as the trace continues in
1061 the next buffer.
1062
1063 To disable the display of Intel PT packets, combine the -D option with
1064 --no-itrace.
1065
1067 By default, perf report will decode trace data found in the perf.data
1068 file. This can be further controlled by new option --itrace exactly the
1069 same as perf script, with the exception that the default is
1070 --itrace=igxe.
1071
1073 perf inject also accepts the --itrace option in which case tracing data
1074 is removed and replaced with the synthesized events. e.g.
1075
1076 perf inject --itrace -i perf.data -o perf.data.new
1077
1078 Below is an example of using Intel PT with autofdo. It requires autofdo
1079 (https://github.com/google/autofdo) and gcc version 5. The bubble sort
1080 example is from the AutoFDO tutorial
1081 (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) amended to take the number
1082 of elements as a parameter.
1083
1084 $ gcc-5 -O3 sort.c -o sort_optimized
1085 $ ./sort_optimized 30000
1086 Bubble sorting array of 30000 elements
1087 2254 ms
1088
1089 $ cat ~/.perfconfig
1090 [intel-pt]
1091 mispred-all = on
1092
1093 $ perf record -e intel_pt//u ./sort 3000
1094 Bubble sorting array of 3000 elements
1095 58 ms
1096 [ perf record: Woken up 2 times to write data ]
1097 [ perf record: Captured and wrote 3.939 MB perf.data ]
1098 $ perf inject -i perf.data -o inj --itrace=i100usle --strip
1099 $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
1100 $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
1101 $ ./sort_autofdo 30000
1102 Bubble sorting array of 30000 elements
1103 2155 ms
1104
1105 Note there is currently no advantage to using Intel PT instead of LBR,
1106 but that may change in the future if greater use is made of the data.
1107
1109 Some hardware has the feature to redirect PEBS records to the Intel PT
1110 trace. Recording is selected by using the aux-output config term e.g.
1111
1112 perf record -c 10000 -e '{intel_pt/branch=0/,cycles/aux-output/ppp}' uname
1113
1114 Note that currently, software only supports redirecting at most one
1115 PEBS event.
1116
1117 To display PEBS events from the Intel PT trace, use the itrace o option
1118 e.g.
1119
1120 perf script --itrace=oe
1121
1123 For --xed the xed tool is needed. Here is how to install it:
1124
1125 $ git clone https://github.com/intelxed/mbuild.git mbuild
1126 $ git clone https://github.com/intelxed/xed
1127 $ cd xed
1128 $ ./mfile.py --share
1129 $ ./mfile.py examples
1130 $ sudo ./mfile.py --prefix=/usr/local install
1131 $ sudo ldconfig
1132 $ sudo cp obj/examples/xed /usr/local/bin
1133
1134 Basic xed testing:
1135
1136 $ xed | head -3
1137 ERROR: required argument(s) were missing
1138 Copyright (C) 2017, Intel Corporation. All rights reserved.
1139 XED version: [v10.0-328-g7d62c8c49b7b]
1140 $
1141
1143 Currently, only kernel tracing is supported and only with either
1144 "timeless" decoding (i.e. no TSC timestamps) or VM Time Correlation. VM
1145 Time Correlation is an extra step using perf inject and requires
1146 unchanging VMX TSC Offset and no VMX TSC Scaling.
1147
1148 Other limitations and caveats
1149
1150 VMX controls may suppress packets needed for decoding resulting in decoding errors
1151 VMX controls may block the perf NMI to the host potentially resulting in lost trace data
1152 Guest kernel self-modifying code (e.g. jump labels or JIT-compiled eBPF) will result in decoding errors
1153 Guest thread information is unknown
1154 Guest VCPU is unknown but may be able to be inferred from the host thread
1155 Callchains are not supported
1156
1157 Example using "timeless" decoding
1158
1159 Start VM
1160
1161 $ sudo virsh start kubuntu20.04
1162 Domain kubuntu20.04 started
1163
1164 Mount the guest file system. Note sshfs needs -o direct_io to enable
1165 reading of proc files. root access is needed to read /proc/kcore.
1166
1167 $ mkdir vm0
1168 $ sshfs -o direct_io root@vm0:/ vm0
1169
1170 Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1171
1172 $ perf buildid-cache -v --kcore vm0/proc/kcore
1173 kcore added to build-id cache directory /home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306
1174 $ KALLSYMS=/home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306/kallsyms
1175
1176 Find the VM process
1177
1178 $ ps -eLl | grep 'KVM\|PID'
1179 F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD
1180 3 S 64055 1430 1 1440 1 80 0 - 1921718 - ? 00:02:47 CPU 0/KVM
1181 3 S 64055 1430 1 1441 1 80 0 - 1921718 - ? 00:02:41 CPU 1/KVM
1182 3 S 64055 1430 1 1442 1 80 0 - 1921718 - ? 00:02:38 CPU 2/KVM
1183 3 S 64055 1430 1 1443 2 80 0 - 1921718 - ? 00:03:18 CPU 3/KVM
1184
1185 Start an open-ended perf record, tracing the VM process, do something
1186 on the VM, and then ctrl-C to stop. TSC is not supported and tsc=0 must
1187 be specified. That means mtc is useless, so add mtc=0. However, IPC can
1188 still be determined, hence cyc=1 can be added. Only kernel decoding is
1189 supported, so k must be specified. Intel PT traces both the host and
1190 the guest so --guest and --host need to be specified. Without
1191 timestamps, --per-thread must be specified to distinguish threads.
1192
1193 $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/tsc=0,mtc=0,cyc=1/k -p 1430 --per-thread
1194 ^C
1195 [ perf record: Woken up 1 times to write data ]
1196 [ perf record: Captured and wrote 5.829 MB ]
1197
1198 perf script can be used to provide an instruction trace
1199
1200 $ perf script --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1201 CPU 0/KVM 1440 ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9
1202 CPU 0/KVM 1440 ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10
1203 CPU 0/KVM 1440 ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11
1204 CPU 0/KVM 1440 ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12
1205 CPU 0/KVM 1440 ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13
1206 CPU 0/KVM 1440 ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14
1207 CPU 0/KVM 1440 ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15
1208 CPU 0/KVM 1440 ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax
1209 CPU 0/KVM 1440 ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40
1210 CPU 0/KVM 1440 ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46
1211 CPU 0/KVM 1440 ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.11 (50/445)
1212 :1440 1440 ffffffffbb678b06 native_write_msr+0x6 ([guest.kernel.kallsyms]) nopl %eax, (%rax,%rax,1)
1213 :1440 1440 ffffffffbb678b0b native_write_msr+0xb ([guest.kernel.kallsyms]) retq IPC: 0.04 (2/41)
1214 :1440 1440 ffffffffbb666646 lapic_next_deadline+0x26 ([guest.kernel.kallsyms]) data16 nop
1215 :1440 1440 ffffffffbb666648 lapic_next_deadline+0x28 ([guest.kernel.kallsyms]) xor %eax, %eax
1216 :1440 1440 ffffffffbb66664a lapic_next_deadline+0x2a ([guest.kernel.kallsyms]) popq %rbp
1217 :1440 1440 ffffffffbb66664b lapic_next_deadline+0x2b ([guest.kernel.kallsyms]) retq IPC: 0.16 (4/25)
1218 :1440 1440 ffffffffbb74607f clockevents_program_event+0x8f ([guest.kernel.kallsyms]) test %eax, %eax
1219 :1440 1440 ffffffffbb746081 clockevents_program_event+0x91 ([guest.kernel.kallsyms]) jz 0xffffffffbb74603c IPC: 0.06 (2/30)
1220 :1440 1440 ffffffffbb74603c clockevents_program_event+0x4c ([guest.kernel.kallsyms]) popq %rbx
1221 :1440 1440 ffffffffbb74603d clockevents_program_event+0x4d ([guest.kernel.kallsyms]) popq %r12
1222
1223 Example using VM Time Correlation
1224
1225 Start VM
1226
1227 $ sudo virsh start kubuntu20.04
1228 Domain kubuntu20.04 started
1229
1230 Mount the guest file system. Note sshfs needs -o direct_io to enable
1231 reading of proc files. root access is needed to read /proc/kcore.
1232
1233 $ mkdir -p vm0
1234 $ sshfs -o direct_io root@vm0:/ vm0
1235
1236 Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1237
1238 $ perf buildid-cache -v --kcore vm0/proc/kcore
1239 same kcore found in /home/user/.debug/[kernel.kcore]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777
1240 $ KALLSYMS=/home/user/.debug/\[kernel.kcore\]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777/kallsyms
1241
1242 Find the VM process
1243
1244 $ ps -eLl | grep 'KVM\|PID'
1245 F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD
1246 3 S 64055 16998 1 17005 13 80 0 - 1818189 - ? 00:00:16 CPU 0/KVM
1247 3 S 64055 16998 1 17006 4 80 0 - 1818189 - ? 00:00:05 CPU 1/KVM
1248 3 S 64055 16998 1 17007 3 80 0 - 1818189 - ? 00:00:04 CPU 2/KVM
1249 3 S 64055 16998 1 17008 4 80 0 - 1818189 - ? 00:00:05 CPU 3/KVM
1250
1251 Start an open-ended perf record, tracing the VM process, do something
1252 on the VM, and then ctrl-C to stop. IPC can be determined, hence cyc=1
1253 can be added. Only kernel decoding is supported, so k must be
1254 specified. Intel PT traces both the host and the guest so --guest and
1255 --host need to be specified.
1256
1257 $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/cyc=1/k -p 16998
1258 ^C[ perf record: Woken up 1 times to write data ]
1259 [ perf record: Captured and wrote 9.041 MB perf.data.kvm ]
1260
1261 Now perf inject can be used to determine the VMX TCS Offset. Note,
1262 Intel PT TSC packets are only 7-bytes, so the TSC Offset might differ
1263 from the actual value in the 8th byte. That will have no effect i.e.
1264 the resulting timestamps will be correct anyway.
1265
1266 $ perf inject -i perf.data.kvm --vm-time-correlation=dry-run
1267 ERROR: Unknown TSC Offset for VMCS 0x1bff6a
1268 VMCS: 0x1bff6a TSC Offset 0xffffe42722c64c41
1269 ERROR: Unknown TSC Offset for VMCS 0x1cbc08
1270 VMCS: 0x1cbc08 TSC Offset 0xffffe42722c64c41
1271 ERROR: Unknown TSC Offset for VMCS 0x1c3ce8
1272 VMCS: 0x1c3ce8 TSC Offset 0xffffe42722c64c41
1273 ERROR: Unknown TSC Offset for VMCS 0x1cbce9
1274 VMCS: 0x1cbce9 TSC Offset 0xffffe42722c64c41
1275
1276 Each virtual CPU has a different Virtual Machine Control Structure
1277 (VMCS) shown above with the calculated TSC Offset. For an unchanging
1278 TSC Offset they should all be the same for the same virtual machine.
1279
1280 Now that the TSC Offset is known, it can be provided to perf inject
1281
1282 $ perf inject -i perf.data.kvm --vm-time-correlation="dry-run 0xffffe42722c64c41"
1283
1284 Note the options for perf inject --vm-time-correlation are:
1285
1286 [ dry-run ] [ <TSC Offset> [ : <VMCS> [ , <VMCS> ]... ] ]...
1287
1288 So it is possible to specify different TSC Offsets for different VMCS.
1289 The option "dry-run" will cause the file to be processed but without
1290 updating it. Note it is also possible to get a intel_pt.log file by
1291 adding option --itrace=d
1292
1293 There were no errors so, do it for real
1294
1295 $ perf inject -i perf.data.kvm --vm-time-correlation=0xffffe42722c64c41 --force
1296
1297 perf script can be used to see if there are any decoder errors
1298
1299 $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --itrace=e-o
1300
1301 There were none.
1302
1303 perf script can be used to provide an instruction trace showing
1304 timestamps
1305
1306 $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1307 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9
1308 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10
1309 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11
1310 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12
1311 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13
1312 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14
1313 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15
1314 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax
1315 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40
1316 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46
1317 CPU 1/KVM 17006 [001] 11500.262866075: ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.05 (40/769)
1318 :17006 17006 [001] 11500.262869216: ffffffff82200cb0 asm_sysvec_apic_timer_interrupt+0x0 ([guest.kernel.kallsyms]) clac
1319 :17006 17006 [001] 11500.262869216: ffffffff82200cb3 asm_sysvec_apic_timer_interrupt+0x3 ([guest.kernel.kallsyms]) pushq $0xffffffffffffffff
1320 :17006 17006 [001] 11500.262869216: ffffffff82200cb5 asm_sysvec_apic_timer_interrupt+0x5 ([guest.kernel.kallsyms]) callq 0xffffffff82201160
1321 :17006 17006 [001] 11500.262869216: ffffffff82201160 error_entry+0x0 ([guest.kernel.kallsyms]) cld
1322 :17006 17006 [001] 11500.262869216: ffffffff82201161 error_entry+0x1 ([guest.kernel.kallsyms]) pushq %rsi
1323 :17006 17006 [001] 11500.262869216: ffffffff82201162 error_entry+0x2 ([guest.kernel.kallsyms]) movq 0x8(%rsp), %rsi
1324 :17006 17006 [001] 11500.262869216: ffffffff82201167 error_entry+0x7 ([guest.kernel.kallsyms]) movq %rdi, 0x8(%rsp)
1325 :17006 17006 [001] 11500.262869216: ffffffff8220116c error_entry+0xc ([guest.kernel.kallsyms]) pushq %rdx
1326 :17006 17006 [001] 11500.262869216: ffffffff8220116d error_entry+0xd ([guest.kernel.kallsyms]) pushq %rcx
1327 :17006 17006 [001] 11500.262869216: ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms]) pushq %rax
1328
1330 perf-record(1), perf-script(1), perf-report(1), perf-inject(1)
1331
1332
1333
1334perf 11/22/2021 PERF-INTEL-PT(1)