perf-intel-pt(1)

1PERF-INTEL-PT(1)                  perf Manual                 PERF-INTEL-PT(1)
2
3
4

NAME

6       perf-intel-pt - Support for Intel Processor Trace within perf tools
7

SYNOPSIS

9       perf record -e intel_pt//
10

DESCRIPTION

12       Intel Processor Trace (Intel PT) is an extension of Intel Architecture
13       that collects information about software execution such as control
14       flow, execution modes and timings and formats it into highly compressed
15       binary packets. Technical details are documented in the Intel 64 and
16       IA-32 Architectures Software Developer Manuals, Chapter 36 Intel
17       Processor Trace.
18
19       Intel PT is first supported in Intel Core M and 5th generation Intel
20       Core processors that are based on the Intel micro-architecture code
21       name Broadwell.
22
23       Trace data is collected by perf record and stored within the perf.data
24       file. See below for options to perf record.
25
26       Trace data must be decoded which involves walking the object code and
27       matching the trace data packets. For example a TNT packet only tells
28       whether a conditional branch was taken or not taken, so to make use of
29       that packet the decoder must know precisely which instruction was being
30       executed.
31
32       Decoding is done on-the-fly. The decoder outputs samples in the same
33       format as samples output by perf hardware events, for example as though
34       the "instructions" or "branches" events had been recorded. Presently 3
35       tools support this: perf script, perf report and perf inject. See below
36       for more information on using those tools.
37
38       The main distinguishing feature of Intel PT is that the decoder can
39       determine the exact flow of software execution. Intel PT can be used to
40       understand why and how did software get to a certain point, or behave a
41       certain way. The software does not have to be recompiled, so Intel PT
42       works with debug or release builds, however the executed images are
43       needed - which makes use in JIT-compiled environments, or with
44       self-modified code, a challenge. Also symbols need to be provided to
45       make sense of addresses.
46
47       A limitation of Intel PT is that it produces huge amounts of trace data
48       (hundreds of megabytes per second per core) which takes a long time to
49       decode, for example two or three orders of magnitude longer than it
50       took to collect. Another limitation is the performance impact of
51       tracing, something that will vary depending on the use-case and
52       architecture.
53

QUICKSTART

55       It is important to start small. That is because it is easy to capture
56       vastly more data than can possibly be processed.
57
58       The simplest thing to do with Intel PT is userspace profiling of small
59       programs. Data is captured with perf record e.g. to trace ls
60       userspace-only:
61
62           perf record -e intel_pt//u ls
63
64       And profiled with perf report e.g.
65
66           perf report
67
68       To also trace kernel space presents a problem, namely kernel
69       self-modifying code. A fairly good kernel image is available in
70       /proc/kcore but to get an accurate image a copy of /proc/kcore needs to
71       be made under the same conditions as the data capture. perf record can
72       make a copy of /proc/kcore if the option --kcore is used, but access to
73       /proc/kcore is restricted e.g.
74
75           sudo perf record -o pt_ls --kcore -e intel_pt// -- ls
76
77       which will create a directory named pt_ls and put the perf.data file
78       (named simply data) and copies of /proc/kcore, /proc/kallsyms and
79       /proc/modules into it. The other tools understand the directory format,
80       so to use perf report becomes:
81
82           sudo perf report -i pt_ls
83
84       Because samples are synthesized after-the-fact, the sampling period can
85       be selected for reporting. e.g. sample every microsecond
86
87           sudo perf report pt_ls --itrace=i1usge
88
89       See the sections below for more information about the --itrace option.
90
91       Beware the smaller the period, the more samples that are produced, and
92       the longer it takes to process them.
93
94       Also note that the coarseness of Intel PT timing information will start
95       to distort the statistical value of the sampling as the sampling period
96       becomes smaller.
97
98       To represent software control flow, "branches" samples are produced. By
99       default a branch sample is synthesized for every single branch. To get
100       an idea what data is available you can use the perf script tool with
101       all itrace sampling options, which will list all the samples.
102
103           perf record -e intel_pt//u ls
104           perf script --itrace=ibxwpe
105
106       An interesting field that is not printed by default is flags which can
107       be displayed as follows:
108
109           perf script --itrace=ibxwpe -F+flags
110
111       The flags are "bcrosyiABExgh" which stand for branch, call, return,
112       conditional, system, asynchronous, interrupt, transaction abort, trace
113       begin, trace end, in transaction, VM-entry, and VM-exit respectively.
114
115       perf script also supports higher level ways to dump instruction traces:
116
117           perf script --insn-trace --xed
118
119       Dump all instructions. This requires installing the xed tool (see XED
120       below) Dumping all instructions in a long trace can be fairly slow. It
121       is usually better to start with higher level decoding, like
122
123           perf script --call-trace
124
125       or
126
127           perf script --call-ret-trace
128
129       and then select a time range of interest. The time range can then be
130       examined in detail with
131
132           perf script --time starttime,stoptime --insn-trace --xed
133
134       While examining the trace it’s also useful to filter on specific CPUs
135       using the -C option
136
137           perf script --time starttime,stoptime --insn-trace --xed -C 1
138
139       Dump all instructions in time range on CPU 1.
140
141       Another interesting field that is not printed by default is ipc which
142       can be displayed as follows:
143
144           perf script --itrace=be -F+ipc
145
146       There are two ways that instructions-per-cycle (IPC) can be calculated
147       depending on the recording.
148
149       If the cyc config term (see config terms section below) was used, then
150       IPC is calculated using the cycle count from CYC packets, otherwise MTC
151       packets are used - refer to the mtc config term. When MTC is used,
152       however, the values are less accurate because the timing is less
153       accurate.
154
155       Because Intel PT does not update the cycle count on every branch or
156       instruction, the values will often be zero. When there are values, they
157       will be the number of instructions and number of cycles since the last
158       update, and thus represent the average IPC since the last IPC for that
159       event type. Note IPC for "branches" events is calculated separately
160       from IPC for "instructions" events.
161
162       Also note that the IPC instruction count may or may not include the
163       current instruction. If the cycle count is associated with an
164       asynchronous branch (e.g. page fault or interrupt), then the
165       instruction count does not include the current instruction, otherwise
166       it does. That is consistent with whether or not that instruction has
167       retired when the cycle count is updated.
168
169       Another note, in the case of "branches" events, non-taken branches are
170       not presently sampled, so IPC values for them do not appear e.g. a CYC
171       packet with a TNT packet that starts with a non-taken branch. To see
172       every possible IPC value, "instructions" events can be used e.g.
173       --itrace=i0ns
174
175       While it is possible to create scripts to analyze the data, an
176       alternative approach is available to export the data to a sqlite or
177       postgresql database. Refer to script export-to-sqlite.py or
178       export-to-postgresql.py for more details, and to script
179       exported-sql-viewer.py for an example of using the database.
180
181       There is also script intel-pt-events.py which provides an example of
182       how to unpack the raw data for power events and PTWRITE. The script
183       also displays branches, and supports 2 additional modes selected by
184       option:
185
186           --insn-trace - instruction trace
187           --src-trace - source trace
188
189       As mentioned above, it is easy to capture too much data. One way to
190       limit the data captured is to use snapshot mode which is explained
191       further below. Refer to new snapshot option and Intel PT modes of
192       operation further below.
193
194       Another problem that will be experienced is decoder errors. They can be
195       caused by inability to access the executed image, self-modified or
196       JIT-ed code, or the inability to match side-band information (such as
197       context switches and mmaps) which results in the decoder not knowing
198       what code was executed.
199
200       There is also the problem of perf not being able to copy the data fast
201       enough, resulting in data lost because the buffer was full. See Buffer
202       handling below for more details.
203

PERF RECORD

205   new event
206       The Intel PT kernel driver creates a new PMU for Intel PT. PMU events
207       are selected by providing the PMU name followed by the "config"
208       separated by slashes. An enhancement has been made to allow default
209       "config" e.g. the option
210
211           -e intel_pt//
212
213       will use a default config value. Currently that is the same as
214
215           -e intel_pt/tsc,noretcomp=0/
216
217       which is the same as
218
219           -e intel_pt/tsc=1,noretcomp=0/
220
221       Note there are now new config terms - see section config terms further
222       below.
223
224       The config terms are listed in /sys/devices/intel_pt/format. They are
225       bit fields within the config member of the struct perf_event_attr which
226       is passed to the kernel by the perf_event_open system call. They
227       correspond to bit fields in the IA32_RTIT_CTL MSR. Here is a list of
228       them and their definitions:
229
230           $ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
231           /sys/bus/event_source/devices/intel_pt/format/cyc:config:1
232           /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
233           /sys/bus/event_source/devices/intel_pt/format/mtc:config:9
234           /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
235           /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
236           /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
237           /sys/bus/event_source/devices/intel_pt/format/tsc:config:10
238
239       Note that the default config must be overridden for each term i.e.
240
241           -e intel_pt/noretcomp=0/
242
243       is the same as:
244
245           -e intel_pt/tsc=1,noretcomp=0/
246
247       So, to disable TSC packets use:
248
249           -e intel_pt/tsc=0/
250
251       It is also possible to specify the config value explicitly:
252
253           -e intel_pt/config=0x400/
254
255       Note that, as with all events, the event is suffixed with event
256       modifiers:
257
258           u       userspace
259           k       kernel
260           h       hypervisor
261           G       guest
262           H       host
263           p       precise ip
264
265       h, G and H are for virtualization which is not supported by Intel PT. p
266       is also not relevant to Intel PT. So only options u and k are
267       meaningful for Intel PT.
268
269       perf_event_attr is displayed if the -vv option is used e.g.
270
271           ------------------------------------------------------------
272           perf_event_attr:
273           type                             6
274           size                             112
275           config                           0x400
276           { sample_period, sample_freq }   1
277           sample_type                      IP|TID|TIME|CPU|IDENTIFIER
278           read_format                      ID
279           disabled                         1
280           inherit                          1
281           exclude_kernel                   1
282           exclude_hv                       1
283           enable_on_exec                   1
284           sample_id_all                    1
285           ------------------------------------------------------------
286           sys_perf_event_open: pid 31104  cpu 0  group_fd -1  flags 0x8
287           sys_perf_event_open: pid 31104  cpu 1  group_fd -1  flags 0x8
288           sys_perf_event_open: pid 31104  cpu 2  group_fd -1  flags 0x8
289           sys_perf_event_open: pid 31104  cpu 3  group_fd -1  flags 0x8
290           ------------------------------------------------------------
291
292   config terms
293       The June 2015 version of Intel 64 and IA-32 Architectures Software
294       Developer Manuals, Chapter 36 Intel Processor Trace, defined new Intel
295       PT features. Some of the features are reflect in new config terms. All
296       the config terms are described below.
297
298       tsc Always supported. Produces TSC timestamp packets to provide timing
299       information. In some cases it is possible to decode without timing
300       information, for example a per-thread context that does not overlap
301       executable memory maps.
302
303           The default config selects tsc (i.e. tsc=1).
304
305       noretcomp Always supported. Disables "return compression" so a TIP
306       packet is produced when a function returns. Causes more packets to be
307       produced but might make decoding more reliable.
308
309           The default config does not select noretcomp (i.e. noretcomp=0).
310
311       psb_period Allows the frequency of PSB packets to be specified.
312
313           The PSB packet is a synchronization packet that provides a
314           starting point for decoding or recovery from errors.
315
316           Support for psb_period is indicated by:
317
318           /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
319
320           which contains "1" if the feature is supported and "0"
321           otherwise.
322
323           Valid values are given by:
324
325           /sys/bus/event_source/devices/intel_pt/caps/psb_periods
326
327           which contains a hexadecimal value, the bits of which represent
328           valid values e.g. bit 2 set means value 2 is valid.
329
330           The psb_period value is converted to the approximate number of
331           trace bytes between PSB packets as:
332
333           2 ^ (value + 11)
334
335           e.g. value 3 means 16KiB bytes between PSBs
336
337           If an invalid value is entered, the error message
338           will give a list of valid values e.g.
339
340           $ perf record -e intel_pt/psb_period=15/u uname
341           Invalid psb_period for intel_pt. Valid values are: 0-5
342
343           If MTC packets are selected, the default config selects a value
344           of 3 (i.e. psb_period=3) or the nearest lower value that is
345           supported (0 is always supported).  Otherwise the default is 0.
346
347           If decoding is expected to be reliable and the buffer is large
348           then a large PSB period can be used.
349
350           Because a TSC packet is produced with PSB, the PSB period can
351           also affect the granularity to timing information in the absence
352           of MTC or CYC.
353
354       mtc Produces MTC timing packets.
355
356           MTC packets provide finer grain timestamp information than TSC
357           packets.  MTC packets record time using the hardware crystal
358           clock (CTC) which is related to TSC packets using a TMA packet.
359
360           Support for this feature is indicated by:
361
362           /sys/bus/event_source/devices/intel_pt/caps/mtc
363
364           which contains "1" if the feature is supported and
365           "0" otherwise.
366
367           The frequency of MTC packets can also be specified - see
368           mtc_period below.
369
370       mtc_period Specifies how frequently MTC packets are produced - see mtc
371       above for how to determine if MTC packets are supported.
372
373           Valid values are given by:
374
375           /sys/bus/event_source/devices/intel_pt/caps/mtc_periods
376
377           which contains a hexadecimal value, the bits of which represent
378           valid values e.g. bit 2 set means value 2 is valid.
379
380           The mtc_period value is converted to the MTC frequency as:
381
382           CTC-frequency / (2 ^ value)
383
384           e.g. value 3 means one eighth of CTC-frequency
385
386           Where CTC is the hardware crystal clock, the frequency of which
387           can be related to TSC via values provided in cpuid leaf 0x15.
388
389           If an invalid value is entered, the error message
390           will give a list of valid values e.g.
391
392           $ perf record -e intel_pt/mtc_period=15/u uname
393           Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
394
395           The default value is 3 or the nearest lower value
396           that is supported (0 is always supported).
397
398       cyc Produces CYC timing packets.
399
400           CYC packets provide even finer grain timestamp information than
401           MTC and TSC packets.  A CYC packet contains the number of CPU
402           cycles since the last CYC packet. Unlike MTC and TSC packets,
403           CYC packets are only sent when another packet is also sent.
404
405           Support for this feature is indicated by:
406
407           /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
408
409           which contains "1" if the feature is supported and
410           "0" otherwise.
411
412           The number of CYC packets produced can be reduced by specifying
413           a threshold - see cyc_thresh below.
414
415       cyc_thresh Specifies how frequently CYC packets are produced - see cyc
416       above for how to determine if CYC packets are supported.
417
418           Valid cyc_thresh values are given by:
419
420           /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
421
422           which contains a hexadecimal value, the bits of which represent
423           valid values e.g. bit 2 set means value 2 is valid.
424
425           The cyc_thresh value represents the minimum number of CPU cycles
426           that must have passed before a CYC packet can be sent.  The
427           number of CPU cycles is:
428
429           2 ^ (value - 1)
430
431           e.g. value 4 means 8 CPU cycles must pass before a CYC packet
432           can be sent.  Note a CYC packet is still only sent when another
433           packet is sent, not at, e.g. every 8 CPU cycles.
434
435           If an invalid value is entered, the error message
436           will give a list of valid values e.g.
437
438           $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
439           Invalid cyc_thresh for intel_pt. Valid values are: 0-12
440
441           CYC packets are not requested by default.
442
443       pt Specifies pass-through which enables the branch config term.
444
445           The default config selects 'pt' if it is available, so a user will
446           never need to specify this term.
447
448       branch Enable branch tracing. Branch tracing is enabled by default so
449       to disable branch tracing use branch=0.
450
451           The default config selects 'branch' if it is available.
452
453       ptw Enable PTWRITE packets which are produced when a ptwrite
454       instruction is executed.
455
456           Support for this feature is indicated by:
457
458           /sys/bus/event_source/devices/intel_pt/caps/ptwrite
459
460           which contains "1" if the feature is supported and
461           "0" otherwise.
462
463       fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP
464       packet provides the address of the ptwrite instruction. In the absence
465       of fup_on_ptw, the decoder will use the address of the previous branch
466       if branch tracing is enabled, otherwise the address will be zero. Note
467       that fup_on_ptw will work even when branch tracing is disabled.
468
469       pwr_evt Enable power events. The power events provide information about
470       changes to the CPU C-state.
471
472           Support for this feature is indicated by:
473
474           /sys/bus/event_source/devices/intel_pt/caps/power_event_trace
475
476           which contains "1" if the feature is supported and
477           "0" otherwise.
478
479   AUX area sampling option
480       To select Intel PT "sampling" the AUX area sampling option can be used:
481
482           --aux-sample
483
484       Optionally it can be followed by the sample size in bytes e.g.
485
486           --aux-sample=8192
487
488       In addition, the Intel PT event to sample must be defined e.g.
489
490           -e intel_pt//u
491
492       Samples on other events will be created containing Intel PT data e.g.
493       the following will create Intel PT samples on the branch-misses event,
494       note the events must be grouped using {}:
495
496           perf record --aux-sample -e '{intel_pt//u,branch-misses:u}'
497
498       An alternative to --aux-sample is to add the config term
499       aux-sample-size to events. In this case, the grouping is implied e.g.
500
501           perf record -e intel_pt//u -e branch-misses/aux-sample-size=8192/u
502
503       is the same as:
504
505           perf record -e '{intel_pt//u,branch-misses/aux-sample-size=8192/u}'
506
507       but allows for also using an address filter e.g.:
508
509           perf record -e intel_pt//u --filter 'filter * @/bin/ls' -e branch-misses/aux-sample-size=8192/u -- ls
510
511       It is important to select a sample size that is big enough to contain
512       at least one PSB packet. If not a warning will be displayed:
513
514           Intel PT sample size (%zu) may be too small for PSB period (%zu)
515
516       The calculation used for that is: if sample_size ⟨ psb_period + 256
517       display the warning. When sampling is used, psb_period defaults to 0
518       (2KiB).
519
520       The default sample size is 4KiB.
521
522       The sample size is passed in aux_sample_size in struct perf_event_attr.
523       The sample size is limited by the maximum event size which is 64KiB. It
524       is difficult to know how big the event might be without the trace
525       sample attached, but the tool validates that the sample size is not
526       greater than 60KiB.
527
528   new snapshot option
529       The difference between full trace and snapshot from the kernel’s
530       perspective is that in full trace we don’t overwrite trace data that
531       the user hasn’t collected yet (and indicated that by advancing
532       aux_tail), whereas in snapshot mode we let the trace run and overwrite
533       older data in the buffer so that whenever something interesting
534       happens, we can stop it and grab a snapshot of what was going on around
535       that interesting moment.
536
537       To select snapshot mode a new option has been added:
538
539           -S
540
541       Optionally it can be followed by the snapshot size e.g.
542
543           -S0x100000
544
545       The default snapshot size is the auxtrace mmap size. If neither
546       auxtrace mmap size nor snapshot size is specified, then the default is
547       4MiB for privileged users (or if /proc/sys/kernel/perf_event_paranoid <
548       0), 128KiB for unprivileged users. If an unprivileged user does not
549       specify mmap pages, the mmap pages will be reduced as described in the
550       new auxtrace mmap size option section below.
551
552       The snapshot size is displayed if the option -vv is used e.g.
553
554           Intel PT snapshot size: %zu
555
556   new auxtrace mmap size option
557       Intel PT buffer size is specified by an addition to the -m option e.g.
558
559           -m,16
560
561       selects a buffer size of 16 pages i.e. 64KiB.
562
563       Note that the existing functionality of -m is unchanged. The auxtrace
564       mmap size is specified by the optional addition of a comma and the
565       value.
566
567       The default auxtrace mmap size for Intel PT is 4MiB/page_size for
568       privileged users (or if /proc/sys/kernel/perf_event_paranoid < 0),
569       128KiB for unprivileged users. If an unprivileged user does not specify
570       mmap pages, the mmap pages will be reduced from the default
571       512KiB/page_size to 256KiB/page_size, otherwise the user is likely to
572       get an error as they exceed their mlock limit (Max locked memory as
573       shown in /proc/self/limits). Note that perf does not count the first
574       512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per
575       cpu against the mlock limit so an unprivileged user is allowed 512KiB
576       per cpu plus their mlock limit (which defaults to 64KiB but is not
577       multiplied by the number of cpus).
578
579       In full-trace mode, powers of two are allowed for buffer size, with a
580       minimum size of 2 pages. In snapshot mode or sampling mode, it is the
581       same but the minimum size is 1 page.
582
583       The mmap size and auxtrace mmap size are displayed if the -vv option is
584       used e.g.
585
586           mmap length 528384
587           auxtrace mmap length 4198400
588
589   Intel PT modes of operation
590       Intel PT can be used in 3 modes: full-trace mode sample mode snapshot
591       mode
592
593       Full-trace mode traces continuously e.g.
594
595           perf record -e intel_pt//u uname
596
597       Sample mode attaches a Intel PT sample to other events e.g.
598
599           perf record --aux-sample -e intel_pt//u -e branch-misses:u
600
601       Snapshot mode captures the available data when a signal is sent or
602       "snapshot" control command is issued. e.g. using a signal
603
604           perf record -v -e intel_pt//u -S ./loopy 1000000000 &
605           [1] 11435
606           kill -USR2 11435
607           Recording AUX area tracing snapshot
608
609       Note that the signal sent is SIGUSR2. Note that "Recording AUX area
610       tracing snapshot" is displayed because the -v option is used.
611
612       The advantage of using "snapshot" control command is that the access is
613       controlled by access to a FIFO e.g.
614
615           $ mkfifo perf.control
616           $ mkfifo perf.ack
617           $ cat perf.ack &
618           [1] 15235
619           $ sudo ~/bin/perf record --control fifo:perf.control,perf.ack -S -e intel_pt//u -- sleep 60 &
620           [2] 15243
621           $ ps -e | grep perf
622           15244 pts/1    00:00:00 perf
623           $ kill -USR2 15244
624           bash: kill: (15244) - Operation not permitted
625           $ echo snapshot > perf.control
626           ack
627
628       The 3 Intel PT modes of operation cannot be used together.
629
630   Buffer handling
631       There may be buffer limitations (i.e. single ToPa entry) which means
632       that actual buffer sizes are limited to powers of 2 up to 4MiB
633       (MAX_ORDER). In order to provide other sizes, and in particular an
634       arbitrarily large size, multiple buffers are logically concatenated.
635       However an interrupt must be used to switch between buffers. That has
636       two potential problems: a) the interrupt may not be handled in time so
637       that the current buffer becomes full and some trace data is lost. b)
638       the interrupts may slow the system and affect the performance results.
639
640       If trace data is lost, the driver sets truncated in the PERF_RECORD_AUX
641       event which the tools report as an error.
642
643       In full-trace mode, the driver waits for data to be copied out before
644       allowing the (logical) buffer to wrap-around. If data is not copied out
645       quickly enough, again truncated is set in the PERF_RECORD_AUX event. If
646       the driver has to wait, the intel_pt event gets disabled. Because it is
647       difficult to know when that happens, perf tools always re-enable the
648       intel_pt event after copying out data.
649
650   Intel PT and build ids
651       By default "perf record" post-processes the event stream to find all
652       build ids for executables for all addresses sampled. Deliberately,
653       Intel PT is not decoded for that purpose (it would take too long).
654       Instead the build ids for all executables encountered (due to mmap,
655       comm or task events) are included in the perf.data file.
656
657       To see buildids included in the perf.data file use the command:
658
659           perf buildid-list
660
661       If the perf.data file contains Intel PT data, that is the same as:
662
663           perf buildid-list --with-hits
664
665   Snapshot mode and event disabling
666       In order to make a snapshot, the intel_pt event is disabled using an
667       IOCTL, namely PERF_EVENT_IOC_DISABLE. However doing that can also
668       disable the collection of side-band information. In order to prevent
669       that, a dummy software event has been introduced that permits tracking
670       events (like mmaps) to continue to be recorded while intel_pt is
671       disabled. That is important to ensure there is complete side-band
672       information to allow the decoding of subsequent snapshots.
673
674       A test has been created for that. To find the test:
675
676           perf test list
677           ...
678           23: Test using a dummy software event to keep tracking
679
680       To run the test:
681
682           perf test 23
683           23: Test using a dummy software event to keep tracking     : Ok
684
685   perf record modes (nothing new here)
686       perf record essentially operates in one of three modes: per thread per
687       cpu workload only
688
689       "per thread" mode is selected by -t or by --per-thread (with -p or -u
690       or just a workload). "per cpu" is selected by -C or -a. "workload only"
691       mode is selected by not using the other options but providing a command
692       to run (i.e. the workload).
693
694       In per-thread mode an exact list of threads is traced. There is no
695       inheritance. Each thread has its own event buffer.
696
697       In per-cpu mode all processes (or processes from the selected cgroup
698       i.e. -G option, or processes selected with -p or -u) are traced. Each
699       cpu has its own buffer. Inheritance is allowed.
700
701       In workload-only mode, the workload is traced but with per-cpu buffers.
702       Inheritance is allowed. Note that you can now trace a workload in
703       per-thread mode by using the --per-thread option.
704
705   Privileged vs non-privileged users
706       Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
707       users have memory limits imposed upon them. That affects what buffer
708       sizes they can have as outlined above.
709
710       The v4.2 kernel introduced support for a context switch metadata event,
711       PERF_RECORD_SWITCH, which allows unprivileged users to see when their
712       processes are scheduled out and in, just not by whom, which is left for
713       the PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide
714       context, which in turn requires CAP_PERFMON or CAP_SYS_ADMIN.
715
716       Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate
717       context switches") commit, that introduces these metadata events for
718       further info.
719
720       When working with kernels < v4.2, the following considerations must be
721       taken, as the sched:sched_switch tracepoints will be used to receive
722       such information:
723
724       Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
725       users are not permitted to use tracepoints which means there is
726       insufficient side-band information to decode Intel PT in per-cpu mode,
727       and potentially workload-only mode too if the workload creates new
728       processes.
729
730       Note also, that to use tracepoints, read-access to debugfs is required.
731       So if debugfs is not mounted or the user does not have read-access, it
732       will again not be possible to decode Intel PT in per-cpu mode.
733
734   sched_switch tracepoint
735       The sched_switch tracepoint is used to provide side-band data for Intel
736       PT decoding in kernels where the PERF_RECORD_SWITCH metadata event
737       isn’t available.
738
739       The sched_switch events are automatically added. e.g. the second event
740       shown below:
741
742           $ perf record -vv -e intel_pt//u uname
743           ------------------------------------------------------------
744           perf_event_attr:
745           type                             6
746           size                             112
747           config                           0x400
748           { sample_period, sample_freq }   1
749           sample_type                      IP|TID|TIME|CPU|IDENTIFIER
750           read_format                      ID
751           disabled                         1
752           inherit                          1
753           exclude_kernel                   1
754           exclude_hv                       1
755           enable_on_exec                   1
756           sample_id_all                    1
757           ------------------------------------------------------------
758           sys_perf_event_open: pid 31104  cpu 0  group_fd -1  flags 0x8
759           sys_perf_event_open: pid 31104  cpu 1  group_fd -1  flags 0x8
760           sys_perf_event_open: pid 31104  cpu 2  group_fd -1  flags 0x8
761           sys_perf_event_open: pid 31104  cpu 3  group_fd -1  flags 0x8
762           ------------------------------------------------------------
763           perf_event_attr:
764           type                             2
765           size                             112
766           config                           0x108
767           { sample_period, sample_freq }   1
768           sample_type                      IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER
769           read_format                      ID
770           inherit                          1
771           sample_id_all                    1
772           exclude_guest                    1
773           ------------------------------------------------------------
774           sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8
775           sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8
776           sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8
777           sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8
778           ------------------------------------------------------------
779           perf_event_attr:
780           type                             1
781           size                             112
782           config                           0x9
783           { sample_period, sample_freq }   1
784           sample_type                      IP|TID|TIME|IDENTIFIER
785           read_format                      ID
786           disabled                         1
787           inherit                          1
788           exclude_kernel                   1
789           exclude_hv                       1
790           mmap                             1
791           comm                             1
792           enable_on_exec                   1
793           task                             1
794           sample_id_all                    1
795           mmap2                            1
796           comm_exec                        1
797           ------------------------------------------------------------
798           sys_perf_event_open: pid 31104  cpu 0  group_fd -1  flags 0x8
799           sys_perf_event_open: pid 31104  cpu 1  group_fd -1  flags 0x8
800           sys_perf_event_open: pid 31104  cpu 2  group_fd -1  flags 0x8
801           sys_perf_event_open: pid 31104  cpu 3  group_fd -1  flags 0x8
802           mmap size 528384B
803           AUX area mmap length 4194304
804           perf event ring buffer mmapped per cpu
805           Synthesizing auxtrace information
806           Linux
807           [ perf record: Woken up 1 times to write data ]
808           [ perf record: Captured and wrote 0.042 MB perf.data ]
809
810       Note, the sched_switch event is only added if the user is permitted to
811       use it and only in per-cpu mode.
812
813       Note also, the sched_switch event is only added if TSC packets are
814       requested. That is because, in the absence of timing information, the
815       sched_switch events cannot be matched against the Intel PT trace.
816

PERF SCRIPT

818       By default, perf script will decode trace data found in the perf.data
819       file. This can be further controlled by new option --itrace.
820
821   New --itrace option
822       Having no option is the same as
823
824           --itrace
825
826       which, in turn, is the same as
827
828           --itrace=cepwx
829
830       The letters are:
831
832           i       synthesize "instructions" events
833           b       synthesize "branches" events
834           x       synthesize "transactions" events
835           w       synthesize "ptwrite" events
836           p       synthesize "power" events (incl. PSB events)
837           c       synthesize branches events (calls only)
838           r       synthesize branches events (returns only)
839           e       synthesize tracing error events
840           d       create a debug log
841           g       synthesize a call chain (use with i or x)
842           G       synthesize a call chain on existing event records
843           l       synthesize last branch entries (use with i or x)
844           L       synthesize last branch entries on existing event records
845           s       skip initial number of events
846           q       quicker (less detailed) decoding
847           Z       prefer to ignore timestamps (so-called "timeless" decoding)
848
849       "Instructions" events look like they were recorded by "perf record -e
850       instructions".
851
852       "Branches" events look like they were recorded by "perf record -e
853       branches". "c" and "r" can be combined to get calls and returns.
854
855       "Transactions" events correspond to the start or end of transactions.
856       The flags field can be used in perf script to determine whether the
857       event is a transaction start, commit or abort.
858
859       Note that "instructions", "branches" and "transactions" events depend
860       on code flow packets which can be disabled by using the config term
861       "branch=0". Refer to the config terms section above.
862
863       "ptwrite" events record the payload of the ptwrite instruction and
864       whether "fup_on_ptw" was used. "ptwrite" events depend on PTWRITE
865       packets which are recorded only if the "ptw" config term was used.
866       Refer to the config terms section above. perf script "synth" field
867       displays "ptwrite" information like this: "ip: 0 payload:
868       0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was used.
869
870       "Power" events correspond to power event packets and CBR (core-to-bus
871       ratio) packets. While CBR packets are always recorded when tracing is
872       enabled, power event packets are recorded only if the "pwr_evt" config
873       term was used. Refer to the config terms section above. The power
874       events record information about C-state changes, whereas CBR is
875       indicative of CPU frequency. perf script "event,synth" fields display
876       information like this: cbr: cbr: 22 freq: 2189 MHz (200%) mwait: hints:
877       0x60 extensions: 0x1 pwre: hw: 0 cstate: 2 sub-cstate: 0 exstop: ip: 1
878       pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4 Where: "cbr"
879       includes the frequency and the percentage of maximum non-turbo "mwait"
880       shows mwait hints and extensions "pwre" shows C-state transitions (to a
881       C-state deeper than C0) and whether initiated by hardware "exstop"
882       indicates execution stopped and whether the IP was recorded exactly,
883       "pwrx" indicates return to C0 For more details refer to the Intel 64
884       and IA-32 Architectures Software Developer Manuals.
885
886       PSB events show when a PSB+ occurred and also the byte-offset in the
887       trace. Emitting a PSB+ can cause a CPU a slight delay. When doing
888       timing analysis of code with Intel PT, it is useful to know if a timing
889       bubble was caused by Intel PT or not.
890
891       Error events show where the decoder lost the trace. Error events are
892       quite important. Users must know if what they are seeing is a complete
893       picture or not. The "e" option may be followed by flags which affect
894       what errors will or will not be reported. Each flag must be preceded by
895       either + or -. The flags supported by Intel PT are: -o Suppress
896       overflow errors -l Suppress trace data lost errors For example, for
897       errors but not overflow or data lost errors:
898
899           --itrace=e-o-l
900
901       The "d" option will cause the creation of a file "intel_pt.log"
902       containing all decoded packets and instructions. Note that this option
903       slows down the decoder and that the resulting file may be very large.
904       The "d" option may be followed by flags which affect what debug
905       messages will or will not be logged. Each flag must be preceded by
906       either + or -. The flags support by Intel PT are: -a Suppress logging
907       of perf events +a Log all perf events By default, logged perf events
908       are filtered by any specified time ranges, but flag +a overrides that.
909
910       In addition, the period of the "instructions" event can be specified.
911       e.g.
912
913           --itrace=i10us
914
915       sets the period to 10us i.e. one instruction sample is synthesized for
916       each 10 microseconds of trace. Alternatives to "us" are "ms"
917       (milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
918       (instructions).
919
920       "ms", "us" and "ns" are converted to TSC ticks.
921
922       The timing information included with Intel PT does not give the time of
923       every instruction. Consequently, for the purpose of sampling, the
924       decoder estimates the time since the last timing packet based on 1 tick
925       per instruction. The time on the sample is not adjusted and reflects
926       the last known value of TSC.
927
928       For Intel PT, the default period is 100us.
929
930       Setting it to a zero period means "as often as possible".
931
932       In the case of Intel PT that is the same as a period of 1 and a unit of
933       instructions (i.e. --itrace=i1i).
934
935       Also the call chain size (default 16, max. 1024) for instructions or
936       transactions events can be specified. e.g.
937
938           --itrace=ig32
939           --itrace=xg32
940
941       Also the number of last branch entries (default 64, max. 1024) for
942       instructions or transactions events can be specified. e.g.
943
944           --itrace=il10
945           --itrace=xl10
946
947       Note that last branch entries are cleared for each sample, so there is
948       no overlap from one sample to the next.
949
950       The G and L options are designed in particular for sample mode, and
951       work much like g and l but add call chain and branch stack to the other
952       selected events instead of synthesized events. For example, to record
953       branch-misses events for ls and then add a call chain derived from the
954       Intel PT trace:
955
956           perf record --aux-sample -e '{intel_pt//u,branch-misses:u}' -- ls
957           perf report --itrace=Ge
958
959       Although in fact G is a default for perf report, so that is the same as
960       just:
961
962           perf report
963
964       One caveat with the G and L options is that they work poorly with
965       "Large PEBS". Large PEBS means PEBS records will be accumulated by
966       hardware and the written into the event buffer in one go. That reduces
967       interrupts, but can give very late timestamps. Because the Intel PT
968       trace is synchronized by timestamps, the PEBS events do not match the
969       trace. Currently, Large PEBS is used only in certain circumstances: -
970       hardware supports it - PEBS is used - event period is specified,
971       instead of frequency - the sample type is limited to the following
972       flags: PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR |
973       PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID |
974       PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_TRANSACTION
975       | PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER
976       | PERF_SAMPLE_PERIOD (and sometimes) | PERF_SAMPLE_TIME Because Intel
977       PT sample mode uses a different sample type to the list above, Large
978       PEBS is not used with Intel PT sample mode. To avoid Large PEBS in
979       other cases, avoid specifying the event period i.e. avoid the perf
980       record -c option, --count option, or period config term.
981
982       To disable trace decoding entirely, use the option --no-itrace.
983
984       It is also possible to skip events generated (instructions, branches,
985       transactions) at the beginning. This is useful to ignore initialization
986       code.
987
988           --itrace=i0nss1000000
989
990       skips the first million instructions.
991
992       The q option changes the way the trace is decoded. The decoding is much
993       faster but much less detailed. Specifically, with the q option, the
994       decoder does not decode TNT packets, and does not walk object code, but
995       gets the ip from FUP and TIP packets. The q option can be used with the
996       b and i options but the period is not used. The q option decodes more
997       quickly, but is useful only if the control flow of interest is
998       represented or indicated by FUP, TIP, TIP.PGE, or TIP.PGD packets
999       (refer below). However the q option could be used to find time ranges
1000       that could then be decoded fully using the --time option.
1001
1002       What will not be decoded with the (single) q option:
1003
1004       •   direct calls and jmps
1005
1006       •   conditional branches
1007
1008       •   non-branch instructions
1009
1010       What will be decoded with the (single) q option:
1011
1012       •   asynchronous branches such as interrupts
1013
1014       •   indirect branches
1015
1016       •   function return target address if the noretcomp config term (refer
1017           config terms section) was used
1018
1019       •   start of (control-flow) tracing
1020
1021       •   end of (control-flow) tracing, if it is not out of context
1022
1023       •   power events, ptwrite, transaction start and abort
1024
1025       •   instruction pointer associated with PSB packets
1026
1027       Note the q option does not specify what events will be synthesized e.g.
1028       the p option must be used also to show power events.
1029
1030       Repeating the q option (double-q i.e. qq) results in even faster
1031       decoding and even less detail. The decoder decodes only extended PSB
1032       (PSB+) packets, getting the instruction pointer if there is a FUP
1033       packet within PSB+ (i.e. between PSB and PSBEND). Note PSB packets
1034       occur regularly in the trace based on the psb_period config term (refer
1035       config terms section). There will be a FUP packet if the PSB+ occurs
1036       while control flow is being traced.
1037
1038       What will not be decoded with the qq option:
1039
1040       •   everything except instruction pointer associated with PSB packets
1041
1042       What will be decoded with the qq option:
1043
1044       •   instruction pointer associated with PSB packets
1045
1046       The Z option is equivalent to having recorded a trace without TSC (i.e.
1047       config term tsc=0). It can be useful to avoid timestamp issues when
1048       decoding a trace of a virtual machine.
1049
1050   dump option
1051       perf script has an option (-D) to "dump" the events i.e. display the
1052       binary data.
1053
1054       When -D is used, Intel PT packets are displayed. The packet decoder
1055       does not pay attention to PSB packets, but just decodes the bytes - so
1056       the packets seen by the actual decoder may not be identical in places
1057       where the data is corrupt. One example of that would be when the
1058       buffer-switching interrupt has been too slow, and the buffer has been
1059       filled completely. In that case, the last packet in the buffer might be
1060       truncated and immediately followed by a PSB as the trace continues in
1061       the next buffer.
1062
1063       To disable the display of Intel PT packets, combine the -D option with
1064       --no-itrace.
1065

PERF REPORT

1067       By default, perf report will decode trace data found in the perf.data
1068       file. This can be further controlled by new option --itrace exactly the
1069       same as perf script, with the exception that the default is
1070       --itrace=igxe.
1071

PERF INJECT

1073       perf inject also accepts the --itrace option in which case tracing data
1074       is removed and replaced with the synthesized events. e.g.
1075
1076           perf inject --itrace -i perf.data -o perf.data.new
1077
1078       Below is an example of using Intel PT with autofdo. It requires autofdo
1079       (https://github.com/google/autofdo) and gcc version 5. The bubble sort
1080       example is from the AutoFDO tutorial
1081       (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) amended to take the number
1082       of elements as a parameter.
1083
1084           $ gcc-5 -O3 sort.c -o sort_optimized
1085           $ ./sort_optimized 30000
1086           Bubble sorting array of 30000 elements
1087           2254 ms
1088
1089           $ cat ~/.perfconfig
1090           [intel-pt]
1091                   mispred-all = on
1092
1093           $ perf record -e intel_pt//u ./sort 3000
1094           Bubble sorting array of 3000 elements
1095           58 ms
1096           [ perf record: Woken up 2 times to write data ]
1097           [ perf record: Captured and wrote 3.939 MB perf.data ]
1098           $ perf inject -i perf.data -o inj --itrace=i100usle --strip
1099           $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
1100           $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
1101           $ ./sort_autofdo 30000
1102           Bubble sorting array of 30000 elements
1103           2155 ms
1104
1105       Note there is currently no advantage to using Intel PT instead of LBR,
1106       but that may change in the future if greater use is made of the data.
1107

PEBS VIA INTEL PT

1109       Some hardware has the feature to redirect PEBS records to the Intel PT
1110       trace. Recording is selected by using the aux-output config term e.g.
1111
1112           perf record -c 10000 -e '{intel_pt/branch=0/,cycles/aux-output/ppp}' uname
1113
1114       Note that currently, software only supports redirecting at most one
1115       PEBS event.
1116
1117       To display PEBS events from the Intel PT trace, use the itrace o option
1118       e.g.
1119
1120           perf script --itrace=oe
1121

XED

1123       For --xed the xed tool is needed. Here is how to install it:
1124
1125           $ git clone https://github.com/intelxed/mbuild.git mbuild
1126           $ git clone https://github.com/intelxed/xed
1127           $ cd xed
1128           $ ./mfile.py --share
1129           $ ./mfile.py examples
1130           $ sudo ./mfile.py --prefix=/usr/local install
1131           $ sudo ldconfig
1132           $ sudo cp obj/examples/xed /usr/local/bin
1133
1134       Basic xed testing:
1135
1136           $ xed | head -3
1137           ERROR: required argument(s) were missing
1138           Copyright (C) 2017, Intel Corporation. All rights reserved.
1139           XED version: [v10.0-328-g7d62c8c49b7b]
1140           $
1141

TRACING VIRTUAL MACHINES

1143       Currently, only kernel tracing is supported and only with either
1144       "timeless" decoding (i.e. no TSC timestamps) or VM Time Correlation. VM
1145       Time Correlation is an extra step using perf inject and requires
1146       unchanging VMX TSC Offset and no VMX TSC Scaling.
1147
1148       Other limitations and caveats
1149
1150           VMX controls may suppress packets needed for decoding resulting in decoding errors
1151           VMX controls may block the perf NMI to the host potentially resulting in lost trace data
1152           Guest kernel self-modifying code (e.g. jump labels or JIT-compiled eBPF) will result in decoding errors
1153           Guest thread information is unknown
1154           Guest VCPU is unknown but may be able to be inferred from the host thread
1155           Callchains are not supported
1156
1157       Example using "timeless" decoding
1158
1159       Start VM
1160
1161           $ sudo virsh start kubuntu20.04
1162           Domain kubuntu20.04 started
1163
1164       Mount the guest file system. Note sshfs needs -o direct_io to enable
1165       reading of proc files. root access is needed to read /proc/kcore.
1166
1167           $ mkdir vm0
1168           $ sshfs -o direct_io root@vm0:/ vm0
1169
1170       Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1171
1172           $ perf buildid-cache -v --kcore vm0/proc/kcore
1173           kcore added to build-id cache directory /home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306
1174           $ KALLSYMS=/home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306/kallsyms
1175
1176       Find the VM process
1177
1178           $ ps -eLl | grep 'KVM\|PID'
1179           F S   UID     PID    PPID     LWP  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
1180           3 S 64055    1430       1    1440  1  80   0 - 1921718 -    ?        00:02:47 CPU 0/KVM
1181           3 S 64055    1430       1    1441  1  80   0 - 1921718 -    ?        00:02:41 CPU 1/KVM
1182           3 S 64055    1430       1    1442  1  80   0 - 1921718 -    ?        00:02:38 CPU 2/KVM
1183           3 S 64055    1430       1    1443  2  80   0 - 1921718 -    ?        00:03:18 CPU 3/KVM
1184
1185       Start an open-ended perf record, tracing the VM process, do something
1186       on the VM, and then ctrl-C to stop. TSC is not supported and tsc=0 must
1187       be specified. That means mtc is useless, so add mtc=0. However, IPC can
1188       still be determined, hence cyc=1 can be added. Only kernel decoding is
1189       supported, so k must be specified. Intel PT traces both the host and
1190       the guest so --guest and --host need to be specified. Without
1191       timestamps, --per-thread must be specified to distinguish threads.
1192
1193           $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/tsc=0,mtc=0,cyc=1/k -p 1430 --per-thread
1194           ^C
1195           [ perf record: Woken up 1 times to write data ]
1196           [ perf record: Captured and wrote 5.829 MB ]
1197
1198       perf script can be used to provide an instruction trace
1199
1200           $ perf script --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1201                 CPU 0/KVM  1440  ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms])                movq  0x48(%rax), %r9
1202                 CPU 0/KVM  1440  ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms])                movq  0x50(%rax), %r10
1203                 CPU 0/KVM  1440  ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms])                movq  0x58(%rax), %r11
1204                 CPU 0/KVM  1440  ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms])                movq  0x60(%rax), %r12
1205                 CPU 0/KVM  1440  ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms])                movq  0x68(%rax), %r13
1206                 CPU 0/KVM  1440  ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms])                movq  0x70(%rax), %r14
1207                 CPU 0/KVM  1440  ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms])                movq  0x78(%rax), %r15
1208                 CPU 0/KVM  1440  ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms])                movq  (%rax), %rax
1209                 CPU 0/KVM  1440  ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms])                callq  0xffffffff82133c40
1210                 CPU 0/KVM  1440  ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms])            jz 0xffffffff82133c46
1211                 CPU 0/KVM  1440  ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms])            vmresume         IPC: 0.11 (50/445)
1212                     :1440  1440  ffffffffbb678b06 native_write_msr+0x6 ([guest.kernel.kallsyms])                 nopl  %eax, (%rax,%rax,1)
1213                     :1440  1440  ffffffffbb678b0b native_write_msr+0xb ([guest.kernel.kallsyms])                 retq     IPC: 0.04 (2/41)
1214                     :1440  1440  ffffffffbb666646 lapic_next_deadline+0x26 ([guest.kernel.kallsyms])             data16 nop
1215                     :1440  1440  ffffffffbb666648 lapic_next_deadline+0x28 ([guest.kernel.kallsyms])             xor %eax, %eax
1216                     :1440  1440  ffffffffbb66664a lapic_next_deadline+0x2a ([guest.kernel.kallsyms])             popq  %rbp
1217                     :1440  1440  ffffffffbb66664b lapic_next_deadline+0x2b ([guest.kernel.kallsyms])             retq     IPC: 0.16 (4/25)
1218                     :1440  1440  ffffffffbb74607f clockevents_program_event+0x8f ([guest.kernel.kallsyms])               test %eax, %eax
1219                     :1440  1440  ffffffffbb746081 clockevents_program_event+0x91 ([guest.kernel.kallsyms])               jz 0xffffffffbb74603c    IPC: 0.06 (2/30)
1220                     :1440  1440  ffffffffbb74603c clockevents_program_event+0x4c ([guest.kernel.kallsyms])               popq  %rbx
1221                     :1440  1440  ffffffffbb74603d clockevents_program_event+0x4d ([guest.kernel.kallsyms])               popq  %r12
1222
1223       Example using VM Time Correlation
1224
1225       Start VM
1226
1227           $ sudo virsh start kubuntu20.04
1228           Domain kubuntu20.04 started
1229
1230       Mount the guest file system. Note sshfs needs -o direct_io to enable
1231       reading of proc files. root access is needed to read /proc/kcore.
1232
1233           $ mkdir -p vm0
1234           $ sshfs -o direct_io root@vm0:/ vm0
1235
1236       Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1237
1238           $ perf buildid-cache -v --kcore vm0/proc/kcore
1239           same kcore found in /home/user/.debug/[kernel.kcore]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777
1240           $ KALLSYMS=/home/user/.debug/\[kernel.kcore\]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777/kallsyms
1241
1242       Find the VM process
1243
1244           $ ps -eLl | grep 'KVM\|PID'
1245           F S   UID     PID    PPID     LWP  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
1246           3 S 64055   16998       1   17005 13  80   0 - 1818189 -    ?        00:00:16 CPU 0/KVM
1247           3 S 64055   16998       1   17006  4  80   0 - 1818189 -    ?        00:00:05 CPU 1/KVM
1248           3 S 64055   16998       1   17007  3  80   0 - 1818189 -    ?        00:00:04 CPU 2/KVM
1249           3 S 64055   16998       1   17008  4  80   0 - 1818189 -    ?        00:00:05 CPU 3/KVM
1250
1251       Start an open-ended perf record, tracing the VM process, do something
1252       on the VM, and then ctrl-C to stop. IPC can be determined, hence cyc=1
1253       can be added. Only kernel decoding is supported, so k must be
1254       specified. Intel PT traces both the host and the guest so --guest and
1255       --host need to be specified.
1256
1257           $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/cyc=1/k -p 16998
1258           ^C[ perf record: Woken up 1 times to write data ]
1259           [ perf record: Captured and wrote 9.041 MB perf.data.kvm ]
1260
1261       Now perf inject can be used to determine the VMX TCS Offset. Note,
1262       Intel PT TSC packets are only 7-bytes, so the TSC Offset might differ
1263       from the actual value in the 8th byte. That will have no effect i.e.
1264       the resulting timestamps will be correct anyway.
1265
1266           $ perf inject -i perf.data.kvm --vm-time-correlation=dry-run
1267           ERROR: Unknown TSC Offset for VMCS 0x1bff6a
1268           VMCS: 0x1bff6a  TSC Offset 0xffffe42722c64c41
1269           ERROR: Unknown TSC Offset for VMCS 0x1cbc08
1270           VMCS: 0x1cbc08  TSC Offset 0xffffe42722c64c41
1271           ERROR: Unknown TSC Offset for VMCS 0x1c3ce8
1272           VMCS: 0x1c3ce8  TSC Offset 0xffffe42722c64c41
1273           ERROR: Unknown TSC Offset for VMCS 0x1cbce9
1274           VMCS: 0x1cbce9  TSC Offset 0xffffe42722c64c41
1275
1276       Each virtual CPU has a different Virtual Machine Control Structure
1277       (VMCS) shown above with the calculated TSC Offset. For an unchanging
1278       TSC Offset they should all be the same for the same virtual machine.
1279
1280       Now that the TSC Offset is known, it can be provided to perf inject
1281
1282           $ perf inject -i perf.data.kvm --vm-time-correlation="dry-run 0xffffe42722c64c41"
1283
1284       Note the options for perf inject --vm-time-correlation are:
1285
1286           [ dry-run ] [ <TSC Offset> [ : <VMCS> [ , <VMCS> ]... ]  ]...
1287
1288       So it is possible to specify different TSC Offsets for different VMCS.
1289       The option "dry-run" will cause the file to be processed but without
1290       updating it. Note it is also possible to get a intel_pt.log file by
1291       adding option --itrace=d
1292
1293       There were no errors so, do it for real
1294
1295           $ perf inject -i perf.data.kvm --vm-time-correlation=0xffffe42722c64c41 --force
1296
1297       perf script can be used to see if there are any decoder errors
1298
1299           $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --itrace=e-o
1300
1301       There were none.
1302
1303       perf script can be used to provide an instruction trace showing
1304       timestamps
1305
1306           $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1307                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms])                 movq  0x48(%rax), %r9
1308                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms])                 movq  0x50(%rax), %r10
1309                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms])                 movq  0x58(%rax), %r11
1310                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms])                 movq  0x60(%rax), %r12
1311                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms])                 movq  0x68(%rax), %r13
1312                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms])                 movq  0x70(%rax), %r14
1313                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms])                 movq  0x78(%rax), %r15
1314                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms])                 movq  (%rax), %rax
1315                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms])                 callq  0xffffffff82133c40
1316                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms])             jz 0xffffffff82133c46
1317                 CPU 1/KVM 17006 [001] 11500.262866075:  ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms])             vmresume         IPC: 0.05 (40/769)
1318                    :17006 17006 [001] 11500.262869216:  ffffffff82200cb0 asm_sysvec_apic_timer_interrupt+0x0 ([guest.kernel.kallsyms])           clac
1319                    :17006 17006 [001] 11500.262869216:  ffffffff82200cb3 asm_sysvec_apic_timer_interrupt+0x3 ([guest.kernel.kallsyms])           pushq  $0xffffffffffffffff
1320                    :17006 17006 [001] 11500.262869216:  ffffffff82200cb5 asm_sysvec_apic_timer_interrupt+0x5 ([guest.kernel.kallsyms])           callq  0xffffffff82201160
1321                    :17006 17006 [001] 11500.262869216:  ffffffff82201160 error_entry+0x0 ([guest.kernel.kallsyms])               cld
1322                    :17006 17006 [001] 11500.262869216:  ffffffff82201161 error_entry+0x1 ([guest.kernel.kallsyms])               pushq  %rsi
1323                    :17006 17006 [001] 11500.262869216:  ffffffff82201162 error_entry+0x2 ([guest.kernel.kallsyms])               movq  0x8(%rsp), %rsi
1324                    :17006 17006 [001] 11500.262869216:  ffffffff82201167 error_entry+0x7 ([guest.kernel.kallsyms])               movq  %rdi, 0x8(%rsp)
1325                    :17006 17006 [001] 11500.262869216:  ffffffff8220116c error_entry+0xc ([guest.kernel.kallsyms])               pushq  %rdx
1326                    :17006 17006 [001] 11500.262869216:  ffffffff8220116d error_entry+0xd ([guest.kernel.kallsyms])               pushq  %rcx
1327                    :17006 17006 [001] 11500.262869216:  ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms])               pushq  %rax
1328