perf-intel-pt(1)

1PERF-INTEL-PT(1)                  perf Manual                 PERF-INTEL-PT(1)
2
3
4

NAME

6       perf-intel-pt - Support for Intel Processor Trace within perf tools
7

SYNOPSIS

9       perf record -e intel_pt//
10

DESCRIPTION

12       Intel Processor Trace (Intel PT) is an extension of Intel Architecture
13       that collects information about software execution such as control
14       flow, execution modes and timings and formats it into highly compressed
15       binary packets. Technical details are documented in the Intel 64 and
16       IA-32 Architectures Software Developer Manuals, Chapter 36 Intel
17       Processor Trace.
18
19       Intel PT is first supported in Intel Core M and 5th generation Intel
20       Core processors that are based on the Intel micro-architecture code
21       name Broadwell.
22
23       Trace data is collected by perf record and stored within the perf.data
24       file. See below for options to perf record.
25
26       Trace data must be decoded which involves walking the object code and
27       matching the trace data packets. For example a TNT packet only tells
28       whether a conditional branch was taken or not taken, so to make use of
29       that packet the decoder must know precisely which instruction was being
30       executed.
31
32       Decoding is done on-the-fly. The decoder outputs samples in the same
33       format as samples output by perf hardware events, for example as though
34       the "instructions" or "branches" events had been recorded. Presently 3
35       tools support this: perf script, perf report and perf inject. See below
36       for more information on using those tools.
37
38       The main distinguishing feature of Intel PT is that the decoder can
39       determine the exact flow of software execution. Intel PT can be used to
40       understand why and how did software get to a certain point, or behave a
41       certain way. The software does not have to be recompiled, so Intel PT
42       works with debug or release builds, however the executed images are
43       needed - which makes use in JIT-compiled environments, or with
44       self-modified code, a challenge. Also symbols need to be provided to
45       make sense of addresses.
46
47       A limitation of Intel PT is that it produces huge amounts of trace data
48       (hundreds of megabytes per second per core) which takes a long time to
49       decode, for example two or three orders of magnitude longer than it
50       took to collect. Another limitation is the performance impact of
51       tracing, something that will vary depending on the use-case and
52       architecture.
53

QUICKSTART

55       It is important to start small. That is because it is easy to capture
56       vastly more data than can possibly be processed.
57
58       The simplest thing to do with Intel PT is userspace profiling of small
59       programs. Data is captured with perf record e.g. to trace ls
60       userspace-only:
61
62           perf record -e intel_pt//u ls
63
64       And profiled with perf report e.g.
65
66           perf report
67
68       To also trace kernel space presents a problem, namely kernel
69       self-modifying code. A fairly good kernel image is available in
70       /proc/kcore but to get an accurate image a copy of /proc/kcore needs to
71       be made under the same conditions as the data capture. perf record can
72       make a copy of /proc/kcore if the option --kcore is used, but access to
73       /proc/kcore is restricted e.g.
74
75           sudo perf record -o pt_ls --kcore -e intel_pt// -- ls
76
77       which will create a directory named pt_ls and put the perf.data file
78       (named simply data) and copies of /proc/kcore, /proc/kallsyms and
79       /proc/modules into it. The other tools understand the directory format,
80       so to use perf report becomes:
81
82           sudo perf report -i pt_ls
83
84       Because samples are synthesized after-the-fact, the sampling period can
85       be selected for reporting. e.g. sample every microsecond
86
87           sudo perf report pt_ls --itrace=i1usge
88
89       See the sections below for more information about the --itrace option.
90
91       Beware the smaller the period, the more samples that are produced, and
92       the longer it takes to process them.
93
94       Also note that the coarseness of Intel PT timing information will start
95       to distort the statistical value of the sampling as the sampling period
96       becomes smaller.
97
98       To represent software control flow, "branches" samples are produced. By
99       default a branch sample is synthesized for every single branch. To get
100       an idea what data is available you can use the perf script tool with
101       all itrace sampling options, which will list all the samples.
102
103           perf record -e intel_pt//u ls
104           perf script --itrace=ibxwpe
105
106       An interesting field that is not printed by default is flags which can
107       be displayed as follows:
108
109           perf script --itrace=ibxwpe -F+flags
110
111       The flags are "bcrosyiABExghDt" which stand for branch, call, return,
112       conditional, system, asynchronous, interrupt, transaction abort, trace
113       begin, trace end, in transaction, VM-entry, VM-exit, interrupt
114       disabled, and interrupt disable toggle respectively.
115
116       perf script also supports higher level ways to dump instruction traces:
117
118           perf script --insn-trace --xed
119
120       Dump all instructions. This requires installing the xed tool (see XED
121       below) Dumping all instructions in a long trace can be fairly slow. It
122       is usually better to start with higher level decoding, like
123
124           perf script --call-trace
125
126       or
127
128           perf script --call-ret-trace
129
130       and then select a time range of interest. The time range can then be
131       examined in detail with
132
133           perf script --time starttime,stoptime --insn-trace --xed
134
135       While examining the trace it’s also useful to filter on specific CPUs
136       using the -C option
137
138           perf script --time starttime,stoptime --insn-trace --xed -C 1
139
140       Dump all instructions in time range on CPU 1.
141
142       Another interesting field that is not printed by default is ipc which
143       can be displayed as follows:
144
145           perf script --itrace=be -F+ipc
146
147       There are two ways that instructions-per-cycle (IPC) can be calculated
148       depending on the recording.
149
150       If the cyc config term (see config terms section below) was used, then
151       IPC is calculated using the cycle count from CYC packets, otherwise MTC
152       packets are used - refer to the mtc config term. When MTC is used,
153       however, the values are less accurate because the timing is less
154       accurate.
155
156       Because Intel PT does not update the cycle count on every branch or
157       instruction, the values will often be zero. When there are values, they
158       will be the number of instructions and number of cycles since the last
159       update, and thus represent the average IPC since the last IPC for that
160       event type. Note IPC for "branches" events is calculated separately
161       from IPC for "instructions" events.
162
163       Even with the cyc config term, it is possible to produce IPC
164       information for every change of timestamp, but at the expense of
165       accuracy. That is selected by specifying the itrace A option. Due to
166       the granularity of timestamps, the actual number of cycles increases
167       even though the cycles reported does not. The number of instructions is
168       known, but if IPC is reported, cycles can be too low and so IPC is too
169       high. Note that inaccuracy decreases as the period of sampling
170       increases i.e. if the number of cycles is too low by a small amount,
171       that becomes less significant if the number of cycles is large. It may
172       also be useful to use the A option in conjunction with
173       dlfilter-show-cycles.so to provide higher granularity cycle
174       information.
175
176       Also note that the IPC instruction count may or may not include the
177       current instruction. If the cycle count is associated with an
178       asynchronous branch (e.g. page fault or interrupt), then the
179       instruction count does not include the current instruction, otherwise
180       it does. That is consistent with whether or not that instruction has
181       retired when the cycle count is updated.
182
183       Another note, in the case of "branches" events, non-taken branches are
184       not presently sampled, so IPC values for them do not appear e.g. a CYC
185       packet with a TNT packet that starts with a non-taken branch. To see
186       every possible IPC value, "instructions" events can be used e.g.
187       --itrace=i0ns
188
189       While it is possible to create scripts to analyze the data, an
190       alternative approach is available to export the data to a sqlite or
191       postgresql database. Refer to script export-to-sqlite.py or
192       export-to-postgresql.py for more details, and to script
193       exported-sql-viewer.py for an example of using the database.
194
195       There is also script intel-pt-events.py which provides an example of
196       how to unpack the raw data for power events and PTWRITE. The script
197       also displays branches, and supports 2 additional modes selected by
198       option:
199
200           --insn-trace - instruction trace
201           --src-trace - source trace
202
203       As mentioned above, it is easy to capture too much data. One way to
204       limit the data captured is to use snapshot mode which is explained
205       further below. Refer to new snapshot option and Intel PT modes of
206       operation further below.
207
208       Another problem that will be experienced is decoder errors. They can be
209       caused by inability to access the executed image, self-modified or
210       JIT-ed code, or the inability to match side-band information (such as
211       context switches and mmaps) which results in the decoder not knowing
212       what code was executed.
213
214       There is also the problem of perf not being able to copy the data fast
215       enough, resulting in data lost because the buffer was full. See Buffer
216       handling below for more details.
217

PERF RECORD

219   new event
220       The Intel PT kernel driver creates a new PMU for Intel PT. PMU events
221       are selected by providing the PMU name followed by the "config"
222       separated by slashes. An enhancement has been made to allow default
223       "config" e.g. the option
224
225           -e intel_pt//
226
227       will use a default config value. Currently that is the same as
228
229           -e intel_pt/tsc,noretcomp=0/
230
231       which is the same as
232
233           -e intel_pt/tsc=1,noretcomp=0/
234
235       Note there are now new config terms - see section config terms further
236       below.
237
238       The config terms are listed in /sys/devices/intel_pt/format. They are
239       bit fields within the config member of the struct perf_event_attr which
240       is passed to the kernel by the perf_event_open system call. They
241       correspond to bit fields in the IA32_RTIT_CTL MSR. Here is a list of
242       them and their definitions:
243
244           $ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
245           /sys/bus/event_source/devices/intel_pt/format/cyc:config:1
246           /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
247           /sys/bus/event_source/devices/intel_pt/format/mtc:config:9
248           /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
249           /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
250           /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
251           /sys/bus/event_source/devices/intel_pt/format/tsc:config:10
252
253       Note that the default config must be overridden for each term i.e.
254
255           -e intel_pt/noretcomp=0/
256
257       is the same as:
258
259           -e intel_pt/tsc=1,noretcomp=0/
260
261       So, to disable TSC packets use:
262
263           -e intel_pt/tsc=0/
264
265       It is also possible to specify the config value explicitly:
266
267           -e intel_pt/config=0x400/
268
269       Note that, as with all events, the event is suffixed with event
270       modifiers:
271
272           u       userspace
273           k       kernel
274           h       hypervisor
275           G       guest
276           H       host
277           p       precise ip
278
279       h, G and H are for virtualization which is not supported by Intel PT. p
280       is also not relevant to Intel PT. So only options u and k are
281       meaningful for Intel PT.
282
283       perf_event_attr is displayed if the -vv option is used e.g.
284
285           ------------------------------------------------------------
286           perf_event_attr:
287           type                             6
288           size                             112
289           config                           0x400
290           { sample_period, sample_freq }   1
291           sample_type                      IP|TID|TIME|CPU|IDENTIFIER
292           read_format                      ID
293           disabled                         1
294           inherit                          1
295           exclude_kernel                   1
296           exclude_hv                       1
297           enable_on_exec                   1
298           sample_id_all                    1
299           ------------------------------------------------------------
300           sys_perf_event_open: pid 31104  cpu 0  group_fd -1  flags 0x8
301           sys_perf_event_open: pid 31104  cpu 1  group_fd -1  flags 0x8
302           sys_perf_event_open: pid 31104  cpu 2  group_fd -1  flags 0x8
303           sys_perf_event_open: pid 31104  cpu 3  group_fd -1  flags 0x8
304           ------------------------------------------------------------
305
306   config terms
307       The June 2015 version of Intel 64 and IA-32 Architectures Software
308       Developer Manuals, Chapter 36 Intel Processor Trace, defined new Intel
309       PT features. Some of the features are reflect in new config terms. All
310       the config terms are described below.
311
312       tsc Always supported. Produces TSC timestamp packets to provide timing
313       information. In some cases it is possible to decode without timing
314       information, for example a per-thread context that does not overlap
315       executable memory maps.
316
317           The default config selects tsc (i.e. tsc=1).
318
319       noretcomp Always supported. Disables "return compression" so a TIP
320       packet is produced when a function returns. Causes more packets to be
321       produced but might make decoding more reliable.
322
323           The default config does not select noretcomp (i.e. noretcomp=0).
324
325       psb_period Allows the frequency of PSB packets to be specified.
326
327           The PSB packet is a synchronization packet that provides a
328           starting point for decoding or recovery from errors.
329
330           Support for psb_period is indicated by:
331
332           /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
333
334           which contains "1" if the feature is supported and "0"
335           otherwise.
336
337           Valid values are given by:
338
339           /sys/bus/event_source/devices/intel_pt/caps/psb_periods
340
341           which contains a hexadecimal value, the bits of which represent
342           valid values e.g. bit 2 set means value 2 is valid.
343
344           The psb_period value is converted to the approximate number of
345           trace bytes between PSB packets as:
346
347           2 ^ (value + 11)
348
349           e.g. value 3 means 16KiB bytes between PSBs
350
351           If an invalid value is entered, the error message
352           will give a list of valid values e.g.
353
354           $ perf record -e intel_pt/psb_period=15/u uname
355           Invalid psb_period for intel_pt. Valid values are: 0-5
356
357           If MTC packets are selected, the default config selects a value
358           of 3 (i.e. psb_period=3) or the nearest lower value that is
359           supported (0 is always supported).  Otherwise the default is 0.
360
361           If decoding is expected to be reliable and the buffer is large
362           then a large PSB period can be used.
363
364           Because a TSC packet is produced with PSB, the PSB period can
365           also affect the granularity to timing information in the absence
366           of MTC or CYC.
367
368       mtc Produces MTC timing packets.
369
370           MTC packets provide finer grain timestamp information than TSC
371           packets.  MTC packets record time using the hardware crystal
372           clock (CTC) which is related to TSC packets using a TMA packet.
373
374           Support for this feature is indicated by:
375
376           /sys/bus/event_source/devices/intel_pt/caps/mtc
377
378           which contains "1" if the feature is supported and
379           "0" otherwise.
380
381           The frequency of MTC packets can also be specified - see
382           mtc_period below.
383
384       mtc_period Specifies how frequently MTC packets are produced - see mtc
385       above for how to determine if MTC packets are supported.
386
387           Valid values are given by:
388
389           /sys/bus/event_source/devices/intel_pt/caps/mtc_periods
390
391           which contains a hexadecimal value, the bits of which represent
392           valid values e.g. bit 2 set means value 2 is valid.
393
394           The mtc_period value is converted to the MTC frequency as:
395
396           CTC-frequency / (2 ^ value)
397
398           e.g. value 3 means one eighth of CTC-frequency
399
400           Where CTC is the hardware crystal clock, the frequency of which
401           can be related to TSC via values provided in cpuid leaf 0x15.
402
403           If an invalid value is entered, the error message
404           will give a list of valid values e.g.
405
406           $ perf record -e intel_pt/mtc_period=15/u uname
407           Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
408
409           The default value is 3 or the nearest lower value
410           that is supported (0 is always supported).
411
412       cyc Produces CYC timing packets.
413
414           CYC packets provide even finer grain timestamp information than
415           MTC and TSC packets.  A CYC packet contains the number of CPU
416           cycles since the last CYC packet. Unlike MTC and TSC packets,
417           CYC packets are only sent when another packet is also sent.
418
419           Support for this feature is indicated by:
420
421           /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
422
423           which contains "1" if the feature is supported and
424           "0" otherwise.
425
426           The number of CYC packets produced can be reduced by specifying
427           a threshold - see cyc_thresh below.
428
429       cyc_thresh Specifies how frequently CYC packets are produced - see cyc
430       above for how to determine if CYC packets are supported.
431
432           Valid cyc_thresh values are given by:
433
434           /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
435
436           which contains a hexadecimal value, the bits of which represent
437           valid values e.g. bit 2 set means value 2 is valid.
438
439           The cyc_thresh value represents the minimum number of CPU cycles
440           that must have passed before a CYC packet can be sent.  The
441           number of CPU cycles is:
442
443           2 ^ (value - 1)
444
445           e.g. value 4 means 8 CPU cycles must pass before a CYC packet
446           can be sent.  Note a CYC packet is still only sent when another
447           packet is sent, not at, e.g. every 8 CPU cycles.
448
449           If an invalid value is entered, the error message
450           will give a list of valid values e.g.
451
452           $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
453           Invalid cyc_thresh for intel_pt. Valid values are: 0-12
454
455           CYC packets are not requested by default.
456
457       pt Specifies pass-through which enables the branch config term.
458
459           The default config selects 'pt' if it is available, so a user will
460           never need to specify this term.
461
462       branch Enable branch tracing. Branch tracing is enabled by default so
463       to disable branch tracing use branch=0.
464
465           The default config selects 'branch' if it is available.
466
467       ptw Enable PTWRITE packets which are produced when a ptwrite
468       instruction is executed.
469
470           Support for this feature is indicated by:
471
472           /sys/bus/event_source/devices/intel_pt/caps/ptwrite
473
474           which contains "1" if the feature is supported and
475           "0" otherwise.
476
477       fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP
478       packet provides the address of the ptwrite instruction. In the absence
479       of fup_on_ptw, the decoder will use the address of the previous branch
480       if branch tracing is enabled, otherwise the address will be zero. Note
481       that fup_on_ptw will work even when branch tracing is disabled.
482
483       pwr_evt Enable power events. The power events provide information about
484       changes to the CPU C-state.
485
486           Support for this feature is indicated by:
487
488           /sys/bus/event_source/devices/intel_pt/caps/power_event_trace
489
490           which contains "1" if the feature is supported and
491           "0" otherwise.
492
493       event Enable Event Trace. The events provide information about
494       asynchronous events.
495
496           Support for this feature is indicated by:
497
498           /sys/bus/event_source/devices/intel_pt/caps/event_trace
499
500           which contains "1" if the feature is supported and
501           "0" otherwise.
502
503       notnt Disable TNT packets. Without TNT packets, it is not possible to
504       walk executable code to reconstruct control flow, however FUP, TIP,
505       TIP.PGE and TIP.PGD packets still indicate asynchronous control flow,
506       and (if return compression is disabled - see noretcomp) return
507       statements. The advantage of eliminating TNT packets is reducing the
508       size of the trace and corresponding tracing overhead.
509
510           Support for this feature is indicated by:
511
512           /sys/bus/event_source/devices/intel_pt/caps/tnt_disable
513
514           which contains "1" if the feature is supported and
515           "0" otherwise.
516
517   AUX area sampling option
518       To select Intel PT "sampling" the AUX area sampling option can be used:
519
520           --aux-sample
521
522       Optionally it can be followed by the sample size in bytes e.g.
523
524           --aux-sample=8192
525
526       In addition, the Intel PT event to sample must be defined e.g.
527
528           -e intel_pt//u
529
530       Samples on other events will be created containing Intel PT data e.g.
531       the following will create Intel PT samples on the branch-misses event,
532       note the events must be grouped using {}:
533
534           perf record --aux-sample -e '{intel_pt//u,branch-misses:u}'
535
536       An alternative to --aux-sample is to add the config term
537       aux-sample-size to events. In this case, the grouping is implied e.g.
538
539           perf record -e intel_pt//u -e branch-misses/aux-sample-size=8192/u
540
541       is the same as:
542
543           perf record -e '{intel_pt//u,branch-misses/aux-sample-size=8192/u}'
544
545       but allows for also using an address filter e.g.:
546
547           perf record -e intel_pt//u --filter 'filter * @/bin/ls' -e branch-misses/aux-sample-size=8192/u -- ls
548
549       It is important to select a sample size that is big enough to contain
550       at least one PSB packet. If not a warning will be displayed:
551
552           Intel PT sample size (%zu) may be too small for PSB period (%zu)
553
554       The calculation used for that is: if sample_size ⟨ psb_period + 256
555       display the warning. When sampling is used, psb_period defaults to 0
556       (2KiB).
557
558       The default sample size is 4KiB.
559
560       The sample size is passed in aux_sample_size in struct perf_event_attr.
561       The sample size is limited by the maximum event size which is 64KiB. It
562       is difficult to know how big the event might be without the trace
563       sample attached, but the tool validates that the sample size is not
564       greater than 60KiB.
565
566   new snapshot option
567       The difference between full trace and snapshot from the kernel’s
568       perspective is that in full trace we don’t overwrite trace data that
569       the user hasn’t collected yet (and indicated that by advancing
570       aux_tail), whereas in snapshot mode we let the trace run and overwrite
571       older data in the buffer so that whenever something interesting
572       happens, we can stop it and grab a snapshot of what was going on around
573       that interesting moment.
574
575       To select snapshot mode a new option has been added:
576
577           -S
578
579       Optionally it can be followed by the snapshot size e.g.
580
581           -S0x100000
582
583       The default snapshot size is the auxtrace mmap size. If neither
584       auxtrace mmap size nor snapshot size is specified, then the default is
585       4MiB for privileged users (or if /proc/sys/kernel/perf_event_paranoid <
586       0), 128KiB for unprivileged users. If an unprivileged user does not
587       specify mmap pages, the mmap pages will be reduced as described in the
588       new auxtrace mmap size option section below.
589
590       The snapshot size is displayed if the option -vv is used e.g.
591
592           Intel PT snapshot size: %zu
593
594   new auxtrace mmap size option
595       Intel PT buffer size is specified by an addition to the -m option e.g.
596
597           -m,16
598
599       selects a buffer size of 16 pages i.e. 64KiB.
600
601       Note that the existing functionality of -m is unchanged. The auxtrace
602       mmap size is specified by the optional addition of a comma and the
603       value.
604
605       The default auxtrace mmap size for Intel PT is 4MiB/page_size for
606       privileged users (or if /proc/sys/kernel/perf_event_paranoid < 0),
607       128KiB for unprivileged users. If an unprivileged user does not specify
608       mmap pages, the mmap pages will be reduced from the default
609       512KiB/page_size to 256KiB/page_size, otherwise the user is likely to
610       get an error as they exceed their mlock limit (Max locked memory as
611       shown in /proc/self/limits). Note that perf does not count the first
612       512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per
613       cpu against the mlock limit so an unprivileged user is allowed 512KiB
614       per cpu plus their mlock limit (which defaults to 64KiB but is not
615       multiplied by the number of cpus).
616
617       In full-trace mode, powers of two are allowed for buffer size, with a
618       minimum size of 2 pages. In snapshot mode or sampling mode, it is the
619       same but the minimum size is 1 page.
620
621       The mmap size and auxtrace mmap size are displayed if the -vv option is
622       used e.g.
623
624           mmap length 528384
625           auxtrace mmap length 4198400
626
627   Intel PT modes of operation
628       Intel PT can be used in 3 modes: full-trace mode sample mode snapshot
629       mode
630
631       Full-trace mode traces continuously e.g.
632
633           perf record -e intel_pt//u uname
634
635       Sample mode attaches a Intel PT sample to other events e.g.
636
637           perf record --aux-sample -e intel_pt//u -e branch-misses:u
638
639       Snapshot mode captures the available data when a signal is sent or
640       "snapshot" control command is issued. e.g. using a signal
641
642           perf record -v -e intel_pt//u -S ./loopy 1000000000 &
643           [1] 11435
644           kill -USR2 11435
645           Recording AUX area tracing snapshot
646
647       Note that the signal sent is SIGUSR2. Note that "Recording AUX area
648       tracing snapshot" is displayed because the -v option is used.
649
650       The advantage of using "snapshot" control command is that the access is
651       controlled by access to a FIFO e.g.
652
653           $ mkfifo perf.control
654           $ mkfifo perf.ack
655           $ cat perf.ack &
656           [1] 15235
657           $ sudo ~/bin/perf record --control fifo:perf.control,perf.ack -S -e intel_pt//u -- sleep 60 &
658           [2] 15243
659           $ ps -e | grep perf
660           15244 pts/1    00:00:00 perf
661           $ kill -USR2 15244
662           bash: kill: (15244) - Operation not permitted
663           $ echo snapshot > perf.control
664           ack
665
666       The 3 Intel PT modes of operation cannot be used together.
667
668   Buffer handling
669       There may be buffer limitations (i.e. single ToPa entry) which means
670       that actual buffer sizes are limited to powers of 2 up to 4MiB
671       (MAX_ORDER). In order to provide other sizes, and in particular an
672       arbitrarily large size, multiple buffers are logically concatenated.
673       However an interrupt must be used to switch between buffers. That has
674       two potential problems: a) the interrupt may not be handled in time so
675       that the current buffer becomes full and some trace data is lost. b)
676       the interrupts may slow the system and affect the performance results.
677
678       If trace data is lost, the driver sets truncated in the PERF_RECORD_AUX
679       event which the tools report as an error.
680
681       In full-trace mode, the driver waits for data to be copied out before
682       allowing the (logical) buffer to wrap-around. If data is not copied out
683       quickly enough, again truncated is set in the PERF_RECORD_AUX event. If
684       the driver has to wait, the intel_pt event gets disabled. Because it is
685       difficult to know when that happens, perf tools always re-enable the
686       intel_pt event after copying out data.
687
688   Intel PT and build ids
689       By default "perf record" post-processes the event stream to find all
690       build ids for executables for all addresses sampled. Deliberately,
691       Intel PT is not decoded for that purpose (it would take too long).
692       Instead the build ids for all executables encountered (due to mmap,
693       comm or task events) are included in the perf.data file.
694
695       To see buildids included in the perf.data file use the command:
696
697           perf buildid-list
698
699       If the perf.data file contains Intel PT data, that is the same as:
700
701           perf buildid-list --with-hits
702
703   Snapshot mode and event disabling
704       In order to make a snapshot, the intel_pt event is disabled using an
705       IOCTL, namely PERF_EVENT_IOC_DISABLE. However doing that can also
706       disable the collection of side-band information. In order to prevent
707       that, a dummy software event has been introduced that permits tracking
708       events (like mmaps) to continue to be recorded while intel_pt is
709       disabled. That is important to ensure there is complete side-band
710       information to allow the decoding of subsequent snapshots.
711
712       A test has been created for that. To find the test:
713
714           perf test list
715           ...
716           23: Test using a dummy software event to keep tracking
717
718       To run the test:
719
720           perf test 23
721           23: Test using a dummy software event to keep tracking     : Ok
722
723   perf record modes (nothing new here)
724       perf record essentially operates in one of three modes: per thread per
725       cpu workload only
726
727       "per thread" mode is selected by -t or by --per-thread (with -p or -u
728       or just a workload). "per cpu" is selected by -C or -a. "workload only"
729       mode is selected by not using the other options but providing a command
730       to run (i.e. the workload).
731
732       In per-thread mode an exact list of threads is traced. There is no
733       inheritance. Each thread has its own event buffer.
734
735       In per-cpu mode all processes (or processes from the selected cgroup
736       i.e. -G option, or processes selected with -p or -u) are traced. Each
737       cpu has its own buffer. Inheritance is allowed.
738
739       In workload-only mode, the workload is traced but with per-cpu buffers.
740       Inheritance is allowed. Note that you can now trace a workload in
741       per-thread mode by using the --per-thread option.
742
743   Privileged vs non-privileged users
744       Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
745       users have memory limits imposed upon them. That affects what buffer
746       sizes they can have as outlined above.
747
748       The v4.2 kernel introduced support for a context switch metadata event,
749       PERF_RECORD_SWITCH, which allows unprivileged users to see when their
750       processes are scheduled out and in, just not by whom, which is left for
751       the PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide
752       context, which in turn requires CAP_PERFMON or CAP_SYS_ADMIN.
753
754       Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate
755       context switches") commit, that introduces these metadata events for
756       further info.
757
758       When working with kernels < v4.2, the following considerations must be
759       taken, as the sched:sched_switch tracepoints will be used to receive
760       such information:
761
762       Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
763       users are not permitted to use tracepoints which means there is
764       insufficient side-band information to decode Intel PT in per-cpu mode,
765       and potentially workload-only mode too if the workload creates new
766       processes.
767
768       Note also, that to use tracepoints, read-access to debugfs is required.
769       So if debugfs is not mounted or the user does not have read-access, it
770       will again not be possible to decode Intel PT in per-cpu mode.
771
772   sched_switch tracepoint
773       The sched_switch tracepoint is used to provide side-band data for Intel
774       PT decoding in kernels where the PERF_RECORD_SWITCH metadata event
775       isn’t available.
776
777       The sched_switch events are automatically added. e.g. the second event
778       shown below:
779
780           $ perf record -vv -e intel_pt//u uname
781           ------------------------------------------------------------
782           perf_event_attr:
783           type                             6
784           size                             112
785           config                           0x400
786           { sample_period, sample_freq }   1
787           sample_type                      IP|TID|TIME|CPU|IDENTIFIER
788           read_format                      ID
789           disabled                         1
790           inherit                          1
791           exclude_kernel                   1
792           exclude_hv                       1
793           enable_on_exec                   1
794           sample_id_all                    1
795           ------------------------------------------------------------
796           sys_perf_event_open: pid 31104  cpu 0  group_fd -1  flags 0x8
797           sys_perf_event_open: pid 31104  cpu 1  group_fd -1  flags 0x8
798           sys_perf_event_open: pid 31104  cpu 2  group_fd -1  flags 0x8
799           sys_perf_event_open: pid 31104  cpu 3  group_fd -1  flags 0x8
800           ------------------------------------------------------------
801           perf_event_attr:
802           type                             2
803           size                             112
804           config                           0x108
805           { sample_period, sample_freq }   1
806           sample_type                      IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER
807           read_format                      ID
808           inherit                          1
809           sample_id_all                    1
810           exclude_guest                    1
811           ------------------------------------------------------------
812           sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8
813           sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8
814           sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8
815           sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8
816           ------------------------------------------------------------
817           perf_event_attr:
818           type                             1
819           size                             112
820           config                           0x9
821           { sample_period, sample_freq }   1
822           sample_type                      IP|TID|TIME|IDENTIFIER
823           read_format                      ID
824           disabled                         1
825           inherit                          1
826           exclude_kernel                   1
827           exclude_hv                       1
828           mmap                             1
829           comm                             1
830           enable_on_exec                   1
831           task                             1
832           sample_id_all                    1
833           mmap2                            1
834           comm_exec                        1
835           ------------------------------------------------------------
836           sys_perf_event_open: pid 31104  cpu 0  group_fd -1  flags 0x8
837           sys_perf_event_open: pid 31104  cpu 1  group_fd -1  flags 0x8
838           sys_perf_event_open: pid 31104  cpu 2  group_fd -1  flags 0x8
839           sys_perf_event_open: pid 31104  cpu 3  group_fd -1  flags 0x8
840           mmap size 528384B
841           AUX area mmap length 4194304
842           perf event ring buffer mmapped per cpu
843           Synthesizing auxtrace information
844           Linux
845           [ perf record: Woken up 1 times to write data ]
846           [ perf record: Captured and wrote 0.042 MB perf.data ]
847
848       Note, the sched_switch event is only added if the user is permitted to
849       use it and only in per-cpu mode.
850
851       Note also, the sched_switch event is only added if TSC packets are
852       requested. That is because, in the absence of timing information, the
853       sched_switch events cannot be matched against the Intel PT trace.
854

PERF SCRIPT

856       By default, perf script will decode trace data found in the perf.data
857       file. This can be further controlled by new option --itrace.
858
859   New --itrace option
860       Having no option is the same as
861
862           --itrace
863
864       which, in turn, is the same as
865
866           --itrace=cepwx
867
868       The letters are:
869
870           i       synthesize "instructions" events
871           b       synthesize "branches" events
872           x       synthesize "transactions" events
873           w       synthesize "ptwrite" events
874           p       synthesize "power" events (incl. PSB events)
875           c       synthesize branches events (calls only)
876           r       synthesize branches events (returns only)
877           o       synthesize PEBS-via-PT events
878           I       synthesize Event Trace events
879           e       synthesize tracing error events
880           d       create a debug log
881           g       synthesize a call chain (use with i or x)
882           G       synthesize a call chain on existing event records
883           l       synthesize last branch entries (use with i or x)
884           L       synthesize last branch entries on existing event records
885           s       skip initial number of events
886           q       quicker (less detailed) decoding
887           A       approximate IPC
888           Z       prefer to ignore timestamps (so-called "timeless" decoding)
889
890       "Instructions" events look like they were recorded by "perf record -e
891       instructions".
892
893       "Branches" events look like they were recorded by "perf record -e
894       branches". "c" and "r" can be combined to get calls and returns.
895
896       "Transactions" events correspond to the start or end of transactions.
897       The flags field can be used in perf script to determine whether the
898       event is a transaction start, commit or abort.
899
900       Note that "instructions", "branches" and "transactions" events depend
901       on code flow packets which can be disabled by using the config term
902       "branch=0". Refer to the config terms section above.
903
904       "ptwrite" events record the payload of the ptwrite instruction and
905       whether "fup_on_ptw" was used. "ptwrite" events depend on PTWRITE
906       packets which are recorded only if the "ptw" config term was used.
907       Refer to the config terms section above. perf script "synth" field
908       displays "ptwrite" information like this: "ip: 0 payload:
909       0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was used.
910
911       "Power" events correspond to power event packets and CBR (core-to-bus
912       ratio) packets. While CBR packets are always recorded when tracing is
913       enabled, power event packets are recorded only if the "pwr_evt" config
914       term was used. Refer to the config terms section above. The power
915       events record information about C-state changes, whereas CBR is
916       indicative of CPU frequency. perf script "event,synth" fields display
917       information like this: cbr: cbr: 22 freq: 2189 MHz (200%) mwait: hints:
918       0x60 extensions: 0x1 pwre: hw: 0 cstate: 2 sub-cstate: 0 exstop: ip: 1
919       pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4 Where: "cbr"
920       includes the frequency and the percentage of maximum non-turbo "mwait"
921       shows mwait hints and extensions "pwre" shows C-state transitions (to a
922       C-state deeper than C0) and whether initiated by hardware "exstop"
923       indicates execution stopped and whether the IP was recorded exactly,
924       "pwrx" indicates return to C0 For more details refer to the Intel 64
925       and IA-32 Architectures Software Developer Manuals.
926
927       PSB events show when a PSB+ occurred and also the byte-offset in the
928       trace. Emitting a PSB+ can cause a CPU a slight delay. When doing
929       timing analysis of code with Intel PT, it is useful to know if a timing
930       bubble was caused by Intel PT or not.
931
932       Error events show where the decoder lost the trace. Error events are
933       quite important. Users must know if what they are seeing is a complete
934       picture or not. The "e" option may be followed by flags which affect
935       what errors will or will not be reported. Each flag must be preceded by
936       either + or -. The flags supported by Intel PT are: -o Suppress
937       overflow errors -l Suppress trace data lost errors For example, for
938       errors but not overflow or data lost errors:
939
940           --itrace=e-o-l
941
942       The "d" option will cause the creation of a file "intel_pt.log"
943       containing all decoded packets and instructions. Note that this option
944       slows down the decoder and that the resulting file may be very large.
945       The "d" option may be followed by flags which affect what debug
946       messages will or will not be logged. Each flag must be preceded by
947       either + or -. The flags support by Intel PT are: -a Suppress logging
948       of perf events +a Log all perf events +o Output to stdout instead of
949       "intel_pt.log" By default, logged perf events are filtered by any
950       specified time ranges, but flag +a overrides that.
951
952       In addition, the period of the "instructions" event can be specified.
953       e.g.
954
955           --itrace=i10us
956
957       sets the period to 10us i.e. one instruction sample is synthesized for
958       each 10 microseconds of trace. Alternatives to "us" are "ms"
959       (milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
960       (instructions).
961
962       "ms", "us" and "ns" are converted to TSC ticks.
963
964       The timing information included with Intel PT does not give the time of
965       every instruction. Consequently, for the purpose of sampling, the
966       decoder estimates the time since the last timing packet based on 1 tick
967       per instruction. The time on the sample is not adjusted and reflects
968       the last known value of TSC.
969
970       For Intel PT, the default period is 100us.
971
972       Setting it to a zero period means "as often as possible".
973
974       In the case of Intel PT that is the same as a period of 1 and a unit of
975       instructions (i.e. --itrace=i1i).
976
977       Also the call chain size (default 16, max. 1024) for instructions or
978       transactions events can be specified. e.g.
979
980           --itrace=ig32
981           --itrace=xg32
982
983       Also the number of last branch entries (default 64, max. 1024) for
984       instructions or transactions events can be specified. e.g.
985
986           --itrace=il10
987           --itrace=xl10
988
989       Note that last branch entries are cleared for each sample, so there is
990       no overlap from one sample to the next.
991
992       The G and L options are designed in particular for sample mode, and
993       work much like g and l but add call chain and branch stack to the other
994       selected events instead of synthesized events. For example, to record
995       branch-misses events for ls and then add a call chain derived from the
996       Intel PT trace:
997
998           perf record --aux-sample -e '{intel_pt//u,branch-misses:u}' -- ls
999           perf report --itrace=Ge
1000
1001       Although in fact G is a default for perf report, so that is the same as
1002       just:
1003
1004           perf report
1005
1006       One caveat with the G and L options is that they work poorly with
1007       "Large PEBS". Large PEBS means PEBS records will be accumulated by
1008       hardware and the written into the event buffer in one go. That reduces
1009       interrupts, but can give very late timestamps. Because the Intel PT
1010       trace is synchronized by timestamps, the PEBS events do not match the
1011       trace. Currently, Large PEBS is used only in certain circumstances: -
1012       hardware supports it - PEBS is used - event period is specified,
1013       instead of frequency - the sample type is limited to the following
1014       flags: PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR |
1015       PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID |
1016       PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_TRANSACTION
1017       | PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER
1018       | PERF_SAMPLE_PERIOD (and sometimes) | PERF_SAMPLE_TIME Because Intel
1019       PT sample mode uses a different sample type to the list above, Large
1020       PEBS is not used with Intel PT sample mode. To avoid Large PEBS in
1021       other cases, avoid specifying the event period i.e. avoid the perf
1022       record -c option, --count option, or period config term.
1023
1024       To disable trace decoding entirely, use the option --no-itrace.
1025
1026       It is also possible to skip events generated (instructions, branches,
1027       transactions) at the beginning. This is useful to ignore initialization
1028       code.
1029
1030           --itrace=i0nss1000000
1031
1032       skips the first million instructions.
1033
1034       The q option changes the way the trace is decoded. The decoding is much
1035       faster but much less detailed. Specifically, with the q option, the
1036       decoder does not decode TNT packets, and does not walk object code, but
1037       gets the ip from FUP and TIP packets. The q option can be used with the
1038       b and i options but the period is not used. The q option decodes more
1039       quickly, but is useful only if the control flow of interest is
1040       represented or indicated by FUP, TIP, TIP.PGE, or TIP.PGD packets
1041       (refer below). However the q option could be used to find time ranges
1042       that could then be decoded fully using the --time option.
1043
1044       What will not be decoded with the (single) q option:
1045
1046       •   direct calls and jmps
1047
1048       •   conditional branches
1049
1050       •   non-branch instructions
1051
1052       What will be decoded with the (single) q option:
1053
1054       •   asynchronous branches such as interrupts
1055
1056       •   indirect branches
1057
1058       •   function return target address if the noretcomp config term (refer
1059           config terms section) was used
1060
1061       •   start of (control-flow) tracing
1062
1063       •   end of (control-flow) tracing, if it is not out of context
1064
1065       •   power events, ptwrite, transaction start and abort
1066
1067       •   instruction pointer associated with PSB packets
1068
1069       Note the q option does not specify what events will be synthesized e.g.
1070       the p option must be used also to show power events.
1071
1072       Repeating the q option (double-q i.e. qq) results in even faster
1073       decoding and even less detail. The decoder decodes only extended PSB
1074       (PSB+) packets, getting the instruction pointer if there is a FUP
1075       packet within PSB+ (i.e. between PSB and PSBEND). Note PSB packets
1076       occur regularly in the trace based on the psb_period config term (refer
1077       config terms section). There will be a FUP packet if the PSB+ occurs
1078       while control flow is being traced.
1079
1080       What will not be decoded with the qq option:
1081
1082       •   everything except instruction pointer associated with PSB packets
1083
1084       What will be decoded with the qq option:
1085
1086       •   instruction pointer associated with PSB packets
1087
1088       The Z option is equivalent to having recorded a trace without TSC (i.e.
1089       config term tsc=0). It can be useful to avoid timestamp issues when
1090       decoding a trace of a virtual machine.
1091
1092   dlfilter-show-cycles.so
1093       Cycles can be displayed using dlfilter-show-cycles.so in which case the
1094       itrace A option can be useful to provide higher granularity cycle
1095       information:
1096
1097           perf script --itrace=A --call-trace --dlfilter dlfilter-show-cycles.so
1098
1099       To see a list of dlfilters:
1100
1101           perf script -v --list-dlfilters
1102
1103       See also perf-dlfilters(1)
1104
1105   dump option
1106       perf script has an option (-D) to "dump" the events i.e. display the
1107       binary data.
1108
1109       When -D is used, Intel PT packets are displayed. The packet decoder
1110       does not pay attention to PSB packets, but just decodes the bytes - so
1111       the packets seen by the actual decoder may not be identical in places
1112       where the data is corrupt. One example of that would be when the
1113       buffer-switching interrupt has been too slow, and the buffer has been
1114       filled completely. In that case, the last packet in the buffer might be
1115       truncated and immediately followed by a PSB as the trace continues in
1116       the next buffer.
1117
1118       To disable the display of Intel PT packets, combine the -D option with
1119       --no-itrace.
1120

PERF REPORT

1122       By default, perf report will decode trace data found in the perf.data
1123       file. This can be further controlled by new option --itrace exactly the
1124       same as perf script, with the exception that the default is
1125       --itrace=igxe.
1126

PERF INJECT

1128       perf inject also accepts the --itrace option in which case tracing data
1129       is removed and replaced with the synthesized events. e.g.
1130
1131           perf inject --itrace -i perf.data -o perf.data.new
1132
1133       Below is an example of using Intel PT with autofdo. It requires autofdo
1134       (https://github.com/google/autofdo) and gcc version 5. The bubble sort
1135       example is from the AutoFDO tutorial
1136       (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) amended to take the number
1137       of elements as a parameter.
1138
1139           $ gcc-5 -O3 sort.c -o sort_optimized
1140           $ ./sort_optimized 30000
1141           Bubble sorting array of 30000 elements
1142           2254 ms
1143
1144           $ cat ~/.perfconfig
1145           [intel-pt]
1146                   mispred-all = on
1147
1148           $ perf record -e intel_pt//u ./sort 3000
1149           Bubble sorting array of 3000 elements
1150           58 ms
1151           [ perf record: Woken up 2 times to write data ]
1152           [ perf record: Captured and wrote 3.939 MB perf.data ]
1153           $ perf inject -i perf.data -o inj --itrace=i100usle --strip
1154           $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
1155           $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
1156           $ ./sort_autofdo 30000
1157           Bubble sorting array of 30000 elements
1158           2155 ms
1159
1160       Note there is currently no advantage to using Intel PT instead of LBR,
1161       but that may change in the future if greater use is made of the data.
1162

PEBS VIA INTEL PT

1164       Some hardware has the feature to redirect PEBS records to the Intel PT
1165       trace. Recording is selected by using the aux-output config term e.g.
1166
1167           perf record -c 10000 -e '{intel_pt/branch=0/,cycles/aux-output/ppp}' uname
1168
1169       Originally, software only supported redirecting at most one PEBS event
1170       because it was not able to differentiate one event from another. To
1171       overcome that, more recent kernels and perf tools add support for the
1172       PERF_RECORD_AUX_OUTPUT_HW_ID side-band event. To check for the presence
1173       of that event in a PEBS-via-PT trace:
1174
1175           perf script -D --no-itrace | grep PERF_RECORD_AUX_OUTPUT_HW_ID
1176
1177       To display PEBS events from the Intel PT trace, use the itrace o option
1178       e.g.
1179
1180           perf script --itrace=oe
1181

XED

1183       For --xed the xed tool is needed. Here is how to install it:
1184
1185           $ git clone https://github.com/intelxed/mbuild.git mbuild
1186           $ git clone https://github.com/intelxed/xed
1187           $ cd xed
1188           $ ./mfile.py --share
1189           $ ./mfile.py examples
1190           $ sudo ./mfile.py --prefix=/usr/local install
1191           $ sudo ldconfig
1192           $ sudo cp obj/examples/xed /usr/local/bin
1193
1194       Basic xed testing:
1195
1196           $ xed | head -3
1197           ERROR: required argument(s) were missing
1198           Copyright (C) 2017, Intel Corporation. All rights reserved.
1199           XED version: [v10.0-328-g7d62c8c49b7b]
1200           $
1201

TRACING VIRTUAL MACHINES

1203       Currently, only kernel tracing is supported and only with either
1204       "timeless" decoding (i.e. no TSC timestamps) or VM Time Correlation. VM
1205       Time Correlation is an extra step using perf inject and requires
1206       unchanging VMX TSC Offset and no VMX TSC Scaling.
1207
1208       Other limitations and caveats
1209
1210           VMX controls may suppress packets needed for decoding resulting in decoding errors
1211           VMX controls may block the perf NMI to the host potentially resulting in lost trace data
1212           Guest kernel self-modifying code (e.g. jump labels or JIT-compiled eBPF) will result in decoding errors
1213           Guest thread information is unknown
1214           Guest VCPU is unknown but may be able to be inferred from the host thread
1215           Callchains are not supported
1216
1217       Example using "timeless" decoding
1218
1219       Start VM
1220
1221           $ sudo virsh start kubuntu20.04
1222           Domain kubuntu20.04 started
1223
1224       Mount the guest file system. Note sshfs needs -o direct_io to enable
1225       reading of proc files. root access is needed to read /proc/kcore.
1226
1227           $ mkdir vm0
1228           $ sshfs -o direct_io root@vm0:/ vm0
1229
1230       Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1231
1232           $ perf buildid-cache -v --kcore vm0/proc/kcore
1233           kcore added to build-id cache directory /home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306
1234           $ KALLSYMS=/home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306/kallsyms
1235
1236       Find the VM process
1237
1238           $ ps -eLl | grep 'KVM\|PID'
1239           F S   UID     PID    PPID     LWP  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
1240           3 S 64055    1430       1    1440  1  80   0 - 1921718 -    ?        00:02:47 CPU 0/KVM
1241           3 S 64055    1430       1    1441  1  80   0 - 1921718 -    ?        00:02:41 CPU 1/KVM
1242           3 S 64055    1430       1    1442  1  80   0 - 1921718 -    ?        00:02:38 CPU 2/KVM
1243           3 S 64055    1430       1    1443  2  80   0 - 1921718 -    ?        00:03:18 CPU 3/KVM
1244
1245       Start an open-ended perf record, tracing the VM process, do something
1246       on the VM, and then ctrl-C to stop. TSC is not supported and tsc=0 must
1247       be specified. That means mtc is useless, so add mtc=0. However, IPC can
1248       still be determined, hence cyc=1 can be added. Only kernel decoding is
1249       supported, so k must be specified. Intel PT traces both the host and
1250       the guest so --guest and --host need to be specified. Without
1251       timestamps, --per-thread must be specified to distinguish threads.
1252
1253           $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/tsc=0,mtc=0,cyc=1/k -p 1430 --per-thread
1254           ^C
1255           [ perf record: Woken up 1 times to write data ]
1256           [ perf record: Captured and wrote 5.829 MB ]
1257
1258       perf script can be used to provide an instruction trace
1259
1260           $ perf script --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1261                 CPU 0/KVM  1440  ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms])                movq  0x48(%rax), %r9
1262                 CPU 0/KVM  1440  ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms])                movq  0x50(%rax), %r10
1263                 CPU 0/KVM  1440  ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms])                movq  0x58(%rax), %r11
1264                 CPU 0/KVM  1440  ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms])                movq  0x60(%rax), %r12
1265                 CPU 0/KVM  1440  ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms])                movq  0x68(%rax), %r13
1266                 CPU 0/KVM  1440  ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms])                movq  0x70(%rax), %r14
1267                 CPU 0/KVM  1440  ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms])                movq  0x78(%rax), %r15
1268                 CPU 0/KVM  1440  ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms])                movq  (%rax), %rax
1269                 CPU 0/KVM  1440  ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms])                callq  0xffffffff82133c40
1270                 CPU 0/KVM  1440  ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms])            jz 0xffffffff82133c46
1271                 CPU 0/KVM  1440  ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms])            vmresume         IPC: 0.11 (50/445)
1272                     :1440  1440  ffffffffbb678b06 native_write_msr+0x6 ([guest.kernel.kallsyms])                 nopl  %eax, (%rax,%rax,1)
1273                     :1440  1440  ffffffffbb678b0b native_write_msr+0xb ([guest.kernel.kallsyms])                 retq     IPC: 0.04 (2/41)
1274                     :1440  1440  ffffffffbb666646 lapic_next_deadline+0x26 ([guest.kernel.kallsyms])             data16 nop
1275                     :1440  1440  ffffffffbb666648 lapic_next_deadline+0x28 ([guest.kernel.kallsyms])             xor %eax, %eax
1276                     :1440  1440  ffffffffbb66664a lapic_next_deadline+0x2a ([guest.kernel.kallsyms])             popq  %rbp
1277                     :1440  1440  ffffffffbb66664b lapic_next_deadline+0x2b ([guest.kernel.kallsyms])             retq     IPC: 0.16 (4/25)
1278                     :1440  1440  ffffffffbb74607f clockevents_program_event+0x8f ([guest.kernel.kallsyms])               test %eax, %eax
1279                     :1440  1440  ffffffffbb746081 clockevents_program_event+0x91 ([guest.kernel.kallsyms])               jz 0xffffffffbb74603c    IPC: 0.06 (2/30)
1280                     :1440  1440  ffffffffbb74603c clockevents_program_event+0x4c ([guest.kernel.kallsyms])               popq  %rbx
1281                     :1440  1440  ffffffffbb74603d clockevents_program_event+0x4d ([guest.kernel.kallsyms])               popq  %r12
1282
1283       Example using VM Time Correlation
1284
1285       Start VM
1286
1287           $ sudo virsh start kubuntu20.04
1288           Domain kubuntu20.04 started
1289
1290       Mount the guest file system. Note sshfs needs -o direct_io to enable
1291       reading of proc files. root access is needed to read /proc/kcore.
1292
1293           $ mkdir -p vm0
1294           $ sshfs -o direct_io root@vm0:/ vm0
1295
1296       Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1297
1298           $ perf buildid-cache -v --kcore vm0/proc/kcore
1299           same kcore found in /home/user/.debug/[kernel.kcore]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777
1300           $ KALLSYMS=/home/user/.debug/\[kernel.kcore\]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777/kallsyms
1301
1302       Find the VM process
1303
1304           $ ps -eLl | grep 'KVM\|PID'
1305           F S   UID     PID    PPID     LWP  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
1306           3 S 64055   16998       1   17005 13  80   0 - 1818189 -    ?        00:00:16 CPU 0/KVM
1307           3 S 64055   16998       1   17006  4  80   0 - 1818189 -    ?        00:00:05 CPU 1/KVM
1308           3 S 64055   16998       1   17007  3  80   0 - 1818189 -    ?        00:00:04 CPU 2/KVM
1309           3 S 64055   16998       1   17008  4  80   0 - 1818189 -    ?        00:00:05 CPU 3/KVM
1310
1311       Start an open-ended perf record, tracing the VM process, do something
1312       on the VM, and then ctrl-C to stop. IPC can be determined, hence cyc=1
1313       can be added. Only kernel decoding is supported, so k must be
1314       specified. Intel PT traces both the host and the guest so --guest and
1315       --host need to be specified.
1316
1317           $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/cyc=1/k -p 16998
1318           ^C[ perf record: Woken up 1 times to write data ]
1319           [ perf record: Captured and wrote 9.041 MB perf.data.kvm ]
1320
1321       Now perf inject can be used to determine the VMX TCS Offset. Note,
1322       Intel PT TSC packets are only 7-bytes, so the TSC Offset might differ
1323       from the actual value in the 8th byte. That will have no effect i.e.
1324       the resulting timestamps will be correct anyway.
1325
1326           $ perf inject -i perf.data.kvm --vm-time-correlation=dry-run
1327           ERROR: Unknown TSC Offset for VMCS 0x1bff6a
1328           VMCS: 0x1bff6a  TSC Offset 0xffffe42722c64c41
1329           ERROR: Unknown TSC Offset for VMCS 0x1cbc08
1330           VMCS: 0x1cbc08  TSC Offset 0xffffe42722c64c41
1331           ERROR: Unknown TSC Offset for VMCS 0x1c3ce8
1332           VMCS: 0x1c3ce8  TSC Offset 0xffffe42722c64c41
1333           ERROR: Unknown TSC Offset for VMCS 0x1cbce9
1334           VMCS: 0x1cbce9  TSC Offset 0xffffe42722c64c41
1335
1336       Each virtual CPU has a different Virtual Machine Control Structure
1337       (VMCS) shown above with the calculated TSC Offset. For an unchanging
1338       TSC Offset they should all be the same for the same virtual machine.
1339
1340       Now that the TSC Offset is known, it can be provided to perf inject
1341
1342           $ perf inject -i perf.data.kvm --vm-time-correlation="dry-run 0xffffe42722c64c41"
1343
1344       Note the options for perf inject --vm-time-correlation are:
1345
1346           [ dry-run ] [ <TSC Offset> [ : <VMCS> [ , <VMCS> ]... ]  ]...
1347
1348       So it is possible to specify different TSC Offsets for different VMCS.
1349       The option "dry-run" will cause the file to be processed but without
1350       updating it. Note it is also possible to get a intel_pt.log file by
1351       adding option --itrace=d
1352
1353       There were no errors so, do it for real
1354
1355           $ perf inject -i perf.data.kvm --vm-time-correlation=0xffffe42722c64c41 --force
1356
1357       perf script can be used to see if there are any decoder errors
1358
1359           $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --itrace=e-o
1360
1361       There were none.
1362
1363       perf script can be used to provide an instruction trace showing
1364       timestamps
1365
1366           $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1367                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms])                 movq  0x48(%rax), %r9
1368                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms])                 movq  0x50(%rax), %r10
1369                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms])                 movq  0x58(%rax), %r11
1370                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms])                 movq  0x60(%rax), %r12
1371                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms])                 movq  0x68(%rax), %r13
1372                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms])                 movq  0x70(%rax), %r14
1373                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms])                 movq  0x78(%rax), %r15
1374                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms])                 movq  (%rax), %rax
1375                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms])                 callq  0xffffffff82133c40
1376                 CPU 1/KVM 17006 [001] 11500.262865593:  ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms])             jz 0xffffffff82133c46
1377                 CPU 1/KVM 17006 [001] 11500.262866075:  ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms])             vmresume         IPC: 0.05 (40/769)
1378                    :17006 17006 [001] 11500.262869216:  ffffffff82200cb0 asm_sysvec_apic_timer_interrupt+0x0 ([guest.kernel.kallsyms])           clac
1379                    :17006 17006 [001] 11500.262869216:  ffffffff82200cb3 asm_sysvec_apic_timer_interrupt+0x3 ([guest.kernel.kallsyms])           pushq  $0xffffffffffffffff
1380                    :17006 17006 [001] 11500.262869216:  ffffffff82200cb5 asm_sysvec_apic_timer_interrupt+0x5 ([guest.kernel.kallsyms])           callq  0xffffffff82201160
1381                    :17006 17006 [001] 11500.262869216:  ffffffff82201160 error_entry+0x0 ([guest.kernel.kallsyms])               cld
1382                    :17006 17006 [001] 11500.262869216:  ffffffff82201161 error_entry+0x1 ([guest.kernel.kallsyms])               pushq  %rsi
1383                    :17006 17006 [001] 11500.262869216:  ffffffff82201162 error_entry+0x2 ([guest.kernel.kallsyms])               movq  0x8(%rsp), %rsi
1384                    :17006 17006 [001] 11500.262869216:  ffffffff82201167 error_entry+0x7 ([guest.kernel.kallsyms])               movq  %rdi, 0x8(%rsp)
1385                    :17006 17006 [001] 11500.262869216:  ffffffff8220116c error_entry+0xc ([guest.kernel.kallsyms])               pushq  %rdx
1386                    :17006 17006 [001] 11500.262869216:  ffffffff8220116d error_entry+0xd ([guest.kernel.kallsyms])               pushq  %rcx
1387                    :17006 17006 [001] 11500.262869216:  ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms])               pushq  %rax
1388

EVENT TRACE

1390       Event Trace records information about asynchronous events, for example
1391       interrupts, faults, VM exits and entries. The information is recorded
1392       in CFE and EVD packets, and also the Interrupt Flag is recorded on the
1393       MODE.Exec packet. The CFE packet contains a type field to identify one
1394       of the following:
1395
1396            1      INTR            interrupt, fault, exception, NMI
1397            2      IRET            interrupt return
1398            3      SMI             system management interrupt
1399            4      RSM             resume from system management mode
1400            5      SIPI            startup interprocessor interrupt
1401            6      INIT            INIT signal
1402            7      VMENTRY         VM-Entry
1403            8      VMEXIT          VM-Entry
1404            9      VMEXIT_INTR     VM-Exit due to interrupt
1405           10      SHUTDOWN        Shutdown
1406
1407       For more details, refer to the Intel 64 and IA-32 Architectures
1408       Software Developer Manuals (version 076 or later).
1409
1410       The capability to do Event Trace is indicated by the
1411       /sys/bus/event_source/devices/intel_pt/caps/event_trace file.
1412
1413       Event trace is selected for recording using the "event" config term.
1414       e.g.
1415
1416           perf record -e intel_pt/event/u uname
1417
1418       Event trace events are output using the --itrace I option. e.g.
1419
1420           perf script --itrace=Ie
1421
1422       perf script displays events containing CFE type, vector and event data,
1423       in the form:
1424
1425           evt:   hw int            (t)  cfe: INTR IP: 1 vector: 3 PFA: 0x8877665544332211
1426
1427       The IP flag indicates if the event binds to an IP, which includes any
1428       case where flow control packet generation is enabled, as well as when
1429       CFE packet IP bit is set.
1430
1431       perf script displays events containing changes to the Interrupt Flag in
1432       the form:
1433
1434           iflag:   t                      IFLAG: 1->0 via branch
1435
1436       where "via branch" indicates a branch (interrupt or return from
1437       interrupt) and "non branch" indicates an instruction such as CFI, STI
1438       or POPF).
1439
1440       In addition, the current state of the interrupt flag is indicated by
1441       the presence or absence of the "D" (interrupt disabled) perf script
1442       flag. If the interrupt flag is changed, then the "t" flag is also
1443       included i.e.
1444
1445                   no flag, interrupts enabled IF=1
1446           t       interrupts become disabled IF=1 -> IF=0
1447           D       interrupts are disabled IF=0
1448           Dt      interrupts become enabled  IF=0 -> IF=1
1449
1450       The intel-pt-events.py script illustrates how to access Event Trace
1451       information using a Python script.
1452

TNT DISABLE

1454       TNT packets are disabled using the "notnt" config term. e.g.
1455
1456           perf record -e intel_pt/notnt/u uname
1457
1458       In that case the --itrace q option is forced because walking executable
1459       code to reconstruct the control flow is not possible.
1460