1PERF-INTEL-PT(1) perf Manual PERF-INTEL-PT(1)
2
3
4
6 perf-intel-pt - Support for Intel Processor Trace within perf tools
7
9 perf record -e intel_pt//
10
12 Intel Processor Trace (Intel PT) is an extension of Intel Architecture
13 that collects information about software execution such as control
14 flow, execution modes and timings and formats it into highly compressed
15 binary packets. Technical details are documented in the Intel 64 and
16 IA-32 Architectures Software Developer Manuals, Chapter 36 Intel
17 Processor Trace.
18
19 Intel PT is first supported in Intel Core M and 5th generation Intel
20 Core processors that are based on the Intel micro-architecture code
21 name Broadwell.
22
23 Trace data is collected by perf record and stored within the perf.data
24 file. See below for options to perf record.
25
26 Trace data must be decoded which involves walking the object code and
27 matching the trace data packets. For example a TNT packet only tells
28 whether a conditional branch was taken or not taken, so to make use of
29 that packet the decoder must know precisely which instruction was being
30 executed.
31
32 Decoding is done on-the-fly. The decoder outputs samples in the same
33 format as samples output by perf hardware events, for example as though
34 the "instructions" or "branches" events had been recorded. Presently 3
35 tools support this: perf script, perf report and perf inject. See below
36 for more information on using those tools.
37
38 The main distinguishing feature of Intel PT is that the decoder can
39 determine the exact flow of software execution. Intel PT can be used to
40 understand why and how did software get to a certain point, or behave a
41 certain way. The software does not have to be recompiled, so Intel PT
42 works with debug or release builds, however the executed images are
43 needed - which makes use in JIT-compiled environments, or with
44 self-modified code, a challenge. Also symbols need to be provided to
45 make sense of addresses.
46
47 A limitation of Intel PT is that it produces huge amounts of trace data
48 (hundreds of megabytes per second per core) which takes a long time to
49 decode, for example two or three orders of magnitude longer than it
50 took to collect. Another limitation is the performance impact of
51 tracing, something that will vary depending on the use-case and
52 architecture.
53
55 It is important to start small. That is because it is easy to capture
56 vastly more data than can possibly be processed.
57
58 The simplest thing to do with Intel PT is userspace profiling of small
59 programs. Data is captured with perf record e.g. to trace ls
60 userspace-only:
61
62 perf record -e intel_pt//u ls
63
64 And profiled with perf report e.g.
65
66 perf report
67
68 To also trace kernel space presents a problem, namely kernel
69 self-modifying code. A fairly good kernel image is available in
70 /proc/kcore but to get an accurate image a copy of /proc/kcore needs to
71 be made under the same conditions as the data capture. perf record can
72 make a copy of /proc/kcore if the option --kcore is used, but access to
73 /proc/kcore is restricted e.g.
74
75 sudo perf record -o pt_ls --kcore -e intel_pt// -- ls
76
77 which will create a directory named pt_ls and put the perf.data file
78 (named simply data) and copies of /proc/kcore, /proc/kallsyms and
79 /proc/modules into it. The other tools understand the directory format,
80 so to use perf report becomes:
81
82 sudo perf report -i pt_ls
83
84 Because samples are synthesized after-the-fact, the sampling period can
85 be selected for reporting. e.g. sample every microsecond
86
87 sudo perf report pt_ls --itrace=i1usge
88
89 See the sections below for more information about the --itrace option.
90
91 Beware the smaller the period, the more samples that are produced, and
92 the longer it takes to process them.
93
94 Also note that the coarseness of Intel PT timing information will start
95 to distort the statistical value of the sampling as the sampling period
96 becomes smaller.
97
98 To represent software control flow, "branches" samples are produced. By
99 default a branch sample is synthesized for every single branch. To get
100 an idea what data is available you can use the perf script tool with
101 all itrace sampling options, which will list all the samples.
102
103 perf record -e intel_pt//u ls
104 perf script --itrace=ibxwpe
105
106 An interesting field that is not printed by default is flags which can
107 be displayed as follows:
108
109 perf script --itrace=ibxwpe -F+flags
110
111 The flags are "bcrosyiABExghDt" which stand for branch, call, return,
112 conditional, system, asynchronous, interrupt, transaction abort, trace
113 begin, trace end, in transaction, VM-entry, VM-exit, interrupt
114 disabled, and interrupt disable toggle respectively.
115
116 perf script also supports higher level ways to dump instruction traces:
117
118 perf script --insn-trace --xed
119
120 Dump all instructions. This requires installing the xed tool (see XED
121 below) Dumping all instructions in a long trace can be fairly slow. It
122 is usually better to start with higher level decoding, like
123
124 perf script --call-trace
125
126 or
127
128 perf script --call-ret-trace
129
130 and then select a time range of interest. The time range can then be
131 examined in detail with
132
133 perf script --time starttime,stoptime --insn-trace --xed
134
135 While examining the trace it’s also useful to filter on specific CPUs
136 using the -C option
137
138 perf script --time starttime,stoptime --insn-trace --xed -C 1
139
140 Dump all instructions in time range on CPU 1.
141
142 Another interesting field that is not printed by default is ipc which
143 can be displayed as follows:
144
145 perf script --itrace=be -F+ipc
146
147 There are two ways that instructions-per-cycle (IPC) can be calculated
148 depending on the recording.
149
150 If the cyc config term (see config terms section below) was used, then
151 IPC is calculated using the cycle count from CYC packets, otherwise MTC
152 packets are used - refer to the mtc config term. When MTC is used,
153 however, the values are less accurate because the timing is less
154 accurate.
155
156 Because Intel PT does not update the cycle count on every branch or
157 instruction, the values will often be zero. When there are values, they
158 will be the number of instructions and number of cycles since the last
159 update, and thus represent the average IPC since the last IPC for that
160 event type. Note IPC for "branches" events is calculated separately
161 from IPC for "instructions" events.
162
163 Even with the cyc config term, it is possible to produce IPC
164 information for every change of timestamp, but at the expense of
165 accuracy. That is selected by specifying the itrace A option. Due to
166 the granularity of timestamps, the actual number of cycles increases
167 even though the cycles reported does not. The number of instructions is
168 known, but if IPC is reported, cycles can be too low and so IPC is too
169 high. Note that inaccuracy decreases as the period of sampling
170 increases i.e. if the number of cycles is too low by a small amount,
171 that becomes less significant if the number of cycles is large. It may
172 also be useful to use the A option in conjunction with
173 dlfilter-show-cycles.so to provide higher granularity cycle
174 information.
175
176 Also note that the IPC instruction count may or may not include the
177 current instruction. If the cycle count is associated with an
178 asynchronous branch (e.g. page fault or interrupt), then the
179 instruction count does not include the current instruction, otherwise
180 it does. That is consistent with whether or not that instruction has
181 retired when the cycle count is updated.
182
183 Another note, in the case of "branches" events, non-taken branches are
184 not presently sampled, so IPC values for them do not appear e.g. a CYC
185 packet with a TNT packet that starts with a non-taken branch. To see
186 every possible IPC value, "instructions" events can be used e.g.
187 --itrace=i0ns
188
189 While it is possible to create scripts to analyze the data, an
190 alternative approach is available to export the data to a sqlite or
191 postgresql database. Refer to script export-to-sqlite.py or
192 export-to-postgresql.py for more details, and to script
193 exported-sql-viewer.py for an example of using the database.
194
195 There is also script intel-pt-events.py which provides an example of
196 how to unpack the raw data for power events and PTWRITE. The script
197 also displays branches, and supports 2 additional modes selected by
198 option:
199
200 --insn-trace - instruction trace
201 --src-trace - source trace
202
203 As mentioned above, it is easy to capture too much data. One way to
204 limit the data captured is to use snapshot mode which is explained
205 further below. Refer to new snapshot option and Intel PT modes of
206 operation further below.
207
208 Another problem that will be experienced is decoder errors. They can be
209 caused by inability to access the executed image, self-modified or
210 JIT-ed code, or the inability to match side-band information (such as
211 context switches and mmaps) which results in the decoder not knowing
212 what code was executed.
213
214 There is also the problem of perf not being able to copy the data fast
215 enough, resulting in data lost because the buffer was full. See Buffer
216 handling below for more details.
217
219 new event
220 The Intel PT kernel driver creates a new PMU for Intel PT. PMU events
221 are selected by providing the PMU name followed by the "config"
222 separated by slashes. An enhancement has been made to allow default
223 "config" e.g. the option
224
225 -e intel_pt//
226
227 will use a default config value. Currently that is the same as
228
229 -e intel_pt/tsc,noretcomp=0/
230
231 which is the same as
232
233 -e intel_pt/tsc=1,noretcomp=0/
234
235 Note there are now new config terms - see section config terms further
236 below.
237
238 The config terms are listed in /sys/devices/intel_pt/format. They are
239 bit fields within the config member of the struct perf_event_attr which
240 is passed to the kernel by the perf_event_open system call. They
241 correspond to bit fields in the IA32_RTIT_CTL MSR. Here is a list of
242 them and their definitions:
243
244 $ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
245 /sys/bus/event_source/devices/intel_pt/format/cyc:config:1
246 /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
247 /sys/bus/event_source/devices/intel_pt/format/mtc:config:9
248 /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
249 /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
250 /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
251 /sys/bus/event_source/devices/intel_pt/format/tsc:config:10
252
253 Note that the default config must be overridden for each term i.e.
254
255 -e intel_pt/noretcomp=0/
256
257 is the same as:
258
259 -e intel_pt/tsc=1,noretcomp=0/
260
261 So, to disable TSC packets use:
262
263 -e intel_pt/tsc=0/
264
265 It is also possible to specify the config value explicitly:
266
267 -e intel_pt/config=0x400/
268
269 Note that, as with all events, the event is suffixed with event
270 modifiers:
271
272 u userspace
273 k kernel
274 h hypervisor
275 G guest
276 H host
277 p precise ip
278
279 h, G and H are for virtualization which is not supported by Intel PT. p
280 is also not relevant to Intel PT. So only options u and k are
281 meaningful for Intel PT.
282
283 perf_event_attr is displayed if the -vv option is used e.g.
284
285 ------------------------------------------------------------
286 perf_event_attr:
287 type 6
288 size 112
289 config 0x400
290 { sample_period, sample_freq } 1
291 sample_type IP|TID|TIME|CPU|IDENTIFIER
292 read_format ID
293 disabled 1
294 inherit 1
295 exclude_kernel 1
296 exclude_hv 1
297 enable_on_exec 1
298 sample_id_all 1
299 ------------------------------------------------------------
300 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
301 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
302 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
303 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
304 ------------------------------------------------------------
305
306 config terms
307 The June 2015 version of Intel 64 and IA-32 Architectures Software
308 Developer Manuals, Chapter 36 Intel Processor Trace, defined new Intel
309 PT features. Some of the features are reflect in new config terms. All
310 the config terms are described below.
311
312 tsc Always supported. Produces TSC timestamp packets to provide timing
313 information. In some cases it is possible to decode without timing
314 information, for example a per-thread context that does not overlap
315 executable memory maps.
316
317 The default config selects tsc (i.e. tsc=1).
318
319 noretcomp Always supported. Disables "return compression" so a TIP
320 packet is produced when a function returns. Causes more packets to be
321 produced but might make decoding more reliable.
322
323 The default config does not select noretcomp (i.e. noretcomp=0).
324
325 psb_period Allows the frequency of PSB packets to be specified.
326
327 The PSB packet is a synchronization packet that provides a
328 starting point for decoding or recovery from errors.
329
330 Support for psb_period is indicated by:
331
332 /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
333
334 which contains "1" if the feature is supported and "0"
335 otherwise.
336
337 Valid values are given by:
338
339 /sys/bus/event_source/devices/intel_pt/caps/psb_periods
340
341 which contains a hexadecimal value, the bits of which represent
342 valid values e.g. bit 2 set means value 2 is valid.
343
344 The psb_period value is converted to the approximate number of
345 trace bytes between PSB packets as:
346
347 2 ^ (value + 11)
348
349 e.g. value 3 means 16KiB bytes between PSBs
350
351 If an invalid value is entered, the error message
352 will give a list of valid values e.g.
353
354 $ perf record -e intel_pt/psb_period=15/u uname
355 Invalid psb_period for intel_pt. Valid values are: 0-5
356
357 If MTC packets are selected, the default config selects a value
358 of 3 (i.e. psb_period=3) or the nearest lower value that is
359 supported (0 is always supported). Otherwise the default is 0.
360
361 If decoding is expected to be reliable and the buffer is large
362 then a large PSB period can be used.
363
364 Because a TSC packet is produced with PSB, the PSB period can
365 also affect the granularity to timing information in the absence
366 of MTC or CYC.
367
368 mtc Produces MTC timing packets.
369
370 MTC packets provide finer grain timestamp information than TSC
371 packets. MTC packets record time using the hardware crystal
372 clock (CTC) which is related to TSC packets using a TMA packet.
373
374 Support for this feature is indicated by:
375
376 /sys/bus/event_source/devices/intel_pt/caps/mtc
377
378 which contains "1" if the feature is supported and
379 "0" otherwise.
380
381 The frequency of MTC packets can also be specified - see
382 mtc_period below.
383
384 mtc_period Specifies how frequently MTC packets are produced - see mtc
385 above for how to determine if MTC packets are supported.
386
387 Valid values are given by:
388
389 /sys/bus/event_source/devices/intel_pt/caps/mtc_periods
390
391 which contains a hexadecimal value, the bits of which represent
392 valid values e.g. bit 2 set means value 2 is valid.
393
394 The mtc_period value is converted to the MTC frequency as:
395
396 CTC-frequency / (2 ^ value)
397
398 e.g. value 3 means one eighth of CTC-frequency
399
400 Where CTC is the hardware crystal clock, the frequency of which
401 can be related to TSC via values provided in cpuid leaf 0x15.
402
403 If an invalid value is entered, the error message
404 will give a list of valid values e.g.
405
406 $ perf record -e intel_pt/mtc_period=15/u uname
407 Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
408
409 The default value is 3 or the nearest lower value
410 that is supported (0 is always supported).
411
412 cyc Produces CYC timing packets.
413
414 CYC packets provide even finer grain timestamp information than
415 MTC and TSC packets. A CYC packet contains the number of CPU
416 cycles since the last CYC packet. Unlike MTC and TSC packets,
417 CYC packets are only sent when another packet is also sent.
418
419 Support for this feature is indicated by:
420
421 /sys/bus/event_source/devices/intel_pt/caps/psb_cyc
422
423 which contains "1" if the feature is supported and
424 "0" otherwise.
425
426 The number of CYC packets produced can be reduced by specifying
427 a threshold - see cyc_thresh below.
428
429 cyc_thresh Specifies how frequently CYC packets are produced - see cyc
430 above for how to determine if CYC packets are supported.
431
432 Valid cyc_thresh values are given by:
433
434 /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
435
436 which contains a hexadecimal value, the bits of which represent
437 valid values e.g. bit 2 set means value 2 is valid.
438
439 The cyc_thresh value represents the minimum number of CPU cycles
440 that must have passed before a CYC packet can be sent. The
441 number of CPU cycles is:
442
443 2 ^ (value - 1)
444
445 e.g. value 4 means 8 CPU cycles must pass before a CYC packet
446 can be sent. Note a CYC packet is still only sent when another
447 packet is sent, not at, e.g. every 8 CPU cycles.
448
449 If an invalid value is entered, the error message
450 will give a list of valid values e.g.
451
452 $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
453 Invalid cyc_thresh for intel_pt. Valid values are: 0-12
454
455 CYC packets are not requested by default.
456
457 pt Specifies pass-through which enables the branch config term.
458
459 The default config selects 'pt' if it is available, so a user will
460 never need to specify this term.
461
462 branch Enable branch tracing. Branch tracing is enabled by default so
463 to disable branch tracing use branch=0.
464
465 The default config selects 'branch' if it is available.
466
467 ptw Enable PTWRITE packets which are produced when a ptwrite
468 instruction is executed.
469
470 Support for this feature is indicated by:
471
472 /sys/bus/event_source/devices/intel_pt/caps/ptwrite
473
474 which contains "1" if the feature is supported and
475 "0" otherwise.
476
477 fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP
478 packet provides the address of the ptwrite instruction. In the absence
479 of fup_on_ptw, the decoder will use the address of the previous branch
480 if branch tracing is enabled, otherwise the address will be zero. Note
481 that fup_on_ptw will work even when branch tracing is disabled.
482
483 pwr_evt Enable power events. The power events provide information about
484 changes to the CPU C-state.
485
486 Support for this feature is indicated by:
487
488 /sys/bus/event_source/devices/intel_pt/caps/power_event_trace
489
490 which contains "1" if the feature is supported and
491 "0" otherwise.
492
493 event Enable Event Trace. The events provide information about
494 asynchronous events.
495
496 Support for this feature is indicated by:
497
498 /sys/bus/event_source/devices/intel_pt/caps/event_trace
499
500 which contains "1" if the feature is supported and
501 "0" otherwise.
502
503 notnt Disable TNT packets. Without TNT packets, it is not possible to
504 walk executable code to reconstruct control flow, however FUP, TIP,
505 TIP.PGE and TIP.PGD packets still indicate asynchronous control flow,
506 and (if return compression is disabled - see noretcomp) return
507 statements. The advantage of eliminating TNT packets is reducing the
508 size of the trace and corresponding tracing overhead.
509
510 Support for this feature is indicated by:
511
512 /sys/bus/event_source/devices/intel_pt/caps/tnt_disable
513
514 which contains "1" if the feature is supported and
515 "0" otherwise.
516
517 AUX area sampling option
518 To select Intel PT "sampling" the AUX area sampling option can be used:
519
520 --aux-sample
521
522 Optionally it can be followed by the sample size in bytes e.g.
523
524 --aux-sample=8192
525
526 In addition, the Intel PT event to sample must be defined e.g.
527
528 -e intel_pt//u
529
530 Samples on other events will be created containing Intel PT data e.g.
531 the following will create Intel PT samples on the branch-misses event,
532 note the events must be grouped using {}:
533
534 perf record --aux-sample -e '{intel_pt//u,branch-misses:u}'
535
536 An alternative to --aux-sample is to add the config term
537 aux-sample-size to events. In this case, the grouping is implied e.g.
538
539 perf record -e intel_pt//u -e branch-misses/aux-sample-size=8192/u
540
541 is the same as:
542
543 perf record -e '{intel_pt//u,branch-misses/aux-sample-size=8192/u}'
544
545 but allows for also using an address filter e.g.:
546
547 perf record -e intel_pt//u --filter 'filter * @/bin/ls' -e branch-misses/aux-sample-size=8192/u -- ls
548
549 It is important to select a sample size that is big enough to contain
550 at least one PSB packet. If not a warning will be displayed:
551
552 Intel PT sample size (%zu) may be too small for PSB period (%zu)
553
554 The calculation used for that is: if sample_size ⟨ psb_period + 256
555 display the warning. When sampling is used, psb_period defaults to 0
556 (2KiB).
557
558 The default sample size is 4KiB.
559
560 The sample size is passed in aux_sample_size in struct perf_event_attr.
561 The sample size is limited by the maximum event size which is 64KiB. It
562 is difficult to know how big the event might be without the trace
563 sample attached, but the tool validates that the sample size is not
564 greater than 60KiB.
565
566 new snapshot option
567 The difference between full trace and snapshot from the kernel’s
568 perspective is that in full trace we don’t overwrite trace data that
569 the user hasn’t collected yet (and indicated that by advancing
570 aux_tail), whereas in snapshot mode we let the trace run and overwrite
571 older data in the buffer so that whenever something interesting
572 happens, we can stop it and grab a snapshot of what was going on around
573 that interesting moment.
574
575 To select snapshot mode a new option has been added:
576
577 -S
578
579 Optionally it can be followed by the snapshot size e.g.
580
581 -S0x100000
582
583 The default snapshot size is the auxtrace mmap size. If neither
584 auxtrace mmap size nor snapshot size is specified, then the default is
585 4MiB for privileged users (or if /proc/sys/kernel/perf_event_paranoid <
586 0), 128KiB for unprivileged users. If an unprivileged user does not
587 specify mmap pages, the mmap pages will be reduced as described in the
588 new auxtrace mmap size option section below.
589
590 The snapshot size is displayed if the option -vv is used e.g.
591
592 Intel PT snapshot size: %zu
593
594 new auxtrace mmap size option
595 Intel PT buffer size is specified by an addition to the -m option e.g.
596
597 -m,16
598
599 selects a buffer size of 16 pages i.e. 64KiB.
600
601 Note that the existing functionality of -m is unchanged. The auxtrace
602 mmap size is specified by the optional addition of a comma and the
603 value.
604
605 The default auxtrace mmap size for Intel PT is 4MiB/page_size for
606 privileged users (or if /proc/sys/kernel/perf_event_paranoid < 0),
607 128KiB for unprivileged users. If an unprivileged user does not specify
608 mmap pages, the mmap pages will be reduced from the default
609 512KiB/page_size to 256KiB/page_size, otherwise the user is likely to
610 get an error as they exceed their mlock limit (Max locked memory as
611 shown in /proc/self/limits). Note that perf does not count the first
612 512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per
613 cpu against the mlock limit so an unprivileged user is allowed 512KiB
614 per cpu plus their mlock limit (which defaults to 64KiB but is not
615 multiplied by the number of cpus).
616
617 In full-trace mode, powers of two are allowed for buffer size, with a
618 minimum size of 2 pages. In snapshot mode or sampling mode, it is the
619 same but the minimum size is 1 page.
620
621 The mmap size and auxtrace mmap size are displayed if the -vv option is
622 used e.g.
623
624 mmap length 528384
625 auxtrace mmap length 4198400
626
627 Intel PT modes of operation
628 Intel PT can be used in 3 modes: full-trace mode sample mode snapshot
629 mode
630
631 Full-trace mode traces continuously e.g.
632
633 perf record -e intel_pt//u uname
634
635 Sample mode attaches a Intel PT sample to other events e.g.
636
637 perf record --aux-sample -e intel_pt//u -e branch-misses:u
638
639 Snapshot mode captures the available data when a signal is sent or
640 "snapshot" control command is issued. e.g. using a signal
641
642 perf record -v -e intel_pt//u -S ./loopy 1000000000 &
643 [1] 11435
644 kill -USR2 11435
645 Recording AUX area tracing snapshot
646
647 Note that the signal sent is SIGUSR2. Note that "Recording AUX area
648 tracing snapshot" is displayed because the -v option is used.
649
650 The advantage of using "snapshot" control command is that the access is
651 controlled by access to a FIFO e.g.
652
653 $ mkfifo perf.control
654 $ mkfifo perf.ack
655 $ cat perf.ack &
656 [1] 15235
657 $ sudo ~/bin/perf record --control fifo:perf.control,perf.ack -S -e intel_pt//u -- sleep 60 &
658 [2] 15243
659 $ ps -e | grep perf
660 15244 pts/1 00:00:00 perf
661 $ kill -USR2 15244
662 bash: kill: (15244) - Operation not permitted
663 $ echo snapshot > perf.control
664 ack
665
666 The 3 Intel PT modes of operation cannot be used together.
667
668 Buffer handling
669 There may be buffer limitations (i.e. single ToPa entry) which means
670 that actual buffer sizes are limited to powers of 2 up to 4MiB
671 (MAX_ORDER). In order to provide other sizes, and in particular an
672 arbitrarily large size, multiple buffers are logically concatenated.
673 However an interrupt must be used to switch between buffers. That has
674 two potential problems: a) the interrupt may not be handled in time so
675 that the current buffer becomes full and some trace data is lost. b)
676 the interrupts may slow the system and affect the performance results.
677
678 If trace data is lost, the driver sets truncated in the PERF_RECORD_AUX
679 event which the tools report as an error.
680
681 In full-trace mode, the driver waits for data to be copied out before
682 allowing the (logical) buffer to wrap-around. If data is not copied out
683 quickly enough, again truncated is set in the PERF_RECORD_AUX event. If
684 the driver has to wait, the intel_pt event gets disabled. Because it is
685 difficult to know when that happens, perf tools always re-enable the
686 intel_pt event after copying out data.
687
688 Intel PT and build ids
689 By default "perf record" post-processes the event stream to find all
690 build ids for executables for all addresses sampled. Deliberately,
691 Intel PT is not decoded for that purpose (it would take too long).
692 Instead the build ids for all executables encountered (due to mmap,
693 comm or task events) are included in the perf.data file.
694
695 To see buildids included in the perf.data file use the command:
696
697 perf buildid-list
698
699 If the perf.data file contains Intel PT data, that is the same as:
700
701 perf buildid-list --with-hits
702
703 Snapshot mode and event disabling
704 In order to make a snapshot, the intel_pt event is disabled using an
705 IOCTL, namely PERF_EVENT_IOC_DISABLE. However doing that can also
706 disable the collection of side-band information. In order to prevent
707 that, a dummy software event has been introduced that permits tracking
708 events (like mmaps) to continue to be recorded while intel_pt is
709 disabled. That is important to ensure there is complete side-band
710 information to allow the decoding of subsequent snapshots.
711
712 A test has been created for that. To find the test:
713
714 perf test list
715 ...
716 23: Test using a dummy software event to keep tracking
717
718 To run the test:
719
720 perf test 23
721 23: Test using a dummy software event to keep tracking : Ok
722
723 perf record modes (nothing new here)
724 perf record essentially operates in one of three modes: per thread per
725 cpu workload only
726
727 "per thread" mode is selected by -t or by --per-thread (with -p or -u
728 or just a workload). "per cpu" is selected by -C or -a. "workload only"
729 mode is selected by not using the other options but providing a command
730 to run (i.e. the workload).
731
732 In per-thread mode an exact list of threads is traced. There is no
733 inheritance. Each thread has its own event buffer.
734
735 In per-cpu mode all processes (or processes from the selected cgroup
736 i.e. -G option, or processes selected with -p or -u) are traced. Each
737 cpu has its own buffer. Inheritance is allowed.
738
739 In workload-only mode, the workload is traced but with per-cpu buffers.
740 Inheritance is allowed. Note that you can now trace a workload in
741 per-thread mode by using the --per-thread option.
742
743 Privileged vs non-privileged users
744 Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
745 users have memory limits imposed upon them. That affects what buffer
746 sizes they can have as outlined above.
747
748 The v4.2 kernel introduced support for a context switch metadata event,
749 PERF_RECORD_SWITCH, which allows unprivileged users to see when their
750 processes are scheduled out and in, just not by whom, which is left for
751 the PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide
752 context, which in turn requires CAP_PERFMON or CAP_SYS_ADMIN.
753
754 Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate
755 context switches") commit, that introduces these metadata events for
756 further info.
757
758 When working with kernels < v4.2, the following considerations must be
759 taken, as the sched:sched_switch tracepoints will be used to receive
760 such information:
761
762 Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged
763 users are not permitted to use tracepoints which means there is
764 insufficient side-band information to decode Intel PT in per-cpu mode,
765 and potentially workload-only mode too if the workload creates new
766 processes.
767
768 Note also, that to use tracepoints, read-access to debugfs is required.
769 So if debugfs is not mounted or the user does not have read-access, it
770 will again not be possible to decode Intel PT in per-cpu mode.
771
772 sched_switch tracepoint
773 The sched_switch tracepoint is used to provide side-band data for Intel
774 PT decoding in kernels where the PERF_RECORD_SWITCH metadata event
775 isn’t available.
776
777 The sched_switch events are automatically added. e.g. the second event
778 shown below:
779
780 $ perf record -vv -e intel_pt//u uname
781 ------------------------------------------------------------
782 perf_event_attr:
783 type 6
784 size 112
785 config 0x400
786 { sample_period, sample_freq } 1
787 sample_type IP|TID|TIME|CPU|IDENTIFIER
788 read_format ID
789 disabled 1
790 inherit 1
791 exclude_kernel 1
792 exclude_hv 1
793 enable_on_exec 1
794 sample_id_all 1
795 ------------------------------------------------------------
796 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
797 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
798 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
799 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
800 ------------------------------------------------------------
801 perf_event_attr:
802 type 2
803 size 112
804 config 0x108
805 { sample_period, sample_freq } 1
806 sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER
807 read_format ID
808 inherit 1
809 sample_id_all 1
810 exclude_guest 1
811 ------------------------------------------------------------
812 sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8
813 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8
814 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8
815 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8
816 ------------------------------------------------------------
817 perf_event_attr:
818 type 1
819 size 112
820 config 0x9
821 { sample_period, sample_freq } 1
822 sample_type IP|TID|TIME|IDENTIFIER
823 read_format ID
824 disabled 1
825 inherit 1
826 exclude_kernel 1
827 exclude_hv 1
828 mmap 1
829 comm 1
830 enable_on_exec 1
831 task 1
832 sample_id_all 1
833 mmap2 1
834 comm_exec 1
835 ------------------------------------------------------------
836 sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
837 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
838 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
839 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
840 mmap size 528384B
841 AUX area mmap length 4194304
842 perf event ring buffer mmapped per cpu
843 Synthesizing auxtrace information
844 Linux
845 [ perf record: Woken up 1 times to write data ]
846 [ perf record: Captured and wrote 0.042 MB perf.data ]
847
848 Note, the sched_switch event is only added if the user is permitted to
849 use it and only in per-cpu mode.
850
851 Note also, the sched_switch event is only added if TSC packets are
852 requested. That is because, in the absence of timing information, the
853 sched_switch events cannot be matched against the Intel PT trace.
854
856 By default, perf script will decode trace data found in the perf.data
857 file. This can be further controlled by new option --itrace.
858
859 New --itrace option
860 Having no option is the same as
861
862 --itrace
863
864 which, in turn, is the same as
865
866 --itrace=cepwx
867
868 The letters are:
869
870 i synthesize "instructions" events
871 b synthesize "branches" events
872 x synthesize "transactions" events
873 w synthesize "ptwrite" events
874 p synthesize "power" events (incl. PSB events)
875 c synthesize branches events (calls only)
876 r synthesize branches events (returns only)
877 o synthesize PEBS-via-PT events
878 I synthesize Event Trace events
879 e synthesize tracing error events
880 d create a debug log
881 g synthesize a call chain (use with i or x)
882 G synthesize a call chain on existing event records
883 l synthesize last branch entries (use with i or x)
884 L synthesize last branch entries on existing event records
885 s skip initial number of events
886 q quicker (less detailed) decoding
887 A approximate IPC
888 Z prefer to ignore timestamps (so-called "timeless" decoding)
889
890 "Instructions" events look like they were recorded by "perf record -e
891 instructions".
892
893 "Branches" events look like they were recorded by "perf record -e
894 branches". "c" and "r" can be combined to get calls and returns.
895
896 "Transactions" events correspond to the start or end of transactions.
897 The flags field can be used in perf script to determine whether the
898 event is a transaction start, commit or abort.
899
900 Note that "instructions", "branches" and "transactions" events depend
901 on code flow packets which can be disabled by using the config term
902 "branch=0". Refer to the config terms section above.
903
904 "ptwrite" events record the payload of the ptwrite instruction and
905 whether "fup_on_ptw" was used. "ptwrite" events depend on PTWRITE
906 packets which are recorded only if the "ptw" config term was used.
907 Refer to the config terms section above. perf script "synth" field
908 displays "ptwrite" information like this: "ip: 0 payload:
909 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was used.
910
911 "Power" events correspond to power event packets and CBR (core-to-bus
912 ratio) packets. While CBR packets are always recorded when tracing is
913 enabled, power event packets are recorded only if the "pwr_evt" config
914 term was used. Refer to the config terms section above. The power
915 events record information about C-state changes, whereas CBR is
916 indicative of CPU frequency. perf script "event,synth" fields display
917 information like this: cbr: cbr: 22 freq: 2189 MHz (200%) mwait: hints:
918 0x60 extensions: 0x1 pwre: hw: 0 cstate: 2 sub-cstate: 0 exstop: ip: 1
919 pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4 Where: "cbr"
920 includes the frequency and the percentage of maximum non-turbo "mwait"
921 shows mwait hints and extensions "pwre" shows C-state transitions (to a
922 C-state deeper than C0) and whether initiated by hardware "exstop"
923 indicates execution stopped and whether the IP was recorded exactly,
924 "pwrx" indicates return to C0 For more details refer to the Intel 64
925 and IA-32 Architectures Software Developer Manuals.
926
927 PSB events show when a PSB+ occurred and also the byte-offset in the
928 trace. Emitting a PSB+ can cause a CPU a slight delay. When doing
929 timing analysis of code with Intel PT, it is useful to know if a timing
930 bubble was caused by Intel PT or not.
931
932 Error events show where the decoder lost the trace. Error events are
933 quite important. Users must know if what they are seeing is a complete
934 picture or not. The "e" option may be followed by flags which affect
935 what errors will or will not be reported. Each flag must be preceded by
936 either + or -. The flags supported by Intel PT are: -o Suppress
937 overflow errors -l Suppress trace data lost errors For example, for
938 errors but not overflow or data lost errors:
939
940 --itrace=e-o-l
941
942 The "d" option will cause the creation of a file "intel_pt.log"
943 containing all decoded packets and instructions. Note that this option
944 slows down the decoder and that the resulting file may be very large.
945 The "d" option may be followed by flags which affect what debug
946 messages will or will not be logged. Each flag must be preceded by
947 either + or -. The flags support by Intel PT are: -a Suppress logging
948 of perf events +a Log all perf events +o Output to stdout instead of
949 "intel_pt.log" By default, logged perf events are filtered by any
950 specified time ranges, but flag +a overrides that.
951
952 In addition, the period of the "instructions" event can be specified.
953 e.g.
954
955 --itrace=i10us
956
957 sets the period to 10us i.e. one instruction sample is synthesized for
958 each 10 microseconds of trace. Alternatives to "us" are "ms"
959 (milliseconds), "ns" (nanoseconds), "t" (TSC ticks) or "i"
960 (instructions).
961
962 "ms", "us" and "ns" are converted to TSC ticks.
963
964 The timing information included with Intel PT does not give the time of
965 every instruction. Consequently, for the purpose of sampling, the
966 decoder estimates the time since the last timing packet based on 1 tick
967 per instruction. The time on the sample is not adjusted and reflects
968 the last known value of TSC.
969
970 For Intel PT, the default period is 100us.
971
972 Setting it to a zero period means "as often as possible".
973
974 In the case of Intel PT that is the same as a period of 1 and a unit of
975 instructions (i.e. --itrace=i1i).
976
977 Also the call chain size (default 16, max. 1024) for instructions or
978 transactions events can be specified. e.g.
979
980 --itrace=ig32
981 --itrace=xg32
982
983 Also the number of last branch entries (default 64, max. 1024) for
984 instructions or transactions events can be specified. e.g.
985
986 --itrace=il10
987 --itrace=xl10
988
989 Note that last branch entries are cleared for each sample, so there is
990 no overlap from one sample to the next.
991
992 The G and L options are designed in particular for sample mode, and
993 work much like g and l but add call chain and branch stack to the other
994 selected events instead of synthesized events. For example, to record
995 branch-misses events for ls and then add a call chain derived from the
996 Intel PT trace:
997
998 perf record --aux-sample -e '{intel_pt//u,branch-misses:u}' -- ls
999 perf report --itrace=Ge
1000
1001 Although in fact G is a default for perf report, so that is the same as
1002 just:
1003
1004 perf report
1005
1006 One caveat with the G and L options is that they work poorly with
1007 "Large PEBS". Large PEBS means PEBS records will be accumulated by
1008 hardware and the written into the event buffer in one go. That reduces
1009 interrupts, but can give very late timestamps. Because the Intel PT
1010 trace is synchronized by timestamps, the PEBS events do not match the
1011 trace. Currently, Large PEBS is used only in certain circumstances: -
1012 hardware supports it - PEBS is used - event period is specified,
1013 instead of frequency - the sample type is limited to the following
1014 flags: PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR |
1015 PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID |
1016 PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_TRANSACTION
1017 | PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER
1018 | PERF_SAMPLE_PERIOD (and sometimes) | PERF_SAMPLE_TIME Because Intel
1019 PT sample mode uses a different sample type to the list above, Large
1020 PEBS is not used with Intel PT sample mode. To avoid Large PEBS in
1021 other cases, avoid specifying the event period i.e. avoid the perf
1022 record -c option, --count option, or period config term.
1023
1024 To disable trace decoding entirely, use the option --no-itrace.
1025
1026 It is also possible to skip events generated (instructions, branches,
1027 transactions) at the beginning. This is useful to ignore initialization
1028 code.
1029
1030 --itrace=i0nss1000000
1031
1032 skips the first million instructions.
1033
1034 The q option changes the way the trace is decoded. The decoding is much
1035 faster but much less detailed. Specifically, with the q option, the
1036 decoder does not decode TNT packets, and does not walk object code, but
1037 gets the ip from FUP and TIP packets. The q option can be used with the
1038 b and i options but the period is not used. The q option decodes more
1039 quickly, but is useful only if the control flow of interest is
1040 represented or indicated by FUP, TIP, TIP.PGE, or TIP.PGD packets
1041 (refer below). However the q option could be used to find time ranges
1042 that could then be decoded fully using the --time option.
1043
1044 What will not be decoded with the (single) q option:
1045
1046 • direct calls and jmps
1047
1048 • conditional branches
1049
1050 • non-branch instructions
1051
1052 What will be decoded with the (single) q option:
1053
1054 • asynchronous branches such as interrupts
1055
1056 • indirect branches
1057
1058 • function return target address if the noretcomp config term (refer
1059 config terms section) was used
1060
1061 • start of (control-flow) tracing
1062
1063 • end of (control-flow) tracing, if it is not out of context
1064
1065 • power events, ptwrite, transaction start and abort
1066
1067 • instruction pointer associated with PSB packets
1068
1069 Note the q option does not specify what events will be synthesized e.g.
1070 the p option must be used also to show power events.
1071
1072 Repeating the q option (double-q i.e. qq) results in even faster
1073 decoding and even less detail. The decoder decodes only extended PSB
1074 (PSB+) packets, getting the instruction pointer if there is a FUP
1075 packet within PSB+ (i.e. between PSB and PSBEND). Note PSB packets
1076 occur regularly in the trace based on the psb_period config term (refer
1077 config terms section). There will be a FUP packet if the PSB+ occurs
1078 while control flow is being traced.
1079
1080 What will not be decoded with the qq option:
1081
1082 • everything except instruction pointer associated with PSB packets
1083
1084 What will be decoded with the qq option:
1085
1086 • instruction pointer associated with PSB packets
1087
1088 The Z option is equivalent to having recorded a trace without TSC (i.e.
1089 config term tsc=0). It can be useful to avoid timestamp issues when
1090 decoding a trace of a virtual machine.
1091
1092 dlfilter-show-cycles.so
1093 Cycles can be displayed using dlfilter-show-cycles.so in which case the
1094 itrace A option can be useful to provide higher granularity cycle
1095 information:
1096
1097 perf script --itrace=A --call-trace --dlfilter dlfilter-show-cycles.so
1098
1099 To see a list of dlfilters:
1100
1101 perf script -v --list-dlfilters
1102
1103 See also perf-dlfilters(1)
1104
1105 dump option
1106 perf script has an option (-D) to "dump" the events i.e. display the
1107 binary data.
1108
1109 When -D is used, Intel PT packets are displayed. The packet decoder
1110 does not pay attention to PSB packets, but just decodes the bytes - so
1111 the packets seen by the actual decoder may not be identical in places
1112 where the data is corrupt. One example of that would be when the
1113 buffer-switching interrupt has been too slow, and the buffer has been
1114 filled completely. In that case, the last packet in the buffer might be
1115 truncated and immediately followed by a PSB as the trace continues in
1116 the next buffer.
1117
1118 To disable the display of Intel PT packets, combine the -D option with
1119 --no-itrace.
1120
1122 By default, perf report will decode trace data found in the perf.data
1123 file. This can be further controlled by new option --itrace exactly the
1124 same as perf script, with the exception that the default is
1125 --itrace=igxe.
1126
1128 perf inject also accepts the --itrace option in which case tracing data
1129 is removed and replaced with the synthesized events. e.g.
1130
1131 perf inject --itrace -i perf.data -o perf.data.new
1132
1133 Below is an example of using Intel PT with autofdo. It requires autofdo
1134 (https://github.com/google/autofdo) and gcc version 5. The bubble sort
1135 example is from the AutoFDO tutorial
1136 (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) amended to take the number
1137 of elements as a parameter.
1138
1139 $ gcc-5 -O3 sort.c -o sort_optimized
1140 $ ./sort_optimized 30000
1141 Bubble sorting array of 30000 elements
1142 2254 ms
1143
1144 $ cat ~/.perfconfig
1145 [intel-pt]
1146 mispred-all = on
1147
1148 $ perf record -e intel_pt//u ./sort 3000
1149 Bubble sorting array of 3000 elements
1150 58 ms
1151 [ perf record: Woken up 2 times to write data ]
1152 [ perf record: Captured and wrote 3.939 MB perf.data ]
1153 $ perf inject -i perf.data -o inj --itrace=i100usle --strip
1154 $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
1155 $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
1156 $ ./sort_autofdo 30000
1157 Bubble sorting array of 30000 elements
1158 2155 ms
1159
1160 Note there is currently no advantage to using Intel PT instead of LBR,
1161 but that may change in the future if greater use is made of the data.
1162
1164 Some hardware has the feature to redirect PEBS records to the Intel PT
1165 trace. Recording is selected by using the aux-output config term e.g.
1166
1167 perf record -c 10000 -e '{intel_pt/branch=0/,cycles/aux-output/ppp}' uname
1168
1169 Originally, software only supported redirecting at most one PEBS event
1170 because it was not able to differentiate one event from another. To
1171 overcome that, more recent kernels and perf tools add support for the
1172 PERF_RECORD_AUX_OUTPUT_HW_ID side-band event. To check for the presence
1173 of that event in a PEBS-via-PT trace:
1174
1175 perf script -D --no-itrace | grep PERF_RECORD_AUX_OUTPUT_HW_ID
1176
1177 To display PEBS events from the Intel PT trace, use the itrace o option
1178 e.g.
1179
1180 perf script --itrace=oe
1181
1183 For --xed the xed tool is needed. Here is how to install it:
1184
1185 $ git clone https://github.com/intelxed/mbuild.git mbuild
1186 $ git clone https://github.com/intelxed/xed
1187 $ cd xed
1188 $ ./mfile.py --share
1189 $ ./mfile.py examples
1190 $ sudo ./mfile.py --prefix=/usr/local install
1191 $ sudo ldconfig
1192 $ sudo cp obj/examples/xed /usr/local/bin
1193
1194 Basic xed testing:
1195
1196 $ xed | head -3
1197 ERROR: required argument(s) were missing
1198 Copyright (C) 2017, Intel Corporation. All rights reserved.
1199 XED version: [v10.0-328-g7d62c8c49b7b]
1200 $
1201
1203 Currently, only kernel tracing is supported and only with either
1204 "timeless" decoding (i.e. no TSC timestamps) or VM Time Correlation. VM
1205 Time Correlation is an extra step using perf inject and requires
1206 unchanging VMX TSC Offset and no VMX TSC Scaling.
1207
1208 Other limitations and caveats
1209
1210 VMX controls may suppress packets needed for decoding resulting in decoding errors
1211 VMX controls may block the perf NMI to the host potentially resulting in lost trace data
1212 Guest kernel self-modifying code (e.g. jump labels or JIT-compiled eBPF) will result in decoding errors
1213 Guest thread information is unknown
1214 Guest VCPU is unknown but may be able to be inferred from the host thread
1215 Callchains are not supported
1216
1217 Example using "timeless" decoding
1218
1219 Start VM
1220
1221 $ sudo virsh start kubuntu20.04
1222 Domain kubuntu20.04 started
1223
1224 Mount the guest file system. Note sshfs needs -o direct_io to enable
1225 reading of proc files. root access is needed to read /proc/kcore.
1226
1227 $ mkdir vm0
1228 $ sshfs -o direct_io root@vm0:/ vm0
1229
1230 Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1231
1232 $ perf buildid-cache -v --kcore vm0/proc/kcore
1233 kcore added to build-id cache directory /home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306
1234 $ KALLSYMS=/home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306/kallsyms
1235
1236 Find the VM process
1237
1238 $ ps -eLl | grep 'KVM\|PID'
1239 F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD
1240 3 S 64055 1430 1 1440 1 80 0 - 1921718 - ? 00:02:47 CPU 0/KVM
1241 3 S 64055 1430 1 1441 1 80 0 - 1921718 - ? 00:02:41 CPU 1/KVM
1242 3 S 64055 1430 1 1442 1 80 0 - 1921718 - ? 00:02:38 CPU 2/KVM
1243 3 S 64055 1430 1 1443 2 80 0 - 1921718 - ? 00:03:18 CPU 3/KVM
1244
1245 Start an open-ended perf record, tracing the VM process, do something
1246 on the VM, and then ctrl-C to stop. TSC is not supported and tsc=0 must
1247 be specified. That means mtc is useless, so add mtc=0. However, IPC can
1248 still be determined, hence cyc=1 can be added. Only kernel decoding is
1249 supported, so k must be specified. Intel PT traces both the host and
1250 the guest so --guest and --host need to be specified. Without
1251 timestamps, --per-thread must be specified to distinguish threads.
1252
1253 $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/tsc=0,mtc=0,cyc=1/k -p 1430 --per-thread
1254 ^C
1255 [ perf record: Woken up 1 times to write data ]
1256 [ perf record: Captured and wrote 5.829 MB ]
1257
1258 perf script can be used to provide an instruction trace
1259
1260 $ perf script --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1261 CPU 0/KVM 1440 ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9
1262 CPU 0/KVM 1440 ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10
1263 CPU 0/KVM 1440 ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11
1264 CPU 0/KVM 1440 ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12
1265 CPU 0/KVM 1440 ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13
1266 CPU 0/KVM 1440 ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14
1267 CPU 0/KVM 1440 ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15
1268 CPU 0/KVM 1440 ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax
1269 CPU 0/KVM 1440 ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40
1270 CPU 0/KVM 1440 ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46
1271 CPU 0/KVM 1440 ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.11 (50/445)
1272 :1440 1440 ffffffffbb678b06 native_write_msr+0x6 ([guest.kernel.kallsyms]) nopl %eax, (%rax,%rax,1)
1273 :1440 1440 ffffffffbb678b0b native_write_msr+0xb ([guest.kernel.kallsyms]) retq IPC: 0.04 (2/41)
1274 :1440 1440 ffffffffbb666646 lapic_next_deadline+0x26 ([guest.kernel.kallsyms]) data16 nop
1275 :1440 1440 ffffffffbb666648 lapic_next_deadline+0x28 ([guest.kernel.kallsyms]) xor %eax, %eax
1276 :1440 1440 ffffffffbb66664a lapic_next_deadline+0x2a ([guest.kernel.kallsyms]) popq %rbp
1277 :1440 1440 ffffffffbb66664b lapic_next_deadline+0x2b ([guest.kernel.kallsyms]) retq IPC: 0.16 (4/25)
1278 :1440 1440 ffffffffbb74607f clockevents_program_event+0x8f ([guest.kernel.kallsyms]) test %eax, %eax
1279 :1440 1440 ffffffffbb746081 clockevents_program_event+0x91 ([guest.kernel.kallsyms]) jz 0xffffffffbb74603c IPC: 0.06 (2/30)
1280 :1440 1440 ffffffffbb74603c clockevents_program_event+0x4c ([guest.kernel.kallsyms]) popq %rbx
1281 :1440 1440 ffffffffbb74603d clockevents_program_event+0x4d ([guest.kernel.kallsyms]) popq %r12
1282
1283 Example using VM Time Correlation
1284
1285 Start VM
1286
1287 $ sudo virsh start kubuntu20.04
1288 Domain kubuntu20.04 started
1289
1290 Mount the guest file system. Note sshfs needs -o direct_io to enable
1291 reading of proc files. root access is needed to read /proc/kcore.
1292
1293 $ mkdir -p vm0
1294 $ sshfs -o direct_io root@vm0:/ vm0
1295
1296 Copy the guest /proc/kallsyms, /proc/modules and /proc/kcore
1297
1298 $ perf buildid-cache -v --kcore vm0/proc/kcore
1299 same kcore found in /home/user/.debug/[kernel.kcore]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777
1300 $ KALLSYMS=/home/user/.debug/\[kernel.kcore\]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777/kallsyms
1301
1302 Find the VM process
1303
1304 $ ps -eLl | grep 'KVM\|PID'
1305 F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD
1306 3 S 64055 16998 1 17005 13 80 0 - 1818189 - ? 00:00:16 CPU 0/KVM
1307 3 S 64055 16998 1 17006 4 80 0 - 1818189 - ? 00:00:05 CPU 1/KVM
1308 3 S 64055 16998 1 17007 3 80 0 - 1818189 - ? 00:00:04 CPU 2/KVM
1309 3 S 64055 16998 1 17008 4 80 0 - 1818189 - ? 00:00:05 CPU 3/KVM
1310
1311 Start an open-ended perf record, tracing the VM process, do something
1312 on the VM, and then ctrl-C to stop. IPC can be determined, hence cyc=1
1313 can be added. Only kernel decoding is supported, so k must be
1314 specified. Intel PT traces both the host and the guest so --guest and
1315 --host need to be specified.
1316
1317 $ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/cyc=1/k -p 16998
1318 ^C[ perf record: Woken up 1 times to write data ]
1319 [ perf record: Captured and wrote 9.041 MB perf.data.kvm ]
1320
1321 Now perf inject can be used to determine the VMX TCS Offset. Note,
1322 Intel PT TSC packets are only 7-bytes, so the TSC Offset might differ
1323 from the actual value in the 8th byte. That will have no effect i.e.
1324 the resulting timestamps will be correct anyway.
1325
1326 $ perf inject -i perf.data.kvm --vm-time-correlation=dry-run
1327 ERROR: Unknown TSC Offset for VMCS 0x1bff6a
1328 VMCS: 0x1bff6a TSC Offset 0xffffe42722c64c41
1329 ERROR: Unknown TSC Offset for VMCS 0x1cbc08
1330 VMCS: 0x1cbc08 TSC Offset 0xffffe42722c64c41
1331 ERROR: Unknown TSC Offset for VMCS 0x1c3ce8
1332 VMCS: 0x1c3ce8 TSC Offset 0xffffe42722c64c41
1333 ERROR: Unknown TSC Offset for VMCS 0x1cbce9
1334 VMCS: 0x1cbce9 TSC Offset 0xffffe42722c64c41
1335
1336 Each virtual CPU has a different Virtual Machine Control Structure
1337 (VMCS) shown above with the calculated TSC Offset. For an unchanging
1338 TSC Offset they should all be the same for the same virtual machine.
1339
1340 Now that the TSC Offset is known, it can be provided to perf inject
1341
1342 $ perf inject -i perf.data.kvm --vm-time-correlation="dry-run 0xffffe42722c64c41"
1343
1344 Note the options for perf inject --vm-time-correlation are:
1345
1346 [ dry-run ] [ <TSC Offset> [ : <VMCS> [ , <VMCS> ]... ] ]...
1347
1348 So it is possible to specify different TSC Offsets for different VMCS.
1349 The option "dry-run" will cause the file to be processed but without
1350 updating it. Note it is also possible to get a intel_pt.log file by
1351 adding option --itrace=d
1352
1353 There were no errors so, do it for real
1354
1355 $ perf inject -i perf.data.kvm --vm-time-correlation=0xffffe42722c64c41 --force
1356
1357 perf script can be used to see if there are any decoder errors
1358
1359 $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --itrace=e-o
1360
1361 There were none.
1362
1363 perf script can be used to provide an instruction trace showing
1364 timestamps
1365
1366 $ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21
1367 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9
1368 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10
1369 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11
1370 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12
1371 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13
1372 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14
1373 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15
1374 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax
1375 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40
1376 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46
1377 CPU 1/KVM 17006 [001] 11500.262866075: ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.05 (40/769)
1378 :17006 17006 [001] 11500.262869216: ffffffff82200cb0 asm_sysvec_apic_timer_interrupt+0x0 ([guest.kernel.kallsyms]) clac
1379 :17006 17006 [001] 11500.262869216: ffffffff82200cb3 asm_sysvec_apic_timer_interrupt+0x3 ([guest.kernel.kallsyms]) pushq $0xffffffffffffffff
1380 :17006 17006 [001] 11500.262869216: ffffffff82200cb5 asm_sysvec_apic_timer_interrupt+0x5 ([guest.kernel.kallsyms]) callq 0xffffffff82201160
1381 :17006 17006 [001] 11500.262869216: ffffffff82201160 error_entry+0x0 ([guest.kernel.kallsyms]) cld
1382 :17006 17006 [001] 11500.262869216: ffffffff82201161 error_entry+0x1 ([guest.kernel.kallsyms]) pushq %rsi
1383 :17006 17006 [001] 11500.262869216: ffffffff82201162 error_entry+0x2 ([guest.kernel.kallsyms]) movq 0x8(%rsp), %rsi
1384 :17006 17006 [001] 11500.262869216: ffffffff82201167 error_entry+0x7 ([guest.kernel.kallsyms]) movq %rdi, 0x8(%rsp)
1385 :17006 17006 [001] 11500.262869216: ffffffff8220116c error_entry+0xc ([guest.kernel.kallsyms]) pushq %rdx
1386 :17006 17006 [001] 11500.262869216: ffffffff8220116d error_entry+0xd ([guest.kernel.kallsyms]) pushq %rcx
1387 :17006 17006 [001] 11500.262869216: ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms]) pushq %rax
1388
1390 Event Trace records information about asynchronous events, for example
1391 interrupts, faults, VM exits and entries. The information is recorded
1392 in CFE and EVD packets, and also the Interrupt Flag is recorded on the
1393 MODE.Exec packet. The CFE packet contains a type field to identify one
1394 of the following:
1395
1396 1 INTR interrupt, fault, exception, NMI
1397 2 IRET interrupt return
1398 3 SMI system management interrupt
1399 4 RSM resume from system management mode
1400 5 SIPI startup interprocessor interrupt
1401 6 INIT INIT signal
1402 7 VMENTRY VM-Entry
1403 8 VMEXIT VM-Entry
1404 9 VMEXIT_INTR VM-Exit due to interrupt
1405 10 SHUTDOWN Shutdown
1406
1407 For more details, refer to the Intel 64 and IA-32 Architectures
1408 Software Developer Manuals (version 076 or later).
1409
1410 The capability to do Event Trace is indicated by the
1411 /sys/bus/event_source/devices/intel_pt/caps/event_trace file.
1412
1413 Event trace is selected for recording using the "event" config term.
1414 e.g.
1415
1416 perf record -e intel_pt/event/u uname
1417
1418 Event trace events are output using the --itrace I option. e.g.
1419
1420 perf script --itrace=Ie
1421
1422 perf script displays events containing CFE type, vector and event data,
1423 in the form:
1424
1425 evt: hw int (t) cfe: INTR IP: 1 vector: 3 PFA: 0x8877665544332211
1426
1427 The IP flag indicates if the event binds to an IP, which includes any
1428 case where flow control packet generation is enabled, as well as when
1429 CFE packet IP bit is set.
1430
1431 perf script displays events containing changes to the Interrupt Flag in
1432 the form:
1433
1434 iflag: t IFLAG: 1->0 via branch
1435
1436 where "via branch" indicates a branch (interrupt or return from
1437 interrupt) and "non branch" indicates an instruction such as CFI, STI
1438 or POPF).
1439
1440 In addition, the current state of the interrupt flag is indicated by
1441 the presence or absence of the "D" (interrupt disabled) perf script
1442 flag. If the interrupt flag is changed, then the "t" flag is also
1443 included i.e.
1444
1445 no flag, interrupts enabled IF=1
1446 t interrupts become disabled IF=1 -> IF=0
1447 D interrupts are disabled IF=0
1448 Dt interrupts become enabled IF=0 -> IF=1
1449
1450 The intel-pt-events.py script illustrates how to access Event Trace
1451 information using a Python script.
1452
1454 TNT packets are disabled using the "notnt" config term. e.g.
1455
1456 perf record -e intel_pt/notnt/u uname
1457
1458 In that case the --itrace q option is forced because walking executable
1459 code to reconstruct the control flow is not possible.
1460
1462 perf-record(1), perf-script(1), perf-report(1), perf-inject(1)
1463
1464
1465
1466perf 06/14/2022 PERF-INTEL-PT(1)