perf-arm-spe(1)

1PERF-ARM-SPE(1)                   perf Manual                  PERF-ARM-SPE(1)
2
3
4

NAME

6       perf-arm-spe - Support for Arm Statistical Profiling Extension within
7       Perf tools
8

SYNOPSIS

10       perf record -e arm_spe//
11

DESCRIPTION

13       The SPE (Statistical Profiling Extension) feature provides accurate
14       attribution of latencies and events down to individual instructions.
15       Rather than being interrupt-driven, it picks an instruction to sample
16       and then captures data for it during execution. Data includes execution
17       time in cycles. For loads and stores it also includes data address,
18       cache miss events, and data origin.
19
20       The sampling has 5 stages:
21
22        1. Choose an operation
23
24        2. Collect data about the operation
25
26        3. Optionally discard the record based on a filter
27
28        4. Write the record to memory
29
30        5. Interrupt when the buffer is full
31
32   Choose an operation
33       This is chosen from a sample population, for SPE this is an
34       IMPLEMENTATION DEFINED choice of all architectural instructions or all
35       micro-ops. Sampling happens at a programmable interval. The
36       architecture provides a mechanism for the SPE driver to infer the
37       minimum interval at which it should sample. This minimum interval is
38       used by the driver if no interval is specified. A pseudo-random
39       perturbation is also added to the sampling interval by default.
40
41   Collect data about the operation
42       Program counter, PMU events, timings and data addresses related to the
43       operation are recorded. Sampling ensures there is only one sampled
44       operation is in flight.
45
46   Optionally discard the record based on a filter
47       Based on programmable criteria, choose whether to keep the record or
48       discard it. If the record is discarded then the flow stops here for
49       this sample.
50
51   Write the record to memory
52       The record is appended to a memory buffer
53
54   Interrupt when the buffer is full
55       When the buffer fills, an interrupt is sent and the driver signals Perf
56       to collect the records. Perf saves the raw data in the perf.data file.
57

OPENING THE FILE

59       Up until this point no decoding of the SPE data was done by either the
60       kernel or Perf. Only when the recorded file is opened with perf report
61       or perf script does the decoding happen. When decoding the data, Perf
62       generates "synthetic samples" as if these were generated at the time of
63       the recording. These samples are the same as if normal sampling was
64       done by Perf without using SPE, although they may have more attributes
65       associated with them. For example a normal sample may have just the
66       instruction pointer, but an SPE sample can have data addresses and
67       latency attributes.
68

WHY SAMPLING?

70       •   Sampling, rather than tracing, cuts down the profiling problem to
71           something more manageable for hardware. Only one sampled operation
72           is in flight at a time.
73
74       •   Allows precise attribution data, including: Full PC of instruction,
75           data virtual and physical addresses.
76
77       •   Allows correlation between an instruction and events, such as TLB
78           and cache miss. (Data source indicates which particular cache was
79           hit, but the meaning is implementation defined because different
80           implementations can have different cache configurations.)
81
82       However, SPE does not provide any call-graph information, and relies on
83       statistical methods.
84

COLLISIONS

86       When an operation is sampled while a previous sampled operation has not
87       finished, a collision occurs. The new sample is dropped. Collisions
88       affect the integrity of the data, so the sample rate should be set to
89       avoid collisions.
90
91       The sample_collision PMU event can be used to determine the number of
92       lost samples. Although this count is based on collisions before
93       filtering occurs. Therefore this can not be used as an exact number for
94       samples dropped that would have made it through the filter, but can be
95       a rough guide.
96

THE EFFECT OF MICROARCHITECTURAL SAMPLING

98       If an implementation samples micro-operations instead of instructions,
99       the results of sampling must be weighted accordingly.
100
101       For example, if a given instruction A is always converted into two
102       micro-operations, A0 and A1, it becomes twice as likely to appear in
103       the sample population.
104
105       The coarse effect of conversions, and, if applicable, sampling of
106       speculative operations, can be estimated from the sample_pop and
107       inst_retired PMU events.
108

KERNEL REQUIREMENTS

110       The ARM_SPE_PMU config must be set to build as either a module or
111       statically.
112
113       Depending on CPU model, the kernel may need to be booted with page
114       table isolation disabled (kpti=off). If KPTI needs to be disabled, this
115       will fail with a console message "profiling buffer inaccessible. Try
116       passing kpti=off on the kernel command line".
117

CAPTURING SPE WITH PERF COMMAND-LINE TOOLS

119       You can record a session with SPE samples:
120
121           perf record -e arm_spe// -- ./mybench
122
123       The sample period is set from the -c option, and because the minimum
124       interval is used by default it’s recommended to set this to a higher
125       value. The value is written to PMSIRR.INTERVAL.
126
127   Config parameters
128       These are placed between the // in the event and comma separated. For
129       example -e arm_spe/load_filter=1,min_latency=10/
130
131           branch_filter=1     - collect branches only (PMSFCR.B)
132           event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
133           jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
134           load_filter=1       - collect loads only (PMSFCR.LD)
135           min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
136           pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
137           pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
138           store_filter=1      - collect stores only (PMSFCR.ST)
139           ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
140
141       * Latency is the total latency from the point at which sampling started
142       on that instruction, rather than only the execution latency.
143
144       Only some events can be filtered on; these include:
145
146           bit 1     - instruction retired (i.e. omit speculative instructions)
147           bit 3     - L1D refill
148           bit 5     - TLB refill
149           bit 7     - mispredict
150           bit 11    - misaligned access
151
152       So to sample just retired instructions:
153
154           perf record -e arm_spe/event_filter=2/ -- ./mybench
155
156       or just mispredicted branches:
157
158           perf record -e arm_spe/event_filter=0x80/ -- ./mybench
159
160   Viewing the data
161       By default perf report and perf script will assign samples to separate
162       groups depending on the attributes/events of the SPE record. Because
163       instructions can have multiple events associated with them, the samples
164       in these groups are not necessarily unique. For example perf report
165       shows these groups:
166
167           Available samples
168           0 arm_spe//
169           0 dummy:u
170           21 l1d-miss
171           897 l1d-access
172           5 llc-miss
173           7 llc-access
174           2 tlb-miss
175           1K tlb-access
176           36 branch-miss
177           0 remote-access
178           900 memory
179
180       The arm_spe// and dummy:u events are implementation details and are
181       expected to be empty.
182
183       To get a full list of unique samples that are not sorted into groups,
184       set the itrace option to generate instruction samples. The period
185       option is also taken into account, so set it to 1 instruction unless
186       you want to further downsample the already sampled SPE data:
187
188           perf report --itrace=i1i
189
190       Memory access details are also stored on the samples and this can be
191       viewed with:
192
193           perf report --mem-mode
194
195   Common errors
196       •   "Cannot find PMU ‘arm_spe’. Missing kernel support?"
197
198               Module not built or loaded, KPTI not disabled (see above), or running on a VM
199
200       •   "Arm SPE CONTEXT packets not found in the traces."
201
202               Root privilege is required to collect context packets. But these only increase the accuracy of
203               assigning PIDs to kernel samples. For userspace sampling this can be ignored.
204
205       •   Excessively large perf.data file size
206
207               Increase sampling interval (see above)
208