1LLVM-EXEGESIS(1) LLVM LLVM-EXEGESIS(1)
2
3
4
6 llvm-exegesis - LLVM Machine Instruction Benchmark
7
9 llvm-exegesis [options]
10
12 llvm-exegesis is a benchmarking tool that uses information available in
13 LLVM to measure host machine instruction characteristics like latency,
14 throughput, or port decomposition.
15
16 Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17 ates a code snippet that makes execution as serial (resp. as parallel)
18 as possible so that we can measure the latency (resp. inverse through‐
19 put/uop decomposition) of the instruction. The code snippet is jitted
20 and, unless requested not to, executed on the host subtarget. The time
21 taken (resp. resource usage) is measured using hardware performance
22 counters. The result is printed out as YAML to the standard output.
23
24 The main goal of this tool is to automatically (in)validate the LLVM's
25 TableDef scheduling models. To that end, we also provide analysis of
26 the results.
27
28 llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29
31 Assume you have an X86-64 machine. To measure the latency of a single
32 instruction, run:
33
34 $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
35
36 Measuring the uop decomposition or inverse throughput of an instruction
37 works similarly:
38
39 $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
40 $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
41
42 The output is a YAML document (the default is to write to stdout, but
43 you can redirect the output to a file using -benchmarks-file):
44
45 ---
46 key:
47 opcode_name: ADD64rr
48 mode: latency
49 config: ''
50 cpu_name: haswell
51 llvm_triple: x86_64-unknown-linux-gnu
52 num_repetitions: 10000
53 measurements:
54 - { key: latency, value: 1.0058, debug_string: '' }
55 error: ''
56 info: 'explicit self cycles, selecting one aliasing configuration.
57 Snippet:
58 ADD64rr R8, R8, R10
59 '
60 ...
61
62 To measure the latency of all instructions for the host architecture,
63 run:
64
65 $ llvm-exegesis -mode=latency -opcode-index=-1
66
68 To measure the latency/uops of a custom piece of code, you can specify
69 the snippets-file option (- reads from standard input).
70
71 $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
72
73 Real-life code snippets typically depend on registers or memory.
74 llvm-exegesis checks the liveliness of registers (i.e. any register use
75 has a corresponding def or is a "live in"). If your code depends on the
76 value of some registers, you have two options:
77
78 • Mark the register as requiring a definition. llvm-exegesis will auto‐
79 matically assign a value to the register. This can be done using the
80 directive LLVM-EXEGESIS-DEFREG <reg name> <hex_value>, where
81 <hex_value> is a bit pattern used to fill <reg_name>. If <hex_value>
82 is smaller than the register width, it will be sign-extended.
83
84 • Mark the register as a "live in". llvm-exegesis will benchmark using
85 whatever value was in this registers on entry. This can be done using
86 the directive LLVM-EXEGESIS-LIVEIN <reg name>.
87
88 For example, the following code snippet depends on the values of XMM1
89 (which will be set by the tool) and the memory buffer passed in RDI
90 (live in).
91
92 # LLVM-EXEGESIS-LIVEIN RDI
93 # LLVM-EXEGESIS-DEFREG XMM1 42
94 vmulps (%rdi), %xmm1, %xmm2
95 vhaddps %xmm2, %xmm2, %xmm3
96 addq $0x10, %rdi
97
99 Assuming you have a set of benchmarked instructions (either latency or
100 uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
101 using the following command:
102
103 $ llvm-exegesis -mode=analysis \
104 -benchmarks-file=/tmp/benchmarks.yaml \
105 -analysis-clusters-output-file=/tmp/clusters.csv \
106 -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
107
108 This will group the instructions into clusters with the same perfor‐
109 mance characteristics. The clusters will be written out to /tmp/clus‐
110 ters.csv in the following format:
111
112 cluster_id,opcode_name,config,sched_class
113 ...
114 2,ADD32ri8_DB,,WriteALU,1.00
115 2,ADD32ri_DB,,WriteALU,1.01
116 2,ADD32rr,,WriteALU,1.01
117 2,ADD32rr_DB,,WriteALU,1.00
118 2,ADD32rr_REV,,WriteALU,1.00
119 2,ADD64i32,,WriteALU,1.01
120 2,ADD64ri32,,WriteALU,1.01
121 2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
122 2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
123 2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
124 2,ADD64ri8,,WriteALU,1.00
125 2,SETBr,,WriteSETCC,1.01
126 ...
127
128 llvm-exegesis will also analyze the clusters to point out inconsisten‐
129 cies in the scheduling information. The output is an html file. For ex‐
130 ample, /tmp/inconsistencies.html will contain messages like the follow‐
131 ing : [image]
132
133 Note that the scheduling class names will be resolved only when
134 llvm-exegesis is compiled in debug mode, else only the class id will be
135 shown. This does not invalidate any of the analysis results though.
136
138 -help Print a summary of command line options.
139
140 -opcode-index=<LLVM opcode index>
141 Specify the opcode to measure, by index. Specifying -1 will re‐
142 sult in measuring every existing opcode. See example 1 for de‐
143 tails. Either opcode-index, opcode-name or snippets-file must
144 be set.
145
146 -opcode-name=<opcode name 1>,<opcode name 2>,...
147 Specify the opcode to measure, by name. Several opcodes can be
148 specified as a comma-separated list. See example 1 for details.
149 Either opcode-index, opcode-name or snippets-file must be set.
150
151 -snippets-file=<filename>
152 Specify the custom code snippet to measure. See example 2 for
153 details. Either opcode-index, opcode-name or snippets-file must
154 be set.
155
156 -mode=[latency|uops|inverse_throughput|analysis]
157 Specify the run mode. Note that some modes have additional re‐
158 quirements and options.
159
160 latency mode can be make use of either RDTSC or LBR. la‐
161 tency[LBR] is only available on X86 (at least Skylake). To run
162 in latency mode, a positive value must be specified for
163 x86-lbr-sample-period and --repetition-mode=loop.
164
165 In analysis mode, you also need to specify at least one of the
166 -analysis-clusters-output-file= and -analysis-inconsisten‐
167 cies-output-file=.
168
169 --benchmark-phase=[prepare-snippet|prepare-and-assemble-snippet|assem‐
170 ble-measured-code|measure]
171 By default, when -mode= is specified, the generated snippet will
172 be executed and measured, and that requires that we are running
173 on the hardware for which the snippet was generated, and that
174 supports performance measurements. However, it is possible to
175 stop at some stage before measuring. Choices are: * pre‐
176 pare-snippet: Only generate the minimal instruction sequence. *
177 prepare-and-assemble-snippet: Same as prepare-snippet, but also
178 dumps an excerpt of the sequence (hex encoded). * assemble-mea‐
179 sured-code: Same as prepare-and-assemble-snippet. but also cre‐
180 ates the full sequence that can be dumped to a file using
181 --dump-object-to-disk. * measure: Same as assemble-mea‐
182 sured-code, but also runs the measurement.
183
184 -x86-lbr-sample-period=<nBranches/sample>
185 Specify the LBR sampling period - how many branches before we
186 take a sample. When a positive value is specified for this op‐
187 tion and when the mode is latency, we will use LBRs for measur‐
188 ing. On choosing the "right" sampling period, a small value is
189 preferred, but throttling could occur if the sampling is too
190 frequent. A prime number should be used to avoid consistently
191 skipping certain blocks.
192
193 -x86-disable-upper-sse-registers
194 Using the upper xmm registers (xmm8-xmm15) forces a longer in‐
195 struction encoding which may put greater pressure on the front‐
196 end fetch and decode stages, potentially reducing the rate that
197 instructions are dispatched to the backend, particularly on
198 older hardware. Comparing baseline results with this mode en‐
199 abled can help determine the effects of the frontend and can be
200 used to improve latency and throughput estimates.
201
202 -repetition-mode=[duplicate|loop|min]
203 Specify the repetition mode. duplicate will create a large,
204 straight line basic block with num-repetitions instructions (re‐
205 peating the snippet num-repetitions/snippet size times). loop
206 will, optionally, duplicate the snippet until the loop body con‐
207 tains at least loop-body-size instructions, and then wrap the
208 result in a loop which will execute num-repetitions instructions
209 (thus, again, repeating the snippet num-repetitions/snippet size
210 times). The loop mode, especially with loop unrolling tends to
211 better hide the effects of the CPU frontend on architectures
212 that cache decoded instructions, but consumes a register for
213 counting iterations. If performing an analysis over many op‐
214 codes, it may be best to instead use the min mode, which will
215 run each other mode, and produce the minimal measured result.
216
217 -num-repetitions=<Number of repetitions>
218 Specify the target number of executed instructions. Note that
219 the actual repetition count of the snippet will be num-repeti‐
220 tions/snippet size. Higher values lead to more accurate mea‐
221 surements but lengthen the benchmark.
222
223 -loop-body-size=<Preferred loop body size>
224 Only effective for -repetition-mode=[loop|min]. Instead of
225 looping over the snippet directly, first duplicate it so that
226 the loop body contains at least this many instructions. This po‐
227 tentially results in loop body being cached in the CPU Op Cache
228 / Loop Cache, which allows to which may have higher throughput
229 than the CPU decoders.
230
231 -max-configs-per-opcode=<value>
232 Specify the maximum configurations that can be generated for
233 each opcode. By default this is 1, meaning that we assume that
234 a single measurement is enough to characterize an opcode. This
235 might not be true of all instructions: for example, the perfor‐
236 mance characteristics of the LEA instruction on X86 depends on
237 the value of assigned registers and immediates. Setting a value
238 of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
239 explore more configurations to discover if some register or im‐
240 mediate assignments lead to different performance characteris‐
241 tics.
242
243 -benchmarks-file=</path/to/file>
244 File to read (analysis mode) or write (latency/uops/in‐
245 verse_throughput modes) benchmark results. "-" uses stdin/std‐
246 out.
247
248 -analysis-clusters-output-file=</path/to/file>
249 If provided, write the analysis clusters as CSV to this file.
250 "-" prints to stdout. By default, this analysis is not run.
251
252 -analysis-inconsistencies-output-file=</path/to/file>
253 If non-empty, write inconsistencies found during analysis to
254 this file. - prints to stdout. By default, this analysis is not
255 run.
256
257 -analysis-filter=[all|reg-only|mem-only]
258 By default, all benchmark results are analysed, but sometimes it
259 may be useful to only look at those that to not involve memory,
260 or vice versa. This option allows to either keep all benchmarks,
261 or filter out (ignore) either all the ones that do involve mem‐
262 ory (involve instructions that may read or write to memory), or
263 the opposite, to only keep such benchmarks.
264
265 -analysis-clustering=[dbscan,naive]
266 Specify the clustering algorithm to use. By default DBSCAN will
267 be used. Naive clustering algorithm is better for doing further
268 work on the -analysis-inconsistencies-output-file= output, it
269 will create one cluster per opcode, and check that the cluster
270 is stable (all points are neighbours).
271
272 -analysis-numpoints=<dbscan numPoints parameter>
273 Specify the numPoints parameters to be used for DBSCAN cluster‐
274 ing (analysis mode, DBSCAN only).
275
276 -analysis-clustering-epsilon=<dbscan epsilon parameter>
277 Specify the epsilon parameter used for clustering of benchmark
278 points (analysis mode).
279
280 -analysis-inconsistency-epsilon=<epsilon>
281 Specify the epsilon parameter used for detection of when the
282 cluster is different from the LLVM schedule profile values
283 (analysis mode).
284
285 -analysis-display-unstable-clusters
286 If there is more than one benchmark for an opcode, said bench‐
287 marks may end up not being clustered into the same cluster if
288 the measured performance characteristics are different. by de‐
289 fault all such opcodes are filtered out. This flag will instead
290 show only such unstable opcodes.
291
292 -ignore-invalid-sched-class=false
293 If set, ignore instructions that do not have a sched class
294 (class idx = 0).
295
296 -mtriple=<triple name>
297 Target triple. See -version for available targets.
298
299 -mcpu=<cpu name>
300 If set, measure the cpu characteristics using the counters for
301 this CPU. This is useful when creating new sched models (the
302 host CPU is unknown to LLVM). (-mcpu=help for details)
303
304 --analysis-override-benchmark-triple-and-cpu
305 By default, llvm-exegesis will analyze the benchmarks for the
306 triple/CPU they were measured for, but if you want to analyze
307 them for some other combination (specified via -mtriple/-mcpu),
308 you can pass this flag.
309
310 --dump-object-to-disk=true
311 If set, llvm-exegesis will dump the generated code to a tempo‐
312 rary file to enable code inspection. Disabled by default.
313
315 llvm-exegesis returns 0 on success. Otherwise, an error message is
316 printed to standard error, and the tool returns a non 0 value.
317
319 Maintained by the LLVM Team (https://llvm.org/).
320
322 2003-2023, LLVM Project
323
324
325
326
32716 2023-08-24 LLVM-EXEGESIS(1)