1LLVM-EXEGESIS(1) LLVM LLVM-EXEGESIS(1)
2
3
4
6 llvm-exegesis - LLVM Machine Instruction Benchmark
7
9 llvm-exegesis [options]
10
12 llvm-exegesis is a benchmarking tool that uses information available in
13 LLVM to measure host machine instruction characteristics like latency,
14 throughput, or port decomposition.
15
16 Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17 ates a code snippet that makes execution as serial (resp. as parallel)
18 as possible so that we can measure the latency (resp. inverse through‐
19 put/uop decomposition) of the instruction. The code snippet is jitted
20 and, unless requested not to, executed on the host subtarget. The time
21 taken (resp. resource usage) is measured using hardware performance
22 counters. The result is printed out as YAML to the standard output.
23
24 The main goal of this tool is to automatically (in)validate the LLVM's
25 TableDef scheduling models. To that end, we also provide analysis of
26 the results.
27
28 llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29
31 llvm-exegesis currently only supports X86 (64-bit only), ARM (AArch64
32 only), MIPS, and PowerPC (PowerPC64LE only) on Linux for benchmarking.
33 Not all benchmarking functionality is guaranteed to work on every plat‐
34 form. llvm-exegesis also has a separate analysis mode that is sup‐
35 ported on every platform on which LLVM is.
36
38 llvm-exegesis supports benchmarking arbitrary snippets of assembly.
39 However, benchmarking these snippets often requires some setup so that
40 they can execute properly. llvm-exegesis has two annotations and some
41 additional utilities to help with setup so that snippets can be bench‐
42 marked properly.
43
44 • LLVM-EXEGESIS-DEFREG <register name> - Adding this annotation to the
45 text assembly snippet to be benchmarked marks the register as requir‐
46 ing a definition. A value will automatically be provided unless a
47 second parameter, a hex value, is passed in. This is done with the
48 LLVM-EXEGESIS-DEFREG <register name> <hex value> format. <hex value>
49 is a bit pattern used to fill the register. If it is a value smaller
50 than the register, it is sign extended to match the size of the reg‐
51 ister.
52
53 • LLVM-EXEGESIS-LIVEIN <register name> - This annotation allows speci‐
54 fying registers that should keep their value upon starting the bench‐
55 mark. Values can be passed through registers from the benchmarking
56 setup in some cases. The registers and the values assigned to them
57 that can be utilized in the benchmarking script with a LLVM-EXEGE‐
58 SIS-LIVEIN are as follows:
59
60 • Scratch memory register - The specific register that this value is
61 put in is platform dependent (e.g., it is the RDI register on X86
62 Linux). Setting this register as a live in ensures that a pointer
63 to a block of memory (1MB) is placed within this register that can
64 be used by the snippet.
65
66 • LLVM-EXEGESIS-MEM-DEF <value name> <size> <value> - This annotation
67 allows specifying memory definitions that can later be mapped into
68 the execution process of a snippet with the LLVM-EXEGESIS-MEM-MAP an‐
69 notation. Each value is named using the <value name> argument so that
70 it can be referenced later within a map annotation. The size is spec‐
71 ified in bytes the the value is taken in hexadecimal. If the size of
72 the value is less than the specified size, the value will be repeated
73 until it fills the entire section of memory. Using this annotation
74 requires using the subprocess execution mode.
75
76 • LLVM-EXEGESIS-MEM-MAP <value name> <address> - This annotation allows
77 for mapping previously defined memory definitions into the execution
78 context of a process. The value name refers to a previously defined
79 memory definition and the address is a decimal number that specifies
80 the address the memory definition should start at. Note that a single
81 memory definition can be mapped multiple times. Using this annotation
82 requires the subprocess execution mode.
83
85 Assume you have an X86-64 machine. To measure the latency of a single
86 instruction, run:
87
88 $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
89
90 Measuring the uop decomposition or inverse throughput of an instruction
91 works similarly:
92
93 $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
94 $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
95
96 The output is a YAML document (the default is to write to stdout, but
97 you can redirect the output to a file using -benchmarks-file):
98
99 ---
100 key:
101 opcode_name: ADD64rr
102 mode: latency
103 config: ''
104 cpu_name: haswell
105 llvm_triple: x86_64-unknown-linux-gnu
106 num_repetitions: 10000
107 measurements:
108 - { key: latency, value: 1.0058, debug_string: '' }
109 error: ''
110 info: 'explicit self cycles, selecting one aliasing configuration.
111 Snippet:
112 ADD64rr R8, R8, R10
113 '
114 ...
115
116 To measure the latency of all instructions for the host architecture,
117 run:
118
119 $ llvm-exegesis -mode=latency -opcode-index=-1
120
122 To measure the latency/uops of a custom piece of code, you can specify
123 the snippets-file option (- reads from standard input).
124
125 $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
126
127 Real-life code snippets typically depend on registers or memory.
128 llvm-exegesis checks the liveliness of registers (i.e. any register use
129 has a corresponding def or is a "live in"). If your code depends on the
130 value of some registers, you need to use snippet annotations to ensure
131 setup is performed properly.
132
133 For example, the following code snippet depends on the values of XMM1
134 (which will be set by the tool) and the memory buffer passed in RDI
135 (live in).
136
137 # LLVM-EXEGESIS-LIVEIN RDI
138 # LLVM-EXEGESIS-DEFREG XMM1 42
139 vmulps (%rdi), %xmm1, %xmm2
140 vhaddps %xmm2, %xmm2, %xmm3
141 addq $0x10, %rdi
142
144 Some snippets require memory setup in specific places to execute with‐
145 out crashing. Setting up memory can be accomplished with the LLVM-EXE‐
146 GESIS-MEM-DEF and LLVM-EXEGESIS-MEM-MAP annotations. To execute the
147 following snippet:
148
149 movq $8192, %rax
150 movq (%rax), %rdi
151
152 We need to have at least eight bytes of memory allocated starting
153 0x2000. We can create the necessary execution environment with the
154 following annotations added to the snippet:
155
156 # LLVM-EXEGESIS-MEM-DEF test1 4096 2147483647
157 # LLVM-EXEGESIS-MEM-MAP test1 8192
158
159 movq $8192, %rax
160 movq (%rax), %rdi
161
163 Assuming you have a set of benchmarked instructions (either latency or
164 uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
165 using the following command:
166
167 $ llvm-exegesis -mode=analysis \
168 -benchmarks-file=/tmp/benchmarks.yaml \
169 -analysis-clusters-output-file=/tmp/clusters.csv \
170 -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
171
172 This will group the instructions into clusters with the same perfor‐
173 mance characteristics. The clusters will be written out to /tmp/clus‐
174 ters.csv in the following format:
175
176 cluster_id,opcode_name,config,sched_class
177 ...
178 2,ADD32ri8_DB,,WriteALU,1.00
179 2,ADD32ri_DB,,WriteALU,1.01
180 2,ADD32rr,,WriteALU,1.01
181 2,ADD32rr_DB,,WriteALU,1.00
182 2,ADD32rr_REV,,WriteALU,1.00
183 2,ADD64i32,,WriteALU,1.01
184 2,ADD64ri32,,WriteALU,1.01
185 2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
186 2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
187 2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
188 2,ADD64ri8,,WriteALU,1.00
189 2,SETBr,,WriteSETCC,1.01
190 ...
191
192 llvm-exegesis will also analyze the clusters to point out inconsisten‐
193 cies in the scheduling information. The output is an html file. For ex‐
194 ample, /tmp/inconsistencies.html will contain messages like the follow‐
195 ing : [image]
196
197 Note that the scheduling class names will be resolved only when
198 llvm-exegesis is compiled in debug mode, else only the class id will be
199 shown. This does not invalidate any of the analysis results though.
200
202 -help Print a summary of command line options.
203
204 -opcode-index=<LLVM opcode index>
205 Specify the opcode to measure, by index. Specifying -1 will re‐
206 sult in measuring every existing opcode. See example 1 for de‐
207 tails. Either opcode-index, opcode-name or snippets-file must
208 be set.
209
210 -opcode-name=<opcode name 1>,<opcode name 2>,...
211 Specify the opcode to measure, by name. Several opcodes can be
212 specified as a comma-separated list. See example 1 for details.
213 Either opcode-index, opcode-name or snippets-file must be set.
214
215 -snippets-file=<filename>
216 Specify the custom code snippet to measure. See example 2 for
217 details. Either opcode-index, opcode-name or snippets-file must
218 be set.
219
220 -mode=[latency|uops|inverse_throughput|analysis]
221 Specify the run mode. Note that some modes have additional re‐
222 quirements and options.
223
224 latency mode can be make use of either RDTSC or LBR. la‐
225 tency[LBR] is only available on X86 (at least Skylake). To run
226 in latency mode, a positive value must be specified for
227 x86-lbr-sample-period and --repetition-mode=loop.
228
229 In analysis mode, you also need to specify at least one of the
230 -analysis-clusters-output-file= and -analysis-inconsisten‐
231 cies-output-file=.
232
233 --benchmark-phase=[prepare-snippet|prepare-and-assemble-snippet|assem‐
234 ble-measured-code|measure]
235 By default, when -mode= is specified, the generated snippet will
236 be executed and measured, and that requires that we are running
237 on the hardware for which the snippet was generated, and that
238 supports performance measurements. However, it is possible to
239 stop at some stage before measuring. Choices are: * pre‐
240 pare-snippet: Only generate the minimal instruction sequence. *
241 prepare-and-assemble-snippet: Same as prepare-snippet, but also
242 dumps an excerpt of the sequence (hex encoded). * assemble-mea‐
243 sured-code: Same as prepare-and-assemble-snippet. but also cre‐
244 ates the full sequence that can be dumped to a file using
245 --dump-object-to-disk. * measure: Same as assemble-mea‐
246 sured-code, but also runs the measurement.
247
248 -x86-lbr-sample-period=<nBranches/sample>
249 Specify the LBR sampling period - how many branches before we
250 take a sample. When a positive value is specified for this op‐
251 tion and when the mode is latency, we will use LBRs for measur‐
252 ing. On choosing the "right" sampling period, a small value is
253 preferred, but throttling could occur if the sampling is too
254 frequent. A prime number should be used to avoid consistently
255 skipping certain blocks.
256
257 -x86-disable-upper-sse-registers
258 Using the upper xmm registers (xmm8-xmm15) forces a longer in‐
259 struction encoding which may put greater pressure on the front‐
260 end fetch and decode stages, potentially reducing the rate that
261 instructions are dispatched to the backend, particularly on
262 older hardware. Comparing baseline results with this mode en‐
263 abled can help determine the effects of the frontend and can be
264 used to improve latency and throughput estimates.
265
266 -repetition-mode=[duplicate|loop|min]
267 Specify the repetition mode. duplicate will create a large,
268 straight line basic block with num-repetitions instructions (re‐
269 peating the snippet num-repetitions/snippet size times). loop
270 will, optionally, duplicate the snippet until the loop body con‐
271 tains at least loop-body-size instructions, and then wrap the
272 result in a loop which will execute num-repetitions instructions
273 (thus, again, repeating the snippet num-repetitions/snippet size
274 times). The loop mode, especially with loop unrolling tends to
275 better hide the effects of the CPU frontend on architectures
276 that cache decoded instructions, but consumes a register for
277 counting iterations. If performing an analysis over many op‐
278 codes, it may be best to instead use the min mode, which will
279 run each other mode, and produce the minimal measured result.
280
281 -num-repetitions=<Number of repetitions>
282 Specify the target number of executed instructions. Note that
283 the actual repetition count of the snippet will be num-repeti‐
284 tions/snippet size. Higher values lead to more accurate mea‐
285 surements but lengthen the benchmark.
286
287 -loop-body-size=<Preferred loop body size>
288 Only effective for -repetition-mode=[loop|min]. Instead of
289 looping over the snippet directly, first duplicate it so that
290 the loop body contains at least this many instructions. This po‐
291 tentially results in loop body being cached in the CPU Op Cache
292 / Loop Cache, which allows to which may have higher throughput
293 than the CPU decoders.
294
295 -max-configs-per-opcode=<value>
296 Specify the maximum configurations that can be generated for
297 each opcode. By default this is 1, meaning that we assume that
298 a single measurement is enough to characterize an opcode. This
299 might not be true of all instructions: for example, the perfor‐
300 mance characteristics of the LEA instruction on X86 depends on
301 the value of assigned registers and immediates. Setting a value
302 of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
303 explore more configurations to discover if some register or im‐
304 mediate assignments lead to different performance characteris‐
305 tics.
306
307 -benchmarks-file=</path/to/file>
308 File to read (analysis mode) or write (latency/uops/in‐
309 verse_throughput modes) benchmark results. "-" uses stdin/std‐
310 out.
311
312 -analysis-clusters-output-file=</path/to/file>
313 If provided, write the analysis clusters as CSV to this file.
314 "-" prints to stdout. By default, this analysis is not run.
315
316 -analysis-inconsistencies-output-file=</path/to/file>
317 If non-empty, write inconsistencies found during analysis to
318 this file. - prints to stdout. By default, this analysis is not
319 run.
320
321 -analysis-filter=[all|reg-only|mem-only]
322 By default, all benchmark results are analysed, but sometimes it
323 may be useful to only look at those that to not involve memory,
324 or vice versa. This option allows to either keep all benchmarks,
325 or filter out (ignore) either all the ones that do involve mem‐
326 ory (involve instructions that may read or write to memory), or
327 the opposite, to only keep such benchmarks.
328
329 -analysis-clustering=[dbscan,naive]
330 Specify the clustering algorithm to use. By default DBSCAN will
331 be used. Naive clustering algorithm is better for doing further
332 work on the -analysis-inconsistencies-output-file= output, it
333 will create one cluster per opcode, and check that the cluster
334 is stable (all points are neighbours).
335
336 -analysis-numpoints=<dbscan numPoints parameter>
337 Specify the numPoints parameters to be used for DBSCAN cluster‐
338 ing (analysis mode, DBSCAN only).
339
340 -analysis-clustering-epsilon=<dbscan epsilon parameter>
341 Specify the epsilon parameter used for clustering of benchmark
342 points (analysis mode).
343
344 -analysis-inconsistency-epsilon=<epsilon>
345 Specify the epsilon parameter used for detection of when the
346 cluster is different from the LLVM schedule profile values
347 (analysis mode).
348
349 -analysis-display-unstable-clusters
350 If there is more than one benchmark for an opcode, said bench‐
351 marks may end up not being clustered into the same cluster if
352 the measured performance characteristics are different. by de‐
353 fault all such opcodes are filtered out. This flag will instead
354 show only such unstable opcodes.
355
356 -ignore-invalid-sched-class=false
357 If set, ignore instructions that do not have a sched class
358 (class idx = 0).
359
360 -mtriple=<triple name>
361 Target triple. See -version for available targets.
362
363 -mcpu=<cpu name>
364 If set, measure the cpu characteristics using the counters for
365 this CPU. This is useful when creating new sched models (the
366 host CPU is unknown to LLVM). (-mcpu=help for details)
367
368 --analysis-override-benchmark-triple-and-cpu
369 By default, llvm-exegesis will analyze the benchmarks for the
370 triple/CPU they were measured for, but if you want to analyze
371 them for some other combination (specified via -mtriple/-mcpu),
372 you can pass this flag.
373
374 --dump-object-to-disk=true
375 If set, llvm-exegesis will dump the generated code to a tempo‐
376 rary file to enable code inspection. Disabled by default.
377
378 --use-dummy-perf-counters
379 If set, llvm-exegesis will not read any real performance coun‐
380 ters and return a dummy value instead. This can be used to en‐
381 sure a snippet doesn't crash when hardware performance counters
382 are unavailable and for debugging llvm-exegesis itself.
383
384 --execution-mode=[inprocess,subprocess]
385 This option specifies what execution mode to use. The inprocess
386 execution mode is the default. The subprocess execution mode al‐
387 lows for additional features such as memory annotations but is
388 currently restricted to X86-64 on Linux.
389
391 llvm-exegesis returns 0 on success. Otherwise, an error message is
392 printed to standard error, and the tool returns a non 0 value.
393
395 Maintained by the LLVM Team (https://llvm.org/).
396
398 2003-2023, LLVM Project
399
400
401
402
40317 2023-11-28 LLVM-EXEGESIS(1)