1LLVM-EXEGESIS(1) LLVM LLVM-EXEGESIS(1)
2
3
4
6 llvm-exegesis - LLVM Machine Instruction Benchmark
7
9 llvm-exegesis [options]
10
12 llvm-exegesis is a benchmarking tool that uses information available in
13 LLVM to measure host machine instruction characteristics like latency,
14 throughput, or port decomposition.
15
16 Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17 ates a code snippet that makes execution as serial (resp. as parallel)
18 as possible so that we can measure the latency (resp. inverse through‐
19 put/uop decomposition) of the instruction. The code snippet is jitted
20 and executed on the host subtarget. The time taken (resp. resource us‐
21 age) is measured using hardware performance counters. The result is
22 printed out as YAML to the standard output.
23
24 The main goal of this tool is to automatically (in)validate the LLVM's
25 TableDef scheduling models. To that end, we also provide analysis of
26 the results.
27
28 llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29
31 Assume you have an X86-64 machine. To measure the latency of a single
32 instruction, run:
33
34 $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
35
36 Measuring the uop decomposition or inverse throughput of an instruction
37 works similarly:
38
39 $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
40 $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
41
42 The output is a YAML document (the default is to write to stdout, but
43 you can redirect the output to a file using -benchmarks-file):
44
45 ---
46 key:
47 opcode_name: ADD64rr
48 mode: latency
49 config: ''
50 cpu_name: haswell
51 llvm_triple: x86_64-unknown-linux-gnu
52 num_repetitions: 10000
53 measurements:
54 - { key: latency, value: 1.0058, debug_string: '' }
55 error: ''
56 info: 'explicit self cycles, selecting one aliasing configuration.
57 Snippet:
58 ADD64rr R8, R8, R10
59 '
60 ...
61
62 To measure the latency of all instructions for the host architecture,
63 run:
64
65 $ llvm-exegesis -mode=latency -opcode-index=-1
66
68 To measure the latency/uops of a custom piece of code, you can specify
69 the snippets-file option (- reads from standard input).
70
71 $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
72
73 Real-life code snippets typically depend on registers or memory.
74 llvm-exegesis checks the liveliness of registers (i.e. any register use
75 has a corresponding def or is a "live in"). If your code depends on the
76 value of some registers, you have two options:
77
78 • Mark the register as requiring a definition. llvm-exegesis will auto‐
79 matically assign a value to the register. This can be done using the
80 directive LLVM-EXEGESIS-DEFREG <reg name> <hex_value>, where
81 <hex_value> is a bit pattern used to fill <reg_name>. If <hex_value>
82 is smaller than the register width, it will be sign-extended.
83
84 • Mark the register as a "live in". llvm-exegesis will benchmark using
85 whatever value was in this registers on entry. This can be done using
86 the directive LLVM-EXEGESIS-LIVEIN <reg name>.
87
88 For example, the following code snippet depends on the values of XMM1
89 (which will be set by the tool) and the memory buffer passed in RDI
90 (live in).
91
92 # LLVM-EXEGESIS-LIVEIN RDI
93 # LLVM-EXEGESIS-DEFREG XMM1 42
94 vmulps (%rdi), %xmm1, %xmm2
95 vhaddps %xmm2, %xmm2, %xmm3
96 addq $0x10, %rdi
97
99 Assuming you have a set of benchmarked instructions (either latency or
100 uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
101 using the following command:
102
103 $ llvm-exegesis -mode=analysis \
104 -benchmarks-file=/tmp/benchmarks.yaml \
105 -analysis-clusters-output-file=/tmp/clusters.csv \
106 -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
107
108 This will group the instructions into clusters with the same perfor‐
109 mance characteristics. The clusters will be written out to /tmp/clus‐
110 ters.csv in the following format:
111
112 cluster_id,opcode_name,config,sched_class
113 ...
114 2,ADD32ri8_DB,,WriteALU,1.00
115 2,ADD32ri_DB,,WriteALU,1.01
116 2,ADD32rr,,WriteALU,1.01
117 2,ADD32rr_DB,,WriteALU,1.00
118 2,ADD32rr_REV,,WriteALU,1.00
119 2,ADD64i32,,WriteALU,1.01
120 2,ADD64ri32,,WriteALU,1.01
121 2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
122 2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
123 2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
124 2,ADD64ri8,,WriteALU,1.00
125 2,SETBr,,WriteSETCC,1.01
126 ...
127
128 llvm-exegesis will also analyze the clusters to point out inconsisten‐
129 cies in the scheduling information. The output is an html file. For ex‐
130 ample, /tmp/inconsistencies.html will contain messages like the follow‐
131 ing : [image]
132
133 Note that the scheduling class names will be resolved only when
134 llvm-exegesis is compiled in debug mode, else only the class id will be
135 shown. This does not invalidate any of the analysis results though.
136
138 -help Print a summary of command line options.
139
140 -opcode-index=<LLVM opcode index>
141 Specify the opcode to measure, by index. Specifying -1 will re‐
142 sult in measuring every existing opcode. See example 1 for de‐
143 tails. Either opcode-index, opcode-name or snippets-file must
144 be set.
145
146 -opcode-name=<opcode name 1>,<opcode name 2>,...
147 Specify the opcode to measure, by name. Several opcodes can be
148 specified as a comma-separated list. See example 1 for details.
149 Either opcode-index, opcode-name or snippets-file must be set.
150
151 -snippets-file=<filename>
152 Specify the custom code snippet to measure. See example 2 for
153 details. Either opcode-index, opcode-name or snippets-file must
154 be set.
155
156 -mode=[latency|uops|inverse_throughput|analysis]
157 Specify the run mode. Note that some modes have additional re‐
158 quirements and options.
159
160 latency mode can be make use of either RDTSC or LBR. la‐
161 tency[LBR] is only available on X86 (at least Skylake). To run
162 in latency mode, a positive value must be specified for
163 x86-lbr-sample-period and --repetition-mode=loop.
164
165 In analysis mode, you also need to specify at least one of the
166 -analysis-clusters-output-file= and -analysis-inconsisten‐
167 cies-output-file=.
168
169 -x86-lbr-sample-period=<nBranches/sample>
170 Specify the LBR sampling period - how many branches before we
171 take a sample. When a positive value is specified for this op‐
172 tion and when the mode is latency, we will use LBRs for measur‐
173 ing. On choosing the "right" sampling period, a small value is
174 preferred, but throttling could occur if the sampling is too
175 frequent. A prime number should be used to avoid consistently
176 skipping certain blocks.
177
178 -repetition-mode=[duplicate|loop|min]
179 Specify the repetition mode. duplicate will create a large,
180 straight line basic block with num-repetitions instructions (re‐
181 peating the snippet num-repetitions/snippet size times). loop
182 will, optionally, duplicate the snippet until the loop body con‐
183 tains at least loop-body-size instructions, and then wrap the
184 result in a loop which will execute num-repetitions instructions
185 (thus, again, repeating the snippet num-repetitions/snippet size
186 times). The loop mode, especially with loop unrolling tends to
187 better hide the effects of the CPU frontend on architectures
188 that cache decoded instructions, but consumes a register for
189 counting iterations. If performing an analysis over many op‐
190 codes, it may be best to instead use the min mode, which will
191 run each other mode, and produce the minimal measured result.
192
193 -num-repetitions=<Number of repetitions>
194 Specify the target number of executed instructions. Note that
195 the actual repetition count of the snippet will be num-repeti‐
196 tions/snippet size. Higher values lead to more accurate mea‐
197 surements but lengthen the benchmark.
198
199 -loop-body-size=<Preferred loop body size>
200 Only effective for -repetition-mode=[loop|min]. Instead of
201 looping over the snippet directly, first duplicate it so that
202 the loop body contains at least this many instructions. This po‐
203 tentially results in loop body being cached in the CPU Op Cache
204 / Loop Cache, which allows to which may have higher throughput
205 than the CPU decoders.
206
207 -max-configs-per-opcode=<value>
208 Specify the maximum configurations that can be generated for
209 each opcode. By default this is 1, meaning that we assume that
210 a single measurement is enough to characterize an opcode. This
211 might not be true of all instructions: for example, the perfor‐
212 mance characteristics of the LEA instruction on X86 depends on
213 the value of assigned registers and immediates. Setting a value
214 of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
215 explore more configurations to discover if some register or im‐
216 mediate assignments lead to different performance characteris‐
217 tics.
218
219 -benchmarks-file=</path/to/file>
220 File to read (analysis mode) or write (latency/uops/in‐
221 verse_throughput modes) benchmark results. "-" uses stdin/std‐
222 out.
223
224 -analysis-clusters-output-file=</path/to/file>
225 If provided, write the analysis clusters as CSV to this file.
226 "-" prints to stdout. By default, this analysis is not run.
227
228 -analysis-inconsistencies-output-file=</path/to/file>
229 If non-empty, write inconsistencies found during analysis to
230 this file. - prints to stdout. By default, this analysis is not
231 run.
232
233 -analysis-clustering=[dbscan,naive]
234 Specify the clustering algorithm to use. By default DBSCAN will
235 be used. Naive clustering algorithm is better for doing further
236 work on the -analysis-inconsistencies-output-file= output, it
237 will create one cluster per opcode, and check that the cluster
238 is stable (all points are neighbours).
239
240 -analysis-numpoints=<dbscan numPoints parameter>
241 Specify the numPoints parameters to be used for DBSCAN cluster‐
242 ing (analysis mode, DBSCAN only).
243
244 -analysis-clustering-epsilon=<dbscan epsilon parameter>
245 Specify the epsilon parameter used for clustering of benchmark
246 points (analysis mode).
247
248 -analysis-inconsistency-epsilon=<epsilon>
249 Specify the epsilon parameter used for detection of when the
250 cluster is different from the LLVM schedule profile values
251 (analysis mode).
252
253 -analysis-display-unstable-clusters
254 If there is more than one benchmark for an opcode, said bench‐
255 marks may end up not being clustered into the same cluster if
256 the measured performance characteristics are different. by de‐
257 fault all such opcodes are filtered out. This flag will instead
258 show only such unstable opcodes.
259
260 -ignore-invalid-sched-class=false
261 If set, ignore instructions that do not have a sched class
262 (class idx = 0).
263
264 -mcpu=<cpu name>
265 If set, measure the cpu characteristics using the counters for
266 this CPU. This is useful when creating new sched models (the
267 host CPU is unknown to LLVM).
268
269 --dump-object-to-disk=true
270 By default, llvm-exegesis will dump the generated code to a tem‐
271 porary file to enable code inspection. You may disable it to
272 speed up the execution and save disk space.
273
275 llvm-exegesis returns 0 on success. Otherwise, an error message is
276 printed to standard error, and the tool returns a non 0 value.
277
279 Maintained by the LLVM Team (https://llvm.org/).
280
282 2003-2022, LLVM Project
283
284
285
286
28713 2022-07-21 LLVM-EXEGESIS(1)