1LLVM-EXEGESIS(1) LLVM LLVM-EXEGESIS(1)
2
3
4
6 llvm-exegesis - LLVM Machine Instruction Benchmark
7
9 llvm-exegesis [options]
10
12 llvm-exegesis is a benchmarking tool that uses information available in
13 LLVM to measure host machine instruction characteristics like latency,
14 throughput, or port decomposition.
15
16 Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17 ates a code snippet that makes execution as serial (resp. as parallel)
18 as possible so that we can measure the latency (resp. inverse through‐
19 put/uop decomposition) of the instruction. The code snippet is jitted
20 and executed on the host subtarget. The time taken (resp. resource us‐
21 age) is measured using hardware performance counters. The result is
22 printed out as YAML to the standard output.
23
24 The main goal of this tool is to automatically (in)validate the LLVM's
25 TableDef scheduling models. To that end, we also provide analysis of
26 the results.
27
28 llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29
31 Assume you have an X86-64 machine. To measure the latency of a single
32 instruction, run:
33
34 $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
35
36 Measuring the uop decomposition or inverse throughput of an instruction
37 works similarly:
38
39 $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
40 $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
41
42 The output is a YAML document (the default is to write to stdout, but
43 you can redirect the output to a file using -benchmarks-file):
44
45 ---
46 key:
47 opcode_name: ADD64rr
48 mode: latency
49 config: ''
50 cpu_name: haswell
51 llvm_triple: x86_64-unknown-linux-gnu
52 num_repetitions: 10000
53 measurements:
54 - { key: latency, value: 1.0058, debug_string: '' }
55 error: ''
56 info: 'explicit self cycles, selecting one aliasing configuration.
57 Snippet:
58 ADD64rr R8, R8, R10
59 '
60 ...
61
62 To measure the latency of all instructions for the host architecture,
63 run:
64
65 $ llvm-exegesis -mode=latency -opcode-index=-1
66
68 To measure the latency/uops of a custom piece of code, you can specify
69 the snippets-file option (- reads from standard input).
70
71 $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
72
73 Real-life code snippets typically depend on registers or memory.
74 llvm-exegesis checks the liveliness of registers (i.e. any register use
75 has a corresponding def or is a "live in"). If your code depends on the
76 value of some registers, you have two options:
77
78 • Mark the register as requiring a definition. llvm-exegesis will auto‐
79 matically assign a value to the register. This can be done using the
80 directive LLVM-EXEGESIS-DEFREG <reg name> <hex_value>, where
81 <hex_value> is a bit pattern used to fill <reg_name>. If <hex_value>
82 is smaller than the register width, it will be sign-extended.
83
84 • Mark the register as a "live in". llvm-exegesis will benchmark using
85 whatever value was in this registers on entry. This can be done using
86 the directive LLVM-EXEGESIS-LIVEIN <reg name>.
87
88 For example, the following code snippet depends on the values of XMM1
89 (which will be set by the tool) and the memory buffer passed in RDI
90 (live in).
91
92 # LLVM-EXEGESIS-LIVEIN RDI
93 # LLVM-EXEGESIS-DEFREG XMM1 42
94 vmulps (%rdi), %xmm1, %xmm2
95 vhaddps %xmm2, %xmm2, %xmm3
96 addq $0x10, %rdi
97
99 Assuming you have a set of benchmarked instructions (either latency or
100 uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
101 using the following command:
102
103 $ llvm-exegesis -mode=analysis \
104 -benchmarks-file=/tmp/benchmarks.yaml \
105 -analysis-clusters-output-file=/tmp/clusters.csv \
106 -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
107
108 This will group the instructions into clusters with the same perfor‐
109 mance characteristics. The clusters will be written out to /tmp/clus‐
110 ters.csv in the following format:
111
112 cluster_id,opcode_name,config,sched_class
113 ...
114 2,ADD32ri8_DB,,WriteALU,1.00
115 2,ADD32ri_DB,,WriteALU,1.01
116 2,ADD32rr,,WriteALU,1.01
117 2,ADD32rr_DB,,WriteALU,1.00
118 2,ADD32rr_REV,,WriteALU,1.00
119 2,ADD64i32,,WriteALU,1.01
120 2,ADD64ri32,,WriteALU,1.01
121 2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
122 2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
123 2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
124 2,ADD64ri8,,WriteALU,1.00
125 2,SETBr,,WriteSETCC,1.01
126 ...
127
128 llvm-exegesis will also analyze the clusters to point out inconsisten‐
129 cies in the scheduling information. The output is an html file. For ex‐
130 ample, /tmp/inconsistencies.html will contain messages like the follow‐
131 ing : [image]
132
133 Note that the scheduling class names will be resolved only when
134 llvm-exegesis is compiled in debug mode, else only the class id will be
135 shown. This does not invalidate any of the analysis results though.
136
138 -help Print a summary of command line options.
139
140 -opcode-index=<LLVM opcode index>
141 Specify the opcode to measure, by index. Specifying -1 will re‐
142 sult in measuring every existing opcode. See example 1 for de‐
143 tails. Either opcode-index, opcode-name or snippets-file must
144 be set.
145
146 -opcode-name=<opcode name 1>,<opcode name 2>,...
147 Specify the opcode to measure, by name. Several opcodes can be
148 specified as a comma-separated list. See example 1 for details.
149 Either opcode-index, opcode-name or snippets-file must be set.
150
151 -snippets-file=<filename>
152 Specify the custom code snippet to measure. See example 2 for
153 details. Either opcode-index, opcode-name or snippets-file must
154 be set.
155
156 -mode=[latency|uops|inverse_throughput|analysis]
157 Specify the run mode. Note that some modes have additional re‐
158 quirements and options.
159
160 latency mode can be make use of either RDTSC or LBR. la‐
161 tency[LBR] is only available on X86 (at least Skylake). To run
162 in latency mode, a positive value must be specified for
163 x86-lbr-sample-period and --repetition-mode=loop.
164
165 In analysis mode, you also need to specify at least one of the
166 -analysis-clusters-output-file= and -analysis-inconsisten‐
167 cies-output-file=.
168
169 -x86-lbr-sample-period=<nBranches/sample>
170 Specify the LBR sampling period - how many branches before we
171 take a sample. When a positive value is specified for this op‐
172 tion and when the mode is latency, we will use LBRs for measur‐
173 ing. On choosing the "right" sampling period, a small value is
174 preferred, but throttling could occur if the sampling is too
175 frequent. A prime number should be used to avoid consistently
176 skipping certain blocks.
177
178 -repetition-mode=[duplicate|loop|min]
179 Specify the repetition mode. duplicate will create a large,
180 straight line basic block with num-repetitions copies of the
181 snippet. loop will wrap the snippet in a loop which will be run
182 num-repetitions times. The loop mode tends to better hide the
183 effects of the CPU frontend on architectures that cache decoded
184 instructions, but consumes a register for counting iterations.
185 If performing an analysis over many opcodes, it may be best to
186 instead use the min mode, which will run each other mode, and
187 produce the minimal measured result.
188
189 -num-repetitions=<Number of repetitions>
190 Specify the number of repetitions of the asm snippet. Higher
191 values lead to more accurate measurements but lengthen the
192 benchmark.
193
194 -max-configs-per-opcode=<value>
195 Specify the maximum configurations that can be generated for
196 each opcode. By default this is 1, meaning that we assume that
197 a single measurement is enough to characterize an opcode. This
198 might not be true of all instructions: for example, the perfor‐
199 mance characteristics of the LEA instruction on X86 depends on
200 the value of assigned registers and immediates. Setting a value
201 of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
202 explore more configurations to discover if some register or im‐
203 mediate assignments lead to different performance characteris‐
204 tics.
205
206 -benchmarks-file=</path/to/file>
207 File to read (analysis mode) or write (latency/uops/in‐
208 verse_throughput modes) benchmark results. "-" uses stdin/std‐
209 out.
210
211 -analysis-clusters-output-file=</path/to/file>
212 If provided, write the analysis clusters as CSV to this file.
213 "-" prints to stdout. By default, this analysis is not run.
214
215 -analysis-inconsistencies-output-file=</path/to/file>
216 If non-empty, write inconsistencies found during analysis to
217 this file. - prints to stdout. By default, this analysis is not
218 run.
219
220 -analysis-clustering=[dbscan,naive]
221 Specify the clustering algorithm to use. By default DBSCAN will
222 be used. Naive clustering algorithm is better for doing further
223 work on the -analysis-inconsistencies-output-file= output, it
224 will create one cluster per opcode, and check that the cluster
225 is stable (all points are neighbours).
226
227 -analysis-numpoints=<dbscan numPoints parameter>
228 Specify the numPoints parameters to be used for DBSCAN cluster‐
229 ing (analysis mode, DBSCAN only).
230
231 -analysis-clustering-epsilon=<dbscan epsilon parameter>
232 Specify the epsilon parameter used for clustering of benchmark
233 points (analysis mode).
234
235 -analysis-inconsistency-epsilon=<epsilon>
236 Specify the epsilon parameter used for detection of when the
237 cluster is different from the LLVM schedule profile values
238 (analysis mode).
239
240 -analysis-display-unstable-clusters
241 If there is more than one benchmark for an opcode, said bench‐
242 marks may end up not being clustered into the same cluster if
243 the measured performance characteristics are different. by de‐
244 fault all such opcodes are filtered out. This flag will instead
245 show only such unstable opcodes.
246
247 -ignore-invalid-sched-class=false
248 If set, ignore instructions that do not have a sched class
249 (class idx = 0).
250
251 -mcpu=<cpu name>
252 If set, measure the cpu characteristics using the counters for
253 this CPU. This is useful when creating new sched models (the
254 host CPU is unknown to LLVM).
255
256 --dump-object-to-disk=true
257 By default, llvm-exegesis will dump the generated code to a tem‐
258 porary file to enable code inspection. You may disable it to
259 speed up the execution and save disk space.
260
262 llvm-exegesis returns 0 on success. Otherwise, an error message is
263 printed to standard error, and the tool returns a non 0 value.
264
266 Maintained by the LLVM Team (https://llvm.org/).
267
269 2003-2023, LLVM Project
270
271
272
273
27412 2023-07-20 LLVM-EXEGESIS(1)