llvm-exegesis-15(1)

1LLVM-EXEGESIS(1)                     LLVM                     LLVM-EXEGESIS(1)
2
3
4

NAME

6       llvm-exegesis - LLVM Machine Instruction Benchmark
7

SYNOPSIS

9       llvm-exegesis [options]
10

DESCRIPTION

12       llvm-exegesis is a benchmarking tool that uses information available in
13       LLVM to measure host machine instruction characteristics like  latency,
14       throughput, or port decomposition.
15
16       Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17       ates a code snippet that makes execution as serial (resp. as  parallel)
18       as  possible so that we can measure the latency (resp. inverse through‐
19       put/uop decomposition) of the instruction.  The code snippet is  jitted
20       and  executed on the host subtarget. The time taken (resp. resource us‐
21       age) is measured using hardware performance  counters.  The  result  is
22       printed out as YAML to the standard output.
23
24       The  main goal of this tool is to automatically (in)validate the LLVM's
25       TableDef scheduling models. To that end, we also  provide  analysis  of
26       the results.
27
28       llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29

EXAMPLE 1: BENCHMARKING INSTRUCTIONS

31       Assume  you  have an X86-64 machine. To measure the latency of a single
32       instruction, run:
33
34          $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
35
36       Measuring the uop decomposition or inverse throughput of an instruction
37       works similarly:
38
39          $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
40          $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
41
42       The  output  is a YAML document (the default is to write to stdout, but
43       you can redirect the output to a file using -benchmarks-file):
44
45          ---
46          key:
47            opcode_name:     ADD64rr
48            mode:            latency
49            config:          ''
50          cpu_name:        haswell
51          llvm_triple:     x86_64-unknown-linux-gnu
52          num_repetitions: 10000
53          measurements:
54            - { key: latency, value: 1.0058, debug_string: '' }
55          error:           ''
56          info:            'explicit self cycles, selecting one aliasing configuration.
57          Snippet:
58          ADD64rr R8, R8, R10
59          '
60          ...
61
62       To measure the latency of all instructions for the  host  architecture,
63       run:
64
65          $ llvm-exegesis -mode=latency -opcode-index=-1
66

EXAMPLE 2: BENCHMARKING A CUSTOM CODE SNIPPET

68       To  measure the latency/uops of a custom piece of code, you can specify
69       the snippets-file option (- reads from standard input).
70
71          $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
72
73       Real-life code  snippets  typically  depend  on  registers  or  memory.
74       llvm-exegesis checks the liveliness of registers (i.e. any register use
75       has a corresponding def or is a "live in"). If your code depends on the
76       value of some registers, you have two options:
77
78       • Mark the register as requiring a definition. llvm-exegesis will auto‐
79         matically assign a value to the register. This can be done using  the
80         directive   LLVM-EXEGESIS-DEFREG   <reg   name>   <hex_value>,  where
81         <hex_value> is a bit pattern used to fill <reg_name>. If  <hex_value>
82         is smaller than the register width, it will be sign-extended.
83
84       • Mark  the register as a "live in". llvm-exegesis will benchmark using
85         whatever value was in this registers on entry. This can be done using
86         the directive LLVM-EXEGESIS-LIVEIN <reg name>.
87
88       For  example,  the following code snippet depends on the values of XMM1
89       (which will be set by the tool) and the memory  buffer  passed  in  RDI
90       (live in).
91
92          # LLVM-EXEGESIS-LIVEIN RDI
93          # LLVM-EXEGESIS-DEFREG XMM1 42
94          vmulps        (%rdi), %xmm1, %xmm2
95          vhaddps       %xmm2, %xmm2, %xmm3
96          addq $0x10, %rdi
97

EXAMPLE 3: ANALYSIS

99       Assuming  you have a set of benchmarked instructions (either latency or
100       uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
101       using the following command:
102
103            $ llvm-exegesis -mode=analysis \
104          -benchmarks-file=/tmp/benchmarks.yaml \
105          -analysis-clusters-output-file=/tmp/clusters.csv \
106          -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
107
108       This  will  group  the instructions into clusters with the same perfor‐
109       mance characteristics. The clusters will be written out  to  /tmp/clus‐
110       ters.csv in the following format:
111
112          cluster_id,opcode_name,config,sched_class
113          ...
114          2,ADD32ri8_DB,,WriteALU,1.00
115          2,ADD32ri_DB,,WriteALU,1.01
116          2,ADD32rr,,WriteALU,1.01
117          2,ADD32rr_DB,,WriteALU,1.00
118          2,ADD32rr_REV,,WriteALU,1.00
119          2,ADD64i32,,WriteALU,1.01
120          2,ADD64ri32,,WriteALU,1.01
121          2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
122          2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
123          2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
124          2,ADD64ri8,,WriteALU,1.00
125          2,SETBr,,WriteSETCC,1.01
126          ...
127
128       llvm-exegesis  will also analyze the clusters to point out inconsisten‐
129       cies in the scheduling information. The output is an html file. For ex‐
130       ample, /tmp/inconsistencies.html will contain messages like the follow‐
131       ing : [image]
132
133       Note that the  scheduling  class  names  will  be  resolved  only  when
134       llvm-exegesis is compiled in debug mode, else only the class id will be
135       shown. This does not invalidate any of the analysis results though.
136

OPTIONS

138       -help  Print a summary of command line options.
139
140       -opcode-index=<LLVM opcode index>
141              Specify the opcode to measure, by index. Specifying -1 will  re‐
142              sult  in  measuring every existing opcode. See example 1 for de‐
143              tails.  Either opcode-index, opcode-name or  snippets-file  must
144              be set.
145
146       -opcode-name=<opcode name 1>,<opcode name 2>,...
147              Specify  the  opcode to measure, by name. Several opcodes can be
148              specified as a comma-separated list. See example 1 for  details.
149              Either opcode-index, opcode-name or snippets-file must be set.
150
151       -snippets-file=<filename>
152              Specify  the  custom  code snippet to measure. See example 2 for
153              details.  Either opcode-index, opcode-name or snippets-file must
154              be set.
155
156       -mode=[latency|uops|inverse_throughput|analysis]
157              Specify  the  run mode. Note that some modes have additional re‐
158              quirements and options.
159
160              latency mode can be  make use  of  either  RDTSC  or  LBR.   la‐
161              tency[LBR]  is only available on X86 (at least Skylake).  To run
162              in  latency  mode,  a  positive  value  must  be  specified  for
163              x86-lbr-sample-period and --repetition-mode=loop.
164
165              In  analysis  mode, you also need to specify at least one of the
166              -analysis-clusters-output-file=    and    -analysis-inconsisten‐
167              cies-output-file=.
168
169       -x86-lbr-sample-period=<nBranches/sample>
170              Specify  the  LBR  sampling period - how many branches before we
171              take a sample.  When a positive value is specified for this  op‐
172              tion  and when the mode is latency, we will use LBRs for measur‐
173              ing.  On choosing the "right" sampling period, a small value  is
174              preferred,  but  throttling  could  occur if the sampling is too
175              frequent. A prime number should be used  to  avoid  consistently
176              skipping certain blocks.
177
178       -repetition-mode=[duplicate|loop|min]
179              Specify  the  repetition  mode.  duplicate  will create a large,
180              straight line basic block with num-repetitions instructions (re‐
181              peating  the  snippet  num-repetitions/snippet size times). loop
182              will, optionally, duplicate the snippet until the loop body con‐
183              tains  at  least  loop-body-size instructions, and then wrap the
184              result in a loop which will execute num-repetitions instructions
185              (thus, again, repeating the snippet num-repetitions/snippet size
186              times). The loop mode, especially with loop unrolling  tends  to
187              better  hide  the  effects  of the CPU frontend on architectures
188              that cache decoded instructions, but  consumes  a  register  for
189              counting  iterations.  If  performing  an analysis over many op‐
190              codes, it may be best to instead use the min  mode,  which  will
191              run each other mode, and produce the minimal measured result.
192
193       -num-repetitions=<Number of repetitions>
194              Specify  the  target  number of executed instructions. Note that
195              the actual repetition count of the snippet will  be  num-repeti‐
196              tions/snippet  size.   Higher  values lead to more accurate mea‐
197              surements but lengthen the benchmark.
198
199       -loop-body-size=<Preferred loop body size>
200              Only  effective  for  -repetition-mode=[loop|min].   Instead  of
201              looping  over  the  snippet directly, first duplicate it so that
202              the loop body contains at least this many instructions. This po‐
203              tentially  results in loop body being cached in the CPU Op Cache
204              / Loop Cache, which allows to which may have  higher  throughput
205              than the CPU decoders.
206
207       -max-configs-per-opcode=<value>
208              Specify  the  maximum  configurations  that can be generated for
209              each opcode.  By default this is 1, meaning that we assume  that
210              a  single  measurement is enough to characterize an opcode. This
211              might not be true of all instructions: for example, the  perfor‐
212              mance  characteristics  of the LEA instruction on X86 depends on
213              the value of assigned registers and immediates. Setting a  value
214              of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
215              explore more configurations to discover if some register or  im‐
216              mediate  assignments  lead to different performance characteris‐
217              tics.
218
219       -benchmarks-file=</path/to/file>
220              File  to  read  (analysis  mode)  or   write   (latency/uops/in‐
221              verse_throughput  modes)  benchmark results. "-" uses stdin/std‐
222              out.
223
224       -analysis-clusters-output-file=</path/to/file>
225              If provided, write the analysis clusters as CSV  to  this  file.
226              "-" prints to stdout. By default, this analysis is not run.
227
228       -analysis-inconsistencies-output-file=</path/to/file>
229              If  non-empty,  write  inconsistencies  found during analysis to
230              this file. - prints to stdout. By default, this analysis is  not
231              run.
232
233       -analysis-clustering=[dbscan,naive]
234              Specify  the clustering algorithm to use. By default DBSCAN will
235              be used.  Naive clustering algorithm is better for doing further
236              work  on  the  -analysis-inconsistencies-output-file= output, it
237              will create one cluster per opcode, and check that  the  cluster
238              is stable (all points are neighbours).
239
240       -analysis-numpoints=<dbscan numPoints parameter>
241              Specify  the numPoints parameters to be used for DBSCAN cluster‐
242              ing (analysis mode, DBSCAN only).
243
244       -analysis-clustering-epsilon=<dbscan epsilon parameter>
245              Specify the epsilon parameter used for clustering  of  benchmark
246              points (analysis mode).
247
248       -analysis-inconsistency-epsilon=<epsilon>
249              Specify  the  epsilon  parameter  used for detection of when the
250              cluster is different  from  the  LLVM  schedule  profile  values
251              (analysis mode).
252
253       -analysis-display-unstable-clusters
254              If  there  is more than one benchmark for an opcode, said bench‐
255              marks may end up not being clustered into the  same  cluster  if
256              the  measured  performance characteristics are different. by de‐
257              fault all such opcodes are filtered out.  This flag will instead
258              show only such unstable opcodes.
259
260       -ignore-invalid-sched-class=false
261              If  set,  ignore  instructions  that  do  not have a sched class
262              (class idx = 0).
263
264       -mcpu=<cpu name>
265              If set, measure the cpu characteristics using the  counters  for
266              this  CPU.  This  is  useful when creating new sched models (the
267              host CPU is unknown to LLVM).
268
269       --dump-object-to-disk=true
270              By default, llvm-exegesis will dump the generated code to a tem‐
271              porary  file  to  enable  code inspection. You may disable it to
272              speed up the execution and save disk space.
273

EXIT STATUS

275       llvm-exegesis returns 0 on success.  Otherwise,  an  error  message  is
276       printed to standard error, and the tool returns a non 0 value.
277

AUTHOR

279       Maintained by the LLVM Team (https://llvm.org/).
280

COPYRIGHT

282       2003-2023, LLVM Project
283
284
285
286
28715                                2023-03-27                  LLVM-EXEGESIS(1)