llvm-exegesis-16(1)

1LLVM-EXEGESIS(1)                     LLVM                     LLVM-EXEGESIS(1)
2
3
4

NAME

6       llvm-exegesis - LLVM Machine Instruction Benchmark
7

SYNOPSIS

9       llvm-exegesis [options]
10

DESCRIPTION

12       llvm-exegesis is a benchmarking tool that uses information available in
13       LLVM to measure host machine instruction characteristics like  latency,
14       throughput, or port decomposition.
15
16       Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17       ates a code snippet that makes execution as serial (resp. as  parallel)
18       as  possible so that we can measure the latency (resp. inverse through‐
19       put/uop decomposition) of the instruction.  The code snippet is  jitted
20       and,  unless requested not to, executed on the host subtarget. The time
21       taken (resp. resource usage) is  measured  using  hardware  performance
22       counters. The result is printed out as YAML to the standard output.
23
24       The  main goal of this tool is to automatically (in)validate the LLVM's
25       TableDef scheduling models. To that end, we also  provide  analysis  of
26       the results.
27
28       llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29

EXAMPLE 1: BENCHMARKING INSTRUCTIONS

31       Assume  you  have an X86-64 machine. To measure the latency of a single
32       instruction, run:
33
34          $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
35
36       Measuring the uop decomposition or inverse throughput of an instruction
37       works similarly:
38
39          $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
40          $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
41
42       The  output  is a YAML document (the default is to write to stdout, but
43       you can redirect the output to a file using -benchmarks-file):
44
45          ---
46          key:
47            opcode_name:     ADD64rr
48            mode:            latency
49            config:          ''
50          cpu_name:        haswell
51          llvm_triple:     x86_64-unknown-linux-gnu
52          num_repetitions: 10000
53          measurements:
54            - { key: latency, value: 1.0058, debug_string: '' }
55          error:           ''
56          info:            'explicit self cycles, selecting one aliasing configuration.
57          Snippet:
58          ADD64rr R8, R8, R10
59          '
60          ...
61
62       To measure the latency of all instructions for the  host  architecture,
63       run:
64
65          $ llvm-exegesis -mode=latency -opcode-index=-1
66

EXAMPLE 2: BENCHMARKING A CUSTOM CODE SNIPPET

68       To  measure the latency/uops of a custom piece of code, you can specify
69       the snippets-file option (- reads from standard input).
70
71          $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
72
73       Real-life code  snippets  typically  depend  on  registers  or  memory.
74       llvm-exegesis checks the liveliness of registers (i.e. any register use
75       has a corresponding def or is a "live in"). If your code depends on the
76       value of some registers, you have two options:
77
78       • Mark the register as requiring a definition. llvm-exegesis will auto‐
79         matically assign a value to the register. This can be done using  the
80         directive   LLVM-EXEGESIS-DEFREG   <reg   name>   <hex_value>,  where
81         <hex_value> is a bit pattern used to fill <reg_name>. If  <hex_value>
82         is smaller than the register width, it will be sign-extended.
83
84       • Mark  the register as a "live in". llvm-exegesis will benchmark using
85         whatever value was in this registers on entry. This can be done using
86         the directive LLVM-EXEGESIS-LIVEIN <reg name>.
87
88       For  example,  the following code snippet depends on the values of XMM1
89       (which will be set by the tool) and the memory  buffer  passed  in  RDI
90       (live in).
91
92          # LLVM-EXEGESIS-LIVEIN RDI
93          # LLVM-EXEGESIS-DEFREG XMM1 42
94          vmulps        (%rdi), %xmm1, %xmm2
95          vhaddps       %xmm2, %xmm2, %xmm3
96          addq $0x10, %rdi
97

EXAMPLE 3: ANALYSIS

99       Assuming  you have a set of benchmarked instructions (either latency or
100       uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
101       using the following command:
102
103            $ llvm-exegesis -mode=analysis \
104          -benchmarks-file=/tmp/benchmarks.yaml \
105          -analysis-clusters-output-file=/tmp/clusters.csv \
106          -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
107
108       This  will  group  the instructions into clusters with the same perfor‐
109       mance characteristics. The clusters will be written out  to  /tmp/clus‐
110       ters.csv in the following format:
111
112          cluster_id,opcode_name,config,sched_class
113          ...
114          2,ADD32ri8_DB,,WriteALU,1.00
115          2,ADD32ri_DB,,WriteALU,1.01
116          2,ADD32rr,,WriteALU,1.01
117          2,ADD32rr_DB,,WriteALU,1.00
118          2,ADD32rr_REV,,WriteALU,1.00
119          2,ADD64i32,,WriteALU,1.01
120          2,ADD64ri32,,WriteALU,1.01
121          2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
122          2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
123          2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
124          2,ADD64ri8,,WriteALU,1.00
125          2,SETBr,,WriteSETCC,1.01
126          ...
127
128       llvm-exegesis  will also analyze the clusters to point out inconsisten‐
129       cies in the scheduling information. The output is an html file. For ex‐
130       ample, /tmp/inconsistencies.html will contain messages like the follow‐
131       ing : [image]
132
133       Note that the  scheduling  class  names  will  be  resolved  only  when
134       llvm-exegesis is compiled in debug mode, else only the class id will be
135       shown. This does not invalidate any of the analysis results though.
136

OPTIONS

138       -help  Print a summary of command line options.
139
140       -opcode-index=<LLVM opcode index>
141              Specify the opcode to measure, by index. Specifying -1 will  re‐
142              sult  in  measuring every existing opcode. See example 1 for de‐
143              tails.  Either opcode-index, opcode-name or  snippets-file  must
144              be set.
145
146       -opcode-name=<opcode name 1>,<opcode name 2>,...
147              Specify  the  opcode to measure, by name. Several opcodes can be
148              specified as a comma-separated list. See example 1 for  details.
149              Either opcode-index, opcode-name or snippets-file must be set.
150
151       -snippets-file=<filename>
152              Specify  the  custom  code snippet to measure. See example 2 for
153              details.  Either opcode-index, opcode-name or snippets-file must
154              be set.
155
156       -mode=[latency|uops|inverse_throughput|analysis]
157              Specify  the  run mode. Note that some modes have additional re‐
158              quirements and options.
159
160              latency mode can be  make use  of  either  RDTSC  or  LBR.   la‐
161              tency[LBR]  is only available on X86 (at least Skylake).  To run
162              in  latency  mode,  a  positive  value  must  be  specified  for
163              x86-lbr-sample-period and --repetition-mode=loop.
164
165              In  analysis  mode, you also need to specify at least one of the
166              -analysis-clusters-output-file=    and    -analysis-inconsisten‐
167              cies-output-file=.
168
169       --benchmark-phase=[prepare-snippet|prepare-and-assemble-snippet|assem‐
170       ble-measured-code|measure]
171              By default, when -mode= is specified, the generated snippet will
172              be  executed and measured, and that requires that we are running
173              on the hardware for which the snippet was  generated,  and  that
174              supports  performance  measurements.  However, it is possible to
175              stop at  some  stage  before  measuring.  Choices  are:  *  pre‐
176              pare-snippet: Only generate the minimal instruction sequence.  *
177              prepare-and-assemble-snippet: Same as prepare-snippet, but  also
178              dumps an excerpt of the sequence (hex encoded).  * assemble-mea‐
179              sured-code: Same as prepare-and-assemble-snippet. but also  cre‐
180              ates  the  full  sequence  that  can  be  dumped to a file using
181              --dump-object-to-disk.   *  measure:   Same   as   assemble-mea‐
182              sured-code, but also runs the measurement.
183
184       -x86-lbr-sample-period=<nBranches/sample>
185              Specify  the  LBR  sampling period - how many branches before we
186              take a sample.  When a positive value is specified for this  op‐
187              tion  and when the mode is latency, we will use LBRs for measur‐
188              ing.  On choosing the "right" sampling period, a small value  is
189              preferred,  but  throttling  could  occur if the sampling is too
190              frequent. A prime number should be used  to  avoid  consistently
191              skipping certain blocks.
192
193       -x86-disable-upper-sse-registers
194              Using  the  upper xmm registers (xmm8-xmm15) forces a longer in‐
195              struction encoding which may put greater pressure on the  front‐
196              end  fetch and decode stages, potentially reducing the rate that
197              instructions are dispatched  to  the  backend,  particularly  on
198              older  hardware.  Comparing  baseline results with this mode en‐
199              abled can help determine the effects of the frontend and can  be
200              used to improve latency and throughput estimates.
201
202       -repetition-mode=[duplicate|loop|min]
203              Specify  the  repetition  mode.  duplicate  will create a large,
204              straight line basic block with num-repetitions instructions (re‐
205              peating  the  snippet  num-repetitions/snippet size times). loop
206              will, optionally, duplicate the snippet until the loop body con‐
207              tains  at  least  loop-body-size instructions, and then wrap the
208              result in a loop which will execute num-repetitions instructions
209              (thus, again, repeating the snippet num-repetitions/snippet size
210              times). The loop mode, especially with loop unrolling  tends  to
211              better  hide  the  effects  of the CPU frontend on architectures
212              that cache decoded instructions, but  consumes  a  register  for
213              counting  iterations.  If  performing  an analysis over many op‐
214              codes, it may be best to instead use the min  mode,  which  will
215              run each other mode, and produce the minimal measured result.
216
217       -num-repetitions=<Number of repetitions>
218              Specify  the  target  number of executed instructions. Note that
219              the actual repetition count of the snippet will  be  num-repeti‐
220              tions/snippet  size.   Higher  values lead to more accurate mea‐
221              surements but lengthen the benchmark.
222
223       -loop-body-size=<Preferred loop body size>
224              Only  effective  for  -repetition-mode=[loop|min].   Instead  of
225              looping  over  the  snippet directly, first duplicate it so that
226              the loop body contains at least this many instructions. This po‐
227              tentially  results in loop body being cached in the CPU Op Cache
228              / Loop Cache, which allows to which may have  higher  throughput
229              than the CPU decoders.
230
231       -max-configs-per-opcode=<value>
232              Specify  the  maximum  configurations  that can be generated for
233              each opcode.  By default this is 1, meaning that we assume  that
234              a  single  measurement is enough to characterize an opcode. This
235              might not be true of all instructions: for example, the  perfor‐
236              mance  characteristics  of the LEA instruction on X86 depends on
237              the value of assigned registers and immediates. Setting a  value
238              of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
239              explore more configurations to discover if some register or  im‐
240              mediate  assignments  lead to different performance characteris‐
241              tics.
242
243       -benchmarks-file=</path/to/file>
244              File  to  read  (analysis  mode)  or   write   (latency/uops/in‐
245              verse_throughput  modes)  benchmark results. "-" uses stdin/std‐
246              out.
247
248       -analysis-clusters-output-file=</path/to/file>
249              If provided, write the analysis clusters as CSV  to  this  file.
250              "-" prints to stdout. By default, this analysis is not run.
251
252       -analysis-inconsistencies-output-file=</path/to/file>
253              If  non-empty,  write  inconsistencies  found during analysis to
254              this file. - prints to stdout. By default, this analysis is  not
255              run.
256
257       -analysis-filter=[all|reg-only|mem-only]
258              By default, all benchmark results are analysed, but sometimes it
259              may be useful to only look at those that to not involve  memory,
260              or vice versa. This option allows to either keep all benchmarks,
261              or filter out (ignore) either all the ones that do involve  mem‐
262              ory  (involve instructions that may read or write to memory), or
263              the opposite, to only keep such benchmarks.
264
265       -analysis-clustering=[dbscan,naive]
266              Specify the clustering algorithm to use. By default DBSCAN  will
267              be used.  Naive clustering algorithm is better for doing further
268              work on the  -analysis-inconsistencies-output-file=  output,  it
269              will  create  one cluster per opcode, and check that the cluster
270              is stable (all points are neighbours).
271
272       -analysis-numpoints=<dbscan numPoints parameter>
273              Specify the numPoints parameters to be used for DBSCAN  cluster‐
274              ing (analysis mode, DBSCAN only).
275
276       -analysis-clustering-epsilon=<dbscan epsilon parameter>
277              Specify  the  epsilon parameter used for clustering of benchmark
278              points (analysis mode).
279
280       -analysis-inconsistency-epsilon=<epsilon>
281              Specify the epsilon parameter used for  detection  of  when  the
282              cluster  is  different  from  the  LLVM  schedule profile values
283              (analysis mode).
284
285       -analysis-display-unstable-clusters
286              If there is more than one benchmark for an opcode,  said  bench‐
287              marks  may  end  up not being clustered into the same cluster if
288              the measured performance characteristics are different.  by  de‐
289              fault all such opcodes are filtered out.  This flag will instead
290              show only such unstable opcodes.
291
292       -ignore-invalid-sched-class=false
293              If set, ignore instructions that  do  not  have  a  sched  class
294              (class idx = 0).
295
296       -mtriple=<triple name>
297              Target triple. See -version for available targets.
298
299       -mcpu=<cpu name>
300              If  set,  measure the cpu characteristics using the counters for
301              this CPU. This is useful when creating  new  sched  models  (the
302              host CPU is unknown to LLVM).  (-mcpu=help for details)
303
304       --analysis-override-benchmark-triple-and-cpu
305              By  default,  llvm-exegesis  will analyze the benchmarks for the
306              triple/CPU they were measured for, but if you  want  to  analyze
307              them  for some other combination (specified via -mtriple/-mcpu),
308              you can pass this flag.
309
310       --dump-object-to-disk=true
311              If set,  llvm-exegesis will dump the generated code to a  tempo‐
312              rary file to enable code inspection. Disabled by default.
313

EXIT STATUS

315       llvm-exegesis  returns  0  on  success.  Otherwise, an error message is
316       printed to standard error, and the tool returns a non 0 value.
317

AUTHOR

319       Maintained by the LLVM Team (https://llvm.org/).
320

COPYRIGHT

322       2003-2023, LLVM Project
323
324
325
326
32716                                2023-07-20                  LLVM-EXEGESIS(1)