llvm-exegesis-12(1)

1LLVM-EXEGESIS(1)                     LLVM                     LLVM-EXEGESIS(1)
2
3
4

NAME

6       llvm-exegesis - LLVM Machine Instruction Benchmark
7

SYNOPSIS

9       llvm-exegesis [options]
10

DESCRIPTION

12       llvm-exegesis is a benchmarking tool that uses information available in
13       LLVM to measure host machine instruction characteristics like  latency,
14       throughput, or port decomposition.
15
16       Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17       ates a code snippet that makes execution as serial (resp. as  parallel)
18       as  possible so that we can measure the latency (resp. inverse through‐
19       put/uop decomposition) of the instruction.  The code snippet is  jitted
20       and  executed on the host subtarget. The time taken (resp. resource us‐
21       age) is measured using hardware performance  counters.  The  result  is
22       printed out as YAML to the standard output.
23
24       The  main goal of this tool is to automatically (in)validate the LLVM's
25       TableDef scheduling models. To that end, we also  provide  analysis  of
26       the results.
27
28       llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29

EXAMPLE 1: BENCHMARKING INSTRUCTIONS

31       Assume  you  have an X86-64 machine. To measure the latency of a single
32       instruction, run:
33
34          $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
35
36       Measuring the uop decomposition or inverse throughput of an instruction
37       works similarly:
38
39          $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
40          $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
41
42       The  output  is a YAML document (the default is to write to stdout, but
43       you can redirect the output to a file using -benchmarks-file):
44
45          ---
46          key:
47            opcode_name:     ADD64rr
48            mode:            latency
49            config:          ''
50          cpu_name:        haswell
51          llvm_triple:     x86_64-unknown-linux-gnu
52          num_repetitions: 10000
53          measurements:
54            - { key: latency, value: 1.0058, debug_string: '' }
55          error:           ''
56          info:            'explicit self cycles, selecting one aliasing configuration.
57          Snippet:
58          ADD64rr R8, R8, R10
59          '
60          ...
61
62       To measure the latency of all instructions for the  host  architecture,
63       run:
64
65          $ llvm-exegesis -mode=latency -opcode-index=-1
66

EXAMPLE 2: BENCHMARKING A CUSTOM CODE SNIPPET

68       To  measure the latency/uops of a custom piece of code, you can specify
69       the snippets-file option (- reads from standard input).
70
71          $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
72
73       Real-life code  snippets  typically  depend  on  registers  or  memory.
74       llvm-exegesis checks the liveliness of registers (i.e. any register use
75       has a corresponding def or is a "live in"). If your code depends on the
76       value of some registers, you have two options:
77
78       • Mark the register as requiring a definition. llvm-exegesis will auto‐
79         matically assign a value to the register. This can be done using  the
80         directive   LLVM-EXEGESIS-DEFREG   <reg   name>   <hex_value>,  where
81         <hex_value> is a bit pattern used to fill <reg_name>. If  <hex_value>
82         is smaller than the register width, it will be sign-extended.
83
84       • Mark  the register as a "live in". llvm-exegesis will benchmark using
85         whatever value was in this registers on entry. This can be done using
86         the directive LLVM-EXEGESIS-LIVEIN <reg name>.
87
88       For  example,  the following code snippet depends on the values of XMM1
89       (which will be set by the tool) and the memory  buffer  passed  in  RDI
90       (live in).
91
92          # LLVM-EXEGESIS-LIVEIN RDI
93          # LLVM-EXEGESIS-DEFREG XMM1 42
94          vmulps        (%rdi), %xmm1, %xmm2
95          vhaddps       %xmm2, %xmm2, %xmm3
96          addq $0x10, %rdi
97

EXAMPLE 3: ANALYSIS

99       Assuming  you have a set of benchmarked instructions (either latency or
100       uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
101       using the following command:
102
103            $ llvm-exegesis -mode=analysis \
104          -benchmarks-file=/tmp/benchmarks.yaml \
105          -analysis-clusters-output-file=/tmp/clusters.csv \
106          -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
107
108       This  will  group  the instructions into clusters with the same perfor‐
109       mance characteristics. The clusters will be written out  to  /tmp/clus‐
110       ters.csv in the following format:
111
112          cluster_id,opcode_name,config,sched_class
113          ...
114          2,ADD32ri8_DB,,WriteALU,1.00
115          2,ADD32ri_DB,,WriteALU,1.01
116          2,ADD32rr,,WriteALU,1.01
117          2,ADD32rr_DB,,WriteALU,1.00
118          2,ADD32rr_REV,,WriteALU,1.00
119          2,ADD64i32,,WriteALU,1.01
120          2,ADD64ri32,,WriteALU,1.01
121          2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
122          2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
123          2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
124          2,ADD64ri8,,WriteALU,1.00
125          2,SETBr,,WriteSETCC,1.01
126          ...
127
128       llvm-exegesis  will also analyze the clusters to point out inconsisten‐
129       cies in the scheduling information. The output is an html file. For ex‐
130       ample, /tmp/inconsistencies.html will contain messages like the follow‐
131       ing : [image]
132
133       Note that the  scheduling  class  names  will  be  resolved  only  when
134       llvm-exegesis is compiled in debug mode, else only the class id will be
135       shown. This does not invalidate any of the analysis results though.
136

OPTIONS

138       -help  Print a summary of command line options.
139
140       -opcode-index=<LLVM opcode index>
141              Specify the opcode to measure, by index. Specifying -1 will  re‐
142              sult  in  measuring every existing opcode. See example 1 for de‐
143              tails.  Either opcode-index, opcode-name or  snippets-file  must
144              be set.
145
146       -opcode-name=<opcode name 1>,<opcode name 2>,...
147              Specify  the  opcode to measure, by name. Several opcodes can be
148              specified as a comma-separated list. See example 1 for  details.
149              Either opcode-index, opcode-name or snippets-file must be set.
150
151       -snippets-file=<filename>
152              Specify  the  custom  code snippet to measure. See example 2 for
153              details.  Either opcode-index, opcode-name or snippets-file must
154              be set.
155
156       -mode=[latency|uops|inverse_throughput|analysis]
157              Specify  the  run mode. Note that some modes have additional re‐
158              quirements and options.
159
160              latency mode can be  make use  of  either  RDTSC  or  LBR.   la‐
161              tency[LBR]  is only available on X86 (at least Skylake).  To run
162              in  latency  mode,  a  positive  value  must  be  specified  for
163              x86-lbr-sample-period and --repetition-mode=loop.
164
165              In  analysis  mode, you also need to specify at least one of the
166              -analysis-clusters-output-file=    and    -analysis-inconsisten‐
167              cies-output-file=.
168
169       -x86-lbr-sample-period=<nBranches/sample>
170              Specify  the  LBR  sampling period - how many branches before we
171              take a sample.  When a positive value is specified for this  op‐
172              tion  and when the mode is latency, we will use LBRs for measur‐
173              ing.  On choosing the "right" sampling period, a small value  is
174              preferred,  but  throttling  could  occur if the sampling is too
175              frequent. A prime number should be used  to  avoid  consistently
176              skipping certain blocks.
177
178       -repetition-mode=[duplicate|loop|min]
179              Specify  the  repetition  mode.  duplicate  will create a large,
180              straight line basic block with  num-repetitions  copies  of  the
181              snippet.  loop will wrap the snippet in a loop which will be run
182              num-repetitions times. The loop mode tends to  better  hide  the
183              effects  of the CPU frontend on architectures that cache decoded
184              instructions, but consumes a register for  counting  iterations.
185              If  performing  an analysis over many opcodes, it may be best to
186              instead use the min mode, which will run each  other  mode,  and
187              produce the minimal measured result.
188
189       -num-repetitions=<Number of repetitions>
190              Specify  the  number  of repetitions of the asm snippet.  Higher
191              values lead to  more  accurate  measurements  but  lengthen  the
192              benchmark.
193
194       -max-configs-per-opcode=<value>
195              Specify  the  maximum  configurations  that can be generated for
196              each opcode.  By default this is 1, meaning that we assume  that
197              a  single  measurement is enough to characterize an opcode. This
198              might not be true of all instructions: for example, the  perfor‐
199              mance  characteristics  of the LEA instruction on X86 depends on
200              the value of assigned registers and immediates. Setting a  value
201              of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
202              explore more configurations to discover if some register or  im‐
203              mediate  assignments  lead to different performance characteris‐
204              tics.
205
206       -benchmarks-file=</path/to/file>
207              File  to  read  (analysis  mode)  or   write   (latency/uops/in‐
208              verse_throughput  modes)  benchmark results. "-" uses stdin/std‐
209              out.
210
211       -analysis-clusters-output-file=</path/to/file>
212              If provided, write the analysis clusters as CSV  to  this  file.
213              "-" prints to stdout. By default, this analysis is not run.
214
215       -analysis-inconsistencies-output-file=</path/to/file>
216              If  non-empty,  write  inconsistencies  found during analysis to
217              this file. - prints to stdout. By default, this analysis is  not
218              run.
219
220       -analysis-clustering=[dbscan,naive]
221              Specify  the clustering algorithm to use. By default DBSCAN will
222              be used.  Naive clustering algorithm is better for doing further
223              work  on  the  -analysis-inconsistencies-output-file= output, it
224              will create one cluster per opcode, and check that  the  cluster
225              is stable (all points are neighbours).
226
227       -analysis-numpoints=<dbscan numPoints parameter>
228              Specify  the numPoints parameters to be used for DBSCAN cluster‐
229              ing (analysis mode, DBSCAN only).
230
231       -analysis-clustering-epsilon=<dbscan epsilon parameter>
232              Specify the epsilon parameter used for clustering  of  benchmark
233              points (analysis mode).
234
235       -analysis-inconsistency-epsilon=<epsilon>
236              Specify  the  epsilon  parameter  used for detection of when the
237              cluster is different  from  the  LLVM  schedule  profile  values
238              (analysis mode).
239
240       -analysis-display-unstable-clusters
241              If  there  is more than one benchmark for an opcode, said bench‐
242              marks may end up not being clustered into the  same  cluster  if
243              the  measured  performance characteristics are different. by de‐
244              fault all such opcodes are filtered out.  This flag will instead
245              show only such unstable opcodes.
246
247       -ignore-invalid-sched-class=false
248              If  set,  ignore  instructions  that  do  not have a sched class
249              (class idx = 0).
250
251       -mcpu=<cpu name>
252              If set, measure the cpu characteristics using the  counters  for
253              this  CPU.  This  is  useful when creating new sched models (the
254              host CPU is unknown to LLVM).
255
256       --dump-object-to-disk=true
257              By default, llvm-exegesis will dump the generated code to a tem‐
258              porary  file  to  enable  code inspection. You may disable it to
259              speed up the execution and save disk space.
260

EXIT STATUS

262       llvm-exegesis returns 0 on success.  Otherwise,  an  error  message  is
263       printed to standard error, and the tool returns a non 0 value.
264

AUTHOR

266       Maintained by the LLVM Team (https://llvm.org/).
267

COPYRIGHT

269       2003-2023, LLVM Project
270
271
272
273
27412                                2023-07-20                  LLVM-EXEGESIS(1)