llvm-exegesis(1)

1LLVM-EXEGESIS(1)                     LLVM                     LLVM-EXEGESIS(1)
2
3
4

NAME

6       llvm-exegesis - LLVM Machine Instruction Benchmark
7

SYNOPSIS

9       llvm-exegesis [options]
10

DESCRIPTION

12       llvm-exegesis is a benchmarking tool that uses information available in
13       LLVM to measure host machine instruction characteristics like  latency,
14       throughput, or port decomposition.
15
16       Given an LLVM opcode name and a benchmarking mode, llvm-exegesis gener‐
17       ates a code snippet that makes execution as serial (resp. as  parallel)
18       as  possible so that we can measure the latency (resp. inverse through‐
19       put/uop decomposition) of the instruction.  The code snippet is  jitted
20       and,  unless requested not to, executed on the host subtarget. The time
21       taken (resp. resource usage) is  measured  using  hardware  performance
22       counters. The result is printed out as YAML to the standard output.
23
24       The  main goal of this tool is to automatically (in)validate the LLVM's
25       TableDef scheduling models. To that end, we also  provide  analysis  of
26       the results.
27
28       llvm-exegesis can also benchmark arbitrary user-provided code snippets.
29

SUPPORTED PLATFORMS

31       llvm-exegesis  currently  only supports X86 (64-bit only), ARM (AArch64
32       only), MIPS, and PowerPC (PowerPC64LE only) on Linux for  benchmarking.
33       Not all benchmarking functionality is guaranteed to work on every plat‐
34       form.  llvm-exegesis also has a separate analysis  mode  that  is  sup‐
35       ported on every platform on which LLVM is.
36

SNIPPET ANNOTATIONS

38       llvm-exegesis  supports  benchmarking  arbitrary  snippets of assembly.
39       However, benchmarking these snippets often requires some setup so  that
40       they  can  execute properly. llvm-exegesis has two annotations and some
41       additional utilities to help with setup so that snippets can be  bench‐
42       marked properly.
43
44       • LLVM-EXEGESIS-DEFREG  <register name> - Adding this annotation to the
45         text assembly snippet to be benchmarked marks the register as requir‐
46         ing  a  definition.   A value will automatically be provided unless a
47         second parameter, a hex value, is passed in. This is  done  with  the
48         LLVM-EXEGESIS-DEFREG  <register name> <hex value> format. <hex value>
49         is a bit pattern used to fill the register. If it is a value  smaller
50         than  the register, it is sign extended to match the size of the reg‐
51         ister.
52
53       • LLVM-EXEGESIS-LIVEIN <register name> - This annotation allows  speci‐
54         fying registers that should keep their value upon starting the bench‐
55         mark. Values can be passed through registers  from  the  benchmarking
56         setup  in  some cases.  The registers and the values assigned to them
57         that can be utilized in the benchmarking script  with  a  LLVM-EXEGE‐
58         SIS-LIVEIN are as follows:
59
60         • Scratch  memory register - The specific register that this value is
61           put in is platform dependent (e.g., it is the RDI register  on  X86
62           Linux).  Setting  this register as a live in ensures that a pointer
63           to a block of memory (1MB) is placed within this register that  can
64           be used by the snippet.
65
66       • LLVM-EXEGESIS-MEM-DEF  <value  name> <size> <value> - This annotation
67         allows specifying memory definitions that can later  be  mapped  into
68         the execution process of a snippet with the LLVM-EXEGESIS-MEM-MAP an‐
69         notation. Each value is named using the <value name> argument so that
70         it can be referenced later within a map annotation. The size is spec‐
71         ified in bytes the the value is taken in hexadecimal. If the size  of
72         the value is less than the specified size, the value will be repeated
73         until it fills the entire section of memory.  Using  this  annotation
74         requires using the subprocess execution mode.
75
76       • LLVM-EXEGESIS-MEM-MAP <value name> <address> - This annotation allows
77         for mapping previously defined memory definitions into the  execution
78         context  of  a process. The value name refers to a previously defined
79         memory definition and the address is a decimal number that  specifies
80         the address the memory definition should start at. Note that a single
81         memory definition can be mapped multiple times. Using this annotation
82         requires the subprocess execution mode.
83

EXAMPLE 1: BENCHMARKING INSTRUCTIONS

85       Assume  you  have an X86-64 machine. To measure the latency of a single
86       instruction, run:
87
88          $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
89
90       Measuring the uop decomposition or inverse throughput of an instruction
91       works similarly:
92
93          $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
94          $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
95
96       The  output  is a YAML document (the default is to write to stdout, but
97       you can redirect the output to a file using -benchmarks-file):
98
99          ---
100          key:
101            opcode_name:     ADD64rr
102            mode:            latency
103            config:          ''
104          cpu_name:        haswell
105          llvm_triple:     x86_64-unknown-linux-gnu
106          num_repetitions: 10000
107          measurements:
108            - { key: latency, value: 1.0058, debug_string: '' }
109          error:           ''
110          info:            'explicit self cycles, selecting one aliasing configuration.
111          Snippet:
112          ADD64rr R8, R8, R10
113          '
114          ...
115
116       To measure the latency of all instructions for the  host  architecture,
117       run:
118
119          $ llvm-exegesis -mode=latency -opcode-index=-1
120

EXAMPLE 2: BENCHMARKING A CUSTOM CODE SNIPPET

122       To  measure the latency/uops of a custom piece of code, you can specify
123       the snippets-file option (- reads from standard input).
124
125          $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
126
127       Real-life code  snippets  typically  depend  on  registers  or  memory.
128       llvm-exegesis checks the liveliness of registers (i.e. any register use
129       has a corresponding def or is a "live in"). If your code depends on the
130       value  of some registers, you need to use snippet annotations to ensure
131       setup is performed properly.
132
133       For example, the following code snippet depends on the values  of  XMM1
134       (which  will  be  set  by the tool) and the memory buffer passed in RDI
135       (live in).
136
137          # LLVM-EXEGESIS-LIVEIN RDI
138          # LLVM-EXEGESIS-DEFREG XMM1 42
139          vmulps        (%rdi), %xmm1, %xmm2
140          vhaddps       %xmm2, %xmm2, %xmm3
141          addq $0x10, %rdi
142

EXAMPLE 3: BENCHMARKING WITH MEMORY ANNOTATIONS

144       Some snippets require memory setup in specific places to execute  with‐
145       out  crashing. Setting up memory can be accomplished with the LLVM-EXE‐
146       GESIS-MEM-DEF and LLVM-EXEGESIS-MEM-MAP  annotations.  To  execute  the
147       following snippet:
148
149          movq $8192, %rax
150          movq (%rax), %rdi
151
152       We  need  to  have  at  least  eight bytes of memory allocated starting
153       0x2000.  We can create the necessary  execution  environment  with  the
154       following annotations added to the snippet:
155
156          # LLVM-EXEGESIS-MEM-DEF test1 4096 2147483647
157          # LLVM-EXEGESIS-MEM-MAP test1 8192
158
159          movq $8192, %rax
160          movq (%rax), %rdi
161

EXAMPLE 4: ANALYSIS

163       Assuming  you have a set of benchmarked instructions (either latency or
164       uops) as YAML in file /tmp/benchmarks.yaml, you can analyze the results
165       using the following command:
166
167            $ llvm-exegesis -mode=analysis \
168          -benchmarks-file=/tmp/benchmarks.yaml \
169          -analysis-clusters-output-file=/tmp/clusters.csv \
170          -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
171
172       This  will  group  the instructions into clusters with the same perfor‐
173       mance characteristics. The clusters will be written out  to  /tmp/clus‐
174       ters.csv in the following format:
175
176          cluster_id,opcode_name,config,sched_class
177          ...
178          2,ADD32ri8_DB,,WriteALU,1.00
179          2,ADD32ri_DB,,WriteALU,1.01
180          2,ADD32rr,,WriteALU,1.01
181          2,ADD32rr_DB,,WriteALU,1.00
182          2,ADD32rr_REV,,WriteALU,1.00
183          2,ADD64i32,,WriteALU,1.01
184          2,ADD64ri32,,WriteALU,1.01
185          2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
186          2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
187          2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
188          2,ADD64ri8,,WriteALU,1.00
189          2,SETBr,,WriteSETCC,1.01
190          ...
191
192       llvm-exegesis  will also analyze the clusters to point out inconsisten‐
193       cies in the scheduling information. The output is an html file. For ex‐
194       ample, /tmp/inconsistencies.html will contain messages like the follow‐
195       ing : [image]
196
197       Note that the  scheduling  class  names  will  be  resolved  only  when
198       llvm-exegesis is compiled in debug mode, else only the class id will be
199       shown. This does not invalidate any of the analysis results though.
200

OPTIONS

202       -help  Print a summary of command line options.
203
204       -opcode-index=<LLVM opcode index>
205              Specify the opcode to measure, by index. Specifying -1 will  re‐
206              sult  in  measuring every existing opcode. See example 1 for de‐
207              tails.  Either opcode-index, opcode-name or  snippets-file  must
208              be set.
209
210       -opcode-name=<opcode name 1>,<opcode name 2>,...
211              Specify  the  opcode to measure, by name. Several opcodes can be
212              specified as a comma-separated list. See example 1 for  details.
213              Either opcode-index, opcode-name or snippets-file must be set.
214
215       -snippets-file=<filename>
216              Specify  the  custom  code snippet to measure. See example 2 for
217              details.  Either opcode-index, opcode-name or snippets-file must
218              be set.
219
220       -mode=[latency|uops|inverse_throughput|analysis]
221              Specify  the  run mode. Note that some modes have additional re‐
222              quirements and options.
223
224              latency mode can be  make use  of  either  RDTSC  or  LBR.   la‐
225              tency[LBR]  is only available on X86 (at least Skylake).  To run
226              in  latency  mode,  a  positive  value  must  be  specified  for
227              x86-lbr-sample-period and --repetition-mode=loop.
228
229              In  analysis  mode, you also need to specify at least one of the
230              -analysis-clusters-output-file=    and    -analysis-inconsisten‐
231              cies-output-file=.
232
233       --benchmark-phase=[prepare-snippet|prepare-and-assemble-snippet|assem‐
234       ble-measured-code|measure]
235              By default, when -mode= is specified, the generated snippet will
236              be  executed and measured, and that requires that we are running
237              on the hardware for which the snippet was  generated,  and  that
238              supports  performance  measurements.  However, it is possible to
239              stop at  some  stage  before  measuring.  Choices  are:  *  pre‐
240              pare-snippet: Only generate the minimal instruction sequence.  *
241              prepare-and-assemble-snippet: Same as prepare-snippet, but  also
242              dumps an excerpt of the sequence (hex encoded).  * assemble-mea‐
243              sured-code: Same as prepare-and-assemble-snippet. but also  cre‐
244              ates  the  full  sequence  that  can  be  dumped to a file using
245              --dump-object-to-disk.   *  measure:   Same   as   assemble-mea‐
246              sured-code, but also runs the measurement.
247
248       -x86-lbr-sample-period=<nBranches/sample>
249              Specify  the  LBR  sampling period - how many branches before we
250              take a sample.  When a positive value is specified for this  op‐
251              tion  and when the mode is latency, we will use LBRs for measur‐
252              ing.  On choosing the "right" sampling period, a small value  is
253              preferred,  but  throttling  could  occur if the sampling is too
254              frequent. A prime number should be used  to  avoid  consistently
255              skipping certain blocks.
256
257       -x86-disable-upper-sse-registers
258              Using  the  upper xmm registers (xmm8-xmm15) forces a longer in‐
259              struction encoding which may put greater pressure on the  front‐
260              end  fetch and decode stages, potentially reducing the rate that
261              instructions are dispatched  to  the  backend,  particularly  on
262              older  hardware.  Comparing  baseline results with this mode en‐
263              abled can help determine the effects of the frontend and can  be
264              used to improve latency and throughput estimates.
265
266       -repetition-mode=[duplicate|loop|min]
267              Specify  the  repetition  mode.  duplicate  will create a large,
268              straight line basic block with num-repetitions instructions (re‐
269              peating  the  snippet  num-repetitions/snippet size times). loop
270              will, optionally, duplicate the snippet until the loop body con‐
271              tains  at  least  loop-body-size instructions, and then wrap the
272              result in a loop which will execute num-repetitions instructions
273              (thus, again, repeating the snippet num-repetitions/snippet size
274              times). The loop mode, especially with loop unrolling  tends  to
275              better  hide  the  effects  of the CPU frontend on architectures
276              that cache decoded instructions, but  consumes  a  register  for
277              counting  iterations.  If  performing  an analysis over many op‐
278              codes, it may be best to instead use the min  mode,  which  will
279              run each other mode, and produce the minimal measured result.
280
281       -num-repetitions=<Number of repetitions>
282              Specify  the  target  number of executed instructions. Note that
283              the actual repetition count of the snippet will  be  num-repeti‐
284              tions/snippet  size.   Higher  values lead to more accurate mea‐
285              surements but lengthen the benchmark.
286
287       -loop-body-size=<Preferred loop body size>
288              Only  effective  for  -repetition-mode=[loop|min].   Instead  of
289              looping  over  the  snippet directly, first duplicate it so that
290              the loop body contains at least this many instructions. This po‐
291              tentially  results in loop body being cached in the CPU Op Cache
292              / Loop Cache, which allows to which may have  higher  throughput
293              than the CPU decoders.
294
295       -max-configs-per-opcode=<value>
296              Specify  the  maximum  configurations  that can be generated for
297              each opcode.  By default this is 1, meaning that we assume  that
298              a  single  measurement is enough to characterize an opcode. This
299              might not be true of all instructions: for example, the  perfor‐
300              mance  characteristics  of the LEA instruction on X86 depends on
301              the value of assigned registers and immediates. Setting a  value
302              of -max-configs-per-opcode larger than 1 allows llvm-exegesis to
303              explore more configurations to discover if some register or  im‐
304              mediate  assignments  lead to different performance characteris‐
305              tics.
306
307       -benchmarks-file=</path/to/file>
308              File  to  read  (analysis  mode)  or   write   (latency/uops/in‐
309              verse_throughput  modes)  benchmark results. "-" uses stdin/std‐
310              out.
311
312       -analysis-clusters-output-file=</path/to/file>
313              If provided, write the analysis clusters as CSV  to  this  file.
314              "-" prints to stdout. By default, this analysis is not run.
315
316       -analysis-inconsistencies-output-file=</path/to/file>
317              If  non-empty,  write  inconsistencies  found during analysis to
318              this file. - prints to stdout. By default, this analysis is  not
319              run.
320
321       -analysis-filter=[all|reg-only|mem-only]
322              By default, all benchmark results are analysed, but sometimes it
323              may be useful to only look at those that to not involve  memory,
324              or vice versa. This option allows to either keep all benchmarks,
325              or filter out (ignore) either all the ones that do involve  mem‐
326              ory  (involve instructions that may read or write to memory), or
327              the opposite, to only keep such benchmarks.
328
329       -analysis-clustering=[dbscan,naive]
330              Specify the clustering algorithm to use. By default DBSCAN  will
331              be used.  Naive clustering algorithm is better for doing further
332              work on the  -analysis-inconsistencies-output-file=  output,  it
333              will  create  one cluster per opcode, and check that the cluster
334              is stable (all points are neighbours).
335
336       -analysis-numpoints=<dbscan numPoints parameter>
337              Specify the numPoints parameters to be used for DBSCAN  cluster‐
338              ing (analysis mode, DBSCAN only).
339
340       -analysis-clustering-epsilon=<dbscan epsilon parameter>
341              Specify  the  epsilon parameter used for clustering of benchmark
342              points (analysis mode).
343
344       -analysis-inconsistency-epsilon=<epsilon>
345              Specify the epsilon parameter used for  detection  of  when  the
346              cluster  is  different  from  the  LLVM  schedule profile values
347              (analysis mode).
348
349       -analysis-display-unstable-clusters
350              If there is more than one benchmark for an opcode,  said  bench‐
351              marks  may  end  up not being clustered into the same cluster if
352              the measured performance characteristics are different.  by  de‐
353              fault all such opcodes are filtered out.  This flag will instead
354              show only such unstable opcodes.
355
356       -ignore-invalid-sched-class=false
357              If set, ignore instructions that  do  not  have  a  sched  class
358              (class idx = 0).
359
360       -mtriple=<triple name>
361              Target triple. See -version for available targets.
362
363       -mcpu=<cpu name>
364              If  set,  measure the cpu characteristics using the counters for
365              this CPU. This is useful when creating  new  sched  models  (the
366              host CPU is unknown to LLVM).  (-mcpu=help for details)
367
368       --analysis-override-benchmark-triple-and-cpu
369              By  default,  llvm-exegesis  will analyze the benchmarks for the
370              triple/CPU they were measured for, but if you  want  to  analyze
371              them  for some other combination (specified via -mtriple/-mcpu),
372              you can pass this flag.
373
374       --dump-object-to-disk=true
375              If set,  llvm-exegesis will dump the generated code to a  tempo‐
376              rary file to enable code inspection. Disabled by default.
377
378       --use-dummy-perf-counters
379              If  set,  llvm-exegesis will not read any real performance coun‐
380              ters and return a dummy value instead. This can be used  to  en‐
381              sure  a snippet doesn't crash when hardware performance counters
382              are unavailable and for debugging llvm-exegesis itself.
383
384       --execution-mode=[inprocess,subprocess]
385              This option specifies what execution mode to use. The  inprocess
386              execution mode is the default. The subprocess execution mode al‐
387              lows for additional features such as memory annotations  but  is
388              currently restricted to X86-64 on Linux.
389

EXIT STATUS

391       llvm-exegesis  returns  0  on  success.  Otherwise, an error message is
392       printed to standard error, and the tool returns a non 0 value.
393

AUTHOR

395       Maintained by the LLVM Team (https://llvm.org/).
396

COPYRIGHT

398       2003-2023, LLVM Project
399
400
401
402
40317                                2023-11-28                  LLVM-EXEGESIS(1)