1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently  works  for  processors  with  a
18       backend for which there is a scheduling model available in LLVM.
19
20       The  main  goal  of this tool is not just to predict the performance of
21       the code when run on the target, but also help with  diagnosing  poten‐
22       tial performance issues.
23
24       Given  an  assembly  code sequence, llvm-mca estimates the Instructions
25       Per Cycle (IPC), as well as hardware resource  pressure.  The  analysis
26       and reporting style were inspired by the IACA tool from Intel.
27
28       For example, you can compile code with clang, output assembly, and pipe
29       it directly into llvm-mca for analysis:
30
31          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
32
33       Or for Intel syntax:
34
35          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
36
37       (llvm-mca detects Intel syntax by the presence of an .intel_syntax  di‐
38       rective  at  the  beginning of the input.  By default its output syntax
39       matches that of its input.)
40
41       Scheduling models are not just used to  compute  instruction  latencies
42       and  throughput,  but  also  to understand what processor resources are
43       available and how to simulate them.
44
45       By design, the quality of the analysis conducted  by  llvm-mca  is  in‐
46       evitably affected by the quality of the scheduling models in LLVM.
47
48       If you see that the performance report is not accurate for a processor,
49       please file a bug against the appropriate backend.
50

OPTIONS

52       If input is "-" or omitted, llvm-mca reads from standard input.  Other‐
53       wise, it will read from the specified filename.
54
55       If  the  -o  option  is  omitted, then llvm-mca will send its output to
56       standard output if the input is from standard input.  If the -o  option
57       specifies "-", then the output will also be sent to standard output.
58
59       -help  Print a summary of command line options.
60
61       -o <filename>
62              Use <filename> as the output filename. See the summary above for
63              more details.
64
65       -mtriple=<target triple>
66              Specify a target triple string.
67
68       -march=<arch>
69              Specify the architecture for which to analyze the code.  It  de‐
70              faults to the host default target.
71
72       -mcpu=<cpuname>
73              Specify  the  processor  for  which to analyze the code.  By de‐
74              fault, the cpu name is autodetected from the host.
75
76       -output-asm-variant=<variant id>
77              Specify the output assembly variant for the report generated  by
78              the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
79              (vic. 1) for this flag enables the AT&T  (vic.  Intel)  assembly
80              format  for the code printed out by the tool in the analysis re‐
81              port.
82
83       -print-imm-hex
84              Prefer hex format for numeric literals in  the  output  assembly
85              printed as part of the report.
86
87       -dispatch=<width>
88              Specify  a  different dispatch width for the processor. The dis‐
89              patch width defaults to  field  'IssueWidth'  in  the  processor
90              scheduling  model.   If width is zero, then the default dispatch
91              width is used.
92
93       -register-file-size=<size>
94              Specify the size of the register file. When specified, this flag
95              limits  how  many  physical registers are available for register
96              renaming purposes. A value of zero for this flag  means  "unlim‐
97              ited number of physical registers".
98
99       -iterations=<number of iterations>
100              Specify  the number of iterations to run. If this flag is set to
101              0, then the tool sets the number  of  iterations  to  a  default
102              value (i.e. 100).
103
104       -noalias=<bool>
105              If set, the tool assumes that loads and stores don't alias. This
106              is the default behavior.
107
108       -lqueue=<load queue size>
109              Specify the size of the load queue in the load/store  unit  emu‐
110              lated by the tool.  By default, the tool assumes an unbound num‐
111              ber of entries in the load queue.  A value of zero for this flag
112              is ignored, and the default load queue size is used instead.
113
114       -squeue=<store queue size>
115              Specify  the size of the store queue in the load/store unit emu‐
116              lated by the tool. By default, the tool assumes an unbound  num‐
117              ber of entries in the store queue. A value of zero for this flag
118              is ignored, and the default store queue size is used instead.
119
120       -timeline
121              Enable the timeline view.
122
123       -timeline-max-iterations=<iterations>
124              Limit the number of iterations to print in the timeline view. By
125              default, the timeline view prints information for up to 10 iter‐
126              ations.
127
128       -timeline-max-cycles=<cycles>
129              Limit the number of cycles in the timeline view, or use 0 for no
130              limit. By default, the number of cycles is set to 80.
131
132       -resource-pressure
133              Enable the resource pressure view. This is enabled by default.
134
135       -register-file-stats
136              Enable register file usage statistics.
137
138       -dispatch-stats
139              Enable  extra  dispatch  statistics. This view collects and ana‐
140              lyzes instruction dispatch events,  as  well  as  static/dynamic
141              dispatch stall events. This view is disabled by default.
142
143       -scheduler-stats
144              Enable  extra  scheduler statistics. This view collects and ana‐
145              lyzes instruction issue events. This view  is  disabled  by  de‐
146              fault.
147
148       -retire-stats
149              Enable  extra  retire control unit statistics. This view is dis‐
150              abled by default.
151
152       -instruction-info
153              Enable the instruction info view. This is enabled by default.
154
155       -show-encoding
156              Enable the printing of instruction encodings within the instruc‐
157              tion info view.
158
159       -all-stats
160              Print all hardware statistics. This enables extra statistics re‐
161              lated to the dispatch logic, the hardware schedulers, the regis‐
162              ter  file(s),  and  the retire control unit. This option is dis‐
163              abled by default.
164
165       -all-views
166              Enable all the view.
167
168       -instruction-tables
169              Prints resource pressure information based on the static  infor‐
170              mation available from the processor model. This differs from the
171              resource pressure view because it doesn't require that the  code
172              is  simulated. It instead prints the theoretical uniform distri‐
173              bution of resource pressure for every instruction in sequence.
174
175       -bottleneck-analysis
176              Print information about bottlenecks that affect the  throughput.
177              This  analysis  can be expensive, and it is disabled by default.
178              Bottlenecks are highlighted  in  the  summary  view.  Bottleneck
179              analysis  is  currently  not  supported  for  processors with an
180              in-order backend.
181
182       -json  Print the requested views in valid JSON format. The instructions
183              and  the  processor  resources are printed as members of special
184              top level JSON objects.  The individual views refer to  them  by
185              index. However, not all views are currently supported. For exam‐
186              ple, the report from the bottleneck analysis is not printed  out
187              in JSON. All the default views are currently supported.
188
189       -disable-cb
190              Force usage of the generic CustomBehaviour class rather than us‐
191              ing the target specific class. The generic class  never  detects
192              any custom hazards.
193

EXIT STATUS

195       llvm-mca  returns  0 on success. Otherwise, an error message is printed
196       to standard error, and the tool returns 1.
197

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

199       llvm-mca allows for the optional usage of special code comments to mark
200       regions  of  the assembly code to be analyzed.  A comment starting with
201       substring LLVM-MCA-BEGIN marks the beginning of a code region.  A  com‐
202       ment  starting  with substring LLVM-MCA-END marks the end of a code re‐
203       gion.  For example:
204
205          # LLVM-MCA-BEGIN
206            ...
207          # LLVM-MCA-END
208
209       If no user-defined region is specified, then llvm-mca assumes a default
210       region  which  contains every instruction in the input file.  Every re‐
211       gion is analyzed in isolation, and the final performance report is  the
212       union of all the reports generated for every code region.
213
214       Code regions can have names. For example:
215
216          # LLVM-MCA-BEGIN A simple example
217            add %eax, %eax
218          # LLVM-MCA-END
219
220       The  code from the example above defines a region named "A simple exam‐
221       ple" with a single instruction in it. Note how the region name  doesn't
222       have  to  be  repeated in the LLVM-MCA-END directive. In the absence of
223       overlapping regions, an anonymous LLVM-MCA-END  directive  always  ends
224       the currently active user defined region.
225
226       Example of nesting regions:
227
228          # LLVM-MCA-BEGIN foo
229            add %eax, %edx
230          # LLVM-MCA-BEGIN bar
231            sub %eax, %edx
232          # LLVM-MCA-END bar
233          # LLVM-MCA-END foo
234
235       Example of overlapping regions:
236
237          # LLVM-MCA-BEGIN foo
238            add %eax, %edx
239          # LLVM-MCA-BEGIN bar
240            sub %eax, %edx
241          # LLVM-MCA-END foo
242            add %eax, %edx
243          # LLVM-MCA-END bar
244
245       Note  that multiple anonymous regions cannot overlap. Also, overlapping
246       regions cannot have the same name.
247
248       There is no support for marking regions from  high-level  source  code,
249       like C or C++. As a workaround, inline assembly directives may be used:
250
251          int foo(int a, int b) {
252            __asm volatile("# LLVM-MCA-BEGIN foo");
253            a += 42;
254            __asm volatile("# LLVM-MCA-END");
255            a *= b;
256            return a;
257          }
258
259       However, this interferes with optimizations like loop vectorization and
260       may have an impact on the code generated. This  is  because  the  __asm
261       statements  are  seen as real code having important side effects, which
262       limits how the code around them can be transformed. If  users  want  to
263       make use of inline assembly to emit markers, then the recommendation is
264       to always verify that the output assembly is equivalent to the assembly
265       generated  in  the absence of markers.  The Clang options to emit opti‐
266       mization reports can also help in detecting missed optimizations.
267

HOW LLVM-MCA WORKS

269       llvm-mca takes assembly code as input. The assembly code is parsed into
270       a sequence of MCInst with the help of the existing LLVM target assembly
271       parsers. The parsed sequence of MCInst is then analyzed by  a  Pipeline
272       module to generate a performance report.
273
274       The  Pipeline  module  simulates  the execution of the machine code se‐
275       quence in a loop of iterations (default is 100). During  this  process,
276       the  pipeline collects a number of execution related statistics. At the
277       end of this process, the pipeline generates and prints  a  report  from
278       the collected statistics.
279
280       Here  is an example of a performance report generated by the tool for a
281       dot-product of two packed float vectors of four elements. The  analysis
282       is  conducted  for target x86, cpu btver2.  The following result can be
283       produced via  the  following  command  using  the  example  located  at
284       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
285
286          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
287
288          Iterations:        300
289          Instructions:      900
290          Total Cycles:      610
291          Total uOps:        900
292
293          Dispatch Width:    2
294          uOps Per Cycle:    1.48
295          IPC:               1.48
296          Block RThroughput: 2.0
297
298
299          Instruction Info:
300          [1]: #uOps
301          [2]: Latency
302          [3]: RThroughput
303          [4]: MayLoad
304          [5]: MayStore
305          [6]: HasSideEffects (U)
306
307          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
308           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
309           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
310           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
311
312
313          Resources:
314          [0]   - JALU0
315          [1]   - JALU1
316          [2]   - JDiv
317          [3]   - JFPA
318          [4]   - JFPM
319          [5]   - JFPU0
320          [6]   - JFPU1
321          [7]   - JLAGU
322          [8]   - JMul
323          [9]   - JSAGU
324          [10]  - JSTC
325          [11]  - JVALU0
326          [12]  - JVALU1
327          [13]  - JVIMUL
328
329
330          Resource pressure per iteration:
331          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
332           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
333
334          Resource pressure by instruction:
335          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
336           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
337           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
338           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
339
340       According  to this report, the dot-product kernel has been executed 300
341       times, for a total of 900 simulated instructions. The total  number  of
342       simulated micro opcodes (uOps) is also 900.
343
344       The  report  is  structured  in three main sections.  The first section
345       collects a few performance numbers; the goal of this section is to give
346       a  very quick overview of the performance throughput. Important perfor‐
347       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
348       Reciprocal Throughput).
349
350       Field  DispatchWidth  is  the  maximum number of micro opcodes that are
351       dispatched to the out-of-order backend every simulated cycle. For  pro‐
352       cessors  with  an in-order backend, DispatchWidth is the maximum number
353       of micro opcodes issued to the backend every simulated cycle.
354
355       IPC is computed dividing the total number of simulated instructions  by
356       the total number of cycles.
357
358       Field  Block  RThroughput  is  the  reciprocal of the block throughput.
359       Block throughput is a theoretical quantity computed as the maximum num‐
360       ber  of  blocks  (i.e.  iterations)  that can be executed per simulated
361       clock cycle in the absence of loop carried dependencies. Block through‐
362       put is superiorly limited by the dispatch rate, and the availability of
363       hardware resources.
364
365       In the absence of loop-carried  data  dependencies,  the  observed  IPC
366       tends  to  a  theoretical maximum which can be computed by dividing the
367       number of instructions of a single iteration by the Block RThroughput.
368
369       Field 'uOps Per Cycle' is computed dividing the total number  of  simu‐
370       lated micro opcodes by the total number of cycles. A delta between Dis‐
371       patch Width and this field is an indicator of a performance  issue.  In
372       the  absence  of loop-carried data dependencies, the observed 'uOps Per
373       Cycle' should tend to a theoretical maximum  throughput  which  can  be
374       computed  by  dividing  the number of uOps of a single iteration by the
375       Block RThroughput.
376
377       Field uOps Per Cycle is bounded from above by the dispatch width.  That
378       is  because  the  dispatch  width limits the maximum size of a dispatch
379       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
380       ware  parallelism.  The  availability of hardware resources affects the
381       resource pressure distribution, and it limits the  number  of  instruc‐
382       tions  that  can  be executed in parallel every cycle.  A delta between
383       Dispatch Width and the theoretical maximum uOps per Cycle (computed  by
384       dividing  the  number  of  uOps  of  a  single  iteration  by the Block
385       RThroughput) is an indicator of a performance bottleneck caused by  the
386       lack  of hardware resources.  In general, the lower the Block RThrough‐
387       put, the better.
388
389       In this example, uOps per iteration/Block RThroughput  is  1.50.  Since
390       there  are no loop-carried dependencies, the observed uOps Per Cycle is
391       expected to approach 1.50 when the number of iterations tends to infin‐
392       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
393       maximum uOp throughput (1.50) is an indicator of a performance  bottle‐
394       neck  caused  by the lack of hardware resources, and the Resource pres‐
395       sure view can help to identify the problematic resource usage.
396
397       The second section of the report is the instruction info view. It shows
398       the  latency  and reciprocal throughput of every instruction in the se‐
399       quence. It also reports extra information related to the number of  mi‐
400       cro  opcodes,  and  opcode properties (i.e., 'MayLoad', 'MayStore', and
401       'HasSideEffects').
402
403       Field RThroughput is the  reciprocal  of  the  instruction  throughput.
404       Throughput  is computed as the maximum number of instructions of a same
405       type that can be executed per clock cycle in the absence of operand de‐
406       pendencies.  In  this  example,  the  reciprocal throughput of a vector
407       float multiply is 1 cycles/instruction.  That is because the FP  multi‐
408       plier JFPM is only available from pipeline JFPU1.
409
410       Instruction  encodings  are  displayed within the instruction info view
411       when flag -show-encoding is specified.
412
413       Below is an example of -show-encoding output for the  dot-product  ker‐
414       nel:
415
416          Instruction Info:
417          [1]: #uOps
418          [2]: Latency
419          [3]: RThroughput
420          [4]: MayLoad
421          [5]: MayStore
422          [6]: HasSideEffects (U)
423          [7]: Encoding Size
424
425          [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
426           1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
427           1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
428           1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
429
430       The  Encoding Size column shows the size in bytes of instructions.  The
431       Encodings column shows the actual instruction encodings (byte sequences
432       in hex).
433
434       The third section is the Resource pressure view.  This view reports the
435       average number of resource cycles consumed every iteration by  instruc‐
436       tions  for  every processor resource unit available on the target.  In‐
437       formation is structured in two tables. The first table reports the num‐
438       ber of resource cycles spent on average every iteration. The second ta‐
439       ble correlates the resource cycles to the machine  instruction  in  the
440       sequence. For example, every iteration of the instruction vmulps always
441       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
442       consuming  an  average of 1 resource cycle per iteration.  Note that on
443       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
444       line  JFPU1,  while horizontal floating-point additions can only be is‐
445       sued to pipeline JFPU0.
446
447       The resource pressure view helps with identifying bottlenecks caused by
448       high  usage  of  specific hardware resources.  Situations with resource
449       pressure mainly concentrated on a few resources should, in general,  be
450       avoided.   Ideally,  pressure  should  be uniformly distributed between
451       multiple resources.
452
453   Timeline View
454       The timeline view produces a  detailed  report  of  each  instruction's
455       state  transitions  through  an instruction pipeline.  This view is en‐
456       abled by the command line option -timeline.  As instructions transition
457       through  the  various stages of the pipeline, their states are depicted
458       in the view report.  These states  are  represented  by  the  following
459       characters:
460
461       • D : Instruction dispatched.
462
463       • e : Instruction executing.
464
465       • E : Instruction executed.
466
467       • R : Instruction retired.
468
469       • = : Instruction already dispatched, waiting to be executed.
470
471       • - : Instruction executed, waiting to be retired.
472
473       Below  is the timeline view for a subset of the dot-product example lo‐
474       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
475       llvm-mca using the following command:
476
477          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
478
479          Timeline view:
480                              012345
481          Index     0123456789
482
483          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
484          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
485          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
486          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
487          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
488          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
489          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
490          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
491          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
492
493
494          Average Wait times (based on the timeline view):
495          [0]: Executions
496          [1]: Average time spent waiting in a scheduler's queue
497          [2]: Average time spent waiting in a scheduler's queue while ready
498          [3]: Average time elapsed from WB until retire stage
499
500                [0]    [1]    [2]    [3]
501          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
502          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
503          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
504                 3     3.3    0.5    1.4       <total>
505
506       The  timeline  view  is  interesting because it shows instruction state
507       changes during execution.  It also gives an idea of how the  tool  pro‐
508       cesses instructions executed on the target, and how their timing infor‐
509       mation might be calculated.
510
511       The timeline view is structured in two tables.  The first  table  shows
512       instructions  changing state over time (measured in cycles); the second
513       table (named Average Wait  times)  reports  useful  timing  statistics,
514       which  should help diagnose performance bottlenecks caused by long data
515       dependencies and sub-optimal usage of hardware resources.
516
517       An instruction in the timeline view is identified by a pair of indices,
518       where  the first index identifies an iteration, and the second index is
519       the instruction index (i.e., where it appears in  the  code  sequence).
520       Since this example was generated using 3 iterations: -iterations=3, the
521       iteration indices range from 0-2 inclusively.
522
523       Excluding the first and last column, the remaining columns are  in  cy‐
524       cles.  Cycles are numbered sequentially starting from 0.
525
526       From the example output above, we know the following:
527
528       • Instruction [1,0] was dispatched at cycle 1.
529
530       • Instruction [1,0] started executing at cycle 2.
531
532       • Instruction [1,0] reached the write back stage at cycle 4.
533
534       • Instruction [1,0] was retired at cycle 10.
535
536       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
537       wait in the scheduler's queue for the operands to become available.  By
538       the  time  vmulps  is  dispatched,  operands are already available, and
539       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
540       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
541       strated by the fact that the instruction only spent 1cy in  the  sched‐
542       uler's queue.
543
544       There  is a gap of 5 cycles between the write-back stage and the retire
545       event.  That is because instructions must retire in program  order,  so
546       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
547       until cycle 10).
548
549       In the example, all instructions are in a RAW (Read After Write) depen‐
550       dency  chain.   Register %xmm2 written by vmulps is immediately used by
551       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
552       used  by  the second vhaddps.  Long data dependencies negatively impact
553       the ILP (Instruction Level Parallelism).
554
555       In the dot-product example, there are anti-dependencies  introduced  by
556       instructions  from  different  iterations.  However, those dependencies
557       can be removed at register renaming stage (at the  cost  of  allocating
558       register aliases, and therefore consuming physical registers).
559
560       Table  Average  Wait  times  helps diagnose performance issues that are
561       caused by the presence of long  latency  instructions  and  potentially
562       long  data  dependencies  which  may  limit the ILP. Last row, <total>,
563       shows a global  average  over  all  instructions  measured.  Note  that
564       llvm-mca,  by  default, assumes at least 1cy between the dispatch event
565       and the issue event.
566
567       When the performance is limited by data dependencies  and/or  long  la‐
568       tency instructions, the number of cycles spent while in the ready state
569       is expected to be very small when compared with the total number of cy‐
570       cles  spent  in  the scheduler's queue.  The difference between the two
571       counters is a good indicator of how large of an impact  data  dependen‐
572       cies  had  on  the  execution of the instructions.  When performance is
573       mostly limited by the lack of hardware resources, the delta between the
574       two  counters  is  small.   However,  the number of cycles spent in the
575       queue tends to be larger (i.e., more than 1-3cy), especially when  com‐
576       pared to other low latency instructions.
577
578   Bottleneck Analysis
579       The  -bottleneck-analysis  command  line option enables the analysis of
580       performance bottlenecks.
581
582       This analysis is potentially expensive. It attempts  to  correlate  in‐
583       creases  in  backend pressure (caused by pipeline resource pressure and
584       data dependencies) to dynamic dispatch stalls.
585
586       Below  is  an  example  of  -bottleneck-analysis  output  generated  by
587       llvm-mca for 500 iterations of the dot-product example on btver2.
588
589          Cycles with backend pressure increase [ 48.07% ]
590          Throughput Bottlenecks:
591            Resource Pressure       [ 47.77% ]
592            - JFPA  [ 47.77% ]
593            - JFPU0  [ 47.77% ]
594            Data Dependencies:      [ 0.30% ]
595            - Register Dependencies [ 0.30% ]
596            - Memory Dependencies   [ 0.00% ]
597
598          Critical sequence based on the simulation:
599
600                        Instruction                         Dependency Information
601           +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
602           |
603           |    < loop carried >
604           |
605           |      0.    vmulps  %xmm0, %xmm1, %xmm2
606           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
607           +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
608           |
609           |    < loop carried >
610           |
611           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
612
613       According  to  the analysis, throughput is limited by resource pressure
614       and not by data dependencies.  The analysis observed increases in back‐
615       end pressure during 48.07% of the simulated run. Almost all those pres‐
616       sure increase events were caused by contention on  processor  resources
617       JFPA/JFPU0.
618
619       The  critical  sequence  is the most expensive sequence of instructions
620       according to the simulation. It is annotated to provide extra  informa‐
621       tion  about  critical  register dependencies and resource interferences
622       between instructions.
623
624       Instructions from the critical sequence are expected  to  significantly
625       impact  performance.  By construction, the accuracy of this analysis is
626       strongly dependent on the simulation and (as always) by the quality  of
627       the processor model in llvm.
628
629       Bottleneck  analysis  is currently not supported for processors with an
630       in-order backend.
631
632   Extra Statistics to Further Diagnose Performance Issues
633       The -all-stats command line option enables extra statistics and perfor‐
634       mance  counters  for the dispatch logic, the reorder buffer, the retire
635       control unit, and the register file.
636
637       Below is an example of -all-stats output generated by  llvm-mca for 300
638       iterations  of  the  dot-product example discussed in the previous sec‐
639       tions.
640
641          Dynamic Dispatch Stall Cycles:
642          RAT     - Register unavailable:                      0
643          RCU     - Retire tokens unavailable:                 0
644          SCHEDQ  - Scheduler full:                            272  (44.6%)
645          LQ      - Load queue full:                           0
646          SQ      - Store queue full:                          0
647          GROUP   - Static restrictions on the dispatch group: 0
648
649
650          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
651          [# dispatched], [# cycles]
652           0,              24  (3.9%)
653           1,              272  (44.6%)
654           2,              314  (51.5%)
655
656
657          Schedulers - number of cycles where we saw N micro opcodes issued:
658          [# issued], [# cycles]
659           0,          7  (1.1%)
660           1,          306  (50.2%)
661           2,          297  (48.7%)
662
663          Scheduler's queue usage:
664          [1] Resource name.
665          [2] Average number of used buffer entries.
666          [3] Maximum number of used buffer entries.
667          [4] Total number of buffer entries.
668
669           [1]            [2]        [3]        [4]
670          JALU01           0          0          20
671          JFPU01           17         18         18
672          JLSAGU           0          0          12
673
674
675          Retire Control Unit - number of cycles where we saw N instructions retired:
676          [# retired], [# cycles]
677           0,           109  (17.9%)
678           1,           102  (16.7%)
679           2,           399  (65.4%)
680
681          Total ROB Entries:                64
682          Max Used ROB Entries:             35  ( 54.7% )
683          Average Used ROB Entries per cy:  32  ( 50.0% )
684
685
686          Register File statistics:
687          Total number of mappings created:    900
688          Max number of mappings used:         35
689
690          *  Register File #1 -- JFpuPRF:
691             Number of physical registers:     72
692             Total number of mappings created: 900
693             Max number of mappings used:      35
694
695          *  Register File #2 -- JIntegerPRF:
696             Number of physical registers:     64
697             Total number of mappings created: 0
698             Max number of mappings used:      0
699
700       If we look at the Dynamic Dispatch  Stall  Cycles  table,  we  see  the
701       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
702       ery time the dispatch logic is unable to dispatch a full group  because
703       the scheduler's queue is full.
704
705       Looking  at the Dispatch Logic table, we see that the pipeline was only
706       able to dispatch two micro opcodes 51.5% of  the  time.   The  dispatch
707       group was limited to one micro opcode 44.6% of the cycles, which corre‐
708       sponds to 272 cycles.  The dispatch statistics are displayed by  either
709       using the command option -all-stats or -dispatch-stats.
710
711       The  next  table,  Schedulers, presents a histogram displaying a count,
712       representing the number of micro opcodes issued on some number  of  cy‐
713       cles.  In  this  case, of the 610 simulated cycles, single opcodes were
714       issued 306 times (50.2%) and there were 7 cycles where no opcodes  were
715       issued.
716
717       The  Scheduler's  queue  usage table shows that the average and maximum
718       number of buffer entries (i.e., scheduler queue entries) used  at  run‐
719       time.   Resource  JFPU01  reached its maximum (18 of 18 queue entries).
720       Note that AMD Jaguar implements three schedulers:
721
722       • JALU01 - A scheduler for ALU instructions.
723
724       • JFPU01 - A scheduler floating point operations.
725
726       • JLSAGU - A scheduler for address generation.
727
728       The dot-product is a kernel of three  floating  point  instructions  (a
729       vector  multiply  followed  by two horizontal adds).  That explains why
730       only the floating point scheduler appears to be used.
731
732       A full scheduler queue is either caused by data dependency chains or by
733       a  sub-optimal  usage of hardware resources.  Sometimes, resource pres‐
734       sure can be mitigated by rewriting the kernel using different  instruc‐
735       tions  that  consume  different scheduler resources.  Schedulers with a
736       small queue are less resilient to bottlenecks caused by the presence of
737       long  data dependencies.  The scheduler statistics are displayed by us‐
738       ing the command option -all-stats or -scheduler-stats.
739
740       The next table, Retire Control Unit, presents a histogram displaying  a
741       count,  representing  the number of instructions retired on some number
742       of cycles.  In this case, of the 610 simulated cycles, two instructions
743       were retired during the same cycle 399 times (65.4%) and there were 109
744       cycles where no instructions were retired.  The retire  statistics  are
745       displayed by using the command option -all-stats or -retire-stats.
746
747       The  last  table  presented is Register File statistics.  Each physical
748       register file (PRF) used by the pipeline is presented  in  this  table.
749       In the case of AMD Jaguar, there are two register files, one for float‐
750       ing-point registers (JFpuPRF) and one  for  integer  registers  (JInte‐
751       gerPRF).  The table shows that of the 900 instructions processed, there
752       were 900 mappings created.  Since  this  dot-product  example  utilized
753       only floating point registers, the JFPuPRF was responsible for creating
754       the 900 mappings.  However, we see that the pipeline only used a  maxi‐
755       mum of 35 of 72 available register slots at any given time. We can con‐
756       clude that the floating point PRF was the only register file  used  for
757       the  example, and that it was never resource constrained.  The register
758       file statistics are displayed by using the command option -all-stats or
759       -register-file-stats.
760
761       In this example, we can conclude that the IPC is mostly limited by data
762       dependencies, and not by resource pressure.
763
764   Instruction Flow
765       This section describes the instruction flow through the  default  pipe‐
766       line  of  llvm-mca,  as  well  as  the functional units involved in the
767       process.
768
769       The default pipeline implements the following sequence of  stages  used
770       to process instructions.
771
772       • Dispatch (Instruction is dispatched to the schedulers).
773
774       • Issue (Instruction is issued to the processor pipelines).
775
776       • Write Back (Instruction is executed, and results are written back).
777
778       • Retire  (Instruction  is  retired; writes are architecturally commit‐
779         ted).
780
781       The in-order pipeline implements the following sequence  of  stages:  *
782       InOrderIssue (Instruction is issued to the processor pipelines).  * Re‐
783       tire (Instruction is retired; writes are architecturally committed).
784
785       llvm-mca assumes that instructions have all  been  decoded  and  placed
786       into  a  queue  before the simulation start. Therefore, the instruction
787       fetch and decode stages are not modeled. Performance bottlenecks in the
788       frontend  are  not diagnosed. Also, llvm-mca does not model branch pre‐
789       diction.
790
791   Instruction Dispatch
792       During the dispatch stage, instructions are  picked  in  program  order
793       from  a queue of already decoded instructions, and dispatched in groups
794       to the simulated hardware schedulers.
795
796       The size of a dispatch group depends on the availability of  the  simu‐
797       lated hardware resources.  The processor dispatch width defaults to the
798       value of the IssueWidth in LLVM's scheduling model.
799
800       An instruction can be dispatched if:
801
802       • The size of the dispatch group is smaller than  processor's  dispatch
803         width.
804
805       • There are enough entries in the reorder buffer.
806
807       • There are enough physical registers to do register renaming.
808
809       • The schedulers are not full.
810
811       Scheduling  models  can  optionally  specify  which  register files are
812       available on the processor. llvm-mca uses that information to  initial‐
813       ize  register file descriptors.  Users can limit the number of physical
814       registers that are globally available for register  renaming  by  using
815       the  command  option -register-file-size.  A value of zero for this op‐
816       tion means unbounded. By knowing how many registers are  available  for
817       renaming,  the  tool  can predict dispatch stalls caused by the lack of
818       physical registers.
819
820       The number of reorder buffer entries consumed by an instruction depends
821       on  the  number  of micro-opcodes specified for that instruction by the
822       target scheduling model.  The reorder buffer is responsible for  track‐
823       ing  the  progress  of  instructions that are "in-flight", and retiring
824       them in program order.  The number of entries in the reorder buffer de‐
825       faults  to the value specified by field MicroOpBufferSize in the target
826       scheduling model.
827
828       Instructions that are dispatched to the  schedulers  consume  scheduler
829       buffer  entries. llvm-mca queries the scheduling model to determine the
830       set of buffered resources consumed by  an  instruction.   Buffered  re‐
831       sources are treated like scheduler resources.
832
833   Instruction Issue
834       Each  processor  scheduler implements a buffer of instructions.  An in‐
835       struction has to wait in the scheduler's buffer  until  input  register
836       operands  become  available.   Only at that point, does the instruction
837       becomes  eligible  for  execution  and  may  be   issued   (potentially
838       out-of-order)  for  execution.   Instruction  latencies are computed by
839       llvm-mca with the help of the scheduling model.
840
841       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
842       ulers.   The  scheduler  is responsible for tracking data dependencies,
843       and dynamically selecting which processor resources are consumed by in‐
844       structions.   It  delegates  the management of processor resource units
845       and resource groups to a resource manager.  The resource manager is re‐
846       sponsible  for  selecting  resource units that are consumed by instruc‐
847       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
848       group, the resource manager selects one of the available units from the
849       group; by default, the resource manager uses a round-robin selector  to
850       guarantee  that  resource  usage  is  uniformly distributed between all
851       units of a group.
852
853       llvm-mca's scheduler internally groups instructions into three sets:
854
855       • WaitSet: a set of instructions whose operands are not ready.
856
857       • ReadySet: a set of instructions ready to execute.
858
859       • IssuedSet: a set of instructions executing.
860
861       Depending on the operands  availability,  instructions  that  are  dis‐
862       patched to the scheduler are either placed into the WaitSet or into the
863       ReadySet.
864
865       Every cycle, the scheduler checks if instructions can be moved from the
866       WaitSet  to  the ReadySet, and if instructions from the ReadySet can be
867       issued to the underlying pipelines. The algorithm prioritizes older in‐
868       structions over younger instructions.
869
870   Write-Back and Retire Stage
871       Issued  instructions  are  moved  from  the  ReadySet to the IssuedSet.
872       There, instructions wait until they reach  the  write-back  stage.   At
873       that point, they get removed from the queue and the retire control unit
874       is notified.
875
876       When instructions are executed, the retire control unit flags  the  in‐
877       struction as "ready to retire."
878
879       Instructions  are retired in program order.  The register file is noti‐
880       fied of the retirement so that it can free the physical registers  that
881       were allocated for the instruction during the register renaming stage.
882
883   Load/Store Unit and Memory Consistency Model
884       To  simulate  an  out-of-order execution of memory operations, llvm-mca
885       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
886       tive execution of loads and stores.
887
888       Each  load  (or  store) consumes an entry in the load (or store) queue.
889       Users can specify flags -lqueue and -squeue to limit the number of  en‐
890       tries  in  the  load  and store queues respectively. The queues are un‐
891       bounded by default.
892
893       The LSUnit implements a relaxed consistency model for memory loads  and
894       stores.  The rules are:
895
896       1. A younger load is allowed to pass an older load only if there are no
897          intervening stores or barriers between the two loads.
898
899       2. A younger load is allowed to pass an older store provided  that  the
900          load does not alias with the store.
901
902       3. A younger store is not allowed to pass an older store.
903
904       4. A younger store is not allowed to pass an older load.
905
906       By  default,  the LSUnit optimistically assumes that loads do not alias
907       (-noalias=true) store operations.  Under this assumption, younger loads
908       are  always allowed to pass older stores.  Essentially, the LSUnit does
909       not attempt to run any alias analysis to predict when loads and  stores
910       do not alias with each other.
911
912       Note  that,  in the case of write-combining memory, rule 3 could be re‐
913       laxed to allow reordering of non-aliasing store operations.  That being
914       said,  at the moment, there is no way to further relax the memory model
915       (-noalias is the only option).  Essentially,  there  is  no  option  to
916       specify  a  different  memory  type (e.g., write-back, write-combining,
917       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
918       memory model.
919
920       Other limitations are:
921
922       • The LSUnit does not know when store-to-load forwarding may occur.
923
924       • The  LSUnit  does  not know anything about cache hierarchy and memory
925         types.
926
927       • The LSUnit does not know how to identify serializing  operations  and
928         memory fences.
929
930       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
931       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
932       "MayStore."   For  loads, the scheduling model provides an "optimistic"
933       load-to-use latency (which usually matches the load-to-use latency  for
934       when there is a hit in the L1D).
935
936       llvm-mca  does  not know about serializing operations or memory-barrier
937       like instructions.  The LSUnit conservatively assumes that an  instruc‐
938       tion which has both "MayLoad" and unmodeled side effects behaves like a
939       "soft" load-barrier.  That means, it serializes loads without forcing a
940       flush  of  the load queue.  Similarly, instructions that "MayStore" and
941       have unmodeled side effects are treated like store  barriers.   A  full
942       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
943       side effects.  This is inaccurate, but it is the best that we can do at
944       the moment with the current information available in LLVM.
945
946       A  load/store  barrier  consumes  one entry of the load/store queue.  A
947       load/store barrier enforces ordering of loads/stores.  A  younger  load
948       cannot  pass a load barrier.  Also, a younger store cannot pass a store
949       barrier.  A younger load has to wait for the memory/load barrier to ex‐
950       ecute.   A  load/store barrier is "executed" when it becomes the oldest
951       entry in the load/store queue(s). That also means, by construction, all
952       of the older loads/stores have been executed.
953
954       In conclusion, the full set of load/store consistency rules are:
955
956       1. A store may not pass a previous store.
957
958       2. A store may not pass a previous load (regardless of -noalias).
959
960       3. A store has to wait until an older store barrier is fully executed.
961
962       4. A load may pass a previous load.
963
964       5. A load may not pass a previous store unless -noalias is set.
965
966       6. A load has to wait until an older load barrier is fully executed.
967
968   In-order Issue and Execute
969       In-order  processors  are modelled as a single InOrderIssueStage stage.
970       It bypasses Dispatch, Scheduler and Load/Store unit.  Instructions  are
971       issued  as  soon  as their operand registers are available and resource
972       requirements are met. Multiple instructions can be issued in one  cycle
973       according to the value of the IssueWidth parameter in LLVM's scheduling
974       model.
975
976       Once issued, an instruction is moved to  IssuedInst  set  until  it  is
977       ready  to  retire. llvm-mca ensures that writes are committed in-order.
978       However,  an  instruction  is  allowed  to  commit  writes  and  retire
979       out-of-order  if  RetireOOO  property  is  true for at least one of its
980       writes.
981
982   Custom Behaviour
983       Due to certain instructions not being expressed perfectly within  their
984       scheduling  model,  llvm-mca  isn't  always  able to simulate them per‐
985       fectly. Modifying the scheduling model isn't  always  a  viable  option
986       though (maybe because the instruction is modeled incorrectly on purpose
987       or the instruction's behaviour is quite complex).  The  CustomBehaviour
988       class can be used in these cases to enforce proper instruction modeling
989       (often by customizing data  dependencies  and  detecting  hazards  that
990       llvm-ma has no way of knowing about).
991
992       llvm-mca  comes with one generic and multiple target specific CustomBe‐
993       haviour classes. The generic class will be used if the -disable-cb flag
994       is used or if a target specific CustomBehaviour class doesn't exist for
995       that target. (The generic class does nothing.) Currently, the CustomBe‐
996       haviour  class  is  only a part of the in-order pipeline, but there are
997       plans to add it to the out-of-order pipeline in the future.
998
999       CustomBehaviour's main method is  checkCustomHazard()  which  uses  the
1000       current  instruction  and  a  list  of all instructions still executing
1001       within the pipeline to determine if the current instruction  should  be
1002       dispatched.   As output, the method returns an integer representing the
1003       number of cycles that the current instruction must stall for (this  can
1004       be an underestimate if you don't know the exact number and a value of 0
1005       represents no stall).
1006
1007       If you'd like to add a CustomBehaviour class for a target that  doesn't
1008       already have one, refer to an existing implementation to see how to set
1009       it up. Remember to look at (and add to) /llvm-mca/lib/CMakeLists.txt.
1010

AUTHOR

1012       Maintained by the LLVM Team (https://llvm.org/).
1013
1015       2003-2023, LLVM Project
1016
1017
1018
1019
102013                                2023-07-20                       LLVM-MCA(1)
Impressum