1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently  works  for  processors  with  a
18       backend for which there is a scheduling model available in LLVM.
19
20       The  main  goal  of this tool is not just to predict the performance of
21       the code when run on the target, but also help with  diagnosing  poten‐
22       tial performance issues.
23
24       Given  an  assembly  code sequence, llvm-mca estimates the Instructions
25       Per Cycle (IPC), as well as hardware resource  pressure.  The  analysis
26       and reporting style were inspired by the IACA tool from Intel.
27
28       For example, you can compile code with clang, output assembly, and pipe
29       it directly into llvm-mca for analysis:
30
31          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
32
33       Or for Intel syntax:
34
35          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
36
37       (llvm-mca detects Intel syntax by the presence of an .intel_syntax  di‐
38       rective  at  the  beginning of the input.  By default its output syntax
39       matches that of its input.)
40
41       Scheduling models are not just used to  compute  instruction  latencies
42       and  throughput,  but  also  to understand what processor resources are
43       available and how to simulate them.
44
45       By design, the quality of the analysis conducted  by  llvm-mca  is  in‐
46       evitably affected by the quality of the scheduling models in LLVM.
47
48       If you see that the performance report is not accurate for a processor,
49       please file a bug against the appropriate backend.
50

OPTIONS

52       If input is "-" or omitted, llvm-mca reads from standard input.  Other‐
53       wise, it will read from the specified filename.
54
55       If  the  -o  option  is  omitted, then llvm-mca will send its output to
56       standard output if the input is from standard input.  If the -o  option
57       specifies "-", then the output will also be sent to standard output.
58
59       -help  Print a summary of command line options.
60
61       -o <filename>
62              Use <filename> as the output filename. See the summary above for
63              more details.
64
65       -mtriple=<target triple>
66              Specify a target triple string.
67
68       -march=<arch>
69              Specify the architecture for which to analyze the code.  It  de‐
70              faults to the host default target.
71
72       -mcpu=<cpuname>
73              Specify  the  processor  for  which to analyze the code.  By de‐
74              fault, the cpu name is autodetected from the host.
75
76       -output-asm-variant=<variant id>
77              Specify the output assembly variant for the report generated  by
78              the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
79              (vic. 1) for this flag enables the AT&T  (vic.  Intel)  assembly
80              format  for the code printed out by the tool in the analysis re‐
81              port.
82
83       -print-imm-hex
84              Prefer hex format for numeric literals in  the  output  assembly
85              printed as part of the report.
86
87       -dispatch=<width>
88              Specify  a  different dispatch width for the processor. The dis‐
89              patch width defaults to  field  'IssueWidth'  in  the  processor
90              scheduling  model.   If width is zero, then the default dispatch
91              width is used.
92
93       -register-file-size=<size>
94              Specify the size of the register file. When specified, this flag
95              limits  how  many  physical registers are available for register
96              renaming purposes. A value of zero for this flag  means  "unlim‐
97              ited number of physical registers".
98
99       -iterations=<number of iterations>
100              Specify  the number of iterations to run. If this flag is set to
101              0, then the tool sets the number  of  iterations  to  a  default
102              value (i.e. 100).
103
104       -noalias=<bool>
105              If set, the tool assumes that loads and stores don't alias. This
106              is the default behavior.
107
108       -lqueue=<load queue size>
109              Specify the size of the load queue in the load/store  unit  emu‐
110              lated by the tool.  By default, the tool assumes an unbound num‐
111              ber of entries in the load queue.  A value of zero for this flag
112              is ignored, and the default load queue size is used instead.
113
114       -squeue=<store queue size>
115              Specify  the size of the store queue in the load/store unit emu‐
116              lated by the tool. By default, the tool assumes an unbound  num‐
117              ber of entries in the store queue. A value of zero for this flag
118              is ignored, and the default store queue size is used instead.
119
120       -timeline
121              Enable the timeline view.
122
123       -timeline-max-iterations=<iterations>
124              Limit the number of iterations to print in the timeline view. By
125              default, the timeline view prints information for up to 10 iter‐
126              ations.
127
128       -timeline-max-cycles=<cycles>
129              Limit the number of cycles in the timeline view, or use 0 for no
130              limit. By default, the number of cycles is set to 80.
131
132       -resource-pressure
133              Enable the resource pressure view. This is enabled by default.
134
135       -register-file-stats
136              Enable register file usage statistics.
137
138       -dispatch-stats
139              Enable  extra  dispatch  statistics. This view collects and ana‐
140              lyzes instruction dispatch events,  as  well  as  static/dynamic
141              dispatch stall events. This view is disabled by default.
142
143       -scheduler-stats
144              Enable  extra  scheduler statistics. This view collects and ana‐
145              lyzes instruction issue events. This view  is  disabled  by  de‐
146              fault.
147
148       -retire-stats
149              Enable  extra  retire control unit statistics. This view is dis‐
150              abled by default.
151
152       -instruction-info
153              Enable the instruction info view. This is enabled by default.
154
155       -show-encoding
156              Enable the printing of instruction encodings within the instruc‐
157              tion info view.
158
159       -show-barriers
160              Enable the printing of LoadBarrier and StoreBarrier flags within
161              the instruction info view.
162
163       -all-stats
164              Print all hardware statistics. This enables extra statistics re‐
165              lated to the dispatch logic, the hardware schedulers, the regis‐
166              ter file(s), and the retire control unit. This  option  is  dis‐
167              abled by default.
168
169       -all-views
170              Enable all the view.
171
172       -instruction-tables
173              Prints  resource pressure information based on the static infor‐
174              mation available from the processor model. This differs from the
175              resource  pressure view because it doesn't require that the code
176              is simulated. It instead prints the theoretical uniform  distri‐
177              bution of resource pressure for every instruction in sequence.
178
179       -bottleneck-analysis
180              Print  information about bottlenecks that affect the throughput.
181              This analysis can be expensive, and it is disabled  by  default.
182              Bottlenecks  are  highlighted  in  the  summary view. Bottleneck
183              analysis is currently  not  supported  for  processors  with  an
184              in-order backend.
185
186       -json  Print the requested views in valid JSON format. The instructions
187              and the processor resources are printed as  members  of  special
188              top  level  JSON objects.  The individual views refer to them by
189              index. However, not all views are currently supported. For exam‐
190              ple,  the report from the bottleneck analysis is not printed out
191              in JSON. All the default views are currently supported.
192
193       -disable-cb
194              Force usage of the generic CustomBehaviour and  InstrPostProcess
195              classes  rather  than  using the target specific implementation.
196              The generic classes never detect any custom hazards or make  any
197              post processing modifications to instructions.
198

EXIT STATUS

200       llvm-mca  returns  0 on success. Otherwise, an error message is printed
201       to standard error, and the tool returns 1.
202

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

204       llvm-mca allows for the optional usage of special code comments to mark
205       regions  of  the assembly code to be analyzed.  A comment starting with
206       substring LLVM-MCA-BEGIN marks the beginning of a code region.  A  com‐
207       ment  starting  with substring LLVM-MCA-END marks the end of a code re‐
208       gion.  For example:
209
210          # LLVM-MCA-BEGIN
211            ...
212          # LLVM-MCA-END
213
214       If no user-defined region is specified, then llvm-mca assumes a default
215       region  which  contains every instruction in the input file.  Every re‐
216       gion is analyzed in isolation, and the final performance report is  the
217       union of all the reports generated for every code region.
218
219       Code regions can have names. For example:
220
221          # LLVM-MCA-BEGIN A simple example
222            add %eax, %eax
223          # LLVM-MCA-END
224
225       The  code from the example above defines a region named "A simple exam‐
226       ple" with a single instruction in it. Note how the region name  doesn't
227       have  to  be  repeated in the LLVM-MCA-END directive. In the absence of
228       overlapping regions, an anonymous LLVM-MCA-END  directive  always  ends
229       the currently active user defined region.
230
231       Example of nesting regions:
232
233          # LLVM-MCA-BEGIN foo
234            add %eax, %edx
235          # LLVM-MCA-BEGIN bar
236            sub %eax, %edx
237          # LLVM-MCA-END bar
238          # LLVM-MCA-END foo
239
240       Example of overlapping regions:
241
242          # LLVM-MCA-BEGIN foo
243            add %eax, %edx
244          # LLVM-MCA-BEGIN bar
245            sub %eax, %edx
246          # LLVM-MCA-END foo
247            add %eax, %edx
248          # LLVM-MCA-END bar
249
250       Note  that multiple anonymous regions cannot overlap. Also, overlapping
251       regions cannot have the same name.
252
253       There is no support for marking regions from  high-level  source  code,
254       like C or C++. As a workaround, inline assembly directives may be used:
255
256          int foo(int a, int b) {
257            __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
258            a += 42;
259            __asm volatile("# LLVM-MCA-END":::"memory");
260            a *= b;
261            return a;
262          }
263
264       However, this interferes with optimizations like loop vectorization and
265       may have an impact on the code generated. This  is  because  the  __asm
266       statements  are  seen as real code having important side effects, which
267       limits how the code around them can be transformed. If  users  want  to
268       make use of inline assembly to emit markers, then the recommendation is
269       to always verify that the output assembly is equivalent to the assembly
270       generated  in  the absence of markers.  The Clang options to emit opti‐
271       mization reports can also help in detecting missed optimizations.
272

HOW LLVM-MCA WORKS

274       llvm-mca takes assembly code as input. The assembly code is parsed into
275       a sequence of MCInst with the help of the existing LLVM target assembly
276       parsers. The parsed sequence of MCInst is then analyzed by  a  Pipeline
277       module to generate a performance report.
278
279       The  Pipeline  module  simulates  the execution of the machine code se‐
280       quence in a loop of iterations (default is 100). During  this  process,
281       the  pipeline collects a number of execution related statistics. At the
282       end of this process, the pipeline generates and prints  a  report  from
283       the collected statistics.
284
285       Here  is an example of a performance report generated by the tool for a
286       dot-product of two packed float vectors of four elements. The  analysis
287       is  conducted  for target x86, cpu btver2.  The following result can be
288       produced via  the  following  command  using  the  example  located  at
289       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
290
291          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
292
293          Iterations:        300
294          Instructions:      900
295          Total Cycles:      610
296          Total uOps:        900
297
298          Dispatch Width:    2
299          uOps Per Cycle:    1.48
300          IPC:               1.48
301          Block RThroughput: 2.0
302
303
304          Instruction Info:
305          [1]: #uOps
306          [2]: Latency
307          [3]: RThroughput
308          [4]: MayLoad
309          [5]: MayStore
310          [6]: HasSideEffects (U)
311
312          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
313           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
314           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
315           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
316
317
318          Resources:
319          [0]   - JALU0
320          [1]   - JALU1
321          [2]   - JDiv
322          [3]   - JFPA
323          [4]   - JFPM
324          [5]   - JFPU0
325          [6]   - JFPU1
326          [7]   - JLAGU
327          [8]   - JMul
328          [9]   - JSAGU
329          [10]  - JSTC
330          [11]  - JVALU0
331          [12]  - JVALU1
332          [13]  - JVIMUL
333
334
335          Resource pressure per iteration:
336          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
337           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
338
339          Resource pressure by instruction:
340          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
341           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
342           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
343           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
344
345       According  to this report, the dot-product kernel has been executed 300
346       times, for a total of 900 simulated instructions. The total  number  of
347       simulated micro opcodes (uOps) is also 900.
348
349       The  report  is  structured  in three main sections.  The first section
350       collects a few performance numbers; the goal of this section is to give
351       a  very quick overview of the performance throughput. Important perfor‐
352       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
353       Reciprocal Throughput).
354
355       Field  DispatchWidth  is  the  maximum number of micro opcodes that are
356       dispatched to the out-of-order backend every simulated cycle. For  pro‐
357       cessors  with  an in-order backend, DispatchWidth is the maximum number
358       of micro opcodes issued to the backend every simulated cycle.
359
360       IPC is computed dividing the total number of simulated instructions  by
361       the total number of cycles.
362
363       Field  Block  RThroughput  is  the  reciprocal of the block throughput.
364       Block throughput is a theoretical quantity computed as the maximum num‐
365       ber  of  blocks  (i.e.  iterations)  that can be executed per simulated
366       clock cycle in the absence of loop carried dependencies. Block through‐
367       put is superiorly limited by the dispatch rate, and the availability of
368       hardware resources.
369
370       In the absence of loop-carried  data  dependencies,  the  observed  IPC
371       tends  to  a  theoretical maximum which can be computed by dividing the
372       number of instructions of a single iteration by the Block RThroughput.
373
374       Field 'uOps Per Cycle' is computed dividing the total number  of  simu‐
375       lated micro opcodes by the total number of cycles. A delta between Dis‐
376       patch Width and this field is an indicator of a performance  issue.  In
377       the  absence  of loop-carried data dependencies, the observed 'uOps Per
378       Cycle' should tend to a theoretical maximum  throughput  which  can  be
379       computed  by  dividing  the number of uOps of a single iteration by the
380       Block RThroughput.
381
382       Field uOps Per Cycle is bounded from above by the dispatch width.  That
383       is  because  the  dispatch  width limits the maximum size of a dispatch
384       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
385       ware  parallelism.  The  availability of hardware resources affects the
386       resource pressure distribution, and it limits the  number  of  instruc‐
387       tions  that  can  be executed in parallel every cycle.  A delta between
388       Dispatch Width and the theoretical maximum uOps per Cycle (computed  by
389       dividing  the  number  of  uOps  of  a  single  iteration  by the Block
390       RThroughput) is an indicator of a performance bottleneck caused by  the
391       lack  of hardware resources.  In general, the lower the Block RThrough‐
392       put, the better.
393
394       In this example, uOps per iteration/Block RThroughput  is  1.50.  Since
395       there  are no loop-carried dependencies, the observed uOps Per Cycle is
396       expected to approach 1.50 when the number of iterations tends to infin‐
397       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
398       maximum uOp throughput (1.50) is an indicator of a performance  bottle‐
399       neck  caused  by the lack of hardware resources, and the Resource pres‐
400       sure view can help to identify the problematic resource usage.
401
402       The second section of the report is the instruction info view. It shows
403       the  latency  and reciprocal throughput of every instruction in the se‐
404       quence. It also reports extra information related to the number of  mi‐
405       cro  opcodes,  and  opcode properties (i.e., 'MayLoad', 'MayStore', and
406       'HasSideEffects').
407
408       Field RThroughput is the  reciprocal  of  the  instruction  throughput.
409       Throughput  is computed as the maximum number of instructions of a same
410       type that can be executed per clock cycle in the absence of operand de‐
411       pendencies.  In  this  example,  the  reciprocal throughput of a vector
412       float multiply is 1 cycles/instruction.  That is because the FP  multi‐
413       plier JFPM is only available from pipeline JFPU1.
414
415       Instruction  encodings  are  displayed within the instruction info view
416       when flag -show-encoding is specified.
417
418       Below is an example of -show-encoding output for the  dot-product  ker‐
419       nel:
420
421          Instruction Info:
422          [1]: #uOps
423          [2]: Latency
424          [3]: RThroughput
425          [4]: MayLoad
426          [5]: MayStore
427          [6]: HasSideEffects (U)
428          [7]: Encoding Size
429
430          [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
431           1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
432           1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
433           1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
434
435       The  Encoding Size column shows the size in bytes of instructions.  The
436       Encodings column shows the actual instruction encodings (byte sequences
437       in hex).
438
439       The third section is the Resource pressure view.  This view reports the
440       average number of resource cycles consumed every iteration by  instruc‐
441       tions  for  every processor resource unit available on the target.  In‐
442       formation is structured in two tables. The first table reports the num‐
443       ber of resource cycles spent on average every iteration. The second ta‐
444       ble correlates the resource cycles to the machine  instruction  in  the
445       sequence. For example, every iteration of the instruction vmulps always
446       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
447       consuming  an  average of 1 resource cycle per iteration.  Note that on
448       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
449       line  JFPU1,  while horizontal floating-point additions can only be is‐
450       sued to pipeline JFPU0.
451
452       The resource pressure view helps with identifying bottlenecks caused by
453       high  usage  of  specific hardware resources.  Situations with resource
454       pressure mainly concentrated on a few resources should, in general,  be
455       avoided.   Ideally,  pressure  should  be uniformly distributed between
456       multiple resources.
457
458   Timeline View
459       The timeline view produces a  detailed  report  of  each  instruction's
460       state  transitions  through  an instruction pipeline.  This view is en‐
461       abled by the command line option -timeline.  As instructions transition
462       through  the  various stages of the pipeline, their states are depicted
463       in the view report.  These states  are  represented  by  the  following
464       characters:
465
466       • D : Instruction dispatched.
467
468       • e : Instruction executing.
469
470       • E : Instruction executed.
471
472       • R : Instruction retired.
473
474       • = : Instruction already dispatched, waiting to be executed.
475
476       • - : Instruction executed, waiting to be retired.
477
478       Below  is the timeline view for a subset of the dot-product example lo‐
479       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
480       llvm-mca using the following command:
481
482          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
483
484          Timeline view:
485                              012345
486          Index     0123456789
487
488          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
489          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
490          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
491          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
492          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
493          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
494          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
495          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
496          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
497
498
499          Average Wait times (based on the timeline view):
500          [0]: Executions
501          [1]: Average time spent waiting in a scheduler's queue
502          [2]: Average time spent waiting in a scheduler's queue while ready
503          [3]: Average time elapsed from WB until retire stage
504
505                [0]    [1]    [2]    [3]
506          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
507          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
508          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
509                 3     3.3    0.5    1.4       <total>
510
511       The  timeline  view  is  interesting because it shows instruction state
512       changes during execution.  It also gives an idea of how the  tool  pro‐
513       cesses instructions executed on the target, and how their timing infor‐
514       mation might be calculated.
515
516       The timeline view is structured in two tables.  The first  table  shows
517       instructions  changing state over time (measured in cycles); the second
518       table (named Average Wait  times)  reports  useful  timing  statistics,
519       which  should help diagnose performance bottlenecks caused by long data
520       dependencies and sub-optimal usage of hardware resources.
521
522       An instruction in the timeline view is identified by a pair of indices,
523       where  the first index identifies an iteration, and the second index is
524       the instruction index (i.e., where it appears in  the  code  sequence).
525       Since this example was generated using 3 iterations: -iterations=3, the
526       iteration indices range from 0-2 inclusively.
527
528       Excluding the first and last column, the remaining columns are  in  cy‐
529       cles.  Cycles are numbered sequentially starting from 0.
530
531       From the example output above, we know the following:
532
533       • Instruction [1,0] was dispatched at cycle 1.
534
535       • Instruction [1,0] started executing at cycle 2.
536
537       • Instruction [1,0] reached the write back stage at cycle 4.
538
539       • Instruction [1,0] was retired at cycle 10.
540
541       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
542       wait in the scheduler's queue for the operands to become available.  By
543       the  time  vmulps  is  dispatched,  operands are already available, and
544       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
545       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
546       strated by the fact that the instruction only spent 1cy in  the  sched‐
547       uler's queue.
548
549       There  is a gap of 5 cycles between the write-back stage and the retire
550       event.  That is because instructions must retire in program  order,  so
551       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
552       until cycle 10).
553
554       In the example, all instructions are in a RAW (Read After Write) depen‐
555       dency  chain.   Register %xmm2 written by vmulps is immediately used by
556       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
557       used  by  the second vhaddps.  Long data dependencies negatively impact
558       the ILP (Instruction Level Parallelism).
559
560       In the dot-product example, there are anti-dependencies  introduced  by
561       instructions  from  different  iterations.  However, those dependencies
562       can be removed at register renaming stage (at the  cost  of  allocating
563       register aliases, and therefore consuming physical registers).
564
565       Table  Average  Wait  times  helps diagnose performance issues that are
566       caused by the presence of long  latency  instructions  and  potentially
567       long  data  dependencies  which  may  limit the ILP. Last row, <total>,
568       shows a global  average  over  all  instructions  measured.  Note  that
569       llvm-mca,  by  default, assumes at least 1cy between the dispatch event
570       and the issue event.
571
572       When the performance is limited by data dependencies  and/or  long  la‐
573       tency instructions, the number of cycles spent while in the ready state
574       is expected to be very small when compared with the total number of cy‐
575       cles  spent  in  the scheduler's queue.  The difference between the two
576       counters is a good indicator of how large of an impact  data  dependen‐
577       cies  had  on  the  execution of the instructions.  When performance is
578       mostly limited by the lack of hardware resources, the delta between the
579       two  counters  is  small.   However,  the number of cycles spent in the
580       queue tends to be larger (i.e., more than 1-3cy), especially when  com‐
581       pared to other low latency instructions.
582
583   Bottleneck Analysis
584       The  -bottleneck-analysis  command  line option enables the analysis of
585       performance bottlenecks.
586
587       This analysis is potentially expensive. It attempts  to  correlate  in‐
588       creases  in  backend pressure (caused by pipeline resource pressure and
589       data dependencies) to dynamic dispatch stalls.
590
591       Below  is  an  example  of  -bottleneck-analysis  output  generated  by
592       llvm-mca for 500 iterations of the dot-product example on btver2.
593
594          Cycles with backend pressure increase [ 48.07% ]
595          Throughput Bottlenecks:
596            Resource Pressure       [ 47.77% ]
597            - JFPA  [ 47.77% ]
598            - JFPU0  [ 47.77% ]
599            Data Dependencies:      [ 0.30% ]
600            - Register Dependencies [ 0.30% ]
601            - Memory Dependencies   [ 0.00% ]
602
603          Critical sequence based on the simulation:
604
605                        Instruction                         Dependency Information
606           +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
607           |
608           |    < loop carried >
609           |
610           |      0.    vmulps  %xmm0, %xmm1, %xmm2
611           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
612           +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
613           |
614           |    < loop carried >
615           |
616           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
617
618       According  to  the analysis, throughput is limited by resource pressure
619       and not by data dependencies.  The analysis observed increases in back‐
620       end pressure during 48.07% of the simulated run. Almost all those pres‐
621       sure increase events were caused by contention on  processor  resources
622       JFPA/JFPU0.
623
624       The  critical  sequence  is the most expensive sequence of instructions
625       according to the simulation. It is annotated to provide extra  informa‐
626       tion  about  critical  register dependencies and resource interferences
627       between instructions.
628
629       Instructions from the critical sequence are expected  to  significantly
630       impact  performance.  By construction, the accuracy of this analysis is
631       strongly dependent on the simulation and (as always) by the quality  of
632       the processor model in llvm.
633
634       Bottleneck  analysis  is currently not supported for processors with an
635       in-order backend.
636
637   Extra Statistics to Further Diagnose Performance Issues
638       The -all-stats command line option enables extra statistics and perfor‐
639       mance  counters  for the dispatch logic, the reorder buffer, the retire
640       control unit, and the register file.
641
642       Below is an example of -all-stats output generated by  llvm-mca for 300
643       iterations  of  the  dot-product example discussed in the previous sec‐
644       tions.
645
646          Dynamic Dispatch Stall Cycles:
647          RAT     - Register unavailable:                      0
648          RCU     - Retire tokens unavailable:                 0
649          SCHEDQ  - Scheduler full:                            272  (44.6%)
650          LQ      - Load queue full:                           0
651          SQ      - Store queue full:                          0
652          GROUP   - Static restrictions on the dispatch group: 0
653
654
655          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
656          [# dispatched], [# cycles]
657           0,              24  (3.9%)
658           1,              272  (44.6%)
659           2,              314  (51.5%)
660
661
662          Schedulers - number of cycles where we saw N micro opcodes issued:
663          [# issued], [# cycles]
664           0,          7  (1.1%)
665           1,          306  (50.2%)
666           2,          297  (48.7%)
667
668          Scheduler's queue usage:
669          [1] Resource name.
670          [2] Average number of used buffer entries.
671          [3] Maximum number of used buffer entries.
672          [4] Total number of buffer entries.
673
674           [1]            [2]        [3]        [4]
675          JALU01           0          0          20
676          JFPU01           17         18         18
677          JLSAGU           0          0          12
678
679
680          Retire Control Unit - number of cycles where we saw N instructions retired:
681          [# retired], [# cycles]
682           0,           109  (17.9%)
683           1,           102  (16.7%)
684           2,           399  (65.4%)
685
686          Total ROB Entries:                64
687          Max Used ROB Entries:             35  ( 54.7% )
688          Average Used ROB Entries per cy:  32  ( 50.0% )
689
690
691          Register File statistics:
692          Total number of mappings created:    900
693          Max number of mappings used:         35
694
695          *  Register File #1 -- JFpuPRF:
696             Number of physical registers:     72
697             Total number of mappings created: 900
698             Max number of mappings used:      35
699
700          *  Register File #2 -- JIntegerPRF:
701             Number of physical registers:     64
702             Total number of mappings created: 0
703             Max number of mappings used:      0
704
705       If we look at the Dynamic Dispatch  Stall  Cycles  table,  we  see  the
706       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
707       ery time the dispatch logic is unable to dispatch a full group  because
708       the scheduler's queue is full.
709
710       Looking  at the Dispatch Logic table, we see that the pipeline was only
711       able to dispatch two micro opcodes 51.5% of  the  time.   The  dispatch
712       group was limited to one micro opcode 44.6% of the cycles, which corre‐
713       sponds to 272 cycles.  The dispatch statistics are displayed by  either
714       using the command option -all-stats or -dispatch-stats.
715
716       The  next  table,  Schedulers, presents a histogram displaying a count,
717       representing the number of micro opcodes issued on some number  of  cy‐
718       cles.  In  this  case, of the 610 simulated cycles, single opcodes were
719       issued 306 times (50.2%) and there were 7 cycles where no opcodes  were
720       issued.
721
722       The  Scheduler's  queue  usage table shows that the average and maximum
723       number of buffer entries (i.e., scheduler queue entries) used  at  run‐
724       time.   Resource  JFPU01  reached its maximum (18 of 18 queue entries).
725       Note that AMD Jaguar implements three schedulers:
726
727       • JALU01 - A scheduler for ALU instructions.
728
729       • JFPU01 - A scheduler floating point operations.
730
731       • JLSAGU - A scheduler for address generation.
732
733       The dot-product is a kernel of three  floating  point  instructions  (a
734       vector  multiply  followed  by two horizontal adds).  That explains why
735       only the floating point scheduler appears to be used.
736
737       A full scheduler queue is either caused by data dependency chains or by
738       a  sub-optimal  usage of hardware resources.  Sometimes, resource pres‐
739       sure can be mitigated by rewriting the kernel using different  instruc‐
740       tions  that  consume  different scheduler resources.  Schedulers with a
741       small queue are less resilient to bottlenecks caused by the presence of
742       long  data dependencies.  The scheduler statistics are displayed by us‐
743       ing the command option -all-stats or -scheduler-stats.
744
745       The next table, Retire Control Unit, presents a histogram displaying  a
746       count,  representing  the number of instructions retired on some number
747       of cycles.  In this case, of the 610 simulated cycles, two instructions
748       were retired during the same cycle 399 times (65.4%) and there were 109
749       cycles where no instructions were retired.  The retire  statistics  are
750       displayed by using the command option -all-stats or -retire-stats.
751
752       The  last  table  presented is Register File statistics.  Each physical
753       register file (PRF) used by the pipeline is presented  in  this  table.
754       In the case of AMD Jaguar, there are two register files, one for float‐
755       ing-point registers (JFpuPRF) and one  for  integer  registers  (JInte‐
756       gerPRF).  The table shows that of the 900 instructions processed, there
757       were 900 mappings created.  Since  this  dot-product  example  utilized
758       only floating point registers, the JFPuPRF was responsible for creating
759       the 900 mappings.  However, we see that the pipeline only used a  maxi‐
760       mum of 35 of 72 available register slots at any given time. We can con‐
761       clude that the floating point PRF was the only register file  used  for
762       the  example, and that it was never resource constrained.  The register
763       file statistics are displayed by using the command option -all-stats or
764       -register-file-stats.
765
766       In this example, we can conclude that the IPC is mostly limited by data
767       dependencies, and not by resource pressure.
768
769   Instruction Flow
770       This section describes the instruction flow through the  default  pipe‐
771       line  of  llvm-mca,  as  well  as  the functional units involved in the
772       process.
773
774       The default pipeline implements the following sequence of  stages  used
775       to process instructions.
776
777       • Dispatch (Instruction is dispatched to the schedulers).
778
779       • Issue (Instruction is issued to the processor pipelines).
780
781       • Write Back (Instruction is executed, and results are written back).
782
783       • Retire  (Instruction  is  retired; writes are architecturally commit‐
784         ted).
785
786       The in-order pipeline implements the following sequence  of  stages:  *
787       InOrderIssue (Instruction is issued to the processor pipelines).  * Re‐
788       tire (Instruction is retired; writes are architecturally committed).
789
790       llvm-mca assumes that instructions have all  been  decoded  and  placed
791       into  a  queue  before the simulation start. Therefore, the instruction
792       fetch and decode stages are not modeled. Performance bottlenecks in the
793       frontend  are  not diagnosed. Also, llvm-mca does not model branch pre‐
794       diction.
795
796   Instruction Dispatch
797       During the dispatch stage, instructions are  picked  in  program  order
798       from  a queue of already decoded instructions, and dispatched in groups
799       to the simulated hardware schedulers.
800
801       The size of a dispatch group depends on the availability of  the  simu‐
802       lated hardware resources.  The processor dispatch width defaults to the
803       value of the IssueWidth in LLVM's scheduling model.
804
805       An instruction can be dispatched if:
806
807       • The size of the dispatch group is smaller than  processor's  dispatch
808         width.
809
810       • There are enough entries in the reorder buffer.
811
812       • There are enough physical registers to do register renaming.
813
814       • The schedulers are not full.
815
816       Scheduling  models  can  optionally  specify  which  register files are
817       available on the processor. llvm-mca uses that information to  initial‐
818       ize  register file descriptors.  Users can limit the number of physical
819       registers that are globally available for register  renaming  by  using
820       the  command  option -register-file-size.  A value of zero for this op‐
821       tion means unbounded. By knowing how many registers are  available  for
822       renaming,  the  tool  can predict dispatch stalls caused by the lack of
823       physical registers.
824
825       The number of reorder buffer entries consumed by an instruction depends
826       on  the  number  of micro-opcodes specified for that instruction by the
827       target scheduling model.  The reorder buffer is responsible for  track‐
828       ing  the  progress  of  instructions that are "in-flight", and retiring
829       them in program order.  The number of entries in the reorder buffer de‐
830       faults  to the value specified by field MicroOpBufferSize in the target
831       scheduling model.
832
833       Instructions that are dispatched to the  schedulers  consume  scheduler
834       buffer  entries. llvm-mca queries the scheduling model to determine the
835       set of buffered resources consumed by  an  instruction.   Buffered  re‐
836       sources are treated like scheduler resources.
837
838   Instruction Issue
839       Each  processor  scheduler implements a buffer of instructions.  An in‐
840       struction has to wait in the scheduler's buffer  until  input  register
841       operands  become  available.   Only at that point, does the instruction
842       becomes  eligible  for  execution  and  may  be   issued   (potentially
843       out-of-order)  for  execution.   Instruction  latencies are computed by
844       llvm-mca with the help of the scheduling model.
845
846       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
847       ulers.   The  scheduler  is responsible for tracking data dependencies,
848       and dynamically selecting which processor resources are consumed by in‐
849       structions.   It  delegates  the management of processor resource units
850       and resource groups to a resource manager.  The resource manager is re‐
851       sponsible  for  selecting  resource units that are consumed by instruc‐
852       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
853       group, the resource manager selects one of the available units from the
854       group; by default, the resource manager uses a round-robin selector  to
855       guarantee  that  resource  usage  is  uniformly distributed between all
856       units of a group.
857
858       llvm-mca's scheduler internally groups instructions into three sets:
859
860       • WaitSet: a set of instructions whose operands are not ready.
861
862       • ReadySet: a set of instructions ready to execute.
863
864       • IssuedSet: a set of instructions executing.
865
866       Depending on the operands  availability,  instructions  that  are  dis‐
867       patched to the scheduler are either placed into the WaitSet or into the
868       ReadySet.
869
870       Every cycle, the scheduler checks if instructions can be moved from the
871       WaitSet  to  the ReadySet, and if instructions from the ReadySet can be
872       issued to the underlying pipelines. The algorithm prioritizes older in‐
873       structions over younger instructions.
874
875   Write-Back and Retire Stage
876       Issued  instructions  are  moved  from  the  ReadySet to the IssuedSet.
877       There, instructions wait until they reach  the  write-back  stage.   At
878       that point, they get removed from the queue and the retire control unit
879       is notified.
880
881       When instructions are executed, the retire control unit flags  the  in‐
882       struction as "ready to retire."
883
884       Instructions  are retired in program order.  The register file is noti‐
885       fied of the retirement so that it can free the physical registers  that
886       were allocated for the instruction during the register renaming stage.
887
888   Load/Store Unit and Memory Consistency Model
889       To  simulate  an  out-of-order execution of memory operations, llvm-mca
890       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
891       tive execution of loads and stores.
892
893       Each  load  (or  store) consumes an entry in the load (or store) queue.
894       Users can specify flags -lqueue and -squeue to limit the number of  en‐
895       tries  in  the  load  and store queues respectively. The queues are un‐
896       bounded by default.
897
898       The LSUnit implements a relaxed consistency model for memory loads  and
899       stores.  The rules are:
900
901       1. A younger load is allowed to pass an older load only if there are no
902          intervening stores or barriers between the two loads.
903
904       2. A younger load is allowed to pass an older store provided  that  the
905          load does not alias with the store.
906
907       3. A younger store is not allowed to pass an older store.
908
909       4. A younger store is not allowed to pass an older load.
910
911       By  default,  the LSUnit optimistically assumes that loads do not alias
912       (-noalias=true) store operations.  Under this assumption, younger loads
913       are  always allowed to pass older stores.  Essentially, the LSUnit does
914       not attempt to run any alias analysis to predict when loads and  stores
915       do not alias with each other.
916
917       Note  that,  in the case of write-combining memory, rule 3 could be re‐
918       laxed to allow reordering of non-aliasing store operations.  That being
919       said,  at the moment, there is no way to further relax the memory model
920       (-noalias is the only option).  Essentially,  there  is  no  option  to
921       specify  a  different  memory  type (e.g., write-back, write-combining,
922       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
923       memory model.
924
925       Other limitations are:
926
927       • The LSUnit does not know when store-to-load forwarding may occur.
928
929       • The  LSUnit  does  not know anything about cache hierarchy and memory
930         types.
931
932       • The LSUnit does not know how to identify serializing  operations  and
933         memory fences.
934
935       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
936       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
937       "MayStore."   For  loads, the scheduling model provides an "optimistic"
938       load-to-use latency (which usually matches the load-to-use latency  for
939       when there is a hit in the L1D).
940
941       llvm-mca  does  not  (on  its own) know about serializing operations or
942       memory-barrier like instructions.  The LSUnit  used  to  conservatively
943       use  an instruction's "MayLoad", "MayStore", and unmodeled side effects
944       flags to determine whether an instruction should be treated as  a  mem‐
945       ory-barrier. This was inaccurate in general and was changed so that now
946       each instruction has an IsAStoreBarrier and IsALoadBarrier flag.  These
947       flags  are  mca specific and default to false for every instruction. If
948       any instruction should have either of these flags  set,  it  should  be
949       done  within the target's InstrPostProcess class.  For an example, look
950       at  the   X86InstrPostProcess::postProcessInstruction   method   within
951       llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp.
952
953       A  load/store  barrier  consumes  one entry of the load/store queue.  A
954       load/store barrier enforces ordering of loads/stores.  A  younger  load
955       cannot  pass a load barrier.  Also, a younger store cannot pass a store
956       barrier.  A younger load has to wait for the memory/load barrier to ex‐
957       ecute.   A  load/store barrier is "executed" when it becomes the oldest
958       entry in the load/store queue(s). That also means, by construction, all
959       of the older loads/stores have been executed.
960
961       In conclusion, the full set of load/store consistency rules are:
962
963       1. A store may not pass a previous store.
964
965       2. A store may not pass a previous load (regardless of -noalias).
966
967       3. A store has to wait until an older store barrier is fully executed.
968
969       4. A load may pass a previous load.
970
971       5. A load may not pass a previous store unless -noalias is set.
972
973       6. A load has to wait until an older load barrier is fully executed.
974
975   In-order Issue and Execute
976       In-order  processors  are modelled as a single InOrderIssueStage stage.
977       It bypasses Dispatch, Scheduler and Load/Store unit.  Instructions  are
978       issued  as  soon  as their operand registers are available and resource
979       requirements are met. Multiple instructions can be issued in one  cycle
980       according to the value of the IssueWidth parameter in LLVM's scheduling
981       model.
982
983       Once issued, an instruction is moved to  IssuedInst  set  until  it  is
984       ready  to  retire. llvm-mca ensures that writes are committed in-order.
985       However,  an  instruction  is  allowed  to  commit  writes  and  retire
986       out-of-order  if  RetireOOO  property  is  true for at least one of its
987       writes.
988
989   Custom Behaviour
990       Due to certain instructions not being expressed perfectly within  their
991       scheduling  model,  llvm-mca  isn't  always  able to simulate them per‐
992       fectly. Modifying the scheduling model isn't  always  a  viable  option
993       though (maybe because the instruction is modeled incorrectly on purpose
994       or the instruction's behaviour is quite complex).  The  CustomBehaviour
995       class can be used in these cases to enforce proper instruction modeling
996       (often by customizing data  dependencies  and  detecting  hazards  that
997       llvm-mca has no way of knowing about).
998
999       llvm-mca  comes with one generic and multiple target specific CustomBe‐
1000       haviour classes. The generic class will be used if the -disable-cb flag
1001       is used or if a target specific CustomBehaviour class doesn't exist for
1002       that target. (The generic class does nothing.) Currently, the CustomBe‐
1003       haviour  class  is  only a part of the in-order pipeline, but there are
1004       plans to add it to the out-of-order pipeline in the future.
1005
1006       CustomBehaviour's main method is  checkCustomHazard()  which  uses  the
1007       current  instruction  and  a  list  of all instructions still executing
1008       within the pipeline to determine if the current instruction  should  be
1009       dispatched.   As output, the method returns an integer representing the
1010       number of cycles that the current instruction must stall for (this  can
1011       be an underestimate if you don't know the exact number and a value of 0
1012       represents no stall).
1013
1014       If you'd like to add a CustomBehaviour class for a target that  doesn't
1015       already have one, refer to an existing implementation to see how to set
1016       it up. The classes are implemented within the target  specific  backend
1017       (for  example  /llvm/lib/Target/AMDGPU/MCA/)  so  that  they can access
1018       backend symbols.
1019
1020   Custom Views
1021       llvm-mca comes with several Views such as the Timeline View and Summary
1022       View.  These Views are generic and can work with most (if not all) tar‐
1023       gets. If you wish to add a new View to llvm-mca and it does not require
1024       any  backend functionality that is not already exposed through MC layer
1025       classes (MCSubtargetInfo, MCInstrInfo, etc.),  please  add  it  to  the
1026       /tools/llvm-mca/View/  directory.  However,  if your new View is target
1027       specific AND requires unexposed backend symbols or  functionality,  you
1028       can define it in the /lib/Target/<TargetName>/MCA/ directory.
1029
1030       To enable this target specific View, you will have to use this target's
1031       CustomBehaviour class to override the CustomBehaviour::getViews() meth‐
1032       ods.   There  are 3 variations of these methods based on where you want
1033       your View to appear in  the  output:  getStartViews(),  getPostInstrIn‐
1034       foViews(),  and  getEndViews(). These methods returns a vector of Views
1035       so you will want to return a vector containing all of the  target  spe‐
1036       cific Views for the target in question.
1037
1038       Because these target specific (and backend dependent) Views require the
1039       CustomBehaviour::getViews() variants, these Views will not  be  enabled
1040       if the -disable-cb flag is used.
1041
1042       Enabling  these  custom  Views does not affect the non-custom (generic)
1043       Views.  Continue to use the usual command line arguments  to  enable  /
1044       disable those Views.
1045

AUTHOR

1047       Maintained by the LLVM Team (https://llvm.org/).
1048
1050       2003-2023, LLVM Project
1051
1052
1053
1054
105514                                2023-07-20                       LLVM-MCA(1)
Impressum