1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently works  for  processors  with  an
18       out-of-order  backend,  for which there is a scheduling model available
19       in LLVM.
20
21       The main goal of this tool is not just to predict  the  performance  of
22       the  code  when run on the target, but also help with diagnosing poten‐
23       tial performance issues.
24
25       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
26       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
27       and reporting style were inspired by the IACA tool from Intel.
28
29       For example, you can compile code with clang, output assembly, and pipe
30       it directly into llvm-mca for analysis:
31
32          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34       Or for Intel syntax:
35
36          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38       (llvm-mca  detects Intel syntax by the presence of an .intel_syntax di‐
39       rective at the beginning of the input.  By default  its  output  syntax
40       matches that of its input.)
41
42       Scheduling  models  are  not just used to compute instruction latencies
43       and throughput, but also to understand  what  processor  resources  are
44       available and how to simulate them.
45
46       By  design,  the  quality  of the analysis conducted by llvm-mca is in‐
47       evitably affected by the quality of the scheduling models in LLVM.
48
49       If you see that the performance report is not accurate for a processor,
50       please file a bug against the appropriate backend.
51

OPTIONS

53       If  input is "-" or omitted, llvm-mca reads from standard input. Other‐
54       wise, it will read from the specified filename.
55
56       If the -o option is omitted, then llvm-mca  will  send  its  output  to
57       standard  output if the input is from standard input.  If the -o option
58       specifies "-", then the output will also be sent to standard output.
59
60       -help  Print a summary of command line options.
61
62       -o <filename>
63              Use <filename> as the output filename. See the summary above for
64              more details.
65
66       -mtriple=<target triple>
67              Specify a target triple string.
68
69       -march=<arch>
70              Specify  the  architecture for which to analyze the code. It de‐
71              faults to the host default target.
72
73       -mcpu=<cpuname>
74              Specify the processor for which to analyze  the  code.   By  de‐
75              fault, the cpu name is autodetected from the host.
76
77       -output-asm-variant=<variant id>
78              Specify  the output assembly variant for the report generated by
79              the tool.  On x86, possible values are [0,  1].  A  value  of  0
80              (vic.  1)  for  this flag enables the AT&T (vic. Intel) assembly
81              format for the code printed out by the tool in the analysis  re‐
82              port.
83
84       -print-imm-hex
85              Prefer  hex  format  for numeric literals in the output assembly
86              printed as part of the report.
87
88       -dispatch=<width>
89              Specify a different dispatch width for the processor.  The  dis‐
90              patch  width  defaults  to  field  'IssueWidth' in the processor
91              scheduling model.  If width is zero, then the  default  dispatch
92              width is used.
93
94       -register-file-size=<size>
95              Specify the size of the register file. When specified, this flag
96              limits how many physical registers are  available  for  register
97              renaming  purposes.  A value of zero for this flag means "unlim‐
98              ited number of physical registers".
99
100       -iterations=<number of iterations>
101              Specify the number of iterations to run. If this flag is set  to
102              0,  then  the  tool  sets  the number of iterations to a default
103              value (i.e. 100).
104
105       -noalias=<bool>
106              If set, the tool assumes that loads and stores don't alias. This
107              is the default behavior.
108
109       -lqueue=<load queue size>
110              Specify  the  size of the load queue in the load/store unit emu‐
111              lated by the tool.  By default, the tool assumes an unbound num‐
112              ber of entries in the load queue.  A value of zero for this flag
113              is ignored, and the default load queue size is used instead.
114
115       -squeue=<store queue size>
116              Specify the size of the store queue in the load/store unit  emu‐
117              lated  by the tool. By default, the tool assumes an unbound num‐
118              ber of entries in the store queue. A value of zero for this flag
119              is ignored, and the default store queue size is used instead.
120
121       -timeline
122              Enable the timeline view.
123
124       -timeline-max-iterations=<iterations>
125              Limit the number of iterations to print in the timeline view. By
126              default, the timeline view prints information for up to 10 iter‐
127              ations.
128
129       -timeline-max-cycles=<cycles>
130              Limit the number of cycles in the timeline view. By default, the
131              number of cycles is set to 80.
132
133       -resource-pressure
134              Enable the resource pressure view. This is enabled by default.
135
136       -register-file-stats
137              Enable register file usage statistics.
138
139       -dispatch-stats
140              Enable extra dispatch statistics. This view  collects  and  ana‐
141              lyzes  instruction  dispatch  events,  as well as static/dynamic
142              dispatch stall events. This view is disabled by default.
143
144       -scheduler-stats
145              Enable extra scheduler statistics. This view collects  and  ana‐
146              lyzes  instruction  issue  events.  This view is disabled by de‐
147              fault.
148
149       -retire-stats
150              Enable extra retire control unit statistics. This view  is  dis‐
151              abled by default.
152
153       -instruction-info
154              Enable the instruction info view. This is enabled by default.
155
156       -show-encoding
157              Enable the printing of instruction encodings within the instruc‐
158              tion info view.
159
160       -all-stats
161              Print all hardware statistics. This enables extra statistics re‐
162              lated to the dispatch logic, the hardware schedulers, the regis‐
163              ter file(s), and the retire control unit. This  option  is  dis‐
164              abled by default.
165
166       -all-views
167              Enable all the view.
168
169       -instruction-tables
170              Prints  resource pressure information based on the static infor‐
171              mation available from the processor model. This differs from the
172              resource  pressure view because it doesn't require that the code
173              is simulated. It instead prints the theoretical uniform  distri‐
174              bution of resource pressure for every instruction in sequence.
175
176       -bottleneck-analysis
177              Print  information about bottlenecks that affect the throughput.
178              This analysis can be expensive, and it is disabled  by  default.
179              Bottlenecks are highlighted in the summary view.
180

EXIT STATUS

182       llvm-mca  returns  0 on success. Otherwise, an error message is printed
183       to standard error, and the tool returns 1.
184

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

186       llvm-mca allows for the optional usage of special code comments to mark
187       regions  of  the assembly code to be analyzed.  A comment starting with
188       substring LLVM-MCA-BEGIN marks the beginning of a code region.  A  com‐
189       ment  starting  with substring LLVM-MCA-END marks the end of a code re‐
190       gion.  For example:
191
192          # LLVM-MCA-BEGIN
193            ...
194          # LLVM-MCA-END
195
196       If no user-defined region is specified, then llvm-mca assumes a default
197       region  which  contains every instruction in the input file.  Every re‐
198       gion is analyzed in isolation, and the final performance report is  the
199       union of all the reports generated for every code region.
200
201       Code regions can have names. For example:
202
203          # LLVM-MCA-BEGIN A simple example
204            add %eax, %eax
205          # LLVM-MCA-END
206
207       The  code from the example above defines a region named "A simple exam‐
208       ple" with a single instruction in it. Note how the region name  doesn't
209       have  to  be  repeated in the LLVM-MCA-END directive. In the absence of
210       overlapping regions, an anonymous LLVM-MCA-END  directive  always  ends
211       the currently active user defined region.
212
213       Example of nesting regions:
214
215          # LLVM-MCA-BEGIN foo
216            add %eax, %edx
217          # LLVM-MCA-BEGIN bar
218            sub %eax, %edx
219          # LLVM-MCA-END bar
220          # LLVM-MCA-END foo
221
222       Example of overlapping regions:
223
224          # LLVM-MCA-BEGIN foo
225            add %eax, %edx
226          # LLVM-MCA-BEGIN bar
227            sub %eax, %edx
228          # LLVM-MCA-END foo
229            add %eax, %edx
230          # LLVM-MCA-END bar
231
232       Note  that multiple anonymous regions cannot overlap. Also, overlapping
233       regions cannot have the same name.
234
235       There is no support for marking regions from  high-level  source  code,
236       like C or C++. As a workaround, inline assembly directives may be used:
237
238          int foo(int a, int b) {
239            __asm volatile("# LLVM-MCA-BEGIN foo");
240            a += 42;
241            __asm volatile("# LLVM-MCA-END");
242            a *= b;
243            return a;
244          }
245
246       However, this interferes with optimizations like loop vectorization and
247       may have an impact on the code generated. This  is  because  the  __asm
248       statements  are  seen as real code having important side effects, which
249       limits how the code around them can be transformed. If  users  want  to
250       make use of inline assembly to emit markers, then the recommendation is
251       to always verify that the output assembly is equivalent to the assembly
252       generated  in  the absence of markers.  The Clang options to emit opti‐
253       mization reports can also help in detecting missed optimizations.
254

HOW LLVM-MCA WORKS

256       llvm-mca takes assembly code as input. The assembly code is parsed into
257       a sequence of MCInst with the help of the existing LLVM target assembly
258       parsers. The parsed sequence of MCInst is then analyzed by  a  Pipeline
259       module to generate a performance report.
260
261       The  Pipeline  module  simulates  the execution of the machine code se‐
262       quence in a loop of iterations (default is 100). During  this  process,
263       the  pipeline collects a number of execution related statistics. At the
264       end of this process, the pipeline generates and prints  a  report  from
265       the collected statistics.
266
267       Here  is an example of a performance report generated by the tool for a
268       dot-product of two packed float vectors of four elements. The  analysis
269       is  conducted  for target x86, cpu btver2.  The following result can be
270       produced via  the  following  command  using  the  example  located  at
271       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
272
273          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
274
275          Iterations:        300
276          Instructions:      900
277          Total Cycles:      610
278          Total uOps:        900
279
280          Dispatch Width:    2
281          uOps Per Cycle:    1.48
282          IPC:               1.48
283          Block RThroughput: 2.0
284
285
286          Instruction Info:
287          [1]: #uOps
288          [2]: Latency
289          [3]: RThroughput
290          [4]: MayLoad
291          [5]: MayStore
292          [6]: HasSideEffects (U)
293
294          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
295           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
296           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
297           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
298
299
300          Resources:
301          [0]   - JALU0
302          [1]   - JALU1
303          [2]   - JDiv
304          [3]   - JFPA
305          [4]   - JFPM
306          [5]   - JFPU0
307          [6]   - JFPU1
308          [7]   - JLAGU
309          [8]   - JMul
310          [9]   - JSAGU
311          [10]  - JSTC
312          [11]  - JVALU0
313          [12]  - JVALU1
314          [13]  - JVIMUL
315
316
317          Resource pressure per iteration:
318          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
319           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
320
321          Resource pressure by instruction:
322          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
323           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
324           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
325           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
326
327       According  to this report, the dot-product kernel has been executed 300
328       times, for a total of 900 simulated instructions. The total  number  of
329       simulated micro opcodes (uOps) is also 900.
330
331       The  report  is  structured  in three main sections.  The first section
332       collects a few performance numbers; the goal of this section is to give
333       a  very quick overview of the performance throughput. Important perfor‐
334       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
335       Reciprocal Throughput).
336
337       Field  DispatchWidth  is  the  maximum number of micro opcodes that are
338       dispatched to the out-of-order backend every simulated cycle.
339
340       IPC is computed dividing the total number of simulated instructions  by
341       the total number of cycles.
342
343       Field  Block  RThroughput  is  the  reciprocal of the block throughput.
344       Block throughput is a theoretical quantity computed as the maximum num‐
345       ber  of  blocks  (i.e.  iterations)  that can be executed per simulated
346       clock cycle in the absence of loop carried dependencies. Block through‐
347       put is superiorly limited by the dispatch rate, and the availability of
348       hardware resources.
349
350       In the absence of loop-carried  data  dependencies,  the  observed  IPC
351       tends  to  a  theoretical maximum which can be computed by dividing the
352       number of instructions of a single iteration by the Block RThroughput.
353
354       Field 'uOps Per Cycle' is computed dividing the total number  of  simu‐
355       lated micro opcodes by the total number of cycles. A delta between Dis‐
356       patch Width and this field is an indicator of a performance  issue.  In
357       the  absence  of loop-carried data dependencies, the observed 'uOps Per
358       Cycle' should tend to a theoretical maximum  throughput  which  can  be
359       computed  by  dividing  the number of uOps of a single iteration by the
360       Block RThroughput.
361
362       Field uOps Per Cycle is bounded from above by the dispatch width.  That
363       is  because  the  dispatch  width limits the maximum size of a dispatch
364       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
365       ware  parallelism.  The  availability of hardware resources affects the
366       resource pressure distribution, and it limits the  number  of  instruc‐
367       tions  that  can  be executed in parallel every cycle.  A delta between
368       Dispatch Width and the theoretical maximum uOps per Cycle (computed  by
369       dividing  the  number  of  uOps  of  a  single  iteration  by the Block
370       RThroughput) is an indicator of a performance bottleneck caused by  the
371       lack  of hardware resources.  In general, the lower the Block RThrough‐
372       put, the better.
373
374       In this example, uOps per iteration/Block RThroughput  is  1.50.  Since
375       there  are no loop-carried dependencies, the observed uOps Per Cycle is
376       expected to approach 1.50 when the number of iterations tends to infin‐
377       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
378       maximum uOp throughput (1.50) is an indicator of a performance  bottle‐
379       neck  caused  by the lack of hardware resources, and the Resource pres‐
380       sure view can help to identify the problematic resource usage.
381
382       The second section of the report is the instruction info view. It shows
383       the  latency  and reciprocal throughput of every instruction in the se‐
384       quence. It also reports extra information related to the number of  mi‐
385       cro  opcodes,  and  opcode properties (i.e., 'MayLoad', 'MayStore', and
386       'HasSideEffects').
387
388       Field RThroughput is the  reciprocal  of  the  instruction  throughput.
389       Throughput  is computed as the maximum number of instructions of a same
390       type that can be executed per clock cycle in the absence of operand de‐
391       pendencies.  In  this  example,  the  reciprocal throughput of a vector
392       float multiply is 1 cycles/instruction.  That is because the FP  multi‐
393       plier JFPM is only available from pipeline JFPU1.
394
395       Instruction  encodings  are  displayed within the instruction info view
396       when flag -show-encoding is specified.
397
398       Below is an example of -show-encoding output for the  dot-product  ker‐
399       nel:
400
401          Instruction Info:
402          [1]: #uOps
403          [2]: Latency
404          [3]: RThroughput
405          [4]: MayLoad
406          [5]: MayStore
407          [6]: HasSideEffects (U)
408          [7]: Encoding Size
409
410          [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
411           1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
412           1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
413           1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
414
415       The  Encoding Size column shows the size in bytes of instructions.  The
416       Encodings column shows the actual instruction encodings (byte sequences
417       in hex).
418
419       The third section is the Resource pressure view.  This view reports the
420       average number of resource cycles consumed every iteration by  instruc‐
421       tions  for  every processor resource unit available on the target.  In‐
422       formation is structured in two tables. The first table reports the num‐
423       ber of resource cycles spent on average every iteration. The second ta‐
424       ble correlates the resource cycles to the machine  instruction  in  the
425       sequence. For example, every iteration of the instruction vmulps always
426       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
427       consuming  an  average of 1 resource cycle per iteration.  Note that on
428       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
429       line  JFPU1,  while horizontal floating-point additions can only be is‐
430       sued to pipeline JFPU0.
431
432       The resource pressure view helps with identifying bottlenecks caused by
433       high  usage  of  specific hardware resources.  Situations with resource
434       pressure mainly concentrated on a few resources should, in general,  be
435       avoided.   Ideally,  pressure  should  be uniformly distributed between
436       multiple resources.
437
438   Timeline View
439       The timeline view produces a  detailed  report  of  each  instruction's
440       state  transitions  through  an instruction pipeline.  This view is en‐
441       abled by the command line option -timeline.  As instructions transition
442       through  the  various stages of the pipeline, their states are depicted
443       in the view report.  These states  are  represented  by  the  following
444       characters:
445
446       • D : Instruction dispatched.
447
448       • e : Instruction executing.
449
450       • E : Instruction executed.
451
452       • R : Instruction retired.
453
454       • = : Instruction already dispatched, waiting to be executed.
455
456       • - : Instruction executed, waiting to be retired.
457
458       Below  is the timeline view for a subset of the dot-product example lo‐
459       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
460       llvm-mca using the following command:
461
462          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
463
464          Timeline view:
465                              012345
466          Index     0123456789
467
468          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
469          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
470          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
471          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
472          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
473          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
474          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
475          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
476          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
477
478
479          Average Wait times (based on the timeline view):
480          [0]: Executions
481          [1]: Average time spent waiting in a scheduler's queue
482          [2]: Average time spent waiting in a scheduler's queue while ready
483          [3]: Average time elapsed from WB until retire stage
484
485                [0]    [1]    [2]    [3]
486          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
487          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
488          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
489                 3     3.3    0.5    1.4       <total>
490
491       The  timeline  view  is  interesting because it shows instruction state
492       changes during execution.  It also gives an idea of how the  tool  pro‐
493       cesses instructions executed on the target, and how their timing infor‐
494       mation might be calculated.
495
496       The timeline view is structured in two tables.  The first  table  shows
497       instructions  changing state over time (measured in cycles); the second
498       table (named Average Wait  times)  reports  useful  timing  statistics,
499       which  should help diagnose performance bottlenecks caused by long data
500       dependencies and sub-optimal usage of hardware resources.
501
502       An instruction in the timeline view is identified by a pair of indices,
503       where  the first index identifies an iteration, and the second index is
504       the instruction index (i.e., where it appears in  the  code  sequence).
505       Since this example was generated using 3 iterations: -iterations=3, the
506       iteration indices range from 0-2 inclusively.
507
508       Excluding the first and last column, the remaining columns are  in  cy‐
509       cles.  Cycles are numbered sequentially starting from 0.
510
511       From the example output above, we know the following:
512
513       • Instruction [1,0] was dispatched at cycle 1.
514
515       • Instruction [1,0] started executing at cycle 2.
516
517       • Instruction [1,0] reached the write back stage at cycle 4.
518
519       • Instruction [1,0] was retired at cycle 10.
520
521       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
522       wait in the scheduler's queue for the operands to become available.  By
523       the  time  vmulps  is  dispatched,  operands are already available, and
524       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
525       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
526       strated by the fact that the instruction only spent 1cy in  the  sched‐
527       uler's queue.
528
529       There  is a gap of 5 cycles between the write-back stage and the retire
530       event.  That is because instructions must retire in program  order,  so
531       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
532       until cycle 10).
533
534       In the example, all instructions are in a RAW (Read After Write) depen‐
535       dency  chain.   Register %xmm2 written by vmulps is immediately used by
536       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
537       used  by  the second vhaddps.  Long data dependencies negatively impact
538       the ILP (Instruction Level Parallelism).
539
540       In the dot-product example, there are anti-dependencies  introduced  by
541       instructions  from  different  iterations.  However, those dependencies
542       can be removed at register renaming stage (at the  cost  of  allocating
543       register aliases, and therefore consuming physical registers).
544
545       Table  Average  Wait  times  helps diagnose performance issues that are
546       caused by the presence of long  latency  instructions  and  potentially
547       long  data  dependencies  which  may  limit the ILP. Last row, <total>,
548       shows a global  average  over  all  instructions  measured.  Note  that
549       llvm-mca,  by  default, assumes at least 1cy between the dispatch event
550       and the issue event.
551
552       When the performance is limited by data dependencies  and/or  long  la‐
553       tency instructions, the number of cycles spent while in the ready state
554       is expected to be very small when compared with the total number of cy‐
555       cles  spent  in  the scheduler's queue.  The difference between the two
556       counters is a good indicator of how large of an impact  data  dependen‐
557       cies  had  on  the  execution of the instructions.  When performance is
558       mostly limited by the lack of hardware resources, the delta between the
559       two  counters  is  small.   However,  the number of cycles spent in the
560       queue tends to be larger (i.e., more than 1-3cy), especially when  com‐
561       pared to other low latency instructions.
562
563   Bottleneck Analysis
564       The  -bottleneck-analysis  command  line option enables the analysis of
565       performance bottlenecks.
566
567       This analysis is potentially expensive. It attempts  to  correlate  in‐
568       creases  in  backend pressure (caused by pipeline resource pressure and
569       data dependencies) to dynamic dispatch stalls.
570
571       Below  is  an  example  of  -bottleneck-analysis  output  generated  by
572       llvm-mca for 500 iterations of the dot-product example on btver2.
573
574          Cycles with backend pressure increase [ 48.07% ]
575          Throughput Bottlenecks:
576            Resource Pressure       [ 47.77% ]
577            - JFPA  [ 47.77% ]
578            - JFPU0  [ 47.77% ]
579            Data Dependencies:      [ 0.30% ]
580            - Register Dependencies [ 0.30% ]
581            - Memory Dependencies   [ 0.00% ]
582
583          Critical sequence based on the simulation:
584
585                        Instruction                         Dependency Information
586           +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
587           |
588           |    < loop carried >
589           |
590           |      0.    vmulps  %xmm0, %xmm1, %xmm2
591           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
592           +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
593           |
594           |    < loop carried >
595           |
596           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
597
598       According  to  the analysis, throughput is limited by resource pressure
599       and not by data dependencies.  The analysis observed increases in back‐
600       end pressure during 48.07% of the simulated run. Almost all those pres‐
601       sure increase events were caused by contention on  processor  resources
602       JFPA/JFPU0.
603
604       The  critical  sequence  is the most expensive sequence of instructions
605       according to the simulation. It is annotated to provide extra  informa‐
606       tion  about  critical  register dependencies and resource interferences
607       between instructions.
608
609       Instructions from the critical sequence are expected  to  significantly
610       impact  performance.  By construction, the accuracy of this analysis is
611       strongly dependent on the simulation and (as always) by the quality  of
612       the processor model in llvm.
613
614   Extra Statistics to Further Diagnose Performance Issues
615       The -all-stats command line option enables extra statistics and perfor‐
616       mance counters for the dispatch logic, the reorder buffer,  the  retire
617       control unit, and the register file.
618
619       Below is an example of -all-stats output generated by  llvm-mca for 300
620       iterations of the dot-product example discussed in  the  previous  sec‐
621       tions.
622
623          Dynamic Dispatch Stall Cycles:
624          RAT     - Register unavailable:                      0
625          RCU     - Retire tokens unavailable:                 0
626          SCHEDQ  - Scheduler full:                            272  (44.6%)
627          LQ      - Load queue full:                           0
628          SQ      - Store queue full:                          0
629          GROUP   - Static restrictions on the dispatch group: 0
630
631
632          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
633          [# dispatched], [# cycles]
634           0,              24  (3.9%)
635           1,              272  (44.6%)
636           2,              314  (51.5%)
637
638
639          Schedulers - number of cycles where we saw N micro opcodes issued:
640          [# issued], [# cycles]
641           0,          7  (1.1%)
642           1,          306  (50.2%)
643           2,          297  (48.7%)
644
645          Scheduler's queue usage:
646          [1] Resource name.
647          [2] Average number of used buffer entries.
648          [3] Maximum number of used buffer entries.
649          [4] Total number of buffer entries.
650
651           [1]            [2]        [3]        [4]
652          JALU01           0          0          20
653          JFPU01           17         18         18
654          JLSAGU           0          0          12
655
656
657          Retire Control Unit - number of cycles where we saw N instructions retired:
658          [# retired], [# cycles]
659           0,           109  (17.9%)
660           1,           102  (16.7%)
661           2,           399  (65.4%)
662
663          Total ROB Entries:                64
664          Max Used ROB Entries:             35  ( 54.7% )
665          Average Used ROB Entries per cy:  32  ( 50.0% )
666
667
668          Register File statistics:
669          Total number of mappings created:    900
670          Max number of mappings used:         35
671
672          *  Register File #1 -- JFpuPRF:
673             Number of physical registers:     72
674             Total number of mappings created: 900
675             Max number of mappings used:      35
676
677          *  Register File #2 -- JIntegerPRF:
678             Number of physical registers:     64
679             Total number of mappings created: 0
680             Max number of mappings used:      0
681
682       If  we  look  at  the  Dynamic  Dispatch Stall Cycles table, we see the
683       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
684       ery  time the dispatch logic is unable to dispatch a full group because
685       the scheduler's queue is full.
686
687       Looking at the Dispatch Logic table, we see that the pipeline was  only
688       able  to  dispatch  two  micro opcodes 51.5% of the time.  The dispatch
689       group was limited to one micro opcode 44.6% of the cycles, which corre‐
690       sponds  to 272 cycles.  The dispatch statistics are displayed by either
691       using the command option -all-stats or -dispatch-stats.
692
693       The next table, Schedulers, presents a histogram  displaying  a  count,
694       representing  the  number of micro opcodes issued on some number of cy‐
695       cles. In this case, of the 610 simulated cycles,  single  opcodes  were
696       issued  306 times (50.2%) and there were 7 cycles where no opcodes were
697       issued.
698
699       The Scheduler's queue usage table shows that the  average  and  maximum
700       number  of  buffer entries (i.e., scheduler queue entries) used at run‐
701       time.  Resource JFPU01 reached its maximum (18 of  18  queue  entries).
702       Note that AMD Jaguar implements three schedulers:
703
704       • JALU01 - A scheduler for ALU instructions.
705
706       • JFPU01 - A scheduler floating point operations.
707
708       • JLSAGU - A scheduler for address generation.
709
710       The  dot-product  is  a  kernel of three floating point instructions (a
711       vector multiply followed by two horizontal adds).   That  explains  why
712       only the floating point scheduler appears to be used.
713
714       A full scheduler queue is either caused by data dependency chains or by
715       a sub-optimal usage of hardware resources.  Sometimes,  resource  pres‐
716       sure  can be mitigated by rewriting the kernel using different instruc‐
717       tions that consume different scheduler resources.   Schedulers  with  a
718       small queue are less resilient to bottlenecks caused by the presence of
719       long data dependencies.  The scheduler statistics are displayed by  us‐
720       ing the command option -all-stats or -scheduler-stats.
721
722       The  next table, Retire Control Unit, presents a histogram displaying a
723       count, representing the number of instructions retired on  some  number
724       of cycles.  In this case, of the 610 simulated cycles, two instructions
725       were retired during the same cycle 399 times (65.4%) and there were 109
726       cycles  where  no instructions were retired.  The retire statistics are
727       displayed by using the command option -all-stats or -retire-stats.
728
729       The last table presented is Register File  statistics.   Each  physical
730       register  file  (PRF)  used by the pipeline is presented in this table.
731       In the case of AMD Jaguar, there are two register files, one for float‐
732       ing-point  registers  (JFpuPRF)  and  one for integer registers (JInte‐
733       gerPRF).  The table shows that of the 900 instructions processed, there
734       were  900  mappings  created.   Since this dot-product example utilized
735       only floating point registers, the JFPuPRF was responsible for creating
736       the  900 mappings.  However, we see that the pipeline only used a maxi‐
737       mum of 35 of 72 available register slots at any given time. We can con‐
738       clude  that  the floating point PRF was the only register file used for
739       the example, and that it was never resource constrained.  The  register
740       file statistics are displayed by using the command option -all-stats or
741       -register-file-stats.
742
743       In this example, we can conclude that the IPC is mostly limited by data
744       dependencies, and not by resource pressure.
745
746   Instruction Flow
747       This  section  describes the instruction flow through the default pipe‐
748       line of llvm-mca, as well as  the  functional  units  involved  in  the
749       process.
750
751       The  default  pipeline implements the following sequence of stages used
752       to process instructions.
753
754       • Dispatch (Instruction is dispatched to the schedulers).
755
756       • Issue (Instruction is issued to the processor pipelines).
757
758       • Write Back (Instruction is executed, and results are written back).
759
760       • Retire (Instruction is retired; writes  are  architecturally  commit‐
761         ted).
762
763       The  default pipeline only models the out-of-order portion of a proces‐
764       sor.  Therefore, the instruction fetch and decode stages are  not  mod‐
765       eled.  Performance  bottlenecks  in  the  frontend  are  not diagnosed.
766       llvm-mca assumes that instructions have all  been  decoded  and  placed
767       into  a  queue  before  the  simulation start.  Also, llvm-mca does not
768       model branch prediction.
769
770   Instruction Dispatch
771       During the dispatch stage, instructions are  picked  in  program  order
772       from  a queue of already decoded instructions, and dispatched in groups
773       to the simulated hardware schedulers.
774
775       The size of a dispatch group depends on the availability of  the  simu‐
776       lated hardware resources.  The processor dispatch width defaults to the
777       value of the IssueWidth in LLVM's scheduling model.
778
779       An instruction can be dispatched if:
780
781       • The size of the dispatch group is smaller than  processor's  dispatch
782         width.
783
784       • There are enough entries in the reorder buffer.
785
786       • There are enough physical registers to do register renaming.
787
788       • The schedulers are not full.
789
790       Scheduling  models  can  optionally  specify  which  register files are
791       available on the processor. llvm-mca uses that information to  initial‐
792       ize  register file descriptors.  Users can limit the number of physical
793       registers that are globally available for register  renaming  by  using
794       the  command  option -register-file-size.  A value of zero for this op‐
795       tion means unbounded. By knowing how many registers are  available  for
796       renaming,  the  tool  can predict dispatch stalls caused by the lack of
797       physical registers.
798
799       The number of reorder buffer entries consumed by an instruction depends
800       on  the  number  of micro-opcodes specified for that instruction by the
801       target scheduling model.  The reorder buffer is responsible for  track‐
802       ing  the  progress  of  instructions that are "in-flight", and retiring
803       them in program order.  The number of entries in the reorder buffer de‐
804       faults  to the value specified by field MicroOpBufferSize in the target
805       scheduling model.
806
807       Instructions that are dispatched to the  schedulers  consume  scheduler
808       buffer  entries. llvm-mca queries the scheduling model to determine the
809       set of buffered resources consumed by  an  instruction.   Buffered  re‐
810       sources are treated like scheduler resources.
811
812   Instruction Issue
813       Each  processor  scheduler implements a buffer of instructions.  An in‐
814       struction has to wait in the scheduler's buffer  until  input  register
815       operands  become  available.   Only at that point, does the instruction
816       becomes  eligible  for  execution  and  may  be   issued   (potentially
817       out-of-order)  for  execution.   Instruction  latencies are computed by
818       llvm-mca with the help of the scheduling model.
819
820       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
821       ulers.   The  scheduler  is responsible for tracking data dependencies,
822       and dynamically selecting which processor resources are consumed by in‐
823       structions.   It  delegates  the management of processor resource units
824       and resource groups to a resource manager.  The resource manager is re‐
825       sponsible  for  selecting  resource units that are consumed by instruc‐
826       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
827       group, the resource manager selects one of the available units from the
828       group; by default, the resource manager uses a round-robin selector  to
829       guarantee  that  resource  usage  is  uniformly distributed between all
830       units of a group.
831
832       llvm-mca's scheduler internally groups instructions into three sets:
833
834       • WaitSet: a set of instructions whose operands are not ready.
835
836       • ReadySet: a set of instructions ready to execute.
837
838       • IssuedSet: a set of instructions executing.
839
840       Depending on the operands  availability,  instructions  that  are  dis‐
841       patched to the scheduler are either placed into the WaitSet or into the
842       ReadySet.
843
844       Every cycle, the scheduler checks if instructions can be moved from the
845       WaitSet  to  the ReadySet, and if instructions from the ReadySet can be
846       issued to the underlying pipelines. The algorithm prioritizes older in‐
847       structions over younger instructions.
848
849   Write-Back and Retire Stage
850       Issued  instructions  are  moved  from  the  ReadySet to the IssuedSet.
851       There, instructions wait until they reach  the  write-back  stage.   At
852       that point, they get removed from the queue and the retire control unit
853       is notified.
854
855       When instructions are executed, the retire control unit flags  the  in‐
856       struction as "ready to retire."
857
858       Instructions  are retired in program order.  The register file is noti‐
859       fied of the retirement so that it can free the physical registers  that
860       were allocated for the instruction during the register renaming stage.
861
862   Load/Store Unit and Memory Consistency Model
863       To  simulate  an  out-of-order execution of memory operations, llvm-mca
864       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
865       tive execution of loads and stores.
866
867       Each  load  (or  store) consumes an entry in the load (or store) queue.
868       Users can specify flags -lqueue and -squeue to limit the number of  en‐
869       tries  in  the  load  and store queues respectively. The queues are un‐
870       bounded by default.
871
872       The LSUnit implements a relaxed consistency model for memory loads  and
873       stores.  The rules are:
874
875       1. A younger load is allowed to pass an older load only if there are no
876          intervening stores or barriers between the two loads.
877
878       2. A younger load is allowed to pass an older store provided  that  the
879          load does not alias with the store.
880
881       3. A younger store is not allowed to pass an older store.
882
883       4. A younger store is not allowed to pass an older load.
884
885       By  default,  the LSUnit optimistically assumes that loads do not alias
886       (-noalias=true) store operations.  Under this assumption, younger loads
887       are  always allowed to pass older stores.  Essentially, the LSUnit does
888       not attempt to run any alias analysis to predict when loads and  stores
889       do not alias with each other.
890
891       Note  that,  in the case of write-combining memory, rule 3 could be re‐
892       laxed to allow reordering of non-aliasing store operations.  That being
893       said,  at the moment, there is no way to further relax the memory model
894       (-noalias is the only option).  Essentially,  there  is  no  option  to
895       specify  a  different  memory  type (e.g., write-back, write-combining,
896       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
897       memory model.
898
899       Other limitations are:
900
901       • The LSUnit does not know when store-to-load forwarding may occur.
902
903       • The  LSUnit  does  not know anything about cache hierarchy and memory
904         types.
905
906       • The LSUnit does not know how to identify serializing  operations  and
907         memory fences.
908
909       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
910       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
911       "MayStore."   For  loads, the scheduling model provides an "optimistic"
912       load-to-use latency (which usually matches the load-to-use latency  for
913       when there is a hit in the L1D).
914
915       llvm-mca  does  not know about serializing operations or memory-barrier
916       like instructions.  The LSUnit conservatively assumes that an  instruc‐
917       tion which has both "MayLoad" and unmodeled side effects behaves like a
918       "soft" load-barrier.  That means, it serializes loads without forcing a
919       flush  of  the load queue.  Similarly, instructions that "MayStore" and
920       have unmodeled side effects are treated like store  barriers.   A  full
921       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
922       side effects.  This is inaccurate, but it is the best that we can do at
923       the moment with the current information available in LLVM.
924
925       A  load/store  barrier  consumes  one entry of the load/store queue.  A
926       load/store barrier enforces ordering of loads/stores.  A  younger  load
927       cannot  pass a load barrier.  Also, a younger store cannot pass a store
928       barrier.  A younger load has to wait for the memory/load barrier to ex‐
929       ecute.   A  load/store barrier is "executed" when it becomes the oldest
930       entry in the load/store queue(s). That also means, by construction, all
931       of the older loads/stores have been executed.
932
933       In conclusion, the full set of load/store consistency rules are:
934
935       1. A store may not pass a previous store.
936
937       2. A store may not pass a previous load (regardless of -noalias).
938
939       3. A store has to wait until an older store barrier is fully executed.
940
941       4. A load may pass a previous load.
942
943       5. A load may not pass a previous store unless -noalias is set.
944
945       6. A load has to wait until an older load barrier is fully executed.
946

AUTHOR

948       Maintained by the LLVM Team (https://llvm.org/).
949
951       2003-2023, LLVM Project
952
953
954
955
95611                                2023-07-20                       LLVM-MCA(1)
Impressum