1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently works  for  processors  with  an
18       out-of-order  backend,  for which there is a scheduling model available
19       in LLVM.
20
21       The main goal of this tool is not just to predict  the  performance  of
22       the  code  when run on the target, but also help with diagnosing poten‐
23       tial performance issues.
24
25       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
26       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
27       and reporting style were inspired by the IACA tool from Intel.
28
29       For example, you can compile code with clang, output assembly, and pipe
30       it directly into llvm-mca for analysis:
31
32          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34       Or for Intel syntax:
35
36          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38       Scheduling  models  are  not just used to compute instruction latencies
39       and throughput, but also to understand  what  processor  resources  are
40       available and how to simulate them.
41
42       By  design,  the  quality  of the analysis conducted by llvm-mca is in‐
43       evitably affected by the quality of the scheduling models in LLVM.
44
45       If you see that the performance report is not accurate for a processor,
46       please file a bug against the appropriate backend.
47

OPTIONS

49       If  input is "-" or omitted, llvm-mca reads from standard input. Other‐
50       wise, it will read from the specified filename.
51
52       If the -o option is omitted, then llvm-mca  will  send  its  output  to
53       standard  output if the input is from standard input.  If the -o option
54       specifies "-", then the output will also be sent to standard output.
55
56       -help  Print a summary of command line options.
57
58       -o <filename>
59              Use <filename> as the output filename. See the summary above for
60              more details.
61
62       -mtriple=<target triple>
63              Specify a target triple string.
64
65       -march=<arch>
66              Specify  the  architecture for which to analyze the code. It de‐
67              faults to the host default target.
68
69       -mcpu=<cpuname>
70              Specify the processor for which to analyze  the  code.   By  de‐
71              fault, the cpu name is autodetected from the host.
72
73       -output-asm-variant=<variant id>
74              Specify  the output assembly variant for the report generated by
75              the tool.  On x86, possible values are [0,  1].  A  value  of  0
76              (vic.  1)  for  this flag enables the AT&T (vic. Intel) assembly
77              format for the code printed out by the tool in the analysis  re‐
78              port.
79
80       -print-imm-hex
81              Prefer  hex  format  for numeric literals in the output assembly
82              printed as part of the report.
83
84       -dispatch=<width>
85              Specify a different dispatch width for the processor.  The  dis‐
86              patch  width  defaults  to  field  'IssueWidth' in the processor
87              scheduling model.  If width is zero, then the  default  dispatch
88              width is used.
89
90       -register-file-size=<size>
91              Specify the size of the register file. When specified, this flag
92              limits how many physical registers are  available  for  register
93              renaming  purposes.  A value of zero for this flag means "unlim‐
94              ited number of physical registers".
95
96       -iterations=<number of iterations>
97              Specify the number of iterations to run. If this flag is set  to
98              0,  then  the  tool  sets  the number of iterations to a default
99              value (i.e. 100).
100
101       -noalias=<bool>
102              If set, the tool assumes that loads and stores don't alias. This
103              is the default behavior.
104
105       -lqueue=<load queue size>
106              Specify  the  size of the load queue in the load/store unit emu‐
107              lated by the tool.  By default, the tool assumes an unbound num‐
108              ber of entries in the load queue.  A value of zero for this flag
109              is ignored, and the default load queue size is used instead.
110
111       -squeue=<store queue size>
112              Specify the size of the store queue in the load/store unit  emu‐
113              lated  by the tool. By default, the tool assumes an unbound num‐
114              ber of entries in the store queue. A value of zero for this flag
115              is ignored, and the default store queue size is used instead.
116
117       -timeline
118              Enable the timeline view.
119
120       -timeline-max-iterations=<iterations>
121              Limit the number of iterations to print in the timeline view. By
122              default, the timeline view prints information for up to 10 iter‐
123              ations.
124
125       -timeline-max-cycles=<cycles>
126              Limit the number of cycles in the timeline view. By default, the
127              number of cycles is set to 80.
128
129       -resource-pressure
130              Enable the resource pressure view. This is enabled by default.
131
132       -register-file-stats
133              Enable register file usage statistics.
134
135       -dispatch-stats
136              Enable extra dispatch statistics. This view  collects  and  ana‐
137              lyzes  instruction  dispatch  events,  as well as static/dynamic
138              dispatch stall events. This view is disabled by default.
139
140       -scheduler-stats
141              Enable extra scheduler statistics. This view collects  and  ana‐
142              lyzes  instruction  issue  events.  This view is disabled by de‐
143              fault.
144
145       -retire-stats
146              Enable extra retire control unit statistics. This view  is  dis‐
147              abled by default.
148
149       -instruction-info
150              Enable the instruction info view. This is enabled by default.
151
152       -show-encoding
153              Enable the printing of instruction encodings within the instruc‐
154              tion info view.
155
156       -all-stats
157              Print all hardware statistics. This enables extra statistics re‐
158              lated to the dispatch logic, the hardware schedulers, the regis‐
159              ter file(s), and the retire control unit. This  option  is  dis‐
160              abled by default.
161
162       -all-views
163              Enable all the view.
164
165       -instruction-tables
166              Prints  resource pressure information based on the static infor‐
167              mation available from the processor model. This differs from the
168              resource  pressure view because it doesn't require that the code
169              is simulated. It instead prints the theoretical uniform  distri‐
170              bution of resource pressure for every instruction in sequence.
171
172       -bottleneck-analysis
173              Print  information about bottlenecks that affect the throughput.
174              This analysis can be expensive, and it is disabled  by  default.
175              Bottlenecks are highlighted in the summary view.
176

EXIT STATUS

178       llvm-mca  returns  0 on success. Otherwise, an error message is printed
179       to standard error, and the tool returns 1.
180

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

182       llvm-mca allows for the optional usage of special code comments to mark
183       regions  of  the assembly code to be analyzed.  A comment starting with
184       substring LLVM-MCA-BEGIN marks the beginning of a code region.  A  com‐
185       ment  starting  with substring LLVM-MCA-END marks the end of a code re‐
186       gion.  For example:
187
188          # LLVM-MCA-BEGIN
189            ...
190          # LLVM-MCA-END
191
192       If no user-defined region is specified, then llvm-mca assumes a default
193       region  which  contains every instruction in the input file.  Every re‐
194       gion is analyzed in isolation, and the final performance report is  the
195       union of all the reports generated for every code region.
196
197       Code regions can have names. For example:
198
199          # LLVM-MCA-BEGIN A simple example
200            add %eax, %eax
201          # LLVM-MCA-END
202
203       The  code from the example above defines a region named "A simple exam‐
204       ple" with a single instruction in it. Note how the region name  doesn't
205       have  to  be  repeated in the LLVM-MCA-END directive. In the absence of
206       overlapping regions, an anonymous LLVM-MCA-END  directive  always  ends
207       the currently active user defined region.
208
209       Example of nesting regions:
210
211          # LLVM-MCA-BEGIN foo
212            add %eax, %edx
213          # LLVM-MCA-BEGIN bar
214            sub %eax, %edx
215          # LLVM-MCA-END bar
216          # LLVM-MCA-END foo
217
218       Example of overlapping regions:
219
220          # LLVM-MCA-BEGIN foo
221            add %eax, %edx
222          # LLVM-MCA-BEGIN bar
223            sub %eax, %edx
224          # LLVM-MCA-END foo
225            add %eax, %edx
226          # LLVM-MCA-END bar
227
228       Note  that multiple anonymous regions cannot overlap. Also, overlapping
229       regions cannot have the same name.
230
231       There is no support for marking regions from  high-level  source  code,
232       like C or C++. As a workaround, inline assembly directives may be used:
233
234          int foo(int a, int b) {
235            __asm volatile("# LLVM-MCA-BEGIN foo");
236            a += 42;
237            __asm volatile("# LLVM-MCA-END");
238            a *= b;
239            return a;
240          }
241
242       However, this interferes with optimizations like loop vectorization and
243       may have an impact on the code generated. This  is  because  the  __asm
244       statements  are  seen as real code having important side effects, which
245       limits how the code around them can be transformed. If  users  want  to
246       make use of inline assembly to emit markers, then the recommendation is
247       to always verify that the output assembly is equivalent to the assembly
248       generated  in  the absence of markers.  The Clang options to emit opti‐
249       mization reports can also help in detecting missed optimizations.
250

HOW LLVM-MCA WORKS

252       llvm-mca takes assembly code as input. The assembly code is parsed into
253       a sequence of MCInst with the help of the existing LLVM target assembly
254       parsers. The parsed sequence of MCInst is then analyzed by  a  Pipeline
255       module to generate a performance report.
256
257       The  Pipeline  module  simulates  the execution of the machine code se‐
258       quence in a loop of iterations (default is 100). During  this  process,
259       the  pipeline collects a number of execution related statistics. At the
260       end of this process, the pipeline generates and prints  a  report  from
261       the collected statistics.
262
263       Here  is an example of a performance report generated by the tool for a
264       dot-product of two packed float vectors of four elements. The  analysis
265       is  conducted  for target x86, cpu btver2.  The following result can be
266       produced via  the  following  command  using  the  example  located  at
267       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
268
269          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
270
271          Iterations:        300
272          Instructions:      900
273          Total Cycles:      610
274          Total uOps:        900
275
276          Dispatch Width:    2
277          uOps Per Cycle:    1.48
278          IPC:               1.48
279          Block RThroughput: 2.0
280
281
282          Instruction Info:
283          [1]: #uOps
284          [2]: Latency
285          [3]: RThroughput
286          [4]: MayLoad
287          [5]: MayStore
288          [6]: HasSideEffects (U)
289
290          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
291           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
292           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
293           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
294
295
296          Resources:
297          [0]   - JALU0
298          [1]   - JALU1
299          [2]   - JDiv
300          [3]   - JFPA
301          [4]   - JFPM
302          [5]   - JFPU0
303          [6]   - JFPU1
304          [7]   - JLAGU
305          [8]   - JMul
306          [9]   - JSAGU
307          [10]  - JSTC
308          [11]  - JVALU0
309          [12]  - JVALU1
310          [13]  - JVIMUL
311
312
313          Resource pressure per iteration:
314          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
315           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
316
317          Resource pressure by instruction:
318          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
319           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
320           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
321           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
322
323       According  to this report, the dot-product kernel has been executed 300
324       times, for a total of 900 simulated instructions. The total  number  of
325       simulated micro opcodes (uOps) is also 900.
326
327       The  report  is  structured  in three main sections.  The first section
328       collects a few performance numbers; the goal of this section is to give
329       a  very quick overview of the performance throughput. Important perfor‐
330       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
331       Reciprocal Throughput).
332
333       Field  DispatchWidth  is  the  maximum number of micro opcodes that are
334       dispatched to the out-of-order backend every simulated cycle.
335
336       IPC is computed dividing the total number of simulated instructions  by
337       the total number of cycles.
338
339       Field  Block  RThroughput  is  the  reciprocal of the block throughput.
340       Block throuhgput is a theoretical quantity computed as the maximum num‐
341       ber  of  blocks  (i.e.  iterations)  that can be executed per simulated
342       clock cycle in the absence of loop carried dependencies. Block through‐
343       put is is superiorly limited by the dispatch rate, and the availability
344       of hardware resources.
345
346       In the absence of loop-carried  data  dependencies,  the  observed  IPC
347       tends  to  a  theoretical maximum which can be computed by dividing the
348       number of instructions of a single iteration by the Block RThroughput.
349
350       Field 'uOps Per Cycle' is computed dividing the total number  of  simu‐
351       lated micro opcodes by the total number of cycles. A delta between Dis‐
352       patch Width and this field is an indicator of a performance  issue.  In
353       the  absence  of loop-carried data dependencies, the observed 'uOps Per
354       Cycle' should tend to a theoretical maximum  throughput  which  can  be
355       computed  by  dividing  the number of uOps of a single iteration by the
356       Block RThroughput.
357
358       Field uOps Per Cycle is bounded from above by the dispatch width.  That
359       is  because  the  dispatch  width limits the maximum size of a dispatch
360       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
361       ware  parallelism.  The  availability of hardware resources affects the
362       resource pressure distribution, and it limits the  number  of  instruc‐
363       tions  that  can  be executed in parallel every cycle.  A delta between
364       Dispatch Width and the theoretical maximum uOps per Cycle (computed  by
365       dividing  the  number  of  uOps  of  a  single  iteration  by the Block
366       RThroughput) is an indicator of a performance bottleneck caused by  the
367       lack  of hardware resources.  In general, the lower the Block RThrough‐
368       put, the better.
369
370       In this example, uOps per iteration/Block RThroughput  is  1.50.  Since
371       there  are no loop-carried dependencies, the observed uOps Per Cycle is
372       expected to approach 1.50 when the number of iterations tends to infin‐
373       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
374       maximum uOp throughput (1.50) is an indicator of a performance  bottle‐
375       neck  caused  by the lack of hardware resources, and the Resource pres‐
376       sure view can help to identify the problematic resource usage.
377
378       The second section of the report is the instruction info view. It shows
379       the  latency  and reciprocal throughput of every instruction in the se‐
380       quence. It also reports extra information related to the number of  mi‐
381       cro  opcodes,  and  opcode properties (i.e., 'MayLoad', 'MayStore', and
382       'HasSideEffects').
383
384       Field RThroughput is the  reciprocal  of  the  instruction  throughput.
385       Throughput  is computed as the maximum number of instructions of a same
386       type that can be executed per clock cycle in the absence of operand de‐
387       pendencies.  In  this  example,  the  reciprocal throughput of a vector
388       float multiply is 1 cycles/instruction.  That is because the FP  multi‐
389       plier JFPM is only available from pipeline JFPU1.
390
391       Instruction  encodings  are  displayed within the instruction info view
392       when flag -show-encoding is specified.
393
394       Below is an example of -show-encoding output for the  dot-product  ker‐
395       nel:
396
397          Instruction Info:
398          [1]: #uOps
399          [2]: Latency
400          [3]: RThroughput
401          [4]: MayLoad
402          [5]: MayStore
403          [6]: HasSideEffects (U)
404          [7]: Encoding Size
405
406          [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
407           1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
408           1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
409           1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
410
411       The  Encoding Size column shows the size in bytes of instructions.  The
412       Encodings column shows the actual instruction encodings (byte sequences
413       in hex).
414
415       The third section is the Resource pressure view.  This view reports the
416       average number of resource cycles consumed every iteration by  instruc‐
417       tions  for  every processor resource unit available on the target.  In‐
418       formation is structured in two tables. The first table reports the num‐
419       ber of resource cycles spent on average every iteration. The second ta‐
420       ble correlates the resource cycles to the machine  instruction  in  the
421       sequence. For example, every iteration of the instruction vmulps always
422       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
423       consuming  an  average of 1 resource cycle per iteration.  Note that on
424       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
425       line  JFPU1,  while horizontal floating-point additions can only be is‐
426       sued to pipeline JFPU0.
427
428       The resource pressure view helps with identifying bottlenecks caused by
429       high  usage  of  specific hardware resources.  Situations with resource
430       pressure mainly concentrated on a few resources should, in general,  be
431       avoided.   Ideally,  pressure  should  be uniformly distributed between
432       multiple resources.
433
434   Timeline View
435       The timeline view produces a  detailed  report  of  each  instruction's
436       state  transitions  through  an instruction pipeline.  This view is en‐
437       abled by the command line option -timeline.  As instructions transition
438       through  the  various stages of the pipeline, their states are depicted
439       in the view report.  These states  are  represented  by  the  following
440       characters:
441
442       • D : Instruction dispatched.
443
444       • e : Instruction executing.
445
446       • E : Instruction executed.
447
448       • R : Instruction retired.
449
450       • = : Instruction already dispatched, waiting to be executed.
451
452       • - : Instruction executed, waiting to be retired.
453
454       Below  is the timeline view for a subset of the dot-product example lo‐
455       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
456       llvm-mca using the following command:
457
458          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
459
460          Timeline view:
461                              012345
462          Index     0123456789
463
464          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
465          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
466          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
467          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
468          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
469          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
470          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
471          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
472          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
473
474
475          Average Wait times (based on the timeline view):
476          [0]: Executions
477          [1]: Average time spent waiting in a scheduler's queue
478          [2]: Average time spent waiting in a scheduler's queue while ready
479          [3]: Average time elapsed from WB until retire stage
480
481                [0]    [1]    [2]    [3]
482          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
483          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
484          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
485                 3     3.3    0.5    1.4       <total>
486
487       The  timeline  view  is  interesting because it shows instruction state
488       changes during execution.  It also gives an idea of how the  tool  pro‐
489       cesses instructions executed on the target, and how their timing infor‐
490       mation might be calculated.
491
492       The timeline view is structured in two tables.  The first  table  shows
493       instructions  changing state over time (measured in cycles); the second
494       table (named Average Wait  times)  reports  useful  timing  statistics,
495       which  should help diagnose performance bottlenecks caused by long data
496       dependencies and sub-optimal usage of hardware resources.
497
498       An instruction in the timeline view is identified by a pair of indices,
499       where  the first index identifies an iteration, and the second index is
500       the instruction index (i.e., where it appears in  the  code  sequence).
501       Since this example was generated using 3 iterations: -iterations=3, the
502       iteration indices range from 0-2 inclusively.
503
504       Excluding the first and last column, the remaining columns are  in  cy‐
505       cles.  Cycles are numbered sequentially starting from 0.
506
507       From the example output above, we know the following:
508
509       • Instruction [1,0] was dispatched at cycle 1.
510
511       • Instruction [1,0] started executing at cycle 2.
512
513       • Instruction [1,0] reached the write back stage at cycle 4.
514
515       • Instruction [1,0] was retired at cycle 10.
516
517       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
518       wait in the scheduler's queue for the operands to become available.  By
519       the  time  vmulps  is  dispatched,  operands are already available, and
520       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
521       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
522       strated by the fact that the instruction only spent 1cy in  the  sched‐
523       uler's queue.
524
525       There  is a gap of 5 cycles between the write-back stage and the retire
526       event.  That is because instructions must retire in program  order,  so
527       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
528       until cycle 10).
529
530       In the example, all instructions are in a RAW (Read After Write) depen‐
531       dency  chain.   Register %xmm2 written by vmulps is immediately used by
532       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
533       used  by  the second vhaddps.  Long data dependencies negatively impact
534       the ILP (Instruction Level Parallelism).
535
536       In the dot-product example, there are anti-dependencies  introduced  by
537       instructions  from  different  iterations.  However, those dependencies
538       can be removed at register renaming stage (at the  cost  of  allocating
539       register aliases, and therefore consuming physical registers).
540
541       Table  Average  Wait  times  helps diagnose performance issues that are
542       caused by the presence of long  latency  instructions  and  potentially
543       long  data  dependencies  which  may  limit the ILP. Last row, <total>,
544       shows a global  average  over  all  instructions  measured.  Note  that
545       llvm-mca,  by  default, assumes at least 1cy between the dispatch event
546       and the issue event.
547
548       When the performance is limited by data dependencies  and/or  long  la‐
549       tency instructions, the number of cycles spent while in the ready state
550       is expected to be very small when compared with the total number of cy‐
551       cles  spent  in  the scheduler's queue.  The difference between the two
552       counters is a good indicator of how large of an impact  data  dependen‐
553       cies  had  on  the  execution of the instructions.  When performance is
554       mostly limited by the lack of hardware resources, the delta between the
555       two  counters  is  small.   However,  the number of cycles spent in the
556       queue tends to be larger (i.e., more than 1-3cy), especially when  com‐
557       pared to other low latency instructions.
558
559   Bottleneck Analysis
560       The  -bottleneck-analysis  command  line option enables the analysis of
561       performance bottlenecks.
562
563       This analysis is potentially expensive. It attempts  to  correlate  in‐
564       creases  in  backend pressure (caused by pipeline resource pressure and
565       data dependencies) to dynamic dispatch stalls.
566
567       Below  is  an  example  of  -bottleneck-analysis  output  generated  by
568       llvm-mca for 500 iterations of the dot-product example on btver2.
569
570          Cycles with backend pressure increase [ 48.07% ]
571          Throughput Bottlenecks:
572            Resource Pressure       [ 47.77% ]
573            - JFPA  [ 47.77% ]
574            - JFPU0  [ 47.77% ]
575            Data Dependencies:      [ 0.30% ]
576            - Register Dependencies [ 0.30% ]
577            - Memory Dependencies   [ 0.00% ]
578
579          Critical sequence based on the simulation:
580
581                        Instruction                         Dependency Information
582           +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
583           |
584           |    < loop carried >
585           |
586           |      0.    vmulps  %xmm0, %xmm1, %xmm2
587           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
588           +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
589           |
590           |    < loop carried >
591           |
592           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
593
594       According  to  the analysis, throughput is limited by resource pressure
595       and not by data dependencies.  The analysis observed increases in back‐
596       end pressure during 48.07% of the simulated run. Almost all those pres‐
597       sure increase events were caused by contention on  processor  resources
598       JFPA/JFPU0.
599
600       The  critical  sequence  is the most expensive sequence of instructions
601       according to the simulation. It is annotated to provide extra  informa‐
602       tion  about  critical  register dependencies and resource interferences
603       between instructions.
604
605       Instructions from the critical sequence are expected  to  significantly
606       impact  performance.  By construction, the accuracy of this analysis is
607       strongly dependent on the simulation and (as always) by the quality  of
608       the processor model in llvm.
609
610   Extra Statistics to Further Diagnose Performance Issues
611       The -all-stats command line option enables extra statistics and perfor‐
612       mance counters for the dispatch logic, the reorder buffer,  the  retire
613       control unit, and the register file.
614
615       Below is an example of -all-stats output generated by  llvm-mca for 300
616       iterations of the dot-product example discussed in  the  previous  sec‐
617       tions.
618
619          Dynamic Dispatch Stall Cycles:
620          RAT     - Register unavailable:                      0
621          RCU     - Retire tokens unavailable:                 0
622          SCHEDQ  - Scheduler full:                            272  (44.6%)
623          LQ      - Load queue full:                           0
624          SQ      - Store queue full:                          0
625          GROUP   - Static restrictions on the dispatch group: 0
626
627
628          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
629          [# dispatched], [# cycles]
630           0,              24  (3.9%)
631           1,              272  (44.6%)
632           2,              314  (51.5%)
633
634
635          Schedulers - number of cycles where we saw N micro opcodes issued:
636          [# issued], [# cycles]
637           0,          7  (1.1%)
638           1,          306  (50.2%)
639           2,          297  (48.7%)
640
641          Scheduler's queue usage:
642          [1] Resource name.
643          [2] Average number of used buffer entries.
644          [3] Maximum number of used buffer entries.
645          [4] Total number of buffer entries.
646
647           [1]            [2]        [3]        [4]
648          JALU01           0          0          20
649          JFPU01           17         18         18
650          JLSAGU           0          0          12
651
652
653          Retire Control Unit - number of cycles where we saw N instructions retired:
654          [# retired], [# cycles]
655           0,           109  (17.9%)
656           1,           102  (16.7%)
657           2,           399  (65.4%)
658
659          Total ROB Entries:                64
660          Max Used ROB Entries:             35  ( 54.7% )
661          Average Used ROB Entries per cy:  32  ( 50.0% )
662
663
664          Register File statistics:
665          Total number of mappings created:    900
666          Max number of mappings used:         35
667
668          *  Register File #1 -- JFpuPRF:
669             Number of physical registers:     72
670             Total number of mappings created: 900
671             Max number of mappings used:      35
672
673          *  Register File #2 -- JIntegerPRF:
674             Number of physical registers:     64
675             Total number of mappings created: 0
676             Max number of mappings used:      0
677
678       If  we  look  at  the  Dynamic  Dispatch Stall Cycles table, we see the
679       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
680       ery  time the dispatch logic is unable to dispatch a full group because
681       the scheduler's queue is full.
682
683       Looking at the Dispatch Logic table, we see that the pipeline was  only
684       able  to  dispatch  two  micro opcodes 51.5% of the time.  The dispatch
685       group was limited to one micro opcode 44.6% of the cycles, which corre‐
686       sponds  to 272 cycles.  The dispatch statistics are displayed by either
687       using the command option -all-stats or -dispatch-stats.
688
689       The next table, Schedulers, presents a histogram  displaying  a  count,
690       representing  the  number of micro opcodes issued on some number of cy‐
691       cles. In this case, of the 610 simulated cycles,  single  opcodes  were
692       issued  306 times (50.2%) and there were 7 cycles where no opcodes were
693       issued.
694
695       The Scheduler's queue usage table shows that the  average  and  maximum
696       number  of  buffer entries (i.e., scheduler queue entries) used at run‐
697       time.  Resource JFPU01 reached its maximum (18 of  18  queue  entries).
698       Note that AMD Jaguar implements three schedulers:
699
700       • JALU01 - A scheduler for ALU instructions.
701
702       • JFPU01 - A scheduler floating point operations.
703
704       • JLSAGU - A scheduler for address generation.
705
706       The  dot-product  is  a  kernel of three floating point instructions (a
707       vector multiply followed by two horizontal adds).   That  explains  why
708       only the floating point scheduler appears to be used.
709
710       A full scheduler queue is either caused by data dependency chains or by
711       a sub-optimal usage of hardware resources.  Sometimes,  resource  pres‐
712       sure  can be mitigated by rewriting the kernel using different instruc‐
713       tions that consume different scheduler resources.   Schedulers  with  a
714       small queue are less resilient to bottlenecks caused by the presence of
715       long data dependencies.  The scheduler statistics are displayed by  us‐
716       ing the command option -all-stats or -scheduler-stats.
717
718       The  next table, Retire Control Unit, presents a histogram displaying a
719       count, representing the number of instructions retired on  some  number
720       of cycles.  In this case, of the 610 simulated cycles, two instructions
721       were retired during the same cycle 399 times (65.4%) and there were 109
722       cycles  where  no instructions were retired.  The retire statistics are
723       displayed by using the command option -all-stats or -retire-stats.
724
725       The last table presented is Register File  statistics.   Each  physical
726       register  file  (PRF)  used by the pipeline is presented in this table.
727       In the case of AMD Jaguar, there are two register files, one for float‐
728       ing-point  registers  (JFpuPRF)  and  one for integer registers (JInte‐
729       gerPRF).  The table shows that of the 900 instructions processed, there
730       were  900  mappings  created.   Since this dot-product example utilized
731       only floating point registers, the JFPuPRF was responsible for creating
732       the  900 mappings.  However, we see that the pipeline only used a maxi‐
733       mum of 35 of 72 available register slots at any given time. We can con‐
734       clude  that  the floating point PRF was the only register file used for
735       the example, and that it was never resource constrained.  The  register
736       file statistics are displayed by using the command option -all-stats or
737       -register-file-stats.
738
739       In this example, we can conclude that the IPC is mostly limited by data
740       dependencies, and not by resource pressure.
741
742   Instruction Flow
743       This  section  describes the instruction flow through the default pipe‐
744       line of llvm-mca, as well as  the  functional  units  involved  in  the
745       process.
746
747       The  default  pipeline implements the following sequence of stages used
748       to process instructions.
749
750       • Dispatch (Instruction is dispatched to the schedulers).
751
752       • Issue (Instruction is issued to the processor pipelines).
753
754       • Write Back (Instruction is executed, and results are written back).
755
756       • Retire (Instruction is retired; writes  are  architecturally  commit‐
757         ted).
758
759       The  default pipeline only models the out-of-order portion of a proces‐
760       sor.  Therefore, the instruction fetch and decode stages are  not  mod‐
761       eled.  Performance  bottlenecks  in  the  frontend  are  not diagnosed.
762       llvm-mca assumes that instructions have all  been  decoded  and  placed
763       into  a  queue  before  the  simulation start.  Also, llvm-mca does not
764       model branch prediction.
765
766   Instruction Dispatch
767       During the dispatch stage, instructions are  picked  in  program  order
768       from  a queue of already decoded instructions, and dispatched in groups
769       to the simulated hardware schedulers.
770
771       The size of a dispatch group depends on the availability of  the  simu‐
772       lated hardware resources.  The processor dispatch width defaults to the
773       value of the IssueWidth in LLVM's scheduling model.
774
775       An instruction can be dispatched if:
776
777       • The size of the dispatch group is smaller than  processor's  dispatch
778         width.
779
780       • There are enough entries in the reorder buffer.
781
782       • There are enough physical registers to do register renaming.
783
784       • The schedulers are not full.
785
786       Scheduling  models  can  optionally  specify  which  register files are
787       available on the processor. llvm-mca uses that information to  initial‐
788       ize  register file descriptors.  Users can limit the number of physical
789       registers that are globally available for register  renaming  by  using
790       the  command  option -register-file-size.  A value of zero for this op‐
791       tion means unbounded. By knowing how many registers are  available  for
792       renaming,  the  tool  can predict dispatch stalls caused by the lack of
793       physical registers.
794
795       The number of reorder buffer entries consumed by an instruction depends
796       on  the  number  of micro-opcodes specified for that instruction by the
797       target scheduling model.  The reorder buffer is responsible for  track‐
798       ing  the  progress  of  instructions that are "in-flight", and retiring
799       them in program order.  The number of entries in the reorder buffer de‐
800       faults  to the value specified by field MicroOpBufferSize in the target
801       scheduling model.
802
803       Instructions that are dispatched to the  schedulers  consume  scheduler
804       buffer  entries. llvm-mca queries the scheduling model to determine the
805       set of buffered resources consumed by  an  instruction.   Buffered  re‐
806       sources are treated like scheduler resources.
807
808   Instruction Issue
809       Each  processor  scheduler implements a buffer of instructions.  An in‐
810       struction has to wait in the scheduler's buffer  until  input  register
811       operands  become  available.   Only at that point, does the instruction
812       becomes  eligible  for  execution  and  may  be   issued   (potentially
813       out-of-order)  for  execution.   Instruction  latencies are computed by
814       llvm-mca with the help of the scheduling model.
815
816       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
817       ulers.   The  scheduler  is responsible for tracking data dependencies,
818       and dynamically selecting which processor resources are consumed by in‐
819       structions.   It  delegates  the management of processor resource units
820       and resource groups to a resource manager.  The resource manager is re‐
821       sponsible  for  selecting  resource units that are consumed by instruc‐
822       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
823       group, the resource manager selects one of the available units from the
824       group; by default, the resource manager uses a round-robin selector  to
825       guarantee  that  resource  usage  is  uniformly distributed between all
826       units of a group.
827
828       llvm-mca's scheduler internally groups instructions into three sets:
829
830       • WaitSet: a set of instructions whose operands are not ready.
831
832       • ReadySet: a set of instructions ready to execute.
833
834       • IssuedSet: a set of instructions executing.
835
836       Depending on the operands  availability,  instructions  that  are  dis‐
837       patched to the scheduler are either placed into the WaitSet or into the
838       ReadySet.
839
840       Every cycle, the scheduler checks if instructions can be moved from the
841       WaitSet  to  the ReadySet, and if instructions from the ReadySet can be
842       issued to the underlying pipelines. The algorithm prioritizes older in‐
843       structions over younger instructions.
844
845   Write-Back and Retire Stage
846       Issued  instructions  are  moved  from  the  ReadySet to the IssuedSet.
847       There, instructions wait until they reach  the  write-back  stage.   At
848       that point, they get removed from the queue and the retire control unit
849       is notified.
850
851       When instructions are executed, the retire control unit flags  the  in‐
852       struction as "ready to retire."
853
854       Instructions  are retired in program order.  The register file is noti‐
855       fied of the retirement so that it can free the physical registers  that
856       were allocated for the instruction during the register renaming stage.
857
858   Load/Store Unit and Memory Consistency Model
859       To  simulate  an  out-of-order execution of memory operations, llvm-mca
860       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
861       tive execution of loads and stores.
862
863       Each  load  (or  store) consumes an entry in the load (or store) queue.
864       Users can specify flags -lqueue and -squeue to limit the number of  en‐
865       tries  in  the  load  and store queues respectively. The queues are un‐
866       bounded by default.
867
868       The LSUnit implements a relaxed consistency model for memory loads  and
869       stores.  The rules are:
870
871       1. A younger load is allowed to pass an older load only if there are no
872          intervening stores or barriers between the two loads.
873
874       2. A younger load is allowed to pass an older store provided  that  the
875          load does not alias with the store.
876
877       3. A younger store is not allowed to pass an older store.
878
879       4. A younger store is not allowed to pass an older load.
880
881       By  default,  the LSUnit optimistically assumes that loads do not alias
882       (-noalias=true) store operations.  Under this assumption, younger loads
883       are  always allowed to pass older stores.  Essentially, the LSUnit does
884       not attempt to run any alias analysis to predict when loads and  stores
885       do not alias with each other.
886
887       Note  that,  in the case of write-combining memory, rule 3 could be re‐
888       laxed to allow reordering of non-aliasing store operations.  That being
889       said,  at the moment, there is no way to further relax the memory model
890       (-noalias is the only option).  Essentially,  there  is  no  option  to
891       specify  a  different  memory  type (e.g., write-back, write-combining,
892       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
893       memory model.
894
895       Other limitations are:
896
897       • The LSUnit does not know when store-to-load forwarding may occur.
898
899       • The  LSUnit  does  not know anything about cache hierarchy and memory
900         types.
901
902       • The LSUnit does not know how to identify serializing  operations  and
903         memory fences.
904
905       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
906       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
907       "MayStore."   For  loads, the scheduling model provides an "optimistic"
908       load-to-use latency (which usually matches the load-to-use latency  for
909       when there is a hit in the L1D).
910
911       llvm-mca  does  not know about serializing operations or memory-barrier
912       like instructions.  The LSUnit conservatively assumes that an  instruc‐
913       tion which has both "MayLoad" and unmodeled side effects behaves like a
914       "soft" load-barrier.  That means, it serializes loads without forcing a
915       flush  of  the load queue.  Similarly, instructions that "MayStore" and
916       have unmodeled side effects are treated like store  barriers.   A  full
917       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
918       side effects.  This is inaccurate, but it is the best that we can do at
919       the moment with the current information available in LLVM.
920
921       A  load/store  barrier  consumes  one entry of the load/store queue.  A
922       load/store barrier enforces ordering of loads/stores.  A  younger  load
923       cannot  pass a load barrier.  Also, a younger store cannot pass a store
924       barrier.  A younger load has to wait for the memory/load barrier to ex‐
925       ecute.   A  load/store barrier is "executed" when it becomes the oldest
926       entry in the load/store queue(s). That also means, by construction, all
927       of the older loads/stores have been executed.
928
929       In conclusion, the full set of load/store consistency rules are:
930
931       1. A store may not pass a previous store.
932
933       2. A store may not pass a previous load (regardless of -noalias).
934
935       3. A store has to wait until an older store barrier is fully executed.
936
937       4. A load may pass a previous load.
938
939       5. A load may not pass a previous store unless -noalias is set.
940
941       6. A load has to wait until an older load barrier is fully executed.
942

AUTHOR

944       Maintained by the LLVM Team (https://llvm.org/).
945
947       2003-2021, LLVM Project
948
949
950
951
95210                                2021-04-19                       LLVM-MCA(1)
Impressum