1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently works  for  processors  with  an
18       out-of-order  backend,  for which there is a scheduling model available
19       in LLVM.
20
21       The main goal of this tool is not just to predict  the  performance  of
22       the  code  when run on the target, but also help with diagnosing poten‐
23       tial performance issues.
24
25       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
26       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
27       and reporting style were inspired by the IACA tool from Intel.
28
29       For example, you can compile code with clang, output assembly, and pipe
30       it directly into llvm-mca for analysis:
31
32          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34       Or for Intel syntax:
35
36          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38       (llvm-mca  detects Intel syntax by the presence of an .intel_syntax di‐
39       rective at the beginning of the input.  By default  its  output  syntax
40       matches that of its input.)
41
42       Scheduling  models  are  not just used to compute instruction latencies
43       and throughput, but also to understand  what  processor  resources  are
44       available and how to simulate them.
45
46       By  design,  the  quality  of the analysis conducted by llvm-mca is in‐
47       evitably affected by the quality of the scheduling models in LLVM.
48
49       If you see that the performance report is not accurate for a processor,
50       please file a bug against the appropriate backend.
51

OPTIONS

53       If  input is "-" or omitted, llvm-mca reads from standard input. Other‐
54       wise, it will read from the specified filename.
55
56       If the -o option is omitted, then llvm-mca  will  send  its  output  to
57       standard  output if the input is from standard input.  If the -o option
58       specifies "-", then the output will also be sent to standard output.
59
60       -help  Print a summary of command line options.
61
62       -o <filename>
63              Use <filename> as the output filename. See the summary above for
64              more details.
65
66       -mtriple=<target triple>
67              Specify a target triple string.
68
69       -march=<arch>
70              Specify  the  architecture for which to analyze the code. It de‐
71              faults to the host default target.
72
73       -mcpu=<cpuname>
74              Specify the processor for which to analyze  the  code.   By  de‐
75              fault, the cpu name is autodetected from the host.
76
77       -output-asm-variant=<variant id>
78              Specify  the output assembly variant for the report generated by
79              the tool.  On x86, possible values are [0,  1].  A  value  of  0
80              (vic.  1)  for  this flag enables the AT&T (vic. Intel) assembly
81              format for the code printed out by the tool in the analysis  re‐
82              port.
83
84       -print-imm-hex
85              Prefer  hex  format  for numeric literals in the output assembly
86              printed as part of the report.
87
88       -dispatch=<width>
89              Specify a different dispatch width for the processor.  The  dis‐
90              patch  width  defaults  to  field  'IssueWidth' in the processor
91              scheduling model.  If width is zero, then the  default  dispatch
92              width is used.
93
94       -register-file-size=<size>
95              Specify the size of the register file. When specified, this flag
96              limits how many physical registers are  available  for  register
97              renaming  purposes.  A value of zero for this flag means "unlim‐
98              ited number of physical registers".
99
100       -iterations=<number of iterations>
101              Specify the number of iterations to run. If this flag is set  to
102              0,  then  the  tool  sets  the number of iterations to a default
103              value (i.e. 100).
104
105       -noalias=<bool>
106              If set, the tool assumes that loads and stores don't alias. This
107              is the default behavior.
108
109       -lqueue=<load queue size>
110              Specify  the  size of the load queue in the load/store unit emu‐
111              lated by the tool.  By default, the tool assumes an unbound num‐
112              ber of entries in the load queue.  A value of zero for this flag
113              is ignored, and the default load queue size is used instead.
114
115       -squeue=<store queue size>
116              Specify the size of the store queue in the load/store unit  emu‐
117              lated  by the tool. By default, the tool assumes an unbound num‐
118              ber of entries in the store queue. A value of zero for this flag
119              is ignored, and the default store queue size is used instead.
120
121       -timeline
122              Enable the timeline view.
123
124       -timeline-max-iterations=<iterations>
125              Limit the number of iterations to print in the timeline view. By
126              default, the timeline view prints information for up to 10 iter‐
127              ations.
128
129       -timeline-max-cycles=<cycles>
130              Limit the number of cycles in the timeline view. By default, the
131              number of cycles is set to 80.
132
133       -resource-pressure
134              Enable the resource pressure view. This is enabled by default.
135
136       -register-file-stats
137              Enable register file usage statistics.
138
139       -dispatch-stats
140              Enable extra dispatch statistics. This view  collects  and  ana‐
141              lyzes  instruction  dispatch  events,  as well as static/dynamic
142              dispatch stall events. This view is disabled by default.
143
144       -scheduler-stats
145              Enable extra scheduler statistics. This view collects  and  ana‐
146              lyzes  instruction  issue  events.  This view is disabled by de‐
147              fault.
148
149       -retire-stats
150              Enable extra retire control unit statistics. This view  is  dis‐
151              abled by default.
152
153       -instruction-info
154              Enable the instruction info view. This is enabled by default.
155
156       -show-encoding
157              Enable the printing of instruction encodings within the instruc‐
158              tion info view.
159
160       -all-stats
161              Print all hardware statistics. This enables extra statistics re‐
162              lated to the dispatch logic, the hardware schedulers, the regis‐
163              ter file(s), and the retire control unit. This  option  is  dis‐
164              abled by default.
165
166       -all-views
167              Enable all the view.
168
169       -instruction-tables
170              Prints  resource pressure information based on the static infor‐
171              mation available from the processor model. This differs from the
172              resource  pressure view because it doesn't require that the code
173              is simulated. It instead prints the theoretical uniform  distri‐
174              bution of resource pressure for every instruction in sequence.
175
176       -bottleneck-analysis
177              Print  information about bottlenecks that affect the throughput.
178              This analysis can be expensive, and it is disabled  by  default.
179              Bottlenecks are highlighted in the summary view.
180
181       -json  Print  the  requested views in JSON format. The instructions and
182              the processor resources are printed as members  of  special  top
183              level  JSON  objects.  The individual views refer to them by in‐
184              dex.
185

EXIT STATUS

187       llvm-mca returns 0 on success. Otherwise, an error message  is  printed
188       to standard error, and the tool returns 1.
189

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

191       llvm-mca allows for the optional usage of special code comments to mark
192       regions of the assembly code to be analyzed.  A comment  starting  with
193       substring  LLVM-MCA-BEGIN  marks the beginning of a code region. A com‐
194       ment starting with substring LLVM-MCA-END marks the end of a  code  re‐
195       gion.  For example:
196
197          # LLVM-MCA-BEGIN
198            ...
199          # LLVM-MCA-END
200
201       If no user-defined region is specified, then llvm-mca assumes a default
202       region which contains every instruction in the input file.   Every  re‐
203       gion  is analyzed in isolation, and the final performance report is the
204       union of all the reports generated for every code region.
205
206       Code regions can have names. For example:
207
208          # LLVM-MCA-BEGIN A simple example
209            add %eax, %eax
210          # LLVM-MCA-END
211
212       The code from the example above defines a region named "A simple  exam‐
213       ple"  with a single instruction in it. Note how the region name doesn't
214       have to be repeated in the LLVM-MCA-END directive. In  the  absence  of
215       overlapping  regions,  an  anonymous LLVM-MCA-END directive always ends
216       the currently active user defined region.
217
218       Example of nesting regions:
219
220          # LLVM-MCA-BEGIN foo
221            add %eax, %edx
222          # LLVM-MCA-BEGIN bar
223            sub %eax, %edx
224          # LLVM-MCA-END bar
225          # LLVM-MCA-END foo
226
227       Example of overlapping regions:
228
229          # LLVM-MCA-BEGIN foo
230            add %eax, %edx
231          # LLVM-MCA-BEGIN bar
232            sub %eax, %edx
233          # LLVM-MCA-END foo
234            add %eax, %edx
235          # LLVM-MCA-END bar
236
237       Note that multiple anonymous regions cannot overlap. Also,  overlapping
238       regions cannot have the same name.
239
240       There  is  no  support for marking regions from high-level source code,
241       like C or C++. As a workaround, inline assembly directives may be used:
242
243          int foo(int a, int b) {
244            __asm volatile("# LLVM-MCA-BEGIN foo");
245            a += 42;
246            __asm volatile("# LLVM-MCA-END");
247            a *= b;
248            return a;
249          }
250
251       However, this interferes with optimizations like loop vectorization and
252       may  have  an  impact  on the code generated. This is because the __asm
253       statements are seen as real code having important side  effects,  which
254       limits  how  the  code around them can be transformed. If users want to
255       make use of inline assembly to emit markers, then the recommendation is
256       to always verify that the output assembly is equivalent to the assembly
257       generated in the absence of markers.  The Clang options to  emit  opti‐
258       mization reports can also help in detecting missed optimizations.
259

HOW LLVM-MCA WORKS

261       llvm-mca takes assembly code as input. The assembly code is parsed into
262       a sequence of MCInst with the help of the existing LLVM target assembly
263       parsers.  The  parsed sequence of MCInst is then analyzed by a Pipeline
264       module to generate a performance report.
265
266       The Pipeline module simulates the execution of  the  machine  code  se‐
267       quence  in  a loop of iterations (default is 100). During this process,
268       the pipeline collects a number of execution related statistics. At  the
269       end  of  this  process, the pipeline generates and prints a report from
270       the collected statistics.
271
272       Here is an example of a performance report generated by the tool for  a
273       dot-product  of two packed float vectors of four elements. The analysis
274       is conducted for target x86, cpu btver2.  The following result  can  be
275       produced  via  the  following  command  using  the  example  located at
276       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
277
278          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
279
280          Iterations:        300
281          Instructions:      900
282          Total Cycles:      610
283          Total uOps:        900
284
285          Dispatch Width:    2
286          uOps Per Cycle:    1.48
287          IPC:               1.48
288          Block RThroughput: 2.0
289
290
291          Instruction Info:
292          [1]: #uOps
293          [2]: Latency
294          [3]: RThroughput
295          [4]: MayLoad
296          [5]: MayStore
297          [6]: HasSideEffects (U)
298
299          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
300           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
301           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
302           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
303
304
305          Resources:
306          [0]   - JALU0
307          [1]   - JALU1
308          [2]   - JDiv
309          [3]   - JFPA
310          [4]   - JFPM
311          [5]   - JFPU0
312          [6]   - JFPU1
313          [7]   - JLAGU
314          [8]   - JMul
315          [9]   - JSAGU
316          [10]  - JSTC
317          [11]  - JVALU0
318          [12]  - JVALU1
319          [13]  - JVIMUL
320
321
322          Resource pressure per iteration:
323          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
324           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
325
326          Resource pressure by instruction:
327          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
328           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
329           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
330           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
331
332       According to this report, the dot-product kernel has been executed  300
333       times,  for  a total of 900 simulated instructions. The total number of
334       simulated micro opcodes (uOps) is also 900.
335
336       The report is structured in three main  sections.   The  first  section
337       collects a few performance numbers; the goal of this section is to give
338       a very quick overview of the performance throughput. Important  perfor‐
339       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
340       Reciprocal Throughput).
341
342       Field DispatchWidth is the maximum number of  micro  opcodes  that  are
343       dispatched to the out-of-order backend every simulated cycle.
344
345       IPC  is computed dividing the total number of simulated instructions by
346       the total number of cycles.
347
348       Field Block RThroughput is the  reciprocal  of  the  block  throughput.
349       Block throughput is a theoretical quantity computed as the maximum num‐
350       ber of blocks (i.e. iterations) that  can  be  executed  per  simulated
351       clock cycle in the absence of loop carried dependencies. Block through‐
352       put is superiorly limited by the dispatch rate, and the availability of
353       hardware resources.
354
355       In  the  absence  of  loop-carried  data dependencies, the observed IPC
356       tends to a theoretical maximum which can be computed  by  dividing  the
357       number of instructions of a single iteration by the Block RThroughput.
358
359       Field  'uOps  Per Cycle' is computed dividing the total number of simu‐
360       lated micro opcodes by the total number of cycles. A delta between Dis‐
361       patch  Width  and this field is an indicator of a performance issue. In
362       the absence of loop-carried data dependencies, the observed  'uOps  Per
363       Cycle'  should  tend  to  a theoretical maximum throughput which can be
364       computed by dividing the number of uOps of a single  iteration  by  the
365       Block RThroughput.
366
367       Field  uOps Per Cycle is bounded from above by the dispatch width. That
368       is because the dispatch width limits the maximum  size  of  a  dispatch
369       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
370       ware parallelism. The availability of hardware  resources  affects  the
371       resource  pressure  distribution,  and it limits the number of instruc‐
372       tions that can be executed in parallel every cycle.   A  delta  between
373       Dispatch  Width and the theoretical maximum uOps per Cycle (computed by
374       dividing the number  of  uOps  of  a  single  iteration  by  the  Block
375       RThroughput)  is an indicator of a performance bottleneck caused by the
376       lack of hardware resources.  In general, the lower the Block  RThrough‐
377       put, the better.
378
379       In  this  example,  uOps per iteration/Block RThroughput is 1.50. Since
380       there are no loop-carried dependencies, the observed uOps Per Cycle  is
381       expected to approach 1.50 when the number of iterations tends to infin‐
382       ity. The delta between the Dispatch Width (2.00), and  the  theoretical
383       maximum  uOp throughput (1.50) is an indicator of a performance bottle‐
384       neck caused by the lack of hardware resources, and the  Resource  pres‐
385       sure view can help to identify the problematic resource usage.
386
387       The second section of the report is the instruction info view. It shows
388       the latency and reciprocal throughput of every instruction in  the  se‐
389       quence.  It also reports extra information related to the number of mi‐
390       cro opcodes, and opcode properties (i.e.,  'MayLoad',  'MayStore',  and
391       'HasSideEffects').
392
393       Field  RThroughput  is  the  reciprocal  of the instruction throughput.
394       Throughput is computed as the maximum number of instructions of a  same
395       type that can be executed per clock cycle in the absence of operand de‐
396       pendencies. In this example, the  reciprocal  throughput  of  a  vector
397       float  multiply is 1 cycles/instruction.  That is because the FP multi‐
398       plier JFPM is only available from pipeline JFPU1.
399
400       Instruction encodings are displayed within the  instruction  info  view
401       when flag -show-encoding is specified.
402
403       Below  is  an example of -show-encoding output for the dot-product ker‐
404       nel:
405
406          Instruction Info:
407          [1]: #uOps
408          [2]: Latency
409          [3]: RThroughput
410          [4]: MayLoad
411          [5]: MayStore
412          [6]: HasSideEffects (U)
413          [7]: Encoding Size
414
415          [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
416           1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
417           1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
418           1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
419
420       The Encoding Size column shows the size in bytes of instructions.   The
421       Encodings column shows the actual instruction encodings (byte sequences
422       in hex).
423
424       The third section is the Resource pressure view.  This view reports the
425       average  number of resource cycles consumed every iteration by instruc‐
426       tions for every processor resource unit available on the  target.   In‐
427       formation is structured in two tables. The first table reports the num‐
428       ber of resource cycles spent on average every iteration. The second ta‐
429       ble  correlates  the  resource cycles to the machine instruction in the
430       sequence. For example, every iteration of the instruction vmulps always
431       executes  on  resource  unit  [6] (JFPU1 - floating point pipeline #1),
432       consuming an average of 1 resource cycle per iteration.  Note  that  on
433       AMD  Jaguar, vector floating-point multiply can only be issued to pipe‐
434       line JFPU1, while horizontal floating-point additions can only  be  is‐
435       sued to pipeline JFPU0.
436
437       The resource pressure view helps with identifying bottlenecks caused by
438       high usage of specific hardware resources.   Situations  with  resource
439       pressure  mainly concentrated on a few resources should, in general, be
440       avoided.  Ideally, pressure should  be  uniformly  distributed  between
441       multiple resources.
442
443   Timeline View
444       The  timeline  view  produces  a  detailed report of each instruction's
445       state transitions through an instruction pipeline.  This  view  is  en‐
446       abled by the command line option -timeline.  As instructions transition
447       through the various stages of the pipeline, their states  are  depicted
448       in  the  view  report.   These  states are represented by the following
449       characters:
450
451       • D : Instruction dispatched.
452
453       • e : Instruction executing.
454
455       • E : Instruction executed.
456
457       • R : Instruction retired.
458
459       • = : Instruction already dispatched, waiting to be executed.
460
461       • - : Instruction executed, waiting to be retired.
462
463       Below is the timeline view for a subset of the dot-product example  lo‐
464       cated  in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
465       llvm-mca using the following command:
466
467          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
468
469          Timeline view:
470                              012345
471          Index     0123456789
472
473          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
474          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
475          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
476          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
477          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
478          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
479          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
480          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
481          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
482
483
484          Average Wait times (based on the timeline view):
485          [0]: Executions
486          [1]: Average time spent waiting in a scheduler's queue
487          [2]: Average time spent waiting in a scheduler's queue while ready
488          [3]: Average time elapsed from WB until retire stage
489
490                [0]    [1]    [2]    [3]
491          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
492          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
493          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
494                 3     3.3    0.5    1.4       <total>
495
496       The timeline view is interesting because  it  shows  instruction  state
497       changes  during  execution.  It also gives an idea of how the tool pro‐
498       cesses instructions executed on the target, and how their timing infor‐
499       mation might be calculated.
500
501       The  timeline  view is structured in two tables.  The first table shows
502       instructions changing state over time (measured in cycles); the  second
503       table  (named  Average  Wait  times)  reports useful timing statistics,
504       which should help diagnose performance bottlenecks caused by long  data
505       dependencies and sub-optimal usage of hardware resources.
506
507       An instruction in the timeline view is identified by a pair of indices,
508       where the first index identifies an iteration, and the second index  is
509       the  instruction  index  (i.e., where it appears in the code sequence).
510       Since this example was generated using 3 iterations: -iterations=3, the
511       iteration indices range from 0-2 inclusively.
512
513       Excluding  the  first and last column, the remaining columns are in cy‐
514       cles.  Cycles are numbered sequentially starting from 0.
515
516       From the example output above, we know the following:
517
518       • Instruction [1,0] was dispatched at cycle 1.
519
520       • Instruction [1,0] started executing at cycle 2.
521
522       • Instruction [1,0] reached the write back stage at cycle 4.
523
524       • Instruction [1,0] was retired at cycle 10.
525
526       Instruction [1,0] (i.e., vmulps from iteration #1)  does  not  have  to
527       wait  in the scheduler's queue for the operands to become available. By
528       the time vmulps is dispatched,  operands  are  already  available,  and
529       pipeline  JFPU1 is ready to serve another instruction.  So the instruc‐
530       tion can be immediately issued on the JFPU1 pipeline.  That  is  demon‐
531       strated  by  the fact that the instruction only spent 1cy in the sched‐
532       uler's queue.
533
534       There is a gap of 5 cycles between the write-back stage and the  retire
535       event.   That  is because instructions must retire in program order, so
536       [1,0] has to wait for [0,2] to be retired first (i.e., it has  to  wait
537       until cycle 10).
538
539       In the example, all instructions are in a RAW (Read After Write) depen‐
540       dency chain.  Register %xmm2 written by vmulps is immediately  used  by
541       the  first  vhaddps, and register %xmm3 written by the first vhaddps is
542       used by the second vhaddps.  Long data dependencies  negatively  impact
543       the ILP (Instruction Level Parallelism).
544
545       In  the  dot-product example, there are anti-dependencies introduced by
546       instructions from different iterations.   However,  those  dependencies
547       can  be  removed  at register renaming stage (at the cost of allocating
548       register aliases, and therefore consuming physical registers).
549
550       Table Average Wait times helps diagnose  performance  issues  that  are
551       caused  by  the  presence  of long latency instructions and potentially
552       long data dependencies which may limit  the  ILP.  Last  row,  <total>,
553       shows  a  global  average  over  all  instructions  measured. Note that
554       llvm-mca, by default, assumes at least 1cy between the  dispatch  event
555       and the issue event.
556
557       When  the  performance  is limited by data dependencies and/or long la‐
558       tency instructions, the number of cycles spent while in the ready state
559       is expected to be very small when compared with the total number of cy‐
560       cles spent in the scheduler's queue.  The difference  between  the  two
561       counters  is  a good indicator of how large of an impact data dependen‐
562       cies had on the execution of the  instructions.   When  performance  is
563       mostly limited by the lack of hardware resources, the delta between the
564       two counters is small.  However, the number  of  cycles  spent  in  the
565       queue  tends to be larger (i.e., more than 1-3cy), especially when com‐
566       pared to other low latency instructions.
567
568   Bottleneck Analysis
569       The -bottleneck-analysis command line option enables  the  analysis  of
570       performance bottlenecks.
571
572       This  analysis  is  potentially expensive. It attempts to correlate in‐
573       creases in backend pressure (caused by pipeline resource  pressure  and
574       data dependencies) to dynamic dispatch stalls.
575
576       Below  is  an  example  of  -bottleneck-analysis  output  generated  by
577       llvm-mca for 500 iterations of the dot-product example on btver2.
578
579          Cycles with backend pressure increase [ 48.07% ]
580          Throughput Bottlenecks:
581            Resource Pressure       [ 47.77% ]
582            - JFPA  [ 47.77% ]
583            - JFPU0  [ 47.77% ]
584            Data Dependencies:      [ 0.30% ]
585            - Register Dependencies [ 0.30% ]
586            - Memory Dependencies   [ 0.00% ]
587
588          Critical sequence based on the simulation:
589
590                        Instruction                         Dependency Information
591           +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
592           |
593           |    < loop carried >
594           |
595           |      0.    vmulps  %xmm0, %xmm1, %xmm2
596           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
597           +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
598           |
599           |    < loop carried >
600           |
601           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
602
603       According to the analysis, throughput is limited by  resource  pressure
604       and not by data dependencies.  The analysis observed increases in back‐
605       end pressure during 48.07% of the simulated run. Almost all those pres‐
606       sure  increase  events were caused by contention on processor resources
607       JFPA/JFPU0.
608
609       The critical sequence is the most expensive  sequence  of  instructions
610       according  to the simulation. It is annotated to provide extra informa‐
611       tion about critical register dependencies  and  resource  interferences
612       between instructions.
613
614       Instructions  from  the critical sequence are expected to significantly
615       impact performance. By construction, the accuracy of this  analysis  is
616       strongly  dependent on the simulation and (as always) by the quality of
617       the processor model in llvm.
618
619   Extra Statistics to Further Diagnose Performance Issues
620       The -all-stats command line option enables extra statistics and perfor‐
621       mance  counters  for the dispatch logic, the reorder buffer, the retire
622       control unit, and the register file.
623
624       Below is an example of -all-stats output generated by  llvm-mca for 300
625       iterations  of  the  dot-product example discussed in the previous sec‐
626       tions.
627
628          Dynamic Dispatch Stall Cycles:
629          RAT     - Register unavailable:                      0
630          RCU     - Retire tokens unavailable:                 0
631          SCHEDQ  - Scheduler full:                            272  (44.6%)
632          LQ      - Load queue full:                           0
633          SQ      - Store queue full:                          0
634          GROUP   - Static restrictions on the dispatch group: 0
635
636
637          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
638          [# dispatched], [# cycles]
639           0,              24  (3.9%)
640           1,              272  (44.6%)
641           2,              314  (51.5%)
642
643
644          Schedulers - number of cycles where we saw N micro opcodes issued:
645          [# issued], [# cycles]
646           0,          7  (1.1%)
647           1,          306  (50.2%)
648           2,          297  (48.7%)
649
650          Scheduler's queue usage:
651          [1] Resource name.
652          [2] Average number of used buffer entries.
653          [3] Maximum number of used buffer entries.
654          [4] Total number of buffer entries.
655
656           [1]            [2]        [3]        [4]
657          JALU01           0          0          20
658          JFPU01           17         18         18
659          JLSAGU           0          0          12
660
661
662          Retire Control Unit - number of cycles where we saw N instructions retired:
663          [# retired], [# cycles]
664           0,           109  (17.9%)
665           1,           102  (16.7%)
666           2,           399  (65.4%)
667
668          Total ROB Entries:                64
669          Max Used ROB Entries:             35  ( 54.7% )
670          Average Used ROB Entries per cy:  32  ( 50.0% )
671
672
673          Register File statistics:
674          Total number of mappings created:    900
675          Max number of mappings used:         35
676
677          *  Register File #1 -- JFpuPRF:
678             Number of physical registers:     72
679             Total number of mappings created: 900
680             Max number of mappings used:      35
681
682          *  Register File #2 -- JIntegerPRF:
683             Number of physical registers:     64
684             Total number of mappings created: 0
685             Max number of mappings used:      0
686
687       If we look at the Dynamic Dispatch  Stall  Cycles  table,  we  see  the
688       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
689       ery time the dispatch logic is unable to dispatch a full group  because
690       the scheduler's queue is full.
691
692       Looking  at the Dispatch Logic table, we see that the pipeline was only
693       able to dispatch two micro opcodes 51.5% of  the  time.   The  dispatch
694       group was limited to one micro opcode 44.6% of the cycles, which corre‐
695       sponds to 272 cycles.  The dispatch statistics are displayed by  either
696       using the command option -all-stats or -dispatch-stats.
697
698       The  next  table,  Schedulers, presents a histogram displaying a count,
699       representing the number of micro opcodes issued on some number  of  cy‐
700       cles.  In  this  case, of the 610 simulated cycles, single opcodes were
701       issued 306 times (50.2%) and there were 7 cycles where no opcodes  were
702       issued.
703
704       The  Scheduler's  queue  usage table shows that the average and maximum
705       number of buffer entries (i.e., scheduler queue entries) used  at  run‐
706       time.   Resource  JFPU01  reached its maximum (18 of 18 queue entries).
707       Note that AMD Jaguar implements three schedulers:
708
709       • JALU01 - A scheduler for ALU instructions.
710
711       • JFPU01 - A scheduler floating point operations.
712
713       • JLSAGU - A scheduler for address generation.
714
715       The dot-product is a kernel of three  floating  point  instructions  (a
716       vector  multiply  followed  by two horizontal adds).  That explains why
717       only the floating point scheduler appears to be used.
718
719       A full scheduler queue is either caused by data dependency chains or by
720       a  sub-optimal  usage of hardware resources.  Sometimes, resource pres‐
721       sure can be mitigated by rewriting the kernel using different  instruc‐
722       tions  that  consume  different scheduler resources.  Schedulers with a
723       small queue are less resilient to bottlenecks caused by the presence of
724       long  data dependencies.  The scheduler statistics are displayed by us‐
725       ing the command option -all-stats or -scheduler-stats.
726
727       The next table, Retire Control Unit, presents a histogram displaying  a
728       count,  representing  the number of instructions retired on some number
729       of cycles.  In this case, of the 610 simulated cycles, two instructions
730       were retired during the same cycle 399 times (65.4%) and there were 109
731       cycles where no instructions were retired.  The retire  statistics  are
732       displayed by using the command option -all-stats or -retire-stats.
733
734       The  last  table  presented is Register File statistics.  Each physical
735       register file (PRF) used by the pipeline is presented  in  this  table.
736       In the case of AMD Jaguar, there are two register files, one for float‐
737       ing-point registers (JFpuPRF) and one  for  integer  registers  (JInte‐
738       gerPRF).  The table shows that of the 900 instructions processed, there
739       were 900 mappings created.  Since  this  dot-product  example  utilized
740       only floating point registers, the JFPuPRF was responsible for creating
741       the 900 mappings.  However, we see that the pipeline only used a  maxi‐
742       mum of 35 of 72 available register slots at any given time. We can con‐
743       clude that the floating point PRF was the only register file  used  for
744       the  example, and that it was never resource constrained.  The register
745       file statistics are displayed by using the command option -all-stats or
746       -register-file-stats.
747
748       In this example, we can conclude that the IPC is mostly limited by data
749       dependencies, and not by resource pressure.
750
751   Instruction Flow
752       This section describes the instruction flow through the  default  pipe‐
753       line  of  llvm-mca,  as  well  as  the functional units involved in the
754       process.
755
756       The default pipeline implements the following sequence of  stages  used
757       to process instructions.
758
759       • Dispatch (Instruction is dispatched to the schedulers).
760
761       • Issue (Instruction is issued to the processor pipelines).
762
763       • Write Back (Instruction is executed, and results are written back).
764
765       • Retire  (Instruction  is  retired; writes are architecturally commit‐
766         ted).
767
768       The default pipeline only models the out-of-order portion of a  proces‐
769       sor.   Therefore,  the instruction fetch and decode stages are not mod‐
770       eled. Performance  bottlenecks  in  the  frontend  are  not  diagnosed.
771       llvm-mca  assumes  that  instructions  have all been decoded and placed
772       into a queue before the simulation  start.   Also,  llvm-mca  does  not
773       model branch prediction.
774
775   Instruction Dispatch
776       During  the  dispatch  stage,  instructions are picked in program order
777       from a queue of already decoded instructions, and dispatched in  groups
778       to the simulated hardware schedulers.
779
780       The  size  of a dispatch group depends on the availability of the simu‐
781       lated hardware resources.  The processor dispatch width defaults to the
782       value of the IssueWidth in LLVM's scheduling model.
783
784       An instruction can be dispatched if:
785
786       • The  size  of the dispatch group is smaller than processor's dispatch
787         width.
788
789       • There are enough entries in the reorder buffer.
790
791       • There are enough physical registers to do register renaming.
792
793       • The schedulers are not full.
794
795       Scheduling models can  optionally  specify  which  register  files  are
796       available  on the processor. llvm-mca uses that information to initial‐
797       ize register file descriptors.  Users can limit the number of  physical
798       registers  that  are  globally available for register renaming by using
799       the command option -register-file-size.  A value of zero for  this  op‐
800       tion  means  unbounded. By knowing how many registers are available for
801       renaming, the tool can predict dispatch stalls caused by  the  lack  of
802       physical registers.
803
804       The number of reorder buffer entries consumed by an instruction depends
805       on the number of micro-opcodes specified for that  instruction  by  the
806       target  scheduling model.  The reorder buffer is responsible for track‐
807       ing the progress of instructions that  are  "in-flight",  and  retiring
808       them in program order.  The number of entries in the reorder buffer de‐
809       faults to the value specified by field MicroOpBufferSize in the  target
810       scheduling model.
811
812       Instructions  that  are  dispatched to the schedulers consume scheduler
813       buffer entries. llvm-mca queries the scheduling model to determine  the
814       set  of  buffered  resources  consumed by an instruction.  Buffered re‐
815       sources are treated like scheduler resources.
816
817   Instruction Issue
818       Each processor scheduler implements a buffer of instructions.   An  in‐
819       struction  has  to  wait in the scheduler's buffer until input register
820       operands become available.  Only at that point,  does  the  instruction
821       becomes   eligible   for  execution  and  may  be  issued  (potentially
822       out-of-order) for execution.  Instruction  latencies  are  computed  by
823       llvm-mca with the help of the scheduling model.
824
825       llvm-mca's  scheduler is designed to simulate multiple processor sched‐
826       ulers.  The scheduler is responsible for  tracking  data  dependencies,
827       and dynamically selecting which processor resources are consumed by in‐
828       structions.  It delegates the management of  processor  resource  units
829       and resource groups to a resource manager.  The resource manager is re‐
830       sponsible for selecting resource units that are  consumed  by  instruc‐
831       tions.   For  example,  if  an  instruction  consumes 1cy of a resource
832       group, the resource manager selects one of the available units from the
833       group;  by default, the resource manager uses a round-robin selector to
834       guarantee that resource usage  is  uniformly  distributed  between  all
835       units of a group.
836
837       llvm-mca's scheduler internally groups instructions into three sets:
838
839       • WaitSet: a set of instructions whose operands are not ready.
840
841       • ReadySet: a set of instructions ready to execute.
842
843       • IssuedSet: a set of instructions executing.
844
845       Depending  on  the  operands  availability,  instructions that are dis‐
846       patched to the scheduler are either placed into the WaitSet or into the
847       ReadySet.
848
849       Every cycle, the scheduler checks if instructions can be moved from the
850       WaitSet to the ReadySet, and if instructions from the ReadySet  can  be
851       issued to the underlying pipelines. The algorithm prioritizes older in‐
852       structions over younger instructions.
853
854   Write-Back and Retire Stage
855       Issued instructions are moved  from  the  ReadySet  to  the  IssuedSet.
856       There,  instructions  wait  until  they reach the write-back stage.  At
857       that point, they get removed from the queue and the retire control unit
858       is notified.
859
860       When  instructions  are executed, the retire control unit flags the in‐
861       struction as "ready to retire."
862
863       Instructions are retired in program order.  The register file is  noti‐
864       fied  of the retirement so that it can free the physical registers that
865       were allocated for the instruction during the register renaming stage.
866
867   Load/Store Unit and Memory Consistency Model
868       To simulate an out-of-order execution of  memory  operations,  llvm-mca
869       utilizes  a simulated load/store unit (LSUnit) to simulate the specula‐
870       tive execution of loads and stores.
871
872       Each load (or store) consumes an entry in the load  (or  store)  queue.
873       Users  can specify flags -lqueue and -squeue to limit the number of en‐
874       tries in the load and store queues respectively.  The  queues  are  un‐
875       bounded by default.
876
877       The  LSUnit implements a relaxed consistency model for memory loads and
878       stores.  The rules are:
879
880       1. A younger load is allowed to pass an older load only if there are no
881          intervening stores or barriers between the two loads.
882
883       2. A  younger  load is allowed to pass an older store provided that the
884          load does not alias with the store.
885
886       3. A younger store is not allowed to pass an older store.
887
888       4. A younger store is not allowed to pass an older load.
889
890       By default, the LSUnit optimistically assumes that loads do  not  alias
891       (-noalias=true) store operations.  Under this assumption, younger loads
892       are always allowed to pass older stores.  Essentially, the LSUnit  does
893       not  attempt to run any alias analysis to predict when loads and stores
894       do not alias with each other.
895
896       Note that, in the case of write-combining memory, rule 3 could  be  re‐
897       laxed to allow reordering of non-aliasing store operations.  That being
898       said, at the moment, there is no way to further relax the memory  model
899       (-noalias  is  the  only  option).   Essentially, there is no option to
900       specify a different memory  type  (e.g.,  write-back,  write-combining,
901       write-through;  etc.)  and  consequently  to weaken, or strengthen, the
902       memory model.
903
904       Other limitations are:
905
906       • The LSUnit does not know when store-to-load forwarding may occur.
907
908       • The LSUnit does not know anything about cache  hierarchy  and  memory
909         types.
910
911       • The  LSUnit  does not know how to identify serializing operations and
912         memory fences.
913
914       The LSUnit does not attempt to predict if  a  load  or  store  hits  or
915       misses  the L1 cache.  It only knows if an instruction "MayLoad" and/or
916       "MayStore."  For loads, the scheduling model provides  an  "optimistic"
917       load-to-use  latency (which usually matches the load-to-use latency for
918       when there is a hit in the L1D).
919
920       llvm-mca does not know about serializing operations  or  memory-barrier
921       like  instructions.  The LSUnit conservatively assumes that an instruc‐
922       tion which has both "MayLoad" and unmodeled side effects behaves like a
923       "soft" load-barrier.  That means, it serializes loads without forcing a
924       flush of the load queue.  Similarly, instructions that  "MayStore"  and
925       have  unmodeled  side  effects are treated like store barriers.  A full
926       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
927       side effects.  This is inaccurate, but it is the best that we can do at
928       the moment with the current information available in LLVM.
929
930       A load/store barrier consumes one entry of  the  load/store  queue.   A
931       load/store  barrier  enforces ordering of loads/stores.  A younger load
932       cannot pass a load barrier.  Also, a younger store cannot pass a  store
933       barrier.  A younger load has to wait for the memory/load barrier to ex‐
934       ecute.  A load/store barrier is "executed" when it becomes  the  oldest
935       entry in the load/store queue(s). That also means, by construction, all
936       of the older loads/stores have been executed.
937
938       In conclusion, the full set of load/store consistency rules are:
939
940       1. A store may not pass a previous store.
941
942       2. A store may not pass a previous load (regardless of -noalias).
943
944       3. A store has to wait until an older store barrier is fully executed.
945
946       4. A load may pass a previous load.
947
948       5. A load may not pass a previous store unless -noalias is set.
949
950       6. A load has to wait until an older load barrier is fully executed.
951

AUTHOR

953       Maintained by the LLVM Team (https://llvm.org/).
954
956       2003-2023, LLVM Project
957
958
959
960
96112                                2023-07-20                       LLVM-MCA(1)
Impressum