llvm-mca-9.0(1)

1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently works  for  processors  with  an
18       out-of-order  backend,  for which there is a scheduling model available
19       in LLVM.
20
21       The main goal of this tool is not just to predict  the  performance  of
22       the  code  when run on the target, but also help with diagnosing poten‐
23       tial performance issues.
24
25       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
26       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
27       and reporting style were inspired by the IACA tool from Intel.
28
29       For example, you can compile code with clang, output assembly, and pipe
30       it directly into llvm-mca for analysis:
31
32          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34       Or for Intel syntax:
35
36          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38       Scheduling  models  are  not just used to compute instruction latencies
39       and throughput, but also to understand  what  processor  resources  are
40       available and how to simulate them.
41
42       By  design,  the  quality  of the analysis conducted by llvm-mca is in‐
43       evitably affected by the quality of the scheduling models in LLVM.
44
45       If you see that the performance report is not accurate for a processor,
46       please file a bug against the appropriate backend.
47

OPTIONS

49       If  input is "-" or omitted, llvm-mca reads from standard input. Other‐
50       wise, it will read from the specified filename.
51
52       If the -o option is omitted, then llvm-mca  will  send  its  output  to
53       standard  output if the input is from standard input.  If the -o option
54       specifies "-", then the output will also be sent to standard output.
55
56       -help  Print a summary of command line options.
57
58       -o <filename>
59              Use <filename> as the output filename. See the summary above for
60              more details.
61
62       -mtriple=<target triple>
63              Specify a target triple string.
64
65       -march=<arch>
66              Specify  the  architecture for which to analyze the code. It de‐
67              faults to the host default target.
68
69       -mcpu=<cpuname>
70              Specify the processor for which to analyze  the  code.   By  de‐
71              fault, the cpu name is autodetected from the host.
72
73       -output-asm-variant=<variant id>
74              Specify  the output assembly variant for the report generated by
75              the tool.  On x86, possible values are [0,  1].  A  value  of  0
76              (vic.  1)  for  this flag enables the AT&T (vic. Intel) assembly
77              format for the code printed out by the tool in the analysis  re‐
78              port.
79
80       -dispatch=<width>
81              Specify  a  different dispatch width for the processor. The dis‐
82              patch width defaults to  field  'IssueWidth'  in  the  processor
83              scheduling  model.   If width is zero, then the default dispatch
84              width is used.
85
86       -register-file-size=<size>
87              Specify the size of the register file. When specified, this flag
88              limits  how  many  physical registers are available for register
89              renaming purposes. A value of zero for this flag  means  "unlim‐
90              ited number of physical registers".
91
92       -iterations=<number of iterations>
93              Specify  the number of iterations to run. If this flag is set to
94              0, then the tool sets the number  of  iterations  to  a  default
95              value (i.e. 100).
96
97       -noalias=<bool>
98              If set, the tool assumes that loads and stores don't alias. This
99              is the default behavior.
100
101       -lqueue=<load queue size>
102              Specify the size of the load queue in the load/store  unit  emu‐
103              lated by the tool.  By default, the tool assumes an unbound num‐
104              ber of entries in the load queue.  A value of zero for this flag
105              is ignored, and the default load queue size is used instead.
106
107       -squeue=<store queue size>
108              Specify  the size of the store queue in the load/store unit emu‐
109              lated by the tool. By default, the tool assumes an unbound  num‐
110              ber of entries in the store queue. A value of zero for this flag
111              is ignored, and the default store queue size is used instead.
112
113       -timeline
114              Enable the timeline view.
115
116       -timeline-max-iterations=<iterations>
117              Limit the number of iterations to print in the timeline view. By
118              default, the timeline view prints information for up to 10 iter‐
119              ations.
120
121       -timeline-max-cycles=<cycles>
122              Limit the number of cycles in the timeline view. By default, the
123              number of cycles is set to 80.
124
125       -resource-pressure
126              Enable the resource pressure view. This is enabled by default.
127
128       -register-file-stats
129              Enable register file usage statistics.
130
131       -dispatch-stats
132              Enable  extra  dispatch  statistics. This view collects and ana‐
133              lyzes instruction dispatch events,  as  well  as  static/dynamic
134              dispatch stall events. This view is disabled by default.
135
136       -scheduler-stats
137              Enable  extra  scheduler statistics. This view collects and ana‐
138              lyzes instruction issue events. This view  is  disabled  by  de‐
139              fault.
140
141       -retire-stats
142              Enable  extra  retire control unit statistics. This view is dis‐
143              abled by default.
144
145       -instruction-info
146              Enable the instruction info view. This is enabled by default.
147
148       -all-stats
149              Print all hardware statistics. This enables extra statistics re‐
150              lated to the dispatch logic, the hardware schedulers, the regis‐
151              ter file(s), and the retire control unit. This  option  is  dis‐
152              abled by default.
153
154       -all-views
155              Enable all the view.
156
157       -instruction-tables
158              Prints  resource pressure information based on the static infor‐
159              mation available from the processor model. This differs from the
160              resource  pressure view because it doesn't require that the code
161              is simulated. It instead prints the theoretical uniform  distri‐
162              bution of resource pressure for every instruction in sequence.
163
164       -bottleneck-analysis
165              Print  information about bottlenecks that affect the throughput.
166              This analysis can be expensive, and it is disabled  by  default.
167              Bottlenecks are highlighted in the summary view.
168

EXIT STATUS

170       llvm-mca  returns  0 on success. Otherwise, an error message is printed
171       to standard error, and the tool returns 1.
172

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

174       llvm-mca allows for the optional usage of special code comments to mark
175       regions  of  the assembly code to be analyzed.  A comment starting with
176       substring LLVM-MCA-BEGIN marks the beginning of a code region.  A  com‐
177       ment  starting  with substring LLVM-MCA-END marks the end of a code re‐
178       gion.  For example:
179
180          # LLVM-MCA-BEGIN
181            ...
182          # LLVM-MCA-END
183
184       If no user-defined region is specified, then llvm-mca assumes a default
185       region  which  contains every instruction in the input file.  Every re‐
186       gion is analyzed in isolation, and the final performance report is  the
187       union of all the reports generated for every code region.
188
189       Code regions can have names. For example:
190
191          # LLVM-MCA-BEGIN A simple example
192            add %eax, %eax
193          # LLVM-MCA-END
194
195       The  code from the example above defines a region named "A simple exam‐
196       ple" with a single instruction in it. Note how the region name  doesn't
197       have  to  be  repeated in the LLVM-MCA-END directive. In the absence of
198       overlapping regions, an anonymous LLVM-MCA-END  directive  always  ends
199       the currently active user defined region.
200
201       Example of nesting regions:
202
203          # LLVM-MCA-BEGIN foo
204            add %eax, %edx
205          # LLVM-MCA-BEGIN bar
206            sub %eax, %edx
207          # LLVM-MCA-END bar
208          # LLVM-MCA-END foo
209
210       Example of overlapping regions:
211
212          # LLVM-MCA-BEGIN foo
213            add %eax, %edx
214          # LLVM-MCA-BEGIN bar
215            sub %eax, %edx
216          # LLVM-MCA-END foo
217            add %eax, %edx
218          # LLVM-MCA-END bar
219
220       Note  that multiple anonymous regions cannot overlap. Also, overlapping
221       regions cannot have the same name.
222
223       There is no support for marking regions from  high-level  source  code,
224       like C or C++. As a workaround, inline assembly directives may be used:
225
226          int foo(int a, int b) {
227            __asm volatile("# LLVM-MCA-BEGIN foo");
228            a += 42;
229            __asm volatile("# LLVM-MCA-END");
230            a *= b;
231            return a;
232          }
233
234       However, this interferes with optimizations like loop vectorization and
235       may have an impact on the code generated. This  is  because  the  __asm
236       statements  are  seen as real code having important side effects, which
237       limits how the code around them can be transformed. If  users  want  to
238       make use of inline assembly to emit markers, then the recommendation is
239       to always verify that the output assembly is equivalent to the assembly
240       generated  in  the absence of markers.  The Clang options to emit opti‐
241       mization reports can also help in detecting missed optimizations.
242

HOW LLVM-MCA WORKS

244       llvm-mca takes assembly code as input. The assembly code is parsed into
245       a sequence of MCInst with the help of the existing LLVM target assembly
246       parsers. The parsed sequence of MCInst is then analyzed by  a  Pipeline
247       module to generate a performance report.
248
249       The  Pipeline  module  simulates  the execution of the machine code se‐
250       quence in a loop of iterations (default is 100). During  this  process,
251       the  pipeline collects a number of execution related statistics. At the
252       end of this process, the pipeline generates and prints  a  report  from
253       the collected statistics.
254
255       Here  is an example of a performance report generated by the tool for a
256       dot-product of two packed float vectors of four elements. The  analysis
257       is  conducted  for target x86, cpu btver2.  The following result can be
258       produced via  the  following  command  using  the  example  located  at
259       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
260
261          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
262
263          Iterations:        300
264          Instructions:      900
265          Total Cycles:      610
266          Total uOps:        900
267
268          Dispatch Width:    2
269          uOps Per Cycle:    1.48
270          IPC:               1.48
271          Block RThroughput: 2.0
272
273
274          Instruction Info:
275          [1]: #uOps
276          [2]: Latency
277          [3]: RThroughput
278          [4]: MayLoad
279          [5]: MayStore
280          [6]: HasSideEffects (U)
281
282          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
283           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
284           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
285           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
286
287
288          Resources:
289          [0]   - JALU0
290          [1]   - JALU1
291          [2]   - JDiv
292          [3]   - JFPA
293          [4]   - JFPM
294          [5]   - JFPU0
295          [6]   - JFPU1
296          [7]   - JLAGU
297          [8]   - JMul
298          [9]   - JSAGU
299          [10]  - JSTC
300          [11]  - JVALU0
301          [12]  - JVALU1
302          [13]  - JVIMUL
303
304
305          Resource pressure per iteration:
306          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
307           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
308
309          Resource pressure by instruction:
310          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
311           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
312           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
313           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
314
315       According  to this report, the dot-product kernel has been executed 300
316       times, for a total of 900 simulated instructions. The total  number  of
317       simulated micro opcodes (uOps) is also 900.
318
319       The  report  is  structured  in three main sections.  The first section
320       collects a few performance numbers; the goal of this section is to give
321       a  very quick overview of the performance throughput. Important perfor‐
322       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
323       Reciprocal Throughput).
324
325       IPC  is computed dividing the total number of simulated instructions by
326       the total number of cycles. In the absence of loop-carried data  depen‐
327       dencies,  the  observed IPC tends to a theoretical maximum which can be
328       computed by dividing the number of instructions of a  single  iteration
329       by the Block RThroughput.
330
331       Field  'uOps  Per Cycle' is computed dividing the total number of simu‐
332       lated micro opcodes by the total number of cycles. A delta between Dis‐
333       patch  Width  and this field is an indicator of a performance issue. In
334       the absence of loop-carried data dependencies, the observed  'uOps  Per
335       Cycle'  should  tend  to  a theoretical maximum throughput which can be
336       computed by dividing the number of uOps of a single  iteration  by  the
337       Block RThroughput.
338
339       Field  uOps Per Cycle is bounded from above by the dispatch width. That
340       is because the dispatch width limits the maximum  size  of  a  dispatch
341       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
342       ware parallelism. The availability of hardware  resources  affects  the
343       resource  pressure  distribution,  and it limits the number of instruc‐
344       tions that can be executed in parallel every cycle.   A  delta  between
345       Dispatch  Width and the theoretical maximum uOps per Cycle (computed by
346       dividing the number  of  uOps  of  a  single  iteration  by  the  Block
347       RTrhoughput)  is an indicator of a performance bottleneck caused by the
348       lack of hardware resources.  In general, the lower the Block  RThrough‐
349       put, the better.
350
351       In  this  example,  uOps per iteration/Block RThroughput is 1.50. Since
352       there are no loop-carried dependencies, the observed uOps Per Cycle  is
353       expected to approach 1.50 when the number of iterations tends to infin‐
354       ity. The delta between the Dispatch Width (2.00), and  the  theoretical
355       maximum  uOp throughput (1.50) is an indicator of a performance bottle‐
356       neck caused by the lack of hardware resources, and the  Resource  pres‐
357       sure view can help to identify the problematic resource usage.
358
359       The  second  section  of  the  report  shows the latency and reciprocal
360       throughput of every instruction in the sequence. That section also  re‐
361       ports extra information related to the number of micro opcodes, and op‐
362       code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
363
364       The third section is the Resource pressure view.  This view reports the
365       average  number of resource cycles consumed every iteration by instruc‐
366       tions for every processor resource unit available on the  target.   In‐
367       formation is structured in two tables. The first table reports the num‐
368       ber of resource cycles spent on average every iteration. The second ta‐
369       ble  correlates  the  resource cycles to the machine instruction in the
370       sequence. For example, every iteration of the instruction vmulps always
371       executes  on  resource  unit  [6] (JFPU1 - floating point pipeline #1),
372       consuming an average of 1 resource cycle per iteration.  Note  that  on
373       AMD  Jaguar, vector floating-point multiply can only be issued to pipe‐
374       line JFPU1, while horizontal floating-point additions can only  be  is‐
375       sued to pipeline JFPU0.
376
377       The resource pressure view helps with identifying bottlenecks caused by
378       high usage of specific hardware resources.   Situations  with  resource
379       pressure  mainly concentrated on a few resources should, in general, be
380       avoided.  Ideally, pressure should  be  uniformly  distributed  between
381       multiple resources.
382
383   Timeline View
384       The  timeline  view  produces  a  detailed report of each instruction's
385       state transitions through an instruction pipeline.  This  view  is  en‐
386       abled by the command line option -timeline.  As instructions transition
387       through the various stages of the pipeline, their states  are  depicted
388       in  the  view  report.   These  states are represented by the following
389       characters:
390
391       • D : Instruction dispatched.
392
393       • e : Instruction executing.
394
395       • E : Instruction executed.
396
397       • R : Instruction retired.
398
399       • = : Instruction already dispatched, waiting to be executed.
400
401       • - : Instruction executed, waiting to be retired.
402
403       Below is the timeline view for a subset of the dot-product example  lo‐
404       cated  in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
405       llvm-mca using the following command:
406
407          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
408
409          Timeline view:
410                              012345
411          Index     0123456789
412
413          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
414          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
415          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
416          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
417          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
418          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
419          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
420          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
421          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
422
423
424          Average Wait times (based on the timeline view):
425          [0]: Executions
426          [1]: Average time spent waiting in a scheduler's queue
427          [2]: Average time spent waiting in a scheduler's queue while ready
428          [3]: Average time elapsed from WB until retire stage
429
430                [0]    [1]    [2]    [3]
431          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
432          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
433          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
434
435       The timeline view is interesting because  it  shows  instruction  state
436       changes  during  execution.  It also gives an idea of how the tool pro‐
437       cesses instructions executed on the target, and how their timing infor‐
438       mation might be calculated.
439
440       The  timeline  view is structured in two tables.  The first table shows
441       instructions changing state over time (measured in cycles); the  second
442       table  (named  Average  Wait  times)  reports useful timing statistics,
443       which should help diagnose performance bottlenecks caused by long  data
444       dependencies and sub-optimal usage of hardware resources.
445
446       An instruction in the timeline view is identified by a pair of indices,
447       where the first index identifies an iteration, and the second index  is
448       the  instruction  index  (i.e., where it appears in the code sequence).
449       Since this example was generated using 3 iterations: -iterations=3, the
450       iteration indices range from 0-2 inclusively.
451
452       Excluding  the  first and last column, the remaining columns are in cy‐
453       cles.  Cycles are numbered sequentially starting from 0.
454
455       From the example output above, we know the following:
456
457       • Instruction [1,0] was dispatched at cycle 1.
458
459       • Instruction [1,0] started executing at cycle 2.
460
461       • Instruction [1,0] reached the write back stage at cycle 4.
462
463       • Instruction [1,0] was retired at cycle 10.
464
465       Instruction [1,0] (i.e., vmulps from iteration #1)  does  not  have  to
466       wait  in the scheduler's queue for the operands to become available. By
467       the time vmulps is dispatched,  operands  are  already  available,  and
468       pipeline  JFPU1 is ready to serve another instruction.  So the instruc‐
469       tion can be immediately issued on the JFPU1 pipeline.  That  is  demon‐
470       strated  by  the fact that the instruction only spent 1cy in the sched‐
471       uler's queue.
472
473       There is a gap of 5 cycles between the write-back stage and the  retire
474       event.   That  is because instructions must retire in program order, so
475       [1,0] has to wait for [0,2] to be retired first (i.e., it has  to  wait
476       until cycle 10).
477
478       In the example, all instructions are in a RAW (Read After Write) depen‐
479       dency chain.  Register %xmm2 written by vmulps is immediately  used  by
480       the  first  vhaddps, and register %xmm3 written by the first vhaddps is
481       used by the second vhaddps.  Long data dependencies  negatively  impact
482       the ILP (Instruction Level Parallelism).
483
484       In  the  dot-product example, there are anti-dependencies introduced by
485       instructions from different iterations.   However,  those  dependencies
486       can  be  removed  at register renaming stage (at the cost of allocating
487       register aliases, and therefore consuming physical registers).
488
489       Table Average Wait times helps diagnose  performance  issues  that  are
490       caused  by  the  presence  of long latency instructions and potentially
491       long data dependencies which may limit the ILP.  Note that llvm-mca, by
492       default,  assumes at least 1cy between the dispatch event and the issue
493       event.
494
495       When the performance is limited by data dependencies  and/or  long  la‐
496       tency instructions, the number of cycles spent while in the ready state
497       is expected to be very small when compared with the total number of cy‐
498       cles  spent  in  the scheduler's queue.  The difference between the two
499       counters is a good indicator of how large of an impact  data  dependen‐
500       cies  had  on  the  execution of the instructions.  When performance is
501       mostly limited by the lack of hardware resources, the delta between the
502       two  counters  is  small.   However,  the number of cycles spent in the
503       queue tends to be larger (i.e., more than 1-3cy), especially when  com‐
504       pared to other low latency instructions.
505
506   Extra Statistics to Further Diagnose Performance Issues
507       The -all-stats command line option enables extra statistics and perfor‐
508       mance counters for the dispatch logic, the reorder buffer,  the  retire
509       control unit, and the register file.
510
511       Below is an example of -all-stats output generated by  llvm-mca for 300
512       iterations of the dot-product example discussed in  the  previous  sec‐
513       tions.
514
515          Dynamic Dispatch Stall Cycles:
516          RAT     - Register unavailable:                      0
517          RCU     - Retire tokens unavailable:                 0
518          SCHEDQ  - Scheduler full:                            272  (44.6%)
519          LQ      - Load queue full:                           0
520          SQ      - Store queue full:                          0
521          GROUP   - Static restrictions on the dispatch group: 0
522
523
524          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
525          [# dispatched], [# cycles]
526           0,              24  (3.9%)
527           1,              272  (44.6%)
528           2,              314  (51.5%)
529
530
531          Schedulers - number of cycles where we saw N micro opcodes issued:
532          [# issued], [# cycles]
533           0,          7  (1.1%)
534           1,          306  (50.2%)
535           2,          297  (48.7%)
536
537          Scheduler's queue usage:
538          [1] Resource name.
539          [2] Average number of used buffer entries.
540          [3] Maximum number of used buffer entries.
541          [4] Total number of buffer entries.
542
543           [1]            [2]        [3]        [4]
544          JALU01           0          0          20
545          JFPU01           17         18         18
546          JLSAGU           0          0          12
547
548
549          Retire Control Unit - number of cycles where we saw N instructions retired:
550          [# retired], [# cycles]
551           0,           109  (17.9%)
552           1,           102  (16.7%)
553           2,           399  (65.4%)
554
555          Total ROB Entries:                64
556          Max Used ROB Entries:             35  ( 54.7% )
557          Average Used ROB Entries per cy:  32  ( 50.0% )
558
559
560          Register File statistics:
561          Total number of mappings created:    900
562          Max number of mappings used:         35
563
564          *  Register File #1 -- JFpuPRF:
565             Number of physical registers:     72
566             Total number of mappings created: 900
567             Max number of mappings used:      35
568
569          *  Register File #2 -- JIntegerPRF:
570             Number of physical registers:     64
571             Total number of mappings created: 0
572             Max number of mappings used:      0
573
574       If  we  look  at  the  Dynamic  Dispatch Stall Cycles table, we see the
575       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
576       ery  time the dispatch logic is unable to dispatch a full group because
577       the scheduler's queue is full.
578
579       Looking at the Dispatch Logic table, we see that the pipeline was  only
580       able  to  dispatch  two  micro opcodes 51.5% of the time.  The dispatch
581       group was limited to one micro opcode 44.6% of the cycles, which corre‐
582       sponds  to 272 cycles.  The dispatch statistics are displayed by either
583       using the command option -all-stats or -dispatch-stats.
584
585       The next table, Schedulers, presents a histogram  displaying  a  count,
586       representing  the  number of micro opcodes issued on some number of cy‐
587       cles. In this case, of the 610 simulated cycles,  single  opcodes  were
588       issued  306 times (50.2%) and there were 7 cycles where no opcodes were
589       issued.
590
591       The Scheduler's queue usage table shows that the  average  and  maximum
592       number  of  buffer entries (i.e., scheduler queue entries) used at run‐
593       time.  Resource JFPU01 reached its maximum (18 of  18  queue  entries).
594       Note that AMD Jaguar implements three schedulers:
595
596       • JALU01 - A scheduler for ALU instructions.
597
598       • JFPU01 - A scheduler floating point operations.
599
600       • JLSAGU - A scheduler for address generation.
601
602       The  dot-product  is  a  kernel of three floating point instructions (a
603       vector multiply followed by two horizontal adds).   That  explains  why
604       only the floating point scheduler appears to be used.
605
606       A full scheduler queue is either caused by data dependency chains or by
607       a sub-optimal usage of hardware resources.  Sometimes,  resource  pres‐
608       sure  can be mitigated by rewriting the kernel using different instruc‐
609       tions that consume different scheduler resources.   Schedulers  with  a
610       small queue are less resilient to bottlenecks caused by the presence of
611       long data dependencies.  The scheduler statistics are displayed by  us‐
612       ing the command option -all-stats or -scheduler-stats.
613
614       The  next table, Retire Control Unit, presents a histogram displaying a
615       count, representing the number of instructions retired on  some  number
616       of cycles.  In this case, of the 610 simulated cycles, two instructions
617       were retired during the same cycle 399 times (65.4%) and there were 109
618       cycles  where  no instructions were retired.  The retire statistics are
619       displayed by using the command option -all-stats or -retire-stats.
620
621       The last table presented is Register File  statistics.   Each  physical
622       register  file  (PRF)  used by the pipeline is presented in this table.
623       In the case of AMD Jaguar, there are two register files, one for float‐
624       ing-point  registers  (JFpuPRF)  and  one for integer registers (JInte‐
625       gerPRF).  The table shows that of the 900 instructions processed, there
626       were  900  mappings  created.   Since this dot-product example utilized
627       only floating point registers, the JFPuPRF was responsible for creating
628       the  900 mappings.  However, we see that the pipeline only used a maxi‐
629       mum of 35 of 72 available register slots at any given time. We can con‐
630       clude  that  the floating point PRF was the only register file used for
631       the example, and that it was never resource constrained.  The  register
632       file statistics are displayed by using the command option -all-stats or
633       -register-file-stats.
634
635       In this example, we can conclude that the IPC is mostly limited by data
636       dependencies, and not by resource pressure.
637
638   Instruction Flow
639       This  section  describes the instruction flow through the default pipe‐
640       line of llvm-mca, as well as  the  functional  units  involved  in  the
641       process.
642
643       The  default  pipeline implements the following sequence of stages used
644       to process instructions.
645
646       • Dispatch (Instruction is dispatched to the schedulers).
647
648       • Issue (Instruction is issued to the processor pipelines).
649
650       • Write Back (Instruction is executed, and results are written back).
651
652       • Retire (Instruction is retired; writes  are  architecturally  commit‐
653         ted).
654
655       The  default pipeline only models the out-of-order portion of a proces‐
656       sor.  Therefore, the instruction fetch and decode stages are  not  mod‐
657       eled.  Performance  bottlenecks  in  the  frontend  are  not diagnosed.
658       llvm-mca assumes that instructions have all  been  decoded  and  placed
659       into  a  queue  before  the  simulation start.  Also, llvm-mca does not
660       model branch prediction.
661
662   Instruction Dispatch
663       During the dispatch stage, instructions are  picked  in  program  order
664       from  a queue of already decoded instructions, and dispatched in groups
665       to the simulated hardware schedulers.
666
667       The size of a dispatch group depends on the availability of  the  simu‐
668       lated hardware resources.  The processor dispatch width defaults to the
669       value of the IssueWidth in LLVM's scheduling model.
670
671       An instruction can be dispatched if:
672
673       • The size of the dispatch group is smaller than  processor's  dispatch
674         width.
675
676       • There are enough entries in the reorder buffer.
677
678       • There are enough physical registers to do register renaming.
679
680       • The schedulers are not full.
681
682       Scheduling  models  can  optionally  specify  which  register files are
683       available on the processor. llvm-mca uses that information to  initial‐
684       ize  register file descriptors.  Users can limit the number of physical
685       registers that are globally available for register  renaming  by  using
686       the  command  option -register-file-size.  A value of zero for this op‐
687       tion means unbounded. By knowing how many registers are  available  for
688       renaming,  the  tool  can predict dispatch stalls caused by the lack of
689       physical registers.
690
691       The number of reorder buffer entries consumed by an instruction depends
692       on  the  number  of micro-opcodes specified for that instruction by the
693       target scheduling model.  The reorder buffer is responsible for  track‐
694       ing  the  progress  of  instructions that are "in-flight", and retiring
695       them in program order.  The number of entries in the reorder buffer de‐
696       faults  to the value specified by field MicroOpBufferSize in the target
697       scheduling model.
698
699       Instructions that are dispatched to the  schedulers  consume  scheduler
700       buffer  entries. llvm-mca queries the scheduling model to determine the
701       set of buffered resources consumed by  an  instruction.   Buffered  re‐
702       sources are treated like scheduler resources.
703
704   Instruction Issue
705       Each  processor  scheduler implements a buffer of instructions.  An in‐
706       struction has to wait in the scheduler's buffer  until  input  register
707       operands  become  available.   Only at that point, does the instruction
708       becomes  eligible  for  execution  and  may  be   issued   (potentially
709       out-of-order)  for  execution.   Instruction  latencies are computed by
710       llvm-mca with the help of the scheduling model.
711
712       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
713       ulers.   The  scheduler  is responsible for tracking data dependencies,
714       and dynamically selecting which processor resources are consumed by in‐
715       structions.   It  delegates  the management of processor resource units
716       and resource groups to a resource manager.  The resource manager is re‐
717       sponsible  for  selecting  resource units that are consumed by instruc‐
718       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
719       group, the resource manager selects one of the available units from the
720       group; by default, the resource manager uses a round-robin selector  to
721       guarantee  that  resource  usage  is  uniformly distributed between all
722       units of a group.
723
724       llvm-mca's scheduler internally groups instructions into three sets:
725
726       • WaitSet: a set of instructions whose operands are not ready.
727
728       • ReadySet: a set of instructions ready to execute.
729
730       • IssuedSet: a set of instructions executing.
731
732       Depending on the operands  availability,  instructions  that  are  dis‐
733       patched to the scheduler are either placed into the WaitSet or into the
734       ReadySet.
735
736       Every cycle, the scheduler checks if instructions can be moved from the
737       WaitSet  to  the ReadySet, and if instructions from the ReadySet can be
738       issued to the underlying pipelines. The algorithm prioritizes older in‐
739       structions over younger instructions.
740
741   Write-Back and Retire Stage
742       Issued  instructions  are  moved  from  the  ReadySet to the IssuedSet.
743       There, instructions wait until they reach  the  write-back  stage.   At
744       that point, they get removed from the queue and the retire control unit
745       is notified.
746
747       When instructions are executed, the retire control unit flags  the  in‐
748       struction as "ready to retire."
749
750       Instructions  are retired in program order.  The register file is noti‐
751       fied of the retirement so that it can free the physical registers  that
752       were allocated for the instruction during the register renaming stage.
753
754   Load/Store Unit and Memory Consistency Model
755       To  simulate  an  out-of-order execution of memory operations, llvm-mca
756       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
757       tive execution of loads and stores.
758
759       Each  load  (or  store) consumes an entry in the load (or store) queue.
760       Users can specify flags -lqueue and -squeue to limit the number of  en‐
761       tries  in  the  load  and store queues respectively. The queues are un‐
762       bounded by default.
763
764       The LSUnit implements a relaxed consistency model for memory loads  and
765       stores.  The rules are:
766
767       1. A younger load is allowed to pass an older load only if there are no
768          intervening stores or barriers between the two loads.
769
770       2. A younger load is allowed to pass an older store provided  that  the
771          load does not alias with the store.
772
773       3. A younger store is not allowed to pass an older store.
774
775       4. A younger store is not allowed to pass an older load.
776
777       By  default,  the LSUnit optimistically assumes that loads do not alias
778       (-noalias=true) store operations.  Under this assumption, younger loads
779       are  always allowed to pass older stores.  Essentially, the LSUnit does
780       not attempt to run any alias analysis to predict when loads and  stores
781       do not alias with each other.
782
783       Note  that,  in the case of write-combining memory, rule 3 could be re‐
784       laxed to allow reordering of non-aliasing store operations.  That being
785       said,  at the moment, there is no way to further relax the memory model
786       (-noalias is the only option).  Essentially,  there  is  no  option  to
787       specify  a  different  memory  type (e.g., write-back, write-combining,
788       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
789       memory model.
790
791       Other limitations are:
792
793       • The LSUnit does not know when store-to-load forwarding may occur.
794
795       • The  LSUnit  does  not know anything about cache hierarchy and memory
796         types.
797
798       • The LSUnit does not know how to identify serializing  operations  and
799         memory fences.
800
801       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
802       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
803       "MayStore."   For  loads, the scheduling model provides an "optimistic"
804       load-to-use latency (which usually matches the load-to-use latency  for
805       when there is a hit in the L1D).
806
807       llvm-mca  does  not know about serializing operations or memory-barrier
808       like instructions.  The LSUnit conservatively assumes that an  instruc‐
809       tion which has both "MayLoad" and unmodeled side effects behaves like a
810       "soft" load-barrier.  That means, it serializes loads without forcing a
811       flush  of  the load queue.  Similarly, instructions that "MayStore" and
812       have unmodeled side effects are treated like store  barriers.   A  full
813       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
814       side effects.  This is inaccurate, but it is the best that we can do at
815       the moment with the current information available in LLVM.
816
817       A  load/store  barrier  consumes  one entry of the load/store queue.  A
818       load/store barrier enforces ordering of loads/stores.  A  younger  load
819       cannot  pass a load barrier.  Also, a younger store cannot pass a store
820       barrier.  A younger load has to wait for the memory/load barrier to ex‐
821       ecute.   A  load/store barrier is "executed" when it becomes the oldest
822       entry in the load/store queue(s). That also means, by construction, all
823       of the older loads/stores have been executed.
824
825       In conclusion, the full set of load/store consistency rules are:
826
827       1. A store may not pass a previous store.
828
829       2. A store may not pass a previous load (regardless of -noalias).
830
831       3. A store has to wait until an older store barrier is fully executed.
832
833       4. A load may pass a previous load.
834
835       5. A load may not pass a previous store unless -noalias is set.
836
837       6. A load has to wait until an older load barrier is fully executed.
838

AUTHOR

840       Maintained by the LLVM Team (https://llvm.org/).
841

COPYRIGHT

843       2003-2021, LLVM Project
844
845
846
847
8489                                 2021-07-22                       LLVM-MCA(1)