1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance  is  measured  in  terms of throughput as well as processor
17       resource consumption. The tool currently works for processors  with  an
18       out-of-order  backend,  for which there is a scheduling model available
19       in LLVM.
20
21       The main goal of this tool is not just to predict  the  performance  of
22       the  code  when run on the target, but also help with diagnosing poten‐
23       tial performance issues.
24
25       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
26       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
27       and reporting style were inspired by the IACA tool from Intel.
28
29       llvm-mca allows the usage of special code comments to mark  regions  of
30       the  assembly  code  to be analyzed.  A comment starting with substring
31       LLVM-MCA-BEGIN marks the beginning of a code region. A comment starting
32       with  substring LLVM-MCA-END marks the end of a code region.  For exam‐
33       ple:
34
35          # LLVM-MCA-BEGIN My Code Region
36            ...
37          # LLVM-MCA-END
38
39       Multiple regions can be specified provided that they do not overlap.  A
40       code region can have an optional description. If no user-defined region
41       is specified, then llvm-mca assumes a  default  region  which  contains
42       every  instruction in the input file.  Every region is analyzed in iso‐
43       lation, and the final performance  report  is  the  union  of  all  the
44       reports generated for every code region.
45
46       Inline assembly directives may be used from source code to annotate the
47       assembly text:
48
49          int foo(int a, int b) {
50            __asm volatile("# LLVM-MCA-BEGIN foo");
51            a += 42;
52            __asm volatile("# LLVM-MCA-END");
53            a *= b;
54            return a;
55          }
56
57       So for example, you can compile code with clang, output  assembly,  and
58       pipe it directly into llvm-mca for analysis:
59
60          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
61
62       Or for Intel syntax:
63
64          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
65

OPTIONS

67       If  input is "-" or omitted, llvm-mca reads from standard input. Other‐
68       wise, it will read from the specified filename.
69
70       If the -o option is omitted, then llvm-mca  will  send  its  output  to
71       standard  output if the input is from standard input.  If the -o option
72       specifies "-", then the output will also be sent to standard output.
73
74       -help  Print a summary of command line options.
75
76       -mtriple=<target triple>
77              Specify a target triple string.
78
79       -march=<arch>
80              Specify the architecture for  which  to  analyze  the  code.  It
81              defaults to the host default target.
82
83       -mcpu=<cpuname>
84              Specify  the  processor  for  which  to  analyze  the  code.  By
85              default, the cpu name is autodetected from the host.
86
87       -output-asm-variant=<variant id>
88              Specify the output assembly variant for the report generated  by
89              the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
90              (vic. 1) for this flag enables the AT&T  (vic.  Intel)  assembly
91              format  for  the  code  printed  out by the tool in the analysis
92              report.
93
94       -dispatch=<width>
95              Specify a different dispatch width for the processor.  The  dis‐
96              patch  width  defaults  to  field  'IssueWidth' in the processor
97              scheduling model.  If width is zero, then the  default  dispatch
98              width is used.
99
100       -register-file-size=<size>
101              Specify the size of the register file. When specified, this flag
102              limits how many physical registers are  available  for  register
103              renaming  purposes.  A value of zero for this flag means "unlim‐
104              ited number of physical registers".
105
106       -iterations=<number of iterations>
107              Specify the number of iterations to run. If this flag is set  to
108              0,  then  the  tool  sets  the number of iterations to a default
109              value (i.e. 100).
110
111       -noalias=<bool>
112              If set, the tool assumes that loads and stores don't alias. This
113              is the default behavior.
114
115       -lqueue=<load queue size>
116              Specify  the  size of the load queue in the load/store unit emu‐
117              lated by the tool.  By default, the tool assumes an unbound num‐
118              ber of entries in the load queue.  A value of zero for this flag
119              is ignored, and the default load queue size is used instead.
120
121       -squeue=<store queue size>
122              Specify the size of the store queue in the load/store unit  emu‐
123              lated  by the tool. By default, the tool assumes an unbound num‐
124              ber of entries in the store queue. A value of zero for this flag
125              is ignored, and the default store queue size is used instead.
126
127       -timeline
128              Enable the timeline view.
129
130       -timeline-max-iterations=<iterations>
131              Limit the number of iterations to print in the timeline view. By
132              default, the timeline view prints information for up to 10 iter‐
133              ations.
134
135       -timeline-max-cycles=<cycles>
136              Limit the number of cycles in the timeline view. By default, the
137              number of cycles is set to 80.
138
139       -resource-pressure
140              Enable the resource pressure view. This is enabled by default.
141
142       -register-file-stats
143              Enable register file usage statistics.
144
145       -dispatch-stats
146              Enable extra dispatch statistics. This view  collects  and  ana‐
147              lyzes  instruction  dispatch  events,  as well as static/dynamic
148              dispatch stall events. This view is disabled by default.
149
150       -scheduler-stats
151              Enable extra scheduler statistics. This view collects  and  ana‐
152              lyzes  instruction  issue  events.  This  view  is  disabled  by
153              default.
154
155       -retire-stats
156              Enable extra retire control unit statistics. This view  is  dis‐
157              abled by default.
158
159       -instruction-info
160              Enable the instruction info view. This is enabled by default.
161
162       -all-stats
163              Print  all  hardware  statistics.  This enables extra statistics
164              related to the dispatch logic, the hardware schedulers, the reg‐
165              ister  file(s), and the retire control unit. This option is dis‐
166              abled by default.
167
168       -all-views
169              Enable all the view.
170
171       -instruction-tables
172              Prints resource pressure information based on the static  infor‐
173              mation available from the processor model. This differs from the
174              resource pressure view because it doesn't require that the  code
175              is  simulated. It instead prints the theoretical uniform distri‐
176              bution of resource pressure for every instruction in sequence.
177

EXIT STATUS

179       llvm-mca returns 0 on success. Otherwise, an error message  is  printed
180       to standard error, and the tool returns 1.
181

HOW LLVM-MCA WORKS

183       llvm-mca takes assembly code as input. The assembly code is parsed into
184       a sequence of MCInst with the help of the existing LLVM target assembly
185       parsers.  The  parsed sequence of MCInst is then analyzed by a Pipeline
186       module to generate a performance report.
187
188       The Pipeline  module  simulates  the  execution  of  the  machine  code
189       sequence in a loop of iterations (default is 100). During this process,
190       the pipeline collects a number of execution related statistics. At  the
191       end  of  this  process, the pipeline generates and prints a report from
192       the collected statistics.
193
194       Here is an example of a performance report generated by the tool for  a
195       dot-product  of two packed float vectors of four elements. The analysis
196       is conducted for target x86, cpu btver2.  The following result  can  be
197       produced  via  the  following  command  using  the  example  located at
198       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
199
200          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
201
202          Iterations:        300
203          Instructions:      900
204          Total Cycles:      610
205          Dispatch Width:    2
206          IPC:               1.48
207          Block RThroughput: 2.0
208
209
210          Instruction Info:
211          [1]: #uOps
212          [2]: Latency
213          [3]: RThroughput
214          [4]: MayLoad
215          [5]: MayStore
216          [6]: HasSideEffects (U)
217
218          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
219           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
220           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
221           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
222
223
224          Resources:
225          [0]   - JALU0
226          [1]   - JALU1
227          [2]   - JDiv
228          [3]   - JFPA
229          [4]   - JFPM
230          [5]   - JFPU0
231          [6]   - JFPU1
232          [7]   - JLAGU
233          [8]   - JMul
234          [9]   - JSAGU
235          [10]  - JSTC
236          [11]  - JVALU0
237          [12]  - JVALU1
238          [13]  - JVIMUL
239
240
241          Resource pressure per iteration:
242          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
243           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
244
245          Resource pressure by instruction:
246          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
247           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
248           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
249           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
250
251       According to this report, the dot-product kernel has been executed  300
252       times, for a total of 900 dynamically executed instructions.
253
254       The  report  is  structured  in three main sections.  The first section
255       collects a few performance numbers; the goal of this section is to give
256       a  very  quick overview of the performance throughput. In this example,
257       the two important performance indicators are IPC and Block  RThroughput
258       (Block Reciprocal Throughput).
259
260       IPC  is computed dividing the total number of simulated instructions by
261       the total number of cycles.  A delta between Dispatch Width and IPC  is
262       an  indicator  of  a  performance issue. In the absence of loop-carried
263       data dependencies, the observed IPC  tends  to  a  theoretical  maximum
264       which  can be computed by dividing the number of instructions of a sin‐
265       gle iteration by the Block RThroughput.
266
267       IPC is bounded from above by the dispatch width. That  is  because  the
268       dispatch width limits the maximum size of a dispatch group. IPC is also
269       limited by the amount of  hardware  parallelism.  The  availability  of
270       hardware  resources  affects the resource pressure distribution, and it
271       limits the number of instructions that  can  be  executed  in  parallel
272       every  cycle.  A delta between Dispatch Width and the theoretical maxi‐
273       mum IPC is an indicator of a performance bottleneck caused by the  lack
274       of hardware resources. In general, the lower the Block RThroughput, the
275       better.
276
277       In this example, Instructions per iteration/Block RThroughput is  1.50.
278       Since  there  are  no  loop-carried  dependencies,  the observed IPC is
279       expected to approach 1.50 when the number of iterations tends to infin‐
280       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
281       maximum IPC (1.50) is an indicator of a performance  bottleneck  caused
282       by  the  lack of hardware resources, and the Resource pressure view can
283       help to identify the problematic resource usage.
284
285       The second section of the  report  shows  the  latency  and  reciprocal
286       throughput  of  every  instruction  in  the sequence. That section also
287       reports extra information related to the number of micro  opcodes,  and
288       opcode properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
289
290       The third section is the Resource pressure view.  This view reports the
291       average number of resource cycles consumed every iteration by  instruc‐
292       tions  for  every  processor  resource  unit  available  on the target.
293       Information is structured in two tables. The first  table  reports  the
294       number  of resource cycles spent on average every iteration. The second
295       table correlates the resource cycles to the machine instruction in  the
296       sequence. For example, every iteration of the instruction vmulps always
297       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
298       consuming  an  average of 1 resource cycle per iteration.  Note that on
299       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
300       line  JFPU1,  while  horizontal  floating-point  additions  can only be
301       issued to pipeline JFPU0.
302
303       The resource pressure view helps with identifying bottlenecks caused by
304       high  usage  of  specific hardware resources.  Situations with resource
305       pressure mainly concentrated on a few resources should, in general,  be
306       avoided.   Ideally,  pressure  should  be uniformly distributed between
307       multiple resources.
308
309   Timeline View
310       The timeline view produces a  detailed  report  of  each  instruction's
311       state  transitions  through  an  instruction  pipeline.   This  view is
312       enabled by the command line option -timeline.  As instructions  transi‐
313       tion  through  the  various  stages  of  the pipeline, their states are
314       depicted in the view report.  These states are represented by the  fol‐
315       lowing characters:
316
317       · D : Instruction dispatched.
318
319       · e : Instruction executing.
320
321       · E : Instruction executed.
322
323       · R : Instruction retired.
324
325       · = : Instruction already dispatched, waiting to be executed.
326
327       · - : Instruction executed, waiting to be retired.
328
329       Below  is  the  timeline  view  for a subset of the dot-product example
330       located in test/tools/llvm-mca/X86/BtVer2/dot-product.s  and  processed
331       by llvm-mca using the following command:
332
333          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
334
335          Timeline view:
336                              012345
337          Index     0123456789
338
339          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
340          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
341          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
342          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
343          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
344          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
345          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
346          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
347          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
348
349
350          Average Wait times (based on the timeline view):
351          [0]: Executions
352          [1]: Average time spent waiting in a scheduler's queue
353          [2]: Average time spent waiting in a scheduler's queue while ready
354          [3]: Average time elapsed from WB until retire stage
355
356                [0]    [1]    [2]    [3]
357          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
358          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
359          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
360
361       The  timeline  view  is  interesting because it shows instruction state
362       changes during execution.  It also gives an idea of how the  tool  pro‐
363       cesses instructions executed on the target, and how their timing infor‐
364       mation might be calculated.
365
366       The timeline view is structured in two tables.  The first  table  shows
367       instructions  changing state over time (measured in cycles); the second
368       table (named Average Wait  times)  reports  useful  timing  statistics,
369       which  should help diagnose performance bottlenecks caused by long data
370       dependencies and sub-optimal usage of hardware resources.
371
372       An instruction in the timeline view is identified by a pair of indices,
373       where  the first index identifies an iteration, and the second index is
374       the instruction index (i.e., where it appears in  the  code  sequence).
375       Since this example was generated using 3 iterations: -iterations=3, the
376       iteration indices range from 0-2 inclusively.
377
378       Excluding the first and last  column,  the  remaining  columns  are  in
379       cycles.  Cycles are numbered sequentially starting from 0.
380
381       From the example output above, we know the following:
382
383       · Instruction [1,0] was dispatched at cycle 1.
384
385       · Instruction [1,0] started executing at cycle 2.
386
387       · Instruction [1,0] reached the write back stage at cycle 4.
388
389       · Instruction [1,0] was retired at cycle 10.
390
391       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
392       wait in the scheduler's queue for the operands to become available.  By
393       the  time  vmulps  is  dispatched,  operands are already available, and
394       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
395       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
396       strated by the fact that the instruction only spent 1cy in  the  sched‐
397       uler's queue.
398
399       There  is a gap of 5 cycles between the write-back stage and the retire
400       event.  That is because instructions must retire in program  order,  so
401       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
402       until cycle 10).
403
404       In the example, all instructions are in a RAW (Read After Write) depen‐
405       dency  chain.   Register %xmm2 written by vmulps is immediately used by
406       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
407       used  by  the second vhaddps.  Long data dependencies negatively impact
408       the ILP (Instruction Level Parallelism).
409
410       In the dot-product example, there are anti-dependencies  introduced  by
411       instructions  from  different  iterations.  However, those dependencies
412       can be removed at register renaming stage (at the  cost  of  allocating
413       register aliases, and therefore consuming physical registers).
414
415       Table  Average  Wait  times  helps diagnose performance issues that are
416       caused by the presence of long  latency  instructions  and  potentially
417       long data dependencies which may limit the ILP.  Note that llvm-mca, by
418       default, assumes at least 1cy between the dispatch event and the  issue
419       event.
420
421       When  the  performance  is  limited  by  data  dependencies and/or long
422       latency instructions, the number of cycles spent  while  in  the  ready
423       state  is expected to be very small when compared with the total number
424       of cycles spent in the scheduler's queue.  The difference  between  the
425       two  counters is a good indicator of how large of an impact data depen‐
426       dencies had on the execution of the instructions.  When performance  is
427       mostly limited by the lack of hardware resources, the delta between the
428       two counters is small.  However, the number  of  cycles  spent  in  the
429       queue  tends to be larger (i.e., more than 1-3cy), especially when com‐
430       pared to other low latency instructions.
431
432   Extra Statistics to Further Diagnose Performance Issues
433       The -all-stats command line option enables extra statistics and perfor‐
434       mance  counters  for the dispatch logic, the reorder buffer, the retire
435       control unit, and the register file.
436
437       Below is an example of -all-stats  output  generated  by  MCA  for  the
438       dot-product example discussed in the previous sections.
439
440          Dynamic Dispatch Stall Cycles:
441          RAT     - Register unavailable:                      0
442          RCU     - Retire tokens unavailable:                 0
443          SCHEDQ  - Scheduler full:                            272
444          LQ      - Load queue full:                           0
445          SQ      - Store queue full:                          0
446          GROUP   - Static restrictions on the dispatch group: 0
447
448
449          Dispatch Logic - number of cycles where we saw N instructions dispatched:
450          [# dispatched], [# cycles]
451           0,              24  (3.9%)
452           1,              272  (44.6%)
453           2,              314  (51.5%)
454
455
456          Schedulers - number of cycles where we saw N instructions issued:
457          [# issued], [# cycles]
458           0,          7  (1.1%)
459           1,          306  (50.2%)
460           2,          297  (48.7%)
461
462
463          Scheduler's queue usage:
464          JALU01,  0/20
465          JFPU01,  18/18
466          JLSAGU,  0/12
467
468
469          Retire Control Unit - number of cycles where we saw N instructions retired:
470          [# retired], [# cycles]
471           0,           109  (17.9%)
472           1,           102  (16.7%)
473           2,           399  (65.4%)
474
475
476          Register File statistics:
477          Total number of mappings created:    900
478          Max number of mappings used:         35
479
480          *  Register File #1 -- JFpuPRF:
481             Number of physical registers:     72
482             Total number of mappings created: 900
483             Max number of mappings used:      35
484
485          *  Register File #2 -- JIntegerPRF:
486             Number of physical registers:     64
487             Total number of mappings created: 0
488             Max number of mappings used:      0
489
490       If  we  look  at  the  Dynamic  Dispatch Stall Cycles table, we see the
491       counter for SCHEDQ reports 272 cycles.   This  counter  is  incremented
492       every  time  the  dispatch  logic  is unable to dispatch a group of two
493       instructions because the scheduler's queue is full.
494
495       Looking at the Dispatch Logic table, we see that the pipeline was  only
496       able  to  dispatch  two  instructions  51.5% of the time.  The dispatch
497       group was limited to one instruction 44.6% of the cycles, which  corre‐
498       sponds  to 272 cycles.  The dispatch statistics are displayed by either
499       using the command option -all-stats or -dispatch-stats.
500
501       The next table, Schedulers, presents a histogram  displaying  a  count,
502       representing  the  number  of  instructions  issued  on  some number of
503       cycles.  In this case, of the 610 simulated cycles, single instructions
504       were issued 306 times (50.2%) and there were 7 cycles where no instruc‐
505       tions were issued.
506
507       The Scheduler's queue usage table shows that the maximum number of buf‐
508       fer  entries (i.e., scheduler queue entries) used at runtime.  Resource
509       JFPU01 reached its maximum (18 of 18  queue  entries).  Note  that  AMD
510       Jaguar implements three schedulers:
511
512       · JALU01 - A scheduler for ALU instructions.
513
514       · JFPU01 - A scheduler floating point operations.
515
516       · JLSAGU - A scheduler for address generation.
517
518       The  dot-product  is  a  kernel of three floating point instructions (a
519       vector multiply followed by two horizontal adds).   That  explains  why
520       only the floating point scheduler appears to be used.
521
522       A full scheduler queue is either caused by data dependency chains or by
523       a sub-optimal usage of hardware resources.  Sometimes,  resource  pres‐
524       sure  can be mitigated by rewriting the kernel using different instruc‐
525       tions that consume different scheduler resources.   Schedulers  with  a
526       small queue are less resilient to bottlenecks caused by the presence of
527       long data dependencies.  The  scheduler  statistics  are  displayed  by
528       using the command option -all-stats or -scheduler-stats.
529
530       The  next table, Retire Control Unit, presents a histogram displaying a
531       count, representing the number of instructions retired on  some  number
532       of cycles.  In this case, of the 610 simulated cycles, two instructions
533       were retired during the same cycle 399 times (65.4%) and there were 109
534       cycles  where  no instructions were retired.  The retire statistics are
535       displayed by using the command option -all-stats or -retire-stats.
536
537       The last table presented is Register File  statistics.   Each  physical
538       register  file  (PRF)  used by the pipeline is presented in this table.
539       In the case of AMD Jaguar, there are two register files, one for float‐
540       ing-point  registers  (JFpuPRF)  and  one for integer registers (JInte‐
541       gerPRF).  The table shows that of the 900 instructions processed, there
542       were  900  mappings  created.   Since this dot-product example utilized
543       only floating point registers, the JFPuPRF was responsible for creating
544       the  900 mappings.  However, we see that the pipeline only used a maxi‐
545       mum of 35 of 72 available register slots at any given time. We can con‐
546       clude  that  the floating point PRF was the only register file used for
547       the example, and that it was never resource constrained.  The  register
548       file statistics are displayed by using the command option -all-stats or
549       -register-file-stats.
550
551       In this example, we can conclude that the IPC is mostly limited by data
552       dependencies, and not by resource pressure.
553
554   Instruction Flow
555       This  section  describes  the  instruction  flow  through MCA's default
556       out-of-order pipeline, as well as the functional units involved in  the
557       process.
558
559       The  default  pipeline implements the following sequence of stages used
560       to process instructions.
561
562       · Dispatch (Instruction is dispatched to the schedulers).
563
564       · Issue (Instruction is issued to the processor pipelines).
565
566       · Write Back (Instruction is executed, and results are written back).
567
568       · Retire (Instruction is retired; writes  are  architecturally  commit‐
569         ted).
570
571       The  default pipeline only models the out-of-order portion of a proces‐
572       sor.  Therefore, the instruction fetch and decode stages are  not  mod‐
573       eled.  Performance  bottlenecks in the frontend are not diagnosed.  MCA
574       assumes that instructions have all  been  decoded  and  placed  into  a
575       queue.  Also, MCA does not model branch prediction.
576
577   Instruction Dispatch
578       During  the  dispatch  stage,  instructions are picked in program order
579       from a queue of already decoded instructions, and dispatched in  groups
580       to the simulated hardware schedulers.
581
582       The  size  of a dispatch group depends on the availability of the simu‐
583       lated hardware resources.  The processor dispatch width defaults to the
584       value of the IssueWidth in LLVM's scheduling model.
585
586       An instruction can be dispatched if:
587
588       · The  size  of the dispatch group is smaller than processor's dispatch
589         width.
590
591       · There are enough entries in the reorder buffer.
592
593       · There are enough physical registers to do register renaming.
594
595       · The schedulers are not full.
596
597       Scheduling models can  optionally  specify  which  register  files  are
598       available  on  the  processor.  MCA uses that information to initialize
599       register file descriptors.  Users can limit the number of physical reg‐
600       isters  that  are globally available for register renaming by using the
601       command option -register-file-size.  A value of zero  for  this  option
602       means  unbounded.   By  knowing  how  many  registers are available for
603       renaming, MCA can predict dispatch stalls caused by the lack of  regis‐
604       ters.
605
606       The number of reorder buffer entries consumed by an instruction depends
607       on the number of  micro-opcodes  specified  by  the  target  scheduling
608       model.   MCA's  reorder  buffer's  purpose  is to track the progress of
609       instructions that are "in-flight," and to retire instructions  in  pro‐
610       gram  order.   The  number of entries in the reorder buffer defaults to
611       the MicroOpBufferSize provided by the target scheduling model.
612
613       Instructions that are dispatched to the  schedulers  consume  scheduler
614       buffer  entries. llvm-mca queries the scheduling model to determine the
615       set  of  buffered  resources  consumed  by  an  instruction.   Buffered
616       resources are treated like scheduler resources.
617
618   Instruction Issue
619       Each  processor  scheduler  implements  a  buffer  of instructions.  An
620       instruction has to wait in the scheduler's buffer until input  register
621       operands  become  available.   Only at that point, does the instruction
622       becomes  eligible  for  execution  and  may  be   issued   (potentially
623       out-of-order)  for  execution.   Instruction  latencies are computed by
624       llvm-mca with the help of the scheduling model.
625
626       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
627       ulers.   The  scheduler  is responsible for tracking data dependencies,
628       and dynamically selecting which processor  resources  are  consumed  by
629       instructions.   It delegates the management of processor resource units
630       and resource groups to a resource manager.   The  resource  manager  is
631       responsible  for selecting resource units that are consumed by instruc‐
632       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
633       group, the resource manager selects one of the available units from the
634       group; by default, the resource manager uses a round-robin selector  to
635       guarantee  that  resource  usage  is  uniformly distributed between all
636       units of a group.
637
638       llvm-mca's scheduler implements three instruction queues:
639
640       · WaitQueue: a queue of instructions whose operands are not ready.
641
642       · ReadyQueue: a queue of instructions ready to execute.
643
644       · IssuedQueue: a queue of instructions executing.
645
646       Depending on the operand availability, instructions that are dispatched
647       to  the  scheduler  are  either  placed  into the WaitQueue or into the
648       ReadyQueue.
649
650       Every cycle, the scheduler checks if instructions can be moved from the
651       WaitQueue  to  the  ReadyQueue, and if instructions from the ReadyQueue
652       can be issued to the underlying pipelines.  The  algorithm  prioritizes
653       older instructions over younger instructions.
654
655   Write-Back and Retire Stage
656       Issued  instructions  are moved from the ReadyQueue to the IssuedQueue.
657       There, instructions wait until they reach  the  write-back  stage.   At
658       that point, they get removed from the queue and the retire control unit
659       is notified.
660
661       When instructions are executed,  the  retire  control  unit  flags  the
662       instruction as "ready to retire."
663
664       Instructions  are retired in program order.  The register file is noti‐
665       fied of the retirement so that it can free the physical registers  that
666       were allocated for the instruction during the register renaming stage.
667
668   Load/Store Unit and Memory Consistency Model
669       To  simulate  an  out-of-order execution of memory operations, llvm-mca
670       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
671       tive execution of loads and stores.
672
673       Each  load  (or  store) consumes an entry in the load (or store) queue.
674       Users can specify flags -lqueue and -squeue  to  limit  the  number  of
675       entries  in  the  load  and  store  queues respectively. The queues are
676       unbounded by default.
677
678       The LSUnit implements a relaxed consistency model for memory loads  and
679       stores.  The rules are:
680
681       1. A younger load is allowed to pass an older load only if there are no
682          intervening stores or barriers between the two loads.
683
684       2. A younger load is allowed to pass an older store provided  that  the
685          load does not alias with the store.
686
687       3. A younger store is not allowed to pass an older store.
688
689       4. A younger store is not allowed to pass an older load.
690
691       By  default,  the LSUnit optimistically assumes that loads do not alias
692       (-noalias=true) store operations.  Under this assumption, younger loads
693       are  always allowed to pass older stores.  Essentially, the LSUnit does
694       not attempt to run any alias analysis to predict when loads and  stores
695       do not alias with each other.
696
697       Note  that,  in  the  case  of  write-combining memory, rule 3 could be
698       relaxed to allow reordering of  non-aliasing  store  operations.   That
699       being  said, at the moment, there is no way to further relax the memory
700       model (-noalias is the only option).  Essentially, there is  no  option
701       to  specify a different memory type (e.g., write-back, write-combining,
702       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
703       memory model.
704
705       Other limitations are:
706
707       · The LSUnit does not know when store-to-load forwarding may occur.
708
709       · The  LSUnit  does  not know anything about cache hierarchy and memory
710         types.
711
712       · The LSUnit does not know how to identify serializing  operations  and
713         memory fences.
714
715       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
716       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
717       "MayStore."   For  loads, the scheduling model provides an "optimistic"
718       load-to-use latency (which usually matches the load-to-use latency  for
719       when there is a hit in the L1D).
720
721       llvm-mca  does  not know about serializing operations or memory-barrier
722       like instructions.  The LSUnit conservatively assumes that an  instruc‐
723       tion which has both "MayLoad" and unmodeled side effects behaves like a
724       "soft" load-barrier.  That means, it serializes loads without forcing a
725       flush  of  the load queue.  Similarly, instructions that "MayStore" and
726       have unmodeled side effects are treated like store  barriers.   A  full
727       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
728       side effects.  This is inaccurate, but it is the best that we can do at
729       the moment with the current information available in LLVM.
730
731       A  load/store  barrier  consumes  one entry of the load/store queue.  A
732       load/store barrier enforces ordering of loads/stores.  A  younger  load
733       cannot  pass a load barrier.  Also, a younger store cannot pass a store
734       barrier.  A younger load has to wait for  the  memory/load  barrier  to
735       execute.  A load/store barrier is "executed" when it becomes the oldest
736       entry in the load/store queue(s). That also means, by construction, all
737       of the older loads/stores have been executed.
738
739       In conclusion, the full set of load/store consistency rules are:
740
741       1. A store may not pass a previous store.
742
743       2. A store may not pass a previous load (regardless of -noalias).
744
745       3. A store has to wait until an older store barrier is fully executed.
746
747       4. A load may pass a previous load.
748
749       5. A load may not pass a previous store unless -noalias is set.
750
751       6. A load has to wait until an older load barrier is fully executed.
752

AUTHOR

754       Maintained by The LLVM Team (http://llvm.org/).
755
757       2003-2019, LLVM Project
758
759
760
761
7627                                 2019-07-25                       LLVM-MCA(1)
Impressum