1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently works  for  processors  with  an
18       out-of-order  backend,  for which there is a scheduling model available
19       in LLVM.
20
21       The main goal of this tool is not just to predict  the  performance  of
22       the  code  when run on the target, but also help with diagnosing poten‐
23       tial performance issues.
24
25       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
26       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
27       and reporting style were inspired by the IACA tool from Intel.
28
29       llvm-mca allows the usage of special code comments to mark  regions  of
30       the  assembly  code  to be analyzed.  A comment starting with substring
31       LLVM-MCA-BEGIN marks the beginning of a code region. A comment starting
32       with  substring LLVM-MCA-END marks the end of a code region.  For exam‐
33       ple:
34
35          # LLVM-MCA-BEGIN My Code Region
36            ...
37          # LLVM-MCA-END
38
39       Multiple regions can be specified provided that they do not overlap.  A
40       code region can have an optional description. If no user-defined region
41       is specified, then llvm-mca assumes a default region which contains ev‐
42       ery  instruction in the input file.  Every region is analyzed in isola‐
43       tion, and the final performance report is the union of all the  reports
44       generated for every code region.
45
46       Inline assembly directives may be used from source code to annotate the
47       assembly text:
48
49          int foo(int a, int b) {
50            __asm volatile("# LLVM-MCA-BEGIN foo");
51            a += 42;
52            __asm volatile("# LLVM-MCA-END");
53            a *= b;
54            return a;
55          }
56
57       So for example, you can compile code with clang, output  assembly,  and
58       pipe it directly into llvm-mca for analysis:
59
60          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
61
62       Or for Intel syntax:
63
64          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
65

OPTIONS

67       If  input is "-" or omitted, llvm-mca reads from standard input. Other‐
68       wise, it will read from the specified filename.
69
70       If the -o option is omitted, then llvm-mca  will  send  its  output  to
71       standard  output if the input is from standard input.  If the -o option
72       specifies "-", then the output will also be sent to standard output.
73
74       -help  Print a summary of command line options.
75
76       -mtriple=<target triple>
77              Specify a target triple string.
78
79       -march=<arch>
80              Specify the architecture for which to analyze the code.  It  de‐
81              faults to the host default target.
82
83       -mcpu=<cpuname>
84              Specify  the  processor  for  which to analyze the code.  By de‐
85              fault, the cpu name is autodetected from the host.
86
87       -output-asm-variant=<variant id>
88              Specify the output assembly variant for the report generated  by
89              the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
90              (vic. 1) for this flag enables the AT&T  (vic.  Intel)  assembly
91              format  for the code printed out by the tool in the analysis re‐
92              port.
93
94       -dispatch=<width>
95              Specify a different dispatch width for the processor.  The  dis‐
96              patch  width  defaults  to  field  'IssueWidth' in the processor
97              scheduling model.  If width is zero, then the  default  dispatch
98              width is used.
99
100       -register-file-size=<size>
101              Specify the size of the register file. When specified, this flag
102              limits how many physical registers are  available  for  register
103              renaming  purposes.  A value of zero for this flag means "unlim‐
104              ited number of physical registers".
105
106       -iterations=<number of iterations>
107              Specify the number of iterations to run. If this flag is set  to
108              0,  then  the  tool  sets  the number of iterations to a default
109              value (i.e. 100).
110
111       -noalias=<bool>
112              If set, the tool assumes that loads and stores don't alias. This
113              is the default behavior.
114
115       -lqueue=<load queue size>
116              Specify  the  size of the load queue in the load/store unit emu‐
117              lated by the tool.  By default, the tool assumes an unbound num‐
118              ber of entries in the load queue.  A value of zero for this flag
119              is ignored, and the default load queue size is used instead.
120
121       -squeue=<store queue size>
122              Specify the size of the store queue in the load/store unit  emu‐
123              lated  by the tool. By default, the tool assumes an unbound num‐
124              ber of entries in the store queue. A value of zero for this flag
125              is ignored, and the default store queue size is used instead.
126
127       -timeline
128              Enable the timeline view.
129
130       -timeline-max-iterations=<iterations>
131              Limit the number of iterations to print in the timeline view. By
132              default, the timeline view prints information for up to 10 iter‐
133              ations.
134
135       -timeline-max-cycles=<cycles>
136              Limit the number of cycles in the timeline view. By default, the
137              number of cycles is set to 80.
138
139       -resource-pressure
140              Enable the resource pressure view. This is enabled by default.
141
142       -register-file-stats
143              Enable register file usage statistics.
144
145       -dispatch-stats
146              Enable extra dispatch statistics. This view  collects  and  ana‐
147              lyzes  instruction  dispatch  events,  as well as static/dynamic
148              dispatch stall events. This view is disabled by default.
149
150       -scheduler-stats
151              Enable extra scheduler statistics. This view collects  and  ana‐
152              lyzes  instruction  issue  events.  This view is disabled by de‐
153              fault.
154
155       -retire-stats
156              Enable extra retire control unit statistics. This view  is  dis‐
157              abled by default.
158
159       -instruction-info
160              Enable the instruction info view. This is enabled by default.
161
162       -all-stats
163              Print all hardware statistics. This enables extra statistics re‐
164              lated to the dispatch logic, the hardware schedulers, the regis‐
165              ter  file(s),  and  the retire control unit. This option is dis‐
166              abled by default.
167
168       -all-views
169              Enable all the view.
170
171       -instruction-tables
172              Prints resource pressure information based on the static  infor‐
173              mation available from the processor model. This differs from the
174              resource pressure view because it doesn't require that the  code
175              is  simulated. It instead prints the theoretical uniform distri‐
176              bution of resource pressure for every instruction in sequence.
177

EXIT STATUS

179       llvm-mca returns 0 on success. Otherwise, an error message  is  printed
180       to standard error, and the tool returns 1.
181

HOW LLVM-MCA WORKS

183       llvm-mca takes assembly code as input. The assembly code is parsed into
184       a sequence of MCInst with the help of the existing LLVM target assembly
185       parsers.  The  parsed sequence of MCInst is then analyzed by a Pipeline
186       module to generate a performance report.
187
188       The Pipeline module simulates the execution of  the  machine  code  se‐
189       quence  in  a loop of iterations (default is 100). During this process,
190       the pipeline collects a number of execution related statistics. At  the
191       end  of  this  process, the pipeline generates and prints a report from
192       the collected statistics.
193
194       Here is an example of a performance report generated by the tool for  a
195       dot-product  of two packed float vectors of four elements. The analysis
196       is conducted for target x86, cpu btver2.  The following result  can  be
197       produced  via  the  following  command  using  the  example  located at
198       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
199
200          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
201
202          Iterations:        300
203          Instructions:      900
204          Total Cycles:      610
205          Dispatch Width:    2
206          IPC:               1.48
207          Block RThroughput: 2.0
208
209
210          Instruction Info:
211          [1]: #uOps
212          [2]: Latency
213          [3]: RThroughput
214          [4]: MayLoad
215          [5]: MayStore
216          [6]: HasSideEffects (U)
217
218          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
219           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
220           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
221           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
222
223
224          Resources:
225          [0]   - JALU0
226          [1]   - JALU1
227          [2]   - JDiv
228          [3]   - JFPA
229          [4]   - JFPM
230          [5]   - JFPU0
231          [6]   - JFPU1
232          [7]   - JLAGU
233          [8]   - JMul
234          [9]   - JSAGU
235          [10]  - JSTC
236          [11]  - JVALU0
237          [12]  - JVALU1
238          [13]  - JVIMUL
239
240
241          Resource pressure per iteration:
242          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
243           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
244
245          Resource pressure by instruction:
246          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
247           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
248           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
249           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
250
251       According to this report, the dot-product kernel has been executed  300
252       times, for a total of 900 dynamically executed instructions.
253
254       The  report  is  structured  in three main sections.  The first section
255       collects a few performance numbers; the goal of this section is to give
256       a  very  quick overview of the performance throughput. In this example,
257       the two important performance indicators are IPC and Block  RThroughput
258       (Block Reciprocal Throughput).
259
260       IPC  is computed dividing the total number of simulated instructions by
261       the total number of cycles.  A delta between Dispatch Width and IPC  is
262       an  indicator  of  a  performance issue. In the absence of loop-carried
263       data dependencies, the observed IPC  tends  to  a  theoretical  maximum
264       which  can be computed by dividing the number of instructions of a sin‐
265       gle iteration by the Block RThroughput.
266
267       IPC is bounded from above by the dispatch width. That  is  because  the
268       dispatch width limits the maximum size of a dispatch group. IPC is also
269       limited by the amount of  hardware  parallelism.  The  availability  of
270       hardware  resources  affects the resource pressure distribution, and it
271       limits the number of instructions that can be executed in parallel  ev‐
272       ery  cycle.  A delta between Dispatch Width and the theoretical maximum
273       IPC is an indicator of a performance bottleneck caused by the  lack  of
274       hardware  resources.  In  general, the lower the Block RThroughput, the
275       better.
276
277       In this example, Instructions per iteration/Block RThroughput is  1.50.
278       Since  there  are no loop-carried dependencies, the observed IPC is ex‐
279       pected to approach 1.50 when the number of iterations tends  to  infin‐
280       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
281       maximum IPC (1.50) is an indicator of a performance  bottleneck  caused
282       by  the  lack of hardware resources, and the Resource pressure view can
283       help to identify the problematic resource usage.
284
285       The second section of the  report  shows  the  latency  and  reciprocal
286       throughput  of every instruction in the sequence. That section also re‐
287       ports extra information related to the number of micro opcodes, and op‐
288       code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
289
290       The third section is the Resource pressure view.  This view reports the
291       average number of resource cycles consumed every iteration by  instruc‐
292       tions  for  every processor resource unit available on the target.  In‐
293       formation is structured in two tables. The first table reports the num‐
294       ber of resource cycles spent on average every iteration. The second ta‐
295       ble correlates the resource cycles to the machine  instruction  in  the
296       sequence. For example, every iteration of the instruction vmulps always
297       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
298       consuming  an  average of 1 resource cycle per iteration.  Note that on
299       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
300       line  JFPU1,  while horizontal floating-point additions can only be is‐
301       sued to pipeline JFPU0.
302
303       The resource pressure view helps with identifying bottlenecks caused by
304       high  usage  of  specific hardware resources.  Situations with resource
305       pressure mainly concentrated on a few resources should, in general,  be
306       avoided.   Ideally,  pressure  should  be uniformly distributed between
307       multiple resources.
308
309   Timeline View
310       The timeline view produces a  detailed  report  of  each  instruction's
311       state  transitions  through  an instruction pipeline.  This view is en‐
312       abled by the command line option -timeline.  As instructions transition
313       through  the  various stages of the pipeline, their states are depicted
314       in the view report.  These states  are  represented  by  the  following
315       characters:
316
317       • D : Instruction dispatched.
318
319       • e : Instruction executing.
320
321       • E : Instruction executed.
322
323       • R : Instruction retired.
324
325       • = : Instruction already dispatched, waiting to be executed.
326
327       • - : Instruction executed, waiting to be retired.
328
329       Below  is the timeline view for a subset of the dot-product example lo‐
330       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
331       llvm-mca using the following command:
332
333          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
334
335          Timeline view:
336                              012345
337          Index     0123456789
338
339          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
340          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
341          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
342          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
343          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
344          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
345          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
346          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
347          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
348
349
350          Average Wait times (based on the timeline view):
351          [0]: Executions
352          [1]: Average time spent waiting in a scheduler's queue
353          [2]: Average time spent waiting in a scheduler's queue while ready
354          [3]: Average time elapsed from WB until retire stage
355
356                [0]    [1]    [2]    [3]
357          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
358          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
359          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
360
361       The  timeline  view  is  interesting because it shows instruction state
362       changes during execution.  It also gives an idea of how the  tool  pro‐
363       cesses instructions executed on the target, and how their timing infor‐
364       mation might be calculated.
365
366       The timeline view is structured in two tables.  The first  table  shows
367       instructions  changing state over time (measured in cycles); the second
368       table (named Average Wait  times)  reports  useful  timing  statistics,
369       which  should help diagnose performance bottlenecks caused by long data
370       dependencies and sub-optimal usage of hardware resources.
371
372       An instruction in the timeline view is identified by a pair of indices,
373       where  the first index identifies an iteration, and the second index is
374       the instruction index (i.e., where it appears in  the  code  sequence).
375       Since this example was generated using 3 iterations: -iterations=3, the
376       iteration indices range from 0-2 inclusively.
377
378       Excluding the first and last column, the remaining columns are  in  cy‐
379       cles.  Cycles are numbered sequentially starting from 0.
380
381       From the example output above, we know the following:
382
383       • Instruction [1,0] was dispatched at cycle 1.
384
385       • Instruction [1,0] started executing at cycle 2.
386
387       • Instruction [1,0] reached the write back stage at cycle 4.
388
389       • Instruction [1,0] was retired at cycle 10.
390
391       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
392       wait in the scheduler's queue for the operands to become available.  By
393       the  time  vmulps  is  dispatched,  operands are already available, and
394       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
395       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
396       strated by the fact that the instruction only spent 1cy in  the  sched‐
397       uler's queue.
398
399       There  is a gap of 5 cycles between the write-back stage and the retire
400       event.  That is because instructions must retire in program  order,  so
401       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
402       until cycle 10).
403
404       In the example, all instructions are in a RAW (Read After Write) depen‐
405       dency  chain.   Register %xmm2 written by vmulps is immediately used by
406       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
407       used  by  the second vhaddps.  Long data dependencies negatively impact
408       the ILP (Instruction Level Parallelism).
409
410       In the dot-product example, there are anti-dependencies  introduced  by
411       instructions  from  different  iterations.  However, those dependencies
412       can be removed at register renaming stage (at the  cost  of  allocating
413       register aliases, and therefore consuming physical registers).
414
415       Table  Average  Wait  times  helps diagnose performance issues that are
416       caused by the presence of long  latency  instructions  and  potentially
417       long data dependencies which may limit the ILP.  Note that llvm-mca, by
418       default, assumes at least 1cy between the dispatch event and the  issue
419       event.
420
421       When  the  performance  is limited by data dependencies and/or long la‐
422       tency instructions, the number of cycles spent while in the ready state
423       is expected to be very small when compared with the total number of cy‐
424       cles spent in the scheduler's queue.  The difference  between  the  two
425       counters  is  a good indicator of how large of an impact data dependen‐
426       cies had on the execution of the  instructions.   When  performance  is
427       mostly limited by the lack of hardware resources, the delta between the
428       two counters is small.  However, the number  of  cycles  spent  in  the
429       queue  tends to be larger (i.e., more than 1-3cy), especially when com‐
430       pared to other low latency instructions.
431
432   Extra Statistics to Further Diagnose Performance Issues
433       The -all-stats command line option enables extra statistics and perfor‐
434       mance  counters  for the dispatch logic, the reorder buffer, the retire
435       control unit, and the register file.
436
437       Below is an example of -all-stats  output  generated  by  MCA  for  the
438       dot-product example discussed in the previous sections.
439
440          Dynamic Dispatch Stall Cycles:
441          RAT     - Register unavailable:                      0
442          RCU     - Retire tokens unavailable:                 0
443          SCHEDQ  - Scheduler full:                            272
444          LQ      - Load queue full:                           0
445          SQ      - Store queue full:                          0
446          GROUP   - Static restrictions on the dispatch group: 0
447
448
449          Dispatch Logic - number of cycles where we saw N instructions dispatched:
450          [# dispatched], [# cycles]
451           0,              24  (3.9%)
452           1,              272  (44.6%)
453           2,              314  (51.5%)
454
455
456          Schedulers - number of cycles where we saw N instructions issued:
457          [# issued], [# cycles]
458           0,          7  (1.1%)
459           1,          306  (50.2%)
460           2,          297  (48.7%)
461
462
463          Scheduler's queue usage:
464          JALU01,  0/20
465          JFPU01,  18/18
466          JLSAGU,  0/12
467
468
469          Retire Control Unit - number of cycles where we saw N instructions retired:
470          [# retired], [# cycles]
471           0,           109  (17.9%)
472           1,           102  (16.7%)
473           2,           399  (65.4%)
474
475
476          Register File statistics:
477          Total number of mappings created:    900
478          Max number of mappings used:         35
479
480          *  Register File #1 -- JFpuPRF:
481             Number of physical registers:     72
482             Total number of mappings created: 900
483             Max number of mappings used:      35
484
485          *  Register File #2 -- JIntegerPRF:
486             Number of physical registers:     64
487             Total number of mappings created: 0
488             Max number of mappings used:      0
489
490       If  we  look  at  the  Dynamic  Dispatch Stall Cycles table, we see the
491       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
492       ery  time  the  dispatch logic is unable to dispatch a group of two in‐
493       structions because the scheduler's queue is full.
494
495       Looking at the Dispatch Logic table, we see that the pipeline was  only
496       able  to  dispatch  two  instructions  51.5% of the time.  The dispatch
497       group was limited to one instruction 44.6% of the cycles, which  corre‐
498       sponds  to 272 cycles.  The dispatch statistics are displayed by either
499       using the command option -all-stats or -dispatch-stats.
500
501       The next table, Schedulers, presents a histogram  displaying  a  count,
502       representing  the  number  of instructions issued on some number of cy‐
503       cles.  In this case, of the 610 simulated cycles,  single  instructions
504       were issued 306 times (50.2%) and there were 7 cycles where no instruc‐
505       tions were issued.
506
507       The Scheduler's queue usage table shows that the maximum number of buf‐
508       fer  entries (i.e., scheduler queue entries) used at runtime.  Resource
509       JFPU01 reached its maximum (18 of 18  queue  entries).  Note  that  AMD
510       Jaguar implements three schedulers:
511
512       • JALU01 - A scheduler for ALU instructions.
513
514       • JFPU01 - A scheduler floating point operations.
515
516       • JLSAGU - A scheduler for address generation.
517
518       The  dot-product  is  a  kernel of three floating point instructions (a
519       vector multiply followed by two horizontal adds).   That  explains  why
520       only the floating point scheduler appears to be used.
521
522       A full scheduler queue is either caused by data dependency chains or by
523       a sub-optimal usage of hardware resources.  Sometimes,  resource  pres‐
524       sure  can be mitigated by rewriting the kernel using different instruc‐
525       tions that consume different scheduler resources.   Schedulers  with  a
526       small queue are less resilient to bottlenecks caused by the presence of
527       long data dependencies.  The scheduler statistics are displayed by  us‐
528       ing the command option -all-stats or -scheduler-stats.
529
530       The  next table, Retire Control Unit, presents a histogram displaying a
531       count, representing the number of instructions retired on  some  number
532       of cycles.  In this case, of the 610 simulated cycles, two instructions
533       were retired during the same cycle 399 times (65.4%) and there were 109
534       cycles  where  no instructions were retired.  The retire statistics are
535       displayed by using the command option -all-stats or -retire-stats.
536
537       The last table presented is Register File  statistics.   Each  physical
538       register  file  (PRF)  used by the pipeline is presented in this table.
539       In the case of AMD Jaguar, there are two register files, one for float‐
540       ing-point  registers  (JFpuPRF)  and  one for integer registers (JInte‐
541       gerPRF).  The table shows that of the 900 instructions processed, there
542       were  900  mappings  created.   Since this dot-product example utilized
543       only floating point registers, the JFPuPRF was responsible for creating
544       the  900 mappings.  However, we see that the pipeline only used a maxi‐
545       mum of 35 of 72 available register slots at any given time. We can con‐
546       clude  that  the floating point PRF was the only register file used for
547       the example, and that it was never resource constrained.  The  register
548       file statistics are displayed by using the command option -all-stats or
549       -register-file-stats.
550
551       In this example, we can conclude that the IPC is mostly limited by data
552       dependencies, and not by resource pressure.
553
554   Instruction Flow
555       This  section  describes  the  instruction  flow  through MCA's default
556       out-of-order pipeline, as well as the functional units involved in  the
557       process.
558
559       The  default  pipeline implements the following sequence of stages used
560       to process instructions.
561
562       • Dispatch (Instruction is dispatched to the schedulers).
563
564       • Issue (Instruction is issued to the processor pipelines).
565
566       • Write Back (Instruction is executed, and results are written back).
567
568       • Retire (Instruction is retired; writes  are  architecturally  commit‐
569         ted).
570
571       The  default pipeline only models the out-of-order portion of a proces‐
572       sor.  Therefore, the instruction fetch and decode stages are  not  mod‐
573       eled.  Performance  bottlenecks in the frontend are not diagnosed.  MCA
574       assumes that instructions have all  been  decoded  and  placed  into  a
575       queue.  Also, MCA does not model branch prediction.
576
577   Instruction Dispatch
578       During  the  dispatch  stage,  instructions are picked in program order
579       from a queue of already decoded instructions, and dispatched in  groups
580       to the simulated hardware schedulers.
581
582       The  size  of a dispatch group depends on the availability of the simu‐
583       lated hardware resources.  The processor dispatch width defaults to the
584       value of the IssueWidth in LLVM's scheduling model.
585
586       An instruction can be dispatched if:
587
588       • The  size  of the dispatch group is smaller than processor's dispatch
589         width.
590
591       • There are enough entries in the reorder buffer.
592
593       • There are enough physical registers to do register renaming.
594
595       • The schedulers are not full.
596
597       Scheduling models can  optionally  specify  which  register  files  are
598       available  on  the  processor.  MCA uses that information to initialize
599       register file descriptors.  Users can limit the number of physical reg‐
600       isters  that  are globally available for register renaming by using the
601       command option -register-file-size.  A value of zero  for  this  option
602       means  unbounded.   By knowing how many registers are available for re‐
603       naming, MCA can predict dispatch stalls caused by the  lack  of  regis‐
604       ters.
605
606       The number of reorder buffer entries consumed by an instruction depends
607       on the number of  micro-opcodes  specified  by  the  target  scheduling
608       model.   MCA's reorder buffer's purpose is to track the progress of in‐
609       structions that are "in-flight," and to retire instructions in  program
610       order.  The number of entries in the reorder buffer defaults to the Mi‐
611       croOpBufferSize provided by the target scheduling model.
612
613       Instructions that are dispatched to the  schedulers  consume  scheduler
614       buffer  entries. llvm-mca queries the scheduling model to determine the
615       set of buffered resources consumed by  an  instruction.   Buffered  re‐
616       sources are treated like scheduler resources.
617
618   Instruction Issue
619       Each  processor  scheduler implements a buffer of instructions.  An in‐
620       struction has to wait in the scheduler's buffer  until  input  register
621       operands  become  available.   Only at that point, does the instruction
622       becomes  eligible  for  execution  and  may  be   issued   (potentially
623       out-of-order)  for  execution.   Instruction  latencies are computed by
624       llvm-mca with the help of the scheduling model.
625
626       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
627       ulers.   The  scheduler  is responsible for tracking data dependencies,
628       and dynamically selecting which processor resources are consumed by in‐
629       structions.   It  delegates  the management of processor resource units
630       and resource groups to a resource manager.  The resource manager is re‐
631       sponsible  for  selecting  resource units that are consumed by instruc‐
632       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
633       group, the resource manager selects one of the available units from the
634       group; by default, the resource manager uses a round-robin selector  to
635       guarantee  that  resource  usage  is  uniformly distributed between all
636       units of a group.
637
638       llvm-mca's scheduler implements three instruction queues:
639
640       • WaitQueue: a queue of instructions whose operands are not ready.
641
642       • ReadyQueue: a queue of instructions ready to execute.
643
644       • IssuedQueue: a queue of instructions executing.
645
646       Depending on the operand availability, instructions that are dispatched
647       to  the  scheduler  are  either  placed  into the WaitQueue or into the
648       ReadyQueue.
649
650       Every cycle, the scheduler checks if instructions can be moved from the
651       WaitQueue  to  the  ReadyQueue, and if instructions from the ReadyQueue
652       can be issued to the underlying pipelines.  The  algorithm  prioritizes
653       older instructions over younger instructions.
654
655   Write-Back and Retire Stage
656       Issued  instructions  are moved from the ReadyQueue to the IssuedQueue.
657       There, instructions wait until they reach  the  write-back  stage.   At
658       that point, they get removed from the queue and the retire control unit
659       is notified.
660
661       When instructions are executed, the retire control unit flags  the  in‐
662       struction as "ready to retire."
663
664       Instructions  are retired in program order.  The register file is noti‐
665       fied of the retirement so that it can free the physical registers  that
666       were allocated for the instruction during the register renaming stage.
667
668   Load/Store Unit and Memory Consistency Model
669       To  simulate  an  out-of-order execution of memory operations, llvm-mca
670       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
671       tive execution of loads and stores.
672
673       Each  load  (or  store) consumes an entry in the load (or store) queue.
674       Users can specify flags -lqueue and -squeue to limit the number of  en‐
675       tries  in  the  load  and store queues respectively. The queues are un‐
676       bounded by default.
677
678       The LSUnit implements a relaxed consistency model for memory loads  and
679       stores.  The rules are:
680
681       1. A younger load is allowed to pass an older load only if there are no
682          intervening stores or barriers between the two loads.
683
684       2. A younger load is allowed to pass an older store provided  that  the
685          load does not alias with the store.
686
687       3. A younger store is not allowed to pass an older store.
688
689       4. A younger store is not allowed to pass an older load.
690
691       By  default,  the LSUnit optimistically assumes that loads do not alias
692       (-noalias=true) store operations.  Under this assumption, younger loads
693       are  always allowed to pass older stores.  Essentially, the LSUnit does
694       not attempt to run any alias analysis to predict when loads and  stores
695       do not alias with each other.
696
697       Note  that,  in the case of write-combining memory, rule 3 could be re‐
698       laxed to allow reordering of non-aliasing store operations.  That being
699       said,  at the moment, there is no way to further relax the memory model
700       (-noalias is the only option).  Essentially,  there  is  no  option  to
701       specify  a  different  memory  type (e.g., write-back, write-combining,
702       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
703       memory model.
704
705       Other limitations are:
706
707       • The LSUnit does not know when store-to-load forwarding may occur.
708
709       • The  LSUnit  does  not know anything about cache hierarchy and memory
710         types.
711
712       • The LSUnit does not know how to identify serializing  operations  and
713         memory fences.
714
715       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
716       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
717       "MayStore."   For  loads, the scheduling model provides an "optimistic"
718       load-to-use latency (which usually matches the load-to-use latency  for
719       when there is a hit in the L1D).
720
721       llvm-mca  does  not know about serializing operations or memory-barrier
722       like instructions.  The LSUnit conservatively assumes that an  instruc‐
723       tion which has both "MayLoad" and unmodeled side effects behaves like a
724       "soft" load-barrier.  That means, it serializes loads without forcing a
725       flush  of  the load queue.  Similarly, instructions that "MayStore" and
726       have unmodeled side effects are treated like store  barriers.   A  full
727       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
728       side effects.  This is inaccurate, but it is the best that we can do at
729       the moment with the current information available in LLVM.
730
731       A  load/store  barrier  consumes  one entry of the load/store queue.  A
732       load/store barrier enforces ordering of loads/stores.  A  younger  load
733       cannot  pass a load barrier.  Also, a younger store cannot pass a store
734       barrier.  A younger load has to wait for the memory/load barrier to ex‐
735       ecute.   A  load/store barrier is "executed" when it becomes the oldest
736       entry in the load/store queue(s). That also means, by construction, all
737       of the older loads/stores have been executed.
738
739       In conclusion, the full set of load/store consistency rules are:
740
741       1. A store may not pass a previous store.
742
743       2. A store may not pass a previous load (regardless of -noalias).
744
745       3. A store has to wait until an older store barrier is fully executed.
746
747       4. A load may pass a previous load.
748
749       5. A load may not pass a previous store unless -noalias is set.
750
751       6. A load has to wait until an older load barrier is fully executed.
752

AUTHOR

754       Maintained by The LLVM Team (http://llvm.org/).
755
757       2003-2023, LLVM Project
758
759
760
761
7627                                 2023-07-20                       LLVM-MCA(1)
Impressum