llvm-mca-8.0(1)

1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently works  for  processors  with  an
18       out-of-order  backend,  for which there is a scheduling model available
19       in LLVM.
20
21       The main goal of this tool is not just to predict  the  performance  of
22       the  code  when run on the target, but also help with diagnosing poten‐
23       tial performance issues.
24
25       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
26       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
27       and reporting style were inspired by the IACA tool from Intel.
28
29       For example, you can compile code with clang, output assembly, and pipe
30       it directly into llvm-mca for analysis:
31
32          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34       Or for Intel syntax:
35
36          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37

OPTIONS

39       If  input is "-" or omitted, llvm-mca reads from standard input. Other‐
40       wise, it will read from the specified filename.
41
42       If the -o option is omitted, then llvm-mca  will  send  its  output  to
43       standard  output if the input is from standard input.  If the -o option
44       specifies "-", then the output will also be sent to standard output.
45
46       -help  Print a summary of command line options.
47
48       -mtriple=<target triple>
49              Specify a target triple string.
50
51       -march=<arch>
52              Specify the architecture for which to analyze the code.  It  de‐
53              faults to the host default target.
54
55       -mcpu=<cpuname>
56              Specify  the  processor  for  which to analyze the code.  By de‐
57              fault, the cpu name is autodetected from the host.
58
59       -output-asm-variant=<variant id>
60              Specify the output assembly variant for the report generated  by
61              the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
62              (vic. 1) for this flag enables the AT&T  (vic.  Intel)  assembly
63              format  for the code printed out by the tool in the analysis re‐
64              port.
65
66       -dispatch=<width>
67              Specify a different dispatch width for the processor.  The  dis‐
68              patch  width  defaults  to  field  'IssueWidth' in the processor
69              scheduling model.  If width is zero, then the  default  dispatch
70              width is used.
71
72       -register-file-size=<size>
73              Specify the size of the register file. When specified, this flag
74              limits how many physical registers are  available  for  register
75              renaming  purposes.  A value of zero for this flag means "unlim‐
76              ited number of physical registers".
77
78       -iterations=<number of iterations>
79              Specify the number of iterations to run. If this flag is set  to
80              0,  then  the  tool  sets  the number of iterations to a default
81              value (i.e. 100).
82
83       -noalias=<bool>
84              If set, the tool assumes that loads and stores don't alias. This
85              is the default behavior.
86
87       -lqueue=<load queue size>
88              Specify  the  size of the load queue in the load/store unit emu‐
89              lated by the tool.  By default, the tool assumes an unbound num‐
90              ber of entries in the load queue.  A value of zero for this flag
91              is ignored, and the default load queue size is used instead.
92
93       -squeue=<store queue size>
94              Specify the size of the store queue in the load/store unit  emu‐
95              lated  by the tool. By default, the tool assumes an unbound num‐
96              ber of entries in the store queue. A value of zero for this flag
97              is ignored, and the default store queue size is used instead.
98
99       -timeline
100              Enable the timeline view.
101
102       -timeline-max-iterations=<iterations>
103              Limit the number of iterations to print in the timeline view. By
104              default, the timeline view prints information for up to 10 iter‐
105              ations.
106
107       -timeline-max-cycles=<cycles>
108              Limit the number of cycles in the timeline view. By default, the
109              number of cycles is set to 80.
110
111       -resource-pressure
112              Enable the resource pressure view. This is enabled by default.
113
114       -register-file-stats
115              Enable register file usage statistics.
116
117       -dispatch-stats
118              Enable extra dispatch statistics. This view  collects  and  ana‐
119              lyzes  instruction  dispatch  events,  as well as static/dynamic
120              dispatch stall events. This view is disabled by default.
121
122       -scheduler-stats
123              Enable extra scheduler statistics. This view collects  and  ana‐
124              lyzes  instruction  issue  events.  This view is disabled by de‐
125              fault.
126
127       -retire-stats
128              Enable extra retire control unit statistics. This view  is  dis‐
129              abled by default.
130
131       -instruction-info
132              Enable the instruction info view. This is enabled by default.
133
134       -all-stats
135              Print all hardware statistics. This enables extra statistics re‐
136              lated to the dispatch logic, the hardware schedulers, the regis‐
137              ter  file(s),  and  the retire control unit. This option is dis‐
138              abled by default.
139
140       -all-views
141              Enable all the view.
142
143       -instruction-tables
144              Prints resource pressure information based on the static  infor‐
145              mation available from the processor model. This differs from the
146              resource pressure view because it doesn't require that the  code
147              is  simulated. It instead prints the theoretical uniform distri‐
148              bution of resource pressure for every instruction in sequence.
149

EXIT STATUS

151       llvm-mca returns 0 on success. Otherwise, an error message  is  printed
152       to standard error, and the tool returns 1.
153

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

155       llvm-mca allows for the optional usage of special code comments to mark
156       regions of the assembly code to be analyzed.  A comment  starting  with
157       substring  LLVM-MCA-BEGIN  marks the beginning of a code region. A com‐
158       ment starting with substring LLVM-MCA-END marks the end of a  code  re‐
159       gion.  For example:
160
161          # LLVM-MCA-BEGIN My Code Region
162            ...
163          # LLVM-MCA-END
164
165       Multiple regions can be specified provided that they do not overlap.  A
166       code region can have an optional description. If no user-defined region
167       is specified, then llvm-mca assumes a default region which contains ev‐
168       ery instruction in the input file.  Every region is analyzed in  isola‐
169       tion,  and the final performance report is the union of all the reports
170       generated for every code region.
171
172       Inline assembly directives may be used from source code to annotate the
173       assembly text:
174
175          int foo(int a, int b) {
176            __asm volatile("# LLVM-MCA-BEGIN foo");
177            a += 42;
178            __asm volatile("# LLVM-MCA-END");
179            a *= b;
180            return a;
181          }
182

HOW LLVM-MCA WORKS

184       llvm-mca takes assembly code as input. The assembly code is parsed into
185       a sequence of MCInst with the help of the existing LLVM target assembly
186       parsers.  The  parsed sequence of MCInst is then analyzed by a Pipeline
187       module to generate a performance report.
188
189       The Pipeline module simulates the execution of  the  machine  code  se‐
190       quence  in  a loop of iterations (default is 100). During this process,
191       the pipeline collects a number of execution related statistics. At  the
192       end  of  this  process, the pipeline generates and prints a report from
193       the collected statistics.
194
195       Here is an example of a performance report generated by the tool for  a
196       dot-product  of two packed float vectors of four elements. The analysis
197       is conducted for target x86, cpu btver2.  The following result  can  be
198       produced  via  the  following  command  using  the  example  located at
199       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
200
201          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
202
203          Iterations:        300
204          Instructions:      900
205          Total Cycles:      610
206          Total uOps:        900
207
208          Dispatch Width:    2
209          uOps Per Cycle:    1.48
210          IPC:               1.48
211          Block RThroughput: 2.0
212
213
214          Instruction Info:
215          [1]: #uOps
216          [2]: Latency
217          [3]: RThroughput
218          [4]: MayLoad
219          [5]: MayStore
220          [6]: HasSideEffects (U)
221
222          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
223           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
224           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
225           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
226
227
228          Resources:
229          [0]   - JALU0
230          [1]   - JALU1
231          [2]   - JDiv
232          [3]   - JFPA
233          [4]   - JFPM
234          [5]   - JFPU0
235          [6]   - JFPU1
236          [7]   - JLAGU
237          [8]   - JMul
238          [9]   - JSAGU
239          [10]  - JSTC
240          [11]  - JVALU0
241          [12]  - JVALU1
242          [13]  - JVIMUL
243
244
245          Resource pressure per iteration:
246          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
247           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
248
249          Resource pressure by instruction:
250          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
251           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
252           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
253           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
254
255       According to this report, the dot-product kernel has been executed  300
256       times,  for  a total of 900 simulated instructions. The total number of
257       simulated micro opcodes (uOps) is also 900.
258
259       The report is structured in three main  sections.   The  first  section
260       collects a few performance numbers; the goal of this section is to give
261       a very quick overview of the performance throughput. Important  perfor‐
262       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
263       Reciprocal Throughput).
264
265       IPC is computed dividing the total number of simulated instructions  by
266       the  total number of cycles. In the absence of loop-carried data depen‐
267       dencies, the observed IPC tends to a theoretical maximum which  can  be
268       computed  by  dividing the number of instructions of a single iteration
269       by the Block RThroughput.
270
271       Field 'uOps Per Cycle' is computed dividing the total number  of  simu‐
272       lated micro opcodes by the total number of cycles. A delta between Dis‐
273       patch Width and this field is an indicator of a performance  issue.  In
274       the  absence  of loop-carried data dependencies, the observed 'uOps Per
275       Cycle' should tend to a theoretical maximum  throughput  which  can  be
276       computed  by  dividing  the number of uOps of a single iteration by the
277       Block RThroughput.
278
279       Field uOps Per Cycle is bounded from above by the dispatch width.  That
280       is  because  the  dispatch  width limits the maximum size of a dispatch
281       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
282       ware  parallelism.  The  availability of hardware resources affects the
283       resource pressure distribution, and it limits the  number  of  instruc‐
284       tions  that  can  be executed in parallel every cycle.  A delta between
285       Dispatch Width and the theoretical maximum uOps per Cycle (computed  by
286       dividing  the  number  of  uOps  of  a  single  iteration  by the Block
287       RTrhoughput) is an indicator of a performance bottleneck caused by  the
288       lack  of hardware resources.  In general, the lower the Block RThrough‐
289       put, the better.
290
291       In this example, uOps per iteration/Block RThroughput  is  1.50.  Since
292       there  are no loop-carried dependencies, the observed uOps Per Cycle is
293       expected to approach 1.50 when the number of iterations tends to infin‐
294       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
295       maximum uOp throughput (1.50) is an indicator of a performance  bottle‐
296       neck  caused  by the lack of hardware resources, and the Resource pres‐
297       sure view can help to identify the problematic resource usage.
298
299       The second section of the  report  shows  the  latency  and  reciprocal
300       throughput  of every instruction in the sequence. That section also re‐
301       ports extra information related to the number of micro opcodes, and op‐
302       code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
303
304       The third section is the Resource pressure view.  This view reports the
305       average number of resource cycles consumed every iteration by  instruc‐
306       tions  for  every processor resource unit available on the target.  In‐
307       formation is structured in two tables. The first table reports the num‐
308       ber of resource cycles spent on average every iteration. The second ta‐
309       ble correlates the resource cycles to the machine  instruction  in  the
310       sequence. For example, every iteration of the instruction vmulps always
311       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
312       consuming  an  average of 1 resource cycle per iteration.  Note that on
313       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
314       line  JFPU1,  while horizontal floating-point additions can only be is‐
315       sued to pipeline JFPU0.
316
317       The resource pressure view helps with identifying bottlenecks caused by
318       high  usage  of  specific hardware resources.  Situations with resource
319       pressure mainly concentrated on a few resources should, in general,  be
320       avoided.   Ideally,  pressure  should  be uniformly distributed between
321       multiple resources.
322
323   Timeline View
324       The timeline view produces a  detailed  report  of  each  instruction's
325       state  transitions  through  an instruction pipeline.  This view is en‐
326       abled by the command line option -timeline.  As instructions transition
327       through  the  various stages of the pipeline, their states are depicted
328       in the view report.  These states  are  represented  by  the  following
329       characters:
330
331       • D : Instruction dispatched.
332
333       • e : Instruction executing.
334
335       • E : Instruction executed.
336
337       • R : Instruction retired.
338
339       • = : Instruction already dispatched, waiting to be executed.
340
341       • - : Instruction executed, waiting to be retired.
342
343       Below  is the timeline view for a subset of the dot-product example lo‐
344       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
345       llvm-mca using the following command:
346
347          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
348
349          Timeline view:
350                              012345
351          Index     0123456789
352
353          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
354          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
355          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
356          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
357          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
358          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
359          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
360          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
361          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
362
363
364          Average Wait times (based on the timeline view):
365          [0]: Executions
366          [1]: Average time spent waiting in a scheduler's queue
367          [2]: Average time spent waiting in a scheduler's queue while ready
368          [3]: Average time elapsed from WB until retire stage
369
370                [0]    [1]    [2]    [3]
371          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
372          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
373          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
374
375       The  timeline  view  is  interesting because it shows instruction state
376       changes during execution.  It also gives an idea of how the  tool  pro‐
377       cesses instructions executed on the target, and how their timing infor‐
378       mation might be calculated.
379
380       The timeline view is structured in two tables.  The first  table  shows
381       instructions  changing state over time (measured in cycles); the second
382       table (named Average Wait  times)  reports  useful  timing  statistics,
383       which  should help diagnose performance bottlenecks caused by long data
384       dependencies and sub-optimal usage of hardware resources.
385
386       An instruction in the timeline view is identified by a pair of indices,
387       where  the first index identifies an iteration, and the second index is
388       the instruction index (i.e., where it appears in  the  code  sequence).
389       Since this example was generated using 3 iterations: -iterations=3, the
390       iteration indices range from 0-2 inclusively.
391
392       Excluding the first and last column, the remaining columns are  in  cy‐
393       cles.  Cycles are numbered sequentially starting from 0.
394
395       From the example output above, we know the following:
396
397       • Instruction [1,0] was dispatched at cycle 1.
398
399       • Instruction [1,0] started executing at cycle 2.
400
401       • Instruction [1,0] reached the write back stage at cycle 4.
402
403       • Instruction [1,0] was retired at cycle 10.
404
405       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
406       wait in the scheduler's queue for the operands to become available.  By
407       the  time  vmulps  is  dispatched,  operands are already available, and
408       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
409       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
410       strated by the fact that the instruction only spent 1cy in  the  sched‐
411       uler's queue.
412
413       There  is a gap of 5 cycles between the write-back stage and the retire
414       event.  That is because instructions must retire in program  order,  so
415       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
416       until cycle 10).
417
418       In the example, all instructions are in a RAW (Read After Write) depen‐
419       dency  chain.   Register %xmm2 written by vmulps is immediately used by
420       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
421       used  by  the second vhaddps.  Long data dependencies negatively impact
422       the ILP (Instruction Level Parallelism).
423
424       In the dot-product example, there are anti-dependencies  introduced  by
425       instructions  from  different  iterations.  However, those dependencies
426       can be removed at register renaming stage (at the  cost  of  allocating
427       register aliases, and therefore consuming physical registers).
428
429       Table  Average  Wait  times  helps diagnose performance issues that are
430       caused by the presence of long  latency  instructions  and  potentially
431       long data dependencies which may limit the ILP.  Note that llvm-mca, by
432       default, assumes at least 1cy between the dispatch event and the  issue
433       event.
434
435       When  the  performance  is limited by data dependencies and/or long la‐
436       tency instructions, the number of cycles spent while in the ready state
437       is expected to be very small when compared with the total number of cy‐
438       cles spent in the scheduler's queue.  The difference  between  the  two
439       counters  is  a good indicator of how large of an impact data dependen‐
440       cies had on the execution of the  instructions.   When  performance  is
441       mostly limited by the lack of hardware resources, the delta between the
442       two counters is small.  However, the number  of  cycles  spent  in  the
443       queue  tends to be larger (i.e., more than 1-3cy), especially when com‐
444       pared to other low latency instructions.
445
446   Extra Statistics to Further Diagnose Performance Issues
447       The -all-stats command line option enables extra statistics and perfor‐
448       mance  counters  for the dispatch logic, the reorder buffer, the retire
449       control unit, and the register file.
450
451       Below is an example of -all-stats output generated by  llvm-mca for 300
452       iterations  of  the  dot-product example discussed in the previous sec‐
453       tions.
454
455          Dynamic Dispatch Stall Cycles:
456          RAT     - Register unavailable:                      0
457          RCU     - Retire tokens unavailable:                 0
458          SCHEDQ  - Scheduler full:                            272  (44.6%)
459          LQ      - Load queue full:                           0
460          SQ      - Store queue full:                          0
461          GROUP   - Static restrictions on the dispatch group: 0
462
463
464          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
465          [# dispatched], [# cycles]
466           0,              24  (3.9%)
467           1,              272  (44.6%)
468           2,              314  (51.5%)
469
470
471          Schedulers - number of cycles where we saw N instructions issued:
472          [# issued], [# cycles]
473           0,          7  (1.1%)
474           1,          306  (50.2%)
475           2,          297  (48.7%)
476
477          Scheduler's queue usage:
478          [1] Resource name.
479          [2] Average number of used buffer entries.
480          [3] Maximum number of used buffer entries.
481          [4] Total number of buffer entries.
482
483           [1]            [2]        [3]        [4]
484          JALU01           0          0          20
485          JFPU01           17         18         18
486          JLSAGU           0          0          12
487
488
489          Retire Control Unit - number of cycles where we saw N instructions retired:
490          [# retired], [# cycles]
491           0,           109  (17.9%)
492           1,           102  (16.7%)
493           2,           399  (65.4%)
494
495          Total ROB Entries:                64
496          Max Used ROB Entries:             35  ( 54.7% )
497          Average Used ROB Entries per cy:  32  ( 50.0% )
498
499
500          Register File statistics:
501          Total number of mappings created:    900
502          Max number of mappings used:         35
503
504          *  Register File #1 -- JFpuPRF:
505             Number of physical registers:     72
506             Total number of mappings created: 900
507             Max number of mappings used:      35
508
509          *  Register File #2 -- JIntegerPRF:
510             Number of physical registers:     64
511             Total number of mappings created: 0
512             Max number of mappings used:      0
513
514       If we look at the Dynamic Dispatch  Stall  Cycles  table,  we  see  the
515       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
516       ery time the dispatch logic is unable to dispatch a full group  because
517       the scheduler's queue is full.
518
519       Looking  at the Dispatch Logic table, we see that the pipeline was only
520       able to dispatch two micro opcodes 51.5% of  the  time.   The  dispatch
521       group was limited to one micro opcode 44.6% of the cycles, which corre‐
522       sponds to 272 cycles.  The dispatch statistics are displayed by  either
523       using the command option -all-stats or -dispatch-stats.
524
525       The  next  table,  Schedulers, presents a histogram displaying a count,
526       representing the number of instructions issued on some  number  of  cy‐
527       cles.   In  this case, of the 610 simulated cycles, single instructions
528       were issued 306 times (50.2%) and there were 7 cycles where no instruc‐
529       tions were issued.
530
531       The  Scheduler's  queue  usage table shows that the average and maximum
532       number of buffer entries (i.e., scheduler queue entries) used  at  run‐
533       time.   Resource  JFPU01  reached its maximum (18 of 18 queue entries).
534       Note that AMD Jaguar implements three schedulers:
535
536       • JALU01 - A scheduler for ALU instructions.
537
538       • JFPU01 - A scheduler floating point operations.
539
540       • JLSAGU - A scheduler for address generation.
541
542       The dot-product is a kernel of three  floating  point  instructions  (a
543       vector  multiply  followed  by two horizontal adds).  That explains why
544       only the floating point scheduler appears to be used.
545
546       A full scheduler queue is either caused by data dependency chains or by
547       a  sub-optimal  usage of hardware resources.  Sometimes, resource pres‐
548       sure can be mitigated by rewriting the kernel using different  instruc‐
549       tions  that  consume  different scheduler resources.  Schedulers with a
550       small queue are less resilient to bottlenecks caused by the presence of
551       long  data dependencies.  The scheduler statistics are displayed by us‐
552       ing the command option -all-stats or -scheduler-stats.
553
554       The next table, Retire Control Unit, presents a histogram displaying  a
555       count,  representing  the number of instructions retired on some number
556       of cycles.  In this case, of the 610 simulated cycles, two instructions
557       were retired during the same cycle 399 times (65.4%) and there were 109
558       cycles where no instructions were retired.  The retire  statistics  are
559       displayed by using the command option -all-stats or -retire-stats.
560
561       The  last  table  presented is Register File statistics.  Each physical
562       register file (PRF) used by the pipeline is presented  in  this  table.
563       In the case of AMD Jaguar, there are two register files, one for float‐
564       ing-point registers (JFpuPRF) and one  for  integer  registers  (JInte‐
565       gerPRF).  The table shows that of the 900 instructions processed, there
566       were 900 mappings created.  Since  this  dot-product  example  utilized
567       only floating point registers, the JFPuPRF was responsible for creating
568       the 900 mappings.  However, we see that the pipeline only used a  maxi‐
569       mum of 35 of 72 available register slots at any given time. We can con‐
570       clude that the floating point PRF was the only register file  used  for
571       the  example, and that it was never resource constrained.  The register
572       file statistics are displayed by using the command option -all-stats or
573       -register-file-stats.
574
575       In this example, we can conclude that the IPC is mostly limited by data
576       dependencies, and not by resource pressure.
577
578   Instruction Flow
579       This section describes the instruction flow through the  default  pipe‐
580       line  of  llvm-mca,  as  well  as  the functional units involved in the
581       process.
582
583       The default pipeline implements the following sequence of  stages  used
584       to process instructions.
585
586       • Dispatch (Instruction is dispatched to the schedulers).
587
588       • Issue (Instruction is issued to the processor pipelines).
589
590       • Write Back (Instruction is executed, and results are written back).
591
592       • Retire  (Instruction  is  retired; writes are architecturally commit‐
593         ted).
594
595       The default pipeline only models the out-of-order portion of a  proces‐
596       sor.   Therefore,  the instruction fetch and decode stages are not mod‐
597       eled. Performance  bottlenecks  in  the  frontend  are  not  diagnosed.
598       llvm-mca  assumes  that  instructions  have all been decoded and placed
599       into a queue before the simulation  start.   Also,  llvm-mca  does  not
600       model branch prediction.
601
602   Instruction Dispatch
603       During  the  dispatch  stage,  instructions are picked in program order
604       from a queue of already decoded instructions, and dispatched in  groups
605       to the simulated hardware schedulers.
606
607       The  size  of a dispatch group depends on the availability of the simu‐
608       lated hardware resources.  The processor dispatch width defaults to the
609       value of the IssueWidth in LLVM's scheduling model.
610
611       An instruction can be dispatched if:
612
613       • The  size  of the dispatch group is smaller than processor's dispatch
614         width.
615
616       • There are enough entries in the reorder buffer.
617
618       • There are enough physical registers to do register renaming.
619
620       • The schedulers are not full.
621
622       Scheduling models can  optionally  specify  which  register  files  are
623       available  on the processor. llvm-mca uses that information to initial‐
624       ize register file descriptors.  Users can limit the number of  physical
625       registers  that  are  globally available for register renaming by using
626       the command option -register-file-size.  A value of zero for  this  op‐
627       tion  means  unbounded. By knowing how many registers are available for
628       renaming, the tool can predict dispatch stalls caused by  the  lack  of
629       physical registers.
630
631       The number of reorder buffer entries consumed by an instruction depends
632       on the number of micro-opcodes specified for that  instruction  by  the
633       target  scheduling model.  The reorder buffer is responsible for track‐
634       ing the progress of instructions that  are  "in-flight",  and  retiring
635       them in program order.  The number of entries in the reorder buffer de‐
636       faults to the value specified by field MicroOpBufferSize in the  target
637       scheduling model.
638
639       Instructions  that  are  dispatched to the schedulers consume scheduler
640       buffer entries. llvm-mca queries the scheduling model to determine  the
641       set  of  buffered  resources  consumed by an instruction.  Buffered re‐
642       sources are treated like scheduler resources.
643
644   Instruction Issue
645       Each processor scheduler implements a buffer of instructions.   An  in‐
646       struction  has  to  wait in the scheduler's buffer until input register
647       operands become available.  Only at that point,  does  the  instruction
648       becomes   eligible   for  execution  and  may  be  issued  (potentially
649       out-of-order) for execution.  Instruction  latencies  are  computed  by
650       llvm-mca with the help of the scheduling model.
651
652       llvm-mca's  scheduler is designed to simulate multiple processor sched‐
653       ulers.  The scheduler is responsible for  tracking  data  dependencies,
654       and dynamically selecting which processor resources are consumed by in‐
655       structions.  It delegates the management of  processor  resource  units
656       and resource groups to a resource manager.  The resource manager is re‐
657       sponsible for selecting resource units that are  consumed  by  instruc‐
658       tions.   For  example,  if  an  instruction  consumes 1cy of a resource
659       group, the resource manager selects one of the available units from the
660       group;  by default, the resource manager uses a round-robin selector to
661       guarantee that resource usage  is  uniformly  distributed  between  all
662       units of a group.
663
664       llvm-mca's scheduler internally groups instructions into three sets:
665
666       • WaitSet: a set of instructions whose operands are not ready.
667
668       • ReadySet: a set of instructions ready to execute.
669
670       • IssuedSet: a set of instructions executing.
671
672       Depending  on  the  operands  availability,  instructions that are dis‐
673       patched to the scheduler are either placed into the WaitSet or into the
674       ReadySet.
675
676       Every cycle, the scheduler checks if instructions can be moved from the
677       WaitSet to the ReadySet, and if instructions from the ReadySet  can  be
678       issued to the underlying pipelines. The algorithm prioritizes older in‐
679       structions over younger instructions.
680
681   Write-Back and Retire Stage
682       Issued instructions are moved  from  the  ReadySet  to  the  IssuedSet.
683       There,  instructions  wait  until  they reach the write-back stage.  At
684       that point, they get removed from the queue and the retire control unit
685       is notified.
686
687       When  instructions  are executed, the retire control unit flags the in‐
688       struction as "ready to retire."
689
690       Instructions are retired in program order.  The register file is  noti‐
691       fied  of the retirement so that it can free the physical registers that
692       were allocated for the instruction during the register renaming stage.
693
694   Load/Store Unit and Memory Consistency Model
695       To simulate an out-of-order execution of  memory  operations,  llvm-mca
696       utilizes  a simulated load/store unit (LSUnit) to simulate the specula‐
697       tive execution of loads and stores.
698
699       Each load (or store) consumes an entry in the load  (or  store)  queue.
700       Users  can specify flags -lqueue and -squeue to limit the number of en‐
701       tries in the load and store queues respectively.  The  queues  are  un‐
702       bounded by default.
703
704       The  LSUnit implements a relaxed consistency model for memory loads and
705       stores.  The rules are:
706
707       1. A younger load is allowed to pass an older load only if there are no
708          intervening stores or barriers between the two loads.
709
710       2. A  younger  load is allowed to pass an older store provided that the
711          load does not alias with the store.
712
713       3. A younger store is not allowed to pass an older store.
714
715       4. A younger store is not allowed to pass an older load.
716
717       By default, the LSUnit optimistically assumes that loads do  not  alias
718       (-noalias=true) store operations.  Under this assumption, younger loads
719       are always allowed to pass older stores.  Essentially, the LSUnit  does
720       not  attempt to run any alias analysis to predict when loads and stores
721       do not alias with each other.
722
723       Note that, in the case of write-combining memory, rule 3 could  be  re‐
724       laxed to allow reordering of non-aliasing store operations.  That being
725       said, at the moment, there is no way to further relax the memory  model
726       (-noalias  is  the  only  option).   Essentially, there is no option to
727       specify a different memory  type  (e.g.,  write-back,  write-combining,
728       write-through;  etc.)  and  consequently  to weaken, or strengthen, the
729       memory model.
730
731       Other limitations are:
732
733       • The LSUnit does not know when store-to-load forwarding may occur.
734
735       • The LSUnit does not know anything about cache  hierarchy  and  memory
736         types.
737
738       • The  LSUnit  does not know how to identify serializing operations and
739         memory fences.
740
741       The LSUnit does not attempt to predict if  a  load  or  store  hits  or
742       misses  the L1 cache.  It only knows if an instruction "MayLoad" and/or
743       "MayStore."  For loads, the scheduling model provides  an  "optimistic"
744       load-to-use  latency (which usually matches the load-to-use latency for
745       when there is a hit in the L1D).
746
747       llvm-mca does not know about serializing operations  or  memory-barrier
748       like  instructions.  The LSUnit conservatively assumes that an instruc‐
749       tion which has both "MayLoad" and unmodeled side effects behaves like a
750       "soft" load-barrier.  That means, it serializes loads without forcing a
751       flush of the load queue.  Similarly, instructions that  "MayStore"  and
752       have  unmodeled  side  effects are treated like store barriers.  A full
753       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
754       side effects.  This is inaccurate, but it is the best that we can do at
755       the moment with the current information available in LLVM.
756
757       A load/store barrier consumes one entry of  the  load/store  queue.   A
758       load/store  barrier  enforces ordering of loads/stores.  A younger load
759       cannot pass a load barrier.  Also, a younger store cannot pass a  store
760       barrier.  A younger load has to wait for the memory/load barrier to ex‐
761       ecute.  A load/store barrier is "executed" when it becomes  the  oldest
762       entry in the load/store queue(s). That also means, by construction, all
763       of the older loads/stores have been executed.
764
765       In conclusion, the full set of load/store consistency rules are:
766
767       1. A store may not pass a previous store.
768
769       2. A store may not pass a previous load (regardless of -noalias).
770
771       3. A store has to wait until an older store barrier is fully executed.
772
773       4. A load may pass a previous load.
774
775       5. A load may not pass a previous store unless -noalias is set.
776
777       6. A load has to wait until an older load barrier is fully executed.
778

AUTHOR

780       Maintained by the LLVM Team (https://llvm.org/).
781

COPYRIGHT

783       2003-2022, LLVM Project
784
785
786
787
7888                                 2022-01-20                       LLVM-MCA(1)