1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with an
18 out-of-order backend, for which there is a scheduling model available
19 in LLVM.
20
21 The main goal of this tool is not just to predict the performance of
22 the code when run on the target, but also help with diagnosing poten‐
23 tial performance issues.
24
25 Given an assembly code sequence, llvm-mca estimates the Instructions
26 Per Cycle (IPC), as well as hardware resource pressure. The analysis
27 and reporting style were inspired by the IACA tool from Intel.
28
29 llvm-mca allows the usage of special code comments to mark regions of
30 the assembly code to be analyzed. A comment starting with substring
31 LLVM-MCA-BEGIN marks the beginning of a code region. A comment starting
32 with substring LLVM-MCA-END marks the end of a code region. For exam‐
33 ple:
34
35 # LLVM-MCA-BEGIN My Code Region
36 ...
37 # LLVM-MCA-END
38
39 Multiple regions can be specified provided that they do not overlap. A
40 code region can have an optional description. If no user-defined region
41 is specified, then llvm-mca assumes a default region which contains ev‐
42 ery instruction in the input file. Every region is analyzed in isola‐
43 tion, and the final performance report is the union of all the reports
44 generated for every code region.
45
46 Inline assembly directives may be used from source code to annotate the
47 assembly text:
48
49 int foo(int a, int b) {
50 __asm volatile("# LLVM-MCA-BEGIN foo");
51 a += 42;
52 __asm volatile("# LLVM-MCA-END");
53 a *= b;
54 return a;
55 }
56
57 So for example, you can compile code with clang, output assembly, and
58 pipe it directly into llvm-mca for analysis:
59
60 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
61
62 Or for Intel syntax:
63
64 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
65
67 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
68 wise, it will read from the specified filename.
69
70 If the -o option is omitted, then llvm-mca will send its output to
71 standard output if the input is from standard input. If the -o option
72 specifies "-", then the output will also be sent to standard output.
73
74 -help Print a summary of command line options.
75
76 -mtriple=<target triple>
77 Specify a target triple string.
78
79 -march=<arch>
80 Specify the architecture for which to analyze the code. It de‐
81 faults to the host default target.
82
83 -mcpu=<cpuname>
84 Specify the processor for which to analyze the code. By de‐
85 fault, the cpu name is autodetected from the host.
86
87 -output-asm-variant=<variant id>
88 Specify the output assembly variant for the report generated by
89 the tool. On x86, possible values are [0, 1]. A value of 0
90 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
91 format for the code printed out by the tool in the analysis re‐
92 port.
93
94 -dispatch=<width>
95 Specify a different dispatch width for the processor. The dis‐
96 patch width defaults to field 'IssueWidth' in the processor
97 scheduling model. If width is zero, then the default dispatch
98 width is used.
99
100 -register-file-size=<size>
101 Specify the size of the register file. When specified, this flag
102 limits how many physical registers are available for register
103 renaming purposes. A value of zero for this flag means "unlim‐
104 ited number of physical registers".
105
106 -iterations=<number of iterations>
107 Specify the number of iterations to run. If this flag is set to
108 0, then the tool sets the number of iterations to a default
109 value (i.e. 100).
110
111 -noalias=<bool>
112 If set, the tool assumes that loads and stores don't alias. This
113 is the default behavior.
114
115 -lqueue=<load queue size>
116 Specify the size of the load queue in the load/store unit emu‐
117 lated by the tool. By default, the tool assumes an unbound num‐
118 ber of entries in the load queue. A value of zero for this flag
119 is ignored, and the default load queue size is used instead.
120
121 -squeue=<store queue size>
122 Specify the size of the store queue in the load/store unit emu‐
123 lated by the tool. By default, the tool assumes an unbound num‐
124 ber of entries in the store queue. A value of zero for this flag
125 is ignored, and the default store queue size is used instead.
126
127 -timeline
128 Enable the timeline view.
129
130 -timeline-max-iterations=<iterations>
131 Limit the number of iterations to print in the timeline view. By
132 default, the timeline view prints information for up to 10 iter‐
133 ations.
134
135 -timeline-max-cycles=<cycles>
136 Limit the number of cycles in the timeline view. By default, the
137 number of cycles is set to 80.
138
139 -resource-pressure
140 Enable the resource pressure view. This is enabled by default.
141
142 -register-file-stats
143 Enable register file usage statistics.
144
145 -dispatch-stats
146 Enable extra dispatch statistics. This view collects and ana‐
147 lyzes instruction dispatch events, as well as static/dynamic
148 dispatch stall events. This view is disabled by default.
149
150 -scheduler-stats
151 Enable extra scheduler statistics. This view collects and ana‐
152 lyzes instruction issue events. This view is disabled by de‐
153 fault.
154
155 -retire-stats
156 Enable extra retire control unit statistics. This view is dis‐
157 abled by default.
158
159 -instruction-info
160 Enable the instruction info view. This is enabled by default.
161
162 -all-stats
163 Print all hardware statistics. This enables extra statistics re‐
164 lated to the dispatch logic, the hardware schedulers, the regis‐
165 ter file(s), and the retire control unit. This option is dis‐
166 abled by default.
167
168 -all-views
169 Enable all the view.
170
171 -instruction-tables
172 Prints resource pressure information based on the static infor‐
173 mation available from the processor model. This differs from the
174 resource pressure view because it doesn't require that the code
175 is simulated. It instead prints the theoretical uniform distri‐
176 bution of resource pressure for every instruction in sequence.
177
179 llvm-mca returns 0 on success. Otherwise, an error message is printed
180 to standard error, and the tool returns 1.
181
183 llvm-mca takes assembly code as input. The assembly code is parsed into
184 a sequence of MCInst with the help of the existing LLVM target assembly
185 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
186 module to generate a performance report.
187
188 The Pipeline module simulates the execution of the machine code se‐
189 quence in a loop of iterations (default is 100). During this process,
190 the pipeline collects a number of execution related statistics. At the
191 end of this process, the pipeline generates and prints a report from
192 the collected statistics.
193
194 Here is an example of a performance report generated by the tool for a
195 dot-product of two packed float vectors of four elements. The analysis
196 is conducted for target x86, cpu btver2. The following result can be
197 produced via the following command using the example located at
198 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
199
200 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
201
202 Iterations: 300
203 Instructions: 900
204 Total Cycles: 610
205 Dispatch Width: 2
206 IPC: 1.48
207 Block RThroughput: 2.0
208
209
210 Instruction Info:
211 [1]: #uOps
212 [2]: Latency
213 [3]: RThroughput
214 [4]: MayLoad
215 [5]: MayStore
216 [6]: HasSideEffects (U)
217
218 [1] [2] [3] [4] [5] [6] Instructions:
219 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
220 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
221 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
222
223
224 Resources:
225 [0] - JALU0
226 [1] - JALU1
227 [2] - JDiv
228 [3] - JFPA
229 [4] - JFPM
230 [5] - JFPU0
231 [6] - JFPU1
232 [7] - JLAGU
233 [8] - JMul
234 [9] - JSAGU
235 [10] - JSTC
236 [11] - JVALU0
237 [12] - JVALU1
238 [13] - JVIMUL
239
240
241 Resource pressure per iteration:
242 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
243 - - - 2.00 1.00 2.00 1.00 - - - - - - -
244
245 Resource pressure by instruction:
246 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
247 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
248 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
249 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
250
251 According to this report, the dot-product kernel has been executed 300
252 times, for a total of 900 dynamically executed instructions.
253
254 The report is structured in three main sections. The first section
255 collects a few performance numbers; the goal of this section is to give
256 a very quick overview of the performance throughput. In this example,
257 the two important performance indicators are IPC and Block RThroughput
258 (Block Reciprocal Throughput).
259
260 IPC is computed dividing the total number of simulated instructions by
261 the total number of cycles. A delta between Dispatch Width and IPC is
262 an indicator of a performance issue. In the absence of loop-carried
263 data dependencies, the observed IPC tends to a theoretical maximum
264 which can be computed by dividing the number of instructions of a sin‐
265 gle iteration by the Block RThroughput.
266
267 IPC is bounded from above by the dispatch width. That is because the
268 dispatch width limits the maximum size of a dispatch group. IPC is also
269 limited by the amount of hardware parallelism. The availability of
270 hardware resources affects the resource pressure distribution, and it
271 limits the number of instructions that can be executed in parallel ev‐
272 ery cycle. A delta between Dispatch Width and the theoretical maximum
273 IPC is an indicator of a performance bottleneck caused by the lack of
274 hardware resources. In general, the lower the Block RThroughput, the
275 better.
276
277 In this example, Instructions per iteration/Block RThroughput is 1.50.
278 Since there are no loop-carried dependencies, the observed IPC is ex‐
279 pected to approach 1.50 when the number of iterations tends to infin‐
280 ity. The delta between the Dispatch Width (2.00), and the theoretical
281 maximum IPC (1.50) is an indicator of a performance bottleneck caused
282 by the lack of hardware resources, and the Resource pressure view can
283 help to identify the problematic resource usage.
284
285 The second section of the report shows the latency and reciprocal
286 throughput of every instruction in the sequence. That section also re‐
287 ports extra information related to the number of micro opcodes, and op‐
288 code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
289
290 The third section is the Resource pressure view. This view reports the
291 average number of resource cycles consumed every iteration by instruc‐
292 tions for every processor resource unit available on the target. In‐
293 formation is structured in two tables. The first table reports the num‐
294 ber of resource cycles spent on average every iteration. The second ta‐
295 ble correlates the resource cycles to the machine instruction in the
296 sequence. For example, every iteration of the instruction vmulps always
297 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
298 consuming an average of 1 resource cycle per iteration. Note that on
299 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
300 line JFPU1, while horizontal floating-point additions can only be is‐
301 sued to pipeline JFPU0.
302
303 The resource pressure view helps with identifying bottlenecks caused by
304 high usage of specific hardware resources. Situations with resource
305 pressure mainly concentrated on a few resources should, in general, be
306 avoided. Ideally, pressure should be uniformly distributed between
307 multiple resources.
308
309 Timeline View
310 The timeline view produces a detailed report of each instruction's
311 state transitions through an instruction pipeline. This view is en‐
312 abled by the command line option -timeline. As instructions transition
313 through the various stages of the pipeline, their states are depicted
314 in the view report. These states are represented by the following
315 characters:
316
317 • D : Instruction dispatched.
318
319 • e : Instruction executing.
320
321 • E : Instruction executed.
322
323 • R : Instruction retired.
324
325 • = : Instruction already dispatched, waiting to be executed.
326
327 • - : Instruction executed, waiting to be retired.
328
329 Below is the timeline view for a subset of the dot-product example lo‐
330 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
331 llvm-mca using the following command:
332
333 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
334
335 Timeline view:
336 012345
337 Index 0123456789
338
339 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
340 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
341 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
342 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
343 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
344 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
345 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
346 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
347 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
348
349
350 Average Wait times (based on the timeline view):
351 [0]: Executions
352 [1]: Average time spent waiting in a scheduler's queue
353 [2]: Average time spent waiting in a scheduler's queue while ready
354 [3]: Average time elapsed from WB until retire stage
355
356 [0] [1] [2] [3]
357 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
358 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
359 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
360
361 The timeline view is interesting because it shows instruction state
362 changes during execution. It also gives an idea of how the tool pro‐
363 cesses instructions executed on the target, and how their timing infor‐
364 mation might be calculated.
365
366 The timeline view is structured in two tables. The first table shows
367 instructions changing state over time (measured in cycles); the second
368 table (named Average Wait times) reports useful timing statistics,
369 which should help diagnose performance bottlenecks caused by long data
370 dependencies and sub-optimal usage of hardware resources.
371
372 An instruction in the timeline view is identified by a pair of indices,
373 where the first index identifies an iteration, and the second index is
374 the instruction index (i.e., where it appears in the code sequence).
375 Since this example was generated using 3 iterations: -iterations=3, the
376 iteration indices range from 0-2 inclusively.
377
378 Excluding the first and last column, the remaining columns are in cy‐
379 cles. Cycles are numbered sequentially starting from 0.
380
381 From the example output above, we know the following:
382
383 • Instruction [1,0] was dispatched at cycle 1.
384
385 • Instruction [1,0] started executing at cycle 2.
386
387 • Instruction [1,0] reached the write back stage at cycle 4.
388
389 • Instruction [1,0] was retired at cycle 10.
390
391 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
392 wait in the scheduler's queue for the operands to become available. By
393 the time vmulps is dispatched, operands are already available, and
394 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
395 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
396 strated by the fact that the instruction only spent 1cy in the sched‐
397 uler's queue.
398
399 There is a gap of 5 cycles between the write-back stage and the retire
400 event. That is because instructions must retire in program order, so
401 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
402 until cycle 10).
403
404 In the example, all instructions are in a RAW (Read After Write) depen‐
405 dency chain. Register %xmm2 written by vmulps is immediately used by
406 the first vhaddps, and register %xmm3 written by the first vhaddps is
407 used by the second vhaddps. Long data dependencies negatively impact
408 the ILP (Instruction Level Parallelism).
409
410 In the dot-product example, there are anti-dependencies introduced by
411 instructions from different iterations. However, those dependencies
412 can be removed at register renaming stage (at the cost of allocating
413 register aliases, and therefore consuming physical registers).
414
415 Table Average Wait times helps diagnose performance issues that are
416 caused by the presence of long latency instructions and potentially
417 long data dependencies which may limit the ILP. Note that llvm-mca, by
418 default, assumes at least 1cy between the dispatch event and the issue
419 event.
420
421 When the performance is limited by data dependencies and/or long la‐
422 tency instructions, the number of cycles spent while in the ready state
423 is expected to be very small when compared with the total number of cy‐
424 cles spent in the scheduler's queue. The difference between the two
425 counters is a good indicator of how large of an impact data dependen‐
426 cies had on the execution of the instructions. When performance is
427 mostly limited by the lack of hardware resources, the delta between the
428 two counters is small. However, the number of cycles spent in the
429 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
430 pared to other low latency instructions.
431
432 Extra Statistics to Further Diagnose Performance Issues
433 The -all-stats command line option enables extra statistics and perfor‐
434 mance counters for the dispatch logic, the reorder buffer, the retire
435 control unit, and the register file.
436
437 Below is an example of -all-stats output generated by MCA for the
438 dot-product example discussed in the previous sections.
439
440 Dynamic Dispatch Stall Cycles:
441 RAT - Register unavailable: 0
442 RCU - Retire tokens unavailable: 0
443 SCHEDQ - Scheduler full: 272
444 LQ - Load queue full: 0
445 SQ - Store queue full: 0
446 GROUP - Static restrictions on the dispatch group: 0
447
448
449 Dispatch Logic - number of cycles where we saw N instructions dispatched:
450 [# dispatched], [# cycles]
451 0, 24 (3.9%)
452 1, 272 (44.6%)
453 2, 314 (51.5%)
454
455
456 Schedulers - number of cycles where we saw N instructions issued:
457 [# issued], [# cycles]
458 0, 7 (1.1%)
459 1, 306 (50.2%)
460 2, 297 (48.7%)
461
462
463 Scheduler's queue usage:
464 JALU01, 0/20
465 JFPU01, 18/18
466 JLSAGU, 0/12
467
468
469 Retire Control Unit - number of cycles where we saw N instructions retired:
470 [# retired], [# cycles]
471 0, 109 (17.9%)
472 1, 102 (16.7%)
473 2, 399 (65.4%)
474
475
476 Register File statistics:
477 Total number of mappings created: 900
478 Max number of mappings used: 35
479
480 * Register File #1 -- JFpuPRF:
481 Number of physical registers: 72
482 Total number of mappings created: 900
483 Max number of mappings used: 35
484
485 * Register File #2 -- JIntegerPRF:
486 Number of physical registers: 64
487 Total number of mappings created: 0
488 Max number of mappings used: 0
489
490 If we look at the Dynamic Dispatch Stall Cycles table, we see the
491 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
492 ery time the dispatch logic is unable to dispatch a group of two in‐
493 structions because the scheduler's queue is full.
494
495 Looking at the Dispatch Logic table, we see that the pipeline was only
496 able to dispatch two instructions 51.5% of the time. The dispatch
497 group was limited to one instruction 44.6% of the cycles, which corre‐
498 sponds to 272 cycles. The dispatch statistics are displayed by either
499 using the command option -all-stats or -dispatch-stats.
500
501 The next table, Schedulers, presents a histogram displaying a count,
502 representing the number of instructions issued on some number of cy‐
503 cles. In this case, of the 610 simulated cycles, single instructions
504 were issued 306 times (50.2%) and there were 7 cycles where no instruc‐
505 tions were issued.
506
507 The Scheduler's queue usage table shows that the maximum number of buf‐
508 fer entries (i.e., scheduler queue entries) used at runtime. Resource
509 JFPU01 reached its maximum (18 of 18 queue entries). Note that AMD
510 Jaguar implements three schedulers:
511
512 • JALU01 - A scheduler for ALU instructions.
513
514 • JFPU01 - A scheduler floating point operations.
515
516 • JLSAGU - A scheduler for address generation.
517
518 The dot-product is a kernel of three floating point instructions (a
519 vector multiply followed by two horizontal adds). That explains why
520 only the floating point scheduler appears to be used.
521
522 A full scheduler queue is either caused by data dependency chains or by
523 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
524 sure can be mitigated by rewriting the kernel using different instruc‐
525 tions that consume different scheduler resources. Schedulers with a
526 small queue are less resilient to bottlenecks caused by the presence of
527 long data dependencies. The scheduler statistics are displayed by us‐
528 ing the command option -all-stats or -scheduler-stats.
529
530 The next table, Retire Control Unit, presents a histogram displaying a
531 count, representing the number of instructions retired on some number
532 of cycles. In this case, of the 610 simulated cycles, two instructions
533 were retired during the same cycle 399 times (65.4%) and there were 109
534 cycles where no instructions were retired. The retire statistics are
535 displayed by using the command option -all-stats or -retire-stats.
536
537 The last table presented is Register File statistics. Each physical
538 register file (PRF) used by the pipeline is presented in this table.
539 In the case of AMD Jaguar, there are two register files, one for float‐
540 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
541 gerPRF). The table shows that of the 900 instructions processed, there
542 were 900 mappings created. Since this dot-product example utilized
543 only floating point registers, the JFPuPRF was responsible for creating
544 the 900 mappings. However, we see that the pipeline only used a maxi‐
545 mum of 35 of 72 available register slots at any given time. We can con‐
546 clude that the floating point PRF was the only register file used for
547 the example, and that it was never resource constrained. The register
548 file statistics are displayed by using the command option -all-stats or
549 -register-file-stats.
550
551 In this example, we can conclude that the IPC is mostly limited by data
552 dependencies, and not by resource pressure.
553
554 Instruction Flow
555 This section describes the instruction flow through MCA's default
556 out-of-order pipeline, as well as the functional units involved in the
557 process.
558
559 The default pipeline implements the following sequence of stages used
560 to process instructions.
561
562 • Dispatch (Instruction is dispatched to the schedulers).
563
564 • Issue (Instruction is issued to the processor pipelines).
565
566 • Write Back (Instruction is executed, and results are written back).
567
568 • Retire (Instruction is retired; writes are architecturally commit‐
569 ted).
570
571 The default pipeline only models the out-of-order portion of a proces‐
572 sor. Therefore, the instruction fetch and decode stages are not mod‐
573 eled. Performance bottlenecks in the frontend are not diagnosed. MCA
574 assumes that instructions have all been decoded and placed into a
575 queue. Also, MCA does not model branch prediction.
576
577 Instruction Dispatch
578 During the dispatch stage, instructions are picked in program order
579 from a queue of already decoded instructions, and dispatched in groups
580 to the simulated hardware schedulers.
581
582 The size of a dispatch group depends on the availability of the simu‐
583 lated hardware resources. The processor dispatch width defaults to the
584 value of the IssueWidth in LLVM's scheduling model.
585
586 An instruction can be dispatched if:
587
588 • The size of the dispatch group is smaller than processor's dispatch
589 width.
590
591 • There are enough entries in the reorder buffer.
592
593 • There are enough physical registers to do register renaming.
594
595 • The schedulers are not full.
596
597 Scheduling models can optionally specify which register files are
598 available on the processor. MCA uses that information to initialize
599 register file descriptors. Users can limit the number of physical reg‐
600 isters that are globally available for register renaming by using the
601 command option -register-file-size. A value of zero for this option
602 means unbounded. By knowing how many registers are available for re‐
603 naming, MCA can predict dispatch stalls caused by the lack of regis‐
604 ters.
605
606 The number of reorder buffer entries consumed by an instruction depends
607 on the number of micro-opcodes specified by the target scheduling
608 model. MCA's reorder buffer's purpose is to track the progress of in‐
609 structions that are "in-flight," and to retire instructions in program
610 order. The number of entries in the reorder buffer defaults to the Mi‐
611 croOpBufferSize provided by the target scheduling model.
612
613 Instructions that are dispatched to the schedulers consume scheduler
614 buffer entries. llvm-mca queries the scheduling model to determine the
615 set of buffered resources consumed by an instruction. Buffered re‐
616 sources are treated like scheduler resources.
617
618 Instruction Issue
619 Each processor scheduler implements a buffer of instructions. An in‐
620 struction has to wait in the scheduler's buffer until input register
621 operands become available. Only at that point, does the instruction
622 becomes eligible for execution and may be issued (potentially
623 out-of-order) for execution. Instruction latencies are computed by
624 llvm-mca with the help of the scheduling model.
625
626 llvm-mca's scheduler is designed to simulate multiple processor sched‐
627 ulers. The scheduler is responsible for tracking data dependencies,
628 and dynamically selecting which processor resources are consumed by in‐
629 structions. It delegates the management of processor resource units
630 and resource groups to a resource manager. The resource manager is re‐
631 sponsible for selecting resource units that are consumed by instruc‐
632 tions. For example, if an instruction consumes 1cy of a resource
633 group, the resource manager selects one of the available units from the
634 group; by default, the resource manager uses a round-robin selector to
635 guarantee that resource usage is uniformly distributed between all
636 units of a group.
637
638 llvm-mca's scheduler implements three instruction queues:
639
640 • WaitQueue: a queue of instructions whose operands are not ready.
641
642 • ReadyQueue: a queue of instructions ready to execute.
643
644 • IssuedQueue: a queue of instructions executing.
645
646 Depending on the operand availability, instructions that are dispatched
647 to the scheduler are either placed into the WaitQueue or into the
648 ReadyQueue.
649
650 Every cycle, the scheduler checks if instructions can be moved from the
651 WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue
652 can be issued to the underlying pipelines. The algorithm prioritizes
653 older instructions over younger instructions.
654
655 Write-Back and Retire Stage
656 Issued instructions are moved from the ReadyQueue to the IssuedQueue.
657 There, instructions wait until they reach the write-back stage. At
658 that point, they get removed from the queue and the retire control unit
659 is notified.
660
661 When instructions are executed, the retire control unit flags the in‐
662 struction as "ready to retire."
663
664 Instructions are retired in program order. The register file is noti‐
665 fied of the retirement so that it can free the physical registers that
666 were allocated for the instruction during the register renaming stage.
667
668 Load/Store Unit and Memory Consistency Model
669 To simulate an out-of-order execution of memory operations, llvm-mca
670 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
671 tive execution of loads and stores.
672
673 Each load (or store) consumes an entry in the load (or store) queue.
674 Users can specify flags -lqueue and -squeue to limit the number of en‐
675 tries in the load and store queues respectively. The queues are un‐
676 bounded by default.
677
678 The LSUnit implements a relaxed consistency model for memory loads and
679 stores. The rules are:
680
681 1. A younger load is allowed to pass an older load only if there are no
682 intervening stores or barriers between the two loads.
683
684 2. A younger load is allowed to pass an older store provided that the
685 load does not alias with the store.
686
687 3. A younger store is not allowed to pass an older store.
688
689 4. A younger store is not allowed to pass an older load.
690
691 By default, the LSUnit optimistically assumes that loads do not alias
692 (-noalias=true) store operations. Under this assumption, younger loads
693 are always allowed to pass older stores. Essentially, the LSUnit does
694 not attempt to run any alias analysis to predict when loads and stores
695 do not alias with each other.
696
697 Note that, in the case of write-combining memory, rule 3 could be re‐
698 laxed to allow reordering of non-aliasing store operations. That being
699 said, at the moment, there is no way to further relax the memory model
700 (-noalias is the only option). Essentially, there is no option to
701 specify a different memory type (e.g., write-back, write-combining,
702 write-through; etc.) and consequently to weaken, or strengthen, the
703 memory model.
704
705 Other limitations are:
706
707 • The LSUnit does not know when store-to-load forwarding may occur.
708
709 • The LSUnit does not know anything about cache hierarchy and memory
710 types.
711
712 • The LSUnit does not know how to identify serializing operations and
713 memory fences.
714
715 The LSUnit does not attempt to predict if a load or store hits or
716 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
717 "MayStore." For loads, the scheduling model provides an "optimistic"
718 load-to-use latency (which usually matches the load-to-use latency for
719 when there is a hit in the L1D).
720
721 llvm-mca does not know about serializing operations or memory-barrier
722 like instructions. The LSUnit conservatively assumes that an instruc‐
723 tion which has both "MayLoad" and unmodeled side effects behaves like a
724 "soft" load-barrier. That means, it serializes loads without forcing a
725 flush of the load queue. Similarly, instructions that "MayStore" and
726 have unmodeled side effects are treated like store barriers. A full
727 memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
728 side effects. This is inaccurate, but it is the best that we can do at
729 the moment with the current information available in LLVM.
730
731 A load/store barrier consumes one entry of the load/store queue. A
732 load/store barrier enforces ordering of loads/stores. A younger load
733 cannot pass a load barrier. Also, a younger store cannot pass a store
734 barrier. A younger load has to wait for the memory/load barrier to ex‐
735 ecute. A load/store barrier is "executed" when it becomes the oldest
736 entry in the load/store queue(s). That also means, by construction, all
737 of the older loads/stores have been executed.
738
739 In conclusion, the full set of load/store consistency rules are:
740
741 1. A store may not pass a previous store.
742
743 2. A store may not pass a previous load (regardless of -noalias).
744
745 3. A store has to wait until an older store barrier is fully executed.
746
747 4. A load may pass a previous load.
748
749 5. A load may not pass a previous store unless -noalias is set.
750
751 6. A load has to wait until an older load barrier is fully executed.
752
754 Maintained by The LLVM Team (http://llvm.org/).
755
757 2003-2023, LLVM Project
758
759
760
761
7627 2023-07-20 LLVM-MCA(1)