1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with an
18 out-of-order backend, for which there is a scheduling model available
19 in LLVM.
20
21 The main goal of this tool is not just to predict the performance of
22 the code when run on the target, but also help with diagnosing poten‐
23 tial performance issues.
24
25 Given an assembly code sequence, llvm-mca estimates the Instructions
26 Per Cycle (IPC), as well as hardware resource pressure. The analysis
27 and reporting style were inspired by the IACA tool from Intel.
28
29 For example, you can compile code with clang, output assembly, and pipe
30 it directly into llvm-mca for analysis:
31
32 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34 Or for Intel syntax:
35
36 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
39 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
40 wise, it will read from the specified filename.
41
42 If the -o option is omitted, then llvm-mca will send its output to
43 standard output if the input is from standard input. If the -o option
44 specifies "-", then the output will also be sent to standard output.
45
46 -help Print a summary of command line options.
47
48 -mtriple=<target triple>
49 Specify a target triple string.
50
51 -march=<arch>
52 Specify the architecture for which to analyze the code. It de‐
53 faults to the host default target.
54
55 -mcpu=<cpuname>
56 Specify the processor for which to analyze the code. By de‐
57 fault, the cpu name is autodetected from the host.
58
59 -output-asm-variant=<variant id>
60 Specify the output assembly variant for the report generated by
61 the tool. On x86, possible values are [0, 1]. A value of 0
62 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
63 format for the code printed out by the tool in the analysis re‐
64 port.
65
66 -dispatch=<width>
67 Specify a different dispatch width for the processor. The dis‐
68 patch width defaults to field 'IssueWidth' in the processor
69 scheduling model. If width is zero, then the default dispatch
70 width is used.
71
72 -register-file-size=<size>
73 Specify the size of the register file. When specified, this flag
74 limits how many physical registers are available for register
75 renaming purposes. A value of zero for this flag means "unlim‐
76 ited number of physical registers".
77
78 -iterations=<number of iterations>
79 Specify the number of iterations to run. If this flag is set to
80 0, then the tool sets the number of iterations to a default
81 value (i.e. 100).
82
83 -noalias=<bool>
84 If set, the tool assumes that loads and stores don't alias. This
85 is the default behavior.
86
87 -lqueue=<load queue size>
88 Specify the size of the load queue in the load/store unit emu‐
89 lated by the tool. By default, the tool assumes an unbound num‐
90 ber of entries in the load queue. A value of zero for this flag
91 is ignored, and the default load queue size is used instead.
92
93 -squeue=<store queue size>
94 Specify the size of the store queue in the load/store unit emu‐
95 lated by the tool. By default, the tool assumes an unbound num‐
96 ber of entries in the store queue. A value of zero for this flag
97 is ignored, and the default store queue size is used instead.
98
99 -timeline
100 Enable the timeline view.
101
102 -timeline-max-iterations=<iterations>
103 Limit the number of iterations to print in the timeline view. By
104 default, the timeline view prints information for up to 10 iter‐
105 ations.
106
107 -timeline-max-cycles=<cycles>
108 Limit the number of cycles in the timeline view. By default, the
109 number of cycles is set to 80.
110
111 -resource-pressure
112 Enable the resource pressure view. This is enabled by default.
113
114 -register-file-stats
115 Enable register file usage statistics.
116
117 -dispatch-stats
118 Enable extra dispatch statistics. This view collects and ana‐
119 lyzes instruction dispatch events, as well as static/dynamic
120 dispatch stall events. This view is disabled by default.
121
122 -scheduler-stats
123 Enable extra scheduler statistics. This view collects and ana‐
124 lyzes instruction issue events. This view is disabled by de‐
125 fault.
126
127 -retire-stats
128 Enable extra retire control unit statistics. This view is dis‐
129 abled by default.
130
131 -instruction-info
132 Enable the instruction info view. This is enabled by default.
133
134 -all-stats
135 Print all hardware statistics. This enables extra statistics re‐
136 lated to the dispatch logic, the hardware schedulers, the regis‐
137 ter file(s), and the retire control unit. This option is dis‐
138 abled by default.
139
140 -all-views
141 Enable all the view.
142
143 -instruction-tables
144 Prints resource pressure information based on the static infor‐
145 mation available from the processor model. This differs from the
146 resource pressure view because it doesn't require that the code
147 is simulated. It instead prints the theoretical uniform distri‐
148 bution of resource pressure for every instruction in sequence.
149
151 llvm-mca returns 0 on success. Otherwise, an error message is printed
152 to standard error, and the tool returns 1.
153
155 llvm-mca allows for the optional usage of special code comments to mark
156 regions of the assembly code to be analyzed. A comment starting with
157 substring LLVM-MCA-BEGIN marks the beginning of a code region. A com‐
158 ment starting with substring LLVM-MCA-END marks the end of a code re‐
159 gion. For example:
160
161 # LLVM-MCA-BEGIN My Code Region
162 ...
163 # LLVM-MCA-END
164
165 Multiple regions can be specified provided that they do not overlap. A
166 code region can have an optional description. If no user-defined region
167 is specified, then llvm-mca assumes a default region which contains ev‐
168 ery instruction in the input file. Every region is analyzed in isola‐
169 tion, and the final performance report is the union of all the reports
170 generated for every code region.
171
172 Inline assembly directives may be used from source code to annotate the
173 assembly text:
174
175 int foo(int a, int b) {
176 __asm volatile("# LLVM-MCA-BEGIN foo");
177 a += 42;
178 __asm volatile("# LLVM-MCA-END");
179 a *= b;
180 return a;
181 }
182
184 llvm-mca takes assembly code as input. The assembly code is parsed into
185 a sequence of MCInst with the help of the existing LLVM target assembly
186 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
187 module to generate a performance report.
188
189 The Pipeline module simulates the execution of the machine code se‐
190 quence in a loop of iterations (default is 100). During this process,
191 the pipeline collects a number of execution related statistics. At the
192 end of this process, the pipeline generates and prints a report from
193 the collected statistics.
194
195 Here is an example of a performance report generated by the tool for a
196 dot-product of two packed float vectors of four elements. The analysis
197 is conducted for target x86, cpu btver2. The following result can be
198 produced via the following command using the example located at
199 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
200
201 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
202
203 Iterations: 300
204 Instructions: 900
205 Total Cycles: 610
206 Total uOps: 900
207
208 Dispatch Width: 2
209 uOps Per Cycle: 1.48
210 IPC: 1.48
211 Block RThroughput: 2.0
212
213
214 Instruction Info:
215 [1]: #uOps
216 [2]: Latency
217 [3]: RThroughput
218 [4]: MayLoad
219 [5]: MayStore
220 [6]: HasSideEffects (U)
221
222 [1] [2] [3] [4] [5] [6] Instructions:
223 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
224 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
225 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
226
227
228 Resources:
229 [0] - JALU0
230 [1] - JALU1
231 [2] - JDiv
232 [3] - JFPA
233 [4] - JFPM
234 [5] - JFPU0
235 [6] - JFPU1
236 [7] - JLAGU
237 [8] - JMul
238 [9] - JSAGU
239 [10] - JSTC
240 [11] - JVALU0
241 [12] - JVALU1
242 [13] - JVIMUL
243
244
245 Resource pressure per iteration:
246 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
247 - - - 2.00 1.00 2.00 1.00 - - - - - - -
248
249 Resource pressure by instruction:
250 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
251 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
252 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
253 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
254
255 According to this report, the dot-product kernel has been executed 300
256 times, for a total of 900 simulated instructions. The total number of
257 simulated micro opcodes (uOps) is also 900.
258
259 The report is structured in three main sections. The first section
260 collects a few performance numbers; the goal of this section is to give
261 a very quick overview of the performance throughput. Important perfor‐
262 mance indicators are IPC, uOps Per Cycle, and Block RThroughput (Block
263 Reciprocal Throughput).
264
265 IPC is computed dividing the total number of simulated instructions by
266 the total number of cycles. In the absence of loop-carried data depen‐
267 dencies, the observed IPC tends to a theoretical maximum which can be
268 computed by dividing the number of instructions of a single iteration
269 by the Block RThroughput.
270
271 Field 'uOps Per Cycle' is computed dividing the total number of simu‐
272 lated micro opcodes by the total number of cycles. A delta between Dis‐
273 patch Width and this field is an indicator of a performance issue. In
274 the absence of loop-carried data dependencies, the observed 'uOps Per
275 Cycle' should tend to a theoretical maximum throughput which can be
276 computed by dividing the number of uOps of a single iteration by the
277 Block RThroughput.
278
279 Field uOps Per Cycle is bounded from above by the dispatch width. That
280 is because the dispatch width limits the maximum size of a dispatch
281 group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
282 ware parallelism. The availability of hardware resources affects the
283 resource pressure distribution, and it limits the number of instruc‐
284 tions that can be executed in parallel every cycle. A delta between
285 Dispatch Width and the theoretical maximum uOps per Cycle (computed by
286 dividing the number of uOps of a single iteration by the Block
287 RTrhoughput) is an indicator of a performance bottleneck caused by the
288 lack of hardware resources. In general, the lower the Block RThrough‐
289 put, the better.
290
291 In this example, uOps per iteration/Block RThroughput is 1.50. Since
292 there are no loop-carried dependencies, the observed uOps Per Cycle is
293 expected to approach 1.50 when the number of iterations tends to infin‐
294 ity. The delta between the Dispatch Width (2.00), and the theoretical
295 maximum uOp throughput (1.50) is an indicator of a performance bottle‐
296 neck caused by the lack of hardware resources, and the Resource pres‐
297 sure view can help to identify the problematic resource usage.
298
299 The second section of the report shows the latency and reciprocal
300 throughput of every instruction in the sequence. That section also re‐
301 ports extra information related to the number of micro opcodes, and op‐
302 code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
303
304 The third section is the Resource pressure view. This view reports the
305 average number of resource cycles consumed every iteration by instruc‐
306 tions for every processor resource unit available on the target. In‐
307 formation is structured in two tables. The first table reports the num‐
308 ber of resource cycles spent on average every iteration. The second ta‐
309 ble correlates the resource cycles to the machine instruction in the
310 sequence. For example, every iteration of the instruction vmulps always
311 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
312 consuming an average of 1 resource cycle per iteration. Note that on
313 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
314 line JFPU1, while horizontal floating-point additions can only be is‐
315 sued to pipeline JFPU0.
316
317 The resource pressure view helps with identifying bottlenecks caused by
318 high usage of specific hardware resources. Situations with resource
319 pressure mainly concentrated on a few resources should, in general, be
320 avoided. Ideally, pressure should be uniformly distributed between
321 multiple resources.
322
323 Timeline View
324 The timeline view produces a detailed report of each instruction's
325 state transitions through an instruction pipeline. This view is en‐
326 abled by the command line option -timeline. As instructions transition
327 through the various stages of the pipeline, their states are depicted
328 in the view report. These states are represented by the following
329 characters:
330
331 • D : Instruction dispatched.
332
333 • e : Instruction executing.
334
335 • E : Instruction executed.
336
337 • R : Instruction retired.
338
339 • = : Instruction already dispatched, waiting to be executed.
340
341 • - : Instruction executed, waiting to be retired.
342
343 Below is the timeline view for a subset of the dot-product example lo‐
344 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
345 llvm-mca using the following command:
346
347 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
348
349 Timeline view:
350 012345
351 Index 0123456789
352
353 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
354 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
355 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
356 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
357 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
358 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
359 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
360 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
361 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
362
363
364 Average Wait times (based on the timeline view):
365 [0]: Executions
366 [1]: Average time spent waiting in a scheduler's queue
367 [2]: Average time spent waiting in a scheduler's queue while ready
368 [3]: Average time elapsed from WB until retire stage
369
370 [0] [1] [2] [3]
371 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
372 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
373 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
374
375 The timeline view is interesting because it shows instruction state
376 changes during execution. It also gives an idea of how the tool pro‐
377 cesses instructions executed on the target, and how their timing infor‐
378 mation might be calculated.
379
380 The timeline view is structured in two tables. The first table shows
381 instructions changing state over time (measured in cycles); the second
382 table (named Average Wait times) reports useful timing statistics,
383 which should help diagnose performance bottlenecks caused by long data
384 dependencies and sub-optimal usage of hardware resources.
385
386 An instruction in the timeline view is identified by a pair of indices,
387 where the first index identifies an iteration, and the second index is
388 the instruction index (i.e., where it appears in the code sequence).
389 Since this example was generated using 3 iterations: -iterations=3, the
390 iteration indices range from 0-2 inclusively.
391
392 Excluding the first and last column, the remaining columns are in cy‐
393 cles. Cycles are numbered sequentially starting from 0.
394
395 From the example output above, we know the following:
396
397 • Instruction [1,0] was dispatched at cycle 1.
398
399 • Instruction [1,0] started executing at cycle 2.
400
401 • Instruction [1,0] reached the write back stage at cycle 4.
402
403 • Instruction [1,0] was retired at cycle 10.
404
405 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
406 wait in the scheduler's queue for the operands to become available. By
407 the time vmulps is dispatched, operands are already available, and
408 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
409 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
410 strated by the fact that the instruction only spent 1cy in the sched‐
411 uler's queue.
412
413 There is a gap of 5 cycles between the write-back stage and the retire
414 event. That is because instructions must retire in program order, so
415 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
416 until cycle 10).
417
418 In the example, all instructions are in a RAW (Read After Write) depen‐
419 dency chain. Register %xmm2 written by vmulps is immediately used by
420 the first vhaddps, and register %xmm3 written by the first vhaddps is
421 used by the second vhaddps. Long data dependencies negatively impact
422 the ILP (Instruction Level Parallelism).
423
424 In the dot-product example, there are anti-dependencies introduced by
425 instructions from different iterations. However, those dependencies
426 can be removed at register renaming stage (at the cost of allocating
427 register aliases, and therefore consuming physical registers).
428
429 Table Average Wait times helps diagnose performance issues that are
430 caused by the presence of long latency instructions and potentially
431 long data dependencies which may limit the ILP. Note that llvm-mca, by
432 default, assumes at least 1cy between the dispatch event and the issue
433 event.
434
435 When the performance is limited by data dependencies and/or long la‐
436 tency instructions, the number of cycles spent while in the ready state
437 is expected to be very small when compared with the total number of cy‐
438 cles spent in the scheduler's queue. The difference between the two
439 counters is a good indicator of how large of an impact data dependen‐
440 cies had on the execution of the instructions. When performance is
441 mostly limited by the lack of hardware resources, the delta between the
442 two counters is small. However, the number of cycles spent in the
443 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
444 pared to other low latency instructions.
445
446 Extra Statistics to Further Diagnose Performance Issues
447 The -all-stats command line option enables extra statistics and perfor‐
448 mance counters for the dispatch logic, the reorder buffer, the retire
449 control unit, and the register file.
450
451 Below is an example of -all-stats output generated by llvm-mca for 300
452 iterations of the dot-product example discussed in the previous sec‐
453 tions.
454
455 Dynamic Dispatch Stall Cycles:
456 RAT - Register unavailable: 0
457 RCU - Retire tokens unavailable: 0
458 SCHEDQ - Scheduler full: 272 (44.6%)
459 LQ - Load queue full: 0
460 SQ - Store queue full: 0
461 GROUP - Static restrictions on the dispatch group: 0
462
463
464 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
465 [# dispatched], [# cycles]
466 0, 24 (3.9%)
467 1, 272 (44.6%)
468 2, 314 (51.5%)
469
470
471 Schedulers - number of cycles where we saw N instructions issued:
472 [# issued], [# cycles]
473 0, 7 (1.1%)
474 1, 306 (50.2%)
475 2, 297 (48.7%)
476
477 Scheduler's queue usage:
478 [1] Resource name.
479 [2] Average number of used buffer entries.
480 [3] Maximum number of used buffer entries.
481 [4] Total number of buffer entries.
482
483 [1] [2] [3] [4]
484 JALU01 0 0 20
485 JFPU01 17 18 18
486 JLSAGU 0 0 12
487
488
489 Retire Control Unit - number of cycles where we saw N instructions retired:
490 [# retired], [# cycles]
491 0, 109 (17.9%)
492 1, 102 (16.7%)
493 2, 399 (65.4%)
494
495 Total ROB Entries: 64
496 Max Used ROB Entries: 35 ( 54.7% )
497 Average Used ROB Entries per cy: 32 ( 50.0% )
498
499
500 Register File statistics:
501 Total number of mappings created: 900
502 Max number of mappings used: 35
503
504 * Register File #1 -- JFpuPRF:
505 Number of physical registers: 72
506 Total number of mappings created: 900
507 Max number of mappings used: 35
508
509 * Register File #2 -- JIntegerPRF:
510 Number of physical registers: 64
511 Total number of mappings created: 0
512 Max number of mappings used: 0
513
514 If we look at the Dynamic Dispatch Stall Cycles table, we see the
515 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
516 ery time the dispatch logic is unable to dispatch a full group because
517 the scheduler's queue is full.
518
519 Looking at the Dispatch Logic table, we see that the pipeline was only
520 able to dispatch two micro opcodes 51.5% of the time. The dispatch
521 group was limited to one micro opcode 44.6% of the cycles, which corre‐
522 sponds to 272 cycles. The dispatch statistics are displayed by either
523 using the command option -all-stats or -dispatch-stats.
524
525 The next table, Schedulers, presents a histogram displaying a count,
526 representing the number of instructions issued on some number of cy‐
527 cles. In this case, of the 610 simulated cycles, single instructions
528 were issued 306 times (50.2%) and there were 7 cycles where no instruc‐
529 tions were issued.
530
531 The Scheduler's queue usage table shows that the average and maximum
532 number of buffer entries (i.e., scheduler queue entries) used at run‐
533 time. Resource JFPU01 reached its maximum (18 of 18 queue entries).
534 Note that AMD Jaguar implements three schedulers:
535
536 • JALU01 - A scheduler for ALU instructions.
537
538 • JFPU01 - A scheduler floating point operations.
539
540 • JLSAGU - A scheduler for address generation.
541
542 The dot-product is a kernel of three floating point instructions (a
543 vector multiply followed by two horizontal adds). That explains why
544 only the floating point scheduler appears to be used.
545
546 A full scheduler queue is either caused by data dependency chains or by
547 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
548 sure can be mitigated by rewriting the kernel using different instruc‐
549 tions that consume different scheduler resources. Schedulers with a
550 small queue are less resilient to bottlenecks caused by the presence of
551 long data dependencies. The scheduler statistics are displayed by us‐
552 ing the command option -all-stats or -scheduler-stats.
553
554 The next table, Retire Control Unit, presents a histogram displaying a
555 count, representing the number of instructions retired on some number
556 of cycles. In this case, of the 610 simulated cycles, two instructions
557 were retired during the same cycle 399 times (65.4%) and there were 109
558 cycles where no instructions were retired. The retire statistics are
559 displayed by using the command option -all-stats or -retire-stats.
560
561 The last table presented is Register File statistics. Each physical
562 register file (PRF) used by the pipeline is presented in this table.
563 In the case of AMD Jaguar, there are two register files, one for float‐
564 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
565 gerPRF). The table shows that of the 900 instructions processed, there
566 were 900 mappings created. Since this dot-product example utilized
567 only floating point registers, the JFPuPRF was responsible for creating
568 the 900 mappings. However, we see that the pipeline only used a maxi‐
569 mum of 35 of 72 available register slots at any given time. We can con‐
570 clude that the floating point PRF was the only register file used for
571 the example, and that it was never resource constrained. The register
572 file statistics are displayed by using the command option -all-stats or
573 -register-file-stats.
574
575 In this example, we can conclude that the IPC is mostly limited by data
576 dependencies, and not by resource pressure.
577
578 Instruction Flow
579 This section describes the instruction flow through the default pipe‐
580 line of llvm-mca, as well as the functional units involved in the
581 process.
582
583 The default pipeline implements the following sequence of stages used
584 to process instructions.
585
586 • Dispatch (Instruction is dispatched to the schedulers).
587
588 • Issue (Instruction is issued to the processor pipelines).
589
590 • Write Back (Instruction is executed, and results are written back).
591
592 • Retire (Instruction is retired; writes are architecturally commit‐
593 ted).
594
595 The default pipeline only models the out-of-order portion of a proces‐
596 sor. Therefore, the instruction fetch and decode stages are not mod‐
597 eled. Performance bottlenecks in the frontend are not diagnosed.
598 llvm-mca assumes that instructions have all been decoded and placed
599 into a queue before the simulation start. Also, llvm-mca does not
600 model branch prediction.
601
602 Instruction Dispatch
603 During the dispatch stage, instructions are picked in program order
604 from a queue of already decoded instructions, and dispatched in groups
605 to the simulated hardware schedulers.
606
607 The size of a dispatch group depends on the availability of the simu‐
608 lated hardware resources. The processor dispatch width defaults to the
609 value of the IssueWidth in LLVM's scheduling model.
610
611 An instruction can be dispatched if:
612
613 • The size of the dispatch group is smaller than processor's dispatch
614 width.
615
616 • There are enough entries in the reorder buffer.
617
618 • There are enough physical registers to do register renaming.
619
620 • The schedulers are not full.
621
622 Scheduling models can optionally specify which register files are
623 available on the processor. llvm-mca uses that information to initial‐
624 ize register file descriptors. Users can limit the number of physical
625 registers that are globally available for register renaming by using
626 the command option -register-file-size. A value of zero for this op‐
627 tion means unbounded. By knowing how many registers are available for
628 renaming, the tool can predict dispatch stalls caused by the lack of
629 physical registers.
630
631 The number of reorder buffer entries consumed by an instruction depends
632 on the number of micro-opcodes specified for that instruction by the
633 target scheduling model. The reorder buffer is responsible for track‐
634 ing the progress of instructions that are "in-flight", and retiring
635 them in program order. The number of entries in the reorder buffer de‐
636 faults to the value specified by field MicroOpBufferSize in the target
637 scheduling model.
638
639 Instructions that are dispatched to the schedulers consume scheduler
640 buffer entries. llvm-mca queries the scheduling model to determine the
641 set of buffered resources consumed by an instruction. Buffered re‐
642 sources are treated like scheduler resources.
643
644 Instruction Issue
645 Each processor scheduler implements a buffer of instructions. An in‐
646 struction has to wait in the scheduler's buffer until input register
647 operands become available. Only at that point, does the instruction
648 becomes eligible for execution and may be issued (potentially
649 out-of-order) for execution. Instruction latencies are computed by
650 llvm-mca with the help of the scheduling model.
651
652 llvm-mca's scheduler is designed to simulate multiple processor sched‐
653 ulers. The scheduler is responsible for tracking data dependencies,
654 and dynamically selecting which processor resources are consumed by in‐
655 structions. It delegates the management of processor resource units
656 and resource groups to a resource manager. The resource manager is re‐
657 sponsible for selecting resource units that are consumed by instruc‐
658 tions. For example, if an instruction consumes 1cy of a resource
659 group, the resource manager selects one of the available units from the
660 group; by default, the resource manager uses a round-robin selector to
661 guarantee that resource usage is uniformly distributed between all
662 units of a group.
663
664 llvm-mca's scheduler internally groups instructions into three sets:
665
666 • WaitSet: a set of instructions whose operands are not ready.
667
668 • ReadySet: a set of instructions ready to execute.
669
670 • IssuedSet: a set of instructions executing.
671
672 Depending on the operands availability, instructions that are dis‐
673 patched to the scheduler are either placed into the WaitSet or into the
674 ReadySet.
675
676 Every cycle, the scheduler checks if instructions can be moved from the
677 WaitSet to the ReadySet, and if instructions from the ReadySet can be
678 issued to the underlying pipelines. The algorithm prioritizes older in‐
679 structions over younger instructions.
680
681 Write-Back and Retire Stage
682 Issued instructions are moved from the ReadySet to the IssuedSet.
683 There, instructions wait until they reach the write-back stage. At
684 that point, they get removed from the queue and the retire control unit
685 is notified.
686
687 When instructions are executed, the retire control unit flags the in‐
688 struction as "ready to retire."
689
690 Instructions are retired in program order. The register file is noti‐
691 fied of the retirement so that it can free the physical registers that
692 were allocated for the instruction during the register renaming stage.
693
694 Load/Store Unit and Memory Consistency Model
695 To simulate an out-of-order execution of memory operations, llvm-mca
696 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
697 tive execution of loads and stores.
698
699 Each load (or store) consumes an entry in the load (or store) queue.
700 Users can specify flags -lqueue and -squeue to limit the number of en‐
701 tries in the load and store queues respectively. The queues are un‐
702 bounded by default.
703
704 The LSUnit implements a relaxed consistency model for memory loads and
705 stores. The rules are:
706
707 1. A younger load is allowed to pass an older load only if there are no
708 intervening stores or barriers between the two loads.
709
710 2. A younger load is allowed to pass an older store provided that the
711 load does not alias with the store.
712
713 3. A younger store is not allowed to pass an older store.
714
715 4. A younger store is not allowed to pass an older load.
716
717 By default, the LSUnit optimistically assumes that loads do not alias
718 (-noalias=true) store operations. Under this assumption, younger loads
719 are always allowed to pass older stores. Essentially, the LSUnit does
720 not attempt to run any alias analysis to predict when loads and stores
721 do not alias with each other.
722
723 Note that, in the case of write-combining memory, rule 3 could be re‐
724 laxed to allow reordering of non-aliasing store operations. That being
725 said, at the moment, there is no way to further relax the memory model
726 (-noalias is the only option). Essentially, there is no option to
727 specify a different memory type (e.g., write-back, write-combining,
728 write-through; etc.) and consequently to weaken, or strengthen, the
729 memory model.
730
731 Other limitations are:
732
733 • The LSUnit does not know when store-to-load forwarding may occur.
734
735 • The LSUnit does not know anything about cache hierarchy and memory
736 types.
737
738 • The LSUnit does not know how to identify serializing operations and
739 memory fences.
740
741 The LSUnit does not attempt to predict if a load or store hits or
742 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
743 "MayStore." For loads, the scheduling model provides an "optimistic"
744 load-to-use latency (which usually matches the load-to-use latency for
745 when there is a hit in the L1D).
746
747 llvm-mca does not know about serializing operations or memory-barrier
748 like instructions. The LSUnit conservatively assumes that an instruc‐
749 tion which has both "MayLoad" and unmodeled side effects behaves like a
750 "soft" load-barrier. That means, it serializes loads without forcing a
751 flush of the load queue. Similarly, instructions that "MayStore" and
752 have unmodeled side effects are treated like store barriers. A full
753 memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
754 side effects. This is inaccurate, but it is the best that we can do at
755 the moment with the current information available in LLVM.
756
757 A load/store barrier consumes one entry of the load/store queue. A
758 load/store barrier enforces ordering of loads/stores. A younger load
759 cannot pass a load barrier. Also, a younger store cannot pass a store
760 barrier. A younger load has to wait for the memory/load barrier to ex‐
761 ecute. A load/store barrier is "executed" when it becomes the oldest
762 entry in the load/store queue(s). That also means, by construction, all
763 of the older loads/stores have been executed.
764
765 In conclusion, the full set of load/store consistency rules are:
766
767 1. A store may not pass a previous store.
768
769 2. A store may not pass a previous load (regardless of -noalias).
770
771 3. A store has to wait until an older store barrier is fully executed.
772
773 4. A load may pass a previous load.
774
775 5. A load may not pass a previous store unless -noalias is set.
776
777 6. A load has to wait until an older load barrier is fully executed.
778
780 Maintained by the LLVM Team (https://llvm.org/).
781
783 2003-2022, LLVM Project
784
785
786
787
7888 2022-01-20 LLVM-MCA(1)