1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with an
18 out-of-order backend, for which there is a scheduling model available
19 in LLVM.
20
21 The main goal of this tool is not just to predict the performance of
22 the code when run on the target, but also help with diagnosing poten‐
23 tial performance issues.
24
25 Given an assembly code sequence, llvm-mca estimates the Instructions
26 Per Cycle (IPC), as well as hardware resource pressure. The analysis
27 and reporting style were inspired by the IACA tool from Intel.
28
29 For example, you can compile code with clang, output assembly, and pipe
30 it directly into llvm-mca for analysis:
31
32 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34 Or for Intel syntax:
35
36 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38 Scheduling models are not just used to compute instruction latencies
39 and throughput, but also to understand what processor resources are
40 available and how to simulate them.
41
42 By design, the quality of the analysis conducted by llvm-mca is in‐
43 evitably affected by the quality of the scheduling models in LLVM.
44
45 If you see that the performance report is not accurate for a processor,
46 please file a bug against the appropriate backend.
47
49 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
50 wise, it will read from the specified filename.
51
52 If the -o option is omitted, then llvm-mca will send its output to
53 standard output if the input is from standard input. If the -o option
54 specifies "-", then the output will also be sent to standard output.
55
56 -help Print a summary of command line options.
57
58 -o <filename>
59 Use <filename> as the output filename. See the summary above for
60 more details.
61
62 -mtriple=<target triple>
63 Specify a target triple string.
64
65 -march=<arch>
66 Specify the architecture for which to analyze the code. It de‐
67 faults to the host default target.
68
69 -mcpu=<cpuname>
70 Specify the processor for which to analyze the code. By de‐
71 fault, the cpu name is autodetected from the host.
72
73 -output-asm-variant=<variant id>
74 Specify the output assembly variant for the report generated by
75 the tool. On x86, possible values are [0, 1]. A value of 0
76 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
77 format for the code printed out by the tool in the analysis re‐
78 port.
79
80 -dispatch=<width>
81 Specify a different dispatch width for the processor. The dis‐
82 patch width defaults to field 'IssueWidth' in the processor
83 scheduling model. If width is zero, then the default dispatch
84 width is used.
85
86 -register-file-size=<size>
87 Specify the size of the register file. When specified, this flag
88 limits how many physical registers are available for register
89 renaming purposes. A value of zero for this flag means "unlim‐
90 ited number of physical registers".
91
92 -iterations=<number of iterations>
93 Specify the number of iterations to run. If this flag is set to
94 0, then the tool sets the number of iterations to a default
95 value (i.e. 100).
96
97 -noalias=<bool>
98 If set, the tool assumes that loads and stores don't alias. This
99 is the default behavior.
100
101 -lqueue=<load queue size>
102 Specify the size of the load queue in the load/store unit emu‐
103 lated by the tool. By default, the tool assumes an unbound num‐
104 ber of entries in the load queue. A value of zero for this flag
105 is ignored, and the default load queue size is used instead.
106
107 -squeue=<store queue size>
108 Specify the size of the store queue in the load/store unit emu‐
109 lated by the tool. By default, the tool assumes an unbound num‐
110 ber of entries in the store queue. A value of zero for this flag
111 is ignored, and the default store queue size is used instead.
112
113 -timeline
114 Enable the timeline view.
115
116 -timeline-max-iterations=<iterations>
117 Limit the number of iterations to print in the timeline view. By
118 default, the timeline view prints information for up to 10 iter‐
119 ations.
120
121 -timeline-max-cycles=<cycles>
122 Limit the number of cycles in the timeline view. By default, the
123 number of cycles is set to 80.
124
125 -resource-pressure
126 Enable the resource pressure view. This is enabled by default.
127
128 -register-file-stats
129 Enable register file usage statistics.
130
131 -dispatch-stats
132 Enable extra dispatch statistics. This view collects and ana‐
133 lyzes instruction dispatch events, as well as static/dynamic
134 dispatch stall events. This view is disabled by default.
135
136 -scheduler-stats
137 Enable extra scheduler statistics. This view collects and ana‐
138 lyzes instruction issue events. This view is disabled by de‐
139 fault.
140
141 -retire-stats
142 Enable extra retire control unit statistics. This view is dis‐
143 abled by default.
144
145 -instruction-info
146 Enable the instruction info view. This is enabled by default.
147
148 -all-stats
149 Print all hardware statistics. This enables extra statistics re‐
150 lated to the dispatch logic, the hardware schedulers, the regis‐
151 ter file(s), and the retire control unit. This option is dis‐
152 abled by default.
153
154 -all-views
155 Enable all the view.
156
157 -instruction-tables
158 Prints resource pressure information based on the static infor‐
159 mation available from the processor model. This differs from the
160 resource pressure view because it doesn't require that the code
161 is simulated. It instead prints the theoretical uniform distri‐
162 bution of resource pressure for every instruction in sequence.
163
164 -bottleneck-analysis
165 Print information about bottlenecks that affect the throughput.
166 This analysis can be expensive, and it is disabled by default.
167 Bottlenecks are highlighted in the summary view.
168
170 llvm-mca returns 0 on success. Otherwise, an error message is printed
171 to standard error, and the tool returns 1.
172
174 llvm-mca allows for the optional usage of special code comments to mark
175 regions of the assembly code to be analyzed. A comment starting with
176 substring LLVM-MCA-BEGIN marks the beginning of a code region. A com‐
177 ment starting with substring LLVM-MCA-END marks the end of a code re‐
178 gion. For example:
179
180 # LLVM-MCA-BEGIN
181 ...
182 # LLVM-MCA-END
183
184 If no user-defined region is specified, then llvm-mca assumes a default
185 region which contains every instruction in the input file. Every re‐
186 gion is analyzed in isolation, and the final performance report is the
187 union of all the reports generated for every code region.
188
189 Code regions can have names. For example:
190
191 # LLVM-MCA-BEGIN A simple example
192 add %eax, %eax
193 # LLVM-MCA-END
194
195 The code from the example above defines a region named "A simple exam‐
196 ple" with a single instruction in it. Note how the region name doesn't
197 have to be repeated in the LLVM-MCA-END directive. In the absence of
198 overlapping regions, an anonymous LLVM-MCA-END directive always ends
199 the currently active user defined region.
200
201 Example of nesting regions:
202
203 # LLVM-MCA-BEGIN foo
204 add %eax, %edx
205 # LLVM-MCA-BEGIN bar
206 sub %eax, %edx
207 # LLVM-MCA-END bar
208 # LLVM-MCA-END foo
209
210 Example of overlapping regions:
211
212 # LLVM-MCA-BEGIN foo
213 add %eax, %edx
214 # LLVM-MCA-BEGIN bar
215 sub %eax, %edx
216 # LLVM-MCA-END foo
217 add %eax, %edx
218 # LLVM-MCA-END bar
219
220 Note that multiple anonymous regions cannot overlap. Also, overlapping
221 regions cannot have the same name.
222
223 There is no support for marking regions from high-level source code,
224 like C or C++. As a workaround, inline assembly directives may be used:
225
226 int foo(int a, int b) {
227 __asm volatile("# LLVM-MCA-BEGIN foo");
228 a += 42;
229 __asm volatile("# LLVM-MCA-END");
230 a *= b;
231 return a;
232 }
233
234 However, this interferes with optimizations like loop vectorization and
235 may have an impact on the code generated. This is because the __asm
236 statements are seen as real code having important side effects, which
237 limits how the code around them can be transformed. If users want to
238 make use of inline assembly to emit markers, then the recommendation is
239 to always verify that the output assembly is equivalent to the assembly
240 generated in the absence of markers. The Clang options to emit opti‐
241 mization reports can also help in detecting missed optimizations.
242
244 llvm-mca takes assembly code as input. The assembly code is parsed into
245 a sequence of MCInst with the help of the existing LLVM target assembly
246 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
247 module to generate a performance report.
248
249 The Pipeline module simulates the execution of the machine code se‐
250 quence in a loop of iterations (default is 100). During this process,
251 the pipeline collects a number of execution related statistics. At the
252 end of this process, the pipeline generates and prints a report from
253 the collected statistics.
254
255 Here is an example of a performance report generated by the tool for a
256 dot-product of two packed float vectors of four elements. The analysis
257 is conducted for target x86, cpu btver2. The following result can be
258 produced via the following command using the example located at
259 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
260
261 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
262
263 Iterations: 300
264 Instructions: 900
265 Total Cycles: 610
266 Total uOps: 900
267
268 Dispatch Width: 2
269 uOps Per Cycle: 1.48
270 IPC: 1.48
271 Block RThroughput: 2.0
272
273
274 Instruction Info:
275 [1]: #uOps
276 [2]: Latency
277 [3]: RThroughput
278 [4]: MayLoad
279 [5]: MayStore
280 [6]: HasSideEffects (U)
281
282 [1] [2] [3] [4] [5] [6] Instructions:
283 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
284 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
285 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
286
287
288 Resources:
289 [0] - JALU0
290 [1] - JALU1
291 [2] - JDiv
292 [3] - JFPA
293 [4] - JFPM
294 [5] - JFPU0
295 [6] - JFPU1
296 [7] - JLAGU
297 [8] - JMul
298 [9] - JSAGU
299 [10] - JSTC
300 [11] - JVALU0
301 [12] - JVALU1
302 [13] - JVIMUL
303
304
305 Resource pressure per iteration:
306 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
307 - - - 2.00 1.00 2.00 1.00 - - - - - - -
308
309 Resource pressure by instruction:
310 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
311 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
312 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
313 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
314
315 According to this report, the dot-product kernel has been executed 300
316 times, for a total of 900 simulated instructions. The total number of
317 simulated micro opcodes (uOps) is also 900.
318
319 The report is structured in three main sections. The first section
320 collects a few performance numbers; the goal of this section is to give
321 a very quick overview of the performance throughput. Important perfor‐
322 mance indicators are IPC, uOps Per Cycle, and Block RThroughput (Block
323 Reciprocal Throughput).
324
325 IPC is computed dividing the total number of simulated instructions by
326 the total number of cycles. In the absence of loop-carried data depen‐
327 dencies, the observed IPC tends to a theoretical maximum which can be
328 computed by dividing the number of instructions of a single iteration
329 by the Block RThroughput.
330
331 Field 'uOps Per Cycle' is computed dividing the total number of simu‐
332 lated micro opcodes by the total number of cycles. A delta between Dis‐
333 patch Width and this field is an indicator of a performance issue. In
334 the absence of loop-carried data dependencies, the observed 'uOps Per
335 Cycle' should tend to a theoretical maximum throughput which can be
336 computed by dividing the number of uOps of a single iteration by the
337 Block RThroughput.
338
339 Field uOps Per Cycle is bounded from above by the dispatch width. That
340 is because the dispatch width limits the maximum size of a dispatch
341 group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
342 ware parallelism. The availability of hardware resources affects the
343 resource pressure distribution, and it limits the number of instruc‐
344 tions that can be executed in parallel every cycle. A delta between
345 Dispatch Width and the theoretical maximum uOps per Cycle (computed by
346 dividing the number of uOps of a single iteration by the Block
347 RTrhoughput) is an indicator of a performance bottleneck caused by the
348 lack of hardware resources. In general, the lower the Block RThrough‐
349 put, the better.
350
351 In this example, uOps per iteration/Block RThroughput is 1.50. Since
352 there are no loop-carried dependencies, the observed uOps Per Cycle is
353 expected to approach 1.50 when the number of iterations tends to infin‐
354 ity. The delta between the Dispatch Width (2.00), and the theoretical
355 maximum uOp throughput (1.50) is an indicator of a performance bottle‐
356 neck caused by the lack of hardware resources, and the Resource pres‐
357 sure view can help to identify the problematic resource usage.
358
359 The second section of the report shows the latency and reciprocal
360 throughput of every instruction in the sequence. That section also re‐
361 ports extra information related to the number of micro opcodes, and op‐
362 code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
363
364 The third section is the Resource pressure view. This view reports the
365 average number of resource cycles consumed every iteration by instruc‐
366 tions for every processor resource unit available on the target. In‐
367 formation is structured in two tables. The first table reports the num‐
368 ber of resource cycles spent on average every iteration. The second ta‐
369 ble correlates the resource cycles to the machine instruction in the
370 sequence. For example, every iteration of the instruction vmulps always
371 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
372 consuming an average of 1 resource cycle per iteration. Note that on
373 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
374 line JFPU1, while horizontal floating-point additions can only be is‐
375 sued to pipeline JFPU0.
376
377 The resource pressure view helps with identifying bottlenecks caused by
378 high usage of specific hardware resources. Situations with resource
379 pressure mainly concentrated on a few resources should, in general, be
380 avoided. Ideally, pressure should be uniformly distributed between
381 multiple resources.
382
383 Timeline View
384 The timeline view produces a detailed report of each instruction's
385 state transitions through an instruction pipeline. This view is en‐
386 abled by the command line option -timeline. As instructions transition
387 through the various stages of the pipeline, their states are depicted
388 in the view report. These states are represented by the following
389 characters:
390
391 • D : Instruction dispatched.
392
393 • e : Instruction executing.
394
395 • E : Instruction executed.
396
397 • R : Instruction retired.
398
399 • = : Instruction already dispatched, waiting to be executed.
400
401 • - : Instruction executed, waiting to be retired.
402
403 Below is the timeline view for a subset of the dot-product example lo‐
404 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
405 llvm-mca using the following command:
406
407 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
408
409 Timeline view:
410 012345
411 Index 0123456789
412
413 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
414 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
415 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
416 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
417 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
418 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
419 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
420 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
421 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
422
423
424 Average Wait times (based on the timeline view):
425 [0]: Executions
426 [1]: Average time spent waiting in a scheduler's queue
427 [2]: Average time spent waiting in a scheduler's queue while ready
428 [3]: Average time elapsed from WB until retire stage
429
430 [0] [1] [2] [3]
431 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
432 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
433 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
434
435 The timeline view is interesting because it shows instruction state
436 changes during execution. It also gives an idea of how the tool pro‐
437 cesses instructions executed on the target, and how their timing infor‐
438 mation might be calculated.
439
440 The timeline view is structured in two tables. The first table shows
441 instructions changing state over time (measured in cycles); the second
442 table (named Average Wait times) reports useful timing statistics,
443 which should help diagnose performance bottlenecks caused by long data
444 dependencies and sub-optimal usage of hardware resources.
445
446 An instruction in the timeline view is identified by a pair of indices,
447 where the first index identifies an iteration, and the second index is
448 the instruction index (i.e., where it appears in the code sequence).
449 Since this example was generated using 3 iterations: -iterations=3, the
450 iteration indices range from 0-2 inclusively.
451
452 Excluding the first and last column, the remaining columns are in cy‐
453 cles. Cycles are numbered sequentially starting from 0.
454
455 From the example output above, we know the following:
456
457 • Instruction [1,0] was dispatched at cycle 1.
458
459 • Instruction [1,0] started executing at cycle 2.
460
461 • Instruction [1,0] reached the write back stage at cycle 4.
462
463 • Instruction [1,0] was retired at cycle 10.
464
465 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
466 wait in the scheduler's queue for the operands to become available. By
467 the time vmulps is dispatched, operands are already available, and
468 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
469 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
470 strated by the fact that the instruction only spent 1cy in the sched‐
471 uler's queue.
472
473 There is a gap of 5 cycles between the write-back stage and the retire
474 event. That is because instructions must retire in program order, so
475 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
476 until cycle 10).
477
478 In the example, all instructions are in a RAW (Read After Write) depen‐
479 dency chain. Register %xmm2 written by vmulps is immediately used by
480 the first vhaddps, and register %xmm3 written by the first vhaddps is
481 used by the second vhaddps. Long data dependencies negatively impact
482 the ILP (Instruction Level Parallelism).
483
484 In the dot-product example, there are anti-dependencies introduced by
485 instructions from different iterations. However, those dependencies
486 can be removed at register renaming stage (at the cost of allocating
487 register aliases, and therefore consuming physical registers).
488
489 Table Average Wait times helps diagnose performance issues that are
490 caused by the presence of long latency instructions and potentially
491 long data dependencies which may limit the ILP. Note that llvm-mca, by
492 default, assumes at least 1cy between the dispatch event and the issue
493 event.
494
495 When the performance is limited by data dependencies and/or long la‐
496 tency instructions, the number of cycles spent while in the ready state
497 is expected to be very small when compared with the total number of cy‐
498 cles spent in the scheduler's queue. The difference between the two
499 counters is a good indicator of how large of an impact data dependen‐
500 cies had on the execution of the instructions. When performance is
501 mostly limited by the lack of hardware resources, the delta between the
502 two counters is small. However, the number of cycles spent in the
503 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
504 pared to other low latency instructions.
505
506 Extra Statistics to Further Diagnose Performance Issues
507 The -all-stats command line option enables extra statistics and perfor‐
508 mance counters for the dispatch logic, the reorder buffer, the retire
509 control unit, and the register file.
510
511 Below is an example of -all-stats output generated by llvm-mca for 300
512 iterations of the dot-product example discussed in the previous sec‐
513 tions.
514
515 Dynamic Dispatch Stall Cycles:
516 RAT - Register unavailable: 0
517 RCU - Retire tokens unavailable: 0
518 SCHEDQ - Scheduler full: 272 (44.6%)
519 LQ - Load queue full: 0
520 SQ - Store queue full: 0
521 GROUP - Static restrictions on the dispatch group: 0
522
523
524 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
525 [# dispatched], [# cycles]
526 0, 24 (3.9%)
527 1, 272 (44.6%)
528 2, 314 (51.5%)
529
530
531 Schedulers - number of cycles where we saw N micro opcodes issued:
532 [# issued], [# cycles]
533 0, 7 (1.1%)
534 1, 306 (50.2%)
535 2, 297 (48.7%)
536
537 Scheduler's queue usage:
538 [1] Resource name.
539 [2] Average number of used buffer entries.
540 [3] Maximum number of used buffer entries.
541 [4] Total number of buffer entries.
542
543 [1] [2] [3] [4]
544 JALU01 0 0 20
545 JFPU01 17 18 18
546 JLSAGU 0 0 12
547
548
549 Retire Control Unit - number of cycles where we saw N instructions retired:
550 [# retired], [# cycles]
551 0, 109 (17.9%)
552 1, 102 (16.7%)
553 2, 399 (65.4%)
554
555 Total ROB Entries: 64
556 Max Used ROB Entries: 35 ( 54.7% )
557 Average Used ROB Entries per cy: 32 ( 50.0% )
558
559
560 Register File statistics:
561 Total number of mappings created: 900
562 Max number of mappings used: 35
563
564 * Register File #1 -- JFpuPRF:
565 Number of physical registers: 72
566 Total number of mappings created: 900
567 Max number of mappings used: 35
568
569 * Register File #2 -- JIntegerPRF:
570 Number of physical registers: 64
571 Total number of mappings created: 0
572 Max number of mappings used: 0
573
574 If we look at the Dynamic Dispatch Stall Cycles table, we see the
575 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
576 ery time the dispatch logic is unable to dispatch a full group because
577 the scheduler's queue is full.
578
579 Looking at the Dispatch Logic table, we see that the pipeline was only
580 able to dispatch two micro opcodes 51.5% of the time. The dispatch
581 group was limited to one micro opcode 44.6% of the cycles, which corre‐
582 sponds to 272 cycles. The dispatch statistics are displayed by either
583 using the command option -all-stats or -dispatch-stats.
584
585 The next table, Schedulers, presents a histogram displaying a count,
586 representing the number of micro opcodes issued on some number of cy‐
587 cles. In this case, of the 610 simulated cycles, single opcodes were
588 issued 306 times (50.2%) and there were 7 cycles where no opcodes were
589 issued.
590
591 The Scheduler's queue usage table shows that the average and maximum
592 number of buffer entries (i.e., scheduler queue entries) used at run‐
593 time. Resource JFPU01 reached its maximum (18 of 18 queue entries).
594 Note that AMD Jaguar implements three schedulers:
595
596 • JALU01 - A scheduler for ALU instructions.
597
598 • JFPU01 - A scheduler floating point operations.
599
600 • JLSAGU - A scheduler for address generation.
601
602 The dot-product is a kernel of three floating point instructions (a
603 vector multiply followed by two horizontal adds). That explains why
604 only the floating point scheduler appears to be used.
605
606 A full scheduler queue is either caused by data dependency chains or by
607 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
608 sure can be mitigated by rewriting the kernel using different instruc‐
609 tions that consume different scheduler resources. Schedulers with a
610 small queue are less resilient to bottlenecks caused by the presence of
611 long data dependencies. The scheduler statistics are displayed by us‐
612 ing the command option -all-stats or -scheduler-stats.
613
614 The next table, Retire Control Unit, presents a histogram displaying a
615 count, representing the number of instructions retired on some number
616 of cycles. In this case, of the 610 simulated cycles, two instructions
617 were retired during the same cycle 399 times (65.4%) and there were 109
618 cycles where no instructions were retired. The retire statistics are
619 displayed by using the command option -all-stats or -retire-stats.
620
621 The last table presented is Register File statistics. Each physical
622 register file (PRF) used by the pipeline is presented in this table.
623 In the case of AMD Jaguar, there are two register files, one for float‐
624 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
625 gerPRF). The table shows that of the 900 instructions processed, there
626 were 900 mappings created. Since this dot-product example utilized
627 only floating point registers, the JFPuPRF was responsible for creating
628 the 900 mappings. However, we see that the pipeline only used a maxi‐
629 mum of 35 of 72 available register slots at any given time. We can con‐
630 clude that the floating point PRF was the only register file used for
631 the example, and that it was never resource constrained. The register
632 file statistics are displayed by using the command option -all-stats or
633 -register-file-stats.
634
635 In this example, we can conclude that the IPC is mostly limited by data
636 dependencies, and not by resource pressure.
637
638 Instruction Flow
639 This section describes the instruction flow through the default pipe‐
640 line of llvm-mca, as well as the functional units involved in the
641 process.
642
643 The default pipeline implements the following sequence of stages used
644 to process instructions.
645
646 • Dispatch (Instruction is dispatched to the schedulers).
647
648 • Issue (Instruction is issued to the processor pipelines).
649
650 • Write Back (Instruction is executed, and results are written back).
651
652 • Retire (Instruction is retired; writes are architecturally commit‐
653 ted).
654
655 The default pipeline only models the out-of-order portion of a proces‐
656 sor. Therefore, the instruction fetch and decode stages are not mod‐
657 eled. Performance bottlenecks in the frontend are not diagnosed.
658 llvm-mca assumes that instructions have all been decoded and placed
659 into a queue before the simulation start. Also, llvm-mca does not
660 model branch prediction.
661
662 Instruction Dispatch
663 During the dispatch stage, instructions are picked in program order
664 from a queue of already decoded instructions, and dispatched in groups
665 to the simulated hardware schedulers.
666
667 The size of a dispatch group depends on the availability of the simu‐
668 lated hardware resources. The processor dispatch width defaults to the
669 value of the IssueWidth in LLVM's scheduling model.
670
671 An instruction can be dispatched if:
672
673 • The size of the dispatch group is smaller than processor's dispatch
674 width.
675
676 • There are enough entries in the reorder buffer.
677
678 • There are enough physical registers to do register renaming.
679
680 • The schedulers are not full.
681
682 Scheduling models can optionally specify which register files are
683 available on the processor. llvm-mca uses that information to initial‐
684 ize register file descriptors. Users can limit the number of physical
685 registers that are globally available for register renaming by using
686 the command option -register-file-size. A value of zero for this op‐
687 tion means unbounded. By knowing how many registers are available for
688 renaming, the tool can predict dispatch stalls caused by the lack of
689 physical registers.
690
691 The number of reorder buffer entries consumed by an instruction depends
692 on the number of micro-opcodes specified for that instruction by the
693 target scheduling model. The reorder buffer is responsible for track‐
694 ing the progress of instructions that are "in-flight", and retiring
695 them in program order. The number of entries in the reorder buffer de‐
696 faults to the value specified by field MicroOpBufferSize in the target
697 scheduling model.
698
699 Instructions that are dispatched to the schedulers consume scheduler
700 buffer entries. llvm-mca queries the scheduling model to determine the
701 set of buffered resources consumed by an instruction. Buffered re‐
702 sources are treated like scheduler resources.
703
704 Instruction Issue
705 Each processor scheduler implements a buffer of instructions. An in‐
706 struction has to wait in the scheduler's buffer until input register
707 operands become available. Only at that point, does the instruction
708 becomes eligible for execution and may be issued (potentially
709 out-of-order) for execution. Instruction latencies are computed by
710 llvm-mca with the help of the scheduling model.
711
712 llvm-mca's scheduler is designed to simulate multiple processor sched‐
713 ulers. The scheduler is responsible for tracking data dependencies,
714 and dynamically selecting which processor resources are consumed by in‐
715 structions. It delegates the management of processor resource units
716 and resource groups to a resource manager. The resource manager is re‐
717 sponsible for selecting resource units that are consumed by instruc‐
718 tions. For example, if an instruction consumes 1cy of a resource
719 group, the resource manager selects one of the available units from the
720 group; by default, the resource manager uses a round-robin selector to
721 guarantee that resource usage is uniformly distributed between all
722 units of a group.
723
724 llvm-mca's scheduler internally groups instructions into three sets:
725
726 • WaitSet: a set of instructions whose operands are not ready.
727
728 • ReadySet: a set of instructions ready to execute.
729
730 • IssuedSet: a set of instructions executing.
731
732 Depending on the operands availability, instructions that are dis‐
733 patched to the scheduler are either placed into the WaitSet or into the
734 ReadySet.
735
736 Every cycle, the scheduler checks if instructions can be moved from the
737 WaitSet to the ReadySet, and if instructions from the ReadySet can be
738 issued to the underlying pipelines. The algorithm prioritizes older in‐
739 structions over younger instructions.
740
741 Write-Back and Retire Stage
742 Issued instructions are moved from the ReadySet to the IssuedSet.
743 There, instructions wait until they reach the write-back stage. At
744 that point, they get removed from the queue and the retire control unit
745 is notified.
746
747 When instructions are executed, the retire control unit flags the in‐
748 struction as "ready to retire."
749
750 Instructions are retired in program order. The register file is noti‐
751 fied of the retirement so that it can free the physical registers that
752 were allocated for the instruction during the register renaming stage.
753
754 Load/Store Unit and Memory Consistency Model
755 To simulate an out-of-order execution of memory operations, llvm-mca
756 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
757 tive execution of loads and stores.
758
759 Each load (or store) consumes an entry in the load (or store) queue.
760 Users can specify flags -lqueue and -squeue to limit the number of en‐
761 tries in the load and store queues respectively. The queues are un‐
762 bounded by default.
763
764 The LSUnit implements a relaxed consistency model for memory loads and
765 stores. The rules are:
766
767 1. A younger load is allowed to pass an older load only if there are no
768 intervening stores or barriers between the two loads.
769
770 2. A younger load is allowed to pass an older store provided that the
771 load does not alias with the store.
772
773 3. A younger store is not allowed to pass an older store.
774
775 4. A younger store is not allowed to pass an older load.
776
777 By default, the LSUnit optimistically assumes that loads do not alias
778 (-noalias=true) store operations. Under this assumption, younger loads
779 are always allowed to pass older stores. Essentially, the LSUnit does
780 not attempt to run any alias analysis to predict when loads and stores
781 do not alias with each other.
782
783 Note that, in the case of write-combining memory, rule 3 could be re‐
784 laxed to allow reordering of non-aliasing store operations. That being
785 said, at the moment, there is no way to further relax the memory model
786 (-noalias is the only option). Essentially, there is no option to
787 specify a different memory type (e.g., write-back, write-combining,
788 write-through; etc.) and consequently to weaken, or strengthen, the
789 memory model.
790
791 Other limitations are:
792
793 • The LSUnit does not know when store-to-load forwarding may occur.
794
795 • The LSUnit does not know anything about cache hierarchy and memory
796 types.
797
798 • The LSUnit does not know how to identify serializing operations and
799 memory fences.
800
801 The LSUnit does not attempt to predict if a load or store hits or
802 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
803 "MayStore." For loads, the scheduling model provides an "optimistic"
804 load-to-use latency (which usually matches the load-to-use latency for
805 when there is a hit in the L1D).
806
807 llvm-mca does not know about serializing operations or memory-barrier
808 like instructions. The LSUnit conservatively assumes that an instruc‐
809 tion which has both "MayLoad" and unmodeled side effects behaves like a
810 "soft" load-barrier. That means, it serializes loads without forcing a
811 flush of the load queue. Similarly, instructions that "MayStore" and
812 have unmodeled side effects are treated like store barriers. A full
813 memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
814 side effects. This is inaccurate, but it is the best that we can do at
815 the moment with the current information available in LLVM.
816
817 A load/store barrier consumes one entry of the load/store queue. A
818 load/store barrier enforces ordering of loads/stores. A younger load
819 cannot pass a load barrier. Also, a younger store cannot pass a store
820 barrier. A younger load has to wait for the memory/load barrier to ex‐
821 ecute. A load/store barrier is "executed" when it becomes the oldest
822 entry in the load/store queue(s). That also means, by construction, all
823 of the older loads/stores have been executed.
824
825 In conclusion, the full set of load/store consistency rules are:
826
827 1. A store may not pass a previous store.
828
829 2. A store may not pass a previous load (regardless of -noalias).
830
831 3. A store has to wait until an older store barrier is fully executed.
832
833 4. A load may pass a previous load.
834
835 5. A load may not pass a previous store unless -noalias is set.
836
837 6. A load has to wait until an older load barrier is fully executed.
838
840 Maintained by the LLVM Team (https://llvm.org/).
841
843 2003-2021, LLVM Project
844
845
846
847
8489 2021-07-22 LLVM-MCA(1)