1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with an
18 out-of-order backend, for which there is a scheduling model available
19 in LLVM.
20
21 The main goal of this tool is not just to predict the performance of
22 the code when run on the target, but also help with diagnosing poten‐
23 tial performance issues.
24
25 Given an assembly code sequence, llvm-mca estimates the Instructions
26 Per Cycle (IPC), as well as hardware resource pressure. The analysis
27 and reporting style were inspired by the IACA tool from Intel.
28
29 For example, you can compile code with clang, output assembly, and pipe
30 it directly into llvm-mca for analysis:
31
32 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34 Or for Intel syntax:
35
36 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38 (llvm-mca detects Intel syntax by the presence of an .intel_syntax di‐
39 rective at the beginning of the input. By default its output syntax
40 matches that of its input.)
41
42 Scheduling models are not just used to compute instruction latencies
43 and throughput, but also to understand what processor resources are
44 available and how to simulate them.
45
46 By design, the quality of the analysis conducted by llvm-mca is in‐
47 evitably affected by the quality of the scheduling models in LLVM.
48
49 If you see that the performance report is not accurate for a processor,
50 please file a bug against the appropriate backend.
51
53 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
54 wise, it will read from the specified filename.
55
56 If the -o option is omitted, then llvm-mca will send its output to
57 standard output if the input is from standard input. If the -o option
58 specifies "-", then the output will also be sent to standard output.
59
60 -help Print a summary of command line options.
61
62 -o <filename>
63 Use <filename> as the output filename. See the summary above for
64 more details.
65
66 -mtriple=<target triple>
67 Specify a target triple string.
68
69 -march=<arch>
70 Specify the architecture for which to analyze the code. It de‐
71 faults to the host default target.
72
73 -mcpu=<cpuname>
74 Specify the processor for which to analyze the code. By de‐
75 fault, the cpu name is autodetected from the host.
76
77 -output-asm-variant=<variant id>
78 Specify the output assembly variant for the report generated by
79 the tool. On x86, possible values are [0, 1]. A value of 0
80 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
81 format for the code printed out by the tool in the analysis re‐
82 port.
83
84 -print-imm-hex
85 Prefer hex format for numeric literals in the output assembly
86 printed as part of the report.
87
88 -dispatch=<width>
89 Specify a different dispatch width for the processor. The dis‐
90 patch width defaults to field 'IssueWidth' in the processor
91 scheduling model. If width is zero, then the default dispatch
92 width is used.
93
94 -register-file-size=<size>
95 Specify the size of the register file. When specified, this flag
96 limits how many physical registers are available for register
97 renaming purposes. A value of zero for this flag means "unlim‐
98 ited number of physical registers".
99
100 -iterations=<number of iterations>
101 Specify the number of iterations to run. If this flag is set to
102 0, then the tool sets the number of iterations to a default
103 value (i.e. 100).
104
105 -noalias=<bool>
106 If set, the tool assumes that loads and stores don't alias. This
107 is the default behavior.
108
109 -lqueue=<load queue size>
110 Specify the size of the load queue in the load/store unit emu‐
111 lated by the tool. By default, the tool assumes an unbound num‐
112 ber of entries in the load queue. A value of zero for this flag
113 is ignored, and the default load queue size is used instead.
114
115 -squeue=<store queue size>
116 Specify the size of the store queue in the load/store unit emu‐
117 lated by the tool. By default, the tool assumes an unbound num‐
118 ber of entries in the store queue. A value of zero for this flag
119 is ignored, and the default store queue size is used instead.
120
121 -timeline
122 Enable the timeline view.
123
124 -timeline-max-iterations=<iterations>
125 Limit the number of iterations to print in the timeline view. By
126 default, the timeline view prints information for up to 10 iter‐
127 ations.
128
129 -timeline-max-cycles=<cycles>
130 Limit the number of cycles in the timeline view. By default, the
131 number of cycles is set to 80.
132
133 -resource-pressure
134 Enable the resource pressure view. This is enabled by default.
135
136 -register-file-stats
137 Enable register file usage statistics.
138
139 -dispatch-stats
140 Enable extra dispatch statistics. This view collects and ana‐
141 lyzes instruction dispatch events, as well as static/dynamic
142 dispatch stall events. This view is disabled by default.
143
144 -scheduler-stats
145 Enable extra scheduler statistics. This view collects and ana‐
146 lyzes instruction issue events. This view is disabled by de‐
147 fault.
148
149 -retire-stats
150 Enable extra retire control unit statistics. This view is dis‐
151 abled by default.
152
153 -instruction-info
154 Enable the instruction info view. This is enabled by default.
155
156 -show-encoding
157 Enable the printing of instruction encodings within the instruc‐
158 tion info view.
159
160 -all-stats
161 Print all hardware statistics. This enables extra statistics re‐
162 lated to the dispatch logic, the hardware schedulers, the regis‐
163 ter file(s), and the retire control unit. This option is dis‐
164 abled by default.
165
166 -all-views
167 Enable all the view.
168
169 -instruction-tables
170 Prints resource pressure information based on the static infor‐
171 mation available from the processor model. This differs from the
172 resource pressure view because it doesn't require that the code
173 is simulated. It instead prints the theoretical uniform distri‐
174 bution of resource pressure for every instruction in sequence.
175
176 -bottleneck-analysis
177 Print information about bottlenecks that affect the throughput.
178 This analysis can be expensive, and it is disabled by default.
179 Bottlenecks are highlighted in the summary view.
180
182 llvm-mca returns 0 on success. Otherwise, an error message is printed
183 to standard error, and the tool returns 1.
184
186 llvm-mca allows for the optional usage of special code comments to mark
187 regions of the assembly code to be analyzed. A comment starting with
188 substring LLVM-MCA-BEGIN marks the beginning of a code region. A com‐
189 ment starting with substring LLVM-MCA-END marks the end of a code re‐
190 gion. For example:
191
192 # LLVM-MCA-BEGIN
193 ...
194 # LLVM-MCA-END
195
196 If no user-defined region is specified, then llvm-mca assumes a default
197 region which contains every instruction in the input file. Every re‐
198 gion is analyzed in isolation, and the final performance report is the
199 union of all the reports generated for every code region.
200
201 Code regions can have names. For example:
202
203 # LLVM-MCA-BEGIN A simple example
204 add %eax, %eax
205 # LLVM-MCA-END
206
207 The code from the example above defines a region named "A simple exam‐
208 ple" with a single instruction in it. Note how the region name doesn't
209 have to be repeated in the LLVM-MCA-END directive. In the absence of
210 overlapping regions, an anonymous LLVM-MCA-END directive always ends
211 the currently active user defined region.
212
213 Example of nesting regions:
214
215 # LLVM-MCA-BEGIN foo
216 add %eax, %edx
217 # LLVM-MCA-BEGIN bar
218 sub %eax, %edx
219 # LLVM-MCA-END bar
220 # LLVM-MCA-END foo
221
222 Example of overlapping regions:
223
224 # LLVM-MCA-BEGIN foo
225 add %eax, %edx
226 # LLVM-MCA-BEGIN bar
227 sub %eax, %edx
228 # LLVM-MCA-END foo
229 add %eax, %edx
230 # LLVM-MCA-END bar
231
232 Note that multiple anonymous regions cannot overlap. Also, overlapping
233 regions cannot have the same name.
234
235 There is no support for marking regions from high-level source code,
236 like C or C++. As a workaround, inline assembly directives may be used:
237
238 int foo(int a, int b) {
239 __asm volatile("# LLVM-MCA-BEGIN foo");
240 a += 42;
241 __asm volatile("# LLVM-MCA-END");
242 a *= b;
243 return a;
244 }
245
246 However, this interferes with optimizations like loop vectorization and
247 may have an impact on the code generated. This is because the __asm
248 statements are seen as real code having important side effects, which
249 limits how the code around them can be transformed. If users want to
250 make use of inline assembly to emit markers, then the recommendation is
251 to always verify that the output assembly is equivalent to the assembly
252 generated in the absence of markers. The Clang options to emit opti‐
253 mization reports can also help in detecting missed optimizations.
254
256 llvm-mca takes assembly code as input. The assembly code is parsed into
257 a sequence of MCInst with the help of the existing LLVM target assembly
258 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
259 module to generate a performance report.
260
261 The Pipeline module simulates the execution of the machine code se‐
262 quence in a loop of iterations (default is 100). During this process,
263 the pipeline collects a number of execution related statistics. At the
264 end of this process, the pipeline generates and prints a report from
265 the collected statistics.
266
267 Here is an example of a performance report generated by the tool for a
268 dot-product of two packed float vectors of four elements. The analysis
269 is conducted for target x86, cpu btver2. The following result can be
270 produced via the following command using the example located at
271 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
272
273 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
274
275 Iterations: 300
276 Instructions: 900
277 Total Cycles: 610
278 Total uOps: 900
279
280 Dispatch Width: 2
281 uOps Per Cycle: 1.48
282 IPC: 1.48
283 Block RThroughput: 2.0
284
285
286 Instruction Info:
287 [1]: #uOps
288 [2]: Latency
289 [3]: RThroughput
290 [4]: MayLoad
291 [5]: MayStore
292 [6]: HasSideEffects (U)
293
294 [1] [2] [3] [4] [5] [6] Instructions:
295 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
296 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
297 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
298
299
300 Resources:
301 [0] - JALU0
302 [1] - JALU1
303 [2] - JDiv
304 [3] - JFPA
305 [4] - JFPM
306 [5] - JFPU0
307 [6] - JFPU1
308 [7] - JLAGU
309 [8] - JMul
310 [9] - JSAGU
311 [10] - JSTC
312 [11] - JVALU0
313 [12] - JVALU1
314 [13] - JVIMUL
315
316
317 Resource pressure per iteration:
318 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
319 - - - 2.00 1.00 2.00 1.00 - - - - - - -
320
321 Resource pressure by instruction:
322 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
323 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
324 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
325 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
326
327 According to this report, the dot-product kernel has been executed 300
328 times, for a total of 900 simulated instructions. The total number of
329 simulated micro opcodes (uOps) is also 900.
330
331 The report is structured in three main sections. The first section
332 collects a few performance numbers; the goal of this section is to give
333 a very quick overview of the performance throughput. Important perfor‐
334 mance indicators are IPC, uOps Per Cycle, and Block RThroughput (Block
335 Reciprocal Throughput).
336
337 Field DispatchWidth is the maximum number of micro opcodes that are
338 dispatched to the out-of-order backend every simulated cycle.
339
340 IPC is computed dividing the total number of simulated instructions by
341 the total number of cycles.
342
343 Field Block RThroughput is the reciprocal of the block throughput.
344 Block throughput is a theoretical quantity computed as the maximum num‐
345 ber of blocks (i.e. iterations) that can be executed per simulated
346 clock cycle in the absence of loop carried dependencies. Block through‐
347 put is superiorly limited by the dispatch rate, and the availability of
348 hardware resources.
349
350 In the absence of loop-carried data dependencies, the observed IPC
351 tends to a theoretical maximum which can be computed by dividing the
352 number of instructions of a single iteration by the Block RThroughput.
353
354 Field 'uOps Per Cycle' is computed dividing the total number of simu‐
355 lated micro opcodes by the total number of cycles. A delta between Dis‐
356 patch Width and this field is an indicator of a performance issue. In
357 the absence of loop-carried data dependencies, the observed 'uOps Per
358 Cycle' should tend to a theoretical maximum throughput which can be
359 computed by dividing the number of uOps of a single iteration by the
360 Block RThroughput.
361
362 Field uOps Per Cycle is bounded from above by the dispatch width. That
363 is because the dispatch width limits the maximum size of a dispatch
364 group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
365 ware parallelism. The availability of hardware resources affects the
366 resource pressure distribution, and it limits the number of instruc‐
367 tions that can be executed in parallel every cycle. A delta between
368 Dispatch Width and the theoretical maximum uOps per Cycle (computed by
369 dividing the number of uOps of a single iteration by the Block
370 RThroughput) is an indicator of a performance bottleneck caused by the
371 lack of hardware resources. In general, the lower the Block RThrough‐
372 put, the better.
373
374 In this example, uOps per iteration/Block RThroughput is 1.50. Since
375 there are no loop-carried dependencies, the observed uOps Per Cycle is
376 expected to approach 1.50 when the number of iterations tends to infin‐
377 ity. The delta between the Dispatch Width (2.00), and the theoretical
378 maximum uOp throughput (1.50) is an indicator of a performance bottle‐
379 neck caused by the lack of hardware resources, and the Resource pres‐
380 sure view can help to identify the problematic resource usage.
381
382 The second section of the report is the instruction info view. It shows
383 the latency and reciprocal throughput of every instruction in the se‐
384 quence. It also reports extra information related to the number of mi‐
385 cro opcodes, and opcode properties (i.e., 'MayLoad', 'MayStore', and
386 'HasSideEffects').
387
388 Field RThroughput is the reciprocal of the instruction throughput.
389 Throughput is computed as the maximum number of instructions of a same
390 type that can be executed per clock cycle in the absence of operand de‐
391 pendencies. In this example, the reciprocal throughput of a vector
392 float multiply is 1 cycles/instruction. That is because the FP multi‐
393 plier JFPM is only available from pipeline JFPU1.
394
395 Instruction encodings are displayed within the instruction info view
396 when flag -show-encoding is specified.
397
398 Below is an example of -show-encoding output for the dot-product ker‐
399 nel:
400
401 Instruction Info:
402 [1]: #uOps
403 [2]: Latency
404 [3]: RThroughput
405 [4]: MayLoad
406 [5]: MayStore
407 [6]: HasSideEffects (U)
408 [7]: Encoding Size
409
410 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions:
411 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2
412 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3
413 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4
414
415 The Encoding Size column shows the size in bytes of instructions. The
416 Encodings column shows the actual instruction encodings (byte sequences
417 in hex).
418
419 The third section is the Resource pressure view. This view reports the
420 average number of resource cycles consumed every iteration by instruc‐
421 tions for every processor resource unit available on the target. In‐
422 formation is structured in two tables. The first table reports the num‐
423 ber of resource cycles spent on average every iteration. The second ta‐
424 ble correlates the resource cycles to the machine instruction in the
425 sequence. For example, every iteration of the instruction vmulps always
426 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
427 consuming an average of 1 resource cycle per iteration. Note that on
428 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
429 line JFPU1, while horizontal floating-point additions can only be is‐
430 sued to pipeline JFPU0.
431
432 The resource pressure view helps with identifying bottlenecks caused by
433 high usage of specific hardware resources. Situations with resource
434 pressure mainly concentrated on a few resources should, in general, be
435 avoided. Ideally, pressure should be uniformly distributed between
436 multiple resources.
437
438 Timeline View
439 The timeline view produces a detailed report of each instruction's
440 state transitions through an instruction pipeline. This view is en‐
441 abled by the command line option -timeline. As instructions transition
442 through the various stages of the pipeline, their states are depicted
443 in the view report. These states are represented by the following
444 characters:
445
446 • D : Instruction dispatched.
447
448 • e : Instruction executing.
449
450 • E : Instruction executed.
451
452 • R : Instruction retired.
453
454 • = : Instruction already dispatched, waiting to be executed.
455
456 • - : Instruction executed, waiting to be retired.
457
458 Below is the timeline view for a subset of the dot-product example lo‐
459 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
460 llvm-mca using the following command:
461
462 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
463
464 Timeline view:
465 012345
466 Index 0123456789
467
468 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
469 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
470 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
471 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
472 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
473 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
474 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
475 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
476 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
477
478
479 Average Wait times (based on the timeline view):
480 [0]: Executions
481 [1]: Average time spent waiting in a scheduler's queue
482 [2]: Average time spent waiting in a scheduler's queue while ready
483 [3]: Average time elapsed from WB until retire stage
484
485 [0] [1] [2] [3]
486 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
487 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
488 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
489 3 3.3 0.5 1.4 <total>
490
491 The timeline view is interesting because it shows instruction state
492 changes during execution. It also gives an idea of how the tool pro‐
493 cesses instructions executed on the target, and how their timing infor‐
494 mation might be calculated.
495
496 The timeline view is structured in two tables. The first table shows
497 instructions changing state over time (measured in cycles); the second
498 table (named Average Wait times) reports useful timing statistics,
499 which should help diagnose performance bottlenecks caused by long data
500 dependencies and sub-optimal usage of hardware resources.
501
502 An instruction in the timeline view is identified by a pair of indices,
503 where the first index identifies an iteration, and the second index is
504 the instruction index (i.e., where it appears in the code sequence).
505 Since this example was generated using 3 iterations: -iterations=3, the
506 iteration indices range from 0-2 inclusively.
507
508 Excluding the first and last column, the remaining columns are in cy‐
509 cles. Cycles are numbered sequentially starting from 0.
510
511 From the example output above, we know the following:
512
513 • Instruction [1,0] was dispatched at cycle 1.
514
515 • Instruction [1,0] started executing at cycle 2.
516
517 • Instruction [1,0] reached the write back stage at cycle 4.
518
519 • Instruction [1,0] was retired at cycle 10.
520
521 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
522 wait in the scheduler's queue for the operands to become available. By
523 the time vmulps is dispatched, operands are already available, and
524 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
525 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
526 strated by the fact that the instruction only spent 1cy in the sched‐
527 uler's queue.
528
529 There is a gap of 5 cycles between the write-back stage and the retire
530 event. That is because instructions must retire in program order, so
531 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
532 until cycle 10).
533
534 In the example, all instructions are in a RAW (Read After Write) depen‐
535 dency chain. Register %xmm2 written by vmulps is immediately used by
536 the first vhaddps, and register %xmm3 written by the first vhaddps is
537 used by the second vhaddps. Long data dependencies negatively impact
538 the ILP (Instruction Level Parallelism).
539
540 In the dot-product example, there are anti-dependencies introduced by
541 instructions from different iterations. However, those dependencies
542 can be removed at register renaming stage (at the cost of allocating
543 register aliases, and therefore consuming physical registers).
544
545 Table Average Wait times helps diagnose performance issues that are
546 caused by the presence of long latency instructions and potentially
547 long data dependencies which may limit the ILP. Last row, <total>,
548 shows a global average over all instructions measured. Note that
549 llvm-mca, by default, assumes at least 1cy between the dispatch event
550 and the issue event.
551
552 When the performance is limited by data dependencies and/or long la‐
553 tency instructions, the number of cycles spent while in the ready state
554 is expected to be very small when compared with the total number of cy‐
555 cles spent in the scheduler's queue. The difference between the two
556 counters is a good indicator of how large of an impact data dependen‐
557 cies had on the execution of the instructions. When performance is
558 mostly limited by the lack of hardware resources, the delta between the
559 two counters is small. However, the number of cycles spent in the
560 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
561 pared to other low latency instructions.
562
563 Bottleneck Analysis
564 The -bottleneck-analysis command line option enables the analysis of
565 performance bottlenecks.
566
567 This analysis is potentially expensive. It attempts to correlate in‐
568 creases in backend pressure (caused by pipeline resource pressure and
569 data dependencies) to dynamic dispatch stalls.
570
571 Below is an example of -bottleneck-analysis output generated by
572 llvm-mca for 500 iterations of the dot-product example on btver2.
573
574 Cycles with backend pressure increase [ 48.07% ]
575 Throughput Bottlenecks:
576 Resource Pressure [ 47.77% ]
577 - JFPA [ 47.77% ]
578 - JFPU0 [ 47.77% ]
579 Data Dependencies: [ 0.30% ]
580 - Register Dependencies [ 0.30% ]
581 - Memory Dependencies [ 0.00% ]
582
583 Critical sequence based on the simulation:
584
585 Instruction Dependency Information
586 +----< 2. vhaddps %xmm3, %xmm3, %xmm4
587 |
588 | < loop carried >
589 |
590 | 0. vmulps %xmm0, %xmm1, %xmm2
591 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
592 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
593 |
594 | < loop carried >
595 |
596 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
597
598 According to the analysis, throughput is limited by resource pressure
599 and not by data dependencies. The analysis observed increases in back‐
600 end pressure during 48.07% of the simulated run. Almost all those pres‐
601 sure increase events were caused by contention on processor resources
602 JFPA/JFPU0.
603
604 The critical sequence is the most expensive sequence of instructions
605 according to the simulation. It is annotated to provide extra informa‐
606 tion about critical register dependencies and resource interferences
607 between instructions.
608
609 Instructions from the critical sequence are expected to significantly
610 impact performance. By construction, the accuracy of this analysis is
611 strongly dependent on the simulation and (as always) by the quality of
612 the processor model in llvm.
613
614 Extra Statistics to Further Diagnose Performance Issues
615 The -all-stats command line option enables extra statistics and perfor‐
616 mance counters for the dispatch logic, the reorder buffer, the retire
617 control unit, and the register file.
618
619 Below is an example of -all-stats output generated by llvm-mca for 300
620 iterations of the dot-product example discussed in the previous sec‐
621 tions.
622
623 Dynamic Dispatch Stall Cycles:
624 RAT - Register unavailable: 0
625 RCU - Retire tokens unavailable: 0
626 SCHEDQ - Scheduler full: 272 (44.6%)
627 LQ - Load queue full: 0
628 SQ - Store queue full: 0
629 GROUP - Static restrictions on the dispatch group: 0
630
631
632 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
633 [# dispatched], [# cycles]
634 0, 24 (3.9%)
635 1, 272 (44.6%)
636 2, 314 (51.5%)
637
638
639 Schedulers - number of cycles where we saw N micro opcodes issued:
640 [# issued], [# cycles]
641 0, 7 (1.1%)
642 1, 306 (50.2%)
643 2, 297 (48.7%)
644
645 Scheduler's queue usage:
646 [1] Resource name.
647 [2] Average number of used buffer entries.
648 [3] Maximum number of used buffer entries.
649 [4] Total number of buffer entries.
650
651 [1] [2] [3] [4]
652 JALU01 0 0 20
653 JFPU01 17 18 18
654 JLSAGU 0 0 12
655
656
657 Retire Control Unit - number of cycles where we saw N instructions retired:
658 [# retired], [# cycles]
659 0, 109 (17.9%)
660 1, 102 (16.7%)
661 2, 399 (65.4%)
662
663 Total ROB Entries: 64
664 Max Used ROB Entries: 35 ( 54.7% )
665 Average Used ROB Entries per cy: 32 ( 50.0% )
666
667
668 Register File statistics:
669 Total number of mappings created: 900
670 Max number of mappings used: 35
671
672 * Register File #1 -- JFpuPRF:
673 Number of physical registers: 72
674 Total number of mappings created: 900
675 Max number of mappings used: 35
676
677 * Register File #2 -- JIntegerPRF:
678 Number of physical registers: 64
679 Total number of mappings created: 0
680 Max number of mappings used: 0
681
682 If we look at the Dynamic Dispatch Stall Cycles table, we see the
683 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
684 ery time the dispatch logic is unable to dispatch a full group because
685 the scheduler's queue is full.
686
687 Looking at the Dispatch Logic table, we see that the pipeline was only
688 able to dispatch two micro opcodes 51.5% of the time. The dispatch
689 group was limited to one micro opcode 44.6% of the cycles, which corre‐
690 sponds to 272 cycles. The dispatch statistics are displayed by either
691 using the command option -all-stats or -dispatch-stats.
692
693 The next table, Schedulers, presents a histogram displaying a count,
694 representing the number of micro opcodes issued on some number of cy‐
695 cles. In this case, of the 610 simulated cycles, single opcodes were
696 issued 306 times (50.2%) and there were 7 cycles where no opcodes were
697 issued.
698
699 The Scheduler's queue usage table shows that the average and maximum
700 number of buffer entries (i.e., scheduler queue entries) used at run‐
701 time. Resource JFPU01 reached its maximum (18 of 18 queue entries).
702 Note that AMD Jaguar implements three schedulers:
703
704 • JALU01 - A scheduler for ALU instructions.
705
706 • JFPU01 - A scheduler floating point operations.
707
708 • JLSAGU - A scheduler for address generation.
709
710 The dot-product is a kernel of three floating point instructions (a
711 vector multiply followed by two horizontal adds). That explains why
712 only the floating point scheduler appears to be used.
713
714 A full scheduler queue is either caused by data dependency chains or by
715 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
716 sure can be mitigated by rewriting the kernel using different instruc‐
717 tions that consume different scheduler resources. Schedulers with a
718 small queue are less resilient to bottlenecks caused by the presence of
719 long data dependencies. The scheduler statistics are displayed by us‐
720 ing the command option -all-stats or -scheduler-stats.
721
722 The next table, Retire Control Unit, presents a histogram displaying a
723 count, representing the number of instructions retired on some number
724 of cycles. In this case, of the 610 simulated cycles, two instructions
725 were retired during the same cycle 399 times (65.4%) and there were 109
726 cycles where no instructions were retired. The retire statistics are
727 displayed by using the command option -all-stats or -retire-stats.
728
729 The last table presented is Register File statistics. Each physical
730 register file (PRF) used by the pipeline is presented in this table.
731 In the case of AMD Jaguar, there are two register files, one for float‐
732 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
733 gerPRF). The table shows that of the 900 instructions processed, there
734 were 900 mappings created. Since this dot-product example utilized
735 only floating point registers, the JFPuPRF was responsible for creating
736 the 900 mappings. However, we see that the pipeline only used a maxi‐
737 mum of 35 of 72 available register slots at any given time. We can con‐
738 clude that the floating point PRF was the only register file used for
739 the example, and that it was never resource constrained. The register
740 file statistics are displayed by using the command option -all-stats or
741 -register-file-stats.
742
743 In this example, we can conclude that the IPC is mostly limited by data
744 dependencies, and not by resource pressure.
745
746 Instruction Flow
747 This section describes the instruction flow through the default pipe‐
748 line of llvm-mca, as well as the functional units involved in the
749 process.
750
751 The default pipeline implements the following sequence of stages used
752 to process instructions.
753
754 • Dispatch (Instruction is dispatched to the schedulers).
755
756 • Issue (Instruction is issued to the processor pipelines).
757
758 • Write Back (Instruction is executed, and results are written back).
759
760 • Retire (Instruction is retired; writes are architecturally commit‐
761 ted).
762
763 The default pipeline only models the out-of-order portion of a proces‐
764 sor. Therefore, the instruction fetch and decode stages are not mod‐
765 eled. Performance bottlenecks in the frontend are not diagnosed.
766 llvm-mca assumes that instructions have all been decoded and placed
767 into a queue before the simulation start. Also, llvm-mca does not
768 model branch prediction.
769
770 Instruction Dispatch
771 During the dispatch stage, instructions are picked in program order
772 from a queue of already decoded instructions, and dispatched in groups
773 to the simulated hardware schedulers.
774
775 The size of a dispatch group depends on the availability of the simu‐
776 lated hardware resources. The processor dispatch width defaults to the
777 value of the IssueWidth in LLVM's scheduling model.
778
779 An instruction can be dispatched if:
780
781 • The size of the dispatch group is smaller than processor's dispatch
782 width.
783
784 • There are enough entries in the reorder buffer.
785
786 • There are enough physical registers to do register renaming.
787
788 • The schedulers are not full.
789
790 Scheduling models can optionally specify which register files are
791 available on the processor. llvm-mca uses that information to initial‐
792 ize register file descriptors. Users can limit the number of physical
793 registers that are globally available for register renaming by using
794 the command option -register-file-size. A value of zero for this op‐
795 tion means unbounded. By knowing how many registers are available for
796 renaming, the tool can predict dispatch stalls caused by the lack of
797 physical registers.
798
799 The number of reorder buffer entries consumed by an instruction depends
800 on the number of micro-opcodes specified for that instruction by the
801 target scheduling model. The reorder buffer is responsible for track‐
802 ing the progress of instructions that are "in-flight", and retiring
803 them in program order. The number of entries in the reorder buffer de‐
804 faults to the value specified by field MicroOpBufferSize in the target
805 scheduling model.
806
807 Instructions that are dispatched to the schedulers consume scheduler
808 buffer entries. llvm-mca queries the scheduling model to determine the
809 set of buffered resources consumed by an instruction. Buffered re‐
810 sources are treated like scheduler resources.
811
812 Instruction Issue
813 Each processor scheduler implements a buffer of instructions. An in‐
814 struction has to wait in the scheduler's buffer until input register
815 operands become available. Only at that point, does the instruction
816 becomes eligible for execution and may be issued (potentially
817 out-of-order) for execution. Instruction latencies are computed by
818 llvm-mca with the help of the scheduling model.
819
820 llvm-mca's scheduler is designed to simulate multiple processor sched‐
821 ulers. The scheduler is responsible for tracking data dependencies,
822 and dynamically selecting which processor resources are consumed by in‐
823 structions. It delegates the management of processor resource units
824 and resource groups to a resource manager. The resource manager is re‐
825 sponsible for selecting resource units that are consumed by instruc‐
826 tions. For example, if an instruction consumes 1cy of a resource
827 group, the resource manager selects one of the available units from the
828 group; by default, the resource manager uses a round-robin selector to
829 guarantee that resource usage is uniformly distributed between all
830 units of a group.
831
832 llvm-mca's scheduler internally groups instructions into three sets:
833
834 • WaitSet: a set of instructions whose operands are not ready.
835
836 • ReadySet: a set of instructions ready to execute.
837
838 • IssuedSet: a set of instructions executing.
839
840 Depending on the operands availability, instructions that are dis‐
841 patched to the scheduler are either placed into the WaitSet or into the
842 ReadySet.
843
844 Every cycle, the scheduler checks if instructions can be moved from the
845 WaitSet to the ReadySet, and if instructions from the ReadySet can be
846 issued to the underlying pipelines. The algorithm prioritizes older in‐
847 structions over younger instructions.
848
849 Write-Back and Retire Stage
850 Issued instructions are moved from the ReadySet to the IssuedSet.
851 There, instructions wait until they reach the write-back stage. At
852 that point, they get removed from the queue and the retire control unit
853 is notified.
854
855 When instructions are executed, the retire control unit flags the in‐
856 struction as "ready to retire."
857
858 Instructions are retired in program order. The register file is noti‐
859 fied of the retirement so that it can free the physical registers that
860 were allocated for the instruction during the register renaming stage.
861
862 Load/Store Unit and Memory Consistency Model
863 To simulate an out-of-order execution of memory operations, llvm-mca
864 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
865 tive execution of loads and stores.
866
867 Each load (or store) consumes an entry in the load (or store) queue.
868 Users can specify flags -lqueue and -squeue to limit the number of en‐
869 tries in the load and store queues respectively. The queues are un‐
870 bounded by default.
871
872 The LSUnit implements a relaxed consistency model for memory loads and
873 stores. The rules are:
874
875 1. A younger load is allowed to pass an older load only if there are no
876 intervening stores or barriers between the two loads.
877
878 2. A younger load is allowed to pass an older store provided that the
879 load does not alias with the store.
880
881 3. A younger store is not allowed to pass an older store.
882
883 4. A younger store is not allowed to pass an older load.
884
885 By default, the LSUnit optimistically assumes that loads do not alias
886 (-noalias=true) store operations. Under this assumption, younger loads
887 are always allowed to pass older stores. Essentially, the LSUnit does
888 not attempt to run any alias analysis to predict when loads and stores
889 do not alias with each other.
890
891 Note that, in the case of write-combining memory, rule 3 could be re‐
892 laxed to allow reordering of non-aliasing store operations. That being
893 said, at the moment, there is no way to further relax the memory model
894 (-noalias is the only option). Essentially, there is no option to
895 specify a different memory type (e.g., write-back, write-combining,
896 write-through; etc.) and consequently to weaken, or strengthen, the
897 memory model.
898
899 Other limitations are:
900
901 • The LSUnit does not know when store-to-load forwarding may occur.
902
903 • The LSUnit does not know anything about cache hierarchy and memory
904 types.
905
906 • The LSUnit does not know how to identify serializing operations and
907 memory fences.
908
909 The LSUnit does not attempt to predict if a load or store hits or
910 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
911 "MayStore." For loads, the scheduling model provides an "optimistic"
912 load-to-use latency (which usually matches the load-to-use latency for
913 when there is a hit in the L1D).
914
915 llvm-mca does not know about serializing operations or memory-barrier
916 like instructions. The LSUnit conservatively assumes that an instruc‐
917 tion which has both "MayLoad" and unmodeled side effects behaves like a
918 "soft" load-barrier. That means, it serializes loads without forcing a
919 flush of the load queue. Similarly, instructions that "MayStore" and
920 have unmodeled side effects are treated like store barriers. A full
921 memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
922 side effects. This is inaccurate, but it is the best that we can do at
923 the moment with the current information available in LLVM.
924
925 A load/store barrier consumes one entry of the load/store queue. A
926 load/store barrier enforces ordering of loads/stores. A younger load
927 cannot pass a load barrier. Also, a younger store cannot pass a store
928 barrier. A younger load has to wait for the memory/load barrier to ex‐
929 ecute. A load/store barrier is "executed" when it becomes the oldest
930 entry in the load/store queue(s). That also means, by construction, all
931 of the older loads/stores have been executed.
932
933 In conclusion, the full set of load/store consistency rules are:
934
935 1. A store may not pass a previous store.
936
937 2. A store may not pass a previous load (regardless of -noalias).
938
939 3. A store has to wait until an older store barrier is fully executed.
940
941 4. A load may pass a previous load.
942
943 5. A load may not pass a previous store unless -noalias is set.
944
945 6. A load has to wait until an older load barrier is fully executed.
946
948 Maintained by the LLVM Team (https://llvm.org/).
949
951 2003-2023, LLVM Project
952
953
954
955
95611 2023-07-20 LLVM-MCA(1)