1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with an
18 out-of-order backend, for which there is a scheduling model available
19 in LLVM.
20
21 The main goal of this tool is not just to predict the performance of
22 the code when run on the target, but also help with diagnosing poten‐
23 tial performance issues.
24
25 Given an assembly code sequence, llvm-mca estimates the Instructions
26 Per Cycle (IPC), as well as hardware resource pressure. The analysis
27 and reporting style were inspired by the IACA tool from Intel.
28
29 For example, you can compile code with clang, output assembly, and pipe
30 it directly into llvm-mca for analysis:
31
32 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34 Or for Intel syntax:
35
36 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38 Scheduling models are not just used to compute instruction latencies
39 and throughput, but also to understand what processor resources are
40 available and how to simulate them.
41
42 By design, the quality of the analysis conducted by llvm-mca is in‐
43 evitably affected by the quality of the scheduling models in LLVM.
44
45 If you see that the performance report is not accurate for a processor,
46 please file a bug against the appropriate backend.
47
49 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
50 wise, it will read from the specified filename.
51
52 If the -o option is omitted, then llvm-mca will send its output to
53 standard output if the input is from standard input. If the -o option
54 specifies "-", then the output will also be sent to standard output.
55
56 -help Print a summary of command line options.
57
58 -o <filename>
59 Use <filename> as the output filename. See the summary above for
60 more details.
61
62 -mtriple=<target triple>
63 Specify a target triple string.
64
65 -march=<arch>
66 Specify the architecture for which to analyze the code. It de‐
67 faults to the host default target.
68
69 -mcpu=<cpuname>
70 Specify the processor for which to analyze the code. By de‐
71 fault, the cpu name is autodetected from the host.
72
73 -output-asm-variant=<variant id>
74 Specify the output assembly variant for the report generated by
75 the tool. On x86, possible values are [0, 1]. A value of 0
76 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
77 format for the code printed out by the tool in the analysis re‐
78 port.
79
80 -print-imm-hex
81 Prefer hex format for numeric literals in the output assembly
82 printed as part of the report.
83
84 -dispatch=<width>
85 Specify a different dispatch width for the processor. The dis‐
86 patch width defaults to field 'IssueWidth' in the processor
87 scheduling model. If width is zero, then the default dispatch
88 width is used.
89
90 -register-file-size=<size>
91 Specify the size of the register file. When specified, this flag
92 limits how many physical registers are available for register
93 renaming purposes. A value of zero for this flag means "unlim‐
94 ited number of physical registers".
95
96 -iterations=<number of iterations>
97 Specify the number of iterations to run. If this flag is set to
98 0, then the tool sets the number of iterations to a default
99 value (i.e. 100).
100
101 -noalias=<bool>
102 If set, the tool assumes that loads and stores don't alias. This
103 is the default behavior.
104
105 -lqueue=<load queue size>
106 Specify the size of the load queue in the load/store unit emu‐
107 lated by the tool. By default, the tool assumes an unbound num‐
108 ber of entries in the load queue. A value of zero for this flag
109 is ignored, and the default load queue size is used instead.
110
111 -squeue=<store queue size>
112 Specify the size of the store queue in the load/store unit emu‐
113 lated by the tool. By default, the tool assumes an unbound num‐
114 ber of entries in the store queue. A value of zero for this flag
115 is ignored, and the default store queue size is used instead.
116
117 -timeline
118 Enable the timeline view.
119
120 -timeline-max-iterations=<iterations>
121 Limit the number of iterations to print in the timeline view. By
122 default, the timeline view prints information for up to 10 iter‐
123 ations.
124
125 -timeline-max-cycles=<cycles>
126 Limit the number of cycles in the timeline view. By default, the
127 number of cycles is set to 80.
128
129 -resource-pressure
130 Enable the resource pressure view. This is enabled by default.
131
132 -register-file-stats
133 Enable register file usage statistics.
134
135 -dispatch-stats
136 Enable extra dispatch statistics. This view collects and ana‐
137 lyzes instruction dispatch events, as well as static/dynamic
138 dispatch stall events. This view is disabled by default.
139
140 -scheduler-stats
141 Enable extra scheduler statistics. This view collects and ana‐
142 lyzes instruction issue events. This view is disabled by de‐
143 fault.
144
145 -retire-stats
146 Enable extra retire control unit statistics. This view is dis‐
147 abled by default.
148
149 -instruction-info
150 Enable the instruction info view. This is enabled by default.
151
152 -show-encoding
153 Enable the printing of instruction encodings within the instruc‐
154 tion info view.
155
156 -all-stats
157 Print all hardware statistics. This enables extra statistics re‐
158 lated to the dispatch logic, the hardware schedulers, the regis‐
159 ter file(s), and the retire control unit. This option is dis‐
160 abled by default.
161
162 -all-views
163 Enable all the view.
164
165 -instruction-tables
166 Prints resource pressure information based on the static infor‐
167 mation available from the processor model. This differs from the
168 resource pressure view because it doesn't require that the code
169 is simulated. It instead prints the theoretical uniform distri‐
170 bution of resource pressure for every instruction in sequence.
171
172 -bottleneck-analysis
173 Print information about bottlenecks that affect the throughput.
174 This analysis can be expensive, and it is disabled by default.
175 Bottlenecks are highlighted in the summary view.
176
178 llvm-mca returns 0 on success. Otherwise, an error message is printed
179 to standard error, and the tool returns 1.
180
182 llvm-mca allows for the optional usage of special code comments to mark
183 regions of the assembly code to be analyzed. A comment starting with
184 substring LLVM-MCA-BEGIN marks the beginning of a code region. A com‐
185 ment starting with substring LLVM-MCA-END marks the end of a code re‐
186 gion. For example:
187
188 # LLVM-MCA-BEGIN
189 ...
190 # LLVM-MCA-END
191
192 If no user-defined region is specified, then llvm-mca assumes a default
193 region which contains every instruction in the input file. Every re‐
194 gion is analyzed in isolation, and the final performance report is the
195 union of all the reports generated for every code region.
196
197 Code regions can have names. For example:
198
199 # LLVM-MCA-BEGIN A simple example
200 add %eax, %eax
201 # LLVM-MCA-END
202
203 The code from the example above defines a region named "A simple exam‐
204 ple" with a single instruction in it. Note how the region name doesn't
205 have to be repeated in the LLVM-MCA-END directive. In the absence of
206 overlapping regions, an anonymous LLVM-MCA-END directive always ends
207 the currently active user defined region.
208
209 Example of nesting regions:
210
211 # LLVM-MCA-BEGIN foo
212 add %eax, %edx
213 # LLVM-MCA-BEGIN bar
214 sub %eax, %edx
215 # LLVM-MCA-END bar
216 # LLVM-MCA-END foo
217
218 Example of overlapping regions:
219
220 # LLVM-MCA-BEGIN foo
221 add %eax, %edx
222 # LLVM-MCA-BEGIN bar
223 sub %eax, %edx
224 # LLVM-MCA-END foo
225 add %eax, %edx
226 # LLVM-MCA-END bar
227
228 Note that multiple anonymous regions cannot overlap. Also, overlapping
229 regions cannot have the same name.
230
231 There is no support for marking regions from high-level source code,
232 like C or C++. As a workaround, inline assembly directives may be used:
233
234 int foo(int a, int b) {
235 __asm volatile("# LLVM-MCA-BEGIN foo");
236 a += 42;
237 __asm volatile("# LLVM-MCA-END");
238 a *= b;
239 return a;
240 }
241
242 However, this interferes with optimizations like loop vectorization and
243 may have an impact on the code generated. This is because the __asm
244 statements are seen as real code having important side effects, which
245 limits how the code around them can be transformed. If users want to
246 make use of inline assembly to emit markers, then the recommendation is
247 to always verify that the output assembly is equivalent to the assembly
248 generated in the absence of markers. The Clang options to emit opti‐
249 mization reports can also help in detecting missed optimizations.
250
252 llvm-mca takes assembly code as input. The assembly code is parsed into
253 a sequence of MCInst with the help of the existing LLVM target assembly
254 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
255 module to generate a performance report.
256
257 The Pipeline module simulates the execution of the machine code se‐
258 quence in a loop of iterations (default is 100). During this process,
259 the pipeline collects a number of execution related statistics. At the
260 end of this process, the pipeline generates and prints a report from
261 the collected statistics.
262
263 Here is an example of a performance report generated by the tool for a
264 dot-product of two packed float vectors of four elements. The analysis
265 is conducted for target x86, cpu btver2. The following result can be
266 produced via the following command using the example located at
267 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
268
269 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
270
271 Iterations: 300
272 Instructions: 900
273 Total Cycles: 610
274 Total uOps: 900
275
276 Dispatch Width: 2
277 uOps Per Cycle: 1.48
278 IPC: 1.48
279 Block RThroughput: 2.0
280
281
282 Instruction Info:
283 [1]: #uOps
284 [2]: Latency
285 [3]: RThroughput
286 [4]: MayLoad
287 [5]: MayStore
288 [6]: HasSideEffects (U)
289
290 [1] [2] [3] [4] [5] [6] Instructions:
291 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
292 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
293 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
294
295
296 Resources:
297 [0] - JALU0
298 [1] - JALU1
299 [2] - JDiv
300 [3] - JFPA
301 [4] - JFPM
302 [5] - JFPU0
303 [6] - JFPU1
304 [7] - JLAGU
305 [8] - JMul
306 [9] - JSAGU
307 [10] - JSTC
308 [11] - JVALU0
309 [12] - JVALU1
310 [13] - JVIMUL
311
312
313 Resource pressure per iteration:
314 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
315 - - - 2.00 1.00 2.00 1.00 - - - - - - -
316
317 Resource pressure by instruction:
318 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
319 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
320 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
321 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
322
323 According to this report, the dot-product kernel has been executed 300
324 times, for a total of 900 simulated instructions. The total number of
325 simulated micro opcodes (uOps) is also 900.
326
327 The report is structured in three main sections. The first section
328 collects a few performance numbers; the goal of this section is to give
329 a very quick overview of the performance throughput. Important perfor‐
330 mance indicators are IPC, uOps Per Cycle, and Block RThroughput (Block
331 Reciprocal Throughput).
332
333 Field DispatchWidth is the maximum number of micro opcodes that are
334 dispatched to the out-of-order backend every simulated cycle.
335
336 IPC is computed dividing the total number of simulated instructions by
337 the total number of cycles.
338
339 Field Block RThroughput is the reciprocal of the block throughput.
340 Block throuhgput is a theoretical quantity computed as the maximum num‐
341 ber of blocks (i.e. iterations) that can be executed per simulated
342 clock cycle in the absence of loop carried dependencies. Block through‐
343 put is is superiorly limited by the dispatch rate, and the availability
344 of hardware resources.
345
346 In the absence of loop-carried data dependencies, the observed IPC
347 tends to a theoretical maximum which can be computed by dividing the
348 number of instructions of a single iteration by the Block RThroughput.
349
350 Field 'uOps Per Cycle' is computed dividing the total number of simu‐
351 lated micro opcodes by the total number of cycles. A delta between Dis‐
352 patch Width and this field is an indicator of a performance issue. In
353 the absence of loop-carried data dependencies, the observed 'uOps Per
354 Cycle' should tend to a theoretical maximum throughput which can be
355 computed by dividing the number of uOps of a single iteration by the
356 Block RThroughput.
357
358 Field uOps Per Cycle is bounded from above by the dispatch width. That
359 is because the dispatch width limits the maximum size of a dispatch
360 group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
361 ware parallelism. The availability of hardware resources affects the
362 resource pressure distribution, and it limits the number of instruc‐
363 tions that can be executed in parallel every cycle. A delta between
364 Dispatch Width and the theoretical maximum uOps per Cycle (computed by
365 dividing the number of uOps of a single iteration by the Block
366 RThroughput) is an indicator of a performance bottleneck caused by the
367 lack of hardware resources. In general, the lower the Block RThrough‐
368 put, the better.
369
370 In this example, uOps per iteration/Block RThroughput is 1.50. Since
371 there are no loop-carried dependencies, the observed uOps Per Cycle is
372 expected to approach 1.50 when the number of iterations tends to infin‐
373 ity. The delta between the Dispatch Width (2.00), and the theoretical
374 maximum uOp throughput (1.50) is an indicator of a performance bottle‐
375 neck caused by the lack of hardware resources, and the Resource pres‐
376 sure view can help to identify the problematic resource usage.
377
378 The second section of the report is the instruction info view. It shows
379 the latency and reciprocal throughput of every instruction in the se‐
380 quence. It also reports extra information related to the number of mi‐
381 cro opcodes, and opcode properties (i.e., 'MayLoad', 'MayStore', and
382 'HasSideEffects').
383
384 Field RThroughput is the reciprocal of the instruction throughput.
385 Throughput is computed as the maximum number of instructions of a same
386 type that can be executed per clock cycle in the absence of operand de‐
387 pendencies. In this example, the reciprocal throughput of a vector
388 float multiply is 1 cycles/instruction. That is because the FP multi‐
389 plier JFPM is only available from pipeline JFPU1.
390
391 Instruction encodings are displayed within the instruction info view
392 when flag -show-encoding is specified.
393
394 Below is an example of -show-encoding output for the dot-product ker‐
395 nel:
396
397 Instruction Info:
398 [1]: #uOps
399 [2]: Latency
400 [3]: RThroughput
401 [4]: MayLoad
402 [5]: MayStore
403 [6]: HasSideEffects (U)
404 [7]: Encoding Size
405
406 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions:
407 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2
408 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3
409 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4
410
411 The Encoding Size column shows the size in bytes of instructions. The
412 Encodings column shows the actual instruction encodings (byte sequences
413 in hex).
414
415 The third section is the Resource pressure view. This view reports the
416 average number of resource cycles consumed every iteration by instruc‐
417 tions for every processor resource unit available on the target. In‐
418 formation is structured in two tables. The first table reports the num‐
419 ber of resource cycles spent on average every iteration. The second ta‐
420 ble correlates the resource cycles to the machine instruction in the
421 sequence. For example, every iteration of the instruction vmulps always
422 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
423 consuming an average of 1 resource cycle per iteration. Note that on
424 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
425 line JFPU1, while horizontal floating-point additions can only be is‐
426 sued to pipeline JFPU0.
427
428 The resource pressure view helps with identifying bottlenecks caused by
429 high usage of specific hardware resources. Situations with resource
430 pressure mainly concentrated on a few resources should, in general, be
431 avoided. Ideally, pressure should be uniformly distributed between
432 multiple resources.
433
434 Timeline View
435 The timeline view produces a detailed report of each instruction's
436 state transitions through an instruction pipeline. This view is en‐
437 abled by the command line option -timeline. As instructions transition
438 through the various stages of the pipeline, their states are depicted
439 in the view report. These states are represented by the following
440 characters:
441
442 • D : Instruction dispatched.
443
444 • e : Instruction executing.
445
446 • E : Instruction executed.
447
448 • R : Instruction retired.
449
450 • = : Instruction already dispatched, waiting to be executed.
451
452 • - : Instruction executed, waiting to be retired.
453
454 Below is the timeline view for a subset of the dot-product example lo‐
455 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
456 llvm-mca using the following command:
457
458 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
459
460 Timeline view:
461 012345
462 Index 0123456789
463
464 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
465 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
466 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
467 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
468 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
469 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
470 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
471 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
472 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
473
474
475 Average Wait times (based on the timeline view):
476 [0]: Executions
477 [1]: Average time spent waiting in a scheduler's queue
478 [2]: Average time spent waiting in a scheduler's queue while ready
479 [3]: Average time elapsed from WB until retire stage
480
481 [0] [1] [2] [3]
482 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
483 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
484 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
485 3 3.3 0.5 1.4 <total>
486
487 The timeline view is interesting because it shows instruction state
488 changes during execution. It also gives an idea of how the tool pro‐
489 cesses instructions executed on the target, and how their timing infor‐
490 mation might be calculated.
491
492 The timeline view is structured in two tables. The first table shows
493 instructions changing state over time (measured in cycles); the second
494 table (named Average Wait times) reports useful timing statistics,
495 which should help diagnose performance bottlenecks caused by long data
496 dependencies and sub-optimal usage of hardware resources.
497
498 An instruction in the timeline view is identified by a pair of indices,
499 where the first index identifies an iteration, and the second index is
500 the instruction index (i.e., where it appears in the code sequence).
501 Since this example was generated using 3 iterations: -iterations=3, the
502 iteration indices range from 0-2 inclusively.
503
504 Excluding the first and last column, the remaining columns are in cy‐
505 cles. Cycles are numbered sequentially starting from 0.
506
507 From the example output above, we know the following:
508
509 • Instruction [1,0] was dispatched at cycle 1.
510
511 • Instruction [1,0] started executing at cycle 2.
512
513 • Instruction [1,0] reached the write back stage at cycle 4.
514
515 • Instruction [1,0] was retired at cycle 10.
516
517 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
518 wait in the scheduler's queue for the operands to become available. By
519 the time vmulps is dispatched, operands are already available, and
520 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
521 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
522 strated by the fact that the instruction only spent 1cy in the sched‐
523 uler's queue.
524
525 There is a gap of 5 cycles between the write-back stage and the retire
526 event. That is because instructions must retire in program order, so
527 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
528 until cycle 10).
529
530 In the example, all instructions are in a RAW (Read After Write) depen‐
531 dency chain. Register %xmm2 written by vmulps is immediately used by
532 the first vhaddps, and register %xmm3 written by the first vhaddps is
533 used by the second vhaddps. Long data dependencies negatively impact
534 the ILP (Instruction Level Parallelism).
535
536 In the dot-product example, there are anti-dependencies introduced by
537 instructions from different iterations. However, those dependencies
538 can be removed at register renaming stage (at the cost of allocating
539 register aliases, and therefore consuming physical registers).
540
541 Table Average Wait times helps diagnose performance issues that are
542 caused by the presence of long latency instructions and potentially
543 long data dependencies which may limit the ILP. Last row, <total>,
544 shows a global average over all instructions measured. Note that
545 llvm-mca, by default, assumes at least 1cy between the dispatch event
546 and the issue event.
547
548 When the performance is limited by data dependencies and/or long la‐
549 tency instructions, the number of cycles spent while in the ready state
550 is expected to be very small when compared with the total number of cy‐
551 cles spent in the scheduler's queue. The difference between the two
552 counters is a good indicator of how large of an impact data dependen‐
553 cies had on the execution of the instructions. When performance is
554 mostly limited by the lack of hardware resources, the delta between the
555 two counters is small. However, the number of cycles spent in the
556 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
557 pared to other low latency instructions.
558
559 Bottleneck Analysis
560 The -bottleneck-analysis command line option enables the analysis of
561 performance bottlenecks.
562
563 This analysis is potentially expensive. It attempts to correlate in‐
564 creases in backend pressure (caused by pipeline resource pressure and
565 data dependencies) to dynamic dispatch stalls.
566
567 Below is an example of -bottleneck-analysis output generated by
568 llvm-mca for 500 iterations of the dot-product example on btver2.
569
570 Cycles with backend pressure increase [ 48.07% ]
571 Throughput Bottlenecks:
572 Resource Pressure [ 47.77% ]
573 - JFPA [ 47.77% ]
574 - JFPU0 [ 47.77% ]
575 Data Dependencies: [ 0.30% ]
576 - Register Dependencies [ 0.30% ]
577 - Memory Dependencies [ 0.00% ]
578
579 Critical sequence based on the simulation:
580
581 Instruction Dependency Information
582 +----< 2. vhaddps %xmm3, %xmm3, %xmm4
583 |
584 | < loop carried >
585 |
586 | 0. vmulps %xmm0, %xmm1, %xmm2
587 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
588 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
589 |
590 | < loop carried >
591 |
592 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
593
594 According to the analysis, throughput is limited by resource pressure
595 and not by data dependencies. The analysis observed increases in back‐
596 end pressure during 48.07% of the simulated run. Almost all those pres‐
597 sure increase events were caused by contention on processor resources
598 JFPA/JFPU0.
599
600 The critical sequence is the most expensive sequence of instructions
601 according to the simulation. It is annotated to provide extra informa‐
602 tion about critical register dependencies and resource interferences
603 between instructions.
604
605 Instructions from the critical sequence are expected to significantly
606 impact performance. By construction, the accuracy of this analysis is
607 strongly dependent on the simulation and (as always) by the quality of
608 the processor model in llvm.
609
610 Extra Statistics to Further Diagnose Performance Issues
611 The -all-stats command line option enables extra statistics and perfor‐
612 mance counters for the dispatch logic, the reorder buffer, the retire
613 control unit, and the register file.
614
615 Below is an example of -all-stats output generated by llvm-mca for 300
616 iterations of the dot-product example discussed in the previous sec‐
617 tions.
618
619 Dynamic Dispatch Stall Cycles:
620 RAT - Register unavailable: 0
621 RCU - Retire tokens unavailable: 0
622 SCHEDQ - Scheduler full: 272 (44.6%)
623 LQ - Load queue full: 0
624 SQ - Store queue full: 0
625 GROUP - Static restrictions on the dispatch group: 0
626
627
628 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
629 [# dispatched], [# cycles]
630 0, 24 (3.9%)
631 1, 272 (44.6%)
632 2, 314 (51.5%)
633
634
635 Schedulers - number of cycles where we saw N micro opcodes issued:
636 [# issued], [# cycles]
637 0, 7 (1.1%)
638 1, 306 (50.2%)
639 2, 297 (48.7%)
640
641 Scheduler's queue usage:
642 [1] Resource name.
643 [2] Average number of used buffer entries.
644 [3] Maximum number of used buffer entries.
645 [4] Total number of buffer entries.
646
647 [1] [2] [3] [4]
648 JALU01 0 0 20
649 JFPU01 17 18 18
650 JLSAGU 0 0 12
651
652
653 Retire Control Unit - number of cycles where we saw N instructions retired:
654 [# retired], [# cycles]
655 0, 109 (17.9%)
656 1, 102 (16.7%)
657 2, 399 (65.4%)
658
659 Total ROB Entries: 64
660 Max Used ROB Entries: 35 ( 54.7% )
661 Average Used ROB Entries per cy: 32 ( 50.0% )
662
663
664 Register File statistics:
665 Total number of mappings created: 900
666 Max number of mappings used: 35
667
668 * Register File #1 -- JFpuPRF:
669 Number of physical registers: 72
670 Total number of mappings created: 900
671 Max number of mappings used: 35
672
673 * Register File #2 -- JIntegerPRF:
674 Number of physical registers: 64
675 Total number of mappings created: 0
676 Max number of mappings used: 0
677
678 If we look at the Dynamic Dispatch Stall Cycles table, we see the
679 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
680 ery time the dispatch logic is unable to dispatch a full group because
681 the scheduler's queue is full.
682
683 Looking at the Dispatch Logic table, we see that the pipeline was only
684 able to dispatch two micro opcodes 51.5% of the time. The dispatch
685 group was limited to one micro opcode 44.6% of the cycles, which corre‐
686 sponds to 272 cycles. The dispatch statistics are displayed by either
687 using the command option -all-stats or -dispatch-stats.
688
689 The next table, Schedulers, presents a histogram displaying a count,
690 representing the number of micro opcodes issued on some number of cy‐
691 cles. In this case, of the 610 simulated cycles, single opcodes were
692 issued 306 times (50.2%) and there were 7 cycles where no opcodes were
693 issued.
694
695 The Scheduler's queue usage table shows that the average and maximum
696 number of buffer entries (i.e., scheduler queue entries) used at run‐
697 time. Resource JFPU01 reached its maximum (18 of 18 queue entries).
698 Note that AMD Jaguar implements three schedulers:
699
700 • JALU01 - A scheduler for ALU instructions.
701
702 • JFPU01 - A scheduler floating point operations.
703
704 • JLSAGU - A scheduler for address generation.
705
706 The dot-product is a kernel of three floating point instructions (a
707 vector multiply followed by two horizontal adds). That explains why
708 only the floating point scheduler appears to be used.
709
710 A full scheduler queue is either caused by data dependency chains or by
711 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
712 sure can be mitigated by rewriting the kernel using different instruc‐
713 tions that consume different scheduler resources. Schedulers with a
714 small queue are less resilient to bottlenecks caused by the presence of
715 long data dependencies. The scheduler statistics are displayed by us‐
716 ing the command option -all-stats or -scheduler-stats.
717
718 The next table, Retire Control Unit, presents a histogram displaying a
719 count, representing the number of instructions retired on some number
720 of cycles. In this case, of the 610 simulated cycles, two instructions
721 were retired during the same cycle 399 times (65.4%) and there were 109
722 cycles where no instructions were retired. The retire statistics are
723 displayed by using the command option -all-stats or -retire-stats.
724
725 The last table presented is Register File statistics. Each physical
726 register file (PRF) used by the pipeline is presented in this table.
727 In the case of AMD Jaguar, there are two register files, one for float‐
728 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
729 gerPRF). The table shows that of the 900 instructions processed, there
730 were 900 mappings created. Since this dot-product example utilized
731 only floating point registers, the JFPuPRF was responsible for creating
732 the 900 mappings. However, we see that the pipeline only used a maxi‐
733 mum of 35 of 72 available register slots at any given time. We can con‐
734 clude that the floating point PRF was the only register file used for
735 the example, and that it was never resource constrained. The register
736 file statistics are displayed by using the command option -all-stats or
737 -register-file-stats.
738
739 In this example, we can conclude that the IPC is mostly limited by data
740 dependencies, and not by resource pressure.
741
742 Instruction Flow
743 This section describes the instruction flow through the default pipe‐
744 line of llvm-mca, as well as the functional units involved in the
745 process.
746
747 The default pipeline implements the following sequence of stages used
748 to process instructions.
749
750 • Dispatch (Instruction is dispatched to the schedulers).
751
752 • Issue (Instruction is issued to the processor pipelines).
753
754 • Write Back (Instruction is executed, and results are written back).
755
756 • Retire (Instruction is retired; writes are architecturally commit‐
757 ted).
758
759 The default pipeline only models the out-of-order portion of a proces‐
760 sor. Therefore, the instruction fetch and decode stages are not mod‐
761 eled. Performance bottlenecks in the frontend are not diagnosed.
762 llvm-mca assumes that instructions have all been decoded and placed
763 into a queue before the simulation start. Also, llvm-mca does not
764 model branch prediction.
765
766 Instruction Dispatch
767 During the dispatch stage, instructions are picked in program order
768 from a queue of already decoded instructions, and dispatched in groups
769 to the simulated hardware schedulers.
770
771 The size of a dispatch group depends on the availability of the simu‐
772 lated hardware resources. The processor dispatch width defaults to the
773 value of the IssueWidth in LLVM's scheduling model.
774
775 An instruction can be dispatched if:
776
777 • The size of the dispatch group is smaller than processor's dispatch
778 width.
779
780 • There are enough entries in the reorder buffer.
781
782 • There are enough physical registers to do register renaming.
783
784 • The schedulers are not full.
785
786 Scheduling models can optionally specify which register files are
787 available on the processor. llvm-mca uses that information to initial‐
788 ize register file descriptors. Users can limit the number of physical
789 registers that are globally available for register renaming by using
790 the command option -register-file-size. A value of zero for this op‐
791 tion means unbounded. By knowing how many registers are available for
792 renaming, the tool can predict dispatch stalls caused by the lack of
793 physical registers.
794
795 The number of reorder buffer entries consumed by an instruction depends
796 on the number of micro-opcodes specified for that instruction by the
797 target scheduling model. The reorder buffer is responsible for track‐
798 ing the progress of instructions that are "in-flight", and retiring
799 them in program order. The number of entries in the reorder buffer de‐
800 faults to the value specified by field MicroOpBufferSize in the target
801 scheduling model.
802
803 Instructions that are dispatched to the schedulers consume scheduler
804 buffer entries. llvm-mca queries the scheduling model to determine the
805 set of buffered resources consumed by an instruction. Buffered re‐
806 sources are treated like scheduler resources.
807
808 Instruction Issue
809 Each processor scheduler implements a buffer of instructions. An in‐
810 struction has to wait in the scheduler's buffer until input register
811 operands become available. Only at that point, does the instruction
812 becomes eligible for execution and may be issued (potentially
813 out-of-order) for execution. Instruction latencies are computed by
814 llvm-mca with the help of the scheduling model.
815
816 llvm-mca's scheduler is designed to simulate multiple processor sched‐
817 ulers. The scheduler is responsible for tracking data dependencies,
818 and dynamically selecting which processor resources are consumed by in‐
819 structions. It delegates the management of processor resource units
820 and resource groups to a resource manager. The resource manager is re‐
821 sponsible for selecting resource units that are consumed by instruc‐
822 tions. For example, if an instruction consumes 1cy of a resource
823 group, the resource manager selects one of the available units from the
824 group; by default, the resource manager uses a round-robin selector to
825 guarantee that resource usage is uniformly distributed between all
826 units of a group.
827
828 llvm-mca's scheduler internally groups instructions into three sets:
829
830 • WaitSet: a set of instructions whose operands are not ready.
831
832 • ReadySet: a set of instructions ready to execute.
833
834 • IssuedSet: a set of instructions executing.
835
836 Depending on the operands availability, instructions that are dis‐
837 patched to the scheduler are either placed into the WaitSet or into the
838 ReadySet.
839
840 Every cycle, the scheduler checks if instructions can be moved from the
841 WaitSet to the ReadySet, and if instructions from the ReadySet can be
842 issued to the underlying pipelines. The algorithm prioritizes older in‐
843 structions over younger instructions.
844
845 Write-Back and Retire Stage
846 Issued instructions are moved from the ReadySet to the IssuedSet.
847 There, instructions wait until they reach the write-back stage. At
848 that point, they get removed from the queue and the retire control unit
849 is notified.
850
851 When instructions are executed, the retire control unit flags the in‐
852 struction as "ready to retire."
853
854 Instructions are retired in program order. The register file is noti‐
855 fied of the retirement so that it can free the physical registers that
856 were allocated for the instruction during the register renaming stage.
857
858 Load/Store Unit and Memory Consistency Model
859 To simulate an out-of-order execution of memory operations, llvm-mca
860 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
861 tive execution of loads and stores.
862
863 Each load (or store) consumes an entry in the load (or store) queue.
864 Users can specify flags -lqueue and -squeue to limit the number of en‐
865 tries in the load and store queues respectively. The queues are un‐
866 bounded by default.
867
868 The LSUnit implements a relaxed consistency model for memory loads and
869 stores. The rules are:
870
871 1. A younger load is allowed to pass an older load only if there are no
872 intervening stores or barriers between the two loads.
873
874 2. A younger load is allowed to pass an older store provided that the
875 load does not alias with the store.
876
877 3. A younger store is not allowed to pass an older store.
878
879 4. A younger store is not allowed to pass an older load.
880
881 By default, the LSUnit optimistically assumes that loads do not alias
882 (-noalias=true) store operations. Under this assumption, younger loads
883 are always allowed to pass older stores. Essentially, the LSUnit does
884 not attempt to run any alias analysis to predict when loads and stores
885 do not alias with each other.
886
887 Note that, in the case of write-combining memory, rule 3 could be re‐
888 laxed to allow reordering of non-aliasing store operations. That being
889 said, at the moment, there is no way to further relax the memory model
890 (-noalias is the only option). Essentially, there is no option to
891 specify a different memory type (e.g., write-back, write-combining,
892 write-through; etc.) and consequently to weaken, or strengthen, the
893 memory model.
894
895 Other limitations are:
896
897 • The LSUnit does not know when store-to-load forwarding may occur.
898
899 • The LSUnit does not know anything about cache hierarchy and memory
900 types.
901
902 • The LSUnit does not know how to identify serializing operations and
903 memory fences.
904
905 The LSUnit does not attempt to predict if a load or store hits or
906 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
907 "MayStore." For loads, the scheduling model provides an "optimistic"
908 load-to-use latency (which usually matches the load-to-use latency for
909 when there is a hit in the L1D).
910
911 llvm-mca does not know about serializing operations or memory-barrier
912 like instructions. The LSUnit conservatively assumes that an instruc‐
913 tion which has both "MayLoad" and unmodeled side effects behaves like a
914 "soft" load-barrier. That means, it serializes loads without forcing a
915 flush of the load queue. Similarly, instructions that "MayStore" and
916 have unmodeled side effects are treated like store barriers. A full
917 memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
918 side effects. This is inaccurate, but it is the best that we can do at
919 the moment with the current information available in LLVM.
920
921 A load/store barrier consumes one entry of the load/store queue. A
922 load/store barrier enforces ordering of loads/stores. A younger load
923 cannot pass a load barrier. Also, a younger store cannot pass a store
924 barrier. A younger load has to wait for the memory/load barrier to ex‐
925 ecute. A load/store barrier is "executed" when it becomes the oldest
926 entry in the load/store queue(s). That also means, by construction, all
927 of the older loads/stores have been executed.
928
929 In conclusion, the full set of load/store consistency rules are:
930
931 1. A store may not pass a previous store.
932
933 2. A store may not pass a previous load (regardless of -noalias).
934
935 3. A store has to wait until an older store barrier is fully executed.
936
937 4. A load may pass a previous load.
938
939 5. A load may not pass a previous store unless -noalias is set.
940
941 6. A load has to wait until an older load barrier is fully executed.
942
944 Maintained by the LLVM Team (https://llvm.org/).
945
947 2003-2021, LLVM Project
948
949
950
951
95210 2021-07-22 LLVM-MCA(1)