1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with an
18 out-of-order backend, for which there is a scheduling model available
19 in LLVM.
20
21 The main goal of this tool is not just to predict the performance of
22 the code when run on the target, but also help with diagnosing poten‐
23 tial performance issues.
24
25 Given an assembly code sequence, llvm-mca estimates the Instructions
26 Per Cycle (IPC), as well as hardware resource pressure. The analysis
27 and reporting style were inspired by the IACA tool from Intel.
28
29 For example, you can compile code with clang, output assembly, and pipe
30 it directly into llvm-mca for analysis:
31
32 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
33
34 Or for Intel syntax:
35
36 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
37
38 (llvm-mca detects Intel syntax by the presence of an .intel_syntax di‐
39 rective at the beginning of the input. By default its output syntax
40 matches that of its input.)
41
42 Scheduling models are not just used to compute instruction latencies
43 and throughput, but also to understand what processor resources are
44 available and how to simulate them.
45
46 By design, the quality of the analysis conducted by llvm-mca is in‐
47 evitably affected by the quality of the scheduling models in LLVM.
48
49 If you see that the performance report is not accurate for a processor,
50 please file a bug against the appropriate backend.
51
53 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
54 wise, it will read from the specified filename.
55
56 If the -o option is omitted, then llvm-mca will send its output to
57 standard output if the input is from standard input. If the -o option
58 specifies "-", then the output will also be sent to standard output.
59
60 -help Print a summary of command line options.
61
62 -o <filename>
63 Use <filename> as the output filename. See the summary above for
64 more details.
65
66 -mtriple=<target triple>
67 Specify a target triple string.
68
69 -march=<arch>
70 Specify the architecture for which to analyze the code. It de‐
71 faults to the host default target.
72
73 -mcpu=<cpuname>
74 Specify the processor for which to analyze the code. By de‐
75 fault, the cpu name is autodetected from the host.
76
77 -output-asm-variant=<variant id>
78 Specify the output assembly variant for the report generated by
79 the tool. On x86, possible values are [0, 1]. A value of 0
80 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
81 format for the code printed out by the tool in the analysis re‐
82 port.
83
84 -print-imm-hex
85 Prefer hex format for numeric literals in the output assembly
86 printed as part of the report.
87
88 -dispatch=<width>
89 Specify a different dispatch width for the processor. The dis‐
90 patch width defaults to field 'IssueWidth' in the processor
91 scheduling model. If width is zero, then the default dispatch
92 width is used.
93
94 -register-file-size=<size>
95 Specify the size of the register file. When specified, this flag
96 limits how many physical registers are available for register
97 renaming purposes. A value of zero for this flag means "unlim‐
98 ited number of physical registers".
99
100 -iterations=<number of iterations>
101 Specify the number of iterations to run. If this flag is set to
102 0, then the tool sets the number of iterations to a default
103 value (i.e. 100).
104
105 -noalias=<bool>
106 If set, the tool assumes that loads and stores don't alias. This
107 is the default behavior.
108
109 -lqueue=<load queue size>
110 Specify the size of the load queue in the load/store unit emu‐
111 lated by the tool. By default, the tool assumes an unbound num‐
112 ber of entries in the load queue. A value of zero for this flag
113 is ignored, and the default load queue size is used instead.
114
115 -squeue=<store queue size>
116 Specify the size of the store queue in the load/store unit emu‐
117 lated by the tool. By default, the tool assumes an unbound num‐
118 ber of entries in the store queue. A value of zero for this flag
119 is ignored, and the default store queue size is used instead.
120
121 -timeline
122 Enable the timeline view.
123
124 -timeline-max-iterations=<iterations>
125 Limit the number of iterations to print in the timeline view. By
126 default, the timeline view prints information for up to 10 iter‐
127 ations.
128
129 -timeline-max-cycles=<cycles>
130 Limit the number of cycles in the timeline view. By default, the
131 number of cycles is set to 80.
132
133 -resource-pressure
134 Enable the resource pressure view. This is enabled by default.
135
136 -register-file-stats
137 Enable register file usage statistics.
138
139 -dispatch-stats
140 Enable extra dispatch statistics. This view collects and ana‐
141 lyzes instruction dispatch events, as well as static/dynamic
142 dispatch stall events. This view is disabled by default.
143
144 -scheduler-stats
145 Enable extra scheduler statistics. This view collects and ana‐
146 lyzes instruction issue events. This view is disabled by de‐
147 fault.
148
149 -retire-stats
150 Enable extra retire control unit statistics. This view is dis‐
151 abled by default.
152
153 -instruction-info
154 Enable the instruction info view. This is enabled by default.
155
156 -show-encoding
157 Enable the printing of instruction encodings within the instruc‐
158 tion info view.
159
160 -all-stats
161 Print all hardware statistics. This enables extra statistics re‐
162 lated to the dispatch logic, the hardware schedulers, the regis‐
163 ter file(s), and the retire control unit. This option is dis‐
164 abled by default.
165
166 -all-views
167 Enable all the view.
168
169 -instruction-tables
170 Prints resource pressure information based on the static infor‐
171 mation available from the processor model. This differs from the
172 resource pressure view because it doesn't require that the code
173 is simulated. It instead prints the theoretical uniform distri‐
174 bution of resource pressure for every instruction in sequence.
175
176 -bottleneck-analysis
177 Print information about bottlenecks that affect the throughput.
178 This analysis can be expensive, and it is disabled by default.
179 Bottlenecks are highlighted in the summary view.
180
181 -json Print the requested views in JSON format. The instructions and
182 the processor resources are printed as members of special top
183 level JSON objects. The individual views refer to them by in‐
184 dex.
185
187 llvm-mca returns 0 on success. Otherwise, an error message is printed
188 to standard error, and the tool returns 1.
189
191 llvm-mca allows for the optional usage of special code comments to mark
192 regions of the assembly code to be analyzed. A comment starting with
193 substring LLVM-MCA-BEGIN marks the beginning of a code region. A com‐
194 ment starting with substring LLVM-MCA-END marks the end of a code re‐
195 gion. For example:
196
197 # LLVM-MCA-BEGIN
198 ...
199 # LLVM-MCA-END
200
201 If no user-defined region is specified, then llvm-mca assumes a default
202 region which contains every instruction in the input file. Every re‐
203 gion is analyzed in isolation, and the final performance report is the
204 union of all the reports generated for every code region.
205
206 Code regions can have names. For example:
207
208 # LLVM-MCA-BEGIN A simple example
209 add %eax, %eax
210 # LLVM-MCA-END
211
212 The code from the example above defines a region named "A simple exam‐
213 ple" with a single instruction in it. Note how the region name doesn't
214 have to be repeated in the LLVM-MCA-END directive. In the absence of
215 overlapping regions, an anonymous LLVM-MCA-END directive always ends
216 the currently active user defined region.
217
218 Example of nesting regions:
219
220 # LLVM-MCA-BEGIN foo
221 add %eax, %edx
222 # LLVM-MCA-BEGIN bar
223 sub %eax, %edx
224 # LLVM-MCA-END bar
225 # LLVM-MCA-END foo
226
227 Example of overlapping regions:
228
229 # LLVM-MCA-BEGIN foo
230 add %eax, %edx
231 # LLVM-MCA-BEGIN bar
232 sub %eax, %edx
233 # LLVM-MCA-END foo
234 add %eax, %edx
235 # LLVM-MCA-END bar
236
237 Note that multiple anonymous regions cannot overlap. Also, overlapping
238 regions cannot have the same name.
239
240 There is no support for marking regions from high-level source code,
241 like C or C++. As a workaround, inline assembly directives may be used:
242
243 int foo(int a, int b) {
244 __asm volatile("# LLVM-MCA-BEGIN foo");
245 a += 42;
246 __asm volatile("# LLVM-MCA-END");
247 a *= b;
248 return a;
249 }
250
251 However, this interferes with optimizations like loop vectorization and
252 may have an impact on the code generated. This is because the __asm
253 statements are seen as real code having important side effects, which
254 limits how the code around them can be transformed. If users want to
255 make use of inline assembly to emit markers, then the recommendation is
256 to always verify that the output assembly is equivalent to the assembly
257 generated in the absence of markers. The Clang options to emit opti‐
258 mization reports can also help in detecting missed optimizations.
259
261 llvm-mca takes assembly code as input. The assembly code is parsed into
262 a sequence of MCInst with the help of the existing LLVM target assembly
263 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
264 module to generate a performance report.
265
266 The Pipeline module simulates the execution of the machine code se‐
267 quence in a loop of iterations (default is 100). During this process,
268 the pipeline collects a number of execution related statistics. At the
269 end of this process, the pipeline generates and prints a report from
270 the collected statistics.
271
272 Here is an example of a performance report generated by the tool for a
273 dot-product of two packed float vectors of four elements. The analysis
274 is conducted for target x86, cpu btver2. The following result can be
275 produced via the following command using the example located at
276 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
277
278 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
279
280 Iterations: 300
281 Instructions: 900
282 Total Cycles: 610
283 Total uOps: 900
284
285 Dispatch Width: 2
286 uOps Per Cycle: 1.48
287 IPC: 1.48
288 Block RThroughput: 2.0
289
290
291 Instruction Info:
292 [1]: #uOps
293 [2]: Latency
294 [3]: RThroughput
295 [4]: MayLoad
296 [5]: MayStore
297 [6]: HasSideEffects (U)
298
299 [1] [2] [3] [4] [5] [6] Instructions:
300 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
301 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
302 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
303
304
305 Resources:
306 [0] - JALU0
307 [1] - JALU1
308 [2] - JDiv
309 [3] - JFPA
310 [4] - JFPM
311 [5] - JFPU0
312 [6] - JFPU1
313 [7] - JLAGU
314 [8] - JMul
315 [9] - JSAGU
316 [10] - JSTC
317 [11] - JVALU0
318 [12] - JVALU1
319 [13] - JVIMUL
320
321
322 Resource pressure per iteration:
323 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
324 - - - 2.00 1.00 2.00 1.00 - - - - - - -
325
326 Resource pressure by instruction:
327 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
328 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
329 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
330 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
331
332 According to this report, the dot-product kernel has been executed 300
333 times, for a total of 900 simulated instructions. The total number of
334 simulated micro opcodes (uOps) is also 900.
335
336 The report is structured in three main sections. The first section
337 collects a few performance numbers; the goal of this section is to give
338 a very quick overview of the performance throughput. Important perfor‐
339 mance indicators are IPC, uOps Per Cycle, and Block RThroughput (Block
340 Reciprocal Throughput).
341
342 Field DispatchWidth is the maximum number of micro opcodes that are
343 dispatched to the out-of-order backend every simulated cycle.
344
345 IPC is computed dividing the total number of simulated instructions by
346 the total number of cycles.
347
348 Field Block RThroughput is the reciprocal of the block throughput.
349 Block throughput is a theoretical quantity computed as the maximum num‐
350 ber of blocks (i.e. iterations) that can be executed per simulated
351 clock cycle in the absence of loop carried dependencies. Block through‐
352 put is superiorly limited by the dispatch rate, and the availability of
353 hardware resources.
354
355 In the absence of loop-carried data dependencies, the observed IPC
356 tends to a theoretical maximum which can be computed by dividing the
357 number of instructions of a single iteration by the Block RThroughput.
358
359 Field 'uOps Per Cycle' is computed dividing the total number of simu‐
360 lated micro opcodes by the total number of cycles. A delta between Dis‐
361 patch Width and this field is an indicator of a performance issue. In
362 the absence of loop-carried data dependencies, the observed 'uOps Per
363 Cycle' should tend to a theoretical maximum throughput which can be
364 computed by dividing the number of uOps of a single iteration by the
365 Block RThroughput.
366
367 Field uOps Per Cycle is bounded from above by the dispatch width. That
368 is because the dispatch width limits the maximum size of a dispatch
369 group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
370 ware parallelism. The availability of hardware resources affects the
371 resource pressure distribution, and it limits the number of instruc‐
372 tions that can be executed in parallel every cycle. A delta between
373 Dispatch Width and the theoretical maximum uOps per Cycle (computed by
374 dividing the number of uOps of a single iteration by the Block
375 RThroughput) is an indicator of a performance bottleneck caused by the
376 lack of hardware resources. In general, the lower the Block RThrough‐
377 put, the better.
378
379 In this example, uOps per iteration/Block RThroughput is 1.50. Since
380 there are no loop-carried dependencies, the observed uOps Per Cycle is
381 expected to approach 1.50 when the number of iterations tends to infin‐
382 ity. The delta between the Dispatch Width (2.00), and the theoretical
383 maximum uOp throughput (1.50) is an indicator of a performance bottle‐
384 neck caused by the lack of hardware resources, and the Resource pres‐
385 sure view can help to identify the problematic resource usage.
386
387 The second section of the report is the instruction info view. It shows
388 the latency and reciprocal throughput of every instruction in the se‐
389 quence. It also reports extra information related to the number of mi‐
390 cro opcodes, and opcode properties (i.e., 'MayLoad', 'MayStore', and
391 'HasSideEffects').
392
393 Field RThroughput is the reciprocal of the instruction throughput.
394 Throughput is computed as the maximum number of instructions of a same
395 type that can be executed per clock cycle in the absence of operand de‐
396 pendencies. In this example, the reciprocal throughput of a vector
397 float multiply is 1 cycles/instruction. That is because the FP multi‐
398 plier JFPM is only available from pipeline JFPU1.
399
400 Instruction encodings are displayed within the instruction info view
401 when flag -show-encoding is specified.
402
403 Below is an example of -show-encoding output for the dot-product ker‐
404 nel:
405
406 Instruction Info:
407 [1]: #uOps
408 [2]: Latency
409 [3]: RThroughput
410 [4]: MayLoad
411 [5]: MayStore
412 [6]: HasSideEffects (U)
413 [7]: Encoding Size
414
415 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions:
416 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2
417 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3
418 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4
419
420 The Encoding Size column shows the size in bytes of instructions. The
421 Encodings column shows the actual instruction encodings (byte sequences
422 in hex).
423
424 The third section is the Resource pressure view. This view reports the
425 average number of resource cycles consumed every iteration by instruc‐
426 tions for every processor resource unit available on the target. In‐
427 formation is structured in two tables. The first table reports the num‐
428 ber of resource cycles spent on average every iteration. The second ta‐
429 ble correlates the resource cycles to the machine instruction in the
430 sequence. For example, every iteration of the instruction vmulps always
431 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
432 consuming an average of 1 resource cycle per iteration. Note that on
433 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
434 line JFPU1, while horizontal floating-point additions can only be is‐
435 sued to pipeline JFPU0.
436
437 The resource pressure view helps with identifying bottlenecks caused by
438 high usage of specific hardware resources. Situations with resource
439 pressure mainly concentrated on a few resources should, in general, be
440 avoided. Ideally, pressure should be uniformly distributed between
441 multiple resources.
442
443 Timeline View
444 The timeline view produces a detailed report of each instruction's
445 state transitions through an instruction pipeline. This view is en‐
446 abled by the command line option -timeline. As instructions transition
447 through the various stages of the pipeline, their states are depicted
448 in the view report. These states are represented by the following
449 characters:
450
451 • D : Instruction dispatched.
452
453 • e : Instruction executing.
454
455 • E : Instruction executed.
456
457 • R : Instruction retired.
458
459 • = : Instruction already dispatched, waiting to be executed.
460
461 • - : Instruction executed, waiting to be retired.
462
463 Below is the timeline view for a subset of the dot-product example lo‐
464 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
465 llvm-mca using the following command:
466
467 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
468
469 Timeline view:
470 012345
471 Index 0123456789
472
473 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
474 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
475 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
476 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
477 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
478 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
479 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
480 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
481 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
482
483
484 Average Wait times (based on the timeline view):
485 [0]: Executions
486 [1]: Average time spent waiting in a scheduler's queue
487 [2]: Average time spent waiting in a scheduler's queue while ready
488 [3]: Average time elapsed from WB until retire stage
489
490 [0] [1] [2] [3]
491 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
492 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
493 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
494 3 3.3 0.5 1.4 <total>
495
496 The timeline view is interesting because it shows instruction state
497 changes during execution. It also gives an idea of how the tool pro‐
498 cesses instructions executed on the target, and how their timing infor‐
499 mation might be calculated.
500
501 The timeline view is structured in two tables. The first table shows
502 instructions changing state over time (measured in cycles); the second
503 table (named Average Wait times) reports useful timing statistics,
504 which should help diagnose performance bottlenecks caused by long data
505 dependencies and sub-optimal usage of hardware resources.
506
507 An instruction in the timeline view is identified by a pair of indices,
508 where the first index identifies an iteration, and the second index is
509 the instruction index (i.e., where it appears in the code sequence).
510 Since this example was generated using 3 iterations: -iterations=3, the
511 iteration indices range from 0-2 inclusively.
512
513 Excluding the first and last column, the remaining columns are in cy‐
514 cles. Cycles are numbered sequentially starting from 0.
515
516 From the example output above, we know the following:
517
518 • Instruction [1,0] was dispatched at cycle 1.
519
520 • Instruction [1,0] started executing at cycle 2.
521
522 • Instruction [1,0] reached the write back stage at cycle 4.
523
524 • Instruction [1,0] was retired at cycle 10.
525
526 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
527 wait in the scheduler's queue for the operands to become available. By
528 the time vmulps is dispatched, operands are already available, and
529 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
530 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
531 strated by the fact that the instruction only spent 1cy in the sched‐
532 uler's queue.
533
534 There is a gap of 5 cycles between the write-back stage and the retire
535 event. That is because instructions must retire in program order, so
536 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
537 until cycle 10).
538
539 In the example, all instructions are in a RAW (Read After Write) depen‐
540 dency chain. Register %xmm2 written by vmulps is immediately used by
541 the first vhaddps, and register %xmm3 written by the first vhaddps is
542 used by the second vhaddps. Long data dependencies negatively impact
543 the ILP (Instruction Level Parallelism).
544
545 In the dot-product example, there are anti-dependencies introduced by
546 instructions from different iterations. However, those dependencies
547 can be removed at register renaming stage (at the cost of allocating
548 register aliases, and therefore consuming physical registers).
549
550 Table Average Wait times helps diagnose performance issues that are
551 caused by the presence of long latency instructions and potentially
552 long data dependencies which may limit the ILP. Last row, <total>,
553 shows a global average over all instructions measured. Note that
554 llvm-mca, by default, assumes at least 1cy between the dispatch event
555 and the issue event.
556
557 When the performance is limited by data dependencies and/or long la‐
558 tency instructions, the number of cycles spent while in the ready state
559 is expected to be very small when compared with the total number of cy‐
560 cles spent in the scheduler's queue. The difference between the two
561 counters is a good indicator of how large of an impact data dependen‐
562 cies had on the execution of the instructions. When performance is
563 mostly limited by the lack of hardware resources, the delta between the
564 two counters is small. However, the number of cycles spent in the
565 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
566 pared to other low latency instructions.
567
568 Bottleneck Analysis
569 The -bottleneck-analysis command line option enables the analysis of
570 performance bottlenecks.
571
572 This analysis is potentially expensive. It attempts to correlate in‐
573 creases in backend pressure (caused by pipeline resource pressure and
574 data dependencies) to dynamic dispatch stalls.
575
576 Below is an example of -bottleneck-analysis output generated by
577 llvm-mca for 500 iterations of the dot-product example on btver2.
578
579 Cycles with backend pressure increase [ 48.07% ]
580 Throughput Bottlenecks:
581 Resource Pressure [ 47.77% ]
582 - JFPA [ 47.77% ]
583 - JFPU0 [ 47.77% ]
584 Data Dependencies: [ 0.30% ]
585 - Register Dependencies [ 0.30% ]
586 - Memory Dependencies [ 0.00% ]
587
588 Critical sequence based on the simulation:
589
590 Instruction Dependency Information
591 +----< 2. vhaddps %xmm3, %xmm3, %xmm4
592 |
593 | < loop carried >
594 |
595 | 0. vmulps %xmm0, %xmm1, %xmm2
596 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
597 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
598 |
599 | < loop carried >
600 |
601 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
602
603 According to the analysis, throughput is limited by resource pressure
604 and not by data dependencies. The analysis observed increases in back‐
605 end pressure during 48.07% of the simulated run. Almost all those pres‐
606 sure increase events were caused by contention on processor resources
607 JFPA/JFPU0.
608
609 The critical sequence is the most expensive sequence of instructions
610 according to the simulation. It is annotated to provide extra informa‐
611 tion about critical register dependencies and resource interferences
612 between instructions.
613
614 Instructions from the critical sequence are expected to significantly
615 impact performance. By construction, the accuracy of this analysis is
616 strongly dependent on the simulation and (as always) by the quality of
617 the processor model in llvm.
618
619 Extra Statistics to Further Diagnose Performance Issues
620 The -all-stats command line option enables extra statistics and perfor‐
621 mance counters for the dispatch logic, the reorder buffer, the retire
622 control unit, and the register file.
623
624 Below is an example of -all-stats output generated by llvm-mca for 300
625 iterations of the dot-product example discussed in the previous sec‐
626 tions.
627
628 Dynamic Dispatch Stall Cycles:
629 RAT - Register unavailable: 0
630 RCU - Retire tokens unavailable: 0
631 SCHEDQ - Scheduler full: 272 (44.6%)
632 LQ - Load queue full: 0
633 SQ - Store queue full: 0
634 GROUP - Static restrictions on the dispatch group: 0
635
636
637 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
638 [# dispatched], [# cycles]
639 0, 24 (3.9%)
640 1, 272 (44.6%)
641 2, 314 (51.5%)
642
643
644 Schedulers - number of cycles where we saw N micro opcodes issued:
645 [# issued], [# cycles]
646 0, 7 (1.1%)
647 1, 306 (50.2%)
648 2, 297 (48.7%)
649
650 Scheduler's queue usage:
651 [1] Resource name.
652 [2] Average number of used buffer entries.
653 [3] Maximum number of used buffer entries.
654 [4] Total number of buffer entries.
655
656 [1] [2] [3] [4]
657 JALU01 0 0 20
658 JFPU01 17 18 18
659 JLSAGU 0 0 12
660
661
662 Retire Control Unit - number of cycles where we saw N instructions retired:
663 [# retired], [# cycles]
664 0, 109 (17.9%)
665 1, 102 (16.7%)
666 2, 399 (65.4%)
667
668 Total ROB Entries: 64
669 Max Used ROB Entries: 35 ( 54.7% )
670 Average Used ROB Entries per cy: 32 ( 50.0% )
671
672
673 Register File statistics:
674 Total number of mappings created: 900
675 Max number of mappings used: 35
676
677 * Register File #1 -- JFpuPRF:
678 Number of physical registers: 72
679 Total number of mappings created: 900
680 Max number of mappings used: 35
681
682 * Register File #2 -- JIntegerPRF:
683 Number of physical registers: 64
684 Total number of mappings created: 0
685 Max number of mappings used: 0
686
687 If we look at the Dynamic Dispatch Stall Cycles table, we see the
688 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
689 ery time the dispatch logic is unable to dispatch a full group because
690 the scheduler's queue is full.
691
692 Looking at the Dispatch Logic table, we see that the pipeline was only
693 able to dispatch two micro opcodes 51.5% of the time. The dispatch
694 group was limited to one micro opcode 44.6% of the cycles, which corre‐
695 sponds to 272 cycles. The dispatch statistics are displayed by either
696 using the command option -all-stats or -dispatch-stats.
697
698 The next table, Schedulers, presents a histogram displaying a count,
699 representing the number of micro opcodes issued on some number of cy‐
700 cles. In this case, of the 610 simulated cycles, single opcodes were
701 issued 306 times (50.2%) and there were 7 cycles where no opcodes were
702 issued.
703
704 The Scheduler's queue usage table shows that the average and maximum
705 number of buffer entries (i.e., scheduler queue entries) used at run‐
706 time. Resource JFPU01 reached its maximum (18 of 18 queue entries).
707 Note that AMD Jaguar implements three schedulers:
708
709 • JALU01 - A scheduler for ALU instructions.
710
711 • JFPU01 - A scheduler floating point operations.
712
713 • JLSAGU - A scheduler for address generation.
714
715 The dot-product is a kernel of three floating point instructions (a
716 vector multiply followed by two horizontal adds). That explains why
717 only the floating point scheduler appears to be used.
718
719 A full scheduler queue is either caused by data dependency chains or by
720 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
721 sure can be mitigated by rewriting the kernel using different instruc‐
722 tions that consume different scheduler resources. Schedulers with a
723 small queue are less resilient to bottlenecks caused by the presence of
724 long data dependencies. The scheduler statistics are displayed by us‐
725 ing the command option -all-stats or -scheduler-stats.
726
727 The next table, Retire Control Unit, presents a histogram displaying a
728 count, representing the number of instructions retired on some number
729 of cycles. In this case, of the 610 simulated cycles, two instructions
730 were retired during the same cycle 399 times (65.4%) and there were 109
731 cycles where no instructions were retired. The retire statistics are
732 displayed by using the command option -all-stats or -retire-stats.
733
734 The last table presented is Register File statistics. Each physical
735 register file (PRF) used by the pipeline is presented in this table.
736 In the case of AMD Jaguar, there are two register files, one for float‐
737 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
738 gerPRF). The table shows that of the 900 instructions processed, there
739 were 900 mappings created. Since this dot-product example utilized
740 only floating point registers, the JFPuPRF was responsible for creating
741 the 900 mappings. However, we see that the pipeline only used a maxi‐
742 mum of 35 of 72 available register slots at any given time. We can con‐
743 clude that the floating point PRF was the only register file used for
744 the example, and that it was never resource constrained. The register
745 file statistics are displayed by using the command option -all-stats or
746 -register-file-stats.
747
748 In this example, we can conclude that the IPC is mostly limited by data
749 dependencies, and not by resource pressure.
750
751 Instruction Flow
752 This section describes the instruction flow through the default pipe‐
753 line of llvm-mca, as well as the functional units involved in the
754 process.
755
756 The default pipeline implements the following sequence of stages used
757 to process instructions.
758
759 • Dispatch (Instruction is dispatched to the schedulers).
760
761 • Issue (Instruction is issued to the processor pipelines).
762
763 • Write Back (Instruction is executed, and results are written back).
764
765 • Retire (Instruction is retired; writes are architecturally commit‐
766 ted).
767
768 The default pipeline only models the out-of-order portion of a proces‐
769 sor. Therefore, the instruction fetch and decode stages are not mod‐
770 eled. Performance bottlenecks in the frontend are not diagnosed.
771 llvm-mca assumes that instructions have all been decoded and placed
772 into a queue before the simulation start. Also, llvm-mca does not
773 model branch prediction.
774
775 Instruction Dispatch
776 During the dispatch stage, instructions are picked in program order
777 from a queue of already decoded instructions, and dispatched in groups
778 to the simulated hardware schedulers.
779
780 The size of a dispatch group depends on the availability of the simu‐
781 lated hardware resources. The processor dispatch width defaults to the
782 value of the IssueWidth in LLVM's scheduling model.
783
784 An instruction can be dispatched if:
785
786 • The size of the dispatch group is smaller than processor's dispatch
787 width.
788
789 • There are enough entries in the reorder buffer.
790
791 • There are enough physical registers to do register renaming.
792
793 • The schedulers are not full.
794
795 Scheduling models can optionally specify which register files are
796 available on the processor. llvm-mca uses that information to initial‐
797 ize register file descriptors. Users can limit the number of physical
798 registers that are globally available for register renaming by using
799 the command option -register-file-size. A value of zero for this op‐
800 tion means unbounded. By knowing how many registers are available for
801 renaming, the tool can predict dispatch stalls caused by the lack of
802 physical registers.
803
804 The number of reorder buffer entries consumed by an instruction depends
805 on the number of micro-opcodes specified for that instruction by the
806 target scheduling model. The reorder buffer is responsible for track‐
807 ing the progress of instructions that are "in-flight", and retiring
808 them in program order. The number of entries in the reorder buffer de‐
809 faults to the value specified by field MicroOpBufferSize in the target
810 scheduling model.
811
812 Instructions that are dispatched to the schedulers consume scheduler
813 buffer entries. llvm-mca queries the scheduling model to determine the
814 set of buffered resources consumed by an instruction. Buffered re‐
815 sources are treated like scheduler resources.
816
817 Instruction Issue
818 Each processor scheduler implements a buffer of instructions. An in‐
819 struction has to wait in the scheduler's buffer until input register
820 operands become available. Only at that point, does the instruction
821 becomes eligible for execution and may be issued (potentially
822 out-of-order) for execution. Instruction latencies are computed by
823 llvm-mca with the help of the scheduling model.
824
825 llvm-mca's scheduler is designed to simulate multiple processor sched‐
826 ulers. The scheduler is responsible for tracking data dependencies,
827 and dynamically selecting which processor resources are consumed by in‐
828 structions. It delegates the management of processor resource units
829 and resource groups to a resource manager. The resource manager is re‐
830 sponsible for selecting resource units that are consumed by instruc‐
831 tions. For example, if an instruction consumes 1cy of a resource
832 group, the resource manager selects one of the available units from the
833 group; by default, the resource manager uses a round-robin selector to
834 guarantee that resource usage is uniformly distributed between all
835 units of a group.
836
837 llvm-mca's scheduler internally groups instructions into three sets:
838
839 • WaitSet: a set of instructions whose operands are not ready.
840
841 • ReadySet: a set of instructions ready to execute.
842
843 • IssuedSet: a set of instructions executing.
844
845 Depending on the operands availability, instructions that are dis‐
846 patched to the scheduler are either placed into the WaitSet or into the
847 ReadySet.
848
849 Every cycle, the scheduler checks if instructions can be moved from the
850 WaitSet to the ReadySet, and if instructions from the ReadySet can be
851 issued to the underlying pipelines. The algorithm prioritizes older in‐
852 structions over younger instructions.
853
854 Write-Back and Retire Stage
855 Issued instructions are moved from the ReadySet to the IssuedSet.
856 There, instructions wait until they reach the write-back stage. At
857 that point, they get removed from the queue and the retire control unit
858 is notified.
859
860 When instructions are executed, the retire control unit flags the in‐
861 struction as "ready to retire."
862
863 Instructions are retired in program order. The register file is noti‐
864 fied of the retirement so that it can free the physical registers that
865 were allocated for the instruction during the register renaming stage.
866
867 Load/Store Unit and Memory Consistency Model
868 To simulate an out-of-order execution of memory operations, llvm-mca
869 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
870 tive execution of loads and stores.
871
872 Each load (or store) consumes an entry in the load (or store) queue.
873 Users can specify flags -lqueue and -squeue to limit the number of en‐
874 tries in the load and store queues respectively. The queues are un‐
875 bounded by default.
876
877 The LSUnit implements a relaxed consistency model for memory loads and
878 stores. The rules are:
879
880 1. A younger load is allowed to pass an older load only if there are no
881 intervening stores or barriers between the two loads.
882
883 2. A younger load is allowed to pass an older store provided that the
884 load does not alias with the store.
885
886 3. A younger store is not allowed to pass an older store.
887
888 4. A younger store is not allowed to pass an older load.
889
890 By default, the LSUnit optimistically assumes that loads do not alias
891 (-noalias=true) store operations. Under this assumption, younger loads
892 are always allowed to pass older stores. Essentially, the LSUnit does
893 not attempt to run any alias analysis to predict when loads and stores
894 do not alias with each other.
895
896 Note that, in the case of write-combining memory, rule 3 could be re‐
897 laxed to allow reordering of non-aliasing store operations. That being
898 said, at the moment, there is no way to further relax the memory model
899 (-noalias is the only option). Essentially, there is no option to
900 specify a different memory type (e.g., write-back, write-combining,
901 write-through; etc.) and consequently to weaken, or strengthen, the
902 memory model.
903
904 Other limitations are:
905
906 • The LSUnit does not know when store-to-load forwarding may occur.
907
908 • The LSUnit does not know anything about cache hierarchy and memory
909 types.
910
911 • The LSUnit does not know how to identify serializing operations and
912 memory fences.
913
914 The LSUnit does not attempt to predict if a load or store hits or
915 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
916 "MayStore." For loads, the scheduling model provides an "optimistic"
917 load-to-use latency (which usually matches the load-to-use latency for
918 when there is a hit in the L1D).
919
920 llvm-mca does not know about serializing operations or memory-barrier
921 like instructions. The LSUnit conservatively assumes that an instruc‐
922 tion which has both "MayLoad" and unmodeled side effects behaves like a
923 "soft" load-barrier. That means, it serializes loads without forcing a
924 flush of the load queue. Similarly, instructions that "MayStore" and
925 have unmodeled side effects are treated like store barriers. A full
926 memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
927 side effects. This is inaccurate, but it is the best that we can do at
928 the moment with the current information available in LLVM.
929
930 A load/store barrier consumes one entry of the load/store queue. A
931 load/store barrier enforces ordering of loads/stores. A younger load
932 cannot pass a load barrier. Also, a younger store cannot pass a store
933 barrier. A younger load has to wait for the memory/load barrier to ex‐
934 ecute. A load/store barrier is "executed" when it becomes the oldest
935 entry in the load/store queue(s). That also means, by construction, all
936 of the older loads/stores have been executed.
937
938 In conclusion, the full set of load/store consistency rules are:
939
940 1. A store may not pass a previous store.
941
942 2. A store may not pass a previous load (regardless of -noalias).
943
944 3. A store has to wait until an older store barrier is fully executed.
945
946 4. A load may pass a previous load.
947
948 5. A load may not pass a previous store unless -noalias is set.
949
950 6. A load has to wait until an older load barrier is fully executed.
951
953 Maintained by the LLVM Team (https://llvm.org/).
954
956 2003-2023, LLVM Project
957
958
959
960
96112 2023-07-20 LLVM-MCA(1)