1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with a
18 backend for which there is a scheduling model available in LLVM.
19
20 The main goal of this tool is not just to predict the performance of
21 the code when run on the target, but also help with diagnosing poten‐
22 tial performance issues.
23
24 Given an assembly code sequence, llvm-mca estimates the Instructions
25 Per Cycle (IPC), as well as hardware resource pressure. The analysis
26 and reporting style were inspired by the IACA tool from Intel.
27
28 For example, you can compile code with clang, output assembly, and pipe
29 it directly into llvm-mca for analysis:
30
31 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
32
33 Or for Intel syntax:
34
35 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
36
37 (llvm-mca detects Intel syntax by the presence of an .intel_syntax di‐
38 rective at the beginning of the input. By default its output syntax
39 matches that of its input.)
40
41 Scheduling models are not just used to compute instruction latencies
42 and throughput, but also to understand what processor resources are
43 available and how to simulate them.
44
45 By design, the quality of the analysis conducted by llvm-mca is in‐
46 evitably affected by the quality of the scheduling models in LLVM.
47
48 If you see that the performance report is not accurate for a processor,
49 please file a bug against the appropriate backend.
50
52 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
53 wise, it will read from the specified filename.
54
55 If the -o option is omitted, then llvm-mca will send its output to
56 standard output if the input is from standard input. If the -o option
57 specifies "-", then the output will also be sent to standard output.
58
59 -help Print a summary of command line options.
60
61 -o <filename>
62 Use <filename> as the output filename. See the summary above for
63 more details.
64
65 -mtriple=<target triple>
66 Specify a target triple string.
67
68 -march=<arch>
69 Specify the architecture for which to analyze the code. It de‐
70 faults to the host default target.
71
72 -mcpu=<cpuname>
73 Specify the processor for which to analyze the code. By de‐
74 fault, the cpu name is autodetected from the host.
75
76 -output-asm-variant=<variant id>
77 Specify the output assembly variant for the report generated by
78 the tool. On x86, possible values are [0, 1]. A value of 0
79 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
80 format for the code printed out by the tool in the analysis re‐
81 port.
82
83 -print-imm-hex
84 Prefer hex format for numeric literals in the output assembly
85 printed as part of the report.
86
87 -dispatch=<width>
88 Specify a different dispatch width for the processor. The dis‐
89 patch width defaults to field 'IssueWidth' in the processor
90 scheduling model. If width is zero, then the default dispatch
91 width is used.
92
93 -register-file-size=<size>
94 Specify the size of the register file. When specified, this flag
95 limits how many physical registers are available for register
96 renaming purposes. A value of zero for this flag means "unlim‐
97 ited number of physical registers".
98
99 -iterations=<number of iterations>
100 Specify the number of iterations to run. If this flag is set to
101 0, then the tool sets the number of iterations to a default
102 value (i.e. 100).
103
104 -noalias=<bool>
105 If set, the tool assumes that loads and stores don't alias. This
106 is the default behavior.
107
108 -lqueue=<load queue size>
109 Specify the size of the load queue in the load/store unit emu‐
110 lated by the tool. By default, the tool assumes an unbound num‐
111 ber of entries in the load queue. A value of zero for this flag
112 is ignored, and the default load queue size is used instead.
113
114 -squeue=<store queue size>
115 Specify the size of the store queue in the load/store unit emu‐
116 lated by the tool. By default, the tool assumes an unbound num‐
117 ber of entries in the store queue. A value of zero for this flag
118 is ignored, and the default store queue size is used instead.
119
120 -timeline
121 Enable the timeline view.
122
123 -timeline-max-iterations=<iterations>
124 Limit the number of iterations to print in the timeline view. By
125 default, the timeline view prints information for up to 10 iter‐
126 ations.
127
128 -timeline-max-cycles=<cycles>
129 Limit the number of cycles in the timeline view, or use 0 for no
130 limit. By default, the number of cycles is set to 80.
131
132 -resource-pressure
133 Enable the resource pressure view. This is enabled by default.
134
135 -register-file-stats
136 Enable register file usage statistics.
137
138 -dispatch-stats
139 Enable extra dispatch statistics. This view collects and ana‐
140 lyzes instruction dispatch events, as well as static/dynamic
141 dispatch stall events. This view is disabled by default.
142
143 -scheduler-stats
144 Enable extra scheduler statistics. This view collects and ana‐
145 lyzes instruction issue events. This view is disabled by de‐
146 fault.
147
148 -retire-stats
149 Enable extra retire control unit statistics. This view is dis‐
150 abled by default.
151
152 -instruction-info
153 Enable the instruction info view. This is enabled by default.
154
155 -show-encoding
156 Enable the printing of instruction encodings within the instruc‐
157 tion info view.
158
159 -show-barriers
160 Enable the printing of LoadBarrier and StoreBarrier flags within
161 the instruction info view.
162
163 -all-stats
164 Print all hardware statistics. This enables extra statistics re‐
165 lated to the dispatch logic, the hardware schedulers, the regis‐
166 ter file(s), and the retire control unit. This option is dis‐
167 abled by default.
168
169 -all-views
170 Enable all the view.
171
172 -instruction-tables
173 Prints resource pressure information based on the static infor‐
174 mation available from the processor model. This differs from the
175 resource pressure view because it doesn't require that the code
176 is simulated. It instead prints the theoretical uniform distri‐
177 bution of resource pressure for every instruction in sequence.
178
179 -bottleneck-analysis
180 Print information about bottlenecks that affect the throughput.
181 This analysis can be expensive, and it is disabled by default.
182 Bottlenecks are highlighted in the summary view. Bottleneck
183 analysis is currently not supported for processors with an
184 in-order backend.
185
186 -json Print the requested views in valid JSON format. The instructions
187 and the processor resources are printed as members of special
188 top level JSON objects. The individual views refer to them by
189 index. However, not all views are currently supported. For exam‐
190 ple, the report from the bottleneck analysis is not printed out
191 in JSON. All the default views are currently supported.
192
193 -disable-cb
194 Force usage of the generic CustomBehaviour and InstrPostProcess
195 classes rather than using the target specific implementation.
196 The generic classes never detect any custom hazards or make any
197 post processing modifications to instructions.
198
200 llvm-mca returns 0 on success. Otherwise, an error message is printed
201 to standard error, and the tool returns 1.
202
204 llvm-mca allows for the optional usage of special code comments to mark
205 regions of the assembly code to be analyzed. A comment starting with
206 substring LLVM-MCA-BEGIN marks the beginning of a code region. A com‐
207 ment starting with substring LLVM-MCA-END marks the end of a code re‐
208 gion. For example:
209
210 # LLVM-MCA-BEGIN
211 ...
212 # LLVM-MCA-END
213
214 If no user-defined region is specified, then llvm-mca assumes a default
215 region which contains every instruction in the input file. Every re‐
216 gion is analyzed in isolation, and the final performance report is the
217 union of all the reports generated for every code region.
218
219 Code regions can have names. For example:
220
221 # LLVM-MCA-BEGIN A simple example
222 add %eax, %eax
223 # LLVM-MCA-END
224
225 The code from the example above defines a region named "A simple exam‐
226 ple" with a single instruction in it. Note how the region name doesn't
227 have to be repeated in the LLVM-MCA-END directive. In the absence of
228 overlapping regions, an anonymous LLVM-MCA-END directive always ends
229 the currently active user defined region.
230
231 Example of nesting regions:
232
233 # LLVM-MCA-BEGIN foo
234 add %eax, %edx
235 # LLVM-MCA-BEGIN bar
236 sub %eax, %edx
237 # LLVM-MCA-END bar
238 # LLVM-MCA-END foo
239
240 Example of overlapping regions:
241
242 # LLVM-MCA-BEGIN foo
243 add %eax, %edx
244 # LLVM-MCA-BEGIN bar
245 sub %eax, %edx
246 # LLVM-MCA-END foo
247 add %eax, %edx
248 # LLVM-MCA-END bar
249
250 Note that multiple anonymous regions cannot overlap. Also, overlapping
251 regions cannot have the same name.
252
253 There is no support for marking regions from high-level source code,
254 like C or C++. As a workaround, inline assembly directives may be used:
255
256 int foo(int a, int b) {
257 __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
258 a += 42;
259 __asm volatile("# LLVM-MCA-END":::"memory");
260 a *= b;
261 return a;
262 }
263
264 However, this interferes with optimizations like loop vectorization and
265 may have an impact on the code generated. This is because the __asm
266 statements are seen as real code having important side effects, which
267 limits how the code around them can be transformed. If users want to
268 make use of inline assembly to emit markers, then the recommendation is
269 to always verify that the output assembly is equivalent to the assembly
270 generated in the absence of markers. The Clang options to emit opti‐
271 mization reports can also help in detecting missed optimizations.
272
274 llvm-mca takes assembly code as input. The assembly code is parsed into
275 a sequence of MCInst with the help of the existing LLVM target assembly
276 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
277 module to generate a performance report.
278
279 The Pipeline module simulates the execution of the machine code se‐
280 quence in a loop of iterations (default is 100). During this process,
281 the pipeline collects a number of execution related statistics. At the
282 end of this process, the pipeline generates and prints a report from
283 the collected statistics.
284
285 Here is an example of a performance report generated by the tool for a
286 dot-product of two packed float vectors of four elements. The analysis
287 is conducted for target x86, cpu btver2. The following result can be
288 produced via the following command using the example located at
289 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
290
291 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
292
293 Iterations: 300
294 Instructions: 900
295 Total Cycles: 610
296 Total uOps: 900
297
298 Dispatch Width: 2
299 uOps Per Cycle: 1.48
300 IPC: 1.48
301 Block RThroughput: 2.0
302
303
304 Instruction Info:
305 [1]: #uOps
306 [2]: Latency
307 [3]: RThroughput
308 [4]: MayLoad
309 [5]: MayStore
310 [6]: HasSideEffects (U)
311
312 [1] [2] [3] [4] [5] [6] Instructions:
313 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
314 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
315 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
316
317
318 Resources:
319 [0] - JALU0
320 [1] - JALU1
321 [2] - JDiv
322 [3] - JFPA
323 [4] - JFPM
324 [5] - JFPU0
325 [6] - JFPU1
326 [7] - JLAGU
327 [8] - JMul
328 [9] - JSAGU
329 [10] - JSTC
330 [11] - JVALU0
331 [12] - JVALU1
332 [13] - JVIMUL
333
334
335 Resource pressure per iteration:
336 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
337 - - - 2.00 1.00 2.00 1.00 - - - - - - -
338
339 Resource pressure by instruction:
340 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
341 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
342 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
343 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
344
345 According to this report, the dot-product kernel has been executed 300
346 times, for a total of 900 simulated instructions. The total number of
347 simulated micro opcodes (uOps) is also 900.
348
349 The report is structured in three main sections. The first section
350 collects a few performance numbers; the goal of this section is to give
351 a very quick overview of the performance throughput. Important perfor‐
352 mance indicators are IPC, uOps Per Cycle, and Block RThroughput (Block
353 Reciprocal Throughput).
354
355 Field DispatchWidth is the maximum number of micro opcodes that are
356 dispatched to the out-of-order backend every simulated cycle. For pro‐
357 cessors with an in-order backend, DispatchWidth is the maximum number
358 of micro opcodes issued to the backend every simulated cycle.
359
360 IPC is computed dividing the total number of simulated instructions by
361 the total number of cycles.
362
363 Field Block RThroughput is the reciprocal of the block throughput.
364 Block throughput is a theoretical quantity computed as the maximum num‐
365 ber of blocks (i.e. iterations) that can be executed per simulated
366 clock cycle in the absence of loop carried dependencies. Block through‐
367 put is superiorly limited by the dispatch rate, and the availability of
368 hardware resources.
369
370 In the absence of loop-carried data dependencies, the observed IPC
371 tends to a theoretical maximum which can be computed by dividing the
372 number of instructions of a single iteration by the Block RThroughput.
373
374 Field 'uOps Per Cycle' is computed dividing the total number of simu‐
375 lated micro opcodes by the total number of cycles. A delta between Dis‐
376 patch Width and this field is an indicator of a performance issue. In
377 the absence of loop-carried data dependencies, the observed 'uOps Per
378 Cycle' should tend to a theoretical maximum throughput which can be
379 computed by dividing the number of uOps of a single iteration by the
380 Block RThroughput.
381
382 Field uOps Per Cycle is bounded from above by the dispatch width. That
383 is because the dispatch width limits the maximum size of a dispatch
384 group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
385 ware parallelism. The availability of hardware resources affects the
386 resource pressure distribution, and it limits the number of instruc‐
387 tions that can be executed in parallel every cycle. A delta between
388 Dispatch Width and the theoretical maximum uOps per Cycle (computed by
389 dividing the number of uOps of a single iteration by the Block
390 RThroughput) is an indicator of a performance bottleneck caused by the
391 lack of hardware resources. In general, the lower the Block RThrough‐
392 put, the better.
393
394 In this example, uOps per iteration/Block RThroughput is 1.50. Since
395 there are no loop-carried dependencies, the observed uOps Per Cycle is
396 expected to approach 1.50 when the number of iterations tends to infin‐
397 ity. The delta between the Dispatch Width (2.00), and the theoretical
398 maximum uOp throughput (1.50) is an indicator of a performance bottle‐
399 neck caused by the lack of hardware resources, and the Resource pres‐
400 sure view can help to identify the problematic resource usage.
401
402 The second section of the report is the instruction info view. It shows
403 the latency and reciprocal throughput of every instruction in the se‐
404 quence. It also reports extra information related to the number of mi‐
405 cro opcodes, and opcode properties (i.e., 'MayLoad', 'MayStore', and
406 'HasSideEffects').
407
408 Field RThroughput is the reciprocal of the instruction throughput.
409 Throughput is computed as the maximum number of instructions of a same
410 type that can be executed per clock cycle in the absence of operand de‐
411 pendencies. In this example, the reciprocal throughput of a vector
412 float multiply is 1 cycles/instruction. That is because the FP multi‐
413 plier JFPM is only available from pipeline JFPU1.
414
415 Instruction encodings are displayed within the instruction info view
416 when flag -show-encoding is specified.
417
418 Below is an example of -show-encoding output for the dot-product ker‐
419 nel:
420
421 Instruction Info:
422 [1]: #uOps
423 [2]: Latency
424 [3]: RThroughput
425 [4]: MayLoad
426 [5]: MayStore
427 [6]: HasSideEffects (U)
428 [7]: Encoding Size
429
430 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions:
431 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2
432 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3
433 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4
434
435 The Encoding Size column shows the size in bytes of instructions. The
436 Encodings column shows the actual instruction encodings (byte sequences
437 in hex).
438
439 The third section is the Resource pressure view. This view reports the
440 average number of resource cycles consumed every iteration by instruc‐
441 tions for every processor resource unit available on the target. In‐
442 formation is structured in two tables. The first table reports the num‐
443 ber of resource cycles spent on average every iteration. The second ta‐
444 ble correlates the resource cycles to the machine instruction in the
445 sequence. For example, every iteration of the instruction vmulps always
446 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
447 consuming an average of 1 resource cycle per iteration. Note that on
448 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
449 line JFPU1, while horizontal floating-point additions can only be is‐
450 sued to pipeline JFPU0.
451
452 The resource pressure view helps with identifying bottlenecks caused by
453 high usage of specific hardware resources. Situations with resource
454 pressure mainly concentrated on a few resources should, in general, be
455 avoided. Ideally, pressure should be uniformly distributed between
456 multiple resources.
457
458 Timeline View
459 The timeline view produces a detailed report of each instruction's
460 state transitions through an instruction pipeline. This view is en‐
461 abled by the command line option -timeline. As instructions transition
462 through the various stages of the pipeline, their states are depicted
463 in the view report. These states are represented by the following
464 characters:
465
466 • D : Instruction dispatched.
467
468 • e : Instruction executing.
469
470 • E : Instruction executed.
471
472 • R : Instruction retired.
473
474 • = : Instruction already dispatched, waiting to be executed.
475
476 • - : Instruction executed, waiting to be retired.
477
478 Below is the timeline view for a subset of the dot-product example lo‐
479 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
480 llvm-mca using the following command:
481
482 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
483
484 Timeline view:
485 012345
486 Index 0123456789
487
488 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
489 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
490 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
491 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
492 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
493 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
494 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
495 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
496 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
497
498
499 Average Wait times (based on the timeline view):
500 [0]: Executions
501 [1]: Average time spent waiting in a scheduler's queue
502 [2]: Average time spent waiting in a scheduler's queue while ready
503 [3]: Average time elapsed from WB until retire stage
504
505 [0] [1] [2] [3]
506 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
507 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
508 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
509 3 3.3 0.5 1.4 <total>
510
511 The timeline view is interesting because it shows instruction state
512 changes during execution. It also gives an idea of how the tool pro‐
513 cesses instructions executed on the target, and how their timing infor‐
514 mation might be calculated.
515
516 The timeline view is structured in two tables. The first table shows
517 instructions changing state over time (measured in cycles); the second
518 table (named Average Wait times) reports useful timing statistics,
519 which should help diagnose performance bottlenecks caused by long data
520 dependencies and sub-optimal usage of hardware resources.
521
522 An instruction in the timeline view is identified by a pair of indices,
523 where the first index identifies an iteration, and the second index is
524 the instruction index (i.e., where it appears in the code sequence).
525 Since this example was generated using 3 iterations: -iterations=3, the
526 iteration indices range from 0-2 inclusively.
527
528 Excluding the first and last column, the remaining columns are in cy‐
529 cles. Cycles are numbered sequentially starting from 0.
530
531 From the example output above, we know the following:
532
533 • Instruction [1,0] was dispatched at cycle 1.
534
535 • Instruction [1,0] started executing at cycle 2.
536
537 • Instruction [1,0] reached the write back stage at cycle 4.
538
539 • Instruction [1,0] was retired at cycle 10.
540
541 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
542 wait in the scheduler's queue for the operands to become available. By
543 the time vmulps is dispatched, operands are already available, and
544 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
545 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
546 strated by the fact that the instruction only spent 1cy in the sched‐
547 uler's queue.
548
549 There is a gap of 5 cycles between the write-back stage and the retire
550 event. That is because instructions must retire in program order, so
551 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
552 until cycle 10).
553
554 In the example, all instructions are in a RAW (Read After Write) depen‐
555 dency chain. Register %xmm2 written by vmulps is immediately used by
556 the first vhaddps, and register %xmm3 written by the first vhaddps is
557 used by the second vhaddps. Long data dependencies negatively impact
558 the ILP (Instruction Level Parallelism).
559
560 In the dot-product example, there are anti-dependencies introduced by
561 instructions from different iterations. However, those dependencies
562 can be removed at register renaming stage (at the cost of allocating
563 register aliases, and therefore consuming physical registers).
564
565 Table Average Wait times helps diagnose performance issues that are
566 caused by the presence of long latency instructions and potentially
567 long data dependencies which may limit the ILP. Last row, <total>,
568 shows a global average over all instructions measured. Note that
569 llvm-mca, by default, assumes at least 1cy between the dispatch event
570 and the issue event.
571
572 When the performance is limited by data dependencies and/or long la‐
573 tency instructions, the number of cycles spent while in the ready state
574 is expected to be very small when compared with the total number of cy‐
575 cles spent in the scheduler's queue. The difference between the two
576 counters is a good indicator of how large of an impact data dependen‐
577 cies had on the execution of the instructions. When performance is
578 mostly limited by the lack of hardware resources, the delta between the
579 two counters is small. However, the number of cycles spent in the
580 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
581 pared to other low latency instructions.
582
583 Bottleneck Analysis
584 The -bottleneck-analysis command line option enables the analysis of
585 performance bottlenecks.
586
587 This analysis is potentially expensive. It attempts to correlate in‐
588 creases in backend pressure (caused by pipeline resource pressure and
589 data dependencies) to dynamic dispatch stalls.
590
591 Below is an example of -bottleneck-analysis output generated by
592 llvm-mca for 500 iterations of the dot-product example on btver2.
593
594 Cycles with backend pressure increase [ 48.07% ]
595 Throughput Bottlenecks:
596 Resource Pressure [ 47.77% ]
597 - JFPA [ 47.77% ]
598 - JFPU0 [ 47.77% ]
599 Data Dependencies: [ 0.30% ]
600 - Register Dependencies [ 0.30% ]
601 - Memory Dependencies [ 0.00% ]
602
603 Critical sequence based on the simulation:
604
605 Instruction Dependency Information
606 +----< 2. vhaddps %xmm3, %xmm3, %xmm4
607 |
608 | < loop carried >
609 |
610 | 0. vmulps %xmm0, %xmm1, %xmm2
611 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
612 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
613 |
614 | < loop carried >
615 |
616 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
617
618 According to the analysis, throughput is limited by resource pressure
619 and not by data dependencies. The analysis observed increases in back‐
620 end pressure during 48.07% of the simulated run. Almost all those pres‐
621 sure increase events were caused by contention on processor resources
622 JFPA/JFPU0.
623
624 The critical sequence is the most expensive sequence of instructions
625 according to the simulation. It is annotated to provide extra informa‐
626 tion about critical register dependencies and resource interferences
627 between instructions.
628
629 Instructions from the critical sequence are expected to significantly
630 impact performance. By construction, the accuracy of this analysis is
631 strongly dependent on the simulation and (as always) by the quality of
632 the processor model in llvm.
633
634 Bottleneck analysis is currently not supported for processors with an
635 in-order backend.
636
637 Extra Statistics to Further Diagnose Performance Issues
638 The -all-stats command line option enables extra statistics and perfor‐
639 mance counters for the dispatch logic, the reorder buffer, the retire
640 control unit, and the register file.
641
642 Below is an example of -all-stats output generated by llvm-mca for 300
643 iterations of the dot-product example discussed in the previous sec‐
644 tions.
645
646 Dynamic Dispatch Stall Cycles:
647 RAT - Register unavailable: 0
648 RCU - Retire tokens unavailable: 0
649 SCHEDQ - Scheduler full: 272 (44.6%)
650 LQ - Load queue full: 0
651 SQ - Store queue full: 0
652 GROUP - Static restrictions on the dispatch group: 0
653
654
655 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
656 [# dispatched], [# cycles]
657 0, 24 (3.9%)
658 1, 272 (44.6%)
659 2, 314 (51.5%)
660
661
662 Schedulers - number of cycles where we saw N micro opcodes issued:
663 [# issued], [# cycles]
664 0, 7 (1.1%)
665 1, 306 (50.2%)
666 2, 297 (48.7%)
667
668 Scheduler's queue usage:
669 [1] Resource name.
670 [2] Average number of used buffer entries.
671 [3] Maximum number of used buffer entries.
672 [4] Total number of buffer entries.
673
674 [1] [2] [3] [4]
675 JALU01 0 0 20
676 JFPU01 17 18 18
677 JLSAGU 0 0 12
678
679
680 Retire Control Unit - number of cycles where we saw N instructions retired:
681 [# retired], [# cycles]
682 0, 109 (17.9%)
683 1, 102 (16.7%)
684 2, 399 (65.4%)
685
686 Total ROB Entries: 64
687 Max Used ROB Entries: 35 ( 54.7% )
688 Average Used ROB Entries per cy: 32 ( 50.0% )
689
690
691 Register File statistics:
692 Total number of mappings created: 900
693 Max number of mappings used: 35
694
695 * Register File #1 -- JFpuPRF:
696 Number of physical registers: 72
697 Total number of mappings created: 900
698 Max number of mappings used: 35
699
700 * Register File #2 -- JIntegerPRF:
701 Number of physical registers: 64
702 Total number of mappings created: 0
703 Max number of mappings used: 0
704
705 If we look at the Dynamic Dispatch Stall Cycles table, we see the
706 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
707 ery time the dispatch logic is unable to dispatch a full group because
708 the scheduler's queue is full.
709
710 Looking at the Dispatch Logic table, we see that the pipeline was only
711 able to dispatch two micro opcodes 51.5% of the time. The dispatch
712 group was limited to one micro opcode 44.6% of the cycles, which corre‐
713 sponds to 272 cycles. The dispatch statistics are displayed by either
714 using the command option -all-stats or -dispatch-stats.
715
716 The next table, Schedulers, presents a histogram displaying a count,
717 representing the number of micro opcodes issued on some number of cy‐
718 cles. In this case, of the 610 simulated cycles, single opcodes were
719 issued 306 times (50.2%) and there were 7 cycles where no opcodes were
720 issued.
721
722 The Scheduler's queue usage table shows that the average and maximum
723 number of buffer entries (i.e., scheduler queue entries) used at run‐
724 time. Resource JFPU01 reached its maximum (18 of 18 queue entries).
725 Note that AMD Jaguar implements three schedulers:
726
727 • JALU01 - A scheduler for ALU instructions.
728
729 • JFPU01 - A scheduler floating point operations.
730
731 • JLSAGU - A scheduler for address generation.
732
733 The dot-product is a kernel of three floating point instructions (a
734 vector multiply followed by two horizontal adds). That explains why
735 only the floating point scheduler appears to be used.
736
737 A full scheduler queue is either caused by data dependency chains or by
738 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
739 sure can be mitigated by rewriting the kernel using different instruc‐
740 tions that consume different scheduler resources. Schedulers with a
741 small queue are less resilient to bottlenecks caused by the presence of
742 long data dependencies. The scheduler statistics are displayed by us‐
743 ing the command option -all-stats or -scheduler-stats.
744
745 The next table, Retire Control Unit, presents a histogram displaying a
746 count, representing the number of instructions retired on some number
747 of cycles. In this case, of the 610 simulated cycles, two instructions
748 were retired during the same cycle 399 times (65.4%) and there were 109
749 cycles where no instructions were retired. The retire statistics are
750 displayed by using the command option -all-stats or -retire-stats.
751
752 The last table presented is Register File statistics. Each physical
753 register file (PRF) used by the pipeline is presented in this table.
754 In the case of AMD Jaguar, there are two register files, one for float‐
755 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
756 gerPRF). The table shows that of the 900 instructions processed, there
757 were 900 mappings created. Since this dot-product example utilized
758 only floating point registers, the JFPuPRF was responsible for creating
759 the 900 mappings. However, we see that the pipeline only used a maxi‐
760 mum of 35 of 72 available register slots at any given time. We can con‐
761 clude that the floating point PRF was the only register file used for
762 the example, and that it was never resource constrained. The register
763 file statistics are displayed by using the command option -all-stats or
764 -register-file-stats.
765
766 In this example, we can conclude that the IPC is mostly limited by data
767 dependencies, and not by resource pressure.
768
769 Instruction Flow
770 This section describes the instruction flow through the default pipe‐
771 line of llvm-mca, as well as the functional units involved in the
772 process.
773
774 The default pipeline implements the following sequence of stages used
775 to process instructions.
776
777 • Dispatch (Instruction is dispatched to the schedulers).
778
779 • Issue (Instruction is issued to the processor pipelines).
780
781 • Write Back (Instruction is executed, and results are written back).
782
783 • Retire (Instruction is retired; writes are architecturally commit‐
784 ted).
785
786 The in-order pipeline implements the following sequence of stages: *
787 InOrderIssue (Instruction is issued to the processor pipelines). * Re‐
788 tire (Instruction is retired; writes are architecturally committed).
789
790 llvm-mca assumes that instructions have all been decoded and placed
791 into a queue before the simulation start. Therefore, the instruction
792 fetch and decode stages are not modeled. Performance bottlenecks in the
793 frontend are not diagnosed. Also, llvm-mca does not model branch pre‐
794 diction.
795
796 Instruction Dispatch
797 During the dispatch stage, instructions are picked in program order
798 from a queue of already decoded instructions, and dispatched in groups
799 to the simulated hardware schedulers.
800
801 The size of a dispatch group depends on the availability of the simu‐
802 lated hardware resources. The processor dispatch width defaults to the
803 value of the IssueWidth in LLVM's scheduling model.
804
805 An instruction can be dispatched if:
806
807 • The size of the dispatch group is smaller than processor's dispatch
808 width.
809
810 • There are enough entries in the reorder buffer.
811
812 • There are enough physical registers to do register renaming.
813
814 • The schedulers are not full.
815
816 Scheduling models can optionally specify which register files are
817 available on the processor. llvm-mca uses that information to initial‐
818 ize register file descriptors. Users can limit the number of physical
819 registers that are globally available for register renaming by using
820 the command option -register-file-size. A value of zero for this op‐
821 tion means unbounded. By knowing how many registers are available for
822 renaming, the tool can predict dispatch stalls caused by the lack of
823 physical registers.
824
825 The number of reorder buffer entries consumed by an instruction depends
826 on the number of micro-opcodes specified for that instruction by the
827 target scheduling model. The reorder buffer is responsible for track‐
828 ing the progress of instructions that are "in-flight", and retiring
829 them in program order. The number of entries in the reorder buffer de‐
830 faults to the value specified by field MicroOpBufferSize in the target
831 scheduling model.
832
833 Instructions that are dispatched to the schedulers consume scheduler
834 buffer entries. llvm-mca queries the scheduling model to determine the
835 set of buffered resources consumed by an instruction. Buffered re‐
836 sources are treated like scheduler resources.
837
838 Instruction Issue
839 Each processor scheduler implements a buffer of instructions. An in‐
840 struction has to wait in the scheduler's buffer until input register
841 operands become available. Only at that point, does the instruction
842 becomes eligible for execution and may be issued (potentially
843 out-of-order) for execution. Instruction latencies are computed by
844 llvm-mca with the help of the scheduling model.
845
846 llvm-mca's scheduler is designed to simulate multiple processor sched‐
847 ulers. The scheduler is responsible for tracking data dependencies,
848 and dynamically selecting which processor resources are consumed by in‐
849 structions. It delegates the management of processor resource units
850 and resource groups to a resource manager. The resource manager is re‐
851 sponsible for selecting resource units that are consumed by instruc‐
852 tions. For example, if an instruction consumes 1cy of a resource
853 group, the resource manager selects one of the available units from the
854 group; by default, the resource manager uses a round-robin selector to
855 guarantee that resource usage is uniformly distributed between all
856 units of a group.
857
858 llvm-mca's scheduler internally groups instructions into three sets:
859
860 • WaitSet: a set of instructions whose operands are not ready.
861
862 • ReadySet: a set of instructions ready to execute.
863
864 • IssuedSet: a set of instructions executing.
865
866 Depending on the operands availability, instructions that are dis‐
867 patched to the scheduler are either placed into the WaitSet or into the
868 ReadySet.
869
870 Every cycle, the scheduler checks if instructions can be moved from the
871 WaitSet to the ReadySet, and if instructions from the ReadySet can be
872 issued to the underlying pipelines. The algorithm prioritizes older in‐
873 structions over younger instructions.
874
875 Write-Back and Retire Stage
876 Issued instructions are moved from the ReadySet to the IssuedSet.
877 There, instructions wait until they reach the write-back stage. At
878 that point, they get removed from the queue and the retire control unit
879 is notified.
880
881 When instructions are executed, the retire control unit flags the in‐
882 struction as "ready to retire."
883
884 Instructions are retired in program order. The register file is noti‐
885 fied of the retirement so that it can free the physical registers that
886 were allocated for the instruction during the register renaming stage.
887
888 Load/Store Unit and Memory Consistency Model
889 To simulate an out-of-order execution of memory operations, llvm-mca
890 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
891 tive execution of loads and stores.
892
893 Each load (or store) consumes an entry in the load (or store) queue.
894 Users can specify flags -lqueue and -squeue to limit the number of en‐
895 tries in the load and store queues respectively. The queues are un‐
896 bounded by default.
897
898 The LSUnit implements a relaxed consistency model for memory loads and
899 stores. The rules are:
900
901 1. A younger load is allowed to pass an older load only if there are no
902 intervening stores or barriers between the two loads.
903
904 2. A younger load is allowed to pass an older store provided that the
905 load does not alias with the store.
906
907 3. A younger store is not allowed to pass an older store.
908
909 4. A younger store is not allowed to pass an older load.
910
911 By default, the LSUnit optimistically assumes that loads do not alias
912 (-noalias=true) store operations. Under this assumption, younger loads
913 are always allowed to pass older stores. Essentially, the LSUnit does
914 not attempt to run any alias analysis to predict when loads and stores
915 do not alias with each other.
916
917 Note that, in the case of write-combining memory, rule 3 could be re‐
918 laxed to allow reordering of non-aliasing store operations. That being
919 said, at the moment, there is no way to further relax the memory model
920 (-noalias is the only option). Essentially, there is no option to
921 specify a different memory type (e.g., write-back, write-combining,
922 write-through; etc.) and consequently to weaken, or strengthen, the
923 memory model.
924
925 Other limitations are:
926
927 • The LSUnit does not know when store-to-load forwarding may occur.
928
929 • The LSUnit does not know anything about cache hierarchy and memory
930 types.
931
932 • The LSUnit does not know how to identify serializing operations and
933 memory fences.
934
935 The LSUnit does not attempt to predict if a load or store hits or
936 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
937 "MayStore." For loads, the scheduling model provides an "optimistic"
938 load-to-use latency (which usually matches the load-to-use latency for
939 when there is a hit in the L1D).
940
941 llvm-mca does not (on its own) know about serializing operations or
942 memory-barrier like instructions. The LSUnit used to conservatively
943 use an instruction's "MayLoad", "MayStore", and unmodeled side effects
944 flags to determine whether an instruction should be treated as a mem‐
945 ory-barrier. This was inaccurate in general and was changed so that now
946 each instruction has an IsAStoreBarrier and IsALoadBarrier flag. These
947 flags are mca specific and default to false for every instruction. If
948 any instruction should have either of these flags set, it should be
949 done within the target's InstrPostProcess class. For an example, look
950 at the X86InstrPostProcess::postProcessInstruction method within
951 llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp.
952
953 A load/store barrier consumes one entry of the load/store queue. A
954 load/store barrier enforces ordering of loads/stores. A younger load
955 cannot pass a load barrier. Also, a younger store cannot pass a store
956 barrier. A younger load has to wait for the memory/load barrier to ex‐
957 ecute. A load/store barrier is "executed" when it becomes the oldest
958 entry in the load/store queue(s). That also means, by construction, all
959 of the older loads/stores have been executed.
960
961 In conclusion, the full set of load/store consistency rules are:
962
963 1. A store may not pass a previous store.
964
965 2. A store may not pass a previous load (regardless of -noalias).
966
967 3. A store has to wait until an older store barrier is fully executed.
968
969 4. A load may pass a previous load.
970
971 5. A load may not pass a previous store unless -noalias is set.
972
973 6. A load has to wait until an older load barrier is fully executed.
974
975 In-order Issue and Execute
976 In-order processors are modelled as a single InOrderIssueStage stage.
977 It bypasses Dispatch, Scheduler and Load/Store unit. Instructions are
978 issued as soon as their operand registers are available and resource
979 requirements are met. Multiple instructions can be issued in one cycle
980 according to the value of the IssueWidth parameter in LLVM's scheduling
981 model.
982
983 Once issued, an instruction is moved to IssuedInst set until it is
984 ready to retire. llvm-mca ensures that writes are committed in-order.
985 However, an instruction is allowed to commit writes and retire
986 out-of-order if RetireOOO property is true for at least one of its
987 writes.
988
989 Custom Behaviour
990 Due to certain instructions not being expressed perfectly within their
991 scheduling model, llvm-mca isn't always able to simulate them per‐
992 fectly. Modifying the scheduling model isn't always a viable option
993 though (maybe because the instruction is modeled incorrectly on purpose
994 or the instruction's behaviour is quite complex). The CustomBehaviour
995 class can be used in these cases to enforce proper instruction modeling
996 (often by customizing data dependencies and detecting hazards that
997 llvm-mca has no way of knowing about).
998
999 llvm-mca comes with one generic and multiple target specific CustomBe‐
1000 haviour classes. The generic class will be used if the -disable-cb flag
1001 is used or if a target specific CustomBehaviour class doesn't exist for
1002 that target. (The generic class does nothing.) Currently, the CustomBe‐
1003 haviour class is only a part of the in-order pipeline, but there are
1004 plans to add it to the out-of-order pipeline in the future.
1005
1006 CustomBehaviour's main method is checkCustomHazard() which uses the
1007 current instruction and a list of all instructions still executing
1008 within the pipeline to determine if the current instruction should be
1009 dispatched. As output, the method returns an integer representing the
1010 number of cycles that the current instruction must stall for (this can
1011 be an underestimate if you don't know the exact number and a value of 0
1012 represents no stall).
1013
1014 If you'd like to add a CustomBehaviour class for a target that doesn't
1015 already have one, refer to an existing implementation to see how to set
1016 it up. The classes are implemented within the target specific backend
1017 (for example /llvm/lib/Target/AMDGPU/MCA/) so that they can access
1018 backend symbols.
1019
1020 Custom Views
1021 llvm-mca comes with several Views such as the Timeline View and Summary
1022 View. These Views are generic and can work with most (if not all) tar‐
1023 gets. If you wish to add a new View to llvm-mca and it does not require
1024 any backend functionality that is not already exposed through MC layer
1025 classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the
1026 /tools/llvm-mca/View/ directory. However, if your new View is target
1027 specific AND requires unexposed backend symbols or functionality, you
1028 can define it in the /lib/Target/<TargetName>/MCA/ directory.
1029
1030 To enable this target specific View, you will have to use this target's
1031 CustomBehaviour class to override the CustomBehaviour::getViews() meth‐
1032 ods. There are 3 variations of these methods based on where you want
1033 your View to appear in the output: getStartViews(), getPostInstrIn‐
1034 foViews(), and getEndViews(). These methods returns a vector of Views
1035 so you will want to return a vector containing all of the target spe‐
1036 cific Views for the target in question.
1037
1038 Because these target specific (and backend dependent) Views require the
1039 CustomBehaviour::getViews() variants, these Views will not be enabled
1040 if the -disable-cb flag is used.
1041
1042 Enabling these custom Views does not affect the non-custom (generic)
1043 Views. Continue to use the usual command line arguments to enable /
1044 disable those Views.
1045
1047 Maintained by the LLVM Team (https://llvm.org/).
1048
1050 2003-2023, LLVM Project
1051
1052
1053
1054
105515 2023-07-20 LLVM-MCA(1)