1LLVM-MCA(1) LLVM LLVM-MCA(1)
2
3
4
6 llvm-mca - LLVM Machine Code Analyzer
7
9 llvm-mca [options] [input]
10
12 llvm-mca is a performance analysis tool that uses information available
13 in LLVM (e.g. scheduling models) to statically measure the performance
14 of machine code in a specific CPU.
15
16 Performance is measured in terms of throughput as well as processor re‐
17 source consumption. The tool currently works for processors with a
18 backend for which there is a scheduling model available in LLVM.
19
20 The main goal of this tool is not just to predict the performance of
21 the code when run on the target, but also help with diagnosing poten‐
22 tial performance issues.
23
24 Given an assembly code sequence, llvm-mca estimates the Instructions
25 Per Cycle (IPC), as well as hardware resource pressure. The analysis
26 and reporting style were inspired by the IACA tool from Intel.
27
28 For example, you can compile code with clang, output assembly, and pipe
29 it directly into llvm-mca for analysis:
30
31 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
32
33 Or for Intel syntax:
34
35 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
36
37 (llvm-mca detects Intel syntax by the presence of an .intel_syntax di‐
38 rective at the beginning of the input. By default its output syntax
39 matches that of its input.)
40
41 Scheduling models are not just used to compute instruction latencies
42 and throughput, but also to understand what processor resources are
43 available and how to simulate them.
44
45 By design, the quality of the analysis conducted by llvm-mca is in‐
46 evitably affected by the quality of the scheduling models in LLVM.
47
48 If you see that the performance report is not accurate for a processor,
49 please file a bug against the appropriate backend.
50
52 If input is "-" or omitted, llvm-mca reads from standard input. Other‐
53 wise, it will read from the specified filename.
54
55 If the -o option is omitted, then llvm-mca will send its output to
56 standard output if the input is from standard input. If the -o option
57 specifies "-", then the output will also be sent to standard output.
58
59 -help Print a summary of command line options.
60
61 -o <filename>
62 Use <filename> as the output filename. See the summary above for
63 more details.
64
65 -mtriple=<target triple>
66 Specify a target triple string.
67
68 -march=<arch>
69 Specify the architecture for which to analyze the code. It de‐
70 faults to the host default target.
71
72 -mcpu=<cpuname>
73 Specify the processor for which to analyze the code. By de‐
74 fault, the cpu name is autodetected from the host.
75
76 -output-asm-variant=<variant id>
77 Specify the output assembly variant for the report generated by
78 the tool. On x86, possible values are [0, 1]. A value of 0
79 (vic. 1) for this flag enables the AT&T (vic. Intel) assembly
80 format for the code printed out by the tool in the analysis re‐
81 port.
82
83 -print-imm-hex
84 Prefer hex format for numeric literals in the output assembly
85 printed as part of the report.
86
87 -dispatch=<width>
88 Specify a different dispatch width for the processor. The dis‐
89 patch width defaults to field 'IssueWidth' in the processor
90 scheduling model. If width is zero, then the default dispatch
91 width is used.
92
93 -register-file-size=<size>
94 Specify the size of the register file. When specified, this flag
95 limits how many physical registers are available for register
96 renaming purposes. A value of zero for this flag means "unlim‐
97 ited number of physical registers".
98
99 -iterations=<number of iterations>
100 Specify the number of iterations to run. If this flag is set to
101 0, then the tool sets the number of iterations to a default
102 value (i.e. 100).
103
104 -noalias=<bool>
105 If set, the tool assumes that loads and stores don't alias. This
106 is the default behavior.
107
108 -lqueue=<load queue size>
109 Specify the size of the load queue in the load/store unit emu‐
110 lated by the tool. By default, the tool assumes an unbound num‐
111 ber of entries in the load queue. A value of zero for this flag
112 is ignored, and the default load queue size is used instead.
113
114 -squeue=<store queue size>
115 Specify the size of the store queue in the load/store unit emu‐
116 lated by the tool. By default, the tool assumes an unbound num‐
117 ber of entries in the store queue. A value of zero for this flag
118 is ignored, and the default store queue size is used instead.
119
120 -timeline
121 Enable the timeline view.
122
123 -timeline-max-iterations=<iterations>
124 Limit the number of iterations to print in the timeline view. By
125 default, the timeline view prints information for up to 10 iter‐
126 ations.
127
128 -timeline-max-cycles=<cycles>
129 Limit the number of cycles in the timeline view, or use 0 for no
130 limit. By default, the number of cycles is set to 80.
131
132 -resource-pressure
133 Enable the resource pressure view. This is enabled by default.
134
135 -register-file-stats
136 Enable register file usage statistics.
137
138 -dispatch-stats
139 Enable extra dispatch statistics. This view collects and ana‐
140 lyzes instruction dispatch events, as well as static/dynamic
141 dispatch stall events. This view is disabled by default.
142
143 -scheduler-stats
144 Enable extra scheduler statistics. This view collects and ana‐
145 lyzes instruction issue events. This view is disabled by de‐
146 fault.
147
148 -retire-stats
149 Enable extra retire control unit statistics. This view is dis‐
150 abled by default.
151
152 -instruction-info
153 Enable the instruction info view. This is enabled by default.
154
155 -show-encoding
156 Enable the printing of instruction encodings within the instruc‐
157 tion info view.
158
159 -all-stats
160 Print all hardware statistics. This enables extra statistics re‐
161 lated to the dispatch logic, the hardware schedulers, the regis‐
162 ter file(s), and the retire control unit. This option is dis‐
163 abled by default.
164
165 -all-views
166 Enable all the view.
167
168 -instruction-tables
169 Prints resource pressure information based on the static infor‐
170 mation available from the processor model. This differs from the
171 resource pressure view because it doesn't require that the code
172 is simulated. It instead prints the theoretical uniform distri‐
173 bution of resource pressure for every instruction in sequence.
174
175 -bottleneck-analysis
176 Print information about bottlenecks that affect the throughput.
177 This analysis can be expensive, and it is disabled by default.
178 Bottlenecks are highlighted in the summary view. Bottleneck
179 analysis is currently not supported for processors with an
180 in-order backend.
181
182 -json Print the requested views in valid JSON format. The instructions
183 and the processor resources are printed as members of special
184 top level JSON objects. The individual views refer to them by
185 index. However, not all views are currently supported. For exam‐
186 ple, the report from the bottleneck analysis is not printed out
187 in JSON. All the default views are currently supported.
188
189 -disable-cb
190 Force usage of the generic CustomBehaviour class rather than us‐
191 ing the target specific class. The generic class never detects
192 any custom hazards.
193
195 llvm-mca returns 0 on success. Otherwise, an error message is printed
196 to standard error, and the tool returns 1.
197
199 llvm-mca allows for the optional usage of special code comments to mark
200 regions of the assembly code to be analyzed. A comment starting with
201 substring LLVM-MCA-BEGIN marks the beginning of a code region. A com‐
202 ment starting with substring LLVM-MCA-END marks the end of a code re‐
203 gion. For example:
204
205 # LLVM-MCA-BEGIN
206 ...
207 # LLVM-MCA-END
208
209 If no user-defined region is specified, then llvm-mca assumes a default
210 region which contains every instruction in the input file. Every re‐
211 gion is analyzed in isolation, and the final performance report is the
212 union of all the reports generated for every code region.
213
214 Code regions can have names. For example:
215
216 # LLVM-MCA-BEGIN A simple example
217 add %eax, %eax
218 # LLVM-MCA-END
219
220 The code from the example above defines a region named "A simple exam‐
221 ple" with a single instruction in it. Note how the region name doesn't
222 have to be repeated in the LLVM-MCA-END directive. In the absence of
223 overlapping regions, an anonymous LLVM-MCA-END directive always ends
224 the currently active user defined region.
225
226 Example of nesting regions:
227
228 # LLVM-MCA-BEGIN foo
229 add %eax, %edx
230 # LLVM-MCA-BEGIN bar
231 sub %eax, %edx
232 # LLVM-MCA-END bar
233 # LLVM-MCA-END foo
234
235 Example of overlapping regions:
236
237 # LLVM-MCA-BEGIN foo
238 add %eax, %edx
239 # LLVM-MCA-BEGIN bar
240 sub %eax, %edx
241 # LLVM-MCA-END foo
242 add %eax, %edx
243 # LLVM-MCA-END bar
244
245 Note that multiple anonymous regions cannot overlap. Also, overlapping
246 regions cannot have the same name.
247
248 There is no support for marking regions from high-level source code,
249 like C or C++. As a workaround, inline assembly directives may be used:
250
251 int foo(int a, int b) {
252 __asm volatile("# LLVM-MCA-BEGIN foo");
253 a += 42;
254 __asm volatile("# LLVM-MCA-END");
255 a *= b;
256 return a;
257 }
258
259 However, this interferes with optimizations like loop vectorization and
260 may have an impact on the code generated. This is because the __asm
261 statements are seen as real code having important side effects, which
262 limits how the code around them can be transformed. If users want to
263 make use of inline assembly to emit markers, then the recommendation is
264 to always verify that the output assembly is equivalent to the assembly
265 generated in the absence of markers. The Clang options to emit opti‐
266 mization reports can also help in detecting missed optimizations.
267
269 llvm-mca takes assembly code as input. The assembly code is parsed into
270 a sequence of MCInst with the help of the existing LLVM target assembly
271 parsers. The parsed sequence of MCInst is then analyzed by a Pipeline
272 module to generate a performance report.
273
274 The Pipeline module simulates the execution of the machine code se‐
275 quence in a loop of iterations (default is 100). During this process,
276 the pipeline collects a number of execution related statistics. At the
277 end of this process, the pipeline generates and prints a report from
278 the collected statistics.
279
280 Here is an example of a performance report generated by the tool for a
281 dot-product of two packed float vectors of four elements. The analysis
282 is conducted for target x86, cpu btver2. The following result can be
283 produced via the following command using the example located at
284 test/tools/llvm-mca/X86/BtVer2/dot-product.s:
285
286 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
287
288 Iterations: 300
289 Instructions: 900
290 Total Cycles: 610
291 Total uOps: 900
292
293 Dispatch Width: 2
294 uOps Per Cycle: 1.48
295 IPC: 1.48
296 Block RThroughput: 2.0
297
298
299 Instruction Info:
300 [1]: #uOps
301 [2]: Latency
302 [3]: RThroughput
303 [4]: MayLoad
304 [5]: MayStore
305 [6]: HasSideEffects (U)
306
307 [1] [2] [3] [4] [5] [6] Instructions:
308 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
309 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
310 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
311
312
313 Resources:
314 [0] - JALU0
315 [1] - JALU1
316 [2] - JDiv
317 [3] - JFPA
318 [4] - JFPM
319 [5] - JFPU0
320 [6] - JFPU1
321 [7] - JLAGU
322 [8] - JMul
323 [9] - JSAGU
324 [10] - JSTC
325 [11] - JVALU0
326 [12] - JVALU1
327 [13] - JVIMUL
328
329
330 Resource pressure per iteration:
331 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
332 - - - 2.00 1.00 2.00 1.00 - - - - - - -
333
334 Resource pressure by instruction:
335 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
336 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
337 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
338 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
339
340 According to this report, the dot-product kernel has been executed 300
341 times, for a total of 900 simulated instructions. The total number of
342 simulated micro opcodes (uOps) is also 900.
343
344 The report is structured in three main sections. The first section
345 collects a few performance numbers; the goal of this section is to give
346 a very quick overview of the performance throughput. Important perfor‐
347 mance indicators are IPC, uOps Per Cycle, and Block RThroughput (Block
348 Reciprocal Throughput).
349
350 Field DispatchWidth is the maximum number of micro opcodes that are
351 dispatched to the out-of-order backend every simulated cycle. For pro‐
352 cessors with an in-order backend, DispatchWidth is the maximum number
353 of micro opcodes issued to the backend every simulated cycle.
354
355 IPC is computed dividing the total number of simulated instructions by
356 the total number of cycles.
357
358 Field Block RThroughput is the reciprocal of the block throughput.
359 Block throughput is a theoretical quantity computed as the maximum num‐
360 ber of blocks (i.e. iterations) that can be executed per simulated
361 clock cycle in the absence of loop carried dependencies. Block through‐
362 put is superiorly limited by the dispatch rate, and the availability of
363 hardware resources.
364
365 In the absence of loop-carried data dependencies, the observed IPC
366 tends to a theoretical maximum which can be computed by dividing the
367 number of instructions of a single iteration by the Block RThroughput.
368
369 Field 'uOps Per Cycle' is computed dividing the total number of simu‐
370 lated micro opcodes by the total number of cycles. A delta between Dis‐
371 patch Width and this field is an indicator of a performance issue. In
372 the absence of loop-carried data dependencies, the observed 'uOps Per
373 Cycle' should tend to a theoretical maximum throughput which can be
374 computed by dividing the number of uOps of a single iteration by the
375 Block RThroughput.
376
377 Field uOps Per Cycle is bounded from above by the dispatch width. That
378 is because the dispatch width limits the maximum size of a dispatch
379 group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
380 ware parallelism. The availability of hardware resources affects the
381 resource pressure distribution, and it limits the number of instruc‐
382 tions that can be executed in parallel every cycle. A delta between
383 Dispatch Width and the theoretical maximum uOps per Cycle (computed by
384 dividing the number of uOps of a single iteration by the Block
385 RThroughput) is an indicator of a performance bottleneck caused by the
386 lack of hardware resources. In general, the lower the Block RThrough‐
387 put, the better.
388
389 In this example, uOps per iteration/Block RThroughput is 1.50. Since
390 there are no loop-carried dependencies, the observed uOps Per Cycle is
391 expected to approach 1.50 when the number of iterations tends to infin‐
392 ity. The delta between the Dispatch Width (2.00), and the theoretical
393 maximum uOp throughput (1.50) is an indicator of a performance bottle‐
394 neck caused by the lack of hardware resources, and the Resource pres‐
395 sure view can help to identify the problematic resource usage.
396
397 The second section of the report is the instruction info view. It shows
398 the latency and reciprocal throughput of every instruction in the se‐
399 quence. It also reports extra information related to the number of mi‐
400 cro opcodes, and opcode properties (i.e., 'MayLoad', 'MayStore', and
401 'HasSideEffects').
402
403 Field RThroughput is the reciprocal of the instruction throughput.
404 Throughput is computed as the maximum number of instructions of a same
405 type that can be executed per clock cycle in the absence of operand de‐
406 pendencies. In this example, the reciprocal throughput of a vector
407 float multiply is 1 cycles/instruction. That is because the FP multi‐
408 plier JFPM is only available from pipeline JFPU1.
409
410 Instruction encodings are displayed within the instruction info view
411 when flag -show-encoding is specified.
412
413 Below is an example of -show-encoding output for the dot-product ker‐
414 nel:
415
416 Instruction Info:
417 [1]: #uOps
418 [2]: Latency
419 [3]: RThroughput
420 [4]: MayLoad
421 [5]: MayStore
422 [6]: HasSideEffects (U)
423 [7]: Encoding Size
424
425 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions:
426 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2
427 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3
428 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4
429
430 The Encoding Size column shows the size in bytes of instructions. The
431 Encodings column shows the actual instruction encodings (byte sequences
432 in hex).
433
434 The third section is the Resource pressure view. This view reports the
435 average number of resource cycles consumed every iteration by instruc‐
436 tions for every processor resource unit available on the target. In‐
437 formation is structured in two tables. The first table reports the num‐
438 ber of resource cycles spent on average every iteration. The second ta‐
439 ble correlates the resource cycles to the machine instruction in the
440 sequence. For example, every iteration of the instruction vmulps always
441 executes on resource unit [6] (JFPU1 - floating point pipeline #1),
442 consuming an average of 1 resource cycle per iteration. Note that on
443 AMD Jaguar, vector floating-point multiply can only be issued to pipe‐
444 line JFPU1, while horizontal floating-point additions can only be is‐
445 sued to pipeline JFPU0.
446
447 The resource pressure view helps with identifying bottlenecks caused by
448 high usage of specific hardware resources. Situations with resource
449 pressure mainly concentrated on a few resources should, in general, be
450 avoided. Ideally, pressure should be uniformly distributed between
451 multiple resources.
452
453 Timeline View
454 The timeline view produces a detailed report of each instruction's
455 state transitions through an instruction pipeline. This view is en‐
456 abled by the command line option -timeline. As instructions transition
457 through the various stages of the pipeline, their states are depicted
458 in the view report. These states are represented by the following
459 characters:
460
461 • D : Instruction dispatched.
462
463 • e : Instruction executing.
464
465 • E : Instruction executed.
466
467 • R : Instruction retired.
468
469 • = : Instruction already dispatched, waiting to be executed.
470
471 • - : Instruction executed, waiting to be retired.
472
473 Below is the timeline view for a subset of the dot-product example lo‐
474 cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed by
475 llvm-mca using the following command:
476
477 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
478
479 Timeline view:
480 012345
481 Index 0123456789
482
483 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
484 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
485 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
486 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
487 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
488 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
489 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
490 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
491 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
492
493
494 Average Wait times (based on the timeline view):
495 [0]: Executions
496 [1]: Average time spent waiting in a scheduler's queue
497 [2]: Average time spent waiting in a scheduler's queue while ready
498 [3]: Average time elapsed from WB until retire stage
499
500 [0] [1] [2] [3]
501 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
502 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
503 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
504 3 3.3 0.5 1.4 <total>
505
506 The timeline view is interesting because it shows instruction state
507 changes during execution. It also gives an idea of how the tool pro‐
508 cesses instructions executed on the target, and how their timing infor‐
509 mation might be calculated.
510
511 The timeline view is structured in two tables. The first table shows
512 instructions changing state over time (measured in cycles); the second
513 table (named Average Wait times) reports useful timing statistics,
514 which should help diagnose performance bottlenecks caused by long data
515 dependencies and sub-optimal usage of hardware resources.
516
517 An instruction in the timeline view is identified by a pair of indices,
518 where the first index identifies an iteration, and the second index is
519 the instruction index (i.e., where it appears in the code sequence).
520 Since this example was generated using 3 iterations: -iterations=3, the
521 iteration indices range from 0-2 inclusively.
522
523 Excluding the first and last column, the remaining columns are in cy‐
524 cles. Cycles are numbered sequentially starting from 0.
525
526 From the example output above, we know the following:
527
528 • Instruction [1,0] was dispatched at cycle 1.
529
530 • Instruction [1,0] started executing at cycle 2.
531
532 • Instruction [1,0] reached the write back stage at cycle 4.
533
534 • Instruction [1,0] was retired at cycle 10.
535
536 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to
537 wait in the scheduler's queue for the operands to become available. By
538 the time vmulps is dispatched, operands are already available, and
539 pipeline JFPU1 is ready to serve another instruction. So the instruc‐
540 tion can be immediately issued on the JFPU1 pipeline. That is demon‐
541 strated by the fact that the instruction only spent 1cy in the sched‐
542 uler's queue.
543
544 There is a gap of 5 cycles between the write-back stage and the retire
545 event. That is because instructions must retire in program order, so
546 [1,0] has to wait for [0,2] to be retired first (i.e., it has to wait
547 until cycle 10).
548
549 In the example, all instructions are in a RAW (Read After Write) depen‐
550 dency chain. Register %xmm2 written by vmulps is immediately used by
551 the first vhaddps, and register %xmm3 written by the first vhaddps is
552 used by the second vhaddps. Long data dependencies negatively impact
553 the ILP (Instruction Level Parallelism).
554
555 In the dot-product example, there are anti-dependencies introduced by
556 instructions from different iterations. However, those dependencies
557 can be removed at register renaming stage (at the cost of allocating
558 register aliases, and therefore consuming physical registers).
559
560 Table Average Wait times helps diagnose performance issues that are
561 caused by the presence of long latency instructions and potentially
562 long data dependencies which may limit the ILP. Last row, <total>,
563 shows a global average over all instructions measured. Note that
564 llvm-mca, by default, assumes at least 1cy between the dispatch event
565 and the issue event.
566
567 When the performance is limited by data dependencies and/or long la‐
568 tency instructions, the number of cycles spent while in the ready state
569 is expected to be very small when compared with the total number of cy‐
570 cles spent in the scheduler's queue. The difference between the two
571 counters is a good indicator of how large of an impact data dependen‐
572 cies had on the execution of the instructions. When performance is
573 mostly limited by the lack of hardware resources, the delta between the
574 two counters is small. However, the number of cycles spent in the
575 queue tends to be larger (i.e., more than 1-3cy), especially when com‐
576 pared to other low latency instructions.
577
578 Bottleneck Analysis
579 The -bottleneck-analysis command line option enables the analysis of
580 performance bottlenecks.
581
582 This analysis is potentially expensive. It attempts to correlate in‐
583 creases in backend pressure (caused by pipeline resource pressure and
584 data dependencies) to dynamic dispatch stalls.
585
586 Below is an example of -bottleneck-analysis output generated by
587 llvm-mca for 500 iterations of the dot-product example on btver2.
588
589 Cycles with backend pressure increase [ 48.07% ]
590 Throughput Bottlenecks:
591 Resource Pressure [ 47.77% ]
592 - JFPA [ 47.77% ]
593 - JFPU0 [ 47.77% ]
594 Data Dependencies: [ 0.30% ]
595 - Register Dependencies [ 0.30% ]
596 - Memory Dependencies [ 0.00% ]
597
598 Critical sequence based on the simulation:
599
600 Instruction Dependency Information
601 +----< 2. vhaddps %xmm3, %xmm3, %xmm4
602 |
603 | < loop carried >
604 |
605 | 0. vmulps %xmm0, %xmm1, %xmm2
606 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
607 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
608 |
609 | < loop carried >
610 |
611 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
612
613 According to the analysis, throughput is limited by resource pressure
614 and not by data dependencies. The analysis observed increases in back‐
615 end pressure during 48.07% of the simulated run. Almost all those pres‐
616 sure increase events were caused by contention on processor resources
617 JFPA/JFPU0.
618
619 The critical sequence is the most expensive sequence of instructions
620 according to the simulation. It is annotated to provide extra informa‐
621 tion about critical register dependencies and resource interferences
622 between instructions.
623
624 Instructions from the critical sequence are expected to significantly
625 impact performance. By construction, the accuracy of this analysis is
626 strongly dependent on the simulation and (as always) by the quality of
627 the processor model in llvm.
628
629 Bottleneck analysis is currently not supported for processors with an
630 in-order backend.
631
632 Extra Statistics to Further Diagnose Performance Issues
633 The -all-stats command line option enables extra statistics and perfor‐
634 mance counters for the dispatch logic, the reorder buffer, the retire
635 control unit, and the register file.
636
637 Below is an example of -all-stats output generated by llvm-mca for 300
638 iterations of the dot-product example discussed in the previous sec‐
639 tions.
640
641 Dynamic Dispatch Stall Cycles:
642 RAT - Register unavailable: 0
643 RCU - Retire tokens unavailable: 0
644 SCHEDQ - Scheduler full: 272 (44.6%)
645 LQ - Load queue full: 0
646 SQ - Store queue full: 0
647 GROUP - Static restrictions on the dispatch group: 0
648
649
650 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
651 [# dispatched], [# cycles]
652 0, 24 (3.9%)
653 1, 272 (44.6%)
654 2, 314 (51.5%)
655
656
657 Schedulers - number of cycles where we saw N micro opcodes issued:
658 [# issued], [# cycles]
659 0, 7 (1.1%)
660 1, 306 (50.2%)
661 2, 297 (48.7%)
662
663 Scheduler's queue usage:
664 [1] Resource name.
665 [2] Average number of used buffer entries.
666 [3] Maximum number of used buffer entries.
667 [4] Total number of buffer entries.
668
669 [1] [2] [3] [4]
670 JALU01 0 0 20
671 JFPU01 17 18 18
672 JLSAGU 0 0 12
673
674
675 Retire Control Unit - number of cycles where we saw N instructions retired:
676 [# retired], [# cycles]
677 0, 109 (17.9%)
678 1, 102 (16.7%)
679 2, 399 (65.4%)
680
681 Total ROB Entries: 64
682 Max Used ROB Entries: 35 ( 54.7% )
683 Average Used ROB Entries per cy: 32 ( 50.0% )
684
685
686 Register File statistics:
687 Total number of mappings created: 900
688 Max number of mappings used: 35
689
690 * Register File #1 -- JFpuPRF:
691 Number of physical registers: 72
692 Total number of mappings created: 900
693 Max number of mappings used: 35
694
695 * Register File #2 -- JIntegerPRF:
696 Number of physical registers: 64
697 Total number of mappings created: 0
698 Max number of mappings used: 0
699
700 If we look at the Dynamic Dispatch Stall Cycles table, we see the
701 counter for SCHEDQ reports 272 cycles. This counter is incremented ev‐
702 ery time the dispatch logic is unable to dispatch a full group because
703 the scheduler's queue is full.
704
705 Looking at the Dispatch Logic table, we see that the pipeline was only
706 able to dispatch two micro opcodes 51.5% of the time. The dispatch
707 group was limited to one micro opcode 44.6% of the cycles, which corre‐
708 sponds to 272 cycles. The dispatch statistics are displayed by either
709 using the command option -all-stats or -dispatch-stats.
710
711 The next table, Schedulers, presents a histogram displaying a count,
712 representing the number of micro opcodes issued on some number of cy‐
713 cles. In this case, of the 610 simulated cycles, single opcodes were
714 issued 306 times (50.2%) and there were 7 cycles where no opcodes were
715 issued.
716
717 The Scheduler's queue usage table shows that the average and maximum
718 number of buffer entries (i.e., scheduler queue entries) used at run‐
719 time. Resource JFPU01 reached its maximum (18 of 18 queue entries).
720 Note that AMD Jaguar implements three schedulers:
721
722 • JALU01 - A scheduler for ALU instructions.
723
724 • JFPU01 - A scheduler floating point operations.
725
726 • JLSAGU - A scheduler for address generation.
727
728 The dot-product is a kernel of three floating point instructions (a
729 vector multiply followed by two horizontal adds). That explains why
730 only the floating point scheduler appears to be used.
731
732 A full scheduler queue is either caused by data dependency chains or by
733 a sub-optimal usage of hardware resources. Sometimes, resource pres‐
734 sure can be mitigated by rewriting the kernel using different instruc‐
735 tions that consume different scheduler resources. Schedulers with a
736 small queue are less resilient to bottlenecks caused by the presence of
737 long data dependencies. The scheduler statistics are displayed by us‐
738 ing the command option -all-stats or -scheduler-stats.
739
740 The next table, Retire Control Unit, presents a histogram displaying a
741 count, representing the number of instructions retired on some number
742 of cycles. In this case, of the 610 simulated cycles, two instructions
743 were retired during the same cycle 399 times (65.4%) and there were 109
744 cycles where no instructions were retired. The retire statistics are
745 displayed by using the command option -all-stats or -retire-stats.
746
747 The last table presented is Register File statistics. Each physical
748 register file (PRF) used by the pipeline is presented in this table.
749 In the case of AMD Jaguar, there are two register files, one for float‐
750 ing-point registers (JFpuPRF) and one for integer registers (JInte‐
751 gerPRF). The table shows that of the 900 instructions processed, there
752 were 900 mappings created. Since this dot-product example utilized
753 only floating point registers, the JFPuPRF was responsible for creating
754 the 900 mappings. However, we see that the pipeline only used a maxi‐
755 mum of 35 of 72 available register slots at any given time. We can con‐
756 clude that the floating point PRF was the only register file used for
757 the example, and that it was never resource constrained. The register
758 file statistics are displayed by using the command option -all-stats or
759 -register-file-stats.
760
761 In this example, we can conclude that the IPC is mostly limited by data
762 dependencies, and not by resource pressure.
763
764 Instruction Flow
765 This section describes the instruction flow through the default pipe‐
766 line of llvm-mca, as well as the functional units involved in the
767 process.
768
769 The default pipeline implements the following sequence of stages used
770 to process instructions.
771
772 • Dispatch (Instruction is dispatched to the schedulers).
773
774 • Issue (Instruction is issued to the processor pipelines).
775
776 • Write Back (Instruction is executed, and results are written back).
777
778 • Retire (Instruction is retired; writes are architecturally commit‐
779 ted).
780
781 The in-order pipeline implements the following sequence of stages: *
782 InOrderIssue (Instruction is issued to the processor pipelines). * Re‐
783 tire (Instruction is retired; writes are architecturally committed).
784
785 llvm-mca assumes that instructions have all been decoded and placed
786 into a queue before the simulation start. Therefore, the instruction
787 fetch and decode stages are not modeled. Performance bottlenecks in the
788 frontend are not diagnosed. Also, llvm-mca does not model branch pre‐
789 diction.
790
791 Instruction Dispatch
792 During the dispatch stage, instructions are picked in program order
793 from a queue of already decoded instructions, and dispatched in groups
794 to the simulated hardware schedulers.
795
796 The size of a dispatch group depends on the availability of the simu‐
797 lated hardware resources. The processor dispatch width defaults to the
798 value of the IssueWidth in LLVM's scheduling model.
799
800 An instruction can be dispatched if:
801
802 • The size of the dispatch group is smaller than processor's dispatch
803 width.
804
805 • There are enough entries in the reorder buffer.
806
807 • There are enough physical registers to do register renaming.
808
809 • The schedulers are not full.
810
811 Scheduling models can optionally specify which register files are
812 available on the processor. llvm-mca uses that information to initial‐
813 ize register file descriptors. Users can limit the number of physical
814 registers that are globally available for register renaming by using
815 the command option -register-file-size. A value of zero for this op‐
816 tion means unbounded. By knowing how many registers are available for
817 renaming, the tool can predict dispatch stalls caused by the lack of
818 physical registers.
819
820 The number of reorder buffer entries consumed by an instruction depends
821 on the number of micro-opcodes specified for that instruction by the
822 target scheduling model. The reorder buffer is responsible for track‐
823 ing the progress of instructions that are "in-flight", and retiring
824 them in program order. The number of entries in the reorder buffer de‐
825 faults to the value specified by field MicroOpBufferSize in the target
826 scheduling model.
827
828 Instructions that are dispatched to the schedulers consume scheduler
829 buffer entries. llvm-mca queries the scheduling model to determine the
830 set of buffered resources consumed by an instruction. Buffered re‐
831 sources are treated like scheduler resources.
832
833 Instruction Issue
834 Each processor scheduler implements a buffer of instructions. An in‐
835 struction has to wait in the scheduler's buffer until input register
836 operands become available. Only at that point, does the instruction
837 becomes eligible for execution and may be issued (potentially
838 out-of-order) for execution. Instruction latencies are computed by
839 llvm-mca with the help of the scheduling model.
840
841 llvm-mca's scheduler is designed to simulate multiple processor sched‐
842 ulers. The scheduler is responsible for tracking data dependencies,
843 and dynamically selecting which processor resources are consumed by in‐
844 structions. It delegates the management of processor resource units
845 and resource groups to a resource manager. The resource manager is re‐
846 sponsible for selecting resource units that are consumed by instruc‐
847 tions. For example, if an instruction consumes 1cy of a resource
848 group, the resource manager selects one of the available units from the
849 group; by default, the resource manager uses a round-robin selector to
850 guarantee that resource usage is uniformly distributed between all
851 units of a group.
852
853 llvm-mca's scheduler internally groups instructions into three sets:
854
855 • WaitSet: a set of instructions whose operands are not ready.
856
857 • ReadySet: a set of instructions ready to execute.
858
859 • IssuedSet: a set of instructions executing.
860
861 Depending on the operands availability, instructions that are dis‐
862 patched to the scheduler are either placed into the WaitSet or into the
863 ReadySet.
864
865 Every cycle, the scheduler checks if instructions can be moved from the
866 WaitSet to the ReadySet, and if instructions from the ReadySet can be
867 issued to the underlying pipelines. The algorithm prioritizes older in‐
868 structions over younger instructions.
869
870 Write-Back and Retire Stage
871 Issued instructions are moved from the ReadySet to the IssuedSet.
872 There, instructions wait until they reach the write-back stage. At
873 that point, they get removed from the queue and the retire control unit
874 is notified.
875
876 When instructions are executed, the retire control unit flags the in‐
877 struction as "ready to retire."
878
879 Instructions are retired in program order. The register file is noti‐
880 fied of the retirement so that it can free the physical registers that
881 were allocated for the instruction during the register renaming stage.
882
883 Load/Store Unit and Memory Consistency Model
884 To simulate an out-of-order execution of memory operations, llvm-mca
885 utilizes a simulated load/store unit (LSUnit) to simulate the specula‐
886 tive execution of loads and stores.
887
888 Each load (or store) consumes an entry in the load (or store) queue.
889 Users can specify flags -lqueue and -squeue to limit the number of en‐
890 tries in the load and store queues respectively. The queues are un‐
891 bounded by default.
892
893 The LSUnit implements a relaxed consistency model for memory loads and
894 stores. The rules are:
895
896 1. A younger load is allowed to pass an older load only if there are no
897 intervening stores or barriers between the two loads.
898
899 2. A younger load is allowed to pass an older store provided that the
900 load does not alias with the store.
901
902 3. A younger store is not allowed to pass an older store.
903
904 4. A younger store is not allowed to pass an older load.
905
906 By default, the LSUnit optimistically assumes that loads do not alias
907 (-noalias=true) store operations. Under this assumption, younger loads
908 are always allowed to pass older stores. Essentially, the LSUnit does
909 not attempt to run any alias analysis to predict when loads and stores
910 do not alias with each other.
911
912 Note that, in the case of write-combining memory, rule 3 could be re‐
913 laxed to allow reordering of non-aliasing store operations. That being
914 said, at the moment, there is no way to further relax the memory model
915 (-noalias is the only option). Essentially, there is no option to
916 specify a different memory type (e.g., write-back, write-combining,
917 write-through; etc.) and consequently to weaken, or strengthen, the
918 memory model.
919
920 Other limitations are:
921
922 • The LSUnit does not know when store-to-load forwarding may occur.
923
924 • The LSUnit does not know anything about cache hierarchy and memory
925 types.
926
927 • The LSUnit does not know how to identify serializing operations and
928 memory fences.
929
930 The LSUnit does not attempt to predict if a load or store hits or
931 misses the L1 cache. It only knows if an instruction "MayLoad" and/or
932 "MayStore." For loads, the scheduling model provides an "optimistic"
933 load-to-use latency (which usually matches the load-to-use latency for
934 when there is a hit in the L1D).
935
936 llvm-mca does not know about serializing operations or memory-barrier
937 like instructions. The LSUnit conservatively assumes that an instruc‐
938 tion which has both "MayLoad" and unmodeled side effects behaves like a
939 "soft" load-barrier. That means, it serializes loads without forcing a
940 flush of the load queue. Similarly, instructions that "MayStore" and
941 have unmodeled side effects are treated like store barriers. A full
942 memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
943 side effects. This is inaccurate, but it is the best that we can do at
944 the moment with the current information available in LLVM.
945
946 A load/store barrier consumes one entry of the load/store queue. A
947 load/store barrier enforces ordering of loads/stores. A younger load
948 cannot pass a load barrier. Also, a younger store cannot pass a store
949 barrier. A younger load has to wait for the memory/load barrier to ex‐
950 ecute. A load/store barrier is "executed" when it becomes the oldest
951 entry in the load/store queue(s). That also means, by construction, all
952 of the older loads/stores have been executed.
953
954 In conclusion, the full set of load/store consistency rules are:
955
956 1. A store may not pass a previous store.
957
958 2. A store may not pass a previous load (regardless of -noalias).
959
960 3. A store has to wait until an older store barrier is fully executed.
961
962 4. A load may pass a previous load.
963
964 5. A load may not pass a previous store unless -noalias is set.
965
966 6. A load has to wait until an older load barrier is fully executed.
967
968 In-order Issue and Execute
969 In-order processors are modelled as a single InOrderIssueStage stage.
970 It bypasses Dispatch, Scheduler and Load/Store unit. Instructions are
971 issued as soon as their operand registers are available and resource
972 requirements are met. Multiple instructions can be issued in one cycle
973 according to the value of the IssueWidth parameter in LLVM's scheduling
974 model.
975
976 Once issued, an instruction is moved to IssuedInst set until it is
977 ready to retire. llvm-mca ensures that writes are committed in-order.
978 However, an instruction is allowed to commit writes and retire
979 out-of-order if RetireOOO property is true for at least one of its
980 writes.
981
982 Custom Behaviour
983 Due to certain instructions not being expressed perfectly within their
984 scheduling model, llvm-mca isn't always able to simulate them per‐
985 fectly. Modifying the scheduling model isn't always a viable option
986 though (maybe because the instruction is modeled incorrectly on purpose
987 or the instruction's behaviour is quite complex). The CustomBehaviour
988 class can be used in these cases to enforce proper instruction modeling
989 (often by customizing data dependencies and detecting hazards that
990 llvm-ma has no way of knowing about).
991
992 llvm-mca comes with one generic and multiple target specific CustomBe‐
993 haviour classes. The generic class will be used if the -disable-cb flag
994 is used or if a target specific CustomBehaviour class doesn't exist for
995 that target. (The generic class does nothing.) Currently, the CustomBe‐
996 haviour class is only a part of the in-order pipeline, but there are
997 plans to add it to the out-of-order pipeline in the future.
998
999 CustomBehaviour's main method is checkCustomHazard() which uses the
1000 current instruction and a list of all instructions still executing
1001 within the pipeline to determine if the current instruction should be
1002 dispatched. As output, the method returns an integer representing the
1003 number of cycles that the current instruction must stall for (this can
1004 be an underestimate if you don't know the exact number and a value of 0
1005 represents no stall).
1006
1007 If you'd like to add a CustomBehaviour class for a target that doesn't
1008 already have one, refer to an existing implementation to see how to set
1009 it up. Remember to look at (and add to) /llvm-mca/lib/CMakeLists.txt.
1010
1012 Maintained by the LLVM Team (https://llvm.org/).
1013
1015 2003-2023, LLVM Project
1016
1017
1018
1019
102013 2023-07-20 LLVM-MCA(1)