llvm-mca(1)

1LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)
2
3
4

NAME

6       llvm-mca - LLVM Machine Code Analyzer
7

SYNOPSIS

9       llvm-mca [options] [input]
10

DESCRIPTION

12       llvm-mca is a performance analysis tool that uses information available
13       in LLVM (e.g. scheduling models) to statically measure the  performance
14       of machine code in a specific CPU.
15
16       Performance is measured in terms of throughput as well as processor re‐
17       source consumption. The tool currently  works  for  processors  with  a
18       backend for which there is a scheduling model available in LLVM.
19
20       The  main  goal  of this tool is not just to predict the performance of
21       the code when run on the target, but also help with  diagnosing  poten‐
22       tial performance issues.
23
24       Given  an  assembly  code sequence, llvm-mca estimates the Instructions
25       Per Cycle (IPC), as well as hardware resource  pressure.  The  analysis
26       and reporting style were inspired by the IACA tool from Intel.
27
28       For example, you can compile code with clang, output assembly, and pipe
29       it directly into llvm-mca for analysis:
30
31          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
32
33       Or for Intel syntax:
34
35          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
36
37       (llvm-mca detects Intel syntax by the presence of an .intel_syntax  di‐
38       rective  at  the  beginning of the input.  By default its output syntax
39       matches that of its input.)
40
41       Scheduling models are not just used to  compute  instruction  latencies
42       and  throughput,  but  also  to understand what processor resources are
43       available and how to simulate them.
44
45       By design, the quality of the analysis conducted  by  llvm-mca  is  in‐
46       evitably affected by the quality of the scheduling models in LLVM.
47
48       If you see that the performance report is not accurate for a processor,
49       please file a bug against the appropriate backend.
50

OPTIONS

52       If input is "-" or omitted, llvm-mca reads from standard input.  Other‐
53       wise, it will read from the specified filename.
54
55       If  the  -o  option  is  omitted, then llvm-mca will send its output to
56       standard output if the input is from standard input.  If the -o  option
57       specifies "-", then the output will also be sent to standard output.
58
59       -help  Print a summary of command line options.
60
61       -o <filename>
62              Use <filename> as the output filename. See the summary above for
63              more details.
64
65       -mtriple=<target triple>
66              Specify a target triple string.
67
68       -march=<arch>
69              Specify the architecture for which to analyze the code.  It  de‐
70              faults to the host default target.
71
72       -mcpu=<cpuname>
73              Specify  the  processor  for  which to analyze the code.  By de‐
74              fault, the cpu name is autodetected from the host.
75
76       -output-asm-variant=<variant id>
77              Specify the output assembly variant for the report generated  by
78              the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
79              (vic. 1) for this flag enables the AT&T  (vic.  Intel)  assembly
80              format  for the code printed out by the tool in the analysis re‐
81              port.
82
83       -print-imm-hex
84              Prefer hex format for numeric literals in  the  output  assembly
85              printed as part of the report.
86
87       -dispatch=<width>
88              Specify  a  different dispatch width for the processor. The dis‐
89              patch width defaults to  field  'IssueWidth'  in  the  processor
90              scheduling  model.   If width is zero, then the default dispatch
91              width is used.
92
93       -register-file-size=<size>
94              Specify the size of the register file. When specified, this flag
95              limits  how  many  physical registers are available for register
96              renaming purposes. A value of zero for this flag  means  "unlim‐
97              ited number of physical registers".
98
99       -iterations=<number of iterations>
100              Specify  the number of iterations to run. If this flag is set to
101              0, then the tool sets the number  of  iterations  to  a  default
102              value (i.e. 100).
103
104       -noalias=<bool>
105              If set, the tool assumes that loads and stores don't alias. This
106              is the default behavior.
107
108       -lqueue=<load queue size>
109              Specify the size of the load queue in the load/store  unit  emu‐
110              lated by the tool.  By default, the tool assumes an unbound num‐
111              ber of entries in the load queue.  A value of zero for this flag
112              is ignored, and the default load queue size is used instead.
113
114       -squeue=<store queue size>
115              Specify  the size of the store queue in the load/store unit emu‐
116              lated by the tool. By default, the tool assumes an unbound  num‐
117              ber of entries in the store queue. A value of zero for this flag
118              is ignored, and the default store queue size is used instead.
119
120       -timeline
121              Enable the timeline view.
122
123       -timeline-max-iterations=<iterations>
124              Limit the number of iterations to print in the timeline view. By
125              default, the timeline view prints information for up to 10 iter‐
126              ations.
127
128       -timeline-max-cycles=<cycles>
129              Limit the number of cycles in the timeline view, or use 0 for no
130              limit. By default, the number of cycles is set to 80.
131
132       -resource-pressure
133              Enable the resource pressure view. This is enabled by default.
134
135       -register-file-stats
136              Enable register file usage statistics.
137
138       -dispatch-stats
139              Enable  extra  dispatch  statistics. This view collects and ana‐
140              lyzes instruction dispatch events,  as  well  as  static/dynamic
141              dispatch stall events. This view is disabled by default.
142
143       -scheduler-stats
144              Enable  extra  scheduler statistics. This view collects and ana‐
145              lyzes instruction issue events. This view  is  disabled  by  de‐
146              fault.
147
148       -retire-stats
149              Enable  extra  retire control unit statistics. This view is dis‐
150              abled by default.
151
152       -instruction-info
153              Enable the instruction info view. This is enabled by default.
154
155       -show-encoding
156              Enable the printing of instruction encodings within the instruc‐
157              tion info view.
158
159       -show-barriers
160              Enable the printing of LoadBarrier and StoreBarrier flags within
161              the instruction info view.
162
163       -all-stats
164              Print all hardware statistics. This enables extra statistics re‐
165              lated to the dispatch logic, the hardware schedulers, the regis‐
166              ter file(s), and the retire control unit. This  option  is  dis‐
167              abled by default.
168
169       -all-views
170              Enable all the view.
171
172       -instruction-tables
173              Prints  resource pressure information based on the static infor‐
174              mation available from the processor model. This differs from the
175              resource  pressure view because it doesn't require that the code
176              is simulated. It instead prints the theoretical uniform  distri‐
177              bution of resource pressure for every instruction in sequence.
178
179       -bottleneck-analysis
180              Print  information about bottlenecks that affect the throughput.
181              This analysis can be expensive, and it is disabled  by  default.
182              Bottlenecks  are  highlighted  in  the  summary view. Bottleneck
183              analysis is currently  not  supported  for  processors  with  an
184              in-order backend.
185
186       -json  Print the requested views in valid JSON format. The instructions
187              and the processor resources are printed as  members  of  special
188              top  level  JSON objects.  The individual views refer to them by
189              index. However, not all views are currently supported. For exam‐
190              ple,  the report from the bottleneck analysis is not printed out
191              in JSON. All the default views are currently supported.
192
193       -disable-cb
194              Force usage of the generic CustomBehaviour and  InstrPostProcess
195              classes  rather  than  using the target specific implementation.
196              The generic classes never detect any custom hazards or make  any
197              post processing modifications to instructions.
198
199       -disable-im
200              Force  usage  of the generic InstrumentManager rather than using
201              the target specific implementation. The  generic  class  creates
202              Instruments  that  provide no extra information, and Instrument‐
203              Manager never overrides the default schedule class for  a  given
204              instruction.
205

EXIT STATUS

207       llvm-mca  returns  0 on success. Otherwise, an error message is printed
208       to standard error, and the tool returns 1.
209

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS

211       llvm-mca allows for the optional usage of special code comments to mark
212       regions  of  the assembly code to be analyzed.  A comment starting with
213       substring LLVM-MCA-BEGIN marks the beginning of an analysis  region.  A
214       comment starting with substring LLVM-MCA-END marks the end of a region.
215       For example:
216
217          # LLVM-MCA-BEGIN
218            ...
219          # LLVM-MCA-END
220
221       If no user-defined region is specified, then llvm-mca assumes a default
222       region  which  contains every instruction in the input file.  Every re‐
223       gion is analyzed in isolation, and the final performance report is  the
224       union of all the reports generated for every analysis region.
225
226       Analysis regions can have names. For example:
227
228          # LLVM-MCA-BEGIN A simple example
229            add %eax, %eax
230          # LLVM-MCA-END
231
232       The  code from the example above defines a region named "A simple exam‐
233       ple" with a single instruction in it. Note how the region name  doesn't
234       have  to  be  repeated in the LLVM-MCA-END directive. In the absence of
235       overlapping regions, an anonymous LLVM-MCA-END  directive  always  ends
236       the currently active user defined region.
237
238       Example of nesting regions:
239
240          # LLVM-MCA-BEGIN foo
241            add %eax, %edx
242          # LLVM-MCA-BEGIN bar
243            sub %eax, %edx
244          # LLVM-MCA-END bar
245          # LLVM-MCA-END foo
246
247       Example of overlapping regions:
248
249          # LLVM-MCA-BEGIN foo
250            add %eax, %edx
251          # LLVM-MCA-BEGIN bar
252            sub %eax, %edx
253          # LLVM-MCA-END foo
254            add %eax, %edx
255          # LLVM-MCA-END bar
256
257       Note  that multiple anonymous regions cannot overlap. Also, overlapping
258       regions cannot have the same name.
259
260       There is no support for marking regions from  high-level  source  code,
261       like C or C++. As a workaround, inline assembly directives may be used:
262
263          int foo(int a, int b) {
264            __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
265            a += 42;
266            __asm volatile("# LLVM-MCA-END":::"memory");
267            a *= b;
268            return a;
269          }
270
271       However, this interferes with optimizations like loop vectorization and
272       may have an impact on the code generated. This  is  because  the  __asm
273       statements  are  seen as real code having important side effects, which
274       limits how the code around them can be transformed. If  users  want  to
275       make use of inline assembly to emit markers, then the recommendation is
276       to always verify that the output assembly is equivalent to the assembly
277       generated  in  the absence of markers.  The Clang options to emit opti‐
278       mization reports can also help in detecting missed optimizations.
279

INSTRUMENT REGIONS

281       An InstrumentRegion describes a region of assembly code guarded by spe‐
282       cial LLVM-MCA comment directives.
283
284          # LLVM-MCA-<INSTRUMENT_TYPE> <data>
285            ...  ## asm
286
287       where  INSTRUMENT_TYPE  is  a type defined by the target and expects to
288       use data.
289
290       A comment starting  with  substring  LLVM-MCA-<INSTRUMENT_TYPE>  brings
291       data  into  scope for llvm-mca to use in its analysis for all following
292       instructions.
293
294       If a comment with the same INSTRUMENT_TYPE is found later  in  the  in‐
295       struction  list,  then  the original InstrumentRegion will be automati‐
296       cally ended, and a new InstrumentRegion will begin.
297
298       If there are comments containing the  different  INSTRUMENT_TYPE,  then
299       both data sets remain available. In contrast with an AnalysisRegion, an
300       InstrumentRegion does not need a comment to end the region.
301
302       Comments that are prefixed with LLVM-MCA- but do not  correspond  to  a
303       valid  INSTRUMENT_TYPE  for the target cause an error, except for BEGIN
304       and END, since those correspond to AnalysisRegions.  Comments  that  do
305       not start with LLVM-MCA- are ignored by :program llvm-mca.
306
307       An instruction (a MCInst) is added to an InstrumentRegion R only if its
308       location is in range [R.RangeStart, R.RangeEnd].
309
310       On RISCV targets, vector instructions have different behaviour  depend‐
311       ing on the LMUL. Code can be instrumented with a comment that takes the
312       following form:
313
314          # LLVM-MCA-RISCV-LMUL <M1|M2|M4|M8|MF2|MF4|MF8>
315
316       The RISCV InstrumentManager will override the schedule class for vector
317       instructions  to use the scheduling behaviour of its pseudo-instruction
318       which is LMUL dependent. It makes sense to place RISCV instrument  com‐
319       ments  directly  after  vset{i}vl{i} instructions, although they can be
320       placed anywhere in the program.
321
322       Example of program with no call to vset{i}vl{i}:
323
324          # LLVM-MCA-RISCV-LMUL M2
325          vadd.vv v2, v2, v2
326
327       Example of program with call to vset{i}vl{i}:
328
329          vsetvli zero, a0, e8, m1, tu, mu
330          # LLVM-MCA-RISCV-LMUL M1
331          vadd.vv v2, v2, v2
332
333       Example of program with multiple calls to vset{i}vl{i}:
334
335          vsetvli zero, a0, e8, m1, tu, mu
336          # LLVM-MCA-RISCV-LMUL M1
337          vadd.vv v2, v2, v2
338          vsetvli zero, a0, e8, m8, tu, mu
339          # LLVM-MCA-RISCV-LMUL M8
340          vadd.vv v2, v2, v2
341
342       Example of program with call to vsetvl:
343
344          vsetvl rd, rs1, rs2
345          # LLVM-MCA-RISCV-LMUL M1
346          vadd.vv v12, v12, v12
347          vsetvl rd, rs1, rs2
348          # LLVM-MCA-RISCV-LMUL M4
349          vadd.vv v12, v12, v12
350

HOW LLVM-MCA WORKS

352       llvm-mca takes assembly code as input. The assembly code is parsed into
353       a sequence of MCInst with the help of the existing LLVM target assembly
354       parsers. The parsed sequence of MCInst is then analyzed by  a  Pipeline
355       module to generate a performance report.
356
357       The  Pipeline  module  simulates  the execution of the machine code se‐
358       quence in a loop of iterations (default is 100). During  this  process,
359       the  pipeline collects a number of execution related statistics. At the
360       end of this process, the pipeline generates and prints  a  report  from
361       the collected statistics.
362
363       Here  is an example of a performance report generated by the tool for a
364       dot-product of two packed float vectors of four elements. The  analysis
365       is  conducted  for target x86, cpu btver2.  The following result can be
366       produced via  the  following  command  using  the  example  located  at
367       test/tools/llvm-mca/X86/BtVer2/dot-product.s:
368
369          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
370
371          Iterations:        300
372          Instructions:      900
373          Total Cycles:      610
374          Total uOps:        900
375
376          Dispatch Width:    2
377          uOps Per Cycle:    1.48
378          IPC:               1.48
379          Block RThroughput: 2.0
380
381
382          Instruction Info:
383          [1]: #uOps
384          [2]: Latency
385          [3]: RThroughput
386          [4]: MayLoad
387          [5]: MayStore
388          [6]: HasSideEffects (U)
389
390          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
391           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
392           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
393           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
394
395
396          Resources:
397          [0]   - JALU0
398          [1]   - JALU1
399          [2]   - JDiv
400          [3]   - JFPA
401          [4]   - JFPM
402          [5]   - JFPU0
403          [6]   - JFPU1
404          [7]   - JLAGU
405          [8]   - JMul
406          [9]   - JSAGU
407          [10]  - JSTC
408          [11]  - JVALU0
409          [12]  - JVALU1
410          [13]  - JVIMUL
411
412
413          Resource pressure per iteration:
414          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
415           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
416
417          Resource pressure by instruction:
418          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
419           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
420           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
421           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
422
423       According  to this report, the dot-product kernel has been executed 300
424       times, for a total of 900 simulated instructions. The total  number  of
425       simulated micro opcodes (uOps) is also 900.
426
427       The  report  is  structured  in three main sections.  The first section
428       collects a few performance numbers; the goal of this section is to give
429       a  very quick overview of the performance throughput. Important perfor‐
430       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
431       Reciprocal Throughput).
432
433       Field  DispatchWidth  is  the  maximum number of micro opcodes that are
434       dispatched to the out-of-order backend every simulated cycle. For  pro‐
435       cessors  with  an in-order backend, DispatchWidth is the maximum number
436       of micro opcodes issued to the backend every simulated cycle.
437
438       IPC is computed dividing the total number of simulated instructions  by
439       the total number of cycles.
440
441       Field  Block  RThroughput  is  the  reciprocal of the block throughput.
442       Block throughput is a theoretical quantity computed as the maximum num‐
443       ber  of  blocks  (i.e.  iterations)  that can be executed per simulated
444       clock cycle in the absence of loop carried dependencies. Block through‐
445       put is superiorly limited by the dispatch rate, and the availability of
446       hardware resources.
447
448       In the absence of loop-carried  data  dependencies,  the  observed  IPC
449       tends  to  a  theoretical maximum which can be computed by dividing the
450       number of instructions of a single iteration by the Block RThroughput.
451
452       Field 'uOps Per Cycle' is computed dividing the total number  of  simu‐
453       lated micro opcodes by the total number of cycles. A delta between Dis‐
454       patch Width and this field is an indicator of a performance  issue.  In
455       the  absence  of loop-carried data dependencies, the observed 'uOps Per
456       Cycle' should tend to a theoretical maximum  throughput  which  can  be
457       computed  by  dividing  the number of uOps of a single iteration by the
458       Block RThroughput.
459
460       Field uOps Per Cycle is bounded from above by the dispatch width.  That
461       is  because  the  dispatch  width limits the maximum size of a dispatch
462       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard‐
463       ware  parallelism.  The  availability of hardware resources affects the
464       resource pressure distribution, and it limits the  number  of  instruc‐
465       tions  that  can  be executed in parallel every cycle.  A delta between
466       Dispatch Width and the theoretical maximum uOps per Cycle (computed  by
467       dividing  the  number  of  uOps  of  a  single  iteration  by the Block
468       RThroughput) is an indicator of a performance bottleneck caused by  the
469       lack  of hardware resources.  In general, the lower the Block RThrough‐
470       put, the better.
471
472       In this example, uOps per iteration/Block RThroughput  is  1.50.  Since
473       there  are no loop-carried dependencies, the observed uOps Per Cycle is
474       expected to approach 1.50 when the number of iterations tends to infin‐
475       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
476       maximum uOp throughput (1.50) is an indicator of a performance  bottle‐
477       neck  caused  by the lack of hardware resources, and the Resource pres‐
478       sure view can help to identify the problematic resource usage.
479
480       The second section of the report is the instruction info view. It shows
481       the  latency  and reciprocal throughput of every instruction in the se‐
482       quence. It also reports extra information related to the number of  mi‐
483       cro  opcodes,  and  opcode properties (i.e., 'MayLoad', 'MayStore', and
484       'HasSideEffects').
485
486       Field RThroughput is the  reciprocal  of  the  instruction  throughput.
487       Throughput  is computed as the maximum number of instructions of a same
488       type that can be executed per clock cycle in the absence of operand de‐
489       pendencies.  In  this  example,  the  reciprocal throughput of a vector
490       float multiply is 1 cycles/instruction.  That is because the FP  multi‐
491       plier JFPM is only available from pipeline JFPU1.
492
493       Instruction  encodings  are  displayed within the instruction info view
494       when flag -show-encoding is specified.
495
496       Below is an example of -show-encoding output for the  dot-product  ker‐
497       nel:
498
499          Instruction Info:
500          [1]: #uOps
501          [2]: Latency
502          [3]: RThroughput
503          [4]: MayLoad
504          [5]: MayStore
505          [6]: HasSideEffects (U)
506          [7]: Encoding Size
507
508          [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
509           1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
510           1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
511           1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
512
513       The  Encoding Size column shows the size in bytes of instructions.  The
514       Encodings column shows the actual instruction encodings (byte sequences
515       in hex).
516
517       The third section is the Resource pressure view.  This view reports the
518       average number of resource cycles consumed every iteration by  instruc‐
519       tions  for  every processor resource unit available on the target.  In‐
520       formation is structured in two tables. The first table reports the num‐
521       ber of resource cycles spent on average every iteration. The second ta‐
522       ble correlates the resource cycles to the machine  instruction  in  the
523       sequence. For example, every iteration of the instruction vmulps always
524       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
525       consuming  an  average of 1 resource cycle per iteration.  Note that on
526       AMD Jaguar, vector floating-point multiply can only be issued to  pipe‐
527       line  JFPU1,  while horizontal floating-point additions can only be is‐
528       sued to pipeline JFPU0.
529
530       The resource pressure view helps with identifying bottlenecks caused by
531       high  usage  of  specific hardware resources.  Situations with resource
532       pressure mainly concentrated on a few resources should, in general,  be
533       avoided.   Ideally,  pressure  should  be uniformly distributed between
534       multiple resources.
535
536   Timeline View
537       The timeline view produces a  detailed  report  of  each  instruction's
538       state  transitions  through  an instruction pipeline.  This view is en‐
539       abled by the command line option -timeline.  As instructions transition
540       through  the  various stages of the pipeline, their states are depicted
541       in the view report.  These states  are  represented  by  the  following
542       characters:
543
544       • D : Instruction dispatched.
545
546       • e : Instruction executing.
547
548       • E : Instruction executed.
549
550       • R : Instruction retired.
551
552       • = : Instruction already dispatched, waiting to be executed.
553
554       • - : Instruction executed, waiting to be retired.
555
556       Below  is the timeline view for a subset of the dot-product example lo‐
557       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
558       llvm-mca using the following command:
559
560          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
561
562          Timeline view:
563                              012345
564          Index     0123456789
565
566          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
567          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
568          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
569          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
570          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
571          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
572          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
573          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
574          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
575
576
577          Average Wait times (based on the timeline view):
578          [0]: Executions
579          [1]: Average time spent waiting in a scheduler's queue
580          [2]: Average time spent waiting in a scheduler's queue while ready
581          [3]: Average time elapsed from WB until retire stage
582
583                [0]    [1]    [2]    [3]
584          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
585          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
586          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
587                 3     3.3    0.5    1.4       <total>
588
589       The  timeline  view  is  interesting because it shows instruction state
590       changes during execution.  It also gives an idea of how the  tool  pro‐
591       cesses instructions executed on the target, and how their timing infor‐
592       mation might be calculated.
593
594       The timeline view is structured in two tables.  The first  table  shows
595       instructions  changing state over time (measured in cycles); the second
596       table (named Average Wait  times)  reports  useful  timing  statistics,
597       which  should help diagnose performance bottlenecks caused by long data
598       dependencies and sub-optimal usage of hardware resources.
599
600       An instruction in the timeline view is identified by a pair of indices,
601       where  the first index identifies an iteration, and the second index is
602       the instruction index (i.e., where it appears in  the  code  sequence).
603       Since this example was generated using 3 iterations: -iterations=3, the
604       iteration indices range from 0-2 inclusively.
605
606       Excluding the first and last column, the remaining columns are  in  cy‐
607       cles.  Cycles are numbered sequentially starting from 0.
608
609       From the example output above, we know the following:
610
611       • Instruction [1,0] was dispatched at cycle 1.
612
613       • Instruction [1,0] started executing at cycle 2.
614
615       • Instruction [1,0] reached the write back stage at cycle 4.
616
617       • Instruction [1,0] was retired at cycle 10.
618
619       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
620       wait in the scheduler's queue for the operands to become available.  By
621       the  time  vmulps  is  dispatched,  operands are already available, and
622       pipeline JFPU1 is ready to serve another instruction.  So the  instruc‐
623       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon‐
624       strated by the fact that the instruction only spent 1cy in  the  sched‐
625       uler's queue.
626
627       There  is a gap of 5 cycles between the write-back stage and the retire
628       event.  That is because instructions must retire in program  order,  so
629       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
630       until cycle 10).
631
632       In the example, all instructions are in a RAW (Read After Write) depen‐
633       dency  chain.   Register %xmm2 written by vmulps is immediately used by
634       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
635       used  by  the second vhaddps.  Long data dependencies negatively impact
636       the ILP (Instruction Level Parallelism).
637
638       In the dot-product example, there are anti-dependencies  introduced  by
639       instructions  from  different  iterations.  However, those dependencies
640       can be removed at register renaming stage (at the  cost  of  allocating
641       register aliases, and therefore consuming physical registers).
642
643       Table  Average  Wait  times  helps diagnose performance issues that are
644       caused by the presence of long  latency  instructions  and  potentially
645       long  data  dependencies  which  may  limit the ILP. Last row, <total>,
646       shows a global  average  over  all  instructions  measured.  Note  that
647       llvm-mca,  by  default, assumes at least 1cy between the dispatch event
648       and the issue event.
649
650       When the performance is limited by data dependencies  and/or  long  la‐
651       tency instructions, the number of cycles spent while in the ready state
652       is expected to be very small when compared with the total number of cy‐
653       cles  spent  in  the scheduler's queue.  The difference between the two
654       counters is a good indicator of how large of an impact  data  dependen‐
655       cies  had  on  the  execution of the instructions.  When performance is
656       mostly limited by the lack of hardware resources, the delta between the
657       two  counters  is  small.   However,  the number of cycles spent in the
658       queue tends to be larger (i.e., more than 1-3cy), especially when  com‐
659       pared to other low latency instructions.
660
661   Bottleneck Analysis
662       The  -bottleneck-analysis  command  line option enables the analysis of
663       performance bottlenecks.
664
665       This analysis is potentially expensive. It attempts  to  correlate  in‐
666       creases  in  backend pressure (caused by pipeline resource pressure and
667       data dependencies) to dynamic dispatch stalls.
668
669       Below  is  an  example  of  -bottleneck-analysis  output  generated  by
670       llvm-mca for 500 iterations of the dot-product example on btver2.
671
672          Cycles with backend pressure increase [ 48.07% ]
673          Throughput Bottlenecks:
674            Resource Pressure       [ 47.77% ]
675            - JFPA  [ 47.77% ]
676            - JFPU0  [ 47.77% ]
677            Data Dependencies:      [ 0.30% ]
678            - Register Dependencies [ 0.30% ]
679            - Memory Dependencies   [ 0.00% ]
680
681          Critical sequence based on the simulation:
682
683                        Instruction                         Dependency Information
684           +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
685           |
686           |    < loop carried >
687           |
688           |      0.    vmulps  %xmm0, %xmm1, %xmm2
689           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
690           +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
691           |
692           |    < loop carried >
693           |
694           +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
695
696       According  to  the analysis, throughput is limited by resource pressure
697       and not by data dependencies.  The analysis observed increases in back‐
698       end pressure during 48.07% of the simulated run. Almost all those pres‐
699       sure increase events were caused by contention on  processor  resources
700       JFPA/JFPU0.
701
702       The  critical  sequence  is the most expensive sequence of instructions
703       according to the simulation. It is annotated to provide extra  informa‐
704       tion  about  critical  register dependencies and resource interferences
705       between instructions.
706
707       Instructions from the critical sequence are expected  to  significantly
708       impact  performance.  By construction, the accuracy of this analysis is
709       strongly dependent on the simulation and (as always) by the quality  of
710       the processor model in llvm.
711
712       Bottleneck  analysis  is currently not supported for processors with an
713       in-order backend.
714
715   Extra Statistics to Further Diagnose Performance Issues
716       The -all-stats command line option enables extra statistics and perfor‐
717       mance  counters  for the dispatch logic, the reorder buffer, the retire
718       control unit, and the register file.
719
720       Below is an example of -all-stats output generated by  llvm-mca for 300
721       iterations  of  the  dot-product example discussed in the previous sec‐
722       tions.
723
724          Dynamic Dispatch Stall Cycles:
725          RAT     - Register unavailable:                      0
726          RCU     - Retire tokens unavailable:                 0
727          SCHEDQ  - Scheduler full:                            272  (44.6%)
728          LQ      - Load queue full:                           0
729          SQ      - Store queue full:                          0
730          GROUP   - Static restrictions on the dispatch group: 0
731
732
733          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
734          [# dispatched], [# cycles]
735           0,              24  (3.9%)
736           1,              272  (44.6%)
737           2,              314  (51.5%)
738
739
740          Schedulers - number of cycles where we saw N micro opcodes issued:
741          [# issued], [# cycles]
742           0,          7  (1.1%)
743           1,          306  (50.2%)
744           2,          297  (48.7%)
745
746          Scheduler's queue usage:
747          [1] Resource name.
748          [2] Average number of used buffer entries.
749          [3] Maximum number of used buffer entries.
750          [4] Total number of buffer entries.
751
752           [1]            [2]        [3]        [4]
753          JALU01           0          0          20
754          JFPU01           17         18         18
755          JLSAGU           0          0          12
756
757
758          Retire Control Unit - number of cycles where we saw N instructions retired:
759          [# retired], [# cycles]
760           0,           109  (17.9%)
761           1,           102  (16.7%)
762           2,           399  (65.4%)
763
764          Total ROB Entries:                64
765          Max Used ROB Entries:             35  ( 54.7% )
766          Average Used ROB Entries per cy:  32  ( 50.0% )
767
768
769          Register File statistics:
770          Total number of mappings created:    900
771          Max number of mappings used:         35
772
773          *  Register File #1 -- JFpuPRF:
774             Number of physical registers:     72
775             Total number of mappings created: 900
776             Max number of mappings used:      35
777
778          *  Register File #2 -- JIntegerPRF:
779             Number of physical registers:     64
780             Total number of mappings created: 0
781             Max number of mappings used:      0
782
783       If we look at the Dynamic Dispatch  Stall  Cycles  table,  we  see  the
784       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev‐
785       ery time the dispatch logic is unable to dispatch a full group  because
786       the scheduler's queue is full.
787
788       Looking  at the Dispatch Logic table, we see that the pipeline was only
789       able to dispatch two micro opcodes 51.5% of  the  time.   The  dispatch
790       group was limited to one micro opcode 44.6% of the cycles, which corre‐
791       sponds to 272 cycles.  The dispatch statistics are displayed by  either
792       using the command option -all-stats or -dispatch-stats.
793
794       The  next  table,  Schedulers, presents a histogram displaying a count,
795       representing the number of micro opcodes issued on some number  of  cy‐
796       cles.  In  this  case, of the 610 simulated cycles, single opcodes were
797       issued 306 times (50.2%) and there were 7 cycles where no opcodes  were
798       issued.
799
800       The  Scheduler's  queue  usage table shows that the average and maximum
801       number of buffer entries (i.e., scheduler queue entries) used  at  run‐
802       time.   Resource  JFPU01  reached its maximum (18 of 18 queue entries).
803       Note that AMD Jaguar implements three schedulers:
804
805       • JALU01 - A scheduler for ALU instructions.
806
807       • JFPU01 - A scheduler floating point operations.
808
809       • JLSAGU - A scheduler for address generation.
810
811       The dot-product is a kernel of three  floating  point  instructions  (a
812       vector  multiply  followed  by two horizontal adds).  That explains why
813       only the floating point scheduler appears to be used.
814
815       A full scheduler queue is either caused by data dependency chains or by
816       a  sub-optimal  usage of hardware resources.  Sometimes, resource pres‐
817       sure can be mitigated by rewriting the kernel using different  instruc‐
818       tions  that  consume  different scheduler resources.  Schedulers with a
819       small queue are less resilient to bottlenecks caused by the presence of
820       long  data dependencies.  The scheduler statistics are displayed by us‐
821       ing the command option -all-stats or -scheduler-stats.
822
823       The next table, Retire Control Unit, presents a histogram displaying  a
824       count,  representing  the number of instructions retired on some number
825       of cycles.  In this case, of the 610 simulated cycles, two instructions
826       were retired during the same cycle 399 times (65.4%) and there were 109
827       cycles where no instructions were retired.  The retire  statistics  are
828       displayed by using the command option -all-stats or -retire-stats.
829
830       The  last  table  presented is Register File statistics.  Each physical
831       register file (PRF) used by the pipeline is presented  in  this  table.
832       In the case of AMD Jaguar, there are two register files, one for float‐
833       ing-point registers (JFpuPRF) and one  for  integer  registers  (JInte‐
834       gerPRF).  The table shows that of the 900 instructions processed, there
835       were 900 mappings created.  Since  this  dot-product  example  utilized
836       only floating point registers, the JFPuPRF was responsible for creating
837       the 900 mappings.  However, we see that the pipeline only used a  maxi‐
838       mum of 35 of 72 available register slots at any given time. We can con‐
839       clude that the floating point PRF was the only register file  used  for
840       the  example, and that it was never resource constrained.  The register
841       file statistics are displayed by using the command option -all-stats or
842       -register-file-stats.
843
844       In this example, we can conclude that the IPC is mostly limited by data
845       dependencies, and not by resource pressure.
846
847   Instruction Flow
848       This section describes the instruction flow through the  default  pipe‐
849       line  of  llvm-mca,  as  well  as  the functional units involved in the
850       process.
851
852       The default pipeline implements the following sequence of  stages  used
853       to process instructions.
854
855       • Dispatch (Instruction is dispatched to the schedulers).
856
857       • Issue (Instruction is issued to the processor pipelines).
858
859       • Write Back (Instruction is executed, and results are written back).
860
861       • Retire  (Instruction  is  retired; writes are architecturally commit‐
862         ted).
863
864       The in-order pipeline implements the following sequence  of  stages:  *
865       InOrderIssue (Instruction is issued to the processor pipelines).  * Re‐
866       tire (Instruction is retired; writes are architecturally committed).
867
868       llvm-mca assumes that instructions have all  been  decoded  and  placed
869       into  a  queue  before the simulation start. Therefore, the instruction
870       fetch and decode stages are not modeled. Performance bottlenecks in the
871       frontend  are  not diagnosed. Also, llvm-mca does not model branch pre‐
872       diction.
873
874   Instruction Dispatch
875       During the dispatch stage, instructions are  picked  in  program  order
876       from  a queue of already decoded instructions, and dispatched in groups
877       to the simulated hardware schedulers.
878
879       The size of a dispatch group depends on the availability of  the  simu‐
880       lated hardware resources.  The processor dispatch width defaults to the
881       value of the IssueWidth in LLVM's scheduling model.
882
883       An instruction can be dispatched if:
884
885       • The size of the dispatch group is smaller than  processor's  dispatch
886         width.
887
888       • There are enough entries in the reorder buffer.
889
890       • There are enough physical registers to do register renaming.
891
892       • The schedulers are not full.
893
894       Scheduling  models  can  optionally  specify  which  register files are
895       available on the processor. llvm-mca uses that information to  initial‐
896       ize  register file descriptors.  Users can limit the number of physical
897       registers that are globally available for register  renaming  by  using
898       the  command  option -register-file-size.  A value of zero for this op‐
899       tion means unbounded. By knowing how many registers are  available  for
900       renaming,  the  tool  can predict dispatch stalls caused by the lack of
901       physical registers.
902
903       The number of reorder buffer entries consumed by an instruction depends
904       on  the  number  of micro-opcodes specified for that instruction by the
905       target scheduling model.  The reorder buffer is responsible for  track‐
906       ing  the  progress  of  instructions that are "in-flight", and retiring
907       them in program order.  The number of entries in the reorder buffer de‐
908       faults  to the value specified by field MicroOpBufferSize in the target
909       scheduling model.
910
911       Instructions that are dispatched to the  schedulers  consume  scheduler
912       buffer  entries. llvm-mca queries the scheduling model to determine the
913       set of buffered resources consumed by  an  instruction.   Buffered  re‐
914       sources are treated like scheduler resources.
915
916   Instruction Issue
917       Each  processor  scheduler implements a buffer of instructions.  An in‐
918       struction has to wait in the scheduler's buffer  until  input  register
919       operands  become  available.   Only at that point, does the instruction
920       becomes  eligible  for  execution  and  may  be   issued   (potentially
921       out-of-order)  for  execution.   Instruction  latencies are computed by
922       llvm-mca with the help of the scheduling model.
923
924       llvm-mca's scheduler is designed to simulate multiple processor  sched‐
925       ulers.   The  scheduler  is responsible for tracking data dependencies,
926       and dynamically selecting which processor resources are consumed by in‐
927       structions.   It  delegates  the management of processor resource units
928       and resource groups to a resource manager.  The resource manager is re‐
929       sponsible  for  selecting  resource units that are consumed by instruc‐
930       tions.  For example, if an  instruction  consumes  1cy  of  a  resource
931       group, the resource manager selects one of the available units from the
932       group; by default, the resource manager uses a round-robin selector  to
933       guarantee  that  resource  usage  is  uniformly distributed between all
934       units of a group.
935
936       llvm-mca's scheduler internally groups instructions into three sets:
937
938       • WaitSet: a set of instructions whose operands are not ready.
939
940       • ReadySet: a set of instructions ready to execute.
941
942       • IssuedSet: a set of instructions executing.
943
944       Depending on the operands  availability,  instructions  that  are  dis‐
945       patched to the scheduler are either placed into the WaitSet or into the
946       ReadySet.
947
948       Every cycle, the scheduler checks if instructions can be moved from the
949       WaitSet  to  the ReadySet, and if instructions from the ReadySet can be
950       issued to the underlying pipelines. The algorithm prioritizes older in‐
951       structions over younger instructions.
952
953   Write-Back and Retire Stage
954       Issued  instructions  are  moved  from  the  ReadySet to the IssuedSet.
955       There, instructions wait until they reach  the  write-back  stage.   At
956       that point, they get removed from the queue and the retire control unit
957       is notified.
958
959       When instructions are executed, the retire control unit flags  the  in‐
960       struction as "ready to retire."
961
962       Instructions  are retired in program order.  The register file is noti‐
963       fied of the retirement so that it can free the physical registers  that
964       were allocated for the instruction during the register renaming stage.
965
966   Load/Store Unit and Memory Consistency Model
967       To  simulate  an  out-of-order execution of memory operations, llvm-mca
968       utilizes a simulated load/store unit (LSUnit) to simulate the  specula‐
969       tive execution of loads and stores.
970
971       Each  load  (or  store) consumes an entry in the load (or store) queue.
972       Users can specify flags -lqueue and -squeue to limit the number of  en‐
973       tries  in  the  load  and store queues respectively. The queues are un‐
974       bounded by default.
975
976       The LSUnit implements a relaxed consistency model for memory loads  and
977       stores.  The rules are:
978
979       1. A younger load is allowed to pass an older load only if there are no
980          intervening stores or barriers between the two loads.
981
982       2. A younger load is allowed to pass an older store provided  that  the
983          load does not alias with the store.
984
985       3. A younger store is not allowed to pass an older store.
986
987       4. A younger store is not allowed to pass an older load.
988
989       By  default,  the LSUnit optimistically assumes that loads do not alias
990       (-noalias=true) store operations.  Under this assumption, younger loads
991       are  always allowed to pass older stores.  Essentially, the LSUnit does
992       not attempt to run any alias analysis to predict when loads and  stores
993       do not alias with each other.
994
995       Note  that,  in the case of write-combining memory, rule 3 could be re‐
996       laxed to allow reordering of non-aliasing store operations.  That being
997       said,  at the moment, there is no way to further relax the memory model
998       (-noalias is the only option).  Essentially,  there  is  no  option  to
999       specify  a  different  memory  type (e.g., write-back, write-combining,
1000       write-through; etc.) and consequently to  weaken,  or  strengthen,  the
1001       memory model.
1002
1003       Other limitations are:
1004
1005       • The LSUnit does not know when store-to-load forwarding may occur.
1006
1007       • The  LSUnit  does  not know anything about cache hierarchy and memory
1008         types.
1009
1010       • The LSUnit does not know how to identify serializing  operations  and
1011         memory fences.
1012
1013       The  LSUnit  does  not  attempt  to  predict if a load or store hits or
1014       misses the L1 cache.  It only knows if an instruction "MayLoad"  and/or
1015       "MayStore."   For  loads, the scheduling model provides an "optimistic"
1016       load-to-use latency (which usually matches the load-to-use latency  for
1017       when there is a hit in the L1D).
1018
1019       llvm-mca  does  not  (on  its own) know about serializing operations or
1020       memory-barrier like instructions.  The LSUnit  used  to  conservatively
1021       use  an instruction's "MayLoad", "MayStore", and unmodeled side effects
1022       flags to determine whether an instruction should be treated as  a  mem‐
1023       ory-barrier. This was inaccurate in general and was changed so that now
1024       each instruction has an IsAStoreBarrier and IsALoadBarrier flag.  These
1025       flags  are  mca specific and default to false for every instruction. If
1026       any instruction should have either of these flags  set,  it  should  be
1027       done  within the target's InstrPostProcess class.  For an example, look
1028       at  the   X86InstrPostProcess::postProcessInstruction   method   within
1029       llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp.
1030
1031       A  load/store  barrier  consumes  one entry of the load/store queue.  A
1032       load/store barrier enforces ordering of loads/stores.  A  younger  load
1033       cannot  pass a load barrier.  Also, a younger store cannot pass a store
1034       barrier.  A younger load has to wait for the memory/load barrier to ex‐
1035       ecute.   A  load/store barrier is "executed" when it becomes the oldest
1036       entry in the load/store queue(s). That also means, by construction, all
1037       of the older loads/stores have been executed.
1038
1039       In conclusion, the full set of load/store consistency rules are:
1040
1041       1. A store may not pass a previous store.
1042
1043       2. A store may not pass a previous load (regardless of -noalias).
1044
1045       3. A store has to wait until an older store barrier is fully executed.
1046
1047       4. A load may pass a previous load.
1048
1049       5. A load may not pass a previous store unless -noalias is set.
1050
1051       6. A load has to wait until an older load barrier is fully executed.
1052
1053   In-order Issue and Execute
1054       In-order  processors  are modelled as a single InOrderIssueStage stage.
1055       It bypasses Dispatch, Scheduler and Load/Store unit.  Instructions  are
1056       issued  as  soon  as their operand registers are available and resource
1057       requirements are met. Multiple instructions can be issued in one  cycle
1058       according to the value of the IssueWidth parameter in LLVM's scheduling
1059       model.
1060
1061       Once issued, an instruction is moved to  IssuedInst  set  until  it  is
1062       ready  to  retire. llvm-mca ensures that writes are committed in-order.
1063       However,  an  instruction  is  allowed  to  commit  writes  and  retire
1064       out-of-order  if  RetireOOO  property  is  true for at least one of its
1065       writes.
1066
1067   Custom Behaviour
1068       Due to certain instructions not being expressed perfectly within  their
1069       scheduling  model,  llvm-mca  isn't  always  able to simulate them per‐
1070       fectly. Modifying the scheduling model isn't  always  a  viable  option
1071       though (maybe because the instruction is modeled incorrectly on purpose
1072       or the instruction's behaviour is quite complex).  The  CustomBehaviour
1073       class can be used in these cases to enforce proper instruction modeling
1074       (often by customizing data  dependencies  and  detecting  hazards  that
1075       llvm-mca has no way of knowing about).
1076
1077       llvm-mca  comes with one generic and multiple target specific CustomBe‐
1078       haviour classes. The generic class will be used if the -disable-cb flag
1079       is used or if a target specific CustomBehaviour class doesn't exist for
1080       that target. (The generic class does nothing.) Currently, the CustomBe‐
1081       haviour  class  is  only a part of the in-order pipeline, but there are
1082       plans to add it to the out-of-order pipeline in the future.
1083
1084       CustomBehaviour's main method is  checkCustomHazard()  which  uses  the
1085       current  instruction  and  a  list  of all instructions still executing
1086       within the pipeline to determine if the current instruction  should  be
1087       dispatched.   As output, the method returns an integer representing the
1088       number of cycles that the current instruction must stall for (this  can
1089       be an underestimate if you don't know the exact number and a value of 0
1090       represents no stall).
1091
1092       If you'd like to add a CustomBehaviour class for a target that  doesn't
1093       already have one, refer to an existing implementation to see how to set
1094       it up. The classes are implemented within the target  specific  backend
1095       (for  example  /llvm/lib/Target/AMDGPU/MCA/)  so  that  they can access
1096       backend symbols.
1097
1098   Instrument Manager
1099       On certain architectures, scheduling information for  certain  instruc‐
1100       tions  do  not  contain all of the information required to identify the
1101       most precise schedule class. For example, data that can have an  impact
1102       on scheduling can be stored in CSR registers.
1103
1104       One  example  of  this  is  on RISCV, where values in registers such as
1105       vtype and vl change the scheduling behaviour  of  vector  instructions.
1106       Since  MCA  does  not keep track of the values in registers, instrument
1107       comments can be used to specify these values.
1108
1109       InstrumentManager's main function is getSchedClassID() which has access
1110       to  the  MCInst  and  all  of  the instruments that are active for that
1111       MCInst.  This function can use the instruments to override the schedule
1112       class of the MCInst.
1113
1114       On  RISCV,  instrument comments containing LMUL information are used by
1115       getSchedClassID() to map a vector instruction and the  active  LMUL  to
1116       the scheduling class of the pseudo-instruction that describes that base
1117       instruction and the active LMUL.
1118
1119   Custom Views
1120       llvm-mca comes with several Views such as the Timeline View and Summary
1121       View.  These Views are generic and can work with most (if not all) tar‐
1122       gets. If you wish to add a new View to llvm-mca and it does not require
1123       any  backend functionality that is not already exposed through MC layer
1124       classes (MCSubtargetInfo, MCInstrInfo, etc.),  please  add  it  to  the
1125       /tools/llvm-mca/View/  directory.  However,  if your new View is target
1126       specific AND requires unexposed backend symbols or  functionality,  you
1127       can define it in the /lib/Target/<TargetName>/MCA/ directory.
1128
1129       To enable this target specific View, you will have to use this target's
1130       CustomBehaviour class to override the CustomBehaviour::getViews() meth‐
1131       ods.   There  are 3 variations of these methods based on where you want
1132       your View to appear in  the  output:  getStartViews(),  getPostInstrIn‐
1133       foViews(),  and  getEndViews(). These methods returns a vector of Views
1134       so you will want to return a vector containing all of the  target  spe‐
1135       cific Views for the target in question.
1136
1137       Because these target specific (and backend dependent) Views require the
1138       CustomBehaviour::getViews() variants, these Views will not  be  enabled
1139       if the -disable-cb flag is used.
1140
1141       Enabling  these  custom  Views does not affect the non-custom (generic)
1142       Views.  Continue to use the usual command line arguments  to  enable  /
1143       disable those Views.
1144

AUTHOR

1146       Maintained by the LLVM Team (https://llvm.org/).
1147

COPYRIGHT

1149       2003-2023, LLVM Project
1150
1151
1152
1153
115416                                2023-08-24                       LLVM-MCA(1)