perf-c2c(1)

1PERF-C2C(1)                       perf Manual                      PERF-C2C(1)
2
3
4

NAME

6       perf-c2c - Shared Data C2C/HITM Analyzer.
7

SYNOPSIS

9       perf c2c record [<options>] <command>
10       perf c2c record [<options>] -- [<record command options>] <command>
11       perf c2c report [<options>]
12

DESCRIPTION

14       C2C stands for Cache To Cache.
15
16       The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17       allows you to track down the cacheline contentions.
18
19       On Intel, the tool is based on load latency and precise store facility
20       events provided by Intel CPUs. On PowerPC, the tool uses random
21       instruction sampling with thresholding feature. On AMD, the tool uses
22       IBS op pmu (due to hardware limitations, perf c2c is not supported on
23       Zen3 cpus).
24
25       These events provide: - memory address of the access - type of the
26       access (load and store details) - latency (in cycles) of the load
27       access
28
29       The c2c tool provide means to record this data and report back access
30       details for cachelines with highest contention - highest number of HITM
31       accesses.
32
33       The basic workflow with this tool follows the standard record/report
34       phase. User uses the record command to record events data and report
35       command to display it.
36

RECORD OPTIONS

38       -e, --event=
39           Select the PMU event. Use perf c2c record -e list to list available
40           events.
41
42       -v, --verbose
43           Be more verbose (show counter open errors, etc).
44
45       -l, --ldlat
46           Configure mem-loads latency. Supported on Intel and Arm64
47           processors only. Ignored on other archs.
48
49       -k, --all-kernel
50           Configure all used events to run in kernel space.
51
52       -u, --all-user
53           Configure all used events to run in user space.
54

REPORT OPTIONS

56       -k, --vmlinux=<file>
57           vmlinux pathname
58
59       -v, --verbose
60           Be more verbose (show counter open errors, etc).
61
62       -i, --input
63           Specify the input file to process.
64
65       -N, --node-info
66           Show extra node info in report (see NODE INFO section)
67
68       -c, --coalesce
69           Specify sorting fields for single cacheline display. Following
70           fields are available: tid,pid,iaddr,dso (see COALESCE)
71
72       -g, --call-graph
73           Setup callchains parameters. Please refer to perf-report man page
74           for details.
75
76       --stdio
77           Force the stdio output (see STDIO OUTPUT)
78
79       --stats
80           Display only statistic tables and force stdio mode.
81
82       --full-symbols
83           Display full length of symbols.
84
85       --no-source
86           Do not display Source:Line column.
87
88       --show-all
89           Show all captured HITM lines, with no regard to HITM % 0.0005
90           limit.
91
92       -f, --force
93           Don’t do ownership validation.
94
95       -d, --display
96           Switch to HITM type (rmt, lcl) or peer snooping type (peer) to
97           display and sort on. Total HITMs (tot) as default, except Arm64
98           uses peer mode as default.
99
100       --stitch-lbr
101           Show callgraph with stitched LBRs, which may have more complete
102           callgraph. The perf.data file must have been obtained using perf
103           c2c record --call-graph lbr. Disabled by default. In common cases
104           with call stack overflows, it can recreate better call stacks than
105           the default lbr call stack output. But this approach is not full
106           proof. There can be cases where it creates incorrect call stacks
107           from incorrect matches. The known limitations include exception
108           handing such as setjmp/longjmp will have calls/returns not match.
109

C2C RECORD

111       The perf c2c record command setup options related to HITM cacheline
112       analysis and calls standard perf record command.
113
114       Following perf record options are configured by default: (check perf
115       record man page for details)
116
117           -W,-d,--phys-data,--sample-cpu
118
119       Unless specified otherwise with -e option, following events are
120       monitored by default on Intel:
121
122           cpu/mem-loads,ldlat=30/P
123           cpu/mem-stores/P
124
125       following on AMD:
126
127           ibs_op//
128
129       and following on PowerPC:
130
131           cpu/mem-loads/
132           cpu/mem-stores/
133
134       User can pass any perf record option behind -- mark, like (to enable
135       callchains and system wide monitoring):
136
137           $ perf c2c record -- -g -a
138
139       Please check RECORD OPTIONS section for specific c2c record options.
140

C2C REPORT

142       The perf c2c report command displays shared data analysis. It comes in
143       two display modes: stdio and tui (default).
144
145       The report command workflow is following: - sort all the data based on
146       the cacheline address - store access details for each cacheline - sort
147       all cachelines based on user settings - display data
148
149       In general perf report output consist of 2 basic views: 1) most
150       expensive cachelines list 2) offsets details for each cacheline
151
152       For each cacheline in the 1) list we display following data: (Both
153       stdio and TUI modes follow the same fields output)
154
155           Index
156           - zero based index to identify the cacheline
157
158           Cacheline
159           - cacheline address (hex number)
160
161           Rmt/Lcl Hitm (Display with HITM types)
162           - cacheline percentage of all Remote/Local HITM accesses
163
164           Peer Snoop (Display with peer type)
165           - cacheline percentage of all peer accesses
166
167           LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
168           - count of Total/Local/Remote load HITMs
169
170           Load Peer - Total, Local, Remote (For display with peer type)
171           - count of Total/Local/Remote load from peer cache or DRAM
172
173           Total records
174           - sum of all cachelines accesses
175
176           Total loads
177           - sum of all load accesses
178
179           Total stores
180           - sum of all store accesses
181
182           Store Reference - L1Hit, L1Miss, N/A
183             L1Hit - store accesses that hit L1
184             L1Miss - store accesses that missed L1
185             N/A - store accesses with memory level is not available
186
187           Core Load Hit - FB, L1, L2
188           - count of load hits in FB (Fill Buffer), L1 and L2 cache
189
190           LLC Load Hit - LlcHit, LclHitm
191           - count of LLC load accesses, includes LLC hits and LLC HITMs
192
193           RMT Load Hit - RmtHit, RmtHitm
194           - count of remote load accesses, includes remote hits and remote HITMs;
195             on Arm neoverse cores, RmtHit is used to account remote accesses,
196             includes remote DRAM or any upward cache level in remote node
197
198           Load Dram - Lcl, Rmt
199           - count of local and remote DRAM accesses
200
201       For each offset in the 2) list we display following data:
202
203           HITM - Rmt, Lcl (Display with HITM types)
204           - % of Remote/Local HITM accesses for given offset within cacheline
205
206           Peer Snoop - Rmt, Lcl (Display with peer type)
207           - % of Remote/Local peer accesses for given offset within cacheline
208
209           Store Refs - L1 Hit, L1 Miss, N/A
210           - % of store accesses that hit L1, missed L1 and N/A (no available) memory
211             level for given offset within cacheline
212
213           Data address - Offset
214           - offset address
215
216           Pid
217           - pid of the process responsible for the accesses
218
219           Tid
220           - tid of the process responsible for the accesses
221
222           Code address
223           - code address responsible for the accesses
224
225           cycles - rmt hitm, lcl hitm, load (Display with HITM types)
226             - sum of cycles for given accesses - Remote/Local HITM and generic load
227
228           cycles - rmt peer, lcl peer, load (Display with peer type)
229             - sum of cycles for given accesses - Remote/Local peer load and generic load
230
231           cpu cnt
232             - number of cpus that participated on the access
233
234           Symbol
235             - code symbol related to the 'Code address' value
236
237           Shared Object
238             - shared object name related to the 'Code address' value
239
240           Source:Line
241             - source information related to the 'Code address' value
242
243           Node
244             - nodes participating on the access (see NODE INFO section)
245

NODE INFO

247       The Node field displays nodes that accesses given cacheline offset. Its
248       output comes in 3 flavors: - node IDs separated by , - node IDs with
249       stats for each ID, in following format: Node{cpus %hitms %stores}
250       (Display with HITM types) Node{cpus %peers %stores} (Display with peer
251       type) - node IDs with list of affected CPUs in following format:
252       Node{cpu list}
253
254       User can switch between above flavors with -N option or use n key to
255       interactively switch in TUI mode.
256

COALESCE

258       User can specify how to sort offsets for cacheline.
259
260       Following fields are available and governs the final output fields set
261       for cacheline offsets output:
262
263           tid   - coalesced by process TIDs
264           pid   - coalesced by process PIDs
265           iaddr - coalesced by code address, following fields are displayed:
266                      Code address, Code symbol, Shared Object, Source line
267           dso   - coalesced by shared object
268
269       By default the coalescing is setup with pid,iaddr.
270

STDIO OUTPUT

272       The stdio output displays data on standard output.
273
274       Following tables are displayed: Trace Event Information - overall
275       statistics of memory accesses
276
277           Global Shared Cache Line Event Information
278           - overall statistics on shared cachelines
279
280           Shared Data Cache Line Table
281           - list of most expensive cachelines
282
283           Shared Cache Line Distribution Pareto
284           - list of all accessed offsets for each cacheline
285

TUI OUTPUT

287       The TUI output provides interactive interface to navigate through
288       cachelines list and to display offset details.
289
290       For details please refer to the help window by pressing ? key.
291

CREDITS

293       Although Don Zickus, Dick Fowles and Joe Mario worked together to get
294       this implemented, we got lots of early help from Arnaldo Carvalho de
295       Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
296

C2C BLOG

298       Check Joe’s blog on c2c tool for detailed use case explanation:
299       https://joemario.github.io/blog/2016/09/01/c2c-blog/
300