1PERF-C2C(1) perf Manual PERF-C2C(1)
2
3
4
6 perf-c2c - Shared Data C2C/HITM Analyzer.
7
9 perf c2c record [<options>] <command>
10 perf c2c record [<options>] -- [<record command options>] <command>
11 perf c2c report [<options>]
12
14 C2C stands for Cache To Cache.
15
16 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17 allows you to track down the cacheline contentions.
18
19 On Intel, the tool is based on load latency and precise store facility
20 events provided by Intel CPUs. On PowerPC, the tool uses random
21 instruction sampling with thresholding feature. On AMD, the tool uses
22 IBS op pmu (due to hardware limitations, perf c2c is not supported on
23 Zen3 cpus). On Arm64 it uses SPE to sample load and store operations,
24 therefore hardware and kernel support is required. See perf-arm-spe(1)
25 for a setup guide. Due to the statistical nature of Arm SPE sampling,
26 not every memory operation will be sampled.
27
28 These events provide: - memory address of the access - type of the
29 access (load and store details) - latency (in cycles) of the load
30 access
31
32 The c2c tool provide means to record this data and report back access
33 details for cachelines with highest contention - highest number of HITM
34 accesses.
35
36 The basic workflow with this tool follows the standard record/report
37 phase. User uses the record command to record events data and report
38 command to display it.
39
41 -e, --event=
42 Select the PMU event. Use perf c2c record -e list to list available
43 events.
44
45 -v, --verbose
46 Be more verbose (show counter open errors, etc).
47
48 -l, --ldlat
49 Configure mem-loads latency. Supported on Intel and Arm64
50 processors only. Ignored on other archs.
51
52 -k, --all-kernel
53 Configure all used events to run in kernel space.
54
55 -u, --all-user
56 Configure all used events to run in user space.
57
59 -k, --vmlinux=<file>
60 vmlinux pathname
61
62 -v, --verbose
63 Be more verbose (show counter open errors, etc).
64
65 -i, --input
66 Specify the input file to process.
67
68 -N, --node-info
69 Show extra node info in report (see NODE INFO section)
70
71 -c, --coalesce
72 Specify sorting fields for single cacheline display. Following
73 fields are available: tid,pid,iaddr,dso (see COALESCE)
74
75 -g, --call-graph
76 Setup callchains parameters. Please refer to perf-report man page
77 for details.
78
79 --stdio
80 Force the stdio output (see STDIO OUTPUT)
81
82 --stats
83 Display only statistic tables and force stdio mode.
84
85 --full-symbols
86 Display full length of symbols.
87
88 --no-source
89 Do not display Source:Line column.
90
91 --show-all
92 Show all captured HITM lines, with no regard to HITM % 0.0005
93 limit.
94
95 -f, --force
96 Don’t do ownership validation.
97
98 -d, --display
99 Switch to HITM type (rmt, lcl) or peer snooping type (peer) to
100 display and sort on. Total HITMs (tot) as default, except Arm64
101 uses peer mode as default.
102
103 --stitch-lbr
104 Show callgraph with stitched LBRs, which may have more complete
105 callgraph. The perf.data file must have been obtained using perf
106 c2c record --call-graph lbr. Disabled by default. In common cases
107 with call stack overflows, it can recreate better call stacks than
108 the default lbr call stack output. But this approach is not
109 foolproof. There can be cases where it creates incorrect call
110 stacks from incorrect matches. The known limitations include
111 exception handing such as setjmp/longjmp will have calls/returns
112 not match.
113
114 --double-cl
115 Group the detection of shared cacheline events into double
116 cacheline granularity. Some architectures have an Adjacent
117 Cacheline Prefetch feature, which causes cacheline sharing to
118 behave like the cacheline size is doubled.
119
121 The perf c2c record command setup options related to HITM cacheline
122 analysis and calls standard perf record command.
123
124 Following perf record options are configured by default: (check perf
125 record man page for details)
126
127 -W,-d,--phys-data,--sample-cpu
128
129 Unless specified otherwise with -e option, following events are
130 monitored by default on Intel:
131
132 cpu/mem-loads,ldlat=30/P
133 cpu/mem-stores/P
134
135 following on AMD:
136
137 ibs_op//
138
139 and following on PowerPC:
140
141 cpu/mem-loads/
142 cpu/mem-stores/
143
144 User can pass any perf record option behind -- mark, like (to enable
145 callchains and system wide monitoring):
146
147 $ perf c2c record -- -g -a
148
149 Please check RECORD OPTIONS section for specific c2c record options.
150
152 The perf c2c report command displays shared data analysis. It comes in
153 two display modes: stdio and tui (default).
154
155 The report command workflow is following: - sort all the data based on
156 the cacheline address - store access details for each cacheline - sort
157 all cachelines based on user settings - display data
158
159 In general perf report output consist of 2 basic views: 1) most
160 expensive cachelines list 2) offsets details for each cacheline
161
162 For each cacheline in the 1) list we display following data: (Both
163 stdio and TUI modes follow the same fields output)
164
165 Index
166 - zero based index to identify the cacheline
167
168 Cacheline
169 - cacheline address (hex number)
170
171 Rmt/Lcl Hitm (Display with HITM types)
172 - cacheline percentage of all Remote/Local HITM accesses
173
174 Peer Snoop (Display with peer type)
175 - cacheline percentage of all peer accesses
176
177 LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
178 - count of Total/Local/Remote load HITMs
179
180 Load Peer - Total, Local, Remote (For display with peer type)
181 - count of Total/Local/Remote load from peer cache or DRAM
182
183 Total records
184 - sum of all cachelines accesses
185
186 Total loads
187 - sum of all load accesses
188
189 Total stores
190 - sum of all store accesses
191
192 Store Reference - L1Hit, L1Miss, N/A
193 L1Hit - store accesses that hit L1
194 L1Miss - store accesses that missed L1
195 N/A - store accesses with memory level is not available
196
197 Core Load Hit - FB, L1, L2
198 - count of load hits in FB (Fill Buffer), L1 and L2 cache
199
200 LLC Load Hit - LlcHit, LclHitm
201 - count of LLC load accesses, includes LLC hits and LLC HITMs
202
203 RMT Load Hit - RmtHit, RmtHitm
204 - count of remote load accesses, includes remote hits and remote HITMs;
205 on Arm neoverse cores, RmtHit is used to account remote accesses,
206 includes remote DRAM or any upward cache level in remote node
207
208 Load Dram - Lcl, Rmt
209 - count of local and remote DRAM accesses
210
211 For each offset in the 2) list we display following data:
212
213 HITM - Rmt, Lcl (Display with HITM types)
214 - % of Remote/Local HITM accesses for given offset within cacheline
215
216 Peer Snoop - Rmt, Lcl (Display with peer type)
217 - % of Remote/Local peer accesses for given offset within cacheline
218
219 Store Refs - L1 Hit, L1 Miss, N/A
220 - % of store accesses that hit L1, missed L1 and N/A (no available) memory
221 level for given offset within cacheline
222
223 Data address - Offset
224 - offset address
225
226 Pid
227 - pid of the process responsible for the accesses
228
229 Tid
230 - tid of the process responsible for the accesses
231
232 Code address
233 - code address responsible for the accesses
234
235 cycles - rmt hitm, lcl hitm, load (Display with HITM types)
236 - sum of cycles for given accesses - Remote/Local HITM and generic load
237
238 cycles - rmt peer, lcl peer, load (Display with peer type)
239 - sum of cycles for given accesses - Remote/Local peer load and generic load
240
241 cpu cnt
242 - number of cpus that participated on the access
243
244 Symbol
245 - code symbol related to the 'Code address' value
246
247 Shared Object
248 - shared object name related to the 'Code address' value
249
250 Source:Line
251 - source information related to the 'Code address' value
252
253 Node
254 - nodes participating on the access (see NODE INFO section)
255
257 The Node field displays nodes that accesses given cacheline offset. Its
258 output comes in 3 flavors: - node IDs separated by , - node IDs with
259 stats for each ID, in following format: Node{cpus %hitms %stores}
260 (Display with HITM types) Node{cpus %peers %stores} (Display with peer
261 type) - node IDs with list of affected CPUs in following format:
262 Node{cpu list}
263
264 User can switch between above flavors with -N option or use n key to
265 interactively switch in TUI mode.
266
268 User can specify how to sort offsets for cacheline.
269
270 Following fields are available and governs the final output fields set
271 for cacheline offsets output:
272
273 tid - coalesced by process TIDs
274 pid - coalesced by process PIDs
275 iaddr - coalesced by code address, following fields are displayed:
276 Code address, Code symbol, Shared Object, Source line
277 dso - coalesced by shared object
278
279 By default the coalescing is setup with pid,iaddr.
280
282 The stdio output displays data on standard output.
283
284 Following tables are displayed: Trace Event Information - overall
285 statistics of memory accesses
286
287 Global Shared Cache Line Event Information
288 - overall statistics on shared cachelines
289
290 Shared Data Cache Line Table
291 - list of most expensive cachelines
292
293 Shared Cache Line Distribution Pareto
294 - list of all accessed offsets for each cacheline
295
297 The TUI output provides interactive interface to navigate through
298 cachelines list and to display offset details.
299
300 For details please refer to the help window by pressing ? key.
301
303 Although Don Zickus, Dick Fowles and Joe Mario worked together to get
304 this implemented, we got lots of early help from Arnaldo Carvalho de
305 Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
306
308 Check Joe’s blog on c2c tool for detailed use case explanation:
309 https://joemario.github.io/blog/2016/09/01/c2c-blog/
310
312 perf-record(1), perf-mem(1), perf-arm-spe(1)
313
314
315
316perf 11/28/2023 PERF-C2C(1)