1PERF-C2C(1) perf Manual PERF-C2C(1)
2
3
4
6 perf-c2c - Shared Data C2C/HITM Analyzer.
7
9 perf c2c record [<options>] <command>
10 perf c2c record [<options>] -- [<record command options>] <command>
11 perf c2c report [<options>]
12
14 C2C stands for Cache To Cache.
15
16 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17 allows you to track down the cacheline contentions.
18
19 On Intel, the tool is based on load latency and precise store facility
20 events provided by Intel CPUs. On PowerPC, the tool uses random
21 instruction sampling with thresholding feature. On AMD, the tool uses
22 IBS op pmu (due to hardware limitations, perf c2c is not supported on
23 Zen3 cpus).
24
25 These events provide: - memory address of the access - type of the
26 access (load and store details) - latency (in cycles) of the load
27 access
28
29 The c2c tool provide means to record this data and report back access
30 details for cachelines with highest contention - highest number of HITM
31 accesses.
32
33 The basic workflow with this tool follows the standard record/report
34 phase. User uses the record command to record events data and report
35 command to display it.
36
38 -e, --event=
39 Select the PMU event. Use perf c2c record -e list to list available
40 events.
41
42 -v, --verbose
43 Be more verbose (show counter open errors, etc).
44
45 -l, --ldlat
46 Configure mem-loads latency. Supported on Intel and Arm64
47 processors only. Ignored on other archs.
48
49 -k, --all-kernel
50 Configure all used events to run in kernel space.
51
52 -u, --all-user
53 Configure all used events to run in user space.
54
56 -k, --vmlinux=<file>
57 vmlinux pathname
58
59 -v, --verbose
60 Be more verbose (show counter open errors, etc).
61
62 -i, --input
63 Specify the input file to process.
64
65 -N, --node-info
66 Show extra node info in report (see NODE INFO section)
67
68 -c, --coalesce
69 Specify sorting fields for single cacheline display. Following
70 fields are available: tid,pid,iaddr,dso (see COALESCE)
71
72 -g, --call-graph
73 Setup callchains parameters. Please refer to perf-report man page
74 for details.
75
76 --stdio
77 Force the stdio output (see STDIO OUTPUT)
78
79 --stats
80 Display only statistic tables and force stdio mode.
81
82 --full-symbols
83 Display full length of symbols.
84
85 --no-source
86 Do not display Source:Line column.
87
88 --show-all
89 Show all captured HITM lines, with no regard to HITM % 0.0005
90 limit.
91
92 -f, --force
93 Don’t do ownership validation.
94
95 -d, --display
96 Switch to HITM type (rmt, lcl) or peer snooping type (peer) to
97 display and sort on. Total HITMs (tot) as default, except Arm64
98 uses peer mode as default.
99
100 --stitch-lbr
101 Show callgraph with stitched LBRs, which may have more complete
102 callgraph. The perf.data file must have been obtained using perf
103 c2c record --call-graph lbr. Disabled by default. In common cases
104 with call stack overflows, it can recreate better call stacks than
105 the default lbr call stack output. But this approach is not full
106 proof. There can be cases where it creates incorrect call stacks
107 from incorrect matches. The known limitations include exception
108 handing such as setjmp/longjmp will have calls/returns not match.
109
111 The perf c2c record command setup options related to HITM cacheline
112 analysis and calls standard perf record command.
113
114 Following perf record options are configured by default: (check perf
115 record man page for details)
116
117 -W,-d,--phys-data,--sample-cpu
118
119 Unless specified otherwise with -e option, following events are
120 monitored by default on Intel:
121
122 cpu/mem-loads,ldlat=30/P
123 cpu/mem-stores/P
124
125 following on AMD:
126
127 ibs_op//
128
129 and following on PowerPC:
130
131 cpu/mem-loads/
132 cpu/mem-stores/
133
134 User can pass any perf record option behind -- mark, like (to enable
135 callchains and system wide monitoring):
136
137 $ perf c2c record -- -g -a
138
139 Please check RECORD OPTIONS section for specific c2c record options.
140
142 The perf c2c report command displays shared data analysis. It comes in
143 two display modes: stdio and tui (default).
144
145 The report command workflow is following: - sort all the data based on
146 the cacheline address - store access details for each cacheline - sort
147 all cachelines based on user settings - display data
148
149 In general perf report output consist of 2 basic views: 1) most
150 expensive cachelines list 2) offsets details for each cacheline
151
152 For each cacheline in the 1) list we display following data: (Both
153 stdio and TUI modes follow the same fields output)
154
155 Index
156 - zero based index to identify the cacheline
157
158 Cacheline
159 - cacheline address (hex number)
160
161 Rmt/Lcl Hitm (Display with HITM types)
162 - cacheline percentage of all Remote/Local HITM accesses
163
164 Peer Snoop (Display with peer type)
165 - cacheline percentage of all peer accesses
166
167 LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
168 - count of Total/Local/Remote load HITMs
169
170 Load Peer - Total, Local, Remote (For display with peer type)
171 - count of Total/Local/Remote load from peer cache or DRAM
172
173 Total records
174 - sum of all cachelines accesses
175
176 Total loads
177 - sum of all load accesses
178
179 Total stores
180 - sum of all store accesses
181
182 Store Reference - L1Hit, L1Miss, N/A
183 L1Hit - store accesses that hit L1
184 L1Miss - store accesses that missed L1
185 N/A - store accesses with memory level is not available
186
187 Core Load Hit - FB, L1, L2
188 - count of load hits in FB (Fill Buffer), L1 and L2 cache
189
190 LLC Load Hit - LlcHit, LclHitm
191 - count of LLC load accesses, includes LLC hits and LLC HITMs
192
193 RMT Load Hit - RmtHit, RmtHitm
194 - count of remote load accesses, includes remote hits and remote HITMs;
195 on Arm neoverse cores, RmtHit is used to account remote accesses,
196 includes remote DRAM or any upward cache level in remote node
197
198 Load Dram - Lcl, Rmt
199 - count of local and remote DRAM accesses
200
201 For each offset in the 2) list we display following data:
202
203 HITM - Rmt, Lcl (Display with HITM types)
204 - % of Remote/Local HITM accesses for given offset within cacheline
205
206 Peer Snoop - Rmt, Lcl (Display with peer type)
207 - % of Remote/Local peer accesses for given offset within cacheline
208
209 Store Refs - L1 Hit, L1 Miss, N/A
210 - % of store accesses that hit L1, missed L1 and N/A (no available) memory
211 level for given offset within cacheline
212
213 Data address - Offset
214 - offset address
215
216 Pid
217 - pid of the process responsible for the accesses
218
219 Tid
220 - tid of the process responsible for the accesses
221
222 Code address
223 - code address responsible for the accesses
224
225 cycles - rmt hitm, lcl hitm, load (Display with HITM types)
226 - sum of cycles for given accesses - Remote/Local HITM and generic load
227
228 cycles - rmt peer, lcl peer, load (Display with peer type)
229 - sum of cycles for given accesses - Remote/Local peer load and generic load
230
231 cpu cnt
232 - number of cpus that participated on the access
233
234 Symbol
235 - code symbol related to the 'Code address' value
236
237 Shared Object
238 - shared object name related to the 'Code address' value
239
240 Source:Line
241 - source information related to the 'Code address' value
242
243 Node
244 - nodes participating on the access (see NODE INFO section)
245
247 The Node field displays nodes that accesses given cacheline offset. Its
248 output comes in 3 flavors: - node IDs separated by , - node IDs with
249 stats for each ID, in following format: Node{cpus %hitms %stores}
250 (Display with HITM types) Node{cpus %peers %stores} (Display with peer
251 type) - node IDs with list of affected CPUs in following format:
252 Node{cpu list}
253
254 User can switch between above flavors with -N option or use n key to
255 interactively switch in TUI mode.
256
258 User can specify how to sort offsets for cacheline.
259
260 Following fields are available and governs the final output fields set
261 for cacheline offsets output:
262
263 tid - coalesced by process TIDs
264 pid - coalesced by process PIDs
265 iaddr - coalesced by code address, following fields are displayed:
266 Code address, Code symbol, Shared Object, Source line
267 dso - coalesced by shared object
268
269 By default the coalescing is setup with pid,iaddr.
270
272 The stdio output displays data on standard output.
273
274 Following tables are displayed: Trace Event Information - overall
275 statistics of memory accesses
276
277 Global Shared Cache Line Event Information
278 - overall statistics on shared cachelines
279
280 Shared Data Cache Line Table
281 - list of most expensive cachelines
282
283 Shared Cache Line Distribution Pareto
284 - list of all accessed offsets for each cacheline
285
287 The TUI output provides interactive interface to navigate through
288 cachelines list and to display offset details.
289
290 For details please refer to the help window by pressing ? key.
291
293 Although Don Zickus, Dick Fowles and Joe Mario worked together to get
294 this implemented, we got lots of early help from Arnaldo Carvalho de
295 Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
296
298 Check Joe’s blog on c2c tool for detailed use case explanation:
299 https://joemario.github.io/blog/2016/09/01/c2c-blog/
300
302 perf-record(1), perf-mem(1)
303
304
305
306perf 01/12/2023 PERF-C2C(1)