perf-c2c(1)

1PERF-C2C(1)                       perf Manual                      PERF-C2C(1)
2
3
4

NAME

6       perf-c2c - Shared Data C2C/HITM Analyzer.
7

SYNOPSIS

9       perf c2c record [<options>] <command>
10       perf c2c record [<options>] -- [<record command options>] <command>
11       perf c2c report [<options>]
12

DESCRIPTION

14       C2C stands for Cache To Cache.
15
16       The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17       allows you to track down the cacheline contentions.
18
19       On Intel, the tool is based on load latency and precise store facility
20       events provided by Intel CPUs. On PowerPC, the tool uses random
21       instruction sampling with thresholding feature. On AMD, the tool uses
22       IBS op pmu (due to hardware limitations, perf c2c is not supported on
23       Zen3 cpus). On Arm64 it uses SPE to sample load and store operations,
24       therefore hardware and kernel support is required. See perf-arm-spe(1)
25       for a setup guide. Due to the statistical nature of Arm SPE sampling,
26       not every memory operation will be sampled.
27
28       These events provide: - memory address of the access - type of the
29       access (load and store details) - latency (in cycles) of the load
30       access
31
32       The c2c tool provide means to record this data and report back access
33       details for cachelines with highest contention - highest number of HITM
34       accesses.
35
36       The basic workflow with this tool follows the standard record/report
37       phase. User uses the record command to record events data and report
38       command to display it.
39

RECORD OPTIONS

41       -e, --event=
42           Select the PMU event. Use perf c2c record -e list to list available
43           events.
44
45       -v, --verbose
46           Be more verbose (show counter open errors, etc).
47
48       -l, --ldlat
49           Configure mem-loads latency. Supported on Intel and Arm64
50           processors only. Ignored on other archs.
51
52       -k, --all-kernel
53           Configure all used events to run in kernel space.
54
55       -u, --all-user
56           Configure all used events to run in user space.
57

REPORT OPTIONS

59       -k, --vmlinux=<file>
60           vmlinux pathname
61
62       -v, --verbose
63           Be more verbose (show counter open errors, etc).
64
65       -i, --input
66           Specify the input file to process.
67
68       -N, --node-info
69           Show extra node info in report (see NODE INFO section)
70
71       -c, --coalesce
72           Specify sorting fields for single cacheline display. Following
73           fields are available: tid,pid,iaddr,dso (see COALESCE)
74
75       -g, --call-graph
76           Setup callchains parameters. Please refer to perf-report man page
77           for details.
78
79       --stdio
80           Force the stdio output (see STDIO OUTPUT)
81
82       --stats
83           Display only statistic tables and force stdio mode.
84
85       --full-symbols
86           Display full length of symbols.
87
88       --no-source
89           Do not display Source:Line column.
90
91       --show-all
92           Show all captured HITM lines, with no regard to HITM % 0.0005
93           limit.
94
95       -f, --force
96           Don’t do ownership validation.
97
98       -d, --display
99           Switch to HITM type (rmt, lcl) or peer snooping type (peer) to
100           display and sort on. Total HITMs (tot) as default, except Arm64
101           uses peer mode as default.
102
103       --stitch-lbr
104           Show callgraph with stitched LBRs, which may have more complete
105           callgraph. The perf.data file must have been obtained using perf
106           c2c record --call-graph lbr. Disabled by default. In common cases
107           with call stack overflows, it can recreate better call stacks than
108           the default lbr call stack output. But this approach is not
109           foolproof. There can be cases where it creates incorrect call
110           stacks from incorrect matches. The known limitations include
111           exception handing such as setjmp/longjmp will have calls/returns
112           not match.
113
114       --double-cl
115           Group the detection of shared cacheline events into double
116           cacheline granularity. Some architectures have an Adjacent
117           Cacheline Prefetch feature, which causes cacheline sharing to
118           behave like the cacheline size is doubled.
119

C2C RECORD

121       The perf c2c record command setup options related to HITM cacheline
122       analysis and calls standard perf record command.
123
124       Following perf record options are configured by default: (check perf
125       record man page for details)
126
127           -W,-d,--phys-data,--sample-cpu
128
129       Unless specified otherwise with -e option, following events are
130       monitored by default on Intel:
131
132           cpu/mem-loads,ldlat=30/P
133           cpu/mem-stores/P
134
135       following on AMD:
136
137           ibs_op//
138
139       and following on PowerPC:
140
141           cpu/mem-loads/
142           cpu/mem-stores/
143
144       User can pass any perf record option behind -- mark, like (to enable
145       callchains and system wide monitoring):
146
147           $ perf c2c record -- -g -a
148
149       Please check RECORD OPTIONS section for specific c2c record options.
150

C2C REPORT

152       The perf c2c report command displays shared data analysis. It comes in
153       two display modes: stdio and tui (default).
154
155       The report command workflow is following: - sort all the data based on
156       the cacheline address - store access details for each cacheline - sort
157       all cachelines based on user settings - display data
158
159       In general perf report output consist of 2 basic views: 1) most
160       expensive cachelines list 2) offsets details for each cacheline
161
162       For each cacheline in the 1) list we display following data: (Both
163       stdio and TUI modes follow the same fields output)
164
165           Index
166           - zero based index to identify the cacheline
167
168           Cacheline
169           - cacheline address (hex number)
170
171           Rmt/Lcl Hitm (Display with HITM types)
172           - cacheline percentage of all Remote/Local HITM accesses
173
174           Peer Snoop (Display with peer type)
175           - cacheline percentage of all peer accesses
176
177           LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
178           - count of Total/Local/Remote load HITMs
179
180           Load Peer - Total, Local, Remote (For display with peer type)
181           - count of Total/Local/Remote load from peer cache or DRAM
182
183           Total records
184           - sum of all cachelines accesses
185
186           Total loads
187           - sum of all load accesses
188
189           Total stores
190           - sum of all store accesses
191
192           Store Reference - L1Hit, L1Miss, N/A
193             L1Hit - store accesses that hit L1
194             L1Miss - store accesses that missed L1
195             N/A - store accesses with memory level is not available
196
197           Core Load Hit - FB, L1, L2
198           - count of load hits in FB (Fill Buffer), L1 and L2 cache
199
200           LLC Load Hit - LlcHit, LclHitm
201           - count of LLC load accesses, includes LLC hits and LLC HITMs
202
203           RMT Load Hit - RmtHit, RmtHitm
204           - count of remote load accesses, includes remote hits and remote HITMs;
205             on Arm neoverse cores, RmtHit is used to account remote accesses,
206             includes remote DRAM or any upward cache level in remote node
207
208           Load Dram - Lcl, Rmt
209           - count of local and remote DRAM accesses
210
211       For each offset in the 2) list we display following data:
212
213           HITM - Rmt, Lcl (Display with HITM types)
214           - % of Remote/Local HITM accesses for given offset within cacheline
215
216           Peer Snoop - Rmt, Lcl (Display with peer type)
217           - % of Remote/Local peer accesses for given offset within cacheline
218
219           Store Refs - L1 Hit, L1 Miss, N/A
220           - % of store accesses that hit L1, missed L1 and N/A (no available) memory
221             level for given offset within cacheline
222
223           Data address - Offset
224           - offset address
225
226           Pid
227           - pid of the process responsible for the accesses
228
229           Tid
230           - tid of the process responsible for the accesses
231
232           Code address
233           - code address responsible for the accesses
234
235           cycles - rmt hitm, lcl hitm, load (Display with HITM types)
236             - sum of cycles for given accesses - Remote/Local HITM and generic load
237
238           cycles - rmt peer, lcl peer, load (Display with peer type)
239             - sum of cycles for given accesses - Remote/Local peer load and generic load
240
241           cpu cnt
242             - number of cpus that participated on the access
243
244           Symbol
245             - code symbol related to the 'Code address' value
246
247           Shared Object
248             - shared object name related to the 'Code address' value
249
250           Source:Line
251             - source information related to the 'Code address' value
252
253           Node
254             - nodes participating on the access (see NODE INFO section)
255

NODE INFO

257       The Node field displays nodes that accesses given cacheline offset. Its
258       output comes in 3 flavors: - node IDs separated by , - node IDs with
259       stats for each ID, in following format: Node{cpus %hitms %stores}
260       (Display with HITM types) Node{cpus %peers %stores} (Display with peer
261       type) - node IDs with list of affected CPUs in following format:
262       Node{cpu list}
263
264       User can switch between above flavors with -N option or use n key to
265       interactively switch in TUI mode.
266

COALESCE

268       User can specify how to sort offsets for cacheline.
269
270       Following fields are available and governs the final output fields set
271       for cacheline offsets output:
272
273           tid   - coalesced by process TIDs
274           pid   - coalesced by process PIDs
275           iaddr - coalesced by code address, following fields are displayed:
276                      Code address, Code symbol, Shared Object, Source line
277           dso   - coalesced by shared object
278
279       By default the coalescing is setup with pid,iaddr.
280

STDIO OUTPUT

282       The stdio output displays data on standard output.
283
284       Following tables are displayed: Trace Event Information - overall
285       statistics of memory accesses
286
287           Global Shared Cache Line Event Information
288           - overall statistics on shared cachelines
289
290           Shared Data Cache Line Table
291           - list of most expensive cachelines
292
293           Shared Cache Line Distribution Pareto
294           - list of all accessed offsets for each cacheline
295

TUI OUTPUT

297       The TUI output provides interactive interface to navigate through
298       cachelines list and to display offset details.
299
300       For details please refer to the help window by pressing ? key.
301

CREDITS

303       Although Don Zickus, Dick Fowles and Joe Mario worked together to get
304       this implemented, we got lots of early help from Arnaldo Carvalho de
305       Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
306

C2C BLOG

308       Check Joe’s blog on c2c tool for detailed use case explanation:
309       https://joemario.github.io/blog/2016/09/01/c2c-blog/
310