1PERF-C2C(1) perf Manual PERF-C2C(1)
2
3
4
6 perf-c2c - Shared Data C2C/HITM Analyzer.
7
9 perf c2c record [<options>] <command>
10 perf c2c record [<options>] -- [<record command options>] <command>
11 perf c2c report [<options>]
12
14 C2C stands for Cache To Cache.
15
16 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17 allows you to track down the cacheline contentions.
18
19 On x86, the tool is based on load latency and precise store facility
20 events provided by Intel CPUs. On PowerPC, the tool uses random
21 instruction sampling with thresholding feature.
22
23 These events provide: - memory address of the access - type of the
24 access (load and store details) - latency (in cycles) of the load
25 access
26
27 The c2c tool provide means to record this data and report back access
28 details for cachelines with highest contention - highest number of HITM
29 accesses.
30
31 The basic workflow with this tool follows the standard record/report
32 phase. User uses the record command to record events data and report
33 command to display it.
34
36 -e, --event=
37 Select the PMU event. Use perf c2c record -e list to list available
38 events.
39
40 -v, --verbose
41 Be more verbose (show counter open errors, etc).
42
43 -l, --ldlat
44 Configure mem-loads latency. (x86 only)
45
46 -k, --all-kernel
47 Configure all used events to run in kernel space.
48
49 -u, --all-user
50 Configure all used events to run in user space.
51
53 -k, --vmlinux=<file>
54 vmlinux pathname
55
56 -v, --verbose
57 Be more verbose (show counter open errors, etc).
58
59 -i, --input
60 Specify the input file to process.
61
62 -N, --node-info
63 Show extra node info in report (see NODE INFO section)
64
65 -c, --coalesce
66 Specify sorting fields for single cacheline display. Following
67 fields are available: tid,pid,iaddr,dso (see COALESCE)
68
69 -g, --call-graph
70 Setup callchains parameters. Please refer to perf-report man page
71 for details.
72
73 --stdio
74 Force the stdio output (see STDIO OUTPUT)
75
76 --stats
77 Display only statistic tables and force stdio mode.
78
79 --full-symbols
80 Display full length of symbols.
81
82 --no-source
83 Do not display Source:Line column.
84
85 --show-all
86 Show all captured HITM lines, with no regard to HITM % 0.0005
87 limit.
88
89 -f, --force
90 Don’t do ownership validation.
91
92 -d, --display
93 Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs
94 as default.
95
96 --stitch-lbr
97 Show callgraph with stitched LBRs, which may have more complete
98 callgraph. The perf.data file must have been obtained using perf
99 c2c record --call-graph lbr. Disabled by default. In common cases
100 with call stack overflows, it can recreate better call stacks than
101 the default lbr call stack output. But this approach is not full
102 proof. There can be cases where it creates incorrect call stacks
103 from incorrect matches. The known limitations include exception
104 handing such as setjmp/longjmp will have calls/returns not match.
105
107 The perf c2c record command setup options related to HITM cacheline
108 analysis and calls standard perf record command.
109
110 Following perf record options are configured by default: (check perf
111 record man page for details)
112
113 -W,-d,--phys-data,--sample-cpu
114
115 Unless specified otherwise with -e option, following events are
116 monitored by default on x86:
117
118 cpu/mem-loads,ldlat=30/P
119 cpu/mem-stores/P
120
121 and following on PowerPC:
122
123 cpu/mem-loads/
124 cpu/mem-stores/
125
126 User can pass any perf record option behind -- mark, like (to enable
127 callchains and system wide monitoring):
128
129 $ perf c2c record -- -g -a
130
131 Please check RECORD OPTIONS section for specific c2c record options.
132
134 The perf c2c report command displays shared data analysis. It comes in
135 two display modes: stdio and tui (default).
136
137 The report command workflow is following: - sort all the data based on
138 the cacheline address - store access details for each cacheline - sort
139 all cachelines based on user settings - display data
140
141 In general perf report output consist of 2 basic views: 1) most
142 expensive cachelines list 2) offsets details for each cacheline
143
144 For each cacheline in the 1) list we display following data: (Both
145 stdio and TUI modes follow the same fields output)
146
147 Index
148 - zero based index to identify the cacheline
149
150 Cacheline
151 - cacheline address (hex number)
152
153 Rmt/Lcl Hitm
154 - cacheline percentage of all Remote/Local HITM accesses
155
156 LLC Load Hitm - Total, LclHitm, RmtHitm
157 - count of Total/Local/Remote load HITMs
158
159 Total records
160 - sum of all cachelines accesses
161
162 Total loads
163 - sum of all load accesses
164
165 Total stores
166 - sum of all store accesses
167
168 Store Reference - L1Hit, L1Miss
169 L1Hit - store accesses that hit L1
170 L1Miss - store accesses that missed L1
171
172 Core Load Hit - FB, L1, L2
173 - count of load hits in FB (Fill Buffer), L1 and L2 cache
174
175 LLC Load Hit - LlcHit, LclHitm
176 - count of LLC load accesses, includes LLC hits and LLC HITMs
177
178 RMT Load Hit - RmtHit, RmtHitm
179 - count of remote load accesses, includes remote hits and remote HITMs
180
181 Load Dram - Lcl, Rmt
182 - count of local and remote DRAM accesses
183
184 For each offset in the 2) list we display following data:
185
186 HITM - Rmt, Lcl
187 - % of Remote/Local HITM accesses for given offset within cacheline
188
189 Store Refs - L1 Hit, L1 Miss
190 - % of store accesses that hit/missed L1 for given offset within cacheline
191
192 Data address - Offset
193 - offset address
194
195 Pid
196 - pid of the process responsible for the accesses
197
198 Tid
199 - tid of the process responsible for the accesses
200
201 Code address
202 - code address responsible for the accesses
203
204 cycles - rmt hitm, lcl hitm, load
205 - sum of cycles for given accesses - Remote/Local HITM and generic load
206
207 cpu cnt
208 - number of cpus that participated on the access
209
210 Symbol
211 - code symbol related to the 'Code address' value
212
213 Shared Object
214 - shared object name related to the 'Code address' value
215
216 Source:Line
217 - source information related to the 'Code address' value
218
219 Node
220 - nodes participating on the access (see NODE INFO section)
221
223 The Node field displays nodes that accesses given cacheline offset. Its
224 output comes in 3 flavors: - node IDs separated by , - node IDs with
225 stats for each ID, in following format: Node{cpus %hitms %stores} -
226 node IDs with list of affected CPUs in following format: Node{cpu list}
227
228 User can switch between above flavors with -N option or use n key to
229 interactively switch in TUI mode.
230
232 User can specify how to sort offsets for cacheline.
233
234 Following fields are available and governs the final output fields set
235 for cacheline offsets output:
236
237 tid - coalesced by process TIDs
238 pid - coalesced by process PIDs
239 iaddr - coalesced by code address, following fields are displayed:
240 Code address, Code symbol, Shared Object, Source line
241 dso - coalesced by shared object
242
243 By default the coalescing is setup with pid,iaddr.
244
246 The stdio output displays data on standard output.
247
248 Following tables are displayed: Trace Event Information - overall
249 statistics of memory accesses
250
251 Global Shared Cache Line Event Information
252 - overall statistics on shared cachelines
253
254 Shared Data Cache Line Table
255 - list of most expensive cachelines
256
257 Shared Cache Line Distribution Pareto
258 - list of all accessed offsets for each cacheline
259
261 The TUI output provides interactive interface to navigate through
262 cachelines list and to display offset details.
263
264 For details please refer to the help window by pressing ? key.
265
267 Although Don Zickus, Dick Fowles and Joe Mario worked together to get
268 this implemented, we got lots of early help from Arnaldo Carvalho de
269 Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
270
272 Check Joe’s blog on c2c tool for detailed use case explanation:
273 https://joemario.github.io/blog/2016/09/01/c2c-blog/
274
276 perf-record(1), perf-mem(1)
277
278
279
280perf 11/22/2021 PERF-C2C(1)