1PERF-C2C(1) perf Manual PERF-C2C(1)
2
3
4
6 perf-c2c - Shared Data C2C/HITM Analyzer.
7
9 perf c2c record [<options>] <command>
10 perf c2c record [<options>] — [<record command options>] <command>
11 perf c2c report [<options>]
12
14 C2C stands for Cache To Cache.
15
16 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17 allows you to track down the cacheline contentions.
18
19 On x86, the tool is based on load latency and precise store facility
20 events provided by Intel CPUs. On PowerPC, the tool uses random
21 instruction sampling with thresholding feature.
22
23 These events provide: - memory address of the access - type of the
24 access (load and store details) - latency (in cycles) of the load
25 access
26
27 The c2c tool provide means to record this data and report back access
28 details for cachelines with highest contention - highest number of HITM
29 accesses.
30
31 The basic workflow with this tool follows the standard record/report
32 phase. User uses the record command to record events data and report
33 command to display it.
34
36 -e, --event=
37 Select the PMU event. Use perf mem record -e list to list available
38 events.
39
40 -v, --verbose
41 Be more verbose (show counter open errors, etc).
42
43 -l, --ldlat
44 Configure mem-loads latency. (x86 only)
45
46 -k, --all-kernel
47 Configure all used events to run in kernel space.
48
49 -u, --all-user
50 Configure all used events to run in user space.
51
53 -k, --vmlinux=<file>
54 vmlinux pathname
55
56 -v, --verbose
57 Be more verbose (show counter open errors, etc).
58
59 -i, --input
60 Specify the input file to process.
61
62 -N, --node-info
63 Show extra node info in report (see NODE INFO section)
64
65 -c, --coalesce
66 Specify sorting fields for single cacheline display. Following
67 fields are available: tid,pid,iaddr,dso (see COALESCE)
68
69 -g, --call-graph
70 Setup callchains parameters. Please refer to perf-report man page
71 for details.
72
73 --stdio
74 Force the stdio output (see STDIO OUTPUT)
75
76 --stats
77 Display only statistic tables and force stdio mode.
78
79 --full-symbols
80 Display full length of symbols.
81
82 --no-source
83 Do not display Source:Line column.
84
85 --show-all
86 Show all captured HITM lines, with no regard to HITM % 0.0005
87 limit.
88
89 -f, --force
90 Don’t do ownership validation.
91
92 -d, --display
93 Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs
94 as default.
95
97 The perf c2c record command setup options related to HITM cacheline
98 analysis and calls standard perf record command.
99
100 Following perf record options are configured by default: (check perf
101 record man page for details)
102
103 -W,-d,--phys-data,--sample-cpu
104
105 Unless specified otherwise with -e option, following events are
106 monitored by default on x86:
107
108 cpu/mem-loads,ldlat=30/P
109 cpu/mem-stores/P
110
111 and following on PowerPC:
112
113 cpu/mem-loads/
114 cpu/mem-stores/
115
116 User can pass any perf record option behind -- mark, like (to enable
117 callchains and system wide monitoring):
118
119 $ perf c2c record -- -g -a
120
121 Please check RECORD OPTIONS section for specific c2c record options.
122
124 The perf c2c report command displays shared data analysis. It comes in
125 two display modes: stdio and tui (default).
126
127 The report command workflow is following: - sort all the data based on
128 the cacheline address - store access details for each cacheline - sort
129 all cachelines based on user settings - display data
130
131 In general perf report output consist of 2 basic views: 1) most
132 expensive cachelines list 2) offsets details for each cacheline
133
134 For each cacheline in the 1) list we display following data: (Both
135 stdio and TUI modes follow the same fields output)
136
137 Index
138 - zero based index to identify the cacheline
139
140 Cacheline
141 - cacheline address (hex number)
142
143 Total records
144 - sum of all cachelines accesses
145
146 Rmt/Lcl Hitm
147 - cacheline percentage of all Remote/Local HITM accesses
148
149 LLC Load Hitm - Total, Lcl, Rmt
150 - count of Total/Local/Remote load HITMs
151
152 Store Reference - Total, L1Hit, L1Miss
153 Total - all store accesses
154 L1Hit - store accesses that hit L1
155 L1Hit - store accesses that missed L1
156
157 Load Dram
158 - count of local and remote DRAM accesses
159
160 LLC Ld Miss
161 - count of all accesses that missed LLC
162
163 Total Loads
164 - sum of all load accesses
165
166 Core Load Hit - FB, L1, L2
167 - count of load hits in FB (Fill Buffer), L1 and L2 cache
168
169 LLC Load Hit - Llc, Rmt
170 - count of LLC and Remote load hits
171
172 For each offset in the 2) list we display following data:
173
174 HITM - Rmt, Lcl
175 - % of Remote/Local HITM accesses for given offset within cacheline
176
177 Store Refs - L1 Hit, L1 Miss
178 - % of store accesses that hit/missed L1 for given offset within cacheline
179
180 Data address - Offset
181 - offset address
182
183 Pid
184 - pid of the process responsible for the accesses
185
186 Tid
187 - tid of the process responsible for the accesses
188
189 Code address
190 - code address responsible for the accesses
191
192 cycles - rmt hitm, lcl hitm, load
193 - sum of cycles for given accesses - Remote/Local HITM and generic load
194
195 cpu cnt
196 - number of cpus that participated on the access
197
198 Symbol
199 - code symbol related to the 'Code address' value
200
201 Shared Object
202 - shared object name related to the 'Code address' value
203
204 Source:Line
205 - source information related to the 'Code address' value
206
207 Node
208 - nodes participating on the access (see NODE INFO section)
209
211 The Node field displays nodes that accesses given cacheline offset. Its
212 output comes in 3 flavors: - node IDs separated by , - node IDs with
213 stats for each ID, in following format: Node{cpus %hitms %stores} -
214 node IDs with list of affected CPUs in following format: Node{cpu list}
215
216 User can switch between above flavors with -N option or use n key to
217 interactively switch in TUI mode.
218
220 User can specify how to sort offsets for cacheline.
221
222 Following fields are available and governs the final output fields set
223 for caheline offsets output:
224
225 tid - coalesced by process TIDs
226 pid - coalesced by process PIDs
227 iaddr - coalesced by code address, following fields are displayed:
228 Code address, Code symbol, Shared Object, Source line
229 dso - coalesced by shared object
230
231 By default the coalescing is setup with pid,iaddr.
232
234 The stdio output displays data on standard output.
235
236 Following tables are displayed: Trace Event Information - overall
237 statistics of memory accesses
238
239 Global Shared Cache Line Event Information
240 - overall statistics on shared cachelines
241
242 Shared Data Cache Line Table
243 - list of most expensive cachelines
244
245 Shared Cache Line Distribution Pareto
246 - list of all accessed offsets for each cacheline
247
249 The TUI output provides interactive interface to navigate through
250 cachelines list and to display offset details.
251
252 For details please refer to the help window by pressing ? key.
253
255 Although Don Zickus, Dick Fowles and Joe Mario worked together to get
256 this implemented, we got lots of early help from Arnaldo Carvalho de
257 Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
258
260 Check Joe’s blog on c2c tool for detailed use case explanation:
261 https://joemario.github.io/blog/2016/09/01/c2c-blog/
262
264 perf-record(1), perf-mem(1)
265
266
267
268perf 06/03/2019 PERF-C2C(1)