1PERF-C2C(1) perf Manual PERF-C2C(1)
2
3
4
6 perf-c2c - Shared Data C2C/HITM Analyzer.
7
9 perf c2c record [<options>] <command>
10 perf c2c record [<options>] — [<record command options>] <command>
11 perf c2c report [<options>]
12
14 C2C stands for Cache To Cache.
15
16 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17 allows you to track down the cacheline contentions.
18
19 The tool is based on x86’s load latency and precise store facility
20 events provided by Intel CPUs. These events provide: - memory address
21 of the access - type of the access (load and store details) - latency
22 (in cycles) of the load access
23
24 The c2c tool provide means to record this data and report back access
25 details for cachelines with highest contention - highest number of HITM
26 accesses.
27
28 The basic workflow with this tool follows the standard record/report
29 phase. User uses the record command to record events data and report
30 command to display it.
31
33 -e, --event=
34 Select the PMU event. Use perf mem record -e list to list available
35 events.
36
37 -v, --verbose
38 Be more verbose (show counter open errors, etc).
39
40 -l, --ldlat
41 Configure mem-loads latency.
42
43 -k, --all-kernel
44 Configure all used events to run in kernel space.
45
46 -u, --all-user
47 Configure all used events to run in user space.
48
50 -k, --vmlinux=<file>
51 vmlinux pathname
52
53 -v, --verbose
54 Be more verbose (show counter open errors, etc).
55
56 -i, --input
57 Specify the input file to process.
58
59 -N, --node-info
60 Show extra node info in report (see NODE INFO section)
61
62 -c, --coalesce
63 Specify sorting fields for single cacheline display. Following
64 fields are available: tid,pid,iaddr,dso (see COALESCE)
65
66 -g, --call-graph
67 Setup callchains parameters. Please refer to perf-report man page
68 for details.
69
70 --stdio
71 Force the stdio output (see STDIO OUTPUT)
72
73 --stats
74 Display only statistic tables and force stdio mode.
75
76 --full-symbols
77 Display full length of symbols.
78
79 --no-source
80 Do not display Source:Line column.
81
82 --show-all
83 Show all captured HITM lines, with no regard to HITM % 0.0005
84 limit.
85
86 -f, --force
87 Don’t do ownership validation.
88
89 -d, --display
90 Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs
91 as default.
92
94 The perf c2c record command setup options related to HITM cacheline
95 analysis and calls standard perf record command.
96
97 Following perf record options are configured by default: (check perf
98 record man page for details)
99
100 -W,-d,--sample-cpu
101
102 Unless specified otherwise with -e option, following events are
103 monitored by default:
104
105 cpu/mem-loads,ldlat=30/P
106 cpu/mem-stores/P
107
108 User can pass any perf record option behind -- mark, like (to enable
109 callchains and system wide monitoring):
110
111 $ perf c2c record -- -g -a
112
113 Please check RECORD OPTIONS section for specific c2c record options.
114
116 The perf c2c report command displays shared data analysis. It comes in
117 two display modes: stdio and tui (default).
118
119 The report command workflow is following: - sort all the data based on
120 the cacheline address - store access details for each cacheline - sort
121 all cachelines based on user settings - display data
122
123 In general perf report output consist of 2 basic views: 1) most
124 expensive cachelines list 2) offsets details for each cacheline
125
126 For each cacheline in the 1) list we display following data: (Both
127 stdio and TUI modes follow the same fields output)
128
129 Index
130 - zero based index to identify the cacheline
131
132 Cacheline
133 - cacheline address (hex number)
134
135 Total records
136 - sum of all cachelines accesses
137
138 Rmt/Lcl Hitm
139 - cacheline percentage of all Remote/Local HITM accesses
140
141 LLC Load Hitm - Total, Lcl, Rmt
142 - count of Total/Local/Remote load HITMs
143
144 Store Reference - Total, L1Hit, L1Miss
145 Total - all store accesses
146 L1Hit - store accesses that hit L1
147 L1Hit - store accesses that missed L1
148
149 Load Dram
150 - count of local and remote DRAM accesses
151
152 LLC Ld Miss
153 - count of all accesses that missed LLC
154
155 Total Loads
156 - sum of all load accesses
157
158 Core Load Hit - FB, L1, L2
159 - count of load hits in FB (Fill Buffer), L1 and L2 cache
160
161 LLC Load Hit - Llc, Rmt
162 - count of LLC and Remote load hits
163
164 For each offset in the 2) list we display following data:
165
166 HITM - Rmt, Lcl
167 - % of Remote/Local HITM accesses for given offset within cacheline
168
169 Store Refs - L1 Hit, L1 Miss
170 - % of store accesses that hit/missed L1 for given offset within cacheline
171
172 Data address - Offset
173 - offset address
174
175 Pid
176 - pid of the process responsible for the accesses
177
178 Tid
179 - tid of the process responsible for the accesses
180
181 Code address
182 - code address responsible for the accesses
183
184 cycles - rmt hitm, lcl hitm, load
185 - sum of cycles for given accesses - Remote/Local HITM and generic load
186
187 cpu cnt
188 - number of cpus that participated on the access
189
190 Symbol
191 - code symbol related to the 'Code address' value
192
193 Shared Object
194 - shared object name related to the 'Code address' value
195
196 Source:Line
197 - source information related to the 'Code address' value
198
199 Node
200 - nodes participating on the access (see NODE INFO section)
201
203 The Node field displays nodes that accesses given cacheline offset. Its
204 output comes in 3 flavors: - node IDs separated by , - node IDs with
205 stats for each ID, in following format: Node{cpus %hitms %stores} -
206 node IDs with list of affected CPUs in following format: Node{cpu list}
207
208 User can switch between above flavors with -N option or use n key to
209 interactively switch in TUI mode.
210
212 User can specify how to sort offsets for cacheline.
213
214 Following fields are available and governs the final output fields set
215 for caheline offsets output:
216
217 tid - coalesced by process TIDs
218 pid - coalesced by process PIDs
219 iaddr - coalesced by code address, following fields are displayed:
220 Code address, Code symbol, Shared Object, Source line
221 dso - coalesced by shared object
222
223 By default the coalescing is setup with pid,iaddr.
224
226 The stdio output displays data on standard output.
227
228 Following tables are displayed: Trace Event Information - overall
229 statistics of memory accesses
230
231 Global Shared Cache Line Event Information
232 - overall statistics on shared cachelines
233
234 Shared Data Cache Line Table
235 - list of most expensive cachelines
236
237 Shared Cache Line Distribution Pareto
238 - list of all accessed offsets for each cacheline
239
241 The TUI output provides interactive interface to navigate through
242 cachelines list and to display offset details.
243
244 For details please refer to the help window by pressing ? key.
245
247 Although Don Zickus, Dick Fowles and Joe Mario worked together to get
248 this implemented, we got lots of early help from Arnaldo Carvalho de
249 Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
250
252 Check Joe’s blog on c2c tool for detailed use case explanation:
253 https://joemario.github.io/blog/2016/09/01/c2c-blog/
254
256 perf-record(1), perf-mem(1)
257
258
259
260perf 06/18/2019 PERF-C2C(1)