perf-c2c(1)

1PERF-C2C(1)                       perf Manual                      PERF-C2C(1)
2
3
4

NAME

6       perf-c2c - Shared Data C2C/HITM Analyzer.
7

SYNOPSIS

9       perf c2c record [<options>] <command>
10       perf c2c record [<options>] — [<record command options>] <command>
11       perf c2c report [<options>]
12

DESCRIPTION

14       C2C stands for Cache To Cache.
15
16       The perf c2c tool provides means for Shared Data C2C/HITM analysis. It
17       allows you to track down the cacheline contentions.
18
19       On x86, the tool is based on load latency and precise store facility
20       events provided by Intel CPUs. On PowerPC, the tool uses random
21       instruction sampling with thresholding feature.
22
23       These events provide: - memory address of the access - type of the
24       access (load and store details) - latency (in cycles) of the load
25       access
26
27       The c2c tool provide means to record this data and report back access
28       details for cachelines with highest contention - highest number of HITM
29       accesses.
30
31       The basic workflow with this tool follows the standard record/report
32       phase. User uses the record command to record events data and report
33       command to display it.
34

RECORD OPTIONS

36       -e, --event=
37           Select the PMU event. Use perf c2c record -e list to list available
38           events.
39
40       -v, --verbose
41           Be more verbose (show counter open errors, etc).
42
43       -l, --ldlat
44           Configure mem-loads latency. (x86 only)
45
46       -k, --all-kernel
47           Configure all used events to run in kernel space.
48
49       -u, --all-user
50           Configure all used events to run in user space.
51

REPORT OPTIONS

53       -k, --vmlinux=<file>
54           vmlinux pathname
55
56       -v, --verbose
57           Be more verbose (show counter open errors, etc).
58
59       -i, --input
60           Specify the input file to process.
61
62       -N, --node-info
63           Show extra node info in report (see NODE INFO section)
64
65       -c, --coalesce
66           Specify sorting fields for single cacheline display. Following
67           fields are available: tid,pid,iaddr,dso (see COALESCE)
68
69       -g, --call-graph
70           Setup callchains parameters. Please refer to perf-report man page
71           for details.
72
73       --stdio
74           Force the stdio output (see STDIO OUTPUT)
75
76       --stats
77           Display only statistic tables and force stdio mode.
78
79       --full-symbols
80           Display full length of symbols.
81
82       --no-source
83           Do not display Source:Line column.
84
85       --show-all
86           Show all captured HITM lines, with no regard to HITM % 0.0005
87           limit.
88
89       -f, --force
90           Don’t do ownership validation.
91
92       -d, --display
93           Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs
94           as default.
95
96       --stitch-lbr
97           Show callgraph with stitched LBRs, which may have more complete
98           callgraph. The perf.data file must have been obtained using perf
99           c2c record --call-graph lbr. Disabled by default. In common cases
100           with call stack overflows, it can recreate better call stacks than
101           the default lbr call stack output. But this approach is not full
102           proof. There can be cases where it creates incorrect call stacks
103           from incorrect matches. The known limitations include exception
104           handing such as setjmp/longjmp will have calls/returns not match.
105

C2C RECORD

107       The perf c2c record command setup options related to HITM cacheline
108       analysis and calls standard perf record command.
109
110       Following perf record options are configured by default: (check perf
111       record man page for details)
112
113           -W,-d,--phys-data,--sample-cpu
114
115       Unless specified otherwise with -e option, following events are
116       monitored by default on x86:
117
118           cpu/mem-loads,ldlat=30/P
119           cpu/mem-stores/P
120
121       and following on PowerPC:
122
123           cpu/mem-loads/
124           cpu/mem-stores/
125
126       User can pass any perf record option behind -- mark, like (to enable
127       callchains and system wide monitoring):
128
129           $ perf c2c record -- -g -a
130
131       Please check RECORD OPTIONS section for specific c2c record options.
132

C2C REPORT

134       The perf c2c report command displays shared data analysis. It comes in
135       two display modes: stdio and tui (default).
136
137       The report command workflow is following: - sort all the data based on
138       the cacheline address - store access details for each cacheline - sort
139       all cachelines based on user settings - display data
140
141       In general perf report output consist of 2 basic views: 1) most
142       expensive cachelines list 2) offsets details for each cacheline
143
144       For each cacheline in the 1) list we display following data: (Both
145       stdio and TUI modes follow the same fields output)
146
147           Index
148           - zero based index to identify the cacheline
149
150           Cacheline
151           - cacheline address (hex number)
152
153           Rmt/Lcl Hitm
154           - cacheline percentage of all Remote/Local HITM accesses
155
156           LLC Load Hitm - Total, LclHitm, RmtHitm
157           - count of Total/Local/Remote load HITMs
158
159           Total records
160           - sum of all cachelines accesses
161
162           Total loads
163           - sum of all load accesses
164
165           Total stores
166           - sum of all store accesses
167
168           Store Reference - L1Hit, L1Miss
169             L1Hit - store accesses that hit L1
170             L1Miss - store accesses that missed L1
171
172           Core Load Hit - FB, L1, L2
173           - count of load hits in FB (Fill Buffer), L1 and L2 cache
174
175           LLC Load Hit - LlcHit, LclHitm
176           - count of LLC load accesses, includes LLC hits and LLC HITMs
177
178           RMT Load Hit - RmtHit, RmtHitm
179           - count of remote load accesses, includes remote hits and remote HITMs
180
181           Load Dram - Lcl, Rmt
182           - count of local and remote DRAM accesses
183
184       For each offset in the 2) list we display following data:
185
186           HITM - Rmt, Lcl
187           - % of Remote/Local HITM accesses for given offset within cacheline
188
189           Store Refs - L1 Hit, L1 Miss
190           - % of store accesses that hit/missed L1 for given offset within cacheline
191
192           Data address - Offset
193           - offset address
194
195           Pid
196           - pid of the process responsible for the accesses
197
198           Tid
199           - tid of the process responsible for the accesses
200
201           Code address
202           - code address responsible for the accesses
203
204           cycles - rmt hitm, lcl hitm, load
205             - sum of cycles for given accesses - Remote/Local HITM and generic load
206
207           cpu cnt
208             - number of cpus that participated on the access
209
210           Symbol
211             - code symbol related to the 'Code address' value
212
213           Shared Object
214             - shared object name related to the 'Code address' value
215
216           Source:Line
217             - source information related to the 'Code address' value
218
219           Node
220             - nodes participating on the access (see NODE INFO section)
221

NODE INFO

223       The Node field displays nodes that accesses given cacheline offset. Its
224       output comes in 3 flavors: - node IDs separated by , - node IDs with
225       stats for each ID, in following format: Node{cpus %hitms %stores} -
226       node IDs with list of affected CPUs in following format: Node{cpu list}
227
228       User can switch between above flavors with -N option or use n key to
229       interactively switch in TUI mode.
230

COALESCE

232       User can specify how to sort offsets for cacheline.
233
234       Following fields are available and governs the final output fields set
235       for caheline offsets output:
236
237           tid   - coalesced by process TIDs
238           pid   - coalesced by process PIDs
239           iaddr - coalesced by code address, following fields are displayed:
240                      Code address, Code symbol, Shared Object, Source line
241           dso   - coalesced by shared object
242
243       By default the coalescing is setup with pid,iaddr.
244

STDIO OUTPUT

246       The stdio output displays data on standard output.
247
248       Following tables are displayed: Trace Event Information - overall
249       statistics of memory accesses
250
251           Global Shared Cache Line Event Information
252           - overall statistics on shared cachelines
253
254           Shared Data Cache Line Table
255           - list of most expensive cachelines
256
257           Shared Cache Line Distribution Pareto
258           - list of all accessed offsets for each cacheline
259

TUI OUTPUT

261       The TUI output provides interactive interface to navigate through
262       cachelines list and to display offset details.
263
264       For details please refer to the help window by pressing ? key.
265

CREDITS

267       Although Don Zickus, Dick Fowles and Joe Mario worked together to get
268       this implemented, we got lots of early help from Arnaldo Carvalho de
269       Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
270

C2C BLOG

272       Check Joe’s blog on c2c tool for detailed use case explanation:
273       https://joemario.github.io/blog/2016/09/01/c2c-blog/
274