1NUMATOP(8) System Manager's Manual NUMATOP(8)
2
3
4
6 numatop - a tool for memory access locality characterization and analy‐
7 sis.
8
10 numatop [-s] [-l] [-f] [-d]
11
12 numatop [-h]
13
15 This manual page briefly documents the numatop command.
16
17 Most modern systems use a Non-Uniform Memory Access (NUMA) design for
18 multiprocessing. In NUMA systems, memory and processors are organized
19 in such a way that some parts of memory are closer to a given proces‐
20 sor, while other parts are farther from it. A processor can access mem‐
21 ory that is closer to it much faster than the memory that is farther
22 from it. Hence, the latency between the processors and different por‐
23 tions of the memory in a NUMA machine may be significantly different.
24
25 numatop is an observation tool for runtime memory locality characteri‐
26 zation and analysis of processes and threads running on a NUMA system.
27 It helps the user to characterize the NUMA behavior of processes and
28 threads and to identify where the NUMA-related performance bottlenecks
29 reside. The tool uses hardware performance counter sampling technolo‐
30 gies and associates the performance data with Linux system runtime
31 information to provide real-time analysis in production systems. The
32 tool can be used to:
33
34 A) Characterize the locality of all running processes and threads to
35 identify those with the poorest locality in the system.
36
37 B) Identify the "hot" memory areas, report average memory access
38 latency, and provide the location where accessed memory is allocated. A
39 "hot" memory area is where process/thread(s) accesses are most fre‐
40 quent. numatop has a metric called "ACCESS%" that specifies what per‐
41 centage of memory accesses are attributable to each memory area.
42
43 Note: numatop records only the memory accesses which have latencies
44 greater than a predefined threshold (128 CPU cycles).
45
46 C) Provide the call-chain(s) in the process/thread code that accesses a
47 given hot memory area.
48
49 D) Provide the call-chain(s) when the process/thread generates certain
50 counter events (RMA/LMA/IR/CYCLE). The call-chain(s) helps to locate
51 the source code that generates the events.
52
53 RMA: Remote Memory Access.
54 LMA: Local Memory Access.
55 IR: Instruction Retired.
56 CYCLE: CPU cycles.
57
58 E) Provide per-node statistics for memory and CPU utilization. A node
59 is: a region of memory in which every byte has the same distance from
60 each CPU.
61
62 F) Show, using a user-friendly interface, the list of processes/threads
63 sorted by some metrics (by default, sorted by CPU utilization), with
64 the top process having the highest CPU utilization in the system and
65 the bottom one having the lowest CPU utilization. Users can also use
66 hotkeys to resort the output by these metrics: RMA, LMA, RMA/LMA, CPI,
67 and CPU%.
68
69 RMA/LMA: ratio of RMA/LMA.
70 CPI: CPU cycle per instruction.
71 CPU%: CPU utilization.
72
73 numatop is a GUI tool that periodically tracks and analyzes the NUMA
74 activity of processes and threads and displays useful metrics. Users
75 can scroll up/down by using the up or down key to navigate in the cur‐
76 rent window and can use several hot keys shown at the bottom of the
77 window, to switch between windows or to change the running state of the
78 tool. For example, hotkey 'R' refreshes the data in the current win‐
79 dow.
80
81 Below is a detailed description of the various display windows and the
82 data items that they display:
83
84 [WIN1 - Monitoring processes and threads]:
85 Get the locality characterization of all processes. This is the first
86 window upon startup, it's numatop's "Home" window. This window displays
87 a list of processes. The top process has the highest system CPU uti‐
88 lization (CPU%), while the bottom process has the lowest CPU% in the
89 system. Generally, the memory-intensive process is also CPU-intensive,
90 so the processes shown in this window are sorted by CPU% by default.
91 The user can press hotkeys '1', '2', '3', '4', or '5' to resort the
92 output by "RMA", "LMA", "RMA/LMA", "CPI", or "CPU%".
93
94 [KEY METRICS]:
95 RMA(K): number of Remote Memory Access (unit is 1000).
96 RMA(K) = RMA / 1000;
97 LMA(K): number of Local Memory Access (unit is 1000).
98 LMA(K) = LMA / 1000;
99 RMA/LMA: ratio of RMA/LMA.
100 CPI: CPU cycles per instruction.
101 CPU%: system CPU utilization (busy time across all CPUs).
102
103 [HOTKEY]:
104 Q: Quit the application.
105 H: WIN1 refresh.
106 R: Refresh to show the latest data.
107 I: Switch to WIN2 to show the normalized data.
108 N: Switch to WIN11 to show the per-node statistics.
109 1: Sort by RMA.
110 2: Sort by LMA.
111 3: Sort by RMA/LMA.
112 4: Sort by CPI.
113 5: Sort by CPU%
114
115 [WIN2 - Monitoring processes and threads (normalized)]:
116 Get the normalized locality characterization of all processes.
117
118 [KEY METRICS]:
119 RPI(K): RMA normalized by 1000 instructions.
120 RPI(K) = RMA / (IR / 1000);
121 LPI(K): LMA normalized by 1000 instructions.
122 LPI(K) = LMA / (IR / 1000);
123 Other metrics remain the same.
124
125 [HOTKEY]:
126 Q: Quit the application.
127 H: Switch to WIN1.
128 B: Back to previous window.
129 R: Refresh to show the latest data.
130 N: Switch to WIN11 to show the per-node statistics.
131 1: Sort by RPI.
132 2: Sort by LPI.
133 3: Sort by RMA/LMA.
134 4: Sort by CPI.
135 5: Sort by CPU%
136
137 [WIN3 - Monitoring the process]:
138 Get the locality characterization with node affinity of a specified
139 process.
140
141 [KEY METRICS]:
142 NODE: the node ID.
143 CPU%: per-node CPU utilization.
144 Other metrics remain the same.
145
146 [HOTKEY]:
147 Q: Quit the application.
148 H: Switch to WIN1.
149 B: Back to previous window.
150 R: Refresh to show the latest data.
151 N: Switch to WIN11 to show the per-node statistics.
152 L: Show the latency information.
153 C: Show the call-chain.
154
155 [WIN4 - Monitoring all threads]:
156 Get the locality characterization of all threads in a specified
157 process.
158
159 [KEY METRICS]:
160 CPU%: per-CPU CPU utilization.
161 Other metrics remain the same.
162
163 [HOTKEY]:
164 Q: Quit the application.
165 H: Switch to WIN1.
166 B: Back to previous window.
167 R: Refresh to show the latest data.
168 N: Switch to WIN11 to show the per-node statistics.
169
170 [WIN5 - Monitoring the thread]:
171 Get the locality characterization with node affinity of a specified
172 thread.
173
174 [KEY METRICS]:
175 CPU%: per-CPU CPU utilization.
176 Other metrics remain the same.
177
178 [HOTKEY]:
179 Q: Quit the application.
180 H: Switch to WIN1.
181 B: Back to previous window.
182 R: Refresh to show the latest data.
183 N: Switch to WIN11 to show the per-node statistics.
184 L: Show the latency information.
185 C: Show the call-chain.
186
187 [WIN6 - Monitoring memory areas]:
188 Get the memory area use with the associated accessing latency of a
189 specified process/thread.
190
191 [KEY METRICS]:
192 ADDR: starting address of the memory area.
193 SIZE: size of memory area (K/M/G bytes).
194 ACCESS%: percentage of memory accesses are to this memory area.
195 LAT(ns): the average latency (nanoseconds) of memory accesses.
196 DESC: description of memory area (from /proc/<pid>/maps).
197
198 [HOTKEY]:
199 Q: Quit the application.
200 H: Switch to WIN1.
201 B: Back to previous window.
202 R: Refresh to show the latest data.
203 A: Show the memory access node distribution.
204 C: Show the call-chain when process/thread accesses the memory area.
205
206 [WIN7 - Memory access node distribution overview]:
207 Get the percentage of memory accesses originated from the
208 process/thread to each node.
209
210 [KEY METRICS]:
211 NODE: the node ID.
212 ACCESS%: percentage of memory accesses are to this node.
213 LAT(ns): the average latency (nanoseconds) of memory accesses to this
214 node.
215
216 [HOTKEY]:
217 Q: Quit the application.
218 H: Switch to WIN1.
219 B: Back to previous window.
220 R: Refresh to show the latest data.
221
222 [WIN8 - Break down the memory area into physical memory on node]:
223 Break down the memory area into the physical mapping on node with the
224 associated accessing latency of a process/thread.
225
226 [KEY METRICS]:
227 NODE: the node ID.
228 Other metrics remain the same.
229
230 [HOTKEY]:
231 Q: Quit the application.
232 H: Switch to WIN1.
233 B: Back to previous window.
234 R: Refresh to show the latest data.
235
236 [WIN9 - Call-chain when process/thread generates the event
237 ("RMA"/"LMA"/"CYCLE"/"IR")]:
238 Determine the call-chains to the code that generates
239 "RMA"/"LMA"/"CYCLE"/"IR".
240
241 [KEY METRICS]:
242 Call-chain list: a list of call-chains.
243
244 [HOTKEY]:
245 Q: Quit the application.
246 H: Switch to WIN1.
247 B: Back to the previous window.
248 R: Refresh to show the latest data.
249 1: Locate call-chain when process/thread generates "RMA"
250 2: Locate call-chain when process/thread generates "LMA"
251 3: Locate call-chain when process/thread generates "CYCLE" (CPU cycle)
252 4: Locate call-chain when process/thread generates "IR" (Instruction
253 Retired)
254
255 [WIN10 - Call-chain when process/thread access the memory area]:
256 Determine the call-chains to the code that references this memory area.
257 The latency must be greater than the predefined latency threshold (128
258 CPU cycles).
259
260 [KEY METRICS]:
261 Call-chain list: a list of call-chains.
262 Other metrics remain the same.
263
264 [HOTKEY]:
265 Q: Quit the application.
266 H: Switch to WIN1.
267 B: Back to previous window.
268 R: Refresh to show the latest data.
269
270 [WIN11 - Node Overview]:
271 Show the basic per-node statistics for this system
272
273 [KEY METRICS]:
274 MEM.ALL: total usable RAM (physical RAM minus a few reserved bits and
275 the kernel binary code).
276 MEM.FREE: sum of LowFree + HighFree (overall stat) .
277 CPU%: per-node CPU utilization.
278 Other metrics remain the same.
279
280 [WIN12 - Information of Node N]:
281 Show the memory use and CPU utilization for the selected node.
282
283 [KEY METRICS]:
284 CPU: array of logical CPUs which belong to this node.
285 CPU%: per-node CPU utilization.
286 MEM active: the amount of memory that has been used more recently and
287 is not usually reclaimed unless absolute necessary.
288 MEM inactive: the amount of memory that has not been used for a while
289 and is eligible to be swapped to disk.
290 Dirty: the amount of memory waiting to be written back to the disk.
291 Writeback: the amount of memory actively being written back to the
292 disk.
293 Mapped: all pages mapped into a process.
294
295 [HOTKEY]:
296 Q: Quit the application.
297 H: Switch to WIN1.
298 B: Back to previous window.
299 R: Refresh to show the latest data.
300
302 The following options are supported by numatop:
303
304 -s sampling_precision
305 normal: balance precision and overhead (default)
306 high: high sampling precision (high overhead)
307 low: low sampling precision, suitable for high load system
308
309 -l log_level
310 Specifies the level of logging in the log file. Valid values are:
311 1: unknown (reserved for future use)
312 2: all
313
314 -f log_file
315 Specifies the log file where output will be written. If the log file is
316 not writable, the tool will prompt "Cannot open '<file name>' for writ‐
317 ting.".
318
319 -d dump_file
320 Specifies the dump file where the screen data will be written. Gener‐
321 ally the dump file is used for automated test. If the dump file is not
322 writable, the tool will prompt "Cannot open <file name> for dump writ‐
323 ing."
324
325 -h
326 Displays the command's usage.
327
329 Example 1: Launch numatop with high sampling precision
330 numatop -s high
331
332 Example 2: Write all warning messages in /tmp/numatop.log
333 numatop -l 2 -o /tmp/numatop.log
334
335 Example 3: Dump screen data in /tmp/dump.log
336 numatop -d /tmp/dump.log
337
339 0: successful operation.
340 Other value: an error occurred.
341
343 You must have root privileges to run numatop.
344 Or set -1 in /proc/sys/kernel/perf_event_paranoid
345
346 Note: The perf_event_paranoid setting has security implications and a
347 non-root user probably doesn't have authority to access /proc. It is
348 highly recommended that the user runs numatop as root.
349
351 numatop requires a patch set to support PEBS Load Latency functionality
352 in the kernel. The patch set has not been integrated in 3.8. Probably
353 it will be integrated in 3.9. The following steps show how to get and
354 apply the patch set.
355
356
357 1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
358 2. cd tip
359 3. git checkout perf/x86
360 4. build kernel as usual
361
362 numatop supports the Intel Xeon processors: 5500-series,
363 6500/7500-series, 5600 series, E7-x8xx-series, and
364 E5-16xx/24xx/26xx/46xx-series. Note: CPU microcode version 0x618 or
365 0x70c or later is required on E5-16xx/24xx/26xx/46xx-series. It also
366 supports IBM Power8 and Power9 processors.
367
368
369
370 April 3, 2013 NUMATOP(8)