numatop(8)

1NUMATOP(8)                  System Manager's Manual                 NUMATOP(8)
2
3
4

NAME

6       numatop - a tool for memory access locality characterization and analy‐
7       sis.
8

SYNOPSIS

10       numatop [-s] [-l] [-f] [-d]
11
12       numatop [-h]
13

DESCRIPTION

15       This manual page briefly documents the numatop command.
16
17       Most modern systems use a Non-Uniform Memory Access (NUMA)  design  for
18       multiprocessing.  In  NUMA systems, memory and processors are organized
19       in such a way that some parts of memory are closer to a  given  proces‐
20       sor, while other parts are farther from it. A processor can access mem‐
21       ory that is closer to it much faster than the memory  that  is  farther
22       from  it.  Hence, the latency between the processors and different por‐
23       tions of the memory in a NUMA machine may be significantly different.
24
25       numatop is an observation tool for runtime memory locality  characteri‐
26       zation  and analysis of processes and threads running on a NUMA system.
27       It helps the user to characterize the NUMA behavior  of  processes  and
28       threads  and to identify where the NUMA-related performance bottlenecks
29       reside. The tool uses Intel performance counter  sampling  technologies
30       and  associates the performance data with Linux system runtime informa‐
31       tion to provide real-time analysis in production systems. The tool  can
32       be used to:
33
34       A)  Characterize  the  locality of all running processes and threads to
35       identify those with the poorest locality in the system.
36
37       B) Identify the  "hot"  memory  areas,  report  average  memory  access
38       latency, and provide the location where accessed memory is allocated. A
39       "hot" memory area is where process/thread(s)  accesses  are  most  fre‐
40       quent.  numatop  has a metric called "ACCESS%" that specifies what per‐
41       centage of memory accesses are attributable to each memory area.
42
43       Note: numatop records only the memory  accesses  which  have  latencies
44       greater than a predefined threshold (128 CPU cycles).
45
46       C) Provide the call-chain(s) in the process/thread code that accesses a
47       given hot memory area.
48
49       D) Provide the call-chain(s) when the process/thread generates  certain
50       counter  events  (RMA/LMA/IR/CYCLE).  The call-chain(s) helps to locate
51       the source code that generates the events.
52
53       RMA: Remote Memory Access.
54       LMA: Local Memory Access.
55       IR: Instruction Retired.
56       CYCLE: CPU cycles.
57
58       E) Provide per-node statistics for memory and CPU utilization.  A  node
59       is:  a  region of memory in which every byte has the same distance from
60       each CPU.
61
62       F) Show, using a user-friendly interface, the list of processes/threads
63       sorted  by  some  metrics (by default, sorted by CPU utilization), with
64       the top process having the highest CPU utilization in  the  system  and
65       the  bottom  one having the lowest CPU utilization.  Users can also use
66       hotkeys to resort the output by these metrics: RMA, LMA, RMA/LMA,  CPI,
67       and CPU%.
68
69       RMA/LMA: ratio of RMA/LMA.
70       CPI: CPU cycle per instruction.
71       CPU%: CPU utilization.
72
73       numatop  is  a  GUI tool that periodically tracks and analyzes the NUMA
74       activity of processes and threads and displays  useful  metrics.  Users
75       can  scroll up/down by using the up or down key to navigate in the cur‐
76       rent window and can use several hot keys shown at  the  bottom  of  the
77       window, to switch between windows or to change the running state of the
78       tool.  For example, hotkey 'R' refreshes the data in the  current  win‐
79       dow.
80
81       Below  is a detailed description of the various display windows and the
82       data items that they display:
83
84       [WIN1 - Monitoring processes and threads]:
85       Get the locality characterization of all processes. This is  the  first
86       window upon startup, it's numatop's "Home" window. This window displays
87       a list of processes. The top process has the highest  system  CPU  uti‐
88       lization  (CPU%),  while  the bottom process has the lowest CPU% in the
89       system. Generally, the memory-intensive process is also  CPU-intensive,
90       so  the  processes  shown in this window are sorted by CPU% by default.
91       The user can press hotkeys '1', '2', '3', '4', or  '5'  to  resort  the
92       output by "RMA", "LMA", "RMA/LMA", "CPI", or "CPU%".
93
94       [KEY METRICS]:
95       RMA(K): number of Remote Memory Access (unit is 1000).
96               RMA(K) = RMA / 1000;
97       LMA(K): number of Local Memory Access (unit is 1000).
98               LMA(K) = LMA / 1000;
99       RMA/LMA: ratio of RMA/LMA.
100       CPI: CPU cycles per instruction.
101       CPU%: system CPU utilization (busy time across all CPUs).
102
103       [HOTKEY]:
104       Q: Quit the application.
105       H: WIN1 refresh.
106       R: Refresh to show the latest data.
107       I: Switch to WIN2 to show the normalized data.
108       N: Switch to WIN11 to show the per-node statistics.
109       1: Sort by RMA.
110       2: Sort by LMA.
111       3: Sort by RMA/LMA.
112       4: Sort by CPI.
113       5: Sort by CPU%
114
115       [WIN2 - Monitoring processes and threads (normalized)]:
116       Get the normalized locality characterization of all processes.
117
118       [KEY METRICS]:
119       RPI(K): RMA normalized by 1000 instructions.
120               RPI(K) = RMA / (IR / 1000);
121       LPI(K): LMA normalized by 1000 instructions.
122               LPI(K) = LMA / (IR / 1000);
123       Other metrics remain the same.
124
125       [HOTKEY]:
126       Q: Quit the application.
127       H: Switch to WIN1.
128       B: Back to previous window.
129       R: Refresh to show the latest data.
130       N: Switch to WIN11 to show the per-node statistics.
131       1: Sort by RPI.
132       2: Sort by LPI.
133       3: Sort by RMA/LMA.
134       4: Sort by CPI.
135       5: Sort by CPU%
136
137       [WIN3 - Monitoring the process]:
138       Get  the  locality  characterization  with node affinity of a specified
139       process.
140
141       [KEY METRICS]:
142       NODE: the node ID.
143       CPU%: per-node CPU utilization.
144       Other metrics remain the same.
145
146       [HOTKEY]:
147       Q: Quit the application.
148       H: Switch to WIN1.
149       B: Back to previous window.
150       R: Refresh to show the latest data.
151       N: Switch to WIN11 to show the per-node statistics.
152       L: Show the latency information.
153       C: Show the call-chain.
154
155       [WIN4 - Monitoring all threads]:
156       Get the  locality  characterization  of  all  threads  in  a  specified
157       process.
158
159       [KEY METRICS]:
160       CPU%: per-CPU CPU utilization.
161       Other metrics remain the same.
162
163       [HOTKEY]:
164       Q: Quit the application.
165       H: Switch to WIN1.
166       B: Back to previous window.
167       R: Refresh to show the latest data.
168       N: Switch to WIN11 to show the per-node statistics.
169
170       [WIN5 - Monitoring the thread]:
171       Get  the  locality  characterization  with node affinity of a specified
172       thread.
173
174       [KEY METRICS]:
175       CPU%: per-CPU CPU utilization.
176       Other metrics remain the same.
177
178       [HOTKEY]:
179       Q: Quit the application.
180       H: Switch to WIN1.
181       B: Back to previous window.
182       R: Refresh to show the latest data.
183       N: Switch to WIN11 to show the per-node statistics.
184       L: Show the latency information.
185       C: Show the call-chain.
186
187       [WIN6 - Monitoring memory areas]:
188       Get the memory area use with the  associated  accessing  latency  of  a
189       specified process/thread.
190
191       [KEY METRICS]:
192       ADDR: starting address of the memory area.
193       SIZE: size of memory area (K/M/G bytes).
194       ACCESS%: percentage of memory accesses are to this memory area.
195       LAT(ns): the average latency (nanoseconds) of memory accesses.
196       DESC: description of memory area (from /proc/<pid>/maps).
197
198       [HOTKEY]:
199       Q: Quit the application.
200       H: Switch to WIN1.
201       B: Back to previous window.
202       R: Refresh to show the latest data.
203       A: Show the memory access node distribution.
204       C: Show the call-chain when process/thread accesses the memory area.
205
206       [WIN7 - Memory access node distribution overview]:
207       Get   the   percentage   of   memory   accesses   originated  from  the
208       process/thread to each node.
209
210       [KEY METRICS]:
211       NODE: the node ID.
212       ACCESS%: percentage of memory accesses are to this node.
213       LAT(ns): the average latency (nanoseconds) of memory accesses  to  this
214       node.
215
216       [HOTKEY]:
217       Q: Quit the application.
218       H: Switch to WIN1.
219       B: Back to previous window.
220       R: Refresh to show the latest data.
221
222       [WIN8 - Break down the memory area into physical memory on node]:
223       Break  down  the memory area into the physical mapping on node with the
224       associated accessing latency of a process/thread.
225
226       [KEY METRICS]:
227       NODE: the node ID.
228       Other metrics remain the same.
229
230       [HOTKEY]:
231       Q: Quit the application.
232       H: Switch to WIN1.
233       B: Back to previous window.
234       R: Refresh to show the latest data.
235
236       [WIN9  -   Call-chain   when   process/thread   generates   the   event
237       ("RMA"/"LMA"/"CYCLE"/"IR")]:
238       Determine    the    call-chains    to    the    code   that   generates
239       "RMA"/"LMA"/"CYCLE"/"IR".
240
241       [KEY METRICS]:
242       Call-chain list: a list of call-chains.
243
244       [HOTKEY]:
245       Q: Quit the application.
246       H: Switch to WIN1.
247       B: Back to the previous window.
248       R: Refresh to show the latest data.
249       1: Locate call-chain when process/thread generates "RMA"
250       2: Locate call-chain when process/thread generates "LMA"
251       3: Locate call-chain when process/thread generates "CYCLE" (CPU cycle)
252       4: Locate call-chain when process/thread  generates  "IR"  (Instruction
253       Retired)
254
255       [WIN10 - Call-chain when process/thread access the memory area]:
256       Determine the call-chains to the code that references this memory area.
257       The latency must be greater than the predefined latency threshold  (128
258       CPU cycles).
259
260       [KEY METRICS]:
261       Call-chain list: a list of call-chains.
262       Other metrics remain the same.
263
264       [HOTKEY]:
265       Q: Quit the application.
266       H: Switch to WIN1.
267       B: Back to previous window.
268       R: Refresh to show the latest data.
269
270       [WIN11 - Node Overview]:
271       Show the basic per-node statistics for this system
272
273       [KEY METRICS]:
274       MEM.ALL:  total  usable RAM (physical RAM minus a few reserved bits and
275       the kernel binary code).
276       MEM.FREE: sum of LowFree + HighFree (overall stat) .
277       CPU%: per-node CPU utilization.
278       Other metrics remain the same.
279
280       [WIN12 - Information of Node N]:
281       Show the memory use and CPU utilization for the selected node.
282
283       [KEY METRICS]:
284       CPU: array of logical CPUs which belong to this node.
285       CPU%: per-node CPU utilization.
286       MEM active: the amount of memory that has been used more  recently  and
287       is not usually reclaimed unless absolute necessary.
288       MEM  inactive:  the amount of memory that has not been used for a while
289       and is eligible to be swapped to disk.
290       Dirty: the amount of memory waiting to be written back to the disk.
291       Writeback: the amount of memory actively  being  written  back  to  the
292       disk.
293       Mapped: all pages mapped into a process.
294
295       [HOTKEY]:
296       Q: Quit the application.
297       H: Switch to WIN1.
298       B: Back to previous window.
299       R: Refresh to show the latest data.
300

OPTIONS

302       The following options are supported by numatop:
303
304       -s sampling_precision
305       normal: balance precision and overhead (default)
306       high: high sampling precision (high overhead)
307       low: low sampling precision, suitable for high load system
308
309       -l log_level
310       Specifies the level of logging in the log file. Valid values are:
311       1: unknown (reserved for future use)
312       2: all
313
314       -f log_file
315       Specifies the log file where output will be written. If the log file is
316       not writable, the tool will prompt "Cannot open '<file name>' for writ‐
317       ting.".
318
319       -d dump_file
320       Specifies  the  dump file where the screen data will be written. Gener‐
321       ally the dump file is used for automated test. If the dump file is  not
322       writable,  the tool will prompt "Cannot open <file name> for dump writ‐
323       ing."
324
325       -h
326       Displays the command's usage.
327

EXAMPLES

329       Example 1: Launch numatop with high sampling precision
330       numatop -s high
331
332       Example 2: Write all warning messages in /tmp/numatop.log
333       numatop -l 2 -o /tmp/numatop.log
334
335       Example 3: Dump screen data in /tmp/dump.log
336       numatop -d /tmp/dump.log
337

EXIT STATUS

339       0: successful operation.
340       Other value: an error occurred.
341

USAGE

343       You must have root privileges to run numatop.
344       Or set -1 in /proc/sys/kernel/perf_event_paranoid
345
346       Note: The perf_event_paranoid setting has security implications  and  a
347       non-root  user  probably  doesn't have authority to access /proc. It is
348       highly recommended that the user runs numatop as root.
349

VERSION

351       numatop requires a patch set to support PEBS Load Latency functionality
352       in  the  kernel. The patch set has not been integrated in 3.8. Probably
353       it will be integrated in 3.9. The following steps show how to  get  and
354       apply the patch set.
355
356
357       1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
358       2. cd tip
359       3. git checkout perf/x86
360       4. build kernel as usual
361
362       numatop    supports    the    Intel   Xeon   processors:   5500-series,
363       6500/7500-series,      5600      series,      E7-x8xx-series,       and
364       E5-16xx/24xx/26xx/46xx-series.   Note:  CPU  microcode version 0x618 or
365       0x70c or later is required on E5-16xx/24xx/26xx/46xx-series.
366
367
368
369                                 April 3, 2013                      NUMATOP(8)