1CPUSET(7) Linux Programmer's Manual CPUSET(7)
2
3
4
6 cpuset - confine processes to processor and memory node subsets
7
9 The cpuset file system is a pseudo-file-system interface to the kernel
10 cpuset mechanism, which is used to control the processor placement and
11 memory placement of processes. It is commonly mounted at /dev/cpuset.
12
13 On systems with kernels compiled with built in support for cpusets, all
14 processes are attached to a cpuset, and cpusets are always present. If
15 a system supports cpusets, then it will have the entry nodev cpuset in
16 the file /proc/filesystems. By mounting the cpuset file system (see
17 the EXAMPLE section below), the administrator can configure the cpusets
18 on a system to control the processor and memory placement of processes
19 on that system. By default, if the cpuset configuration on a system is
20 not modified or if the cpuset file system is not even mounted, then the
21 cpuset mechanism, though present, has no affect on the system's behav‐
22 ior.
23
24 A cpuset defines a list of CPUs and memory nodes.
25
26 The CPUs of a system include all the logical processing units on which
27 a process can execute, including, if present, multiple processor cores
28 within a package and Hyper-Threads within a processor core. Memory
29 nodes include all distinct banks of main memory; small and SMP systems
30 typically have just one memory node that contains all the system's main
31 memory, while NUMA (non-uniform memory access) systems have multiple
32 memory nodes.
33
34 Cpusets are represented as directories in a hierarchical pseudo-file
35 system, where the top directory in the hierarchy (/dev/cpuset) repre‐
36 sents the entire system (all online CPUs and memory nodes) and any
37 cpuset that is the child (descendant) of another parent cpuset contains
38 a subset of that parent's CPUs and memory nodes. The directories and
39 files representing cpusets have normal file-system permissions.
40
41 Every process in the system belongs to exactly one cpuset. A process
42 is confined to only run on the CPUs in the cpuset it belongs to, and to
43 allocate memory only on the memory nodes in that cpuset. When a
44 process fork(2)s, the child process is placed in the same cpuset as its
45 parent. With sufficient privilege, a process may be moved from one
46 cpuset to another and the allowed CPUs and memory nodes of an existing
47 cpuset may be changed.
48
49 When the system begins booting, a single cpuset is defined that
50 includes all CPUs and memory nodes on the system, and all processes are
51 in that cpuset. During the boot process, or later during normal system
52 operation, other cpusets may be created, as subdirectories of this top
53 cpuset, under the control of the system administrator, and processes
54 may be placed in these other cpusets.
55
56 Cpusets are integrated with the sched_setaffinity(2) scheduling affin‐
57 ity mechanism and the mbind(2) and set_mempolicy(2) memory-placement
58 mechanisms in the kernel. Neither of these mechanisms let a process
59 make use of a CPU or memory node that is not allowed by that process's
60 cpuset. If changes to a process's cpuset placement conflict with these
61 other mechanisms, then cpuset placement is enforced even if it means
62 overriding these other mechanisms. The kernel accomplishes this over‐
63 riding by silently restricting the CPUs and memory nodes requested by
64 these other mechanisms to those allowed by the invoking process's
65 cpuset. This can result in these other calls returning an error, if
66 for example, such a call ends up requesting an empty set of CPUs or
67 memory nodes, after that request is restricted to the invoking
68 process's cpuset.
69
70 Typically, a cpuset is used to manage the CPU and memory-node confine‐
71 ment for a set of cooperating processes such as a batch scheduler job,
72 and these other mechanisms are used to manage the placement of individ‐
73 ual processes or memory regions within that set or job.
74
76 Each directory below /dev/cpuset represents a cpuset and contains a
77 fixed set of pseudo-files describing the state of that cpuset.
78
79 New cpusets are created using the mkdir(2) system call or the mkdir(1)
80 command. The properties of a cpuset, such as its flags, allowed CPUs
81 and memory nodes, and attached processes, are queried and modified by
82 reading or writing to the appropriate file in that cpuset's directory,
83 as listed below.
84
85 The pseudo-files in each cpuset directory are automatically created
86 when the cpuset is created, as a result of the mkdir(2) invocation. It
87 is not possible to directly add or remove these pseudo-files.
88
89 A cpuset directory that contains no child cpuset directories, and has
90 no attached processes, can be removed using rmdir(2) or rmdir(1). It
91 is not necessary, or possible, to remove the pseudo-files inside the
92 directory before removing it.
93
94 The pseudo-files in each cpuset directory are small text files that may
95 be read and written using traditional shell utilities such as cat(1),
96 and echo(1), or from a program by using file I/O library functions or
97 system calls, such as open(2), read(2), write(2), and close(2).
98
99 The pseudo-files in a cpuset directory represent internal kernel state
100 and do not have any persistent image on disk. Each of these per-cpuset
101 files is listed and described below.
102
103 tasks List of the process IDs (PIDs) of the processes in that cpuset.
104 The list is formatted as a series of ASCII decimal numbers, each
105 followed by a newline. A process may be added to a cpuset
106 (automatically removing it from the cpuset that previously con‐
107 tained it) by writing its PID to that cpuset's tasks file (with
108 or without a trailing newline.)
109
110 Warning: only one PID may be written to the tasks file at a
111 time. If a string is written that contains more than one PID,
112 only the first one will be used.
113
114 notify_on_release
115 Flag (0 or 1). If set (1), that cpuset will receive special
116 handling after it is released, that is, after all processes
117 cease using it (i.e., terminate or are moved to a different
118 cpuset) and all child cpuset directories have been removed. See
119 the Notify On Release section, below.
120
121 cpus List of the physical numbers of the CPUs on which processes in
122 that cpuset are allowed to execute. See List Format below for a
123 description of the format of cpus.
124
125 The CPUs allowed to a cpuset may be changed by writing a new
126 list to its cpus file.
127
128 cpu_exclusive
129 Flag (0 or 1). If set (1), the cpuset has exclusive use of its
130 CPUs (no sibling or cousin cpuset may overlap CPUs). By default
131 this is off (0). Newly created cpusets also initially default
132 this to off (0).
133
134 Two cpusets are sibling cpusets if they share the same parent
135 cpuset in the /dev/cpuset hierarchy. Two cpusets are cousin
136 cpusets if neither is the ancestor of the other. Regardless of
137 the cpu_exclusive setting, if one cpuset is the ancestor of
138 another, and if both of these cpusets have nonempty cpus, then
139 their cpus must overlap, because the cpus of any cpuset are
140 always a subset of the cpus of its parent cpuset.
141
142 mems List of memory nodes on which processes in this cpuset are
143 allowed to allocate memory. See List Format below for a
144 description of the format of mems.
145
146 mem_exclusive
147 Flag (0 or 1). If set (1), the cpuset has exclusive use of its
148 memory nodes (no sibling or cousin may overlap). Also if set
149 (1), the cpuset is a Hardwall cpuset (see below.) By default
150 this is off (0). Newly created cpusets also initially default
151 this to off (0).
152
153 Regardless of the mem_exclusive setting, if one cpuset is the
154 ancestor of another, then their memory nodes must overlap,
155 because the memory nodes of any cpuset are always a subset of
156 the memory nodes of that cpuset's parent cpuset.
157
158 mem_hardwall (since Linux 2.6.26)
159 Flag (0 or 1). If set (1), the cpuset is a Hardwall cpuset (see
160 below.) Unlike mem_exclusive, there is no constraint on whether
161 cpusets marked mem_hardwall may have overlapping memory nodes
162 with sibling or cousin cpusets. By default this is off (0).
163 Newly created cpusets also initially default this to off (0).
164
165 memory_migrate (since Linux 2.6.16)
166 Flag (0 or 1). If set (1), then memory migration is enabled.
167 By default this is off (0). See the Memory Migration section,
168 below.
169
170 memory_pressure (since Linux 2.6.16)
171 A measure of how much memory pressure the processes in this
172 cpuset are causing. See the Memory Pressure section, below.
173 Unless memory_pressure_enabled is enabled, always has value zero
174 (0). This file is read-only. See the WARNINGS section, below.
175
176 memory_pressure_enabled (since Linux 2.6.16)
177 Flag (0 or 1). This file is only present in the root cpuset,
178 normally /dev/cpuset. If set (1), the memory_pressure calcula‐
179 tions are enabled for all cpusets in the system. By default
180 this is off (0). See the Memory Pressure section, below.
181
182 memory_spread_page (since Linux 2.6.17)
183 Flag (0 or 1). If set (1), pages in the kernel page cache
184 (file-system buffers) are uniformly spread across the cpuset.
185 By default this is off (0) in the top cpuset, and inherited from
186 the parent cpuset in newly created cpusets. See the Memory
187 Spread section, below.
188
189 memory_spread_slab (since Linux 2.6.17)
190 Flag (0 or 1). If set (1), the kernel slab caches for file I/O
191 (directory and inode structures) are uniformly spread across the
192 cpuset. By default this is off (0) in the top cpuset, and
193 inherited from the parent cpuset in newly created cpusets. See
194 the Memory Spread section, below.
195
196 sched_load_balance (since Linux 2.6.24)
197 Flag (0 or 1). If set (1, the default) the kernel will automat‐
198 ically load balance processes in that cpuset over the allowed
199 CPUs in that cpuset. If cleared (0) the kernel will avoid load
200 balancing processes in this cpuset, unless some other cpuset
201 with overlapping CPUs has its sched_load_balance flag set. See
202 Scheduler Load Balancing, below, for further details.
203
204 sched_relax_domain_level (since Linux 2.6.26)
205 Integer, between -1 and a small positive value. The
206 sched_relax_domain_level controls the width of the range of CPUs
207 over which the kernel scheduler performs immediate rebalancing
208 of runnable tasks across CPUs. If sched_load_balance is dis‐
209 abled, then the setting of sched_relax_domain_level does not
210 matter, as no such load balancing is done. If sched_load_bal‐
211 ance is enabled, then the higher the value of the
212 sched_relax_domain_level, the wider the range of CPUs over which
213 immediate load balancing is attempted. See Scheduler Relax
214 Domain Level, below, for further details.
215
216 In addition to the above pseudo-files in each directory below
217 /dev/cpuset, each process has a pseudo-file, /proc/<pid>/cpuset, that
218 displays the path of the process's cpuset directory relative to the
219 root of the cpuset file system.
220
221 Also the /proc/<pid>/status file for each process has four added lines,
222 displaying the process's Cpus_allowed (on which CPUs it may be sched‐
223 uled) and Mems_allowed (on which memory nodes it may obtain memory), in
224 the two formats Mask Format and List Format (see below) as shown in the
225 following example:
226
227 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
228 Cpus_allowed_list: 0-127
229 Mems_allowed: ffffffff,ffffffff
230 Mems_allowed_list: 0-63
231
232 The "allowed" fields were added in Linux 2.6.24; the "allowed_list"
233 fields were added in Linux 2.6.26.
234
236 In addition to controlling which cpus and mems a process is allowed to
237 use, cpusets provide the following extended capabilities.
238
239 Exclusive Cpusets
240 If a cpuset is marked cpu_exclusive or mem_exclusive, no other cpuset,
241 other than a direct ancestor or descendant, may share any of the same
242 CPUs or memory nodes.
243
244 A cpuset that is mem_exclusive restricts kernel allocations for buffer
245 cache pages and other internal kernel data pages commonly shared by the
246 kernel across multiple users. All cpusets, whether mem_exclusive or
247 not, restrict allocations of memory for user space. This enables con‐
248 figuring a system so that several independent jobs can share common
249 kernel data, while isolating each job's user allocation in its own
250 cpuset. To do this, construct a large mem_exclusive cpuset to hold all
251 the jobs, and construct child, non-mem_exclusive cpusets for each indi‐
252 vidual job. Only a small amount of kernel memory, such as requests
253 from interrupt handlers, is allowed to be placed on memory nodes out‐
254 side even a mem_exclusive cpuset.
255
256 Hardwall
257 A cpuset that has mem_exclusive or mem_hardwall set is a hardwall
258 cpuset. A hardwall cpuset restricts kernel allocations for page, buf‐
259 fer, and other data commonly shared by the kernel across multiple
260 users. All cpusets, whether hardwall or not, restrict allocations of
261 memory for user space.
262
263 This enables configuring a system so that several independent jobs can
264 share common kernel data, such as file system pages, while isolating
265 each job's user allocation in its own cpuset. To do this, construct a
266 large hardwall cpuset to hold all the jobs, and construct child cpusets
267 for each individual job which are not hardwall cpusets.
268
269 Only a small amount of kernel memory, such as requests from interrupt
270 handlers, is allowed to be taken outside even a hardwall cpuset.
271
272 Notify On Release
273 If the notify_on_release flag is enabled (1) in a cpuset, then whenever
274 the last process in the cpuset leaves (exits or attaches to some other
275 cpuset) and the last child cpuset of that cpuset is removed, the kernel
276 will run the command /sbin/cpuset_release_agent, supplying the pathname
277 (relative to the mount point of the cpuset file system) of the aban‐
278 doned cpuset. This enables automatic removal of abandoned cpusets.
279
280 The default value of notify_on_release in the root cpuset at system
281 boot is disabled (0). The default value of other cpusets at creation
282 is the current value of their parent's notify_on_release setting.
283
284 The command /sbin/cpuset_release_agent is invoked, with the name
285 (/dev/cpuset relative path) of the to-be-released cpuset in argv[1].
286
287 The usual contents of the command /sbin/cpuset_release_agent is simply
288 the shell script:
289
290 #!/bin/sh
291 rmdir /dev/cpuset/$1
292
293 As with other flag values below, this flag can be changed by writing an
294 ASCII number 0 or 1 (with optional trailing newline) into the file, to
295 clear or set the flag, respectively.
296
297 Memory Pressure
298 The memory_pressure of a cpuset provides a simple per-cpuset running
299 average of the rate that the processes in a cpuset are attempting to
300 free up in-use memory on the nodes of the cpuset to satisfy additional
301 memory requests.
302
303 This enables batch managers that are monitoring jobs running in dedi‐
304 cated cpusets to efficiently detect what level of memory pressure that
305 job is causing.
306
307 This is useful both on tightly managed systems running a wide mix of
308 submitted jobs, which may choose to terminate or reprioritize jobs that
309 are trying to use more memory than allowed on the nodes assigned them,
310 and with tightly coupled, long-running, massively parallel scientific
311 computing jobs that will dramatically fail to meet required performance
312 goals if they start to use more memory than allowed to them.
313
314 This mechanism provides a very economical way for the batch manager to
315 monitor a cpuset for signs of memory pressure. It's up to the batch
316 manager or other user code to decide what action to take if it detects
317 signs of memory pressure.
318
319 Unless memory pressure calculation is enabled by setting the pseudo-
320 file /dev/cpuset/memory_pressure_enabled, it is not computed for any
321 cpuset, and reads from any memory_pressure always return zero, as rep‐
322 resented by the ASCII string "0\n". See the WARNINGS section, below.
323
324 A per-cpuset, running average is employed for the following reasons:
325
326 * Because this meter is per-cpuset rather than per-process or per vir‐
327 tual memory region, the system load imposed by a batch scheduler
328 monitoring this metric is sharply reduced on large systems, because
329 a scan of the tasklist can be avoided on each set of queries.
330
331 * Because this meter is a running average rather than an accumulating
332 counter, a batch scheduler can detect memory pressure with a single
333 read, instead of having to read and accumulate results for a period
334 of time.
335
336 * Because this meter is per-cpuset rather than per-process, the batch
337 scheduler can obtain the key information — memory pressure in a
338 cpuset — with a single read, rather than having to query and accumu‐
339 late results over all the (dynamically changing) set of processes in
340 the cpuset.
341
342 The memory_pressure of a cpuset is calculated using a per-cpuset simple
343 digital filter that is kept within the kernel. For each cpuset, this
344 filter tracks the recent rate at which processes attached to that
345 cpuset enter the kernel direct reclaim code.
346
347 The kernel direct reclaim code is entered whenever a process has to
348 satisfy a memory page request by first finding some other page to
349 repurpose, due to lack of any readily available already free pages.
350 Dirty file system pages are repurposed by first writing them to disk.
351 Unmodified file system buffer pages are repurposed by simply dropping
352 them, though if that page is needed again, it will have to be reread
353 from disk.
354
355 The memory_pressure file provides an integer number representing the
356 recent (half-life of 10 seconds) rate of entries to the direct reclaim
357 code caused by any process in the cpuset, in units of reclaims
358 attempted per second, times 1000.
359
360 Memory Spread
361 There are two Boolean flag files per cpuset that control where the ker‐
362 nel allocates pages for the file-system buffers and related in-kernel
363 data structures. They are called memory_spread_page and mem‐
364 ory_spread_slab.
365
366 If the per-cpuset Boolean flag file memory_spread_page is set, then the
367 kernel will spread the file-system buffers (page cache) evenly over all
368 the nodes that the faulting process is allowed to use, instead of pre‐
369 ferring to put those pages on the node where the process is running.
370
371 If the per-cpuset Boolean flag file memory_spread_slab is set, then the
372 kernel will spread some file-system-related slab caches, such as those
373 for inodes and directory entries, evenly over all the nodes that the
374 faulting process is allowed to use, instead of preferring to put those
375 pages on the node where the process is running.
376
377 The setting of these flags does not affect the data segment (see
378 brk(2)) or stack segment pages of a process.
379
380 By default, both kinds of memory spreading are off and the kernel
381 prefers to allocate memory pages on the node local to where the
382 requesting process is running. If that node is not allowed by the
383 process's NUMA memory policy or cpuset configuration or if there are
384 insufficient free memory pages on that node, then the kernel looks for
385 the nearest node that is allowed and has sufficient free memory.
386
387 When new cpusets are created, they inherit the memory spread settings
388 of their parent.
389
390 Setting memory spreading causes allocations for the affected page or
391 slab caches to ignore the process's NUMA memory policy and be spread
392 instead. However, the effect of these changes in memory placement
393 caused by cpuset-specified memory spreading is hidden from the mbind(2)
394 or set_mempolicy(2) calls. These two NUMA memory policy calls always
395 appear to behave as if no cpuset-specified memory spreading is in
396 effect, even if it is. If cpuset memory spreading is subsequently
397 turned off, the NUMA memory policy most recently specified by these
398 calls is automatically reapplied.
399
400 Both memory_spread_page and memory_spread_slab are Boolean flag files.
401 By default they contain "0", meaning that the feature is off for that
402 cpuset. If a "1" is written to that file, that turns the named feature
403 on.
404
405 Cpuset-specified memory spreading behaves similarly to what is known
406 (in other contexts) as round-robin or interleave memory placement.
407
408 Cpuset-specified memory spreading can provide substantial performance
409 improvements for jobs that:
410
411 a) need to place thread-local data on memory nodes close to the CPUs
412 which are running the threads that most frequently access that data;
413 but also
414
415 b) need to access large file-system data sets that must to be spread
416 across the several nodes in the job's cpuset in order to fit.
417
418 Without this policy, the memory allocation across the nodes in the
419 job's cpuset can become very uneven, especially for jobs that might
420 have just a single thread initializing or reading in the data set.
421
422 Memory Migration
423 Normally, under the default setting (disabled) of memory_migrate, once
424 a page is allocated (given a physical page of main memory) then that
425 page stays on whatever node it was allocated, so long as it remains
426 allocated, even if the cpuset's memory-placement policy mems subse‐
427 quently changes.
428
429 When memory migration is enabled in a cpuset, if the mems setting of
430 the cpuset is changed, then any memory page in use by any process in
431 the cpuset that is on a memory node that is no longer allowed will be
432 migrated to a memory node that is allowed.
433
434 Furthermore, if a process is moved into a cpuset with memory_migrate
435 enabled, any memory pages it uses that were on memory nodes allowed in
436 its previous cpuset, but which are not allowed in its new cpuset, will
437 be migrated to a memory node allowed in the new cpuset.
438
439 The relative placement of a migrated page within the cpuset is pre‐
440 served during these migration operations if possible. For example, if
441 the page was on the second valid node of the prior cpuset, then the
442 page will be placed on the second valid node of the new cpuset, if pos‐
443 sible.
444
445 Scheduler Load Balancing
446 The kernel scheduler automatically load balances processes. If one CPU
447 is underutilized, the kernel will look for processes on other more
448 overloaded CPUs and move those processes to the underutilized CPU,
449 within the constraints of such placement mechanisms as cpusets and
450 sched_setaffinity(2).
451
452 The algorithmic cost of load balancing and its impact on key shared
453 kernel data structures such as the process list increases more than
454 linearly with the number of CPUs being balanced. For example, it costs
455 more to load balance across one large set of CPUs than it does to bal‐
456 ance across two smaller sets of CPUs, each of half the size of the
457 larger set. (The precise relationship between the number of CPUs being
458 balanced and the cost of load balancing depends on implementation
459 details of the kernel process scheduler, which is subject to change
460 over time, as improved kernel scheduler algorithms are implemented.)
461
462 The per-cpuset flag sched_load_balance provides a mechanism to suppress
463 this automatic scheduler load balancing in cases where it is not needed
464 and suppressing it would have worthwhile performance benefits.
465
466 By default, load balancing is done across all CPUs, except those marked
467 isolated using the kernel boot time "isolcpus=" argument. (See Sched‐
468 uler Relax Domain Level, below, to change this default.)
469
470 This default load balancing across all CPUs is not well suited to the
471 following two situations:
472
473 * On large systems, load balancing across many CPUs is expensive. If
474 the system is managed using cpusets to place independent jobs on
475 separate sets of CPUs, full load balancing is unnecessary.
476
477 * Systems supporting real-time on some CPUs need to minimize system
478 overhead on those CPUs, including avoiding process load balancing if
479 that is not needed.
480
481 When the per-cpuset flag sched_load_balance is enabled (the default
482 setting), it requests load balancing across all the CPUs in that
483 cpuset's allowed CPUs, ensuring that load balancing can move a process
484 (not otherwise pinned, as by sched_setaffinity(2)) from any CPU in that
485 cpuset to any other.
486
487 When the per-cpuset flag sched_load_balance is disabled, then the
488 scheduler will avoid load balancing across the CPUs in that cpuset,
489 except in so far as is necessary because some overlapping cpuset has
490 sched_load_balance enabled.
491
492 So, for example, if the top cpuset has the flag sched_load_balance
493 enabled, then the scheduler will load balance across all CPUs, and the
494 setting of the sched_load_balance flag in other cpusets has no effect,
495 as we're already fully load balancing.
496
497 Therefore in the above two situations, the flag sched_load_balance
498 should be disabled in the top cpuset, and only some of the smaller,
499 child cpusets would have this flag enabled.
500
501 When doing this, you don't usually want to leave any unpinned processes
502 in the top cpuset that might use nontrivial amounts of CPU, as such
503 processes may be artificially constrained to some subset of CPUs,
504 depending on the particulars of this flag setting in descendant
505 cpusets. Even if such a process could use spare CPU cycles in some
506 other CPUs, the kernel scheduler might not consider the possibility of
507 load balancing that process to the underused CPU.
508
509 Of course, processes pinned to a particular CPU can be left in a cpuset
510 that disables sched_load_balance as those processes aren't going any‐
511 where else anyway.
512
513 Scheduler Relax Domain Level
514 The kernel scheduler performs immediate load balancing whenever a CPU
515 becomes free or another task becomes runnable. This load balancing
516 works to ensure that as many CPUs as possible are usefully employed
517 running tasks. The kernel also performs periodic load balancing off
518 the software clock described in time(7). The setting of
519 sched_relax_domain_level only applies to immediate load balancing.
520 Regardless of the sched_relax_domain_level setting, periodic load bal‐
521 ancing is attempted over all CPUs (unless disabled by turning off
522 sched_load_balance.) In any case, of course, tasks will only be sched‐
523 uled to run on CPUs allowed by their cpuset, as modified by
524 sched_setaffinity(2) system calls.
525
526 On small systems, such as those with just a few CPUs, immediate load
527 balancing is useful to improve system interactivity and to minimize
528 wasteful idle CPU cycles. But on large systems, attempting immediate
529 load balancing across a large number of CPUs can be more costly than it
530 is worth, depending on the particular performance characteristics of
531 the job mix and the hardware.
532
533 The exact meaning of the small integer values of
534 sched_relax_domain_level will depend on internal implementation details
535 of the kernel scheduler code and on the non-uniform architecture of the
536 hardware. Both of these will evolve over time and vary by system
537 architecture and kernel version.
538
539 As of this writing, when this capability was introduced in Linux
540 2.6.26, on certain popular architectures, the positive values of
541 sched_relax_domain_level have the following meanings.
542
543 [1m(1) Perform immediate load balancing across Hyper-Thread siblings on
544 the same core.
545 [1m(2) Perform immediate load balancing across other cores in the same
546 package.
547 [1m(3) Perform immediate load balancing across other CPUs on the same node
548 or blade.
549 [1m(4) Perform immediate load balancing across over several (implementa‐
550 tion detail) nodes [On NUMA systems].
551 [1m(5) Perform immediate load balancing across over all CPUs in system [On
552 NUMA systems].
553
554 The sched_relax_domain_level value of zero (0) always means don't per‐
555 form immediate load balancing, hence that load balancing is only done
556 periodically, not immediately when a CPU becomes available or another
557 task becomes runnable.
558
559 The sched_relax_domain_level value of minus one (-1) always means use
560 the system default value. The system default value can vary by archi‐
561 tecture and kernel version. This system default value can be changed
562 by kernel boot-time "relax_domain_level=" argument.
563
564 In the case of multiple overlapping cpusets which have conflicting
565 sched_relax_domain_level values, then the highest such value applies to
566 all CPUs in any of the overlapping cpusets. In such cases, the value
567 minus one (-1) is the lowest value, overridden by any other value, and
568 the value zero (0) is the next lowest value.
569
571 The following formats are used to represent sets of CPUs and memory
572 nodes.
573
574 Mask Format
575 The Mask Format is used to represent CPU and memory-node bitmasks in
576 the /proc/<pid>/status file.
577
578 This format displays each 32-bit word in hexadecimal (using ASCII char‐
579 acters "0" - "9" and "a" - "f"); words are filled with leading zeros,
580 if required. For masks longer than one word, a comma separator is used
581 between words. Words are displayed in big-endian order, which has the
582 most significant bit first. The hex digits within a word are also in
583 big-endian order.
584
585 The number of 32-bit words displayed is the minimum number needed to
586 display all bits of the bitmask, based on the size of the bitmask.
587
588 Examples of the Mask Format:
589
590 00000001 # just bit 0 set
591 40000000,00000000,00000000 # just bit 94 set
592 00000001,00000000,00000000 # just bit 64 set
593 000000ff,00000000 # bits 32-39 set
594 00000000,000E3862 # 1,5,6,11-13,17-19 set
595
596 A mask with bits 0, 1, 2, 4, 8, 16, 32, and 64 set displays as:
597
598 00000001,00000001,00010117
599
600 The first "1" is for bit 64, the second for bit 32, the third for bit
601 16, the fourth for bit 8, the fifth for bit 4, and the "7" is for bits
602 2, 1, and 0.
603
604 List Format
605 The List Format for cpus and mems is a comma-separated list of CPU or
606 memory-node numbers and ranges of numbers, in ASCII decimal.
607
608 Examples of the List Format:
609
610 0-4,9 # bits 0, 1, 2, 3, 4, and 9 set
611 0-2,7,12-14 # bits 0, 1, 2, 7, 12, 13, and 14 set
612
614 The following rules apply to each cpuset:
615
616 * Its CPUs and memory nodes must be a (possibly equal) subset of its
617 parent's.
618
619 * It can only be marked cpu_exclusive if its parent is.
620
621 * It can only be marked mem_exclusive if its parent is.
622
623 * If it is cpu_exclusive, its CPUs may not overlap any sibling.
624
625 * If it is memory_exclusive, its memory nodes may not overlap any sib‐
626 ling.
627
629 The permissions of a cpuset are determined by the permissions of the
630 directories and pseudo-files in the cpuset file system, normally
631 mounted at /dev/cpuset.
632
633 For instance, a process can put itself in some other cpuset (than its
634 current one) if it can write the tasks file for that cpuset. This
635 requires execute permission on the encompassing directories and write
636 permission on the tasks file.
637
638 An additional constraint is applied to requests to place some other
639 process in a cpuset. One process may not attach another to a cpuset
640 unless it would have permission to send that process a signal (see
641 kill(2)).
642
643 A process may create a child cpuset if it can access and write the par‐
644 ent cpuset directory. It can modify the CPUs or memory nodes in a
645 cpuset if it can access that cpuset's directory (execute permissions on
646 the each of the parent directories) and write the corresponding cpus or
647 mems file.
648
649 There is one minor difference between the manner in which these permis‐
650 sions are evaluated and the manner in which normal file-system opera‐
651 tion permissions are evaluated. The kernel interprets relative path‐
652 names starting at a process's current working directory. Even if one
653 is operating on a cpuset file, relative pathnames are interpreted rela‐
654 tive to the process's current working directory, not relative to the
655 process's current cpuset. The only ways that cpuset paths relative to
656 a process's current cpuset can be used are if either the process's cur‐
657 rent working directory is its cpuset (it first did a cd or chdir(2) to
658 its cpuset directory beneath /dev/cpuset, which is a bit unusual) or if
659 some user code converts the relative cpuset path to a full file-system
660 path.
661
662 In theory, this means that user code should specify cpusets using abso‐
663 lute pathnames, which requires knowing the mount point of the cpuset
664 file system (usually, but not necessarily, /dev/cpuset). In practice,
665 all user level code that this author is aware of simply assumes that if
666 the cpuset file system is mounted, then it is mounted at /dev/cpuset.
667 Furthermore, it is common practice for carefully written user code to
668 verify the presence of the pseudo-file /dev/cpuset/tasks in order to
669 verify that the cpuset pseudo-file system is currently mounted.
670
672 Enabling memory_pressure
673 By default, the per-cpuset file memory_pressure always contains zero
674 (0). Unless this feature is enabled by writing "1" to the pseudo-file
675 /dev/cpuset/memory_pressure_enabled, the kernel does not compute per-
676 cpuset memory_pressure.
677
678 Using the echo command
679 When using the echo command at the shell prompt to change the values of
680 cpuset files, beware that the built-in echo command in some shells does
681 not display an error message if the write(2) system call fails. For
682 example, if the command:
683
684 echo 19 > mems
685
686 failed because memory node 19 was not allowed (perhaps the current sys‐
687 tem does not have a memory node 19), then the echo command might not
688 display any error. It is better to use the /bin/echo external command
689 to change cpuset file settings, as this command will display write(2)
690 errors, as in the example:
691
692 /bin/echo 19 > mems
693 /bin/echo: write error: Invalid argument
694
696 Memory placement
697 Not all allocations of system memory are constrained by cpusets, for
698 the following reasons.
699
700 If hot-plug functionality is used to remove all the CPUs that are cur‐
701 rently assigned to a cpuset, then the kernel will automatically update
702 the cpus_allowed of all processes attached to CPUs in that cpuset to
703 allow all CPUs. When memory hot-plug functionality for removing memory
704 nodes is available, a similar exception is expected to apply there as
705 well. In general, the kernel prefers to violate cpuset placement,
706 rather than starving a process that has had all its allowed CPUs or
707 memory nodes taken offline. User code should reconfigure cpusets to
708 only refer to online CPUs and memory nodes when using hot-plug to add
709 or remove such resources.
710
711 A few kernel-critical, internal memory-allocation requests, marked
712 GFP_ATOMIC, must be satisfied immediately. The kernel may drop some
713 request or malfunction if one of these allocations fail. If such a
714 request cannot be satisfied within the current process's cpuset, then
715 we relax the cpuset, and look for memory anywhere we can find it. It's
716 better to violate the cpuset than stress the kernel.
717
718 Allocations of memory requested by kernel drivers while processing an
719 interrupt lack any relevant process context, and are not confined by
720 cpusets.
721
722 Renaming cpusets
723 You can use the rename(2) system call to rename cpusets. Only simple
724 renaming is supported; that is, changing the name of a cpuset directory
725 is permitted, but moving a directory into a different directory is not
726 permitted.
727
729 The Linux kernel implementation of cpusets sets errno to specify the
730 reason for a failed system call affecting cpusets.
731
732 The possible errno settings and their meaning when set on a failed
733 cpuset call are as listed below.
734
735 E2BIG Attempted a write(2) on a special cpuset file with a length
736 larger than some kernel-determined upper limit on the length of
737 such writes.
738
739 EACCES Attempted to write(2) the process ID (PID) of a process to a
740 cpuset tasks file when one lacks permission to move that
741 process.
742
743 EACCES Attempted to add, using write(2), a CPU or memory node to a
744 cpuset, when that CPU or memory node was not already in its par‐
745 ent.
746
747 EACCES Attempted to set, using write(2), cpu_exclusive or mem_exclusive
748 on a cpuset whose parent lacks the same setting.
749
750 EACCES Attempted to write(2) a memory_pressure file.
751
752 EACCES Attempted to create a file in a cpuset directory.
753
754 EBUSY Attempted to remove, using rmdir(2), a cpuset with attached pro‐
755 cesses.
756
757 EBUSY Attempted to remove, using rmdir(2), a cpuset with child
758 cpusets.
759
760 EBUSY Attempted to remove a CPU or memory node from a cpuset that is
761 also in a child of that cpuset.
762
763 EEXIST Attempted to create, using mkdir(2), a cpuset that already
764 exists.
765
766 EEXIST Attempted to rename(2) a cpuset to a name that already exists.
767
768 EFAULT Attempted to read(2) or write(2) a cpuset file using a buffer
769 that is outside the writing processes accessible address space.
770
771 EINVAL Attempted to change a cpuset, using write(2), in a way that
772 would violate a cpu_exclusive or mem_exclusive attribute of that
773 cpuset or any of its siblings.
774
775 EINVAL Attempted to write(2) an empty cpus or mems list to a cpuset
776 which has attached processes or child cpusets.
777
778 EINVAL Attempted to write(2) a cpus or mems list which included a range
779 with the second number smaller than the first number.
780
781 EINVAL Attempted to write(2) a cpus or mems list which included an
782 invalid character in the string.
783
784 EINVAL Attempted to write(2) a list to a cpus file that did not include
785 any online CPUs.
786
787 EINVAL Attempted to write(2) a list to a mems file that did not include
788 any online memory nodes.
789
790 EINVAL Attempted to write(2) a list to a mems file that included a node
791 that held no memory.
792
793 EIO Attempted to write(2) a string to a cpuset tasks file that does
794 not begin with an ASCII decimal integer.
795
796 EIO Attempted to rename(2) a cpuset into a different directory.
797
798 ENAMETOOLONG
799 Attempted to read(2) a /proc/<pid>/cpuset file for a cpuset path
800 that is longer than the kernel page size.
801
802 ENAMETOOLONG
803 Attempted to create, using mkdir(2), a cpuset whose base direc‐
804 tory name is longer than 255 characters.
805
806 ENAMETOOLONG
807 Attempted to create, using mkdir(2), a cpuset whose full path‐
808 name, including the mount point (typically "/dev/cpuset/") pre‐
809 fix, is longer than 4095 characters.
810
811 ENODEV The cpuset was removed by another process at the same time as a
812 write(2) was attempted on one of the pseudo-files in the cpuset
813 directory.
814
815 ENOENT Attempted to create, using mkdir(2), a cpuset in a parent cpuset
816 that doesn't exist.
817
818 ENOENT Attempted to access(2) or open(2) a nonexistent file in a cpuset
819 directory.
820
821 ENOMEM Insufficient memory is available within the kernel; can occur on
822 a variety of system calls affecting cpusets, but only if the
823 system is extremely short of memory.
824
825 ENOSPC Attempted to write(2) the process ID (PID) of a process to a
826 cpuset tasks file when the cpuset had an empty cpus or empty
827 mems setting.
828
829 ENOSPC Attempted to write(2) an empty cpus or mems setting to a cpuset
830 that has tasks attached.
831
832 ENOTDIR
833 Attempted to rename(2) a nonexistent cpuset.
834
835 EPERM Attempted to remove a file from a cpuset directory.
836
837 ERANGE Specified a cpus or mems list to the kernel which included a
838 number too large for the kernel to set in its bitmasks.
839
840 ESRCH Attempted to write(2) the process ID (PID) of a nonexistent
841 process to a cpuset tasks file.
842
844 Cpusets appeared in version 2.6.12 of the Linux kernel.
845
847 Despite its name, the pid parameter is actually a thread ID, and each
848 thread in a threaded group can be attached to a different cpuset. The
849 value returned from a call to gettid(2) can be passed in the argument
850 pid.
851
853 memory_pressure cpuset files can be opened for writing, creation, or
854 truncation, but then the write(2) fails with errno set to EACCES, and
855 the creation and truncation options on open(2) have no effect.
856
858 The following examples demonstrate querying and setting cpuset options
859 using shell commands.
860
861 Creating and attaching to a cpuset.
862 To create a new cpuset and attach the current command shell to it, the
863 steps are:
864
865 1) mkdir /dev/cpuset (if not already done)
866 2) mount -t cpuset none /dev/cpuset (if not already done)
867 3) Create the new cpuset using mkdir(1).
868 4) Assign CPUs and memory nodes to the new cpuset.
869 5) Attach the shell to the new cpuset.
870
871 For example, the following sequence of commands will set up a cpuset
872 named "Charlie", containing just CPUs 2 and 3, and memory node 1, and
873 then attach the current shell to that cpuset.
874
875 $ mkdir /dev/cpuset
876 $ mount -t cpuset cpuset /dev/cpuset
877 $ cd /dev/cpuset
878 $ mkdir Charlie
879 $ cd Charlie
880 $ /bin/echo 2-3 > cpus
881 $ /bin/echo 1 > mems
882 $ /bin/echo $$ > tasks
883 # The current shell is now running in cpuset Charlie
884 # The next line should display '/Charlie'
885 $ cat /proc/self/cpuset
886
887 Migrating a job to different memory nodes.
888 To migrate a job (the set of processes attached to a cpuset) to differ‐
889 ent CPUs and memory nodes in the system, including moving the memory
890 pages currently allocated to that job, perform the following steps.
891
892 1) Let's say we want to move the job in cpuset alpha (CPUs 4-7 and
893 memory nodes 2-3) to a new cpuset beta (CPUs 16-19 and memory nodes
894 8-9).
895 2) First create the new cpuset beta.
896 3) Then allow CPUs 16-19 and memory nodes 8-9 in beta.
897 4) Then enable memory_migration in beta.
898 5) Then move each process from alpha to beta.
899
900 The following sequence of commands accomplishes this.
901
902 $ cd /dev/cpuset
903 $ mkdir beta
904 $ cd beta
905 $ /bin/echo 16-19 > cpus
906 $ /bin/echo 8-9 > mems
907 $ /bin/echo 1 > memory_migrate
908 $ while read i; do /bin/echo $i; done < ../alpha/tasks > tasks
909
910 The above should move any processes in alpha to beta, and any memory
911 held by these processes on memory nodes 2-3 to memory nodes 8-9,
912 respectively.
913
914 Notice that the last step of the above sequence did not do:
915
916 $ cp ../alpha/tasks tasks
917
918 The while loop, rather than the seemingly easier use of the cp(1) com‐
919 mand, was necessary because only one process PID at a time may be writ‐
920 ten to the tasks file.
921
922 The same effect (writing one PID at a time) as the while loop can be
923 accomplished more efficiently, in fewer keystrokes and in syntax that
924 works on any shell, but alas more obscurely, by using the -u
925 (unbuffered) option of sed(1):
926
927 $ sed -un p < ../alpha/tasks > tasks
928
930 taskset(1), get_mempolicy(2), getcpu(2), mbind(2), sched_getaffin‐
931 ity(2), sched_setaffinity(2), sched_setscheduler(2), set_mempolicy(2),
932 CPU_SET(3), proc(5), numa(7), migratepages(8), numactl(8)
933
934 The kernel source file Documentation/cpusets.txt.
935
937 This page is part of release 3.25 of the Linux man-pages project. A
938 description of the project, and information about reporting bugs, can
939 be found at http://www.kernel.org/doc/man-pages/.
940
941
942
943Linux 2008-11-12 CPUSET(7)