1CPUSET(7) Linux Programmer's Manual CPUSET(7)
2
3
4
6 cpuset - confine processes to processor and memory node subsets
7
9 The cpuset filesystem is a pseudo-filesystem interface to the kernel
10 cpuset mechanism, which is used to control the processor placement and
11 memory placement of processes. It is commonly mounted at /dev/cpuset.
12
13 On systems with kernels compiled with built in support for cpusets, all
14 processes are attached to a cpuset, and cpusets are always present. If
15 a system supports cpusets, then it will have the entry nodev cpuset in
16 the file /proc/filesystems. By mounting the cpuset filesystem (see the
17 EXAMPLES section below), the administrator can configure the cpusets on
18 a system to control the processor and memory placement of processes on
19 that system. By default, if the cpuset configuration on a system is
20 not modified or if the cpuset filesystem is not even mounted, then the
21 cpuset mechanism, though present, has no effect on the system's behav‐
22 ior.
23
24 A cpuset defines a list of CPUs and memory nodes.
25
26 The CPUs of a system include all the logical processing units on which
27 a process can execute, including, if present, multiple processor cores
28 within a package and Hyper-Threads within a processor core. Memory
29 nodes include all distinct banks of main memory; small and SMP systems
30 typically have just one memory node that contains all the system's main
31 memory, while NUMA (non-uniform memory access) systems have multiple
32 memory nodes.
33
34 Cpusets are represented as directories in a hierarchical pseudo-
35 filesystem, where the top directory in the hierarchy (/dev/cpuset) rep‐
36 resents the entire system (all online CPUs and memory nodes) and any
37 cpuset that is the child (descendant) of another parent cpuset contains
38 a subset of that parent's CPUs and memory nodes. The directories and
39 files representing cpusets have normal filesystem permissions.
40
41 Every process in the system belongs to exactly one cpuset. A process
42 is confined to run only on the CPUs in the cpuset it belongs to, and to
43 allocate memory only on the memory nodes in that cpuset. When a
44 process fork(2)s, the child process is placed in the same cpuset as its
45 parent. With sufficient privilege, a process may be moved from one
46 cpuset to another and the allowed CPUs and memory nodes of an existing
47 cpuset may be changed.
48
49 When the system begins booting, a single cpuset is defined that in‐
50 cludes all CPUs and memory nodes on the system, and all processes are
51 in that cpuset. During the boot process, or later during normal system
52 operation, other cpusets may be created, as subdirectories of this top
53 cpuset, under the control of the system administrator, and processes
54 may be placed in these other cpusets.
55
56 Cpusets are integrated with the sched_setaffinity(2) scheduling affin‐
57 ity mechanism and the mbind(2) and set_mempolicy(2) memory-placement
58 mechanisms in the kernel. Neither of these mechanisms let a process
59 make use of a CPU or memory node that is not allowed by that process's
60 cpuset. If changes to a process's cpuset placement conflict with these
61 other mechanisms, then cpuset placement is enforced even if it means
62 overriding these other mechanisms. The kernel accomplishes this over‐
63 riding by silently restricting the CPUs and memory nodes requested by
64 these other mechanisms to those allowed by the invoking process's
65 cpuset. This can result in these other calls returning an error, if
66 for example, such a call ends up requesting an empty set of CPUs or
67 memory nodes, after that request is restricted to the invoking
68 process's cpuset.
69
70 Typically, a cpuset is used to manage the CPU and memory-node confine‐
71 ment for a set of cooperating processes such as a batch scheduler job,
72 and these other mechanisms are used to manage the placement of individ‐
73 ual processes or memory regions within that set or job.
74
76 Each directory below /dev/cpuset represents a cpuset and contains a
77 fixed set of pseudo-files describing the state of that cpuset.
78
79 New cpusets are created using the mkdir(2) system call or the mkdir(1)
80 command. The properties of a cpuset, such as its flags, allowed CPUs
81 and memory nodes, and attached processes, are queried and modified by
82 reading or writing to the appropriate file in that cpuset's directory,
83 as listed below.
84
85 The pseudo-files in each cpuset directory are automatically created
86 when the cpuset is created, as a result of the mkdir(2) invocation. It
87 is not possible to directly add or remove these pseudo-files.
88
89 A cpuset directory that contains no child cpuset directories, and has
90 no attached processes, can be removed using rmdir(2) or rmdir(1). It
91 is not necessary, or possible, to remove the pseudo-files inside the
92 directory before removing it.
93
94 The pseudo-files in each cpuset directory are small text files that may
95 be read and written using traditional shell utilities such as cat(1),
96 and echo(1), or from a program by using file I/O library functions or
97 system calls, such as open(2), read(2), write(2), and close(2).
98
99 The pseudo-files in a cpuset directory represent internal kernel state
100 and do not have any persistent image on disk. Each of these per-cpuset
101 files is listed and described below.
102
103 tasks List of the process IDs (PIDs) of the processes in that cpuset.
104 The list is formatted as a series of ASCII decimal numbers, each
105 followed by a newline. A process may be added to a cpuset (au‐
106 tomatically removing it from the cpuset that previously con‐
107 tained it) by writing its PID to that cpuset's tasks file (with
108 or without a trailing newline).
109
110 Warning: only one PID may be written to the tasks file at a
111 time. If a string is written that contains more than one PID,
112 only the first one will be used.
113
114 notify_on_release
115 Flag (0 or 1). If set (1), that cpuset will receive special
116 handling after it is released, that is, after all processes
117 cease using it (i.e., terminate or are moved to a different
118 cpuset) and all child cpuset directories have been removed. See
119 the Notify On Release section, below.
120
121 cpuset.cpus
122 List of the physical numbers of the CPUs on which processes in
123 that cpuset are allowed to execute. See List Format below for a
124 description of the format of cpus.
125
126 The CPUs allowed to a cpuset may be changed by writing a new
127 list to its cpus file.
128
129 cpuset.cpu_exclusive
130 Flag (0 or 1). If set (1), the cpuset has exclusive use of its
131 CPUs (no sibling or cousin cpuset may overlap CPUs). By de‐
132 fault, this is off (0). Newly created cpusets also initially
133 default this to off (0).
134
135 Two cpusets are sibling cpusets if they share the same parent
136 cpuset in the /dev/cpuset hierarchy. Two cpusets are cousin
137 cpusets if neither is the ancestor of the other. Regardless of
138 the cpu_exclusive setting, if one cpuset is the ancestor of an‐
139 other, and if both of these cpusets have nonempty cpus, then
140 their cpus must overlap, because the cpus of any cpuset are al‐
141 ways a subset of the cpus of its parent cpuset.
142
143 cpuset.mems
144 List of memory nodes on which processes in this cpuset are al‐
145 lowed to allocate memory. See List Format below for a descrip‐
146 tion of the format of mems.
147
148 cpuset.mem_exclusive
149 Flag (0 or 1). If set (1), the cpuset has exclusive use of its
150 memory nodes (no sibling or cousin may overlap). Also if set
151 (1), the cpuset is a Hardwall cpuset (see below). By default,
152 this is off (0). Newly created cpusets also initially default
153 this to off (0).
154
155 Regardless of the mem_exclusive setting, if one cpuset is the
156 ancestor of another, then their memory nodes must overlap, be‐
157 cause the memory nodes of any cpuset are always a subset of the
158 memory nodes of that cpuset's parent cpuset.
159
160 cpuset.mem_hardwall (since Linux 2.6.26)
161 Flag (0 or 1). If set (1), the cpuset is a Hardwall cpuset (see
162 below). Unlike mem_exclusive, there is no constraint on whether
163 cpusets marked mem_hardwall may have overlapping memory nodes
164 with sibling or cousin cpusets. By default, this is off (0).
165 Newly created cpusets also initially default this to off (0).
166
167 cpuset.memory_migrate (since Linux 2.6.16)
168 Flag (0 or 1). If set (1), then memory migration is enabled.
169 By default, this is off (0). See the Memory Migration section,
170 below.
171
172 cpuset.memory_pressure (since Linux 2.6.16)
173 A measure of how much memory pressure the processes in this
174 cpuset are causing. See the Memory Pressure section, below.
175 Unless memory_pressure_enabled is enabled, always has value zero
176 (0). This file is read-only. See the WARNINGS section, below.
177
178 cpuset.memory_pressure_enabled (since Linux 2.6.16)
179 Flag (0 or 1). This file is present only in the root cpuset,
180 normally /dev/cpuset. If set (1), the memory_pressure calcula‐
181 tions are enabled for all cpusets in the system. By default,
182 this is off (0). See the Memory Pressure section, below.
183
184 cpuset.memory_spread_page (since Linux 2.6.17)
185 Flag (0 or 1). If set (1), pages in the kernel page cache
186 (filesystem buffers) are uniformly spread across the cpuset. By
187 default, this is off (0) in the top cpuset, and inherited from
188 the parent cpuset in newly created cpusets. See the Memory
189 Spread section, below.
190
191 cpuset.memory_spread_slab (since Linux 2.6.17)
192 Flag (0 or 1). If set (1), the kernel slab caches for file I/O
193 (directory and inode structures) are uniformly spread across the
194 cpuset. By default, is off (0) in the top cpuset, and inherited
195 from the parent cpuset in newly created cpusets. See the Memory
196 Spread section, below.
197
198 cpuset.sched_load_balance (since Linux 2.6.24)
199 Flag (0 or 1). If set (1, the default) the kernel will automat‐
200 ically load balance processes in that cpuset over the allowed
201 CPUs in that cpuset. If cleared (0) the kernel will avoid load
202 balancing processes in this cpuset, unless some other cpuset
203 with overlapping CPUs has its sched_load_balance flag set. See
204 Scheduler Load Balancing, below, for further details.
205
206 cpuset.sched_relax_domain_level (since Linux 2.6.26)
207 Integer, between -1 and a small positive value. The sched_re‐
208 lax_domain_level controls the width of the range of CPUs over
209 which the kernel scheduler performs immediate rebalancing of
210 runnable tasks across CPUs. If sched_load_balance is disabled,
211 then the setting of sched_relax_domain_level does not matter, as
212 no such load balancing is done. If sched_load_balance is en‐
213 abled, then the higher the value of the sched_relax_do‐
214 main_level, the wider the range of CPUs over which immediate
215 load balancing is attempted. See Scheduler Relax Domain Level,
216 below, for further details.
217
218 In addition to the above pseudo-files in each directory below
219 /dev/cpuset, each process has a pseudo-file, /proc/<pid>/cpuset, that
220 displays the path of the process's cpuset directory relative to the
221 root of the cpuset filesystem.
222
223 Also the /proc/<pid>/status file for each process has four added lines,
224 displaying the process's Cpus_allowed (on which CPUs it may be sched‐
225 uled) and Mems_allowed (on which memory nodes it may obtain memory), in
226 the two formats Mask Format and List Format (see below) as shown in the
227 following example:
228
229 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
230 Cpus_allowed_list: 0-127
231 Mems_allowed: ffffffff,ffffffff
232 Mems_allowed_list: 0-63
233
234 The "allowed" fields were added in Linux 2.6.24; the "allowed_list"
235 fields were added in Linux 2.6.26.
236
238 In addition to controlling which cpus and mems a process is allowed to
239 use, cpusets provide the following extended capabilities.
240
241 Exclusive cpusets
242 If a cpuset is marked cpu_exclusive or mem_exclusive, no other cpuset,
243 other than a direct ancestor or descendant, may share any of the same
244 CPUs or memory nodes.
245
246 A cpuset that is mem_exclusive restricts kernel allocations for buffer
247 cache pages and other internal kernel data pages commonly shared by the
248 kernel across multiple users. All cpusets, whether mem_exclusive or
249 not, restrict allocations of memory for user space. This enables con‐
250 figuring a system so that several independent jobs can share common
251 kernel data, while isolating each job's user allocation in its own
252 cpuset. To do this, construct a large mem_exclusive cpuset to hold all
253 the jobs, and construct child, non-mem_exclusive cpusets for each indi‐
254 vidual job. Only a small amount of kernel memory, such as requests
255 from interrupt handlers, is allowed to be placed on memory nodes out‐
256 side even a mem_exclusive cpuset.
257
258 Hardwall
259 A cpuset that has mem_exclusive or mem_hardwall set is a hardwall
260 cpuset. A hardwall cpuset restricts kernel allocations for page, buf‐
261 fer, and other data commonly shared by the kernel across multiple
262 users. All cpusets, whether hardwall or not, restrict allocations of
263 memory for user space.
264
265 This enables configuring a system so that several independent jobs can
266 share common kernel data, such as filesystem pages, while isolating
267 each job's user allocation in its own cpuset. To do this, construct a
268 large hardwall cpuset to hold all the jobs, and construct child cpusets
269 for each individual job which are not hardwall cpusets.
270
271 Only a small amount of kernel memory, such as requests from interrupt
272 handlers, is allowed to be taken outside even a hardwall cpuset.
273
274 Notify on release
275 If the notify_on_release flag is enabled (1) in a cpuset, then whenever
276 the last process in the cpuset leaves (exits or attaches to some other
277 cpuset) and the last child cpuset of that cpuset is removed, the kernel
278 will run the command /sbin/cpuset_release_agent, supplying the pathname
279 (relative to the mount point of the cpuset filesystem) of the abandoned
280 cpuset. This enables automatic removal of abandoned cpusets.
281
282 The default value of notify_on_release in the root cpuset at system
283 boot is disabled (0). The default value of other cpusets at creation
284 is the current value of their parent's notify_on_release setting.
285
286 The command /sbin/cpuset_release_agent is invoked, with the name
287 (/dev/cpuset relative path) of the to-be-released cpuset in argv[1].
288
289 The usual contents of the command /sbin/cpuset_release_agent is simply
290 the shell script:
291
292 #!/bin/sh
293 rmdir /dev/cpuset/$1
294
295 As with other flag values below, this flag can be changed by writing an
296 ASCII number 0 or 1 (with optional trailing newline) into the file, to
297 clear or set the flag, respectively.
298
299 Memory pressure
300 The memory_pressure of a cpuset provides a simple per-cpuset running
301 average of the rate that the processes in a cpuset are attempting to
302 free up in-use memory on the nodes of the cpuset to satisfy additional
303 memory requests.
304
305 This enables batch managers that are monitoring jobs running in dedi‐
306 cated cpusets to efficiently detect what level of memory pressure that
307 job is causing.
308
309 This is useful both on tightly managed systems running a wide mix of
310 submitted jobs, which may choose to terminate or reprioritize jobs that
311 are trying to use more memory than allowed on the nodes assigned them,
312 and with tightly coupled, long-running, massively parallel scientific
313 computing jobs that will dramatically fail to meet required performance
314 goals if they start to use more memory than allowed to them.
315
316 This mechanism provides a very economical way for the batch manager to
317 monitor a cpuset for signs of memory pressure. It's up to the batch
318 manager or other user code to decide what action to take if it detects
319 signs of memory pressure.
320
321 Unless memory pressure calculation is enabled by setting the pseudo-
322 file /dev/cpuset/cpuset.memory_pressure_enabled, it is not computed for
323 any cpuset, and reads from any memory_pressure always return zero, as
324 represented by the ASCII string "0\n". See the WARNINGS section, be‐
325 low.
326
327 A per-cpuset, running average is employed for the following reasons:
328
329 * Because this meter is per-cpuset rather than per-process or per vir‐
330 tual memory region, the system load imposed by a batch scheduler
331 monitoring this metric is sharply reduced on large systems, because
332 a scan of the tasklist can be avoided on each set of queries.
333
334 * Because this meter is a running average rather than an accumulating
335 counter, a batch scheduler can detect memory pressure with a single
336 read, instead of having to read and accumulate results for a period
337 of time.
338
339 * Because this meter is per-cpuset rather than per-process, the batch
340 scheduler can obtain the key information—memory pressure in a
341 cpuset—with a single read, rather than having to query and accumu‐
342 late results over all the (dynamically changing) set of processes in
343 the cpuset.
344
345 The memory_pressure of a cpuset is calculated using a per-cpuset simple
346 digital filter that is kept within the kernel. For each cpuset, this
347 filter tracks the recent rate at which processes attached to that
348 cpuset enter the kernel direct reclaim code.
349
350 The kernel direct reclaim code is entered whenever a process has to
351 satisfy a memory page request by first finding some other page to re‐
352 purpose, due to lack of any readily available already free pages.
353 Dirty filesystem pages are repurposed by first writing them to disk.
354 Unmodified filesystem buffer pages are repurposed by simply dropping
355 them, though if that page is needed again, it will have to be reread
356 from disk.
357
358 The cpuset.memory_pressure file provides an integer number representing
359 the recent (half-life of 10 seconds) rate of entries to the direct re‐
360 claim code caused by any process in the cpuset, in units of reclaims
361 attempted per second, times 1000.
362
363 Memory spread
364 There are two Boolean flag files per cpuset that control where the ker‐
365 nel allocates pages for the filesystem buffers and related in-kernel
366 data structures. They are called cpuset.memory_spread_page and
367 cpuset.memory_spread_slab.
368
369 If the per-cpuset Boolean flag file cpuset.memory_spread_page is set,
370 then the kernel will spread the filesystem buffers (page cache) evenly
371 over all the nodes that the faulting process is allowed to use, instead
372 of preferring to put those pages on the node where the process is run‐
373 ning.
374
375 If the per-cpuset Boolean flag file cpuset.memory_spread_slab is set,
376 then the kernel will spread some filesystem-related slab caches, such
377 as those for inodes and directory entries, evenly over all the nodes
378 that the faulting process is allowed to use, instead of preferring to
379 put those pages on the node where the process is running.
380
381 The setting of these flags does not affect the data segment (see
382 brk(2)) or stack segment pages of a process.
383
384 By default, both kinds of memory spreading are off and the kernel pre‐
385 fers to allocate memory pages on the node local to where the requesting
386 process is running. If that node is not allowed by the process's NUMA
387 memory policy or cpuset configuration or if there are insufficient free
388 memory pages on that node, then the kernel looks for the nearest node
389 that is allowed and has sufficient free memory.
390
391 When new cpusets are created, they inherit the memory spread settings
392 of their parent.
393
394 Setting memory spreading causes allocations for the affected page or
395 slab caches to ignore the process's NUMA memory policy and be spread
396 instead. However, the effect of these changes in memory placement
397 caused by cpuset-specified memory spreading is hidden from the mbind(2)
398 or set_mempolicy(2) calls. These two NUMA memory policy calls always
399 appear to behave as if no cpuset-specified memory spreading is in ef‐
400 fect, even if it is. If cpuset memory spreading is subsequently turned
401 off, the NUMA memory policy most recently specified by these calls is
402 automatically reapplied.
403
404 Both cpuset.memory_spread_page and cpuset.memory_spread_slab are Bool‐
405 ean flag files. By default, they contain "0", meaning that the feature
406 is off for that cpuset. If a "1" is written to that file, that turns
407 the named feature on.
408
409 Cpuset-specified memory spreading behaves similarly to what is known
410 (in other contexts) as round-robin or interleave memory placement.
411
412 Cpuset-specified memory spreading can provide substantial performance
413 improvements for jobs that:
414
415 a) need to place thread-local data on memory nodes close to the CPUs
416 which are running the threads that most frequently access that data;
417 but also
418
419 b) need to access large filesystem data sets that must to be spread
420 across the several nodes in the job's cpuset in order to fit.
421
422 Without this policy, the memory allocation across the nodes in the
423 job's cpuset can become very uneven, especially for jobs that might
424 have just a single thread initializing or reading in the data set.
425
426 Memory migration
427 Normally, under the default setting (disabled) of cpuset.memory_mi‐
428 grate, once a page is allocated (given a physical page of main memory),
429 then that page stays on whatever node it was allocated, so long as it
430 remains allocated, even if the cpuset's memory-placement policy mems
431 subsequently changes.
432
433 When memory migration is enabled in a cpuset, if the mems setting of
434 the cpuset is changed, then any memory page in use by any process in
435 the cpuset that is on a memory node that is no longer allowed will be
436 migrated to a memory node that is allowed.
437
438 Furthermore, if a process is moved into a cpuset with memory_migrate
439 enabled, any memory pages it uses that were on memory nodes allowed in
440 its previous cpuset, but which are not allowed in its new cpuset, will
441 be migrated to a memory node allowed in the new cpuset.
442
443 The relative placement of a migrated page within the cpuset is pre‐
444 served during these migration operations if possible. For example, if
445 the page was on the second valid node of the prior cpuset, then the
446 page will be placed on the second valid node of the new cpuset, if pos‐
447 sible.
448
449 Scheduler load balancing
450 The kernel scheduler automatically load balances processes. If one CPU
451 is underutilized, the kernel will look for processes on other more
452 overloaded CPUs and move those processes to the underutilized CPU,
453 within the constraints of such placement mechanisms as cpusets and
454 sched_setaffinity(2).
455
456 The algorithmic cost of load balancing and its impact on key shared
457 kernel data structures such as the process list increases more than
458 linearly with the number of CPUs being balanced. For example, it costs
459 more to load balance across one large set of CPUs than it does to bal‐
460 ance across two smaller sets of CPUs, each of half the size of the
461 larger set. (The precise relationship between the number of CPUs being
462 balanced and the cost of load balancing depends on implementation de‐
463 tails of the kernel process scheduler, which is subject to change over
464 time, as improved kernel scheduler algorithms are implemented.)
465
466 The per-cpuset flag sched_load_balance provides a mechanism to suppress
467 this automatic scheduler load balancing in cases where it is not needed
468 and suppressing it would have worthwhile performance benefits.
469
470 By default, load balancing is done across all CPUs, except those marked
471 isolated using the kernel boot time "isolcpus=" argument. (See Sched‐
472 uler Relax Domain Level, below, to change this default.)
473
474 This default load balancing across all CPUs is not well suited to the
475 following two situations:
476
477 * On large systems, load balancing across many CPUs is expensive. If
478 the system is managed using cpusets to place independent jobs on
479 separate sets of CPUs, full load balancing is unnecessary.
480
481 * Systems supporting real-time on some CPUs need to minimize system
482 overhead on those CPUs, including avoiding process load balancing if
483 that is not needed.
484
485 When the per-cpuset flag sched_load_balance is enabled (the default
486 setting), it requests load balancing across all the CPUs in that
487 cpuset's allowed CPUs, ensuring that load balancing can move a process
488 (not otherwise pinned, as by sched_setaffinity(2)) from any CPU in that
489 cpuset to any other.
490
491 When the per-cpuset flag sched_load_balance is disabled, then the
492 scheduler will avoid load balancing across the CPUs in that cpuset, ex‐
493 cept in so far as is necessary because some overlapping cpuset has
494 sched_load_balance enabled.
495
496 So, for example, if the top cpuset has the flag sched_load_balance en‐
497 abled, then the scheduler will load balance across all CPUs, and the
498 setting of the sched_load_balance flag in other cpusets has no effect,
499 as we're already fully load balancing.
500
501 Therefore in the above two situations, the flag sched_load_balance
502 should be disabled in the top cpuset, and only some of the smaller,
503 child cpusets would have this flag enabled.
504
505 When doing this, you don't usually want to leave any unpinned processes
506 in the top cpuset that might use nontrivial amounts of CPU, as such
507 processes may be artificially constrained to some subset of CPUs, de‐
508 pending on the particulars of this flag setting in descendant cpusets.
509 Even if such a process could use spare CPU cycles in some other CPUs,
510 the kernel scheduler might not consider the possibility of load balanc‐
511 ing that process to the underused CPU.
512
513 Of course, processes pinned to a particular CPU can be left in a cpuset
514 that disables sched_load_balance as those processes aren't going any‐
515 where else anyway.
516
517 Scheduler relax domain level
518 The kernel scheduler performs immediate load balancing whenever a CPU
519 becomes free or another task becomes runnable. This load balancing
520 works to ensure that as many CPUs as possible are usefully employed
521 running tasks. The kernel also performs periodic load balancing off
522 the software clock described in time(7). The setting of sched_re‐
523 lax_domain_level applies only to immediate load balancing. Regardless
524 of the sched_relax_domain_level setting, periodic load balancing is at‐
525 tempted over all CPUs (unless disabled by turning off sched_load_bal‐
526 ance.) In any case, of course, tasks will be scheduled to run only on
527 CPUs allowed by their cpuset, as modified by sched_setaffinity(2) sys‐
528 tem calls.
529
530 On small systems, such as those with just a few CPUs, immediate load
531 balancing is useful to improve system interactivity and to minimize
532 wasteful idle CPU cycles. But on large systems, attempting immediate
533 load balancing across a large number of CPUs can be more costly than it
534 is worth, depending on the particular performance characteristics of
535 the job mix and the hardware.
536
537 The exact meaning of the small integer values of sched_relax_do‐
538 main_level will depend on internal implementation details of the kernel
539 scheduler code and on the non-uniform architecture of the hardware.
540 Both of these will evolve over time and vary by system architecture and
541 kernel version.
542
543 As of this writing, when this capability was introduced in Linux
544 2.6.26, on certain popular architectures, the positive values of
545 sched_relax_domain_level have the following meanings.
546
547 [1m(1) Perform immediate load balancing across Hyper-Thread siblings on
548 the same core.
549 [1m(2) Perform immediate load balancing across other cores in the same
550 package.
551 [1m(3) Perform immediate load balancing across other CPUs on the same node
552 or blade.
553 [1m(4) Perform immediate load balancing across over several (implementa‐
554 tion detail) nodes [On NUMA systems].
555 [1m(5) Perform immediate load balancing across over all CPUs in system [On
556 NUMA systems].
557
558 The sched_relax_domain_level value of zero (0) always means don't per‐
559 form immediate load balancing, hence that load balancing is done only
560 periodically, not immediately when a CPU becomes available or another
561 task becomes runnable.
562
563 The sched_relax_domain_level value of minus one (-1) always means use
564 the system default value. The system default value can vary by archi‐
565 tecture and kernel version. This system default value can be changed
566 by kernel boot-time "relax_domain_level=" argument.
567
568 In the case of multiple overlapping cpusets which have conflicting
569 sched_relax_domain_level values, then the highest such value applies to
570 all CPUs in any of the overlapping cpusets. In such cases, the value
571 minus one (-1) is the lowest value, overridden by any other value, and
572 the value zero (0) is the next lowest value.
573
575 The following formats are used to represent sets of CPUs and memory
576 nodes.
577
578 Mask format
579 The Mask Format is used to represent CPU and memory-node bit masks in
580 the /proc/<pid>/status file.
581
582 This format displays each 32-bit word in hexadecimal (using ASCII char‐
583 acters "0" - "9" and "a" - "f"); words are filled with leading zeros,
584 if required. For masks longer than one word, a comma separator is used
585 between words. Words are displayed in big-endian order, which has the
586 most significant bit first. The hex digits within a word are also in
587 big-endian order.
588
589 The number of 32-bit words displayed is the minimum number needed to
590 display all bits of the bit mask, based on the size of the bit mask.
591
592 Examples of the Mask Format:
593
594 00000001 # just bit 0 set
595 40000000,00000000,00000000 # just bit 94 set
596 00000001,00000000,00000000 # just bit 64 set
597 000000ff,00000000 # bits 32-39 set
598 00000000,000e3862 # 1,5,6,11-13,17-19 set
599
600 A mask with bits 0, 1, 2, 4, 8, 16, 32, and 64 set displays as:
601
602 00000001,00000001,00010117
603
604 The first "1" is for bit 64, the second for bit 32, the third for bit
605 16, the fourth for bit 8, the fifth for bit 4, and the "7" is for bits
606 2, 1, and 0.
607
608 List format
609 The List Format for cpus and mems is a comma-separated list of CPU or
610 memory-node numbers and ranges of numbers, in ASCII decimal.
611
612 Examples of the List Format:
613
614 0-4,9 # bits 0, 1, 2, 3, 4, and 9 set
615 0-2,7,12-14 # bits 0, 1, 2, 7, 12, 13, and 14 set
616
618 The following rules apply to each cpuset:
619
620 * Its CPUs and memory nodes must be a (possibly equal) subset of its
621 parent's.
622
623 * It can be marked cpu_exclusive only if its parent is.
624
625 * It can be marked mem_exclusive only if its parent is.
626
627 * If it is cpu_exclusive, its CPUs may not overlap any sibling.
628
629 * If it is memory_exclusive, its memory nodes may not overlap any sib‐
630 ling.
631
633 The permissions of a cpuset are determined by the permissions of the
634 directories and pseudo-files in the cpuset filesystem, normally mounted
635 at /dev/cpuset.
636
637 For instance, a process can put itself in some other cpuset (than its
638 current one) if it can write the tasks file for that cpuset. This re‐
639 quires execute permission on the encompassing directories and write
640 permission on the tasks file.
641
642 An additional constraint is applied to requests to place some other
643 process in a cpuset. One process may not attach another to a cpuset
644 unless it would have permission to send that process a signal (see
645 kill(2)).
646
647 A process may create a child cpuset if it can access and write the par‐
648 ent cpuset directory. It can modify the CPUs or memory nodes in a
649 cpuset if it can access that cpuset's directory (execute permissions on
650 the each of the parent directories) and write the corresponding cpus or
651 mems file.
652
653 There is one minor difference between the manner in which these permis‐
654 sions are evaluated and the manner in which normal filesystem operation
655 permissions are evaluated. The kernel interprets relative pathnames
656 starting at a process's current working directory. Even if one is op‐
657 erating on a cpuset file, relative pathnames are interpreted relative
658 to the process's current working directory, not relative to the
659 process's current cpuset. The only ways that cpuset paths relative to
660 a process's current cpuset can be used are if either the process's cur‐
661 rent working directory is its cpuset (it first did a cd or chdir(2) to
662 its cpuset directory beneath /dev/cpuset, which is a bit unusual) or if
663 some user code converts the relative cpuset path to a full filesystem
664 path.
665
666 In theory, this means that user code should specify cpusets using abso‐
667 lute pathnames, which requires knowing the mount point of the cpuset
668 filesystem (usually, but not necessarily, /dev/cpuset). In practice,
669 all user level code that this author is aware of simply assumes that if
670 the cpuset filesystem is mounted, then it is mounted at /dev/cpuset.
671 Furthermore, it is common practice for carefully written user code to
672 verify the presence of the pseudo-file /dev/cpuset/tasks in order to
673 verify that the cpuset pseudo-filesystem is currently mounted.
674
676 Enabling memory_pressure
677 By default, the per-cpuset file cpuset.memory_pressure always contains
678 zero (0). Unless this feature is enabled by writing "1" to the pseudo-
679 file /dev/cpuset/cpuset.memory_pressure_enabled, the kernel does not
680 compute per-cpuset memory_pressure.
681
682 Using the echo command
683 When using the echo command at the shell prompt to change the values of
684 cpuset files, beware that the built-in echo command in some shells does
685 not display an error message if the write(2) system call fails. For
686 example, if the command:
687
688 echo 19 > cpuset.mems
689
690 failed because memory node 19 was not allowed (perhaps the current sys‐
691 tem does not have a memory node 19), then the echo command might not
692 display any error. It is better to use the /bin/echo external command
693 to change cpuset file settings, as this command will display write(2)
694 errors, as in the example:
695
696 /bin/echo 19 > cpuset.mems
697 /bin/echo: write error: Invalid argument
698
700 Memory placement
701 Not all allocations of system memory are constrained by cpusets, for
702 the following reasons.
703
704 If hot-plug functionality is used to remove all the CPUs that are cur‐
705 rently assigned to a cpuset, then the kernel will automatically update
706 the cpus_allowed of all processes attached to CPUs in that cpuset to
707 allow all CPUs. When memory hot-plug functionality for removing memory
708 nodes is available, a similar exception is expected to apply there as
709 well. In general, the kernel prefers to violate cpuset placement,
710 rather than starving a process that has had all its allowed CPUs or
711 memory nodes taken offline. User code should reconfigure cpusets to
712 refer only to online CPUs and memory nodes when using hot-plug to add
713 or remove such resources.
714
715 A few kernel-critical, internal memory-allocation requests, marked
716 GFP_ATOMIC, must be satisfied immediately. The kernel may drop some
717 request or malfunction if one of these allocations fail. If such a re‐
718 quest cannot be satisfied within the current process's cpuset, then we
719 relax the cpuset, and look for memory anywhere we can find it. It's
720 better to violate the cpuset than stress the kernel.
721
722 Allocations of memory requested by kernel drivers while processing an
723 interrupt lack any relevant process context, and are not confined by
724 cpusets.
725
726 Renaming cpusets
727 You can use the rename(2) system call to rename cpusets. Only simple
728 renaming is supported; that is, changing the name of a cpuset directory
729 is permitted, but moving a directory into a different directory is not
730 permitted.
731
733 The Linux kernel implementation of cpusets sets errno to specify the
734 reason for a failed system call affecting cpusets.
735
736 The possible errno settings and their meaning when set on a failed
737 cpuset call are as listed below.
738
739 E2BIG Attempted a write(2) on a special cpuset file with a length
740 larger than some kernel-determined upper limit on the length of
741 such writes.
742
743 EACCES Attempted to write(2) the process ID (PID) of a process to a
744 cpuset tasks file when one lacks permission to move that
745 process.
746
747 EACCES Attempted to add, using write(2), a CPU or memory node to a
748 cpuset, when that CPU or memory node was not already in its par‐
749 ent.
750
751 EACCES Attempted to set, using write(2), cpuset.cpu_exclusive or
752 cpuset.mem_exclusive on a cpuset whose parent lacks the same
753 setting.
754
755 EACCES Attempted to write(2) a cpuset.memory_pressure file.
756
757 EACCES Attempted to create a file in a cpuset directory.
758
759 EBUSY Attempted to remove, using rmdir(2), a cpuset with attached pro‐
760 cesses.
761
762 EBUSY Attempted to remove, using rmdir(2), a cpuset with child
763 cpusets.
764
765 EBUSY Attempted to remove a CPU or memory node from a cpuset that is
766 also in a child of that cpuset.
767
768 EEXIST Attempted to create, using mkdir(2), a cpuset that already ex‐
769 ists.
770
771 EEXIST Attempted to rename(2) a cpuset to a name that already exists.
772
773 EFAULT Attempted to read(2) or write(2) a cpuset file using a buffer
774 that is outside the writing processes accessible address space.
775
776 EINVAL Attempted to change a cpuset, using write(2), in a way that
777 would violate a cpu_exclusive or mem_exclusive attribute of that
778 cpuset or any of its siblings.
779
780 EINVAL Attempted to write(2) an empty cpuset.cpus or cpuset.mems list
781 to a cpuset which has attached processes or child cpusets.
782
783 EINVAL Attempted to write(2) a cpuset.cpus or cpuset.mems list which
784 included a range with the second number smaller than the first
785 number.
786
787 EINVAL Attempted to write(2) a cpuset.cpus or cpuset.mems list which
788 included an invalid character in the string.
789
790 EINVAL Attempted to write(2) a list to a cpuset.cpus file that did not
791 include any online CPUs.
792
793 EINVAL Attempted to write(2) a list to a cpuset.mems file that did not
794 include any online memory nodes.
795
796 EINVAL Attempted to write(2) a list to a cpuset.mems file that included
797 a node that held no memory.
798
799 EIO Attempted to write(2) a string to a cpuset tasks file that does
800 not begin with an ASCII decimal integer.
801
802 EIO Attempted to rename(2) a cpuset into a different directory.
803
804 ENAMETOOLONG
805 Attempted to read(2) a /proc/<pid>/cpuset file for a cpuset path
806 that is longer than the kernel page size.
807
808 ENAMETOOLONG
809 Attempted to create, using mkdir(2), a cpuset whose base direc‐
810 tory name is longer than 255 characters.
811
812 ENAMETOOLONG
813 Attempted to create, using mkdir(2), a cpuset whose full path‐
814 name, including the mount point (typically "/dev/cpuset/") pre‐
815 fix, is longer than 4095 characters.
816
817 ENODEV The cpuset was removed by another process at the same time as a
818 write(2) was attempted on one of the pseudo-files in the cpuset
819 directory.
820
821 ENOENT Attempted to create, using mkdir(2), a cpuset in a parent cpuset
822 that doesn't exist.
823
824 ENOENT Attempted to access(2) or open(2) a nonexistent file in a cpuset
825 directory.
826
827 ENOMEM Insufficient memory is available within the kernel; can occur on
828 a variety of system calls affecting cpusets, but only if the
829 system is extremely short of memory.
830
831 ENOSPC Attempted to write(2) the process ID (PID) of a process to a
832 cpuset tasks file when the cpuset had an empty cpuset.cpus or
833 empty cpuset.mems setting.
834
835 ENOSPC Attempted to write(2) an empty cpuset.cpus or cpuset.mems set‐
836 ting to a cpuset that has tasks attached.
837
838 ENOTDIR
839 Attempted to rename(2) a nonexistent cpuset.
840
841 EPERM Attempted to remove a file from a cpuset directory.
842
843 ERANGE Specified a cpuset.cpus or cpuset.mems list to the kernel which
844 included a number too large for the kernel to set in its bit
845 masks.
846
847 ESRCH Attempted to write(2) the process ID (PID) of a nonexistent
848 process to a cpuset tasks file.
849
851 Cpusets appeared in version 2.6.12 of the Linux kernel.
852
854 Despite its name, the pid parameter is actually a thread ID, and each
855 thread in a threaded group can be attached to a different cpuset. The
856 value returned from a call to gettid(2) can be passed in the argument
857 pid.
858
860 cpuset.memory_pressure cpuset files can be opened for writing, cre‐
861 ation, or truncation, but then the write(2) fails with errno set to
862 EACCES, and the creation and truncation options on open(2) have no ef‐
863 fect.
864
866 The following examples demonstrate querying and setting cpuset options
867 using shell commands.
868
869 Creating and attaching to a cpuset.
870 To create a new cpuset and attach the current command shell to it, the
871 steps are:
872
873 1) mkdir /dev/cpuset (if not already done)
874 2) mount -t cpuset none /dev/cpuset (if not already done)
875 3) Create the new cpuset using mkdir(1).
876 4) Assign CPUs and memory nodes to the new cpuset.
877 5) Attach the shell to the new cpuset.
878
879 For example, the following sequence of commands will set up a cpuset
880 named "Charlie", containing just CPUs 2 and 3, and memory node 1, and
881 then attach the current shell to that cpuset.
882
883 $ mkdir /dev/cpuset
884 $ mount -t cpuset cpuset /dev/cpuset
885 $ cd /dev/cpuset
886 $ mkdir Charlie
887 $ cd Charlie
888 $ /bin/echo 2-3 > cpuset.cpus
889 $ /bin/echo 1 > cpuset.mems
890 $ /bin/echo $$ > tasks
891 # The current shell is now running in cpuset Charlie
892 # The next line should display '/Charlie'
893 $ cat /proc/self/cpuset
894
895 Migrating a job to different memory nodes.
896 To migrate a job (the set of processes attached to a cpuset) to differ‐
897 ent CPUs and memory nodes in the system, including moving the memory
898 pages currently allocated to that job, perform the following steps.
899
900 1) Let's say we want to move the job in cpuset alpha (CPUs 4–7 and
901 memory nodes 2–3) to a new cpuset beta (CPUs 16–19 and memory nodes
902 8–9).
903 2) First create the new cpuset beta.
904 3) Then allow CPUs 16–19 and memory nodes 8–9 in beta.
905 4) Then enable memory_migration in beta.
906 5) Then move each process from alpha to beta.
907
908 The following sequence of commands accomplishes this.
909
910 $ cd /dev/cpuset
911 $ mkdir beta
912 $ cd beta
913 $ /bin/echo 16-19 > cpuset.cpus
914 $ /bin/echo 8-9 > cpuset.mems
915 $ /bin/echo 1 > cpuset.memory_migrate
916 $ while read i; do /bin/echo $i; done < ../alpha/tasks > tasks
917
918 The above should move any processes in alpha to beta, and any memory
919 held by these processes on memory nodes 2–3 to memory nodes 8–9, re‐
920 spectively.
921
922 Notice that the last step of the above sequence did not do:
923
924 $ cp ../alpha/tasks tasks
925
926 The while loop, rather than the seemingly easier use of the cp(1) com‐
927 mand, was necessary because only one process PID at a time may be writ‐
928 ten to the tasks file.
929
930 The same effect (writing one PID at a time) as the while loop can be
931 accomplished more efficiently, in fewer keystrokes and in syntax that
932 works on any shell, but alas more obscurely, by using the -u (un‐
933 buffered) option of sed(1):
934
935 $ sed -un p < ../alpha/tasks > tasks
936
938 taskset(1), get_mempolicy(2), getcpu(2), mbind(2), sched_getaffin‐
939 ity(2), sched_setaffinity(2), sched_setscheduler(2), set_mempolicy(2),
940 CPU_SET(3), proc(5), cgroups(7), numa(7), sched(7), migratepages(8),
941 numactl(8)
942
943 Documentation/admin-guide/cgroup-v1/cpusets.rst in the Linux kernel
944 source tree (or Documentation/cgroup-v1/cpusets.txt before Linux 4.18,
945 and Documentation/cpusets.txt before Linux 2.6.29)
946
948 This page is part of release 5.13 of the Linux man-pages project. A
949 description of the project, information about reporting bugs, and the
950 latest version of this page, can be found at
951 https://www.kernel.org/doc/man-pages/.
952
953
954
955Linux 2020-11-01 CPUSET(7)