1CGROUPS(7) Linux Programmer's Manual CGROUPS(7)
2
3
4
6 cgroups - Linux control groups
7
9 Control cgroups, usually referred to as cgroups, are a Linux kernel
10 feature which allow processes to be organized into hierarchical groups
11 whose usage of various types of resources can then be limited and moni‐
12 tored. The kernel's cgroup interface is provided through a pseudo-
13 filesystem called cgroupfs. Grouping is implemented in the core cgroup
14 kernel code, while resource tracking and limits are implemented in a
15 set of per-resource-type subsystems (memory, CPU, and so on).
16
17 Terminology
18 A cgroup is a collection of processes that are bound to a set of limits
19 or parameters defined via the cgroup filesystem.
20
21 A subsystem is a kernel component that modifies the behavior of the
22 processes in a cgroup. Various subsystems have been implemented, mak‐
23 ing it possible to do things such as limiting the amount of CPU time
24 and memory available to a cgroup, accounting for the CPU time used by a
25 cgroup, and freezing and resuming execution of the processes in a
26 cgroup. Subsystems are sometimes also known as resource controllers
27 (or simply, controllers).
28
29 The cgroups for a controller are arranged in a hierarchy. This hierar‐
30 chy is defined by creating, removing, and renaming subdirectories
31 within the cgroup filesystem. At each level of the hierarchy,
32 attributes (e.g., limits) can be defined. The limits, control, and
33 accounting provided by cgroups generally have effect throughout the
34 subhierarchy underneath the cgroup where the attributes are defined.
35 Thus, for example, the limits placed on a cgroup at a higher level in
36 the hierarchy cannot be exceeded by descendant cgroups.
37
38 Cgroups version 1 and version 2
39 The initial release of the cgroups implementation was in Linux 2.6.24.
40 Over time, various cgroup controllers have been added to allow the man‐
41 agement of various types of resources. However, the development of
42 these controllers was largely uncoordinated, with the result that many
43 inconsistencies arose between controllers and management of the cgroup
44 hierarchies became rather complex. (A longer description of these
45 problems can be found in the kernel source file Documenta‐
46 tion/cgroup-v2.txt.)
47
48 Because of the problems with the initial cgroups implementation
49 (cgroups version 1), starting in Linux 3.10, work began on a new,
50 orthogonal implementation to remedy these problems. Initially marked
51 experimental, and hidden behind the -o __DEVEL__sane_behavior mount
52 option, the new version (cgroups version 2) was eventually made offi‐
53 cial with the release of Linux 4.5. Differences between the two ver‐
54 sions are described in the text below.
55
56 Although cgroups v2 is intended as a replacement for cgroups v1, the
57 older system continues to exist (and for compatibility reasons is
58 unlikely to be removed). Currently, cgroups v2 implements only a sub‐
59 set of the controllers available in cgroups v1. The two systems are
60 implemented so that both v1 controllers and v2 controllers can be
61 mounted on the same system. Thus, for example, it is possible to use
62 those controllers that are supported under version 2, while also using
63 version 1 controllers where version 2 does not yet support those con‐
64 trollers. The only restriction here is that a controller can't be
65 simultaneously employed in both a cgroups v1 hierarchy and in the
66 cgroups v2 hierarchy.
67
69 Under cgroups v1, each controller may be mounted against a separate
70 cgroup filesystem that provides its own hierarchical organization of
71 the processes on the system. It is also possible to comount multiple
72 (or even all) cgroups v1 controllers against the same cgroup filesys‐
73 tem, meaning that the comounted controllers manage the same hierarchi‐
74 cal organization of processes.
75
76 For each mounted hierarchy, the directory tree mirrors the control
77 group hierarchy. Each control group is represented by a directory,
78 with each of its child control cgroups represented as a child direc‐
79 tory. For instance, /user/joe/1.session represents control group
80 1.session, which is a child of cgroup joe, which is a child of /user.
81 Under each cgroup directory is a set of files which can be read or
82 written to, reflecting resource limits and a few general cgroup proper‐
83 ties.
84
85 Tasks (threads) versus processes
86 In cgroups v1, a distinction is drawn between processes and tasks. In
87 this view, a process can consist of multiple tasks (more commonly
88 called threads, from a user-space perspective, and called such in the
89 remainder of this man page). In cgroups v1, it is possible to indepen‐
90 dently manipulate the cgroup memberships of the threads in a process.
91
92 The cgroups v1 ability to split threads across different cgroups caused
93 problems in some cases. For example, it made no sense for the memory
94 controller, since all of the threads of a process share a single
95 address space. Because of these problems, the ability to independently
96 manipulate the cgroup memberships of the threads in a process was
97 removed in the initial cgroups v2 implementation, and subsequently
98 restored in a more limited form (see the discussion of "thread mode"
99 below).
100
101 Mounting v1 controllers
102 The use of cgroups requires a kernel built with the CONFIG_CGROUP
103 option. In addition, each of the v1 controllers has an associated con‐
104 figuration option that must be set in order to employ that controller.
105
106 In order to use a v1 controller, it must be mounted against a cgroup
107 filesystem. The usual place for such mounts is under a tmpfs(5)
108 filesystem mounted at /sys/fs/cgroup. Thus, one might mount the cpu
109 controller as follows:
110
111 mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
112
113 It is possible to comount multiple controllers against the same hierar‐
114 chy. For example, here the cpu and cpuacct controllers are comounted
115 against a single hierarchy:
116
117 mount -t cgroup -o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
118
119 Comounting controllers has the effect that a process is in the same
120 cgroup for all of the comounted controllers. Separately mounting con‐
121 trollers allows a process to be in cgroup /foo1 for one controller
122 while being in /foo2/foo3 for another.
123
124 It is possible to comount all v1 controllers against the same hierar‐
125 chy:
126
127 mount -t cgroup -o all cgroup /sys/fs/cgroup
128
129 (One can achieve the same result by omitting -o all, since it is the
130 default if no controllers are explicitly specified.)
131
132 It is not possible to mount the same controller against multiple cgroup
133 hierarchies. For example, it is not possible to mount both the cpu and
134 cpuacct controllers against one hierarchy, and to mount the cpu con‐
135 troller alone against another hierarchy. It is possible to create mul‐
136 tiple mount points with exactly the same set of comounted controllers.
137 However, in this case all that results is multiple mount points provid‐
138 ing a view of the same hierarchy.
139
140 Note that on many systems, the v1 controllers are automatically mounted
141 under /sys/fs/cgroup; in particular, systemd(1) automatically creates
142 such mount points.
143
144 Unmounting v1 controllers
145 A mounted cgroup filesystem can be unmounted using the umount(8) com‐
146 mand, as in the following example:
147
148 umount /sys/fs/cgroup/pids
149
150 But note well: a cgroup filesystem is unmounted only if it is not busy,
151 that is, it has no child cgroups. If this is not the case, then the
152 only effect of the umount(8) is to make the mount invisible. Thus, to
153 ensure that the mount point is really removed, one must first remove
154 all child cgroups, which in turn can be done only after all member pro‐
155 cesses have been moved from those cgroups to the root cgroup.
156
157 Cgroups version 1 controllers
158 Each of the cgroups version 1 controllers is governed by a kernel con‐
159 figuration option (listed below). Additionally, the availability of
160 the cgroups feature is governed by the CONFIG_CGROUPS kernel configura‐
161 tion option.
162
163 cpu (since Linux 2.6.24; CONFIG_CGROUP_SCHED)
164 Cgroups can be guaranteed a minimum number of "CPU shares" when
165 a system is busy. This does not limit a cgroup's CPU usage if
166 the CPUs are not busy. For further information, see Documenta‐
167 tion/scheduler/sched-design-CFS.txt.
168
169 In Linux 3.2, this controller was extended to provide CPU "band‐
170 width" control. If the kernel is configured with CON‐
171 FIG_CFS_BANDWIDTH, then within each scheduling period (defined
172 via a file in the cgroup directory), it is possible to define an
173 upper limit on the CPU time allocated to the processes in a
174 cgroup. This upper limit applies even if there is no other com‐
175 petition for the CPU. Further information can be found in the
176 kernel source file Documentation/scheduler/sched-bwc.txt.
177
178 cpuacct (since Linux 2.6.24; CONFIG_CGROUP_CPUACCT)
179 This provides accounting for CPU usage by groups of processes.
180
181 Further information can be found in the kernel source file Docu‐
182 mentation/cgroup-v1/cpuacct.txt.
183
184 cpuset (since Linux 2.6.24; CONFIG_CPUSETS)
185 This cgroup can be used to bind the processes in a cgroup to a
186 specified set of CPUs and NUMA nodes.
187
188 Further information can be found in the kernel source file Docu‐
189 mentation/cgroup-v1/cpusets.txt.
190
191 memory (since Linux 2.6.25; CONFIG_MEMCG)
192 The memory controller supports reporting and limiting of process
193 memory, kernel memory, and swap used by cgroups.
194
195 Further information can be found in the kernel source file Docu‐
196 mentation/cgroup-v1/memory.txt.
197
198 devices (since Linux 2.6.26; CONFIG_CGROUP_DEVICE)
199 This supports controlling which processes may create (mknod)
200 devices as well as open them for reading or writing. The poli‐
201 cies may be specified as whitelists and blacklists. Hierarchy
202 is enforced, so new rules must not violate existing rules for
203 the target or ancestor cgroups.
204
205 Further information can be found in the kernel source file Docu‐
206 mentation/cgroup-v1/devices.txt.
207
208 freezer (since Linux 2.6.28; CONFIG_CGROUP_FREEZER)
209 The freezer cgroup can suspend and restore (resume) all pro‐
210 cesses in a cgroup. Freezing a cgroup /A also causes its chil‐
211 dren, for example, processes in /A/B, to be frozen.
212
213 Further information can be found in the kernel source file Docu‐
214 mentation/cgroup-v1/freezer-subsystem.txt.
215
216 net_cls (since Linux 2.6.29; CONFIG_CGROUP_NET_CLASSID)
217 This places a classid, specified for the cgroup, on network
218 packets created by a cgroup. These classids can then be used in
219 firewall rules, as well as used to shape traffic using tc(8).
220 This applies only to packets leaving the cgroup, not to traffic
221 arriving at the cgroup.
222
223 Further information can be found in the kernel source file Docu‐
224 mentation/cgroup-v1/net_cls.txt.
225
226 blkio (since Linux 2.6.33; CONFIG_BLK_CGROUP)
227 The blkio cgroup controls and limits access to specified block
228 devices by applying IO control in the form of throttling and
229 upper limits against leaf nodes and intermediate nodes in the
230 storage hierarchy.
231
232 Two policies are available. The first is a proportional-weight
233 time-based division of disk implemented with CFQ. This is in
234 effect for leaf nodes using CFQ. The second is a throttling
235 policy which specifies upper I/O rate limits on a device.
236
237 Further information can be found in the kernel source file Docu‐
238 mentation/cgroup-v1/blkio-controller.txt.
239
240 perf_event (since Linux 2.6.39; CONFIG_CGROUP_PERF)
241 This controller allows perf monitoring of the set of processes
242 grouped in a cgroup.
243
244 Further information can be found in the kernel source file
245 tools/perf/Documentation/perf-record.txt.
246
247 net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO)
248 This allows priorities to be specified, per network interface,
249 for cgroups.
250
251 Further information can be found in the kernel source file Docu‐
252 mentation/cgroup-v1/net_prio.txt.
253
254 hugetlb (since Linux 3.5; CONFIG_CGROUP_HUGETLB)
255 This supports limiting the use of huge pages by cgroups.
256
257 Further information can be found in the kernel source file Docu‐
258 mentation/cgroup-v1/hugetlb.txt.
259
260 pids (since Linux 4.3; CONFIG_CGROUP_PIDS)
261 This controller permits limiting the number of process that may
262 be created in a cgroup (and its descendants).
263
264 Further information can be found in the kernel source file Docu‐
265 mentation/cgroup-v1/pids.txt.
266
267 rdma (since Linux 4.11; CONFIG_CGROUP_RDMA)
268 The RDMA controller permits limiting the use of RDMA/IB-specific
269 resources per cgroup.
270
271 Further information can be found in the kernel source file Docu‐
272 mentation/cgroup-v1/rdma.txt.
273
274 Creating cgroups and moving processes
275 A cgroup filesystem initially contains a single root cgroup, '/', which
276 all processes belong to. A new cgroup is created by creating a direc‐
277 tory in the cgroup filesystem:
278
279 mkdir /sys/fs/cgroup/cpu/cg1
280
281 This creates a new empty cgroup.
282
283 A process may be moved to this cgroup by writing its PID into the
284 cgroup's cgroup.procs file:
285
286 echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
287
288 Only one PID at a time should be written to this file.
289
290 Writing the value 0 to a cgroup.procs file causes the writing process
291 to be moved to the corresponding cgroup.
292
293 When writing a PID into the cgroup.procs, all threads in the process
294 are moved into the new cgroup at once.
295
296 Within a hierarchy, a process can be a member of exactly one cgroup.
297 Writing a process's PID to a cgroup.procs file automatically removes it
298 from the cgroup of which it was previously a member.
299
300 The cgroup.procs file can be read to obtain a list of the processes
301 that are members of a cgroup. The returned list of PIDs is not guaran‐
302 teed to be in order. Nor is it guaranteed to be free of duplicates.
303 (For example, a PID may be recycled while reading from the list.)
304
305 In cgroups v1, an individual thread can be moved to another cgroup by
306 writing its thread ID (i.e., the kernel thread ID returned by clone(2)
307 and gettid(2)) to the tasks file in a cgroup directory. This file can
308 be read to discover the set of threads that are members of the cgroup.
309
310 Removing cgroups
311 To remove a cgroup, it must first have no child cgroups and contain no
312 (nonzombie) processes. So long as that is the case, one can simply
313 remove the corresponding directory pathname. Note that files in a
314 cgroup directory cannot and need not be removed.
315
316 Cgroups v1 release notification
317 Two files can be used to determine whether the kernel provides notifi‐
318 cations when a cgroup becomes empty. A cgroup is considered to be
319 empty when it contains no child cgroups and no member processes.
320
321 A special file in the root directory of each cgroup hierarchy,
322 release_agent, can be used to register the pathname of a program that
323 may be invoked when a cgroup in the hierarchy becomes empty. The path‐
324 name of the newly empty cgroup (relative to the cgroup mount point) is
325 provided as the sole command-line argument when the release_agent pro‐
326 gram is invoked. The release_agent program might remove the cgroup
327 directory, or perhaps repopulate it with a process.
328
329 The default value of the release_agent file is empty, meaning that no
330 release agent is invoked.
331
332 The content of the release_agent file can also be specified via a mount
333 option when the cgroup filesystem is mounted:
334
335 mount -o release_agent=pathname ...
336
337 Whether or not the release_agent program is invoked when a particular
338 cgroup becomes empty is determined by the value in the
339 notify_on_release file in the corresponding cgroup directory. If this
340 file contains the value 0, then the release_agent program is not
341 invoked. If it contains the value 1, the release_agent program is
342 invoked. The default value for this file in the root cgroup is 0. At
343 the time when a new cgroup is created, the value in this file is inher‐
344 ited from the corresponding file in the parent cgroup.
345
346 Cgroup v1 named hierarchies
347 In cgroups v1, it is possible to mount a cgroup hierarchy that has no
348 attached controllers:
349
350 mount -t cgroup -o none,name=somename none /some/mount/point
351
352 Multiple instances of such hierarchies can be mounted; each hierarchy
353 must have a unique name. The only purpose of such hierarchies is to
354 track processes. (See the discussion of release notification below.)
355 An example of this is the name=systemd cgroup hierarchy that is used by
356 systemd(1) to track services and user sessions.
357
359 In cgroups v2, all mounted controllers reside in a single unified hier‐
360 archy. While (different) controllers may be simultaneously mounted
361 under the v1 and v2 hierarchies, it is not possible to mount the same
362 controller simultaneously under both the v1 and the v2 hierarchies.
363
364 The new behaviors in cgroups v2 are summarized here, and in some cases
365 elaborated in the following subsections.
366
367 1. Cgroups v2 provides a unified hierarchy against which all con‐
368 trollers are mounted.
369
370 2. "Internal" processes are not permitted. With the exception of the
371 root cgroup, processes may reside only in leaf nodes (cgroups that
372 do not themselves contain child cgroups). The details are somewhat
373 more subtle than this, and are described below.
374
375 3. Active cgroups must be specified via the files cgroup.controllers
376 and cgroup.subtree_control.
377
378 4. The tasks file has been removed. In addition, the
379 cgroup.clone_children file that is employed by the cpuset controller
380 has been removed.
381
382 5. An improved mechanism for notification of empty cgroups is provided
383 by the cgroup.events file.
384
385 For more changes, see the Documentation/cgroup-v2.txt file in the ker‐
386 nel source.
387
388 Some of the new behaviors listed above saw subsequent modification with
389 the addition in Linux 4.14 of "thread mode" (described below).
390
391 Cgroups v2 unified hierarchy
392 In cgroups v1, the ability to mount different controllers against dif‐
393 ferent hierarchies was intended to allow great flexibility for applica‐
394 tion design. In practice, though, the flexibility turned out to less
395 useful than expected, and in many cases added complexity. Therefore,
396 in cgroups v2, all available controllers are mounted against a single
397 hierarchy. The available controllers are automatically mounted, mean‐
398 ing that it is not necessary (or possible) to specify the controllers
399 when mounting the cgroup v2 filesystem using a command such as the fol‐
400 lowing:
401
402 mount -t cgroup2 none /mnt/cgroup2
403
404 A cgroup v2 controller is available only if it is not currently in use
405 via a mount against a cgroup v1 hierarchy. Or, to put things another
406 way, it is not possible to employ the same controller against both a v1
407 hierarchy and the unified v2 hierarchy. This means that it may be nec‐
408 essary first to unmount a v1 controller (as described above) before
409 that controller is available in v2. Since systemd(1) makes heavy use
410 of some v1 controllers by default, it can in some cases be simpler to
411 boot the system with selected v1 controllers disabled. To do this,
412 specify the cgroup_no_v1=list option on the kernel boot command line;
413 list is a comma-separated list of the names of the controllers to dis‐
414 able, or the word all to disable all v1 controllers. (This situation
415 is correctly handled by systemd(1), which falls back to operating with‐
416 out the specified controllers.)
417
418 Note that on many modern systems, systemd(1) automatically mounts the
419 cgroup2 filesystem at /sys/fs/cgroup/unified during the boot process.
420
421 Cgroups v2 controllers
422 The following controllers, documented in the kernel source file Docu‐
423 mentation/cgroup-v2.txt, are supported in cgroups version 2:
424
425 io (since Linux 4.5)
426 This is the successor of the version 1 blkio controller.
427
428 memory (since Linux 4.5)
429 This is the successor of the version 1 memory controller.
430
431 pids (since Linux 4.5)
432 This is the same as the version 1 pids controller.
433
434 perf_event (since Linux 4.11)
435 This is the same as the version 1 perf_event controller.
436
437 rdma (since Linux 4.11)
438 This is the same as the version 1 rdma controller.
439
440 cpu (since Linux 4.15)
441 This is the successor to the version 1 cpu and cpuacct con‐
442 trollers.
443
444 Cgroups v2 subtree control
445 Each cgroup in the v2 hierarchy contains the following two files:
446
447 cgroup.controllers
448 This read-only file exposes a list of the controllers that are
449 available in this cgroup. The contents of this file match the
450 contents of the cgroup.subtree_control file in the parent
451 cgroup.
452
453 cgroup.subtree_control
454 This is a list of controllers that are active (enabled) in the
455 cgroup. The set of controllers in this file is a subset of the
456 set in the cgroup.controllers of this cgroup. The set of active
457 controllers is modified by writing strings to this file contain‐
458 ing space-delimited controller names, each preceded by '+' (to
459 enable a controller) or '-' (to disable a controller), as in the
460 following example:
461
462 echo '+pids -memory' > x/y/cgroup.subtree_control
463
464 An attempt to enable a controller that is not present in
465 cgroup.controllers leads to an ENOENT error when writing to the
466 cgroup.subtree_control file.
467
468 Because the list of controllers in cgroup.subtree_control is a subset
469 of those cgroup.controllers, a controller that has been disabled in one
470 cgroup in the hierarchy can never be re-enabled in the subtree below
471 that cgroup.
472
473 A cgroup's cgroup.subtree_control file determines the set of con‐
474 trollers that are exercised in the child cgroups. When a controller
475 (e.g., pids) is present in the cgroup.subtree_control file of a parent
476 cgroup, then the corresponding controller-interface files (e.g.,
477 pids.max) are automatically created in the children of that cgroup and
478 can be used to exert resource control in the child cgroups.
479
480 Cgroups v2 "no internal processes" rule
481 Cgroups v2 enforces a so-called "no internal processes" rule. Roughly
482 speaking, this rule means that, with the exception of the root cgroup,
483 processes may reside only in leaf nodes (cgroups that do not themselves
484 contain child cgroups). This avoids the need to decide how to parti‐
485 tion resources between processes which are members of cgroup A and pro‐
486 cesses in child cgroups of A.
487
488 For instance, if cgroup /cg1/cg2 exists, then a process may reside in
489 /cg1/cg2, but not in /cg1. This is to avoid an ambiguity in cgroups v1
490 with respect to the delegation of resources between processes in /cg1
491 and its child cgroups. The recommended approach in cgroups v2 is to
492 create a subdirectory called leaf for any nonleaf cgroup which should
493 contain processes, but no child cgroups. Thus, processes which previ‐
494 ously would have gone into /cg1 would now go into /cg1/leaf. This has
495 the advantage of making explicit the relationship between processes in
496 /cg1/leaf and /cg1's other children.
497
498 The "no internal processes" rule is in fact more subtle than stated
499 above. More precisely, the rule is that a (nonroot) cgroup can't both
500 (1) have member processes, and (2) distribute resources into child
501 cgroups—that is, have a nonempty cgroup.subtree_control file. Thus, it
502 is possible for a cgroup to have both member processes and child
503 cgroups, but before controllers can be enabled for that cgroup, the
504 member processes must be moved out of the cgroup (e.g., perhaps into
505 the child cgroups).
506
507 With the Linux 4.14 addition of "thread mode" (described below), the
508 "no internal processes" rule has been relaxed in some cases.
509
510 Cgroups v2 cgroup.events file
511 With cgroups v2, a new mechanism is provided to obtain notification
512 about when a cgroup becomes empty. The cgroups v1 release_agent and
513 notify_on_release files are removed, and replaced by a new, more gen‐
514 eral-purpose file, cgroup.events. This read-only file contains key-
515 value pairs (delimited by newline characters, with the key and value
516 separated by spaces) that identify events or state for a cgroup. Cur‐
517 rently, only one key appears in this file, populated, which has either
518 the value 0, meaning that the cgroup (and its descendants) contain no
519 (nonzombie) processes, or 1, meaning that the cgroup contains member
520 processes.
521
522 The cgroup.events file can be monitored, in order to receive notifica‐
523 tion when a cgroup transitions between the populated and unpopulated
524 states (or vice versa). When monitoring this file using inotify(7),
525 transitions generate IN_MODIFY events, and when monitoring the file
526 using poll(2), transitions generate POLLPRI events.
527
528 The cgroups v2 release-notification mechanism provided by the populated
529 field of the cgroup.events file offers at least two advantages over the
530 cgroups v1 release_agent mechanism. First, it allows for cheaper noti‐
531 fication, since a single process can monitor multiple cgroup.events
532 files. By contrast, the cgroups v1 mechanism requires the creation of
533 a process for each notification. Second, notification can be delegated
534 to a process that lives inside a container associated with the newly
535 empty cgroup.
536
537 Cgroups v2 cgroup.stat file
538 Each cgroup in the v2 hierarchy contains a read-only cgroup.stat file
539 (first introduced in Linux 4.14) that consists of lines containing key-
540 value pairs. The following keys currently appear in this file:
541
542 nr_descendants
543 This is the total number of visible (i.e., living) descendant
544 cgroups underneath this cgroup.
545
546 nr_dying_descendants
547 This is the total number of dying descendant cgroups underneath
548 this cgroup. A cgroup enters the dying state after being
549 deleted. It remains in that state for an undefined period
550 (which will depend on system load) while resources are freed
551 before the cgroup is destroyed. Note that the presence of some
552 cgroups in the dying state is normal, and is not indicative of
553 any problem.
554
555 A process can't be made a member of a dying cgroup, and a dying
556 cgroup can't be brought back to life.
557
558 Limiting the number of descendant cgroups
559 Each cgroup in the v2 hierarchy contains the following files, which can
560 be used to view and set limits on the number of descendant cgroups
561 under that cgroup:
562
563 cgroup.max.depth (since Linux 4.14)
564 This file defines a limit on the depth of nesting of descendant
565 cgroups. A value of 0 in this file means that no descendant
566 cgroups can be created. An attempt to create a descendant whose
567 nesting level exceeds the limit fails (mkdir(2) fails with the
568 error EAGAIN).
569
570 Writing the string "max" to this file means that no limit is
571 imposed. The default value in this file is "max".
572
573 cgroup.max.descendants (since Linux 4.14)
574 This file defines a limit on the number of live descendant
575 cgroups that this cgroup may have. An attempt to create more
576 descendants than allowed by the limit fails (mkdir(2) fails with
577 the error EAGAIN).
578
579 Writing the string "max" to this file means that no limit is
580 imposed. The default value in this file is "max".
581
582 Cgroups v2 delegation: delegation to a less privileged user
583 In the context of cgroups, delegation means passing management of some
584 subtree of the cgroup hierarchy to a nonprivileged process. Cgroups v1
585 provides support for delegation that was accidental and not fully
586 secure. Cgroups v2 supports delegation by explicit design.
587
588 Some terminology is required in order to describe delegation. A dele‐
589 gater is a privileged user (i.e., root) who owns a parent cgroup. A
590 delegatee is a nonprivileged user who will be granted the permissions
591 needed to manage some subhierarchy under that parent cgroup, known as
592 the delegated subtree.
593
594 To perform delegation, the delegater makes certain directories and
595 files writable by the delegatee, typically by changing the ownership of
596 the objects to be the user ID of the delegatee. Assuming that we want
597 to delegate the hierarchy rooted at (say) /dlgt_grp and that there are
598 not yet any child cgroups under that cgroup, the ownership of the fol‐
599 lowing is changed to the user ID of the delegatee:
600
601 /dlgt_grp
602 Changing the ownership of the root of the subtree means that any
603 new cgroups created under the subtree (and the files they con‐
604 tain) will also be owned by the delegatee.
605
606 /dlgt_grp/cgroup.procs
607 Changing the ownership of this file means that the delegatee can
608 move processes into the root of the delegated subtree.
609
610 /dlgt_grp/cgroup.subtree_control
611 Changing the ownership of this file means that that the delega‐
612 tee can enable controllers (that are present in
613 /dlgt_grp/cgroup.controllers) in order to further redistribute
614 resources at lower levels in the subtree. (As an alternative to
615 changing the ownership of this file, the delegater might instead
616 add selected controllers to this file.)
617
618 /dlgt_grp/cgroup.threads
619 Changing the ownership of this file is necessary if a threaded
620 subtree is being delegated (see the description of "thread
621 mode", below). This permits the delegatee to write thread IDs
622 to the file. (The ownership of this file can also be changed
623 when delegating a domain subtree, but currently this serves no
624 purpose, since, as described below, it is not possible to move a
625 thread between domain cgroups by writing its thread ID to the
626 cgroup.tasks file.)
627
628 The delegater should not change the ownership of any of the controller
629 interfaces files (e.g., pids.max, memory.high) in dlgt_grp. Those
630 files are used from the next level above the delegated subtree in order
631 to distribute resources into the subtree, and the delegatee should not
632 have permission to change the resources that are distributed into the
633 delegated subtree.
634
635 See also the discussion of the /sys/kernel/cgroup/delegate file in
636 NOTES.
637
638 After the aforementioned steps have been performed, the delegatee can
639 create child cgroups within the delegated subtree (the cgroup subdirec‐
640 tories and the files they contain will be owned by the delegatee) and
641 move processes between cgroups in the subtree. If some controllers are
642 present in dlgt_grp/cgroup.subtree_control, or the ownership of that
643 file was passed to the delegatee, the delegatee can also control the
644 further redistribution of the corresponding resources into the dele‐
645 gated subtree.
646
647 Cgroups v2 delegation: nsdelegate and cgroup namespaces
648 Starting with Linux 4.13, there is a second way to perform cgroup dele‐
649 gation. This is done by mounting or remounting the cgroup v2 filesys‐
650 tem with the nsdelegate mount option. For example, if the cgroup v2
651 filesystem has already been mounted, we can remount it with the nsdele‐
652 gate option as follows:
653
654 mount -t cgroup2 -o remount,nsdelegate \
655 none /sys/fs/cgroup/unified
656
657 The effect of this mount option is to cause cgroup namespaces to auto‐
658 matically become delegation boundaries. More specifically, the follow‐
659 ing restrictions apply for processes inside the cgroup namespace:
660
661 * Writes to controller interface files in the root directory of the
662 namespace will fail with the error EPERM. Processes inside the
663 cgroup namespace can still write to delegatable files in the root
664 directory of the cgroup namespace such as cgroup.procs and
665 cgroup.subtree_control, and can create subhierarchy underneath the
666 root directory.
667
668 * Attempts to migrate processes across the namespace boundary are
669 denied (with the error ENOENT). Processes inside the cgroup names‐
670 pace can still (subject to the containment rules described below)
671 move processes between cgroups within the subhierarchy under the
672 namespace root.
673
674 The ability to define cgroup namespaces as delegation boundaries makes
675 cgroup namespaces more useful. To understand why, suppose that we
676 already have one cgroup hierarchy that has been delegated to a nonpriv‐
677 ileged user, cecilia, using the older delegation technique described
678 above. Suppose further that cecilia wanted to further delegate a sub‐
679 hierarchy under the existing delegated hierarchy. (For example, the
680 delegated hierarchy might be associated with an unprivileged container
681 run by cecilia.) Even if a cgroup namespace was employed, because both
682 hierarchies are owned by the unprivileged user cecilia, the following
683 illegitimate actions could be performed:
684
685 * A process in the inferior hierarchy could change the resource con‐
686 troller settings in the root directory of the that hierarchy.
687 (These resource controller settings are intended to allow control to
688 be exercised from the parent cgroup; a process inside the child
689 cgroup should not be allowed to modify them.)
690
691 * A process inside the inferior hierarchy could move processes into
692 and out of the inferior hierarchy if the cgroups in the superior
693 hierarchy were somehow visible.
694
695 Employing the nsdelegate mount option prevents both of these possibili‐
696 ties.
697
698 The nsdelegate mount option only has an effect when performed in the
699 initial mount namespace; in other mount namespaces, the option is
700 silently ignored.
701
702 Note: On some systems, systemd(1) automatically mounts the cgroup v2
703 filesystem. In order to experiment with the nsdelegate operation, it
704 may be desirable to
705
706 Cgroup v2 delegation containment rules
707 Some delegation containment rules ensure that the delegatee can move
708 processes between cgroups within the delegated subtree, but can't move
709 processes from outside the delegated subtree into the subtree or vice
710 versa. A nonprivileged process (i.e., the delegatee) can write the PID
711 of a "target" process into a cgroup.procs file only if all of the fol‐
712 lowing are true:
713
714 * The writer has write permission on the cgroup.procs file in the des‐
715 tination cgroup.
716
717 * The writer has write permission on the cgroup.procs file in the com‐
718 mon ancestor of the source and destination cgroups. (In some cases,
719 the common ancestor may be the source or destination cgroup itself.)
720
721 * If the cgroup v2 filesystem was mounted with the nsdelegate option,
722 the writer must be able to see the source and destination cgroups
723 from its cgroup namespace.
724
725 * Before Linux 4.11: the effective UID of the writer (i.e., the dele‐
726 gatee) matches the real user ID or the saved set-user-ID of the tar‐
727 get process. (This was a historical requirement inherited from
728 cgroups v1 that was later deemed unnecessary, since the other rules
729 suffice for containment in cgroups v2.)
730
731 Note: one consequence of these delegation containment rules is that the
732 unprivileged delegatee can't place the first process into the delegated
733 subtree; instead, the delegater must place the first process (a process
734 owned by the delegatee) into the delegated subtree.
735
737 Among the restrictions imposed by cgroups v2 that were not present in
738 cgroups v1 are the following:
739
740 * No thread-granularity control: all of the threads of a process must
741 be in the same cgroup.
742
743 * No internal processes: a cgroup can't both have member processes and
744 exercise controllers on child cgroups.
745
746 Both of these restrictions were added because the lack of these
747 restrictions had caused problems in cgroups v1. In particular, the
748 cgroups v1 ability to allow thread-level granularity for cgroup member‐
749 ship made no sense for some controllers. (A notable example was the
750 memory controller: since threads share an address space, it made no
751 sense to split threads across different memory cgroups.)
752
753 Notwithstanding the initial design decision in cgroups v2, there were
754 use cases for certain controllers, notably the cpu controller, for
755 which thread-level granularity of control was meaningful and useful.
756 To accommodate such use cases, Linux 4.14 added thread mode for cgroups
757 v2.
758
759 Thread mode allows the following:
760
761 * The creation of threaded subtrees in which the threads of a process
762 may be spread across cgroups inside the tree. (A threaded subtree
763 may contain multiple multithreaded processes.)
764
765 * The concept of threaded controllers, which can distribute resources
766 across the cgroups in a threaded subtree.
767
768 * A relaxation of the "no internal processes rule", so that, within a
769 threaded subtree, a cgroup can both contain member threads and exer‐
770 cise resource control over child cgroups.
771
772 With the addition of thread mode, each nonroot cgroup now contains a
773 new file, cgroup.type, that exposes, and in some circumstances can be
774 used to change, the "type" of a cgroup. This file contains one of the
775 following type values:
776
777 domain This is a normal v2 cgroup that provides process-granularity
778 control. If a process is a member of this cgroup, then all
779 threads of the process are (by definition) in the same cgroup.
780 This is the default cgroup type, and provides the same behavior
781 that was provided for cgroups in the initial cgroups v2 imple‐
782 mentation.
783
784 threaded
785 This cgroup is a member of a threaded subtree. Threads can be
786 added to this cgroup, and controllers can be enabled for the
787 cgroup.
788
789 domain threaded
790 This is a domain cgroup that serves as the root of a threaded
791 subtree. This cgroup type is also known as "threaded root".
792
793 domain invalid
794 This is a cgroup inside a threaded subtree that is in an
795 "invalid" state. Processes can't be added to the cgroup, and
796 controllers can't be enabled for the cgroup. The only thing
797 that can be done with this cgroup (other than deleting it) is to
798 convert it to a threaded cgroup by writing the string "threaded"
799 to the cgroup.type file.
800
801 The rationale for the existence of this "interim" type during
802 the creation of a threaded subtree (rather than the kernel sim‐
803 ply immediately converting all cgroups under the threaded root
804 to the type threaded) is to allow for possible future extensions
805 to the thread mode model
806
807 Threaded versus domain controllers
808 With the addition of threads mode, cgroups v2 now distinguishes two
809 types of resource controllers:
810
811 * Threaded controllers: these controllers support thread-granularity
812 for resource control and can be enabled inside threaded subtrees,
813 with the result that the corresponding controller-interface files
814 appear inside the cgroups in the threaded subtree. As at Linux
815 4.15, the following controllers are threaded: cpu, perf_event, and
816 pids.
817
818 * Domain controllers: these controllers support only process granular‐
819 ity for resource control. From the perspective of a domain con‐
820 troller, all threads of a process are always in the same cgroup.
821 Domain controllers can't be enabled inside a threaded subtree.
822
823 Creating a threaded subtree
824 There are two pathways that lead to the creation of a threaded subtree.
825 The first pathway proceeds as follows:
826
827 1. We write the string "threaded" to the cgroup.type file of a cgroup
828 y/z that currently has the type domain. This has the following
829 effects:
830
831 * The type of the cgroup y/z becomes threaded.
832
833 * The type of the parent cgroup, y, becomes domain threaded. The
834 parent cgroup is the root of a threaded subtree (also known as
835 the "threaded root").
836
837 * All other cgroups under y that were not already of type threaded
838 (because they were inside already existing threaded subtrees
839 under the new threaded root) are converted to type domain
840 invalid. Any subsequently created cgroups under y will also have
841 the type domain invalid.
842
843 2. We write the string "threaded" to each of the domain invalid cgroups
844 under y, in order to convert them to the type threaded. As a conse‐
845 quence of this step, all threads under the threaded root now have
846 the type threaded and the threaded subtree is now fully usable. The
847 requirement to write "threaded" to each of these cgroups is somewhat
848 cumbersome, but allows for possible future extensions to the thread-
849 mode model.
850
851 The second way of creating a threaded subtree is as follows:
852
853 1. In an existing cgroup, z, that currently has the type domain, we (1)
854 enable one or more threaded controllers and (2) make a process a
855 member of z. (These two steps can be done in either order.) This
856 has the following consequences:
857
858 * The type of z becomes domain threaded.
859
860 * All of the descendant cgroups of x that were not already of type
861 threaded are converted to type domain invalid.
862
863 2. As before, we make the threaded subtree usable by writing the string
864 "threaded" to each of the domain invalid cgroups under y, in order
865 to convert them to the type threaded.
866
867 One of the consequences of the above pathways to creating a threaded
868 subtree is that the threaded root cgroup can be a parent only to
869 threaded (and domain invalid) cgroups. The threaded root cgroup can't
870 be a parent of a domain cgroups, and a threaded cgroup can't have a
871 sibling that is a domain cgroup.
872
873 Using a threaded subtree
874 Within a threaded subtree, threaded controllers can be enabled in each
875 subgroup whose type has been changed to threaded; upon doing so, the
876 corresponding controller interface files appear in the children of that
877 cgroup.
878
879 A process can be moved into a threaded subtree by writing its PID to
880 the cgroup.procs file in one of the cgroups inside the tree. This has
881 the effect of making all of the threads in the process members of the
882 corresponding cgroup and makes the process a member of the threaded
883 subtree. The threads of the process can then be spread across the
884 threaded subtree by writing their thread IDs (see gettid(2)) to the
885 cgroup.threads files in different cgroups inside the subtree. The
886 threads of a process must all reside in the same threaded subtree.
887
888 As with writing to cgroup.procs, some containment rules apply when
889 writing to the cgroup.threads file:
890
891 * The writer must have write permission on the cgroup.threads file in
892 the destination cgroup.
893
894 * The writer must have write permission on the cgroup.procs file in
895 the common ancestor of the source and destination cgroups. (In some
896 cases, the common ancestor may be the source or destination cgroup
897 itself.)
898
899 * The source and destination cgroups must be in the same threaded sub‐
900 tree. (Outside a threaded subtree, an attempt to move a thread by
901 writing its thread ID to the cgroup.threads file in a different
902 domain cgroup fails with the error EOPNOTSUPP.)
903
904 The cgroup.threads file is present in each cgroup (including domain
905 cgroups) and can be read in order to discover the set of threads that
906 is present in the cgroup. The set of thread IDs obtained when reading
907 this file is not guaranteed to be ordered or free of duplicates.
908
909 The cgroup.procs file in the threaded root shows the PIDs of all pro‐
910 cesses that are members of the threaded subtree. The cgroup.procs
911 files in the other cgroups in the subtree are not readable.
912
913 Domain controllers can't be enabled in a threaded subtree; no con‐
914 troller-interface files appear inside the cgroups underneath the
915 threaded root. From the point of view of a domain controller, threaded
916 subtrees are invisible: a multithreaded process inside a threaded sub‐
917 tree appears to a domain controller as a process that resides in the
918 threaded root cgroup.
919
920 Within a threaded subtree, the "no internal processes" rule does not
921 apply: a cgroup can both contain member processes (or thread) and exer‐
922 cise controllers on child cgroups.
923
924 Rules for writing to cgroup.type and creating threaded subtrees
925 A number of rules apply when writing to the cgroup.type file:
926
927 * Only the string "threaded" may be written. In other words, the only
928 explicit transition that is possible is to convert a domain cgroup
929 to type threaded.
930
931 * The string "threaded" can be written only if the current value in
932 cgroup.type is one of the following
933
934 · domain, to start the creation of a threaded subtree via the first
935 of the pathways described above;
936
937 · domain invalid, to convert one of the cgroups in a threaded sub‐
938 tree into a usable (i.e., threaded) state;
939
940 · threaded, which has no effect (a "no-op").
941
942 * We can't write to a cgroup.type file if the parent's type is domain
943 invalid. In other words, the cgroups of a threaded subtree must be
944 converted to the threaded state in a top-down manner.
945
946 There are also some constraints that must be satisfied in order to cre‐
947 ate a threaded subtree rooted at the cgroup x:
948
949 * There can be no member processes in the descendant cgroups of x.
950 (The cgroup x can itself have member processes.)
951
952 * No domain controllers may be enabled in x's cgroup.subtree_control
953 file.
954
955 If any of the above constraints is violated, then an attempt to write
956 "threaded" to a cgroup.type file fails with the error ENOTSUP.
957
958 The "domain threaded" cgroup type
959 According to the pathways described above, the type of a cgroup can
960 change to domain threaded in either of the following cases:
961
962 * The string "threaded" is written to a child cgroup.
963
964 * A threaded controller is enabled inside the cgroup and a process is
965 made a member of the cgroup.
966
967 A domain threaded cgroup, x, can revert to the type domain if the above
968 conditions no longer hold true—that is, if all threaded child cgroups
969 of x are removed and either x no longer has threaded controllers
970 enabled or no longer has member processes.
971
972 When a domain threaded cgroup x reverts to the type domain:
973
974 * All domain invalid descendants of x that are not in lower-level
975 threaded subtrees revert to the type domain.
976
977 * The root cgroups in any lower-level threaded subtrees revert to the
978 type domain threaded.
979
980 Exceptions for the root cgroup
981 The root cgroup of the v2 hierarchy is treated exceptionally: it can be
982 the parent of both domain and threaded cgroups. If the string
983 "threaded" is written to the cgroup.type file of one of the children of
984 the root cgroup, then
985
986 * The type of that cgroup becomes threaded.
987
988 * The type of any descendants of that cgroup that are not part of
989 lower-level threaded subtrees changes to domain invalid.
990
991 Note that in this case, there is no cgroup whose type becomes domain
992 threaded. (Notionally, the root cgroup can be considered as the
993 threaded root for the cgroup whose type was changed to threaded.)
994
995 The aim of this exceptional treatment for the root cgroup is to allow a
996 threaded cgroup that employs the cpu controller to be placed as high as
997 possible in the hierarchy, so as to minimize the (small) cost of
998 traversing the cgroup hierarchy.
999
1000 The cgroups v2 "cpu" controller and realtime processes
1001 As at Linux 4.15, the cgroups v2 cpu controller does not support con‐
1002 trol of realtime processes, and the controller can be enabled in the
1003 root cgroup only if all realtime threads are in the root cgroup. (If
1004 there are realtime processes in nonroot cgroups, then a write(2) of the
1005 string "+cpu" to the cgroup.subtree_control file fails with the error
1006 EINVAL. However, on some systems, systemd(1) places certain realtime
1007 processes in nonroot cgroups in the v2 hierarchy. On such systems,
1008 these processes must first be moved to the root cgroup before the cpu
1009 controller can be enabled.
1010
1012 The following errors can occur for mount(2):
1013
1014 EBUSY An attempt to mount a cgroup version 1 filesystem specified nei‐
1015 ther the name= option (to mount a named hierarchy) nor a con‐
1016 troller name (or all).
1017
1019 A child process created via fork(2) inherits its parent's cgroup mem‐
1020 berships. A process's cgroup memberships are preserved across
1021 execve(2).
1022
1023 /proc files
1024 /proc/cgroups (since Linux 2.6.24)
1025 This file contains information about the controllers that are
1026 compiled into the kernel. An example of the contents of this
1027 file (reformatted for readability) is the following:
1028
1029 #subsys_name hierarchy num_cgroups enabled
1030 cpuset 4 1 1
1031 cpu 8 1 1
1032 cpuacct 8 1 1
1033 blkio 6 1 1
1034 memory 3 1 1
1035 devices 10 84 1
1036 freezer 7 1 1
1037 net_cls 9 1 1
1038 perf_event 5 1 1
1039 net_prio 9 1 1
1040 hugetlb 0 1 0
1041 pids 2 1 1
1042
1043 The fields in this file are, from left to right:
1044
1045 1. The name of the controller.
1046
1047 2. The unique ID of the cgroup hierarchy on which this con‐
1048 troller is mounted. If multiple cgroups v1 controllers are
1049 bound to the same hierarchy, then each will show the same
1050 hierarchy ID in this field. The value in this field will be
1051 0 if:
1052
1053 a) the controller is not mounted on a cgroups v1 hierarchy;
1054
1055 b) the controller is bound to the cgroups v2 single unified
1056 hierarchy; or
1057
1058 c) the controller is disabled (see below).
1059
1060 3. The number of control groups in this hierarchy using this
1061 controller.
1062
1063 4. This field contains the value 1 if this controller is
1064 enabled, or 0 if it has been disabled (via the cgroup_disable
1065 kernel command-line boot parameter).
1066
1067 /proc/[pid]/cgroup (since Linux 2.6.24)
1068 This file describes control groups to which the process with the
1069 corresponding PID belongs. The displayed information differs
1070 for cgroups version 1 and version 2 hierarchies.
1071
1072 For each cgroup hierarchy of which the process is a member,
1073 there is one entry containing three colon-separated fields:
1074
1075 hierarchy-ID:controller-list:cgroup-path
1076
1077 For example:
1078
1079 5:cpuacct,cpu,cpuset:/daemons
1080
1081 The colon-separated fields are, from left to right:
1082
1083 1. For cgroups version 1 hierarchies, this field contains a
1084 unique hierarchy ID number that can be matched to a hierarchy
1085 ID in /proc/cgroups. For the cgroups version 2 hierarchy,
1086 this field contains the value 0.
1087
1088 2. For cgroups version 1 hierarchies, this field contains a
1089 comma-separated list of the controllers bound to the hierar‐
1090 chy. For the cgroups version 2 hierarchy, this field is
1091 empty.
1092
1093 3. This field contains the pathname of the control group in the
1094 hierarchy to which the process belongs. This pathname is
1095 relative to the mount point of the hierarchy.
1096
1097 /sys/kernel/cgroup files
1098 /sys/kernel/cgroup/delegate (since Linux 4.15)
1099 This file exports a list of the cgroups v2 files (one per line)
1100 that are delegatable (i.e., whose ownership should be changed to
1101 the user ID of the delegatee). In the future, the set of dele‐
1102 gatable files may change or grow, and this file provides a way
1103 for the kernel to inform user-space applications of which files
1104 must be delegated. As at Linux 4.15, one sees the following
1105 when inspecting this file:
1106
1107 $ cat /sys/kernel/cgroup/delegate
1108 cgroup.procs
1109 cgroup.subtree_control
1110 cgroup.threads
1111
1112 /sys/kernel/cgroup/features (since Linux 4.15)
1113 Over time, the set of cgroups v2 features that are provided by
1114 the kernel may change or grow, or some features may not be
1115 enabled by default. This file provides a way for user-space
1116 applications to discover what features the running kernel sup‐
1117 ports and has enabled. Features are listed one per line:
1118
1119 $ cat /sys/kernel/cgroup/features
1120 nsdelegate
1121
1122 The entries that can appear in this file are:
1123
1124 nsdelegate (since Linux 4.15)
1125 The kernel supports the nsdelegate mount option.
1126
1128 prlimit(1), systemd(1), systemd-cgls(1), systemd-cgtop(1), clone(2),
1129 ioprio_set(2), perf_event_open(2), setrlimit(2), cgroup_namespaces(7),
1130 cpuset(7), namespaces(7), sched(7), user_namespaces(7)
1131
1133 This page is part of release 4.15 of the Linux man-pages project. A
1134 description of the project, information about reporting bugs, and the
1135 latest version of this page, can be found at
1136 https://www.kernel.org/doc/man-pages/.
1137
1138
1139
1140Linux 2018-02-02 CGROUPS(7)