1CGROUPS(7) Linux Programmer's Manual CGROUPS(7)
2
3
4
6 cgroups - Linux control groups
7
9 Control cgroups, usually referred to as cgroups, are a Linux kernel
10 feature which allow processes to be organized into hierarchical groups
11 whose usage of various types of resources can then be limited and moni‐
12 tored. The kernel's cgroup interface is provided through a pseudo-
13 filesystem called cgroupfs. Grouping is implemented in the core cgroup
14 kernel code, while resource tracking and limits are implemented in a
15 set of per-resource-type subsystems (memory, CPU, and so on).
16
17 Terminology
18 A cgroup is a collection of processes that are bound to a set of limits
19 or parameters defined via the cgroup filesystem.
20
21 A subsystem is a kernel component that modifies the behavior of the
22 processes in a cgroup. Various subsystems have been implemented, mak‐
23 ing it possible to do things such as limiting the amount of CPU time
24 and memory available to a cgroup, accounting for the CPU time used by a
25 cgroup, and freezing and resuming execution of the processes in a
26 cgroup. Subsystems are sometimes also known as resource controllers
27 (or simply, controllers).
28
29 The cgroups for a controller are arranged in a hierarchy. This hierar‐
30 chy is defined by creating, removing, and renaming subdirectories
31 within the cgroup filesystem. At each level of the hierarchy,
32 attributes (e.g., limits) can be defined. The limits, control, and
33 accounting provided by cgroups generally have effect throughout the
34 subhierarchy underneath the cgroup where the attributes are defined.
35 Thus, for example, the limits placed on a cgroup at a higher level in
36 the hierarchy cannot be exceeded by descendant cgroups.
37
38 Cgroups version 1 and version 2
39 The initial release of the cgroups implementation was in Linux 2.6.24.
40 Over time, various cgroup controllers have been added to allow the man‐
41 agement of various types of resources. However, the development of
42 these controllers was largely uncoordinated, with the result that many
43 inconsistencies arose between controllers and management of the cgroup
44 hierarchies became rather complex. (A longer description of these
45 problems can be found in the kernel source file Documenta‐
46 tion/cgroup-v2.txt.)
47
48 Because of the problems with the initial cgroups implementation
49 (cgroups version 1), starting in Linux 3.10, work began on a new,
50 orthogonal implementation to remedy these problems. Initially marked
51 experimental, and hidden behind the -o __DEVEL__sane_behavior mount
52 option, the new version (cgroups version 2) was eventually made offi‐
53 cial with the release of Linux 4.5. Differences between the two ver‐
54 sions are described in the text below.
55
56 Although cgroups v2 is intended as a replacement for cgroups v1, the
57 older system continues to exist (and for compatibility reasons is
58 unlikely to be removed). Currently, cgroups v2 implements only a sub‐
59 set of the controllers available in cgroups v1. The two systems are
60 implemented so that both v1 controllers and v2 controllers can be
61 mounted on the same system. Thus, for example, it is possible to use
62 those controllers that are supported under version 2, while also using
63 version 1 controllers where version 2 does not yet support those con‐
64 trollers. The only restriction here is that a controller can't be
65 simultaneously employed in both a cgroups v1 hierarchy and in the
66 cgroups v2 hierarchy.
67
69 Under cgroups v1, each controller may be mounted against a separate
70 cgroup filesystem that provides its own hierarchical organization of
71 the processes on the system. It is also possible to comount multiple
72 (or even all) cgroups v1 controllers against the same cgroup filesys‐
73 tem, meaning that the comounted controllers manage the same hierarchi‐
74 cal organization of processes.
75
76 For each mounted hierarchy, the directory tree mirrors the control
77 group hierarchy. Each control group is represented by a directory,
78 with each of its child control cgroups represented as a child direc‐
79 tory. For instance, /user/joe/1.session represents control group
80 1.session, which is a child of cgroup joe, which is a child of /user.
81 Under each cgroup directory is a set of files which can be read or
82 written to, reflecting resource limits and a few general cgroup proper‐
83 ties.
84
85 Tasks (threads) versus processes
86 In cgroups v1, a distinction is drawn between processes and tasks. In
87 this view, a process can consist of multiple tasks (more commonly
88 called threads, from a user-space perspective, and called such in the
89 remainder of this man page). In cgroups v1, it is possible to indepen‐
90 dently manipulate the cgroup memberships of the threads in a process.
91
92 The cgroups v1 ability to split threads across different cgroups caused
93 problems in some cases. For example, it made no sense for the memory
94 controller, since all of the threads of a process share a single
95 address space. Because of these problems, the ability to independently
96 manipulate the cgroup memberships of the threads in a process was
97 removed in the initial cgroups v2 implementation, and subsequently
98 restored in a more limited form (see the discussion of "thread mode"
99 below).
100
101 Mounting v1 controllers
102 The use of cgroups requires a kernel built with the CONFIG_CGROUP
103 option. In addition, each of the v1 controllers has an associated con‐
104 figuration option that must be set in order to employ that controller.
105
106 In order to use a v1 controller, it must be mounted against a cgroup
107 filesystem. The usual place for such mounts is under a tmpfs(5)
108 filesystem mounted at /sys/fs/cgroup. Thus, one might mount the cpu
109 controller as follows:
110
111 mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
112
113 It is possible to comount multiple controllers against the same hierar‐
114 chy. For example, here the cpu and cpuacct controllers are comounted
115 against a single hierarchy:
116
117 mount -t cgroup -o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
118
119 Comounting controllers has the effect that a process is in the same
120 cgroup for all of the comounted controllers. Separately mounting con‐
121 trollers allows a process to be in cgroup /foo1 for one controller
122 while being in /foo2/foo3 for another.
123
124 It is possible to comount all v1 controllers against the same hierar‐
125 chy:
126
127 mount -t cgroup -o all cgroup /sys/fs/cgroup
128
129 (One can achieve the same result by omitting -o all, since it is the
130 default if no controllers are explicitly specified.)
131
132 It is not possible to mount the same controller against multiple cgroup
133 hierarchies. For example, it is not possible to mount both the cpu and
134 cpuacct controllers against one hierarchy, and to mount the cpu con‐
135 troller alone against another hierarchy. It is possible to create mul‐
136 tiple mount points with exactly the same set of comounted controllers.
137 However, in this case all that results is multiple mount points provid‐
138 ing a view of the same hierarchy.
139
140 Note that on many systems, the v1 controllers are automatically mounted
141 under /sys/fs/cgroup; in particular, systemd(1) automatically creates
142 such mount points.
143
144 Unmounting v1 controllers
145 A mounted cgroup filesystem can be unmounted using the umount(8) com‐
146 mand, as in the following example:
147
148 umount /sys/fs/cgroup/pids
149
150 But note well: a cgroup filesystem is unmounted only if it is not busy,
151 that is, it has no child cgroups. If this is not the case, then the
152 only effect of the umount(8) is to make the mount invisible. Thus, to
153 ensure that the mount point is really removed, one must first remove
154 all child cgroups, which in turn can be done only after all member pro‐
155 cesses have been moved from those cgroups to the root cgroup.
156
157 Cgroups version 1 controllers
158 Each of the cgroups version 1 controllers is governed by a kernel con‐
159 figuration option (listed below). Additionally, the availability of
160 the cgroups feature is governed by the CONFIG_CGROUPS kernel configura‐
161 tion option.
162
163 cpu (since Linux 2.6.24; CONFIG_CGROUP_SCHED)
164 Cgroups can be guaranteed a minimum number of "CPU shares" when
165 a system is busy. This does not limit a cgroup's CPU usage if
166 the CPUs are not busy. For further information, see Documenta‐
167 tion/scheduler/sched-design-CFS.txt.
168
169 In Linux 3.2, this controller was extended to provide CPU "band‐
170 width" control. If the kernel is configured with CON‐
171 FIG_CFS_BANDWIDTH, then within each scheduling period (defined
172 via a file in the cgroup directory), it is possible to define an
173 upper limit on the CPU time allocated to the processes in a
174 cgroup. This upper limit applies even if there is no other com‐
175 petition for the CPU. Further information can be found in the
176 kernel source file Documentation/scheduler/sched-bwc.txt.
177
178 cpuacct (since Linux 2.6.24; CONFIG_CGROUP_CPUACCT)
179 This provides accounting for CPU usage by groups of processes.
180
181 Further information can be found in the kernel source file Docu‐
182 mentation/cgroup-v1/cpuacct.txt.
183
184 cpuset (since Linux 2.6.24; CONFIG_CPUSETS)
185 This cgroup can be used to bind the processes in a cgroup to a
186 specified set of CPUs and NUMA nodes.
187
188 Further information can be found in the kernel source file Docu‐
189 mentation/cgroup-v1/cpusets.txt.
190
191 memory (since Linux 2.6.25; CONFIG_MEMCG)
192 The memory controller supports reporting and limiting of process
193 memory, kernel memory, and swap used by cgroups.
194
195 Further information can be found in the kernel source file Docu‐
196 mentation/cgroup-v1/memory.txt.
197
198 devices (since Linux 2.6.26; CONFIG_CGROUP_DEVICE)
199 This supports controlling which processes may create (mknod)
200 devices as well as open them for reading or writing. The poli‐
201 cies may be specified as whitelists and blacklists. Hierarchy
202 is enforced, so new rules must not violate existing rules for
203 the target or ancestor cgroups.
204
205 Further information can be found in the kernel source file Docu‐
206 mentation/cgroup-v1/devices.txt.
207
208 freezer (since Linux 2.6.28; CONFIG_CGROUP_FREEZER)
209 The freezer cgroup can suspend and restore (resume) all pro‐
210 cesses in a cgroup. Freezing a cgroup /A also causes its chil‐
211 dren, for example, processes in /A/B, to be frozen.
212
213 Further information can be found in the kernel source file Docu‐
214 mentation/cgroup-v1/freezer-subsystem.txt.
215
216 net_cls (since Linux 2.6.29; CONFIG_CGROUP_NET_CLASSID)
217 This places a classid, specified for the cgroup, on network
218 packets created by a cgroup. These classids can then be used in
219 firewall rules, as well as used to shape traffic using tc(8).
220 This applies only to packets leaving the cgroup, not to traffic
221 arriving at the cgroup.
222
223 Further information can be found in the kernel source file Docu‐
224 mentation/cgroup-v1/net_cls.txt.
225
226 blkio (since Linux 2.6.33; CONFIG_BLK_CGROUP)
227 The blkio cgroup controls and limits access to specified block
228 devices by applying IO control in the form of throttling and
229 upper limits against leaf nodes and intermediate nodes in the
230 storage hierarchy.
231
232 Two policies are available. The first is a proportional-weight
233 time-based division of disk implemented with CFQ. This is in
234 effect for leaf nodes using CFQ. The second is a throttling
235 policy which specifies upper I/O rate limits on a device.
236
237 Further information can be found in the kernel source file Docu‐
238 mentation/cgroup-v1/blkio-controller.txt.
239
240 perf_event (since Linux 2.6.39; CONFIG_CGROUP_PERF)
241 This controller allows perf monitoring of the set of processes
242 grouped in a cgroup.
243
244 Further information can be found in the kernel source file
245 tools/perf/Documentation/perf-record.txt.
246
247 net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO)
248 This allows priorities to be specified, per network interface,
249 for cgroups.
250
251 Further information can be found in the kernel source file Docu‐
252 mentation/cgroup-v1/net_prio.txt.
253
254 hugetlb (since Linux 3.5; CONFIG_CGROUP_HUGETLB)
255 This supports limiting the use of huge pages by cgroups.
256
257 Further information can be found in the kernel source file Docu‐
258 mentation/cgroup-v1/hugetlb.txt.
259
260 pids (since Linux 4.3; CONFIG_CGROUP_PIDS)
261 This controller permits limiting the number of process that may
262 be created in a cgroup (and its descendants).
263
264 Further information can be found in the kernel source file Docu‐
265 mentation/cgroup-v1/pids.txt.
266
267 rdma (since Linux 4.11; CONFIG_CGROUP_RDMA)
268 The RDMA controller permits limiting the use of RDMA/IB-specific
269 resources per cgroup.
270
271 Further information can be found in the kernel source file Docu‐
272 mentation/cgroup-v1/rdma.txt.
273
274 Creating cgroups and moving processes
275 A cgroup filesystem initially contains a single root cgroup, '/', which
276 all processes belong to. A new cgroup is created by creating a direc‐
277 tory in the cgroup filesystem:
278
279 mkdir /sys/fs/cgroup/cpu/cg1
280
281 This creates a new empty cgroup.
282
283 A process may be moved to this cgroup by writing its PID into the
284 cgroup's cgroup.procs file:
285
286 echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
287
288 Only one PID at a time should be written to this file.
289
290 Writing the value 0 to a cgroup.procs file causes the writing process
291 to be moved to the corresponding cgroup.
292
293 When writing a PID into the cgroup.procs, all threads in the process
294 are moved into the new cgroup at once.
295
296 Within a hierarchy, a process can be a member of exactly one cgroup.
297 Writing a process's PID to a cgroup.procs file automatically removes it
298 from the cgroup of which it was previously a member.
299
300 The cgroup.procs file can be read to obtain a list of the processes
301 that are members of a cgroup. The returned list of PIDs is not guaran‐
302 teed to be in order. Nor is it guaranteed to be free of duplicates.
303 (For example, a PID may be recycled while reading from the list.)
304
305 In cgroups v1, an individual thread can be moved to another cgroup by
306 writing its thread ID (i.e., the kernel thread ID returned by clone(2)
307 and gettid(2)) to the tasks file in a cgroup directory. This file can
308 be read to discover the set of threads that are members of the cgroup.
309
310 Removing cgroups
311 To remove a cgroup, it must first have no child cgroups and contain no
312 (nonzombie) processes. So long as that is the case, one can simply
313 remove the corresponding directory pathname. Note that files in a
314 cgroup directory cannot and need not be removed.
315
316 Cgroups v1 release notification
317 Two files can be used to determine whether the kernel provides notifi‐
318 cations when a cgroup becomes empty. A cgroup is considered to be
319 empty when it contains no child cgroups and no member processes.
320
321 A special file in the root directory of each cgroup hierarchy,
322 release_agent, can be used to register the pathname of a program that
323 may be invoked when a cgroup in the hierarchy becomes empty. The path‐
324 name of the newly empty cgroup (relative to the cgroup mount point) is
325 provided as the sole command-line argument when the release_agent pro‐
326 gram is invoked. The release_agent program might remove the cgroup
327 directory, or perhaps repopulate it with a process.
328
329 The default value of the release_agent file is empty, meaning that no
330 release agent is invoked.
331
332 The content of the release_agent file can also be specified via a mount
333 option when the cgroup filesystem is mounted:
334
335 mount -o release_agent=pathname ...
336
337 Whether or not the release_agent program is invoked when a particular
338 cgroup becomes empty is determined by the value in the
339 notify_on_release file in the corresponding cgroup directory. If this
340 file contains the value 0, then the release_agent program is not
341 invoked. If it contains the value 1, the release_agent program is
342 invoked. The default value for this file in the root cgroup is 0. At
343 the time when a new cgroup is created, the value in this file is inher‐
344 ited from the corresponding file in the parent cgroup.
345
346 Cgroup v1 named hierarchies
347 In cgroups v1, it is possible to mount a cgroup hierarchy that has no
348 attached controllers:
349
350 mount -t cgroup -o none,name=somename none /some/mount/point
351
352 Multiple instances of such hierarchies can be mounted; each hierarchy
353 must have a unique name. The only purpose of such hierarchies is to
354 track processes. (See the discussion of release notification below.)
355 An example of this is the name=systemd cgroup hierarchy that is used by
356 systemd(1) to track services and user sessions.
357
359 In cgroups v2, all mounted controllers reside in a single unified hier‐
360 archy. While (different) controllers may be simultaneously mounted
361 under the v1 and v2 hierarchies, it is not possible to mount the same
362 controller simultaneously under both the v1 and the v2 hierarchies.
363
364 The new behaviors in cgroups v2 are summarized here, and in some cases
365 elaborated in the following subsections.
366
367 1. Cgroups v2 provides a unified hierarchy against which all con‐
368 trollers are mounted.
369
370 2. "Internal" processes are not permitted. With the exception of the
371 root cgroup, processes may reside only in leaf nodes (cgroups that
372 do not themselves contain child cgroups). The details are somewhat
373 more subtle than this, and are described below.
374
375 3. Active cgroups must be specified via the files cgroup.controllers
376 and cgroup.subtree_control.
377
378 4. The tasks file has been removed. In addition, the
379 cgroup.clone_children file that is employed by the cpuset controller
380 has been removed.
381
382 5. An improved mechanism for notification of empty cgroups is provided
383 by the cgroup.events file.
384
385 For more changes, see the Documentation/cgroup-v2.txt file in the ker‐
386 nel source.
387
388 Some of the new behaviors listed above saw subsequent modification with
389 the addition in Linux 4.14 of "thread mode" (described below).
390
391 Cgroups v2 unified hierarchy
392 In cgroups v1, the ability to mount different controllers against dif‐
393 ferent hierarchies was intended to allow great flexibility for applica‐
394 tion design. In practice, though, the flexibility turned out to less
395 useful than expected, and in many cases added complexity. Therefore,
396 in cgroups v2, all available controllers are mounted against a single
397 hierarchy. The available controllers are automatically mounted, mean‐
398 ing that it is not necessary (or possible) to specify the controllers
399 when mounting the cgroup v2 filesystem using a command such as the fol‐
400 lowing:
401
402 mount -t cgroup2 none /mnt/cgroup2
403
404 A cgroup v2 controller is available only if it is not currently in use
405 via a mount against a cgroup v1 hierarchy. Or, to put things another
406 way, it is not possible to employ the same controller against both a v1
407 hierarchy and the unified v2 hierarchy. This means that it may be nec‐
408 essary first to unmount a v1 controller (as described above) before
409 that controller is available in v2. Since systemd(1) makes heavy use
410 of some v1 controllers by default, it can in some cases be simpler to
411 boot the system with selected v1 controllers disabled. To do this,
412 specify the cgroup_no_v1=list option on the kernel boot command line;
413 list is a comma-separated list of the names of the controllers to dis‐
414 able, or the word all to disable all v1 controllers. (This situation
415 is correctly handled by systemd(1), which falls back to operating with‐
416 out the specified controllers.)
417
418 Note that on many modern systems, systemd(1) automatically mounts the
419 cgroup2 filesystem at /sys/fs/cgroup/unified during the boot process.
420
421 Cgroups v2 controllers
422 The following controllers, documented in the kernel source file Docu‐
423 mentation/cgroup-v2.txt, are supported in cgroups version 2:
424
425 io (since Linux 4.5)
426 This is the successor of the version 1 blkio controller.
427
428 memory (since Linux 4.5)
429 This is the successor of the version 1 memory controller.
430
431 pids (since Linux 4.5)
432 This is the same as the version 1 pids controller.
433
434 perf_event (since Linux 4.11)
435 This is the same as the version 1 perf_event controller.
436
437 rdma (since Linux 4.11)
438 This is the same as the version 1 rdma controller.
439
440 cpu (since Linux 4.15)
441 This is the successor to the version 1 cpu and cpuacct con‐
442 trollers.
443
444 Cgroups v2 subtree control
445 Each cgroup in the v2 hierarchy contains the following two files:
446
447 cgroup.controllers
448 This read-only file exposes a list of the controllers that are
449 available in this cgroup. The contents of this file match the
450 contents of the cgroup.subtree_control file in the parent
451 cgroup.
452
453 cgroup.subtree_control
454 This is a list of controllers that are active (enabled) in the
455 cgroup. The set of controllers in this file is a subset of the
456 set in the cgroup.controllers of this cgroup. The set of active
457 controllers is modified by writing strings to this file contain‐
458 ing space-delimited controller names, each preceded by '+' (to
459 enable a controller) or '-' (to disable a controller), as in the
460 following example:
461
462 echo '+pids -memory' > x/y/cgroup.subtree_control
463
464 An attempt to enable a controller that is not present in
465 cgroup.controllers leads to an ENOENT error when writing to the
466 cgroup.subtree_control file.
467
468 Because the list of controllers in cgroup.subtree_control is a subset
469 of those cgroup.controllers, a controller that has been disabled in one
470 cgroup in the hierarchy can never be re-enabled in the subtree below
471 that cgroup.
472
473 A cgroup's cgroup.subtree_control file determines the set of con‐
474 trollers that are exercised in the child cgroups. When a controller
475 (e.g., pids) is present in the cgroup.subtree_control file of a parent
476 cgroup, then the corresponding controller-interface files (e.g.,
477 pids.max) are automatically created in the children of that cgroup and
478 can be used to exert resource control in the child cgroups.
479
480 Cgroups v2 "no internal processes" rule
481 Cgroups v2 enforces a so-called "no internal processes" rule. Roughly
482 speaking, this rule means that, with the exception of the root cgroup,
483 processes may reside only in leaf nodes (cgroups that do not themselves
484 contain child cgroups). This avoids the need to decide how to parti‐
485 tion resources between processes which are members of cgroup A and pro‐
486 cesses in child cgroups of A.
487
488 For instance, if cgroup /cg1/cg2 exists, then a process may reside in
489 /cg1/cg2, but not in /cg1. This is to avoid an ambiguity in cgroups v1
490 with respect to the delegation of resources between processes in /cg1
491 and its child cgroups. The recommended approach in cgroups v2 is to
492 create a subdirectory called leaf for any nonleaf cgroup which should
493 contain processes, but no child cgroups. Thus, processes which previ‐
494 ously would have gone into /cg1 would now go into /cg1/leaf. This has
495 the advantage of making explicit the relationship between processes in
496 /cg1/leaf and /cg1's other children.
497
498 The "no internal processes" rule is in fact more subtle than stated
499 above. More precisely, the rule is that a (nonroot) cgroup can't both
500 (1) have member processes, and (2) distribute resources into child
501 cgroups—that is, have a nonempty cgroup.subtree_control file. Thus, it
502 is possible for a cgroup to have both member processes and child
503 cgroups, but before controllers can be enabled for that cgroup, the
504 member processes must be moved out of the cgroup (e.g., perhaps into
505 the child cgroups).
506
507 With the Linux 4.14 addition of "thread mode" (described below), the
508 "no internal processes" rule has been relaxed in some cases.
509
510 Cgroups v2 cgroup.events file
511 With cgroups v2, a new mechanism is provided to obtain notification
512 about when a cgroup becomes empty. The cgroups v1 release_agent and
513 notify_on_release files are removed, and replaced by a new, more gen‐
514 eral-purpose file, cgroup.events. This read-only file contains key-
515 value pairs (delimited by newline characters, with the key and value
516 separated by spaces) that identify events or state for a cgroup. Cur‐
517 rently, only one key appears in this file, populated, which has either
518 the value 0, meaning that the cgroup (and its descendants) contain no
519 (nonzombie) processes, or 1, meaning that the cgroup contains member
520 processes.
521
522 The cgroup.events file can be monitored, in order to receive notifica‐
523 tion when a cgroup transitions between the populated and unpopulated
524 states (or vice versa). When monitoring this file using inotify(7),
525 transitions generate IN_MODIFY events, and when monitoring the file
526 using poll(2), transitions cause the bits POLLPRI and POLLERR to be
527 returned in the revents field.
528
529 The cgroups v2 release-notification mechanism provided by the populated
530 field of the cgroup.events file offers at least two advantages over the
531 cgroups v1 release_agent mechanism. First, it allows for cheaper noti‐
532 fication, since a single process can monitor multiple cgroup.events
533 files. By contrast, the cgroups v1 mechanism requires the creation of
534 a process for each notification. Second, notification can be delegated
535 to a process that lives inside a container associated with the newly
536 empty cgroup.
537
538 Cgroups v2 cgroup.stat file
539 Each cgroup in the v2 hierarchy contains a read-only cgroup.stat file
540 (first introduced in Linux 4.14) that consists of lines containing key-
541 value pairs. The following keys currently appear in this file:
542
543 nr_descendants
544 This is the total number of visible (i.e., living) descendant
545 cgroups underneath this cgroup.
546
547 nr_dying_descendants
548 This is the total number of dying descendant cgroups underneath
549 this cgroup. A cgroup enters the dying state after being
550 deleted. It remains in that state for an undefined period
551 (which will depend on system load) while resources are freed
552 before the cgroup is destroyed. Note that the presence of some
553 cgroups in the dying state is normal, and is not indicative of
554 any problem.
555
556 A process can't be made a member of a dying cgroup, and a dying
557 cgroup can't be brought back to life.
558
559 Limiting the number of descendant cgroups
560 Each cgroup in the v2 hierarchy contains the following files, which can
561 be used to view and set limits on the number of descendant cgroups
562 under that cgroup:
563
564 cgroup.max.depth (since Linux 4.14)
565 This file defines a limit on the depth of nesting of descendant
566 cgroups. A value of 0 in this file means that no descendant
567 cgroups can be created. An attempt to create a descendant whose
568 nesting level exceeds the limit fails (mkdir(2) fails with the
569 error EAGAIN).
570
571 Writing the string "max" to this file means that no limit is
572 imposed. The default value in this file is "max".
573
574 cgroup.max.descendants (since Linux 4.14)
575 This file defines a limit on the number of live descendant
576 cgroups that this cgroup may have. An attempt to create more
577 descendants than allowed by the limit fails (mkdir(2) fails with
578 the error EAGAIN).
579
580 Writing the string "max" to this file means that no limit is
581 imposed. The default value in this file is "max".
582
583 Cgroups v2 delegation: delegation to a less privileged user
584 In the context of cgroups, delegation means passing management of some
585 subtree of the cgroup hierarchy to a nonprivileged process. Cgroups v1
586 provides support for delegation that was accidental and not fully
587 secure. Cgroups v2 supports delegation by explicit design.
588
589 Some terminology is required in order to describe delegation. A dele‐
590 gater is a privileged user (i.e., root) who owns a parent cgroup. A
591 delegatee is a nonprivileged user who will be granted the permissions
592 needed to manage some subhierarchy under that parent cgroup, known as
593 the delegated subtree.
594
595 To perform delegation, the delegater makes certain directories and
596 files writable by the delegatee, typically by changing the ownership of
597 the objects to be the user ID of the delegatee. Assuming that we want
598 to delegate the hierarchy rooted at (say) /dlgt_grp and that there are
599 not yet any child cgroups under that cgroup, the ownership of the fol‐
600 lowing is changed to the user ID of the delegatee:
601
602 /dlgt_grp
603 Changing the ownership of the root of the subtree means that any
604 new cgroups created under the subtree (and the files they con‐
605 tain) will also be owned by the delegatee.
606
607 /dlgt_grp/cgroup.procs
608 Changing the ownership of this file means that the delegatee can
609 move processes into the root of the delegated subtree.
610
611 /dlgt_grp/cgroup.subtree_control
612 Changing the ownership of this file means that that the delega‐
613 tee can enable controllers (that are present in
614 /dlgt_grp/cgroup.controllers) in order to further redistribute
615 resources at lower levels in the subtree. (As an alternative to
616 changing the ownership of this file, the delegater might instead
617 add selected controllers to this file.)
618
619 /dlgt_grp/cgroup.threads
620 Changing the ownership of this file is necessary if a threaded
621 subtree is being delegated (see the description of "thread
622 mode", below). This permits the delegatee to write thread IDs
623 to the file. (The ownership of this file can also be changed
624 when delegating a domain subtree, but currently this serves no
625 purpose, since, as described below, it is not possible to move a
626 thread between domain cgroups by writing its thread ID to the
627 cgroup.tasks file.)
628
629 The delegater should not change the ownership of any of the controller
630 interfaces files (e.g., pids.max, memory.high) in dlgt_grp. Those
631 files are used from the next level above the delegated subtree in order
632 to distribute resources into the subtree, and the delegatee should not
633 have permission to change the resources that are distributed into the
634 delegated subtree.
635
636 See also the discussion of the /sys/kernel/cgroup/delegate file in
637 NOTES.
638
639 After the aforementioned steps have been performed, the delegatee can
640 create child cgroups within the delegated subtree (the cgroup subdirec‐
641 tories and the files they contain will be owned by the delegatee) and
642 move processes between cgroups in the subtree. If some controllers are
643 present in dlgt_grp/cgroup.subtree_control, or the ownership of that
644 file was passed to the delegatee, the delegatee can also control the
645 further redistribution of the corresponding resources into the dele‐
646 gated subtree.
647
648 Cgroups v2 delegation: nsdelegate and cgroup namespaces
649 Starting with Linux 4.13, there is a second way to perform cgroup dele‐
650 gation. This is done by mounting or remounting the cgroup v2 filesys‐
651 tem with the nsdelegate mount option. For example, if the cgroup v2
652 filesystem has already been mounted, we can remount it with the nsdele‐
653 gate option as follows:
654
655 mount -t cgroup2 -o remount,nsdelegate \
656 none /sys/fs/cgroup/unified
657
658 The effect of this mount option is to cause cgroup namespaces to auto‐
659 matically become delegation boundaries. More specifically, the follow‐
660 ing restrictions apply for processes inside the cgroup namespace:
661
662 * Writes to controller interface files in the root directory of the
663 namespace will fail with the error EPERM. Processes inside the
664 cgroup namespace can still write to delegatable files in the root
665 directory of the cgroup namespace such as cgroup.procs and
666 cgroup.subtree_control, and can create subhierarchy underneath the
667 root directory.
668
669 * Attempts to migrate processes across the namespace boundary are
670 denied (with the error ENOENT). Processes inside the cgroup names‐
671 pace can still (subject to the containment rules described below)
672 move processes between cgroups within the subhierarchy under the
673 namespace root.
674
675 The ability to define cgroup namespaces as delegation boundaries makes
676 cgroup namespaces more useful. To understand why, suppose that we
677 already have one cgroup hierarchy that has been delegated to a nonpriv‐
678 ileged user, cecilia, using the older delegation technique described
679 above. Suppose further that cecilia wanted to further delegate a sub‐
680 hierarchy under the existing delegated hierarchy. (For example, the
681 delegated hierarchy might be associated with an unprivileged container
682 run by cecilia.) Even if a cgroup namespace was employed, because both
683 hierarchies are owned by the unprivileged user cecilia, the following
684 illegitimate actions could be performed:
685
686 * A process in the inferior hierarchy could change the resource con‐
687 troller settings in the root directory of the that hierarchy.
688 (These resource controller settings are intended to allow control to
689 be exercised from the parent cgroup; a process inside the child
690 cgroup should not be allowed to modify them.)
691
692 * A process inside the inferior hierarchy could move processes into
693 and out of the inferior hierarchy if the cgroups in the superior
694 hierarchy were somehow visible.
695
696 Employing the nsdelegate mount option prevents both of these possibili‐
697 ties.
698
699 The nsdelegate mount option only has an effect when performed in the
700 initial mount namespace; in other mount namespaces, the option is
701 silently ignored.
702
703 Note: On some systems, systemd(1) automatically mounts the cgroup v2
704 filesystem. In order to experiment with the nsdelegate operation, it
705 may be desirable to
706
707 Cgroup v2 delegation containment rules
708 Some delegation containment rules ensure that the delegatee can move
709 processes between cgroups within the delegated subtree, but can't move
710 processes from outside the delegated subtree into the subtree or vice
711 versa. A nonprivileged process (i.e., the delegatee) can write the PID
712 of a "target" process into a cgroup.procs file only if all of the fol‐
713 lowing are true:
714
715 * The writer has write permission on the cgroup.procs file in the des‐
716 tination cgroup.
717
718 * The writer has write permission on the cgroup.procs file in the com‐
719 mon ancestor of the source and destination cgroups. (In some cases,
720 the common ancestor may be the source or destination cgroup itself.)
721
722 * If the cgroup v2 filesystem was mounted with the nsdelegate option,
723 the writer must be able to see the source and destination cgroups
724 from its cgroup namespace.
725
726 * Before Linux 4.11: the effective UID of the writer (i.e., the dele‐
727 gatee) matches the real user ID or the saved set-user-ID of the tar‐
728 get process. (This was a historical requirement inherited from
729 cgroups v1 that was later deemed unnecessary, since the other rules
730 suffice for containment in cgroups v2.)
731
732 Note: one consequence of these delegation containment rules is that the
733 unprivileged delegatee can't place the first process into the delegated
734 subtree; instead, the delegater must place the first process (a process
735 owned by the delegatee) into the delegated subtree.
736
738 Among the restrictions imposed by cgroups v2 that were not present in
739 cgroups v1 are the following:
740
741 * No thread-granularity control: all of the threads of a process must
742 be in the same cgroup.
743
744 * No internal processes: a cgroup can't both have member processes and
745 exercise controllers on child cgroups.
746
747 Both of these restrictions were added because the lack of these
748 restrictions had caused problems in cgroups v1. In particular, the
749 cgroups v1 ability to allow thread-level granularity for cgroup member‐
750 ship made no sense for some controllers. (A notable example was the
751 memory controller: since threads share an address space, it made no
752 sense to split threads across different memory cgroups.)
753
754 Notwithstanding the initial design decision in cgroups v2, there were
755 use cases for certain controllers, notably the cpu controller, for
756 which thread-level granularity of control was meaningful and useful.
757 To accommodate such use cases, Linux 4.14 added thread mode for cgroups
758 v2.
759
760 Thread mode allows the following:
761
762 * The creation of threaded subtrees in which the threads of a process
763 may be spread across cgroups inside the tree. (A threaded subtree
764 may contain multiple multithreaded processes.)
765
766 * The concept of threaded controllers, which can distribute resources
767 across the cgroups in a threaded subtree.
768
769 * A relaxation of the "no internal processes rule", so that, within a
770 threaded subtree, a cgroup can both contain member threads and exer‐
771 cise resource control over child cgroups.
772
773 With the addition of thread mode, each nonroot cgroup now contains a
774 new file, cgroup.type, that exposes, and in some circumstances can be
775 used to change, the "type" of a cgroup. This file contains one of the
776 following type values:
777
778 domain This is a normal v2 cgroup that provides process-granularity
779 control. If a process is a member of this cgroup, then all
780 threads of the process are (by definition) in the same cgroup.
781 This is the default cgroup type, and provides the same behavior
782 that was provided for cgroups in the initial cgroups v2 imple‐
783 mentation.
784
785 threaded
786 This cgroup is a member of a threaded subtree. Threads can be
787 added to this cgroup, and controllers can be enabled for the
788 cgroup.
789
790 domain threaded
791 This is a domain cgroup that serves as the root of a threaded
792 subtree. This cgroup type is also known as "threaded root".
793
794 domain invalid
795 This is a cgroup inside a threaded subtree that is in an
796 "invalid" state. Processes can't be added to the cgroup, and
797 controllers can't be enabled for the cgroup. The only thing
798 that can be done with this cgroup (other than deleting it) is to
799 convert it to a threaded cgroup by writing the string "threaded"
800 to the cgroup.type file.
801
802 The rationale for the existence of this "interim" type during
803 the creation of a threaded subtree (rather than the kernel sim‐
804 ply immediately converting all cgroups under the threaded root
805 to the type threaded) is to allow for possible future extensions
806 to the thread mode model
807
808 Threaded versus domain controllers
809 With the addition of threads mode, cgroups v2 now distinguishes two
810 types of resource controllers:
811
812 * Threaded controllers: these controllers support thread-granularity
813 for resource control and can be enabled inside threaded subtrees,
814 with the result that the corresponding controller-interface files
815 appear inside the cgroups in the threaded subtree. As at Linux
816 4.15, the following controllers are threaded: cpu, perf_event, and
817 pids.
818
819 * Domain controllers: these controllers support only process granular‐
820 ity for resource control. From the perspective of a domain con‐
821 troller, all threads of a process are always in the same cgroup.
822 Domain controllers can't be enabled inside a threaded subtree.
823
824 Creating a threaded subtree
825 There are two pathways that lead to the creation of a threaded subtree.
826 The first pathway proceeds as follows:
827
828 1. We write the string "threaded" to the cgroup.type file of a cgroup
829 y/z that currently has the type domain. This has the following
830 effects:
831
832 * The type of the cgroup y/z becomes threaded.
833
834 * The type of the parent cgroup, y, becomes domain threaded. The
835 parent cgroup is the root of a threaded subtree (also known as
836 the "threaded root").
837
838 * All other cgroups under y that were not already of type threaded
839 (because they were inside already existing threaded subtrees
840 under the new threaded root) are converted to type domain
841 invalid. Any subsequently created cgroups under y will also have
842 the type domain invalid.
843
844 2. We write the string "threaded" to each of the domain invalid cgroups
845 under y, in order to convert them to the type threaded. As a conse‐
846 quence of this step, all threads under the threaded root now have
847 the type threaded and the threaded subtree is now fully usable. The
848 requirement to write "threaded" to each of these cgroups is somewhat
849 cumbersome, but allows for possible future extensions to the thread-
850 mode model.
851
852 The second way of creating a threaded subtree is as follows:
853
854 1. In an existing cgroup, z, that currently has the type domain, we (1)
855 enable one or more threaded controllers and (2) make a process a
856 member of z. (These two steps can be done in either order.) This
857 has the following consequences:
858
859 * The type of z becomes domain threaded.
860
861 * All of the descendant cgroups of x that were not already of type
862 threaded are converted to type domain invalid.
863
864 2. As before, we make the threaded subtree usable by writing the string
865 "threaded" to each of the domain invalid cgroups under y, in order
866 to convert them to the type threaded.
867
868 One of the consequences of the above pathways to creating a threaded
869 subtree is that the threaded root cgroup can be a parent only to
870 threaded (and domain invalid) cgroups. The threaded root cgroup can't
871 be a parent of a domain cgroups, and a threaded cgroup can't have a
872 sibling that is a domain cgroup.
873
874 Using a threaded subtree
875 Within a threaded subtree, threaded controllers can be enabled in each
876 subgroup whose type has been changed to threaded; upon doing so, the
877 corresponding controller interface files appear in the children of that
878 cgroup.
879
880 A process can be moved into a threaded subtree by writing its PID to
881 the cgroup.procs file in one of the cgroups inside the tree. This has
882 the effect of making all of the threads in the process members of the
883 corresponding cgroup and makes the process a member of the threaded
884 subtree. The threads of the process can then be spread across the
885 threaded subtree by writing their thread IDs (see gettid(2)) to the
886 cgroup.threads files in different cgroups inside the subtree. The
887 threads of a process must all reside in the same threaded subtree.
888
889 As with writing to cgroup.procs, some containment rules apply when
890 writing to the cgroup.threads file:
891
892 * The writer must have write permission on the cgroup.threads file in
893 the destination cgroup.
894
895 * The writer must have write permission on the cgroup.procs file in
896 the common ancestor of the source and destination cgroups. (In some
897 cases, the common ancestor may be the source or destination cgroup
898 itself.)
899
900 * The source and destination cgroups must be in the same threaded sub‐
901 tree. (Outside a threaded subtree, an attempt to move a thread by
902 writing its thread ID to the cgroup.threads file in a different
903 domain cgroup fails with the error EOPNOTSUPP.)
904
905 The cgroup.threads file is present in each cgroup (including domain
906 cgroups) and can be read in order to discover the set of threads that
907 is present in the cgroup. The set of thread IDs obtained when reading
908 this file is not guaranteed to be ordered or free of duplicates.
909
910 The cgroup.procs file in the threaded root shows the PIDs of all pro‐
911 cesses that are members of the threaded subtree. The cgroup.procs
912 files in the other cgroups in the subtree are not readable.
913
914 Domain controllers can't be enabled in a threaded subtree; no con‐
915 troller-interface files appear inside the cgroups underneath the
916 threaded root. From the point of view of a domain controller, threaded
917 subtrees are invisible: a multithreaded process inside a threaded sub‐
918 tree appears to a domain controller as a process that resides in the
919 threaded root cgroup.
920
921 Within a threaded subtree, the "no internal processes" rule does not
922 apply: a cgroup can both contain member processes (or thread) and exer‐
923 cise controllers on child cgroups.
924
925 Rules for writing to cgroup.type and creating threaded subtrees
926 A number of rules apply when writing to the cgroup.type file:
927
928 * Only the string "threaded" may be written. In other words, the only
929 explicit transition that is possible is to convert a domain cgroup
930 to type threaded.
931
932 * The string "threaded" can be written only if the current value in
933 cgroup.type is one of the following
934
935 · domain, to start the creation of a threaded subtree via the first
936 of the pathways described above;
937
938 · domain invalid, to convert one of the cgroups in a threaded sub‐
939 tree into a usable (i.e., threaded) state;
940
941 · threaded, which has no effect (a "no-op").
942
943 * We can't write to a cgroup.type file if the parent's type is domain
944 invalid. In other words, the cgroups of a threaded subtree must be
945 converted to the threaded state in a top-down manner.
946
947 There are also some constraints that must be satisfied in order to cre‐
948 ate a threaded subtree rooted at the cgroup x:
949
950 * There can be no member processes in the descendant cgroups of x.
951 (The cgroup x can itself have member processes.)
952
953 * No domain controllers may be enabled in x's cgroup.subtree_control
954 file.
955
956 If any of the above constraints is violated, then an attempt to write
957 "threaded" to a cgroup.type file fails with the error ENOTSUP.
958
959 The "domain threaded" cgroup type
960 According to the pathways described above, the type of a cgroup can
961 change to domain threaded in either of the following cases:
962
963 * The string "threaded" is written to a child cgroup.
964
965 * A threaded controller is enabled inside the cgroup and a process is
966 made a member of the cgroup.
967
968 A domain threaded cgroup, x, can revert to the type domain if the above
969 conditions no longer hold true—that is, if all threaded child cgroups
970 of x are removed and either x no longer has threaded controllers
971 enabled or no longer has member processes.
972
973 When a domain threaded cgroup x reverts to the type domain:
974
975 * All domain invalid descendants of x that are not in lower-level
976 threaded subtrees revert to the type domain.
977
978 * The root cgroups in any lower-level threaded subtrees revert to the
979 type domain threaded.
980
981 Exceptions for the root cgroup
982 The root cgroup of the v2 hierarchy is treated exceptionally: it can be
983 the parent of both domain and threaded cgroups. If the string
984 "threaded" is written to the cgroup.type file of one of the children of
985 the root cgroup, then
986
987 * The type of that cgroup becomes threaded.
988
989 * The type of any descendants of that cgroup that are not part of
990 lower-level threaded subtrees changes to domain invalid.
991
992 Note that in this case, there is no cgroup whose type becomes domain
993 threaded. (Notionally, the root cgroup can be considered as the
994 threaded root for the cgroup whose type was changed to threaded.)
995
996 The aim of this exceptional treatment for the root cgroup is to allow a
997 threaded cgroup that employs the cpu controller to be placed as high as
998 possible in the hierarchy, so as to minimize the (small) cost of
999 traversing the cgroup hierarchy.
1000
1001 The cgroups v2 "cpu" controller and realtime processes
1002 As at Linux 4.15, the cgroups v2 cpu controller does not support con‐
1003 trol of realtime processes, and the controller can be enabled in the
1004 root cgroup only if all realtime threads are in the root cgroup. (If
1005 there are realtime processes in nonroot cgroups, then a write(2) of the
1006 string "+cpu" to the cgroup.subtree_control file fails with the error
1007 EINVAL. However, on some systems, systemd(1) places certain realtime
1008 processes in nonroot cgroups in the v2 hierarchy. On such systems,
1009 these processes must first be moved to the root cgroup before the cpu
1010 controller can be enabled.
1011
1013 The following errors can occur for mount(2):
1014
1015 EBUSY An attempt to mount a cgroup version 1 filesystem specified nei‐
1016 ther the name= option (to mount a named hierarchy) nor a con‐
1017 troller name (or all).
1018
1020 A child process created via fork(2) inherits its parent's cgroup mem‐
1021 berships. A process's cgroup memberships are preserved across
1022 execve(2).
1023
1024 /proc files
1025 /proc/cgroups (since Linux 2.6.24)
1026 This file contains information about the controllers that are
1027 compiled into the kernel. An example of the contents of this
1028 file (reformatted for readability) is the following:
1029
1030 #subsys_name hierarchy num_cgroups enabled
1031 cpuset 4 1 1
1032 cpu 8 1 1
1033 cpuacct 8 1 1
1034 blkio 6 1 1
1035 memory 3 1 1
1036 devices 10 84 1
1037 freezer 7 1 1
1038 net_cls 9 1 1
1039 perf_event 5 1 1
1040 net_prio 9 1 1
1041 hugetlb 0 1 0
1042 pids 2 1 1
1043
1044 The fields in this file are, from left to right:
1045
1046 1. The name of the controller.
1047
1048 2. The unique ID of the cgroup hierarchy on which this con‐
1049 troller is mounted. If multiple cgroups v1 controllers are
1050 bound to the same hierarchy, then each will show the same
1051 hierarchy ID in this field. The value in this field will be
1052 0 if:
1053
1054 a) the controller is not mounted on a cgroups v1 hierarchy;
1055
1056 b) the controller is bound to the cgroups v2 single unified
1057 hierarchy; or
1058
1059 c) the controller is disabled (see below).
1060
1061 3. The number of control groups in this hierarchy using this
1062 controller.
1063
1064 4. This field contains the value 1 if this controller is
1065 enabled, or 0 if it has been disabled (via the cgroup_disable
1066 kernel command-line boot parameter).
1067
1068 /proc/[pid]/cgroup (since Linux 2.6.24)
1069 This file describes control groups to which the process with the
1070 corresponding PID belongs. The displayed information differs
1071 for cgroups version 1 and version 2 hierarchies.
1072
1073 For each cgroup hierarchy of which the process is a member,
1074 there is one entry containing three colon-separated fields:
1075
1076 hierarchy-ID:controller-list:cgroup-path
1077
1078 For example:
1079
1080 5:cpuacct,cpu,cpuset:/daemons
1081
1082 The colon-separated fields are, from left to right:
1083
1084 1. For cgroups version 1 hierarchies, this field contains a
1085 unique hierarchy ID number that can be matched to a hierarchy
1086 ID in /proc/cgroups. For the cgroups version 2 hierarchy,
1087 this field contains the value 0.
1088
1089 2. For cgroups version 1 hierarchies, this field contains a
1090 comma-separated list of the controllers bound to the hierar‐
1091 chy. For the cgroups version 2 hierarchy, this field is
1092 empty.
1093
1094 3. This field contains the pathname of the control group in the
1095 hierarchy to which the process belongs. This pathname is
1096 relative to the mount point of the hierarchy.
1097
1098 /sys/kernel/cgroup files
1099 /sys/kernel/cgroup/delegate (since Linux 4.15)
1100 This file exports a list of the cgroups v2 files (one per line)
1101 that are delegatable (i.e., whose ownership should be changed to
1102 the user ID of the delegatee). In the future, the set of dele‐
1103 gatable files may change or grow, and this file provides a way
1104 for the kernel to inform user-space applications of which files
1105 must be delegated. As at Linux 4.15, one sees the following
1106 when inspecting this file:
1107
1108 $ cat /sys/kernel/cgroup/delegate
1109 cgroup.procs
1110 cgroup.subtree_control
1111 cgroup.threads
1112
1113 /sys/kernel/cgroup/features (since Linux 4.15)
1114 Over time, the set of cgroups v2 features that are provided by
1115 the kernel may change or grow, or some features may not be
1116 enabled by default. This file provides a way for user-space
1117 applications to discover what features the running kernel sup‐
1118 ports and has enabled. Features are listed one per line:
1119
1120 $ cat /sys/kernel/cgroup/features
1121 nsdelegate
1122
1123 The entries that can appear in this file are:
1124
1125 nsdelegate (since Linux 4.15)
1126 The kernel supports the nsdelegate mount option.
1127
1129 prlimit(1), systemd(1), systemd-cgls(1), systemd-cgtop(1), clone(2),
1130 ioprio_set(2), perf_event_open(2), setrlimit(2), cgroup_namespaces(7),
1131 cpuset(7), namespaces(7), sched(7), user_namespaces(7)
1132
1134 This page is part of release 4.16 of the Linux man-pages project. A
1135 description of the project, information about reporting bugs, and the
1136 latest version of this page, can be found at
1137 https://www.kernel.org/doc/man-pages/.
1138
1139
1140
1141Linux 2018-02-02 CGROUPS(7)