1SYSTEMD.RESOURCE-CONTROL(5)systemd.resource-controlSYSTEMD.RESOURCE-CONTROL(5)
2
3
4
6 systemd.resource-control - Resource control unit settings
7
9 slice.slice, scope.scope, service.service, socket.socket, mount.mount,
10 swap.swap
11
13 Unit configuration files for services, slices, scopes, sockets, mount
14 points, and swap devices share a subset of configuration options for
15 resource control of spawned processes. Internally, this relies on the
16 Linux Control Groups (cgroups) kernel concept for organizing processes
17 in a hierarchical tree of named groups for the purpose of resource
18 management.
19
20 This man page lists the configuration options shared by those six unit
21 types. See systemd.unit(5) for the common options of all unit
22 configuration files, and systemd.slice(5), systemd.scope(5),
23 systemd.service(5), systemd.socket(5), systemd.mount(5), and
24 systemd.swap(5) for more information on the specific unit configuration
25 files. The resource control configuration options are configured in the
26 [Slice], [Scope], [Service], [Socket], [Mount], or [Swap] sections,
27 depending on the unit type.
28
29 In addition, options which control resources available to programs
30 executed by systemd are listed in systemd.exec(5). Those options
31 complement options listed here.
32
33 Enabling and disabling controllers
34 Controllers in the cgroup hierarchy are hierarchical, and resource
35 control is realized by distributing resource assignments between
36 siblings in branches of the cgroup hierarchy. There is no need to
37 explicitly enable a cgroup controller for a unit. systemd will
38 instruct the kernel to enable a controller for a given unit when this
39 unit has configuration for a given controller. For example, when
40 CPUWeight= is set, the cpu controller will be enabled, and when
41 TasksMax= are set, the pids controller will be enabled. In addition,
42 various controllers may be also be enabled explicitly via the
43 MemoryAccounting=/TasksAccounting=/IOAccounting= settings. Because of
44 how the cgroup hierarchy works, controllers will be automatically
45 enabled for all parent units and for any sibling units starting with
46 the lowest level at which a controller is enabled. Units for which a
47 controller is enabled may be subject to resource control even if they
48 don't have any explicit configuration.
49
50 Setting Delegate= enables any delegated controllers for that unit (see
51 below). The delegatee may then enable controllers for its children as
52 appropriate. In particular, if the delegatee is systemd (in the
53 user@.service unit), it will repeat the same logic as the system
54 instance and enable controllers for user units which have resource
55 limits configured, and their siblings and parents and parents'
56 siblings.
57
58 Controllers may be disabled for parts of the cgroup hierarchy with
59 DisableControllers= (see below).
60
61 Example 1. Enabling and disabling controllers
62
63 -.slice
64 / \
65 /-----/ \--------------\
66 / \
67 system.slice user.slice
68 / \ / \
69 / \ / \
70 / \ user@42.service user@1000.service
71 / \ Delegate= Delegate=yes
72 a.service b.slice / \
73 CPUWeight=20 DisableControllers=cpu / \
74 / \ app.slice session.slice
75 / \ CPUWeight=100 CPUWeight=100
76 / \
77 b1.service b2.service
78 CPUWeight=1000
79
80
81 In this hierarchy, the cpu controller is enabled for all units shown
82 except b1.service and b2.service. Because there is no explicit
83 configuration for system.slice and user.slice, CPU resources will be
84 split equally between them. Similarly, resources are allocated equally
85 between children of user.slice and between the child slices beneath
86 user@1000.service. Assuming that there is no futher configuration of
87 resources or delegation below slices app.slice or session.slice, the
88 cpu controller would not be enabled for units in those slices and CPU
89 resources would be further allocated using other mechanisms, e.g. based
90 on nice levels. The manager for user 42 has delegation enabled without
91 any controllers, i.e. it can manipulate its subtree of the cgroup
92 hierarchy, but without resource control.
93
94 In the slice system.slice, CPU resources are split 1:6 for service
95 a.service, and 5:6 for slice b.slice, because slice b.slice gets the
96 default value of 100 for cpu.weight when CPUWeight= is not set.
97
98 CPUWeight= setting in service b2.service is neutralized by
99 DisableControllers= in slice b.slice, so the cpu controller would not
100 be enabled for services b1.service and b2.service, and CPU resources
101 would be further allocated using other mechanisms, e.g. based on nice
102 levels.
103
104 Setting resource controls for a group of related units
105 As described in systemd.unit(5), the settings listed here may be set
106 through the main file of a unit and drop-in snippets in *.d/
107 directories. The list of directories searched for drop-ins includes
108 names formed by repeatedly truncating the unit name after all dashes.
109 This is particularly convenient to set resource limits for a group of
110 units with similar names.
111
112 For example, every user gets their own slice user-nnn.slice. Drop-ins
113 with local configuration that affect user 1000 may be placed in
114 /etc/systemd/system/user-1000.slice,
115 /etc/systemd/system/user-1000.slice.d/*.conf, but also
116 /etc/systemd/system/user-.slice.d/*.conf. This last directory applies
117 to all user slices.
118
119 See the New Control Group Interfaces[1] for an introduction on how to
120 make use of resource control APIs from programs.
121
123 The following dependencies are implicitly added:
124
125 • Units with the Slice= setting set automatically acquire Requires=
126 and After= dependencies on the specified slice unit.
127
129 Units of the types listed above can have settings for resource control
130 configuration:
131
132 CPU Accounting and Control
133 CPUAccounting=
134 Turn on CPU usage accounting for this unit. Takes a boolean
135 argument. Note that turning on CPU accounting for one unit will
136 also implicitly turn it on for all units contained in the same
137 slice and for all its parent slices and the units contained
138 therein. The system default for this setting may be controlled with
139 DefaultCPUAccounting= in systemd-system.conf(5).
140
141 Under the unified cgroup hierarchy, CPU accounting is available for
142 all units and this setting has no effect.
143
144 CPUWeight=weight, StartupCPUWeight=weight
145 These settings control the cpu controller in the unified hierarchy.
146
147 These options accept an integer value or a the special string
148 "idle":
149
150 • If set to an integer value, assign the specified CPU time
151 weight to the processes executed, if the unified control group
152 hierarchy is used on the system. These options control the
153 "cpu.weight" control group attribute. The allowed range is 1 to
154 10000. Defaults to unset, but the kernel default is 100. For
155 details about this control group attribute, see Control Groups
156 v2[2] and CFS Scheduler[3]. The available CPU time is split up
157 among all units within one slice relative to their CPU time
158 weight. A higher weight means more CPU time, a lower weight
159 means less.
160
161 • If set to the special string "idle", mark the cgroup for "idle
162 scheduling", which means that it will get CPU resources only
163 when there are no processes not marked in this way to execute
164 in this cgroup or its siblings. This setting corresponds to the
165 "cpu.idle" cgroup attribute.
166
167 Note that this value only has an effect on cgroup-v2, for
168 cgroup-v1 it is equivalent to the minimum weight.
169
170 While StartupCPUWeight= applies to the startup and shutdown phases
171 of the system, CPUWeight= applies to normal runtime of the system,
172 and if the former is not set also to the startup and shutdown
173 phases. Using StartupCPUWeight= allows prioritizing specific
174 services at boot-up and shutdown differently than during normal
175 runtime.
176
177 In addition to the resource allocation performed by the cpu
178 controller, the kernel may automatically divide resources based on
179 session-id grouping, see "The autogroup feature" in sched(7). The
180 effect of this feature is similar to the cpu controller with no
181 explicit configuration, so users should be careful to not mistake
182 one for the other.
183
184 CPUQuota=
185 This setting controls the cpu controller in the unified hierarchy.
186
187 Assign the specified CPU time quota to the processes executed.
188 Takes a percentage value, suffixed with "%". The percentage
189 specifies how much CPU time the unit shall get at maximum, relative
190 to the total CPU time available on one CPU. Use values > 100% for
191 allotting CPU time on more than one CPU. This controls the
192 "cpu.max" attribute on the unified control group hierarchy and
193 "cpu.cfs_quota_us" on legacy. For details about these control group
194 attributes, see Control Groups v2[2] and CFS Bandwidth Control[4].
195 Setting CPUQuota= to an empty value unsets the quota.
196
197 Example: CPUQuota=20% ensures that the executed processes will
198 never get more than 20% CPU time on one CPU.
199
200 CPUQuotaPeriodSec=
201 This setting controls the cpu controller in the unified hierarchy.
202
203 Assign the duration over which the CPU time quota specified by
204 CPUQuota= is measured. Takes a time duration value in seconds, with
205 an optional suffix such as "ms" for milliseconds (or "s" for
206 seconds.) The default setting is 100ms. The period is clamped to
207 the range supported by the kernel, which is [1ms, 1000ms].
208 Additionally, the period is adjusted up so that the quota interval
209 is also at least 1ms. Setting CPUQuotaPeriodSec= to an empty value
210 resets it to the default.
211
212 This controls the second field of "cpu.max" attribute on the
213 unified control group hierarchy and "cpu.cfs_period_us" on legacy.
214 For details about these control group attributes, see Control
215 Groups v2[2] and CFS Scheduler[3].
216
217 Example: CPUQuotaPeriodSec=10ms to request that the CPU quota is
218 measured in periods of 10ms.
219
220 AllowedCPUs=, StartupAllowedCPUs=
221 This setting controls the cpuset controller in the unified
222 hierarchy.
223
224 Restrict processes to be executed on specific CPUs. Takes a list of
225 CPU indices or ranges separated by either whitespace or commas. CPU
226 ranges are specified by the lower and upper CPU indices separated
227 by a dash.
228
229 Setting AllowedCPUs= or StartupAllowedCPUs= doesn't guarantee that
230 all of the CPUs will be used by the processes as it may be limited
231 by parent units. The effective configuration is reported as
232 EffectiveCPUs=.
233
234 While StartupAllowedCPUs= applies to the startup and shutdown
235 phases of the system, AllowedCPUs= applies to normal runtime of the
236 system, and if the former is not set also to the startup and
237 shutdown phases. Using StartupAllowedCPUs= allows prioritizing
238 specific services at boot-up and shutdown differently than during
239 normal runtime.
240
241 This setting is supported only with the unified control group
242 hierarchy.
243
244 Memory Accounting and Control
245 MemoryAccounting=
246 This setting controls the memory controller in the unified
247 hierarchy.
248
249 Turn on process and kernel memory accounting for this unit. Takes a
250 boolean argument. Note that turning on memory accounting for one
251 unit will also implicitly turn it on for all units contained in the
252 same slice and for all its parent slices and the units contained
253 therein. The system default for this setting may be controlled with
254 DefaultMemoryAccounting= in systemd-system.conf(5).
255
256 MemoryMin=bytes, MemoryLow=bytes
257 These settings control the memory controller in the unified
258 hierarchy.
259
260 Specify the memory usage protection of the executed processes in
261 this unit. When reclaiming memory, the unit is treated as if it was
262 using less memory resulting in memory to be preferentially
263 reclaimed from unprotected units. Using MemoryLow= results in a
264 weaker protection where memory may still be reclaimed to avoid
265 invoking the OOM killer in case there is no other reclaimable
266 memory.
267
268 For a protection to be effective, it is generally required to set a
269 corresponding allocation on all ancestors, which is then
270 distributed between children (with the exception of the root
271 slice). Any MemoryMin= or MemoryLow= allocation that is not
272 explicitly distributed to specific children is used to create a
273 shared protection for all children. As this is a shared protection,
274 the children will freely compete for the memory.
275
276 Takes a memory size in bytes. If the value is suffixed with K, M, G
277 or T, the specified memory size is parsed as Kilobytes, Megabytes,
278 Gigabytes, or Terabytes (with the base 1024), respectively.
279 Alternatively, a percentage value may be specified, which is taken
280 relative to the installed physical memory on the system. If
281 assigned the special value "infinity", all available memory is
282 protected, which may be useful in order to always inherit all of
283 the protection afforded by ancestors. This controls the
284 "memory.min" or "memory.low" control group attribute. For details
285 about this control group attribute, see Memory Interface Files[5].
286
287 Units may have their children use a default "memory.min" or
288 "memory.low" value by specifying DefaultMemoryMin= or
289 DefaultMemoryLow=, which has the same semantics as MemoryMin= and
290 MemoryLow=. This setting does not affect "memory.min" or
291 "memory.low" in the unit itself. Using it to set a default child
292 allocation is only useful on kernels older than 5.7, which do not
293 support the "memory_recursiveprot" cgroup2 mount option.
294
295 MemoryHigh=bytes
296 These settings control the memory controller in the unified
297 hierarchy.
298
299 Specify the throttling limit on memory usage of the executed
300 processes in this unit. Memory usage may go above the limit if
301 unavoidable, but the processes are heavily slowed down and memory
302 is taken away aggressively in such cases. This is the main
303 mechanism to control memory usage of a unit.
304
305 Takes a memory size in bytes. If the value is suffixed with K, M, G
306 or T, the specified memory size is parsed as Kilobytes, Megabytes,
307 Gigabytes, or Terabytes (with the base 1024), respectively.
308 Alternatively, a percentage value may be specified, which is taken
309 relative to the installed physical memory on the system. If
310 assigned the special value "infinity", no memory throttling is
311 applied. This controls the "memory.high" control group attribute.
312 For details about this control group attribute, see Memory
313 Interface Files[5].
314
315 MemoryMax=bytes
316 These settings control the memory controller in the unified
317 hierarchy.
318
319 Specify the absolute limit on memory usage of the executed
320 processes in this unit. If memory usage cannot be contained under
321 the limit, out-of-memory killer is invoked inside the unit. It is
322 recommended to use MemoryHigh= as the main control mechanism and
323 use MemoryMax= as the last line of defense.
324
325 Takes a memory size in bytes. If the value is suffixed with K, M, G
326 or T, the specified memory size is parsed as Kilobytes, Megabytes,
327 Gigabytes, or Terabytes (with the base 1024), respectively.
328 Alternatively, a percentage value may be specified, which is taken
329 relative to the installed physical memory on the system. If
330 assigned the special value "infinity", no memory limit is applied.
331 This controls the "memory.max" control group attribute. For details
332 about this control group attribute, see Memory Interface Files[5].
333
334 MemorySwapMax=bytes
335 These settings control the memory controller in the unified
336 hierarchy.
337
338 Specify the absolute limit on swap usage of the executed processes
339 in this unit.
340
341 Takes a swap size in bytes. If the value is suffixed with K, M, G
342 or T, the specified swap size is parsed as Kilobytes, Megabytes,
343 Gigabytes, or Terabytes (with the base 1024), respectively. If
344 assigned the special value "infinity", no swap limit is applied.
345 These settings control the "memory.swap.max" control group
346 attribute. For details about this control group attribute, see
347 Memory Interface Files[5].
348
349 MemoryZSwapMax=bytes
350 These settings control the memory controller in the unified
351 hierarchy.
352
353 Specify the absolute limit on zswap usage of the processes in this
354 unit. Zswap is a lightweight compressed cache for swap pages. It
355 takes pages that are in the process of being swapped out and
356 attempts to compress them into a dynamically allocated RAM-based
357 memory pool. If the limit specified is hit, no entries from this
358 unit will be stored in the pool until existing entries are faulted
359 back or written out to disk. See the kernel's Zswap[6]
360 documentation for more details.
361
362 Takes a size in bytes. If the value is suffixed with K, M, G or T,
363 the specified size is parsed as Kilobytes, Megabytes, Gigabytes, or
364 Terabytes (with the base 1024), respectively. If assigned the
365 special value "infinity", no limit is applied. These settings
366 control the "memory.zswap.max" control group attribute. For details
367 about this control group attribute, see Memory Interface Files[5].
368
369 AllowedMemoryNodes=, StartupAllowedMemoryNodes=
370 These settings control the cpuset controller in the unified
371 hierarchy.
372
373 Restrict processes to be executed on specific memory NUMA nodes.
374 Takes a list of memory NUMA nodes indices or ranges separated by
375 either whitespace or commas. Memory NUMA nodes ranges are specified
376 by the lower and upper NUMA nodes indices separated by a dash.
377
378 Setting AllowedMemoryNodes= or StartupAllowedMemoryNodes= doesn't
379 guarantee that all of the memory NUMA nodes will be used by the
380 processes as it may be limited by parent units. The effective
381 configuration is reported as EffectiveMemoryNodes=.
382
383 While StartupAllowedMemoryNodes= applies to the startup and
384 shutdown phases of the system, AllowedMemoryNodes= applies to
385 normal runtime of the system, and if the former is not set also to
386 the startup and shutdown phases. Using StartupAllowedMemoryNodes=
387 allows prioritizing specific services at boot-up and shutdown
388 differently than during normal runtime.
389
390 This setting is supported only with the unified control group
391 hierarchy.
392
393 Process Accounting and Control
394 TasksAccounting=
395 This setting controls the pids controller in the unified hierarchy.
396
397 Turn on task accounting for this unit. Takes a boolean argument. If
398 enabled, the kernel will keep track of the total number of tasks in
399 the unit and its children. This number includes both kernel threads
400 and userspace processes, with each thread counted individually.
401 Note that turning on tasks accounting for one unit will also
402 implicitly turn it on for all units contained in the same slice and
403 for all its parent slices and the units contained therein. The
404 system default for this setting may be controlled with
405 DefaultTasksAccounting= in systemd-system.conf(5).
406
407 TasksMax=N
408 This setting controls the pids controller in the unified hierarchy.
409
410 Specify the maximum number of tasks that may be created in the
411 unit. This ensures that the number of tasks accounted for the unit
412 (see above) stays below a specific limit. This either takes an
413 absolute number of tasks or a percentage value that is taken
414 relative to the configured maximum number of tasks on the system.
415 If assigned the special value "infinity", no tasks limit is
416 applied. This controls the "pids.max" control group attribute. For
417 details about this control group attribute, the pids controller[7].
418
419 The system default for this setting may be controlled with
420 DefaultTasksMax= in systemd-system.conf(5).
421
422 IO Accounting and Control
423 IOAccounting=
424 This setting controls the io controller in the unified hierarchy.
425
426 Turn on Block I/O accounting for this unit, if the unified control
427 group hierarchy is used on the system. Takes a boolean argument.
428 Note that turning on block I/O accounting for one unit will also
429 implicitly turn it on for all units contained in the same slice and
430 all for its parent slices and the units contained therein. The
431 system default for this setting may be controlled with
432 DefaultIOAccounting= in systemd-system.conf(5).
433
434 IOWeight=weight, StartupIOWeight=weight
435 These settings control the io controller in the unified hierarchy.
436
437 Set the default overall block I/O weight for the executed
438 processes, if the unified control group hierarchy is used on the
439 system. Takes a single weight value (between 1 and 10000) to set
440 the default block I/O weight. This controls the "io.weight" control
441 group attribute, which defaults to 100. For details about this
442 control group attribute, see IO Interface Files[8]. The available
443 I/O bandwidth is split up among all units within one slice relative
444 to their block I/O weight. A higher weight means more I/O
445 bandwidth, a lower weight means less.
446
447 While StartupIOWeight= applies to the startup and shutdown phases
448 of the system, IOWeight= applies to the later runtime of the
449 system, and if the former is not set also to the startup and
450 shutdown phases. This allows prioritizing specific services at
451 boot-up and shutdown differently than during runtime.
452
453 IODeviceWeight=device weight
454 This setting controls the io controller in the unified hierarchy.
455
456 Set the per-device overall block I/O weight for the executed
457 processes, if the unified control group hierarchy is used on the
458 system. Takes a space-separated pair of a file path and a weight
459 value to specify the device specific weight value, between 1 and
460 10000. (Example: "/dev/sda 1000"). The file path may be specified
461 as path to a block device node or as any other file, in which case
462 the backing block device of the file system of the file is
463 determined. This controls the "io.weight" control group attribute,
464 which defaults to 100. Use this option multiple times to set
465 weights for multiple devices. For details about this control group
466 attribute, see IO Interface Files[8].
467
468 The specified device node should reference a block device that has
469 an I/O scheduler associated, i.e. should not refer to partition or
470 loopback block devices, but to the originating, physical device.
471 When a path to a regular file or directory is specified it is
472 attempted to discover the correct originating device backing the
473 file system of the specified path. This works correctly only for
474 simpler cases, where the file system is directly placed on a
475 partition or physical block device, or where simple 1:1 encryption
476 using dm-crypt/LUKS is used. This discovery does not cover complex
477 storage and in particular RAID and volume management storage
478 devices.
479
480 IOReadBandwidthMax=device bytes, IOWriteBandwidthMax=device bytes
481 These settings control the io controller in the unified hierarchy.
482
483 Set the per-device overall block I/O bandwidth maximum limit for
484 the executed processes, if the unified control group hierarchy is
485 used on the system. This limit is not work-conserving and the
486 executed processes are not allowed to use more even if the device
487 has idle capacity. Takes a space-separated pair of a file path and
488 a bandwidth value (in bytes per second) to specify the device
489 specific bandwidth. The file path may be a path to a block device
490 node, or as any other file in which case the backing block device
491 of the file system of the file is used. If the bandwidth is
492 suffixed with K, M, G, or T, the specified bandwidth is parsed as
493 Kilobytes, Megabytes, Gigabytes, or Terabytes, respectively, to the
494 base of 1000. (Example:
495 "/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 5M"). This
496 controls the "io.max" control group attributes. Use this option
497 multiple times to set bandwidth limits for multiple devices. For
498 details about this control group attribute, see IO Interface
499 Files[8].
500
501 Similar restrictions on block device discovery as for
502 IODeviceWeight= apply, see above.
503
504 IOReadIOPSMax=device IOPS, IOWriteIOPSMax=device IOPS
505 These settings control the io controller in the unified hierarchy.
506
507 Set the per-device overall block I/O IOs-Per-Second maximum limit
508 for the executed processes, if the unified control group hierarchy
509 is used on the system. This limit is not work-conserving and the
510 executed processes are not allowed to use more even if the device
511 has idle capacity. Takes a space-separated pair of a file path and
512 an IOPS value to specify the device specific IOPS. The file path
513 may be a path to a block device node, or as any other file in which
514 case the backing block device of the file system of the file is
515 used. If the IOPS is suffixed with K, M, G, or T, the specified
516 IOPS is parsed as KiloIOPS, MegaIOPS, GigaIOPS, or TeraIOPS,
517 respectively, to the base of 1000. (Example:
518 "/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 1K"). This
519 controls the "io.max" control group attributes. Use this option
520 multiple times to set IOPS limits for multiple devices. For details
521 about this control group attribute, see IO Interface Files[8].
522
523 Similar restrictions on block device discovery as for
524 IODeviceWeight= apply, see above.
525
526 IODeviceLatencyTargetSec=device target
527 This setting controls the io controller in the unified hierarchy.
528
529 Set the per-device average target I/O latency for the executed
530 processes, if the unified control group hierarchy is used on the
531 system. Takes a file path and a timespan separated by a space to
532 specify the device specific latency target. (Example: "/dev/sda
533 25ms"). The file path may be specified as path to a block device
534 node or as any other file, in which case the backing block device
535 of the file system of the file is determined. This controls the
536 "io.latency" control group attribute. Use this option multiple
537 times to set latency target for multiple devices. For details about
538 this control group attribute, see IO Interface Files[8].
539
540 Implies "IOAccounting=yes".
541
542 These settings are supported only if the unified control group
543 hierarchy is used.
544
545 Similar restrictions on block device discovery as for
546 IODeviceWeight= apply, see above.
547
548 Network Accounting and Control
549 IPAccounting=
550 Takes a boolean argument. If true, turns on IPv4 and IPv6 network
551 traffic accounting for packets sent or received by the unit. When
552 this option is turned on, all IPv4 and IPv6 sockets created by any
553 process of the unit are accounted for.
554
555 When this option is used in socket units, it applies to all IPv4
556 and IPv6 sockets associated with it (including both listening and
557 connection sockets where this applies). Note that for
558 socket-activated services, this configuration setting and the
559 accounting data of the service unit and the socket unit are kept
560 separate, and displayed separately. No propagation of the setting
561 and the collected statistics is done, in either direction.
562 Moreover, any traffic sent or received on any of the socket unit's
563 sockets is accounted to the socket unit — and never to the service
564 unit it might have activated, even if the socket is used by it.
565
566 The system default for this setting may be controlled with
567 DefaultIPAccounting= in systemd-system.conf(5).
568
569 IPAddressAllow=ADDRESS[/PREFIXLENGTH]...,
570 IPAddressDeny=ADDRESS[/PREFIXLENGTH]...
571 Turn on network traffic filtering for IP packets sent and received
572 over AF_INET and AF_INET6 sockets. Both directives take a space
573 separated list of IPv4 or IPv6 addresses, each optionally suffixed
574 with an address prefix length in bits after a "/" character. If the
575 suffix is omitted, the address is considered a host address, i.e.
576 the filter covers the whole address (32 bits for IPv4, 128 bits for
577 IPv6).
578
579 The access lists configured with this option are applied to all
580 sockets created by processes of this unit (or in the case of socket
581 units, associated with it). The lists are implicitly combined with
582 any lists configured for any of the parent slice units this unit
583 might be a member of. By default both access lists are empty. Both
584 ingress and egress traffic is filtered by these settings. In case
585 of ingress traffic the source IP address is checked against these
586 access lists, in case of egress traffic the destination IP address
587 is checked. The following rules are applied in turn:
588
589 • Access is granted when the checked IP address matches an entry
590 in the IPAddressAllow= list.
591
592 • Otherwise, access is denied when the checked IP address matches
593 an entry in the IPAddressDeny= list.
594
595 • Otherwise, access is granted.
596
597 In order to implement an allow-listing IP firewall, it is
598 recommended to use a IPAddressDeny=any setting on an upper-level
599 slice unit (such as the root slice -.slice or the slice containing
600 all system services system.slice – see systemd.special(7) for
601 details on these slice units), plus individual per-service
602 IPAddressAllow= lines permitting network access to relevant
603 services, and only them.
604
605 Note that for socket-activated services, the IP access list
606 configured on the socket unit applies to all sockets associated
607 with it directly, but not to any sockets created by the ultimately
608 activated services for it. Conversely, the IP access list
609 configured for the service is not applied to any sockets passed
610 into the service via socket activation. Thus, it is usually a good
611 idea to replicate the IP access lists on both the socket and the
612 service unit. Nevertheless, it may make sense to maintain one list
613 more open and the other one more restricted, depending on the
614 usecase.
615
616 If these settings are used multiple times in the same unit the
617 specified lists are combined. If an empty string is assigned to
618 these settings the specific access list is reset and all previous
619 settings undone.
620
621 In place of explicit IPv4 or IPv6 address and prefix length
622 specifications a small set of symbolic names may be used. The
623 following names are defined:
624
625 Table 1. Special address/network names
626 ┌──────────────┬─────────────────────┬─────────────────────┐
627 │Symbolic Name │ Definition │ Meaning │
628 ├──────────────┼─────────────────────┼─────────────────────┤
629 │any │ 0.0.0.0/0 ::/0 │ Any host │
630 ├──────────────┼─────────────────────┼─────────────────────┤
631 │localhost │ 127.0.0.0/8 ::1/128 │ All addresses on │
632 │ │ │ the local loopback │
633 ├──────────────┼─────────────────────┼─────────────────────┤
634 │link-local │ 169.254.0.0/16 │ All link-local IP │
635 │ │ fe80::/64 │ addresses │
636 ├──────────────┼─────────────────────┼─────────────────────┤
637 │multicast │ 224.0.0.0/4 │ All IP multicasting │
638 │ │ ff00::/8 │ addresses │
639 └──────────────┴─────────────────────┴─────────────────────┘
640 Note that these settings might not be supported on some systems
641 (for example if eBPF control group support is not enabled in the
642 underlying kernel or container manager). These settings will have
643 no effect in that case. If compatibility with such systems is
644 desired it is hence recommended to not exclusively rely on them for
645 IP security.
646
647 This option cannot be bypassed by prefixing "+" to the executable
648 path in the service unit, as it applies to the whole control group.
649
650 SocketBindAllow=bind-rule, SocketBindDeny=bind-rule
651 Allow or deny binding a socket address to a socket by matching it
652 with the bind-rule and applying a corresponding action if there is
653 a match.
654
655 bind-rule describes socket properties such as address-family,
656 transport-protocol and ip-ports.
657
658 bind-rule := { [address-family:][transport-protocol:][ip-ports] |
659 any }
660
661 address-family := { ipv4 | ipv6 }
662
663 transport-protocol := { tcp | udp }
664
665 ip-ports := { ip-port | ip-port-range }
666
667 An optional address-family expects ipv4 or ipv6 values. If not
668 specified, a rule will be matched for both IPv4 and IPv6 addresses
669 and applied depending on other socket fields, e.g.
670 transport-protocol, ip-port.
671
672 An optional transport-protocol expects tcp or udp transport
673 protocol names. If not specified, a rule will be matched for any
674 transport protocol.
675
676 An optional ip-port value must lie within 1...65535 interval
677 inclusively, i.e. dynamic port 0 is not allowed. A range of
678 sequential ports is described by ip-port-range :=
679 ip-port-low-ip-port-high, where ip-port-low is smaller than or
680 equal to ip-port-high and both are within 1...65535 inclusively.
681
682 A special value any can be used to apply a rule to any address
683 family, transport protocol and any port with a positive value.
684
685 To allow multiple rules assign SocketBindAllow= or SocketBindDeny=
686 multiple times. To clear the existing assignments pass an empty
687 SocketBindAllow= or SocketBindDeny= assignment.
688
689 For each of SocketBindAllow= and SocketBindDeny=, maximum allowed
690 number of assignments is 128.
691
692 • Binding to a socket is allowed when a socket address matches an
693 entry in the SocketBindAllow= list.
694
695 • Otherwise, binding is denied when the socket address matches an
696 entry in the SocketBindDeny= list.
697
698 • Otherwise, binding is allowed.
699
700 The feature is implemented with cgroup/bind4 and cgroup/bind6
701 cgroup-bpf hooks.
702
703 Examples:
704
705 ...
706 # Allow binding IPv6 socket addresses with a port greater than or equal to 10000.
707 [Service]
708 SocketBindAllow=ipv6:10000-65535
709 SocketBindDeny=any
710 ...
711 # Allow binding IPv4 and IPv6 socket addresses with 1234 and 4321 ports.
712 [Service]
713 SocketBindAllow=1234
714 SocketBindAllow=4321
715 SocketBindDeny=any
716 ...
717 # Deny binding IPv6 socket addresses.
718 [Service]
719 SocketBindDeny=ipv6
720 ...
721 # Deny binding IPv4 and IPv6 socket addresses.
722 [Service]
723 SocketBindDeny=any
724 ...
725 # Allow binding only over TCP
726 [Service]
727 SocketBindAllow=tcp
728 SocketBindDeny=any
729 ...
730 # Allow binding only over IPv6/TCP
731 [Service]
732 SocketBindAllow=ipv6:tcp
733 SocketBindDeny=any
734 ...
735 # Allow binding ports within 10000-65535 range over IPv4/UDP.
736 [Service]
737 SocketBindAllow=ipv4:udp:10000-65535
738 SocketBindDeny=any
739 ...
740
741 This option cannot be bypassed by prefixing "+" to the executable
742 path in the service unit, as it applies to the whole control group.
743
744 RestrictNetworkInterfaces=
745 Takes a list of space-separated network interface names. This
746 option restricts the network interfaces that processes of this unit
747 can use. By default processes can only use the network interfaces
748 listed (allow-list). If the first character of the rule is "~", the
749 effect is inverted: the processes can only use network interfaces
750 not listed (deny-list).
751
752 This option can appear multiple times, in which case the network
753 interface names are merged. If the empty string is assigned the set
754 is reset, all prior assignments will have not effect.
755
756 If you specify both types of this option (i.e. allow-listing and
757 deny-listing), the first encountered will take precedence and will
758 dictate the default action (allow vs deny). Then the next
759 occurrences of this option will add or delete the listed network
760 interface names from the set, depending of its type and the default
761 action.
762
763 The loopback interface ("lo") is not treated in any special way,
764 you have to configure it explicitly in the unit file.
765
766 Example 1: allow-list
767
768 RestrictNetworkInterfaces=eth1
769 RestrictNetworkInterfaces=eth2
770
771 Programs in the unit will be only able to use the eth1 and eth2
772 network interfaces.
773
774 Example 2: deny-list
775
776 RestrictNetworkInterfaces=~eth1 eth2
777
778 Programs in the unit will be able to use any network interface but
779 eth1 and eth2.
780
781 Example 3: mixed
782
783 RestrictNetworkInterfaces=eth1 eth2
784 RestrictNetworkInterfaces=~eth1
785
786 Programs in the unit will be only able to use the eth2 network
787 interface.
788
789 This option cannot be bypassed by prefixing "+" to the executable
790 path in the service unit, as it applies to the whole control group.
791
792 BPF Programs
793 IPIngressFilterPath=BPF_FS_PROGRAM_PATH,
794 IPEgressFilterPath=BPF_FS_PROGRAM_PATH
795 Add custom network traffic filters implemented as BPF programs,
796 applying to all IP packets sent and received over AF_INET and
797 AF_INET6 sockets. Takes an absolute path to a pinned BPF program in
798 the BPF virtual filesystem (/sys/fs/bpf/).
799
800 The filters configured with this option are applied to all sockets
801 created by processes of this unit (or in the case of socket units,
802 associated with it). The filters are loaded in addition to filters
803 any of the parent slice units this unit might be a member of as
804 well as any IPAddressAllow= and IPAddressDeny= filters in any of
805 these units. By default there are no filters specified.
806
807 If these settings are used multiple times in the same unit all the
808 specified programs are attached. If an empty string is assigned to
809 these settings the program list is reset and all previous specified
810 programs ignored.
811
812 If the path BPF_FS_PROGRAM_PATH in IPIngressFilterPath= assignment
813 is already being handled by BPFProgram= ingress hook, e.g.
814 BPFProgram=ingress:BPF_FS_PROGRAM_PATH, the assignment will be
815 still considered valid and the program will be attached to a
816 cgroup. Same for IPEgressFilterPath= path and egress hook.
817
818 Note that for socket-activated services, the IP filter programs
819 configured on the socket unit apply to all sockets associated with
820 it directly, but not to any sockets created by the ultimately
821 activated services for it. Conversely, the IP filter programs
822 configured for the service are not applied to any sockets passed
823 into the service via socket activation. Thus, it is usually a good
824 idea, to replicate the IP filter programs on both the socket and
825 the service unit, however it often makes sense to maintain one
826 configuration more open and the other one more restricted,
827 depending on the usecase.
828
829 Note that these settings might not be supported on some systems
830 (for example if eBPF control group support is not enabled in the
831 underlying kernel or container manager). These settings will fail
832 the service in that case. If compatibility with such systems is
833 desired it is hence recommended to attach your filter manually
834 (requires Delegate=yes) instead of using this setting.
835
836 BPFProgram=type:program-path
837 BPFProgram= allows attaching custom BPF programs to the cgroup of a
838 unit. (This generalizes the functionality exposed via
839 IPEgressFilterPath= and and IPIngressFilterPath= for other hooks.)
840 Cgroup-bpf hooks in the form of BPF programs loaded to the BPF
841 filesystem are attached with cgroup-bpf attach flags determined by
842 the unit. For details about attachment types and flags see
843 bpf.h[9]. Also refer to the general BPF documentation[10].
844
845 The specification of BPF program consists of a pair of BPF program
846 type and program path in the file system, with ":" as the
847 separator: type:program-path.
848
849 The BPF program type is equivalent to the BPF attach type used in
850 bpftool. It may be one of egress, ingress, sock_create, sock_ops,
851 device, bind4, bind6, connect4, connect6, post_bind4, post_bind6,
852 sendmsg4, sendmsg6, sysctl, recvmsg4, recvmsg6, getsockopt,
853 setsockopt.
854
855 The specified program path must be an absolute path referencing a
856 BPF program inode in the bpffs file system (which generally means
857 it must begin with /sys/fs/bpf/). If a specified program does not
858 exist (i.e. has not been uploaded to the BPF subsystem of the
859 kernel yet), it will not be installed but unit activation will
860 continue (a warning will be printed to the logs).
861
862 Setting BPFProgram= to an empty value makes previous assignments
863 ineffective.
864
865 Multiple assignments of the same program type/path pair have the
866 same effect as a single assignment: the program will be attached
867 just once.
868
869 If BPF egress pinned to program-path path is already being handled
870 by IPEgressFilterPath=, BPFProgram= assignment will be considered
871 valid and BPFProgram= will be attached to a cgroup. Similarly for
872 ingress hook and IPIngressFilterPath= assignment.
873
874 BPF programs passed with BPFProgram= are attached to the cgroup of
875 a unit with BPF attach flag multi, that allows further attachments
876 of the same type within cgroup hierarchy topped by the unit cgroup.
877
878 Examples:
879
880 BPFProgram=egress:/sys/fs/bpf/egress-hook
881 BPFProgram=bind6:/sys/fs/bpf/sock-addr-hook
882
883 Device Access
884 DeviceAllow=
885 Control access to specific device nodes by the executed processes.
886 Takes two space-separated strings: a device node specifier followed
887 by a combination of r, w, m to control reading, writing, or
888 creation of the specific device nodes by the unit (mknod),
889 respectively. This functionality is implemented using eBPF
890 filtering.
891
892 When access to all physical devices should be disallowed,
893 PrivateDevices= may be used instead. See systemd.exec(5).
894
895 The device node specifier is either a path to a device node in the
896 file system, starting with /dev/, or a string starting with either
897 "char-" or "block-" followed by a device group name, as listed in
898 /proc/devices. The latter is useful to allow-list all current and
899 future devices belonging to a specific device group at once. The
900 device group is matched according to filename globbing rules, you
901 may hence use the "*" and "?" wildcards. (Note that such globbing
902 wildcards are not available for device node path specifications!)
903 In order to match device nodes by numeric major/minor, use device
904 node paths in the /dev/char/ and /dev/block/ directories. However,
905 matching devices by major/minor is generally not recommended as
906 assignments are neither stable nor portable between systems or
907 different kernel versions.
908
909 Examples: /dev/sda5 is a path to a device node, referring to an ATA
910 or SCSI block device. "char-pts" and "char-alsa" are specifiers
911 for all pseudo TTYs and all ALSA sound devices, respectively.
912 "char-cpu/*" is a specifier matching all CPU related device groups.
913
914 Note that allow lists defined this way should only reference device
915 groups which are resolvable at the time the unit is started. Any
916 device groups not resolvable then are not added to the device allow
917 list. In order to work around this limitation, consider extending
918 service units with a pair of After=modprobe@xyz.service and
919 Wants=modprobe@xyz.service lines that load the necessary kernel
920 module implementing the device group if missing. Example:
921
922 ...
923 [Unit]
924 Wants=modprobe@loop.service
925 After=modprobe@loop.service
926
927 [Service]
928 DeviceAllow=block-loop
929 DeviceAllow=/dev/loop-control
930 ...
931
932 This option cannot be bypassed by prefixing "+" to the executable
933 path in the service unit, as it applies to the whole control group.
934
935 DevicePolicy=auto|closed|strict
936 Control the policy for allowing device access:
937
938 strict
939 means to only allow types of access that are explicitly
940 specified.
941
942 closed
943 in addition, allows access to standard pseudo devices including
944 /dev/null, /dev/zero, /dev/full, /dev/random, and /dev/urandom.
945
946 auto
947 in addition, allows access to all devices if no explicit
948 DeviceAllow= is present. This is the default.
949
950 This option cannot be bypassed by prefixing "+" to the executable
951 path in the service unit, as it applies to the whole control group.
952
953 Control Group Management
954 Slice=
955 The name of the slice unit to place the unit in. Defaults to
956 system.slice for all non-instantiated units of all unit types
957 (except for slice units themselves see below). Instance units are
958 by default placed in a subslice of system.slice that is named after
959 the template name.
960
961 This option may be used to arrange systemd units in a hierarchy of
962 slices each of which might have resource settings applied.
963
964 For units of type slice, the only accepted value for this setting
965 is the parent slice. Since the name of a slice unit implies the
966 parent slice, it is hence redundant to ever set this parameter
967 directly for slice units.
968
969 Special care should be taken when relying on the default slice
970 assignment in templated service units that have
971 DefaultDependencies=no set, see systemd.service(5), section
972 "Default Dependencies" for details.
973
974 Delegate=
975 Turns on delegation of further resource control partitioning to
976 processes of the unit. Units where this is enabled may create and
977 manage their own private subhierarchy of control groups below the
978 control group of the unit itself. For unprivileged services (i.e.
979 those using the User= setting) the unit's control group will be
980 made accessible to the relevant user.
981
982 When enabled the service manager will refrain from manipulating
983 control groups or moving processes below the unit's control group,
984 so that a clear concept of ownership is established: the control
985 group tree at the level of the unit's control group and above (i.e.
986 towards the root control group) is owned and managed by the service
987 manager of the host, while the control group tree below the unit's
988 control group is owned and managed by the unit itself.
989
990 Takes either a boolean argument or a (possibly empty) list of
991 control group controller names. If true, delegation is turned on,
992 and all supported controllers are enabled for the unit, making them
993 available to the unit's processes for management. If false,
994 delegation is turned off entirely (and no additional controllers
995 are enabled). If set to a list of controllers, delegation is turned
996 on, and the specified controllers are enabled for the unit.
997 Assigning the empty string will enable delegation, but reset the
998 list of controllers, and all assignments prior to this will have no
999 effect. Note that additional controllers other than the ones
1000 specified might be made available as well, depending on
1001 configuration of the containing slice unit or other units contained
1002 in it. Defaults to false.
1003
1004 Note that controller delegation to less privileged code is only
1005 safe on the unified control group hierarchy. Accordingly, access to
1006 the specified controllers will not be granted to unprivileged
1007 services on the legacy hierarchy, even when requested.
1008
1009 The following controller names may be specified: cpu, cpuacct,
1010 cpuset, io, blkio, memory, devices, pids, bpf-firewall, and
1011 bpf-devices.
1012
1013 Not all of these controllers are available on all kernels however,
1014 and some are specific to the unified hierarchy while others are
1015 specific to the legacy hierarchy. Also note that the kernel might
1016 support further controllers, which aren't covered here yet as
1017 delegation is either not supported at all for them or not defined
1018 cleanly.
1019
1020 Note that because of the hierarchical nature of cgroup hierarchy,
1021 any controllers that are delegated will be enabled for the parent
1022 and sibling units of the unit with delegation.
1023
1024 For further details on the delegation model consult Control Group
1025 APIs and Delegation[11].
1026
1027 DisableControllers=
1028 Disables controllers from being enabled for a unit's children. If a
1029 controller listed is already in use in its subtree, the controller
1030 will be removed from the subtree. This can be used to avoid
1031 configuration in child units from being able to implicitly or
1032 explicitly enable a controller. Defaults to empty.
1033
1034 Multiple controllers may be specified, separated by spaces. You may
1035 also pass DisableControllers= multiple times, in which case each
1036 new instance adds another controller to disable. Passing
1037 DisableControllers= by itself with no controller name present
1038 resets the disabled controller list.
1039
1040 It may not be possible to disable a controller after units have
1041 been started, if the unit or any child of the unit in question
1042 delegates controllers to its children, as any delegated subtree of
1043 the cgroup hierarchy is unmanaged by systemd.
1044
1045 The following controller names may be specified: cpu, cpuacct,
1046 cpuset, io, blkio, memory, devices, pids, bpf-firewall, and
1047 bpf-devices.
1048
1049 Memory Pressure Control
1050 ManagedOOMSwap=auto|kill, ManagedOOMMemoryPressure=auto|kill
1051 Specifies how systemd-oomd.service(8) will act on this unit's
1052 cgroups. Defaults to auto.
1053
1054 When set to kill, the unit becomes a candidate for monitoring by
1055 systemd-oomd. If the cgroup passes the limits set by oomd.conf(5)
1056 or the unit configuration, systemd-oomd will select a descendant
1057 cgroup and send SIGKILL to all of the processes under it. You can
1058 find more details on candidates and kill behavior at systemd-
1059 oomd.service(8) and oomd.conf(5).
1060
1061 Setting either of these properties to kill will also result in
1062 After= and Wants= dependencies on systemd-oomd.service unless
1063 DefaultDependencies=no.
1064
1065 When set to auto, systemd-oomd will not actively use this cgroup's
1066 data for monitoring and detection. However, if an ancestor cgroup
1067 has one of these properties set to kill, a unit with auto can still
1068 be a candidate for systemd-oomd to terminate.
1069
1070 ManagedOOMMemoryPressureLimit=
1071 Overrides the default memory pressure limit set by oomd.conf(5) for
1072 this unit (cgroup). Takes a percentage value between 0% and 100%,
1073 inclusive. This property is ignored unless
1074 ManagedOOMMemoryPressure=kill. Defaults to 0%, which means to use
1075 the default set by oomd.conf(5).
1076
1077 ManagedOOMPreference=none|avoid|omit
1078 Allows deprioritizing or omitting this unit's cgroup as a candidate
1079 when systemd-oomd needs to act. Requires support for extended
1080 attributes (see xattr(7)) in order to use avoid or omit.
1081
1082 When calculating candidates to relieve swap usage, systemd-oomd
1083 will only respect these extended attributes if the unit's cgroup is
1084 owned by root.
1085
1086 When calculating candidates to relieve memory pressure,
1087 systemd-oomd will only respect these extended attributes if the
1088 unit's cgroup is owned by root, or if the unit's cgroup owner, and
1089 the owner of the monitored ancestor cgroup are the same. For
1090 example, if systemd-oomd is calculating candidates for -.slice,
1091 then extended attributes set on descendants of
1092 /user.slice/user-1000.slice/user@1000.service/ will be ignored
1093 because the descendants are owned by UID 1000, and -.slice is owned
1094 by UID 0. But, if calculating candidates for
1095 /user.slice/user-1000.slice/user@1000.service/, then extended
1096 attributes set on the descendants would be respected.
1097
1098 If this property is set to avoid, the service manager will convey
1099 this to systemd-oomd, which will only select this cgroup if there
1100 are no other viable candidates.
1101
1102 If this property is set to omit, the service manager will convey
1103 this to systemd-oomd, which will ignore this cgroup as a candidate
1104 and will not perform any actions on it.
1105
1106 It is recommended to use avoid and omit sparingly, as it can
1107 adversely affect systemd-oomd's kill behavior. Also note that these
1108 extended attributes are not applied recursively to cgroups under
1109 this unit's cgroup.
1110
1111 Defaults to none which means systemd-oomd will rank this unit's
1112 cgroup as defined in systemd-oomd.service(8) and oomd.conf(5).
1113
1115 systemd 252
1116 Options for controlling the Legacy Control Group Hierarchy (Control
1117 Groups version 1[12]) are now fully deprecated: CPUShares=weight,
1118 StartupCPUShares=weight, MemoryLimit=bytes, BlockIOAccounting=,
1119 BlockIOWeight=weight, StartupBlockIOWeight=weight,
1120 BlockIODeviceWeight=device weight, BlockIOReadBandwidth=device
1121 bytes, BlockIOWriteBandwidth=device bytes. Please switch to the
1122 unified cgroup hierarchy.
1123
1125 systemd(1), systemd-system.conf(5), systemd.unit(5),
1126 systemd.service(5), systemd.slice(5), systemd.scope(5),
1127 systemd.socket(5), systemd.mount(5), systemd.swap(5), systemd.exec(5),
1128 systemd.directives(7), systemd.special(7), systemd-oomd.service(8), The
1129 documentation for control groups and specific controllers in the Linux
1130 kernel: Control Groups v2[2].
1131
1133 1. New Control Group Interfaces
1134 https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface
1135
1136 2. Control Groups v2
1137 https://docs.kernel.org/admin-guide/cgroup-v2.html
1138
1139 3. CFS Scheduler
1140 https://docs.kernel.org/scheduler/sched-design-CFS.html
1141
1142 4. CFS Bandwidth Control
1143 https://docs.kernel.org/scheduler/sched-bwc.html
1144
1145 5. Memory Interface Files
1146 https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files
1147
1148 6. Zswap
1149 https://www.kernel.org/doc/html/latest/admin-guide/mm/zswap.html
1150
1151 7. pids controller
1152 https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#pid
1153
1154 8. IO Interface Files
1155 https://docs.kernel.org/admin-guide/cgroup-v2.html#io-interface-files
1156
1157 9. bpf.h
1158 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/include/uapi/linux/bpf.h
1159
1160 10. BPF documentation
1161 https://docs.kernel.org/bpf/
1162
1163 11. Control Group APIs and Delegation
1164 https://systemd.io/CGROUP_DELEGATION
1165
1166 12. Control Groups version 1
1167 https://docs.kernel.org/admin-guide/cgroup-v1/index.html
1168
1169
1170
1171systemd 253 SYSTEMD.RESOURCE-CONTROL(5)