1Capabilities(7) Miscellaneous Information Manual Capabilities(7)
2
3
4
6 capabilities - overview of Linux capabilities
7
9 For the purpose of performing permission checks, traditional UNIX im‐
10 plementations distinguish two categories of processes: privileged pro‐
11 cesses (whose effective user ID is 0, referred to as superuser or
12 root), and unprivileged processes (whose effective UID is nonzero).
13 Privileged processes bypass all kernel permission checks, while unpriv‐
14 ileged processes are subject to full permission checking based on the
15 process's credentials (usually: effective UID, effective GID, and sup‐
16 plementary group list).
17
18 Starting with Linux 2.2, Linux divides the privileges traditionally as‐
19 sociated with superuser into distinct units, known as capabilities,
20 which can be independently enabled and disabled. Capabilities are a
21 per-thread attribute.
22
23 Capabilities list
24 The following list shows the capabilities implemented on Linux, and the
25 operations or behaviors that each capability permits:
26
27 CAP_AUDIT_CONTROL (since Linux 2.6.11)
28 Enable and disable kernel auditing; change auditing filter
29 rules; retrieve auditing status and filtering rules.
30
31 CAP_AUDIT_READ (since Linux 3.16)
32 Allow reading the audit log via a multicast netlink socket.
33
34 CAP_AUDIT_WRITE (since Linux 2.6.11)
35 Write records to kernel auditing log.
36
37 CAP_BLOCK_SUSPEND (since Linux 3.5)
38 Employ features that can block system suspend (epoll(7) EPOLL‐
39 WAKEUP, /proc/sys/wake_lock).
40
41 CAP_BPF (since Linux 5.8)
42 Employ privileged BPF operations; see bpf(2) and bpf-helpers(7).
43
44 This capability was added in Linux 5.8 to separate out BPF func‐
45 tionality from the overloaded CAP_SYS_ADMIN capability.
46
47 CAP_CHECKPOINT_RESTORE (since Linux 5.9)
48 • Update /proc/sys/kernel/ns_last_pid (see pid_namespaces(7));
49 • employ the set_tid feature of clone3(2);
50 • read the contents of the symbolic links in
51 /proc/pid/map_files for other processes.
52
53 This capability was added in Linux 5.9 to separate out check‐
54 point/restore functionality from the overloaded CAP_SYS_ADMIN
55 capability.
56
57 CAP_CHOWN
58 Make arbitrary changes to file UIDs and GIDs (see chown(2)).
59
60 CAP_DAC_OVERRIDE
61 Bypass file read, write, and execute permission checks. (DAC is
62 an abbreviation of "discretionary access control".)
63
64 CAP_DAC_READ_SEARCH
65 • Bypass file read permission checks and directory read and ex‐
66 ecute permission checks;
67 • invoke open_by_handle_at(2);
68 • use the linkat(2) AT_EMPTY_PATH flag to create a link to a
69 file referred to by a file descriptor.
70
71 CAP_FOWNER
72 • Bypass permission checks on operations that normally require
73 the filesystem UID of the process to match the UID of the
74 file (e.g., chmod(2), utime(2)), excluding those operations
75 covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH;
76 • set inode flags (see ioctl_iflags(2)) on arbitrary files;
77 • set Access Control Lists (ACLs) on arbitrary files;
78 • ignore directory sticky bit on file deletion;
79 • modify user extended attributes on sticky directory owned by
80 any user;
81 • specify O_NOATIME for arbitrary files in open(2) and fc‐
82 ntl(2).
83
84 CAP_FSETID
85 • Don't clear set-user-ID and set-group-ID mode bits when a
86 file is modified;
87 • set the set-group-ID bit for a file whose GID does not match
88 the filesystem or any of the supplementary GIDs of the call‐
89 ing process.
90
91 CAP_IPC_LOCK
92 • Lock memory (mlock(2), mlockall(2), mmap(2), shmctl(2));
93 • Allocate memory using huge pages (memfd_create(2), mmap(2),
94 shmctl(2)).
95
96 CAP_IPC_OWNER
97 Bypass permission checks for operations on System V IPC objects.
98
99 CAP_KILL
100 Bypass permission checks for sending signals (see kill(2)).
101 This includes use of the ioctl(2) KDSIGACCEPT operation.
102
103 CAP_LEASE (since Linux 2.4)
104 Establish leases on arbitrary files (see fcntl(2)).
105
106 CAP_LINUX_IMMUTABLE
107 Set the FS_APPEND_FL and FS_IMMUTABLE_FL inode flags (see
108 ioctl_iflags(2)).
109
110 CAP_MAC_ADMIN (since Linux 2.6.25)
111 Allow MAC configuration or state changes. Implemented for the
112 Smack Linux Security Module (LSM).
113
114 CAP_MAC_OVERRIDE (since Linux 2.6.25)
115 Override Mandatory Access Control (MAC). Implemented for the
116 Smack LSM.
117
118 CAP_MKNOD (since Linux 2.4)
119 Create special files using mknod(2).
120
121 CAP_NET_ADMIN
122 Perform various network-related operations:
123 • interface configuration;
124 • administration of IP firewall, masquerading, and accounting;
125 • modify routing tables;
126 • bind to any address for transparent proxying;
127 • set type-of-service (TOS);
128 • clear driver statistics;
129 • set promiscuous mode;
130 • enabling multicasting;
131 • use setsockopt(2) to set the following socket options: SO_DE‐
132 BUG, SO_MARK, SO_PRIORITY (for a priority outside the range 0
133 to 6), SO_RCVBUFFORCE, and SO_SNDBUFFORCE.
134
135 CAP_NET_BIND_SERVICE
136 Bind a socket to Internet domain privileged ports (port numbers
137 less than 1024).
138
139 CAP_NET_BROADCAST
140 (Unused) Make socket broadcasts, and listen to multicasts.
141
142 CAP_NET_RAW
143 • Use RAW and PACKET sockets;
144 • bind to any address for transparent proxying.
145
146 CAP_PERFMON (since Linux 5.8)
147 Employ various performance-monitoring mechanisms, including:
148
149 • call perf_event_open(2);
150 • employ various BPF operations that have performance implica‐
151 tions.
152
153 This capability was added in Linux 5.8 to separate out perfor‐
154 mance monitoring functionality from the overloaded CAP_SYS_ADMIN
155 capability. See also the kernel source file Documentation/ad‐
156 min-guide/perf-security.rst.
157
158 CAP_SETGID
159 • Make arbitrary manipulations of process GIDs and supplemen‐
160 tary GID list;
161 • forge GID when passing socket credentials via UNIX domain
162 sockets;
163 • write a group ID mapping in a user namespace (see user_name‐
164 spaces(7)).
165
166 CAP_SETFCAP (since Linux 2.6.24)
167 Set arbitrary capabilities on a file.
168
169 Since Linux 5.12, this capability is also needed to map user ID
170 0 in a new user namespace; see user_namespaces(7) for details.
171
172 CAP_SETPCAP
173 If file capabilities are supported (i.e., since Linux 2.6.24):
174 add any capability from the calling thread's bounding set to its
175 inheritable set; drop capabilities from the bounding set (via
176 prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags.
177
178 If file capabilities are not supported (i.e., before Linux
179 2.6.24): grant or remove any capability in the caller's permit‐
180 ted capability set to or from any other process. (This property
181 of CAP_SETPCAP is not available when the kernel is configured to
182 support file capabilities, since CAP_SETPCAP has entirely dif‐
183 ferent semantics for such kernels.)
184
185 CAP_SETUID
186 • Make arbitrary manipulations of process UIDs (setuid(2), se‐
187 treuid(2), setresuid(2), setfsuid(2));
188 • forge UID when passing socket credentials via UNIX domain
189 sockets;
190 • write a user ID mapping in a user namespace (see user_name‐
191 spaces(7)).
192
193 CAP_SYS_ADMIN
194 Note: this capability is overloaded; see Notes to kernel devel‐
195 opers below.
196
197 • Perform a range of system administration operations includ‐
198 ing: quotactl(2), mount(2), umount(2), pivot_root(2),
199 swapon(2), swapoff(2), sethostname(2), and setdomainname(2);
200 • perform privileged syslog(2) operations (since Linux 2.6.37,
201 CAP_SYSLOG should be used to permit such operations);
202 • perform VM86_REQUEST_IRQ vm86(2) command;
203 • access the same checkpoint/restore functionality that is gov‐
204 erned by CAP_CHECKPOINT_RESTORE (but the latter, weaker capa‐
205 bility is preferred for accessing that functionality).
206 • perform the same BPF operations as are governed by CAP_BPF
207 (but the latter, weaker capability is preferred for accessing
208 that functionality).
209 • employ the same performance monitoring mechanisms as are gov‐
210 erned by CAP_PERFMON (but the latter, weaker capability is
211 preferred for accessing that functionality).
212 • perform IPC_SET and IPC_RMID operations on arbitrary System V
213 IPC objects;
214 • override RLIMIT_NPROC resource limit;
215 • perform operations on trusted and security extended at‐
216 tributes (see xattr(7));
217 • use lookup_dcookie(2);
218 • use ioprio_set(2) to assign IOPRIO_CLASS_RT and (before Linux
219 2.6.25) IOPRIO_CLASS_IDLE I/O scheduling classes;
220 • forge PID when passing socket credentials via UNIX domain
221 sockets;
222 • exceed /proc/sys/fs/file-max, the system-wide limit on the
223 number of open files, in system calls that open files (e.g.,
224 accept(2), execve(2), open(2), pipe(2));
225 • employ CLONE_* flags that create new namespaces with clone(2)
226 and unshare(2) (but, since Linux 3.8, creating user name‐
227 spaces does not require any capability);
228 • access privileged perf event information;
229 • call setns(2) (requires CAP_SYS_ADMIN in the target name‐
230 space);
231 • call fanotify_init(2);
232 • perform privileged KEYCTL_CHOWN and KEYCTL_SETPERM keyctl(2)
233 operations;
234 • perform madvise(2) MADV_HWPOISON operation;
235 • employ the TIOCSTI ioctl(2) to insert characters into the in‐
236 put queue of a terminal other than the caller's controlling
237 terminal;
238 • employ the obsolete nfsservctl(2) system call;
239 • employ the obsolete bdflush(2) system call;
240 • perform various privileged block-device ioctl(2) operations;
241 • perform various privileged filesystem ioctl(2) operations;
242 • perform privileged ioctl(2) operations on the /dev/random de‐
243 vice (see random(4));
244 • install a seccomp(2) filter without first having to set the
245 no_new_privs thread attribute;
246 • modify allow/deny rules for device control groups;
247 • employ the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation to
248 dump tracee's seccomp filters;
249 • employ the ptrace(2) PTRACE_SETOPTIONS operation to suspend
250 the tracee's seccomp protections (i.e., the PTRACE_O_SUS‐
251 PEND_SECCOMP flag);
252 • perform administrative operations on many device drivers;
253 • modify autogroup nice values by writing to /proc/pid/auto‐
254 group (see sched(7)).
255
256 CAP_SYS_BOOT
257 Use reboot(2) and kexec_load(2).
258
259 CAP_SYS_CHROOT
260 • Use chroot(2);
261 • change mount namespaces using setns(2).
262
263 CAP_SYS_MODULE
264 • Load and unload kernel modules (see init_module(2) and
265 delete_module(2));
266 • before Linux 2.6.25: drop capabilities from the system-wide
267 capability bounding set.
268
269 CAP_SYS_NICE
270 • Lower the process nice value (nice(2), setpriority(2)) and
271 change the nice value for arbitrary processes;
272 • set real-time scheduling policies for calling process, and
273 set scheduling policies and priorities for arbitrary pro‐
274 cesses (sched_setscheduler(2), sched_setparam(2), sched_se‐
275 tattr(2));
276 • set CPU affinity for arbitrary processes (sched_setaffin‐
277 ity(2));
278 • set I/O scheduling class and priority for arbitrary processes
279 (ioprio_set(2));
280 • apply migrate_pages(2) to arbitrary processes and allow pro‐
281 cesses to be migrated to arbitrary nodes;
282 • apply move_pages(2) to arbitrary processes;
283 • use the MPOL_MF_MOVE_ALL flag with mbind(2) and
284 move_pages(2).
285
286 CAP_SYS_PACCT
287 Use acct(2).
288
289 CAP_SYS_PTRACE
290 • Trace arbitrary processes using ptrace(2);
291 • apply get_robust_list(2) to arbitrary processes;
292 • transfer data to or from the memory of arbitrary processes
293 using process_vm_readv(2) and process_vm_writev(2);
294 • inspect processes using kcmp(2).
295
296 CAP_SYS_RAWIO
297 • Perform I/O port operations (iopl(2) and ioperm(2));
298 • access /proc/kcore;
299 • employ the FIBMAP ioctl(2) operation;
300 • open devices for accessing x86 model-specific registers
301 (MSRs, see msr(4));
302 • update /proc/sys/vm/mmap_min_addr;
303 • create memory mappings at addresses below the value specified
304 by /proc/sys/vm/mmap_min_addr;
305 • map files in /proc/bus/pci;
306 • open /dev/mem and /dev/kmem;
307 • perform various SCSI device commands;
308 • perform certain operations on hpsa(4) and cciss(4) devices;
309 • perform a range of device-specific operations on other de‐
310 vices.
311
312 CAP_SYS_RESOURCE
313 • Use reserved space on ext2 filesystems;
314 • make ioctl(2) calls controlling ext3 journaling;
315 • override disk quota limits;
316 • increase resource limits (see setrlimit(2));
317 • override RLIMIT_NPROC resource limit;
318 • override maximum number of consoles on console allocation;
319 • override maximum number of keymaps;
320 • allow more than 64hz interrupts from the real-time clock;
321 • raise msg_qbytes limit for a System V message queue above the
322 limit in /proc/sys/kernel/msgmnb (see msgop(2) and ms‐
323 gctl(2));
324 • allow the RLIMIT_NOFILE resource limit on the number of "in-
325 flight" file descriptors to be bypassed when passing file de‐
326 scriptors to another process via a UNIX domain socket (see
327 unix(7));
328 • override the /proc/sys/fs/pipe-size-max limit when setting
329 the capacity of a pipe using the F_SETPIPE_SZ fcntl(2) com‐
330 mand;
331 • use F_SETPIPE_SZ to increase the capacity of a pipe above the
332 limit specified by /proc/sys/fs/pipe-max-size;
333 • override /proc/sys/fs/mqueue/queues_max,
334 /proc/sys/fs/mqueue/msg_max, and /proc/sys/fs/mqueue/msg‐
335 size_max limits when creating POSIX message queues (see
336 mq_overview(7));
337 • employ the prctl(2) PR_SET_MM operation;
338 • set /proc/pid/oom_score_adj to a value lower than the value
339 last set by a process with CAP_SYS_RESOURCE.
340
341 CAP_SYS_TIME
342 Set system clock (settimeofday(2), stime(2), adjtimex(2)); set
343 real-time (hardware) clock.
344
345 CAP_SYS_TTY_CONFIG
346 Use vhangup(2); employ various privileged ioctl(2) operations on
347 virtual terminals.
348
349 CAP_SYSLOG (since Linux 2.6.37)
350 • Perform privileged syslog(2) operations. See syslog(2) for
351 information on which operations require privilege.
352 • View kernel addresses exposed via /proc and other interfaces
353 when /proc/sys/kernel/kptr_restrict has the value 1. (See
354 the discussion of the kptr_restrict in proc(5).)
355
356 CAP_WAKE_ALARM (since Linux 3.0)
357 Trigger something that will wake up the system (set CLOCK_REAL‐
358 TIME_ALARM and CLOCK_BOOTTIME_ALARM timers).
359
360 Past and current implementation
361 A full implementation of capabilities requires that:
362
363 • For all privileged operations, the kernel must check whether the
364 thread has the required capability in its effective set.
365
366 • The kernel must provide system calls allowing a thread's capability
367 sets to be changed and retrieved.
368
369 • The filesystem must support attaching capabilities to an executable
370 file, so that a process gains those capabilities when the file is
371 executed.
372
373 Before Linux 2.6.24, only the first two of these requirements are met;
374 since Linux 2.6.24, all three requirements are met.
375
376 Notes to kernel developers
377 When adding a new kernel feature that should be governed by a capabil‐
378 ity, consider the following points.
379
380 • The goal of capabilities is divide the power of superuser into
381 pieces, such that if a program that has one or more capabilities is
382 compromised, its power to do damage to the system would be less than
383 the same program running with root privilege.
384
385 • You have the choice of either creating a new capability for your new
386 feature, or associating the feature with one of the existing capa‐
387 bilities. In order to keep the set of capabilities to a manageable
388 size, the latter option is preferable, unless there are compelling
389 reasons to take the former option. (There is also a technical
390 limit: the size of capability sets is currently limited to 64 bits.)
391
392 • To determine which existing capability might best be associated with
393 your new feature, review the list of capabilities above in order to
394 find a "silo" into which your new feature best fits. One approach
395 to take is to determine if there are other features requiring capa‐
396 bilities that will always be used along with the new feature. If
397 the new feature is useless without these other features, you should
398 use the same capability as the other features.
399
400 • Don't choose CAP_SYS_ADMIN if you can possibly avoid it! A vast
401 proportion of existing capability checks are associated with this
402 capability (see the partial list above). It can plausibly be called
403 "the new root", since on the one hand, it confers a wide range of
404 powers, and on the other hand, its broad scope means that this is
405 the capability that is required by many privileged programs. Don't
406 make the problem worse. The only new features that should be asso‐
407 ciated with CAP_SYS_ADMIN are ones that closely match existing uses
408 in that silo.
409
410 • If you have determined that it really is necessary to create a new
411 capability for your feature, don't make or name it as a "single-use"
412 capability. Thus, for example, the addition of the highly specific
413 CAP_SYS_PACCT was probably a mistake. Instead, try to identify and
414 name your new capability as a broader silo into which other related
415 future use cases might fit.
416
417 Thread capability sets
418 Each thread has the following capability sets containing zero or more
419 of the above capabilities:
420
421 Permitted
422 This is a limiting superset for the effective capabilities that
423 the thread may assume. It is also a limiting superset for the
424 capabilities that may be added to the inheritable set by a
425 thread that does not have the CAP_SETPCAP capability in its ef‐
426 fective set.
427
428 If a thread drops a capability from its permitted set, it can
429 never reacquire that capability (unless it execve(2)s either a
430 set-user-ID-root program, or a program whose associated file ca‐
431 pabilities grant that capability).
432
433 Inheritable
434 This is a set of capabilities preserved across an execve(2).
435 Inheritable capabilities remain inheritable when executing any
436 program, and inheritable capabilities are added to the permitted
437 set when executing a program that has the corresponding bits set
438 in the file inheritable set.
439
440 Because inheritable capabilities are not generally preserved
441 across execve(2) when running as a non-root user, applications
442 that wish to run helper programs with elevated capabilities
443 should consider using ambient capabilities, described below.
444
445 Effective
446 This is the set of capabilities used by the kernel to perform
447 permission checks for the thread.
448
449 Bounding (per-thread since Linux 2.6.25)
450 The capability bounding set is a mechanism that can be used to
451 limit the capabilities that are gained during execve(2).
452
453 Since Linux 2.6.25, this is a per-thread capability set. In
454 older kernels, the capability bounding set was a system wide at‐
455 tribute shared by all threads on the system.
456
457 For more details, see Capability bounding set below.
458
459 Ambient (since Linux 4.3)
460 This is a set of capabilities that are preserved across an ex‐
461 ecve(2) of a program that is not privileged. The ambient capa‐
462 bility set obeys the invariant that no capability can ever be
463 ambient if it is not both permitted and inheritable.
464
465 The ambient capability set can be directly modified using
466 prctl(2). Ambient capabilities are automatically lowered if ei‐
467 ther of the corresponding permitted or inheritable capabilities
468 is lowered.
469
470 Executing a program that changes UID or GID due to the set-user-
471 ID or set-group-ID bits or executing a program that has any file
472 capabilities set will clear the ambient set. Ambient capabili‐
473 ties are added to the permitted set and assigned to the effec‐
474 tive set when execve(2) is called. If ambient capabilities
475 cause a process's permitted and effective capabilities to in‐
476 crease during an execve(2), this does not trigger the secure-ex‐
477 ecution mode described in ld.so(8).
478
479 A child created via fork(2) inherits copies of its parent's capability
480 sets. For details on how execve(2) affects capabilities, see Transfor‐
481 mation of capabilities during execve() below.
482
483 Using capset(2), a thread may manipulate its own capability sets; see
484 Programmatically adjusting capability sets below.
485
486 Since Linux 3.2, the file /proc/sys/kernel/cap_last_cap exposes the nu‐
487 merical value of the highest capability supported by the running ker‐
488 nel; this can be used to determine the highest bit that may be set in a
489 capability set.
490
491 File capabilities
492 Since Linux 2.6.24, the kernel supports associating capability sets
493 with an executable file using setcap(8). The file capability sets are
494 stored in an extended attribute (see setxattr(2) and xattr(7)) named
495 security.capability. Writing to this extended attribute requires the
496 CAP_SETFCAP capability. The file capability sets, in conjunction with
497 the capability sets of the thread, determine the capabilities of a
498 thread after an execve(2).
499
500 The three file capability sets are:
501
502 Permitted (formerly known as forced):
503 These capabilities are automatically permitted to the thread,
504 regardless of the thread's inheritable capabilities.
505
506 Inheritable (formerly known as allowed):
507 This set is ANDed with the thread's inheritable set to determine
508 which inheritable capabilities are enabled in the permitted set
509 of the thread after the execve(2).
510
511 Effective:
512 This is not a set, but rather just a single bit. If this bit is
513 set, then during an execve(2) all of the new permitted capabili‐
514 ties for the thread are also raised in the effective set. If
515 this bit is not set, then after an execve(2), none of the new
516 permitted capabilities is in the new effective set.
517
518 Enabling the file effective capability bit implies that any file
519 permitted or inheritable capability that causes a thread to ac‐
520 quire the corresponding permitted capability during an execve(2)
521 (see Transformation of capabilities during execve() below) will
522 also acquire that capability in its effective set. Therefore,
523 when assigning capabilities to a file (setcap(8),
524 cap_set_file(3), cap_set_fd(3)), if we specify the effective
525 flag as being enabled for any capability, then the effective
526 flag must also be specified as enabled for all other capabili‐
527 ties for which the corresponding permitted or inheritable flag
528 is enabled.
529
530 File capability extended attribute versioning
531 To allow extensibility, the kernel supports a scheme to encode a ver‐
532 sion number inside the security.capability extended attribute that is
533 used to implement file capabilities. These version numbers are inter‐
534 nal to the implementation, and not directly visible to user-space ap‐
535 plications. To date, the following versions are supported:
536
537 VFS_CAP_REVISION_1
538 This was the original file capability implementation, which sup‐
539 ported 32-bit masks for file capabilities.
540
541 VFS_CAP_REVISION_2 (since Linux 2.6.25)
542 This version allows for file capability masks that are 64 bits
543 in size, and was necessary as the number of supported capabili‐
544 ties grew beyond 32. The kernel transparently continues to sup‐
545 port the execution of files that have 32-bit version 1 capabil‐
546 ity masks, but when adding capabilities to files that did not
547 previously have capabilities, or modifying the capabilities of
548 existing files, it automatically uses the version 2 scheme (or
549 possibly the version 3 scheme, as described below).
550
551 VFS_CAP_REVISION_3 (since Linux 4.14)
552 Version 3 file capabilities are provided to support namespaced
553 file capabilities (described below).
554
555 As with version 2 file capabilities, version 3 capability masks
556 are 64 bits in size. But in addition, the root user ID of name‐
557 space is encoded in the security.capability extended attribute.
558 (A namespace's root user ID is the value that user ID 0 inside
559 that namespace maps to in the initial user namespace.)
560
561 Version 3 file capabilities are designed to coexist with version
562 2 capabilities; that is, on a modern Linux system, there may be
563 some files with version 2 capabilities while others have version
564 3 capabilities.
565
566 Before Linux 4.14, the only kind of file capability extended attribute
567 that could be attached to a file was a VFS_CAP_REVISION_2 attribute.
568 Since Linux 4.14, the version of the security.capability extended at‐
569 tribute that is attached to a file depends on the circumstances in
570 which the attribute was created.
571
572 Starting with Linux 4.14, a security.capability extended attribute is
573 automatically created as (or converted to) a version 3 (VFS_CAP_REVI‐
574 SION_3) attribute if both of the following are true:
575
576 • The thread writing the attribute resides in a noninitial user name‐
577 space. (More precisely: the thread resides in a user namespace
578 other than the one from which the underlying filesystem was
579 mounted.)
580
581 • The thread has the CAP_SETFCAP capability over the file inode, mean‐
582 ing that (a) the thread has the CAP_SETFCAP capability in its own
583 user namespace; and (b) the UID and GID of the file inode have map‐
584 pings in the writer's user namespace.
585
586 When a VFS_CAP_REVISION_3 security.capability extended attribute is
587 created, the root user ID of the creating thread's user namespace is
588 saved in the extended attribute.
589
590 By contrast, creating or modifying a security.capability extended at‐
591 tribute from a privileged (CAP_SETFCAP) thread that resides in the
592 namespace where the underlying filesystem was mounted (this normally
593 means the initial user namespace) automatically results in the creation
594 of a version 2 (VFS_CAP_REVISION_2) attribute.
595
596 Note that the creation of a version 3 security.capability extended at‐
597 tribute is automatic. That is to say, when a user-space application
598 writes (setxattr(2)) a security.capability attribute in the version 2
599 format, the kernel will automatically create a version 3 attribute if
600 the attribute is created in the circumstances described above. Corre‐
601 spondingly, when a version 3 security.capability attribute is retrieved
602 (getxattr(2)) by a process that resides inside a user namespace that
603 was created by the root user ID (or a descendant of that user name‐
604 space), the returned attribute is (automatically) simplified to appear
605 as a version 2 attribute (i.e., the returned value is the size of a
606 version 2 attribute and does not include the root user ID). These au‐
607 tomatic translations mean that no changes are required to user-space
608 tools (e.g., setcap(1) and getcap(1)) in order for those tools to be
609 used to create and retrieve version 3 security.capability attributes.
610
611 Note that a file can have either a version 2 or a version 3 secu‐
612 rity.capability extended attribute associated with it, but not both:
613 creation or modification of the security.capability extended attribute
614 will automatically modify the version according to the circumstances in
615 which the extended attribute is created or modified.
616
617 Transformation of capabilities during execve()
618 During an execve(2), the kernel calculates the new capabilities of the
619 process using the following algorithm:
620
621 P'(ambient) = (file is privileged) ? 0 : P(ambient)
622
623 P'(permitted) = (P(inheritable) & F(inheritable)) |
624 (F(permitted) & P(bounding)) | P'(ambient)
625
626 P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
627
628 P'(inheritable) = P(inheritable) [i.e., unchanged]
629
630 P'(bounding) = P(bounding) [i.e., unchanged]
631
632 where:
633
634 P() denotes the value of a thread capability set before the ex‐
635 ecve(2)
636
637 P'() denotes the value of a thread capability set after the ex‐
638 ecve(2)
639
640 F() denotes a file capability set
641
642 Note the following details relating to the above capability transforma‐
643 tion rules:
644
645 • The ambient capability set is present only since Linux 4.3. When
646 determining the transformation of the ambient set during execve(2),
647 a privileged file is one that has capabilities or has the set-user-
648 ID or set-group-ID bit set.
649
650 • Prior to Linux 2.6.25, the bounding set was a system-wide attribute
651 shared by all threads. That system-wide value was employed to cal‐
652 culate the new permitted set during execve(2) in the same manner as
653 shown above for P(bounding).
654
655 Note: during the capability transitions described above, file capabili‐
656 ties may be ignored (treated as empty) for the same reasons that the
657 set-user-ID and set-group-ID bits are ignored; see execve(2). File ca‐
658 pabilities are similarly ignored if the kernel was booted with the
659 no_file_caps option.
660
661 Note: according to the rules above, if a process with nonzero user IDs
662 performs an execve(2) then any capabilities that are present in its
663 permitted and effective sets will be cleared. For the treatment of ca‐
664 pabilities when a process with a user ID of zero performs an execve(2),
665 see Capabilities and execution of programs by root below.
666
667 Safety checking for capability-dumb binaries
668 A capability-dumb binary is an application that has been marked to have
669 file capabilities, but has not been converted to use the libcap(3) API
670 to manipulate its capabilities. (In other words, this is a traditional
671 set-user-ID-root program that has been switched to use file capabili‐
672 ties, but whose code has not been modified to understand capabilities.)
673 For such applications, the effective capability bit is set on the file,
674 so that the file permitted capabilities are automatically enabled in
675 the process effective set when executing the file. The kernel recog‐
676 nizes a file which has the effective capability bit set as capability-
677 dumb for the purpose of the check described here.
678
679 When executing a capability-dumb binary, the kernel checks if the
680 process obtained all permitted capabilities that were specified in the
681 file permitted set, after the capability transformations described
682 above have been performed. (The typical reason why this might not oc‐
683 cur is that the capability bounding set masked out some of the capabil‐
684 ities in the file permitted set.) If the process did not obtain the
685 full set of file permitted capabilities, then execve(2) fails with the
686 error EPERM. This prevents possible security risks that could arise
687 when a capability-dumb application is executed with less privilege than
688 it needs. Note that, by definition, the application could not itself
689 recognize this problem, since it does not employ the libcap(3) API.
690
691 Capabilities and execution of programs by root
692 In order to mirror traditional UNIX semantics, the kernel performs spe‐
693 cial treatment of file capabilities when a process with UID 0 (root)
694 executes a program and when a set-user-ID-root program is executed.
695
696 After having performed any changes to the process effective ID that
697 were triggered by the set-user-ID mode bit of the binary—e.g., switch‐
698 ing the effective user ID to 0 (root) because a set-user-ID-root pro‐
699 gram was executed—the kernel calculates the file capability sets as
700 follows:
701
702 (1) If the real or effective user ID of the process is 0 (root), then
703 the file inheritable and permitted sets are ignored; instead they
704 are notionally considered to be all ones (i.e., all capabilities
705 enabled). (There is one exception to this behavior, described in
706 Set-user-ID-root programs that have file capabilities below.)
707
708 (2) If the effective user ID of the process is 0 (root) or the file
709 effective bit is in fact enabled, then the file effective bit is
710 notionally defined to be one (enabled).
711
712 These notional values for the file's capability sets are then used as
713 described above to calculate the transformation of the process's capa‐
714 bilities during execve(2).
715
716 Thus, when a process with nonzero UIDs execve(2)s a set-user-ID-root
717 program that does not have capabilities attached, or when a process
718 whose real and effective UIDs are zero execve(2)s a program, the calcu‐
719 lation of the process's new permitted capabilities simplifies to:
720
721 P'(permitted) = P(inheritable) | P(bounding)
722
723 P'(effective) = P'(permitted)
724
725 Consequently, the process gains all capabilities in its permitted and
726 effective capability sets, except those masked out by the capability
727 bounding set. (In the calculation of P'(permitted), the P'(ambient)
728 term can be simplified away because it is by definition a proper subset
729 of P(inheritable).)
730
731 The special treatments of user ID 0 (root) described in this subsection
732 can be disabled using the securebits mechanism described below.
733
734 Set-user-ID-root programs that have file capabilities
735 There is one exception to the behavior described in Capabilities and
736 execution of programs by root above. If (a) the binary that is being
737 executed has capabilities attached and (b) the real user ID of the
738 process is not 0 (root) and (c) the effective user ID of the process is
739 0 (root), then the file capability bits are honored (i.e., they are not
740 notionally considered to be all ones). The usual way in which this
741 situation can arise is when executing a set-UID-root program that also
742 has file capabilities. When such a program is executed, the process
743 gains just the capabilities granted by the program (i.e., not all capa‐
744 bilities, as would occur when executing a set-user-ID-root program that
745 does not have any associated file capabilities).
746
747 Note that one can assign empty capability sets to a program file, and
748 thus it is possible to create a set-user-ID-root program that changes
749 the effective and saved set-user-ID of the process that executes the
750 program to 0, but confers no capabilities to that process.
751
752 Capability bounding set
753 The capability bounding set is a security mechanism that can be used to
754 limit the capabilities that can be gained during an execve(2). The
755 bounding set is used in the following ways:
756
757 • During an execve(2), the capability bounding set is ANDed with the
758 file permitted capability set, and the result of this operation is
759 assigned to the thread's permitted capability set. The capability
760 bounding set thus places a limit on the permitted capabilities that
761 may be granted by an executable file.
762
763 • (Since Linux 2.6.25) The capability bounding set acts as a limiting
764 superset for the capabilities that a thread can add to its inherita‐
765 ble set using capset(2). This means that if a capability is not in
766 the bounding set, then a thread can't add this capability to its in‐
767 heritable set, even if it was in its permitted capabilities, and
768 thereby cannot have this capability preserved in its permitted set
769 when it execve(2)s a file that has the capability in its inheritable
770 set.
771
772 Note that the bounding set masks the file permitted capabilities, but
773 not the inheritable capabilities. If a thread maintains a capability
774 in its inheritable set that is not in its bounding set, then it can
775 still gain that capability in its permitted set by executing a file
776 that has the capability in its inheritable set.
777
778 Depending on the kernel version, the capability bounding set is either
779 a system-wide attribute, or a per-process attribute.
780
781 Capability bounding set from Linux 2.6.25 onward
782
783 From Linux 2.6.25, the capability bounding set is a per-thread attri‐
784 bute. (The system-wide capability bounding set described below no
785 longer exists.)
786
787 The bounding set is inherited at fork(2) from the thread's parent, and
788 is preserved across an execve(2).
789
790 A thread may remove capabilities from its capability bounding set using
791 the prctl(2) PR_CAPBSET_DROP operation, provided it has the CAP_SETPCAP
792 capability. Once a capability has been dropped from the bounding set,
793 it cannot be restored to that set. A thread can determine if a capa‐
794 bility is in its bounding set using the prctl(2) PR_CAPBSET_READ opera‐
795 tion.
796
797 Removing capabilities from the bounding set is supported only if file
798 capabilities are compiled into the kernel. Before Linux 2.6.33, file
799 capabilities were an optional feature configurable via the CONFIG_SECU‐
800 RITY_FILE_CAPABILITIES option. Since Linux 2.6.33, the configuration
801 option has been removed and file capabilities are always part of the
802 kernel. When file capabilities are compiled into the kernel, the init
803 process (the ancestor of all processes) begins with a full bounding
804 set. If file capabilities are not compiled into the kernel, then init
805 begins with a full bounding set minus CAP_SETPCAP, because this capa‐
806 bility has a different meaning when there are no file capabilities.
807
808 Removing a capability from the bounding set does not remove it from the
809 thread's inheritable set. However it does prevent the capability from
810 being added back into the thread's inheritable set in the future.
811
812 Capability bounding set prior to Linux 2.6.25
813
814 Before Linux 2.6.25, the capability bounding set is a system-wide at‐
815 tribute that affects all threads on the system. The bounding set is
816 accessible via the file /proc/sys/kernel/cap-bound. (Confusingly, this
817 bit mask parameter is expressed as a signed decimal number in
818 /proc/sys/kernel/cap-bound.)
819
820 Only the init process may set capabilities in the capability bounding
821 set; other than that, the superuser (more precisely: a process with the
822 CAP_SYS_MODULE capability) may only clear capabilities from this set.
823
824 On a standard system the capability bounding set always masks out the
825 CAP_SETPCAP capability. To remove this restriction (dangerous!), mod‐
826 ify the definition of CAP_INIT_EFF_SET in include/linux/capability.h
827 and rebuild the kernel.
828
829 The system-wide capability bounding set feature was added to Linux
830 2.2.11.
831
832 Effect of user ID changes on capabilities
833 To preserve the traditional semantics for transitions between 0 and
834 nonzero user IDs, the kernel makes the following changes to a thread's
835 capability sets on changes to the thread's real, effective, saved set,
836 and filesystem user IDs (using setuid(2), setresuid(2), or similar):
837
838 • If one or more of the real, effective, or saved set user IDs was
839 previously 0, and as a result of the UID changes all of these IDs
840 have a nonzero value, then all capabilities are cleared from the
841 permitted, effective, and ambient capability sets.
842
843 • If the effective user ID is changed from 0 to nonzero, then all ca‐
844 pabilities are cleared from the effective set.
845
846 • If the effective user ID is changed from nonzero to 0, then the per‐
847 mitted set is copied to the effective set.
848
849 • If the filesystem user ID is changed from 0 to nonzero (see setf‐
850 suid(2)), then the following capabilities are cleared from the ef‐
851 fective set: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH,
852 CAP_FOWNER, CAP_FSETID, CAP_LINUX_IMMUTABLE (since Linux 2.6.30),
853 CAP_MAC_OVERRIDE, and CAP_MKNOD (since Linux 2.6.30). If the
854 filesystem UID is changed from nonzero to 0, then any of these capa‐
855 bilities that are enabled in the permitted set are enabled in the
856 effective set.
857
858 If a thread that has a 0 value for one or more of its user IDs wants to
859 prevent its permitted capability set being cleared when it resets all
860 of its user IDs to nonzero values, it can do so using the
861 SECBIT_KEEP_CAPS securebits flag described below.
862
863 Programmatically adjusting capability sets
864 A thread can retrieve and change its permitted, effective, and inheri‐
865 table capability sets using the capget(2) and capset(2) system calls.
866 However, the use of cap_get_proc(3) and cap_set_proc(3), both provided
867 in the libcap package, is preferred for this purpose. The following
868 rules govern changes to the thread capability sets:
869
870 • If the caller does not have the CAP_SETPCAP capability, the new in‐
871 heritable set must be a subset of the combination of the existing
872 inheritable and permitted sets.
873
874 • (Since Linux 2.6.25) The new inheritable set must be a subset of the
875 combination of the existing inheritable set and the capability
876 bounding set.
877
878 • The new permitted set must be a subset of the existing permitted set
879 (i.e., it is not possible to acquire permitted capabilities that the
880 thread does not currently have).
881
882 • The new effective set must be a subset of the new permitted set.
883
884 The securebits flags: establishing a capabilities-only environment
885 Starting with Linux 2.6.26, and with a kernel in which file capabili‐
886 ties are enabled, Linux implements a set of per-thread securebits flags
887 that can be used to disable special handling of capabilities for UID 0
888 (root). These flags are as follows:
889
890 SECBIT_KEEP_CAPS
891 Setting this flag allows a thread that has one or more 0 UIDs to
892 retain capabilities in its permitted set when it switches all of
893 its UIDs to nonzero values. If this flag is not set, then such
894 a UID switch causes the thread to lose all permitted capabili‐
895 ties. This flag is always cleared on an execve(2).
896
897 Note that even with the SECBIT_KEEP_CAPS flag set, the effective
898 capabilities of a thread are cleared when it switches its effec‐
899 tive UID to a nonzero value. However, if the thread has set
900 this flag and its effective UID is already nonzero, and the
901 thread subsequently switches all other UIDs to nonzero values,
902 then the effective capabilities will not be cleared.
903
904 The setting of the SECBIT_KEEP_CAPS flag is ignored if the
905 SECBIT_NO_SETUID_FIXUP flag is set. (The latter flag provides a
906 superset of the effect of the former flag.)
907
908 This flag provides the same functionality as the older prctl(2)
909 PR_SET_KEEPCAPS operation.
910
911 SECBIT_NO_SETUID_FIXUP
912 Setting this flag stops the kernel from adjusting the process's
913 permitted, effective, and ambient capability sets when the
914 thread's effective and filesystem UIDs are switched between zero
915 and nonzero values. See Effect of user ID changes on capabili‐
916 ties above.
917
918 SECBIT_NOROOT
919 If this bit is set, then the kernel does not grant capabilities
920 when a set-user-ID-root program is executed, or when a process
921 with an effective or real UID of 0 calls execve(2). (See Capa‐
922 bilities and execution of programs by root above.)
923
924 SECBIT_NO_CAP_AMBIENT_RAISE
925 Setting this flag disallows raising ambient capabilities via the
926 prctl(2) PR_CAP_AMBIENT_RAISE operation.
927
928 Each of the above "base" flags has a companion "locked" flag. Setting
929 any of the "locked" flags is irreversible, and has the effect of pre‐
930 venting further changes to the corresponding "base" flag. The locked
931 flags are: SECBIT_KEEP_CAPS_LOCKED, SECBIT_NO_SETUID_FIXUP_LOCKED,
932 SECBIT_NOROOT_LOCKED, and SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED.
933
934 The securebits flags can be modified and retrieved using the prctl(2)
935 PR_SET_SECUREBITS and PR_GET_SECUREBITS operations. The CAP_SETPCAP
936 capability is required to modify the flags. Note that the SECBIT_*
937 constants are available only after including the <linux/securebits.h>
938 header file.
939
940 The securebits flags are inherited by child processes. During an ex‐
941 ecve(2), all of the flags are preserved, except SECBIT_KEEP_CAPS which
942 is always cleared.
943
944 An application can use the following call to lock itself, and all of
945 its descendants, into an environment where the only way of gaining ca‐
946 pabilities is by executing a program with associated file capabilities:
947
948 prctl(PR_SET_SECUREBITS,
949 /* SECBIT_KEEP_CAPS off */
950 SECBIT_KEEP_CAPS_LOCKED |
951 SECBIT_NO_SETUID_FIXUP |
952 SECBIT_NO_SETUID_FIXUP_LOCKED |
953 SECBIT_NOROOT |
954 SECBIT_NOROOT_LOCKED);
955 /* Setting/locking SECBIT_NO_CAP_AMBIENT_RAISE
956 is not required */
957
958 Per-user-namespace "set-user-ID-root" programs
959 A set-user-ID program whose UID matches the UID that created a user
960 namespace will confer capabilities in the process's permitted and ef‐
961 fective sets when executed by any process inside that namespace or any
962 descendant user namespace.
963
964 The rules about the transformation of the process's capabilities during
965 the execve(2) are exactly as described in Transformation of capabili‐
966 ties during execve() and Capabilities and execution of programs by root
967 above, with the difference that, in the latter subsection, "root" is
968 the UID of the creator of the user namespace.
969
970 Namespaced file capabilities
971 Traditional (i.e., version 2) file capabilities associate only a set of
972 capability masks with a binary executable file. When a process exe‐
973 cutes a binary with such capabilities, it gains the associated capabil‐
974 ities (within its user namespace) as per the rules described in Trans‐
975 formation of capabilities during execve() above.
976
977 Because version 2 file capabilities confer capabilities to the execut‐
978 ing process regardless of which user namespace it resides in, only
979 privileged processes are permitted to associate capabilities with a
980 file. Here, "privileged" means a process that has the CAP_SETFCAP ca‐
981 pability in the user namespace where the filesystem was mounted (nor‐
982 mally the initial user namespace). This limitation renders file capa‐
983 bilities useless for certain use cases. For example, in user-names‐
984 paced containers, it can be desirable to be able to create a binary
985 that confers capabilities only to processes executed inside that con‐
986 tainer, but not to processes that are executed outside the container.
987
988 Linux 4.14 added so-called namespaced file capabilities to support such
989 use cases. Namespaced file capabilities are recorded as version 3
990 (i.e., VFS_CAP_REVISION_3) security.capability extended attributes.
991 Such an attribute is automatically created in the circumstances de‐
992 scribed in File capability extended attribute versioning above. When a
993 version 3 security.capability extended attribute is created, the kernel
994 records not just the capability masks in the extended attribute, but
995 also the namespace root user ID.
996
997 As with a binary that has VFS_CAP_REVISION_2 file capabilities, a bi‐
998 nary with VFS_CAP_REVISION_3 file capabilities confers capabilities to
999 a process during execve(). However, capabilities are conferred only if
1000 the binary is executed by a process that resides in a user namespace
1001 whose UID 0 maps to the root user ID that is saved in the extended at‐
1002 tribute, or when executed by a process that resides in a descendant of
1003 such a namespace.
1004
1005 Interaction with user namespaces
1006 For further information on the interaction of capabilities and user
1007 namespaces, see user_namespaces(7).
1008
1010 No standards govern capabilities, but the Linux capability implementa‐
1011 tion is based on the withdrawn POSIX.1e draft standard
1012 ⟨https://archive.org/details/posix_1003.1e-990310⟩.
1013
1015 When attempting to strace(1) binaries that have capabilities (or set-
1016 user-ID-root binaries), you may find the -u <username> option useful.
1017 Something like:
1018
1019 $ sudo strace -o trace.log -u ceci ./myprivprog
1020
1021 From Linux 2.5.27 to Linux 2.6.26, capabilities were an optional kernel
1022 component, and could be enabled/disabled via the CONFIG_SECURITY_CAPA‐
1023 BILITIES kernel configuration option.
1024
1025 The /proc/pid/task/TID/status file can be used to view the capability
1026 sets of a thread. The /proc/pid/status file shows the capability sets
1027 of a process's main thread. Before Linux 3.8, nonexistent capabilities
1028 were shown as being enabled (1) in these sets. Since Linux 3.8, all
1029 nonexistent capabilities (above CAP_LAST_CAP) are shown as disabled
1030 (0).
1031
1032 The libcap package provides a suite of routines for setting and getting
1033 capabilities that is more comfortable and less likely to change than
1034 the interface provided by capset(2) and capget(2). This package also
1035 provides the setcap(8) and getcap(8) programs. It can be found at
1036 ⟨https://git.kernel.org/pub/scm/libs/libcap/libcap.git/refs/⟩.
1037
1038 Before Linux 2.6.24, and from Linux 2.6.24 to Linux 2.6.32 if file ca‐
1039 pabilities are not enabled, a thread with the CAP_SETPCAP capability
1040 can manipulate the capabilities of threads other than itself. However,
1041 this is only theoretically possible, since no thread ever has CAP_SETP‐
1042 CAP in either of these cases:
1043
1044 • In the pre-2.6.25 implementation the system-wide capability bounding
1045 set, /proc/sys/kernel/cap-bound, always masks out the CAP_SETPCAP
1046 capability, and this can not be changed without modifying the kernel
1047 source and rebuilding the kernel.
1048
1049 • If file capabilities are disabled (i.e., the kernel CONFIG_SECU‐
1050 RITY_FILE_CAPABILITIES option is disabled), then init starts out
1051 with the CAP_SETPCAP capability removed from its per-process bound‐
1052 ing set, and that bounding set is inherited by all other processes
1053 created on the system.
1054
1056 capsh(1), setpriv(1), prctl(2), setfsuid(2), cap_clear(3),
1057 cap_copy_ext(3), cap_from_text(3), cap_get_file(3), cap_get_proc(3),
1058 cap_init(3), capgetp(3), capsetp(3), libcap(3), proc(5), creden‐
1059 tials(7), pthreads(7), user_namespaces(7), captest(8), filecap(8), get‐
1060 cap(8), getpcaps(8), netcap(8), pscap(8), setcap(8)
1061
1062 include/linux/capability.h in the Linux kernel source tree
1063
1064
1065
1066Linux man-pages 6.05 2023-05-03 Capabilities(7)