1CAPABILITIES(7) Linux Programmer's Manual CAPABILITIES(7)
2
3
4
6 capabilities - overview of Linux capabilities
7
9 For the purpose of performing permission checks, traditional UNIX im‐
10 plementations distinguish two categories of processes: privileged pro‐
11 cesses (whose effective user ID is 0, referred to as superuser or
12 root), and unprivileged processes (whose effective UID is nonzero).
13 Privileged processes bypass all kernel permission checks, while unpriv‐
14 ileged processes are subject to full permission checking based on the
15 process's credentials (usually: effective UID, effective GID, and sup‐
16 plementary group list).
17
18 Starting with kernel 2.2, Linux divides the privileges traditionally
19 associated with superuser into distinct units, known as capabilities,
20 which can be independently enabled and disabled. Capabilities are a
21 per-thread attribute.
22
23 Capabilities list
24 The following list shows the capabilities implemented on Linux, and the
25 operations or behaviors that each capability permits:
26
27 CAP_AUDIT_CONTROL (since Linux 2.6.11)
28 Enable and disable kernel auditing; change auditing filter
29 rules; retrieve auditing status and filtering rules.
30
31 CAP_AUDIT_READ (since Linux 3.16)
32 Allow reading the audit log via a multicast netlink socket.
33
34 CAP_AUDIT_WRITE (since Linux 2.6.11)
35 Write records to kernel auditing log.
36
37 CAP_BLOCK_SUSPEND (since Linux 3.5)
38 Employ features that can block system suspend (epoll(7) EPOLL‐
39 WAKEUP, /proc/sys/wake_lock).
40
41 CAP_BPF (since Linux 5.8)
42 Employ privileged BPF operations; see bpf(2) and bpf-helpers(7).
43
44 This capability was added in Linux 5.8 to separate out BPF func‐
45 tionality from the overloaded CAP_SYS_ADMIN capability.
46
47 CAP_CHECKPOINT_RESTORE (since Linux 5.9)
48 * Update /proc/sys/kernel/ns_last_pid (see pid_namespaces(7));
49 * employ the set_tid feature of clone3(2);
50 * read the contents of the symbolic links in
51 /proc/[pid]/map_files for other processes.
52
53 This capability was added in Linux 5.9 to separate out check‐
54 point/restore functionality from the overloaded CAP_SYS_ADMIN
55 capability.
56
57 CAP_CHOWN
58 Make arbitrary changes to file UIDs and GIDs (see chown(2)).
59
60 CAP_DAC_OVERRIDE
61 Bypass file read, write, and execute permission checks. (DAC is
62 an abbreviation of "discretionary access control".)
63
64 CAP_DAC_READ_SEARCH
65 * Bypass file read permission checks and directory read and exe‐
66 cute permission checks;
67 * invoke open_by_handle_at(2);
68 * use the linkat(2) AT_EMPTY_PATH flag to create a link to a
69 file referred to by a file descriptor.
70
71 CAP_FOWNER
72 * Bypass permission checks on operations that normally require
73 the filesystem UID of the process to match the UID of the file
74 (e.g., chmod(2), utime(2)), excluding those operations covered
75 by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH;
76 * set inode flags (see ioctl_iflags(2)) on arbitrary files;
77 * set Access Control Lists (ACLs) on arbitrary files;
78 * ignore directory sticky bit on file deletion;
79 * modify user extended attributes on sticky directory owned by
80 any user;
81 * specify O_NOATIME for arbitrary files in open(2) and fcntl(2).
82
83 CAP_FSETID
84 * Don't clear set-user-ID and set-group-ID mode bits when a file
85 is modified;
86 * set the set-group-ID bit for a file whose GID does not match
87 the filesystem or any of the supplementary GIDs of the calling
88 process.
89
90 CAP_IPC_LOCK
91 * Lock memory (mlock(2), mlockall(2), mmap(2), shmctl(2));
92 * Allocate memory using huge pages (memfd_create(2), mmap(2),
93 shmctl(2)).
94 CAP_IPC_OWNER
95 Bypass permission checks for operations on System V IPC objects.
96 CAP_KILL
97 Bypass permission checks for sending signals (see kill(2)).
98 This includes use of the ioctl(2) KDSIGACCEPT operation.
99 CAP_LEASE (since Linux 2.4)
100 Establish leases on arbitrary files (see fcntl(2)).
101 CAP_LINUX_IMMUTABLE
102 Set the FS_APPEND_FL and FS_IMMUTABLE_FL inode flags (see
103 ioctl_iflags(2)).
104 CAP_MAC_ADMIN (since Linux 2.6.25)
105 Allow MAC configuration or state changes. Implemented for the
106 Smack Linux Security Module (LSM).
107 CAP_MAC_OVERRIDE (since Linux 2.6.25)
108 Override Mandatory Access Control (MAC). Implemented for the
109 Smack LSM.
110 CAP_MKNOD (since Linux 2.4)
111 Create special files using mknod(2).
112 CAP_NET_ADMIN
113 Perform various network-related operations:
114 * interface configuration;
115 * administration of IP firewall, masquerading, and accounting;
116 * modify routing tables;
117 * bind to any address for transparent proxying;
118 * set type-of-service (TOS);
119 * clear driver statistics;
120 * set promiscuous mode;
121 * enabling multicasting;
122 * use setsockopt(2) to set the following socket options: SO_DE‐
123 BUG, SO_MARK, SO_PRIORITY (for a priority outside the range 0
124 to 6), SO_RCVBUFFORCE, and SO_SNDBUFFORCE.
125
126 CAP_NET_BIND_SERVICE
127 Bind a socket to Internet domain privileged ports (port numbers
128 less than 1024).
129
130 CAP_NET_BROADCAST
131 (Unused) Make socket broadcasts, and listen to multicasts.
132
133 CAP_NET_RAW
134 * Use RAW and PACKET sockets;
135 * bind to any address for transparent proxying.
136
137 CAP_PERFMON (since Linux 5.8)
138 Employ various performance-monitoring mechanisms, including:
139
140 * call perf_event_open(2);
141 * employ various BPF operations that have performance implica‐
142 tions.
143
144 This capability was added in Linux 5.8 to separate out perfor‐
145 mance monitoring functionality from the overloaded CAP_SYS_ADMIN
146 capability. See also the kernel source file Documentation/ad‐
147 min-guide/perf-security.rst.
148
149 CAP_SETGID
150 * Make arbitrary manipulations of process GIDs and supplementary
151 GID list;
152 * forge GID when passing socket credentials via UNIX domain
153 sockets;
154 * write a group ID mapping in a user namespace (see user_name‐
155 spaces(7)).
156
157 CAP_SETFCAP (since Linux 2.6.24)
158 Set arbitrary capabilities on a file.
159
160 Since Linux 5.12, this capability is also needed to map user ID
161 0 in a new user namespace; see user_namespaces(7) for details.
162
163 CAP_SETPCAP
164 If file capabilities are supported (i.e., since Linux 2.6.24):
165 add any capability from the calling thread's bounding set to its
166 inheritable set; drop capabilities from the bounding set (via
167 prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags.
168
169 If file capabilities are not supported (i.e., kernels before
170 Linux 2.6.24): grant or remove any capability in the caller's
171 permitted capability set to or from any other process. (This
172 property of CAP_SETPCAP is not available when the kernel is con‐
173 figured to support file capabilities, since CAP_SETPCAP has en‐
174 tirely different semantics for such kernels.)
175
176 CAP_SETUID
177 * Make arbitrary manipulations of process UIDs (setuid(2), se‐
178 treuid(2), setresuid(2), setfsuid(2));
179 * forge UID when passing socket credentials via UNIX domain
180 sockets;
181 * write a user ID mapping in a user namespace (see user_name‐
182 spaces(7)).
183
184 CAP_SYS_ADMIN
185 Note: this capability is overloaded; see Notes to kernel devel‐
186 opers, below.
187
188 * Perform a range of system administration operations including:
189 quotactl(2), mount(2), umount(2), pivot_root(2), swapon(2),
190 swapoff(2), sethostname(2), and setdomainname(2);
191 * perform privileged syslog(2) operations (since Linux 2.6.37,
192 CAP_SYSLOG should be used to permit such operations);
193 * perform VM86_REQUEST_IRQ vm86(2) command;
194 * access the same checkpoint/restore functionality that is gov‐
195 erned by CAP_CHECKPOINT_RESTORE (but the latter, weaker capa‐
196 bility is preferred for accessing that functionality).
197 * perform the same BPF operations as are governed by CAP_BPF
198 (but the latter, weaker capability is preferred for accessing
199 that functionality).
200 * employ the same performance monitoring mechanisms as are gov‐
201 erned by CAP_PERFMON (but the latter, weaker capability is
202 preferred for accessing that functionality).
203 * perform IPC_SET and IPC_RMID operations on arbitrary System V
204 IPC objects;
205 * override RLIMIT_NPROC resource limit;
206 * perform operations on trusted and security extended attributes
207 (see xattr(7));
208 * use lookup_dcookie(2);
209 * use ioprio_set(2) to assign IOPRIO_CLASS_RT and (before Linux
210 2.6.25) IOPRIO_CLASS_IDLE I/O scheduling classes;
211 * forge PID when passing socket credentials via UNIX domain
212 sockets;
213 * exceed /proc/sys/fs/file-max, the system-wide limit on the
214 number of open files, in system calls that open files (e.g.,
215 accept(2), execve(2), open(2), pipe(2));
216 * employ CLONE_* flags that create new namespaces with clone(2)
217 and unshare(2) (but, since Linux 3.8, creating user namespaces
218 does not require any capability);
219 * access privileged perf event information;
220 * call setns(2) (requires CAP_SYS_ADMIN in the target name‐
221 space);
222 * call fanotify_init(2);
223 * perform privileged KEYCTL_CHOWN and KEYCTL_SETPERM keyctl(2)
224 operations;
225 * perform madvise(2) MADV_HWPOISON operation;
226 * employ the TIOCSTI ioctl(2) to insert characters into the in‐
227 put queue of a terminal other than the caller's controlling
228 terminal;
229 * employ the obsolete nfsservctl(2) system call;
230 * employ the obsolete bdflush(2) system call;
231 * perform various privileged block-device ioctl(2) operations;
232 * perform various privileged filesystem ioctl(2) operations;
233 * perform privileged ioctl(2) operations on the /dev/random de‐
234 vice (see random(4));
235 * install a seccomp(2) filter without first having to set the
236 no_new_privs thread attribute;
237 * modify allow/deny rules for device control groups;
238 * employ the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation to
239 dump tracee's seccomp filters;
240 * employ the ptrace(2) PTRACE_SETOPTIONS operation to suspend
241 the tracee's seccomp protections (i.e., the PTRACE_O_SUS‐
242 PEND_SECCOMP flag);
243 * perform administrative operations on many device drivers;
244 * modify autogroup nice values by writing to /proc/[pid]/auto‐
245 group (see sched(7)).
246
247 CAP_SYS_BOOT
248 Use reboot(2) and kexec_load(2).
249
250 CAP_SYS_CHROOT
251 * Use chroot(2);
252 * change mount namespaces using setns(2).
253
254 CAP_SYS_MODULE
255 * Load and unload kernel modules (see init_module(2) and
256 delete_module(2));
257 * in kernels before 2.6.25: drop capabilities from the system-
258 wide capability bounding set.
259
260 CAP_SYS_NICE
261 * Lower the process nice value (nice(2), setpriority(2)) and
262 change the nice value for arbitrary processes;
263 * set real-time scheduling policies for calling process, and set
264 scheduling policies and priorities for arbitrary processes
265 (sched_setscheduler(2), sched_setparam(2), sched_setattr(2));
266 * set CPU affinity for arbitrary processes (sched_setaffin‐
267 ity(2));
268 * set I/O scheduling class and priority for arbitrary processes
269 (ioprio_set(2));
270 * apply migrate_pages(2) to arbitrary processes and allow pro‐
271 cesses to be migrated to arbitrary nodes;
272 * apply move_pages(2) to arbitrary processes;
273 * use the MPOL_MF_MOVE_ALL flag with mbind(2) and move_pages(2).
274
275 CAP_SYS_PACCT
276 Use acct(2).
277
278 CAP_SYS_PTRACE
279 * Trace arbitrary processes using ptrace(2);
280 * apply get_robust_list(2) to arbitrary processes;
281 * transfer data to or from the memory of arbitrary processes us‐
282 ing process_vm_readv(2) and process_vm_writev(2);
283 * inspect processes using kcmp(2).
284
285 CAP_SYS_RAWIO
286 * Perform I/O port operations (iopl(2) and ioperm(2));
287 * access /proc/kcore;
288 * employ the FIBMAP ioctl(2) operation;
289 * open devices for accessing x86 model-specific registers (MSRs,
290 see msr(4));
291 * update /proc/sys/vm/mmap_min_addr;
292 * create memory mappings at addresses below the value specified
293 by /proc/sys/vm/mmap_min_addr;
294 * map files in /proc/bus/pci;
295 * open /dev/mem and /dev/kmem;
296 * perform various SCSI device commands;
297 * perform certain operations on hpsa(4) and cciss(4) devices;
298 * perform a range of device-specific operations on other de‐
299 vices.
300
301 CAP_SYS_RESOURCE
302 * Use reserved space on ext2 filesystems;
303 * make ioctl(2) calls controlling ext3 journaling;
304 * override disk quota limits;
305 * increase resource limits (see setrlimit(2));
306 * override RLIMIT_NPROC resource limit;
307 * override maximum number of consoles on console allocation;
308 * override maximum number of keymaps;
309 * allow more than 64hz interrupts from the real-time clock;
310 * raise msg_qbytes limit for a System V message queue above the
311 limit in /proc/sys/kernel/msgmnb (see msgop(2) and msgctl(2));
312 * allow the RLIMIT_NOFILE resource limit on the number of "in-
313 flight" file descriptors to be bypassed when passing file de‐
314 scriptors to another process via a UNIX domain socket (see
315 unix(7));
316 * override the /proc/sys/fs/pipe-size-max limit when setting the
317 capacity of a pipe using the F_SETPIPE_SZ fcntl(2) command;
318 * use F_SETPIPE_SZ to increase the capacity of a pipe above the
319 limit specified by /proc/sys/fs/pipe-max-size;
320 * override /proc/sys/fs/mqueue/queues_max,
321 /proc/sys/fs/mqueue/msg_max, and /proc/sys/fs/mqueue/msg‐
322 size_max limits when creating POSIX message queues (see
323 mq_overview(7));
324 * employ the prctl(2) PR_SET_MM operation;
325 * set /proc/[pid]/oom_score_adj to a value lower than the value
326 last set by a process with CAP_SYS_RESOURCE.
327
328 CAP_SYS_TIME
329 Set system clock (settimeofday(2), stime(2), adjtimex(2)); set
330 real-time (hardware) clock.
331
332 CAP_SYS_TTY_CONFIG
333 Use vhangup(2); employ various privileged ioctl(2) operations on
334 virtual terminals.
335
336 CAP_SYSLOG (since Linux 2.6.37)
337 * Perform privileged syslog(2) operations. See syslog(2) for
338 information on which operations require privilege.
339 * View kernel addresses exposed via /proc and other interfaces
340 when /proc/sys/kernel/kptr_restrict has the value 1. (See the
341 discussion of the kptr_restrict in proc(5).)
342
343 CAP_WAKE_ALARM (since Linux 3.0)
344 Trigger something that will wake up the system (set CLOCK_REAL‐
345 TIME_ALARM and CLOCK_BOOTTIME_ALARM timers).
346
347 Past and current implementation
348 A full implementation of capabilities requires that:
349
350 1. For all privileged operations, the kernel must check whether the
351 thread has the required capability in its effective set.
352
353 2. The kernel must provide system calls allowing a thread's capability
354 sets to be changed and retrieved.
355
356 3. The filesystem must support attaching capabilities to an executable
357 file, so that a process gains those capabilities when the file is
358 executed.
359
360 Before kernel 2.6.24, only the first two of these requirements are met;
361 since kernel 2.6.24, all three requirements are met.
362
363 Notes to kernel developers
364 When adding a new kernel feature that should be governed by a capabil‐
365 ity, consider the following points.
366
367 * The goal of capabilities is divide the power of superuser into
368 pieces, such that if a program that has one or more capabilities is
369 compromised, its power to do damage to the system would be less than
370 the same program running with root privilege.
371
372 * You have the choice of either creating a new capability for your new
373 feature, or associating the feature with one of the existing capa‐
374 bilities. In order to keep the set of capabilities to a manageable
375 size, the latter option is preferable, unless there are compelling
376 reasons to take the former option. (There is also a technical
377 limit: the size of capability sets is currently limited to 64 bits.)
378
379 * To determine which existing capability might best be associated with
380 your new feature, review the list of capabilities above in order to
381 find a "silo" into which your new feature best fits. One approach
382 to take is to determine if there are other features requiring capa‐
383 bilities that will always be used along with the new feature. If
384 the new feature is useless without these other features, you should
385 use the same capability as the other features.
386
387 * Don't choose CAP_SYS_ADMIN if you can possibly avoid it! A vast
388 proportion of existing capability checks are associated with this
389 capability (see the partial list above). It can plausibly be called
390 "the new root", since on the one hand, it confers a wide range of
391 powers, and on the other hand, its broad scope means that this is
392 the capability that is required by many privileged programs. Don't
393 make the problem worse. The only new features that should be asso‐
394 ciated with CAP_SYS_ADMIN are ones that closely match existing uses
395 in that silo.
396
397 * If you have determined that it really is necessary to create a new
398 capability for your feature, don't make or name it as a "single-use"
399 capability. Thus, for example, the addition of the highly specific
400 CAP_SYS_PACCT was probably a mistake. Instead, try to identify and
401 name your new capability as a broader silo into which other related
402 future use cases might fit.
403
404 Thread capability sets
405 Each thread has the following capability sets containing zero or more
406 of the above capabilities:
407
408 Permitted
409 This is a limiting superset for the effective capabilities that
410 the thread may assume. It is also a limiting superset for the
411 capabilities that may be added to the inheritable set by a
412 thread that does not have the CAP_SETPCAP capability in its ef‐
413 fective set.
414
415 If a thread drops a capability from its permitted set, it can
416 never reacquire that capability (unless it execve(2)s either a
417 set-user-ID-root program, or a program whose associated file ca‐
418 pabilities grant that capability).
419
420 Inheritable
421 This is a set of capabilities preserved across an execve(2).
422 Inheritable capabilities remain inheritable when executing any
423 program, and inheritable capabilities are added to the permitted
424 set when executing a program that has the corresponding bits set
425 in the file inheritable set.
426
427 Because inheritable capabilities are not generally preserved
428 across execve(2) when running as a non-root user, applications
429 that wish to run helper programs with elevated capabilities
430 should consider using ambient capabilities, described below.
431
432 Effective
433 This is the set of capabilities used by the kernel to perform
434 permission checks for the thread.
435
436 Bounding (per-thread since Linux 2.6.25)
437 The capability bounding set is a mechanism that can be used to
438 limit the capabilities that are gained during execve(2).
439
440 Since Linux 2.6.25, this is a per-thread capability set. In
441 older kernels, the capability bounding set was a system wide at‐
442 tribute shared by all threads on the system.
443
444 For more details on the capability bounding set, see below.
445
446 Ambient (since Linux 4.3)
447 This is a set of capabilities that are preserved across an ex‐
448 ecve(2) of a program that is not privileged. The ambient capa‐
449 bility set obeys the invariant that no capability can ever be
450 ambient if it is not both permitted and inheritable.
451
452 The ambient capability set can be directly modified using
453 prctl(2). Ambient capabilities are automatically lowered if ei‐
454 ther of the corresponding permitted or inheritable capabilities
455 is lowered.
456
457 Executing a program that changes UID or GID due to the set-user-
458 ID or set-group-ID bits or executing a program that has any file
459 capabilities set will clear the ambient set. Ambient capabili‐
460 ties are added to the permitted set and assigned to the effec‐
461 tive set when execve(2) is called. If ambient capabilities
462 cause a process's permitted and effective capabilities to in‐
463 crease during an execve(2), this does not trigger the secure-ex‐
464 ecution mode described in ld.so(8).
465
466 A child created via fork(2) inherits copies of its parent's capability
467 sets. See below for a discussion of the treatment of capabilities dur‐
468 ing execve(2).
469
470 Using capset(2), a thread may manipulate its own capability sets (see
471 below).
472
473 Since Linux 3.2, the file /proc/sys/kernel/cap_last_cap exposes the nu‐
474 merical value of the highest capability supported by the running ker‐
475 nel; this can be used to determine the highest bit that may be set in a
476 capability set.
477
478 File capabilities
479 Since kernel 2.6.24, the kernel supports associating capability sets
480 with an executable file using setcap(8). The file capability sets are
481 stored in an extended attribute (see setxattr(2) and xattr(7)) named
482 security.capability. Writing to this extended attribute requires the
483 CAP_SETFCAP capability. The file capability sets, in conjunction with
484 the capability sets of the thread, determine the capabilities of a
485 thread after an execve(2).
486
487 The three file capability sets are:
488
489 Permitted (formerly known as forced):
490 These capabilities are automatically permitted to the thread,
491 regardless of the thread's inheritable capabilities.
492
493 Inheritable (formerly known as allowed):
494 This set is ANDed with the thread's inheritable set to determine
495 which inheritable capabilities are enabled in the permitted set
496 of the thread after the execve(2).
497
498 Effective:
499 This is not a set, but rather just a single bit. If this bit is
500 set, then during an execve(2) all of the new permitted capabili‐
501 ties for the thread are also raised in the effective set. If
502 this bit is not set, then after an execve(2), none of the new
503 permitted capabilities is in the new effective set.
504
505 Enabling the file effective capability bit implies that any file
506 permitted or inheritable capability that causes a thread to ac‐
507 quire the corresponding permitted capability during an execve(2)
508 (see the transformation rules described below) will also acquire
509 that capability in its effective set. Therefore, when assigning
510 capabilities to a file (setcap(8), cap_set_file(3),
511 cap_set_fd(3)), if we specify the effective flag as being en‐
512 abled for any capability, then the effective flag must also be
513 specified as enabled for all other capabilities for which the
514 corresponding permitted or inheritable flags is enabled.
515
516 File capability extended attribute versioning
517 To allow extensibility, the kernel supports a scheme to encode a ver‐
518 sion number inside the security.capability extended attribute that is
519 used to implement file capabilities. These version numbers are inter‐
520 nal to the implementation, and not directly visible to user-space ap‐
521 plications. To date, the following versions are supported:
522
523 VFS_CAP_REVISION_1
524 This was the original file capability implementation, which sup‐
525 ported 32-bit masks for file capabilities.
526
527 VFS_CAP_REVISION_2 (since Linux 2.6.25)
528 This version allows for file capability masks that are 64 bits
529 in size, and was necessary as the number of supported capabili‐
530 ties grew beyond 32. The kernel transparently continues to sup‐
531 port the execution of files that have 32-bit version 1 capabil‐
532 ity masks, but when adding capabilities to files that did not
533 previously have capabilities, or modifying the capabilities of
534 existing files, it automatically uses the version 2 scheme (or
535 possibly the version 3 scheme, as described below).
536
537 VFS_CAP_REVISION_3 (since Linux 4.14)
538 Version 3 file capabilities are provided to support namespaced
539 file capabilities (described below).
540
541 As with version 2 file capabilities, version 3 capability masks
542 are 64 bits in size. But in addition, the root user ID of name‐
543 space is encoded in the security.capability extended attribute.
544 (A namespace's root user ID is the value that user ID 0 inside
545 that namespace maps to in the initial user namespace.)
546
547 Version 3 file capabilities are designed to coexist with version
548 2 capabilities; that is, on a modern Linux system, there may be
549 some files with version 2 capabilities while others have version
550 3 capabilities.
551
552 Before Linux 4.14, the only kind of file capability extended attribute
553 that could be attached to a file was a VFS_CAP_REVISION_2 attribute.
554 Since Linux 4.14, the version of the security.capability extended at‐
555 tribute that is attached to a file depends on the circumstances in
556 which the attribute was created.
557
558 Starting with Linux 4.14, a security.capability extended attribute is
559 automatically created as (or converted to) a version 3 (VFS_CAP_REVI‐
560 SION_3) attribute if both of the following are true:
561
562 (1) The thread writing the attribute resides in a noninitial user name‐
563 space. (More precisely: the thread resides in a user namespace
564 other than the one from which the underlying filesystem was
565 mounted.)
566
567 (2) The thread has the CAP_SETFCAP capability over the file inode,
568 meaning that (a) the thread has the CAP_SETFCAP capability in its
569 own user namespace; and (b) the UID and GID of the file inode have
570 mappings in the writer's user namespace.
571
572 When a VFS_CAP_REVISION_3 security.capability extended attribute is
573 created, the root user ID of the creating thread's user namespace is
574 saved in the extended attribute.
575
576 By contrast, creating or modifying a security.capability extended at‐
577 tribute from a privileged (CAP_SETFCAP) thread that resides in the
578 namespace where the underlying filesystem was mounted (this normally
579 means the initial user namespace) automatically results in the creation
580 of a version 2 (VFS_CAP_REVISION_2) attribute.
581
582 Note that the creation of a version 3 security.capability extended at‐
583 tribute is automatic. That is to say, when a user-space application
584 writes (setxattr(2)) a security.capability attribute in the version 2
585 format, the kernel will automatically create a version 3 attribute if
586 the attribute is created in the circumstances described above. Corre‐
587 spondingly, when a version 3 security.capability attribute is retrieved
588 (getxattr(2)) by a process that resides inside a user namespace that
589 was created by the root user ID (or a descendant of that user name‐
590 space), the returned attribute is (automatically) simplified to appear
591 as a version 2 attribute (i.e., the returned value is the size of a
592 version 2 attribute and does not include the root user ID). These au‐
593 tomatic translations mean that no changes are required to user-space
594 tools (e.g., setcap(1) and getcap(1)) in order for those tools to be
595 used to create and retrieve version 3 security.capability attributes.
596
597 Note that a file can have either a version 2 or a version 3 secu‐
598 rity.capability extended attribute associated with it, but not both:
599 creation or modification of the security.capability extended attribute
600 will automatically modify the version according to the circumstances in
601 which the extended attribute is created or modified.
602
603 Transformation of capabilities during execve()
604 During an execve(2), the kernel calculates the new capabilities of the
605 process using the following algorithm:
606
607 P'(ambient) = (file is privileged) ? 0 : P(ambient)
608
609 P'(permitted) = (P(inheritable) & F(inheritable)) |
610 (F(permitted) & P(bounding)) | P'(ambient)
611
612 P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
613
614 P'(inheritable) = P(inheritable) [i.e., unchanged]
615
616 P'(bounding) = P(bounding) [i.e., unchanged]
617
618 where:
619
620 P() denotes the value of a thread capability set before the ex‐
621 ecve(2)
622
623 P'() denotes the value of a thread capability set after the ex‐
624 ecve(2)
625
626 F() denotes a file capability set
627
628 Note the following details relating to the above capability transforma‐
629 tion rules:
630
631 * The ambient capability set is present only since Linux 4.3. When
632 determining the transformation of the ambient set during execve(2),
633 a privileged file is one that has capabilities or has the set-user-
634 ID or set-group-ID bit set.
635
636 * Prior to Linux 2.6.25, the bounding set was a system-wide attribute
637 shared by all threads. That system-wide value was employed to cal‐
638 culate the new permitted set during execve(2) in the same manner as
639 shown above for P(bounding).
640
641 Note: during the capability transitions described above, file capabili‐
642 ties may be ignored (treated as empty) for the same reasons that the
643 set-user-ID and set-group-ID bits are ignored; see execve(2). File ca‐
644 pabilities are similarly ignored if the kernel was booted with the
645 no_file_caps option.
646
647 Note: according to the rules above, if a process with nonzero user IDs
648 performs an execve(2) then any capabilities that are present in its
649 permitted and effective sets will be cleared. For the treatment of ca‐
650 pabilities when a process with a user ID of zero performs an execve(2),
651 see below under Capabilities and execution of programs by root.
652
653 Safety checking for capability-dumb binaries
654 A capability-dumb binary is an application that has been marked to have
655 file capabilities, but has not been converted to use the libcap(3) API
656 to manipulate its capabilities. (In other words, this is a traditional
657 set-user-ID-root program that has been switched to use file capabili‐
658 ties, but whose code has not been modified to understand capabilities.)
659 For such applications, the effective capability bit is set on the file,
660 so that the file permitted capabilities are automatically enabled in
661 the process effective set when executing the file. The kernel recog‐
662 nizes a file which has the effective capability bit set as capability-
663 dumb for the purpose of the check described here.
664
665 When executing a capability-dumb binary, the kernel checks if the
666 process obtained all permitted capabilities that were specified in the
667 file permitted set, after the capability transformations described
668 above have been performed. (The typical reason why this might not oc‐
669 cur is that the capability bounding set masked out some of the capabil‐
670 ities in the file permitted set.) If the process did not obtain the
671 full set of file permitted capabilities, then execve(2) fails with the
672 error EPERM. This prevents possible security risks that could arise
673 when a capability-dumb application is executed with less privilege that
674 it needs. Note that, by definition, the application could not itself
675 recognize this problem, since it does not employ the libcap(3) API.
676
677 Capabilities and execution of programs by root
678 In order to mirror traditional UNIX semantics, the kernel performs spe‐
679 cial treatment of file capabilities when a process with UID 0 (root)
680 executes a program and when a set-user-ID-root program is executed.
681
682 After having performed any changes to the process effective ID that
683 were triggered by the set-user-ID mode bit of the binary—e.g., switch‐
684 ing the effective user ID to 0 (root) because a set-user-ID-root pro‐
685 gram was executed—the kernel calculates the file capability sets as
686 follows:
687
688 1. If the real or effective user ID of the process is 0 (root), then
689 the file inheritable and permitted sets are ignored; instead they
690 are notionally considered to be all ones (i.e., all capabilities en‐
691 abled). (There is one exception to this behavior, described below
692 in Set-user-ID-root programs that have file capabilities.)
693
694 2. If the effective user ID of the process is 0 (root) or the file ef‐
695 fective bit is in fact enabled, then the file effective bit is no‐
696 tionally defined to be one (enabled).
697
698 These notional values for the file's capability sets are then used as
699 described above to calculate the transformation of the process's capa‐
700 bilities during execve(2).
701
702 Thus, when a process with nonzero UIDs execve(2)s a set-user-ID-root
703 program that does not have capabilities attached, or when a process
704 whose real and effective UIDs are zero execve(2)s a program, the calcu‐
705 lation of the process's new permitted capabilities simplifies to:
706
707 P'(permitted) = P(inheritable) | P(bounding)
708
709 P'(effective) = P'(permitted)
710
711 Consequently, the process gains all capabilities in its permitted and
712 effective capability sets, except those masked out by the capability
713 bounding set. (In the calculation of P'(permitted), the P'(ambient)
714 term can be simplified away because it is by definition a proper subset
715 of P(inheritable).)
716
717 The special treatments of user ID 0 (root) described in this subsection
718 can be disabled using the securebits mechanism described below.
719
720 Set-user-ID-root programs that have file capabilities
721 There is one exception to the behavior described under Capabilities and
722 execution of programs by root. If (a) the binary that is being exe‐
723 cuted has capabilities attached and (b) the real user ID of the process
724 is not 0 (root) and (c) the effective user ID of the process is 0
725 (root), then the file capability bits are honored (i.e., they are not
726 notionally considered to be all ones). The usual way in which this
727 situation can arise is when executing a set-UID-root program that also
728 has file capabilities. When such a program is executed, the process
729 gains just the capabilities granted by the program (i.e., not all capa‐
730 bilities, as would occur when executing a set-user-ID-root program that
731 does not have any associated file capabilities).
732
733 Note that one can assign empty capability sets to a program file, and
734 thus it is possible to create a set-user-ID-root program that changes
735 the effective and saved set-user-ID of the process that executes the
736 program to 0, but confers no capabilities to that process.
737
738 Capability bounding set
739 The capability bounding set is a security mechanism that can be used to
740 limit the capabilities that can be gained during an execve(2). The
741 bounding set is used in the following ways:
742
743 * During an execve(2), the capability bounding set is ANDed with the
744 file permitted capability set, and the result of this operation is
745 assigned to the thread's permitted capability set. The capability
746 bounding set thus places a limit on the permitted capabilities that
747 may be granted by an executable file.
748
749 * (Since Linux 2.6.25) The capability bounding set acts as a limiting
750 superset for the capabilities that a thread can add to its inherita‐
751 ble set using capset(2). This means that if a capability is not in
752 the bounding set, then a thread can't add this capability to its in‐
753 heritable set, even if it was in its permitted capabilities, and
754 thereby cannot have this capability preserved in its permitted set
755 when it execve(2)s a file that has the capability in its inheritable
756 set.
757
758 Note that the bounding set masks the file permitted capabilities, but
759 not the inheritable capabilities. If a thread maintains a capability
760 in its inheritable set that is not in its bounding set, then it can
761 still gain that capability in its permitted set by executing a file
762 that has the capability in its inheritable set.
763
764 Depending on the kernel version, the capability bounding set is either
765 a system-wide attribute, or a per-process attribute.
766
767 Capability bounding set from Linux 2.6.25 onward
768
769 From Linux 2.6.25, the capability bounding set is a per-thread attri‐
770 bute. (The system-wide capability bounding set described below no
771 longer exists.)
772
773 The bounding set is inherited at fork(2) from the thread's parent, and
774 is preserved across an execve(2).
775
776 A thread may remove capabilities from its capability bounding set using
777 the prctl(2) PR_CAPBSET_DROP operation, provided it has the CAP_SETPCAP
778 capability. Once a capability has been dropped from the bounding set,
779 it cannot be restored to that set. A thread can determine if a capa‐
780 bility is in its bounding set using the prctl(2) PR_CAPBSET_READ opera‐
781 tion.
782
783 Removing capabilities from the bounding set is supported only if file
784 capabilities are compiled into the kernel. In kernels before Linux
785 2.6.33, file capabilities were an optional feature configurable via the
786 CONFIG_SECURITY_FILE_CAPABILITIES option. Since Linux 2.6.33, the con‐
787 figuration option has been removed and file capabilities are always
788 part of the kernel. When file capabilities are compiled into the ker‐
789 nel, the init process (the ancestor of all processes) begins with a
790 full bounding set. If file capabilities are not compiled into the ker‐
791 nel, then init begins with a full bounding set minus CAP_SETPCAP, be‐
792 cause this capability has a different meaning when there are no file
793 capabilities.
794
795 Removing a capability from the bounding set does not remove it from the
796 thread's inheritable set. However it does prevent the capability from
797 being added back into the thread's inheritable set in the future.
798
799 Capability bounding set prior to Linux 2.6.25
800
801 In kernels before 2.6.25, the capability bounding set is a system-wide
802 attribute that affects all threads on the system. The bounding set is
803 accessible via the file /proc/sys/kernel/cap-bound. (Confusingly, this
804 bit mask parameter is expressed as a signed decimal number in
805 /proc/sys/kernel/cap-bound.)
806
807 Only the init process may set capabilities in the capability bounding
808 set; other than that, the superuser (more precisely: a process with the
809 CAP_SYS_MODULE capability) may only clear capabilities from this set.
810
811 On a standard system the capability bounding set always masks out the
812 CAP_SETPCAP capability. To remove this restriction (dangerous!), mod‐
813 ify the definition of CAP_INIT_EFF_SET in include/linux/capability.h
814 and rebuild the kernel.
815
816 The system-wide capability bounding set feature was added to Linux
817 starting with kernel version 2.2.11.
818
819 Effect of user ID changes on capabilities
820 To preserve the traditional semantics for transitions between 0 and
821 nonzero user IDs, the kernel makes the following changes to a thread's
822 capability sets on changes to the thread's real, effective, saved set,
823 and filesystem user IDs (using setuid(2), setresuid(2), or similar):
824
825 1. If one or more of the real, effective, or saved set user IDs was
826 previously 0, and as a result of the UID changes all of these IDs
827 have a nonzero value, then all capabilities are cleared from the
828 permitted, effective, and ambient capability sets.
829
830 2. If the effective user ID is changed from 0 to nonzero, then all ca‐
831 pabilities are cleared from the effective set.
832
833 3. If the effective user ID is changed from nonzero to 0, then the per‐
834 mitted set is copied to the effective set.
835
836 4. If the filesystem user ID is changed from 0 to nonzero (see setf‐
837 suid(2)), then the following capabilities are cleared from the ef‐
838 fective set: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH,
839 CAP_FOWNER, CAP_FSETID, CAP_LINUX_IMMUTABLE (since Linux 2.6.30),
840 CAP_MAC_OVERRIDE, and CAP_MKNOD (since Linux 2.6.30). If the
841 filesystem UID is changed from nonzero to 0, then any of these capa‐
842 bilities that are enabled in the permitted set are enabled in the
843 effective set.
844
845 If a thread that has a 0 value for one or more of its user IDs wants to
846 prevent its permitted capability set being cleared when it resets all
847 of its user IDs to nonzero values, it can do so using the
848 SECBIT_KEEP_CAPS securebits flag described below.
849
850 Programmatically adjusting capability sets
851 A thread can retrieve and change its permitted, effective, and inheri‐
852 table capability sets using the capget(2) and capset(2) system calls.
853 However, the use of cap_get_proc(3) and cap_set_proc(3), both provided
854 in the libcap package, is preferred for this purpose. The following
855 rules govern changes to the thread capability sets:
856
857 1. If the caller does not have the CAP_SETPCAP capability, the new in‐
858 heritable set must be a subset of the combination of the existing
859 inheritable and permitted sets.
860
861 2. (Since Linux 2.6.25) The new inheritable set must be a subset of the
862 combination of the existing inheritable set and the capability
863 bounding set.
864
865 3. The new permitted set must be a subset of the existing permitted set
866 (i.e., it is not possible to acquire permitted capabilities that the
867 thread does not currently have).
868
869 4. The new effective set must be a subset of the new permitted set.
870
871 The securebits flags: establishing a capabilities-only environment
872 Starting with kernel 2.6.26, and with a kernel in which file capabili‐
873 ties are enabled, Linux implements a set of per-thread securebits flags
874 that can be used to disable special handling of capabilities for UID 0
875 (root). These flags are as follows:
876
877 SECBIT_KEEP_CAPS
878 Setting this flag allows a thread that has one or more 0 UIDs to
879 retain capabilities in its permitted set when it switches all of
880 its UIDs to nonzero values. If this flag is not set, then such
881 a UID switch causes the thread to lose all permitted capabili‐
882 ties. This flag is always cleared on an execve(2).
883
884 Note that even with the SECBIT_KEEP_CAPS flag set, the effective
885 capabilities of a thread are cleared when it switches its effec‐
886 tive UID to a nonzero value. However, if the thread has set
887 this flag and its effective UID is already nonzero, and the
888 thread subsequently switches all other UIDs to nonzero values,
889 then the effective capabilities will not be cleared.
890
891 The setting of the SECBIT_KEEP_CAPS flag is ignored if the
892 SECBIT_NO_SETUID_FIXUP flag is set. (The latter flag provides a
893 superset of the effect of the former flag.)
894
895 This flag provides the same functionality as the older prctl(2)
896 PR_SET_KEEPCAPS operation.
897
898 SECBIT_NO_SETUID_FIXUP
899 Setting this flag stops the kernel from adjusting the process's
900 permitted, effective, and ambient capability sets when the
901 thread's effective and filesystem UIDs are switched between zero
902 and nonzero values. (See the subsection Effect of user ID
903 changes on capabilities.)
904
905 SECBIT_NOROOT
906 If this bit is set, then the kernel does not grant capabilities
907 when a set-user-ID-root program is executed, or when a process
908 with an effective or real UID of 0 calls execve(2). (See the
909 subsection Capabilities and execution of programs by root.)
910
911 SECBIT_NO_CAP_AMBIENT_RAISE
912 Setting this flag disallows raising ambient capabilities via the
913 prctl(2) PR_CAP_AMBIENT_RAISE operation.
914
915 Each of the above "base" flags has a companion "locked" flag. Setting
916 any of the "locked" flags is irreversible, and has the effect of pre‐
917 venting further changes to the corresponding "base" flag. The locked
918 flags are: SECBIT_KEEP_CAPS_LOCKED, SECBIT_NO_SETUID_FIXUP_LOCKED,
919 SECBIT_NOROOT_LOCKED, and SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED.
920
921 The securebits flags can be modified and retrieved using the prctl(2)
922 PR_SET_SECUREBITS and PR_GET_SECUREBITS operations. The CAP_SETPCAP
923 capability is required to modify the flags. Note that the SECBIT_*
924 constants are available only after including the <linux/securebits.h>
925 header file.
926
927 The securebits flags are inherited by child processes. During an ex‐
928 ecve(2), all of the flags are preserved, except SECBIT_KEEP_CAPS which
929 is always cleared.
930
931 An application can use the following call to lock itself, and all of
932 its descendants, into an environment where the only way of gaining ca‐
933 pabilities is by executing a program with associated file capabilities:
934
935 prctl(PR_SET_SECUREBITS,
936 /* SECBIT_KEEP_CAPS off */
937 SECBIT_KEEP_CAPS_LOCKED |
938 SECBIT_NO_SETUID_FIXUP |
939 SECBIT_NO_SETUID_FIXUP_LOCKED |
940 SECBIT_NOROOT |
941 SECBIT_NOROOT_LOCKED);
942 /* Setting/locking SECBIT_NO_CAP_AMBIENT_RAISE
943 is not required */
944
945 Per-user-namespace "set-user-ID-root" programs
946 A set-user-ID program whose UID matches the UID that created a user
947 namespace will confer capabilities in the process's permitted and ef‐
948 fective sets when executed by any process inside that namespace or any
949 descendant user namespace.
950
951 The rules about the transformation of the process's capabilities during
952 the execve(2) are exactly as described in the subsections Transforma‐
953 tion of capabilities during execve() and Capabilities and execution of
954 programs by root, with the difference that, in the latter subsection,
955 "root" is the UID of the creator of the user namespace.
956
957 Namespaced file capabilities
958 Traditional (i.e., version 2) file capabilities associate only a set of
959 capability masks with a binary executable file. When a process exe‐
960 cutes a binary with such capabilities, it gains the associated capabil‐
961 ities (within its user namespace) as per the rules described above in
962 "Transformation of capabilities during execve()".
963
964 Because version 2 file capabilities confer capabilities to the execut‐
965 ing process regardless of which user namespace it resides in, only
966 privileged processes are permitted to associate capabilities with a
967 file. Here, "privileged" means a process that has the CAP_SETFCAP ca‐
968 pability in the user namespace where the filesystem was mounted (nor‐
969 mally the initial user namespace). This limitation renders file capa‐
970 bilities useless for certain use cases. For example, in user-names‐
971 paced containers, it can be desirable to be able to create a binary
972 that confers capabilities only to processes executed inside that con‐
973 tainer, but not to processes that are executed outside the container.
974
975 Linux 4.14 added so-called namespaced file capabilities to support such
976 use cases. Namespaced file capabilities are recorded as version 3
977 (i.e., VFS_CAP_REVISION_3) security.capability extended attributes.
978 Such an attribute is automatically created in the circumstances de‐
979 scribed above under "File capability extended attribute versioning".
980 When a version 3 security.capability extended attribute is created, the
981 kernel records not just the capability masks in the extended attribute,
982 but also the namespace root user ID.
983
984 As with a binary that has VFS_CAP_REVISION_2 file capabilities, a bi‐
985 nary with VFS_CAP_REVISION_3 file capabilities confers capabilities to
986 a process during execve(). However, capabilities are conferred only if
987 the binary is executed by a process that resides in a user namespace
988 whose UID 0 maps to the root user ID that is saved in the extended at‐
989 tribute, or when executed by a process that resides in a descendant of
990 such a namespace.
991
992 Interaction with user namespaces
993 For further information on the interaction of capabilities and user
994 namespaces, see user_namespaces(7).
995
997 No standards govern capabilities, but the Linux capability implementa‐
998 tion is based on the withdrawn POSIX.1e draft standard; see
999 ⟨https://archive.org/details/posix_1003.1e-990310⟩.
1000
1002 When attempting to strace(1) binaries that have capabilities (or set-
1003 user-ID-root binaries), you may find the -u <username> option useful.
1004 Something like:
1005
1006 $ sudo strace -o trace.log -u ceci ./myprivprog
1007
1008 From kernel 2.5.27 to kernel 2.6.26, capabilities were an optional ker‐
1009 nel component, and could be enabled/disabled via the CONFIG_SECU‐
1010 RITY_CAPABILITIES kernel configuration option.
1011
1012 The /proc/[pid]/task/TID/status file can be used to view the capability
1013 sets of a thread. The /proc/[pid]/status file shows the capability
1014 sets of a process's main thread. Before Linux 3.8, nonexistent capa‐
1015 bilities were shown as being enabled (1) in these sets. Since Linux
1016 3.8, all nonexistent capabilities (above CAP_LAST_CAP) are shown as
1017 disabled (0).
1018
1019 The libcap package provides a suite of routines for setting and getting
1020 capabilities that is more comfortable and less likely to change than
1021 the interface provided by capset(2) and capget(2). This package also
1022 provides the setcap(8) and getcap(8) programs. It can be found at
1023 ⟨https://git.kernel.org/pub/scm/libs/libcap/libcap.git/refs/⟩.
1024
1025 Before kernel 2.6.24, and from kernel 2.6.24 to kernel 2.6.32 if file
1026 capabilities are not enabled, a thread with the CAP_SETPCAP capability
1027 can manipulate the capabilities of threads other than itself. However,
1028 this is only theoretically possible, since no thread ever has CAP_SETP‐
1029 CAP in either of these cases:
1030
1031 * In the pre-2.6.25 implementation the system-wide capability bounding
1032 set, /proc/sys/kernel/cap-bound, always masks out the CAP_SETPCAP ca‐
1033 pability, and this can not be changed without modifying the kernel
1034 source and rebuilding the kernel.
1035
1036 * If file capabilities are disabled (i.e., the kernel CONFIG_SECU‐
1037 RITY_FILE_CAPABILITIES option is disabled), then init starts out with
1038 the CAP_SETPCAP capability removed from its per-process bounding set,
1039 and that bounding set is inherited by all other processes created on
1040 the system.
1041
1043 capsh(1), setpriv(1), prctl(2), setfsuid(2), cap_clear(3),
1044 cap_copy_ext(3), cap_from_text(3), cap_get_file(3), cap_get_proc(3),
1045 cap_init(3), capgetp(3), capsetp(3), libcap(3), proc(5), creden‐
1046 tials(7), pthreads(7), user_namespaces(7), captest(8), filecap(8), get‐
1047 cap(8), getpcaps(8), netcap(8), pscap(8), setcap(8)
1048
1049 include/linux/capability.h in the Linux kernel source tree
1050
1052 This page is part of release 5.13 of the Linux man-pages project. A
1053 description of the project, information about reporting bugs, and the
1054 latest version of this page, can be found at
1055 https://www.kernel.org/doc/man-pages/.
1056
1057
1058
1059Linux 2021-08-27 CAPABILITIES(7)