1USER_NAMESPACES(7) Linux Programmer's Manual USER_NAMESPACES(7)
2
3
4
6 user_namespaces - overview of Linux user namespaces
7
9 For an overview of namespaces, see namespaces(7).
10
11 User namespaces isolate security-related identifiers and attributes, in
12 particular, user IDs and group IDs (see credentials(7)), the root
13 directory, keys (see keyrings(7)), and capabilities (see capabili‐
14 ties(7)). A process's user and group IDs can be different inside and
15 outside a user namespace. In particular, a process can have a normal
16 unprivileged user ID outside a user namespace while at the same time
17 having a user ID of 0 inside the namespace; in other words, the process
18 has full privileges for operations inside the user namespace, but is
19 unprivileged for operations outside the namespace.
20
21 Nested namespaces, namespace membership
22 User namespaces can be nested; that is, each user namespace—except the
23 initial ("root") namespace—has a parent user namespace, and can have
24 zero or more child user namespaces. The parent user namespace is the
25 user namespace of the process that creates the user namespace via a
26 call to unshare(2) or clone(2) with the CLONE_NEWUSER flag.
27
28 The kernel imposes (since version 3.11) a limit of 32 nested levels of
29 user namespaces. Calls to unshare(2) or clone(2) that would cause this
30 limit to be exceeded fail with the error EUSERS.
31
32 Each process is a member of exactly one user namespace. A process cre‐
33 ated via fork(2) or clone(2) without the CLONE_NEWUSER flag is a member
34 of the same user namespace as its parent. A single-threaded process
35 can join another user namespace with setns(2) if it has the
36 CAP_SYS_ADMIN in that namespace; upon doing so, it gains a full set of
37 capabilities in that namespace.
38
39 A call to clone(2) or unshare(2) with the CLONE_NEWUSER flag makes the
40 new child process (for clone(2)) or the caller (for unshare(2)) a mem‐
41 ber of the new user namespace created by the call.
42
43 The NS_GET_PARENT ioctl(2) operation can be used to discover the
44 parental relationship between user namespaces; see ioctl_ns(2).
45
46 Capabilities
47 The child process created by clone(2) with the CLONE_NEWUSER flag
48 starts out with a complete set of capabilities in the new user names‐
49 pace. Likewise, a process that creates a new user namespace using
50 unshare(2) or joins an existing user namespace using setns(2) gains a
51 full set of capabilities in that namespace. On the other hand, that
52 process has no capabilities in the parent (in the case of clone(2)) or
53 previous (in the case of unshare(2) and setns(2)) user namespace, even
54 if the new namespace is created or joined by the root user (i.e., a
55 process with user ID 0 in the root namespace).
56
57 Note that a call to execve(2) will cause a process's capabilities to be
58 recalculated in the usual way (see capabilities(7)). Consequently,
59 unless the process has a user ID of 0 within the namespace, or the exe‐
60 cutable file has a nonempty inheritable capabilities mask, the process
61 will lose all capabilities. See the discussion of user and group ID
62 mappings, below.
63
64 A call to clone(2) or unshare(2) using the CLONE_NEWUSER flag or a call
65 to setns(2) that moves the caller into another user namespace sets the
66 "securebits" flags (see capabilities(7)) to their default values (all
67 flags disabled) in the child (for clone(2)) or caller (for unshare(2)
68 or setns(2)). Note that because the caller no longer has capabilities
69 in its original user namespace after a call to setns(2), it is not pos‐
70 sible for a process to reset its "securebits" flags while retaining its
71 user namespace membership by using a pair of setns(2) calls to move to
72 another user namespace and then return to its original user namespace.
73
74 The rules for determining whether or not a process has a capability in
75 a particular user namespace are as follows:
76
77 1. A process has a capability inside a user namespace if it is a member
78 of that namespace and it has the capability in its effective capa‐
79 bility set. A process can gain capabilities in its effective capa‐
80 bility set in various ways. For example, it may execute a set-user-
81 ID program or an executable with associated file capabilities. In
82 addition, a process may gain capabilities via the effect of
83 clone(2), unshare(2), or setns(2), as already described.
84
85 2. If a process has a capability in a user namespace, then it has that
86 capability in all child (and further removed descendant) namespaces
87 as well.
88
89 3. When a user namespace is created, the kernel records the effective
90 user ID of the creating process as being the "owner" of the names‐
91 pace. A process that resides in the parent of the user namespace
92 and whose effective user ID matches the owner of the namespace has
93 all capabilities in the namespace. By virtue of the previous rule,
94 this means that the process has all capabilities in all further
95 removed descendant user namespaces as well. The NS_GET_OWNER_UID
96 ioctl(2) operation can be used to discover the user ID of the owner
97 of the namespace; see ioctl_ns(2).
98
99 Effect of capabilities within a user namespace
100 Having a capability inside a user namespace permits a process to per‐
101 form operations (that require privilege) only on resources governed by
102 that namespace. In other words, having a capability in a user names‐
103 pace permits a process to perform privileged operations on resources
104 that are governed by (nonuser) namespaces owned by (associated with)
105 the user namespace (see the next subsection).
106
107 On the other hand, there are many privileged operations that affect
108 resources that are not associated with any namespace type, for example,
109 changing the system (i.e., calendar) time (governed by CAP_SYS_TIME),
110 loading a kernel module (governed by CAP_SYS_MODULE), and creating a
111 device (governed by CAP_MKNOD). Only a process with privileges in the
112 initial user namespace can perform such operations.
113
114 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
115 mount namespace allows that process to create bind mounts and mount the
116 following types of filesystems:
117
118 * /proc (since Linux 3.8)
119 * /sys (since Linux 3.8)
120 * devpts (since Linux 3.9)
121 * tmpfs(5) (since Linux 3.9)
122 * ramfs (since Linux 3.9)
123 * mqueue (since Linux 3.9)
124 * bpf (since Linux 4.4)
125
126 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
127 cgroup namespace allows (since Linux 4.6) that process to the mount the
128 cgroup version 2 filesystem and cgroup version 1 named hierarchies
129 (i.e., cgroup filesystems mounted with the "none,name=" option).
130
131 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
132 PID namespace allows (since Linux 3.8) that process to mount /proc
133 filesystems.
134
135 Note however, that mounting block-based filesystems can be done only by
136 a process that holds CAP_SYS_ADMIN in the initial user namespace.
137
138 Interaction of user namespaces and other types of namespaces
139 Starting in Linux 3.8, unprivileged processes can create user names‐
140 paces, and the other types of namespaces can be created with just the
141 CAP_SYS_ADMIN capability in the caller's user namespace.
142
143 When a nonuser namespace is created, it is owned by the user namespace
144 in which the creating process was a member at the time of the creation
145 of the namespace. Privileged operations on resources governed by the
146 nonuser namespace require that the process has the necessary capabili‐
147 ties in the user namespace that owns the nonuser namespace.
148
149 If CLONE_NEWUSER is specified along with other CLONE_NEW* flags in a
150 single clone(2) or unshare(2) call, the user namespace is guaranteed to
151 be created first, giving the child (clone(2)) or caller (unshare(2))
152 privileges over the remaining namespaces created by the call. Thus, it
153 is possible for an unprivileged caller to specify this combination of
154 flags.
155
156 When a new namespace (other than a user namespace) is created via
157 clone(2) or unshare(2), the kernel records the user namespace of the
158 creating process as the owner of the new namespace. (This association
159 can't be changed.) When a process in the new namespace subsequently
160 performs privileged operations that operate on global resources iso‐
161 lated by the namespace, the permission checks are performed according
162 to the process's capabilities in the user namespace that the kernel
163 associated with the new namespace. For example, suppose that a process
164 attempts to change the hostname (sethostname(2)), a resource governed
165 by the UTS namespace. In this case, the kernel will determine which
166 user namespace owns the process's UTS namespace, and check whether the
167 process has the required capability (CAP_SYS_ADMIN) in that user names‐
168 pace.
169
170 The NS_GET_USERNS ioctl(2) operation can be used to discover the user
171 namespace that owns a nonuser namespace; see ioctl_ns(2).
172
173 User and group ID mappings: uid_map and gid_map
174 When a user namespace is created, it starts out without a mapping of
175 user IDs (group IDs) to the parent user namespace. The
176 /proc/[pid]/uid_map and /proc/[pid]/gid_map files (available since
177 Linux 3.5) expose the mappings for user and group IDs inside the user
178 namespace for the process pid. These files can be read to view the
179 mappings in a user namespace and written to (once) to define the map‐
180 pings.
181
182 The description in the following paragraphs explains the details for
183 uid_map; gid_map is exactly the same, but each instance of "user ID" is
184 replaced by "group ID".
185
186 The uid_map file exposes the mapping of user IDs from the user names‐
187 pace of the process pid to the user namespace of the process that
188 opened uid_map (but see a qualification to this point below). In other
189 words, processes that are in different user namespaces will potentially
190 see different values when reading from a particular uid_map file,
191 depending on the user ID mappings for the user namespaces of the read‐
192 ing processes.
193
194 Each line in the uid_map file specifies a 1-to-1 mapping of a range of
195 contiguous user IDs between two user namespaces. (When a user names‐
196 pace is first created, this file is empty.) The specification in each
197 line takes the form of three numbers delimited by white space. The
198 first two numbers specify the starting user ID in each of the two user
199 namespaces. The third number specifies the length of the mapped range.
200 In detail, the fields are interpreted as follows:
201
202 (1) The start of the range of user IDs in the user namespace of the
203 process pid.
204
205 (2) The start of the range of user IDs to which the user IDs specified
206 by field one map. How field two is interpreted depends on whether
207 the process that opened uid_map and the process pid are in the same
208 user namespace, as follows:
209
210 a) If the two processes are in different user namespaces: field two
211 is the start of a range of user IDs in the user namespace of the
212 process that opened uid_map.
213
214 b) If the two processes are in the same user namespace: field two
215 is the start of the range of user IDs in the parent user names‐
216 pace of the process pid. This case enables the opener of
217 uid_map (the common case here is opening /proc/self/uid_map) to
218 see the mapping of user IDs into the user namespace of the
219 process that created this user namespace.
220
221 (3) The length of the range of user IDs that is mapped between the two
222 user namespaces.
223
224 System calls that return user IDs (group IDs)—for example, getuid(2),
225 getgid(2), and the credential fields in the structure returned by
226 stat(2)—return the user ID (group ID) mapped into the caller's user
227 namespace.
228
229 When a process accesses a file, its user and group IDs are mapped into
230 the initial user namespace for the purpose of permission checking and
231 assigning IDs when creating a file. When a process retrieves file user
232 and group IDs via stat(2), the IDs are mapped in the opposite direc‐
233 tion, to produce values relative to the process user and group ID map‐
234 pings.
235
236 The initial user namespace has no parent namespace, but, for consis‐
237 tency, the kernel provides dummy user and group ID mapping files for
238 this namespace. Looking at the uid_map file (gid_map is the same) from
239 a shell in the initial namespace shows:
240
241 $ cat /proc/$$/uid_map
242 0 0 4294967295
243
244 This mapping tells us that the range starting at user ID 0 in this
245 namespace maps to a range starting at 0 in the (nonexistent) parent
246 namespace, and the length of the range is the largest 32-bit unsigned
247 integer. This leaves 4294967295 (the 32-bit signed -1 value) unmapped.
248 This is deliberate: (uid_t) -1 is used in several interfaces (e.g.,
249 setreuid(2)) as a way to specify "no user ID". Leaving (uid_t) -1
250 unmapped and unusable guarantees that there will be no confusion when
251 using these interfaces.
252
253 Defining user and group ID mappings: writing to uid_map and gid_map
254 After the creation of a new user namespace, the uid_map file of one of
255 the processes in the namespace may be written to once to define the
256 mapping of user IDs in the new user namespace. An attempt to write
257 more than once to a uid_map file in a user namespace fails with the
258 error EPERM. Similar rules apply for gid_map files.
259
260 The lines written to uid_map (gid_map) must conform to the following
261 rules:
262
263 * The three fields must be valid numbers, and the last field must be
264 greater than 0.
265
266 * Lines are terminated by newline characters.
267
268 * There is a limit on the number of lines in the file. In Linux 4.14
269 and earlier, this limit was (arbitrarily) set at 5 lines. Since
270 Linux 4.15, the limit is 340 lines. In addition, the number of
271 bytes written to the file must be less than the system page size,
272 and the write must be performed at the start of the file (i.e.,
273 lseek(2) and pwrite(2) can't be used to write to nonzero offsets in
274 the file).
275
276 * The range of user IDs (group IDs) specified in each line cannot
277 overlap with the ranges in any other lines. In the initial imple‐
278 mentation (Linux 3.8), this requirement was satisfied by a simplis‐
279 tic implementation that imposed the further requirement that the
280 values in both field 1 and field 2 of successive lines must be in
281 ascending numerical order, which prevented some otherwise valid maps
282 from being created. Linux 3.9 and later fix this limitation, allow‐
283 ing any valid set of nonoverlapping maps.
284
285 * At least one line must be written to the file.
286
287 Writes that violate the above rules fail with the error EINVAL.
288
289 In order for a process to write to the /proc/[pid]/uid_map
290 (/proc/[pid]/gid_map) file, all of the following requirements must be
291 met:
292
293 1. The writing process must have the CAP_SETUID (CAP_SETGID) capability
294 in the user namespace of the process pid.
295
296 2. The writing process must either be in the user namespace of the
297 process pid or be in the parent user namespace of the process pid.
298
299 3. The mapped user IDs (group IDs) must in turn have a mapping in the
300 parent user namespace.
301
302 4. One of the following two cases applies:
303
304 * Either the writing process has the CAP_SETUID (CAP_SETGID) capa‐
305 bility in the parent user namespace.
306
307 + No further restrictions apply: the process can make mappings
308 to arbitrary user IDs (group IDs) in the parent user names‐
309 pace.
310
311 * Or otherwise all of the following restrictions apply:
312
313 + The data written to uid_map (gid_map) must consist of a single
314 line that maps the writing process's effective user ID (group
315 ID) in the parent user namespace to a user ID (group ID) in
316 the user namespace.
317
318 + The writing process must have the same effective user ID as
319 the process that created the user namespace.
320
321 + In the case of gid_map, use of the setgroups(2) system call
322 must first be denied by writing "deny" to the /proc/[pid]/set‐
323 groups file (see below) before writing to gid_map.
324
325 Writes that violate the above rules fail with the error EPERM.
326
327 Interaction with system calls that change process UIDs or GIDs
328 In a user namespace where the uid_map file has not been written, the
329 system calls that change user IDs will fail. Similarly, if the gid_map
330 file has not been written, the system calls that change group IDs will
331 fail. After the uid_map and gid_map files have been written, only the
332 mapped values may be used in system calls that change user and group
333 IDs.
334
335 For user IDs, the relevant system calls include setuid(2), setfsuid(2),
336 setreuid(2), and setresuid(2). For group IDs, the relevant system
337 calls include setgid(2), setfsgid(2), setregid(2), setresgid(2), and
338 setgroups(2).
339
340 Writing "deny" to the /proc/[pid]/setgroups file before writing to
341 /proc/[pid]/gid_map will permanently disable setgroups(2) in a user
342 namespace and allow writing to /proc/[pid]/gid_map without having the
343 CAP_SETGID capability in the parent user namespace.
344
345 The /proc/[pid]/setgroups file
346 The /proc/[pid]/setgroups file displays the string "allow" if processes
347 in the user namespace that contains the process pid are permitted to
348 employ the setgroups(2) system call; it displays "deny" if setgroups(2)
349 is not permitted in that user namespace. Note that regardless of the
350 value in the /proc/[pid]/setgroups file (and regardless of the
351 process's capabilities), calls to setgroups(2) are also not permitted
352 if /proc/[pid]/gid_map has not yet been set.
353
354 A privileged process (one with the CAP_SYS_ADMIN capability in the
355 namespace) may write either of the strings "allow" or "deny" to this
356 file before writing a group ID mapping for this user namespace to the
357 file /proc/[pid]/gid_map. Writing the string "deny" prevents any
358 process in the user namespace from employing setgroups(2).
359
360 The essence of the restrictions described in the preceding paragraph is
361 that it is permitted to write to /proc/[pid]/setgroups only so long as
362 calling setgroups(2) is disallowed because /proc/[pid]/gid_map has not
363 been set. This ensures that a process cannot transition from a state
364 where setgroups(2) is allowed to a state where setgroups(2) is denied;
365 a process can transition only from setgroups(2) being disallowed to
366 setgroups(2) being allowed.
367
368 The default value of this file in the initial user namespace is
369 "allow".
370
371 Once /proc/[pid]/gid_map has been written to (which has the effect of
372 enabling setgroups(2) in the user namespace), it is no longer possible
373 to disallow setgroups(2) by writing "deny" to /proc/[pid]/setgroups
374 (the write fails with the error EPERM).
375
376 A child user namespace inherits the /proc/[pid]/setgroups setting from
377 its parent.
378
379 If the setgroups file has the value "deny", then the setgroups(2) sys‐
380 tem call can't subsequently be reenabled (by writing "allow" to the
381 file) in this user namespace. (Attempts to do so fail with the error
382 EPERM.) This restriction also propagates down to all child user names‐
383 paces of this user namespace.
384
385 The /proc/[pid]/setgroups file was added in Linux 3.19, but was back‐
386 ported to many earlier stable kernel series, because it addresses a
387 security issue. The issue concerned files with permissions such as
388 "rwx---rwx". Such files give fewer permissions to "group" than they do
389 to "other". This means that dropping groups using setgroups(2) might
390 allow a process file access that it did not formerly have. Before the
391 existence of user namespaces this was not a concern, since only a priv‐
392 ileged process (one with the CAP_SETGID capability) could call set‐
393 groups(2). However, with the introduction of user namespaces, it
394 became possible for an unprivileged process to create a new namespace
395 in which the user had all privileges. This then allowed formerly
396 unprivileged users to drop groups and thus gain file access that they
397 did not previously have. The /proc/[pid]/setgroups file was added to
398 address this security issue, by denying any pathway for an unprivileged
399 process to drop groups with setgroups(2).
400
401 Unmapped user and group IDs
402 There are various places where an unmapped user ID (group ID) may be
403 exposed to user space. For example, the first process in a new user
404 namespace may call getuid(2) before a user ID mapping has been defined
405 for the namespace. In most such cases, an unmapped user ID is con‐
406 verted to the overflow user ID (group ID); the default value for the
407 overflow user ID (group ID) is 65534. See the descriptions of
408 /proc/sys/kernel/overflowuid and /proc/sys/kernel/overflowgid in
409 proc(5).
410
411 The cases where unmapped IDs are mapped in this fashion include system
412 calls that return user IDs (getuid(2), getgid(2), and similar), creden‐
413 tials passed over a UNIX domain socket, credentials returned by
414 stat(2), waitid(2), and the System V IPC "ctl" IPC_STAT operations,
415 credentials exposed by /proc/[pid]/status and the files in
416 /proc/sysvipc/*, credentials returned via the si_uid field in the sig‐
417 info_t received with a signal (see sigaction(2)), credentials written
418 to the process accounting file (see acct(5)), and credentials returned
419 with POSIX message queue notifications (see mq_notify(3)).
420
421 There is one notable case where unmapped user and group IDs are not
422 converted to the corresponding overflow ID value. When viewing a
423 uid_map or gid_map file in which there is no mapping for the second
424 field, that field is displayed as 4294967295 (-1 as an unsigned inte‐
425 ger).
426
427 Accessing files
428 In order to determine permissions when an unprivileged process accesses
429 a file, the process credentials (UID, GID) and the file credentials are
430 in effect mapped back to what they would be in the initial user names‐
431 pace and then compared to determine the permissions that the process
432 has on the file. The same is also of other objects that employ the
433 credentials plus permissions mask accessibility model, such as System V
434 IPC objects
435
436 Operation of file-related capabilities
437 Certain capabilities allow a process to bypass various kernel-enforced
438 restrictions when performing operations on files owned by other users
439 or groups. These capabilities are: CAP_CHOWN, CAP_DAC_OVERRIDE,
440 CAP_DAC_READ_SEARCH, CAP_FOWNER, and CAP_FSETID.
441
442 Within a user namespace, these capabilities allow a process to bypass
443 the rules if the process has the relevant capability over the file,
444 meaning that:
445
446 * the process has the relevant effective capability in its user names‐
447 pace; and
448
449 * the file's user ID and group ID both have valid mappings in the user
450 namespace.
451
452 The CAP_FOWNER capability is treated somewhat exceptionally: it allows
453 a process to bypass the corresponding rules so long as at least the
454 file's user ID has a mapping in the user namespace (i.e., the file's
455 group ID does not need to have a valid mapping).
456
457 Set-user-ID and set-group-ID programs
458 When a process inside a user namespace executes a set-user-ID (set-
459 group-ID) program, the process's effective user (group) ID inside the
460 namespace is changed to whatever value is mapped for the user (group)
461 ID of the file. However, if either the user or the group ID of the
462 file has no mapping inside the namespace, the set-user-ID (set-group-
463 ID) bit is silently ignored: the new program is executed, but the
464 process's effective user (group) ID is left unchanged. (This mirrors
465 the semantics of executing a set-user-ID or set-group-ID program that
466 resides on a filesystem that was mounted with the MS_NOSUID flag, as
467 described in mount(2).)
468
469 Miscellaneous
470 When a process's user and group IDs are passed over a UNIX domain
471 socket to a process in a different user namespace (see the description
472 of SCM_CREDENTIALS in unix(7)), they are translated into the corre‐
473 sponding values as per the receiving process's user and group ID map‐
474 pings.
475
477 Namespaces are a Linux-specific feature.
478
480 Over the years, there have been a lot of features that have been added
481 to the Linux kernel that have been made available only to privileged
482 users because of their potential to confuse set-user-ID-root applica‐
483 tions. In general, it becomes safe to allow the root user in a user
484 namespace to use those features because it is impossible, while in a
485 user namespace, to gain more privilege than the root user of a user
486 namespace has.
487
488 Availability
489 Use of user namespaces requires a kernel that is configured with the
490 CONFIG_USER_NS option. User namespaces require support in a range of
491 subsystems across the kernel. When an unsupported subsystem is config‐
492 ured into the kernel, it is not possible to configure user namespaces
493 support.
494
495 As at Linux 3.8, most relevant subsystems supported user namespaces,
496 but a number of filesystems did not have the infrastructure needed to
497 map user and group IDs between user namespaces. Linux 3.9 added the
498 required infrastructure support for many of the remaining unsupported
499 filesystems (Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA,
500 NFS, and OCFS2). Linux 3.12 added support for the last of the unsup‐
501 ported major filesystems, XFS.
502
504 The program below is designed to allow experimenting with user names‐
505 paces, as well as other types of namespaces. It creates namespaces as
506 specified by command-line options and then executes a command inside
507 those namespaces. The comments and usage() function inside the program
508 provide a full explanation of the program. The following shell session
509 demonstrates its use.
510
511 First, we look at the run-time environment:
512
513 $ uname -rs # Need Linux 3.8 or later
514 Linux 3.8.0
515 $ id -u # Running as unprivileged user
516 1000
517 $ id -g
518 1000
519
520 Now start a new shell in new user (-U), mount (-m), and PID (-p) names‐
521 paces, with user ID (-M) and group ID (-G) 1000 mapped to 0 inside the
522 user namespace:
523
524 $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash
525
526 The shell has PID 1, because it is the first process in the new PID
527 namespace:
528
529 bash$ echo $$
530 1
531
532 Mounting a new /proc filesystem and listing all of the processes visi‐
533 ble in the new PID namespace shows that the shell can't see any pro‐
534 cesses outside the PID namespace:
535
536 bash$ mount -t proc proc /proc
537 bash$ ps ax
538 PID TTY STAT TIME COMMAND
539 1 pts/3 S 0:00 bash
540 22 pts/3 R+ 0:00 ps ax
541
542 Inside the user namespace, the shell has user and group ID 0, and a
543 full set of permitted and effective capabilities:
544
545 bash$ cat /proc/$$/status | egrep '^[UG]id'
546 Uid: 0 0 0 0
547 Gid: 0 0 0 0
548 bash$ cat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'
549 CapInh: 0000000000000000
550 CapPrm: 0000001fffffffff
551 CapEff: 0000001fffffffff
552
553 Program source
554
555 /* userns_child_exec.c
556
557 Licensed under GNU General Public License v2 or later
558
559 Create a child process that executes a shell command in new
560 namespace(s); allow UID and GID mappings to be specified when
561 creating a user namespace.
562 */
563 #define _GNU_SOURCE
564 #include <sched.h>
565 #include <unistd.h>
566 #include <stdlib.h>
567 #include <sys/wait.h>
568 #include <signal.h>
569 #include <fcntl.h>
570 #include <stdio.h>
571 #include <string.h>
572 #include <limits.h>
573 #include <errno.h>
574
575 /* A simple error-handling function: print an error message based
576 on the value in 'errno' and terminate the calling process */
577
578 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
579 } while (0)
580
581 struct child_args {
582 char **argv; /* Command to be executed by child, with args */
583 int pipe_fd[2]; /* Pipe used to synchronize parent and child */
584 };
585
586 static int verbose;
587
588 static void
589 usage(char *pname)
590 {
591 fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
592 fprintf(stderr, "Create a child process that executes a shell "
593 "command in a new user namespace,\n"
594 "and possibly also other new namespace(s).\n\n");
595 fprintf(stderr, "Options can be:\n\n");
596 #define fpe(str) fprintf(stderr, " %s", str);
597 fpe("-i New IPC namespace\n");
598 fpe("-m New mount namespace\n");
599 fpe("-n New network namespace\n");
600 fpe("-p New PID namespace\n");
601 fpe("-u New UTS namespace\n");
602 fpe("-U New user namespace\n");
603 fpe("-M uid_map Specify UID map for user namespace\n");
604 fpe("-G gid_map Specify GID map for user namespace\n");
605 fpe("-z Map user's UID and GID to 0 in user namespace\n");
606 fpe(" (equivalent to: -M '0 <uid> 1' -G '0 <gid> 1')\n");
607 fpe("-v Display verbose messages\n");
608 fpe("\n");
609 fpe("If -z, -M, or -G is specified, -U is required.\n");
610 fpe("It is not permitted to specify both -z and either -M or -G.\n");
611 fpe("\n");
612 fpe("Map strings for -M and -G consist of records of the form:\n");
613 fpe("\n");
614 fpe(" ID-inside-ns ID-outside-ns len\n");
615 fpe("\n");
616 fpe("A map string can contain multiple records, separated"
617 " by commas;\n");
618 fpe("the commas are replaced by newlines before writing"
619 " to map files.\n");
620
621 exit(EXIT_FAILURE);
622 }
623
624 /* Update the mapping file 'map_file', with the value provided in
625 'mapping', a string that defines a UID or GID mapping. A UID or
626 GID mapping consists of one or more newline-delimited records
627 of the form:
628
629 ID_inside-ns ID-outside-ns length
630
631 Requiring the user to supply a string that contains newlines is
632 of course inconvenient for command-line use. Thus, we permit the
633 use of commas to delimit records in this string, and replace them
634 with newlines before writing the string to the file. */
635
636 static void
637 update_map(char *mapping, char *map_file)
638 {
639 int fd, j;
640 size_t map_len; /* Length of 'mapping' */
641
642 /* Replace commas in mapping string with newlines */
643
644 map_len = strlen(mapping);
645 for (j = 0; j < map_len; j++)
646 if (mapping[j] == ',')
647 mapping[j] = '\n';
648
649 fd = open(map_file, O_RDWR);
650 if (fd == -1) {
651 fprintf(stderr, "ERROR: open %s: %s\n", map_file,
652 strerror(errno));
653 exit(EXIT_FAILURE);
654 }
655
656 if (write(fd, mapping, map_len) != map_len) {
657 fprintf(stderr, "ERROR: write %s: %s\n", map_file,
658 strerror(errno));
659 exit(EXIT_FAILURE);
660 }
661
662 close(fd);
663 }
664
665 /* Linux 3.19 made a change in the handling of setgroups(2) and the
666 'gid_map' file to address a security issue. The issue allowed
667 *unprivileged* users to employ user namespaces in order to drop
668 The upshot of the 3.19 changes is that in order to update the
669 'gid_maps' file, use of the setgroups() system call in this
670 user namespace must first be disabled by writing "deny" to one of
671 the /proc/PID/setgroups files for this namespace. That is the
672 purpose of the following function. */
673
674 static void
675 proc_setgroups_write(pid_t child_pid, char *str)
676 {
677 char setgroups_path[PATH_MAX];
678 int fd;
679
680 snprintf(setgroups_path, PATH_MAX, "/proc/%ld/setgroups",
681 (long) child_pid);
682
683 fd = open(setgroups_path, O_RDWR);
684 if (fd == -1) {
685
686 /* We may be on a system that doesn't support
687 /proc/PID/setgroups. In that case, the file won't exist,
688 and the system won't impose the restrictions that Linux 3.19
689 added. That's fine: we don't need to do anything in order
690 to permit 'gid_map' to be updated.
691
692 However, if the error from open() was something other than
693 the ENOENT error that is expected for that case, let the
694 user know. */
695
696 if (errno != ENOENT)
697 fprintf(stderr, "ERROR: open %s: %s\n", setgroups_path,
698 strerror(errno));
699 return;
700 }
701
702 if (write(fd, str, strlen(str)) == -1)
703 fprintf(stderr, "ERROR: write %s: %s\n", setgroups_path,
704 strerror(errno));
705
706 close(fd);
707 }
708
709 static int /* Start function for cloned child */
710 childFunc(void *arg)
711 {
712 struct child_args *args = (struct child_args *) arg;
713 char ch;
714
715 /* Wait until the parent has updated the UID and GID mappings.
716 See the comment in main(). We wait for end of file on a
717 pipe that will be closed by the parent process once it has
718 updated the mappings. */
719
720 close(args->pipe_fd[1]); /* Close our descriptor for the write
721 end of the pipe so that we see EOF
722 when parent closes its descriptor */
723 if (read(args->pipe_fd[0], &ch, 1) != 0) {
724 fprintf(stderr,
725 "Failure in child: read from pipe returned != 0\n");
726 exit(EXIT_FAILURE);
727 }
728
729 close(args->pipe_fd[0]);
730
731 /* Execute a shell command */
732
733 printf("About to exec %s\n", args->argv[0]);
734 execvp(args->argv[0], args->argv);
735 errExit("execvp");
736 }
737
738 #define STACK_SIZE (1024 * 1024)
739
740 static char child_stack[STACK_SIZE]; /* Space for child's stack */
741
742 int
743 main(int argc, char *argv[])
744 {
745 int flags, opt, map_zero;
746 pid_t child_pid;
747 struct child_args args;
748 char *uid_map, *gid_map;
749 const int MAP_BUF_SIZE = 100;
750 char map_buf[MAP_BUF_SIZE];
751 char map_path[PATH_MAX];
752
753 /* Parse command-line options. The initial '+' character in
754 the final getopt() argument prevents GNU-style permutation
755 of command-line options. That's useful, since sometimes
756 the 'command' to be executed by this program itself
757 has command-line options. We don't want getopt() to treat
758 those as options to this program. */
759
760 flags = 0;
761 verbose = 0;
762 gid_map = NULL;
763 uid_map = NULL;
764 map_zero = 0;
765 while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != -1) {
766 switch (opt) {
767 case 'i': flags |= CLONE_NEWIPC; break;
768 case 'm': flags |= CLONE_NEWNS; break;
769 case 'n': flags |= CLONE_NEWNET; break;
770 case 'p': flags |= CLONE_NEWPID; break;
771 case 'u': flags |= CLONE_NEWUTS; break;
772 case 'v': verbose = 1; break;
773 case 'z': map_zero = 1; break;
774 case 'M': uid_map = optarg; break;
775 case 'G': gid_map = optarg; break;
776 case 'U': flags |= CLONE_NEWUSER; break;
777 default: usage(argv[0]);
778 }
779 }
780
781 /* -M or -G without -U is nonsensical */
782
783 if (((uid_map != NULL || gid_map != NULL || map_zero) &&
784 !(flags & CLONE_NEWUSER)) ||
785 (map_zero && (uid_map != NULL || gid_map != NULL)))
786 usage(argv[0]);
787
788 args.argv = &argv[optind];
789
790 /* We use a pipe to synchronize the parent and child, in order to
791 ensure that the parent sets the UID and GID maps before the child
792 calls execve(). This ensures that the child maintains its
793 capabilities during the execve() in the common case where we
794 want to map the child's effective user ID to 0 in the new user
795 namespace. Without this synchronization, the child would lose
796 its capabilities if it performed an execve() with nonzero
797 user IDs (see the capabilities(7) man page for details of the
798 transformation of a process's capabilities during execve()). */
799
800 if (pipe(args.pipe_fd) == -1)
801 errExit("pipe");
802
803 /* Create the child in new namespace(s) */
804
805 child_pid = clone(childFunc, child_stack + STACK_SIZE,
806 flags | SIGCHLD, &args);
807 if (child_pid == -1)
808 errExit("clone");
809
810 /* Parent falls through to here */
811
812 if (verbose)
813 printf("%s: PID of child created by clone() is %ld\n",
814 argv[0], (long) child_pid);
815
816 /* Update the UID and GID maps in the child */
817
818 if (uid_map != NULL || map_zero) {
819 snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
820 (long) child_pid);
821 if (map_zero) {
822 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
823 uid_map = map_buf;
824 }
825 update_map(uid_map, map_path);
826 }
827
828 if (gid_map != NULL || map_zero) {
829 proc_setgroups_write(child_pid, "deny");
830
831 snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
832 (long) child_pid);
833 if (map_zero) {
834 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
835 gid_map = map_buf;
836 }
837 update_map(gid_map, map_path);
838 }
839
840 /* Close the write end of the pipe, to signal to the child that we
841 have updated the UID and GID maps */
842
843 close(args.pipe_fd[1]);
844
845 if (waitpid(child_pid, NULL, 0) == -1) /* Wait for child */
846 errExit("waitpid");
847
848 if (verbose)
849 printf("%s: terminating\n", argv[0]);
850
851 exit(EXIT_SUCCESS);
852 }
853
855 newgidmap(1), newuidmap(1), clone(2), ptrace(2), setns(2), unshare(2),
856 proc(5), subgid(5), subuid(5), capabilities(7), cgroup_namespaces(7),
857 credentials(7), namespaces(7), pid_namespaces(7)
858
859 The kernel source file Documentation/namespaces/resource-control.txt.
860
862 This page is part of release 5.07 of the Linux man-pages project. A
863 description of the project, information about reporting bugs, and the
864 latest version of this page, can be found at
865 https://www.kernel.org/doc/man-pages/.
866
867
868
869Linux 2020-06-09 USER_NAMESPACES(7)