1USER_NAMESPACES(7) Linux Programmer's Manual USER_NAMESPACES(7)
2
3
4
6 user_namespaces - overview of Linux user namespaces
7
9 For an overview of namespaces, see namespaces(7).
10
11 User namespaces isolate security-related identifiers and attributes, in
12 particular, user IDs and group IDs (see credentials(7)), the root
13 directory, keys (see keyrings(7)), and capabilities (see capabili‐
14 ties(7)). A process's user and group IDs can be different inside and
15 outside a user namespace. In particular, a process can have a normal
16 unprivileged user ID outside a user namespace while at the same time
17 having a user ID of 0 inside the namespace; in other words, the process
18 has full privileges for operations inside the user namespace, but is
19 unprivileged for operations outside the namespace.
20
21 Nested namespaces, namespace membership
22 User namespaces can be nested; that is, each user namespace—except the
23 initial ("root") namespace—has a parent user namespace, and can have
24 zero or more child user namespaces. The parent user namespace is the
25 user namespace of the process that creates the user namespace via a
26 call to unshare(2) or clone(2) with the CLONE_NEWUSER flag.
27
28 The kernel imposes (since version 3.11) a limit of 32 nested levels of
29 user namespaces. Calls to unshare(2) or clone(2) that would cause this
30 limit to be exceeded fail with the error EUSERS.
31
32 Each process is a member of exactly one user namespace. A process cre‐
33 ated via fork(2) or clone(2) without the CLONE_NEWUSER flag is a member
34 of the same user namespace as its parent. A single-threaded process
35 can join another user namespace with setns(2) if it has the
36 CAP_SYS_ADMIN in that namespace; upon doing so, it gains a full set of
37 capabilities in that namespace.
38
39 A call to clone(2) or unshare(2) with the CLONE_NEWUSER flag makes the
40 new child process (for clone(2)) or the caller (for unshare(2)) a mem‐
41 ber of the new user namespace created by the call.
42
43 The NS_GET_PARENT ioctl(2) operation can be used to discover the
44 parental relationship between user namespaces; see ioctl_ns(2).
45
46 Capabilities
47 The child process created by clone(2) with the CLONE_NEWUSER flag
48 starts out with a complete set of capabilities in the new user names‐
49 pace. Likewise, a process that creates a new user namespace using
50 unshare(2) or joins an existing user namespace using setns(2) gains a
51 full set of capabilities in that namespace. On the other hand, that
52 process has no capabilities in the parent (in the case of clone(2)) or
53 previous (in the case of unshare(2) and setns(2)) user namespace, even
54 if the new namespace is created or joined by the root user (i.e., a
55 process with user ID 0 in the root namespace).
56
57 Note that a call to execve(2) will cause a process's capabilities to be
58 recalculated in the usual way (see capabilities(7)). Consequently,
59 unless the process has a user ID of 0 within the namespace, or the exe‐
60 cutable file has a nonempty inheritable capabilities mask, the process
61 will lose all capabilities. See the discussion of user and group ID
62 mappings, below.
63
64 A call to clone(2), unshare(2), or setns(2) using the CLONE_NEWUSER
65 flag sets the "securebits" flags (see capabilities(7)) to their default
66 values (all flags disabled) in the child (for clone(2)) or caller (for
67 unshare(2), or setns(2)). Note that because the caller no longer has
68 capabilities in its original user namespace after a call to setns(2),
69 it is not possible for a process to reset its "securebits" flags while
70 retaining its user namespace membership by using a pair of setns(2)
71 calls to move to another user namespace and then return to its original
72 user namespace.
73
74 The rules for determining whether or not a process has a capability in
75 a particular user namespace are as follows:
76
77 1. A process has a capability inside a user namespace if it is a member
78 of that namespace and it has the capability in its effective capa‐
79 bility set. A process can gain capabilities in its effective capa‐
80 bility set in various ways. For example, it may execute a set-user-
81 ID program or an executable with associated file capabilities. In
82 addition, a process may gain capabilities via the effect of
83 clone(2), unshare(2), or setns(2), as already described.
84
85 2. If a process has a capability in a user namespace, then it has that
86 capability in all child (and further removed descendant) namespaces
87 as well.
88
89 3. When a user namespace is created, the kernel records the effective
90 user ID of the creating process as being the "owner" of the names‐
91 pace. A process that resides in the parent of the user namespace
92 and whose effective user ID matches the owner of the namespace has
93 all capabilities in the namespace. By virtue of the previous rule,
94 this means that the process has all capabilities in all further
95 removed descendant user namespaces as well. The NS_GET_OWNER_UID
96 ioctl(2) operation can be used to discover the user ID of the owner
97 of the namespace; see ioctl_ns(2).
98
99 Effect of capabilities within a user namespace
100 Having a capability inside a user namespace permits a process to per‐
101 form operations (that require privilege) only on resources governed by
102 that namespace. In other words, having a capability in a user names‐
103 pace permits a process to perform privileged operations on resources
104 that are governed by (nonuser) namespaces associated with the user
105 namespace (see the next subsection).
106
107 On the other hand, there are many privileged operations that affect
108 resources that are not associated with any namespace type, for example,
109 changing the system time (governed by CAP_SYS_TIME), loading a kernel
110 module (governed by CAP_SYS_MODULE), and creating a device (governed by
111 CAP_MKNOD). Only a process with privileges in the initial user names‐
112 pace can perform such operations.
113
114 Holding CAP_SYS_ADMIN within the user namespace associated with a
115 process's mount namespace allows that process to create bind mounts and
116 mount the following types of filesystems:
117
118 * /proc (since Linux 3.8)
119 * /sys (since Linux 3.8)
120 * devpts (since Linux 3.9)
121 * tmpfs(5) (since Linux 3.9)
122 * ramfs (since Linux 3.9)
123 * mqueue (since Linux 3.9)
124 * bpf (since Linux 4.4)
125
126 Holding CAP_SYS_ADMIN within the user namespace associated with a
127 process's cgroup namespace allows (since Linux 4.6) that process to the
128 mount cgroup version 2 filesystem and cgroup version 1 named hierar‐
129 chies (i.e., cgroup filesystems mounted with the "none,name=" option).
130
131 Holding CAP_SYS_ADMIN within the user namespace associated with a
132 process's PID namespace allows (since Linux 3.8) that process to mount
133 /proc filesystems.
134
135 Note however, that mounting block-based filesystems can be done only by
136 a process that holds CAP_SYS_ADMIN in the initial user namespace.
137
138 Interaction of user namespaces and other types of namespaces
139 Starting in Linux 3.8, unprivileged processes can create user names‐
140 paces, and other the other types of namespaces can be created with just
141 the CAP_SYS_ADMIN capability in the caller's user namespace.
142
143 When a non-user-namespace is created, it is owned by the user namespace
144 in which the creating process was a member at the time of the creation
145 of the namespace. Actions on the non-user-namespace require capabili‐
146 ties in the corresponding user namespace.
147
148 If CLONE_NEWUSER is specified along with other CLONE_NEW* flags in a
149 single clone(2) or unshare(2) call, the user namespace is guaranteed to
150 be created first, giving the child (clone(2)) or caller (unshare(2))
151 privileges over the remaining namespaces created by the call. Thus, it
152 is possible for an unprivileged caller to specify this combination of
153 flags.
154
155 When a new namespace (other than a user namespace) is created via
156 clone(2) or unshare(2), the kernel records the user namespace of the
157 creating process against the new namespace. (This association can't be
158 changed.) When a process in the new namespace subsequently performs
159 privileged operations that operate on global resources isolated by the
160 namespace, the permission checks are performed according to the
161 process's capabilities in the user namespace that the kernel associated
162 with the new namespace. For example, suppose that a process attempts
163 to change the hostname (sethostname(2)), a resource governed by the UTS
164 namespace. In this case, the kernel will determine which user names‐
165 pace is associated with the process's UTS namespace, and check whether
166 the process has the required capability (CAP_SYS_ADMIN) in that user
167 namespace.
168
169 The NS_GET_USERNS ioctl(2) operation can be used to discover the user
170 namespace with which a non-user namespace is associated; see
171 ioctl_ns(2).
172
173 User and group ID mappings: uid_map and gid_map
174 When a user namespace is created, it starts out without a mapping of
175 user IDs (group IDs) to the parent user namespace. The
176 /proc/[pid]/uid_map and /proc/[pid]/gid_map files (available since
177 Linux 3.5) expose the mappings for user and group IDs inside the user
178 namespace for the process pid. These files can be read to view the
179 mappings in a user namespace and written to (once) to define the map‐
180 pings.
181
182 The description in the following paragraphs explains the details for
183 uid_map; gid_map is exactly the same, but each instance of "user ID" is
184 replaced by "group ID".
185
186 The uid_map file exposes the mapping of user IDs from the user names‐
187 pace of the process pid to the user namespace of the process that
188 opened uid_map (but see a qualification to this point below). In other
189 words, processes that are in different user namespaces will potentially
190 see different values when reading from a particular uid_map file,
191 depending on the user ID mappings for the user namespaces of the read‐
192 ing processes.
193
194 Each line in the uid_map file specifies a 1-to-1 mapping of a range of
195 contiguous user IDs between two user namespaces. (When a user names‐
196 pace is first created, this file is empty.) The specification in each
197 line takes the form of three numbers delimited by white space. The
198 first two numbers specify the starting user ID in each of the two user
199 namespaces. The third number specifies the length of the mapped range.
200 In detail, the fields are interpreted as follows:
201
202 (1) The start of the range of user IDs in the user namespace of the
203 process pid.
204
205 (2) The start of the range of user IDs to which the user IDs specified
206 by field one map. How field two is interpreted depends on whether
207 the process that opened uid_map and the process pid are in the same
208 user namespace, as follows:
209
210 a) If the two processes are in different user namespaces: field two
211 is the start of a range of user IDs in the user namespace of the
212 process that opened uid_map.
213
214 b) If the two processes are in the same user namespace: field two
215 is the start of the range of user IDs in the parent user names‐
216 pace of the process pid. This case enables the opener of
217 uid_map (the common case here is opening /proc/self/uid_map) to
218 see the mapping of user IDs into the user namespace of the
219 process that created this user namespace.
220
221 (3) The length of the range of user IDs that is mapped between the two
222 user namespaces.
223
224 System calls that return user IDs (group IDs)—for example, getuid(2),
225 getgid(2), and the credential fields in the structure returned by
226 stat(2)—return the user ID (group ID) mapped into the caller's user
227 namespace.
228
229 When a process accesses a file, its user and group IDs are mapped into
230 the initial user namespace for the purpose of permission checking and
231 assigning IDs when creating a file. When a process retrieves file user
232 and group IDs via stat(2), the IDs are mapped in the opposite direc‐
233 tion, to produce values relative to the process user and group ID map‐
234 pings.
235
236 The initial user namespace has no parent namespace, but, for consis‐
237 tency, the kernel provides dummy user and group ID mapping files for
238 this namespace. Looking at the uid_map file (gid_map is the same) from
239 a shell in the initial namespace shows:
240
241 $ cat /proc/$$/uid_map
242 0 0 4294967295
243
244 This mapping tells us that the range starting at user ID 0 in this
245 namespace maps to a range starting at 0 in the (nonexistent) parent
246 namespace, and the length of the range is the largest 32-bit unsigned
247 integer. This leaves 4294967295 (the 32-bit signed -1 value) unmapped.
248 This is deliberate: (uid_t) -1 is used in several interfaces (e.g.,
249 setreuid(2)) as a way to specify "no user ID". Leaving (uid_t) -1
250 unmapped and unusable guarantees that there will be no confusion when
251 using these interfaces.
252
253 Defining user and group ID mappings: writing to uid_map and gid_map
254 After the creation of a new user namespace, the uid_map file of one of
255 the processes in the namespace may be written to once to define the
256 mapping of user IDs in the new user namespace. An attempt to write
257 more than once to a uid_map file in a user namespace fails with the
258 error EPERM. Similar rules apply for gid_map files.
259
260 The lines written to uid_map (gid_map) must conform to the following
261 rules:
262
263 * The three fields must be valid numbers, and the last field must be
264 greater than 0.
265
266 * Lines are terminated by newline characters.
267
268 * There is a limit on the number of lines in the file. In Linux 4.14
269 and earlier, this limit was (arbitrarily) set at 5 lines. Since
270 Linux 4.15, the limit is 340 lines. In addition, the number of
271 bytes written to the file must be less than the system page size,
272 and the write must be performed at the start of the file (i.e.,
273 lseek(2) and pwrite(2) can't be used to write to nonzero offsets in
274 the file).
275
276 * The range of user IDs (group IDs) specified in each line cannot
277 overlap with the ranges in any other lines. In the initial imple‐
278 mentation (Linux 3.8), this requirement was satisfied by a simplis‐
279 tic implementation that imposed the further requirement that the
280 values in both field 1 and field 2 of successive lines must be in
281 ascending numerical order, which prevented some otherwise valid maps
282 from being created. Linux 3.9 and later fix this limitation, allow‐
283 ing any valid set of nonoverlapping maps.
284
285 * At least one line must be written to the file.
286
287 Writes that violate the above rules fail with the error EINVAL.
288
289 In order for a process to write to the /proc/[pid]/uid_map
290 (/proc/[pid]/gid_map) file, all of the following requirements must be
291 met:
292
293 1. The writing process must have the CAP_SETUID (CAP_SETGID) capability
294 in the user namespace of the process pid.
295
296 2. The writing process must either be in the user namespace of the
297 process pid or be in the parent user namespace of the process pid.
298
299 3. The mapped user IDs (group IDs) must in turn have a mapping in the
300 parent user namespace.
301
302 4. One of the following two cases applies:
303
304 * Either the writing process has the CAP_SETUID (CAP_SETGID) capa‐
305 bility in the parent user namespace.
306
307 + No further restrictions apply: the process can make mappings
308 to arbitrary user IDs (group IDs) in the parent user names‐
309 pace.
310
311 * Or otherwise all of the following restrictions apply:
312
313 + The data written to uid_map (gid_map) must consist of a single
314 line that maps the writing process's effective user ID (group
315 ID) in the parent user namespace to a user ID (group ID) in
316 the user namespace.
317
318 + The writing process must have the same effective user ID as
319 the process that created the user namespace.
320
321 + In the case of gid_map, use of the setgroups(2) system call
322 must first be denied by writing "deny" to the /proc/[pid]/set‐
323 groups file (see below) before writing to gid_map.
324
325 Writes that violate the above rules fail with the error EPERM.
326
327 Interaction with system calls that change process UIDs or GIDs
328 In a user namespace where the uid_map file has not been written, the
329 system calls that change user IDs will fail. Similarly, if the gid_map
330 file has not been written, the system calls that change group IDs will
331 fail. After the uid_map and gid_map files have been written, only the
332 mapped values may be used in system calls that change user and group
333 IDs.
334
335 For user IDs, the relevant system calls include setuid(2), setfsuid(2),
336 setreuid(2), and setresuid(2). For group IDs, the relevant system
337 calls include setgid(2), setfsgid(2), setregid(2), setresgid(2), and
338 setgroups(2).
339
340 Writing "deny" to the /proc/[pid]/setgroups file before writing to
341 /proc/[pid]/gid_map will permanently disable setgroups(2) in a user
342 namespace and allow writing to /proc/[pid]/gid_map without having the
343 CAP_SETGID capability in the parent user namespace.
344
345 The /proc/[pid]/setgroups file
346 The /proc/[pid]/setgroups file displays the string "allow" if processes
347 in the user namespace that contains the process pid are permitted to
348 employ the setgroups(2) system call; it displays "deny" if setgroups(2)
349 is not permitted in that user namespace. Note that regardless of the
350 value in the /proc/[pid]/setgroups file (and regardless of the
351 process's capabilities), calls to setgroups(2) are also not permitted
352 if /proc/[pid]/gid_map has not yet been set.
353
354 A privileged process (one with the CAP_SYS_ADMIN capability in the
355 namespace) may write either of the strings "allow" or "deny" to this
356 file before writing a group ID mapping for this user namespace to the
357 file /proc/[pid]/gid_map. Writing the string "deny" prevents any
358 process in the user namespace from employing setgroups(2).
359
360 The essence of the restrictions described in the preceding paragraph is
361 that it is permitted to write to /proc/[pid]/setgroups only so long as
362 calling setgroups(2) is disallowed because /proc/[pid]gid_map has not
363 been set. This ensures that a process cannot transition from a state
364 where setgroups(2) is allowed to a state where setgroups(2) is denied;
365 a process can transition only from setgroups(2) being disallowed to
366 setgroups(2) being allowed.
367
368 The default value of this file in the initial user namespace is
369 "allow".
370
371 Once /proc/[pid]/gid_map has been written to (which has the effect of
372 enabling setgroups(2) in the user namespace), it is no longer possible
373 to disallow setgroups(2) by writing "deny" to /proc/[pid]/setgroups
374 (the write fails with the error EPERM).
375
376 A child user namespace inherits the /proc/[pid]/setgroups setting from
377 its parent.
378
379 If the setgroups file has the value "deny", then the setgroups(2) sys‐
380 tem call can't subsequently be reenabled (by writing "allow" to the
381 file) in this user namespace. (Attempts to do so fail with the error
382 EPERM.) This restriction also propagates down to all child user names‐
383 paces of this user namespace.
384
385 The /proc/[pid]/setgroups file was added in Linux 3.19, but was back‐
386 ported to many earlier stable kernel series, because it addresses a
387 security issue. The issue concerned files with permissions such as
388 "rwx---rwx". Such files give fewer permissions to "group" than they do
389 to "other". This means that dropping groups using setgroups(2) might
390 allow a process file access that it did not formerly have. Before the
391 existence of user namespaces this was not a concern, since only a priv‐
392 ileged process (one with the CAP_SETGID capability) could call set‐
393 groups(2). However, with the introduction of user namespaces, it
394 became possible for an unprivileged process to create a new namespace
395 in which the user had all privileges. This then allowed formerly
396 unprivileged users to drop groups and thus gain file access that they
397 did not previously have. The /proc/[pid]/setgroups file was added to
398 address this security issue, by denying any pathway for an unprivileged
399 process to drop groups with setgroups(2).
400
401 Unmapped user and group IDs
402 There are various places where an unmapped user ID (group ID) may be
403 exposed to user space. For example, the first process in a new user
404 namespace may call getuid(2) before a user ID mapping has been defined
405 for the namespace. In most such cases, an unmapped user ID is con‐
406 verted to the overflow user ID (group ID); the default value for the
407 overflow user ID (group ID) is 65534. See the descriptions of
408 /proc/sys/kernel/overflowuid and /proc/sys/kernel/overflowgid in
409 proc(5).
410
411 The cases where unmapped IDs are mapped in this fashion include system
412 calls that return user IDs (getuid(2), getgid(2), and similar), creden‐
413 tials passed over a UNIX domain socket, credentials returned by
414 stat(2), waitid(2), and the System V IPC "ctl" IPC_STAT operations,
415 credentials exposed by /proc/[pid]/status and the files in
416 /proc/sysvipc/*, credentials returned via the si_uid field in the sig‐
417 info_t received with a signal (see sigaction(2)), credentials written
418 to the process accounting file (see acct(5)), and credentials returned
419 with POSIX message queue notifications (see mq_notify(3)).
420
421 There is one notable case where unmapped user and group IDs are not
422 converted to the corresponding overflow ID value. When viewing a
423 uid_map or gid_map file in which there is no mapping for the second
424 field, that field is displayed as 4294967295 (-1 as an unsigned inte‐
425 ger).
426
427 Set-user-ID and set-group-ID programs
428 When a process inside a user namespace executes a set-user-ID (set-
429 group-ID) program, the process's effective user (group) ID inside the
430 namespace is changed to whatever value is mapped for the user (group)
431 ID of the file. However, if either the user or the group ID of the
432 file has no mapping inside the namespace, the set-user-ID (set-group-
433 ID) bit is silently ignored: the new program is executed, but the
434 process's effective user (group) ID is left unchanged. (This mirrors
435 the semantics of executing a set-user-ID or set-group-ID program that
436 resides on a filesystem that was mounted with the MS_NOSUID flag, as
437 described in mount(2).)
438
439 Miscellaneous
440 When a process's user and group IDs are passed over a UNIX domain
441 socket to a process in a different user namespace (see the description
442 of SCM_CREDENTIALS in unix(7)), they are translated into the corre‐
443 sponding values as per the receiving process's user and group ID map‐
444 pings.
445
447 Namespaces are a Linux-specific feature.
448
450 Over the years, there have been a lot of features that have been added
451 to the Linux kernel that have been made available only to privileged
452 users because of their potential to confuse set-user-ID-root applica‐
453 tions. In general, it becomes safe to allow the root user in a user
454 namespace to use those features because it is impossible, while in a
455 user namespace, to gain more privilege than the root user of a user
456 namespace has.
457
458 Availability
459 Use of user namespaces requires a kernel that is configured with the
460 CONFIG_USER_NS option. User namespaces require support in a range of
461 subsystems across the kernel. When an unsupported subsystem is config‐
462 ured into the kernel, it is not possible to configure user namespaces
463 support.
464
465 As at Linux 3.8, most relevant subsystems supported user namespaces,
466 but a number of filesystems did not have the infrastructure needed to
467 map user and group IDs between user namespaces. Linux 3.9 added the
468 required infrastructure support for many of the remaining unsupported
469 filesystems (Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA,
470 NFS, and OCFS2). Linux 3.12 added support the last of the unsupported
471 major filesystems, XFS.
472
474 The program below is designed to allow experimenting with user names‐
475 paces, as well as other types of namespaces. It creates namespaces as
476 specified by command-line options and then executes a command inside
477 those namespaces. The comments and usage() function inside the program
478 provide a full explanation of the program. The following shell session
479 demonstrates its use.
480
481 First, we look at the run-time environment:
482
483 $ uname -rs # Need Linux 3.8 or later
484 Linux 3.8.0
485 $ id -u # Running as unprivileged user
486 1000
487 $ id -g
488 1000
489
490 Now start a new shell in new user (-U), mount (-m), and PID (-p) names‐
491 paces, with user ID (-M) and group ID (-G) 1000 mapped to 0 inside the
492 user namespace:
493
494 $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash
495
496 The shell has PID 1, because it is the first process in the new PID
497 namespace:
498
499 bash$ echo $$
500 1
501 Mounting a new /proc filesystem and listing all of the processes visi‐
502 ble in the new PID namespace shows that the shell can't see any pro‐
503 cesses outside the PID namespace:
504
505 bash$ mount -t proc proc /proc
506 bash$ ps ax
507 PID TTY STAT TIME COMMAND
508 1 pts/3 S 0:00 bash
509 22 pts/3 R+ 0:00 ps ax
510
511 Inside the user namespace, the shell has user and group ID 0, and a
512 full set of permitted and effective capabilities:
513
514 bash$ cat /proc/$$/status | egrep '^[UG]id'
515 Uid: 0 0 0 0
516 Gid: 0 0 0 0
517 bash$ cat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'
518 CapInh: 0000000000000000
519 CapPrm: 0000001fffffffff
520 CapEff: 0000001fffffffff
521
522 Program source
523
524 /* userns_child_exec.c
525
526 Licensed under GNU General Public License v2 or later
527
528 Create a child process that executes a shell command in new
529 namespace(s); allow UID and GID mappings to be specified when
530 creating a user namespace.
531 */
532 #define _GNU_SOURCE
533 #include <sched.h>
534 #include <unistd.h>
535 #include <stdlib.h>
536 #include <sys/wait.h>
537 #include <signal.h>
538 #include <fcntl.h>
539 #include <stdio.h>
540 #include <string.h>
541 #include <limits.h>
542 #include <errno.h>
543
544 /* A simple error-handling function: print an error message based
545 on the value in 'errno' and terminate the calling process */
546
547 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
548 } while (0)
549
550 struct child_args {
551 char **argv; /* Command to be executed by child, with args */
552 int pipe_fd[2]; /* Pipe used to synchronize parent and child */
553 };
554
555 static int verbose;
556
557 static void
558 usage(char *pname)
559 {
560 fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
561 fprintf(stderr, "Create a child process that executes a shell "
562 "command in a new user namespace,\n"
563 "and possibly also other new namespace(s).\n\n");
564 fprintf(stderr, "Options can be:\n\n");
565 #define fpe(str) fprintf(stderr, " %s", str);
566 fpe("-i New IPC namespace\n");
567 fpe("-m New mount namespace\n");
568 fpe("-n New network namespace\n");
569 fpe("-p New PID namespace\n");
570 fpe("-u New UTS namespace\n");
571 fpe("-U New user namespace\n");
572 fpe("-M uid_map Specify UID map for user namespace\n");
573 fpe("-G gid_map Specify GID map for user namespace\n");
574 fpe("-z Map user's UID and GID to 0 in user namespace\n");
575 fpe(" (equivalent to: -M '0 <uid> 1' -G '0 <gid> 1')\n");
576 fpe("-v Display verbose messages\n");
577 fpe("\n");
578 fpe("If -z, -M, or -G is specified, -U is required.\n");
579 fpe("It is not permitted to specify both -z and either -M or -G.\n");
580 fpe("\n");
581 fpe("Map strings for -M and -G consist of records of the form:\n");
582 fpe("\n");
583 fpe(" ID-inside-ns ID-outside-ns len\n");
584 fpe("\n");
585 fpe("A map string can contain multiple records, separated"
586 " by commas;\n");
587 fpe("the commas are replaced by newlines before writing"
588 " to map files.\n");
589
590 exit(EXIT_FAILURE);
591 }
592
593 /* Update the mapping file 'map_file', with the value provided in
594 'mapping', a string that defines a UID or GID mapping. A UID or
595 GID mapping consists of one or more newline-delimited records
596 of the form:
597
598 ID_inside-ns ID-outside-ns length
599
600 Requiring the user to supply a string that contains newlines is
601 of course inconvenient for command-line use. Thus, we permit the
602 use of commas to delimit records in this string, and replace them
603 with newlines before writing the string to the file. */
604
605 static void
606 update_map(char *mapping, char *map_file)
607 {
608 int fd, j;
609 size_t map_len; /* Length of 'mapping' */
610
611 /* Replace commas in mapping string with newlines */
612
613 map_len = strlen(mapping);
614 for (j = 0; j < map_len; j++)
615 if (mapping[j] == ',')
616 mapping[j] = '\n';
617
618 fd = open(map_file, O_RDWR);
619 if (fd == -1) {
620 fprintf(stderr, "ERROR: open %s: %s\n", map_file,
621 strerror(errno));
622 exit(EXIT_FAILURE);
623 }
624
625 if (write(fd, mapping, map_len) != map_len) {
626 fprintf(stderr, "ERROR: write %s: %s\n", map_file,
627 strerror(errno));
628 exit(EXIT_FAILURE);
629 }
630
631 close(fd);
632 }
633
634 /* Linux 3.19 made a change in the handling of setgroups(2) and the
635 'gid_map' file to address a security issue. The issue allowed
636 *unprivileged* users to employ user namespaces in order to drop
637 The upshot of the 3.19 changes is that in order to update the
638 'gid_maps' file, use of the setgroups() system call in this
639 user namespace must first be disabled by writing "deny" to one of
640 the /proc/PID/setgroups files for this namespace. That is the
641 purpose of the following function. */
642
643 static void
644 proc_setgroups_write(pid_t child_pid, char *str)
645 {
646 char setgroups_path[PATH_MAX];
647 int fd;
648
649 snprintf(setgroups_path, PATH_MAX, "/proc/%ld/setgroups",
650 (long) child_pid);
651
652 fd = open(setgroups_path, O_RDWR);
653 if (fd == -1) {
654
655 /* We may be on a system that doesn't support
656 /proc/PID/setgroups. In that case, the file won't exist,
657 and the system won't impose the restrictions that Linux 3.19
658 added. That's fine: we don't need to do anything in order
659 to permit 'gid_map' to be updated.
660
661 However, if the error from open() was something other than
662 the ENOENT error that is expected for that case, let the
663 user know. */
664
665 if (errno != ENOENT)
666 fprintf(stderr, "ERROR: open %s: %s\n", setgroups_path,
667 strerror(errno));
668 return;
669 }
670
671 if (write(fd, str, strlen(str)) == -1)
672 fprintf(stderr, "ERROR: write %s: %s\n", setgroups_path,
673 strerror(errno));
674
675 close(fd);
676 }
677
678 static int /* Start function for cloned child */
679 childFunc(void *arg)
680 {
681 struct child_args *args = (struct child_args *) arg;
682 char ch;
683
684 /* Wait until the parent has updated the UID and GID mappings.
685 See the comment in main(). We wait for end of file on a
686 pipe that will be closed by the parent process once it has
687 updated the mappings. */
688
689 close(args->pipe_fd[1]); /* Close our descriptor for the write
690 end of the pipe so that we see EOF
691 when parent closes its descriptor */
692 if (read(args->pipe_fd[0], &ch, 1) != 0) {
693 fprintf(stderr,
694 "Failure in child: read from pipe returned != 0\n");
695 exit(EXIT_FAILURE);
696 }
697
698 close(args->pipe_fd[0]);
699
700 /* Execute a shell command */
701
702 printf("About to exec %s\n", args->argv[0]);
703 execvp(args->argv[0], args->argv);
704 errExit("execvp");
705 }
706
707 #define STACK_SIZE (1024 * 1024)
708
709 static char child_stack[STACK_SIZE]; /* Space for child's stack */
710
711 int
712 main(int argc, char *argv[])
713 {
714 int flags, opt, map_zero;
715 pid_t child_pid;
716 struct child_args args;
717 char *uid_map, *gid_map;
718 const int MAP_BUF_SIZE = 100;
719 char map_buf[MAP_BUF_SIZE];
720 char map_path[PATH_MAX];
721
722 /* Parse command-line options. The initial '+' character in
723 the final getopt() argument prevents GNU-style permutation
724 of command-line options. That's useful, since sometimes
725 the 'command' to be executed by this program itself
726 has command-line options. We don't want getopt() to treat
727 those as options to this program. */
728
729 flags = 0;
730 verbose = 0;
731 gid_map = NULL;
732 uid_map = NULL;
733 map_zero = 0;
734 while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != -1) {
735 switch (opt) {
736 case 'i': flags |= CLONE_NEWIPC; break;
737 case 'm': flags |= CLONE_NEWNS; break;
738 case 'n': flags |= CLONE_NEWNET; break;
739 case 'p': flags |= CLONE_NEWPID; break;
740 case 'u': flags |= CLONE_NEWUTS; break;
741 case 'v': verbose = 1; break;
742 case 'z': map_zero = 1; break;
743 case 'M': uid_map = optarg; break;
744 case 'G': gid_map = optarg; break;
745 case 'U': flags |= CLONE_NEWUSER; break;
746 default: usage(argv[0]);
747 }
748 }
749
750 /* -M or -G without -U is nonsensical */
751
752 if (((uid_map != NULL || gid_map != NULL || map_zero) &&
753 !(flags & CLONE_NEWUSER)) ||
754 (map_zero && (uid_map != NULL || gid_map != NULL)))
755 usage(argv[0]);
756
757 args.argv = &argv[optind];
758
759 /* We use a pipe to synchronize the parent and child, in order to
760 ensure that the parent sets the UID and GID maps before the child
761 calls execve(). This ensures that the child maintains its
762 capabilities during the execve() in the common case where we
763 want to map the child's effective user ID to 0 in the new user
764 namespace. Without this synchronization, the child would lose
765 its capabilities if it performed an execve() with nonzero
766 user IDs (see the capabilities(7) man page for details of the
767 transformation of a process's capabilities during execve()). */
768
769 if (pipe(args.pipe_fd) == -1)
770 errExit("pipe");
771
772 /* Create the child in new namespace(s) */
773
774 child_pid = clone(childFunc, child_stack + STACK_SIZE,
775 flags | SIGCHLD, &args);
776 if (child_pid == -1)
777 errExit("clone");
778
779 /* Parent falls through to here */
780
781 if (verbose)
782 printf("%s: PID of child created by clone() is %ld\n",
783 argv[0], (long) child_pid);
784
785 /* Update the UID and GID maps in the child */
786
787 if (uid_map != NULL || map_zero) {
788 snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
789 (long) child_pid);
790 if (map_zero) {
791 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
792 uid_map = map_buf;
793 }
794 update_map(uid_map, map_path);
795 }
796
797 if (gid_map != NULL || map_zero) {
798 proc_setgroups_write(child_pid, "deny");
799
800 snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
801 (long) child_pid);
802 if (map_zero) {
803 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
804 gid_map = map_buf;
805 }
806 update_map(gid_map, map_path);
807 }
808
809 /* Close the write end of the pipe, to signal to the child that we
810 have updated the UID and GID maps */
811
812 close(args.pipe_fd[1]);
813
814 if (waitpid(child_pid, NULL, 0) == -1) /* Wait for child */
815 errExit("waitpid");
816
817 if (verbose)
818 printf("%s: terminating\n", argv[0]);
819
820 exit(EXIT_SUCCESS);
821 }
822
824 newgidmap(1), newuidmap(1), clone(2), ptrace(2), setns(2), unshare(2),
825 proc(5), subgid(5), subuid(5), capabilities(7), cgroup_namespaces(7)
826 credentials(7), namespaces(7), pid_namespaces(7)
827
828 The kernel source file Documentation/namespaces/resource-control.txt.
829
831 This page is part of release 4.15 of the Linux man-pages project. A
832 description of the project, information about reporting bugs, and the
833 latest version of this page, can be found at
834 https://www.kernel.org/doc/man-pages/.
835
836
837
838Linux 2018-02-02 USER_NAMESPACES(7)