1USER_NAMESPACES(7) Linux Programmer's Manual USER_NAMESPACES(7)
2
3
4
6 user_namespaces - overview of Linux user namespaces
7
9 For an overview of namespaces, see namespaces(7).
10
11 User namespaces isolate security-related identifiers and attributes, in
12 particular, user IDs and group IDs (see credentials(7)), the root
13 directory, keys (see keyrings(7)), and capabilities (see capabili‐
14 ties(7)). A process's user and group IDs can be different inside and
15 outside a user namespace. In particular, a process can have a normal
16 unprivileged user ID outside a user namespace while at the same time
17 having a user ID of 0 inside the namespace; in other words, the process
18 has full privileges for operations inside the user namespace, but is
19 unprivileged for operations outside the namespace.
20
21 Nested namespaces, namespace membership
22 User namespaces can be nested; that is, each user namespace—except the
23 initial ("root") namespace—has a parent user namespace, and can have
24 zero or more child user namespaces. The parent user namespace is the
25 user namespace of the process that creates the user namespace via a
26 call to unshare(2) or clone(2) with the CLONE_NEWUSER flag.
27
28 The kernel imposes (since version 3.11) a limit of 32 nested levels of
29 user namespaces. Calls to unshare(2) or clone(2) that would cause this
30 limit to be exceeded fail with the error EUSERS.
31
32 Each process is a member of exactly one user namespace. A process cre‐
33 ated via fork(2) or clone(2) without the CLONE_NEWUSER flag is a member
34 of the same user namespace as its parent. A single-threaded process
35 can join another user namespace with setns(2) if it has the
36 CAP_SYS_ADMIN in that namespace; upon doing so, it gains a full set of
37 capabilities in that namespace.
38
39 A call to clone(2) or unshare(2) with the CLONE_NEWUSER flag makes the
40 new child process (for clone(2)) or the caller (for unshare(2)) a mem‐
41 ber of the new user namespace created by the call.
42
43 The NS_GET_PARENT ioctl(2) operation can be used to discover the
44 parental relationship between user namespaces; see ioctl_ns(2).
45
46 Capabilities
47 The child process created by clone(2) with the CLONE_NEWUSER flag
48 starts out with a complete set of capabilities in the new user names‐
49 pace. Likewise, a process that creates a new user namespace using
50 unshare(2) or joins an existing user namespace using setns(2) gains a
51 full set of capabilities in that namespace. On the other hand, that
52 process has no capabilities in the parent (in the case of clone(2)) or
53 previous (in the case of unshare(2) and setns(2)) user namespace, even
54 if the new namespace is created or joined by the root user (i.e., a
55 process with user ID 0 in the root namespace).
56
57 Note that a call to execve(2) will cause a process's capabilities to be
58 recalculated in the usual way (see capabilities(7)). Consequently,
59 unless the process has a user ID of 0 within the namespace, or the exe‐
60 cutable file has a nonempty inheritable capabilities mask, the process
61 will lose all capabilities. See the discussion of user and group ID
62 mappings, below.
63
64 A call to clone(2), unshare(2), or setns(2) using the CLONE_NEWUSER
65 flag sets the "securebits" flags (see capabilities(7)) to their default
66 values (all flags disabled) in the child (for clone(2)) or caller (for
67 unshare(2), or setns(2)). Note that because the caller no longer has
68 capabilities in its original user namespace after a call to setns(2),
69 it is not possible for a process to reset its "securebits" flags while
70 retaining its user namespace membership by using a pair of setns(2)
71 calls to move to another user namespace and then return to its original
72 user namespace.
73
74 The rules for determining whether or not a process has a capability in
75 a particular user namespace are as follows:
76
77 1. A process has a capability inside a user namespace if it is a member
78 of that namespace and it has the capability in its effective capa‐
79 bility set. A process can gain capabilities in its effective capa‐
80 bility set in various ways. For example, it may execute a set-user-
81 ID program or an executable with associated file capabilities. In
82 addition, a process may gain capabilities via the effect of
83 clone(2), unshare(2), or setns(2), as already described.
84
85 2. If a process has a capability in a user namespace, then it has that
86 capability in all child (and further removed descendant) namespaces
87 as well.
88
89 3. When a user namespace is created, the kernel records the effective
90 user ID of the creating process as being the "owner" of the names‐
91 pace. A process that resides in the parent of the user namespace
92 and whose effective user ID matches the owner of the namespace has
93 all capabilities in the namespace. By virtue of the previous rule,
94 this means that the process has all capabilities in all further
95 removed descendant user namespaces as well. The NS_GET_OWNER_UID
96 ioctl(2) operation can be used to discover the user ID of the owner
97 of the namespace; see ioctl_ns(2).
98
99 Effect of capabilities within a user namespace
100 Having a capability inside a user namespace permits a process to per‐
101 form operations (that require privilege) only on resources governed by
102 that namespace. In other words, having a capability in a user names‐
103 pace permits a process to perform privileged operations on resources
104 that are governed by (nonuser) namespaces owned by (associated with)
105 the user namespace (see the next subsection).
106
107 On the other hand, there are many privileged operations that affect
108 resources that are not associated with any namespace type, for example,
109 changing the system time (governed by CAP_SYS_TIME), loading a kernel
110 module (governed by CAP_SYS_MODULE), and creating a device (governed by
111 CAP_MKNOD). Only a process with privileges in the initial user names‐
112 pace can perform such operations.
113
114 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
115 mount namespace allows that process to create bind mounts and mount the
116 following types of filesystems:
117
118 * /proc (since Linux 3.8)
119 * /sys (since Linux 3.8)
120 * devpts (since Linux 3.9)
121 * tmpfs(5) (since Linux 3.9)
122 * ramfs (since Linux 3.9)
123 * mqueue (since Linux 3.9)
124 * bpf (since Linux 4.4)
125
126 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
127 cgroup namespace allows (since Linux 4.6) that process to the mount the
128 cgroup version 2 filesystem and cgroup version 1 named hierarchies
129 (i.e., cgroup filesystems mounted with the "none,name=" option).
130
131 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
132 PID namespace allows (since Linux 3.8) that process to mount /proc
133 filesystems.
134
135 Note however, that mounting block-based filesystems can be done only by
136 a process that holds CAP_SYS_ADMIN in the initial user namespace.
137
138 Interaction of user namespaces and other types of namespaces
139 Starting in Linux 3.8, unprivileged processes can create user names‐
140 paces, and the other types of namespaces can be created with just the
141 CAP_SYS_ADMIN capability in the caller's user namespace.
142
143 When a nonuser namespace is created, it is owned by the user namespace
144 in which the creating process was a member at the time of the creation
145 of the namespace. Actions on the nonuser namespace require capabili‐
146 ties in the corresponding user namespace.
147
148 If CLONE_NEWUSER is specified along with other CLONE_NEW* flags in a
149 single clone(2) or unshare(2) call, the user namespace is guaranteed to
150 be created first, giving the child (clone(2)) or caller (unshare(2))
151 privileges over the remaining namespaces created by the call. Thus, it
152 is possible for an unprivileged caller to specify this combination of
153 flags.
154
155 When a new namespace (other than a user namespace) is created via
156 clone(2) or unshare(2), the kernel records the user namespace of the
157 creating process as the owner of the new namespace. (This association
158 can't be changed.) When a process in the new namespace subsequently
159 performs privileged operations that operate on global resources iso‐
160 lated by the namespace, the permission checks are performed according
161 to the process's capabilities in the user namespace that the kernel
162 associated with the new namespace. For example, suppose that a process
163 attempts to change the hostname (sethostname(2)), a resource governed
164 by the UTS namespace. In this case, the kernel will determine which
165 user namespace owns the process's UTS namespace, and check whether the
166 process has the required capability (CAP_SYS_ADMIN) in that user names‐
167 pace.
168
169 The NS_GET_USERNS ioctl(2) operation can be used to discover the user
170 namespace that owns a nonuser namespace; see ioctl_ns(2).
171
172 User and group ID mappings: uid_map and gid_map
173 When a user namespace is created, it starts out without a mapping of
174 user IDs (group IDs) to the parent user namespace. The
175 /proc/[pid]/uid_map and /proc/[pid]/gid_map files (available since
176 Linux 3.5) expose the mappings for user and group IDs inside the user
177 namespace for the process pid. These files can be read to view the
178 mappings in a user namespace and written to (once) to define the map‐
179 pings.
180
181 The description in the following paragraphs explains the details for
182 uid_map; gid_map is exactly the same, but each instance of "user ID" is
183 replaced by "group ID".
184
185 The uid_map file exposes the mapping of user IDs from the user names‐
186 pace of the process pid to the user namespace of the process that
187 opened uid_map (but see a qualification to this point below). In other
188 words, processes that are in different user namespaces will potentially
189 see different values when reading from a particular uid_map file,
190 depending on the user ID mappings for the user namespaces of the read‐
191 ing processes.
192
193 Each line in the uid_map file specifies a 1-to-1 mapping of a range of
194 contiguous user IDs between two user namespaces. (When a user names‐
195 pace is first created, this file is empty.) The specification in each
196 line takes the form of three numbers delimited by white space. The
197 first two numbers specify the starting user ID in each of the two user
198 namespaces. The third number specifies the length of the mapped range.
199 In detail, the fields are interpreted as follows:
200
201 (1) The start of the range of user IDs in the user namespace of the
202 process pid.
203
204 (2) The start of the range of user IDs to which the user IDs specified
205 by field one map. How field two is interpreted depends on whether
206 the process that opened uid_map and the process pid are in the same
207 user namespace, as follows:
208
209 a) If the two processes are in different user namespaces: field two
210 is the start of a range of user IDs in the user namespace of the
211 process that opened uid_map.
212
213 b) If the two processes are in the same user namespace: field two
214 is the start of the range of user IDs in the parent user names‐
215 pace of the process pid. This case enables the opener of
216 uid_map (the common case here is opening /proc/self/uid_map) to
217 see the mapping of user IDs into the user namespace of the
218 process that created this user namespace.
219
220 (3) The length of the range of user IDs that is mapped between the two
221 user namespaces.
222
223 System calls that return user IDs (group IDs)—for example, getuid(2),
224 getgid(2), and the credential fields in the structure returned by
225 stat(2)—return the user ID (group ID) mapped into the caller's user
226 namespace.
227
228 When a process accesses a file, its user and group IDs are mapped into
229 the initial user namespace for the purpose of permission checking and
230 assigning IDs when creating a file. When a process retrieves file user
231 and group IDs via stat(2), the IDs are mapped in the opposite direc‐
232 tion, to produce values relative to the process user and group ID map‐
233 pings.
234
235 The initial user namespace has no parent namespace, but, for consis‐
236 tency, the kernel provides dummy user and group ID mapping files for
237 this namespace. Looking at the uid_map file (gid_map is the same) from
238 a shell in the initial namespace shows:
239
240 $ cat /proc/$$/uid_map
241 0 0 4294967295
242
243 This mapping tells us that the range starting at user ID 0 in this
244 namespace maps to a range starting at 0 in the (nonexistent) parent
245 namespace, and the length of the range is the largest 32-bit unsigned
246 integer. This leaves 4294967295 (the 32-bit signed -1 value) unmapped.
247 This is deliberate: (uid_t) -1 is used in several interfaces (e.g.,
248 setreuid(2)) as a way to specify "no user ID". Leaving (uid_t) -1
249 unmapped and unusable guarantees that there will be no confusion when
250 using these interfaces.
251
252 Defining user and group ID mappings: writing to uid_map and gid_map
253 After the creation of a new user namespace, the uid_map file of one of
254 the processes in the namespace may be written to once to define the
255 mapping of user IDs in the new user namespace. An attempt to write
256 more than once to a uid_map file in a user namespace fails with the
257 error EPERM. Similar rules apply for gid_map files.
258
259 The lines written to uid_map (gid_map) must conform to the following
260 rules:
261
262 * The three fields must be valid numbers, and the last field must be
263 greater than 0.
264
265 * Lines are terminated by newline characters.
266
267 * There is a limit on the number of lines in the file. In Linux 4.14
268 and earlier, this limit was (arbitrarily) set at 5 lines. Since
269 Linux 4.15, the limit is 340 lines. In addition, the number of
270 bytes written to the file must be less than the system page size,
271 and the write must be performed at the start of the file (i.e.,
272 lseek(2) and pwrite(2) can't be used to write to nonzero offsets in
273 the file).
274
275 * The range of user IDs (group IDs) specified in each line cannot
276 overlap with the ranges in any other lines. In the initial imple‐
277 mentation (Linux 3.8), this requirement was satisfied by a simplis‐
278 tic implementation that imposed the further requirement that the
279 values in both field 1 and field 2 of successive lines must be in
280 ascending numerical order, which prevented some otherwise valid maps
281 from being created. Linux 3.9 and later fix this limitation, allow‐
282 ing any valid set of nonoverlapping maps.
283
284 * At least one line must be written to the file.
285
286 Writes that violate the above rules fail with the error EINVAL.
287
288 In order for a process to write to the /proc/[pid]/uid_map
289 (/proc/[pid]/gid_map) file, all of the following requirements must be
290 met:
291
292 1. The writing process must have the CAP_SETUID (CAP_SETGID) capability
293 in the user namespace of the process pid.
294
295 2. The writing process must either be in the user namespace of the
296 process pid or be in the parent user namespace of the process pid.
297
298 3. The mapped user IDs (group IDs) must in turn have a mapping in the
299 parent user namespace.
300
301 4. One of the following two cases applies:
302
303 * Either the writing process has the CAP_SETUID (CAP_SETGID) capa‐
304 bility in the parent user namespace.
305
306 + No further restrictions apply: the process can make mappings
307 to arbitrary user IDs (group IDs) in the parent user names‐
308 pace.
309
310 * Or otherwise all of the following restrictions apply:
311
312 + The data written to uid_map (gid_map) must consist of a single
313 line that maps the writing process's effective user ID (group
314 ID) in the parent user namespace to a user ID (group ID) in
315 the user namespace.
316
317 + The writing process must have the same effective user ID as
318 the process that created the user namespace.
319
320 + In the case of gid_map, use of the setgroups(2) system call
321 must first be denied by writing "deny" to the /proc/[pid]/set‐
322 groups file (see below) before writing to gid_map.
323
324 Writes that violate the above rules fail with the error EPERM.
325
326 Interaction with system calls that change process UIDs or GIDs
327 In a user namespace where the uid_map file has not been written, the
328 system calls that change user IDs will fail. Similarly, if the gid_map
329 file has not been written, the system calls that change group IDs will
330 fail. After the uid_map and gid_map files have been written, only the
331 mapped values may be used in system calls that change user and group
332 IDs.
333
334 For user IDs, the relevant system calls include setuid(2), setfsuid(2),
335 setreuid(2), and setresuid(2). For group IDs, the relevant system
336 calls include setgid(2), setfsgid(2), setregid(2), setresgid(2), and
337 setgroups(2).
338
339 Writing "deny" to the /proc/[pid]/setgroups file before writing to
340 /proc/[pid]/gid_map will permanently disable setgroups(2) in a user
341 namespace and allow writing to /proc/[pid]/gid_map without having the
342 CAP_SETGID capability in the parent user namespace.
343
344 The /proc/[pid]/setgroups file
345 The /proc/[pid]/setgroups file displays the string "allow" if processes
346 in the user namespace that contains the process pid are permitted to
347 employ the setgroups(2) system call; it displays "deny" if setgroups(2)
348 is not permitted in that user namespace. Note that regardless of the
349 value in the /proc/[pid]/setgroups file (and regardless of the
350 process's capabilities), calls to setgroups(2) are also not permitted
351 if /proc/[pid]/gid_map has not yet been set.
352
353 A privileged process (one with the CAP_SYS_ADMIN capability in the
354 namespace) may write either of the strings "allow" or "deny" to this
355 file before writing a group ID mapping for this user namespace to the
356 file /proc/[pid]/gid_map. Writing the string "deny" prevents any
357 process in the user namespace from employing setgroups(2).
358
359 The essence of the restrictions described in the preceding paragraph is
360 that it is permitted to write to /proc/[pid]/setgroups only so long as
361 calling setgroups(2) is disallowed because /proc/[pid]gid_map has not
362 been set. This ensures that a process cannot transition from a state
363 where setgroups(2) is allowed to a state where setgroups(2) is denied;
364 a process can transition only from setgroups(2) being disallowed to
365 setgroups(2) being allowed.
366
367 The default value of this file in the initial user namespace is
368 "allow".
369
370 Once /proc/[pid]/gid_map has been written to (which has the effect of
371 enabling setgroups(2) in the user namespace), it is no longer possible
372 to disallow setgroups(2) by writing "deny" to /proc/[pid]/setgroups
373 (the write fails with the error EPERM).
374
375 A child user namespace inherits the /proc/[pid]/setgroups setting from
376 its parent.
377
378 If the setgroups file has the value "deny", then the setgroups(2) sys‐
379 tem call can't subsequently be reenabled (by writing "allow" to the
380 file) in this user namespace. (Attempts to do so fail with the error
381 EPERM.) This restriction also propagates down to all child user names‐
382 paces of this user namespace.
383
384 The /proc/[pid]/setgroups file was added in Linux 3.19, but was back‐
385 ported to many earlier stable kernel series, because it addresses a
386 security issue. The issue concerned files with permissions such as
387 "rwx---rwx". Such files give fewer permissions to "group" than they do
388 to "other". This means that dropping groups using setgroups(2) might
389 allow a process file access that it did not formerly have. Before the
390 existence of user namespaces this was not a concern, since only a priv‐
391 ileged process (one with the CAP_SETGID capability) could call set‐
392 groups(2). However, with the introduction of user namespaces, it
393 became possible for an unprivileged process to create a new namespace
394 in which the user had all privileges. This then allowed formerly
395 unprivileged users to drop groups and thus gain file access that they
396 did not previously have. The /proc/[pid]/setgroups file was added to
397 address this security issue, by denying any pathway for an unprivileged
398 process to drop groups with setgroups(2).
399
400 Unmapped user and group IDs
401 There are various places where an unmapped user ID (group ID) may be
402 exposed to user space. For example, the first process in a new user
403 namespace may call getuid(2) before a user ID mapping has been defined
404 for the namespace. In most such cases, an unmapped user ID is con‐
405 verted to the overflow user ID (group ID); the default value for the
406 overflow user ID (group ID) is 65534. See the descriptions of
407 /proc/sys/kernel/overflowuid and /proc/sys/kernel/overflowgid in
408 proc(5).
409
410 The cases where unmapped IDs are mapped in this fashion include system
411 calls that return user IDs (getuid(2), getgid(2), and similar), creden‐
412 tials passed over a UNIX domain socket, credentials returned by
413 stat(2), waitid(2), and the System V IPC "ctl" IPC_STAT operations,
414 credentials exposed by /proc/[pid]/status and the files in
415 /proc/sysvipc/*, credentials returned via the si_uid field in the sig‐
416 info_t received with a signal (see sigaction(2)), credentials written
417 to the process accounting file (see acct(5)), and credentials returned
418 with POSIX message queue notifications (see mq_notify(3)).
419
420 There is one notable case where unmapped user and group IDs are not
421 converted to the corresponding overflow ID value. When viewing a
422 uid_map or gid_map file in which there is no mapping for the second
423 field, that field is displayed as 4294967295 (-1 as an unsigned inte‐
424 ger).
425
426 Accessing files
427 In order to determine permissions when an unprivileged process accesses
428 a file, the process credentials (UID, GID) and the file credentials are
429 in effect mapped back to what they would be in the initial user names‐
430 pace and then compared to determine the permissions that the process
431 has on the file. The same is also of other objects that employ the
432 credentials plus permissions mask accessibility model, such as System V
433 IPC objects
434
435 Operation of file-related capabilities
436 Certain capabilities allow a process to bypass various kernel-enforced
437 restrictions when performing operations on files owned by other users
438 or groups. These capabilities are: CAP_CHOWN, CAP_DAC_OVERRIDE,
439 CAP_DAC_READ_SEARCH, CAP_FOWNER, and CAP_FSETID.
440
441 Within a user namespace, these capabilities allow a process to bypass
442 the rules if the process has the relevant capability over the file,
443 meaning that:
444
445 * the process has the relevant effective capability in its user names‐
446 pace; and
447
448 * the file's user ID and group ID both have valid mappings in the user
449 namespace.
450
451 The CAP_FOWNER capability is treated somewhat exceptionally: it allows
452 a process to bypass the corresponding rules so long as at least the
453 file's user ID has a mapping in the user namespace (i.e., the file's
454 group ID does not need to have a valid mapping).
455
456 Set-user-ID and set-group-ID programs
457 When a process inside a user namespace executes a set-user-ID (set-
458 group-ID) program, the process's effective user (group) ID inside the
459 namespace is changed to whatever value is mapped for the user (group)
460 ID of the file. However, if either the user or the group ID of the
461 file has no mapping inside the namespace, the set-user-ID (set-group-
462 ID) bit is silently ignored: the new program is executed, but the
463 process's effective user (group) ID is left unchanged. (This mirrors
464 the semantics of executing a set-user-ID or set-group-ID program that
465 resides on a filesystem that was mounted with the MS_NOSUID flag, as
466 described in mount(2).)
467
468 Miscellaneous
469 When a process's user and group IDs are passed over a UNIX domain
470 socket to a process in a different user namespace (see the description
471 of SCM_CREDENTIALS in unix(7)), they are translated into the corre‐
472 sponding values as per the receiving process's user and group ID map‐
473 pings.
474
476 Namespaces are a Linux-specific feature.
477
479 Over the years, there have been a lot of features that have been added
480 to the Linux kernel that have been made available only to privileged
481 users because of their potential to confuse set-user-ID-root applica‐
482 tions. In general, it becomes safe to allow the root user in a user
483 namespace to use those features because it is impossible, while in a
484 user namespace, to gain more privilege than the root user of a user
485 namespace has.
486
487 Availability
488 Use of user namespaces requires a kernel that is configured with the
489 CONFIG_USER_NS option. User namespaces require support in a range of
490 subsystems across the kernel. When an unsupported subsystem is config‐
491 ured into the kernel, it is not possible to configure user namespaces
492 support.
493
494 As at Linux 3.8, most relevant subsystems supported user namespaces,
495 but a number of filesystems did not have the infrastructure needed to
496 map user and group IDs between user namespaces. Linux 3.9 added the
497 required infrastructure support for many of the remaining unsupported
498 filesystems (Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA,
499 NFS, and OCFS2). Linux 3.12 added support for the last of the unsup‐
500 ported major filesystems, XFS.
501
503 The program below is designed to allow experimenting with user names‐
504 paces, as well as other types of namespaces. It creates namespaces as
505 specified by command-line options and then executes a command inside
506 those namespaces. The comments and usage() function inside the program
507 provide a full explanation of the program. The following shell session
508 demonstrates its use.
509
510 First, we look at the run-time environment:
511
512 $ uname -rs # Need Linux 3.8 or later
513 Linux 3.8.0
514 $ id -u # Running as unprivileged user
515 1000
516 $ id -g
517 1000
518
519 Now start a new shell in new user (-U), mount (-m), and PID (-p) names‐
520 paces, with user ID (-M) and group ID (-G) 1000 mapped to 0 inside the
521 user namespace:
522
523 $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash
524
525 The shell has PID 1, because it is the first process in the new PID
526 namespace:
527
528 bash$ echo $$
529 1
530
531 Mounting a new /proc filesystem and listing all of the processes visi‐
532 ble in the new PID namespace shows that the shell can't see any pro‐
533 cesses outside the PID namespace:
534
535 bash$ mount -t proc proc /proc
536 bash$ ps ax
537 PID TTY STAT TIME COMMAND
538 1 pts/3 S 0:00 bash
539 22 pts/3 R+ 0:00 ps ax
540
541 Inside the user namespace, the shell has user and group ID 0, and a
542 full set of permitted and effective capabilities:
543
544 bash$ cat /proc/$$/status | egrep '^[UG]id'
545 Uid: 0 0 0 0
546 Gid: 0 0 0 0
547 bash$ cat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'
548 CapInh: 0000000000000000
549 CapPrm: 0000001fffffffff
550 CapEff: 0000001fffffffff
551
552 Program source
553
554 /* userns_child_exec.c
555
556 Licensed under GNU General Public License v2 or later
557
558 Create a child process that executes a shell command in new
559 namespace(s); allow UID and GID mappings to be specified when
560 creating a user namespace.
561 */
562 #define _GNU_SOURCE
563 #include <sched.h>
564 #include <unistd.h>
565 #include <stdlib.h>
566 #include <sys/wait.h>
567 #include <signal.h>
568 #include <fcntl.h>
569 #include <stdio.h>
570 #include <string.h>
571 #include <limits.h>
572 #include <errno.h>
573
574 /* A simple error-handling function: print an error message based
575 on the value in 'errno' and terminate the calling process */
576
577 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
578 } while (0)
579
580 struct child_args {
581 char **argv; /* Command to be executed by child, with args */
582 int pipe_fd[2]; /* Pipe used to synchronize parent and child */
583 };
584
585 static int verbose;
586
587 static void
588 usage(char *pname)
589 {
590 fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
591 fprintf(stderr, "Create a child process that executes a shell "
592 "command in a new user namespace,\n"
593 "and possibly also other new namespace(s).\n\n");
594 fprintf(stderr, "Options can be:\n\n");
595 #define fpe(str) fprintf(stderr, " %s", str);
596 fpe("-i New IPC namespace\n");
597 fpe("-m New mount namespace\n");
598 fpe("-n New network namespace\n");
599 fpe("-p New PID namespace\n");
600 fpe("-u New UTS namespace\n");
601 fpe("-U New user namespace\n");
602 fpe("-M uid_map Specify UID map for user namespace\n");
603 fpe("-G gid_map Specify GID map for user namespace\n");
604 fpe("-z Map user's UID and GID to 0 in user namespace\n");
605 fpe(" (equivalent to: -M '0 <uid> 1' -G '0 <gid> 1')\n");
606 fpe("-v Display verbose messages\n");
607 fpe("\n");
608 fpe("If -z, -M, or -G is specified, -U is required.\n");
609 fpe("It is not permitted to specify both -z and either -M or -G.\n");
610 fpe("\n");
611 fpe("Map strings for -M and -G consist of records of the form:\n");
612 fpe("\n");
613 fpe(" ID-inside-ns ID-outside-ns len\n");
614 fpe("\n");
615 fpe("A map string can contain multiple records, separated"
616 " by commas;\n");
617 fpe("the commas are replaced by newlines before writing"
618 " to map files.\n");
619
620 exit(EXIT_FAILURE);
621 }
622
623 /* Update the mapping file 'map_file', with the value provided in
624 'mapping', a string that defines a UID or GID mapping. A UID or
625 GID mapping consists of one or more newline-delimited records
626 of the form:
627
628 ID_inside-ns ID-outside-ns length
629
630 Requiring the user to supply a string that contains newlines is
631 of course inconvenient for command-line use. Thus, we permit the
632 use of commas to delimit records in this string, and replace them
633 with newlines before writing the string to the file. */
634
635 static void
636 update_map(char *mapping, char *map_file)
637 {
638 int fd, j;
639 size_t map_len; /* Length of 'mapping' */
640
641 /* Replace commas in mapping string with newlines */
642
643 map_len = strlen(mapping);
644 for (j = 0; j < map_len; j++)
645 if (mapping[j] == ',')
646 mapping[j] = '\n';
647
648 fd = open(map_file, O_RDWR);
649 if (fd == -1) {
650 fprintf(stderr, "ERROR: open %s: %s\n", map_file,
651 strerror(errno));
652 exit(EXIT_FAILURE);
653 }
654
655 if (write(fd, mapping, map_len) != map_len) {
656 fprintf(stderr, "ERROR: write %s: %s\n", map_file,
657 strerror(errno));
658 exit(EXIT_FAILURE);
659 }
660
661 close(fd);
662 }
663
664 /* Linux 3.19 made a change in the handling of setgroups(2) and the
665 'gid_map' file to address a security issue. The issue allowed
666 *unprivileged* users to employ user namespaces in order to drop
667 The upshot of the 3.19 changes is that in order to update the
668 'gid_maps' file, use of the setgroups() system call in this
669 user namespace must first be disabled by writing "deny" to one of
670 the /proc/PID/setgroups files for this namespace. That is the
671 purpose of the following function. */
672
673 static void
674 proc_setgroups_write(pid_t child_pid, char *str)
675 {
676 char setgroups_path[PATH_MAX];
677 int fd;
678
679 snprintf(setgroups_path, PATH_MAX, "/proc/%ld/setgroups",
680 (long) child_pid);
681
682 fd = open(setgroups_path, O_RDWR);
683 if (fd == -1) {
684
685 /* We may be on a system that doesn't support
686 /proc/PID/setgroups. In that case, the file won't exist,
687 and the system won't impose the restrictions that Linux 3.19
688 added. That's fine: we don't need to do anything in order
689 to permit 'gid_map' to be updated.
690
691 However, if the error from open() was something other than
692 the ENOENT error that is expected for that case, let the
693 user know. */
694
695 if (errno != ENOENT)
696 fprintf(stderr, "ERROR: open %s: %s\n", setgroups_path,
697 strerror(errno));
698 return;
699 }
700
701 if (write(fd, str, strlen(str)) == -1)
702 fprintf(stderr, "ERROR: write %s: %s\n", setgroups_path,
703 strerror(errno));
704
705 close(fd);
706 }
707
708 static int /* Start function for cloned child */
709 childFunc(void *arg)
710 {
711 struct child_args *args = (struct child_args *) arg;
712 char ch;
713
714 /* Wait until the parent has updated the UID and GID mappings.
715 See the comment in main(). We wait for end of file on a
716 pipe that will be closed by the parent process once it has
717 updated the mappings. */
718
719 close(args->pipe_fd[1]); /* Close our descriptor for the write
720 end of the pipe so that we see EOF
721 when parent closes its descriptor */
722 if (read(args->pipe_fd[0], &ch, 1) != 0) {
723 fprintf(stderr,
724 "Failure in child: read from pipe returned != 0\n");
725 exit(EXIT_FAILURE);
726 }
727
728 close(args->pipe_fd[0]);
729
730 /* Execute a shell command */
731
732 printf("About to exec %s\n", args->argv[0]);
733 execvp(args->argv[0], args->argv);
734 errExit("execvp");
735 }
736
737 #define STACK_SIZE (1024 * 1024)
738
739 static char child_stack[STACK_SIZE]; /* Space for child's stack */
740
741 int
742 main(int argc, char *argv[])
743 {
744 int flags, opt, map_zero;
745 pid_t child_pid;
746 struct child_args args;
747 char *uid_map, *gid_map;
748 const int MAP_BUF_SIZE = 100;
749 char map_buf[MAP_BUF_SIZE];
750 char map_path[PATH_MAX];
751
752 /* Parse command-line options. The initial '+' character in
753 the final getopt() argument prevents GNU-style permutation
754 of command-line options. That's useful, since sometimes
755 the 'command' to be executed by this program itself
756 has command-line options. We don't want getopt() to treat
757 those as options to this program. */
758
759 flags = 0;
760 verbose = 0;
761 gid_map = NULL;
762 uid_map = NULL;
763 map_zero = 0;
764 while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != -1) {
765 switch (opt) {
766 case 'i': flags |= CLONE_NEWIPC; break;
767 case 'm': flags |= CLONE_NEWNS; break;
768 case 'n': flags |= CLONE_NEWNET; break;
769 case 'p': flags |= CLONE_NEWPID; break;
770 case 'u': flags |= CLONE_NEWUTS; break;
771 case 'v': verbose = 1; break;
772 case 'z': map_zero = 1; break;
773 case 'M': uid_map = optarg; break;
774 case 'G': gid_map = optarg; break;
775 case 'U': flags |= CLONE_NEWUSER; break;
776 default: usage(argv[0]);
777 }
778 }
779
780 /* -M or -G without -U is nonsensical */
781
782 if (((uid_map != NULL || gid_map != NULL || map_zero) &&
783 !(flags & CLONE_NEWUSER)) ||
784 (map_zero && (uid_map != NULL || gid_map != NULL)))
785 usage(argv[0]);
786
787 args.argv = &argv[optind];
788
789 /* We use a pipe to synchronize the parent and child, in order to
790 ensure that the parent sets the UID and GID maps before the child
791 calls execve(). This ensures that the child maintains its
792 capabilities during the execve() in the common case where we
793 want to map the child's effective user ID to 0 in the new user
794 namespace. Without this synchronization, the child would lose
795 its capabilities if it performed an execve() with nonzero
796 user IDs (see the capabilities(7) man page for details of the
797 transformation of a process's capabilities during execve()). */
798
799 if (pipe(args.pipe_fd) == -1)
800 errExit("pipe");
801
802 /* Create the child in new namespace(s) */
803
804 child_pid = clone(childFunc, child_stack + STACK_SIZE,
805 flags | SIGCHLD, &args);
806 if (child_pid == -1)
807 errExit("clone");
808
809 /* Parent falls through to here */
810
811 if (verbose)
812 printf("%s: PID of child created by clone() is %ld\n",
813 argv[0], (long) child_pid);
814
815 /* Update the UID and GID maps in the child */
816
817 if (uid_map != NULL || map_zero) {
818 snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
819 (long) child_pid);
820 if (map_zero) {
821 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
822 uid_map = map_buf;
823 }
824 update_map(uid_map, map_path);
825 }
826
827 if (gid_map != NULL || map_zero) {
828 proc_setgroups_write(child_pid, "deny");
829
830 snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
831 (long) child_pid);
832 if (map_zero) {
833 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
834 gid_map = map_buf;
835 }
836 update_map(gid_map, map_path);
837 }
838
839 /* Close the write end of the pipe, to signal to the child that we
840 have updated the UID and GID maps */
841
842 close(args.pipe_fd[1]);
843
844 if (waitpid(child_pid, NULL, 0) == -1) /* Wait for child */
845 errExit("waitpid");
846
847 if (verbose)
848 printf("%s: terminating\n", argv[0]);
849
850 exit(EXIT_SUCCESS);
851 }
852
854 newgidmap(1), newuidmap(1), clone(2), ptrace(2), setns(2), unshare(2),
855 proc(5), subgid(5), subuid(5), capabilities(7), cgroup_namespaces(7)
856 credentials(7), namespaces(7), pid_namespaces(7)
857
858 The kernel source file Documentation/namespaces/resource-control.txt.
859
861 This page is part of release 5.02 of the Linux man-pages project. A
862 description of the project, information about reporting bugs, and the
863 latest version of this page, can be found at
864 https://www.kernel.org/doc/man-pages/.
865
866
867
868Linux 2019-08-02 USER_NAMESPACES(7)