1USER_NAMESPACES(7) Linux Programmer's Manual USER_NAMESPACES(7)
2
3
4
6 user_namespaces - overview of Linux user namespaces
7
9 For an overview of namespaces, see namespaces(7).
10
11 User namespaces isolate security-related identifiers and attributes, in
12 particular, user IDs and group IDs (see credentials(7)), the root di‐
13 rectory, keys (see keyrings(7)), and capabilities (see capabili‐
14 ties(7)). A process's user and group IDs can be different inside and
15 outside a user namespace. In particular, a process can have a normal
16 unprivileged user ID outside a user namespace while at the same time
17 having a user ID of 0 inside the namespace; in other words, the process
18 has full privileges for operations inside the user namespace, but is
19 unprivileged for operations outside the namespace.
20
21 Nested namespaces, namespace membership
22 User namespaces can be nested; that is, each user namespace—except the
23 initial ("root") namespace—has a parent user namespace, and can have
24 zero or more child user namespaces. The parent user namespace is the
25 user namespace of the process that creates the user namespace via a
26 call to unshare(2) or clone(2) with the CLONE_NEWUSER flag.
27
28 The kernel imposes (since version 3.11) a limit of 32 nested levels of
29 user namespaces. Calls to unshare(2) or clone(2) that would cause this
30 limit to be exceeded fail with the error EUSERS.
31
32 Each process is a member of exactly one user namespace. A process cre‐
33 ated via fork(2) or clone(2) without the CLONE_NEWUSER flag is a member
34 of the same user namespace as its parent. A single-threaded process
35 can join another user namespace with setns(2) if it has the CAP_SYS_AD‐
36 MIN in that namespace; upon doing so, it gains a full set of capabili‐
37 ties in that namespace.
38
39 A call to clone(2) or unshare(2) with the CLONE_NEWUSER flag makes the
40 new child process (for clone(2)) or the caller (for unshare(2)) a mem‐
41 ber of the new user namespace created by the call.
42
43 The NS_GET_PARENT ioctl(2) operation can be used to discover the
44 parental relationship between user namespaces; see ioctl_ns(2).
45
46 Capabilities
47 The child process created by clone(2) with the CLONE_NEWUSER flag
48 starts out with a complete set of capabilities in the new user name‐
49 space. Likewise, a process that creates a new user namespace using un‐
50 share(2) or joins an existing user namespace using setns(2) gains a
51 full set of capabilities in that namespace. On the other hand, that
52 process has no capabilities in the parent (in the case of clone(2)) or
53 previous (in the case of unshare(2) and setns(2)) user namespace, even
54 if the new namespace is created or joined by the root user (i.e., a
55 process with user ID 0 in the root namespace).
56
57 Note that a call to execve(2) will cause a process's capabilities to be
58 recalculated in the usual way (see capabilities(7)). Consequently, un‐
59 less the process has a user ID of 0 within the namespace, or the exe‐
60 cutable file has a nonempty inheritable capabilities mask, the process
61 will lose all capabilities. See the discussion of user and group ID
62 mappings, below.
63
64 A call to clone(2) or unshare(2) using the CLONE_NEWUSER flag or a call
65 to setns(2) that moves the caller into another user namespace sets the
66 "securebits" flags (see capabilities(7)) to their default values (all
67 flags disabled) in the child (for clone(2)) or caller (for unshare(2)
68 or setns(2)). Note that because the caller no longer has capabilities
69 in its original user namespace after a call to setns(2), it is not pos‐
70 sible for a process to reset its "securebits" flags while retaining its
71 user namespace membership by using a pair of setns(2) calls to move to
72 another user namespace and then return to its original user namespace.
73
74 The rules for determining whether or not a process has a capability in
75 a particular user namespace are as follows:
76
77 1. A process has a capability inside a user namespace if it is a member
78 of that namespace and it has the capability in its effective capa‐
79 bility set. A process can gain capabilities in its effective capa‐
80 bility set in various ways. For example, it may execute a set-user-
81 ID program or an executable with associated file capabilities. In
82 addition, a process may gain capabilities via the effect of
83 clone(2), unshare(2), or setns(2), as already described.
84
85 2. If a process has a capability in a user namespace, then it has that
86 capability in all child (and further removed descendant) namespaces
87 as well.
88
89 3. When a user namespace is created, the kernel records the effective
90 user ID of the creating process as being the "owner" of the name‐
91 space. A process that resides in the parent of the user namespace
92 and whose effective user ID matches the owner of the namespace has
93 all capabilities in the namespace. By virtue of the previous rule,
94 this means that the process has all capabilities in all further re‐
95 moved descendant user namespaces as well. The NS_GET_OWNER_UID
96 ioctl(2) operation can be used to discover the user ID of the owner
97 of the namespace; see ioctl_ns(2).
98
99 Effect of capabilities within a user namespace
100 Having a capability inside a user namespace permits a process to per‐
101 form operations (that require privilege) only on resources governed by
102 that namespace. In other words, having a capability in a user name‐
103 space permits a process to perform privileged operations on resources
104 that are governed by (nonuser) namespaces owned by (associated with)
105 the user namespace (see the next subsection).
106
107 On the other hand, there are many privileged operations that affect re‐
108 sources that are not associated with any namespace type, for example,
109 changing the system (i.e., calendar) time (governed by CAP_SYS_TIME),
110 loading a kernel module (governed by CAP_SYS_MODULE), and creating a
111 device (governed by CAP_MKNOD). Only a process with privileges in the
112 initial user namespace can perform such operations.
113
114 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
115 mount namespace allows that process to create bind mounts and mount the
116 following types of filesystems:
117
118 * /proc (since Linux 3.8)
119 * /sys (since Linux 3.8)
120 * devpts (since Linux 3.9)
121 * tmpfs(5) (since Linux 3.9)
122 * ramfs (since Linux 3.9)
123 * mqueue (since Linux 3.9)
124 * bpf (since Linux 4.4)
125
126 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
127 cgroup namespace allows (since Linux 4.6) that process to the mount the
128 cgroup version 2 filesystem and cgroup version 1 named hierarchies
129 (i.e., cgroup filesystems mounted with the "none,name=" option).
130
131 Holding CAP_SYS_ADMIN within the user namespace that owns a process's
132 PID namespace allows (since Linux 3.8) that process to mount /proc
133 filesystems.
134
135 Note however, that mounting block-based filesystems can be done only by
136 a process that holds CAP_SYS_ADMIN in the initial user namespace.
137
138 Interaction of user namespaces and other types of namespaces
139 Starting in Linux 3.8, unprivileged processes can create user name‐
140 spaces, and the other types of namespaces can be created with just the
141 CAP_SYS_ADMIN capability in the caller's user namespace.
142
143 When a nonuser namespace is created, it is owned by the user namespace
144 in which the creating process was a member at the time of the creation
145 of the namespace. Privileged operations on resources governed by the
146 nonuser namespace require that the process has the necessary capabili‐
147 ties in the user namespace that owns the nonuser namespace.
148
149 If CLONE_NEWUSER is specified along with other CLONE_NEW* flags in a
150 single clone(2) or unshare(2) call, the user namespace is guaranteed to
151 be created first, giving the child (clone(2)) or caller (unshare(2))
152 privileges over the remaining namespaces created by the call. Thus, it
153 is possible for an unprivileged caller to specify this combination of
154 flags.
155
156 When a new namespace (other than a user namespace) is created via
157 clone(2) or unshare(2), the kernel records the user namespace of the
158 creating process as the owner of the new namespace. (This association
159 can't be changed.) When a process in the new namespace subsequently
160 performs privileged operations that operate on global resources iso‐
161 lated by the namespace, the permission checks are performed according
162 to the process's capabilities in the user namespace that the kernel as‐
163 sociated with the new namespace. For example, suppose that a process
164 attempts to change the hostname (sethostname(2)), a resource governed
165 by the UTS namespace. In this case, the kernel will determine which
166 user namespace owns the process's UTS namespace, and check whether the
167 process has the required capability (CAP_SYS_ADMIN) in that user name‐
168 space.
169
170 The NS_GET_USERNS ioctl(2) operation can be used to discover the user
171 namespace that owns a nonuser namespace; see ioctl_ns(2).
172
173 User and group ID mappings: uid_map and gid_map
174 When a user namespace is created, it starts out without a mapping of
175 user IDs (group IDs) to the parent user namespace. The
176 /proc/[pid]/uid_map and /proc/[pid]/gid_map files (available since
177 Linux 3.5) expose the mappings for user and group IDs inside the user
178 namespace for the process pid. These files can be read to view the
179 mappings in a user namespace and written to (once) to define the map‐
180 pings.
181
182 The description in the following paragraphs explains the details for
183 uid_map; gid_map is exactly the same, but each instance of "user ID" is
184 replaced by "group ID".
185
186 The uid_map file exposes the mapping of user IDs from the user name‐
187 space of the process pid to the user namespace of the process that
188 opened uid_map (but see a qualification to this point below). In other
189 words, processes that are in different user namespaces will potentially
190 see different values when reading from a particular uid_map file, de‐
191 pending on the user ID mappings for the user namespaces of the reading
192 processes.
193
194 Each line in the uid_map file specifies a 1-to-1 mapping of a range of
195 contiguous user IDs between two user namespaces. (When a user name‐
196 space is first created, this file is empty.) The specification in each
197 line takes the form of three numbers delimited by white space. The
198 first two numbers specify the starting user ID in each of the two user
199 namespaces. The third number specifies the length of the mapped range.
200 In detail, the fields are interpreted as follows:
201
202 (1) The start of the range of user IDs in the user namespace of the
203 process pid.
204
205 (2) The start of the range of user IDs to which the user IDs specified
206 by field one map. How field two is interpreted depends on whether
207 the process that opened uid_map and the process pid are in the same
208 user namespace, as follows:
209
210 a) If the two processes are in different user namespaces: field two
211 is the start of a range of user IDs in the user namespace of the
212 process that opened uid_map.
213
214 b) If the two processes are in the same user namespace: field two
215 is the start of the range of user IDs in the parent user name‐
216 space of the process pid. This case enables the opener of
217 uid_map (the common case here is opening /proc/self/uid_map) to
218 see the mapping of user IDs into the user namespace of the
219 process that created this user namespace.
220
221 (3) The length of the range of user IDs that is mapped between the two
222 user namespaces.
223
224 System calls that return user IDs (group IDs)—for example, getuid(2),
225 getgid(2), and the credential fields in the structure returned by
226 stat(2)—return the user ID (group ID) mapped into the caller's user
227 namespace.
228
229 When a process accesses a file, its user and group IDs are mapped into
230 the initial user namespace for the purpose of permission checking and
231 assigning IDs when creating a file. When a process retrieves file user
232 and group IDs via stat(2), the IDs are mapped in the opposite direc‐
233 tion, to produce values relative to the process user and group ID map‐
234 pings.
235
236 The initial user namespace has no parent namespace, but, for consis‐
237 tency, the kernel provides dummy user and group ID mapping files for
238 this namespace. Looking at the uid_map file (gid_map is the same) from
239 a shell in the initial namespace shows:
240
241 $ cat /proc/$$/uid_map
242 0 0 4294967295
243
244 This mapping tells us that the range starting at user ID 0 in this
245 namespace maps to a range starting at 0 in the (nonexistent) parent
246 namespace, and the length of the range is the largest 32-bit unsigned
247 integer. This leaves 4294967295 (the 32-bit signed -1 value) unmapped.
248 This is deliberate: (uid_t) -1 is used in several interfaces (e.g., se‐
249 treuid(2)) as a way to specify "no user ID". Leaving (uid_t) -1 un‐
250 mapped and unusable guarantees that there will be no confusion when us‐
251 ing these interfaces.
252
253 Defining user and group ID mappings: writing to uid_map and gid_map
254 After the creation of a new user namespace, the uid_map file of one of
255 the processes in the namespace may be written to once to define the
256 mapping of user IDs in the new user namespace. An attempt to write
257 more than once to a uid_map file in a user namespace fails with the er‐
258 ror EPERM. Similar rules apply for gid_map files.
259
260 The lines written to uid_map (gid_map) must conform to the following
261 rules:
262
263 * The three fields must be valid numbers, and the last field must be
264 greater than 0.
265
266 * Lines are terminated by newline characters.
267
268 * There is a limit on the number of lines in the file. In Linux 4.14
269 and earlier, this limit was (arbitrarily) set at 5 lines. Since
270 Linux 4.15, the limit is 340 lines. In addition, the number of
271 bytes written to the file must be less than the system page size,
272 and the write must be performed at the start of the file (i.e.,
273 lseek(2) and pwrite(2) can't be used to write to nonzero offsets in
274 the file).
275
276 * The range of user IDs (group IDs) specified in each line cannot
277 overlap with the ranges in any other lines. In the initial imple‐
278 mentation (Linux 3.8), this requirement was satisfied by a simplis‐
279 tic implementation that imposed the further requirement that the
280 values in both field 1 and field 2 of successive lines must be in
281 ascending numerical order, which prevented some otherwise valid maps
282 from being created. Linux 3.9 and later fix this limitation, allow‐
283 ing any valid set of nonoverlapping maps.
284
285 * At least one line must be written to the file.
286
287 Writes that violate the above rules fail with the error EINVAL.
288
289 In order for a process to write to the /proc/[pid]/uid_map
290 (/proc/[pid]/gid_map) file, all of the following requirements must be
291 met:
292
293 1. The writing process must have the CAP_SETUID (CAP_SETGID) capability
294 in the user namespace of the process pid.
295
296 2. The writing process must either be in the user namespace of the
297 process pid or be in the parent user namespace of the process pid.
298
299 3. The mapped user IDs (group IDs) must in turn have a mapping in the
300 parent user namespace.
301
302 4. One of the following two cases applies:
303
304 * Either the writing process has the CAP_SETUID (CAP_SETGID) capa‐
305 bility in the parent user namespace.
306
307 + No further restrictions apply: the process can make mappings
308 to arbitrary user IDs (group IDs) in the parent user name‐
309 space.
310
311 * Or otherwise all of the following restrictions apply:
312
313 + The data written to uid_map (gid_map) must consist of a single
314 line that maps the writing process's effective user ID (group
315 ID) in the parent user namespace to a user ID (group ID) in
316 the user namespace.
317
318 + The writing process must have the same effective user ID as
319 the process that created the user namespace.
320
321 + In the case of gid_map, use of the setgroups(2) system call
322 must first be denied by writing "deny" to the /proc/[pid]/set‐
323 groups file (see below) before writing to gid_map.
324
325 Writes that violate the above rules fail with the error EPERM.
326
327 Interaction with system calls that change process UIDs or GIDs
328 In a user namespace where the uid_map file has not been written, the
329 system calls that change user IDs will fail. Similarly, if the gid_map
330 file has not been written, the system calls that change group IDs will
331 fail. After the uid_map and gid_map files have been written, only the
332 mapped values may be used in system calls that change user and group
333 IDs.
334
335 For user IDs, the relevant system calls include setuid(2), setfsuid(2),
336 setreuid(2), and setresuid(2). For group IDs, the relevant system
337 calls include setgid(2), setfsgid(2), setregid(2), setresgid(2), and
338 setgroups(2).
339
340 Writing "deny" to the /proc/[pid]/setgroups file before writing to
341 /proc/[pid]/gid_map will permanently disable setgroups(2) in a user
342 namespace and allow writing to /proc/[pid]/gid_map without having the
343 CAP_SETGID capability in the parent user namespace.
344
345 The /proc/[pid]/setgroups file
346 The /proc/[pid]/setgroups file displays the string "allow" if processes
347 in the user namespace that contains the process pid are permitted to
348 employ the setgroups(2) system call; it displays "deny" if setgroups(2)
349 is not permitted in that user namespace. Note that regardless of the
350 value in the /proc/[pid]/setgroups file (and regardless of the
351 process's capabilities), calls to setgroups(2) are also not permitted
352 if /proc/[pid]/gid_map has not yet been set.
353
354 A privileged process (one with the CAP_SYS_ADMIN capability in the
355 namespace) may write either of the strings "allow" or "deny" to this
356 file before writing a group ID mapping for this user namespace to the
357 file /proc/[pid]/gid_map. Writing the string "deny" prevents any
358 process in the user namespace from employing setgroups(2).
359
360 The essence of the restrictions described in the preceding paragraph is
361 that it is permitted to write to /proc/[pid]/setgroups only so long as
362 calling setgroups(2) is disallowed because /proc/[pid]/gid_map has not
363 been set. This ensures that a process cannot transition from a state
364 where setgroups(2) is allowed to a state where setgroups(2) is denied;
365 a process can transition only from setgroups(2) being disallowed to
366 setgroups(2) being allowed.
367
368 The default value of this file in the initial user namespace is "al‐
369 low".
370
371 Once /proc/[pid]/gid_map has been written to (which has the effect of
372 enabling setgroups(2) in the user namespace), it is no longer possible
373 to disallow setgroups(2) by writing "deny" to /proc/[pid]/setgroups
374 (the write fails with the error EPERM).
375
376 A child user namespace inherits the /proc/[pid]/setgroups setting from
377 its parent.
378
379 If the setgroups file has the value "deny", then the setgroups(2) sys‐
380 tem call can't subsequently be reenabled (by writing "allow" to the
381 file) in this user namespace. (Attempts to do so fail with the error
382 EPERM.) This restriction also propagates down to all child user name‐
383 spaces of this user namespace.
384
385 The /proc/[pid]/setgroups file was added in Linux 3.19, but was back‐
386 ported to many earlier stable kernel series, because it addresses a se‐
387 curity issue. The issue concerned files with permissions such as
388 "rwx---rwx". Such files give fewer permissions to "group" than they do
389 to "other". This means that dropping groups using setgroups(2) might
390 allow a process file access that it did not formerly have. Before the
391 existence of user namespaces this was not a concern, since only a priv‐
392 ileged process (one with the CAP_SETGID capability) could call set‐
393 groups(2). However, with the introduction of user namespaces, it be‐
394 came possible for an unprivileged process to create a new namespace in
395 which the user had all privileges. This then allowed formerly unprivi‐
396 leged users to drop groups and thus gain file access that they did not
397 previously have. The /proc/[pid]/setgroups file was added to address
398 this security issue, by denying any pathway for an unprivileged process
399 to drop groups with setgroups(2).
400
401 Unmapped user and group IDs
402 There are various places where an unmapped user ID (group ID) may be
403 exposed to user space. For example, the first process in a new user
404 namespace may call getuid(2) before a user ID mapping has been defined
405 for the namespace. In most such cases, an unmapped user ID is con‐
406 verted to the overflow user ID (group ID); the default value for the
407 overflow user ID (group ID) is 65534. See the descriptions of
408 /proc/sys/kernel/overflowuid and /proc/sys/kernel/overflowgid in
409 proc(5).
410
411 The cases where unmapped IDs are mapped in this fashion include system
412 calls that return user IDs (getuid(2), getgid(2), and similar), creden‐
413 tials passed over a UNIX domain socket, credentials returned by
414 stat(2), waitid(2), and the System V IPC "ctl" IPC_STAT operations,
415 credentials exposed by /proc/[pid]/status and the files in
416 /proc/sysvipc/*, credentials returned via the si_uid field in the sig‐
417 info_t received with a signal (see sigaction(2)), credentials written
418 to the process accounting file (see acct(5)), and credentials returned
419 with POSIX message queue notifications (see mq_notify(3)).
420
421 There is one notable case where unmapped user and group IDs are not
422 converted to the corresponding overflow ID value. When viewing a
423 uid_map or gid_map file in which there is no mapping for the second
424 field, that field is displayed as 4294967295 (-1 as an unsigned inte‐
425 ger).
426
427 Accessing files
428 In order to determine permissions when an unprivileged process accesses
429 a file, the process credentials (UID, GID) and the file credentials are
430 in effect mapped back to what they would be in the initial user name‐
431 space and then compared to determine the permissions that the process
432 has on the file. The same is also of other objects that employ the
433 credentials plus permissions mask accessibility model, such as System V
434 IPC objects
435
436 Operation of file-related capabilities
437 Certain capabilities allow a process to bypass various kernel-enforced
438 restrictions when performing operations on files owned by other users
439 or groups. These capabilities are: CAP_CHOWN, CAP_DAC_OVERRIDE,
440 CAP_DAC_READ_SEARCH, CAP_FOWNER, and CAP_FSETID.
441
442 Within a user namespace, these capabilities allow a process to bypass
443 the rules if the process has the relevant capability over the file,
444 meaning that:
445
446 * the process has the relevant effective capability in its user name‐
447 space; and
448
449 * the file's user ID and group ID both have valid mappings in the user
450 namespace.
451
452 The CAP_FOWNER capability is treated somewhat exceptionally: it allows
453 a process to bypass the corresponding rules so long as at least the
454 file's user ID has a mapping in the user namespace (i.e., the file's
455 group ID does not need to have a valid mapping).
456
457 Set-user-ID and set-group-ID programs
458 When a process inside a user namespace executes a set-user-ID (set-
459 group-ID) program, the process's effective user (group) ID inside the
460 namespace is changed to whatever value is mapped for the user (group)
461 ID of the file. However, if either the user or the group ID of the
462 file has no mapping inside the namespace, the set-user-ID (set-group-
463 ID) bit is silently ignored: the new program is executed, but the
464 process's effective user (group) ID is left unchanged. (This mirrors
465 the semantics of executing a set-user-ID or set-group-ID program that
466 resides on a filesystem that was mounted with the MS_NOSUID flag, as
467 described in mount(2).)
468
469 Miscellaneous
470 When a process's user and group IDs are passed over a UNIX domain
471 socket to a process in a different user namespace (see the description
472 of SCM_CREDENTIALS in unix(7)), they are translated into the corre‐
473 sponding values as per the receiving process's user and group ID map‐
474 pings.
475
477 Namespaces are a Linux-specific feature.
478
480 Over the years, there have been a lot of features that have been added
481 to the Linux kernel that have been made available only to privileged
482 users because of their potential to confuse set-user-ID-root applica‐
483 tions. In general, it becomes safe to allow the root user in a user
484 namespace to use those features because it is impossible, while in a
485 user namespace, to gain more privilege than the root user of a user
486 namespace has.
487
488 Availability
489 Use of user namespaces requires a kernel that is configured with the
490 CONFIG_USER_NS option. User namespaces require support in a range of
491 subsystems across the kernel. When an unsupported subsystem is config‐
492 ured into the kernel, it is not possible to configure user namespaces
493 support.
494
495 As at Linux 3.8, most relevant subsystems supported user namespaces,
496 but a number of filesystems did not have the infrastructure needed to
497 map user and group IDs between user namespaces. Linux 3.9 added the
498 required infrastructure support for many of the remaining unsupported
499 filesystems (Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA,
500 NFS, and OCFS2). Linux 3.12 added support for the last of the unsup‐
501 ported major filesystems, XFS.
502
504 The program below is designed to allow experimenting with user name‐
505 spaces, as well as other types of namespaces. It creates namespaces as
506 specified by command-line options and then executes a command inside
507 those namespaces. The comments and usage() function inside the program
508 provide a full explanation of the program. The following shell session
509 demonstrates its use.
510
511 First, we look at the run-time environment:
512
513 $ uname -rs # Need Linux 3.8 or later
514 Linux 3.8.0
515 $ id -u # Running as unprivileged user
516 1000
517 $ id -g
518 1000
519
520 Now start a new shell in new user (-U), mount (-m), and PID (-p) name‐
521 spaces, with user ID (-M) and group ID (-G) 1000 mapped to 0 inside the
522 user namespace:
523
524 $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash
525
526 The shell has PID 1, because it is the first process in the new PID
527 namespace:
528
529 bash$ echo $$
530 1
531
532 Mounting a new /proc filesystem and listing all of the processes visi‐
533 ble in the new PID namespace shows that the shell can't see any pro‐
534 cesses outside the PID namespace:
535
536 bash$ mount -t proc proc /proc
537 bash$ ps ax
538 PID TTY STAT TIME COMMAND
539 1 pts/3 S 0:00 bash
540 22 pts/3 R+ 0:00 ps ax
541
542 Inside the user namespace, the shell has user and group ID 0, and a
543 full set of permitted and effective capabilities:
544
545 bash$ cat /proc/$$/status | egrep '^[UG]id'
546 Uid: 0 0 0 0
547 Gid: 0 0 0 0
548 bash$ cat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'
549 CapInh: 0000000000000000
550 CapPrm: 0000001fffffffff
551 CapEff: 0000001fffffffff
552
553 Program source
554
555 /* userns_child_exec.c
556
557 Licensed under GNU General Public License v2 or later
558
559 Create a child process that executes a shell command in new
560 namespace(s); allow UID and GID mappings to be specified when
561 creating a user namespace.
562 */
563 #define _GNU_SOURCE
564 #include <sched.h>
565 #include <unistd.h>
566 #include <stdint.h>
567 #include <stdlib.h>
568 #include <sys/wait.h>
569 #include <signal.h>
570 #include <fcntl.h>
571 #include <stdio.h>
572 #include <string.h>
573 #include <limits.h>
574 #include <errno.h>
575
576 /* A simple error-handling function: print an error message based
577 on the value in 'errno' and terminate the calling process */
578
579 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
580 } while (0)
581
582 struct child_args {
583 char **argv; /* Command to be executed by child, with args */
584 int pipe_fd[2]; /* Pipe used to synchronize parent and child */
585 };
586
587 static int verbose;
588
589 static void
590 usage(char *pname)
591 {
592 fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
593 fprintf(stderr, "Create a child process that executes a shell "
594 "command in a new user namespace,\n"
595 "and possibly also other new namespace(s).\n\n");
596 fprintf(stderr, "Options can be:\n\n");
597 #define fpe(str) fprintf(stderr, " %s", str);
598 fpe("-i New IPC namespace\n");
599 fpe("-m New mount namespace\n");
600 fpe("-n New network namespace\n");
601 fpe("-p New PID namespace\n");
602 fpe("-u New UTS namespace\n");
603 fpe("-U New user namespace\n");
604 fpe("-M uid_map Specify UID map for user namespace\n");
605 fpe("-G gid_map Specify GID map for user namespace\n");
606 fpe("-z Map user's UID and GID to 0 in user namespace\n");
607 fpe(" (equivalent to: -M '0 <uid> 1' -G '0 <gid> 1')\n");
608 fpe("-v Display verbose messages\n");
609 fpe("\n");
610 fpe("If -z, -M, or -G is specified, -U is required.\n");
611 fpe("It is not permitted to specify both -z and either -M or -G.\n");
612 fpe("\n");
613 fpe("Map strings for -M and -G consist of records of the form:\n");
614 fpe("\n");
615 fpe(" ID-inside-ns ID-outside-ns len\n");
616 fpe("\n");
617 fpe("A map string can contain multiple records, separated"
618 " by commas;\n");
619 fpe("the commas are replaced by newlines before writing"
620 " to map files.\n");
621
622 exit(EXIT_FAILURE);
623 }
624
625 /* Update the mapping file 'map_file', with the value provided in
626 'mapping', a string that defines a UID or GID mapping. A UID or
627 GID mapping consists of one or more newline-delimited records
628 of the form:
629
630 ID_inside-ns ID-outside-ns length
631
632 Requiring the user to supply a string that contains newlines is
633 of course inconvenient for command-line use. Thus, we permit the
634 use of commas to delimit records in this string, and replace them
635 with newlines before writing the string to the file. */
636
637 static void
638 update_map(char *mapping, char *map_file)
639 {
640 int fd;
641 size_t map_len; /* Length of 'mapping' */
642
643 /* Replace commas in mapping string with newlines */
644
645 map_len = strlen(mapping);
646 for (int j = 0; j < map_len; j++)
647 if (mapping[j] == ',')
648 mapping[j] = '\n';
649
650 fd = open(map_file, O_RDWR);
651 if (fd == -1) {
652 fprintf(stderr, "ERROR: open %s: %s\n", map_file,
653 strerror(errno));
654 exit(EXIT_FAILURE);
655 }
656
657 if (write(fd, mapping, map_len) != map_len) {
658 fprintf(stderr, "ERROR: write %s: %s\n", map_file,
659 strerror(errno));
660 exit(EXIT_FAILURE);
661 }
662
663 close(fd);
664 }
665
666 /* Linux 3.19 made a change in the handling of setgroups(2) and the
667 'gid_map' file to address a security issue. The issue allowed
668 *unprivileged* users to employ user namespaces in order to drop
669 The upshot of the 3.19 changes is that in order to update the
670 'gid_maps' file, use of the setgroups() system call in this
671 user namespace must first be disabled by writing "deny" to one of
672 the /proc/PID/setgroups files for this namespace. That is the
673 purpose of the following function. */
674
675 static void
676 proc_setgroups_write(pid_t child_pid, char *str)
677 {
678 char setgroups_path[PATH_MAX];
679 int fd;
680
681 snprintf(setgroups_path, PATH_MAX, "/proc/%jd/setgroups",
682 (intmax_t) child_pid);
683
684 fd = open(setgroups_path, O_RDWR);
685 if (fd == -1) {
686
687 /* We may be on a system that doesn't support
688 /proc/PID/setgroups. In that case, the file won't exist,
689 and the system won't impose the restrictions that Linux 3.19
690 added. That's fine: we don't need to do anything in order
691 to permit 'gid_map' to be updated.
692
693 However, if the error from open() was something other than
694 the ENOENT error that is expected for that case, let the
695 user know. */
696
697 if (errno != ENOENT)
698 fprintf(stderr, "ERROR: open %s: %s\n", setgroups_path,
699 strerror(errno));
700 return;
701 }
702
703 if (write(fd, str, strlen(str)) == -1)
704 fprintf(stderr, "ERROR: write %s: %s\n", setgroups_path,
705 strerror(errno));
706
707 close(fd);
708 }
709
710 static int /* Start function for cloned child */
711 childFunc(void *arg)
712 {
713 struct child_args *args = arg;
714 char ch;
715
716 /* Wait until the parent has updated the UID and GID mappings.
717 See the comment in main(). We wait for end of file on a
718 pipe that will be closed by the parent process once it has
719 updated the mappings. */
720
721 close(args->pipe_fd[1]); /* Close our descriptor for the write
722 end of the pipe so that we see EOF
723 when parent closes its descriptor */
724 if (read(args->pipe_fd[0], &ch, 1) != 0) {
725 fprintf(stderr,
726 "Failure in child: read from pipe returned != 0\n");
727 exit(EXIT_FAILURE);
728 }
729
730 close(args->pipe_fd[0]);
731
732 /* Execute a shell command */
733
734 printf("About to exec %s\n", args->argv[0]);
735 execvp(args->argv[0], args->argv);
736 errExit("execvp");
737 }
738
739 #define STACK_SIZE (1024 * 1024)
740
741 static char child_stack[STACK_SIZE]; /* Space for child's stack */
742
743 int
744 main(int argc, char *argv[])
745 {
746 int flags, opt, map_zero;
747 pid_t child_pid;
748 struct child_args args;
749 char *uid_map, *gid_map;
750 const int MAP_BUF_SIZE = 100;
751 char map_buf[MAP_BUF_SIZE];
752 char map_path[PATH_MAX];
753
754 /* Parse command-line options. The initial '+' character in
755 the final getopt() argument prevents GNU-style permutation
756 of command-line options. That's useful, since sometimes
757 the 'command' to be executed by this program itself
758 has command-line options. We don't want getopt() to treat
759 those as options to this program. */
760
761 flags = 0;
762 verbose = 0;
763 gid_map = NULL;
764 uid_map = NULL;
765 map_zero = 0;
766 while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != -1) {
767 switch (opt) {
768 case 'i': flags |= CLONE_NEWIPC; break;
769 case 'm': flags |= CLONE_NEWNS; break;
770 case 'n': flags |= CLONE_NEWNET; break;
771 case 'p': flags |= CLONE_NEWPID; break;
772 case 'u': flags |= CLONE_NEWUTS; break;
773 case 'v': verbose = 1; break;
774 case 'z': map_zero = 1; break;
775 case 'M': uid_map = optarg; break;
776 case 'G': gid_map = optarg; break;
777 case 'U': flags |= CLONE_NEWUSER; break;
778 default: usage(argv[0]);
779 }
780 }
781
782 /* -M or -G without -U is nonsensical */
783
784 if (((uid_map != NULL || gid_map != NULL || map_zero) &&
785 !(flags & CLONE_NEWUSER)) ||
786 (map_zero && (uid_map != NULL || gid_map != NULL)))
787 usage(argv[0]);
788
789 args.argv = &argv[optind];
790
791 /* We use a pipe to synchronize the parent and child, in order to
792 ensure that the parent sets the UID and GID maps before the child
793 calls execve(). This ensures that the child maintains its
794 capabilities during the execve() in the common case where we
795 want to map the child's effective user ID to 0 in the new user
796 namespace. Without this synchronization, the child would lose
797 its capabilities if it performed an execve() with nonzero
798 user IDs (see the capabilities(7) man page for details of the
799 transformation of a process's capabilities during execve()). */
800
801 if (pipe(args.pipe_fd) == -1)
802 errExit("pipe");
803
804 /* Create the child in new namespace(s) */
805
806 child_pid = clone(childFunc, child_stack + STACK_SIZE,
807 flags | SIGCHLD, &args);
808 if (child_pid == -1)
809 errExit("clone");
810
811 /* Parent falls through to here */
812
813 if (verbose)
814 printf("%s: PID of child created by clone() is %jd\n",
815 argv[0], (intmax_t) child_pid);
816
817 /* Update the UID and GID maps in the child */
818
819 if (uid_map != NULL || map_zero) {
820 snprintf(map_path, PATH_MAX, "/proc/%jd/uid_map",
821 (intmax_t) child_pid);
822 if (map_zero) {
823 snprintf(map_buf, MAP_BUF_SIZE, "0 %jd 1",
824 (intmax_t) getuid());
825 uid_map = map_buf;
826 }
827 update_map(uid_map, map_path);
828 }
829
830 if (gid_map != NULL || map_zero) {
831 proc_setgroups_write(child_pid, "deny");
832
833 snprintf(map_path, PATH_MAX, "/proc/%jd/gid_map",
834 (intmax_t) child_pid);
835 if (map_zero) {
836 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1",
837 (intmax_t) getgid());
838 gid_map = map_buf;
839 }
840 update_map(gid_map, map_path);
841 }
842
843 /* Close the write end of the pipe, to signal to the child that we
844 have updated the UID and GID maps */
845
846 close(args.pipe_fd[1]);
847
848 if (waitpid(child_pid, NULL, 0) == -1) /* Wait for child */
849 errExit("waitpid");
850
851 if (verbose)
852 printf("%s: terminating\n", argv[0]);
853
854 exit(EXIT_SUCCESS);
855 }
856
858 newgidmap(1), newuidmap(1), clone(2), ptrace(2), setns(2), unshare(2),
859 proc(5), subgid(5), subuid(5), capabilities(7), cgroup_namespaces(7),
860 credentials(7), namespaces(7), pid_namespaces(7)
861
862 The kernel source file Documentation/namespaces/resource-control.txt.
863
865 This page is part of release 5.10 of the Linux man-pages project. A
866 description of the project, information about reporting bugs, and the
867 latest version of this page, can be found at
868 https://www.kernel.org/doc/man-pages/.
869
870
871
872Linux 2020-11-01 USER_NAMESPACES(7)