user_namespaces(7)

1USER_NAMESPACES(7)         Linux Programmer's Manual        USER_NAMESPACES(7)
2
3
4

NAME

6       user_namespaces - overview of Linux user namespaces
7

DESCRIPTION

9       For an overview of namespaces, see namespaces(7).
10
11       User namespaces isolate security-related identifiers and attributes, in
12       particular, user IDs and  group  IDs  (see  credentials(7)),  the  root
13       directory,  keys  (see  keyrings(7)),  and  capabilities (see capabili‐
14       ties(7)).  A process's user and group IDs can be different  inside  and
15       outside  a  user namespace.  In particular, a process can have a normal
16       unprivileged user ID outside a user namespace while at  the  same  time
17       having a user ID of 0 inside the namespace; in other words, the process
18       has full privileges for operations inside the user  namespace,  but  is
19       unprivileged for operations outside the namespace.
20
21   Nested namespaces, namespace membership
22       User  namespaces can be nested; that is, each user namespace—except the
23       initial ("root") namespace—has a parent user namespace,  and  can  have
24       zero  or  more child user namespaces.  The parent user namespace is the
25       user namespace of the process that creates the  user  namespace  via  a
26       call to unshare(2) or clone(2) with the CLONE_NEWUSER flag.
27
28       The  kernel imposes (since version 3.11) a limit of 32 nested levels of
29       user namespaces.  Calls to unshare(2) or clone(2) that would cause this
30       limit to be exceeded fail with the error EUSERS.
31
32       Each process is a member of exactly one user namespace.  A process cre‐
33       ated via fork(2) or clone(2) without the CLONE_NEWUSER flag is a member
34       of  the  same  user namespace as its parent.  A single-threaded process
35       can  join  another  user  namespace  with  setns(2)  if  it   has   the
36       CAP_SYS_ADMIN  in that namespace; upon doing so, it gains a full set of
37       capabilities in that namespace.
38
39       A call to clone(2) or unshare(2) with the CLONE_NEWUSER flag makes  the
40       new  child process (for clone(2)) or the caller (for unshare(2)) a mem‐
41       ber of the new user namespace created by the call.
42
43       The NS_GET_PARENT ioctl(2)  operation  can  be  used  to  discover  the
44       parental relationship between user namespaces; see ioctl_ns(2).
45
46   Capabilities
47       The  child  process  created  by  clone(2)  with the CLONE_NEWUSER flag
48       starts out with a complete set of capabilities in the new  user  names‐
49       pace.   Likewise,  a  process  that  creates a new user namespace using
50       unshare(2) or joins an existing user namespace using setns(2)  gains  a
51       full  set  of  capabilities in that namespace.  On the other hand, that
52       process has no capabilities in the parent (in the case of clone(2))  or
53       previous  (in the case of unshare(2) and setns(2)) user namespace, even
54       if the new namespace is created or joined by the  root  user  (i.e.,  a
55       process with user ID 0 in the root namespace).
56
57       Note that a call to execve(2) will cause a process's capabilities to be
58       recalculated in the usual  way  (see  capabilities(7)).   Consequently,
59       unless the process has a user ID of 0 within the namespace, or the exe‐
60       cutable file has a nonempty inheritable capabilities mask, the  process
61       will  lose  all  capabilities.  See the discussion of user and group ID
62       mappings, below.
63
64       A call to clone(2), unshare(2), or  setns(2)  using  the  CLONE_NEWUSER
65       flag sets the "securebits" flags (see capabilities(7)) to their default
66       values (all flags disabled) in the child (for clone(2)) or caller  (for
67       unshare(2),  or  setns(2)).  Note that because the caller no longer has
68       capabilities in its original user namespace after a call  to  setns(2),
69       it  is not possible for a process to reset its "securebits" flags while
70       retaining its user namespace membership by using  a  pair  of  setns(2)
71       calls to move to another user namespace and then return to its original
72       user namespace.
73
74       The rules for determining whether or not a process has a capability  in
75       a particular user namespace are as follows:
76
77       1. A process has a capability inside a user namespace if it is a member
78          of that namespace and it has the capability in its  effective  capa‐
79          bility  set.  A process can gain capabilities in its effective capa‐
80          bility set in various ways.  For example, it may execute a set-user-
81          ID  program  or an executable with associated file capabilities.  In
82          addition,  a  process  may  gain  capabilities  via  the  effect  of
83          clone(2), unshare(2), or setns(2), as already described.
84
85       2. If  a process has a capability in a user namespace, then it has that
86          capability in all child (and further removed descendant)  namespaces
87          as well.
88
89       3. When  a  user namespace is created, the kernel records the effective
90          user ID of the creating process as being the "owner" of  the  names‐
91          pace.   A  process  that resides in the parent of the user namespace
92          and whose effective user ID matches the owner of the  namespace  has
93          all  capabilities in the namespace.  By virtue of the previous rule,
94          this means that the process has  all  capabilities  in  all  further
95          removed  descendant  user  namespaces as well.  The NS_GET_OWNER_UID
96          ioctl(2) operation can be used to discover the user ID of the  owner
97          of the namespace; see ioctl_ns(2).
98
99   Effect of capabilities within a user namespace
100       Having  a  capability inside a user namespace permits a process to per‐
101       form operations (that require privilege) only on resources governed  by
102       that  namespace.   In other words, having a capability in a user names‐
103       pace permits a process to perform privileged  operations  on  resources
104       that  are  governed  by  (nonuser)  namespaces associated with the user
105       namespace (see the next subsection).
106
107       On the other hand, there are many  privileged  operations  that  affect
108       resources that are not associated with any namespace type, for example,
109       changing the system time (governed by CAP_SYS_TIME), loading  a  kernel
110       module (governed by CAP_SYS_MODULE), and creating a device (governed by
111       CAP_MKNOD).  Only a process with privileges in the initial user  names‐
112       pace can perform such operations.
113
114       Holding  CAP_SYS_ADMIN  within  the  user  namespace  associated with a
115       process's mount namespace allows that process to create bind mounts and
116       mount the following types of filesystems:
117
118           * /proc (since Linux 3.8)
119           * /sys (since Linux 3.8)
120           * devpts (since Linux 3.9)
121           * tmpfs(5) (since Linux 3.9)
122           * ramfs (since Linux 3.9)
123           * mqueue (since Linux 3.9)
124           * bpf (since Linux 4.4)
125
126       Holding  CAP_SYS_ADMIN  within  the  user  namespace  associated with a
127       process's cgroup namespace allows (since Linux 4.6) that process to the
128       mount  the cgroup version 2 filesystem and cgroup version 1 named hier‐
129       archies  (i.e.,  cgroup  filesystems  mounted  with  the   "none,name="
130       option).
131
132       Holding  CAP_SYS_ADMIN  within  the  user  namespace  associated with a
133       process's PID namespace allows (since Linux 3.8) that process to  mount
134       /proc filesystems.
135
136       Note however, that mounting block-based filesystems can be done only by
137       a process that holds CAP_SYS_ADMIN in the initial user namespace.
138
139   Interaction of user namespaces and other types of namespaces
140       Starting in Linux 3.8, unprivileged processes can  create  user  names‐
141       paces, and other the other types of namespaces can be created with just
142       the CAP_SYS_ADMIN capability in the caller's user namespace.
143
144       When a non-user-namespace is created, it is owned by the user namespace
145       in  which the creating process was a member at the time of the creation
146       of the namespace.  Actions on the non-user-namespace require  capabili‐
147       ties in the corresponding user namespace.
148
149       If  CLONE_NEWUSER  is  specified along with other CLONE_NEW* flags in a
150       single clone(2) or unshare(2) call, the user namespace is guaranteed to
151       be  created  first,  giving the child (clone(2)) or caller (unshare(2))
152       privileges over the remaining namespaces created by the call.  Thus, it
153       is  possible  for an unprivileged caller to specify this combination of
154       flags.
155
156       When a new namespace (other than  a  user  namespace)  is  created  via
157       clone(2)  or  unshare(2),  the kernel records the user namespace of the
158       creating process against the new namespace.  (This association can't be
159       changed.)   When  a  process in the new namespace subsequently performs
160       privileged operations that operate on global resources isolated by  the
161       namespace,  the  permission  checks  are  performed  according  to  the
162       process's capabilities in the user namespace that the kernel associated
163       with  the  new namespace.  For example, suppose that a process attempts
164       to change the hostname (sethostname(2)), a resource governed by the UTS
165       namespace.   In  this case, the kernel will determine which user names‐
166       pace is associated with the process's UTS namespace, and check  whether
167       the  process  has  the required capability (CAP_SYS_ADMIN) in that user
168       namespace.
169
170       The NS_GET_USERNS ioctl(2) operation can be used to discover  the  user
171       namespace   with   which   a  non-user  namespace  is  associated;  see
172       ioctl_ns(2).
173
174   User and group ID mappings: uid_map and gid_map
175       When a user namespace is created, it starts out without  a  mapping  of
176       user   IDs   (group   IDs)   to   the   parent   user  namespace.   The
177       /proc/[pid]/uid_map  and  /proc/[pid]/gid_map  files  (available  since
178       Linux  3.5)  expose the mappings for user and group IDs inside the user
179       namespace for the process pid.  These files can be  read  to  view  the
180       mappings  in  a user namespace and written to (once) to define the map‐
181       pings.
182
183       The description in the following paragraphs explains  the  details  for
184       uid_map; gid_map is exactly the same, but each instance of "user ID" is
185       replaced by "group ID".
186
187       The uid_map file exposes the mapping of user IDs from the  user  names‐
188       pace  of  the  process  pid  to  the user namespace of the process that
189       opened uid_map (but see a qualification to this point below).  In other
190       words, processes that are in different user namespaces will potentially
191       see different values when  reading  from  a  particular  uid_map  file,
192       depending  on the user ID mappings for the user namespaces of the read‐
193       ing processes.
194
195       Each line in the uid_map file specifies a 1-to-1 mapping of a range  of
196       contiguous  user  IDs between two user namespaces.  (When a user names‐
197       pace is first created, this file is empty.)  The specification in  each
198       line  takes  the  form  of three numbers delimited by white space.  The
199       first two numbers specify the starting user ID in each of the two  user
200       namespaces.  The third number specifies the length of the mapped range.
201       In detail, the fields are interpreted as follows:
202
203       (1) The start of the range of user IDs in the  user  namespace  of  the
204           process pid.
205
206       (2) The  start of the range of user IDs to which the user IDs specified
207           by field one map.  How field two is interpreted depends on  whether
208           the process that opened uid_map and the process pid are in the same
209           user namespace, as follows:
210
211           a) If the two processes are in different user namespaces: field two
212              is the start of a range of user IDs in the user namespace of the
213              process that opened uid_map.
214
215           b) If the two processes are in the same user namespace:  field  two
216              is  the start of the range of user IDs in the parent user names‐
217              pace of the process  pid.   This  case  enables  the  opener  of
218              uid_map  (the common case here is opening /proc/self/uid_map) to
219              see the mapping of user IDs  into  the  user  namespace  of  the
220              process that created this user namespace.
221
222       (3) The  length of the range of user IDs that is mapped between the two
223           user namespaces.
224
225       System calls that return user IDs (group IDs)—for  example,  getuid(2),
226       getgid(2),  and  the  credential  fields  in  the structure returned by
227       stat(2)—return the user ID (group ID) mapped  into  the  caller's  user
228       namespace.
229
230       When  a process accesses a file, its user and group IDs are mapped into
231       the initial user namespace for the purpose of permission  checking  and
232       assigning IDs when creating a file.  When a process retrieves file user
233       and group IDs via stat(2), the IDs are mapped in  the  opposite  direc‐
234       tion,  to produce values relative to the process user and group ID map‐
235       pings.
236
237       The initial user namespace has no parent namespace,  but,  for  consis‐
238       tency,  the  kernel  provides dummy user and group ID mapping files for
239       this namespace.  Looking at the uid_map file (gid_map is the same) from
240       a shell in the initial namespace shows:
241
242           $ cat /proc/$$/uid_map
243                    0          0 4294967295
244
245       This  mapping  tells  us  that  the range starting at user ID 0 in this
246       namespace maps to a range starting at 0  in  the  (nonexistent)  parent
247       namespace,  and  the length of the range is the largest 32-bit unsigned
248       integer.  This leaves 4294967295 (the 32-bit signed -1 value) unmapped.
249       This  is  deliberate:  (uid_t) -1  is used in several interfaces (e.g.,
250       setreuid(2)) as a way to specify  "no  user  ID".   Leaving  (uid_t) -1
251       unmapped  and  unusable guarantees that there will be no confusion when
252       using these interfaces.
253
254   Defining user and group ID mappings: writing to uid_map and gid_map
255       After the creation of a new user namespace, the uid_map file of one  of
256       the  processes  in  the  namespace may be written to once to define the
257       mapping of user IDs in the new user namespace.   An  attempt  to  write
258       more  than  once  to  a uid_map file in a user namespace fails with the
259       error EPERM.  Similar rules apply for gid_map files.
260
261       The lines written to uid_map (gid_map) must conform  to  the  following
262       rules:
263
264       *  The  three  fields must be valid numbers, and the last field must be
265          greater than 0.
266
267       *  Lines are terminated by newline characters.
268
269       *  There is a limit on the number of lines in the file.  In Linux  4.14
270          and  earlier,  this  limit  was (arbitrarily) set at 5 lines.  Since
271          Linux 4.15, the limit is 340 lines.   In  addition,  the  number  of
272          bytes  written  to  the file must be less than the system page size,
273          and the write must be performed at the  start  of  the  file  (i.e.,
274          lseek(2)  and pwrite(2) can't be used to write to nonzero offsets in
275          the file).
276
277       *  The range of user IDs (group IDs)  specified  in  each  line  cannot
278          overlap  with  the ranges in any other lines.  In the initial imple‐
279          mentation (Linux 3.8), this requirement was satisfied by a  simplis‐
280          tic  implementation  that  imposed  the further requirement that the
281          values in both field 1 and field 2 of successive lines  must  be  in
282          ascending numerical order, which prevented some otherwise valid maps
283          from being created.  Linux 3.9 and later fix this limitation, allow‐
284          ing any valid set of nonoverlapping maps.
285
286       *  At least one line must be written to the file.
287
288       Writes that violate the above rules fail with the error EINVAL.
289
290       In   order   for   a   process  to  write  to  the  /proc/[pid]/uid_map
291       (/proc/[pid]/gid_map) file, all of the following requirements  must  be
292       met:
293
294       1. The writing process must have the CAP_SETUID (CAP_SETGID) capability
295          in the user namespace of the process pid.
296
297       2. The writing process must either be in  the  user  namespace  of  the
298          process pid or be in the parent user namespace of the process pid.
299
300       3. The  mapped  user IDs (group IDs) must in turn have a mapping in the
301          parent user namespace.
302
303       4. One of the following two cases applies:
304
305          *  Either the writing process has the CAP_SETUID (CAP_SETGID)  capa‐
306             bility in the parent user namespace.
307
308             +  No  further  restrictions apply: the process can make mappings
309                to arbitrary user IDs (group IDs) in the  parent  user  names‐
310                pace.
311
312          *  Or otherwise all of the following restrictions apply:
313
314             +  The data written to uid_map (gid_map) must consist of a single
315                line that maps the writing process's effective user ID  (group
316                ID)  in  the  parent user namespace to a user ID (group ID) in
317                the user namespace.
318
319             +  The writing process must have the same effective  user  ID  as
320                the process that created the user namespace.
321
322             +  In  the  case  of gid_map, use of the setgroups(2) system call
323                must first be denied by writing "deny" to the /proc/[pid]/set‐
324                groups file (see below) before writing to gid_map.
325
326       Writes that violate the above rules fail with the error EPERM.
327
328   Interaction with system calls that change process UIDs or GIDs
329       In  a  user  namespace where the uid_map file has not been written, the
330       system calls that change user IDs will fail.  Similarly, if the gid_map
331       file  has not been written, the system calls that change group IDs will
332       fail.  After the uid_map and gid_map files have been written, only  the
333       mapped  values  may  be used in system calls that change user and group
334       IDs.
335
336       For user IDs, the relevant system calls include setuid(2), setfsuid(2),
337       setreuid(2),  and  setresuid(2).   For  group  IDs, the relevant system
338       calls include setgid(2), setfsgid(2),  setregid(2),  setresgid(2),  and
339       setgroups(2).
340
341       Writing  "deny"  to  the  /proc/[pid]/setgroups  file before writing to
342       /proc/[pid]/gid_map will permanently disable  setgroups(2)  in  a  user
343       namespace  and  allow writing to /proc/[pid]/gid_map without having the
344       CAP_SETGID capability in the parent user namespace.
345
346   The /proc/[pid]/setgroups file
347       The /proc/[pid]/setgroups file displays the string "allow" if processes
348       in  the  user  namespace that contains the process pid are permitted to
349       employ the setgroups(2) system call; it displays "deny" if setgroups(2)
350       is  not  permitted in that user namespace.  Note that regardless of the
351       value  in  the  /proc/[pid]/setgroups  file  (and  regardless  of   the
352       process's  capabilities),  calls to setgroups(2) are also not permitted
353       if /proc/[pid]/gid_map has not yet been set.
354
355       A privileged process (one with  the  CAP_SYS_ADMIN  capability  in  the
356       namespace)  may  write  either of the strings "allow" or "deny" to this
357       file before writing a group ID mapping for this user namespace  to  the
358       file  /proc/[pid]/gid_map.   Writing  the  string  "deny"  prevents any
359       process in the user namespace from employing setgroups(2).
360
361       The essence of the restrictions described in the preceding paragraph is
362       that  it is permitted to write to /proc/[pid]/setgroups only so long as
363       calling setgroups(2) is disallowed because /proc/[pid]gid_map  has  not
364       been  set.   This ensures that a process cannot transition from a state
365       where setgroups(2) is allowed to a state where setgroups(2) is  denied;
366       a  process  can  transition  only from setgroups(2) being disallowed to
367       setgroups(2) being allowed.
368
369       The default value of  this  file  in  the  initial  user  namespace  is
370       "allow".
371
372       Once  /proc/[pid]/gid_map  has been written to (which has the effect of
373       enabling setgroups(2) in the user namespace), it is no longer  possible
374       to  disallow  setgroups(2)  by  writing "deny" to /proc/[pid]/setgroups
375       (the write fails with the error EPERM).
376
377       A child user namespace inherits the /proc/[pid]/setgroups setting  from
378       its parent.
379
380       If  the setgroups file has the value "deny", then the setgroups(2) sys‐
381       tem call can't subsequently be reenabled (by  writing  "allow"  to  the
382       file)  in  this user namespace.  (Attempts to do so fail with the error
383       EPERM.)  This restriction also propagates down to all child user names‐
384       paces of this user namespace.
385
386       The  /proc/[pid]/setgroups  file was added in Linux 3.19, but was back‐
387       ported to many earlier stable kernel series,  because  it  addresses  a
388       security  issue.   The  issue  concerned files with permissions such as
389       "rwx---rwx".  Such files give fewer permissions to "group" than they do
390       to  "other".   This means that dropping groups using setgroups(2) might
391       allow a process file access that it did not formerly have.  Before  the
392       existence of user namespaces this was not a concern, since only a priv‐
393       ileged process (one with the CAP_SETGID  capability)  could  call  set‐
394       groups(2).   However,  with  the  introduction  of  user namespaces, it
395       became possible for an unprivileged process to create a  new  namespace
396       in  which  the  user  had  all  privileges.  This then allowed formerly
397       unprivileged users to drop groups and thus gain file access  that  they
398       did  not  previously have.  The /proc/[pid]/setgroups file was added to
399       address this security issue, by denying any pathway for an unprivileged
400       process to drop groups with setgroups(2).
401
402   Unmapped user and group IDs
403       There  are  various  places where an unmapped user ID (group ID) may be
404       exposed to user space.  For example, the first process in  a  new  user
405       namespace  may call getuid(2) before a user ID mapping has been defined
406       for the namespace.  In most such cases, an unmapped  user  ID  is  con‐
407       verted  to  the  overflow user ID (group ID); the default value for the
408       overflow user  ID  (group  ID)  is  65534.   See  the  descriptions  of
409       /proc/sys/kernel/overflowuid    and   /proc/sys/kernel/overflowgid   in
410       proc(5).
411
412       The cases where unmapped IDs are mapped in this fashion include  system
413       calls that return user IDs (getuid(2), getgid(2), and similar), creden‐
414       tials passed  over  a  UNIX  domain  socket,  credentials  returned  by
415       stat(2),  waitid(2),  and  the  System V IPC "ctl" IPC_STAT operations,
416       credentials  exposed   by   /proc/[pid]/status   and   the   files   in
417       /proc/sysvipc/*,  credentials returned via the si_uid field in the sig‐
418       info_t received with a signal (see sigaction(2)),  credentials  written
419       to  the process accounting file (see acct(5)), and credentials returned
420       with POSIX message queue notifications (see mq_notify(3)).
421
422       There is one notable case where unmapped user and  group  IDs  are  not
423       converted  to  the  corresponding  overflow  ID  value.  When viewing a
424       uid_map or gid_map file in which there is no  mapping  for  the  second
425       field,  that  field is displayed as 4294967295 (-1 as an unsigned inte‐
426       ger).
427
428   Set-user-ID and set-group-ID programs
429       When a process inside a user namespace  executes  a  set-user-ID  (set-
430       group-ID)  program,  the process's effective user (group) ID inside the
431       namespace is changed to whatever value is mapped for the  user  (group)
432       ID  of  the  file.   However, if either the user or the group ID of the
433       file has no mapping inside the namespace, the  set-user-ID  (set-group-
434       ID)  bit  is  silently  ignored:  the  new program is executed, but the
435       process's effective user (group) ID is left unchanged.   (This  mirrors
436       the  semantics  of executing a set-user-ID or set-group-ID program that
437       resides on a filesystem that was mounted with the  MS_NOSUID  flag,  as
438       described in mount(2).)
439
440   Miscellaneous
441       When  a  process's  user  and  group  IDs are passed over a UNIX domain
442       socket to a process in a different user namespace (see the  description
443       of  SCM_CREDENTIALS  in  unix(7)),  they are translated into the corre‐
444       sponding values as per the receiving process's user and group  ID  map‐
445       pings.
446

CONFORMING TO

448       Namespaces are a Linux-specific feature.
449

NOTES

451       Over  the years, there have been a lot of features that have been added
452       to the Linux kernel that have been made available  only  to  privileged
453       users  because  of their potential to confuse set-user-ID-root applica‐
454       tions.  In general, it becomes safe to allow the root user  in  a  user
455       namespace  to  use  those features because it is impossible, while in a
456       user namespace, to gain more privilege than the root  user  of  a  user
457       namespace has.
458
459   Availability
460       Use  of  user  namespaces requires a kernel that is configured with the
461       CONFIG_USER_NS option.  User namespaces require support in a  range  of
462       subsystems across the kernel.  When an unsupported subsystem is config‐
463       ured into the kernel, it is not possible to configure  user  namespaces
464       support.
465
466       As  at  Linux  3.8, most relevant subsystems supported user namespaces,
467       but a number of filesystems did not have the infrastructure  needed  to
468       map  user  and  group IDs between user namespaces.  Linux 3.9 added the
469       required infrastructure support for many of the  remaining  unsupported
470       filesystems  (Plan  9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA,
471       NFS, and OCFS2).  Linux 3.12 added support the last of the  unsupported
472       major filesystems, XFS.
473

EXAMPLE

475       The  program  below is designed to allow experimenting with user names‐
476       paces, as well as other types of namespaces.  It creates namespaces  as
477       specified  by  command-line  options and then executes a command inside
478       those namespaces.  The comments and usage() function inside the program
479       provide a full explanation of the program.  The following shell session
480       demonstrates its use.
481
482       First, we look at the run-time environment:
483
484           $ uname -rs     # Need Linux 3.8 or later
485           Linux 3.8.0
486           $ id -u         # Running as unprivileged user
487           1000
488           $ id -g
489           1000
490
491       Now start a new shell in new user (-U), mount (-m), and PID (-p) names‐
492       paces,  with user ID (-M) and group ID (-G) 1000 mapped to 0 inside the
493       user namespace:
494
495           $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash
496
497       The shell has PID 1, because it is the first process  in  the  new  PID
498       namespace:
499
500           bash$ echo $$
501           1
502       Mounting  a new /proc filesystem and listing all of the processes visi‐
503       ble in the new PID namespace shows that the shell can't  see  any  pro‐
504       cesses outside the PID namespace:
505
506           bash$ mount -t proc proc /proc
507           bash$ ps ax
508             PID TTY      STAT   TIME COMMAND
509               1 pts/3    S      0:00 bash
510              22 pts/3    R+     0:00 ps ax
511
512       Inside  the  user  namespace,  the shell has user and group ID 0, and a
513       full set of permitted and effective capabilities:
514
515           bash$ cat /proc/$$/status | egrep '^[UG]id'
516           Uid: 0    0    0    0
517           Gid: 0    0    0    0
518           bash$ cat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'
519           CapInh:   0000000000000000
520           CapPrm:   0000001fffffffff
521           CapEff:   0000001fffffffff
522
523   Program source
524
525       /* userns_child_exec.c
526
527          Licensed under GNU General Public License v2 or later
528
529          Create a child process that executes a shell command in new
530          namespace(s); allow UID and GID mappings to be specified when
531          creating a user namespace.
532       */
533       #define _GNU_SOURCE
534       #include <sched.h>
535       #include <unistd.h>
536       #include <stdlib.h>
537       #include <sys/wait.h>
538       #include <signal.h>
539       #include <fcntl.h>
540       #include <stdio.h>
541       #include <string.h>
542       #include <limits.h>
543       #include <errno.h>
544
545       /* A simple error-handling function: print an error message based
546          on the value in 'errno' and terminate the calling process */
547
548       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
549                               } while (0)
550
551       struct child_args {
552           char **argv;        /* Command to be executed by child, with args */
553           int    pipe_fd[2];  /* Pipe used to synchronize parent and child */
554       };
555
556       static int verbose;
557
558       static void
559       usage(char *pname)
560       {
561           fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
562           fprintf(stderr, "Create a child process that executes a shell "
563                   "command in a new user namespace,\n"
564                   "and possibly also other new namespace(s).\n\n");
565           fprintf(stderr, "Options can be:\n\n");
566       #define fpe(str) fprintf(stderr, "    %s", str);
567           fpe("-i          New IPC namespace\n");
568           fpe("-m          New mount namespace\n");
569           fpe("-n          New network namespace\n");
570           fpe("-p          New PID namespace\n");
571           fpe("-u          New UTS namespace\n");
572           fpe("-U          New user namespace\n");
573           fpe("-M uid_map  Specify UID map for user namespace\n");
574           fpe("-G gid_map  Specify GID map for user namespace\n");
575           fpe("-z          Map user's UID and GID to 0 in user namespace\n");
576           fpe("            (equivalent to: -M '0 <uid> 1' -G '0 <gid> 1')\n");
577           fpe("-v          Display verbose messages\n");
578           fpe("\n");
579           fpe("If -z, -M, or -G is specified, -U is required.\n");
580           fpe("It is not permitted to specify both -z and either -M or -G.\n");
581           fpe("\n");
582           fpe("Map strings for -M and -G consist of records of the form:\n");
583           fpe("\n");
584           fpe("    ID-inside-ns   ID-outside-ns   len\n");
585           fpe("\n");
586           fpe("A map string can contain multiple records, separated"
587               " by commas;\n");
588           fpe("the commas are replaced by newlines before writing"
589               " to map files.\n");
590
591           exit(EXIT_FAILURE);
592       }
593
594       /* Update the mapping file 'map_file', with the value provided in
595          'mapping', a string that defines a UID or GID mapping. A UID or
596          GID mapping consists of one or more newline-delimited records
597          of the form:
598
599              ID_inside-ns    ID-outside-ns   length
600
601          Requiring the user to supply a string that contains newlines is
602          of course inconvenient for command-line use. Thus, we permit the
603          use of commas to delimit records in this string, and replace them
604          with newlines before writing the string to the file. */
605
606       static void
607       update_map(char *mapping, char *map_file)
608       {
609           int fd, j;
610           size_t map_len;     /* Length of 'mapping' */
611
612           /* Replace commas in mapping string with newlines */
613
614           map_len = strlen(mapping);
615           for (j = 0; j < map_len; j++)
616               if (mapping[j] == ',')
617                   mapping[j] = '\n';
618
619           fd = open(map_file, O_RDWR);
620           if (fd == -1) {
621               fprintf(stderr, "ERROR: open %s: %s\n", map_file,
622                       strerror(errno));
623               exit(EXIT_FAILURE);
624           }
625
626           if (write(fd, mapping, map_len) != map_len) {
627               fprintf(stderr, "ERROR: write %s: %s\n", map_file,
628                       strerror(errno));
629               exit(EXIT_FAILURE);
630           }
631
632           close(fd);
633       }
634
635       /* Linux 3.19 made a change in the handling of setgroups(2) and the
636          'gid_map' file to address a security issue. The issue allowed
637          *unprivileged* users to employ user namespaces in order to drop
638          The upshot of the 3.19 changes is that in order to update the
639          'gid_maps' file, use of the setgroups() system call in this
640          user namespace must first be disabled by writing "deny" to one of
641          the /proc/PID/setgroups files for this namespace.  That is the
642          purpose of the following function. */
643
644       static void
645       proc_setgroups_write(pid_t child_pid, char *str)
646       {
647           char setgroups_path[PATH_MAX];
648           int fd;
649
650           snprintf(setgroups_path, PATH_MAX, "/proc/%ld/setgroups",
651                   (long) child_pid);
652
653           fd = open(setgroups_path, O_RDWR);
654           if (fd == -1) {
655
656               /* We may be on a system that doesn't support
657                  /proc/PID/setgroups. In that case, the file won't exist,
658                  and the system won't impose the restrictions that Linux 3.19
659                  added. That's fine: we don't need to do anything in order
660                  to permit 'gid_map' to be updated.
661
662                  However, if the error from open() was something other than
663                  the ENOENT error that is expected for that case,  let the
664                  user know. */
665
666               if (errno != ENOENT)
667                   fprintf(stderr, "ERROR: open %s: %s\n", setgroups_path,
668                       strerror(errno));
669               return;
670           }
671
672           if (write(fd, str, strlen(str)) == -1)
673               fprintf(stderr, "ERROR: write %s: %s\n", setgroups_path,
674                   strerror(errno));
675
676           close(fd);
677       }
678
679       static int              /* Start function for cloned child */
680       childFunc(void *arg)
681       {
682           struct child_args *args = (struct child_args *) arg;
683           char ch;
684
685           /* Wait until the parent has updated the UID and GID mappings.
686              See the comment in main(). We wait for end of file on a
687              pipe that will be closed by the parent process once it has
688              updated the mappings. */
689
690           close(args->pipe_fd[1]);    /* Close our descriptor for the write
691                                          end of the pipe so that we see EOF
692                                          when parent closes its descriptor */
693           if (read(args->pipe_fd[0], &ch, 1) != 0) {
694               fprintf(stderr,
695                       "Failure in child: read from pipe returned != 0\n");
696               exit(EXIT_FAILURE);
697           }
698
699           close(args->pipe_fd[0]);
700
701           /* Execute a shell command */
702
703           printf("About to exec %s\n", args->argv[0]);
704           execvp(args->argv[0], args->argv);
705           errExit("execvp");
706       }
707
708       #define STACK_SIZE (1024 * 1024)
709
710       static char child_stack[STACK_SIZE];    /* Space for child's stack */
711
712       int
713       main(int argc, char *argv[])
714       {
715           int flags, opt, map_zero;
716           pid_t child_pid;
717           struct child_args args;
718           char *uid_map, *gid_map;
719           const int MAP_BUF_SIZE = 100;
720           char map_buf[MAP_BUF_SIZE];
721           char map_path[PATH_MAX];
722
723           /* Parse command-line options. The initial '+' character in
724              the final getopt() argument prevents GNU-style permutation
725              of command-line options. That's useful, since sometimes
726              the 'command' to be executed by this program itself
727              has command-line options. We don't want getopt() to treat
728              those as options to this program. */
729
730           flags = 0;
731           verbose = 0;
732           gid_map = NULL;
733           uid_map = NULL;
734           map_zero = 0;
735           while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != -1) {
736               switch (opt) {
737               case 'i': flags |= CLONE_NEWIPC;        break;
738               case 'm': flags |= CLONE_NEWNS;         break;
739               case 'n': flags |= CLONE_NEWNET;        break;
740               case 'p': flags |= CLONE_NEWPID;        break;
741               case 'u': flags |= CLONE_NEWUTS;        break;
742               case 'v': verbose = 1;                  break;
743               case 'z': map_zero = 1;                 break;
744               case 'M': uid_map = optarg;             break;
745               case 'G': gid_map = optarg;             break;
746               case 'U': flags |= CLONE_NEWUSER;       break;
747               default:  usage(argv[0]);
748               }
749           }
750
751           /* -M or -G without -U is nonsensical */
752
753           if (((uid_map != NULL || gid_map != NULL || map_zero) &&
754                       !(flags & CLONE_NEWUSER)) ||
755                   (map_zero && (uid_map != NULL || gid_map != NULL)))
756               usage(argv[0]);
757
758           args.argv = &argv[optind];
759
760           /* We use a pipe to synchronize the parent and child, in order to
761              ensure that the parent sets the UID and GID maps before the child
762              calls execve(). This ensures that the child maintains its
763              capabilities during the execve() in the common case where we
764              want to map the child's effective user ID to 0 in the new user
765              namespace. Without this synchronization, the child would lose
766              its capabilities if it performed an execve() with nonzero
767              user IDs (see the capabilities(7) man page for details of the
768              transformation of a process's capabilities during execve()). */
769
770           if (pipe(args.pipe_fd) == -1)
771               errExit("pipe");
772
773           /* Create the child in new namespace(s) */
774
775           child_pid = clone(childFunc, child_stack + STACK_SIZE,
776                             flags | SIGCHLD, &args);
777           if (child_pid == -1)
778               errExit("clone");
779
780           /* Parent falls through to here */
781
782           if (verbose)
783               printf("%s: PID of child created by clone() is %ld\n",
784                       argv[0], (long) child_pid);
785
786           /* Update the UID and GID maps in the child */
787
788           if (uid_map != NULL || map_zero) {
789               snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
790                       (long) child_pid);
791               if (map_zero) {
792                   snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
793                   uid_map = map_buf;
794               }
795               update_map(uid_map, map_path);
796           }
797
798           if (gid_map != NULL || map_zero) {
799               proc_setgroups_write(child_pid, "deny");
800
801               snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
802                       (long) child_pid);
803               if (map_zero) {
804                   snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
805                   gid_map = map_buf;
806               }
807               update_map(gid_map, map_path);
808           }
809
810           /* Close the write end of the pipe, to signal to the child that we
811              have updated the UID and GID maps */
812
813           close(args.pipe_fd[1]);
814
815           if (waitpid(child_pid, NULL, 0) == -1)      /* Wait for child */
816               errExit("waitpid");
817
818           if (verbose)
819               printf("%s: terminating\n", argv[0]);
820
821           exit(EXIT_SUCCESS);
822       }
823

COLOPHON

832       This page is part of release 4.16 of the Linux  man-pages  project.   A
833       description  of  the project, information about reporting bugs, and the
834       latest    version    of    this    page,    can     be     found     at
835       https://www.kernel.org/doc/man-pages/.
836
837
838
839Linux                             2018-02-02                USER_NAMESPACES(7)

NAME

DESCRIPTION

CONFORMING TO

NOTES

EXAMPLE

SEE ALSO

COLOPHON