clone(2) - f37

1CLONE(2)                   Linux Programmer's Manual                  CLONE(2)
2
3
4

NAME

6       clone, __clone2, clone3 - create a child process
7

SYNOPSIS

9       /* Prototype for the glibc wrapper function */
10
11       #define _GNU_SOURCE
12       #include <sched.h>
13
14       int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
15                 /* pid_t *parent_tid, void *tls, pid_t *child_tid */ );
16
17       /* For the prototype of the raw clone() system call, see NOTES */
18
19       #include <linux/sched.h>    /* Definition of struct clone_args */
20       #include <sched.h>          /* Definition of CLONE_* constants */
21       #include <sys/syscall.h>    /* Definition of SYS_* constants */
22       #include <unistd.h>
23
24       long syscall(SYS_clone3, struct clone_args *cl_args, size_t size);
25
26       Note:  glibc provides no wrapper for clone3(), necessitating the use of
27       syscall(2).
28

DESCRIPTION

30       These system calls create a new ("child") process, in a manner  similar
31       to fork(2).
32
33       By  contrast with fork(2), these system calls provide more precise con‐
34       trol over what pieces of execution context are shared between the call‐
35       ing  process  and  the  child process.  For example, using these system
36       calls, the caller can control whether or not the  two  processes  share
37       the virtual address space, the table of file descriptors, and the table
38       of signal handlers.  These  system  calls  also  allow  the  new  child
39       process to be placed in separate namespaces(7).
40
41       Note  that  in this manual page, "calling process" normally corresponds
42       to "parent process".  But see  the  descriptions  of  CLONE_PARENT  and
43       CLONE_THREAD below.
44
45       This page describes the following interfaces:
46
47       *  The glibc clone() wrapper function and the underlying system call on
48          which it is based.  The main text describes  the  wrapper  function;
49          the differences for the raw system call are described toward the end
50          of this page.
51
52       *  The newer clone3() system call.
53
54       In the remainder of this page, the terminology "the clone call" is used
55       when noting details that apply to all of these interfaces,
56
57   The clone() wrapper function
58       When the child process is created with the clone() wrapper function, it
59       commences execution by calling the function pointed to by the  argument
60       fn.  (This differs from fork(2), where execution continues in the child
61       from the point of the fork(2) call.)  The arg argument is passed as the
62       argument of the function fn.
63
64       When  the  fn(arg) function returns, the child process terminates.  The
65       integer returned by fn is the exit status for the child  process.   The
66       child process may also terminate explicitly by calling exit(2) or after
67       receiving a fatal signal.
68
69       The stack argument specifies the location of  the  stack  used  by  the
70       child  process.   Since the child and calling process may share memory,
71       it is not possible for the child process to execute in the  same  stack
72       as the calling process.  The calling process must therefore set up mem‐
73       ory space for the child stack and pass  a  pointer  to  this  space  to
74       clone().  Stacks grow downward on all processors that run Linux (except
75       the HP PA processors), so stack usually points to the  topmost  address
76       of the memory space set up for the child stack.  Note that clone() does
77       not provide a means whereby the caller can inform  the  kernel  of  the
78       size of the stack area.
79
80       The remaining arguments to clone() are discussed below.
81
82   clone3()
83       The  clone3()  system  call provides a superset of the functionality of
84       the older clone() interface.  It also provides a number of API improve‐
85       ments,  including:  space for additional flags bits; cleaner separation
86       in the use of various arguments; and the ability to specify the size of
87       the child's stack area.
88
89       As with fork(2), clone3() returns in both the parent and the child.  It
90       returns 0 in the child process and returns the PID of the child in  the
91       parent.
92
93       The cl_args argument of clone3() is a structure of the following form:
94
95           struct clone_args {
96               u64 flags;        /* Flags bit mask */
97               u64 pidfd;        /* Where to store PID file descriptor
98                                    (int *) */
99               u64 child_tid;    /* Where to store child TID,
100                                    in child's memory (pid_t *) */
101               u64 parent_tid;   /* Where to store child TID,
102                                    in parent's memory (pid_t *) */
103               u64 exit_signal;  /* Signal to deliver to parent on
104                                    child termination */
105               u64 stack;        /* Pointer to lowest byte of stack */
106               u64 stack_size;   /* Size of stack */
107               u64 tls;          /* Location of new TLS */
108               u64 set_tid;      /* Pointer to a pid_t array
109                                    (since Linux 5.5) */
110               u64 set_tid_size; /* Number of elements in set_tid
111                                    (since Linux 5.5) */
112               u64 cgroup;       /* File descriptor for target cgroup
113                                    of child (since Linux 5.7) */
114           };
115
116       The size argument that is supplied to clone3() should be initialized to
117       the size of this structure.  (The existence of the size  argument  per‐
118       mits future extensions to the clone_args structure.)
119
120       The  stack  for the child process is specified via cl_args.stack, which
121       points to the lowest byte of the stack  area,  and  cl_args.stack_size,
122       which  specifies the size of the stack in bytes.  In the case where the
123       CLONE_VM flag (see below) is specified, a stack must be explicitly  al‐
124       located and specified.  Otherwise, these two fields can be specified as
125       NULL and 0, which causes the child to use the same stack  area  as  the
126       parent (in the child's own virtual address space).
127
128       The remaining fields in the cl_args argument are discussed below.
129
130   Equivalence between clone() and clone3() arguments
131       Unlike the older clone() interface, where arguments are passed individ‐
132       ually, in the newer clone3() interface the arguments are packaged  into
133       the  clone_args structure shown above.  This structure allows for a su‐
134       perset of the information passed via the clone() arguments.
135
136       The following table shows the  equivalence  between  the  arguments  of
137       clone() and the fields in the clone_args argument supplied to clone3():
138
139           clone()         clone3()        Notes
140                           cl_args field
141           flags & ~0xff   flags           For most flags; details
142                                           below
143           parent_tid      pidfd           See CLONE_PIDFD
144           child_tid       child_tid       See CLONE_CHILD_SETTID
145           parent_tid      parent_tid      See CLONE_PARENT_SETTID
146           flags & 0xff    exit_signal
147           stack           stack
148           ---             stack_size
149           tls             tls             See CLONE_SETTLS
150           ---             set_tid         See below for details
151           ---             set_tid_size
152           ---             cgroup          See CLONE_INTO_CGROUP
153
154   The child termination signal
155       When the child process terminates, a signal may be sent to the  parent.
156       The  termination signal is specified in the low byte of flags (clone())
157       or in cl_args.exit_signal (clone3()).  If this signal is  specified  as
158       anything  other  than SIGCHLD, then the parent process must specify the
159       __WALL or __WCLONE options when waiting for the child with wait(2).  If
160       no  signal  (i.e.,  zero)  is specified, then the parent process is not
161       signaled when the child terminates.
162
163   The set_tid array
164       By default, the kernel chooses the next  sequential  PID  for  the  new
165       process in each of the PID namespaces where it is present.  When creat‐
166       ing a process with clone3(), the set_tid array (available  since  Linux
167       5.5) can be used to select specific PIDs for the process in some or all
168       of the PID namespaces where it is present.  If the  PID  of  the  newly
169       created  process should be set only for the current PID namespace or in
170       the newly created PID namespace (if flags contains  CLONE_NEWPID)  then
171       the  first  element  in the set_tid array has to be the desired PID and
172       set_tid_size needs to be 1.
173
174       If the PID of the newly created process should have a certain value  in
175       multiple  PID  namespaces, then the set_tid array can have multiple en‐
176       tries.  The first entry defines the PID in the most deeply  nested  PID
177       namespace  and  each  of  the following entries contains the PID in the
178       corresponding ancestor PID namespace.  The number of PID namespaces  in
179       which  a  PID  should be set is defined by set_tid_size which cannot be
180       larger than the number of currently nested PID namespaces.
181
182       To create a process with the following PIDs in a PID namespace  hierar‐
183       chy:
184
185           PID NS level   Requested PID   Notes
186           0              31496           Outermost PID namespace
187           1              42
188           2              7               Innermost PID namespace
189
190       Set the array to:
191
192           set_tid[0] = 7;
193           set_tid[1] = 42;
194           set_tid[2] = 31496;
195           set_tid_size = 3;
196
197       If  only the PIDs in the two innermost PID namespaces need to be speci‐
198       fied, set the array to:
199
200           set_tid[0] = 7;
201           set_tid[1] = 42;
202           set_tid_size = 2;
203
204       The PID in the PID namespaces outside the two innermost PID  namespaces
205       is selected the same way as any other PID is selected.
206
207       The  set_tid  feature  requires  CAP_SYS_ADMIN  or  (since  Linux  5.9)
208       CAP_CHECKPOINT_RESTORE in all owning user namespaces of the target  PID
209       namespaces.
210
211       Callers  may  only choose a PID greater than 1 in a given PID namespace
212       if an init process (i.e., a process with PID 1) already exists in  that
213       namespace.  Otherwise the PID entry for this PID namespace must be 1.
214
215   The flags mask
216       Both  clone()  and  clone3() allow a flags bit mask that modifies their
217       behavior and allows the caller to specify what is  shared  between  the
218       calling  process  and the child process.  This bit mask—the flags argu‐
219       ment of clone() or the cl_args.flags field passed  to  clone3()—is  re‐
220       ferred to as the flags mask in the remainder of this page.
221
222       The flags mask is specified as a bitwise-OR of zero or more of the con‐
223       stants listed below.  Except as noted below, these flags are  available
224       (and have the same effect) in both clone() and clone3().
225
226       CLONE_CHILD_CLEARTID (since Linux 2.5.49)
227              Clear  (zero)  the child thread ID at the location pointed to by
228              child_tid (clone()) or  cl_args.child_tid  (clone3())  in  child
229              memory  when  the  child  exits, and do a wakeup on the futex at
230              that address.  The  address  involved  may  be  changed  by  the
231              set_tid_address(2)  system  call.  This is used by threading li‐
232              braries.
233
234       CLONE_CHILD_SETTID (since Linux 2.5.49)
235              Store the  child  thread  ID  at  the  location  pointed  to  by
236              child_tid  (clone())  or  cl_args.child_tid  (clone3())  in  the
237              child's memory.  The store operation completes before the  clone
238              call  returns control to user space in the child process.  (Note
239              that the store operation may not have completed before the clone
240              call  returns  in  the  parent process, which is relevant if the
241              CLONE_VM flag is also employed.)
242
243       CLONE_CLEAR_SIGHAND (since Linux 5.5)
244              By default, signal dispositions in the child thread are the same
245              as  in  the parent.  If this flag is specified, then all signals
246              that are handled in the parent are reset to their default dispo‐
247              sitions (SIG_DFL) in the child.
248
249              Specifying  this flag together with CLONE_SIGHAND is nonsensical
250              and disallowed.
251
252       CLONE_DETACHED (historical)
253              For a while (during the Linux 2.5 development series) there  was
254              a  CLONE_DETACHED flag, which caused the parent not to receive a
255              signal when the child terminated.   Ultimately,  the  effect  of
256              this  flag  was  subsumed under the CLONE_THREAD flag and by the
257              time Linux 2.6.0 was released, this flag had no effect.   Start‐
258              ing  in  Linux  2.6.2,  the need to give this flag together with
259              CLONE_THREAD disappeared.
260
261              This flag is still defined, but it is usually ignored when call‐
262              ing  clone().   However,  see the description of CLONE_PIDFD for
263              some exceptions.
264
265       CLONE_FILES (since Linux 2.0)
266              If CLONE_FILES is set, the calling process and the child process
267              share  the same file descriptor table.  Any file descriptor cre‐
268              ated by the calling process or by  the  child  process  is  also
269              valid  in the other process.  Similarly, if one of the processes
270              closes a file descriptor, or changes its associated flags (using
271              the  fcntl(2)  F_SETFD operation), the other process is also af‐
272              fected.  If a process sharing a file descriptor table calls  ex‐
273              ecve(2), its file descriptor table is duplicated (unshared).
274
275              If  CLONE_FILES is not set, the child process inherits a copy of
276              all file descriptors opened in the calling process at  the  time
277              of  the  clone  call.   Subsequent operations that open or close
278              file descriptors, or change file descriptor flags, performed  by
279              either  the  calling  process or the child process do not affect
280              the other process.  Note, however, that the duplicated file  de‐
281              scriptors  in the child refer to the same open file descriptions
282              as the corresponding file descriptors in  the  calling  process,
283              and thus share file offsets and file status flags (see open(2)).
284
285       CLONE_FS (since Linux 2.0)
286              If  CLONE_FS  is set, the caller and the child process share the
287              same filesystem information.  This  includes  the  root  of  the
288              filesystem,  the  current working directory, and the umask.  Any
289              call to chroot(2), chdir(2), or umask(2) performed by the  call‐
290              ing process or the child process also affects the other process.
291
292              If CLONE_FS is not set, the child process works on a copy of the
293              filesystem information of the calling process at the time of the
294              clone call.  Calls to chroot(2), chdir(2), or umask(2) performed
295              later by one of the processes do not affect the other process.
296
297       CLONE_INTO_CGROUP (since Linux 5.7)
298              By default, a child process is placed  in  the  same  version  2
299              cgroup  as  its  parent.   The CLONE_INTO_CGROUP flag allows the
300              child process to be created in a  different  version  2  cgroup.
301              (Note  that  CLONE_INTO_CGROUP  has  effect  only  for version 2
302              cgroups.)
303
304              In order to place the child process in a different  cgroup,  the
305              caller specifies CLONE_INTO_CGROUP in cl_args.flags and passes a
306              file descriptor that  refers  to  a  version  2  cgroup  in  the
307              cl_args.cgroup  field.  (This file descriptor can be obtained by
308              opening a cgroup v2 directory using either the O_RDONLY  or  the
309              O_PATH  flag.)   Note  that  all  of the usual restrictions (de‐
310              scribed in cgroups(7)) on placing a process  into  a  version  2
311              cgroup apply.
312
313              Among  the possible use cases for CLONE_INTO_CGROUP are the fol‐
314              lowing:
315
316              *  Spawning a process into a cgroup different from the  parent's
317                 cgroup  makes  it  possible for a service manager to directly
318                 spawn new services into dedicated cgroups.   This  eliminates
319                 the  accounting  jitter  that  would  be  caused if the child
320                 process was first created in the same cgroup  as  the  parent
321                 and then moved into the target cgroup.  Furthermore, spawning
322                 the child process directly into a target cgroup  is  signifi‐
323                 cantly  cheaper than moving the child process into the target
324                 cgroup after it has been created.
325
326              *  The CLONE_INTO_CGROUP flag also allows the creation of frozen
327                 child  processes by spawning them into a frozen cgroup.  (See
328                 cgroups(7) for a description of the freezer controller.)
329
330              *  For threaded applications  (or  even  thread  implementations
331                 which make use of cgroups to limit individual threads), it is
332                 possible to establish a fixed cgroup layout  before  spawning
333                 each thread directly into its target cgroup.
334
335       CLONE_IO (since Linux 2.6.25)
336              If  CLONE_IO  is set, then the new process shares an I/O context
337              with the calling process.  If this flag is  not  set,  then  (as
338              with fork(2)) the new process has its own I/O context.
339
340              The  I/O  context  is the I/O scope of the disk scheduler (i.e.,
341              what the I/O scheduler uses to model scheduling of  a  process's
342              I/O).  If processes share the same I/O context, they are treated
343              as one by the I/O scheduler.  As  a  consequence,  they  get  to
344              share  disk  time.   For  some  I/O schedulers, if two processes
345              share an I/O context, they will be allowed to  interleave  their
346              disk  access.  If several threads are doing I/O on behalf of the
347              same process (aio_read(3), for  instance),  they  should  employ
348              CLONE_IO to get better I/O performance.
349
350              If  the  kernel  is not configured with the CONFIG_BLOCK option,
351              this flag is a no-op.
352
353       CLONE_NEWCGROUP (since Linux 4.6)
354              Create the process in a new cgroup namespace.  If this  flag  is
355              not  set,  then  (as with fork(2)) the process is created in the
356              same cgroup namespaces as the calling process.
357
358              For further information on cgroup namespaces,  see  cgroup_name‐
359              spaces(7).
360
361              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWC‐
362              GROUP.
363
364       CLONE_NEWIPC (since Linux 2.6.19)
365              If CLONE_NEWIPC is set, then create the process  in  a  new  IPC
366              namespace.  If this flag is not set, then (as with fork(2)), the
367              process is created in the same  IPC  namespace  as  the  calling
368              process.
369
370              For   further  information  on  IPC  namespaces,  see  ipc_name‐
371              spaces(7).
372
373              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
374              CLONE_NEWIPC.   This flag can't be specified in conjunction with
375              CLONE_SYSVSEM.
376
377       CLONE_NEWNET (since Linux 2.6.24)
378              (The implementation of this flag was  completed  only  by  about
379              kernel version 2.6.29.)
380
381              If CLONE_NEWNET is set, then create the process in a new network
382              namespace.  If this flag is not set, then (as with fork(2))  the
383              process  is created in the same network namespace as the calling
384              process.
385
386              For further information on network namespaces, see network_name‐
387              spaces(7).
388
389              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
390              CLONE_NEWNET.
391
392       CLONE_NEWNS (since Linux 2.4.19)
393              If CLONE_NEWNS is set, the cloned child  is  started  in  a  new
394              mount namespace, initialized with a copy of the namespace of the
395              parent.  If CLONE_NEWNS is not set, the child lives in the  same
396              mount namespace as the parent.
397
398              For  further  information on mount namespaces, see namespaces(7)
399              and mount_namespaces(7).
400
401              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
402              CLONE_NEWNS.   It  is  not permitted to specify both CLONE_NEWNS
403              and CLONE_FS in the same clone call.
404
405       CLONE_NEWPID (since Linux 2.6.24)
406              If CLONE_NEWPID is set, then create the process  in  a  new  PID
407              namespace.   If this flag is not set, then (as with fork(2)) the
408              process is created in the same  PID  namespace  as  the  calling
409              process.
410
411              For further information on PID namespaces, see namespaces(7) and
412              pid_namespaces(7).
413
414              Only a privileged process (CAP_SYS_ADMIN) can employ  CLONE_NEW‐
415              PID.    This   flag  can't  be  specified  in  conjunction  with
416              CLONE_THREAD or CLONE_PARENT.
417
418       CLONE_NEWUSER
419              (This flag first became meaningful for clone() in Linux  2.6.23,
420              the  current clone() semantics were merged in Linux 3.5, and the
421              final pieces to make the user namespaces completely usable  were
422              merged in Linux 3.8.)
423
424              If  CLONE_NEWUSER  is set, then create the process in a new user
425              namespace.  If this flag is not set, then (as with fork(2))  the
426              process  is  created  in  the same user namespace as the calling
427              process.
428
429              For further information on user  namespaces,  see  namespaces(7)
430              and user_namespaces(7).
431
432              Before  Linux 3.8, use of CLONE_NEWUSER required that the caller
433              have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SET‐
434              GID.   Starting with Linux 3.8, no privileges are needed to cre‐
435              ate a user namespace.
436
437              This flag can't be specified in conjunction with CLONE_THREAD or
438              CLONE_PARENT.   For  security  reasons,  CLONE_NEWUSER cannot be
439              specified in conjunction with CLONE_FS.
440
441       CLONE_NEWUTS (since Linux 2.6.19)
442              If CLONE_NEWUTS is set, then create the process  in  a  new  UTS
443              namespace,  whose identifiers are initialized by duplicating the
444              identifiers from the UTS namespace of the calling  process.   If
445              this flag is not set, then (as with fork(2)) the process is cre‐
446              ated in the same UTS namespace as the calling process.
447
448              For  further  information  on  UTS  namespaces,  see   uts_name‐
449              spaces(7).
450
451              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
452              CLONE_NEWUTS.
453
454       CLONE_PARENT (since Linux 2.3.12)
455              If CLONE_PARENT is set, then the parent of the new child (as re‐
456              turned  by  getppid(2))  will be the same as that of the calling
457              process.
458
459              If CLONE_PARENT is not set, then (as with fork(2))  the  child's
460              parent is the calling process.
461
462              Note  that  it is the parent process, as returned by getppid(2),
463              which  is  signaled  when  the  child  terminates,  so  that  if
464              CLONE_PARENT  is  set,  then  the parent of the calling process,
465              rather than the calling process itself, is signaled.
466
467              The CLONE_PARENT flag can't be used in clone calls by the global
468              init  process (PID 1 in the initial PID namespace) and init pro‐
469              cesses in other PID namespaces.  This restriction  prevents  the
470              creation  of  multi-rooted process trees as well as the creation
471              of unreapable zombies in the initial PID namespace.
472
473       CLONE_PARENT_SETTID (since Linux 2.5.49)
474              Store the child thread ID at the location  pointed  to  by  par‐
475              ent_tid  (clone())  or cl_args.parent_tid (clone3()) in the par‐
476              ent's  memory.   (In  Linux  2.5.32-2.5.48  there  was  a   flag
477              CLONE_SETTID  that did this.)  The store operation completes be‐
478              fore the clone call returns control to user space.
479
480       CLONE_PID (Linux 2.0 to 2.5.15)
481              If CLONE_PID is set, the child process is created with the  same
482              process ID as the calling process.  This is good for hacking the
483              system, but otherwise of not much use.  From  Linux  2.3.21  on‐
484              ward,  this  flag  could  be  specified  only by the system boot
485              process (PID 0).  The flag disappeared completely from the  ker‐
486              nel  sources in Linux 2.5.16.  Subsequently, the kernel silently
487              ignored this bit if it was specified in the  flags  mask.   Much
488              later,  the  same  bit  was  recycled for use as the CLONE_PIDFD
489              flag.
490
491       CLONE_PIDFD (since Linux 5.2)
492              If this flag is specified, a PID file  descriptor  referring  to
493              the  child  process is allocated and placed at a specified loca‐
494              tion in the parent's memory.  The close-on-exec flag is  set  on
495              this  new file descriptor.  PID file descriptors can be used for
496              the purposes described in pidfd_open(2).
497
498              *  When using clone3(), the PID file descriptor is placed at the
499                 location pointed to by cl_args.pidfd.
500
501              *  When  using clone(), the PID file descriptor is placed at the
502                 location pointed to by parent_tid.  Since the parent_tid  ar‐
503                 gument is used to return the PID file descriptor, CLONE_PIDFD
504                 cannot be used with CLONE_PARENT_SETTID when calling clone().
505
506              It is currently not possible to  use  this  flag  together  with
507              CLONE_THREAD.  This means that the process identified by the PID
508              file descriptor will always be a thread group leader.
509
510              If the  obsolete  CLONE_DETACHED  flag  is  specified  alongside
511              CLONE_PIDFD  when calling clone(), an error is returned.  An er‐
512              ror also results if CLONE_DETACHED  is  specified  when  calling
513              clone3().   This error behavior ensures that the bit correspond‐
514              ing to CLONE_DETACHED can be reused for  further  PID  file  de‐
515              scriptor features in the future.
516
517       CLONE_PTRACE (since Linux 2.2)
518              If  CLONE_PTRACE  is specified, and the calling process is being
519              traced, then trace the child also (see ptrace(2)).
520
521       CLONE_SETTLS (since Linux 2.5.32)
522              The TLS (Thread Local Storage) descriptor is set to tls.
523
524              The interpretation of tls and the resulting effect is  architec‐
525              ture  dependent.   On  x86,  tls  is  interpreted  as  a  struct
526              user_desc * (see set_thread_area(2)).  On x86-64 it is  the  new
527              value  to  be set for the %fs base register (see the ARCH_SET_FS
528              argument to arch_prctl(2)).  On architectures with  a  dedicated
529              TLS register, it is the new value of that register.
530
531              Use  of  this  flag requires detailed knowledge and generally it
532              should not be used except in libraries implementing threading.
533
534       CLONE_SIGHAND (since Linux 2.0)
535              If CLONE_SIGHAND is set,  the  calling  process  and  the  child
536              process share the same table of signal handlers.  If the calling
537              process or child process calls sigaction(2) to change the behav‐
538              ior  associated  with  a  signal, the behavior is changed in the
539              other process as well.  However, the calling process  and  child
540              processes  still  have distinct signal masks and sets of pending
541              signals.  So, one of them may block  or  unblock  signals  using
542              sigprocmask(2) without affecting the other process.
543
544              If  CLONE_SIGHAND  is not set, the child process inherits a copy
545              of the signal handlers of the calling process at the time of the
546              clone call.  Calls to sigaction(2) performed later by one of the
547              processes have no effect on the other process.
548
549              Since Linux 2.6.0, the flags mask must also include CLONE_VM  if
550              CLONE_SIGHAND is specified.
551
552       CLONE_STOPPED (since Linux 2.6.0)
553              If CLONE_STOPPED is set, then the child is initially stopped (as
554              though it was sent a SIGSTOP signal), and  must  be  resumed  by
555              sending it a SIGCONT signal.
556
557              This  flag  was deprecated from Linux 2.6.25 onward, and was re‐
558              moved altogether  in  Linux  2.6.38.   Since  then,  the  kernel
559              silently ignores it without error.  Starting with Linux 4.6, the
560              same bit was reused for the CLONE_NEWCGROUP flag.
561
562       CLONE_SYSVSEM (since Linux 2.5.10)
563              If CLONE_SYSVSEM is set, then the child and the calling  process
564              share  a  single  list of System V semaphore adjustment (semadj)
565              values (see semop(2)).  In this case, the  shared  list  accumu‐
566              lates  semadj  values across all processes sharing the list, and
567              semaphore adjustments are performed only when the  last  process
568              that  is sharing the list terminates (or ceases sharing the list
569              using unshare(2)).  If this flag is not set, then the child  has
570              a separate semadj list that is initially empty.
571
572       CLONE_THREAD (since Linux 2.4.0)
573              If  CLONE_THREAD  is set, the child is placed in the same thread
574              group as the calling process.  To make the remainder of the dis‐
575              cussion of CLONE_THREAD more readable, the term "thread" is used
576              to refer to the processes within a thread group.
577
578              Thread groups were a feature added in Linux 2.4 to  support  the
579              POSIX  threads  notion  of  a set of threads that share a single
580              PID.  Internally, this shared PID is the so-called thread  group
581              identifier  (TGID) for the thread group.  Since Linux 2.4, calls
582              to getpid(2) return the TGID of the caller.
583
584              The threads within a group can be distinguished by  their  (sys‐
585              tem-wide) unique thread IDs (TID).  A new thread's TID is avail‐
586              able as the function result returned to the caller, and a thread
587              can obtain its own TID using gettid(2).
588
589              When  a clone call is made without specifying CLONE_THREAD, then
590              the resulting thread is placed in a new thread group whose  TGID
591              is  the  same as the thread's TID.  This thread is the leader of
592              the new thread group.
593
594              A new thread created  with  CLONE_THREAD  has  the  same  parent
595              process  as  the  process  that  made the clone call (i.e., like
596              CLONE_PARENT), so that calls to getppid(2) return the same value
597              for  all  of the threads in a thread group.  When a CLONE_THREAD
598              thread terminates, the thread that created  it  is  not  sent  a
599              SIGCHLD  (or  other  termination)  signal; nor can the status of
600              such a thread be obtained using wait(2).  (The thread is said to
601              be detached.)
602
603              After  all of the threads in a thread group terminate the parent
604              process of the thread group is sent a SIGCHLD (or other termina‐
605              tion) signal.
606
607              If  any  of the threads in a thread group performs an execve(2),
608              then all threads other than the thread group leader  are  termi‐
609              nated,  and  the  new  program  is  executed in the thread group
610              leader.
611
612              If one of the threads in a thread group creates  a  child  using
613              fork(2),  then  any  thread  in  the  group can wait(2) for that
614              child.
615
616              Since Linux 2.5.35, the flags mask must also include  CLONE_SIG‐
617              HAND  if  CLONE_THREAD  is specified (and note that, since Linux
618              2.6.0, CLONE_SIGHAND also requires CLONE_VM to be included).
619
620              Signal dispositions and actions are process-wide: if  an  unhan‐
621              dled  signal is delivered to a thread, then it will affect (ter‐
622              minate, stop, continue, be ignored in) all members of the thread
623              group.
624
625              Each thread has its own signal mask, as set by sigprocmask(2).
626
627              A signal may be process-directed or thread-directed.  A process-
628              directed signal is targeted at a thread group  (i.e.,  a  TGID),
629              and  is  delivered  to an arbitrarily selected thread from among
630              those that are  not  blocking  the  signal.   A  signal  may  be
631              process-directed because it was generated by the kernel for rea‐
632              sons other than a hardware exception, or because it was sent us‐
633              ing  kill(2)  or  sigqueue(3).  A thread-directed signal is tar‐
634              geted at (i.e., delivered to) a specific thread.  A  signal  may
635              be  thread  directed  because  it  was  sent  using tgkill(2) or
636              pthread_sigqueue(3), or because the thread  executed  a  machine
637              language  instruction that triggered a hardware exception (e.g.,
638              invalid memory access triggering SIGSEGV or a floating-point ex‐
639              ception triggering SIGFPE).
640
641              A  call  to sigpending(2) returns a signal set that is the union
642              of the pending process-directed signals and the signals that are
643              pending for the calling thread.
644
645              If a process-directed signal is delivered to a thread group, and
646              the thread group has installed a handler for  the  signal,  then
647              the handler is invoked in exactly one, arbitrarily selected mem‐
648              ber of the thread group that has not  blocked  the  signal.   If
649              multiple  threads in a group are waiting to accept the same sig‐
650              nal using sigwaitinfo(2), the kernel will arbitrarily select one
651              of these threads to receive the signal.
652
653       CLONE_UNTRACED (since Linux 2.5.46)
654              If  CLONE_UNTRACED  is  specified, then a tracing process cannot
655              force CLONE_PTRACE on this child process.
656
657       CLONE_VFORK (since Linux 2.2)
658              If CLONE_VFORK is set, the execution of the calling  process  is
659              suspended  until the child releases its virtual memory resources
660              via a call to execve(2) or _exit(2) (as with vfork(2)).
661
662              If CLONE_VFORK is not set, then both the calling process and the
663              child  are schedulable after the call, and an application should
664              not rely on execution occurring in any particular order.
665
666       CLONE_VM (since Linux 2.0)
667              If CLONE_VM is set, the calling process and  the  child  process
668              run in the same memory space.  In particular, memory writes per‐
669              formed by the calling process or by the child process  are  also
670              visible  in  the other process.  Moreover, any memory mapping or
671              unmapping performed with mmap(2) or munmap(2) by  the  child  or
672              calling process also affects the other process.
673
674              If  CLONE_VM  is  not  set, the child process runs in a separate
675              copy of the memory space of the calling process at the  time  of
676              the  clone call.  Memory writes or file mappings/unmappings per‐
677              formed by one of the processes do not affect the other, as  with
678              fork(2).
679
680              If  the  CLONE_VM  flag is specified and the CLONE_VFORK flag is
681              not specified, then any alternate signal stack that  was  estab‐
682              lished by sigaltstack(2) is cleared in the child process.
683

RETURN VALUE

685       On  success,  the  thread  ID  of  the child process is returned in the
686       caller's thread of execution.   On  failure,  -1  is  returned  in  the
687       caller's  context, no child process is created, and errno is set to in‐
688       dicate the error.
689

ERRORS

691       EAGAIN Too many processes are already running; see fork(2).
692
693       EBUSY (clone3() only)
694              CLONE_INTO_CGROUP was specified in cl_args.flags, but  the  file
695              descriptor  specified  in  cl_args.cgroup  refers to a version 2
696              cgroup in which a domain controller is enabled.
697
698       EEXIST (clone3() only)
699              One (or more) of the PIDs specified in set_tid already exists in
700              the corresponding PID namespace.
701
702       EINVAL Both CLONE_SIGHAND and CLONE_CLEAR_SIGHAND were specified in the
703              flags mask.
704
705       EINVAL CLONE_SIGHAND was specified in the flags mask, but CLONE_VM  was
706              not.  (Since Linux 2.6.0.)
707
708       EINVAL CLONE_THREAD  was specified in the flags mask, but CLONE_SIGHAND
709              was not.  (Since Linux 2.5.35.)
710
711       EINVAL CLONE_THREAD was specified in the flags mask,  but  the  current
712              process  previously called unshare(2) with the CLONE_NEWPID flag
713              or used setns(2) to reassociate itself with a PID namespace.
714
715       EINVAL Both CLONE_FS and CLONE_NEWNS were specified in the flags mask.
716
717       EINVAL (since Linux 3.9)
718              Both CLONE_NEWUSER and CLONE_FS  were  specified  in  the  flags
719              mask.
720
721       EINVAL Both  CLONE_NEWIPC and CLONE_SYSVSEM were specified in the flags
722              mask.
723
724       EINVAL One (or both) of CLONE_NEWPID or CLONE_NEWUSER and one (or both)
725              of  CLONE_THREAD  or  CLONE_PARENT  were  specified in the flags
726              mask.
727
728       EINVAL (since Linux 2.6.32)
729              CLONE_PARENT was specified, and the caller is an init process.
730
731       EINVAL Returned by the glibc clone() wrapper function when fn or  stack
732              is specified as NULL.
733
734       EINVAL CLONE_NEWIPC was specified in the flags mask, but the kernel was
735              not configured with the  CONFIG_SYSVIPC  and  CONFIG_IPC_NS  op‐
736              tions.
737
738       EINVAL CLONE_NEWNET was specified in the flags mask, but the kernel was
739              not configured with the CONFIG_NET_NS option.
740
741       EINVAL CLONE_NEWPID was specified in the flags mask, but the kernel was
742              not configured with the CONFIG_PID_NS option.
743
744       EINVAL CLONE_NEWUSER  was  specified  in the flags mask, but the kernel
745              was not configured with the CONFIG_USER_NS option.
746
747       EINVAL CLONE_NEWUTS was specified in the flags mask, but the kernel was
748              not configured with the CONFIG_UTS_NS option.
749
750       EINVAL stack  is  not aligned to a suitable boundary for this architec‐
751              ture.  For example, on aarch64, stack must be a multiple of 16.
752
753       EINVAL (clone3() only)
754              CLONE_DETACHED was specified in the flags mask.
755
756       EINVAL (clone() only)
757              CLONE_PIDFD was specified together with  CLONE_DETACHED  in  the
758              flags mask.
759
760       EINVAL CLONE_PIDFD  was  specified  together  with  CLONE_THREAD in the
761              flags mask.
762
763       EINVAL (clone() only)
764              CLONE_PIDFD was specified together with  CLONE_PARENT_SETTID  in
765              the flags mask.
766
767       EINVAL (clone3() only)
768              set_tid_size  is  greater  than  the  number of nested PID name‐
769              spaces.
770
771       EINVAL (clone3() only)
772              One of the PIDs specified in set_tid was an invalid.
773
774       EINVAL (AArch64 only, Linux 4.6 and earlier)
775              stack was not aligned to a 128-bit boundary.
776
777       ENOMEM Cannot allocate sufficient memory to allocate a  task  structure
778              for  the  child,  or to copy those parts of the caller's context
779              that need to be copied.
780
781       ENOSPC (since Linux 3.7)
782              CLONE_NEWPID was specified in the flags mask, but the  limit  on
783              the  nesting  depth  of PID namespaces would have been exceeded;
784              see pid_namespaces(7).
785
786       ENOSPC (since Linux 4.9; beforehand EUSERS)
787              CLONE_NEWUSER was specified in the  flags  mask,  and  the  call
788              would cause the limit on the number of nested user namespaces to
789              be exceeded.  See user_namespaces(7).
790
791              From Linux 3.11 to Linux 4.8, the error diagnosed in  this  case
792              was EUSERS.
793
794       ENOSPC (since Linux 4.9)
795              One  of the values in the flags mask specified the creation of a
796              new user namespace, but doing so would have caused the limit de‐
797              fined  by  the  corresponding  file  in /proc/sys/user to be ex‐
798              ceeded.  For further details, see namespaces(7).
799
800       EOPNOTSUPP (clone3() only)
801              CLONE_INTO_CGROUP was specified in cl_args.flags, but  the  file
802              descriptor  specified  in  cl_args.cgroup  refers to a version 2
803              cgroup that is in the domain invalid state.
804
805       EPERM  CLONE_NEWCGROUP,   CLONE_NEWIPC,   CLONE_NEWNET,    CLONE_NEWNS,
806              CLONE_NEWPID,  or  CLONE_NEWUTS was specified by an unprivileged
807              process (process without CAP_SYS_ADMIN).
808
809       EPERM  CLONE_PID was specified by  a  process  other  than  process  0.
810              (This error occurs only on Linux 2.5.15 and earlier.)
811
812       EPERM  CLONE_NEWUSER  was  specified  in the flags mask, but either the
813              effective user ID or the effective group ID of the  caller  does
814              not  have  a  mapping  in  the  parent namespace (see user_name‐
815              spaces(7)).
816
817       EPERM (since Linux 3.9)
818              CLONE_NEWUSER was specified in the flags mask and the caller  is
819              in  a chroot environment (i.e., the caller's root directory does
820              not match the root directory of the mount namespace in which  it
821              resides).
822
823       EPERM (clone3() only)
824              set_tid_size  was  greater  than  zero, and the caller lacks the
825              CAP_SYS_ADMIN capability in one or more of the  user  namespaces
826              that own the corresponding PID namespaces.
827
828       ERESTARTNOINTR (since Linux 2.6.17)
829              System  call  was interrupted by a signal and will be restarted.
830              (This can be seen only during a trace.)
831
832       EUSERS (Linux 3.11 to Linux 4.8)
833              CLONE_NEWUSER was specified in the flags mask, and the limit  on
834              the number of nested user namespaces would be exceeded.  See the
835              discussion of the ENOSPC error above.
836

VERSIONS

838       The clone3() system call first appeared in Linux 5.3.
839

CONFORMING TO

841       These system calls are Linux-specific and should not be  used  in  pro‐
842       grams intended to be portable.
843

NOTES

845       One  use of these systems calls is to implement threads: multiple flows
846       of control in a program that  run  concurrently  in  a  shared  address
847       space.
848
849       Note  that the glibc clone() wrapper function makes some changes in the
850       memory pointed to by stack (changes required to set the stack  up  cor‐
851       rectly  for the child) before invoking the clone() system call.  So, in
852       cases where clone() is used to recursively create children, do not  use
853       the buffer employed for the parent's stack as the stack of the child.
854
855       The kcmp(2) system call can be used to test whether two processes share
856       various resources such as a file descriptor table, System  V  semaphore
857       undo operations, or a virtual address space.
858
859       Handlers  registered  using pthread_atfork(3) are not executed during a
860       clone call.
861
862       In the Linux 2.4.x series, CLONE_THREAD generally  does  not  make  the
863       parent of the new thread the same as the parent of the calling process.
864       However, for kernel versions 2.4.7 to 2.4.18 the CLONE_THREAD flag  im‐
865       plied the CLONE_PARENT flag (as in Linux 2.6.0 and later).
866
867       On  i386,  clone()  should not be called through vsyscall, but directly
868       through int $0x80.
869
870   C library/kernel differences
871       The raw clone() system call corresponds more closely to fork(2) in that
872       execution  in the child continues from the point of the call.  As such,
873       the fn and arg arguments of the clone() wrapper function are omitted.
874
875       In contrast to the glibc wrapper, the raw clone() system  call  accepts
876       NULL as a stack argument (and clone3() likewise allows cl_args.stack to
877       be NULL).  In this case, the child uses a  duplicate  of  the  parent's
878       stack.   (Copy-on-write  semantics  ensure that the child gets separate
879       copies of stack pages when either process modifies the stack.)  In this
880       case,  for  correct operation, the CLONE_VM option should not be speci‐
881       fied.  (If the child shares the parent's memory because of the  use  of
882       the  CLONE_VM  flag, then no copy-on-write duplication occurs and chaos
883       is likely to result.)
884
885       The order of the arguments also differs in the  raw  system  call,  and
886       there are variations in the arguments across architectures, as detailed
887       in the following paragraphs.
888
889       The raw system call interface on x86-64 and  some  other  architectures
890       (including sh, tile, and alpha) is:
891
892           long clone(unsigned long flags, void *stack,
893                      int *parent_tid, int *child_tid,
894                      unsigned long tls);
895
896       On  x86-32,  and  several  other common architectures (including score,
897       ARM, ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS),  the  order  of
898       the last two arguments is reversed:
899
900           long clone(unsigned long flags, void *stack,
901                     int *parent_tid, unsigned long tls,
902                     int *child_tid);
903
904       On  the  cris  and s390 architectures, the order of the first two argu‐
905       ments is reversed:
906
907           long clone(void *stack, unsigned long flags,
908                      int *parent_tid, int *child_tid,
909                      unsigned long tls);
910
911       On the microblaze architecture, an additional argument is supplied:
912
913           long clone(unsigned long flags, void *stack,
914                      int stack_size,         /* Size of stack */
915                      int *parent_tid, int *child_tid,
916                      unsigned long tls);
917
918   blackfin, m68k, and sparc
919       The argument-passing conventions on blackfin, m68k, and sparc are  dif‐
920       ferent  from  the descriptions above.  For details, see the kernel (and
921       glibc) source.
922
923   ia64
924       On ia64, a different interface is used:
925
926           int __clone2(int (*fn)(void *),
927                        void *stack_base, size_t stack_size,
928                        int flags, void *arg, ...
929                     /* pid_t *parent_tid, struct user_desc *tls,
930                        pid_t *child_tid */ );
931
932       The prototype shown above is for the glibc wrapper  function;  for  the
933       system  call  itself,  the prototype can be described as follows (it is
934       identical to the clone() prototype on microblaze):
935
936           long clone2(unsigned long flags, void *stack_base,
937                       int stack_size,         /* Size of stack */
938                       int *parent_tid, int *child_tid,
939                       unsigned long tls);
940
941       __clone2() operates in the same way as clone(), except that  stack_base
942       points  to the lowest address of the child's stack area, and stack_size
943       specifies the size of the stack pointed to by stack_base.
944
945   Linux 2.4 and earlier
946       In Linux 2.4 and earlier, clone() does not take  arguments  parent_tid,
947       tls, and child_tid.
948

BUGS

950       GNU C library versions 2.3.4 up to and including 2.24 contained a wrap‐
951       per function for  getpid(2)  that  performed  caching  of  PIDs.   This
952       caching relied on support in the glibc wrapper for clone(), but limita‐
953       tions in the implementation meant that the cache was not up to date  in
954       some  circumstances.   In  particular, if a signal was delivered to the
955       child immediately after the clone() call, then a call to getpid(2) in a
956       handler  for  the  signal  could  return the PID of the calling process
957       ("the parent"), if the clone wrapper had not yet had a chance to update
958       the  PID  cache  in the child.  (This discussion ignores the case where
959       the child was created using CLONE_THREAD, when getpid(2) should  return
960       the  same  value  in  the child and in the process that called clone(),
961       since the caller and the child are  in  the  same  thread  group.   The
962       stale-cache  problem also does not occur if the flags argument includes
963       CLONE_VM.)  To get the truth, it was sometimes necessary  to  use  code
964       such as the following:
965
966           #include <syscall.h>
967
968           pid_t mypid;
969
970           mypid = syscall(SYS_getpid);
971
972       Because  of the stale-cache problem, as well as other problems noted in
973       getpid(2), the PID caching feature was removed in glibc 2.25.
974

EXAMPLES

976       The following program demonstrates the use of clone() to create a child
977       process  that  executes in a separate UTS namespace.  The child changes
978       the hostname in its UTS namespace.  Both parent and child then  display
979       the  system  hostname, making it possible to see that the hostname dif‐
980       fers in the UTS namespaces of the parent and child.  For an example  of
981       the use of this program, see setns(2).
982
983       Within  the  sample  program, we allocate the memory that is to be used
984       for the child's stack using mmap(2) rather than malloc(3) for the  fol‐
985       lowing reasons:
986
987       *  mmap(2)  allocates  a block of memory that starts on a page boundary
988          and is a multiple of the page size.  This is useful if  we  want  to
989          establish a guard page (a page with protection PROT_NONE) at the end
990          of the stack using mprotect(2).
991
992       *  We can specify the MAP_STACK flag to request a mapping that is suit‐
993          able  for  a  stack.  For the moment, this flag is a no-op on Linux,
994          but it exists and has effect on some other systems, so we should in‐
995          clude it for portability.
996
997   Program source
998       #define _GNU_SOURCE
999       #include <sys/wait.h>
1000       #include <sys/utsname.h>
1001       #include <sched.h>
1002       #include <string.h>
1003       #include <stdint.h>
1004       #include <stdio.h>
1005       #include <stdlib.h>
1006       #include <unistd.h>
1007       #include <sys/mman.h>
1008
1009       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
1010                               } while (0)
1011
1012       static int              /* Start function for cloned child */
1013       childFunc(void *arg)
1014       {
1015           struct utsname uts;
1016
1017           /* Change hostname in UTS namespace of child. */
1018
1019           if (sethostname(arg, strlen(arg)) == -1)
1020               errExit("sethostname");
1021
1022           /* Retrieve and display hostname. */
1023
1024           if (uname(&uts) == -1)
1025               errExit("uname");
1026           printf("uts.nodename in child:  %s\n", uts.nodename);
1027
1028           /* Keep the namespace open for a while, by sleeping.
1029              This allows some experimentation--for example, another
1030              process might join the namespace. */
1031
1032           sleep(200);
1033
1034           return 0;           /* Child terminates now */
1035       }
1036
1037       #define STACK_SIZE (1024 * 1024)    /* Stack size for cloned child */
1038
1039       int
1040       main(int argc, char *argv[])
1041       {
1042           char *stack;                    /* Start of stack buffer */
1043           char *stackTop;                 /* End of stack buffer */
1044           pid_t pid;
1045           struct utsname uts;
1046
1047           if (argc < 2) {
1048               fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
1049               exit(EXIT_SUCCESS);
1050           }
1051
1052           /* Allocate memory to be used for the stack of the child. */
1053
1054           stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
1055                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
1056           if (stack == MAP_FAILED)
1057               errExit("mmap");
1058
1059           stackTop = stack + STACK_SIZE;  /* Assume stack grows downward */
1060
1061           /* Create child that has its own UTS namespace;
1062              child commences execution in childFunc(). */
1063
1064           pid = clone(childFunc, stackTop, CLONE_NEWUTS | SIGCHLD, argv[1]);
1065           if (pid == -1)
1066               errExit("clone");
1067           printf("clone() returned %jd\n", (intmax_t) pid);
1068
1069           /* Parent falls through to here */
1070
1071           sleep(1);           /* Give child time to change its hostname */
1072
1073           /* Display hostname in parent's UTS namespace. This will be
1074              different from hostname in child's UTS namespace. */
1075
1076           if (uname(&uts) == -1)
1077               errExit("uname");
1078           printf("uts.nodename in parent: %s\n", uts.nodename);
1079
1080           if (waitpid(pid, NULL, 0) == -1)    /* Wait for child */
1081               errExit("waitpid");
1082           printf("child has terminated\n");
1083
1084           exit(EXIT_SUCCESS);
1085       }
1086

COLOPHON

1094       This page is part of release 5.13 of the Linux  man-pages  project.   A
1095       description  of  the project, information about reporting bugs, and the
1096       latest    version    of    this    page,    can     be     found     at
1097       https://www.kernel.org/doc/man-pages/.
1098
1099
1100
1101Linux                             2021-03-22                          CLONE(2)