__clone2(2)

1CLONE(2)                   Linux Programmer's Manual                  CLONE(2)
2
3
4

NAME

6       clone, __clone2, clone3 - create a child process
7

SYNOPSIS

9       /* Prototype for the glibc wrapper function */
10
11       #define _GNU_SOURCE
12       #include <sched.h>
13
14       int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
15                 /* pid_t *parent_tid, void *tls, pid_t *child_tid */ );
16
17       /* For the prototype of the raw clone() system call, see NOTES */
18
19       long clone3(struct clone_args *cl_args, size_t size);
20
21       Note: There is not yet a glibc wrapper for clone3(); see NOTES.
22

DESCRIPTION

24       These  system calls create a new ("child") process, in a manner similar
25       to fork(2).
26
27       By contrast with fork(2), these system calls provide more precise  con‐
28       trol over what pieces of execution context are shared between the call‐
29       ing process and the child process.  For  example,  using  these  system
30       calls,  the  caller  can control whether or not the two processes share
31       the virtual address space, the table of file descriptors, and the table
32       of  signal  handlers.   These  system  calls  also  allow the new child
33       process to be placed in separate namespaces(7).
34
35       Note that in this manual page, "calling process"  normally  corresponds
36       to  "parent  process".   But  see  the descriptions of CLONE_PARENT and
37       CLONE_THREAD below.
38
39       This page describes the following interfaces:
40
41       *  The glibc clone() wrapper function and the underlying system call on
42          which  it  is  based.  The main text describes the wrapper function;
43          the differences for the raw system call are described toward the end
44          of this page.
45
46       *  The newer clone3() system call.
47
48       In the remainder of this page, the terminology "the clone call" is used
49       when noting details that apply to all of these interfaces,
50
51   The clone() wrapper function
52       When the child process is created with the clone() wrapper function, it
53       commences  execution by calling the function pointed to by the argument
54       fn.  (This differs from fork(2), where execution continues in the child
55       from the point of the fork(2) call.)  The arg argument is passed as the
56       argument of the function fn.
57
58       When the fn(arg) function returns, the child process  terminates.   The
59       integer  returned  by fn is the exit status for the child process.  The
60       child process may also terminate explicitly by calling exit(2) or after
61       receiving a fatal signal.
62
63       The  stack  argument  specifies  the  location of the stack used by the
64       child process.  Since the child and calling process may  share  memory,
65       it  is  not possible for the child process to execute in the same stack
66       as the calling process.  The calling process must therefore set up mem‐
67       ory  space  for  the  child  stack  and pass a pointer to this space to
68       clone().  Stacks grow downward on all processors that run Linux (except
69       the  HP  PA processors), so stack usually points to the topmost address
70       of the memory space set up for the child stack.  Note that clone() does
71       not  provide  a  means  whereby the caller can inform the kernel of the
72       size of the stack area.
73
74       The remaining arguments to clone() are discussed below.
75
76   clone3()
77       The clone3() system call provides a superset of  the  functionality  of
78       the older clone() interface.  It also provides a number of API improve‐
79       ments, including: space for additional flags bits;  cleaner  separation
80       in the use of various arguments; and the ability to specify the size of
81       the child's stack area.
82
83       As with fork(2), clone3() returns in both the parent and the child.  It
84       returns  0 in the child process and returns the PID of the child in the
85       parent.
86
87       The cl_args argument of clone3() is a structure of the following form:
88
89           struct clone_args {
90               u64 flags;        /* Flags bit mask */
91               u64 pidfd;        /* Where to store PID file descriptor
92                                    (pid_t *) */
93               u64 child_tid;    /* Where to store child TID,
94                                    in child's memory (pid_t *) */
95               u64 parent_tid;   /* Where to store child TID,
96                                    in parent's memory (int *) */
97               u64 exit_signal;  /* Signal to deliver to parent on
98                                    child termination */
99               u64 stack;        /* Pointer to lowest byte of stack */
100               u64 stack_size;   /* Size of stack */
101               u64 tls;          /* Location of new TLS */
102               u64 set_tid;      /* Pointer to a pid_t array
103                                    (since Linux 5.5) */
104               u64 set_tid_size; /* Number of elements in set_tid
105                                    (since Linux 5.5) */
106               u64 cgroup;       /* File descriptor for target cgroup
107                                    of child (since Linux 5.7) */
108           };
109
110       The size argument that is supplied to clone3() should be initialized to
111       the  size  of this structure.  (The existence of the size argument per‐
112       mits future extensions to the clone_args structure.)
113
114       The stack for the child process is specified via  cl_args.stack,  which
115       points  to  the  lowest byte of the stack area, and cl_args.stack_size,
116       which specifies the size of the stack in bytes.  In the case where  the
117       CLONE_VM  flag (see below) is specified, a stack must be explicitly al‐
118       located and specified.  Otherwise, these two fields can be specified as
119       NULL  and  0,  which causes the child to use the same stack area as the
120       parent (in the child's own virtual address space).
121
122       The remaining fields in the cl_args argument are discussed below.
123
124   Equivalence between clone() and clone3() arguments
125       Unlike the older clone() interface, where arguments are passed individ‐
126       ually,  in the newer clone3() interface the arguments are packaged into
127       the clone_args structure shown above.  This structure allows for a  su‐
128       perset of the information passed via the clone() arguments.
129
130       The  following  table  shows  the  equivalence between the arguments of
131       clone() and the fields in the clone_args argument supplied to clone3():
132
133              clone()         clone3()        Notes
134                              cl_args field
135              flags & ~0xff   flags           For most flags; details below
136              parent_tid      pidfd           See CLONE_PIDFD
137              child_tid       child_tid       See CLONE_CHILD_SETTID
138              parent_tid      parent_tid      See CLONE_PARENT_SETTID
139              flags & 0xff    exit_signal
140              stack           stack
141              ---             stack_size
142              tls             tls             See CLONE_SETTLS
143              ---             set_tid         See below for details
144              ---             set_tid_size
145              ---             cgroup          See CLONE_INTO_CGROUP
146
147   The child termination signal
148       When the child process terminates, a signal may be sent to the  parent.
149       The  termination signal is specified in the low byte of flags (clone())
150       or in cl_args.exit_signal (clone3()).  If this signal is  specified  as
151       anything  other  than SIGCHLD, then the parent process must specify the
152       __WALL or __WCLONE options when waiting for the child with wait(2).  If
153       no  signal  (i.e.,  zero)  is specified, then the parent process is not
154       signaled when the child terminates.
155
156   The set_tid array
157       By default, the kernel chooses the next  sequential  PID  for  the  new
158       process in each of the PID namespaces where it is present.  When creat‐
159       ing a process with clone3(), the set_tid array (available  since  Linux
160       5.5) can be used to select specific PIDs for the process in some or all
161       of the PID namespaces where it is present.  If the  PID  of  the  newly
162       created  process should be set only for the current PID namespace or in
163       the newly created PID namespace (if flags contains  CLONE_NEWPID)  then
164       the  first  element  in the set_tid array has to be the desired PID and
165       set_tid_size needs to be 1.
166
167       If the PID of the newly created process should have a certain value  in
168       multiple  PID  namespaces, then the set_tid array can have multiple en‐
169       tries.  The first entry defines the PID in the most deeply  nested  PID
170       namespace  and  each  of  the following entries contains the PID in the
171       corresponding ancestor PID namespace.  The number of PID namespaces  in
172       which  a  PID  should be set is defined by set_tid_size which cannot be
173       larger than the number of currently nested PID namespaces.
174
175       To create a process with the following PIDs in a PID namespace  hierar‐
176       chy:
177
178              PID NS level   Requested PID   Notes
179              0              31496           Outermost PID namespace
180              1              42
181              2              7               Innermost PID namespace
182
183       Set the array to:
184
185           set_tid[0] = 7;
186           set_tid[1] = 42;
187           set_tid[2] = 31496;
188           set_tid_size = 3;
189
190       If  only the PIDs in the two innermost PID namespaces need to be speci‐
191       fied, set the array to:
192
193           set_tid[0] = 7;
194           set_tid[1] = 42;
195           set_tid_size = 2;
196
197       The PID in the PID namespaces outside the two innermost PID  namespaces
198       will be selected the same way as any other PID is selected.
199
200       The  set_tid  feature  requires  CAP_SYS_ADMIN  or  (since  Linux  5.9)
201       CAP_CHECKPOINT_RESTORE in all owning user namespaces of the target  PID
202       namespaces.
203
204       Callers  may  only choose a PID greater than 1 in a given PID namespace
205       if an init process (i.e., a process with PID 1) already exists in  that
206       namespace.  Otherwise the PID entry for this PID namespace must be 1.
207
208   The flags mask
209       Both  clone()  and  clone3() allow a flags bit mask that modifies their
210       behavior and allows the caller to specify what is  shared  between  the
211       calling  process  and the child process.  This bit mask—the flags argu‐
212       ment of clone() or the cl_args.flags field passed  to  clone3()—is  re‐
213       ferred to as the flags mask in the remainder of this page.
214
215       The flags mask is specified as a bitwise-OR of zero or more of the con‐
216       stants listed below.  Except as noted below, these flags are  available
217       (and have the same effect) in both clone() and clone3().
218
219       CLONE_CHILD_CLEARTID (since Linux 2.5.49)
220              Clear  (zero)  the child thread ID at the location pointed to by
221              child_tid (clone()) or  cl_args.child_tid  (clone3())  in  child
222              memory  when  the  child  exits, and do a wakeup on the futex at
223              that address.  The  address  involved  may  be  changed  by  the
224              set_tid_address(2)  system  call.  This is used by threading li‐
225              braries.
226
227       CLONE_CHILD_SETTID (since Linux 2.5.49)
228              Store the  child  thread  ID  at  the  location  pointed  to  by
229              child_tid  (clone())  or  cl_args.child_tid  (clone3())  in  the
230              child's memory.  The store operation completes before the  clone
231              call  returns control to user space in the child process.  (Note
232              that the store operation may not have completed before the clone
233              call  returns  in  the parent process, which will be relevant if
234              the CLONE_VM flag is also employed.)
235
236       CLONE_CLEAR_SIGHAND (since Linux 5.5)
237              By default, signal dispositions in the child thread are the same
238              as  in  the parent.  If this flag is specified, then all signals
239              that are handled in the parent are reset to their default dispo‐
240              sitions (SIG_DFL) in the child.
241
242              Specifying  this flag together with CLONE_SIGHAND is nonsensical
243              and disallowed.
244
245       CLONE_DETACHED (historical)
246              For a while (during the Linux 2.5 development series) there  was
247              a  CLONE_DETACHED flag, which caused the parent not to receive a
248              signal when the child terminated.   Ultimately,  the  effect  of
249              this  flag  was  subsumed under the CLONE_THREAD flag and by the
250              time Linux 2.6.0 was released, this flag had no effect.   Start‐
251              ing  in  Linux  2.6.2,  the need to give this flag together with
252              CLONE_THREAD disappeared.
253
254              This flag is still defined, but it is usually ignored when call‐
255              ing  clone().   However,  see the description of CLONE_PIDFD for
256              some exceptions.
257
258       CLONE_FILES (since Linux 2.0)
259              If CLONE_FILES is set, the calling process and the child process
260              share  the same file descriptor table.  Any file descriptor cre‐
261              ated by the calling process or by  the  child  process  is  also
262              valid  in the other process.  Similarly, if one of the processes
263              closes a file descriptor, or changes its associated flags (using
264              the  fcntl(2)  F_SETFD operation), the other process is also af‐
265              fected.  If a process sharing a file descriptor table calls  ex‐
266              ecve(2), its file descriptor table is duplicated (unshared).
267
268              If  CLONE_FILES is not set, the child process inherits a copy of
269              all file descriptors opened in the calling process at  the  time
270              of  the  clone  call.   Subsequent operations that open or close
271              file descriptors, or change file descriptor flags, performed  by
272              either  the  calling  process or the child process do not affect
273              the other process.  Note, however, that the duplicated file  de‐
274              scriptors  in the child refer to the same open file descriptions
275              as the corresponding file descriptors in  the  calling  process,
276              and thus share file offsets and file status flags (see open(2)).
277
278       CLONE_FS (since Linux 2.0)
279              If  CLONE_FS  is set, the caller and the child process share the
280              same filesystem information.  This  includes  the  root  of  the
281              filesystem,  the  current working directory, and the umask.  Any
282              call to chroot(2), chdir(2), or umask(2) performed by the  call‐
283              ing process or the child process also affects the other process.
284
285              If CLONE_FS is not set, the child process works on a copy of the
286              filesystem information of the calling process at the time of the
287              clone call.  Calls to chroot(2), chdir(2), or umask(2) performed
288              later by one of the processes do not affect the other process.
289
290       CLONE_INTO_CGROUP (since Linux 5.7)
291              By default, a child process is placed  in  the  same  version  2
292              cgroup  as  its  parent.   The CLONE_INTO_CGROUP flag allows the
293              child process to be created in a  different  version  2  cgroup.
294              (Note  that  CLONE_INTO_CGROUP  has  effect  only  for version 2
295              cgroups.)
296
297              In order to place the child process in a different  cgroup,  the
298              caller specifies CLONE_INTO_CGROUP in cl_args.flags and passes a
299              file descriptor that  refers  to  a  version  2  cgroup  in  the
300              cl_args.cgroup  field.  (This file descriptor can be obtained by
301              opening a cgroup v2 directory using either the O_RDONLY  or  the
302              O_PATH  flag.)   Note  that  all  of the usual restrictions (de‐
303              scribed in cgroups(7)) on placing a process  into  a  version  2
304              cgroup apply.
305
306              Among  the possible use cases for CLONE_INTO_CGROUP are the fol‐
307              lowing:
308
309              *  Spawning a process into a cgroup different from the  parent's
310                 cgroup  makes  it  possible for a service manager to directly
311                 spawn new services into dedicated cgroups.   This  eliminates
312                 the  accounting  jitter  that  would  be  caused if the child
313                 process was first created in the same cgroup  as  the  parent
314                 and then moved into the target cgroup.  Furthermore, spawning
315                 the child process directly into a target cgroup  is  signifi‐
316                 cantly  cheaper than moving the child process into the target
317                 cgroup after it has been created.
318
319              *  The CLONE_INTO_CGROUP flag also allows the creation of frozen
320                 child  processes by spawning them into a frozen cgroup.  (See
321                 cgroups(7) for a description of the freezer controller.)
322
323              *  For threaded applications  (or  even  thread  implementations
324                 which make use of cgroups to limit individual threads), it is
325                 possible to establish a fixed cgroup layout  before  spawning
326                 each thread directly into its target cgroup.
327
328       CLONE_IO (since Linux 2.6.25)
329              If  CLONE_IO  is set, then the new process shares an I/O context
330              with the calling process.  If this flag is  not  set,  then  (as
331              with fork(2)) the new process has its own I/O context.
332
333              The  I/O  context  is the I/O scope of the disk scheduler (i.e.,
334              what the I/O scheduler uses to model scheduling of  a  process's
335              I/O).  If processes share the same I/O context, they are treated
336              as one by the I/O scheduler.  As  a  consequence,  they  get  to
337              share  disk  time.   For  some  I/O schedulers, if two processes
338              share an I/O context, they will be allowed to  interleave  their
339              disk  access.  If several threads are doing I/O on behalf of the
340              same process (aio_read(3), for  instance),  they  should  employ
341              CLONE_IO to get better I/O performance.
342
343              If  the  kernel  is not configured with the CONFIG_BLOCK option,
344              this flag is a no-op.
345
346       CLONE_NEWCGROUP (since Linux 4.6)
347              Create the process in a new cgroup namespace.  If this  flag  is
348              not  set,  then  (as with fork(2)) the process is created in the
349              same cgroup namespaces as the calling process.
350
351              For further information on cgroup namespaces,  see  cgroup_name‐
352              spaces(7).
353
354              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWC‐
355              GROUP.
356
357       CLONE_NEWIPC (since Linux 2.6.19)
358              If CLONE_NEWIPC is set, then create the process  in  a  new  IPC
359              namespace.  If this flag is not set, then (as with fork(2)), the
360              process is created in the same  IPC  namespace  as  the  calling
361              process.
362
363              For   further  information  on  IPC  namespaces,  see  ipc_name‐
364              spaces(7).
365
366              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
367              CLONE_NEWIPC.   This flag can't be specified in conjunction with
368              CLONE_SYSVSEM.
369
370       CLONE_NEWNET (since Linux 2.6.24)
371              (The implementation of this flag was  completed  only  by  about
372              kernel version 2.6.29.)
373
374              If CLONE_NEWNET is set, then create the process in a new network
375              namespace.  If this flag is not set, then (as with fork(2))  the
376              process  is created in the same network namespace as the calling
377              process.
378
379              For further information on network namespaces, see network_name‐
380              spaces(7).
381
382              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
383              CLONE_NEWNET.
384
385       CLONE_NEWNS (since Linux 2.4.19)
386              If CLONE_NEWNS is set, the cloned child  is  started  in  a  new
387              mount namespace, initialized with a copy of the namespace of the
388              parent.  If CLONE_NEWNS is not set, the child lives in the  same
389              mount namespace as the parent.
390
391              For  further  information on mount namespaces, see namespaces(7)
392              and mount_namespaces(7).
393
394              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
395              CLONE_NEWNS.   It  is  not permitted to specify both CLONE_NEWNS
396              and CLONE_FS in the same clone call.
397
398       CLONE_NEWPID (since Linux 2.6.24)
399              If CLONE_NEWPID is set, then create the process  in  a  new  PID
400              namespace.   If this flag is not set, then (as with fork(2)) the
401              process is created in the same  PID  namespace  as  the  calling
402              process.
403
404              For further information on PID namespaces, see namespaces(7) and
405              pid_namespaces(7).
406
407              Only a privileged process (CAP_SYS_ADMIN) can employ  CLONE_NEW‐
408              PID.    This   flag  can't  be  specified  in  conjunction  with
409              CLONE_THREAD or CLONE_PARENT.
410
411       CLONE_NEWUSER
412              (This flag first became meaningful for clone() in Linux  2.6.23,
413              the  current clone() semantics were merged in Linux 3.5, and the
414              final pieces to make the user namespaces completely usable  were
415              merged in Linux 3.8.)
416
417              If  CLONE_NEWUSER  is set, then create the process in a new user
418              namespace.  If this flag is not set, then (as with fork(2))  the
419              process  is  created  in  the same user namespace as the calling
420              process.
421
422              For further information on user  namespaces,  see  namespaces(7)
423              and user_namespaces(7).
424
425              Before  Linux 3.8, use of CLONE_NEWUSER required that the caller
426              have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SET‐
427              GID.   Starting with Linux 3.8, no privileges are needed to cre‐
428              ate a user namespace.
429
430              This flag can't be specified in conjunction with CLONE_THREAD or
431              CLONE_PARENT.   For  security  reasons,  CLONE_NEWUSER cannot be
432              specified in conjunction with CLONE_FS.
433
434       CLONE_NEWUTS (since Linux 2.6.19)
435              If CLONE_NEWUTS is set, then create the process  in  a  new  UTS
436              namespace,  whose identifiers are initialized by duplicating the
437              identifiers from the UTS namespace of the calling  process.   If
438              this flag is not set, then (as with fork(2)) the process is cre‐
439              ated in the same UTS namespace as the calling process.
440
441              For  further  information  on  UTS  namespaces,  see   uts_name‐
442              spaces(7).
443
444              Only   a   privileged   process   (CAP_SYS_ADMIN)   can   employ
445              CLONE_NEWUTS.
446
447       CLONE_PARENT (since Linux 2.3.12)
448              If CLONE_PARENT is set, then the parent of the new child (as re‐
449              turned  by  getppid(2))  will be the same as that of the calling
450              process.
451
452              If CLONE_PARENT is not set, then (as with fork(2))  the  child's
453              parent is the calling process.
454
455              Note  that  it is the parent process, as returned by getppid(2),
456              which  is  signaled  when  the  child  terminates,  so  that  if
457              CLONE_PARENT  is  set,  then  the parent of the calling process,
458              rather than the calling process itself, will be signaled.
459
460              The CLONE_PARENT flag can't be used in clone calls by the global
461              init  process (PID 1 in the initial PID namespace) and init pro‐
462              cesses in other PID namespaces.  This restriction  prevents  the
463              creation  of  multi-rooted process trees as well as the creation
464              of unreapable zombies in the initial PID namespace.
465
466       CLONE_PARENT_SETTID (since Linux 2.5.49)
467              Store the child thread ID at the location  pointed  to  by  par‐
468              ent_tid  (clone())  or cl_args.parent_tid (clone3()) in the par‐
469              ent's  memory.   (In  Linux  2.5.32-2.5.48  there  was  a   flag
470              CLONE_SETTID  that did this.)  The store operation completes be‐
471              fore the clone call returns control to user space.
472
473       CLONE_PID (Linux 2.0 to 2.5.15)
474              If CLONE_PID is set, the child process is created with the  same
475              process ID as the calling process.  This is good for hacking the
476              system, but otherwise of not much use.  From  Linux  2.3.21  on‐
477              ward,  this  flag  could  be  specified  only by the system boot
478              process (PID 0).  The flag disappeared completely from the  ker‐
479              nel  sources in Linux 2.5.16.  Subsequently, the kernel silently
480              ignored this bit if it was specified in the  flags  mask.   Much
481              later,  the  same  bit  was  recycled for use as the CLONE_PIDFD
482              flag.
483
484       CLONE_PIDFD (since Linux 5.2)
485              If this flag is specified, a PID file  descriptor  referring  to
486              the  child  process is allocated and placed at a specified loca‐
487              tion in the parent's memory.  The close-on-exec flag is  set  on
488              this  new file descriptor.  PID file descriptors can be used for
489              the purposes described in pidfd_open(2).
490
491              *  When using clone3(), the PID file descriptor is placed at the
492                 location pointed to by cl_args.pidfd.
493
494              *  When  using clone(), the PID file descriptor is placed at the
495                 location pointed to by parent_tid.  Since the parent_tid  ar‐
496                 gument is used to return the PID file descriptor, CLONE_PIDFD
497                 cannot be used with CLONE_PARENT_SETTID when calling clone().
498
499              It is currently not possible to  use  this  flag  together  with
500              CLONE_THREAD.  This means that the process identified by the PID
501              file descriptor will always be a thread group leader.
502
503              If the  obsolete  CLONE_DETACHED  flag  is  specified  alongside
504              CLONE_PIDFD  when calling clone(), an error is returned.  An er‐
505              ror also results if CLONE_DETACHED  is  specified  when  calling
506              clone3().   This error behavior ensures that the bit correspond‐
507              ing to CLONE_DETACHED can be reused for  further  PID  file  de‐
508              scriptor features in the future.
509
510       CLONE_PTRACE (since Linux 2.2)
511              If  CLONE_PTRACE  is specified, and the calling process is being
512              traced, then trace the child also (see ptrace(2)).
513
514       CLONE_SETTLS (since Linux 2.5.32)
515              The TLS (Thread Local Storage) descriptor is set to tls.
516
517              The interpretation of tls and the resulting effect is  architec‐
518              ture  dependent.   On  x86,  tls  is  interpreted  as  a  struct
519              user_desc * (see set_thread_area(2)).  On x86-64 it is  the  new
520              value  to  be set for the %fs base register (see the ARCH_SET_FS
521              argument to arch_prctl(2)).  On architectures with  a  dedicated
522              TLS register, it is the new value of that register.
523
524              Use  of  this  flag requires detailed knowledge and generally it
525              should not be used except in libraries implementing threading.
526
527       CLONE_SIGHAND (since Linux 2.0)
528              If CLONE_SIGHAND is set,  the  calling  process  and  the  child
529              process share the same table of signal handlers.  If the calling
530              process or child process calls sigaction(2) to change the behav‐
531              ior  associated  with  a  signal, the behavior is changed in the
532              other process as well.  However, the calling process  and  child
533              processes  still  have distinct signal masks and sets of pending
534              signals.  So, one of them may block  or  unblock  signals  using
535              sigprocmask(2) without affecting the other process.
536
537              If  CLONE_SIGHAND  is not set, the child process inherits a copy
538              of the signal handlers of the calling process at the time of the
539              clone call.  Calls to sigaction(2) performed later by one of the
540              processes have no effect on the other process.
541
542              Since Linux 2.6.0, the flags mask must also include CLONE_VM  if
543              CLONE_SIGHAND is specified
544
545       CLONE_STOPPED (since Linux 2.6.0)
546              If CLONE_STOPPED is set, then the child is initially stopped (as
547              though it was sent a SIGSTOP signal), and  must  be  resumed  by
548              sending it a SIGCONT signal.
549
550              This  flag  was deprecated from Linux 2.6.25 onward, and was re‐
551              moved altogether  in  Linux  2.6.38.   Since  then,  the  kernel
552              silently ignores it without error.  Starting with Linux 4.6, the
553              same bit was reused for the CLONE_NEWCGROUP flag.
554
555       CLONE_SYSVSEM (since Linux 2.5.10)
556              If CLONE_SYSVSEM is set, then the child and the calling  process
557              share  a  single  list of System V semaphore adjustment (semadj)
558              values (see semop(2)).  In this case, the  shared  list  accumu‐
559              lates  semadj  values across all processes sharing the list, and
560              semaphore adjustments are performed only when the  last  process
561              that  is sharing the list terminates (or ceases sharing the list
562              using unshare(2)).  If this flag is not set, then the child  has
563              a separate semadj list that is initially empty.
564
565       CLONE_THREAD (since Linux 2.4.0)
566              If  CLONE_THREAD  is set, the child is placed in the same thread
567              group as the calling process.  To make the remainder of the dis‐
568              cussion of CLONE_THREAD more readable, the term "thread" is used
569              to refer to the processes within a thread group.
570
571              Thread groups were a feature added in Linux 2.4 to  support  the
572              POSIX  threads  notion  of  a set of threads that share a single
573              PID.  Internally, this shared PID is the so-called thread  group
574              identifier  (TGID) for the thread group.  Since Linux 2.4, calls
575              to getpid(2) return the TGID of the caller.
576
577              The threads within a group can be distinguished by  their  (sys‐
578              tem-wide) unique thread IDs (TID).  A new thread's TID is avail‐
579              able as the function result returned to the caller, and a thread
580              can obtain its own TID using gettid(2).
581
582              When  a clone call is made without specifying CLONE_THREAD, then
583              the resulting thread is placed in a new thread group whose  TGID
584              is  the  same as the thread's TID.  This thread is the leader of
585              the new thread group.
586
587              A new thread created  with  CLONE_THREAD  has  the  same  parent
588              process  as  the  process  that  made the clone call (i.e., like
589              CLONE_PARENT), so that calls to getppid(2) return the same value
590              for  all  of the threads in a thread group.  When a CLONE_THREAD
591              thread terminates, the thread that created  it  is  not  sent  a
592              SIGCHLD  (or  other  termination)  signal; nor can the status of
593              such a thread be obtained using wait(2).  (The thread is said to
594              be detached.)
595
596              After  all of the threads in a thread group terminate the parent
597              process of the thread group is sent a SIGCHLD (or other termina‐
598              tion) signal.
599
600              If  any  of the threads in a thread group performs an execve(2),
601              then all threads other than the thread group leader  are  termi‐
602              nated,  and  the  new  program  is  executed in the thread group
603              leader.
604
605              If one of the threads in a thread group creates  a  child  using
606              fork(2),  then  any  thread  in  the  group can wait(2) for that
607              child.
608
609              Since Linux 2.5.35, the flags mask must also include  CLONE_SIG‐
610              HAND  if  CLONE_THREAD  is specified (and note that, since Linux
611              2.6.0, CLONE_SIGHAND also requires CLONE_VM to be included).
612
613              Signal dispositions and actions are process-wide: if  an  unhan‐
614              dled  signal is delivered to a thread, then it will affect (ter‐
615              minate, stop, continue, be ignored in) all members of the thread
616              group.
617
618              Each thread has its own signal mask, as set by sigprocmask(2).
619
620              A signal may be process-directed or thread-directed.  A process-
621              directed signal is targeted at a thread group  (i.e.,  a  TGID),
622              and  is  delivered  to an arbitrarily selected thread from among
623              those that are  not  blocking  the  signal.   A  signal  may  be
624              process-directed because it was generated by the kernel for rea‐
625              sons other than a hardware exception, or because it was sent us‐
626              ing  kill(2)  or  sigqueue(3).  A thread-directed signal is tar‐
627              geted at (i.e., delivered to) a specific thread.  A  signal  may
628              be  thread  directed  because  it  was  sent  using tgkill(2) or
629              pthread_sigqueue(3), or because the thread  executed  a  machine
630              language  instruction that triggered a hardware exception (e.g.,
631              invalid memory access triggering SIGSEGV or a floating-point ex‐
632              ception triggering SIGFPE).
633
634              A  call  to sigpending(2) returns a signal set that is the union
635              of the pending process-directed signals and the signals that are
636              pending for the calling thread.
637
638              If a process-directed signal is delivered to a thread group, and
639              the thread group has installed a handler for  the  signal,  then
640              the handler will be invoked in exactly one, arbitrarily selected
641              member of the thread group that has not blocked the signal.   If
642              multiple  threads in a group are waiting to accept the same sig‐
643              nal using sigwaitinfo(2), the kernel will arbitrarily select one
644              of these threads to receive the signal.
645
646       CLONE_UNTRACED (since Linux 2.5.46)
647              If  CLONE_UNTRACED  is  specified, then a tracing process cannot
648              force CLONE_PTRACE on this child process.
649
650       CLONE_VFORK (since Linux 2.2)
651              If CLONE_VFORK is set, the execution of the calling  process  is
652              suspended  until the child releases its virtual memory resources
653              via a call to execve(2) or _exit(2) (as with vfork(2)).
654
655              If CLONE_VFORK is not set, then both the calling process and the
656              child  are schedulable after the call, and an application should
657              not rely on execution occurring in any particular order.
658
659       CLONE_VM (since Linux 2.0)
660              If CLONE_VM is set, the calling process and  the  child  process
661              run in the same memory space.  In particular, memory writes per‐
662              formed by the calling process or by the child process  are  also
663              visible  in  the other process.  Moreover, any memory mapping or
664              unmapping performed with mmap(2) or munmap(2) by  the  child  or
665              calling process also affects the other process.
666
667              If  CLONE_VM  is  not  set, the child process runs in a separate
668              copy of the memory space of the calling process at the  time  of
669              the  clone call.  Memory writes or file mappings/unmappings per‐
670              formed by one of the processes do not affect the other, as  with
671              fork(2).
672
673              If  the  CLONE_VM flag is specified and the CLONE_VM flag is not
674              specified, then any alternate signal stack that was  established
675              by sigaltstack(2) is cleared in the child process.
676

RETURN VALUE

678       On  success,  the  thread  ID  of  the child process is returned in the
679       caller's thread of execution.   On  failure,  -1  is  returned  in  the
680       caller's  context,  no child process will be created, and errno will be
681       set appropriately.
682

ERRORS

684       EAGAIN Too many processes are already running; see fork(2).
685
686       EBUSY (clone3() only)
687              CLONE_INTO_CGROUP was specified in cl_args.flags, but  the  file
688              descriptor  specified  in  cl_args.cgroup  refers to a version 2
689              cgroup in which a domain controller is enabled.
690
691       EEXIST (clone3() only)
692              One (or more) of the PIDs specified in set_tid already exists in
693              the corresponding PID namespace.
694
695       EINVAL Both CLONE_SIGHAND and CLONE_CLEAR_SIGHAND were specified in the
696              flags mask.
697
698       EINVAL CLONE_SIGHAND was specified in the flags mask, but CLONE_VM  was
699              not.  (Since Linux 2.6.0.)
700
701       EINVAL CLONE_THREAD  was specified in the flags mask, but CLONE_SIGHAND
702              was not.  (Since Linux 2.5.35.)
703
704       EINVAL CLONE_THREAD was specified in the flags mask,  but  the  current
705              process  previously called unshare(2) with the CLONE_NEWPID flag
706              or used setns(2) to reassociate itself with a PID namespace.
707
708       EINVAL Both CLONE_FS and CLONE_NEWNS were specified in the flags mask.
709
710       EINVAL (since Linux 3.9)
711              Both CLONE_NEWUSER and CLONE_FS  were  specified  in  the  flags
712              mask.
713
714       EINVAL Both  CLONE_NEWIPC and CLONE_SYSVSEM were specified in the flags
715              mask.
716
717       EINVAL One (or both) of CLONE_NEWPID or CLONE_NEWUSER and one (or both)
718              of  CLONE_THREAD  or  CLONE_PARENT  were  specified in the flags
719              mask.
720
721       EINVAL (since Linux 2.6.32)
722              CLONE_PARENT was specified, and the caller is an init process.
723
724       EINVAL Returned by the glibc clone() wrapper function when fn or  stack
725              is specified as NULL.
726
727       EINVAL CLONE_NEWIPC was specified in the flags mask, but the kernel was
728              not configured with the  CONFIG_SYSVIPC  and  CONFIG_IPC_NS  op‐
729              tions.
730
731       EINVAL CLONE_NEWNET was specified in the flags mask, but the kernel was
732              not configured with the CONFIG_NET_NS option.
733
734       EINVAL CLONE_NEWPID was specified in the flags mask, but the kernel was
735              not configured with the CONFIG_PID_NS option.
736
737       EINVAL CLONE_NEWUSER  was  specified  in the flags mask, but the kernel
738              was not configured with the CONFIG_USER_NS option.
739
740       EINVAL CLONE_NEWUTS was specified in the flags mask, but the kernel was
741              not configured with the CONFIG_UTS_NS option.
742
743       EINVAL stack  is  not aligned to a suitable boundary for this architec‐
744              ture.  For example, on aarch64, stack must be a multiple of 16.
745
746       EINVAL (clone3() only)
747              CLONE_DETACHED was specified in the flags mask.
748
749       EINVAL (clone() only)
750              CLONE_PIDFD was specified together with  CLONE_DETACHED  in  the
751              flags mask.
752
753       EINVAL CLONE_PIDFD  was  specified  together  with  CLONE_THREAD in the
754              flags mask.
755
756       EINVAL (clone() only)
757              CLONE_PIDFD was specified together with  CLONE_PARENT_SETTID  in
758              the flags mask.
759
760       EINVAL (clone3() only)
761              set_tid_size  is  greater  than  the  number of nested PID name‐
762              spaces.
763
764       EINVAL (clone3() only)
765              One of the PIDs specified in set_tid was an invalid.
766
767       EINVAL (AArch64 only, Linux 4.6 and earlier)
768              stack was not aligned to a 126-bit boundary.
769
770       ENOMEM Cannot allocate sufficient memory to allocate a  task  structure
771              for  the  child,  or to copy those parts of the caller's context
772              that need to be copied.
773
774       ENOSPC (since Linux 3.7)
775              CLONE_NEWPID was specified in the flags mask, but the  limit  on
776              the  nesting  depth  of PID namespaces would have been exceeded;
777              see pid_namespaces(7).
778
779       ENOSPC (since Linux 4.9; beforehand EUSERS)
780              CLONE_NEWUSER was specified in the  flags  mask,  and  the  call
781              would cause the limit on the number of nested user namespaces to
782              be exceeded.  See user_namespaces(7).
783
784              From Linux 3.11 to Linux 4.8, the error diagnosed in  this  case
785              was EUSERS.
786
787       ENOSPC (since Linux 4.9)
788              One  of the values in the flags mask specified the creation of a
789              new user namespace, but doing so would have caused the limit de‐
790              fined  by  the  corresponding  file  in /proc/sys/user to be ex‐
791              ceeded.  For further details, see namespaces(7).
792
793       EOPNOTSUPP (clone3() only)
794              CLONE_INTO_CGROUP was specified in cl_args.flags, but  the  file
795              descriptor  specified  in  cl_args.cgroup  refers to a version 2
796              cgroup that is in the domain invalid state.
797
798       EPERM  CLONE_NEWCGROUP,   CLONE_NEWIPC,   CLONE_NEWNET,    CLONE_NEWNS,
799              CLONE_NEWPID,  or  CLONE_NEWUTS was specified by an unprivileged
800              process (process without CAP_SYS_ADMIN).
801
802       EPERM  CLONE_PID was specified by  a  process  other  than  process  0.
803              (This error occurs only on Linux 2.5.15 and earlier.)
804
805       EPERM  CLONE_NEWUSER  was  specified  in the flags mask, but either the
806              effective user ID or the effective group ID of the  caller  does
807              not  have  a  mapping  in  the  parent namespace (see user_name‐
808              spaces(7)).
809
810       EPERM (since Linux 3.9)
811              CLONE_NEWUSER was specified in the flags mask and the caller  is
812              in  a chroot environment (i.e., the caller's root directory does
813              not match the root directory of the mount namespace in which  it
814              resides).
815
816       EPERM (clone3() only)
817              set_tid_size  was  greater  than  zero, and the caller lacks the
818              CAP_SYS_ADMIN capability in one or more of the  user  namespaces
819              that own the corresponding PID namespaces.
820
821       ERESTARTNOINTR (since Linux 2.6.17)
822              System  call  was interrupted by a signal and will be restarted.
823              (This can be seen only during a trace.)
824
825       EUSERS (Linux 3.11 to Linux 4.8)
826              CLONE_NEWUSER was specified in the flags mask, and the limit  on
827              the number of nested user namespaces would be exceeded.  See the
828              discussion of the ENOSPC error above.
829

VERSIONS

831       The clone3() system call first appeared in Linux 5.3.
832

CONFORMING TO

834       These system calls are Linux-specific and should not be  used  in  pro‐
835       grams intended to be portable.
836

NOTES

838       One  use of these systems calls is to implement threads: multiple flows
839       of control in a program that  run  concurrently  in  a  shared  address
840       space.
841
842       Glibc   does  not  provide  a  wrapper  for  clone3();  call  it  using
843       syscall(2).
844
845       Note that the glibc clone() wrapper function makes some changes in  the
846       memory  pointed  to by stack (changes required to set the stack up cor‐
847       rectly for the child) before invoking the clone() system call.  So,  in
848       cases  where clone() is used to recursively create children, do not use
849       the buffer employed for the parent's stack as the stack of the child.
850
851       The kcmp(2) system call can be used to test whether two processes share
852       various  resources  such as a file descriptor table, System V semaphore
853       undo operations, or a virtual address space.
854
855       Handlers registered using pthread_atfork(3) are not executed  during  a
856       clone call.
857
858       In  the  Linux  2.4.x  series, CLONE_THREAD generally does not make the
859       parent of the new thread the same as the parent of the calling process.
860       However,  for kernel versions 2.4.7 to 2.4.18 the CLONE_THREAD flag im‐
861       plied the CLONE_PARENT flag (as in Linux 2.6.0 and later).
862
863       On i386, clone() should not be called through  vsyscall,  but  directly
864       through int $0x80.
865
866   C library/kernel differences
867       The raw clone() system call corresponds more closely to fork(2) in that
868       execution in the child continues from the point of the call.  As  such,
869       the fn and arg arguments of the clone() wrapper function are omitted.
870
871       In  contrast  to the glibc wrapper, the raw clone() system call accepts
872       NULL as a stack argument (and clone3() likewise allows cl_args.stack to
873       be  NULL).   In  this  case, the child uses a duplicate of the parent's
874       stack.  (Copy-on-write semantics ensure that the  child  gets  separate
875       copies of stack pages when either process modifies the stack.)  In this
876       case, for correct operation, the CLONE_VM option should not  be  speci‐
877       fied.   (If  the child shares the parent's memory because of the use of
878       the CLONE_VM flag, then no copy-on-write duplication occurs  and  chaos
879       is likely to result.)
880
881       The  order  of  the  arguments also differs in the raw system call, and
882       there are variations in the arguments across architectures, as detailed
883       in the following paragraphs.
884
885       The  raw  system  call interface on x86-64 and some other architectures
886       (including sh, tile, and alpha) is:
887
888           long clone(unsigned long flags, void *stack,
889                      int *parent_tid, int *child_tid,
890                      unsigned long tls);
891
892       On x86-32, and several other  common  architectures  (including  score,
893       ARM,  ARM  64,  PA-RISC, arc, Power PC, xtensa, and MIPS), the order of
894       the last two arguments is reversed:
895
896           long clone(unsigned long flags, void *stack,
897                     int *parent_tid, unsigned long tls,
898                     int *child_tid);
899
900       On the cris and s390 architectures, the order of the  first  two  argu‐
901       ments is reversed:
902
903           long clone(void *stack, unsigned long flags,
904                      int *parent_tid, int *child_tid,
905                      unsigned long tls);
906
907       On the microblaze architecture, an additional argument is supplied:
908
909           long clone(unsigned long flags, void *stack,
910                      int stack_size,         /* Size of stack */
911                      int *parent_tid, int *child_tid,
912                      unsigned long tls);
913
914   blackfin, m68k, and sparc
915       The  argument-passing conventions on blackfin, m68k, and sparc are dif‐
916       ferent from the descriptions above.  For details, see the  kernel  (and
917       glibc) source.
918
919   ia64
920       On ia64, a different interface is used:
921
922           int __clone2(int (*fn)(void *),
923                        void *stack_base, size_t stack_size,
924                        int flags, void *arg, ...
925                     /* pid_t *parent_tid, struct user_desc *tls,
926                        pid_t *child_tid */ );
927
928       The  prototype  shown  above is for the glibc wrapper function; for the
929       system call itself, the prototype can be described as  follows  (it  is
930       identical to the clone() prototype on microblaze):
931
932           long clone2(unsigned long flags, void *stack_base,
933                       int stack_size,         /* Size of stack */
934                       int *parent_tid, int *child_tid,
935                       unsigned long tls);
936
937       __clone2()  operates in the same way as clone(), except that stack_base
938       points to the lowest address of the child's stack area, and  stack_size
939       specifies the size of the stack pointed to by stack_base.
940
941   Linux 2.4 and earlier
942       In  Linux  2.4 and earlier, clone() does not take arguments parent_tid,
943       tls, and child_tid.
944

BUGS

946       GNU C library versions 2.3.4 up to and including 2.24 contained a wrap‐
947       per  function  for  getpid(2)  that  performed  caching  of PIDs.  This
948       caching relied on support in the glibc wrapper for clone(), but limita‐
949       tions  in the implementation meant that the cache was not up to date in
950       some circumstances.  In particular, if a signal was  delivered  to  the
951       child immediately after the clone() call, then a call to getpid(2) in a
952       handler for the signal could return the  PID  of  the  calling  process
953       ("the parent"), if the clone wrapper had not yet had a chance to update
954       the PID cache in the child.  (This discussion ignores  the  case  where
955       the  child was created using CLONE_THREAD, when getpid(2) should return
956       the same value in the child and in the  process  that  called  clone(),
957       since  the  caller  and  the  child  are in the same thread group.  The
958       stale-cache problem also does not occur if the flags argument  includes
959       CLONE_VM.)   To  get  the truth, it was sometimes necessary to use code
960       such as the following:
961
962           #include <syscall.h>
963
964           pid_t mypid;
965
966           mypid = syscall(SYS_getpid);
967
968       Because of the stale-cache problem, as well as other problems noted  in
969       getpid(2), the PID caching feature was removed in glibc 2.25.
970

EXAMPLES

972       The following program demonstrates the use of clone() to create a child
973       process that executes in a separate UTS namespace.  The  child  changes
974       the  hostname in its UTS namespace.  Both parent and child then display
975       the system hostname, making it possible to see that the  hostname  dif‐
976       fers  in the UTS namespaces of the parent and child.  For an example of
977       the use of this program, see setns(2).
978
979       Within the sample program, we allocate the memory that is  to  be  used
980       for  the child's stack using mmap(2) rather than malloc(3) for the fol‐
981       lowing reasons:
982
983       *  mmap(2) allocates a block of memory that starts on a  page  boundary
984          and  is  a  multiple of the page size.  This is useful if we want to
985          establish a guard page (a page with protection PROT_NONE) at the end
986          of the stack using mprotect(2).
987
988       *  We can specify the MAP_STACK flag to request a mapping that is suit‐
989          able for a stack.  For the moment, this flag is a  no-op  on  Linux,
990          but it exists and has effect on some other systems, so we should in‐
991          clude it for portability.
992
993   Program source
994       #define _GNU_SOURCE
995       #include <sys/wait.h>
996       #include <sys/utsname.h>
997       #include <sched.h>
998       #include <string.h>
999       #include <stdint.h>
1000       #include <stdio.h>
1001       #include <stdlib.h>
1002       #include <unistd.h>
1003       #include <sys/mman.h>
1004
1005       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
1006                               } while (0)
1007
1008       static int              /* Start function for cloned child */
1009       childFunc(void *arg)
1010       {
1011           struct utsname uts;
1012
1013           /* Change hostname in UTS namespace of child */
1014
1015           if (sethostname(arg, strlen(arg)) == -1)
1016               errExit("sethostname");
1017
1018           /* Retrieve and display hostname */
1019
1020           if (uname(&uts) == -1)
1021               errExit("uname");
1022           printf("uts.nodename in child:  %s\n", uts.nodename);
1023
1024           /* Keep the namespace open for a while, by sleeping.
1025              This allows some experimentation--for example, another
1026              process might join the namespace. */
1027
1028           sleep(200);
1029
1030           return 0;           /* Child terminates now */
1031       }
1032
1033       #define STACK_SIZE (1024 * 1024)    /* Stack size for cloned child */
1034
1035       int
1036       main(int argc, char *argv[])
1037       {
1038           char *stack;                    /* Start of stack buffer */
1039           char *stackTop;                 /* End of stack buffer */
1040           pid_t pid;
1041           struct utsname uts;
1042
1043           if (argc < 2) {
1044               fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
1045               exit(EXIT_SUCCESS);
1046           }
1047
1048           /* Allocate memory to be used for the stack of the child */
1049
1050           stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
1051                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
1052           if (stack == MAP_FAILED)
1053               errExit("mmap");
1054
1055           stackTop = stack + STACK_SIZE;  /* Assume stack grows downward */
1056
1057           /* Create child that has its own UTS namespace;
1058              child commences execution in childFunc() */
1059
1060           pid = clone(childFunc, stackTop, CLONE_NEWUTS | SIGCHLD, argv[1]);
1061           if (pid == -1)
1062               errExit("clone");
1063           printf("clone() returned %jd\n", (intmax_t) pid);
1064
1065           /* Parent falls through to here */
1066
1067           sleep(1);           /* Give child time to change its hostname */
1068
1069           /* Display hostname in parent's UTS namespace. This will be
1070              different from hostname in child's UTS namespace. */
1071
1072           if (uname(&uts) == -1)
1073               errExit("uname");
1074           printf("uts.nodename in parent: %s\n", uts.nodename);
1075
1076           if (waitpid(pid, NULL, 0) == -1)    /* Wait for child */
1077               errExit("waitpid");
1078           printf("child has terminated\n");
1079
1080           exit(EXIT_SUCCESS);
1081       }
1082

COLOPHON

1090       This  page  is  part of release 5.10 of the Linux man-pages project.  A
1091       description of the project, information about reporting bugs,  and  the
1092       latest     version     of     this    page,    can    be    found    at
1093       https://www.kernel.org/doc/man-pages/.
1094
1095
1096
1097Linux                             2020-11-01                          CLONE(2)