1CLONE(2) Linux Programmer's Manual CLONE(2)
2
3
4
6 clone, __clone2, clone3 - create a child process
7
9 /* Prototype for the glibc wrapper function */
10
11 #define _GNU_SOURCE
12 #include <sched.h>
13
14 int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
15 /* pid_t *parent_tid, void *tls, pid_t *child_tid */ );
16
17 /* For the prototype of the raw clone() system call, see NOTES */
18
19 long clone3(struct clone_args *cl_args, size_t size);
20
21 Note: There is not yet a glibc wrapper for clone3(); see NOTES.
22
24 These system calls create a new ("child") process, in a manner similar
25 to fork(2).
26
27 By contrast with fork(2), these system calls provide more precise con‐
28 trol over what pieces of execution context are shared between the call‐
29 ing process and the child process. For example, using these system
30 calls, the caller can control whether or not the two processes share
31 the virtual address space, the table of file descriptors, and the table
32 of signal handlers. These system calls also allow the new child
33 process to be placed in separate namespaces(7).
34
35 Note that in this manual page, "calling process" normally corresponds
36 to "parent process". But see the descriptions of CLONE_PARENT and
37 CLONE_THREAD below.
38
39 This page describes the following interfaces:
40
41 * The glibc clone() wrapper function and the underlying system call on
42 which it is based. The main text describes the wrapper function;
43 the differences for the raw system call are described toward the end
44 of this page.
45
46 * The newer clone3() system call.
47
48 In the remainder of this page, the terminology "the clone call" is used
49 when noting details that apply to all of these interfaces,
50
51 The clone() wrapper function
52 When the child process is created with the clone() wrapper function, it
53 commences execution by calling the function pointed to by the argument
54 fn. (This differs from fork(2), where execution continues in the child
55 from the point of the fork(2) call.) The arg argument is passed as the
56 argument of the function fn.
57
58 When the fn(arg) function returns, the child process terminates. The
59 integer returned by fn is the exit status for the child process. The
60 child process may also terminate explicitly by calling exit(2) or after
61 receiving a fatal signal.
62
63 The stack argument specifies the location of the stack used by the
64 child process. Since the child and calling process may share memory,
65 it is not possible for the child process to execute in the same stack
66 as the calling process. The calling process must therefore set up mem‐
67 ory space for the child stack and pass a pointer to this space to
68 clone(). Stacks grow downward on all processors that run Linux (except
69 the HP PA processors), so stack usually points to the topmost address
70 of the memory space set up for the child stack. Note that clone() does
71 not provide a means whereby the caller can inform the kernel of the
72 size of the stack area.
73
74 The remaining arguments to clone() are discussed below.
75
76 clone3()
77 The clone3() system call provides a superset of the functionality of
78 the older clone() interface. It also provides a number of API improve‐
79 ments, including: space for additional flags bits; cleaner separation
80 in the use of various arguments; and the ability to specify the size of
81 the child's stack area.
82
83 As with fork(2), clone3() returns in both the parent and the child. It
84 returns 0 in the child process and returns the PID of the child in the
85 parent.
86
87 The cl_args argument of clone3() is a structure of the following form:
88
89 struct clone_args {
90 u64 flags; /* Flags bit mask */
91 u64 pidfd; /* Where to store PID file descriptor
92 (pid_t *) */
93 u64 child_tid; /* Where to store child TID,
94 in child's memory (pid_t *) */
95 u64 parent_tid; /* Where to store child TID,
96 in parent's memory (int *) */
97 u64 exit_signal; /* Signal to deliver to parent on
98 child termination */
99 u64 stack; /* Pointer to lowest byte of stack */
100 u64 stack_size; /* Size of stack */
101 u64 tls; /* Location of new TLS */
102 u64 set_tid; /* Pointer to a pid_t array
103 (since Linux 5.5) */
104 u64 set_tid_size; /* Number of elements in set_tid
105 (since Linux 5.5) */
106 u64 cgroup; /* File descriptor for target cgroup
107 of child (since Linux 5.7) */
108 };
109
110 The size argument that is supplied to clone3() should be initialized to
111 the size of this structure. (The existence of the size argument per‐
112 mits future extensions to the clone_args structure.)
113
114 The stack for the child process is specified via cl_args.stack, which
115 points to the lowest byte of the stack area, and cl_args.stack_size,
116 which specifies the size of the stack in bytes. In the case where the
117 CLONE_VM flag (see below) is specified, a stack must be explicitly
118 allocated and specified. Otherwise, these two fields can be specified
119 as NULL and 0, which causes the child to use the same stack area as the
120 parent (in the child's own virtual address space).
121
122 The remaining fields in the cl_args argument are discussed below.
123
124 Equivalence between clone() and clone3() arguments
125 Unlike the older clone() interface, where arguments are passed individ‐
126 ually, in the newer clone3() interface the arguments are packaged into
127 the clone_args structure shown above. This structure allows for a
128 superset of the information passed via the clone() arguments.
129
130 The following table shows the equivalence between the arguments of
131 clone() and the fields in the clone_args argument supplied to clone3():
132
133 clone() clone3() Notes
134 cl_args field
135 flags & ~0xff flags For most flags; details below
136 parent_tid pidfd See CLONE_PIDFD
137 child_tid child_tid See CLONE_CHILD_SETTID
138 parent_tid parent_tid See CLONE_PARENT_SETTID
139 flags & 0xff exit_signal
140 stack stack
141 --- stack_size
142 tls tls See CLONE_SETTLS
143 --- set_tid See below for details
144 --- set_tid_size
145 --- cgroup See CLONE_INTO_CGROUP
146
147 The child termination signal
148 When the child process terminates, a signal may be sent to the parent.
149 The termination signal is specified in the low byte of flags (clone())
150 or in cl_args.exit_signal (clone3()). If this signal is specified as
151 anything other than SIGCHLD, then the parent process must specify the
152 __WALL or __WCLONE options when waiting for the child with wait(2). If
153 no signal (i.e., zero) is specified, then the parent process is not
154 signaled when the child terminates.
155
156 The set_tid array
157 By default, the kernel chooses the next sequential PID for the new
158 process in each of the PID namespaces where it is present. When creat‐
159 ing a process with clone3(), the set_tid array (available since Linux
160 5.5) can be used to select specific PIDs for the process in some or all
161 of the PID namespaces where it is present. If the PID of the newly
162 created process should be set only for the current PID namespace or in
163 the newly created PID namespace (if flags contains CLONE_NEWPID) then
164 the first element in the set_tid array has to be the desired PID and
165 set_tid_size needs to be 1.
166
167 If the PID of the newly created process should have a certain value in
168 multiple PID namespaces, then the set_tid array can have multiple
169 entries. The first entry defines the PID in the most deeply nested PID
170 namespace and each of the following entries contains the PID in the
171 corresponding ancestor PID namespace. The number of PID namespaces in
172 which a PID should be set is defined by set_tid_size which cannot be
173 larger than the number of currently nested PID namespaces.
174
175 To create a process with the following PIDs in a PID namespace hierar‐
176 chy:
177
178 PID NS level Requested PID Notes
179 0 31496 Outermost PID namespace
180 1 42
181 2 7 Innermost PID namespace
182
183 Set the array to:
184
185 set_tid[0] = 7;
186 set_tid[1] = 42;
187 set_tid[2] = 31496;
188 set_tid_size = 3;
189
190 If only the PIDs in the two innermost PID namespaces need to be speci‐
191 fied, set the array to:
192
193 set_tid[0] = 7;
194 set_tid[1] = 42;
195 set_tid_size = 2;
196
197 The PID in the PID namespaces outside the two innermost PID namespaces
198 will be selected the same way as any other PID is selected.
199
200 The set_tid feature requires CAP_SYS_ADMIN in all owning user names‐
201 paces of the target PID namespaces.
202
203 Callers may only choose a PID greater than 1 in a given PID namespace
204 if an init process (i.e., a process with PID 1) already exists in that
205 namespace. Otherwise the PID entry for this PID namespace must be 1.
206
207 The flags mask
208 Both clone() and clone3() allow a flags bit mask that modifies their
209 behavior and allows the caller to specify what is shared between the
210 calling process and the child process. This bit mask—the flags argu‐
211 ment of clone() or the cl_args.flags field passed to clone3()—is
212 referred to as the flags mask in the remainder of this page.
213
214 The flags mask is specified as a bitwise-OR of zero or more of the con‐
215 stants listed below. Except as noted below, these flags are available
216 (and have the same effect) in both clone() and clone3().
217
218 CLONE_CHILD_CLEARTID (since Linux 2.5.49)
219 Clear (zero) the child thread ID at the location pointed to by
220 child_tid (clone()) or cl_args.child_tid (clone3()) in child
221 memory when the child exits, and do a wakeup on the futex at
222 that address. The address involved may be changed by the
223 set_tid_address(2) system call. This is used by threading
224 libraries.
225
226 CLONE_CHILD_SETTID (since Linux 2.5.49)
227 Store the child thread ID at the location pointed to by
228 child_tid (clone()) or cl_args.child_tid (clone3()) in the
229 child's memory. The store operation completes before the clone
230 call returns control to user space in the child process. (Note
231 that the store operation may not have completed before the clone
232 call returns in the parent process, which will be relevant if
233 the CLONE_VM flag is also employed.)
234
235 CLONE_CLEAR_SIGHAND (since Linux 5.5)
236 By default, signal dispositions in the child thread are the same
237 as in the parent. If this flag is specified, then all signals
238 that are handled in the parent are reset to their default dispo‐
239 sitions (SIG_DFL) in the child.
240
241 Specifying this flag together with CLONE_SIGHAND is nonsensical
242 and disallowed.
243
244 CLONE_DETACHED (historical)
245 For a while (during the Linux 2.5 development series) there was
246 a CLONE_DETACHED flag, which caused the parent not to receive a
247 signal when the child terminated. Ultimately, the effect of
248 this flag was subsumed under the CLONE_THREAD flag and by the
249 time Linux 2.6.0 was released, this flag had no effect. Start‐
250 ing in Linux 2.6.2, the need to give this flag together with
251 CLONE_THREAD disappeared.
252
253 This flag is still defined, but it is usually ignored when call‐
254 ing clone(). However, see the description of CLONE_PIDFD for
255 some exceptions.
256
257 CLONE_FILES (since Linux 2.0)
258 If CLONE_FILES is set, the calling process and the child process
259 share the same file descriptor table. Any file descriptor cre‐
260 ated by the calling process or by the child process is also
261 valid in the other process. Similarly, if one of the processes
262 closes a file descriptor, or changes its associated flags (using
263 the fcntl(2) F_SETFD operation), the other process is also
264 affected. If a process sharing a file descriptor table calls
265 execve(2), its file descriptor table is duplicated (unshared).
266
267 If CLONE_FILES is not set, the child process inherits a copy of
268 all file descriptors opened in the calling process at the time
269 of the clone call. Subsequent operations that open or close
270 file descriptors, or change file descriptor flags, performed by
271 either the calling process or the child process do not affect
272 the other process. Note, however, that the duplicated file
273 descriptors in the child refer to the same open file descrip‐
274 tions as the corresponding file descriptors in the calling
275 process, and thus share file offsets and file status flags (see
276 open(2)).
277
278 CLONE_FS (since Linux 2.0)
279 If CLONE_FS is set, the caller and the child process share the
280 same filesystem information. This includes the root of the
281 filesystem, the current working directory, and the umask. Any
282 call to chroot(2), chdir(2), or umask(2) performed by the call‐
283 ing process or the child process also affects the other process.
284
285 If CLONE_FS is not set, the child process works on a copy of the
286 filesystem information of the calling process at the time of the
287 clone call. Calls to chroot(2), chdir(2), or umask(2) performed
288 later by one of the processes do not affect the other process.
289
290 CLONE_INTO_CGROUP (since Linux 5.7)
291 By default, a child process is placed in the same version 2
292 cgroup as its parent. The CLONE_INTO_CGROUP flag allows the
293 child process to be created in a different version 2 cgroup.
294 (Note that CLONE_INTO_CGROUP has effect only for version 2
295 cgroups.)
296
297 In order to place the child process in a different cgroup, the
298 caller specifies CLONE_INTO_CGROUP in cl_args.flags and passes a
299 file descriptor that refers to a version 2 cgroup in the
300 cl_args.cgroup field. (This file descriptor can be obtained by
301 opening a cgroup v2 directory using either the O_RDONLY or the
302 O_PATH flag.) Note that all of the usual restrictions
303 (described in cgroups(7)) on placing a process into a version 2
304 cgroup apply.
305
306 Among the possible use cases for CLONE_INTO_CGROUP are the fol‐
307 lowing:
308
309 * Spawning a process into a cgroup different from the parent's
310 cgroup makes it possible for a service manager to directly
311 spawn new services into dedicated cgroups. This eliminates
312 the accounting jitter that would be caused if the child
313 process was first created in the same cgroup as the parent
314 and then moved into the target cgroup. Furthermore, spawning
315 the child process directly into a target cgroup is signifi‐
316 cantly cheaper than moving the child process into the target
317 cgroup after it has been created.
318
319 * The CLONE_INTO_CGROUP flag also allows the creation of frozen
320 child processes by spawning them into a frozen cgroup. (See
321 cgroups(7) for a description of the freezer controller.)
322
323 * For threaded applications (or even thread implementations
324 which make use of cgroups to limit individual threads), it is
325 possible to establish a fixed cgroup layout before spawning
326 each thread directly into its target cgroup.
327
328 CLONE_IO (since Linux 2.6.25)
329 If CLONE_IO is set, then the new process shares an I/O context
330 with the calling process. If this flag is not set, then (as
331 with fork(2)) the new process has its own I/O context.
332
333 The I/O context is the I/O scope of the disk scheduler (i.e.,
334 what the I/O scheduler uses to model scheduling of a process's
335 I/O). If processes share the same I/O context, they are treated
336 as one by the I/O scheduler. As a consequence, they get to
337 share disk time. For some I/O schedulers, if two processes
338 share an I/O context, they will be allowed to interleave their
339 disk access. If several threads are doing I/O on behalf of the
340 same process (aio_read(3), for instance), they should employ
341 CLONE_IO to get better I/O performance.
342
343 If the kernel is not configured with the CONFIG_BLOCK option,
344 this flag is a no-op.
345
346 CLONE_NEWCGROUP (since Linux 4.6)
347 Create the process in a new cgroup namespace. If this flag is
348 not set, then (as with fork(2)) the process is created in the
349 same cgroup namespaces as the calling process.
350
351 For further information on cgroup namespaces, see cgroup_names‐
352 paces(7).
353
354 Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWC‐
355 GROUP.
356
357 CLONE_NEWIPC (since Linux 2.6.19)
358 If CLONE_NEWIPC is set, then create the process in a new IPC
359 namespace. If this flag is not set, then (as with fork(2)), the
360 process is created in the same IPC namespace as the calling
361 process.
362
363 For further information on IPC namespaces, see ipc_names‐
364 paces(7).
365
366 Only a privileged process (CAP_SYS_ADMIN) can employ
367 CLONE_NEWIPC. This flag can't be specified in conjunction with
368 CLONE_SYSVSEM.
369
370 CLONE_NEWNET (since Linux 2.6.24)
371 (The implementation of this flag was completed only by about
372 kernel version 2.6.29.)
373
374 If CLONE_NEWNET is set, then create the process in a new network
375 namespace. If this flag is not set, then (as with fork(2)) the
376 process is created in the same network namespace as the calling
377 process.
378
379 For further information on network namespaces, see net‐
380 work_namespaces(7).
381
382 Only a privileged process (CAP_SYS_ADMIN) can employ
383 CLONE_NEWNET.
384
385 CLONE_NEWNS (since Linux 2.4.19)
386 If CLONE_NEWNS is set, the cloned child is started in a new
387 mount namespace, initialized with a copy of the namespace of the
388 parent. If CLONE_NEWNS is not set, the child lives in the same
389 mount namespace as the parent.
390
391 For further information on mount namespaces, see namespaces(7)
392 and mount_namespaces(7).
393
394 Only a privileged process (CAP_SYS_ADMIN) can employ
395 CLONE_NEWNS. It is not permitted to specify both CLONE_NEWNS
396 and CLONE_FS in the same clone call.
397
398 CLONE_NEWPID (since Linux 2.6.24)
399 If CLONE_NEWPID is set, then create the process in a new PID
400 namespace. If this flag is not set, then (as with fork(2)) the
401 process is created in the same PID namespace as the calling
402 process.
403
404 For further information on PID namespaces, see namespaces(7) and
405 pid_namespaces(7).
406
407 Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEW‐
408 PID. This flag can't be specified in conjunction with
409 CLONE_THREAD or CLONE_PARENT.
410
411 CLONE_NEWUSER
412 (This flag first became meaningful for clone() in Linux 2.6.23,
413 the current clone() semantics were merged in Linux 3.5, and the
414 final pieces to make the user namespaces completely usable were
415 merged in Linux 3.8.)
416
417 If CLONE_NEWUSER is set, then create the process in a new user
418 namespace. If this flag is not set, then (as with fork(2)) the
419 process is created in the same user namespace as the calling
420 process.
421
422 For further information on user namespaces, see namespaces(7)
423 and user_namespaces(7).
424
425 Before Linux 3.8, use of CLONE_NEWUSER required that the caller
426 have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SET‐
427 GID. Starting with Linux 3.8, no privileges are needed to cre‐
428 ate a user namespace.
429
430 This flag can't be specified in conjunction with CLONE_THREAD or
431 CLONE_PARENT. For security reasons, CLONE_NEWUSER cannot be
432 specified in conjunction with CLONE_FS.
433
434 CLONE_NEWUTS (since Linux 2.6.19)
435 If CLONE_NEWUTS is set, then create the process in a new UTS
436 namespace, whose identifiers are initialized by duplicating the
437 identifiers from the UTS namespace of the calling process. If
438 this flag is not set, then (as with fork(2)) the process is cre‐
439 ated in the same UTS namespace as the calling process.
440
441 For further information on UTS namespaces, see uts_names‐
442 paces(7).
443
444 Only a privileged process (CAP_SYS_ADMIN) can employ
445 CLONE_NEWUTS.
446
447 CLONE_PARENT (since Linux 2.3.12)
448 If CLONE_PARENT is set, then the parent of the new child (as
449 returned by getppid(2)) will be the same as that of the calling
450 process.
451
452 If CLONE_PARENT is not set, then (as with fork(2)) the child's
453 parent is the calling process.
454
455 Note that it is the parent process, as returned by getppid(2),
456 which is signaled when the child terminates, so that if
457 CLONE_PARENT is set, then the parent of the calling process,
458 rather than the calling process itself, will be signaled.
459
460 The CLONE_PARENT flag can't be used in clone calls by the global
461 init process (PID 1 in the initial PID namespace) and init pro‐
462 cesses in other PID namespaces. This restriction prevents the
463 creation of multi-rooted process trees as well as the creation
464 of unreapable zombies in the initial PID namespace.
465
466 CLONE_PARENT_SETTID (since Linux 2.5.49)
467 Store the child thread ID at the location pointed to by par‐
468 ent_tid (clone()) or cl_args.parent_tid (clone3()) in the par‐
469 ent's memory. (In Linux 2.5.32-2.5.48 there was a flag
470 CLONE_SETTID that did this.) The store operation completes
471 before the clone call returns control to user space.
472
473 CLONE_PID (Linux 2.0 to 2.5.15)
474 If CLONE_PID is set, the child process is created with the same
475 process ID as the calling process. This is good for hacking the
476 system, but otherwise of not much use. From Linux 2.3.21
477 onward, this flag could be specified only by the system boot
478 process (PID 0). The flag disappeared completely from the ker‐
479 nel sources in Linux 2.5.16. Subsequently, the kernel silently
480 ignored this bit if it was specified in the flags mask. Much
481 later, the same bit was recycled for use as the CLONE_PIDFD
482 flag.
483
484 CLONE_PIDFD (since Linux 5.2)
485 If this flag is specified, a PID file descriptor referring to
486 the child process is allocated and placed at a specified loca‐
487 tion in the parent's memory. The close-on-exec flag is set on
488 this new file descriptor. PID file descriptors can be used for
489 the purposes described in pidfd_open(2).
490
491 * When using clone3(), the PID file descriptor is placed at the
492 location pointed to by cl_args.pidfd.
493
494 * When using clone(), the PID file descriptor is placed at the
495 location pointed to by parent_tid. Since the parent_tid
496 argument is used to return the PID file descriptor,
497 CLONE_PIDFD cannot be used with CLONE_PARENT_SETTID when
498 calling clone().
499
500 It is currently not possible to use this flag together with
501 CLONE_THREAD. This means that the process identified by the PID
502 file descriptor will always be a thread group leader.
503
504 If the obsolete CLONE_DETACHED flag is specified alongside
505 CLONE_PIDFD when calling clone(), an error is returned. An
506 error also results if CLONE_DETACHED is specified when calling
507 clone3(). This error behavior ensures that the bit correspond‐
508 ing to CLONE_DETACHED can be reused for further PID file
509 descriptor features in the future.
510
511 CLONE_PTRACE (since Linux 2.2)
512 If CLONE_PTRACE is specified, and the calling process is being
513 traced, then trace the child also (see ptrace(2)).
514
515 CLONE_SETTLS (since Linux 2.5.32)
516 The TLS (Thread Local Storage) descriptor is set to tls.
517
518 The interpretation of tls and the resulting effect is architec‐
519 ture dependent. On x86, tls is interpreted as a struct
520 user_desc * (see set_thread_area(2)). On x86-64 it is the new
521 value to be set for the %fs base register (see the ARCH_SET_FS
522 argument to arch_prctl(2)). On architectures with a dedicated
523 TLS register, it is the new value of that register.
524
525 Use of this flag requires detailed knowledge and generally it
526 should not be used except in libraries implementing threading.
527
528 CLONE_SIGHAND (since Linux 2.0)
529 If CLONE_SIGHAND is set, the calling process and the child
530 process share the same table of signal handlers. If the calling
531 process or child process calls sigaction(2) to change the behav‐
532 ior associated with a signal, the behavior is changed in the
533 other process as well. However, the calling process and child
534 processes still have distinct signal masks and sets of pending
535 signals. So, one of them may block or unblock signals using
536 sigprocmask(2) without affecting the other process.
537
538 If CLONE_SIGHAND is not set, the child process inherits a copy
539 of the signal handlers of the calling process at the time of the
540 clone call. Calls to sigaction(2) performed later by one of the
541 processes have no effect on the other process.
542
543 Since Linux 2.6.0, the flags mask must also include CLONE_VM if
544 CLONE_SIGHAND is specified
545
546 CLONE_STOPPED (since Linux 2.6.0)
547 If CLONE_STOPPED is set, then the child is initially stopped (as
548 though it was sent a SIGSTOP signal), and must be resumed by
549 sending it a SIGCONT signal.
550
551 This flag was deprecated from Linux 2.6.25 onward, and was
552 removed altogether in Linux 2.6.38. Since then, the kernel
553 silently ignores it without error. Starting with Linux 4.6, the
554 same bit was reused for the CLONE_NEWCGROUP flag.
555
556 CLONE_SYSVSEM (since Linux 2.5.10)
557 If CLONE_SYSVSEM is set, then the child and the calling process
558 share a single list of System V semaphore adjustment (semadj)
559 values (see semop(2)). In this case, the shared list accumu‐
560 lates semadj values across all processes sharing the list, and
561 semaphore adjustments are performed only when the last process
562 that is sharing the list terminates (or ceases sharing the list
563 using unshare(2)). If this flag is not set, then the child has
564 a separate semadj list that is initially empty.
565
566 CLONE_THREAD (since Linux 2.4.0)
567 If CLONE_THREAD is set, the child is placed in the same thread
568 group as the calling process. To make the remainder of the dis‐
569 cussion of CLONE_THREAD more readable, the term "thread" is used
570 to refer to the processes within a thread group.
571
572 Thread groups were a feature added in Linux 2.4 to support the
573 POSIX threads notion of a set of threads that share a single
574 PID. Internally, this shared PID is the so-called thread group
575 identifier (TGID) for the thread group. Since Linux 2.4, calls
576 to getpid(2) return the TGID of the caller.
577
578 The threads within a group can be distinguished by their (sys‐
579 tem-wide) unique thread IDs (TID). A new thread's TID is avail‐
580 able as the function result returned to the caller, and a thread
581 can obtain its own TID using gettid(2).
582
583 When a clone call is made without specifying CLONE_THREAD, then
584 the resulting thread is placed in a new thread group whose TGID
585 is the same as the thread's TID. This thread is the leader of
586 the new thread group.
587
588 A new thread created with CLONE_THREAD has the same parent
589 process as the process that made the clone call (i.e., like
590 CLONE_PARENT), so that calls to getppid(2) return the same value
591 for all of the threads in a thread group. When a CLONE_THREAD
592 thread terminates, the thread that created it is not sent a
593 SIGCHLD (or other termination) signal; nor can the status of
594 such a thread be obtained using wait(2). (The thread is said to
595 be detached.)
596
597 After all of the threads in a thread group terminate the parent
598 process of the thread group is sent a SIGCHLD (or other termina‐
599 tion) signal.
600
601 If any of the threads in a thread group performs an execve(2),
602 then all threads other than the thread group leader are termi‐
603 nated, and the new program is executed in the thread group
604 leader.
605
606 If one of the threads in a thread group creates a child using
607 fork(2), then any thread in the group can wait(2) for that
608 child.
609
610 Since Linux 2.5.35, the flags mask must also include CLONE_SIG‐
611 HAND if CLONE_THREAD is specified (and note that, since Linux
612 2.6.0, CLONE_SIGHAND also requires CLONE_VM to be included).
613
614 Signal dispositions and actions are process-wide: if an unhan‐
615 dled signal is delivered to a thread, then it will affect (ter‐
616 minate, stop, continue, be ignored in) all members of the thread
617 group.
618
619 Each thread has its own signal mask, as set by sigprocmask(2).
620
621 A signal may be process-directed or thread-directed. A process-
622 directed signal is targeted at a thread group (i.e., a TGID),
623 and is delivered to an arbitrarily selected thread from among
624 those that are not blocking the signal. A signal may be
625 process-directed because it was generated by the kernel for rea‐
626 sons other than a hardware exception, or because it was sent
627 using kill(2) or sigqueue(3). A thread-directed signal is tar‐
628 geted at (i.e., delivered to) a specific thread. A signal may
629 be thread directed because it was sent using tgkill(2) or
630 pthread_sigqueue(3), or because the thread executed a machine
631 language instruction that triggered a hardware exception (e.g.,
632 invalid memory access triggering SIGSEGV or a floating-point
633 exception triggering SIGFPE).
634
635 A call to sigpending(2) returns a signal set that is the union
636 of the pending process-directed signals and the signals that are
637 pending for the calling thread.
638
639 If a process-directed signal is delivered to a thread group, and
640 the thread group has installed a handler for the signal, then
641 the handler will be invoked in exactly one, arbitrarily selected
642 member of the thread group that has not blocked the signal. If
643 multiple threads in a group are waiting to accept the same sig‐
644 nal using sigwaitinfo(2), the kernel will arbitrarily select one
645 of these threads to receive the signal.
646
647 CLONE_UNTRACED (since Linux 2.5.46)
648 If CLONE_UNTRACED is specified, then a tracing process cannot
649 force CLONE_PTRACE on this child process.
650
651 CLONE_VFORK (since Linux 2.2)
652 If CLONE_VFORK is set, the execution of the calling process is
653 suspended until the child releases its virtual memory resources
654 via a call to execve(2) or _exit(2) (as with vfork(2)).
655
656 If CLONE_VFORK is not set, then both the calling process and the
657 child are schedulable after the call, and an application should
658 not rely on execution occurring in any particular order.
659
660 CLONE_VM (since Linux 2.0)
661 If CLONE_VM is set, the calling process and the child process
662 run in the same memory space. In particular, memory writes per‐
663 formed by the calling process or by the child process are also
664 visible in the other process. Moreover, any memory mapping or
665 unmapping performed with mmap(2) or munmap(2) by the child or
666 calling process also affects the other process.
667
668 If CLONE_VM is not set, the child process runs in a separate
669 copy of the memory space of the calling process at the time of
670 the clone call. Memory writes or file mappings/unmappings per‐
671 formed by one of the processes do not affect the other, as with
672 fork(2).
673
675 On success, the thread ID of the child process is returned in the call‐
676 er's thread of execution. On failure, -1 is returned in the caller's
677 context, no child process will be created, and errno will be set appro‐
678 priately.
679
681 EAGAIN Too many processes are already running; see fork(2).
682
683 EBUSY (clone3() only)
684 CLONE_INTO_CGROUP was specified in cl_args.flags, but the file
685 descriptor specified in cl_args.cgroup refers to a version 2
686 cgroup in which a domain controller is enabled.
687
688 EEXIST (clone3() only)
689 One (or more) of the PIDs specified in set_tid already exists in
690 the corresponding PID namespace.
691
692 EINVAL Both CLONE_SIGHAND and CLONE_CLEAR_SIGHAND were specified in the
693 flags mask.
694
695 EINVAL CLONE_SIGHAND was specified in the flags mask, but CLONE_VM was
696 not. (Since Linux 2.6.0.)
697
698 EINVAL CLONE_THREAD was specified in the flags mask, but CLONE_SIGHAND
699 was not. (Since Linux 2.5.35.)
700
701 EINVAL CLONE_THREAD was specified in the flags mask, but the current
702 process previously called unshare(2) with the CLONE_NEWPID flag
703 or used setns(2) to reassociate itself with a PID namespace.
704
705 EINVAL Both CLONE_FS and CLONE_NEWNS were specified in the flags mask.
706
707 EINVAL (since Linux 3.9)
708 Both CLONE_NEWUSER and CLONE_FS were specified in the flags
709 mask.
710
711 EINVAL Both CLONE_NEWIPC and CLONE_SYSVSEM were specified in the flags
712 mask.
713
714 EINVAL One (or both) of CLONE_NEWPID or CLONE_NEWUSER and one (or both)
715 of CLONE_THREAD or CLONE_PARENT were specified in the flags
716 mask.
717
718 EINVAL (since Linux 2.6.32)
719 CLONE_PARENT was specified, and the caller is an init process.
720
721 EINVAL Returned by the glibc clone() wrapper function when fn or stack
722 is specified as NULL.
723
724 EINVAL CLONE_NEWIPC was specified in the flags mask, but the kernel was
725 not configured with the CONFIG_SYSVIPC and CONFIG_IPC_NS
726 options.
727
728 EINVAL CLONE_NEWNET was specified in the flags mask, but the kernel was
729 not configured with the CONFIG_NET_NS option.
730
731 EINVAL CLONE_NEWPID was specified in the flags mask, but the kernel was
732 not configured with the CONFIG_PID_NS option.
733
734 EINVAL CLONE_NEWUSER was specified in the flags mask, but the kernel
735 was not configured with the CONFIG_USER_NS option.
736
737 EINVAL CLONE_NEWUTS was specified in the flags mask, but the kernel was
738 not configured with the CONFIG_UTS_NS option.
739
740 EINVAL stack is not aligned to a suitable boundary for this architec‐
741 ture. For example, on aarch64, stack must be a multiple of 16.
742
743 EINVAL (clone3() only)
744 CLONE_DETACHED was specified in the flags mask.
745
746 EINVAL (clone() only)
747 CLONE_PIDFD was specified together with CLONE_DETACHED in the
748 flags mask.
749
750 EINVAL CLONE_PIDFD was specified together with CLONE_THREAD in the
751 flags mask.
752
753 EINVAL (clone() only)
754 CLONE_PIDFD was specified together with CLONE_PARENT_SETTID in
755 the flags mask.
756
757 EINVAL (clone3() only)
758 set_tid_size is greater than the number of nested PID names‐
759 paces.
760
761 EINVAL (clone3() only)
762 One of the PIDs specified in set_tid was an invalid.
763
764 EINVAL (AArch64 only, Linux 4.6 and earlier)
765 stack was not aligned to a 126-bit boundary.
766
767 ENOMEM Cannot allocate sufficient memory to allocate a task structure
768 for the child, or to copy those parts of the caller's context
769 that need to be copied.
770
771 ENOSPC (since Linux 3.7)
772 CLONE_NEWPID was specified in the flags mask, but the limit on
773 the nesting depth of PID namespaces would have been exceeded;
774 see pid_namespaces(7).
775
776 ENOSPC (since Linux 4.9; beforehand EUSERS)
777 CLONE_NEWUSER was specified in the flags mask, and the call
778 would cause the limit on the number of nested user namespaces to
779 be exceeded. See user_namespaces(7).
780
781 From Linux 3.11 to Linux 4.8, the error diagnosed in this case
782 was EUSERS.
783
784 ENOSPC (since Linux 4.9)
785 One of the values in the flags mask specified the creation of a
786 new user namespace, but doing so would have caused the limit
787 defined by the corresponding file in /proc/sys/user to be
788 exceeded. For further details, see namespaces(7).
789
790 EOPNOTSUP (clone3() only)
791 CLONE_INTO_CGROUP was specified in cl_args.flags, but the file
792 descriptor specified in cl_args.cgroup refers to a version 2
793 cgroup that is in the domain invalid state.
794
795 EPERM CLONE_NEWCGROUP, CLONE_NEWIPC, CLONE_NEWNET, CLONE_NEWNS,
796 CLONE_NEWPID, or CLONE_NEWUTS was specified by an unprivileged
797 process (process without CAP_SYS_ADMIN).
798
799 EPERM CLONE_PID was specified by a process other than process 0.
800 (This error occurs only on Linux 2.5.15 and earlier.)
801
802 EPERM CLONE_NEWUSER was specified in the flags mask, but either the
803 effective user ID or the effective group ID of the caller does
804 not have a mapping in the parent namespace (see user_names‐
805 paces(7)).
806
807 EPERM (since Linux 3.9)
808 CLONE_NEWUSER was specified in the flags mask and the caller is
809 in a chroot environment (i.e., the caller's root directory does
810 not match the root directory of the mount namespace in which it
811 resides).
812
813 EPERM (clone3() only)
814 set_tid_size was greater than zero, and the caller lacks the
815 CAP_SYS_ADMIN capability in one or more of the user namespaces
816 that own the corresponding PID namespaces.
817
818 ERESTARTNOINTR (since Linux 2.6.17)
819 System call was interrupted by a signal and will be restarted.
820 (This can be seen only during a trace.)
821
822 EUSERS (Linux 3.11 to Linux 4.8)
823 CLONE_NEWUSER was specified in the flags mask, and the limit on
824 the number of nested user namespaces would be exceeded. See the
825 discussion of the ENOSPC error above.
826
828 The clone3() system call first appeared in Linux 5.3.
829
831 These system calls are Linux-specific and should not be used in pro‐
832 grams intended to be portable.
833
835 One use of these systems calls is to implement threads: multiple flows
836 of control in a program that run concurrently in a shared address
837 space.
838
839 Glibc does not provide a wrapper for clone3(); call it using
840 syscall(2).
841
842 Note that the glibc clone() wrapper function makes some changes in the
843 memory pointed to by stack (changes required to set the stack up cor‐
844 rectly for the child) before invoking the clone() system call. So, in
845 cases where clone() is used to recursively create children, do not use
846 the buffer employed for the parent's stack as the stack of the child.
847
848 The kcmp(2) system call can be used to test whether two processes share
849 various resources such as a file descriptor table, System V semaphore
850 undo operations, or a virtual address space.
851
852 Handlers registered using pthread_atfork(3) are not executed during a
853 clone call.
854
855 In the Linux 2.4.x series, CLONE_THREAD generally does not make the
856 parent of the new thread the same as the parent of the calling process.
857 However, for kernel versions 2.4.7 to 2.4.18 the CLONE_THREAD flag
858 implied the CLONE_PARENT flag (as in Linux 2.6.0 and later).
859
860 On i386, clone() should not be called through vsyscall, but directly
861 through int $0x80.
862
863 C library/kernel differences
864 The raw clone() system call corresponds more closely to fork(2) in that
865 execution in the child continues from the point of the call. As such,
866 the fn and arg arguments of the clone() wrapper function are omitted.
867
868 In contrast to the glibc wrapper, the raw clone() system call accepts
869 NULL as a stack argument (and clone3() likewise allows cl_args.stack to
870 be NULL). In this case, the child uses a duplicate of the parent's
871 stack. (Copy-on-write semantics ensure that the child gets separate
872 copies of stack pages when either process modifies the stack.) In this
873 case, for correct operation, the CLONE_VM option should not be speci‐
874 fied. (If the child shares the parent's memory because of the use of
875 the CLONE_VM flag, then no copy-on-write duplication occurs and chaos
876 is likely to result.)
877
878 The order of the arguments also differs in the raw system call, and
879 there are variations in the arguments across architectures, as detailed
880 in the following paragraphs.
881
882 The raw system call interface on x86-64 and some other architectures
883 (including sh, tile, and alpha) is:
884
885 long clone(unsigned long flags, void *stack,
886 int *parent_tid, int *child_tid,
887 unsigned long tls);
888
889 On x86-32, and several other common architectures (including score,
890 ARM, ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS), the order of
891 the last two arguments is reversed:
892
893 long clone(unsigned long flags, void *stack,
894 int *parent_tid, unsigned long tls,
895 int *child_tid);
896
897 On the cris and s390 architectures, the order of the first two argu‐
898 ments is reversed:
899
900 long clone(void *stack, unsigned long flags,
901 int *parent_tid, int *child_tid,
902 unsigned long tls);
903
904 On the microblaze architecture, an additional argument is supplied:
905
906 long clone(unsigned long flags, void *stack,
907 int stack_size, /* Size of stack */
908 int *parent_tid, int *child_tid,
909 unsigned long tls);
910
911 blackfin, m68k, and sparc
912 The argument-passing conventions on blackfin, m68k, and sparc are dif‐
913 ferent from the descriptions above. For details, see the kernel (and
914 glibc) source.
915
916 ia64
917 On ia64, a different interface is used:
918
919 int __clone2(int (*fn)(void *),
920 void *stack_base, size_t stack_size,
921 int flags, void *arg, ...
922 /* pid_t *parent_tid, struct user_desc *tls,
923 pid_t *child_tid */ );
924
925 The prototype shown above is for the glibc wrapper function; for the
926 system call itself, the prototype can be described as follows (it is
927 identical to the clone() prototype on microblaze):
928
929 long clone2(unsigned long flags, void *stack_base,
930 int stack_size, /* Size of stack */
931 int *parent_tid, int *child_tid,
932 unsigned long tls);
933
934 __clone2() operates in the same way as clone(), except that stack_base
935 points to the lowest address of the child's stack area, and stack_size
936 specifies the size of the stack pointed to by stack_base.
937
938 Linux 2.4 and earlier
939 In Linux 2.4 and earlier, clone() does not take arguments parent_tid,
940 tls, and child_tid.
941
943 GNU C library versions 2.3.4 up to and including 2.24 contained a wrap‐
944 per function for getpid(2) that performed caching of PIDs. This
945 caching relied on support in the glibc wrapper for clone(), but limita‐
946 tions in the implementation meant that the cache was not up to date in
947 some circumstances. In particular, if a signal was delivered to the
948 child immediately after the clone() call, then a call to getpid(2) in a
949 handler for the signal could return the PID of the calling process
950 ("the parent"), if the clone wrapper had not yet had a chance to update
951 the PID cache in the child. (This discussion ignores the case where
952 the child was created using CLONE_THREAD, when getpid(2) should return
953 the same value in the child and in the process that called clone(),
954 since the caller and the child are in the same thread group. The
955 stale-cache problem also does not occur if the flags argument includes
956 CLONE_VM.) To get the truth, it was sometimes necessary to use code
957 such as the following:
958
959 #include <syscall.h>
960
961 pid_t mypid;
962
963 mypid = syscall(SYS_getpid);
964
965 Because of the stale-cache problem, as well as other problems noted in
966 getpid(2), the PID caching feature was removed in glibc 2.25.
967
969 The following program demonstrates the use of clone() to create a child
970 process that executes in a separate UTS namespace. The child changes
971 the hostname in its UTS namespace. Both parent and child then display
972 the system hostname, making it possible to see that the hostname dif‐
973 fers in the UTS namespaces of the parent and child. For an example of
974 the use of this program, see setns(2).
975
976 Within the sample program, we allocate the memory that is to be used
977 for the child's stack using mmap(2) rather than malloc(3) for the fol‐
978 lowing reasons:
979
980 * mmap(2) allocates a block of memory that starts on a page boundary
981 and is a multiple of the page size. This is useful if we want to
982 establish a guard page (a page with protection PROT_NONE) at the end
983 of the stack using mprotect(2).
984
985 * We can specify the MAP_STACK flag to request a mapping that is suit‐
986 able for a stack. For the moment, this flag is a no-op on Linux,
987 but it exists and has effect on some other systems, so we should
988 include it for portability.
989
990 Program source
991 #define _GNU_SOURCE
992 #include <sys/wait.h>
993 #include <sys/utsname.h>
994 #include <sched.h>
995 #include <string.h>
996 #include <stdio.h>
997 #include <stdlib.h>
998 #include <unistd.h>
999 #include <sys/mman.h>
1000
1001 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
1002 } while (0)
1003
1004 static int /* Start function for cloned child */
1005 childFunc(void *arg)
1006 {
1007 struct utsname uts;
1008
1009 /* Change hostname in UTS namespace of child */
1010
1011 if (sethostname(arg, strlen(arg)) == -1)
1012 errExit("sethostname");
1013
1014 /* Retrieve and display hostname */
1015
1016 if (uname(&uts) == -1)
1017 errExit("uname");
1018 printf("uts.nodename in child: %s\n", uts.nodename);
1019
1020 /* Keep the namespace open for a while, by sleeping.
1021 This allows some experimentation--for example, another
1022 process might join the namespace. */
1023
1024 sleep(200);
1025
1026 return 0; /* Child terminates now */
1027 }
1028
1029 #define STACK_SIZE (1024 * 1024) /* Stack size for cloned child */
1030
1031 int
1032 main(int argc, char *argv[])
1033 {
1034 char *stack; /* Start of stack buffer */
1035 char *stackTop; /* End of stack buffer */
1036 pid_t pid;
1037 struct utsname uts;
1038
1039 if (argc < 2) {
1040 fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
1041 exit(EXIT_SUCCESS);
1042 }
1043
1044 /* Allocate memory to be used for the stack of the child */
1045
1046 stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
1047 MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
1048 if (stack == MAP_FAILED)
1049 errExit("mmap");
1050
1051 stackTop = stack + STACK_SIZE; /* Assume stack grows downward */
1052
1053 /* Create child that has its own UTS namespace;
1054 child commences execution in childFunc() */
1055
1056 pid = clone(childFunc, stackTop, CLONE_NEWUTS | SIGCHLD, argv[1]);
1057 if (pid == -1)
1058 errExit("clone");
1059 printf("clone() returned %ld\n", (long) pid);
1060
1061 /* Parent falls through to here */
1062
1063 sleep(1); /* Give child time to change its hostname */
1064
1065 /* Display hostname in parent's UTS namespace. This will be
1066 different from hostname in child's UTS namespace. */
1067
1068 if (uname(&uts) == -1)
1069 errExit("uname");
1070 printf("uts.nodename in parent: %s\n", uts.nodename);
1071
1072 if (waitpid(pid, NULL, 0) == -1) /* Wait for child */
1073 errExit("waitpid");
1074 printf("child has terminated\n");
1075
1076 exit(EXIT_SUCCESS);
1077 }
1078
1080 fork(2), futex(2), getpid(2), gettid(2), kcmp(2), mmap(2),
1081 pidfd_open(2), set_thread_area(2), set_tid_address(2), setns(2),
1082 tkill(2), unshare(2), wait(2), capabilities(7), namespaces(7),
1083 pthreads(7)
1084
1086 This page is part of release 5.07 of the Linux man-pages project. A
1087 description of the project, information about reporting bugs, and the
1088 latest version of this page, can be found at
1089 https://www.kernel.org/doc/man-pages/.
1090
1091
1092
1093Linux 2020-06-09 CLONE(2)