1clone(2) System Calls Manual clone(2)
2
3
4
6 clone, __clone2, clone3 - create a child process
7
9 Standard C library (libc, -lc)
10
12 /* Prototype for the glibc wrapper function */
13
14 #define _GNU_SOURCE
15 #include <sched.h>
16
17 int clone(int (*fn)(void *_Nullable), void *stack, int flags,
18 void *_Nullable arg, ... /* pid_t *_Nullable parent_tid,
19 void *_Nullable tls,
20 pid_t *_Nullable child_tid */ );
21
22 /* For the prototype of the raw clone() system call, see NOTES */
23
24 #include <linux/sched.h> /* Definition of struct clone_args */
25 #include <sched.h> /* Definition of CLONE_* constants */
26 #include <sys/syscall.h> /* Definition of SYS_* constants */
27 #include <unistd.h>
28
29 long syscall(SYS_clone3, struct clone_args *cl_args, size_t size);
30
31 Note: glibc provides no wrapper for clone3(), necessitating the use of
32 syscall(2).
33
35 These system calls create a new ("child") process, in a manner similar
36 to fork(2).
37
38 By contrast with fork(2), these system calls provide more precise con‐
39 trol over what pieces of execution context are shared between the call‐
40 ing process and the child process. For example, using these system
41 calls, the caller can control whether or not the two processes share
42 the virtual address space, the table of file descriptors, and the table
43 of signal handlers. These system calls also allow the new child
44 process to be placed in separate namespaces(7).
45
46 Note that in this manual page, "calling process" normally corresponds
47 to "parent process". But see the descriptions of CLONE_PARENT and
48 CLONE_THREAD below.
49
50 This page describes the following interfaces:
51
52 • The glibc clone() wrapper function and the underlying system call on
53 which it is based. The main text describes the wrapper function;
54 the differences for the raw system call are described toward the end
55 of this page.
56
57 • The newer clone3() system call.
58
59 In the remainder of this page, the terminology "the clone call" is used
60 when noting details that apply to all of these interfaces,
61
62 The clone() wrapper function
63 When the child process is created with the clone() wrapper function, it
64 commences execution by calling the function pointed to by the argument
65 fn. (This differs from fork(2), where execution continues in the child
66 from the point of the fork(2) call.) The arg argument is passed as the
67 argument of the function fn.
68
69 When the fn(arg) function returns, the child process terminates. The
70 integer returned by fn is the exit status for the child process. The
71 child process may also terminate explicitly by calling exit(2) or after
72 receiving a fatal signal.
73
74 The stack argument specifies the location of the stack used by the
75 child process. Since the child and calling process may share memory,
76 it is not possible for the child process to execute in the same stack
77 as the calling process. The calling process must therefore set up mem‐
78 ory space for the child stack and pass a pointer to this space to
79 clone(). Stacks grow downward on all processors that run Linux (except
80 the HP PA processors), so stack usually points to the topmost address
81 of the memory space set up for the child stack. Note that clone() does
82 not provide a means whereby the caller can inform the kernel of the
83 size of the stack area.
84
85 The remaining arguments to clone() are discussed below.
86
87 clone3()
88 The clone3() system call provides a superset of the functionality of
89 the older clone() interface. It also provides a number of API improve‐
90 ments, including: space for additional flags bits; cleaner separation
91 in the use of various arguments; and the ability to specify the size of
92 the child's stack area.
93
94 As with fork(2), clone3() returns in both the parent and the child. It
95 returns 0 in the child process and returns the PID of the child in the
96 parent.
97
98 The cl_args argument of clone3() is a structure of the following form:
99
100 struct clone_args {
101 u64 flags; /* Flags bit mask */
102 u64 pidfd; /* Where to store PID file descriptor
103 (int *) */
104 u64 child_tid; /* Where to store child TID,
105 in child's memory (pid_t *) */
106 u64 parent_tid; /* Where to store child TID,
107 in parent's memory (pid_t *) */
108 u64 exit_signal; /* Signal to deliver to parent on
109 child termination */
110 u64 stack; /* Pointer to lowest byte of stack */
111 u64 stack_size; /* Size of stack */
112 u64 tls; /* Location of new TLS */
113 u64 set_tid; /* Pointer to a pid_t array
114 (since Linux 5.5) */
115 u64 set_tid_size; /* Number of elements in set_tid
116 (since Linux 5.5) */
117 u64 cgroup; /* File descriptor for target cgroup
118 of child (since Linux 5.7) */
119 };
120
121 The size argument that is supplied to clone3() should be initialized to
122 the size of this structure. (The existence of the size argument per‐
123 mits future extensions to the clone_args structure.)
124
125 The stack for the child process is specified via cl_args.stack, which
126 points to the lowest byte of the stack area, and cl_args.stack_size,
127 which specifies the size of the stack in bytes. In the case where the
128 CLONE_VM flag (see below) is specified, a stack must be explicitly al‐
129 located and specified. Otherwise, these two fields can be specified as
130 NULL and 0, which causes the child to use the same stack area as the
131 parent (in the child's own virtual address space).
132
133 The remaining fields in the cl_args argument are discussed below.
134
135 Equivalence between clone() and clone3() arguments
136 Unlike the older clone() interface, where arguments are passed individ‐
137 ually, in the newer clone3() interface the arguments are packaged into
138 the clone_args structure shown above. This structure allows for a su‐
139 perset of the information passed via the clone() arguments.
140
141 The following table shows the equivalence between the arguments of
142 clone() and the fields in the clone_args argument supplied to clone3():
143
144 clone() clone3() Notes
145 cl_args field
146 flags & ~0xff flags For most flags; details
147 below
148 parent_tid pidfd See CLONE_PIDFD
149 child_tid child_tid See CLONE_CHILD_SETTID
150 parent_tid parent_tid See CLONE_PARENT_SETTID
151 flags & 0xff exit_signal
152 stack stack
153 --- stack_size
154 tls tls See CLONE_SETTLS
155 --- set_tid See below for details
156 --- set_tid_size
157 --- cgroup See CLONE_INTO_CGROUP
158
159 The child termination signal
160 When the child process terminates, a signal may be sent to the parent.
161 The termination signal is specified in the low byte of flags (clone())
162 or in cl_args.exit_signal (clone3()). If this signal is specified as
163 anything other than SIGCHLD, then the parent process must specify the
164 __WALL or __WCLONE options when waiting for the child with wait(2). If
165 no signal (i.e., zero) is specified, then the parent process is not
166 signaled when the child terminates.
167
168 The set_tid array
169 By default, the kernel chooses the next sequential PID for the new
170 process in each of the PID namespaces where it is present. When creat‐
171 ing a process with clone3(), the set_tid array (available since Linux
172 5.5) can be used to select specific PIDs for the process in some or all
173 of the PID namespaces where it is present. If the PID of the newly
174 created process should be set only for the current PID namespace or in
175 the newly created PID namespace (if flags contains CLONE_NEWPID) then
176 the first element in the set_tid array has to be the desired PID and
177 set_tid_size needs to be 1.
178
179 If the PID of the newly created process should have a certain value in
180 multiple PID namespaces, then the set_tid array can have multiple en‐
181 tries. The first entry defines the PID in the most deeply nested PID
182 namespace and each of the following entries contains the PID in the
183 corresponding ancestor PID namespace. The number of PID namespaces in
184 which a PID should be set is defined by set_tid_size which cannot be
185 larger than the number of currently nested PID namespaces.
186
187 To create a process with the following PIDs in a PID namespace hierar‐
188 chy:
189
190 PID NS level Requested PID Notes
191 0 31496 Outermost PID namespace
192 1 42
193 2 7 Innermost PID namespace
194
195 Set the array to:
196
197 set_tid[0] = 7;
198 set_tid[1] = 42;
199 set_tid[2] = 31496;
200 set_tid_size = 3;
201
202 If only the PIDs in the two innermost PID namespaces need to be speci‐
203 fied, set the array to:
204
205 set_tid[0] = 7;
206 set_tid[1] = 42;
207 set_tid_size = 2;
208
209 The PID in the PID namespaces outside the two innermost PID namespaces
210 is selected the same way as any other PID is selected.
211
212 The set_tid feature requires CAP_SYS_ADMIN or (since Linux 5.9)
213 CAP_CHECKPOINT_RESTORE in all owning user namespaces of the target PID
214 namespaces.
215
216 Callers may only choose a PID greater than 1 in a given PID namespace
217 if an init process (i.e., a process with PID 1) already exists in that
218 namespace. Otherwise the PID entry for this PID namespace must be 1.
219
220 The flags mask
221 Both clone() and clone3() allow a flags bit mask that modifies their
222 behavior and allows the caller to specify what is shared between the
223 calling process and the child process. This bit mask—the flags argu‐
224 ment of clone() or the cl_args.flags field passed to clone3()—is re‐
225 ferred to as the flags mask in the remainder of this page.
226
227 The flags mask is specified as a bitwise OR of zero or more of the con‐
228 stants listed below. Except as noted below, these flags are available
229 (and have the same effect) in both clone() and clone3().
230
231 CLONE_CHILD_CLEARTID (since Linux 2.5.49)
232 Clear (zero) the child thread ID at the location pointed to by
233 child_tid (clone()) or cl_args.child_tid (clone3()) in child
234 memory when the child exits, and do a wakeup on the futex at
235 that address. The address involved may be changed by the
236 set_tid_address(2) system call. This is used by threading li‐
237 braries.
238
239 CLONE_CHILD_SETTID (since Linux 2.5.49)
240 Store the child thread ID at the location pointed to by
241 child_tid (clone()) or cl_args.child_tid (clone3()) in the
242 child's memory. The store operation completes before the clone
243 call returns control to user space in the child process. (Note
244 that the store operation may not have completed before the clone
245 call returns in the parent process, which is relevant if the
246 CLONE_VM flag is also employed.)
247
248 CLONE_CLEAR_SIGHAND (since Linux 5.5)
249 By default, signal dispositions in the child thread are the same
250 as in the parent. If this flag is specified, then all signals
251 that are handled in the parent are reset to their default dispo‐
252 sitions (SIG_DFL) in the child.
253
254 Specifying this flag together with CLONE_SIGHAND is nonsensical
255 and disallowed.
256
257 CLONE_DETACHED (historical)
258 For a while (during the Linux 2.5 development series) there was
259 a CLONE_DETACHED flag, which caused the parent not to receive a
260 signal when the child terminated. Ultimately, the effect of
261 this flag was subsumed under the CLONE_THREAD flag and by the
262 time Linux 2.6.0 was released, this flag had no effect. Start‐
263 ing in Linux 2.6.2, the need to give this flag together with
264 CLONE_THREAD disappeared.
265
266 This flag is still defined, but it is usually ignored when call‐
267 ing clone(). However, see the description of CLONE_PIDFD for
268 some exceptions.
269
270 CLONE_FILES (since Linux 2.0)
271 If CLONE_FILES is set, the calling process and the child process
272 share the same file descriptor table. Any file descriptor cre‐
273 ated by the calling process or by the child process is also
274 valid in the other process. Similarly, if one of the processes
275 closes a file descriptor, or changes its associated flags (using
276 the fcntl(2) F_SETFD operation), the other process is also af‐
277 fected. If a process sharing a file descriptor table calls ex‐
278 ecve(2), its file descriptor table is duplicated (unshared).
279
280 If CLONE_FILES is not set, the child process inherits a copy of
281 all file descriptors opened in the calling process at the time
282 of the clone call. Subsequent operations that open or close
283 file descriptors, or change file descriptor flags, performed by
284 either the calling process or the child process do not affect
285 the other process. Note, however, that the duplicated file de‐
286 scriptors in the child refer to the same open file descriptions
287 as the corresponding file descriptors in the calling process,
288 and thus share file offsets and file status flags (see open(2)).
289
290 CLONE_FS (since Linux 2.0)
291 If CLONE_FS is set, the caller and the child process share the
292 same filesystem information. This includes the root of the
293 filesystem, the current working directory, and the umask. Any
294 call to chroot(2), chdir(2), or umask(2) performed by the call‐
295 ing process or the child process also affects the other process.
296
297 If CLONE_FS is not set, the child process works on a copy of the
298 filesystem information of the calling process at the time of the
299 clone call. Calls to chroot(2), chdir(2), or umask(2) performed
300 later by one of the processes do not affect the other process.
301
302 CLONE_INTO_CGROUP (since Linux 5.7)
303 By default, a child process is placed in the same version 2
304 cgroup as its parent. The CLONE_INTO_CGROUP flag allows the
305 child process to be created in a different version 2 cgroup.
306 (Note that CLONE_INTO_CGROUP has effect only for version 2
307 cgroups.)
308
309 In order to place the child process in a different cgroup, the
310 caller specifies CLONE_INTO_CGROUP in cl_args.flags and passes a
311 file descriptor that refers to a version 2 cgroup in the
312 cl_args.cgroup field. (This file descriptor can be obtained by
313 opening a cgroup v2 directory using either the O_RDONLY or the
314 O_PATH flag.) Note that all of the usual restrictions (de‐
315 scribed in cgroups(7)) on placing a process into a version 2
316 cgroup apply.
317
318 Among the possible use cases for CLONE_INTO_CGROUP are the fol‐
319 lowing:
320
321 • Spawning a process into a cgroup different from the parent's
322 cgroup makes it possible for a service manager to directly
323 spawn new services into dedicated cgroups. This eliminates
324 the accounting jitter that would be caused if the child
325 process was first created in the same cgroup as the parent
326 and then moved into the target cgroup. Furthermore, spawning
327 the child process directly into a target cgroup is signifi‐
328 cantly cheaper than moving the child process into the target
329 cgroup after it has been created.
330
331 • The CLONE_INTO_CGROUP flag also allows the creation of frozen
332 child processes by spawning them into a frozen cgroup. (See
333 cgroups(7) for a description of the freezer controller.)
334
335 • For threaded applications (or even thread implementations
336 which make use of cgroups to limit individual threads), it is
337 possible to establish a fixed cgroup layout before spawning
338 each thread directly into its target cgroup.
339
340 CLONE_IO (since Linux 2.6.25)
341 If CLONE_IO is set, then the new process shares an I/O context
342 with the calling process. If this flag is not set, then (as
343 with fork(2)) the new process has its own I/O context.
344
345 The I/O context is the I/O scope of the disk scheduler (i.e.,
346 what the I/O scheduler uses to model scheduling of a process's
347 I/O). If processes share the same I/O context, they are treated
348 as one by the I/O scheduler. As a consequence, they get to
349 share disk time. For some I/O schedulers, if two processes
350 share an I/O context, they will be allowed to interleave their
351 disk access. If several threads are doing I/O on behalf of the
352 same process (aio_read(3), for instance), they should employ
353 CLONE_IO to get better I/O performance.
354
355 If the kernel is not configured with the CONFIG_BLOCK option,
356 this flag is a no-op.
357
358 CLONE_NEWCGROUP (since Linux 4.6)
359 Create the process in a new cgroup namespace. If this flag is
360 not set, then (as with fork(2)) the process is created in the
361 same cgroup namespaces as the calling process.
362
363 For further information on cgroup namespaces, see cgroup_name‐
364 spaces(7).
365
366 Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWC‐
367 GROUP.
368
369 CLONE_NEWIPC (since Linux 2.6.19)
370 If CLONE_NEWIPC is set, then create the process in a new IPC
371 namespace. If this flag is not set, then (as with fork(2)), the
372 process is created in the same IPC namespace as the calling
373 process.
374
375 For further information on IPC namespaces, see ipc_name‐
376 spaces(7).
377
378 Only a privileged process (CAP_SYS_ADMIN) can employ
379 CLONE_NEWIPC. This flag can't be specified in conjunction with
380 CLONE_SYSVSEM.
381
382 CLONE_NEWNET (since Linux 2.6.24)
383 (The implementation of this flag was completed only by about
384 Linux 2.6.29.)
385
386 If CLONE_NEWNET is set, then create the process in a new network
387 namespace. If this flag is not set, then (as with fork(2)) the
388 process is created in the same network namespace as the calling
389 process.
390
391 For further information on network namespaces, see network_name‐
392 spaces(7).
393
394 Only a privileged process (CAP_SYS_ADMIN) can employ
395 CLONE_NEWNET.
396
397 CLONE_NEWNS (since Linux 2.4.19)
398 If CLONE_NEWNS is set, the cloned child is started in a new
399 mount namespace, initialized with a copy of the namespace of the
400 parent. If CLONE_NEWNS is not set, the child lives in the same
401 mount namespace as the parent.
402
403 For further information on mount namespaces, see namespaces(7)
404 and mount_namespaces(7).
405
406 Only a privileged process (CAP_SYS_ADMIN) can employ
407 CLONE_NEWNS. It is not permitted to specify both CLONE_NEWNS
408 and CLONE_FS in the same clone call.
409
410 CLONE_NEWPID (since Linux 2.6.24)
411 If CLONE_NEWPID is set, then create the process in a new PID
412 namespace. If this flag is not set, then (as with fork(2)) the
413 process is created in the same PID namespace as the calling
414 process.
415
416 For further information on PID namespaces, see namespaces(7) and
417 pid_namespaces(7).
418
419 Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEW‐
420 PID. This flag can't be specified in conjunction with
421 CLONE_THREAD or CLONE_PARENT.
422
423 CLONE_NEWUSER
424 (This flag first became meaningful for clone() in Linux 2.6.23,
425 the current clone() semantics were merged in Linux 3.5, and the
426 final pieces to make the user namespaces completely usable were
427 merged in Linux 3.8.)
428
429 If CLONE_NEWUSER is set, then create the process in a new user
430 namespace. If this flag is not set, then (as with fork(2)) the
431 process is created in the same user namespace as the calling
432 process.
433
434 For further information on user namespaces, see namespaces(7)
435 and user_namespaces(7).
436
437 Before Linux 3.8, use of CLONE_NEWUSER required that the caller
438 have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SET‐
439 GID. Starting with Linux 3.8, no privileges are needed to cre‐
440 ate a user namespace.
441
442 This flag can't be specified in conjunction with CLONE_THREAD or
443 CLONE_PARENT. For security reasons, CLONE_NEWUSER cannot be
444 specified in conjunction with CLONE_FS.
445
446 CLONE_NEWUTS (since Linux 2.6.19)
447 If CLONE_NEWUTS is set, then create the process in a new UTS
448 namespace, whose identifiers are initialized by duplicating the
449 identifiers from the UTS namespace of the calling process. If
450 this flag is not set, then (as with fork(2)) the process is cre‐
451 ated in the same UTS namespace as the calling process.
452
453 For further information on UTS namespaces, see uts_name‐
454 spaces(7).
455
456 Only a privileged process (CAP_SYS_ADMIN) can employ
457 CLONE_NEWUTS.
458
459 CLONE_PARENT (since Linux 2.3.12)
460 If CLONE_PARENT is set, then the parent of the new child (as re‐
461 turned by getppid(2)) will be the same as that of the calling
462 process.
463
464 If CLONE_PARENT is not set, then (as with fork(2)) the child's
465 parent is the calling process.
466
467 Note that it is the parent process, as returned by getppid(2),
468 which is signaled when the child terminates, so that if
469 CLONE_PARENT is set, then the parent of the calling process,
470 rather than the calling process itself, is signaled.
471
472 The CLONE_PARENT flag can't be used in clone calls by the global
473 init process (PID 1 in the initial PID namespace) and init pro‐
474 cesses in other PID namespaces. This restriction prevents the
475 creation of multi-rooted process trees as well as the creation
476 of unreapable zombies in the initial PID namespace.
477
478 CLONE_PARENT_SETTID (since Linux 2.5.49)
479 Store the child thread ID at the location pointed to by par‐
480 ent_tid (clone()) or cl_args.parent_tid (clone3()) in the par‐
481 ent's memory. (In Linux 2.5.32-2.5.48 there was a flag
482 CLONE_SETTID that did this.) The store operation completes be‐
483 fore the clone call returns control to user space.
484
485 CLONE_PID (Linux 2.0 to Linux 2.5.15)
486 If CLONE_PID is set, the child process is created with the same
487 process ID as the calling process. This is good for hacking the
488 system, but otherwise of not much use. From Linux 2.3.21 on‐
489 ward, this flag could be specified only by the system boot
490 process (PID 0). The flag disappeared completely from the ker‐
491 nel sources in Linux 2.5.16. Subsequently, the kernel silently
492 ignored this bit if it was specified in the flags mask. Much
493 later, the same bit was recycled for use as the CLONE_PIDFD
494 flag.
495
496 CLONE_PIDFD (since Linux 5.2)
497 If this flag is specified, a PID file descriptor referring to
498 the child process is allocated and placed at a specified loca‐
499 tion in the parent's memory. The close-on-exec flag is set on
500 this new file descriptor. PID file descriptors can be used for
501 the purposes described in pidfd_open(2).
502
503 • When using clone3(), the PID file descriptor is placed at the
504 location pointed to by cl_args.pidfd.
505
506 • When using clone(), the PID file descriptor is placed at the
507 location pointed to by parent_tid. Since the parent_tid ar‐
508 gument is used to return the PID file descriptor, CLONE_PIDFD
509 cannot be used with CLONE_PARENT_SETTID when calling clone().
510
511 It is currently not possible to use this flag together with
512 CLONE_THREAD. This means that the process identified by the PID
513 file descriptor will always be a thread group leader.
514
515 If the obsolete CLONE_DETACHED flag is specified alongside
516 CLONE_PIDFD when calling clone(), an error is returned. An er‐
517 ror also results if CLONE_DETACHED is specified when calling
518 clone3(). This error behavior ensures that the bit correspond‐
519 ing to CLONE_DETACHED can be reused for further PID file de‐
520 scriptor features in the future.
521
522 CLONE_PTRACE (since Linux 2.2)
523 If CLONE_PTRACE is specified, and the calling process is being
524 traced, then trace the child also (see ptrace(2)).
525
526 CLONE_SETTLS (since Linux 2.5.32)
527 The TLS (Thread Local Storage) descriptor is set to tls.
528
529 The interpretation of tls and the resulting effect is architec‐
530 ture dependent. On x86, tls is interpreted as a struct
531 user_desc * (see set_thread_area(2)). On x86-64 it is the new
532 value to be set for the %fs base register (see the ARCH_SET_FS
533 argument to arch_prctl(2)). On architectures with a dedicated
534 TLS register, it is the new value of that register.
535
536 Use of this flag requires detailed knowledge and generally it
537 should not be used except in libraries implementing threading.
538
539 CLONE_SIGHAND (since Linux 2.0)
540 If CLONE_SIGHAND is set, the calling process and the child
541 process share the same table of signal handlers. If the calling
542 process or child process calls sigaction(2) to change the behav‐
543 ior associated with a signal, the behavior is changed in the
544 other process as well. However, the calling process and child
545 processes still have distinct signal masks and sets of pending
546 signals. So, one of them may block or unblock signals using
547 sigprocmask(2) without affecting the other process.
548
549 If CLONE_SIGHAND is not set, the child process inherits a copy
550 of the signal handlers of the calling process at the time of the
551 clone call. Calls to sigaction(2) performed later by one of the
552 processes have no effect on the other process.
553
554 Since Linux 2.6.0, the flags mask must also include CLONE_VM if
555 CLONE_SIGHAND is specified.
556
557 CLONE_STOPPED (since Linux 2.6.0)
558 If CLONE_STOPPED is set, then the child is initially stopped (as
559 though it was sent a SIGSTOP signal), and must be resumed by
560 sending it a SIGCONT signal.
561
562 This flag was deprecated from Linux 2.6.25 onward, and was re‐
563 moved altogether in Linux 2.6.38. Since then, the kernel
564 silently ignores it without error. Starting with Linux 4.6, the
565 same bit was reused for the CLONE_NEWCGROUP flag.
566
567 CLONE_SYSVSEM (since Linux 2.5.10)
568 If CLONE_SYSVSEM is set, then the child and the calling process
569 share a single list of System V semaphore adjustment (semadj)
570 values (see semop(2)). In this case, the shared list accumu‐
571 lates semadj values across all processes sharing the list, and
572 semaphore adjustments are performed only when the last process
573 that is sharing the list terminates (or ceases sharing the list
574 using unshare(2)). If this flag is not set, then the child has
575 a separate semadj list that is initially empty.
576
577 CLONE_THREAD (since Linux 2.4.0)
578 If CLONE_THREAD is set, the child is placed in the same thread
579 group as the calling process. To make the remainder of the dis‐
580 cussion of CLONE_THREAD more readable, the term "thread" is used
581 to refer to the processes within a thread group.
582
583 Thread groups were a feature added in Linux 2.4 to support the
584 POSIX threads notion of a set of threads that share a single
585 PID. Internally, this shared PID is the so-called thread group
586 identifier (TGID) for the thread group. Since Linux 2.4, calls
587 to getpid(2) return the TGID of the caller.
588
589 The threads within a group can be distinguished by their (sys‐
590 tem-wide) unique thread IDs (TID). A new thread's TID is avail‐
591 able as the function result returned to the caller, and a thread
592 can obtain its own TID using gettid(2).
593
594 When a clone call is made without specifying CLONE_THREAD, then
595 the resulting thread is placed in a new thread group whose TGID
596 is the same as the thread's TID. This thread is the leader of
597 the new thread group.
598
599 A new thread created with CLONE_THREAD has the same parent
600 process as the process that made the clone call (i.e., like
601 CLONE_PARENT), so that calls to getppid(2) return the same value
602 for all of the threads in a thread group. When a CLONE_THREAD
603 thread terminates, the thread that created it is not sent a
604 SIGCHLD (or other termination) signal; nor can the status of
605 such a thread be obtained using wait(2). (The thread is said to
606 be detached.)
607
608 After all of the threads in a thread group terminate the parent
609 process of the thread group is sent a SIGCHLD (or other termina‐
610 tion) signal.
611
612 If any of the threads in a thread group performs an execve(2),
613 then all threads other than the thread group leader are termi‐
614 nated, and the new program is executed in the thread group
615 leader.
616
617 If one of the threads in a thread group creates a child using
618 fork(2), then any thread in the group can wait(2) for that
619 child.
620
621 Since Linux 2.5.35, the flags mask must also include CLONE_SIG‐
622 HAND if CLONE_THREAD is specified (and note that, since Linux
623 2.6.0, CLONE_SIGHAND also requires CLONE_VM to be included).
624
625 Signal dispositions and actions are process-wide: if an unhan‐
626 dled signal is delivered to a thread, then it will affect (ter‐
627 minate, stop, continue, be ignored in) all members of the thread
628 group.
629
630 Each thread has its own signal mask, as set by sigprocmask(2).
631
632 A signal may be process-directed or thread-directed. A process-
633 directed signal is targeted at a thread group (i.e., a TGID),
634 and is delivered to an arbitrarily selected thread from among
635 those that are not blocking the signal. A signal may be
636 process-directed because it was generated by the kernel for rea‐
637 sons other than a hardware exception, or because it was sent us‐
638 ing kill(2) or sigqueue(3). A thread-directed signal is tar‐
639 geted at (i.e., delivered to) a specific thread. A signal may
640 be thread directed because it was sent using tgkill(2) or
641 pthread_sigqueue(3), or because the thread executed a machine
642 language instruction that triggered a hardware exception (e.g.,
643 invalid memory access triggering SIGSEGV or a floating-point ex‐
644 ception triggering SIGFPE).
645
646 A call to sigpending(2) returns a signal set that is the union
647 of the pending process-directed signals and the signals that are
648 pending for the calling thread.
649
650 If a process-directed signal is delivered to a thread group, and
651 the thread group has installed a handler for the signal, then
652 the handler is invoked in exactly one, arbitrarily selected mem‐
653 ber of the thread group that has not blocked the signal. If
654 multiple threads in a group are waiting to accept the same sig‐
655 nal using sigwaitinfo(2), the kernel will arbitrarily select one
656 of these threads to receive the signal.
657
658 CLONE_UNTRACED (since Linux 2.5.46)
659 If CLONE_UNTRACED is specified, then a tracing process cannot
660 force CLONE_PTRACE on this child process.
661
662 CLONE_VFORK (since Linux 2.2)
663 If CLONE_VFORK is set, the execution of the calling process is
664 suspended until the child releases its virtual memory resources
665 via a call to execve(2) or _exit(2) (as with vfork(2)).
666
667 If CLONE_VFORK is not set, then both the calling process and the
668 child are schedulable after the call, and an application should
669 not rely on execution occurring in any particular order.
670
671 CLONE_VM (since Linux 2.0)
672 If CLONE_VM is set, the calling process and the child process
673 run in the same memory space. In particular, memory writes per‐
674 formed by the calling process or by the child process are also
675 visible in the other process. Moreover, any memory mapping or
676 unmapping performed with mmap(2) or munmap(2) by the child or
677 calling process also affects the other process.
678
679 If CLONE_VM is not set, the child process runs in a separate
680 copy of the memory space of the calling process at the time of
681 the clone call. Memory writes or file mappings/unmappings per‐
682 formed by one of the processes do not affect the other, as with
683 fork(2).
684
685 If the CLONE_VM flag is specified and the CLONE_VFORK flag is
686 not specified, then any alternate signal stack that was estab‐
687 lished by sigaltstack(2) is cleared in the child process.
688
690 On success, the thread ID of the child process is returned in the
691 caller's thread of execution. On failure, -1 is returned in the
692 caller's context, no child process is created, and errno is set to in‐
693 dicate the error.
694
696 EACCES (clone3() only)
697 CLONE_INTO_CGROUP was specified in cl_args.flags, but the re‐
698 strictions (described in cgroups(7)) on placing the child
699 process into the version 2 cgroup referred to by cl_args.cgroup
700 are not met.
701
702 EAGAIN Too many processes are already running; see fork(2).
703
704 EBUSY (clone3() only)
705 CLONE_INTO_CGROUP was specified in cl_args.flags, but the file
706 descriptor specified in cl_args.cgroup refers to a version 2
707 cgroup in which a domain controller is enabled.
708
709 EEXIST (clone3() only)
710 One (or more) of the PIDs specified in set_tid already exists in
711 the corresponding PID namespace.
712
713 EINVAL Both CLONE_SIGHAND and CLONE_CLEAR_SIGHAND were specified in the
714 flags mask.
715
716 EINVAL CLONE_SIGHAND was specified in the flags mask, but CLONE_VM was
717 not. (Since Linux 2.6.0.)
718
719 EINVAL CLONE_THREAD was specified in the flags mask, but CLONE_SIGHAND
720 was not. (Since Linux 2.5.35.)
721
722 EINVAL CLONE_THREAD was specified in the flags mask, but the current
723 process previously called unshare(2) with the CLONE_NEWPID flag
724 or used setns(2) to reassociate itself with a PID namespace.
725
726 EINVAL Both CLONE_FS and CLONE_NEWNS were specified in the flags mask.
727
728 EINVAL (since Linux 3.9)
729 Both CLONE_NEWUSER and CLONE_FS were specified in the flags
730 mask.
731
732 EINVAL Both CLONE_NEWIPC and CLONE_SYSVSEM were specified in the flags
733 mask.
734
735 EINVAL One (or both) of CLONE_NEWPID or CLONE_NEWUSER and one (or both)
736 of CLONE_THREAD or CLONE_PARENT were specified in the flags
737 mask.
738
739 EINVAL (since Linux 2.6.32)
740 CLONE_PARENT was specified, and the caller is an init process.
741
742 EINVAL Returned by the glibc clone() wrapper function when fn or stack
743 is specified as NULL.
744
745 EINVAL CLONE_NEWIPC was specified in the flags mask, but the kernel was
746 not configured with the CONFIG_SYSVIPC and CONFIG_IPC_NS op‐
747 tions.
748
749 EINVAL CLONE_NEWNET was specified in the flags mask, but the kernel was
750 not configured with the CONFIG_NET_NS option.
751
752 EINVAL CLONE_NEWPID was specified in the flags mask, but the kernel was
753 not configured with the CONFIG_PID_NS option.
754
755 EINVAL CLONE_NEWUSER was specified in the flags mask, but the kernel
756 was not configured with the CONFIG_USER_NS option.
757
758 EINVAL CLONE_NEWUTS was specified in the flags mask, but the kernel was
759 not configured with the CONFIG_UTS_NS option.
760
761 EINVAL stack is not aligned to a suitable boundary for this architec‐
762 ture. For example, on aarch64, stack must be a multiple of 16.
763
764 EINVAL (clone3() only)
765 CLONE_DETACHED was specified in the flags mask.
766
767 EINVAL (clone() only)
768 CLONE_PIDFD was specified together with CLONE_DETACHED in the
769 flags mask.
770
771 EINVAL CLONE_PIDFD was specified together with CLONE_THREAD in the
772 flags mask.
773
774 EINVAL (clone() only)
775 CLONE_PIDFD was specified together with CLONE_PARENT_SETTID in
776 the flags mask.
777
778 EINVAL (clone3() only)
779 set_tid_size is greater than the number of nested PID name‐
780 spaces.
781
782 EINVAL (clone3() only)
783 One of the PIDs specified in set_tid was an invalid.
784
785 EINVAL (clone3() only)
786 CLONE_THREAD or CLONE_PARENT was specified in the flags mask,
787 but a signal was specified in exit_signal.
788
789 EINVAL (AArch64 only, Linux 4.6 and earlier)
790 stack was not aligned to a 128-bit boundary.
791
792 ENOMEM Cannot allocate sufficient memory to allocate a task structure
793 for the child, or to copy those parts of the caller's context
794 that need to be copied.
795
796 ENOSPC (since Linux 3.7)
797 CLONE_NEWPID was specified in the flags mask, but the limit on
798 the nesting depth of PID namespaces would have been exceeded;
799 see pid_namespaces(7).
800
801 ENOSPC (since Linux 4.9; beforehand EUSERS)
802 CLONE_NEWUSER was specified in the flags mask, and the call
803 would cause the limit on the number of nested user namespaces to
804 be exceeded. See user_namespaces(7).
805
806 From Linux 3.11 to Linux 4.8, the error diagnosed in this case
807 was EUSERS.
808
809 ENOSPC (since Linux 4.9)
810 One of the values in the flags mask specified the creation of a
811 new user namespace, but doing so would have caused the limit de‐
812 fined by the corresponding file in /proc/sys/user to be ex‐
813 ceeded. For further details, see namespaces(7).
814
815 EOPNOTSUPP (clone3() only)
816 CLONE_INTO_CGROUP was specified in cl_args.flags, but the file
817 descriptor specified in cl_args.cgroup refers to a version 2
818 cgroup that is in the domain invalid state.
819
820 EPERM CLONE_NEWCGROUP, CLONE_NEWIPC, CLONE_NEWNET, CLONE_NEWNS,
821 CLONE_NEWPID, or CLONE_NEWUTS was specified by an unprivileged
822 process (process without CAP_SYS_ADMIN).
823
824 EPERM CLONE_PID was specified by a process other than process 0.
825 (This error occurs only on Linux 2.5.15 and earlier.)
826
827 EPERM CLONE_NEWUSER was specified in the flags mask, but either the
828 effective user ID or the effective group ID of the caller does
829 not have a mapping in the parent namespace (see user_name‐
830 spaces(7)).
831
832 EPERM (since Linux 3.9)
833 CLONE_NEWUSER was specified in the flags mask and the caller is
834 in a chroot environment (i.e., the caller's root directory does
835 not match the root directory of the mount namespace in which it
836 resides).
837
838 EPERM (clone3() only)
839 set_tid_size was greater than zero, and the caller lacks the
840 CAP_SYS_ADMIN capability in one or more of the user namespaces
841 that own the corresponding PID namespaces.
842
843 ERESTARTNOINTR (since Linux 2.6.17)
844 System call was interrupted by a signal and will be restarted.
845 (This can be seen only during a trace.)
846
847 EUSERS (Linux 3.11 to Linux 4.8)
848 CLONE_NEWUSER was specified in the flags mask, and the limit on
849 the number of nested user namespaces would be exceeded. See the
850 discussion of the ENOSPC error above.
851
853 The glibc clone() wrapper function makes some changes in the memory
854 pointed to by stack (changes required to set the stack up correctly for
855 the child) before invoking the clone() system call. So, in cases where
856 clone() is used to recursively create children, do not use the buffer
857 employed for the parent's stack as the stack of the child.
858
859 On i386, clone() should not be called through vsyscall, but directly
860 through int $0x80.
861
862 C library/kernel differences
863 The raw clone() system call corresponds more closely to fork(2) in that
864 execution in the child continues from the point of the call. As such,
865 the fn and arg arguments of the clone() wrapper function are omitted.
866
867 In contrast to the glibc wrapper, the raw clone() system call accepts
868 NULL as a stack argument (and clone3() likewise allows cl_args.stack to
869 be NULL). In this case, the child uses a duplicate of the parent's
870 stack. (Copy-on-write semantics ensure that the child gets separate
871 copies of stack pages when either process modifies the stack.) In this
872 case, for correct operation, the CLONE_VM option should not be speci‐
873 fied. (If the child shares the parent's memory because of the use of
874 the CLONE_VM flag, then no copy-on-write duplication occurs and chaos
875 is likely to result.)
876
877 The order of the arguments also differs in the raw system call, and
878 there are variations in the arguments across architectures, as detailed
879 in the following paragraphs.
880
881 The raw system call interface on x86-64 and some other architectures
882 (including sh, tile, and alpha) is:
883
884 long clone(unsigned long flags, void *stack,
885 int *parent_tid, int *child_tid,
886 unsigned long tls);
887
888 On x86-32, and several other common architectures (including score,
889 ARM, ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS), the order of
890 the last two arguments is reversed:
891
892 long clone(unsigned long flags, void *stack,
893 int *parent_tid, unsigned long tls,
894 int *child_tid);
895
896 On the cris and s390 architectures, the order of the first two argu‐
897 ments is reversed:
898
899 long clone(void *stack, unsigned long flags,
900 int *parent_tid, int *child_tid,
901 unsigned long tls);
902
903 On the microblaze architecture, an additional argument is supplied:
904
905 long clone(unsigned long flags, void *stack,
906 int stack_size, /* Size of stack */
907 int *parent_tid, int *child_tid,
908 unsigned long tls);
909
910 blackfin, m68k, and sparc
911 The argument-passing conventions on blackfin, m68k, and sparc are dif‐
912 ferent from the descriptions above. For details, see the kernel (and
913 glibc) source.
914
915 ia64
916 On ia64, a different interface is used:
917
918 int __clone2(int (*fn)(void *),
919 void *stack_base, size_t stack_size,
920 int flags, void *arg, ...
921 /* pid_t *parent_tid, struct user_desc *tls,
922 pid_t *child_tid */ );
923
924 The prototype shown above is for the glibc wrapper function; for the
925 system call itself, the prototype can be described as follows (it is
926 identical to the clone() prototype on microblaze):
927
928 long clone2(unsigned long flags, void *stack_base,
929 int stack_size, /* Size of stack */
930 int *parent_tid, int *child_tid,
931 unsigned long tls);
932
933 __clone2() operates in the same way as clone(), except that stack_base
934 points to the lowest address of the child's stack area, and stack_size
935 specifies the size of the stack pointed to by stack_base.
936
938 Linux.
939
941 clone3()
942 Linux 5.3.
943
944 Linux 2.4 and earlier
945 In the Linux 2.4.x series, CLONE_THREAD generally does not make the
946 parent of the new thread the same as the parent of the calling process.
947 However, from Linux 2.4.7 to Linux 2.4.18 the CLONE_THREAD flag implied
948 the CLONE_PARENT flag (as in Linux 2.6.0 and later).
949
950 In Linux 2.4 and earlier, clone() does not take arguments parent_tid,
951 tls, and child_tid.
952
954 One use of these systems calls is to implement threads: multiple flows
955 of control in a program that run concurrently in a shared address
956 space.
957
958 The kcmp(2) system call can be used to test whether two processes share
959 various resources such as a file descriptor table, System V semaphore
960 undo operations, or a virtual address space.
961
962 Handlers registered using pthread_atfork(3) are not executed during a
963 clone call.
964
966 GNU C library versions 2.3.4 up to and including 2.24 contained a wrap‐
967 per function for getpid(2) that performed caching of PIDs. This
968 caching relied on support in the glibc wrapper for clone(), but limita‐
969 tions in the implementation meant that the cache was not up to date in
970 some circumstances. In particular, if a signal was delivered to the
971 child immediately after the clone() call, then a call to getpid(2) in a
972 handler for the signal could return the PID of the calling process
973 ("the parent"), if the clone wrapper had not yet had a chance to update
974 the PID cache in the child. (This discussion ignores the case where
975 the child was created using CLONE_THREAD, when getpid(2) should return
976 the same value in the child and in the process that called clone(),
977 since the caller and the child are in the same thread group. The
978 stale-cache problem also does not occur if the flags argument includes
979 CLONE_VM.) To get the truth, it was sometimes necessary to use code
980 such as the following:
981
982 #include <syscall.h>
983
984 pid_t mypid;
985
986 mypid = syscall(SYS_getpid);
987
988 Because of the stale-cache problem, as well as other problems noted in
989 getpid(2), the PID caching feature was removed in glibc 2.25.
990
992 The following program demonstrates the use of clone() to create a child
993 process that executes in a separate UTS namespace. The child changes
994 the hostname in its UTS namespace. Both parent and child then display
995 the system hostname, making it possible to see that the hostname dif‐
996 fers in the UTS namespaces of the parent and child. For an example of
997 the use of this program, see setns(2).
998
999 Within the sample program, we allocate the memory that is to be used
1000 for the child's stack using mmap(2) rather than malloc(3) for the fol‐
1001 lowing reasons:
1002
1003 • mmap(2) allocates a block of memory that starts on a page boundary
1004 and is a multiple of the page size. This is useful if we want to
1005 establish a guard page (a page with protection PROT_NONE) at the end
1006 of the stack using mprotect(2).
1007
1008 • We can specify the MAP_STACK flag to request a mapping that is suit‐
1009 able for a stack. For the moment, this flag is a no-op on Linux,
1010 but it exists and has effect on some other systems, so we should in‐
1011 clude it for portability.
1012
1013 Program source
1014 #define _GNU_SOURCE
1015 #include <err.h>
1016 #include <sched.h>
1017 #include <signal.h>
1018 #include <stdint.h>
1019 #include <stdio.h>
1020 #include <stdlib.h>
1021 #include <string.h>
1022 #include <sys/mman.h>
1023 #include <sys/utsname.h>
1024 #include <sys/wait.h>
1025 #include <unistd.h>
1026
1027 static int /* Start function for cloned child */
1028 childFunc(void *arg)
1029 {
1030 struct utsname uts;
1031
1032 /* Change hostname in UTS namespace of child. */
1033
1034 if (sethostname(arg, strlen(arg)) == -1)
1035 err(EXIT_FAILURE, "sethostname");
1036
1037 /* Retrieve and display hostname. */
1038
1039 if (uname(&uts) == -1)
1040 err(EXIT_FAILURE, "uname");
1041 printf("uts.nodename in child: %s\n", uts.nodename);
1042
1043 /* Keep the namespace open for a while, by sleeping.
1044 This allows some experimentation--for example, another
1045 process might join the namespace. */
1046
1047 sleep(200);
1048
1049 return 0; /* Child terminates now */
1050 }
1051
1052 #define STACK_SIZE (1024 * 1024) /* Stack size for cloned child */
1053
1054 int
1055 main(int argc, char *argv[])
1056 {
1057 char *stack; /* Start of stack buffer */
1058 char *stackTop; /* End of stack buffer */
1059 pid_t pid;
1060 struct utsname uts;
1061
1062 if (argc < 2) {
1063 fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
1064 exit(EXIT_SUCCESS);
1065 }
1066
1067 /* Allocate memory to be used for the stack of the child. */
1068
1069 stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
1070 MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
1071 if (stack == MAP_FAILED)
1072 err(EXIT_FAILURE, "mmap");
1073
1074 stackTop = stack + STACK_SIZE; /* Assume stack grows downward */
1075
1076 /* Create child that has its own UTS namespace;
1077 child commences execution in childFunc(). */
1078
1079 pid = clone(childFunc, stackTop, CLONE_NEWUTS | SIGCHLD, argv[1]);
1080 if (pid == -1)
1081 err(EXIT_FAILURE, "clone");
1082 printf("clone() returned %jd\n", (intmax_t) pid);
1083
1084 /* Parent falls through to here */
1085
1086 sleep(1); /* Give child time to change its hostname */
1087
1088 /* Display hostname in parent's UTS namespace. This will be
1089 different from hostname in child's UTS namespace. */
1090
1091 if (uname(&uts) == -1)
1092 err(EXIT_FAILURE, "uname");
1093 printf("uts.nodename in parent: %s\n", uts.nodename);
1094
1095 if (waitpid(pid, NULL, 0) == -1) /* Wait for child */
1096 err(EXIT_FAILURE, "waitpid");
1097 printf("child has terminated\n");
1098
1099 exit(EXIT_SUCCESS);
1100 }
1101
1103 fork(2), futex(2), getpid(2), gettid(2), kcmp(2), mmap(2),
1104 pidfd_open(2), set_thread_area(2), set_tid_address(2), setns(2),
1105 tkill(2), unshare(2), wait(2), capabilities(7), namespaces(7),
1106 pthreads(7)
1107
1108
1109
1110Linux man-pages 6.05 2023-05-03 clone(2)