1FUTEX(2) Linux Programmer's Manual FUTEX(2)
2
3
4
6 futex - fast user-space locking
7
9 #include <linux/futex.h> /* Definition of FUTEX_* constants */
10 #include <sys/syscall.h> /* Definition of SYS_* constants */
11 #include <unistd.h>
12
13 long syscall(SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val,
14 const struct timespec *timeout, /* or: uint32_t val2 */
15 uint32_t *uaddr2, uint32_t val3);
16
17 Note: glibc provides no wrapper for futex(), necessitating the use of
18 syscall(2).
19
21 The futex() system call provides a method for waiting until a certain
22 condition becomes true. It is typically used as a blocking construct
23 in the context of shared-memory synchronization. When using futexes,
24 the majority of the synchronization operations are performed in user
25 space. A user-space program employs the futex() system call only when
26 it is likely that the program has to block for a longer time until the
27 condition becomes true. Other futex() operations can be used to wake
28 any processes or threads waiting for a particular condition.
29
30 A futex is a 32-bit value—referred to below as a futex word—whose ad‐
31 dress is supplied to the futex() system call. (Futexes are 32 bits in
32 size on all platforms, including 64-bit systems.) All futex operations
33 are governed by this value. In order to share a futex between pro‐
34 cesses, the futex is placed in a region of shared memory, created using
35 (for example) mmap(2) or shmat(2). (Thus, the futex word may have dif‐
36 ferent virtual addresses in different processes, but these addresses
37 all refer to the same location in physical memory.) In a multithreaded
38 program, it is sufficient to place the futex word in a global variable
39 shared by all threads.
40
41 When executing a futex operation that requests to block a thread, the
42 kernel will block only if the futex word has the value that the calling
43 thread supplied (as one of the arguments of the futex() call) as the
44 expected value of the futex word. The loading of the futex word's
45 value, the comparison of that value with the expected value, and the
46 actual blocking will happen atomically and will be totally ordered with
47 respect to concurrent operations performed by other threads on the same
48 futex word. Thus, the futex word is used to connect the synchroniza‐
49 tion in user space with the implementation of blocking by the kernel.
50 Analogously to an atomic compare-and-exchange operation that poten‐
51 tially changes shared memory, blocking via a futex is an atomic com‐
52 pare-and-block operation.
53
54 One use of futexes is for implementing locks. The state of the lock
55 (i.e., acquired or not acquired) can be represented as an atomically
56 accessed flag in shared memory. In the uncontended case, a thread can
57 access or modify the lock state with atomic instructions, for example
58 atomically changing it from not acquired to acquired using an atomic
59 compare-and-exchange instruction. (Such instructions are performed en‐
60 tirely in user mode, and the kernel maintains no information about the
61 lock state.) On the other hand, a thread may be unable to acquire a
62 lock because it is already acquired by another thread. It then may
63 pass the lock's flag as a futex word and the value representing the ac‐
64 quired state as the expected value to a futex() wait operation. This
65 futex() operation will block if and only if the lock is still acquired
66 (i.e., the value in the futex word still matches the "acquired state").
67 When releasing the lock, a thread has to first reset the lock state to
68 not acquired and then execute a futex operation that wakes threads
69 blocked on the lock flag used as a futex word (this can be further op‐
70 timized to avoid unnecessary wake-ups). See futex(7) for more detail
71 on how to use futexes.
72
73 Besides the basic wait and wake-up futex functionality, there are fur‐
74 ther futex operations aimed at supporting more complex use cases.
75
76 Note that no explicit initialization or destruction is necessary to use
77 futexes; the kernel maintains a futex (i.e., the kernel-internal imple‐
78 mentation artifact) only while operations such as FUTEX_WAIT, described
79 below, are being performed on a particular futex word.
80
81 Arguments
82 The uaddr argument points to the futex word. On all platforms, futexes
83 are four-byte integers that must be aligned on a four-byte boundary.
84 The operation to perform on the futex is specified in the futex_op ar‐
85 gument; val is a value whose meaning and purpose depends on futex_op.
86
87 The remaining arguments (timeout, uaddr2, and val3) are required only
88 for certain of the futex operations described below. Where one of
89 these arguments is not required, it is ignored.
90
91 For several blocking operations, the timeout argument is a pointer to a
92 timespec structure that specifies a timeout for the operation. How‐
93 ever, notwithstanding the prototype shown above, for some operations,
94 the least significant four bytes of this argument are instead used as
95 an integer whose meaning is determined by the operation. For these op‐
96 erations, the kernel casts the timeout value first to unsigned long,
97 then to uint32_t, and in the remainder of this page, this argument is
98 referred to as val2 when interpreted in this fashion.
99
100 Where it is required, the uaddr2 argument is a pointer to a second fu‐
101 tex word that is employed by the operation.
102
103 The interpretation of the final integer argument, val3, depends on the
104 operation.
105
106 Futex operations
107 The futex_op argument consists of two parts: a command that specifies
108 the operation to be performed, bitwise ORed with zero or more options
109 that modify the behaviour of the operation. The options that may be
110 included in futex_op are as follows:
111
112 FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
113 This option bit can be employed with all futex operations. It
114 tells the kernel that the futex is process-private and not
115 shared with another process (i.e., it is being used for synchro‐
116 nization only between threads of the same process). This allows
117 the kernel to make some additional performance optimizations.
118
119 As a convenience, <linux/futex.h> defines a set of constants
120 with the suffix _PRIVATE that are equivalents of all of the op‐
121 erations listed below, but with the FUTEX_PRIVATE_FLAG ORed into
122 the constant value. Thus, there are FUTEX_WAIT_PRIVATE, FU‐
123 TEX_WAKE_PRIVATE, and so on.
124
125 FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
126 This option bit can be employed only with the FUTEX_WAIT_BITSET,
127 FUTEX_WAIT_REQUEUE_PI, (since Linux 4.5) FUTEX_WAIT, and (since
128 Linux 5.14) FUTEX_LOCK_PI2 operations.
129
130 If this option is set, the kernel measures the timeout against
131 the CLOCK_REALTIME clock.
132
133 If this option is not set, the kernel measures the timeout
134 against the CLOCK_MONOTONIC clock.
135
136 The operation specified in futex_op is one of the following:
137
138 FUTEX_WAIT (since Linux 2.6.0)
139 This operation tests that the value at the futex word pointed to
140 by the address uaddr still contains the expected value val, and
141 if so, then sleeps waiting for a FUTEX_WAKE operation on the fu‐
142 tex word. The load of the value of the futex word is an atomic
143 memory access (i.e., using atomic machine instructions of the
144 respective architecture). This load, the comparison with the
145 expected value, and starting to sleep are performed atomically
146 and totally ordered with respect to other futex operations on
147 the same futex word. If the thread starts to sleep, it is con‐
148 sidered a waiter on this futex word. If the futex value does
149 not match val, then the call fails immediately with the error
150 EAGAIN.
151
152 The purpose of the comparison with the expected value is to pre‐
153 vent lost wake-ups. If another thread changed the value of the
154 futex word after the calling thread decided to block based on
155 the prior value, and if the other thread executed a FUTEX_WAKE
156 operation (or similar wake-up) after the value change and before
157 this FUTEX_WAIT operation, then the calling thread will observe
158 the value change and will not start to sleep.
159
160 If the timeout is not NULL, the structure it points to specifies
161 a timeout for the wait. (This interval will be rounded up to
162 the system clock granularity, and is guaranteed not to expire
163 early.) The timeout is by default measured according to the
164 CLOCK_MONOTONIC clock, but, since Linux 4.5, the CLOCK_REALTIME
165 clock can be selected by specifying FUTEX_CLOCK_REALTIME in fu‐
166 tex_op. If timeout is NULL, the call blocks indefinitely.
167
168 Note: for FUTEX_WAIT, timeout is interpreted as a relative
169 value. This differs from other futex operations, where timeout
170 is interpreted as an absolute value. To obtain the equivalent
171 of FUTEX_WAIT with an absolute timeout, employ FUTEX_WAIT_BITSET
172 with val3 specified as FUTEX_BITSET_MATCH_ANY.
173
174 The arguments uaddr2 and val3 are ignored.
175
176 FUTEX_WAKE (since Linux 2.6.0)
177 This operation wakes at most val of the waiters that are waiting
178 (e.g., inside FUTEX_WAIT) on the futex word at the address
179 uaddr. Most commonly, val is specified as either 1 (wake up a
180 single waiter) or INT_MAX (wake up all waiters). No guarantee
181 is provided about which waiters are awoken (e.g., a waiter with
182 a higher scheduling priority is not guaranteed to be awoken in
183 preference to a waiter with a lower priority).
184
185 The arguments timeout, uaddr2, and val3 are ignored.
186
187 FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
188 This operation creates a file descriptor that is associated with
189 the futex at uaddr. The caller must close the returned file de‐
190 scriptor after use. When another process or thread performs a
191 FUTEX_WAKE on the futex word, the file descriptor indicates as
192 being readable with select(2), poll(2), and epoll(7)
193
194 The file descriptor can be used to obtain asynchronous notifica‐
195 tions: if val is nonzero, then, when another process or thread
196 executes a FUTEX_WAKE, the caller will receive the signal number
197 that was passed in val.
198
199 The arguments timeout, uaddr2, and val3 are ignored.
200
201 Because it was inherently racy, FUTEX_FD has been removed from
202 Linux 2.6.26 onward.
203
204 FUTEX_REQUEUE (since Linux 2.6.0)
205 This operation performs the same task as FUTEX_CMP_REQUEUE (see
206 below), except that no check is made using the value in val3.
207 (The argument val3 is ignored.)
208
209 FUTEX_CMP_REQUEUE (since Linux 2.6.7)
210 This operation first checks whether the location uaddr still
211 contains the value val3. If not, the operation fails with the
212 error EAGAIN. Otherwise, the operation wakes up a maximum of
213 val waiters that are waiting on the futex at uaddr. If there
214 are more than val waiters, then the remaining waiters are re‐
215 moved from the wait queue of the source futex at uaddr and added
216 to the wait queue of the target futex at uaddr2. The val2 argu‐
217 ment specifies an upper limit on the number of waiters that are
218 requeued to the futex at uaddr2.
219
220 The load from uaddr is an atomic memory access (i.e., using
221 atomic machine instructions of the respective architecture).
222 This load, the comparison with val3, and the requeueing of any
223 waiters are performed atomically and totally ordered with re‐
224 spect to other operations on the same futex word.
225
226 Typical values to specify for val are 0 or 1. (Specifying
227 INT_MAX is not useful, because it would make the FUTEX_CMP_RE‐
228 QUEUE operation equivalent to FUTEX_WAKE.) The limit value
229 specified via val2 is typically either 1 or INT_MAX. (Specify‐
230 ing the argument as 0 is not useful, because it would make the
231 FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAIT.)
232
233 The FUTEX_CMP_REQUEUE operation was added as a replacement for
234 the earlier FUTEX_REQUEUE. The difference is that the check of
235 the value at uaddr can be used to ensure that requeueing happens
236 only under certain conditions, which allows race conditions to
237 be avoided in certain use cases.
238
239 Both FUTEX_REQUEUE and FUTEX_CMP_REQUEUE can be used to avoid
240 "thundering herd" wake-ups that could occur when using FU‐
241 TEX_WAKE in cases where all of the waiters that are woken need
242 to acquire another futex. Consider the following scenario,
243 where multiple waiter threads are waiting on B, a wait queue im‐
244 plemented using a futex:
245
246 lock(A)
247 while (!check_value(V)) {
248 unlock(A);
249 block_on(B);
250 lock(A);
251 };
252 unlock(A);
253
254 If a waker thread used FUTEX_WAKE, then all waiters waiting on B
255 would be woken up, and they would all try to acquire lock A.
256 However, waking all of the threads in this manner would be
257 pointless because all except one of the threads would immedi‐
258 ately block on lock A again. By contrast, a requeue operation
259 wakes just one waiter and moves the other waiters to lock A, and
260 when the woken waiter unlocks A then the next waiter can pro‐
261 ceed.
262
263 FUTEX_WAKE_OP (since Linux 2.6.14)
264 This operation was added to support some user-space use cases
265 where more than one futex must be handled at the same time. The
266 most notable example is the implementation of pthread_cond_sig‐
267 nal(3), which requires operations on two futexes, the one used
268 to implement the mutex and the one used in the implementation of
269 the wait queue associated with the condition variable. FU‐
270 TEX_WAKE_OP allows such cases to be implemented without leading
271 to high rates of contention and context switching.
272
273 The FUTEX_WAKE_OP operation is equivalent to executing the fol‐
274 lowing code atomically and totally ordered with respect to other
275 futex operations on any of the two supplied futex words:
276
277 uint32_t oldval = *(uint32_t *) uaddr2;
278 *(uint32_t *) uaddr2 = oldval op oparg;
279 futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
280 if (oldval cmp cmparg)
281 futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
282
283 In other words, FUTEX_WAKE_OP does the following:
284
285 * saves the original value of the futex word at uaddr2 and per‐
286 forms an operation to modify the value of the futex at
287 uaddr2; this is an atomic read-modify-write memory access
288 (i.e., using atomic machine instructions of the respective
289 architecture)
290
291 * wakes up a maximum of val waiters on the futex for the futex
292 word at uaddr; and
293
294 * dependent on the results of a test of the original value of
295 the futex word at uaddr2, wakes up a maximum of val2 waiters
296 on the futex for the futex word at uaddr2.
297
298 The operation and comparison that are to be performed are en‐
299 coded in the bits of the argument val3. Pictorially, the encod‐
300 ing is:
301
302 +---+---+-----------+-----------+
303 |op |cmp| oparg | cmparg |
304 +---+---+-----------+-----------+
305 4 4 12 12 <== # of bits
306
307 Expressed in code, the encoding is:
308
309 #define FUTEX_OP(op, oparg, cmp, cmparg) \
310 (((op & 0xf) << 28) | \
311 ((cmp & 0xf) << 24) | \
312 ((oparg & 0xfff) << 12) | \
313 (cmparg & 0xfff))
314
315 In the above, op and cmp are each one of the codes listed below.
316 The oparg and cmparg components are literal numeric values, ex‐
317 cept as noted below.
318
319 The op component has one of the following values:
320
321 FUTEX_OP_SET 0 /* uaddr2 = oparg; */
322 FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
323 FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
324 FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */
325 FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */
326
327 In addition, bitwise ORing the following value into op causes
328 (1 << oparg) to be used as the operand:
329
330 FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
331
332 The cmp field is one of the following:
333
334 FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
335 FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
336 FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
337 FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
338 FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
339 FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
340
341 The return value of FUTEX_WAKE_OP is the sum of the number of
342 waiters woken on the futex uaddr plus the number of waiters wo‐
343 ken on the futex uaddr2.
344
345 FUTEX_WAIT_BITSET (since Linux 2.6.25)
346 This operation is like FUTEX_WAIT except that val3 is used to
347 provide a 32-bit bit mask to the kernel. This bit mask, in
348 which at least one bit must be set, is stored in the kernel-in‐
349 ternal state of the waiter. See the description of FU‐
350 TEX_WAKE_BITSET for further details.
351
352 If timeout is not NULL, the structure it points to specifies an
353 absolute timeout for the wait operation. If timeout is NULL,
354 the operation can block indefinitely.
355
356 The uaddr2 argument is ignored.
357
358 FUTEX_WAKE_BITSET (since Linux 2.6.25)
359 This operation is the same as FUTEX_WAKE except that the val3
360 argument is used to provide a 32-bit bit mask to the kernel.
361 This bit mask, in which at least one bit must be set, is used to
362 select which waiters should be woken up. The selection is done
363 by a bitwise AND of the "wake" bit mask (i.e., the value in
364 val3) and the bit mask which is stored in the kernel-internal
365 state of the waiter (the "wait" bit mask that is set using FU‐
366 TEX_WAIT_BITSET). All of the waiters for which the result of
367 the AND is nonzero are woken up; the remaining waiters are left
368 sleeping.
369
370 The effect of FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET is to al‐
371 low selective wake-ups among multiple waiters that are blocked
372 on the same futex. However, note that, depending on the use
373 case, employing this bit-mask multiplexing feature on a futex
374 can be less efficient than simply using multiple futexes, be‐
375 cause employing bit-mask multiplexing requires the kernel to
376 check all waiters on a futex, including those that are not in‐
377 terested in being woken up (i.e., they do not have the relevant
378 bit set in their "wait" bit mask).
379
380 The constant FUTEX_BITSET_MATCH_ANY, which corresponds to all 32
381 bits set in the bit mask, can be used as the val3 argument for
382 FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET. Other than differences
383 in the handling of the timeout argument, the FUTEX_WAIT opera‐
384 tion is equivalent to FUTEX_WAIT_BITSET with val3 specified as
385 FUTEX_BITSET_MATCH_ANY; that is, allow a wake-up by any waker.
386 The FUTEX_WAKE operation is equivalent to FUTEX_WAKE_BITSET with
387 val3 specified as FUTEX_BITSET_MATCH_ANY; that is, wake up any
388 waiter(s).
389
390 The uaddr2 and timeout arguments are ignored.
391
392 Priority-inheritance futexes
393 Linux supports priority-inheritance (PI) futexes in order to handle
394 priority-inversion problems that can be encountered with normal futex
395 locks. Priority inversion is the problem that occurs when a high-pri‐
396 ority task is blocked waiting to acquire a lock held by a low-priority
397 task, while tasks at an intermediate priority continuously preempt the
398 low-priority task from the CPU. Consequently, the low-priority task
399 makes no progress toward releasing the lock, and the high-priority task
400 remains blocked.
401
402 Priority inheritance is a mechanism for dealing with the priority-in‐
403 version problem. With this mechanism, when a high-priority task be‐
404 comes blocked by a lock held by a low-priority task, the priority of
405 the low-priority task is temporarily raised to that of the high-prior‐
406 ity task, so that it is not preempted by any intermediate level tasks,
407 and can thus make progress toward releasing the lock. To be effective,
408 priority inheritance must be transitive, meaning that if a high-prior‐
409 ity task blocks on a lock held by a lower-priority task that is itself
410 blocked by a lock held by another intermediate-priority task (and so
411 on, for chains of arbitrary length), then both of those tasks (or more
412 generally, all of the tasks in a lock chain) have their priorities
413 raised to be the same as the high-priority task.
414
415 From a user-space perspective, what makes a futex PI-aware is a policy
416 agreement (described below) between user space and the kernel about the
417 value of the futex word, coupled with the use of the PI-futex opera‐
418 tions described below. (Unlike the other futex operations described
419 above, the PI-futex operations are designed for the implementation of
420 very specific IPC mechanisms.)
421
422 The PI-futex operations described below differ from the other futex op‐
423 erations in that they impose policy on the use of the value of the fu‐
424 tex word:
425
426 * If the lock is not acquired, the futex word's value shall be 0.
427
428 * If the lock is acquired, the futex word's value shall be the thread
429 ID (TID; see gettid(2)) of the owning thread.
430
431 * If the lock is owned and there are threads contending for the lock,
432 then the FUTEX_WAITERS bit shall be set in the futex word's value;
433 in other words, this value is:
434
435 FUTEX_WAITERS | TID
436
437 (Note that is invalid for a PI futex word to have no owner and FU‐
438 TEX_WAITERS set.)
439
440 With this policy in place, a user-space application can acquire an un‐
441 acquired lock or release a lock using atomic instructions executed in
442 user mode (e.g., a compare-and-swap operation such as cmpxchg on the
443 x86 architecture). Acquiring a lock simply consists of using compare-
444 and-swap to atomically set the futex word's value to the caller's TID
445 if its previous value was 0. Releasing a lock requires using compare-
446 and-swap to set the futex word's value to 0 if the previous value was
447 the expected TID.
448
449 If a futex is already acquired (i.e., has a nonzero value), waiters
450 must employ the FUTEX_LOCK_PI or FUTEX_LOCK_PI2 operations to acquire
451 the lock. If other threads are waiting for the lock, then the FU‐
452 TEX_WAITERS bit is set in the futex value; in this case, the lock owner
453 must employ the FUTEX_UNLOCK_PI operation to release the lock.
454
455 In the cases where callers are forced into the kernel (i.e., required
456 to perform a futex() call), they then deal directly with a so-called
457 RT-mutex, a kernel locking mechanism which implements the required pri‐
458 ority-inheritance semantics. After the RT-mutex is acquired, the futex
459 value is updated accordingly, before the calling thread returns to user
460 space.
461
462 It is important to note that the kernel will update the futex word's
463 value prior to returning to user space. (This prevents the possibility
464 of the futex word's value ending up in an invalid state, such as having
465 an owner but the value being 0, or having waiters but not having the
466 FUTEX_WAITERS bit set.)
467
468 If a futex has an associated RT-mutex in the kernel (i.e., there are
469 blocked waiters) and the owner of the futex/RT-mutex dies unexpectedly,
470 then the kernel cleans up the RT-mutex and hands it over to the next
471 waiter. This in turn requires that the user-space value is updated ac‐
472 cordingly. To indicate that this is required, the kernel sets the FU‐
473 TEX_OWNER_DIED bit in the futex word along with the thread ID of the
474 new owner. User space can detect this situation via the presence of
475 the FUTEX_OWNER_DIED bit and is then responsible for cleaning up the
476 stale state left over by the dead owner.
477
478 PI futexes are operated on by specifying one of the values listed below
479 in futex_op. Note that the PI futex operations must be used as paired
480 operations and are subject to some additional requirements:
481
482 * FUTEX_LOCK_PI, FUTEX_LOCK_PI2, and FUTEX_TRYLOCK_PI pair with FU‐
483 TEX_UNLOCK_PI. FUTEX_UNLOCK_PI must be called only on a futex owned
484 by the calling thread, as defined by the value policy, otherwise the
485 error EPERM results.
486
487 * FUTEX_WAIT_REQUEUE_PI pairs with FUTEX_CMP_REQUEUE_PI. This must be
488 performed from a non-PI futex to a distinct PI futex (or the error
489 EINVAL results). Additionally, val (the number of waiters to be wo‐
490 ken) must be 1 (or the error EINVAL results).
491
492 The PI futex operations are as follows:
493
494 FUTEX_LOCK_PI (since Linux 2.6.18)
495 This operation is used after an attempt to acquire the lock via
496 an atomic user-mode instruction failed because the futex word
497 has a nonzero value—specifically, because it contained the (PID-
498 namespace-specific) TID of the lock owner.
499
500 The operation checks the value of the futex word at the address
501 uaddr. If the value is 0, then the kernel tries to atomically
502 set the futex value to the caller's TID. If the futex word's
503 value is nonzero, the kernel atomically sets the FUTEX_WAITERS
504 bit, which signals the futex owner that it cannot unlock the fu‐
505 tex in user space atomically by setting the futex value to 0.
506 After that, the kernel:
507
508 1. Tries to find the thread which is associated with the owner
509 TID.
510
511 2. Creates or reuses kernel state on behalf of the owner. (If
512 this is the first waiter, there is no kernel state for this
513 futex, so kernel state is created by locking the RT-mutex and
514 the futex owner is made the owner of the RT-mutex. If there
515 are existing waiters, then the existing state is reused.)
516
517 3. Attaches the waiter to the futex (i.e., the waiter is en‐
518 queued on the RT-mutex waiter list).
519
520 If more than one waiter exists, the enqueueing of the waiter is
521 in descending priority order. (For information on priority or‐
522 dering, see the discussion of the SCHED_DEADLINE, SCHED_FIFO,
523 and SCHED_RR scheduling policies in sched(7).) The owner inher‐
524 its either the waiter's CPU bandwidth (if the waiter is sched‐
525 uled under the SCHED_DEADLINE policy) or the waiter's priority
526 (if the waiter is scheduled under the SCHED_RR or SCHED_FIFO
527 policy). This inheritance follows the lock chain in the case of
528 nested locking and performs deadlock detection.
529
530 The timeout argument provides a timeout for the lock attempt.
531 If timeout is not NULL, the structure it points to specifies an
532 absolute timeout, measured against the CLOCK_REALTIME clock. If
533 timeout is NULL, the operation will block indefinitely.
534
535 The uaddr2, val, and val3 arguments are ignored.
536
537 FUTEX_LOCK_PI2 (since Linux 5.14)
538 This operation is the same as FUTEX_LOCK_PI, except that the
539 clock against which timeout is measured is selectable. By de‐
540 fault, the (absolute) timeout specified in timeout is measured
541 againt the CLOCK_MONOTONIC clock, but if the FUTEX_CLOCK_REAL‐
542 TIME flag is specified in futex_op, then the timeout is measured
543 against the CLOCK_REALTIME clock.
544
545 FUTEX_TRYLOCK_PI (since Linux 2.6.18)
546 This operation tries to acquire the lock at uaddr. It is in‐
547 voked when a user-space atomic acquire did not succeed because
548 the futex word was not 0.
549
550 Because the kernel has access to more state information than
551 user space, acquisition of the lock might succeed if performed
552 by the kernel in cases where the futex word (i.e., the state in‐
553 formation accessible to use-space) contains stale state (FU‐
554 TEX_WAITERS and/or FUTEX_OWNER_DIED). This can happen when the
555 owner of the futex died. User space cannot handle this condi‐
556 tion in a race-free manner, but the kernel can fix this up and
557 acquire the futex.
558
559 The uaddr2, val, timeout, and val3 arguments are ignored.
560
561 FUTEX_UNLOCK_PI (since Linux 2.6.18)
562 This operation wakes the top priority waiter that is waiting in
563 FUTEX_LOCK_PI or FUTEX_LOCK_PI2 on the futex address provided by
564 the uaddr argument.
565
566 This is called when the user-space value at uaddr cannot be
567 changed atomically from a TID (of the owner) to 0.
568
569 The uaddr2, val, timeout, and val3 arguments are ignored.
570
571 FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31)
572 This operation is a PI-aware variant of FUTEX_CMP_REQUEUE. It
573 requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on
574 uaddr from a non-PI source futex (uaddr) to a PI target futex
575 (uaddr2).
576
577 As with FUTEX_CMP_REQUEUE, this operation wakes up a maximum of
578 val waiters that are waiting on the futex at uaddr. However,
579 for FUTEX_CMP_REQUEUE_PI, val is required to be 1 (since the
580 main point is to avoid a thundering herd). The remaining wait‐
581 ers are removed from the wait queue of the source futex at uaddr
582 and added to the wait queue of the target futex at uaddr2.
583
584 The val2 and val3 arguments serve the same purposes as for FU‐
585 TEX_CMP_REQUEUE.
586
587 FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
588 Wait on a non-PI futex at uaddr and potentially be requeued (via
589 a FUTEX_CMP_REQUEUE_PI operation in another task) onto a PI fu‐
590 tex at uaddr2. The wait operation on uaddr is the same as for
591 FUTEX_WAIT.
592
593 The waiter can be removed from the wait on uaddr without re‐
594 queueing on uaddr2 via a FUTEX_WAKE operation in another task.
595 In this case, the FUTEX_WAIT_REQUEUE_PI operation fails with the
596 error EAGAIN.
597
598 If timeout is not NULL, the structure it points to specifies an
599 absolute timeout for the wait operation. If timeout is NULL,
600 the operation can block indefinitely.
601
602 The val3 argument is ignored.
603
604 The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were added to
605 support a fairly specific use case: support for priority-inheri‐
606 tance-aware POSIX threads condition variables. The idea is that
607 these operations should always be paired, in order to ensure
608 that user space and the kernel remain in sync. Thus, in the FU‐
609 TEX_WAIT_REQUEUE_PI operation, the user-space application pre-
610 specifies the target of the requeue that takes place in the FU‐
611 TEX_CMP_REQUEUE_PI operation.
612
614 In the event of an error (and assuming that futex() was invoked via
615 syscall(2)), all operations return -1 and set errno to indicate the er‐
616 ror.
617
618 The return value on success depends on the operation, as described in
619 the following list:
620
621 FUTEX_WAIT
622 Returns 0 if the caller was woken up. Note that a wake-up can
623 also be caused by common futex usage patterns in unrelated code
624 that happened to have previously used the futex word's memory
625 location (e.g., typical futex-based implementations of Pthreads
626 mutexes can cause this under some conditions). Therefore, call‐
627 ers should always conservatively assume that a return value of 0
628 can mean a spurious wake-up, and use the futex word's value
629 (i.e., the user-space synchronization scheme) to decide whether
630 to continue to block or not.
631
632 FUTEX_WAKE
633 Returns the number of waiters that were woken up.
634
635 FUTEX_FD
636 Returns the new file descriptor associated with the futex.
637
638 FUTEX_REQUEUE
639 Returns the number of waiters that were woken up.
640
641 FUTEX_CMP_REQUEUE
642 Returns the total number of waiters that were woken up or re‐
643 queued to the futex for the futex word at uaddr2. If this value
644 is greater than val, then the difference is the number of wait‐
645 ers requeued to the futex for the futex word at uaddr2.
646
647 FUTEX_WAKE_OP
648 Returns the total number of waiters that were woken up. This is
649 the sum of the woken waiters on the two futexes for the futex
650 words at uaddr and uaddr2.
651
652 FUTEX_WAIT_BITSET
653 Returns 0 if the caller was woken up. See FUTEX_WAIT for how to
654 interpret this correctly in practice.
655
656 FUTEX_WAKE_BITSET
657 Returns the number of waiters that were woken up.
658
659 FUTEX_LOCK_PI
660 Returns 0 if the futex was successfully locked.
661
662 FUTEX_LOCK_PI2
663 Returns 0 if the futex was successfully locked.
664
665 FUTEX_TRYLOCK_PI
666 Returns 0 if the futex was successfully locked.
667
668 FUTEX_UNLOCK_PI
669 Returns 0 if the futex was successfully unlocked.
670
671 FUTEX_CMP_REQUEUE_PI
672 Returns the total number of waiters that were woken up or re‐
673 queued to the futex for the futex word at uaddr2. If this value
674 is greater than val, then difference is the number of waiters
675 requeued to the futex for the futex word at uaddr2.
676
677 FUTEX_WAIT_REQUEUE_PI
678 Returns 0 if the caller was successfully requeued to the futex
679 for the futex word at uaddr2.
680
682 EACCES No read access to the memory of a futex word.
683
684 EAGAIN (FUTEX_WAIT, FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI) The value
685 pointed to by uaddr was not equal to the expected value val at
686 the time of the call.
687
688 Note: on Linux, the symbolic names EAGAIN and EWOULDBLOCK (both
689 of which appear in different parts of the kernel futex code)
690 have the same value.
691
692 EAGAIN (FUTEX_CMP_REQUEUE, FUTEX_CMP_REQUEUE_PI) The value pointed to
693 by uaddr is not equal to the expected value val3.
694
695 EAGAIN (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
696 QUEUE_PI) The futex owner thread ID of uaddr (for FUTEX_CMP_RE‐
697 QUEUE_PI: uaddr2) is about to exit, but has not yet handled the
698 internal state cleanup. Try again.
699
700 EDEADLK
701 (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
702 QUEUE_PI) The futex word at uaddr is already locked by the
703 caller.
704
705 EDEADLK
706 (FUTEX_CMP_REQUEUE_PI) While requeueing a waiter to the PI futex
707 for the futex word at uaddr2, the kernel detected a deadlock.
708
709 EFAULT A required pointer argument (i.e., uaddr, uaddr2, or timeout)
710 did not point to a valid user-space address.
711
712 EINTR A FUTEX_WAIT or FUTEX_WAIT_BITSET operation was interrupted by a
713 signal (see signal(7)). In kernels before Linux 2.6.22, this
714 error could also be returned for a spurious wakeup; since Linux
715 2.6.22, this no longer happens.
716
717 EINVAL The operation in futex_op is one of those that employs a time‐
718 out, but the supplied timeout argument was invalid (tv_sec was
719 less than zero, or tv_nsec was not less than 1,000,000,000).
720
721 EINVAL The operation specified in futex_op employs one or both of the
722 pointers uaddr and uaddr2, but one of these does not point to a
723 valid object—that is, the address is not four-byte-aligned.
724
725 EINVAL (FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET) The bit mask supplied in
726 val3 is zero.
727
728 EINVAL (FUTEX_CMP_REQUEUE_PI) uaddr equals uaddr2 (i.e., an attempt was
729 made to requeue to the same futex).
730
731 EINVAL (FUTEX_FD) The signal number supplied in val is invalid.
732
733 EINVAL (FUTEX_WAKE, FUTEX_WAKE_OP, FUTEX_WAKE_BITSET, FUTEX_REQUEUE,
734 FUTEX_CMP_REQUEUE) The kernel detected an inconsistency between
735 the user-space state at uaddr and the kernel state—that is, it
736 detected a waiter which waits in FUTEX_LOCK_PI or FUTEX_LOCK_PI2
737 on uaddr.
738
739 EINVAL (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_UN‐
740 LOCK_PI) The kernel detected an inconsistency between the user-
741 space state at uaddr and the kernel state. This indicates ei‐
742 ther state corruption or that the kernel found a waiter on uaddr
743 which is waiting via FUTEX_WAIT or FUTEX_WAIT_BITSET.
744
745 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
746 tween the user-space state at uaddr2 and the kernel state; that
747 is, the kernel detected a waiter which waits via FUTEX_WAIT or
748 FUTEX_WAIT_BITSET on uaddr2.
749
750 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
751 tween the user-space state at uaddr and the kernel state; that
752 is, the kernel detected a waiter which waits via FUTEX_WAIT or
753 FUTEX_WAIT_BITSET on uaddr.
754
755 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
756 tween the user-space state at uaddr and the kernel state; that
757 is, the kernel detected a waiter which waits on uaddr via FU‐
758 TEX_LOCK_PI or FUTEX_LOCK_PI2 (instead of FUTEX_WAIT_RE‐
759 QUEUE_PI).
760
761 EINVAL (FUTEX_CMP_REQUEUE_PI) An attempt was made to requeue a waiter
762 to a futex other than that specified by the matching FU‐
763 TEX_WAIT_REQUEUE_PI call for that waiter.
764
765 EINVAL (FUTEX_CMP_REQUEUE_PI) The val argument is not 1.
766
767 EINVAL Invalid argument.
768
769 ENFILE (FUTEX_FD) The system-wide limit on the total number of open
770 files has been reached.
771
772 ENOMEM (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
773 QUEUE_PI) The kernel could not allocate memory to hold state in‐
774 formation.
775
776 ENOSYS Invalid operation specified in futex_op.
777
778 ENOSYS The FUTEX_CLOCK_REALTIME option was specified in futex_op, but
779 the accompanying operation was neither FUTEX_WAIT, FU‐
780 TEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, nor FUTEX_LOCK_PI2.
781
782 ENOSYS (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_UN‐
783 LOCK_PI, FUTEX_CMP_REQUEUE_PI, FUTEX_WAIT_REQUEUE_PI) A run-time
784 check determined that the operation is not available. The PI-
785 futex operations are not implemented on all architectures and
786 are not supported on some CPU variants.
787
788 EPERM (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
789 QUEUE_PI) The caller is not allowed to attach itself to the fu‐
790 tex at uaddr (for FUTEX_CMP_REQUEUE_PI: the futex at uaddr2).
791 (This may be caused by a state corruption in user space.)
792
793 EPERM (FUTEX_UNLOCK_PI) The caller does not own the lock represented
794 by the futex word.
795
796 ESRCH (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
797 QUEUE_PI) The thread ID in the futex word at uaddr does not ex‐
798 ist.
799
800 ESRCH (FUTEX_CMP_REQUEUE_PI) The thread ID in the futex word at uaddr2
801 does not exist.
802
803 ETIMEDOUT
804 The operation in futex_op employed the timeout specified in
805 timeout, and the timeout expired before the operation completed.
806
808 Futexes were first made available in a stable kernel release with Linux
809 2.6.0.
810
811 Initial futex support was merged in Linux 2.5.7 but with different se‐
812 mantics from what was described above. A four-argument system call
813 with the semantics described in this page was introduced in Linux
814 2.5.40. A fifth argument was added in Linux 2.5.70, and a sixth argu‐
815 ment was added in Linux 2.6.7.
816
818 This system call is Linux-specific.
819
821 Several higher-level programming abstractions are implemented via fu‐
822 texes, including POSIX semaphores and various POSIX threads synchro‐
823 nization mechanisms (mutexes, condition variables, read-write locks,
824 and barriers).
825
827 The program below demonstrates use of futexes in a program where a par‐
828 ent process and a child process use a pair of futexes located inside a
829 shared anonymous mapping to synchronize access to a shared resource:
830 the terminal. The two processes each write nloops (a command-line ar‐
831 gument that defaults to 5 if omitted) messages to the terminal and em‐
832 ploy a synchronization protocol that ensures that they alternate in
833 writing messages. Upon running this program we see output such as the
834 following:
835
836 $ ./futex_demo
837 Parent (18534) 0
838 Child (18535) 0
839 Parent (18534) 1
840 Child (18535) 1
841 Parent (18534) 2
842 Child (18535) 2
843 Parent (18534) 3
844 Child (18535) 3
845 Parent (18534) 4
846 Child (18535) 4
847
848 Program source
849
850 /* futex_demo.c
851
852 Usage: futex_demo [nloops]
853 (Default: 5)
854
855 Demonstrate the use of futexes in a program where parent and child
856 use a pair of futexes located inside a shared anonymous mapping to
857 synchronize access to a shared resource: the terminal. The two
858 processes each write 'num-loops' messages to the terminal and employ
859 a synchronization protocol that ensures that they alternate in
860 writing messages.
861 */
862 #define _GNU_SOURCE
863 #include <stdio.h>
864 #include <errno.h>
865 #include <stdatomic.h>
866 #include <stdint.h>
867 #include <stdlib.h>
868 #include <unistd.h>
869 #include <sys/wait.h>
870 #include <sys/mman.h>
871 #include <sys/syscall.h>
872 #include <linux/futex.h>
873 #include <sys/time.h>
874
875 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
876 } while (0)
877
878 static uint32_t *futex1, *futex2, *iaddr;
879
880 static int
881 futex(uint32_t *uaddr, int futex_op, uint32_t val,
882 const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
883 {
884 return syscall(SYS_futex, uaddr, futex_op, val,
885 timeout, uaddr2, val3);
886 }
887
888 /* Acquire the futex pointed to by 'futexp': wait for its value to
889 become 1, and then set the value to 0. */
890
891 static void
892 fwait(uint32_t *futexp)
893 {
894 long s;
895
896 /* atomic_compare_exchange_strong(ptr, oldval, newval)
897 atomically performs the equivalent of:
898
899 if (*ptr == *oldval)
900 *ptr = newval;
901
902 It returns true if the test yielded true and *ptr was updated. */
903
904 while (1) {
905
906 /* Is the futex available? */
907 const uint32_t one = 1;
908 if (atomic_compare_exchange_strong(futexp, &one, 0))
909 break; /* Yes */
910
911 /* Futex is not available; wait. */
912
913 s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
914 if (s == -1 && errno != EAGAIN)
915 errExit("futex-FUTEX_WAIT");
916 }
917 }
918
919 /* Release the futex pointed to by 'futexp': if the futex currently
920 has the value 0, set its value to 1 and the wake any futex waiters,
921 so that if the peer is blocked in fwait(), it can proceed. */
922
923 static void
924 fpost(uint32_t *futexp)
925 {
926 long s;
927
928 /* atomic_compare_exchange_strong() was described
929 in comments above. */
930
931 const uint32_t zero = 0;
932 if (atomic_compare_exchange_strong(futexp, &zero, 1)) {
933 s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
934 if (s == -1)
935 errExit("futex-FUTEX_WAKE");
936 }
937 }
938
939 int
940 main(int argc, char *argv[])
941 {
942 pid_t childPid;
943 int nloops;
944
945 setbuf(stdout, NULL);
946
947 nloops = (argc > 1) ? atoi(argv[1]) : 5;
948
949 /* Create a shared anonymous mapping that will hold the futexes.
950 Since the futexes are being shared between processes, we
951 subsequently use the "shared" futex operations (i.e., not the
952 ones suffixed "_PRIVATE"). */
953
954 iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE,
955 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
956 if (iaddr == MAP_FAILED)
957 errExit("mmap");
958
959 futex1 = &iaddr[0];
960 futex2 = &iaddr[1];
961
962 *futex1 = 0; /* State: unavailable */
963 *futex2 = 1; /* State: available */
964
965 /* Create a child process that inherits the shared anonymous
966 mapping. */
967
968 childPid = fork();
969 if (childPid == -1)
970 errExit("fork");
971
972 if (childPid == 0) { /* Child */
973 for (int j = 0; j < nloops; j++) {
974 fwait(futex1);
975 printf("Child (%jd) %d\n", (intmax_t) getpid(), j);
976 fpost(futex2);
977 }
978
979 exit(EXIT_SUCCESS);
980 }
981
982 /* Parent falls through to here. */
983
984 for (int j = 0; j < nloops; j++) {
985 fwait(futex2);
986 printf("Parent (%jd) %d\n", (intmax_t) getpid(), j);
987 fpost(futex1);
988 }
989
990 wait(NULL);
991
992 exit(EXIT_SUCCESS);
993 }
994
996 get_robust_list(2), restart_syscall(2), pthread_mutexattr_getproto‐
997 col(3), futex(7), sched(7)
998
999 The following kernel source files:
1000
1001 * Documentation/pi-futex.txt
1002
1003 * Documentation/futex-requeue-pi.txt
1004
1005 * Documentation/locking/rt-mutex.txt
1006
1007 * Documentation/locking/rt-mutex-design.txt
1008
1009 * Documentation/robust-futex-ABI.txt
1010
1011 Franke, H., Russell, R., and Kirwood, M., 2002. Fuss, Futexes and Fur‐
1012 wocks: Fast Userlevel Locking in Linux (from proceedings of the Ottawa
1013 Linux Symposium 2002),
1014 ⟨http://kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf⟩
1015
1016 Hart, D., 2009. A futex overview and update,
1017 ⟨http://lwn.net/Articles/360699/⟩
1018
1019 Hart, D. and Guniguntala, D., 2009. Requeue-PI: Making Glibc Condvars
1020 PI-Aware (from proceedings of the 2009 Real-Time Linux Workshop),
1021 ⟨http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf⟩
1022
1023 Drepper, U., 2011. Futexes Are Tricky,
1024 ⟨http://www.akkadia.org/drepper/futex.pdf⟩
1025
1026 Futex example library, futex-*.tar.bz2 at
1027 ⟨ftp://ftp.kernel.org/pub/linux/kernel/people/rusty/⟩
1028
1030 This page is part of release 5.13 of the Linux man-pages project. A
1031 description of the project, information about reporting bugs, and the
1032 latest version of this page, can be found at
1033 https://www.kernel.org/doc/man-pages/.
1034
1035
1036
1037Linux 2021-08-27 FUTEX(2)