1futex(2) System Calls Manual futex(2)
2
3
4
6 futex - fast user-space locking
7
9 Standard C library (libc, -lc)
10
12 #include <linux/futex.h> /* Definition of FUTEX_* constants */
13 #include <sys/syscall.h> /* Definition of SYS_* constants */
14 #include <unistd.h>
15
16 long syscall(SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val,
17 const struct timespec *timeout, /* or: uint32_t val2 */
18 uint32_t *uaddr2, uint32_t val3);
19
20 Note: glibc provides no wrapper for futex(), necessitating the use of
21 syscall(2).
22
24 The futex() system call provides a method for waiting until a certain
25 condition becomes true. It is typically used as a blocking construct
26 in the context of shared-memory synchronization. When using futexes,
27 the majority of the synchronization operations are performed in user
28 space. A user-space program employs the futex() system call only when
29 it is likely that the program has to block for a longer time until the
30 condition becomes true. Other futex() operations can be used to wake
31 any processes or threads waiting for a particular condition.
32
33 A futex is a 32-bit value—referred to below as a futex word—whose ad‐
34 dress is supplied to the futex() system call. (Futexes are 32 bits in
35 size on all platforms, including 64-bit systems.) All futex operations
36 are governed by this value. In order to share a futex between pro‐
37 cesses, the futex is placed in a region of shared memory, created using
38 (for example) mmap(2) or shmat(2). (Thus, the futex word may have dif‐
39 ferent virtual addresses in different processes, but these addresses
40 all refer to the same location in physical memory.) In a multithreaded
41 program, it is sufficient to place the futex word in a global variable
42 shared by all threads.
43
44 When executing a futex operation that requests to block a thread, the
45 kernel will block only if the futex word has the value that the calling
46 thread supplied (as one of the arguments of the futex() call) as the
47 expected value of the futex word. The loading of the futex word's
48 value, the comparison of that value with the expected value, and the
49 actual blocking will happen atomically and will be totally ordered with
50 respect to concurrent operations performed by other threads on the same
51 futex word. Thus, the futex word is used to connect the synchroniza‐
52 tion in user space with the implementation of blocking by the kernel.
53 Analogously to an atomic compare-and-exchange operation that poten‐
54 tially changes shared memory, blocking via a futex is an atomic com‐
55 pare-and-block operation.
56
57 One use of futexes is for implementing locks. The state of the lock
58 (i.e., acquired or not acquired) can be represented as an atomically
59 accessed flag in shared memory. In the uncontended case, a thread can
60 access or modify the lock state with atomic instructions, for example
61 atomically changing it from not acquired to acquired using an atomic
62 compare-and-exchange instruction. (Such instructions are performed en‐
63 tirely in user mode, and the kernel maintains no information about the
64 lock state.) On the other hand, a thread may be unable to acquire a
65 lock because it is already acquired by another thread. It then may
66 pass the lock's flag as a futex word and the value representing the ac‐
67 quired state as the expected value to a futex() wait operation. This
68 futex() operation will block if and only if the lock is still acquired
69 (i.e., the value in the futex word still matches the "acquired state").
70 When releasing the lock, a thread has to first reset the lock state to
71 not acquired and then execute a futex operation that wakes threads
72 blocked on the lock flag used as a futex word (this can be further op‐
73 timized to avoid unnecessary wake-ups). See futex(7) for more detail
74 on how to use futexes.
75
76 Besides the basic wait and wake-up futex functionality, there are fur‐
77 ther futex operations aimed at supporting more complex use cases.
78
79 Note that no explicit initialization or destruction is necessary to use
80 futexes; the kernel maintains a futex (i.e., the kernel-internal imple‐
81 mentation artifact) only while operations such as FUTEX_WAIT, described
82 below, are being performed on a particular futex word.
83
84 Arguments
85 The uaddr argument points to the futex word. On all platforms, futexes
86 are four-byte integers that must be aligned on a four-byte boundary.
87 The operation to perform on the futex is specified in the futex_op ar‐
88 gument; val is a value whose meaning and purpose depends on futex_op.
89
90 The remaining arguments (timeout, uaddr2, and val3) are required only
91 for certain of the futex operations described below. Where one of
92 these arguments is not required, it is ignored.
93
94 For several blocking operations, the timeout argument is a pointer to a
95 timespec structure that specifies a timeout for the operation. How‐
96 ever, notwithstanding the prototype shown above, for some operations,
97 the least significant four bytes of this argument are instead used as
98 an integer whose meaning is determined by the operation. For these op‐
99 erations, the kernel casts the timeout value first to unsigned long,
100 then to uint32_t, and in the remainder of this page, this argument is
101 referred to as val2 when interpreted in this fashion.
102
103 Where it is required, the uaddr2 argument is a pointer to a second fu‐
104 tex word that is employed by the operation.
105
106 The interpretation of the final integer argument, val3, depends on the
107 operation.
108
109 Futex operations
110 The futex_op argument consists of two parts: a command that specifies
111 the operation to be performed, bitwise ORed with zero or more options
112 that modify the behaviour of the operation. The options that may be
113 included in futex_op are as follows:
114
115 FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
116 This option bit can be employed with all futex operations. It
117 tells the kernel that the futex is process-private and not
118 shared with another process (i.e., it is being used for synchro‐
119 nization only between threads of the same process). This allows
120 the kernel to make some additional performance optimizations.
121
122 As a convenience, <linux/futex.h> defines a set of constants
123 with the suffix _PRIVATE that are equivalents of all of the op‐
124 erations listed below, but with the FUTEX_PRIVATE_FLAG ORed into
125 the constant value. Thus, there are FUTEX_WAIT_PRIVATE, FU‐
126 TEX_WAKE_PRIVATE, and so on.
127
128 FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
129 This option bit can be employed only with the FUTEX_WAIT_BITSET,
130 FUTEX_WAIT_REQUEUE_PI, (since Linux 4.5) FUTEX_WAIT, and (since
131 Linux 5.14) FUTEX_LOCK_PI2 operations.
132
133 If this option is set, the kernel measures the timeout against
134 the CLOCK_REALTIME clock.
135
136 If this option is not set, the kernel measures the timeout
137 against the CLOCK_MONOTONIC clock.
138
139 The operation specified in futex_op is one of the following:
140
141 FUTEX_WAIT (since Linux 2.6.0)
142 This operation tests that the value at the futex word pointed to
143 by the address uaddr still contains the expected value val, and
144 if so, then sleeps waiting for a FUTEX_WAKE operation on the fu‐
145 tex word. The load of the value of the futex word is an atomic
146 memory access (i.e., using atomic machine instructions of the
147 respective architecture). This load, the comparison with the
148 expected value, and starting to sleep are performed atomically
149 and totally ordered with respect to other futex operations on
150 the same futex word. If the thread starts to sleep, it is con‐
151 sidered a waiter on this futex word. If the futex value does
152 not match val, then the call fails immediately with the error
153 EAGAIN.
154
155 The purpose of the comparison with the expected value is to pre‐
156 vent lost wake-ups. If another thread changed the value of the
157 futex word after the calling thread decided to block based on
158 the prior value, and if the other thread executed a FUTEX_WAKE
159 operation (or similar wake-up) after the value change and before
160 this FUTEX_WAIT operation, then the calling thread will observe
161 the value change and will not start to sleep.
162
163 If the timeout is not NULL, the structure it points to specifies
164 a timeout for the wait. (This interval will be rounded up to
165 the system clock granularity, and is guaranteed not to expire
166 early.) The timeout is by default measured according to the
167 CLOCK_MONOTONIC clock, but, since Linux 4.5, the CLOCK_REALTIME
168 clock can be selected by specifying FUTEX_CLOCK_REALTIME in fu‐
169 tex_op. If timeout is NULL, the call blocks indefinitely.
170
171 Note: for FUTEX_WAIT, timeout is interpreted as a relative
172 value. This differs from other futex operations, where timeout
173 is interpreted as an absolute value. To obtain the equivalent
174 of FUTEX_WAIT with an absolute timeout, employ FUTEX_WAIT_BITSET
175 with val3 specified as FUTEX_BITSET_MATCH_ANY.
176
177 The arguments uaddr2 and val3 are ignored.
178
179 FUTEX_WAKE (since Linux 2.6.0)
180 This operation wakes at most val of the waiters that are waiting
181 (e.g., inside FUTEX_WAIT) on the futex word at the address
182 uaddr. Most commonly, val is specified as either 1 (wake up a
183 single waiter) or INT_MAX (wake up all waiters). No guarantee
184 is provided about which waiters are awoken (e.g., a waiter with
185 a higher scheduling priority is not guaranteed to be awoken in
186 preference to a waiter with a lower priority).
187
188 The arguments timeout, uaddr2, and val3 are ignored.
189
190 FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
191 This operation creates a file descriptor that is associated with
192 the futex at uaddr. The caller must close the returned file de‐
193 scriptor after use. When another process or thread performs a
194 FUTEX_WAKE on the futex word, the file descriptor indicates as
195 being readable with select(2), poll(2), and epoll(7)
196
197 The file descriptor can be used to obtain asynchronous notifica‐
198 tions: if val is nonzero, then, when another process or thread
199 executes a FUTEX_WAKE, the caller will receive the signal number
200 that was passed in val.
201
202 The arguments timeout, uaddr2, and val3 are ignored.
203
204 Because it was inherently racy, FUTEX_FD has been removed from
205 Linux 2.6.26 onward.
206
207 FUTEX_REQUEUE (since Linux 2.6.0)
208 This operation performs the same task as FUTEX_CMP_REQUEUE (see
209 below), except that no check is made using the value in val3.
210 (The argument val3 is ignored.)
211
212 FUTEX_CMP_REQUEUE (since Linux 2.6.7)
213 This operation first checks whether the location uaddr still
214 contains the value val3. If not, the operation fails with the
215 error EAGAIN. Otherwise, the operation wakes up a maximum of
216 val waiters that are waiting on the futex at uaddr. If there
217 are more than val waiters, then the remaining waiters are re‐
218 moved from the wait queue of the source futex at uaddr and added
219 to the wait queue of the target futex at uaddr2. The val2 argu‐
220 ment specifies an upper limit on the number of waiters that are
221 requeued to the futex at uaddr2.
222
223 The load from uaddr is an atomic memory access (i.e., using
224 atomic machine instructions of the respective architecture).
225 This load, the comparison with val3, and the requeueing of any
226 waiters are performed atomically and totally ordered with re‐
227 spect to other operations on the same futex word.
228
229 Typical values to specify for val are 0 or 1. (Specifying
230 INT_MAX is not useful, because it would make the FUTEX_CMP_RE‐
231 QUEUE operation equivalent to FUTEX_WAKE.) The limit value
232 specified via val2 is typically either 1 or INT_MAX. (Specify‐
233 ing the argument as 0 is not useful, because it would make the
234 FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAIT.)
235
236 The FUTEX_CMP_REQUEUE operation was added as a replacement for
237 the earlier FUTEX_REQUEUE. The difference is that the check of
238 the value at uaddr can be used to ensure that requeueing happens
239 only under certain conditions, which allows race conditions to
240 be avoided in certain use cases.
241
242 Both FUTEX_REQUEUE and FUTEX_CMP_REQUEUE can be used to avoid
243 "thundering herd" wake-ups that could occur when using FU‐
244 TEX_WAKE in cases where all of the waiters that are woken need
245 to acquire another futex. Consider the following scenario,
246 where multiple waiter threads are waiting on B, a wait queue im‐
247 plemented using a futex:
248
249 lock(A)
250 while (!check_value(V)) {
251 unlock(A);
252 block_on(B);
253 lock(A);
254 };
255 unlock(A);
256
257 If a waker thread used FUTEX_WAKE, then all waiters waiting on B
258 would be woken up, and they would all try to acquire lock A.
259 However, waking all of the threads in this manner would be
260 pointless because all except one of the threads would immedi‐
261 ately block on lock A again. By contrast, a requeue operation
262 wakes just one waiter and moves the other waiters to lock A, and
263 when the woken waiter unlocks A then the next waiter can pro‐
264 ceed.
265
266 FUTEX_WAKE_OP (since Linux 2.6.14)
267 This operation was added to support some user-space use cases
268 where more than one futex must be handled at the same time. The
269 most notable example is the implementation of pthread_cond_sig‐
270 nal(3), which requires operations on two futexes, the one used
271 to implement the mutex and the one used in the implementation of
272 the wait queue associated with the condition variable. FU‐
273 TEX_WAKE_OP allows such cases to be implemented without leading
274 to high rates of contention and context switching.
275
276 The FUTEX_WAKE_OP operation is equivalent to executing the fol‐
277 lowing code atomically and totally ordered with respect to other
278 futex operations on any of the two supplied futex words:
279
280 uint32_t oldval = *(uint32_t *) uaddr2;
281 *(uint32_t *) uaddr2 = oldval op oparg;
282 futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
283 if (oldval cmp cmparg)
284 futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
285
286 In other words, FUTEX_WAKE_OP does the following:
287
288 • saves the original value of the futex word at uaddr2 and per‐
289 forms an operation to modify the value of the futex at
290 uaddr2; this is an atomic read-modify-write memory access
291 (i.e., using atomic machine instructions of the respective
292 architecture)
293
294 • wakes up a maximum of val waiters on the futex for the futex
295 word at uaddr; and
296
297 • dependent on the results of a test of the original value of
298 the futex word at uaddr2, wakes up a maximum of val2 waiters
299 on the futex for the futex word at uaddr2.
300
301 The operation and comparison that are to be performed are en‐
302 coded in the bits of the argument val3. Pictorially, the encod‐
303 ing is:
304
305 +---+---+-----------+-----------+
306 |op |cmp| oparg | cmparg |
307 +---+---+-----------+-----------+
308 4 4 12 12 <== # of bits
309
310 Expressed in code, the encoding is:
311
312 #define FUTEX_OP(op, oparg, cmp, cmparg) \
313 (((op & 0xf) << 28) | \
314 ((cmp & 0xf) << 24) | \
315 ((oparg & 0xfff) << 12) | \
316 (cmparg & 0xfff))
317
318 In the above, op and cmp are each one of the codes listed below.
319 The oparg and cmparg components are literal numeric values, ex‐
320 cept as noted below.
321
322 The op component has one of the following values:
323
324 FUTEX_OP_SET 0 /* uaddr2 = oparg; */
325 FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
326 FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
327 FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */
328 FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */
329
330 In addition, bitwise ORing the following value into op causes
331 (1 << oparg) to be used as the operand:
332
333 FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
334
335 The cmp field is one of the following:
336
337 FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
338 FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
339 FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
340 FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
341 FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
342 FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
343
344 The return value of FUTEX_WAKE_OP is the sum of the number of
345 waiters woken on the futex uaddr plus the number of waiters wo‐
346 ken on the futex uaddr2.
347
348 FUTEX_WAIT_BITSET (since Linux 2.6.25)
349 This operation is like FUTEX_WAIT except that val3 is used to
350 provide a 32-bit bit mask to the kernel. This bit mask, in
351 which at least one bit must be set, is stored in the kernel-in‐
352 ternal state of the waiter. See the description of FU‐
353 TEX_WAKE_BITSET for further details.
354
355 If timeout is not NULL, the structure it points to specifies an
356 absolute timeout for the wait operation. If timeout is NULL,
357 the operation can block indefinitely.
358
359 The uaddr2 argument is ignored.
360
361 FUTEX_WAKE_BITSET (since Linux 2.6.25)
362 This operation is the same as FUTEX_WAKE except that the val3
363 argument is used to provide a 32-bit bit mask to the kernel.
364 This bit mask, in which at least one bit must be set, is used to
365 select which waiters should be woken up. The selection is done
366 by a bitwise AND of the "wake" bit mask (i.e., the value in
367 val3) and the bit mask which is stored in the kernel-internal
368 state of the waiter (the "wait" bit mask that is set using FU‐
369 TEX_WAIT_BITSET). All of the waiters for which the result of
370 the AND is nonzero are woken up; the remaining waiters are left
371 sleeping.
372
373 The effect of FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET is to al‐
374 low selective wake-ups among multiple waiters that are blocked
375 on the same futex. However, note that, depending on the use
376 case, employing this bit-mask multiplexing feature on a futex
377 can be less efficient than simply using multiple futexes, be‐
378 cause employing bit-mask multiplexing requires the kernel to
379 check all waiters on a futex, including those that are not in‐
380 terested in being woken up (i.e., they do not have the relevant
381 bit set in their "wait" bit mask).
382
383 The constant FUTEX_BITSET_MATCH_ANY, which corresponds to all 32
384 bits set in the bit mask, can be used as the val3 argument for
385 FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET. Other than differences
386 in the handling of the timeout argument, the FUTEX_WAIT opera‐
387 tion is equivalent to FUTEX_WAIT_BITSET with val3 specified as
388 FUTEX_BITSET_MATCH_ANY; that is, allow a wake-up by any waker.
389 The FUTEX_WAKE operation is equivalent to FUTEX_WAKE_BITSET with
390 val3 specified as FUTEX_BITSET_MATCH_ANY; that is, wake up any
391 waiter(s).
392
393 The uaddr2 and timeout arguments are ignored.
394
395 Priority-inheritance futexes
396 Linux supports priority-inheritance (PI) futexes in order to handle
397 priority-inversion problems that can be encountered with normal futex
398 locks. Priority inversion is the problem that occurs when a high-pri‐
399 ority task is blocked waiting to acquire a lock held by a low-priority
400 task, while tasks at an intermediate priority continuously preempt the
401 low-priority task from the CPU. Consequently, the low-priority task
402 makes no progress toward releasing the lock, and the high-priority task
403 remains blocked.
404
405 Priority inheritance is a mechanism for dealing with the priority-in‐
406 version problem. With this mechanism, when a high-priority task be‐
407 comes blocked by a lock held by a low-priority task, the priority of
408 the low-priority task is temporarily raised to that of the high-prior‐
409 ity task, so that it is not preempted by any intermediate level tasks,
410 and can thus make progress toward releasing the lock. To be effective,
411 priority inheritance must be transitive, meaning that if a high-prior‐
412 ity task blocks on a lock held by a lower-priority task that is itself
413 blocked by a lock held by another intermediate-priority task (and so
414 on, for chains of arbitrary length), then both of those tasks (or more
415 generally, all of the tasks in a lock chain) have their priorities
416 raised to be the same as the high-priority task.
417
418 From a user-space perspective, what makes a futex PI-aware is a policy
419 agreement (described below) between user space and the kernel about the
420 value of the futex word, coupled with the use of the PI-futex opera‐
421 tions described below. (Unlike the other futex operations described
422 above, the PI-futex operations are designed for the implementation of
423 very specific IPC mechanisms.)
424
425 The PI-futex operations described below differ from the other futex op‐
426 erations in that they impose policy on the use of the value of the fu‐
427 tex word:
428
429 • If the lock is not acquired, the futex word's value shall be 0.
430
431 • If the lock is acquired, the futex word's value shall be the thread
432 ID (TID; see gettid(2)) of the owning thread.
433
434 • If the lock is owned and there are threads contending for the lock,
435 then the FUTEX_WAITERS bit shall be set in the futex word's value;
436 in other words, this value is:
437
438 FUTEX_WAITERS | TID
439
440 (Note that is invalid for a PI futex word to have no owner and FU‐
441 TEX_WAITERS set.)
442
443 With this policy in place, a user-space application can acquire an un‐
444 acquired lock or release a lock using atomic instructions executed in
445 user mode (e.g., a compare-and-swap operation such as cmpxchg on the
446 x86 architecture). Acquiring a lock simply consists of using compare-
447 and-swap to atomically set the futex word's value to the caller's TID
448 if its previous value was 0. Releasing a lock requires using compare-
449 and-swap to set the futex word's value to 0 if the previous value was
450 the expected TID.
451
452 If a futex is already acquired (i.e., has a nonzero value), waiters
453 must employ the FUTEX_LOCK_PI or FUTEX_LOCK_PI2 operations to acquire
454 the lock. If other threads are waiting for the lock, then the FU‐
455 TEX_WAITERS bit is set in the futex value; in this case, the lock owner
456 must employ the FUTEX_UNLOCK_PI operation to release the lock.
457
458 In the cases where callers are forced into the kernel (i.e., required
459 to perform a futex() call), they then deal directly with a so-called
460 RT-mutex, a kernel locking mechanism which implements the required pri‐
461 ority-inheritance semantics. After the RT-mutex is acquired, the futex
462 value is updated accordingly, before the calling thread returns to user
463 space.
464
465 It is important to note that the kernel will update the futex word's
466 value prior to returning to user space. (This prevents the possibility
467 of the futex word's value ending up in an invalid state, such as having
468 an owner but the value being 0, or having waiters but not having the
469 FUTEX_WAITERS bit set.)
470
471 If a futex has an associated RT-mutex in the kernel (i.e., there are
472 blocked waiters) and the owner of the futex/RT-mutex dies unexpectedly,
473 then the kernel cleans up the RT-mutex and hands it over to the next
474 waiter. This in turn requires that the user-space value is updated ac‐
475 cordingly. To indicate that this is required, the kernel sets the FU‐
476 TEX_OWNER_DIED bit in the futex word along with the thread ID of the
477 new owner. User space can detect this situation via the presence of
478 the FUTEX_OWNER_DIED bit and is then responsible for cleaning up the
479 stale state left over by the dead owner.
480
481 PI futexes are operated on by specifying one of the values listed below
482 in futex_op. Note that the PI futex operations must be used as paired
483 operations and are subject to some additional requirements:
484
485 • FUTEX_LOCK_PI, FUTEX_LOCK_PI2, and FUTEX_TRYLOCK_PI pair with FU‐
486 TEX_UNLOCK_PI. FUTEX_UNLOCK_PI must be called only on a futex owned
487 by the calling thread, as defined by the value policy, otherwise the
488 error EPERM results.
489
490 • FUTEX_WAIT_REQUEUE_PI pairs with FUTEX_CMP_REQUEUE_PI. This must be
491 performed from a non-PI futex to a distinct PI futex (or the error
492 EINVAL results). Additionally, val (the number of waiters to be wo‐
493 ken) must be 1 (or the error EINVAL results).
494
495 The PI futex operations are as follows:
496
497 FUTEX_LOCK_PI (since Linux 2.6.18)
498 This operation is used after an attempt to acquire the lock via
499 an atomic user-mode instruction failed because the futex word
500 has a nonzero value—specifically, because it contained the (PID-
501 namespace-specific) TID of the lock owner.
502
503 The operation checks the value of the futex word at the address
504 uaddr. If the value is 0, then the kernel tries to atomically
505 set the futex value to the caller's TID. If the futex word's
506 value is nonzero, the kernel atomically sets the FUTEX_WAITERS
507 bit, which signals the futex owner that it cannot unlock the fu‐
508 tex in user space atomically by setting the futex value to 0.
509 After that, the kernel:
510
511 (1) Tries to find the thread which is associated with the owner
512 TID.
513
514 (2) Creates or reuses kernel state on behalf of the owner. (If
515 this is the first waiter, there is no kernel state for this
516 futex, so kernel state is created by locking the RT-mutex
517 and the futex owner is made the owner of the RT-mutex. If
518 there are existing waiters, then the existing state is
519 reused.)
520
521 (3) Attaches the waiter to the futex (i.e., the waiter is en‐
522 queued on the RT-mutex waiter list).
523
524 If more than one waiter exists, the enqueueing of the waiter is
525 in descending priority order. (For information on priority or‐
526 dering, see the discussion of the SCHED_DEADLINE, SCHED_FIFO,
527 and SCHED_RR scheduling policies in sched(7).) The owner inher‐
528 its either the waiter's CPU bandwidth (if the waiter is sched‐
529 uled under the SCHED_DEADLINE policy) or the waiter's priority
530 (if the waiter is scheduled under the SCHED_RR or SCHED_FIFO
531 policy). This inheritance follows the lock chain in the case of
532 nested locking and performs deadlock detection.
533
534 The timeout argument provides a timeout for the lock attempt.
535 If timeout is not NULL, the structure it points to specifies an
536 absolute timeout, measured against the CLOCK_REALTIME clock. If
537 timeout is NULL, the operation will block indefinitely.
538
539 The uaddr2, val, and val3 arguments are ignored.
540
541 FUTEX_LOCK_PI2 (since Linux 5.14)
542 This operation is the same as FUTEX_LOCK_PI, except that the
543 clock against which timeout is measured is selectable. By de‐
544 fault, the (absolute) timeout specified in timeout is measured
545 against the CLOCK_MONOTONIC clock, but if the FUTEX_CLOCK_REAL‐
546 TIME flag is specified in futex_op, then the timeout is measured
547 against the CLOCK_REALTIME clock.
548
549 FUTEX_TRYLOCK_PI (since Linux 2.6.18)
550 This operation tries to acquire the lock at uaddr. It is in‐
551 voked when a user-space atomic acquire did not succeed because
552 the futex word was not 0.
553
554 Because the kernel has access to more state information than
555 user space, acquisition of the lock might succeed if performed
556 by the kernel in cases where the futex word (i.e., the state in‐
557 formation accessible to use-space) contains stale state (FU‐
558 TEX_WAITERS and/or FUTEX_OWNER_DIED). This can happen when the
559 owner of the futex died. User space cannot handle this condi‐
560 tion in a race-free manner, but the kernel can fix this up and
561 acquire the futex.
562
563 The uaddr2, val, timeout, and val3 arguments are ignored.
564
565 FUTEX_UNLOCK_PI (since Linux 2.6.18)
566 This operation wakes the top priority waiter that is waiting in
567 FUTEX_LOCK_PI or FUTEX_LOCK_PI2 on the futex address provided by
568 the uaddr argument.
569
570 This is called when the user-space value at uaddr cannot be
571 changed atomically from a TID (of the owner) to 0.
572
573 The uaddr2, val, timeout, and val3 arguments are ignored.
574
575 FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31)
576 This operation is a PI-aware variant of FUTEX_CMP_REQUEUE. It
577 requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on
578 uaddr from a non-PI source futex (uaddr) to a PI target futex
579 (uaddr2).
580
581 As with FUTEX_CMP_REQUEUE, this operation wakes up a maximum of
582 val waiters that are waiting on the futex at uaddr. However,
583 for FUTEX_CMP_REQUEUE_PI, val is required to be 1 (since the
584 main point is to avoid a thundering herd). The remaining wait‐
585 ers are removed from the wait queue of the source futex at uaddr
586 and added to the wait queue of the target futex at uaddr2.
587
588 The val2 and val3 arguments serve the same purposes as for FU‐
589 TEX_CMP_REQUEUE.
590
591 FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
592 Wait on a non-PI futex at uaddr and potentially be requeued (via
593 a FUTEX_CMP_REQUEUE_PI operation in another task) onto a PI fu‐
594 tex at uaddr2. The wait operation on uaddr is the same as for
595 FUTEX_WAIT.
596
597 The waiter can be removed from the wait on uaddr without re‐
598 queueing on uaddr2 via a FUTEX_WAKE operation in another task.
599 In this case, the FUTEX_WAIT_REQUEUE_PI operation fails with the
600 error EAGAIN.
601
602 If timeout is not NULL, the structure it points to specifies an
603 absolute timeout for the wait operation. If timeout is NULL,
604 the operation can block indefinitely.
605
606 The val3 argument is ignored.
607
608 The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were added to
609 support a fairly specific use case: support for priority-inheri‐
610 tance-aware POSIX threads condition variables. The idea is that
611 these operations should always be paired, in order to ensure
612 that user space and the kernel remain in sync. Thus, in the FU‐
613 TEX_WAIT_REQUEUE_PI operation, the user-space application pre-
614 specifies the target of the requeue that takes place in the FU‐
615 TEX_CMP_REQUEUE_PI operation.
616
618 In the event of an error (and assuming that futex() was invoked via
619 syscall(2)), all operations return -1 and set errno to indicate the er‐
620 ror.
621
622 The return value on success depends on the operation, as described in
623 the following list:
624
625 FUTEX_WAIT
626 Returns 0 if the caller was woken up. Note that a wake-up can
627 also be caused by common futex usage patterns in unrelated code
628 that happened to have previously used the futex word's memory
629 location (e.g., typical futex-based implementations of Pthreads
630 mutexes can cause this under some conditions). Therefore, call‐
631 ers should always conservatively assume that a return value of 0
632 can mean a spurious wake-up, and use the futex word's value
633 (i.e., the user-space synchronization scheme) to decide whether
634 to continue to block or not.
635
636 FUTEX_WAKE
637 Returns the number of waiters that were woken up.
638
639 FUTEX_FD
640 Returns the new file descriptor associated with the futex.
641
642 FUTEX_REQUEUE
643 Returns the number of waiters that were woken up.
644
645 FUTEX_CMP_REQUEUE
646 Returns the total number of waiters that were woken up or re‐
647 queued to the futex for the futex word at uaddr2. If this value
648 is greater than val, then the difference is the number of wait‐
649 ers requeued to the futex for the futex word at uaddr2.
650
651 FUTEX_WAKE_OP
652 Returns the total number of waiters that were woken up. This is
653 the sum of the woken waiters on the two futexes for the futex
654 words at uaddr and uaddr2.
655
656 FUTEX_WAIT_BITSET
657 Returns 0 if the caller was woken up. See FUTEX_WAIT for how to
658 interpret this correctly in practice.
659
660 FUTEX_WAKE_BITSET
661 Returns the number of waiters that were woken up.
662
663 FUTEX_LOCK_PI
664 Returns 0 if the futex was successfully locked.
665
666 FUTEX_LOCK_PI2
667 Returns 0 if the futex was successfully locked.
668
669 FUTEX_TRYLOCK_PI
670 Returns 0 if the futex was successfully locked.
671
672 FUTEX_UNLOCK_PI
673 Returns 0 if the futex was successfully unlocked.
674
675 FUTEX_CMP_REQUEUE_PI
676 Returns the total number of waiters that were woken up or re‐
677 queued to the futex for the futex word at uaddr2. If this value
678 is greater than val, then difference is the number of waiters
679 requeued to the futex for the futex word at uaddr2.
680
681 FUTEX_WAIT_REQUEUE_PI
682 Returns 0 if the caller was successfully requeued to the futex
683 for the futex word at uaddr2.
684
686 EACCES No read access to the memory of a futex word.
687
688 EAGAIN (FUTEX_WAIT, FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI) The value
689 pointed to by uaddr was not equal to the expected value val at
690 the time of the call.
691
692 Note: on Linux, the symbolic names EAGAIN and EWOULDBLOCK (both
693 of which appear in different parts of the kernel futex code)
694 have the same value.
695
696 EAGAIN (FUTEX_CMP_REQUEUE, FUTEX_CMP_REQUEUE_PI) The value pointed to
697 by uaddr is not equal to the expected value val3.
698
699 EAGAIN (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
700 QUEUE_PI) The futex owner thread ID of uaddr (for FUTEX_CMP_RE‐
701 QUEUE_PI: uaddr2) is about to exit, but has not yet handled the
702 internal state cleanup. Try again.
703
704 EDEADLK
705 (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
706 QUEUE_PI) The futex word at uaddr is already locked by the
707 caller.
708
709 EDEADLK
710 (FUTEX_CMP_REQUEUE_PI) While requeueing a waiter to the PI futex
711 for the futex word at uaddr2, the kernel detected a deadlock.
712
713 EFAULT A required pointer argument (i.e., uaddr, uaddr2, or timeout)
714 did not point to a valid user-space address.
715
716 EINTR A FUTEX_WAIT or FUTEX_WAIT_BITSET operation was interrupted by a
717 signal (see signal(7)). Before Linux 2.6.22, this error could
718 also be returned for a spurious wakeup; since Linux 2.6.22, this
719 no longer happens.
720
721 EINVAL The operation in futex_op is one of those that employs a time‐
722 out, but the supplied timeout argument was invalid (tv_sec was
723 less than zero, or tv_nsec was not less than 1,000,000,000).
724
725 EINVAL The operation specified in futex_op employs one or both of the
726 pointers uaddr and uaddr2, but one of these does not point to a
727 valid object—that is, the address is not four-byte-aligned.
728
729 EINVAL (FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET) The bit mask supplied in
730 val3 is zero.
731
732 EINVAL (FUTEX_CMP_REQUEUE_PI) uaddr equals uaddr2 (i.e., an attempt was
733 made to requeue to the same futex).
734
735 EINVAL (FUTEX_FD) The signal number supplied in val is invalid.
736
737 EINVAL (FUTEX_WAKE, FUTEX_WAKE_OP, FUTEX_WAKE_BITSET, FUTEX_REQUEUE,
738 FUTEX_CMP_REQUEUE) The kernel detected an inconsistency between
739 the user-space state at uaddr and the kernel state—that is, it
740 detected a waiter which waits in FUTEX_LOCK_PI or FUTEX_LOCK_PI2
741 on uaddr.
742
743 EINVAL (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_UN‐
744 LOCK_PI) The kernel detected an inconsistency between the user-
745 space state at uaddr and the kernel state. This indicates ei‐
746 ther state corruption or that the kernel found a waiter on uaddr
747 which is waiting via FUTEX_WAIT or FUTEX_WAIT_BITSET.
748
749 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
750 tween the user-space state at uaddr2 and the kernel state; that
751 is, the kernel detected a waiter which waits via FUTEX_WAIT or
752 FUTEX_WAIT_BITSET on uaddr2.
753
754 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
755 tween the user-space state at uaddr and the kernel state; that
756 is, the kernel detected a waiter which waits via FUTEX_WAIT or
757 FUTEX_WAIT_BITSET on uaddr.
758
759 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
760 tween the user-space state at uaddr and the kernel state; that
761 is, the kernel detected a waiter which waits on uaddr via FU‐
762 TEX_LOCK_PI or FUTEX_LOCK_PI2 (instead of FUTEX_WAIT_RE‐
763 QUEUE_PI).
764
765 EINVAL (FUTEX_CMP_REQUEUE_PI) An attempt was made to requeue a waiter
766 to a futex other than that specified by the matching FU‐
767 TEX_WAIT_REQUEUE_PI call for that waiter.
768
769 EINVAL (FUTEX_CMP_REQUEUE_PI) The val argument is not 1.
770
771 EINVAL Invalid argument.
772
773 ENFILE (FUTEX_FD) The system-wide limit on the total number of open
774 files has been reached.
775
776 ENOMEM (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
777 QUEUE_PI) The kernel could not allocate memory to hold state in‐
778 formation.
779
780 ENOSYS Invalid operation specified in futex_op.
781
782 ENOSYS The FUTEX_CLOCK_REALTIME option was specified in futex_op, but
783 the accompanying operation was neither FUTEX_WAIT, FU‐
784 TEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, nor FUTEX_LOCK_PI2.
785
786 ENOSYS (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_UN‐
787 LOCK_PI, FUTEX_CMP_REQUEUE_PI, FUTEX_WAIT_REQUEUE_PI) A run-time
788 check determined that the operation is not available. The PI-
789 futex operations are not implemented on all architectures and
790 are not supported on some CPU variants.
791
792 EPERM (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
793 QUEUE_PI) The caller is not allowed to attach itself to the fu‐
794 tex at uaddr (for FUTEX_CMP_REQUEUE_PI: the futex at uaddr2).
795 (This may be caused by a state corruption in user space.)
796
797 EPERM (FUTEX_UNLOCK_PI) The caller does not own the lock represented
798 by the futex word.
799
800 ESRCH (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_RE‐
801 QUEUE_PI) The thread ID in the futex word at uaddr does not ex‐
802 ist.
803
804 ESRCH (FUTEX_CMP_REQUEUE_PI) The thread ID in the futex word at uaddr2
805 does not exist.
806
807 ETIMEDOUT
808 The operation in futex_op employed the timeout specified in
809 timeout, and the timeout expired before the operation completed.
810
812 Linux.
813
815 Linux 2.6.0.
816
817 Initial futex support was merged in Linux 2.5.7 but with different se‐
818 mantics from what was described above. A four-argument system call
819 with the semantics described in this page was introduced in Linux
820 2.5.40. A fifth argument was added in Linux 2.5.70, and a sixth argu‐
821 ment was added in Linux 2.6.7.
822
824 The program below demonstrates use of futexes in a program where a par‐
825 ent process and a child process use a pair of futexes located inside a
826 shared anonymous mapping to synchronize access to a shared resource:
827 the terminal. The two processes each write nloops (a command-line ar‐
828 gument that defaults to 5 if omitted) messages to the terminal and em‐
829 ploy a synchronization protocol that ensures that they alternate in
830 writing messages. Upon running this program we see output such as the
831 following:
832
833 $ ./futex_demo
834 Parent (18534) 0
835 Child (18535) 0
836 Parent (18534) 1
837 Child (18535) 1
838 Parent (18534) 2
839 Child (18535) 2
840 Parent (18534) 3
841 Child (18535) 3
842 Parent (18534) 4
843 Child (18535) 4
844
845 Program source
846
847 /* futex_demo.c
848
849 Usage: futex_demo [nloops]
850 (Default: 5)
851
852 Demonstrate the use of futexes in a program where parent and child
853 use a pair of futexes located inside a shared anonymous mapping to
854 synchronize access to a shared resource: the terminal. The two
855 processes each write 'num-loops' messages to the terminal and employ
856 a synchronization protocol that ensures that they alternate in
857 writing messages.
858 */
859 #define _GNU_SOURCE
860 #include <err.h>
861 #include <errno.h>
862 #include <linux/futex.h>
863 #include <stdatomic.h>
864 #include <stdint.h>
865 #include <stdio.h>
866 #include <stdlib.h>
867 #include <sys/mman.h>
868 #include <sys/syscall.h>
869 #include <sys/time.h>
870 #include <sys/wait.h>
871 #include <unistd.h>
872
873 static uint32_t *futex1, *futex2, *iaddr;
874
875 static int
876 futex(uint32_t *uaddr, int futex_op, uint32_t val,
877 const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
878 {
879 return syscall(SYS_futex, uaddr, futex_op, val,
880 timeout, uaddr2, val3);
881 }
882
883 /* Acquire the futex pointed to by 'futexp': wait for its value to
884 become 1, and then set the value to 0. */
885
886 static void
887 fwait(uint32_t *futexp)
888 {
889 long s;
890 const uint32_t one = 1;
891
892 /* atomic_compare_exchange_strong(ptr, oldval, newval)
893 atomically performs the equivalent of:
894
895 if (*ptr == *oldval)
896 *ptr = newval;
897
898 It returns true if the test yielded true and *ptr was updated. */
899
900 while (1) {
901
902 /* Is the futex available? */
903 if (atomic_compare_exchange_strong(futexp, &one, 0))
904 break; /* Yes */
905
906 /* Futex is not available; wait. */
907
908 s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
909 if (s == -1 && errno != EAGAIN)
910 err(EXIT_FAILURE, "futex-FUTEX_WAIT");
911 }
912 }
913
914 /* Release the futex pointed to by 'futexp': if the futex currently
915 has the value 0, set its value to 1 and then wake any futex waiters,
916 so that if the peer is blocked in fwait(), it can proceed. */
917
918 static void
919 fpost(uint32_t *futexp)
920 {
921 long s;
922 const uint32_t zero = 0;
923
924 /* atomic_compare_exchange_strong() was described
925 in comments above. */
926
927 if (atomic_compare_exchange_strong(futexp, &zero, 1)) {
928 s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
929 if (s == -1)
930 err(EXIT_FAILURE, "futex-FUTEX_WAKE");
931 }
932 }
933
934 int
935 main(int argc, char *argv[])
936 {
937 pid_t childPid;
938 unsigned int nloops;
939
940 setbuf(stdout, NULL);
941
942 nloops = (argc > 1) ? atoi(argv[1]) : 5;
943
944 /* Create a shared anonymous mapping that will hold the futexes.
945 Since the futexes are being shared between processes, we
946 subsequently use the "shared" futex operations (i.e., not the
947 ones suffixed "_PRIVATE"). */
948
949 iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE,
950 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
951 if (iaddr == MAP_FAILED)
952 err(EXIT_FAILURE, "mmap");
953
954 futex1 = &iaddr[0];
955 futex2 = &iaddr[1];
956
957 *futex1 = 0; /* State: unavailable */
958 *futex2 = 1; /* State: available */
959
960 /* Create a child process that inherits the shared anonymous
961 mapping. */
962
963 childPid = fork();
964 if (childPid == -1)
965 err(EXIT_FAILURE, "fork");
966
967 if (childPid == 0) { /* Child */
968 for (unsigned int j = 0; j < nloops; j++) {
969 fwait(futex1);
970 printf("Child (%jd) %u\n", (intmax_t) getpid(), j);
971 fpost(futex2);
972 }
973
974 exit(EXIT_SUCCESS);
975 }
976
977 /* Parent falls through to here. */
978
979 for (unsigned int j = 0; j < nloops; j++) {
980 fwait(futex2);
981 printf("Parent (%jd) %u\n", (intmax_t) getpid(), j);
982 fpost(futex1);
983 }
984
985 wait(NULL);
986
987 exit(EXIT_SUCCESS);
988 }
989
991 get_robust_list(2), restart_syscall(2), pthread_mutexattr_getproto‐
992 col(3), futex(7), sched(7)
993
994 The following kernel source files:
995
996 • Documentation/pi-futex.txt
997
998 • Documentation/futex-requeue-pi.txt
999
1000 • Documentation/locking/rt-mutex.txt
1001
1002 • Documentation/locking/rt-mutex-design.txt
1003
1004 • Documentation/robust-futex-ABI.txt
1005
1006 Franke, H., Russell, R., and Kirwood, M., 2002. Fuss, Futexes and Fur‐
1007 wocks: Fast Userlevel Locking in Linux (from proceedings of the Ottawa
1008 Linux Symposium 2002),
1009 ⟨http://kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf⟩
1010
1011 Hart, D., 2009. A futex overview and update,
1012 ⟨http://lwn.net/Articles/360699/⟩
1013
1014 Hart, D. and Guniguntala, D., 2009. Requeue-PI: Making glibc Condvars
1015 PI-Aware (from proceedings of the 2009 Real-Time Linux Workshop),
1016 ⟨http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf⟩
1017
1018 Drepper, U., 2011. Futexes Are Tricky,
1019 ⟨http://www.akkadia.org/drepper/futex.pdf⟩
1020
1021 Futex example library, futex-*.tar.bz2 at
1022 ⟨https://mirrors.kernel.org/pub/linux/kernel/people/rusty/⟩
1023
1024
1025
1026Linux man-pages 6.05 2023-05-03 futex(2)