1FUTEX(2) Linux Programmer's Manual FUTEX(2)
2
3
4
6 futex - fast user-space locking
7
9 #include <linux/futex.h> /* Definition of FUTEX_* constants */
10 #include <sys/syscall.h> /* Definition of SYS_* constants */
11 #include <unistd.h>
12
13 long syscall(SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val,
14 const struct timespec *timeout, /* or: uint32_t val2 */
15 uint32_t *uaddr2, uint32_t val3);
16
17 Note: glibc provides no wrapper for futex(), necessitating the use of
18 syscall(2).
19
21 The futex() system call provides a method for waiting until a certain
22 condition becomes true. It is typically used as a blocking construct
23 in the context of shared-memory synchronization. When using futexes,
24 the majority of the synchronization operations are performed in user
25 space. A user-space program employs the futex() system call only when
26 it is likely that the program has to block for a longer time until the
27 condition becomes true. Other futex() operations can be used to wake
28 any processes or threads waiting for a particular condition.
29
30 A futex is a 32-bit value—referred to below as a futex word—whose ad‐
31 dress is supplied to the futex() system call. (Futexes are 32 bits in
32 size on all platforms, including 64-bit systems.) All futex operations
33 are governed by this value. In order to share a futex between pro‐
34 cesses, the futex is placed in a region of shared memory, created using
35 (for example) mmap(2) or shmat(2). (Thus, the futex word may have dif‐
36 ferent virtual addresses in different processes, but these addresses
37 all refer to the same location in physical memory.) In a multithreaded
38 program, it is sufficient to place the futex word in a global variable
39 shared by all threads.
40
41 When executing a futex operation that requests to block a thread, the
42 kernel will block only if the futex word has the value that the calling
43 thread supplied (as one of the arguments of the futex() call) as the
44 expected value of the futex word. The loading of the futex word's
45 value, the comparison of that value with the expected value, and the
46 actual blocking will happen atomically and will be totally ordered with
47 respect to concurrent operations performed by other threads on the same
48 futex word. Thus, the futex word is used to connect the synchroniza‐
49 tion in user space with the implementation of blocking by the kernel.
50 Analogously to an atomic compare-and-exchange operation that poten‐
51 tially changes shared memory, blocking via a futex is an atomic com‐
52 pare-and-block operation.
53
54 One use of futexes is for implementing locks. The state of the lock
55 (i.e., acquired or not acquired) can be represented as an atomically
56 accessed flag in shared memory. In the uncontended case, a thread can
57 access or modify the lock state with atomic instructions, for example
58 atomically changing it from not acquired to acquired using an atomic
59 compare-and-exchange instruction. (Such instructions are performed en‐
60 tirely in user mode, and the kernel maintains no information about the
61 lock state.) On the other hand, a thread may be unable to acquire a
62 lock because it is already acquired by another thread. It then may
63 pass the lock's flag as a futex word and the value representing the ac‐
64 quired state as the expected value to a futex() wait operation. This
65 futex() operation will block if and only if the lock is still acquired
66 (i.e., the value in the futex word still matches the "acquired state").
67 When releasing the lock, a thread has to first reset the lock state to
68 not acquired and then execute a futex operation that wakes threads
69 blocked on the lock flag used as a futex word (this can be further op‐
70 timized to avoid unnecessary wake-ups). See futex(7) for more detail
71 on how to use futexes.
72
73 Besides the basic wait and wake-up futex functionality, there are fur‐
74 ther futex operations aimed at supporting more complex use cases.
75
76 Note that no explicit initialization or destruction is necessary to use
77 futexes; the kernel maintains a futex (i.e., the kernel-internal imple‐
78 mentation artifact) only while operations such as FUTEX_WAIT, described
79 below, are being performed on a particular futex word.
80
81 Arguments
82 The uaddr argument points to the futex word. On all platforms, futexes
83 are four-byte integers that must be aligned on a four-byte boundary.
84 The operation to perform on the futex is specified in the futex_op ar‐
85 gument; val is a value whose meaning and purpose depends on futex_op.
86
87 The remaining arguments (timeout, uaddr2, and val3) are required only
88 for certain of the futex operations described below. Where one of
89 these arguments is not required, it is ignored.
90
91 For several blocking operations, the timeout argument is a pointer to a
92 timespec structure that specifies a timeout for the operation. How‐
93 ever, notwithstanding the prototype shown above, for some operations,
94 the least significant four bytes of this argument are instead used as
95 an integer whose meaning is determined by the operation. For these op‐
96 erations, the kernel casts the timeout value first to unsigned long,
97 then to uint32_t, and in the remainder of this page, this argument is
98 referred to as val2 when interpreted in this fashion.
99
100 Where it is required, the uaddr2 argument is a pointer to a second fu‐
101 tex word that is employed by the operation.
102
103 The interpretation of the final integer argument, val3, depends on the
104 operation.
105
106 Futex operations
107 The futex_op argument consists of two parts: a command that specifies
108 the operation to be performed, bitwise ORed with zero or more options
109 that modify the behaviour of the operation. The options that may be
110 included in futex_op are as follows:
111
112 FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
113 This option bit can be employed with all futex operations. It
114 tells the kernel that the futex is process-private and not
115 shared with another process (i.e., it is being used for synchro‐
116 nization only between threads of the same process). This allows
117 the kernel to make some additional performance optimizations.
118
119 As a convenience, <linux/futex.h> defines a set of constants
120 with the suffix _PRIVATE that are equivalents of all of the op‐
121 erations listed below, but with the FUTEX_PRIVATE_FLAG ORed into
122 the constant value. Thus, there are FUTEX_WAIT_PRIVATE, FU‐
123 TEX_WAKE_PRIVATE, and so on.
124
125 FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
126 This option bit can be employed only with the FUTEX_WAIT_BITSET,
127 FUTEX_WAIT_REQUEUE_PI, and (since Linux 4.5) FUTEX_WAIT opera‐
128 tions.
129
130 If this option is set, the kernel measures the timeout against
131 the CLOCK_REALTIME clock.
132
133 If this option is not set, the kernel measures the timeout
134 against the CLOCK_MONOTONIC clock.
135
136 The operation specified in futex_op is one of the following:
137
138 FUTEX_WAIT (since Linux 2.6.0)
139 This operation tests that the value at the futex word pointed to
140 by the address uaddr still contains the expected value val, and
141 if so, then sleeps waiting for a FUTEX_WAKE operation on the fu‐
142 tex word. The load of the value of the futex word is an atomic
143 memory access (i.e., using atomic machine instructions of the
144 respective architecture). This load, the comparison with the
145 expected value, and starting to sleep are performed atomically
146 and totally ordered with respect to other futex operations on
147 the same futex word. If the thread starts to sleep, it is con‐
148 sidered a waiter on this futex word. If the futex value does
149 not match val, then the call fails immediately with the error
150 EAGAIN.
151
152 The purpose of the comparison with the expected value is to pre‐
153 vent lost wake-ups. If another thread changed the value of the
154 futex word after the calling thread decided to block based on
155 the prior value, and if the other thread executed a FUTEX_WAKE
156 operation (or similar wake-up) after the value change and before
157 this FUTEX_WAIT operation, then the calling thread will observe
158 the value change and will not start to sleep.
159
160 If the timeout is not NULL, the structure it points to specifies
161 a timeout for the wait. (This interval will be rounded up to
162 the system clock granularity, and is guaranteed not to expire
163 early.) The timeout is by default measured according to the
164 CLOCK_MONOTONIC clock, but, since Linux 4.5, the CLOCK_REALTIME
165 clock can be selected by specifying FUTEX_CLOCK_REALTIME in fu‐
166 tex_op. If timeout is NULL, the call blocks indefinitely.
167
168 Note: for FUTEX_WAIT, timeout is interpreted as a relative
169 value. This differs from other futex operations, where timeout
170 is interpreted as an absolute value. To obtain the equivalent
171 of FUTEX_WAIT with an absolute timeout, employ FUTEX_WAIT_BITSET
172 with val3 specified as FUTEX_BITSET_MATCH_ANY.
173
174 The arguments uaddr2 and val3 are ignored.
175
176 FUTEX_WAKE (since Linux 2.6.0)
177 This operation wakes at most val of the waiters that are waiting
178 (e.g., inside FUTEX_WAIT) on the futex word at the address
179 uaddr. Most commonly, val is specified as either 1 (wake up a
180 single waiter) or INT_MAX (wake up all waiters). No guarantee
181 is provided about which waiters are awoken (e.g., a waiter with
182 a higher scheduling priority is not guaranteed to be awoken in
183 preference to a waiter with a lower priority).
184
185 The arguments timeout, uaddr2, and val3 are ignored.
186
187 FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
188 This operation creates a file descriptor that is associated with
189 the futex at uaddr. The caller must close the returned file de‐
190 scriptor after use. When another process or thread performs a
191 FUTEX_WAKE on the futex word, the file descriptor indicates as
192 being readable with select(2), poll(2), and epoll(7)
193
194 The file descriptor can be used to obtain asynchronous notifica‐
195 tions: if val is nonzero, then, when another process or thread
196 executes a FUTEX_WAKE, the caller will receive the signal number
197 that was passed in val.
198
199 The arguments timeout, uaddr2, and val3 are ignored.
200
201 Because it was inherently racy, FUTEX_FD has been removed from
202 Linux 2.6.26 onward.
203
204 FUTEX_REQUEUE (since Linux 2.6.0)
205 This operation performs the same task as FUTEX_CMP_REQUEUE (see
206 below), except that no check is made using the value in val3.
207 (The argument val3 is ignored.)
208
209 FUTEX_CMP_REQUEUE (since Linux 2.6.7)
210 This operation first checks whether the location uaddr still
211 contains the value val3. If not, the operation fails with the
212 error EAGAIN. Otherwise, the operation wakes up a maximum of
213 val waiters that are waiting on the futex at uaddr. If there
214 are more than val waiters, then the remaining waiters are re‐
215 moved from the wait queue of the source futex at uaddr and added
216 to the wait queue of the target futex at uaddr2. The val2 argu‐
217 ment specifies an upper limit on the number of waiters that are
218 requeued to the futex at uaddr2.
219
220 The load from uaddr is an atomic memory access (i.e., using
221 atomic machine instructions of the respective architecture).
222 This load, the comparison with val3, and the requeueing of any
223 waiters are performed atomically and totally ordered with re‐
224 spect to other operations on the same futex word.
225
226 Typical values to specify for val are 0 or 1. (Specifying
227 INT_MAX is not useful, because it would make the FUTEX_CMP_RE‐
228 QUEUE operation equivalent to FUTEX_WAKE.) The limit value
229 specified via val2 is typically either 1 or INT_MAX. (Specify‐
230 ing the argument as 0 is not useful, because it would make the
231 FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAIT.)
232
233 The FUTEX_CMP_REQUEUE operation was added as a replacement for
234 the earlier FUTEX_REQUEUE. The difference is that the check of
235 the value at uaddr can be used to ensure that requeueing happens
236 only under certain conditions, which allows race conditions to
237 be avoided in certain use cases.
238
239 Both FUTEX_REQUEUE and FUTEX_CMP_REQUEUE can be used to avoid
240 "thundering herd" wake-ups that could occur when using FU‐
241 TEX_WAKE in cases where all of the waiters that are woken need
242 to acquire another futex. Consider the following scenario,
243 where multiple waiter threads are waiting on B, a wait queue im‐
244 plemented using a futex:
245
246 lock(A)
247 while (!check_value(V)) {
248 unlock(A);
249 block_on(B);
250 lock(A);
251 };
252 unlock(A);
253
254 If a waker thread used FUTEX_WAKE, then all waiters waiting on B
255 would be woken up, and they would all try to acquire lock A.
256 However, waking all of the threads in this manner would be
257 pointless because all except one of the threads would immedi‐
258 ately block on lock A again. By contrast, a requeue operation
259 wakes just one waiter and moves the other waiters to lock A, and
260 when the woken waiter unlocks A then the next waiter can pro‐
261 ceed.
262
263 FUTEX_WAKE_OP (since Linux 2.6.14)
264 This operation was added to support some user-space use cases
265 where more than one futex must be handled at the same time. The
266 most notable example is the implementation of pthread_cond_sig‐
267 nal(3), which requires operations on two futexes, the one used
268 to implement the mutex and the one used in the implementation of
269 the wait queue associated with the condition variable. FU‐
270 TEX_WAKE_OP allows such cases to be implemented without leading
271 to high rates of contention and context switching.
272
273 The FUTEX_WAKE_OP operation is equivalent to executing the fol‐
274 lowing code atomically and totally ordered with respect to other
275 futex operations on any of the two supplied futex words:
276
277 uint32_t oldval = *(uint32_t *) uaddr2;
278 *(uint32_t *) uaddr2 = oldval op oparg;
279 futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
280 if (oldval cmp cmparg)
281 futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
282
283 In other words, FUTEX_WAKE_OP does the following:
284
285 * saves the original value of the futex word at uaddr2 and per‐
286 forms an operation to modify the value of the futex at
287 uaddr2; this is an atomic read-modify-write memory access
288 (i.e., using atomic machine instructions of the respective
289 architecture)
290
291 * wakes up a maximum of val waiters on the futex for the futex
292 word at uaddr; and
293
294 * dependent on the results of a test of the original value of
295 the futex word at uaddr2, wakes up a maximum of val2 waiters
296 on the futex for the futex word at uaddr2.
297
298 The operation and comparison that are to be performed are en‐
299 coded in the bits of the argument val3. Pictorially, the encod‐
300 ing is:
301
302 +---+---+-----------+-----------+
303 |op |cmp| oparg | cmparg |
304 +---+---+-----------+-----------+
305 4 4 12 12 <== # of bits
306
307 Expressed in code, the encoding is:
308
309 #define FUTEX_OP(op, oparg, cmp, cmparg) \
310 (((op & 0xf) << 28) | \
311 ((cmp & 0xf) << 24) | \
312 ((oparg & 0xfff) << 12) | \
313 (cmparg & 0xfff))
314
315 In the above, op and cmp are each one of the codes listed below.
316 The oparg and cmparg components are literal numeric values, ex‐
317 cept as noted below.
318
319 The op component has one of the following values:
320
321 FUTEX_OP_SET 0 /* uaddr2 = oparg; */
322 FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
323 FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
324 FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */
325 FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */
326
327 In addition, bitwise ORing the following value into op causes
328 (1 << oparg) to be used as the operand:
329
330 FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
331
332 The cmp field is one of the following:
333
334 FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
335 FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
336 FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
337 FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
338 FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
339 FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
340
341 The return value of FUTEX_WAKE_OP is the sum of the number of
342 waiters woken on the futex uaddr plus the number of waiters wo‐
343 ken on the futex uaddr2.
344
345 FUTEX_WAIT_BITSET (since Linux 2.6.25)
346 This operation is like FUTEX_WAIT except that val3 is used to
347 provide a 32-bit bit mask to the kernel. This bit mask, in
348 which at least one bit must be set, is stored in the kernel-in‐
349 ternal state of the waiter. See the description of FU‐
350 TEX_WAKE_BITSET for further details.
351
352 If timeout is not NULL, the structure it points to specifies an
353 absolute timeout for the wait operation. If timeout is NULL,
354 the operation can block indefinitely.
355
356 The uaddr2 argument is ignored.
357
358 FUTEX_WAKE_BITSET (since Linux 2.6.25)
359 This operation is the same as FUTEX_WAKE except that the val3
360 argument is used to provide a 32-bit bit mask to the kernel.
361 This bit mask, in which at least one bit must be set, is used to
362 select which waiters should be woken up. The selection is done
363 by a bitwise AND of the "wake" bit mask (i.e., the value in
364 val3) and the bit mask which is stored in the kernel-internal
365 state of the waiter (the "wait" bit mask that is set using FU‐
366 TEX_WAIT_BITSET). All of the waiters for which the result of
367 the AND is nonzero are woken up; the remaining waiters are left
368 sleeping.
369
370 The effect of FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET is to al‐
371 low selective wake-ups among multiple waiters that are blocked
372 on the same futex. However, note that, depending on the use
373 case, employing this bit-mask multiplexing feature on a futex
374 can be less efficient than simply using multiple futexes, be‐
375 cause employing bit-mask multiplexing requires the kernel to
376 check all waiters on a futex, including those that are not in‐
377 terested in being woken up (i.e., they do not have the relevant
378 bit set in their "wait" bit mask).
379
380 The constant FUTEX_BITSET_MATCH_ANY, which corresponds to all 32
381 bits set in the bit mask, can be used as the val3 argument for
382 FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET. Other than differences
383 in the handling of the timeout argument, the FUTEX_WAIT opera‐
384 tion is equivalent to FUTEX_WAIT_BITSET with val3 specified as
385 FUTEX_BITSET_MATCH_ANY; that is, allow a wake-up by any waker.
386 The FUTEX_WAKE operation is equivalent to FUTEX_WAKE_BITSET with
387 val3 specified as FUTEX_BITSET_MATCH_ANY; that is, wake up any
388 waiter(s).
389
390 The uaddr2 and timeout arguments are ignored.
391
392 Priority-inheritance futexes
393 Linux supports priority-inheritance (PI) futexes in order to handle
394 priority-inversion problems that can be encountered with normal futex
395 locks. Priority inversion is the problem that occurs when a high-pri‐
396 ority task is blocked waiting to acquire a lock held by a low-priority
397 task, while tasks at an intermediate priority continuously preempt the
398 low-priority task from the CPU. Consequently, the low-priority task
399 makes no progress toward releasing the lock, and the high-priority task
400 remains blocked.
401
402 Priority inheritance is a mechanism for dealing with the priority-in‐
403 version problem. With this mechanism, when a high-priority task be‐
404 comes blocked by a lock held by a low-priority task, the priority of
405 the low-priority task is temporarily raised to that of the high-prior‐
406 ity task, so that it is not preempted by any intermediate level tasks,
407 and can thus make progress toward releasing the lock. To be effective,
408 priority inheritance must be transitive, meaning that if a high-prior‐
409 ity task blocks on a lock held by a lower-priority task that is itself
410 blocked by a lock held by another intermediate-priority task (and so
411 on, for chains of arbitrary length), then both of those tasks (or more
412 generally, all of the tasks in a lock chain) have their priorities
413 raised to be the same as the high-priority task.
414
415 From a user-space perspective, what makes a futex PI-aware is a policy
416 agreement (described below) between user space and the kernel about the
417 value of the futex word, coupled with the use of the PI-futex opera‐
418 tions described below. (Unlike the other futex operations described
419 above, the PI-futex operations are designed for the implementation of
420 very specific IPC mechanisms.)
421
422 The PI-futex operations described below differ from the other futex op‐
423 erations in that they impose policy on the use of the value of the fu‐
424 tex word:
425
426 * If the lock is not acquired, the futex word's value shall be 0.
427
428 * If the lock is acquired, the futex word's value shall be the thread
429 ID (TID; see gettid(2)) of the owning thread.
430
431 * If the lock is owned and there are threads contending for the lock,
432 then the FUTEX_WAITERS bit shall be set in the futex word's value;
433 in other words, this value is:
434
435 FUTEX_WAITERS | TID
436
437 (Note that is invalid for a PI futex word to have no owner and FU‐
438 TEX_WAITERS set.)
439
440 With this policy in place, a user-space application can acquire an un‐
441 acquired lock or release a lock using atomic instructions executed in
442 user mode (e.g., a compare-and-swap operation such as cmpxchg on the
443 x86 architecture). Acquiring a lock simply consists of using compare-
444 and-swap to atomically set the futex word's value to the caller's TID
445 if its previous value was 0. Releasing a lock requires using compare-
446 and-swap to set the futex word's value to 0 if the previous value was
447 the expected TID.
448
449 If a futex is already acquired (i.e., has a nonzero value), waiters
450 must employ the FUTEX_LOCK_PI operation to acquire the lock. If other
451 threads are waiting for the lock, then the FUTEX_WAITERS bit is set in
452 the futex value; in this case, the lock owner must employ the FUTEX_UN‐
453 LOCK_PI operation to release the lock.
454
455 In the cases where callers are forced into the kernel (i.e., required
456 to perform a futex() call), they then deal directly with a so-called
457 RT-mutex, a kernel locking mechanism which implements the required pri‐
458 ority-inheritance semantics. After the RT-mutex is acquired, the futex
459 value is updated accordingly, before the calling thread returns to user
460 space.
461
462 It is important to note that the kernel will update the futex word's
463 value prior to returning to user space. (This prevents the possibility
464 of the futex word's value ending up in an invalid state, such as having
465 an owner but the value being 0, or having waiters but not having the
466 FUTEX_WAITERS bit set.)
467
468 If a futex has an associated RT-mutex in the kernel (i.e., there are
469 blocked waiters) and the owner of the futex/RT-mutex dies unexpectedly,
470 then the kernel cleans up the RT-mutex and hands it over to the next
471 waiter. This in turn requires that the user-space value is updated ac‐
472 cordingly. To indicate that this is required, the kernel sets the FU‐
473 TEX_OWNER_DIED bit in the futex word along with the thread ID of the
474 new owner. User space can detect this situation via the presence of
475 the FUTEX_OWNER_DIED bit and is then responsible for cleaning up the
476 stale state left over by the dead owner.
477
478 PI futexes are operated on by specifying one of the values listed below
479 in futex_op. Note that the PI futex operations must be used as paired
480 operations and are subject to some additional requirements:
481
482 * FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI pair with FUTEX_UNLOCK_PI. FU‐
483 TEX_UNLOCK_PI must be called only on a futex owned by the calling
484 thread, as defined by the value policy, otherwise the error EPERM
485 results.
486
487 * FUTEX_WAIT_REQUEUE_PI pairs with FUTEX_CMP_REQUEUE_PI. This must be
488 performed from a non-PI futex to a distinct PI futex (or the error
489 EINVAL results). Additionally, val (the number of waiters to be wo‐
490 ken) must be 1 (or the error EINVAL results).
491
492 The PI futex operations are as follows:
493
494 FUTEX_LOCK_PI (since Linux 2.6.18)
495 This operation is used after an attempt to acquire the lock via
496 an atomic user-mode instruction failed because the futex word
497 has a nonzero value—specifically, because it contained the (PID-
498 namespace-specific) TID of the lock owner.
499
500 The operation checks the value of the futex word at the address
501 uaddr. If the value is 0, then the kernel tries to atomically
502 set the futex value to the caller's TID. If the futex word's
503 value is nonzero, the kernel atomically sets the FUTEX_WAITERS
504 bit, which signals the futex owner that it cannot unlock the fu‐
505 tex in user space atomically by setting the futex value to 0.
506 After that, the kernel:
507
508 1. Tries to find the thread which is associated with the owner
509 TID.
510
511 2. Creates or reuses kernel state on behalf of the owner. (If
512 this is the first waiter, there is no kernel state for this
513 futex, so kernel state is created by locking the RT-mutex and
514 the futex owner is made the owner of the RT-mutex. If there
515 are existing waiters, then the existing state is reused.)
516
517 3. Attaches the waiter to the futex (i.e., the waiter is en‐
518 queued on the RT-mutex waiter list).
519
520 If more than one waiter exists, the enqueueing of the waiter is
521 in descending priority order. (For information on priority or‐
522 dering, see the discussion of the SCHED_DEADLINE, SCHED_FIFO,
523 and SCHED_RR scheduling policies in sched(7).) The owner inher‐
524 its either the waiter's CPU bandwidth (if the waiter is sched‐
525 uled under the SCHED_DEADLINE policy) or the waiter's priority
526 (if the waiter is scheduled under the SCHED_RR or SCHED_FIFO
527 policy). This inheritance follows the lock chain in the case of
528 nested locking and performs deadlock detection.
529
530 The timeout argument provides a timeout for the lock attempt.
531 If timeout is not NULL, the structure it points to specifies an
532 absolute timeout, measured against the CLOCK_REALTIME clock. If
533 timeout is NULL, the operation will block indefinitely.
534
535 The uaddr2, val, and val3 arguments are ignored.
536
537 FUTEX_TRYLOCK_PI (since Linux 2.6.18)
538 This operation tries to acquire the lock at uaddr. It is in‐
539 voked when a user-space atomic acquire did not succeed because
540 the futex word was not 0.
541
542 Because the kernel has access to more state information than
543 user space, acquisition of the lock might succeed if performed
544 by the kernel in cases where the futex word (i.e., the state in‐
545 formation accessible to use-space) contains stale state (FU‐
546 TEX_WAITERS and/or FUTEX_OWNER_DIED). This can happen when the
547 owner of the futex died. User space cannot handle this condi‐
548 tion in a race-free manner, but the kernel can fix this up and
549 acquire the futex.
550
551 The uaddr2, val, timeout, and val3 arguments are ignored.
552
553 FUTEX_UNLOCK_PI (since Linux 2.6.18)
554 This operation wakes the top priority waiter that is waiting in
555 FUTEX_LOCK_PI on the futex address provided by the uaddr argu‐
556 ment.
557
558 This is called when the user-space value at uaddr cannot be
559 changed atomically from a TID (of the owner) to 0.
560
561 The uaddr2, val, timeout, and val3 arguments are ignored.
562
563 FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31)
564 This operation is a PI-aware variant of FUTEX_CMP_REQUEUE. It
565 requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on
566 uaddr from a non-PI source futex (uaddr) to a PI target futex
567 (uaddr2).
568
569 As with FUTEX_CMP_REQUEUE, this operation wakes up a maximum of
570 val waiters that are waiting on the futex at uaddr. However,
571 for FUTEX_CMP_REQUEUE_PI, val is required to be 1 (since the
572 main point is to avoid a thundering herd). The remaining wait‐
573 ers are removed from the wait queue of the source futex at uaddr
574 and added to the wait queue of the target futex at uaddr2.
575
576 The val2 and val3 arguments serve the same purposes as for FU‐
577 TEX_CMP_REQUEUE.
578
579 FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
580 Wait on a non-PI futex at uaddr and potentially be requeued (via
581 a FUTEX_CMP_REQUEUE_PI operation in another task) onto a PI fu‐
582 tex at uaddr2. The wait operation on uaddr is the same as for
583 FUTEX_WAIT.
584
585 The waiter can be removed from the wait on uaddr without re‐
586 queueing on uaddr2 via a FUTEX_WAKE operation in another task.
587 In this case, the FUTEX_WAIT_REQUEUE_PI operation fails with the
588 error EAGAIN.
589
590 If timeout is not NULL, the structure it points to specifies an
591 absolute timeout for the wait operation. If timeout is NULL,
592 the operation can block indefinitely.
593
594 The val3 argument is ignored.
595
596 The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were added to
597 support a fairly specific use case: support for priority-inheri‐
598 tance-aware POSIX threads condition variables. The idea is that
599 these operations should always be paired, in order to ensure
600 that user space and the kernel remain in sync. Thus, in the FU‐
601 TEX_WAIT_REQUEUE_PI operation, the user-space application pre-
602 specifies the target of the requeue that takes place in the FU‐
603 TEX_CMP_REQUEUE_PI operation.
604
606 In the event of an error (and assuming that futex() was invoked via
607 syscall(2)), all operations return -1 and set errno to indicate the er‐
608 ror.
609
610 The return value on success depends on the operation, as described in
611 the following list:
612
613 FUTEX_WAIT
614 Returns 0 if the caller was woken up. Note that a wake-up can
615 also be caused by common futex usage patterns in unrelated code
616 that happened to have previously used the futex word's memory
617 location (e.g., typical futex-based implementations of Pthreads
618 mutexes can cause this under some conditions). Therefore, call‐
619 ers should always conservatively assume that a return value of 0
620 can mean a spurious wake-up, and use the futex word's value
621 (i.e., the user-space synchronization scheme) to decide whether
622 to continue to block or not.
623
624 FUTEX_WAKE
625 Returns the number of waiters that were woken up.
626
627 FUTEX_FD
628 Returns the new file descriptor associated with the futex.
629
630 FUTEX_REQUEUE
631 Returns the number of waiters that were woken up.
632
633 FUTEX_CMP_REQUEUE
634 Returns the total number of waiters that were woken up or re‐
635 queued to the futex for the futex word at uaddr2. If this value
636 is greater than val, then the difference is the number of wait‐
637 ers requeued to the futex for the futex word at uaddr2.
638
639 FUTEX_WAKE_OP
640 Returns the total number of waiters that were woken up. This is
641 the sum of the woken waiters on the two futexes for the futex
642 words at uaddr and uaddr2.
643
644 FUTEX_WAIT_BITSET
645 Returns 0 if the caller was woken up. See FUTEX_WAIT for how to
646 interpret this correctly in practice.
647
648 FUTEX_WAKE_BITSET
649 Returns the number of waiters that were woken up.
650
651 FUTEX_LOCK_PI
652 Returns 0 if the futex was successfully locked.
653
654 FUTEX_TRYLOCK_PI
655 Returns 0 if the futex was successfully locked.
656
657 FUTEX_UNLOCK_PI
658 Returns 0 if the futex was successfully unlocked.
659
660 FUTEX_CMP_REQUEUE_PI
661 Returns the total number of waiters that were woken up or re‐
662 queued to the futex for the futex word at uaddr2. If this value
663 is greater than val, then difference is the number of waiters
664 requeued to the futex for the futex word at uaddr2.
665
666 FUTEX_WAIT_REQUEUE_PI
667 Returns 0 if the caller was successfully requeued to the futex
668 for the futex word at uaddr2.
669
671 EACCES No read access to the memory of a futex word.
672
673 EAGAIN (FUTEX_WAIT, FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI) The value
674 pointed to by uaddr was not equal to the expected value val at
675 the time of the call.
676
677 Note: on Linux, the symbolic names EAGAIN and EWOULDBLOCK (both
678 of which appear in different parts of the kernel futex code)
679 have the same value.
680
681 EAGAIN (FUTEX_CMP_REQUEUE, FUTEX_CMP_REQUEUE_PI) The value pointed to
682 by uaddr is not equal to the expected value val3.
683
684 EAGAIN (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The fu‐
685 tex owner thread ID of uaddr (for FUTEX_CMP_REQUEUE_PI: uaddr2)
686 is about to exit, but has not yet handled the internal state
687 cleanup. Try again.
688
689 EDEADLK
690 (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The fu‐
691 tex word at uaddr is already locked by the caller.
692
693 EDEADLK
694 (FUTEX_CMP_REQUEUE_PI) While requeueing a waiter to the PI futex
695 for the futex word at uaddr2, the kernel detected a deadlock.
696
697 EFAULT A required pointer argument (i.e., uaddr, uaddr2, or timeout)
698 did not point to a valid user-space address.
699
700 EINTR A FUTEX_WAIT or FUTEX_WAIT_BITSET operation was interrupted by a
701 signal (see signal(7)). In kernels before Linux 2.6.22, this
702 error could also be returned for a spurious wakeup; since Linux
703 2.6.22, this no longer happens.
704
705 EINVAL The operation in futex_op is one of those that employs a time‐
706 out, but the supplied timeout argument was invalid (tv_sec was
707 less than zero, or tv_nsec was not less than 1,000,000,000).
708
709 EINVAL The operation specified in futex_op employs one or both of the
710 pointers uaddr and uaddr2, but one of these does not point to a
711 valid object—that is, the address is not four-byte-aligned.
712
713 EINVAL (FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET) The bit mask supplied in
714 val3 is zero.
715
716 EINVAL (FUTEX_CMP_REQUEUE_PI) uaddr equals uaddr2 (i.e., an attempt was
717 made to requeue to the same futex).
718
719 EINVAL (FUTEX_FD) The signal number supplied in val is invalid.
720
721 EINVAL (FUTEX_WAKE, FUTEX_WAKE_OP, FUTEX_WAKE_BITSET, FUTEX_REQUEUE,
722 FUTEX_CMP_REQUEUE) The kernel detected an inconsistency between
723 the user-space state at uaddr and the kernel state—that is, it
724 detected a waiter which waits in FUTEX_LOCK_PI on uaddr.
725
726 EINVAL (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_UNLOCK_PI) The kernel
727 detected an inconsistency between the user-space state at uaddr
728 and the kernel state. This indicates either state corruption or
729 that the kernel found a waiter on uaddr which is waiting via FU‐
730 TEX_WAIT or FUTEX_WAIT_BITSET.
731
732 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
733 tween the user-space state at uaddr2 and the kernel state; that
734 is, the kernel detected a waiter which waits via FUTEX_WAIT or
735 FUTEX_WAIT_BITSET on uaddr2.
736
737 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
738 tween the user-space state at uaddr and the kernel state; that
739 is, the kernel detected a waiter which waits via FUTEX_WAIT or
740 FUTEX_WAIT_BITSET on uaddr.
741
742 EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency be‐
743 tween the user-space state at uaddr and the kernel state; that
744 is, the kernel detected a waiter which waits on uaddr via FU‐
745 TEX_LOCK_PI (instead of FUTEX_WAIT_REQUEUE_PI).
746
747 EINVAL (FUTEX_CMP_REQUEUE_PI) An attempt was made to requeue a waiter
748 to a futex other than that specified by the matching FU‐
749 TEX_WAIT_REQUEUE_PI call for that waiter.
750
751 EINVAL (FUTEX_CMP_REQUEUE_PI) The val argument is not 1.
752
753 EINVAL Invalid argument.
754
755 ENFILE (FUTEX_FD) The system-wide limit on the total number of open
756 files has been reached.
757
758 ENOMEM (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The ker‐
759 nel could not allocate memory to hold state information.
760
761 ENOSYS Invalid operation specified in futex_op.
762
763 ENOSYS The FUTEX_CLOCK_REALTIME option was specified in futex_op, but
764 the accompanying operation was neither FUTEX_WAIT, FU‐
765 TEX_WAIT_BITSET, nor FUTEX_WAIT_REQUEUE_PI.
766
767 ENOSYS (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_UNLOCK_PI, FUTEX_CMP_RE‐
768 QUEUE_PI, FUTEX_WAIT_REQUEUE_PI) A run-time check determined
769 that the operation is not available. The PI-futex operations
770 are not implemented on all architectures and are not supported
771 on some CPU variants.
772
773 EPERM (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The
774 caller is not allowed to attach itself to the futex at uaddr
775 (for FUTEX_CMP_REQUEUE_PI: the futex at uaddr2). (This may be
776 caused by a state corruption in user space.)
777
778 EPERM (FUTEX_UNLOCK_PI) The caller does not own the lock represented
779 by the futex word.
780
781 ESRCH (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The
782 thread ID in the futex word at uaddr does not exist.
783
784 ESRCH (FUTEX_CMP_REQUEUE_PI) The thread ID in the futex word at uaddr2
785 does not exist.
786
787 ETIMEDOUT
788 The operation in futex_op employed the timeout specified in
789 timeout, and the timeout expired before the operation completed.
790
792 Futexes were first made available in a stable kernel release with Linux
793 2.6.0.
794
795 Initial futex support was merged in Linux 2.5.7 but with different se‐
796 mantics from what was described above. A four-argument system call
797 with the semantics described in this page was introduced in Linux
798 2.5.40. A fifth argument was added in Linux 2.5.70, and a sixth argu‐
799 ment was added in Linux 2.6.7.
800
802 This system call is Linux-specific.
803
805 Several higher-level programming abstractions are implemented via fu‐
806 texes, including POSIX semaphores and various POSIX threads synchro‐
807 nization mechanisms (mutexes, condition variables, read-write locks,
808 and barriers).
809
811 The program below demonstrates use of futexes in a program where a par‐
812 ent process and a child process use a pair of futexes located inside a
813 shared anonymous mapping to synchronize access to a shared resource:
814 the terminal. The two processes each write nloops (a command-line ar‐
815 gument that defaults to 5 if omitted) messages to the terminal and em‐
816 ploy a synchronization protocol that ensures that they alternate in
817 writing messages. Upon running this program we see output such as the
818 following:
819
820 $ ./futex_demo
821 Parent (18534) 0
822 Child (18535) 0
823 Parent (18534) 1
824 Child (18535) 1
825 Parent (18534) 2
826 Child (18535) 2
827 Parent (18534) 3
828 Child (18535) 3
829 Parent (18534) 4
830 Child (18535) 4
831
832 Program source
833
834 /* futex_demo.c
835
836 Usage: futex_demo [nloops]
837 (Default: 5)
838
839 Demonstrate the use of futexes in a program where parent and child
840 use a pair of futexes located inside a shared anonymous mapping to
841 synchronize access to a shared resource: the terminal. The two
842 processes each write 'num-loops' messages to the terminal and employ
843 a synchronization protocol that ensures that they alternate in
844 writing messages.
845 */
846 #define _GNU_SOURCE
847 #include <stdio.h>
848 #include <errno.h>
849 #include <stdatomic.h>
850 #include <stdint.h>
851 #include <stdlib.h>
852 #include <unistd.h>
853 #include <sys/wait.h>
854 #include <sys/mman.h>
855 #include <sys/syscall.h>
856 #include <linux/futex.h>
857 #include <sys/time.h>
858
859 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
860 } while (0)
861
862 static uint32_t *futex1, *futex2, *iaddr;
863
864 static int
865 futex(uint32_t *uaddr, int futex_op, uint32_t val,
866 const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
867 {
868 return syscall(SYS_futex, uaddr, futex_op, val,
869 timeout, uaddr2, val3);
870 }
871
872 /* Acquire the futex pointed to by 'futexp': wait for its value to
873 become 1, and then set the value to 0. */
874
875 static void
876 fwait(uint32_t *futexp)
877 {
878 long s;
879
880 /* atomic_compare_exchange_strong(ptr, oldval, newval)
881 atomically performs the equivalent of:
882
883 if (*ptr == *oldval)
884 *ptr = newval;
885
886 It returns true if the test yielded true and *ptr was updated. */
887
888 while (1) {
889
890 /* Is the futex available? */
891 const uint32_t one = 1;
892 if (atomic_compare_exchange_strong(futexp, &one, 0))
893 break; /* Yes */
894
895 /* Futex is not available; wait. */
896
897 s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
898 if (s == -1 && errno != EAGAIN)
899 errExit("futex-FUTEX_WAIT");
900 }
901 }
902
903 /* Release the futex pointed to by 'futexp': if the futex currently
904 has the value 0, set its value to 1 and the wake any futex waiters,
905 so that if the peer is blocked in fwait(), it can proceed. */
906
907 static void
908 fpost(uint32_t *futexp)
909 {
910 long s;
911
912 /* atomic_compare_exchange_strong() was described
913 in comments above. */
914
915 const uint32_t zero = 0;
916 if (atomic_compare_exchange_strong(futexp, &zero, 1)) {
917 s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
918 if (s == -1)
919 errExit("futex-FUTEX_WAKE");
920 }
921 }
922
923 int
924 main(int argc, char *argv[])
925 {
926 pid_t childPid;
927 int nloops;
928
929 setbuf(stdout, NULL);
930
931 nloops = (argc > 1) ? atoi(argv[1]) : 5;
932
933 /* Create a shared anonymous mapping that will hold the futexes.
934 Since the futexes are being shared between processes, we
935 subsequently use the "shared" futex operations (i.e., not the
936 ones suffixed "_PRIVATE"). */
937
938 iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE,
939 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
940 if (iaddr == MAP_FAILED)
941 errExit("mmap");
942
943 futex1 = &iaddr[0];
944 futex2 = &iaddr[1];
945
946 *futex1 = 0; /* State: unavailable */
947 *futex2 = 1; /* State: available */
948
949 /* Create a child process that inherits the shared anonymous
950 mapping. */
951
952 childPid = fork();
953 if (childPid == -1)
954 errExit("fork");
955
956 if (childPid == 0) { /* Child */
957 for (int j = 0; j < nloops; j++) {
958 fwait(futex1);
959 printf("Child (%jd) %d\n", (intmax_t) getpid(), j);
960 fpost(futex2);
961 }
962
963 exit(EXIT_SUCCESS);
964 }
965
966 /* Parent falls through to here. */
967
968 for (int j = 0; j < nloops; j++) {
969 fwait(futex2);
970 printf("Parent (%jd) %d\n", (intmax_t) getpid(), j);
971 fpost(futex1);
972 }
973
974 wait(NULL);
975
976 exit(EXIT_SUCCESS);
977 }
978
980 get_robust_list(2), restart_syscall(2), pthread_mutexattr_getproto‐
981 col(3), futex(7), sched(7)
982
983 The following kernel source files:
984
985 * Documentation/pi-futex.txt
986
987 * Documentation/futex-requeue-pi.txt
988
989 * Documentation/locking/rt-mutex.txt
990
991 * Documentation/locking/rt-mutex-design.txt
992
993 * Documentation/robust-futex-ABI.txt
994
995 Franke, H., Russell, R., and Kirwood, M., 2002. Fuss, Futexes and Fur‐
996 wocks: Fast Userlevel Locking in Linux (from proceedings of the Ottawa
997 Linux Symposium 2002),
998 ⟨http://kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf⟩
999
1000 Hart, D., 2009. A futex overview and update,
1001 ⟨http://lwn.net/Articles/360699/⟩
1002
1003 Hart, D. and Guniguntala, D., 2009. Requeue-PI: Making Glibc Condvars
1004 PI-Aware (from proceedings of the 2009 Real-Time Linux Workshop),
1005 ⟨http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf⟩
1006
1007 Drepper, U., 2011. Futexes Are Tricky,
1008 ⟨http://www.akkadia.org/drepper/futex.pdf⟩
1009
1010 Futex example library, futex-*.tar.bz2 at
1011 ⟨ftp://ftp.kernel.org/pub/linux/kernel/people/rusty/⟩
1012
1014 This page is part of release 5.12 of the Linux man-pages project. A
1015 description of the project, information about reporting bugs, and the
1016 latest version of this page, can be found at
1017 https://www.kernel.org/doc/man-pages/.
1018
1019
1020
1021Linux 2021-03-22 FUTEX(2)