1USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
2
3
4
6 userfaultfd - create a file descriptor for handling page faults in user
7 space
8
10 #include <sys/types.h>
11 #include <linux/userfaultfd.h>
12
13 int userfaultfd(int flags);
14
15 Note: There is no glibc wrapper for this system call; see NOTES.
16
18 userfaultfd() creates a new userfaultfd object that can be used for
19 delegation of page-fault handling to a user-space application, and
20 returns a file descriptor that refers to the new object. The new user‐
21 faultfd object is configured using ioctl(2).
22
23 Once the userfaultfd object is configured, the application can use
24 read(2) to receive userfaultfd notifications. The reads from user‐
25 faultfd may be blocking or non-blocking, depending on the value of
26 flags used for the creation of the userfaultfd or subsequent calls to
27 fcntl(2).
28
29 The following values may be bitwise ORed in flags to change the behav‐
30 ior of userfaultfd():
31
32 O_CLOEXEC
33 Enable the close-on-exec flag for the new userfaultfd file
34 descriptor. See the description of the O_CLOEXEC flag in
35 open(2).
36
37 O_NONBLOCK
38 Enables non-blocking operation for the userfaultfd object. See
39 the description of the O_NONBLOCK flag in open(2).
40
41 When the last file descriptor referring to a userfaultfd object is
42 closed, all memory ranges that were registered with the object are
43 unregistered and unread events are flushed.
44
45 Usage
46 The userfaultfd mechanism is designed to allow a thread in a multi‐
47 threaded program to perform user-space paging for the other threads in
48 the process. When a page fault occurs for one of the regions regis‐
49 tered to the userfaultfd object, the faulting thread is put to sleep
50 and an event is generated that can be read via the userfaultfd file
51 descriptor. The fault-handling thread reads events from this file
52 descriptor and services them using the operations described in
53 ioctl_userfaultfd(2). When servicing the page fault events, the fault-
54 handling thread can trigger a wake-up for the sleeping thread.
55
56 It is possible for the faulting threads and the fault-handling threads
57 to run in the context of different processes. In this case, these
58 threads may belong to different programs, and the program that executes
59 the faulting threads will not necessarily cooperate with the program
60 that handles the page faults. In such non-cooperative mode, the
61 process that monitors userfaultfd and handles page faults needs to be
62 aware of the changes in the virtual memory layout of the faulting
63 process to avoid memory corruption.
64
65 Starting from Linux 4.11, userfaultfd can also notify the fault-han‐
66 dling threads about changes in the virtual memory layout of the fault‐
67 ing process. In addition, if the faulting process invokes fork(2), the
68 userfaultfd objects associated with the parent may be duplicated into
69 the child process and the userfaultfd monitor will be notified (via the
70 UFFD_EVENT_FORK described below) about the file descriptor associated
71 with the userfault objects created for the child process, which allows
72 the userfaultfd monitor to perform user-space paging for the child
73 process. Unlike page faults which have to be synchronous and require
74 an explicit or implicit wakeup, all other events are delivered asyn‐
75 chronously and the non-cooperative process resumes execution as soon as
76 the userfaultfd manager executes read(2). The userfaultfd manager
77 should carefully synchronize calls to UFFDIO_COPY with the processing
78 of events.
79
80 The current asynchronous model of the event delivery is optimal for
81 single threaded non-cooperative userfaultfd manager implementations.
82
83 Userfaultfd operation
84 After the userfaultfd object is created with userfaultfd(), the appli‐
85 cation must enable it using the UFFDIO_API ioctl(2) operation. This
86 operation allows a handshake between the kernel and user space to
87 determine the API version and supported features. This operation must
88 be performed before any of the other ioctl(2) operations described
89 below (or those operations fail with the EINVAL error).
90
91 After a successful UFFDIO_API operation, the application then registers
92 memory address ranges using the UFFDIO_REGISTER ioctl(2) operation.
93 After successful completion of a UFFDIO_REGISTER operation, a page
94 fault occurring in the requested memory range, and satisfying the mode
95 defined at the registration time, will be forwarded by the kernel to
96 the user-space application. The application can then use the UFF‐
97 DIO_COPY or UFFDIO_ZEROPAGE ioctl(2) operations to resolve the page
98 fault.
99
100 Starting from Linux 4.14, if the application sets the UFFD_FEATURE_SIG‐
101 BUS feature bit using the UFFDIO_API ioctl(2), no page-fault notifica‐
102 tion will be forwarded to user space. Instead a SIGBUS signal is
103 delivered to the faulting process. With this feature, userfaultfd can
104 be used for robustness purposes to simply catch any access to areas
105 within the registered address range that do not have pages allocated,
106 without having to listen to userfaultfd events. No userfaultfd monitor
107 will be required for dealing with such memory accesses. For example,
108 this feature can be useful for applications that want to prevent the
109 kernel from automatically allocating pages and filling holes in sparse
110 files when the hole is accessed through a memory mapping.
111
112 The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2)
113 if used in combination with UFFD_FEATURE_FORK.
114
115 Details of the various ioctl(2) operations can be found in ioctl_user‐
116 faultfd(2).
117
118 Since Linux 4.11, events other than page-fault may enabled during UFF‐
119 DIO_API operation.
120
121 Up to Linux 4.11, userfaultfd can be used only with anonymous private
122 memory mappings. Since Linux 4.11, userfaultfd can be also used with
123 hugetlbfs and shared memory mappings.
124
125 Reading from the userfaultfd structure
126 Each read(2) from the userfaultfd file descriptor returns one or more
127 uffd_msg structures, each of which describes a page-fault event or an
128 event required for the non-cooperative userfaultfd usage:
129
130 struct uffd_msg {
131 __u8 event; /* Type of event */
132 ...
133 union {
134 struct {
135 __u64 flags; /* Flags describing fault */
136 __u64 address; /* Faulting address */
137 } pagefault;
138
139 struct { /* Since Linux 4.11 */
140 __u32 ufd; /* Userfault file descriptor
141 of the child process */
142 } fork;
143
144 struct { /* Since Linux 4.11 */
145 __u64 from; /* Old address of remapped area */
146 __u64 to; /* New address of remapped area */
147 __u64 len; /* Original mapping length */
148 } remap;
149
150 struct { /* Since Linux 4.11 */
151 __u64 start; /* Start address of removed area */
152 __u64 end; /* End address of removed area */
153 } remove;
154 ...
155 } arg;
156
157 /* Padding fields omitted */
158 } __packed;
159
160 If multiple events are available and the supplied buffer is large
161 enough, read(2) returns as many events as will fit in the supplied buf‐
162 fer. If the buffer supplied to read(2) is smaller than the size of the
163 uffd_msg structure, the read(2) fails with the error EINVAL.
164
165 The fields set in the uffd_msg structure are as follows:
166
167 event The type of event. Depending of the event type, different
168 fields of the arg union represent details required for the event
169 processing. The non-page-fault events are generated only when
170 appropriate feature is enabled during API handshake with UFF‐
171 DIO_API ioctl(2).
172
173 The following values can appear in the event field:
174
175 UFFD_EVENT_PAGEFAULT (since Linux 4.3)
176 A page-fault event. The page-fault details are available
177 in the pagefault field.
178
179 UFFD_EVENT_FORK (since Linux 4.11)
180 Generated when the faulting process invokes fork(2) (or
181 clone(2) without the CLONE_VM flag). The event details
182 are available in the fork field.
183
184 UFFD_EVENT_REMAP (since Linux 4.11)
185 Generated when the faulting process invokes mremap(2).
186 The event details are available in the remap field.
187
188 UFFD_EVENT_REMOVE (since Linux 4.11)
189 Generated when the faulting process invokes madvise(2)
190 with MADV_DONTNEED or MADV_REMOVE advice. The event
191 details are available in the remove field.
192
193 UFFD_EVENT_UNMAP (since Linux 4.11)
194 Generated when the faulting process unmaps a memory
195 range, either explicitly using munmap(2) or implicitly
196 during mmap(2) or mremap(2). The event details are
197 available in the remove field.
198
199 pagefault.address
200 The address that triggered the page fault.
201
202 pagefault.flags
203 A bit mask of flags that describe the event. For
204 UFFD_EVENT_PAGEFAULT, the following flag may appear:
205
206 UFFD_PAGEFAULT_FLAG_WRITE
207 If the address is in a range that was registered with the
208 UFFDIO_REGISTER_MODE_MISSING flag (see ioctl_user‐
209 faultfd(2)) and this flag is set, this a write fault;
210 otherwise it is a read fault.
211
212 fork.ufd
213 The file descriptor associated with the userfault object created
214 for the child created by fork(2).
215
216 remap.from
217 The original address of the memory range that was remapped using
218 mremap(2).
219
220 remap.to
221 The new address of the memory range that was remapped using
222 mremap(2).
223
224 remap.len
225 The original length of the memory range that was remapped using
226 mremap(2).
227
228 remove.start
229 The start address of the memory range that was freed using mad‐
230 vise(2) or unmapped
231
232 remove.end
233 The end address of the memory range that was freed using mad‐
234 vise(2) or unmapped
235
236 A read(2) on a userfaultfd file descriptor can fail with the following
237 errors:
238
239 EINVAL The userfaultfd object has not yet been enabled using the UFF‐
240 DIO_API ioctl(2) operation
241
242 If the O_NONBLOCK flag is enabled in the associated open file descrip‐
243 tion, the userfaultfd file descriptor can be monitored with poll(2),
244 select(2), and epoll(7). When events are available, the file descrip‐
245 tor indicates as readable. If the O_NONBLOCK flag is not enabled, then
246 poll(2) (always) indicates the file as having a POLLERR condition, and
247 select(2) indicates the file descriptor as both readable and writable.
248
250 On success, userfaultfd() returns a new file descriptor that refers to
251 the userfaultfd object. On error, -1 is returned, and errno is set
252 appropriately.
253
255 EINVAL An unsupported value was specified in flags.
256
257 EMFILE The per-process limit on the number of open file descriptors has
258 been reached
259
260 ENFILE The system-wide limit on the total number of open files has been
261 reached.
262
263 ENOMEM Insufficient kernel memory was available.
264
265 EPERM (since Linux 5.2)
266 The caller is not privileged (does not have the CAP_SYS_PTRACE
267 capability in the initial user namespace), and
268 /proc/sys/vm/unprivileged_userfaultfd has the value 0.
269
271 The userfaultfd() system call first appeared in Linux 4.3.
272
273 The support for hugetlbfs and shared memory areas and non-page-fault
274 events was added in Linux 4.11
275
277 userfaultfd() is Linux-specific and should not be used in programs
278 intended to be portable.
279
281 Glibc does not provide a wrapper for this system call; call it using
282 syscall(2).
283
284 The userfaultfd mechanism can be used as an alternative to traditional
285 user-space paging techniques based on the use of the SIGSEGV signal and
286 mmap(2). It can also be used to implement lazy restore for check‐
287 point/restore mechanisms, as well as post-copy migration to allow
288 (nearly) uninterrupted execution when transferring virtual machines and
289 Linux containers from one host to another.
290
292 If the UFFD_FEATURE_EVENT_FORK is enabled and a system call from the
293 fork(2) family is interrupted by a signal or failed, a stale user‐
294 faultfd descriptor might be created. In this case, a spurious
295 UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
296
298 The program below demonstrates the use of the userfaultfd mechanism.
299 The program creates two threads, one of which acts as the page-fault
300 handler for the process, for the pages in a demand-page zero region
301 created using mmap(2).
302
303 The program takes one command-line argument, which is the number of
304 pages that will be created in a mapping whose page faults will be han‐
305 dled via userfaultfd. After creating a userfaultfd object, the program
306 then creates an anonymous private mapping of the specified size and
307 registers the address range of that mapping using the UFFDIO_REGISTER
308 ioctl(2) operation. The program then creates a second thread that will
309 perform the task of handling page faults.
310
311 The main thread then walks through the pages of the mapping fetching
312 bytes from successive pages. Because the pages have not yet been
313 accessed, the first access of a byte in each page will trigger a page-
314 fault event on the userfaultfd file descriptor.
315
316 Each of the page-fault events is handled by the second thread, which
317 sits in a loop processing input from the userfaultfd file descriptor.
318 In each loop iteration, the second thread first calls poll(2) to check
319 the state of the file descriptor, and then reads an event from the file
320 descriptor. All such events should be UFFD_EVENT_PAGEFAULT events,
321 which the thread handles by copying a page of data into the faulting
322 region using the UFFDIO_COPY ioctl(2) operation.
323
324 The following is an example of what we see when running the program:
325
326 $ ./userfaultfd_demo 3
327 Address returned by mmap() = 0x7fd30106c000
328
329 fault_handler_thread():
330 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
331 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
332 (uffdio_copy.copy returned 4096)
333 Read address 0x7fd30106c00f in main(): A
334 Read address 0x7fd30106c40f in main(): A
335 Read address 0x7fd30106c80f in main(): A
336 Read address 0x7fd30106cc0f in main(): A
337
338 fault_handler_thread():
339 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
340 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
341 (uffdio_copy.copy returned 4096)
342 Read address 0x7fd30106d00f in main(): B
343 Read address 0x7fd30106d40f in main(): B
344 Read address 0x7fd30106d80f in main(): B
345 Read address 0x7fd30106dc0f in main(): B
346
347 fault_handler_thread():
348 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
349 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
350 (uffdio_copy.copy returned 4096)
351 Read address 0x7fd30106e00f in main(): C
352 Read address 0x7fd30106e40f in main(): C
353 Read address 0x7fd30106e80f in main(): C
354 Read address 0x7fd30106ec0f in main(): C
355
356 Program source
357
358 /* userfaultfd_demo.c
359
360 Licensed under the GNU General Public License version 2 or later.
361 */
362 #define _GNU_SOURCE
363 #include <sys/types.h>
364 #include <stdio.h>
365 #include <linux/userfaultfd.h>
366 #include <pthread.h>
367 #include <errno.h>
368 #include <unistd.h>
369 #include <stdlib.h>
370 #include <fcntl.h>
371 #include <signal.h>
372 #include <poll.h>
373 #include <string.h>
374 #include <sys/mman.h>
375 #include <sys/syscall.h>
376 #include <sys/ioctl.h>
377 #include <poll.h>
378
379 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
380 } while (0)
381
382 static int page_size;
383
384 static void *
385 fault_handler_thread(void *arg)
386 {
387 static struct uffd_msg msg; /* Data read from userfaultfd */
388 static int fault_cnt = 0; /* Number of faults so far handled */
389 long uffd; /* userfaultfd file descriptor */
390 static char *page = NULL;
391 struct uffdio_copy uffdio_copy;
392 ssize_t nread;
393
394 uffd = (long) arg;
395
396 /* Create a page that will be copied into the faulting region */
397
398 if (page == NULL) {
399 page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
400 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
401 if (page == MAP_FAILED)
402 errExit("mmap");
403 }
404
405 /* Loop, handling incoming events on the userfaultfd
406 file descriptor */
407
408 for (;;) {
409
410 /* See what poll() tells us about the userfaultfd */
411
412 struct pollfd pollfd;
413 int nready;
414 pollfd.fd = uffd;
415 pollfd.events = POLLIN;
416 nready = poll(&pollfd, 1, -1);
417 if (nready == -1)
418 errExit("poll");
419
420 printf("\nfault_handler_thread():\n");
421 printf(" poll() returns: nready = %d; "
422 "POLLIN = %d; POLLERR = %d\n", nready,
423 (pollfd.revents & POLLIN) != 0,
424 (pollfd.revents & POLLERR) != 0);
425
426 /* Read an event from the userfaultfd */
427
428 nread = read(uffd, &msg, sizeof(msg));
429 if (nread == 0) {
430 printf("EOF on userfaultfd!\n");
431 exit(EXIT_FAILURE);
432 }
433
434 if (nread == -1)
435 errExit("read");
436
437 /* We expect only one kind of event; verify that assumption */
438
439 if (msg.event != UFFD_EVENT_PAGEFAULT) {
440 fprintf(stderr, "Unexpected event on userfaultfd\n");
441 exit(EXIT_FAILURE);
442 }
443
444 /* Display info about the page-fault event */
445
446 printf(" UFFD_EVENT_PAGEFAULT event: ");
447 printf("flags = %llx; ", msg.arg.pagefault.flags);
448 printf("address = %llx\n", msg.arg.pagefault.address);
449
450 /* Copy the page pointed to by 'page' into the faulting
451 region. Vary the contents that are copied in, so that it
452 is more obvious that each fault is handled separately. */
453
454 memset(page, 'A' + fault_cnt % 20, page_size);
455 fault_cnt++;
456
457 uffdio_copy.src = (unsigned long) page;
458
459 /* We need to handle page faults in units of pages(!).
460 So, round faulting address down to page boundary */
461
462 uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
463 ~(page_size - 1);
464 uffdio_copy.len = page_size;
465 uffdio_copy.mode = 0;
466 uffdio_copy.copy = 0;
467 if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
468 errExit("ioctl-UFFDIO_COPY");
469
470 printf(" (uffdio_copy.copy returned %lld)\n",
471 uffdio_copy.copy);
472 }
473 }
474
475 int
476 main(int argc, char *argv[])
477 {
478 long uffd; /* userfaultfd file descriptor */
479 char *addr; /* Start of region handled by userfaultfd */
480 unsigned long len; /* Length of region handled by userfaultfd */
481 pthread_t thr; /* ID of thread that handles page faults */
482 struct uffdio_api uffdio_api;
483 struct uffdio_register uffdio_register;
484 int s;
485
486 if (argc != 2) {
487 fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
488 exit(EXIT_FAILURE);
489 }
490
491 page_size = sysconf(_SC_PAGE_SIZE);
492 len = strtoul(argv[1], NULL, 0) * page_size;
493
494 /* Create and enable userfaultfd object */
495
496 uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
497 if (uffd == -1)
498 errExit("userfaultfd");
499
500 uffdio_api.api = UFFD_API;
501 uffdio_api.features = 0;
502 if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
503 errExit("ioctl-UFFDIO_API");
504
505 /* Create a private anonymous mapping. The memory will be
506 demand-zero paged--that is, not yet allocated. When we
507 actually touch the memory, it will be allocated via
508 the userfaultfd. */
509
510 addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
511 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
512 if (addr == MAP_FAILED)
513 errExit("mmap");
514
515 printf("Address returned by mmap() = %p\n", addr);
516
517 /* Register the memory range of the mapping we just created for
518 handling by the userfaultfd object. In mode, we request to track
519 missing pages (i.e., pages that have not yet been faulted in). */
520
521 uffdio_register.range.start = (unsigned long) addr;
522 uffdio_register.range.len = len;
523 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
524 if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
525 errExit("ioctl-UFFDIO_REGISTER");
526
527 /* Create a thread that will process the userfaultfd events */
528
529 s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
530 if (s != 0) {
531 errno = s;
532 errExit("pthread_create");
533 }
534
535 /* Main thread now touches memory in the mapping, touching
536 locations 1024 bytes apart. This will trigger userfaultfd
537 events for all pages in the region. */
538
539 int l;
540 l = 0xf; /* Ensure that faulting address is not on a page
541 boundary, in order to test that we correctly
542 handle that case in fault_handling_thread() */
543 while (l < len) {
544 char c = addr[l];
545 printf("Read address %p in main(): ", addr + l);
546 printf("%c\n", c);
547 l += 1024;
548 usleep(100000); /* Slow things down a little */
549 }
550
551 exit(EXIT_SUCCESS);
552 }
553
555 fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
556
557 Documentation/admin-guide/mm/userfaultfd.rst in the Linux kernel source
558 tree
559
561 This page is part of release 5.07 of the Linux man-pages project. A
562 description of the project, information about reporting bugs, and the
563 latest version of this page, can be found at
564 https://www.kernel.org/doc/man-pages/.
565
566
567
568Linux 2020-06-09 USERFAULTFD(2)