1USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
2
3
4
6 userfaultfd - create a file descriptor for handling page faults in user
7 space
8
10 #include <sys/types.h>
11 #include <linux/userfaultfd.h>
12
13 int userfaultfd(int flags);
14
15 Note: There is no glibc wrapper for this system call; see NOTES.
16
18 userfaultfd() creates a new userfaultfd object that can be used for
19 delegation of page-fault handling to a user-space application, and
20 returns a file descriptor that refers to the new object. The new user‐
21 faultfd object is configured using ioctl(2).
22
23 Once the userfaultfd object is configured, the application can use
24 read(2) to receive userfaultfd notifications. The reads from user‐
25 faultfd may be blocking or non-blocking, depending on the value of
26 flags used for the creation of the userfaultfd or subsequent calls to
27 fcntl(2).
28
29 The following values may be bitwise ORed in flags to change the behav‐
30 ior of userfaultfd():
31
32 O_CLOEXEC
33 Enable the close-on-exec flag for the new userfaultfd file
34 descriptor. See the description of the O_CLOEXEC flag in
35 open(2).
36
37 O_NONBLOCK
38 Enables non-blocking operation for the userfaultfd object. See
39 the description of the O_NONBLOCK flag in open(2).
40
41 When the last file descriptor referring to a userfaultfd object is
42 closed, all memory ranges that were registered with the object are
43 unregistered and unread events are flushed.
44
45 Usage
46 The userfaultfd mechanism is designed to allow a thread in a multi‐
47 threaded program to perform user-space paging for the other threads in
48 the process. When a page fault occurs for one of the regions regis‐
49 tered to the userfaultfd object, the faulting thread is put to sleep
50 and an event is generated that can be read via the userfaultfd file
51 descriptor. The fault-handling thread reads events from this file
52 descriptor and services them using the operations described in
53 ioctl_userfaultfd(2). When servicing the page fault events, the fault-
54 handling thread can trigger a wake-up for the sleeping thread.
55
56 It is possible for the faulting threads and the fault-handling threads
57 to run in the context of different processes. In this case, these
58 threads may belong to different programs, and the program that executes
59 the faulting threads will not necessarily cooperate with the program
60 that handles the page faults. In such non-cooperative mode, the
61 process that monitors userfaultfd and handles page faults needs to be
62 aware of the changes in the virtual memory layout of the faulting
63 process to avoid memory corruption.
64
65 Starting from Linux 4.11, userfaultfd can also notify the fault-han‐
66 dling threads about changes in the virtual memory layout of the fault‐
67 ing process. In addition, if the faulting process invokes fork(2), the
68 userfaultfd objects associated with the parent may be duplicated into
69 the child process and the userfaultfd monitor will be notified (via the
70 UFFD_EVENT_FORK described below) about the file descriptor associated
71 with the userfault objects created for the child process, which allows
72 the userfaultfd monitor to perform user-space paging for the child
73 process. Unlike page faults which have to be synchronous and require
74 an explicit or implicit wakeup, all other events are delivered asyn‐
75 chronously and the non-cooperative process resumes execution as soon as
76 the userfaultfd manager executes read(2). The userfaultfd manager
77 should carefully synchronize calls to UFFDIO_COPY with the processing
78 of events.
79
80 The current asynchronous model of the event delivery is optimal for
81 single threaded non-cooperative userfaultfd manager implementations.
82
83 Userfaultfd operation
84 After the userfaultfd object is created with userfaultfd(), the appli‐
85 cation must enable it using the UFFDIO_API ioctl(2) operation. This
86 operation allows a handshake between the kernel and user space to
87 determine the API version and supported features. This operation must
88 be performed before any of the other ioctl(2) operations described
89 below (or those operations fail with the EINVAL error).
90
91 After a successful UFFDIO_API operation, the application then registers
92 memory address ranges using the UFFDIO_REGISTER ioctl(2) operation.
93 After successful completion of a UFFDIO_REGISTER operation, a page
94 fault occurring in the requested memory range, and satisfying the mode
95 defined at the registration time, will be forwarded by the kernel to
96 the user-space application. The application can then use the UFF‐
97 DIO_COPY or UFFDIO_ZEROPAGE ioctl(2) operations to resolve the page
98 fault.
99
100 Starting from Linux 4.14, if the application sets the UFFD_FEATURE_SIG‐
101 BUS feature bit using the UFFDIO_API ioctl(2), no page-fault notifica‐
102 tion will be forwarded to user space. Instead a SIGBUS signal is
103 delivered to the faulting process. With this feature, userfaultfd can
104 be used for robustness purposes to simply catch any access to areas
105 within the registered address range that do not have pages allocated,
106 without having to listen to userfaultfd events. No userfaultfd monitor
107 will be required for dealing with such memory accesses. For example,
108 this feature can be useful for applications that want to prevent the
109 kernel from automatically allocating pages and filling holes in sparse
110 files when the hole is accessed through a memory mapping.
111
112 The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2)
113 if used in combination with UFFD_FEATURE_FORK.
114
115 Details of the various ioctl(2) operations can be found in ioctl_user‐
116 faultfd(2).
117
118 Since Linux 4.11, events other than page-fault may enabled during UFF‐
119 DIO_API operation.
120
121 Up to Linux 4.11, userfaultfd can be used only with anonymous private
122 memory mappings. Since Linux 4.11, userfaultfd can be also used with
123 hugetlbfs and shared memory mappings.
124
125 Reading from the userfaultfd structure
126 Each read(2) from the userfaultfd file descriptor returns one or more
127 uffd_msg structures, each of which describes a page-fault event or an
128 event required for the non-cooperative userfaultfd usage:
129
130 struct uffd_msg {
131 __u8 event; /* Type of event */
132 ...
133 union {
134 struct {
135 __u64 flags; /* Flags describing fault */
136 __u64 address; /* Faulting address */
137 } pagefault;
138
139 struct { /* Since Linux 4.11 */
140 __u32 ufd; /* Userfault file descriptor
141 of the child process */
142 } fork;
143
144 struct { /* Since Linux 4.11 */
145 __u64 from; /* Old address of remapped area */
146 __u64 to; /* New address of remapped area */
147 __u64 len; /* Original mapping length */
148 } remap;
149
150 struct { /* Since Linux 4.11 */
151 __u64 start; /* Start address of removed area */
152 __u64 end; /* End address of removed area */
153 } remove;
154 ...
155 } arg;
156
157 /* Padding fields omitted */
158 } __packed;
159
160 If multiple events are available and the supplied buffer is large
161 enough, read(2) returns as many events as will fit in the supplied buf‐
162 fer. If the buffer supplied to read(2) is smaller than the size of the
163 uffd_msg structure, the read(2) fails with the error EINVAL.
164
165 The fields set in the uffd_msg structure are as follows:
166
167 event The type of event. Depending of the event type, different
168 fields of the arg union represent details required for the event
169 processing. The non-page-fault events are generated only when
170 appropriate feature is enabled during API handshake with UFF‐
171 DIO_API ioctl(2).
172
173 The following values can appear in the event field:
174
175 UFFD_EVENT_PAGEFAULT (since Linux 4.3)
176 A page-fault event. The page-fault details are available
177 in the pagefault field.
178
179 UFFD_EVENT_FORK (since Linux 4.11)
180 Generated when the faulting process invokes fork(2) (or
181 clone(2) without the CLONE_VM flag). The event details
182 are available in the fork field.
183
184 UFFD_EVENT_REMAP (since Linux 4.11)
185 Generated when the faulting process invokes mremap(2).
186 The event details are available in the remap field.
187
188 UFFD_EVENT_REMOVE (since Linux 4.11)
189 Generated when the faulting process invokes madvise(2)
190 with MADV_DONTNEED or MADV_REMOVE advice. The event
191 details are available in the remove field.
192
193 UFFD_EVENT_UNMAP (since Linux 4.11)
194 Generated when the faulting process unmaps a memory
195 range, either explicitly using munmap(2) or implicitly
196 during mmap(2) or mremap(2). The event details are
197 available in the remove field.
198
199 pagefault.address
200 The address that triggered the page fault.
201
202 pagefault.flags
203 A bit mask of flags that describe the event. For
204 UFFD_EVENT_PAGEFAULT, the following flag may appear:
205
206 UFFD_PAGEFAULT_FLAG_WRITE
207 If the address is in a range that was registered with the
208 UFFDIO_REGISTER_MODE_MISSING flag (see ioctl_user‐
209 faultfd(2)) and this flag is set, this a write fault;
210 otherwise it is a read fault.
211
212 fork.ufd
213 The file descriptor associated with the userfault object created
214 for the child created by fork(2).
215
216 remap.from
217 The original address of the memory range that was remapped using
218 mremap(2).
219
220 remap.to
221 The new address of the memory range that was remapped using
222 mremap(2).
223
224 remap.len
225 The original length of the memory range that was remapped using
226 mremap(2).
227
228 remove.start
229 The start address of the memory range that was freed using mad‐
230 vise(2) or unmapped
231
232 remove.end
233 The end address of the memory range that was freed using mad‐
234 vise(2) or unmapped
235
236 A read(2) on a userfaultfd file descriptor can fail with the following
237 errors:
238
239 EINVAL The userfaultfd object has not yet been enabled using the UFF‐
240 DIO_API ioctl(2) operation
241
242 If the O_NONBLOCK flag is enabled in the associated open file descrip‐
243 tion, the userfaultfd file descriptor can be monitored with poll(2),
244 select(2), and epoll(7). When events are available, the file descrip‐
245 tor indicates as readable. If the O_NONBLOCK flag is not enabled, then
246 poll(2) (always) indicates the file as having a POLLERR condition, and
247 select(2) indicates the file descriptor as both readable and writable.
248
250 On success, userfaultfd() returns a new file descriptor that refers to
251 the userfaultfd object. On error, -1 is returned, and errno is set
252 appropriately.
253
255 EINVAL An unsupported value was specified in flags.
256
257 EMFILE The per-process limit on the number of open file descriptors has
258 been reached
259
260 ENFILE The system-wide limit on the total number of open files has been
261 reached.
262
263 ENOMEM Insufficient kernel memory was available.
264
266 The userfaultfd() system call first appeared in Linux 4.3.
267
268 The support for hugetlbfs and shared memory areas and non-page-fault
269 events was added in Linux 4.11
270
272 userfaultfd() is Linux-specific and should not be used in programs
273 intended to be portable.
274
276 Glibc does not provide a wrapper for this system call; call it using
277 syscall(2).
278
279 The userfaultfd mechanism can be used as an alternative to traditional
280 user-space paging techniques based on the use of the SIGSEGV signal and
281 mmap(2). It can also be used to implement lazy restore for check‐
282 point/restore mechanisms, as well as post-copy migration to allow
283 (nearly) uninterrupted execution when transferring virtual machines and
284 Linux containers from one host to another.
285
287 If the UFFD_FEATURE_EVENT_FORK is enabled and a system call from the
288 fork(2) family is interrupted by a signal or failed, a stale user‐
289 faultfd descriptor might be created. In this case, a spurious
290 UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
291
293 The program below demonstrates the use of the userfaultfd mechanism.
294 The program creates two threads, one of which acts as the page-fault
295 handler for the process, for the pages in a demand-page zero region
296 created using mmap(2).
297
298 The program takes one command-line argument, which is the number of
299 pages that will be created in a mapping whose page faults will be han‐
300 dled via userfaultfd. After creating a userfaultfd object, the program
301 then creates an anonymous private mapping of the specified size and
302 registers the address range of that mapping using the UFFDIO_REGISTER
303 ioctl(2) operation. The program then creates a second thread that will
304 perform the task of handling page faults.
305
306 The main thread then walks through the pages of the mapping fetching
307 bytes from successive pages. Because the pages have not yet been
308 accessed, the first access of a byte in each page will trigger a page-
309 fault event on the userfaultfd file descriptor.
310
311 Each of the page-fault events is handled by the second thread, which
312 sits in a loop processing input from the userfaultfd file descriptor.
313 In each loop iteration, the second thread first calls poll(2) to check
314 the state of the file descriptor, and then reads an event from the file
315 descriptor. All such events should be UFFD_EVENT_PAGEFAULT events,
316 which the thread handles by copying a page of data into the faulting
317 region using the UFFDIO_COPY ioctl(2) operation.
318
319 The following is an example of what we see when running the program:
320
321 $ ./userfaultfd_demo 3
322 Address returned by mmap() = 0x7fd30106c000
323
324 fault_handler_thread():
325 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
326 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
327 (uffdio_copy.copy returned 4096)
328 Read address 0x7fd30106c00f in main(): A
329 Read address 0x7fd30106c40f in main(): A
330 Read address 0x7fd30106c80f in main(): A
331 Read address 0x7fd30106cc0f in main(): A
332
333 fault_handler_thread():
334 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
335 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
336 (uffdio_copy.copy returned 4096)
337 Read address 0x7fd30106d00f in main(): B
338 Read address 0x7fd30106d40f in main(): B
339 Read address 0x7fd30106d80f in main(): B
340 Read address 0x7fd30106dc0f in main(): B
341
342 fault_handler_thread():
343 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
344 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
345 (uffdio_copy.copy returned 4096)
346 Read address 0x7fd30106e00f in main(): C
347 Read address 0x7fd30106e40f in main(): C
348 Read address 0x7fd30106e80f in main(): C
349 Read address 0x7fd30106ec0f in main(): C
350
351 Program source
352
353 /* userfaultfd_demo.c
354
355 Licensed under the GNU General Public License version 2 or later.
356 */
357 #define _GNU_SOURCE
358 #include <sys/types.h>
359 #include <stdio.h>
360 #include <linux/userfaultfd.h>
361 #include <pthread.h>
362 #include <errno.h>
363 #include <unistd.h>
364 #include <stdlib.h>
365 #include <fcntl.h>
366 #include <signal.h>
367 #include <poll.h>
368 #include <string.h>
369 #include <sys/mman.h>
370 #include <sys/syscall.h>
371 #include <sys/ioctl.h>
372 #include <poll.h>
373
374 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
375 } while (0)
376
377 static int page_size;
378
379 static void *
380 fault_handler_thread(void *arg)
381 {
382 static struct uffd_msg msg; /* Data read from userfaultfd */
383 static int fault_cnt = 0; /* Number of faults so far handled */
384 long uffd; /* userfaultfd file descriptor */
385 static char *page = NULL;
386 struct uffdio_copy uffdio_copy;
387 ssize_t nread;
388
389 uffd = (long) arg;
390
391 /* Create a page that will be copied into the faulting region */
392
393 if (page == NULL) {
394 page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
395 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
396 if (page == MAP_FAILED)
397 errExit("mmap");
398 }
399
400 /* Loop, handling incoming events on the userfaultfd
401 file descriptor */
402
403 for (;;) {
404
405 /* See what poll() tells us about the userfaultfd */
406
407 struct pollfd pollfd;
408 int nready;
409 pollfd.fd = uffd;
410 pollfd.events = POLLIN;
411 nready = poll(&pollfd, 1, -1);
412 if (nready == -1)
413 errExit("poll");
414
415 printf("\nfault_handler_thread():\n");
416 printf(" poll() returns: nready = %d; "
417 "POLLIN = %d; POLLERR = %d\n", nready,
418 (pollfd.revents & POLLIN) != 0,
419 (pollfd.revents & POLLERR) != 0);
420
421 /* Read an event from the userfaultfd */
422
423 nread = read(uffd, &msg, sizeof(msg));
424 if (nread == 0) {
425 printf("EOF on userfaultfd!\n");
426 exit(EXIT_FAILURE);
427 }
428
429 if (nread == -1)
430 errExit("read");
431
432 /* We expect only one kind of event; verify that assumption */
433
434 if (msg.event != UFFD_EVENT_PAGEFAULT) {
435 fprintf(stderr, "Unexpected event on userfaultfd\n");
436 exit(EXIT_FAILURE);
437 }
438
439 /* Display info about the page-fault event */
440
441 printf(" UFFD_EVENT_PAGEFAULT event: ");
442 printf("flags = %llx; ", msg.arg.pagefault.flags);
443 printf("address = %llx\n", msg.arg.pagefault.address);
444
445 /* Copy the page pointed to by 'page' into the faulting
446 region. Vary the contents that are copied in, so that it
447 is more obvious that each fault is handled separately. */
448
449 memset(page, 'A' + fault_cnt % 20, page_size);
450 fault_cnt++;
451
452 uffdio_copy.src = (unsigned long) page;
453
454 /* We need to handle page faults in units of pages(!).
455 So, round faulting address down to page boundary */
456
457 uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
458 ~(page_size - 1);
459 uffdio_copy.len = page_size;
460 uffdio_copy.mode = 0;
461 uffdio_copy.copy = 0;
462 if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
463 errExit("ioctl-UFFDIO_COPY");
464
465 printf(" (uffdio_copy.copy returned %lld)\n",
466 uffdio_copy.copy);
467 }
468 }
469
470 int
471 main(int argc, char *argv[])
472 {
473 long uffd; /* userfaultfd file descriptor */
474 char *addr; /* Start of region handled by userfaultfd */
475 unsigned long len; /* Length of region handled by userfaultfd */
476 pthread_t thr; /* ID of thread that handles page faults */
477 struct uffdio_api uffdio_api;
478 struct uffdio_register uffdio_register;
479 int s;
480
481 if (argc != 2) {
482 fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
483 exit(EXIT_FAILURE);
484 }
485
486 page_size = sysconf(_SC_PAGE_SIZE);
487 len = strtoul(argv[1], NULL, 0) * page_size;
488
489 /* Create and enable userfaultfd object */
490
491 uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
492 if (uffd == -1)
493 errExit("userfaultfd");
494
495 uffdio_api.api = UFFD_API;
496 uffdio_api.features = 0;
497 if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
498 errExit("ioctl-UFFDIO_API");
499
500 /* Create a private anonymous mapping. The memory will be
501 demand-zero paged--that is, not yet allocated. When we
502 actually touch the memory, it will be allocated via
503 the userfaultfd. */
504
505 addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
506 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
507 if (addr == MAP_FAILED)
508 errExit("mmap");
509
510 printf("Address returned by mmap() = %p\n", addr);
511
512 /* Register the memory range of the mapping we just created for
513 handling by the userfaultfd object. In mode, we request to track
514 missing pages (i.e., pages that have not yet been faulted in). */
515
516 uffdio_register.range.start = (unsigned long) addr;
517 uffdio_register.range.len = len;
518 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
519 if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
520 errExit("ioctl-UFFDIO_REGISTER");
521
522 /* Create a thread that will process the userfaultfd events */
523
524 s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
525 if (s != 0) {
526 errno = s;
527 errExit("pthread_create");
528 }
529
530 /* Main thread now touches memory in the mapping, touching
531 locations 1024 bytes apart. This will trigger userfaultfd
532 events for all pages in the region. */
533
534 int l;
535 l = 0xf; /* Ensure that faulting address is not on a page
536 boundary, in order to test that we correctly
537 handle that case in fault_handling_thread() */
538 while (l < len) {
539 char c = addr[l];
540 printf("Read address %p in main(): ", addr + l);
541 printf("%c\n", c);
542 l += 1024;
543 usleep(100000); /* Slow things down a little */
544 }
545
546 exit(EXIT_SUCCESS);
547 }
548
550 fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
551
552 Documentation/admin-guide/mm/userfaultfd.rst in the Linux kernel source
553 tree
554
556 This page is part of release 5.02 of the Linux man-pages project. A
557 description of the project, information about reporting bugs, and the
558 latest version of this page, can be found at
559 https://www.kernel.org/doc/man-pages/.
560
561
562
563Linux 2019-03-06 USERFAULTFD(2)