1USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
2
3
4
6 userfaultfd - create a file descriptor for handling page faults in user
7 space
8
10 #include <sys/types.h>
11 #include <linux/userfaultfd.h>
12
13 int userfaultfd(int flags);
14
15 Note: There is no glibc wrapper for this system call; see NOTES.
16
18 userfaultfd() creates a new userfaultfd object that can be used for
19 delegation of page-fault handling to a user-space application, and
20 returns a file descriptor that refers to the new object. The new user‐
21 faultfd object is configured using ioctl(2).
22
23 Once the userfaultfd object is configured, the application can use
24 read(2) to receive userfaultfd notifications. The reads from user‐
25 faultfd may be blocking or non-blocking, depending on the value of
26 flags used for the creation of the userfaultfd or subsequent calls to
27 fcntl(2).
28
29 The following values may be bitwise ORed in flags to change the behav‐
30 ior of userfaultfd():
31
32 O_CLOEXEC
33 Enable the close-on-exec flag for the new userfaultfd file
34 descriptor. See the description of the O_CLOEXEC flag in
35 open(2).
36
37 O_NONBLOCK
38 Enables non-blocking operation for the userfaultfd object. See
39 the description of the O_NONBLOCK flag in open(2).
40
41 When the last file descriptor referring to a userfaultfd object is
42 closed, all memory ranges that were registered with the object are
43 unregistered and unread events are flushed.
44
45 Usage
46 The userfaultfd mechanism is designed to allow a thread in a multi‐
47 threaded program to perform user-space paging for the other threads in
48 the process. When a page fault occurs for one of the regions regis‐
49 tered to the userfaultfd object, the faulting thread is put to sleep
50 and an event is generated that can be read via the userfaultfd file
51 descriptor. The fault-handling thread reads events from this file
52 descriptor and services them using the operations described in
53 ioctl_userfaultfd(2). When servicing the page fault events, the fault-
54 handling thread can trigger a wake-up for the sleeping thread.
55
56 It is possible for the faulting threads and the fault-handling threads
57 to run in the context of different processes. In this case, these
58 threads may belong to different programs, and the program that executes
59 the faulting threads will not necessarily cooperate with the program
60 that handles the page faults. In such non-cooperative mode, the
61 process that monitors userfaultfd and handles page faults needs to be
62 aware of the changes in the virtual memory layout of the faulting
63 process to avoid memory corruption.
64
65 Starting from Linux 4.11, userfaultfd can also notify the fault-han‐
66 dling threads about changes in the virtual memory layout of the fault‐
67 ing process. In addition, if the faulting process invokes fork(2), the
68 userfaultfd objects associated with the parent may be duplicated into
69 the child process and the userfaultfd monitor will be notified (via the
70 UFFD_EVENT_FORK described below) about the file descriptor associated
71 with the userfault objects created for the child process, which allows
72 the userfaultfd monitor to perform user-space paging for the child
73 process. Unlike page faults which have to be synchronous and require
74 an explicit or implicit wakeup, all other events are delivered asyn‐
75 chronously and the non-cooperative process resumes execution as soon as
76 the userfaultfd manager executes read(2). The userfaultfd manager
77 should carefully synchronize calls to UFFDIO_COPY with the processing
78 of events.
79
80 The current asynchronous model of the event delivery is optimal for
81 single threaded non-cooperative userfaultfd manager implementations.
82
83 Userfaultfd operation
84 After the userfaultfd object is created with userfaultfd(), the appli‐
85 cation must enable it using the UFFDIO_API ioctl(2) operation. This
86 operation allows a handshake between the kernel and user space to
87 determine the API version and supported features. This operation must
88 be performed before any of the other ioctl(2) operations described
89 below (or those operations fail with the EINVAL error).
90
91 After a successful UFFDIO_API operation, the application then registers
92 memory address ranges using the UFFDIO_REGISTER ioctl(2) operation.
93 After successful completion of a UFFDIO_REGISTER operation, a page
94 fault occurring in the requested memory range, and satisfying the mode
95 defined at the registration time, will be forwarded by the kernel to
96 the user-space application. The application can then use the UFF‐
97 DIO_COPY or UFFDIO_ZERO ioctl(2) operations to resolve the page fault.
98
99 Starting from Linux 4.14, if the application sets the UFFD_FEATURE_SIG‐
100 BUS feature bit using the UFFDIO_API ioctl(2), no page-fault notifica‐
101 tion will be forwarded to user space. Instead a SIGBUS signal is
102 delivered to the faulting process. With this feature, userfaultfd can
103 be used for robustness purposes to simply catch any access to areas
104 within the registered address range that do not have pages allocated,
105 without having to listen to userfaultfd events. No userfaultfd monitor
106 will be required for dealing with such memory accesses. For example,
107 this feature can be useful for applications that want to prevent the
108 kernel from automatically allocating pages and filling holes in sparse
109 files when the hole is accessed through a memory mapping.
110
111 The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2)
112 if used in combination with UFFD_FEATURE_FORK.
113
114 Details of the various ioctl(2) operations can be found in ioctl_user‐
115 faultfd(2).
116
117 Since Linux 4.11, events other than page-fault may enabled during UFF‐
118 DIO_API operation.
119
120 Up to Linux 4.11, userfaultfd can be used only with anonymous private
121 memory mappings. Since Linux 4.11, userfaultfd can be also used with
122 hugetlbfs and shared memory mappings.
123
124 Reading from the userfaultfd structure
125 Each read(2) from the userfaultfd file descriptor returns one or more
126 uffd_msg structures, each of which describes a page-fault event or an
127 event required for the non-cooperative userfaultfd usage:
128
129 struct uffd_msg {
130 __u8 event; /* Type of event */
131 ...
132 union {
133 struct {
134 __u64 flags; /* Flags describing fault */
135 __u64 address; /* Faulting address */
136 } pagefault;
137
138 struct { /* Since Linux 4.11 */
139 __u32 ufd; /* Userfault file descriptor
140 of the child process */
141 } fork;
142
143 struct { /* Since Linux 4.11 */
144 __u64 from; /* Old address of remapped area */
145 __u64 to; /* New address of remapped area */
146 __u64 len; /* Original mapping length */
147 } remap;
148
149 struct { /* Since Linux 4.11 */
150 __u64 start; /* Start address of removed area */
151 __u64 end; /* End address of removed area */
152 } remove;
153 ...
154 } arg;
155
156 /* Padding fields omitted */
157 } __packed;
158
159 If multiple events are available and the supplied buffer is large
160 enough, read(2) returns as many events as will fit in the supplied buf‐
161 fer. If the buffer supplied to read(2) is smaller than the size of the
162 uffd_msg structure, the read(2) fails with the error EINVAL.
163
164 The fields set in the uffd_msg structure are as follows:
165
166 event The type of event. Depending of the event type, different
167 fields of the arg union represent details required for the event
168 processing. The non-page-fault events are generated only when
169 appropriate feature is enabled during API handshake with UFF‐
170 DIO_API ioctl(2).
171
172 The following values can appear in the event field:
173
174 UFFD_EVENT_PAGEFAULT (since Linux 4.3)
175 A page-fault event. The page-fault details are available
176 in the pagefault field.
177
178 UFFD_EVENT_FORK (since Linux 4.11)
179 Generated when the faulting process invokes fork(2) (or
180 clone(2) without the CLONE_VM flag). The event details
181 are available in the fork field.
182
183 UFFD_EVENT_REMAP (since Linux 4.11)
184 Generated when the faulting process invokes mremap(2).
185 The event details are available in the remap field.
186
187 UFFD_EVENT_REMOVE (since Linux 4.11)
188 Generated when the faulting process invokes madvise(2)
189 with MADV_DONTNEED or MADV_REMOVE advice. The event
190 details are available in the remove field.
191
192 UFFD_EVENT_UNMAP (since Linux 4.11)
193 Generated when the faulting process unmaps a memory
194 range, either explicitly using munmap(2) or implicitly
195 during mmap(2) or mremap(2). The event details are
196 available in the remove field.
197
198 pagefault.address
199 The address that triggered the page fault.
200
201 pagefault.flags
202 A bit mask of flags that describe the event. For
203 UFFD_EVENT_PAGEFAULT, the following flag may appear:
204
205 UFFD_PAGEFAULT_FLAG_WRITE
206 If the address is in a range that was registered with the
207 UFFDIO_REGISTER_MODE_MISSING flag (see ioctl_user‐
208 faultfd(2)) and this flag is set, this a write fault;
209 otherwise it is a read fault.
210
211 fork.ufd
212 The file descriptor associated with the userfault object created
213 for the child created by fork(2).
214
215 remap.from
216 The original address of the memory range that was remapped using
217 mremap(2).
218
219 remap.to
220 The new address of the memory range that was remapped using
221 mremap(2).
222
223 remap.len
224 The original length of the memory range that was remapped using
225 mremap(2).
226
227 remove.start
228 The start address of the memory range that was freed using mad‐
229 vise(2) or unmapped
230
231 remove.end
232 The end address of the memory range that was freed using mad‐
233 vise(2) or unmapped
234
235 A read(2) on a userfaultfd file descriptor can fail with the following
236 errors:
237
238 EINVAL The userfaultfd object has not yet been enabled using the UFF‐
239 DIO_API ioctl(2) operation
240
241 If the O_NONBLOCK flag is enabled in the associated open file descrip‐
242 tion, the userfaultfd file descriptor can be monitored with poll(2),
243 select(2), and epoll(7). When events are available, the file descrip‐
244 tor indicates as readable. If the O_NONBLOCK flag is not enabled, then
245 poll(2) (always) indicates the file as having a POLLERR condition, and
246 select(2) indicates the file descriptor as both readable and writable.
247
249 On success, userfaultfd() returns a new file descriptor that refers to
250 the userfaultfd object. On error, -1 is returned, and errno is set
251 appropriately.
252
254 EINVAL An unsupported value was specified in flags.
255
256 EMFILE The per-process limit on the number of open file descriptors has
257 been reached
258
259 ENFILE The system-wide limit on the total number of open files has been
260 reached.
261
262 ENOMEM Insufficient kernel memory was available.
263
265 The userfaultfd() system call first appeared in Linux 4.3.
266
267 The support for hugetlbfs and shared memory areas and non-page-fault
268 events was added in Linux 4.11
269
271 userfaultfd() is Linux-specific and should not be used in programs
272 intended to be portable.
273
275 Glibc does not provide a wrapper for this system call; call it using
276 syscall(2).
277
278 The userfaultfd mechanism can be used as an alternative to traditional
279 user-space paging techniques based on the use of the SIGSEGV signal and
280 mmap(2). It can also be used to implement lazy restore for check‐
281 point/restore mechanisms, as well as post-copy migration to allow
282 (nearly) uninterrupted execution when transferring virtual machines and
283 Linux containers from one host to another.
284
286 If the UFFD_FEATURE_EVENT_FORK is enabled and a system call from the
287 fork(2) family is interrupted by a signal or failed, a stale user‐
288 faultfd descriptor might be created. In this case, a spurious
289 UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
290
292 The program below demonstrates the use of the userfaultfd mechanism.
293 The program creates two threads, one of which acts as the page-fault
294 handler for the process, for the pages in a demand-page zero region
295 created using mmap(2).
296
297 The program takes one command-line argument, which is the number of
298 pages that will be created in a mapping whose page faults will be han‐
299 dled via userfaultfd. After creating a userfaultfd object, the program
300 then creates an anonymous private mapping of the specified size and
301 registers the address range of that mapping using the UFFDIO_REGISTER
302 ioctl(2) operation. The program then creates a second thread that will
303 perform the task of handling page faults.
304
305 The main thread then walks through the pages of the mapping fetching
306 bytes from successive pages. Because the pages have not yet been
307 accessed, the first access of a byte in each page will trigger a page-
308 fault event on the userfaultfd file descriptor.
309
310 Each of the page-fault events is handled by the second thread, which
311 sits in a loop processing input from the userfaultfd file descriptor.
312 In each loop iteration, the second thread first calls poll(2) to check
313 the state of the file descriptor, and then reads an event from the file
314 descriptor. All such events should be UFFD_EVENT_PAGEFAULT events,
315 which the thread handles by copying a page of data into the faulting
316 region using the UFFDIO_COPY ioctl(2) operation.
317
318 The following is an example of what we see when running the program:
319
320 $ ./userfaultfd_demo 3
321 Address returned by mmap() = 0x7fd30106c000
322
323 fault_handler_thread():
324 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
325 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
326 (uffdio_copy.copy returned 4096)
327 Read address 0x7fd30106c00f in main(): A
328 Read address 0x7fd30106c40f in main(): A
329 Read address 0x7fd30106c80f in main(): A
330 Read address 0x7fd30106cc0f in main(): A
331
332 fault_handler_thread():
333 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
334 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
335 (uffdio_copy.copy returned 4096)
336 Read address 0x7fd30106d00f in main(): B
337 Read address 0x7fd30106d40f in main(): B
338 Read address 0x7fd30106d80f in main(): B
339 Read address 0x7fd30106dc0f in main(): B
340
341 fault_handler_thread():
342 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
343 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
344 (uffdio_copy.copy returned 4096)
345 Read address 0x7fd30106e00f in main(): C
346 Read address 0x7fd30106e40f in main(): C
347 Read address 0x7fd30106e80f in main(): C
348 Read address 0x7fd30106ec0f in main(): C
349
350 Program source
351
352 /* userfaultfd_demo.c
353
354 Licensed under the GNU General Public License version 2 or later.
355 */
356 #define _GNU_SOURCE
357 #include <sys/types.h>
358 #include <stdio.h>
359 #include <linux/userfaultfd.h>
360 #include <pthread.h>
361 #include <errno.h>
362 #include <unistd.h>
363 #include <stdlib.h>
364 #include <fcntl.h>
365 #include <signal.h>
366 #include <poll.h>
367 #include <string.h>
368 #include <sys/mman.h>
369 #include <sys/syscall.h>
370 #include <sys/ioctl.h>
371 #include <poll.h>
372
373 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
374 } while (0)
375
376 static int page_size;
377
378 static void *
379 fault_handler_thread(void *arg)
380 {
381 static struct uffd_msg msg; /* Data read from userfaultfd */
382 static int fault_cnt = 0; /* Number of faults so far handled */
383 long uffd; /* userfaultfd file descriptor */
384 static char *page = NULL;
385 struct uffdio_copy uffdio_copy;
386 ssize_t nread;
387
388 uffd = (long) arg;
389
390 /* Create a page that will be copied into the faulting region */
391
392 if (page == NULL) {
393 page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
394 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
395 if (page == MAP_FAILED)
396 errExit("mmap");
397 }
398
399 /* Loop, handling incoming events on the userfaultfd
400 file descriptor */
401
402 for (;;) {
403
404 /* See what poll() tells us about the userfaultfd */
405
406 struct pollfd pollfd;
407 int nready;
408 pollfd.fd = uffd;
409 pollfd.events = POLLIN;
410 nready = poll(&pollfd, 1, -1);
411 if (nready == -1)
412 errExit("poll");
413
414 printf("\nfault_handler_thread():\n");
415 printf(" poll() returns: nready = %d; "
416 "POLLIN = %d; POLLERR = %d\n", nready,
417 (pollfd.revents & POLLIN) != 0,
418 (pollfd.revents & POLLERR) != 0);
419
420 /* Read an event from the userfaultfd */
421
422 nread = read(uffd, &msg, sizeof(msg));
423 if (nread == 0) {
424 printf("EOF on userfaultfd!\n");
425 exit(EXIT_FAILURE);
426 }
427
428 if (nread == -1)
429 errExit("read");
430
431 /* We expect only one kind of event; verify that assumption */
432
433 if (msg.event != UFFD_EVENT_PAGEFAULT) {
434 fprintf(stderr, "Unexpected event on userfaultfd\n");
435 exit(EXIT_FAILURE);
436 }
437
438 /* Display info about the page-fault event */
439
440 printf(" UFFD_EVENT_PAGEFAULT event: ");
441 printf("flags = %llx; ", msg.arg.pagefault.flags);
442 printf("address = %llx\n", msg.arg.pagefault.address);
443
444 /* Copy the page pointed to by 'page' into the faulting
445 region. Vary the contents that are copied in, so that it
446 is more obvious that each fault is handled separately. */
447
448 memset(page, 'A' + fault_cnt % 20, page_size);
449 fault_cnt++;
450
451 uffdio_copy.src = (unsigned long) page;
452
453 /* We need to handle page faults in units of pages(!).
454 So, round faulting address down to page boundary */
455
456 uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
457 ~(page_size - 1);
458 uffdio_copy.len = page_size;
459 uffdio_copy.mode = 0;
460 uffdio_copy.copy = 0;
461 if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
462 errExit("ioctl-UFFDIO_COPY");
463
464 printf(" (uffdio_copy.copy returned %lld)\n",
465 uffdio_copy.copy);
466 }
467 }
468
469 int
470 main(int argc, char *argv[])
471 {
472 long uffd; /* userfaultfd file descriptor */
473 char *addr; /* Start of region handled by userfaultfd */
474 unsigned long len; /* Length of region handled by userfaultfd */
475 pthread_t thr; /* ID of thread that handles page faults */
476 struct uffdio_api uffdio_api;
477 struct uffdio_register uffdio_register;
478 int s;
479
480 if (argc != 2) {
481 fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
482 exit(EXIT_FAILURE);
483 }
484
485 page_size = sysconf(_SC_PAGE_SIZE);
486 len = strtoul(argv[1], NULL, 0) * page_size;
487
488 /* Create and enable userfaultfd object */
489
490 uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
491 if (uffd == -1)
492 errExit("userfaultfd");
493
494 uffdio_api.api = UFFD_API;
495 uffdio_api.features = 0;
496 if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
497 errExit("ioctl-UFFDIO_API");
498
499 /* Create a private anonymous mapping. The memory will be
500 demand-zero paged--that is, not yet allocated. When we
501 actually touch the memory, it will be allocated via
502 the userfaultfd. */
503
504 addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
505 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
506 if (addr == MAP_FAILED)
507 errExit("mmap");
508
509 printf("Address returned by mmap() = %p\n", addr);
510
511 /* Register the memory range of the mapping we just created for
512 handling by the userfaultfd object. In mode, we request to track
513 missing pages (i.e., pages that have not yet been faulted in). */
514
515 uffdio_register.range.start = (unsigned long) addr;
516 uffdio_register.range.len = len;
517 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
518 if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
519 errExit("ioctl-UFFDIO_REGISTER");
520
521 /* Create a thread that will process the userfaultfd events */
522
523 s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
524 if (s != 0) {
525 errno = s;
526 errExit("pthread_create");
527 }
528
529 /* Main thread now touches memory in the mapping, touching
530 locations 1024 bytes apart. This will trigger userfaultfd
531 events for all pages in the region. */
532
533 int l;
534 l = 0xf; /* Ensure that faulting address is not on a page
535 boundary, in order to test that we correctly
536 handle that case in fault_handling_thread() */
537 while (l < len) {
538 char c = addr[l];
539 printf("Read address %p in main(): ", addr + l);
540 printf("%c\n", c);
541 l += 1024;
542 usleep(100000); /* Slow things down a little */
543 }
544
545 exit(EXIT_SUCCESS);
546 }
547
549 fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
550
551 Documentation/vm/userfaultfd.txt in the Linux kernel source tree
552
554 This page is part of release 4.15 of the Linux man-pages project. A
555 description of the project, information about reporting bugs, and the
556 latest version of this page, can be found at
557 https://www.kernel.org/doc/man-pages/.
558
559
560
561Linux 2017-09-15 USERFAULTFD(2)