1USERFAULTFD(2)             Linux Programmer's Manual            USERFAULTFD(2)
2
3
4

NAME

6       userfaultfd - create a file descriptor for handling page faults in user
7       space
8

SYNOPSIS

10       #include <sys/types.h>
11       #include <linux/userfaultfd.h>
12
13       int userfaultfd(int flags);
14
15       Note: There is no glibc wrapper for this system call; see NOTES.
16

DESCRIPTION

18       userfaultfd() creates a new userfaultfd object that  can  be  used  for
19       delegation  of  page-fault  handling  to  a user-space application, and
20       returns a file descriptor that refers to the new object.  The new user‐
21       faultfd object is configured using ioctl(2).
22
23       Once  the  userfaultfd  object  is  configured, the application can use
24       read(2) to receive userfaultfd notifications.   The  reads  from  user‐
25       faultfd  may  be  blocking  or  non-blocking, depending on the value of
26       flags used for the creation of the userfaultfd or subsequent  calls  to
27       fcntl(2).
28
29       The  following values may be bitwise ORed in flags to change the behav‐
30       ior of userfaultfd():
31
32       O_CLOEXEC
33              Enable the close-on-exec  flag  for  the  new  userfaultfd  file
34              descriptor.   See  the  description  of  the  O_CLOEXEC  flag in
35              open(2).
36
37       O_NONBLOCK
38              Enables non-blocking operation for the userfaultfd object.   See
39              the description of the O_NONBLOCK flag in open(2).
40
41       When  the  last  file  descriptor  referring to a userfaultfd object is
42       closed, all memory ranges that were  registered  with  the  object  are
43       unregistered and unread events are flushed.
44
45   Usage
46       The  userfaultfd  mechanism  is  designed to allow a thread in a multi‐
47       threaded program to perform user-space paging for the other threads  in
48       the  process.   When  a page fault occurs for one of the regions regis‐
49       tered to the userfaultfd object, the faulting thread is  put  to  sleep
50       and  an  event  is  generated that can be read via the userfaultfd file
51       descriptor.  The fault-handling thread  reads  events  from  this  file
52       descriptor   and  services  them  using  the  operations  described  in
53       ioctl_userfaultfd(2).  When servicing the page fault events, the fault-
54       handling thread can trigger a wake-up for the sleeping thread.
55
56       It  is possible for the faulting threads and the fault-handling threads
57       to run in the context of different  processes.   In  this  case,  these
58       threads may belong to different programs, and the program that executes
59       the faulting threads will not necessarily cooperate  with  the  program
60       that  handles  the  page  faults.   In  such  non-cooperative mode, the
61       process that monitors userfaultfd and handles page faults needs  to  be
62       aware  of  the  changes  in  the  virtual memory layout of the faulting
63       process to avoid memory corruption.
64
65       Starting from Linux 4.11, userfaultfd can also  notify  the  fault-han‐
66       dling  threads about changes in the virtual memory layout of the fault‐
67       ing process.  In addition, if the faulting process invokes fork(2), the
68       userfaultfd  objects  associated with the parent may be duplicated into
69       the child process and the userfaultfd monitor will be notified (via the
70       UFFD_EVENT_FORK  described  below) about the file descriptor associated
71       with the userfault objects created for the child process, which  allows
72       the  userfaultfd  monitor  to  perform  user-space paging for the child
73       process.  Unlike page faults which have to be synchronous  and  require
74       an  explicit  or  implicit wakeup, all other events are delivered asyn‐
75       chronously and the non-cooperative process resumes execution as soon as
76       the  userfaultfd  manager  executes  read(2).   The userfaultfd manager
77       should carefully synchronize calls to UFFDIO_COPY with  the  processing
78       of events.
79
80       The  current  asynchronous  model  of the event delivery is optimal for
81       single threaded non-cooperative userfaultfd manager implementations.
82
83   Userfaultfd operation
84       After the userfaultfd object is created with userfaultfd(), the  appli‐
85       cation  must  enable  it using the UFFDIO_API ioctl(2) operation.  This
86       operation allows a handshake between  the  kernel  and  user  space  to
87       determine  the API version and supported features.  This operation must
88       be performed before any of  the  other  ioctl(2)  operations  described
89       below (or those operations fail with the EINVAL error).
90
91       After a successful UFFDIO_API operation, the application then registers
92       memory address ranges using  the  UFFDIO_REGISTER  ioctl(2)  operation.
93       After  successful  completion  of  a  UFFDIO_REGISTER operation, a page
94       fault occurring in the requested memory range, and satisfying the  mode
95       defined  at  the  registration time, will be forwarded by the kernel to
96       the user-space application.  The application  can  then  use  the  UFF‐
97       DIO_COPY  or  UFFDIO_ZEROPAGE  ioctl(2)  operations to resolve the page
98       fault.
99
100       Starting from Linux 4.14, if the application sets the UFFD_FEATURE_SIG‐
101       BUS  feature bit using the UFFDIO_API ioctl(2), no page-fault notifica‐
102       tion will be forwarded to user  space.   Instead  a  SIGBUS  signal  is
103       delivered  to the faulting process.  With this feature, userfaultfd can
104       be used for robustness purposes to simply catch  any  access  to  areas
105       within  the  registered address range that do not have pages allocated,
106       without having to listen to userfaultfd events.  No userfaultfd monitor
107       will  be  required for dealing with such memory accesses.  For example,
108       this feature can be useful for applications that want  to  prevent  the
109       kernel  from automatically allocating pages and filling holes in sparse
110       files when the hole is accessed through a memory mapping.
111
112       The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2)
113       if used in combination with UFFD_FEATURE_FORK.
114
115       Details  of the various ioctl(2) operations can be found in ioctl_user‐
116       faultfd(2).
117
118       Since Linux 4.11, events other than page-fault may enabled during  UFF‐
119       DIO_API operation.
120
121       Up  to  Linux 4.11, userfaultfd can be used only with anonymous private
122       memory mappings.  Since Linux 4.11, userfaultfd can be also  used  with
123       hugetlbfs and shared memory mappings.
124
125   Reading from the userfaultfd structure
126       Each  read(2)  from the userfaultfd file descriptor returns one or more
127       uffd_msg structures, each of which describes a page-fault event  or  an
128       event required for the non-cooperative userfaultfd usage:
129
130           struct uffd_msg {
131               __u8  event;            /* Type of event */
132               ...
133               union {
134                   struct {
135                       __u64 flags;    /* Flags describing fault */
136                       __u64 address;  /* Faulting address */
137                   } pagefault;
138
139                   struct {            /* Since Linux 4.11 */
140                       __u32 ufd;      /* Userfault file descriptor
141                                          of the child process */
142                   } fork;
143
144                   struct {            /* Since Linux 4.11 */
145                       __u64 from;     /* Old address of remapped area */
146                       __u64 to;       /* New address of remapped area */
147                       __u64 len;      /* Original mapping length */
148                   } remap;
149
150                   struct {            /* Since Linux 4.11 */
151                       __u64 start;    /* Start address of removed area */
152                       __u64 end;      /* End address of removed area */
153                   } remove;
154                   ...
155               } arg;
156
157               /* Padding fields omitted */
158           } __packed;
159
160       If  multiple  events  are  available  and  the supplied buffer is large
161       enough, read(2) returns as many events as will fit in the supplied buf‐
162       fer.  If the buffer supplied to read(2) is smaller than the size of the
163       uffd_msg structure, the read(2) fails with the error EINVAL.
164
165       The fields set in the uffd_msg structure are as follows:
166
167       event  The type of event.   Depending  of  the  event  type,  different
168              fields of the arg union represent details required for the event
169              processing.  The non-page-fault events are generated  only  when
170              appropriate  feature  is  enabled during API handshake with UFF‐
171              DIO_API ioctl(2).
172
173              The following values can appear in the event field:
174
175              UFFD_EVENT_PAGEFAULT (since Linux 4.3)
176                     A page-fault event.  The page-fault details are available
177                     in the pagefault field.
178
179              UFFD_EVENT_FORK (since Linux 4.11)
180                     Generated  when  the faulting process invokes fork(2) (or
181                     clone(2) without the CLONE_VM flag).  The  event  details
182                     are available in the fork field.
183
184              UFFD_EVENT_REMAP (since Linux 4.11)
185                     Generated  when  the  faulting process invokes mremap(2).
186                     The event details are available in the remap field.
187
188              UFFD_EVENT_REMOVE (since Linux 4.11)
189                     Generated when the faulting  process  invokes  madvise(2)
190                     with  MADV_DONTNEED  or  MADV_REMOVE  advice.   The event
191                     details are available in the remove field.
192
193              UFFD_EVENT_UNMAP (since Linux 4.11)
194                     Generated when  the  faulting  process  unmaps  a  memory
195                     range,  either  explicitly  using munmap(2) or implicitly
196                     during mmap(2)  or  mremap(2).   The  event  details  are
197                     available in the remove field.
198
199       pagefault.address
200              The address that triggered the page fault.
201
202       pagefault.flags
203              A   bit   mask   of   flags   that   describe  the  event.   For
204              UFFD_EVENT_PAGEFAULT, the following flag may appear:
205
206              UFFD_PAGEFAULT_FLAG_WRITE
207                     If the address is in a range that was registered with the
208                     UFFDIO_REGISTER_MODE_MISSING    flag   (see   ioctl_user‐
209                     faultfd(2)) and this flag is set,  this  a  write  fault;
210                     otherwise it is a read fault.
211
212       fork.ufd
213              The file descriptor associated with the userfault object created
214              for the child created by fork(2).
215
216       remap.from
217              The original address of the memory range that was remapped using
218              mremap(2).
219
220       remap.to
221              The  new  address  of  the  memory range that was remapped using
222              mremap(2).
223
224       remap.len
225              The original length of the memory range that was remapped  using
226              mremap(2).
227
228       remove.start
229              The  start address of the memory range that was freed using mad‐
230              vise(2) or unmapped
231
232       remove.end
233              The end address of the memory range that was  freed  using  mad‐
234              vise(2) or unmapped
235
236       A  read(2) on a userfaultfd file descriptor can fail with the following
237       errors:
238
239       EINVAL The userfaultfd object has not yet been enabled using  the  UFF‐
240              DIO_API ioctl(2) operation
241
242       If  the O_NONBLOCK flag is enabled in the associated open file descrip‐
243       tion, the userfaultfd file descriptor can be  monitored  with  poll(2),
244       select(2),  and epoll(7).  When events are available, the file descrip‐
245       tor indicates as readable.  If the O_NONBLOCK flag is not enabled, then
246       poll(2)  (always) indicates the file as having a POLLERR condition, and
247       select(2) indicates the file descriptor as both readable and writable.
248

RETURN VALUE

250       On success, userfaultfd() returns a new file descriptor that refers  to
251       the  userfaultfd  object.   On  error, -1 is returned, and errno is set
252       appropriately.
253

ERRORS

255       EINVAL An unsupported value was specified in flags.
256
257       EMFILE The per-process limit on the number of open file descriptors has
258              been reached
259
260       ENFILE The system-wide limit on the total number of open files has been
261              reached.
262
263       ENOMEM Insufficient kernel memory was available.
264

VERSIONS

266       The userfaultfd() system call first appeared in Linux 4.3.
267
268       The support for hugetlbfs and shared memory  areas  and  non-page-fault
269       events was added in Linux 4.11
270

CONFORMING TO

272       userfaultfd()  is  Linux-specific  and  should  not be used in programs
273       intended to be portable.
274

NOTES

276       Glibc does not provide a wrapper for this system call;  call  it  using
277       syscall(2).
278
279       The  userfaultfd mechanism can be used as an alternative to traditional
280       user-space paging techniques based on the use of the SIGSEGV signal and
281       mmap(2).   It  can  also  be  used to implement lazy restore for check‐
282       point/restore mechanisms, as  well  as  post-copy  migration  to  allow
283       (nearly) uninterrupted execution when transferring virtual machines and
284       Linux containers from one host to another.
285

BUGS

287       If the UFFD_FEATURE_EVENT_FORK is enabled and a system  call  from  the
288       fork(2)  family  is  interrupted  by  a signal or failed, a stale user‐
289       faultfd  descriptor  might  be  created.   In  this  case,  a  spurious
290       UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
291

EXAMPLE

293       The  program  below  demonstrates the use of the userfaultfd mechanism.
294       The program creates two threads, one of which acts  as  the  page-fault
295       handler  for  the  process,  for the pages in a demand-page zero region
296       created using mmap(2).
297
298       The program takes one command-line argument, which  is  the  number  of
299       pages  that will be created in a mapping whose page faults will be han‐
300       dled via userfaultfd.  After creating a userfaultfd object, the program
301       then  creates  an  anonymous  private mapping of the specified size and
302       registers the address range of that mapping using  the  UFFDIO_REGISTER
303       ioctl(2) operation.  The program then creates a second thread that will
304       perform the task of handling page faults.
305
306       The main thread then walks through the pages of  the  mapping  fetching
307       bytes  from  successive  pages.   Because  the  pages have not yet been
308       accessed, the first access of a byte in each page will trigger a  page-
309       fault event on the userfaultfd file descriptor.
310
311       Each  of  the  page-fault events is handled by the second thread, which
312       sits in a loop processing input from the userfaultfd  file  descriptor.
313       In  each loop iteration, the second thread first calls poll(2) to check
314       the state of the file descriptor, and then reads an event from the file
315       descriptor.   All  such  events  should be UFFD_EVENT_PAGEFAULT events,
316       which the thread handles by copying a page of data  into  the  faulting
317       region using the UFFDIO_COPY ioctl(2) operation.
318
319       The following is an example of what we see when running the program:
320
321           $ ./userfaultfd_demo 3
322           Address returned by mmap() = 0x7fd30106c000
323
324           fault_handler_thread():
325               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
326               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
327                   (uffdio_copy.copy returned 4096)
328           Read address 0x7fd30106c00f in main(): A
329           Read address 0x7fd30106c40f in main(): A
330           Read address 0x7fd30106c80f in main(): A
331           Read address 0x7fd30106cc0f in main(): A
332
333           fault_handler_thread():
334               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
335               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
336                   (uffdio_copy.copy returned 4096)
337           Read address 0x7fd30106d00f in main(): B
338           Read address 0x7fd30106d40f in main(): B
339           Read address 0x7fd30106d80f in main(): B
340           Read address 0x7fd30106dc0f in main(): B
341
342           fault_handler_thread():
343               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
344               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
345                   (uffdio_copy.copy returned 4096)
346           Read address 0x7fd30106e00f in main(): C
347           Read address 0x7fd30106e40f in main(): C
348           Read address 0x7fd30106e80f in main(): C
349           Read address 0x7fd30106ec0f in main(): C
350
351   Program source
352
353       /* userfaultfd_demo.c
354
355          Licensed under the GNU General Public License version 2 or later.
356       */
357       #define _GNU_SOURCE
358       #include <sys/types.h>
359       #include <stdio.h>
360       #include <linux/userfaultfd.h>
361       #include <pthread.h>
362       #include <errno.h>
363       #include <unistd.h>
364       #include <stdlib.h>
365       #include <fcntl.h>
366       #include <signal.h>
367       #include <poll.h>
368       #include <string.h>
369       #include <sys/mman.h>
370       #include <sys/syscall.h>
371       #include <sys/ioctl.h>
372       #include <poll.h>
373
374       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
375                               } while (0)
376
377       static int page_size;
378
379       static void *
380       fault_handler_thread(void *arg)
381       {
382           static struct uffd_msg msg;   /* Data read from userfaultfd */
383           static int fault_cnt = 0;     /* Number of faults so far handled */
384           long uffd;                    /* userfaultfd file descriptor */
385           static char *page = NULL;
386           struct uffdio_copy uffdio_copy;
387           ssize_t nread;
388
389           uffd = (long) arg;
390
391           /* Create a page that will be copied into the faulting region */
392
393           if (page == NULL) {
394               page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
395                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
396               if (page == MAP_FAILED)
397                   errExit("mmap");
398           }
399
400           /* Loop, handling incoming events on the userfaultfd
401              file descriptor */
402
403           for (;;) {
404
405               /* See what poll() tells us about the userfaultfd */
406
407               struct pollfd pollfd;
408               int nready;
409               pollfd.fd = uffd;
410               pollfd.events = POLLIN;
411               nready = poll(&pollfd, 1, -1);
412               if (nready == -1)
413                   errExit("poll");
414
415               printf("\nfault_handler_thread():\n");
416               printf("    poll() returns: nready = %d; "
417                       "POLLIN = %d; POLLERR = %d\n", nready,
418                       (pollfd.revents & POLLIN) != 0,
419                       (pollfd.revents & POLLERR) != 0);
420
421               /* Read an event from the userfaultfd */
422
423               nread = read(uffd, &msg, sizeof(msg));
424               if (nread == 0) {
425                   printf("EOF on userfaultfd!\n");
426                   exit(EXIT_FAILURE);
427               }
428
429               if (nread == -1)
430                   errExit("read");
431
432               /* We expect only one kind of event; verify that assumption */
433
434               if (msg.event != UFFD_EVENT_PAGEFAULT) {
435                   fprintf(stderr, "Unexpected event on userfaultfd\n");
436                   exit(EXIT_FAILURE);
437               }
438
439               /* Display info about the page-fault event */
440
441               printf("    UFFD_EVENT_PAGEFAULT event: ");
442               printf("flags = %llx; ", msg.arg.pagefault.flags);
443               printf("address = %llx\n", msg.arg.pagefault.address);
444
445               /* Copy the page pointed to by 'page' into the faulting
446                  region. Vary the contents that are copied in, so that it
447                  is more obvious that each fault is handled separately. */
448
449               memset(page, 'A' + fault_cnt % 20, page_size);
450               fault_cnt++;
451
452               uffdio_copy.src = (unsigned long) page;
453
454               /* We need to handle page faults in units of pages(!).
455                  So, round faulting address down to page boundary */
456
457               uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
458                                                  ~(page_size - 1);
459               uffdio_copy.len = page_size;
460               uffdio_copy.mode = 0;
461               uffdio_copy.copy = 0;
462               if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
463                   errExit("ioctl-UFFDIO_COPY");
464
465               printf("        (uffdio_copy.copy returned %lld)\n",
466                       uffdio_copy.copy);
467           }
468       }
469
470       int
471       main(int argc, char *argv[])
472       {
473           long uffd;          /* userfaultfd file descriptor */
474           char *addr;         /* Start of region handled by userfaultfd */
475           unsigned long len;  /* Length of region handled by userfaultfd */
476           pthread_t thr;      /* ID of thread that handles page faults */
477           struct uffdio_api uffdio_api;
478           struct uffdio_register uffdio_register;
479           int s;
480
481           if (argc != 2) {
482               fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
483               exit(EXIT_FAILURE);
484           }
485
486           page_size = sysconf(_SC_PAGE_SIZE);
487           len = strtoul(argv[1], NULL, 0) * page_size;
488
489           /* Create and enable userfaultfd object */
490
491           uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
492           if (uffd == -1)
493               errExit("userfaultfd");
494
495           uffdio_api.api = UFFD_API;
496           uffdio_api.features = 0;
497           if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
498               errExit("ioctl-UFFDIO_API");
499
500           /* Create a private anonymous mapping. The memory will be
501              demand-zero paged--that is, not yet allocated. When we
502              actually touch the memory, it will be allocated via
503              the userfaultfd. */
504
505           addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
506                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
507           if (addr == MAP_FAILED)
508               errExit("mmap");
509
510           printf("Address returned by mmap() = %p\n", addr);
511
512           /* Register the memory range of the mapping we just created for
513              handling by the userfaultfd object. In mode, we request to track
514              missing pages (i.e., pages that have not yet been faulted in). */
515
516           uffdio_register.range.start = (unsigned long) addr;
517           uffdio_register.range.len = len;
518           uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
519           if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
520               errExit("ioctl-UFFDIO_REGISTER");
521
522           /* Create a thread that will process the userfaultfd events */
523
524           s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
525           if (s != 0) {
526               errno = s;
527               errExit("pthread_create");
528           }
529
530           /* Main thread now touches memory in the mapping, touching
531              locations 1024 bytes apart. This will trigger userfaultfd
532              events for all pages in the region. */
533
534           int l;
535           l = 0xf;    /* Ensure that faulting address is not on a page
536                          boundary, in order to test that we correctly
537                          handle that case in fault_handling_thread() */
538           while (l < len) {
539               char c = addr[l];
540               printf("Read address %p in main(): ", addr + l);
541               printf("%c\n", c);
542               l += 1024;
543               usleep(100000);         /* Slow things down a little */
544           }
545
546           exit(EXIT_SUCCESS);
547       }
548

SEE ALSO

550       fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
551
552       Documentation/admin-guide/mm/userfaultfd.rst in the Linux kernel source
553       tree
554

COLOPHON

556       This page is part of release 5.04 of the Linux  man-pages  project.   A
557       description  of  the project, information about reporting bugs, and the
558       latest    version    of    this    page,    can     be     found     at
559       https://www.kernel.org/doc/man-pages/.
560
561
562
563Linux                             2019-03-06                    USERFAULTFD(2)
Impressum