1USERFAULTFD(2)             Linux Programmer's Manual            USERFAULTFD(2)
2
3
4

NAME

6       userfaultfd - create a file descriptor for handling page faults in user
7       space
8

SYNOPSIS

10       #include <sys/types.h>
11       #include <linux/userfaultfd.h>
12
13       int userfaultfd(int flags);
14
15       Note: There is no glibc wrapper for this system call; see NOTES.
16

DESCRIPTION

18       userfaultfd() creates a new userfaultfd object that  can  be  used  for
19       delegation  of  page-fault  handling  to  a user-space application, and
20       returns a file descriptor that refers to the new object.  The new user‐
21       faultfd object is configured using ioctl(2).
22
23       Once  the  userfaultfd  object  is  configured, the application can use
24       read(2) to receive userfaultfd notifications.   The  reads  from  user‐
25       faultfd  may  be  blocking  or  non-blocking, depending on the value of
26       flags used for the creation of the userfaultfd or subsequent  calls  to
27       fcntl(2).
28
29       The  following values may be bitwise ORed in flags to change the behav‐
30       ior of userfaultfd():
31
32       O_CLOEXEC
33              Enable the close-on-exec  flag  for  the  new  userfaultfd  file
34              descriptor.   See  the  description  of  the  O_CLOEXEC  flag in
35              open(2).
36
37       O_NONBLOCK
38              Enables non-blocking operation for the userfaultfd object.   See
39              the description of the O_NONBLOCK flag in open(2).
40
41       When  the  last  file  descriptor  referring to a userfaultfd object is
42       closed, all memory ranges that were  registered  with  the  object  are
43       unregistered and unread events are flushed.
44
45   Usage
46       The  userfaultfd  mechanism  is  designed to allow a thread in a multi‐
47       threaded program to perform user-space paging for the other threads  in
48       the  process.   When  a page fault occurs for one of the regions regis‐
49       tered to the userfaultfd object, the faulting thread is  put  to  sleep
50       and  an  event  is  generated that can be read via the userfaultfd file
51       descriptor.  The fault-handling thread  reads  events  from  this  file
52       descriptor   and  services  them  using  the  operations  described  in
53       ioctl_userfaultfd(2).  When servicing the page fault events, the fault-
54       handling thread can trigger a wake-up for the sleeping thread.
55
56       It  is possible for the faulting threads and the fault-handling threads
57       to run in the context of different  processes.   In  this  case,  these
58       threads may belong to different programs, and the program that executes
59       the faulting threads will not necessarily cooperate  with  the  program
60       that  handles  the  page  faults.   In  such  non-cooperative mode, the
61       process that monitors userfaultfd and handles page faults needs  to  be
62       aware  of  the  changes  in  the  virtual memory layout of the faulting
63       process to avoid memory corruption.
64
65       Starting from Linux 4.11, userfaultfd can also  notify  the  fault-han‐
66       dling  threads about changes in the virtual memory layout of the fault‐
67       ing process.  In addition, if the faulting process invokes fork(2), the
68       userfaultfd  objects  associated with the parent may be duplicated into
69       the child process and the userfaultfd monitor will be notified (via the
70       UFFD_EVENT_FORK  described  below) about the file descriptor associated
71       with the userfault objects created for the child process, which  allows
72       the  userfaultfd  monitor  to  perform  user-space paging for the child
73       process.  Unlike page faults which have to be synchronous  and  require
74       an  explicit  or  implicit wakeup, all other events are delivered asyn‐
75       chronously and the non-cooperative process resumes execution as soon as
76       the  userfaultfd  manager  executes  read(2).   The userfaultfd manager
77       should carefully synchronize calls to UFFDIO_COPY with  the  processing
78       of events.
79
80       The  current  asynchronous  model  of the event delivery is optimal for
81       single threaded non-cooperative userfaultfd manager implementations.
82
83   Userfaultfd operation
84       After the userfaultfd object is created with userfaultfd(), the  appli‐
85       cation  must  enable  it using the UFFDIO_API ioctl(2) operation.  This
86       operation allows a handshake between  the  kernel  and  user  space  to
87       determine  the API version and supported features.  This operation must
88       be performed before any of  the  other  ioctl(2)  operations  described
89       below (or those operations fail with the EINVAL error).
90
91       After a successful UFFDIO_API operation, the application then registers
92       memory address ranges using  the  UFFDIO_REGISTER  ioctl(2)  operation.
93       After  successful  completion  of  a  UFFDIO_REGISTER operation, a page
94       fault occurring in the requested memory range, and satisfying the  mode
95       defined  at  the  registration time, will be forwarded by the kernel to
96       the user-space application.  The application  can  then  use  the  UFF‐
97       DIO_COPY or UFFDIO_ZERO ioctl(2) operations to resolve the page fault.
98
99       Starting from Linux 4.14, if the application sets the UFFD_FEATURE_SIG‐
100       BUS feature bit using the UFFDIO_API ioctl(2), no page-fault  notifica‐
101       tion  will  be  forwarded  to  user  space.  Instead a SIGBUS signal is
102       delivered to the faulting process.  With this feature, userfaultfd  can
103       be  used  for  robustness  purposes to simply catch any access to areas
104       within the registered address range that do not have  pages  allocated,
105       without having to listen to userfaultfd events.  No userfaultfd monitor
106       will be required for dealing with such memory accesses.   For  example,
107       this  feature  can  be useful for applications that want to prevent the
108       kernel from automatically allocating pages and filling holes in  sparse
109       files when the hole is accessed through a memory mapping.
110
111       The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2)
112       if used in combination with UFFD_FEATURE_FORK.
113
114       Details of the various ioctl(2) operations can be found in  ioctl_user‐
115       faultfd(2).
116
117       Since  Linux 4.11, events other than page-fault may enabled during UFF‐
118       DIO_API operation.
119
120       Up to Linux 4.11, userfaultfd can be used only with  anonymous  private
121       memory  mappings.   Since Linux 4.11, userfaultfd can be also used with
122       hugetlbfs and shared memory mappings.
123
124   Reading from the userfaultfd structure
125       Each read(2) from the userfaultfd file descriptor returns one  or  more
126       uffd_msg  structures,  each of which describes a page-fault event or an
127       event required for the non-cooperative userfaultfd usage:
128
129           struct uffd_msg {
130               __u8  event;            /* Type of event */
131               ...
132               union {
133                   struct {
134                       __u64 flags;    /* Flags describing fault */
135                       __u64 address;  /* Faulting address */
136                   } pagefault;
137
138                   struct {            /* Since Linux 4.11 */
139                       __u32 ufd;      /* Userfault file descriptor
140                                          of the child process */
141                   } fork;
142
143                   struct {            /* Since Linux 4.11 */
144                       __u64 from;     /* Old address of remapped area */
145                       __u64 to;       /* New address of remapped area */
146                       __u64 len;      /* Original mapping length */
147                   } remap;
148
149                   struct {            /* Since Linux 4.11 */
150                       __u64 start;    /* Start address of removed area */
151                       __u64 end;      /* End address of removed area */
152                   } remove;
153                   ...
154               } arg;
155
156               /* Padding fields omitted */
157           } __packed;
158
159       If multiple events are available  and  the  supplied  buffer  is  large
160       enough, read(2) returns as many events as will fit in the supplied buf‐
161       fer.  If the buffer supplied to read(2) is smaller than the size of the
162       uffd_msg structure, the read(2) fails with the error EINVAL.
163
164       The fields set in the uffd_msg structure are as follows:
165
166       event  The  type  of  event.   Depending  of  the event type, different
167              fields of the arg union represent details required for the event
168              processing.   The  non-page-fault events are generated only when
169              appropriate feature is enabled during API  handshake  with  UFF‐
170              DIO_API ioctl(2).
171
172              The following values can appear in the event field:
173
174              UFFD_EVENT_PAGEFAULT (since Linux 4.3)
175                     A page-fault event.  The page-fault details are available
176                     in the pagefault field.
177
178              UFFD_EVENT_FORK (since Linux 4.11)
179                     Generated when the faulting process invokes  fork(2)  (or
180                     clone(2)  without  the CLONE_VM flag).  The event details
181                     are available in the fork field.
182
183              UFFD_EVENT_REMAP (since Linux 4.11)
184                     Generated when the faulting  process  invokes  mremap(2).
185                     The event details are available in the remap field.
186
187              UFFD_EVENT_REMOVE (since Linux 4.11)
188                     Generated  when  the  faulting process invokes madvise(2)
189                     with MADV_DONTNEED  or  MADV_REMOVE  advice.   The  event
190                     details are available in the remove field.
191
192              UFFD_EVENT_UNMAP (since Linux 4.11)
193                     Generated  when  the  faulting  process  unmaps  a memory
194                     range, either explicitly using  munmap(2)  or  implicitly
195                     during  mmap(2)  or  mremap(2).   The  event  details are
196                     available in the remove field.
197
198       pagefault.address
199              The address that triggered the page fault.
200
201       pagefault.flags
202              A  bit  mask  of   flags   that   describe   the   event.    For
203              UFFD_EVENT_PAGEFAULT, the following flag may appear:
204
205              UFFD_PAGEFAULT_FLAG_WRITE
206                     If the address is in a range that was registered with the
207                     UFFDIO_REGISTER_MODE_MISSING   flag   (see    ioctl_user‐
208                     faultfd(2))  and  this  flag  is set, this a write fault;
209                     otherwise it is a read fault.
210
211       fork.ufd
212              The file descriptor associated with the userfault object created
213              for the child created by fork(2).
214
215       remap.from
216              The original address of the memory range that was remapped using
217              mremap(2).
218
219       remap.to
220              The new address of the memory  range  that  was  remapped  using
221              mremap(2).
222
223       remap.len
224              The  original length of the memory range that was remapped using
225              mremap(2).
226
227       remove.start
228              The start address of the memory range that was freed using  mad‐
229              vise(2) or unmapped
230
231       remove.end
232              The  end  address  of the memory range that was freed using mad‐
233              vise(2) or unmapped
234
235       A read(2) on a userfaultfd file descriptor can fail with the  following
236       errors:
237
238       EINVAL The  userfaultfd  object has not yet been enabled using the UFF‐
239              DIO_API ioctl(2) operation
240
241       If the O_NONBLOCK flag is enabled in the associated open file  descrip‐
242       tion,  the  userfaultfd  file descriptor can be monitored with poll(2),
243       select(2), and epoll(7).  When events are available, the file  descrip‐
244       tor indicates as readable.  If the O_NONBLOCK flag is not enabled, then
245       poll(2) (always) indicates the file as having a POLLERR condition,  and
246       select(2) indicates the file descriptor as both readable and writable.
247

RETURN VALUE

249       On  success, userfaultfd() returns a new file descriptor that refers to
250       the userfaultfd object.  On error, -1 is returned,  and  errno  is  set
251       appropriately.
252

ERRORS

254       EINVAL An unsupported value was specified in flags.
255
256       EMFILE The per-process limit on the number of open file descriptors has
257              been reached
258
259       ENFILE The system-wide limit on the total number of open files has been
260              reached.
261
262       ENOMEM Insufficient kernel memory was available.
263

VERSIONS

265       The userfaultfd() system call first appeared in Linux 4.3.
266
267       The  support  for  hugetlbfs and shared memory areas and non-page-fault
268       events was added in Linux 4.11
269

CONFORMING TO

271       userfaultfd() is Linux-specific and should  not  be  used  in  programs
272       intended to be portable.
273

NOTES

275       Glibc  does  not  provide a wrapper for this system call; call it using
276       syscall(2).
277
278       The userfaultfd mechanism can be used as an alternative to  traditional
279       user-space paging techniques based on the use of the SIGSEGV signal and
280       mmap(2).  It can also be used to  implement  lazy  restore  for  check‐
281       point/restore  mechanisms,  as  well  as  post-copy  migration to allow
282       (nearly) uninterrupted execution when transferring virtual machines and
283       Linux containers from one host to another.
284

BUGS

286       If  the  UFFD_FEATURE_EVENT_FORK  is enabled and a system call from the
287       fork(2) family is interrupted by a signal  or  failed,  a  stale  user‐
288       faultfd  descriptor  might  be  created.   In  this  case,  a  spurious
289       UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
290

EXAMPLE

292       The program below demonstrates the use of  the  userfaultfd  mechanism.
293       The  program  creates  two threads, one of which acts as the page-fault
294       handler for the process, for the pages in  a  demand-page  zero  region
295       created using mmap(2).
296
297       The  program  takes  one  command-line argument, which is the number of
298       pages that will be created in a mapping whose page faults will be  han‐
299       dled via userfaultfd.  After creating a userfaultfd object, the program
300       then creates an anonymous private mapping of  the  specified  size  and
301       registers  the  address range of that mapping using the UFFDIO_REGISTER
302       ioctl(2) operation.  The program then creates a second thread that will
303       perform the task of handling page faults.
304
305       The  main  thread  then walks through the pages of the mapping fetching
306       bytes from successive pages.  Because  the  pages  have  not  yet  been
307       accessed,  the first access of a byte in each page will trigger a page-
308       fault event on the userfaultfd file descriptor.
309
310       Each of the page-fault events is handled by the  second  thread,  which
311       sits  in  a loop processing input from the userfaultfd file descriptor.
312       In each loop iteration, the second thread first calls poll(2) to  check
313       the state of the file descriptor, and then reads an event from the file
314       descriptor.  All such events  should  be  UFFD_EVENT_PAGEFAULT  events,
315       which  the  thread  handles by copying a page of data into the faulting
316       region using the UFFDIO_COPY ioctl(2) operation.
317
318       The following is an example of what we see when running the program:
319
320           $ ./userfaultfd_demo 3
321           Address returned by mmap() = 0x7fd30106c000
322
323           fault_handler_thread():
324               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
325               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
326                   (uffdio_copy.copy returned 4096)
327           Read address 0x7fd30106c00f in main(): A
328           Read address 0x7fd30106c40f in main(): A
329           Read address 0x7fd30106c80f in main(): A
330           Read address 0x7fd30106cc0f in main(): A
331
332           fault_handler_thread():
333               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
334               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
335                   (uffdio_copy.copy returned 4096)
336           Read address 0x7fd30106d00f in main(): B
337           Read address 0x7fd30106d40f in main(): B
338           Read address 0x7fd30106d80f in main(): B
339           Read address 0x7fd30106dc0f in main(): B
340
341           fault_handler_thread():
342               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
343               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
344                   (uffdio_copy.copy returned 4096)
345           Read address 0x7fd30106e00f in main(): C
346           Read address 0x7fd30106e40f in main(): C
347           Read address 0x7fd30106e80f in main(): C
348           Read address 0x7fd30106ec0f in main(): C
349
350   Program source
351
352       /* userfaultfd_demo.c
353
354          Licensed under the GNU General Public License version 2 or later.
355       */
356       #define _GNU_SOURCE
357       #include <sys/types.h>
358       #include <stdio.h>
359       #include <linux/userfaultfd.h>
360       #include <pthread.h>
361       #include <errno.h>
362       #include <unistd.h>
363       #include <stdlib.h>
364       #include <fcntl.h>
365       #include <signal.h>
366       #include <poll.h>
367       #include <string.h>
368       #include <sys/mman.h>
369       #include <sys/syscall.h>
370       #include <sys/ioctl.h>
371       #include <poll.h>
372
373       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
374                               } while (0)
375
376       static int page_size;
377
378       static void *
379       fault_handler_thread(void *arg)
380       {
381           static struct uffd_msg msg;   /* Data read from userfaultfd */
382           static int fault_cnt = 0;     /* Number of faults so far handled */
383           long uffd;                    /* userfaultfd file descriptor */
384           static char *page = NULL;
385           struct uffdio_copy uffdio_copy;
386           ssize_t nread;
387
388           uffd = (long) arg;
389
390           /* Create a page that will be copied into the faulting region */
391
392           if (page == NULL) {
393               page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
394                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
395               if (page == MAP_FAILED)
396                   errExit("mmap");
397           }
398
399           /* Loop, handling incoming events on the userfaultfd
400              file descriptor */
401
402           for (;;) {
403
404               /* See what poll() tells us about the userfaultfd */
405
406               struct pollfd pollfd;
407               int nready;
408               pollfd.fd = uffd;
409               pollfd.events = POLLIN;
410               nready = poll(&pollfd, 1, -1);
411               if (nready == -1)
412                   errExit("poll");
413
414               printf("\nfault_handler_thread():\n");
415               printf("    poll() returns: nready = %d; "
416                       "POLLIN = %d; POLLERR = %d\n", nready,
417                       (pollfd.revents & POLLIN) != 0,
418                       (pollfd.revents & POLLERR) != 0);
419
420               /* Read an event from the userfaultfd */
421
422               nread = read(uffd, &msg, sizeof(msg));
423               if (nread == 0) {
424                   printf("EOF on userfaultfd!\n");
425                   exit(EXIT_FAILURE);
426               }
427
428               if (nread == -1)
429                   errExit("read");
430
431               /* We expect only one kind of event; verify that assumption */
432
433               if (msg.event != UFFD_EVENT_PAGEFAULT) {
434                   fprintf(stderr, "Unexpected event on userfaultfd\n");
435                   exit(EXIT_FAILURE);
436               }
437
438               /* Display info about the page-fault event */
439
440               printf("    UFFD_EVENT_PAGEFAULT event: ");
441               printf("flags = %llx; ", msg.arg.pagefault.flags);
442               printf("address = %llx\n", msg.arg.pagefault.address);
443
444               /* Copy the page pointed to by 'page' into the faulting
445                  region. Vary the contents that are copied in, so that it
446                  is more obvious that each fault is handled separately. */
447
448               memset(page, 'A' + fault_cnt % 20, page_size);
449               fault_cnt++;
450
451               uffdio_copy.src = (unsigned long) page;
452
453               /* We need to handle page faults in units of pages(!).
454                  So, round faulting address down to page boundary */
455
456               uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
457                                                  ~(page_size - 1);
458               uffdio_copy.len = page_size;
459               uffdio_copy.mode = 0;
460               uffdio_copy.copy = 0;
461               if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
462                   errExit("ioctl-UFFDIO_COPY");
463
464               printf("        (uffdio_copy.copy returned %lld)\n",
465                       uffdio_copy.copy);
466           }
467       }
468
469       int
470       main(int argc, char *argv[])
471       {
472           long uffd;          /* userfaultfd file descriptor */
473           char *addr;         /* Start of region handled by userfaultfd */
474           unsigned long len;  /* Length of region handled by userfaultfd */
475           pthread_t thr;      /* ID of thread that handles page faults */
476           struct uffdio_api uffdio_api;
477           struct uffdio_register uffdio_register;
478           int s;
479
480           if (argc != 2) {
481               fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
482               exit(EXIT_FAILURE);
483           }
484
485           page_size = sysconf(_SC_PAGE_SIZE);
486           len = strtoul(argv[1], NULL, 0) * page_size;
487
488           /* Create and enable userfaultfd object */
489
490           uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
491           if (uffd == -1)
492               errExit("userfaultfd");
493
494           uffdio_api.api = UFFD_API;
495           uffdio_api.features = 0;
496           if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
497               errExit("ioctl-UFFDIO_API");
498
499           /* Create a private anonymous mapping. The memory will be
500              demand-zero paged--that is, not yet allocated. When we
501              actually touch the memory, it will be allocated via
502              the userfaultfd. */
503
504           addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
505                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
506           if (addr == MAP_FAILED)
507               errExit("mmap");
508
509           printf("Address returned by mmap() = %p\n", addr);
510
511           /* Register the memory range of the mapping we just created for
512              handling by the userfaultfd object. In mode, we request to track
513              missing pages (i.e., pages that have not yet been faulted in). */
514
515           uffdio_register.range.start = (unsigned long) addr;
516           uffdio_register.range.len = len;
517           uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
518           if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
519               errExit("ioctl-UFFDIO_REGISTER");
520
521           /* Create a thread that will process the userfaultfd events */
522
523           s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
524           if (s != 0) {
525               errno = s;
526               errExit("pthread_create");
527           }
528
529           /* Main thread now touches memory in the mapping, touching
530              locations 1024 bytes apart. This will trigger userfaultfd
531              events for all pages in the region. */
532
533           int l;
534           l = 0xf;    /* Ensure that faulting address is not on a page
535                          boundary, in order to test that we correctly
536                          handle that case in fault_handling_thread() */
537           while (l < len) {
538               char c = addr[l];
539               printf("Read address %p in main(): ", addr + l);
540               printf("%c\n", c);
541               l += 1024;
542               usleep(100000);         /* Slow things down a little */
543           }
544
545           exit(EXIT_SUCCESS);
546       }
547

SEE ALSO

549       fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
550
551       Documentation/vm/userfaultfd.txt in the Linux kernel source tree
552

COLOPHON

554       This page is part of release 4.16 of the Linux  man-pages  project.   A
555       description  of  the project, information about reporting bugs, and the
556       latest    version    of    this    page,    can     be     found     at
557       https://www.kernel.org/doc/man-pages/.
558
559
560
561Linux                             2017-09-15                    USERFAULTFD(2)
Impressum