1USERFAULTFD(2)             Linux Programmer's Manual            USERFAULTFD(2)
2
3
4

NAME

6       userfaultfd - create a file descriptor for handling page faults in user
7       space
8

SYNOPSIS

10       #include <sys/types.h>
11       #include <linux/userfaultfd.h>
12
13       int userfaultfd(int flags);
14
15       Note: There is no glibc wrapper for this system call; see NOTES.
16

DESCRIPTION

18       userfaultfd() creates a new userfaultfd object that  can  be  used  for
19       delegation  of  page-fault  handling  to  a user-space application, and
20       returns a file descriptor that refers to the new object.  The new user‐
21       faultfd object is configured using ioctl(2).
22
23       Once  the  userfaultfd  object  is  configured, the application can use
24       read(2) to receive userfaultfd notifications.   The  reads  from  user‐
25       faultfd  may  be  blocking  or  non-blocking, depending on the value of
26       flags used for the creation of the userfaultfd or subsequent  calls  to
27       fcntl(2).
28
29       The  following values may be bitwise ORed in flags to change the behav‐
30       ior of userfaultfd():
31
32       O_CLOEXEC
33              Enable the close-on-exec  flag  for  the  new  userfaultfd  file
34              descriptor.   See  the  description  of  the  O_CLOEXEC  flag in
35              open(2).
36
37       O_NONBLOCK
38              Enables non-blocking operation for the userfaultfd object.   See
39              the description of the O_NONBLOCK flag in open(2).
40
41       When  the  last  file  descriptor  referring to a userfaultfd object is
42       closed, all memory ranges that were  registered  with  the  object  are
43       unregistered and unread events are flushed.
44
45   Usage
46       The  userfaultfd  mechanism  is  designed to allow a thread in a multi‐
47       threaded program to perform user-space paging for the other threads  in
48       the  process.   When  a page fault occurs for one of the regions regis‐
49       tered to the userfaultfd object, the faulting thread is  put  to  sleep
50       and  an  event  is  generated that can be read via the userfaultfd file
51       descriptor.  The fault-handling thread  reads  events  from  this  file
52       descriptor   and  services  them  using  the  operations  described  in
53       ioctl_userfaultfd(2).  When servicing the page fault events, the fault-
54       handling thread can trigger a wake-up for the sleeping thread.
55
56       It  is possible for the faulting threads and the fault-handling threads
57       to run in the context of different  processes.   In  this  case,  these
58       threads may belong to different programs, and the program that executes
59       the faulting threads will not necessarily cooperate  with  the  program
60       that  handles  the  page  faults.   In  such  non-cooperative mode, the
61       process that monitors userfaultfd and handles page faults needs  to  be
62       aware  of  the  changes  in  the  virtual memory layout of the faulting
63       process to avoid memory corruption.
64
65       Starting from Linux 4.11, userfaultfd can also  notify  the  fault-han‐
66       dling  threads about changes in the virtual memory layout of the fault‐
67       ing process.  In addition, if the faulting process invokes fork(2), the
68       userfaultfd  objects  associated with the parent may be duplicated into
69       the child process and the userfaultfd monitor will be notified (via the
70       UFFD_EVENT_FORK  described  below) about the file descriptor associated
71       with the userfault objects created for the child process, which  allows
72       the  userfaultfd  monitor  to  perform  user-space paging for the child
73       process.  Unlike page faults which have to be synchronous  and  require
74       an  explicit  or  implicit wakeup, all other events are delivered asyn‐
75       chronously and the non-cooperative process resumes execution as soon as
76       the  userfaultfd  manager  executes  read(2).   The userfaultfd manager
77       should carefully synchronize calls to UFFDIO_COPY with  the  processing
78       of events.
79
80       The  current  asynchronous  model  of the event delivery is optimal for
81       single threaded non-cooperative userfaultfd manager implementations.
82
83   Userfaultfd operation
84       After the userfaultfd object is created with userfaultfd(), the  appli‐
85       cation  must  enable  it using the UFFDIO_API ioctl(2) operation.  This
86       operation allows a handshake between  the  kernel  and  user  space  to
87       determine  the API version and supported features.  This operation must
88       be performed before any of  the  other  ioctl(2)  operations  described
89       below (or those operations fail with the EINVAL error).
90
91       After a successful UFFDIO_API operation, the application then registers
92       memory address ranges using  the  UFFDIO_REGISTER  ioctl(2)  operation.
93       After  successful  completion  of  a  UFFDIO_REGISTER operation, a page
94       fault occurring in the requested memory range, and satisfying the  mode
95       defined  at  the  registration time, will be forwarded by the kernel to
96       the user-space application.  The application  can  then  use  the  UFF‐
97       DIO_COPY  or  UFFDIO_ZEROPAGE  ioctl(2)  operations to resolve the page
98       fault.
99
100       Starting from Linux 4.14, if the application sets the UFFD_FEATURE_SIG‐
101       BUS  feature bit using the UFFDIO_API ioctl(2), no page-fault notifica‐
102       tion will be forwarded to user  space.   Instead  a  SIGBUS  signal  is
103       delivered  to the faulting process.  With this feature, userfaultfd can
104       be used for robustness purposes to simply catch  any  access  to  areas
105       within  the  registered address range that do not have pages allocated,
106       without having to listen to userfaultfd events.  No userfaultfd monitor
107       will  be  required for dealing with such memory accesses.  For example,
108       this feature can be useful for applications that want  to  prevent  the
109       kernel  from automatically allocating pages and filling holes in sparse
110       files when the hole is accessed through a memory mapping.
111
112       The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2)
113       if used in combination with UFFD_FEATURE_FORK.
114
115       Details  of the various ioctl(2) operations can be found in ioctl_user‐
116       faultfd(2).
117
118       Since Linux 4.11, events other than page-fault may enabled during  UFF‐
119       DIO_API operation.
120
121       Up  to  Linux 4.11, userfaultfd can be used only with anonymous private
122       memory mappings.  Since Linux 4.11, userfaultfd can be also  used  with
123       hugetlbfs and shared memory mappings.
124
125   Reading from the userfaultfd structure
126       Each  read(2)  from the userfaultfd file descriptor returns one or more
127       uffd_msg structures, each of which describes a page-fault event  or  an
128       event required for the non-cooperative userfaultfd usage:
129
130           struct uffd_msg {
131               __u8  event;            /* Type of event */
132               ...
133               union {
134                   struct {
135                       __u64 flags;    /* Flags describing fault */
136                       __u64 address;  /* Faulting address */
137                   } pagefault;
138
139                   struct {            /* Since Linux 4.11 */
140                       __u32 ufd;      /* Userfault file descriptor
141                                          of the child process */
142                   } fork;
143
144                   struct {            /* Since Linux 4.11 */
145                       __u64 from;     /* Old address of remapped area */
146                       __u64 to;       /* New address of remapped area */
147                       __u64 len;      /* Original mapping length */
148                   } remap;
149
150                   struct {            /* Since Linux 4.11 */
151                       __u64 start;    /* Start address of removed area */
152                       __u64 end;      /* End address of removed area */
153                   } remove;
154                   ...
155               } arg;
156
157               /* Padding fields omitted */
158           } __packed;
159
160       If  multiple  events  are  available  and  the supplied buffer is large
161       enough, read(2) returns as many events as will fit in the supplied buf‐
162       fer.  If the buffer supplied to read(2) is smaller than the size of the
163       uffd_msg structure, the read(2) fails with the error EINVAL.
164
165       The fields set in the uffd_msg structure are as follows:
166
167       event  The type of event.   Depending  of  the  event  type,  different
168              fields of the arg union represent details required for the event
169              processing.  The non-page-fault events are generated  only  when
170              appropriate  feature  is  enabled during API handshake with UFF‐
171              DIO_API ioctl(2).
172
173              The following values can appear in the event field:
174
175              UFFD_EVENT_PAGEFAULT (since Linux 4.3)
176                     A page-fault event.  The page-fault details are available
177                     in the pagefault field.
178
179              UFFD_EVENT_FORK (since Linux 4.11)
180                     Generated  when  the faulting process invokes fork(2) (or
181                     clone(2) without the CLONE_VM flag).  The  event  details
182                     are available in the fork field.
183
184              UFFD_EVENT_REMAP (since Linux 4.11)
185                     Generated  when  the  faulting process invokes mremap(2).
186                     The event details are available in the remap field.
187
188              UFFD_EVENT_REMOVE (since Linux 4.11)
189                     Generated when the faulting  process  invokes  madvise(2)
190                     with  MADV_DONTNEED  or  MADV_REMOVE  advice.   The event
191                     details are available in the remove field.
192
193              UFFD_EVENT_UNMAP (since Linux 4.11)
194                     Generated when  the  faulting  process  unmaps  a  memory
195                     range,  either  explicitly  using munmap(2) or implicitly
196                     during mmap(2)  or  mremap(2).   The  event  details  are
197                     available in the remove field.
198
199       pagefault.address
200              The address that triggered the page fault.
201
202       pagefault.flags
203              A   bit   mask   of   flags   that   describe  the  event.   For
204              UFFD_EVENT_PAGEFAULT, the following flag may appear:
205
206              UFFD_PAGEFAULT_FLAG_WRITE
207                     If the address is in a range that was registered with the
208                     UFFDIO_REGISTER_MODE_MISSING    flag   (see   ioctl_user‐
209                     faultfd(2)) and this flag is set,  this  a  write  fault;
210                     otherwise it is a read fault.
211
212       fork.ufd
213              The file descriptor associated with the userfault object created
214              for the child created by fork(2).
215
216       remap.from
217              The original address of the memory range that was remapped using
218              mremap(2).
219
220       remap.to
221              The  new  address  of  the  memory range that was remapped using
222              mremap(2).
223
224       remap.len
225              The original length of the memory range that was remapped  using
226              mremap(2).
227
228       remove.start
229              The  start address of the memory range that was freed using mad‐
230              vise(2) or unmapped
231
232       remove.end
233              The end address of the memory range that was  freed  using  mad‐
234              vise(2) or unmapped
235
236       A  read(2) on a userfaultfd file descriptor can fail with the following
237       errors:
238
239       EINVAL The userfaultfd object has not yet been enabled using  the  UFF‐
240              DIO_API ioctl(2) operation
241
242       If  the O_NONBLOCK flag is enabled in the associated open file descrip‐
243       tion, the userfaultfd file descriptor can be  monitored  with  poll(2),
244       select(2),  and epoll(7).  When events are available, the file descrip‐
245       tor indicates as readable.  If the O_NONBLOCK flag is not enabled, then
246       poll(2)  (always) indicates the file as having a POLLERR condition, and
247       select(2) indicates the file descriptor as both readable and writable.
248

RETURN VALUE

250       On success, userfaultfd() returns a new file descriptor that refers  to
251       the  userfaultfd  object.   On  error, -1 is returned, and errno is set
252       appropriately.
253

ERRORS

255       EINVAL An unsupported value was specified in flags.
256
257       EMFILE The per-process limit on the number of open file descriptors has
258              been reached
259
260       ENFILE The system-wide limit on the total number of open files has been
261              reached.
262
263       ENOMEM Insufficient kernel memory was available.
264
265       EPERM (since Linux 5.2)
266              The caller is not privileged (does not have  the  CAP_SYS_PTRACE
267              capability    in    the    initial    user    namespace),    and
268              /proc/sys/vm/unprivileged_userfaultfd has the value 0.
269

VERSIONS

271       The userfaultfd() system call first appeared in Linux 4.3.
272
273       The support for hugetlbfs and shared memory  areas  and  non-page-fault
274       events was added in Linux 4.11
275

CONFORMING TO

277       userfaultfd()  is  Linux-specific  and  should  not be used in programs
278       intended to be portable.
279

NOTES

281       Glibc does not provide a wrapper for this system call;  call  it  using
282       syscall(2).
283
284       The  userfaultfd mechanism can be used as an alternative to traditional
285       user-space paging techniques based on the use of the SIGSEGV signal and
286       mmap(2).   It  can  also  be  used to implement lazy restore for check‐
287       point/restore mechanisms, as  well  as  post-copy  migration  to  allow
288       (nearly) uninterrupted execution when transferring virtual machines and
289       Linux containers from one host to another.
290

BUGS

292       If the UFFD_FEATURE_EVENT_FORK is enabled and a system  call  from  the
293       fork(2)  family  is  interrupted  by  a signal or failed, a stale user‐
294       faultfd  descriptor  might  be  created.   In  this  case,  a  spurious
295       UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
296

EXAMPLES

298       The  program  below  demonstrates the use of the userfaultfd mechanism.
299       The program creates two threads, one of which acts  as  the  page-fault
300       handler  for  the  process,  for the pages in a demand-page zero region
301       created using mmap(2).
302
303       The program takes one command-line argument, which  is  the  number  of
304       pages  that will be created in a mapping whose page faults will be han‐
305       dled via userfaultfd.  After creating a userfaultfd object, the program
306       then  creates  an  anonymous  private mapping of the specified size and
307       registers the address range of that mapping using  the  UFFDIO_REGISTER
308       ioctl(2) operation.  The program then creates a second thread that will
309       perform the task of handling page faults.
310
311       The main thread then walks through the pages of  the  mapping  fetching
312       bytes  from  successive  pages.   Because  the  pages have not yet been
313       accessed, the first access of a byte in each page will trigger a  page-
314       fault event on the userfaultfd file descriptor.
315
316       Each  of  the  page-fault events is handled by the second thread, which
317       sits in a loop processing input from the userfaultfd  file  descriptor.
318       In  each loop iteration, the second thread first calls poll(2) to check
319       the state of the file descriptor, and then reads an event from the file
320       descriptor.   All  such  events  should be UFFD_EVENT_PAGEFAULT events,
321       which the thread handles by copying a page of data  into  the  faulting
322       region using the UFFDIO_COPY ioctl(2) operation.
323
324       The following is an example of what we see when running the program:
325
326           $ ./userfaultfd_demo 3
327           Address returned by mmap() = 0x7fd30106c000
328
329           fault_handler_thread():
330               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
331               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
332                   (uffdio_copy.copy returned 4096)
333           Read address 0x7fd30106c00f in main(): A
334           Read address 0x7fd30106c40f in main(): A
335           Read address 0x7fd30106c80f in main(): A
336           Read address 0x7fd30106cc0f in main(): A
337
338           fault_handler_thread():
339               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
340               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
341                   (uffdio_copy.copy returned 4096)
342           Read address 0x7fd30106d00f in main(): B
343           Read address 0x7fd30106d40f in main(): B
344           Read address 0x7fd30106d80f in main(): B
345           Read address 0x7fd30106dc0f in main(): B
346
347           fault_handler_thread():
348               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
349               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
350                   (uffdio_copy.copy returned 4096)
351           Read address 0x7fd30106e00f in main(): C
352           Read address 0x7fd30106e40f in main(): C
353           Read address 0x7fd30106e80f in main(): C
354           Read address 0x7fd30106ec0f in main(): C
355
356   Program source
357
358       /* userfaultfd_demo.c
359
360          Licensed under the GNU General Public License version 2 or later.
361       */
362       #define _GNU_SOURCE
363       #include <sys/types.h>
364       #include <stdio.h>
365       #include <linux/userfaultfd.h>
366       #include <pthread.h>
367       #include <errno.h>
368       #include <unistd.h>
369       #include <stdlib.h>
370       #include <fcntl.h>
371       #include <signal.h>
372       #include <poll.h>
373       #include <string.h>
374       #include <sys/mman.h>
375       #include <sys/syscall.h>
376       #include <sys/ioctl.h>
377       #include <poll.h>
378
379       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
380                               } while (0)
381
382       static int page_size;
383
384       static void *
385       fault_handler_thread(void *arg)
386       {
387           static struct uffd_msg msg;   /* Data read from userfaultfd */
388           static int fault_cnt = 0;     /* Number of faults so far handled */
389           long uffd;                    /* userfaultfd file descriptor */
390           static char *page = NULL;
391           struct uffdio_copy uffdio_copy;
392           ssize_t nread;
393
394           uffd = (long) arg;
395
396           /* Create a page that will be copied into the faulting region */
397
398           if (page == NULL) {
399               page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
400                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
401               if (page == MAP_FAILED)
402                   errExit("mmap");
403           }
404
405           /* Loop, handling incoming events on the userfaultfd
406              file descriptor */
407
408           for (;;) {
409
410               /* See what poll() tells us about the userfaultfd */
411
412               struct pollfd pollfd;
413               int nready;
414               pollfd.fd = uffd;
415               pollfd.events = POLLIN;
416               nready = poll(&pollfd, 1, -1);
417               if (nready == -1)
418                   errExit("poll");
419
420               printf("\nfault_handler_thread():\n");
421               printf("    poll() returns: nready = %d; "
422                       "POLLIN = %d; POLLERR = %d\n", nready,
423                       (pollfd.revents & POLLIN) != 0,
424                       (pollfd.revents & POLLERR) != 0);
425
426               /* Read an event from the userfaultfd */
427
428               nread = read(uffd, &msg, sizeof(msg));
429               if (nread == 0) {
430                   printf("EOF on userfaultfd!\n");
431                   exit(EXIT_FAILURE);
432               }
433
434               if (nread == -1)
435                   errExit("read");
436
437               /* We expect only one kind of event; verify that assumption */
438
439               if (msg.event != UFFD_EVENT_PAGEFAULT) {
440                   fprintf(stderr, "Unexpected event on userfaultfd\n");
441                   exit(EXIT_FAILURE);
442               }
443
444               /* Display info about the page-fault event */
445
446               printf("    UFFD_EVENT_PAGEFAULT event: ");
447               printf("flags = %llx; ", msg.arg.pagefault.flags);
448               printf("address = %llx\n", msg.arg.pagefault.address);
449
450               /* Copy the page pointed to by 'page' into the faulting
451                  region. Vary the contents that are copied in, so that it
452                  is more obvious that each fault is handled separately. */
453
454               memset(page, 'A' + fault_cnt % 20, page_size);
455               fault_cnt++;
456
457               uffdio_copy.src = (unsigned long) page;
458
459               /* We need to handle page faults in units of pages(!).
460                  So, round faulting address down to page boundary */
461
462               uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
463                                                  ~(page_size - 1);
464               uffdio_copy.len = page_size;
465               uffdio_copy.mode = 0;
466               uffdio_copy.copy = 0;
467               if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
468                   errExit("ioctl-UFFDIO_COPY");
469
470               printf("        (uffdio_copy.copy returned %lld)\n",
471                       uffdio_copy.copy);
472           }
473       }
474
475       int
476       main(int argc, char *argv[])
477       {
478           long uffd;          /* userfaultfd file descriptor */
479           char *addr;         /* Start of region handled by userfaultfd */
480           unsigned long len;  /* Length of region handled by userfaultfd */
481           pthread_t thr;      /* ID of thread that handles page faults */
482           struct uffdio_api uffdio_api;
483           struct uffdio_register uffdio_register;
484           int s;
485
486           if (argc != 2) {
487               fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
488               exit(EXIT_FAILURE);
489           }
490
491           page_size = sysconf(_SC_PAGE_SIZE);
492           len = strtoul(argv[1], NULL, 0) * page_size;
493
494           /* Create and enable userfaultfd object */
495
496           uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
497           if (uffd == -1)
498               errExit("userfaultfd");
499
500           uffdio_api.api = UFFD_API;
501           uffdio_api.features = 0;
502           if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
503               errExit("ioctl-UFFDIO_API");
504
505           /* Create a private anonymous mapping. The memory will be
506              demand-zero paged--that is, not yet allocated. When we
507              actually touch the memory, it will be allocated via
508              the userfaultfd. */
509
510           addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
511                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
512           if (addr == MAP_FAILED)
513               errExit("mmap");
514
515           printf("Address returned by mmap() = %p\n", addr);
516
517           /* Register the memory range of the mapping we just created for
518              handling by the userfaultfd object. In mode, we request to track
519              missing pages (i.e., pages that have not yet been faulted in). */
520
521           uffdio_register.range.start = (unsigned long) addr;
522           uffdio_register.range.len = len;
523           uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
524           if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
525               errExit("ioctl-UFFDIO_REGISTER");
526
527           /* Create a thread that will process the userfaultfd events */
528
529           s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
530           if (s != 0) {
531               errno = s;
532               errExit("pthread_create");
533           }
534
535           /* Main thread now touches memory in the mapping, touching
536              locations 1024 bytes apart. This will trigger userfaultfd
537              events for all pages in the region. */
538
539           int l;
540           l = 0xf;    /* Ensure that faulting address is not on a page
541                          boundary, in order to test that we correctly
542                          handle that case in fault_handling_thread() */
543           while (l < len) {
544               char c = addr[l];
545               printf("Read address %p in main(): ", addr + l);
546               printf("%c\n", c);
547               l += 1024;
548               usleep(100000);         /* Slow things down a little */
549           }
550
551           exit(EXIT_SUCCESS);
552       }
553

SEE ALSO

555       fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
556
557       Documentation/admin-guide/mm/userfaultfd.rst in the Linux kernel source
558       tree
559

COLOPHON

561       This page is part of release 5.07 of the Linux  man-pages  project.   A
562       description  of  the project, information about reporting bugs, and the
563       latest    version    of    this    page,    can     be     found     at
564       https://www.kernel.org/doc/man-pages/.
565
566
567
568Linux                             2020-06-09                    USERFAULTFD(2)
Impressum