1USERFAULTFD(2)             Linux Programmer's Manual            USERFAULTFD(2)
2
3
4

NAME

6       userfaultfd - create a file descriptor for handling page faults in user
7       space
8

SYNOPSIS

10       #include <sys/types.h>
11       #include <linux/userfaultfd.h>
12
13       int userfaultfd(int flags);
14
15       Note: There is no glibc wrapper for this system call; see NOTES.
16

DESCRIPTION

18       userfaultfd() creates a new userfaultfd object that  can  be  used  for
19       delegation  of page-fault handling to a user-space application, and re‐
20       turns a file descriptor that refers to the new object.  The  new  user‐
21       faultfd object is configured using ioctl(2).
22
23       Once  the  userfaultfd  object  is  configured, the application can use
24       read(2) to receive userfaultfd notifications.   The  reads  from  user‐
25       faultfd  may  be  blocking  or  non-blocking, depending on the value of
26       flags used for the creation of the userfaultfd or subsequent  calls  to
27       fcntl(2).
28
29       The  following values may be bitwise ORed in flags to change the behav‐
30       ior of userfaultfd():
31
32       O_CLOEXEC
33              Enable the close-on-exec flag for the new userfaultfd  file  de‐
34              scriptor.  See the description of the O_CLOEXEC flag in open(2).
35
36       O_NONBLOCK
37              Enables  non-blocking operation for the userfaultfd object.  See
38              the description of the O_NONBLOCK flag in open(2).
39
40       When the last file descriptor referring  to  a  userfaultfd  object  is
41       closed,  all memory ranges that were registered with the object are un‐
42       registered and unread events are flushed.
43
44   Usage
45       The userfaultfd mechanism is designed to allow a  thread  in  a  multi‐
46       threaded  program to perform user-space paging for the other threads in
47       the process.  When a page fault occurs for one of  the  regions  regis‐
48       tered  to  the  userfaultfd object, the faulting thread is put to sleep
49       and an event is generated that can be read via the userfaultfd file de‐
50       scriptor.   The  fault-handling  thread reads events from this file de‐
51       scriptor  and  services  them  using  the   operations   described   in
52       ioctl_userfaultfd(2).  When servicing the page fault events, the fault-
53       handling thread can trigger a wake-up for the sleeping thread.
54
55       It is possible for the faulting threads and the fault-handling  threads
56       to  run  in  the  context  of different processes.  In this case, these
57       threads may belong to different programs, and the program that executes
58       the  faulting  threads  will not necessarily cooperate with the program
59       that handles the  page  faults.   In  such  non-cooperative  mode,  the
60       process  that  monitors userfaultfd and handles page faults needs to be
61       aware of the changes in the  virtual  memory  layout  of  the  faulting
62       process to avoid memory corruption.
63
64       Starting  from  Linux  4.11, userfaultfd can also notify the fault-han‐
65       dling threads about changes in the virtual memory layout of the  fault‐
66       ing process.  In addition, if the faulting process invokes fork(2), the
67       userfaultfd objects associated with the parent may be  duplicated  into
68       the child process and the userfaultfd monitor will be notified (via the
69       UFFD_EVENT_FORK described below) about the file  descriptor  associated
70       with  the userfault objects created for the child process, which allows
71       the userfaultfd monitor to perform  user-space  paging  for  the  child
72       process.   Unlike  page faults which have to be synchronous and require
73       an explicit or implicit wakeup, all other events  are  delivered  asyn‐
74       chronously and the non-cooperative process resumes execution as soon as
75       the userfaultfd manager  executes  read(2).   The  userfaultfd  manager
76       should  carefully  synchronize calls to UFFDIO_COPY with the processing
77       of events.
78
79       The current asynchronous model of the event  delivery  is  optimal  for
80       single threaded non-cooperative userfaultfd manager implementations.
81
82   Userfaultfd operation
83       After  the userfaultfd object is created with userfaultfd(), the appli‐
84       cation must enable it using the UFFDIO_API  ioctl(2)  operation.   This
85       operation  allows  a handshake between the kernel and user space to de‐
86       termine the API version and supported features.  This operation must be
87       performed  before  any of the other ioctl(2) operations described below
88       (or those operations fail with the EINVAL error).
89
90       After a successful UFFDIO_API operation, the application then registers
91       memory  address  ranges  using  the UFFDIO_REGISTER ioctl(2) operation.
92       After successful completion of  a  UFFDIO_REGISTER  operation,  a  page
93       fault  occurring in the requested memory range, and satisfying the mode
94       defined at the registration time, will be forwarded by  the  kernel  to
95       the  user-space  application.   The  application  can then use the UFF‐
96       DIO_COPY or UFFDIO_ZEROPAGE ioctl(2) operations  to  resolve  the  page
97       fault.
98
99       Starting from Linux 4.14, if the application sets the UFFD_FEATURE_SIG‐
100       BUS feature bit using the UFFDIO_API ioctl(2), no page-fault  notifica‐
101       tion  will  be forwarded to user space.  Instead a SIGBUS signal is de‐
102       livered to the faulting process.  With this feature, userfaultfd can be
103       used for robustness purposes to simply catch any access to areas within
104       the registered address range that do not have pages allocated,  without
105       having to listen to userfaultfd events.  No userfaultfd monitor will be
106       required for dealing with such memory accesses.  For example, this fea‐
107       ture  can  be  useful  for applications that want to prevent the kernel
108       from automatically allocating pages and filling holes in  sparse  files
109       when the hole is accessed through a memory mapping.
110
111       The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2)
112       if used in combination with UFFD_FEATURE_FORK.
113
114       Details of the various ioctl(2) operations can be found in  ioctl_user‐
115       faultfd(2).
116
117       Since  Linux 4.11, events other than page-fault may enabled during UFF‐
118       DIO_API operation.
119
120       Up to Linux 4.11, userfaultfd can be used only with  anonymous  private
121       memory  mappings.   Since Linux 4.11, userfaultfd can be also used with
122       hugetlbfs and shared memory mappings.
123
124   Reading from the userfaultfd structure
125       Each read(2) from the userfaultfd file descriptor returns one  or  more
126       uffd_msg  structures,  each of which describes a page-fault event or an
127       event required for the non-cooperative userfaultfd usage:
128
129           struct uffd_msg {
130               __u8  event;            /* Type of event */
131               ...
132               union {
133                   struct {
134                       __u64 flags;    /* Flags describing fault */
135                       __u64 address;  /* Faulting address */
136                   } pagefault;
137
138                   struct {            /* Since Linux 4.11 */
139                       __u32 ufd;      /* Userfault file descriptor
140                                          of the child process */
141                   } fork;
142
143                   struct {            /* Since Linux 4.11 */
144                       __u64 from;     /* Old address of remapped area */
145                       __u64 to;       /* New address of remapped area */
146                       __u64 len;      /* Original mapping length */
147                   } remap;
148
149                   struct {            /* Since Linux 4.11 */
150                       __u64 start;    /* Start address of removed area */
151                       __u64 end;      /* End address of removed area */
152                   } remove;
153                   ...
154               } arg;
155
156               /* Padding fields omitted */
157           } __packed;
158
159       If multiple events are available  and  the  supplied  buffer  is  large
160       enough, read(2) returns as many events as will fit in the supplied buf‐
161       fer.  If the buffer supplied to read(2) is smaller than the size of the
162       uffd_msg structure, the read(2) fails with the error EINVAL.
163
164       The fields set in the uffd_msg structure are as follows:
165
166       event  The  type  of  event.   Depending  of  the event type, different
167              fields of the arg union represent details required for the event
168              processing.   The  non-page-fault events are generated only when
169              appropriate feature is enabled during API  handshake  with  UFF‐
170              DIO_API ioctl(2).
171
172              The following values can appear in the event field:
173
174              UFFD_EVENT_PAGEFAULT (since Linux 4.3)
175                     A page-fault event.  The page-fault details are available
176                     in the pagefault field.
177
178              UFFD_EVENT_FORK (since Linux 4.11)
179                     Generated when the faulting process invokes  fork(2)  (or
180                     clone(2)  without  the CLONE_VM flag).  The event details
181                     are available in the fork field.
182
183              UFFD_EVENT_REMAP (since Linux 4.11)
184                     Generated when the faulting  process  invokes  mremap(2).
185                     The event details are available in the remap field.
186
187              UFFD_EVENT_REMOVE (since Linux 4.11)
188                     Generated  when  the  faulting process invokes madvise(2)
189                     with MADV_DONTNEED or MADV_REMOVE advice.  The event  de‐
190                     tails are available in the remove field.
191
192              UFFD_EVENT_UNMAP (since Linux 4.11)
193                     Generated  when  the  faulting  process  unmaps  a memory
194                     range, either explicitly using  munmap(2)  or  implicitly
195                     during  mmap(2)  or  mremap(2).   The  event  details are
196                     available in the remove field.
197
198       pagefault.address
199              The address that triggered the page fault.
200
201       pagefault.flags
202              A  bit  mask  of   flags   that   describe   the   event.    For
203              UFFD_EVENT_PAGEFAULT, the following flag may appear:
204
205              UFFD_PAGEFAULT_FLAG_WRITE
206                     If the address is in a range that was registered with the
207                     UFFDIO_REGISTER_MODE_MISSING   flag   (see    ioctl_user‐
208                     faultfd(2))  and  this  flag  is set, this a write fault;
209                     otherwise it is a read fault.
210
211       fork.ufd
212              The file descriptor associated with the userfault object created
213              for the child created by fork(2).
214
215       remap.from
216              The original address of the memory range that was remapped using
217              mremap(2).
218
219       remap.to
220              The new address of the memory  range  that  was  remapped  using
221              mremap(2).
222
223       remap.len
224              The  original length of the memory range that was remapped using
225              mremap(2).
226
227       remove.start
228              The start address of the memory range that was freed using  mad‐
229              vise(2) or unmapped
230
231       remove.end
232              The  end  address  of the memory range that was freed using mad‐
233              vise(2) or unmapped
234
235       A read(2) on a userfaultfd file descriptor can fail with the  following
236       errors:
237
238       EINVAL The  userfaultfd  object has not yet been enabled using the UFF‐
239              DIO_API ioctl(2) operation
240
241       If the O_NONBLOCK flag is enabled in the associated open file  descrip‐
242       tion,  the  userfaultfd  file descriptor can be monitored with poll(2),
243       select(2), and epoll(7).  When events are available, the file  descrip‐
244       tor indicates as readable.  If the O_NONBLOCK flag is not enabled, then
245       poll(2) (always) indicates the file as having a POLLERR condition,  and
246       select(2) indicates the file descriptor as both readable and writable.
247

RETURN VALUE

249       On  success, userfaultfd() returns a new file descriptor that refers to
250       the userfaultfd object.  On error, -1 is returned, and errno is set ap‐
251       propriately.
252

ERRORS

254       EINVAL An unsupported value was specified in flags.
255
256       EMFILE The per-process limit on the number of open file descriptors has
257              been reached
258
259       ENFILE The system-wide limit on the total number of open files has been
260              reached.
261
262       ENOMEM Insufficient kernel memory was available.
263
264       EPERM (since Linux 5.2)
265              The  caller  is not privileged (does not have the CAP_SYS_PTRACE
266              capability in the initial user namespace), and  /proc/sys/vm/un‐
267              privileged_userfaultfd has the value 0.
268

VERSIONS

270       The userfaultfd() system call first appeared in Linux 4.3.
271
272       The  support  for  hugetlbfs and shared memory areas and non-page-fault
273       events was added in Linux 4.11
274

CONFORMING TO

276       userfaultfd() is Linux-specific and should not be used in programs  in‐
277       tended to be portable.
278

NOTES

280       Glibc  does  not  provide a wrapper for this system call; call it using
281       syscall(2).
282
283       The userfaultfd mechanism can be used as an alternative to  traditional
284       user-space paging techniques based on the use of the SIGSEGV signal and
285       mmap(2).  It can also be used to  implement  lazy  restore  for  check‐
286       point/restore  mechanisms,  as  well  as  post-copy  migration to allow
287       (nearly) uninterrupted execution when transferring virtual machines and
288       Linux containers from one host to another.
289

BUGS

291       If  the  UFFD_FEATURE_EVENT_FORK  is enabled and a system call from the
292       fork(2) family is interrupted by a signal  or  failed,  a  stale  user‐
293       faultfd  descriptor  might  be  created.   In  this  case,  a  spurious
294       UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
295

EXAMPLES

297       The program below demonstrates the use of  the  userfaultfd  mechanism.
298       The  program  creates  two threads, one of which acts as the page-fault
299       handler for the process, for the pages in  a  demand-page  zero  region
300       created using mmap(2).
301
302       The  program  takes  one  command-line argument, which is the number of
303       pages that will be created in a mapping whose page faults will be  han‐
304       dled via userfaultfd.  After creating a userfaultfd object, the program
305       then creates an anonymous private mapping of  the  specified  size  and
306       registers  the  address range of that mapping using the UFFDIO_REGISTER
307       ioctl(2) operation.  The program then creates a second thread that will
308       perform the task of handling page faults.
309
310       The  main  thread  then walks through the pages of the mapping fetching
311       bytes from successive pages.  Because the pages have not yet  been  ac‐
312       cessed,  the  first  access of a byte in each page will trigger a page-
313       fault event on the userfaultfd file descriptor.
314
315       Each of the page-fault events is handled by the  second  thread,  which
316       sits  in  a loop processing input from the userfaultfd file descriptor.
317       In each loop iteration, the second thread first calls poll(2) to  check
318       the state of the file descriptor, and then reads an event from the file
319       descriptor.  All such events  should  be  UFFD_EVENT_PAGEFAULT  events,
320       which  the  thread  handles by copying a page of data into the faulting
321       region using the UFFDIO_COPY ioctl(2) operation.
322
323       The following is an example of what we see when running the program:
324
325           $ ./userfaultfd_demo 3
326           Address returned by mmap() = 0x7fd30106c000
327
328           fault_handler_thread():
329               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
330               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
331                   (uffdio_copy.copy returned 4096)
332           Read address 0x7fd30106c00f in main(): A
333           Read address 0x7fd30106c40f in main(): A
334           Read address 0x7fd30106c80f in main(): A
335           Read address 0x7fd30106cc0f in main(): A
336
337           fault_handler_thread():
338               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
339               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
340                   (uffdio_copy.copy returned 4096)
341           Read address 0x7fd30106d00f in main(): B
342           Read address 0x7fd30106d40f in main(): B
343           Read address 0x7fd30106d80f in main(): B
344           Read address 0x7fd30106dc0f in main(): B
345
346           fault_handler_thread():
347               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
348               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
349                   (uffdio_copy.copy returned 4096)
350           Read address 0x7fd30106e00f in main(): C
351           Read address 0x7fd30106e40f in main(): C
352           Read address 0x7fd30106e80f in main(): C
353           Read address 0x7fd30106ec0f in main(): C
354
355   Program source
356
357       /* userfaultfd_demo.c
358
359          Licensed under the GNU General Public License version 2 or later.
360       */
361       #define _GNU_SOURCE
362       #include <inttypes.h>
363       #include <sys/types.h>
364       #include <stdio.h>
365       #include <linux/userfaultfd.h>
366       #include <pthread.h>
367       #include <errno.h>
368       #include <unistd.h>
369       #include <stdlib.h>
370       #include <fcntl.h>
371       #include <signal.h>
372       #include <poll.h>
373       #include <string.h>
374       #include <sys/mman.h>
375       #include <sys/syscall.h>
376       #include <sys/ioctl.h>
377       #include <poll.h>
378
379       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
380                               } while (0)
381
382       static int page_size;
383
384       static void *
385       fault_handler_thread(void *arg)
386       {
387           static struct uffd_msg msg;   /* Data read from userfaultfd */
388           static int fault_cnt = 0;     /* Number of faults so far handled */
389           long uffd;                    /* userfaultfd file descriptor */
390           static char *page = NULL;
391           struct uffdio_copy uffdio_copy;
392           ssize_t nread;
393
394           uffd = (long) arg;
395
396           /* Create a page that will be copied into the faulting region */
397
398           if (page == NULL) {
399               page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
400                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
401               if (page == MAP_FAILED)
402                   errExit("mmap");
403           }
404
405           /* Loop, handling incoming events on the userfaultfd
406              file descriptor */
407
408           for (;;) {
409
410               /* See what poll() tells us about the userfaultfd */
411
412               struct pollfd pollfd;
413               int nready;
414               pollfd.fd = uffd;
415               pollfd.events = POLLIN;
416               nready = poll(&pollfd, 1, -1);
417               if (nready == -1)
418                   errExit("poll");
419
420               printf("\nfault_handler_thread():\n");
421               printf("    poll() returns: nready = %d; "
422                       "POLLIN = %d; POLLERR = %d\n", nready,
423                       (pollfd.revents & POLLIN) != 0,
424                       (pollfd.revents & POLLERR) != 0);
425
426               /* Read an event from the userfaultfd */
427
428               nread = read(uffd, &msg, sizeof(msg));
429               if (nread == 0) {
430                   printf("EOF on userfaultfd!\n");
431                   exit(EXIT_FAILURE);
432               }
433
434               if (nread == -1)
435                   errExit("read");
436
437               /* We expect only one kind of event; verify that assumption */
438
439               if (msg.event != UFFD_EVENT_PAGEFAULT) {
440                   fprintf(stderr, "Unexpected event on userfaultfd\n");
441                   exit(EXIT_FAILURE);
442               }
443
444               /* Display info about the page-fault event */
445
446               printf("    UFFD_EVENT_PAGEFAULT event: ");
447               printf("flags = %"PRIx64"; ", msg.arg.pagefault.flags);
448               printf("address = %"PRIx64"\n", msg.arg.pagefault.address);
449
450               /* Copy the page pointed to by 'page' into the faulting
451                  region. Vary the contents that are copied in, so that it
452                  is more obvious that each fault is handled separately. */
453
454               memset(page, 'A' + fault_cnt % 20, page_size);
455               fault_cnt++;
456
457               uffdio_copy.src = (unsigned long) page;
458
459               /* We need to handle page faults in units of pages(!).
460                  So, round faulting address down to page boundary */
461
462               uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
463                                                  ~(page_size - 1);
464               uffdio_copy.len = page_size;
465               uffdio_copy.mode = 0;
466               uffdio_copy.copy = 0;
467               if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
468                   errExit("ioctl-UFFDIO_COPY");
469
470               printf("        (uffdio_copy.copy returned %"PRId64")\n",
471                       uffdio_copy.copy);
472           }
473       }
474
475       int
476       main(int argc, char *argv[])
477       {
478           long uffd;          /* userfaultfd file descriptor */
479           char *addr;         /* Start of region handled by userfaultfd */
480           uint64_t len;       /* Length of region handled by userfaultfd */
481           pthread_t thr;      /* ID of thread that handles page faults */
482           struct uffdio_api uffdio_api;
483           struct uffdio_register uffdio_register;
484           int s;
485
486           if (argc != 2) {
487               fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
488               exit(EXIT_FAILURE);
489           }
490
491           page_size = sysconf(_SC_PAGE_SIZE);
492           len = strtoull(argv[1], NULL, 0) * page_size;
493
494           /* Create and enable userfaultfd object */
495
496           uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
497           if (uffd == -1)
498               errExit("userfaultfd");
499
500           uffdio_api.api = UFFD_API;
501           uffdio_api.features = 0;
502           if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
503               errExit("ioctl-UFFDIO_API");
504
505           /* Create a private anonymous mapping. The memory will be
506              demand-zero paged--that is, not yet allocated. When we
507              actually touch the memory, it will be allocated via
508              the userfaultfd. */
509
510           addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
511                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
512           if (addr == MAP_FAILED)
513               errExit("mmap");
514
515           printf("Address returned by mmap() = %p\n", addr);
516
517           /* Register the memory range of the mapping we just created for
518              handling by the userfaultfd object. In mode, we request to track
519              missing pages (i.e., pages that have not yet been faulted in). */
520
521           uffdio_register.range.start = (unsigned long) addr;
522           uffdio_register.range.len = len;
523           uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
524           if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
525               errExit("ioctl-UFFDIO_REGISTER");
526
527           /* Create a thread that will process the userfaultfd events */
528
529           s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
530           if (s != 0) {
531               errno = s;
532               errExit("pthread_create");
533           }
534
535           /* Main thread now touches memory in the mapping, touching
536              locations 1024 bytes apart. This will trigger userfaultfd
537              events for all pages in the region. */
538
539           int l;
540           l = 0xf;    /* Ensure that faulting address is not on a page
541                          boundary, in order to test that we correctly
542                          handle that case in fault_handling_thread() */
543           while (l < len) {
544               char c = addr[l];
545               printf("Read address %p in main(): ", addr + l);
546               printf("%c\n", c);
547               l += 1024;
548               usleep(100000);         /* Slow things down a little */
549           }
550
551           exit(EXIT_SUCCESS);
552       }
553

SEE ALSO

555       fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
556
557       Documentation/admin-guide/mm/userfaultfd.rst in the Linux kernel source
558       tree
559

COLOPHON

561       This  page  is  part of release 5.10 of the Linux man-pages project.  A
562       description of the project, information about reporting bugs,  and  the
563       latest     version     of     this    page,    can    be    found    at
564       https://www.kernel.org/doc/man-pages/.
565
566
567
568Linux                             2020-11-01                    USERFAULTFD(2)
Impressum