memfd_secret(2)

1memfd_secret(2)               System Calls Manual              memfd_secret(2)
2
3
4

NAME

6       memfd_secret - create an anonymous RAM-based file to access secret mem‐
7       ory regions
8

LIBRARY

10       Standard C library (libc, -lc)
11

SYNOPSIS

13       #include <sys/syscall.h>      /* Definition of SYS_* constants */
14       #include <unistd.h>
15
16       int syscall(SYS_memfd_secret, unsigned int flags);
17
18       Note: glibc provides no wrapper for memfd_secret(),  necessitating  the
19       use of syscall(2).
20

DESCRIPTION

22       memfd_secret()  creates  an anonymous RAM-based file and returns a file
23       descriptor that refers to it.  The file provides a way  to  create  and
24       access  memory  regions  with  stronger protection than usual RAM-based
25       files and anonymous memory mappings.  Once all open references  to  the
26       file are closed, it is automatically released.  The initial size of the
27       file is set to 0.  Following the call, the file size should be set  us‐
28       ing ftruncate(2).
29
30       The memory areas backing the file created with memfd_secret(2) are vis‐
31       ible only to the processes that have access  to  the  file  descriptor.
32       The  memory  region is removed from the kernel page tables and only the
33       page tables of the processes holding the file descriptor map the corre‐
34       sponding  physical memory.  (Thus, the pages in the region can't be ac‐
35       cessed by the kernel itself, so that, for example, pointers to the  re‐
36       gion can't be passed to system calls.)
37
38       The following values may be bitwise ORed in flags to control the behav‐
39       ior of memfd_secret():
40
41       FD_CLOEXEC
42              Set the close-on-exec flag on the  new  file  descriptor,  which
43              causes  the  region to be removed from the process on execve(2).
44              See the description of the O_CLOEXEC flag in open(2)
45
46       As its return value, memfd_secret() returns a new file descriptor  that
47       refers  to  an anonymous file.  This file descriptor is opened for both
48       reading and writing (O_RDWR) and O_LARGEFILE is set for  the  file  de‐
49       scriptor.
50
51       With  respect  to  fork(2) and execve(2), the usual semantics apply for
52       the file descriptor created by memfd_secret().  A copy of the file  de‐
53       scriptor  is  inherited  by the child produced by fork(2) and refers to
54       the same file.  The file descriptor is preserved across execve(2),  un‐
55       less the close-on-exec flag has been set.
56
57       The  memory  region  is  locked  into  memory  in  the same way as with
58       mlock(2), so that it will never be written into swap,  and  hibernation
59       is  inhibited  for  as  long  as any memfd_secret() descriptions exist.
60       However the implementation of memfd_secret() will not try  to  populate
61       the  whole  range during the mmap(2) call that attaches the region into
62       the process's address space; instead, the pages are only actually allo‐
63       cated  as they are faulted in.  The amount of memory allowed for memory
64       mappings of the file descriptor obeys the same rules  as  mlock(2)  and
65       cannot exceed RLIMIT_MEMLOCK.
66

RETURN VALUE

68       On success, memfd_secret() returns a new file descriptor.  On error, -1
69       is returned and errno is set to indicate the error.
70

ERRORS

72       EINVAL flags included unknown bits.
73
74       EMFILE The per-process limit on the number of open file descriptors has
75              been reached.
76
77       EMFILE The system-wide limit on the total number of open files has been
78              reached.
79
80       ENOMEM There was insufficient memory to create a new anonymous file.
81
82       ENOSYS memfd_secret() is not implemented on this architecture,  or  has
83              not  been  enabled on the kernel command-line with secretmem_en‐
84              able=1.
85

STANDARDS

87       Linux.
88

HISTORY

90       Linux 5.14.
91

NOTES

93       The memfd_secret() system  call  is  designed  to  allow  a  user-space
94       process  to  create  a  range of memory that is inaccessible to anybody
95       else - kernel included.  There is no 100% guarantee that  kernel  won't
96       be able to access memory ranges backed by memfd_secret() in any circum‐
97       stances, but nevertheless, it is much harder to  exfiltrate  data  from
98       these regions.
99
100       memfd_secret() provides the following protections:
101
102       •  Enhanced protection (in conjunction with all the other in-kernel at‐
103          tack prevention systems) against ROP attacks.  Absence  of  any  in-
104          kernel primitive for accessing memory backed by memfd_secret() means
105          that one-gadget ROP attack can't work to perform data  exfiltration.
106          The  attacker  would  need to find enough ROP gadgets to reconstruct
107          the missing page table entries, which significantly increases diffi‐
108          culty of the attack, especially when other protections like the ker‐
109          nel stack size limit and address space layout randomization  are  in
110          place.
111
112       •  Prevent  cross-process  user-space  memory exposures.  Once a region
113          for a memfd_secret() memory mapping is allocated, the user can't ac‐
114          cidentally pass it into the kernel to be transmitted somewhere.  The
115          memory pages in this region cannot be accessed via  the  direct  map
116          and they are disallowed in get_user_pages.
117
118       •  Harden  against  exploited  kernel flaws.  In order to access memory
119          areas backed by memfd_secret(), a kernel-side attack would  need  to
120          either  walk  the  page  tables  and create new ones, or spawn a new
121          privileged user-space process to perform secrets exfiltration  using
122          ptrace(2).
123
124       The  way memfd_secret() allocates and locks the memory may impact over‐
125       all system performance, therefore the system call is  disabled  by  de‐
126       fault and only available if the system administrator turned it on using
127       "secretmem.enable=y" kernel parameter.
128
129       To prevent potential data leaks of memory regions backed  by  memfd_se‐
130       cret()  from  a  hybernation image, hybernation is prevented when there
131       are active memfd_secret() users.
132