1madvise(2) System Calls Manual madvise(2)
2
3
4
6 madvise - give advice about use of memory
7
9 Standard C library (libc, -lc)
10
12 #include <sys/mman.h>
13
14 int madvise(void addr[.length], size_t length, int advice);
15
16 Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
17
18 madvise():
19 Since glibc 2.19:
20 _DEFAULT_SOURCE
21 Up to and including glibc 2.19:
22 _BSD_SOURCE
23
25 The madvise() system call is used to give advice or directions to the
26 kernel about the address range beginning at address addr and with size
27 length. madvise() only operates on whole pages, therefore addr must be
28 page-aligned. The value of length is rounded up to a multiple of page
29 size. In most cases, the goal of such advice is to improve system or
30 application performance.
31
32 Initially, the system call supported a set of "conventional" advice
33 values, which are also available on several other implementations.
34 (Note, though, that madvise() is not specified in POSIX.) Subse‐
35 quently, a number of Linux-specific advice values have been added.
36
37 Conventional advice values
38 The advice values listed below allow an application to tell the kernel
39 how it expects to use some mapped or shared memory areas, so that the
40 kernel can choose appropriate read-ahead and caching techniques. These
41 advice values do not influence the semantics of the application (except
42 in the case of MADV_DONTNEED), but may influence its performance. All
43 of the advice values listed here have analogs in the POSIX-specified
44 posix_madvise(3) function, and the values have the same meanings, with
45 the exception of MADV_DONTNEED.
46
47 The advice is indicated in the advice argument, which is one of the
48 following:
49
50 MADV_NORMAL
51 No special treatment. This is the default.
52
53 MADV_RANDOM
54 Expect page references in random order. (Hence, read ahead may
55 be less useful than normally.)
56
57 MADV_SEQUENTIAL
58 Expect page references in sequential order. (Hence, pages in
59 the given range can be aggressively read ahead, and may be freed
60 soon after they are accessed.)
61
62 MADV_WILLNEED
63 Expect access in the near future. (Hence, it might be a good
64 idea to read some pages ahead.)
65
66 MADV_DONTNEED
67 Do not expect access in the near future. (For the time being,
68 the application is finished with the given range, so the kernel
69 can free resources associated with it.)
70
71 After a successful MADV_DONTNEED operation, the semantics of
72 memory access in the specified region are changed: subsequent
73 accesses of pages in the range will succeed, but will result in
74 either repopulating the memory contents from the up-to-date con‐
75 tents of the underlying mapped file (for shared file mappings,
76 shared anonymous mappings, and shmem-based techniques such as
77 System V shared memory segments) or zero-fill-on-demand pages
78 for anonymous private mappings.
79
80 Note that, when applied to shared mappings, MADV_DONTNEED might
81 not lead to immediate freeing of the pages in the range. The
82 kernel is free to delay freeing the pages until an appropriate
83 moment. The resident set size (RSS) of the calling process will
84 be immediately reduced however.
85
86 MADV_DONTNEED cannot be applied to locked pages, or VM_PFNMAP
87 pages. (Pages marked with the kernel-internal VM_PFNMAP flag
88 are special memory areas that are not managed by the virtual
89 memory subsystem. Such pages are typically created by device
90 drivers that map the pages into user space.)
91
92 Support for Huge TLB pages was added in Linux v5.18. Addresses
93 within a mapping backed by Huge TLB pages must be aligned to the
94 underlying Huge TLB page size, and the range length is rounded
95 up to a multiple of the underlying Huge TLB page size.
96
97 Linux-specific advice values
98 The following Linux-specific advice values have no counterparts in the
99 POSIX-specified posix_madvise(3), and may or may not have counterparts
100 in the madvise() interface available on other implementations. Note
101 that some of these operations change the semantics of memory accesses.
102
103 MADV_REMOVE (since Linux 2.6.16)
104 Free up a given range of pages and its associated backing store.
105 This is equivalent to punching a hole in the corresponding range
106 of the backing store (see fallocate(2)). Subsequent accesses in
107 the specified address range will see data with a value of zero.
108
109 The specified address range must be mapped shared and writable.
110 This flag cannot be applied to locked pages, or VM_PFNMAP pages.
111
112 In the initial implementation, only tmpfs(5) supported MADV_RE‐
113 MOVE; but since Linux 3.5, any filesystem which supports the
114 fallocate(2) FALLOC_FL_PUNCH_HOLE mode also supports MADV_RE‐
115 MOVE. Filesystems which do not support MADV_REMOVE fail with
116 the error EOPNOTSUPP.
117
118 Support for the Huge TLB filesystem was added in Linux v4.3.
119
120 MADV_DONTFORK (since Linux 2.6.16)
121 Do not make the pages in this range available to the child after
122 a fork(2). This is useful to prevent copy-on-write semantics
123 from changing the physical location of a page if the parent
124 writes to it after a fork(2). (Such page relocations cause
125 problems for hardware that DMAs into the page.)
126
127 MADV_DOFORK (since Linux 2.6.16)
128 Undo the effect of MADV_DONTFORK, restoring the default behav‐
129 ior, whereby a mapping is inherited across fork(2).
130
131 MADV_HWPOISON (since Linux 2.6.32)
132 Poison the pages in the range specified by addr and length and
133 handle subsequent references to those pages like a hardware mem‐
134 ory corruption. This operation is available only for privileged
135 (CAP_SYS_ADMIN) processes. This operation may result in the
136 calling process receiving a SIGBUS and the page being unmapped.
137
138 This feature is intended for testing of memory error-handling
139 code; it is available only if the kernel was configured with
140 CONFIG_MEMORY_FAILURE.
141
142 MADV_MERGEABLE (since Linux 2.6.32)
143 Enable Kernel Samepage Merging (KSM) for the pages in the range
144 specified by addr and length. The kernel regularly scans those
145 areas of user memory that have been marked as mergeable, looking
146 for pages with identical content. These are replaced by a sin‐
147 gle write-protected page (which is automatically copied if a
148 process later wants to update the content of the page). KSM
149 merges only private anonymous pages (see mmap(2)).
150
151 The KSM feature is intended for applications that generate many
152 instances of the same data (e.g., virtualization systems such as
153 KVM). It can consume a lot of processing power; use with care.
154 See the Linux kernel source file Documentation/ad‐
155 min-guide/mm/ksm.rst for more details.
156
157 The MADV_MERGEABLE and MADV_UNMERGEABLE operations are available
158 only if the kernel was configured with CONFIG_KSM.
159
160 MADV_UNMERGEABLE (since Linux 2.6.32)
161 Undo the effect of an earlier MADV_MERGEABLE operation on the
162 specified address range; KSM unmerges whatever pages it had
163 merged in the address range specified by addr and length.
164
165 MADV_SOFT_OFFLINE (since Linux 2.6.33)
166 Soft offline the pages in the range specified by addr and
167 length. The memory of each page in the specified range is pre‐
168 served (i.e., when next accessed, the same content will be visi‐
169 ble, but in a new physical page frame), and the original page is
170 offlined (i.e., no longer used, and taken out of normal memory
171 management). The effect of the MADV_SOFT_OFFLINE operation is
172 invisible to (i.e., does not change the semantics of) the call‐
173 ing process.
174
175 This feature is intended for testing of memory error-handling
176 code; it is available only if the kernel was configured with
177 CONFIG_MEMORY_FAILURE.
178
179 MADV_HUGEPAGE (since Linux 2.6.38)
180 Enable Transparent Huge Pages (THP) for pages in the range spec‐
181 ified by addr and length. The kernel will regularly scan the
182 areas marked as huge page candidates to replace them with huge
183 pages. The kernel will also allocate huge pages directly when
184 the region is naturally aligned to the huge page size (see
185 posix_memalign(2)).
186
187 This feature is primarily aimed at applications that use large
188 mappings of data and access large regions of that memory at a
189 time (e.g., virtualization systems such as QEMU). It can very
190 easily waste memory (e.g., a 2 MB mapping that only ever ac‐
191 cesses 1 byte will result in 2 MB of wired memory instead of one
192 4 KB page). See the Linux kernel source file Documentation/ad‐
193 min-guide/mm/transhuge.rst for more details.
194
195 Most common kernels configurations provide MADV_HUGEPAGE-style
196 behavior by default, and thus MADV_HUGEPAGE is normally not nec‐
197 essary. It is mostly intended for embedded systems, where
198 MADV_HUGEPAGE-style behavior may not be enabled by default in
199 the kernel. On such systems, this flag can be used in order to
200 selectively enable THP. Whenever MADV_HUGEPAGE is used, it
201 should always be in regions of memory with an access pattern
202 that the developer knows in advance won't risk to increase the
203 memory footprint of the application when transparent hugepages
204 are enabled.
205
206 Since Linux 5.4, automatic scan of eligible areas and replace‐
207 ment by huge pages works with private anonymous pages (see
208 mmap(2)), shmem pages, and file-backed pages. For all memory
209 types, memory may only be replaced by huge pages on hugepage-
210 aligned boundaries. For file-mapped memory —including tmpfs
211 (see tmpfs(2))— the mapping must also be naturally hugepage-
212 aligned within the file. Additionally, for file-backed, non-
213 tmpfs memory, the file must not be open for write and the map‐
214 ping must be executable.
215
216 The VMA must not be marked VM_NOHUGEPAGE, VM_HUGETLB, VM_IO,
217 VM_DONTEXPAND, VM_MIXEDMAP, or VM_PFNMAP, nor can it be stack
218 memory or backed by a DAX-enabled device (unless the DAX device
219 is hot-plugged as System RAM). The process must also not have
220 PR_SET_THP_DISABLE set (see prctl(2)).
221
222 The MADV_HUGEPAGE, MADV_NOHUGEPAGE, and MADV_COLLAPSE operations
223 are available only if the kernel was configured with CON‐
224 FIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported
225 if the kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS.
226
227 MADV_NOHUGEPAGE (since Linux 2.6.38)
228 Ensures that memory in the address range specified by addr and
229 length will not be backed by transparent hugepages.
230
231 MADV_COLLAPSE (since Linux 6.1)
232 Perform a best-effort synchronous collapse of the native pages
233 mapped by the memory range into Transparent Huge Pages (THPs).
234 MADV_COLLAPSE operates on the current state of memory of the
235 calling process and makes no persistent changes or guarantees on
236 how pages will be mapped, constructed, or faulted in the future.
237
238 MADV_COLLAPSE supports private anonymous pages (see mmap(2)),
239 shmem pages, and file-backed pages. See MADV_HUGEPAGE for gen‐
240 eral information on memory requirements for THP. If the range
241 provided spans multiple VMAs, the semantics of the collapse over
242 each VMA is independent from the others. If collapse of a given
243 huge page-aligned/sized region fails, the operation may continue
244 to attempt collapsing the remainder of the specified memory.
245 MADV_COLLAPSE will automatically clamp the provided range to be
246 hugepage-aligned.
247
248 All non-resident pages covered by the range will first be
249 swapped/faulted-in, before being copied onto a freshly allocated
250 hugepage. If the native pages compose the same PTE-mapped
251 hugepage, and are suitably aligned, allocation of a new hugepage
252 may be elided and collapse may happen in-place. Unmapped pages
253 will have their data directly initialized to 0 in the new
254 hugepage. However, for every eligible hugepage-aligned/sized
255 region to be collapsed, at least one page must currently be
256 backed by physical memory.
257
258 MADV_COLLAPSE is independent of any sysfs (see sysfs(5)) setting
259 under /sys/kernel/mm/transparent_hugepage, both in terms of de‐
260 termining THP eligibility, and allocation semantics. See Linux
261 kernel source file Documentation/admin-guide/mm/transhuge.rst
262 for more information. MADV_COLLAPSE also ignores huge= tmpfs
263 mount when operating on tmpfs files. Allocation for the new
264 hugepage may enter direct reclaim and/or compaction, regardless
265 of VMA flags (though VM_NOHUGEPAGE is still respected).
266
267 When the system has multiple NUMA nodes, the hugepage will be
268 allocated from the node providing the most native pages.
269
270 If all hugepage-sized/aligned regions covered by the provided
271 range were either successfully collapsed, or were already PMD-
272 mapped THPs, this operation will be deemed successful. Note
273 that this doesn't guarantee anything about other possible map‐
274 pings of the memory. In the event multiple hugepage-
275 aligned/sized areas fail to collapse, only the most-re‐
276 cently–failed code will be set in errno.
277
278 MADV_DONTDUMP (since Linux 3.4)
279 Exclude from a core dump those pages in the range specified by
280 addr and length. This is useful in applications that have large
281 areas of memory that are known not to be useful in a core dump.
282 The effect of MADV_DONTDUMP takes precedence over the bit mask
283 that is set via the /proc/pid/coredump_filter file (see
284 core(5)).
285
286 MADV_DODUMP (since Linux 3.4)
287 Undo the effect of an earlier MADV_DONTDUMP.
288
289 MADV_FREE (since Linux 4.5)
290 The application no longer requires the pages in the range speci‐
291 fied by addr and len. The kernel can thus free these pages, but
292 the freeing could be delayed until memory pressure occurs. For
293 each of the pages that has been marked to be freed but has not
294 yet been freed, the free operation will be canceled if the
295 caller writes into the page. After a successful MADV_FREE oper‐
296 ation, any stale data (i.e., dirty, unwritten pages) will be
297 lost when the kernel frees the pages. However, subsequent
298 writes to pages in the range will succeed and then kernel cannot
299 free those dirtied pages, so that the caller can always see just
300 written data. If there is no subsequent write, the kernel can
301 free the pages at any time. Once pages in the range have been
302 freed, the caller will see zero-fill-on-demand pages upon subse‐
303 quent page references.
304
305 The MADV_FREE operation can be applied only to private anonymous
306 pages (see mmap(2)). Before Linux 4.12, when freeing pages on a
307 swapless system, the pages in the given range are freed in‐
308 stantly, regardless of memory pressure.
309
310 MADV_WIPEONFORK (since Linux 4.14)
311 Present the child process with zero-filled memory in this range
312 after a fork(2). This is useful in forking servers in order to
313 ensure that sensitive per-process data (for example, PRNG seeds,
314 cryptographic secrets, and so on) is not handed to child pro‐
315 cesses.
316
317 The MADV_WIPEONFORK operation can be applied only to private
318 anonymous pages (see mmap(2)).
319
320 Within the child created by fork(2), the MADV_WIPEONFORK setting
321 remains in place on the specified address range. This setting
322 is cleared during execve(2).
323
324 MADV_KEEPONFORK (since Linux 4.14)
325 Undo the effect of an earlier MADV_WIPEONFORK.
326
327 MADV_COLD (since Linux 5.4)
328 Deactivate a given range of pages. This will make the pages a
329 more probable reclaim target should there be a memory pressure.
330 This is a nondestructive operation. The advice might be ignored
331 for some pages in the range when it is not applicable.
332
333 MADV_PAGEOUT (since Linux 5.4)
334 Reclaim a given range of pages. This is done to free up memory
335 occupied by these pages. If a page is anonymous, it will be
336 swapped out. If a page is file-backed and dirty, it will be
337 written back to the backing storage. The advice might be ig‐
338 nored for some pages in the range when it is not applicable.
339
340 MADV_POPULATE_READ (since Linux 5.14)
341 "Populate (prefault) page tables readable, faulting in all pages
342 in the range just as if manually reading from each page; how‐
343 ever, avoid the actual memory access that would have been per‐
344 formed after handling the fault.
345
346 In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide
347 errors, can be applied to (parts of) existing mappings and will
348 always populate (prefault) page tables readable. One example
349 use case is prefaulting a file mapping, reading all file content
350 from disk; however, pages won't be dirtied and consequently
351 won't have to be written back to disk when evicting the pages
352 from memory.
353
354 Depending on the underlying mapping, map the shared zeropage,
355 preallocate memory or read the underlying file; files with holes
356 might or might not preallocate blocks. If populating fails, a
357 SIGBUS signal is not generated; instead, an error is returned.
358
359 If MADV_POPULATE_READ succeeds, all page tables have been popu‐
360 lated (prefaulted) readable once. If MADV_POPULATE_READ fails,
361 some page tables might have been populated.
362
363 MADV_POPULATE_READ cannot be applied to mappings without read
364 permissions and special mappings, for example, mappings marked
365 with kernel-internal flags such as VM_PFNMAP or VM_IO, or secret
366 memory regions created using memfd_secret(2).
367
368 Note that with MADV_POPULATE_READ, the process can be killed at
369 any moment when the system runs out of memory.
370
371 MADV_POPULATE_WRITE (since Linux 5.14)
372 Populate (prefault) page tables writable, faulting in all pages
373 in the range just as if manually writing to each each page; how‐
374 ever, avoid the actual memory access that would have been per‐
375 formed after handling the fault.
376
377 In contrast to MAP_POPULATE, MADV_POPULATE_WRITE does not hide
378 errors, can be applied to (parts of) existing mappings and will
379 always populate (prefault) page tables writable. One example
380 use case is preallocating memory, breaking any CoW (Copy on
381 Write).
382
383 Depending on the underlying mapping, preallocate memory or read
384 the underlying file; files with holes will preallocate blocks.
385 If populating fails, a SIGBUS signal is not generated; instead,
386 an error is returned.
387
388 If MADV_POPULATE_WRITE succeeds, all page tables have been popu‐
389 lated (prefaulted) writable once. If MADV_POPULATE_WRITE fails,
390 some page tables might have been populated.
391
392 MADV_POPULATE_WRITE cannot be applied to mappings without write
393 permissions and special mappings, for example, mappings marked
394 with kernel-internal flags such as VM_PFNMAP or VM_IO, or secret
395 memory regions created using memfd_secret(2).
396
397 Note that with MADV_POPULATE_WRITE, the process can be killed at
398 any moment when the system runs out of memory.
399
401 On success, madvise() returns zero. On error, it returns -1 and errno
402 is set to indicate the error.
403
405 EACCES advice is MADV_REMOVE, but the specified address range is not a
406 shared writable mapping.
407
408 EAGAIN A kernel resource was temporarily unavailable.
409
410 EBADF The map exists, but the area maps something that isn't a file.
411
412 EBUSY (for MADV_COLLAPSE) Could not charge hugepage to cgroup: cgroup
413 limit exceeded.
414
415 EFAULT advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, and popu‐
416 lating (prefaulting) page tables failed because a SIGBUS would
417 have been generated on actual memory access and the reason is
418 not a HW poisoned page (HW poisoned pages can, for example, be
419 created using the MADV_HWPOISON flag described elsewhere in this
420 page).
421
422 EINVAL addr is not page-aligned or length is negative.
423
424 EINVAL advice is not a valid.
425
426 EINVAL advice is MADV_COLD or MADV_PAGEOUT and the specified address
427 range includes locked, Huge TLB pages, or VM_PFNMAP pages.
428
429 EINVAL advice is MADV_DONTNEED or MADV_REMOVE and the specified address
430 range includes locked, Huge TLB pages, or VM_PFNMAP pages.
431
432 EINVAL advice is MADV_MERGEABLE or MADV_UNMERGEABLE, but the kernel was
433 not configured with CONFIG_KSM.
434
435 EINVAL advice is MADV_FREE or MADV_WIPEONFORK but the specified address
436 range includes file, Huge TLB, MAP_SHARED, or VM_PFNMAP ranges.
437
438 EINVAL advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, but the
439 specified address range includes ranges with insufficient per‐
440 missions or special mappings, for example, mappings marked with
441 kernel-internal flags such a VM_IO or VM_PFNMAP, or secret mem‐
442 ory regions created using memfd_secret(2).
443
444 EIO (for MADV_WILLNEED) Paging in this area would exceed the
445 process's maximum resident set size.
446
447 ENOMEM (for MADV_WILLNEED) Not enough memory: paging in failed.
448
449 ENOMEM (for MADV_COLLAPSE) Not enough memory: could not allocate
450 hugepage.
451
452 ENOMEM Addresses in the specified range are not currently mapped, or
453 are outside the address space of the process.
454
455 ENOMEM advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, and popu‐
456 lating (prefaulting) page tables failed because there was not
457 enough memory.
458
459 EPERM advice is MADV_HWPOISON, but the caller does not have the
460 CAP_SYS_ADMIN capability.
461
462 EHWPOISON
463 advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, and popu‐
464 lating (prefaulting) page tables failed because a HW poisoned
465 page (HW poisoned pages can, for example, be created using the
466 MADV_HWPOISON flag described elsewhere in this page) was encoun‐
467 tered.
468
470 Versions of this system call, implementing a wide variety of advice
471 values, exist on many other implementations. Other implementations
472 typically implement at least the flags listed above under Conventional
473 advice flags, albeit with some variation in semantics.
474
475 POSIX.1-2001 describes posix_madvise(3) with constants POSIX_MADV_NOR‐
476 MAL, POSIX_MADV_RANDOM, POSIX_MADV_SEQUENTIAL, POSIX_MADV_WILLNEED, and
477 POSIX_MADV_DONTNEED, and so on, with behavior close to the similarly
478 named flags listed above.
479
480 Linux
481 The Linux implementation requires that the address addr be page-
482 aligned, and allows length to be zero. If there are some parts of the
483 specified address range that are not mapped, the Linux version of mad‐
484 vise() ignores them and applies the call to the rest (but returns
485 ENOMEM from the system call, as it should).
486
487 madvise(0, 0, advice) will return zero iff advice is supported by the
488 kernel and can be relied on to probe for support.
489
491 None.
492
494 First appeared in 4.4BSD.
495
496 Since Linux 3.18, support for this system call is optional, depending
497 on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
498
500 getrlimit(2), memfd_secret(2), mincore(2), mmap(2), mprotect(2),
501 msync(2), munmap(2), prctl(2), process_madvise(2), posix_madvise(3),
502 core(5)
503
504
505
506Linux man-pages 6.04 2023-04-03 madvise(2)