1BTRFS-MAN5(5) Btrfs Manual BTRFS-MAN5(5)
2
3
4
6 btrfs-man5 - topics about the BTRFS filesystem (mount options,
7 supported file attributes and other)
8
10 This document describes topics related to BTRFS that are not specific
11 to the tools. Currently covers:
12
13 1. mount options
14
15 2. filesystem features
16
17 3. checksum algorithms
18
19 4. compression
20
21 5. filesystem exclusive operations
22
23 6. filesystem limits
24
25 7. bootloader support
26
27 8. file attributes
28
29 9. zoned mode
30
31 10. control device
32
33 11. filesystems with multiple block group profiles
34
35 12. seeding device
36
37 13. raid56 status and recommended practices
38
39 14. storage model
40
41 15. hardware considerations
42
44 This section describes mount options specific to BTRFS. For the generic
45 mount options please refer to mount(8) manpage. The options are sorted
46 alphabetically (discarding the no prefix).
47
48 Note
49 most mount options apply to the whole filesystem and only options
50 in the first mounted subvolume will take effect. This is due to
51 lack of implementation and may change in the future. This means
52 that (for example) you can’t set per-subvolume nodatacow,
53 nodatasum, or compress using mount options. This should eventually
54 be fixed, but it has proved to be difficult to implement correctly
55 within the Linux VFS framework.
56
57 Mount options are processed in order, only the last occurrence of an
58 option takes effect and may disable other options due to constraints
59 (see eg. nodatacow and compress). The output of mount command shows
60 which options have been applied.
61
62 acl, noacl
63 (default: on)
64
65 Enable/disable support for Posix Access Control Lists (ACLs). See
66 the acl(5) manual page for more information about ACLs.
67
68 The support for ACL is build-time configurable (BTRFS_FS_POSIX_ACL)
69 and mount fails if acl is requested but the feature is not compiled
70 in.
71
72 autodefrag, noautodefrag
73 (since: 3.0, default: off)
74
75 Enable automatic file defragmentation. When enabled, small random
76 writes into files (in a range of tens of kilobytes, currently it’s
77 64K) are detected and queued up for the defragmentation process.
78 Not well suited for large database workloads.
79
80 The read latency may increase due to reading the adjacent blocks
81 that make up the range for defragmentation, successive write will
82 merge the blocks in the new location.
83
84 Warning
85 Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as
86 well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12
87 or ≥ 3.13.4 will break up the reflinks of COW data (for example
88 files copied with cp --reflink, snapshots or de-duplicated
89 data). This may cause considerable increase of space usage
90 depending on the broken up reflinks.
91
92 barrier, nobarrier
93 (default: on)
94
95 Ensure that all IO write operations make it through the device
96 cache and are stored permanently when the filesystem is at its
97 consistency checkpoint. This typically means that a flush command
98 is sent to the device that will synchronize all pending data and
99 ordinary metadata blocks, then writes the superblock and issues
100 another flush.
101
102 The write flushes incur a slight hit and also prevent the IO block
103 scheduler to reorder requests in a more effective way. Disabling
104 barriers gets rid of that penalty but will most certainly lead to a
105 corrupted filesystem in case of a crash or power loss. The ordinary
106 metadata blocks could be yet unwritten at the time the new
107 superblock is stored permanently, expecting that the block pointers
108 to metadata were stored permanently before.
109
110 On a device with a volatile battery-backed write-back cache, the
111 nobarrier option will not lead to filesystem corruption as the
112 pending blocks are supposed to make it to the permanent storage.
113
114 check_int, check_int_data, check_int_print_mask=value
115 (since: 3.0, default: off)
116
117 These debugging options control the behavior of the integrity
118 checking module (the BTRFS_FS_CHECK_INTEGRITY config option
119 required). The main goal is to verify that all blocks from a given
120 transaction period are properly linked.
121
122 check_int enables the integrity checker module, which examines all
123 block write requests to ensure on-disk consistency, at a large
124 memory and CPU cost.
125
126 check_int_data includes extent data in the integrity checks, and
127 implies the check_int option.
128
129 check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values
130 as defined in fs/btrfs/check-integrity.c, to control the integrity
131 checker module behavior.
132
133 See comments at the top of fs/btrfs/check-integrity.c for more
134 information.
135
136 clear_cache
137 Force clearing and rebuilding of the disk space cache if something
138 has gone wrong. See also: space_cache.
139
140 commit=seconds
141 (since: 3.12, default: 30)
142
143 Set the interval of periodic transaction commit when data are
144 synchronized to permanent storage. Higher interval values lead to
145 larger amount of unwritten data, which has obvious consequences
146 when the system crashes. The upper bound is not forced, but a
147 warning is printed if it’s more than 300 seconds (5 minutes). Use
148 with care.
149
150 compress, compress=type[:level], compress-force,
151 compress-force=type[:level]
152 (default: off, level support since: 5.1)
153
154 Control BTRFS file data compression. Type may be specified as zlib,
155 lzo, zstd or no (for no compression, used for remounting). If no
156 type is specified, zlib is used. If compress-force is specified,
157 then compression will always be attempted, but the data may end up
158 uncompressed if the compression would make them larger.
159
160 Both zlib and zstd (since version 5.1) expose the compression level
161 as a tunable knob with higher levels trading speed and memory
162 (zstd) for higher compression ratios. This can be set by appending
163 a colon and the desired level. Zlib accepts the range [1, 9] and
164 zstd accepts [1, 15]. If no level is set, both currently use a
165 default level of 3. The value 0 is an alias for the default level.
166
167 Otherwise some simple heuristics are applied to detect an
168 incompressible file. If the first blocks written to a file are not
169 compressible, the whole file is permanently marked to skip
170 compression. As this is too simple, the compress-force is a
171 workaround that will compress most of the files at the cost of some
172 wasted CPU cycles on failed attempts. Since kernel 4.15, a set of
173 heuristic algorithms have been improved by using frequency
174 sampling, repeated pattern detection and Shannon entropy
175 calculation to avoid that.
176
177 Note
178 If compression is enabled, nodatacow and nodatasum are
179 disabled.
180
181 datacow, nodatacow
182 (default: on)
183
184 Enable data copy-on-write for newly created files. Nodatacow
185 implies nodatasum, and disables compression. All files created
186 under nodatacow are also set the NOCOW file attribute (see
187 chattr(1)).
188
189 Note
190 If nodatacow or nodatasum are enabled, compression is disabled.
191 Updates in-place improve performance for workloads that do frequent
192 overwrites, at the cost of potential partial writes, in case the
193 write is interrupted (system crash, device failure).
194
195 datasum, nodatasum
196 (default: on)
197
198 Enable data checksumming for newly created files. Datasum implies
199 datacow, ie. the normal mode of operation. All files created under
200 nodatasum inherit the "no checksums" property, however there’s no
201 corresponding file attribute (see chattr(1)).
202
203 Note
204 If nodatacow or nodatasum are enabled, compression is disabled.
205 There is a slight performance gain when checksums are turned off,
206 the corresponding metadata blocks holding the checksums do not need
207 to updated. The cost of checksumming of the blocks in memory is
208 much lower than the IO, modern CPUs feature hardware support of the
209 checksumming algorithm.
210
211 degraded
212 (default: off)
213
214 Allow mounts with less devices than the RAID profile constraints
215 require. A read-write mount (or remount) may fail when there are
216 too many devices missing, for example if a stripe member is
217 completely missing from RAID0.
218
219 Since 4.14, the constraint checks have been improved and are
220 verified on the chunk level, not an the device level. This allows
221 degraded mounts of filesystems with mixed RAID profiles for data
222 and metadata, even if the device number constraints would not be
223 satisfied for some of the profiles.
224
225 Example: metadata — raid1, data — single, devices — /dev/sda,
226 /dev/sdb
227
228 Suppose the data are completely stored on sda, then missing sdb
229 will not prevent the mount, even if 1 missing device would normally
230 prevent (any) single profile to mount. In case some of the data
231 chunks are stored on sdb, then the constraint of single/data is not
232 satisfied and the filesystem cannot be mounted.
233
234 device=devicepath
235 Specify a path to a device that will be scanned for BTRFS
236 filesystem during mount. This is usually done automatically by a
237 device manager (like udev) or using the btrfs device scan command
238 (eg. run from the initial ramdisk). In cases where this is not
239 possible the device mount option can help.
240
241 Note
242 booting eg. a RAID1 system may fail even if all filesystem’s
243 device paths are provided as the actual device nodes may not be
244 discovered by the system at that point.
245
246 discard, discard=sync, discard=async, nodiscard
247 (default: off, async support since: 5.6)
248
249 Enable discarding of freed file blocks. This is useful for SSD
250 devices, thinly provisioned LUNs, or virtual machine images;
251 however, every storage layer must support discard for it to work.
252
253 In the synchronous mode (sync or without option value), lack of
254 asynchronous queued TRIM on the backing device TRIM can severely
255 degrade performance, because a synchronous TRIM operation will be
256 attempted instead. Queued TRIM requires newer than SATA revision
257 3.1 chipsets and devices.
258
259 The asynchronous mode (async) gathers extents in larger chunks
260 before sending them to the devices for TRIM. The overhead and
261 performance impact should be negligible compared to the previous
262 mode and it’s supposed to be the preferred mode if needed.
263
264 If it is not necessary to immediately discard freed blocks, then
265 the fstrim tool can be used to discard all free blocks in a batch.
266 Scheduling a TRIM during a period of low system activity will
267 prevent latent interference with the performance of other
268 operations. Also, a device may ignore the TRIM command if the range
269 is too small, so running a batch discard has a greater probability
270 of actually discarding the blocks.
271
272 enospc_debug, noenospc_debug
273 (default: off)
274
275 Enable verbose output for some ENOSPC conditions. It’s safe to use
276 but can be noisy if the system reaches near-full state.
277
278 fatal_errors=action
279 (since: 3.4, default: bug)
280
281 Action to take when encountering a fatal error.
282
283 bug
284 BUG() on a fatal error, the system will stay in the crashed
285 state and may be still partially usable, but reboot is required
286 for full operation
287
288 panic
289 panic() on a fatal error, depending on other system
290 configuration, this may be followed by a reboot. Please refer
291 to the documentation of kernel boot parameters, eg. panic,
292 oops or crashkernel.
293
294 flushoncommit, noflushoncommit
295 (default: off)
296
297 This option forces any data dirtied by a write in a prior
298 transaction to commit as part of the current commit, effectively a
299 full filesystem sync.
300
301 This makes the committed state a fully consistent view of the file
302 system from the application’s perspective (i.e. it includes all
303 completed file system operations). This was previously the behavior
304 only when a snapshot was created.
305
306 When off, the filesystem is consistent but buffered writes may last
307 more than one transaction commit.
308
309 fragment=type
310 (depends on compile-time option BTRFS_DEBUG, since: 4.4, default:
311 off)
312
313 A debugging helper to intentionally fragment given type of block
314 groups. The type can be data, metadata or all. This mount option
315 should not be used outside of debugging environments and is not
316 recognized if the kernel config option BTRFS_DEBUG is not enabled.
317
318 nologreplay
319 (default: off, even read-only)
320
321 The tree-log contains pending updates to the filesystem until the
322 full commit. The log is replayed on next mount, this can be
323 disabled by this option. See also treelog. Note that nologreplay is
324 the same as norecovery.
325
326 Warning
327 currently, the tree log is replayed even with a read-only
328 mount! To disable that behaviour, mount also with nologreplay.
329
330 max_inline=bytes
331 (default: min(2048, page size) )
332
333 Specify the maximum amount of space, that can be inlined in a
334 metadata B-tree leaf. The value is specified in bytes, optionally
335 with a K suffix (case insensitive). In practice, this value is
336 limited by the filesystem block size (named sectorsize at mkfs
337 time), and memory page size of the system. In case of sectorsize
338 limit, there’s some space unavailable due to leaf headers. For
339 example, a 4k sectorsize, maximum size of inline data is about 3900
340 bytes.
341
342 Inlining can be completely turned off by specifying 0. This will
343 increase data block slack if file sizes are much smaller than block
344 size but will reduce metadata consumption in return.
345
346 Note
347 the default value has changed to 2048 in kernel 4.6.
348
349 metadata_ratio=value
350 (default: 0, internal logic)
351
352 Specifies that 1 metadata chunk should be allocated after every
353 value data chunks. Default behaviour depends on internal logic,
354 some percent of unused metadata space is attempted to be maintained
355 but is not always possible if there’s not enough space left for
356 chunk allocation. The option could be useful to override the
357 internal logic in favor of the metadata allocation if the expected
358 workload is supposed to be metadata intense (snapshots, reflinks,
359 xattrs, inlined files).
360
361 norecovery
362 (since: 4.5, default: off)
363
364 Do not attempt any data recovery at mount time. This will disable
365 logreplay and avoids other write operations. Note that this option
366 is the same as nologreplay.
367
368 Note
369 The opposite option recovery used to have different meaning but
370 was changed for consistency with other filesystems, where
371 norecovery is used for skipping log replay. BTRFS does the same
372 and in general will try to avoid any write operations.
373
374 rescan_uuid_tree
375 (since: 3.12, default: off)
376
377 Force check and rebuild procedure of the UUID tree. This should not
378 normally be needed.
379
380 rescue
381 (since: 5.9)
382
383 Modes allowing mount with damaged filesystem structures.
384
385 • usebackuproot (since: 5.9, replaces standalone option
386 usebackuproot)
387
388 • nologreplay (since: 5.9, replaces standalone option
389 nologreplay)
390
391 • ignorebadroots, ibadroots (since: 5.11)
392
393 • ignoredatacsums, idatacsums (since: 5.11)
394
395 • all (since: 5.9)
396
397 skip_balance
398 (since: 3.3, default: off)
399
400 Skip automatic resume of an interrupted balance operation. The
401 operation can later be resumed with btrfs balance resume, or the
402 paused state can be removed with btrfs balance cancel. The default
403 behaviour is to resume an interrupted balance immediately after a
404 volume is mounted.
405
406 space_cache, space_cache=version, nospace_cache
407 (nospace_cache since: 3.2, space_cache=v1 and space_cache=v2 since
408 4.5, default: space_cache=v1)
409
410 Options to control the free space cache. The free space cache
411 greatly improves performance when reading block group free space
412 into memory. However, managing the space cache consumes some
413 resources, including a small amount of disk space.
414
415 There are two implementations of the free space cache. The original
416 one, referred to as v1, is the safe default. The v1 space cache can
417 be disabled at mount time with nospace_cache without clearing.
418
419 On very large filesystems (many terabytes) and certain workloads,
420 the performance of the v1 space cache may degrade drastically. The
421 v2 implementation, which adds a new B-tree called the free space
422 tree, addresses this issue. Once enabled, the v2 space cache will
423 always be used and cannot be disabled unless it is cleared. Use
424 clear_cache,space_cache=v1 or clear_cache,nospace_cache to do so.
425 If v2 is enabled, kernels without v2 support will only be able to
426 mount the filesystem in read-only mode.
427
428 The btrfs-check(8) and mkfs.btrfs(8) commands have full v2 free
429 space cache support since v4.19.
430
431 If a version is not explicitly specified, the default
432 implementation will be chosen, which is v1.
433
434 ssd, ssd_spread, nossd, nossd_spread
435 (default: SSD autodetected)
436
437 Options to control SSD allocation schemes. By default, BTRFS will
438 enable or disable SSD optimizations depending on status of a device
439 with respect to rotational or non-rotational type. This is
440 determined by the contents of /sys/block/DEV/queue/rotational). If
441 it is 0, the ssd option is turned on. The option nossd will disable
442 the autodetection.
443
444 The optimizations make use of the absence of the seek penalty
445 that’s inherent for the rotational devices. The blocks can be
446 typically written faster and are not offloaded to separate threads.
447
448 Note
449 Since 4.14, the block layout optimizations have been dropped.
450 This used to help with first generations of SSD devices. Their
451 FTL (flash translation layer) was not effective and the
452 optimization was supposed to improve the wear by better
453 aligning blocks. This is no longer true with modern SSD devices
454 and the optimization had no real benefit. Furthermore it caused
455 increased fragmentation. The layout tuning has been kept intact
456 for the option ssd_spread.
457 The ssd_spread mount option attempts to allocate into bigger and
458 aligned chunks of unused space, and may perform better on low-end
459 SSDs. ssd_spread implies ssd, enabling all other SSD heuristics as
460 well. The option nossd will disable all SSD options while
461 nossd_spread only disables ssd_spread.
462
463 subvol=path
464 Mount subvolume from path rather than the toplevel subvolume. The
465 path is always treated as relative to the toplevel subvolume. This
466 mount option overrides the default subvolume set for the given
467 filesystem.
468
469 subvolid=subvolid
470 Mount subvolume specified by a subvolid number rather than the
471 toplevel subvolume. You can use btrfs subvolume list of btrfs
472 subvolume show to see subvolume ID numbers. This mount option
473 overrides the default subvolume set for the given filesystem.
474
475 Note
476 if both subvolid and subvol are specified, they must point at
477 the same subvolume, otherwise the mount will fail.
478
479 thread_pool=number
480 (default: min(NRCPUS + 2, 8) )
481
482 The number of worker threads to start. NRCPUS is number of on-line
483 CPUs detected at the time of mount. Small number leads to less
484 parallelism in processing data and metadata, higher numbers could
485 lead to a performance hit due to increased locking contention,
486 process scheduling, cache-line bouncing or costly data transfers
487 between local CPU memories.
488
489 treelog, notreelog
490 (default: on)
491
492 Enable the tree logging used for fsync and O_SYNC writes. The tree
493 log stores changes without the need of a full filesystem sync. The
494 log operations are flushed at sync and transaction commit. If the
495 system crashes between two such syncs, the pending tree log
496 operations are replayed during mount.
497
498 Warning
499 currently, the tree log is replayed even with a read-only
500 mount! To disable that behaviour, also mount with nologreplay.
501 The tree log could contain new files/directories, these would not
502 exist on a mounted filesystem if the log is not replayed.
503
504 usebackuproot
505 (since: 4.6, default: off)
506
507 Enable autorecovery attempts if a bad tree root is found at mount
508 time. Currently this scans a backup list of several previous tree
509 roots and tries to use the first readable. This can be used with
510 read-only mounts as well.
511
512 Note
513 This option has replaced recovery.
514
515 user_subvol_rm_allowed
516 (default: off)
517
518 Allow subvolumes to be deleted by their respective owner.
519 Otherwise, only the root user can do that.
520
521 Note
522 historically, any user could create a snapshot even if he was
523 not owner of the source subvolume, the subvolume deletion has
524 been restricted for that reason. The subvolume creation has
525 been restricted but this mount option is still required. This
526 is a usability issue. Since 4.18, the rmdir(2) syscall can
527 delete an empty subvolume just like an ordinary directory.
528 Whether this is possible can be detected at runtime, see
529 rmdir_subvol feature in FILESYSTEM FEATURES.
530
531 DEPRECATED MOUNT OPTIONS
532 List of mount options that have been removed, kept for backward
533 compatibility.
534
535 recovery
536 (since: 3.2, default: off, deprecated since: 4.5)
537
538 Note
539 this option has been replaced by usebackuproot and should not
540 be used but will work on 4.5+ kernels.
541
542 inode_cache, noinode_cache
543 (removed in: 5.11, since: 3.0, default: off)
544
545 Note
546 the functionality has been removed in 5.11, any stale data
547 created by previous use of the inode_cache option can be
548 removed by btrfs check --clear-ino-cache.
549
550 NOTES ON GENERIC MOUNT OPTIONS
551 Some of the general mount options from mount(8) that affect BTRFS and
552 are worth mentioning.
553
554 noatime
555 under read intensive work-loads, specifying noatime significantly
556 improves performance because no new access time information needs
557 to be written. Without this option, the default is relatime, which
558 only reduces the number of inode atime updates in comparison to the
559 traditional strictatime. The worst case for atime updates under
560 relatime occurs when many files are read whose atime is older than
561 24 h and which are freshly snapshotted. In that case the atime is
562 updated and COW happens - for each file - in bulk. See also
563 https://lwn.net/Articles/499293/ - Atime and btrfs: a bad
564 combination? (LWN, 2012-05-31).
565
566 Note that noatime may break applications that rely on atime uptimes
567 like the venerable Mutt (unless you use maildir mailboxes).
568
570 The basic set of filesystem features gets extended over time. The
571 backward compatibility is maintained and the features are optional,
572 need to be explicitly asked for so accidental use will not create
573 incompatibilities.
574
575 There are several classes and the respective tools to manage the
576 features:
577
578 at mkfs time only
579 This is namely for core structures, like the b-tree nodesize or
580 checksum algorithm, see mkfs.btrfs(8) for more details.
581
582 after mkfs, on an unmounted filesystem
583 Features that may optimize internal structures or add new
584 structures to support new functionality, see btrfstune(8). The
585 command btrfs inspect-internal dump-super device will dump a
586 superblock, you can map the value of incompat_flags to the features
587 listed below
588
589 after mkfs, on a mounted filesystem
590 The features of a filesystem (with a given UUID) are listed in
591 /sys/fs/btrfs/UUID/features/, one file per feature. The status is
592 stored inside the file. The value 1 is for enabled and active,
593 while 0 means the feature was enabled at mount time but turned off
594 afterwards.
595
596 Whether a particular feature can be turned on a mounted filesystem
597 can be found in the directory /sys/fs/btrfs/features/, one file per
598 feature. The value 1 means the feature can be enabled.
599
600 List of features (see also mkfs.btrfs(8) section FILESYSTEM FEATURES):
601
602 big_metadata
603 (since: 3.4)
604
605 the filesystem uses nodesize for metadata blocks, this can be
606 bigger than the page size
607
608 compress_lzo
609 (since: 2.6.38)
610
611 the lzo compression has been used on the filesystem, either as a
612 mount option or via btrfs filesystem defrag.
613
614 compress_zstd
615 (since: 4.14)
616
617 the zstd compression has been used on the filesystem, either as a
618 mount option or via btrfs filesystem defrag.
619
620 default_subvol
621 (since: 2.6.34)
622
623 the default subvolume has been set on the filesystem
624
625 extended_iref
626 (since: 3.7)
627
628 increased hardlink limit per file in a directory to 65536, older
629 kernels supported a varying number of hardlinks depending on the
630 sum of all file name sizes that can be stored into one metadata
631 block
632
633 free_space_tree
634 (since: 4.5)
635
636 free space representation using a dedicated b-tree, successor of v1
637 space cache
638
639 metadata_uuid
640 (since: 5.0)
641
642 the main filesystem UUID is the metadata_uuid, which stores the new
643 UUID only in the superblock while all metadata blocks still have
644 the UUID set at mkfs time, see btrfstune(8) for more
645
646 mixed_backref
647 (since: 2.6.31)
648
649 the last major disk format change, improved backreferences, now
650 default
651
652 mixed_groups
653 (since: 2.6.37)
654
655 mixed data and metadata block groups, ie. the data and metadata are
656 not separated and occupy the same block groups, this mode is
657 suitable for small volumes as there are no constraints how the
658 remaining space should be used (compared to the split mode, where
659 empty metadata space cannot be used for data and vice versa)
660
661 on the other hand, the final layout is quite unpredictable and
662 possibly highly fragmented, which means worse performance
663
664 no_holes
665 (since: 3.14)
666
667 improved representation of file extents where holes are not
668 explicitly stored as an extent, saves a few percent of metadata if
669 sparse files are used
670
671 raid1c34
672 (since: 5.5)
673
674 extended RAID1 mode with copies on 3 or 4 devices respectively
675
676 raid56
677 (since: 3.9)
678
679 the filesystem contains or contained a raid56 profile of block
680 groups
681
682 rmdir_subvol
683 (since: 4.18)
684
685 indicate that rmdir(2) syscall can delete an empty subvolume just
686 like an ordinary directory. Note that this feature only depends on
687 the kernel version.
688
689 skinny_metadata
690 (since: 3.10)
691
692 reduced-size metadata for extent references, saves a few percent of
693 metadata
694
695 send_stream_version
696 (since: 5.10)
697
698 number of the highest supported send stream version
699
700 supported_checksums
701 (since: 5.5)
702
703 list of checksum algorithms supported by the kernel module, the
704 respective modules or built-in implementing the algorithms need to
705 be present to mount the filesystem, see CHECKSUM ALGORITHMS
706
707 supported_sectorsizes
708 (since: 5.13)
709
710 list of values that are accepted as sector sizes (mkfs.btrfs
711 --sectorsize) by the running kernel
712
713 supported_rescue_options
714 (since: 5.11)
715
716 list of values for the mount option rescue that are supported by
717 the running kernel, see btrfs(5)
718
719 zoned
720 (since: 5.12)
721
722 zoned mode is allocation/write friendly to host-managed zoned
723 devices, allocation space is partitioned into fixed-size zones that
724 must be updated sequentially, see ZONED MODE
725
726 SWAPFILE SUPPORT
727 The swapfile is supported since kernel 5.0. Use swapon(8) to activate
728 the swapfile. There are some limitations of the implementation in btrfs
729 and linux swap subsystem:
730
731 • filesystem - must be only single device
732
733 • filesystem - must have only single data profile
734
735 • swapfile - the containing subvolume cannot be snapshotted
736
737 • swapfile - must be preallocated
738
739 • swapfile - must be nodatacow (ie. also nodatasum)
740
741 • swapfile - must not be compressed
742
743 The limitations come namely from the COW-based design and mapping layer
744 of blocks that allows the advanced features like relocation and
745 multi-device filesystems. However, the swap subsystem expects simpler
746 mapping and no background changes of the file blocks once they’ve been
747 attached to swap.
748
749 With active swapfiles, the following whole-filesystem operations will
750 skip swapfile extents or may fail:
751
752 • balance - block groups with swapfile extents are skipped and
753 reported, the rest will be processed normally
754
755 • resize grow - unaffected
756
757 • resize shrink - works as long as the extents are outside of the
758 shrunk range
759
760 • device add - a new device does not interfere with existing swapfile
761 and this operation will work, though no new swapfile can be
762 activated afterwards
763
764 • device delete - if the device has been added as above, it can be
765 also deleted
766
767 • device replace - ditto
768
769 When there are no active swapfiles and a whole-filesystem exclusive
770 operation is running (ie. balance, device delete, shrink), the
771 swapfiles cannot be temporarily activated. The operation must finish
772 first.
773
774 To create and activate a swapfile run the following commands:
775
776 # truncate -s 0 swapfile
777 # chattr +C swapfile
778 # fallocate -l 2G swapfile
779 # chmod 0600 swapfile
780 # mkswap swapfile
781 # swapon swapfile
782
783 Please note that the UUID returned by the mkswap utility identifies the
784 swap "filesystem" and because it’s stored in a file, it’s not generally
785 visible and usable as an identifier unlike if it was on a block device.
786
787 The file will appear in /proc/swaps:
788
789 # cat /proc/swaps
790 Filename Type Size Used Priority
791 /path/swapfile file 2097152 0 -2
792
793 The swapfile can be created as one-time operation or, once properly
794 created, activated on each boot by the swapon -a command (usually
795 started by the service manager). Add the following entry to /etc/fstab,
796 assuming the filesystem that provides the /path has been already
797 mounted at this point. Additional mount options relevant for the
798 swapfile can be set too (like priority, not the btrfs mount options).
799
800 /path/swapfile none swap defaults 0 0
801
803 There are several checksum algorithms supported. The default and
804 backward compatible is crc32c. Since kernel 5.5 there are three more
805 with different characteristics and trade-offs regarding speed and
806 strength. The following list may help you to decide which one to
807 select.
808
809 CRC32C (32bit digest)
810 default, best backward compatibility, very fast, modern CPUs have
811 instruction-level support, not collision-resistant but still good
812 error detection capabilities
813
814 XXHASH (64bit digest)
815 can be used as CRC32C successor, very fast, optimized for modern
816 CPUs utilizing instruction pipelining, good collision resistance
817 and error detection
818
819 SHA256 (256bit digest)
820 a cryptographic-strength hash, relatively slow but with possible
821 CPU instruction acceleration or specialized hardware cards, FIPS
822 certified and in wide use
823
824 BLAKE2b (256bit digest)
825 a cryptographic-strength hash, relatively fast with possible CPU
826 acceleration using SIMD extensions, not standardized but based on
827 BLAKE which was a SHA3 finalist, in wide use, the algorithm used is
828 BLAKE2b-256 that’s optimized for 64bit platforms
829
830 The digest size affects overall size of data block checksums stored in
831 the filesystem. The metadata blocks have a fixed area up to 256bits (32
832 bytes), so there’s no increase. Each data block has a separate checksum
833 stored, with additional overhead of the b-tree leaves.
834
835 Approximate relative performance of the algorithms, measured against
836 CRC32C using reference software implementations on a 3.5GHz intel CPU:
837
838 ┌────────┬─────────────┬───────┬─────────────────┐
839 │ │ │ │ │
840 │Digest │ Cycles/4KiB │ Ratio │ Implementation │
841 ├────────┼─────────────┼───────┼─────────────────┤
842 │ │ │ │ │
843 │CRC32C │ 1700 │ 1.00 │ CPU instruction │
844 ├────────┼─────────────┼───────┼─────────────────┤
845 │ │ │ │ │
846 │XXHASH │ 2500 │ 1.44 │ reference impl. │
847 ├────────┼─────────────┼───────┼─────────────────┤
848 │ │ │ │ │
849 │SHA256 │ 105000 │ 61 │ reference impl. │
850 ├────────┼─────────────┼───────┼─────────────────┤
851 │ │ │ │ │
852 │SHA256 │ 36000 │ 21 │ libgcrypt/AVX2 │
853 ├────────┼─────────────┼───────┼─────────────────┤
854 │ │ │ │ │
855 │SHA256 │ 63000 │ 37 │ libsodium/AVX2 │
856 ├────────┼─────────────┼───────┼─────────────────┤
857 │ │ │ │ │
858 │BLAKE2b │ 22000 │ 13 │ reference impl. │
859 ├────────┼─────────────┼───────┼─────────────────┤
860 │ │ │ │ │
861 │BLAKE2b │ 19000 │ 11 │ libgcrypt/AVX2 │
862 ├────────┼─────────────┼───────┼─────────────────┤
863 │ │ │ │ │
864 │BLAKE2b │ 19000 │ 11 │ libsodium/AVX2 │
865 └────────┴─────────────┴───────┴─────────────────┘
866
867 Many kernels are configured with SHA256 as built-in and not as a
868 module. The accelerated versions are however provided by the modules
869 and must be loaded explicitly (modprobe sha256) before mounting the
870 filesystem to make use of them. You can check in
871 /sys/fs/btrfs/FSID/checksum which one is used. If you see
872 sha256-generic, then you may want to unmount and mount the filesystem
873 again, changing that on a mounted filesystem is not possible. Check the
874 file /proc/crypto, when the implementation is built-in, you’d find
875
876 name : sha256
877 driver : sha256-generic
878 module : kernel
879 priority : 100
880 ...
881
882 while accelerated implementation is e.g.
883
884 name : sha256
885 driver : sha256-avx2
886 module : sha256_ssse3
887 priority : 170
888 ...
889
891 Btrfs supports transparent file compression. There are three algorithms
892 available: ZLIB, LZO and ZSTD (since v4.14). Basically, compression is
893 on a file by file basis. You can have a single btrfs mount point that
894 has some files that are uncompressed, some that are compressed with
895 LZO, some with ZLIB, for instance (though you may not want it that way,
896 it is supported).
897
898 To enable compression, mount the filesystem with options compress or
899 compress-force. Please refer to section MOUNT OPTIONS. Once compression
900 is enabled, all new writes will be subject to compression. Some files
901 may not compress very well, and these are typically not recompressed
902 but still written uncompressed.
903
904 Each compression algorithm has different speed/ratio trade offs. The
905 levels can be selected by a mount option and affect only the resulting
906 size (ie. no compatibility issues).
907
908 Basic characteristics:
909
910
911 ZLIB slower, higher compression
912 ratio
913
914
915 • levels: 1 to 9,
916 mapped
917 directly,
918 default level
919 is 3
920
921 • good backward
922 compatibility
923
924 LZO faster compression and
925 decompression than zlib,
926 worse compression ratio,
927 designed to be fast
928
929
930 • no levels
931
932 • good backward
933 compatibility
934
935
936
937
938 ZSTD compression comparable to
939 zlib with higher
940 compression/decompression
941 speeds and different ratio
942
943
944 • levels: 1 to 15
945
946 • since 4.14,
947 levels since
948 5.1
949
950
951 The differences depend on the actual data set and cannot be expressed
952 by a single number or recommendation. Higher levels consume more CPU
953 time and may not bring a significant improvement, lower levels are
954 close to real time.
955
956 The algorithms could be mixed in one file as they’re stored per extent.
957 The compression can be changed on a file by btrfs filesystem defrag
958 command, using the -c option, or by btrfs property set using the
959 compression property. Setting compression by chattr +c utility will set
960 it to zlib.
961
962 INCOMPRESSIBLE DATA
963 Files with already compressed data or with data that won’t compress
964 well with the CPU and memory constraints of the kernel implementations
965 are using a simple decision logic. If the first portion of data being
966 compressed is not smaller than the original, the compression of the
967 file is disabled — unless the filesystem is mounted with
968 compress-force. In that case compression will always be attempted on
969 the file only to be later discarded. This is not optimal and subject to
970 optimizations and further development.
971
972 If a file is identified as incompressible, a flag is set (NOCOMPRESS)
973 and it’s sticky. On that file compression won’t be performed unless
974 forced. The flag can be also set by chattr +m (since e2fsprogs 1.46.2)
975 or by properties with value no or none. Empty value will reset it to
976 the default that’s currently applicable on the mounted filesystem.
977
978 There are two ways to detect incompressible data:
979
980 • actual compression attempt - data are compressed, if the result is
981 not smaller, it’s discarded, so this depends on the algorithm and
982 level
983
984 • pre-compression heuristics - a quick statistical evaluation on the
985 data is peformed and based on the result either compression is
986 performed or skipped, the NOCOMPRESS bit is not set just by the
987 heuristic, only if the compression algorithm does not make an
988 improvent
989
990 PRE-COMPRESSION HEURISTICS
991 The heuristics aim to do a few quick statistical tests on the
992 compressed data in order to avoid probably costly compression that
993 would turn out to be inefficient. Compression algorithms could have
994 internal detection of incompressible data too but this leads to more
995 overhead as the compression is done in another thread and has to write
996 the data anyway. The heuristic is read-only and can utilize cached
997 memory.
998
999 The tests performed based on the following: data sampling, long repated
1000 pattern detection, byte frequency, Shannon entropy.
1001
1002 COMPATIBILITY WITH OTHER FEATURES
1003 Compression is done using the COW mechanism so it’s incompatible with
1004 nodatacow. Direct IO works on compressed files but will fall back to
1005 buffered writes. Currently nodatasum and compression don’t work
1006 together.
1007
1009 There are several operations that affect the whole filesystem and
1010 cannot be run in parallel. Attempt to start one while another is
1011 running will fail.
1012
1013 Since kernel 5.10 the currently running operation can be obtained from
1014 /sys/fs/UUID/exclusive_operation with following values and operations:
1015
1016 • balance
1017
1018 • device add
1019
1020 • device delete
1021
1022 • device replace
1023
1024 • resize
1025
1026 • swapfile activate
1027
1028 • none
1029
1030 Enqueuing is supported for several btrfs subcommands so they can be
1031 started at once and then serialized.
1032
1034 maximum file name length
1035 255
1036
1037 maximum symlink target length
1038 depends on the nodesize value, for 4k it’s 3949 bytes, for larger
1039 nodesize it’s 4095 due to the system limit PATH_MAX
1040
1041 The symlink target may not be a valid path, ie. the path name
1042 components can exceed the limits (NAME_MAX), there’s no content
1043 validation at symlink(3) creation.
1044
1045 maximum number of inodes
1046 2^64 but depends on the available metadata space as the inodes are
1047 created dynamically
1048
1049 inode numbers
1050 minimum number: 256 (for subvolumes), regular files and
1051 directories: 257
1052
1053 maximum file length
1054 inherent limit of btrfs is 2^64 (16 EiB) but the linux VFS limit is
1055 2^63 (8 EiB)
1056
1057 maximum number of subvolumes
1058 the subvolume ids can go up to 2^64 but the number of actual
1059 subvolumes depends on the available metadata space, the space
1060 consumed by all subvolume metadata includes bookkeeping of shared
1061 extents can be large (MiB, GiB)
1062
1063 maximum number of hardlinks of a file in a directory
1064 65536 when the extref feature is turned on during mkfs (default),
1065 roughly 100 otherwise
1066
1067 minimum filesystem size
1068 the minimal size of each device depends on the mixed-bg feature,
1069 without that (the default) it’s about 109MiB, with mixed-bg it’s is
1070 16MiB
1071
1073 GRUB2 (https://www.gnu.org/software/grub) has the most advanced support
1074 of booting from BTRFS with respect to features.
1075
1076 U-boot (https://www.denx.de/wiki/U-Boot/) has decent support for
1077 booting but not all BTRFS features are implemented, check the
1078 documentation.
1079
1080 EXTLINUX (from the https://syslinux.org project) can boot but does not
1081 support all features. Please check the upstream documentation before
1082 you use it.
1083
1084 The first 1MiB on each device is unused with the exception of primary
1085 superblock that is on the offset 64KiB and spans 4KiB.
1086
1088 The btrfs filesystem supports setting file attributes or flags. Note
1089 there are old and new interfaces, with confusing names. The following
1090 list should clarify that:
1091
1092 • attributes: chattr(1) or lsattr(1) utilities (the ioctls are
1093 FS_IOC_GETFLAGS and FS_IOC_SETFLAGS), due to the ioctl names the
1094 attributes are also called flags
1095
1096 • xflags: to distinguish from the previous, it’s extended flags, with
1097 tunable bits similar to the attributes but extensible and new bits
1098 will be added in the future (the ioctls are FS_IOC_FSGETXATTR and
1099 FS_IOC_FSSETXATTR but they are not related to extended attributes
1100 that are also called xattrs), there’s no standard tool to change
1101 the bits, there’s support in xfs_io(8) as command xfs_io -c chattr
1102
1103 ATTRIBUTES
1104 a
1105 append only, new writes are always written at the end of the file
1106
1107 A
1108 no atime updates
1109
1110 c
1111 compress data, all data written after this attribute is set will be
1112 compressed. Please note that compression is also affected by the
1113 mount options or the parent directory attributes.
1114
1115 When set on a directory, all newly created files will inherit this
1116 attribute. This attribute cannot be set with m at the same time.
1117
1118 C
1119 no copy-on-write, file data modifications are done in-place
1120
1121 When set on a directory, all newly created files will inherit this
1122 attribute.
1123
1124 Note
1125 due to implementation limitations, this flag can be set/unset
1126 only on empty files.
1127
1128 d
1129 no dump, makes sense with 3rd party tools like dump(8), on BTRFS
1130 the attribute can be set/unset but no other special handling is
1131 done
1132
1133 D
1134 synchronous directory updates, for more details search open(2) for
1135 O_SYNC and O_DSYNC
1136
1137 i
1138 immutable, no file data and metadata changes allowed even to the
1139 root user as long as this attribute is set (obviously the exception
1140 is unsetting the attribute)
1141
1142 m
1143 no compression, permanently turn off compression on the given file.
1144 Any compression mount options will not affect this file. (chattr
1145 support added in 1.46.2)
1146
1147 When set on a directory, all newly created files will inherit this
1148 attribute. This attribute cannot be set with c at the same time.
1149
1150 S
1151 synchronous updates, for more details search open(2) for O_SYNC and
1152 O_DSYNC
1153
1154 No other attributes are supported. For the complete list please refer
1155 to the chattr(1) manual page.
1156
1157 XFLAGS
1158 There’s overlap of letters assigned to the bits with the attributes,
1159 this list refers to what xfs_io(8) provides:
1160
1161 i
1162 immutable, same as the attribute
1163
1164 a
1165 append only, same as the attribute
1166
1167 s
1168 synchronous updates, same as the attribute S
1169
1170 A
1171 no atime updates, same as the attribute
1172
1173 d
1174 no dump, same as the attribute
1175
1177 Since version 5.12 btrfs supports so called zoned mode. This is a
1178 special on-disk format and allocation/write strategy that’s friendly to
1179 zoned devices. In short, a device is partitioned into fixed-size zones
1180 and each zone can be updated by append-only manner, or reset. As btrfs
1181 has no fixed data structures, except the super blocks, the zoned mode
1182 only requires block placement that follows the device constraints. You
1183 can learn about the whole architecture at https://zonedstorage.io .
1184
1185 The devices are also called SMR/ZBC/ZNS, in host-managed mode. Note
1186 that there are devices that appear as non-zoned but actually are, this
1187 is drive-managed and using zoned mode won’t help.
1188
1189 The zone size depends on the device, typical sizes are 256MiB or 1GiB.
1190 In general it must be a power of two. Emulated zoned devices like
1191 null_blk allow to set various zone sizes.
1192
1193 REQUIREMENTS, LIMITATIONS
1194 • all devices must have the same zone size
1195
1196 • maximum zone size is 8GiB
1197
1198 • mixing zoned and non-zoned devices is possible, the zone writes are
1199 emulated, but this is namely for testing
1200
1201 • the super block is handled in a special way and is at different
1202 locations than on a non-zoned filesystem:
1203
1204 • primary: 0B (and the next two zones)
1205
1206 • secondary: 512G (and the next two zones)
1207
1208 • tertiary: 4TiB (4096GiB, and the next two zones)
1209
1210 INCOMPATIBLE FEATURES
1211 The main constraint of the zoned devices is lack of in-place update of
1212 the data. This is inherently incompatbile with some features:
1213
1214 • nodatacow - overwrite in-place, cannot create such files
1215
1216 • fallocate - preallocating space for in-place first write
1217
1218 • mixed-bg - unordered writes to data and metadata, fixing that means
1219 using separate data and metadata block groups
1220
1221 • booting - the zone at offset 0 contains superblock, resetting the
1222 zone would destroy the bootloader data
1223
1224 Initial support lacks some features but they’re planned:
1225
1226 • only single profile is supported
1227
1228 • fstrim - due to dependency on free space cache v1
1229
1230 SUPER BLOCK
1231 As said above, super block is handled in a special way. In order to be
1232 crash safe, at least one zone in a known location must contain a valid
1233 superblock. This is implemented as a ring buffer in two consecutive
1234 zones, starting from known offsets 0, 512G and 4TiB. The values are
1235 different than on non-zoned devices. Each new super block is appended
1236 to the end of the zone, once it’s filled, the zone is reset and writes
1237 continue to the next one. Looking up the latest super block needs to
1238 read offsets of both zones and determine the last written version.
1239
1240 The amount of space reserved for super block depends on the zone size.
1241 The secondary and tertiary copies are at distant offsets as the
1242 capacity of the devices is expected to be large, tens of terabytes.
1243 Maximum zone size supported is 8GiB, which would mean that eg. offset
1244 0-16GiB would be reserved just for the super block on a hypothetical
1245 device of that zone size. This is wasteful but required to guarantee
1246 crash safety.
1247
1249 There’s a character special device /dev/btrfs-control with major and
1250 minor numbers 10 and 234 (the device can be found under the misc
1251 category).
1252
1253 $ ls -l /dev/btrfs-control
1254 crw------- 1 root root 10, 234 Jan 1 12:00 /dev/btrfs-control
1255
1256 The device accepts some ioctl calls that can perform following actions
1257 on the filesystem module:
1258
1259 • scan devices for btrfs filesystem (ie. to let multi-device
1260 filesystems mount automatically) and register them with the kernel
1261 module
1262
1263 • similar to scan, but also wait until the device scanning process is
1264 finished for a given filesystem
1265
1266 • get the supported features (can be also found under
1267 /sys/fs/btrfs/features)
1268
1269 The device is created when btrfs is initialized, either as a module or
1270 a built-in functionality and makes sense only in connection with that.
1271 Running eg. mkfs without the module loaded will not register the device
1272 and will probably warn about that.
1273
1274 In rare cases when the module is loaded but the device is not present
1275 (most likely accidentally deleted), it’s possible to recreate it by
1276
1277 # mknod --mode=600 /dev/btrfs-control c 10 234
1278
1279 or (since 5.11) by a convenience command
1280
1281 # btrfs rescue create-control-device
1282
1283 The control device is not strictly required but the device scanning
1284 will not work and a workaround would need to be used to mount a
1285 multi-device filesystem. The mount option device can trigger the device
1286 scanning during mount, see also btrfs device scan.
1287
1289 It is possible that a btrfs filesystem contains multiple block group
1290 profiles of the same type. This could happen when a profile conversion
1291 using balance filters is interrupted (see btrfs-balance(8)). Some btrfs
1292 commands perform a test to detect this kind of condition and print a
1293 warning like this:
1294
1295 WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
1296 WARNING: Data: single, raid1
1297 WARNING: Metadata: single, raid1
1298
1299 The corresponding output of btrfs filesystem df might look like:
1300
1301 WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
1302 WARNING: Data: single, raid1
1303 WARNING: Metadata: single, raid1
1304 Data, RAID1: total=832.00MiB, used=0.00B
1305 Data, single: total=1.63GiB, used=0.00B
1306 System, single: total=4.00MiB, used=16.00KiB
1307 Metadata, single: total=8.00MiB, used=112.00KiB
1308 Metadata, RAID1: total=64.00MiB, used=32.00KiB
1309 GlobalReserve, single: total=16.25MiB, used=0.00B
1310
1311 There’s more than one line for type Data and Metadata, while the
1312 profiles are single and RAID1.
1313
1314 This state of the filesystem OK but most likely needs the
1315 user/administrator to take an action and finish the interrupted tasks.
1316 This cannot be easily done automatically, also the user knows the
1317 expected final profiles.
1318
1319 In the example above, the filesystem started as a single device and
1320 single block group profile. Then another device was added, followed by
1321 balance with convert=raid1 but for some reason hasn’t finished.
1322 Restarting the balance with convert=raid1 will continue and end up with
1323 filesystem with all block group profiles RAID1.
1324
1325 Note
1326 If you’re familiar with balance filters, you can use
1327 convert=raid1,profiles=single,soft, which will take only the
1328 unconverted single profiles and convert them to raid1. This may
1329 speed up the conversion as it would not try to rewrite the already
1330 convert raid1 profiles.
1331
1332 Having just one profile is desired as this also clearly defines the
1333 profile of newly allocated block groups, otherwise this depends on
1334 internal allocation policy. When there are multiple profiles present,
1335 the order of selection is RAID6, RAID5, RAID10, RAID1, RAID0 as long as
1336 the device number constraints are satisfied.
1337
1338 Commands that print the warning were chosen so they’re brought to user
1339 attention when the filesystem state is being changed in that regard.
1340 This is: device add, device delete, balance cancel, balance pause.
1341 Commands that report space usage: filesystem df, device usage. The
1342 command filesystem usage provides a line in the overall summary:
1343
1344 Multiple profiles: yes (data, metadata)
1345
1347 The COW mechanism and multiple devices under one hood enable an
1348 interesting concept, called a seeding device: extending a read-only
1349 filesystem on a single device filesystem with another device that
1350 captures all writes. For example imagine an immutable golden image of
1351 an operating system enhanced with another device that allows to use the
1352 data from the golden image and normal operation. This idea originated
1353 on CD-ROMs with base OS and allowing to use them for live systems, but
1354 this became obsolete. There are technologies providing similar
1355 functionality, like unionmount, overlayfs or qcow2 image snapshot.
1356
1357 The seeding device starts as a normal filesystem, once the contents is
1358 ready, btrfstune -S 1 is used to flag it as a seeding device. Mounting
1359 such device will not allow any writes, except adding a new device by
1360 btrfs device add. Then the filesystem can be remounted as read-write.
1361
1362 Given that the filesystem on the seeding device is always recognized as
1363 read-only, it can be used to seed multiple filesystems, at the same
1364 time. The UUID that is normally attached to a device is automatically
1365 changed to a random UUID on each mount.
1366
1367 Once the seeding device is mounted, it needs the writable device. After
1368 adding it, something like remount -o remount,rw /path makes the
1369 filesystem at /path ready for use. The simplest usecase is to throw
1370 away all changes by unmounting the filesystem when convenient.
1371
1372 Alternatively, deleting the seeding device from the filesystem can turn
1373 it into a normal filesystem, provided that the writable device can also
1374 contain all the data from the seeding device.
1375
1376 The seeding device flag can be cleared again by btrfstune -f -s 0, eg.
1377 allowing to update with newer data but please note that this will
1378 invalidate all existing filesystems that use this particular seeding
1379 device. This works for some usecases, not for others, and a forcing
1380 flag to the command is mandatory to avoid accidental mistakes.
1381
1382 Example how to create and use one seeding device:
1383
1384 # mkfs.btrfs /dev/sda
1385 # mount /dev/sda /mnt/mnt1
1386 # ... fill mnt1 with data
1387 # umount /mnt/mnt1
1388 # btrfstune -S 1 /dev/sda
1389 # mount /dev/sda /mnt/mnt1
1390 # btrfs device add /dev/sdb /mnt
1391 # mount -o remount,rw /mnt/mnt1
1392 # ... /mnt/mnt1 is now writable
1393
1394 Now /mnt/mnt1 can be used normally. The device /dev/sda can be mounted
1395 again with a another writable device:
1396
1397 # mount /dev/sda /mnt/mnt2
1398 # btrfs device add /dev/sdc /mnt/mnt2
1399 # mount -o remount,rw /mnt/mnt2
1400 # ... /mnt/mnt2 is now writable
1401
1402 The writable device (/dev/sdb) can be decoupled from the seeding device
1403 and used independently:
1404
1405 # btrfs device delete /dev/sda /mnt/mnt1
1406
1407 As the contents originated in the seeding device, it’s possible to turn
1408 /dev/sdb to a seeding device again and repeat the whole process.
1409
1410 A few things to note:
1411
1412 • it’s recommended to use only single device for the seeding device,
1413 it works for multiple devices but the single profile must be used
1414 in order to make the seeding device deletion work
1415
1416 • block group profiles single and dup support the usecases above
1417
1418 • the label is copied from the seeding device and can be changed by
1419 btrfs filesystem label
1420
1421 • each new mount of the seeding device gets a new random UUID
1422
1424 The RAID56 feature provides striping and parity over several devices,
1425 same as the traditional RAID5/6. There are some implementation and
1426 design deficiencies that make it unreliable for some corner cases and
1427 the feature should not be used in production, only for evaluation or
1428 testing. The power failure safety for metadata with RAID56 is not 100%.
1429
1430 Metadata
1431 Do not use raid5 nor raid6 for metadata. Use raid1 or raid1c3
1432 respectively.
1433
1434 The substitute profiles provide the same guarantees against loss of 1
1435 or 2 devices, and in some respect can be an improvement. Recovering
1436 from one missing device will only need to access the remaining 1st or
1437 2nd copy, that in general may be stored on some other devices due to
1438 the way RAID1 works on btrfs, unlike on a striped profile (similar to
1439 raid0) that would need all devices all the time.
1440
1441 The space allocation pattern and consumption is different (eg. on N
1442 devices): for raid5 as an example, a 1GiB chunk is reserved on each
1443 device, while with raid1 there’s each 1GiB chunk stored on 2 devices.
1444 The consumption of each 1GiB of used metadata is then N * 1GiB for vs 2
1445 * 1GiB. Using raid1 is also more convenient for balancing/converting to
1446 other profile due to lower requirement on the available chunk space.
1447
1448 Missing/incomplete support
1449 When RAID56 is on the same filesystem with different raid profiles, the
1450 space reporting is inaccurate, eg. df, btrfs filesystem df or btrfs
1451 filesystem usge. When there’s only a one profile per block group type
1452 (eg. raid5 for data) the reporting is accurate.
1453
1454 When scrub is started on a RAID56 filesystem, it’s started on all
1455 devices that degrade the performance. The workaround is to start it on
1456 each device separately. Due to that the device stats may not match the
1457 actual state and some errors might get reported multiple times.
1458
1459 The write hole problem.
1460
1462 A storage model is a model that captures key physical aspects of data
1463 structure in a data store. A filesystem is the logical structure
1464 organizing data on top of the storage device.
1465
1466 The filesystem assumes several features or limitations of the storage
1467 device and utilizes them or applies measures to guarantee reliability.
1468 BTRFS in particular is based on a COW (copy on write) mode of writing,
1469 ie. not updating data in place but rather writing a new copy to a
1470 different location and then atomically switching the pointers.
1471
1472 In an ideal world, the device does what it promises. The filesystem
1473 assumes that this may not be true so additional mechanisms are applied
1474 to either detect misbehaving hardware or get valid data by other means.
1475 The devices may (and do) apply their own detection and repair
1476 mechanisms but we won’t assume any.
1477
1478 The following assumptions about storage devices are considered (sorted
1479 by importance, numbers are for further reference):
1480
1481 1. atomicity of reads and writes of blocks/sectors (the smallest unit
1482 of data the device presents to the upper layers)
1483
1484 2. there’s a flush command that instructs the device to forcibly order
1485 writes before and after the command; alternatively there’s a
1486 barrier command that facilitates the ordering but may not flush the
1487 data
1488
1489 3. data sent to write to a given device offset will be written without
1490 further changes to the data and to the offset
1491
1492 4. writes can be reordered by the device, unless explicitly serialized
1493 by the flush command
1494
1495 5. reads and writes can be freely reordered and interleaved
1496
1497 The consistency model of BTRFS builds on these assumptions. The logical
1498 data updates are grouped, into a generation, written on the device,
1499 serialized by the flush command and then the super block is written
1500 ending the generation. All logical links among metadata comprising a
1501 consistent view of the data may not cross the generation boundary.
1502
1503 WHEN THINGS GO WRONG
1504 No or partial atomicity of block reads/writes (1)
1505
1506 • Problem: a partial block contents is written (torn write), eg. due
1507 to a power glitch or other electronics failure during the
1508 read/write
1509
1510 • Detection: checksum mismatch on read
1511
1512 • Repair: use another copy or rebuild from multiple blocks using some
1513 encoding scheme
1514
1515 The flush command does not flush (2)
1516
1517 This is perhaps the most serious problem and impossible to mitigate by
1518 filesystem without limitations and design restrictions. What could
1519 happen in the worst case is that writes from one generation bleed to
1520 another one, while still letting the filesystem consider the
1521 generations isolated. Crash at any point would leave data on the device
1522 in an inconsistent state without any hint what exactly got written,
1523 what is missing and leading to stale metadata link information.
1524
1525 Devices usually honor the flush command, but for performance reasons
1526 may do internal caching, where the flushed data are not yet
1527 persistently stored. A power failure could lead to a similar scenario
1528 as above, although it’s less likely that later writes would be written
1529 before the cached ones. This is beyond what a filesystem can take into
1530 account. Devices or controllers are usually equipped with batteries or
1531 capacitors to write the cache contents even after power is cut.
1532 (Battery backed write cache)
1533
1534 Data get silently changed on write (3)
1535
1536 Such thing should not happen frequently, but still can happen
1537 spuriously due the complex internal workings of devices or physical
1538 effects of the storage media itself.
1539
1540 • Problem: while the data are written atomically, the contents get
1541 changed
1542
1543 • Detection: checksum mismatch on read
1544
1545 • Repair: use another copy or rebuild from multiple blocks using some
1546 encoding scheme
1547
1548 Data get silently written to another offset (3)
1549
1550 This would be another serious problem as the filesystem has no
1551 information when it happens. For that reason the measures have to be
1552 done ahead of time. This problem is also commonly called ghost write.
1553
1554 The metadata blocks have the checksum embedded in the blocks, so a
1555 correct atomic write would not corrupt the checksum. It’s likely that
1556 after reading such block the data inside would not be consistent with
1557 the rest. To rule that out there’s embedded block number in the
1558 metadata block. It’s the logical block number because this is what the
1559 logical structure expects and verifies.
1560
1562 The following is based on information publicly available, user
1563 feedback, community discussions or bug report analyses. It’s not
1564 complete and further research is encouraged when in doubt.
1565
1566 MAIN MEMORY
1567 The data structures and raw data blocks are temporarily stored in
1568 computer memory before they get written to the device. It is critical
1569 that memory is reliable because even simple bit flips can have vast
1570 consequences and lead to damaged structures, not only in the filesystem
1571 but in the whole operating system.
1572
1573 Based on experience in the community, memory bit flips are more common
1574 than one would think. When it happens, it’s reported by the
1575 tree-checker or by a checksum mismatch after reading blocks. There are
1576 some very obvious instances of bit flips that happen, e.g. in an
1577 ordered sequence of keys in metadata blocks. We can easily infer from
1578 the other data what values get damaged and how. However, fixing that is
1579 not straightforward and would require cross-referencing data from the
1580 entire filesystem to see the scope.
1581
1582 If available, ECC memory should lower the chances of bit flips, but
1583 this type of memory is not available in all cases. A memory test should
1584 be performed in case there’s a visible bit flip pattern, though this
1585 may not detect a faulty memory module because the actual load of the
1586 system could be the factor making the problems appear. In recent years
1587 attacks on how the memory modules operate have been demonstrated
1588 (rowhammer) achieving specific bits to be flipped. While these were
1589 targeted, this shows that a series of reads or writes can affect
1590 unrelated parts of memory.
1591
1592 Further reading:
1593
1594 • https://en.wikipedia.org/wiki/Row_hammer
1595
1596 What to do:
1597
1598 • run memtest, note that sometimes memory errors happen only when the
1599 system is under heavy load that the default memtest cannot trigger
1600
1601 • memory errors may appear as filesystem going read-only due to "pre
1602 write" check, that verify meta data before they get written but
1603 fail some basic consistency checks
1604
1605 DIRECT MEMORY ACCESS (DMA)
1606 Another class of errors is related to DMA (direct memory access)
1607 performed by device drivers. While this could be considered a software
1608 error, the data transfers that happen without CPU assistance may
1609 accidentally corrupt other pages. Storage devices utilize DMA for
1610 performance reasons, the filesystem structures and data pages are
1611 passed back and forth, making errors possible in case page life time is
1612 not properly tracked.
1613
1614 There are lots of quirks (device-specific workarounds) in Linux kernel
1615 drivers (regarding not only DMA) that are added when found. The quirks
1616 may avoid specific errors or disable some features to avoid worse
1617 problems.
1618
1619 What to do:
1620
1621 • use up-to-date kernel (recent releases or maintained long term
1622 support versions)
1623
1624 • as this may be caused by faulty drivers, keep the systems
1625 up-to-date
1626
1627 ROTATIONAL DISKS (HDD)
1628 Rotational HDDs typically fail at the level of individual sectors or
1629 small clusters. Read failures are caught on the levels below the
1630 filesystem and are returned to the user as EIO - Input/output error.
1631 Reading the blocks repeatedly may return the data eventually, but this
1632 is better done by specialized tools and filesystem takes the result of
1633 the lower layers. Rewriting the sectors may trigger internal remapping
1634 but this inevitably leads to data loss.
1635
1636 Disk firmware is technically software but from the filesystem
1637 perspective is part of the hardware. IO requests are processed, and
1638 caching or various other optimizations are performed, which may lead to
1639 bugs under high load or unexpected physical conditions or unsupported
1640 use cases.
1641
1642 Disks are connected by cables with two ends, both of which can cause
1643 problems when not attached properly. Data transfers are protected by
1644 checksums and the lower layers try hard to transfer the data correctly
1645 or not at all. The errors from badly-connecting cables may manifest as
1646 large amount of failed read or write requests, or as short error bursts
1647 depending on physical conditions.
1648
1649 What to do:
1650
1651 • check smartctl for potential issues
1652
1653 SOLID STATE DRIVES (SSD)
1654 The mechanism of information storage is different from HDDs and this
1655 affects the failure mode as well. The data are stored in cells grouped
1656 in large blocks with limited number of resets and other write
1657 constraints. The firmware tries to avoid unnecessary resets and
1658 performs optimizations to maximize the storage media lifetime. The
1659 known techniques are deduplication (blocks with same fingerprint/hash
1660 are mapped to same physical block), compression or internal remapping
1661 and garbage collection of used memory cells. Due to the additional
1662 processing there are measures to verity the data e.g. by ECC codes.
1663
1664 The observations of failing SSDs show that the whole electronic fails
1665 at once or affects a lot of data (eg. stored on one chip). Recovering
1666 such data may need specialized equipment and reading data repeatedly
1667 does not help as it’s possible with HDDs.
1668
1669 There are several technologies of the memory cells with different
1670 characteristics and price. The lifetime is directly affected by the
1671 type and frequency of data written. Writing "too much" distinct data
1672 (e.g. encrypted) may render the internal deduplication ineffective and
1673 lead to a lot of rewrites and increased wear of the memory cells.
1674
1675 There are several technologies and manufacturers so it’s hard to
1676 describe them but there are some that exhibit similar behaviour:
1677
1678 • expensive SSD will use more durable memory cells and is optimized
1679 for reliability and high load
1680
1681 • cheap SSD is projected for a lower load ("desktop user") and is
1682 optimized for cost, it may employ the optimizations and/or extended
1683 error reporting partially or not at all
1684
1685 It’s not possible to reliably determine the expected lifetime of an SSD
1686 due to lack of information about how it works or due to lack of
1687 reliable stats provided by the device.
1688
1689 Metadata writes tend to be the biggest component of lifetime writes to
1690 a SSD, so there is some value in reducing them. Depending on the device
1691 class (high end/low end) the features like DUP block group profiles may
1692 affect the reliability in both ways:
1693
1694 • high end are typically more reliable and using single for data and
1695 metadata could be suitable to reduce device wear
1696
1697 • low end could lack ability to identify errors so an additional
1698 redundancy at the filesystem level (checksums, DUP) could help
1699
1700 Only users who consume 50 to 100% of the SSD’s actual lifetime writes
1701 need to be concerned by the write amplification of btrfs DUP metadata.
1702 Most users will be far below 50% of the actual lifetime, or will write
1703 the drive to death and discover how many writes 100% of the actual
1704 lifetime was. SSD firmware often adds its own write multipliers that
1705 can be arbitrary and unpredictable and dependent on application
1706 behavior, and these will typically have far greater effect on SSD
1707 lifespan than DUP metadata. It’s more or less impossible to predict
1708 when a SSD will run out of lifetime writes to within a factor of two,
1709 so it’s hard to justify wear reduction as a benefit.
1710
1711 Further reading:
1712
1713 • https://www.snia.org/educational-library/ssd-and-deduplication-end-spinning-disk-2012
1714
1715 • https://www.snia.org/educational-library/realities-solid-state-storage-2013-2013
1716
1717 • https://www.snia.org/educational-library/ssd-performance-primer-2013
1718
1719 • https://www.snia.org/educational-library/how-controllers-maximize-ssd-life-2013
1720
1721 What to do:
1722
1723 • run smartctl or self-tests to look for potential issues
1724
1725 • keep the firmware up-to-date
1726
1727 NVM EXPRESS, NON-VOLATILE MEMORY (NVMe)
1728 NVMe is a type of persistent memory usually connected over a system bus
1729 (PCIe) or similar interface and the speeds are an order of magnitude
1730 faster than SSD. It is also a non-rotating type of storage, and is not
1731 typically connected by a cable. It’s not a SCSI type device either but
1732 rather a complete specification for logical device interface.
1733
1734 In a way the errors could be compared to a combination of SSD class and
1735 regular memory. Errors may exhibit as random bit flips or IO failures.
1736 There are tools to access the internal log (nvme log and nvme-cli) for
1737 a more detailed analysis.
1738
1739 There are separate error detection and correction steps performed e.g.
1740 on the bus level and in most cases never making in to the filesystem
1741 level. Once this happens it could mean there’s some systematic error
1742 like overheating or bad physical connection of the device. You may want
1743 to run self-tests (using smartctl).
1744
1745 • https://en.wikipedia.org/wiki/NVM_Express
1746
1747 • https://www.smartmontools.org/wiki/NVMe_Support
1748
1749 DRIVE FIRMWARE
1750 Firmware is technically still software but embedded into the hardware.
1751 As all software has bugs, so does firmware. Storage devices can update
1752 the firmware and fix known bugs. In some cases the it’s possible to
1753 avoid certain bugs by quirks (device-specific workarounds) in Linux
1754 kernel.
1755
1756 A faulty firmware can cause wide range of corruptions from small and
1757 localized to large affecting lots of data. Self-repair capabilities may
1758 not be sufficient.
1759
1760 What to do:
1761
1762 • check for firmware updates in case there are known problems, note
1763 that updating firmware can be risky on itself
1764
1765 • use up-to-date kernel (recent releases or maintained long term
1766 support versions)
1767
1768 SD FLASH CARDS
1769 There are a lot of devices with low power consumption and thus using
1770 storage media based on low power consumption too, typically flash
1771 memory stored on a chip enclosed in a detachable card package. An
1772 improperly inserted card may be damaged by electrical spikes when the
1773 device is turned on or off. The chips storing data in turn may be
1774 damaged permanently. All types of flash memory have a limited number of
1775 rewrites, so the data are internally translated by FTL (flash
1776 translation layer). This is implemented in firmware (technically a
1777 software) and prone to bugs that manifest as hardware errors.
1778
1779 Adding redundancy like using DUP profiles for both data and metadata
1780 can help in some cases but a full backup might be the best option once
1781 problems appear and replacing the card could be required as well.
1782
1783 HARDWARE AS THE MAIN SOURCE OF FILESYSTEM CORRUPTIONS
1784 If you use unreliable hardware and don’t know about that, don’t blame
1785 the filesystem when it tells you.
1786
1788 acl(5), btrfs(8), chattr(1), fstrim(8), ioctl(2), mkfs.btrfs(8),
1789 mount(8), swapon(8)
1790
1791
1792
1793Btrfs v5.15.1 11/22/2021 BTRFS-MAN5(5)