btrfs(5) - f37

1BTRFS(5)                             BTRFS                            BTRFS(5)
2
3
4

NAME

6       btrfs  -  topics  about  the BTRFS filesystem (mount options, supported
7       file attributes and other)
8

DESCRIPTION

10       This document describes topics related to BTRFS that are  not  specific
11       to the tools.  Currently covers:
12
13       1.  mount options
14
15       2.  filesystem features
16
17       3.  checksum algorithms
18
19       4.  compression
20
21       5.  sysfs interface
22
23       6.  filesystem exclusive operations
24
25       7.  filesystem limits
26
27       8.  bootloader support
28
29       9.  file attributes
30
31       10. zoned mode
32
33       11. control device
34
35       12. filesystems with multiple block group profiles
36
37       13. seeding device
38
39       14. RAID56 status and recommended practices
40
41       15. storage model, hardware considerations
42

MOUNT OPTIONS

44   BTRFS SPECIFIC MOUNT OPTIONS
45       This  section  describes  mount  options  specific  to  BTRFS.  For the
46       generic mount options please refer to mount(8) manual page. The options
47       are sorted alphabetically (discarding the no prefix).
48
49       NOTE:
50          Most mount options apply to the whole filesystem and only options in
51          the first mounted subvolume will take effect. This is due to lack of
52          implementation  and  may  change in the future. This means that (for
53          example) you can't set per-subvolume nodatacow, nodatasum,  or  com‐
54          press  using  mount options. This should eventually be fixed, but it
55          has proved to be difficult to implement correctly within  the  Linux
56          VFS framework.
57
58       Mount  options  are  processed in order, only the last occurrence of an
59       option takes effect and may disable other options  due  to  constraints
60       (see  e.g.   nodatacow and compress). The output of mount command shows
61       which options have been applied.
62
63       acl, noacl
64              (default: on)
65
66              Enable/disable support for POSIX Access  Control  Lists  (ACLs).
67              See the acl(5) manual page for more information about ACLs.
68
69              The    support    for    ACL    is    build-time    configurable
70              (BTRFS_FS_POSIX_ACL) and mount fails if acl is requested but the
71              feature is not compiled in.
72
73       autodefrag, noautodefrag
74              (since: 3.0, default: off)
75
76              Enable automatic file defragmentation.  When enabled, small ran‐
77              dom writes into files (in a range of  tens  of  kilobytes,  cur‐
78              rently it's 64KiB) are detected and queued up for the defragmen‐
79              tation process.  Not well suited for large database workloads.
80
81              The read latency may increase due to reading the adjacent blocks
82              that  make  up  the  range for defragmentation, successive write
83              will merge the blocks in the new location.
84
85              WARNING:
86                 Defragmenting with Linux kernel versions < 3.9 or ≥  3.14-rc2
87                 as  well  as  with  Linux stable kernel versions ≥ 3.10.31, ≥
88                 3.12.12 or ≥ 3.13.4 will break up the reflinks  of  COW  data
89                 (for  example  files  copied  with cp --reflink, snapshots or
90                 de-duplicated data).  This may cause considerable increase of
91                 space usage depending on the broken up reflinks.
92
93       barrier, nobarrier
94              (default: on)
95
96              Ensure  that  all IO write operations make it through the device
97              cache and are stored permanently when the filesystem is  at  its
98              consistency  checkpoint.  This typically means that a flush com‐
99              mand is sent to the device that  will  synchronize  all  pending
100              data  and  ordinary  metadata blocks, then writes the superblock
101              and issues another flush.
102
103              The write flushes incur a slight hit and  also  prevent  the  IO
104              block  scheduler  to  reorder  requests in a more effective way.
105              Disabling barriers gets rid of that penalty but will  most  cer‐
106              tainly  lead  to  a  corrupted  filesystem in case of a crash or
107              power loss. The ordinary metadata blocks could be yet  unwritten
108              at  the time the new superblock is stored permanently, expecting
109              that the block pointers to metadata were stored permanently  be‐
110              fore.
111
112              On a device with a volatile battery-backed write-back cache, the
113              nobarrier option will not lead to filesystem corruption  as  the
114              pending blocks are supposed to make it to the permanent storage.
115
116       check_int, check_int_data, check_int_print_mask=<value>
117              (since: 3.0, default: off)
118
119              These  debugging  options  control the behavior of the integrity
120              checking module (the BTRFS_FS_CHECK_INTEGRITY config option  re‐
121              quired). The main goal is to verify that all blocks from a given
122              transaction period are properly linked.
123
124              check_int enables the integrity checker module,  which  examines
125              all  block  write  requests  to ensure on-disk consistency, at a
126              large memory and CPU cost.
127
128              check_int_data includes extent data in the integrity checks, and
129              implies the check_int option.
130
131              check_int_print_mask  takes  a  bitmask  of BTRFSIC_PRINT_MASK_*
132              values as defined in fs/btrfs/check-integrity.c, to control  the
133              integrity checker module behavior.
134
135              See  comments  at the top of fs/btrfs/check-integrity.c for more
136              information.
137
138       clear_cache
139              Force clearing and rebuilding of the disk space cache  if  some‐
140              thing has gone wrong. See also: space_cache.
141
142       commit=<seconds>
143              (since: 3.12, default: 30)
144
145              Set  the  interval  of periodic transaction commit when data are
146              synchronized to permanent storage. Higher interval  values  lead
147              to  larger  amount  of  unwritten data, which has obvious conse‐
148              quences when the system crashes.  The upper bound is not forced,
149              but  a  warning is printed if it's more than 300 seconds (5 min‐
150              utes). Use with care.
151
152       compress,      compress=<type[:level]>,      compress-force,       com‐
153       press-force=<type[:level]>
154              (default: off, level support since: 5.1)
155
156              Control  BTRFS  file data compression.  Type may be specified as
157              zlib, lzo, zstd or no (for no compression, used for remounting).
158              If  no  type  is  specified, zlib is used.  If compress-force is
159              specified, then compression will always be  attempted,  but  the
160              data  may end up uncompressed if the compression would make them
161              larger.
162
163              Both zlib and zstd (since version 5.1)  expose  the  compression
164              level  as  a  tunable  knob with higher levels trading speed and
165              memory (zstd) for higher compression ratios. This can be set  by
166              appending a colon and the desired level.  ZLIB accepts the range
167              [1, 9] and ZSTD accepts [1, 15]. If no level is set,  both  cur‐
168              rently use a default level of 3. The value 0 is an alias for the
169              default level.
170
171              Otherwise some simple heuristics are applied to detect an incom‐
172              pressible  file.   If the first blocks written to a file are not
173              compressible, the whole file is permanently marked to skip  com‐
174              pression.  As  this is too simple, the compress-force is a work‐
175              around that will compress most of the files at the cost of  some
176              wasted  CPU cycles on failed attempts.  Since kernel 4.15, a set
177              of heuristic algorithms have been improved  by  using  frequency
178              sampling,  repeated pattern detection and Shannon entropy calcu‐
179              lation to avoid that.
180
181              NOTE:
182                 If compression is enabled, nodatacow and nodatasum  are  dis‐
183                 abled.
184
185       datacow, nodatacow
186              (default: on)
187
188              Enable  data  copy-on-write  for newly created files.  Nodatacow
189              implies nodatasum, and disables compression. All  files  created
190              under  nodatacow  are  also  set  the  NOCOW file attribute (see
191              chattr(1)).
192
193              NOTE:
194                 If nodatacow or nodatasum are enabled,  compression  is  dis‐
195                 abled.
196
197              Updates  in-place improve performance for workloads that do fre‐
198              quent overwrites, at the cost of potential  partial  writes,  in
199              case the write is interrupted (system crash, device failure).
200
201       datasum, nodatasum
202              (default: on)
203
204              Enable  data  checksumming for newly created files.  Datasum im‐
205              plies datacow, i.e. the normal mode of operation. All files cre‐
206              ated  under  nodatasum inherit the "no checksums" property, how‐
207              ever there's no corresponding file attribute (see chattr(1)).
208
209              NOTE:
210                 If nodatacow or nodatasum are enabled,  compression  is  dis‐
211                 abled.
212
213              There  is  a  slight  performance gain when checksums are turned
214              off, the corresponding metadata blocks holding the checksums  do
215              not  need to updated.  The cost of checksumming of the blocks in
216              memory is much lower than the IO, modern CPUs  feature  hardware
217              support of the checksumming algorithm.
218
219       degraded
220              (default: off)
221
222              Allow mounts with less devices than the RAID profile constraints
223              require.  A read-write mount (or remount) may  fail  when  there
224              are  too many devices missing, for example if a stripe member is
225              completely missing from RAID0.
226
227              Since 4.14, the constraint checks have  been  improved  and  are
228              verified  on  the chunk level, not at the device level. This al‐
229              lows degraded mounts of filesystems with mixed RAID profiles for
230              data  and  metadata, even if the device number constraints would
231              not be satisfied for some of the profiles.
232
233              Example: metadata -- raid1, data -- single, devices -- /dev/sda,
234              /dev/sdb
235
236              Suppose  the data are completely stored on sda, then missing sdb
237              will not prevent the mount, even if 1 missing device would  nor‐
238              mally prevent (any) single profile to mount. In case some of the
239              data chunks are stored on  sdb,  then  the  constraint  of  sin‐
240              gle/data is not satisfied and the filesystem cannot be mounted.
241
242       device=<devicepath>
243              Specify  a  path  to  a  device  that  will be scanned for BTRFS
244              filesystem during mount. This is usually done automatically by a
245              device  manager  (like udev) or using the btrfs device scan com‐
246              mand (e.g. run from the initial ramdisk). In cases where this is
247              not possible the device mount option can help.
248
249              NOTE:
250                 Booting e.g. a RAID1 system may fail even if all filesystem's
251                 device paths are provided as the actual device nodes may  not
252                 be discovered by the system at that point.
253
254       discard, discard=sync, discard=async, nodiscard
255              (default: off, async support since: 5.6)
256
257              Enable  discarding of freed file blocks.  This is useful for SSD
258              devices, thinly provisioned LUNs,  or  virtual  machine  images;
259              however,  every  storage  layer  must  support discard for it to
260              work.
261
262              In the synchronous mode (sync or without option value), lack  of
263              asynchronous queued TRIM on the backing device TRIM can severely
264              degrade performance, because a synchronous TRIM  operation  will
265              be attempted instead. Queued TRIM requires newer than SATA revi‐
266              sion 3.1 chipsets and devices.
267
268              The asynchronous mode (async) gathers extents in  larger  chunks
269              before  sending  them  to the devices for TRIM. The overhead and
270              performance impact should be negligible compared to the previous
271              mode and it's supposed to be the preferred mode if needed.
272
273              If it is not necessary to immediately discard freed blocks, then
274              the fstrim tool can be used to discard  all  free  blocks  in  a
275              batch.  Scheduling a TRIM during a period of low system activity
276              will prevent latent interference with the performance  of  other
277              operations.  Also,  a  device may ignore the TRIM command if the
278              range is too small, so running a batch  discard  has  a  greater
279              probability of actually discarding the blocks.
280
281       enospc_debug, noenospc_debug
282              (default: off)
283
284              Enable  verbose  output for some ENOSPC conditions. It's safe to
285              use but can be noisy if the system reaches near-full state.
286
287       fatal_errors=<action>
288              (since: 3.4, default: bug)
289
290              Action to take when encountering a fatal error.
291
292              bug    BUG() on a fatal error,  the  system  will  stay  in  the
293                     crashed  state and may be still partially usable, but re‐
294                     boot is required for full operation
295
296              panic  panic() on a fatal error, depending on other system  con‐
297                     figuration,  this may be followed by a reboot. Please re‐
298                     fer to the documentation of kernel boot parameters,  e.g.
299                     panic, oops or crashkernel.
300
301       flushoncommit, noflushoncommit
302              (default: off)
303
304              This option forces any data dirtied by a write in a prior trans‐
305              action to commit as part of the current  commit,  effectively  a
306              full filesystem sync.
307
308              This  makes  the  committed state a fully consistent view of the
309              file system from the application's perspective (i.e. it includes
310              all  completed  file system operations). This was previously the
311              behavior only when a snapshot was created.
312
313              When off, the filesystem is consistent but buffered  writes  may
314              last more than one transaction commit.
315
316       fragment=<type>
317              (depends  on  compile-time  option  BTRFS_DEBUG, since: 4.4, de‐
318              fault: off)
319
320              A debugging helper to intentionally fragment given type of block
321              groups. The type can be data, metadata or all. This mount option
322              should not be used outside of debugging environments and is  not
323              recognized  if  the  kernel config option BTRFS_DEBUG is not en‐
324              abled.
325
326       nologreplay
327              (default: off, even read-only)
328
329              The tree-log contains pending updates to  the  filesystem  until
330              the full commit.  The log is replayed on next mount, this can be
331              disabled by this option.  See also treelog.  Note that  nologre‐
332              play is the same as norecovery.
333
334              WARNING:
335                 Currently,  the  tree  log  is replayed even with a read-only
336                 mount! To disable that behaviour, mount  also  with  nologre‐
337                 play.
338
339       max_inline=<bytes>
340              (default: min(2048, page size) )
341
342              Specify  the  maximum  amount of space, that can be inlined in a
343              metadata b-tree leaf.  The value is specified in bytes,  option‐
344              ally  with  a  K  suffix  (case insensitive).  In practice, this
345              value is limited by the filesystem block size (named  sectorsize
346              at  mkfs  time),  and memory page size of the system. In case of
347              sectorsize limit, there's some space  unavailable  due  to  leaf
348              headers.  For example, a 4KiB sectorsize, maximum size of inline
349              data is about 3900 bytes.
350
351              Inlining can be completely turned off by specifying 0. This will
352              increase  data  block  slack if file sizes are much smaller than
353              block size but will reduce metadata consumption in return.
354
355              NOTE:
356                 The default value has changed to 2048 in kernel 4.6.
357
358       metadata_ratio=<value>
359              (default: 0, internal logic)
360
361              Specifies that 1 metadata chunk should be allocated after  every
362              value  data chunks. Default behaviour depends on internal logic,
363              some percent of unused metadata space is attempted to  be  main‐
364              tained  but  is  not always possible if there's not enough space
365              left for chunk allocation. The option could be useful  to  over‐
366              ride  the  internal logic in favor of the metadata allocation if
367              the expected workload is supposed to be metadata intense  (snap‐
368              shots, reflinks, xattrs, inlined files).
369
370       norecovery
371              (since: 4.5, default: off)
372
373              Do  not  attempt any data recovery at mount time. This will dis‐
374              able logreplay and avoids other write operations. Note that this
375              option is the same as nologreplay.
376
377              NOTE:
378                 The  opposite  option recovery used to have different meaning
379                 but was changed for consistency with other filesystems, where
380                 norecovery  is  used  for skipping log replay. BTRFS does the
381                 same and in general will try to avoid any write operations.
382
383       rescan_uuid_tree
384              (since: 3.12, default: off)
385
386              Force check and rebuild procedure of the UUID tree. This  should
387              not normally be needed.
388
389       rescue (since: 5.9)
390
391              Modes allowing mount with damaged filesystem structures.
392
393              • usebackuproot (since: 5.9, replaces standalone option useback‐
394                uproot)
395
396              • nologreplay (since: 5.9, replaces standalone  option  nologre‐
397                play)
398
399              • ignorebadroots, ibadroots (since: 5.11)
400
401              • ignoredatacsums, idatacsums (since: 5.11)
402
403              • all (since: 5.9)
404
405       skip_balance
406              (since: 3.3, default: off)
407
408              Skip  automatic  resume of an interrupted balance operation. The
409              operation can later be resumed with btrfs balance resume, or the
410              paused  state  can be removed with btrfs balance cancel. The de‐
411              fault behaviour is to resume an interrupted balance  immediately
412              after a volume is mounted.
413
414       space_cache, space_cache=<version>, nospace_cache
415              (nospace_cache  since:  3.2,  space_cache=v1  and space_cache=v2
416              since 4.5, default: space_cache=v1)
417
418              Options to control the free space cache. The  free  space  cache
419              greatly improves performance when reading block group free space
420              into memory. However, managing the space cache consumes some re‐
421              sources, including a small amount of disk space.
422
423              There are two implementations of the free space cache. The orig‐
424              inal one, referred to as v1, is the safe default. The  v1  space
425              cache  can  be disabled at mount time with nospace_cache without
426              clearing.
427
428              On very large filesystems (many  terabytes)  and  certain  work‐
429              loads, the performance of the v1 space cache may degrade drasti‐
430              cally. The v2 implementation, which adds a new b-tree called the
431              free  space  tree,  addresses  this  issue. Once enabled, the v2
432              space cache will always be used and cannot be disabled unless it
433              is      cleared.      Use      clear_cache,space_cache=v1     or
434              clear_cache,nospace_cache to do so. If v2  is  enabled,  kernels
435              without  v2 support will only be able to mount the filesystem in
436              read-only mode.
437
438              The btrfs-check(8) and :doc:`mkfs.btrfs(8)<mkfs.btrfs>  commands
439              have full v2 free space cache support since v4.19.
440
441              If  a version is not explicitly specified, the default implemen‐
442              tation will be chosen, which is v1.
443
444       ssd, ssd_spread, nossd, nossd_spread
445              (default: SSD autodetected)
446
447              Options to control SSD allocation schemes.   By  default,  BTRFS
448              will  enable or disable SSD optimizations depending on status of
449              a device with respect to rotational or non-rotational type. This
450              is  determined  by  the  contents  of /sys/block/DEV/queue/rota‐
451              tional). If it is 0, the ssd option is turned  on.   The  option
452              nossd will disable the autodetection.
453
454              The  optimizations  make  use of the absence of the seek penalty
455              that's inherent for the rotational devices. The  blocks  can  be
456              typically  written  faster  and  are  not  offloaded to separate
457              threads.
458
459              NOTE:
460                 Since 4.14, the block layout optimizations have been dropped.
461                 This  used  to  help  with  first generations of SSD devices.
462                 Their FTL (flash translation layer) was not effective and the
463                 optimization  was  supposed  to  improve  the  wear by better
464                 aligning blocks. This is no longer true with modern  SSD  de‐
465                 vices  and  the optimization had no real benefit. Furthermore
466                 it caused increased fragmentation. The layout tuning has been
467                 kept intact for the option ssd_spread.
468
469              The ssd_spread mount option attempts to allocate into bigger and
470              aligned chunks of  unused  space,  and  may  perform  better  on
471              low-end  SSDs.   ssd_spread  implies ssd, enabling all other SSD
472              heuristics as well. The option nossd will disable  all  SSD  op‐
473              tions while nossd_spread only disables ssd_spread.
474
475       subvol=<path>
476              Mount  subvolume  from  path rather than the toplevel subvolume.
477              The path is always treated as relative to the  toplevel  subvol‐
478              ume.   This mount option overrides the default subvolume set for
479              the given filesystem.
480
481       subvolid=<subvolid>
482              Mount subvolume specified by a subvolid number rather  than  the
483              toplevel  subvolume.   You can use btrfs subvolume list of btrfs
484              subvolume show to see subvolume ID numbers.  This  mount  option
485              overrides the default subvolume set for the given filesystem.
486
487              NOTE:
488                 If both subvolid and subvol are specified, they must point at
489                 the same subvolume, otherwise the mount will fail.
490
491       thread_pool=<number>
492              (default: min(NRCPUS + 2, 8) )
493
494              The number of worker threads  to  start.  NRCPUS  is  number  of
495              on-line  CPUs  detected at the time of mount. Small number leads
496              to less parallelism in processing data and metadata, higher num‐
497              bers  could  lead  to a performance hit due to increased locking
498              contention, process scheduling, cache-line  bouncing  or  costly
499              data transfers between local CPU memories.
500
501       treelog, notreelog
502              (default: on)
503
504              Enable  the  tree  logging used for fsync and O_SYNC writes. The
505              tree log stores changes without the need of  a  full  filesystem
506              sync.  The  log  operations  are flushed at sync and transaction
507              commit. If the system crashes between two such syncs, the  pend‐
508              ing tree log operations are replayed during mount.
509
510              WARNING:
511                 Currently,  the  tree  log  is replayed even with a read-only
512                 mount! To disable that behaviour, also  mount  with  nologre‐
513                 play.
514
515              The  tree  log  could contain new files/directories, these would
516              not exist on a mounted filesystem if the log is not replayed.
517
518       usebackuproot
519              (since: 4.6, default: off)
520
521              Enable autorecovery attempts if a bad  tree  root  is  found  at
522              mount  time.  Currently this scans a backup list of several pre‐
523              vious tree roots and tries to use the first readable.  This  can
524              be used with read-only mounts as well.
525
526              NOTE:
527                 This option has replaced recovery.
528
529       user_subvol_rm_allowed
530              (default: off)
531
532              Allow subvolumes to be deleted by their respective owner. Other‐
533              wise, only the root user can do that.
534
535              NOTE:
536                 Historically, any user could create a snapshot even if he was
537                 not owner of the source subvolume, the subvolume deletion has
538                 been restricted for that reason. The subvolume  creation  has
539                 been restricted but this mount option is still required. This
540                 is a usability issue.  Since 4.18, the rmdir(2)  syscall  can
541                 delete  an  empty  subvolume just like an ordinary directory.
542                 Whether this is possible can  be  detected  at  runtime,  see
543                 rmdir_subvol feature in FILESYSTEM FEATURES.
544
545   DEPRECATED MOUNT OPTIONS
546       List of mount options that have been removed, kept for backward compat‐
547       ibility.
548
549       recovery
550              (since: 3.2, default: off, deprecated since: 4.5)
551
552              NOTE:
553                 This option has been replaced by usebackuproot and should not
554                 be used but will work on 4.5+ kernels.
555
556       inode_cache, noinode_cache
557              (removed in: 5.11, since: 3.0, default: off)
558
559              NOTE:
560                 The  functionality  has  been removed in 5.11, any stale data
561                 created by previous use of the inode_cache option can be  re‐
562                 moved by btrfs check --clear-ino-cache.
563
564   NOTES ON GENERIC MOUNT OPTIONS
565       Some  of  the general mount options from mount(8) that affect BTRFS and
566       are worth mentioning.
567
568       noatime
569              under read intensive  work-loads,  specifying  noatime  signifi‐
570              cantly  improves performance because no new access time informa‐
571              tion needs to be written. Without this option,  the  default  is
572              relatime,  which  only reduces the number of inode atime updates
573              in comparison to the traditional strictatime. The worst case for
574              atime  updates  under  relatime  occurs when many files are read
575              whose atime is older than 24 h and which are  freshly  snapshot‐
576              ted.  In  that  case  the atime is updated and COW happens - for
577              each file - in bulk. See also https://lwn.net/Articles/499293/ -
578              Atime and btrfs: a bad combination? (LWN, 2012-05-31).
579
580              Note  that noatime may break applications that rely on atime up‐
581              times like the venerable Mutt  (unless  you  use  maildir  mail‐
582              boxes).
583

FILESYSTEM FEATURES

585       The basic set of filesystem features gets extended over time. The back‐
586       ward compatibility is maintained and the features are optional, need to
587       be explicitly asked for so accidental use will not create incompatibil‐
588       ities.
589
590       There are several classes and the respective tools to manage  the  fea‐
591       tures:
592
593       at mkfs time only
594              This  is namely for core structures, like the b-tree nodesize or
595              checksum algorithm, see mkfs.btrfs(8) for more details.
596
597       after mkfs, on an unmounted filesystem
598              Features that may optimize internal structures or add new struc‐
599              tures  to  support new functionality, see btrfstune(8). The com‐
600              mand btrfs inspect-internal dump-super /dev/sdx will dump a  su‐
601              perblock,  you  can  map the value of incompat_flags to the fea‐
602              tures listed below
603
604       after mkfs, on a mounted filesystem
605              The features of a filesystem (with a given UUID) are  listed  in
606              /sys/fs/btrfs/UUID/features/,  one  file per feature. The status
607              is stored inside the file. The value 1 is for  enabled  and  ac‐
608              tive,  while  0  means the feature was enabled at mount time but
609              turned off afterwards.
610
611              Whether a particular feature can be turned on a mounted filesys‐
612              tem  can  be found in the directory /sys/fs/btrfs/features/, one
613              file per feature. The value 1 means the feature can be enabled.
614
615       List of features (see also mkfs.btrfs(8) section FILESYSTEM FEATURES):
616
617       big_metadata
618              (since: 3.4)
619
620              the filesystem uses nodesize for metadata blocks,  this  can  be
621              bigger than the page size
622
623       compress_lzo
624              (since: 2.6.38)
625
626              the lzo compression has been used on the filesystem, either as a
627              mount option or via btrfs filesystem defrag.
628
629       compress_zstd
630              (since: 4.14)
631
632              the zstd compression has been used on the filesystem, either  as
633              a mount option or via btrfs filesystem defrag.
634
635       default_subvol
636              (since: 2.6.34)
637
638              the default subvolume has been set on the filesystem
639
640       extended_iref
641              (since: 3.7)
642
643              increased hardlink limit per file in a directory to 65536, older
644              kernels supported a varying number of hardlinks depending on the
645              sum  of all file name sizes that can be stored into one metadata
646              block
647
648       free_space_tree
649              (since: 4.5)
650
651              free space representation using a dedicated b-tree, successor of
652              v1 space cache
653
654       metadata_uuid
655              (since: 5.0)
656
657              the  main filesystem UUID is the metadata_uuid, which stores the
658              new UUID only in the superblock while all metadata blocks  still
659              have the UUID set at mkfs time, see btrfstune(8) for more
660
661       mixed_backref
662              (since: 2.6.31)
663
664              the  last major disk format change, improved backreferences, now
665              default
666
667       mixed_groups
668              (since: 2.6.37)
669
670              mixed data and metadata block groups, i.e. the data and metadata
671              are not separated and occupy the same block groups, this mode is
672              suitable for small volumes as there are no constraints  how  the
673              remaining  space  should  be  used  (compared to the split mode,
674              where empty metadata space cannot be  used  for  data  and  vice
675              versa)
676
677              on  the  other hand, the final layout is quite unpredictable and
678              possibly highly fragmented, which means worse performance
679
680       no_holes
681              (since: 3.14)
682
683              improved representation of file extents where holes are not  ex‐
684              plicitly stored as an extent, saves a few percent of metadata if
685              sparse files are used
686
687       raid1c34
688              (since: 5.5)
689
690              extended RAID1 mode with copies on 3 or 4 devices respectively
691
692       RAID56 (since: 3.9)
693
694              the filesystem contains or contained a RAID56 profile  of  block
695              groups
696
697       rmdir_subvol
698              (since: 4.18)
699
700              indicate  that  rmdir(2)  syscall  can delete an empty subvolume
701              just like an ordinary directory. Note that this feature only de‐
702              pends on the kernel version.
703
704       skinny_metadata
705              (since: 3.10)
706
707              reduced-size metadata for extent references, saves a few percent
708              of metadata
709
710       send_stream_version
711              (since: 5.10)
712
713              number of the highest supported send stream version
714
715       supported_checksums
716              (since: 5.5)
717
718              list of checksum algorithms supported by the kernel module,  the
719              respective  modules or built-in implementing the algorithms need
720              to be present to mount the filesystem, see CHECKSUM ALGORITHMS
721
722       supported_sectorsizes
723              (since: 5.13)
724
725              list of values that are accepted  as  sector  sizes  (mkfs.btrfs
726              --sectorsize) by the running kernel
727
728       supported_rescue_options
729              (since: 5.11)
730
731              list of values for the mount option rescue that are supported by
732              the running kernel, see btrfs(5)
733
734       zoned  (since: 5.12)
735
736              zoned mode is allocation/write friendly  to  host-managed  zoned
737              devices,  allocation  space is partitioned into fixed-size zones
738              that must be updated sequentially, see ZONED MODE
739

SWAPFILE SUPPORT

741       A swapfile is file-backed memory that the system  uses  to  temporarily
742       offload  the  RAM.   It is supported since kernel 5.0. Use swapon(8) to
743       activate the swapfile. There are some limitations of the implementation
744       in BTRFS and Linux swap subsystem:
745
746       • filesystem - must be only single device
747
748       • filesystem - must have only single data profile
749
750       • swapfile - the containing subvolume cannot be snapshotted
751
752       • swapfile - must be preallocated (i.e. no holes)
753
754       • swapfile - must be NODATACOW (i.e. also NODATASUM, no compression)
755
756       The limitations come namely from the COW-based design and mapping layer
757       of blocks  that  allows  the  advanced  features  like  relocation  and
758       multi-device  filesystems.  However, the swap subsystem expects simpler
759       mapping and no background changes  of  the  file  block  location  once
760       they've been assigned to swap.
761
762       With  active  swapfiles, the following whole-filesystem operations will
763       skip swapfile extents or may fail:
764
765       • balance - block groups with swapfile  extents  are  skipped  and  re‐
766         ported, the rest will be processed normally
767
768       • resize grow - unaffected
769
770       • resize  shrink  -  works  as  long  as the extents are outside of the
771         shrunk range
772
773       • device add - a new device does not interfere with  existing  swapfile
774         and this operation will work, though no new swapfile can be activated
775         afterwards
776
777       • device delete - if the device has been added as above, it can be also
778         deleted
779
780       • device replace - ditto
781
782       When there are no active swapfiles and a whole-filesystem exclusive op‐
783       eration is running (e.g. balance, device delete, shrink), the swapfiles
784       cannot be temporarily activated. The operation must finish first.
785
786       To create and activate a swapfile run the following commands:
787
788          # truncate -s 0 swapfile
789          # chattr +C swapfile
790          # fallocate -l 2G swapfile
791          # chmod 0600 swapfile
792          # mkswap swapfile
793          # swapon swapfile
794
795       Since version 6.1 it's possible to create the swapfile in a single com‐
796       mand (except the activation):
797
798          # btrfs filesystem mkswapfile swapfile
799          # swapon swapfile
800
801       Please note that the UUID returned by the mkswap utility identifies the
802       swap "filesystem" and because it's stored in a file, it's not generally
803       visible and usable as an identifier unlike if it was on a block device.
804
805       The file will appear in /proc/swaps:
806
807          # cat /proc/swaps
808          Filename          Type          Size           Used      Priority
809          /path/swapfile    file          2097152        0         -2
810
811       The swapfile can be created as one-time  operation  or,  once  properly
812       created,  activated  on  each  boot  by  the swapon -a command (usually
813       started by the service manager). Add the following entry to /etc/fstab,
814       assuming  the  filesystem  that  provides  the  /path  has been already
815       mounted at this point.  Additional mount options relevant for the swap‐
816       file can be set too (like priority, not the BTRFS mount options).
817
818          /path/swapfile        none        swap        defaults      0 0
819

HIBERNATION

821       A  swapfile  can  be used for hibernation but it's not straightforward.
822       Before  hibernation  a  resume  offset  must   be   written   to   file
823       /sys/power/resume_offset  or  the  kernel  command  line  parameter re‐
824       sume_offset must be set.
825
826       The value is the physical offset on the device. Note that this  is  not
827       the same value that filefrag prints as physical offset!
828
829       Btrfs  filesystem  uses  mapping between logical and physical addresses
830       but here the physical can still map  to  one  or  more  device-specific
831       physical block addresses. It's the device-specific physical offset that
832       is suitable as resume offset.
833
834       Since version 6.1 there's a command btrfs inspect-internal map-swapfile
835       that  will  print the device physical offset and the adjusted value for
836       /sys/power/resume_offset.  Note that the value is divided by page size,
837       i.e. it's not the offset itself.
838
839          # btrfs filesystem mkswapfile swapfile
840          # btrfs inspect-internal map-swapfile swapfile
841          Physical start: 811511726080
842          Resume offset:     198122980
843
844       For scripting and convenience the option -r will print just the offset:
845
846          # btrfs inspect-internal map-swapfile -r swapfile
847          198122980
848
849       The  command  map-swapfile  also verifies all the requirements, i.e. no
850       holes, single device, etc.
851

TROUBLESHOOTING

853       If the swapfile activation fails please verify that  you  followed  all
854       the  steps above or check the system log (e.g. dmesg or journalctl) for
855       more information.
856
857       Notably, the swapon utility exits with a message that does not say what
858       failed:
859
860          # swapon /path/swapfile
861          swapon: /path/swapfile: swapon failed: Invalid argument
862
863       The  specific  reason  is likely to be printed to the system log by the
864       btrfs module:
865
866          # journalctl -t kernel | grep swapfile
867          kernel: BTRFS warning (device sda): swapfile must have single data profile
868

CHECKSUM ALGORITHMS

870       Data and metadata are checksummed by default, the  checksum  is  calcu‐
871       lated  before write and verified after reading the blocks from devices.
872       The whole metadata block has a checksum stored  inline  in  the  b-tree
873       node  header,  each  data  block  has a detached checksum stored in the
874       checksum tree.
875
876       There are several checksum algorithms supported. The default and  back‐
877       ward  compatible is crc32c.  Since kernel 5.5 there are three more with
878       different characteristics and trade-offs regarding speed and  strength.
879       The following list may help you to decide which one to select.
880
881       CRC32C (32bit digest)
882              default,  best  backward  compatibility,  very fast, modern CPUs
883              have  instruction-level  support,  not  collision-resistant  but
884              still good error detection capabilities
885
886       XXHASH (64bit digest)
887              can be used as CRC32C successor, very fast, optimized for modern
888              CPUs utilizing instruction pipelining, good collision resistance
889              and error detection
890
891       SHA256 (256bit digest)
892              a cryptographic-strength hash, relatively slow but with possible
893              CPU instruction acceleration or specialized hardware cards, FIPS
894              certified and in wide use
895
896       BLAKE2b (256bit digest)
897              a cryptographic-strength hash, relatively fast with possible CPU
898              acceleration using SIMD extensions, not standardized  but  based
899              on  BLAKE  which was a SHA3 finalist, in wide use, the algorithm
900              used is BLAKE2b-256 that's optimized for 64bit platforms
901
902       The digest size affects overall size of data block checksums stored  in
903       the  filesystem.   The metadata blocks have a fixed area up to 256 bits
904       (32 bytes), so there's no increase. Each  data  block  has  a  separate
905       checksum stored, with additional overhead of the b-tree leaves.
906
907       Approximate  relative  performance  of the algorithms, measured against
908       CRC32C using reference software implementations on a 3.5GHz intel CPU:
909
910                  ┌────────┬─────────────┬───────┬─────────────────┐
911                  │Digest  │ Cycles/4KiB │ Ratio │ Implementation  │
912                  ├────────┼─────────────┼───────┼─────────────────┤
913                  │CRC32C  │ 1700        │ 1.00  │ CPU instruction │
914                  ├────────┼─────────────┼───────┼─────────────────┤
915                  │XXHASH  │ 2500        │ 1.44  │ reference impl. │
916                  ├────────┼─────────────┼───────┼─────────────────┤
917                  │SHA256  │ 105000      │ 61    │ reference impl. │
918                  ├────────┼─────────────┼───────┼─────────────────┤
919                  │SHA256  │ 36000       │ 21    │ libgcrypt/AVX2  │
920                  ├────────┼─────────────┼───────┼─────────────────┤
921                  │SHA256  │ 63000       │ 37    │ libsodium/AVX2  │
922                  ├────────┼─────────────┼───────┼─────────────────┤
923                  │BLAKE2b │ 22000       │ 13    │ reference impl. │
924                  ├────────┼─────────────┼───────┼─────────────────┤
925                  │BLAKE2b │ 19000       │ 11    │ libgcrypt/AVX2  │
926                  ├────────┼─────────────┼───────┼─────────────────┤
927                  │BLAKE2b │ 19000       │ 11    │ libsodium/AVX2  │
928                  └────────┴─────────────┴───────┴─────────────────┘
929
930       Many kernels are configured with SHA256 as built-in and not as  a  mod‐
931       ule.   The accelerated versions are however provided by the modules and
932       must  be  loaded  explicitly  (modprobe  sha256)  before  mounting  the
933       filesystem    to    make    use    of    them.   You   can   check   in
934       /sys/fs/btrfs/FSID/checksum   which   one   is   used.   If   you   see
935       sha256-generic,  then  you may want to unmount and mount the filesystem
936       again, changing that on a mounted filesystem is  not  possible.   Check
937       the file /proc/crypto, when the implementation is built-in, you'd find
938
939          name         : sha256
940          driver       : sha256-generic
941          module       : kernel
942          priority     : 100
943          ...
944
945       while accelerated implementation is e.g.
946
947          name         : sha256
948          driver       : sha256-avx2
949          module       : sha256_ssse3
950          priority     : 170
951          ...
952

COMPRESSION

954       Btrfs supports transparent file compression. There are three algorithms
955       available: ZLIB, LZO and ZSTD (since v4.14), with various levels.   The
956       compression  happens  on the level of file extents and the algorithm is
957       selected by file property, mount option or by a  defrag  command.   You
958       can have a single btrfs mount point that has some files that are uncom‐
959       pressed, some that are compressed with LZO, some  with  ZLIB,  for  in‐
960       stance (though you may not want it that way, it is supported).
961
962       Once the compression is set, all newly written data will be compressed,
963       i.e.  existing data are untouched. Data are split into  smaller  chunks
964       (128KiB)  before compression to make random rewrites possible without a
965       high performance hit. Due to the increased number of extents the  meta‐
966       data consumption is higher. The chunks are compressed in parallel.
967
968       The  algorithms can be characterized as follows regarding the speed/ra‐
969       tio trade-offs:
970
971       ZLIB
972
973              • slower, higher compression ratio
974
975              • levels: 1 to 9, mapped directly, default level is 3
976
977              • good backward compatibility
978
979       LZO
980
981              • faster compression and decompression than ZLIB, worse compres‐
982                sion ratio, designed to be fast
983
984              • no levels
985
986              • good backward compatibility
987
988       ZSTD
989
990              • compression  comparable to ZLIB with higher compression/decom‐
991                pression speeds and different ratio
992
993              • levels: 1 to 15, mapped directly (higher levels are not avail‐
994                able)
995
996              • since 4.14, levels since 5.1
997
998       The  differences  depend on the actual data set and cannot be expressed
999       by a single number or recommendation. Higher levels  consume  more  CPU
1000       time  and  may  not  bring  a significant improvement, lower levels are
1001       close to real time.
1002

HOW TO ENABLE COMPRESSION

1004       Typically the compression can be enabled on the whole filesystem, spec‐
1005       ified  for the mount point. Note that the compression mount options are
1006       shared among all mounts of the same filesystem, either bind  mounts  or
1007       subvolume mounts.  Please refer to section MOUNT OPTIONS.
1008
1009          $ mount -o compress=zstd /dev/sdx /mnt
1010
1011       This  will enable the zstd algorithm on the default level (which is 3).
1012       The level can be specified manually too like zstd:3. Higher levels com‐
1013       press  better  at  the  cost  of time. This in turn may cause increased
1014       write latency, low levels are suitable for real-time compression and on
1015       reasonably fast CPU don't cause noticeable performance drops.
1016
1017          $ btrfs filesystem defrag -czstd file
1018
1019       The  command above will start defragmentation of the whole file and ap‐
1020       ply the compression, regardless of the mount option. (Note:  specifying
1021       level is not yet implemented). The compression algorithm is not persis‐
1022       tent and applies only to the defragmentation  command,  for  any  other
1023       writes other compression settings apply.
1024
1025       Persistent settings on a per-file basis can be set in two ways:
1026
1027          $ chattr +c file
1028          $ btrfs property set file compression zstd
1029
1030       The  first  command is using legacy interface of file attributes inher‐
1031       ited from ext2 filesystem and is not flexible, so by default  the  zlib
1032       compression  is set. The other command sets a property on the file with
1033       the given algorithm.  (Note: setting level that way is not  yet  imple‐
1034       mented.)
1035

COMPRESSION LEVELS

1037       The level support of ZLIB has been added in v4.14, LZO does not support
1038       levels (the kernel implementation provides only one), ZSTD  level  sup‐
1039       port has been added in v5.1.
1040
1041       There  are  9  levels  of ZLIB supported (1 to 9), mapping 1:1 from the
1042       mount option to the algorithm defined level. The default  is  level  3,
1043       which  provides the reasonably good compression ratio and is still rea‐
1044       sonably fast. The difference in compression gain of levels 7, 8  and  9
1045       is comparable but the higher levels take longer.
1046
1047       The  ZSTD  support  includes  levels 1 to 15, a subset of full range of
1048       what ZSTD provides. Levels 1-3 are real-time, 4-8 slower with  improved
1049       compression  and 9-15 try even harder though the resulting size may not
1050       be significantly improved.
1051
1052       Level 0 always maps to the default. The compression level does not  af‐
1053       fect compatibility.
1054

INCOMPRESSIBLE DATA

1056       Files  with  already  compressed  data or with data that won't compress
1057       well with the CPU and memory constraints of the kernel  implementations
1058       are  using  a simple decision logic. If the first portion of data being
1059       compressed is not smaller than the original,  the  compression  of  the
1060       file  is  disabled  --  unless  the  filesystem  is  mounted  with com‐
1061       press-force. In that case compression will always be attempted  on  the
1062       file only to be later discarded. This is not optimal and subject to op‐
1063       timizations and further development.
1064
1065       If a file is identified as incompressible, a flag is  set  (NOCOMPRESS)
1066       and  it's  sticky.  On  that file compression won't be performed unless
1067       forced. The flag can be also set by chattr +m (since e2fsprogs  1.46.2)
1068       or  by  properties  with value no or none. Empty value will reset it to
1069       the default that's currently applicable on the mounted filesystem.
1070
1071       There are two ways to detect incompressible data:
1072
1073       • actual compression attempt - data are compressed, if  the  result  is
1074         not  smaller,  it's  discarded,  so this depends on the algorithm and
1075         level
1076
1077       • pre-compression heuristics - a quick statistical  evaluation  on  the
1078         data  is performed and based on the result either compression is per‐
1079         formed or skipped, the NOCOMPRESS bit is not set just by the  heuris‐
1080         tic, only if the compression algorithm does not make an improvement
1081
1082          $ lsattr file
1083          ---------------------m file
1084
1085       Using  the  forcing  compression is not recommended, the heuristics are
1086       supposed to decide that and compression  algorithms  internally  detect
1087       incompressible data too.
1088

PRE-COMPRESSION HEURISTICS

1090       The  heuristics  aim  to  do  a few quick statistical tests on the com‐
1091       pressed data in order to avoid probably costly compression  that  would
1092       turn  out to be inefficient. Compression algorithms could have internal
1093       detection of incompressible data too but this leads to more overhead as
1094       the  compression  is  done  in another thread and has to write the data
1095       anyway. The heuristic is read-only and can utilize cached memory.
1096
1097       The tests performed based on the following:  data  sampling,  long  re‐
1098       peated pattern detection, byte frequency, Shannon entropy.
1099

COMPATIBILITY

1101       Compression  is  done using the COW mechanism so it's incompatible with
1102       nodatacow. Direct IO works on compressed files but will  fall  back  to
1103       buffered  writes  and  leads  to recompression. Currently nodatasum and
1104       compression don't work together.
1105
1106       The compression algorithms have been added over  time  so  the  version
1107       compatibility should be also considered, together with other tools that
1108       may access the compressed data like bootloaders.
1109

SYSFS INTERFACE

1111       Btrfs has a sysfs interface to provide extra knobs.
1112
1113       The top level path is /sys/fs/btrfs/, and the main directory layout  is
1114       the following:
1115
1116           ┌─────────────────────────────┬─────────────────────┬─────────┐
1117           │Relative Path                │ Description         │ Version │
1118           ├─────────────────────────────┼─────────────────────┼─────────┤
1119           │features/                    │ All  supported fea‐ │ 3.14+   │
1120           │                             │ tures               │         │
1121           ├─────────────────────────────┼─────────────────────┼─────────┤
1122           │<UUID>/                      │ Mounted fs UUID     │ 3.14+   │
1123           ├─────────────────────────────┼─────────────────────┼─────────┤
1124           │<UUID>/allocation/           │ Space    allocation │ 3.14+   │
1125           │                             │ info                │         │
1126           ├─────────────────────────────┼─────────────────────┼─────────┤
1127           │<UUID>/features/             │ Features   of   the │ 3.14+   │
1128           │                             │ filesystem          │         │
1129           ├─────────────────────────────┼─────────────────────┼─────────┤
1130           │<UUID>/devices/<DE‐          │ Symlink   to   each │ 5.6+    │
1131           │VID>/                        │ block device sysfs  │         │
1132           ├─────────────────────────────┼─────────────────────┼─────────┤
1133           │<UUID>/devinfo/<DE‐          │ Btrfs specific info │ 5.6+    │
1134           │VID>/                        │ for each device     │         │
1135           ├─────────────────────────────┼─────────────────────┼─────────┤
1136           │<UUID>/qgroups/              │ Global qgroup info  │ 5.9+    │
1137           ├─────────────────────────────┼─────────────────────┼─────────┤
1138           │<UUID>/qgroups/<LEVEL>_<ID>/ │ Info    for    each │ 5.9+    │
1139           │                             │ qgroup              │         │
1140           └─────────────────────────────┴─────────────────────┴─────────┘
1141
1142       For /sys/fs/btrfs/features/ directory, each file means a supported fea‐
1143       ture for the current kernel.
1144
1145       For  /sys/fs/btrfs/<UUID>/features/  directory,  each file means an en‐
1146       abled feature for the mounted filesystem.
1147
1148       The features shares the same name in section FILESYSTEM FEATURES.
1149
1150       Files in /sys/fs/btrfs/<UUID>/ directory are:
1151
1152       bg_reclaim_threshold
1153              (RW, since: 5.19)
1154
1155              Used space percentage of total device space to start auto  block
1156              group claim.  Mostly for zoned devices.
1157
1158       checksum
1159              (RO, since: 5.5)
1160
1161              The  checksum  used  for  the mounted filesystem.  This includes
1162              both the checksum type (see section CHECKSUM ALGORITHMS) and the
1163              implemented driver (mostly shows if it's hardware accelerated).
1164
1165       clone_alignment
1166              (RO, since: 3.16)
1167
1168              The bytes alignment for clone and dedupe ioctls.
1169
1170       commit_stats
1171              (RW, since: 6.0)
1172
1173              The performance statistics for btrfs transaction commit.  Mostly
1174              for debug purposes.
1175
1176              Writing into this file will reset the maximum commit duration to
1177              the input value.
1178
1179       exclusive_operation
1180              (RO, since: 5.10)
1181
1182              Shows the running exclusive operation.  Check section FILESYSTEM
1183              EXCLUSIVE OPERATIONS for details.
1184
1185       generation
1186              (RO, since: 5.11)
1187
1188              Show the generation of the mounted filesystem.
1189
1190       label  (RW, since: 3.14)
1191
1192              Show the current label of the mounted filesystem.
1193
1194       metadata_uuid
1195              (RO, since: 5.0)
1196
1197              Shows the metadata uuid of the mounted filesystem.  Check  meta‐
1198              data_uuid feature for more details.
1199
1200       nodesize
1201              (RO, since: 3.14)
1202
1203              Show the nodesize of the mounted filesystem.
1204
1205       quota_override
1206              (RW, since: 4.13)
1207
1208              Shows the current quota override status.  0 means no quota over‐
1209              ride.  1 means quota override, quota  can  ignore  the  existing
1210              limit settings.
1211
1212       read_policy
1213              (RW, since: 5.11)
1214
1215              Shows  the  current  balance  policy  for reads.  Currently only
1216              "pid" (balance using pid value) is supported.
1217
1218       sectorsize
1219              (RO, since: 3.14)
1220
1221              Shows the sectorsize of the mounted filesystem.
1222
1223       Files and  directories  in  /sys/fs/btrfs/<UUID>/allocations  directory
1224       are:
1225
1226       global_rsv_reserved
1227              (RO, since: 3.14)
1228
1229              The used bytes of the global reservation.
1230
1231       global_rsv_size
1232              (RO, since: 3.14)
1233
1234              The total size of the global reservation.
1235
1236       data/, metadata/ and system/ directories
1237              (RO, since: 5.14)
1238
1239              Space  info  accounting for the 3 chunk types.  Mostly for debug
1240              purposes.
1241
1242       Files in /sys/fs/btrfs/<UUID>/allocations/{data,metadata,system} direc‐
1243       tory are:
1244
1245       bg_reclaim_threshold
1246              (RW, since: 5.19)
1247
1248              Reclaimable  space  percentage  of block group's size (excluding
1249              permanently unusable space) to reclaim the block group.  Can  be
1250              used on regular or zoned devices.
1251
1252       chunk_size
1253              (RW, since: 6.0)
1254
1255              Shows  the  chunk  size.  Can  be changed for data and metadata.
1256              Cannot be set for zoned devices.
1257
1258       Files in /sys/fs/btrfs/<UUID>/devinfo/<DEVID> directory are:
1259
1260       error_stats:
1261              (RO, since: 5.14)
1262
1263              Shows all the history error numbers of the device.
1264
1265       fsid:  (RO, since: 5.17)
1266
1267              Shows the fsid which the device belongs to.  It can be different
1268              than the <UUID> if it's a seed device.
1269
1270       in_fs_metadata
1271              (RO, since: 5.6)
1272
1273              Shows  whether we have found the device.  Should always be 1, as
1274              if this turns to 0, the <DEVID> directory would get removed  au‐
1275              tomatically.
1276
1277       missing
1278              (RO, since: 5.6)
1279
1280              Shows whether the device is missing.
1281
1282       replace_target
1283              (RO, since: 5.6)
1284
1285              Shows  whether  the device is the replace target.  If no dev-re‐
1286              place is running, this value should be 0.
1287
1288       scrub_speed_max
1289              (RW, since: 5.14)
1290
1291              Shows the scrub  speed  limit  for  this  device.  The  unit  is
1292              Bytes/s.  0 means no limit.
1293
1294       writeable
1295              (RO, since: 5.6)
1296
1297              Show if the device is writeable.
1298
1299       Files in /sys/fs/btrfs/<UUID>/qgroups/ directory are:
1300
1301       enabled
1302              (RO, since: 6.1)
1303
1304              Shows  if  qgroup  is enabled.  Also, if qgroup is disabled, the
1305              qgroups directory would be removed automatically.
1306
1307       inconsistent
1308              (RO, since: 6.1)
1309
1310              Shows if the qgroup numbers are inconsistent.  If 1, it's recom‐
1311              mended to do a qgroup rescan.
1312
1313       drop_subtree_threshold
1314              (RW, since: 6.1)
1315
1316              Shows  the  subtree  drop threshold to automatically mark qgroup
1317              inconsistent.
1318
1319              When dropping large subvolumes with qgroup enabled, there  would
1320              be  a  huge  load  for  qgroup accounting.  If we have a subtree
1321              whose level is larger than or equal to this value, we  will  not
1322              trigger  qgroup  account at all, but mark qgroup inconsistent to
1323              avoid the huge workload.
1324
1325              Default value is 8, where no subtree drop can trigger qgroup.
1326
1327              Lower value can reduce qgroup workload, at  the  cost  of  extra
1328              qgroup rescan to re-calculate the numbers.
1329
1330       Files in /sys/fs/btrfs/<UUID>/<LEVEL>_<ID>/ directory are:
1331
1332       exclusive
1333              (RO, since: 5.9)
1334
1335              Shows the exclusively owned bytes of the qgroup.
1336
1337       limit_flags
1338              (RO, since: 5.9)
1339
1340              Shows  the  numeric  value  of  the limit flags.  If 0, means no
1341              limit implied.
1342
1343       max_exclusive
1344              (RO, since: 5.9)
1345
1346              Shows the limits on exclusively owned bytes.
1347
1348       max_referenced
1349              (RO, since: 5.9)
1350
1351              Shows the limits on referenced bytes.
1352
1353       referenced
1354              (RO, since: 5.9)
1355
1356              Shows the referenced bytes of the qgroup.
1357
1358       rsv_data
1359              (RO, since: 5.9)
1360
1361              Shows the reserved bytes for data.
1362
1363       rsv_meta_pertrans
1364              (RO, since: 5.9)
1365
1366              Shows the reserved bytes for per transaction metadata.
1367
1368       rsv_meta_prealloc
1369              (RO, since: 5.9)
1370
1371              Shows the reserved bytes for preallocated metadata.
1372

FILESYSTEM EXCLUSIVE OPERATIONS

1374       There are several operations that affect the whole filesystem and  can‐
1375       not  be  run in parallel. Attempt to start one while another is running
1376       will fail (see exceptions below).
1377
1378       Since kernel 5.10 the currently running operation can be obtained  from
1379       /sys/fs/UUID/exclusive_operation with following values and operations:
1380
1381       • balance
1382
1383       • balance paused (since 5.17)
1384
1385       • device add
1386
1387       • device delete
1388
1389       • device replace
1390
1391       • resize
1392
1393       • swapfile activate
1394
1395       • none
1396
1397       Enqueuing  is  supported  for  several btrfs subcommands so they can be
1398       started at once and then serialized.
1399
1400       There's an exception when a paused balance allows to start a device add
1401       operation as they don't really collide and this can be used to add more
1402       space for the balance to finish.
1403

FILESYSTEM LIMITS

1405       maximum file name length
1406              255
1407
1408              This limit is imposed by Linux  VFS,  the  structures  of  BTRFS
1409              could store larger file names.
1410
1411       maximum symlink target length
1412              depends  on  the  nodesize  value, for 4KiB it's 3949 bytes, for
1413              larger nodesize it's 4095 due to the system limit PATH_MAX
1414
1415              The symlink target may not be a valid path, i.e. the  path  name
1416              components  can exceed the limits (NAME_MAX), there's no content
1417              validation at symlink(3) creation.
1418
1419       maximum number of inodes
1420              264 but depends on the available metadata space  as  the  inodes
1421              are created dynamically
1422
1423              Each  subvolume  is  an independent namespace of inodes and thus
1424              their numbers, so the limit is per subvolume, not for the  whole
1425              filesystem.
1426
1427       inode numbers
1428              minimum number: 256 (for subvolumes), regular files and directo‐
1429              ries: 257, maximum number: (2:sup:64 - 256)
1430
1431              The inode numbers that can be assigned to user created files are
1432              from the whole 64bit space except first 256 and last 256 in that
1433              range that are reserved for internal b-tree identifiers.
1434
1435       maximum file length
1436              inherent limit of BTRFS is 264 (16 EiB) but the practical  limit
1437              of Linux VFS is 263 (8 EiB)
1438
1439       maximum number of subvolumes
1440              the subvolume ids can go up to 248 but the number of actual sub‐
1441              volumes depends on the available metadata space
1442
1443              The space consumed by all subvolume metadata includes  bookkeep‐
1444              ing  of shared extents can be large (MiB, GiB). The range is not
1445              the full 64bit range because of qgroups that use  the  upper  16
1446              bits for another purposes.
1447
1448       maximum number of hardlinks of a file in a directory
1449              65536  when  the  extref  feature  is turned on during mkfs (de‐
1450              fault), roughly 100 otherwise
1451
1452       minimum filesystem size
1453              the minimal size of each device depends on the mixed-bg feature,
1454              without that (the default) it's about 109MiB, with mixed-bg it's
1455              is 16MiB
1456

BOOTLOADER SUPPORT

1458       GRUB2 (https://www.gnu.org/software/grub) has the most advanced support
1459       of booting from BTRFS with respect to features.
1460
1461       U-boot  (https://www.denx.de/wiki/U-Boot/) has decent support for boot‐
1462       ing but not all BTRFS features are implemented,  check  the  documenta‐
1463       tion.
1464
1465       EXTLINUX  (from  the  https://syslinux.org project) has limited support
1466       for BTRFS boot and hasn't been updated for for a long time  so  is  not
1467       recommended as bootloader.
1468
1469       In  general, the first 1MiB on each device is unused with the exception
1470       of primary superblock that is on the offset 64KiB and spans  4KiB.  The
1471       rest can be freely used by bootloaders or for other system information.
1472       Note that booting from a filesystem on zoned device is not supported.
1473

FILE ATTRIBUTES

1475       The btrfs filesystem supports setting file attributes  or  flags.  Note
1476       there  are  old and new interfaces, with confusing names. The following
1477       list should clarify that:
1478
1479       • attributes:  chattr(1)  or  lsattr(1)  utilities  (the   ioctls   are
1480         FS_IOC_GETFLAGS  and FS_IOC_SETFLAGS), due to the ioctl names the at‐
1481         tributes are also called flags
1482
1483       • xflags: to distinguish from the previous, it's extended  flags,  with
1484         tunable  bits  similar  to the attributes but extensible and new bits
1485         will be added in the future (the  ioctls  are  FS_IOC_FSGETXATTR  and
1486         FS_IOC_FSSETXATTR  but  they  are  not related to extended attributes
1487         that are also called xattrs), there's no standard tool to change  the
1488         bits, there's support in xfs_io(8) as command xfs_io -c chattr
1489
1490   Attributes
1491       a      append  only,  new  writes  are always written at the end of the
1492              file
1493
1494       A      no atime updates
1495
1496       c      compress data, all data written after this attribute is set will
1497              be compressed.  Please note that compression is also affected by
1498              the mount options or the parent directory attributes.
1499
1500              When set on a directory, all newly created  files  will  inherit
1501              this  attribute.   This  attribute cannot be set with 'm' at the
1502              same time.
1503
1504       C      no copy-on-write, file data modifications are done in-place
1505
1506              When set on a directory, all newly created  files  will  inherit
1507              this attribute.
1508
1509              NOTE:
1510                 Due to implementation limitations, this flag can be set/unset
1511                 only on empty files.
1512
1513       d      no dump, makes sense with 3rd party tools like dump(8), on BTRFS
1514              the  attribute can be set/unset but no other special handling is
1515              done
1516
1517       D      synchronous directory updates, for more details  search  open(2)
1518              for O_SYNC and O_DSYNC
1519
1520       i      immutable, no file data and metadata changes allowed even to the
1521              root user as long as this attribute is set (obviously the excep‐
1522              tion is unsetting the attribute)
1523
1524       m      no  compression,  permanently  turn off compression on the given
1525              file. Any compression mount options will not affect  this  file.
1526              (chattr support added in 1.46.2)
1527
1528              When  set  on  a directory, all newly created files will inherit
1529              this attribute.  This attribute cannot be set with c at the same
1530              time.
1531
1532       S      synchronous  updates, for more details search open(2) for O_SYNC
1533              and O_DSYNC
1534
1535       No other attributes are supported.  For the complete list please  refer
1536       to the chattr(1) manual page.
1537
1538   XFLAGS
1539       There's  overlap  of  letters assigned to the bits with the attributes,
1540       this list refers to what xfs_io(8) provides:
1541
1542       i      immutable, same as the attribute
1543
1544       a      append only, same as the attribute
1545
1546       s      synchronous updates, same as the attribute S
1547
1548       A      no atime updates, same as the attribute
1549
1550       d      no dump, same as the attribute
1551

ZONED MODE

1553       Since version 5.12 btrfs supports so called zoned mode. This is a  spe‐
1554       cial  on-disk  format  and allocation/write strategy that's friendly to
1555       zoned devices.  In short, a device is partitioned into fixed-size zones
1556       and  each zone can be updated by append-only manner, or reset. As btrfs
1557       has no fixed data structures, except the super blocks, the  zoned  mode
1558       only  requires block placement that follows the device constraints. You
1559       can learn about the whole architecture at https://zonedstorage.io .
1560
1561       The devices are also called SMR/ZBC/ZNS,  in  host-managed  mode.  Note
1562       that  there are devices that appear as non-zoned but actually are, this
1563       is drive-managed and using zoned mode won't help.
1564
1565       The zone size depends on the device, typical sizes are 256MiB or  1GiB.
1566       In  general  it  must  be  a  power of two. Emulated zoned devices like
1567       null_blk allow to set various zone sizes.
1568
1569   Requirements, limitations
1570       • all devices must have the same zone size
1571
1572       • maximum zone size is 8GiB
1573
1574       • minimum zone size is 4MiB
1575
1576       • mixing zoned and non-zoned devices is possible, the zone  writes  are
1577         emulated, but this is namely for testing
1578
1579       •
1580
1581         the super block is handled in a special way and is at different loca‐
1582         tions than on a non-zoned filesystem:
1583
1584                • primary: 0B (and the next two zones)
1585
1586                • secondary: 512GiB (and the next two zones)
1587
1588                • tertiary: 4TiB (4096GiB, and the next two zones)
1589
1590   Incompatible features
1591       The main constraint of the zoned devices is lack of in-place update  of
1592       the data.  This is inherently incompatible with some features:
1593
1594       • NODATACOW - overwrite in-place, cannot create such files
1595
1596       • fallocate - preallocating space for in-place first write
1597
1598       • mixed-bg  -  unordered writes to data and metadata, fixing that means
1599         using separate data and metadata block groups
1600
1601       • booting - the zone at offset 0  contains  superblock,  resetting  the
1602         zone would destroy the bootloader data
1603
1604       Initial support lacks some features but they're planned:
1605
1606       • only single profile is supported
1607
1608       • fstrim - due to dependency on free space cache v1
1609
1610   Super block
1611       As  said above, super block is handled in a special way. In order to be
1612       crash safe, at least one zone in a known location must contain a  valid
1613       superblock.   This  is  implemented as a ring buffer in two consecutive
1614       zones, starting from known offsets 0B, 512GiB and 4TiB.
1615
1616       The values are different than on  non-zoned  devices.  Each  new  super
1617       block is appended to the end of the zone, once it's filled, the zone is
1618       reset and writes continue to the next one. Looking up the latest  super
1619       block  needs to read offsets of both zones and determine the last writ‐
1620       ten version.
1621
1622       The amount of space reserved for super block depends on the zone  size.
1623       The  secondary and tertiary copies are at distant offsets as the capac‐
1624       ity of the devices is expected to be large, tens of terabytes.  Maximum
1625       zone  size supported is 8GiB, which would mean that e.g. offset 0-16GiB
1626       would be reserved just for the super block on a hypothetical device  of
1627       that  zone  size.  This  is  wasteful  but  required to guarantee crash
1628       safety.
1629

CONTROL DEVICE

1631       There's a character special device /dev/btrfs-control  with  major  and
1632       minor numbers 10 and 234 (the device can be found under the 'misc' cat‐
1633       egory).
1634
1635          $ ls -l /dev/btrfs-control
1636          crw------- 1 root root 10, 234 Jan  1 12:00 /dev/btrfs-control
1637
1638       The device accepts some ioctl calls that can perform following  actions
1639       on the filesystem module:
1640
1641       • scan  devices for btrfs filesystem (i.e. to let multi-device filesys‐
1642         tems mount automatically) and register them with the kernel module
1643
1644       • similar to scan, but also wait until the device scanning  process  is
1645         finished for a given filesystem
1646
1647       • get    the    supported   features   (can   be   also   found   under
1648         /sys/fs/btrfs/features)
1649
1650       The device is created when btrfs is initialized, either as a module  or
1651       a  built-in functionality and makes sense only in connection with that.
1652       Running e.g. mkfs without the module loaded will not register  the  de‐
1653       vice and will probably warn about that.
1654
1655       In  rare  cases when the module is loaded but the device is not present
1656       (most likely accidentally deleted), it's possible to recreate it by
1657
1658          # mknod --mode=600 /dev/btrfs-control c 10 234
1659
1660       or (since 5.11) by a convenience command
1661
1662          # btrfs rescue create-control-device
1663
1664       The control device is not strictly required  but  the  device  scanning
1665       will  not  work  and  a  workaround  would  need  to be used to mount a
1666       multi-device filesystem.  The mount option device can trigger  the  de‐
1667       vice scanning during mount, see also btrfs device scan.
1668

FILESYSTEM WITH MULTIPLE PROFILES

1670       It  is  possible  that a btrfs filesystem contains multiple block group
1671       profiles of the same type.  This could happen when a profile conversion
1672       using  balance  filters  is  interrupted  (see btrfs-balance(8)).  Some
1673       btrfs commands perform a test to detect  this  kind  of  condition  and
1674       print a warning like this:
1675
1676          WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
1677          WARNING:   Data: single, raid1
1678          WARNING:   Metadata: single, raid1
1679
1680       The corresponding output of btrfs filesystem df might look like:
1681
1682          WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
1683          WARNING:   Data: single, raid1
1684          WARNING:   Metadata: single, raid1
1685          Data, RAID1: total=832.00MiB, used=0.00B
1686          Data, single: total=1.63GiB, used=0.00B
1687          System, single: total=4.00MiB, used=16.00KiB
1688          Metadata, single: total=8.00MiB, used=112.00KiB
1689          Metadata, RAID1: total=64.00MiB, used=32.00KiB
1690          GlobalReserve, single: total=16.25MiB, used=0.00B
1691
1692       There's  more  than one line for type Data and Metadata, while the pro‐
1693       files are single and RAID1.
1694
1695       This state of the filesystem OK but most likely needs the user/adminis‐
1696       trator  to take an action and finish the interrupted tasks. This cannot
1697       be easily done automatically, also the user knows  the  expected  final
1698       profiles.
1699
1700       In  the  example  above,  the filesystem started as a single device and
1701       single block group profile. Then another device was added, followed  by
1702       balance  with  convert=raid1  but  for  some  reason  hasn't  finished.
1703       Restarting the balance with convert=raid1 will continue and end up with
1704       filesystem with all block group profiles RAID1.
1705
1706       NOTE:
1707          If   you're   familiar  with  balance  filters,  you  can  use  con‐
1708          vert=raid1,profiles=single,soft, which will  take  only  the  uncon‐
1709          verted  single profiles and convert them to raid1. This may speed up
1710          the conversion as it would not try to rewrite  the  already  convert
1711          raid1 profiles.
1712
1713       Having  just  one  profile  is desired as this also clearly defines the
1714       profile of newly allocated block groups, otherwise this depends on  in‐
1715       ternal allocation policy. When there are multiple profiles present, the
1716       order of selection is RAID56, RAID10, RAID1, RAID0 as long as  the  de‐
1717       vice number constraints are satisfied.
1718
1719       Commands  that print the warning were chosen so they're brought to user
1720       attention when the filesystem state is being changed  in  that  regard.
1721       This is: device add, device delete, balance cancel, balance pause. Com‐
1722       mands that report space usage: filesystem df, device usage. The command
1723       filesystem usage provides a line in the overall summary:
1724
1725          Multiple profiles:                 yes (data, metadata)
1726

SEEDING DEVICE

1728       The  COW mechanism and multiple devices under one hood enable an inter‐
1729       esting concept, called a seeding device: extending a read-only filesys‐
1730       tem on a device with another device that captures all writes. For exam‐
1731       ple imagine an immutable golden image of an operating  system  enhanced
1732       with  another  device that allows to use the data from the golden image
1733       and normal operation.  This idea originated on CD-ROMs with base OS and
1734       allowing  to use them for live systems, but this became obsolete. There
1735       are technologies  providing  similar  functionality,  like  unionmount,
1736       overlayfs or qcow2 image snapshot.
1737
1738       The  seeding device starts as a normal filesystem, once the contents is
1739       ready, btrfstune -S 1 is used to flag it as a seeding device.  Mounting
1740       such  device  will  not allow any writes, except adding a new device by
1741       btrfs device add.  Then the filesystem can be remounted as read-write.
1742
1743       Given that the filesystem on the seeding device is always recognized as
1744       read-only,  it can be used to seed multiple filesystems from one device
1745       at the same time. The UUID that is normally attached to a device is au‐
1746       tomatically changed to a random UUID on each mount.
1747
1748       Once the seeding device is mounted, it needs the writable device. After
1749       adding it,  something  like  remount  -o  remount,rw  /path  makes  the
1750       filesystem  at  /path  ready for use. The simplest use case is to throw
1751       away all changes by unmounting the filesystem when convenient.
1752
1753       Alternatively, deleting the seeding device from the filesystem can turn
1754       it into a normal filesystem, provided that the writable device can also
1755       contain all the data from the seeding device.
1756
1757       The seeding device flag can be cleared again by btrfstune -f -S 0, e.g.
1758       allowing  to  update with newer data but please note that this will in‐
1759       validate all existing filesystems that use this particular seeding  de‐
1760       vice.  This  works  for some use cases, not for others, and the forcing
1761       flag to the command is mandatory to avoid accidental mistakes.
1762
1763       Example how to create and use one seeding device:
1764
1765          # mkfs.btrfs /dev/sda
1766          # mount /dev/sda /mnt/mnt1
1767          ... fill mnt1 with data
1768          # umount /mnt/mnt1
1769
1770          # btrfstune -S 1 /dev/sda
1771
1772          # mount /dev/sda /mnt/mnt1
1773          # btrfs device add /dev/sdb /mnt/mnt1
1774          # mount -o remount,rw /mnt/mnt1
1775          ... /mnt/mnt1 is now writable
1776
1777       Now /mnt/mnt1 can be used normally. The device /dev/sda can be  mounted
1778       again with a another writable device:
1779
1780          # mount /dev/sda /mnt/mnt2
1781          # btrfs device add /dev/sdc /mnt/mnt2
1782          # mount -o remount,rw /mnt/mnt2
1783          ... /mnt/mnt2 is now writable
1784
1785       The writable device (/dev/sdb) can be decoupled from the seeding device
1786       and used independently:
1787
1788          # btrfs device delete /dev/sda /mnt/mnt1
1789
1790       As the contents originated in the seeding device, it's possible to turn
1791       /dev/sdb to a seeding device again and repeat the whole process.
1792
1793       A few things to note:
1794
1795       • it's recommended to use only single device for the seeding device, it
1796         works for multiple devices but the single profile must be used in or‐
1797         der to make the seeding device deletion work
1798
1799       • block group profiles single and dup support the use cases above
1800
1801       • the  label  is  copied  from the seeding device and can be changed by
1802         btrfs filesystem label
1803
1804       • each new mount of the seeding device gets a new random UUID
1805
1806   Chained seeding devices
1807       Though it's not recommended and is rather an obscure and  untested  use
1808       case,  chaining  seeding devices is possible. In the first example, the
1809       writable device /dev/sdb can be  turned  onto  another  seeding  device
1810       again,  depending  on the unchanged seeding device /dev/sda. Then using
1811       /dev/sdb as the primary seeding device it can be extended with  another
1812       writable  device,  say /dev/sdd, and it continues as before as a simple
1813       tree structure on devices.
1814
1815          # mkfs.btrfs /dev/sda
1816          # mount /dev/sda /mnt/mnt1
1817          ... fill mnt1 with data
1818          # umount /mnt/mnt1
1819
1820          # btrfstune -S 1 /dev/sda
1821
1822          # mount /dev/sda /mnt/mnt1
1823          # btrfs device add /dev/sdb /mnt/mnt1
1824          # mount -o remount,rw /mnt/mnt1
1825          ... /mnt/mnt1 is now writable
1826          # umount /mnt/mnt1
1827
1828          # btrfstune -S 1 /dev/sdb
1829
1830          # mount /dev/sdb /mnt/mnt1
1831          # btrfs device add /dev/sdc /mnt
1832          # mount -o remount,rw /mnt/mnt1
1833          ... /mnt/mnt1 is now writable
1834          # umount /mnt/mnt1
1835
1836       As a result we have:
1837
1838       • sda is a single seeding device, with its initial contents
1839
1840       • sdb is a seeding device but requires sda, the contents are  from  the
1841         time  when  sdb  is made seeding, i.e. contents of sda with any later
1842         changes
1843
1844       • sdc last writable, can be made a seeding one the same way as was sdb,
1845         preserving its contents and depending on sda and sdb
1846
1847       As  long  as the seeding devices are unmodified and available, they can
1848       be used to start another branch.
1849

RAID56 STATUS AND RECOMMENDED PRACTICES

1851       The RAID56 feature provides striping and parity over  several  devices,
1852       same  as the traditional RAID5/6. There are some implementation and de‐
1853       sign deficiencies that make it unreliable for some corner cases and the
1854       feature  should not be used in production, only for evaluation or test‐
1855       ing.  The power failure safety for metadata with RAID56 is not 100%.
1856
1857   Metadata
1858       Do not use raid5 nor raid6 for metadata. Use raid1 or  raid1c3  respec‐
1859       tively.
1860
1861       The  substitute  profiles provide the same guarantees against loss of 1
1862       or 2 devices, and in some respect can be  an  improvement.   Recovering
1863       from  one  missing device will only need to access the remaining 1st or
1864       2nd copy, that in general may be stored on some other  devices  due  to
1865       the  way  RAID1 works on btrfs, unlike on a striped profile (similar to
1866       raid0) that would need all devices all the time.
1867
1868       The space allocation pattern and consumption is different  (e.g.  on  N
1869       devices): for raid5 as an example, a 1GiB chunk is reserved on each de‐
1870       vice, while with raid1 there's each 1GiB chunk stored on 2 devices. The
1871       consumption  of  each 1GiB of used metadata is then N * 1GiB for vs 2 *
1872       1GiB. Using raid1 is also more convenient for  balancing/converting  to
1873       other profile due to lower requirement on the available chunk space.
1874
1875   Missing/incomplete support
1876       When RAID56 is on the same filesystem with different raid profiles, the
1877       space reporting is inaccurate, e.g. df, btrfs filesystem  df  or  btrfs
1878       filesystem  usage. When there's only a one profile per block group type
1879       (e.g. RAID5 for data) the reporting is accurate.
1880
1881       When scrub is started on a RAID56 filesystem, it's started on  all  de‐
1882       vices  that  degrade  the performance. The workaround is to start it on
1883       each device separately. Due to that the device stats may not match  the
1884       actual state and some errors might get reported multiple times.
1885
1886       The  write  hole  problem.  An unclean shutdown could leave a partially
1887       written stripe in a state where the some stripe ranges and  the  parity
1888       are  from  the  old  writes  and some are new. The information which is
1889       which is not tracked. Write journal is not implemented. Alternatively a
1890       full  read-modify-write  would  make  sure that a full stripe is always
1891       written, avoiding the write hole completely, but  performance  in  that
1892       case turned out to be too bad for use.
1893
1894       The  striping  happens on all available devices (at the time the chunks
1895       were allocated), so in case a new device is added it may  not  be  uti‐
1896       lized  immediately  and  would  require a rebalance. A fixed configured
1897       stripe width is not implemented.
1898

STORAGE MODEL, HARDWARE CONSIDERATIONS

1900   Storage model
1901       A storage model is a model that captures key physical aspects  of  data
1902       structure  in a data store. A filesystem is the logical structure orga‐
1903       nizing data on top of the storage device.
1904
1905       The filesystem assumes several features or limitations of  the  storage
1906       device  and utilizes them or applies measures to guarantee reliability.
1907       BTRFS in particular is based on a COW (copy on write) mode of  writing,
1908       i.e. not updating data in place but rather writing a new copy to a dif‐
1909       ferent location and then atomically switching the pointers.
1910
1911       In an ideal world, the device does what it promises. The filesystem as‐
1912       sumes that this may not be true so additional mechanisms are applied to
1913       either detect misbehaving hardware or get valid data  by  other  means.
1914       The  devices  may  (and do) apply their own detection and repair mecha‐
1915       nisms but we won't assume any.
1916
1917       The following assumptions about storage devices are considered  (sorted
1918       by importance, numbers are for further reference):
1919
1920       1. atomicity  of  reads and writes of blocks/sectors (the smallest unit
1921          of data the device presents to the upper layers)
1922
1923       2. there's a flush command that instructs the device to forcibly  order
1924          writes before and after the command; alternatively there's a barrier
1925          command that facilitates the ordering but may not flush the data
1926
1927       3. data sent to write to a given device offset will be written  without
1928          further changes to the data and to the offset
1929
1930       4. writes  can be reordered by the device, unless explicitly serialized
1931          by the flush command
1932
1933       5. reads and writes can be freely reordered and interleaved
1934
1935       The consistency model of BTRFS builds on these assumptions. The logical
1936       data updates are grouped, into a generation, written on the device, se‐
1937       rialized by the flush command and then the super block is written  end‐
1938       ing the generation.  All logical links among metadata comprising a con‐
1939       sistent view of the data may not cross the generation boundary.
1940
1941   When things go wrong
1942       No or partial atomicity of block reads/writes (1)
1943
1944       • Problem: a partial block contents is written (torn write),  e.g.  due
1945         to a power glitch or other electronics failure during the read/write
1946
1947       • Detection: checksum mismatch on read
1948
1949       • Repair:  use  another copy or rebuild from multiple blocks using some
1950         encoding scheme
1951
1952       The flush command does not flush (2)
1953
1954       This is perhaps the most serious problem and impossible to mitigate  by
1955       filesystem without limitations and design restrictions. What could hap‐
1956       pen in the worst case is that writes from one generation bleed  to  an‐
1957       other  one, while still letting the filesystem consider the generations
1958       isolated. Crash at any point would leave data on the device in  an  in‐
1959       consistent  state  without  any  hint what exactly got written, what is
1960       missing and leading to stale metadata link information.
1961
1962       Devices usually honor the flush command, but  for  performance  reasons
1963       may  do  internal  caching,  where the flushed data are not yet persis‐
1964       tently stored. A power failure could lead  to  a  similar  scenario  as
1965       above, although it's less likely that later writes would be written be‐
1966       fore the cached ones. This is beyond what a filesystem  can  take  into
1967       account.  Devices or controllers are usually equipped with batteries or
1968       capacitors to write the cache contents even after power is  cut.  (Bat‐
1969       tery backed write cache)
1970
1971       Data get silently changed on write (3)
1972
1973       Such  thing  should  not happen frequently, but still can happen spuri‐
1974       ously due the complex internal workings of devices or physical  effects
1975       of the storage media itself.
1976
1977       • Problem:  while  the  data  are  written atomically, the contents get
1978         changed
1979
1980       • Detection: checksum mismatch on read
1981
1982       • Repair: use another copy or rebuild from multiple blocks  using  some
1983         encoding scheme
1984
1985       Data get silently written to another offset (3)
1986
1987       This would be another serious problem as the filesystem has no informa‐
1988       tion when it happens. For that reason the  measures  have  to  be  done
1989       ahead of time.  This problem is also commonly called ghost write.
1990
1991       The metadata blocks have the checksum embedded in the blocks, so a cor‐
1992       rect atomic write would not corrupt the checksum. It's likely that  af‐
1993       ter reading such block the data inside would not be consistent with the
1994       rest. To rule that out there's embedded block number  in  the  metadata
1995       block.  It's  the logical block number because this is what the logical
1996       structure expects and verifies.
1997
1998       The following is based on information publicly  available,  user  feed‐
1999       back,  community  discussions or bug report analyses. It's not complete
2000       and further research is encouraged when in doubt.
2001
2002   Main memory
2003       The data structures and raw data blocks are temporarily stored in  com‐
2004       puter memory before they get written to the device. It is critical that
2005       memory is reliable because even simple bit flips can have  vast  conse‐
2006       quences  and lead to damaged structures, not only in the filesystem but
2007       in the whole operating system.
2008
2009       Based on experience in the community, memory bit flips are more  common
2010       than   one   would  think.  When  it  happens,  it's  reported  by  the
2011       tree-checker or by a checksum mismatch after reading blocks. There  are
2012       some  very  obvious  instances of bit flips that happen, e.g. in an or‐
2013       dered sequence of keys in metadata blocks. We can easily infer from the
2014       other data what values get damaged and how. However, fixing that is not
2015       straightforward and would require cross-referencing data from  the  en‐
2016       tire filesystem to see the scope.
2017
2018       If  available,  ECC  memory  should lower the chances of bit flips, but
2019       this type of memory is not available in all cases. A memory test should
2020       be  performed  in  case there's a visible bit flip pattern, though this
2021       may not detect a faulty memory module because the actual  load  of  the
2022       system  could be the factor making the problems appear. In recent years
2023       attacks on how  the  memory  modules  operate  have  been  demonstrated
2024       (rowhammer)  achieving  specific  bits to be flipped.  While these were
2025       targeted, this shows that a series of reads or writes can affect  unre‐
2026       lated parts of memory.
2027
2028       Further reading:
2029
2030       • https://en.wikipedia.org/wiki/Row_hammer
2031
2032       What to do:
2033
2034       • run  memtest,  note that sometimes memory errors happen only when the
2035         system is under heavy load that the default memtest cannot trigger
2036
2037       • memory errors may appear as filesystem going read-only  due  to  "pre
2038         write"  check, that verify meta data before they get written but fail
2039         some basic consistency checks
2040
2041   Direct memory access (DMA)
2042       Another class of errors is related to DMA (direct memory  access)  per‐
2043       formed by device drivers. While this could be considered a software er‐
2044       ror, the data transfers that happen without CPU assistance may acciden‐
2045       tally  corrupt other pages. Storage devices utilize DMA for performance
2046       reasons, the filesystem structures and data pages are passed  back  and
2047       forth,  making  errors  possible in case page life time is not properly
2048       tracked.
2049
2050       There are lots of quirks (device-specific workarounds) in Linux  kernel
2051       drivers  (regarding not only DMA) that are added when found. The quirks
2052       may avoid specific errors or disable some features to avoid worse prob‐
2053       lems.
2054
2055       What to do:
2056
2057       • use  up-to-date  kernel (recent releases or maintained long term sup‐
2058         port versions)
2059
2060       • as this may be caused by faulty drivers, keep the systems up-to-date
2061
2062   Rotational disks (HDD)
2063       Rotational HDDs typically fail at the level of  individual  sectors  or
2064       small  clusters.   Read  failures  are  caught  on the levels below the
2065       filesystem and are returned to the user as EIO  -  Input/output  error.
2066       Reading  the blocks repeatedly may return the data eventually, but this
2067       is better done by specialized tools and filesystem takes the result  of
2068       the  lower layers. Rewriting the sectors may trigger internal remapping
2069       but this inevitably leads to data loss.
2070
2071       Disk firmware is technically software but from the filesystem  perspec‐
2072       tive is part of the hardware. IO requests are processed, and caching or
2073       various other optimizations are performed, which may lead to bugs under
2074       high load or unexpected physical conditions or unsupported use cases.
2075
2076       Disks  are  connected  by cables with two ends, both of which can cause
2077       problems when not attached properly. Data transfers  are  protected  by
2078       checksums  and the lower layers try hard to transfer the data correctly
2079       or not at all. The errors from badly-connecting cables may manifest  as
2080       large amount of failed read or write requests, or as short error bursts
2081       depending on physical conditions.
2082
2083       What to do:
2084
2085       • check smartctl for potential issues
2086
2087   Solid state drives (SSD)
2088       The mechanism of information storage is different from  HDDs  and  this
2089       affects  the failure mode as well. The data are stored in cells grouped
2090       in large blocks with limited number of  resets  and  other  write  con‐
2091       straints.  The  firmware tries to avoid unnecessary resets and performs
2092       optimizations to maximize the storage media lifetime. The  known  tech‐
2093       niques  are deduplication (blocks with same fingerprint/hash are mapped
2094       to same physical block), compression or internal remapping and  garbage
2095       collection of used memory cells. Due to the additional processing there
2096       are measures to verity the data e.g. by ECC codes.
2097
2098       The observations of failing SSDs show that the whole  electronic  fails
2099       at  once or affects a lot of data (e.g. stored on one chip). Recovering
2100       such data may need specialized equipment and  reading  data  repeatedly
2101       does not help as it's possible with HDDs.
2102
2103       There are several technologies of the memory cells with different char‐
2104       acteristics and price. The lifetime is directly affected  by  the  type
2105       and  frequency of data written.  Writing "too much" distinct data (e.g.
2106       encrypted) may render the internal deduplication ineffective  and  lead
2107       to a lot of rewrites and increased wear of the memory cells.
2108
2109       There  are  several  technologies and manufacturers so it's hard to de‐
2110       scribe them but there are some that exhibit similar behaviour:
2111
2112       • expensive SSD will use more durable memory cells and is optimized for
2113         reliability and high load
2114
2115       • cheap SSD is projected for a lower load ("desktop user") and is opti‐
2116         mized for cost, it may employ the optimizations and/or extended error
2117         reporting partially or not at all
2118
2119       It's not possible to reliably determine the expected lifetime of an SSD
2120       due to lack of information about how it works or due to lack  of  reli‐
2121       able stats provided by the device.
2122
2123       Metadata  writes tend to be the biggest component of lifetime writes to
2124       a SSD, so there is some value in reducing them. Depending on the device
2125       class (high end/low end) the features like DUP block group profiles may
2126       affect the reliability in both ways:
2127
2128       • high end are typically more reliable and using single  for  data  and
2129         metadata could be suitable to reduce device wear
2130
2131       • low end could lack ability to identify errors so an additional redun‐
2132         dancy at the filesystem level (checksums, DUP) could help
2133
2134       Only users who consume 50 to 100% of the SSD's actual  lifetime  writes
2135       need  to be concerned by the write amplification of btrfs DUP metadata.
2136       Most users will be far below 50% of the actual lifetime, or will  write
2137       the  drive  to  death  and  discover how many writes 100% of the actual
2138       lifetime was. SSD firmware often adds its own  write  multipliers  that
2139       can  be arbitrary and unpredictable and dependent on application behav‐
2140       ior, and these will typically have far greater effect on  SSD  lifespan
2141       than  DUP  metadata. It's more or less impossible to predict when a SSD
2142       will run out of lifetime writes to within a factor of two, so it's hard
2143       to justify wear reduction as a benefit.
2144
2145       Further reading:
2146
2147       • https://www.snia.org/educational-library/ssd-and-deduplication-end-spinning-disk-2012
2148
2149       • https://www.snia.org/educational-library/realities-solid-state-storage-2013-2013
2150
2151       • https://www.snia.org/educational-library/ssd-performance-primer-2013
2152
2153       • https://www.snia.org/educational-library/how-controllers-maximize-ssd-life-2013
2154
2155       What to do:
2156
2157       • run smartctl or self-tests to look for potential issues
2158
2159       • keep the firmware up-to-date
2160
2161   NVM express, non-volatile memory (NVMe)
2162       NVMe is a type of persistent memory usually connected over a system bus
2163       (PCIe)  or  similar  interface and the speeds are an order of magnitude
2164       faster than SSD.  It is also a non-rotating type of storage, and is not
2165       typically  connected by a cable. It's not a SCSI type device either but
2166       rather a complete specification for logical device interface.
2167
2168       In a way the errors could be compared to a combination of SSD class and
2169       regular  memory. Errors may exhibit as random bit flips or IO failures.
2170       There are tools to access the internal log (nvme log and nvme-cli)  for
2171       a more detailed analysis.
2172
2173       There  are separate error detection and correction steps performed e.g.
2174       on the bus level and in most cases never making in  to  the  filesystem
2175       level.  Once  this  happens it could mean there's some systematic error
2176       like overheating or bad physical connection of the device. You may want
2177       to run self-tests (using smartctl).
2178
2179       • https://en.wikipedia.org/wiki/NVM_Express
2180
2181       • https://www.smartmontools.org/wiki/NVMe_Support
2182
2183   Drive firmware
2184       Firmware  is technically still software but embedded into the hardware.
2185       As all software has bugs, so does firmware. Storage devices can  update
2186       the  firmware  and  fix  known bugs. In some cases the it's possible to
2187       avoid certain bugs by quirks  (device-specific  workarounds)  in  Linux
2188       kernel.
2189
2190       A  faulty  firmware  can cause wide range of corruptions from small and
2191       localized to large affecting lots of data. Self-repair capabilities may
2192       not be sufficient.
2193
2194       What to do:
2195
2196       • check  for  firmware  updates  in case there are known problems, note
2197         that updating firmware can be risky on itself
2198
2199       • use up-to-date kernel (recent releases or maintained long  term  sup‐
2200         port versions)
2201
2202   SD flash cards
2203       There  are  a  lot of devices with low power consumption and thus using
2204       storage media based on low power consumption too, typically flash  mem‐
2205       ory  stored on a chip enclosed in a detachable card package. An improp‐
2206       erly inserted card may be damaged by electrical spikes when the  device
2207       is turned on or off. The chips storing data in turn may be damaged per‐
2208       manently. All types of flash memory have a limited number of  rewrites,
2209       so the data are internally translated by FTL (flash translation layer).
2210       This is implemented in firmware (technically a software) and  prone  to
2211       bugs that manifest as hardware errors.
2212
2213       Adding  redundancy  like  using DUP profiles for both data and metadata
2214       can help in some cases but a full backup might be the best option  once
2215       problems appear and replacing the card could be required as well.
2216
2217   Hardware as the main source of filesystem corruptions
2218       If  you  use unreliable hardware and don't know about that, don't blame
2219       the filesystem when it tells you.
2220