md(4) - f39

1MD(4)                      Kernel Interfaces Manual                      MD(4)
2
3
4

NAME

6       md - Multiple Device driver aka Linux Software RAID
7

SYNOPSIS

9       /dev/mdn
10       /dev/md/n
11       /dev/md/name
12

DESCRIPTION

14       The  md  driver  provides  virtual devices that are created from one or
15       more independent underlying devices.  This array of devices often  con‐
16       tains redundancy and the devices are often disk drives, hence the acro‐
17       nym RAID which stands for a Redundant Array of Independent Disks.
18
19       md supports RAID levels 1 (mirroring), 4 (striped array with parity de‐
20       vice),  5  (striped  array  with  distributed  parity  information),  6
21       (striped array with distributed dual redundancy  information),  and  10
22       (striped  and  mirrored).   If  some number of underlying devices fails
23       while using one of these levels, the array will continue  to  function;
24       this  number  is one for RAID levels 4 and 5, two for RAID level 6, and
25       all but one (N-1) for RAID level 1, and dependent on configuration  for
26       level 10.
27
28       md also supports a number of pseudo RAID (non-redundant) configurations
29       including RAID0 (striped array), LINEAR (catenated array), MULTIPATH (a
30       set  of  different  interfaces to the same device), and FAULTY (a layer
31       over a single device into which errors can be injected).
32
33
34   MD METADATA
35       Each device in an array may have some metadata stored  in  the  device.
36       This  metadata  is sometimes called a superblock.  The metadata records
37       information about the structure and state of the  array.   This  allows
38       the array to be reliably re-assembled after a shutdown.
39
40       From Linux kernel version 2.6.10, md provides support for two different
41       formats of metadata, and other formats can be added.  Prior to this re‐
42       lease, only one format is supported.
43
44       The common format — known as version 0.90 — has a superblock that is 4K
45       long and is written into a 64K aligned block that starts at  least  64K
46       and  less than 128K from the end of the device (i.e. to get the address
47       of the superblock round the size of the device down to  a  multiple  of
48       64K  and  then subtract 64K).  The available size of each device is the
49       amount of space before the super block, so between 64K and 128K is lost
50       when a device in incorporated into an MD array.  This superblock stores
51       multi-byte fields in a processor-dependent  manner,  so  arrays  cannot
52       easily be moved between computers with different processors.
53
54       The new format — known as version 1 — has a superblock that is normally
55       1K long, but can be longer.  It is normally stored between 8K  and  12K
56       from  the end of the device, on a 4K boundary, though variations can be
57       stored at the start of the device (version 1.1) or 4K from the start of
58       the  device  (version 1.2).  This metadata format stores multibyte data
59       in a processor-independent format and supports up to hundreds of compo‐
60       nent devices (version 0.90 only supports 28).
61
62       The metadata contains, among other things:
63
64       LEVEL  The  manner  in  which  the  devices are arranged into the array
65              (LINEAR, RAID0, RAID1, RAID4, RAID5, RAID10, MULTIPATH).
66
67       UUID   a 128 bit Universally Unique Identifier that identifies the  ar‐
68              ray that contains this device.
69
70
71       When  a version 0.90 array is being reshaped (e.g. adding extra devices
72       to a RAID5), the version number is temporarily set to 0.91.   This  en‐
73       sures  that  if the reshape process is stopped in the middle (e.g. by a
74       system crash) and the machine boots into an older kernel that does  not
75       support  reshaping,  then  the array will not be assembled (which would
76       cause data corruption) but will be left untouched until a  kernel  that
77       can complete the reshape processes is used.
78
79
80   ARRAYS WITHOUT METADATA
81       While it is usually best to create arrays with superblocks so that they
82       can be assembled reliably, there are some circumstances when  an  array
83       without superblocks is preferred.  These include:
84
85       LEGACY ARRAYS
86              Early  versions of the md driver only supported LINEAR and RAID0
87              configurations and did not use a superblock (which is less crit‐
88              ical  with  these  configurations).  While such arrays should be
89              rebuilt with superblocks if possible, md  continues  to  support
90              them.
91
92       FAULTY Being  a  largely transparent layer over a different device, the
93              FAULTY personality doesn't  gain  anything  from  having  a  su‐
94              perblock.
95
96       MULTIPATH
97              It is often possible to detect devices which are different paths
98              to the same storage directly rather than  having  a  distinctive
99              superblock  written to the device and searched for on all paths.
100              In this case, a MULTIPATH array with no superblock makes sense.
101
102       RAID1  In some configurations it might be desired  to  create  a  RAID1
103              configuration  that  does  not use a superblock, and to maintain
104              the state of the array elsewhere.  While not encouraged for gen‐
105              eral use, it does have special-purpose uses and is supported.
106
107
108   ARRAYS WITH EXTERNAL METADATA
109       From release 2.6.28, the md driver supports arrays with externally man‐
110       aged metadata.  That is, the metadata is not managed by the kernel  but
111       rather  by  a user-space program which is external to the kernel.  This
112       allows support for a variety of metadata formats without cluttering the
113       kernel with lots of details.
114
115       md  is  able to communicate with the user-space program through various
116       sysfs attributes so that it can make appropriate changes to  the  meta‐
117       data - for example to mark a device as faulty.  When necessary, md will
118       wait for the program to acknowledge the event by writing to a sysfs at‐
119       tribute.   The manual page for mdmon(8) contains more detail about this
120       interaction.
121
122
123   CONTAINERS
124       Many metadata formats use a single block of metadata to describe a num‐
125       ber of different arrays which all use the same set of devices.  In this
126       case it is helpful for the kernel to know about the full set of devices
127       as a whole.  This set is known to md as a container.  A container is an
128       md array with externally managed metadata and with  device  offset  and
129       size  so that it just covers the metadata part of the devices.  The re‐
130       mainder of each device is available to be incorporated into various ar‐
131       rays.
132
133
134   LINEAR
135       A  LINEAR  array  simply catenates the available space on each drive to
136       form one large virtual drive.
137
138       One advantage of this arrangement over the more common  RAID0  arrange‐
139       ment  is that the array may be reconfigured at a later time with an ex‐
140       tra drive, so the array is made bigger without disturbing the data that
141       is on the array.  This can even be done on a live array.
142
143       If  a  chunksize is given with a LINEAR array, the usable space on each
144       device is rounded down to a multiple of this chunksize.
145
146
147   RAID0
148       A RAID0 array (which has zero redundancy) is also known  as  a  striped
149       array.  A RAID0 array is configured at creation with a Chunk Size which
150       must be a power of  two  (prior  to  Linux  2.6.31),  and  at  least  4
151       kibibytes.
152
153       The  RAID0 driver assigns the first chunk of the array to the first de‐
154       vice, the second chunk to the second device, and so on until all drives
155       have  been  assigned  one  chunk.   This  collection  of chunks forms a
156       stripe.  Further chunks are gathered into stripes in the same way,  and
157       are assigned to the remaining space in the drives.
158
159       If devices in the array are not all the same size, then once the small‐
160       est device has been  exhausted,  the  RAID0  driver  starts  collecting
161       chunks  into smaller stripes that only span the drives which still have
162       remaining space.
163
164       A bug was introduced in linux 3.14 which changed the layout  of  blocks
165       in  a  RAID0  beyond the region that is striped over all devices.  This
166       bug does not affect an array with all devices the same  size,  but  can
167       affect other RAID0 arrays.
168
169       Linux  5.4 (and some stable kernels to which the change was backported)
170       will not normally assemble such an array as it cannot know which layout
171       to  use.   There is a module parameter "raid0.default_layout" which can
172       be set to "1" to force the kernel to use the pre-3.14 layout or to  "2"
173       to  force  it  to  use  the 3.14-and-later layout.  when creating a new
174       RAID0 array, mdadm will record the chosen layout in the metadata  in  a
175       way  that  allows newer kernels to assemble the array without needing a
176       module parameter.
177
178       To assemble an old array on a new kernel without using the  module  pa‐
179       rameter,  use  either  the --update=layout-original option or the --up‐
180       date=layout-alternate option.
181
182       Once you have updated the layout you will not be able to mount the  ar‐
183       ray  on an older kernel.  If you need to revert to an older kernel, the
184       layout information can be erased with the  --update=layout-unspecificed
185       option.   If  you  use  this option to --assemble while running a newer
186       kernel, the array will NOT assemble, but the metadata will be update so
187       that it can be assembled on an older kernel.
188
189       No that setting the layout to "unspecified" removes protections against
190       this bug, and you must be sure that the kernel you use matches the lay‐
191       out of the array.
192
193
194   RAID1
195       A  RAID1  array is also known as a mirrored set (though mirrors tend to
196       provide reflected images, which RAID1 does not) or a plex.
197
198       Once initialised, each device in a RAID1  array  contains  exactly  the
199       same  data.   Changes  are written to all devices in parallel.  Data is
200       read from any one device.  The driver attempts to distribute  read  re‐
201       quests across all devices to maximise performance.
202
203       All devices in a RAID1 array should be the same size.  If they are not,
204       then only the amount of space available on the smallest device is  used
205       (any extra space on other devices is wasted).
206
207       Note that the read balancing done by the driver does not make the RAID1
208       performance profile be the same as for RAID0; a single  stream  of  se‐
209       quential input will not be accelerated (e.g. a single dd), but multiple
210       sequential streams or a random workload will use more than one spindle.
211       In  theory,  having  an N-disk RAID1 will allow N sequential threads to
212       read from all disks.
213
214       Individual devices in a RAID1 can be marked as  "write-mostly".   These
215       drives  are  excluded  from  the normal read balancing and will only be
216       read from when there is no other option.  This can be  useful  for  de‐
217       vices connected over a slow link.
218
219
220   RAID4
221       A  RAID4  array  is like a RAID0 array with an extra device for storing
222       parity. This device is the last of the active devices in the array. Un‐
223       like  RAID0,  RAID4  also requires that all stripes span all drives, so
224       extra space on devices that are larger than the smallest is wasted.
225
226       When any block in a RAID4 array is modified, the parity block for  that
227       stripe  (i.e.  the block in the parity device at the same device offset
228       as the stripe) is also modified so that the parity  block  always  con‐
229       tains  the  "parity" for the whole stripe.  I.e. its content is equiva‐
230       lent to the result of performing an exclusive-or operation between  all
231       the data blocks in the stripe.
232
233       This allows the array to continue to function if one device fails.  The
234       data that was on that device can be calculated as needed from the  par‐
235       ity block and the other data blocks.
236
237
238   RAID5
239       RAID5  is  very  similar  to  RAID4.  The difference is that the parity
240       blocks for each stripe, instead of being on a single device,  are  dis‐
241       tributed  across  all devices.  This allows more parallelism when writ‐
242       ing, as two different block updates will quite possibly  affect  parity
243       blocks on different devices so there is less contention.
244
245       This  also  allows  more parallelism when reading, as read requests are
246       distributed over all the devices in the array instead of all but one.
247
248
249   RAID6
250       RAID6 is similar to RAID5, but can handle the loss of any  two  devices
251       without  data  loss.   Accordingly,  it  requires N+2 drives to store N
252       drives worth of data.
253
254       The performance for RAID6 is slightly lower but comparable to RAID5  in
255       normal mode and single disk failure mode.  It is very slow in dual disk
256       failure mode, however.
257
258
259   RAID10
260       RAID10 provides a combination of RAID1  and  RAID0,  and  is  sometimes
261       known  as RAID1+0.  Every datablock is duplicated some number of times,
262       and the resulting collection of datablocks are distributed over  multi‐
263       ple drives.
264
265       When  configuring a RAID10 array, it is necessary to specify the number
266       of replicas of each data block that are  required  (this  will  usually
267       be 2)  and  whether  their  layout  should be "near", "far" or "offset"
268       (with "offset" being available since Linux 2.6.18).
269
270       About the RAID10 Layout Examples:
271       The examples below visualise the chunk distribution on  the  underlying
272       devices for the respective layout.
273
274       For  simplicity  it  is  assumed that the size of the chunks equals the
275       size of the blocks of the underlying devices as well as  those  of  the
276       RAID10 device exported by the kernel (for example /dev/md/name).
277       Therefore  the chunks / chunk numbers map directly to the blocks /block
278       addresses of the exported RAID10 device.
279
280       Decimal numbers (0, 1, 2, ...) are the chunks of the RAID10 and due  to
281       the  above  assumption  also  the blocks and block addresses of the ex‐
282       ported RAID10 device.
283       Repeated numbers mean copies of a chunk / block (obviously on different
284       underlying devices).
285       Hexadecimal  numbers (0x00, 0x01, 0x02, ...) are the block addresses of
286       the underlying devices.
287
288
289        "near" Layout
290              When "near" replicas are chosen, the multiple copies of a  given
291              chunk  are  laid  out  consecutively ("as close to each other as
292              possible") across the stripes of the array.
293
294              With an even number of devices, they will  likely  (unless  some
295              misalignment is present) lay at the very same offset on the dif‐
296              ferent devices.
297              This is as the "classic" RAID1+0; that is two groups of mirrored
298              devices  (in the example below the groups Device #1 / #2 and De‐
299              vice #3 / #4 are each a RAID1) both in turn  forming  a  striped
300              RAID0.
301
302              Example  with  2 copies  per chunk and an even number (4) of de‐
303              vices:
304
305                    ┌───────────┌───────────┌───────────┌───────────┐
306                    │ Device #1 │ Device #2 │ Device #3 │ Device #4 │
307              ┌─────├───────────├───────────├───────────├───────────┤
308              │0x00 │     0     │     0     │     1     │     1     │
309              │0x01 │     2     │     2     │     3     │     3     │
310              │...  │    ...    │    ...    │    ...    │    ...    │
311              │ :   │     :     │     :     │     :     │     :     │
312              │...  │    ...    │    ...    │    ...    │    ...    │
313              │0x80 │    254    │    254    │    255    │    255    │
314              └─────└───────────└───────────└───────────└───────────┘
315                      \---------v---------/   \---------v---------/
316                              RAID1                   RAID1
317                      \---------------------v---------------------/
318                                          RAID0
319
320              Example with 2 copies per chunk and an  odd  number (5)  of  de‐
321              vices:
322
323                    ┌────────┌────────┌────────┌────────┌────────┐
324                    │ Dev #1 │ Dev #2 │ Dev #3 │ Dev #4 │ Dev #5 │
325              ┌─────├────────├────────├────────├────────├────────┤
326              │0x00 │   0    │   0    │   1    │   1    │   2    │
327              │0x01 │   2    │   3    │   3    │   4    │   4    │
328              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │
329              │ :   │   :    │   :    │   :    │   :    │   :    │
330              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │
331              │0x80 │  317   │  318   │  318   │  319   │  319   │
332              └─────└────────└────────└────────└────────└────────┘
333
334
335
336        "far" Layout
337              When  "far"  replicas are chosen, the multiple copies of a given
338              chunk are laid out quite distant ("as far as  reasonably  possi‐
339              ble") from each other.
340
341              First  a  complete  sequence of all data blocks (that is all the
342              data one sees on the exported RAID10 block  device)  is  striped
343              over  the  devices. Then another (though "shifted") complete se‐
344              quence of all data blocks; and so on (in the case of  more  than
345              2 copies per chunk).
346
347              The  "shift" needed to prevent placing copies of the same chunks
348              on the same devices is actually a cyclic permutation  with  off‐
349              set 1  of  each  of  the  stripes  within a complete sequence of
350              chunks.
351              The offset 1 is relative to the previous  complete  sequence  of
352              chunks,  so in case of more than 2 copies per chunk one gets the
353              following offsets:
354              1. complete sequence of chunks: offset =  0
355              2. complete sequence of chunks: offset =  1
356              3. complete sequence of chunks: offset =  2
357                                     :
358              n. complete sequence of chunks: offset = n-1
359
360              Example with 2 copies per chunk and an even  number (4)  of  de‐
361              vices:
362
363                    ┌───────────┌───────────┌───────────┌───────────┐
364                    │ Device #1 │ Device #2 │ Device #3 │ Device #4 │
365              ┌─────├───────────├───────────├───────────├───────────┤
366              │0x00 │     0     │     1     │     2     │     3     │ \
367              │0x01 │     4     │     5     │     6     │     7     │ > [#]
368              │...  │    ...    │    ...    │    ...    │    ...    │ :
369              │ :   │     :     │     :     │     :     │     :     │ :
370              │...  │    ...    │    ...    │    ...    │    ...    │ :
371              │0x40 │    252    │    253    │    254    │    255    │ /
372              │0x41 │     3     │     0     │     1     │     2     │ \
373              │0x42 │     7     │     4     │     5     │     6     │ > [#]~
374              │...  │    ...    │    ...    │    ...    │    ...    │ :
375              │ :   │     :     │     :     │     :     │     :     │ :
376              │...  │    ...    │    ...    │    ...    │    ...    │ :
377              │0x80 │    255    │    252    │    253    │    254    │ /
378              └─────└───────────└───────────└───────────└───────────┘
379
380              Example  with  2 copies  per  chunk and an odd number (5) of de‐
381              vices:
382
383                    ┌────────┌────────┌────────┌────────┌────────┐
384                    │ Dev #1 │ Dev #2 │ Dev #3 │ Dev #4 │ Dev #5 │
385              ┌─────├────────├────────├────────├────────├────────┤
386              │0x00 │   0    │   1    │   2    │   3    │   4    │ \
387              │0x01 │   5    │   6    │   7    │   8    │   9    │ > [#]
388              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │ :
389              │ :   │   :    │   :    │   :    │   :    │   :    │ :
390              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │ :
391              │0x40 │  315   │  316   │  317   │  318   │  319   │ /
392              │0x41 │   4    │   0    │   1    │   2    │   3    │ \
393              │0x42 │   9    │   5    │   6    │   7    │   8    │ > [#]~
394              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │ :
395              │ :   │   :    │   :    │   :    │   :    │   :    │ :
396              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │ :
397              │0x80 │  319   │  315   │  316   │  317   │  318   │ /
398              └─────└────────└────────└────────└────────└────────┘
399
400              With [#] being the complete  sequence  of  chunks  and  [#]~ the
401              cyclic  permutation  with  offset 1 thereof (in the case of more
402              than    2    copies     per     chunk     there     would     be
403              ([#]~)~, (([#]~)~)~, ...).
404
405              The  advantage  of  this layout is that MD can easily spread se‐
406              quential reads over the devices, making them similar to RAID0 in
407              terms of speed.
408              The  cost  is more seeking for writes, making them substantially
409              slower.
410
411
412       "offset" Layout
413              When "offset" replicas are chosen, all the  copies  of  a  given
414              chunk  are  striped  consecutively ("offset by the stripe length
415              after each other") over the devices.
416
417              Explained in detail, <number of devices> consecutive chunks  are
418              striped  over  the  devices, immediately followed by a "shifted"
419              copy of these chunks (and by further such  "shifted"  copies  in
420              the case of more than 2 copies per chunk).
421              This  pattern  repeats for all further consecutive chunks of the
422              exported  RAID10  device  (in  other  words:  all  further  data
423              blocks).
424
425              The  "shift" needed to prevent placing copies of the same chunks
426              on the same devices is actually a cyclic permutation  with  off‐
427              set 1  of each of the striped copies of <number of devices> con‐
428              secutive chunks.
429              The offset 1 is relative to the previous striped copy of <number
430              of devices> consecutive chunks, so in case of more than 2 copies
431              per chunk one gets the following offsets:
432              1. <number of devices> consecutive chunks: offset =  0
433              2. <number of devices> consecutive chunks: offset =  1
434              3. <number of devices> consecutive chunks: offset =  2
435                                           :
436              n. <number of devices> consecutive chunks: offset = n-1
437
438              Example with 2 copies per chunk and an even  number (4)  of  de‐
439              vices:
440
441                    ┌───────────┌───────────┌───────────┌───────────┐
442                    │ Device #1 │ Device #2 │ Device #3 │ Device #4 │
443              ┌─────├───────────├───────────├───────────├───────────┤
444              │0x00 │     0     │     1     │     2     │     3     │ ) AA
445              │0x01 │     3     │     0     │     1     │     2     │ ) AA~
446              │0x02 │     4     │     5     │     6     │     7     │ ) AB
447              │0x03 │     7     │     4     │     5     │     6     │ ) AB~
448              │...  │    ...    │    ...    │    ...    │    ...    │ ) ...
449              │ :   │     :     │     :     │     :     │     :     │   :
450              │...  │    ...    │    ...    │    ...    │    ...    │ ) ...
451              │0x79 │    251    │    252    │    253    │    254    │ ) EX
452              │0x80 │    254    │    251    │    252    │    253    │ ) EX~
453              └─────└───────────└───────────└───────────└───────────┘
454
455              Example  with  2 copies  per  chunk and an odd number (5) of de‐
456              vices:
457
458                    ┌────────┌────────┌────────┌────────┌────────┐
459                    │ Dev #1 │ Dev #2 │ Dev #3 │ Dev #4 │ Dev #5 │
460              ┌─────├────────├────────├────────├────────├────────┤
461              │0x00 │   0    │   1    │   2    │   3    │   4    │ ) AA
462              │0x01 │   4    │   0    │   1    │   2    │   3    │ ) AA~
463              │0x02 │   5    │   6    │   7    │   8    │   9    │ ) AB
464              │0x03 │   9    │   5    │   6    │   7    │   8    │ ) AB~
465              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │ ) ...
466              │ :   │   :    │   :    │   :    │   :    │   :    │   :
467              │...  │  ...   │  ...   │  ...   │  ...   │  ...   │ ) ...
468              │0x79 │  314   │  315   │  316   │  317   │  318   │ ) EX
469              │0x80 │  318   │  314   │  315   │  316   │  317   │ ) EX~
470              └─────└────────└────────└────────└────────└────────┘
471
472              With AA, AB, ..., AZ, BA, ... being the sets of <number  of  de‐
473              vices>  consecutive  chunks and AA~, AB~, ..., AZ~, BA~, ... the
474              cyclic permutations with offset 1 thereof (in the case  of  more
475              than  2  copies per chunk there would be (AA~)~, ...  as well as
476              ((AA~)~)~, ... and so on).
477
478              This should give similar read  characteristics  to  "far"  if  a
479              suitably  large  chunk size is used, but without as much seeking
480              for writes.
481
482       It should be noted that the number of devices in a  RAID10  array  need
483       not be a multiple of the number of replica of each data block; however,
484       there must be at least as many devices as replicas.
485
486       If, for example, an array is created with 5  devices  and  2  replicas,
487       then  space equivalent to 2.5 of the devices will be available, and ev‐
488       ery block will be stored on two different devices.
489
490       Finally, it is possible to have an array with  both  "near"  and  "far"
491       copies.  If an array is configured with 2 near copies and 2 far copies,
492       then there will be a total of 4 copies of each block, each on a differ‐
493       ent  drive.   This is an artifact of the implementation and is unlikely
494       to be of real value.
495
496
497   MULTIPATH
498       MULTIPATH is not really a RAID at all as there is only one real  device
499       in  a  MULTIPATH  md  array.   However there are multiple access points
500       (paths) to this device, and one of these paths might fail, so there are
501       some similarities.
502
503       A  MULTIPATH  array  is composed of a number of logically different de‐
504       vices, often fibre channel interfaces, that all refer the the same real
505       device.  If one of these interfaces fails (e.g. due to cable problems),
506       the MULTIPATH driver will attempt to redirect requests to  another  in‐
507       terface.
508
509       The MULTIPATH drive is not receiving any ongoing development and should
510       be considered a legacy driver.  The device-mapper based multipath driv‐
511       ers should be preferred for new installations.
512
513
514   FAULTY
515       The  FAULTY md module is provided for testing purposes.  A FAULTY array
516       has exactly one component device and is normally  assembled  without  a
517       superblock,  so  the  md array created provides direct access to all of
518       the data in the component device.
519
520       The FAULTY module may be requested to simulate faults to allow  testing
521       of  other md levels or of filesystems.  Faults can be chosen to trigger
522       on read requests or write requests, and can be transient (a  subsequent
523       read/write  at the address will probably succeed) or persistent (subse‐
524       quent read/write of the same address will fail).  Further, read  faults
525       can be "fixable" meaning that they persist until a write request at the
526       same address.
527
528       Fault types can be requested with a period.  In this  case,  the  fault
529       will  recur  repeatedly after the given number of requests of the rele‐
530       vant type.  For example if persistent read faults have a period of 100,
531       then  every  100th  read request would generate a fault, and the faulty
532       sector would be recorded so that subsequent reads on that sector  would
533       also fail.
534
535       There  is  a limit to the number of faulty sectors that are remembered.
536       Faults generated after this limit is exhausted  are  treated  as  tran‐
537       sient.
538
539       The list of faulty sectors can be flushed, and the active list of fail‐
540       ure modes can be cleared.
541
542
543   UNCLEAN SHUTDOWN
544       When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10  array
545       there  is  a  possibility of inconsistency for short periods of time as
546       each update requires at least two block to be written to different  de‐
547       vices, and these writes probably won't happen at exactly the same time.
548       Thus if a system with one of these arrays is shutdown in the middle  of
549       a  write  operation  (e.g.  due to power failure), the array may not be
550       consistent.
551
552       To handle this situation, the md driver marks an array as  "dirty"  be‐
553       fore  writing any data to it, and marks it as "clean" when the array is
554       being disabled, e.g. at shutdown.  If the md driver finds an  array  to
555       be dirty at startup, it proceeds to correct any possibly inconsistency.
556       For RAID1, this involves copying the contents of the first  drive  onto
557       all other drives.  For RAID4, RAID5 and RAID6 this involves recalculat‐
558       ing the parity for each stripe and making sure that  the  parity  block
559       has the correct data.  For RAID10 it involves copying one of the repli‐
560       cas of each block onto all the others.  This process, known as  "resyn‐
561       chronising"  or "resync" is performed in the background.  The array can
562       still be used, though possibly with reduced performance.
563
564       If a RAID4, RAID5 or RAID6 array is  degraded  (missing  at  least  one
565       drive,  two  for RAID6) when it is restarted after an unclean shutdown,
566       it cannot recalculate parity, and so it is possible that data might  be
567       undetectably  corrupted.  The 2.4 md driver does not alert the operator
568       to this condition.  The 2.6 md driver will fail to start  an  array  in
569       this  condition  without manual intervention, though this behaviour can
570       be overridden by a kernel parameter.
571
572
573   RECOVERY
574       If the md driver detects a write error on a device in a  RAID1,  RAID4,
575       RAID5,  RAID6,  or  RAID10  array,  it immediately disables that device
576       (marking it as faulty) and continues operation  on  the  remaining  de‐
577       vices.   If there are spare drives, the driver will start recreating on
578       one of the spare drives the data which was on that failed drive, either
579       by copying a working drive in a RAID1 configuration, or by doing calcu‐
580       lations with the parity block on RAID4, RAID5 or RAID6, or  by  finding
581       and copying originals for RAID10.
582
583       In kernels prior to about 2.6.15, a read error would cause the same ef‐
584       fect as a write error.  In later kernels,  a  read-error  will  instead
585       cause  md  to  attempt a recovery by overwriting the bad block. i.e. it
586       will find the correct data from elsewhere, write it over the block that
587       failed, and then try to read it back again.  If either the write or the
588       re-read fail, md will treat the error the same way that a  write  error
589       is treated, and will fail the whole device.
590
591       While  this  recovery  process is happening, the md driver will monitor
592       accesses to the array and will slow down the rate of recovery if  other
593       activity  is  happening, so that normal access to the array will not be
594       unduly affected.  When no other activity  is  happening,  the  recovery
595       process  proceeds  at full speed.  The actual speed targets for the two
596       different situations can  be  controlled  by  the  speed_limit_min  and
597       speed_limit_max control files mentioned below.
598
599
600   SCRUBBING AND MISMATCHES
601       As storage devices can develop bad blocks at any time it is valuable to
602       regularly read all blocks on all devices in an array  so  as  to  catch
603       such bad blocks early.  This process is called scrubbing.
604
605       md arrays can be scrubbed by writing either check or repair to the file
606       md/sync_action in the sysfs directory for the device.
607
608       Requesting a scrub will cause md to read every block on every device in
609       the  array,  and  check  that  the  data  is consistent.  For RAID1 and
610       RAID10, this means checking that the copies are identical.  For  RAID4,
611       RAID5,  RAID6  this  means checking that the parity block is (or blocks
612       are) correct.
613
614       If a read error is detected during this process, the normal  read-error
615       handling  causes  correct data to be found from other devices and to be
616       written back to the faulty device.  In many case this will  effectively
617       fix the bad block.
618
619       If  all  blocks  read  successfully but are found to not be consistent,
620       then this is regarded as a mismatch.
621
622       If check was used, then no action is taken to handle the  mismatch,  it
623       is  simply  recorded.   If repair was used, then a mismatch will be re‐
624       paired in the same way that resync repairs arrays.  For RAID5/RAID6 new
625       parity  blocks  are  written.   For RAID1/RAID10, all but one block are
626       overwritten with the content of that one block.
627
628       A count of mismatches is recorded in the  sysfs  file  md/mismatch_cnt.
629       This  is  set to zero when a scrub starts and is incremented whenever a
630       sector is found that is a mismatch.  md normally works  in  units  much
631       larger  than  a single sector and when it finds a mismatch, it does not
632       determine exactly how many actual sectors were affected but simply adds
633       the  number of sectors in the IO unit that was used.  So a value of 128
634       could simply mean that a single  64KB  check  found  an  error  (128  x
635       512bytes = 64KB).
636
637       If  an  array is created by mdadm with --assume-clean then a subsequent
638       check could be expected to find some mismatches.
639
640       On a truly clean RAID5 or RAID6 array, any mismatches should indicate a
641       hardware  problem  at  some  level - software issues should never cause
642       such a mismatch.
643
644       However on RAID1 and RAID10 it is possible for software issues to cause
645       a  mismatch  to  be  reported.  This does not necessarily mean that the
646       data on the array is corrupted.  It could simply  be  that  the  system
647       does  not  care what is stored on that part of the array - it is unused
648       space.
649
650       The most likely cause for an unexpected mismatch on RAID1 or RAID10 oc‐
651       curs if a swap partition or swap file is stored on the array.
652
653       When  the  swap subsystem wants to write a page of memory out, it flags
654       the page as 'clean' in the memory manager and requests the swap  device
655       to  write it out.  It is quite possible that the memory will be changed
656       while the write-out is happening.  In that case the 'clean'  flag  will
657       be found to be clear when the write completes and so the swap subsystem
658       will simply forget that the swapout had been attempted, and will possi‐
659       bly choose a different page to write out.
660
661       If the swap device was on RAID1 (or RAID10), then the data is sent from
662       memory to a device twice (or more depending on the number of devices in
663       the  array).   Thus it is possible that the memory gets changed between
664       the times it is sent, so different data can be written to the different
665       devices  in  the  array.  This will be detected by check as a mismatch.
666       However it does not reflect any corruption as the block where this mis‐
667       match  occurs  is  being treated by the swap system as being empty, and
668       the data will never be read from that block.
669
670       It is conceivable for a similar situation to occur on  non-swap  files,
671       though it is less likely.
672
673       Thus  the  mismatch_cnt  value  can not be interpreted very reliably on
674       RAID1 or RAID10, especially when the device is used for swap.
675
676
677
678   BITMAP WRITE-INTENT LOGGING
679       From Linux 2.6.13, md supports a bitmap  based  write-intent  log.   If
680       configured,  the bitmap is used to record which blocks of the array may
681       be out of sync.  Before any write request is  honoured,  md  will  make
682       sure  that  the corresponding bit in the log is set.  After a period of
683       time with no writes to an area of the array, the corresponding bit will
684       be cleared.
685
686       This bitmap is used for two optimisations.
687
688       Firstly, after an unclean shutdown, the resync process will consult the
689       bitmap and only resync those blocks that correspond to bits in the bit‐
690       map that are set.  This can dramatically reduce resync time.
691
692       Secondly,  when  a  drive fails and is removed from the array, md stops
693       clearing bits in the intent log.  If that same drive is re-added to the
694       array,  md  will notice and will only recover the sections of the drive
695       that are covered by bits in the intent log that are set.  This can  al‐
696       low  a  device to be temporarily removed and reinserted without causing
697       an enormous recovery cost.
698
699       The intent log can be stored in a file on a separate device, or it  can
700       be stored near the superblocks of an array which has superblocks.
701
702       It  is  possible  to add an intent log to an active array, or remove an
703       intent log if one is present.
704
705       In 2.6.13, intent bitmaps are only supported with RAID1.  Other  levels
706       with redundancy are supported from 2.6.15.
707
708
709   BAD BLOCK LIST
710       From  Linux  3.5  each device in an md array can store a list of known-
711       bad-blocks.  This list is 4K in size and usually positioned at the  end
712       of the space between the superblock and the data.
713
714       When  a block cannot be read and cannot be repaired by writing data re‐
715       covered from other devices, the address of the block is stored  in  the
716       bad  block  list.   Similarly if an attempt to write a block fails, the
717       address will be recorded as a bad block.  If attempting to  record  the
718       bad block fails, the whole device will be marked faulty.
719
720       Attempting to read from a known bad block will cause a read error.  At‐
721       tempting to write to a known bad block will be ignored if any write er‐
722       rors have been reported by the device.  If there have been no write er‐
723       rors then the data will be written to the known bad block and  if  that
724       succeeds, the address will be removed from the list.
725
726       This  allows an array to fail more gracefully - a few blocks on differ‐
727       ent devices can be faulty without taking the whole array out of action.
728
729       The list is particularly useful when recovering to a spare.  If  a  few
730       blocks  cannot be read from the other devices, the bulk of the recovery
731       can complete and those few bad blocks will be recorded in the bad block
732       list.
733
734
735   RAID WRITE HOLE
736       Due  to  non-atomicity nature of RAID write operations, interruption of
737       write operations (system crash, etc.) to RAID456 array can lead to  in‐
738       consistent parity and data loss (so called RAID-5 write hole).  To plug
739       the write hole md supports two mechanisms described below.
740
741
742       DIRTY STRIPE JOURNAL
743              From Linux 4.4, md supports write  ahead  journal  for  RAID456.
744              When  the  array is created, an additional journal device can be
745              added to the array through write-journal option. The RAID  write
746              journal works similar to file system journals. Before writing to
747              the data disks, md persists data AND parity of the stripe to the
748              journal  device.  After  crashes, md searches the journal device
749              for incomplete write operations, and replay  them  to  the  data
750              disks.
751
752              When  the  journal device fails, the RAID array is forced to run
753              in read-only mode.
754
755
756       PARTIAL PARITY LOG
757              From Linux 4.12 md supports Partial Parity Log (PPL)  for  RAID5
758              arrays only.  Partial parity for a write operation is the XOR of
759              stripe data chunks not modified by the write. PPL is  stored  in
760              the metadata region of RAID member drives, no additional journal
761              drive is needed.  After crashes, if one of the not modified data
762              disks  of the stripe is missing, this updated parity can be used
763              to recover its data.
764
765              This mechanism is documented more fully in the  file  Documenta‐
766              tion/md/raid5-ppl.rst
767
768
769   WRITE-BEHIND
770       From Linux 2.6.14, md supports WRITE-BEHIND on RAID1 arrays.
771
772       This allows certain devices in the array to be flagged as write-mostly.
773       MD will only read from such devices if there is no other option.
774
775       If a write-intent bitmap is also provided,  write  requests  to  write-
776       mostly devices will be treated as write-behind requests and md will not
777       wait for writes to those requests  to  complete  before  reporting  the
778       write as complete to the filesystem.
779
780       This  allows  for  a  RAID1 with WRITE-BEHIND to be used to mirror data
781       over a slow link to a remote computer (providing  the  link  isn't  too
782       slow).   The extra latency of the remote link will not slow down normal
783       operations, but the remote system will still have a  reasonably  up-to-
784       date copy of all data.
785
786
787   FAILFAST
788       From  Linux  4.10,  md  supports  FAILFAST for RAID1 and RAID10 arrays.
789       This is a flag that can be set on individual drives, though it is  usu‐
790       ally set on all drives, or no drives.
791
792       When md sends an I/O request to a drive that is marked as FAILFAST, and
793       when the array could survive the loss  of  that  drive  without  losing
794       data,  md  will request that the underlying device does not perform any
795       retries.  This means that a failure will be reported  to  md  promptly,
796       and  it  can mark the device as faulty and continue using the other de‐
797       vice(s).  md cannot control the timeout that the underlying devices use
798       to  determine failure.  Any changes desired to that timeout must be set
799       explictly on the underlying device, separately from using mdadm.
800
801       If a FAILFAST request does fail, and if it is still safe  to  mark  the
802       device  as  faulty  without  data loss, that will be done and the array
803       will continue functioning on a reduced number of devices.  If it is not
804       possible to safely mark the device as faulty, md will retry the request
805       without disabling retries in the underlying device.  In  any  case,  md
806       will  not  attempt to repair read errors on a device marked as FAILFAST
807       by writing out the correct.  It will just mark the device as faulty.
808
809       FAILFAST is appropriate for storage arrays that have a low  probability
810       of  true  failure,  but will sometimes introduce unacceptable delays to
811       I/O requests while performing internal maintenance.  The value of  set‐
812       ting FAILFAST involves a trade-off.  The gain is that the chance of un‐
813       acceptable delays is substantially reduced.  The cost is that  the  un‐
814       likely  event of data-loss on one device is slightly more likely to re‐
815       sult in data-loss for the array.
816
817       When a device in an array using FAILFAST is marked as faulty,  it  will
818       usually  become  usable again in a short while.  mdadm makes no attempt
819       to detect that possibility.  Some separate mechanism, tuned to the spe‐
820       cific  details  of  the  expected failure modes, needs to be created to
821       monitor devices to see when they return to full functionality,  and  to
822       then re-add them to the array.  In order of this "re-add" functionality
823       to be effective, an array using FAILFAST should always have a write-in‐
824       tent bitmap.
825
826
827   RESTRIPING
828       Restriping,  also  known as Reshaping, is the processes of re-arranging
829       the data stored in each stripe into a new layout.  This  might  involve
830       changing the number of devices in the array (so the stripes are wider),
831       changing the chunk size (so stripes are deeper or shallower), or chang‐
832       ing  the  arrangement  of  data  and parity (possibly changing the RAID
833       level, e.g. 1 to 5 or 5 to 6).
834
835       As of Linux 2.6.35, md can reshape a RAID4, RAID5, or  RAID6  array  to
836       have  a  different number of devices (more or fewer) and to have a dif‐
837       ferent layout or chunk size.  It can also convert between these differ‐
838       ent RAID levels.  It can also convert between RAID0 and RAID10, and be‐
839       tween RAID0 and RAID4 or RAID5.  Other possibilities may follow in  fu‐
840       ture kernels.
841
842       During  any  stripe  process there is a 'critical section' during which
843       live data is being overwritten on disk.  For the operation of  increas‐
844       ing  the  number of drives in a RAID5, this critical section covers the
845       first few stripes (the number being the product of the old and new num‐
846       ber  of  devices).  After this critical section is passed, data is only
847       written to areas of the array which no longer hold live data — the live
848       data has already been located away.
849
850       For  a  reshape which reduces the number of devices, the 'critical sec‐
851       tion' is at the end of the reshape process.
852
853       md is not able to ensure data preservation if there is  a  crash  (e.g.
854       power failure) during the critical section.  If md is asked to start an
855       array which failed during a critical section  of  restriping,  it  will
856       fail to start the array.
857
858       To deal with this possibility, a user-space program must
859
860       •   Disable writes to that section of the array (using the sysfs inter‐
861           face),
862
863       •   take a copy of the data somewhere (i.e. make a backup),
864
865       •   allow the process to continue and invalidate the backup and restore
866           write access once the critical section is passed, and
867
868       •   provide for restoring the critical data before restarting the array
869           after a system crash.
870
871       mdadm versions from 2.4 do this for growing a RAID5 array.
872
873       For operations that do not change the size of the  array,  like  simply
874       increasing  chunk size, or converting RAID5 to RAID6 with one extra de‐
875       vice, the entire process is the critical section.  In  this  case,  the
876       restripe  will  need  to progress in stages, as a section is suspended,
877       backed up, restriped, and released.
878
879
880   SYSFS INTERFACE
881       Each block device appears as a directory in  sysfs  (which  is  usually
882       mounted at /sys).  For MD devices, this directory will contain a subdi‐
883       rectory called md which contains various files for providing access  to
884       information about the array.
885
886       This  interface  is documented more fully in the file Documentation/ad‐
887       min-guide/md.rst which is distributed with the  kernel  sources.   That
888       file  should  be  consulted  for full documentation.  The following are
889       just a selection of attribute files that are available.
890
891
892       md/sync_speed_min
893              This  value,  if  set,  overrides  the  system-wide  setting  in
894              /proc/sys/dev/raid/speed_limit_min for this array only.  Writing
895              the value system to this file will cause the system-wide setting
896              to have effect.
897
898
899       md/sync_speed_max
900              This   is   the   partner  of  md/sync_speed_min  and  overrides
901              /proc/sys/dev/raid/speed_limit_max described below.
902
903
904       md/sync_action
905              This can be used to  monitor  and  control  the  resync/recovery
906              process  of  MD.  In particular, writing "check" here will cause
907              the array to read all data block and check that they are consis‐
908              tent  (e.g.  parity  is  correct, or all mirror replicas are the
909              same).  Any discrepancies found are NOT corrected.
910
911              A count of problems found will be stored in md/mismatch_count.
912
913              Alternately, "repair" can be written which will cause  the  same
914              check to be performed, but any errors will be corrected.
915
916              Finally, "idle" can be written to stop the check/repair process.
917
918
919       md/stripe_cache_size
920              This  is only available on RAID5 and RAID6.  It records the size
921              (in pages per device) of the  stripe cache  which  is  used  for
922              synchronising all write operations to the array and all read op‐
923              erations if the array is degraded.  The default is  256.   Valid
924              values  are  17  to  32768.  Increasing this number can increase
925              performance in some situations, at some cost in  system  memory.
926              Note,  setting this value too high can result in an "out of mem‐
927              ory" condition for the system.
928
929              memory_consumed    =    system_page_size    *     nr_disks     *
930              stripe_cache_size
931
932
933       md/preread_bypass_threshold
934              This  is  only available on RAID5 and RAID6.  This variable sets
935              the number of times MD will service a  full-stripe-write  before
936              servicing  a  stripe that requires some "prereading".  For fair‐
937              ness  this   defaults   to   1.    Valid   values   are   0   to
938              stripe_cache_size.  Setting this to 0 maximizes sequential-write
939              throughput at the cost of fairness to  threads  doing  small  or
940              random writes.
941
942
943       md/bitmap/backlog
944              The  value  stored in the file only has any effect on RAID1 when
945              write-mostly devices are active, and write requests to those de‐
946              vices are proceed in the background.
947
948              This  variable  sets  a  limit on the number of concurrent back‐
949              ground writes, the valid values are 0 to  16383,  0  means  that
950              write-behind is not allowed, while any other number means it can
951              happen.  If there are more write requests than the  number,  new
952              writes will by synchronous.
953
954
955       md/bitmap/can_clear
956              This  is for externally managed bitmaps, where the kernel writes
957              the bitmap itself, but metadata describing the bitmap is managed
958              by mdmon or similar.
959
960              When  the  array  is degraded, bits mustn't be cleared. When the
961              array becomes optimal again, bit can be cleared, but  first  the
962              metadata  needs  to  record  the current event count. So md sets
963              this to 'false' and notifies mdmon, then mdmon updates the meta‐
964              data and writes 'true'.
965
966              There  is  no  code  in  mdmon  to actually do this, so maybe it
967              doesn't even work.
968
969
970       md/bitmap/chunksize
971              The bitmap chunksize can only be changed when no bitmap  is  ac‐
972              tive, and the value should be power of 2 and at least 512.
973
974
975       md/bitmap/location
976              This  indicates  where  the write-intent bitmap for the array is
977              stored.  It can be "none" or "file" or a signed offset from  the
978              array  metadata  - measured in sectors. You cannot set a file by
979              writing here - that can only be done  with  the  SET_BITMAP_FILE
980              ioctl.
981
982              Write  'none'  to  'bitmap/location'  will clear bitmap, and the
983              previous location value must be write to it to restore bitmap.
984
985
986       md/bitmap/max_backlog_used
987              This keeps track of the maximum number of  concurrent  write-be‐
988              hind  requests  for  an md array, writing any value to this file
989              will clear it.
990
991
992       md/bitmap/metadata
993              This can be 'internal' or 'clustered' or 'external'.  'internal'
994              is set by default, which means the metadata for bitmap is stored
995              in the first 256 bytes of the bitmap  space.  'clustered'  means
996              separate bitmap metadata are used for each cluster node. 'exter‐
997              nal' means that bitmap metadata is  managed  externally  to  the
998              kernel.
999
1000
1001       md/bitmap/space
1002              This  shows the space (in sectors) which is available at md/bit‐
1003              map/location, and allows the kernel to know when it is  safe  to
1004              resize the bitmap to match a resized array. It should big enough
1005              to contain the total bytes in the bitmap.
1006
1007              For 1.0 metadata, assume we can use up to the superblock if  be‐
1008              fore, else to 4K beyond superblock. For other metadata versions,
1009              assume no change is possible.
1010
1011
1012       md/bitmap/time_base
1013              This shows the time (in seconds) between disk  flushes,  and  is
1014              used to looking for bits in the bitmap to be cleared.
1015
1016              The  default  value  is  5 seconds, and it should be an unsigned
1017              long value.
1018
1019
1020   KERNEL PARAMETERS
1021       The md driver recognised several different kernel parameters.
1022
1023       raid=noautodetect
1024              This will disable the normal detection of md arrays that happens
1025              at  boot time.  If a drive is partitioned with MS-DOS style par‐
1026              titions, then if any of the 4 main partitions  has  a  partition
1027              type  of 0xFD, then that partition will normally be inspected to
1028              see if it is part of an MD array, and if  any  full  arrays  are
1029              found,  they  are  started.  This kernel parameter disables this
1030              behaviour.
1031
1032
1033       raid=partitionable
1034
1035       raid=part
1036              These are available in 2.6 and later kernels only.   They  indi‐
1037              cate that autodetected MD arrays should be created as partition‐
1038              able arrays, with a different major device number to the  origi‐
1039              nal non-partitionable md arrays.  The device number is listed as
1040              mdp in /proc/devices.
1041
1042
1043       md_mod.start_ro=1
1044
1045       /sys/module/md_mod/parameters/start_ro
1046              This tells md to start all arrays in read-only mode.  This is  a
1047              soft  read-only  that will automatically switch to read-write on
1048              the first write request.   However  until  that  write  request,
1049              nothing  is  written  to any device by md, and in particular, no
1050              resync or recovery operation is started.
1051
1052
1053       md_mod.start_dirty_degraded=1
1054
1055       /sys/module/md_mod/parameters/start_dirty_degraded
1056              As mentioned above, md will not normally start a  RAID4,  RAID5,
1057              or  RAID6  that is both dirty and degraded as this situation can
1058              imply hidden data  loss.   This  can  be  awkward  if  the  root
1059              filesystem is affected.  Using this module parameter allows such
1060              arrays to be started at boot time.  It should be understood that
1061              there  is  a real (though small) risk of data corruption in this
1062              situation.
1063
1064
1065       md=n,dev,dev,...
1066
1067       md=dn,dev,dev,...
1068              This tells the md driver to assemble /dev/md n from  the  listed
1069              devices.   It  is only necessary to start the device holding the
1070              root filesystem this way.  Other arrays are  best  started  once
1071              the system is booted.
1072
1073              In  2.6  kernels, the d immediately after the = indicates that a
1074              partitionable device (e.g.  /dev/md/d0) should be created rather
1075              than the original non-partitionable device.
1076
1077
1078       md=n,l,c,i,dev...
1079              This  tells  the  md driver to assemble a legacy RAID0 or LINEAR
1080              array without a superblock.  n gives the  md  device  number,  l
1081              gives the level, 0 for RAID0 or -1 for LINEAR, c gives the chunk
1082              size as a base-2 logarithm offset by twelve, so 0  means  4K,  1
1083              means 8K.  i is ignored (legacy support).
1084
1085

FILES

1087       /proc/mdstat
1088              Contains  information  about the status of currently running ar‐
1089              ray.
1090
1091       /proc/sys/dev/raid/speed_limit_min
1092              A readable and writable file that reflects  the  current  "goal"
1093              rebuild  speed for times when non-rebuild activity is current on
1094              an array.  The speed is in Kibibytes per second, and is  a  per-
1095              device  rate,  not  a  per-array rate (which means that an array
1096              with more disks will shuffle more data for a given speed).   The
1097              default is 1000.
1098
1099
1100       /proc/sys/dev/raid/speed_limit_max
1101              A  readable  and  writable file that reflects the current "goal"
1102              rebuild speed for times when no non-rebuild activity is  current
1103              on an array.  The default is 200,000.
1104
1105

NAME

SYNOPSIS

DESCRIPTION

FILES

SEE ALSO