1MD(4) Kernel Interfaces Manual MD(4)
2
3
4
6 md - Multiple Device driver aka Linux Software RAID
7
9 /dev/mdn
10 /dev/md/n
11 /dev/md/name
12
14 The md driver provides virtual devices that are created from one or
15 more independent underlying devices. This array of devices often con‐
16 tains redundancy and the devices are often disk drives, hence the acro‐
17 nym RAID which stands for a Redundant Array of Independent Disks.
18
19 md supports RAID levels 1 (mirroring), 4 (striped array with parity de‐
20 vice), 5 (striped array with distributed parity information), 6
21 (striped array with distributed dual redundancy information), and 10
22 (striped and mirrored). If some number of underlying devices fails
23 while using one of these levels, the array will continue to function;
24 this number is one for RAID levels 4 and 5, two for RAID level 6, and
25 all but one (N-1) for RAID level 1, and dependent on configuration for
26 level 10.
27
28 md also supports a number of pseudo RAID (non-redundant) configurations
29 including RAID0 (striped array), LINEAR (catenated array), MULTIPATH (a
30 set of different interfaces to the same device), and FAULTY (a layer
31 over a single device into which errors can be injected).
32
33
34 MD METADATA
35 Each device in an array may have some metadata stored in the device.
36 This metadata is sometimes called a superblock. The metadata records
37 information about the structure and state of the array. This allows
38 the array to be reliably re-assembled after a shutdown.
39
40 From Linux kernel version 2.6.10, md provides support for two different
41 formats of metadata, and other formats can be added. Prior to this re‐
42 lease, only one format is supported.
43
44 The common format — known as version 0.90 — has a superblock that is 4K
45 long and is written into a 64K aligned block that starts at least 64K
46 and less than 128K from the end of the device (i.e. to get the address
47 of the superblock round the size of the device down to a multiple of
48 64K and then subtract 64K). The available size of each device is the
49 amount of space before the super block, so between 64K and 128K is lost
50 when a device in incorporated into an MD array. This superblock stores
51 multi-byte fields in a processor-dependent manner, so arrays cannot
52 easily be moved between computers with different processors.
53
54 The new format — known as version 1 — has a superblock that is normally
55 1K long, but can be longer. It is normally stored between 8K and 12K
56 from the end of the device, on a 4K boundary, though variations can be
57 stored at the start of the device (version 1.1) or 4K from the start of
58 the device (version 1.2). This metadata format stores multibyte data
59 in a processor-independent format and supports up to hundreds of compo‐
60 nent devices (version 0.90 only supports 28).
61
62 The metadata contains, among other things:
63
64 LEVEL The manner in which the devices are arranged into the array
65 (LINEAR, RAID0, RAID1, RAID4, RAID5, RAID10, MULTIPATH).
66
67 UUID a 128 bit Universally Unique Identifier that identifies the ar‐
68 ray that contains this device.
69
70
71 When a version 0.90 array is being reshaped (e.g. adding extra devices
72 to a RAID5), the version number is temporarily set to 0.91. This en‐
73 sures that if the reshape process is stopped in the middle (e.g. by a
74 system crash) and the machine boots into an older kernel that does not
75 support reshaping, then the array will not be assembled (which would
76 cause data corruption) but will be left untouched until a kernel that
77 can complete the reshape processes is used.
78
79
80 ARRAYS WITHOUT METADATA
81 While it is usually best to create arrays with superblocks so that they
82 can be assembled reliably, there are some circumstances when an array
83 without superblocks is preferred. These include:
84
85 LEGACY ARRAYS
86 Early versions of the md driver only supported LINEAR and RAID0
87 configurations and did not use a superblock (which is less crit‐
88 ical with these configurations). While such arrays should be
89 rebuilt with superblocks if possible, md continues to support
90 them.
91
92 FAULTY Being a largely transparent layer over a different device, the
93 FAULTY personality doesn't gain anything from having a su‐
94 perblock.
95
96 MULTIPATH
97 It is often possible to detect devices which are different paths
98 to the same storage directly rather than having a distinctive
99 superblock written to the device and searched for on all paths.
100 In this case, a MULTIPATH array with no superblock makes sense.
101
102 RAID1 In some configurations it might be desired to create a RAID1
103 configuration that does not use a superblock, and to maintain
104 the state of the array elsewhere. While not encouraged for gen‐
105 eral use, it does have special-purpose uses and is supported.
106
107
108 ARRAYS WITH EXTERNAL METADATA
109 From release 2.6.28, the md driver supports arrays with externally man‐
110 aged metadata. That is, the metadata is not managed by the kernel but
111 rather by a user-space program which is external to the kernel. This
112 allows support for a variety of metadata formats without cluttering the
113 kernel with lots of details.
114
115 md is able to communicate with the user-space program through various
116 sysfs attributes so that it can make appropriate changes to the meta‐
117 data - for example to mark a device as faulty. When necessary, md will
118 wait for the program to acknowledge the event by writing to a sysfs at‐
119 tribute. The manual page for mdmon(8) contains more detail about this
120 interaction.
121
122
123 CONTAINERS
124 Many metadata formats use a single block of metadata to describe a num‐
125 ber of different arrays which all use the same set of devices. In this
126 case it is helpful for the kernel to know about the full set of devices
127 as a whole. This set is known to md as a container. A container is an
128 md array with externally managed metadata and with device offset and
129 size so that it just covers the metadata part of the devices. The re‐
130 mainder of each device is available to be incorporated into various ar‐
131 rays.
132
133
134 LINEAR
135 A LINEAR array simply catenates the available space on each drive to
136 form one large virtual drive.
137
138 One advantage of this arrangement over the more common RAID0 arrange‐
139 ment is that the array may be reconfigured at a later time with an ex‐
140 tra drive, so the array is made bigger without disturbing the data that
141 is on the array. This can even be done on a live array.
142
143 If a chunksize is given with a LINEAR array, the usable space on each
144 device is rounded down to a multiple of this chunksize.
145
146
147 RAID0
148 A RAID0 array (which has zero redundancy) is also known as a striped
149 array. A RAID0 array is configured at creation with a Chunk Size which
150 must be a power of two (prior to Linux 2.6.31), and at least 4
151 kibibytes.
152
153 The RAID0 driver assigns the first chunk of the array to the first de‐
154 vice, the second chunk to the second device, and so on until all drives
155 have been assigned one chunk. This collection of chunks forms a
156 stripe. Further chunks are gathered into stripes in the same way, and
157 are assigned to the remaining space in the drives.
158
159 If devices in the array are not all the same size, then once the small‐
160 est device has been exhausted, the RAID0 driver starts collecting
161 chunks into smaller stripes that only span the drives which still have
162 remaining space.
163
164 A bug was introduced in linux 3.14 which changed the layout of blocks
165 in a RAID0 beyond the region that is striped over all devices. This
166 bug does not affect an array with all devices the same size, but can
167 affect other RAID0 arrays.
168
169 Linux 5.4 (and some stable kernels to which the change was backported)
170 will not normally assemble such an array as it cannot know which layout
171 to use. There is a module parameter "raid0.default_layout" which can
172 be set to "1" to force the kernel to use the pre-3.14 layout or to "2"
173 to force it to use the 3.14-and-later layout. when creating a new
174 RAID0 array, mdadm will record the chosen layout in the metadata in a
175 way that allows newer kernels to assemble the array without needing a
176 module parameter.
177
178 To assemble an old array on a new kernel without using the module pa‐
179 rameter, use either the --update=layout-original option or the --up‐
180 date=layout-alternate option.
181
182 Once you have updated the layout you will not be able to mount the ar‐
183 ray on an older kernel. If you need to revert to an older kernel, the
184 layout information can be erased with the --update=layout-unspecificed
185 option. If you use this option to --assemble while running a newer
186 kernel, the array will NOT assemble, but the metadata will be update so
187 that it can be assembled on an older kernel.
188
189 No that setting the layout to "unspecified" removes protections against
190 this bug, and you must be sure that the kernel you use matches the lay‐
191 out of the array.
192
193
194 RAID1
195 A RAID1 array is also known as a mirrored set (though mirrors tend to
196 provide reflected images, which RAID1 does not) or a plex.
197
198 Once initialised, each device in a RAID1 array contains exactly the
199 same data. Changes are written to all devices in parallel. Data is
200 read from any one device. The driver attempts to distribute read re‐
201 quests across all devices to maximise performance.
202
203 All devices in a RAID1 array should be the same size. If they are not,
204 then only the amount of space available on the smallest device is used
205 (any extra space on other devices is wasted).
206
207 Note that the read balancing done by the driver does not make the RAID1
208 performance profile be the same as for RAID0; a single stream of se‐
209 quential input will not be accelerated (e.g. a single dd), but multiple
210 sequential streams or a random workload will use more than one spindle.
211 In theory, having an N-disk RAID1 will allow N sequential threads to
212 read from all disks.
213
214 Individual devices in a RAID1 can be marked as "write-mostly". These
215 drives are excluded from the normal read balancing and will only be
216 read from when there is no other option. This can be useful for de‐
217 vices connected over a slow link.
218
219
220 RAID4
221 A RAID4 array is like a RAID0 array with an extra device for storing
222 parity. This device is the last of the active devices in the array. Un‐
223 like RAID0, RAID4 also requires that all stripes span all drives, so
224 extra space on devices that are larger than the smallest is wasted.
225
226 When any block in a RAID4 array is modified, the parity block for that
227 stripe (i.e. the block in the parity device at the same device offset
228 as the stripe) is also modified so that the parity block always con‐
229 tains the "parity" for the whole stripe. I.e. its content is equiva‐
230 lent to the result of performing an exclusive-or operation between all
231 the data blocks in the stripe.
232
233 This allows the array to continue to function if one device fails. The
234 data that was on that device can be calculated as needed from the par‐
235 ity block and the other data blocks.
236
237
238 RAID5
239 RAID5 is very similar to RAID4. The difference is that the parity
240 blocks for each stripe, instead of being on a single device, are dis‐
241 tributed across all devices. This allows more parallelism when writ‐
242 ing, as two different block updates will quite possibly affect parity
243 blocks on different devices so there is less contention.
244
245 This also allows more parallelism when reading, as read requests are
246 distributed over all the devices in the array instead of all but one.
247
248
249 RAID6
250 RAID6 is similar to RAID5, but can handle the loss of any two devices
251 without data loss. Accordingly, it requires N+2 drives to store N
252 drives worth of data.
253
254 The performance for RAID6 is slightly lower but comparable to RAID5 in
255 normal mode and single disk failure mode. It is very slow in dual disk
256 failure mode, however.
257
258
259 RAID10
260 RAID10 provides a combination of RAID1 and RAID0, and is sometimes
261 known as RAID1+0. Every datablock is duplicated some number of times,
262 and the resulting collection of datablocks are distributed over multi‐
263 ple drives.
264
265 When configuring a RAID10 array, it is necessary to specify the number
266 of replicas of each data block that are required (this will usually
267 be 2) and whether their layout should be "near", "far" or "offset"
268 (with "offset" being available since Linux 2.6.18).
269
270 About the RAID10 Layout Examples:
271 The examples below visualise the chunk distribution on the underlying
272 devices for the respective layout.
273
274 For simplicity it is assumed that the size of the chunks equals the
275 size of the blocks of the underlying devices as well as those of the
276 RAID10 device exported by the kernel (for example /dev/md/name).
277 Therefore the chunks / chunk numbers map directly to the blocks /block
278 addresses of the exported RAID10 device.
279
280 Decimal numbers (0, 1, 2, ...) are the chunks of the RAID10 and due to
281 the above assumption also the blocks and block addresses of the ex‐
282 ported RAID10 device.
283 Repeated numbers mean copies of a chunk / block (obviously on different
284 underlying devices).
285 Hexadecimal numbers (0x00, 0x01, 0x02, ...) are the block addresses of
286 the underlying devices.
287
288
289 "near" Layout
290 When "near" replicas are chosen, the multiple copies of a given
291 chunk are laid out consecutively ("as close to each other as
292 possible") across the stripes of the array.
293
294 With an even number of devices, they will likely (unless some
295 misalignment is present) lay at the very same offset on the dif‐
296 ferent devices.
297 This is as the "classic" RAID1+0; that is two groups of mirrored
298 devices (in the example below the groups Device #1 / #2 and De‐
299 vice #3 / #4 are each a RAID1) both in turn forming a striped
300 RAID0.
301
302 Example with 2 copies per chunk and an even number (4) of de‐
303 vices:
304
305 ┌───────────┌───────────┌───────────┌───────────┐
306 │ Device #1 │ Device #2 │ Device #3 │ Device #4 │
307 ┌─────├───────────├───────────├───────────├───────────┤
308 │0x00 │ 0 │ 0 │ 1 │ 1 │
309 │0x01 │ 2 │ 2 │ 3 │ 3 │
310 │... │ ... │ ... │ ... │ ... │
311 │ : │ : │ : │ : │ : │
312 │... │ ... │ ... │ ... │ ... │
313 │0x80 │ 254 │ 254 │ 255 │ 255 │
314 └─────└───────────└───────────└───────────└───────────┘
315 \---------v---------/ \---------v---------/
316 RAID1 RAID1
317 \---------------------v---------------------/
318 RAID0
319
320 Example with 2 copies per chunk and an odd number (5) of de‐
321 vices:
322
323 ┌────────┌────────┌────────┌────────┌────────┐
324 │ Dev #1 │ Dev #2 │ Dev #3 │ Dev #4 │ Dev #5 │
325 ┌─────├────────├────────├────────├────────├────────┤
326 │0x00 │ 0 │ 0 │ 1 │ 1 │ 2 │
327 │0x01 │ 2 │ 3 │ 3 │ 4 │ 4 │
328 │... │ ... │ ... │ ... │ ... │ ... │
329 │ : │ : │ : │ : │ : │ : │
330 │... │ ... │ ... │ ... │ ... │ ... │
331 │0x80 │ 317 │ 318 │ 318 │ 319 │ 319 │
332 └─────└────────└────────└────────└────────└────────┘
333
334
335
336 "far" Layout
337 When "far" replicas are chosen, the multiple copies of a given
338 chunk are laid out quite distant ("as far as reasonably possi‐
339 ble") from each other.
340
341 First a complete sequence of all data blocks (that is all the
342 data one sees on the exported RAID10 block device) is striped
343 over the devices. Then another (though "shifted") complete se‐
344 quence of all data blocks; and so on (in the case of more than
345 2 copies per chunk).
346
347 The "shift" needed to prevent placing copies of the same chunks
348 on the same devices is actually a cyclic permutation with off‐
349 set 1 of each of the stripes within a complete sequence of
350 chunks.
351 The offset 1 is relative to the previous complete sequence of
352 chunks, so in case of more than 2 copies per chunk one gets the
353 following offsets:
354 1. complete sequence of chunks: offset = 0
355 2. complete sequence of chunks: offset = 1
356 3. complete sequence of chunks: offset = 2
357 :
358 n. complete sequence of chunks: offset = n-1
359
360 Example with 2 copies per chunk and an even number (4) of de‐
361 vices:
362
363 ┌───────────┌───────────┌───────────┌───────────┐
364 │ Device #1 │ Device #2 │ Device #3 │ Device #4 │
365 ┌─────├───────────├───────────├───────────├───────────┤
366 │0x00 │ 0 │ 1 │ 2 │ 3 │ \
367 │0x01 │ 4 │ 5 │ 6 │ 7 │ > [#]
368 │... │ ... │ ... │ ... │ ... │ :
369 │ : │ : │ : │ : │ : │ :
370 │... │ ... │ ... │ ... │ ... │ :
371 │0x40 │ 252 │ 253 │ 254 │ 255 │ /
372 │0x41 │ 3 │ 0 │ 1 │ 2 │ \
373 │0x42 │ 7 │ 4 │ 5 │ 6 │ > [#]~
374 │... │ ... │ ... │ ... │ ... │ :
375 │ : │ : │ : │ : │ : │ :
376 │... │ ... │ ... │ ... │ ... │ :
377 │0x80 │ 255 │ 252 │ 253 │ 254 │ /
378 └─────└───────────└───────────└───────────└───────────┘
379
380 Example with 2 copies per chunk and an odd number (5) of de‐
381 vices:
382
383 ┌────────┌────────┌────────┌────────┌────────┐
384 │ Dev #1 │ Dev #2 │ Dev #3 │ Dev #4 │ Dev #5 │
385 ┌─────├────────├────────├────────├────────├────────┤
386 │0x00 │ 0 │ 1 │ 2 │ 3 │ 4 │ \
387 │0x01 │ 5 │ 6 │ 7 │ 8 │ 9 │ > [#]
388 │... │ ... │ ... │ ... │ ... │ ... │ :
389 │ : │ : │ : │ : │ : │ : │ :
390 │... │ ... │ ... │ ... │ ... │ ... │ :
391 │0x40 │ 315 │ 316 │ 317 │ 318 │ 319 │ /
392 │0x41 │ 4 │ 0 │ 1 │ 2 │ 3 │ \
393 │0x42 │ 9 │ 5 │ 6 │ 7 │ 8 │ > [#]~
394 │... │ ... │ ... │ ... │ ... │ ... │ :
395 │ : │ : │ : │ : │ : │ : │ :
396 │... │ ... │ ... │ ... │ ... │ ... │ :
397 │0x80 │ 319 │ 315 │ 316 │ 317 │ 318 │ /
398 └─────└────────└────────└────────└────────└────────┘
399
400 With [#] being the complete sequence of chunks and [#]~ the
401 cyclic permutation with offset 1 thereof (in the case of more
402 than 2 copies per chunk there would be
403 ([#]~)~, (([#]~)~)~, ...).
404
405 The advantage of this layout is that MD can easily spread se‐
406 quential reads over the devices, making them similar to RAID0 in
407 terms of speed.
408 The cost is more seeking for writes, making them substantially
409 slower.
410
411
412 "offset" Layout
413 When "offset" replicas are chosen, all the copies of a given
414 chunk are striped consecutively ("offset by the stripe length
415 after each other") over the devices.
416
417 Explained in detail, <number of devices> consecutive chunks are
418 striped over the devices, immediately followed by a "shifted"
419 copy of these chunks (and by further such "shifted" copies in
420 the case of more than 2 copies per chunk).
421 This pattern repeats for all further consecutive chunks of the
422 exported RAID10 device (in other words: all further data
423 blocks).
424
425 The "shift" needed to prevent placing copies of the same chunks
426 on the same devices is actually a cyclic permutation with off‐
427 set 1 of each of the striped copies of <number of devices> con‐
428 secutive chunks.
429 The offset 1 is relative to the previous striped copy of <number
430 of devices> consecutive chunks, so in case of more than 2 copies
431 per chunk one gets the following offsets:
432 1. <number of devices> consecutive chunks: offset = 0
433 2. <number of devices> consecutive chunks: offset = 1
434 3. <number of devices> consecutive chunks: offset = 2
435 :
436 n. <number of devices> consecutive chunks: offset = n-1
437
438 Example with 2 copies per chunk and an even number (4) of de‐
439 vices:
440
441 ┌───────────┌───────────┌───────────┌───────────┐
442 │ Device #1 │ Device #2 │ Device #3 │ Device #4 │
443 ┌─────├───────────├───────────├───────────├───────────┤
444 │0x00 │ 0 │ 1 │ 2 │ 3 │ ) AA
445 │0x01 │ 3 │ 0 │ 1 │ 2 │ ) AA~
446 │0x02 │ 4 │ 5 │ 6 │ 7 │ ) AB
447 │0x03 │ 7 │ 4 │ 5 │ 6 │ ) AB~
448 │... │ ... │ ... │ ... │ ... │ ) ...
449 │ : │ : │ : │ : │ : │ :
450 │... │ ... │ ... │ ... │ ... │ ) ...
451 │0x79 │ 251 │ 252 │ 253 │ 254 │ ) EX
452 │0x80 │ 254 │ 251 │ 252 │ 253 │ ) EX~
453 └─────└───────────└───────────└───────────└───────────┘
454
455 Example with 2 copies per chunk and an odd number (5) of de‐
456 vices:
457
458 ┌────────┌────────┌────────┌────────┌────────┐
459 │ Dev #1 │ Dev #2 │ Dev #3 │ Dev #4 │ Dev #5 │
460 ┌─────├────────├────────├────────├────────├────────┤
461 │0x00 │ 0 │ 1 │ 2 │ 3 │ 4 │ ) AA
462 │0x01 │ 4 │ 0 │ 1 │ 2 │ 3 │ ) AA~
463 │0x02 │ 5 │ 6 │ 7 │ 8 │ 9 │ ) AB
464 │0x03 │ 9 │ 5 │ 6 │ 7 │ 8 │ ) AB~
465 │... │ ... │ ... │ ... │ ... │ ... │ ) ...
466 │ : │ : │ : │ : │ : │ : │ :
467 │... │ ... │ ... │ ... │ ... │ ... │ ) ...
468 │0x79 │ 314 │ 315 │ 316 │ 317 │ 318 │ ) EX
469 │0x80 │ 318 │ 314 │ 315 │ 316 │ 317 │ ) EX~
470 └─────└────────└────────└────────└────────└────────┘
471
472 With AA, AB, ..., AZ, BA, ... being the sets of <number of de‐
473 vices> consecutive chunks and AA~, AB~, ..., AZ~, BA~, ... the
474 cyclic permutations with offset 1 thereof (in the case of more
475 than 2 copies per chunk there would be (AA~)~, ... as well as
476 ((AA~)~)~, ... and so on).
477
478 This should give similar read characteristics to "far" if a
479 suitably large chunk size is used, but without as much seeking
480 for writes.
481
482 It should be noted that the number of devices in a RAID10 array need
483 not be a multiple of the number of replica of each data block; however,
484 there must be at least as many devices as replicas.
485
486 If, for example, an array is created with 5 devices and 2 replicas,
487 then space equivalent to 2.5 of the devices will be available, and ev‐
488 ery block will be stored on two different devices.
489
490 Finally, it is possible to have an array with both "near" and "far"
491 copies. If an array is configured with 2 near copies and 2 far copies,
492 then there will be a total of 4 copies of each block, each on a differ‐
493 ent drive. This is an artifact of the implementation and is unlikely
494 to be of real value.
495
496
497 MULTIPATH
498 MULTIPATH is not really a RAID at all as there is only one real device
499 in a MULTIPATH md array. However there are multiple access points
500 (paths) to this device, and one of these paths might fail, so there are
501 some similarities.
502
503 A MULTIPATH array is composed of a number of logically different de‐
504 vices, often fibre channel interfaces, that all refer the the same real
505 device. If one of these interfaces fails (e.g. due to cable problems),
506 the MULTIPATH driver will attempt to redirect requests to another in‐
507 terface.
508
509 The MULTIPATH drive is not receiving any ongoing development and should
510 be considered a legacy driver. The device-mapper based multipath driv‐
511 ers should be preferred for new installations.
512
513
514 FAULTY
515 The FAULTY md module is provided for testing purposes. A FAULTY array
516 has exactly one component device and is normally assembled without a
517 superblock, so the md array created provides direct access to all of
518 the data in the component device.
519
520 The FAULTY module may be requested to simulate faults to allow testing
521 of other md levels or of filesystems. Faults can be chosen to trigger
522 on read requests or write requests, and can be transient (a subsequent
523 read/write at the address will probably succeed) or persistent (subse‐
524 quent read/write of the same address will fail). Further, read faults
525 can be "fixable" meaning that they persist until a write request at the
526 same address.
527
528 Fault types can be requested with a period. In this case, the fault
529 will recur repeatedly after the given number of requests of the rele‐
530 vant type. For example if persistent read faults have a period of 100,
531 then every 100th read request would generate a fault, and the faulty
532 sector would be recorded so that subsequent reads on that sector would
533 also fail.
534
535 There is a limit to the number of faulty sectors that are remembered.
536 Faults generated after this limit is exhausted are treated as tran‐
537 sient.
538
539 The list of faulty sectors can be flushed, and the active list of fail‐
540 ure modes can be cleared.
541
542
543 UNCLEAN SHUTDOWN
544 When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
545 there is a possibility of inconsistency for short periods of time as
546 each update requires at least two block to be written to different de‐
547 vices, and these writes probably won't happen at exactly the same time.
548 Thus if a system with one of these arrays is shutdown in the middle of
549 a write operation (e.g. due to power failure), the array may not be
550 consistent.
551
552 To handle this situation, the md driver marks an array as "dirty" be‐
553 fore writing any data to it, and marks it as "clean" when the array is
554 being disabled, e.g. at shutdown. If the md driver finds an array to
555 be dirty at startup, it proceeds to correct any possibly inconsistency.
556 For RAID1, this involves copying the contents of the first drive onto
557 all other drives. For RAID4, RAID5 and RAID6 this involves recalculat‐
558 ing the parity for each stripe and making sure that the parity block
559 has the correct data. For RAID10 it involves copying one of the repli‐
560 cas of each block onto all the others. This process, known as "resyn‐
561 chronising" or "resync" is performed in the background. The array can
562 still be used, though possibly with reduced performance.
563
564 If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
565 drive, two for RAID6) when it is restarted after an unclean shutdown,
566 it cannot recalculate parity, and so it is possible that data might be
567 undetectably corrupted. The 2.4 md driver does not alert the operator
568 to this condition. The 2.6 md driver will fail to start an array in
569 this condition without manual intervention, though this behaviour can
570 be overridden by a kernel parameter.
571
572
573 RECOVERY
574 If the md driver detects a write error on a device in a RAID1, RAID4,
575 RAID5, RAID6, or RAID10 array, it immediately disables that device
576 (marking it as faulty) and continues operation on the remaining de‐
577 vices. If there are spare drives, the driver will start recreating on
578 one of the spare drives the data which was on that failed drive, either
579 by copying a working drive in a RAID1 configuration, or by doing calcu‐
580 lations with the parity block on RAID4, RAID5 or RAID6, or by finding
581 and copying originals for RAID10.
582
583 In kernels prior to about 2.6.15, a read error would cause the same ef‐
584 fect as a write error. In later kernels, a read-error will instead
585 cause md to attempt a recovery by overwriting the bad block. i.e. it
586 will find the correct data from elsewhere, write it over the block that
587 failed, and then try to read it back again. If either the write or the
588 re-read fail, md will treat the error the same way that a write error
589 is treated, and will fail the whole device.
590
591 While this recovery process is happening, the md driver will monitor
592 accesses to the array and will slow down the rate of recovery if other
593 activity is happening, so that normal access to the array will not be
594 unduly affected. When no other activity is happening, the recovery
595 process proceeds at full speed. The actual speed targets for the two
596 different situations can be controlled by the speed_limit_min and
597 speed_limit_max control files mentioned below.
598
599
600 SCRUBBING AND MISMATCHES
601 As storage devices can develop bad blocks at any time it is valuable to
602 regularly read all blocks on all devices in an array so as to catch
603 such bad blocks early. This process is called scrubbing.
604
605 md arrays can be scrubbed by writing either check or repair to the file
606 md/sync_action in the sysfs directory for the device.
607
608 Requesting a scrub will cause md to read every block on every device in
609 the array, and check that the data is consistent. For RAID1 and
610 RAID10, this means checking that the copies are identical. For RAID4,
611 RAID5, RAID6 this means checking that the parity block is (or blocks
612 are) correct.
613
614 If a read error is detected during this process, the normal read-error
615 handling causes correct data to be found from other devices and to be
616 written back to the faulty device. In many case this will effectively
617 fix the bad block.
618
619 If all blocks read successfully but are found to not be consistent,
620 then this is regarded as a mismatch.
621
622 If check was used, then no action is taken to handle the mismatch, it
623 is simply recorded. If repair was used, then a mismatch will be re‐
624 paired in the same way that resync repairs arrays. For RAID5/RAID6 new
625 parity blocks are written. For RAID1/RAID10, all but one block are
626 overwritten with the content of that one block.
627
628 A count of mismatches is recorded in the sysfs file md/mismatch_cnt.
629 This is set to zero when a scrub starts and is incremented whenever a
630 sector is found that is a mismatch. md normally works in units much
631 larger than a single sector and when it finds a mismatch, it does not
632 determine exactly how many actual sectors were affected but simply adds
633 the number of sectors in the IO unit that was used. So a value of 128
634 could simply mean that a single 64KB check found an error (128 x
635 512bytes = 64KB).
636
637 If an array is created by mdadm with --assume-clean then a subsequent
638 check could be expected to find some mismatches.
639
640 On a truly clean RAID5 or RAID6 array, any mismatches should indicate a
641 hardware problem at some level - software issues should never cause
642 such a mismatch.
643
644 However on RAID1 and RAID10 it is possible for software issues to cause
645 a mismatch to be reported. This does not necessarily mean that the
646 data on the array is corrupted. It could simply be that the system
647 does not care what is stored on that part of the array - it is unused
648 space.
649
650 The most likely cause for an unexpected mismatch on RAID1 or RAID10 oc‐
651 curs if a swap partition or swap file is stored on the array.
652
653 When the swap subsystem wants to write a page of memory out, it flags
654 the page as 'clean' in the memory manager and requests the swap device
655 to write it out. It is quite possible that the memory will be changed
656 while the write-out is happening. In that case the 'clean' flag will
657 be found to be clear when the write completes and so the swap subsystem
658 will simply forget that the swapout had been attempted, and will possi‐
659 bly choose a different page to write out.
660
661 If the swap device was on RAID1 (or RAID10), then the data is sent from
662 memory to a device twice (or more depending on the number of devices in
663 the array). Thus it is possible that the memory gets changed between
664 the times it is sent, so different data can be written to the different
665 devices in the array. This will be detected by check as a mismatch.
666 However it does not reflect any corruption as the block where this mis‐
667 match occurs is being treated by the swap system as being empty, and
668 the data will never be read from that block.
669
670 It is conceivable for a similar situation to occur on non-swap files,
671 though it is less likely.
672
673 Thus the mismatch_cnt value can not be interpreted very reliably on
674 RAID1 or RAID10, especially when the device is used for swap.
675
676
677
678 BITMAP WRITE-INTENT LOGGING
679 From Linux 2.6.13, md supports a bitmap based write-intent log. If
680 configured, the bitmap is used to record which blocks of the array may
681 be out of sync. Before any write request is honoured, md will make
682 sure that the corresponding bit in the log is set. After a period of
683 time with no writes to an area of the array, the corresponding bit will
684 be cleared.
685
686 This bitmap is used for two optimisations.
687
688 Firstly, after an unclean shutdown, the resync process will consult the
689 bitmap and only resync those blocks that correspond to bits in the bit‐
690 map that are set. This can dramatically reduce resync time.
691
692 Secondly, when a drive fails and is removed from the array, md stops
693 clearing bits in the intent log. If that same drive is re-added to the
694 array, md will notice and will only recover the sections of the drive
695 that are covered by bits in the intent log that are set. This can al‐
696 low a device to be temporarily removed and reinserted without causing
697 an enormous recovery cost.
698
699 The intent log can be stored in a file on a separate device, or it can
700 be stored near the superblocks of an array which has superblocks.
701
702 It is possible to add an intent log to an active array, or remove an
703 intent log if one is present.
704
705 In 2.6.13, intent bitmaps are only supported with RAID1. Other levels
706 with redundancy are supported from 2.6.15.
707
708
709 BAD BLOCK LIST
710 From Linux 3.5 each device in an md array can store a list of known-
711 bad-blocks. This list is 4K in size and usually positioned at the end
712 of the space between the superblock and the data.
713
714 When a block cannot be read and cannot be repaired by writing data re‐
715 covered from other devices, the address of the block is stored in the
716 bad block list. Similarly if an attempt to write a block fails, the
717 address will be recorded as a bad block. If attempting to record the
718 bad block fails, the whole device will be marked faulty.
719
720 Attempting to read from a known bad block will cause a read error. At‐
721 tempting to write to a known bad block will be ignored if any write er‐
722 rors have been reported by the device. If there have been no write er‐
723 rors then the data will be written to the known bad block and if that
724 succeeds, the address will be removed from the list.
725
726 This allows an array to fail more gracefully - a few blocks on differ‐
727 ent devices can be faulty without taking the whole array out of action.
728
729 The list is particularly useful when recovering to a spare. If a few
730 blocks cannot be read from the other devices, the bulk of the recovery
731 can complete and those few bad blocks will be recorded in the bad block
732 list.
733
734
735 RAID WRITE HOLE
736 Due to non-atomicity nature of RAID write operations, interruption of
737 write operations (system crash, etc.) to RAID456 array can lead to in‐
738 consistent parity and data loss (so called RAID-5 write hole). To plug
739 the write hole md supports two mechanisms described below.
740
741
742 DIRTY STRIPE JOURNAL
743 From Linux 4.4, md supports write ahead journal for RAID456.
744 When the array is created, an additional journal device can be
745 added to the array through write-journal option. The RAID write
746 journal works similar to file system journals. Before writing to
747 the data disks, md persists data AND parity of the stripe to the
748 journal device. After crashes, md searches the journal device
749 for incomplete write operations, and replay them to the data
750 disks.
751
752 When the journal device fails, the RAID array is forced to run
753 in read-only mode.
754
755
756 PARTIAL PARITY LOG
757 From Linux 4.12 md supports Partial Parity Log (PPL) for RAID5
758 arrays only. Partial parity for a write operation is the XOR of
759 stripe data chunks not modified by the write. PPL is stored in
760 the metadata region of RAID member drives, no additional journal
761 drive is needed. After crashes, if one of the not modified data
762 disks of the stripe is missing, this updated parity can be used
763 to recover its data.
764
765 This mechanism is documented more fully in the file Documenta‐
766 tion/md/raid5-ppl.rst
767
768
769 WRITE-BEHIND
770 From Linux 2.6.14, md supports WRITE-BEHIND on RAID1 arrays.
771
772 This allows certain devices in the array to be flagged as write-mostly.
773 MD will only read from such devices if there is no other option.
774
775 If a write-intent bitmap is also provided, write requests to write-
776 mostly devices will be treated as write-behind requests and md will not
777 wait for writes to those requests to complete before reporting the
778 write as complete to the filesystem.
779
780 This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
781 over a slow link to a remote computer (providing the link isn't too
782 slow). The extra latency of the remote link will not slow down normal
783 operations, but the remote system will still have a reasonably up-to-
784 date copy of all data.
785
786
787 FAILFAST
788 From Linux 4.10, md supports FAILFAST for RAID1 and RAID10 arrays.
789 This is a flag that can be set on individual drives, though it is usu‐
790 ally set on all drives, or no drives.
791
792 When md sends an I/O request to a drive that is marked as FAILFAST, and
793 when the array could survive the loss of that drive without losing
794 data, md will request that the underlying device does not perform any
795 retries. This means that a failure will be reported to md promptly,
796 and it can mark the device as faulty and continue using the other de‐
797 vice(s). md cannot control the timeout that the underlying devices use
798 to determine failure. Any changes desired to that timeout must be set
799 explictly on the underlying device, separately from using mdadm.
800
801 If a FAILFAST request does fail, and if it is still safe to mark the
802 device as faulty without data loss, that will be done and the array
803 will continue functioning on a reduced number of devices. If it is not
804 possible to safely mark the device as faulty, md will retry the request
805 without disabling retries in the underlying device. In any case, md
806 will not attempt to repair read errors on a device marked as FAILFAST
807 by writing out the correct. It will just mark the device as faulty.
808
809 FAILFAST is appropriate for storage arrays that have a low probability
810 of true failure, but will sometimes introduce unacceptable delays to
811 I/O requests while performing internal maintenance. The value of set‐
812 ting FAILFAST involves a trade-off. The gain is that the chance of un‐
813 acceptable delays is substantially reduced. The cost is that the un‐
814 likely event of data-loss on one device is slightly more likely to re‐
815 sult in data-loss for the array.
816
817 When a device in an array using FAILFAST is marked as faulty, it will
818 usually become usable again in a short while. mdadm makes no attempt
819 to detect that possibility. Some separate mechanism, tuned to the spe‐
820 cific details of the expected failure modes, needs to be created to
821 monitor devices to see when they return to full functionality, and to
822 then re-add them to the array. In order of this "re-add" functionality
823 to be effective, an array using FAILFAST should always have a write-in‐
824 tent bitmap.
825
826
827 RESTRIPING
828 Restriping, also known as Reshaping, is the processes of re-arranging
829 the data stored in each stripe into a new layout. This might involve
830 changing the number of devices in the array (so the stripes are wider),
831 changing the chunk size (so stripes are deeper or shallower), or chang‐
832 ing the arrangement of data and parity (possibly changing the RAID
833 level, e.g. 1 to 5 or 5 to 6).
834
835 As of Linux 2.6.35, md can reshape a RAID4, RAID5, or RAID6 array to
836 have a different number of devices (more or fewer) and to have a dif‐
837 ferent layout or chunk size. It can also convert between these differ‐
838 ent RAID levels. It can also convert between RAID0 and RAID10, and be‐
839 tween RAID0 and RAID4 or RAID5. Other possibilities may follow in fu‐
840 ture kernels.
841
842 During any stripe process there is a 'critical section' during which
843 live data is being overwritten on disk. For the operation of increas‐
844 ing the number of drives in a RAID5, this critical section covers the
845 first few stripes (the number being the product of the old and new num‐
846 ber of devices). After this critical section is passed, data is only
847 written to areas of the array which no longer hold live data — the live
848 data has already been located away.
849
850 For a reshape which reduces the number of devices, the 'critical sec‐
851 tion' is at the end of the reshape process.
852
853 md is not able to ensure data preservation if there is a crash (e.g.
854 power failure) during the critical section. If md is asked to start an
855 array which failed during a critical section of restriping, it will
856 fail to start the array.
857
858 To deal with this possibility, a user-space program must
859
860 • Disable writes to that section of the array (using the sysfs inter‐
861 face),
862
863 • take a copy of the data somewhere (i.e. make a backup),
864
865 • allow the process to continue and invalidate the backup and restore
866 write access once the critical section is passed, and
867
868 • provide for restoring the critical data before restarting the array
869 after a system crash.
870
871 mdadm versions from 2.4 do this for growing a RAID5 array.
872
873 For operations that do not change the size of the array, like simply
874 increasing chunk size, or converting RAID5 to RAID6 with one extra de‐
875 vice, the entire process is the critical section. In this case, the
876 restripe will need to progress in stages, as a section is suspended,
877 backed up, restriped, and released.
878
879
880 SYSFS INTERFACE
881 Each block device appears as a directory in sysfs (which is usually
882 mounted at /sys). For MD devices, this directory will contain a subdi‐
883 rectory called md which contains various files for providing access to
884 information about the array.
885
886 This interface is documented more fully in the file Documentation/ad‐
887 min-guide/md.rst which is distributed with the kernel sources. That
888 file should be consulted for full documentation. The following are
889 just a selection of attribute files that are available.
890
891
892 md/sync_speed_min
893 This value, if set, overrides the system-wide setting in
894 /proc/sys/dev/raid/speed_limit_min for this array only. Writing
895 the value system to this file will cause the system-wide setting
896 to have effect.
897
898
899 md/sync_speed_max
900 This is the partner of md/sync_speed_min and overrides
901 /proc/sys/dev/raid/speed_limit_max described below.
902
903
904 md/sync_action
905 This can be used to monitor and control the resync/recovery
906 process of MD. In particular, writing "check" here will cause
907 the array to read all data block and check that they are consis‐
908 tent (e.g. parity is correct, or all mirror replicas are the
909 same). Any discrepancies found are NOT corrected.
910
911 A count of problems found will be stored in md/mismatch_count.
912
913 Alternately, "repair" can be written which will cause the same
914 check to be performed, but any errors will be corrected.
915
916 Finally, "idle" can be written to stop the check/repair process.
917
918
919 md/stripe_cache_size
920 This is only available on RAID5 and RAID6. It records the size
921 (in pages per device) of the stripe cache which is used for
922 synchronising all write operations to the array and all read op‐
923 erations if the array is degraded. The default is 256. Valid
924 values are 17 to 32768. Increasing this number can increase
925 performance in some situations, at some cost in system memory.
926 Note, setting this value too high can result in an "out of mem‐
927 ory" condition for the system.
928
929 memory_consumed = system_page_size * nr_disks *
930 stripe_cache_size
931
932
933 md/preread_bypass_threshold
934 This is only available on RAID5 and RAID6. This variable sets
935 the number of times MD will service a full-stripe-write before
936 servicing a stripe that requires some "prereading". For fair‐
937 ness this defaults to 1. Valid values are 0 to
938 stripe_cache_size. Setting this to 0 maximizes sequential-write
939 throughput at the cost of fairness to threads doing small or
940 random writes.
941
942
943 md/bitmap/backlog
944 The value stored in the file only has any effect on RAID1 when
945 write-mostly devices are active, and write requests to those de‐
946 vices are proceed in the background.
947
948 This variable sets a limit on the number of concurrent back‐
949 ground writes, the valid values are 0 to 16383, 0 means that
950 write-behind is not allowed, while any other number means it can
951 happen. If there are more write requests than the number, new
952 writes will by synchronous.
953
954
955 md/bitmap/can_clear
956 This is for externally managed bitmaps, where the kernel writes
957 the bitmap itself, but metadata describing the bitmap is managed
958 by mdmon or similar.
959
960 When the array is degraded, bits mustn't be cleared. When the
961 array becomes optimal again, bit can be cleared, but first the
962 metadata needs to record the current event count. So md sets
963 this to 'false' and notifies mdmon, then mdmon updates the meta‐
964 data and writes 'true'.
965
966 There is no code in mdmon to actually do this, so maybe it
967 doesn't even work.
968
969
970 md/bitmap/chunksize
971 The bitmap chunksize can only be changed when no bitmap is ac‐
972 tive, and the value should be power of 2 and at least 512.
973
974
975 md/bitmap/location
976 This indicates where the write-intent bitmap for the array is
977 stored. It can be "none" or "file" or a signed offset from the
978 array metadata - measured in sectors. You cannot set a file by
979 writing here - that can only be done with the SET_BITMAP_FILE
980 ioctl.
981
982 Write 'none' to 'bitmap/location' will clear bitmap, and the
983 previous location value must be write to it to restore bitmap.
984
985
986 md/bitmap/max_backlog_used
987 This keeps track of the maximum number of concurrent write-be‐
988 hind requests for an md array, writing any value to this file
989 will clear it.
990
991
992 md/bitmap/metadata
993 This can be 'internal' or 'clustered' or 'external'. 'internal'
994 is set by default, which means the metadata for bitmap is stored
995 in the first 256 bytes of the bitmap space. 'clustered' means
996 separate bitmap metadata are used for each cluster node. 'exter‐
997 nal' means that bitmap metadata is managed externally to the
998 kernel.
999
1000
1001 md/bitmap/space
1002 This shows the space (in sectors) which is available at md/bit‐
1003 map/location, and allows the kernel to know when it is safe to
1004 resize the bitmap to match a resized array. It should big enough
1005 to contain the total bytes in the bitmap.
1006
1007 For 1.0 metadata, assume we can use up to the superblock if be‐
1008 fore, else to 4K beyond superblock. For other metadata versions,
1009 assume no change is possible.
1010
1011
1012 md/bitmap/time_base
1013 This shows the time (in seconds) between disk flushes, and is
1014 used to looking for bits in the bitmap to be cleared.
1015
1016 The default value is 5 seconds, and it should be an unsigned
1017 long value.
1018
1019
1020 KERNEL PARAMETERS
1021 The md driver recognised several different kernel parameters.
1022
1023 raid=noautodetect
1024 This will disable the normal detection of md arrays that happens
1025 at boot time. If a drive is partitioned with MS-DOS style par‐
1026 titions, then if any of the 4 main partitions has a partition
1027 type of 0xFD, then that partition will normally be inspected to
1028 see if it is part of an MD array, and if any full arrays are
1029 found, they are started. This kernel parameter disables this
1030 behaviour.
1031
1032
1033 raid=partitionable
1034
1035 raid=part
1036 These are available in 2.6 and later kernels only. They indi‐
1037 cate that autodetected MD arrays should be created as partition‐
1038 able arrays, with a different major device number to the origi‐
1039 nal non-partitionable md arrays. The device number is listed as
1040 mdp in /proc/devices.
1041
1042
1043 md_mod.start_ro=1
1044
1045 /sys/module/md_mod/parameters/start_ro
1046 This tells md to start all arrays in read-only mode. This is a
1047 soft read-only that will automatically switch to read-write on
1048 the first write request. However until that write request,
1049 nothing is written to any device by md, and in particular, no
1050 resync or recovery operation is started.
1051
1052
1053 md_mod.start_dirty_degraded=1
1054
1055 /sys/module/md_mod/parameters/start_dirty_degraded
1056 As mentioned above, md will not normally start a RAID4, RAID5,
1057 or RAID6 that is both dirty and degraded as this situation can
1058 imply hidden data loss. This can be awkward if the root
1059 filesystem is affected. Using this module parameter allows such
1060 arrays to be started at boot time. It should be understood that
1061 there is a real (though small) risk of data corruption in this
1062 situation.
1063
1064
1065 md=n,dev,dev,...
1066
1067 md=dn,dev,dev,...
1068 This tells the md driver to assemble /dev/md n from the listed
1069 devices. It is only necessary to start the device holding the
1070 root filesystem this way. Other arrays are best started once
1071 the system is booted.
1072
1073 In 2.6 kernels, the d immediately after the = indicates that a
1074 partitionable device (e.g. /dev/md/d0) should be created rather
1075 than the original non-partitionable device.
1076
1077
1078 md=n,l,c,i,dev...
1079 This tells the md driver to assemble a legacy RAID0 or LINEAR
1080 array without a superblock. n gives the md device number, l
1081 gives the level, 0 for RAID0 or -1 for LINEAR, c gives the chunk
1082 size as a base-2 logarithm offset by twelve, so 0 means 4K, 1
1083 means 8K. i is ignored (legacy support).
1084
1085
1087 /proc/mdstat
1088 Contains information about the status of currently running ar‐
1089 ray.
1090
1091 /proc/sys/dev/raid/speed_limit_min
1092 A readable and writable file that reflects the current "goal"
1093 rebuild speed for times when non-rebuild activity is current on
1094 an array. The speed is in Kibibytes per second, and is a per-
1095 device rate, not a per-array rate (which means that an array
1096 with more disks will shuffle more data for a given speed). The
1097 default is 1000.
1098
1099
1100 /proc/sys/dev/raid/speed_limit_max
1101 A readable and writable file that reflects the current "goal"
1102 rebuild speed for times when no non-rebuild activity is current
1103 on an array. The default is 200,000.
1104
1105
1107 mdadm(8),
1108
1109
1110
1111 MD(4)