1MD(4) Kernel Interfaces Manual MD(4)
2
3
4
6 md - Multiple Device driver aka Linux Software Raid
7
9 /dev/mdn
10 /dev/md/n
11
13 The md driver provides virtual devices that are created from one or
14 more independent underlying devices. This array of devices often con‐
15 tains redundancy, and hence the acronym RAID which stands for a Redun‐
16 dant Array of Independent Devices.
17
18 md supports RAID levels 1 (mirroring), 4 (striped array with parity
19 device), 5 (striped array with distributed parity information), 6
20 (striped array with distributed dual redundancy information), and 10
21 (striped and mirrored). If some number of underlying devices fails
22 while using one of these levels, the array will continue to function;
23 this number is one for RAID levels 4 and 5, two for RAID level 6, and
24 all but one (N-1) for RAID level 1, and dependant on configuration for
25 level 10.
26
27 md also supports a number of pseudo RAID (non-redundant) configurations
28 including RAID0 (striped array), LINEAR (catenated array), MULTIPATH (a
29 set of different interfaces to the same device), and FAULTY (a layer
30 over a single device into which errors can be injected).
31
32
33 MD SUPER BLOCK
34 Each device in an array may have a superblock which records information
35 about the structure and state of the array. This allows the array to
36 be reliably re-assembled after a shutdown.
37
38 From Linux kernel version 2.6.10, md provides support for two different
39 formats of this superblock, and other formats can be added. Prior to
40 this release, only one format is supported.
41
42 The common format — known as version 0.90 — has a superblock that is 4K
43 long and is written into a 64K aligned block that starts at least 64K
44 and less than 128K from the end of the device (i.e. to get the address
45 of the superblock round the size of the device down to a multiple of
46 64K and then subtract 64K). The available size of each device is the
47 amount of space before the super block, so between 64K and 128K is lost
48 when a device in incorporated into an MD array. This superblock stores
49 multi-byte fields in a processor-dependant manner, so arrays cannot
50 easily be moved between computers with different processors.
51
52 The new format — known as version 1 — has a superblock that is normally
53 1K long, but can be longer. It is normally stored between 8K and 12K
54 from the end of the device, on a 4K boundary, though variations can be
55 stored at the start of the device (version 1.1) or 4K from the start of
56 the device (version 1.2). This superblock format stores multibyte data
57 in a processor-independent format and has supports up to hundreds of
58 component devices (version 0.90 only supports 28).
59
60 The superblock contains, among other things:
61
62 LEVEL The manner in which the devices are arranged into the array
63 (linear, raid0, raid1, raid4, raid5, raid10, multipath).
64
65 UUID a 128 bit Universally Unique Identifier that identifies the
66 array that this device is part of.
67
68 When a version 0.90 array is being reshaped (e.g. adding extra
69 devices to a RAID5), the version number is temporarily set to
70 0.91. This ensures that if the reshape process is stopped in
71 the middle (e.g. by a system crash) and the machine boots into
72 an older kernel that does not support reshaping, then the array
73 will not be assembled (which would cause data corruption) but
74 will be left untouched until a kernel that can complete the
75 reshape processes is used.
76
77
78 ARRAYS WITHOUT SUPERBLOCKS
79 While it is usually best to create arrays with superblocks so that they
80 can be assembled reliably, there are some circumstances where an array
81 without superblocks in preferred. This include:
82
83 LEGACY ARRAYS
84 Early versions of the md driver only supported Linear and Raid0
85 configurations and did not use a superblock (which is less crit‐
86 ical with these configurations). While such arrays should be
87 rebuilt with superblocks if possible, md continues to support
88 them.
89
90 FAULTY Being a largely transparent layer over a different device, the
91 FAULTY personality doesn't gain anything from having a
92 superblock.
93
94 MULTIPATH
95 It is often possible to detect devices which are different paths
96 to the same storage directly rather than having a distinctive
97 superblock written to the device and searched for on all paths.
98 In this case, a MULTIPATH array with no superblock makes sense.
99
100 RAID1 In some configurations it might be desired to create a raid1
101 configuration that does use a superblock, and to maintain the
102 state of the array elsewhere. While not encouraged for general
103 us, it does have special-purpose uses and is supported.
104
105
106 LINEAR
107 A linear array simply catenates the available space on each drive
108 together to form one large virtual drive.
109
110 One advantage of this arrangement over the more common RAID0 arrange‐
111 ment is that the array may be reconfigured at a later time with an
112 extra drive and so the array is made bigger without disturbing the data
113 that is on the array. However this cannot be done on a live array.
114
115 If a chunksize is given with a LINEAR array, the usable space on each
116 device is rounded down to a multiple of this chunksize.
117
118
119 RAID0
120 A RAID0 array (which has zero redundancy) is also known as a striped
121 array. A RAID0 array is configured at creation with a Chunk Size which
122 must be a power of two, and at least 4 kibibytes.
123
124 The RAID0 driver assigns the first chunk of the array to the first
125 device, the second chunk to the second device, and so on until all
126 drives have been assigned one chunk. This collection of chunks forms a
127 stripe. Further chunks are gathered into stripes in the same way which
128 are assigned to the remaining space in the drives.
129
130 If devices in the array are not all the same size, then once the small‐
131 est device has been exhausted, the RAID0 driver starts collecting
132 chunks into smaller stripes that only span the drives which still have
133 remaining space.
134
135
136
137 RAID1
138 A RAID1 array is also known as a mirrored set (though mirrors tend to
139 provide reflected images, which RAID1 does not) or a plex.
140
141 Once initialised, each device in a RAID1 array contains exactly the
142 same data. Changes are written to all devices in parallel. Data is
143 read from any one device. The driver attempts to distribute read
144 requests across all devices to maximise performance.
145
146 All devices in a RAID1 array should be the same size. If they are not,
147 then only the amount of space available on the smallest device is used.
148 Any extra space on other devices is wasted.
149
150
151 RAID4
152 A RAID4 array is like a RAID0 array with an extra device for storing
153 parity. This device is the last of the active devices in the array.
154 Unlike RAID0, RAID4 also requires that all stripes span all drives, so
155 extra space on devices that are larger than the smallest is wasted.
156
157 When any block in a RAID4 array is modified the parity block for that
158 stripe (i.e. the block in the parity device at the same device offset
159 as the stripe) is also modified so that the parity block always con‐
160 tains the "parity" for the whole stripe. i.e. its contents is equiva‐
161 lent to the result of performing an exclusive-or operation between all
162 the data blocks in the stripe.
163
164 This allows the array to continue to function if one device fails. The
165 data that was on that device can be calculated as needed from the par‐
166 ity block and the other data blocks.
167
168
169 RAID5
170 RAID5 is very similar to RAID4. The difference is that the parity
171 blocks for each stripe, instead of being on a single device, are dis‐
172 tributed across all devices. This allows more parallelism when writing
173 as two different block updates will quite possibly affect parity blocks
174 on different devices so there is less contention.
175
176 This also allows more parallelism when reading as read requests are
177 distributed over all the devices in the array instead of all but one.
178
179
180 RAID6
181 RAID6 is similar to RAID5, but can handle the loss of any two devices
182 without data loss. Accordingly, it requires N+2 drives to store N
183 drives worth of data.
184
185 The performance for RAID6 is slightly lower but comparable to RAID5 in
186 normal mode and single disk failure mode. It is very slow in dual disk
187 failure mode, however.
188
189
190 RAID10
191 RAID10 provides a combination of RAID1 and RAID0, and sometimes known
192 as RAID1+0. Every datablock is duplicated some number of times, and
193 the resulting collection of datablocks are distributed over multiple
194 drives.
195
196 When configuring a RAID10 array it is necessary to specify the number
197 of replicas of each data block that are required (this will normally be
198 2) and whether the replicas should be 'near', 'offset' or 'far'. (Note
199 that the 'offset' layout is only available from 2.6.18).
200
201 When 'near' replicas are chosen, the multiple copies of a given chunk
202 are laid out consecutively across the stripes of the array, so the two
203 copies of a datablock will likely be at the same offset on two adjacent
204 devices.
205
206 When 'far' replicas are chosen, the multiple copies of a given chunk
207 are laid out quite distant from each other. The first copy of all data
208 blocks will be striped across the early part of all drives in RAID0
209 fashion, and then the next copy of all blocks will be striped across a
210 later section of all drives, always ensuring that all copies of any
211 given block are on different drives.
212
213 The 'far' arrangement can give sequential read performance equal to
214 that of a RAID0 array, but at the cost of degraded write performance.
215
216 When 'offset' replicas are chosen, the multiple copies of a given chunk
217 are laid out on consecutive drives and at consecutive offsets. Effec‐
218 tively each stripe is duplicated and the copies are offset by one
219 device. This should give similar read characteristics to 'far' if a
220 suitably large chunk size is used, but without as much seeking for
221 writes.
222
223 It should be noted that the number of devices in a RAID10 array need
224 not be a multiple of the number of replica of each data block, those
225 there must be at least as many devices as replicas.
226
227 If, for example, an array is created with 5 devices and 2 replicas,
228 then space equivalent to 2.5 of the devices will be available, and
229 every block will be stored on two different devices.
230
231 Finally, it is possible to have an array with both 'near' and 'far'
232 copies. If and array is configured with 2 near copies and 2 far
233 copies, then there will be a total of 4 copies of each block, each on a
234 different drive. This is an artifact of the implementation and is
235 unlikely to be of real value.
236
237
238 MUTIPATH
239 MULTIPATH is not really a RAID at all as there is only one real device
240 in a MULTIPATH md array. However there are multiple access points
241 (paths) to this device, and one of these paths might fail, so there are
242 some similarities.
243
244 A MULTIPATH array is composed of a number of logically different
245 devices, often fibre channel interfaces, that all refer the the same
246 real device. If one of these interfaces fails (e.g. due to cable prob‐
247 lems), the multipath driver will attempt to redirect requests to
248 another interface.
249
250
251 FAULTY
252 The FAULTY md module is provided for testing purposes. A faulty array
253 has exactly one component device and is normally assembled without a
254 superblock, so the md array created provides direct access to all of
255 the data in the component device.
256
257 The FAULTY module may be requested to simulate faults to allow testing
258 of other md levels or of filesystems. Faults can be chosen to trigger
259 on read requests or write requests, and can be transient (a subsequent
260 read/write at the address will probably succeed) or persistent (subse‐
261 quent read/write of the same address will fail). Further, read faults
262 can be "fixable" meaning that they persist until a write request at the
263 same address.
264
265 Fault types can be requested with a period. In this case the fault
266 will recur repeatedly after the given number of requests of the rele‐
267 vant type. For example if persistent read faults have a period of 100,
268 then every 100th read request would generate a fault, and the faulty
269 sector would be recorded so that subsequent reads on that sector would
270 also fail.
271
272 There is a limit to the number of faulty sectors that are remembered.
273 Faults generated after this limit is exhausted are treated as tran‐
274 sient.
275
276 The list of faulty sectors can be flushed, and the active list of fail‐
277 ure modes can be cleared.
278
279
280 UNCLEAN SHUTDOWN
281 When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
282 there is a possibility of inconsistency for short periods of time as
283 each update requires are least two block to be written to different
284 devices, and these writes probably wont happen at exactly the same
285 time. Thus if a system with one of these arrays is shutdown in the
286 middle of a write operation (e.g. due to power failure), the array may
287 not be consistent.
288
289 To handle this situation, the md driver marks an array as "dirty"
290 before writing any data to it, and marks it as "clean" when the array
291 is being disabled, e.g. at shutdown. If the md driver finds an array
292 to be dirty at startup, it proceeds to correct any possibly inconsis‐
293 tency. For RAID1, this involves copying the contents of the first
294 drive onto all other drives. For RAID4, RAID5 and RAID6 this involves
295 recalculating the parity for each stripe and making sure that the par‐
296 ity block has the correct data. For RAID10 it involves copying one of
297 the replicas of each block onto all the others. This process, known as
298 "resynchronising" or "resync" is performed in the background. The
299 array can still be used, though possibly with reduced performance.
300
301 If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
302 drive) when it is restarted after an unclean shutdown, it cannot recal‐
303 culate parity, and so it is possible that data might be undetectably
304 corrupted. The 2.4 md driver does not alert the operator to this con‐
305 dition. The 2.6 md driver will fail to start an array in this condi‐
306 tion without manual intervention, though this behaviour can be overrid‐
307 den by a kernel parameter.
308
309
310 RECOVERY
311 If the md driver detects a write error on a device in a RAID1, RAID4,
312 RAID5, RAID6, or RAID10 array, it immediately disables that device
313 (marking it as faulty) and continues operation on the remaining
314 devices. If there is a spare drive, the driver will start recreating
315 on one of the spare drives the data what was on that failed drive,
316 either by copying a working drive in a RAID1 configuration, or by doing
317 calculations with the parity block on RAID4, RAID5 or RAID6, or by
318 finding a copying originals for RAID10.
319
320 In kernels prior to about 2.6.15, a read error would cause the same
321 effect as a write error. In later kernels, a read-error will instead
322 cause md to attempt a recovery by overwriting the bad block. i.e. it
323 will find the correct data from elsewhere, write it over the block that
324 failed, and then try to read it back again. If either the write or the
325 re-read fail, md will treat the error the same way that a write error
326 is treated and will fail the whole device.
327
328 While this recovery process is happening, the md driver will monitor
329 accesses to the array and will slow down the rate of recovery if other
330 activity is happening, so that normal access to the array will not be
331 unduly affected. When no other activity is happening, the recovery
332 process proceeds at full speed. The actual speed targets for the two
333 different situations can be controlled by the speed_limit_min and
334 speed_limit_max control files mentioned below.
335
336
337 BITMAP WRITE-INTENT LOGGING
338 From Linux 2.6.13, md supports a bitmap based write-intent log. If
339 configured, the bitmap is used to record which blocks of the array may
340 be out of sync. Before any write request is honoured, md will make
341 sure that the corresponding bit in the log is set. After a period of
342 time with no writes to an area of the array, the corresponding bit will
343 be cleared.
344
345 This bitmap is used for two optimisations.
346
347 Firstly, after an unclean shutdown, the resync process will consult the
348 bitmap and only resync those blocks that correspond to bits in the bit‐
349 map that are set. This can dramatically reduce resync time.
350
351 Secondly, when a drive fails and is removed from the array, md stops
352 clearing bits in the intent log. If that same drive is re-added to the
353 array, md will notice and will only recover the sections of the drive
354 that are covered by bits in the intent log that are set. This can
355 allow a device to be temporarily removed and reinserted without causing
356 an enormous recovery cost.
357
358 The intent log can be stored in a file on a separate device, or it can
359 be stored near the superblocks of an array which has superblocks.
360
361 It is possible to add an intent log or an active array, or remove an
362 intent log if one is present.
363
364 In 2.6.13, intent bitmaps are only supported with RAID1. Other levels
365 with redundancy are supported from 2.6.15.
366
367
368 WRITE-BEHIND
369 From Linux 2.6.14, md supports WRITE-BEHIND on RAID1 arrays.
370
371 This allows certain devices in the array to be flagged as write-mostly.
372 MD will only read from such devices if there is no other option.
373
374 If a write-intent bitmap is also provided, write requests to write-
375 mostly devices will be treated as write-behind requests and md will not
376 wait for writes to those requests to complete before reporting the
377 write as complete to the filesystem.
378
379 This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
380 over a slow link to a remote computer (providing the link isn't too
381 slow). The extra latency of the remote link will not slow down normal
382 operations, but the remote system will still have a reasonably up-to-
383 date copy of all data.
384
385
386 RESTRIPING
387 Restriping, also known as Reshaping, is the processes of re-arranging
388 the data stored in each stripe into a new layout. This might involve
389 changing the number of devices in the array (so the stripes are wider)
390 changing the chunk size (so stripes are deeper or shallower), or chang‐
391 ing the arrangement of data and parity, possibly changing the raid
392 level (e.g. 1 to 5 or 5 to 6).
393
394 As of Linux 2.6.17, md can reshape a raid5 array to have more devices.
395 Other possibilities may follow in future kernels.
396
397 During any stripe process there is a 'critical section' during which
398 live data is being overwritten on disk. For the operation of increas‐
399 ing the number of drives in a raid5, this critical section covers the
400 first few stripes (the number being the product of the old and new num‐
401 ber of devices). After this critical section is passed, data is only
402 written to areas of the array which no longer hold live data — the live
403 data has already been located away.
404
405 md is not able to ensure data preservation if there is a crash (e.g.
406 power failure) during the critical section. If md is asked to start an
407 array which failed during a critical section of restriping, it will
408 fail to start the array.
409
410 To deal with this possibility, a user-space program must
411
412 · Disable writes to that section of the array (using the sysfs inter‐
413 face),
414
415 · Take a copy of the data somewhere (i.e. make a backup)
416
417 · Allow the process to continue and invalidate the backup and restore
418 write access once the critical section is passed, and
419
420 · Provide for restoring the critical data before restarting the array
421 after a system crash.
422
423 mdadm version 2.4 and later will do this for growing a RAID5 array.
424
425 For operations that do not change the size of the array, like simply
426 increasing chunk size, or converting RAID5 to RAID6 with one extra
427 device, the entire process is the critical section. In this case the
428 restripe will need to progress in stages as a section is suspended,
429 backed up, restriped, and released. This is not yet implemented.
430
431
432 SYSFS INTERFACE
433 All block devices appear as a directory in sysfs (usually mounted at
434 /sys). For MD devices, this directory will contain a subdirectory
435 called md which contains various files for providing access to informa‐
436 tion about the array.
437
438 This interface is documented more fully in the file Documenta‐
439 tion/md.txt which is distributed with the kernel sources. That file
440 should be consulted for full documentation. The following are just a
441 selection of attribute files that are available.
442
443
444 md/sync_speed_min
445 This value, if set, overrides the system-wide setting in
446 /proc/sys/dev/raid/speed_limit_min for this array only. Writing
447 the value system to this file cause the system-wide setting to
448 have effect.
449
450
451 md/sync_speed_max
452 This is the partner of md/sync_speed_min and overrides
453 /proc/sys/dev/raid/spool_limit_max described below.
454
455
456 md/sync_action
457 This can be used to monitor and control the resync/recovery
458 process of MD. In particular, writing "check" here will cause
459 the array to read all data block and check that they are consis‐
460 tent (e.g. parity is correct, or all mirror replicas are the
461 same). Any discrepancies found are NOT corrected.
462
463 A count of problems found will be stored in md/mismatch_count.
464
465 Alternately, "repair" can be written which will cause the same
466 check to be performed, but any errors will be corrected.
467
468 Finally, "idle" can be written to stop the check/repair process.
469
470
471 md/stripe_cache_size
472 This is only available on RAID5 and RAID6. It records the size
473 (in pages per device) of the stripe cache which is used for
474 synchronising all read and write operations to the array. The
475 default is 128. Increasing this number can increase performance
476 in some situations, at some cost in system memory.
477
478
479
480 KERNEL PARAMETERS
481 The md driver recognised several different kernel parameters.
482
483 raid=noautodetect
484 This will disable the normal detection of md arrays that happens
485 at boot time. If a drive is partitioned with MS-DOS style par‐
486 titions, then if any of the 4 main partitions has a partition
487 type of 0xFD, then that partition will normally be inspected to
488 see if it is part of an MD array, and if any full arrays are
489 found, they are started. This kernel parameter disables this
490 behaviour.
491
492
493 raid=partitionable
494
495 raid=part
496 These are available in 2.6 and later kernels only. They indi‐
497 cate that autodetected MD arrays should be created as partition‐
498 able arrays, with a different major device number to the origi‐
499 nal non-partitionable md arrays. The device number is listed as
500 mdp in /proc/devices.
501
502
503 md_mod.start_ro=1
504 This tells md to start all arrays in read-only mode. This is a
505 soft read-only that will automatically switch to read-write on
506 the first write request. However until that write request,
507 nothing is written to any device by md, and in particular, no
508 resync or recovery operation is started.
509
510
511 md_mod.start_dirty_degraded=1
512 As mentioned above, md will not normally start a RAID4, RAID5,
513 or RAID6 that is both dirty and degraded as this situation can
514 imply hidden data loss. This can be awkward if the root
515 filesystem is affected. Using the module parameter allows such
516 arrays to be started at boot time. It should be understood that
517 there is a real (though small) risk of data corruption in this
518 situation.
519
520
521 md=n,dev,dev,...
522
523 md=dn,dev,dev,...
524 This tells the md driver to assemble /dev/md n from the listed
525 devices. It is only necessary to start the device holding the
526 root filesystem this way. Other arrays are best started once
527 the system is booted.
528
529 In 2.6 kernels, the d immediately after the = indicates that a
530 partitionable device (e.g. /dev/md/d0) should be created rather
531 than the original non-partitionable device.
532
533
534 md=n,l,c,i,dev...
535 This tells the md driver to assemble a legacy RAID0 or LINEAR
536 array without a superblock. n gives the md device number, l
537 gives the level, 0 for RAID0 or -1 for LINEAR, c gives the chunk
538 size as a base-2 logarithm offset by twelve, so 0 means 4K, 1
539 means 8K. i is ignored (legacy support).
540
541
543 /proc/mdstat
544 Contains information about the status of currently running
545 array.
546
547 /proc/sys/dev/raid/speed_limit_min
548 A readable and writable file that reflects the current goal
549 rebuild speed for times when non-rebuild activity is current on
550 an array. The speed is in Kibibytes per second, and is a per-
551 device rate, not a per-array rate (which means that an array
552 with more disc will shuffle more data for a given speed). The
553 default is 100.
554
555
556 /proc/sys/dev/raid/speed_limit_max
557 A readable and writable file that reflects the current goal
558 rebuild speed for times when no non-rebuild activity is current
559 on an array. The default is 100,000.
560
561
563 mdadm(8), mkraid(8).
564
565
566
567 MD(4)