md(4) - f7

1MD(4)                      Kernel Interfaces Manual                      MD(4)
2
3
4

NAME

6       md - Multiple Device driver aka Linux Software Raid
7

SYNOPSIS

9       /dev/mdn
10       /dev/md/n
11

DESCRIPTION

13       The  md  driver  provides  virtual devices that are created from one or
14       more independent underlying devices.  This array of devices often  con‐
15       tains  redundancy, and hence the acronym RAID which stands for a Redun‐
16       dant Array of Independent Devices.
17
18       md supports RAID levels 1 (mirroring), 4  (striped  array  with  parity
19       device),  5  (striped  array  with  distributed  parity information), 6
20       (striped array with distributed dual redundancy  information),  and  10
21       (striped  and  mirrored).   If  some number of underlying devices fails
22       while using one of these levels, the array will continue  to  function;
23       this  number  is one for RAID levels 4 and 5, two for RAID level 6, and
24       all but one (N-1) for RAID level 1, and dependant on configuration  for
25       level 10.
26
27       md also supports a number of pseudo RAID (non-redundant) configurations
28       including RAID0 (striped array), LINEAR (catenated array), MULTIPATH (a
29       set  of  different  interfaces to the same device), and FAULTY (a layer
30       over a single device into which errors can be injected).
31
32
33   MD SUPER BLOCK
34       Each device in an array may have a superblock which records information
35       about  the  structure and state of the array.  This allows the array to
36       be reliably re-assembled after a shutdown.
37
38       From Linux kernel version 2.6.10, md provides support for two different
39       formats  of  this superblock, and other formats can be added.  Prior to
40       this release, only one format is supported.
41
42       The common format — known as version 0.90 — has a superblock that is 4K
43       long  and  is written into a 64K aligned block that starts at least 64K
44       and less than 128K from the end of the device (i.e. to get the  address
45       of  the  superblock  round the size of the device down to a multiple of
46       64K and then subtract 64K).  The available size of each device  is  the
47       amount of space before the super block, so between 64K and 128K is lost
48       when a device in incorporated into an MD array.  This superblock stores
49       multi-byte  fields  in  a  processor-dependant manner, so arrays cannot
50       easily be moved between computers with different processors.
51
52       The new format — known as version 1 — has a superblock that is normally
53       1K  long,  but can be longer.  It is normally stored between 8K and 12K
54       from the end of the device, on a 4K boundary, though variations can  be
55       stored at the start of the device (version 1.1) or 4K from the start of
56       the device (version 1.2).  This superblock format stores multibyte data
57       in  a  processor-independent  format and has supports up to hundreds of
58       component devices (version 0.90 only supports 28).
59
60       The superblock contains, among other things:
61
62       LEVEL  The manner in which the devices  are  arranged  into  the  array
63              (linear, raid0, raid1, raid4, raid5, raid10, multipath).
64
65       UUID   a  128  bit  Universally  Unique  Identifier that identifies the
66              array that this device is part of.
67
68              When a version 0.90 array is being reshaped (e.g.  adding  extra
69              devices  to  a  RAID5), the version number is temporarily set to
70              0.91.  This ensures that if the reshape process  is  stopped  in
71              the  middle  (e.g. by a system crash) and the machine boots into
72              an older kernel that does not support reshaping, then the  array
73              will  not  be  assembled (which would cause data corruption) but
74              will be left untouched until a  kernel  that  can  complete  the
75              reshape processes is used.
76
77
78   ARRAYS WITHOUT SUPERBLOCKS
79       While it is usually best to create arrays with superblocks so that they
80       can be assembled reliably, there are some circumstances where an  array
81       without superblocks in preferred.  This include:
82
83       LEGACY ARRAYS
84              Early  versions of the md driver only supported Linear and Raid0
85              configurations and did not use a superblock (which is less crit‐
86              ical  with  these  configurations).  While such arrays should be
87              rebuilt with superblocks if possible, md  continues  to  support
88              them.
89
90       FAULTY Being  a  largely transparent layer over a different device, the
91              FAULTY  personality  doesn't  gain  anything   from   having   a
92              superblock.
93
94       MULTIPATH
95              It is often possible to detect devices which are different paths
96              to the same storage directly rather than  having  a  distinctive
97              superblock  written to the device and searched for on all paths.
98              In this case, a MULTIPATH array with no superblock makes sense.
99
100       RAID1  In some configurations it might be desired  to  create  a  raid1
101              configuration  that  does  use a superblock, and to maintain the
102              state of the array elsewhere.  While not encouraged for  general
103              us, it does have special-purpose uses and is supported.
104
105
106   LINEAR
107       A  linear  array  simply  catenates  the  available space on each drive
108       together to form one large virtual drive.
109
110       One advantage of this arrangement over the more common  RAID0  arrange‐
111       ment  is  that  the  array  may be reconfigured at a later time with an
112       extra drive and so the array is made bigger without disturbing the data
113       that is on the array.  However this cannot be done on a live array.
114
115       If  a  chunksize is given with a LINEAR array, the usable space on each
116       device is rounded down to a multiple of this chunksize.
117
118
119   RAID0
120       A RAID0 array (which has zero redundancy) is also known  as  a  striped
121       array.  A RAID0 array is configured at creation with a Chunk Size which
122       must be a power of two, and at least 4 kibibytes.
123
124       The RAID0 driver assigns the first chunk of  the  array  to  the  first
125       device,  the  second  chunk  to  the second device, and so on until all
126       drives have been assigned one chunk.  This collection of chunks forms a
127       stripe.  Further chunks are gathered into stripes in the same way which
128       are assigned to the remaining space in the drives.
129
130       If devices in the array are not all the same size, then once the small‐
131       est  device  has  been  exhausted,  the  RAID0 driver starts collecting
132       chunks into smaller stripes that only span the drives which still  have
133       remaining space.
134
135
136
137   RAID1
138       A  RAID1  array is also known as a mirrored set (though mirrors tend to
139       provide reflected images, which RAID1 does not) or a plex.
140
141       Once initialised, each device in a RAID1  array  contains  exactly  the
142       same  data.   Changes  are written to all devices in parallel.  Data is
143       read from any one device.   The  driver  attempts  to  distribute  read
144       requests across all devices to maximise performance.
145
146       All devices in a RAID1 array should be the same size.  If they are not,
147       then only the amount of space available on the smallest device is used.
148       Any extra space on other devices is wasted.
149
150
151   RAID4
152       A  RAID4  array  is like a RAID0 array with an extra device for storing
153       parity. This device is the last of the active  devices  in  the  array.
154       Unlike  RAID0, RAID4 also requires that all stripes span all drives, so
155       extra space on devices that are larger than the smallest is wasted.
156
157       When any block in a RAID4 array is modified the parity block  for  that
158       stripe  (i.e.  the block in the parity device at the same device offset
159       as the stripe) is also modified so that the parity  block  always  con‐
160       tains  the "parity" for the whole stripe.  i.e. its contents is equiva‐
161       lent to the result of performing an exclusive-or operation between  all
162       the data blocks in the stripe.
163
164       This allows the array to continue to function if one device fails.  The
165       data that was on that device can be calculated as needed from the  par‐
166       ity block and the other data blocks.
167
168
169   RAID5
170       RAID5  is  very  similar  to  RAID4.  The difference is that the parity
171       blocks for each stripe, instead of being on a single device,  are  dis‐
172       tributed across all devices.  This allows more parallelism when writing
173       as two different block updates will quite possibly affect parity blocks
174       on different devices so there is less contention.
175
176       This  also  allows  more  parallelism when reading as read requests are
177       distributed over all the devices in the array instead of all but one.
178
179
180   RAID6
181       RAID6 is similar to RAID5, but can handle the loss of any  two  devices
182       without  data  loss.   Accordingly,  it  requires N+2 drives to store N
183       drives worth of data.
184
185       The performance for RAID6 is slightly lower but comparable to RAID5  in
186       normal mode and single disk failure mode.  It is very slow in dual disk
187       failure mode, however.
188
189
190   RAID10
191       RAID10 provides a combination of RAID1 and RAID0, and  sometimes  known
192       as  RAID1+0.   Every  datablock is duplicated some number of times, and
193       the resulting collection of datablocks are  distributed  over  multiple
194       drives.
195
196       When  configuring  a RAID10 array it is necessary to specify the number
197       of replicas of each data block that are required (this will normally be
198       2) and whether the replicas should be 'near', 'offset' or 'far'.  (Note
199       that the 'offset' layout is only available from 2.6.18).
200
201       When 'near' replicas are chosen, the multiple copies of a  given  chunk
202       are  laid out consecutively across the stripes of the array, so the two
203       copies of a datablock will likely be at the same offset on two adjacent
204       devices.
205
206       When  'far'  replicas  are chosen, the multiple copies of a given chunk
207       are laid out quite distant from each other.  The first copy of all data
208       blocks  will  be  striped  across the early part of all drives in RAID0
209       fashion, and then the next copy of all blocks will be striped across  a
210       later  section  of  all  drives, always ensuring that all copies of any
211       given block are on different drives.
212
213       The 'far' arrangement can give sequential  read  performance  equal  to
214       that of a RAID0 array, but at the cost of degraded write performance.
215
216       When 'offset' replicas are chosen, the multiple copies of a given chunk
217       are laid out on consecutive drives and at consecutive offsets.   Effec‐
218       tively  each  stripe  is  duplicated  and  the copies are offset by one
219       device.   This should give similar read characteristics to 'far'  if  a
220       suitably  large  chunk  size  is  used, but without as much seeking for
221       writes.
222
223       It should be noted that the number of devices in a  RAID10  array  need
224       not  be  a  multiple of the number of replica of each data block, those
225       there must be at least as many devices as replicas.
226
227       If, for example, an array is created with 5  devices  and  2  replicas,
228       then  space  equivalent  to  2.5  of the devices will be available, and
229       every block will be stored on two different devices.
230
231       Finally, it is possible to have an array with  both  'near'  and  'far'
232       copies.   If  and  array  is  configured  with  2 near copies and 2 far
233       copies, then there will be a total of 4 copies of each block, each on a
234       different  drive.   This  is  an  artifact of the implementation and is
235       unlikely to be of real value.
236
237
238   MUTIPATH
239       MULTIPATH is not really a RAID at all as there is only one real  device
240       in  a  MULTIPATH  md  array.   However there are multiple access points
241       (paths) to this device, and one of these paths might fail, so there are
242       some similarities.
243
244       A  MULTIPATH  array  is  composed  of  a  number of logically different
245       devices, often fibre channel interfaces, that all refer  the  the  same
246       real  device. If one of these interfaces fails (e.g. due to cable prob‐
247       lems), the multipath  driver  will  attempt  to  redirect  requests  to
248       another interface.
249
250
251   FAULTY
252       The  FAULTY md module is provided for testing purposes.  A faulty array
253       has exactly one component device and is normally  assembled  without  a
254       superblock,  so  the  md array created provides direct access to all of
255       the data in the component device.
256
257       The FAULTY module may be requested to simulate faults to allow  testing
258       of  other md levels or of filesystems.  Faults can be chosen to trigger
259       on read requests or write requests, and can be transient (a  subsequent
260       read/write  at the address will probably succeed) or persistent (subse‐
261       quent read/write of the same address will fail).  Further, read  faults
262       can be "fixable" meaning that they persist until a write request at the
263       same address.
264
265       Fault types can be requested with a period.  In  this  case  the  fault
266       will  recur  repeatedly after the given number of requests of the rele‐
267       vant type.  For example if persistent read faults have a period of 100,
268       then  every  100th  read request would generate a fault, and the faulty
269       sector would be recorded so that subsequent reads on that sector  would
270       also fail.
271
272       There  is  a limit to the number of faulty sectors that are remembered.
273       Faults generated after this limit is exhausted  are  treated  as  tran‐
274       sient.
275
276       The list of faulty sectors can be flushed, and the active list of fail‐
277       ure modes can be cleared.
278
279
280   UNCLEAN SHUTDOWN
281       When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10  array
282       there  is  a  possibility of inconsistency for short periods of time as
283       each update requires are least two block to  be  written  to  different
284       devices,  and  these  writes  probably  wont happen at exactly the same
285       time.  Thus if a system with one of these arrays  is  shutdown  in  the
286       middle  of a write operation (e.g. due to power failure), the array may
287       not be consistent.
288
289       To handle this situation, the md  driver  marks  an  array  as  "dirty"
290       before  writing  any data to it, and marks it as "clean" when the array
291       is being disabled, e.g. at shutdown.  If the md driver finds  an  array
292       to  be  dirty at startup, it proceeds to correct any possibly inconsis‐
293       tency.  For RAID1, this involves copying  the  contents  of  the  first
294       drive  onto all other drives.  For RAID4, RAID5 and RAID6 this involves
295       recalculating the parity for each stripe and making sure that the  par‐
296       ity  block has the correct data.  For RAID10 it involves copying one of
297       the replicas of each block onto all the others.  This process, known as
298       "resynchronising"  or  "resync"  is  performed  in the background.  The
299       array can still be used, though possibly with reduced performance.
300
301       If a RAID4, RAID5 or RAID6 array is  degraded  (missing  at  least  one
302       drive) when it is restarted after an unclean shutdown, it cannot recal‐
303       culate parity, and so it is possible that data  might  be  undetectably
304       corrupted.   The 2.4 md driver does not alert the operator to this con‐
305       dition.  The 2.6 md driver will fail to start an array in  this  condi‐
306       tion without manual intervention, though this behaviour can be overrid‐
307       den by a kernel parameter.
308
309
310   RECOVERY
311       If the md driver detects a write error on a device in a  RAID1,  RAID4,
312       RAID5,  RAID6,  or  RAID10  array,  it immediately disables that device
313       (marking it  as  faulty)  and  continues  operation  on  the  remaining
314       devices.   If  there is a spare drive, the driver will start recreating
315       on one of the spare drives the data what  was  on  that  failed  drive,
316       either by copying a working drive in a RAID1 configuration, or by doing
317       calculations with the parity block on RAID4,  RAID5  or  RAID6,  or  by
318       finding a copying originals for RAID10.
319
320       In  kernels  prior  to  about 2.6.15, a read error would cause the same
321       effect as a write error.  In later kernels, a read-error  will  instead
322       cause  md  to  attempt a recovery by overwriting the bad block. i.e. it
323       will find the correct data from elsewhere, write it over the block that
324       failed, and then try to read it back again.  If either the write or the
325       re-read fail, md will treat the error the same way that a  write  error
326       is treated and will fail the whole device.
327
328       While  this  recovery  process is happening, the md driver will monitor
329       accesses to the array and will slow down the rate of recovery if  other
330       activity  is  happening, so that normal access to the array will not be
331       unduly affected.  When no other activity  is  happening,  the  recovery
332       process  proceeds  at full speed.  The actual speed targets for the two
333       different situations can  be  controlled  by  the  speed_limit_min  and
334       speed_limit_max control files mentioned below.
335
336
337   BITMAP WRITE-INTENT LOGGING
338       From  Linux  2.6.13,  md  supports a bitmap based write-intent log.  If
339       configured, the bitmap is used to record which blocks of the array  may
340       be  out  of  sync.   Before any write request is honoured, md will make
341       sure that the corresponding bit in the log is set.  After a  period  of
342       time with no writes to an area of the array, the corresponding bit will
343       be cleared.
344
345       This bitmap is used for two optimisations.
346
347       Firstly, after an unclean shutdown, the resync process will consult the
348       bitmap and only resync those blocks that correspond to bits in the bit‐
349       map that are set.  This can dramatically reduce resync time.
350
351       Secondly, when a drive fails and is removed from the  array,  md  stops
352       clearing bits in the intent log.  If that same drive is re-added to the
353       array, md will notice and will only recover the sections of  the  drive
354       that  are  covered  by  bits  in the intent log that are set.  This can
355       allow a device to be temporarily removed and reinserted without causing
356       an enormous recovery cost.
357
358       The  intent log can be stored in a file on a separate device, or it can
359       be stored near the superblocks of an array which has superblocks.
360
361       It is possible to add an intent log or an active array,  or  remove  an
362       intent log if one is present.
363
364       In  2.6.13, intent bitmaps are only supported with RAID1.  Other levels
365       with redundancy are supported from 2.6.15.
366
367
368   WRITE-BEHIND
369       From Linux 2.6.14, md supports WRITE-BEHIND on RAID1 arrays.
370
371       This allows certain devices in the array to be flagged as write-mostly.
372       MD will only read from such devices if there is no other option.
373
374       If  a  write-intent  bitmap  is also provided, write requests to write-
375       mostly devices will be treated as write-behind requests and md will not
376       wait  for  writes  to  those  requests to complete before reporting the
377       write as complete to the filesystem.
378
379       This allows for a RAID1 with WRITE-BEHIND to be  used  to  mirror  data
380       over  a  slow  link  to a remote computer (providing the link isn't too
381       slow).  The extra latency of the remote link will not slow down  normal
382       operations,  but  the remote system will still have a reasonably up-to-
383       date copy of all data.
384
385
386   RESTRIPING
387       Restriping, also known as Reshaping, is the processes  of  re-arranging
388       the  data  stored in each stripe into a new layout.  This might involve
389       changing the number of devices in the array (so the stripes are  wider)
390       changing the chunk size (so stripes are deeper or shallower), or chang‐
391       ing the arrangement of data and  parity,  possibly  changing  the  raid
392       level (e.g. 1 to 5 or 5 to 6).
393
394       As  of Linux 2.6.17, md can reshape a raid5 array to have more devices.
395       Other possibilities may follow in future kernels.
396
397       During any stripe process there is a 'critical  section'  during  which
398       live  data is being overwritten on disk.  For the operation of increas‐
399       ing the number of drives in a raid5, this critical section  covers  the
400       first few stripes (the number being the product of the old and new num‐
401       ber of devices).  After this critical section is passed, data  is  only
402       written to areas of the array which no longer hold live data — the live
403       data has already been located away.
404
405       md is not able to ensure data preservation if there is  a  crash  (e.g.
406       power failure) during the critical section.  If md is asked to start an
407       array which failed during a critical section  of  restriping,  it  will
408       fail to start the array.
409
410       To deal with this possibility, a user-space program must
411
412       ·   Disable writes to that section of the array (using the sysfs inter‐
413           face),
414
415       ·   Take a copy of the data somewhere (i.e. make a backup)
416
417       ·   Allow the process to continue and invalidate the backup and restore
418           write access once the critical section is passed, and
419
420       ·   Provide for restoring the critical data before restarting the array
421           after a system crash.
422
423       mdadm version 2.4 and later will do this for growing a RAID5  array.
424
425       For operations that do not change the size of the  array,  like  simply
426       increasing  chunk  size,  or  converting  RAID5 to RAID6 with one extra
427       device, the entire process is the critical section. In  this  case  the
428       restripe  will  need  to  progress in stages as a section is suspended,
429       backed up, restriped, and released.  This is not yet implemented.
430
431
432   SYSFS INTERFACE
433       All block devices appear as a directory in sysfs  (usually  mounted  at
434       /sys).   For  MD  devices,  this  directory will contain a subdirectory
435       called md which contains various files for providing access to informa‐
436       tion about the array.
437
438       This  interface  is  documented  more  fully  in  the  file  Documenta‐
439       tion/md.txt which is distributed with the kernel  sources.   That  file
440       should  be  consulted for full documentation.  The following are just a
441       selection of attribute files that are available.
442
443
444       md/sync_speed_min
445              This  value,  if  set,  overrides  the  system-wide  setting  in
446              /proc/sys/dev/raid/speed_limit_min for this array only.  Writing
447              the value system to this file cause the system-wide  setting  to
448              have effect.
449
450
451       md/sync_speed_max
452              This   is   the   partner  of  md/sync_speed_min  and  overrides
453              /proc/sys/dev/raid/spool_limit_max described below.
454
455
456       md/sync_action
457              This can be used to  monitor  and  control  the  resync/recovery
458              process  of  MD.  In particular, writing "check" here will cause
459              the array to read all data block and check that they are consis‐
460              tent  (e.g.  parity  is  correct, or all mirror replicas are the
461              same).  Any discrepancies found are NOT corrected.
462
463              A count of problems found will be stored in md/mismatch_count.
464
465              Alternately, "repair" can be written which will cause  the  same
466              check to be performed, but any errors will be corrected.
467
468              Finally, "idle" can be written to stop the check/repair process.
469
470
471       md/stripe_cache_size
472              This  is only available on RAID5 and RAID6.  It records the size
473              (in pages per device) of the  stripe cache  which  is  used  for
474              synchronising  all  read and write operations to the array.  The
475              default is 128.  Increasing this number can increase performance
476              in some situations, at some cost in system memory.
477
478
479
480   KERNEL PARAMETERS
481       The md driver recognised several different kernel parameters.
482
483       raid=noautodetect
484              This will disable the normal detection of md arrays that happens
485              at boot time.  If a drive is partitioned with MS-DOS style  par‐
486              titions,  then  if  any of the 4 main partitions has a partition
487              type of 0xFD, then that partition will normally be inspected  to
488              see  if  it  is  part of an MD array, and if any full arrays are
489              found, they are started.  This kernel  parameter  disables  this
490              behaviour.
491
492
493       raid=partitionable
494
495       raid=part
496              These  are  available in 2.6 and later kernels only.  They indi‐
497              cate that autodetected MD arrays should be created as partition‐
498              able  arrays, with a different major device number to the origi‐
499              nal non-partitionable md arrays.  The device number is listed as
500              mdp in /proc/devices.
501
502
503       md_mod.start_ro=1
504              This  tells md to start all arrays in read-only mode.  This is a
505              soft read-only that will automatically switch to  read-write  on
506              the  first  write  request.   However  until that write request,
507              nothing is written to any device by md, and  in  particular,  no
508              resync or recovery operation is started.
509
510
511       md_mod.start_dirty_degraded=1
512              As  mentioned  above, md will not normally start a RAID4, RAID5,
513              or RAID6 that is both dirty and degraded as this  situation  can
514              imply  hidden  data  loss.   This  can  be  awkward  if the root
515              filesystem is affected.  Using the module parameter allows  such
516              arrays to be started at boot time.  It should be understood that
517              there is a real (though small) risk of data corruption  in  this
518              situation.
519
520
521       md=n,dev,dev,...
522
523       md=dn,dev,dev,...
524              This  tells  the md driver to assemble /dev/md n from the listed
525              devices.  It is only necessary to start the device  holding  the
526              root  filesystem  this  way.  Other arrays are best started once
527              the system is booted.
528
529              In 2.6 kernels, the d immediately after the = indicates  that  a
530              partitionable device (e.g.  /dev/md/d0) should be created rather
531              than the original non-partitionable device.
532
533
534       md=n,l,c,i,dev...
535              This tells the md driver to assemble a legacy  RAID0  or  LINEAR
536              array  without  a  superblock.   n gives the md device number, l
537              gives the level, 0 for RAID0 or -1 for LINEAR, c gives the chunk
538              size  as  a  base-2 logarithm offset by twelve, so 0 means 4K, 1
539              means 8K.  i is ignored (legacy support).
540
541

FILES

543       /proc/mdstat
544              Contains information  about  the  status  of  currently  running
545              array.
546
547       /proc/sys/dev/raid/speed_limit_min
548              A  readable  and  writable  file  that reflects the current goal
549              rebuild speed for times when non-rebuild activity is current  on
550              an  array.   The speed is in Kibibytes per second, and is a per-
551              device rate, not a per-array rate (which  means  that  an  array
552              with  more disc will shuffle more data for a given speed).   The
553              default is 100.
554
555
556       /proc/sys/dev/raid/speed_limit_max
557              A readable and writable file  that  reflects  the  current  goal
558              rebuild  speed for times when no non-rebuild activity is current
559              on an array.  The default is 100,000.
560
561

NAME

SYNOPSIS

DESCRIPTION

FILES

SEE ALSO