qdisk(5)

1QDisk(5)                      Cluster Quorum Disk                     QDisk(5)
2
3
4

NAME

6       qdisk - a disk-based quorum daemon for CMAN / Linux-Cluster
7

1. Overview

1.1 Problem

10       In  some  situations,  it  may  be  necessary or desirable to sustain a
11       majority node failure of a cluster without  introducing  the  need  for
12       asymmetric  cluster  configurations  (e.g.  client-server,  or heavily-
13       weighted voting nodes).
14
15

1.2. Design Requirements

17       * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
18       danger  of  a simple network partition causing a split brain.  That is,
19       we need to be able to ensure that the  majority  failure  case  is  not
20       merely the result of a network partition.
21
22       *  Ability  to use external reasons for deciding which partition is the
23       the quorate partition in a partitioned cluster.  For  example,  a  user
24       may  have  a  service running on one node, and that node must always be
25       the master in the event of a network partition.  Or, a node might  lose
26       all  network  connectivity  except  the cluster communication path - in
27       which case, a user may wish that node to be evicted from the cluster.
28
29       * Integration with CMAN.  We must not require CMAN to run with  us  (or
30       without  us).   Linux-Cluster does not require a quorum disk normally -
31       introducing new requirements on the base of how Linux-Cluster  operates
32       is not allowed.
33
34       * Data integrity.  In order to recover from a majority failure, fencing
35       is required.  The fencing subsystem is already provided by  Linux-Clus‐
36       ter.
37
38       *  Non-reliance  on  hardware  or  protocol specific methods (i.e. SCSI
39       reservations).  This ensures the quorum disk algorithm can be  used  on
40       the widest range of hardware configurations possible.
41
42       *  Little  or  no  memory allocation after initialization.  In critical
43       paths during failover, we do not want to  have  to  worry  about  being
44       killed  during  a  memory  pressure situation because we request a page
45       fault, and the Linux OOM killer responds...
46
47

1.3. Hardware Considerations and Requirements

1.3.1. Concurrent, Synchronous, Read/Write Access

50       This quorum daemon requires  a  shared  block  device  with  concurrent
51       read/write  access  from  all  nodes  in the cluster.  The shared block
52       device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
53       RAIDed  iSCSI target, or even GNBD.  The quorum daemon uses O_DIRECT to
54       write to the device.
55
56

1.3.2. Bargain-basement JBODs need not apply

58       There is a minimum performance requirement inherent  when  using  disk-
59       based  cluster  quorum  algorithms, so design your cluster accordingly.
60       Using a cheap JBOD with old SCSI2 disks on a multi-initiator  bus  will
61       cause problems at the first load spike.  Plan your loads accordingly; a
62       node's inability to write to the quorum disk in a  timely  manner  will
63       cause  the cluster to evict the node.  Using host-RAID or multi-initia‐
64       tor parallel SCSI configurations with the qdisk daemon is  unlikely  to
65       work,  and  will  probably  cause  administrators a lot of frustration.
66       That having been said, because  the  timeouts  are  configurable,  most
67       hardware should work if the timeouts are set high enough.
68
69

1.3.3. Fencing is Required

71       In order to maintain data integrity under all failure scenarios, use of
72       this quorum daemon requires adequate  fencing,  preferably  power-based
73       fencing.   Watchdog  timers  and software-based solutions to reboot the
74       node internally, while possibly sufficient, are not  considered  'fenc‐
75       ing' for the purposes of using the quorum disk.
76
77

1.4. Limitations

79       *  At  this  time, this daemon supports a maximum of 16 nodes.  This is
80       primarily a scalability issue:  As  we  increase  the  node  count,  we
81       increase  the amount of synchronous I/O contention on the shared quorum
82       disk.
83
84       * Cluster node IDs must be statically configured  in  cluster.conf  and
85       must be numbered from 1..16 (there can be gaps, of course).
86
87       * Cluster node votes must all be 1.
88
89       *  CMAN  must  be  running before the qdisk program can operate in full
90       capacity.  If CMAN is not running, qdisk will wait for it.
91
92       * CMAN's eviction timeout should be at least 2x the quorum daemon's  to
93       give  the  quorum daemon adequate time to converge on a master during a
94       failure +  load  spike  situation.   See  section  3.3.1  for  specific
95       details.
96
97       *  For  'all-but-one'  failure  operation,  the  total  number of votes
98       assigned to the quorum device should be equal to or  greater  than  the
99       total  number  of  node-votes  in the cluster.  While it is possible to
100       assign only one (or a few) votes to the quorum device, the  effects  of
101       doing so have not been explored.
102
103       *  For  'tiebreaker'  operation  in  a  two-node  cluster, unset CMAN's
104       two_node flag (or set it to 0), set CMAN's expected votes to  '3',  set
105       each node's vote to '1', and leave qdisk's vote count unset.  This will
106       allow the cluster to operate if either both nodes are online, or a sin‐
107       gle node & the heuristics.
108
109       *  Currently,  the  quorum disk daemon is difficult to use with CLVM if
110       the quorum disk resides on a CLVM logical volume.  CLVM requires a quo‐
111       rate  cluster  to correctly operate, which introduces a chicken-and-egg
112       problem for starting the cluster: CLVM needs  quorum,  but  the  quorum
113       daemon  needs  CLVM (if and only if the quorum device lies on CLVM-man‐
114       aged storage).  One way to work around this is to *not* set  the  clus‐
115       ter's  expected  votes to include the quorum daemon's votes.  Bring all
116       nodes online, and start the quorum daemon *after* the whole cluster  is
117       running.  This will allow the expected votes to increase naturally.
118
119

2. Algorithms

2.1. Heartbeating & Liveliness Determination

122       Nodes  update  individual  status  blocks on the quorum disk at a user-
123       defined rate.  Each write of a status block alters the timestamp, which
124       is  what other nodes use to decide whether a node has hung or not.  If,
125       after a user-defined number of 'misses' (that is, failure to  update  a
126       timestamp),  a  node  is  declared  offline.  After a certain number of
127       'hits' (changed timestamp + "i am alive" state), the node  is  declared
128       online.
129
130       The  status block contains additional information, such as a bitmask of
131       the nodes that node believes are online.  Some of this  information  is
132       used  by the master - while some is just for performance recording, and
133       may be used at a later time.  The most important pieces of  information
134       a node writes to its status block are:
135
136            - Timestamp
137            - Internal state (available / not available)
138            - Score
139            -  Known  max  score  (may be used in the future to detect invalid
140            configurations)
141            - Vote/bid messages
142            - Other nodes it thinks are online
143
144

2.2. Scoring & Heuristics

146       The administrator can configure up to 10 purely  arbitrary  heuristics,
147       and  must  exercise  caution  in doing so.  At least one administrator-
148       defined heuristic is required for operation, but it is generally a good
149       idea  to  have more than one heuristic.  By default, only nodes scoring
150       over 1/2 of the total maximum score will claim they are  available  via
151       the quorum disk, and a node (master or otherwise) whose score drops too
152       low will remove itself (usually, by rebooting).
153
154       The heuristics themselves can be any command  executable  by  'sh  -c'.
155       For example, in early testing the following was used:
156
157            <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
158
159       This is a literal sh-ism which tests for the existence of a file called
160       "/quorum".  Without that file, the node would claim it was unavailable.
161       This is an awful example, and should never, ever be used in production,
162       but is provided as an example as to what one could do...
163
164       Typically, the heuristics should be snippets of shell code or  commands
165       which  help  determine  a  node's usefulness to the cluster or clients.
166       Ideally, you want to add traces for all of  your  network  paths  (e.g.
167       check  links,  or  ping routers), and methods to detect availability of
168       shared storage.
169
170

2.3. Master Election

172       Only one master is present at any one time in the  cluster,  regardless
173       of  how many partitions exist within the cluster itself.  The master is
174       elected by a simple voting  scheme  in  which  the  lowest  node  which
175       believes  it  is  capable of running (i.e. scores high enough) bids for
176       master status.  If the other nodes agree, it becomes the master.   This
177       algorithm is run whenever no master is present.
178
179       If another node comes online with a lower node ID while a node is still
180       bidding for master status, it will rescind its bid  and  vote  for  the
181       lower  node  ID.   If  a master dies or a bidding node dies, the voting
182       algorithm is started over.  The voting algorithm  typically  takes  two
183       passes to complete.
184
185       Master  deaths  take  marginally longer to recover from than non-master
186       deaths, because a new master must be elected before the old master  can
187       be evicted & fenced.
188
189

2.4. Master Duties

191       The  master  node  decides who is or is not in the master partition, as
192       well as handles eviction of dead nodes (both via the  quorum  disk  and
193       via  the  linux-cluster  fencing  system  by using the cman_kill_node()
194       API).
195
196

2.5. How it All Ties Together

198       When a master is present, and if the  master  believes  a  node  to  be
199       online, that node will advertise to CMAN that the quorum disk is avail‐
200       able.  The master will only grant a node membership if:
201
202            (a) CMAN believes the node to be online, and  (b)  that  node  has
203            made enough consecutive, timely writes
204                to the quorum disk, and
205            (c) the node has a high enough score to consider itself online.
206
207

3. Configuration

3.1. The <quorumd> tag

210       This tag is a child of the top-level <cluster> tag.
211
212        <quorumd
213         interval="1"
214            This is the frequency of read/write cycles, in seconds.
215
216         tko="10"
217            This  is  the  number  of  cycles  a node must miss in order to be
218            declared dead.  The default for this number is  dependent  on  the
219            configured token timeout.
220
221         tko_up="X"
222            This  is  the  number of cycles a node must be seen in order to be
223            declared online.  Default is floor(tko/3).
224
225         upgrade_wait="2"
226            This is the number of cycles a node must wait before initiating  a
227            bid  for master status after heuristic scoring becomes sufficient.
228            The default is 2.  This can not be set to 0, and should not exceed
229            tko.
230
231         master_wait="X"
232            This  is  the  number  of cycles a node must wait for votes before
233            declaring  itself  master  after  making  a   bid.    Default   is
234            floor(tko/2).   This  can not be less than 2, must be greater than
235            tko_up, and should not exceed tko.
236
237         votes="3"
238            This is the number of votes the quorum daemon advertises  to  CMAN
239            when  it  has  a  high enough score.  The default is the number of
240            nodes in the cluster minus 1.  For example, in a 4  node  cluster,
241            the  default is 3.  This value may change during normal operation,
242            for example when adding or removing a node from the cluster.
243
244         log_level="4"
245            This controls the verbosity of the quorum  daemon  in  the  system
246            logs.  0 = emergencies; 7 = debug.  This option is deprecated.
247
248         log_facility="daemon"
249            This  controls  the syslog facility used by the quorum daemon when
250            logging.  For a complete list of available  facilities,  see  sys‐
251            log.conf(5).  The default value for this is 'daemon'.  This option
252            is deprecated.
253
254         status_file="/foo"
255            Write internal states out to this file  periodically  ("-"  =  use
256            stdout).  This is primarily used for debugging.  The default value
257            for this attribute is undefined.  This option can be changed while
258            qdiskd is running.
259
260         min_score="3"
261            Absolute  minimum  score  to  be  consider one's self "alive".  If
262            omitted, or set to 0, the  default  function  "floor((n+1)/2)"  is
263            used,  where  n  is  the total of all of defined heuristics' score
264            attribute.  This must  never  exceed  the  sum  of  the  heuristic
265            scores, or else the quorum disk will never be available.
266
267         reboot="1"
268            If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
269            sition as a result in a change in score (see  section  2.2).   The
270            default  for  this  value  is  1 (on).  This option can be changed
271            while qdiskd is running.
272
273         master_wins="0"
274            If set to 1 (on), only the qdiskd master will advertise its  votes
275            to  CMAN.  In a network partition, only the qdisk master will pro‐
276            vide votes to CMAN.  Consequently, that  node  will  automatically
277            "win" in a fence race.
278
279            This  option  requires  careful  tuning  of  the CMAN timeout, the
280            qdiskd timeout, and CMAN's quorum_dev_poll value.  As  a  rule  of
281            thumb,  CMAN's  quorum_dev_poll  value  should be equal to Totem's
282            token timeout and qdiskd's timeout (interval*tko) should  be  less
283            than  half  of  Totem's token timeout.  See section 3.3.1 for more
284            information.
285
286            This option only takes effect if there are no  heuristics  config‐
287            ured  and  it  is  valid  only for 2 node cluster.  This option is
288            automatically disabled if heuristics are defined  or  cluster  has
289            more than 2 nodes configured.
290
291            In a two-node cluster with no heuristics and no defined vote count
292            (see above), this mode is turned by default.  If enabled  in  this
293            way at startup and a node is later added to the cluster configura‐
294            tion or the vote count is set to a value other than 1,  this  mode
295            will be disabled.
296
297         allow_kill="1"
298            If  set  to  0  (off), qdiskd will *not* instruct to kill nodes it
299            thinks are dead (as a result of not writing to the  quorum  disk).
300            The  default for this value is 1 (on).  This option can be changed
301            while qdiskd is running.
302
303         paranoid="0"
304            If set to 1 (on), qdiskd will watch internal timers and reboot the
305            node  if it takes more than (interval * tko) seconds to complete a
306            quorum disk pass.  The default for this value is  0  (off).   This
307            option can be changed while qdiskd is running.
308
309         io_timeout="0"
310            If set to 1 (on), qdiskd will watch internal timers and reboot the
311            node if qdisk is not able to write to disk after (interval *  tko)
312            seconds.   The default for this value is 0 (off). If io_timeout is
313            active max_error_cycles is overridden and set to off.
314
315         scheduler="rr"
316            Valid values are 'rr', 'fifo', and 'other'.  Selects the  schedul‐
317            ing  queue  in  the Linux kernel for operation of the main & score
318            threads (does not affect the heuristics; they are  always  run  in
319            the  'other'  queue).  Default is 'rr'.  See sched_setscheduler(2)
320            for more details.
321
322         priority="1"
323            Valid values for 'rr' and 'fifo' are 1..100 inclusive.  Valid val‐
324            ues  for  'other' are -20..20 inclusive.  Sets the priority of the
325            main & score threads.  The default value is 1 (in the RR and  FIFO
326            queues,  higher  numbers  denote  higher priority; in OTHER, lower
327            values denote higher priority).  This option can be changed  while
328            qdiskd is running.
329
330         stop_cman="0"
331            Ordinarily,  cluster membership is left up to CMAN, not qdisk.  If
332            this parameter is set to 1 (on), qdiskd will tell  CMAN  to  leave
333            the  cluster  if it is unable to initialize the quorum disk during
334            startup.  This can be used to prevent cluster participation  by  a
335            node  which  has  been disconnected from the SAN.  The default for
336            this value is 0 (off).  This option can be changed while qdiskd is
337            running.
338
339         use_uptime="1"
340            If  this  parameter  is set to 1 (on), qdiskd will use values from
341            /proc/uptime for internal timings.  This is  a  bit  less  precise
342            than  gettimeofday(2), but the benefit is that changing the system
343            clock will not affect qdiskd's behavior  -  even  if  paranoid  is
344            enabled.   If  set to 0, qdiskd will use gettimeofday(2), which is
345            more precise.  The default for this value is 1 (on / use uptime).
346
347         device="/dev/sda1"
348            This is the device the quorum daemon will use.  This  device  must
349            be the same on all nodes.
350
351         label="mylabel"
352            This  overrides  the  device  field if present.  If specified, the
353            quorum daemon will read /proc/partitions and check for qdisk  sig‐
354            natures  on  every block device found, comparing the label against
355            the specified label.  This is useful in configurations  where  the
356            block device name differs on a per-node basis.
357
358         cman_label="mylabel"
359            This overrides the label advertised to CMAN if present.  If speci‐
360            fied, the quorum daemon will register with this  name  instead  of
361            the actual device name.
362
363         max_error_cycles="0"/>
364            If we receive an I/O error during a cycle, we do not poll CMAN and
365            tell it we are alive.  If specified, this value will cause  qdiskd
366            to  exit  after  the specified number of consecutive cycles during
367            which I/O errors occur.  The default  is  0  (no  maximum).   This
368            option  can  be  changed  while qdiskd is running.  This option is
369            ignored if io_timeout is set to 1.
370
371        />
372
373

3.3.1. Quorum Disk Timings

375       Qdiskd should not be used in environments requiring  failure  detection
376       times of less than approximately 10 seconds.
377
378       Qdiskd  will  attempt  to  automatically configure timings based on the
379       totem timeout and the TKO.   If  configuring  manually,  Totem's  token
380       timeout must be set to a value at least 1 interval greater than the the
381       following function:
382
383         interval * (tko + master_wait + upgrade_wait)
384
385       So, if you have an interval of 2, a tko of  7,  master_wait  of  2  and
386       upgrade_wait  of  2,  the  token  timeout should be at least 24 seconds
387       (24000 msec).
388
389       It is recommended to have at least 3 intervals to reduce  the  risk  of
390       quorum  loss  during heavy I/O load.  As a rule of thumb, using a totem
391       timeout more than 2x of qdiskd's timeout will result in good behavior.
392
393       An improper timing configuration will cause CMAN to give up on  qdiskd,
394       causing a temporary loss of quorum during master transition.
395
396

3.2. The <heuristic> tag

398       This  tag  is  a  child  of  the  <quorumd> tag.  Heuristics may not be
399       changed while qdiskd is running.
400
401        <heuristic
402         program="/test.sh"
403            This is the program used to determine if this heuristic is  alive.
404            This  can  be  anything  which  may  be executed by /bin/sh -c.  A
405            return value of zero indicates success;  anything  else  indicates
406            failure.  This is required.
407
408         score="1"
409            This is the weight of this heuristic.  Be careful when determining
410            scores for heuristics.  The default score for each heuristic is 1.
411
412         interval="2"
413            This is the frequency (in seconds) at which we poll the heuristic.
414            The default interval is determined by the qdiskd timeout.
415
416         tko="1"
417            After  this  many failed attempts to run the heuristic, it is con‐
418            sidered DOWN, and its score is removed.  The default tko for  each
419            heuristic is determined by the qdiskd timeout.
420        />
421
422
423

3.3. Examples

3.3.1. 3 cluster nodes & 3 routers

426        <cman expected_votes="6" .../>
427        <clusternodes>
428            <clusternode name="node1" votes="1" ... />
429            <clusternode name="node2" votes="1" ... />
430            <clusternode name="node3" votes="1" ... />
431        </clusternodes>
432        <quorumd interval="1" tko="10" votes="3" label="testing">
433            <heuristic   program="ping   A  -c1  -w1"  score="1"  interval="2"
434            tko="3"/>
435            <heuristic  program="ping  B  -c1  -w1"   score="1"   interval="2"
436            tko="3"/>
437            <heuristic   program="ping   C  -c1  -w1"  score="1"  interval="2"
438            tko="3"/>
439        </quorumd>
440
441

3.3.2. 2 cluster nodes & 1 IP tiebreaker

443        <cman two_node="0" expected_votes="3" .../>
444        <clusternodes>
445            <clusternode name="node1" votes="1" ... />
446            <clusternode name="node2" votes="1" ... />
447        </clusternodes>
448        <quorumd interval="1" tko="10" votes="1" label="testing">
449            <heuristic  program="ping  A  -c1  -w1"   score="1"   interval="2"
450            tko="3"/>
451        </quorumd>
452
453
454

3.4. Heuristic score considerations

456       *  Heuristic  timeouts  should be set high enough to allow the previous
457       run of a given heuristic to complete.
458
459       * Heuristic scripts returning anything except 0 as  their  return  code
460       are considered failed.
461
462       *  The worst-case for improperly configured quorum heuristics is a race
463       to fence where two partitions simultaneously try to kill each other.
464
465

3.5. Creating a quorum disk partition

467       The mkqdisk utility can create and  list  currently  configured  quorum
468       disks visible to the local node; see mkqdisk(8) for more details.
469
470