qdisk(5) - f14

1QDisk(5)                      Cluster Quorum Disk                     QDisk(5)
2
3
4

NAME

6       qdisk - a disk-based quorum daemon for CMAN / Linux-Cluster
7

1. Overview

1.1 Problem

10       In  some  situations,  it  may  be  necessary or desirable to sustain a
11       majority node failure of a cluster without  introducing  the  need  for
12       asymmetric  cluster  configurations  (e.g.  client-server,  or heavily-
13       weighted voting nodes).
14
15

1.2. Design Requirements

17       * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
18       danger  of  a simple network partition causing a split brain.  That is,
19       we need to be able to ensure that the  majority  failure  case  is  not
20       merely the result of a network partition.
21
22       *  Ability  to use external reasons for deciding which partition is the
23       the quorate partition in a partitioned cluster.  For  example,  a  user
24       may  have  a  service running on one node, and that node must always be
25       the master in the event of a network partition.  Or, a node might  lose
26       all  network  connectivity  except  the cluster communication path - in
27       which case, a user may wish that node to be evicted from the cluster.
28
29       * Integration with CMAN.  We must not require CMAN to run with  us  (or
30       without  us).   Linux-Cluster does not require a quorum disk normally -
31       introducing new requirements on the base of how Linux-Cluster  operates
32       is not allowed.
33
34       * Data integrity.  In order to recover from a majority failure, fencing
35       is required.  The fencing subsystem is already provided by  Linux-Clus‐
36       ter.
37
38       *  Non-reliance  on  hardware  or  protocol specific methods (i.e. SCSI
39       reservations).  This ensures the quorum disk algorithm can be  used  on
40       the widest range of hardware configurations possible.
41
42       *  Little  or  no  memory allocation after initialization.  In critical
43       paths during failover, we do not want to  have  to  worry  about  being
44       killed  during  a  memory  pressure situation because we request a page
45       fault, and the Linux OOM killer responds...
46
47

1.3. Hardware Considerations and Requirements

1.3.1. Concurrent, Synchronous, Read/Write Access

50       This quorum daemon requires  a  shared  block  device  with  concurrent
51       read/write  access  from  all  nodes  in the cluster.  The shared block
52       device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
53       RAIDed  iSCSI target, or even GNBD.  The quorum daemon uses O_DIRECT to
54       write to the device.
55
56

1.3.2. Bargain-basement JBODs need not apply

58       There is a minimum performance requirement inherent  when  using  disk-
59       based  cluster  quorum  algorithms, so design your cluster accordingly.
60       Using a cheap JBOD with old SCSI2 disks on a multi-initiator  bus  will
61       cause problems at the first load spike.  Plan your loads accordingly; a
62       node's inability to write to the quorum disk in a  timely  manner  will
63       cause  the cluster to evict the node.  Using host-RAID or multi-initia‐
64       tor parallel SCSI configurations with the qdisk daemon is  unlikely  to
65       work,  and  will  probably  cause  administrators a lot of frustration.
66       That having been said, because  the  timeouts  are  configurable,  most
67       hardware should work if the timeouts are set high enough.
68
69

1.3.3. Fencing is Required

71       In order to maintain data integrity under all failure scenarios, use of
72       this quorum daemon requires adequate  fencing,  preferably  power-based
73       fencing.   Watchdog  timers  and software-based solutions to reboot the
74       node internally, while possibly sufficient, are not  considered  'fenc‐
75       ing' for the purposes of using the quorum disk.
76
77

1.4. Limitations

79       *  At  this  time, this daemon supports a maximum of 16 nodes.  This is
80       primarily a scalability issue:  As  we  increase  the  node  count,  we
81       increase  the amount of synchronous I/O contention on the shared quorum
82       disk.
83
84       * Cluster node IDs must be statically configured  in  cluster.conf  and
85       must be numbered from 1..16 (there can be gaps, of course).
86
87       * Cluster node votes must all be 1.
88
89       *  CMAN  must  be  running before the qdisk program can operate in full
90       capacity.  If CMAN is not running, qdisk will wait for it.
91
92       * CMAN's eviction timeout should be at least 2x the quorum daemon's  to
93       give  the  quorum daemon adequate time to converge on a master during a
94       failure +  load  spike  situation.   See  section  3.3.1  for  specific
95       details.
96
97       *  For  'all-but-one'  failure  operation,  the  total  number of votes
98       assigned to the quorum device should be equal to or  greater  than  the
99       total  number  of  node-votes  in the cluster.  While it is possible to
100       assign only one (or a few) votes to the quorum device, the  effects  of
101       doing so have not been explored.
102
103       *  For  'tiebreaker'  operation  in  a  two-node  cluster, unset CMAN's
104       two_node flag (or set it to 0), set CMAN's expected votes to  '3',  set
105       each node's vote to '1', and leave qdisk's vote count unset.  This will
106       allow the cluster to operate if either both nodes are online, or a sin‐
107       gle node & the heuristics.
108
109       *  Currently,  the  quorum disk daemon is difficult to use with CLVM if
110       the quorum disk resides on a CLVM logical volume.  CLVM requires a quo‐
111       rate  cluster  to correctly operate, which introduces a chicken-and-egg
112       problem for starting the cluster: CLVM needs  quorum,  but  the  quorum
113       daemon  needs  CLVM (if and only if the quorum device lies on CLVM-man‐
114       aged storage).  One way to work around this is to *not* set  the  clus‐
115       ter's  expected  votes to include the quorum daemon's votes.  Bring all
116       nodes online, and start the quorum daemon *after* the whole cluster  is
117       running.  This will allow the expected votes to increase naturally.
118
119

2. Algorithms

2.1. Heartbeating & Liveliness Determination

122       Nodes  update  individual  status  blocks on the quorum disk at a user-
123       defined rate.  Each write of a status block alters the timestamp, which
124       is  what other nodes use to decide whether a node has hung or not.  If,
125       after a user-defined number of 'misses' (that is, failure to  update  a
126       timestamp),  a  node  is  declared  offline.  After a certain number of
127       'hits' (changed timestamp + "i am alive" state), the node  is  declared
128       online.
129
130       The  status block contains additional information, such as a bitmask of
131       the nodes that node believes are online.  Some of this  information  is
132       used  by the master - while some is just for performance recording, and
133       may be used at a later time.  The most important pieces of  information
134       a node writes to its status block are:
135
136            - Timestamp
137            - Internal state (available / not available)
138            - Score
139            -  Known  max  score  (may be used in the future to detect invalid
140            configurations)
141            - Vote/bid messages
142            - Other nodes it thinks are online
143
144

2.2. Scoring & Heuristics

146       The administrator can configure up to 10 purely  arbitrary  heuristics,
147       and  must  exercise  caution  in doing so.  At least one administrator-
148       defined heuristic is required for operation, but it is generally a good
149       idea  to  have more than one heuristic.  By default, only nodes scoring
150       over 1/2 of the total maximum score will claim they are  available  via
151       the quorum disk, and a node (master or otherwise) whose score drops too
152       low will remove itself (usually, by rebooting).
153
154       The heuristics themselves can be any command  executable  by  'sh  -c'.
155       For example, in early testing the following was used:
156
157            <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
158
159       This is a literal sh-ism which tests for the existence of a file called
160       "/quorum".  Without that file, the node would claim it was unavailable.
161       This is an awful example, and should never, ever be used in production,
162       but is provided as an example as to what one could do...
163
164       Typically, the heuristics should be snippets of shell code or  commands
165       which  help  determine  a  node's usefulness to the cluster or clients.
166       Ideally, you want to add traces for all of  your  network  paths  (e.g.
167       check  links,  or  ping routers), and methods to detect availability of
168       shared storage.
169
170

2.3. Master Election

172       Only one master is present at any one time in the  cluster,  regardless
173       of  how many partitions exist within the cluster itself.  The master is
174       elected by a simple voting  scheme  in  which  the  lowest  node  which
175       believes  it  is  capable of running (i.e. scores high enough) bids for
176       master status.  If the other nodes agree, it becomes the master.   This
177       algorithm is run whenever no master is present.
178
179       If another node comes online with a lower node ID while a node is still
180       bidding for master status, it will rescind its bid  and  vote  for  the
181       lower  node  ID.   If  a master dies or a bidding node dies, the voting
182       algorithm is started over.  The voting algorithm  typically  takes  two
183       passes to complete.
184
185       Master  deaths  take  marginally longer to recover from than non-master
186       deaths, because a new master must be elected before the old master  can
187       be evicted & fenced.
188
189

2.4. Master Duties

191       The  master  node  decides who is or is not in the master partition, as
192       well as handles eviction of dead nodes (both via the  quorum  disk  and
193       via  the  linux-cluster  fencing  system  by using the cman_kill_node()
194       API).
195
196

2.5. How it All Ties Together

198       When a master is present, and if the  master  believes  a  node  to  be
199       online, that node will advertise to CMAN that the quorum disk is avail‐
200       able.  The master will only grant a node membership if:
201
202            (a) CMAN believes the node to be online, and  (b)  that  node  has
203            made enough consecutive, timely writes
204                to the quorum disk, and
205            (c) the node has a high enough score to consider itself online.
206
207

3. Configuration

3.1. The <quorumd> tag

210       This tag is a child of the top-level <cluster> tag.
211
212        <quorumd
213         interval="1"
214            This is the frequency of read/write cycles, in seconds.
215
216         tko="10"
217            This  is  the  number  of  cycles  a node must miss in order to be
218            declared dead.  The default for this number is  dependent  on  the
219            configured token timeout.
220
221         tko_up="X"
222            This  is  the  number of cycles a node must be seen in order to be
223            declared online.  Default is floor(tko/3).
224
225         upgrade_wait="2"
226            This is the number of cycles a node must wait before initiating  a
227            bid  for master status after heuristic scoring becomes sufficient.
228            The default is 2.  This can not be set to 0, and should not exceed
229            tko.
230
231         master_wait="X"
232            This  is  the  number  of cycles a node must wait for votes before
233            declaring  itself  master  after  making  a   bid.    Default   is
234            floor(tko/2).   This  can not be less than 2, must be greater than
235            tko_up, and should not exceed tko.
236
237         votes="3"
238            This is the number of votes the quorum daemon advertises  to  CMAN
239            when  it  has  a  high enough score.  The default is the number of
240            nodes in the cluster minus 1.  For example, in a 4  node  cluster,
241            the  default is 3.  This value may change during normal operation,
242            for example when adding or removing a node from the cluster.
243
244         log_level="4"
245            This controls the verbosity of the quorum  daemon  in  the  system
246            logs.  0 = emergencies; 7 = debug.  This option is deprecated.
247
248         log_facility="daemon"
249            This  controls  the syslog facility used by the quorum daemon when
250            logging.  For a complete list of available  facilities,  see  sys‐
251            log.conf(5).  The default value for this is 'daemon'.  This option
252            is deprecated.
253
254         status_file="/foo"
255            Write internal states out to this file  periodically  ("-"  =  use
256            stdout).  This is primarily used for debugging.  The default value
257            for this attribute is undefined.  This option can be changed while
258            qdiskd is running.
259
260         min_score="3"
261            Absolute  minimum  score  to  be  consider one's self "alive".  If
262            omitted, or set to 0, the  default  function  "floor((n+1)/2)"  is
263            used,  where  n  is  the total of all of defined heuristics' score
264            attribute.  This must  never  exceed  the  sum  of  the  heuristic
265            scores, or else the quorum disk will never be available.
266
267         reboot="1"
268            If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
269            sition as a result in a change in score (see  section  2.2).   The
270            default  for  this  value  is  1 (on).  This option can be changed
271            while qdiskd is running.
272
273         master_wins="0"
274            If set to 1 (on), only the qdiskd master will advertise its  votes
275            to  CMAN.  In a network partition, only the qdisk master will pro‐
276            vide votes to CMAN.  Consequently, that  node  will  automatically
277            "win" in a fence race.
278
279            This  option  requires  careful  tuning  of  the CMAN timeout, the
280            qdiskd timeout, and CMAN's quorum_dev_poll value.  As  a  rule  of
281            thumb,  CMAN's  quorum_dev_poll  value  should be equal to Totem's
282            token timeout and qdiskd's timeout (interval*tko) should  be  less
283            than  half  of  Totem's token timeout.  See section 3.3.1 for more
284            information.
285
286            This option only takes effect if there are no  heuristics  config‐
287            ured.   Usage  of this option in configurations with more than two
288            cluster nodes is undefined and should not be done.
289
290            In a two-node cluster with no heuristics and no defined vote count
291            (see  above),  this mode is turned by default.  If enabled in this
292            way at startup and a node is later added to the cluster configura‐
293            tion  or  the vote count is set to a value other than 1, this mode
294            will be disabled.
295
296         allow_kill="1"
297            If set to 0 (off), qdiskd will *not* instruct  to  kill  nodes  it
298            thinks  are  dead (as a result of not writing to the quorum disk).
299            The default for this value is 1 (on).  This option can be  changed
300            while qdiskd is running.
301
302         paranoid="0"
303            If set to 1 (on), qdiskd will watch internal timers and reboot the
304            node if it takes more than (interval * tko) seconds to complete  a
305            quorum  disk  pass.   The default for this value is 0 (off).  This
306            option can be changed while qdiskd is running.
307
308         io_timeout="0"
309            If set to 1 (on), qdiskd will watch internal timers and reboot the
310            node  if qdisk is not able to write to disk after (interval * tko)
311            seconds.  The default for this value is 0 (off). If io_timeout  is
312            active max_error_cycles is overridden and set to off.
313
314         scheduler="rr"
315            Valid  values are 'rr', 'fifo', and 'other'.  Selects the schedul‐
316            ing queue in the Linux kernel for operation of the  main  &  score
317            threads  (does  not  affect the heuristics; they are always run in
318            the 'other' queue).  Default is 'rr'.   See  sched_setscheduler(2)
319            for more details.
320
321         priority="1"
322            Valid values for 'rr' and 'fifo' are 1..100 inclusive.  Valid val‐
323            ues for 'other' are -20..20 inclusive.  Sets the priority  of  the
324            main  & score threads.  The default value is 1 (in the RR and FIFO
325            queues, higher numbers denote higher  priority;  in  OTHER,  lower
326            values  denote higher priority).  This option can be changed while
327            qdiskd is running.
328
329         stop_cman="0"
330            Ordinarily, cluster membership is left up to CMAN, not qdisk.   If
331            this  parameter  is  set to 1 (on), qdiskd will tell CMAN to leave
332            the cluster if it is unable to initialize the quorum  disk  during
333            startup.   This  can be used to prevent cluster participation by a
334            node which has been disconnected from the SAN.   The  default  for
335            this value is 0 (off).  This option can be changed while qdiskd is
336            running.
337
338         use_uptime="1"
339            If this parameter is set to 1 (on), qdiskd will  use  values  from
340            /proc/uptime  for  internal  timings.   This is a bit less precise
341            than gettimeofday(2), but the benefit is that changing the  system
342            clock  will  not  affect  qdiskd's  behavior - even if paranoid is
343            enabled.  If set to 0, qdiskd will use gettimeofday(2),  which  is
344            more precise.  The default for this value is 1 (on / use uptime).
345
346         device="/dev/sda1"
347            This  is  the device the quorum daemon will use.  This device must
348            be the same on all nodes.
349
350         label="mylabel"
351            This overrides the device field if  present.   If  specified,  the
352            quorum  daemon will read /proc/partitions and check for qdisk sig‐
353            natures on every block device found, comparing the  label  against
354            the  specified  label.  This is useful in configurations where the
355            block device name differs on a per-node basis.
356
357         cman_label="mylabel"
358            This overrides the label advertised to CMAN if present.  If speci‐
359            fied,  the  quorum  daemon will register with this name instead of
360            the actual device name.
361
362         max_error_cycles="0"/>
363            If we receive an I/O error during a cycle, we do not poll CMAN and
364            tell  it we are alive.  If specified, this value will cause qdiskd
365            to exit after the specified number of  consecutive  cycles  during
366            which  I/O  errors  occur.   The  default is 0 (no maximum).  This
367            option can be changed while qdiskd is  running.   This  option  is
368            ignored if io_timeout is set to 1.
369
370        />
371
372

3.3.1. Quorum Disk Timings

374       Qdiskd  should  not be used in environments requiring failure detection
375       times of less than approximately 10 seconds.
376
377       Qdiskd will attempt to automatically configure  timings  based  on  the
378       totem  timeout  and  the  TKO.   If configuring manually, Totem's token
379       timeout must be set to a value at least 1 interval greater than the the
380       following function:
381
382         interval * (tko + master_wait + upgrade_wait)
383
384       So,  if  you  have  an  interval of 2, a tko of 7, master_wait of 2 and
385       upgrade_wait of 2, the token timeout should  be  at  least  24  seconds
386       (24000 msec).
387
388       It  is  recommended  to have at least 3 intervals to reduce the risk of
389       quorum loss during heavy I/O load.  As a rule of thumb, using  a  totem
390       timeout more than 2x of qdiskd's timeout will result in good behavior.
391
392       An  improper timing configuration will cause CMAN to give up on qdiskd,
393       causing a temporary loss of quorum during master transition.
394
395

3.2. The <heuristic> tag

397       This tag is a child of  the  <quorumd>  tag.   Heuristics  may  not  be
398       changed while qdiskd is running.
399
400        <heuristic
401         program="/test.sh"
402            This  is the program used to determine if this heuristic is alive.
403            This can be anything which may  be  executed  by  /bin/sh  -c.   A
404            return  value  of  zero indicates success; anything else indicates
405            failure.  This is required.
406
407         score="1"
408            This is the weight of this heuristic.  Be careful when determining
409            scores for heuristics.  The default score for each heuristic is 1.
410
411         interval="2"
412            This is the frequency (in seconds) at which we poll the heuristic.
413            The default interval is determined by the qdiskd timeout.
414
415         tko="1"
416            After this many failed attempts to run the heuristic, it  is  con‐
417            sidered  DOWN, and its score is removed.  The default tko for each
418            heuristic is determined by the qdiskd timeout.
419        />
420
421
422

3.3. Examples

3.3.1. 3 cluster nodes & 3 routers

425        <cman expected_votes="6" .../>
426        <clusternodes>
427            <clusternode name="node1" votes="1" ... />
428            <clusternode name="node2" votes="1" ... />
429            <clusternode name="node3" votes="1" ... />
430        </clusternodes>
431        <quorumd interval="1" tko="10" votes="3" label="testing">
432            <heuristic  program="ping  A  -c1  -t1"   score="1"   interval="2"
433            tko="3"/>
434            <heuristic   program="ping   B  -c1  -t1"  score="1"  interval="2"
435            tko="3"/>
436            <heuristic  program="ping  C  -c1  -t1"   score="1"   interval="2"
437            tko="3"/>
438        </quorumd>
439
440

3.3.2. 2 cluster nodes & 1 IP tiebreaker

442        <cman two_node="0" expected_votes="3" .../>
443        <clusternodes>
444            <clusternode name="node1" votes="1" ... />
445            <clusternode name="node2" votes="1" ... />
446        </clusternodes>
447        <quorumd interval="1" tko="10" votes="1" label="testing">
448            <heuristic   program="ping   A  -c1  -t1"  score="1"  interval="2"
449            tko="3"/>
450        </quorumd>
451
452
453

3.4. Heuristic score considerations

455       * Heuristic timeouts should be set high enough to  allow  the  previous
456       run of a given heuristic to complete.
457
458       *  Heuristic  scripts  returning anything except 0 as their return code
459       are considered failed.
460
461       * The worst-case for improperly configured quorum heuristics is a  race
462       to fence where two partitions simultaneously try to kill each other.
463
464

3.5. Creating a quorum disk partition

466       The  mkqdisk  utility  can  create and list currently configured quorum
467       disks visible to the local node; see mkqdisk(8) for more details.
468
469