qdisk(5)

1QDisk(8)                      Cluster Quorum Disk                     QDisk(8)
2
3
4

NAME

6       QDisk 1.0 - a disk-based quorum daemon for CMAN / Linux-Cluster
7

1. Overview

1.1 Problem

10       In  some  situations,  it  may  be  necessary or desirable to sustain a
11       majority node failure of a cluster without  introducing  the  need  for
12       asymmetric  cluster  configurations  (e.g.  client-server,  or heavily-
13       weighted voting nodes).
14
15

1.2. Design Requirements

17       * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
18       danger  of  a simple network partition causing a split brain.  That is,
19       we need to be able to ensure that the  majority  failure  case  is  not
20       merely the result of a network partition.
21
22       *  Ability  to use external reasons for deciding which partition is the
23       the quorate partition in a partitioned cluster.  For  example,  a  user
24       may  have  a  service running on one node, and that node must always be
25       the master in the event of a network partition.  Or, a node might  lose
26       all  network  connectivity  except  the cluster communication path - in
27       which case, a user may wish that node to be evicted from the cluster.
28
29       * Integration with CMAN.  We must not require CMAN to run with  us  (or
30       without  us).   Linux-Cluster does not require a quorum disk normally -
31       introducing new requirements on the base of how Linux-Cluster  operates
32       is not allowed.
33
34       * Data integrity.  In order to recover from a majority failure, fencing
35       is required.  The fencing subsystem is already provided by  Linux-Clus‐
36       ter.
37
38       *  Non-reliance  on  hardware  or  protocol specific methods (i.e. SCSI
39       reservations).  This ensures the quorum disk algorithm can be  used  on
40       the widest range of hardware configurations possible.
41
42       *  Little  or  no  memory allocation after initialization.  In critical
43       paths during failover, we do not want to  have  to  worry  about  being
44       killed  during  a  memory  pressure situation because we request a page
45       fault, and the Linux OOM killer responds...
46
47

1.3. Hardware Considerations and Requirements

1.3.1. Concurrent, Synchronous, Read/Write Access

50       This quorum daemon requires  a  shared  block  device  with  concurrent
51       read/write  access  from  all  nodes  in the cluster.  The shared block
52       device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
53       RAIDed  iSCSI target, or even GNBD.  The quorum daemon uses O_DIRECT to
54       write to the device.
55
56

1.3.2. Bargain-basement JBODs need not apply

58       There is a minimum performance requirement inherent  when  using  disk-
59       based  cluster  quorum  algorithms, so design your cluster accordingly.
60       Using a cheap JBOD with old SCSI2 disks on a multi-initiator  bus  will
61       cause problems at the first load spike.  Plan your loads accordingly; a
62       node's inability to write to the quorum disk in a  timely  manner  will
63       cause  the cluster to evict the node.  Using host-RAID or multi-initia‐
64       tor parallel SCSI configurations with the qdisk daemon is  unlikely  to
65       work,  and  will  probably  cause  administrators a lot of frustration.
66       That having been said, because  the  timeouts  are  configurable,  most
67       hardware should work if the timeouts are set high enough.
68
69

1.3.3. Fencing is Required

71       In order to maintain data integrity under all failure scenarios, use of
72       this quorum daemon requires adequate fencing,  preferrably  power-based
73       fencing.   Watchdog  timers  and software-based solutions to reboot the
74       node internally, while possibly sufficient, are not  considered  'fenc‐
75       ing' for the purposes of using the quorum disk.
76
77

1.4. Limitations

79       *  At  this  time, this daemon supports a maximum of 16 nodes.  This is
80       primarily a scalability issue:  As  we  increase  the  node  count,  we
81       increase  the amount of synchronous I/O contention on the shared quorum
82       disk.
83
84       * Cluster node IDs must be statically configured  in  cluster.conf  and
85       must be numbered from 1..16 (there can be gaps, of course).
86
87       * Cluster node votes should be more or less equal.
88
89       * CMAN must be running before the qdisk program can start.
90
91       *  CMAN's eviction timeout should be at least 2x the quorum daemon's to
92       give the quorum daemon adequate time to converge on a master  during  a
93       failure + load spike situation.
94
95       *  The  total  number  of votes assigned to the quorum device should be
96       equal to or greater than the total number of node-votes in the cluster.
97       While  it is possible to assign only one (or a few) votes to the quorum
98       device, the effects of doing so have not been explored.
99
100       * Currently, the quorum disk daemon is difficult to use  with  CLVM  if
101       the quorum disk resides on a CLVM logical volume.  CLVM requires a quo‐
102       rate cluster to correctly operate, which introduces  a  chicken-and-egg
103       problem  for  starting  the  cluster: CLVM needs quorum, but the quorum
104       daemon needs CLVM (if and only if the quorum device lies  on  CLVM-man‐
105       aged  storage).   One way to work around this is to *not* set the clus‐
106       ter's expected votes to include the quorum daemon's votes.   Bring  all
107       nodes  online, and start the quorum daemon *after* the whole cluster is
108       running.  This will allow the expected votes to increase naturally.
109
110

2. Algorithms

2.1. Heartbeating & Liveliness Determination

113       Nodes update individual status blocks on the quorum  disk  at  a  user-
114       defined rate.  Each write of a status block alters the timestamp, which
115       is what other nodes use to decide whether a node has hung or not.   If,
116       after  a  user-defined number of 'misses' (that is, failure to update a
117       timestamp), a node is declared offline.   After  a  certain  number  of
118       'hits'  (changed  timestamp + "i am alive" state), the node is declared
119       online.
120
121       The status block contains additional information, such as a bitmask  of
122       the  nodes  that node believes are online.  Some of this information is
123       used by the master - while some is just for performace  recording,  and
124       may  be used at a later time.  The most important pieces of information
125       a node writes to its status block are:
126
127            - Timestamp
128            - Internal state (available / not available)
129            - Score
130            - Known max score (may be used in the  future  to  detect  invalid
131            configurations)
132            - Vote/bid messages
133            - Other nodes it thinks are online
134
135

2.2. Scoring & Heuristics

137       The  administrator  can configure up to 10 purely arbitrary heuristics,
138       and must exercise caution in doing so.   At  least  one  administrator-
139       defined heuristic is required for operation, but it is generally a good
140       idea to have more than one heuristic.  By default, only  nodes  scoring
141       over  1/2  of the total maximum score will claim they are available via
142       the quorum disk, and a node (master or otherwise) whose score drops too
143       low will remove itself (usually, by rebooting).
144
145       The  heuristics  themselves  can  be any command executable by 'sh -c'.
146       For example, in early testing the following was used:
147
148            <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
149
150       This is a literal sh-ism which tests for the existence of a file called
151       "/quorum".  Without that file, the node would claim it was unavailable.
152       This is an awful example, and should never, ever be used in production,
153       but is provided as an example as to what one could do...
154
155       Typically,  the heuristics should be snippets of shell code or commands
156       which help determine a node's usefulness to  the  cluster  or  clients.
157       Ideally,  you  want  to  add traces for all of your network paths (e.g.
158       check links, or ping routers), and methods to  detect  availability  of
159       shared storage.
160
161

2.3. Master Election

163       Only  one  master is present at any one time in the cluster, regardless
164       of how many partitions exist within the cluster itself.  The master  is
165       elected  by  a  simple  voting  scheme  in  which the lowest node which
166       believes it is capable of running (i.e. scores high  enough)  bids  for
167       master  status.  If the other nodes agree, it becomes the master.  This
168       algorithm is run whenever no master is present.
169
170       If another node comes online with a lower node ID while a node is still
171       bidding  for  master  status,  it will rescind its bid and vote for the
172       lower node ID.  If a master dies or a bidding  node  dies,  the  voting
173       algorithm  is  started  over.  The voting algorithm typically takes two
174       passes to complete.
175
176       Master deaths take marginally longer to recover  from  than  non-master
177       deaths,  because a new master must be elected before the old master can
178       be evicted & fenced.
179
180

2.4. Master Duties

182       The master node decides who is or is not in the  master  partition,  as
183       well  as  handles  eviction of dead nodes (both via the quorum disk and
184       via the linux-cluster fencing  system  by  using  the  cman_kill_node()
185       API).
186
187

2.5. How it All Ties Together

189       When  a  master  is  present,  and  if the master believes a node to be
190       online, that node will advertise to CMAN that the quorum disk is avail‐
191       able.  The master will only grant a node membership if:
192
193            (a) CMAN believes the node to be online, and
194            (b) that node has made enough consecutive, timely writes
195                to the quorum disk, and
196            (c) the node has a high enough score to consider itself online.
197
198

3. Configuration

3.1. The <quorumd> tag

201       This tag is a child of the top-level <cluster> tag.
202
203        <quorumd
204         interval="1"
205            This is the frequency of read/write cycles
206
207         tko="10"
208            This  is  the  number  of  cycles  a node must miss in order to be
209            declared dead.
210
211         votes="3"
212            This is the number of votes the quorum daemon advertises  to  CMAN
213            when it has a high enough score.
214
215         log_level="4"
216            This  controls  the  verbosity  of the quorum daemon in the system
217            logs.  0 = emergencies; 7 = debug.
218
219         log_facility="local4"
220            This controls the syslog facility used by the quorum  daemon  when
221            logging.   For  a  complete list of available facilities, see sys‐
222            log.conf(5).
223
224         status_file="/foo"
225            Write internal states out to this file  periodically  ("-"  =  use
226            stdout).  This is primarily used for debugging.
227
228         min_score="3"
229            Absolute  minimum  score  to  be  consider one's self "alive".  If
230            omitted, or set to 0, the  default  function  "floor((n+1)/2)"  is
231            used, where n is the sum-total of all of defined heuristics' score
232            attribute.
233
234         reboot="1"
235            If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
236            sition  as  a  result in a change in score (see section 2.2).  The
237            default for this value is 1 (on).
238
239         device="/dev/sda1"
240            This is the device the quorum daemon will use.  This  device  must
241            be the same on all nodes.
242
243         label="mylabel"/>
244            This  overrides  the  device  field if present.  If specified, the
245            quorum daemon will read /proc/partitions and check for qdisk  sig‐
246            natures  on  every block device found, comparing the label against
247            the specified label.  This is useful in configurations  where  the
248            block device name differs on a per-node basis.
249
250

3.2. The <heuristic> tag

252       This tag is a child of the <quorumd> tag.
253
254        <heuristic
255         program="/test.sh"
256            This  is the program used to determine if this heuristic is alive.
257            This can be anything which may  be  executed  by  /bin/sh  -c.   A
258            return  value  of  zero indicates success; anything else indicates
259            failure.
260
261         score="1"
262            This is the weight of this heuristic.  Be careful when determining
263            scores for heuristics.
264
265         interval="2"/>
266            This is the frequency at which we poll the heuristic.
267
268

3.3. Example

270        <quorumd interval="1" tko="10" votes="3" label="testing">
271            <heuristic program="ping A -c1 -t1" score="1" interval="2"/>
272            <heuristic program="ping B -c1 -t1" score="1" interval="2"/>
273            <heuristic program="ping C -c1 -t1" score="1" interval="2"/>
274        </quorumd>
275
276

3.4. Heuristic score considerations

278       *  Heuristic  timeouts  should be set high enough to allow the previous
279       run of a given heuristic to complete.
280
281       * Heuristic scripts returning anything except 0 as  their  return  code
282       are considered failed.
283
284       *  The worst-case for improperly configured quorum heuristics is a race
285       to fence where two partitions simultaneously try to kill each other.
286
287

3.5. Creating a quorum disk partition

289       The mkqdisk utility can create and  list  currently  configured  quorum
290       disks visible to the local node; see mkqdisk(8) for more details.
291
292