1QDisk(8) Cluster Quorum Disk QDisk(8)
2
3
4
6 QDisk 1.0 - a disk-based quorum daemon for CMAN / Linux-Cluster
7
10 In some situations, it may be necessary or desirable to sustain a
11 majority node failure of a cluster without introducing the need for
12 asymmetric cluster configurations (e.g. client-server, or heavily-
13 weighted voting nodes).
14
15
17 * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
18 danger of a simple network partition causing a split brain. That is,
19 we need to be able to ensure that the majority failure case is not
20 merely the result of a network partition.
21
22 * Ability to use external reasons for deciding which partition is the
23 the quorate partition in a partitioned cluster. For example, a user
24 may have a service running on one node, and that node must always be
25 the master in the event of a network partition. Or, a node might lose
26 all network connectivity except the cluster communication path - in
27 which case, a user may wish that node to be evicted from the cluster.
28
29 * Integration with CMAN. We must not require CMAN to run with us (or
30 without us). Linux-Cluster does not require a quorum disk normally -
31 introducing new requirements on the base of how Linux-Cluster operates
32 is not allowed.
33
34 * Data integrity. In order to recover from a majority failure, fencing
35 is required. The fencing subsystem is already provided by Linux-Clus‐
36 ter.
37
38 * Non-reliance on hardware or protocol specific methods (i.e. SCSI
39 reservations). This ensures the quorum disk algorithm can be used on
40 the widest range of hardware configurations possible.
41
42 * Little or no memory allocation after initialization. In critical
43 paths during failover, we do not want to have to worry about being
44 killed during a memory pressure situation because we request a page
45 fault, and the Linux OOM killer responds...
46
47
50 This quorum daemon requires a shared block device with concurrent
51 read/write access from all nodes in the cluster. The shared block
52 device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
53 RAIDed iSCSI target, or even GNBD. The quorum daemon uses O_DIRECT to
54 write to the device.
55
56
58 There is a minimum performance requirement inherent when using disk-
59 based cluster quorum algorithms, so design your cluster accordingly.
60 Using a cheap JBOD with old SCSI2 disks on a multi-initiator bus will
61 cause problems at the first load spike. Plan your loads accordingly; a
62 node's inability to write to the quorum disk in a timely manner will
63 cause the cluster to evict the node. Using host-RAID or multi-initia‐
64 tor parallel SCSI configurations with the qdisk daemon is unlikely to
65 work, and will probably cause administrators a lot of frustration.
66 That having been said, because the timeouts are configurable, most
67 hardware should work if the timeouts are set high enough.
68
69
71 In order to maintain data integrity under all failure scenarios, use of
72 this quorum daemon requires adequate fencing, preferrably power-based
73 fencing. Watchdog timers and software-based solutions to reboot the
74 node internally, while possibly sufficient, are not considered 'fenc‐
75 ing' for the purposes of using the quorum disk.
76
77
79 * At this time, this daemon supports a maximum of 16 nodes. This is
80 primarily a scalability issue: As we increase the node count, we
81 increase the amount of synchronous I/O contention on the shared quorum
82 disk.
83
84 * Cluster node IDs must be statically configured in cluster.conf and
85 must be numbered from 1..16 (there can be gaps, of course).
86
87 * Cluster node votes should be more or less equal.
88
89 * CMAN must be running before the qdisk program can start.
90
91 * CMAN's eviction timeout should be at least 2x the quorum daemon's to
92 give the quorum daemon adequate time to converge on a master during a
93 failure + load spike situation.
94
95 * The total number of votes assigned to the quorum device should be
96 equal to or greater than the total number of node-votes in the cluster.
97 While it is possible to assign only one (or a few) votes to the quorum
98 device, the effects of doing so have not been explored.
99
100 * Currently, the quorum disk daemon is difficult to use with CLVM if
101 the quorum disk resides on a CLVM logical volume. CLVM requires a quo‐
102 rate cluster to correctly operate, which introduces a chicken-and-egg
103 problem for starting the cluster: CLVM needs quorum, but the quorum
104 daemon needs CLVM (if and only if the quorum device lies on CLVM-man‐
105 aged storage). One way to work around this is to *not* set the clus‐
106 ter's expected votes to include the quorum daemon's votes. Bring all
107 nodes online, and start the quorum daemon *after* the whole cluster is
108 running. This will allow the expected votes to increase naturally.
109
110
113 Nodes update individual status blocks on the quorum disk at a user-
114 defined rate. Each write of a status block alters the timestamp, which
115 is what other nodes use to decide whether a node has hung or not. If,
116 after a user-defined number of 'misses' (that is, failure to update a
117 timestamp), a node is declared offline. After a certain number of
118 'hits' (changed timestamp + "i am alive" state), the node is declared
119 online.
120
121 The status block contains additional information, such as a bitmask of
122 the nodes that node believes are online. Some of this information is
123 used by the master - while some is just for performace recording, and
124 may be used at a later time. The most important pieces of information
125 a node writes to its status block are:
126
127 - Timestamp
128 - Internal state (available / not available)
129 - Score
130 - Known max score (may be used in the future to detect invalid
131 configurations)
132 - Vote/bid messages
133 - Other nodes it thinks are online
134
135
137 The administrator can configure up to 10 purely arbitrary heuristics,
138 and must exercise caution in doing so. At least one administrator-
139 defined heuristic is required for operation, but it is generally a good
140 idea to have more than one heuristic. By default, only nodes scoring
141 over 1/2 of the total maximum score will claim they are available via
142 the quorum disk, and a node (master or otherwise) whose score drops too
143 low will remove itself (usually, by rebooting).
144
145 The heuristics themselves can be any command executable by 'sh -c'.
146 For example, in early testing the following was used:
147
148 <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
149
150 This is a literal sh-ism which tests for the existence of a file called
151 "/quorum". Without that file, the node would claim it was unavailable.
152 This is an awful example, and should never, ever be used in production,
153 but is provided as an example as to what one could do...
154
155 Typically, the heuristics should be snippets of shell code or commands
156 which help determine a node's usefulness to the cluster or clients.
157 Ideally, you want to add traces for all of your network paths (e.g.
158 check links, or ping routers), and methods to detect availability of
159 shared storage.
160
161
163 Only one master is present at any one time in the cluster, regardless
164 of how many partitions exist within the cluster itself. The master is
165 elected by a simple voting scheme in which the lowest node which
166 believes it is capable of running (i.e. scores high enough) bids for
167 master status. If the other nodes agree, it becomes the master. This
168 algorithm is run whenever no master is present.
169
170 If another node comes online with a lower node ID while a node is still
171 bidding for master status, it will rescind its bid and vote for the
172 lower node ID. If a master dies or a bidding node dies, the voting
173 algorithm is started over. The voting algorithm typically takes two
174 passes to complete.
175
176 Master deaths take marginally longer to recover from than non-master
177 deaths, because a new master must be elected before the old master can
178 be evicted & fenced.
179
180
182 The master node decides who is or is not in the master partition, as
183 well as handles eviction of dead nodes (both via the quorum disk and
184 via the linux-cluster fencing system by using the cman_kill_node()
185 API).
186
187
189 When a master is present, and if the master believes a node to be
190 online, that node will advertise to CMAN that the quorum disk is avail‐
191 able. The master will only grant a node membership if:
192
193 (a) CMAN believes the node to be online, and
194 (b) that node has made enough consecutive, timely writes
195 to the quorum disk, and
196 (c) the node has a high enough score to consider itself online.
197
198
201 This tag is a child of the top-level <cluster> tag.
202
203 <quorumd
204 interval="1"
205 This is the frequency of read/write cycles
206
207 tko="10"
208 This is the number of cycles a node must miss in order to be
209 declared dead.
210
211 votes="3"
212 This is the number of votes the quorum daemon advertises to CMAN
213 when it has a high enough score.
214
215 log_level="4"
216 This controls the verbosity of the quorum daemon in the system
217 logs. 0 = emergencies; 7 = debug.
218
219 log_facility="local4"
220 This controls the syslog facility used by the quorum daemon when
221 logging. For a complete list of available facilities, see sys‐
222 log.conf(5).
223
224 status_file="/foo"
225 Write internal states out to this file periodically ("-" = use
226 stdout). This is primarily used for debugging.
227
228 min_score="3"
229 Absolute minimum score to be consider one's self "alive". If
230 omitted, or set to 0, the default function "floor((n+1)/2)" is
231 used, where n is the sum-total of all of defined heuristics' score
232 attribute.
233
234 reboot="1"
235 If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
236 sition as a result in a change in score (see section 2.2). The
237 default for this value is 1 (on).
238
239 device="/dev/sda1"
240 This is the device the quorum daemon will use. This device must
241 be the same on all nodes.
242
243 label="mylabel"/>
244 This overrides the device field if present. If specified, the
245 quorum daemon will read /proc/partitions and check for qdisk sig‐
246 natures on every block device found, comparing the label against
247 the specified label. This is useful in configurations where the
248 block device name differs on a per-node basis.
249
250
252 This tag is a child of the <quorumd> tag.
253
254 <heuristic
255 program="/test.sh"
256 This is the program used to determine if this heuristic is alive.
257 This can be anything which may be executed by /bin/sh -c. A
258 return value of zero indicates success; anything else indicates
259 failure.
260
261 score="1"
262 This is the weight of this heuristic. Be careful when determining
263 scores for heuristics.
264
265 interval="2"/>
266 This is the frequency at which we poll the heuristic.
267
268
270 <quorumd interval="1" tko="10" votes="3" label="testing">
271 <heuristic program="ping A -c1 -t1" score="1" interval="2"/>
272 <heuristic program="ping B -c1 -t1" score="1" interval="2"/>
273 <heuristic program="ping C -c1 -t1" score="1" interval="2"/>
274 </quorumd>
275
276
278 * Heuristic timeouts should be set high enough to allow the previous
279 run of a given heuristic to complete.
280
281 * Heuristic scripts returning anything except 0 as their return code
282 are considered failed.
283
284 * The worst-case for improperly configured quorum heuristics is a race
285 to fence where two partitions simultaneously try to kill each other.
286
287
289 The mkqdisk utility can create and list currently configured quorum
290 disks visible to the local node; see mkqdisk(8) for more details.
291
292
294 mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5)
295
296
297
298 July 2006 QDisk(8)