1QDisk(5) Cluster Quorum Disk QDisk(5)
2
3
4
6 qdisk - a disk-based quorum daemon for CMAN / Linux-Cluster
7
10 In some situations, it may be necessary or desirable to sustain a
11 majority node failure of a cluster without introducing the need for
12 asymmetric cluster configurations (e.g. client-server, or heavily-
13 weighted voting nodes).
14
15
17 * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
18 danger of a simple network partition causing a split brain. That is,
19 we need to be able to ensure that the majority failure case is not
20 merely the result of a network partition.
21
22 * Ability to use external reasons for deciding which partition is the
23 the quorate partition in a partitioned cluster. For example, a user
24 may have a service running on one node, and that node must always be
25 the master in the event of a network partition. Or, a node might lose
26 all network connectivity except the cluster communication path - in
27 which case, a user may wish that node to be evicted from the cluster.
28
29 * Integration with CMAN. We must not require CMAN to run with us (or
30 without us). Linux-Cluster does not require a quorum disk normally -
31 introducing new requirements on the base of how Linux-Cluster operates
32 is not allowed.
33
34 * Data integrity. In order to recover from a majority failure, fencing
35 is required. The fencing subsystem is already provided by Linux-Clus‐
36 ter.
37
38 * Non-reliance on hardware or protocol specific methods (i.e. SCSI
39 reservations). This ensures the quorum disk algorithm can be used on
40 the widest range of hardware configurations possible.
41
42 * Little or no memory allocation after initialization. In critical
43 paths during failover, we do not want to have to worry about being
44 killed during a memory pressure situation because we request a page
45 fault, and the Linux OOM killer responds...
46
47
50 This quorum daemon requires a shared block device with concurrent
51 read/write access from all nodes in the cluster. The shared block
52 device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
53 RAIDed iSCSI target, or even GNBD. The quorum daemon uses O_DIRECT to
54 write to the device.
55
56
58 There is a minimum performance requirement inherent when using disk-
59 based cluster quorum algorithms, so design your cluster accordingly.
60 Using a cheap JBOD with old SCSI2 disks on a multi-initiator bus will
61 cause problems at the first load spike. Plan your loads accordingly; a
62 node's inability to write to the quorum disk in a timely manner will
63 cause the cluster to evict the node. Using host-RAID or multi-initia‐
64 tor parallel SCSI configurations with the qdisk daemon is unlikely to
65 work, and will probably cause administrators a lot of frustration.
66 That having been said, because the timeouts are configurable, most
67 hardware should work if the timeouts are set high enough.
68
69
71 In order to maintain data integrity under all failure scenarios, use of
72 this quorum daemon requires adequate fencing, preferably power-based
73 fencing. Watchdog timers and software-based solutions to reboot the
74 node internally, while possibly sufficient, are not considered 'fenc‐
75 ing' for the purposes of using the quorum disk.
76
77
79 * At this time, this daemon supports a maximum of 16 nodes. This is
80 primarily a scalability issue: As we increase the node count, we
81 increase the amount of synchronous I/O contention on the shared quorum
82 disk.
83
84 * Cluster node IDs must be statically configured in cluster.conf and
85 must be numbered from 1..16 (there can be gaps, of course).
86
87 * Cluster node votes must all be 1.
88
89 * CMAN must be running before the qdisk program can operate in full
90 capacity. If CMAN is not running, qdisk will wait for it.
91
92 * CMAN's eviction timeout should be at least 2x the quorum daemon's to
93 give the quorum daemon adequate time to converge on a master during a
94 failure + load spike situation. See section 3.3.1 for specific
95 details.
96
97 * For 'all-but-one' failure operation, the total number of votes
98 assigned to the quorum device should be equal to or greater than the
99 total number of node-votes in the cluster. While it is possible to
100 assign only one (or a few) votes to the quorum device, the effects of
101 doing so have not been explored.
102
103 * For 'tiebreaker' operation in a two-node cluster, unset CMAN's
104 two_node flag (or set it to 0), set CMAN's expected votes to '3', set
105 each node's vote to '1', and leave qdisk's vote count unset. This will
106 allow the cluster to operate if either both nodes are online, or a sin‐
107 gle node & the heuristics.
108
109 * Currently, the quorum disk daemon is difficult to use with CLVM if
110 the quorum disk resides on a CLVM logical volume. CLVM requires a quo‐
111 rate cluster to correctly operate, which introduces a chicken-and-egg
112 problem for starting the cluster: CLVM needs quorum, but the quorum
113 daemon needs CLVM (if and only if the quorum device lies on CLVM-man‐
114 aged storage). One way to work around this is to *not* set the clus‐
115 ter's expected votes to include the quorum daemon's votes. Bring all
116 nodes online, and start the quorum daemon *after* the whole cluster is
117 running. This will allow the expected votes to increase naturally.
118
119
122 Nodes update individual status blocks on the quorum disk at a user-
123 defined rate. Each write of a status block alters the timestamp, which
124 is what other nodes use to decide whether a node has hung or not. If,
125 after a user-defined number of 'misses' (that is, failure to update a
126 timestamp), a node is declared offline. After a certain number of
127 'hits' (changed timestamp + "i am alive" state), the node is declared
128 online.
129
130 The status block contains additional information, such as a bitmask of
131 the nodes that node believes are online. Some of this information is
132 used by the master - while some is just for performance recording, and
133 may be used at a later time. The most important pieces of information
134 a node writes to its status block are:
135
136 - Timestamp
137 - Internal state (available / not available)
138 - Score
139 - Known max score (may be used in the future to detect invalid
140 configurations)
141 - Vote/bid messages
142 - Other nodes it thinks are online
143
144
146 The administrator can configure up to 10 purely arbitrary heuristics,
147 and must exercise caution in doing so. At least one administrator-
148 defined heuristic is required for operation, but it is generally a good
149 idea to have more than one heuristic. By default, only nodes scoring
150 over 1/2 of the total maximum score will claim they are available via
151 the quorum disk, and a node (master or otherwise) whose score drops too
152 low will remove itself (usually, by rebooting).
153
154 The heuristics themselves can be any command executable by 'sh -c'.
155 For example, in early testing the following was used:
156
157 <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
158
159 This is a literal sh-ism which tests for the existence of a file called
160 "/quorum". Without that file, the node would claim it was unavailable.
161 This is an awful example, and should never, ever be used in production,
162 but is provided as an example as to what one could do...
163
164 Typically, the heuristics should be snippets of shell code or commands
165 which help determine a node's usefulness to the cluster or clients.
166 Ideally, you want to add traces for all of your network paths (e.g.
167 check links, or ping routers), and methods to detect availability of
168 shared storage.
169
170
172 Only one master is present at any one time in the cluster, regardless
173 of how many partitions exist within the cluster itself. The master is
174 elected by a simple voting scheme in which the lowest node which
175 believes it is capable of running (i.e. scores high enough) bids for
176 master status. If the other nodes agree, it becomes the master. This
177 algorithm is run whenever no master is present.
178
179 If another node comes online with a lower node ID while a node is still
180 bidding for master status, it will rescind its bid and vote for the
181 lower node ID. If a master dies or a bidding node dies, the voting
182 algorithm is started over. The voting algorithm typically takes two
183 passes to complete.
184
185 Master deaths take marginally longer to recover from than non-master
186 deaths, because a new master must be elected before the old master can
187 be evicted & fenced.
188
189
191 The master node decides who is or is not in the master partition, as
192 well as handles eviction of dead nodes (both via the quorum disk and
193 via the linux-cluster fencing system by using the cman_kill_node()
194 API).
195
196
198 When a master is present, and if the master believes a node to be
199 online, that node will advertise to CMAN that the quorum disk is avail‐
200 able. The master will only grant a node membership if:
201
202 (a) CMAN believes the node to be online, and (b) that node has
203 made enough consecutive, timely writes
204 to the quorum disk, and
205 (c) the node has a high enough score to consider itself online.
206
207
210 This tag is a child of the top-level <cluster> tag.
211
212 <quorumd
213 interval="1"
214 This is the frequency of read/write cycles, in seconds.
215
216 tko="10"
217 This is the number of cycles a node must miss in order to be
218 declared dead. The default for this number is dependent on the
219 configured token timeout.
220
221 tko_up="X"
222 This is the number of cycles a node must be seen in order to be
223 declared online. Default is floor(tko/3).
224
225 upgrade_wait="2"
226 This is the number of cycles a node must wait before initiating a
227 bid for master status after heuristic scoring becomes sufficient.
228 The default is 2. This can not be set to 0, and should not exceed
229 tko.
230
231 master_wait="X"
232 This is the number of cycles a node must wait for votes before
233 declaring itself master after making a bid. Default is
234 floor(tko/2). This can not be less than 2, must be greater than
235 tko_up, and should not exceed tko.
236
237 votes="3"
238 This is the number of votes the quorum daemon advertises to CMAN
239 when it has a high enough score. The default is the number of
240 nodes in the cluster minus 1. For example, in a 4 node cluster,
241 the default is 3. This value may change during normal operation,
242 for example when adding or removing a node from the cluster.
243
244 log_level="4"
245 This controls the verbosity of the quorum daemon in the system
246 logs. 0 = emergencies; 7 = debug. This option is deprecated.
247
248 log_facility="daemon"
249 This controls the syslog facility used by the quorum daemon when
250 logging. For a complete list of available facilities, see sys‐
251 log.conf(5). The default value for this is 'daemon'. This option
252 is deprecated.
253
254 status_file="/foo"
255 Write internal states out to this file periodically ("-" = use
256 stdout). This is primarily used for debugging. The default value
257 for this attribute is undefined. This option can be changed while
258 qdiskd is running.
259
260 min_score="3"
261 Absolute minimum score to be consider one's self "alive". If
262 omitted, or set to 0, the default function "floor((n+1)/2)" is
263 used, where n is the total of all of defined heuristics' score
264 attribute. This must never exceed the sum of the heuristic
265 scores, or else the quorum disk will never be available.
266
267 reboot="1"
268 If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
269 sition as a result in a change in score (see section 2.2). The
270 default for this value is 1 (on). This option can be changed
271 while qdiskd is running.
272
273 master_wins="0"
274 If set to 1 (on), only the qdiskd master will advertise its votes
275 to CMAN. In a network partition, only the qdisk master will pro‐
276 vide votes to CMAN. Consequently, that node will automatically
277 "win" in a fence race.
278
279 This option requires careful tuning of the CMAN timeout, the
280 qdiskd timeout, and CMAN's quorum_dev_poll value. As a rule of
281 thumb, CMAN's quorum_dev_poll value should be equal to Totem's
282 token timeout and qdiskd's timeout (interval*tko) should be less
283 than half of Totem's token timeout. See section 3.3.1 for more
284 information.
285
286 This option only takes effect if there are no heuristics config‐
287 ured and it is valid only for 2 node cluster. This option is
288 automatically disabled if heuristics are defined or cluster has
289 more than 2 nodes configured.
290
291 In a two-node cluster with no heuristics and no defined vote count
292 (see above), this mode is turned by default. If enabled in this
293 way at startup and a node is later added to the cluster configura‐
294 tion or the vote count is set to a value other than 1, this mode
295 will be disabled.
296
297 allow_kill="1"
298 If set to 0 (off), qdiskd will *not* instruct to kill nodes it
299 thinks are dead (as a result of not writing to the quorum disk).
300 The default for this value is 1 (on). This option can be changed
301 while qdiskd is running.
302
303 paranoid="0"
304 If set to 1 (on), qdiskd will watch internal timers and reboot the
305 node if it takes more than (interval * tko) seconds to complete a
306 quorum disk pass. The default for this value is 0 (off). This
307 option can be changed while qdiskd is running.
308
309 io_timeout="0"
310 If set to 1 (on), qdiskd will watch internal timers and reboot the
311 node if qdisk is not able to write to disk after (interval * tko)
312 seconds. The default for this value is 0 (off). If io_timeout is
313 active max_error_cycles is overridden and set to off.
314
315 scheduler="rr"
316 Valid values are 'rr', 'fifo', and 'other'. Selects the schedul‐
317 ing queue in the Linux kernel for operation of the main & score
318 threads (does not affect the heuristics; they are always run in
319 the 'other' queue). Default is 'rr'. See sched_setscheduler(2)
320 for more details.
321
322 priority="1"
323 Valid values for 'rr' and 'fifo' are 1..100 inclusive. Valid val‐
324 ues for 'other' are -20..20 inclusive. Sets the priority of the
325 main & score threads. The default value is 1 (in the RR and FIFO
326 queues, higher numbers denote higher priority; in OTHER, lower
327 values denote higher priority). This option can be changed while
328 qdiskd is running.
329
330 stop_cman="0"
331 Ordinarily, cluster membership is left up to CMAN, not qdisk. If
332 this parameter is set to 1 (on), qdiskd will tell CMAN to leave
333 the cluster if it is unable to initialize the quorum disk during
334 startup. This can be used to prevent cluster participation by a
335 node which has been disconnected from the SAN. The default for
336 this value is 0 (off). This option can be changed while qdiskd is
337 running.
338
339 use_uptime="1"
340 If this parameter is set to 1 (on), qdiskd will use values from
341 /proc/uptime for internal timings. This is a bit less precise
342 than gettimeofday(2), but the benefit is that changing the system
343 clock will not affect qdiskd's behavior - even if paranoid is
344 enabled. If set to 0, qdiskd will use gettimeofday(2), which is
345 more precise. The default for this value is 1 (on / use uptime).
346
347 device="/dev/sda1"
348 This is the device the quorum daemon will use. This device must
349 be the same on all nodes.
350
351 label="mylabel"
352 This overrides the device field if present. If specified, the
353 quorum daemon will read /proc/partitions and check for qdisk sig‐
354 natures on every block device found, comparing the label against
355 the specified label. This is useful in configurations where the
356 block device name differs on a per-node basis.
357
358 cman_label="mylabel"
359 This overrides the label advertised to CMAN if present. If speci‐
360 fied, the quorum daemon will register with this name instead of
361 the actual device name.
362
363 max_error_cycles="0"/>
364 If we receive an I/O error during a cycle, we do not poll CMAN and
365 tell it we are alive. If specified, this value will cause qdiskd
366 to exit after the specified number of consecutive cycles during
367 which I/O errors occur. The default is 0 (no maximum). This
368 option can be changed while qdiskd is running. This option is
369 ignored if io_timeout is set to 1.
370
371 />
372
373
375 Qdiskd should not be used in environments requiring failure detection
376 times of less than approximately 10 seconds.
377
378 Qdiskd will attempt to automatically configure timings based on the
379 totem timeout and the TKO. If configuring manually, Totem's token
380 timeout must be set to a value at least 1 interval greater than the the
381 following function:
382
383 interval * (tko + master_wait + upgrade_wait)
384
385 So, if you have an interval of 2, a tko of 7, master_wait of 2 and
386 upgrade_wait of 2, the token timeout should be at least 24 seconds
387 (24000 msec).
388
389 It is recommended to have at least 3 intervals to reduce the risk of
390 quorum loss during heavy I/O load. As a rule of thumb, using a totem
391 timeout more than 2x of qdiskd's timeout will result in good behavior.
392
393 An improper timing configuration will cause CMAN to give up on qdiskd,
394 causing a temporary loss of quorum during master transition.
395
396
398 This tag is a child of the <quorumd> tag. Heuristics may not be
399 changed while qdiskd is running.
400
401 <heuristic
402 program="/test.sh"
403 This is the program used to determine if this heuristic is alive.
404 This can be anything which may be executed by /bin/sh -c. A
405 return value of zero indicates success; anything else indicates
406 failure. This is required.
407
408 score="1"
409 This is the weight of this heuristic. Be careful when determining
410 scores for heuristics. The default score for each heuristic is 1.
411
412 interval="2"
413 This is the frequency (in seconds) at which we poll the heuristic.
414 The default interval is determined by the qdiskd timeout.
415
416 tko="1"
417 After this many failed attempts to run the heuristic, it is con‐
418 sidered DOWN, and its score is removed. The default tko for each
419 heuristic is determined by the qdiskd timeout.
420 />
421
422
423
426 <cman expected_votes="6" .../>
427 <clusternodes>
428 <clusternode name="node1" votes="1" ... />
429 <clusternode name="node2" votes="1" ... />
430 <clusternode name="node3" votes="1" ... />
431 </clusternodes>
432 <quorumd interval="1" tko="10" votes="3" label="testing">
433 <heuristic program="ping A -c1 -w1" score="1" interval="2"
434 tko="3"/>
435 <heuristic program="ping B -c1 -w1" score="1" interval="2"
436 tko="3"/>
437 <heuristic program="ping C -c1 -w1" score="1" interval="2"
438 tko="3"/>
439 </quorumd>
440
441
443 <cman two_node="0" expected_votes="3" .../>
444 <clusternodes>
445 <clusternode name="node1" votes="1" ... />
446 <clusternode name="node2" votes="1" ... />
447 </clusternodes>
448 <quorumd interval="1" tko="10" votes="1" label="testing">
449 <heuristic program="ping A -c1 -w1" score="1" interval="2"
450 tko="3"/>
451 </quorumd>
452
453
454
456 * Heuristic timeouts should be set high enough to allow the previous
457 run of a given heuristic to complete.
458
459 * Heuristic scripts returning anything except 0 as their return code
460 are considered failed.
461
462 * The worst-case for improperly configured quorum heuristics is a race
463 to fence where two partitions simultaneously try to kill each other.
464
465
467 The mkqdisk utility can create and list currently configured quorum
468 disks visible to the local node; see mkqdisk(8) for more details.
469
470
472 mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)
473
474
475
476 12 Oct 2011 QDisk(5)