1QDisk(5) Cluster Quorum Disk QDisk(5)
2
3
4
6 qdisk - a disk-based quorum daemon for CMAN / Linux-Cluster
7
10 In some situations, it may be necessary or desirable to sustain a
11 majority node failure of a cluster without introducing the need for
12 asymmetric cluster configurations (e.g. client-server, or heavily-
13 weighted voting nodes).
14
15
17 * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
18 danger of a simple network partition causing a split brain. That is,
19 we need to be able to ensure that the majority failure case is not
20 merely the result of a network partition.
21
22 * Ability to use external reasons for deciding which partition is the
23 the quorate partition in a partitioned cluster. For example, a user
24 may have a service running on one node, and that node must always be
25 the master in the event of a network partition. Or, a node might lose
26 all network connectivity except the cluster communication path - in
27 which case, a user may wish that node to be evicted from the cluster.
28
29 * Integration with CMAN. We must not require CMAN to run with us (or
30 without us). Linux-Cluster does not require a quorum disk normally -
31 introducing new requirements on the base of how Linux-Cluster operates
32 is not allowed.
33
34 * Data integrity. In order to recover from a majority failure, fencing
35 is required. The fencing subsystem is already provided by Linux-Clus‐
36 ter.
37
38 * Non-reliance on hardware or protocol specific methods (i.e. SCSI
39 reservations). This ensures the quorum disk algorithm can be used on
40 the widest range of hardware configurations possible.
41
42 * Little or no memory allocation after initialization. In critical
43 paths during failover, we do not want to have to worry about being
44 killed during a memory pressure situation because we request a page
45 fault, and the Linux OOM killer responds...
46
47
50 This quorum daemon requires a shared block device with concurrent
51 read/write access from all nodes in the cluster. The shared block
52 device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
53 RAIDed iSCSI target, or even GNBD. The quorum daemon uses O_DIRECT to
54 write to the device.
55
56
58 There is a minimum performance requirement inherent when using disk-
59 based cluster quorum algorithms, so design your cluster accordingly.
60 Using a cheap JBOD with old SCSI2 disks on a multi-initiator bus will
61 cause problems at the first load spike. Plan your loads accordingly; a
62 node's inability to write to the quorum disk in a timely manner will
63 cause the cluster to evict the node. Using host-RAID or multi-initia‐
64 tor parallel SCSI configurations with the qdisk daemon is unlikely to
65 work, and will probably cause administrators a lot of frustration.
66 That having been said, because the timeouts are configurable, most
67 hardware should work if the timeouts are set high enough.
68
69
71 In order to maintain data integrity under all failure scenarios, use of
72 this quorum daemon requires adequate fencing, preferably power-based
73 fencing. Watchdog timers and software-based solutions to reboot the
74 node internally, while possibly sufficient, are not considered 'fenc‐
75 ing' for the purposes of using the quorum disk.
76
77
79 * At this time, this daemon supports a maximum of 16 nodes. This is
80 primarily a scalability issue: As we increase the node count, we
81 increase the amount of synchronous I/O contention on the shared quorum
82 disk.
83
84 * Cluster node IDs must be statically configured in cluster.conf and
85 must be numbered from 1..16 (there can be gaps, of course).
86
87 * Cluster node votes must all be 1.
88
89 * CMAN must be running before the qdisk program can operate in full
90 capacity. If CMAN is not running, qdisk will wait for it.
91
92 * CMAN's eviction timeout should be at least 2x the quorum daemon's to
93 give the quorum daemon adequate time to converge on a master during a
94 failure + load spike situation. See section 3.3.1 for specific
95 details.
96
97 * For 'all-but-one' failure operation, the total number of votes
98 assigned to the quorum device should be equal to or greater than the
99 total number of node-votes in the cluster. While it is possible to
100 assign only one (or a few) votes to the quorum device, the effects of
101 doing so have not been explored.
102
103 * For 'tiebreaker' operation in a two-node cluster, unset CMAN's
104 two_node flag (or set it to 0), set CMAN's expected votes to '3', set
105 each node's vote to '1', and leave qdisk's vote count unset. This will
106 allow the cluster to operate if either both nodes are online, or a sin‐
107 gle node & the heuristics.
108
109 * Currently, the quorum disk daemon is difficult to use with CLVM if
110 the quorum disk resides on a CLVM logical volume. CLVM requires a quo‐
111 rate cluster to correctly operate, which introduces a chicken-and-egg
112 problem for starting the cluster: CLVM needs quorum, but the quorum
113 daemon needs CLVM (if and only if the quorum device lies on CLVM-man‐
114 aged storage). One way to work around this is to *not* set the clus‐
115 ter's expected votes to include the quorum daemon's votes. Bring all
116 nodes online, and start the quorum daemon *after* the whole cluster is
117 running. This will allow the expected votes to increase naturally.
118
119
122 Nodes update individual status blocks on the quorum disk at a user-
123 defined rate. Each write of a status block alters the timestamp, which
124 is what other nodes use to decide whether a node has hung or not. If,
125 after a user-defined number of 'misses' (that is, failure to update a
126 timestamp), a node is declared offline. After a certain number of
127 'hits' (changed timestamp + "i am alive" state), the node is declared
128 online.
129
130 The status block contains additional information, such as a bitmask of
131 the nodes that node believes are online. Some of this information is
132 used by the master - while some is just for performance recording, and
133 may be used at a later time. The most important pieces of information
134 a node writes to its status block are:
135
136 - Timestamp
137 - Internal state (available / not available)
138 - Score
139 - Known max score (may be used in the future to detect invalid
140 configurations)
141 - Vote/bid messages
142 - Other nodes it thinks are online
143
144
146 The administrator can configure up to 10 purely arbitrary heuristics,
147 and must exercise caution in doing so. At least one administrator-
148 defined heuristic is required for operation, but it is generally a good
149 idea to have more than one heuristic. By default, only nodes scoring
150 over 1/2 of the total maximum score will claim they are available via
151 the quorum disk, and a node (master or otherwise) whose score drops too
152 low will remove itself (usually, by rebooting).
153
154 The heuristics themselves can be any command executable by 'sh -c'.
155 For example, in early testing the following was used:
156
157 <heuristic program="[ -f /quorum ]" score="10" interval="2"/>
158
159 This is a literal sh-ism which tests for the existence of a file called
160 "/quorum". Without that file, the node would claim it was unavailable.
161 This is an awful example, and should never, ever be used in production,
162 but is provided as an example as to what one could do...
163
164 Typically, the heuristics should be snippets of shell code or commands
165 which help determine a node's usefulness to the cluster or clients.
166 Ideally, you want to add traces for all of your network paths (e.g.
167 check links, or ping routers), and methods to detect availability of
168 shared storage.
169
170
172 Only one master is present at any one time in the cluster, regardless
173 of how many partitions exist within the cluster itself. The master is
174 elected by a simple voting scheme in which the lowest node which
175 believes it is capable of running (i.e. scores high enough) bids for
176 master status. If the other nodes agree, it becomes the master. This
177 algorithm is run whenever no master is present.
178
179 If another node comes online with a lower node ID while a node is still
180 bidding for master status, it will rescind its bid and vote for the
181 lower node ID. If a master dies or a bidding node dies, the voting
182 algorithm is started over. The voting algorithm typically takes two
183 passes to complete.
184
185 Master deaths take marginally longer to recover from than non-master
186 deaths, because a new master must be elected before the old master can
187 be evicted & fenced.
188
189
191 The master node decides who is or is not in the master partition, as
192 well as handles eviction of dead nodes (both via the quorum disk and
193 via the linux-cluster fencing system by using the cman_kill_node()
194 API).
195
196
198 When a master is present, and if the master believes a node to be
199 online, that node will advertise to CMAN that the quorum disk is avail‐
200 able. The master will only grant a node membership if:
201
202 (a) CMAN believes the node to be online, and (b) that node has
203 made enough consecutive, timely writes
204 to the quorum disk, and
205 (c) the node has a high enough score to consider itself online.
206
207
210 This tag is a child of the top-level <cluster> tag.
211
212 <quorumd
213 interval="1"
214 This is the frequency of read/write cycles, in seconds.
215
216 tko="10"
217 This is the number of cycles a node must miss in order to be
218 declared dead. The default for this number is dependent on the
219 configured token timeout.
220
221 tko_up="X"
222 This is the number of cycles a node must be seen in order to be
223 declared online. Default is floor(tko/3).
224
225 upgrade_wait="2"
226 This is the number of cycles a node must wait before initiating a
227 bid for master status after heuristic scoring becomes sufficient.
228 The default is 2. This can not be set to 0, and should not exceed
229 tko.
230
231 master_wait="X"
232 This is the number of cycles a node must wait for votes before
233 declaring itself master after making a bid. Default is
234 floor(tko/2). This can not be less than 2, must be greater than
235 tko_up, and should not exceed tko.
236
237 votes="3"
238 This is the number of votes the quorum daemon advertises to CMAN
239 when it has a high enough score. The default is the number of
240 nodes in the cluster minus 1. For example, in a 4 node cluster,
241 the default is 3. This value may change during normal operation,
242 for example when adding or removing a node from the cluster.
243
244 log_level="4"
245 This controls the verbosity of the quorum daemon in the system
246 logs. 0 = emergencies; 7 = debug. This option is deprecated.
247
248 log_facility="daemon"
249 This controls the syslog facility used by the quorum daemon when
250 logging. For a complete list of available facilities, see sys‐
251 log.conf(5). The default value for this is 'daemon'. This option
252 is deprecated.
253
254 status_file="/foo"
255 Write internal states out to this file periodically ("-" = use
256 stdout). This is primarily used for debugging. The default value
257 for this attribute is undefined. This option can be changed while
258 qdiskd is running.
259
260 min_score="3"
261 Absolute minimum score to be consider one's self "alive". If
262 omitted, or set to 0, the default function "floor((n+1)/2)" is
263 used, where n is the total of all of defined heuristics' score
264 attribute. This must never exceed the sum of the heuristic
265 scores, or else the quorum disk will never be available.
266
267 reboot="1"
268 If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
269 sition as a result in a change in score (see section 2.2). The
270 default for this value is 1 (on). This option can be changed
271 while qdiskd is running.
272
273 master_wins="0"
274 If set to 1 (on), only the qdiskd master will advertise its votes
275 to CMAN. In a network partition, only the qdisk master will pro‐
276 vide votes to CMAN. Consequently, that node will automatically
277 "win" in a fence race.
278
279 This option requires careful tuning of the CMAN timeout, the
280 qdiskd timeout, and CMAN's quorum_dev_poll value. As a rule of
281 thumb, CMAN's quorum_dev_poll value should be equal to Totem's
282 token timeout and qdiskd's timeout (interval*tko) should be less
283 than half of Totem's token timeout. See section 3.3.1 for more
284 information.
285
286 This option only takes effect if there are no heuristics config‐
287 ured. Usage of this option in configurations with more than two
288 cluster nodes is undefined and should not be done.
289
290 In a two-node cluster with no heuristics and no defined vote count
291 (see above), this mode is turned by default. If enabled in this
292 way at startup and a node is later added to the cluster configura‐
293 tion or the vote count is set to a value other than 1, this mode
294 will be disabled.
295
296 allow_kill="1"
297 If set to 0 (off), qdiskd will *not* instruct to kill nodes it
298 thinks are dead (as a result of not writing to the quorum disk).
299 The default for this value is 1 (on). This option can be changed
300 while qdiskd is running.
301
302 paranoid="0"
303 If set to 1 (on), qdiskd will watch internal timers and reboot the
304 node if it takes more than (interval * tko) seconds to complete a
305 quorum disk pass. The default for this value is 0 (off). This
306 option can be changed while qdiskd is running.
307
308 io_timeout="0"
309 If set to 1 (on), qdiskd will watch internal timers and reboot the
310 node if qdisk is not able to write to disk after (interval * tko)
311 seconds. The default for this value is 0 (off). If io_timeout is
312 active max_error_cycles is overridden and set to off.
313
314 scheduler="rr"
315 Valid values are 'rr', 'fifo', and 'other'. Selects the schedul‐
316 ing queue in the Linux kernel for operation of the main & score
317 threads (does not affect the heuristics; they are always run in
318 the 'other' queue). Default is 'rr'. See sched_setscheduler(2)
319 for more details.
320
321 priority="1"
322 Valid values for 'rr' and 'fifo' are 1..100 inclusive. Valid val‐
323 ues for 'other' are -20..20 inclusive. Sets the priority of the
324 main & score threads. The default value is 1 (in the RR and FIFO
325 queues, higher numbers denote higher priority; in OTHER, lower
326 values denote higher priority). This option can be changed while
327 qdiskd is running.
328
329 stop_cman="0"
330 Ordinarily, cluster membership is left up to CMAN, not qdisk. If
331 this parameter is set to 1 (on), qdiskd will tell CMAN to leave
332 the cluster if it is unable to initialize the quorum disk during
333 startup. This can be used to prevent cluster participation by a
334 node which has been disconnected from the SAN. The default for
335 this value is 0 (off). This option can be changed while qdiskd is
336 running.
337
338 use_uptime="1"
339 If this parameter is set to 1 (on), qdiskd will use values from
340 /proc/uptime for internal timings. This is a bit less precise
341 than gettimeofday(2), but the benefit is that changing the system
342 clock will not affect qdiskd's behavior - even if paranoid is
343 enabled. If set to 0, qdiskd will use gettimeofday(2), which is
344 more precise. The default for this value is 1 (on / use uptime).
345
346 device="/dev/sda1"
347 This is the device the quorum daemon will use. This device must
348 be the same on all nodes.
349
350 label="mylabel"
351 This overrides the device field if present. If specified, the
352 quorum daemon will read /proc/partitions and check for qdisk sig‐
353 natures on every block device found, comparing the label against
354 the specified label. This is useful in configurations where the
355 block device name differs on a per-node basis.
356
357 cman_label="mylabel"
358 This overrides the label advertised to CMAN if present. If speci‐
359 fied, the quorum daemon will register with this name instead of
360 the actual device name.
361
362 max_error_cycles="0"/>
363 If we receive an I/O error during a cycle, we do not poll CMAN and
364 tell it we are alive. If specified, this value will cause qdiskd
365 to exit after the specified number of consecutive cycles during
366 which I/O errors occur. The default is 0 (no maximum). This
367 option can be changed while qdiskd is running. This option is
368 ignored if io_timeout is set to 1.
369
370 />
371
372
374 Qdiskd should not be used in environments requiring failure detection
375 times of less than approximately 10 seconds.
376
377 Qdiskd will attempt to automatically configure timings based on the
378 totem timeout and the TKO. If configuring manually, Totem's token
379 timeout must be set to a value at least 1 interval greater than the the
380 following function:
381
382 interval * (tko + master_wait + upgrade_wait)
383
384 So, if you have an interval of 2, a tko of 7, master_wait of 2 and
385 upgrade_wait of 2, the token timeout should be at least 24 seconds
386 (24000 msec).
387
388 It is recommended to have at least 3 intervals to reduce the risk of
389 quorum loss during heavy I/O load. As a rule of thumb, using a totem
390 timeout more than 2x of qdiskd's timeout will result in good behavior.
391
392 An improper timing configuration will cause CMAN to give up on qdiskd,
393 causing a temporary loss of quorum during master transition.
394
395
397 This tag is a child of the <quorumd> tag. Heuristics may not be
398 changed while qdiskd is running.
399
400 <heuristic
401 program="/test.sh"
402 This is the program used to determine if this heuristic is alive.
403 This can be anything which may be executed by /bin/sh -c. A
404 return value of zero indicates success; anything else indicates
405 failure. This is required.
406
407 score="1"
408 This is the weight of this heuristic. Be careful when determining
409 scores for heuristics. The default score for each heuristic is 1.
410
411 interval="2"
412 This is the frequency (in seconds) at which we poll the heuristic.
413 The default interval is determined by the qdiskd timeout.
414
415 tko="1"
416 After this many failed attempts to run the heuristic, it is con‐
417 sidered DOWN, and its score is removed. The default tko for each
418 heuristic is determined by the qdiskd timeout.
419 />
420
421
422
425 <cman expected_votes="6" .../>
426 <clusternodes>
427 <clusternode name="node1" votes="1" ... />
428 <clusternode name="node2" votes="1" ... />
429 <clusternode name="node3" votes="1" ... />
430 </clusternodes>
431 <quorumd interval="1" tko="10" votes="3" label="testing">
432 <heuristic program="ping A -c1 -t1" score="1" interval="2"
433 tko="3"/>
434 <heuristic program="ping B -c1 -t1" score="1" interval="2"
435 tko="3"/>
436 <heuristic program="ping C -c1 -t1" score="1" interval="2"
437 tko="3"/>
438 </quorumd>
439
440
442 <cman two_node="0" expected_votes="3" .../>
443 <clusternodes>
444 <clusternode name="node1" votes="1" ... />
445 <clusternode name="node2" votes="1" ... />
446 </clusternodes>
447 <quorumd interval="1" tko="10" votes="1" label="testing">
448 <heuristic program="ping A -c1 -t1" score="1" interval="2"
449 tko="3"/>
450 </quorumd>
451
452
453
455 * Heuristic timeouts should be set high enough to allow the previous
456 run of a given heuristic to complete.
457
458 * Heuristic scripts returning anything except 0 as their return code
459 are considered failed.
460
461 * The worst-case for improperly configured quorum heuristics is a race
462 to fence where two partitions simultaneously try to kill each other.
463
464
466 The mkqdisk utility can create and list currently configured quorum
467 disks visible to the local node; see mkqdisk(8) for more details.
468
469
471 mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)
472
473
474
475 20 Feb 2007 QDisk(5)