1SBD(8) STONITH Block Device SBD(8)
2
3
4
6 sbd - STONITH Block Device daemon
7
9 sbd <-d /dev/...> [options] "command"
10
12 SBD provides a node fencing mechanism (Shoot the other node in the
13 head, STONITH) for Pacemaker-based clusters through the exchange of
14 messages via shared block storage such as for example a SAN, iSCSI,
15 FCoE. This isolates the fencing mechanism from changes in firmware
16 version or dependencies on specific firmware controllers, and it can be
17 used as a STONITH mechanism in all configurations that have reliable
18 shared storage.
19
20 SBD can also be used without any shared storage. In this mode, the
21 watchdog device will be used to reset the node if it loses quorum, if
22 any monitored daemon is lost and not recovered or if Pacemaker decides
23 that the node requires fencing.
24
25 The sbd binary implements both the daemon that watches the message
26 slots as well as the management tool for interacting with the block
27 storage device(s). This mode of operation is specified via the
28 "command" parameter; some of these modes take additional parameters.
29
30 To use SBD with shared storage, you must first "create" the messaging
31 layout on one to three block devices. Second, configure
32 /etc/sysconfig/sbd to list those devices (and possibly adjust other
33 options), and restart the cluster stack on each node to ensure that
34 "sbd" is started. Third, configure the "external/sbd" fencing resource
35 in the Pacemaker CIB.
36
37 Each of these steps is documented in more detail below the description
38 of the command options.
39
40 "sbd" can only be used as root.
41
42 GENERAL OPTIONS
43 -d /dev/...
44 Specify the block device(s) to be used. If you have more than one,
45 specify this option up to three times. This parameter is mandatory
46 for all modes, since SBD always needs a block device to interact
47 with.
48
49 This man page uses /dev/sda1, /dev/sdb1, and /dev/sdc1 as example
50 device names for brevity. However, in your production environment,
51 you should instead always refer to them by using the long, stable
52 device name (e.g.,
53 /dev/disk/by-id/dm-uuid-part1-mpath-3600508b400105b5a0001500000250000).
54
55 -v Enable some verbose debug logging.
56
57 -h Display a concise summary of "sbd" options.
58
59 -n node
60 Set local node name; defaults to "uname -n". This should not need
61 to be set.
62
63 -R Do not enable realtime priority. By default, "sbd" runs at realtime
64 priority, locks itself into memory, and also acquires highest IO
65 priority to protect itself against interference from other
66 processes on the system. This is a debugging-only option.
67
68 -I N
69 Async IO timeout (defaults to 3 seconds, optional). You should not
70 need to adjust this unless your IO setup is really very slow.
71
72 (In daemon mode, the watchdog is refreshed when the majority of
73 devices could be read within this time.)
74
75 create
76 Example usage:
77
78 sbd -d /dev/sdc2 -d /dev/sdd3 create
79
80 If you specify the create command, sbd will write a metadata header to
81 the device(s) specified and also initialize the messaging slots for up
82 to 255 nodes.
83
84 Warning: This command will not prompt for confirmation. Roughly the
85 first megabyte of the specified block device(s) will be overwritten
86 immediately and without backup.
87
88 This command accepts a few options to adjust the default timings that
89 are written to the metadata (to ensure they are identical across all
90 nodes accessing the device).
91
92 -1 N
93 Set watchdog timeout to N seconds. This depends mostly on your
94 storage latency; the majority of devices must be successfully read
95 within this time, or else the node will self-fence.
96
97 If your sbd device(s) reside on a multipath setup or iSCSI, this
98 should be the time required to detect a path failure. You may be
99 able to reduce this if your device outages are independent, or if
100 you are using the Pacemaker integration.
101
102 -2 N
103 Set slot allocation timeout to N seconds. You should not need to
104 tune this.
105
106 -3 N
107 Set daemon loop timeout to N seconds. You should not need to tune
108 this.
109
110 -4 N
111 Set msgwait timeout to N seconds. This should be twice the watchdog
112 timeout. This is the time after which a message written to a node's
113 slot will be considered delivered. (Or long enough for the node to
114 detect that it needed to self-fence.)
115
116 This also affects the stonith-timeout in Pacemaker's CIB; see
117 below.
118
119 list
120 Example usage:
121
122 # sbd -d /dev/sda1 list
123 0 hex-0 clear
124 1 hex-7 clear
125 2 hex-9 clear
126
127 List all allocated slots on device, and messages. You should see all
128 cluster nodes that have ever been started against this device. Nodes
129 that are currently running should have a clear state; nodes that have
130 been fenced, but not yet restarted, will show the appropriate fencing
131 message.
132
133 dump
134 Example usage:
135
136 # sbd -d /dev/sda1 dump
137 ==Dumping header on disk /dev/sda1
138 Header version : 2
139 Number of slots : 255
140 Sector size : 512
141 Timeout (watchdog) : 15
142 Timeout (allocate) : 2
143 Timeout (loop) : 1
144 Timeout (msgwait) : 30
145 ==Header on disk /dev/sda1 is dumped
146
147 Dump meta-data header from device.
148
149 watch
150 Example usage:
151
152 sbd -d /dev/sdc2 -d /dev/sdd3 -P watch
153
154 This command will make "sbd" start in daemon mode. It will constantly
155 monitor the message slot of the local node for incoming messages,
156 reachability, and optionally take Pacemaker's state into account.
157
158 "sbd" must be started on boot before the cluster stack! See below for
159 enabling this according to your boot environment.
160
161 The options for this mode are rarely specified directly on the
162 commandline directly, but most frequently set via /etc/sysconfig/sbd.
163
164 It also constantly monitors connectivity to the storage device, and
165 self-fences in case the partition becomes unreachable, guaranteeing
166 that it does not disconnect from fencing messages.
167
168 A node slot is automatically allocated on the device(s) the first time
169 the daemon starts watching the device; hence, manual allocation is not
170 usually required.
171
172 If a watchdog is used together with the "sbd" as is strongly
173 recommended, the watchdog is activated at initial start of the sbd
174 daemon. The watchdog is refreshed every time the majority of SBD
175 devices has been successfully read. Using a watchdog provides
176 additional protection against "sbd" crashing.
177
178 If the Pacemaker integration is activated, "sbd" will not self-fence if
179 device majority is lost, if:
180
181 1. The partition the node is in is still quorate according to the CIB;
182
183 2. it is still quorate according to Corosync's node count;
184
185 3. the node itself is considered online and healthy by Pacemaker.
186
187 This allows "sbd" to survive temporary outages of the majority of
188 devices. However, while the cluster is in such a degraded state, it can
189 neither successfully fence nor be shutdown cleanly (as taking the
190 cluster below the quorum threshold will immediately cause all remaining
191 nodes to self-fence). In short, it will not tolerate any further
192 faults. Please repair the system before continuing.
193
194 There is one "sbd" process that acts as a master to which all watchers
195 report; one per device to monitor the node's slot; and, optionally, one
196 that handles the Pacemaker integration.
197
198 -W Enable or disable use of the system watchdog to protect against the
199 sbd processes failing and the node being left in an undefined
200 state. Specify this once to enable, twice to disable.
201
202 Defaults to enabled.
203
204 -w /dev/watchdog
205 This can be used to override the default watchdog device used and
206 should not usually be necessary.
207
208 -p /var/run/sbd.pid
209 This option can be used to specify a pidfile for the main sbd
210 process.
211
212 -F N
213 Number of failures before a failing servant process will not be
214 restarted immediately until the dampening delay has expired. If set
215 to zero, servants will be restarted immediately and indefinitely.
216 If set to one, a failed servant will be restarted once every -t
217 seconds. If set to a different value, the servant will be restarted
218 that many times within the dampening period and then delay.
219
220 Defaults to 1.
221
222 -t N
223 Dampening delay before faulty servants are restarted. Combined with
224 "-F 1", the most logical way to tune the restart frequency of
225 servant processes. Default is 5 seconds.
226
227 If set to zero, processes will be restarted indefinitely and
228 immediately.
229
230 -P Enable Pacemaker integration which checks Pacemaker quorum and node
231 health. Specify this once to enable, twice to disable.
232
233 Defaults to enabled.
234
235 -S N
236 Set the start mode. (Defaults to 0.)
237
238 If this is set to zero, sbd will always start up unconditionally,
239 regardless of whether the node was previously fenced or not.
240
241 If set to one, sbd will only start if the node was previously
242 shutdown cleanly (as indicated by an exit request message in the
243 slot), or if the slot is empty. A reset, crashdump, or power-off
244 request in any slot will halt the start up.
245
246 This is useful to prevent nodes from rejoining if they were faulty.
247 The node must be manually "unfenced" by sending an empty message to
248 it:
249
250 sbd -d /dev/sda1 message node1 clear
251
252 -s N
253 Set the start-up wait time for devices. (Defaults to 120.)
254
255 Dynamic block devices such as iSCSI might not be fully initialized
256 and present yet. This allows to set a timeout for waiting for
257 devices to appear on start-up. If set to 0, start-up will be
258 aborted immediately if no devices are available.
259
260 -Z Enable trace mode. Warning: this is unsafe for production, use at
261 your own risk! Specifying this once will turn all reboots or power-
262 offs, be they caused by self-fence decisions or messages, into a
263 crashdump. Specifying this twice will just log them but not
264 continue running.
265
266 -T By default, the daemon will set the watchdog timeout as specified
267 in the device metadata. However, this does not work for every
268 watchdog device. In this case, you must manually ensure that the
269 watchdog timeout used by the system correctly matches the SBD
270 settings, and then specify this option to allow "sbd" to continue
271 with start-up.
272
273 -5 N
274 Warn if the time interval for tickling the watchdog exceeds this
275 many seconds. Since the node is unable to log the watchdog expiry
276 (it reboots immediately without a chance to write its logs to
277 disk), this is very useful for getting an indication that the
278 watchdog timeout is too short for the IO load of the system.
279
280 Default is 3 seconds, set to zero to disable.
281
282 -C N
283 Watchdog timeout to set before crashdumping. If SBD is set to
284 crashdump instead of reboot - either via the trace mode settings or
285 the external/sbd fencing agent's parameter -, SBD will adjust the
286 watchdog timeout to this setting before triggering the dump.
287 Otherwise, the watchdog might trigger and prevent a successful
288 crashdump from ever being written.
289
290 Defaults to 240 seconds. Set to zero to disable.
291
292 allocate
293 Example usage:
294
295 sbd -d /dev/sda1 allocate node1
296
297 Explicitly allocates a slot for the specified node name. This should
298 rarely be necessary, as every node will automatically allocate itself a
299 slot the first time it starts up on watch mode.
300
301 message
302 Example usage:
303
304 sbd -d /dev/sda1 message node1 test
305
306 Writes the specified message to node's slot. This is rarely done
307 directly, but rather abstracted via the "external/sbd" fencing agent
308 configured as a cluster resource.
309
310 Supported message types are:
311
312 test
313 This only generates a log message on the receiving node and can be
314 used to check if SBD is seeing the device. Note that this could
315 overwrite a fencing request send by the cluster, so should not be
316 used during production.
317
318 reset
319 Reset the target upon receipt of this message.
320
321 off Power-off the target.
322
323 crashdump
324 Cause the target node to crashdump.
325
326 exit
327 This will make the "sbd" daemon exit cleanly on the target. You
328 should not send this message manually; this is handled properly
329 during shutdown of the cluster stack. Manually stopping the daemon
330 means the node is unprotected!
331
332 clear
333 This message indicates that no real message has been sent to the
334 node. You should not set this manually; "sbd" will clear the
335 message slot automatically during start-up, and setting this
336 manually could overwrite a fencing message by the cluster.
337
338 query-watchdog
339 Example usage:
340
341 sbd query-watchdog
342
343 Check for available watchdog devices and print some info.
344
345 Warning: This command will arm the watchdog during query, and if your
346 watchdog refuses disarming (for example, if its kernel module has the
347 'nowayout' parameter set) this will reset your system.
348
349 test-watchdog
350 Example usage:
351
352 sbd test-watchdog [-w /dev/watchdog3]
353
354 Test specified watchdog device (/dev/watchdog by default).
355
356 Warning: This command will arm the watchdog and have your system reset
357 in case your watchdog is working properly! If issued from an
358 interactive session, it will prompt for confirmation.
359
361 Configure a watchdog
362 It is highly recommended that you configure your Linux system to load a
363 watchdog driver with hardware assistance (as is available on most
364 modern systems), such as hpwdt, iTCO_wdt, or others. As a fall-back,
365 you can use the softdog module.
366
367 No other software must access the watchdog timer; it can only be
368 accessed by one process at any given time. Some hardware vendors ship
369 systems management software that use the watchdog for system resets
370 (f.e. HP ASR daemon). Such software has to be disabled if the watchdog
371 is to be used by SBD.
372
373 Choosing and initializing the block device(s)
374 First, you have to decide if you want to use one, two, or three
375 devices.
376
377 If you are using multiple ones, they should reside on independent
378 storage setups. Putting all three of them on the same logical unit for
379 example would not provide any additional redundancy.
380
381 The SBD device can be connected via Fibre Channel, Fibre Channel over
382 Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of
383 network-based quorum server; the advantage is that it does not require
384 a smart host at your third location, just block storage.
385
386 The SBD partitions themselves must not be mirrored (via MD, DRBD, or
387 the storage layer itself), since this could result in a split-mirror
388 scenario. Nor can they reside on cLVM2 volume groups, since they must
389 be accessed by the cluster stack before it has started the cLVM2
390 daemons; hence, these should be either raw partitions or logical units
391 on (multipath) storage.
392
393 The block device(s) must be accessible from all nodes. (While it is not
394 necessary that they share the same path name on all nodes, this is
395 considered a very good idea.)
396
397 SBD will only use about one megabyte per device, so you can easily
398 create a small partition, or very small logical units. (The size of
399 the SBD device depends on the block size of the underlying device.
400 Thus, 1MB is fine on plain SCSI devices and SAN storage with 512 byte
401 blocks. On the IBM s390x architecture in particular, disks default to
402 4k blocks, and thus require roughly 4MB.)
403
404 The number of devices will affect the operation of SBD as follows:
405
406 One device
407 In its most simple implementation, you use one device only. This is
408 appropriate for clusters where all your data is on the same shared
409 storage (with internal redundancy) anyway; the SBD device does not
410 introduce an additional single point of failure then.
411
412 If the SBD device is not accessible, the daemon will fail to start
413 and inhibit openais startup.
414
415 Two devices
416 This configuration is a trade-off, primarily aimed at environments
417 where host-based mirroring is used, but no third storage device is
418 available.
419
420 SBD will not commit suicide if it loses access to one mirror leg;
421 this allows the cluster to continue to function even in the face of
422 one outage.
423
424 However, SBD will not fence the other side while only one mirror
425 leg is available, since it does not have enough knowledge to detect
426 an asymmetric split of the storage. So it will not be able to
427 automatically tolerate a second failure while one of the storage
428 arrays is down. (Though you can use the appropriate crm command to
429 acknowledge the fence manually.)
430
431 It will not start unless both devices are accessible on boot.
432
433 Three devices
434 In this most reliable and recommended configuration, SBD will only
435 self-fence if more than one device is lost; hence, this
436 configuration is resilient against temporary single device outages
437 (be it due to failures or maintenance). Fencing messages can still
438 be successfully relayed if at least two devices remain accessible.
439
440 This configuration is appropriate for more complex scenarios where
441 storage is not confined to a single array. For example, host-based
442 mirroring solutions could have one SBD per mirror leg (not mirrored
443 itself), and an additional tie-breaker on iSCSI.
444
445 It will only start if at least two devices are accessible on boot.
446
447 After you have chosen the devices and created the appropriate
448 partitions and perhaps multipath alias names to ease management, use
449 the "sbd create" command described above to initialize the SBD metadata
450 on them.
451
452 Sharing the block device(s) between multiple clusters
453
454 It is possible to share the block devices between multiple clusters,
455 provided the total number of nodes accessing them does not exceed 255
456 nodes, and they all must share the same SBD timeouts (since these are
457 part of the metadata).
458
459 If you are using multiple devices this can reduce the setup overhead
460 required. However, you should not share devices between clusters in
461 different security domains.
462
463 Configure SBD to start on boot
464 On systems using "sysvinit", the "openais" or "corosync" system start-
465 up scripts must handle starting or stopping "sbd" as required before
466 starting the rest of the cluster stack.
467
468 For "systemd", sbd simply has to be enabled using
469
470 systemctl enable sbd.service
471
472 The daemon is brought online on each node before corosync and Pacemaker
473 are started, and terminated only after all other cluster components
474 have been shut down - ensuring that cluster resources are never
475 activated without SBD supervision.
476
477 Configuration via sysconfig
478 The system instance of "sbd" is configured via /etc/sysconfig/sbd. In
479 this file, you must specify the device(s) used, as well as any options
480 to pass to the daemon:
481
482 SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1"
483 SBD_PACEMAKER="true"
484
485 "sbd" will fail to start if no "SBD_DEVICE" is specified. See the
486 installed template for more options that can be configured here.
487
488 Testing the sbd installation
489 After a restart of the cluster stack on this node, you can now try
490 sending a test message to it as root, from this or any other node:
491
492 sbd -d /dev/sda1 message node1 test
493
494 The node will acknowledge the receipt of the message in the system
495 logs:
496
497 Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2
498
499 This confirms that SBD is indeed up and running on the node, and that
500 it is ready to receive messages.
501
502 Make sure that /etc/sysconfig/sbd is identical on all cluster nodes,
503 and that all cluster nodes are running the daemon.
504
506 Fencing resource
507 Pacemaker can only interact with SBD to issue a node fence if there is
508 a configure fencing resource. This should be a primitive, not a clone,
509 as follows:
510
511 primitive fencing-sbd stonith:external/sbd \
512 params pcmk_delay_max=30
513
514 This will automatically use the same devices as configured in
515 /etc/sysconfig/sbd.
516
517 While you should not configure this as a clone (as Pacemaker will
518 register the fencing device on each node automatically), the
519 pcmk_delay_max setting enables random fencing delay which ensures, in a
520 scenario where a split-brain scenario did occur in a two node cluster,
521 that one of the nodes has a better chance to survive to avoid double
522 fencing.
523
524 SBD also supports turning the reset request into a crash request, which
525 may be helpful for debugging if you have kernel crashdumping
526 configured; then, every fence request will cause the node to dump core.
527 You can enable this via the "crashdump="true"" parameter on the fencing
528 resource. This is not recommended for production use, but only for
529 debugging phases.
530
531 General cluster properties
532 You must also enable STONITH in general, and set the STONITH timeout to
533 be at least twice the msgwait timeout you have configured, to allow
534 enough time for the fencing message to be delivered. If your msgwait
535 timeout is 60 seconds, this is a possible configuration:
536
537 property stonith-enabled="true"
538 property stonith-timeout="120s"
539
540 Caution: if stonith-timeout is too low for msgwait and the system
541 overhead, sbd will never be able to successfully complete a fence
542 request. This will create a fencing loop.
543
544 Note that the sbd fencing agent will try to detect this and
545 automatically extend the stonith-timeout setting to a reasonable value,
546 on the assumption that sbd modifying your configuration is preferable
547 to not fencing.
548
550 Recovering from temporary SBD device outage
551 If you have multiple devices, failure of a single device is not
552 immediately fatal. "sbd" will retry to restart the monitor for the
553 device every 5 seconds by default. However, you can tune this via the
554 options to the watch command.
555
556 In case you wish the immediately force a restart of all currently
557 disabled monitor processes, you can send a SIGUSR1 to the SBD
558 inquisitor process.
559
561 Copyright (C) 2008-2013 Lars Marowsky-Bree
562
563 This program is free software; you can redistribute it and/or modify it
564 under the terms of the GNU General Public License as published by the
565 Free Software Foundation; either version 2 of the License, or (at your
566 option) any later version.
567
568 This software is distributed in the hope that it will be useful, but
569 WITHOUT ANY WARRANTY; without even the implied warranty of
570 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
571 General Public License for more details.
572
573 For details see the GNU General Public License at
574 http://www.gnu.org/licenses/gpl-2.0.html (version 2) and/or
575 http://www.gnu.org/licenses/gpl.html (the newest as per "any later").
576
577
578
579SBD 2018-11-20 SBD(8)