1SBD(8) STONITH Block Device SBD(8)
2
3
4
6 sbd - STONITH Block Device daemon
7
9 sbd <-d /dev/...> [options] "command"
10
12 SBD provides a node fencing mechanism (Shoot the other node in the
13 head, STONITH) for Pacemaker-based clusters through the exchange of
14 messages via shared block storage such as for example a SAN, iSCSI,
15 FCoE. This isolates the fencing mechanism from changes in firmware
16 version or dependencies on specific firmware controllers, and it can be
17 used as a STONITH mechanism in all configurations that have reliable
18 shared storage.
19
20 SBD can also be used without any shared storage. In this mode, the
21 watchdog device will be used to reset the node if it loses quorum, if
22 any monitored daemon is lost and not recovered or if Pacemaker decides
23 that the node requires fencing.
24
25 The sbd binary implements both the daemon that watches the message
26 slots as well as the management tool for interacting with the block
27 storage device(s). This mode of operation is specified via the
28 "command" parameter; some of these modes take additional parameters.
29
30 To use SBD with shared storage, you must first "create" the messaging
31 layout on one to three block devices. Second, configure
32 /etc/sysconfig/sbd to list those devices (and possibly adjust other
33 options), and restart the cluster stack on each node to ensure that
34 "sbd" is started. Third, configure the "external/sbd" fencing resource
35 in the Pacemaker CIB.
36
37 Each of these steps is documented in more detail below the description
38 of the command options.
39
40 "sbd" can only be used as root.
41
42 GENERAL OPTIONS
43 -d /dev/...
44 Specify the block device(s) to be used. If you have more than one,
45 specify this option up to three times. This parameter is mandatory
46 for all modes, since SBD always needs a block device to interact
47 with.
48
49 This man page uses /dev/sda1, /dev/sdb1, and /dev/sdc1 as example
50 device names for brevity. However, in your production environment,
51 you should instead always refer to them by using the long, stable
52 device name (e.g.,
53 /dev/disk/by-id/dm-uuid-part1-mpath-3600508b400105b5a0001500000250000).
54
55 -v|-vv|-vvv
56 Enable verbose|debug|debug-library logging (optional)
57
58 -h Display a concise summary of "sbd" options.
59
60 -n node
61 Set local node name; defaults to "uname -n". This should not need
62 to be set.
63
64 -R Do not enable realtime priority. By default, "sbd" runs at realtime
65 priority, locks itself into memory, and also acquires highest IO
66 priority to protect itself against interference from other
67 processes on the system. This is a debugging-only option.
68
69 -I N
70 Async IO timeout (defaults to 3 seconds, optional). You should not
71 need to adjust this unless your IO setup is really very slow.
72
73 (In daemon mode, the watchdog is refreshed when the majority of
74 devices could be read within this time.)
75
76 create
77 Example usage:
78
79 sbd -d /dev/sdc2 -d /dev/sdd3 create
80
81 If you specify the create command, sbd will write a metadata header to
82 the device(s) specified and also initialize the messaging slots for up
83 to 255 nodes.
84
85 Warning: This command will not prompt for confirmation. Roughly the
86 first megabyte of the specified block device(s) will be overwritten
87 immediately and without backup.
88
89 This command accepts a few options to adjust the default timings that
90 are written to the metadata (to ensure they are identical across all
91 nodes accessing the device).
92
93 -1 N
94 Set watchdog timeout to N seconds. This depends mostly on your
95 storage latency; the majority of devices must be successfully read
96 within this time, or else the node will self-fence.
97
98 If your sbd device(s) reside on a multipath setup or iSCSI, this
99 should be the time required to detect a path failure. You may be
100 able to reduce this if your device outages are independent, or if
101 you are using the Pacemaker integration.
102
103 -2 N
104 Set slot allocation timeout to N seconds. You should not need to
105 tune this.
106
107 -3 N
108 Set daemon loop timeout to N seconds. You should not need to tune
109 this.
110
111 -4 N
112 Set msgwait timeout to N seconds. This should be twice the watchdog
113 timeout. This is the time after which a message written to a node's
114 slot will be considered delivered. (Or long enough for the node to
115 detect that it needed to self-fence.)
116
117 This also affects the stonith-timeout in Pacemaker's CIB; see
118 below.
119
120 list
121 Example usage:
122
123 # sbd -d /dev/sda1 list
124 0 hex-0 clear
125 1 hex-7 clear
126 2 hex-9 clear
127
128 List all allocated slots on device, and messages. You should see all
129 cluster nodes that have ever been started against this device. Nodes
130 that are currently running should have a clear state; nodes that have
131 been fenced, but not yet restarted, will show the appropriate fencing
132 message.
133
134 dump
135 Example usage:
136
137 # sbd -d /dev/sda1 dump
138 ==Dumping header on disk /dev/sda1
139 Header version : 2
140 Number of slots : 255
141 Sector size : 512
142 Timeout (watchdog) : 15
143 Timeout (allocate) : 2
144 Timeout (loop) : 1
145 Timeout (msgwait) : 30
146 ==Header on disk /dev/sda1 is dumped
147
148 Dump meta-data header from device.
149
150 watch
151 Example usage:
152
153 sbd -d /dev/sdc2 -d /dev/sdd3 -P watch
154
155 This command will make "sbd" start in daemon mode. It will constantly
156 monitor the message slot of the local node for incoming messages,
157 reachability, and optionally take Pacemaker's state into account.
158
159 "sbd" must be started on boot before the cluster stack! See below for
160 enabling this according to your boot environment.
161
162 The options for this mode are rarely specified directly on the
163 commandline directly, but most frequently set via /etc/sysconfig/sbd.
164
165 It also constantly monitors connectivity to the storage device, and
166 self-fences in case the partition becomes unreachable, guaranteeing
167 that it does not disconnect from fencing messages.
168
169 A node slot is automatically allocated on the device(s) the first time
170 the daemon starts watching the device; hence, manual allocation is not
171 usually required.
172
173 If a watchdog is used together with the "sbd" as is strongly
174 recommended, the watchdog is activated at initial start of the sbd
175 daemon. The watchdog is refreshed every time the majority of SBD
176 devices has been successfully read. Using a watchdog provides
177 additional protection against "sbd" crashing.
178
179 If the Pacemaker integration is activated, "sbd" will not self-fence if
180 device majority is lost, if:
181
182 1. The partition the node is in is still quorate according to the CIB;
183
184 2. it is still quorate according to Corosync's node count;
185
186 3. the node itself is considered online and healthy by Pacemaker.
187
188 This allows "sbd" to survive temporary outages of the majority of
189 devices. However, while the cluster is in such a degraded state, it can
190 neither successfully fence nor be shutdown cleanly (as taking the
191 cluster below the quorum threshold will immediately cause all remaining
192 nodes to self-fence). In short, it will not tolerate any further
193 faults. Please repair the system before continuing.
194
195 There is one "sbd" process that acts as a master to which all watchers
196 report; one per device to monitor the node's slot; and, optionally, one
197 that handles the Pacemaker integration.
198
199 -W Enable or disable use of the system watchdog to protect against the
200 sbd processes failing and the node being left in an undefined
201 state. Specify this once to enable, twice to disable.
202
203 Defaults to enabled.
204
205 -w /dev/watchdog
206 This can be used to override the default watchdog device used and
207 should not usually be necessary.
208
209 -p /var/run/sbd.pid
210 This option can be used to specify a pidfile for the main sbd
211 process.
212
213 -F N
214 Number of failures before a failing servant process will not be
215 restarted immediately until the dampening delay has expired. If set
216 to zero, servants will be restarted immediately and indefinitely.
217 If set to one, a failed servant will be restarted once every -t
218 seconds. If set to a different value, the servant will be restarted
219 that many times within the dampening period and then delay.
220
221 Defaults to 1.
222
223 -t N
224 Dampening delay before faulty servants are restarted. Combined with
225 "-F 1", the most logical way to tune the restart frequency of
226 servant processes. Default is 5 seconds.
227
228 If set to zero, processes will be restarted indefinitely and
229 immediately.
230
231 -P Enable Pacemaker integration which checks Pacemaker quorum and node
232 health. Specify this once to enable, twice to disable.
233
234 Defaults to enabled.
235
236 -S N
237 Set the start mode. (Defaults to 0.)
238
239 If this is set to zero, sbd will always start up unconditionally,
240 regardless of whether the node was previously fenced or not.
241
242 If set to one, sbd will only start if the node was previously
243 shutdown cleanly (as indicated by an exit request message in the
244 slot), or if the slot is empty. A reset, crashdump, or power-off
245 request in any slot will halt the start up.
246
247 This is useful to prevent nodes from rejoining if they were faulty.
248 The node must be manually "unfenced" by sending an empty message to
249 it:
250
251 sbd -d /dev/sda1 message node1 clear
252
253 -s N
254 Set the start-up wait time for devices. (Defaults to 120.)
255
256 Dynamic block devices such as iSCSI might not be fully initialized
257 and present yet. This allows one to set a timeout for waiting for
258 devices to appear on start-up. If set to 0, start-up will be
259 aborted immediately if no devices are available.
260
261 -Z Enable trace mode. Warning: this is unsafe for production, use at
262 your own risk! Specifying this once will turn all reboots or power-
263 offs, be they caused by self-fence decisions or messages, into a
264 crashdump. Specifying this twice will just log them but not
265 continue running.
266
267 -T By default, the daemon will set the watchdog timeout as specified
268 in the device metadata. However, this does not work for every
269 watchdog device. In this case, you must manually ensure that the
270 watchdog timeout used by the system correctly matches the SBD
271 settings, and then specify this option to allow "sbd" to continue
272 with start-up.
273
274 -5 N
275 Warn if the time interval for tickling the watchdog exceeds this
276 many seconds. Since the node is unable to log the watchdog expiry
277 (it reboots immediately without a chance to write its logs to
278 disk), this is very useful for getting an indication that the
279 watchdog timeout is too short for the IO load of the system.
280
281 Default is 3 seconds, set to zero to disable.
282
283 -C N
284 Watchdog timeout to set before crashdumping. If SBD is set to
285 crashdump instead of reboot - either via the trace mode settings or
286 the external/sbd fencing agent's parameter -, SBD will adjust the
287 watchdog timeout to this setting before triggering the dump.
288 Otherwise, the watchdog might trigger and prevent a successful
289 crashdump from ever being written.
290
291 Set to zero (= default) to disable.
292
293 -r N
294 Actions to be executed when the watchers don't timely report to the
295 sbd master process or one of the watchers detects that the master
296 process has died.
297
298 Set timeout-action to comma-separated combination of noflush|flush
299 plus reboot|crashdump|off. If just one of both is given the other
300 stays at the default.
301
302 This doesn't affect actions like off, crashdump, reboot explicitly
303 triggered via message slots. And it does as well not configure the
304 action a watchdog would trigger should it run off (there is no
305 generic interface).
306
307 Defaults to flush,reboot.
308
309 allocate
310 Example usage:
311
312 sbd -d /dev/sda1 allocate node1
313
314 Explicitly allocates a slot for the specified node name. This should
315 rarely be necessary, as every node will automatically allocate itself a
316 slot the first time it starts up on watch mode.
317
318 message
319 Example usage:
320
321 sbd -d /dev/sda1 message node1 test
322
323 Writes the specified message to node's slot. This is rarely done
324 directly, but rather abstracted via the "external/sbd" fencing agent
325 configured as a cluster resource.
326
327 Supported message types are:
328
329 test
330 This only generates a log message on the receiving node and can be
331 used to check if SBD is seeing the device. Note that this could
332 overwrite a fencing request send by the cluster, so should not be
333 used during production.
334
335 reset
336 Reset the target upon receipt of this message.
337
338 off Power-off the target.
339
340 crashdump
341 Cause the target node to crashdump.
342
343 exit
344 This will make the "sbd" daemon exit cleanly on the target. You
345 should not send this message manually; this is handled properly
346 during shutdown of the cluster stack. Manually stopping the daemon
347 means the node is unprotected!
348
349 clear
350 This message indicates that no real message has been sent to the
351 node. You should not set this manually; "sbd" will clear the
352 message slot automatically during start-up, and setting this
353 manually could overwrite a fencing message by the cluster.
354
355 query-watchdog
356 Example usage:
357
358 sbd query-watchdog
359
360 Check for available watchdog devices and print some info.
361
362 Warning: This command will arm the watchdog during query, and if your
363 watchdog refuses disarming (for example, if its kernel module has the
364 'nowayout' parameter set) this will reset your system.
365
366 test-watchdog
367 Example usage:
368
369 sbd test-watchdog [-w /dev/watchdog3]
370
371 Test specified watchdog device (/dev/watchdog by default).
372
373 Warning: This command will arm the watchdog and have your system reset
374 in case your watchdog is working properly! If issued from an
375 interactive session, it will prompt for confirmation.
376
378 Configure a watchdog
379 It is highly recommended that you configure your Linux system to load a
380 watchdog driver with hardware assistance (as is available on most
381 modern systems), such as hpwdt, iTCO_wdt, or others. As a fall-back,
382 you can use the softdog module.
383
384 No other software must access the watchdog timer; it can only be
385 accessed by one process at any given time. Some hardware vendors ship
386 systems management software that use the watchdog for system resets
387 (f.e. HP ASR daemon). Such software has to be disabled if the watchdog
388 is to be used by SBD.
389
390 Choosing and initializing the block device(s)
391 First, you have to decide if you want to use one, two, or three
392 devices.
393
394 If you are using multiple ones, they should reside on independent
395 storage setups. Putting all three of them on the same logical unit for
396 example would not provide any additional redundancy.
397
398 The SBD device can be connected via Fibre Channel, Fibre Channel over
399 Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of
400 network-based quorum server; the advantage is that it does not require
401 a smart host at your third location, just block storage.
402
403 The SBD partitions themselves must not be mirrored (via MD, DRBD, or
404 the storage layer itself), since this could result in a split-mirror
405 scenario. Nor can they reside on cLVM2 volume groups, since they must
406 be accessed by the cluster stack before it has started the cLVM2
407 daemons; hence, these should be either raw partitions or logical units
408 on (multipath) storage.
409
410 The block device(s) must be accessible from all nodes. (While it is not
411 necessary that they share the same path name on all nodes, this is
412 considered a very good idea.)
413
414 SBD will only use about one megabyte per device, so you can easily
415 create a small partition, or very small logical units. (The size of
416 the SBD device depends on the block size of the underlying device.
417 Thus, 1MB is fine on plain SCSI devices and SAN storage with 512 byte
418 blocks. On the IBM s390x architecture in particular, disks default to
419 4k blocks, and thus require roughly 4MB.)
420
421 The number of devices will affect the operation of SBD as follows:
422
423 One device
424 In its most simple implementation, you use one device only. This is
425 appropriate for clusters where all your data is on the same shared
426 storage (with internal redundancy) anyway; the SBD device does not
427 introduce an additional single point of failure then.
428
429 If the SBD device is not accessible, the daemon will fail to start
430 and inhibit startup of cluster services.
431
432 Two devices
433 This configuration is a trade-off, primarily aimed at environments
434 where host-based mirroring is used, but no third storage device is
435 available.
436
437 SBD will not commit suicide if it loses access to one mirror leg;
438 this allows the cluster to continue to function even in the face of
439 one outage.
440
441 However, SBD will not fence the other side while only one mirror
442 leg is available, since it does not have enough knowledge to detect
443 an asymmetric split of the storage. So it will not be able to
444 automatically tolerate a second failure while one of the storage
445 arrays is down. (Though you can use the appropriate crm command to
446 acknowledge the fence manually.)
447
448 It will not start unless both devices are accessible on boot.
449
450 Three devices
451 In this most reliable and recommended configuration, SBD will only
452 self-fence if more than one device is lost; hence, this
453 configuration is resilient against temporary single device outages
454 (be it due to failures or maintenance). Fencing messages can still
455 be successfully relayed if at least two devices remain accessible.
456
457 This configuration is appropriate for more complex scenarios where
458 storage is not confined to a single array. For example, host-based
459 mirroring solutions could have one SBD per mirror leg (not mirrored
460 itself), and an additional tie-breaker on iSCSI.
461
462 It will only start if at least two devices are accessible on boot.
463
464 After you have chosen the devices and created the appropriate
465 partitions and perhaps multipath alias names to ease management, use
466 the "sbd create" command described above to initialize the SBD metadata
467 on them.
468
469 Sharing the block device(s) between multiple clusters
470
471 It is possible to share the block devices between multiple clusters,
472 provided the total number of nodes accessing them does not exceed 255
473 nodes, and they all must share the same SBD timeouts (since these are
474 part of the metadata).
475
476 If you are using multiple devices this can reduce the setup overhead
477 required. However, you should not share devices between clusters in
478 different security domains.
479
480 Configure SBD to start on boot
481 On systems using "sysvinit", the "openais" or "corosync" system start-
482 up scripts must handle starting or stopping "sbd" as required before
483 starting the rest of the cluster stack.
484
485 For "systemd", sbd simply has to be enabled using
486
487 systemctl enable sbd.service
488
489 The daemon is brought online on each node before corosync and Pacemaker
490 are started, and terminated only after all other cluster components
491 have been shut down - ensuring that cluster resources are never
492 activated without SBD supervision.
493
494 Configuration via sysconfig
495 The system instance of "sbd" is configured via /etc/sysconfig/sbd. In
496 this file, you must specify the device(s) used, as well as any options
497 to pass to the daemon:
498
499 SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1"
500 SBD_PACEMAKER="true"
501
502 "sbd" will fail to start if no "SBD_DEVICE" is specified. See the
503 installed template or section for configuration via environment for
504 more options that can be configured here. In general configuration
505 done via parameters takes precedence over the configuration from the
506 configuration file.
507
508 Configuration via environment
509 SBD_DEVICE
510 Allows "string" defaulting to ""
511
512 SBD_DEVICE specifies the devices to use for exchanging sbd messages
513 and to monitor. If specifying more than one path, use ";" as
514 separator.
515
516 SBD_PACEMAKER
517 Allows "yesno" defaulting to "yes"
518
519 Whether to enable the pacemaker integration.
520
521 SBD_STARTMODE
522 Allows "always / clean" defaulting to "always"
523
524 Specify the start mode for sbd. Setting this to "clean" will only
525 allow sbd to start if it was not previously fenced. See the -S
526 option in the man page.
527
528 SBD_DELAY_START
529 Allows "yesno / integer" defaulting to "no"
530
531 Whether to delay after starting sbd on boot for "msgwait" seconds.
532 This may be necessary if your cluster nodes reboot so fast that the
533 other nodes are still waiting in the fence acknowledgement phase.
534 This is an occasional issue with virtual machines.
535
536 This can also be enabled by being set to a specific delay value, in
537 seconds. Sometimes a longer delay than the default, "msgwait", is
538 needed, for example in the cases where it's considered to be safer
539 to wait longer than: corosync token timeout + consensus timeout +
540 pcmk_delay_max + msgwait
541
542 Be aware that the special value "1" means "yes" rather than "1s".
543
544 Consider that you might have to adapt the startup-timeout
545 accordingly if the default isn't sufficient. (TimeoutStartSec for
546 systemd)
547
548 This option may be ignored at a later point, once pacemaker handles
549 this case better.
550
551 SBD_WATCHDOG_DEV
552 Allows "string" defaulting to "/dev/watchdog"
553
554 Watchdog device to use. If set to /dev/null, no watchdog device
555 will be used.
556
557 SBD_WATCHDOG_TIMEOUT
558 Allows "integer" defaulting to 5
559
560 How long, in seconds, the watchdog will wait before panicking the
561 node if no-one tickles it.
562
563 This depends mostly on your storage latency; the majority of
564 devices must be successfully read within this time, or else the
565 node will self-fence.
566
567 If your sbd device(s) reside on a multipath setup or iSCSI, this
568 should be the time required to detect a path failure.
569
570 Be aware that watchdog timeout set in the on-disk metadata takes
571 precedence.
572
573 SBD_TIMEOUT_ACTION
574 Allows "string" defaulting to "flush,reboot"
575
576 Actions to be executed when the watchers don't timely report to the
577 sbd master process or one of the watchers detects that the master
578 process has died.
579
580 Set timeout-action to comma-separated combination of noflush|flush
581 plus reboot|crashdump|off. If just one of both is given the other
582 stays at the default.
583
584 This doesn't affect actions like off, crashdump, reboot explicitly
585 triggered via message slots. And it does as well not configure the
586 action a watchdog would trigger should it run off (there is no
587 generic interface).
588
589 SBD_MOVE_TO_ROOT_CGROUP
590 Allows "yesno / auto" defaulting to "auto"
591
592 If CPUAccounting is enabled default is not to assign any RT-budget
593 to the system.slice which prevents sbd from running RR-scheduled.
594
595 One way to escape that issue is to move sbd-processes from the
596 slice they were originally started to root-slice. Of course
597 starting sbd in a certain slice might be intentional. Thus in
598 auto-mode sbd will check if the slice has RT-budget assigned. If
599 that is the case sbd will stay in that slice while it will be moved
600 to root-slice otherwise.
601
602 SBD_OPTS
603 Allows "string" defaulting to ""
604
605 Additional options for starting sbd
606
607 Testing the sbd installation
608 After a restart of the cluster stack on this node, you can now try
609 sending a test message to it as root, from this or any other node:
610
611 sbd -d /dev/sda1 message node1 test
612
613 The node will acknowledge the receipt of the message in the system
614 logs:
615
616 Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2
617
618 This confirms that SBD is indeed up and running on the node, and that
619 it is ready to receive messages.
620
621 Make sure that /etc/sysconfig/sbd is identical on all cluster nodes,
622 and that all cluster nodes are running the daemon.
623
625 Fencing resource
626 Pacemaker can only interact with SBD to issue a node fence if there is
627 a configure fencing resource. This should be a primitive, not a clone,
628 as follows:
629
630 primitive fencing-sbd stonith:external/sbd \
631 params pcmk_delay_max=30
632
633 This will automatically use the same devices as configured in
634 /etc/sysconfig/sbd.
635
636 While you should not configure this as a clone (as Pacemaker will
637 register the fencing device on each node automatically), the
638 pcmk_delay_max setting enables random fencing delay which ensures, in a
639 scenario where a split-brain scenario did occur in a two node cluster,
640 that one of the nodes has a better chance to survive to avoid double
641 fencing.
642
643 SBD also supports turning the reset request into a crash request, which
644 may be helpful for debugging if you have kernel crashdumping
645 configured; then, every fence request will cause the node to dump core.
646 You can enable this via the "crashdump="true"" parameter on the fencing
647 resource. This is not recommended for production use, but only for
648 debugging phases.
649
650 General cluster properties
651 You must also enable STONITH in general, and set the STONITH timeout to
652 be at least twice the msgwait timeout you have configured, to allow
653 enough time for the fencing message to be delivered. If your msgwait
654 timeout is 60 seconds, this is a possible configuration:
655
656 property stonith-enabled="true"
657 property stonith-timeout="120s"
658
659 Caution: if stonith-timeout is too low for msgwait and the system
660 overhead, sbd will never be able to successfully complete a fence
661 request. This will create a fencing loop.
662
663 Note that the sbd fencing agent will try to detect this and
664 automatically extend the stonith-timeout setting to a reasonable value,
665 on the assumption that sbd modifying your configuration is preferable
666 to not fencing.
667
669 Recovering from temporary SBD device outage
670 If you have multiple devices, failure of a single device is not
671 immediately fatal. "sbd" will retry to restart the monitor for the
672 device every 5 seconds by default. However, you can tune this via the
673 options to the watch command.
674
675 In case you wish the immediately force a restart of all currently
676 disabled monitor processes, you can send a SIGUSR1 to the SBD
677 inquisitor process.
678
680 Copyright (C) 2008-2013 Lars Marowsky-Bree
681
682 This program is free software; you can redistribute it and/or modify it
683 under the terms of the GNU General Public License as published by the
684 Free Software Foundation; either version 2 of the License, or (at your
685 option) any later version.
686
687 This software is distributed in the hope that it will be useful, but
688 WITHOUT ANY WARRANTY; without even the implied warranty of
689 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
690 General Public License for more details.
691
692 For details see the GNU General Public License at
693 http://www.gnu.org/licenses/gpl-2.0.html (version 2) and/or
694 http://www.gnu.org/licenses/gpl.html (the newest as per "any later").
695
696
697
698SBD 2020-03-05 SBD(8)