1SBD(8)                       STONITH Block Device                       SBD(8)
2
3
4

NAME

6       sbd - STONITH Block Device daemon
7

SYNOPSIS

9       sbd <-d /dev/...> [options] "command"
10

SUMMARY

12       SBD provides a node fencing mechanism (Shoot the other node in the
13       head, STONITH) for Pacemaker-based clusters through the exchange of
14       messages via shared block storage such as for example a SAN, iSCSI,
15       FCoE. This isolates the fencing mechanism from changes in firmware
16       version or dependencies on specific firmware controllers, and it can be
17       used as a STONITH mechanism in all configurations that have reliable
18       shared storage.
19
20       SBD can also be used without any shared storage. In this mode, the
21       watchdog device will be used to reset the node if it loses quorum, if
22       any monitored daemon is lost and not recovered or if Pacemaker decides
23       that the node requires fencing.
24
25       The sbd binary implements both the daemon that watches the message
26       slots as well as the management tool for interacting with the block
27       storage device(s). This mode of operation is specified via the
28       "command" parameter; some of these modes take additional parameters.
29
30       To use SBD with shared storage, you must first "create" the messaging
31       layout on one to three block devices. Second, configure
32       /etc/sysconfig/sbd to list those devices (and possibly adjust other
33       options), and restart the cluster stack on each node to ensure that
34       "sbd" is started. Third, configure the "external/sbd" fencing resource
35       in the Pacemaker CIB.
36
37       Each of these steps is documented in more detail below the description
38       of the command options.
39
40       "sbd" can only be used as root.
41
42   GENERAL OPTIONS
43       -d /dev/...
44           Specify the block device(s) to be used. If you have more than one,
45           specify this option up to three times. This parameter is mandatory
46           for all modes, since SBD always needs a block device to interact
47           with.
48
49           This man page uses /dev/sda1, /dev/sdb1, and /dev/sdc1 as example
50           device names for brevity. However, in your production environment,
51           you should instead always refer to them by using the long, stable
52           device name (e.g.,
53           /dev/disk/by-id/dm-uuid-part1-mpath-3600508b400105b5a0001500000250000).
54
55       -v|-vv|-vvv
56           Enable verbose|debug|debug-library logging (optional)
57
58       -h  Display a concise summary of "sbd" options.
59
60       -n node
61           Set local node name; defaults to "uname -n". This should not need
62           to be set.
63
64       -R  Do not enable realtime priority. By default, "sbd" runs at realtime
65           priority, locks itself into memory, and also acquires highest IO
66           priority to protect itself against interference from other
67           processes on the system. This is a debugging-only option.
68
69       -I N
70           Async IO timeout (defaults to 3 seconds, optional). You should not
71           need to adjust this unless your IO setup is really very slow.
72
73           (In daemon mode, the watchdog is refreshed when the majority of
74           devices could be read within this time.)
75
76   create
77       Example usage:
78
79               sbd -d /dev/sdc2 -d /dev/sdd3 create
80
81       If you specify the create command, sbd will write a metadata header to
82       the device(s) specified and also initialize the messaging slots for up
83       to 255 nodes.
84
85       Warning: This command will not prompt for confirmation. Roughly the
86       first megabyte of the specified block device(s) will be overwritten
87       immediately and without backup.
88
89       This command accepts a few options to adjust the default timings that
90       are written to the metadata (to ensure they are identical across all
91       nodes accessing the device).
92
93       -1 N
94           Set watchdog timeout to N seconds. This depends mostly on your
95           storage latency; the majority of devices must be successfully read
96           within this time, or else the node will self-fence.
97
98           If your sbd device(s) reside on a multipath setup or iSCSI, this
99           should be the time required to detect a path failure. You may be
100           able to reduce this if your device outages are independent, or if
101           you are using the Pacemaker integration.
102
103       -2 N
104           Set slot allocation timeout to N seconds. You should not need to
105           tune this.
106
107       -3 N
108           Set daemon loop timeout to N seconds. You should not need to tune
109           this.
110
111       -4 N
112           Set msgwait timeout to N seconds. This should be twice the watchdog
113           timeout. This is the time after which a message written to a node's
114           slot will be considered delivered. (Or long enough for the node to
115           detect that it needed to self-fence.)
116
117           This also affects the stonith-timeout in Pacemaker's CIB; see
118           below.
119
120   list
121       Example usage:
122
123               # sbd -d /dev/sda1 list
124               0       hex-0   clear
125               1       hex-7   clear
126               2       hex-9   clear
127
128       List all allocated slots on device, and messages. You should see all
129       cluster nodes that have ever been started against this device. Nodes
130       that are currently running should have a clear state; nodes that have
131       been fenced, but not yet restarted, will show the appropriate fencing
132       message.
133
134   dump
135       Example usage:
136
137               # sbd -d /dev/sda1 dump
138               ==Dumping header on disk /dev/sda1
139               Header version     : 2
140               Number of slots    : 255
141               Sector size        : 512
142               Timeout (watchdog) : 15
143               Timeout (allocate) : 2
144               Timeout (loop)     : 1
145               Timeout (msgwait)  : 30
146               ==Header on disk /dev/sda1 is dumped
147
148       Dump meta-data header from device.
149
150   watch
151       Example usage:
152
153               sbd -d /dev/sdc2 -d /dev/sdd3 -P watch
154
155       This command will make "sbd" start in daemon mode. It will constantly
156       monitor the message slot of the local node for incoming messages,
157       reachability, and optionally take Pacemaker's state into account.
158
159       "sbd" must be started on boot before the cluster stack! See below for
160       enabling this according to your boot environment.
161
162       The options for this mode are rarely specified directly on the
163       commandline directly, but most frequently set via /etc/sysconfig/sbd.
164
165       It also constantly monitors connectivity to the storage device, and
166       self-fences in case the partition becomes unreachable, guaranteeing
167       that it does not disconnect from fencing messages.
168
169       A node slot is automatically allocated on the device(s) the first time
170       the daemon starts watching the device; hence, manual allocation is not
171       usually required.
172
173       If a watchdog is used together with the "sbd" as is strongly
174       recommended, the watchdog is activated at initial start of the sbd
175       daemon. The watchdog is refreshed every time the majority of SBD
176       devices has been successfully read. Using a watchdog provides
177       additional protection against "sbd" crashing.
178
179       If the Pacemaker integration is activated, "sbd" will not self-fence if
180       device majority is lost, if:
181
182       1.  The partition the node is in is still quorate according to the CIB;
183
184       2.  it is still quorate according to Corosync's node count;
185
186       3.  the node itself is considered online and healthy by Pacemaker.
187
188       This allows "sbd" to survive temporary outages of the majority of
189       devices. However, while the cluster is in such a degraded state, it can
190       neither successfully fence nor be shutdown cleanly (as taking the
191       cluster below the quorum threshold will immediately cause all remaining
192       nodes to self-fence). In short, it will not tolerate any further
193       faults.  Please repair the system before continuing.
194
195       There is one "sbd" process that acts as a master to which all watchers
196       report; one per device to monitor the node's slot; and, optionally, one
197       that handles the Pacemaker integration.
198
199       -W  Enable or disable use of the system watchdog to protect against the
200           sbd processes failing and the node being left in an undefined
201           state. Specify this once to enable, twice to disable.
202
203           Defaults to enabled.
204
205       -w /dev/watchdog
206           This can be used to override the default watchdog device used and
207           should not usually be necessary.
208
209       -p /var/run/sbd.pid
210           This option can be used to specify a pidfile for the main sbd
211           process.
212
213       -F N
214           Number of failures before a failing servant process will not be
215           restarted immediately until the dampening delay has expired. If set
216           to zero, servants will be restarted immediately and indefinitely.
217           If set to one, a failed servant will be restarted once every -t
218           seconds. If set to a different value, the servant will be restarted
219           that many times within the dampening period and then delay.
220
221           Defaults to 1.
222
223       -t N
224           Dampening delay before faulty servants are restarted. Combined with
225           "-F 1", the most logical way to tune the restart frequency of
226           servant processes.  Default is 5 seconds.
227
228           If set to zero, processes will be restarted indefinitely and
229           immediately.
230
231       -P  Enable Pacemaker integration which checks Pacemaker quorum and node
232           health.  Specify this once to enable, twice to disable.
233
234           Defaults to enabled.
235
236       -S N
237           Set the start mode. (Defaults to 0.)
238
239           If this is set to zero, sbd will always start up unconditionally,
240           regardless of whether the node was previously fenced or not.
241
242           If set to one, sbd will only start if the node was previously
243           shutdown cleanly (as indicated by an exit request message in the
244           slot), or if the slot is empty. A reset, crashdump, or power-off
245           request in any slot will halt the start up.
246
247           This is useful to prevent nodes from rejoining if they were faulty.
248           The node must be manually "unfenced" by sending an empty message to
249           it:
250
251                   sbd -d /dev/sda1 message node1 clear
252
253       -s N
254           Set the start-up wait time for devices. (Defaults to 120.)
255
256           Dynamic block devices such as iSCSI might not be fully initialized
257           and present yet. This allows one to set a timeout for waiting for
258           devices to appear on start-up. If set to 0, start-up will be
259           aborted immediately if no devices are available.
260
261       -Z  Enable trace mode. Warning: this is unsafe for production, use at
262           your own risk! Specifying this once will turn all reboots or power-
263           offs, be they caused by self-fence decisions or messages, into a
264           crashdump.  Specifying this twice will just log them but not
265           continue running.
266
267       -T  By default, the daemon will set the watchdog timeout as specified
268           in the device metadata. However, this does not work for every
269           watchdog device.  In this case, you must manually ensure that the
270           watchdog timeout used by the system correctly matches the SBD
271           settings, and then specify this option to allow "sbd" to continue
272           with start-up.
273
274       -5 N
275           Warn if the time interval for tickling the watchdog exceeds this
276           many seconds.  Since the node is unable to log the watchdog expiry
277           (it reboots immediately without a chance to write its logs to
278           disk), this is very useful for getting an indication that the
279           watchdog timeout is too short for the IO load of the system.
280
281           Default is 3 seconds, set to zero to disable.
282
283       -C N
284           Watchdog timeout to set before crashdumping. If SBD is set to
285           crashdump instead of reboot - either via the trace mode settings or
286           the external/sbd fencing agent's parameter -, SBD will adjust the
287           watchdog timeout to this setting before triggering the dump.
288           Otherwise, the watchdog might trigger and prevent a successful
289           crashdump from ever being written.
290
291           Set to zero (= default) to disable.
292
293       -r N
294           Actions to be executed when the watchers don't timely report to the
295           sbd master process or one of the watchers detects that the master
296           process has died.
297
298           Set timeout-action to comma-separated combination of noflush|flush
299           plus reboot|crashdump|off.  If just one of both is given the other
300           stays at the default.
301
302           This doesn't affect actions like off, crashdump, reboot explicitly
303           triggered via message slots.  And it does as well not configure the
304           action a watchdog would trigger should it run off (there is no
305           generic interface).
306
307           Defaults to flush,reboot.
308
309   allocate
310       Example usage:
311
312               sbd -d /dev/sda1 allocate node1
313
314       Explicitly allocates a slot for the specified node name. This should
315       rarely be necessary, as every node will automatically allocate itself a
316       slot the first time it starts up on watch mode.
317
318   message
319       Example usage:
320
321               sbd -d /dev/sda1 message node1 test
322
323       Writes the specified message to node's slot. This is rarely done
324       directly, but rather abstracted via the "external/sbd" fencing agent
325       configured as a cluster resource.
326
327       Supported message types are:
328
329       test
330           This only generates a log message on the receiving node and can be
331           used to check if SBD is seeing the device. Note that this could
332           overwrite a fencing request send by the cluster, so should not be
333           used during production.
334
335       reset
336           Reset the target upon receipt of this message.
337
338       off Power-off the target.
339
340       crashdump
341           Cause the target node to crashdump.
342
343       exit
344           This will make the "sbd" daemon exit cleanly on the target. You
345           should not send this message manually; this is handled properly
346           during shutdown of the cluster stack. Manually stopping the daemon
347           means the node is unprotected!
348
349       clear
350           This message indicates that no real message has been sent to the
351           node.  You should not set this manually; "sbd" will clear the
352           message slot automatically during start-up, and setting this
353           manually could overwrite a fencing message by the cluster.
354
355   query-watchdog
356       Example usage:
357
358               sbd query-watchdog
359
360       Check for available watchdog devices and print some info.
361
362       Warning: This command will arm the watchdog during query, and if your
363       watchdog refuses disarming (for example, if its kernel module has the
364       'nowayout' parameter set) this will reset your system.
365
366   test-watchdog
367       Example usage:
368
369               sbd test-watchdog [-w /dev/watchdog3]
370
371       Test specified watchdog device (/dev/watchdog by default).
372
373       Warning: This command will arm the watchdog and have your system reset
374       in case your watchdog is working properly! If issued from an
375       interactive session, it will prompt for confirmation.
376

Base system configuration

378   Configure a watchdog
379       It is highly recommended that you configure your Linux system to load a
380       watchdog driver with hardware assistance (as is available on most
381       modern systems), such as hpwdt, iTCO_wdt, or others. As a fall-back,
382       you can use the softdog module.
383
384       No other software must access the watchdog timer; it can only be
385       accessed by one process at any given time. Some hardware vendors ship
386       systems management software that use the watchdog for system resets
387       (f.e. HP ASR daemon). Such software has to be disabled if the watchdog
388       is to be used by SBD.
389
390   Choosing and initializing the block device(s)
391       First, you have to decide if you want to use one, two, or three
392       devices.
393
394       If you are using multiple ones, they should reside on independent
395       storage setups. Putting all three of them on the same logical unit for
396       example would not provide any additional redundancy.
397
398       The SBD device can be connected via Fibre Channel, Fibre Channel over
399       Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of
400       network-based quorum server; the advantage is that it does not require
401       a smart host at your third location, just block storage.
402
403       The SBD partitions themselves must not be mirrored (via MD, DRBD, or
404       the storage layer itself), since this could result in a split-mirror
405       scenario. Nor can they reside on cLVM2 volume groups, since they must
406       be accessed by the cluster stack before it has started the cLVM2
407       daemons; hence, these should be either raw partitions or logical units
408       on (multipath) storage.
409
410       The block device(s) must be accessible from all nodes. (While it is not
411       necessary that they share the same path name on all nodes, this is
412       considered a very good idea.)
413
414       SBD will only use about one megabyte per device, so you can easily
415       create a small partition, or very small logical units.  (The size of
416       the SBD device depends on the block size of the underlying device.
417       Thus, 1MB is fine on plain SCSI devices and SAN storage with 512 byte
418       blocks. On the IBM s390x architecture in particular, disks default to
419       4k blocks, and thus require roughly 4MB.)
420
421       The number of devices will affect the operation of SBD as follows:
422
423       One device
424           In its most simple implementation, you use one device only. This is
425           appropriate for clusters where all your data is on the same shared
426           storage (with internal redundancy) anyway; the SBD device does not
427           introduce an additional single point of failure then.
428
429           If the SBD device is not accessible, the daemon will fail to start
430           and inhibit startup of cluster services.
431
432       Two devices
433           This configuration is a trade-off, primarily aimed at environments
434           where host-based mirroring is used, but no third storage device is
435           available.
436
437           SBD will not commit suicide if it loses access to one mirror leg;
438           this allows the cluster to continue to function even in the face of
439           one outage.
440
441           However, SBD will not fence the other side while only one mirror
442           leg is available, since it does not have enough knowledge to detect
443           an asymmetric split of the storage. So it will not be able to
444           automatically tolerate a second failure while one of the storage
445           arrays is down. (Though you can use the appropriate crm command to
446           acknowledge the fence manually.)
447
448           It will not start unless both devices are accessible on boot.
449
450       Three devices
451           In this most reliable and recommended configuration, SBD will only
452           self-fence if more than one device is lost; hence, this
453           configuration is resilient against temporary single device outages
454           (be it due to failures or maintenance).  Fencing messages can still
455           be successfully relayed if at least two devices remain accessible.
456
457           This configuration is appropriate for more complex scenarios where
458           storage is not confined to a single array. For example, host-based
459           mirroring solutions could have one SBD per mirror leg (not mirrored
460           itself), and an additional tie-breaker on iSCSI.
461
462           It will only start if at least two devices are accessible on boot.
463
464       After you have chosen the devices and created the appropriate
465       partitions and perhaps multipath alias names to ease management, use
466       the "sbd create" command described above to initialize the SBD metadata
467       on them.
468
469       Sharing the block device(s) between multiple clusters
470
471       It is possible to share the block devices between multiple clusters,
472       provided the total number of nodes accessing them does not exceed 255
473       nodes, and they all must share the same SBD timeouts (since these are
474       part of the metadata).
475
476       If you are using multiple devices this can reduce the setup overhead
477       required. However, you should not share devices between clusters in
478       different security domains.
479
480   Configure SBD to start on boot
481       On systems using "sysvinit", the "openais" or "corosync" system start-
482       up scripts must handle starting or stopping "sbd" as required before
483       starting the rest of the cluster stack.
484
485       For "systemd", sbd simply has to be enabled using
486
487               systemctl enable sbd.service
488
489       The daemon is brought online on each node before corosync and Pacemaker
490       are started, and terminated only after all other cluster components
491       have been shut down - ensuring that cluster resources are never
492       activated without SBD supervision.
493
494   Configuration via sysconfig
495       The system instance of "sbd" is configured via /etc/sysconfig/sbd.  In
496       this file, you must specify the device(s) used, as well as any options
497       to pass to the daemon:
498
499               SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1"
500               SBD_PACEMAKER="true"
501
502       "sbd" will fail to start if no "SBD_DEVICE" is specified. See the
503       installed template or section for configuration via environment for
504       more options that can be configured here.  In general configuration
505       done via parameters takes precedence over the configuration from the
506       configuration file.
507
508   Configuration via environment
509       SBD_DEVICE
510           Allows "string" defaulting to ""
511
512           SBD_DEVICE specifies the devices to use for exchanging sbd messages
513           and to monitor. If specifying more than one path, use ";" as
514           separator.
515
516       SBD_PACEMAKER
517           Allows "yesno" defaulting to "yes"
518
519           Whether to enable the pacemaker integration.
520
521       SBD_STARTMODE
522           Allows "always / clean" defaulting to "always"
523
524           Specify the start mode for sbd. Setting this to "clean" will only
525           allow sbd to start if it was not previously fenced. See the -S
526           option in the man page.
527
528       SBD_DELAY_START
529           Allows "yesno / integer" defaulting to "no"
530
531           Whether to delay after starting sbd on boot for "msgwait" seconds.
532           This may be necessary if your cluster nodes reboot so fast that the
533           other nodes are still waiting in the fence acknowledgement phase.
534           This is an occasional issue with virtual machines.
535
536           This can also be enabled by being set to a specific delay value, in
537           seconds. Sometimes a longer delay than the default, "msgwait", is
538           needed, for example in the cases where it's considered to be safer
539           to wait longer than: corosync token timeout + consensus timeout +
540           pcmk_delay_max + msgwait
541
542           Be aware that the special value "1" means "yes" rather than "1s".
543
544           Consider that you might have to adapt the startup-timeout
545           accordingly if the default isn't sufficient. (TimeoutStartSec for
546           systemd)
547
548           This option may be ignored at a later point, once pacemaker handles
549           this case better.
550
551       SBD_WATCHDOG_DEV
552           Allows "string" defaulting to "/dev/watchdog"
553
554           Watchdog device to use. If set to /dev/null, no watchdog device
555           will be used.
556
557       SBD_WATCHDOG_TIMEOUT
558           Allows "integer" defaulting to 5
559
560           How long, in seconds, the watchdog will wait before panicking the
561           node if no-one tickles it.
562
563           This depends mostly on your storage latency; the majority of
564           devices must be successfully read within this time, or else the
565           node will self-fence.
566
567           If your sbd device(s) reside on a multipath setup or iSCSI, this
568           should be the time required to detect a path failure.
569
570           Be aware that watchdog timeout set in the on-disk metadata takes
571           precedence.
572
573       SBD_TIMEOUT_ACTION
574           Allows "string" defaulting to "flush,reboot"
575
576           Actions to be executed when the watchers don't timely report to the
577           sbd master process or one of the watchers detects that the master
578           process has died.
579
580           Set timeout-action to comma-separated combination of noflush|flush
581           plus reboot|crashdump|off.  If just one of both is given the other
582           stays at the default.
583
584           This doesn't affect actions like off, crashdump, reboot explicitly
585           triggered via message slots.  And it does as well not configure the
586           action a watchdog would trigger should it run off (there is no
587           generic interface).
588
589       SBD_MOVE_TO_ROOT_CGROUP
590           Allows "yesno / auto" defaulting to "auto"
591
592           If CPUAccounting is enabled default is not to assign any RT-budget
593           to the system.slice which prevents sbd from running RR-scheduled.
594
595           One way to escape that issue is to move sbd-processes from the
596           slice they were originally started to root-slice.  Of course
597           starting sbd in a certain slice might be intentional.  Thus in
598           auto-mode sbd will check if the slice has RT-budget assigned.  If
599           that is the case sbd will stay in that slice while it will be moved
600           to root-slice otherwise.
601
602       SBD_OPTS
603           Allows "string" defaulting to ""
604
605           Additional options for starting sbd
606
607   Testing the sbd installation
608       After a restart of the cluster stack on this node, you can now try
609       sending a test message to it as root, from this or any other node:
610
611               sbd -d /dev/sda1 message node1 test
612
613       The node will acknowledge the receipt of the message in the system
614       logs:
615
616               Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2
617
618       This confirms that SBD is indeed up and running on the node, and that
619       it is ready to receive messages.
620
621       Make sure that /etc/sysconfig/sbd is identical on all cluster nodes,
622       and that all cluster nodes are running the daemon.
623

Pacemaker CIB integration

625   Fencing resource
626       Pacemaker can only interact with SBD to issue a node fence if there is
627       a configure fencing resource. This should be a primitive, not a clone,
628       as follows:
629
630               primitive fencing-sbd stonith:external/sbd \
631                       params pcmk_delay_max=30
632
633       This will automatically use the same devices as configured in
634       /etc/sysconfig/sbd.
635
636       While you should not configure this as a clone (as Pacemaker will
637       register the fencing device on each node automatically), the
638       pcmk_delay_max setting enables random fencing delay which ensures, in a
639       scenario where a split-brain scenario did occur in a two node cluster,
640       that one of the nodes has a better chance to survive to avoid double
641       fencing.
642
643       SBD also supports turning the reset request into a crash request, which
644       may be helpful for debugging if you have kernel crashdumping
645       configured; then, every fence request will cause the node to dump core.
646       You can enable this via the "crashdump="true"" parameter on the fencing
647       resource. This is not recommended for production use, but only for
648       debugging phases.
649
650   General cluster properties
651       You must also enable STONITH in general, and set the STONITH timeout to
652       be at least twice the msgwait timeout you have configured, to allow
653       enough time for the fencing message to be delivered. If your msgwait
654       timeout is 60 seconds, this is a possible configuration:
655
656               property stonith-enabled="true"
657               property stonith-timeout="120s"
658
659       Caution: if stonith-timeout is too low for msgwait and the system
660       overhead, sbd will never be able to successfully complete a fence
661       request. This will create a fencing loop.
662
663       Note that the sbd fencing agent will try to detect this and
664       automatically extend the stonith-timeout setting to a reasonable value,
665       on the assumption that sbd modifying your configuration is preferable
666       to not fencing.
667

Management tasks

669   Recovering from temporary SBD device outage
670       If you have multiple devices, failure of a single device is not
671       immediately fatal. "sbd" will retry to restart the monitor for the
672       device every 5 seconds by default. However, you can tune this via the
673       options to the watch command.
674
675       In case you wish the immediately force a restart of all currently
676       disabled monitor processes, you can send a SIGUSR1 to the SBD
677       inquisitor process.
678

LICENSE

680       Copyright (C) 2008-2013 Lars Marowsky-Bree
681
682       This program is free software; you can redistribute it and/or modify it
683       under the terms of the GNU General Public License as published by the
684       Free Software Foundation; either version 2 of the License, or (at your
685       option) any later version.
686
687       This software is distributed in the hope that it will be useful, but
688       WITHOUT ANY WARRANTY; without even the implied warranty of
689       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
690       General Public License for more details.
691
692       For details see the GNU General Public License at
693       http://www.gnu.org/licenses/gpl-2.0.html (version 2) and/or
694       http://www.gnu.org/licenses/gpl.html (the newest as per "any later").
695
696
697
698SBD                               2020-03-05                            SBD(8)
Impressum