1SBD(8)                       STONITH Block Device                       SBD(8)
2
3
4

NAME

6       sbd - STONITH Block Device daemon
7

SYNOPSIS

9       sbd <-d /dev/...> [options] "command"
10

SUMMARY

12       SBD provides a node fencing mechanism (Shoot the other node in the
13       head, STONITH) for Pacemaker-based clusters through the exchange of
14       messages via shared block storage such as for example a SAN, iSCSI,
15       FCoE. This isolates the fencing mechanism from changes in firmware
16       version or dependencies on specific firmware controllers, and it can be
17       used as a STONITH mechanism in all configurations that have reliable
18       shared storage.
19
20       SBD can also be used without any shared storage. In this mode, the
21       watchdog device will be used to reset the node if it loses quorum, if
22       any monitored daemon is lost and not recovered or if Pacemaker decides
23       that the node requires fencing.
24
25       The sbd binary implements both the daemon that watches the message
26       slots as well as the management tool for interacting with the block
27       storage device(s). This mode of operation is specified via the
28       "command" parameter; some of these modes take additional parameters.
29
30       To use SBD with shared storage, you must first "create" the messaging
31       layout on one to three block devices. Second, configure
32       /etc/sysconfig/sbd to list those devices (and possibly adjust other
33       options), and restart the cluster stack on each node to ensure that
34       "sbd" is started. Third, configure the "external/sbd" fencing resource
35       in the Pacemaker CIB.
36
37       Each of these steps is documented in more detail below the description
38       of the command options.
39
40       "sbd" can only be used as root.
41
42   GENERAL OPTIONS
43       -d /dev/...
44           Specify the block device(s) to be used. If you have more than one,
45           specify this option up to three times. This parameter is mandatory
46           for all modes, since SBD always needs a block device to interact
47           with.
48
49           This man page uses /dev/sda1, /dev/sdb1, and /dev/sdc1 as example
50           device names for brevity. However, in your production environment,
51           you should instead always refer to them by using the long, stable
52           device name (e.g.,
53           /dev/disk/by-id/dm-uuid-part1-mpath-3600508b400105b5a0001500000250000).
54
55       -v  Enable some verbose debug logging.
56
57       -h  Display a concise summary of "sbd" options.
58
59       -n node
60           Set local node name; defaults to "uname -n". This should not need
61           to be set.
62
63       -R  Do not enable realtime priority. By default, "sbd" runs at realtime
64           priority, locks itself into memory, and also acquires highest IO
65           priority to protect itself against interference from other
66           processes on the system. This is a debugging-only option.
67
68       -I N
69           Async IO timeout (defaults to 3 seconds, optional). You should not
70           need to adjust this unless your IO setup is really very slow.
71
72           (In daemon mode, the watchdog is refreshed when the majority of
73           devices could be read within this time.)
74
75   create
76       Example usage:
77
78               sbd -d /dev/sdc2 -d /dev/sdd3 create
79
80       If you specify the create command, sbd will write a metadata header to
81       the device(s) specified and also initialize the messaging slots for up
82       to 255 nodes.
83
84       Warning: This command will not prompt for confirmation. Roughly the
85       first megabyte of the specified block device(s) will be overwritten
86       immediately and without backup.
87
88       This command accepts a few options to adjust the default timings that
89       are written to the metadata (to ensure they are identical across all
90       nodes accessing the device).
91
92       -1 N
93           Set watchdog timeout to N seconds. This depends mostly on your
94           storage latency; the majority of devices must be successfully read
95           within this time, or else the node will self-fence.
96
97           If your sbd device(s) reside on a multipath setup or iSCSI, this
98           should be the time required to detect a path failure. You may be
99           able to reduce this if your device outages are independent, or if
100           you are using the Pacemaker integration.
101
102       -2 N
103           Set slot allocation timeout to N seconds. You should not need to
104           tune this.
105
106       -3 N
107           Set daemon loop timeout to N seconds. You should not need to tune
108           this.
109
110       -4 N
111           Set msgwait timeout to N seconds. This should be twice the watchdog
112           timeout. This is the time after which a message written to a node's
113           slot will be considered delivered. (Or long enough for the node to
114           detect that it needed to self-fence.)
115
116           This also affects the stonith-timeout in Pacemaker's CIB; see
117           below.
118
119   list
120       Example usage:
121
122               # sbd -d /dev/sda1 list
123               0       hex-0   clear
124               1       hex-7   clear
125               2       hex-9   clear
126
127       List all allocated slots on device, and messages. You should see all
128       cluster nodes that have ever been started against this device. Nodes
129       that are currently running should have a clear state; nodes that have
130       been fenced, but not yet restarted, will show the appropriate fencing
131       message.
132
133   dump
134       Example usage:
135
136               # sbd -d /dev/sda1 dump
137               ==Dumping header on disk /dev/sda1
138               Header version     : 2
139               Number of slots    : 255
140               Sector size        : 512
141               Timeout (watchdog) : 15
142               Timeout (allocate) : 2
143               Timeout (loop)     : 1
144               Timeout (msgwait)  : 30
145               ==Header on disk /dev/sda1 is dumped
146
147       Dump meta-data header from device.
148
149   watch
150       Example usage:
151
152               sbd -d /dev/sdc2 -d /dev/sdd3 -P watch
153
154       This command will make "sbd" start in daemon mode. It will constantly
155       monitor the message slot of the local node for incoming messages,
156       reachability, and optionally take Pacemaker's state into account.
157
158       "sbd" must be started on boot before the cluster stack! See below for
159       enabling this according to your boot environment.
160
161       The options for this mode are rarely specified directly on the
162       commandline directly, but most frequently set via /etc/sysconfig/sbd.
163
164       It also constantly monitors connectivity to the storage device, and
165       self-fences in case the partition becomes unreachable, guaranteeing
166       that it does not disconnect from fencing messages.
167
168       A node slot is automatically allocated on the device(s) the first time
169       the daemon starts watching the device; hence, manual allocation is not
170       usually required.
171
172       If a watchdog is used together with the "sbd" as is strongly
173       recommended, the watchdog is activated at initial start of the sbd
174       daemon. The watchdog is refreshed every time the majority of SBD
175       devices has been successfully read. Using a watchdog provides
176       additional protection against "sbd" crashing.
177
178       If the Pacemaker integration is activated, "sbd" will not self-fence if
179       device majority is lost, if:
180
181       1.  The partition the node is in is still quorate according to the CIB;
182
183       2.  it is still quorate according to Corosync's node count;
184
185       3.  the node itself is considered online and healthy by Pacemaker.
186
187       This allows "sbd" to survive temporary outages of the majority of
188       devices. However, while the cluster is in such a degraded state, it can
189       neither successfully fence nor be shutdown cleanly (as taking the
190       cluster below the quorum threshold will immediately cause all remaining
191       nodes to self-fence). In short, it will not tolerate any further
192       faults.  Please repair the system before continuing.
193
194       There is one "sbd" process that acts as a master to which all watchers
195       report; one per device to monitor the node's slot; and, optionally, one
196       that handles the Pacemaker integration.
197
198       -W  Enable or disable use of the system watchdog to protect against the
199           sbd processes failing and the node being left in an undefined
200           state. Specify this once to enable, twice to disable.
201
202           Defaults to enabled.
203
204       -w /dev/watchdog
205           This can be used to override the default watchdog device used and
206           should not usually be necessary.
207
208       -p /var/run/sbd.pid
209           This option can be used to specify a pidfile for the main sbd
210           process.
211
212       -F N
213           Number of failures before a failing servant process will not be
214           restarted immediately until the dampening delay has expired. If set
215           to zero, servants will be restarted immediately and indefinitely.
216           If set to one, a failed servant will be restarted once every -t
217           seconds. If set to a different value, the servant will be restarted
218           that many times within the dampening period and then delay.
219
220           Defaults to 1.
221
222       -t N
223           Dampening delay before faulty servants are restarted. Combined with
224           "-F 1", the most logical way to tune the restart frequency of
225           servant processes.  Default is 5 seconds.
226
227           If set to zero, processes will be restarted indefinitely and
228           immediately.
229
230       -P  Enable Pacemaker integration which checks Pacemaker quorum and node
231           health.  Specify this once to enable, twice to disable.
232
233           Defaults to enabled.
234
235       -S N
236           Set the start mode. (Defaults to 0.)
237
238           If this is set to zero, sbd will always start up unconditionally,
239           regardless of whether the node was previously fenced or not.
240
241           If set to one, sbd will only start if the node was previously
242           shutdown cleanly (as indicated by an exit request message in the
243           slot), or if the slot is empty. A reset, crashdump, or power-off
244           request in any slot will halt the start up.
245
246           This is useful to prevent nodes from rejoining if they were faulty.
247           The node must be manually "unfenced" by sending an empty message to
248           it:
249
250                   sbd -d /dev/sda1 message node1 clear
251
252       -s N
253           Set the start-up wait time for devices. (Defaults to 120.)
254
255           Dynamic block devices such as iSCSI might not be fully initialized
256           and present yet. This allows to set a timeout for waiting for
257           devices to appear on start-up. If set to 0, start-up will be
258           aborted immediately if no devices are available.
259
260       -Z  Enable trace mode. Warning: this is unsafe for production, use at
261           your own risk! Specifying this once will turn all reboots or power-
262           offs, be they caused by self-fence decisions or messages, into a
263           crashdump.  Specifying this twice will just log them but not
264           continue running.
265
266       -T  By default, the daemon will set the watchdog timeout as specified
267           in the device metadata. However, this does not work for every
268           watchdog device.  In this case, you must manually ensure that the
269           watchdog timeout used by the system correctly matches the SBD
270           settings, and then specify this option to allow "sbd" to continue
271           with start-up.
272
273       -5 N
274           Warn if the time interval for tickling the watchdog exceeds this
275           many seconds.  Since the node is unable to log the watchdog expiry
276           (it reboots immediately without a chance to write its logs to
277           disk), this is very useful for getting an indication that the
278           watchdog timeout is too short for the IO load of the system.
279
280           Default is 3 seconds, set to zero to disable.
281
282       -C N
283           Watchdog timeout to set before crashdumping. If SBD is set to
284           crashdump instead of reboot - either via the trace mode settings or
285           the external/sbd fencing agent's parameter -, SBD will adjust the
286           watchdog timeout to this setting before triggering the dump.
287           Otherwise, the watchdog might trigger and prevent a successful
288           crashdump from ever being written.
289
290           Defaults to 240 seconds. Set to zero to disable.
291
292       -r N
293           Actions to be executed when the watchers don't timely report to the
294           sbd master process or one of the watchers detects that the master
295           process has died.
296
297           Set timeout-action to comma-separated combination of noflush|flush
298           plus reboot|crashdump|off.  If just one of both is given the other
299           stays at the default.
300
301           This doesn't affect actions like off, crashdump, reboot explicitly
302           triggered via message slots.  And it does as well not configure the
303           action a watchdog would trigger should it run off (there is no
304           generic interface).
305
306           Defaults to flush,reboot.
307
308   allocate
309       Example usage:
310
311               sbd -d /dev/sda1 allocate node1
312
313       Explicitly allocates a slot for the specified node name. This should
314       rarely be necessary, as every node will automatically allocate itself a
315       slot the first time it starts up on watch mode.
316
317   message
318       Example usage:
319
320               sbd -d /dev/sda1 message node1 test
321
322       Writes the specified message to node's slot. This is rarely done
323       directly, but rather abstracted via the "external/sbd" fencing agent
324       configured as a cluster resource.
325
326       Supported message types are:
327
328       test
329           This only generates a log message on the receiving node and can be
330           used to check if SBD is seeing the device. Note that this could
331           overwrite a fencing request send by the cluster, so should not be
332           used during production.
333
334       reset
335           Reset the target upon receipt of this message.
336
337       off Power-off the target.
338
339       crashdump
340           Cause the target node to crashdump.
341
342       exit
343           This will make the "sbd" daemon exit cleanly on the target. You
344           should not send this message manually; this is handled properly
345           during shutdown of the cluster stack. Manually stopping the daemon
346           means the node is unprotected!
347
348       clear
349           This message indicates that no real message has been sent to the
350           node.  You should not set this manually; "sbd" will clear the
351           message slot automatically during start-up, and setting this
352           manually could overwrite a fencing message by the cluster.
353
354   query-watchdog
355       Example usage:
356
357               sbd query-watchdog
358
359       Check for available watchdog devices and print some info.
360
361       Warning: This command will arm the watchdog during query, and if your
362       watchdog refuses disarming (for example, if its kernel module has the
363       'nowayout' parameter set) this will reset your system.
364
365   test-watchdog
366       Example usage:
367
368               sbd test-watchdog [-w /dev/watchdog3]
369
370       Test specified watchdog device (/dev/watchdog by default).
371
372       Warning: This command will arm the watchdog and have your system reset
373       in case your watchdog is working properly! If issued from an
374       interactive session, it will prompt for confirmation.
375

Base system configuration

377   Configure a watchdog
378       It is highly recommended that you configure your Linux system to load a
379       watchdog driver with hardware assistance (as is available on most
380       modern systems), such as hpwdt, iTCO_wdt, or others. As a fall-back,
381       you can use the softdog module.
382
383       No other software must access the watchdog timer; it can only be
384       accessed by one process at any given time. Some hardware vendors ship
385       systems management software that use the watchdog for system resets
386       (f.e. HP ASR daemon). Such software has to be disabled if the watchdog
387       is to be used by SBD.
388
389   Choosing and initializing the block device(s)
390       First, you have to decide if you want to use one, two, or three
391       devices.
392
393       If you are using multiple ones, they should reside on independent
394       storage setups. Putting all three of them on the same logical unit for
395       example would not provide any additional redundancy.
396
397       The SBD device can be connected via Fibre Channel, Fibre Channel over
398       Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of
399       network-based quorum server; the advantage is that it does not require
400       a smart host at your third location, just block storage.
401
402       The SBD partitions themselves must not be mirrored (via MD, DRBD, or
403       the storage layer itself), since this could result in a split-mirror
404       scenario. Nor can they reside on cLVM2 volume groups, since they must
405       be accessed by the cluster stack before it has started the cLVM2
406       daemons; hence, these should be either raw partitions or logical units
407       on (multipath) storage.
408
409       The block device(s) must be accessible from all nodes. (While it is not
410       necessary that they share the same path name on all nodes, this is
411       considered a very good idea.)
412
413       SBD will only use about one megabyte per device, so you can easily
414       create a small partition, or very small logical units.  (The size of
415       the SBD device depends on the block size of the underlying device.
416       Thus, 1MB is fine on plain SCSI devices and SAN storage with 512 byte
417       blocks. On the IBM s390x architecture in particular, disks default to
418       4k blocks, and thus require roughly 4MB.)
419
420       The number of devices will affect the operation of SBD as follows:
421
422       One device
423           In its most simple implementation, you use one device only. This is
424           appropriate for clusters where all your data is on the same shared
425           storage (with internal redundancy) anyway; the SBD device does not
426           introduce an additional single point of failure then.
427
428           If the SBD device is not accessible, the daemon will fail to start
429           and inhibit openais startup.
430
431       Two devices
432           This configuration is a trade-off, primarily aimed at environments
433           where host-based mirroring is used, but no third storage device is
434           available.
435
436           SBD will not commit suicide if it loses access to one mirror leg;
437           this allows the cluster to continue to function even in the face of
438           one outage.
439
440           However, SBD will not fence the other side while only one mirror
441           leg is available, since it does not have enough knowledge to detect
442           an asymmetric split of the storage. So it will not be able to
443           automatically tolerate a second failure while one of the storage
444           arrays is down. (Though you can use the appropriate crm command to
445           acknowledge the fence manually.)
446
447           It will not start unless both devices are accessible on boot.
448
449       Three devices
450           In this most reliable and recommended configuration, SBD will only
451           self-fence if more than one device is lost; hence, this
452           configuration is resilient against temporary single device outages
453           (be it due to failures or maintenance).  Fencing messages can still
454           be successfully relayed if at least two devices remain accessible.
455
456           This configuration is appropriate for more complex scenarios where
457           storage is not confined to a single array. For example, host-based
458           mirroring solutions could have one SBD per mirror leg (not mirrored
459           itself), and an additional tie-breaker on iSCSI.
460
461           It will only start if at least two devices are accessible on boot.
462
463       After you have chosen the devices and created the appropriate
464       partitions and perhaps multipath alias names to ease management, use
465       the "sbd create" command described above to initialize the SBD metadata
466       on them.
467
468       Sharing the block device(s) between multiple clusters
469
470       It is possible to share the block devices between multiple clusters,
471       provided the total number of nodes accessing them does not exceed 255
472       nodes, and they all must share the same SBD timeouts (since these are
473       part of the metadata).
474
475       If you are using multiple devices this can reduce the setup overhead
476       required. However, you should not share devices between clusters in
477       different security domains.
478
479   Configure SBD to start on boot
480       On systems using "sysvinit", the "openais" or "corosync" system start-
481       up scripts must handle starting or stopping "sbd" as required before
482       starting the rest of the cluster stack.
483
484       For "systemd", sbd simply has to be enabled using
485
486               systemctl enable sbd.service
487
488       The daemon is brought online on each node before corosync and Pacemaker
489       are started, and terminated only after all other cluster components
490       have been shut down - ensuring that cluster resources are never
491       activated without SBD supervision.
492
493   Configuration via sysconfig
494       The system instance of "sbd" is configured via /etc/sysconfig/sbd.  In
495       this file, you must specify the device(s) used, as well as any options
496       to pass to the daemon:
497
498               SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1"
499               SBD_PACEMAKER="true"
500
501       "sbd" will fail to start if no "SBD_DEVICE" is specified. See the
502       installed template for more options that can be configured here.  In
503       general configuration done via parameters takes precedence over the
504       configuration from the configuration file.
505
506   Testing the sbd installation
507       After a restart of the cluster stack on this node, you can now try
508       sending a test message to it as root, from this or any other node:
509
510               sbd -d /dev/sda1 message node1 test
511
512       The node will acknowledge the receipt of the message in the system
513       logs:
514
515               Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2
516
517       This confirms that SBD is indeed up and running on the node, and that
518       it is ready to receive messages.
519
520       Make sure that /etc/sysconfig/sbd is identical on all cluster nodes,
521       and that all cluster nodes are running the daemon.
522

Pacemaker CIB integration

524   Fencing resource
525       Pacemaker can only interact with SBD to issue a node fence if there is
526       a configure fencing resource. This should be a primitive, not a clone,
527       as follows:
528
529               primitive fencing-sbd stonith:external/sbd \
530                       params pcmk_delay_max=30
531
532       This will automatically use the same devices as configured in
533       /etc/sysconfig/sbd.
534
535       While you should not configure this as a clone (as Pacemaker will
536       register the fencing device on each node automatically), the
537       pcmk_delay_max setting enables random fencing delay which ensures, in a
538       scenario where a split-brain scenario did occur in a two node cluster,
539       that one of the nodes has a better chance to survive to avoid double
540       fencing.
541
542       SBD also supports turning the reset request into a crash request, which
543       may be helpful for debugging if you have kernel crashdumping
544       configured; then, every fence request will cause the node to dump core.
545       You can enable this via the "crashdump="true"" parameter on the fencing
546       resource. This is not recommended for production use, but only for
547       debugging phases.
548
549   General cluster properties
550       You must also enable STONITH in general, and set the STONITH timeout to
551       be at least twice the msgwait timeout you have configured, to allow
552       enough time for the fencing message to be delivered. If your msgwait
553       timeout is 60 seconds, this is a possible configuration:
554
555               property stonith-enabled="true"
556               property stonith-timeout="120s"
557
558       Caution: if stonith-timeout is too low for msgwait and the system
559       overhead, sbd will never be able to successfully complete a fence
560       request. This will create a fencing loop.
561
562       Note that the sbd fencing agent will try to detect this and
563       automatically extend the stonith-timeout setting to a reasonable value,
564       on the assumption that sbd modifying your configuration is preferable
565       to not fencing.
566

Management tasks

568   Recovering from temporary SBD device outage
569       If you have multiple devices, failure of a single device is not
570       immediately fatal. "sbd" will retry to restart the monitor for the
571       device every 5 seconds by default. However, you can tune this via the
572       options to the watch command.
573
574       In case you wish the immediately force a restart of all currently
575       disabled monitor processes, you can send a SIGUSR1 to the SBD
576       inquisitor process.
577

LICENSE

579       Copyright (C) 2008-2013 Lars Marowsky-Bree
580
581       This program is free software; you can redistribute it and/or modify it
582       under the terms of the GNU General Public License as published by the
583       Free Software Foundation; either version 2 of the License, or (at your
584       option) any later version.
585
586       This software is distributed in the hope that it will be useful, but
587       WITHOUT ANY WARRANTY; without even the implied warranty of
588       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
589       General Public License for more details.
590
591       For details see the GNU General Public License at
592       http://www.gnu.org/licenses/gpl-2.0.html (version 2) and/or
593       http://www.gnu.org/licenses/gpl.html (the newest as per "any later").
594
595
596
597SBD                               2019-10-07                            SBD(8)
Impressum