sbd(8) - f29

1SBD(8)                       STONITH Block Device                       SBD(8)
2
3
4

NAME

6       sbd - STONITH Block Device daemon
7

SYNOPSIS

9       sbd <-d /dev/...> [options] "command"
10

SUMMARY

12       SBD provides a node fencing mechanism (Shoot the other node in the
13       head, STONITH) for Pacemaker-based clusters through the exchange of
14       messages via shared block storage such as for example a SAN, iSCSI,
15       FCoE. This isolates the fencing mechanism from changes in firmware
16       version or dependencies on specific firmware controllers, and it can be
17       used as a STONITH mechanism in all configurations that have reliable
18       shared storage.
19
20       SBD can also be used without any shared storage. In this mode, the
21       watchdog device will be used to reset the node if it loses quorum, if
22       any monitored daemon is lost and not recovered or if Pacemaker decides
23       that the node requires fencing.
24
25       The sbd binary implements both the daemon that watches the message
26       slots as well as the management tool for interacting with the block
27       storage device(s). This mode of operation is specified via the
28       "command" parameter; some of these modes take additional parameters.
29
30       To use SBD with shared storage, you must first "create" the messaging
31       layout on one to three block devices. Second, configure
32       /etc/sysconfig/sbd to list those devices (and possibly adjust other
33       options), and restart the cluster stack on each node to ensure that
34       "sbd" is started. Third, configure the "external/sbd" fencing resource
35       in the Pacemaker CIB.
36
37       Each of these steps is documented in more detail below the description
38       of the command options.
39
40       "sbd" can only be used as root.
41
42   GENERAL OPTIONS
43       -d /dev/...
44           Specify the block device(s) to be used. If you have more than one,
45           specify this option up to three times. This parameter is mandatory
46           for all modes, since SBD always needs a block device to interact
47           with.
48
49           This man page uses /dev/sda1, /dev/sdb1, and /dev/sdc1 as example
50           device names for brevity. However, in your production environment,
51           you should instead always refer to them by using the long, stable
52           device name (e.g.,
53           /dev/disk/by-id/dm-uuid-part1-mpath-3600508b400105b5a0001500000250000).
54
55       -v  Enable some verbose debug logging.
56
57       -h  Display a concise summary of "sbd" options.
58
59       -n node
60           Set local node name; defaults to "uname -n". This should not need
61           to be set.
62
63       -R  Do not enable realtime priority. By default, "sbd" runs at realtime
64           priority, locks itself into memory, and also acquires highest IO
65           priority to protect itself against interference from other
66           processes on the system. This is a debugging-only option.
67
68       -I N
69           Async IO timeout (defaults to 3 seconds, optional). You should not
70           need to adjust this unless your IO setup is really very slow.
71
72           (In daemon mode, the watchdog is refreshed when the majority of
73           devices could be read within this time.)
74
75   create
76       Example usage:
77
78               sbd -d /dev/sdc2 -d /dev/sdd3 create
79
80       If you specify the create command, sbd will write a metadata header to
81       the device(s) specified and also initialize the messaging slots for up
82       to 255 nodes.
83
84       Warning: This command will not prompt for confirmation. Roughly the
85       first megabyte of the specified block device(s) will be overwritten
86       immediately and without backup.
87
88       This command accepts a few options to adjust the default timings that
89       are written to the metadata (to ensure they are identical across all
90       nodes accessing the device).
91
92       -1 N
93           Set watchdog timeout to N seconds. This depends mostly on your
94           storage latency; the majority of devices must be successfully read
95           within this time, or else the node will self-fence.
96
97           If your sbd device(s) reside on a multipath setup or iSCSI, this
98           should be the time required to detect a path failure. You may be
99           able to reduce this if your device outages are independent, or if
100           you are using the Pacemaker integration.
101
102       -2 N
103           Set slot allocation timeout to N seconds. You should not need to
104           tune this.
105
106       -3 N
107           Set daemon loop timeout to N seconds. You should not need to tune
108           this.
109
110       -4 N
111           Set msgwait timeout to N seconds. This should be twice the watchdog
112           timeout. This is the time after which a message written to a node's
113           slot will be considered delivered. (Or long enough for the node to
114           detect that it needed to self-fence.)
115
116           This also affects the stonith-timeout in Pacemaker's CIB; see
117           below.
118
119   list
120       Example usage:
121
122               # sbd -d /dev/sda1 list
123               0       hex-0   clear
124               1       hex-7   clear
125               2       hex-9   clear
126
127       List all allocated slots on device, and messages. You should see all
128       cluster nodes that have ever been started against this device. Nodes
129       that are currently running should have a clear state; nodes that have
130       been fenced, but not yet restarted, will show the appropriate fencing
131       message.
132
133   dump
134       Example usage:
135
136               # sbd -d /dev/sda1 dump
137               ==Dumping header on disk /dev/sda1
138               Header version     : 2
139               Number of slots    : 255
140               Sector size        : 512
141               Timeout (watchdog) : 15
142               Timeout (allocate) : 2
143               Timeout (loop)     : 1
144               Timeout (msgwait)  : 30
145               ==Header on disk /dev/sda1 is dumped
146
147       Dump meta-data header from device.
148
149   watch
150       Example usage:
151
152               sbd -d /dev/sdc2 -d /dev/sdd3 -P watch
153
154       This command will make "sbd" start in daemon mode. It will constantly
155       monitor the message slot of the local node for incoming messages,
156       reachability, and optionally take Pacemaker's state into account.
157
158       "sbd" must be started on boot before the cluster stack! See below for
159       enabling this according to your boot environment.
160
161       The options for this mode are rarely specified directly on the
162       commandline directly, but most frequently set via /etc/sysconfig/sbd.
163
164       It also constantly monitors connectivity to the storage device, and
165       self-fences in case the partition becomes unreachable, guaranteeing
166       that it does not disconnect from fencing messages.
167
168       A node slot is automatically allocated on the device(s) the first time
169       the daemon starts watching the device; hence, manual allocation is not
170       usually required.
171
172       If a watchdog is used together with the "sbd" as is strongly
173       recommended, the watchdog is activated at initial start of the sbd
174       daemon. The watchdog is refreshed every time the majority of SBD
175       devices has been successfully read. Using a watchdog provides
176       additional protection against "sbd" crashing.
177
178       If the Pacemaker integration is activated, "sbd" will not self-fence if
179       device majority is lost, if:
180
181       1.  The partition the node is in is still quorate according to the CIB;
182
183       2.  it is still quorate according to Corosync's node count;
184
185       3.  the node itself is considered online and healthy by Pacemaker.
186
187       This allows "sbd" to survive temporary outages of the majority of
188       devices. However, while the cluster is in such a degraded state, it can
189       neither successfully fence nor be shutdown cleanly (as taking the
190       cluster below the quorum threshold will immediately cause all remaining
191       nodes to self-fence). In short, it will not tolerate any further
192       faults.  Please repair the system before continuing.
193
194       There is one "sbd" process that acts as a master to which all watchers
195       report; one per device to monitor the node's slot; and, optionally, one
196       that handles the Pacemaker integration.
197
198       -W  Enable or disable use of the system watchdog to protect against the
199           sbd processes failing and the node being left in an undefined
200           state. Specify this once to enable, twice to disable.
201
202           Defaults to enabled.
203
204       -w /dev/watchdog
205           This can be used to override the default watchdog device used and
206           should not usually be necessary.
207
208       -p /var/run/sbd.pid
209           This option can be used to specify a pidfile for the main sbd
210           process.
211
212       -F N
213           Number of failures before a failing servant process will not be
214           restarted immediately until the dampening delay has expired. If set
215           to zero, servants will be restarted immediately and indefinitely.
216           If set to one, a failed servant will be restarted once every -t
217           seconds. If set to a different value, the servant will be restarted
218           that many times within the dampening period and then delay.
219
220           Defaults to 1.
221
222       -t N
223           Dampening delay before faulty servants are restarted. Combined with
224           "-F 1", the most logical way to tune the restart frequency of
225           servant processes.  Default is 5 seconds.
226
227           If set to zero, processes will be restarted indefinitely and
228           immediately.
229
230       -P  Enable Pacemaker integration which checks Pacemaker quorum and node
231           health.  Specify this once to enable, twice to disable.
232
233           Defaults to enabled.
234
235       -S N
236           Set the start mode. (Defaults to 0.)
237
238           If this is set to zero, sbd will always start up unconditionally,
239           regardless of whether the node was previously fenced or not.
240
241           If set to one, sbd will only start if the node was previously
242           shutdown cleanly (as indicated by an exit request message in the
243           slot), or if the slot is empty. A reset, crashdump, or power-off
244           request in any slot will halt the start up.
245
246           This is useful to prevent nodes from rejoining if they were faulty.
247           The node must be manually "unfenced" by sending an empty message to
248           it:
249
250                   sbd -d /dev/sda1 message node1 clear
251
252       -s N
253           Set the start-up wait time for devices. (Defaults to 120.)
254
255           Dynamic block devices such as iSCSI might not be fully initialized
256           and present yet. This allows to set a timeout for waiting for
257           devices to appear on start-up. If set to 0, start-up will be
258           aborted immediately if no devices are available.
259
260       -Z  Enable trace mode. Warning: this is unsafe for production, use at
261           your own risk! Specifying this once will turn all reboots or power-
262           offs, be they caused by self-fence decisions or messages, into a
263           crashdump.  Specifying this twice will just log them but not
264           continue running.
265
266       -T  By default, the daemon will set the watchdog timeout as specified
267           in the device metadata. However, this does not work for every
268           watchdog device.  In this case, you must manually ensure that the
269           watchdog timeout used by the system correctly matches the SBD
270           settings, and then specify this option to allow "sbd" to continue
271           with start-up.
272
273       -5 N
274           Warn if the time interval for tickling the watchdog exceeds this
275           many seconds.  Since the node is unable to log the watchdog expiry
276           (it reboots immediately without a chance to write its logs to
277           disk), this is very useful for getting an indication that the
278           watchdog timeout is too short for the IO load of the system.
279
280           Default is 3 seconds, set to zero to disable.
281
282       -C N
283           Watchdog timeout to set before crashdumping. If SBD is set to
284           crashdump instead of reboot - either via the trace mode settings or
285           the external/sbd fencing agent's parameter -, SBD will adjust the
286           watchdog timeout to this setting before triggering the dump.
287           Otherwise, the watchdog might trigger and prevent a successful
288           crashdump from ever being written.
289
290           Defaults to 240 seconds. Set to zero to disable.
291
292   allocate
293       Example usage:
294
295               sbd -d /dev/sda1 allocate node1
296
297       Explicitly allocates a slot for the specified node name. This should
298       rarely be necessary, as every node will automatically allocate itself a
299       slot the first time it starts up on watch mode.
300
301   message
302       Example usage:
303
304               sbd -d /dev/sda1 message node1 test
305
306       Writes the specified message to node's slot. This is rarely done
307       directly, but rather abstracted via the "external/sbd" fencing agent
308       configured as a cluster resource.
309
310       Supported message types are:
311
312       test
313           This only generates a log message on the receiving node and can be
314           used to check if SBD is seeing the device. Note that this could
315           overwrite a fencing request send by the cluster, so should not be
316           used during production.
317
318       reset
319           Reset the target upon receipt of this message.
320
321       off Power-off the target.
322
323       crashdump
324           Cause the target node to crashdump.
325
326       exit
327           This will make the "sbd" daemon exit cleanly on the target. You
328           should not send this message manually; this is handled properly
329           during shutdown of the cluster stack. Manually stopping the daemon
330           means the node is unprotected!
331
332       clear
333           This message indicates that no real message has been sent to the
334           node.  You should not set this manually; "sbd" will clear the
335           message slot automatically during start-up, and setting this
336           manually could overwrite a fencing message by the cluster.
337
338   query-watchdog
339       Example usage:
340
341               sbd query-watchdog
342
343       Check for available watchdog devices and print some info.
344
345       Warning: This command will arm the watchdog during query, and if your
346       watchdog refuses disarming (for example, if its kernel module has the
347       'nowayout' parameter set) this will reset your system.
348
349   test-watchdog
350       Example usage:
351
352               sbd test-watchdog [-w /dev/watchdog3]
353
354       Test specified watchdog device (/dev/watchdog by default).
355
356       Warning: This command will arm the watchdog and have your system reset
357       in case your watchdog is working properly! If issued from an
358       interactive session, it will prompt for confirmation.
359

Base system configuration

361   Configure a watchdog
362       It is highly recommended that you configure your Linux system to load a
363       watchdog driver with hardware assistance (as is available on most
364       modern systems), such as hpwdt, iTCO_wdt, or others. As a fall-back,
365       you can use the softdog module.
366
367       No other software must access the watchdog timer; it can only be
368       accessed by one process at any given time. Some hardware vendors ship
369       systems management software that use the watchdog for system resets
370       (f.e. HP ASR daemon). Such software has to be disabled if the watchdog
371       is to be used by SBD.
372
373   Choosing and initializing the block device(s)
374       First, you have to decide if you want to use one, two, or three
375       devices.
376
377       If you are using multiple ones, they should reside on independent
378       storage setups. Putting all three of them on the same logical unit for
379       example would not provide any additional redundancy.
380
381       The SBD device can be connected via Fibre Channel, Fibre Channel over
382       Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of
383       network-based quorum server; the advantage is that it does not require
384       a smart host at your third location, just block storage.
385
386       The SBD partitions themselves must not be mirrored (via MD, DRBD, or
387       the storage layer itself), since this could result in a split-mirror
388       scenario. Nor can they reside on cLVM2 volume groups, since they must
389       be accessed by the cluster stack before it has started the cLVM2
390       daemons; hence, these should be either raw partitions or logical units
391       on (multipath) storage.
392
393       The block device(s) must be accessible from all nodes. (While it is not
394       necessary that they share the same path name on all nodes, this is
395       considered a very good idea.)
396
397       SBD will only use about one megabyte per device, so you can easily
398       create a small partition, or very small logical units.  (The size of
399       the SBD device depends on the block size of the underlying device.
400       Thus, 1MB is fine on plain SCSI devices and SAN storage with 512 byte
401       blocks. On the IBM s390x architecture in particular, disks default to
402       4k blocks, and thus require roughly 4MB.)
403
404       The number of devices will affect the operation of SBD as follows:
405
406       One device
407           In its most simple implementation, you use one device only. This is
408           appropriate for clusters where all your data is on the same shared
409           storage (with internal redundancy) anyway; the SBD device does not
410           introduce an additional single point of failure then.
411
412           If the SBD device is not accessible, the daemon will fail to start
413           and inhibit openais startup.
414
415       Two devices
416           This configuration is a trade-off, primarily aimed at environments
417           where host-based mirroring is used, but no third storage device is
418           available.
419
420           SBD will not commit suicide if it loses access to one mirror leg;
421           this allows the cluster to continue to function even in the face of
422           one outage.
423
424           However, SBD will not fence the other side while only one mirror
425           leg is available, since it does not have enough knowledge to detect
426           an asymmetric split of the storage. So it will not be able to
427           automatically tolerate a second failure while one of the storage
428           arrays is down. (Though you can use the appropriate crm command to
429           acknowledge the fence manually.)
430
431           It will not start unless both devices are accessible on boot.
432
433       Three devices
434           In this most reliable and recommended configuration, SBD will only
435           self-fence if more than one device is lost; hence, this
436           configuration is resilient against temporary single device outages
437           (be it due to failures or maintenance).  Fencing messages can still
438           be successfully relayed if at least two devices remain accessible.
439
440           This configuration is appropriate for more complex scenarios where
441           storage is not confined to a single array. For example, host-based
442           mirroring solutions could have one SBD per mirror leg (not mirrored
443           itself), and an additional tie-breaker on iSCSI.
444
445           It will only start if at least two devices are accessible on boot.
446
447       After you have chosen the devices and created the appropriate
448       partitions and perhaps multipath alias names to ease management, use
449       the "sbd create" command described above to initialize the SBD metadata
450       on them.
451
452       Sharing the block device(s) between multiple clusters
453
454       It is possible to share the block devices between multiple clusters,
455       provided the total number of nodes accessing them does not exceed 255
456       nodes, and they all must share the same SBD timeouts (since these are
457       part of the metadata).
458
459       If you are using multiple devices this can reduce the setup overhead
460       required. However, you should not share devices between clusters in
461       different security domains.
462
463   Configure SBD to start on boot
464       On systems using "sysvinit", the "openais" or "corosync" system start-
465       up scripts must handle starting or stopping "sbd" as required before
466       starting the rest of the cluster stack.
467
468       For "systemd", sbd simply has to be enabled using
469
470               systemctl enable sbd.service
471
472       The daemon is brought online on each node before corosync and Pacemaker
473       are started, and terminated only after all other cluster components
474       have been shut down - ensuring that cluster resources are never
475       activated without SBD supervision.
476
477   Configuration via sysconfig
478       The system instance of "sbd" is configured via /etc/sysconfig/sbd.  In
479       this file, you must specify the device(s) used, as well as any options
480       to pass to the daemon:
481
482               SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1"
483               SBD_PACEMAKER="true"
484
485       "sbd" will fail to start if no "SBD_DEVICE" is specified. See the
486       installed template for more options that can be configured here.
487
488   Testing the sbd installation
489       After a restart of the cluster stack on this node, you can now try
490       sending a test message to it as root, from this or any other node:
491
492               sbd -d /dev/sda1 message node1 test
493
494       The node will acknowledge the receipt of the message in the system
495       logs:
496
497               Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2
498
499       This confirms that SBD is indeed up and running on the node, and that
500       it is ready to receive messages.
501
502       Make sure that /etc/sysconfig/sbd is identical on all cluster nodes,
503       and that all cluster nodes are running the daemon.
504

Pacemaker CIB integration

506   Fencing resource
507       Pacemaker can only interact with SBD to issue a node fence if there is
508       a configure fencing resource. This should be a primitive, not a clone,
509       as follows:
510
511               primitive fencing-sbd stonith:external/sbd \
512                       params pcmk_delay_max=30
513
514       This will automatically use the same devices as configured in
515       /etc/sysconfig/sbd.
516
517       While you should not configure this as a clone (as Pacemaker will
518       register the fencing device on each node automatically), the
519       pcmk_delay_max setting enables random fencing delay which ensures, in a
520       scenario where a split-brain scenario did occur in a two node cluster,
521       that one of the nodes has a better chance to survive to avoid double
522       fencing.
523
524       SBD also supports turning the reset request into a crash request, which
525       may be helpful for debugging if you have kernel crashdumping
526       configured; then, every fence request will cause the node to dump core.
527       You can enable this via the "crashdump="true"" parameter on the fencing
528       resource. This is not recommended for production use, but only for
529       debugging phases.
530
531   General cluster properties
532       You must also enable STONITH in general, and set the STONITH timeout to
533       be at least twice the msgwait timeout you have configured, to allow
534       enough time for the fencing message to be delivered. If your msgwait
535       timeout is 60 seconds, this is a possible configuration:
536
537               property stonith-enabled="true"
538               property stonith-timeout="120s"
539
540       Caution: if stonith-timeout is too low for msgwait and the system
541       overhead, sbd will never be able to successfully complete a fence
542       request. This will create a fencing loop.
543
544       Note that the sbd fencing agent will try to detect this and
545       automatically extend the stonith-timeout setting to a reasonable value,
546       on the assumption that sbd modifying your configuration is preferable
547       to not fencing.
548

Management tasks

550   Recovering from temporary SBD device outage
551       If you have multiple devices, failure of a single device is not
552       immediately fatal. "sbd" will retry to restart the monitor for the
553       device every 5 seconds by default. However, you can tune this via the
554       options to the watch command.
555
556       In case you wish the immediately force a restart of all currently
557       disabled monitor processes, you can send a SIGUSR1 to the SBD
558       inquisitor process.
559

LICENSE

561       Copyright (C) 2008-2013 Lars Marowsky-Bree
562
563       This program is free software; you can redistribute it and/or modify it
564       under the terms of the GNU General Public License as published by the
565       Free Software Foundation; either version 2 of the License, or (at your
566       option) any later version.
567
568       This software is distributed in the hope that it will be useful, but
569       WITHOUT ANY WARRANTY; without even the implied warranty of
570       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
571       General Public License for more details.
572
573       For details see the GNU General Public License at
574       http://www.gnu.org/licenses/gpl-2.0.html (version 2) and/or
575       http://www.gnu.org/licenses/gpl.html (the newest as per "any later").
576
577
578
579SBD                               2018-11-20                            SBD(8)