1SBD(8) STONITH Block Device SBD(8)
2
3
4
6 sbd - STONITH Block Device daemon
7
9 sbd <-d /dev/...> [options] "command"
10
12 SBD provides a node fencing mechanism (Shoot the other node in the
13 head, STONITH) for Pacemaker-based clusters through the exchange of
14 messages via shared block storage such as for example a SAN, iSCSI,
15 FCoE. This isolates the fencing mechanism from changes in firmware
16 version or dependencies on specific firmware controllers, and it can be
17 used as a STONITH mechanism in all configurations that have reliable
18 shared storage.
19
20 SBD can also be used without any shared storage. In this mode, the
21 watchdog device will be used to reset the node if it loses quorum, if
22 any monitored daemon is lost and not recovered or if Pacemaker decides
23 that the node requires fencing.
24
25 The sbd binary implements both the daemon that watches the message
26 slots as well as the management tool for interacting with the block
27 storage device(s). This mode of operation is specified via the
28 "command" parameter; some of these modes take additional parameters.
29
30 To use SBD with shared storage, you must first "create" the messaging
31 layout on one to three block devices. Second, configure
32 /etc/sysconfig/sbd to list those devices (and possibly adjust other
33 options), and restart the cluster stack on each node to ensure that
34 "sbd" is started. Third, configure the "external/sbd" fencing resource
35 in the Pacemaker CIB.
36
37 Each of these steps is documented in more detail below the description
38 of the command options.
39
40 "sbd" can only be used as root.
41
42 GENERAL OPTIONS
43 -d /dev/...
44 Specify the block device(s) to be used. If you have more than one,
45 specify this option up to three times. This parameter is mandatory
46 for all modes, since SBD always needs a block device to interact
47 with.
48
49 This man page uses /dev/sda1, /dev/sdb1, and /dev/sdc1 as example
50 device names for brevity. However, in your production environment,
51 you should instead always refer to them by using the long, stable
52 device name (e.g.,
53 /dev/disk/by-id/dm-uuid-part1-mpath-3600508b400105b5a0001500000250000).
54
55 -v Enable some verbose debug logging.
56
57 -h Display a concise summary of "sbd" options.
58
59 -n node
60 Set local node name; defaults to "uname -n". This should not need
61 to be set.
62
63 -R Do not enable realtime priority. By default, "sbd" runs at realtime
64 priority, locks itself into memory, and also acquires highest IO
65 priority to protect itself against interference from other
66 processes on the system. This is a debugging-only option.
67
68 -I N
69 Async IO timeout (defaults to 3 seconds, optional). You should not
70 need to adjust this unless your IO setup is really very slow.
71
72 (In daemon mode, the watchdog is refreshed when the majority of
73 devices could be read within this time.)
74
75 create
76 Example usage:
77
78 sbd -d /dev/sdc2 -d /dev/sdd3 create
79
80 If you specify the create command, sbd will write a metadata header to
81 the device(s) specified and also initialize the messaging slots for up
82 to 255 nodes.
83
84 Warning: This command will not prompt for confirmation. Roughly the
85 first megabyte of the specified block device(s) will be overwritten
86 immediately and without backup.
87
88 This command accepts a few options to adjust the default timings that
89 are written to the metadata (to ensure they are identical across all
90 nodes accessing the device).
91
92 -1 N
93 Set watchdog timeout to N seconds. This depends mostly on your
94 storage latency; the majority of devices must be successfully read
95 within this time, or else the node will self-fence.
96
97 If your sbd device(s) reside on a multipath setup or iSCSI, this
98 should be the time required to detect a path failure. You may be
99 able to reduce this if your device outages are independent, or if
100 you are using the Pacemaker integration.
101
102 -2 N
103 Set slot allocation timeout to N seconds. You should not need to
104 tune this.
105
106 -3 N
107 Set daemon loop timeout to N seconds. You should not need to tune
108 this.
109
110 -4 N
111 Set msgwait timeout to N seconds. This should be twice the watchdog
112 timeout. This is the time after which a message written to a node's
113 slot will be considered delivered. (Or long enough for the node to
114 detect that it needed to self-fence.)
115
116 This also affects the stonith-timeout in Pacemaker's CIB; see
117 below.
118
119 list
120 Example usage:
121
122 # sbd -d /dev/sda1 list
123 0 hex-0 clear
124 1 hex-7 clear
125 2 hex-9 clear
126
127 List all allocated slots on device, and messages. You should see all
128 cluster nodes that have ever been started against this device. Nodes
129 that are currently running should have a clear state; nodes that have
130 been fenced, but not yet restarted, will show the appropriate fencing
131 message.
132
133 dump
134 Example usage:
135
136 # sbd -d /dev/sda1 dump
137 ==Dumping header on disk /dev/sda1
138 Header version : 2
139 Number of slots : 255
140 Sector size : 512
141 Timeout (watchdog) : 15
142 Timeout (allocate) : 2
143 Timeout (loop) : 1
144 Timeout (msgwait) : 30
145 ==Header on disk /dev/sda1 is dumped
146
147 Dump meta-data header from device.
148
149 watch
150 Example usage:
151
152 sbd -d /dev/sdc2 -d /dev/sdd3 -P watch
153
154 This command will make "sbd" start in daemon mode. It will constantly
155 monitor the message slot of the local node for incoming messages,
156 reachability, and optionally take Pacemaker's state into account.
157
158 "sbd" must be started on boot before the cluster stack! See below for
159 enabling this according to your boot environment.
160
161 The options for this mode are rarely specified directly on the
162 commandline directly, but most frequently set via /etc/sysconfig/sbd.
163
164 It also constantly monitors connectivity to the storage device, and
165 self-fences in case the partition becomes unreachable, guaranteeing
166 that it does not disconnect from fencing messages.
167
168 A node slot is automatically allocated on the device(s) the first time
169 the daemon starts watching the device; hence, manual allocation is not
170 usually required.
171
172 If a watchdog is used together with the "sbd" as is strongly
173 recommended, the watchdog is activated at initial start of the sbd
174 daemon. The watchdog is refreshed every time the majority of SBD
175 devices has been successfully read. Using a watchdog provides
176 additional protection against "sbd" crashing.
177
178 If the Pacemaker integration is activated, "sbd" will not self-fence if
179 device majority is lost, if:
180
181 1. The partition the node is in is still quorate according to the CIB;
182
183 2. it is still quorate according to Corosync's node count;
184
185 3. the node itself is considered online and healthy by Pacemaker.
186
187 This allows "sbd" to survive temporary outages of the majority of
188 devices. However, while the cluster is in such a degraded state, it can
189 neither successfully fence nor be shutdown cleanly (as taking the
190 cluster below the quorum threshold will immediately cause all remaining
191 nodes to self-fence). In short, it will not tolerate any further
192 faults. Please repair the system before continuing.
193
194 There is one "sbd" process that acts as a master to which all watchers
195 report; one per device to monitor the node's slot; and, optionally, one
196 that handles the Pacemaker integration.
197
198 -W Enable or disable use of the system watchdog to protect against the
199 sbd processes failing and the node being left in an undefined
200 state. Specify this once to enable, twice to disable.
201
202 Defaults to enabled.
203
204 -w /dev/watchdog
205 This can be used to override the default watchdog device used and
206 should not usually be necessary.
207
208 -p /var/run/sbd.pid
209 This option can be used to specify a pidfile for the main sbd
210 process.
211
212 -F N
213 Number of failures before a failing servant process will not be
214 restarted immediately until the dampening delay has expired. If set
215 to zero, servants will be restarted immediately and indefinitely.
216 If set to one, a failed servant will be restarted once every -t
217 seconds. If set to a different value, the servant will be restarted
218 that many times within the dampening period and then delay.
219
220 Defaults to 1.
221
222 -t N
223 Dampening delay before faulty servants are restarted. Combined with
224 "-F 1", the most logical way to tune the restart frequency of
225 servant processes. Default is 5 seconds.
226
227 If set to zero, processes will be restarted indefinitely and
228 immediately.
229
230 -P Enable Pacemaker integration which checks Pacemaker quorum and node
231 health. Specify this once to enable, twice to disable.
232
233 Defaults to enabled.
234
235 -S N
236 Set the start mode. (Defaults to 0.)
237
238 If this is set to zero, sbd will always start up unconditionally,
239 regardless of whether the node was previously fenced or not.
240
241 If set to one, sbd will only start if the node was previously
242 shutdown cleanly (as indicated by an exit request message in the
243 slot), or if the slot is empty. A reset, crashdump, or power-off
244 request in any slot will halt the start up.
245
246 This is useful to prevent nodes from rejoining if they were faulty.
247 The node must be manually "unfenced" by sending an empty message to
248 it:
249
250 sbd -d /dev/sda1 message node1 clear
251
252 -s N
253 Set the start-up wait time for devices. (Defaults to 120.)
254
255 Dynamic block devices such as iSCSI might not be fully initialized
256 and present yet. This allows one to set a timeout for waiting for
257 devices to appear on start-up. If set to 0, start-up will be
258 aborted immediately if no devices are available.
259
260 -Z Enable trace mode. Warning: this is unsafe for production, use at
261 your own risk! Specifying this once will turn all reboots or power-
262 offs, be they caused by self-fence decisions or messages, into a
263 crashdump. Specifying this twice will just log them but not
264 continue running.
265
266 -T By default, the daemon will set the watchdog timeout as specified
267 in the device metadata. However, this does not work for every
268 watchdog device. In this case, you must manually ensure that the
269 watchdog timeout used by the system correctly matches the SBD
270 settings, and then specify this option to allow "sbd" to continue
271 with start-up.
272
273 -5 N
274 Warn if the time interval for tickling the watchdog exceeds this
275 many seconds. Since the node is unable to log the watchdog expiry
276 (it reboots immediately without a chance to write its logs to
277 disk), this is very useful for getting an indication that the
278 watchdog timeout is too short for the IO load of the system.
279
280 Default is 3 seconds, set to zero to disable.
281
282 -C N
283 Watchdog timeout to set before crashdumping. If SBD is set to
284 crashdump instead of reboot - either via the trace mode settings or
285 the external/sbd fencing agent's parameter -, SBD will adjust the
286 watchdog timeout to this setting before triggering the dump.
287 Otherwise, the watchdog might trigger and prevent a successful
288 crashdump from ever being written.
289
290 Defaults to 240 seconds. Set to zero to disable.
291
292 -r N
293 Actions to be executed when the watchers don't timely report to the
294 sbd master process or one of the watchers detects that the master
295 process has died.
296
297 Set timeout-action to comma-separated combination of noflush|flush
298 plus reboot|crashdump|off. If just one of both is given the other
299 stays at the default.
300
301 This doesn't affect actions like off, crashdump, reboot explicitly
302 triggered via message slots. And it does as well not configure the
303 action a watchdog would trigger should it run off (there is no
304 generic interface).
305
306 Defaults to flush,reboot.
307
308 allocate
309 Example usage:
310
311 sbd -d /dev/sda1 allocate node1
312
313 Explicitly allocates a slot for the specified node name. This should
314 rarely be necessary, as every node will automatically allocate itself a
315 slot the first time it starts up on watch mode.
316
317 message
318 Example usage:
319
320 sbd -d /dev/sda1 message node1 test
321
322 Writes the specified message to node's slot. This is rarely done
323 directly, but rather abstracted via the "external/sbd" fencing agent
324 configured as a cluster resource.
325
326 Supported message types are:
327
328 test
329 This only generates a log message on the receiving node and can be
330 used to check if SBD is seeing the device. Note that this could
331 overwrite a fencing request send by the cluster, so should not be
332 used during production.
333
334 reset
335 Reset the target upon receipt of this message.
336
337 off Power-off the target.
338
339 crashdump
340 Cause the target node to crashdump.
341
342 exit
343 This will make the "sbd" daemon exit cleanly on the target. You
344 should not send this message manually; this is handled properly
345 during shutdown of the cluster stack. Manually stopping the daemon
346 means the node is unprotected!
347
348 clear
349 This message indicates that no real message has been sent to the
350 node. You should not set this manually; "sbd" will clear the
351 message slot automatically during start-up, and setting this
352 manually could overwrite a fencing message by the cluster.
353
354 query-watchdog
355 Example usage:
356
357 sbd query-watchdog
358
359 Check for available watchdog devices and print some info.
360
361 Warning: This command will arm the watchdog during query, and if your
362 watchdog refuses disarming (for example, if its kernel module has the
363 'nowayout' parameter set) this will reset your system.
364
365 test-watchdog
366 Example usage:
367
368 sbd test-watchdog [-w /dev/watchdog3]
369
370 Test specified watchdog device (/dev/watchdog by default).
371
372 Warning: This command will arm the watchdog and have your system reset
373 in case your watchdog is working properly! If issued from an
374 interactive session, it will prompt for confirmation.
375
377 Configure a watchdog
378 It is highly recommended that you configure your Linux system to load a
379 watchdog driver with hardware assistance (as is available on most
380 modern systems), such as hpwdt, iTCO_wdt, or others. As a fall-back,
381 you can use the softdog module.
382
383 No other software must access the watchdog timer; it can only be
384 accessed by one process at any given time. Some hardware vendors ship
385 systems management software that use the watchdog for system resets
386 (f.e. HP ASR daemon). Such software has to be disabled if the watchdog
387 is to be used by SBD.
388
389 Choosing and initializing the block device(s)
390 First, you have to decide if you want to use one, two, or three
391 devices.
392
393 If you are using multiple ones, they should reside on independent
394 storage setups. Putting all three of them on the same logical unit for
395 example would not provide any additional redundancy.
396
397 The SBD device can be connected via Fibre Channel, Fibre Channel over
398 Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of
399 network-based quorum server; the advantage is that it does not require
400 a smart host at your third location, just block storage.
401
402 The SBD partitions themselves must not be mirrored (via MD, DRBD, or
403 the storage layer itself), since this could result in a split-mirror
404 scenario. Nor can they reside on cLVM2 volume groups, since they must
405 be accessed by the cluster stack before it has started the cLVM2
406 daemons; hence, these should be either raw partitions or logical units
407 on (multipath) storage.
408
409 The block device(s) must be accessible from all nodes. (While it is not
410 necessary that they share the same path name on all nodes, this is
411 considered a very good idea.)
412
413 SBD will only use about one megabyte per device, so you can easily
414 create a small partition, or very small logical units. (The size of
415 the SBD device depends on the block size of the underlying device.
416 Thus, 1MB is fine on plain SCSI devices and SAN storage with 512 byte
417 blocks. On the IBM s390x architecture in particular, disks default to
418 4k blocks, and thus require roughly 4MB.)
419
420 The number of devices will affect the operation of SBD as follows:
421
422 One device
423 In its most simple implementation, you use one device only. This is
424 appropriate for clusters where all your data is on the same shared
425 storage (with internal redundancy) anyway; the SBD device does not
426 introduce an additional single point of failure then.
427
428 If the SBD device is not accessible, the daemon will fail to start
429 and inhibit openais startup.
430
431 Two devices
432 This configuration is a trade-off, primarily aimed at environments
433 where host-based mirroring is used, but no third storage device is
434 available.
435
436 SBD will not commit suicide if it loses access to one mirror leg;
437 this allows the cluster to continue to function even in the face of
438 one outage.
439
440 However, SBD will not fence the other side while only one mirror
441 leg is available, since it does not have enough knowledge to detect
442 an asymmetric split of the storage. So it will not be able to
443 automatically tolerate a second failure while one of the storage
444 arrays is down. (Though you can use the appropriate crm command to
445 acknowledge the fence manually.)
446
447 It will not start unless both devices are accessible on boot.
448
449 Three devices
450 In this most reliable and recommended configuration, SBD will only
451 self-fence if more than one device is lost; hence, this
452 configuration is resilient against temporary single device outages
453 (be it due to failures or maintenance). Fencing messages can still
454 be successfully relayed if at least two devices remain accessible.
455
456 This configuration is appropriate for more complex scenarios where
457 storage is not confined to a single array. For example, host-based
458 mirroring solutions could have one SBD per mirror leg (not mirrored
459 itself), and an additional tie-breaker on iSCSI.
460
461 It will only start if at least two devices are accessible on boot.
462
463 After you have chosen the devices and created the appropriate
464 partitions and perhaps multipath alias names to ease management, use
465 the "sbd create" command described above to initialize the SBD metadata
466 on them.
467
468 Sharing the block device(s) between multiple clusters
469
470 It is possible to share the block devices between multiple clusters,
471 provided the total number of nodes accessing them does not exceed 255
472 nodes, and they all must share the same SBD timeouts (since these are
473 part of the metadata).
474
475 If you are using multiple devices this can reduce the setup overhead
476 required. However, you should not share devices between clusters in
477 different security domains.
478
479 Configure SBD to start on boot
480 On systems using "sysvinit", the "openais" or "corosync" system start-
481 up scripts must handle starting or stopping "sbd" as required before
482 starting the rest of the cluster stack.
483
484 For "systemd", sbd simply has to be enabled using
485
486 systemctl enable sbd.service
487
488 The daemon is brought online on each node before corosync and Pacemaker
489 are started, and terminated only after all other cluster components
490 have been shut down - ensuring that cluster resources are never
491 activated without SBD supervision.
492
493 Configuration via sysconfig
494 The system instance of "sbd" is configured via /etc/sysconfig/sbd. In
495 this file, you must specify the device(s) used, as well as any options
496 to pass to the daemon:
497
498 SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1"
499 SBD_PACEMAKER="true"
500
501 "sbd" will fail to start if no "SBD_DEVICE" is specified. See the
502 installed template for more options that can be configured here. In
503 general configuration done via parameters takes precedence over the
504 configuration from the configuration file.
505
506 Testing the sbd installation
507 After a restart of the cluster stack on this node, you can now try
508 sending a test message to it as root, from this or any other node:
509
510 sbd -d /dev/sda1 message node1 test
511
512 The node will acknowledge the receipt of the message in the system
513 logs:
514
515 Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2
516
517 This confirms that SBD is indeed up and running on the node, and that
518 it is ready to receive messages.
519
520 Make sure that /etc/sysconfig/sbd is identical on all cluster nodes,
521 and that all cluster nodes are running the daemon.
522
524 Fencing resource
525 Pacemaker can only interact with SBD to issue a node fence if there is
526 a configure fencing resource. This should be a primitive, not a clone,
527 as follows:
528
529 primitive fencing-sbd stonith:external/sbd \
530 params pcmk_delay_max=30
531
532 This will automatically use the same devices as configured in
533 /etc/sysconfig/sbd.
534
535 While you should not configure this as a clone (as Pacemaker will
536 register the fencing device on each node automatically), the
537 pcmk_delay_max setting enables random fencing delay which ensures, in a
538 scenario where a split-brain scenario did occur in a two node cluster,
539 that one of the nodes has a better chance to survive to avoid double
540 fencing.
541
542 SBD also supports turning the reset request into a crash request, which
543 may be helpful for debugging if you have kernel crashdumping
544 configured; then, every fence request will cause the node to dump core.
545 You can enable this via the "crashdump="true"" parameter on the fencing
546 resource. This is not recommended for production use, but only for
547 debugging phases.
548
549 General cluster properties
550 You must also enable STONITH in general, and set the STONITH timeout to
551 be at least twice the msgwait timeout you have configured, to allow
552 enough time for the fencing message to be delivered. If your msgwait
553 timeout is 60 seconds, this is a possible configuration:
554
555 property stonith-enabled="true"
556 property stonith-timeout="120s"
557
558 Caution: if stonith-timeout is too low for msgwait and the system
559 overhead, sbd will never be able to successfully complete a fence
560 request. This will create a fencing loop.
561
562 Note that the sbd fencing agent will try to detect this and
563 automatically extend the stonith-timeout setting to a reasonable value,
564 on the assumption that sbd modifying your configuration is preferable
565 to not fencing.
566
568 Recovering from temporary SBD device outage
569 If you have multiple devices, failure of a single device is not
570 immediately fatal. "sbd" will retry to restart the monitor for the
571 device every 5 seconds by default. However, you can tune this via the
572 options to the watch command.
573
574 In case you wish the immediately force a restart of all currently
575 disabled monitor processes, you can send a SIGUSR1 to the SBD
576 inquisitor process.
577
579 Copyright (C) 2008-2013 Lars Marowsky-Bree
580
581 This program is free software; you can redistribute it and/or modify it
582 under the terms of the GNU General Public License as published by the
583 Free Software Foundation; either version 2 of the License, or (at your
584 option) any later version.
585
586 This software is distributed in the hope that it will be useful, but
587 WITHOUT ANY WARRANTY; without even the implied warranty of
588 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
589 General Public License for more details.
590
591 For details see the GNU General Public License at
592 http://www.gnu.org/licenses/gpl-2.0.html (version 2) and/or
593 http://www.gnu.org/licenses/gpl.html (the newest as per "any later").
594
595
596
597SBD 2019-07-26 SBD(8)