fenced(8)

1FENCED(8)                           cluster                          FENCED(8)
2
3
4

NAME

6       fenced - the I/O Fencing daemon
7
8

SYNOPSIS

10       fenced [OPTIONS]
11
12

DESCRIPTION

14       The  fencing  daemon,  fenced,  fences  cluster nodes that have failed.
15       Fencing a node generally means rebooting it or otherwise preventing  it
16       from  writing  to  storage,  e.g.  disabling  its port on a SAN switch.
17       Fencing involves interacting with a hardware device, e.g. network power
18       switch,  SAN switch, storage array.  Different "fencing agents" are run
19       by fenced to interact with various hardware devices.
20
21       Software related to sharing storage among nodes in a cluster, e.g. GFS,
22       usually  requires fencing to be configured to prevent corruption of the
23       storage in the presence of node failure and  recovery.   GFS  will  not
24       allow  a  node  to  mount  a GFS file system unless the node is running
25       fenced.
26
27       Once started, fenced waits for the fence_tool(8)  join  command  to  be
28       run,  telling  it  to join the fence domain: a group of nodes that will
29       fence group members that fail.  When the cluster does not have  quorum,
30       fencing operations are postponed until quorum is restored.  If a failed
31       fence domain member is reset and rejoins the cluster before the remain‐
32       ing  domain members have fenced it, the fencing is no longer needed and
33       will be skipped.
34
35       fenced uses the corosync cluster membership system, it's closed process
36       group library (libcpg), and the cman quorum and configuration libraries
37       (libcman, libccs).
38
39       The cman  init  script  usually  starts  the  fenced  daemon  and  runs
40       fence_tool join and leave.
41
42
43   Node failure
44       When  a  fence  domain  member fails, fenced runs an agent to fence it.
45       The specific agent to run and the agent parameters are  all  read  from
46       the cluster.conf file (using libccs) at the time of fencing.  The fenc‐
47       ing operation against a failed node is not  considered  complete  until
48       the  exec'ed  agent  exits.   The exit value of the agent indicates the
49       success or failure of the operation.  If the operation  failed,  fenced
50       will  retry (possibly with a different agent, depending on the configu‐
51       ration) until fencing succeeds.  Other systems such as DLM and GFS wait
52       for fencing to complete before starting their own recovery for a failed
53       node.  Information about fencing operations will also appear in syslog.
54
55       When a domain member fails, the actual fencing operation can be delayed
56       by  a  configurable  number of seconds (cluster.conf post_fail_delay or
57       -f).  Within this time, the failed node could be reset and  rejoin  the
58       cluster  to avoid being fenced.  This delay is 0 by default to minimize
59       the time that other systems are blocked.
60
61
62   Domain startup
63       When the fence domain is first created in the  cluster  (by  the  first
64       node  to join it) and subsequently enabled (by the cluster gaining quo‐
65       rum) any nodes listed in cluster.conf that are not presently members of
66       the corosync cluster are fenced.  The status of these nodes is unknown,
67       and to be safe they are assumed to need fencing.  This startup  fencing
68       can  be  disabled,  but it's only truly safe to do so if an operator is
69       present to verify that no cluster nodes are in need of fencing.
70
71       The following example illustrates why  startup  fencing  is  important.
72       Take  a  three node cluster with nodes A, B and C; all three have a GFS
73       file system mounted.  All three nodes  experience  a  low-level  kernel
74       hang  at  about the same time.  A watchdog triggers a reboot on nodes A
75       and B, but not C.  A and B reboot, form the cluster again, gain quorum,
76       join  the  fence  domain,  _don't_ fence node C which is still hung and
77       unresponsive, and mount the GFS fs again.  If C were to  come  back  to
78       life,  it  could corrupt the fs.  So, A and B need to fence C when they
79       reform the fence domain since they don't know the state  of  C.   If  C
80       _had_  been  reset  by  a  watchdog  like A and B, but was just slow in
81       rebooting, then A and B might be fencing C unnecessarily when  they  do
82       startup fencing.
83
84       The  first  way  to  avoid fencing nodes unnecessarily on startup is to
85       ensure that all nodes have joined the cluster before any of  the  nodes
86       start the fence daemon.  This method is difficult to automate.
87
88       A  second  way to avoid fencing nodes unnecessarily on startup is using
89       the cluster.conf post_join_delay setting (or -j option).  This  is  the
90       number of seconds fenced will delay before actually fencing any victims
91       after nodes join the domain.  This delay gives  nodes  that  have  been
92       tagged for fencing a chance to join the cluster and avoid being fenced.
93       A delay of -1 here will cause the daemon to wait indefinitely  for  all
94       nodes  to  join  the  cluster  and  no nodes will actually be fenced on
95       startup.
96
97       To disable fencing at domain-creation time entirely,  the  cluster.conf
98       clean_start  setting  (or  -c  option)  can be used to declare that all
99       nodes are in a clean or  safe  state  to  start.   This  setting/option
100       should  not  generally  be  used since it risks not fencing a node that
101       needs it, which can lead to corruption in other applications (like GFS)
102       that depend on fencing.
103
104       Avoiding  unnecessary  fencing  at  startup is primarily a concern when
105       nodes are fenced by power cycling.  If nodes are  fenced  by  disabling
106       their  SAN  access,  then  unnecessarily fencing a node is usually less
107       disruptive.
108
109
110   Fencing override
111       If a fencing device fails, the agent may repeatedly  return  errors  as
112       fenced tries to fence a failed node.  In this case, the admin can manu‐
113       ally reset the failed node, and then use  fence_ack_manual(8)  to  tell
114       fenced to continue without fencing the node.
115
116

OPTIONS

118       Command line options override a corresponding setting in cluster.conf.
119
120
121       -D     Enable debugging to stderr and don't fork.
122              See also fence_tool dump in fence_tool(8).
123
124       -L     Enable debugging to log file.
125              See also logging in cluster.conf(5).
126
127       -g num groupd compatibility mode, 0 off, 1 on. Default 0.
128
129       -r path
130              Register  a  directory  that needs to be empty for the daemon to
131              start.  Use a dash (-) to skip default directories  /sys/fs/gfs,
132              /sys/fs/gfs2, /sys/kernel/dlm.
133
134       -c     All nodes are in a clean state to start. Do no startup fencing.
135
136       -q     Disable dbus signals.
137
138       -j secs
139              Post-join fencing delay. Default 6.
140
141       -f secs
142              Post-fail fencing delay. Default 0.
143
144       -R secs
145              Number  of  seconds to wait for a manual override after a failed
146              fencing attempt before the next attempt. Default 3.
147
148       -O path
149              Location of a FIFO used for  communication  between  fenced  and
150              fence_ack_manual.
151
152       -h     Print a help message describing available options, then exit.
153
154       -V     Print program version information, then exit.
155
156

FILES

158       cluster.conf(5) is usually located at /etc/cluster/cluster.conf.  It is
159       not read directly.  Other cluster components  load  the  contents  into
160       memory, and the values are accessed through the libccs library.
161
162       Configuration options for fenced are added to the <fence_daemon /> sec‐
163       tion of cluster.conf, within the top level <cluster> section.
164
165
166       post_join_delay
167              is the number of seconds the daemon will wait before fencing any
168              victims after a node joins the domain.  Default 6.
169
170              <fence_daemon post_join_delay="6"/>
171
172
173       post_fail_delay
174              is the number of seconds the daemon will wait before fencing any
175              victims after a domain member fails.  Default 0.
176
177              <fence_daemon post_fail_delay="0"/>
178
179
180       clean_start
181              is used to prevent any startup fencing the daemon might do.   It
182              indicates that the daemon should assume all nodes are in a clean
183              state to start. Default 0.
184
185              <fence_daemon clean_start="0"/>
186
187
188       override_path
189              is the location of a FIFO used for communication between  fenced
190              and fence_ack_manual. Default shown.
191
192              <fence_daemon override_path="/var/run/cluster/fenced_override"/>
193
194
195       override_time
196              is  the number of seconds to wait for administrator intervention
197              between fencing attempts following fence agent failures. Default
198              3.
199
200              <fence_daemon override_time="3"/>
201
202
203   Per-node fencing settings
204       The  per-node fencing configuration is partly dependant on the specific
205       agent/hardware being used.  The general framework begins like this:
206
207       <clusternodes>
208
209       <clusternode name="node1" nodeid="1">
210               <fence>
211               </fence>
212       </clusternode>
213
214       <clusternode name="node2" nodeid="2">
215               <fence>
216               </fence>
217       </clusternode>
218
219       </clusternodes>
220
221       The simple fragment above is a valid configuration: there is no way  to
222       fence  these  nodes.   If one of these nodes is in the fence domain and
223       fails, fenced will repeatedly fail in its attempts to  fence  it.   The
224       admin  will  need  to  manually  reset  the  failed  node  and then use
225       fence_ack_manual to tell fenced to continue  without  fencing  it  (see
226       override above).
227
228       There  is  typically  a single method used to fence each node (the name
229       given to the method is not significant).  A method refers to a specific
230       device  listed  in  the separate <fencedevices> section, and then lists
231       any node-specific parameters related to using the device.
232
233       <clusternodes>
234
235       <clusternode name="node1" nodeid="1">
236               <fence>
237               <method name="1">
238               <device name="myswitch" foo="x"/>
239               </method>
240               </fence>
241       </clusternode>
242
243       <clusternode name="node2" nodeid="2">
244               <fence>
245               <method name="1">
246               <device name="myswitch" foo="y"/>
247               </method>
248               </fence>
249       </clusternode>
250
251       </clusternodes>
252
253
254   Fence device settings
255       This section defines properties of the devices  used  to  fence  nodes.
256       There may be one or more devices listed.  The per-node fencing sections
257       above reference one of these fence devices by name.
258
259       <fencedevices>
260               <fencedevice name="myswitch" agent="..." something="..."/>
261       </fencedevices>
262
263
264   Multiple methods for a node
265       In more  advanced  configurations,  multiple  fencing  methods  can  be
266       defined  for  a  node.  If fencing fails using the first method, fenced
267       will try the next method, and continue to cycle through  methods  until
268       one succeeds.
269
270       <clusternode name="node1" nodeid="1">
271               <fence>
272               <method name="1">
273               <device name="myswitch" foo="x"/>
274               </method>
275               <method name="2">
276               <device name="another" bar="123"/>
277               </method>
278               </fence>
279       </clusternode>
280
281       <fencedevices>
282               <fencedevice name="myswitch" agent="..." something="..."/>
283               <fencedevice name="another" agent="..."/>
284       </fencedevices>
285
286
287   Dual path, redundant power
288       Sometimes  fencing a node requires disabling two power ports or two i/o
289       paths.  This is done by specifying two or more devices within a method.
290       fenced  will  run  the agent for the device twice, once for each device
291       line, and both must succeed for fencing to be considered successful.
292
293       <clusternode name="node1" nodeid="1">
294               <fence>
295               <method name="1">
296               <device name="sanswitch1" port="11"/>
297               <device name="sanswitch2" port="11"/>
298               </method>
299               </fence>
300       </clusternode>
301
302       When using power switches to fence nodes with dual power supplies,  the
303       agents must be told to turn off both power ports before restoring power
304       to either port.  The default off-on behavior of the agent could  result
305       in the power never being fully disabled to the node.
306
307       <clusternode name="node1" nodeid="1">
308               <fence>
309               <method name="1">
310               <device name="nps1" port="11" action="off"/>
311               <device name="nps2" port="11" action="off"/>
312               <device name="nps1" port="11" action="on"/>
313               <device name="nps2" port="11" action="on"/>
314               </method>
315               </fence>
316       </clusternode>
317
318
319   NOTES
320       Due to a limitation in XML/DTD validation, the name="" value within the
321       device or fencedevice section cannot start with a number.
322
323
324   Hardware-specific settings
325       Find documentation for configuring specific  devices  from  the  device
326       agent's man page.
327
328

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

FILES

SEE ALSO