watchdog(8)

1WATCHDOG(8)                 System Manager's Manual                WATCHDOG(8)
2
3
4

NAME

6       watchdog - a software watchdog daemon
7

SYNOPSIS

9       watchdog  [-f|--force]  [-c filename|--config-file filename] [-v|--ver‐
10       bose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11

DESCRIPTION

13       The Linux kernel can reset the system if serious problems are detected.
14       This  can  be  implemented  via  special  watchdog  hardware,  or via a
15       slightly less reliable software-only watchdog inside the kernel. Either
16       way,  there  needs  to  be a daemon that tells the kernel the system is
17       working fine. If the daemon stops doing that, the system is reset.
18
19       watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20       it  often  enough  to keep the kernel from resetting, at least once per
21       minute. Each write delays the  reboot  time  another  minute.  After  a
22       minute of inactivity the watchdog hardware will cause the reset. In the
23       case of the software watchdog the ability to reboot will depend on  the
24       state of the machines and interrupts.
25
26       The  watchdog  daemon  can  be  stopped without causing a reboot if the
27       device /dev/watchdog is closed correctly, unless your  kernel  is  com‐
28       piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29

TESTS

31       The watchdog daemon does several tests to check the system status:
32
33       ·  Is the process table full?
34
35       ·  Is there enough free memory?
36
37       ·  Are some files accessible?
38
39       ·  Have some files changed within a given interval?
40
41       ·  Is the average work load too high?
42
43       ·  Has a file table overflow occurred?
44
45       ·  Is a process still running? The process is specified by a pid file.
46
47       ·  Do some IP addresses answer to ping?
48
49       ·  Do network interfaces receive traffic?
50
51       ·  Is  the  temperature  too  high? (Temperature data not always avail‐
52          able.)
53
54       ·  Execute a user defined command to do arbitrary tests.
55
56       ·  Execute one or more test/repair commands found  in  /etc/watchdog.d.
57          These commands are called with the argument test or repair.
58
59       If  any of these checks fail watchdog will cause a shutdown. Should any
60       of these tests except the user defined  binary  last  longer  than  one
61       minute the machine will be rebooted, too.
62

OPTIONS

64       Available command line options are the following:
65
66       -v, --verbose
67              Set  verbose mode. Only implemented if compiled with SYSLOG fea‐
68              ture. This mode will log each several infos in  LOG_DAEMON  with
69              priority  LOG_INFO.   This  is useful if you want to see exactly
70              what happened until the watchdog rebooted the system.  Currently
71              it  logs  the  temperature (if available), the load average, the
72              change date of the files it checks and  how  often  it  went  to
73              sleep.
74
75       -s, --sync
76              Try  to  synchronize  the  filesystem  every time the process is
77              awake. Note that the system is rebooted if for  any  reason  the
78              synchronizing lasts longer than a minute.
79
80       -b, --softboot
81              Soft-boot  the  system  if an error occurs during the main loop,
82              e.g. if a given file is not accessible  via  the  stat(2)  call.
83              Note  that  this  does not apply to the opening of /dev/watchdog
84              and /proc/loadavg, which are opened before the main loop starts.
85
86       -f, --force
87              Force the usage of the interval given or the maximal load  aver‐
88              age given in the config file.
89
90       -c config-file, --config-file config-file
91              Use config-file as the configuration file instead of the default
92              /etc/watchdog.conf.
93
94       -q, --no-action
95              Do not reboot or halt the machine. This is for testing purposes.
96              All checks are executed and the results are logged as usual, but
97              no action is taken.  Also your hardware card or the kernel soft‐
98              ware  watchdog  driver  is  not enabled. Temperature checking is
99              also disabled since this triggers the hardware watchdog on  some
100              cards.
101

FUNCTION

103       After  watchdog  starts,  it  puts  itself into the background and then
104       tries all checks specified in its configuration file in  turn.  Between
105       each  two  tests it will write to the kernel device to prevent a reset.
106       After finishing all tests watchdog goes to sleep  for  some  time.  The
107       kernel  drivers  expects  a  write to the watchdog device every minute.
108       Otherwise the system will be reset. As a default  watchdog  will  sleep
109       for only 10 seconds so it triggers the device early enough.
110
111       Under  high system load watchdog might be swapped out of memory and may
112       fail to make it back in in time. Under these  circumstances  the  Linux
113       kernel  will  reset the machine. To make sure you won't get unnecessary
114       reboots make sure you have the variable realtime set to yes in the con‐
115       figuration  file  watchdog.conf.  This adds real time support to watch‐
116       dog: it will lock itself into memory and there  should  be  no  problem
117       even under the highest of loads.
118
119       Also  you  can  specify  a maximal allowed load average. Once this load
120       average is reached the system is rebooted. You may specify maximal load
121       averages  for  1 minute, 5 minutes or 15 minutes. The default values is
122       to disable this test. Be careful not to set this parameter too low.  To
123       set  a  value  less then the predefined minimal value of 2, you have to
124       use the -f option.
125
126       You can also specify a minimal amount of virtual  memory  you  want  to
127       have  available  as free. As soon as more virtual memory is used action
128       is taken by watchdog.  Note, however, that watchdog  does  not  distin‐
129       guish  between different types of memory usage. It just checks for free
130       virtual memory.
131
132       If you have a watchdog card with temperature sensor you can specify the
133       maximal  allowed temperature. Once this temperature is reached the sys‐
134       tem is halted. The default value is 120. There is no unit conversion so
135       make  sure you use the same unit as your hardware.  watchdog will issue
136       warnings once the temperature increases 90%, 95% and 98% of  this  tem‐
137       perature.
138
139       When  using  file  mode  watchdog  will try to stat(2) the given files.
140       Errors returned by stat will not cause a reboot. For a reboot the  stat
141       call  has  to last at least one minute.  This may happen if the file is
142       located on an NFS mounted filesystem. If your system relies on  an  NFS
143       mounted  filesystem you might try this option.  However, in such a case
144       the sync option may not work if the NFS server is not answering.
145
146       watchdog can read the pid from a pid file and see whether  the  process
147       still  exists.  If  not,  action  is taken by watchdog.  So you can for
148       instance restart the server from your repair-binary.
149
150       watchdog will try periodically  to  fork  itself  to  see  whether  the
151       process  table  is full. This process will leave a zombie process until
152       watchdog wakes up again and catches it; this is harmless,  don't  worry
153       about it.
154
155       In  ping  mode  watchdog  tries  to  ping the given IP addresses. These
156       addresses do not have to be a single machine. It is possible to ping to
157       a  broadcast address instead to see if at least one machine in a subnet
158       is still living.
159
160       Do not use this broadcast ping unless your MIS person a) knows about it
161       and b) has given you explicit permission to use it!
162
163       watchdog  will  send  out three ping packages and wait up to <interval>
164       seconds for the reply with <interval> being the time it goes  to  sleep
165       between  two  times  triggering the watchdog device. Thus a unreachable
166       network will not cause a hard reset but a soft reboot.
167
168       You can also test passively for an unreachable network by just monitor‐
169       ing a given interface for traffic. If no traffic arrives the network is
170       considered unreachable causing a soft reboot or action from the  repair
171       binary.
172
173       watchdog  can  run an external command for user-defined tests. A return
174       code not equal 0 means an error occured and watchdog should  react.  If
175       the external command is killed by an uncaught signal this is considered
176       an error by watchdog too.  The command may take longer  than  the  time
177       slice  defined  for the kernel device without a problem. However, error
178       messages are generated into the syslog facility. If  you  have  enabled
179       softboot  on  error  the machine will be rebooted if the binary doesn't
180       exit in half the time watchdog sleeps between two tries triggering  the
181       kernel device.
182
183       If  you  specify a repair binary it will be started instead of shutting
184       down the system. If this binary is not able to fix the problem watchdog
185       will still cause a reboot afterwards.
186
187       If  the  machine  is halted an email is sent to notify a human that the
188       machine is going down. Starting with version  4.4  watchdog  will  also
189       notify the human in charge if the machine is rebooted.
190

SOFT REBOOT

192       A  soft  reboot  (i.e. controlled shutdown and reboot) is initiated for
193       every error that is found. Since  there  might  be  no  more  processes
194       available, watchdog does it all by himself. That means:
195
196       1.  Kill all processes with SIGTERM.
197
198       2.  After a short pause kill all remaining processes with SIGKILL.
199
200       3.  Record a shutdown entry in wtmp.
201
202       4.  Save the random seed from /dev/urandom.  If the device is non-exis‐
203           tant or there is no filename for saving this step is skipped.
204
205       5.  Turn off accounting.
206
207       6.  Turn off quota and swap.
208
209       7.  Unmount all partitions except the root partition.
210
211       8.  Remount the root partition read-only.
212
213       9.  Shut down all network interfaces.
214
215       10. Finally reboot.
216

CHECK BINARY

218       If the return code of the check binary is not zero watchdog will assume
219       an  error  and reboot the system. Be careful with this if you are using
220       the real-time properties of watchdog since watchdog will wait  for  the
221       return  of  this  binary  before  proceeding.  An positive exit code is
222       interpreted as an system error code (see errno.h for details). Negative
223       values are special to watchdog:
224
225       -1     Reboot  the  system.  This is not exactly an error message but a
226              command to watchdog.  If the return code is -1 watchdog will not
227              try to run a shutdown script instead.
228
229       -2     Reset  the  system.  This  is not exactly an error message but a
230              command to watchdog.  If the return code  is  -2  watchdog  will
231              simply refuse to write the kernel device again.
232
233       -3     Maximum load average exceeded.
234
235       -4     The temperature inside is too high.
236
237       -5     /proc/loadavg contains no (or not enough) data.
238
239       -6     The given file was not changed in the given interval.
240
241       -7     /proc/meminfo contains invalid data.
242
243       -8     Child process was killed by a signal.
244
245       -9     Child process did not return in time.
246
247       -10    Free for personal use.
248

REPAIR BINARY

250       The  repair binary is started with one parameter: the error number that
251       caused watchdog to initiate the boot process. After  trying  to  repair
252       the system the binary should exit with 0 if the system was successfully
253       repaired and thus there is no need to boot anymore. A return value  not
254       equal  0 tells watchdog to reboot. The return code of the repair binary
255       should be the error number of the error causing watchdog to reboot.  Be
256       careful  with  this  if  you  are  using the real-time properties since
257       watchdog will wait for the return of this binary before proceeding.
258

TEST DIRECTORY

260       Executables placed in the test directory are discovered by watchdog  on
261       startup  and are automatically executed.  They are bounded time-wise by
262       the test-timeout directive in watchdog.conf.
263
264       These executables are called with either "test" as the  first  argument
265       (if  a test is being performed) or "repair" as the first argument (if a
266       repair for a previously-failed "test" operation on is being performed).
267
268       The as with test binaries and repair binaries, expected exit codes  for
269       a successful test or repair operation is always zero.
270
271       If  an  executable's test operation fails, the same executable is auto‐
272       matically called with the "repair" argument as well as the return  code
273       of the previously-failed test operation.
274
275       For example, if the following execution returns 42:
276
277           /etc/watchdog.d/my-test test
278
279       The watchdog daemon will attempt to repair the problem by calling:
280
281           /etc/watchdog.d/my-test repair 42
282
283       This enables administrators and application developers to make intelli‐
284       gent test/repair commands.  If the "repair" operation is  not  required
285       (or  is  not likely to succeed), it is important that the author of the
286       command return a non-zero value so the machine  will  still  reboot  as
287       expected.
288
289       Note  that  the  watchdog  daemon may interpret and act upon any of the
290       reserved return codes noted in the Check Binary section prior to  call‐
291       ing a given command in "repair" mode.
292

BUGS

294       None known so far.
295

AUTHORS

297       The    original    code   is   an   example   written   by   Alan   Cox
298       <alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All  addi‐
299       tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
300       <johnie@netgod.net> had the idea of testing the load average.  He  also
301       took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
302       brought up some hardware watchdog issues and helped testing this stuff.
303

FILES

305       /dev/watchdog
306              The watchdog device.
307
308       /var/run/watchdog.pid
309              The pid file of the running watchdog.
310