watchdog(8)

1WATCHDOG(8)                 System Manager's Manual                WATCHDOG(8)
2
3
4

NAME

6       watchdog - a software watchdog daemon
7

SYNOPSIS

9       watchdog   [-F|--foreground]  [-f|--force]  [-c  filename|--config-file
10       filename] [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11

DESCRIPTION

13       The Linux kernel can reset the system if serious problems are detected.
14       This  can  be  implemented  via  special  watchdog  hardware,  or via a
15       slightly less reliable software-only watchdog inside the kernel. Either
16       way,  there  needs  to  be a daemon that tells the kernel the system is
17       working fine. If the daemon stops doing that, the system is reset.
18
19       watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20       it  often  enough  to keep the kernel from resetting, at least once per
21       minute. Each write delays the  reboot  time  another  minute.  After  a
22       minute of inactivity the watchdog hardware will cause the reset. In the
23       case of the software watchdog the ability to reboot will depend on  the
24       state of the machines and interrupts.
25
26       The  watchdog  daemon  can  be  stopped without causing a reboot if the
27       device /dev/watchdog is closed correctly, unless your  kernel  is  com‐
28       piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29

TESTS

31       The watchdog daemon does several tests to check the system status:
32
33       ·  Is the process table full?
34
35       ·  Is there enough free memory?
36
37       ·  Is there enough allocatable memory?
38
39       ·  Are some files accessible?
40
41       ·  Have some files changed within a given interval?
42
43       ·  Is the average work load too high?
44
45       ·  Has a file table overflow occurred?
46
47       ·  Is a process still running? The process is specified by a pid file.
48
49       ·  Do some IP addresses answer to ping?
50
51       ·  Do network interfaces receive traffic?
52
53       ·  Is  the  temperature  too  high? (Temperature data not always avail‐
54          able.)
55
56       ·  Execute a user defined command to do arbitrary tests.
57
58       ·  Execute one or more test/repair commands found  in  /etc/watchdog.d.
59          These commands are called with the argument test or repair.
60
61       If  any of these checks fail watchdog will cause a shutdown. Should any
62       of these tests except the user defined  binary  last  longer  than  one
63       minute the machine will be rebooted, too.
64

OPTIONS

66       Available command line options are the following:
67
68       -v, --verbose
69              Set  verbose mode. Only implemented if compiled with SYSLOG fea‐
70              ture. This mode will log each several infos in  LOG_DAEMON  with
71              priority  LOG_DEBUG.   This is useful if you want to see exactly
72              what happened until the watchdog rebooted the system.  Currently
73              it  logs  the  temperature (if available), the load average, the
74              change date of the files it checks and  how  often  it  went  to
75              sleep.
76
77       -s, --sync
78              Try  to  synchronize  the  filesystem  every time the process is
79              awake. Note that the system is rebooted if for  any  reason  the
80              synchronizing lasts longer than a minute.
81
82       -b, --softboot
83              Soft-boot  the  system  if an error occurs during the main loop,
84              e.g. if a given file is not accessible  via  the  stat(2)  call.
85              Note  that  this  does not apply to the opening of /dev/watchdog
86              and /proc/loadavg, which are opened before the main loop starts.
87              Now this is implemented by disabling the error re-try timer.
88
89       -F, --foreground
90              Run  in  foreground  mode, useful for running under systemd (for
91              example).
92
93       -f, --force
94              Force the usage of the interval given or the maximal load  aver‐
95              age  given  in the config file. Without this option these values
96              are sanity checked.
97
98       -c config-file, --config-file config-file
99              Use config-file as the configuration file instead of the default
100              /etc/watchdog.conf.
101
102       -q, --no-action
103              Do not reboot or halt the machine. This is for testing purposes.
104              All checks are executed and the results are logged as usual, but
105              no action is taken.  Also your hardware card or the kernel soft‐
106              ware watchdog driver is not enabled.  NOTE:  This  still  allows
107              'repair'  actions to run, but the daemon itself will not attempt
108              a reboot.
109
110       -X num, --loop-exit num
111              Run for 'num' loops  then  exit  as  if  SIGTERM  was  received.
112              Intended for test/debug (e.g. using valgrind for checking memory
113              access). If the daemon exits on a loop counter and you have  the
114              CONFIG_WATCHDOG_NOWAYOUT  option  compiled  for  the  kernel  or
115              device-driver then an unplanned reboot will follow - be warned!
116

FUNCTION

118       After watchdog starts, it puts itself  into  the  background  and  then
119       tries  all  checks specified in its configuration file in turn. Between
120       each two tests it will write to the kernel device to prevent  a  reset.
121       After  finishing  all  tests  watchdog goes to sleep for some time. The
122       kernel drivers expects a write to the  watchdog  device  every  minute.
123       Otherwise  the system will be reset.  watchdog will sleep for a config‐
124       ure interval that defaults to 1 second to make  sure  it  triggers  the
125       device early enough.
126
127       Under  high system load watchdog might be swapped out of memory and may
128       fail to make it back in in time. Under these  circumstances  the  Linux
129       kernel  will  reset the machine. To make sure you won't get unnecessary
130       reboots make sure you have the variable realtime set to yes in the con‐
131       figuration  file  watchdog.conf.  This adds real time support to watch‐
132       dog: it will lock itself into memory and there should   be  no  problem
133       even under the highest of loads.
134
135       On system running out of memory the kernel will try to free enough mem‐
136       ory by killing process. The watchdog daemon  itself  is  exempted  from
137       this so-called out-of-memory killer.
138
139       Also  you  can  specify  a maximal allowed load average. Once this load
140       average is reached the system is rebooted. You may specify maximal load
141       averages  for  1 minute, 5 minutes or 15 minutes. The default values is
142       to disable this test. Be careful not to set this parameter too low.  To
143       set  a  value  less then the predefined minimal value of 2, you have to
144       use the -f option.
145
146       You can also specify a minimal amount of virtual  memory  you  want  to
147       have  available  as free. As soon as more virtual memory is used action
148       is taken by watchdog.  Note, however, that watchdog  does  not  distin‐
149       guish  between different types of memory usage. It just checks for free
150       virtual memory.
151
152       If you have a machine with temperature sensor(s) you  can  specify  the
153       maximal  allowed  temperature.  Once this temperature is reached on any
154       sensor the system is powered off. The default value is 90 C.  Typically
155       the temperature information is provided by the sensors package as files
156       in the virtual filesystem /sys/device and can be found using, for exam‐
157       ple, the command
158
159           find /sys -name 'temp*input' -print
160
161       These  files hold the temperature in milli-Celsius. You can have multi‐
162       ple sensors used in the config file. For example to change to 75C maxi‐
163       mum and to check two virtual files for the system temperature you might
164       have this:
165
166           max-temperature = 75
167           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
168           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input
169
170       The watchdog will issue warnings once the  temperature  increases  90%,
171       95% and 98% of the configured maximum temperature.
172
173       When  using  file  mode  watchdog  will try to stat(2) the given files.
174       Errors returned by stat will not cause a reboot. For a reboot the  stat
175       call has to last at least the re-try time-out value (default 1 minute).
176       This may happen if the file is located on an NFS mounted filesystem. If
177       your  system  relies  on  an  NFS mounted filesystem you might try this
178       option.  However, in such a case the sync option may not  work  if  the
179       NFS server is not answering.
180
181       watchdog  can  read the pid from a pid file and see whether the process
182       still exists. If not, action is taken by  watchdog.   So  you  can  for
183       instance  restart  the server from your repair-binary.  See the Systemd
184       section below for additinal information.
185
186       watchdog will try periodically  to  fork  itself  to  see  whether  the
187       process  table  is full. This process will leave a zombie process until
188       watchdog wakes up again and catches it; this is harmless,  don't  worry
189       about it.
190
191       In  ping  mode  watchdog  tries to ping the given IPv4 addresses. These
192       addresses do not have to be a single machine. It is possible to ping to
193       a  broadcast address instead to see if at least one machine in a subnet
194       is still living.
195
196       Do not use this broadcast ping unless your MIS person a) knows about it
197       and b) has given you explicit permission to use it!
198
199       watchdog  will  send  out three ping packages and wait up to <interval>
200       seconds for the reply with <interval> being the time it goes  to  sleep
201       between  two  times  triggering the watchdog device. Thus a unreachable
202       network will not cause a hard reset but a soft reboot.
203
204       You can also test passively for an unreachable network by just monitor‐
205       ing a given interface for traffic. If no traffic arrives the network is
206       considered unreachable causing a soft reboot or action from the  repair
207       binary.
208
209       To start the watchdog when network is available see the Systemd section
210       below.
211
212       watchdog can run an external command for user-defined tests.  A  return
213       code  not equal 0 means an error occurred and watchdog should react. If
214       the external command is killed by an uncaught signal this is considered
215       an  error  by  watchdog too.  The command may take longer than the time
216       slice defined for the kernel device without a problem.  However,  error
217       messages  are  generated  into the syslog facility. If you have enabled
218       softboot on error the machine will be rebooted if  the  binary  doesn't
219       exit  in half the time watchdog sleeps between two tries triggering the
220       kernel device.
221
222       If you specify a repair binary it will be started instead  of  shutting
223       down the system. If this binary is not able to fix the problem watchdog
224       will still cause a reboot afterwards.
225
226       If the machine is halted an email is sent to notify a  human  that  the
227       machine  is  going  down.  Starting with version 4.4 watchdog will also
228       notify the human in charge if the machine is rebooted.
229
230       The re-try timer applies to most errors, except reset/reboot calls  and
231       too  hot.   It  allows a given error source to recover, and treats most
232       tests in this way.  Exceptions are file handle test, load averages, and
233       system  memory.  If  set  to the minimum time of 1 second it will still
234       allow a single re-try at any polling interval of the system.
235

SOFT REBOOT

237       A soft reboot (i.e. controlled shutdown and reboot)  is  initiated  for
238       every  error  that  is  found.  Since  there might be no more processes
239       available, watchdog does it all by himself. That means:
240
241       1.  Kill all processes with SIGTERM.
242
243       2.  After a short pause kill all remaining processes with SIGKILL.
244
245       3.  Record a shutdown entry in wtmp.
246
247       4.  Save the random seed from /dev/urandom.  If the device is non-exis‐
248           tant or there is no filename for saving this step is skipped.
249
250       5.  Turn off accounting.
251
252       6.  Turn off quota and swap.
253
254       7.  Unmount all partitions except the root partition.
255
256       8.  Remount the root partition read-only.
257
258       9.  Shut down all network interfaces.
259
260       10. Finally reboot.
261

CHECK BINARY

263       If the return code of the check binary is not zero watchdog will assume
264       an error and reboot the system. Be careful with this if you  are  using
265       the  real-time  properties of watchdog since watchdog will wait for the
266       return of this binary before proceeding. An exit code smaller than  245
267       is  interpreted as an system error code (see errno.h for details). Val‐
268       ues of 245 or larger than are special to watchdog:
269
270       255 (based on -1 as unsigned 8-bit number)
271              Reboot the system. This is not exactly an error  message  but  a
272              command  to  watchdog.   If the return code is this the watchdog
273              will not try to run a shutdown script instead.
274
275       254    Reset the system. This is not exactly an  error  message  but  a
276              command  to  watchdog.   If the return code is this the watchdog
277              will attempt to hard-reset the machine  without  attempting  any
278              sort of orderly stopping of process, unmounting of file systems,
279              etc.
280
281       253    Maximum load average exceeded.
282
283       252    The temperature inside is too high.
284
285       251    /proc/loadavg contains no (or not enough) data.
286
287       250    The given file was not changed in the given interval.
288
289       249    /proc/meminfo contains invalid data.
290
291       248    Child process was killed by a signal.
292
293       247    Child process did not return in time.
294
295       246    Free for personal watchdog-specific use (was -10 as an  unsigned
296              8-bit number).
297
298       With  enforcing  SELinux  policy  please  use  the  /usr/libexec/watch‐
299       dog/scripts/ for your test-binary configuration.
300
301       245    Reserved for an unknown result, for example  a  slow  background
302              test that is still running so neither a success nor an error.
303

REPAIR BINARY

305       The  repair binary is started with one parameter: the error number that
306       caused watchdog to initiate the boot process. After  trying  to  repair
307       the system the binary should exit with 0 if the system was successfully
308       repaired and thus there is no need to boot anymore. A return value  not
309       equal  0 tells watchdog to reboot. The return code of the repair binary
310       should be the error number of the error causing watchdog to reboot.  Be
311       careful  with  this  if  you  are  using the real-time properties since
312       watchdog will wait for the return of this binary before proceeding.
313
314       The configuration file parameter repair-maximum controls the number  of
315       successive  repair  attempts  that  report 0 (i.e. success) but fail to
316       clear the tested fault. If this is exceeded then a reboot takes  place.
317       If  set  to zero then a reboot can always be blocked by the repair pro‐
318       gram reporting success.
319
320       With  enforcing  SELinux  policy  please  use  the  /usr/libexec/watch‐
321       dog/scripts/ for your repair-binary configuration.
322

TEST DIRECTORY

324       Executables  placed in the test directory are discovered by watchdog on
325       startup and are automatically executed.  They are bounded time-wise  by
326       the test-timeout directive in watchdog.conf.
327
328       These  executables  are called with either "test" as the first argument
329       (if a test is being performed) or "repair" as the first argument (if  a
330       repair for a previously-failed "test" operation on is being performed).
331
332       The  as with test binaries and repair binaries, expected exit codes for
333       a successful test or repair operation is always zero.
334
335       If an executable's test operation fails, the same executable  is  auto‐
336       matically  called with the "repair" argument as well as the return code
337       of the previously-failed test operation.
338
339       For example, if the following execution returns 42:
340
341           /etc/watchdog.d/my-test test
342
343       The watchdog daemon will attempt to repair the problem by calling:
344
345           /etc/watchdog.d/my-test repair 42
346
347       This enables administrators and application developers to make intelli‐
348       gent  test/repair  commands.  If the "repair" operation is not required
349       (or is not likely to succeed), it is important that the author  of  the
350       command  return  a  non-zero  value so the machine will still reboot as
351       expected.
352
353       Note that the watchdog daemon may interpret and act  upon  any  of  the
354       reserved  return codes noted in the Check Binary section prior to call‐
355       ing a given command in "repair" mode.
356
357       As for the repair binary, the  configuration  parameter  repair-maximum
358       also controls the number of successive repair attempts that report suc‐
359       cess (return 0) but fail to clear the fault.
360

SYSTEMD

362       To start watchdog after the network is available:
363
364       systemctl disable watchdog
365       systemctl enable NetworkManager-wait-online
366       systemctl enable watchdog-ping
367
368       When using custom service pid check with custom  service  systemd  unit
369       file  please  be aware the "Requires=" does dependent service deactiva‐
370       tion.  Using  "Before=watchdog.service"  or  "Before=watchdog-ping.ser‐
371       vice"  in  the  custom  service  unit file may be the desired operation
372       instead.  See systemd.unit documentation for more details.
373
374

SELINUX

376       The directories /etc/watchdog.d/ and /usr/libexec/watchdog/scripts/ are
377       recognized locations for custom executables.
378

BUGS

380       None known so far.
381

AUTHORS

383       The    original    code   is   an   example   written   by   Alan   Cox
384       <alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All  addi‐
385       tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
386       <johnie@netgod.net> had the idea of testing the load average.  He  also
387       took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
388       brought up some hardware watchdog issues and helped testing this stuff.
389

FILES

391       /dev/watchdog
392              The watchdog device.
393
394       /var/run/watchdog.pid
395              The pid file of the running watchdog.
396