1WATCHDOG(8)                 System Manager's Manual                WATCHDOG(8)
2
3
4

NAME

6       watchdog - a software watchdog daemon
7

SYNOPSIS

9       watchdog  [-f|--force]  [-c filename|--config-file filename] [-v|--ver‐
10       bose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11

DESCRIPTION

13       The Linux kernel can reset the system if serious problems are detected.
14       This  can  be  implemented  via  special  watchdog  hardware,  or via a
15       slightly less reliable software-only watchdog inside the kernel. Either
16       way,  there  needs  to  be a daemon that tells the kernel the system is
17       working fine. If the daemon stops doing that, the system is reset.
18
19       watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20       it  often  enough  to keep the kernel from resetting, at least once per
21       minute. Each write delays the  reboot  time  another  minute.  After  a
22       minute of inactivity the watchdog hardware will cause the reset. In the
23       case of the software watchdog the ability to reboot will depend on  the
24       state of the machines and interrupts.
25
26       The  watchdog  daemon  can  be  stopped without causing a reboot if the
27       device /dev/watchdog is closed correctly, unless your  kernel  is  com‐
28       piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29

TESTS

31       The watchdog daemon does several tests to check the system status:
32
33       ·  Is the process table full?
34
35       ·  Is there enough free memory?
36
37       ·  Are some files accessible?
38
39       ·  Have some files changed within a given interval?
40
41       ·  Is the average work load too high?
42
43       ·  Has a file table overflow occurred?
44
45       ·  Is a process still running? The process is specified by a pid file.
46
47       ·  Do some IP addresses answer to ping?
48
49       ·  Do network interfaces receive traffic?
50
51       ·  Is  the  temperature  too  high? (Temperature data not always avail‐
52          able.)
53
54       ·  Execute a user defined command to do arbitrary tests.
55
56       If any of these checks fail watchdog will cause a shutdown. Should  any
57       of  these  tests  except  the  user defined binary last longer than one
58       minute the machine will be rebooted, too.
59

OPTIONS

61       Available command line options are the following:
62
63       -v, --verbose
64              Set verbose mode. Only implemented if compiled with SYSLOG  fea‐
65              ture.  This  mode will log each several infos in LOG_DAEMON with
66              priority LOG_INFO.  This is useful if you want  to  see  exactly
67              what  happened until the watchdog rebooted the system. Currently
68              it logs the temperature (if available), the  load  average,  the
69              change  date  of  the  files  it checks and how often it went to
70              sleep.
71
72       -s, --sync
73              Try to synchronize the filesystem  every  time  the  process  is
74              awake.  Note  that  the system is rebooted if for any reason the
75              synchronizing lasts longer than a minute.
76
77       -b, --softboot
78              Soft-boot the system if an error occurs during  the  main  loop,
79              e.g.  if  a  given  file is not accessible via the stat(2) call.
80              Note that this does not apply to the  opening  of  /dev/watchdog
81              and /proc/loadavg, which are opened before the main loop starts.
82
83       -f, --force
84              Force  the usage of the interval given or the maximal load aver‐
85              age given in the config file.
86
87       -c config-file, --config-file config-file
88              Use config-file as the configuration file instead of the default
89              /etc/watchdog.conf.
90
91       -q, --no-action
92              Do not reboot or halt the machine. This is for testing purposes.
93              All checks are executed and the results are logged as usual, but
94              no action is taken.  Also your hardware card or the kernel soft‐
95              ware watchdog driver is not  enabled.  Temperature  checking  is
96              also  disabled since this triggers the hardware watchdog on some
97              cards.
98

FUNCTION

100       After watchdog starts, it puts itself  into  the  background  and  then
101       tries  all  checks specified in its configuration file in turn. Between
102       each two tests it will write to the kernel device to prevent  a  reset.
103       After  finishing  all  tests  watchdog goes to sleep for some time. The
104       kernel drivers expects a write to the  watchdog  device  every  minute.
105       Otherwise  the  system  will be reset. As a default watchdog will sleep
106       for only 10 seconds so it triggers the device early enough.
107
108       Under high system load watchdog might be swapped out of memory and  may
109       fail  to  make  it back in in time. Under these circumstances the Linux
110       kernel will reset the machine. To make sure you won't  get  unnecessary
111       reboots make sure you have the variable realtime set to yes in the con‐
112       figuration file watchdog.conf.  This adds real time support  to  watch‐
113       dog:  it  will  lock  itself into memory and there should be no problem
114       even under the highest of loads.
115
116       Also you can specify a maximal allowed load  average.  Once  this  load
117       average is reached the system is rebooted. You may specify maximal load
118       averages for 1 minute, 5 minutes or 15 minutes. The default  values  is
119       to  disable this test. Be careful not to set this parameter too low. To
120       set a value less then the predefined minimal value of 2,  you  have  to
121       use the -f option.
122
123       You  can  also  specify  a minimal amount of virtual memory you want to
124       have available as free. As soon as more virtual memory is  used  action
125       is  taken  by  watchdog.  Note, however, that watchdog does not distin‐
126       guish between different types of memory usage. It just checks for  free
127       virtual memory.
128
129       If you have a watchdog card with temperature sensor you can specify the
130       maximal allowed temperature. Once this temperature is reached the  sys‐
131       tem is halted. The default value is 120. There is no unit conversion so
132       make sure you use the same unit as your hardware.  watchdog will  issue
133       warnings  once  the temperature increases 90%, 95% and 98% of this tem‐
134       perature.
135
136       When using file mode watchdog will try  to  stat(2)  the  given  files.
137       Errors  returned by stat will not cause a reboot. For a reboot the stat
138       call has to last at least one minute.  This may happen if the  file  is
139       located  on  an NFS mounted filesystem. If your system relies on an NFS
140       mounted filesystem you might try this option.  However, in such a  case
141       the sync option may not work if the NFS server is not answering.
142
143       watchdog  can  read the pid from a pid file and see whether the process
144       still exists. If not, action is taken by  watchdog.   So  you  can  for
145       instance restart the server from your repair-binary.
146
147       watchdog  will  try  periodically  to  fork  itself  to see whether the
148       process table is full. This process will leave a zombie  process  until
149       watchdog  wakes  up again and catches it; this is harmless, don't worry
150       about it.
151
152       In ping mode watchdog tries to  ping  the  given  IP  addresses.  These
153       addresses do not have to be a single machine. It is possible to ping to
154       a broadcast address instead to see if at least one machine in a  subnet
155       is still living.
156
157       Do not use this broadcast ping unless your MIS person a) knows about it
158       and b) has given you explicit permission to use it!
159
160       watchdog will send out three ping packages and wait  up  to  <interval>
161       seconds  for  the reply with <interval> being the time it goes to sleep
162       between two times triggering the watchdog device.  Thus  a  unreachable
163       network will not cause a hard reset but a soft reboot.
164
165       You can also test passively for an unreachable network by just monitor‐
166       ing a given interface for traffic. If no traffic arrives the network is
167       considered  unreachable causing a soft reboot or action from the repair
168       binary.
169
170       watchdog can run an external command for user-defined tests.  A  return
171       code  not  equal 0 means an error occured and watchdog should react. If
172       the external command is killed by an uncaught signal this is considered
173       an  error  by  watchdog too.  The command may take longer than the time
174       slice defined for the kernel device without a problem.  However,  error
175       messages  are  generated  into the syslog facility. If you have enabled
176       softboot on error the machine will be rebooted if  the  binary  doesn't
177       exit  in half the time watchdog sleeps between two tries triggering the
178       kernel device.
179
180       If you specify a repair binary it will be started instead  of  shutting
181       down the system. If this binary is not able to fix the problem watchdog
182       will still cause a reboot afterwards.
183
184       If the machine is halted an email is sent to notify a  human  that  the
185       machine  is  going  down.  Starting with version 4.4 watchdog will also
186       notify the human in charge if the machine is rebooted.
187

SOFT REBOOT

189       A soft reboot (i.e. controlled shutdown and reboot)  is  initiated  for
190       every  error  that  is  found.  Since  there might be no more processes
191       available, watchdog does it all by himself. That means:
192
193       1.  Kill all processes with SIGTERM.
194
195       2.  After a short pause kill all remaining processes with SIGKILL.
196
197       3.  Record a shutdown entry in wtmp.
198
199       4.  Save the random seed from /dev/urandom.  If the device is non-exis‐
200           tant or there is no filename for saving this step is skipped.
201
202       5.  Turn off accounting.
203
204       6.  Turn off quota and swap.
205
206       7.  Unmount all partitions except the root partition.
207
208       8.  Remount the root partition read-only.
209
210       9.  Shut down all network interfaces.
211
212       10. Finally reboot.
213

CHECK BINARY

215       If the return code of the check binary is not zero watchdog will assume
216       an error and reboot the system. Be careful with this if you  are  using
217       the  real-time  properties of watchdog since watchdog will wait for the
218       return of this binary before  proceeding.  An  positive  exit  code  is
219       interpreted as an system error code (see errno.h for details). Negative
220       values are special to watchdog:
221
222       -1     Reboot the system. This is not exactly an error  message  but  a
223              command to watchdog.  If the return code is -1 watchdog will not
224              try to run a shutdown script instead.
225
226       -2     Reset the system. This is not exactly an  error  message  but  a
227              command  to  watchdog.   If  the return code is -2 watchdog will
228              simply refuse to write the kernel device again.
229
230       -3     Maximum load average exceeded.
231
232       -4     The temperature inside is too high.
233
234       -5     /proc/loadavg contains no (or not enough) data.
235
236       -6     The given file was not changed in the given interval.
237
238       -7     /proc/meminfo contains invalid data.
239
240       -8     Child process was killed by a signal.
241
242       -9     Child process did not return in time.
243
244       -10    Free for personal use.
245

REPAIR BINARY

247       The repair binary is started with one parameter: the error number  that
248       caused  watchdog  to  initiate the boot process. After trying to repair
249       the system the binary should exit with 0 if the system was successfully
250       repaired  and thus there is no need to boot anymore. A return value not
251       equal 0 tells watchdog to reboot. The return code of the repair  binary
252       should  be the error number of the error causing watchdog to reboot. Be
253       careful with this if you  are  using  the  real-time  properties  since
254       watchdog will wait for the return of this binary before proceeding.
255

BUGS

257       None known so far.
258

AUTHORS

260       The    original    code   is   an   example   written   by   Alan   Cox
261       <alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All  addi‐
262       tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
263       <johnie@netgod.net> had the idea of testing the load average.  He  also
264       took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
265       brought up some hardware watchdog issues and helped testing this stuff.
266

FILES

268       /dev/watchdog
269              The watchdog device.
270
271       /var/run/watchdog.pid
272              The pid file of the running watchdog.
273

SEE ALSO

275       watchdog.conf(5)
276
277
278
2794th Berkeley Distribution        January 2005                      WATCHDOG(8)
Impressum