1WATCHDOG(8)                 System Manager's Manual                WATCHDOG(8)
2
3
4

NAME

6       watchdog - a software watchdog daemon
7

SYNOPSIS

9       watchdog   [-F|--foreground]  [-f|--force]  [-c  filename|--config-file
10       filename] [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11

DESCRIPTION

13       The Linux kernel can reset the system if serious problems are detected.
14       This  can  be  implemented  via  special  watchdog  hardware,  or via a
15       slightly less reliable software-only watchdog inside the kernel. Either
16       way,  there  needs  to  be a daemon that tells the kernel the system is
17       working fine. If the daemon stops doing that, the system is reset.
18
19       watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20       it  often  enough  to keep the kernel from resetting, at least once per
21       minute. Each write delays the  reboot  time  another  minute.  After  a
22       minute of inactivity the watchdog hardware will cause the reset. In the
23       case of the software watchdog the ability to reboot will depend on  the
24       state of the machines and interrupts.
25
26       The  watchdog  daemon  can  be  stopped without causing a reboot if the
27       device /dev/watchdog is closed correctly, unless your  kernel  is  com‐
28       piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29

TESTS

31       The watchdog daemon does several tests to check the system status:
32
33       ·  Is the process table full?
34
35       ·  Is there enough free memory?
36
37       ·  Are some files accessible?
38
39       ·  Have some files changed within a given interval?
40
41       ·  Is the average work load too high?
42
43       ·  Has a file table overflow occurred?
44
45       ·  Is a process still running? The process is specified by a pid file.
46
47       ·  Do some IP addresses answer to ping?
48
49       ·  Do network interfaces receive traffic?
50
51       ·  Is  the  temperature  too  high? (Temperature data not always avail‐
52          able.)
53
54       ·  Execute a user defined command to do arbitrary tests.
55
56       ·  Execute one or more test/repair commands found  in  /etc/watchdog.d.
57          These commands are called with the argument test or repair.
58
59       If  any of these checks fail watchdog will cause a shutdown. Should any
60       of these tests except the user defined  binary  last  longer  than  one
61       minute the machine will be rebooted, too.
62

OPTIONS

64       Available command line options are the following:
65
66       -v, --verbose
67              Set  verbose mode. Only implemented if compiled with SYSLOG fea‐
68              ture. This mode will log each several infos in  LOG_DAEMON  with
69              priority  LOG_INFO.   This  is useful if you want to see exactly
70              what happened until the watchdog rebooted the system.  Currently
71              it  logs  the  temperature (if available), the load average, the
72              change date of the files it checks and  how  often  it  went  to
73              sleep.
74
75       -s, --sync
76              Try  to  synchronize  the  filesystem  every time the process is
77              awake. Note that the system is rebooted if for  any  reason  the
78              synchronizing lasts longer than a minute.
79
80       -b, --softboot
81              Soft-boot  the  system  if an error occurs during the main loop,
82              e.g. if a given file is not accessible  via  the  stat(2)  call.
83              Note  that  this  does not apply to the opening of /dev/watchdog
84              and /proc/loadavg, which are opened before the main loop starts.
85
86       -F, --foreground
87              Run in foreground mode, useful for running  under  systemd  (for
88              example).
89
90       -f, --force
91              Force  the usage of the interval given or the maximal load aver‐
92              age given in the config file.
93
94       -c config-file, --config-file config-file
95              Use config-file as the configuration file instead of the default
96              /etc/watchdog.conf.
97
98       -q, --no-action
99              Do not reboot or halt the machine. This is for testing purposes.
100              All checks are executed and the results are logged as usual, but
101              no action is taken.  Also your hardware card or the kernel soft‐
102              ware watchdog driver is not  enabled.  Temperature  checking  is
103              also  disabled since this triggers the hardware watchdog on some
104              cards.
105

FUNCTION

107       After watchdog starts, it puts itself  into  the  background  and  then
108       tries  all  checks specified in its configuration file in turn. Between
109       each two tests it will write to the kernel device to prevent  a  reset.
110       After  finishing  all  tests  watchdog goes to sleep for some time. The
111       kernel drivers expects a write to the  watchdog  device  every  minute.
112       Otherwise  the  system  will be reset. As a default watchdog will sleep
113       for only 1 second so it triggers the device early enough.
114
115       Under high system load watchdog might be swapped out of memory and  may
116       fail  to  make  it back in in time. Under these circumstances the Linux
117       kernel will reset the machine. To make sure you won't  get  unnecessary
118       reboots make sure you have the variable realtime set to yes in the con‐
119       figuration file watchdog.conf.  This adds real time support  to  watch‐
120       dog:  it  will  lock itself into memory and there should  be no problem
121       even under the highest of loads.
122
123       On system running out of memory the kernel will try to free enough mem‐
124       ory  by  killing  process.  The watchdog daemon itself is exempted from
125       this so-called out-of-memory killer.
126
127       Also you can specify a maximal allowed load  average.  Once  this  load
128       average is reached the system is rebooted. You may specify maximal load
129       averages for 1 minute, 5 minutes or 15 minutes. The default  values  is
130       to  disable this test. Be careful not to set this parameter too low. To
131       set a value less then the predefined minimal value of 2,  you  have  to
132       use the -f option.
133
134       You  can  also  specify  a minimal amount of virtual memory you want to
135       have available as free. As soon as more virtual memory is  used  action
136       is  taken  by  watchdog.  Note, however, that watchdog does not distin‐
137       guish between different types of memory usage. It just checks for  free
138       virtual memory.
139
140       If you have a watchdog card with temperature sensor you can specify the
141       maximal allowed temperature. Once this temperature is reached the  sys‐
142       tem is halted. The default value is 120. There is no unit conversion so
143       make sure you use the same unit as your hardware.  watchdog will  issue
144       warnings  once  the temperature increases 90%, 95% and 98% of this tem‐
145       perature.
146
147       When using file mode watchdog will try  to  stat(2)  the  given  files.
148       Errors  returned by stat will not cause a reboot. For a reboot the stat
149       call has to last at least one minute.  This may happen if the  file  is
150       located  on  an NFS mounted filesystem. If your system relies on an NFS
151       mounted filesystem you might try this option.  However, in such a  case
152       the sync option may not work if the NFS server is not answering.
153
154       watchdog  can  read the pid from a pid file and see whether the process
155       still exists. If not, action is taken by  watchdog.   So  you  can  for
156       instance  restart  the server from your repair-binary.  See the Systemd
157       section below for additinal information.
158
159       watchdog will try periodically  to  fork  itself  to  see  whether  the
160       process  table  is full. This process will leave a zombie process until
161       watchdog wakes up again and catches it; this is harmless,  don't  worry
162       about it.
163
164       In  ping  mode  watchdog  tries  to  ping the given IP addresses. These
165       addresses do not have to be a single machine. It is possible to ping to
166       a  broadcast address instead to see if at least one machine in a subnet
167       is still living.
168
169       Do not use this broadcast ping unless your MIS person a) knows about it
170       and b) has given you explicit permission to use it!
171
172       watchdog  will  send  out three ping packages and wait up to <interval>
173       seconds for the reply with <interval> being the time it goes  to  sleep
174       between  two  times  triggering the watchdog device. Thus a unreachable
175       network will not cause a hard reset but a soft reboot.
176
177       You can also test passively for an unreachable network by just monitor‐
178       ing a given interface for traffic. If no traffic arrives the network is
179       considered unreachable causing a soft reboot or action from the  repair
180       binary.
181
182       To start the watchdog when network is available see the Systemd section
183       below.
184
185       watchdog can run an external command for user-defined tests.  A  return
186       code  not  equal 0 means an error occured and watchdog should react. If
187       the external command is killed by an uncaught signal this is considered
188       an  error  by  watchdog too.  The command may take longer than the time
189       slice defined for the kernel device without a problem.  However,  error
190       messages  are  generated  into the syslog facility. If you have enabled
191       softboot on error the machine will be rebooted if  the  binary  doesn't
192       exit  in half the time watchdog sleeps between two tries triggering the
193       kernel device.
194
195       If you specify a repair binary it will be started instead  of  shutting
196       down the system. If this binary is not able to fix the problem watchdog
197       will still cause a reboot afterwards.
198
199       If the machine is halted an email is sent to notify a  human  that  the
200       machine  is  going  down.  Starting with version 4.4 watchdog will also
201       notify the human in charge if the machine is rebooted.
202

SOFT REBOOT

204       A soft reboot (i.e. controlled shutdown and reboot)  is  initiated  for
205       every  error  that  is  found.  Since  there might be no more processes
206       available, watchdog does it all by himself. That means:
207
208       1.  Kill all processes with SIGTERM.
209
210       2.  After a short pause kill all remaining processes with SIGKILL.
211
212       3.  Record a shutdown entry in wtmp.
213
214       4.  Save the random seed from /dev/urandom.  If the device is non-exis‐
215           tant or there is no filename for saving this step is skipped.
216
217       5.  Turn off accounting.
218
219       6.  Turn off quota and swap.
220
221       7.  Unmount all partitions except the root partition.
222
223       8.  Remount the root partition read-only.
224
225       9.  Shut down all network interfaces.
226
227       10. Finally reboot.
228

CHECK BINARY

230       If the return code of the check binary is not zero watchdog will assume
231       an error and reboot the system. Be careful with this if you  are  using
232       the  real-time  properties of watchdog since watchdog will wait for the
233       return of this binary before  proceeding.  An  positive  exit  code  is
234       interpreted as an system error code (see errno.h for details). Negative
235       values are special to watchdog:
236
237       -1     Reboot the system. This is not exactly an error  message  but  a
238              command to watchdog.  If the return code is -1 watchdog will not
239              try to run a shutdown script instead.
240
241       -2     Reset the system. This is not exactly an  error  message  but  a
242              command  to  watchdog.   If  the return code is -2 watchdog will
243              simply refuse to write the kernel device again.
244
245       -3     Maximum load average exceeded.
246
247       -4     The temperature inside is too high.
248
249       -5     /proc/loadavg contains no (or not enough) data.
250
251       -6     The given file was not changed in the given interval.
252
253       -7     /proc/meminfo contains invalid data.
254
255       -8     Child process was killed by a signal.
256
257       -9     Child process did not return in time.
258
259       -10    Free for personal use.
260
261       With  enforcing  SELinux  policy  please  use  the  /usr/libexec/watch‐
262       dog/scripts/ for your test-binary configuration.
263

REPAIR BINARY

265       The  repair binary is started with one parameter: the error number that
266       caused watchdog to initiate the boot process. After  trying  to  repair
267       the system the binary should exit with 0 if the system was successfully
268       repaired and thus there is no need to boot anymore. A return value  not
269       equal  0 tells watchdog to reboot. The return code of the repair binary
270       should be the error number of the error causing watchdog to reboot.  Be
271       careful  with  this  if  you  are  using the real-time properties since
272       watchdog will wait for the return of this binary before proceeding.
273
274       With  enforcing  SELinux  policy  please  use  the  /usr/libexec/watch‐
275       dog/scripts/ for your repair-binary configuration.
276

TEST DIRECTORY

278       Executables  placed in the test directory are discovered by watchdog on
279       startup and are automatically executed.  They are bounded time-wise  by
280       the test-timeout directive in watchdog.conf.
281
282       These  executables  are called with either "test" as the first argument
283       (if a test is being performed) or "repair" as the first argument (if  a
284       repair for a previously-failed "test" operation on is being performed).
285
286       The  as with test binaries and repair binaries, expected exit codes for
287       a successful test or repair operation is always zero.
288
289       If an executable's test operation fails, the same executable  is  auto‐
290       matically  called with the "repair" argument as well as the return code
291       of the previously-failed test operation.
292
293       For example, if the following execution returns 42:
294
295           /etc/watchdog.d/my-test test
296
297       The watchdog daemon will attempt to repair the problem by calling:
298
299           /etc/watchdog.d/my-test repair 42
300
301       This enables administrators and application developers to make intelli‐
302       gent  test/repair  commands.  If the "repair" operation is not required
303       (or is not likely to succeed), it is important that the author  of  the
304       command  return  a  non-zero  value so the machine will still reboot as
305       expected.
306
307       Note that the watchdog daemon may interpret and act  upon  any  of  the
308       reserved  return codes noted in the Check Binary section prior to call‐
309       ing a given command in "repair" mode.
310

SYSTEMD

312       To start watchdog after the network is available:
313
314       systemctl disable watchdog
315       systemctl enable NetworkManager-wait-online
316       systemctl enable watchdog-ping
317
318       When using custom service pid check with custom  service  systemd  unit
319       file  please  be aware the "Requires=" does dependent service deactiva‐
320       tion.  Using  "Before=watchdog.service"  or  "Before=watchdog-ping.ser‐
321       vice"  in  the  custom  service  unit file may be the desired operation
322       instead.  See systemd.unit documentation for more details.
323
324

SELINUX

326       The directories /etc/watchdog.d/ and /usr/libexec/watchdog/scripts/ are
327       recognized locations for custom executables.
328

BUGS

330       None known so far.
331

AUTHORS

333       The    original    code   is   an   example   written   by   Alan   Cox
334       <alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All  addi‐
335       tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
336       <johnie@netgod.net> had the idea of testing the load average.  He  also
337       took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
338       brought up some hardware watchdog issues and helped testing this stuff.
339

FILES

341       /dev/watchdog
342              The watchdog device.
343
344       /var/run/watchdog.pid
345              The pid file of the running watchdog.
346

SEE ALSO

348       watchdog.conf(5),systemd.unit(5)
349
350
351
3524th Berkeley Distribution        January 2005                      WATCHDOG(8)
Impressum