watchdog(8)

1WATCHDOG(8)                 System Manager's Manual                WATCHDOG(8)
2
3
4

NAME

6       watchdog - a software watchdog daemon
7

SYNOPSIS

9       watchdog   [-F|--foreground]  [-f|--force]  [-c  filename|--config-file
10       filename] [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11

DESCRIPTION

13       The Linux kernel can reset the system if serious problems are detected.
14       This  can  be  implemented  via  special  watchdog  hardware,  or via a
15       slightly less reliable software-only watchdog inside the kernel. Either
16       way,  there  needs  to  be a daemon that tells the kernel the system is
17       working fine. If the daemon stops doing that, the system is reset.
18
19       watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20       it  often  enough  to keep the kernel from resetting, at least once per
21       minute. Each write delays the  reboot  time  another  minute.  After  a
22       minute of inactivity the watchdog hardware will cause the reset. In the
23       case of the software watchdog the ability to reboot will depend on  the
24       state of the machines and interrupts.
25
26       The  watchdog daemon can be stopped without causing a reboot if the de‐
27       vice /dev/watchdog is closed correctly, unless your kernel is  compiled
28       with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29

TESTS

31       The watchdog daemon does several tests to check the system status:
32
33       •  Is the process table full?
34
35       •  Is there enough free memory?
36
37       •  Is there enough allocatable memory?
38
39       •  Are some files accessible?
40
41       •  Have some files changed within a given interval?
42
43       •  Is the average work load too high?
44
45       •  Has a file table overflow occurred?
46
47       •  Is a process still running? The process is specified by a pid file.
48
49       •  Do some IP addresses answer to ping?
50
51       •  Do network interfaces receive traffic?
52
53       •  Is  the  temperature  too  high? (Temperature data not always avail‐
54          able.)
55
56       •  Execute a user defined command to do arbitrary tests.
57
58       •  Execute one or more test/repair commands found  in  /etc/watchdog.d.
59          These commands are called with the argument test or repair.
60
61       If  any of these checks fail watchdog will cause a shutdown. Should any
62       of these tests except the user defined  binary  last  longer  than  one
63       minute the machine will be rebooted, too.
64

OPTIONS

66       Available command line options are the following:
67
68       -v, --verbose
69              Set  verbose mode. Only implemented if compiled with SYSLOG fea‐
70              ture. This mode will log each several infos in  LOG_DAEMON  with
71              priority  LOG_DEBUG.   This is useful if you want to see exactly
72              what happened until the watchdog rebooted the system.  Currently
73              it  logs  the  temperature (if available), the load average, the
74              change date of the files it checks and  how  often  it  went  to
75              sleep.  You can use this twice to enable some more verbose debug
76              message for testing.
77
78       -s, --sync
79              Try to synchronize the filesystem  every  time  the  process  is
80              awake.  Note  that  the system is rebooted if for any reason the
81              synchronizing lasts longer than a minute.
82
83       -b, --softboot
84              Soft-boot the system if an error occurs during  the  main  loop,
85              e.g.  if  a  given  file is not accessible via the stat(2) call.
86              Note that this does not apply to the  opening  of  /dev/watchdog
87              and /proc/loadavg, which are opened before the main loop starts.
88              Now this is implemented by disabling the error re-try timer.
89
90       -F, --foreground
91              Run in foreground mode, useful for running  under  systemd  (for
92              example).
93
94       -f, --force
95              Force  the usage of the interval given or the maximal load aver‐
96              age given in the config file. Without this option  these  values
97              are sanity checked.
98
99       -c config-file, --config-file config-file
100              Use config-file as the configuration file instead of the default
101              /etc/watchdog.conf.
102
103       -q, --no-action
104              Do not reboot or halt the machine. This is for testing purposes.
105              All checks are executed and the results are logged as usual, but
106              no action is taken.  Also your hardware card or the kernel soft‐
107              ware  watchdog  driver  is  not enabled. NOTE: This still allows
108              'repair' actions to run, but the daemon itself will not  attempt
109              a reboot.
110
111       -X num, --loop-exit num
112              Run  for  'num'  loops then exit as if SIGTERM was received. In‐
113              tended for test/debug (e.g. using valgrind for  checking  memory
114              access).  If the daemon exits on a loop counter and you have the
115              CONFIG_WATCHDOG_NOWAYOUT option compiled for the kernel  or  de‐
116              vice-driver then an unplanned reboot will follow - be warned!
117

FUNCTION

119       After  watchdog  starts,  it  puts  itself into the background and then
120       tries all checks specified in its configuration file in  turn.  Between
121       each  two  tests it will write to the kernel device to prevent a reset.
122       After finishing all tests watchdog goes to sleep  for  some  time.  The
123       kernel  drivers  expects  a  write to the watchdog device every minute.
124       Otherwise the system will be reset.  watchdog will sleep for a  config‐
125       ure interval that defaults to 1 second to make sure it triggers the de‐
126       vice early enough.
127
128       Under high system load watchdog might be swapped out of memory and  may
129       fail  to  make  it back in in time. Under these circumstances the Linux
130       kernel will reset the machine. To make sure you won't  get  unnecessary
131       reboots make sure you have the variable realtime set to yes in the con‐
132       figuration file watchdog.conf.  This adds real time support  to  watch‐
133       dog:  it  will  lock itself into memory and there should  be no problem
134       even under the highest of loads.
135
136       On system running out of memory the kernel will try to free enough mem‐
137       ory  by  killing  process.  The watchdog daemon itself is exempted from
138       this so-called out-of-memory killer.
139
140       Also you can specify a maximal allowed load average. Once this load av‐
141       erage  is  reached the system is rebooted. You may specify maximal load
142       averages for 1 minute, 5 minutes or 15 minutes. The default  values  is
143       to  disable this test. Be careful not to set this parameter too low. To
144       set a value less then the predefined minimal value of 2,  you  have  to
145       use the -f option.
146
147       You  can  also  specify  a minimal amount of virtual memory you want to
148       have available as free. As soon as more virtual memory is  used  action
149       is  taken  by  watchdog.  Note, however, that watchdog does not distin‐
150       guish between different types of memory usage. It just checks for  free
151       virtual memory.
152
153       If  you  have  a machine with temperature sensor(s) you can specify the
154       maximal allowed temperature. Once this temperature is  reached  on  any
155       sensor  the system is powered off. The default value is 90 C. Typically
156       the temperature information is provided by the sensors package as files
157       in the virtual filesystem /sys/device and can be found using, for exam‐
158       ple, the command
159
160           find /sys -name 'temp*input' -print
161
162       These files hold the temperature in milli-Celsius. You can have  multi‐
163       ple sensors used in the config file. For example to change to 75C maxi‐
164       mum and to check two virtual files for the system temperature you might
165       have this:
166
167           max-temperature = 75
168           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
169           temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input
170
171       The  watchdog  will  issue warnings once the temperature increases 90%,
172       95% and 98% of the configured maximum temperature.
173
174       When using file mode watchdog will try to stat(2) the given files.  Er‐
175       rors  returned  by  stat will not cause a reboot. For a reboot the stat
176       call has to last at least the re-try time-out value (default 1 minute).
177       This may happen if the file is located on an NFS mounted filesystem. If
178       your system relies on an NFS mounted filesystem you might try this  op‐
179       tion.   However, in such a case the sync option may not work if the NFS
180       server is not answering.
181
182       watchdog can read the pid from a pid file and see whether  the  process
183       still  exists. If not, action is taken by watchdog.  So you can for in‐
184       stance restart the server from your  repair-binary.   See  the  Systemd
185       section below for additinal information.
186
187       watchdog  will  try  periodically  to  fork  itself  to see whether the
188       process table is full. This process will leave a zombie  process  until
189       watchdog  wakes  up again and catches it; this is harmless, don't worry
190       about it.
191
192       In ping mode watchdog tries to ping the given IPv4 addresses. These ad‐
193       dresses do not have to be a single machine. It is possible to ping to a
194       broadcast address instead to see if at least one machine in a subnet is
195       still living.
196
197       Do not use this broadcast ping unless your MIS person a) knows about it
198       and b) has given you explicit permission to use it!
199
200       watchdog will send out three ping packages and wait  up  to  <interval>
201       seconds  for  the reply with <interval> being the time it goes to sleep
202       between two times triggering the watchdog device.  Thus  a  unreachable
203       network will not cause a hard reset but a soft reboot.
204
205       You can also test passively for an unreachable network by just monitor‐
206       ing a given interface for traffic. If no traffic arrives the network is
207       considered  unreachable causing a soft reboot or action from the repair
208       binary.
209
210       To start the watchdog when network is available see the Systemd section
211       below.
212
213       watchdog  can  run an external command for user-defined tests. A return
214       code not equal 0 means an error occurred and watchdog should react.  If
215       the external command is killed by an uncaught signal this is considered
216       an error by watchdog too.  The command may take longer  than  the  time
217       slice  defined  for the kernel device without a problem. However, error
218       messages are generated into the syslog facility. If  you  have  enabled
219       softboot  on  error  the machine will be rebooted if the binary doesn't
220       exit in half the time watchdog sleeps between two tries triggering  the
221       kernel device.
222
223       If  you  specify a repair binary it will be started instead of shutting
224       down the system. If this binary is not able to fix the problem watchdog
225       will still cause a reboot afterwards.
226
227       If  the  machine  is halted an email is sent to notify a human that the
228       machine is going down. Starting with version 4.4 watchdog will also no‐
229       tify the human in charge if the machine is rebooted.
230
231       The  re-try timer applies to most errors, except reset/reboot calls and
232       too hot.  It allows a given error source to recover,  and  treats  most
233       tests in this way.  Exceptions are file handle test, load averages, and
234       system memory. If set to the minimum time of 1 second it will still al‐
235       low a single re-try at any polling interval of the system.
236

SOFT REBOOT

238       A  soft  reboot  (i.e. controlled shutdown and reboot) is initiated for
239       every error that is found. Since  there  might  be  no  more  processes
240       available, watchdog does it all by himself. That means:
241
242       1.  Kill all processes with SIGTERM.
243
244       2.  After a short pause kill all remaining processes with SIGKILL.
245
246       3.  Record a shutdown entry in wtmp.
247
248       4.  Save the random seed from /dev/urandom.  If the device is non-exis‐
249           tant or there is no filename for saving this step is skipped.
250
251       5.  Turn off accounting.
252
253       6.  Turn off quota and swap.
254
255       7.  Unmount all partitions
256
257       8.  Finally reboot.
258

CHECK BINARY

260       If the return code of the check binary is not zero watchdog will assume
261       an  error  and reboot the system. Be careful with this if you are using
262       the real-time properties of watchdog since watchdog will wait  for  the
263       return  of this binary before proceeding. An exit code smaller than 245
264       is interpreted as an system error code (see errno.h for details).  Val‐
265       ues of 245 or larger than are special to watchdog:
266
267       255    (based  on  -1 as unsigned 8-bit number) Reboot the system. This
268              is not exactly an error message but a command to  watchdog.   If
269              the return code is this the watchdog will not try to run a shut‐
270              down script instead.
271
272       254    Reset the system. This is not exactly an  error  message  but  a
273              command  to  watchdog.   If the return code is this the watchdog
274              will attempt to hard-reset the machine  without  attempting  any
275              sort of orderly stopping of process, unmounting of file systems,
276              etc.
277
278       253    Maximum load average exceeded.
279
280       252    The temperature inside is too high.
281
282       251    /proc/loadavg contains no (or not enough) data.
283
284       250    The given file was not changed in the given interval.
285
286       249    /proc/meminfo contains invalid data.
287
288       248    Child process was killed by a signal.
289
290       247    Child process did not return in time.
291
292       246    Free for personal watchdog-specific use (was -10 as an  unsigned
293              8-bit number).
294
295       With  enforcing  SELinux  policy  please  use  the  /usr/libexec/watch‐
296       dog/scripts/ for your test-binary configuration.
297
298       245    Reserved for an unknown result, for example  a  slow  background
299              test that is still running so neither a success nor an error.
300

REPAIR BINARY

302       The  repair binary is started with one parameter: the error number that
303       caused watchdog to initiate the boot process. After  trying  to  repair
304       the system the binary should exit with 0 if the system was successfully
305       repaired and thus there is no need to boot anymore. A return value  not
306       equal  0 tells watchdog to reboot. The return code of the repair binary
307       should be the error number of the error causing watchdog to reboot.  Be
308       careful  with  this  if  you  are  using the real-time properties since
309       watchdog will wait for the return of this binary before proceeding.
310
311       The configuration file parameter repair-maximum controls the number  of
312       successive  repair  attempts  that  report 0 (i.e. success) but fail to
313       clear the tested fault. If this is exceeded then a reboot takes  place.
314       If  set  to zero then a reboot can always be blocked by the repair pro‐
315       gram reporting success.
316
317       With  enforcing  SELinux  policy  please  use  the  /usr/libexec/watch‐
318       dog/scripts/ for your repair-binary configuration.
319

TEST DIRECTORY

321       Executables  placed in the test directory are discovered by watchdog on
322       startup and are automatically executed.  They are bounded time-wise  by
323       the test-timeout directive in watchdog.conf.
324
325       These  executables  are called with either "test" as the first argument
326       (if a test is being performed) or "repair" as the first argument (if  a
327       repair for a previously-failed "test" operation on is being performed).
328
329       As  with  test  binaries and repair binaries, expected exit codes for a
330       successful test or repair operation is always zero.
331
332       If an executable's test operation fails, the same executable  is  auto‐
333       matically  called with the "repair" argument as well as the return code
334       of the previously-failed test operation.
335
336       For example, if the following execution returns 42:
337
338           /etc/watchdog.d/my-test test
339
340       The watchdog daemon will attempt to repair the problem by calling:
341
342           /etc/watchdog.d/my-test repair 42
343
344       This enables administrators and application developers to make intelli‐
345       gent  test/repair  commands.  If the "repair" operation is not required
346       (or is not likely to succeed), it is important that the author  of  the
347       command return a non-zero value so the machine will still reboot as ex‐
348       pected.
349
350       Note that the watchdog daemon may interpret and act upon any of the re‐
351       served  return codes noted in the Check Binary section prior to calling
352       a given command in "repair" mode.
353
354       As for the repair binary, the  configuration  parameter  repair-maximum
355       also controls the number of successive repair attempts that report suc‐
356       cess (return 0) but fail to clear the fault.
357

SYSTEMD

359       To start watchdog after the network is available:
360
361       systemctl disable watchdog
362       systemctl enable NetworkManager-wait-online
363       systemctl enable watchdog-ping
364
365       When using custom service pid check with custom  service  systemd  unit
366       file  please  be aware the "Requires=" does dependent service deactiva‐
367       tion.  Using  "Before=watchdog.service"  or  "Before=watchdog-ping.ser‐
368       vice"  in the custom service unit file may be the desired operation in‐
369       stead.  See systemd.unit documentation for more details.
370
371

SELINUX

373       The directories /etc/watchdog.d/ and /usr/libexec/watchdog/scripts/ are
374       recognized locations for custom executables.
375

BUGS

377       None known so far.
378

AUTHORS

380       The  original  code  is  an  example  written  by  Alan  Cox  <alan@lx‐
381       orguk.ukuu.org.uk>, the author of the kernel driver. All additions were
382       written   by   Michael   Meskes   <meskes@debian.org>.   Johnie  Ingram
383       <johnie@netgod.net> had the idea of testing the load average.  He  also
384       took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
385       brought up some hardware watchdog issues and helped testing this stuff.
386

FILES

388       /dev/watchdog
389              The watchdog device.
390
391       /var/run/watchdog.pid
392              The pid file of the running watchdog.
393