1WATCHDOG(8) System Manager's Manual WATCHDOG(8)
2
3
4
6 watchdog - a software watchdog daemon
7
9 watchdog [-F|--foreground] [-f|--force] [-c filename|--config-file
10 filename] [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11
13 The Linux kernel can reset the system if serious problems are detected.
14 This can be implemented via special watchdog hardware, or via a
15 slightly less reliable software-only watchdog inside the kernel. Either
16 way, there needs to be a daemon that tells the kernel the system is
17 working fine. If the daemon stops doing that, the system is reset.
18
19 watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20 it often enough to keep the kernel from resetting, at least once per
21 minute. Each write delays the reboot time another minute. After a
22 minute of inactivity the watchdog hardware will cause the reset. In the
23 case of the software watchdog the ability to reboot will depend on the
24 state of the machines and interrupts.
25
26 The watchdog daemon can be stopped without causing a reboot if the de‐
27 vice /dev/watchdog is closed correctly, unless your kernel is compiled
28 with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29
31 The watchdog daemon does several tests to check the system status:
32
33 • Is the process table full?
34
35 • Is there enough free memory?
36
37 • Is there enough allocatable memory?
38
39 • Are some files accessible?
40
41 • Have some files changed within a given interval?
42
43 • Is the average work load too high?
44
45 • Has a file table overflow occurred?
46
47 • Is a process still running? The process is specified by a pid file.
48
49 • Do some IP addresses answer to ping?
50
51 • Do network interfaces receive traffic?
52
53 • Is the temperature too high? (Temperature data not always avail‐
54 able.)
55
56 • Execute a user defined command to do arbitrary tests.
57
58 • Execute one or more test/repair commands found in /etc/watchdog.d.
59 These commands are called with the argument test or repair.
60
61 If any of these checks fail watchdog will cause a shutdown. Should any
62 of these tests except the user defined binary last longer than one
63 minute the machine will be rebooted, too.
64
66 Available command line options are the following:
67
68 -v, --verbose
69 Set verbose mode. Only implemented if compiled with SYSLOG fea‐
70 ture. This mode will log each several infos in LOG_DAEMON with
71 priority LOG_DEBUG. This is useful if you want to see exactly
72 what happened until the watchdog rebooted the system. Currently
73 it logs the temperature (if available), the load average, the
74 change date of the files it checks and how often it went to
75 sleep. You can use this twice to enable some more verbose debug
76 message for testing.
77
78 -s, --sync
79 Try to synchronize the filesystem every time the process is
80 awake. Note that the system is rebooted if for any reason the
81 synchronizing lasts longer than a minute.
82
83 -b, --softboot
84 Soft-boot the system if an error occurs during the main loop,
85 e.g. if a given file is not accessible via the stat(2) call.
86 Note that this does not apply to the opening of /dev/watchdog
87 and /proc/loadavg, which are opened before the main loop starts.
88 Now this is implemented by disabling the error re-try timer.
89
90 -F, --foreground
91 Run in foreground mode, useful for running under systemd (for
92 example).
93
94 -f, --force
95 Force the usage of the interval given or the maximal load aver‐
96 age given in the config file. Without this option these values
97 are sanity checked.
98
99 -c config-file, --config-file config-file
100 Use config-file as the configuration file instead of the default
101 /etc/watchdog.conf.
102
103 -q, --no-action
104 Do not reboot or halt the machine. This is for testing purposes.
105 All checks are executed and the results are logged as usual, but
106 no action is taken. Also your hardware card or the kernel soft‐
107 ware watchdog driver is not enabled. NOTE: This still allows
108 'repair' actions to run, but the daemon itself will not attempt
109 a reboot.
110
111 -X num, --loop-exit num
112 Run for 'num' loops then exit as if SIGTERM was received. In‐
113 tended for test/debug (e.g. using valgrind for checking memory
114 access). If the daemon exits on a loop counter and you have the
115 CONFIG_WATCHDOG_NOWAYOUT option compiled for the kernel or de‐
116 vice-driver then an unplanned reboot will follow - be warned!
117
119 After watchdog starts, it puts itself into the background and then
120 tries all checks specified in its configuration file in turn. Between
121 each two tests it will write to the kernel device to prevent a reset.
122 After finishing all tests watchdog goes to sleep for some time. The
123 kernel drivers expects a write to the watchdog device every minute.
124 Otherwise the system will be reset. watchdog will sleep for a config‐
125 ure interval that defaults to 1 second to make sure it triggers the de‐
126 vice early enough.
127
128 Under high system load watchdog might be swapped out of memory and may
129 fail to make it back in in time. Under these circumstances the Linux
130 kernel will reset the machine. To make sure you won't get unnecessary
131 reboots make sure you have the variable realtime set to yes in the con‐
132 figuration file watchdog.conf. This adds real time support to watch‐
133 dog: it will lock itself into memory and there should be no problem
134 even under the highest of loads.
135
136 On system running out of memory the kernel will try to free enough mem‐
137 ory by killing process. The watchdog daemon itself is exempted from
138 this so-called out-of-memory killer.
139
140 Also you can specify a maximal allowed load average. Once this load av‐
141 erage is reached the system is rebooted. You may specify maximal load
142 averages for 1 minute, 5 minutes or 15 minutes. The default values is
143 to disable this test. Be careful not to set this parameter too low. To
144 set a value less then the predefined minimal value of 2, you have to
145 use the -f option.
146
147 You can also specify a minimal amount of virtual memory you want to
148 have available as free. As soon as more virtual memory is used action
149 is taken by watchdog. Note, however, that watchdog does not distin‐
150 guish between different types of memory usage. It just checks for free
151 virtual memory.
152
153 If you have a machine with temperature sensor(s) you can specify the
154 maximal allowed temperature. Once this temperature is reached on any
155 sensor the system is powered off. The default value is 90 C. Typically
156 the temperature information is provided by the sensors package as files
157 in the virtual filesystem /sys/device and can be found using, for exam‐
158 ple, the command
159
160 find /sys -name 'temp*input' -print
161
162 These files hold the temperature in milli-Celsius. You can have multi‐
163 ple sensors used in the config file. For example to change to 75C maxi‐
164 mum and to check two virtual files for the system temperature you might
165 have this:
166
167 max-temperature = 75
168 temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
169 temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input
170
171 The watchdog will issue warnings once the temperature increases 90%,
172 95% and 98% of the configured maximum temperature.
173
174 When using file mode watchdog will try to stat(2) the given files. Er‐
175 rors returned by stat will not cause a reboot. For a reboot the stat
176 call has to last at least the re-try time-out value (default 1 minute).
177 This may happen if the file is located on an NFS mounted filesystem. If
178 your system relies on an NFS mounted filesystem you might try this op‐
179 tion. However, in such a case the sync option may not work if the NFS
180 server is not answering.
181
182 watchdog can read the pid from a pid file and see whether the process
183 still exists. If not, action is taken by watchdog. So you can for in‐
184 stance restart the server from your repair-binary. See the Systemd
185 section below for additinal information.
186
187 watchdog will try periodically to fork itself to see whether the
188 process table is full. This process will leave a zombie process until
189 watchdog wakes up again and catches it; this is harmless, don't worry
190 about it.
191
192 In ping mode watchdog tries to ping the given IPv4 addresses. These ad‐
193 dresses do not have to be a single machine. It is possible to ping to a
194 broadcast address instead to see if at least one machine in a subnet is
195 still living.
196
197 Do not use this broadcast ping unless your MIS person a) knows about it
198 and b) has given you explicit permission to use it!
199
200 watchdog will send out three ping packages and wait up to <interval>
201 seconds for the reply with <interval> being the time it goes to sleep
202 between two times triggering the watchdog device. Thus a unreachable
203 network will not cause a hard reset but a soft reboot.
204
205 You can also test passively for an unreachable network by just monitor‐
206 ing a given interface for traffic. If no traffic arrives the network is
207 considered unreachable causing a soft reboot or action from the repair
208 binary.
209
210 To start the watchdog when network is available see the Systemd section
211 below.
212
213 watchdog can run an external command for user-defined tests. A return
214 code not equal 0 means an error occurred and watchdog should react. If
215 the external command is killed by an uncaught signal this is considered
216 an error by watchdog too. The command may take longer than the time
217 slice defined for the kernel device without a problem. However, error
218 messages are generated into the syslog facility. If you have enabled
219 softboot on error the machine will be rebooted if the binary doesn't
220 exit in half the time watchdog sleeps between two tries triggering the
221 kernel device.
222
223 If you specify a repair binary it will be started instead of shutting
224 down the system. If this binary is not able to fix the problem watchdog
225 will still cause a reboot afterwards.
226
227 If the machine is halted an email is sent to notify a human that the
228 machine is going down. Starting with version 4.4 watchdog will also no‐
229 tify the human in charge if the machine is rebooted.
230
231 The re-try timer applies to most errors, except reset/reboot calls and
232 too hot. It allows a given error source to recover, and treats most
233 tests in this way. Exceptions are file handle test, load averages, and
234 system memory. If set to the minimum time of 1 second it will still al‐
235 low a single re-try at any polling interval of the system.
236
238 A soft reboot (i.e. controlled shutdown and reboot) is initiated for
239 every error that is found. Since there might be no more processes
240 available, watchdog does it all by himself. That means:
241
242 1. Kill all processes with SIGTERM.
243
244 2. After a short pause kill all remaining processes with SIGKILL.
245
246 3. Record a shutdown entry in wtmp.
247
248 4. Save the random seed from /dev/urandom. If the device is non-exis‐
249 tant or there is no filename for saving this step is skipped.
250
251 5. Turn off accounting.
252
253 6. Turn off quota and swap.
254
255 7. Unmount all partitions
256
257 8. Finally reboot.
258
260 If the return code of the check binary is not zero watchdog will assume
261 an error and reboot the system. Be careful with this if you are using
262 the real-time properties of watchdog since watchdog will wait for the
263 return of this binary before proceeding. An exit code smaller than 245
264 is interpreted as an system error code (see errno.h for details). Val‐
265 ues of 245 or larger than are special to watchdog:
266
267 255 (based on -1 as unsigned 8-bit number) Reboot the system. This
268 is not exactly an error message but a command to watchdog. If
269 the return code is this the watchdog will not try to run a shut‐
270 down script instead.
271
272 254 Reset the system. This is not exactly an error message but a
273 command to watchdog. If the return code is this the watchdog
274 will attempt to hard-reset the machine without attempting any
275 sort of orderly stopping of process, unmounting of file systems,
276 etc.
277
278 253 Maximum load average exceeded.
279
280 252 The temperature inside is too high.
281
282 251 /proc/loadavg contains no (or not enough) data.
283
284 250 The given file was not changed in the given interval.
285
286 249 /proc/meminfo contains invalid data.
287
288 248 Child process was killed by a signal.
289
290 247 Child process did not return in time.
291
292 246 Free for personal watchdog-specific use (was -10 as an unsigned
293 8-bit number).
294
295 With enforcing SELinux policy please use the /usr/libexec/watch‐
296 dog/scripts/ for your test-binary configuration.
297
298 245 Reserved for an unknown result, for example a slow background
299 test that is still running so neither a success nor an error.
300
302 The repair binary is started with one parameter: the error number that
303 caused watchdog to initiate the boot process. After trying to repair
304 the system the binary should exit with 0 if the system was successfully
305 repaired and thus there is no need to boot anymore. A return value not
306 equal 0 tells watchdog to reboot. The return code of the repair binary
307 should be the error number of the error causing watchdog to reboot. Be
308 careful with this if you are using the real-time properties since
309 watchdog will wait for the return of this binary before proceeding.
310
311 The configuration file parameter repair-maximum controls the number of
312 successive repair attempts that report 0 (i.e. success) but fail to
313 clear the tested fault. If this is exceeded then a reboot takes place.
314 If set to zero then a reboot can always be blocked by the repair pro‐
315 gram reporting success.
316
317 With enforcing SELinux policy please use the /usr/libexec/watch‐
318 dog/scripts/ for your repair-binary configuration.
319
321 Executables placed in the test directory are discovered by watchdog on
322 startup and are automatically executed. They are bounded time-wise by
323 the test-timeout directive in watchdog.conf.
324
325 These executables are called with either "test" as the first argument
326 (if a test is being performed) or "repair" as the first argument (if a
327 repair for a previously-failed "test" operation on is being performed).
328
329 As with test binaries and repair binaries, expected exit codes for a
330 successful test or repair operation is always zero.
331
332 If an executable's test operation fails, the same executable is auto‐
333 matically called with the "repair" argument as well as the return code
334 of the previously-failed test operation.
335
336 For example, if the following execution returns 42:
337
338 /etc/watchdog.d/my-test test
339
340 The watchdog daemon will attempt to repair the problem by calling:
341
342 /etc/watchdog.d/my-test repair 42
343
344 This enables administrators and application developers to make intelli‐
345 gent test/repair commands. If the "repair" operation is not required
346 (or is not likely to succeed), it is important that the author of the
347 command return a non-zero value so the machine will still reboot as ex‐
348 pected.
349
350 Note that the watchdog daemon may interpret and act upon any of the re‐
351 served return codes noted in the Check Binary section prior to calling
352 a given command in "repair" mode.
353
354 As for the repair binary, the configuration parameter repair-maximum
355 also controls the number of successive repair attempts that report suc‐
356 cess (return 0) but fail to clear the fault.
357
359 To start watchdog after the network is available:
360
361 systemctl disable watchdog
362 systemctl enable NetworkManager-wait-online
363 systemctl enable watchdog-ping
364
365 When using custom service pid check with custom service systemd unit
366 file please be aware the "Requires=" does dependent service deactiva‐
367 tion. Using "Before=watchdog.service" or "Before=watchdog-ping.ser‐
368 vice" in the custom service unit file may be the desired operation in‐
369 stead. See systemd.unit documentation for more details.
370
371
373 The directories /etc/watchdog.d/ and /usr/libexec/watchdog/scripts/ are
374 recognized locations for custom executables.
375
377 None known so far.
378
380 The original code is an example written by Alan Cox <alan@lx‐
381 orguk.ukuu.org.uk>, the author of the kernel driver. All additions were
382 written by Michael Meskes <meskes@debian.org>. Johnie Ingram
383 <johnie@netgod.net> had the idea of testing the load average. He also
384 took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
385 brought up some hardware watchdog issues and helped testing this stuff.
386
388 /dev/watchdog
389 The watchdog device.
390
391 /var/run/watchdog.pid
392 The pid file of the running watchdog.
393
395 watchdog.conf(5),systemd.unit(5)
396
397
398
3994th Berkeley Distribution February 2019 WATCHDOG(8)