1WATCHDOG(8) System Manager's Manual WATCHDOG(8)
2
3
4
6 watchdog - a software watchdog daemon
7
9 watchdog [-F|--foreground] [-f|--force] [-c filename|--config-file
10 filename] [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11
13 The Linux kernel can reset the system if serious problems are detected.
14 This can be implemented via special watchdog hardware, or via a
15 slightly less reliable software-only watchdog inside the kernel. Either
16 way, there needs to be a daemon that tells the kernel the system is
17 working fine. If the daemon stops doing that, the system is reset.
18
19 watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20 it often enough to keep the kernel from resetting, at least once per
21 minute. Each write delays the reboot time another minute. After a
22 minute of inactivity the watchdog hardware will cause the reset. In the
23 case of the software watchdog the ability to reboot will depend on the
24 state of the machines and interrupts.
25
26 The watchdog daemon can be stopped without causing a reboot if the
27 device /dev/watchdog is closed correctly, unless your kernel is com‐
28 piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29
31 The watchdog daemon does several tests to check the system status:
32
33 · Is the process table full?
34
35 · Is there enough free memory?
36
37 · Is there enough allocatable memory?
38
39 · Are some files accessible?
40
41 · Have some files changed within a given interval?
42
43 · Is the average work load too high?
44
45 · Has a file table overflow occurred?
46
47 · Is a process still running? The process is specified by a pid file.
48
49 · Do some IP addresses answer to ping?
50
51 · Do network interfaces receive traffic?
52
53 · Is the temperature too high? (Temperature data not always avail‐
54 able.)
55
56 · Execute a user defined command to do arbitrary tests.
57
58 · Execute one or more test/repair commands found in /etc/watchdog.d.
59 These commands are called with the argument test or repair.
60
61 If any of these checks fail watchdog will cause a shutdown. Should any
62 of these tests except the user defined binary last longer than one
63 minute the machine will be rebooted, too.
64
66 Available command line options are the following:
67
68 -v, --verbose
69 Set verbose mode. Only implemented if compiled with SYSLOG fea‐
70 ture. This mode will log each several infos in LOG_DAEMON with
71 priority LOG_DEBUG. This is useful if you want to see exactly
72 what happened until the watchdog rebooted the system. Currently
73 it logs the temperature (if available), the load average, the
74 change date of the files it checks and how often it went to
75 sleep.
76
77 -s, --sync
78 Try to synchronize the filesystem every time the process is
79 awake. Note that the system is rebooted if for any reason the
80 synchronizing lasts longer than a minute.
81
82 -b, --softboot
83 Soft-boot the system if an error occurs during the main loop,
84 e.g. if a given file is not accessible via the stat(2) call.
85 Note that this does not apply to the opening of /dev/watchdog
86 and /proc/loadavg, which are opened before the main loop starts.
87 Now this is implemented by disabling the error re-try timer.
88
89 -F, --foreground
90 Run in foreground mode, useful for running under systemd (for
91 example).
92
93 -f, --force
94 Force the usage of the interval given or the maximal load aver‐
95 age given in the config file. Without this option these values
96 are sanity checked.
97
98 -c config-file, --config-file config-file
99 Use config-file as the configuration file instead of the default
100 /etc/watchdog.conf.
101
102 -q, --no-action
103 Do not reboot or halt the machine. This is for testing purposes.
104 All checks are executed and the results are logged as usual, but
105 no action is taken. Also your hardware card or the kernel soft‐
106 ware watchdog driver is not enabled. NOTE: This still allows
107 'repair' actions to run, but the daemon itself will not attempt
108 a reboot.
109
110 -X num, --loop-exit num
111 Run for 'num' loops then exit as if SIGTERM was received.
112 Intended for test/debug (e.g. using valgrind for checking memory
113 access). If the daemon exits on a loop counter and you have the
114 CONFIG_WATCHDOG_NOWAYOUT option compiled for the kernel or
115 device-driver then an unplanned reboot will follow - be warned!
116
118 After watchdog starts, it puts itself into the background and then
119 tries all checks specified in its configuration file in turn. Between
120 each two tests it will write to the kernel device to prevent a reset.
121 After finishing all tests watchdog goes to sleep for some time. The
122 kernel drivers expects a write to the watchdog device every minute.
123 Otherwise the system will be reset. watchdog will sleep for a config‐
124 ure interval that defaults to 1 second to make sure it triggers the
125 device early enough.
126
127 Under high system load watchdog might be swapped out of memory and may
128 fail to make it back in in time. Under these circumstances the Linux
129 kernel will reset the machine. To make sure you won't get unnecessary
130 reboots make sure you have the variable realtime set to yes in the con‐
131 figuration file watchdog.conf. This adds real time support to watch‐
132 dog: it will lock itself into memory and there should be no problem
133 even under the highest of loads.
134
135 On system running out of memory the kernel will try to free enough mem‐
136 ory by killing process. The watchdog daemon itself is exempted from
137 this so-called out-of-memory killer.
138
139 Also you can specify a maximal allowed load average. Once this load
140 average is reached the system is rebooted. You may specify maximal load
141 averages for 1 minute, 5 minutes or 15 minutes. The default values is
142 to disable this test. Be careful not to set this parameter too low. To
143 set a value less then the predefined minimal value of 2, you have to
144 use the -f option.
145
146 You can also specify a minimal amount of virtual memory you want to
147 have available as free. As soon as more virtual memory is used action
148 is taken by watchdog. Note, however, that watchdog does not distin‐
149 guish between different types of memory usage. It just checks for free
150 virtual memory.
151
152 If you have a machine with temperature sensor(s) you can specify the
153 maximal allowed temperature. Once this temperature is reached on any
154 sensor the system is powered off. The default value is 90 C. Typically
155 the temperature information is provided by the sensors package as files
156 in the virtual filesystem /sys/device and can be found using, for exam‐
157 ple, the command
158
159 find /sys -name 'temp*input' -print
160
161 These files hold the temperature in milli-Celsius. You can have multi‐
162 ple sensors used in the config file. For example to change to 75C maxi‐
163 mum and to check two virtual files for the system temperature you might
164 have this:
165
166 max-temperature = 75
167 temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
168 temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input
169
170 The watchdog will issue warnings once the temperature increases 90%,
171 95% and 98% of the configured maximum temperature.
172
173 When using file mode watchdog will try to stat(2) the given files.
174 Errors returned by stat will not cause a reboot. For a reboot the stat
175 call has to last at least the re-try time-out value (default 1 minute).
176 This may happen if the file is located on an NFS mounted filesystem. If
177 your system relies on an NFS mounted filesystem you might try this
178 option. However, in such a case the sync option may not work if the
179 NFS server is not answering.
180
181 watchdog can read the pid from a pid file and see whether the process
182 still exists. If not, action is taken by watchdog. So you can for
183 instance restart the server from your repair-binary. See the Systemd
184 section below for additinal information.
185
186 watchdog will try periodically to fork itself to see whether the
187 process table is full. This process will leave a zombie process until
188 watchdog wakes up again and catches it; this is harmless, don't worry
189 about it.
190
191 In ping mode watchdog tries to ping the given IPv4 addresses. These
192 addresses do not have to be a single machine. It is possible to ping to
193 a broadcast address instead to see if at least one machine in a subnet
194 is still living.
195
196 Do not use this broadcast ping unless your MIS person a) knows about it
197 and b) has given you explicit permission to use it!
198
199 watchdog will send out three ping packages and wait up to <interval>
200 seconds for the reply with <interval> being the time it goes to sleep
201 between two times triggering the watchdog device. Thus a unreachable
202 network will not cause a hard reset but a soft reboot.
203
204 You can also test passively for an unreachable network by just monitor‐
205 ing a given interface for traffic. If no traffic arrives the network is
206 considered unreachable causing a soft reboot or action from the repair
207 binary.
208
209 To start the watchdog when network is available see the Systemd section
210 below.
211
212 watchdog can run an external command for user-defined tests. A return
213 code not equal 0 means an error occurred and watchdog should react. If
214 the external command is killed by an uncaught signal this is considered
215 an error by watchdog too. The command may take longer than the time
216 slice defined for the kernel device without a problem. However, error
217 messages are generated into the syslog facility. If you have enabled
218 softboot on error the machine will be rebooted if the binary doesn't
219 exit in half the time watchdog sleeps between two tries triggering the
220 kernel device.
221
222 If you specify a repair binary it will be started instead of shutting
223 down the system. If this binary is not able to fix the problem watchdog
224 will still cause a reboot afterwards.
225
226 If the machine is halted an email is sent to notify a human that the
227 machine is going down. Starting with version 4.4 watchdog will also
228 notify the human in charge if the machine is rebooted.
229
230 The re-try timer applies to most errors, except reset/reboot calls and
231 too hot. It allows a given error source to recover, and treats most
232 tests in this way. Exceptions are file handle test, load averages, and
233 system memory. If set to the minimum time of 1 second it will still
234 allow a single re-try at any polling interval of the system.
235
237 A soft reboot (i.e. controlled shutdown and reboot) is initiated for
238 every error that is found. Since there might be no more processes
239 available, watchdog does it all by himself. That means:
240
241 1. Kill all processes with SIGTERM.
242
243 2. After a short pause kill all remaining processes with SIGKILL.
244
245 3. Record a shutdown entry in wtmp.
246
247 4. Save the random seed from /dev/urandom. If the device is non-exis‐
248 tant or there is no filename for saving this step is skipped.
249
250 5. Turn off accounting.
251
252 6. Turn off quota and swap.
253
254 7. Unmount all partitions except the root partition.
255
256 8. Remount the root partition read-only.
257
258 9. Shut down all network interfaces.
259
260 10. Finally reboot.
261
263 If the return code of the check binary is not zero watchdog will assume
264 an error and reboot the system. Be careful with this if you are using
265 the real-time properties of watchdog since watchdog will wait for the
266 return of this binary before proceeding. An exit code smaller than 245
267 is interpreted as an system error code (see errno.h for details). Val‐
268 ues of 245 or larger than are special to watchdog:
269
270 255 (based on -1 as unsigned 8-bit number)
271 Reboot the system. This is not exactly an error message but a
272 command to watchdog. If the return code is this the watchdog
273 will not try to run a shutdown script instead.
274
275 254 Reset the system. This is not exactly an error message but a
276 command to watchdog. If the return code is this the watchdog
277 will attempt to hard-reset the machine without attempting any
278 sort of orderly stopping of process, unmounting of file systems,
279 etc.
280
281 253 Maximum load average exceeded.
282
283 252 The temperature inside is too high.
284
285 251 /proc/loadavg contains no (or not enough) data.
286
287 250 The given file was not changed in the given interval.
288
289 249 /proc/meminfo contains invalid data.
290
291 248 Child process was killed by a signal.
292
293 247 Child process did not return in time.
294
295 246 Free for personal watchdog-specific use (was -10 as an unsigned
296 8-bit number).
297
298 With enforcing SELinux policy please use the /usr/libexec/watch‐
299 dog/scripts/ for your test-binary configuration.
300
301 245 Reserved for an unknown result, for example a slow background
302 test that is still running so neither a success nor an error.
303
305 The repair binary is started with one parameter: the error number that
306 caused watchdog to initiate the boot process. After trying to repair
307 the system the binary should exit with 0 if the system was successfully
308 repaired and thus there is no need to boot anymore. A return value not
309 equal 0 tells watchdog to reboot. The return code of the repair binary
310 should be the error number of the error causing watchdog to reboot. Be
311 careful with this if you are using the real-time properties since
312 watchdog will wait for the return of this binary before proceeding.
313
314 The configuration file parameter repair-maximum controls the number of
315 successive repair attempts that report 0 (i.e. success) but fail to
316 clear the tested fault. If this is exceeded then a reboot takes place.
317 If set to zero then a reboot can always be blocked by the repair pro‐
318 gram reporting success.
319
320 With enforcing SELinux policy please use the /usr/libexec/watch‐
321 dog/scripts/ for your repair-binary configuration.
322
324 Executables placed in the test directory are discovered by watchdog on
325 startup and are automatically executed. They are bounded time-wise by
326 the test-timeout directive in watchdog.conf.
327
328 These executables are called with either "test" as the first argument
329 (if a test is being performed) or "repair" as the first argument (if a
330 repair for a previously-failed "test" operation on is being performed).
331
332 The as with test binaries and repair binaries, expected exit codes for
333 a successful test or repair operation is always zero.
334
335 If an executable's test operation fails, the same executable is auto‐
336 matically called with the "repair" argument as well as the return code
337 of the previously-failed test operation.
338
339 For example, if the following execution returns 42:
340
341 /etc/watchdog.d/my-test test
342
343 The watchdog daemon will attempt to repair the problem by calling:
344
345 /etc/watchdog.d/my-test repair 42
346
347 This enables administrators and application developers to make intelli‐
348 gent test/repair commands. If the "repair" operation is not required
349 (or is not likely to succeed), it is important that the author of the
350 command return a non-zero value so the machine will still reboot as
351 expected.
352
353 Note that the watchdog daemon may interpret and act upon any of the
354 reserved return codes noted in the Check Binary section prior to call‐
355 ing a given command in "repair" mode.
356
357 As for the repair binary, the configuration parameter repair-maximum
358 also controls the number of successive repair attempts that report suc‐
359 cess (return 0) but fail to clear the fault.
360
362 To start watchdog after the network is available:
363
364 systemctl disable watchdog
365 systemctl enable NetworkManager-wait-online
366 systemctl enable watchdog-ping
367
368 When using custom service pid check with custom service systemd unit
369 file please be aware the "Requires=" does dependent service deactiva‐
370 tion. Using "Before=watchdog.service" or "Before=watchdog-ping.ser‐
371 vice" in the custom service unit file may be the desired operation
372 instead. See systemd.unit documentation for more details.
373
374
376 The directories /etc/watchdog.d/ and /usr/libexec/watchdog/scripts/ are
377 recognized locations for custom executables.
378
380 None known so far.
381
383 The original code is an example written by Alan Cox
384 <alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All addi‐
385 tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
386 <johnie@netgod.net> had the idea of testing the load average. He also
387 took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
388 brought up some hardware watchdog issues and helped testing this stuff.
389
391 /dev/watchdog
392 The watchdog device.
393
394 /var/run/watchdog.pid
395 The pid file of the running watchdog.
396
398 watchdog.conf(5),systemd.unit(5)
399
400
401
4024th Berkeley Distribution January 2016 WATCHDOG(8)