1WATCHDOG(8) System Manager's Manual WATCHDOG(8)
2
3
4
6 watchdog - a software watchdog daemon
7
9 watchdog [-F|--foreground] [-f|--force] [-c filename|--config-file
10 filename] [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11
13 The Linux kernel can reset the system if serious problems are detected.
14 This can be implemented via special watchdog hardware, or via a
15 slightly less reliable software-only watchdog inside the kernel. Either
16 way, there needs to be a daemon that tells the kernel the system is
17 working fine. If the daemon stops doing that, the system is reset.
18
19 watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20 it often enough to keep the kernel from resetting, at least once per
21 minute. Each write delays the reboot time another minute. After a
22 minute of inactivity the watchdog hardware will cause the reset. In the
23 case of the software watchdog the ability to reboot will depend on the
24 state of the machines and interrupts.
25
26 The watchdog daemon can be stopped without causing a reboot if the
27 device /dev/watchdog is closed correctly, unless your kernel is com‐
28 piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29
31 The watchdog daemon does several tests to check the system status:
32
33 · Is the process table full?
34
35 · Is there enough free memory?
36
37 · Are some files accessible?
38
39 · Have some files changed within a given interval?
40
41 · Is the average work load too high?
42
43 · Has a file table overflow occurred?
44
45 · Is a process still running? The process is specified by a pid file.
46
47 · Do some IP addresses answer to ping?
48
49 · Do network interfaces receive traffic?
50
51 · Is the temperature too high? (Temperature data not always avail‐
52 able.)
53
54 · Execute a user defined command to do arbitrary tests.
55
56 · Execute one or more test/repair commands found in /etc/watchdog.d.
57 These commands are called with the argument test or repair.
58
59 If any of these checks fail watchdog will cause a shutdown. Should any
60 of these tests except the user defined binary last longer than one
61 minute the machine will be rebooted, too.
62
64 Available command line options are the following:
65
66 -v, --verbose
67 Set verbose mode. Only implemented if compiled with SYSLOG fea‐
68 ture. This mode will log each several infos in LOG_DAEMON with
69 priority LOG_INFO. This is useful if you want to see exactly
70 what happened until the watchdog rebooted the system. Currently
71 it logs the temperature (if available), the load average, the
72 change date of the files it checks and how often it went to
73 sleep.
74
75 -s, --sync
76 Try to synchronize the filesystem every time the process is
77 awake. Note that the system is rebooted if for any reason the
78 synchronizing lasts longer than a minute.
79
80 -b, --softboot
81 Soft-boot the system if an error occurs during the main loop,
82 e.g. if a given file is not accessible via the stat(2) call.
83 Note that this does not apply to the opening of /dev/watchdog
84 and /proc/loadavg, which are opened before the main loop starts.
85
86 -F, --foreground
87 Run in foreground mode, useful for running under systemd (for
88 example).
89
90 -f, --force
91 Force the usage of the interval given or the maximal load aver‐
92 age given in the config file.
93
94 -c config-file, --config-file config-file
95 Use config-file as the configuration file instead of the default
96 /etc/watchdog.conf.
97
98 -q, --no-action
99 Do not reboot or halt the machine. This is for testing purposes.
100 All checks are executed and the results are logged as usual, but
101 no action is taken. Also your hardware card or the kernel soft‐
102 ware watchdog driver is not enabled. Temperature checking is
103 also disabled since this triggers the hardware watchdog on some
104 cards.
105
107 After watchdog starts, it puts itself into the background and then
108 tries all checks specified in its configuration file in turn. Between
109 each two tests it will write to the kernel device to prevent a reset.
110 After finishing all tests watchdog goes to sleep for some time. The
111 kernel drivers expects a write to the watchdog device every minute.
112 Otherwise the system will be reset. As a default watchdog will sleep
113 for only 1 second so it triggers the device early enough.
114
115 Under high system load watchdog might be swapped out of memory and may
116 fail to make it back in in time. Under these circumstances the Linux
117 kernel will reset the machine. To make sure you won't get unnecessary
118 reboots make sure you have the variable realtime set to yes in the con‐
119 figuration file watchdog.conf. This adds real time support to watch‐
120 dog: it will lock itself into memory and there should be no problem
121 even under the highest of loads.
122
123 On system running out of memory the kernel will try to free enough mem‐
124 ory by killing process. The watchdog daemon itself is exempted from
125 this so-called out-of-memory killer.
126
127 Also you can specify a maximal allowed load average. Once this load
128 average is reached the system is rebooted. You may specify maximal load
129 averages for 1 minute, 5 minutes or 15 minutes. The default values is
130 to disable this test. Be careful not to set this parameter too low. To
131 set a value less then the predefined minimal value of 2, you have to
132 use the -f option.
133
134 You can also specify a minimal amount of virtual memory you want to
135 have available as free. As soon as more virtual memory is used action
136 is taken by watchdog. Note, however, that watchdog does not distin‐
137 guish between different types of memory usage. It just checks for free
138 virtual memory.
139
140 If you have a watchdog card with temperature sensor you can specify the
141 maximal allowed temperature. Once this temperature is reached the sys‐
142 tem is halted. The default value is 120. There is no unit conversion so
143 make sure you use the same unit as your hardware. watchdog will issue
144 warnings once the temperature increases 90%, 95% and 98% of this tem‐
145 perature.
146
147 When using file mode watchdog will try to stat(2) the given files.
148 Errors returned by stat will not cause a reboot. For a reboot the stat
149 call has to last at least one minute. This may happen if the file is
150 located on an NFS mounted filesystem. If your system relies on an NFS
151 mounted filesystem you might try this option. However, in such a case
152 the sync option may not work if the NFS server is not answering.
153
154 watchdog can read the pid from a pid file and see whether the process
155 still exists. If not, action is taken by watchdog. So you can for
156 instance restart the server from your repair-binary. See the Systemd
157 section below for additinal information.
158
159 watchdog will try periodically to fork itself to see whether the
160 process table is full. This process will leave a zombie process until
161 watchdog wakes up again and catches it; this is harmless, don't worry
162 about it.
163
164 In ping mode watchdog tries to ping the given IP addresses. These
165 addresses do not have to be a single machine. It is possible to ping to
166 a broadcast address instead to see if at least one machine in a subnet
167 is still living.
168
169 Do not use this broadcast ping unless your MIS person a) knows about it
170 and b) has given you explicit permission to use it!
171
172 watchdog will send out three ping packages and wait up to <interval>
173 seconds for the reply with <interval> being the time it goes to sleep
174 between two times triggering the watchdog device. Thus a unreachable
175 network will not cause a hard reset but a soft reboot.
176
177 You can also test passively for an unreachable network by just monitor‐
178 ing a given interface for traffic. If no traffic arrives the network is
179 considered unreachable causing a soft reboot or action from the repair
180 binary.
181
182 To start the watchdog when network is available see the Systemd section
183 below.
184
185 watchdog can run an external command for user-defined tests. A return
186 code not equal 0 means an error occured and watchdog should react. If
187 the external command is killed by an uncaught signal this is considered
188 an error by watchdog too. The command may take longer than the time
189 slice defined for the kernel device without a problem. However, error
190 messages are generated into the syslog facility. If you have enabled
191 softboot on error the machine will be rebooted if the binary doesn't
192 exit in half the time watchdog sleeps between two tries triggering the
193 kernel device.
194
195 If you specify a repair binary it will be started instead of shutting
196 down the system. If this binary is not able to fix the problem watchdog
197 will still cause a reboot afterwards.
198
199 If the machine is halted an email is sent to notify a human that the
200 machine is going down. Starting with version 4.4 watchdog will also
201 notify the human in charge if the machine is rebooted.
202
204 A soft reboot (i.e. controlled shutdown and reboot) is initiated for
205 every error that is found. Since there might be no more processes
206 available, watchdog does it all by himself. That means:
207
208 1. Kill all processes with SIGTERM.
209
210 2. After a short pause kill all remaining processes with SIGKILL.
211
212 3. Record a shutdown entry in wtmp.
213
214 4. Save the random seed from /dev/urandom. If the device is non-exis‐
215 tant or there is no filename for saving this step is skipped.
216
217 5. Turn off accounting.
218
219 6. Turn off quota and swap.
220
221 7. Unmount all partitions except the root partition.
222
223 8. Remount the root partition read-only.
224
225 9. Shut down all network interfaces.
226
227 10. Finally reboot.
228
230 If the return code of the check binary is not zero watchdog will assume
231 an error and reboot the system. Be careful with this if you are using
232 the real-time properties of watchdog since watchdog will wait for the
233 return of this binary before proceeding. An positive exit code is
234 interpreted as an system error code (see errno.h for details). Negative
235 values are special to watchdog:
236
237 -1 Reboot the system. This is not exactly an error message but a
238 command to watchdog. If the return code is -1 watchdog will not
239 try to run a shutdown script instead.
240
241 -2 Reset the system. This is not exactly an error message but a
242 command to watchdog. If the return code is -2 watchdog will
243 simply refuse to write the kernel device again.
244
245 -3 Maximum load average exceeded.
246
247 -4 The temperature inside is too high.
248
249 -5 /proc/loadavg contains no (or not enough) data.
250
251 -6 The given file was not changed in the given interval.
252
253 -7 /proc/meminfo contains invalid data.
254
255 -8 Child process was killed by a signal.
256
257 -9 Child process did not return in time.
258
259 -10 Free for personal use.
260
261 With enforcing SELinux policy please use the /usr/libexec/watch‐
262 dog/scripts/ for your test-binary configuration.
263
265 The repair binary is started with one parameter: the error number that
266 caused watchdog to initiate the boot process. After trying to repair
267 the system the binary should exit with 0 if the system was successfully
268 repaired and thus there is no need to boot anymore. A return value not
269 equal 0 tells watchdog to reboot. The return code of the repair binary
270 should be the error number of the error causing watchdog to reboot. Be
271 careful with this if you are using the real-time properties since
272 watchdog will wait for the return of this binary before proceeding.
273
274 With enforcing SELinux policy please use the /usr/libexec/watch‐
275 dog/scripts/ for your repair-binary configuration.
276
278 Executables placed in the test directory are discovered by watchdog on
279 startup and are automatically executed. They are bounded time-wise by
280 the test-timeout directive in watchdog.conf.
281
282 These executables are called with either "test" as the first argument
283 (if a test is being performed) or "repair" as the first argument (if a
284 repair for a previously-failed "test" operation on is being performed).
285
286 The as with test binaries and repair binaries, expected exit codes for
287 a successful test or repair operation is always zero.
288
289 If an executable's test operation fails, the same executable is auto‐
290 matically called with the "repair" argument as well as the return code
291 of the previously-failed test operation.
292
293 For example, if the following execution returns 42:
294
295 /etc/watchdog.d/my-test test
296
297 The watchdog daemon will attempt to repair the problem by calling:
298
299 /etc/watchdog.d/my-test repair 42
300
301 This enables administrators and application developers to make intelli‐
302 gent test/repair commands. If the "repair" operation is not required
303 (or is not likely to succeed), it is important that the author of the
304 command return a non-zero value so the machine will still reboot as
305 expected.
306
307 Note that the watchdog daemon may interpret and act upon any of the
308 reserved return codes noted in the Check Binary section prior to call‐
309 ing a given command in "repair" mode.
310
312 To start watchdog after the network is available:
313
314 systemctl disable watchdog
315 systemctl enable NetworkManager-wait-online
316 systemctl enable watchdog-ping
317
318 When using custom service pid check with custom service systemd unit
319 file please be aware the "Requires=" does dependent service deactiva‐
320 tion. Using "Before=watchdog.service" or "Before=watchdog-ping.ser‐
321 vice" in the custom service unit file may be the desired operation
322 instead. See systemd.unit documentation for more details.
323
324
326 The directories /etc/watchdog.d/ and /usr/libexec/watchdog/scripts/ are
327 recognized locations for custom executables.
328
330 None known so far.
331
333 The original code is an example written by Alan Cox
334 <alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All addi‐
335 tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
336 <johnie@netgod.net> had the idea of testing the load average. He also
337 took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
338 brought up some hardware watchdog issues and helped testing this stuff.
339
341 /dev/watchdog
342 The watchdog device.
343
344 /var/run/watchdog.pid
345 The pid file of the running watchdog.
346
348 watchdog.conf(5),systemd.unit(5)
349
350
351
3524th Berkeley Distribution January 2005 WATCHDOG(8)