1WATCHDOG(8) System Manager's Manual WATCHDOG(8)
2
3
4
6 watchdog - a software watchdog daemon
7
9 watchdog [-f|--force] [-c filename|--config-file filename] [-v|--ver‐
10 bose] [-s|--sync] [-b|--softboot] [-q|--no-action]
11
13 The Linux kernel can reset the system if serious problems are detected.
14 This can be implemented via special watchdog hardware, or via a
15 slightly less reliable software-only watchdog inside the kernel. Either
16 way, there needs to be a daemon that tells the kernel the system is
17 working fine. If the daemon stops doing that, the system is reset.
18
19 watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to
20 it often enough to keep the kernel from resetting, at least once per
21 minute. Each write delays the reboot time another minute. After a
22 minute of inactivity the watchdog hardware will cause the reset. In the
23 case of the software watchdog the ability to reboot will depend on the
24 state of the machines and interrupts.
25
26 The watchdog daemon can be stopped without causing a reboot if the
27 device /dev/watchdog is closed correctly, unless your kernel is com‐
28 piled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
29
31 The watchdog daemon does several tests to check the system status:
32
33 · Is the process table full?
34
35 · Is there enough free memory?
36
37 · Are some files accessible?
38
39 · Have some files changed within a given interval?
40
41 · Is the average work load too high?
42
43 · Has a file table overflow occurred?
44
45 · Is a process still running? The process is specified by a pid file.
46
47 · Do some IP addresses answer to ping?
48
49 · Do network interfaces receive traffic?
50
51 · Is the temperature too high? (Temperature data not always avail‐
52 able.)
53
54 · Execute a user defined command to do arbitrary tests.
55
56 If any of these checks fail watchdog will cause a shutdown. Should any
57 of these tests except the user defined binary last longer than one
58 minute the machine will be rebooted, too.
59
61 Available command line options are the following:
62
63 -v, --verbose
64 Set verbose mode. Only implemented if compiled with SYSLOG fea‐
65 ture. This mode will log each several infos in LOG_DAEMON with
66 priority LOG_INFO. This is useful if you want to see exactly
67 what happened until the watchdog rebooted the system. Currently
68 it logs the temperature (if available), the load average, the
69 change date of the files it checks and how often it went to
70 sleep.
71
72 -s, --sync
73 Try to synchronize the filesystem every time the process is
74 awake. Note that the system is rebooted if for any reason the
75 synchronizing lasts longer than a minute.
76
77 -b, --softboot
78 Soft-boot the system if an error occurs during the main loop,
79 e.g. if a given file is not accessible via the stat(2) call.
80 Note that this does not apply to the opening of /dev/watchdog
81 and /proc/loadavg, which are opened before the main loop starts.
82
83 -f, --force
84 Force the usage of the interval given or the maximal load aver‐
85 age given in the config file.
86
87 -c config-file, --config-file config-file
88 Use config-file as the configuration file instead of the default
89 /etc/watchdog.conf.
90
91 -q, --no-action
92 Do not reboot or halt the machine. This is for testing purposes.
93 All checks are executed and the results are logged as usual, but
94 no action is taken. Also your hardware card or the kernel soft‐
95 ware watchdog driver is not enabled. Temperature checking is
96 also disabled since this triggers the hardware watchdog on some
97 cards.
98
100 After watchdog starts, it puts itself into the background and then
101 tries all checks specified in its configuration file in turn. Between
102 each two tests it will write to the kernel device to prevent a reset.
103 After finishing all tests watchdog goes to sleep for some time. The
104 kernel drivers expects a write to the watchdog device every minute.
105 Otherwise the system will be reset. As a default watchdog will sleep
106 for only 10 seconds so it triggers the device early enough.
107
108 Under high system load watchdog might be swapped out of memory and may
109 fail to make it back in in time. Under these circumstances the Linux
110 kernel will reset the machine. To make sure you won't get unnecessary
111 reboots make sure you have the variable realtime set to yes in the con‐
112 figuration file watchdog.conf. This adds real time support to watch‐
113 dog: it will lock itself into memory and there should be no problem
114 even under the highest of loads.
115
116 Also you can specify a maximal allowed load average. Once this load
117 average is reached the system is rebooted. You may specify maximal load
118 averages for 1 minute, 5 minutes or 15 minutes. The default values is
119 to disable this test. Be careful not to set this parameter too low. To
120 set a value less then the predefined minimal value of 2, you have to
121 use the -f option.
122
123 You can also specify a minimal amount of virtual memory you want to
124 have available as free. As soon as more virtual memory is used action
125 is taken by watchdog. Note, however, that watchdog does not distin‐
126 guish between different types of memory usage. It just checks for free
127 virtual memory.
128
129 If you have a watchdog card with temperature sensor you can specify the
130 maximal allowed temperature. Once this temperature is reached the sys‐
131 tem is halted. The default value is 120. There is no unit conversion so
132 make sure you use the same unit as your hardware. watchdog will issue
133 warnings once the temperature increases 90%, 95% and 98% of this tem‐
134 perature.
135
136 When using file mode watchdog will try to stat(2) the given files.
137 Errors returned by stat will not cause a reboot. For a reboot the stat
138 call has to last at least one minute. This may happen if the file is
139 located on an NFS mounted filesystem. If your system relies on an NFS
140 mounted filesystem you might try this option. However, in such a case
141 the sync option may not work if the NFS server is not answering.
142
143 watchdog can read the pid from a pid file and see whether the process
144 still exists. If not, action is taken by watchdog. So you can for
145 instance restart the server from your repair-binary.
146
147 watchdog will try periodically to fork itself to see whether the
148 process table is full. This process will leave a zombie process until
149 watchdog wakes up again and catches it; this is harmless, don't worry
150 about it.
151
152 In ping mode watchdog tries to ping the given IP addresses. These
153 addresses do not have to be a single machine. It is possible to ping to
154 a broadcast address instead to see if at least one machine in a subnet
155 is still living.
156
157 Do not use this broadcast ping unless your MIS person a) knows about it
158 and b) has given you explicit permission to use it!
159
160 watchdog will send out three ping packages and wait up to <interval>
161 seconds for the reply with <interval> being the time it goes to sleep
162 between two times triggering the watchdog device. Thus a unreachable
163 network will not cause a hard reset but a soft reboot.
164
165 You can also test passively for an unreachable network by just monitor‐
166 ing a given interface for traffic. If no traffic arrives the network is
167 considered unreachable causing a soft reboot or action from the repair
168 binary.
169
170 watchdog can run an external command for user-defined tests. A return
171 code not equal 0 means an error occured and watchdog should react. If
172 the external command is killed by an uncaught signal this is considered
173 an error by watchdog too. The command may take longer than the time
174 slice defined for the kernel device without a problem. However, error
175 messages are generated into the syslog facility. If you have enabled
176 softboot on error the machine will be rebooted if the binary doesn't
177 exit in half the time watchdog sleeps between two tries triggering the
178 kernel device.
179
180 If you specify a repair binary it will be started instead of shutting
181 down the system. If this binary is not able to fix the problem watchdog
182 will still cause a reboot afterwards.
183
184 If the machine is halted an email is sent to notify a human that the
185 machine is going down. Starting with version 4.4 watchdog will also
186 notify the human in charge if the machine is rebooted.
187
189 A soft reboot (i.e. controlled shutdown and reboot) is initiated for
190 every error that is found. Since there might be no more processes
191 available, watchdog does it all by himself. That means:
192
193 1. Kill all processes with SIGTERM.
194
195 2. After a short pause kill all remaining processes with SIGKILL.
196
197 3. Record a shutdown entry in wtmp.
198
199 4. Save the random seed from /dev/urandom. If the device is non-exis‐
200 tant or there is no filename for saving this step is skipped.
201
202 5. Turn off accounting.
203
204 6. Turn off quota and swap.
205
206 7. Unmount all partitions except the root partition.
207
208 8. Remount the root partition read-only.
209
210 9. Shut down all network interfaces.
211
212 10. Finally reboot.
213
215 If the return code of the check binary is not zero watchdog will assume
216 an error and reboot the system. Be careful with this if you are using
217 the real-time properties of watchdog since watchdog will wait for the
218 return of this binary before proceeding. An positive exit code is
219 interpreted as an system error code (see errno.h for details). Negative
220 values are special to watchdog:
221
222 -1 Reboot the system. This is not exactly an error message but a
223 command to watchdog. If the return code is -1 watchdog will not
224 try to run a shutdown script instead.
225
226 -2 Reset the system. This is not exactly an error message but a
227 command to watchdog. If the return code is -2 watchdog will
228 simply refuse to write the kernel device again.
229
230 -3 Maximum load average exceeded.
231
232 -4 The temperature inside is too high.
233
234 -5 /proc/loadavg contains no (or not enough) data.
235
236 -6 The given file was not changed in the given interval.
237
238 -7 /proc/meminfo contains invalid data.
239
240 -8 Child process was killed by a signal.
241
242 -9 Child process did not return in time.
243
244 -10 Free for personal use.
245
247 The repair binary is started with one parameter: the error number that
248 caused watchdog to initiate the boot process. After trying to repair
249 the system the binary should exit with 0 if the system was successfully
250 repaired and thus there is no need to boot anymore. A return value not
251 equal 0 tells watchdog to reboot. The return code of the repair binary
252 should be the error number of the error causing watchdog to reboot. Be
253 careful with this if you are using the real-time properties since
254 watchdog will wait for the return of this binary before proceeding.
255
257 None known so far.
258
260 The original code is an example written by Alan Cox
261 <alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. All addi‐
262 tions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram
263 <johnie@netgod.net> had the idea of testing the load average. He also
264 took over the Debian specific work. Dave Cinege <dcinege@psychosis.com>
265 brought up some hardware watchdog issues and helped testing this stuff.
266
268 /dev/watchdog
269 The watchdog device.
270
271 /var/run/watchdog.pid
272 The pid file of the running watchdog.
273
275 watchdog.conf(5)
276
277
278
2794th Berkeley Distribution January 2005 WATCHDOG(8)