1WATCHDOG.CONF(5) File Formats Manual WATCHDOG.CONF(5)
2
3
4
6 watchdog.conf - configuration file for the watchdog daemon
7
9 This file carries all configuration options for the Linux watchdog dae‐
10 mon. Each option has to be written on a line for itself. Comments
11 start with '#'. Blanks are ignored except after the '=' sign. An empty
12 text after the '=' sign disables the feature as long as that makes
13 sense.
14
16 interval = <interval>
17 Set the highest possible interval between two writes to the
18 watchdog device. The device is triggered after each check re‐
19 gardless of the time it took. After finishing all checks watch‐
20 dog goes to sleep for a full cycle of <interval> seconds. De‐
21 fault value is 1 second. The kernel drivers typically expects a
22 write command every minute otherwise the system will be re‐
23 booted. Therefore an interval of more than a minute can only be
24 used with the force command-line option [--force | -f].
25
26 logtick = <logtick>
27 If you enable verbose logging, a message is written into the
28 syslog or a logfile. While this is nice, it is not necessary to
29 get a message every interval which really fills up disk and
30 needs CPU. logtick allows adjustment of the number of intervals
31 skipped before a log message is written. If you use logtick = 60
32 and interval = 10, only every 10 minutes (600 seconds) a message
33 is written. This may make the exact time of a crash harder to
34 find but greatly reduces disk usage and administrator nerves if
35 you're looking for a particular syslog entry in between of
36 watchdog messages.
37
38 max-load-1 = <load1>
39 Set the maximal allowed load average for a 1 minute span. Once
40 this load average is reached the system is rebooted. Default
41 value is 0. That means the load average check is disabled. Be
42 careful not to set this parameter too low. To set a value less
43 then the predefined minimal value of 2, you have to use the -f
44 command line option.
45
46 max-load-5 = <load5>
47 Set the maximal allowed load average for a 5 minute span. Once
48 this load average is reached the system is rebooted. Default
49 value is 3/4*max-load-1. Be careful not to this parameter too
50 low. To set a value less then the predefined minimal value of 2,
51 you have to use the -f command line option.
52
53 max-load-15 = <load15>
54 Set the maximal allowed load average for a 15 minute span. Once
55 this load average is reached the system is rebooted. Default
56 value is 1/2*max-load-1. Be careful not to this parameter too
57 low. To set a value less then the predefined minimal value of 2,
58 you have to use the -f command line option.
59
60 min-memory = <minpage>
61 Set the minimal amount of memory that has to stay free. Note
62 that this is in memory pages (4kB on x86). Default value is 0
63 pages which means this test is disabled. The page size is taken
64 from the system include files. The usable memory is computed
65 from MemFree + Buffers + Cached since buffer and cache use typi‐
66 cally expand to use most free memory but the kernel will reclaim
67 this as needed. NOTE: If this measure gets below a few tens of
68 MB then the system will page swap aggressively have poorer file
69 system performance due to the lack of caching. This is a 'pas‐
70 sive' test and works by reading /proc/meminfo
71
72 allocatable-memory = <minpage>
73 Set the minimum amount of allocatable memory available on the
74 system. Note that this is in pages. Default value is 0 pages
75 which means the test is disabled. As with min-memory, the page
76 size is taken from the system include files. This is an 'active'
77 test and it works by attempting to memory-map a block of the
78 configured size.
79
80 max-swap = <maxpage>
81 Set the maximum amount of swap use. Note that this is in memory
82 pages (4kB on x86). Default value is 0 pages which means this
83 test is disabled. Often this should be a large portion of avail‐
84 able swap, but remember that paging 1GB of swap can take sev‐
85 eral/tens of seconds. This is a 'passive' test and works by
86 reading /proc/meminfo
87
88 watchdog-device = <device>
89 Set the watchdog device name, typically /dev/watchdog. Default
90 is to disable keep alive support. This should be tested by run‐
91 ning the daemon from the command line before configuring it to
92 start automatically on booting.
93
94 watchdog-refresh-use-settimeout = <auto|yes|no>
95 Refresh watchdog timer by setting its timeout instead of using a
96 normal watchdog refresh operation. Might help if your watchdog
97 trips by itself when the first timeout interval elapses. Default
98 is 'auto' for IT87 fix-up but this can be disabled with 'no' or
99 forced for other modules with 'yes'.
100
101 watchdog-refresh-ignore-errors = <yes|no>
102 Ignore errors reported by writing to the watchdog device. Typi‐
103 cally this is used for systems that have broken implementations
104 of the IPMI driver to avoid a reboot loop.
105
106 watchdog-timeout = <timeout>
107 Set the watchdog device timeout during startup. If not set, a
108 default is used that should be set to the kernel timer margin at
109 compile time.
110
111 temperature-sensor = <temp-virtual-file>
112 Set the temperature sensor name. This is normally a 'virtual
113 file' under /sys and it contains the temperature in milli-Cel‐
114 sius. Usually these are generated by the sensors package, but
115 take care as device enumeration may not be fixed. Default is to
116 disable temperature checking. Multiple sensors can be used by
117 having repeated temperature-sensor entries. Due to the enumera‐
118 tion problem any missing temp sensor is simply ignored and not
119 treated as a reboot trigger.
120
121 max-temperature = <temp>
122 Set the maximal allowed temperature in Celsius. Once this tem‐
123 perature is reached the system is stopped. Default value is 90
124 C. Watchdog will issue warnings once the temperature increases
125 90%, 95% and 98% of this temperature.
126
127 temp-power-off = <yes|no>
128 Set the watchdog action on overheating. Yes option (default) is
129 to power the machine off, no option is to halt machine and allow
130 Ctrl-Alt-Del reboot.
131
132 file = <filename>
133 Set file name for file mode. This option can be given as often
134 as you like to check several files.
135
136 change = <mtime>
137 Set the change interval time for file mode. This options always
138 belongs to the active filename, that is when finding a 'change
139 =' line watchdog assumes it belongs to the most recently read
140 'file =' line. They don't necessarily have to follow each other
141 directly. But you cannot specify a 'change =' before a 'file ='.
142 The default is to only stat the file and don't look for changes.
143 Using this feature to monitor changes in /var/log/messages might
144 require some special syslog daemon configuration, e.g. rsyslog
145 needs "$ActionWriteAllMarkMessages on" to be set to make sure
146 the marks are written no matter what.
147
148 pidfile = <pidfilename>
149 Set pidfile name for daemon test mode. This option can be given
150 as often as you like to check several daemons, assuming they
151 write their post-forking PID to the specified files. See the
152 Systemd section in watchdog (8) for more information.
153
154 ping = <ip-addr>
155 Set IPv4 address for ping mode. This option can be used more
156 than once to check different connections.
157
158 ping-count = <ping-per-interval>
159 Set the number of ping attempts in each 'interval' of time. De‐
160 fault is 3 and it completes on the first successful ping.
161
162 interface = <if-name>
163 Set interface name for network mode. This option can be used
164 more than once to check different interfaces. Note it is only
165 possible to check physical interfaces, and not aliased IP inter‐
166 faces.
167
168 test-binary = <testbin>
169 Execute the given binary to do some user defined tests. With
170 enforcing SELinux policy please use the /usr/libexec/watch‐
171 dog/scripts/ for your test-binary configuration.
172
173 test-timeout = <timeout in seconds>
174 User defined tests may only run for <timeout> seconds. Set to 0
175 for unlimited.
176
177 repair-binary = <repbin>
178 Execute the given binary in case of a problem instead of shut‐
179 ting down the system. With enforcing SELinux policy please use
180 the /usr/libexec/watchdog/scripts/ for your repair-binary con‐
181 figuration.
182
183 repair-timeout = <timeout in seconds>
184 repair command may only run for <timeout> seconds. Set to 0 for
185 'unlimited', but note that the hardware timer is not refreshed
186 in this case so the system will hard-reset at some point.
187
188 retry-timeout = <timeout in seconds>
189 Allow most error conditions to persist for <timeout> seconds.
190 Set to 0 for immediate action (like softboot behaviour).
191
192 repair-maximum = <count>
193 This allows no more then <count> repair attempts against a given
194 fault that report success (i.e. return 0), but fail to clear the
195 fault, before a reboot is initiated anyway. If set to zero then
196 a repairable fault can always be blocked by a repair program re‐
197 porting success (previous daemon behaviour).
198
199 softboot-option = <yes|no>
200 This acts like the -b / --softboot command line and simply sets
201 the retry timeout to zero.
202
203 admin = <mail-address>
204 Email address to send admin mail to. That is, who shall be noti‐
205 fied that the machine is being halted or rebooted. Default is
206 'root'. If you want to disable notification via email just set
207 admin to en empty string.
208
209 realtime = <yes|no>
210 If set to yes watchdog will lock itself into memory so it is
211 never swapped out.
212
213 priority = <schedule priority>
214 Set the schedule priority for realtime mode passed to
215 sched_setscheduler().
216
217 test-directory = <test directory>
218 Set the directory to run user test/repair scripts. Default is
219 '/etc/watchdog.d' The /etc/watchdog.d/ is recognized by SELinux
220 policy. See the Test Directory section in watchdog(8) for more
221 information.
222
223 log-dir = <log directory>
224 Set the log directory to capture the standard output and stan‐
225 dard error from repair-binary and test-binary execution. Default
226 is '/var/log/watchdog'.
227
228 sigterm-delay = <time in seconds>
229 Set the time on shut down between first sending SIGTERM to all
230 processes, and then sending SIGKILL. Default is 5 seconds which
231 is generally enough, but systems with large databases or virtual
232 machines might need longer.
233
234 verbose = <level>
235 This overrides the command line --verbose option. Generally the
236 verbose mode is only enabled for debugging as it creates a lot
237 of syslog chatter, so use this option with consideration. Zero
238 is "normal" operation (quiet), while 1 is typically used for de‐
239 bugging. Values of 2 or more usually generate far too many mes‐
240 sages.
241
242 heartbeat-file = <filename>
243 For debugging this allows a rolling set of status values to be
244 kept on disk
245
246 heartbeat-stamps = <interval>
247 For debugging this sets the number of entries in the <heartbeat-
248 file>
249
250 log-killed-pids = <yes|no>
251 This acts like enabling 'verbose' logging, but only for a system
252 reboot, where it enables the logging of the PID values for all
253 processes that are being killed. The results are written to the
254 killall5.log file in the log directory (if at all possible) in
255 this case. Intended for debugging cases where you would like to
256 know what was running at the point the machine triggered the
257 watchdog, but don't want syslog filling up with the usual chat‐
258 ter of activity.
259
261 /etc/watchdog.conf
262 The watchdog configuration file
263
264 /etc/watchdog.d
265 A directory containing test-or-repair commands. See the Test Di‐
266 rectory section in watchdog(8) for more information.
267
269 watchdog(8)
270
271
272
2734th Berkeley Distribution February 2019 WATCHDOG.CONF(5)