1mon(8) Parallel Service Monitoring Daemon mon(8)
2
3
4
6 mon - monitor services for availability, sending alarms upon failures.
7
9 mon [-dfhlMSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c config] [-D
10 dir] [-i secs] [-k num] [-l [statetype]] [-L dir] [-m num] [-p num] [-P
11 pidfile] [-r delay] [-s dir]
12
14 mon is a general-purpose scheduler for monitoring service availability
15 and triggering alerts upon detecting failures. mon was designed to be
16 open in the sense that it supports arbitrary monitoring facilities and
17 alert methods via a common interface, which are easily implemented
18 through programs (in C, Perl, shell, etc.), SNMP traps, and special Mon
19 (UDP packet) traps.
20
21
23 -a dir Path to alert scripts. Default is
24 /usr/local/lib/mon/alert.d:alert.d. Multiple alert paths may be
25 specified by separating them with a colon. Non-absolute paths
26 are taken to be relative to the base directory (/usr/lib/mon by
27 default).
28
29 -b dir Base directory for mon. scriptdir, alertdir, and statedir are
30 all relative to this directory unless specified from /. Default
31 is /usr/lib/mon.
32
33 -B dir Configuration file base directory. All config files are located
34 here, including mon.cf, monusers.cf, and auth.cf.
35
36 -A authfile
37 Authentication configuration file. By default this is
38 /etc/mon/auth.cf if the /etc/mon directory exists, or
39 /usr/lib/mon/auth.cf otherwise.
40
41 -c file
42 Read configuration from file. This defaults to IR
43 /etc/mon/mon.cf " if the " /etc/mon directory exists, otherwise
44 to /etc/mon.cf.
45
46 -d Enable debugging mode.
47
48 -D dir Path to state directory. Default is the first of
49 /var/state/mon, /var/lib/mon, and /usr/lib/mon/state.d which
50 exists.
51
52 -f Fork and run as a daemon process. This is the preferred way to
53 run mon.
54
55 -h Print help information.
56
57 -i secs
58 Sleep interval, in seconds. Defaults to 1. This shouldn't need
59 to be adjusted for any reason.
60
61 -k num Set log history to a maximum of num entries. Defaults to 100.
62
63 -l statetype
64 Load state from the last saved state file. The supported saved
65 state types are disabled for disabled watches, services, and
66 hosts, opstatus for failure/alert/ack status of all services,
67 and all for both. If no statetype is provided, disabled is
68 assumed.
69
70 -L dir Sets the log dir. See also logdir in the configuration file.
71 The default is /var/log/mon if that directory exists, otherwise
72 log.d in the base directory.
73
74 -M Pre-process the configuration file with the macro expansion
75 package m4.
76
77 -m num Set the throttle for the maximum number of processes to num.
78
79 -p num Make server listen on port num. This defaults to 2583.
80
81 -S Start with the scheduler stopped.
82
83 -P pidfile
84 Store the server's pid in pidfile, the default is the first of
85 /var/run/mon/mon.pid, /var/run/mon.pid, and /etc/mon.pid whose
86 directory exists. An empty value tells mon not to use a pid
87 file.
88
89 -r delay
90 Sets the number of seconds used to randomize the startup delay
91 before each service is scheduled. Refer to the global randstart
92 variable in the configuration file.
93
94 -s dir Path to monitor scripts. Default is
95 /usr/local/lib/mon/mon.d:mon.d. Multiple alert paths may be
96 specified by separating them with a colon. Non-absolute paths
97 are taken to be relative to the base directory (/usr/lib/mon by
98 default).
99
100 -v Print version information.
101
102
104 monitor
105 A program which tests for a certain condition, returns either
106 true or false, and optionally produces output to be passed back
107 to the scheduler. Common monitors detect host reachability via
108 ICMP echo messages, or connection to TCP services.
109
110 period A period in time as interpreted by the Time::Period module.
111
112 alert A program which sends a message when invoked by the scheduler.
113 The scheduler calls upon an alert when it detects a failure from
114 a monitor. An alert program accepts a set of command-line argu‐
115 ments from the scheduler, in addition to data via standard
116 input.
117
118 hostgroup
119 A single host or list of hosts, specified as names or IP
120 addresses.
121
122 service
123 A collection of parameters used to deal with monitoring a par‐
124 ticular resource which is provided by a group. Services are usu‐
125 ally modeled after things such as an SMTP server, ICMP echo
126 capability, server disk space availability, or SNMP events.
127
128 view A collection of hostgroups, used to filter mon output for client
129 display. i.e. a 'network-services' view might be defined so
130 your network staff can see just the hostgroups which matter to
131 them, without having to see all hostgroups defined in Mon.
132
133 watch A collection of services which apply to a particular group.
134
136 When the mon scheduler starts, it reads a configuration file to deter‐
137 mine the services it needs to monitor. The configuration file defaults
138 to /etc/mon.cf, and can be specified using the -c parameter. If the -M
139 option is specified, then the configuration file is pre-processed with
140 m4. If the configuration file ends with .m4, the file is also pro‐
141 cessed by m4 automatically.
142
143 The scheduler enters a loop which handles client connections, monitor
144 invocations, and failure alerts. Each service has a timer, specified in
145 the configuration file as the interval variable, which tells the sched‐
146 uler how frequently to invoke a monitor process. The scheduler may be
147 temporarily stopped. While it is stopped, client access still func‐
148 tions, but it just doesn't schedule things. This is useful in conjunc‐
149 tion while resetting the server, because you can do this: save the
150 hosts and services which are disabled, reset the server with the sched‐
151 uler stopped, re-disabled those hosts and services, then start the
152 scheduler. It also allows making atomic changes across several client
153 connections. See the moncmd man page for more information.
154
155
157 Monitor processes are invoked with the arguments specified in the con‐
158 figuration file, appended by the hosts from the applicable host group.
159 For example, if the watch group is "servers", which contain the host‐
160 names "smtp", "nntp", and "ns", and the monitor line reads as follows,
161 monitor fping.monitor -t 4000 -r 2
162 then the exectuable "fping.monitor" will be executed with these parame‐
163 ters:
164 MONITOR_DIR/fping.monitor -t 4000 -r 2 smtp nntp ns
165
166 MONITOR_DIR is actually a search path, by default
167 /usr/local/lib/mon/mon.d then /usr/lib/mon/mon.d, but it can be over‐
168 ridden by the -s option or in the configuration file. If all hosts in
169 the hostgroup have been disabled, then a warning is sent to syslog and
170 the monitor is not run. This behavior may be overridden with the
171 "allow_empty_group" option in the service definition. If the final
172 argument to the "monitor" line is ";;" (it must be preceded by white‐
173 space), then the host list will not be appended to the parameter list.
174
175 In addition to environment variables defined by the user in the service
176 definition, mon passes certain variables to monitor process.
177
178
179 MON_LAST_SUMMARY
180 The first line of the output from the last time the monitor
181 exited. This is not the summary of the current monitor run, but
182 the previous one. This may be used by an alert script to pro‐
183 vide historical context in an alert.
184
185
186 MON_LAST_OUTPUT
187 The entire output of the monitor from the last time it exited.
188 This is not the output of the current monitor run, but the pre‐
189 vious one. This may be used by an alert script to provide his‐
190 torical context in an alert.
191
192
193
194 MON_LAST_FAILURE
195 The time(2) of the last failure for this service.
196
197
198 MON_FIRST_FAILURE
199 The time(2) of the first time this service failed.
200
201
202 MON_LAST_SUCCESS
203 The time(2) of the last time this service passed.
204
205
206 MON_DESCRIPTION
207 The description of this service, as defined in the configuration
208 file using the description tag.
209
210
211 MON_DEPEND_STATUS
212 The depend status, "o" if dependency failure, "1" otherwise.
213
214
215 MON_LOGDIR
216 The directory log files should be placed, as indicated by the
217 logdir global configuration variable.
218
219
220 MON_STATEDIR
221 The directory where state files should be kept, as indicated by
222 the statedir global configuration variable.
223
224
225 MON_CFBASEDIR
226 The directory where configuration files should be kept, as indi‐
227 cated by the cfbasedir global configuration variable.
228
229
230 "fping.monitor" should return an exit status of 0 if it completed suc‐
231 cessfully (found no problems), or nonzero if a problem was detected.
232 The first line of output from the monitor script has a special meaning:
233 it is used as a brief summary of the exact failure which was detected,
234 and is passed to the alert program. All remaining output is also passed
235 to the alert program, but it has no required interpretation.
236
237 If a monitor for a particular service is still running, and the time
238 comes for mon to run another monitor for that service, it will not
239 start another monitor. For example, if the interval is 10s, and the
240 monitor does not finish running within 10 seconds, then mon will wait
241 until the first monitor exits before running another one.
242
243
245 Upon a non-zero or zero exit status, the associated alert or upalert
246 program (respectively) is started, pending the following conditions: If
247 an alert for a specific service is disabled, do not send an alert. If
248 dep_behavior is set to 'a', or alertdepend is set, and a parent depen‐
249 dency is failing, then suppress the alert. If the alert has previously
250 been acknowledged, do not send the alert, unless it is an upalert. If
251 an alert is not within the specified period, record the failure via
252 syslog(3) and do not send an alert. If the failure does not fall
253 within a defined period, do not send an alert. No upalerts are sent
254 without corresponding down alerts, unless no_comp_alerts is defined in
255 the period section. An upalert will only be sent if the previous state
256 is a failure. If an alert was already sent within the last alertevery
257 interval, do not send another alert, unless the summary output from the
258 current monitor program differs from the last monitor process. Other‐
259 wise, send an alert using each alert program listed for that period.
260 The observe_detail argument to alertevery affects this behavior by
261 observing the changes in the detail part of the output in addition to
262 the summary line. If a monitor has successive failures and the summary
263 output changes in each of them, alertevery will not suppress multiple
264 consecutive alerts. The reasoning is that if the summary output
265 changes, then a significant event occurred and the user should be
266 alerted. The "strict" argument to alertevery will suppress both com‐
267 paring the output from the previous monitor run to the current and pre‐
268 vent a successful return value of the monitor from resetting the
269 alertevery timer. For example, "alertevery 24h strict" will only send
270 out an alert once every 24 hours, regardless of whether the monitor
271 output changes, or if the service stops and then starts failing.
272
273
275 Alert programs are found in the path supplied with the -a parameter, or
276 in the /usr/local/lib/mon/alert.d and directories if not specified.
277 They are invoked with the following command-line parameters:
278
279
280 -s service
281 Service tag from the configuration file.
282
283 -g group
284 Host group name from the configuration file.
285
286 -h hosts
287 The expanded version of the host group, space delimited, but
288 contained in one shell "word".
289
290 -l alertevery
291 The number of seconds until the next alarm will be sent.
292
293 -O This option is supplied to an alert only if the alert is
294 being generated as a result of an expected traap timing out
295
296 -t time
297 The time (in time(2) format) of when this failure condition was
298 detected.
299
300 -T This option is supplied to an alert only if the alert was trig‐
301 gered by a trap
302
303 -u This option is supplied to an alert only if it is being called
304 as an upalert.
305
306
307 The remaining arguments are supplied from the trailing parameters in
308 the configuration file, after the "alert" service parameter.
309
310 As with monitor programs, alert programs are invoked with environment
311 variables defined by the user in the service definition, in addition to
312 the following which are explicitly set by the server:
313
314
315 MON_LAST_SUMMARY
316 The first line of the output from the last time the monitor
317 exited.
318
319
320 MON_LAST_OUTPUT
321 The entire output of the monitor from the last time it exited.
322
323
324 MON_LAST_FAILURE
325 The time(2) of the last failure for this service.
326
327
328 MON_FIRST_FAILURE
329 The time(2) of the first time this service failed.
330
331
332 MON_LAST_SUCCESS
333 The time(2) of the last time this service passed.
334
335
336 MON_DESCRIPTION
337 The description of this service, as defined in the configuration
338 file using the description tag.
339
340
341 MON_GROUP
342 The watch group which triggered this alarm
343
344
345 MON_SERVICE
346 The service heading which generated this alert
347
348
349 MON_RETVAL
350 The exit value of the failed monitor program, or return value as
351 accepted from a trap.
352
353
354 MON_OPSTATUS
355 The operational status of the service.
356
357
358 MON_ALERTTYPE
359 Has one of the following values: "failure", "up", "startup",
360 "trap", or "traptimeout", and signifies the type of alert which
361 was triggered.
362
363
364 MON_TRAP_INTENDED
365 This is only set when an unknown mon trap is received and caught
366 by the default/defaut watch/service. This contains colon sepa‐
367 rated entries of the trap's intended watch group and service
368 name.
369
370
371 MON_LOGDIR
372 The directory log files should be placed, as indicated by the
373 logdir global configuration variable.
374
375
376 MON_STATEDIR
377 The directory where state files should be kept, as indicated by
378 the statedir global configuration variable.
379
380
381 MON_CFBASEDIR
382 The directory where configuration files should be kept, as indi‐
383 cated by the cfbasedir global configuration variable.
384
385
386 The first line from standard input must be used as a brief summary of
387 the problem, normally supplied as the subject line of an email, or text
388 sent to an alphanumeric pager. Interpretation of all subsequent lines
389 read from stdin is left up to the alerting program. The usual parame‐
390 ters are a list of recipients to deliver the notification to. The
391 interpretation of the recipients is not specified, and is up to the
392 alert program.
393
394
396 The configuration file consists of zero or more global variable defini‐
397 tions, zero or more hostgroup definitions, and one or more watch defi‐
398 nitions. Each watch definition may have one or more service defini‐
399 tions. A watch definition is terminated by a blank line, another defi‐
400 nition, or the end of the file. A line beginning with optional leading
401 whitespace and a pound ("#") is regarded as a comment, and is ignored.
402
403 Lines are parsed as they are read. Long lines may be continued by end‐
404 ing them with a backslash ("\"). If a line is continued, then the
405 backslash, the trailing whitespace after the backslash, and the leading
406 whitespace of the following line are removed. The end result is assem‐
407 bled into a single line.
408
409 Typically the configuration file has the following layout:
410
411 1. Global variable definitions
412
413 2. Hostgroup definitions
414
415 3. Watch definitions
416
417 See the "etc/example.cf" file which comes for the distribution for an
418 example.
419
420
421 Global Variables
422 The following variables may be set to override compiled-in defaults.
423 Command-line options will have a higher precedence than these defini‐
424 tions.
425
426
427 alertdir = dir
428 dir is the full path to the alert scripts. This is the value set
429 by the -a command-line parameter.
430
431 Multiple alert paths may be specified by separating them with a
432 colon. Non-absolute paths are taken to be relative to the base
433 directory (/usr/lib/mon by default).
434
435 When the configuration file is read, all alerts referenced from
436 the configuration will be looked up in each of these paths, and
437 the full path to the first instance of the alert found is stored
438 in a hash. This hash is only generated upon startup or after a
439 "reset" command, so newly added alert scripts will not be recog‐
440 nized until a "reset" is performed.
441
442
443 mondir = dir
444 dir is the full path to the monitor scripts. This value may also
445 be set by the -s command-line parameter. If this path does not
446 begin with a "/", it will be relative to basedir.
447
448 Multiple alert paths may be specified by separating them with a
449 colon. All paths must be absolute.
450
451 When the configuration file is read, all monitors referenced
452 from the configuration will be looked up in each of these paths,
453 and the full path to the first instance of the monitor found is
454 stored in a hash. This hash is only generated upon startup or
455 after a "reset" command, so newly added monitor scripts will not
456 be recognized until a "reset" is performed.
457
458
459 statedir = dir
460 dir is the full path to the state directory. mon uses this
461 directory to save various state information. If this path does
462 not begin with a "/", it will be relative to basedir.
463
464
465 logdir = dir
466 dir is the full path to the log directory. mon uses this direc‐
467 tory to save various logs, including the downtime log. If this
468 path does not begin with a "/", it will be relative to basedir.
469
470
471 basedir = dir
472 dir is the full path for the state, log, monitor, and alert
473 directories.
474
475
476 cfbasedir = dir
477 dir is the full path where all the config files can be found
478 (monusers.cf, auth.cf, etc.).
479
480
481 authfile = file
482 file is the path to the authentication file. If the path does
483 not begin with a "/", it will be relative to cfbasedir.
484
485
486 authtype = type [type...]
487 type is the type of authentication to use. A space-separated
488 list of types may be specified, and they will be checked the
489 order they are listed. As soon as a successful authentication is
490 performed, the user is considered authenticated by mon for the
491 duration of the session and no more authentication checks are
492 performed.
493
494 If type is getpwnam, then the standard Unix passwd file authen‐
495 tication method will be used (calls getpwnam(3) on the user and
496 compares the crypt(3)ed version of the password with what it
497 gets from getpwnam). This will not work if shadow passwords are
498 enabled on the system.
499
500 If type is userfile, then usernames and hashed passwords are
501 read from userfile, which is defined via the userfile configura‐
502 tion variable.
503
504 If type is pam, then PAM (pluggable authentication modules) will
505 be used for authentication. The service specified by the pam‐
506 service global will be used. If no global is given, the PAM
507 passwd service will be used.
508
509 If type is trustlocal, then if the client connection comes from
510 locahost, the username passed from the client will be trusted,
511 and the password will be ignored. This can be used when you
512 want the client to handle the authentication for you. I.e. a
513 CGI script using one of the many apache authentication methods.
514
515
516 userfile = file
517 This file is used when authtype is set to userfile. It consists
518 of a sequence of lines of the format 'username : password'.
519 password is stored as the hash returned by the standard Unix
520 crypt(3) function. NOTE: the format of this file is compatible
521 with the Apache file based username/password file format. It is
522 possible to use the htpasswd program supplied with Apache to
523 manage the mon userfile.
524
525 Blank lines and lines beginning with # are ignored.
526
527
528 pamservice = service
529 The PAM service used for authentication. This is applicable only
530 if "pam" is specified as a parameter to the authtype setting. If
531 this global is not defined, it defaults to passwd.
532
533
534 serverbind = addr
535
536
537 trapbind = addr
538
539 serverbind and trapbind specify which address to bind the server
540 and trap ports to, respectively. If these are not defined, the
541 default address is INADDR_ANY, which allows connections on all
542 interfaces. For security reasons, it could be a good idea to
543 bind only to the loopback interface.
544
545
546 dtlogfile = file
547 file is a file which will be used to record the downtime log.
548 Whenever a service fails for some amount of time and then stop
549 failing, this event is written to the log. If this parameter is
550 not set, no logging is done. The format of the file is as fol‐
551 lows (# is a comment and may be ignored):
552
553 timenoticed group service firstfail downtime interval summary.
554
555 timenoticed is the time(2) the service came back up.
556
557 group service is the group and service which failed.
558
559 firstfail is the time(2) when the service began to fail.
560
561 downtime is the number of seconds the service failed.
562
563 interval is the frequency (in seconds) that the service is
564 polled.
565
566 summary is the summary line from when the service was failing.
567
568
569 monerrfile = filename
570 By default, when mon daemonizes itself, it connects stdout and
571 stderr to /dev/null. If monerrfile is set to a file, then stdout
572 and stderr will be appended to that file. In all cases stdin is
573 connected to /dev/null. If mon is told to run in the foreground
574 and to not daemonize, then none of this applies, since
575 stdin/stdout/stderr stay connected to whatever they were at the
576 time of invocation.
577
578
579 dtlogging = yes/no
580
581 Turns downtime logging on or off. The default is off.
582
583
584 histlength = num
585 num is the the maximum number of events to be retained in his‐
586 tory list. The default is 100. This value may also be set by
587 the -k command-line parameter.
588
589
590 historicfile = file
591 If this variable is set, then alerts are logged to file, and
592 upon startup, some (or all) of the past history is read into
593 memory.
594
595
596 historictime = timeval
597 num is the amount of the history file to read upon startup.
598 "Now" - timeval is read. See the explanation of interval in the
599 "Service Definitions" section for a description of timeval.
600
601
602 serverport = port
603 port is the TCP port number that the server should bind to. This
604 value may also be set by the -p command-line parameter. Normally
605 this port is looked up via getservbyname(3), and it defaults to
606 2583.
607
608
609 trapport = port
610 port is the UDP port number that the trap server should bind to.
611 Normally this port is looked up via getservbyname(3), and it
612 defaults to 2583.
613
614
615 pidfile = path
616 path is the file the sever will store its pid in. This value
617 may also be set by the -P command-line parameter.
618
619
620 maxprocs = num
621 Throttles the number of concurrently forked processes to num.
622 The intent is to provide a safety net for the unlikely situation
623 when the server tries to take on too many tasks at once. Note
624 that this situation has only been reported to happen when trying
625 to use a garbled configuration file! You don't want to use a
626 garbled configuration file now, do you?
627
628
629 cltimeout = secs
630 Sets the client inactivity timeout to secs. This is meant to
631 help thwart denial of service attacks or recover from crashed
632 clients. secs is interpreted as a "1h/1m/1s" string, where "1m"
633 = 60 seconds.
634
635
636 randstart = interval
637 When the server starts, normally all services will not be sched‐
638 uled until the interval defined in the respective service sec‐
639 tion. This can cause long delays before the first check of a
640 service, and possibly a high load on the server if multiple
641 things are scheduled at the same intervals. This option is used
642 to randomize the scheduling of the first test for all services
643 during the startup period, and immediately after the reset com‐
644 mand. If randstart is defined, the scheduled run time of all
645 services of all watch groups will be a random number between
646 zero and randstart seconds.
647
648
649 dep_recur_limit = depth
650 Limit dependency recursion level to depth. If dependency recur‐
651 sion (dependencies which depend on other dependencies) tries to
652 go beyond depth, then the recursion is aborted and a messages is
653 logged to syslog. The default limit is 10.
654
655
656 dep_behavior = {a|m|hm}
657 dep_behavior controls whether the dependency expression sup‐
658 presses one of: the running of alerts, the running of monitors,
659 or the passing of individual hosts to the monitors. Read more
660 about the behavior in the "Service Definitions" section below.
661
662 This is a global setting which controls the default settings for
663 the service-specified variable.
664
665
666 dep_memory = timeval
667 If set, dep_memory will cause dependencies to continue to pre‐
668 vent alerts/monitoring for a period of time after the service
669 returns to a normal state. This can be used to prevent over-
670 eager alerting when a machine is rebooting, for example. See
671 the explanation of interval in the "Service Definitions" section
672 for a description of timeval.
673
674 This is a global setting which controls the default settings for
675 the service-specified variable.
676
677
678 syslog_facility = facility
679 Specifies the syslog facility used for logging. daemon is the
680 default.
681
682
683
684
685 startupalerts_on_reset = {yes|no}
686
687 If set to "yes", startupalerts will be invoked when the reset
688 client command is executed. The default is "no".
689
690
691 monremote = program
692
693 If set, this external program will be called by Mon when various
694 client requests are processed. This can be used to propagate
695 those changes from one Mon server to another, if you have multi‐
696 ple monitoring machines. An example script, monremote.pl is
697 available in the clients directory.
698
699
700 Hostgroup Entries
701 Hostgroup entries begin with the keyword hostgroup, and are followed by
702 a hostgroup tag and one or more hostnames or IP addresses, separated by
703 whitespace. The hostgroup tag must be composed of alphanumeric charac‐
704 ters, a dash ("-"), a period ("."), or an underscore ("_"). Non-blank
705 lines following the first hostgroup line are interpreted as more host‐
706 names. The hostgroup definition ends with a blank line. For example:
707
708 hostgroup servers nameserver smtpserver nntpserver
709 nfsserver httpserver smbserver
710
711 hostgroup router_group cisco7000 agsplus
712
713
714 View Entries
715 View entries begin with the keyword view, and are followed by a view
716 tag and the names of one or more hostgroups. The view tag must be com‐
717 posed of alphanumeric characters, a dash ("-"), a period ("."), or an
718 underscore ("_"). Non-blank lines following the first view line are
719 interpreted as more hostgroup names. The view definition ends with a
720 blank line. For example:
721
722 view servers dns-servers web-servers file-servers
723 mail-servers
724
725 view network-services routers switches vpn-servers
726
727
728
729 Watch Group Entries
730 Watch entries begin with a line that starts with the keyword watch,
731 followed by whitespace and a single word which normally refers to a
732 pre-defined hostgroup. If the second word is not recognized as a host‐
733 group tag, a new hostgroup is created whose tag is that word, and that
734 word is its only member.
735
736 Watch entries consist of one or more service definitions.
737
738 A watch group is terminated by a blank line, the end of the file, or by
739 a subsequent definition, "watch", "hostgroup", or otherwise.
740
741 There may be a special watch group entry called "default". If a default
742 watch group is defined with a service entry named "default", then this
743 definition will be used in handling traps received for an unrecognized
744 watch and service.
745
746
747 Service Definitions
748 service servicename
749 A service definition begins with they keyword service followed
750 by a word which is the tag for this service. This word must be
751 unique among all services defined for the same watch group.
752
753 The components of a service are an interval, monitor, and one or
754 more time period definitions, as defined below.
755
756 If a service name of "default" is defined within a watch group
757 called "dafault" (see above), then the default/default defini‐
758 tion will be used for handling unknown mon traps.
759
760 The following configuration parameters are valid only following
761 a service definition:
762
763
764 VARIABLE=value
765 Environment variables may be defined for each service, which
766 will be included in the environment of monitors and alerts.
767 Variables must be specified in all capital letters, must begin
768 with an alphabetical character or an underscore, and there must
769 be no spaces to the left of the equal sign.
770
771
772 interval timeval
773 The keyword interval followed by a time value specifies the fre‐
774 quency that a monitor script will be triggered. Time values are
775 defined as "30s", "5m", "1h", or "1d", meaning 30 seconds, 5
776 minutes, 1 hour, or 1 day. The numeric portion may be a frac‐
777 tion, such as "1.5h" or an hour and a half. This format of a
778 time specification will be referred to as timeval.
779
780
781 failure_interval timeval
782 Adjusts the polling interval to timeval when the service check
783 is failing. Resets the interval to the original when the service
784 succeeds.
785
786
787 traptimeout timeval
788 This keyword takes the same time specification argument as
789 interval, and makes the service expect a trap from an external
790 source at least that often, else a failure will be registered.
791 This is used for a heartbeat-style service.
792
793
794 trapduration timeval
795 If a trap is received, the status of the service the trap was
796 delivered to will normally remain constant. If trapduration is
797 specified, the status of the service will remain in a failure
798 state for the duration specified by timeval, and then it will be
799 reset to "success".
800
801
802 randskew timeval
803 Rather than schedule the monitor script to run at the start of
804 each interval, randomly adjust the interval specified by the
805 interval parameter by plus-or-minus randskew . The skew value
806 is specified as the interval parameter: "30s", "5m", etc... For
807 example if interval is 1m, and randskew is "5s", then mon will
808 schedule the monitor script some time between every 55 seconds
809 and 65 seconds. The intent is to help distribute the load on
810 the server when many services are scheduled at the same inter‐
811 vals.
812
813
814 monitor monitor-name [arg...]
815 The keyword monitor followed by a script name and arguments
816 specifies the monitor to run when the timer expires. Shell-like
817 quoting conventions are followed when specifying the arguments
818 to send to the monitor script. The script is invoked from the
819 directory given with the -s argument, and all following words
820 are supplied as arguments to the monitor program, followed by
821 the list of hosts in the group referred to by the current watch
822 group. If the monitor line ends with ";;" as a separate word,
823 the host groups are not appended to the argument list when the
824 program is invoked.
825
826
827 allow_empty_group
828 The allow_empty_group option will allow a monitor to be invoked
829 even when the hostgroup for that watch is empty because of dis‐
830 abled hosts. The default behavior is not to invoke the monitor
831 when all hosts in a hostgroup have been disabled.
832
833
834 description descriptiontext
835 The text following description is queried by client programs,
836 passed to alerts and monitors via an environment variable. It
837 should contain a brief description of the service, suitable for
838 inclusion in an email or on a web page.
839
840
841 exclude_hosts host [host...]
842 Any hosts listed after exclude_hosts will be excluded from the
843 service check.
844
845
846 exclude_period periodspec
847 Do not run a scheduled monitor during the time identified by
848 periodspec.
849
850
851 depend dependexpression
852 The depend keyword is used to specify a dependency expression,
853 which evaluates to either true of false, in the boolean sense.
854 Dependencies are actual Perl expressions, and must obey all syn‐
855 tactical rules. The expressions are evaluated in their own pack‐
856 age space so as to not accidentally have some unwanted side-
857 effect. If a syntax error is found when evaluating the expres‐
858 sion, it is logged via syslog.
859
860 Before evaluation, the following substitutions on the expression
861 occur: phrases which look like "group:service" are substituted
862 with the value of the current operational status of that speci‐
863 fied service. These opstatus substitutions are computed recur‐
864 sively, so if service A depends upon service B, and service B
865 depends upon service C, then service A depends upon service C.
866 Successful operational statuses (which evaluate to "1") are
867 "STAT_OK", "STAT_COLDSTART", "STAT_WARMSTART", and
868 "STAT_UNKNOWN". The word "SELF" (in all caps) can be used for
869 the group (e.g. "SELF:service"), and is an abbreviation for the
870 current watch group.
871
872 This feature can be used to control alerts for services which
873 are dependent on other services, e.g. an SMTP test which is
874 dependent upon the machine being ping-reachable.
875
876
877 dep_behavior {a|m|hm}
878 The evaluation of the dependency graphs specified via the depend
879 keyword can control the suppression of alert or monitor invoca‐
880 tions, or the suppression of individual hosts passed to the mon‐
881 itor.
882
883 Alert suppression. If this option is set to "a", then the
884 dependency expression will be evaluated after the monitor for
885 the service exits or after a trap is received. An alert will
886 only be sent if the evaluation succeeds, meaning that none of
887 the nodes in the dependency graph indicate failure.
888
889 Monitor suppression. If it is set to "m", then the dependency
890 expression will be evaulated before the monitor for the service
891 is about to run. If the evaulation succeeds, then the monitor
892 will be run. Otherwise, the monitor will not be run and the sta‐
893 tus of the service will remain the same.
894
895 Host suppression. If it is set to "hm" then Mon will extract
896 the list of "parent" services from the dependency expression.
897 (In fact the expression can be just a list of services.) Then
898 when the monitor for the service is about to be run, for each
899 host in the current hostgroup Mon will search all the parent
900 services which are currently failing and look for the hostname
901 in the current summary output. If the hostname is found, this
902 host will be excluded from this run of the monitor. This can be
903 used to e.g. allow an SMTP test on a group of hosts to still be
904 run even when a single host is not ping-reachable. If all the
905 rest of the hosts are working fine, the service will be in an OK
906 state, but if another host fails the SMTP test Mon can still
907 alert about that host even though the parent dependency was
908 failing. The dependency expression will not be used recursively
909 in this case.
910
911
912 alertdepend dependexpression
913
914 monitordepend dependexpression
915
916 hostdepend dependexpression
917 These keywords allow you to specify multiple dependency expres‐
918 sions of different types. Each one corresponds to the different
919 dep_behavior settings listed above. They will be evaluated
920 independently in the different contexts as listed above. If
921 depend is present, it takes precedence over the matching key‐
922 word, depending on the dep_behavior setting.
923
924
925 dep_memory timeval
926 If set, dep_memory will cause dependencies to continue to pre‐
927 vent alerts/monitoring for a period of time after the service
928 returns to a normal state. This can be used to prevent over-
929 eager alerting when a machine is rebooting, for example. See
930 the explanation of interval in the "Service Definitions" section
931 for a description of timeval.
932
933
934 redistribute alert [arg...]
935 A service may have one redistribute option, which is a special
936 form of an an alert definition. This alert will be called on
937 every service status update, even sequential success status
938 updates. This can be used to integrate Mon with another moni‐
939 toring system, or to link together multiple Mon servers via an
940 alert script that generates Mon traps. See the "ALERT PROGRAMS"
941 section above for a list of the parameters mon will pass auto‐
942 matically to alert programs.
943
944
945 unack_summary
946 Remove the "acknowledged" state from a service if the summary
947 component of the failure message changes. In most common usage
948 the summary is the list of hosts that are failing, so additional
949 hosts failing would remove an ack.
950
951
952
953 Period Definitions
954 Periods are used to define the conditions which should allow alerts to
955 be delivered.
956
957
958 period [label:] periodspec
959 A period groups one or more alarms and variables which control
960 how often an alert happens when there is a failure. The period
961 definition has two forms. The first takes an argument which is a
962 period specification from Patrick Ryan's Time::Period Perl 5
963 module. Refer to "perldoc Time::Period" for more information.
964
965 The second form requires a label followed by a period specifica‐
966 tion, as defined above. The label is a tag consisting of an
967 alphabetic character or underscore followed by zero or more
968 alphanumerics or underscores and ending with a colon. This form
969 allows multiple periods with the same period definition. One use
970 is to have a period definition which has no alertafter or
971 alertevery parameters for a particular time period, and another
972 for the same time period with a different set of alerts that
973 does contain those parameters.
974
975 Period definitions, in either the first or second form, must be
976 unique within each service definition. For example, if you need
977 to define two periods both for "wd {Sun-Sat}", then one or both
978 of the period definitions must specify a label such as "period
979 t1: wd {Sun-Sat}" and "period t2: wd {Sun-Sat}".
980
981
982 alertevery timeval [observe_detail | strict]
983 The alertevery keyword (within a period definition) takes the
984 same type of argument as the interval variable, and limits the
985 number of times an alert is sent when the service continues to
986 fail. For example, if the interval is "1h", then only the
987 alerts in the period section will only be triggered once every
988 hour. If the alertevery keyword is omitted in a period entry, an
989 alert will be sent out every time a failure is detected. By
990 default, if the summary output of two successive failures
991 changes, then the alertevery interval is overridden, and an
992 alert will be sent. If the string "observe_detail" is the last
993 argument, then both the summary and detail output lines will be
994 considered when comparing the output of successive failures. If
995 the string "strict" is the last argument, then the output of the
996 monitor or the state change of the service will have no effect
997 on when alerts are sent. That is, "alertevery 24h strict" will
998 send only one alert every 24 hours, no matter what. Please
999 refer to the ALERT DECISION LOGIC section for a detailed expla‐
1000 nation of how alerts are suppressed.
1001
1002
1003 alertafter num
1004
1005
1006 alertafter num timeval
1007
1008
1009 alertafter timeval
1010 The alertafter keyword (within a period section) has three
1011 forms: only with the "num" argument, or with the "num timeval"
1012 arguments, or only with the "timeval" argument. In the first
1013 form, an alert will only be invoked after "num" consecutive
1014 failures.
1015
1016 In the second form, the arguments are a positive integer fol‐
1017 lowed by an interval, as described by the interval variable
1018 above. If these parameters are specified, then the alerts for
1019 that period will only be called after that many failures happen
1020 within that interval. For example, if alertafter is given the
1021 arguments "3 30m", then the alert will be called if 3 failures
1022 happen within 30 minutes.
1023
1024 In the third form, the argument is an interval, as described by
1025 the interval variable above. Alerts for that period will only
1026 be called if the service has been in a failure state for more
1027 than the length of time desribed by the interval, regardless of
1028 the number of failures noticed within that interval.
1029
1030
1031 numalerts num
1032
1033 This variable tells the server to call no more than num alerts
1034 during a failure. The alert counter is kept on a per-period
1035 basis, and is reset upon each success.
1036
1037
1038 no_comp_alerts
1039
1040 If this option is specified, then upalerts will be called when‐
1041 ever the service state changes from failure to success, rather
1042 than only after a corresponding "down" alert.
1043
1044
1045 alert alert [arg...]
1046 A period may contain multiple alerts, which are triggered upon
1047 failure of the service. An alert is specified with the alert
1048 keyword, followed by an optional exit parameter, and arguments
1049 which are interpreted the same as the monitor definition, but
1050 without the ";;" exception. The exit parameter takes the form of
1051 exit=x or exit=x-y and has the effect that the alert is only
1052 called if the exit status of the monitor script falls within the
1053 range of the exit parameter. If, for example, the alert line is
1054 alert exit=10-20 mail.alert mis then mail-alert will only be
1055 invoked with mis as its arguments if the monitor program's exit
1056 value is between 10 and 20. This feature allows you to trigger
1057 different alerts at different severity levels (like when free
1058 disk space goes from 8% to 3%).
1059
1060 See the ALERT PROGRAMS section above for a list of the pra‐
1061 maeters mon will pass automatically to alert programs.
1062
1063
1064 upalert alert [arg...]
1065 An upalert is the compliment of an alert. An upalert is called
1066 when a services makes the state transition from failure to suc‐
1067 cess, if a corresponding "down" alert was previously sent. The
1068 upalert script is called supplying the same parameters as the
1069 alert script, with the addition of the -u parameter which is
1070 simply used to let an alert script know that it is being called
1071 as an upalert. Multiple upalerts may be specified for each
1072 period definition. Set the per-period no_comp_alerts option to
1073 send an upalert regardless if whether or not a "down" alert was
1074 sent.
1075
1076
1077 startupalert alert [arg...]
1078 A startupalert is only called when the mon server starts execu‐
1079 tion, or when a "reset" command was issued to the server,
1080 depending on the setting of the startupalerts_on_reset global.
1081 Unlike other alerts, startupalerts are not called following the
1082 exit of a monitor, i.e. they are called in their own right,
1083 therefore the "exit=" argument is not applicable to startu‐
1084 palert.
1085
1086
1087 upalertafter timeval
1088 The upalertafter parameter is specified as a string that follows
1089 the syntax of the interval parameter ("30s", "1m", etc.), and
1090 controls the triggering of an upalert. If a service comes back
1091 up after being down for a time greater than or equal to the
1092 value of this option, an upalert will be called. Use this option
1093 to prevent upalerts to be called because of "blips" (brief out‐
1094 ages).
1095
1096
1098 The file specified by the authfile variable in the configuration file
1099 (or passed via the -A parameter) will be loaded upon startup. This
1100 file defines restrictions upon which client commands may be executed by
1101 which users. It is a text file which consists of comments, command def‐
1102 initions, and trap authentication parameters. A comment line begins
1103 with optional whitespace followed by pound sign. Blank lines are
1104 ignored.
1105
1106 The file is separated into a command section and a trap section. Sec‐
1107 tions are specified by a single line containing one of the following
1108 statements:
1109
1110 command section
1111
1112 or
1113
1114 trap section
1115
1116 Lines following one of the above statements apply to that section until
1117 either the end of the file or another section begins.
1118
1119 A command definition consists of a command, followed by a colon, fol‐
1120 lowed by a comma-separated list of users who may execute the command.
1121 The default is that no users may execute any commands unless they are
1122 explicitly allowed in this configuration file. For clarity, a user can
1123 be denied by prefixing the user name with "!". If the word "AUTH_ANY"
1124 is used for a username, then any authenticated user will be allowed to
1125 execute the command. If the word "all" is used for a username, then
1126 that command may be executed by any user, authenticated or not.
1127
1128 The trap section allows configuration of which users may send traps
1129 from which hosts. The syntax is a source host (name or ip address),
1130 whitespace, a username, whitespace, and a plaintext password for that
1131 user. If the source host is "*", then allow traps from any host. If the
1132 username is "*", then accept traps without regard for the username or
1133 password. If no hosts or users are specified, then no traps will be
1134 accepted.
1135
1136 An example configuration file:
1137
1138 command section
1139 list: all
1140 reset: root,admin
1141 loadstate: root
1142 savestate: root
1143
1144 trap section
1145 127.0.0.1 root r@@tp4sswrd
1146
1147 This means that all clients are able to perform the list command,
1148 "root" is able to perform "reset", "loadstate", "savestate", and
1149 "admin" is able to execute the "reset" command.
1150
1151
1153 The server listens on TCP port 2583, which may be overridden using the
1154 -p port option. Commands are a single line each, terminated by a new‐
1155 line. The server can handle any number of simultaneous client connec‐
1156 tions.
1157
1158
1160 See manual page for moncmd.
1161
1162
1164 Mon has the facility to receive special "mon traps" from any local or
1165 remote machine. Currently, the only available method for sending mon
1166 traps are through the Mon::Client perl interface, though the UDP packet
1167 format is defined well enough to permit the writing of traps in other
1168 languages.
1169
1170 Traps are handled similarly to monitors: a trap sends an operational
1171 status, summary line, and description text, and mon generates an alert
1172 or upalert as necessary.
1173
1174 Traps can be caught by any watch/service group set up in the mon con‐
1175 figuration file, however it is suggested that you configure watch/ser‐
1176 vice groups specifically for the traps you expect to receive. When
1177 defining a special watch/service group for traps, do not include a
1178 "monitor" directive (as no monitor need be invoked). Since a monitor is
1179 not being invoked, it is not necessary for the watch definition to have
1180 a hostgroup which contains real host names. Just make up a useful
1181 name, and mon will automatically create the watch group for you.
1182
1183 Here is a simple config file example:
1184
1185 watch trap-service
1186 service host1-disks
1187 description TRAP: for host1 disk status
1188 period wd {Sun-Sat}
1189 alert mail.alert someone@your.org
1190 upalert mail.alert -u someone@your.org
1191
1192
1193 Since mon listens on a UDP port for any trap, a default facility is
1194 available for handling traps to unknown groups or services. To enable
1195 this facility, you must include a "default" watch group with a
1196 "default" service entry containing the specifics of alarms. If a
1197 default/default watch group and service are not configured, then
1198 unknown traps get logged via syslog, and no alarm is sent. NOTE: The
1199 default/default facility is a single entity as far as accounting and
1200 alarming go. Alarm programs which are not aware of this fact may send
1201 confusing information when a failure trap comes from one machine, fol‐
1202 lowed by a success (ok) trap from a different machine. See the alarm
1203 environment variable MON_TRAP_INTENDED above for a possible way around
1204 this. It is intended that default/default be used as a facility to
1205 catch unknown traps, and should not be relied upon to catch all traps
1206 in a production environment. If you are lazy and only want to use
1207 default/default for catching all traps, it would be best to disable
1208 upalerts, and use the MON_TRAP_INTENDED environment variable in alert
1209 scripts to make the alerts more meaningful to you.
1210
1211 Here is an example default facility:
1212
1213 watch default
1214 service default
1215 description Default trap service
1216 period wd {Sun-Sat}
1217 alert mail.alert someone@your.org
1218 upalert mail.alert -u someone@your.org
1219
1220
1221
1223 The mon distribution comes with an example configuration called exam‐
1224 ple.cf. Refer to that file for more information.
1225
1226
1228 moncmd(1), Time::Period(3pm), Mon::Client(3pm)
1229
1231 mon was written because I couldn't find anything out there that did
1232 just what I needed, and nothing was worth modifying to add the features
1233 I wanted. It doesn't have a cool name, and that bothers me because I
1234 couldn't think of one.
1235
1237 Report bugs to the email address below.
1238
1240 Jim Trocki <trockij@arctic.org>
1241
1242
1243
1244Linux $Date: 2007/06/25 13:10:07 $ mon(8)