mon(8) - f36

1mon(8)                Parallel Service Monitoring Daemon                mon(8)
2
3
4

NAME

6       mon - monitor services for availability, sending alarms upon failures.
7

SYNOPSIS

9       mon [-dfhlMSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c config] [-D
10       dir] [-i secs] [-k num] [-l [statetype]] [-L dir] [-m num] [-p num] [-P
11       pidfile] [-r delay] [-s dir]
12

DESCRIPTION

14       mon  is a general-purpose scheduler for monitoring service availability
15       and triggering alerts upon detecting failures.  mon was designed to  be
16       open  in the sense that it supports arbitrary monitoring facilities and
17       alert methods via a common  interface,  which  are  easily  implemented
18       through programs (in C, Perl, shell, etc.), SNMP traps, and special Mon
19       (UDP packet) traps.
20
21

OPTIONS

23       -a dir Path       to       alert       scripts.       Default        is
24              /usr/local/lib/mon/alert.d:alert.d.  Multiple alert paths may be
25              specified by separating them with a colon.   Non-absolute  paths
26              are  taken to be relative to the base directory (/usr/lib/mon by
27              default).
28
29       -b dir Base directory for mon. scriptdir, alertdir,  and  statedir  are
30              all relative to this directory unless specified from /.  Default
31              is /usr/lib/mon.
32
33       -B dir Configuration file base directory. All config files are  located
34              here, including mon.cf, monusers.cf, and auth.cf.
35
36       -A authfile
37              Authentication   configuration   file.   By   default   this  is
38              /etc/mon/auth.cf  if   the   /etc/mon   directory   exists,   or
39              /usr/lib/mon/auth.cf otherwise.
40
41       -c file
42              Read   configuration   from   file.    This   defaults   to   IR
43              /etc/mon/mon.cf " if the " /etc/mon directory exists,  otherwise
44              to /etc/mon.cf.
45
46       -d     Enable debugging mode.
47
48       -D dir Path   to   state   directory.    Default   is   the   first  of
49              /var/state/mon,  /var/lib/mon,  and  /usr/lib/mon/state.d  which
50              exists.
51
52       -f     Fork  and  run as a daemon process. This is the preferred way to
53              run mon.
54
55       -h     Print help information.
56
57       -i secs
58              Sleep interval, in seconds. Defaults to 1. This  shouldn't  need
59              to be adjusted for any reason.
60
61       -k num Set log history to a maximum of num entries. Defaults to 100.
62
63       -l statetype
64              Load  state  from the last saved state file. The supported saved
65              state types are disabled for  disabled  watches,  services,  and
66              hosts,  opstatus  for  failure/alert/ack status of all services,
67              and all for both.  If no  statetype  is  provided,  disabled  is
68              assumed.
69
70       -L dir Sets  the  log  dir.  See also logdir in the configuration file.
71              The default is /var/log/mon if that directory exists,  otherwise
72              log.d in the base directory.
73
74       -M     Pre-process  the  configuration  file  with  the macro expansion
75              package m4.
76
77       -m num Set the throttle for the maximum number of processes to num.
78
79       -p num Make server listen on port num.  This defaults to 2583.
80
81       -S     Start with the scheduler stopped.
82
83       -P pidfile
84              Store the server's pid in pidfile, the default is the  first  of
85              /var/run/mon/mon.pid,  /var/run/mon.pid,  and /etc/mon.pid whose
86              directory exists.  An empty value tells mon not  to  use  a  pid
87              file.
88
89       -r delay
90              Sets  the  number of seconds used to randomize the startup delay
91              before each service is scheduled. Refer to the global  randstart
92              variable in the configuration file.
93
94       -s dir Path       to       monitor       scripts.       Default      is
95              /usr/local/lib/mon/mon.d:mon.d.  Multiple  alert  paths  may  be
96              specified  by  separating them with a colon.  Non-absolute paths
97              are taken to be relative to the base directory (/usr/lib/mon  by
98              default).
99
100       -v     Print version information.
101
102

DEFINITIONS

104       monitor
105              A  program  which  tests for a certain condition, returns either
106              true or false, and optionally produces output to be passed  back
107              to  the scheduler.  Common monitors detect host reachability via
108              ICMP echo messages, or connection to TCP services.
109
110       period A period in time as interpreted by the Time::Period module.
111
112       alert  A program which sends a message when invoked by  the  scheduler.
113              The scheduler calls upon an alert when it detects a failure from
114              a monitor.  An alert program accepts a set of command-line argu‐
115              ments  from  the  scheduler,  in  addition  to data via standard
116              input.
117
118       hostgroup
119              A single host or  list  of  hosts,  specified  as  names  or  IP
120              addresses.
121
122       service
123              A  collection  of parameters used to deal with monitoring a par‐
124              ticular resource which is provided by a group. Services are usu‐
125              ally  modeled  after  things  such  as an SMTP server, ICMP echo
126              capability, server disk space availability, or SNMP events.
127
128       view   A collection of hostgroups, used to filter mon output for client
129              display.   i.e.  a  'network-services'  view might be defined so
130              your network staff can see just the hostgroups which  matter  to
131              them, without having to see all hostgroups defined in Mon.
132
133       watch  A collection of services which apply to a particular group.
134

OPERATION

136       When  the mon scheduler starts, it reads a configuration file to deter‐
137       mine the services it needs to monitor. The configuration file  defaults
138       to  /etc/mon.cf, and can be specified using the -c parameter. If the -M
139       option is specified, then the configuration file is pre-processed  with
140       m4.   If  the  configuration  file ends with .m4, the file is also pro‐
141       cessed by m4 automatically.
142
143       The scheduler enters a loop which handles client  connections,  monitor
144       invocations, and failure alerts. Each service has a timer, specified in
145       the configuration file as the interval variable, which tells the sched‐
146       uler  how frequently to invoke a monitor process.  The scheduler may be
147       temporarily stopped. While it is stopped,  client  access  still  func‐
148       tions,  but it just doesn't schedule things. This is useful in conjunc‐
149       tion while resetting the server, because you  can  do  this:  save  the
150       hosts and services which are disabled, reset the server with the sched‐
151       uler stopped, re-disabled those hosts  and  services,  then  start  the
152       scheduler.  It  also allows making atomic changes across several client
153       connections.  See the moncmd man page for more information.
154
155

MONITOR PROGRAMS

157       Monitor processes are invoked with the arguments specified in the  con‐
158       figuration  file, appended by the hosts from the applicable host group.
159       For example, if the watch group is "servers", which contain  the  host‐
160       names "smtp", "nntp", and "ns", and the monitor line reads as follows,
161        monitor fping.monitor -t 4000 -r 2
162       then the exectuable "fping.monitor" will be executed with these parame‐
163       ters:
164        MONITOR_DIR/fping.monitor -t 4000 -r 2 smtp nntp ns
165
166       MONITOR_DIR    is    actually    a    search    path,    by     default
167       /usr/local/lib/mon/mon.d  then  /usr/lib/mon/mon.d, but it can be over‐
168       ridden by the -s option or in the configuration file.  If all hosts  in
169       the  hostgroup have been disabled, then a warning is sent to syslog and
170       the monitor is not run.  This  behavior  may  be  overridden  with  the
171       "allow_empty_group"  option  in  the  service definition.  If the final
172       argument to the "monitor" line is ";;" (it must be preceded  by  white‐
173       space), then the host list will not be appended to the parameter list.
174
175       In addition to environment variables defined by the user in the service
176       definition, mon passes certain variables to monitor process.
177
178
179       MON_LAST_SUMMARY
180              The first line of the output from  the  last  time  the  monitor
181              exited.  This is not the summary of the current monitor run, but
182              the previous one.  This may be used by an alert script  to  pro‐
183              vide historical context in an alert.
184
185
186       MON_LAST_OUTPUT
187              The  entire  output of the monitor from the last time it exited.
188              This is not the output of the current monitor run, but the  pre‐
189              vious  one.  This may be used by an alert script to provide his‐
190              torical context in an alert.
191
192
193
194       MON_LAST_FAILURE
195              The time(2) of the last failure for this service.
196
197
198       MON_FIRST_FAILURE
199              The time(2) of the first time this service failed.
200
201
202       MON_LAST_SUCCESS
203              The time(2) of the last time this service passed.
204
205
206       MON_DESCRIPTION
207              The description of this service, as defined in the configuration
208              file using the description tag.
209
210
211       MON_DEPEND_STATUS
212              The depend status, "o" if dependency failure, "1" otherwise.
213
214
215       MON_LOGDIR
216              The  directory  log  files should be placed, as indicated by the
217              logdir global configuration variable.
218
219
220       MON_STATEDIR
221              The directory where state files should be kept, as indicated  by
222              the statedir global configuration variable.
223
224
225       MON_CFBASEDIR
226              The directory where configuration files should be kept, as indi‐
227              cated by the cfbasedir global configuration variable.
228
229
230       "fping.monitor" should return an exit status of 0 if it completed  suc‐
231       cessfully  (found  no  problems), or nonzero if a problem was detected.
232       The first line of output from the monitor script has a special meaning:
233       it  is used as a brief summary of the exact failure which was detected,
234       and is passed to the alert program. All remaining output is also passed
235       to the alert program, but it has no required interpretation.
236
237       If  a  monitor  for a particular service is still running, and the time
238       comes for mon to run another monitor for  that  service,  it  will  not
239       start  another  monitor.  For  example, if the interval is 10s, and the
240       monitor does not finish running within 10 seconds, then mon  will  wait
241       until the first monitor exits before running another one.
242
243

ALERT DECISION LOGIC

245       Upon  a  non-zero  or zero exit status, the associated alert or upalert
246       program (respectively) is started, pending the following conditions: If
247       an  alert for a specific service is disabled, do not send an alert.  If
248       dep_behavior is set to 'a', or alertdepend is set, and a parent  depen‐
249       dency is failing, then suppress the alert.  If the alert has previously
250       been acknowledged, do not send the alert, unless it is an upalert.   If
251       an  alert  is  not  within the specified period, record the failure via
252       syslog(3) and do not send an alert.   If  the  failure  does  not  fall
253       within  a  defined  period, do not send an alert.  No upalerts are sent
254       without corresponding down alerts, unless no_comp_alerts is defined  in
255       the  period section. An upalert will only be sent if the previous state
256       is a failure.  If an alert was already sent within the last  alertevery
257       interval, do not send another alert, unless the summary output from the
258       current monitor program differs from the last monitor process.   Other‐
259       wise,  send  an  alert using each alert program listed for that period.
260       The observe_detail argument to  alertevery  affects  this  behavior  by
261       observing  the  changes in the detail part of the output in addition to
262       the summary line.  If a monitor has successive failures and the summary
263       output  changes  in each of them, alertevery will not suppress multiple
264       consecutive alerts.  The  reasoning  is  that  if  the  summary  output
265       changes,  then  a  significant  event  occurred  and the user should be
266       alerted.  The "strict" argument to alertevery will suppress  both  com‐
267       paring the output from the previous monitor run to the current and pre‐
268       vent a successful return  value  of  the  monitor  from  resetting  the
269       alertevery  timer.  For example, "alertevery 24h strict" will only send
270       out an alert once every 24 hours, regardless  of  whether  the  monitor
271       output changes, or if the service stops and then starts failing.
272
273

ALERT PROGRAMS

275       Alert programs are found in the path supplied with the -a parameter, or
276       in the /usr/local/lib/mon/alert.d and  directories  if  not  specified.
277       They are invoked with the following command-line parameters:
278
279
280       -s service
281              Service tag from the configuration file.
282
283       -g group
284              Host group name from the configuration file.
285
286       -h hosts
287              The  expanded  version  of  the host group, space delimited, but
288              contained in one shell "word".
289
290       -l alertevery
291              The number of seconds until the next alarm will be sent.
292
293       -O     This option  is  supplied  to an alert  only  if  the  alert  is
294              being generated as a result of an expected traap timing out
295
296       -t time
297              The  time (in time(2) format) of when this failure condition was
298              detected.
299
300       -T     This option is supplied to an alert only if the alert was  trig‐
301              gered by a trap
302
303       -u     This  option  is supplied to an alert only if it is being called
304              as an upalert.
305
306
307       The remaining arguments are supplied from the  trailing  parameters  in
308       the configuration file, after the "alert" service parameter.
309
310       As  with  monitor programs, alert programs are invoked with environment
311       variables defined by the user in the service definition, in addition to
312       the following which are explicitly set by the server:
313
314
315       MON_LAST_SUMMARY
316              The  first  line  of  the  output from the last time the monitor
317              exited.
318
319
320       MON_LAST_OUTPUT
321              The entire output of the monitor from the last time it exited.
322
323
324       MON_LAST_FAILURE
325              The time(2) of the last failure for this service.
326
327
328       MON_FIRST_FAILURE
329              The time(2) of the first time this service failed.
330
331
332       MON_LAST_SUCCESS
333              The time(2) of the last time this service passed.
334
335
336       MON_DESCRIPTION
337              The description of this service, as defined in the configuration
338              file using the description tag.
339
340
341       MON_GROUP
342              The watch group which triggered this alarm
343
344
345       MON_SERVICE
346              The service heading which generated this alert
347
348
349       MON_RETVAL
350              The exit value of the failed monitor program, or return value as
351              accepted from a trap.
352
353
354       MON_OPSTATUS
355              The operational status of the service.
356
357
358       MON_ALERTTYPE
359              Has one of the following  values:  "failure",  "up",  "startup",
360              "trap",  or "traptimeout", and signifies the type of alert which
361              was triggered.
362
363
364       MON_TRAP_INTENDED
365              This is only set when an unknown mon trap is received and caught
366              by  the  default/defaut watch/service. This contains colon sepa‐
367              rated entries of the trap's intended  watch  group  and  service
368              name.
369
370
371       MON_LOGDIR
372              The  directory  log  files should be placed, as indicated by the
373              logdir global configuration variable.
374
375
376       MON_STATEDIR
377              The directory where state files should be kept, as indicated  by
378              the statedir global configuration variable.
379
380
381       MON_CFBASEDIR
382              The directory where configuration files should be kept, as indi‐
383              cated by the cfbasedir global configuration variable.
384
385
386       The first line from standard input must be used as a brief  summary  of
387       the problem, normally supplied as the subject line of an email, or text
388       sent to an alphanumeric pager. Interpretation of all  subsequent  lines
389       read  from  stdin is left up to the alerting program. The usual parame‐
390       ters are a list of recipients to  deliver  the  notification  to.   The
391       interpretation  of  the  recipients  is not specified, and is up to the
392       alert program.
393
394

CONFIGURATION FILE

396       The configuration file consists of zero or more global variable defini‐
397       tions,  zero or more hostgroup definitions, and one or more watch defi‐
398       nitions. Each watch definition may have one  or  more  service  defini‐
399       tions.  A watch definition is terminated by a blank line, another defi‐
400       nition, or the end of the file. A line beginning with optional  leading
401       whitespace and a pound ("#") is regarded as a comment, and is ignored.
402
403       Lines  are parsed as they are read. Long lines may be continued by end‐
404       ing them with a backslash ("\").  If a  line  is  continued,  then  the
405       backslash, the trailing whitespace after the backslash, and the leading
406       whitespace of the following line are removed. The end result is  assem‐
407       bled into a single line.
408
409       Typically the configuration file has the following layout:
410
411       1. Global variable definitions
412
413       2. Hostgroup definitions
414
415       3. Watch definitions
416
417       See  the  "etc/example.cf" file which comes for the distribution for an
418       example.
419
420
421   Global Variables
422       The following variables may be set to  override  compiled-in  defaults.
423       Command-line  options  will have a higher precedence than these defini‐
424       tions.
425
426
427       alertdir = dir
428              dir is the full path to the alert scripts. This is the value set
429              by the -a command-line parameter.
430
431              Multiple  alert paths may be specified by separating them with a
432              colon.  Non-absolute paths are taken to be relative to the  base
433              directory (/usr/lib/mon by default).
434
435              When  the configuration file is read, all alerts referenced from
436              the configuration will be looked up in each of these paths,  and
437              the full path to the first instance of the alert found is stored
438              in a hash. This hash is only generated upon startup or  after  a
439              "reset" command, so newly added alert scripts will not be recog‐
440              nized until a "reset" is performed.
441
442
443       mondir = dir
444              dir is the full path to the monitor scripts. This value may also
445              be  set  by the -s command-line parameter. If this path does not
446              begin with a "/", it will be relative to basedir.
447
448              Multiple alert paths may be specified by separating them with  a
449              colon. All paths must be absolute.
450
451              When  the  configuration  file  is read, all monitors referenced
452              from the configuration will be looked up in each of these paths,
453              and  the full path to the first instance of the monitor found is
454              stored in a hash. This hash is only generated  upon  startup  or
455              after a "reset" command, so newly added monitor scripts will not
456              be recognized until a "reset" is performed.
457
458
459       statedir = dir
460              dir is the full path to the  state  directory.   mon  uses  this
461              directory  to  save various state information. If this path does
462              not begin with a "/", it will be relative to basedir.
463
464
465       logdir = dir
466              dir is the full path to the log directory.  mon uses this direc‐
467              tory  to  save various logs, including the downtime log. If this
468              path does not begin with a "/", it will be relative to basedir.
469
470
471       basedir = dir
472              dir is the full path for the  state,  log,  monitor,  and  alert
473              directories.
474
475
476       cfbasedir = dir
477              dir  is  the  full  path where all the config files can be found
478              (monusers.cf, auth.cf, etc.).
479
480
481       authfile = file
482              file is the path to the authentication file. If  the  path  does
483              not begin with a "/", it will be relative to cfbasedir.
484
485
486       authtype = type [type...]
487              type  is  the  type  of authentication to use. A space-separated
488              list of types may be specified, and they  will  be  checked  the
489              order they are listed. As soon as a successful authentication is
490              performed, the user is considered authenticated by mon  for  the
491              duration  of  the  session and no more authentication checks are
492              performed.
493
494              If type is getpwnam, then the standard Unix passwd file  authen‐
495              tication  method will be used (calls getpwnam(3) on the user and
496              compares the crypt(3)ed version of the  password  with  what  it
497              gets  from getpwnam). This will not work if shadow passwords are
498              enabled on the system.
499
500              If type is userfile, then usernames  and  hashed  passwords  are
501              read from userfile, which is defined via the userfile configura‐
502              tion variable.
503
504              If type is pam, then PAM (pluggable authentication modules) will
505              be  used  for authentication.  The service specified by the pam‐
506              service global will be used. If no  global  is  given,  the  PAM
507              passwd service will be used.
508
509              If  type is trustlocal, then if the client connection comes from
510              locahost, the username passed from the client will  be  trusted,
511              and  the  password  will  be ignored.  This can be used when you
512              want the client to handle the authentication for  you.   I.e.  a
513              CGI script using one of the many apache authentication methods.
514
515
516       userfile = file
517              This file is used when authtype is set to userfile.  It consists
518              of a sequence of lines of  the  format  'username  :  password'.
519              password  is  stored  as  the hash returned by the standard Unix
520              crypt(3) function.  NOTE: the format of this file is  compatible
521              with  the Apache file based username/password file format. It is
522              possible to use the htpasswd program  supplied  with  Apache  to
523              manage the mon userfile.
524
525              Blank lines and lines beginning with # are ignored.
526
527
528       pamservice = service
529              The PAM service used for authentication. This is applicable only
530              if "pam" is specified as a parameter to the authtype setting. If
531              this global is not defined, it defaults to passwd.
532
533
534       serverbind = addr
535
536
537       trapbind = addr
538
539              serverbind and trapbind specify which address to bind the server
540              and trap ports to, respectively.  If these are not defined,  the
541              default  address  is INADDR_ANY, which allows connections on all
542              interfaces. For security reasons, it could be  a  good  idea  to
543              bind only to the loopback interface.
544
545
546       dtlogfile = file
547              file  is  a  file which will be used to record the downtime log.
548              Whenever a service fails for some amount of time and  then  stop
549              failing,  this event is written to the log. If this parameter is
550              not set, no logging is done. The format of the file is  as  fol‐
551              lows (# is a comment and may be ignored):
552
553              timenoticed group service firstfail downtime interval summary.
554
555              timenoticed is the time(2) the service came back up.
556
557              group service is the group and service which failed.
558
559              firstfail is the time(2) when the service began to fail.
560
561              downtime is the number of seconds the service failed.
562
563              interval  is  the  frequency  (in  seconds)  that the service is
564              polled.
565
566              summary is the summary line from when the service was failing.
567
568
569       monerrfile = filename
570              By default, when mon daemonizes itself, it connects  stdout  and
571              stderr to /dev/null. If monerrfile is set to a file, then stdout
572              and stderr will be appended to that file. In all cases stdin  is
573              connected  to /dev/null. If mon is told to run in the foreground
574              and  to  not  daemonize,  then  none  of  this  applies,   since
575              stdin/stdout/stderr  stay connected to whatever they were at the
576              time of invocation.
577
578
579       dtlogging = yes/no
580
581              Turns downtime logging on or off. The default is off.
582
583
584       histlength = num
585              num is the the maximum number of events to be retained  in  his‐
586              tory  list.  The  default is 100.  This value may also be set by
587              the -k command-line parameter.
588
589
590       historicfile = file
591              If this variable is set, then alerts are  logged  to  file,  and
592              upon  startup,  some  (or  all) of the past history is read into
593              memory.
594
595
596       historictime = timeval
597              num is the amount of the history  file  to  read  upon  startup.
598              "Now"  - timeval is read. See the explanation of interval in the
599              "Service Definitions" section for a description of timeval.
600
601
602       serverport = port
603              port is the TCP port number that the server should bind to. This
604              value may also be set by the -p command-line parameter. Normally
605              this port is looked up via getservbyname(3), and it defaults  to
606              2583.
607
608
609       trapport = port
610              port is the UDP port number that the trap server should bind to.
611              Normally this port is looked up  via  getservbyname(3),  and  it
612              defaults to 2583.
613
614
615       pidfile = path
616              path  is  the  file the sever will store its pid in.  This value
617              may also be set by the -P command-line parameter.
618
619
620       maxprocs = num
621              Throttles the number of concurrently forked  processes  to  num.
622              The intent is to provide a safety net for the unlikely situation
623              when the server tries to take on too many tasks at  once.   Note
624              that this situation has only been reported to happen when trying
625              to use a garbled configuration file! You don't  want  to  use  a
626              garbled configuration file now, do you?
627
628
629       cltimeout = secs
630              Sets  the  client  inactivity timeout to secs.  This is meant to
631              help thwart denial of service attacks or  recover  from  crashed
632              clients.  secs is interpreted as a "1h/1m/1s" string, where "1m"
633              = 60 seconds.
634
635
636       randstart = interval
637              When the server starts, normally all services will not be sched‐
638              uled  until  the interval defined in the respective service sec‐
639              tion.  This can cause long delays before the first  check  of  a
640              service,  and  possibly  a  high  load on the server if multiple
641              things are scheduled at the same intervals.  This option is used
642              to  randomize  the scheduling of the first test for all services
643              during the startup period, and immediately after the reset  com‐
644              mand.   If  randstart  is defined, the scheduled run time of all
645              services of all watch groups will be  a  random  number  between
646              zero and randstart seconds.
647
648
649       dep_recur_limit = depth
650              Limit dependency recursion level to depth.  If dependency recur‐
651              sion (dependencies which depend on other dependencies) tries  to
652              go beyond depth, then the recursion is aborted and a messages is
653              logged to syslog.  The default limit is 10.
654
655
656       dep_behavior = {a|m|hm}
657              dep_behavior controls whether  the  dependency  expression  sup‐
658              presses  one of: the running of alerts, the running of monitors,
659              or the passing of individual hosts to the monitors.   Read  more
660              about the behavior in the "Service Definitions" section below.
661
662              This is a global setting which controls the default settings for
663              the service-specified variable.
664
665
666       dep_memory = timeval
667              If set, dep_memory will cause dependencies to continue  to  pre‐
668              vent  alerts/monitoring  for  a period of time after the service
669              returns to a normal state.  This can be used  to  prevent  over-
670              eager  alerting  when  a machine is rebooting, for example.  See
671              the explanation of interval in the "Service Definitions" section
672              for a description of timeval.
673
674              This is a global setting which controls the default settings for
675              the service-specified variable.
676
677
678       syslog_facility = facility
679              Specifies the syslog facility used for logging.  daemon  is  the
680              default.
681
682
683
684
685       startupalerts_on_reset = {yes|no}
686
687              If  set  to  "yes", startupalerts will be invoked when the reset
688              client command is executed. The default is "no".
689
690
691       monremote = program
692
693              If set, this external program will be called by Mon when various
694              client  requests  are  processed.  This can be used to propagate
695              those changes from one Mon server to another, if you have multi‐
696              ple  monitoring  machines.   An  example script, monremote.pl is
697              available in the clients directory.
698
699
700   Hostgroup Entries
701       Hostgroup entries begin with the keyword hostgroup, and are followed by
702       a hostgroup tag and one or more hostnames or IP addresses, separated by
703       whitespace. The hostgroup tag must be composed of alphanumeric  charac‐
704       ters,  a  dash ("-"), a period ("."), or an underscore ("_"). Non-blank
705       lines following the first hostgroup line are interpreted as more  host‐
706       names.  The hostgroup definition ends with a blank line. For example:
707
708              hostgroup servers nameserver smtpserver nntpserver
709                   nfsserver httpserver smbserver
710
711              hostgroup router_group cisco7000 agsplus
712
713
714   View Entries
715       View  entries  begin  with the keyword view, and are followed by a view
716       tag and the names of one or more hostgroups.  The view tag must be com‐
717       posed  of  alphanumeric characters, a dash ("-"), a period ("."), or an
718       underscore ("_"). Non-blank lines following the  first  view  line  are
719       interpreted  as  more hostgroup names.  The view definition ends with a
720       blank line. For example:
721
722              view servers dns-servers web-servers file-servers
723                   mail-servers
724
725              view network-services routers switches vpn-servers
726
727
728
729   Watch Group Entries
730       Watch entries begin with a line that starts  with  the  keyword  watch,
731       followed  by  whitespace  and  a single word which normally refers to a
732       pre-defined hostgroup. If the second word is not recognized as a  host‐
733       group  tag, a new hostgroup is created whose tag is that word, and that
734       word is its only member.
735
736       Watch entries consist of one or more service definitions.
737
738       A watch group is terminated by a blank line, the end of the file, or by
739       a subsequent definition, "watch", "hostgroup", or otherwise.
740
741       There may be a special watch group entry called "default". If a default
742       watch group is defined with a service entry named "default", then  this
743       definition  will be used in handling traps received for an unrecognized
744       watch and service.
745
746
747   Service Definitions
748       service servicename
749              A service definition begins with they keyword  service  followed
750              by  a word which is the tag for this service.  This word must be
751              unique among all services defined for the same watch group.
752
753              The components of a service are an interval, monitor, and one or
754              more time period definitions, as defined below.
755
756              If  a  service name of "default" is defined within a watch group
757              called "dafault" (see above), then the  default/default  defini‐
758              tion will be used for handling unknown mon traps.
759
760              The  following configuration parameters are valid only following
761              a service definition:
762
763
764       VARIABLE=value
765              Environment variables may be defined  for  each  service,  which
766              will  be  included  in  the  environment of monitors and alerts.
767              Variables must be specified in all capital letters,  must  begin
768              with  an alphabetical character or an underscore, and there must
769              be no spaces to the left of the equal sign.
770
771
772       interval timeval
773              The keyword interval followed by a time value specifies the fre‐
774              quency that a monitor script will be triggered.  Time values are
775              defined as "30s", "5m", "1h", or "1d",  meaning  30  seconds,  5
776              minutes,  1  hour,  or 1 day. The numeric portion may be a frac‐
777              tion, such as "1.5h" or an hour and a half.  This  format  of  a
778              time specification will be referred to as timeval.
779
780
781       failure_interval timeval
782              Adjusts  the  polling interval to timeval when the service check
783              is failing. Resets the interval to the original when the service
784              succeeds.
785
786
787       traptimeout timeval
788              This  keyword  takes  the  same  time  specification argument as
789              interval, and makes the service expect a trap from  an  external
790              source  at  least that often, else a failure will be registered.
791              This is used for a heartbeat-style service.
792
793
794       trapduration timeval
795              If a trap is received, the status of the service  the  trap  was
796              delivered  to  will normally remain constant. If trapduration is
797              specified, the status of the service will remain  in  a  failure
798              state for the duration specified by timeval, and then it will be
799              reset to "success".
800
801
802       randskew timeval
803              Rather than schedule the monitor script to run at the  start  of
804              each  interval,  randomly  adjust  the interval specified by the
805              interval parameter by plus-or-minus randskew .  The  skew  value
806              is specified as the interval parameter: "30s", "5m", etc...  For
807              example if interval is 1m, and randskew is "5s", then  mon  will
808              schedule  the  monitor script some time between every 55 seconds
809              and 65 seconds.  The intent is to help distribute  the  load  on
810              the  server  when many services are scheduled at the same inter‐
811              vals.
812
813
814       monitor monitor-name [arg...]
815              The keyword monitor followed by  a  script  name  and  arguments
816              specifies  the monitor to run when the timer expires. Shell-like
817              quoting conventions are followed when specifying  the  arguments
818              to  send  to the monitor script.  The script is invoked from the
819              directory given with the -s argument, and  all  following  words
820              are  supplied  as  arguments to the monitor program, followed by
821              the list of hosts in the group referred to by the current  watch
822              group.   If  the monitor line ends with ";;" as a separate word,
823              the host groups are not appended to the argument list  when  the
824              program is invoked.
825
826
827       allow_empty_group
828              The  allow_empty_group option will allow a monitor to be invoked
829              even when the hostgroup for that watch is empty because of  dis‐
830              abled  hosts.  The default behavior is not to invoke the monitor
831              when all hosts in a hostgroup have been disabled.
832
833
834       description descriptiontext
835              The text following description is queried  by  client  programs,
836              passed  to  alerts  and monitors via an environment variable. It
837              should contain a brief description of the service, suitable  for
838              inclusion in an email or on a web page.
839
840
841       exclude_hosts host [host...]
842              Any  hosts  listed after exclude_hosts will be excluded from the
843              service check.
844
845
846       exclude_period periodspec
847              Do not run a scheduled monitor during  the  time  identified  by
848              periodspec.
849
850
851       depend dependexpression
852              The  depend  keyword is used to specify a dependency expression,
853              which evaluates to either true of false, in the  boolean  sense.
854              Dependencies are actual Perl expressions, and must obey all syn‐
855              tactical rules. The expressions are evaluated in their own pack‐
856              age  space  so  as  to not accidentally have some unwanted side-
857              effect.  If a syntax error is found when evaluating the  expres‐
858              sion, it is logged via syslog.
859
860              Before evaluation, the following substitutions on the expression
861              occur: phrases which look like "group:service"  are  substituted
862              with  the value of the current operational status of that speci‐
863              fied service. These opstatus substitutions are  computed  recur‐
864              sively,  so  if  service A depends upon service B, and service B
865              depends upon service C, then service A depends upon  service  C.
866              Successful  operational  statuses  (which  evaluate  to "1") are
867              "STAT_OK",     "STAT_COLDSTART",      "STAT_WARMSTART",      and
868              "STAT_UNKNOWN".   The  word "SELF" (in all caps) can be used for
869              the group (e.g. "SELF:service"), and is an abbreviation for  the
870              current watch group.
871
872              This  feature  can  be used to control alerts for services which
873              are dependent on other services, e.g.  an  SMTP  test  which  is
874              dependent upon the machine being ping-reachable.
875
876
877       dep_behavior {a|m|hm}
878              The evaluation of the dependency graphs specified via the depend
879              keyword can control the suppression of alert or monitor  invoca‐
880              tions, or the suppression of individual hosts passed to the mon‐
881              itor.
882
883              Alert suppression.  If this option  is  set  to  "a",  then  the
884              dependency  expression  will  be evaluated after the monitor for
885              the service exits or after a trap is received.   An  alert  will
886              only  be  sent  if the evaluation succeeds, meaning that none of
887              the nodes in the dependency graph indicate failure.
888
889              Monitor suppression.  If it is set to "m", then  the  dependency
890              expression  will be evaulated before the monitor for the service
891              is about to run.  If the evaulation succeeds, then  the  monitor
892              will be run. Otherwise, the monitor will not be run and the sta‐
893              tus of the service will remain the same.
894
895              Host suppression.  If it is set to "hm" then  Mon  will  extract
896              the  list  of  "parent" services from the dependency expression.
897              (In fact the expression can be just a list  of  services.)  Then
898              when  the  monitor  for the service is about to be run, for each
899              host in the current hostgroup Mon will  search  all  the  parent
900              services  which  are currently failing and look for the hostname
901              in the current summary output.  If the hostname is  found,  this
902              host will be excluded from this run of the monitor.  This can be
903              used to e.g. allow an SMTP test on a group of hosts to still  be
904              run  even  when a single host is not ping-reachable.  If all the
905              rest of the hosts are working fine, the service will be in an OK
906              state,  but  if  another  host fails the SMTP test Mon can still
907              alert about that host even  though  the  parent  dependency  was
908              failing.  The dependency expression will not be used recursively
909              in this case.
910
911
912       alertdepend dependexpression
913
914       monitordepend dependexpression
915
916       hostdepend dependexpression
917              These keywords allow you to specify multiple dependency  expres‐
918              sions of different types.  Each one corresponds to the different
919              dep_behavior settings listed  above.   They  will  be  evaluated
920              independently  in  the  different  contexts as listed above.  If
921              depend is present, it takes precedence over  the  matching  key‐
922              word, depending on the dep_behavior setting.
923
924
925       dep_memory timeval
926              If  set,  dep_memory will cause dependencies to continue to pre‐
927              vent alerts/monitoring for a period of time  after  the  service
928              returns  to  a  normal state.  This can be used to prevent over-
929              eager alerting when a machine is rebooting,  for  example.   See
930              the explanation of interval in the "Service Definitions" section
931              for a description of timeval.
932
933
934       redistribute alert [arg...]
935              A service may have one redistribute option, which is  a  special
936              form  of  an  an alert definition.  This alert will be called on
937              every service status  update,  even  sequential  success  status
938              updates.   This  can be used to integrate Mon with another moni‐
939              toring system, or to link together multiple Mon servers  via  an
940              alert script that generates Mon traps.  See the "ALERT PROGRAMS"
941              section above for a list of the parameters mon will  pass  auto‐
942              matically to alert programs.
943
944
945       unack_summary
946              Remove  the  "acknowledged"  state from a service if the summary
947              component of the failure message changes.  In most common  usage
948              the summary is the list of hosts that are failing, so additional
949              hosts failing would remove an ack.
950
951
952
953   Period Definitions
954       Periods are used to define the conditions which should allow alerts  to
955       be delivered.
956
957
958       period [label:] periodspec
959              A  period  groups one or more alarms and variables which control
960              how often an alert happens when there is a failure.  The  period
961              definition has two forms. The first takes an argument which is a
962              period specification from Patrick  Ryan's  Time::Period  Perl  5
963              module. Refer to "perldoc Time::Period" for more information.
964
965              The second form requires a label followed by a period specifica‐
966              tion, as defined above. The label is  a  tag  consisting  of  an
967              alphabetic  character  or  underscore  followed  by zero or more
968              alphanumerics or underscores and ending with a colon. This  form
969              allows multiple periods with the same period definition. One use
970              is to have a  period  definition  which  has  no  alertafter  or
971              alertevery  parameters for a particular time period, and another
972              for the same time period with a different  set  of  alerts  that
973              does contain those parameters.
974
975              Period  definitions, in either the first or second form, must be
976              unique within each service definition. For example, if you  need
977              to  define two periods both for "wd {Sun-Sat}", then one or both
978              of the period definitions must specify a label such  as  "period
979              t1: wd {Sun-Sat}" and "period t2: wd {Sun-Sat}".
980
981
982       alertevery timeval [observe_detail | strict]
983              The  alertevery  keyword  (within a period definition) takes the
984              same type of argument as the interval variable, and  limits  the
985              number  of  times an alert is sent when the service continues to
986              fail.  For example, if the  interval  is  "1h",  then  only  the
987              alerts  in  the period section will only be triggered once every
988              hour. If the alertevery keyword is omitted in a period entry, an
989              alert  will  be  sent  out  every time a failure is detected. By
990              default, if  the  summary  output  of  two  successive  failures
991              changes,  then  the  alertevery  interval  is overridden, and an
992              alert will be sent.  If the string "observe_detail" is the  last
993              argument,  then both the summary and detail output lines will be
994              considered when comparing the output of successive failures.  If
995              the string "strict" is the last argument, then the output of the
996              monitor or the state change of the service will have  no  effect
997              on  when  alerts are sent. That is, "alertevery 24h strict" will
998              send only one alert every 24  hours,  no  matter  what.   Please
999              refer  to the ALERT DECISION LOGIC section for a detailed expla‐
1000              nation of how alerts are suppressed.
1001
1002
1003       alertafter num
1004
1005
1006       alertafter num timeval
1007
1008
1009       alertafter timeval
1010              The alertafter keyword  (within  a  period  section)  has  three
1011              forms:  only  with the "num" argument, or with the "num timeval"
1012              arguments, or only with the "timeval" argument.   In  the  first
1013              form,  an  alert  will  only  be invoked after "num" consecutive
1014              failures.
1015
1016              In the second form, the arguments are a  positive  integer  fol‐
1017              lowed  by  an  interval,  as  described by the interval variable
1018              above.  If these parameters are specified, then the  alerts  for
1019              that  period will only be called after that many failures happen
1020              within that interval. For example, if alertafter  is  given  the
1021              arguments  "3 30m",  then the alert will be called if 3 failures
1022              happen within 30 minutes.
1023
1024              In the third form, the argument is an interval, as described  by
1025              the  interval  variable above.  Alerts for that period will only
1026              be called if the service has been in a failure  state  for  more
1027              than  the length of time desribed by the interval, regardless of
1028              the number of failures noticed within that interval.
1029
1030
1031       numalerts num
1032
1033              This variable tells the server to call no more than  num  alerts
1034              during  a  failure.  The  alert  counter is kept on a per-period
1035              basis, and is reset upon each success.
1036
1037
1038       no_comp_alerts
1039
1040              If this option is specified, then upalerts will be called  when‐
1041              ever  the  service state changes from failure to success, rather
1042              than only after a corresponding "down" alert.
1043
1044
1045       alert alert [arg...]
1046              A period may contain multiple alerts, which are  triggered  upon
1047              failure  of  the  service.  An alert is specified with the alert
1048              keyword, followed by an optional exit parameter,  and  arguments
1049              which  are  interpreted  the same as the monitor definition, but
1050              without the ";;" exception. The exit parameter takes the form of
1051              exit=x  or  exit=x-y  and  has the effect that the alert is only
1052              called if the exit status of the monitor script falls within the
1053              range  of the exit parameter. If, for example, the alert line is
1054              alert exit=10-20 mail.alert mis then  mail-alert  will  only  be
1055              invoked  with mis as its arguments if the monitor program's exit
1056              value is between 10 and 20. This feature allows you  to  trigger
1057              different  alerts  at  different severity levels (like when free
1058              disk space goes from 8% to 3%).
1059
1060              See the ALERT PROGRAMS section above for  a  list  of  the  pra‐
1061              maeters mon will pass automatically to alert programs.
1062
1063
1064       upalert alert [arg...]
1065              An  upalert is the compliment of an alert.  An upalert is called
1066              when a services makes the state transition from failure to  suc‐
1067              cess,  if  a corresponding "down" alert was previously sent. The
1068              upalert script is called supplying the same  parameters  as  the
1069              alert  script,  with  the  addition of the -u parameter which is
1070              simply used to let an alert script know that it is being  called
1071              as  an  upalert.  Multiple  upalerts  may  be specified for each
1072              period definition.  Set the per-period no_comp_alerts option  to
1073              send  an upalert regardless if whether or not a "down" alert was
1074              sent.
1075
1076
1077       startupalert alert [arg...]
1078              A startupalert is only called when the mon server starts  execu‐
1079              tion,  or  when  a  "reset"  command  was  issued to the server,
1080              depending on the setting of the  startupalerts_on_reset  global.
1081              Unlike  other alerts, startupalerts are not called following the
1082              exit of a monitor, i.e. they are  called  in  their  own  right,
1083              therefore  the  "exit="  argument  is  not applicable to startu‐
1084              palert.
1085
1086
1087       upalertafter timeval
1088              The upalertafter parameter is specified as a string that follows
1089              the  syntax  of  the interval parameter ("30s", "1m", etc.), and
1090              controls the triggering of an upalert.  If a service comes  back
1091              up  after  being  down  for  a time greater than or equal to the
1092              value of this option, an upalert will be called. Use this option
1093              to  prevent upalerts to be called because of "blips" (brief out‐
1094              ages).
1095
1096

AUTHENTICATION CONFIGURATION FILE

1098       The file specified by the authfile variable in the  configuration  file
1099       (or  passed  via  the  -A parameter) will be loaded upon startup.  This
1100       file defines restrictions upon which client commands may be executed by
1101       which users. It is a text file which consists of comments, command def‐
1102       initions, and trap authentication parameters.  A  comment  line  begins
1103       with  optional  whitespace  followed  by  pound  sign.  Blank lines are
1104       ignored.
1105
1106       The file is separated into a command section and a trap  section.  Sec‐
1107       tions  are  specified  by a single line containing one of the following
1108       statements:
1109
1110                   command section
1111
1112       or
1113
1114                   trap section
1115
1116       Lines following one of the above statements apply to that section until
1117       either the end of the file or another section begins.
1118
1119       A  command  definition consists of a command, followed by a colon, fol‐
1120       lowed by a comma-separated list of users who may execute  the  command.
1121       The  default  is that no users may execute any commands unless they are
1122       explicitly allowed in this configuration file. For clarity, a user  can
1123       be  denied  by prefixing the user name with "!". If the word "AUTH_ANY"
1124       is used for a username, then any authenticated user will be allowed  to
1125       execute  the  command.  If  the word "all" is used for a username, then
1126       that command may be executed by any user, authenticated or not.
1127
1128       The trap section allows configuration of which  users  may  send  traps
1129       from  which  hosts.  The  syntax is a source host (name or ip address),
1130       whitespace, a username, whitespace, and a plaintext password  for  that
1131       user. If the source host is "*", then allow traps from any host. If the
1132       username is "*", then accept traps without regard for the  username  or
1133       password.  If  no  hosts  or users are specified, then no traps will be
1134       accepted.
1135
1136       An example configuration file:
1137
1138              command section
1139              list:          all
1140              reset:         root,admin
1141              loadstate:          root
1142              savestate:          root
1143
1144              trap section
1145              127.0.0.1 root r@@tp4sswrd
1146
1147       This means that all clients are  able  to  perform  the  list  command,
1148       "root"  is  able  to  perform  "reset",  "loadstate",  "savestate", and
1149       "admin" is able to execute the "reset" command.
1150
1151

CLIENT-SERVER INTERFACE

1153       The server listens on TCP port 2583, which may be overridden using  the
1154       -p port  option.  Commands are a single line each, terminated by a new‐
1155       line.  The server can handle any number of simultaneous client  connec‐
1156       tions.
1157
1158

CLIENT INTERFACE COMMANDS

1160       See manual page for moncmd.
1161
1162

MON TRAPPING

1164       Mon  has  the facility to receive special "mon traps" from any local or
1165       remote machine. Currently, the only available method  for  sending  mon
1166       traps are through the Mon::Client perl interface, though the UDP packet
1167       format is defined well enough to permit the writing of traps  in  other
1168       languages.
1169
1170       Traps  are  handled  similarly to monitors: a trap sends an operational
1171       status, summary line, and description text, and mon generates an  alert
1172       or upalert as necessary.
1173
1174       Traps  can  be caught by any watch/service group set up in the mon con‐
1175       figuration file, however it is suggested that you configure  watch/ser‐
1176       vice  groups  specifically  for  the  traps you expect to receive. When
1177       defining a special watch/service group for  traps,  do  not  include  a
1178       "monitor" directive (as no monitor need be invoked). Since a monitor is
1179       not being invoked, it is not necessary for the watch definition to have
1180       a  hostgroup  which  contains  real  host names.  Just make up a useful
1181       name, and mon will automatically create the watch group for you.
1182
1183       Here is a simple config file example:
1184
1185              watch trap-service
1186                   service host1-disks
1187                        description TRAP: for host1 disk status
1188                        period wd {Sun-Sat}
1189                             alert mail.alert someone@your.org
1190                             upalert mail.alert -u someone@your.org
1191
1192
1193       Since mon listens on a UDP port for any trap,  a  default  facility  is
1194       available  for handling traps to unknown groups or services.  To enable
1195       this facility,  you  must  include  a  "default"  watch  group  with  a
1196       "default"  service  entry  containing  the  specifics  of alarms.  If a
1197       default/default watch  group  and  service  are  not  configured,  then
1198       unknown  traps  get logged via syslog, and no alarm is sent.  NOTE: The
1199       default/default facility is a single entity as far  as  accounting  and
1200       alarming  go.  Alarm programs which are not aware of this fact may send
1201       confusing information when a failure trap comes from one machine,  fol‐
1202       lowed  by  a  success (ok) trap from a different machine. See the alarm
1203       environment variable MON_TRAP_INTENDED above for a possible way  around
1204       this.  It  is  intended  that  default/default be used as a facility to
1205       catch unknown traps, and should not be relied upon to catch  all  traps
1206       in  a  production  environment.  If  you  are lazy and only want to use
1207       default/default for catching all traps, it would  be  best  to  disable
1208       upalerts,  and  use the MON_TRAP_INTENDED environment variable in alert
1209       scripts to make the alerts more meaningful to you.
1210
1211       Here is an example default facility:
1212
1213              watch default
1214                   service default
1215                        description Default trap service
1216                        period wd {Sun-Sat}
1217                             alert mail.alert someone@your.org
1218                             upalert mail.alert -u someone@your.org
1219
1220
1221

EXAMPLES

1223       The mon distribution comes with an example configuration  called  exam‐
1224       ple.cf.  Refer to that file for more information.
1225
1226

HISTORY

1231       mon  was  written  because  I couldn't find anything out there that did
1232       just what I needed, and nothing was worth modifying to add the features
1233       I  wanted.  It  doesn't have a cool name, and that bothers me because I
1234       couldn't think of one.
1235

BUGS

1237       Report bugs to the email address below.
1238

AUTHOR

1240       Jim Trocki <trockij@arctic.org>
1241
1242
1243
1244Linux                    $Date: 2007/06/25 13:10:07 $                   mon(8)