1mon(8)                Parallel Service Monitoring Daemon                mon(8)


6       mon - monitor services for availability, sending alarms upon failures.


9       mon [-dfhlMSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c config] [-D
10       dir] [-i secs] [-k num] [-l [statetype]] [-L dir] [-m num] [-p num] [-P
11       pidfile] [-r delay] [-s dir]


14       mon  is a general-purpose scheduler for monitoring service availability
15       and triggering alerts upon detecting failures.  mon was designed to  be
16       open  in the sense that it supports arbitrary monitoring facilities and
17       alert methods via a common  interface,  which  are  easily  implemented
18       through programs (in C, Perl, shell, etc.), SNMP traps, and special Mon
19       (UDP packet) traps.


23       -a dir Path       to       alert       scripts.       Default        is
24              /usr/local/lib/mon/alert.d:alert.d.  Multiple alert paths may be
25              specified by separating them with a colon.   Non-absolute  paths
26              are  taken to be relative to the base directory (/usr/lib/mon by
27              default).
29       -b dir Base directory for mon. scriptdir, alertdir,  and  statedir  are
30              all relative to this directory unless specified from /.  Default
31              is /usr/lib/mon.
33       -B dir Configuration file base directory. All config files are  located
34              here, including mon.cf, monusers.cf, and auth.cf.
36       -A authfile
37              Authentication   configuration   file.   By   default   this  is
38              /etc/mon/auth.cf  if   the   /etc/mon   directory   exists,   or
39              /usr/lib/mon/auth.cf otherwise.
41       -c file
42              Read   configuration   from   file.    This   defaults   to   IR
43              /etc/mon/mon.cf " if the " /etc/mon directory exists,  otherwise
44              to /etc/mon.cf.
46       -d     Enable debugging mode.
48       -D dir Path   to   state   directory.    Default   is   the   first  of
49              /var/state/mon,  /var/lib/mon,  and  /usr/lib/mon/state.d  which
50              exists.
52       -f     Fork  and  run as a daemon process. This is the preferred way to
53              run mon.
55       -h     Print help information.
57       -i secs
58              Sleep interval, in seconds. Defaults to 1. This  shouldn't  need
59              to be adjusted for any reason.
61       -k num Set log history to a maximum of num entries. Defaults to 100.
63       -l statetype
64              Load  state  from the last saved state file. The supported saved
65              state types are disabled for  disabled  watches,  services,  and
66              hosts,  opstatus  for  failure/alert/ack status of all services,
67              and all for both.  If no  statetype  is  provided,  disabled  is
68              assumed.
70       -L dir Sets  the  log  dir.  See also logdir in the configuration file.
71              The default is /var/log/mon if that directory exists,  otherwise
72              log.d in the base directory.
74       -M     Pre-process  the  configuration  file  with  the macro expansion
75              package m4.
77       -m num Set the throttle for the maximum number of processes to num.
79       -p num Make server listen on port num.  This defaults to 2583.
81       -S     Start with the scheduler stopped.
83       -P pidfile
84              Store the server's pid in pidfile, the default is the  first  of
85              /var/run/mon/mon.pid,  /var/run/mon.pid,  and /etc/mon.pid whose
86              directory exists.  An empty value tells mon not  to  use  a  pid
87              file.
89       -r delay
90              Sets  the  number of seconds used to randomize the startup delay
91              before each service is scheduled. Refer to the global  randstart
92              variable in the configuration file.
94       -s dir Path       to       monitor       scripts.       Default      is
95              /usr/local/lib/mon/mon.d:mon.d.  Multiple  alert  paths  may  be
96              specified  by  separating them with a colon.  Non-absolute paths
97              are taken to be relative to the base directory (/usr/lib/mon  by
98              default).
100       -v     Print version information.


104       monitor
105              A  program  which  tests for a certain condition, returns either
106              true or false, and optionally produces output to be passed  back
107              to  the scheduler.  Common monitors detect host reachability via
108              ICMP echo messages, or connection to TCP services.
110       period A period in time as interpreted by the Time::Period module.
112       alert  A program which sends a message when invoked by  the  scheduler.
113              The scheduler calls upon an alert when it detects a failure from
114              a monitor.  An alert program accepts a set of command-line argu‐
115              ments  from  the  scheduler,  in  addition  to data via standard
116              input.
118       hostgroup
119              A single host or  list  of  hosts,  specified  as  names  or  IP
120              addresses.
122       service
123              A  collection  of parameters used to deal with monitoring a par‐
124              ticular resource which is provided by a group. Services are usu‐
125              ally  modeled  after  things  such  as an SMTP server, ICMP echo
126              capability, server disk space availability, or SNMP events.
128       view   A collection of hostgroups, used to filter mon output for client
129              display.   i.e.  a  'network-services'  view might be defined so
130              your network staff can see just the hostgroups which  matter  to
131              them, without having to see all hostgroups defined in Mon.
133       watch  A collection of services which apply to a particular group.


136       When  the mon scheduler starts, it reads a configuration file to deter‐
137       mine the services it needs to monitor. The configuration file  defaults
138       to  /etc/mon.cf, and can be specified using the -c parameter. If the -M
139       option is specified, then the configuration file is pre-processed  with
140       m4.   If  the  configuration  file ends with .m4, the file is also pro‐
141       cessed by m4 automatically.
143       The scheduler enters a loop which handles client  connections,  monitor
144       invocations, and failure alerts. Each service has a timer, specified in
145       the configuration file as the interval variable, which tells the sched‐
146       uler  how frequently to invoke a monitor process.  The scheduler may be
147       temporarily stopped. While it is stopped,  client  access  still  func‐
148       tions,  but it just doesn't schedule things. This is useful in conjunc‐
149       tion while resetting the server, because you  can  do  this:  save  the
150       hosts and services which are disabled, reset the server with the sched‐
151       uler stopped, re-disabled those hosts  and  services,  then  start  the
152       scheduler.  It  also allows making atomic changes across several client
153       connections.  See the moncmd man page for more information.


157       Monitor processes are invoked with the arguments specified in the  con‐
158       figuration  file, appended by the hosts from the applicable host group.
159       For example, if the watch group is "servers", which contain  the  host‐
160       names "smtp", "nntp", and "ns", and the monitor line reads as follows,
161        monitor fping.monitor -t 4000 -r 2
162       then the exectuable "fping.monitor" will be executed with these parame‐
163       ters:
164        MONITOR_DIR/fping.monitor -t 4000 -r 2 smtp nntp ns
166       MONITOR_DIR    is    actually    a    search    path,    by     default
167       /usr/local/lib/mon/mon.d  then  /usr/lib/mon/mon.d, but it can be over‐
168       ridden by the -s option or in the configuration file.  If all hosts  in
169       the  hostgroup have been disabled, then a warning is sent to syslog and
170       the monitor is not run.  This  behavior  may  be  overridden  with  the
171       "allow_empty_group"  option  in  the  service definition.  If the final
172       argument to the "monitor" line is ";;" (it must be preceded  by  white‐
173       space), then the host list will not be appended to the parameter list.
175       In addition to environment variables defined by the user in the service
176       definition, mon passes certain variables to monitor process.
180              The first line of the output from  the  last  time  the  monitor
181              exited.  This is not the summary of the current monitor run, but
182              the previous one.  This may be used by an alert script  to  pro‐
183              vide historical context in an alert.
187              The  entire  output of the monitor from the last time it exited.
188              This is not the output of the current monitor run, but the  pre‐
189              vious  one.  This may be used by an alert script to provide his‐
190              torical context in an alert.
195              The time(2) of the last failure for this service.
199              The time(2) of the first time this service failed.
203              The time(2) of the last time this service passed.
207              The description of this service, as defined in the configuration
208              file using the description tag.
212              The depend status, "o" if dependency failure, "1" otherwise.
215       MON_LOGDIR
216              The  directory  log  files should be placed, as indicated by the
217              logdir global configuration variable.
220       MON_STATEDIR
221              The directory where state files should be kept, as indicated  by
222              the statedir global configuration variable.
226              The directory where configuration files should be kept, as indi‐
227              cated by the cfbasedir global configuration variable.
230       "fping.monitor" should return an exit status of 0 if it completed  suc‐
231       cessfully  (found  no  problems), or nonzero if a problem was detected.
232       The first line of output from the monitor script has a special meaning:
233       it  is used as a brief summary of the exact failure which was detected,
234       and is passed to the alert program. All remaining output is also passed
235       to the alert program, but it has no required interpretation.
237       If  a  monitor  for a particular service is still running, and the time
238       comes for mon to run another monitor for  that  service,  it  will  not
239       start  another  monitor.  For  example, if the interval is 10s, and the
240       monitor does not finish running within 10 seconds, then mon  will  wait
241       until the first monitor exits before running another one.


245       Upon  a  non-zero  or zero exit status, the associated alert or upalert
246       program (respectively) is started, pending the following conditions: If
247       an  alert for a specific service is disabled, do not send an alert.  If
248       dep_behavior is set to 'a', or alertdepend is set, and a parent  depen‐
249       dency is failing, then suppress the alert.  If the alert has previously
250       been acknowledged, do not send the alert, unless it is an upalert.   If
251       an  alert  is  not  within the specified period, record the failure via
252       syslog(3) and do not send an alert.   If  the  failure  does  not  fall
253       within  a  defined  period, do not send an alert.  No upalerts are sent
254       without corresponding down alerts, unless no_comp_alerts is defined  in
255       the  period section. An upalert will only be sent if the previous state
256       is a failure.  If an alert was already sent within the last  alertevery
257       interval, do not send another alert, unless the summary output from the
258       current monitor program differs from the last monitor process.   Other‐
259       wise,  send  an  alert using each alert program listed for that period.
260       The observe_detail argument to  alertevery  affects  this  behavior  by
261       observing  the  changes in the detail part of the output in addition to
262       the summary line.  If a monitor has successive failures and the summary
263       output  changes  in each of them, alertevery will not suppress multiple
264       consecutive alerts.  The  reasoning  is  that  if  the  summary  output
265       changes,  then  a  significant  event  occurred  and the user should be
266       alerted.  The "strict" argument to alertevery will suppress  both  com‐
267       paring the output from the previous monitor run to the current and pre‐
268       vent a successful return  value  of  the  monitor  from  resetting  the
269       alertevery  timer.  For example, "alertevery 24h strict" will only send
270       out an alert once every 24 hours, regardless  of  whether  the  monitor
271       output changes, or if the service stops and then starts failing.


275       Alert programs are found in the path supplied with the -a parameter, or
276       in the /usr/local/lib/mon/alert.d and  directories  if  not  specified.
277       They are invoked with the following command-line parameters:
280       -s service
281              Service tag from the configuration file.
283       -g group
284              Host group name from the configuration file.
286       -h hosts
287              The  expanded  version  of  the host group, space delimited, but
288              contained in one shell "word".
290       -l alertevery
291              The number of seconds until the next alarm will be sent.
293       -O     This option  is  supplied  to an alert  only  if  the  alert  is
294              being generated as a result of an expected traap timing out
296       -t time
297              The  time (in time(2) format) of when this failure condition was
298              detected.
300       -T     This option is supplied to an alert only if the alert was  trig‐
301              gered by a trap
303       -u     This  option  is supplied to an alert only if it is being called
304              as an upalert.
307       The remaining arguments are supplied from the  trailing  parameters  in
308       the configuration file, after the "alert" service parameter.
310       As  with  monitor programs, alert programs are invoked with environment
311       variables defined by the user in the service definition, in addition to
312       the following which are explicitly set by the server:
316              The  first  line  of  the  output from the last time the monitor
317              exited.
321              The entire output of the monitor from the last time it exited.
325              The time(2) of the last failure for this service.
329              The time(2) of the first time this service failed.
333              The time(2) of the last time this service passed.
337              The description of this service, as defined in the configuration
338              file using the description tag.
341       MON_GROUP
342              The watch group which triggered this alarm
345       MON_SERVICE
346              The service heading which generated this alert
349       MON_RETVAL
350              The exit value of the failed monitor program, or return value as
351              accepted from a trap.
354       MON_OPSTATUS
355              The operational status of the service.
359              Has one of the following  values:  "failure",  "up",  "startup",
360              "trap",  or "traptimeout", and signifies the type of alert which
361              was triggered.
365              This is only set when an unknown mon trap is received and caught
366              by  the  default/defaut watch/service. This contains colon sepa‐
367              rated entries of the trap's intended  watch  group  and  service
368              name.
371       MON_LOGDIR
372              The  directory  log  files should be placed, as indicated by the
373              logdir global configuration variable.
376       MON_STATEDIR
377              The directory where state files should be kept, as indicated  by
378              the statedir global configuration variable.
382              The directory where configuration files should be kept, as indi‐
383              cated by the cfbasedir global configuration variable.
386       The first line from standard input must be used as a brief  summary  of
387       the problem, normally supplied as the subject line of an email, or text
388       sent to an alphanumeric pager. Interpretation of all  subsequent  lines
389       read  from  stdin is left up to the alerting program. The usual parame‐
390       ters are a list of recipients to  deliver  the  notification  to.   The
391       interpretation  of  the  recipients  is not specified, and is up to the
392       alert program.


396       The configuration file consists of zero or more global variable defini‐
397       tions,  zero or more hostgroup definitions, and one or more watch defi‐
398       nitions. Each watch definition may have one  or  more  service  defini‐
399       tions.  A watch definition is terminated by a blank line, another defi‐
400       nition, or the end of the file. A line beginning with optional  leading
401       whitespace and a pound ("#") is regarded as a comment, and is ignored.
403       Lines  are parsed as they are read. Long lines may be continued by end‐
404       ing them with a backslash ("\").  If a  line  is  continued,  then  the
405       backslash, the trailing whitespace after the backslash, and the leading
406       whitespace of the following line are removed. The end result is  assem‐
407       bled into a single line.
409       Typically the configuration file has the following layout:
411       1. Global variable definitions
413       2. Hostgroup definitions
415       3. Watch definitions
417       See  the  "etc/example.cf" file which comes for the distribution for an
418       example.
421   Global Variables
422       The following variables may be set to  override  compiled-in  defaults.
423       Command-line  options  will have a higher precedence than these defini‐
424       tions.
427       alertdir = dir
428              dir is the full path to the alert scripts. This is the value set
429              by the -a command-line parameter.
431              Multiple  alert paths may be specified by separating them with a
432              colon.  Non-absolute paths are taken to be relative to the  base
433              directory (/usr/lib/mon by default).
435              When  the configuration file is read, all alerts referenced from
436              the configuration will be looked up in each of these paths,  and
437              the full path to the first instance of the alert found is stored
438              in a hash. This hash is only generated upon startup or  after  a
439              "reset" command, so newly added alert scripts will not be recog‐
440              nized until a "reset" is performed.
443       mondir = dir
444              dir is the full path to the monitor scripts. This value may also
445              be  set  by the -s command-line parameter. If this path does not
446              begin with a "/", it will be relative to basedir.
448              Multiple alert paths may be specified by separating them with  a
449              colon. All paths must be absolute.
451              When  the  configuration  file  is read, all monitors referenced
452              from the configuration will be looked up in each of these paths,
453              and  the full path to the first instance of the monitor found is
454              stored in a hash. This hash is only generated  upon  startup  or
455              after a "reset" command, so newly added monitor scripts will not
456              be recognized until a "reset" is performed.
459       statedir = dir
460              dir is the full path to the  state  directory.   mon  uses  this
461              directory  to  save various state information. If this path does
462              not begin with a "/", it will be relative to basedir.
465       logdir = dir
466              dir is the full path to the log directory.  mon uses this direc‐
467              tory  to  save various logs, including the downtime log. If this
468              path does not begin with a "/", it will be relative to basedir.
471       basedir = dir
472              dir is the full path for the  state,  log,  monitor,  and  alert
473              directories.
476       cfbasedir = dir
477              dir  is  the  full  path where all the config files can be found
478              (monusers.cf, auth.cf, etc.).
481       authfile = file
482              file is the path to the authentication file. If  the  path  does
483              not begin with a "/", it will be relative to cfbasedir.
486       authtype = type [type...]
487              type  is  the  type  of authentication to use. A space-separated
488              list of types may be specified, and they  will  be  checked  the
489              order they are listed. As soon as a successful authentication is
490              performed, the user is considered authenticated by mon  for  the
491              duration  of  the  session and no more authentication checks are
492              performed.
494              If type is getpwnam, then the standard Unix passwd file  authen‐
495              tication  method will be used (calls getpwnam(3) on the user and
496              compares the crypt(3)ed version of the  password  with  what  it
497              gets  from getpwnam). This will not work if shadow passwords are
498              enabled on the system.
500              If type is userfile, then usernames  and  hashed  passwords  are
501              read from userfile, which is defined via the userfile configura‐
502              tion variable.
504              If type is pam, then PAM (pluggable authentication modules) will
505              be  used  for authentication.  The service specified by the pam‐
506              service global will be used. If no  global  is  given,  the  PAM
507              passwd service will be used.
509              If  type is trustlocal, then if the client connection comes from
510              locahost, the username passed from the client will  be  trusted,
511              and  the  password  will  be ignored.  This can be used when you
512              want the client to handle the authentication for  you.   I.e.  a
513              CGI script using one of the many apache authentication methods.
516       userfile = file
517              This file is used when authtype is set to userfile.  It consists
518              of a sequence of lines of  the  format  'username  :  password'.
519              password  is  stored  as  the hash returned by the standard Unix
520              crypt(3) function.  NOTE: the format of this file is  compatible
521              with  the Apache file based username/password file format. It is
522              possible to use the htpasswd program  supplied  with  Apache  to
523              manage the mon userfile.
525              Blank lines and lines beginning with # are ignored.
528       pamservice = service
529              The PAM service used for authentication. This is applicable only
530              if "pam" is specified as a parameter to the authtype setting. If
531              this global is not defined, it defaults to passwd.
534       serverbind = addr
537       trapbind = addr
539              serverbind and trapbind specify which address to bind the server
540              and trap ports to, respectively.  If these are not defined,  the
541              default  address  is INADDR_ANY, which allows connections on all
542              interfaces. For security reasons, it could be  a  good  idea  to
543              bind only to the loopback interface.
546       dtlogfile = file
547              file  is  a  file which will be used to record the downtime log.
548              Whenever a service fails for some amount of time and  then  stop
549              failing,  this event is written to the log. If this parameter is
550              not set, no logging is done. The format of the file is  as  fol‐
551              lows (# is a comment and may be ignored):
553              timenoticed group service firstfail downtime interval summary.
555              timenoticed is the time(2) the service came back up.
557              group service is the group and service which failed.
559              firstfail is the time(2) when the service began to fail.
561              downtime is the number of seconds the service failed.
563              interval  is  the  frequency  (in  seconds)  that the service is
564              polled.
566              summary is the summary line from when the service was failing.
569       monerrfile = filename
570              By default, when mon daemonizes itself, it connects  stdout  and
571              stderr to /dev/null. If monerrfile is set to a file, then stdout
572              and stderr will be appended to that file. In all cases stdin  is
573              connected  to /dev/null. If mon is told to run in the foreground
574              and  to  not  daemonize,  then  none  of  this  applies,   since
575              stdin/stdout/stderr  stay connected to whatever they were at the
576              time of invocation.
579       dtlogging = yes/no
581              Turns downtime logging on or off. The default is off.
584       histlength = num
585              num is the the maximum number of events to be retained  in  his‐
586              tory  list.  The  default is 100.  This value may also be set by
587              the -k command-line parameter.
590       historicfile = file
591              If this variable is set, then alerts are  logged  to  file,  and
592              upon  startup,  some  (or  all) of the past history is read into
593              memory.
596       historictime = timeval
597              num is the amount of the history  file  to  read  upon  startup.
598              "Now"  - timeval is read. See the explanation of interval in the
599              "Service Definitions" section for a description of timeval.
602       serverport = port
603              port is the TCP port number that the server should bind to. This
604              value may also be set by the -p command-line parameter. Normally
605              this port is looked up via getservbyname(3), and it defaults  to
606              2583.
609       trapport = port
610              port is the UDP port number that the trap server should bind to.
611              Normally this port is looked up  via  getservbyname(3),  and  it
612              defaults to 2583.
615       pidfile = path
616              path  is  the  file the sever will store its pid in.  This value
617              may also be set by the -P command-line parameter.
620       maxprocs = num
621              Throttles the number of concurrently forked  processes  to  num.
622              The intent is to provide a safety net for the unlikely situation
623              when the server tries to take on too many tasks at  once.   Note
624              that this situation has only been reported to happen when trying
625              to use a garbled configuration file! You don't  want  to  use  a
626              garbled configuration file now, do you?
629       cltimeout = secs
630              Sets  the  client  inactivity timeout to secs.  This is meant to
631              help thwart denial of service attacks or  recover  from  crashed
632              clients.  secs is interpreted as a "1h/1m/1s" string, where "1m"
633              = 60 seconds.
636       randstart = interval
637              When the server starts, normally all services will not be sched‐
638              uled  until  the interval defined in the respective service sec‐
639              tion.  This can cause long delays before the first  check  of  a
640              service,  and  possibly  a  high  load on the server if multiple
641              things are scheduled at the same intervals.  This option is used
642              to  randomize  the scheduling of the first test for all services
643              during the startup period, and immediately after the reset  com‐
644              mand.   If  randstart  is defined, the scheduled run time of all
645              services of all watch groups will be  a  random  number  between
646              zero and randstart seconds.
649       dep_recur_limit = depth
650              Limit dependency recursion level to depth.  If dependency recur‐
651              sion (dependencies which depend on other dependencies) tries  to
652              go beyond depth, then the recursion is aborted and a messages is
653              logged to syslog.  The default limit is 10.
656       dep_behavior = {a|m|hm}
657              dep_behavior controls whether  the  dependency  expression  sup‐
658              presses  one of: the running of alerts, the running of monitors,
659              or the passing of individual hosts to the monitors.   Read  more
660              about the behavior in the "Service Definitions" section below.
662              This is a global setting which controls the default settings for
663              the service-specified variable.
666       dep_memory = timeval
667              If set, dep_memory will cause dependencies to continue  to  pre‐
668              vent  alerts/monitoring  for  a period of time after the service
669              returns to a normal state.  This can be used  to  prevent  over-
670              eager  alerting  when  a machine is rebooting, for example.  See
671              the explanation of interval in the "Service Definitions" section
672              for a description of timeval.
674              This is a global setting which controls the default settings for
675              the service-specified variable.
678       syslog_facility = facility
679              Specifies the syslog facility used for logging.  daemon  is  the
680              default.
685       startupalerts_on_reset = {yes|no}
687              If  set  to  "yes", startupalerts will be invoked when the reset
688              client command is executed. The default is "no".
691       monremote = program
693              If set, this external program will be called by Mon when various
694              client  requests  are  processed.  This can be used to propagate
695              those changes from one Mon server to another, if you have multi‐
696              ple  monitoring  machines.   An  example script, monremote.pl is
697              available in the clients directory.
700   Hostgroup Entries
701       Hostgroup entries begin with the keyword hostgroup, and are followed by
702       a hostgroup tag and one or more hostnames or IP addresses, separated by
703       whitespace. The hostgroup tag must be composed of alphanumeric  charac‐
704       ters,  a  dash ("-"), a period ("."), or an underscore ("_"). Non-blank
705       lines following the first hostgroup line are interpreted as more  host‐
706       names.  The hostgroup definition ends with a blank line. For example:
708              hostgroup servers nameserver smtpserver nntpserver
709                   nfsserver httpserver smbserver
711              hostgroup router_group cisco7000 agsplus
714   View Entries
715       View  entries  begin  with the keyword view, and are followed by a view
716       tag and the names of one or more hostgroups.  The view tag must be com‐
717       posed  of  alphanumeric characters, a dash ("-"), a period ("."), or an
718       underscore ("_"). Non-blank lines following the  first  view  line  are
719       interpreted  as  more hostgroup names.  The view definition ends with a
720       blank line. For example:
722              view servers dns-servers web-servers file-servers
723                   mail-servers
725              view network-services routers switches vpn-servers
729   Watch Group Entries
730       Watch entries begin with a line that starts  with  the  keyword  watch,
731       followed  by  whitespace  and  a single word which normally refers to a
732       pre-defined hostgroup. If the second word is not recognized as a  host‐
733       group  tag, a new hostgroup is created whose tag is that word, and that
734       word is its only member.
736       Watch entries consist of one or more service definitions.
738       A watch group is terminated by a blank line, the end of the file, or by
739       a subsequent definition, "watch", "hostgroup", or otherwise.
741       There may be a special watch group entry called "default". If a default
742       watch group is defined with a service entry named "default", then  this
743       definition  will be used in handling traps received for an unrecognized
744       watch and service.
747   Service Definitions
748       service servicename
749              A service definition begins with they keyword  service  followed
750              by  a word which is the tag for this service.  This word must be
751              unique among all services defined for the same watch group.
753              The components of a service are an interval, monitor, and one or
754              more time period definitions, as defined below.
756              If  a  service name of "default" is defined within a watch group
757              called "dafault" (see above), then the  default/default  defini‐
758              tion will be used for handling unknown mon traps.
760              The  following configuration parameters are valid only following
761              a service definition:
764       VARIABLE=value
765              Environment variables may be defined  for  each  service,  which
766              will  be  included  in  the  environment of monitors and alerts.
767              Variables must be specified in all capital letters,  must  begin
768              with  an alphabetical character or an underscore, and there must
769              be no spaces to the left of the equal sign.
772       interval timeval
773              The keyword interval followed by a time value specifies the fre‐
774              quency that a monitor script will be triggered.  Time values are
775              defined as "30s", "5m", "1h", or "1d",  meaning  30  seconds,  5
776              minutes,  1  hour,  or 1 day. The numeric portion may be a frac‐
777              tion, such as "1.5h" or an hour and a half.  This  format  of  a
778              time specification will be referred to as timeval.
781       failure_interval timeval
782              Adjusts  the  polling interval to timeval when the service check
783              is failing. Resets the interval to the original when the service
784              succeeds.
787       traptimeout timeval
788              This  keyword  takes  the  same  time  specification argument as
789              interval, and makes the service expect a trap from  an  external
790              source  at  least that often, else a failure will be registered.
791              This is used for a heartbeat-style service.
794       trapduration timeval
795              If a trap is received, the status of the service  the  trap  was
796              delivered  to  will normally remain constant. If trapduration is
797              specified, the status of the service will remain  in  a  failure
798              state for the duration specified by timeval, and then it will be
799              reset to "success".
802       randskew timeval
803              Rather than schedule the monitor script to run at the  start  of
804              each  interval,  randomly  adjust  the interval specified by the
805              interval parameter by plus-or-minus randskew .  The  skew  value
806              is specified as the interval parameter: "30s", "5m", etc...  For
807              example if interval is 1m, and randskew is "5s", then  mon  will
808              schedule  the  monitor script some time between every 55 seconds
809              and 65 seconds.  The intent is to help distribute  the  load  on
810              the  server  when many services are scheduled at the same inter‐
811              vals.
814       monitor monitor-name [arg...]
815              The keyword monitor followed by  a  script  name  and  arguments
816              specifies  the monitor to run when the timer expires. Shell-like
817              quoting conventions are followed when specifying  the  arguments
818              to  send  to the monitor script.  The script is invoked from the
819              directory given with the -s argument, and  all  following  words
820              are  supplied  as  arguments to the monitor program, followed by
821              the list of hosts in the group referred to by the current  watch
822              group.   If  the monitor line ends with ";;" as a separate word,
823              the host groups are not appended to the argument list  when  the
824              program is invoked.
827       allow_empty_group
828              The  allow_empty_group option will allow a monitor to be invoked
829              even when the hostgroup for that watch is empty because of  dis‐
830              abled  hosts.  The default behavior is not to invoke the monitor
831              when all hosts in a hostgroup have been disabled.
834       description descriptiontext
835              The text following description is queried  by  client  programs,
836              passed  to  alerts  and monitors via an environment variable. It
837              should contain a brief description of the service, suitable  for
838              inclusion in an email or on a web page.
841       exclude_hosts host [host...]
842              Any  hosts  listed after exclude_hosts will be excluded from the
843              service check.
846       exclude_period periodspec
847              Do not run a scheduled monitor during  the  time  identified  by
848              periodspec.
851       depend dependexpression
852              The  depend  keyword is used to specify a dependency expression,
853              which evaluates to either true of false, in the  boolean  sense.
854              Dependencies are actual Perl expressions, and must obey all syn‐
855              tactical rules. The expressions are evaluated in their own pack‐
856              age  space  so  as  to not accidentally have some unwanted side-
857              effect.  If a syntax error is found when evaluating the  expres‐
858              sion, it is logged via syslog.
860              Before evaluation, the following substitutions on the expression
861              occur: phrases which look like "group:service"  are  substituted
862              with  the value of the current operational status of that speci‐
863              fied service. These opstatus substitutions are  computed  recur‐
864              sively,  so  if  service A depends upon service B, and service B
865              depends upon service C, then service A depends upon  service  C.
866              Successful  operational  statuses  (which  evaluate  to "1") are
867              "STAT_OK",     "STAT_COLDSTART",      "STAT_WARMSTART",      and
868              "STAT_UNKNOWN".   The  word "SELF" (in all caps) can be used for
869              the group (e.g. "SELF:service"), and is an abbreviation for  the
870              current watch group.
872              This  feature  can  be used to control alerts for services which
873              are dependent on other services, e.g.  an  SMTP  test  which  is
874              dependent upon the machine being ping-reachable.
877       dep_behavior {a|m|hm}
878              The evaluation of the dependency graphs specified via the depend
879              keyword can control the suppression of alert or monitor  invoca‐
880              tions, or the suppression of individual hosts passed to the mon‐
881              itor.
883              Alert suppression.  If this option  is  set  to  "a",  then  the
884              dependency  expression  will  be evaluated after the monitor for
885              the service exits or after a trap is received.   An  alert  will
886              only  be  sent  if the evaluation succeeds, meaning that none of
887              the nodes in the dependency graph indicate failure.
889              Monitor suppression.  If it is set to "m", then  the  dependency
890              expression  will be evaulated before the monitor for the service
891              is about to run.  If the evaulation succeeds, then  the  monitor
892              will be run. Otherwise, the monitor will not be run and the sta‐
893              tus of the service will remain the same.
895              Host suppression.  If it is set to "hm" then  Mon  will  extract
896              the  list  of  "parent" services from the dependency expression.
897              (In fact the expression can be just a list  of  services.)  Then
898              when  the  monitor  for the service is about to be run, for each
899              host in the current hostgroup Mon will  search  all  the  parent
900              services  which  are currently failing and look for the hostname
901              in the current summary output.  If the hostname is  found,  this
902              host will be excluded from this run of the monitor.  This can be
903              used to e.g. allow an SMTP test on a group of hosts to still  be
904              run  even  when a single host is not ping-reachable.  If all the
905              rest of the hosts are working fine, the service will be in an OK
906              state,  but  if  another  host fails the SMTP test Mon can still
907              alert about that host even  though  the  parent  dependency  was
908              failing.  The dependency expression will not be used recursively
909              in this case.
912       alertdepend dependexpression
914       monitordepend dependexpression
916       hostdepend dependexpression
917              These keywords allow you to specify multiple dependency  expres‐
918              sions of different types.  Each one corresponds to the different
919              dep_behavior settings listed  above.   They  will  be  evaluated
920              independently  in  the  different  contexts as listed above.  If
921              depend is present, it takes precedence over  the  matching  key‐
922              word, depending on the dep_behavior setting.
925       dep_memory timeval
926              If  set,  dep_memory will cause dependencies to continue to pre‐
927              vent alerts/monitoring for a period of time  after  the  service
928              returns  to  a  normal state.  This can be used to prevent over-
929              eager alerting when a machine is rebooting,  for  example.   See
930              the explanation of interval in the "Service Definitions" section
931              for a description of timeval.
934       redistribute alert [arg...]
935              A service may have one redistribute option, which is  a  special
936              form  of  an  an alert definition.  This alert will be called on
937              every service status  update,  even  sequential  success  status
938              updates.   This  can be used to integrate Mon with another moni‐
939              toring system, or to link together multiple Mon servers  via  an
940              alert script that generates Mon traps.  See the "ALERT PROGRAMS"
941              section above for a list of the parameters mon will  pass  auto‐
942              matically to alert programs.
945       unack_summary
946              Remove  the  "acknowledged"  state from a service if the summary
947              component of the failure message changes.  In most common  usage
948              the summary is the list of hosts that are failing, so additional
949              hosts failing would remove an ack.
953   Period Definitions
954       Periods are used to define the conditions which should allow alerts  to
955       be delivered.
958       period [label:] periodspec
959              A  period  groups one or more alarms and variables which control
960              how often an alert happens when there is a failure.  The  period
961              definition has two forms. The first takes an argument which is a
962              period specification from Patrick  Ryan's  Time::Period  Perl  5
963              module. Refer to "perldoc Time::Period" for more information.
965              The second form requires a label followed by a period specifica‐
966              tion, as defined above. The label is  a  tag  consisting  of  an
967              alphabetic  character  or  underscore  followed  by zero or more
968              alphanumerics or underscores and ending with a colon. This  form
969              allows multiple periods with the same period definition. One use
970              is to have a  period  definition  which  has  no  alertafter  or
971              alertevery  parameters for a particular time period, and another
972              for the same time period with a different  set  of  alerts  that
973              does contain those parameters.
975              Period  definitions, in either the first or second form, must be
976              unique within each service definition. For example, if you  need
977              to  define two periods both for "wd {Sun-Sat}", then one or both
978              of the period definitions must specify a label such  as  "period
979              t1: wd {Sun-Sat}" and "period t2: wd {Sun-Sat}".
982       alertevery timeval [observe_detail | strict]
983              The  alertevery  keyword  (within a period definition) takes the
984              same type of argument as the interval variable, and  limits  the
985              number  of  times an alert is sent when the service continues to
986              fail.  For example, if the  interval  is  "1h",  then  only  the
987              alerts  in  the period section will only be triggered once every
988              hour. If the alertevery keyword is omitted in a period entry, an
989              alert  will  be  sent  out  every time a failure is detected. By
990              default, if  the  summary  output  of  two  successive  failures
991              changes,  then  the  alertevery  interval  is overridden, and an
992              alert will be sent.  If the string "observe_detail" is the  last
993              argument,  then both the summary and detail output lines will be
994              considered when comparing the output of successive failures.  If
995              the string "strict" is the last argument, then the output of the
996              monitor or the state change of the service will have  no  effect
997              on  when  alerts are sent. That is, "alertevery 24h strict" will
998              send only one alert every 24  hours,  no  matter  what.   Please
999              refer  to the ALERT DECISION LOGIC section for a detailed expla‐
1000              nation of how alerts are suppressed.
1003       alertafter num
1006       alertafter num timeval
1009       alertafter timeval
1010              The alertafter keyword  (within  a  period  section)  has  three
1011              forms:  only  with the "num" argument, or with the "num timeval"
1012              arguments, or only with the "timeval" argument.   In  the  first
1013              form,  an  alert  will  only  be invoked after "num" consecutive
1014              failures.
1016              In the second form, the arguments are a  positive  integer  fol‐
1017              lowed  by  an  interval,  as  described by the interval variable
1018              above.  If these parameters are specified, then the  alerts  for
1019              that  period will only be called after that many failures happen
1020              within that interval. For example, if alertafter  is  given  the
1021              arguments  "3 30m",  then the alert will be called if 3 failures
1022              happen within 30 minutes.
1024              In the third form, the argument is an interval, as described  by
1025              the  interval  variable above.  Alerts for that period will only
1026              be called if the service has been in a failure  state  for  more
1027              than  the length of time desribed by the interval, regardless of
1028              the number of failures noticed within that interval.
1031       numalerts num
1033              This variable tells the server to call no more than  num  alerts
1034              during  a  failure.  The  alert  counter is kept on a per-period
1035              basis, and is reset upon each success.
1038       no_comp_alerts
1040              If this option is specified, then upalerts will be called  when‐
1041              ever  the  service state changes from failure to success, rather
1042              than only after a corresponding "down" alert.
1045       alert alert [arg...]
1046              A period may contain multiple alerts, which are  triggered  upon
1047              failure  of  the  service.  An alert is specified with the alert
1048              keyword, followed by an optional exit parameter,  and  arguments
1049              which  are  interpreted  the same as the monitor definition, but
1050              without the ";;" exception. The exit parameter takes the form of
1051              exit=x  or  exit=x-y  and  has the effect that the alert is only
1052              called if the exit status of the monitor script falls within the
1053              range  of the exit parameter. If, for example, the alert line is
1054              alert exit=10-20 mail.alert mis then  mail-alert  will  only  be
1055              invoked  with mis as its arguments if the monitor program's exit
1056              value is between 10 and 20. This feature allows you  to  trigger
1057              different  alerts  at  different severity levels (like when free
1058              disk space goes from 8% to 3%).
1060              See the ALERT PROGRAMS section above for  a  list  of  the  pra‐
1061              maeters mon will pass automatically to alert programs.
1064       upalert alert [arg...]
1065              An  upalert is the compliment of an alert.  An upalert is called
1066              when a services makes the state transition from failure to  suc‐
1067              cess,  if  a corresponding "down" alert was previously sent. The
1068              upalert script is called supplying the same  parameters  as  the
1069              alert  script,  with  the  addition of the -u parameter which is
1070              simply used to let an alert script know that it is being  called
1071              as  an  upalert.  Multiple  upalerts  may  be specified for each
1072              period definition.  Set the per-period no_comp_alerts option  to
1073              send  an upalert regardless if whether or not a "down" alert was
1074              sent.
1077       startupalert alert [arg...]
1078              A startupalert is only called when the mon server starts  execu‐
1079              tion,  or  when  a  "reset"  command  was  issued to the server,
1080              depending on the setting of the  startupalerts_on_reset  global.
1081              Unlike  other alerts, startupalerts are not called following the
1082              exit of a monitor, i.e. they are  called  in  their  own  right,
1083              therefore  the  "exit="  argument  is  not applicable to startu‐
1084              palert.
1087       upalertafter timeval
1088              The upalertafter parameter is specified as a string that follows
1089              the  syntax  of  the interval parameter ("30s", "1m", etc.), and
1090              controls the triggering of an upalert.  If a service comes  back
1091              up  after  being  down  for  a time greater than or equal to the
1092              value of this option, an upalert will be called. Use this option
1093              to  prevent upalerts to be called because of "blips" (brief out‐
1094              ages).


1098       The file specified by the authfile variable in the  configuration  file
1099       (or  passed  via  the  -A parameter) will be loaded upon startup.  This
1100       file defines restrictions upon which client commands may be executed by
1101       which users. It is a text file which consists of comments, command def‐
1102       initions, and trap authentication parameters.  A  comment  line  begins
1103       with  optional  whitespace  followed  by  pound  sign.  Blank lines are
1104       ignored.
1106       The file is separated into a command section and a trap  section.  Sec‐
1107       tions  are  specified  by a single line containing one of the following
1108       statements:
1110                   command section
1112       or
1114                   trap section
1116       Lines following one of the above statements apply to that section until
1117       either the end of the file or another section begins.
1119       A  command  definition consists of a command, followed by a colon, fol‐
1120       lowed by a comma-separated list of users who may execute  the  command.
1121       The  default  is that no users may execute any commands unless they are
1122       explicitly allowed in this configuration file. For clarity, a user  can
1123       be  denied  by prefixing the user name with "!". If the word "AUTH_ANY"
1124       is used for a username, then any authenticated user will be allowed  to
1125       execute  the  command.  If  the word "all" is used for a username, then
1126       that command may be executed by any user, authenticated or not.
1128       The trap section allows configuration of which  users  may  send  traps
1129       from  which  hosts.  The  syntax is a source host (name or ip address),
1130       whitespace, a username, whitespace, and a plaintext password  for  that
1131       user. If the source host is "*", then allow traps from any host. If the
1132       username is "*", then accept traps without regard for the  username  or
1133       password.  If  no  hosts  or users are specified, then no traps will be
1134       accepted.
1136       An example configuration file:
1138              command section
1139              list:          all
1140              reset:         root,admin
1141              loadstate:          root
1142              savestate:          root
1144              trap section
1145     root r@@tp4sswrd
1147       This means that all clients are  able  to  perform  the  list  command,
1148       "root"  is  able  to  perform  "reset",  "loadstate",  "savestate", and
1149       "admin" is able to execute the "reset" command.


1153       The server listens on TCP port 2583, which may be overridden using  the
1154       -p port  option.  Commands are a single line each, terminated by a new‐
1155       line.  The server can handle any number of simultaneous client  connec‐
1156       tions.


1160       See manual page for moncmd.


1164       Mon  has  the facility to receive special "mon traps" from any local or
1165       remote machine. Currently, the only available method  for  sending  mon
1166       traps are through the Mon::Client perl interface, though the UDP packet
1167       format is defined well enough to permit the writing of traps  in  other
1168       languages.
1170       Traps  are  handled  similarly to monitors: a trap sends an operational
1171       status, summary line, and description text, and mon generates an  alert
1172       or upalert as necessary.
1174       Traps  can  be caught by any watch/service group set up in the mon con‐
1175       figuration file, however it is suggested that you configure  watch/ser‐
1176       vice  groups  specifically  for  the  traps you expect to receive. When
1177       defining a special watch/service group for  traps,  do  not  include  a
1178       "monitor" directive (as no monitor need be invoked). Since a monitor is
1179       not being invoked, it is not necessary for the watch definition to have
1180       a  hostgroup  which  contains  real  host names.  Just make up a useful
1181       name, and mon will automatically create the watch group for you.
1183       Here is a simple config file example:
1185              watch trap-service
1186                   service host1-disks
1187                        description TRAP: for host1 disk status
1188                        period wd {Sun-Sat}
1189                             alert mail.alert someone@your.org
1190                             upalert mail.alert -u someone@your.org
1193       Since mon listens on a UDP port for any trap,  a  default  facility  is
1194       available  for handling traps to unknown groups or services.  To enable
1195       this facility,  you  must  include  a  "default"  watch  group  with  a
1196       "default"  service  entry  containing  the  specifics  of alarms.  If a
1197       default/default watch  group  and  service  are  not  configured,  then
1198       unknown  traps  get logged via syslog, and no alarm is sent.  NOTE: The
1199       default/default facility is a single entity as far  as  accounting  and
1200       alarming  go.  Alarm programs which are not aware of this fact may send
1201       confusing information when a failure trap comes from one machine,  fol‐
1202       lowed  by  a  success (ok) trap from a different machine. See the alarm
1203       environment variable MON_TRAP_INTENDED above for a possible way  around
1204       this.  It  is  intended  that  default/default be used as a facility to
1205       catch unknown traps, and should not be relied upon to catch  all  traps
1206       in  a  production  environment.  If  you  are lazy and only want to use
1207       default/default for catching all traps, it would  be  best  to  disable
1208       upalerts,  and  use the MON_TRAP_INTENDED environment variable in alert
1209       scripts to make the alerts more meaningful to you.
1211       Here is an example default facility:
1213              watch default
1214                   service default
1215                        description Default trap service
1216                        period wd {Sun-Sat}
1217                             alert mail.alert someone@your.org
1218                             upalert mail.alert -u someone@your.org


1223       The mon distribution comes with an example configuration  called  exam‐
1224       ple.cf.  Refer to that file for more information.


1228       moncmd(1), Time::Period(3pm), Mon::Client(3pm)


1231       mon  was  written  because  I couldn't find anything out there that did
1232       just what I needed, and nothing was worth modifying to add the features
1233       I  wanted.  It  doesn't have a cool name, and that bothers me because I
1234       couldn't think of one.


1237       Report bugs to the email address below.


1240       Jim Trocki <trockij@arctic.org>
1244Linux                    $Date: 2007/06/25 13:10:07 $                   mon(8)