1SAM_OVERVIEW(8)   Corosync Cluster Engine Programmer's Manual  SAM_OVERVIEW(8)
2
3
4

NAME

6       sam_overview - Overview of the Simple Availability Manager
7
8

OVERVIEW

10       The  SAM  library provide a tool to check the health of an application.
11       The main purpose of SAM is to restart a local process when it fails  to
12       respond to a healthcheck request in a configured time interval.
13
14
15       During  sam_initialize(3),  a  duplicate copy of the process is created
16       using the fork(3) system call.  This duplicate  process  copy  contains
17       the  logic for executing the SAM server.  The SAM server is responsible
18       for requesting healthchecks from the active  process,  and  controlling
19       the  lifecycle  of  the  active  process  when it fails.  If the active
20       process fails to respond to the healthcheck request  sent  by  the  SAM
21       server, it will be sent a user configurable signal (default SIGTERM) to
22       request shutdown of the application.  After a configured time interval,
23       the  process  will  be  forcibly killed by being sent a SIGKILL signal.
24       Once the active process terminates, the SAM server will  create  a  new
25       active process.
26
27
28       The Simple Availability Manager is meant to be used in conjunction with
29       the cpg service.  Used together,  it  is  possible  to  restart  a  cpg
30       process that fails healthchecking during operation.
31
32
33       The main features of SAM include:
34
35
36              ·  A configurable recovery policy.
37
38              ·  A configurable time interval for health check operations.
39
40              ·  A notification via signal before recovery action is taken.
41
42              ·  A  mechanism  to  indicate  to  the application the number of
43                 times an active process has been created by the SAM server.
44
45              ·  Both application driven  health  checking  and  event  driven
46                 health checking.
47
48

Initializing SAM

50       The  SAM library is initialized by sam_initialize(3).  sam_initalize(3)
51       may only be called once per process.  Calling it  more  then  once  has
52       undefined results and is not recommended or tested.
53
54

Setting warning callback

56       User  configurable  signal (default SIGTERM) is sent to the application
57       when a recovery action is planned.  The application can  use  the  sig‐
58       nal(3) system call to monitor for this signal.
59
60
61       There  are  no  special constraints on what SAM apis may be called in a
62       warning callback.  After time_interval expires,  a  SIGKILL  signal  is
63       sent to the active process to force its termination.
64
65

Registering the active process

67       The  active  process is registered with SAM by calling sam_register(3).
68       This function should only be called one time in  a  process.   After  a
69       recovery  action  is taken, the new active process will begin execution
70       at the next line of code in a user process after sam_register(3).
71
72

Enabling event driven healthchecking

74       Two types of healthchecking are available to the user.  The first model
75       is one where the user application healthchecks during its normal opera‐
76       tion.  It is never requested to healtcheck, and if the  active  process
77       doesn't   respond  within  the  time  interval,  the  process  will  be
78       restarted.
79
80
81       A more useful mechanism for healthchecking is event driven healthcheck‐
82       ing.  Because this model is directed by the SAM server, It isn't neces‐
83       sary to guess  or  add  timers  to  the  active  process  to  signal  a
84       healthcheck  operation is successful.  To use event driven healthcheck‐
85       ing, the sam_hc_callback_register(3) function should be executed.
86
87

Quorum integration

89       SAM  has  special  policies  (SAM_RECOVERY_POLICY_QUIT  and  SAM_RECOV‐
90       ERY_POLICY_RESTART)  for integration with quorum service. This policies
91       changes SAM behaviour in two aspects.
92
93              ·  Call of sam_start(3) blocks until corosync becomes quorate
94
95              ·  User selected recovery action is taken immediately after lost
96                 of quorum.
97
98

Storing user data

100       Sometimes  there  is  need  to  store some data, which survives between
101       instances.  One can in such case use files, databases, ... or much sim‐
102       pler    in    memory    solution    presented   by   sam_data_store(3),
103       sam_data_restore(3) and sam_data_getsize(3) functions.
104
105

Confdb integration

107       SAM has policy flag used  for  confdb  system  integration  (SAM_RECOV‐
108       ERY_POLICY_CONFDB).   If  process  is  registered  with  this flag, new
109       confdb object PROCESS_NAME:PID is created with following keys:
110
111              ·  recovery - will be quit or restart depending on policy
112
113              ·  poll_period - period of health checking in milliseconds
114
115              ·  last_updated - Timestamp (in nanoseconds) of the last  health
116                 check.
117
118              ·  state  - state of process (can be one of registered, started,
119                 failed, waiting for quorum)
120
121
122       Object is automatically deleted if process exits  with  stopped  health
123       checking.
124
125
126       Confdb  integration  with corosync watchdog can be used in implicit and
127       explicit way.
128
129
130       Implicit way is achieved by setting recovery policy  to  QUIT  and  let
131       process exit with started health checking.  If this happened, object is
132       not deleted and corosync watchdog will take required action.
133
134
135       Explicit way is useful for situations, when  developer  can  deal  with
136       some  non-fatal  fall of application.  This mode is achieved by setting
137       policy to RESTART and using SAM same as without Confdb integration.  If
138       real fail is needed (like too many restarts at all, per/sec, ...), it's
139       possible to use  sam_mark_failed(3)  and  let  corosync  watchdog  take
140       required action.
141
142

BUGS

SEE ALSO

145       sam_initialize(3),       sam_data_getsize(3),      sam_data_restore(3),
146       sam_data_store(3), sam_finalize(3),  sam_mark_failed(3),  sam_start(3),
147       sam_stop(3),  sam_register(3),  sam_warn_signal_set(3), sam_hc_send(3),
148       sam_hc_callback_register(3)
149
150
151
152corosync Man Page                 21/05/2010                   SAM_OVERVIEW(8)
Impressum