1SAM_OVERVIEW(8) Corosync Cluster Engine Programmer's Manual SAM_OVERVIEW(8)
2
3
4
6 sam_overview - Overview of the Simple Availability Manager
7
8
10 The SAM library provide a tool to check the health of an application.
11 The main purpose of SAM is to restart a local process when it fails to
12 respond to a healthcheck request in a configured time interval.
13
14
15 During sam_initialize(3), a duplicate copy of the process is created
16 using the fork(3) system call. This duplicate process copy contains
17 the logic for executing the SAM server. The SAM server is responsible
18 for requesting healthchecks from the active process, and controlling
19 the lifecycle of the active process when it fails. If the active
20 process fails to respond to the healthcheck request sent by the SAM
21 server, it will be sent a user configurable signal (default SIGTERM) to
22 request shutdown of the application. After a configured time interval,
23 the process will be forcibly killed by being sent a SIGKILL signal.
24 Once the active process terminates, the SAM server will create a new
25 active process.
26
27
28 The Simple Availability Manager is meant to be used in conjunction with
29 the cpg service. Used together, it is possible to restart a cpg
30 process that fails healthchecking during operation.
31
32
33 The main features of SAM include:
34
35
36 · A configurable recovery policy.
37
38 · A configurable time interval for health check operations.
39
40 · A notification via signal before recovery action is taken.
41
42 · A mechanism to indicate to the application the number of
43 times an active process has been created by the SAM server.
44
45 · Both application driven health checking and event driven
46 health checking.
47
48
50 The SAM library is initialized by sam_initialize(3). sam_initalize(3)
51 may only be called once per process. Calling it more then once has
52 undefined results and is not recommended or tested.
53
54
56 User configurable signal (default SIGTERM) is sent to the application
57 when a recovery action is planned. The application can use the sig‐
58 nal(3) system call to monitor for this signal.
59
60
61 There are no special constraints on what SAM apis may be called in a
62 warning callback. After time_interval expires, a SIGKILL signal is
63 sent to the active process to force its termination.
64
65
67 The active process is registered with SAM by calling sam_register(3).
68 This function should only be called one time in a process. After a
69 recovery action is taken, the new active process will begin execution
70 at the next line of code in a user process after sam_register(3).
71
72
74 Two types of healthchecking are available to the user. The first model
75 is one where the user application healthchecks during its normal opera‐
76 tion. It is never requested to healtcheck, and if the active process
77 doesn't respond within the time interval, the process will be
78 restarted.
79
80
81 A more useful mechanism for healthchecking is event driven healthcheck‐
82 ing. Because this model is directed by the SAM server, It isn't neces‐
83 sary to guess or add timers to the active process to signal a
84 healthcheck operation is successful. To use event driven healthcheck‐
85 ing, the sam_hc_callback_register(3) function should be executed.
86
87
89 SAM has special policies (SAM_RECOVERY_POLICY_QUIT and SAM_RECOV‐
90 ERY_POLICY_RESTART) for integration with quorum service. This policies
91 changes SAM behaviour in two aspects.
92
93 · Call of sam_start(3) blocks until corosync becomes quorate
94
95 · User selected recovery action is taken immediately after lost
96 of quorum.
97
98
100 Sometimes there is need to store some data, which survives between
101 instances. One can in such case use files, databases, ... or much sim‐
102 pler in memory solution presented by sam_data_store(3),
103 sam_data_restore(3) and sam_data_getsize(3) functions.
104
105
107 SAM has policy flag used for confdb system integration (SAM_RECOV‐
108 ERY_POLICY_CONFDB). If process is registered with this flag, new
109 confdb object PROCESS_NAME:PID is created with following keys:
110
111 · recovery - will be quit or restart depending on policy
112
113 · poll_period - period of health checking in milliseconds
114
115 · last_updated - Timestamp (in nanoseconds) of the last health
116 check.
117
118 · state - state of process (can be one of registered, started,
119 failed, waiting for quorum)
120
121
122 Object is automatically deleted if process exits with stopped health
123 checking.
124
125
126 Confdb integration with corosync watchdog can be used in implicit and
127 explicit way.
128
129
130 Implicit way is achieved by setting recovery policy to QUIT and let
131 process exit with started health checking. If this happened, object is
132 not deleted and corosync watchdog will take required action.
133
134
135 Explicit way is useful for situations, when developer can deal with
136 some non-fatal fall of application. This mode is achieved by setting
137 policy to RESTART and using SAM same as without Confdb integration. If
138 real fail is needed (like too many restarts at all, per/sec, ...), it's
139 possible to use sam_mark_failed(3) and let corosync watchdog take
140 required action.
141
142
145 sam_initialize(3), sam_data_getsize(3), sam_data_restore(3),
146 sam_data_store(3), sam_finalize(3), sam_mark_failed(3), sam_start(3),
147 sam_stop(3), sam_register(3), sam_warn_signal_set(3), sam_hc_send(3),
148 sam_hc_callback_register(3)
149
150
151
152corosync Man Page 21/05/2010 SAM_OVERVIEW(8)