1rgmanager(8) Red Hat Cluster Suite rgmanager(8)
2
3
4
6 rgmanager - Resource Group (Cluster Service) Manager Daemon
7
9 rgmanager handles management of user-defined cluster services (also
10 known as resource groups). This includes handling of user requests
11 including service start, service disable, service relocate, and service
12 restart. The service manager daemon also handles restarting and relo‐
13 cating services in the event of failures.
14
16 The service manager is spawned by an init script after the cluster in‐
17 frastructure has been started and only functions when the cluster is
18 quorate and locks are working.
19
20 During initialization, the service manager runs scripts which ensure
21 that all services are clear to be started. After that, it determines
22 which services need to be started and starts them.
23
24 When an event is received, members which are no longer online have
25 their services taken away from them. The event should only occur in
26 the case that the member has been fenced whenever fencing is available.
27
28 When a cluster member determines that it is no longer in the cluster
29 quorum, the service manager stops all services and waits for a new quo‐
30 rum to form.
31
32
34 Rgmanager is configured via cluster.conf. With the exception of log‐
35 ging, all of rgmanager's configuration resides with the <rm> tag. The
36 general parameters for rgmanager are as follows:
37
38 central_processing - Enable central processing mode (requires cluster-
39 wide shut down and restart of rgmanager). This alternative mode of
40 handling failures externalizes most of rgmanager's features into a
41 user-editable script. This mode is disabled by default.
42
43 status_poll_interval - This defines the amount of time, in seconds,
44 rgmanager waits between resource tree scans for status checks.
45 Decreasing this value may improve rgmanager's ability to detect fail‐
46 ures in services, but at a cost of decreased performance and increased
47 system utilization. The default is 10 seconds.
48
49 status_child_max - Maximum number of status check threads (default =
50 5). It is not recommended that this ever be changed. This simply con‐
51 trols how many instances of clustat queries may be outstanding on a
52 single node at any given time.
53
54 transition_throttling - This is the amount of time the event processing
55 thread stays alive after the last event has been processed. The
56 default is 5 seconds. It is not recommended that this ever be changed.
57
58 log_level - DEPRECATED; DO NOT USE. Controls log level filtering to
59 syslog. Default is 5; valid values range from 0-7. See clus‐
60 ter.conf(5) for the current method to configure logging.
61
62 log_facility - DEPRECATED; DO NOT USE. Controls log level facility
63 when sending messages to syslog. Default is "daemon". See clus‐
64 ter.conf(5) for the current method to configure logging.
65
66
68 Resource agents define resource classes rgmanager can manage. Rgman‐
69 ager follows the Open Cluster Framework Resource Agent API v1.0 (draft)
70 standard, with the following two notable exceptions:
71
72 * Rgmanager does not call monitor; it only calls status
73 * Rgmanager looks for resource agets in /usr/share/cluster
74
75 Rgmanager uses the metadata from resource agents to determine what
76 parameters to look for in cluster.conf for a each resource type. View‐
77 ing the resource agent metadata is the best way to understand all the
78 various resource agent parameters.
79
80
82 A service or resource group is a collection of resources defined in
83 cluster.conf for rgmanager's use. Resource groups are also called
84 resource trees.
85
86 A resource group is the atomic unit of failover in rgmanager. That is,
87 even though rgmanager calls out to various resource agents individually
88 in order to start or stop various resources, everything in the resource
89 group is always moved around together in the event of a relocation or
90 failover.
91
92
94 Rgmanager supports only two startup policies,
95
96 autostart - if set to 1 (the default), the service is started when a
97 quorum forms. If set to 0, the service is not automatically started.
98
99 Startup Policy Configuration: Recovery Configuration:
100 <rm>
101 <service name="service1" autostart="[0|1]" .../>
102 ...
103 </rm>
104
105
107 Rgmanager supports three recovery policies for services; this is con‐
108 figured by the recovery parameter in the service definition.
109
110 restart - means to attempt to restart the resource group in place in
111 the event of one or more failures of individual resources. This can
112 further be augmented by the max_restarts and restart_expire_time param‐
113 eters, which define a tolerance for the amount of service restarts over
114 the given amount of time.
115
116 relocate - means to move the resource group to another host in the
117 cluster instead of restarting on the same host.
118
119 disable - means to not try to recover the resource group. Instead,
120 just place it in to the disabled state.
121
122 Recovery Configuration:
123 <rm>
124 <service name="service1" recovery="[restart|relocate|disable]" .../>
125 ...
126 </rm>
127
128
130 A failover domain is an ordered subset of members to which a service
131 may be bound. The following is a list of semantics governing the
132 options as to how the different configuration options affect the behav‐
133 ior of a failover domain:
134
135 preferred node or preferred member : The preferred node was the member
136 designated to run a given service if the member is online. We can emu‐
137 late this behavior by specifying an unordered, unrestricted failover
138 domain of exactly one member.
139
140 restricted domain : Services bound to the domain may only run on clus‐
141 ter members which are also members of the failover domain. If no mem‐
142 bers of the failover domain are available, the service is placed in the
143 stopped state.
144
145 unrestricted domain : Services bound to this domain may run on all
146 cluster members, but will run on a member of the domain whenever one is
147 available. This means that if a service is running outside of the
148 domain and a member of the domain comes online, the service will
149 migrate to that member.
150
151 ordered domain : The order specified in the configuration dictates the
152 order of preference of members within the domain. The highest-ranking
153 member of the domain will run the service whenever it is online. This
154 means that if member A has a higher rank than member B, the service
155 will migrate to A if it was running on B if A transitions from offline
156 to online.
157
158 unordered domain : Members of the domain have no order of preference;
159 any member may run the service. Services will always migrate to members
160 of their failover domain whenever possible, however, in an unordered
161 domain.
162
163 nofailback : Enabling this option for an ordered failover domain will
164 prevent automated fail-back after a more-preferred node rejoins the
165 cluster. Consequently, nofailback requires an ordered domain in order
166 to be meaningful. When nofailback is used, the following two behaviors
167 should be noted:
168 * If a subset of cluster nodes forms a quorum, the node with the high‐
169 est priority in the failover domain is selected to run a service bound
170 to the domain. After this point, a higher priority member joining the
171 cluster will not trigger a relocation.
172 * When a service is running outside of its unrestricted failover
173 domain and a cluster member boots which is a part of the service's
174 failover domain, the service will relocate to that member. That is,
175 nofailback does not prevent transitions from outside of a failover
176 domain to inside a failover domain. After this point, a higher prior‐
177 ity member joining the cluster will not trigger a relocation.
178
179 Ordering, restriction, and nofailback are flags and may be combined in
180 almost any way (ie, ordered+restricted, unordered+unrestricted, etc.).
181 These combinations affect both where services start after initial quo‐
182 rum formation and which cluster members will take over services in the
183 event that the service has failed.
184
185 Failover Domain Configuration:
186 <rm>
187 <failoverdomains>
188 <failoverdomain name="NAME" ordered="[0|1]" restricted="[0|1]"
189 nofailback="[0|1" >
190 <failoverdomainnode name="node1" priority="[1..100]" />
191 ...
192 </failoverdomain>
193 </failoverdomains>
194 ...
195 </rm>
196
197
199 These are how the basic user-initiated service operations (via clusv‐
200 cadm ) work.
201
202 enable - start the service, optionally on a preferred target and
203 optionally according to failover domain rules. In absence of either,
204 the local host where clusvcadm is run will start the service. If the
205 original start fails, the service behaves as though a relocate opera‐
206 tion was requested (see below). If the operation succeeds, the service
207 is placed in the started state.
208
209 disable - stop the service and place into the disabled state. This is
210 the only permissible operation when a service is in the failed state.
211
212 relocate - move the service to another node. Optionally, the adminis‐
213 trator may specify a preferred node to receive the service, but the
214 inability for the service to run on that host (e.g. if the service
215 fails to start or the host is offline) does not prevent relocation, and
216 another node is chosen. Rgmanager attempts to start the service on
217 every permissible node in the cluster. If no permissible target node in
218 the cluster successfully starts the service, the relocation fails and
219 the service is attempted to be restarted on the original owner. If the
220 original owner can not restart the service, the service is placed in
221 the stopped state.
222
223 stop - stop the service and place into the stopped state.
224
225 migrate - migrate the virtual machine to another node. The administra‐
226 tor must specify a target node. Depending on the failure, a failure to
227 migrate may result with the virtual machine in the failed state or in
228 the started state on the original owner.
229
230 freeze - freeze the service or virtual machine in place and prevent
231 status checks from occurring. Administrators may do this in order to
232 perform maintenance on one or more parts of a given service without
233 having rgmanager interfere. It is very important that the administra‐
234 tor unfreezes the service once maintenance is complete, as a frozen
235 service will not fail over. Freezing a service does NOT affect is
236 operational state. For example, it does not 'pause' virtual machines
237 or suspend them to disk.
238
239 unfreeze - unfreeze (thaw) the service or virtual machine. This com‐
240 mand makes rgmanager perform status checks on the service again.
241
242
244 These are the most common service states.
245
246 disabled - The service will remain in the disabled state until either
247 an administrator re-enables the service or the cluster loses quorum
248 (when the cluster regains quorum, the autostart parameter is evalu‐
249 ated). An administrator may enable the service from this state.
250
251 failed - The service is presumed dead. A service is placed in to this
252 state whenever a resource's stop operation fails. After a service is
253 placed in to this state, the administrator must verify that there are
254 no allocated resources (mounted file systems, etc.) prior to issuing a
255 disable request. The only operation which can take place when a service
256 has entered this state is a disable.
257
258 stopped - When in the stopped state, the service will be evaluated for
259 starting after the next service or node transition. This is considered
260 a temporary state. An administrator may disable or enable the service
261 from this state.
262
263 recovering - The cluster is trying to recover the service. An adminis‐
264 trator may disable the service to prevent recovery if desired.
265
266 started - If a service status check fails, recover it according to the
267 service recovery policy. If the host running the service fails, recover
268 it following failover domain & exclusive service rules. An administra‐
269 tor may relocate, stop, disable, and (with virtual machines) migrate
270 the service from this state.
271
272
274 Apart from what is noted in the VM resource agent, rgmanager provides a
275 few convenience features when dealing with virtual machines.
276 * it will use live migration when transferring a virtual machine to a
277 more-preferred host in the cluster as a consequence of failover domain
278 operation
279 * it will search the other instances of rgmanager in the cluster in
280 the case that a user accidentally moves a virtual machine using other
281 management tools
282 * unlike services, adding a virtual machine to rgmanager's configura‐
283 tion will not cause the virtual machine to be restarted
284 * removing a virtual machine from rgmanager's configuration will leave
285 the virtual machine running.
286
287
289 -f Run in the foreground (do not fork).
290
291 -d Enable debug-level logging.
292
293 -q Disable DBus signals which are normally sent when services
294 change state.
295
296 -w Disable internal process monitoring (for debugging).
297
298 -N Do not perform stop-before-start. Combined with the -Z flag to
299 clusvcadm, this can be used to allow rgmanager to be upgraded
300 without stopping a given user service or set of services.
301
302 -C Explicitly disable or enable CPG-based locking. The default is
303 to enable this when RRP is turned on (which requires a cluster
304 outage). This option MUST be the same on all hosts in the clus‐
305 ter and must only be enabled or disabled with all instances of
306 rgmanager turned off.
307
308
310 http://sources.redhat.com/cluster/wiki/RGManager
311
312 clusvcadm(8), cluster.conf(5), cpglockd(8)
313
314
315
316 Jul 2010 rgmanager(8)