rgmanager(8)

1rgmanager(8)                 Red Hat Cluster Suite                rgmanager(8)
2
3
4

NAME

6       rgmanager - Resource Group (Cluster Service) Manager Daemon
7

DESCRIPTION

9       rgmanager  handles  management  of  user-defined cluster services (also
10       known as resource groups).  This includes  handling  of  user  requests
11       including service start, service disable, service relocate, and service
12       restart.  The service manager daemon also handles restarting and  relo‐
13       cating services in the event of failures.
14

HOW IT WORKS

16       The  service manager is spawned by an init script after the cluster in‐
17       frastructure has been started and only functions when  the  cluster  is
18       quorate and locks are working.
19
20       During  initialization,  the  service manager runs scripts which ensure
21       that all services are clear to be started.  After that,  it  determines
22       which services need to be started and starts them.
23
24       When  an  event  is  received,  members which are no longer online have
25       their services taken away from them.  The event should  only  occur  in
26       the case that the member has been fenced whenever fencing is available.
27
28       When  a  cluster  member determines that it is no longer in the cluster
29       quorum, the service manager stops all services and waits for a new quo‐
30       rum to form.
31
32

CONFIGURATION

34       Rgmanager  is  configured via cluster.conf.  With the exception of log‐
35       ging, all of rgmanager's configuration resides with the <rm> tag.   The
36       general parameters for rgmanager are as follows:
37
38       central_processing  - Enable central processing mode (requires cluster-
39       wide shut down and restart of rgmanager).   This  alternative  mode  of
40       handling  failures  externalizes  most  of  rgmanager's features into a
41       user-editable script.  This mode is disabled by default.
42
43       status_poll_interval - This defines the amount  of  time,  in  seconds,
44       rgmanager   waits  between  resource  tree  scans  for  status  checks.
45       Decreasing this value may improve rgmanager's ability to  detect  fail‐
46       ures  in services, but at a cost of decreased performance and increased
47       system utilization.  The default is 10 seconds.
48
49       status_child_max - Maximum number of status check  threads  (default  =
50       5).  It is not recommended that this ever be changed.  This simply con‐
51       trols how many instances of clustat queries may  be  outstanding  on  a
52       single node at any given time.
53
54       transition_throttling - This is the amount of time the event processing
55       thread stays alive after  the  last  event  has  been  processed.   The
56       default is 5 seconds.  It is not recommended that this ever be changed.
57
58       log_level  -  DEPRECATED;  DO NOT USE.  Controls log level filtering to
59       syslog.  Default  is  5;  valid  values  range  from  0-7.   See  clus‐
60       ter.conf(5) for the current method to configure logging.
61
62       log_facility  -  DEPRECATED;  DO  NOT USE.  Controls log level facility
63       when sending messages to  syslog.   Default  is  "daemon".   See  clus‐
64       ter.conf(5) for the current method to configure logging.
65
66

RESOURCE AGENTS

68       Resource  agents  define resource classes rgmanager can manage.  Rgman‐
69       ager follows the Open Cluster Framework Resource Agent API v1.0 (draft)
70       standard, with the following two notable exceptions:
71
72        * Rgmanager does not call monitor; it only calls status
73        * Rgmanager looks for resource agets in /usr/share/cluster
74
75       Rgmanager  uses  the  metadata  from  resource agents to determine what
76       parameters to look for in cluster.conf for a each resource type.  View‐
77       ing  the  resource agent metadata is the best way to understand all the
78       various resource agent parameters.
79
80

SERVICES / RESOURCE GROUPS

82       A service or resource group is a collection  of  resources  defined  in
83       cluster.conf  for  rgmanager's  use.   Resource  groups are also called
84       resource trees.
85
86       A resource group is the atomic unit of failover in rgmanager.  That is,
87       even though rgmanager calls out to various resource agents individually
88       in order to start or stop various resources, everything in the resource
89       group  is  always moved around together in the event of a relocation or
90       failover.
91
92

STARTUP POLICIES

94       Rgmanager supports only two startup policies,
95
96       autostart - if set to 1 (the default), the service is  started  when  a
97       quorum forms.  If set to 0, the service is not automatically started.
98
99       Startup Policy Configuration: Recovery Configuration:
100        <rm>
101          <service name="service1" autostart="[0|1]" .../>
102           ...
103        </rm>
104
105

RECOVERY POLICIES

107       Rgmanager  supports  three recovery policies for services; this is con‐
108       figured by the recovery parameter in the service definition.
109
110       restart - means to attempt to restart the resource group  in  place  in
111       the  event  of  one or more failures of individual resources.  This can
112       further be augmented by the max_restarts and restart_expire_time param‐
113       eters, which define a tolerance for the amount of service restarts over
114       the given amount of time.
115
116       relocate - means to move the resource group  to  another  host  in  the
117       cluster instead of restarting on the same host.
118
119       disable  -  means  to  not try to recover the resource group.  Instead,
120       just place it in to the disabled state.
121
122       Recovery Configuration:
123        <rm>
124          <service name="service1" recovery="[restart|relocate|disable]" .../>
125           ...
126        </rm>
127
128

FAILOVER DOMAINS

130       A failover domain is an ordered subset of members to  which  a  service
131       may  be  bound.  The  following  is  a  list of semantics governing the
132       options as to how the different configuration options affect the behav‐
133       ior of a failover domain:
134
135       preferred  node or preferred member : The preferred node was the member
136       designated to run a given service if the member is online. We can  emu‐
137       late  this  behavior  by specifying an unordered, unrestricted failover
138       domain of exactly one member.
139
140       restricted domain : Services bound to the domain may only run on  clus‐
141       ter  members  which are also members of the failover domain. If no mem‐
142       bers of the failover domain are available, the service is placed in the
143       stopped state.
144
145       unrestricted  domain  :  Services  bound  to this domain may run on all
146       cluster members, but will run on a member of the domain whenever one is
147       available.  This  means  that  if  a  service is running outside of the
148       domain and a member of  the  domain  comes  online,  the  service  will
149       migrate to that member.
150
151       ordered  domain : The order specified in the configuration dictates the
152       order of preference of members within the domain.  The  highest-ranking
153       member  of the domain will run the service whenever it is online.  This
154       means that if member A has a higher rank than  member  B,  the  service
155       will  migrate to A if it was running on B if A transitions from offline
156       to online.
157
158       unordered domain : Members of the domain have no order  of  preference;
159       any member may run the service. Services will always migrate to members
160       of their failover domain whenever possible, however,  in  an  unordered
161       domain.
162
163       nofailback  :  Enabling this option for an ordered failover domain will
164       prevent automated fail-back after a  more-preferred  node  rejoins  the
165       cluster.  Consequently,  nofailback requires an ordered domain in order
166       to be meaningful.  When nofailback is used, the following two behaviors
167       should be noted:
168        * If a subset of cluster nodes forms a quorum, the node with the high‐
169        est priority in the failover domain is selected to run a service bound
170        to  the domain. After this point, a higher priority member joining the
171        cluster will not trigger a relocation.
172        * When a service is  running  outside  of  its  unrestricted  failover
173        domain  and  a  cluster  member boots which is a part of the service's
174        failover domain, the service will relocate to that  member.  That  is,
175        nofailback  does  not  prevent  transitions from outside of a failover
176        domain to inside a failover domain. After this point, a higher  prior‐
177        ity member joining the cluster will not trigger a relocation.
178
179       Ordering,  restriction, and nofailback are flags and may be combined in
180       almost any way (ie, ordered+restricted, unordered+unrestricted,  etc.).
181       These  combinations affect both where services start after initial quo‐
182       rum formation and which cluster members will take over services in  the
183       event that the service has failed.
184
185       Failover Domain Configuration:
186        <rm>
187          <failoverdomains>
188            <failoverdomain   name="NAME"  ordered="[0|1]"  restricted="[0|1]"
189            nofailback="[0|1" >
190              <failoverdomainnode name="node1" priority="[1..100]" />
191               ...
192            </failoverdomain>
193          </failoverdomains>
194           ...
195        </rm>
196
197

SERVICE OPERATIONS

199       These are how the basic user-initiated service operations  (via  clusv‐
200       cadm ) work.
201
202       enable  -  start  the  service,  optionally  on  a preferred target and
203       optionally according to failover domain rules. In  absence  of  either,
204       the  local  host  where clusvcadm is run will start the service. If the
205       original start fails, the service behaves as though a  relocate  opera‐
206       tion  was requested (see below). If the operation succeeds, the service
207       is placed in the started state.
208
209       disable - stop the service and place into the disabled state.  This  is
210       the only permissible operation when a service is in the failed state.
211
212       relocate  -  move the service to another node. Optionally, the adminis‐
213       trator may specify a preferred node to receive  the  service,  but  the
214       inability  for  the  service  to  run on that host (e.g. if the service
215       fails to start or the host is offline) does not prevent relocation, and
216       another  node  is  chosen.   Rgmanager attempts to start the service on
217       every permissible node in the cluster. If no permissible target node in
218       the  cluster  successfully starts the service, the relocation fails and
219       the service is attempted to be restarted on the original owner.  If the
220       original  owner  can  not restart the service, the service is placed in
221       the stopped state.
222
223       stop - stop the service and place into the stopped state.
224
225       migrate - migrate the virtual machine to another node. The  administra‐
226       tor  must specify a target node. Depending on the failure, a failure to
227       migrate may result with the virtual machine in the failed state  or  in
228       the started state on the original owner.
229
230       freeze  -  freeze  the  service or virtual machine in place and prevent
231       status checks from occurring.  Administrators may do this in  order  to
232       perform  maintenance  on  one  or more parts of a given service without
233       having rgmanager interfere.  It is very important that the  administra‐
234       tor  unfreezes  the  service  once maintenance is complete, as a frozen
235       service will not fail over.  Freezing a  service  does  NOT  affect  is
236       operational  state.   For example, it does not 'pause' virtual machines
237       or suspend them to disk.
238
239       unfreeze - unfreeze (thaw) the service or virtual machine.   This  com‐
240       mand makes rgmanager perform status checks on the service again.
241
242

SERVICE STATES

244       These are the most common service states.
245
246       disabled  -  The service will remain in the disabled state until either
247       an administrator re-enables the service or  the  cluster  loses  quorum
248       (when  the  cluster  regains  quorum, the autostart parameter is evalu‐
249       ated). An administrator may enable the service from this state.
250
251       failed - The service is presumed dead.  A service is placed in to  this
252       state  whenever  a resource's stop operation fails.  After a service is
253       placed in to this state, the administrator must verify that  there  are
254       no  allocated resources (mounted file systems, etc.) prior to issuing a
255       disable request. The only operation which can take place when a service
256       has entered this state is a disable.
257
258       stopped  - When in the stopped state, the service will be evaluated for
259       starting after the next service or node transition.  This is considered
260       a  temporary  state. An administrator may disable or enable the service
261       from this state.
262
263       recovering - The cluster is trying to recover the service. An  adminis‐
264       trator may disable the service to prevent recovery if desired.
265
266       started  - If a service status check fails, recover it according to the
267       service recovery policy. If the host running the service fails, recover
268       it  following failover domain & exclusive service rules. An administra‐
269       tor may relocate, stop, disable, and (with  virtual  machines)  migrate
270       the service from this state.
271
272

VIRTUAL MACHINE FEATURES

274       Apart from what is noted in the VM resource agent, rgmanager provides a
275       few convenience features when dealing with virtual machines.
276        * it will use live migration when transferring a virtual machine to  a
277        more-preferred host in the cluster as a consequence of failover domain
278        operation
279        * it will search the other instances of rgmanager in  the  cluster  in
280        the  case that a user accidentally moves a virtual machine using other
281        management tools
282        * unlike services, adding a virtual machine to rgmanager's  configura‐
283        tion will not cause the virtual machine to be restarted
284        * removing a virtual machine from rgmanager's configuration will leave
285        the virtual machine running.
286
287

COMMAND LINE OPTIONS

289       -f     Run in the foreground (do not fork).
290
291       -d     Enable debug-level logging.
292
293       -q     Disable DBus signals  which  are  normally  sent  when  services
294              change state.
295
296       -w     Disable internal process monitoring (for debugging).
297
298       -N     Do  not perform stop-before-start.  Combined with the -Z flag to
299              clusvcadm, this can be used to allow rgmanager  to  be  upgraded
300              without stopping a given user service or set of services.
301
302