1SANLOCK(8)                  System Manager's Manual                 SANLOCK(8)
2
3
4

NAME

6       sanlock - shared storage lock manager
7
8

SYNOPSIS

10       sanlock [COMMAND] [ACTION] ...
11
12

DESCRIPTION

14       The sanlock daemon manages leases for applications running on a cluster
15       of hosts with shared storage.  All lease management and coordination is
16       done  through  reading  and  writing blocks on the shared storage.  Two
17       types of leases are used, each based on a different algorithm:
18
19       "delta leases" are slow to acquire and require regular  i/o  to  shared
20       storage.   A delta lease exists in a single sector of storage.  Acquir‐
21       ing a delta lease involves reads and writes to that sector separated by
22       specific  delays.  Once acquired, a lease must be renewed by updating a
23       timestamp in the sector regularly.  sanlock uses a delta  lease  inter‐
24       nally  to  hold a lease on a host_id.  host_id leases prevent two hosts
25       from using the same host_id and provide basic host liveness information
26       based on the renewals.
27
28       "paxos  leases"  are  generally  fast to acquire and sanlock makes them
29       available to applications as general purpose resource leases.  A  paxos
30       lease  exists in 1MB of shared storage (8MB for 4k sectors).  Acquiring
31       a paxos lease involves reads and writes to max_hosts (2000) sectors  in
32       a  specific  sequence  specified  by  the  Disk Paxos algorithm.  paxos
33       leases use host_id's internally to indicate the owner of the lease, and
34       the algorithm fails if different hosts use the same host_id.  So, delta
35       leases provide the unique host_id's used in paxos leases.  paxos leases
36       also refer to delta leases to check if a host_id is alive.
37
38       Before  sanlock  can be used, the user must assign each host a host_id,
39       which is a number between 1 and 2000.  Two hosts should  not  be  given
40       the  same host_id (even though delta leases attempt to detect this mis‐
41       take.)
42
43       sanlock views a pool of storage as a "lockspace".  Each  distinct  pool
44       of  storage, e.g. from different sources, would typically be defined as
45       a separate lockspace, with a unique lockspace name.
46
47       Part of this storage space must be reserved and initialized for sanlock
48       to  store delta leases.  Each host that wants to use the lockspace must
49       first acquire a delta lease on its host_id number within the lockspace.
50       (See  the add_lockspace action/api.)  The space required for 2000 delta
51       leases in the lockspace (for 2000 possible host_id's) is 1MB  (8MB  for
52       4k  sectors).   (This  is  the  same  size  required for a single paxos
53       lease.)
54
55       More storage space must be reserved and initialized for  paxos  leases,
56       according to the needs of the applications using sanlock.
57
58       The  following  steps illustrate these concepts using the command line.
59       Applications may choose to do these same steps through libsanlock.
60
61       1. Create storage pools and reserve and initialize host_id leases
62       two different LUNs on a SAN: /dev/sdb, /dev/sdc
63       # vgcreate pool1 /dev/sdb
64       # vgcreate pool2 /dev/sdc
65       # lvcreate -n hostid_leases -L 1MB pool1
66       # lvcreate -n hostid_leases -L 1MB pool2
67       # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
68       # sanlock direct init -s LS2:0:/dev/pool2/hostid_leases:0
69
70       2. Start the sanlock daemon on each host
71       # sanlock daemon
72
73       3. Add each lockspace to be used
74       host1:
75       # sanlock client add_lockspace -s LS1:1:/dev/pool1/hostid_leases:0
76       # sanlock client add_lockspace -s LS2:1:/dev/pool2/hostid_leases:0
77       host2:
78       # sanlock client add_lockspace -s LS1:2:/dev/pool1/hostid_leases:0
79       # sanlock client add_lockspace -s LS2:2:/dev/pool2/hostid_leases:0
80
81       4. Applications can now reserve/initialize space for  resource  leases,
82       and then acquire the leases as they need to access the resources.
83
84       The  resource  leases that are created and how they are used depends on
85       the application.  For example, say application A, running on host1  and
86       host2,   needs   to   synchronize   access   to   data   it  stores  on
87       /dev/pool1/Adata.  A could use a resource lease as follows:
88
89       5. Reserve and initialize a single resource lease for Adata
90       # lvcreate -n Adata_lease -L 1MB pool1
91       # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0
92
93       6. Acquire the lease from the app using libsanlock (see  sanlock_regis‐
94       ter,  sanlock_acquire).   If the app is already running as pid 123, and
95       has registered with the sanlock daemon, the lease can be added  for  it
96       manually.
97       # sanlock client acquire -r LS1:Adata:/dev/pool1/Adata_lease:0 -p 123
98
99       offsets
100
101       offsets  must  be  1MB aligned for disks with 512 byte sectors, and 8MB
102       aligned for disks with 4096 byte sectors.
103
104       offsets may be used to place leases on  the  same  device  rather  than
105       using  separate  devices  and offset 0 as shown in examples above, e.g.
106       these commands above:
107       # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
108       # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0
109       could be replaced by:
110       # sanlock direct init -s LS1:0:/dev/pool1/leases:0
111       # sanlock direct init -r LS1:Adata:/dev/pool1/leases:1048576
112
113       failures
114
115       If a process holding resource leases fails or exits  without  releasing
116       its leases, sanlock will release the leases for it automatically.
117
118       If  the  sanlock daemon cannot renew a lockspace host_id for a specific
119       period of time (usually because storage access is lost),  sanlock  will
120       kill any process holding a resource lease within the lockspace.
121
122       If  the  sanlock  daemon crashes or gets stuck, it will no longer renew
123       the expiry time of its per-host_id connections to the wdmd daemon,  and
124       the watchdog device will reset the host.
125
126       watchdog
127
128       sanlock  uses  the  wdmd(8) daemon to access /dev/watchdog.  A separate
129       wdmd connection is maintained with wdmd for each host_id being renewed.
130       Each  host_id  connection  has  an  expiry time for some seconds in the
131       future.  After each successful host_id  renewal,  sanlock  updates  the
132       associated  expiry time in wdmd.  If wdmd finds any connection expired,
133       it will not pet /dev/watchdog.  After enough successive  expired/failed
134       checks, the watchdog device will fire and reset the host.
135
136       After a number of failed attempts to renew a host_id, sanlock kills any
137       process using that lockspace.  Once all those  processes  have  exited,
138       sanlock  will  unregister the associated wdmd connection.  wdmd will no
139       longer find the expired connection, and will resume petting /dev/watch‐
140       dog  (assuming  it finds no other failed/expired tests.)  If the killed
141       processes did not exit quickly enough, the expired wdmd connection will
142       not be unregistered, and /dev/watchdog will reset the host.
143
144       Based on these known timeout values, sanlock on another host can calcu‐
145       late, based on the last host_id renewal, when the failed host will have
146       been reset by its watchdog (or killed all the necessary processes).
147
148       If  the  sanlock  daemon  itself  fails, crashes, get stuck, it will no
149       longer update the expiry time for  its  host_id  connections  to  wdmd,
150       which will also lead to the watchdog resetting the host.
151
152       safety
153
154       sanlock leases are meant to guarantee that two process on two hosts are
155       never allowed to hold the same resource lease at once.  If  they  were,
156       the  resource being protected may be corrupted.  There are three levels
157       of protection built into sanlock itself:
158
159       1. The paxos leases and delta leases themselves.
160
161       2. If the  leases  cannot  function  because  storage  access  is  lost
162       (host_id's  cannot be renewed), the sanlock daemon kills any pids using
163       resource leases in the lockspace.
164
165       3. If the pids do not exit after being killed, or if the sanlock daemon
166       fails, the watchdog device resets the host.
167
168

OPTIONS

170       COMMAND can be one of three primary top level choices
171
172       sanlock daemon start daemon
173       sanlock client send request to daemon (default command if none given)
174       sanlock direct access storage directly (no coordination with daemon)
175
176       sanlock daemon [options]
177
178       -D no fork and print all logging to stderr
179
180       -Q 0|1 quiet error messages for common lock contention
181
182       -R 0|1 renewal debugging, log debug info for each renewal
183
184       -L pri write logging at priority level and up to logfile (-1 none)
185
186       -S pri write logging at priority level and up to syslog (-1 none)
187
188       -U uid user id
189
190       -G gid group id
191
192       -t num max worker threads
193
194       -g sec seconds for graceful recovery
195
196       -w 0|1 use watchdog through wdmd
197
198       -h 0|1 use high priority (RR) scheduling
199
200       -l num use mlockall (0 none, 1 current, 2 current and future)
201
202       -a 0|1 use async i/o
203
204       sanlock client action [options]
205
206       sanlock client status
207
208       Print processes, lockspaces, and resources being managed by the sanlock
209       daemon.  Add -D to show extra internal  daemon  status  for  debugging.
210       Add  -o  p  to  show  resources  by  pid,  or -o s to show resources by
211       lockspace.
212
213       sanlock client host_status
214
215       Print state of host_id delta  leases  read  during  the  last  renewal.
216       State  of  all  lockspaces  is shown (use -s to select one).  Add -D to
217       show extra internal daemon status for debugging.
218
219       sanlock client gets
220
221       Print lockspaces being managed by the sanlock  daemon.   The  LOCKSPACE
222       string  will  be  followed  by ADD or REM if the lockspace is currently
223       being added or removed.  Add -h 1 to also show hosts in each lockspace.
224
225       sanlock client log_dump
226
227       Print the sanlock daemon internal debug log.
228
229       sanlock client shutdown
230
231       Ask the sanlock daemon to exit.  Without the force option (-f  0),  the
232       command will be ignored if any lockspaces exist.  With the force option
233       (-f 1), any registered processes will be killed, their resource  leases
234       released, and lockspaces removed.
235
236       sanlock client init -s LOCKSPACE
237
238       Tell  the  sanlock  daemon  to  initialize a lockspace on disk.  The -o
239       option can be used to specify the io  timeout  to  be  written  in  the
240       host_id leases.  (Also see sanlock direct init.)
241
242       sanlock client init -r RESOURCE
243
244       Tell  the sanlock daemon to initialize a resource lease on disk.  (Also
245       see sanlock direct init.)
246
247       sanlock client read -s LOCKSPACE
248
249       Tell the sanlock daemon to  read  a  lockspace  from  disk.   Only  the
250       LOCKSPACE  path and offset are required.  If host_id is zero, the first
251       record at offset (host_id 1) is used.  The complete  LOCKSPACE  and  io
252       timeout are printed.
253
254       sanlock client read -r RESOURCE
255
256       Tell  the  sanlock daemon to read a resource lease from disk.  Only the
257       RESOURCE path and  offset  are  required.   The  complete  RESOURCE  is
258       printed.  (Also see sanlock direct read_leader.)
259
260       sanlock client align -s LOCKSPACE
261
262       Tell  the  sanlock  daemon to report the required lease alignment for a
263       storage path.  Only path is used from the LOCKSPACE argument.
264
265       sanlock client add_lockspace -s LOCKSPACE
266
267       Tell the sanlock  daemon  to  acquire  the  specified  host_id  in  the
268       lockspace.   This will allow resources to be acquired in the lockspace.
269       The -o option can be used to specify the io timeout  of  the  acquiring
270       host, and will be written in the host_id lease.
271
272       sanlock client inq_lockspace -s LOCKSPACE
273
274       Inquire about the state of the lockspace in the sanlock daemon, whether
275       it is being added or removed, or is joined.
276
277       sanlock client rem_lockspace -s LOCKSPACE
278
279       Tell the sanlock  daemon  to  release  the  specified  host_id  in  the
280       lockspace.   Any  processes  holding  resource leases in this lockspace
281       will be killed, and the resource leases not released.
282
283       sanlock client command -r RESOURCE -c path args
284
285       Register with the sanlock daemon, acquire the specified resource lease,
286       and  exec  the  command at path with args.  When the command exits, the
287       sanlock daemon will release the lease.  -c must be the final option.
288
289       sanlock client acquire -r RESOURCE -p pid
290       sanlock client release -r RESOURCE -p pid
291
292       Tell the sanlock daemon to acquire or release  the  specified  resource
293       lease  for  the given pid.  The pid must be registered with the sanlock
294       daemon.  acquire  can  optionally  take  a  versioned  RESOURCE  string
295       RESOURCE:lver,  where  lver  is  the  version of the lease that must be
296       acquired, or fail.
297
298       sanlock client inquire -p pid
299
300       Print the resource leases held the given pid.  The  format  is  a  ver‐
301       sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
302       lease held.
303
304       sanlock client request -r RESOURCE -f force_mode
305
306       Request the owner of a resource do something specified  by  force_mode.
307       A  versioned  RESOURCE:lver  string must be used with a greater version
308       than is presently held.  Zero lver and force_mode clears the request.
309
310       sanlock client examine -r RESOURCE
311
312       Examine the request record for the currently held  resource  lease  and
313       carry out the action specified by the requested force_mode.
314
315       sanlock client examine -s LOCKSPACE
316
317       Examine  requests  for  all resource leases currently held in the named
318       lockspace.  Only lockspace_name is used from the LOCKSPACE argument.
319
320       sanlock direct action [options]
321
322       -a 0|1 use async i/o
323
324       -o sec io timeout in seconds
325
326       sanlock direct init -s LOCKSPACE
327       sanlock direct init -r RESOURCE
328
329       Initialize storage for  2000  host_id  (delta)  leases  for  the  given
330       lockspace,  or initialize storage for one resource (paxos) lease.  Both
331       options require 1MB of space.  The host_id in the LOCKSPACE  string  is
332       not  relevant to initialization, so the value is ignored.  (The default
333       of 2000 host_ids  can  be  changed  for  special  cases  using  the  -n
334       num_hosts  and -m max_hosts options.)  With -s, the -o option specifies
335       the io timeout to be written in the host_id leases.
336
337       sanlock direct read_leader -s LOCKSPACE
338       sanlock direct read_leader -r RESOURCE
339
340       Read a leader record from disk and print the fields.  The leader record
341       is  the  single sector of a delta lease, or the first sector of a paxos
342       lease.
343
344       sanlock direct dump path[:offset]
345
346       Read disk sectors and print leader records for delta or  paxos  leases.
347       Add  -f  1  to  print  the  request record values for paxos leases, and
348       host_ids set in delta lease bitmaps.
349
350
351   LOCKSPACE option string
352       -s lockspace_name:host_id:path:offset
353
354       lockspace_name name of lockspace
355       host_id local host identifier in lockspace
356       path path to storage reserved for leases
357       offset offset on path (bytes)
358
359
360   RESOURCE option string
361       -r lockspace_name:resource_name:path:offset
362
363       lockspace_name name of lockspace
364       resource_name name of resource
365       path path to storage reserved for leases
366       offset offset on path (bytes)
367
368
369   RESOURCE option string with version
370       -r lockspace_name:resource_name:path:offset:lver
371
372       lver leader version or SH for shared lease
373
374
375   Defaults
376       sanlock help shows the default values for the options above.
377
378       sanlock version shows the build version.
379
380

USAGE

382   Request/Examine
383       The first part of making a  request  for  a  resource  is  writing  the
384       request  record  of  the  resource  (the  sector  following  the leader
385       record).  To make a successful request:
386
387       ·  RESOURCE:lver must be greater than the lver presently  held  by  the
388          other host.  This implies the leader record must be read to discover
389          the lver, prior to making a request.
390
391       ·  RESOURCE:lver must be greater than or equal to  the  lver  presently
392          written to the request record.  Two hosts may write a new request at
393          the same time for the same lver, in which case both  would  succeed,
394          but the force_mode from the last would win.
395
396       ·  The force_mode must be greater than zero.
397
398       ·  To  unconditionally  clear  the  request  record  (set both lver and
399          force_mode to 0), make request with RESOURCE:0 and force_mode 0.
400
401       The owner of the requested resource will not know of the request unless
402       it  is  explicitly  told  to  examine  its  resources via the "examine"
403       api/command, or otherwise notfied.
404
405       The second part of making a request is  notifying  the  resource  lease
406       owner  that  it  should  examine  the  request  records of its resource
407       leases.  The notification will cause the lease owner  to  automatically
408       run  the  equivalent  of  "sanlock client examine -s LOCKSPACE" for the
409       lockspace of the requested resource.
410
411       The notification is made using a bitmap in each  host_id  delta  lease.
412       Each  bit represents each of the possible host_ids (1-2000).  If host A
413       wants to notify host B to examine its resources, A sets the bit in  its
414       own  bitmap  that  corresponds to the host_id of B.  When B next renews
415       its delta lease, it reads the delta leases for  all  hosts  and  checks
416       each  bitmap  to see if its own host_id has been set.  It finds the bit
417       for its own host_id set  in  A's  bitmap,  and  examines  its  resource
418       request  records.   (The bit remains set in A's bitmap for request_fin‐
419       ish_seconds.)
420
421       force_mode determines the action the resource lease owner should take:
422
423       1 (FORCE): kill the process  holding  the  resource  lease.   When  the
424       process  has  exited, the resource lease will be released, and can then
425       be acquired by anyone.  The kill  signal  is  SIGKILL  (or  SIGTERM  if
426       SIGKILL is restricted.)
427
428       2  (GRACEFUL):  run  the program configured by sanlock_killpath against
429       the process holding the resource lease.  If  no  killpath  is  defined,
430       then FORCE is used.
431
432
433   Graceful recovery
434       When  a  lockspace  host_id  cannot be renewed for a specific period of
435       time, sanlock enters a recovery mode in which it attempts  to  forcibly
436       release  any  resource leases in that lockspace.  If all the leases are
437       not released within 60 seconds, the watchdog will fire,  resetting  the
438       host.
439
440       The  most  immediate way of releasing the resource leases in the failed
441       lockspace is by sending SIGKILL to all pids  holding  the  leases,  and
442       automatically  releasing  the  resource leases as the pids exit.  After
443       all pids have exited, no resource leases are held in the lockspace, the
444       watchdog  expiration  is  removed,  and the host can avoid the watchdog
445       reset.
446
447       A slightly more graceful approach is to send SIGTERM to  a  pid  before
448       escalating  to  SIGKILL.   sanlock does this by sending SIGTERM to each
449       pid, once a second, for the first N  seconds,  before  sending  SIGKILL
450       once a second for the remaining M seconds (N/M can be tuned with the -g
451       daemon option.)
452
453       An even more graceful approach is to configure a program for sanlock to
454       run that will terminate or suspend each pid, and explicitly release the
455       leases it held.  sanlock will run this program for each pid.  It has  N
456       seconds  to  terminate  the pid or explicitly release its leases before
457       sanlock escalates to SIGKILL for the remaining M seconds.
458
459

SEE ALSO

461       wdmd(8)
462
463
464
465
466                                  2011-08-05                        SANLOCK(8)
Impressum