1SANLOCK(8)                  System Manager's Manual                 SANLOCK(8)
2
3
4

NAME

6       sanlock - shared storage lock manager
7
8

SYNOPSIS

10       sanlock [COMMAND] [ACTION] ...
11
12

DESCRIPTION

14       sanlock  is  a lock manager built on shared storage.  Hosts with access
15       to the storage can perform locking.   An  application  running  on  the
16       hosts  is  given  a small amount of space on the shared block device or
17       file, and uses sanlock for its  own  application-specific  synchroniza‐
18       tion.   Internally,  the  sanlock  daemon manages locks using two disk-
19       based lease algorithms: delta leases and paxos leases.
20
21
22       · delta leases are slow to acquire and demand  regular  i/o  to  shared
23         storage.   sanlock  only  uses them internally to hold a lease on its
24         "host_id" (an integer host identifier from 1-2000).  They prevent two
25         hosts  from using the same host identifier.  The delta lease renewals
26         also indicate if a host is alive.  ("Light-Weight Leases for Storage-
27         Centric Coordination", Chockler and Malkhi.)
28
29
30       · paxos  leases are fast to acquire and sanlock makes them available to
31         applications as general purpose  resource  leases.   The  disk  paxos
32         algorithm uses host_id's internally to represent different hosts, and
33         the owner of a paxos lease.  delta leases  provide  unique  host_id's
34         for  implementing  paxos  leases, and delta lease renewals serve as a
35         proxy for paxos lease renewal.  ("Disk Paxos", Eli Gafni  and  Leslie
36         Lamport.)
37
38
39       Externally, the sanlock daemon exposes a locking interface through lib‐
40       sanlock in terms of "lockspaces" and "resources".   A  lockspace  is  a
41       locking  context that an application creates for itself on shared stor‐
42       age.  When the application on each host  is  started,  it  "joins"  the
43       lockspace.  It can then create "resources" on the shared storage.  Each
44       resource represents an application-specific  entity.   The  application
45       can acquire and release leases on resources.
46
47       To use sanlock from an application:
48
49
50       · Allocate  shared  storage for an application, e.g. a shared LUN or LV
51         from a SAN, or files from NFS.
52
53
54       · Provide the storage to the application.
55
56
57       · The application  uses  this  storage  with  libsanlock  to  create  a
58         lockspace and resources for itself.
59
60
61       · The application joins the lockspace when it starts.
62
63
64       · The application acquires and releases leases on resources.
65
66
67       How lockspaces and resources translate to delta leases and paxos leases
68       within sanlock:
69
70       Lockspaces
71
72
73       · A lockspace is based on delta leases held  by  each  host  using  the
74         lockspace.
75
76
77       · A  lockspace  is  a series of 2000 delta leases on disk, and requires
78         1MB of storage.
79
80
81       · A lockspace can support up to 2000 concurrent hosts  using  it,  each
82         using a different delta lease.
83
84
85       · Applications  can  i)  create,  ii)  join and iii) leave a lockspace,
86         which corresponds to i) initializing the set of delta leases on disk,
87         ii)  acquiring  one  of the delta leases and iii) releasing the delta
88         lease.
89
90
91       · When a lockspace is created, a unique lockspace name and  disk  loca‐
92         tion is provided by the application.
93
94
95       · When a lockspace is created/initialized, sanlock formats the sequence
96         of 2000 on-disk delta lease structures on  the  file  or  disk,  e.g.
97         /mnt/leasefile (NFS) or /dev/vg/lv (SAN).
98
99
100       · The  2000  individual  delta  leases in a lockspace are identified by
101         number: 1,2,3,...,2000.
102
103
104       · Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
105         its  number,  e.g. delta lease 1 is offset 0, delta lease 2 is offset
106         512, delta lease 2000 is offset 1023488.
107
108
109       · When an application joins a lockspace, it must specify the  lockspace
110         name,  the  lockspace  location  on  shared  disk/file, and the local
111         host's host_id.  sanlock then acquires the delta lease  corresponding
112         to  the  host_id,  e.g. joining the lockspace with host_id 1 acquires
113         delta lease 1.
114
115
116       · The terms delta lease, lockspace lease, and host_id  lease  are  used
117         interchangably.
118
119
120       · sanlock  acquires  a delta lease by writing the host's unique name to
121         the delta lease disk sector, reading it back after a delay, and veri‐
122         fying it is the same.
123
124
125       · If  a  unique host name is not specified, sanlock generates a uuid to
126         use as the host's name.  The delta lease algorithm depends  on  hosts
127         using unique names.
128
129
130       · The  application  on  each  host  should  be configured with a unique
131         host_id, where the host_id is an integer 1-2000.
132
133
134       · If hosts are misconfigured and have the same host_id, the delta lease
135         algorithm is designed to detect this conflict, and only one host will
136         be able to acquire the delta lease for that host_id.
137
138
139       · A delta lease ensures that a lockspace host_id is  being  used  by  a
140         single host with the unique name specified in the delta lease.
141
142
143       · Resolving  delta  lease  conflicts  is slow, because the algorithm is
144         based on waiting and watching for some time for other hosts to  write
145         to  the  same  delta  lease sector.  If multiple hosts try to use the
146         same delta lease, the delay is increased substantially.   So,  it  is
147         best  to configure applications to use unique host_id's that will not
148         conflict.
149
150
151       · After sanlock acquires a delta lease, the lease must be renewed until
152         the  application leaves the lockspace (which corresponds to releasing
153         the delta lease on the host_id.)
154
155
156       · sanlock renews delta leases every 20 seconds (by default) by  writing
157         a new timestamp into the delta lease sector.
158
159
160       · When a host acquires a delta lease in a lockspace, it can be referred
161         to as "joining" the lockspace.  Once it has joined the lockspace,  it
162         can use resources associated with the lockspace.
163
164
165       Resources
166
167
168       · A  lockspace  is  a  context  for  resources  that  can be locked and
169         unlocked by an application.
170
171
172       · sanlock uses paxos leases to  implement  leases  on  resources.   The
173         terms paxos lease and resource lease are used interchangably.
174
175
176       · A paxos lease exists on shared storage and requires 1MB of space.  It
177         contains a unique resource name and the name of the lockspace.
178
179
180       · An application assigns its own meaning to a sanlock resource and  the
181         leases  on it.  A sanlock resource could represent some shared object
182         like a file, or some unique role among the hosts.
183
184
185       · Resource leases are associated with a specific lockspace and can only
186         be  used by hosts that have joined that lockspace (they are holding a
187         delta lease on a host_id in that lockspace.)
188
189
190       · An  application  must  keep  track  of  the  disk  locations  of  its
191         lockspaces  and  resources.  sanlock does not maintain any persistent
192         index or directory of lockspaces or resources that have been  created
193         by  applications,  so  applications  need to remember where they have
194         placed their own leases (which files or disks and offsets).
195
196
197       · sanlock does not renew paxos leases  directly  (although  it  could).
198         Instead,  the  renewal of a host's delta lease represents the renewal
199         of all that host's paxos  leases  in  the  associated  lockspace.  In
200         effect,  many  paxos  lease  renewals are factored out into one delta
201         lease renewal.  This reduces i/o when many paxos leases are used.
202
203
204       · The disk paxos algorithm allows multiple  hosts  to  all  attempt  to
205         acquire  the same paxos lease at once, and will produce a single win‐
206         ner/owner of the resource lease.  (Shared resource  leases  are  also
207         possible in addition to the default exclusive leases.)
208
209
210       · The  disk paxos algorithm involves a specific sequence of reading and
211         writing the sectors of the paxos lease disk area.  Each  host  has  a
212         dedicated  512  byte  sector  in  the  paxos lease disk area where it
213         writes its own "ballot", and each host reads the entire disk area  to
214         see the ballots of other hosts.  The first sector of the disk area is
215         the "leader record" that holds the result of the last  paxos  ballot.
216         The winner of the paxos ballot writes the result of the ballot to the
217         leader record (the winner of the ballot  may  have  selected  another
218         contending host as the owner of the paxos lease.)
219
220
221       · After  a paxos lease is acquired, no further i/o is done in the paxos
222         lease disk area.
223
224
225       · Releasing the paxos lease involves writing a single sector  to  clear
226         the current owner in the leader record.
227
228
229       · If  a  host  holding  a paxos lease fails, the disk area of the paxos
230         lease still indicates that the paxos lease is  owned  by  the  failed
231         host.  If another host attempts to acquire the paxos lease, and finds
232         the lease is held by another host_id, it will check the  delta  lease
233         of that host_id.  If the delta lease of the host_id is being renewed,
234         then the paxos lease is owned and cannot be acquired.  If  the  delta
235         lease  of  the  owner's  host_id has expired, then the paxos lease is
236         expired and can be taken (by going  through  the  paxos  lease  algo‐
237         rithm.)
238
239
240       · The  "interaction" or "awareness" between hosts of each other is lim‐
241         ited to the case where they attempt to acquire the same paxos  lease,
242         and need to check if the referenced delta lease has expired or not.
243
244
245       · When  hosts  do  not attempt to lock the same resources concurrently,
246         there is no host interaction or awareness.  The state or  actions  of
247         one host have no effect on others.
248
249
250       · To  speed  up checking delta lease expiration (in the case of a paxos
251         lease conflict), sanlock keeps track of past renewals of other  delta
252         leases in the lockspace.
253
254
255       Expiration
256
257
258       · If  a  host  fails to renew its delta lease, e.g. it looses access to
259         the storage, its delta lease will eventually expire and another  host
260         will be able to take over any resource leases held by the host.  san‐
261         lock must ensure that the application on two different hosts  is  not
262         holding and using the same lease concurrently.
263
264
265       · When  sanlock has failed to renew a delta lease for a period of time,
266         it will begin taking measures to stop local processes  (applications)
267         from using any resource leases associated with the expiring lockspace
268         delta lease.  sanlock enters this "recovery mode" well ahead  of  the
269         time  when  another  host  could  take over the locally owned leases.
270         sanlock must have sufficient time to stop all  local  processes  that
271         are using the expiring leases.
272
273
274       · sanlock  uses  three  methods  to stop local processes that are using
275         expiring leases:
276
277         1. Graceful shutdown.  sanlock will  execute  a  "graceful  shutdown"
278         program that the application previously specified for this case.  The
279         shutdown program tells the  application  to  shut  down  because  its
280         leases  are  expiring.   The application must respond by stopping its
281         activities and releasing its leases (or  exit).   If  an  application
282         does  not  specify a graceful shutdown program, sanlock sends SIGTERM
283         to the process instead.  The process must release its leases or  exit
284         in  a  prescribed amount of time (see -g), or sanlock proceeds to the
285         next method of stopping.
286
287         2. Forced shutdown.  sanlock will send SIGKILL to processes using the
288         expiring  leases.   The processes have a fixed amount of time to exit
289         after receiving SIGKILL.  If any do not exit in  this  time,  sanlock
290         will proceed to the next method.
291
292         3.  Host  reset.   sanlock will trigger the host's watchdog device to
293         forcibly reset it.  sanlock  carefully  manages  the  timing  of  the
294         watchdog  device so that it fires shortly before any other host could
295         take over the resource leases held by local processes.
296
297
298       Failures
299
300       If a process holding resource leases fails or exits  without  releasing
301       its  leases,  sanlock  will  release  the  leases  for it automatically
302       (unless persistent resource leases were used.)
303
304       If the sanlock daemon cannot renew a lockspace delta lease for  a  spe‐
305       cific  period  of  time  (see Expiration), sanlock will enter "recovery
306       mode" where it attempts to  stop  and/or  kill  any  processes  holding
307       resource  leases  in  the  expiring lockspace.  If the processes do not
308       exit in time, sanlock will force the host to be reset using  the  local
309       watchdog device.
310
311       If  the  sanlock  daemon crashes or hangs, it will not renew the expiry
312       time of the per-lockspace connections it had to the wdmd daemon.   This
313       will  lead to the expiration of the local watchdog device, and the host
314       will be reset.
315
316       Watchdog
317
318       sanlock uses the wdmd(8) daemon to access /dev/watchdog.   wdmd  multi‐
319       plexes  multiple  timeouts  onto  the  single  watchdog timer.  This is
320       required because delta leases for each lockspace are renewed and expire
321       independently.
322
323       sanlock  maintains  a  wdmd  connection  for each lockspace delta lease
324       being renewed.  Each connection has an expiry time for some seconds  in
325       the future.  After each successful delta lease renewal, the expiry time
326       is renewed for the associated wdmd connection.  If wdmd finds any  con‐
327       nection  expired,  it  will  not  renew the /dev/watchdog timer.  Given
328       enough successive failed renewals, the watchdog device  will  fire  and
329       reset  the host.  (Given the multiplexing nature of wdmd, shorter over‐
330       lapping renewal failures from multiple lockspaces could cause  spurious
331       watchdog firing.)
332
333       The direct link between delta lease renewals and watchdog renewals pro‐
334       vides a predictable watchdog firing time based on delta  lease  renewal
335       timestamps  that  are visible from other hosts.  sanlock knows the time
336       the watchdog on another host has fired based on the delta  lease  time.
337       Furthermore,  if the watchdog device on another host fails to fire when
338       it should, the continuation of delta lease renewals from the other host
339       will  make  this  evident  and prevent leases from being taken from the
340       failed host.
341
342       If sanlock is able  to  stop/kill  all  processing  using  an  expiring
343       lockspace,  the  associated  wdmd  connection  for  that  lockspace  is
344       removed.  The expired wdmd connection will no longer block  /dev/watch‐
345       dog renewals, and the host should avoid being reset.
346
347       Storage
348
349       On  devices  with 512 byte sectors, lockspaces and resources are 1MB in
350       size.  On devices with 4096 byte sectors, lockspaces and resources  are
351       8MB  in size.  sanlock uses 512 byte sectors when shared files are used
352       in place of shared block devices.  Offsets of leases or resources  must
353       be multiples of 1MB/8MB according to the sector size.
354
355       Using  sanlock  on shared block devices that do host based mirroring or
356       replication is not likely to work correctly.   When  using  sanlock  on
357       shared files, all sanlock io should go to one file server.
358
359       Example
360
361       This  is an example of creating and using lockspaces and resources from
362       the command line.  (Most applications would use sanlock through libsan‐
363       lock rather than through the command line.)
364
365
366       1.  Allocate shared storage for sanlock leases.
367
368           This  example assumes 512 byte sectors on the device, in which case
369           the lockspace needs 1MB and each resource needs 1MB.
370
371           # vgcreate vg /dev/sdb
372           # lvcreate -n leases -L 1GB vg
373
374
375       2.  Start sanlock on all hosts.
376
377           The -w 0 disables use of the watchdog for testing.
378
379           # sanlock daemon -w 0
380
381
382       3.  Start a dummy application on all hosts.
383
384           This sanlock command registers with sanlock, then execs  the  sleep
385           command  which  inherits the registered fd.  The sleep process acts
386           as the dummy application.  Because the sleep process is  registered
387           with sanlock, leases can be acquired for it.
388
389           # sanlock client command -c /bin/sleep 600 &
390
391
392       4.  Create a lockspace for the application (from one host).
393
394           The lockspace is named "test".
395
396           # sanlock client init -s test:0:/dev/test/leases:0
397
398
399       5.  Join the lockspace for the application.
400
401           Use a unique host_id on each host.
402
403           host1:
404           # sanlock client add_lockspace -s test:1:/dev/vg/leases:0
405           host2:
406           # sanlock client add_lockspace -s test:2:/dev/vg/leases:0
407
408
409       6.  Create two resources for the application (from one host).
410
411           The  resources  are  named  "RA" and "RB".  Offsets are used on the
412           same device as the lockspace.  Different LVs or files could also be
413           used.
414
415           # sanlock client init -r test:RA:/dev/vg/leases:1048576
416           # sanlock client init -r test:RB:/dev/vg/leases:2097152
417
418
419       7.  Acquire resource leases for the application on host1.
420
421           Acquire an exclusive lease (the default) on the first resource, and
422           a shared lease (SH) on the second resource.
423
424           # export P=`pidof sleep`
425           # sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
426           # sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
427
428
429       8.  Acquire resource leases for the application on host2.
430
431           Acquiring the exclusive lease  on  the  first  resource  will  fail
432           because  it  is  held  by host1.  Acquiring the shared lease on the
433           second resource will succeed.
434
435           # export P=`pidof sleep`
436           # sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
437           # sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
438
439
440       9.  Release resource leases for the application on both hosts.
441
442           The sleep pid could also be killed, which will result in  the  san‐
443           lock daemon releasing its leases when it exits.
444
445           # sanlock client release -r test:RA:/dev/vg/leases:1048576 -p $P
446           # sanlock client release -r test:RB:/dev/vg/leases:2097152 -p $P
447
448
449       10. Leave the lockspace for the application.
450
451           host1:
452           # sanlock client rem_lockspace -s test:1:/dev/vg/leases:0
453           host2:
454           # sanlock client rem_lockspace -s test:2:/dev/vg/leases:0
455
456
457       11. Stop sanlock on all hosts.
458
459           # sanlock shutdown
460
461
462

OPTIONS

464       COMMAND can be one of three primary top level choices
465
466       sanlock daemon start daemon
467       sanlock client send request to daemon (default command if none given)
468       sanlock direct access storage directly (no coordination with daemon)
469
470
471   Daemon Command
472       sanlock daemon [options]
473
474       -D no fork and print all logging to stderr
475
476       -Q 0|1 quiet error messages for common lock contention
477
478       -R 0|1 renewal debugging, log debug info for each renewal
479
480       -L pri write logging at priority level and up to logfile (-1 none)
481
482       -S pri write logging at priority level and up to syslog (-1 none)
483
484       -U uid user id
485
486       -G gid group id
487
488       -t num max worker threads
489
490       -g sec seconds for graceful recovery
491
492       -w 0|1 use watchdog through wdmd
493
494       -h 0|1 use high priority (RR) scheduling
495
496       -l num use mlockall (0 none, 1 current, 2 current and future)
497
498       -b sec seconds a host id bit will remain set in delta lease bitmap
499
500       -e str local host name used in delta leases
501
502
503
504   Client Command
505       sanlock client action [options]
506
507       sanlock client status
508
509       Print processes, lockspaces, and resources being managed by the sanlock
510       daemon.  Add -D to show extra internal  daemon  status  for  debugging.
511       Add  -o  p  to  show  resources  by  pid,  or -o s to show resources by
512       lockspace.
513
514       sanlock client host_status
515
516       Print state of host_id delta  leases  read  during  the  last  renewal.
517       State  of  all  lockspaces  is shown (use -s to select one).  Add -D to
518       show extra internal daemon status for debugging.
519
520       sanlock client gets
521
522       Print lockspaces being managed by the sanlock  daemon.   The  LOCKSPACE
523       string  will  be  followed  by ADD or REM if the lockspace is currently
524       being added or removed.  Add -h 1 to also show hosts in each lockspace.
525
526       sanlock client renewal -s LOCKSPACE
527
528       Print a history of renewals with timing details.  See the Renewal  his‐
529       tory section below.
530
531       sanlock client log_dump
532
533       Print the sanlock daemon internal debug log.
534
535       sanlock client shutdown
536
537       Ask  the  sanlock daemon to exit.  Without the force option (-f 0), the
538       command will be ignored if any lockspaces exist.  With the force option
539       (-f  1), any registered processes will be killed, their resource leases
540       released, and lockspaces removed.  With the wait  option  (-w  1),  the
541       command  will  wait for a result from the daemon indicating that it has
542       shut down and is exiting, or cannot shut down because lockspaces  exist
543       (command fails).
544
545       sanlock client init -s LOCKSPACE
546
547       Tell  the  sanlock  daemon  to  initialize a lockspace on disk.  The -o
548       option can be used to specify the io  timeout  to  be  written  in  the
549       host_id leases.  (Also see sanlock direct init.)
550
551       sanlock client init -r RESOURCE
552
553       Tell  the sanlock daemon to initialize a resource lease on disk.  (Also
554       see sanlock direct init.)
555
556       sanlock client read -s LOCKSPACE
557
558       Tell the sanlock daemon to  read  a  lockspace  from  disk.   Only  the
559       LOCKSPACE  path and offset are required.  If host_id is zero, the first
560       record at offset (host_id 1) is used.  The complete  LOCKSPACE  and  io
561       timeout are printed.
562
563       sanlock client read -r RESOURCE
564
565       Tell  the  sanlock daemon to read a resource lease from disk.  Only the
566       RESOURCE path and  offset  are  required.   The  complete  RESOURCE  is
567       printed.  (Also see sanlock direct read_leader.)
568
569       sanlock client align -s LOCKSPACE
570
571       Tell  the  sanlock  daemon to report the required lease alignment for a
572       storage path.  Only path is used from the LOCKSPACE argument.
573
574       sanlock client add_lockspace -s LOCKSPACE
575
576       Tell the sanlock  daemon  to  acquire  the  specified  host_id  in  the
577       lockspace.   This will allow resources to be acquired in the lockspace.
578       The -o option can be used to specify the io timeout  of  the  acquiring
579       host, and will be written in the host_id lease.
580
581       sanlock client inq_lockspace -s LOCKSPACE
582
583       Inquire about the state of the lockspace in the sanlock daemon, whether
584       it is being added or removed, or is joined.
585
586       sanlock client rem_lockspace -s LOCKSPACE
587
588       Tell the sanlock  daemon  to  release  the  specified  host_id  in  the
589       lockspace.   Any  processes  holding  resource leases in this lockspace
590       will be killed, and the resource leases not released.
591
592       sanlock client command -r RESOURCE -c path args
593
594       Register with the sanlock daemon, acquire the specified resource lease,
595       and  exec  the  command at path with args.  When the command exits, the
596       sanlock daemon will release the lease.  -c must be the final option.
597
598       sanlock client acquire -r RESOURCE -p pid
599       sanlock client release -r RESOURCE -p pid
600
601       Tell the sanlock daemon to acquire or release  the  specified  resource
602       lease  for  the given pid.  The pid must be registered with the sanlock
603       daemon.  acquire  can  optionally  take  a  versioned  RESOURCE  string
604       RESOURCE:lver,  where  lver  is  the  version of the lease that must be
605       acquired, or fail.
606
607       sanlock client convert -r RESOURCE -p pid
608
609       Tell the sanlock daemon to convert the mode of the  specified  resource
610       lease  for the given pid.  If the existing mode is exclusive (default),
611       the mode of the lease can be converted to shared with RESOURCE:SH.   If
612       the  existing mode is shared, the mode of the lease can be converted to
613       exclusive with RESOURCE (no :SH suffix).
614
615       sanlock client inquire -p pid
616
617       Print the resource leases held the given pid.  The  format  is  a  ver‐
618       sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
619       lease held.
620
621       sanlock client request -r RESOURCE -f force_mode
622
623       Request the owner of a resource do something specified  by  force_mode.
624       A  versioned  RESOURCE:lver  string must be used with a greater version
625       than is presently held.  Zero lver and force_mode clears the request.
626
627       sanlock client examine -r RESOURCE
628
629       Examine the request record for the currently held  resource  lease  and
630       carry out the action specified by the requested force_mode.
631
632       sanlock client examine -s LOCKSPACE
633
634       Examine  requests  for  all resource leases currently held in the named
635       lockspace.  Only lockspace_name is used from the LOCKSPACE argument.
636
637       sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num
638
639       Set an event for another host.  When the sanlock daemon next renews its
640       delta  lease  for the lockspace it will: set the bit for the host_id in
641       its bitmap, and set the generation, event and data values  in  its  own
642       delta  lease.   An application that has registered for events from this
643       lockspace on the destination host will get the event that has been  set
644       when  the  destination  sees  the  event  during  its  next delta lease
645       renewal.
646
647       sanlock client set_config -s LOCKSPACE
648
649       Set a configuration value for a lockspace.  Only lockspace_name is used
650       from  the  LOCKSPACE  argument.  The USED flag has the same effect on a
651       lockspace as a process holding a resource lease  that  will  not  exit.
652       The  USED_BY_ORPHANS flag means that an orphan resource lease will have
653       the same effect as the USED.
654       -u 0|1 Set (1) or clear (0) the USED flag.
655       -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.
656
657
658   Direct Command
659       sanlock direct action [options]
660
661
662       -o sec io timeout in seconds
663
664       sanlock direct init -s LOCKSPACE
665       sanlock direct init -r RESOURCE
666
667       Initialize storage for  2000  host_id  (delta)  leases  for  the  given
668       lockspace,  or initialize storage for one resource (paxos) lease.  Both
669       options require 1MB of space.  The host_id in the LOCKSPACE  string  is
670       not  relevant to initialization, so the value is ignored.  (The default
671       of 2000 host_ids  can  be  changed  for  special  cases  using  the  -n
672       num_hosts  and -m max_hosts options.)  With -s, the -o option specifies
673       the io timeout to be written in the host_id leases.  With -r, the -z  1
674       option  invalidates  the  resource  lease  on disk so it cannot be used
675       until reinitialized normally.
676
677       sanlock direct read_leader -s LOCKSPACE
678       sanlock direct read_leader -r RESOURCE
679
680       Read a leader record from disk and print the fields.  The leader record
681       is  the  single sector of a delta lease, or the first sector of a paxos
682       lease.
683
684       sanlock direct dump path[:offset[:size]]
685
686       Read disk sectors and print leader records for delta or  paxos  leases.
687       Add  -f  1  to  print  the  request record values for paxos leases, and
688       host_ids set in delta lease bitmaps.
689
690
691   LOCKSPACE option string
692       -s lockspace_name:host_id:path:offset
693
694       lockspace_name name of lockspace
695       host_id local host identifier in lockspace
696       path path to storage reserved for leases
697       offset offset on path (bytes)
698
699
700   RESOURCE option string
701       -r lockspace_name:resource_name:path:offset
702
703       lockspace_name name of lockspace
704       resource_name name of resource
705       path path to storage reserved for leases
706       offset offset on path (bytes)
707
708
709   RESOURCE option string with suffix
710       -r lockspace_name:resource_name:path:offset:lver
711
712       lver leader version
713
714       -r lockspace_name:resource_name:path:offset:SH
715
716       SH indicates shared mode
717
718
719   Defaults
720       sanlock help shows the default values for the options above.
721
722       sanlock version shows the build version.
723
724

OTHER

726   Request/Examine
727       The first part of making a  request  for  a  resource  is  writing  the
728       request  record  of  the  resource  (the  sector  following  the leader
729       record).  To make a successful request:
730
731       · RESOURCE:lver must be greater than the lver  presently  held  by  the
732         other  host.  This implies the leader record must be read to discover
733         the lver, prior to making a request.
734
735       · RESOURCE:lver must be greater than or equal  to  the  lver  presently
736         written  to the request record.  Two hosts may write a new request at
737         the same time for the same lver, in which case  both  would  succeed,
738         but the force_mode from the last would win.
739
740       · The force_mode must be greater than zero.
741
742       · To  unconditionally  clear  the  request  record  (set  both lver and
743         force_mode to 0), make request with RESOURCE:0 and force_mode 0.
744
745
746       The owner of the requested resource will not know of the request unless
747       it  is  explicitly  told  to  examine  its  resources via the "examine"
748       api/command, or otherwise notfied.
749
750       The second part of making a request is  notifying  the  resource  lease
751       owner  that  it  should  examine  the  request  records of its resource
752       leases.  The notification will cause the lease owner  to  automatically
753       run  the  equivalent  of  "sanlock client examine -s LOCKSPACE" for the
754       lockspace of the requested resource.
755
756       The notification is made using a bitmap in each  host_id  delta  lease.
757       Each  bit represents each of the possible host_ids (1-2000).  If host A
758       wants to notify host B to examine its resources, A sets the bit in  its
759       own  bitmap  that  corresponds to the host_id of B.  When B next renews
760       its delta lease, it reads the delta leases for  all  hosts  and  checks
761       each  bitmap  to see if its own host_id has been set.  It finds the bit
762       for its own host_id set  in  A's  bitmap,  and  examines  its  resource
763       request  records.   (The  bit  remains  set  in A's bitmap for set_bit‐
764       map_seconds.)
765
766       force_mode determines the action the resource lease owner should take:
767
768
769       · FORCE (1): kill the process holding the  resource  lease.   When  the
770         process has exited, the resource lease will be released, and can then
771         be acquired by anyone.  The kill signal is  SIGKILL  (or  SIGTERM  if
772         SIGKILL is restricted.)
773
774
775       · GRACEFUL  (2): run the program configured by sanlock_killpath against
776         the process holding the resource lease.  If no killpath  is  defined,
777         then FORCE is used.
778
779
780   Persistent and orphan resource leases
781       A  resource  lease can be acquired with the PERSISTENT flag (-P 1).  If
782       the process holding the lease exits, the lease will  not  be  released,
783       but  kept  on  an  orphan  list.   Another local process can acquire an
784       orphan lease using the ORPHAN flag (-O 1), or release the orphan  lease
785       using  the  ORPHAN  flag  (-O 1).  All orphan leases can be released by
786       setting the lockspace name (-s lockspace_name) with no resource name.
787
788
789   Renewal history
790       sanlock saves a limited history of lease renewal  information  in  each
791       lockspace.   See sanlock.conf renewal_history_size to set the amount of
792       history or to disable (set to 0).
793
794       IO times are measured in delta lease renewal (each delta lease  renewal
795       includes one read and one write).
796
797       For each successful renewal, a record is saved that includes:
798
799       · the timestamp written in the delta lease by the renewal
800
801       · the time in milliseconds taken by the delta lease read
802
803       · the time in milliseconds taken by the delta lease write
804
805
806       Also  counted  and  recorded  are  the  number io timeouts and other io
807       errors that occur between successful renewals.
808
809       Two consecutive successful renewals would be recorded as:
810       timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
811       timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0
812
813       Those fields are:
814
815
816       · timestamp is the value written  into  the  delta  lease  during  that
817         renewal.
818
819
820       · read_ms/write_ms   are   the   milliseconds  taken  for  the  renewal
821         read/write ios.
822
823
824       · next_timeouts are the number of io timeouts that  occured  after  the
825         renewal recorded on that line, and before the next successful renewal
826         on the following line.
827
828
829       · next_errors are the number of io errors (not timeouts)  that  occured
830         after  renewal  recorded on that line, and before the next successful
831         renewal on the following line.
832
833
834       The command 'sanlock client renewal -s lockspace_name' reports the full
835       history  of renewals saved by sanlock, which by default is 180 records,
836       about 1 hour of history when using a 20 second renewal interval  for  a
837       10 second io timeout.
838
839

INTERNALS

841   Disk Format
842       · This example uses 512 byte sectors.
843
844       · Each  lockspace  is 1MB.  It holds 2000 delta_leases, one per sector,
845         supporting up to 2000 hosts.
846
847       · Each paxos_lease is 1MB.  It is used as a lease for one resource.
848
849       · The leader_record structure is used differently by each lease type.
850
851       · To display all leader_record fields, see sanlock direct read_leader.
852
853       · A lockspace is often followed on disk by the paxos_leases used within
854         that lockspace, but this layout is not required.
855
856       · The request_record and host_id bitmap are used for requests/events.
857
858       · The mode_block contains the SHARED flag indicating a lease is held in
859         the shared mode.
860
861       · In a  lockspace,  the  host  using  host_id  N  writes  to  a  single
862         delta_lease in sector N-1.  No other hosts write to this sector.  All
863         hosts read all lockspace sectors when renewing their own delta_lease,
864         and are able to monitor renewals of all delta_leases.
865
866       · In a paxos_lease, each host has a dedicated sector it writes to, con‐
867         taining its own paxos_dblock and mode_block structures.   Its  sector
868         is based on its host_id; host_id 1 writes to the dblock/mode_block in
869         sector 2 of the paxos_lease.
870
871       · The paxos_dblock structures are used by  the  paxos_lease  algorithm,
872         and the result is written to the leader_record.
873
874
875       0x000000 lockspace foo:0:/path:0
876
877       (There  is  no representation on disk of the lockspace in general, only
878       the sequence of specific delta_leases which collectively represent  the
879       lockspace.)
880
881       delta_lease foo:1:/path:0
882       0x000 0         leader_record         (sector 0, for host_id 1)
883                       magic: 0x12212010
884                       space_name: foo
885                       resource_name: host uuid/name
886                       ...
887                       host_id bitmap        (leader_record + 256)
888
889       delta_lease foo:2:/path:0
890       0x200 512       leader_record         (sector 1, for host_id 2)
891                       magic: 0x12212010
892                       space_name: foo
893                       resource_name: host uuid/name
894                       ...
895                       host_id bitmap        (leader_record + 256)
896
897       delta_lease foo:3:/path:0
898       0x400 1024      leader_record         (sector 2, for host_id 3)
899                       magic: 0x12212010
900                       space_name: foo
901                       resource_name: host uuid/name
902                       ...
903                       host_id bitmap        (leader_record + 256)
904
905       delta_lease foo:2000:/path:0
906       0xF9E00         leader_record         (sector 1999, for host_id 2000)
907                       magic: 0x12212010
908                       space_name: foo
909                       resource_name: host uuid/name
910                       ...
911                       host_id bitmap        (leader_record + 256)
912
913       0x100000 paxos_lease foo:example1:/path:1048576
914       0x000 0         leader_record         (sector 0)
915                       magic: 0x06152010
916                       space_name: foo
917                       resource_name: example1
918
919       0x200 512       request_record        (sector 1)
920                       magic: 0x08292011
921
922       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
923       0x480 1152      mode_block            (paxos_dblock + 128)
924
925       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
926       0x680 1664      mode_block            (paxos_dblock + 128)
927
928       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
929       0x880 2176      mode_block            (paxos_dblock + 128)
930
931       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
932       0xFA280         mode_block            (paxos_dblock + 128)
933
934       0x200000 paxos_lease foo:example2:/path:2097152
935       0x000 0         leader_record         (sector 0)
936                       magic: 0x06152010
937                       space_name: foo
938                       resource_name: example2
939
940       0x200 512       request_record        (sector 1)
941                       magic: 0x08292011
942
943       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
944       0x480 1152      mode_block            (paxos_dblock + 128)
945
946       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
947       0x680 1664      mode_block            (paxos_dblock + 128)
948
949       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
950       0x880 2176      mode_block            (paxos_dblock + 128)
951
952       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
953       0xFA280         mode_block            (paxos_dblock + 128)
954
955
956   Lease ownership
957       Not  shown  in  the  leader_record  structures  above are the owner_id,
958       owner_generation and timestamp  fields.   These  are  the  fields  that
959       define the lease owner.
960
961       The  delta_lease at sector N for host_id N+1 has leader_record.owner_id
962       N+1.  The leader_record.owner_generation is incremented each  time  the
963       delta_lease   is   acquired.   When  a  delta_lease  is  acquired,  the
964       leader_record.timestamp field is set to the time of the  host  and  the
965       leader_record.resource_name  is  set  to  the  unique name of the host.
966       When   the   host   renews   the   delta_lease,   it   writes   a   new
967       leader_record.timestamp.  When a host releases a delta_lease, it writes
968       zero to leader_record.timestamp.
969
970       When a host acquires a  paxos_lease,  it  uses  the  host_id/generation
971       value  from  the  delta_lease  it holds in the lockspace.  It uses this
972       host_id/generation to identify itself in the paxos_dblock when  running
973       the  paxos  algorithm.   The  result  of  the  algorithm is the winning
974       host_id/generation - the new owner of  the  paxos_lease.   The  winning
975       host_id/generation      are      written     to     the     paxos_lease
976       leader_record.owner_id and  leader_record.owner_generation  fields  and
977       leader_record.timestamp is set.  When a host releases a paxos_lease, it
978       sets leader_record.timestamp to 0.
979
980       When a paxos_lease is free  (leader_record.timestamp  is  0),  multiple
981       hosts  may  attempt  to  acquire  it.   The  paxos algorithm, using the
982       paxos_dblock structures, will select only one of the hosts as  the  new
983       owner, and that owner is written in the leader_record.  The paxos_lease
984       will no longer be free (non-zero timestamp).  Other hosts will see this
985       and will not attempt to acquire the paxos_lease until it is free again.
986
987       If  a  paxos_lease is owned (non-zero timestamp), but the owner has not
988       renewed its delta_lease for a specific length of time, then  the  owner
989       value  in the paxos_lease becomes expired, and other hosts will use the
990       paxos algorithm to acquire the paxos_lease, and set a new owner.
991
992

FILES

994       /etc/sanlock/sanlock.conf
995
996

SEE ALSO

998       wdmd(8)
999
1000
1001
1002
1003                                  2015-01-23                        SANLOCK(8)
Impressum