1SANLOCK(8)                  System Manager's Manual                 SANLOCK(8)
2
3
4

NAME

6       sanlock - shared storage lock manager
7
8

SYNOPSIS

10       sanlock [COMMAND] [ACTION] ...
11
12

DESCRIPTION

14       sanlock  is  a lock manager built on shared storage.  Hosts with access
15       to the storage can perform locking.   An  application  running  on  the
16       hosts  is  given  a small amount of space on the shared block device or
17       file, and uses sanlock for its  own  application-specific  synchroniza‐
18       tion.   Internally,  the  sanlock  daemon manages locks using two disk-
19       based lease algorithms: delta leases and paxos leases.
20
21
22delta leases are slow to acquire and demand  regular  i/o  to  shared
23         storage.   sanlock  only  uses them internally to hold a lease on its
24         "host_id" (an integer host identifier from 1-2000).  They prevent two
25         hosts  from using the same host identifier.  The delta lease renewals
26         also indicate if a host is alive.  ("Light-Weight Leases for Storage-
27         Centric Coordination", Chockler and Malkhi.)
28
29
30paxos  leases are fast to acquire and sanlock makes them available to
31         applications as general purpose resource leases.  The disk paxos  al‐
32         gorithm  uses  host_id's internally to represent different hosts, and
33         the owner of a paxos lease.  delta leases  provide  unique  host_id's
34         for  implementing  paxos  leases, and delta lease renewals serve as a
35         proxy for paxos lease renewal.  ("Disk Paxos", Eli Gafni  and  Leslie
36         Lamport.)
37
38
39       Externally, the sanlock daemon exposes a locking interface through lib‐
40       sanlock in terms of "lockspaces" and "resources".   A  lockspace  is  a
41       locking  context that an application creates for itself on shared stor‐
42       age.  When the application on each host  is  started,  it  "joins"  the
43       lockspace.  It can then create "resources" on the shared storage.  Each
44       resource represents an application-specific  entity.   The  application
45       can acquire and release leases on resources.
46
47       To use sanlock from an application:
48
49
50       • Allocate  shared  storage for an application, e.g. a shared LUN or LV
51         from a SAN, or files from NFS.
52
53
54       • Provide the storage to the application.
55
56
57       • The application  uses  this  storage  with  libsanlock  to  create  a
58         lockspace and resources for itself.
59
60
61       • The application joins the lockspace when it starts.
62
63
64       • The application acquires and releases leases on resources.
65
66
67       How lockspaces and resources translate to delta leases and paxos leases
68       within sanlock:
69
70       Lockspaces
71
72
73       • A lockspace is based on delta leases held  by  each  host  using  the
74         lockspace.
75
76
77       • A  lockspace  is  a series of 2000 delta leases on disk, and requires
78         1MB of storage.  (See Storage below for size variations.)
79
80
81       • A lockspace can support up to 2000 concurrent hosts  using  it,  each
82         using a different delta lease.
83
84
85       • Applications  can  i)  create,  ii)  join and iii) leave a lockspace,
86         which corresponds to i) initializing the set of delta leases on disk,
87         ii)  acquiring  one  of the delta leases and iii) releasing the delta
88         lease.
89
90
91       • When a lockspace is created, a unique lockspace name and  disk  loca‐
92         tion is provided by the application.
93
94
95       • When a lockspace is created/initialized, sanlock formats the sequence
96         of 2000 on-disk delta lease structures on  the  file  or  disk,  e.g.
97         /mnt/leasefile (NFS) or /dev/vg/lv (SAN).
98
99
100       • The  2000  individual  delta  leases in a lockspace are identified by
101         number: 1,2,3,...,2000.
102
103
104       • Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
105         its  number,  e.g. delta lease 1 is offset 0, delta lease 2 is offset
106         512, delta lease 2000 is offset 1023488.  (See Storage below for size
107         variations.)
108
109
110       • When  an application joins a lockspace, it must specify the lockspace
111         name, the lockspace location  on  shared  disk/file,  and  the  local
112         host's  host_id.  sanlock then acquires the delta lease corresponding
113         to the host_id, e.g. joining the lockspace with  host_id  1  acquires
114         delta lease 1.
115
116
117       • The  terms  delta  lease, lockspace lease, and host_id lease are used
118         interchangeably.
119
120
121       • sanlock acquires a delta lease by writing the host's unique  name  to
122         the delta lease disk sector, reading it back after a delay, and veri‐
123         fying it is the same.
124
125
126       • If a unique host name is not specified, sanlock generates a  uuid  to
127         use  as  the host's name.  The delta lease algorithm depends on hosts
128         using unique names.
129
130
131       • The application on each host  should  be  configured  with  a  unique
132         host_id, where the host_id is an integer 1-2000.
133
134
135       • If hosts are misconfigured and have the same host_id, the delta lease
136         algorithm is designed to detect this conflict, and only one host will
137         be able to acquire the delta lease for that host_id.
138
139
140       • A  delta  lease  ensures  that a lockspace host_id is being used by a
141         single host with the unique name specified in the delta lease.
142
143
144       • Resolving delta lease conflicts is slow,  because  the  algorithm  is
145         based  on waiting and watching for some time for other hosts to write
146         to the same delta lease sector.  If multiple hosts  try  to  use  the
147         same  delta  lease,  the delay is increased substantially.  So, it is
148         best to configure applications to use unique host_id's that will  not
149         conflict.
150
151
152       • After sanlock acquires a delta lease, the lease must be renewed until
153         the application leaves the lockspace (which corresponds to  releasing
154         the delta lease on the host_id.)
155
156
157       • sanlock  renews delta leases every 20 seconds (by default) by writing
158         a new timestamp into the delta lease sector.
159
160
161       • When a host acquires a delta lease in a lockspace, it can be referred
162         to  as "joining" the lockspace.  Once it has joined the lockspace, it
163         can use resources associated with the lockspace.
164
165
166       Resources
167
168
169       • A lockspace is a context for resources that can  be  locked  and  un‐
170         locked by an application.
171
172
173       • sanlock  uses  paxos  leases  to  implement leases on resources.  The
174         terms paxos lease and resource lease are used interchangeably.
175
176
177       • A paxos lease exists on shared storage and requires 1MB of space.  It
178         contains a unique resource name and the name of the lockspace.
179
180
181       • An  application assigns its own meaning to a sanlock resource and the
182         leases on it.  A sanlock resource could represent some shared  object
183         like a file, or some unique role among the hosts.
184
185
186       • Resource leases are associated with a specific lockspace and can only
187         be used by hosts that have joined that lockspace (they are holding  a
188         delta lease on a host_id in that lockspace.)
189
190
191       • An  application  must  keep  track  of  the  disk  locations  of  its
192         lockspaces and resources.  sanlock does not maintain  any  persistent
193         index  or directory of lockspaces or resources that have been created
194         by applications, so applications need to  remember  where  they  have
195         placed their own leases (which files or disks and offsets).
196
197
198       • sanlock  does  not  renew  paxos leases directly (although it could).
199         Instead, the renewal of a host's delta lease represents  the  renewal
200         of  all  that host's paxos leases in the associated lockspace. In ef‐
201         fect, many paxos lease renewals are factored out into one delta lease
202         renewal.  This reduces i/o when many paxos leases are used.
203
204
205       • The  disk paxos algorithm allows multiple hosts to all attempt to ac‐
206         quire the same paxos lease at once, and will produce  a  single  win‐
207         ner/owner  of  the  resource lease.  (Shared resource leases are also
208         possible in addition to the default exclusive leases.)
209
210
211       • The disk paxos algorithm involves a specific sequence of reading  and
212         writing  the  sectors  of the paxos lease disk area.  Each host has a
213         dedicated 512 byte sector in the  paxos  lease  disk  area  where  it
214         writes  its own "ballot", and each host reads the entire disk area to
215         see the ballots of other hosts.  The first sector of the disk area is
216         the  "leader  record" that holds the result of the last paxos ballot.
217         The winner of the paxos ballot writes the result of the ballot to the
218         leader  record  (the  winner  of the ballot may have selected another
219         contending host as the owner of the paxos lease.)
220
221
222       • After a paxos lease is acquired, no further i/o is done in the  paxos
223         lease disk area.
224
225
226       • Releasing  the  paxos lease involves writing a single sector to clear
227         the current owner in the leader record.
228
229
230       • If a host holding a paxos lease fails, the disk  area  of  the  paxos
231         lease  still  indicates  that  the paxos lease is owned by the failed
232         host.  If another host attempts to acquire the paxos lease, and finds
233         the  lease  is held by another host_id, it will check the delta lease
234         of that host_id.  If the delta lease of the host_id is being renewed,
235         then  the  paxos lease is owned and cannot be acquired.  If the delta
236         lease of the owner's host_id has expired, then the paxos lease is ex‐
237         pired and can be taken (by going through the paxos lease algorithm.)
238
239
240       • The  "interaction" or "awareness" between hosts of each other is lim‐
241         ited to the case where they attempt to acquire the same paxos  lease,
242         and need to check if the referenced delta lease has expired or not.
243
244
245       • When  hosts  do  not attempt to lock the same resources concurrently,
246         there is no host interaction or awareness.  The state or  actions  of
247         one host have no effect on others.
248
249
250       • To  speed  up checking delta lease expiration (in the case of a paxos
251         lease conflict), sanlock keeps track of past renewals of other  delta
252         leases in the lockspace.
253
254
255       Resource Index
256
257       The  resource index (rindex) is an optional sanlock feature that appli‐
258       cations can use to keep track of resource lease offsets.   Without  the
259       rindex, an application must keep track of where its resource leases ex‐
260       ist on disk and find available locations when creating new leases.
261
262       The sanlock rindex uses two align-size  areas  on  disk  following  the
263       lockspace.   The  first area holds rindex entries; each entry records a
264       resource lease name and location.  The  second  area  holds  a  private
265       paxos lease, used by sanlock internally to protect rindex updates.
266
267       The  application creates the rindex on disk with the "format" function.
268       Format is a disk-only operation and does not  interact  with  the  live
269       lockspace,  so  it  can  be called without first calling add_lockspace.
270       The application needs to follow the convention of writing the lockspace
271       at the start of the device (offset 0) and formatting the rindex immedi‐
272       ately following the lockspace area.  When formatting,  the  application
273       must  set  flags  for sector size and align size to match those for the
274       lockspace.
275
276       To use the rindex, the application:
277
278
279       • Uses the "create" function to create a new resource  lease  on  disk.
280         This  takes  the  place  of  the write_resource function.  The create
281         function requires the location of the rindex and the name of the  new
282         resource  lease.  sanlock finds a free lease area, writes the new re‐
283         source lease at that location, updates the rindex with the  name:off‐
284         set, and returns the offset to the caller.  The caller uses this off‐
285         set when acquiring the resource lease.
286
287
288       • Uses the "delete" function to remove a resource disk  on  disk  (also
289         corresponding  to  the  write_resource function.)  sanlock clears the
290         resource lease and the rindex entry for it.   A  subsequent  call  to
291         create  may  use  this  same  disk  location for a different resource
292         lease.
293
294
295       • Uses the "lookup" function to discover the offset of a resource lease
296         given  the resource lease name.  The caller would typically call this
297         prior to acquiring the resource lease.
298
299
300       • Uses the "rebuild" function to recreate the rindex if it  is  damaged
301         or  becomes  inconsistent.  This function scans the disk for resource
302         leases and creates new rindex entries to match the leases it finds.
303
304
305       • The "update" function manipulates rindex entries directly and  should
306         not normally be used by the application.  In normal usage, the create
307         and delete functions manipulate rindex  entries.   Update  is  mainly
308         useful for testing or repairs.
309
310
311       Expiration
312
313
314       • If  a  host  fails to renew its delta lease, e.g. it looses access to
315         the storage, its delta lease will eventually expire and another  host
316         will be able to take over any resource leases held by the host.  san‐
317         lock must ensure that the application on two different hosts  is  not
318         holding and using the same lease concurrently.
319
320
321       • When  sanlock has failed to renew a delta lease for a period of time,
322         it will begin taking measures to stop local processes  (applications)
323         from using any resource leases associated with the expiring lockspace
324         delta lease.  sanlock enters this "recovery mode" well ahead  of  the
325         time  when  another  host  could  take over the locally owned leases.
326         sanlock must have sufficient time to stop all  local  processes  that
327         are using the expiring leases.
328
329
330       • sanlock uses three methods to stop local processes that are using ex‐
331         piring leases:
332
333         1. Graceful shutdown.  sanlock will  execute  a  "graceful  shutdown"
334         program that the application previously specified for this case.  The
335         shutdown program tells the  application  to  shut  down  because  its
336         leases  are  expiring.   The application must respond by stopping its
337         activities and releasing its leases (or  exit).   If  an  application
338         does  not  specify a graceful shutdown program, sanlock sends SIGTERM
339         to the process instead.  The process must release its leases or  exit
340         in  a  prescribed amount of time (see -g), or sanlock proceeds to the
341         next method of stopping.
342
343         2. Forced shutdown.  sanlock will send SIGKILL to processes using the
344         expiring  leases.   The processes have a fixed amount of time to exit
345         after receiving SIGKILL.  If any do not exit in  this  time,  sanlock
346         will proceed to the next method.
347
348         3.  Host  reset.   sanlock will trigger the host's watchdog device to
349         forcibly reset it.  sanlock  carefully  manages  the  timing  of  the
350         watchdog  device so that it fires shortly before any other host could
351         take over the resource leases held by local processes.
352
353
354       Failures
355
356       If a process holding resource leases fails or exits  without  releasing
357       its  leases,  sanlock will release the leases for it automatically (un‐
358       less persistent resource leases were used.)
359
360       If the sanlock daemon cannot renew a lockspace delta lease for  a  spe‐
361       cific  period  of  time  (see Expiration), sanlock will enter "recovery
362       mode" where it attempts to stop and/or kill any processes  holding  re‐
363       source  leases in the expiring lockspace.  If the processes do not exit
364       in time, sanlock will force the host to be reset using the local watch‐
365       dog device.
366
367       If  the  sanlock  daemon crashes or hangs, it will not renew the expiry
368       time of the per-lockspace connections it had to the wdmd daemon.   This
369       will  lead to the expiration of the local watchdog device, and the host
370       will be reset.
371
372       Watchdog
373
374       sanlock uses the wdmd(8) daemon to access /dev/watchdog.   wdmd  multi‐
375       plexes  multiple  timeouts onto the single watchdog timer.  This is re‐
376       quired because delta leases for each lockspace are renewed  and  expire
377       independently.
378
379       sanlock  maintains a wdmd connection for each lockspace delta lease be‐
380       ing renewed.  Each connection has an expiry time for  some  seconds  in
381       the future.  After each successful delta lease renewal, the expiry time
382       is renewed for the associated wdmd connection.  If wdmd finds any  con‐
383       nection  expired,  it  will  not  renew the /dev/watchdog timer.  Given
384       enough successive failed renewals, the watchdog device  will  fire  and
385       reset  the host.  (Given the multiplexing nature of wdmd, shorter over‐
386       lapping renewal failures from multiple lockspaces could cause  spurious
387       watchdog firing.)
388
389       The direct link between delta lease renewals and watchdog renewals pro‐
390       vides a predictable watchdog firing time based on delta  lease  renewal
391       timestamps  that  are visible from other hosts.  sanlock knows the time
392       the watchdog on another host has fired based on the delta  lease  time.
393       Furthermore,  if the watchdog device on another host fails to fire when
394       it should, the continuation of delta lease renewals from the other host
395       will  make  this  evident  and prevent leases from being taken from the
396       failed host.
397
398       If sanlock is able  to  stop/kill  all  processing  using  an  expiring
399       lockspace,  the  associated  wdmd  connection for that lockspace is re‐
400       moved.  The expired wdmd connection will no longer block  /dev/watchdog
401       renewals, and the host should avoid being reset.
402
403       Storage
404
405       The  sector  size  and the align size should be specified when creating
406       lockspaces and resources (and rindex).  The "align size" is the size on
407       disk  of  a  lockspace  or a resource, i.e. the amount of disk space it
408       uses.  Lockspaces and resources should use matching  sector  and  align
409       sizes,  and  must  use offsets in multiples of the align size.  The max
410       number of hosts that can use a lockspace or  resource  depends  on  the
411       combination of sector size and align size, shown below.  The host_id of
412       hosts using the lockspace can be no larger than the max_hosts value for
413       the lockspace.
414
415       Accepted  combinations  of  sector  size and align size, and the corre‐
416       sponding max_hosts (and max host_id) are:
417
418       sector_size 512, align_size 1M, max_hosts 2000
419       sector_size 4096, align_size 1M, max_hosts 250
420       sector_size 4096, align_size 2M, max_hosts 500
421       sector_size 4096, align_size 4M, max_hosts 1000
422       sector_size 4096, align_size 8M, max_hosts 2000
423
424       When sector_size and align_size are not specified, the behavior matches
425       the  behavior  before these sizes could be configured: on devices which
426       report sector size 512, 512/1M/2000 is used, on  devices  which  report
427       sector  size  4096,  4096/8M/2000 is used, and on files, 512/1M/2000 is
428       always used.  (Other combinations are not compatible with sanlock  ver‐
429       sion 3.6 or earlier.)
430
431       Using  sanlock  on shared block devices that do host based mirroring or
432       replication is not likely to work correctly.   When  using  sanlock  on
433       shared files, all sanlock io should go to one file server.
434
435       Example
436
437       This  is an example of creating and using lockspaces and resources from
438       the command line.  (Most applications would use sanlock through libsan‐
439       lock rather than through the command line.)
440
441
442       1.  Allocate shared storage for sanlock leases.
443
444           This  example assumes 512 byte sectors on the device, in which case
445           the lockspace needs 1MB and each resource needs 1MB.
446
447           The  example  shared  block  device  accessible  to  all  hosts  is
448           /dev/leases.
449
450
451       2.  Start sanlock on all hosts.
452
453           The -w 0 disables use of the watchdog for testing.
454
455           # sanlock daemon -w 0
456
457
458       3.  Start a dummy application on all hosts.
459
460           This  sanlock  command registers with sanlock, then execs the sleep
461           command which inherits the registered fd.  The sleep  process  acts
462           as  the dummy application.  Because the sleep process is registered
463           with sanlock, leases can be acquired for it.
464
465           # sanlock client command -c /bin/sleep 600 &
466
467
468       4.  Create a lockspace for the application (from one host).
469
470           The lockspace is named "test".
471
472           # sanlock client init -s test:0:/dev/leases:0
473
474
475       5.  Join the lockspace for the application.
476
477           Use a unique host_id on each host.
478
479           host1:
480           # sanlock client add_lockspace -s test:1:/dev/leases:0
481           host2:
482           # sanlock client add_lockspace -s test:2:/dev/leases:0
483
484
485       6.  Create two resources for the application (from one host).
486
487           The resources are named "RA" and "RB".  Offsets  are  used  on  the
488           same device as the lockspace.  Different LVs or files could also be
489           used.
490
491           # sanlock client init -r test:RA:/dev/leases:1048576
492           # sanlock client init -r test:RB:/dev/leases:2097152
493
494
495       7.  Acquire resource leases for the application on host1.
496
497           Acquire an exclusive lease (the default) on the first resource, and
498           a shared lease (SH) on the second resource.
499
500           # export P=`pidof sleep`
501           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
502           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
503
504
505       8.  Acquire resource leases for the application on host2.
506
507           Acquiring  the  exclusive lease on the first resource will fail be‐
508           cause it is held by host1.  Acquiring the shared lease on the  sec‐
509           ond resource will succeed.
510
511           # export P=`pidof sleep`
512           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
513           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
514
515
516       9.  Release resource leases for the application on both hosts.
517
518           The  sleep  pid could also be killed, which will result in the san‐
519           lock daemon releasing its leases when it exits.
520
521           # sanlock client release -r test:RA:/dev/leases:1048576 -p $P
522           # sanlock client release -r test:RB:/dev/leases:2097152 -p $P
523
524
525       10. Leave the lockspace for the application.
526
527           host1:
528           # sanlock client rem_lockspace -s test:1:/dev/leases:0
529           host2:
530           # sanlock client rem_lockspace -s test:2:/dev/leases:0
531
532
533       11. Stop sanlock on all hosts.
534
535           # sanlock shutdown
536
537
538

OPTIONS

540       COMMAND can be one of three primary top level choices
541
542       sanlock daemon start daemon
543       sanlock client send request to daemon (default command if none given)
544       sanlock direct access storage directly (no coordination with daemon)
545
546
547   Daemon Command
548       sanlock daemon [options]
549
550       -D no fork and print all logging to stderr
551
552       -Q 0|1 quiet error messages for common lock contention
553
554       -R 0|1 renewal debugging, log debug info for each renewal
555
556       -L pri write logging at priority level and up to logfile (-1 none)
557
558       -S pri write logging at priority level and up to syslog (-1 none)
559
560       -U uid user id
561
562       -G gid group id
563
564       -H num renewal history size
565
566       -t num max worker threads
567
568       -g sec seconds for graceful recovery
569
570       -w 0|1 use watchdog through wdmd
571
572       -h 0|1 use high priority (RR) scheduling
573
574       -l num use mlockall (0 none, 1 current, 2 current and future)
575
576       -b sec seconds a host id bit will remain set in delta lease bitmap
577
578       -e str local host name used in delta leases
579
580
581
582   Client Command
583       sanlock client action [options]
584
585       sanlock client status
586
587       Print processes, lockspaces, and resources being managed by the sanlock
588       daemon.   Add  -D  to  show extra internal daemon status for debugging.
589       Add -o p to show resources by  pid,  or  -o  s  to  show  resources  by
590       lockspace.
591
592       sanlock client host_status
593
594       Print  state  of  host_id  delta  leases  read during the last renewal.
595       State of all lockspaces is shown (use -s to select  one).   Add  -D  to
596       show extra internal daemon status for debugging.
597
598       sanlock client gets
599
600       Print  lockspaces  being  managed by the sanlock daemon.  The LOCKSPACE
601       string will be followed by ADD or REM if the lockspace is currently be‐
602       ing added or removed.  Add -h 1 to also show hosts in each lockspace.
603
604       sanlock client renewal -s LOCKSPACE
605
606       Print  a history of renewals with timing details.  See the Renewal his‐
607       tory section below.
608
609       sanlock client log_dump
610
611       Print the sanlock daemon internal debug log.
612
613       sanlock client shutdown
614
615       Ask the sanlock daemon to exit.  Without the force option (-f  0),  the
616       command will be ignored if any lockspaces exist.  With the force option
617       (-f 1), any registered processes will be killed, their resource  leases
618       released,  and  lockspaces  removed.   With the wait option (-w 1), the
619       command will wait for a result from the daemon indicating that  it  has
620       shut  down and is exiting, or cannot shut down because lockspaces exist
621       (command fails).
622
623       sanlock client init -s LOCKSPACE
624
625       Tell the sanlock daemon to initialize a lockspace on disk.  The -o  op‐
626       tion can be used to specify the io timeout to be written in the host_id
627       leases.  The -Z and -A options can be used to specify the  sector  size
628       and align size, and both should be set together.  (Also see sanlock di‐
629       rect init.)
630
631       sanlock client init -r RESOURCE
632
633       Tell the sanlock daemon to initialize a resource lease on disk.  The -Z
634       and  -A  options can be used to specify the sector size and align size,
635       and both should be set together.  (Also see sanlock direct init.)
636
637       sanlock client read -s LOCKSPACE
638
639       Tell the sanlock daemon to  read  a  lockspace  from  disk.   Only  the
640       LOCKSPACE  path and offset are required.  If host_id is zero, the first
641       record at offset (host_id  1)  is  used.   The  complete  LOCKSPACE  is
642       printed.   Add  -D  to  print  other details.  (Also see sanlock direct
643       read_leader.)
644
645       sanlock client read -r RESOURCE
646
647       Tell the sanlock daemon to read a resource lease from disk.   Only  the
648       RESOURCE  path  and  offset  are  required.   The  complete RESOURCE is
649       printed.  Add -D to print other  details.   (Also  see  sanlock  direct
650       read_leader.)
651
652       sanlock client add_lockspace -s LOCKSPACE
653
654       Tell  the  sanlock  daemon  to  acquire  the  specified  host_id in the
655       lockspace.  This will allow resources to be acquired in the  lockspace.
656       The  -o  option  can be used to specify the io timeout of the acquiring
657       host, and will be written in the host_id lease.
658
659       sanlock client inq_lockspace -s LOCKSPACE
660
661       Inquire about the state of the lockspace in the sanlock daemon, whether
662       it is being added or removed, or is joined.
663
664       sanlock client rem_lockspace -s LOCKSPACE
665
666       Tell  the  sanlock  daemon  to  release  the  specified  host_id in the
667       lockspace.  Any processes holding resource  leases  in  this  lockspace
668       will be killed, and the resource leases not released.
669
670       sanlock client command -r RESOURCE -c path args
671
672       Register with the sanlock daemon, acquire the specified resource lease,
673       and exec the command at path with args.  When the  command  exits,  the
674       sanlock daemon will release the lease.  -c must be the final option.
675
676       sanlock client acquire -r RESOURCE -p pid
677       sanlock client release -r RESOURCE -p pid
678
679       Tell  the  sanlock  daemon to acquire or release the specified resource
680       lease for the given pid.  The pid must be registered with  the  sanlock
681       daemon.   acquire  can  optionally take a versioned RESOURCE string RE‐
682       SOURCE:lver, where lver is the version of the lease that  must  be  ac‐
683       quired, or fail.
684
685       sanlock client convert -r RESOURCE -p pid
686
687       Tell  the  sanlock daemon to convert the mode of the specified resource
688       lease for the given pid.  If the existing mode is exclusive  (default),
689       the  mode of the lease can be converted to shared with RESOURCE:SH.  If
690       the existing mode is shared, the mode of the lease can be converted  to
691       exclusive with RESOURCE (no :SH suffix).
692
693       sanlock client inquire -p pid
694
695       Print  the  resource  leases  held the given pid.  The format is a ver‐
696       sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
697       lease held.
698
699       sanlock client request -r RESOURCE -f force_mode
700
701       Request  the  owner of a resource do something specified by force_mode.
702       A versioned RESOURCE:lver string must be used with  a  greater  version
703       than is presently held.  Zero lver and force_mode clears the request.
704
705       sanlock client examine -r RESOURCE
706
707       Examine  the  request  record for the currently held resource lease and
708       carry out the action specified by the requested force_mode.
709
710       sanlock client examine -s LOCKSPACE
711
712       Examine requests for all resource leases currently held  in  the  named
713       lockspace.  Only lockspace_name is used from the LOCKSPACE argument.
714
715       sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num
716
717       Set an event for another host.  When the sanlock daemon next renews its
718       delta lease for the lockspace it will: set the bit for the  host_id  in
719       its  bitmap,  and  set the generation, event and data values in its own
720       delta lease.  An application that has registered for events  from  this
721       lockspace  on the destination host will get the event that has been set
722       when the destination sees the event during its  next  delta  lease  re‐
723       newal.
724
725       sanlock client set_config -s LOCKSPACE
726
727       Set a configuration value for a lockspace.  Only lockspace_name is used
728       from the LOCKSPACE argument.  The USED flag has the same  effect  on  a
729       lockspace  as  a  process  holding a resource lease that will not exit.
730       The USED_BY_ORPHANS flag means that an orphan resource lease will  have
731       the same effect as the USED.
732       -u 0|1 Set (1) or clear (0) the USED flag.
733       -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.
734
735       sanlock client format -x RINDEX
736
737       Create  a resource index on disk.  Use -Z and -A to set the sector size
738       and align size to match the lockspace.
739
740       sanlock client create -x RINDEX -e resource_name
741
742       Create a new resource lease on disk, using the rindex to  find  a  free
743       offset.
744
745       sanlock client delete -x RINDEX -e resource_name[:offset]
746
747       Delete an existing resource lease on disk.
748
749       sanlock client lookup -x RINDEX -e resource_name
750
751       Look up the offset of an existing resource lease by name on disk, using
752       the rindex.  With no -e option, lookup returns the next free lease off‐
753       set.  If -e specifes both name and offset, the lookup verifies both are
754       correct.
755
756       sanlock client update -x RINDEX -e resource_name[:offset] [-z 0|1]
757
758       Add (-z 0) or remove (-z 1) an rindex entry on disk.
759
760       sanlock client rebuild -x RINDEX
761
762       Rebuild the rindex entries by scanning the disk for resource leases.
763
764
765
766   Direct Command
767       sanlock direct action [options]
768
769
770       -o sec io timeout in seconds
771
772       sanlock direct init -s LOCKSPACE
773       sanlock direct init -r RESOURCE
774
775       Initialize storage for a lockspace or resource.   Use  the  -Z  and  -A
776       flags  to  specify  the sector size and align size.  The max hosts that
777       can use the lockspace/resource (and the max possible host_id) is deter‐
778       mined by the sector/align size combination.  Possible combinations are:
779       512/1M, 4096/1M, 4096/2M, 4096/4M, 4096/8M.  Lockspaces  and  resources
780       both  use  the  same amount of space (align_size) for each combination.
781       When initializing a lockspace, sanlock  initializes  delta  leases  for
782       max_hosts  in  the  given space.  When initializing a resource, sanlock
783       initializes a single paxos lease in the space.  With -s, the -o  option
784       specifies the io timeout to be written in the host_id leases.  With -r,
785       the -z 1 option invalidates the resource lease on disk so it cannot  be
786       used until reinitialized normally.
787
788       sanlock direct read_leader -s LOCKSPACE
789       sanlock direct read_leader -r RESOURCE
790
791       Read a leader record from disk and print the fields.  The leader record
792       is the single sector of a delta lease, or the first sector of  a  paxos
793       lease.
794
795       sanlock direct dump path[:offset[:size]]
796
797       Read  disk  sectors and print leader records for delta or paxos leases.
798       Add -f 1 to print the request record values for paxos leases,  host_ids
799       set in delta lease bitmaps, and rindex entries.
800
801       sanlock direct format -x RINDEX
802       sanlock direct lookup -x RINDEX -e resource_name
803       sanlock direct update -x RINDEX -e resource_name[:offset] [-z 0|1]
804       sanlock direct rebuild -x RINDEX
805
806       Access  the  resource  index  on disk without going through the sanlock
807       daemon.  This precludes using  the  internal  paxos  lease  to  protect
808       rindex modifications.  See client equivalents for descriptions.
809
810
811
812   LOCKSPACE option string
813       -s lockspace_name:host_id:path:offset
814
815       lockspace_name name of lockspace
816       host_id local host identifier in lockspace
817       path path to storage to use for leases
818       offset offset on path (bytes)
819
820
821   RESOURCE option string
822       -r lockspace_name:resource_name:path:offset
823
824       lockspace_name name of lockspace
825       resource_name name of resource
826       path path to storage to use leases
827       offset offset on path (bytes)
828
829
830   RESOURCE option string with suffix
831       -r lockspace_name:resource_name:path:offset:lver
832
833       lver leader version
834
835       -r lockspace_name:resource_name:path:offset:SH
836
837       SH indicates shared mode
838
839
840   RINDEX option string
841       -x lockspace_name:path:offset
842
843       lockspace_name name of lockspace
844       path path to storage to use for leases
845       offset offset on path (bytes) of rindex
846
847
848
849   Defaults
850       sanlock help shows the default values for the options above.
851
852       sanlock version shows the build version.
853
854

OTHER

856   Request/Examine
857       The  first  part  of making a request for a resource is writing the re‐
858       quest record of the resource (the sector following the leader  record).
859       To make a successful request:
860
861       • RESOURCE:lver  must  be  greater  than the lver presently held by the
862         other host.  This implies the leader record must be read to  discover
863         the lver, prior to making a request.
864
865       • RESOURCE:lver  must  be  greater  than or equal to the lver presently
866         written to the request record.  Two hosts may write a new request  at
867         the  same  time  for the same lver, in which case both would succeed,
868         but the force_mode from the last would win.
869
870       • The force_mode must be greater than zero.
871
872       • To unconditionally clear  the  request  record  (set  both  lver  and
873         force_mode to 0), make request with RESOURCE:0 and force_mode 0.
874
875
876       The owner of the requested resource will not know of the request unless
877       it is explicitly told  to  examine  its  resources  via  the  "examine"
878       api/command, or otherwise notfied.
879
880       The  second  part  of  making a request is notifying the resource lease
881       owner that it should  examine  the  request  records  of  its  resource
882       leases.   The  notification will cause the lease owner to automatically
883       run the equivalent of "sanlock client examine  -s  LOCKSPACE"  for  the
884       lockspace of the requested resource.
885
886       The  notification  is  made using a bitmap in each host_id delta lease.
887       Each bit represents each of the possible host_ids (1-2000).  If host  A
888       wants  to notify host B to examine its resources, A sets the bit in its
889       own bitmap that corresponds to the host_id of B.  When  B  next  renews
890       its  delta  lease,  it  reads the delta leases for all hosts and checks
891       each bitmap to see if its own host_id has been set.  It finds  the  bit
892       for  its  own  host_id set in A's bitmap, and examines its resource re‐
893       quest records.  (The bit remains set in A's bitmap for  set_bitmap_sec‐
894       onds.)
895
896       force_mode determines the action the resource lease owner should take:
897
898
899       • FORCE  (1):  kill  the  process holding the resource lease.  When the
900         process has exited, the resource lease will be released, and can then
901         be  acquired  by  anyone.   The kill signal is SIGKILL (or SIGTERM if
902         SIGKILL is restricted.)
903
904
905       • GRACEFUL (2): run the program configured by sanlock_killpath  against
906         the  process  holding the resource lease.  If no killpath is defined,
907         then FORCE is used.
908
909
910   Persistent and orphan resource leases
911       A resource lease can be acquired with the PERSISTENT flag (-P  1).   If
912       the  process  holding  the lease exits, the lease will not be released,
913       but kept on an orphan list.  Another local process can acquire  an  or‐
914       phan  lease  using  the ORPHAN flag (-O 1), or release the orphan lease
915       using the ORPHAN flag (-O 1).  All orphan leases  can  be  released  by
916       setting the lockspace name (-s lockspace_name) with no resource name.
917
918
919   Renewal history
920       sanlock  saves  a  limited history of lease renewal information in each
921       lockspace.  See sanlock.conf renewal_history_size to set the amount  of
922       history or to disable (set to 0).
923
924       IO  times are measured in delta lease renewal (each delta lease renewal
925       includes one read and one write).
926
927       For each successful renewal, a record is saved that includes:
928
929       • the timestamp written in the delta lease by the renewal
930
931       • the time in milliseconds taken by the delta lease read
932
933       • the time in milliseconds taken by the delta lease write
934
935
936       Also counted and recorded are the number io timeouts and other  io  er‐
937       rors that occur between successful renewals.
938
939       Two consecutive successful renewals would be recorded as:
940       timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
941       timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0
942
943       Those fields are:
944
945
946       • timestamp  is  the value written into the delta lease during that re‐
947         newal.
948
949
950       • read_ms/write_ms  are  the  milliseconds  taken   for   the   renewal
951         read/write ios.
952
953
954       • next_timeouts  are  the number of io timeouts that occurred after the
955         renewal recorded on that line, and before the next successful renewal
956         on the following line.
957
958
959       • next_errors  are the number of io errors (not timeouts) that occurred
960         after renewal recorded on that line, and before the  next  successful
961         renewal on the following line.
962
963
964       The command 'sanlock client renewal -s lockspace_name' reports the full
965       history of renewals saved by sanlock, which by default is 180  records,
966       about  1  hour of history when using a 20 second renewal interval for a
967       10 second io timeout.
968
969

INTERNALS

971   Disk Format
972       • This example uses 512 byte sectors.
973
974       • Each lockspace is 1MB.  It holds 2000 delta_leases, one  per  sector,
975         supporting up to 2000 hosts.
976
977       • Each paxos_lease is 1MB.  It is used as a lease for one resource.
978
979       • The leader_record structure is used differently by each lease type.
980
981       • To display all leader_record fields, see sanlock direct read_leader.
982
983       • A lockspace is often followed on disk by the paxos_leases used within
984         that lockspace, but this layout is not required.
985
986       • The request_record and host_id bitmap are used for requests/events.
987
988       • The mode_block contains the SHARED flag indicating a lease is held in
989         the shared mode.
990
991       • In  a  lockspace,  the  host  using  host_id  N  writes  to  a single
992         delta_lease in sector N-1.  No other hosts write to this sector.  All
993         hosts read all lockspace sectors when renewing their own delta_lease,
994         and are able to monitor renewals of all delta_leases.
995
996       • In a paxos_lease, each host has a dedicated sector it writes to, con‐
997         taining  its  own paxos_dblock and mode_block structures.  Its sector
998         is based on its host_id; host_id 1 writes to the dblock/mode_block in
999         sector 2 of the paxos_lease.
1000
1001       • The  paxos_dblock  structures  are used by the paxos_lease algorithm,
1002         and the result is written to the leader_record.
1003
1004
1005       0x000000 lockspace foo:0:/path:0
1006
1007       (There is no representation on disk of the lockspace in  general,  only
1008       the  sequence of specific delta_leases which collectively represent the
1009       lockspace.)
1010
1011       delta_lease foo:1:/path:0
1012       0x000 0         leader_record         (sector 0, for host_id 1)
1013                       magic: 0x12212010
1014                       space_name: foo
1015                       resource_name: host uuid/name
1016                       ...
1017                       host_id bitmap        (leader_record + 256)
1018
1019       delta_lease foo:2:/path:0
1020       0x200 512       leader_record         (sector 1, for host_id 2)
1021                       magic: 0x12212010
1022                       space_name: foo
1023                       resource_name: host uuid/name
1024                       ...
1025                       host_id bitmap        (leader_record + 256)
1026
1027       delta_lease foo:3:/path:0
1028       0x400 1024      leader_record         (sector 2, for host_id 3)
1029                       magic: 0x12212010
1030                       space_name: foo
1031                       resource_name: host uuid/name
1032                       ...
1033                       host_id bitmap        (leader_record + 256)
1034
1035       delta_lease foo:2000:/path:0
1036       0xF9E00         leader_record         (sector 1999, for host_id 2000)
1037                       magic: 0x12212010
1038                       space_name: foo
1039                       resource_name: host uuid/name
1040                       ...
1041                       host_id bitmap        (leader_record + 256)
1042
1043       0x100000 paxos_lease foo:example1:/path:1048576
1044       0x000 0         leader_record         (sector 0)
1045                       magic: 0x06152010
1046                       space_name: foo
1047                       resource_name: example1
1048
1049       0x200 512       request_record        (sector 1)
1050                       magic: 0x08292011
1051
1052       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
1053       0x480 1152      mode_block            (paxos_dblock + 128)
1054
1055       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
1056       0x680 1664      mode_block            (paxos_dblock + 128)
1057
1058       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
1059       0x880 2176      mode_block            (paxos_dblock + 128)
1060
1061       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
1062       0xFA280         mode_block            (paxos_dblock + 128)
1063
1064       0x200000 paxos_lease foo:example2:/path:2097152
1065       0x000 0         leader_record         (sector 0)
1066                       magic: 0x06152010
1067                       space_name: foo
1068                       resource_name: example2
1069
1070       0x200 512       request_record        (sector 1)
1071                       magic: 0x08292011
1072
1073       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
1074       0x480 1152      mode_block            (paxos_dblock + 128)
1075
1076       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
1077       0x680 1664      mode_block            (paxos_dblock + 128)
1078
1079       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
1080       0x880 2176      mode_block            (paxos_dblock + 128)
1081
1082       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
1083       0xFA280         mode_block            (paxos_dblock + 128)
1084
1085
1086   Lease ownership
1087       Not shown in the  leader_record  structures  above  are  the  owner_id,
1088       owner_generation  and  timestamp fields.  These are the fields that de‐
1089       fine the lease owner.
1090
1091       The delta_lease at sector N for host_id N+1 has  leader_record.owner_id
1092       N+1.   The  leader_record.owner_generation is incremented each time the
1093       delta_lease  is  acquired.   When  a  delta_lease  is   acquired,   the
1094       leader_record.timestamp  field  is  set to the time of the host and the
1095       leader_record.resource_name is set to the  unique  name  of  the  host.
1096       When   the   host   renews   the   delta_lease,   it   writes   a   new
1097       leader_record.timestamp.  When a host releases a delta_lease, it writes
1098       zero to leader_record.timestamp.
1099
1100       When  a  host  acquires  a  paxos_lease, it uses the host_id/generation
1101       value from the delta_lease it holds in the  lockspace.   It  uses  this
1102       host_id/generation  to identify itself in the paxos_dblock when running
1103       the paxos algorithm.  The  result  of  the  algorithm  is  the  winning
1104       host_id/generation  -  the  new  owner of the paxos_lease.  The winning
1105       host_id/generation     are     written     to      the      paxos_lease
1106       leader_record.owner_id  and  leader_record.owner_generation  fields and
1107       leader_record.timestamp is set.  When a host releases a paxos_lease, it
1108       sets leader_record.timestamp to 0.
1109
1110       When  a  paxos_lease  is  free (leader_record.timestamp is 0), multiple
1111       hosts may attempt to  acquire  it.   The  paxos  algorithm,  using  the
1112       paxos_dblock  structures,  will select only one of the hosts as the new
1113       owner, and that owner is written in the leader_record.  The paxos_lease
1114       will no longer be free (non-zero timestamp).  Other hosts will see this
1115       and will not attempt to acquire the paxos_lease until it is free again.
1116
1117       If a paxos_lease is owned (non-zero timestamp), but the owner  has  not
1118       renewed  its  delta_lease for a specific length of time, then the owner
1119       value in the paxos_lease becomes expired, and other hosts will use  the
1120       paxos algorithm to acquire the paxos_lease, and set a new owner.
1121
1122

FILES

1124       /etc/sanlock/sanlock.conf
1125
1126
1127       • quiet_fail = 1
1128         See -Q
1129
1130
1131       • debug_renew = 0
1132         See -R
1133
1134
1135       • logfile_priority = 4
1136         See -L
1137
1138
1139       • logfile_use_utc = 0
1140         Use UTC instead of local time in log messages.
1141
1142
1143       • syslog_priority = 3
1144         See -S
1145
1146
1147       • names_log_priority = 4
1148         Log  resource names at this priority level (uses syslog priority num‐
1149         bers).  If this is greater than or equal  to  logfile_priority,  each
1150         requested resource name and location is recorded in sanlock.log.
1151
1152
1153       • use_watchdog = 1
1154         See -w
1155
1156
1157       • high_priority = 1
1158         See -h
1159
1160
1161       • mlock_level = 1
1162         See -l
1163
1164
1165       • sh_retries = 8
1166         The  number  of times to try acquiring a paxos lease when acquiring a
1167         shared lease when the paxos lease is held by another host acquiring a
1168         shared lease.
1169
1170
1171       • uname = sanlock
1172         See -U
1173
1174
1175       • gname = sanlock
1176         See -G
1177
1178
1179       • our_host_name = <str>
1180         See -e
1181
1182
1183       • renewal_read_extend_sec = <seconds>
1184         If  a  renewal  read i/o times out, wait this many additional seconds
1185         for that read to complete at the start of the subsequent renewal  at‐
1186         tempt.  When not configured, sanlock waits for an additional io_time‐
1187         out seconds for a previous timed out read to complete.
1188
1189
1190       • renewal_history_size = 180
1191         See -H
1192
1193
1194       • paxos_debug_all = 0
1195         Include all details in the paxos debug logging.
1196
1197
1198       • debug_io = <str>
1199         Add debug logging for each i/o.  "submit" (no quotes) produces  debug
1200         output  at  submission time, "complete" produces debug output at com‐
1201         pletion time, and "submit,complete" (no space) produces both.
1202
1203
1204       • max_sectors_kb = <str>|<num>
1205         Set to "ignore" (no quotes)  to  prevent  sanlock  from  checking  or
1206         changing  max_sectors_kb  for  the  lockspace  disk  when  starting a
1207         lockspace.  Set to "align" (no quotes) to set max_sectors_kb for  the
1208         lockspace  disk  to the align size of the lockspace.  Set to a number
1209         to set a specific number of KB for all lockspace disks.
1210
1211
1212       • debug_clients = 0
1213         Enable or disable debug logging for all  client  connections  to  the
1214         sanlock daemon.
1215
1216
1217       • debug_cmd = +|-<name>
1218         Enable  (+name)  or disable (-name) debug logging at the command pro‐
1219         cessing level for specifically named commands, e.g. "debug_cmd = +ac‐
1220         quire",  or  "debug_cmd = -inq_lockspace".  Repeat this line for each
1221         command name.  Use a plus prefix before the name to enable and a  mi‐
1222         nus  prefix  to  disable.   By  default sanlock disables some command
1223         level debugging for commands that are often repetitive and  fill  the
1224         in  memory debug buffer.  This only affects debug logging, not errors
1225         or warnings, and disabling command level debugging for a command does
1226         not  disable  lower level debugging for that command.  Special values
1227         +all and -all can be used to enable or disable all commands, and  can
1228         be used before or after other debug_cmd lines.
1229
1230
1231       • write_init_io_timeout = <seconds>
1232         The io timeout to use when initializing ondisk lease structures for a
1233         lockspace or resource.  This timeout is not used as a part of  either
1234         lease algorithm (as the standard io_timeout is.)
1235
1236
1237       • max_worker_threads = <num>
1238         See -t
1239
1240

SEE ALSO

1242       wdmd(8)
1243
1244
1245
1246
1247                                  2015-01-23                        SANLOCK(8)
Impressum