1SANLOCK(8)                  System Manager's Manual                 SANLOCK(8)
2
3
4

NAME

6       sanlock - shared storage lock manager
7
8

SYNOPSIS

10       sanlock [COMMAND] [ACTION] ...
11
12

DESCRIPTION

14       sanlock  is  a lock manager built on shared storage.  Hosts with access
15       to the storage can perform locking.   An  application  running  on  the
16       hosts  is  given  a small amount of space on the shared block device or
17       file, and uses sanlock for its  own  application-specific  synchroniza‐
18       tion.   Internally,  the  sanlock  daemon manages locks using two disk-
19       based lease algorithms: delta leases and paxos leases.
20
21
22       · delta leases are slow to acquire and demand  regular  i/o  to  shared
23         storage.   sanlock  only  uses them internally to hold a lease on its
24         "host_id" (an integer host identifier from 1-2000).  They prevent two
25         hosts  from using the same host identifier.  The delta lease renewals
26         also indicate if a host is alive.  ("Light-Weight Leases for Storage-
27         Centric Coordination", Chockler and Malkhi.)
28
29
30       · paxos  leases are fast to acquire and sanlock makes them available to
31         applications as general purpose  resource  leases.   The  disk  paxos
32         algorithm uses host_id's internally to represent different hosts, and
33         the owner of a paxos lease.  delta leases  provide  unique  host_id's
34         for  implementing  paxos  leases, and delta lease renewals serve as a
35         proxy for paxos lease renewal.  ("Disk Paxos", Eli Gafni  and  Leslie
36         Lamport.)
37
38
39       Externally, the sanlock daemon exposes a locking interface through lib‐
40       sanlock in terms of "lockspaces" and "resources".   A  lockspace  is  a
41       locking  context that an application creates for itself on shared stor‐
42       age.  When the application on each host  is  started,  it  "joins"  the
43       lockspace.  It can then create "resources" on the shared storage.  Each
44       resource represents an application-specific  entity.   The  application
45       can acquire and release leases on resources.
46
47       To use sanlock from an application:
48
49
50       · Allocate  shared  storage for an application, e.g. a shared LUN or LV
51         from a SAN, or files from NFS.
52
53
54       · Provide the storage to the application.
55
56
57       · The application  uses  this  storage  with  libsanlock  to  create  a
58         lockspace and resources for itself.
59
60
61       · The application joins the lockspace when it starts.
62
63
64       · The application acquires and releases leases on resources.
65
66
67       How lockspaces and resources translate to delta leases and paxos leases
68       within sanlock:
69
70       Lockspaces
71
72
73       · A lockspace is based on delta leases held  by  each  host  using  the
74         lockspace.
75
76
77       · A  lockspace  is  a series of 2000 delta leases on disk, and requires
78         1MB of storage.  (See Storage below for size variations.)
79
80
81       · A lockspace can support up to 2000 concurrent hosts  using  it,  each
82         using a different delta lease.
83
84
85       · Applications  can  i)  create,  ii)  join and iii) leave a lockspace,
86         which corresponds to i) initializing the set of delta leases on disk,
87         ii)  acquiring  one  of the delta leases and iii) releasing the delta
88         lease.
89
90
91       · When a lockspace is created, a unique lockspace name and  disk  loca‐
92         tion is provided by the application.
93
94
95       · When a lockspace is created/initialized, sanlock formats the sequence
96         of 2000 on-disk delta lease structures on  the  file  or  disk,  e.g.
97         /mnt/leasefile (NFS) or /dev/vg/lv (SAN).
98
99
100       · The  2000  individual  delta  leases in a lockspace are identified by
101         number: 1,2,3,...,2000.
102
103
104       · Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
105         its  number,  e.g. delta lease 1 is offset 0, delta lease 2 is offset
106         512, delta lease 2000 is offset 1023488.  (See Storage below for size
107         variations.)
108
109
110       · When  an application joins a lockspace, it must specify the lockspace
111         name, the lockspace location  on  shared  disk/file,  and  the  local
112         host's  host_id.  sanlock then acquires the delta lease corresponding
113         to the host_id, e.g. joining the lockspace with  host_id  1  acquires
114         delta lease 1.
115
116
117       · The  terms  delta  lease, lockspace lease, and host_id lease are used
118         interchangably.
119
120
121       · sanlock acquires a delta lease by writing the host's unique  name  to
122         the delta lease disk sector, reading it back after a delay, and veri‐
123         fying it is the same.
124
125
126       · If a unique host name is not specified, sanlock generates a  uuid  to
127         use  as  the host's name.  The delta lease algorithm depends on hosts
128         using unique names.
129
130
131       · The application on each host  should  be  configured  with  a  unique
132         host_id, where the host_id is an integer 1-2000.
133
134
135       · If hosts are misconfigured and have the same host_id, the delta lease
136         algorithm is designed to detect this conflict, and only one host will
137         be able to acquire the delta lease for that host_id.
138
139
140       · A  delta  lease  ensures  that a lockspace host_id is being used by a
141         single host with the unique name specified in the delta lease.
142
143
144       · Resolving delta lease conflicts is slow,  because  the  algorithm  is
145         based  on waiting and watching for some time for other hosts to write
146         to the same delta lease sector.  If multiple hosts  try  to  use  the
147         same  delta  lease,  the delay is increased substantially.  So, it is
148         best to configure applications to use unique host_id's that will  not
149         conflict.
150
151
152       · After sanlock acquires a delta lease, the lease must be renewed until
153         the application leaves the lockspace (which corresponds to  releasing
154         the delta lease on the host_id.)
155
156
157       · sanlock  renews delta leases every 20 seconds (by default) by writing
158         a new timestamp into the delta lease sector.
159
160
161       · When a host acquires a delta lease in a lockspace, it can be referred
162         to  as "joining" the lockspace.  Once it has joined the lockspace, it
163         can use resources associated with the lockspace.
164
165
166       Resources
167
168
169       · A lockspace is a  context  for  resources  that  can  be  locked  and
170         unlocked by an application.
171
172
173       · sanlock  uses  paxos  leases  to  implement leases on resources.  The
174         terms paxos lease and resource lease are used interchangably.
175
176
177       · A paxos lease exists on shared storage and requires 1MB of space.  It
178         contains a unique resource name and the name of the lockspace.
179
180
181       · An  application assigns its own meaning to a sanlock resource and the
182         leases on it.  A sanlock resource could represent some shared  object
183         like a file, or some unique role among the hosts.
184
185
186       · Resource leases are associated with a specific lockspace and can only
187         be used by hosts that have joined that lockspace (they are holding  a
188         delta lease on a host_id in that lockspace.)
189
190
191       · An  application  must  keep  track  of  the  disk  locations  of  its
192         lockspaces and resources.  sanlock does not maintain  any  persistent
193         index  or directory of lockspaces or resources that have been created
194         by applications, so applications need to  remember  where  they  have
195         placed their own leases (which files or disks and offsets).
196
197
198       · sanlock  does  not  renew  paxos leases directly (although it could).
199         Instead, the renewal of a host's delta lease represents  the  renewal
200         of  all  that  host's  paxos  leases  in the associated lockspace. In
201         effect, many paxos lease renewals are factored  out  into  one  delta
202         lease renewal.  This reduces i/o when many paxos leases are used.
203
204
205       · The  disk  paxos  algorithm  allows  multiple hosts to all attempt to
206         acquire the same paxos lease at once, and will produce a single  win‐
207         ner/owner  of  the  resource lease.  (Shared resource leases are also
208         possible in addition to the default exclusive leases.)
209
210
211       · The disk paxos algorithm involves a specific sequence of reading  and
212         writing  the  sectors  of the paxos lease disk area.  Each host has a
213         dedicated 512 byte sector in the  paxos  lease  disk  area  where  it
214         writes  its own "ballot", and each host reads the entire disk area to
215         see the ballots of other hosts.  The first sector of the disk area is
216         the  "leader  record" that holds the result of the last paxos ballot.
217         The winner of the paxos ballot writes the result of the ballot to the
218         leader  record  (the  winner  of the ballot may have selected another
219         contending host as the owner of the paxos lease.)
220
221
222       · After a paxos lease is acquired, no further i/o is done in the  paxos
223         lease disk area.
224
225
226       · Releasing  the  paxos lease involves writing a single sector to clear
227         the current owner in the leader record.
228
229
230       · If a host holding a paxos lease fails, the disk  area  of  the  paxos
231         lease  still  indicates  that  the paxos lease is owned by the failed
232         host.  If another host attempts to acquire the paxos lease, and finds
233         the  lease  is held by another host_id, it will check the delta lease
234         of that host_id.  If the delta lease of the host_id is being renewed,
235         then  the  paxos lease is owned and cannot be acquired.  If the delta
236         lease of the owner's host_id has expired, then  the  paxos  lease  is
237         expired  and  can  be  taken  (by going through the paxos lease algo‐
238         rithm.)
239
240
241       · The "interaction" or "awareness" between hosts of each other is  lim‐
242         ited  to the case where they attempt to acquire the same paxos lease,
243         and need to check if the referenced delta lease has expired or not.
244
245
246       · When hosts do not attempt to lock the  same  resources  concurrently,
247         there  is  no host interaction or awareness.  The state or actions of
248         one host have no effect on others.
249
250
251       · To speed up checking delta lease expiration (in the case of  a  paxos
252         lease  conflict), sanlock keeps track of past renewals of other delta
253         leases in the lockspace.
254
255
256       Resource Index
257
258       The resource index (rindex) is an optional sanlock feature that  appli‐
259       cations  can  use to keep track of resource lease offsets.  Without the
260       rindex, an application must keep track of  where  its  resource  leases
261       exist on disk and find available locations when creating new leases.
262
263       The  sanlock  rindex  uses  two  align-size areas on disk following the
264       lockspace.  The first area holds rindex entries; each entry  records  a
265       resource  lease  name  and  location.   The second area holds a private
266       paxos lease, used by sanlock internally to protect rindex updates.
267
268       The application creates the rindex on disk with the "format"  function.
269       Format  is  a  disk-only  operation and does not interact with the live
270       lockspace, so it can be called  without  first  calling  add_lockspace.
271       The application needs to follow the convention of writing the lockspace
272       at the start of the device (offset 0) and formatting the rindex immedi‐
273       ately  following  the lockspace area.  When formatting, the application
274       must set flags for sector size and align size to match  those  for  the
275       lockspace.
276
277       To use the rindex, the application:
278
279
280       · Uses  the  "create"  function to create a new resource lease on disk.
281         This takes the place of  the  write_resource  function.   The  create
282         function  requires the location of the rindex and the name of the new
283         resource lease.  sanlock finds a free  lease  area,  writes  the  new
284         resource  lease  at  that  location,  updates  the  rindex  with  the
285         name:offset, and returns the offset to the caller.  The  caller  uses
286         this offset when acquiring the resource lease.
287
288
289       · Uses  the  "delete"  function to remove a resource disk on disk (also
290         corresponding to the write_resource function.)   sanlock  clears  the
291         resource  lease  and  the  rindex entry for it.  A subsequent call to
292         create may use this same  disk  location  for  a  different  resource
293         lease.
294
295
296       · Uses the "lookup" function to discover the offset of a resource lease
297         given the resource lease name.  The caller would typically call  this
298         prior to acquiring the resource lease.
299
300
301       · Uses  the  "rebuild" function to recreate the rindex if it is damaged
302         or becomes inconsistent.  This function scans the disk  for  resource
303         leases and creates new rindex entries to match the leases it finds.
304
305
306       · The  "update" function manipulates rindex entries directly and should
307         not normally be used by the application.  In normal usage, the create
308         and  delete  functions  manipulate  rindex entries.  Update is mainly
309         useful for testing or repairs.
310
311
312       Expiration
313
314
315       · If a host fails to renew its delta lease, e.g. it  looses  access  to
316         the  storage, its delta lease will eventually expire and another host
317         will be able to take over any resource leases held by the host.  san‐
318         lock  must  ensure that the application on two different hosts is not
319         holding and using the same lease concurrently.
320
321
322       · When sanlock has failed to renew a delta lease for a period of  time,
323         it  will begin taking measures to stop local processes (applications)
324         from using any resource leases associated with the expiring lockspace
325         delta  lease.   sanlock enters this "recovery mode" well ahead of the
326         time when another host could take  over  the  locally  owned  leases.
327         sanlock  must  have  sufficient time to stop all local processes that
328         are using the expiring leases.
329
330
331       · sanlock uses three methods to stop local  processes  that  are  using
332         expiring leases:
333
334         1.  Graceful  shutdown.   sanlock  will execute a "graceful shutdown"
335         program that the application previously specified for this case.  The
336         shutdown  program  tells  the  application  to  shut down because its
337         leases are expiring.  The application must respond  by  stopping  its
338         activities  and  releasing  its  leases (or exit).  If an application
339         does not specify a graceful shutdown program, sanlock  sends  SIGTERM
340         to  the process instead.  The process must release its leases or exit
341         in a prescribed amount of time (see -g), or sanlock proceeds  to  the
342         next method of stopping.
343
344         2. Forced shutdown.  sanlock will send SIGKILL to processes using the
345         expiring leases.  The processes have a fixed amount of time  to  exit
346         after  receiving  SIGKILL.   If any do not exit in this time, sanlock
347         will proceed to the next method.
348
349         3. Host reset.  sanlock will trigger the host's  watchdog  device  to
350         forcibly  reset  it.   sanlock  carefully  manages  the timing of the
351         watchdog device so that it fires shortly before any other host  could
352         take over the resource leases held by local processes.
353
354
355       Failures
356
357       If  a  process holding resource leases fails or exits without releasing
358       its leases, sanlock  will  release  the  leases  for  it  automatically
359       (unless persistent resource leases were used.)
360
361       If  the  sanlock daemon cannot renew a lockspace delta lease for a spe‐
362       cific period of time (see Expiration),  sanlock  will  enter  "recovery
363       mode"  where  it  attempts  to  stop  and/or kill any processes holding
364       resource leases in the expiring lockspace.  If  the  processes  do  not
365       exit  in  time, sanlock will force the host to be reset using the local
366       watchdog device.
367
368       If the sanlock daemon crashes or hangs, it will not  renew  the  expiry
369       time  of the per-lockspace connections it had to the wdmd daemon.  This
370       will lead to the expiration of the local watchdog device, and the  host
371       will be reset.
372
373       Watchdog
374
375       sanlock  uses  the wdmd(8) daemon to access /dev/watchdog.  wdmd multi‐
376       plexes multiple timeouts onto  the  single  watchdog  timer.   This  is
377       required because delta leases for each lockspace are renewed and expire
378       independently.
379
380       sanlock maintains a wdmd connection  for  each  lockspace  delta  lease
381       being  renewed.  Each connection has an expiry time for some seconds in
382       the future.  After each successful delta lease renewal, the expiry time
383       is  renewed for the associated wdmd connection.  If wdmd finds any con‐
384       nection expired, it will not  renew  the  /dev/watchdog  timer.   Given
385       enough  successive  failed  renewals, the watchdog device will fire and
386       reset the host.  (Given the multiplexing nature of wdmd, shorter  over‐
387       lapping  renewal failures from multiple lockspaces could cause spurious
388       watchdog firing.)
389
390       The direct link between delta lease renewals and watchdog renewals pro‐
391       vides  a  predictable watchdog firing time based on delta lease renewal
392       timestamps that are visible from other hosts.  sanlock knows  the  time
393       the  watchdog  on another host has fired based on the delta lease time.
394       Furthermore, if the watchdog device on another host fails to fire  when
395       it should, the continuation of delta lease renewals from the other host
396       will make this evident and prevent leases from  being  taken  from  the
397       failed host.
398
399       If  sanlock  is  able  to  stop/kill  all  processing using an expiring
400       lockspace,  the  associated  wdmd  connection  for  that  lockspace  is
401       removed.   The expired wdmd connection will no longer block /dev/watch‐
402       dog renewals, and the host should avoid being reset.
403
404       Storage
405
406       The sector size and the align size should be  specified  when  creating
407       lockspaces and resources (and rindex).  The "align size" is the size on
408       disk of a lockspace or a resource, i.e. the amount  of  disk  space  it
409       uses.   Lockspaces  and  resources should use matching sector and align
410       sizes, and must use offsets in multiples of the align  size.   The  max
411       number  of  hosts  that  can use a lockspace or resource depends on the
412       combination of sector size and align size, shown below.  The host_id of
413       hosts using the lockspace can be no larger than the max_hosts value for
414       the lockspace.
415
416       Accepted combinations of sector size and align  size,  and  the  corre‐
417       sponding max_hosts (and max host_id) are:
418
419       sector_size 512, align_size 1M, max_hosts 2000
420       sector_size 4096, align_size 1M, max_hosts 250
421       sector_size 4096, align_size 2M, max_hosts 500
422       sector_size 4096, align_size 4M, max_hosts 1000
423       sector_size 4096, align_size 8M, max_hosts 2000
424
425       When sector_size and align_size are not specified, the behavior matches
426       the behavior before these sizes could be configured: on  devices  which
427       report  sector  size  512, 512/1M/2000 is used, on devices which report
428       sector size 4096, 4096/8M/2000 is used, and on  files,  512/1M/2000  is
429       always  used.  (Other combinations are not compatible with sanlock ver‐
430       sion 3.6 or earlier.)
431
432       Using sanlock on shared block devices that do host based  mirroring  or
433       replication  is  not  likely  to work correctly.  When using sanlock on
434       shared files, all sanlock io should go to one file server.
435
436       Example
437
438       This is an example of creating and using lockspaces and resources  from
439       the command line.  (Most applications would use sanlock through libsan‐
440       lock rather than through the command line.)
441
442
443       1.  Allocate shared storage for sanlock leases.
444
445           This example assumes 512 byte sectors on the device, in which  case
446           the lockspace needs 1MB and each resource needs 1MB.
447
448           The  example  shared  block  device  accessible  to  all  hosts  is
449           /dev/leases.
450
451
452       2.  Start sanlock on all hosts.
453
454           The -w 0 disables use of the watchdog for testing.
455
456           # sanlock daemon -w 0
457
458
459       3.  Start a dummy application on all hosts.
460
461           This sanlock command registers with sanlock, then execs  the  sleep
462           command  which  inherits the registered fd.  The sleep process acts
463           as the dummy application.  Because the sleep process is  registered
464           with sanlock, leases can be acquired for it.
465
466           # sanlock client command -c /bin/sleep 600 &
467
468
469       4.  Create a lockspace for the application (from one host).
470
471           The lockspace is named "test".
472
473           # sanlock client init -s test:0:/dev/leases:0
474
475
476       5.  Join the lockspace for the application.
477
478           Use a unique host_id on each host.
479
480           host1:
481           # sanlock client add_lockspace -s test:1:/dev/leases:0
482           host2:
483           # sanlock client add_lockspace -s test:2:/dev/leases:0
484
485
486       6.  Create two resources for the application (from one host).
487
488           The  resources  are  named  "RA" and "RB".  Offsets are used on the
489           same device as the lockspace.  Different LVs or files could also be
490           used.
491
492           # sanlock client init -r test:RA:/dev/leases:1048576
493           # sanlock client init -r test:RB:/dev/leases:2097152
494
495
496       7.  Acquire resource leases for the application on host1.
497
498           Acquire an exclusive lease (the default) on the first resource, and
499           a shared lease (SH) on the second resource.
500
501           # export P=`pidof sleep`
502           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
503           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
504
505
506       8.  Acquire resource leases for the application on host2.
507
508           Acquiring the exclusive lease  on  the  first  resource  will  fail
509           because  it  is  held  by host1.  Acquiring the shared lease on the
510           second resource will succeed.
511
512           # export P=`pidof sleep`
513           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
514           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
515
516
517       9.  Release resource leases for the application on both hosts.
518
519           The sleep pid could also be killed, which will result in  the  san‐
520           lock daemon releasing its leases when it exits.
521
522           # sanlock client release -r test:RA:/dev/leases:1048576 -p $P
523           # sanlock client release -r test:RB:/dev/leases:2097152 -p $P
524
525
526       10. Leave the lockspace for the application.
527
528           host1:
529           # sanlock client rem_lockspace -s test:1:/dev/leases:0
530           host2:
531           # sanlock client rem_lockspace -s test:2:/dev/leases:0
532
533
534       11. Stop sanlock on all hosts.
535
536           # sanlock shutdown
537
538
539

OPTIONS

541       COMMAND can be one of three primary top level choices
542
543       sanlock daemon start daemon
544       sanlock client send request to daemon (default command if none given)
545       sanlock direct access storage directly (no coordination with daemon)
546
547
548   Daemon Command
549       sanlock daemon [options]
550
551       -D no fork and print all logging to stderr
552
553       -Q 0|1 quiet error messages for common lock contention
554
555       -R 0|1 renewal debugging, log debug info for each renewal
556
557       -L pri write logging at priority level and up to logfile (-1 none)
558
559       -S pri write logging at priority level and up to syslog (-1 none)
560
561       -U uid user id
562
563       -G gid group id
564
565       -H num renewal history size
566
567       -t num max worker threads
568
569       -g sec seconds for graceful recovery
570
571       -w 0|1 use watchdog through wdmd
572
573       -h 0|1 use high priority (RR) scheduling
574
575       -l num use mlockall (0 none, 1 current, 2 current and future)
576
577       -b sec seconds a host id bit will remain set in delta lease bitmap
578
579       -e str local host name used in delta leases
580
581
582
583   Client Command
584       sanlock client action [options]
585
586       sanlock client status
587
588       Print processes, lockspaces, and resources being managed by the sanlock
589       daemon.  Add -D to show extra internal  daemon  status  for  debugging.
590       Add  -o  p  to  show  resources  by  pid,  or -o s to show resources by
591       lockspace.
592
593       sanlock client host_status
594
595       Print state of host_id delta  leases  read  during  the  last  renewal.
596       State  of  all  lockspaces  is shown (use -s to select one).  Add -D to
597       show extra internal daemon status for debugging.
598
599       sanlock client gets
600
601       Print lockspaces being managed by the sanlock  daemon.   The  LOCKSPACE
602       string  will  be  followed  by ADD or REM if the lockspace is currently
603       being added or removed.  Add -h 1 to also show hosts in each lockspace.
604
605       sanlock client renewal -s LOCKSPACE
606
607       Print a history of renewals with timing details.  See the Renewal  his‐
608       tory section below.
609
610       sanlock client log_dump
611
612       Print the sanlock daemon internal debug log.
613
614       sanlock client shutdown
615
616       Ask  the  sanlock daemon to exit.  Without the force option (-f 0), the
617       command will be ignored if any lockspaces exist.  With the force option
618       (-f  1), any registered processes will be killed, their resource leases
619       released, and lockspaces removed.  With the wait  option  (-w  1),  the
620       command  will  wait for a result from the daemon indicating that it has
621       shut down and is exiting, or cannot shut down because lockspaces  exist
622       (command fails).
623
624       sanlock client init -s LOCKSPACE
625
626       Tell  the  sanlock  daemon  to  initialize a lockspace on disk.  The -o
627       option can be used to specify the io  timeout  to  be  written  in  the
628       host_id  leases.  The -Z and -A options can be used to specify the sec‐
629       tor size and align size, and both should be set  together.   (Also  see
630       sanlock direct init.)
631
632       sanlock client init -r RESOURCE
633
634       Tell the sanlock daemon to initialize a resource lease on disk.  The -Z
635       and -A options can be used to specify the sector size and  align  size,
636       and both should be set together.  (Also see sanlock direct init.)
637
638       sanlock client read -s LOCKSPACE
639
640       Tell  the  sanlock  daemon  to  read  a  lockspace from disk.  Only the
641       LOCKSPACE path and offset are required.  If host_id is zero, the  first
642       record  at  offset  (host_id  1)  is  used.   The complete LOCKSPACE is
643       printed.  Add -D to print other  details.   (Also  see  sanlock  direct
644       read_leader.)
645
646       sanlock client read -r RESOURCE
647
648       Tell  the  sanlock daemon to read a resource lease from disk.  Only the
649       RESOURCE path and  offset  are  required.   The  complete  RESOURCE  is
650       printed.   Add  -D  to  print  other details.  (Also see sanlock direct
651       read_leader.)
652
653       sanlock client add_lockspace -s LOCKSPACE
654
655       Tell the sanlock  daemon  to  acquire  the  specified  host_id  in  the
656       lockspace.   This will allow resources to be acquired in the lockspace.
657       The -o option can be used to specify the io timeout  of  the  acquiring
658       host, and will be written in the host_id lease.
659
660       sanlock client inq_lockspace -s LOCKSPACE
661
662       Inquire about the state of the lockspace in the sanlock daemon, whether
663       it is being added or removed, or is joined.
664
665       sanlock client rem_lockspace -s LOCKSPACE
666
667       Tell the sanlock  daemon  to  release  the  specified  host_id  in  the
668       lockspace.   Any  processes  holding  resource leases in this lockspace
669       will be killed, and the resource leases not released.
670
671       sanlock client command -r RESOURCE -c path args
672
673       Register with the sanlock daemon, acquire the specified resource lease,
674       and  exec  the  command at path with args.  When the command exits, the
675       sanlock daemon will release the lease.  -c must be the final option.
676
677       sanlock client acquire -r RESOURCE -p pid
678       sanlock client release -r RESOURCE -p pid
679
680       Tell the sanlock daemon to acquire or release  the  specified  resource
681       lease  for  the given pid.  The pid must be registered with the sanlock
682       daemon.  acquire  can  optionally  take  a  versioned  RESOURCE  string
683       RESOURCE:lver,  where  lver  is  the  version of the lease that must be
684       acquired, or fail.
685
686       sanlock client convert -r RESOURCE -p pid
687
688       Tell the sanlock daemon to convert the mode of the  specified  resource
689       lease  for the given pid.  If the existing mode is exclusive (default),
690       the mode of the lease can be converted to shared with RESOURCE:SH.   If
691       the  existing mode is shared, the mode of the lease can be converted to
692       exclusive with RESOURCE (no :SH suffix).
693
694       sanlock client inquire -p pid
695
696       Print the resource leases held the given pid.  The  format  is  a  ver‐
697       sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
698       lease held.
699
700       sanlock client request -r RESOURCE -f force_mode
701
702       Request the owner of a resource do something specified  by  force_mode.
703       A  versioned  RESOURCE:lver  string must be used with a greater version
704       than is presently held.  Zero lver and force_mode clears the request.
705
706       sanlock client examine -r RESOURCE
707
708       Examine the request record for the currently held  resource  lease  and
709       carry out the action specified by the requested force_mode.
710
711       sanlock client examine -s LOCKSPACE
712
713       Examine  requests  for  all resource leases currently held in the named
714       lockspace.  Only lockspace_name is used from the LOCKSPACE argument.
715
716       sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num
717
718       Set an event for another host.  When the sanlock daemon next renews its
719       delta  lease  for the lockspace it will: set the bit for the host_id in
720       its bitmap, and set the generation, event and data values  in  its  own
721       delta  lease.   An application that has registered for events from this
722       lockspace on the destination host will get the event that has been  set
723       when  the  destination  sees  the  event  during  its  next delta lease
724       renewal.
725
726       sanlock client set_config -s LOCKSPACE
727
728       Set a configuration value for a lockspace.  Only lockspace_name is used
729       from  the  LOCKSPACE  argument.  The USED flag has the same effect on a
730       lockspace as a process holding a resource lease  that  will  not  exit.
731       The  USED_BY_ORPHANS flag means that an orphan resource lease will have
732       the same effect as the USED.
733       -u 0|1 Set (1) or clear (0) the USED flag.
734       -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.
735
736       sanlock client format -x RINDEX
737
738       Create a resource index on disk.  Use -Z and -A to set the sector  size
739       and align size to match the lockspace.
740
741       sanlock client create -x RINDEX -e resource_name
742
743       Create  a  new  resource lease on disk, using the rindex to find a free
744       offset.
745
746       sanlock client delete -x RINDEX -e resource_name[:offset]
747
748       Delete an existing resource lease on disk.
749
750       sanlock client lookup -x RINDEX -e resource_name
751
752       Look up the offset of an existing resource lease by name on disk, using
753       the rindex.  With no -e option, lookup returns the next free lease off‐
754       set.  If -e specifes both name and offset, the lookup verifies both are
755       correct.
756
757       sanlock client update -x RINDEX -e resource_name[:offset] [-z 0|1]
758
759       Add (-z 0) or remove (-z 1) an rindex entry on disk.
760
761       sanlock client rebuild -x RINDEX
762
763       Rebuild the rindex entries by scanning the disk for resource leases.
764
765
766
767   Direct Command
768       sanlock direct action [options]
769
770
771       -o sec io timeout in seconds
772
773       sanlock direct init -s LOCKSPACE
774       sanlock direct init -r RESOURCE
775
776       Initialize  storage  for  a  lockspace  or resource.  Use the -Z and -A
777       flags to specify the sector size and align size.  The  max  hosts  that
778       can use the lockspace/resource (and the max possible host_id) is deter‐
779       mined by the sector/align size combination.  Possible combinations are:
780       512/1M,  4096/1M,  4096/2M, 4096/4M, 4096/8M.  Lockspaces and resources
781       both use the same amount of space (align_size)  for  each  combination.
782       When  initializing  a  lockspace,  sanlock initializes delta leases for
783       max_hosts in the given space.  When initializing  a  resource,  sanlock
784       initializes  a single paxos lease in the space.  With -s, the -o option
785       specifies the io timeout to be written in the host_id leases.  With -r,
786       the  -z 1 option invalidates the resource lease on disk so it cannot be
787       used until reinitialized normally.
788
789       sanlock direct read_leader -s LOCKSPACE
790       sanlock direct read_leader -r RESOURCE
791
792       Read a leader record from disk and print the fields.  The leader record
793       is  the  single sector of a delta lease, or the first sector of a paxos
794       lease.
795
796       sanlock direct dump path[:offset[:size]]
797
798       Read disk sectors and print leader records for delta or  paxos  leases.
799       Add  -f 1 to print the request record values for paxos leases, host_ids
800       set in delta lease bitmaps, and rindex entries.
801
802       sanlock direct format -x RINDEX
803       sanlock direct lookup -x RINDEX -e resource_name
804       sanlock direct update -x RINDEX -e resource_name[:offset] [-z 0|1]
805       sanlock direct rebuild -x RINDEX
806
807       Access the resource index on disk without  going  through  the  sanlock
808       daemon.   This  precludes  using  the  internal  paxos lease to protect
809       rindex modifications.  See client equivalents for descriptions.
810
811
812
813   LOCKSPACE option string
814       -s lockspace_name:host_id:path:offset
815
816       lockspace_name name of lockspace
817       host_id local host identifier in lockspace
818       path path to storage to use for leases
819       offset offset on path (bytes)
820
821
822   RESOURCE option string
823       -r lockspace_name:resource_name:path:offset
824
825       lockspace_name name of lockspace
826       resource_name name of resource
827       path path to storage to use leases
828       offset offset on path (bytes)
829
830
831   RESOURCE option string with suffix
832       -r lockspace_name:resource_name:path:offset:lver
833
834       lver leader version
835
836       -r lockspace_name:resource_name:path:offset:SH
837
838       SH indicates shared mode
839
840
841   RINDEX option string
842       -x lockspace_name:path:offset
843
844       lockspace_name name of lockspace
845       path path to storage to use for leases
846       offset offset on path (bytes) of rindex
847
848
849
850   Defaults
851       sanlock help shows the default values for the options above.
852
853       sanlock version shows the build version.
854
855

OTHER

857   Request/Examine
858       The first part of making a  request  for  a  resource  is  writing  the
859       request  record  of  the  resource  (the  sector  following  the leader
860       record).  To make a successful request:
861
862       · RESOURCE:lver must be greater than the lver  presently  held  by  the
863         other  host.  This implies the leader record must be read to discover
864         the lver, prior to making a request.
865
866       · RESOURCE:lver must be greater than or equal  to  the  lver  presently
867         written  to the request record.  Two hosts may write a new request at
868         the same time for the same lver, in which case  both  would  succeed,
869         but the force_mode from the last would win.
870
871       · The force_mode must be greater than zero.
872
873       · To  unconditionally  clear  the  request  record  (set  both lver and
874         force_mode to 0), make request with RESOURCE:0 and force_mode 0.
875
876
877       The owner of the requested resource will not know of the request unless
878       it  is  explicitly  told  to  examine  its  resources via the "examine"
879       api/command, or otherwise notfied.
880
881       The second part of making a request is  notifying  the  resource  lease
882       owner  that  it  should  examine  the  request  records of its resource
883       leases.  The notification will cause the lease owner  to  automatically
884       run  the  equivalent  of  "sanlock client examine -s LOCKSPACE" for the
885       lockspace of the requested resource.
886
887       The notification is made using a bitmap in each  host_id  delta  lease.
888       Each  bit represents each of the possible host_ids (1-2000).  If host A
889       wants to notify host B to examine its resources, A sets the bit in  its
890       own  bitmap  that  corresponds to the host_id of B.  When B next renews
891       its delta lease, it reads the delta leases for  all  hosts  and  checks
892       each  bitmap  to see if its own host_id has been set.  It finds the bit
893       for its own host_id set  in  A's  bitmap,  and  examines  its  resource
894       request  records.   (The  bit  remains  set  in A's bitmap for set_bit‐
895       map_seconds.)
896
897       force_mode determines the action the resource lease owner should take:
898
899
900       · FORCE (1): kill the process holding the  resource  lease.   When  the
901         process has exited, the resource lease will be released, and can then
902         be acquired by anyone.  The kill signal is  SIGKILL  (or  SIGTERM  if
903         SIGKILL is restricted.)
904
905
906       · GRACEFUL  (2): run the program configured by sanlock_killpath against
907         the process holding the resource lease.  If no killpath  is  defined,
908         then FORCE is used.
909
910
911   Persistent and orphan resource leases
912       A  resource  lease can be acquired with the PERSISTENT flag (-P 1).  If
913       the process holding the lease exits, the lease will  not  be  released,
914       but  kept  on  an  orphan  list.   Another local process can acquire an
915       orphan lease using the ORPHAN flag (-O 1), or release the orphan  lease
916       using  the  ORPHAN  flag  (-O 1).  All orphan leases can be released by
917       setting the lockspace name (-s lockspace_name) with no resource name.
918
919
920   Renewal history
921       sanlock saves a limited history of lease renewal  information  in  each
922       lockspace.   See sanlock.conf renewal_history_size to set the amount of
923       history or to disable (set to 0).
924
925       IO times are measured in delta lease renewal (each delta lease  renewal
926       includes one read and one write).
927
928       For each successful renewal, a record is saved that includes:
929
930       · the timestamp written in the delta lease by the renewal
931
932       · the time in milliseconds taken by the delta lease read
933
934       · the time in milliseconds taken by the delta lease write
935
936
937       Also  counted  and  recorded  are  the  number io timeouts and other io
938       errors that occur between successful renewals.
939
940       Two consecutive successful renewals would be recorded as:
941       timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
942       timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0
943
944       Those fields are:
945
946
947       · timestamp is the value written  into  the  delta  lease  during  that
948         renewal.
949
950
951       · read_ms/write_ms   are   the   milliseconds  taken  for  the  renewal
952         read/write ios.
953
954
955       · next_timeouts are the number of io timeouts that  occured  after  the
956         renewal recorded on that line, and before the next successful renewal
957         on the following line.
958
959
960       · next_errors are the number of io errors (not timeouts)  that  occured
961         after  renewal  recorded on that line, and before the next successful
962         renewal on the following line.
963
964
965       The command 'sanlock client renewal -s lockspace_name' reports the full
966       history  of renewals saved by sanlock, which by default is 180 records,
967       about 1 hour of history when using a 20 second renewal interval  for  a
968       10 second io timeout.
969
970

INTERNALS

972   Disk Format
973       · This example uses 512 byte sectors.
974
975       · Each  lockspace  is 1MB.  It holds 2000 delta_leases, one per sector,
976         supporting up to 2000 hosts.
977
978       · Each paxos_lease is 1MB.  It is used as a lease for one resource.
979
980       · The leader_record structure is used differently by each lease type.
981
982       · To display all leader_record fields, see sanlock direct read_leader.
983
984       · A lockspace is often followed on disk by the paxos_leases used within
985         that lockspace, but this layout is not required.
986
987       · The request_record and host_id bitmap are used for requests/events.
988
989       · The mode_block contains the SHARED flag indicating a lease is held in
990         the shared mode.
991
992       · In a  lockspace,  the  host  using  host_id  N  writes  to  a  single
993         delta_lease in sector N-1.  No other hosts write to this sector.  All
994         hosts read all lockspace sectors when renewing their own delta_lease,
995         and are able to monitor renewals of all delta_leases.
996
997       · In a paxos_lease, each host has a dedicated sector it writes to, con‐
998         taining its own paxos_dblock and mode_block structures.   Its  sector
999         is based on its host_id; host_id 1 writes to the dblock/mode_block in
1000         sector 2 of the paxos_lease.
1001
1002       · The paxos_dblock structures are used by  the  paxos_lease  algorithm,
1003         and the result is written to the leader_record.
1004
1005
1006       0x000000 lockspace foo:0:/path:0
1007
1008       (There  is  no representation on disk of the lockspace in general, only
1009       the sequence of specific delta_leases which collectively represent  the
1010       lockspace.)
1011
1012       delta_lease foo:1:/path:0
1013       0x000 0         leader_record         (sector 0, for host_id 1)
1014                       magic: 0x12212010
1015                       space_name: foo
1016                       resource_name: host uuid/name
1017                       ...
1018                       host_id bitmap        (leader_record + 256)
1019
1020       delta_lease foo:2:/path:0
1021       0x200 512       leader_record         (sector 1, for host_id 2)
1022                       magic: 0x12212010
1023                       space_name: foo
1024                       resource_name: host uuid/name
1025                       ...
1026                       host_id bitmap        (leader_record + 256)
1027
1028       delta_lease foo:3:/path:0
1029       0x400 1024      leader_record         (sector 2, for host_id 3)
1030                       magic: 0x12212010
1031                       space_name: foo
1032                       resource_name: host uuid/name
1033                       ...
1034                       host_id bitmap        (leader_record + 256)
1035
1036       delta_lease foo:2000:/path:0
1037       0xF9E00         leader_record         (sector 1999, for host_id 2000)
1038                       magic: 0x12212010
1039                       space_name: foo
1040                       resource_name: host uuid/name
1041                       ...
1042                       host_id bitmap        (leader_record + 256)
1043
1044       0x100000 paxos_lease foo:example1:/path:1048576
1045       0x000 0         leader_record         (sector 0)
1046                       magic: 0x06152010
1047                       space_name: foo
1048                       resource_name: example1
1049
1050       0x200 512       request_record        (sector 1)
1051                       magic: 0x08292011
1052
1053       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
1054       0x480 1152      mode_block            (paxos_dblock + 128)
1055
1056       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
1057       0x680 1664      mode_block            (paxos_dblock + 128)
1058
1059       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
1060       0x880 2176      mode_block            (paxos_dblock + 128)
1061
1062       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
1063       0xFA280         mode_block            (paxos_dblock + 128)
1064
1065       0x200000 paxos_lease foo:example2:/path:2097152
1066       0x000 0         leader_record         (sector 0)
1067                       magic: 0x06152010
1068                       space_name: foo
1069                       resource_name: example2
1070
1071       0x200 512       request_record        (sector 1)
1072                       magic: 0x08292011
1073
1074       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
1075       0x480 1152      mode_block            (paxos_dblock + 128)
1076
1077       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
1078       0x680 1664      mode_block            (paxos_dblock + 128)
1079
1080       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
1081       0x880 2176      mode_block            (paxos_dblock + 128)
1082
1083       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
1084       0xFA280         mode_block            (paxos_dblock + 128)
1085
1086
1087   Lease ownership
1088       Not  shown  in  the  leader_record  structures  above are the owner_id,
1089       owner_generation and timestamp  fields.   These  are  the  fields  that
1090       define the lease owner.
1091
1092       The  delta_lease at sector N for host_id N+1 has leader_record.owner_id
1093       N+1.  The leader_record.owner_generation is incremented each  time  the
1094       delta_lease   is   acquired.   When  a  delta_lease  is  acquired,  the
1095       leader_record.timestamp field is set to the time of the  host  and  the
1096       leader_record.resource_name  is  set  to  the  unique name of the host.
1097       When   the   host   renews   the   delta_lease,   it   writes   a   new
1098       leader_record.timestamp.  When a host releases a delta_lease, it writes
1099       zero to leader_record.timestamp.
1100
1101       When a host acquires a  paxos_lease,  it  uses  the  host_id/generation
1102       value  from  the  delta_lease  it holds in the lockspace.  It uses this
1103       host_id/generation to identify itself in the paxos_dblock when  running
1104       the  paxos  algorithm.   The  result  of  the  algorithm is the winning
1105       host_id/generation - the new owner of  the  paxos_lease.   The  winning
1106       host_id/generation      are      written     to     the     paxos_lease
1107       leader_record.owner_id and  leader_record.owner_generation  fields  and
1108       leader_record.timestamp is set.  When a host releases a paxos_lease, it
1109       sets leader_record.timestamp to 0.
1110
1111       When a paxos_lease is free  (leader_record.timestamp  is  0),  multiple
1112       hosts  may  attempt  to  acquire  it.   The  paxos algorithm, using the
1113       paxos_dblock structures, will select only one of the hosts as  the  new
1114       owner, and that owner is written in the leader_record.  The paxos_lease
1115       will no longer be free (non-zero timestamp).  Other hosts will see this
1116       and will not attempt to acquire the paxos_lease until it is free again.
1117
1118       If  a  paxos_lease is owned (non-zero timestamp), but the owner has not
1119       renewed its delta_lease for a specific length of time, then  the  owner
1120       value  in the paxos_lease becomes expired, and other hosts will use the
1121       paxos algorithm to acquire the paxos_lease, and set a new owner.
1122
1123

FILES

1125       /etc/sanlock/sanlock.conf
1126
1127
1128       · quiet_fail = 1
1129         See -Q
1130
1131
1132       · debug_renew = 0
1133         See -R
1134
1135
1136       · logfile_priority = 4
1137         See -L
1138
1139
1140       · logfile_use_utc = 0
1141         Use UTC instead of local time in log messages.
1142
1143
1144       · syslog_priority = 3
1145         See -S
1146
1147
1148       · names_log_priority = 4
1149         Log resource names at this priority level (uses syslog priority  num‐
1150         bers).   If  this  is greater than or equal to logfile_priority, each
1151         requested resource name and location is recorded in sanlock.log.
1152
1153
1154       · use_watchdog = 1
1155         See -w
1156
1157
1158       · high_priority = 1
1159         See -h
1160
1161
1162       · mlock_level = 1
1163         See -l
1164
1165
1166       · sh_retries = 8
1167         The number of times to try acquiring a paxos lease when  acquiring  a
1168         shared lease when the paxos lease is held by another host acquiring a
1169         shared lease.
1170
1171
1172       · uname = sanlock
1173         See -U
1174
1175
1176       · gname = sanlock
1177         See -G
1178
1179
1180       · our_host_name = <str>
1181         See -e
1182
1183
1184       · renewal_read_extend_sec = <seconds>
1185         If a renewal read i/o times out, wait this  many  additional  seconds
1186         for  that  read  to  complete  at the start of the subsequent renewal
1187         attempt.  When  not  configured,  sanlock  waits  for  an  additional
1188         io_timeout seconds for a previous timed out read to complete.
1189
1190
1191       · renewal_history_size = 180
1192         See -H
1193
1194
1195       · paxos_debug_all = 0
1196         Include all details in the paxos debug logging.
1197
1198
1199       · debug_io = <str>
1200         Add  debug logging for each i/o.  "submit" (no quotes) produces debug
1201         output at submission time, "complete" produces debug output  at  com‐
1202         pletion time, and "submit,complete" (no space) produces both.
1203
1204
1205       · max_sectors_kb = <str>|<num>
1206         Set  to  "ignore"  (no  quotes)  to  prevent sanlock from checking or
1207         changing max_sectors_kb  for  the  lockspace  disk  when  starting  a
1208         lockspace.   Set to "align" (no quotes) to set max_sectors_kb for the
1209         lockspace disk to the align size of the lockspace.  Set to  a  number
1210         to set a specific number of KB for all lockspace disks.
1211
1212
1213       · debug_clients = 0
1214         Enable  or  disable  debug  logging for all client connections to the
1215         sanlock daemon.
1216
1217
1218       · debug_cmd = +|-<name>
1219         Enable (+name) or disable (-name) debug logging at the  command  pro‐
1220         cessing  level  for  specifically  named  commands, e.g. "debug_cmd =
1221         +acquire", or "debug_cmd = -inq_lockspace".   Repeat  this  line  for
1222         each command name.  Use a plus prefix before the name to enable and a
1223         minus prefix to disable.  By default sanlock  disables  some  command
1224         level  debugging  for commands that are often repetitive and fill the
1225         in memory debug buffer.  This only affects debug logging, not  errors
1226         or warnings, and disabling command level debugging for a command does
1227         not disable lower level debugging for that command.   Special  values
1228         +all  and -all can be used to enable or disable all commands, and can
1229         be used before or after other debug_cmd lines.
1230
1231

SEE ALSO

1233       wdmd(8)
1234
1235
1236
1237
1238                                  2015-01-23                        SANLOCK(8)
Impressum