1SANLOCK(8) System Manager's Manual SANLOCK(8)
2
3
4
6 sanlock - shared storage lock manager
7
8
10 sanlock [COMMAND] [ACTION] ...
11
12
14 sanlock is a lock manager built on shared storage. Hosts with access
15 to the storage can perform locking. An application running on the
16 hosts is given a small amount of space on the shared block device or
17 file, and uses sanlock for its own application-specific synchroniza‐
18 tion. Internally, the sanlock daemon manages locks using two disk-
19 based lease algorithms: delta leases and paxos leases.
20
21
22 · delta leases are slow to acquire and demand regular i/o to shared
23 storage. sanlock only uses them internally to hold a lease on its
24 "host_id" (an integer host identifier from 1-2000). They prevent two
25 hosts from using the same host identifier. The delta lease renewals
26 also indicate if a host is alive. ("Light-Weight Leases for Storage-
27 Centric Coordination", Chockler and Malkhi.)
28
29
30 · paxos leases are fast to acquire and sanlock makes them available to
31 applications as general purpose resource leases. The disk paxos
32 algorithm uses host_id's internally to represent different hosts, and
33 the owner of a paxos lease. delta leases provide unique host_id's
34 for implementing paxos leases, and delta lease renewals serve as a
35 proxy for paxos lease renewal. ("Disk Paxos", Eli Gafni and Leslie
36 Lamport.)
37
38
39 Externally, the sanlock daemon exposes a locking interface through lib‐
40 sanlock in terms of "lockspaces" and "resources". A lockspace is a
41 locking context that an application creates for itself on shared stor‐
42 age. When the application on each host is started, it "joins" the
43 lockspace. It can then create "resources" on the shared storage. Each
44 resource represents an application-specific entity. The application
45 can acquire and release leases on resources.
46
47 To use sanlock from an application:
48
49
50 · Allocate shared storage for an application, e.g. a shared LUN or LV
51 from a SAN, or files from NFS.
52
53
54 · Provide the storage to the application.
55
56
57 · The application uses this storage with libsanlock to create a
58 lockspace and resources for itself.
59
60
61 · The application joins the lockspace when it starts.
62
63
64 · The application acquires and releases leases on resources.
65
66
67 How lockspaces and resources translate to delta leases and paxos leases
68 within sanlock:
69
70 Lockspaces
71
72
73 · A lockspace is based on delta leases held by each host using the
74 lockspace.
75
76
77 · A lockspace is a series of 2000 delta leases on disk, and requires
78 1MB of storage.
79
80
81 · A lockspace can support up to 2000 concurrent hosts using it, each
82 using a different delta lease.
83
84
85 · Applications can i) create, ii) join and iii) leave a lockspace,
86 which corresponds to i) initializing the set of delta leases on disk,
87 ii) acquiring one of the delta leases and iii) releasing the delta
88 lease.
89
90
91 · When a lockspace is created, a unique lockspace name and disk loca‐
92 tion is provided by the application.
93
94
95 · When a lockspace is created/initialized, sanlock formats the sequence
96 of 2000 on-disk delta lease structures on the file or disk, e.g.
97 /mnt/leasefile (NFS) or /dev/vg/lv (SAN).
98
99
100 · The 2000 individual delta leases in a lockspace are identified by
101 number: 1,2,3,...,2000.
102
103
104 · Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
105 its number, e.g. delta lease 1 is offset 0, delta lease 2 is offset
106 512, delta lease 2000 is offset 1023488.
107
108
109 · When an application joins a lockspace, it must specify the lockspace
110 name, the lockspace location on shared disk/file, and the local
111 host's host_id. sanlock then acquires the delta lease corresponding
112 to the host_id, e.g. joining the lockspace with host_id 1 acquires
113 delta lease 1.
114
115
116 · The terms delta lease, lockspace lease, and host_id lease are used
117 interchangably.
118
119
120 · sanlock acquires a delta lease by writing the host's unique name to
121 the delta lease disk sector, reading it back after a delay, and veri‐
122 fying it is the same.
123
124
125 · If a unique host name is not specified, sanlock generates a uuid to
126 use as the host's name. The delta lease algorithm depends on hosts
127 using unique names.
128
129
130 · The application on each host should be configured with a unique
131 host_id, where the host_id is an integer 1-2000.
132
133
134 · If hosts are misconfigured and have the same host_id, the delta lease
135 algorithm is designed to detect this conflict, and only one host will
136 be able to acquire the delta lease for that host_id.
137
138
139 · A delta lease ensures that a lockspace host_id is being used by a
140 single host with the unique name specified in the delta lease.
141
142
143 · Resolving delta lease conflicts is slow, because the algorithm is
144 based on waiting and watching for some time for other hosts to write
145 to the same delta lease sector. If multiple hosts try to use the
146 same delta lease, the delay is increased substantially. So, it is
147 best to configure applications to use unique host_id's that will not
148 conflict.
149
150
151 · After sanlock acquires a delta lease, the lease must be renewed until
152 the application leaves the lockspace (which corresponds to releasing
153 the delta lease on the host_id.)
154
155
156 · sanlock renews delta leases every 20 seconds (by default) by writing
157 a new timestamp into the delta lease sector.
158
159
160 · When a host acquires a delta lease in a lockspace, it can be referred
161 to as "joining" the lockspace. Once it has joined the lockspace, it
162 can use resources associated with the lockspace.
163
164
165 Resources
166
167
168 · A lockspace is a context for resources that can be locked and
169 unlocked by an application.
170
171
172 · sanlock uses paxos leases to implement leases on resources. The
173 terms paxos lease and resource lease are used interchangably.
174
175
176 · A paxos lease exists on shared storage and requires 1MB of space. It
177 contains a unique resource name and the name of the lockspace.
178
179
180 · An application assigns its own meaning to a sanlock resource and the
181 leases on it. A sanlock resource could represent some shared object
182 like a file, or some unique role among the hosts.
183
184
185 · Resource leases are associated with a specific lockspace and can only
186 be used by hosts that have joined that lockspace (they are holding a
187 delta lease on a host_id in that lockspace.)
188
189
190 · An application must keep track of the disk locations of its
191 lockspaces and resources. sanlock does not maintain any persistent
192 index or directory of lockspaces or resources that have been created
193 by applications, so applications need to remember where they have
194 placed their own leases (which files or disks and offsets).
195
196
197 · sanlock does not renew paxos leases directly (although it could).
198 Instead, the renewal of a host's delta lease represents the renewal
199 of all that host's paxos leases in the associated lockspace. In
200 effect, many paxos lease renewals are factored out into one delta
201 lease renewal. This reduces i/o when many paxos leases are used.
202
203
204 · The disk paxos algorithm allows multiple hosts to all attempt to
205 acquire the same paxos lease at once, and will produce a single win‐
206 ner/owner of the resource lease. (Shared resource leases are also
207 possible in addition to the default exclusive leases.)
208
209
210 · The disk paxos algorithm involves a specific sequence of reading and
211 writing the sectors of the paxos lease disk area. Each host has a
212 dedicated 512 byte sector in the paxos lease disk area where it
213 writes its own "ballot", and each host reads the entire disk area to
214 see the ballots of other hosts. The first sector of the disk area is
215 the "leader record" that holds the result of the last paxos ballot.
216 The winner of the paxos ballot writes the result of the ballot to the
217 leader record (the winner of the ballot may have selected another
218 contending host as the owner of the paxos lease.)
219
220
221 · After a paxos lease is acquired, no further i/o is done in the paxos
222 lease disk area.
223
224
225 · Releasing the paxos lease involves writing a single sector to clear
226 the current owner in the leader record.
227
228
229 · If a host holding a paxos lease fails, the disk area of the paxos
230 lease still indicates that the paxos lease is owned by the failed
231 host. If another host attempts to acquire the paxos lease, and finds
232 the lease is held by another host_id, it will check the delta lease
233 of that host_id. If the delta lease of the host_id is being renewed,
234 then the paxos lease is owned and cannot be acquired. If the delta
235 lease of the owner's host_id has expired, then the paxos lease is
236 expired and can be taken (by going through the paxos lease algo‐
237 rithm.)
238
239
240 · The "interaction" or "awareness" between hosts of each other is lim‐
241 ited to the case where they attempt to acquire the same paxos lease,
242 and need to check if the referenced delta lease has expired or not.
243
244
245 · When hosts do not attempt to lock the same resources concurrently,
246 there is no host interaction or awareness. The state or actions of
247 one host have no effect on others.
248
249
250 · To speed up checking delta lease expiration (in the case of a paxos
251 lease conflict), sanlock keeps track of past renewals of other delta
252 leases in the lockspace.
253
254
255 Expiration
256
257
258 · If a host fails to renew its delta lease, e.g. it looses access to
259 the storage, its delta lease will eventually expire and another host
260 will be able to take over any resource leases held by the host. san‐
261 lock must ensure that the application on two different hosts is not
262 holding and using the same lease concurrently.
263
264
265 · When sanlock has failed to renew a delta lease for a period of time,
266 it will begin taking measures to stop local processes (applications)
267 from using any resource leases associated with the expiring lockspace
268 delta lease. sanlock enters this "recovery mode" well ahead of the
269 time when another host could take over the locally owned leases.
270 sanlock must have sufficient time to stop all local processes that
271 are using the expiring leases.
272
273
274 · sanlock uses three methods to stop local processes that are using
275 expiring leases:
276
277 1. Graceful shutdown. sanlock will execute a "graceful shutdown"
278 program that the application previously specified for this case. The
279 shutdown program tells the application to shut down because its
280 leases are expiring. The application must respond by stopping its
281 activities and releasing its leases (or exit). If an application
282 does not specify a graceful shutdown program, sanlock sends SIGTERM
283 to the process instead. The process must release its leases or exit
284 in a prescribed amount of time (see -g), or sanlock proceeds to the
285 next method of stopping.
286
287 2. Forced shutdown. sanlock will send SIGKILL to processes using the
288 expiring leases. The processes have a fixed amount of time to exit
289 after receiving SIGKILL. If any do not exit in this time, sanlock
290 will proceed to the next method.
291
292 3. Host reset. sanlock will trigger the host's watchdog device to
293 forcibly reset it. sanlock carefully manages the timing of the
294 watchdog device so that it fires shortly before any other host could
295 take over the resource leases held by local processes.
296
297
298 Failures
299
300 If a process holding resource leases fails or exits without releasing
301 its leases, sanlock will release the leases for it automatically
302 (unless persistent resource leases were used.)
303
304 If the sanlock daemon cannot renew a lockspace delta lease for a spe‐
305 cific period of time (see Expiration), sanlock will enter "recovery
306 mode" where it attempts to stop and/or kill any processes holding
307 resource leases in the expiring lockspace. If the processes do not
308 exit in time, sanlock will force the host to be reset using the local
309 watchdog device.
310
311 If the sanlock daemon crashes or hangs, it will not renew the expiry
312 time of the per-lockspace connections it had to the wdmd daemon. This
313 will lead to the expiration of the local watchdog device, and the host
314 will be reset.
315
316 Watchdog
317
318 sanlock uses the wdmd(8) daemon to access /dev/watchdog. wdmd multi‐
319 plexes multiple timeouts onto the single watchdog timer. This is
320 required because delta leases for each lockspace are renewed and expire
321 independently.
322
323 sanlock maintains a wdmd connection for each lockspace delta lease
324 being renewed. Each connection has an expiry time for some seconds in
325 the future. After each successful delta lease renewal, the expiry time
326 is renewed for the associated wdmd connection. If wdmd finds any con‐
327 nection expired, it will not renew the /dev/watchdog timer. Given
328 enough successive failed renewals, the watchdog device will fire and
329 reset the host. (Given the multiplexing nature of wdmd, shorter over‐
330 lapping renewal failures from multiple lockspaces could cause spurious
331 watchdog firing.)
332
333 The direct link between delta lease renewals and watchdog renewals pro‐
334 vides a predictable watchdog firing time based on delta lease renewal
335 timestamps that are visible from other hosts. sanlock knows the time
336 the watchdog on another host has fired based on the delta lease time.
337 Furthermore, if the watchdog device on another host fails to fire when
338 it should, the continuation of delta lease renewals from the other host
339 will make this evident and prevent leases from being taken from the
340 failed host.
341
342 If sanlock is able to stop/kill all processing using an expiring
343 lockspace, the associated wdmd connection for that lockspace is
344 removed. The expired wdmd connection will no longer block /dev/watch‐
345 dog renewals, and the host should avoid being reset.
346
347 Storage
348
349 On devices with 512 byte sectors, lockspaces and resources are 1MB in
350 size. On devices with 4096 byte sectors, lockspaces and resources are
351 8MB in size. sanlock uses 512 byte sectors when shared files are used
352 in place of shared block devices. Offsets of leases or resources must
353 be multiples of 1MB/8MB according to the sector size.
354
355 Using sanlock on shared block devices that do host based mirroring or
356 replication is not likely to work correctly. When using sanlock on
357 shared files, all sanlock io should go to one file server.
358
359 Example
360
361 This is an example of creating and using lockspaces and resources from
362 the command line. (Most applications would use sanlock through libsan‐
363 lock rather than through the command line.)
364
365
366 1. Allocate shared storage for sanlock leases.
367
368 This example assumes 512 byte sectors on the device, in which case
369 the lockspace needs 1MB and each resource needs 1MB.
370
371 # vgcreate vg /dev/sdb
372 # lvcreate -n leases -L 1GB vg
373
374
375 2. Start sanlock on all hosts.
376
377 The -w 0 disables use of the watchdog for testing.
378
379 # sanlock daemon -w 0
380
381
382 3. Start a dummy application on all hosts.
383
384 This sanlock command registers with sanlock, then execs the sleep
385 command which inherits the registered fd. The sleep process acts
386 as the dummy application. Because the sleep process is registered
387 with sanlock, leases can be acquired for it.
388
389 # sanlock client command -c /bin/sleep 600 &
390
391
392 4. Create a lockspace for the application (from one host).
393
394 The lockspace is named "test".
395
396 # sanlock client init -s test:0:/dev/test/leases:0
397
398
399 5. Join the lockspace for the application.
400
401 Use a unique host_id on each host.
402
403 host1:
404 # sanlock client add_lockspace -s test:1:/dev/vg/leases:0
405 host2:
406 # sanlock client add_lockspace -s test:2:/dev/vg/leases:0
407
408
409 6. Create two resources for the application (from one host).
410
411 The resources are named "RA" and "RB". Offsets are used on the
412 same device as the lockspace. Different LVs or files could also be
413 used.
414
415 # sanlock client init -r test:RA:/dev/vg/leases:1048576
416 # sanlock client init -r test:RB:/dev/vg/leases:2097152
417
418
419 7. Acquire resource leases for the application on host1.
420
421 Acquire an exclusive lease (the default) on the first resource, and
422 a shared lease (SH) on the second resource.
423
424 # export P=`pidof sleep`
425 # sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
426 # sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
427
428
429 8. Acquire resource leases for the application on host2.
430
431 Acquiring the exclusive lease on the first resource will fail
432 because it is held by host1. Acquiring the shared lease on the
433 second resource will succeed.
434
435 # export P=`pidof sleep`
436 # sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
437 # sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
438
439
440 9. Release resource leases for the application on both hosts.
441
442 The sleep pid could also be killed, which will result in the san‐
443 lock daemon releasing its leases when it exits.
444
445 # sanlock client release -r test:RA:/dev/vg/leases:1048576 -p $P
446 # sanlock client release -r test:RB:/dev/vg/leases:2097152 -p $P
447
448
449 10. Leave the lockspace for the application.
450
451 host1:
452 # sanlock client rem_lockspace -s test:1:/dev/vg/leases:0
453 host2:
454 # sanlock client rem_lockspace -s test:2:/dev/vg/leases:0
455
456
457 11. Stop sanlock on all hosts.
458
459 # sanlock shutdown
460
461
462
464 COMMAND can be one of three primary top level choices
465
466 sanlock daemon start daemon
467 sanlock client send request to daemon (default command if none given)
468 sanlock direct access storage directly (no coordination with daemon)
469
470
471 Daemon Command
472 sanlock daemon [options]
473
474 -D no fork and print all logging to stderr
475
476 -Q 0|1 quiet error messages for common lock contention
477
478 -R 0|1 renewal debugging, log debug info for each renewal
479
480 -L pri write logging at priority level and up to logfile (-1 none)
481
482 -S pri write logging at priority level and up to syslog (-1 none)
483
484 -U uid user id
485
486 -G gid group id
487
488 -t num max worker threads
489
490 -g sec seconds for graceful recovery
491
492 -w 0|1 use watchdog through wdmd
493
494 -h 0|1 use high priority (RR) scheduling
495
496 -l num use mlockall (0 none, 1 current, 2 current and future)
497
498 -b sec seconds a host id bit will remain set in delta lease bitmap
499
500 -e str local host name used in delta leases
501
502
503
504 Client Command
505 sanlock client action [options]
506
507 sanlock client status
508
509 Print processes, lockspaces, and resources being managed by the sanlock
510 daemon. Add -D to show extra internal daemon status for debugging.
511 Add -o p to show resources by pid, or -o s to show resources by
512 lockspace.
513
514 sanlock client host_status
515
516 Print state of host_id delta leases read during the last renewal.
517 State of all lockspaces is shown (use -s to select one). Add -D to
518 show extra internal daemon status for debugging.
519
520 sanlock client gets
521
522 Print lockspaces being managed by the sanlock daemon. The LOCKSPACE
523 string will be followed by ADD or REM if the lockspace is currently
524 being added or removed. Add -h 1 to also show hosts in each lockspace.
525
526 sanlock client renewal -s LOCKSPACE
527
528 Print a history of renewals with timing details. See the Renewal his‐
529 tory section below.
530
531 sanlock client log_dump
532
533 Print the sanlock daemon internal debug log.
534
535 sanlock client shutdown
536
537 Ask the sanlock daemon to exit. Without the force option (-f 0), the
538 command will be ignored if any lockspaces exist. With the force option
539 (-f 1), any registered processes will be killed, their resource leases
540 released, and lockspaces removed. With the wait option (-w 1), the
541 command will wait for a result from the daemon indicating that it has
542 shut down and is exiting, or cannot shut down because lockspaces exist
543 (command fails).
544
545 sanlock client init -s LOCKSPACE
546
547 Tell the sanlock daemon to initialize a lockspace on disk. The -o
548 option can be used to specify the io timeout to be written in the
549 host_id leases. (Also see sanlock direct init.)
550
551 sanlock client init -r RESOURCE
552
553 Tell the sanlock daemon to initialize a resource lease on disk. (Also
554 see sanlock direct init.)
555
556 sanlock client read -s LOCKSPACE
557
558 Tell the sanlock daemon to read a lockspace from disk. Only the
559 LOCKSPACE path and offset are required. If host_id is zero, the first
560 record at offset (host_id 1) is used. The complete LOCKSPACE and io
561 timeout are printed.
562
563 sanlock client read -r RESOURCE
564
565 Tell the sanlock daemon to read a resource lease from disk. Only the
566 RESOURCE path and offset are required. The complete RESOURCE is
567 printed. (Also see sanlock direct read_leader.)
568
569 sanlock client align -s LOCKSPACE
570
571 Tell the sanlock daemon to report the required lease alignment for a
572 storage path. Only path is used from the LOCKSPACE argument.
573
574 sanlock client add_lockspace -s LOCKSPACE
575
576 Tell the sanlock daemon to acquire the specified host_id in the
577 lockspace. This will allow resources to be acquired in the lockspace.
578 The -o option can be used to specify the io timeout of the acquiring
579 host, and will be written in the host_id lease.
580
581 sanlock client inq_lockspace -s LOCKSPACE
582
583 Inquire about the state of the lockspace in the sanlock daemon, whether
584 it is being added or removed, or is joined.
585
586 sanlock client rem_lockspace -s LOCKSPACE
587
588 Tell the sanlock daemon to release the specified host_id in the
589 lockspace. Any processes holding resource leases in this lockspace
590 will be killed, and the resource leases not released.
591
592 sanlock client command -r RESOURCE -c path args
593
594 Register with the sanlock daemon, acquire the specified resource lease,
595 and exec the command at path with args. When the command exits, the
596 sanlock daemon will release the lease. -c must be the final option.
597
598 sanlock client acquire -r RESOURCE -p pid
599 sanlock client release -r RESOURCE -p pid
600
601 Tell the sanlock daemon to acquire or release the specified resource
602 lease for the given pid. The pid must be registered with the sanlock
603 daemon. acquire can optionally take a versioned RESOURCE string
604 RESOURCE:lver, where lver is the version of the lease that must be
605 acquired, or fail.
606
607 sanlock client convert -r RESOURCE -p pid
608
609 Tell the sanlock daemon to convert the mode of the specified resource
610 lease for the given pid. If the existing mode is exclusive (default),
611 the mode of the lease can be converted to shared with RESOURCE:SH. If
612 the existing mode is shared, the mode of the lease can be converted to
613 exclusive with RESOURCE (no :SH suffix).
614
615 sanlock client inquire -p pid
616
617 Print the resource leases held the given pid. The format is a ver‐
618 sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
619 lease held.
620
621 sanlock client request -r RESOURCE -f force_mode
622
623 Request the owner of a resource do something specified by force_mode.
624 A versioned RESOURCE:lver string must be used with a greater version
625 than is presently held. Zero lver and force_mode clears the request.
626
627 sanlock client examine -r RESOURCE
628
629 Examine the request record for the currently held resource lease and
630 carry out the action specified by the requested force_mode.
631
632 sanlock client examine -s LOCKSPACE
633
634 Examine requests for all resource leases currently held in the named
635 lockspace. Only lockspace_name is used from the LOCKSPACE argument.
636
637 sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num
638
639 Set an event for another host. When the sanlock daemon next renews its
640 delta lease for the lockspace it will: set the bit for the host_id in
641 its bitmap, and set the generation, event and data values in its own
642 delta lease. An application that has registered for events from this
643 lockspace on the destination host will get the event that has been set
644 when the destination sees the event during its next delta lease
645 renewal.
646
647 sanlock client set_config -s LOCKSPACE
648
649 Set a configuration value for a lockspace. Only lockspace_name is used
650 from the LOCKSPACE argument. The USED flag has the same effect on a
651 lockspace as a process holding a resource lease that will not exit.
652 The USED_BY_ORPHANS flag means that an orphan resource lease will have
653 the same effect as the USED.
654 -u 0|1 Set (1) or clear (0) the USED flag.
655 -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.
656
657
658 Direct Command
659 sanlock direct action [options]
660
661
662 -o sec io timeout in seconds
663
664 sanlock direct init -s LOCKSPACE
665 sanlock direct init -r RESOURCE
666
667 Initialize storage for 2000 host_id (delta) leases for the given
668 lockspace, or initialize storage for one resource (paxos) lease. Both
669 options require 1MB of space. The host_id in the LOCKSPACE string is
670 not relevant to initialization, so the value is ignored. (The default
671 of 2000 host_ids can be changed for special cases using the -n
672 num_hosts and -m max_hosts options.) With -s, the -o option specifies
673 the io timeout to be written in the host_id leases. With -r, the -z 1
674 option invalidates the resource lease on disk so it cannot be used
675 until reinitialized normally.
676
677 sanlock direct read_leader -s LOCKSPACE
678 sanlock direct read_leader -r RESOURCE
679
680 Read a leader record from disk and print the fields. The leader record
681 is the single sector of a delta lease, or the first sector of a paxos
682 lease.
683
684 sanlock direct dump path[:offset[:size]]
685
686 Read disk sectors and print leader records for delta or paxos leases.
687 Add -f 1 to print the request record values for paxos leases, and
688 host_ids set in delta lease bitmaps.
689
690
691 LOCKSPACE option string
692 -s lockspace_name:host_id:path:offset
693
694 lockspace_name name of lockspace
695 host_id local host identifier in lockspace
696 path path to storage reserved for leases
697 offset offset on path (bytes)
698
699
700 RESOURCE option string
701 -r lockspace_name:resource_name:path:offset
702
703 lockspace_name name of lockspace
704 resource_name name of resource
705 path path to storage reserved for leases
706 offset offset on path (bytes)
707
708
709 RESOURCE option string with suffix
710 -r lockspace_name:resource_name:path:offset:lver
711
712 lver leader version
713
714 -r lockspace_name:resource_name:path:offset:SH
715
716 SH indicates shared mode
717
718
719 Defaults
720 sanlock help shows the default values for the options above.
721
722 sanlock version shows the build version.
723
724
726 Request/Examine
727 The first part of making a request for a resource is writing the
728 request record of the resource (the sector following the leader
729 record). To make a successful request:
730
731 · RESOURCE:lver must be greater than the lver presently held by the
732 other host. This implies the leader record must be read to discover
733 the lver, prior to making a request.
734
735 · RESOURCE:lver must be greater than or equal to the lver presently
736 written to the request record. Two hosts may write a new request at
737 the same time for the same lver, in which case both would succeed,
738 but the force_mode from the last would win.
739
740 · The force_mode must be greater than zero.
741
742 · To unconditionally clear the request record (set both lver and
743 force_mode to 0), make request with RESOURCE:0 and force_mode 0.
744
745
746 The owner of the requested resource will not know of the request unless
747 it is explicitly told to examine its resources via the "examine"
748 api/command, or otherwise notfied.
749
750 The second part of making a request is notifying the resource lease
751 owner that it should examine the request records of its resource
752 leases. The notification will cause the lease owner to automatically
753 run the equivalent of "sanlock client examine -s LOCKSPACE" for the
754 lockspace of the requested resource.
755
756 The notification is made using a bitmap in each host_id delta lease.
757 Each bit represents each of the possible host_ids (1-2000). If host A
758 wants to notify host B to examine its resources, A sets the bit in its
759 own bitmap that corresponds to the host_id of B. When B next renews
760 its delta lease, it reads the delta leases for all hosts and checks
761 each bitmap to see if its own host_id has been set. It finds the bit
762 for its own host_id set in A's bitmap, and examines its resource
763 request records. (The bit remains set in A's bitmap for set_bit‐
764 map_seconds.)
765
766 force_mode determines the action the resource lease owner should take:
767
768
769 · FORCE (1): kill the process holding the resource lease. When the
770 process has exited, the resource lease will be released, and can then
771 be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if
772 SIGKILL is restricted.)
773
774
775 · GRACEFUL (2): run the program configured by sanlock_killpath against
776 the process holding the resource lease. If no killpath is defined,
777 then FORCE is used.
778
779
780 Persistent and orphan resource leases
781 A resource lease can be acquired with the PERSISTENT flag (-P 1). If
782 the process holding the lease exits, the lease will not be released,
783 but kept on an orphan list. Another local process can acquire an
784 orphan lease using the ORPHAN flag (-O 1), or release the orphan lease
785 using the ORPHAN flag (-O 1). All orphan leases can be released by
786 setting the lockspace name (-s lockspace_name) with no resource name.
787
788
789 Renewal history
790 sanlock saves a limited history of lease renewal information in each
791 lockspace. See sanlock.conf renewal_history_size to set the amount of
792 history or to disable (set to 0).
793
794 IO times are measured in delta lease renewal (each delta lease renewal
795 includes one read and one write).
796
797 For each successful renewal, a record is saved that includes:
798
799 · the timestamp written in the delta lease by the renewal
800
801 · the time in milliseconds taken by the delta lease read
802
803 · the time in milliseconds taken by the delta lease write
804
805
806 Also counted and recorded are the number io timeouts and other io
807 errors that occur between successful renewals.
808
809 Two consecutive successful renewals would be recorded as:
810 timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
811 timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0
812
813 Those fields are:
814
815
816 · timestamp is the value written into the delta lease during that
817 renewal.
818
819
820 · read_ms/write_ms are the milliseconds taken for the renewal
821 read/write ios.
822
823
824 · next_timeouts are the number of io timeouts that occured after the
825 renewal recorded on that line, and before the next successful renewal
826 on the following line.
827
828
829 · next_errors are the number of io errors (not timeouts) that occured
830 after renewal recorded on that line, and before the next successful
831 renewal on the following line.
832
833
834 The command 'sanlock client renewal -s lockspace_name' reports the full
835 history of renewals saved by sanlock, which by default is 180 records,
836 about 1 hour of history when using a 20 second renewal interval for a
837 10 second io timeout.
838
839
841 Disk Format
842 · This example uses 512 byte sectors.
843
844 · Each lockspace is 1MB. It holds 2000 delta_leases, one per sector,
845 supporting up to 2000 hosts.
846
847 · Each paxos_lease is 1MB. It is used as a lease for one resource.
848
849 · The leader_record structure is used differently by each lease type.
850
851 · To display all leader_record fields, see sanlock direct read_leader.
852
853 · A lockspace is often followed on disk by the paxos_leases used within
854 that lockspace, but this layout is not required.
855
856 · The request_record and host_id bitmap are used for requests/events.
857
858 · The mode_block contains the SHARED flag indicating a lease is held in
859 the shared mode.
860
861 · In a lockspace, the host using host_id N writes to a single
862 delta_lease in sector N-1. No other hosts write to this sector. All
863 hosts read all lockspace sectors when renewing their own delta_lease,
864 and are able to monitor renewals of all delta_leases.
865
866 · In a paxos_lease, each host has a dedicated sector it writes to, con‐
867 taining its own paxos_dblock and mode_block structures. Its sector
868 is based on its host_id; host_id 1 writes to the dblock/mode_block in
869 sector 2 of the paxos_lease.
870
871 · The paxos_dblock structures are used by the paxos_lease algorithm,
872 and the result is written to the leader_record.
873
874
875 0x000000 lockspace foo:0:/path:0
876
877 (There is no representation on disk of the lockspace in general, only
878 the sequence of specific delta_leases which collectively represent the
879 lockspace.)
880
881 delta_lease foo:1:/path:0
882 0x000 0 leader_record (sector 0, for host_id 1)
883 magic: 0x12212010
884 space_name: foo
885 resource_name: host uuid/name
886 ...
887 host_id bitmap (leader_record + 256)
888
889 delta_lease foo:2:/path:0
890 0x200 512 leader_record (sector 1, for host_id 2)
891 magic: 0x12212010
892 space_name: foo
893 resource_name: host uuid/name
894 ...
895 host_id bitmap (leader_record + 256)
896
897 delta_lease foo:3:/path:0
898 0x400 1024 leader_record (sector 2, for host_id 3)
899 magic: 0x12212010
900 space_name: foo
901 resource_name: host uuid/name
902 ...
903 host_id bitmap (leader_record + 256)
904
905 delta_lease foo:2000:/path:0
906 0xF9E00 leader_record (sector 1999, for host_id 2000)
907 magic: 0x12212010
908 space_name: foo
909 resource_name: host uuid/name
910 ...
911 host_id bitmap (leader_record + 256)
912
913 0x100000 paxos_lease foo:example1:/path:1048576
914 0x000 0 leader_record (sector 0)
915 magic: 0x06152010
916 space_name: foo
917 resource_name: example1
918
919 0x200 512 request_record (sector 1)
920 magic: 0x08292011
921
922 0x400 1024 paxos_dblock (sector 2, for host_id 1)
923 0x480 1152 mode_block (paxos_dblock + 128)
924
925 0x600 1536 paxos_dblock (sector 3, for host_id 2)
926 0x680 1664 mode_block (paxos_dblock + 128)
927
928 0x800 2048 paxos_dblock (sector 4, for host_id 3)
929 0x880 2176 mode_block (paxos_dblock + 128)
930
931 0xFA200 paxos_dblock (sector 2001, for host_id 2000)
932 0xFA280 mode_block (paxos_dblock + 128)
933
934 0x200000 paxos_lease foo:example2:/path:2097152
935 0x000 0 leader_record (sector 0)
936 magic: 0x06152010
937 space_name: foo
938 resource_name: example2
939
940 0x200 512 request_record (sector 1)
941 magic: 0x08292011
942
943 0x400 1024 paxos_dblock (sector 2, for host_id 1)
944 0x480 1152 mode_block (paxos_dblock + 128)
945
946 0x600 1536 paxos_dblock (sector 3, for host_id 2)
947 0x680 1664 mode_block (paxos_dblock + 128)
948
949 0x800 2048 paxos_dblock (sector 4, for host_id 3)
950 0x880 2176 mode_block (paxos_dblock + 128)
951
952 0xFA200 paxos_dblock (sector 2001, for host_id 2000)
953 0xFA280 mode_block (paxos_dblock + 128)
954
955
956 Lease ownership
957 Not shown in the leader_record structures above are the owner_id,
958 owner_generation and timestamp fields. These are the fields that
959 define the lease owner.
960
961 The delta_lease at sector N for host_id N+1 has leader_record.owner_id
962 N+1. The leader_record.owner_generation is incremented each time the
963 delta_lease is acquired. When a delta_lease is acquired, the
964 leader_record.timestamp field is set to the time of the host and the
965 leader_record.resource_name is set to the unique name of the host.
966 When the host renews the delta_lease, it writes a new
967 leader_record.timestamp. When a host releases a delta_lease, it writes
968 zero to leader_record.timestamp.
969
970 When a host acquires a paxos_lease, it uses the host_id/generation
971 value from the delta_lease it holds in the lockspace. It uses this
972 host_id/generation to identify itself in the paxos_dblock when running
973 the paxos algorithm. The result of the algorithm is the winning
974 host_id/generation - the new owner of the paxos_lease. The winning
975 host_id/generation are written to the paxos_lease
976 leader_record.owner_id and leader_record.owner_generation fields and
977 leader_record.timestamp is set. When a host releases a paxos_lease, it
978 sets leader_record.timestamp to 0.
979
980 When a paxos_lease is free (leader_record.timestamp is 0), multiple
981 hosts may attempt to acquire it. The paxos algorithm, using the
982 paxos_dblock structures, will select only one of the hosts as the new
983 owner, and that owner is written in the leader_record. The paxos_lease
984 will no longer be free (non-zero timestamp). Other hosts will see this
985 and will not attempt to acquire the paxos_lease until it is free again.
986
987 If a paxos_lease is owned (non-zero timestamp), but the owner has not
988 renewed its delta_lease for a specific length of time, then the owner
989 value in the paxos_lease becomes expired, and other hosts will use the
990 paxos algorithm to acquire the paxos_lease, and set a new owner.
991
992
994 /etc/sanlock/sanlock.conf
995
996
998 wdmd(8)
999
1000
1001
1002
1003 2015-01-23 SANLOCK(8)