1SANLOCK(8) System Manager's Manual SANLOCK(8)
2
3
4
6 sanlock - shared storage lock manager
7
8
10 sanlock [COMMAND] [ACTION] ...
11
12
14 sanlock is a lock manager built on shared storage. Hosts with access
15 to the storage can perform locking. An application running on the
16 hosts is given a small amount of space on the shared block device or
17 file, and uses sanlock for its own application-specific synchroniza‐
18 tion. Internally, the sanlock daemon manages locks using two disk-
19 based lease algorithms: delta leases and paxos leases.
20
21
22 · delta leases are slow to acquire and demand regular i/o to shared
23 storage. sanlock only uses them internally to hold a lease on its
24 "host_id" (an integer host identifier from 1-2000). They prevent two
25 hosts from using the same host identifier. The delta lease renewals
26 also indicate if a host is alive. ("Light-Weight Leases for Storage-
27 Centric Coordination", Chockler and Malkhi.)
28
29
30 · paxos leases are fast to acquire and sanlock makes them available to
31 applications as general purpose resource leases. The disk paxos
32 algorithm uses host_id's internally to represent different hosts, and
33 the owner of a paxos lease. delta leases provide unique host_id's
34 for implementing paxos leases, and delta lease renewals serve as a
35 proxy for paxos lease renewal. ("Disk Paxos", Eli Gafni and Leslie
36 Lamport.)
37
38
39 Externally, the sanlock daemon exposes a locking interface through lib‐
40 sanlock in terms of "lockspaces" and "resources". A lockspace is a
41 locking context that an application creates for itself on shared stor‐
42 age. When the application on each host is started, it "joins" the
43 lockspace. It can then create "resources" on the shared storage. Each
44 resource represents an application-specific entity. The application
45 can acquire and release leases on resources.
46
47 To use sanlock from an application:
48
49
50 · Allocate shared storage for an application, e.g. a shared LUN or LV
51 from a SAN, or files from NFS.
52
53
54 · Provide the storage to the application.
55
56
57 · The application uses this storage with libsanlock to create a
58 lockspace and resources for itself.
59
60
61 · The application joins the lockspace when it starts.
62
63
64 · The application acquires and releases leases on resources.
65
66
67 How lockspaces and resources translate to delta leases and paxos leases
68 within sanlock:
69
70 Lockspaces
71
72
73 · A lockspace is based on delta leases held by each host using the
74 lockspace.
75
76
77 · A lockspace is a series of 2000 delta leases on disk, and requires
78 1MB of storage. (See Storage below for size variations.)
79
80
81 · A lockspace can support up to 2000 concurrent hosts using it, each
82 using a different delta lease.
83
84
85 · Applications can i) create, ii) join and iii) leave a lockspace,
86 which corresponds to i) initializing the set of delta leases on disk,
87 ii) acquiring one of the delta leases and iii) releasing the delta
88 lease.
89
90
91 · When a lockspace is created, a unique lockspace name and disk loca‐
92 tion is provided by the application.
93
94
95 · When a lockspace is created/initialized, sanlock formats the sequence
96 of 2000 on-disk delta lease structures on the file or disk, e.g.
97 /mnt/leasefile (NFS) or /dev/vg/lv (SAN).
98
99
100 · The 2000 individual delta leases in a lockspace are identified by
101 number: 1,2,3,...,2000.
102
103
104 · Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
105 its number, e.g. delta lease 1 is offset 0, delta lease 2 is offset
106 512, delta lease 2000 is offset 1023488. (See Storage below for size
107 variations.)
108
109
110 · When an application joins a lockspace, it must specify the lockspace
111 name, the lockspace location on shared disk/file, and the local
112 host's host_id. sanlock then acquires the delta lease corresponding
113 to the host_id, e.g. joining the lockspace with host_id 1 acquires
114 delta lease 1.
115
116
117 · The terms delta lease, lockspace lease, and host_id lease are used
118 interchangably.
119
120
121 · sanlock acquires a delta lease by writing the host's unique name to
122 the delta lease disk sector, reading it back after a delay, and veri‐
123 fying it is the same.
124
125
126 · If a unique host name is not specified, sanlock generates a uuid to
127 use as the host's name. The delta lease algorithm depends on hosts
128 using unique names.
129
130
131 · The application on each host should be configured with a unique
132 host_id, where the host_id is an integer 1-2000.
133
134
135 · If hosts are misconfigured and have the same host_id, the delta lease
136 algorithm is designed to detect this conflict, and only one host will
137 be able to acquire the delta lease for that host_id.
138
139
140 · A delta lease ensures that a lockspace host_id is being used by a
141 single host with the unique name specified in the delta lease.
142
143
144 · Resolving delta lease conflicts is slow, because the algorithm is
145 based on waiting and watching for some time for other hosts to write
146 to the same delta lease sector. If multiple hosts try to use the
147 same delta lease, the delay is increased substantially. So, it is
148 best to configure applications to use unique host_id's that will not
149 conflict.
150
151
152 · After sanlock acquires a delta lease, the lease must be renewed until
153 the application leaves the lockspace (which corresponds to releasing
154 the delta lease on the host_id.)
155
156
157 · sanlock renews delta leases every 20 seconds (by default) by writing
158 a new timestamp into the delta lease sector.
159
160
161 · When a host acquires a delta lease in a lockspace, it can be referred
162 to as "joining" the lockspace. Once it has joined the lockspace, it
163 can use resources associated with the lockspace.
164
165
166 Resources
167
168
169 · A lockspace is a context for resources that can be locked and
170 unlocked by an application.
171
172
173 · sanlock uses paxos leases to implement leases on resources. The
174 terms paxos lease and resource lease are used interchangably.
175
176
177 · A paxos lease exists on shared storage and requires 1MB of space. It
178 contains a unique resource name and the name of the lockspace.
179
180
181 · An application assigns its own meaning to a sanlock resource and the
182 leases on it. A sanlock resource could represent some shared object
183 like a file, or some unique role among the hosts.
184
185
186 · Resource leases are associated with a specific lockspace and can only
187 be used by hosts that have joined that lockspace (they are holding a
188 delta lease on a host_id in that lockspace.)
189
190
191 · An application must keep track of the disk locations of its
192 lockspaces and resources. sanlock does not maintain any persistent
193 index or directory of lockspaces or resources that have been created
194 by applications, so applications need to remember where they have
195 placed their own leases (which files or disks and offsets).
196
197
198 · sanlock does not renew paxos leases directly (although it could).
199 Instead, the renewal of a host's delta lease represents the renewal
200 of all that host's paxos leases in the associated lockspace. In
201 effect, many paxos lease renewals are factored out into one delta
202 lease renewal. This reduces i/o when many paxos leases are used.
203
204
205 · The disk paxos algorithm allows multiple hosts to all attempt to
206 acquire the same paxos lease at once, and will produce a single win‐
207 ner/owner of the resource lease. (Shared resource leases are also
208 possible in addition to the default exclusive leases.)
209
210
211 · The disk paxos algorithm involves a specific sequence of reading and
212 writing the sectors of the paxos lease disk area. Each host has a
213 dedicated 512 byte sector in the paxos lease disk area where it
214 writes its own "ballot", and each host reads the entire disk area to
215 see the ballots of other hosts. The first sector of the disk area is
216 the "leader record" that holds the result of the last paxos ballot.
217 The winner of the paxos ballot writes the result of the ballot to the
218 leader record (the winner of the ballot may have selected another
219 contending host as the owner of the paxos lease.)
220
221
222 · After a paxos lease is acquired, no further i/o is done in the paxos
223 lease disk area.
224
225
226 · Releasing the paxos lease involves writing a single sector to clear
227 the current owner in the leader record.
228
229
230 · If a host holding a paxos lease fails, the disk area of the paxos
231 lease still indicates that the paxos lease is owned by the failed
232 host. If another host attempts to acquire the paxos lease, and finds
233 the lease is held by another host_id, it will check the delta lease
234 of that host_id. If the delta lease of the host_id is being renewed,
235 then the paxos lease is owned and cannot be acquired. If the delta
236 lease of the owner's host_id has expired, then the paxos lease is
237 expired and can be taken (by going through the paxos lease algo‐
238 rithm.)
239
240
241 · The "interaction" or "awareness" between hosts of each other is lim‐
242 ited to the case where they attempt to acquire the same paxos lease,
243 and need to check if the referenced delta lease has expired or not.
244
245
246 · When hosts do not attempt to lock the same resources concurrently,
247 there is no host interaction or awareness. The state or actions of
248 one host have no effect on others.
249
250
251 · To speed up checking delta lease expiration (in the case of a paxos
252 lease conflict), sanlock keeps track of past renewals of other delta
253 leases in the lockspace.
254
255
256 Resource Index
257
258 The resource index (rindex) is an optional sanlock feature that appli‐
259 cations can use to keep track of resource lease offsets. Without the
260 rindex, an application must keep track of where its resource leases
261 exist on disk and find available locations when creating new leases.
262
263 The sanlock rindex uses two align-size areas on disk following the
264 lockspace. The first area holds rindex entries; each entry records a
265 resource lease name and location. The second area holds a private
266 paxos lease, used by sanlock internally to protect rindex updates.
267
268 The application creates the rindex on disk with the "format" function.
269 Format is a disk-only operation and does not interact with the live
270 lockspace, so it can be called without first calling add_lockspace.
271 The application needs to follow the convention of writing the lockspace
272 at the start of the device (offset 0) and formatting the rindex immedi‐
273 ately following the lockspace area. When formatting, the application
274 must set flags for sector size and align size to match those for the
275 lockspace.
276
277 To use the rindex, the application:
278
279
280 · Uses the "create" function to create a new resource lease on disk.
281 This takes the place of the write_resource function. The create
282 function requires the location of the rindex and the name of the new
283 resource lease. sanlock finds a free lease area, writes the new
284 resource lease at that location, updates the rindex with the
285 name:offset, and returns the offset to the caller. The caller uses
286 this offset when acquiring the resource lease.
287
288
289 · Uses the "delete" function to remove a resource disk on disk (also
290 corresponding to the write_resource function.) sanlock clears the
291 resource lease and the rindex entry for it. A subsequent call to
292 create may use this same disk location for a different resource
293 lease.
294
295
296 · Uses the "lookup" function to discover the offset of a resource lease
297 given the resource lease name. The caller would typically call this
298 prior to acquiring the resource lease.
299
300
301 · Uses the "rebuild" function to recreate the rindex if it is damaged
302 or becomes inconsistent. This function scans the disk for resource
303 leases and creates new rindex entries to match the leases it finds.
304
305
306 · The "update" function manipulates rindex entries directly and should
307 not normally be used by the application. In normal usage, the create
308 and delete functions manipulate rindex entries. Update is mainly
309 useful for testing or repairs.
310
311
312 Expiration
313
314
315 · If a host fails to renew its delta lease, e.g. it looses access to
316 the storage, its delta lease will eventually expire and another host
317 will be able to take over any resource leases held by the host. san‐
318 lock must ensure that the application on two different hosts is not
319 holding and using the same lease concurrently.
320
321
322 · When sanlock has failed to renew a delta lease for a period of time,
323 it will begin taking measures to stop local processes (applications)
324 from using any resource leases associated with the expiring lockspace
325 delta lease. sanlock enters this "recovery mode" well ahead of the
326 time when another host could take over the locally owned leases.
327 sanlock must have sufficient time to stop all local processes that
328 are using the expiring leases.
329
330
331 · sanlock uses three methods to stop local processes that are using
332 expiring leases:
333
334 1. Graceful shutdown. sanlock will execute a "graceful shutdown"
335 program that the application previously specified for this case. The
336 shutdown program tells the application to shut down because its
337 leases are expiring. The application must respond by stopping its
338 activities and releasing its leases (or exit). If an application
339 does not specify a graceful shutdown program, sanlock sends SIGTERM
340 to the process instead. The process must release its leases or exit
341 in a prescribed amount of time (see -g), or sanlock proceeds to the
342 next method of stopping.
343
344 2. Forced shutdown. sanlock will send SIGKILL to processes using the
345 expiring leases. The processes have a fixed amount of time to exit
346 after receiving SIGKILL. If any do not exit in this time, sanlock
347 will proceed to the next method.
348
349 3. Host reset. sanlock will trigger the host's watchdog device to
350 forcibly reset it. sanlock carefully manages the timing of the
351 watchdog device so that it fires shortly before any other host could
352 take over the resource leases held by local processes.
353
354
355 Failures
356
357 If a process holding resource leases fails or exits without releasing
358 its leases, sanlock will release the leases for it automatically
359 (unless persistent resource leases were used.)
360
361 If the sanlock daemon cannot renew a lockspace delta lease for a spe‐
362 cific period of time (see Expiration), sanlock will enter "recovery
363 mode" where it attempts to stop and/or kill any processes holding
364 resource leases in the expiring lockspace. If the processes do not
365 exit in time, sanlock will force the host to be reset using the local
366 watchdog device.
367
368 If the sanlock daemon crashes or hangs, it will not renew the expiry
369 time of the per-lockspace connections it had to the wdmd daemon. This
370 will lead to the expiration of the local watchdog device, and the host
371 will be reset.
372
373 Watchdog
374
375 sanlock uses the wdmd(8) daemon to access /dev/watchdog. wdmd multi‐
376 plexes multiple timeouts onto the single watchdog timer. This is
377 required because delta leases for each lockspace are renewed and expire
378 independently.
379
380 sanlock maintains a wdmd connection for each lockspace delta lease
381 being renewed. Each connection has an expiry time for some seconds in
382 the future. After each successful delta lease renewal, the expiry time
383 is renewed for the associated wdmd connection. If wdmd finds any con‐
384 nection expired, it will not renew the /dev/watchdog timer. Given
385 enough successive failed renewals, the watchdog device will fire and
386 reset the host. (Given the multiplexing nature of wdmd, shorter over‐
387 lapping renewal failures from multiple lockspaces could cause spurious
388 watchdog firing.)
389
390 The direct link between delta lease renewals and watchdog renewals pro‐
391 vides a predictable watchdog firing time based on delta lease renewal
392 timestamps that are visible from other hosts. sanlock knows the time
393 the watchdog on another host has fired based on the delta lease time.
394 Furthermore, if the watchdog device on another host fails to fire when
395 it should, the continuation of delta lease renewals from the other host
396 will make this evident and prevent leases from being taken from the
397 failed host.
398
399 If sanlock is able to stop/kill all processing using an expiring
400 lockspace, the associated wdmd connection for that lockspace is
401 removed. The expired wdmd connection will no longer block /dev/watch‐
402 dog renewals, and the host should avoid being reset.
403
404 Storage
405
406 The sector size and the align size should be specified when creating
407 lockspaces and resources (and rindex). The "align size" is the size on
408 disk of a lockspace or a resource, i.e. the amount of disk space it
409 uses. Lockspaces and resources should use matching sector and align
410 sizes, and must use offsets in multiples of the align size. The max
411 number of hosts that can use a lockspace or resource depends on the
412 combination of sector size and align size, shown below. The host_id of
413 hosts using the lockspace can be no larger than the max_hosts value for
414 the lockspace.
415
416 Accepted combinations of sector size and align size, and the corre‐
417 sponding max_hosts (and max host_id) are:
418
419 sector_size 512, align_size 1M, max_hosts 2000
420 sector_size 4096, align_size 1M, max_hosts 250
421 sector_size 4096, align_size 2M, max_hosts 500
422 sector_size 4096, align_size 4M, max_hosts 1000
423 sector_size 4096, align_size 8M, max_hosts 2000
424
425 When sector_size and align_size are not specified, the behavior matches
426 the behavior before these sizes could be configured: on devices which
427 report sector size 512, 512/1M/2000 is used, on devices which report
428 sector size 4096, 4096/8M/2000 is used, and on files, 512/1M/2000 is
429 always used. (Other combinations are not compatible with sanlock ver‐
430 sion 3.6 or earlier.)
431
432 Using sanlock on shared block devices that do host based mirroring or
433 replication is not likely to work correctly. When using sanlock on
434 shared files, all sanlock io should go to one file server.
435
436 Example
437
438 This is an example of creating and using lockspaces and resources from
439 the command line. (Most applications would use sanlock through libsan‐
440 lock rather than through the command line.)
441
442
443 1. Allocate shared storage for sanlock leases.
444
445 This example assumes 512 byte sectors on the device, in which case
446 the lockspace needs 1MB and each resource needs 1MB.
447
448 The example shared block device accessible to all hosts is
449 /dev/leases.
450
451
452 2. Start sanlock on all hosts.
453
454 The -w 0 disables use of the watchdog for testing.
455
456 # sanlock daemon -w 0
457
458
459 3. Start a dummy application on all hosts.
460
461 This sanlock command registers with sanlock, then execs the sleep
462 command which inherits the registered fd. The sleep process acts
463 as the dummy application. Because the sleep process is registered
464 with sanlock, leases can be acquired for it.
465
466 # sanlock client command -c /bin/sleep 600 &
467
468
469 4. Create a lockspace for the application (from one host).
470
471 The lockspace is named "test".
472
473 # sanlock client init -s test:0:/dev/leases:0
474
475
476 5. Join the lockspace for the application.
477
478 Use a unique host_id on each host.
479
480 host1:
481 # sanlock client add_lockspace -s test:1:/dev/leases:0
482 host2:
483 # sanlock client add_lockspace -s test:2:/dev/leases:0
484
485
486 6. Create two resources for the application (from one host).
487
488 The resources are named "RA" and "RB". Offsets are used on the
489 same device as the lockspace. Different LVs or files could also be
490 used.
491
492 # sanlock client init -r test:RA:/dev/leases:1048576
493 # sanlock client init -r test:RB:/dev/leases:2097152
494
495
496 7. Acquire resource leases for the application on host1.
497
498 Acquire an exclusive lease (the default) on the first resource, and
499 a shared lease (SH) on the second resource.
500
501 # export P=`pidof sleep`
502 # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
503 # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
504
505
506 8. Acquire resource leases for the application on host2.
507
508 Acquiring the exclusive lease on the first resource will fail
509 because it is held by host1. Acquiring the shared lease on the
510 second resource will succeed.
511
512 # export P=`pidof sleep`
513 # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
514 # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
515
516
517 9. Release resource leases for the application on both hosts.
518
519 The sleep pid could also be killed, which will result in the san‐
520 lock daemon releasing its leases when it exits.
521
522 # sanlock client release -r test:RA:/dev/leases:1048576 -p $P
523 # sanlock client release -r test:RB:/dev/leases:2097152 -p $P
524
525
526 10. Leave the lockspace for the application.
527
528 host1:
529 # sanlock client rem_lockspace -s test:1:/dev/leases:0
530 host2:
531 # sanlock client rem_lockspace -s test:2:/dev/leases:0
532
533
534 11. Stop sanlock on all hosts.
535
536 # sanlock shutdown
537
538
539
541 COMMAND can be one of three primary top level choices
542
543 sanlock daemon start daemon
544 sanlock client send request to daemon (default command if none given)
545 sanlock direct access storage directly (no coordination with daemon)
546
547
548 Daemon Command
549 sanlock daemon [options]
550
551 -D no fork and print all logging to stderr
552
553 -Q 0|1 quiet error messages for common lock contention
554
555 -R 0|1 renewal debugging, log debug info for each renewal
556
557 -L pri write logging at priority level and up to logfile (-1 none)
558
559 -S pri write logging at priority level and up to syslog (-1 none)
560
561 -U uid user id
562
563 -G gid group id
564
565 -H num renewal history size
566
567 -t num max worker threads
568
569 -g sec seconds for graceful recovery
570
571 -w 0|1 use watchdog through wdmd
572
573 -h 0|1 use high priority (RR) scheduling
574
575 -l num use mlockall (0 none, 1 current, 2 current and future)
576
577 -b sec seconds a host id bit will remain set in delta lease bitmap
578
579 -e str local host name used in delta leases
580
581
582
583 Client Command
584 sanlock client action [options]
585
586 sanlock client status
587
588 Print processes, lockspaces, and resources being managed by the sanlock
589 daemon. Add -D to show extra internal daemon status for debugging.
590 Add -o p to show resources by pid, or -o s to show resources by
591 lockspace.
592
593 sanlock client host_status
594
595 Print state of host_id delta leases read during the last renewal.
596 State of all lockspaces is shown (use -s to select one). Add -D to
597 show extra internal daemon status for debugging.
598
599 sanlock client gets
600
601 Print lockspaces being managed by the sanlock daemon. The LOCKSPACE
602 string will be followed by ADD or REM if the lockspace is currently
603 being added or removed. Add -h 1 to also show hosts in each lockspace.
604
605 sanlock client renewal -s LOCKSPACE
606
607 Print a history of renewals with timing details. See the Renewal his‐
608 tory section below.
609
610 sanlock client log_dump
611
612 Print the sanlock daemon internal debug log.
613
614 sanlock client shutdown
615
616 Ask the sanlock daemon to exit. Without the force option (-f 0), the
617 command will be ignored if any lockspaces exist. With the force option
618 (-f 1), any registered processes will be killed, their resource leases
619 released, and lockspaces removed. With the wait option (-w 1), the
620 command will wait for a result from the daemon indicating that it has
621 shut down and is exiting, or cannot shut down because lockspaces exist
622 (command fails).
623
624 sanlock client init -s LOCKSPACE
625
626 Tell the sanlock daemon to initialize a lockspace on disk. The -o
627 option can be used to specify the io timeout to be written in the
628 host_id leases. The -Z and -A options can be used to specify the sec‐
629 tor size and align size, and both should be set together. (Also see
630 sanlock direct init.)
631
632 sanlock client init -r RESOURCE
633
634 Tell the sanlock daemon to initialize a resource lease on disk. The -Z
635 and -A options can be used to specify the sector size and align size,
636 and both should be set together. (Also see sanlock direct init.)
637
638 sanlock client read -s LOCKSPACE
639
640 Tell the sanlock daemon to read a lockspace from disk. Only the
641 LOCKSPACE path and offset are required. If host_id is zero, the first
642 record at offset (host_id 1) is used. The complete LOCKSPACE is
643 printed. Add -D to print other details. (Also see sanlock direct
644 read_leader.)
645
646 sanlock client read -r RESOURCE
647
648 Tell the sanlock daemon to read a resource lease from disk. Only the
649 RESOURCE path and offset are required. The complete RESOURCE is
650 printed. Add -D to print other details. (Also see sanlock direct
651 read_leader.)
652
653 sanlock client add_lockspace -s LOCKSPACE
654
655 Tell the sanlock daemon to acquire the specified host_id in the
656 lockspace. This will allow resources to be acquired in the lockspace.
657 The -o option can be used to specify the io timeout of the acquiring
658 host, and will be written in the host_id lease.
659
660 sanlock client inq_lockspace -s LOCKSPACE
661
662 Inquire about the state of the lockspace in the sanlock daemon, whether
663 it is being added or removed, or is joined.
664
665 sanlock client rem_lockspace -s LOCKSPACE
666
667 Tell the sanlock daemon to release the specified host_id in the
668 lockspace. Any processes holding resource leases in this lockspace
669 will be killed, and the resource leases not released.
670
671 sanlock client command -r RESOURCE -c path args
672
673 Register with the sanlock daemon, acquire the specified resource lease,
674 and exec the command at path with args. When the command exits, the
675 sanlock daemon will release the lease. -c must be the final option.
676
677 sanlock client acquire -r RESOURCE -p pid
678 sanlock client release -r RESOURCE -p pid
679
680 Tell the sanlock daemon to acquire or release the specified resource
681 lease for the given pid. The pid must be registered with the sanlock
682 daemon. acquire can optionally take a versioned RESOURCE string
683 RESOURCE:lver, where lver is the version of the lease that must be
684 acquired, or fail.
685
686 sanlock client convert -r RESOURCE -p pid
687
688 Tell the sanlock daemon to convert the mode of the specified resource
689 lease for the given pid. If the existing mode is exclusive (default),
690 the mode of the lease can be converted to shared with RESOURCE:SH. If
691 the existing mode is shared, the mode of the lease can be converted to
692 exclusive with RESOURCE (no :SH suffix).
693
694 sanlock client inquire -p pid
695
696 Print the resource leases held the given pid. The format is a ver‐
697 sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
698 lease held.
699
700 sanlock client request -r RESOURCE -f force_mode
701
702 Request the owner of a resource do something specified by force_mode.
703 A versioned RESOURCE:lver string must be used with a greater version
704 than is presently held. Zero lver and force_mode clears the request.
705
706 sanlock client examine -r RESOURCE
707
708 Examine the request record for the currently held resource lease and
709 carry out the action specified by the requested force_mode.
710
711 sanlock client examine -s LOCKSPACE
712
713 Examine requests for all resource leases currently held in the named
714 lockspace. Only lockspace_name is used from the LOCKSPACE argument.
715
716 sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num
717
718 Set an event for another host. When the sanlock daemon next renews its
719 delta lease for the lockspace it will: set the bit for the host_id in
720 its bitmap, and set the generation, event and data values in its own
721 delta lease. An application that has registered for events from this
722 lockspace on the destination host will get the event that has been set
723 when the destination sees the event during its next delta lease
724 renewal.
725
726 sanlock client set_config -s LOCKSPACE
727
728 Set a configuration value for a lockspace. Only lockspace_name is used
729 from the LOCKSPACE argument. The USED flag has the same effect on a
730 lockspace as a process holding a resource lease that will not exit.
731 The USED_BY_ORPHANS flag means that an orphan resource lease will have
732 the same effect as the USED.
733 -u 0|1 Set (1) or clear (0) the USED flag.
734 -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.
735
736 sanlock client format -x RINDEX
737
738 Create a resource index on disk. Use -Z and -A to set the sector size
739 and align size to match the lockspace.
740
741 sanlock client create -x RINDEX -e resource_name
742
743 Create a new resource lease on disk, using the rindex to find a free
744 offset.
745
746 sanlock client delete -x RINDEX -e resource_name[:offset]
747
748 Delete an existing resource lease on disk.
749
750 sanlock client lookup -x RINDEX -e resource_name
751
752 Look up the offset of an existing resource lease by name on disk, using
753 the rindex. With no -e option, lookup returns the next free lease off‐
754 set. If -e specifes both name and offset, the lookup verifies both are
755 correct.
756
757 sanlock client update -x RINDEX -e resource_name[:offset] [-z 0|1]
758
759 Add (-z 0) or remove (-z 1) an rindex entry on disk.
760
761 sanlock client rebuild -x RINDEX
762
763 Rebuild the rindex entries by scanning the disk for resource leases.
764
765
766
767 Direct Command
768 sanlock direct action [options]
769
770
771 -o sec io timeout in seconds
772
773 sanlock direct init -s LOCKSPACE
774 sanlock direct init -r RESOURCE
775
776 Initialize storage for a lockspace or resource. Use the -Z and -A
777 flags to specify the sector size and align size. The max hosts that
778 can use the lockspace/resource (and the max possible host_id) is deter‐
779 mined by the sector/align size combination. Possible combinations are:
780 512/1M, 4096/1M, 4096/2M, 4096/4M, 4096/8M. Lockspaces and resources
781 both use the same amount of space (align_size) for each combination.
782 When initializing a lockspace, sanlock initializes delta leases for
783 max_hosts in the given space. When initializing a resource, sanlock
784 initializes a single paxos lease in the space. With -s, the -o option
785 specifies the io timeout to be written in the host_id leases. With -r,
786 the -z 1 option invalidates the resource lease on disk so it cannot be
787 used until reinitialized normally.
788
789 sanlock direct read_leader -s LOCKSPACE
790 sanlock direct read_leader -r RESOURCE
791
792 Read a leader record from disk and print the fields. The leader record
793 is the single sector of a delta lease, or the first sector of a paxos
794 lease.
795
796 sanlock direct dump path[:offset[:size]]
797
798 Read disk sectors and print leader records for delta or paxos leases.
799 Add -f 1 to print the request record values for paxos leases, host_ids
800 set in delta lease bitmaps, and rindex entries.
801
802 sanlock direct format -x RINDEX
803 sanlock direct lookup -x RINDEX -e resource_name
804 sanlock direct update -x RINDEX -e resource_name[:offset] [-z 0|1]
805 sanlock direct rebuild -x RINDEX
806
807 Access the resource index on disk without going through the sanlock
808 daemon. This precludes using the internal paxos lease to protect
809 rindex modifications. See client equivalents for descriptions.
810
811
812
813 LOCKSPACE option string
814 -s lockspace_name:host_id:path:offset
815
816 lockspace_name name of lockspace
817 host_id local host identifier in lockspace
818 path path to storage to use for leases
819 offset offset on path (bytes)
820
821
822 RESOURCE option string
823 -r lockspace_name:resource_name:path:offset
824
825 lockspace_name name of lockspace
826 resource_name name of resource
827 path path to storage to use leases
828 offset offset on path (bytes)
829
830
831 RESOURCE option string with suffix
832 -r lockspace_name:resource_name:path:offset:lver
833
834 lver leader version
835
836 -r lockspace_name:resource_name:path:offset:SH
837
838 SH indicates shared mode
839
840
841 RINDEX option string
842 -x lockspace_name:path:offset
843
844 lockspace_name name of lockspace
845 path path to storage to use for leases
846 offset offset on path (bytes) of rindex
847
848
849
850 Defaults
851 sanlock help shows the default values for the options above.
852
853 sanlock version shows the build version.
854
855
857 Request/Examine
858 The first part of making a request for a resource is writing the
859 request record of the resource (the sector following the leader
860 record). To make a successful request:
861
862 · RESOURCE:lver must be greater than the lver presently held by the
863 other host. This implies the leader record must be read to discover
864 the lver, prior to making a request.
865
866 · RESOURCE:lver must be greater than or equal to the lver presently
867 written to the request record. Two hosts may write a new request at
868 the same time for the same lver, in which case both would succeed,
869 but the force_mode from the last would win.
870
871 · The force_mode must be greater than zero.
872
873 · To unconditionally clear the request record (set both lver and
874 force_mode to 0), make request with RESOURCE:0 and force_mode 0.
875
876
877 The owner of the requested resource will not know of the request unless
878 it is explicitly told to examine its resources via the "examine"
879 api/command, or otherwise notfied.
880
881 The second part of making a request is notifying the resource lease
882 owner that it should examine the request records of its resource
883 leases. The notification will cause the lease owner to automatically
884 run the equivalent of "sanlock client examine -s LOCKSPACE" for the
885 lockspace of the requested resource.
886
887 The notification is made using a bitmap in each host_id delta lease.
888 Each bit represents each of the possible host_ids (1-2000). If host A
889 wants to notify host B to examine its resources, A sets the bit in its
890 own bitmap that corresponds to the host_id of B. When B next renews
891 its delta lease, it reads the delta leases for all hosts and checks
892 each bitmap to see if its own host_id has been set. It finds the bit
893 for its own host_id set in A's bitmap, and examines its resource
894 request records. (The bit remains set in A's bitmap for set_bit‐
895 map_seconds.)
896
897 force_mode determines the action the resource lease owner should take:
898
899
900 · FORCE (1): kill the process holding the resource lease. When the
901 process has exited, the resource lease will be released, and can then
902 be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if
903 SIGKILL is restricted.)
904
905
906 · GRACEFUL (2): run the program configured by sanlock_killpath against
907 the process holding the resource lease. If no killpath is defined,
908 then FORCE is used.
909
910
911 Persistent and orphan resource leases
912 A resource lease can be acquired with the PERSISTENT flag (-P 1). If
913 the process holding the lease exits, the lease will not be released,
914 but kept on an orphan list. Another local process can acquire an
915 orphan lease using the ORPHAN flag (-O 1), or release the orphan lease
916 using the ORPHAN flag (-O 1). All orphan leases can be released by
917 setting the lockspace name (-s lockspace_name) with no resource name.
918
919
920 Renewal history
921 sanlock saves a limited history of lease renewal information in each
922 lockspace. See sanlock.conf renewal_history_size to set the amount of
923 history or to disable (set to 0).
924
925 IO times are measured in delta lease renewal (each delta lease renewal
926 includes one read and one write).
927
928 For each successful renewal, a record is saved that includes:
929
930 · the timestamp written in the delta lease by the renewal
931
932 · the time in milliseconds taken by the delta lease read
933
934 · the time in milliseconds taken by the delta lease write
935
936
937 Also counted and recorded are the number io timeouts and other io
938 errors that occur between successful renewals.
939
940 Two consecutive successful renewals would be recorded as:
941 timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
942 timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0
943
944 Those fields are:
945
946
947 · timestamp is the value written into the delta lease during that
948 renewal.
949
950
951 · read_ms/write_ms are the milliseconds taken for the renewal
952 read/write ios.
953
954
955 · next_timeouts are the number of io timeouts that occured after the
956 renewal recorded on that line, and before the next successful renewal
957 on the following line.
958
959
960 · next_errors are the number of io errors (not timeouts) that occured
961 after renewal recorded on that line, and before the next successful
962 renewal on the following line.
963
964
965 The command 'sanlock client renewal -s lockspace_name' reports the full
966 history of renewals saved by sanlock, which by default is 180 records,
967 about 1 hour of history when using a 20 second renewal interval for a
968 10 second io timeout.
969
970
972 Disk Format
973 · This example uses 512 byte sectors.
974
975 · Each lockspace is 1MB. It holds 2000 delta_leases, one per sector,
976 supporting up to 2000 hosts.
977
978 · Each paxos_lease is 1MB. It is used as a lease for one resource.
979
980 · The leader_record structure is used differently by each lease type.
981
982 · To display all leader_record fields, see sanlock direct read_leader.
983
984 · A lockspace is often followed on disk by the paxos_leases used within
985 that lockspace, but this layout is not required.
986
987 · The request_record and host_id bitmap are used for requests/events.
988
989 · The mode_block contains the SHARED flag indicating a lease is held in
990 the shared mode.
991
992 · In a lockspace, the host using host_id N writes to a single
993 delta_lease in sector N-1. No other hosts write to this sector. All
994 hosts read all lockspace sectors when renewing their own delta_lease,
995 and are able to monitor renewals of all delta_leases.
996
997 · In a paxos_lease, each host has a dedicated sector it writes to, con‐
998 taining its own paxos_dblock and mode_block structures. Its sector
999 is based on its host_id; host_id 1 writes to the dblock/mode_block in
1000 sector 2 of the paxos_lease.
1001
1002 · The paxos_dblock structures are used by the paxos_lease algorithm,
1003 and the result is written to the leader_record.
1004
1005
1006 0x000000 lockspace foo:0:/path:0
1007
1008 (There is no representation on disk of the lockspace in general, only
1009 the sequence of specific delta_leases which collectively represent the
1010 lockspace.)
1011
1012 delta_lease foo:1:/path:0
1013 0x000 0 leader_record (sector 0, for host_id 1)
1014 magic: 0x12212010
1015 space_name: foo
1016 resource_name: host uuid/name
1017 ...
1018 host_id bitmap (leader_record + 256)
1019
1020 delta_lease foo:2:/path:0
1021 0x200 512 leader_record (sector 1, for host_id 2)
1022 magic: 0x12212010
1023 space_name: foo
1024 resource_name: host uuid/name
1025 ...
1026 host_id bitmap (leader_record + 256)
1027
1028 delta_lease foo:3:/path:0
1029 0x400 1024 leader_record (sector 2, for host_id 3)
1030 magic: 0x12212010
1031 space_name: foo
1032 resource_name: host uuid/name
1033 ...
1034 host_id bitmap (leader_record + 256)
1035
1036 delta_lease foo:2000:/path:0
1037 0xF9E00 leader_record (sector 1999, for host_id 2000)
1038 magic: 0x12212010
1039 space_name: foo
1040 resource_name: host uuid/name
1041 ...
1042 host_id bitmap (leader_record + 256)
1043
1044 0x100000 paxos_lease foo:example1:/path:1048576
1045 0x000 0 leader_record (sector 0)
1046 magic: 0x06152010
1047 space_name: foo
1048 resource_name: example1
1049
1050 0x200 512 request_record (sector 1)
1051 magic: 0x08292011
1052
1053 0x400 1024 paxos_dblock (sector 2, for host_id 1)
1054 0x480 1152 mode_block (paxos_dblock + 128)
1055
1056 0x600 1536 paxos_dblock (sector 3, for host_id 2)
1057 0x680 1664 mode_block (paxos_dblock + 128)
1058
1059 0x800 2048 paxos_dblock (sector 4, for host_id 3)
1060 0x880 2176 mode_block (paxos_dblock + 128)
1061
1062 0xFA200 paxos_dblock (sector 2001, for host_id 2000)
1063 0xFA280 mode_block (paxos_dblock + 128)
1064
1065 0x200000 paxos_lease foo:example2:/path:2097152
1066 0x000 0 leader_record (sector 0)
1067 magic: 0x06152010
1068 space_name: foo
1069 resource_name: example2
1070
1071 0x200 512 request_record (sector 1)
1072 magic: 0x08292011
1073
1074 0x400 1024 paxos_dblock (sector 2, for host_id 1)
1075 0x480 1152 mode_block (paxos_dblock + 128)
1076
1077 0x600 1536 paxos_dblock (sector 3, for host_id 2)
1078 0x680 1664 mode_block (paxos_dblock + 128)
1079
1080 0x800 2048 paxos_dblock (sector 4, for host_id 3)
1081 0x880 2176 mode_block (paxos_dblock + 128)
1082
1083 0xFA200 paxos_dblock (sector 2001, for host_id 2000)
1084 0xFA280 mode_block (paxos_dblock + 128)
1085
1086
1087 Lease ownership
1088 Not shown in the leader_record structures above are the owner_id,
1089 owner_generation and timestamp fields. These are the fields that
1090 define the lease owner.
1091
1092 The delta_lease at sector N for host_id N+1 has leader_record.owner_id
1093 N+1. The leader_record.owner_generation is incremented each time the
1094 delta_lease is acquired. When a delta_lease is acquired, the
1095 leader_record.timestamp field is set to the time of the host and the
1096 leader_record.resource_name is set to the unique name of the host.
1097 When the host renews the delta_lease, it writes a new
1098 leader_record.timestamp. When a host releases a delta_lease, it writes
1099 zero to leader_record.timestamp.
1100
1101 When a host acquires a paxos_lease, it uses the host_id/generation
1102 value from the delta_lease it holds in the lockspace. It uses this
1103 host_id/generation to identify itself in the paxos_dblock when running
1104 the paxos algorithm. The result of the algorithm is the winning
1105 host_id/generation - the new owner of the paxos_lease. The winning
1106 host_id/generation are written to the paxos_lease
1107 leader_record.owner_id and leader_record.owner_generation fields and
1108 leader_record.timestamp is set. When a host releases a paxos_lease, it
1109 sets leader_record.timestamp to 0.
1110
1111 When a paxos_lease is free (leader_record.timestamp is 0), multiple
1112 hosts may attempt to acquire it. The paxos algorithm, using the
1113 paxos_dblock structures, will select only one of the hosts as the new
1114 owner, and that owner is written in the leader_record. The paxos_lease
1115 will no longer be free (non-zero timestamp). Other hosts will see this
1116 and will not attempt to acquire the paxos_lease until it is free again.
1117
1118 If a paxos_lease is owned (non-zero timestamp), but the owner has not
1119 renewed its delta_lease for a specific length of time, then the owner
1120 value in the paxos_lease becomes expired, and other hosts will use the
1121 paxos algorithm to acquire the paxos_lease, and set a new owner.
1122
1123
1125 /etc/sanlock/sanlock.conf
1126
1127
1128 · quiet_fail = 1
1129 See -Q
1130
1131
1132 · debug_renew = 0
1133 See -R
1134
1135
1136 · logfile_priority = 4
1137 See -L
1138
1139
1140 · logfile_use_utc = 0
1141 Use UTC instead of local time in log messages.
1142
1143
1144 · syslog_priority = 3
1145 See -S
1146
1147
1148 · names_log_priority = 4
1149 Log resource names at this priority level (uses syslog priority num‐
1150 bers). If this is greater than or equal to logfile_priority, each
1151 requested resource name and location is recorded in sanlock.log.
1152
1153
1154 · use_watchdog = 1
1155 See -w
1156
1157
1158 · high_priority = 1
1159 See -h
1160
1161
1162 · mlock_level = 1
1163 See -l
1164
1165
1166 · sh_retries = 8
1167 The number of times to try acquiring a paxos lease when acquiring a
1168 shared lease when the paxos lease is held by another host acquiring a
1169 shared lease.
1170
1171
1172 · uname = sanlock
1173 See -U
1174
1175
1176 · gname = sanlock
1177 See -G
1178
1179
1180 · our_host_name = <str>
1181 See -e
1182
1183
1184 · renewal_read_extend_sec = <seconds>
1185 If a renewal read i/o times out, wait this many additional seconds
1186 for that read to complete at the start of the subsequent renewal
1187 attempt. When not configured, sanlock waits for an additional
1188 io_timeout seconds for a previous timed out read to complete.
1189
1190
1191 · renewal_history_size = 180
1192 See -H
1193
1194
1195 · paxos_debug_all = 0
1196 Include all details in the paxos debug logging.
1197
1198
1199 · debug_io = <str>
1200 Add debug logging for each i/o. "submit" (no quotes) produces debug
1201 output at submission time, "complete" produces debug output at com‐
1202 pletion time, and "submit,complete" (no space) produces both.
1203
1204
1205 · max_sectors_kb = <str>|<num>
1206 Set to "ignore" (no quotes) to prevent sanlock from checking or
1207 changing max_sectors_kb for the lockspace disk when starting a
1208 lockspace. Set to "align" (no quotes) to set max_sectors_kb for the
1209 lockspace disk to the align size of the lockspace. Set to a number
1210 to set a specific number of KB for all lockspace disks.
1211
1212
1213 · debug_clients = 0
1214 Enable or disable debug logging for all client connections to the
1215 sanlock daemon.
1216
1217
1218 · debug_cmd = +|-<name>
1219 Enable (+name) or disable (-name) debug logging at the command pro‐
1220 cessing level for specifically named commands, e.g. "debug_cmd =
1221 +acquire", or "debug_cmd = -inq_lockspace". Repeat this line for
1222 each command name. Use a plus prefix before the name to enable and a
1223 minus prefix to disable. By default sanlock disables some command
1224 level debugging for commands that are often repetitive and fill the
1225 in memory debug buffer. This only affects debug logging, not errors
1226 or warnings, and disabling command level debugging for a command does
1227 not disable lower level debugging for that command. Special values
1228 +all and -all can be used to enable or disable all commands, and can
1229 be used before or after other debug_cmd lines.
1230
1231
1233 wdmd(8)
1234
1235
1236
1237
1238 2015-01-23 SANLOCK(8)