1SANLOCK(8) System Manager's Manual SANLOCK(8)
2
3
4
6 sanlock - shared storage lock manager
7
8
10 sanlock [COMMAND] [ACTION] ...
11
12
14 sanlock is a lock manager built on shared storage. Hosts with access
15 to the storage can perform locking. An application running on the
16 hosts is given a small amount of space on the shared block device or
17 file, and uses sanlock for its own application-specific synchroniza‐
18 tion. Internally, the sanlock daemon manages locks using two disk-
19 based lease algorithms: delta leases and paxos leases.
20
21
22 • delta leases are slow to acquire and demand regular i/o to shared
23 storage. sanlock only uses them internally to hold a lease on its
24 "host_id" (an integer host identifier from 1-2000). They prevent two
25 hosts from using the same host identifier. The delta lease renewals
26 also indicate if a host is alive. ("Light-Weight Leases for Storage-
27 Centric Coordination", Chockler and Malkhi.)
28
29
30 • paxos leases are fast to acquire and sanlock makes them available to
31 applications as general purpose resource leases. The disk paxos al‐
32 gorithm uses host_id's internally to represent different hosts, and
33 the owner of a paxos lease. delta leases provide unique host_id's
34 for implementing paxos leases, and delta lease renewals serve as a
35 proxy for paxos lease renewal. ("Disk Paxos", Eli Gafni and Leslie
36 Lamport.)
37
38
39 Externally, the sanlock daemon exposes a locking interface through lib‐
40 sanlock in terms of "lockspaces" and "resources". A lockspace is a
41 locking context that an application creates for itself on shared stor‐
42 age. When the application on each host is started, it "joins" the
43 lockspace. It can then create "resources" on the shared storage. Each
44 resource represents an application-specific entity. The application
45 can acquire and release leases on resources.
46
47 To use sanlock from an application:
48
49
50 • Allocate shared storage for an application, e.g. a shared LUN or LV
51 from a SAN, or files from NFS.
52
53
54 • Provide the storage to the application.
55
56
57 • The application uses this storage with libsanlock to create a
58 lockspace and resources for itself.
59
60
61 • The application joins the lockspace when it starts.
62
63
64 • The application acquires and releases leases on resources.
65
66
67 How lockspaces and resources translate to delta leases and paxos leases
68 within sanlock:
69
70 Lockspaces
71
72
73 • A lockspace is based on delta leases held by each host using the
74 lockspace.
75
76
77 • A lockspace is a series of 2000 delta leases on disk, and requires
78 1MB of storage. (See Storage below for size variations.)
79
80
81 • A lockspace can support up to 2000 concurrent hosts using it, each
82 using a different delta lease.
83
84
85 • Applications can i) create, ii) join and iii) leave a lockspace,
86 which corresponds to i) initializing the set of delta leases on disk,
87 ii) acquiring one of the delta leases and iii) releasing the delta
88 lease.
89
90
91 • When a lockspace is created, a unique lockspace name and disk loca‐
92 tion is provided by the application.
93
94
95 • When a lockspace is created/initialized, sanlock formats the sequence
96 of 2000 on-disk delta lease structures on the file or disk, e.g.
97 /mnt/leasefile (NFS) or /dev/vg/lv (SAN).
98
99
100 • The 2000 individual delta leases in a lockspace are identified by
101 number: 1,2,3,...,2000.
102
103
104 • Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
105 its number, e.g. delta lease 1 is offset 0, delta lease 2 is offset
106 512, delta lease 2000 is offset 1023488. (See Storage below for size
107 variations.)
108
109
110 • When an application joins a lockspace, it must specify the lockspace
111 name, the lockspace location on shared disk/file, and the local
112 host's host_id. sanlock then acquires the delta lease corresponding
113 to the host_id, e.g. joining the lockspace with host_id 1 acquires
114 delta lease 1.
115
116
117 • The terms delta lease, lockspace lease, and host_id lease are used
118 interchangeably.
119
120
121 • sanlock acquires a delta lease by writing the host's unique name to
122 the delta lease disk sector, reading it back after a delay, and veri‐
123 fying it is the same.
124
125
126 • If a unique host name is not specified, sanlock generates a uuid to
127 use as the host's name. The delta lease algorithm depends on hosts
128 using unique names.
129
130
131 • The application on each host should be configured with a unique
132 host_id, where the host_id is an integer 1-2000.
133
134
135 • If hosts are misconfigured and have the same host_id, the delta lease
136 algorithm is designed to detect this conflict, and only one host will
137 be able to acquire the delta lease for that host_id.
138
139
140 • A delta lease ensures that a lockspace host_id is being used by a
141 single host with the unique name specified in the delta lease.
142
143
144 • Resolving delta lease conflicts is slow, because the algorithm is
145 based on waiting and watching for some time for other hosts to write
146 to the same delta lease sector. If multiple hosts try to use the
147 same delta lease, the delay is increased substantially. So, it is
148 best to configure applications to use unique host_id's that will not
149 conflict.
150
151
152 • After sanlock acquires a delta lease, the lease must be renewed until
153 the application leaves the lockspace (which corresponds to releasing
154 the delta lease on the host_id.)
155
156
157 • sanlock renews delta leases every 20 seconds (by default) by writing
158 a new timestamp into the delta lease sector.
159
160
161 • When a host acquires a delta lease in a lockspace, it can be referred
162 to as "joining" the lockspace. Once it has joined the lockspace, it
163 can use resources associated with the lockspace.
164
165
166 Resources
167
168
169 • A lockspace is a context for resources that can be locked and un‐
170 locked by an application.
171
172
173 • sanlock uses paxos leases to implement leases on resources. The
174 terms paxos lease and resource lease are used interchangeably.
175
176
177 • A paxos lease exists on shared storage and requires 1MB of space. It
178 contains a unique resource name and the name of the lockspace.
179
180
181 • An application assigns its own meaning to a sanlock resource and the
182 leases on it. A sanlock resource could represent some shared object
183 like a file, or some unique role among the hosts.
184
185
186 • Resource leases are associated with a specific lockspace and can only
187 be used by hosts that have joined that lockspace (they are holding a
188 delta lease on a host_id in that lockspace.)
189
190
191 • An application must keep track of the disk locations of its
192 lockspaces and resources. sanlock does not maintain any persistent
193 index or directory of lockspaces or resources that have been created
194 by applications, so applications need to remember where they have
195 placed their own leases (which files or disks and offsets).
196
197
198 • sanlock does not renew paxos leases directly (although it could).
199 Instead, the renewal of a host's delta lease represents the renewal
200 of all that host's paxos leases in the associated lockspace. In ef‐
201 fect, many paxos lease renewals are factored out into one delta lease
202 renewal. This reduces i/o when many paxos leases are used.
203
204
205 • The disk paxos algorithm allows multiple hosts to all attempt to ac‐
206 quire the same paxos lease at once, and will produce a single win‐
207 ner/owner of the resource lease. (Shared resource leases are also
208 possible in addition to the default exclusive leases.)
209
210
211 • The disk paxos algorithm involves a specific sequence of reading and
212 writing the sectors of the paxos lease disk area. Each host has a
213 dedicated 512 byte sector in the paxos lease disk area where it
214 writes its own "ballot", and each host reads the entire disk area to
215 see the ballots of other hosts. The first sector of the disk area is
216 the "leader record" that holds the result of the last paxos ballot.
217 The winner of the paxos ballot writes the result of the ballot to the
218 leader record (the winner of the ballot may have selected another
219 contending host as the owner of the paxos lease.)
220
221
222 • After a paxos lease is acquired, no further i/o is done in the paxos
223 lease disk area.
224
225
226 • Releasing the paxos lease involves writing a single sector to clear
227 the current owner in the leader record.
228
229
230 • If a host holding a paxos lease fails, the disk area of the paxos
231 lease still indicates that the paxos lease is owned by the failed
232 host. If another host attempts to acquire the paxos lease, and finds
233 the lease is held by another host_id, it will check the delta lease
234 of that host_id. If the delta lease of the host_id is being renewed,
235 then the paxos lease is owned and cannot be acquired. If the delta
236 lease of the owner's host_id has expired, then the paxos lease is ex‐
237 pired and can be taken (by going through the paxos lease algorithm.)
238
239
240 • The "interaction" or "awareness" between hosts of each other is lim‐
241 ited to the case where they attempt to acquire the same paxos lease,
242 and need to check if the referenced delta lease has expired or not.
243
244
245 • When hosts do not attempt to lock the same resources concurrently,
246 there is no host interaction or awareness. The state or actions of
247 one host have no effect on others.
248
249
250 • To speed up checking delta lease expiration (in the case of a paxos
251 lease conflict), sanlock keeps track of past renewals of other delta
252 leases in the lockspace.
253
254
255 Resource Index
256
257 The resource index (rindex) is an optional sanlock feature that appli‐
258 cations can use to keep track of resource lease offsets. Without the
259 rindex, an application must keep track of where its resource leases ex‐
260 ist on disk and find available locations when creating new leases.
261
262 The sanlock rindex uses two align-size areas on disk following the
263 lockspace. The first area holds rindex entries; each entry records a
264 resource lease name and location. The second area holds a private
265 paxos lease, used by sanlock internally to protect rindex updates.
266
267 The application creates the rindex on disk with the "format" function.
268 Format is a disk-only operation and does not interact with the live
269 lockspace, so it can be called without first calling add_lockspace.
270 The application needs to follow the convention of writing the lockspace
271 at the start of the device (offset 0) and formatting the rindex immedi‐
272 ately following the lockspace area. When formatting, the application
273 must set flags for sector size and align size to match those for the
274 lockspace.
275
276 To use the rindex, the application:
277
278
279 • Uses the "create" function to create a new resource lease on disk.
280 This takes the place of the write_resource function. The create
281 function requires the location of the rindex and the name of the new
282 resource lease. sanlock finds a free lease area, writes the new re‐
283 source lease at that location, updates the rindex with the name:off‐
284 set, and returns the offset to the caller. The caller uses this off‐
285 set when acquiring the resource lease.
286
287
288 • Uses the "delete" function to remove a resource disk on disk (also
289 corresponding to the write_resource function.) sanlock clears the
290 resource lease and the rindex entry for it. A subsequent call to
291 create may use this same disk location for a different resource
292 lease.
293
294
295 • Uses the "lookup" function to discover the offset of a resource lease
296 given the resource lease name. The caller would typically call this
297 prior to acquiring the resource lease.
298
299
300 • Uses the "rebuild" function to recreate the rindex if it is damaged
301 or becomes inconsistent. This function scans the disk for resource
302 leases and creates new rindex entries to match the leases it finds.
303
304
305 • The "update" function manipulates rindex entries directly and should
306 not normally be used by the application. In normal usage, the create
307 and delete functions manipulate rindex entries. Update is mainly
308 useful for testing or repairs.
309
310
311 Expiration
312
313
314 • If a host fails to renew its delta lease, e.g. it looses access to
315 the storage, its delta lease will eventually expire and another host
316 will be able to take over any resource leases held by the host. san‐
317 lock must ensure that the application on two different hosts is not
318 holding and using the same lease concurrently.
319
320
321 • When sanlock has failed to renew a delta lease for a period of time,
322 it will begin taking measures to stop local processes (applications)
323 from using any resource leases associated with the expiring lockspace
324 delta lease. sanlock enters this "recovery mode" well ahead of the
325 time when another host could take over the locally owned leases.
326 sanlock must have sufficient time to stop all local processes that
327 are using the expiring leases.
328
329
330 • sanlock uses three methods to stop local processes that are using ex‐
331 piring leases:
332
333 1. Graceful shutdown. sanlock will execute a "graceful shutdown"
334 program that the application previously specified for this case. The
335 shutdown program tells the application to shut down because its
336 leases are expiring. The application must respond by stopping its
337 activities and releasing its leases (or exit). If an application
338 does not specify a graceful shutdown program, sanlock sends SIGTERM
339 to the process instead. The process must release its leases or exit
340 in a prescribed amount of time (see -g), or sanlock proceeds to the
341 next method of stopping.
342
343 2. Forced shutdown. sanlock will send SIGKILL to processes using the
344 expiring leases. The processes have a fixed amount of time to exit
345 after receiving SIGKILL. If any do not exit in this time, sanlock
346 will proceed to the next method.
347
348 3. Host reset. sanlock will trigger the host's watchdog device to
349 forcibly reset it. sanlock carefully manages the timing of the
350 watchdog device so that it fires shortly before any other host could
351 take over the resource leases held by local processes.
352
353
354 Failures
355
356 If a process holding resource leases fails or exits without releasing
357 its leases, sanlock will release the leases for it automatically (un‐
358 less persistent resource leases were used.)
359
360 If the sanlock daemon cannot renew a lockspace delta lease for a spe‐
361 cific period of time (see Expiration), sanlock will enter "recovery
362 mode" where it attempts to stop and/or kill any processes holding re‐
363 source leases in the expiring lockspace. If the processes do not exit
364 in time, sanlock will force the host to be reset using the local watch‐
365 dog device.
366
367 If the sanlock daemon crashes or hangs, it will not renew the expiry
368 time of the per-lockspace connections it had to the wdmd daemon. This
369 will lead to the expiration of the local watchdog device, and the host
370 will be reset.
371
372 Watchdog
373
374 sanlock uses the wdmd(8) daemon to access /dev/watchdog. wdmd multi‐
375 plexes multiple timeouts onto the single watchdog timer. This is re‐
376 quired because delta leases for each lockspace are renewed and expire
377 independently.
378
379 sanlock maintains a wdmd connection for each lockspace delta lease be‐
380 ing renewed. Each connection has an expiry time for some seconds in
381 the future. After each successful delta lease renewal, the expiry time
382 is renewed for the associated wdmd connection. If wdmd finds any con‐
383 nection expired, it will not renew the /dev/watchdog timer. Given
384 enough successive failed renewals, the watchdog device will fire and
385 reset the host. (Given the multiplexing nature of wdmd, shorter over‐
386 lapping renewal failures from multiple lockspaces could cause spurious
387 watchdog firing.)
388
389 The direct link between delta lease renewals and watchdog renewals pro‐
390 vides a predictable watchdog firing time based on delta lease renewal
391 timestamps that are visible from other hosts. sanlock knows the time
392 the watchdog on another host has fired based on the delta lease time.
393 Furthermore, if the watchdog device on another host fails to fire when
394 it should, the continuation of delta lease renewals from the other host
395 will make this evident and prevent leases from being taken from the
396 failed host.
397
398 If sanlock is able to stop/kill all processing using an expiring
399 lockspace, the associated wdmd connection for that lockspace is re‐
400 moved. The expired wdmd connection will no longer block /dev/watchdog
401 renewals, and the host should avoid being reset.
402
403 Storage
404
405 The sector size and the align size should be specified when creating
406 lockspaces and resources (and rindex). The "align size" is the size on
407 disk of a lockspace or a resource, i.e. the amount of disk space it
408 uses. Lockspaces and resources should use matching sector and align
409 sizes, and must use offsets in multiples of the align size. The max
410 number of hosts that can use a lockspace or resource depends on the
411 combination of sector size and align size, shown below. The host_id of
412 hosts using the lockspace can be no larger than the max_hosts value for
413 the lockspace.
414
415 Accepted combinations of sector size and align size, and the corre‐
416 sponding max_hosts (and max host_id) are:
417
418 sector_size 512, align_size 1M, max_hosts 2000
419 sector_size 4096, align_size 1M, max_hosts 250
420 sector_size 4096, align_size 2M, max_hosts 500
421 sector_size 4096, align_size 4M, max_hosts 1000
422 sector_size 4096, align_size 8M, max_hosts 2000
423
424 When sector_size and align_size are not specified, the behavior matches
425 the behavior before these sizes could be configured: on devices which
426 report sector size 512, 512/1M/2000 is used, on devices which report
427 sector size 4096, 4096/8M/2000 is used, and on files, 512/1M/2000 is
428 always used. (Other combinations are not compatible with sanlock ver‐
429 sion 3.6 or earlier.)
430
431 Using sanlock on shared block devices that do host based mirroring or
432 replication is not likely to work correctly. When using sanlock on
433 shared files, all sanlock io should go to one file server.
434
435 Example
436
437 This is an example of creating and using lockspaces and resources from
438 the command line. (Most applications would use sanlock through libsan‐
439 lock rather than through the command line.)
440
441
442 1. Allocate shared storage for sanlock leases.
443
444 This example assumes 512 byte sectors on the device, in which case
445 the lockspace needs 1MB and each resource needs 1MB.
446
447 The example shared block device accessible to all hosts is
448 /dev/leases.
449
450
451 2. Start sanlock on all hosts.
452
453 The -w 0 disables use of the watchdog for testing.
454
455 # sanlock daemon -w 0
456
457
458 3. Start a dummy application on all hosts.
459
460 This sanlock command registers with sanlock, then execs the sleep
461 command which inherits the registered fd. The sleep process acts
462 as the dummy application. Because the sleep process is registered
463 with sanlock, leases can be acquired for it.
464
465 # sanlock client command -c /bin/sleep 600 &
466
467
468 4. Create a lockspace for the application (from one host).
469
470 The lockspace is named "test".
471
472 # sanlock client init -s test:0:/dev/leases:0
473
474
475 5. Join the lockspace for the application.
476
477 Use a unique host_id on each host.
478
479 host1:
480 # sanlock client add_lockspace -s test:1:/dev/leases:0
481 host2:
482 # sanlock client add_lockspace -s test:2:/dev/leases:0
483
484
485 6. Create two resources for the application (from one host).
486
487 The resources are named "RA" and "RB". Offsets are used on the
488 same device as the lockspace. Different LVs or files could also be
489 used.
490
491 # sanlock client init -r test:RA:/dev/leases:1048576
492 # sanlock client init -r test:RB:/dev/leases:2097152
493
494
495 7. Acquire resource leases for the application on host1.
496
497 Acquire an exclusive lease (the default) on the first resource, and
498 a shared lease (SH) on the second resource.
499
500 # export P=`pidof sleep`
501 # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
502 # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
503
504
505 8. Acquire resource leases for the application on host2.
506
507 Acquiring the exclusive lease on the first resource will fail be‐
508 cause it is held by host1. Acquiring the shared lease on the sec‐
509 ond resource will succeed.
510
511 # export P=`pidof sleep`
512 # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
513 # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
514
515
516 9. Release resource leases for the application on both hosts.
517
518 The sleep pid could also be killed, which will result in the san‐
519 lock daemon releasing its leases when it exits.
520
521 # sanlock client release -r test:RA:/dev/leases:1048576 -p $P
522 # sanlock client release -r test:RB:/dev/leases:2097152 -p $P
523
524
525 10. Leave the lockspace for the application.
526
527 host1:
528 # sanlock client rem_lockspace -s test:1:/dev/leases:0
529 host2:
530 # sanlock client rem_lockspace -s test:2:/dev/leases:0
531
532
533 11. Stop sanlock on all hosts.
534
535 # sanlock shutdown
536
537
538
540 COMMAND can be one of three primary top level choices
541
542 sanlock daemon start daemon
543 sanlock client send request to daemon (default command if none given)
544 sanlock direct access storage directly (no coordination with daemon)
545
546
547 Daemon Command
548 sanlock daemon [options]
549
550 -D no fork and print all logging to stderr
551
552 -Q 0|1 quiet error messages for common lock contention
553
554 -R 0|1 renewal debugging, log debug info for each renewal
555
556 -L pri write logging at priority level and up to logfile (-1 none)
557
558 -S pri write logging at priority level and up to syslog (-1 none)
559
560 -U uid user id
561
562 -G gid group id
563
564 -H num renewal history size
565
566 -t num max worker threads
567
568 -g sec seconds for graceful recovery
569
570 -w 0|1 use watchdog through wdmd
571
572 -h 0|1 use high priority (RR) scheduling
573
574 -l num use mlockall (0 none, 1 current, 2 current and future)
575
576 -b sec seconds a host id bit will remain set in delta lease bitmap
577
578 -e str local host name used in delta leases
579
580
581
582 Client Command
583 sanlock client action [options]
584
585 sanlock client status
586
587 Print processes, lockspaces, and resources being managed by the sanlock
588 daemon. Add -D to show extra internal daemon status for debugging.
589 Add -o p to show resources by pid, or -o s to show resources by
590 lockspace.
591
592 sanlock client host_status
593
594 Print state of host_id delta leases read during the last renewal.
595 State of all lockspaces is shown (use -s to select one). Add -D to
596 show extra internal daemon status for debugging.
597
598 sanlock client gets
599
600 Print lockspaces being managed by the sanlock daemon. The LOCKSPACE
601 string will be followed by ADD or REM if the lockspace is currently be‐
602 ing added or removed. Add -h 1 to also show hosts in each lockspace.
603
604 sanlock client renewal -s LOCKSPACE
605
606 Print a history of renewals with timing details. See the Renewal his‐
607 tory section below.
608
609 sanlock client log_dump
610
611 Print the sanlock daemon internal debug log.
612
613 sanlock client shutdown
614
615 Ask the sanlock daemon to exit. Without the force option (-f 0), the
616 command will be ignored if any lockspaces exist. With the force option
617 (-f 1), any registered processes will be killed, their resource leases
618 released, and lockspaces removed. With the wait option (-w 1), the
619 command will wait for a result from the daemon indicating that it has
620 shut down and is exiting, or cannot shut down because lockspaces exist
621 (command fails).
622
623 sanlock client init -s LOCKSPACE
624
625 Tell the sanlock daemon to initialize a lockspace on disk. The -o op‐
626 tion can be used to specify the io timeout to be written in the host_id
627 leases. The -Z and -A options can be used to specify the sector size
628 and align size, and both should be set together. (Also see sanlock di‐
629 rect init.)
630
631 sanlock client init -r RESOURCE
632
633 Tell the sanlock daemon to initialize a resource lease on disk. The -Z
634 and -A options can be used to specify the sector size and align size,
635 and both should be set together. (Also see sanlock direct init.)
636
637 sanlock client read -s LOCKSPACE
638
639 Tell the sanlock daemon to read a lockspace from disk. Only the
640 LOCKSPACE path and offset are required. If host_id is zero, the first
641 record at offset (host_id 1) is used. The complete LOCKSPACE is
642 printed. Add -D to print other details. (Also see sanlock direct
643 read_leader.)
644
645 sanlock client read -r RESOURCE
646
647 Tell the sanlock daemon to read a resource lease from disk. Only the
648 RESOURCE path and offset are required. The complete RESOURCE is
649 printed. Add -D to print other details. (Also see sanlock direct
650 read_leader.)
651
652 sanlock client add_lockspace -s LOCKSPACE
653
654 Tell the sanlock daemon to acquire the specified host_id in the
655 lockspace. This will allow resources to be acquired in the lockspace.
656 The -o option can be used to specify the io timeout of the acquiring
657 host, and will be written in the host_id lease.
658
659 sanlock client inq_lockspace -s LOCKSPACE
660
661 Inquire about the state of the lockspace in the sanlock daemon, whether
662 it is being added or removed, or is joined.
663
664 sanlock client rem_lockspace -s LOCKSPACE
665
666 Tell the sanlock daemon to release the specified host_id in the
667 lockspace. Any processes holding resource leases in this lockspace
668 will be killed, and the resource leases not released.
669
670 sanlock client command -r RESOURCE -c path args
671
672 Register with the sanlock daemon, acquire the specified resource lease,
673 and exec the command at path with args. When the command exits, the
674 sanlock daemon will release the lease. -c must be the final option.
675
676 sanlock client acquire -r RESOURCE -p pid
677 sanlock client release -r RESOURCE -p pid
678
679 Tell the sanlock daemon to acquire or release the specified resource
680 lease for the given pid. The pid must be registered with the sanlock
681 daemon. acquire can optionally take a versioned RESOURCE string RE‐
682 SOURCE:lver, where lver is the version of the lease that must be ac‐
683 quired, or fail.
684
685 sanlock client convert -r RESOURCE -p pid
686
687 Tell the sanlock daemon to convert the mode of the specified resource
688 lease for the given pid. If the existing mode is exclusive (default),
689 the mode of the lease can be converted to shared with RESOURCE:SH. If
690 the existing mode is shared, the mode of the lease can be converted to
691 exclusive with RESOURCE (no :SH suffix).
692
693 sanlock client inquire -p pid
694
695 Print the resource leases held the given pid. The format is a ver‐
696 sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
697 lease held.
698
699 sanlock client request -r RESOURCE -f force_mode
700
701 Request the owner of a resource do something specified by force_mode.
702 A versioned RESOURCE:lver string must be used with a greater version
703 than is presently held. Zero lver and force_mode clears the request.
704
705 sanlock client examine -r RESOURCE
706
707 Examine the request record for the currently held resource lease and
708 carry out the action specified by the requested force_mode.
709
710 sanlock client examine -s LOCKSPACE
711
712 Examine requests for all resource leases currently held in the named
713 lockspace. Only lockspace_name is used from the LOCKSPACE argument.
714
715 sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num
716
717 Set an event for another host. When the sanlock daemon next renews its
718 delta lease for the lockspace it will: set the bit for the host_id in
719 its bitmap, and set the generation, event and data values in its own
720 delta lease. An application that has registered for events from this
721 lockspace on the destination host will get the event that has been set
722 when the destination sees the event during its next delta lease re‐
723 newal.
724
725 sanlock client set_config -s LOCKSPACE
726
727 Set a configuration value for a lockspace. Only lockspace_name is used
728 from the LOCKSPACE argument. The USED flag has the same effect on a
729 lockspace as a process holding a resource lease that will not exit.
730 The USED_BY_ORPHANS flag means that an orphan resource lease will have
731 the same effect as the USED.
732 -u 0|1 Set (1) or clear (0) the USED flag.
733 -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.
734
735 sanlock client format -x RINDEX
736
737 Create a resource index on disk. Use -Z and -A to set the sector size
738 and align size to match the lockspace.
739
740 sanlock client create -x RINDEX -e resource_name
741
742 Create a new resource lease on disk, using the rindex to find a free
743 offset.
744
745 sanlock client delete -x RINDEX -e resource_name[:offset]
746
747 Delete an existing resource lease on disk.
748
749 sanlock client lookup -x RINDEX -e resource_name
750
751 Look up the offset of an existing resource lease by name on disk, using
752 the rindex. With no -e option, lookup returns the next free lease off‐
753 set. If -e specifes both name and offset, the lookup verifies both are
754 correct.
755
756 sanlock client update -x RINDEX -e resource_name[:offset] [-z 0|1]
757
758 Add (-z 0) or remove (-z 1) an rindex entry on disk.
759
760 sanlock client rebuild -x RINDEX
761
762 Rebuild the rindex entries by scanning the disk for resource leases.
763
764
765
766 Direct Command
767 sanlock direct action [options]
768
769
770 -o sec io timeout in seconds
771
772 sanlock direct init -s LOCKSPACE
773 sanlock direct init -r RESOURCE
774
775 Initialize storage for a lockspace or resource. Use the -Z and -A
776 flags to specify the sector size and align size. The max hosts that
777 can use the lockspace/resource (and the max possible host_id) is deter‐
778 mined by the sector/align size combination. Possible combinations are:
779 512/1M, 4096/1M, 4096/2M, 4096/4M, 4096/8M. Lockspaces and resources
780 both use the same amount of space (align_size) for each combination.
781 When initializing a lockspace, sanlock initializes delta leases for
782 max_hosts in the given space. When initializing a resource, sanlock
783 initializes a single paxos lease in the space. With -s, the -o option
784 specifies the io timeout to be written in the host_id leases. With -r,
785 the -z 1 option invalidates the resource lease on disk so it cannot be
786 used until reinitialized normally.
787
788 sanlock direct read_leader -s LOCKSPACE
789 sanlock direct read_leader -r RESOURCE
790
791 Read a leader record from disk and print the fields. The leader record
792 is the single sector of a delta lease, or the first sector of a paxos
793 lease.
794
795 sanlock direct dump path[:offset[:size]]
796
797 Read disk sectors and print leader records for delta or paxos leases.
798 Add -f 1 to print the request record values for paxos leases, host_ids
799 set in delta lease bitmaps, and rindex entries.
800
801 sanlock direct format -x RINDEX
802 sanlock direct lookup -x RINDEX -e resource_name
803 sanlock direct update -x RINDEX -e resource_name[:offset] [-z 0|1]
804 sanlock direct rebuild -x RINDEX
805
806 Access the resource index on disk without going through the sanlock
807 daemon. This precludes using the internal paxos lease to protect
808 rindex modifications. See client equivalents for descriptions.
809
810
811
812 LOCKSPACE option string
813 -s lockspace_name:host_id:path:offset
814
815 lockspace_name name of lockspace
816 host_id local host identifier in lockspace
817 path path to storage to use for leases
818 offset offset on path (bytes)
819
820
821 RESOURCE option string
822 -r lockspace_name:resource_name:path:offset
823
824 lockspace_name name of lockspace
825 resource_name name of resource
826 path path to storage to use leases
827 offset offset on path (bytes)
828
829
830 RESOURCE option string with suffix
831 -r lockspace_name:resource_name:path:offset:lver
832
833 lver leader version
834
835 -r lockspace_name:resource_name:path:offset:SH
836
837 SH indicates shared mode
838
839
840 RINDEX option string
841 -x lockspace_name:path:offset
842
843 lockspace_name name of lockspace
844 path path to storage to use for leases
845 offset offset on path (bytes) of rindex
846
847
848
849 Defaults
850 sanlock help shows the default values for the options above.
851
852 sanlock version shows the build version.
853
854
856 Request/Examine
857 The first part of making a request for a resource is writing the re‐
858 quest record of the resource (the sector following the leader record).
859 To make a successful request:
860
861 • RESOURCE:lver must be greater than the lver presently held by the
862 other host. This implies the leader record must be read to discover
863 the lver, prior to making a request.
864
865 • RESOURCE:lver must be greater than or equal to the lver presently
866 written to the request record. Two hosts may write a new request at
867 the same time for the same lver, in which case both would succeed,
868 but the force_mode from the last would win.
869
870 • The force_mode must be greater than zero.
871
872 • To unconditionally clear the request record (set both lver and
873 force_mode to 0), make request with RESOURCE:0 and force_mode 0.
874
875
876 The owner of the requested resource will not know of the request unless
877 it is explicitly told to examine its resources via the "examine"
878 api/command, or otherwise notfied.
879
880 The second part of making a request is notifying the resource lease
881 owner that it should examine the request records of its resource
882 leases. The notification will cause the lease owner to automatically
883 run the equivalent of "sanlock client examine -s LOCKSPACE" for the
884 lockspace of the requested resource.
885
886 The notification is made using a bitmap in each host_id delta lease.
887 Each bit represents each of the possible host_ids (1-2000). If host A
888 wants to notify host B to examine its resources, A sets the bit in its
889 own bitmap that corresponds to the host_id of B. When B next renews
890 its delta lease, it reads the delta leases for all hosts and checks
891 each bitmap to see if its own host_id has been set. It finds the bit
892 for its own host_id set in A's bitmap, and examines its resource re‐
893 quest records. (The bit remains set in A's bitmap for set_bitmap_sec‐
894 onds.)
895
896 force_mode determines the action the resource lease owner should take:
897
898
899 • FORCE (1): kill the process holding the resource lease. When the
900 process has exited, the resource lease will be released, and can then
901 be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if
902 SIGKILL is restricted.)
903
904
905 • GRACEFUL (2): run the program configured by sanlock_killpath against
906 the process holding the resource lease. If no killpath is defined,
907 then FORCE is used.
908
909
910 Persistent and orphan resource leases
911 A resource lease can be acquired with the PERSISTENT flag (-P 1). If
912 the process holding the lease exits, the lease will not be released,
913 but kept on an orphan list. Another local process can acquire an or‐
914 phan lease using the ORPHAN flag (-O 1), or release the orphan lease
915 using the ORPHAN flag (-O 1). All orphan leases can be released by
916 setting the lockspace name (-s lockspace_name) with no resource name.
917
918
919 Renewal history
920 sanlock saves a limited history of lease renewal information in each
921 lockspace. See sanlock.conf renewal_history_size to set the amount of
922 history or to disable (set to 0).
923
924 IO times are measured in delta lease renewal (each delta lease renewal
925 includes one read and one write).
926
927 For each successful renewal, a record is saved that includes:
928
929 • the timestamp written in the delta lease by the renewal
930
931 • the time in milliseconds taken by the delta lease read
932
933 • the time in milliseconds taken by the delta lease write
934
935
936 Also counted and recorded are the number io timeouts and other io er‐
937 rors that occur between successful renewals.
938
939 Two consecutive successful renewals would be recorded as:
940 timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
941 timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0
942
943 Those fields are:
944
945
946 • timestamp is the value written into the delta lease during that re‐
947 newal.
948
949
950 • read_ms/write_ms are the milliseconds taken for the renewal
951 read/write ios.
952
953
954 • next_timeouts are the number of io timeouts that occurred after the
955 renewal recorded on that line, and before the next successful renewal
956 on the following line.
957
958
959 • next_errors are the number of io errors (not timeouts) that occurred
960 after renewal recorded on that line, and before the next successful
961 renewal on the following line.
962
963
964 The command 'sanlock client renewal -s lockspace_name' reports the full
965 history of renewals saved by sanlock, which by default is 180 records,
966 about 1 hour of history when using a 20 second renewal interval for a
967 10 second io timeout.
968
969
971 Disk Format
972 • This example uses 512 byte sectors.
973
974 • Each lockspace is 1MB. It holds 2000 delta_leases, one per sector,
975 supporting up to 2000 hosts.
976
977 • Each paxos_lease is 1MB. It is used as a lease for one resource.
978
979 • The leader_record structure is used differently by each lease type.
980
981 • To display all leader_record fields, see sanlock direct read_leader.
982
983 • A lockspace is often followed on disk by the paxos_leases used within
984 that lockspace, but this layout is not required.
985
986 • The request_record and host_id bitmap are used for requests/events.
987
988 • The mode_block contains the SHARED flag indicating a lease is held in
989 the shared mode.
990
991 • In a lockspace, the host using host_id N writes to a single
992 delta_lease in sector N-1. No other hosts write to this sector. All
993 hosts read all lockspace sectors when renewing their own delta_lease,
994 and are able to monitor renewals of all delta_leases.
995
996 • In a paxos_lease, each host has a dedicated sector it writes to, con‐
997 taining its own paxos_dblock and mode_block structures. Its sector
998 is based on its host_id; host_id 1 writes to the dblock/mode_block in
999 sector 2 of the paxos_lease.
1000
1001 • The paxos_dblock structures are used by the paxos_lease algorithm,
1002 and the result is written to the leader_record.
1003
1004
1005 0x000000 lockspace foo:0:/path:0
1006
1007 (There is no representation on disk of the lockspace in general, only
1008 the sequence of specific delta_leases which collectively represent the
1009 lockspace.)
1010
1011 delta_lease foo:1:/path:0
1012 0x000 0 leader_record (sector 0, for host_id 1)
1013 magic: 0x12212010
1014 space_name: foo
1015 resource_name: host uuid/name
1016 ...
1017 host_id bitmap (leader_record + 256)
1018
1019 delta_lease foo:2:/path:0
1020 0x200 512 leader_record (sector 1, for host_id 2)
1021 magic: 0x12212010
1022 space_name: foo
1023 resource_name: host uuid/name
1024 ...
1025 host_id bitmap (leader_record + 256)
1026
1027 delta_lease foo:3:/path:0
1028 0x400 1024 leader_record (sector 2, for host_id 3)
1029 magic: 0x12212010
1030 space_name: foo
1031 resource_name: host uuid/name
1032 ...
1033 host_id bitmap (leader_record + 256)
1034
1035 delta_lease foo:2000:/path:0
1036 0xF9E00 leader_record (sector 1999, for host_id 2000)
1037 magic: 0x12212010
1038 space_name: foo
1039 resource_name: host uuid/name
1040 ...
1041 host_id bitmap (leader_record + 256)
1042
1043 0x100000 paxos_lease foo:example1:/path:1048576
1044 0x000 0 leader_record (sector 0)
1045 magic: 0x06152010
1046 space_name: foo
1047 resource_name: example1
1048
1049 0x200 512 request_record (sector 1)
1050 magic: 0x08292011
1051
1052 0x400 1024 paxos_dblock (sector 2, for host_id 1)
1053 0x480 1152 mode_block (paxos_dblock + 128)
1054
1055 0x600 1536 paxos_dblock (sector 3, for host_id 2)
1056 0x680 1664 mode_block (paxos_dblock + 128)
1057
1058 0x800 2048 paxos_dblock (sector 4, for host_id 3)
1059 0x880 2176 mode_block (paxos_dblock + 128)
1060
1061 0xFA200 paxos_dblock (sector 2001, for host_id 2000)
1062 0xFA280 mode_block (paxos_dblock + 128)
1063
1064 0x200000 paxos_lease foo:example2:/path:2097152
1065 0x000 0 leader_record (sector 0)
1066 magic: 0x06152010
1067 space_name: foo
1068 resource_name: example2
1069
1070 0x200 512 request_record (sector 1)
1071 magic: 0x08292011
1072
1073 0x400 1024 paxos_dblock (sector 2, for host_id 1)
1074 0x480 1152 mode_block (paxos_dblock + 128)
1075
1076 0x600 1536 paxos_dblock (sector 3, for host_id 2)
1077 0x680 1664 mode_block (paxos_dblock + 128)
1078
1079 0x800 2048 paxos_dblock (sector 4, for host_id 3)
1080 0x880 2176 mode_block (paxos_dblock + 128)
1081
1082 0xFA200 paxos_dblock (sector 2001, for host_id 2000)
1083 0xFA280 mode_block (paxos_dblock + 128)
1084
1085
1086 Lease ownership
1087 Not shown in the leader_record structures above are the owner_id,
1088 owner_generation and timestamp fields. These are the fields that de‐
1089 fine the lease owner.
1090
1091 The delta_lease at sector N for host_id N+1 has leader_record.owner_id
1092 N+1. The leader_record.owner_generation is incremented each time the
1093 delta_lease is acquired. When a delta_lease is acquired, the
1094 leader_record.timestamp field is set to the time of the host and the
1095 leader_record.resource_name is set to the unique name of the host.
1096 When the host renews the delta_lease, it writes a new
1097 leader_record.timestamp. When a host releases a delta_lease, it writes
1098 zero to leader_record.timestamp.
1099
1100 When a host acquires a paxos_lease, it uses the host_id/generation
1101 value from the delta_lease it holds in the lockspace. It uses this
1102 host_id/generation to identify itself in the paxos_dblock when running
1103 the paxos algorithm. The result of the algorithm is the winning
1104 host_id/generation - the new owner of the paxos_lease. The winning
1105 host_id/generation are written to the paxos_lease
1106 leader_record.owner_id and leader_record.owner_generation fields and
1107 leader_record.timestamp is set. When a host releases a paxos_lease, it
1108 sets leader_record.timestamp to 0.
1109
1110 When a paxos_lease is free (leader_record.timestamp is 0), multiple
1111 hosts may attempt to acquire it. The paxos algorithm, using the
1112 paxos_dblock structures, will select only one of the hosts as the new
1113 owner, and that owner is written in the leader_record. The paxos_lease
1114 will no longer be free (non-zero timestamp). Other hosts will see this
1115 and will not attempt to acquire the paxos_lease until it is free again.
1116
1117 If a paxos_lease is owned (non-zero timestamp), but the owner has not
1118 renewed its delta_lease for a specific length of time, then the owner
1119 value in the paxos_lease becomes expired, and other hosts will use the
1120 paxos algorithm to acquire the paxos_lease, and set a new owner.
1121
1122
1124 /etc/sanlock/sanlock.conf
1125
1126
1127 • quiet_fail = 1
1128 See -Q
1129
1130
1131 • debug_renew = 0
1132 See -R
1133
1134
1135 • logfile_priority = 4
1136 See -L
1137
1138
1139 • logfile_use_utc = 0
1140 Use UTC instead of local time in log messages.
1141
1142
1143 • syslog_priority = 3
1144 See -S
1145
1146
1147 • names_log_priority = 4
1148 Log resource names at this priority level (uses syslog priority num‐
1149 bers). If this is greater than or equal to logfile_priority, each
1150 requested resource name and location is recorded in sanlock.log.
1151
1152
1153 • use_watchdog = 1
1154 See -w
1155
1156
1157 • high_priority = 1
1158 See -h
1159
1160
1161 • mlock_level = 1
1162 See -l
1163
1164
1165 • sh_retries = 8
1166 The number of times to try acquiring a paxos lease when acquiring a
1167 shared lease when the paxos lease is held by another host acquiring a
1168 shared lease.
1169
1170
1171 • uname = sanlock
1172 See -U
1173
1174
1175 • gname = sanlock
1176 See -G
1177
1178
1179 • our_host_name = <str>
1180 See -e
1181
1182
1183 • renewal_read_extend_sec = <seconds>
1184 If a renewal read i/o times out, wait this many additional seconds
1185 for that read to complete at the start of the subsequent renewal at‐
1186 tempt. When not configured, sanlock waits for an additional io_time‐
1187 out seconds for a previous timed out read to complete.
1188
1189
1190 • renewal_history_size = 180
1191 See -H
1192
1193
1194 • paxos_debug_all = 0
1195 Include all details in the paxos debug logging.
1196
1197
1198 • debug_io = <str>
1199 Add debug logging for each i/o. "submit" (no quotes) produces debug
1200 output at submission time, "complete" produces debug output at com‐
1201 pletion time, and "submit,complete" (no space) produces both.
1202
1203
1204 • max_sectors_kb = <str>|<num>
1205 Set to "ignore" (no quotes) to prevent sanlock from checking or
1206 changing max_sectors_kb for the lockspace disk when starting a
1207 lockspace. Set to "align" (no quotes) to set max_sectors_kb for the
1208 lockspace disk to the align size of the lockspace. Set to a number
1209 to set a specific number of KB for all lockspace disks.
1210
1211
1212 • debug_clients = 0
1213 Enable or disable debug logging for all client connections to the
1214 sanlock daemon.
1215
1216
1217 • debug_cmd = +|-<name>
1218 Enable (+name) or disable (-name) debug logging at the command pro‐
1219 cessing level for specifically named commands, e.g. "debug_cmd = +ac‐
1220 quire", or "debug_cmd = -inq_lockspace". Repeat this line for each
1221 command name. Use a plus prefix before the name to enable and a mi‐
1222 nus prefix to disable. By default sanlock disables some command
1223 level debugging for commands that are often repetitive and fill the
1224 in memory debug buffer. This only affects debug logging, not errors
1225 or warnings, and disabling command level debugging for a command does
1226 not disable lower level debugging for that command. Special values
1227 +all and -all can be used to enable or disable all commands, and can
1228 be used before or after other debug_cmd lines.
1229
1230
1231 • write_init_io_timeout = <seconds>
1232 The io timeout to use when initializing ondisk lease structures for a
1233 lockspace or resource. This timeout is not used as a part of either
1234 lease algorithm (as the standard io_timeout is.)
1235
1236
1237 • max_worker_threads = <num>
1238 See -t
1239
1240
1242 wdmd(8)
1243
1244
1245
1246
1247 2015-01-23 SANLOCK(8)