1SANLOCK(8) System Manager's Manual SANLOCK(8)
2
3
4
6 sanlock - shared storage lock manager
7
8
10 sanlock [COMMAND] [ACTION] ...
11
12
14 The sanlock daemon manages leases for applications running on a cluster
15 of hosts with shared storage. All lease management and coordination is
16 done through reading and writing blocks on the shared storage. Two
17 types of leases are used, each based on a different algorithm:
18
19 "delta leases" are slow to acquire and require regular i/o to shared
20 storage. A delta lease exists in a single sector of storage. Acquir‐
21 ing a delta lease involves reads and writes to that sector separated by
22 specific delays. Once acquired, a lease must be renewed by updating a
23 timestamp in the sector regularly. sanlock uses a delta lease inter‐
24 nally to hold a lease on a host_id. host_id leases prevent two hosts
25 from using the same host_id and provide basic host liveness information
26 based on the renewals.
27
28 "paxos leases" are generally fast to acquire and sanlock makes them
29 available to applications as general purpose resource leases. A paxos
30 lease exists in 1MB of shared storage (8MB for 4k sectors). Acquiring
31 a paxos lease involves reads and writes to max_hosts (2000) sectors in
32 a specific sequence specified by the Disk Paxos algorithm. paxos
33 leases use host_id's internally to indicate the owner of the lease, and
34 the algorithm fails if different hosts use the same host_id. So, delta
35 leases provide the unique host_id's used in paxos leases. paxos leases
36 also refer to delta leases to check if a host_id is alive.
37
38 Before sanlock can be used, the user must assign each host a host_id,
39 which is a number between 1 and 2000. Two hosts should not be given
40 the same host_id (even though delta leases attempt to detect this mis‐
41 take.)
42
43 sanlock views a pool of storage as a "lockspace". Each distinct pool
44 of storage, e.g. from different sources, would typically be defined as
45 a separate lockspace, with a unique lockspace name.
46
47 Part of this storage space must be reserved and initialized for sanlock
48 to store delta leases. Each host that wants to use the lockspace must
49 first acquire a delta lease on its host_id number within the lockspace.
50 (See the add_lockspace action/api.) The space required for 2000 delta
51 leases in the lockspace (for 2000 possible host_id's) is 1MB (8MB for
52 4k sectors). (This is the same size required for a single paxos
53 lease.)
54
55 More storage space must be reserved and initialized for paxos leases,
56 according to the needs of the applications using sanlock.
57
58 The following steps illustrate these concepts using the command line.
59 Applications may choose to do these same steps through libsanlock.
60
61 1. Create storage pools and reserve and initialize host_id leases
62 two different LUNs on a SAN: /dev/sdb, /dev/sdc
63 # vgcreate pool1 /dev/sdb
64 # vgcreate pool2 /dev/sdc
65 # lvcreate -n hostid_leases -L 1MB pool1
66 # lvcreate -n hostid_leases -L 1MB pool2
67 # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
68 # sanlock direct init -s LS2:0:/dev/pool2/hostid_leases:0
69
70 2. Start the sanlock daemon on each host
71 # sanlock daemon
72
73 3. Add each lockspace to be used
74 host1:
75 # sanlock client add_lockspace -s LS1:1:/dev/pool1/hostid_leases:0
76 # sanlock client add_lockspace -s LS2:1:/dev/pool2/hostid_leases:0
77 host2:
78 # sanlock client add_lockspace -s LS1:2:/dev/pool1/hostid_leases:0
79 # sanlock client add_lockspace -s LS2:2:/dev/pool2/hostid_leases:0
80
81 4. Applications can now reserve/initialize space for resource leases,
82 and then acquire the leases as they need to access the resources.
83
84 The resource leases that are created and how they are used depends on
85 the application. For example, say application A, running on host1 and
86 host2, needs to synchronize access to data it stores on
87 /dev/pool1/Adata. A could use a resource lease as follows:
88
89 5. Reserve and initialize a single resource lease for Adata
90 # lvcreate -n Adata_lease -L 1MB pool1
91 # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0
92
93 6. Acquire the lease from the app using libsanlock (see sanlock_regis‐
94 ter, sanlock_acquire). If the app is already running as pid 123, and
95 has registered with the sanlock daemon, the lease can be added for it
96 manually.
97 # sanlock client acquire -r LS1:Adata:/dev/pool1/Adata_lease:0 -p 123
98
99 offsets
100
101 offsets must be 1MB aligned for disks with 512 byte sectors, and 8MB
102 aligned for disks with 4096 byte sectors.
103
104 offsets may be used to place leases on the same device rather than
105 using separate devices and offset 0 as shown in examples above, e.g.
106 these commands above:
107 # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
108 # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0
109 could be replaced by:
110 # sanlock direct init -s LS1:0:/dev/pool1/leases:0
111 # sanlock direct init -r LS1:Adata:/dev/pool1/leases:1048576
112
113 failures
114
115 If a process holding resource leases fails or exits without releasing
116 its leases, sanlock will release the leases for it automatically.
117
118 If the sanlock daemon cannot renew a lockspace host_id for a specific
119 period of time (usually because storage access is lost), sanlock will
120 kill any process holding a resource lease within the lockspace.
121
122 If the sanlock daemon crashes or gets stuck, it will no longer renew
123 the expiry time of its per-host_id connections to the wdmd daemon, and
124 the watchdog device will reset the host.
125
126 watchdog
127
128 sanlock uses the wdmd(8) daemon to access /dev/watchdog. A separate
129 wdmd connection is maintained with wdmd for each host_id being renewed.
130 Each host_id connection has an expiry time for some seconds in the
131 future. After each successful host_id renewal, sanlock updates the
132 associated expiry time in wdmd. If wdmd finds any connection expired,
133 it will not pet /dev/watchdog. After enough successive expired/failed
134 checks, the watchdog device will fire and reset the host.
135
136 After a number of failed attempts to renew a host_id, sanlock kills any
137 process using that lockspace. Once all those processes have exited,
138 sanlock will unregister the associated wdmd connection. wdmd will no
139 longer find the expired connection, and will resume petting /dev/watch‐
140 dog (assuming it finds no other failed/expired tests.) If the killed
141 processes did not exit quickly enough, the expired wdmd connection will
142 not be unregistered, and /dev/watchdog will reset the host.
143
144 Based on these known timeout values, sanlock on another host can calcu‐
145 late, based on the last host_id renewal, when the failed host will have
146 been reset by its watchdog (or killed all the necessary processes).
147
148 If the sanlock daemon itself fails, crashes, get stuck, it will no
149 longer update the expiry time for its host_id connections to wdmd,
150 which will also lead to the watchdog resetting the host.
151
152 safety
153
154 sanlock leases are meant to guarantee that two process on two hosts are
155 never allowed to hold the same resource lease at once. If they were,
156 the resource being protected may be corrupted. There are three levels
157 of protection built into sanlock itself:
158
159 1. The paxos leases and delta leases themselves.
160
161 2. If the leases cannot function because storage access is lost
162 (host_id's cannot be renewed), the sanlock daemon kills any pids using
163 resource leases in the lockspace.
164
165 3. If the pids do not exit after being killed, or if the sanlock daemon
166 fails, the watchdog device resets the host.
167
168
170 COMMAND can be one of three primary top level choices
171
172 sanlock daemon start daemon
173 sanlock client send request to daemon (default command if none given)
174 sanlock direct access storage directly (no coordination with daemon)
175
176 sanlock daemon [options]
177
178 -D no fork and print all logging to stderr
179
180 -Q 0|1 quiet error messages for common lock contention
181
182 -R 0|1 renewal debugging, log debug info for each renewal
183
184 -L pri write logging at priority level and up to logfile (-1 none)
185
186 -S pri write logging at priority level and up to syslog (-1 none)
187
188 -U uid user id
189
190 -G gid group id
191
192 -t num max worker threads
193
194 -g sec seconds for graceful recovery
195
196 -w 0|1 use watchdog through wdmd
197
198 -h 0|1 use high priority (RR) scheduling
199
200 -l num use mlockall (0 none, 1 current, 2 current and future)
201
202 -a 0|1 use async i/o
203
204 sanlock client action [options]
205
206 sanlock client status
207
208 Print processes, lockspaces, and resources being managed by the sanlock
209 daemon. Add -D to show extra internal daemon status for debugging.
210 Add -o p to show resources by pid, or -o s to show resources by
211 lockspace.
212
213 sanlock client host_status
214
215 Print state of host_id delta leases read during the last renewal.
216 State of all lockspaces is shown (use -s to select one). Add -D to
217 show extra internal daemon status for debugging.
218
219 sanlock client gets
220
221 Print lockspaces being managed by the sanlock daemon. The LOCKSPACE
222 string will be followed by ADD or REM if the lockspace is currently
223 being added or removed. Add -h 1 to also show hosts in each lockspace.
224
225 sanlock client log_dump
226
227 Print the sanlock daemon internal debug log.
228
229 sanlock client shutdown
230
231 Ask the sanlock daemon to exit. Without the force option (-f 0), the
232 command will be ignored if any lockspaces exist. With the force option
233 (-f 1), any registered processes will be killed, their resource leases
234 released, and lockspaces removed.
235
236 sanlock client init -s LOCKSPACE
237
238 Tell the sanlock daemon to initialize a lockspace on disk. The -o
239 option can be used to specify the io timeout to be written in the
240 host_id leases. (Also see sanlock direct init.)
241
242 sanlock client init -r RESOURCE
243
244 Tell the sanlock daemon to initialize a resource lease on disk. (Also
245 see sanlock direct init.)
246
247 sanlock client read -s LOCKSPACE
248
249 Tell the sanlock daemon to read a lockspace from disk. Only the
250 LOCKSPACE path and offset are required. If host_id is zero, the first
251 record at offset (host_id 1) is used. The complete LOCKSPACE and io
252 timeout are printed.
253
254 sanlock client read -r RESOURCE
255
256 Tell the sanlock daemon to read a resource lease from disk. Only the
257 RESOURCE path and offset are required. The complete RESOURCE is
258 printed. (Also see sanlock direct read_leader.)
259
260 sanlock client align -s LOCKSPACE
261
262 Tell the sanlock daemon to report the required lease alignment for a
263 storage path. Only path is used from the LOCKSPACE argument.
264
265 sanlock client add_lockspace -s LOCKSPACE
266
267 Tell the sanlock daemon to acquire the specified host_id in the
268 lockspace. This will allow resources to be acquired in the lockspace.
269 The -o option can be used to specify the io timeout of the acquiring
270 host, and will be written in the host_id lease.
271
272 sanlock client inq_lockspace -s LOCKSPACE
273
274 Inquire about the state of the lockspace in the sanlock daemon, whether
275 it is being added or removed, or is joined.
276
277 sanlock client rem_lockspace -s LOCKSPACE
278
279 Tell the sanlock daemon to release the specified host_id in the
280 lockspace. Any processes holding resource leases in this lockspace
281 will be killed, and the resource leases not released.
282
283 sanlock client command -r RESOURCE -c path args
284
285 Register with the sanlock daemon, acquire the specified resource lease,
286 and exec the command at path with args. When the command exits, the
287 sanlock daemon will release the lease. -c must be the final option.
288
289 sanlock client acquire -r RESOURCE -p pid
290 sanlock client release -r RESOURCE -p pid
291
292 Tell the sanlock daemon to acquire or release the specified resource
293 lease for the given pid. The pid must be registered with the sanlock
294 daemon. acquire can optionally take a versioned RESOURCE string
295 RESOURCE:lver, where lver is the version of the lease that must be
296 acquired, or fail.
297
298 sanlock client inquire -p pid
299
300 Print the resource leases held the given pid. The format is a ver‐
301 sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
302 lease held.
303
304 sanlock client request -r RESOURCE -f force_mode
305
306 Request the owner of a resource do something specified by force_mode.
307 A versioned RESOURCE:lver string must be used with a greater version
308 than is presently held. Zero lver and force_mode clears the request.
309
310 sanlock client examine -r RESOURCE
311
312 Examine the request record for the currently held resource lease and
313 carry out the action specified by the requested force_mode.
314
315 sanlock client examine -s LOCKSPACE
316
317 Examine requests for all resource leases currently held in the named
318 lockspace. Only lockspace_name is used from the LOCKSPACE argument.
319
320 sanlock direct action [options]
321
322 -a 0|1 use async i/o
323
324 -o sec io timeout in seconds
325
326 sanlock direct init -s LOCKSPACE
327 sanlock direct init -r RESOURCE
328
329 Initialize storage for 2000 host_id (delta) leases for the given
330 lockspace, or initialize storage for one resource (paxos) lease. Both
331 options require 1MB of space. The host_id in the LOCKSPACE string is
332 not relevant to initialization, so the value is ignored. (The default
333 of 2000 host_ids can be changed for special cases using the -n
334 num_hosts and -m max_hosts options.) With -s, the -o option specifies
335 the io timeout to be written in the host_id leases.
336
337 sanlock direct read_leader -s LOCKSPACE
338 sanlock direct read_leader -r RESOURCE
339
340 Read a leader record from disk and print the fields. The leader record
341 is the single sector of a delta lease, or the first sector of a paxos
342 lease.
343
344 sanlock direct dump path[:offset]
345
346 Read disk sectors and print leader records for delta or paxos leases.
347 Add -f 1 to print the request record values for paxos leases, and
348 host_ids set in delta lease bitmaps.
349
350
351 LOCKSPACE option string
352 -s lockspace_name:host_id:path:offset
353
354 lockspace_name name of lockspace
355 host_id local host identifier in lockspace
356 path path to storage reserved for leases
357 offset offset on path (bytes)
358
359
360 RESOURCE option string
361 -r lockspace_name:resource_name:path:offset
362
363 lockspace_name name of lockspace
364 resource_name name of resource
365 path path to storage reserved for leases
366 offset offset on path (bytes)
367
368
369 RESOURCE option string with version
370 -r lockspace_name:resource_name:path:offset:lver
371
372 lver leader version or SH for shared lease
373
374
375 Defaults
376 sanlock help shows the default values for the options above.
377
378 sanlock version shows the build version.
379
380
382 Request/Examine
383 The first part of making a request for a resource is writing the
384 request record of the resource (the sector following the leader
385 record). To make a successful request:
386
387 · RESOURCE:lver must be greater than the lver presently held by the
388 other host. This implies the leader record must be read to discover
389 the lver, prior to making a request.
390
391 · RESOURCE:lver must be greater than or equal to the lver presently
392 written to the request record. Two hosts may write a new request at
393 the same time for the same lver, in which case both would succeed,
394 but the force_mode from the last would win.
395
396 · The force_mode must be greater than zero.
397
398 · To unconditionally clear the request record (set both lver and
399 force_mode to 0), make request with RESOURCE:0 and force_mode 0.
400
401 The owner of the requested resource will not know of the request unless
402 it is explicitly told to examine its resources via the "examine"
403 api/command, or otherwise notfied.
404
405 The second part of making a request is notifying the resource lease
406 owner that it should examine the request records of its resource
407 leases. The notification will cause the lease owner to automatically
408 run the equivalent of "sanlock client examine -s LOCKSPACE" for the
409 lockspace of the requested resource.
410
411 The notification is made using a bitmap in each host_id delta lease.
412 Each bit represents each of the possible host_ids (1-2000). If host A
413 wants to notify host B to examine its resources, A sets the bit in its
414 own bitmap that corresponds to the host_id of B. When B next renews
415 its delta lease, it reads the delta leases for all hosts and checks
416 each bitmap to see if its own host_id has been set. It finds the bit
417 for its own host_id set in A's bitmap, and examines its resource
418 request records. (The bit remains set in A's bitmap for request_fin‐
419 ish_seconds.)
420
421 force_mode determines the action the resource lease owner should take:
422
423 1 (FORCE): kill the process holding the resource lease. When the
424 process has exited, the resource lease will be released, and can then
425 be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if
426 SIGKILL is restricted.)
427
428 2 (GRACEFUL): run the program configured by sanlock_killpath against
429 the process holding the resource lease. If no killpath is defined,
430 then FORCE is used.
431
432
433 Graceful recovery
434 When a lockspace host_id cannot be renewed for a specific period of
435 time, sanlock enters a recovery mode in which it attempts to forcibly
436 release any resource leases in that lockspace. If all the leases are
437 not released within 60 seconds, the watchdog will fire, resetting the
438 host.
439
440 The most immediate way of releasing the resource leases in the failed
441 lockspace is by sending SIGKILL to all pids holding the leases, and
442 automatically releasing the resource leases as the pids exit. After
443 all pids have exited, no resource leases are held in the lockspace, the
444 watchdog expiration is removed, and the host can avoid the watchdog
445 reset.
446
447 A slightly more graceful approach is to send SIGTERM to a pid before
448 escalating to SIGKILL. sanlock does this by sending SIGTERM to each
449 pid, once a second, for the first N seconds, before sending SIGKILL
450 once a second for the remaining M seconds (N/M can be tuned with the -g
451 daemon option.)
452
453 An even more graceful approach is to configure a program for sanlock to
454 run that will terminate or suspend each pid, and explicitly release the
455 leases it held. sanlock will run this program for each pid. It has N
456 seconds to terminate the pid or explicitly release its leases before
457 sanlock escalates to SIGKILL for the remaining M seconds.
458
459
461 wdmd(8)
462
463
464
465
466 2011-08-05 SANLOCK(8)