1OCFS2(7) OCFS2 Manual Pages OCFS2(7)
2
3
4
6 OCFS2 - A Shared-Disk Cluster File System for Linux
7
8
10 OCFS2 is a file system. It allows users to store and retrieve data. The
11 data is stored in files that are organized in a hierarchical directory
12 tree. It is a POSIX compliant file system that supports the standard
13 interfaces and the behavioral semantics as spelled out by that specifi‐
14 cation.
15
16 It is also a shared disk cluster file system, one that allows multiple
17 nodes to access the same disk at the same time. This is where the fun
18 begins as allowing a file system to be accessible on multiple nodes
19 opens a can of worms. What if the nodes are of different architectures?
20 What if a node dies while writing to the file system? What data consis‐
21 tency can one expect if processes on two nodes are reading and writing
22 concurrently? What if one node removes a file while it is still being
23 used on another node?
24
25 Unlike most shared file systems where the answer is fuzzy, the answer
26 in OCFS2 is very well defined. It behaves on all nodes exactly like a
27 local file system. If a file is removed, the directory entry is removed
28 but the inode is kept as long as it is in use across the cluster. When
29 the last user closes the descriptor, the inode is marked for deletion.
30
31 The data consistency model follows the same principle. It works as if
32 the two processes that are running on two different nodes are running
33 on the same node. A read on a node gets the last write irrespective of
34 the IO mode used. The modes can be buffered, direct, asynchronous,
35 splice or memory mapped IOs. It is fully cache coherent.
36
37 Take for example the REFLINK feature that allows a user to create mul‐
38 tiple write-able snapshots of a file. This feature, like all others, is
39 fully cluster-aware. A file being written to on multiple nodes can be
40 safely reflinked on another. The snapshot created is a point-in-time
41 image of the file that includes both the file data and all its at‐
42 tributes (including extended attributes).
43
44 It is a journaling file system. When a node dies, a surviving node
45 transparently replays the journal of the dead node. This ensures that
46 the file system metadata is always consistent. It also defaults to or‐
47 dered data journaling to ensure the file data is flushed to disk before
48 the journal commit, to remove the small possibility of stale data ap‐
49 pearing in files after a crash.
50
51 It is architecture and endian neutral. It allows concurrent mounts on
52 nodes with different processors like x86, x86_64, IA64 and PPC64. It
53 handles little and big endian, 32-bit and 64-bit architectures.
54
55 It is feature rich. It supports indexed directories, metadata check‐
56 sums, extended attributes, POSIX ACLs, quotas, REFLINKs, sparse files,
57 unwritten extents and inline-data.
58
59 It is fully integrated with the mainline Linux kernel. The file system
60 was merged into Linux kernel 2.6.16 in early 2006.
61
62 It is quickly installed. It is available with almost all Linux distri‐
63 butions. The file system is on-disk compatible across all of them.
64
65 It is modular. The file system can be configured to operate with other
66 cluster stacks like Pacemaker and CMAN along with its own stack, O2CB.
67
68 It is easily configured. The O2CB cluster stack configuration involves
69 editing two files, one for cluster layout and the other for cluster
70 timeouts.
71
72 It is very efficient. The file system consumes very little resources.
73 It is used to store virtual machine images in limited memory environ‐
74 ments like Xen and KVM.
75
76 In summary, OCFS2 is an efficient, easily configured, modular, quickly
77 installed, fully integrated and compatible, feature-rich, architecture
78 and endian neutral, cache coherent, ordered data journaling, POSIX-com‐
79 pliant, shared disk cluster file system.
80
81
83 OCFS2 is a general-purpose shared-disk cluster file system for Linux
84 capable of providing both high performance and high availability.
85
86 As it provides local file system semantics, it can be used with almost
87 all applications. Cluster-aware applications can make use of cache-co‐
88 herent parallel I/Os from multiple nodes to scale out applications eas‐
89 ily. Other applications can make use of the clustering facilities to
90 fail-over running application in the event of a node failure.
91
92 The notable features of the file system are:
93
94 Tunable Block size
95 The file system supports block sizes of 512, 1K, 2K and 4K
96 bytes. 4KB is almost always recommended. This feature is avail‐
97 able in all releases of the file system.
98
99
100 Tunable Cluster size
101 A cluster size is also referred to as an allocation unit. The
102 file system supports cluster sizes of 4K, 8K, 16K, 32K, 64K,
103 128K, 256K, 512K and 1M bytes. For most use cases, 4KB is recom‐
104 mended. However, a larger value is recommended for volumes host‐
105 ing mostly very large files like database files, virtual machine
106 images, etc. A large cluster size allows the file system to
107 store large files more efficiently. This feature is available in
108 all releases of the file system.
109
110
111 Endian and Architecture neutral
112 The file system can be mounted concurrently on nodes having dif‐
113 ferent architectures. Like 32-bit, 64-bit, little-endian (x86,
114 x86_64, ia64) and big-endian (ppc64, s390x). This feature is
115 available in all releases of the file system.
116
117
118 Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes
119 The file system supports all modes of I/O for maximum flexibil‐
120 ity and performance. It also supports cluster-wide shared
121 writeable mmap(2). The support for bufferred, direct and asyn‐
122 chronous I/O is available in all releases. The support for
123 splice I/O was added in Linux kernel 2.6.20 and for shared
124 writeable map(2) in 2.6.23.
125
126
127 Multiple Cluster Stacks
128 The file system includes a flexible framework to allow it to
129 function with userspace cluster stacks like Pacemaker (pcmk) and
130 CMAN (cman), its own in-kernel cluster stack o2cb and no cluster
131 stack.
132
133 The support for o2cb cluster stack is available in all releases.
134
135 The support for no cluster stack, or local mount, was added in
136 Linux kernel 2.6.20.
137
138 The support for userspace cluster stack was added in Linux ker‐
139 nel 2.6.26.
140
141
142 Journaling
143 The file system supports both ordered (default) and writeback
144 data journaling modes to provide file system consistency in the
145 event of power failure or system crash. It uses JBD2 in Linux
146 kernel 2.6.28 and later. It used JBD in earlier kernels.
147
148
149 Extent-based Allocations
150 The file system allocates and tracks space in ranges of clus‐
151 ters. This is unlike block based file systems that have to track
152 each and every block. This feature allows the file system to be
153 very efficient when dealing with both large volumes and large
154 files. This feature is available in all releases of the file
155 system.
156
157
158 Sparse files
159 Sparse files are files with holes. With this feature, the file
160 system delays allocating space until a write is issued to a
161 cluster. This feature was added in Linux kernel 2.6.22 and re‐
162 quires enabling on-disk feature sparse.
163
164
165 Unwritten Extents
166 An unwritten extent is also referred to as user pre-allocation.
167 It allows an application to request a range of clusters to be
168 allocated, but not initialized, within a file. Pre-allocation
169 allows the file system to optimize the data layout with fewer,
170 larger extents. It also provides a performance boost, delaying
171 initialization until the user writes to the clusters. This fea‐
172 ture was added in Linux kernel 2.6.23 and requires enabling on-
173 disk feature unwritten.
174
175
176 Hole Punching
177 Hole punching allows an application to remove arbitrary allo‐
178 cated regions within a file. Creating holes, essentially. This
179 is more efficient than zeroing the same extents. This feature
180 is especially useful in virtualized environments as it allows a
181 block discard in a guest file system to be converted to a hole
182 punch in the host file system thus allowing users to reduce disk
183 space usage. This feature was added in Linux kernel 2.6.23 and
184 requires enabling on-disk features sparse and unwritten.
185
186
187 Inline-data
188 Inline data is also referred to as data-in-inode as it allows
189 storing small files and directories in the inode block. This not
190 only saves space but also has a positive impact on cold-cache
191 directory and file operations. The data is transparently moved
192 out to an extent when it no longer fits inside the inode block.
193 This feature was added in Linux kernel 2.6.24 and requires en‐
194 abling on-disk feature inline-data.
195
196
197 REFLINK
198 REFLINK is also referred to as fast copy. It allows users to
199 atomically (and instantly) copy regular files. In other words,
200 create multiple writeable snapshots of regular files. It is
201 called REFLINK because it looks and feels more like a (hard)
202 link(2) than a traditional snapshot. Like a link, it is a regu‐
203 lar user operation, subject to the security attributes of the
204 inode being reflinked and not to the super user privileges typi‐
205 cally required to create a snapshot. Like a link, it operates
206 within a file system. But unlike a link, it links the inodes at
207 the data extent level allowing each reflinked inode to grow in‐
208 dependently as and when written to. Up to four billion inodes
209 can share a data extent. This feature was added in Linux kernel
210 2.6.32 and requires enabling on-disk feature refcount.
211
212
213 Allocation Reservation
214 File contiguity plays an important role in file system perfor‐
215 mance. When a file is fragmented on disk, reading and writing to
216 the file involves many seeks, leading to lower throughput. Con‐
217 tiguous files, on the other hand, minimize seeks, allowing the
218 disks to perform IO at the maximum rate.
219
220 With allocation reservation, the file system reserves a window
221 in the bitmap for all extending files allowing each to grow as
222 contiguously as possible. As this extra space is not actually
223 allocated, it is available for use by other files if the need
224 arises. This feature was added in Linux kernel 2.6.35 and can
225 be tuned using the mount option resv_level.
226
227
228 Indexed Directories
229 An indexed directory allows users to perform quick lookups of a
230 file in very large directories. It also results in faster cre‐
231 ates and unlinks and thus provides better overall performance.
232 This feature was added in Linux kernel 2.6.30 and requires en‐
233 abling on-disk feature indexed-dirs.
234
235
236 File Attributes
237 This refers to EXT2-style file attributes, such as immutable,
238 modified using chattr(1) and queried using lsattr(1). This fea‐
239 ture was added in Linux kernel 2.6.19.
240
241
242 Extended Attributes
243 An extended attribute refers to a name:value pair than can be
244 associated with file system objects like regular files, directo‐
245 ries, symbolic links, etc. OCFS2 allows associating an unlimited
246 number of attributes per object. The attribute names can be up
247 to 255 bytes in length, terminated by the first NUL character.
248 While it is not required, printable names (ASCII) are recom‐
249 mended. The attribute values can be up to 64 KB of arbitrary bi‐
250 nary data. These attributes can be modified and listed using
251 standard Linux utilities setfattr(1) and getfattr(1). This fea‐
252 ture was added in Linux kernel 2.6.29 and requires enabling on-
253 disk feature xattr.
254
255
256 Metadata Checksums
257 This feature allows the file system to detect silent corruptions
258 in all metadata blocks like inodes and directories. This feature
259 was added in Linux kernel 2.6.29 and requires enabling on-disk
260 feature metaecc.
261
262
263 POSIX ACLs and Security Attributes
264 POSIX ACLs allows assigning fine-grained discretionary access
265 rights for files and directories. This security scheme is a lot
266 more flexible than the traditional file access permissions that
267 imposes a strict user-group-other model.
268
269 Security attributes allow the file system to support other secu‐
270 rity regimes like SELinux, SMACK, AppArmor, etc.
271
272 Both these security extensions were added in Linux kernel 2.6.29
273 and requires enabling on-disk feature xattr.
274
275
276 User and Group Quotas
277 This feature allows setting up usage quotas on user and group
278 basis by using the standard utilities like quota(1),
279 setquota(8), quotacheck(8), and quotaon(8). This feature was
280 added in Linux kernel 2.6.29 and requires enabling on-disk fea‐
281 tures usrquota and grpquota.
282
283
284 Unix File Locking
285 The Unix operating system has historically provided two system
286 calls to lock files. flock(2) or BSD locking and fcntl(2) or
287 POSIX locking. OCFS2 extends both file locks to the cluster.
288 File locks taken on one node interact with those taken on other
289 nodes.
290
291 The support for clustered flock(2) was added in Linux kernel
292 2.6.26. All flock(2) options are supported, including the ker‐
293 nels ability to cancel a lock request when an appropriate kill
294 signal is received by the user. This feature is supported with
295 all cluster-stacks including o2cb.
296
297 The support for clustered fcntl(2) was added in Linux kernel
298 2.6.28. But because it requires group communication to make the
299 locks coherent, it is only supported with userspace cluster
300 stacks, pcmk and cman and not with the default cluster stack
301 o2cb.
302
303
304 Comprehensive Tools Support
305 The file system has a comprehensive EXT3-style toolset that
306 tries to use similar parameters for ease-of-use. It includes
307 mkfs.ocfs2(8) (format), tunefs.ocfs2(8) (tune), fsck.ocfs2(8)
308 (check), debugfs.ocfs2(8) (debug), etc.
309
310
311 Online Resize
312 The file system can be dynamically grown using tunefs.ocfs2(8).
313 This feature was added in Linux kernel 2.6.25.
314
315
317 The O2CB cluster stack has a global heartbeat mode. It allows users to
318 specify heartbeat regions that are consistent across all nodes. The
319 cluster stack also allows online addition and removal of both nodes and
320 heartbeat regions.
321
322 o2cb(8) is the new cluster configuration utility. It is an easy to use
323 utility that allows users to create the cluster configuration on a node
324 that is not part of the cluster. It replaces the older utility
325 o2cb_ctl(8) which has being deprecated.
326
327 ocfs2console(8) has been obsoleted.
328
329 o2info(8) is a new utility that can be used to provide file system in‐
330 formation. It allows non-privileged users to see the enabled file sys‐
331 tem features, block and cluster sizes, extended file stat, free space
332 fragmentation, etc.
333
334 o2hbmonitor(8) is a o2hb heartbeat monitor. It is an extremely light
335 weight utility that logs messages to the system logger once the heart‐
336 beat delay exceeds the warn threshold. This utility is useful in iden‐
337 tifying volumes encountering I/O delays.
338
339 debugfs.ocfs2(8) has some new commands. net_stats shows the o2net mes‐
340 sage times between various nodes. This is useful in identifying nodes
341 are that slowing down the cluster operations. stat_sysdir allows the
342 user to dump the entire system directory that can be used to debug is‐
343 sues. grpextents dumps the complete free space fragmentation in the
344 cluster group allocator.
345
346 mkfs.ocfs2(8) now enables xattr, indexed-dirs, discontig-bg, refcount,
347 extended-slotmap and clusterinfo feature flags by default, in addition
348 to the older defaults, sparse, unwritten and inline-data.
349
350 mount.ocfs2(8) allows users to specify the level of cache coherency be‐
351 tween nodes. By default the file system operates in full coherency
352 mode that also serializes the direct I/Os. While this mode is techni‐
353 cally correct, it limits the I/O thruput in a clustered database. This
354 mount option allows the user to limit the cache coherency to only the
355 buffered I/Os to allow multiple nodes to do concurrent direct writes to
356 the same file. This feature works with Linux kernel 2.6.37 and later.
357
358
360 The OCFS2 development teams goes to great lengths to maintain compati‐
361 bility. It attempts to maintain both on-disk and network protocol com‐
362 patibility across all releases of the file system. It does so even
363 while adding new features that entail on-disk format and network proto‐
364 col changes. To do this successfully, it follows a few rules:
365
366 1. The on-disk format changes are managed by a set of feature flags
367 that can be turned on and off. The file system in kernel detects
368 these features during mount and continues only if it understands
369 all the features. Users encountering this have the option of either
370 disabling that feature or upgrading the file system to a newer re‐
371 lease.
372
373 2. The latest release of ocfs2-tools is compatible with all ver‐
374 sions of the file system. All utilities detect the features enabled
375 on disk and continue only if it understands all the features. Users
376 encountering this have to upgrade the tools to a newer release.
377
378 3. The network protocol version is negotiated by the nodes to en‐
379 sure all nodes understand the active protocol version.
380
381
382 FEATURE FLAGS
383 The feature flags are split into three categories, namely, Com‐
384 pat, Incompat and RO Compat.
385
386 Compat, or compatible, is a feature that the file system does
387 not need to fully understand to safely read/write to the volume.
388 An example of this is the backup-super feature that added the
389 capability to backup the super block in multiple locations in
390 the file system. As the backup super blocks are typically not
391 read nor written to by the file system, an older file system can
392 safely mount a volume with this feature enabled.
393
394 Incompat, or incompatible, is a feature that the file system
395 needs to fully understand to read/write to the volume. Most fea‐
396 tures fall under this category.
397
398 RO Compat, or read-only compatible, is a feature that the file
399 system needs to fully understand to write to the volume. Older
400 software can safely read a volume with this feature enabled. An
401 example of this would be user and group quotas. As quotas are
402 manipulated only when the file system is written to, older soft‐
403 ware can safely mount such volumes in read-only mode.
404
405 The list of feature flags, the version of the kernel it was
406 added in, the earliest version of the tools that understands it,
407 etc., is as follows:
408
409
410 ┌─────────────────────┬────────────────┬─────────────────┬───────────┬───────────┐
411 │Feature Flags │ Kernel Version │ Tools Version │ Category │ Hex Value │
412 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
413 │backup-super │ All │ ocfs2-tools 1.2 │ Compat │ 1 │
414 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
415 │strict-journal-super │ All │ All │ Compat │ 2 │
416 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
417 │local │ Linux 2.6.20 │ ocfs2-tools 1.2 │ Incompat │ 8 │
418 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
419 │sparse │ Linux 2.6.22 │ ocfs2-tools 1.4 │ Incompat │ 10 │
420 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
421 │inline-data │ Linux 2.6.24 │ ocfs2-tools 1.4 │ Incompat │ 40 │
422 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
423 │extended-slotmap │ Linux 2.6.27 │ ocfs2-tools 1.6 │ Incompat │ 100 │
424 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
425 │xattr │ Linux 2.6.29 │ ocfs2-tools 1.6 │ Incompat │ 200 │
426 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
427 │indexed-dirs │ Linux 2.6.30 │ ocfs2-tools 1.6 │ Incompat │ 400 │
428 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
429 │metaecc │ Linux 2.6.29 │ ocfs2-tools 1.6 │ Incompat │ 800 │
430 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
431 │refcount │ Linux 2.6.32 │ ocfs2-tools 1.6 │ Incompat │ 1000 │
432 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
433 │discontig-bg │ Linux 2.6.35 │ ocfs2-tools 1.6 │ Incompat │ 2000 │
434 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
435 │clusterinfo │ Linux 2.6.37 │ ocfs2-tools 1.8 │ Incompat │ 4000 │
436 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
437 │unwritten │ Linux 2.6.23 │ ocfs2-tools 1.4 │ RO Compat │ 1 │
438 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
439 │grpquota │ Linux 2.6.29 │ ocfs2-tools 1.6 │ RO Compat │ 2 │
440 ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
441 │usrquota │ Linux 2.6.29 │ ocfs2-tools 1.6 │ RO Compat │ 4 │
442 └─────────────────────┴────────────────┴─────────────────┴───────────┴───────────┘
443
444 To query the features enabled on a volume, do:
445
446 $ o2info --fs-features /dev/sdf1
447 backup-super strict-journal-super sparse extended-slotmap inline-data xattr
448 indexed-dirs refcount discontig-bg clusterinfo unwritten
449
450
451 ENABLING AND DISABLING FEATURES
452
453 The format utility, mkfs.ocfs2(8), allows a user to enable and
454 disable specific features using the fs-features option. The fea‐
455 tures are provided as a comma separated list. The enabled fea‐
456 tures are listed as is. The disabled features are prefixed with
457 no. The example below shows the file system being formatted
458 with sparse disabled and inline-data enabled.
459
460 # mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1
461
462 After formatting, the users can toggle features using the tune
463 utility, tunefs.ocfs2(8). This is an offline operation. The
464 volume needs to be umounted across the cluster. The example be‐
465 low shows the sparse feature being enabled and inline-data dis‐
466 abled.
467
468 # tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1
469
470 Care should be taken before enabling and disabling features.
471 Users planning to use a volume with an older version of the file
472 system will be better of not enabling newer features as turning
473 disabling may not succeed.
474
475 An example would be disabling the sparse feature; this requires
476 filling every hole. The operation can only succeed if the file
477 system has enough free space.
478
479
480 DETECTING FEATURE INCOMPATIBILITY
481
482 Say one tries to mount a volume with an incompatible feature.
483 What happens then? How does one detect the problem? How does one
484 know the name of that incompatible feature?
485
486 To begin with, one should look for error messages in dmesg(8).
487 Mount failures that are due to an incompatible feature will al‐
488 ways result in an error message like the following:
489
490 ERROR: couldn't mount because of unsupported optional features (200).
491
492 Here the file system is unable to mount the volume due to an un‐
493 supported optional feature. That means that that feature is an
494 Incompat feature. By referring to the table above, one can then
495 deduce that the user failed to mount a volume with the xattr
496 feature enabled. (The value in the error message is in hexadeci‐
497 mal.)
498
499 Another example of an error message due to incompatibility is as
500 follows:
501
502 ERROR: couldn't mount RDWR because of unsupported optional features (1).
503
504 Here the file system is unable to mount the volume in the RW
505 mode. That means that that feature is a RO Compat feature. An‐
506 other look at the table and it becomes apparent that the volume
507 had the unwritten feature enabled.
508
509 In both cases, the user has the option of disabling the feature.
510 In the second case, the user has the choice of mounting the vol‐
511 ume in the RO mode.
512
513
515 The OCFS2 software is split into two components, namely, kernel and
516 tools. The kernel component includes the core file system and the clus‐
517 ter stack, and is packaged along with the kernel. The tools component
518 is packaged as ocfs2-tools and needs to be specifically installed. It
519 provides utilities to format, tune, mount, debug and check the file
520 system.
521
522 To install ocfs2-tools, refer to the package handling utility in in
523 your distributions.
524
525 The next step is selecting a cluster stack. The options include:
526
527 A. No cluster stack, or local mount.
528
529 B. In-kernel o2cb cluster stack with local or global heartbeat.
530
531 C. Userspace cluster stacks pcmk or cman.
532
533 The file system allows changing cluster stacks easily using
534 tunefs.ocfs2(8). To list the cluster stacks stamped on the OCFS2 vol‐
535 umes, do:
536
537 # mounted.ocfs2 -d
538 Device Stack Cluster F UUID Label
539 /dev/sdb1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hbvol1
540 /dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount
541 /dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol
542 /dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol
543 /dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch
544
545
546 NON-CLUSTERED OR LOCAL MOUNT
547
548 To format a OCFS2 volume as a non-clustered (local) volume, do:
549
550 # mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1
551
552 To convert an existing clustered volume to a non-clustered vol‐
553 ume, do:
554
555 # tunefs.ocfs2 --fs-features=local /dev/sda1
556
557 Non-clustered volumes do not interact with the cluster stack.
558 One can have both clustered and non-clustered volumes mounted at
559 the same time.
560
561 While formatting a non-clustered volume, users should consider
562 the possibility of later converting that volume to a clustered
563 one. If there is a possibility of that, then the user should add
564 enough node-slots using the -N option. Adding node-slots during
565 format creates journals with large extents. If created later,
566 then the journals will be fragmented which is not good for per‐
567 formance.
568
569
570 CLUSTERED MOUNT WITH O2CB CLUSTER STACK
571
572 Only one of the two heartbeat mode can be active at any one
573 time. Changing heartbeat modes is an offline operation.
574
575 Both heartbeat modes require /etc/ocfs2/cluster.conf and
576 /etc/sysconfig/o2cb to be populated as described in ocfs2.clus‐
577 ter.conf(5) and o2cb.sysconfig(5) respectively. The only differ‐
578 ence in set up between the two modes is that global requires
579 heartbeat devices to be configured whereas local does not.
580
581 Refer o2cb(7) for more information.
582
583
584 LOCAL HEARTBEAT
585 This is the default heartbeat mode. The user needs to
586 populate the configuration files as described in
587 ocfs2.cluster.conf(5) and o2cb.sysconfig(5). In this
588 mode, the cluster stack heartbeats on all mounted vol‐
589 umes. Thus, one does not have to specify heartbeat de‐
590 vices in cluster.conf.
591
592 Once configured, the o2cb cluster stack can be onlined
593 and offlined as follows:
594
595 # service o2cb online
596 Setting cluster stack "o2cb": OK
597 Registering O2CB cluster "webcluster": OK
598 Setting O2CB cluster timeouts : OK
599
600 # service o2cb offline
601 Clean userdlm domains: OK
602 Stopping O2CB cluster webcluster: OK
603 Unregistering O2CB cluster "webcluster": OK
604
605
606 GLOBAL HEARTBEAT
607 The configuration is similar to local heartbeat. The one
608 additional step in this mode is that it requires heart‐
609 beat devices to be also configured.
610
611 These heartbeat devices are OCFS2 formatted volumes with
612 global heartbeat enabled on disk. These volumes can later
613 be mounted and used as clustered file systems.
614
615 The steps to format a volume with global heartbeat en‐
616 abled is listed in o2cb(7). Also listed there is listing
617 all volumes with the cluster stack stamped on disk.
618
619 In this mode, the heartbeat is started when the cluster
620 is onlined and stopped when the cluster is offlined.
621
622 # service o2cb online
623 Setting cluster stack "o2cb": OK
624 Registering O2CB cluster "webcluster": OK
625 Setting O2CB cluster timeouts : OK
626 Starting global heartbeat for cluster "webcluster": OK
627
628 # service o2cb offline
629 Clean userdlm domains: OK
630 Stopping global heartbeat on cluster "webcluster": OK
631 Stopping O2CB cluster webcluster: OK
632 Unregistering O2CB cluster "webcluster": OK
633
634 # service o2cb status
635 Driver for "configfs": Loaded
636 Filesystem "configfs": Mounted
637 Stack glue driver: Loaded
638 Stack plugin "o2cb": Loaded
639 Driver for "ocfs2_dlmfs": Loaded
640 Filesystem "ocfs2_dlmfs": Mounted
641 Checking O2CB cluster "webcluster": Online
642 Heartbeat dead threshold: 31
643 Network idle timeout: 30000
644 Network keepalive delay: 2000
645 Network reconnect delay: 2000
646 Heartbeat mode: Global
647 Checking O2CB heartbeat: Active
648 77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
649 Nodes in O2CB cluster: 92 96
650
651
652
653 CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK
654
655 Configure and online the userspace stack pcmk or cman before us‐
656 ing tunefs.ocfs2(8) to update the cluster stack on disk.
657
658 # tunefs.ocfs2 --update-cluster-stack /dev/sdd1
659 Updating on-disk cluster information to match the running cluster.
660 DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
661 FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
662 Update the on-disk cluster information? y
663
664 Refer to the cluster stack documentation for information on
665 starting and stopping the cluster stack.
666
667
669 This sections lists the utilities that are used to manage the OCFS2
670 file systems. This includes tools to format, tune, check, mount, debug
671 the file system. Each utility has a man page that lists its capabili‐
672 ties in detail.
673
674
675 mkfs.ocfs2(8)
676 This is the file system format utility. All volumes have to be
677 formatted prior to its use. As this utility overwrites the vol‐
678 ume, use it with care. Double check to ensure the volume is not
679 in use on any node in the cluster.
680
681 As a precaution, the utility will abort if the volume is locally
682 mounted. It also detects use across the cluster if used by
683 OCFS2. But these checks are not comprehensive and can be over‐
684 ridden. So use it with care.
685
686 While it is not always required, the cluster should be online.
687
688
689 tunefs.ocfs2(8)
690 This is the file system tune utility. It allows users to change
691 certain on-disk parameters like label, uuid, number of node-
692 slots, volume size and the size of the journals. It also allows
693 turning on and off the file system features as listed above.
694
695 This utility requires the cluster to be online.
696
697
698 fsck.ocfs2(8)
699 This is the file system check utility. It detects and fixes on-
700 disk errors. All the check codes and their fixes are listed in
701 fsck.ocfs2.checks(8).
702
703 This utility requires the cluster to be online to ensure the
704 volume is not in use on another node and to prevent the volume
705 from being mounted for the duration of the check.
706
707
708 mount.ocfs2(8)
709 This is the file system mount utility. It is invoked indirectly
710 by the mount(8) utility.
711
712 This utility detects the cluster status and aborts if the clus‐
713 ter is offline or does not match the cluster stamped on disk.
714
715
716 o2cluster(8)
717 This is the file system cluster stack update utility. It allows
718 the users to update the on-disk cluster stack to the one pro‐
719 vided.
720
721 This utility only updates the disk if the utility is reasonably
722 assured that the file system is not in use on any node.
723
724
725 o2info(1)
726 This is the file system information utility. It provides infor‐
727 mation like the features enabled on disk, block size, cluster
728 size, free space fragmentation, etc.
729
730 It can be used by both privileged and non-privileged users.
731 Users having read permission on the device can provide the path
732 to the device. Other users can provide the path to a file on a
733 mounted file system.
734
735
736 debugfs.ocfs2(8)
737 This is the file system debug utility. It allows users to exam‐
738 ine all file system structures including walking directory
739 structures, displaying inodes, backing up files, etc., without
740 mounting the file system.
741
742 This utility requires the user to have read permission on the
743 device.
744
745
746 o2image(8)
747 This is the file system image utility. It allows users to copy
748 the file system metadata skeleton, including the inodes, direc‐
749 tories, bitmaps, etc. As it excludes data, it shrinks the size
750 of the file system tremendously.
751
752 The image file created can be used in debugging on-disk corrup‐
753 tions.
754
755
756 mounted.ocfs2(8)
757 This is the file system detect utility. It detects all OCFS2
758 volumes in the system and lists its label, uuid and cluster
759 stack.
760
761
763 This sections lists the utilities that are used to manage O2CB cluster
764 stack. Each utility has a man page that lists its capabilities in de‐
765 tail.
766
767 o2cb(8)
768 This is the cluster configuration utility. It allows users to
769 update the cluster configuration by adding and removing nodes
770 and heartbeat regions. This utility is used by the o2cb init
771 script to online and offline the cluster.
772
773 This is a new utility and replaces o2cb_ctl(8) which has been
774 deprecated.
775
776
777 ocfs2_hb_ctl(8)
778 This is the cluster heartbeat utility. It allows users to start
779 and stop local heartbeat. This utility is invoked by
780 mount.ocfs2(8) and should not be invoked directly by the user.
781
782
783 o2hbmonitor(8)
784 This is the disk heartbeat monitor. It tracks the elapsed time
785 since the last heartbeat and logs warnings once that time ex‐
786 ceeds the warn threshold.
787
788
790 This section includes some useful notes that may prove helpful to the
791 user.
792
793 BALANCED CLUSTER
794 A cluster is a computer. This is a fact and not a slogan. What
795 this means is that an errant node in the cluster can affect the
796 behavior of other nodes. If one node is slow, the cluster opera‐
797 tions will slow down on all nodes. To prevent that, it is best
798 to have a balanced cluster. This is a cluster that has equally
799 powered and loaded nodes.
800
801 The standard recommendation for such clusters is to have identi‐
802 cal hardware and software across all the nodes. However, that is
803 not a hard and fast rule. After all, we have taken the effort to
804 ensure that OCFS2 works in a mixed architecture environment.
805
806 If one uses OCFS2 in a mixed architecture environment, try to
807 ensure that the nodes are equally powered and loaded. The use of
808 a load balancer can assist with the latter. Power refers to the
809 number of processors, speed, amount of memory, I/O throughput,
810 network bandwidth, etc. In reality, having equally powered het‐
811 erogeneous nodes is not always practical. In that case, make the
812 lower node numbers more powerful than the higher node numbers.
813 The O2CB cluster stack favors lower node numbers in all of its
814 tiebreaking logic.
815
816 This is not to suggest you should add a single core node in a
817 cluster of quad cores. No amount of node number juggling will
818 help you there.
819
820
821 FILE DELETION
822 In Linux, rm(1) removes the directory entry. It does not neces‐
823 sarily delete the corresponding inode. But by removing the di‐
824 rectory entry, it gives the illusion that the inode has been
825 deleted. This puzzles users when they do not see a correspond‐
826 ing up-tick in the reported free space. The reason is that in‐
827 ode deletion has a few more hurdles to cross.
828
829 First is the hard link count, that indicates the number of di‐
830 rectory entries pointing to that inode. As long as an inode has
831 one or more directory entries pointing to it, it cannot be
832 deleted. The file system has to wait for the removal of all
833 those directory entries. In other words, wait for that count to
834 drop to zero.
835
836 The second hurdle is the POSIX semantics allowing files to be
837 unlinked even while they are in-use. In OCFS2, that translates
838 to in-use across the cluster. The file system has to wait for
839 all processes across the cluster to stop using the inode.
840
841 Once these conditions are met, the inode is deleted and the
842 freed space is visible after the next sync.
843
844 Now the amount of space freed depends on the allocation. Only
845 space that is actually allocated to that inode is freed. The ex‐
846 ample below shows a sparsely allocated file of size 51TB of
847 which only 2.4GB is actually allocated.
848
849 $ ls -lsh largefile
850 2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile
851
852 Furthermore, for reflinked files, only private extents are
853 freed. Shared extents are freed when the last inode accessing
854 it, is deleted. The example below shows a 4GB file that shares
855 3GB with other reflinked files. Deleting it will increase the
856 free space by 1GB. However, if it is the only remaining file ac‐
857 cessing the shared extents, the full 4G will be freed. (More
858 information on the shared-du(1) utility is provided below.)
859
860 $ shared-du -m -c --shared-size reflinkedfile
861 4000 (3000) reflinkedfile
862
863 The deletion itself is a multi-step process. Once the hard link
864 count falls to zero, the inode is moved to the orphan_dir system
865 directory where it remains until the last process, across the
866 cluster, stops using the inode. Then the file system frees the
867 extents and adds the freed space count to the truncate_log sys‐
868 tem file where it remains until the next sync. The freed space
869 is made visible to the user only after that sync.
870
871
872 DIRECTORY LISTING
873 ls(1) may be a simple command, but it is not cheap. What is ex‐
874 pensive is not the part where it reads the directory listing,
875 but the second part where it reads all the inodes, also referred
876 as an inode stat(2). If the inodes are not in cache, this can
877 entail disk I/O. Now, while a cold cache inode stat(2) is ex‐
878 pensive in all file systems, it is especially so in a clustered
879 file system as it needs to take a cluster lock on each inode.
880
881 A hot cache stat(2), on the other hand, has shown to perform on
882 OCFS2 like it does on EXT3.
883
884 In other words, the second ls(1) will be quicker than the first.
885 However, it is not guaranteed. Say you have a million files in a
886 file system and not enough kernel memory to cache all the in‐
887 odes. In that case, each ls(1) will involve some cold cache
888 stat(2)s.
889
890
891 ALLOCATION RESERVATION
892 Allocation reservation allows multiple concurrently extending
893 files to grow as contiguously as possible. One way to demon‐
894 strate its functioning is to run a script that extends multiple
895 files in a circular order. The script below does that by writing
896 one hundred 4KB chunks to four files, one after another.
897
898 $ for i in $(seq 0 99);
899 > do
900 > for j in $(seq 4);
901 > do
902 > dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
903 > done;
904 > done;
905
906 When run on a system running Linux kernel 2.6.34 or earlier, we
907 end up with files with 100 extents each. That is full fragmenta‐
908 tion. As the files are being extended one after another, the on-
909 disk allocations are fully interleaved.
910
911 $ filefrag file1 file2 file3 file4
912 file1: 100 extents found
913 file2: 100 extents found
914 file3: 100 extents found
915 file4: 100 extents found
916
917 When run on a system running Linux kernel 2.6.35 or later, we
918 see files with 7 extents each. That is a lot fewer than before.
919 Fewer extents mean more on-disk contiguity and that always leads
920 to better overall performance.
921
922 $ filefrag file1 file2 file3 file4
923 file1: 7 extents found
924 file2: 7 extents found
925 file3: 7 extents found
926 file4: 7 extents found
927
928
929 REFLINK OPERATION
930 This feature allows a user to create a writeable snapshot of a
931 regular file. In this operation, the file system creates a new
932 inode with the same extent pointers as the original inode. Mul‐
933 tiple inodes are thus able to share data extents. This adds a
934 twist in file system administration because none of the existing
935 file system utilities in Linux expect this behavior. du(1), a
936 utility to used to compute file space usage, simply adds the
937 blocks allocated to each inode. As it does not know about shared
938 extents, it over estimates the space used. Say, we have a 5GB
939 file in a volume having 42GB free.
940
941 $ ls -l
942 total 5120000
943 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile
944
945 $ du -m myfile*
946 5000 myfile
947
948 $ df -h .
949 Filesystem Size Used Avail Use% Mounted on
950 /dev/sdd1 50G 8.2G 42G 17% /ocfs2
951
952 If we were to reflink it 4 times, we would expect the directory
953 listing to report five 5GB files, but the df(1) to report no
954 loss of available space. du(1), on the other hand, would report
955 the disk usage to climb to 25GB.
956
957 $ reflink myfile myfile-ref1
958 $ reflink myfile myfile-ref2
959 $ reflink myfile myfile-ref3
960 $ reflink myfile myfile-ref4
961
962 $ ls -l
963 total 25600000
964 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile
965 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref1
966 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref2
967 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref3
968 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref4
969
970 $ df -h .
971 Filesystem Size Used Avail Use% Mounted on
972 /dev/sdd1 50G 8.2G 42G 17% /ocfs2
973
974 $ du -m myfile*
975 5000 myfile
976 5000 myfile-ref1
977 5000 myfile-ref2
978 5000 myfile-ref3
979 5000 myfile-ref4
980 25000 total
981
982 Enter shared-du(1), a shared extent-aware du. This utility re‐
983 ports the shared extents per file in parenthesis and the overall
984 footprint. As expected, it lists the overall footprint at 5GB.
985 One can view the details of the extents using shared-file‐
986 frag(1). Both these utilities are available at http://oss.ora‐
987 cle.com/~smushran/reflink-tools/. We are currently in the
988 process of pushing the changes to the upstream maintainers of
989 these utilities.
990
991 $ shared-du -m -c --shared-size myfile*
992 5000 (5000) myfile
993 5000 (5000) myfile-ref1
994 5000 (5000) myfile-ref2
995 5000 (5000) myfile-ref3
996 5000 (5000) myfile-ref4
997 25000 total
998 5000 footprint
999
1000 # shared-filefrag -v myfile
1001 Filesystem type is: 7461636f
1002 File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
1003 ext logical physical expected length flags
1004 0 0 2247937 8448
1005 1 8448 2257921 2256384 30720
1006 2 39168 2290177 2288640 30720
1007 3 69888 2322433 2320896 30720
1008 4 100608 2354689 2353152 30720
1009 7 192768 2451457 2449920 30720
1010 . . .
1011 37 1073408 2032129 2030592 30720 shared
1012 38 1104128 2064385 2062848 30720 shared
1013 39 1134848 2096641 2095104 30720 shared
1014 40 1165568 2128897 2127360 30720 shared
1015 41 1196288 2161153 2159616 30720 shared
1016 42 1227008 2193409 2191872 30720 shared
1017 43 1257728 2225665 2224128 22272 shared,eof
1018 myfile: 44 extents found
1019
1020
1021 DATA COHERENCY
1022 One of the challenges in a shared file system is data coherency
1023 when multiple nodes are writing to the same set of files. NFS,
1024 for example, provides close-to-open data coherency that results
1025 in the data being flushed to the server when the file is closed
1026 on the client. This leaves open a wide window for stale data
1027 being read on another node.
1028
1029 A simple test to check the data coherency of a shared file sys‐
1030 tem involves concurrently appending the same file. Like running
1031 "uname -a >>/dir/file" using a parallel distributed shell like
1032 dsh or pconsole. If coherent, the file will contain the results
1033 from all nodes.
1034
1035 # dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"
1036 # cat /ocfs2/test
1037 Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1038 Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1039 Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1040 Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1041
1042 OCFS2 is a fully cache coherent cluster file system.
1043
1044
1045 DISCONTIGUOUS BLOCK GROUP
1046 Most file systems pre-allocate space for inodes during format.
1047 OCFS2 dynamically allocates this space when required.
1048
1049 However, this dynamic allocation has been problematic when the
1050 free space is very fragmented, because the file system required
1051 the inode and extent allocators to grow in contiguous fixed-size
1052 chunks.
1053
1054 The discontiguous block group feature takes care of this problem
1055 by allowing the allocators to grow in smaller, variable-sized
1056 chunks.
1057
1058 This feature was added in Linux kernel 2.6.35 and requires en‐
1059 abling on-disk feature discontig-bg.
1060
1061
1062 BACKUP SUPER BLOCKS
1063 A file system super block stores critical information that is
1064 hard to recreate. In OCFS2, it stores the block size, cluster
1065 size, and the locations of the root and system directories,
1066 among other things. As this block is close to the start of the
1067 disk, it is very susceptible to being overwritten by an errant
1068 write. Say, dd if=file of=/dev/sda1.
1069
1070 Backup super blocks are copies of the super block. These blocks
1071 are dispersed in the volume to minimize the chances of being
1072 overwritten. On the small chance that the original gets cor‐
1073 rupted, the backups are available to scan and fix the corrup‐
1074 tion.
1075
1076 mkfs.ocfs2(8) enables this feature by default. Users can disable
1077 this by specifying --fs-features=nobackup-super during format.
1078
1079 o2info(1) can be used to view whether the feature has been en‐
1080 abled on a device.
1081
1082 # o2info --fs-features /dev/sdb1
1083 backup-super strict-journal-super sparse extended-slotmap inline-data xattr
1084 indexed-dirs refcount discontig-bg clusterinfo unwritten
1085
1086 In OCFS2, the super block is on the third block. The backups are
1087 located at the 1G, 4G, 16G, 64G, 256G and 1T byte offsets. The
1088 actual number of backup blocks depends on the size of the de‐
1089 vice. The super block is not backed up on devices smaller than
1090 1GB.
1091
1092 fsck.ocfs2(8) refers to these six offsets by numbers, 1 to 6.
1093 Users can specify any backup with the -r option to recover the
1094 volume. The example below uses the second backup. If successful,
1095 fsck.ocfs2(8) overwrites the corrupted super block with the
1096 backup.
1097
1098 # fsck.ocfs2 -f -r 2 /dev/sdb1
1099 fsck.ocfs2 1.8.0
1100 [RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
1101 Checking OCFS2 filesystem in /dev/sdb1:
1102 Label: webhome
1103 UUID: B3E021A2A12B4D0EB08E9E986CDC7947
1104 Number of blocks: 13107196
1105 Block size: 4096
1106 Number of clusters: 13107196
1107 Cluster size: 4096
1108 Number of slots: 8
1109
1110 /dev/sdb1 was run with -f, check forced.
1111 Pass 0a: Checking cluster allocation chains
1112 Pass 0b: Checking inode allocation chains
1113 Pass 0c: Checking extent block allocation chains
1114 Pass 1: Checking inodes and blocks.
1115 Pass 2: Checking directory entries.
1116 Pass 3: Checking directory connectivity.
1117 Pass 4a: checking for orphaned inodes
1118 Pass 4b: Checking inodes link counts.
1119 All passes succeeded.
1120
1121
1122 SYNTHETIC FILE SYSTEMS
1123 The OCFS2 development effort included two synthetic file sys‐
1124 tems, configfs and dlmfs. It also makes use of a third, debugfs.
1125
1126
1127 configfs
1128 configfs has since been accepted as a generic kernel com‐
1129 ponent and is also used by netconsole and fs/dlm. OCFS2
1130 tools use it to communicate the list of nodes in the
1131 cluster, details of the heartbeat device, cluster time‐
1132 outs, and so on to the in-kernel node manager. The o2cb
1133 init script mounts this file system at /sys/kernel/con‐
1134 fig.
1135
1136
1137 dlmfs dlmfs exposes the in-kernel o2dlm to the user-space.
1138 While it was developed primarily for OCFS2 tools, it has
1139 seen usage by others looking to add a cluster locking di‐
1140 mension in their applications. Users interested in doing
1141 the same should look at the libo2dlm library provided by
1142 ocfs2-tools. The o2cb init script mounts this file system
1143 at /dlm.
1144
1145
1146 debugfs
1147 OCFS2 uses debugfs to expose its in-kernel information to
1148 user space. For example, listing the file system cluster
1149 locks, dlm locks, dlm state, o2net state, etc. Users can
1150 access the information by mounting the file system at
1151 /sys/kernel/debug. To automount, add the following to
1152 /etc/fstab: debugfs /sys/kernel/debug debugfs defaults 0
1153 0
1154
1155
1156 DISTRIBUTED LOCK MANAGER
1157 One of the key technologies in a cluster is the lock manager,
1158 which maintains the locking state of all resources across the
1159 cluster. An easy implementation of a lock manager involves des‐
1160 ignating one node to handle everything. In this model, if a node
1161 wanted to acquire a lock, it would send the request to the lock
1162 manager. However, this model has a weakness: lock manager’s
1163 death causes the cluster to seize up.
1164
1165 A better model is one where all nodes manage a subset of the
1166 lock resources. Each node maintains enough information for all
1167 the lock resources it is interested in. On event of a node
1168 death, the remaining nodes pool in the information to recon‐
1169 struct the lock state maintained by the dead node. In this
1170 scheme, the locking overhead is distributed amongst all the
1171 nodes. Hence, the term distributed lock manager.
1172
1173 O2DLM is a distributed lock manager. It is based on the specifi‐
1174 cation titled "Programming Locking Application" written by
1175 Kristin Thomas and is available at the following link.
1176 http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlm‐
1177 book_final.pdf
1178
1179
1180 DLM DEBUGGING
1181 O2DLM has a rich debugging infrastructure that allows it to show
1182 the state of the lock manager, all the lock resources, among
1183 other things. The figure below shows the dlm state of a nine-
1184 node cluster that has just lost three nodes: 12, 32, and 35. It
1185 can be ascertained that node 7, the recovery master, is cur‐
1186 rently recovering node 12 and has received the lock states of
1187 the dead node from all other live nodes.
1188
1189 # cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
1190 Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001 Key: 0x10748e61
1191 Thread Pid: 24542 Node: 7 State: JOINED
1192 Number of Joins: 1 Joining Node: 255
1193 Domain Map: 7 31 33 34 40 50
1194 Live Map: 7 31 33 34 40 50
1195 Lock Resources: 48850 (439879)
1196 MLEs: 0 (1428625)
1197 Blocking: 0 (1066000)
1198 Mastery: 0 (362625)
1199 Migration: 0 (0)
1200 Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty
1201 Purge Count: 0 Refs: 1
1202 Dead Node: 12
1203 Recovery Pid: 24543 Master: 7 State: ACTIVE
1204 Recovery Map: 12 32 35
1205 Recovery Node State:
1206 7 - DONE
1207 31 - DONE
1208 33 - DONE
1209 34 - DONE
1210 40 - DONE
1211 50 - DONE
1212
1213 The figure below shows the state of a dlm lock resource that is
1214 mastered (owned) by node 25, with 6 locks in the granted queue
1215 and node 26 holding the EX (writelock) lock on that resource.
1216
1217 # debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1
1218 Lockres: M000000000000000022d63c00000000 Owner: 25 State: 0x0
1219 Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No
1220 Refs: 8 Locks: 6 On Lists: None
1221 Reference Map: 26 27 28 94 95
1222 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action
1223 Granted 94 NL -1 94:3169409 2 No No None
1224 Granted 28 NL -1 28:3213591 2 No No None
1225 Granted 27 NL -1 27:3216832 2 No No None
1226 Granted 95 NL -1 95:3178429 2 No No None
1227 Granted 25 NL -1 25:3513994 2 No No None
1228 Granted 26 EX -1 26:3512906 2 No No None
1229
1230 The figure below shows a lock from the file system perspective.
1231 Specifically, it shows a lock that is in the process of being
1232 upconverted from a NL to EX. Locks in this state are are re‐
1233 ferred to in the file system as busy locks and can be listed us‐
1234 ing the debugfs.ocfs2 command, "fs_locks -B".
1235
1236 # debugfs.ocfs2 -R "fs_locks -B" /dev/sda1
1237 Lockres: M000000000000000000000b9aba12ec Mode: No Lock
1238 Flags: Initialized Attached Busy
1239 RO Holders: 0 EX Holders: 0
1240 Pending Action: Convert Pending Unlock Action: None
1241 Requested Mode: Exclusive Blocking Mode: No Lock
1242 PR > Gets: 0 Fails: 0 Waits Total: 0us Max: 0us Avg: 0ns
1243 EX > Gets: 1 Fails: 0 Waits Total: 544us Max: 544us Avg: 544185ns
1244 Disk Refreshes: 1
1245
1246 With this debugging infrastructure in place, users can debug
1247 hang issues as follows:
1248
1249 * Dump the busy fs locks for all the OCFS2 volumes on the
1250 node with hanging processes. If no locks are found, then the
1251 problem is not related to O2DLM.
1252
1253 * Dump the corresponding dlm lock for all the busy fs locks.
1254 Note down the owner (master) of all the locks.
1255
1256 * Dump the dlm locks on the master node for each lock.
1257
1258 At this stage, one should note that the hanging node is waiting
1259 to get an AST from the master. The master, on the other hand,
1260 cannot send the AST until the current holder has down converted
1261 that lock, which it will do upon receiving a Blocking AST. How‐
1262 ever, a node can only down convert if all the lock holders have
1263 stopped using that lock. After dumping the dlm lock on the mas‐
1264 ter node, identify the current lock holder and dump both the dlm
1265 and fs locks on that node.
1266
1267 The trick here is to see whether the Blocking AST message has
1268 been relayed to file system. If not, the problem is in the dlm
1269 layer. If it has, then the most common reason would be a lock
1270 holder, the count for which is maintained in the fs lock.
1271
1272 At this stage, printing the list of process helps.
1273
1274 $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
1275
1276 Make a note of all D state processes. At least one of them is
1277 responsible for the hang on the first node.
1278
1279 The challenge then is to figure out why those processes are
1280 hanging. Failing that, at least get enough information (like
1281 alt-sysrq t output) for the kernel developers to review. What
1282 to do next depends on where the process is hanging. If it is
1283 waiting for the I/O to complete, the problem could be anywhere
1284 in the I/O subsystem, from the block device layer through the
1285 drivers to the disk array. If the hang concerns a user lock
1286 (flock(2)), the problem could be in the user’s application. A
1287 possible solution could be to kill the holder. If the hang is
1288 due to tight or fragmented memory, free up some memory by
1289 killing non-essential processes.
1290
1291 The thing to note is that the symptom for the problem was on one
1292 node but the cause is on another. The issue can only be resolved
1293 on the node holding the lock. Sometimes, the best solution will
1294 be to reset that node. Once killed, the O2DLM recovery process
1295 will clear all locks owned by the dead node and let the cluster
1296 continue to operate. As harsh as that sounds, at times it is the
1297 only solution. The good news is that, by following the trail,
1298 you now have enough information to file a bug and get the real
1299 issue resolved.
1300
1301
1302 NFS EXPORTING
1303 OCFS2 volumes can be exported as NFS volumes. This support is
1304 limited to NFS version 3, which translates to Linux kernel ver‐
1305 sion 2.4 or later.
1306
1307 If the version of the Linux kernel on the system exporting the
1308 volume is older than 2.6.30, then the NFS clients must mount the
1309 volumes using the nordirplus mount option. This disables the
1310 READDIRPLUS RPC call to workaround a bug in NFSD, detailed in
1311 the following link:
1312
1313 http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
1314
1315 Users running NFS version 2 can export the volume after having
1316 disabled subtree checking (mount option no_subtree_check). Be
1317 warned, disabling the check has security implications (docu‐
1318 mented in the exports(5) man page) that users must evaluate on
1319 their own.
1320
1321
1322 FILE SYSTEM LIMITS
1323 OCFS2 has no intrinsic limit on the total number of files and
1324 directories in the file system. In general, it is only limited
1325 by the size of the device. But there is one limit imposed by the
1326 current filesystem. It can address at most four billion clus‐
1327 ters. A file system with 1MB cluster size can go up to 4PB,
1328 while a file system with a 4KB cluster size can address up to
1329 16TB.
1330
1331
1332 SYSTEM OBJECTS
1333 The OCFS2 file system stores its internal meta-data, including
1334 bitmaps, journals, etc., as system files. These are grouped in a
1335 system directory. These files and directories are not accessible
1336 via the file system interface but can be viewed using the de‐
1337 bugfs.ocfs2(8) tool.
1338
1339 To list the system directory (referred to as double-slash), do:
1340
1341 # debugfs.ocfs2 -R "ls -l //" /dev/sde1
1342 66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 .
1343 66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 ..
1344 67 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 bad_blocks
1345 68 -rw-r--r-- 1 0 0 1179648 19-Jul-2011 13:36 global_inode_alloc
1346 69 -rw-r--r-- 1 0 0 4096 19-Jul-2011 14:35 slot_map
1347 70 -rw-r--r-- 1 0 0 1048576 19-Jul-2011 13:36 heartbeat
1348 71 -rw-r--r-- 1 0 0 53686960128 19-Jul-2011 13:36 global_bitmap
1349 72 drwxr-xr-x 2 0 0 3896 25-Jul-2011 15:05 orphan_dir:0000
1350 73 drwxr-xr-x 2 0 0 3896 19-Jul-2011 13:36 orphan_dir:0001
1351 74 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0000
1352 75 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0001
1353 76 -rw-r--r-- 1 0 0 121634816 19-Jul-2011 13:36 inode_alloc:0000
1354 77 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 inode_alloc:0001
1355 77 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:36 journal:0000
1356 79 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:37 journal:0001
1357 80 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0000
1358 81 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0001
1359 82 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0000
1360 83 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0001
1361
1362 The file names that end with numbers are slot specific and are
1363 referred to as node-local system files. The set of node-local
1364 files used by a node can be determined from the slot map. To
1365 list the slot map, do:
1366
1367 # debugfs.ocfs2 -R "slotmap" /dev/sde1
1368 Slot# Node#
1369 0 32
1370 1 35
1371 2 40
1372 3 31
1373 4 34
1374 5 33
1375
1376 For more information, refer to the OCFS2 support guides avail‐
1377 able in the Documentation section at http://oss.ora‐
1378 cle.com/projects/ocfs2.
1379
1380
1381 HEARTBEAT, QUORUM, AND FENCING
1382 Heartbeat is an essential component in any cluster. It is
1383 charged with accurately designating nodes as dead or alive. A
1384 mistake here could lead to a cluster hang or a corruption.
1385
1386 o2hb is the disk heartbeat component of o2cb. It periodically
1387 updates a timestamp on disk, indicating to others that this node
1388 is alive. It also reads all the timestamps to identify other
1389 live nodes. Other cluster components, like o2dlm and o2net, use
1390 the o2hb service to get node up and down events.
1391
1392 The quorum is the group of nodes in a cluster that is allowed to
1393 operate on the shared storage. When there is a failure in the
1394 cluster, nodes may be split into groups that can communicate in
1395 their groups and with the shared storage but not between groups.
1396 o2quo determines which group is allowed to continue and initi‐
1397 ates fencing of the other group(s).
1398
1399 Fencing is the act of forcefully removing a node from a cluster.
1400 A node with OCFS2 mounted will fence itself when it realizes
1401 that it does not have quorum in a degraded cluster. It does this
1402 so that other nodes won’t be stuck trying to access its re‐
1403 sources.
1404
1405 o2cb uses a machine reset to fence. This is the quickest route
1406 for the node to rejoin the cluster.
1407
1408
1409 PROCESSES
1410
1411
1412 [o2net]
1413 One per node. It is a work-queue thread started when the
1414 cluster is brought on-line and stopped when it is off-
1415 lined. It handles network communication for all mounts.
1416 It gets the list of active nodes from O2HB and sets up a
1417 TCP/IP communication channel with each live node. It
1418 sends regular keep-alive packets to detect any interrup‐
1419 tion on the channels.
1420
1421
1422 [user_dlm]
1423 One per node. It is a work-queue thread started when
1424 dlmfs is loaded and stopped when it is unloaded (dlmfs is
1425 a synthetic file system that allows user space processes
1426 to access the in-kernel dlm).
1427
1428
1429 [ocfs2_wq]
1430 One per node. It is a work-queue thread started when the
1431 OCFS2 module is loaded and stopped when it is unloaded.
1432 It is assigned background file system tasks that may take
1433 cluster locks like flushing the truncate log, orphan di‐
1434 rectory recovery and local alloc recovery. For example,
1435 orphan directory recovery runs in the background so that
1436 it does not affect recovery time.
1437
1438
1439 [o2hb-14C29A7392]
1440 One per heartbeat device. It is a kernel thread started
1441 when the heartbeat region is populated in configfs and
1442 stopped when it is removed. It writes every two seconds
1443 to a block in the heartbeat region, indicating that this
1444 node is alive. It also reads the region to maintain a map
1445 of live nodes. It notifies subscribers like o2net and
1446 o2dlm of any changes in the live node map.
1447
1448
1449 [ocfs2dc]
1450 One per mount. It is a kernel thread started when a vol‐
1451 ume is mounted and stopped when it is unmounted. It down‐
1452 grades locks in response to blocking ASTs (BASTs) re‐
1453 quested by other nodes.
1454
1455
1456 [jbd2/sdf1-97]
1457 One per mount. It is part of JBD2, which OCFS2 uses for
1458 journaling.
1459
1460
1461 [ocfs2cmt]
1462 One per mount. It is a kernel thread started when a vol‐
1463 ume is mounted and stopped when it is unmounted. It works
1464 with kjournald2.
1465
1466
1467 [ocfs2rec]
1468 It is started whenever a node has to be recovered. This
1469 thread performs file system recovery by replaying the
1470 journal of the dead node. It is scheduled to run after
1471 dlm recovery has completed.
1472
1473
1474 [dlm_thread]
1475 One per dlm domain. It is a kernel thread started when a
1476 dlm domain is created and stopped when it is destroyed.
1477 This thread sends ASTs and blocking ASTs in response to
1478 lock level convert requests. It also frees unused lock
1479 resources.
1480
1481
1482 [dlm_reco_thread]
1483 One per dlm domain. It is a kernel thread that handles
1484 dlm recovery when another node dies. If this node is the
1485 dlm recovery master, it re-masters every lock resource
1486 owned by the dead node.
1487
1488
1489 [dlm_wq]
1490 One per dlm domain. It is a work-queue thread that o2dlm
1491 uses to queue blocking tasks.
1492
1493
1494 FUTURE WORK
1495 File system development is a never ending cycle. Faster and
1496 larger disks, faster and more number of processors, larger
1497 caches, etc. keep changing the sweet spot for performance forc‐
1498 ing developers to rethink long held beliefs. Add to that new use
1499 cases, which forces developers to be innovative in providing so‐
1500 lutions that melds seamlessly with existing semantics.
1501
1502 We are currently looking to add features like transparent com‐
1503 pression, transparent encryption, delayed allocation, multi-de‐
1504 vice support, etc. as well as work on improving performance on
1505 newer generation machines.
1506
1507 If you are interested in contributing, email the development
1508 team at ocfs2-devel@oss.oracle.com.
1509
1510
1512 The principal developers of the OCFS2 file system, its tools and the
1513 O2CB cluster stack, are Joel Becker, Zach Brown, Mark Fasheh, Jan Kara,
1514 Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.
1515
1516 Other developers who have contributed to the file system via bug fixes,
1517 testing, etc. are Wim Coekaerts, Srinivas Eeda, Coly Li, Jeff Mahoney,
1518 Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wengang Wang.
1519
1520 The members of the Linux Cluster community including Andrew Beekhof,
1521 Lars Marowsky-Bree, Fabio Massimo Di Nitto and David Teigland.
1522
1523 The members of the Linux File system community including Christoph
1524 Hellwig and Chris Mason.
1525
1526 The corporations that have contributed resources for this project in‐
1527 cluding Oracle, SUSE Labs, EMC, Emulex, HP, IBM, Intel and Network Ap‐
1528 pliance.
1529
1530
1532 debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8)
1533 mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8) o2info(1)
1534 o2cb(7) o2cb(8) o2cb.sysconfig(5) o2hbmonitor(8) ocfs2.cluster.conf(5)
1535 tunefs.ocfs2(8)
1536
1537
1539 Oracle Corporation
1540
1541
1543 Copyright © 2004, 2012 Oracle. All rights reserved.
1544
1545
1546
1547Version 1.8.7 January 2012 OCFS2(7)