ocfs2(7) - f37

1OCFS2(7)                      OCFS2 Manual Pages                      OCFS2(7)
2
3
4

NAME

6       OCFS2 - A Shared-Disk Cluster File System for Linux
7
8

INTRODUCTION

10       OCFS2 is a file system. It allows users to store and retrieve data. The
11       data is stored in files that are organized in a hierarchical  directory
12       tree.  It  is  a POSIX compliant file system that supports the standard
13       interfaces and the behavioral semantics as spelled out by that specifi‐
14       cation.
15
16       It  is also a shared disk cluster file system, one that allows multiple
17       nodes to access the same disk at the same time. This is where  the  fun
18       begins  as  allowing  a  file system to be accessible on multiple nodes
19       opens a can of worms. What if the nodes are of different architectures?
20       What if a node dies while writing to the file system? What data consis‐
21       tency can one expect if processes on two nodes are reading and  writing
22       concurrently?  What  if one node removes a file while it is still being
23       used on another node?
24
25       Unlike most shared file systems where the answer is fuzzy,  the  answer
26       in  OCFS2  is very well defined. It behaves on all nodes exactly like a
27       local file system. If a file is removed, the directory entry is removed
28       but  the inode is kept as long as it is in use across the cluster. When
29       the last user closes the descriptor, the inode is marked for deletion.
30
31       The data consistency model follows the same principle. It works  as  if
32       the  two  processes that are running on two different nodes are running
33       on the same node. A read on a node gets the last write irrespective  of
34       the  IO  mode  used.  The  modes can be buffered, direct, asynchronous,
35       splice or memory mapped IOs. It is fully cache coherent.
36
37       Take for example the REFLINK feature that allows a user to create  mul‐
38       tiple write-able snapshots of a file. This feature, like all others, is
39       fully cluster-aware. A file being written to on multiple nodes  can  be
40       safely  reflinked  on  another. The snapshot created is a point-in-time
41       image of the file that includes both the file  data  and  all  its  at‐
42       tributes (including extended attributes).
43
44       It  is  a  journaling  file  system. When a node dies, a surviving node
45       transparently replays the journal of the dead node. This  ensures  that
46       the  file system metadata is always consistent. It also defaults to or‐
47       dered data journaling to ensure the file data is flushed to disk before
48       the  journal  commit, to remove the small possibility of stale data ap‐
49       pearing in files after a crash.
50
51       It is architecture and endian neutral. It allows concurrent  mounts  on
52       nodes  with  different  processors like x86, x86_64, IA64 and PPC64. It
53       handles little and big endian, 32-bit and 64-bit architectures.
54
55       It is feature rich. It supports indexed  directories,  metadata  check‐
56       sums,  extended attributes, POSIX ACLs, quotas, REFLINKs, sparse files,
57       unwritten extents and inline-data.
58
59       It is fully integrated with the mainline Linux kernel. The file  system
60       was merged into Linux kernel 2.6.16 in early 2006.
61
62       It  is quickly installed. It is available with almost all Linux distri‐
63       butions.  The file system is on-disk compatible across all of them.
64
65       It is modular. The file system can be configured to operate with  other
66       cluster stacks like Pacemaker and CMAN along with its own stack, O2CB.
67
68       It  is easily configured. The O2CB cluster stack configuration involves
69       editing two files, one for cluster layout and  the  other  for  cluster
70       timeouts.
71
72       It  is  very efficient. The file system consumes very little resources.
73       It is used to store virtual machine images in limited  memory  environ‐
74       ments like Xen and KVM.
75
76       In  summary, OCFS2 is an efficient, easily configured, modular, quickly
77       installed, fully integrated and compatible, feature-rich,  architecture
78       and endian neutral, cache coherent, ordered data journaling, POSIX-com‐
79       pliant, shared disk cluster file system.
80
81

OVERVIEW

83       OCFS2 is a general-purpose shared-disk cluster file  system  for  Linux
84       capable of providing both high performance and high availability.
85
86       As  it provides local file system semantics, it can be used with almost
87       all applications.  Cluster-aware applications can make use of cache-co‐
88       herent parallel I/Os from multiple nodes to scale out applications eas‐
89       ily. Other applications can make use of the  clustering  facilities  to
90       fail-over running application in the event of a node failure.
91
92       The notable features of the file system are:
93
94       Tunable Block size
95              The  file  system  supports  block  sizes  of 512, 1K, 2K and 4K
96              bytes. 4KB is almost always recommended. This feature is  avail‐
97              able in all releases of the file system.
98
99
100       Tunable Cluster size
101              A  cluster  size  is also referred to as an allocation unit. The
102              file system supports cluster sizes of 4K,  8K,  16K,  32K,  64K,
103              128K, 256K, 512K and 1M bytes. For most use cases, 4KB is recom‐
104              mended. However, a larger value is recommended for volumes host‐
105              ing mostly very large files like database files, virtual machine
106              images, etc. A large cluster size  allows  the  file  system  to
107              store large files more efficiently. This feature is available in
108              all releases of the file system.
109
110
111       Endian and Architecture neutral
112              The file system can be mounted concurrently on nodes having dif‐
113              ferent  architectures.  Like 32-bit, 64-bit, little-endian (x86,
114              x86_64, ia64) and big-endian (ppc64, s390x).   This  feature  is
115              available in all releases of the file system.
116
117
118       Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes
119              The  file system supports all modes of I/O for maximum flexibil‐
120              ity and  performance.   It  also  supports  cluster-wide  shared
121              writeable  mmap(2).  The support for bufferred, direct and asyn‐
122              chronous I/O is available  in  all  releases.  The  support  for
123              splice  I/O  was  added  in  Linux  kernel 2.6.20 and for shared
124              writeable map(2) in 2.6.23.
125
126
127       Multiple Cluster Stacks
128              The file system includes a flexible framework  to  allow  it  to
129              function with userspace cluster stacks like Pacemaker (pcmk) and
130              CMAN (cman), its own in-kernel cluster stack o2cb and no cluster
131              stack.
132
133              The support for o2cb cluster stack is available in all releases.
134
135              The  support  for no cluster stack, or local mount, was added in
136              Linux kernel 2.6.20.
137
138              The support for userspace cluster stack was added in Linux  ker‐
139              nel 2.6.26.
140
141
142       Journaling
143              The  file  system  supports both ordered (default) and writeback
144              data journaling modes to provide file system consistency in  the
145              event  of  power failure or system crash.  It uses JBD2 in Linux
146              kernel 2.6.28 and later. It used JBD in earlier kernels.
147
148
149       Extent-based Allocations
150              The file system allocates and tracks space in  ranges  of  clus‐
151              ters. This is unlike block based file systems that have to track
152              each and every block. This feature allows the file system to  be
153              very  efficient  when  dealing with both large volumes and large
154              files.  This feature is available in all releases  of  the  file
155              system.
156
157
158       Sparse files
159              Sparse  files  are files with holes. With this feature, the file
160              system delays allocating space until a  write  is  issued  to  a
161              cluster.  This  feature was added in Linux kernel 2.6.22 and re‐
162              quires enabling on-disk feature sparse.
163
164
165       Unwritten Extents
166              An unwritten extent is also referred to as user  pre-allocation.
167              It  allows  an  application to request a range of clusters to be
168              allocated, but not initialized, within a  file.   Pre-allocation
169              allows  the  file system to optimize the data layout with fewer,
170              larger extents. It also provides a performance  boost,  delaying
171              initialization  until the user writes to the clusters. This fea‐
172              ture was added in Linux kernel 2.6.23 and requires enabling  on-
173              disk feature unwritten.
174
175
176       Hole Punching
177              Hole  punching  allows  an application to remove arbitrary allo‐
178              cated regions within a file. Creating holes,  essentially.  This
179              is  more  efficient than zeroing the same extents.  This feature
180              is especially useful in virtualized environments as it allows  a
181              block  discard  in a guest file system to be converted to a hole
182              punch in the host file system thus allowing users to reduce disk
183              space  usage.  This feature was added in Linux kernel 2.6.23 and
184              requires enabling on-disk features sparse and unwritten.
185
186
187       Inline-data
188              Inline data is also referred to as data-in-inode  as  it  allows
189              storing small files and directories in the inode block. This not
190              only saves space but also has a positive  impact  on  cold-cache
191              directory  and  file operations. The data is transparently moved
192              out to an extent when it no longer fits inside the inode  block.
193              This  feature  was added in Linux kernel 2.6.24 and requires en‐
194              abling on-disk feature inline-data.
195
196
197       REFLINK
198              REFLINK is also referred to as fast copy.  It  allows  users  to
199              atomically  (and  instantly) copy regular files. In other words,
200              create multiple writeable snapshots of  regular  files.   It  is
201              called  REFLINK  because  it  looks and feels more like a (hard)
202              link(2) than a traditional snapshot. Like a link, it is a  regu‐
203              lar  user  operation,  subject to the security attributes of the
204              inode being reflinked and not to the super user privileges typi‐
205              cally  required  to  create a snapshot. Like a link, it operates
206              within a file system. But unlike a link, it links the inodes  at
207              the  data extent level allowing each reflinked inode to grow in‐
208              dependently as and when written to. Up to  four  billion  inodes
209              can share a data extent.  This feature was added in Linux kernel
210              2.6.32 and requires enabling on-disk feature refcount.
211
212
213       Allocation Reservation
214              File contiguity plays an important role in file  system  perfor‐
215              mance. When a file is fragmented on disk, reading and writing to
216              the file involves many seeks, leading to lower throughput.  Con‐
217              tiguous  files,  on the other hand, minimize seeks, allowing the
218              disks to perform IO at the maximum rate.
219
220              With allocation reservation, the file system reserves  a  window
221              in  the  bitmap for all extending files allowing each to grow as
222              contiguously as possible. As this extra space  is  not  actually
223              allocated,  it  is  available for use by other files if the need
224              arises.  This feature was added in Linux kernel 2.6.35  and  can
225              be tuned using the mount option resv_level.
226
227
228       Indexed Directories
229              An  indexed directory allows users to perform quick lookups of a
230              file in very large directories. It also results in  faster  cre‐
231              ates  and  unlinks and thus provides better overall performance.
232              This feature was added in Linux kernel 2.6.30 and  requires  en‐
233              abling on-disk feature indexed-dirs.
234
235
236       File Attributes
237              This  refers  to  EXT2-style file attributes, such as immutable,
238              modified using chattr(1) and queried using lsattr(1). This  fea‐
239              ture was added in Linux kernel 2.6.19.
240
241
242       Extended Attributes
243              An  extended  attribute  refers to a name:value pair than can be
244              associated with file system objects like regular files, directo‐
245              ries, symbolic links, etc. OCFS2 allows associating an unlimited
246              number of attributes per object. The attribute names can  be  up
247              to  255  bytes in length, terminated by the first NUL character.
248              While it is not required, printable  names  (ASCII)  are  recom‐
249              mended. The attribute values can be up to 64 KB of arbitrary bi‐
250              nary data. These attributes can be  modified  and  listed  using
251              standard  Linux utilities setfattr(1) and getfattr(1). This fea‐
252              ture was added in Linux kernel 2.6.29 and requires enabling  on-
253              disk feature xattr.
254
255
256       Metadata Checksums
257              This feature allows the file system to detect silent corruptions
258              in all metadata blocks like inodes and directories. This feature
259              was  added  in Linux kernel 2.6.29 and requires enabling on-disk
260              feature metaecc.
261
262
263       POSIX ACLs and Security Attributes
264              POSIX ACLs allows assigning  fine-grained  discretionary  access
265              rights  for files and directories. This security scheme is a lot
266              more flexible than the traditional file access permissions  that
267              imposes a strict user-group-other model.
268
269              Security attributes allow the file system to support other secu‐
270              rity regimes like SELinux, SMACK, AppArmor, etc.
271
272              Both these security extensions were added in Linux kernel 2.6.29
273              and requires enabling on-disk feature xattr.
274
275
276       User and Group Quotas
277              This  feature  allows  setting up usage quotas on user and group
278              basis  by  using   the   standard   utilities   like   quota(1),
279              setquota(8),  quotacheck(8),  and  quotaon(8).  This feature was
280              added in Linux kernel 2.6.29 and requires enabling on-disk  fea‐
281              tures usrquota and grpquota.
282
283
284       Unix File Locking
285              The  Unix  operating system has historically provided two system
286              calls to lock files.  flock(2) or BSD locking  and  fcntl(2)  or
287              POSIX  locking.  OCFS2  extends  both file locks to the cluster.
288              File locks taken on one node interact with those taken on  other
289              nodes.
290
291              The  support  for  clustered  flock(2) was added in Linux kernel
292              2.6.26.  All flock(2) options are supported, including the  ker‐
293              nels  ability  to cancel a lock request when an appropriate kill
294              signal is received by the user. This feature is  supported  with
295              all cluster-stacks including o2cb.
296
297              The  support  for  clustered  fcntl(2) was added in Linux kernel
298              2.6.28.  But because it requires group communication to make the
299              locks  coherent,  it  is  only  supported with userspace cluster
300              stacks, pcmk and cman and not with  the  default  cluster  stack
301              o2cb.
302
303
304       Comprehensive Tools Support
305              The  file  system  has  a  comprehensive EXT3-style toolset that
306              tries to use similar parameters  for  ease-of-use.  It  includes
307              mkfs.ocfs2(8)  (format),  tunefs.ocfs2(8)  (tune), fsck.ocfs2(8)
308              (check), debugfs.ocfs2(8) (debug), etc.
309
310
311       Online Resize
312              The file system can be dynamically grown using  tunefs.ocfs2(8).
313              This feature was added in Linux kernel 2.6.25.
314
315

RECENT CHANGES

317       The  O2CB cluster stack has a global heartbeat mode. It allows users to
318       specify heartbeat regions that are consistent  across  all  nodes.  The
319       cluster stack also allows online addition and removal of both nodes and
320       heartbeat regions.
321
322       o2cb(8) is the new cluster configuration utility. It is an easy to  use
323       utility that allows users to create the cluster configuration on a node
324       that is not  part  of  the  cluster.  It  replaces  the  older  utility
325       o2cb_ctl(8) which has being deprecated.
326
327       ocfs2console(8) has been obsoleted.
328
329       o2info(8)  is a new utility that can be used to provide file system in‐
330       formation.  It allows non-privileged users to see the enabled file sys‐
331       tem  features,  block and cluster sizes, extended file stat, free space
332       fragmentation, etc.
333
334       o2hbmonitor(8) is a o2hb heartbeat monitor. It is  an  extremely  light
335       weight  utility that logs messages to the system logger once the heart‐
336       beat delay exceeds the warn threshold. This utility is useful in  iden‐
337       tifying volumes encountering I/O delays.
338
339       debugfs.ocfs2(8)  has some new commands. net_stats shows the o2net mes‐
340       sage times between various nodes. This is useful in  identifying  nodes
341       are  that  slowing  down the cluster operations. stat_sysdir allows the
342       user to dump the entire system directory that can be used to debug  is‐
343       sues.  grpextents  dumps  the  complete free space fragmentation in the
344       cluster group allocator.
345
346       mkfs.ocfs2(8) now enables xattr, indexed-dirs, discontig-bg,  refcount,
347       extended-slotmap  and clusterinfo feature flags by default, in addition
348       to the older defaults, sparse, unwritten and inline-data.
349
350       mount.ocfs2(8) allows users to specify the level of cache coherency be‐
351       tween  nodes.   By  default  the file system operates in full coherency
352       mode that also serializes the direct I/Os. While this mode  is  techni‐
353       cally  correct, it limits the I/O thruput in a clustered database. This
354       mount option allows the user to limit the cache coherency to  only  the
355       buffered I/Os to allow multiple nodes to do concurrent direct writes to
356       the same file. This feature works with Linux kernel 2.6.37 and later.
357
358

COMPATIBILITY

360       The OCFS2 development teams goes to great lengths to maintain  compati‐
361       bility.  It attempts to maintain both on-disk and network protocol com‐
362       patibility across all releases of the file  system.  It  does  so  even
363       while adding new features that entail on-disk format and network proto‐
364       col changes. To do this successfully, it follows a few rules:
365
366           1. The on-disk format changes are managed by a set of feature flags
367           that  can  be  turned on and off. The file system in kernel detects
368           these features during mount and continues only  if  it  understands
369           all the features. Users encountering this have the option of either
370           disabling that feature or upgrading the file system to a newer  re‐
371           lease.
372
373           2.  The  latest  release of ocfs2-tools is compatible with all ver‐
374           sions of the file system. All utilities detect the features enabled
375           on disk and continue only if it understands all the features. Users
376           encountering this have to upgrade the tools to a newer release.
377
378           3. The network protocol version is negotiated by the nodes  to  en‐
379           sure all nodes understand the active protocol version.
380
381
382       FEATURE FLAGS
383              The  feature flags are split into three categories, namely, Com‐
384              pat, Incompat and RO Compat.
385
386              Compat, or compatible, is a feature that the  file  system  does
387              not need to fully understand to safely read/write to the volume.
388              An example of this is the backup-super feature  that  added  the
389              capability  to  backup  the super block in multiple locations in
390              the file system. As the backup super blocks  are  typically  not
391              read nor written to by the file system, an older file system can
392              safely mount a volume with this feature enabled.
393
394              Incompat, or incompatible, is a feature  that  the  file  system
395              needs to fully understand to read/write to the volume. Most fea‐
396              tures fall under this category.
397
398              RO Compat, or read-only compatible, is a feature that  the  file
399              system  needs  to fully understand to write to the volume. Older
400              software can safely read a volume with this feature enabled.  An
401              example  of  this  would be user and group quotas. As quotas are
402              manipulated only when the file system is written to, older soft‐
403              ware can safely mount such volumes in read-only mode.
404
405              The  list  of  feature  flags,  the version of the kernel it was
406              added in, the earliest version of the tools that understands it,
407              etc., is as follows:
408
409
410      ┌─────────────────────┬────────────────┬─────────────────┬───────────┬───────────┐
411      │Feature Flags        │ Kernel Version │ Tools Version   │ Category  │ Hex Value │
412      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
413      │backup-super         │      All       │ ocfs2-tools 1.2 │  Compat   │     1     │
414      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
415      │strict-journal-super │      All       │       All       │  Compat   │     2     │
416      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
417      │local                │  Linux 2.6.20  │ ocfs2-tools 1.2 │ Incompat  │     8     │
418      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
419      │sparse               │  Linux 2.6.22  │ ocfs2-tools 1.4 │ Incompat  │    10     │
420      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
421      │inline-data          │  Linux 2.6.24  │ ocfs2-tools 1.4 │ Incompat  │    40     │
422      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
423      │extended-slotmap     │  Linux 2.6.27  │ ocfs2-tools 1.6 │ Incompat  │    100    │
424      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
425      │xattr                │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │    200    │
426      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
427      │indexed-dirs         │  Linux 2.6.30  │ ocfs2-tools 1.6 │ Incompat  │    400    │
428      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
429      │metaecc              │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │    800    │
430      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
431      │refcount             │  Linux 2.6.32  │ ocfs2-tools 1.6 │ Incompat  │   1000    │
432      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
433      │discontig-bg         │  Linux 2.6.35  │ ocfs2-tools 1.6 │ Incompat  │   2000    │
434      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
435      │clusterinfo          │  Linux 2.6.37  │ ocfs2-tools 1.8 │ Incompat  │   4000    │
436      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
437      │unwritten            │  Linux 2.6.23  │ ocfs2-tools 1.4 │ RO Compat │     1     │
438      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
439      │grpquota             │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │     2     │
440      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
441      │usrquota             │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │     4     │
442      └─────────────────────┴────────────────┴─────────────────┴───────────┴───────────┘
443
444              To query the features enabled on a volume, do:
445
446              $ o2info --fs-features /dev/sdf1
447              backup-super strict-journal-super sparse extended-slotmap inline-data xattr
448              indexed-dirs refcount discontig-bg clusterinfo unwritten
449
450
451       ENABLING AND DISABLING FEATURES
452
453              The  format  utility, mkfs.ocfs2(8), allows a user to enable and
454              disable specific features using the fs-features option. The fea‐
455              tures  are  provided as a comma separated list. The enabled fea‐
456              tures are listed as is. The disabled features are prefixed  with
457              no.   The  example  below  shows the file system being formatted
458              with sparse disabled and inline-data enabled.
459
460              # mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1
461
462              After formatting, the users can toggle features using  the  tune
463              utility,  tunefs.ocfs2(8).   This  is  an offline operation. The
464              volume needs to be umounted across the cluster.  The example be‐
465              low  shows the sparse feature being enabled and inline-data dis‐
466              abled.
467
468              # tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1
469
470              Care should be taken before  enabling  and  disabling  features.
471              Users planning to use a volume with an older version of the file
472              system will be better of not enabling newer features as  turning
473              disabling may not succeed.
474
475              An  example would be disabling the sparse feature; this requires
476              filling every hole.  The operation can only succeed if the  file
477              system has enough free space.
478
479
480       DETECTING FEATURE INCOMPATIBILITY
481
482              Say  one  tries  to mount a volume with an incompatible feature.
483              What happens then? How does one detect the problem? How does one
484              know the name of that incompatible feature?
485
486              To  begin  with, one should look for error messages in dmesg(8).
487              Mount failures that are due to an incompatible feature will  al‐
488              ways result in an error message like the following:
489
490              ERROR: couldn't mount because of unsupported optional features (200).
491
492              Here the file system is unable to mount the volume due to an un‐
493              supported optional feature. That means that that feature  is  an
494              Incompat  feature. By referring to the table above, one can then
495              deduce that the user failed to mount a  volume  with  the  xattr
496              feature enabled. (The value in the error message is in hexadeci‐
497              mal.)
498
499              Another example of an error message due to incompatibility is as
500              follows:
501
502              ERROR: couldn't mount RDWR because of unsupported optional features (1).
503
504              Here  the  file  system  is unable to mount the volume in the RW
505              mode. That means that that feature is a RO Compat  feature.  An‐
506              other  look at the table and it becomes apparent that the volume
507              had the unwritten feature enabled.
508
509              In both cases, the user has the option of disabling the feature.
510              In the second case, the user has the choice of mounting the vol‐
511              ume in the RO mode.
512
513

GETTING STARTED

515       The OCFS2 software is split into two  components,  namely,  kernel  and
516       tools. The kernel component includes the core file system and the clus‐
517       ter stack, and is packaged along with the kernel. The  tools  component
518       is  packaged  as ocfs2-tools and needs to be specifically installed. It
519       provides utilities to format, tune, mount, debug  and  check  the  file
520       system.
521
522       To  install  ocfs2-tools,  refer  to the package handling utility in in
523       your distributions.
524
525       The next step is selecting a cluster stack. The options include:
526
527           A. No cluster stack, or local mount.
528
529           B. In-kernel o2cb cluster stack with local or global heartbeat.
530
531           C. Userspace cluster stacks pcmk or cman.
532
533       The  file  system  allows  changing   cluster   stacks   easily   using
534       tunefs.ocfs2(8).   To list the cluster stacks stamped on the OCFS2 vol‐
535       umes, do:
536
537       # mounted.ocfs2 -d
538       Device     Stack  Cluster     F  UUID                              Label
539       /dev/sdb1  o2cb   webcluster  G  DCDA2845177F4D59A0F2DCD8DE507CC3  hbvol1
540       /dev/sdc1  None                  23878C320CF3478095D1318CB5C99EED  localmount
541       /dev/sdd1  o2cb   webcluster  G  8AB016CD59FC4327A2CDAB69F08518E3  webvol
542       /dev/sdg1  o2cb   webcluster  G  77D95EF51C0149D2823674FCC162CF8B  logsvol
543       /dev/sdh1  o2cb   webcluster  G  BBA1DBD0F73F449384CE75197D9B7098  scratch
544
545
546       NON-CLUSTERED OR LOCAL MOUNT
547
548              To format a OCFS2 volume as a non-clustered (local) volume, do:
549
550              # mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1
551
552              To convert an existing clustered volume to a non-clustered  vol‐
553              ume, do:
554
555              # tunefs.ocfs2 --fs-features=local /dev/sda1
556
557              Non-clustered  volumes  do  not interact with the cluster stack.
558              One can have both clustered and non-clustered volumes mounted at
559              the same time.
560
561              While  formatting  a non-clustered volume, users should consider
562              the possibility of later converting that volume to  a  clustered
563              one. If there is a possibility of that, then the user should add
564              enough node-slots using the -N option. Adding node-slots  during
565              format  creates  journals  with large extents. If created later,
566              then the journals will be fragmented which is not good for  per‐
567              formance.
568
569
570       CLUSTERED MOUNT WITH O2CB CLUSTER STACK
571
572              Only  one  of  the  two  heartbeat mode can be active at any one
573              time. Changing heartbeat modes is an offline operation.
574
575              Both  heartbeat  modes   require   /etc/ocfs2/cluster.conf   and
576              /etc/sysconfig/o2cb  to be populated as described in ocfs2.clus‐
577              ter.conf(5) and o2cb.sysconfig(5) respectively. The only differ‐
578              ence  in  set  up  between the two modes is that global requires
579              heartbeat devices to be configured whereas local does not.
580
581              Refer o2cb(7) for more information.
582
583
584              LOCAL HEARTBEAT
585                     This is the default heartbeat mode.  The  user  needs  to
586                     populate   the   configuration   files  as  described  in
587                     ocfs2.cluster.conf(5)  and  o2cb.sysconfig(5).  In   this
588                     mode,  the  cluster  stack heartbeats on all mounted vol‐
589                     umes. Thus, one does not have to  specify  heartbeat  de‐
590                     vices in cluster.conf.
591
592                     Once  configured,  the  o2cb cluster stack can be onlined
593                     and offlined as follows:
594
595                     # service o2cb online
596                     Setting cluster stack "o2cb": OK
597                     Registering O2CB cluster "webcluster": OK
598                     Setting O2CB cluster timeouts : OK
599
600                     # service o2cb offline
601                     Clean userdlm domains: OK
602                     Stopping O2CB cluster webcluster: OK
603                     Unregistering O2CB cluster "webcluster": OK
604
605
606              GLOBAL HEARTBEAT
607                     The configuration is similar to local heartbeat. The  one
608                     additional  step  in this mode is that it requires heart‐
609                     beat devices to be also configured.
610
611                     These heartbeat devices are OCFS2 formatted volumes  with
612                     global heartbeat enabled on disk. These volumes can later
613                     be mounted and used as clustered file systems.
614
615                     The steps to format a volume with  global  heartbeat  en‐
616                     abled is listed in o2cb(7).  Also listed there is listing
617                     all volumes with the cluster stack stamped on disk.
618
619                     In this mode, the heartbeat is started when  the  cluster
620                     is onlined and stopped when the cluster is offlined.
621
622                     # service o2cb online
623                     Setting cluster stack "o2cb": OK
624                     Registering O2CB cluster "webcluster": OK
625                     Setting O2CB cluster timeouts : OK
626                     Starting global heartbeat for cluster "webcluster": OK
627
628                     # service o2cb offline
629                     Clean userdlm domains: OK
630                     Stopping global heartbeat on cluster "webcluster": OK
631                     Stopping O2CB cluster webcluster: OK
632                     Unregistering O2CB cluster "webcluster": OK
633
634                     # service o2cb status
635                     Driver for "configfs": Loaded
636                     Filesystem "configfs": Mounted
637                     Stack glue driver: Loaded
638                     Stack plugin "o2cb": Loaded
639                     Driver for "ocfs2_dlmfs": Loaded
640                     Filesystem "ocfs2_dlmfs": Mounted
641                     Checking O2CB cluster "webcluster": Online
642                       Heartbeat dead threshold: 31
643                       Network idle timeout: 30000
644                       Network keepalive delay: 2000
645                       Network reconnect delay: 2000
646                       Heartbeat mode: Global
647                     Checking O2CB heartbeat: Active
648                       77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
649                     Nodes in O2CB cluster: 92 96
650
651
652
653       CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK
654
655              Configure and online the userspace stack pcmk or cman before us‐
656              ing tunefs.ocfs2(8) to update the cluster stack on disk.
657
658              # tunefs.ocfs2 --update-cluster-stack /dev/sdd1
659              Updating on-disk cluster information to match the running cluster.
660              DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
661              FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
662              Update the on-disk cluster information? y
663
664              Refer to the cluster  stack  documentation  for  information  on
665              starting and stopping the cluster stack.
666
667

FILE SYSTEM UTILITIES

669       This  sections  lists  the  utilities that are used to manage the OCFS2
670       file systems.  This includes tools to format, tune, check, mount, debug
671       the  file  system. Each utility has a man page that lists its capabili‐
672       ties in detail.
673
674
675       mkfs.ocfs2(8)
676              This is the file system format utility. All volumes have  to  be
677              formatted prior to its use.  As this utility overwrites the vol‐
678              ume, use it with care. Double check to ensure the volume is  not
679              in use on any node in the cluster.
680
681              As a precaution, the utility will abort if the volume is locally
682              mounted. It also detects use  across  the  cluster  if  used  by
683              OCFS2.  But  these checks are not comprehensive and can be over‐
684              ridden. So use it with care.
685
686              While it is not always required, the cluster should be online.
687
688
689       tunefs.ocfs2(8)
690              This is the file system tune utility. It allows users to  change
691              certain  on-disk  parameters  like  label, uuid, number of node-
692              slots, volume size and the size of the journals. It also  allows
693              turning on and off the file system features as listed above.
694
695              This utility requires the cluster to be online.
696
697
698       fsck.ocfs2(8)
699              This  is the file system check utility. It detects and fixes on-
700              disk errors. All the check codes and their fixes are  listed  in
701              fsck.ocfs2.checks(8).
702
703              This  utility  requires  the  cluster to be online to ensure the
704              volume is not in use on another node and to prevent  the  volume
705              from being mounted for the duration of the check.
706
707
708       mount.ocfs2(8)
709              This  is the file system mount utility. It is invoked indirectly
710              by the mount(8) utility.
711
712              This utility detects the cluster status and aborts if the  clus‐
713              ter is offline or does not match the cluster stamped on disk.
714
715
716       o2cluster(8)
717              This  is the file system cluster stack update utility. It allows
718              the users to update the on-disk cluster stack to  the  one  pro‐
719              vided.
720
721              This  utility only updates the disk if the utility is reasonably
722              assured that the file system is not in use on any node.
723
724
725       o2info(1)
726              This is the file system information utility. It provides  infor‐
727              mation  like  the  features enabled on disk, block size, cluster
728              size, free space fragmentation, etc.
729
730              It can be used by  both  privileged  and  non-privileged  users.
731              Users  having read permission on the device can provide the path
732              to the device. Other users can provide the path to a file  on  a
733              mounted file system.
734
735
736       debugfs.ocfs2(8)
737              This  is the file system debug utility. It allows users to exam‐
738              ine all  file  system  structures  including  walking  directory
739              structures,  displaying  inodes, backing up files, etc., without
740              mounting the file system.
741
742              This utility requires the user to have read  permission  on  the
743              device.
744
745
746       o2image(8)
747              This  is  the file system image utility. It allows users to copy
748              the file system metadata skeleton, including the inodes,  direc‐
749              tories,  bitmaps,  etc. As it excludes data, it shrinks the size
750              of the file system tremendously.
751
752              The image file created can be used in debugging on-disk  corrup‐
753              tions.
754
755
756       mounted.ocfs2(8)
757              This  is  the  file  system detect utility. It detects all OCFS2
758              volumes in the system and lists  its  label,  uuid  and  cluster
759              stack.
760
761

O2CB CLUSTER STACK UTILITIES

763       This  sections lists the utilities that are used to manage O2CB cluster
764       stack.  Each utility has a man page that lists its capabilities in  de‐
765       tail.
766
767       o2cb(8)
768              This  is  the  cluster configuration utility. It allows users to
769              update the cluster configuration by adding  and  removing  nodes
770              and  heartbeat  regions.  This  utility is used by the o2cb init
771              script to online and offline the cluster.
772
773              This is a new utility and replaces o2cb_ctl(8)  which  has  been
774              deprecated.
775
776
777       ocfs2_hb_ctl(8)
778              This  is the cluster heartbeat utility. It allows users to start
779              and  stop  local  heartbeat.  This   utility   is   invoked   by
780              mount.ocfs2(8) and should not be invoked directly by the user.
781
782
783       o2hbmonitor(8)
784              This  is  the disk heartbeat monitor. It tracks the elapsed time
785              since the last heartbeat and logs warnings once  that  time  ex‐
786              ceeds the warn threshold.
787
788

FILE SYSTEM NOTES

790       This  section  includes some useful notes that may prove helpful to the
791       user.
792
793       BALANCED CLUSTER
794              A cluster is a computer. This is a fact and not a  slogan.  What
795              this  means is that an errant node in the cluster can affect the
796              behavior of other nodes. If one node is slow, the cluster opera‐
797              tions  will  slow down on all nodes. To prevent that, it is best
798              to have a balanced cluster. This is a cluster that  has  equally
799              powered and loaded nodes.
800
801              The standard recommendation for such clusters is to have identi‐
802              cal hardware and software across all the nodes. However, that is
803              not a hard and fast rule. After all, we have taken the effort to
804              ensure that OCFS2 works in a mixed architecture environment.
805
806              If one uses OCFS2 in a mixed architecture  environment,  try  to
807              ensure that the nodes are equally powered and loaded. The use of
808              a load balancer can assist with the latter. Power refers to  the
809              number  of  processors, speed, amount of memory, I/O throughput,
810              network bandwidth, etc. In reality, having equally powered  het‐
811              erogeneous nodes is not always practical. In that case, make the
812              lower node numbers more powerful than the higher  node  numbers.
813              The  O2CB  cluster stack favors lower node numbers in all of its
814              tiebreaking logic.
815
816              This is not to suggest you should add a single core  node  in  a
817              cluster  of  quad  cores. No amount of node number juggling will
818              help you there.
819
820
821       FILE DELETION
822              In Linux, rm(1) removes the directory entry. It does not  neces‐
823              sarily  delete  the corresponding inode. But by removing the di‐
824              rectory entry, it gives the illusion that  the  inode  has  been
825              deleted.   This puzzles users when they do not see a correspond‐
826              ing up-tick in the reported free space.  The reason is that  in‐
827              ode deletion has a few more hurdles to cross.
828
829              First  is  the hard link count, that indicates the number of di‐
830              rectory entries pointing to that inode. As long as an inode  has
831              one  or  more  directory  entries  pointing  to it, it cannot be
832              deleted.  The file system has to wait for  the  removal  of  all
833              those  directory entries. In other words, wait for that count to
834              drop to zero.
835
836              The second hurdle is the POSIX semantics allowing  files  to  be
837              unlinked  even  while they are in-use. In OCFS2, that translates
838              to in-use across the cluster. The file system has  to  wait  for
839              all processes across the cluster to stop using the inode.
840
841              Once  these  conditions  are  met,  the inode is deleted and the
842              freed space is visible after the next sync.
843
844              Now the amount of space freed depends on  the  allocation.  Only
845              space that is actually allocated to that inode is freed. The ex‐
846              ample below shows a sparsely allocated  file  of  size  51TB  of
847              which only 2.4GB is actually allocated.
848
849              $ ls -lsh largefile
850              2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile
851
852              Furthermore,  for  reflinked  files,  only  private  extents are
853              freed. Shared extents are freed when the  last  inode  accessing
854              it,  is  deleted. The example below shows a 4GB file that shares
855              3GB with other reflinked files. Deleting it  will  increase  the
856              free space by 1GB. However, if it is the only remaining file ac‐
857              cessing the shared extents, the full 4G will  be  freed.   (More
858              information on the shared-du(1) utility is provided below.)
859
860              $ shared-du -m -c --shared-size reflinkedfile
861              4000    (3000)  reflinkedfile
862
863              The  deletion itself is a multi-step process. Once the hard link
864              count falls to zero, the inode is moved to the orphan_dir system
865              directory  where  it  remains until the last process, across the
866              cluster, stops using the inode. Then the file system  frees  the
867              extents  and adds the freed space count to the truncate_log sys‐
868              tem file where it remains until the next sync.  The freed  space
869              is made visible to the user only after that sync.
870
871
872       DIRECTORY LISTING
873              ls(1)  may be a simple command, but it is not cheap. What is ex‐
874              pensive is not the part where it reads  the  directory  listing,
875              but the second part where it reads all the inodes, also referred
876              as an inode stat(2). If the inodes are not in  cache,  this  can
877              entail  disk  I/O.  Now, while a cold cache inode stat(2) is ex‐
878              pensive in all file systems, it is especially so in a  clustered
879              file system as it needs to take a cluster lock on each inode.
880
881              A  hot cache stat(2), on the other hand, has shown to perform on
882              OCFS2 like it does on EXT3.
883
884              In other words, the second ls(1) will be quicker than the first.
885              However, it is not guaranteed. Say you have a million files in a
886              file system and not enough kernel memory to cache  all  the  in‐
887              odes.  In  that  case,  each  ls(1) will involve some cold cache
888              stat(2)s.
889
890
891       ALLOCATION RESERVATION
892              Allocation reservation allows  multiple  concurrently  extending
893              files  to  grow  as  contiguously as possible. One way to demon‐
894              strate its functioning is to run a script that extends  multiple
895              files in a circular order. The script below does that by writing
896              one hundred 4KB chunks to four files, one after another.
897
898              $ for i in $(seq 0 99);
899              > do
900              >   for j in $(seq 4);
901              >   do
902              >     dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
903              >   done;
904              > done;
905
906              When run on a system running Linux kernel 2.6.34 or earlier,  we
907              end up with files with 100 extents each. That is full fragmenta‐
908              tion. As the files are being extended one after another, the on-
909              disk allocations are fully interleaved.
910
911              $ filefrag file1 file2 file3 file4
912              file1: 100 extents found
913              file2: 100 extents found
914              file3: 100 extents found
915              file4: 100 extents found
916
917              When  run  on  a system running Linux kernel 2.6.35 or later, we
918              see files with 7 extents each. That is a lot fewer than  before.
919              Fewer extents mean more on-disk contiguity and that always leads
920              to better overall performance.
921
922              $ filefrag file1 file2 file3 file4
923              file1: 7 extents found
924              file2: 7 extents found
925              file3: 7 extents found
926              file4: 7 extents found
927
928
929       REFLINK OPERATION
930              This feature allows a user to create a writeable snapshot  of  a
931              regular  file.  In this operation, the file system creates a new
932              inode with the same extent pointers as the original inode.  Mul‐
933              tiple  inodes  are  thus able to share data extents. This adds a
934              twist in file system administration because none of the existing
935              file  system  utilities  in Linux expect this behavior. du(1), a
936              utility to used to compute file space  usage,  simply  adds  the
937              blocks allocated to each inode. As it does not know about shared
938              extents, it over estimates the space used.  Say, we have  a  5GB
939              file in a volume having 42GB free.
940
941              $ ls -l
942              total 5120000
943              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
944
945              $ du -m myfile*
946              5000    myfile
947
948              $ df -h .
949              Filesystem            Size  Used Avail Use% Mounted on
950              /dev/sdd1             50G   8.2G   42G  17% /ocfs2
951
952              If  we were to reflink it 4 times, we would expect the directory
953              listing to report five 5GB files, but the  df(1)  to  report  no
954              loss  of available space. du(1), on the other hand, would report
955              the disk usage to climb to 25GB.
956
957              $ reflink myfile myfile-ref1
958              $ reflink myfile myfile-ref2
959              $ reflink myfile myfile-ref3
960              $ reflink myfile myfile-ref4
961
962              $ ls -l
963              total 25600000
964              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
965              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref1
966              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref2
967              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref3
968              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref4
969
970              $ df -h .
971              Filesystem            Size  Used Avail Use% Mounted on
972              /dev/sdd1             50G   8.2G   42G  17% /ocfs2
973
974              $ du -m myfile*
975              5000    myfile
976              5000    myfile-ref1
977              5000    myfile-ref2
978              5000    myfile-ref3
979              5000    myfile-ref4
980              25000 total
981
982              Enter shared-du(1), a shared extent-aware du. This  utility  re‐
983              ports the shared extents per file in parenthesis and the overall
984              footprint. As expected, it lists the overall footprint  at  5GB.
985              One  can  view  the  details  of  the extents using shared-file‐
986              frag(1).  Both these utilities are available at  http://oss.ora‐
987              cle.com/~smushran/reflink-tools/.    We  are  currently  in  the
988              process of pushing the changes to the  upstream  maintainers  of
989              these utilities.
990
991              $ shared-du -m -c --shared-size myfile*
992              5000    (5000)  myfile
993              5000    (5000)  myfile-ref1
994              5000    (5000)  myfile-ref2
995              5000    (5000)  myfile-ref3
996              5000    (5000)  myfile-ref4
997              25000 total
998              5000 footprint
999
1000              # shared-filefrag -v myfile
1001              Filesystem type is: 7461636f
1002              File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
1003              ext logical physical expected length flags
1004              0         0  2247937            8448
1005              1      8448  2257921  2256384  30720
1006              2     39168  2290177  2288640  30720
1007              3     69888  2322433  2320896  30720
1008              4    100608  2354689  2353152  30720
1009              7    192768  2451457  2449920  30720
1010               . . .
1011              37  1073408  2032129  2030592  30720 shared
1012              38  1104128  2064385  2062848  30720 shared
1013              39  1134848  2096641  2095104  30720 shared
1014              40  1165568  2128897  2127360  30720 shared
1015              41  1196288  2161153  2159616  30720 shared
1016              42  1227008  2193409  2191872  30720 shared
1017              43  1257728  2225665  2224128  22272 shared,eof
1018              myfile: 44 extents found
1019
1020
1021       DATA COHERENCY
1022              One  of the challenges in a shared file system is data coherency
1023              when multiple nodes are writing to the same set of  files.  NFS,
1024              for  example, provides close-to-open data coherency that results
1025              in the data being flushed to the server when the file is  closed
1026              on  the  client.   This leaves open a wide window for stale data
1027              being read on another node.
1028
1029              A simple test to check the data coherency of a shared file  sys‐
1030              tem  involves concurrently appending the same file. Like running
1031              "uname -a >>/dir/file" using a parallel distributed  shell  like
1032              dsh  or pconsole. If coherent, the file will contain the results
1033              from all nodes.
1034
1035              # dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"
1036              # cat /ocfs2/test
1037              Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1038              Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1039              Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1040              Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1041
1042              OCFS2 is a fully cache coherent cluster file system.
1043
1044
1045       DISCONTIGUOUS BLOCK GROUP
1046              Most file systems pre-allocate space for inodes  during  format.
1047              OCFS2 dynamically allocates this space when required.
1048
1049              However,  this  dynamic allocation has been problematic when the
1050              free space is very fragmented, because the file system  required
1051              the inode and extent allocators to grow in contiguous fixed-size
1052              chunks.
1053
1054              The discontiguous block group feature takes care of this problem
1055              by  allowing  the  allocators to grow in smaller, variable-sized
1056              chunks.
1057
1058              This feature was added in Linux kernel 2.6.35 and  requires  en‐
1059              abling on-disk feature discontig-bg.
1060
1061
1062       BACKUP SUPER BLOCKS
1063              A  file  system  super block stores critical information that is
1064              hard to recreate.  In OCFS2, it stores the block  size,  cluster
1065              size,  and  the  locations  of  the root and system directories,
1066              among other things. As this block is close to the start  of  the
1067              disk,  it  is very susceptible to being overwritten by an errant
1068              write.  Say, dd if=file of=/dev/sda1.
1069
1070              Backup super blocks are copies of the super block. These  blocks
1071              are  dispersed  in  the  volume to minimize the chances of being
1072              overwritten. On the small chance that  the  original  gets  cor‐
1073              rupted,  the  backups  are available to scan and fix the corrup‐
1074              tion.
1075
1076              mkfs.ocfs2(8) enables this feature by default. Users can disable
1077              this by specifying --fs-features=nobackup-super during format.
1078
1079              o2info(1)  can  be used to view whether the feature has been en‐
1080              abled on a device.
1081
1082              # o2info --fs-features /dev/sdb1
1083              backup-super strict-journal-super sparse extended-slotmap inline-data xattr
1084              indexed-dirs refcount discontig-bg clusterinfo unwritten
1085
1086              In OCFS2, the super block is on the third block. The backups are
1087              located  at  the 1G, 4G, 16G, 64G, 256G and 1T byte offsets. The
1088              actual number of backup blocks depends on the size  of  the  de‐
1089              vice.  The  super block is not backed up on devices smaller than
1090              1GB.
1091
1092              fsck.ocfs2(8) refers to these six offsets by numbers,  1  to  6.
1093              Users  can  specify any backup with the -r option to recover the
1094              volume. The example below uses the second backup. If successful,
1095              fsck.ocfs2(8)  overwrites  the  corrupted  super  block with the
1096              backup.
1097
1098              # fsck.ocfs2 -f -r 2 /dev/sdb1
1099              fsck.ocfs2 1.8.0
1100              [RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
1101              Checking OCFS2 filesystem in /dev/sdb1:
1102                Label:              webhome
1103                UUID:               B3E021A2A12B4D0EB08E9E986CDC7947
1104                Number of blocks:   13107196
1105                Block size:         4096
1106                Number of clusters: 13107196
1107                Cluster size:       4096
1108                Number of slots:    8
1109
1110              /dev/sdb1 was run with -f, check forced.
1111              Pass 0a: Checking cluster allocation chains
1112              Pass 0b: Checking inode allocation chains
1113              Pass 0c: Checking extent block allocation chains
1114              Pass 1: Checking inodes and blocks.
1115              Pass 2: Checking directory entries.
1116              Pass 3: Checking directory connectivity.
1117              Pass 4a: checking for orphaned inodes
1118              Pass 4b: Checking inodes link counts.
1119              All passes succeeded.
1120
1121
1122       SYNTHETIC FILE SYSTEMS
1123              The OCFS2 development effort included two  synthetic  file  sys‐
1124              tems, configfs and dlmfs. It also makes use of a third, debugfs.
1125
1126
1127              configfs
1128                     configfs has since been accepted as a generic kernel com‐
1129                     ponent and is also used by netconsole and  fs/dlm.  OCFS2
1130                     tools  use  it  to  communicate  the list of nodes in the
1131                     cluster, details of the heartbeat device,  cluster  time‐
1132                     outs,  and  so on to the in-kernel node manager. The o2cb
1133                     init script mounts this file system  at  /sys/kernel/con‐
1134                     fig.
1135
1136
1137              dlmfs  dlmfs  exposes  the  in-kernel  o2dlm  to the user-space.
1138                     While it was developed primarily for OCFS2 tools, it  has
1139                     seen usage by others looking to add a cluster locking di‐
1140                     mension in their applications. Users interested in  doing
1141                     the  same should look at the libo2dlm library provided by
1142                     ocfs2-tools. The o2cb init script mounts this file system
1143                     at /dlm.
1144
1145
1146              debugfs
1147                     OCFS2 uses debugfs to expose its in-kernel information to
1148                     user space. For example, listing the file system  cluster
1149                     locks,  dlm locks, dlm state, o2net state, etc. Users can
1150                     access the information by mounting  the  file  system  at
1151                     /sys/kernel/debug.  To  automount,  add  the following to
1152                     /etc/fstab: debugfs /sys/kernel/debug debugfs defaults  0
1153                     0
1154
1155
1156       DISTRIBUTED LOCK MANAGER
1157              One  of  the  key technologies in a cluster is the lock manager,
1158              which maintains the locking state of all  resources  across  the
1159              cluster.  An easy implementation of a lock manager involves des‐
1160              ignating one node to handle everything. In this model, if a node
1161              wanted  to acquire a lock, it would send the request to the lock
1162              manager. However, this model  has  a  weakness:  lock  manager’s
1163              death causes the cluster to seize up.
1164
1165              A  better  model  is  one where all nodes manage a subset of the
1166              lock resources. Each node maintains enough information  for  all
1167              the  lock  resources  it  is  interested  in. On event of a node
1168              death, the remaining nodes pool in  the  information  to  recon‐
1169              struct  the  lock  state  maintained  by  the dead node. In this
1170              scheme, the locking overhead  is  distributed  amongst  all  the
1171              nodes. Hence, the term distributed lock manager.
1172
1173              O2DLM is a distributed lock manager. It is based on the specifi‐
1174              cation  titled  "Programming  Locking  Application"  written  by
1175              Kristin   Thomas   and  is  available  at  the  following  link.
1176              http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlm‐
1177              book_final.pdf
1178
1179
1180       DLM DEBUGGING
1181              O2DLM has a rich debugging infrastructure that allows it to show
1182              the state of the lock manager, all  the  lock  resources,  among
1183              other  things.   The figure below shows the dlm state of a nine-
1184              node cluster that has just lost three nodes: 12, 32, and 35.  It
1185              can  be  ascertained  that  node 7, the recovery master, is cur‐
1186              rently recovering node 12 and has received the  lock  states  of
1187              the dead node from all other live nodes.
1188
1189              # cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
1190              Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001  Key: 0x10748e61
1191              Thread Pid: 24542  Node: 7  State: JOINED
1192              Number of Joins: 1  Joining Node: 255
1193              Domain Map: 7 31 33 34 40 50
1194              Live Map: 7 31 33 34 40 50
1195              Lock Resources: 48850 (439879)
1196              MLEs: 0 (1428625)
1197                Blocking: 0 (1066000)
1198                Mastery: 0 (362625)
1199                Migration: 0 (0)
1200              Lists: Dirty=Empty  Purge=Empty  PendingASTs=Empty  PendingBASTs=Empty
1201              Purge Count: 0  Refs: 1
1202              Dead Node: 12
1203              Recovery Pid: 24543  Master: 7  State: ACTIVE
1204              Recovery Map: 12 32 35
1205              Recovery Node State:
1206                      7 - DONE
1207                      31 - DONE
1208                      33 - DONE
1209                      34 - DONE
1210                      40 - DONE
1211                      50 - DONE
1212
1213              The  figure below shows the state of a dlm lock resource that is
1214              mastered (owned) by node 25, with 6 locks in the  granted  queue
1215              and node 26 holding the EX (writelock) lock on that resource.
1216
1217              # debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1
1218              Lockres: M000000000000000022d63c00000000   Owner: 25    State: 0x0
1219              Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
1220              Refs: 8    Locks: 6    On Lists: None
1221              Reference Map: 26 27 28 94 95
1222               Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST  Pending-Action
1223               Granted     94    NL     -1    94:3169409       2     No   No    None
1224               Granted     28    NL     -1    28:3213591       2     No   No    None
1225               Granted     27    NL     -1    27:3216832       2     No   No    None
1226               Granted     95    NL     -1    95:3178429       2     No   No    None
1227               Granted     25    NL     -1    25:3513994       2     No   No    None
1228               Granted     26    EX     -1    26:3512906       2     No   No    None
1229
1230              The  figure below shows a lock from the file system perspective.
1231              Specifically, it shows a lock that is in the  process  of  being
1232              upconverted  from  a  NL  to EX. Locks in this state are are re‐
1233              ferred to in the file system as busy locks and can be listed us‐
1234              ing the debugfs.ocfs2 command, "fs_locks -B".
1235
1236              # debugfs.ocfs2 -R "fs_locks -B" /dev/sda1
1237              Lockres: M000000000000000000000b9aba12ec  Mode: No Lock
1238              Flags: Initialized Attached Busy
1239              RO Holders: 0  EX Holders: 0
1240              Pending Action: Convert  Pending Unlock Action: None
1241              Requested Mode: Exclusive  Blocking Mode: No Lock
1242              PR > Gets: 0  Fails: 0    Waits Total: 0us  Max: 0us  Avg: 0ns
1243              EX > Gets: 1  Fails: 0    Waits Total: 544us  Max: 544us  Avg: 544185ns
1244              Disk Refreshes: 1
1245
1246              With  this  debugging  infrastructure  in place, users can debug
1247              hang issues as follows:
1248
1249                  * Dump the busy fs locks for all the OCFS2  volumes  on  the
1250                  node with hanging processes. If no locks are found, then the
1251                  problem is not related to O2DLM.
1252
1253                  * Dump the corresponding dlm lock for all the busy fs locks.
1254                  Note down the owner (master) of all the locks.
1255
1256                  * Dump the dlm locks on the master node for each lock.
1257
1258              At  this stage, one should note that the hanging node is waiting
1259              to get an AST from the master. The master, on  the  other  hand,
1260              cannot  send the AST until the current holder has down converted
1261              that lock, which it will do upon receiving a Blocking AST.  How‐
1262              ever,  a node can only down convert if all the lock holders have
1263              stopped using that lock.  After dumping the dlm lock on the mas‐
1264              ter node, identify the current lock holder and dump both the dlm
1265              and fs locks on that node.
1266
1267              The trick here is to see whether the Blocking  AST  message  has
1268              been  relayed  to file system. If not, the problem is in the dlm
1269              layer. If it has, then the most common reason would  be  a  lock
1270              holder, the count for which is maintained in the fs lock.
1271
1272              At this stage, printing the list of process helps.
1273
1274              $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
1275
1276              Make  a  note  of all D state processes. At least one of them is
1277              responsible for the hang on the first node.
1278
1279              The challenge then is to figure  out  why  those  processes  are
1280              hanging.  Failing  that,  at  least get enough information (like
1281              alt-sysrq t output) for the kernel developers to  review.   What
1282              to  do  next  depends  on where the process is hanging. If it is
1283              waiting for the I/O to complete, the problem could  be  anywhere
1284              in  the  I/O  subsystem, from the block device layer through the
1285              drivers to the disk array. If the  hang  concerns  a  user  lock
1286              (flock(2)),  the  problem  could be in the user’s application. A
1287              possible solution could be to kill the holder. If  the  hang  is
1288              due  to  tight  or  fragmented  memory,  free  up some memory by
1289              killing non-essential processes.
1290
1291              The thing to note is that the symptom for the problem was on one
1292              node but the cause is on another. The issue can only be resolved
1293              on the node holding the lock. Sometimes, the best solution  will
1294              be  to  reset that node. Once killed, the O2DLM recovery process
1295              will clear all locks owned by the dead node and let the  cluster
1296              continue to operate. As harsh as that sounds, at times it is the
1297              only solution. The good news is that, by  following  the  trail,
1298              you  now  have enough information to file a bug and get the real
1299              issue resolved.
1300
1301
1302       NFS EXPORTING
1303              OCFS2 volumes can be exported as NFS volumes.  This  support  is
1304              limited  to NFS version 3, which translates to Linux kernel ver‐
1305              sion 2.4 or later.
1306
1307              If the version of the Linux kernel on the system  exporting  the
1308              volume is older than 2.6.30, then the NFS clients must mount the
1309              volumes using the nordirplus mount  option.  This  disables  the
1310              READDIRPLUS  RPC  call  to workaround a bug in NFSD, detailed in
1311              the following link:
1312
1313              http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
1314
1315              Users running NFS version 2 can export the volume  after  having
1316              disabled  subtree  checking  (mount option no_subtree_check). Be
1317              warned, disabling the check  has  security  implications  (docu‐
1318              mented  in  the exports(5) man page) that users must evaluate on
1319              their own.
1320
1321
1322       FILE SYSTEM LIMITS
1323              OCFS2 has no intrinsic limit on the total number  of  files  and
1324              directories  in  the file system. In general, it is only limited
1325              by the size of the device. But there is one limit imposed by the
1326              current  filesystem.  It  can address at most four billion clus‐
1327              ters. A file system with 1MB cluster size  can  go  up  to  4PB,
1328              while  a  file  system with a 4KB cluster size can address up to
1329              16TB.
1330
1331
1332       SYSTEM OBJECTS
1333              The OCFS2 file system stores its internal  meta-data,  including
1334              bitmaps, journals, etc., as system files. These are grouped in a
1335              system directory. These files and directories are not accessible
1336              via  the  file  system interface but can be viewed using the de‐
1337              bugfs.ocfs2(8) tool.
1338
1339              To list the system directory (referred to as double-slash), do:
1340
1341              # debugfs.ocfs2 -R "ls -l //" /dev/sde1
1342                      66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 .
1343                      66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 ..
1344                      67     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 bad_blocks
1345                      68     -rw-r--r--   1  0  0      1179648 19-Jul-2011 13:36 global_inode_alloc
1346                      69     -rw-r--r--   1  0  0         4096 19-Jul-2011 14:35 slot_map
1347                      70     -rw-r--r--   1  0  0      1048576 19-Jul-2011 13:36 heartbeat
1348                      71     -rw-r--r--   1  0  0  53686960128 19-Jul-2011 13:36 global_bitmap
1349                      72     drwxr-xr-x   2  0  0         3896 25-Jul-2011 15:05 orphan_dir:0000
1350                      73     drwxr-xr-x   2  0  0         3896 19-Jul-2011 13:36 orphan_dir:0001
1351                      74     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0000
1352                      75     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0001
1353                      76     -rw-r--r--   1  0  0    121634816 19-Jul-2011 13:36 inode_alloc:0000
1354                      77     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 inode_alloc:0001
1355                      77     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:36 journal:0000
1356                      79     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:37 journal:0001
1357                      80     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0000
1358                      81     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0001
1359                      82     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0000
1360                      83     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0001
1361
1362              The file names that end with numbers are slot specific  and  are
1363              referred  to  as  node-local system files. The set of node-local
1364              files used by a node can be determined from  the  slot  map.  To
1365              list the slot map, do:
1366
1367              # debugfs.ocfs2 -R "slotmap" /dev/sde1
1368                  Slot#    Node#
1369                      0       32
1370                      1       35
1371                      2       40
1372                      3       31
1373                      4       34
1374                      5       33
1375
1376              For  more  information, refer to the OCFS2 support guides avail‐
1377              able   in   the   Documentation   section   at   http://oss.ora‐
1378              cle.com/projects/ocfs2.
1379
1380
1381       HEARTBEAT, QUORUM, AND FENCING
1382              Heartbeat  is  an  essential  component  in  any  cluster. It is
1383              charged with accurately designating nodes as dead  or  alive.  A
1384              mistake here could lead to a cluster hang or a corruption.
1385
1386              o2hb  is  the  disk heartbeat component of o2cb. It periodically
1387              updates a timestamp on disk, indicating to others that this node
1388              is  alive.  It  also  reads all the timestamps to identify other
1389              live nodes. Other cluster components, like o2dlm and o2net,  use
1390              the o2hb service to get node up and down events.
1391
1392              The quorum is the group of nodes in a cluster that is allowed to
1393              operate on the shared storage. When there is a  failure  in  the
1394              cluster,  nodes may be split into groups that can communicate in
1395              their groups and with the shared storage but not between groups.
1396              o2quo  determines  which group is allowed to continue and initi‐
1397              ates fencing of the other group(s).
1398
1399              Fencing is the act of forcefully removing a node from a cluster.
1400              A  node  with  OCFS2  mounted will fence itself when it realizes
1401              that it does not have quorum in a degraded cluster. It does this
1402              so  that  other  nodes  won’t  be stuck trying to access its re‐
1403              sources.
1404
1405              o2cb uses a machine reset to fence. This is the  quickest  route
1406              for the node to rejoin the cluster.
1407
1408
1409       PROCESSES
1410
1411
1412              [o2net]
1413                     One  per node. It is a work-queue thread started when the
1414                     cluster is brought on-line and stopped when  it  is  off-
1415                     lined.  It  handles network communication for all mounts.
1416                     It gets the list of active nodes from O2HB and sets up  a
1417                     TCP/IP  communication  channel  with  each  live node. It
1418                     sends regular keep-alive packets to detect any  interrup‐
1419                     tion on the channels.
1420
1421
1422              [user_dlm]
1423                     One  per  node.  It  is  a work-queue thread started when
1424                     dlmfs is loaded and stopped when it is unloaded (dlmfs is
1425                     a  synthetic file system that allows user space processes
1426                     to access the in-kernel dlm).
1427
1428
1429              [ocfs2_wq]
1430                     One per node. It is a work-queue thread started when  the
1431                     OCFS2  module  is loaded and stopped when it is unloaded.
1432                     It is assigned background file system tasks that may take
1433                     cluster  locks like flushing the truncate log, orphan di‐
1434                     rectory recovery and local alloc recovery.  For  example,
1435                     orphan  directory recovery runs in the background so that
1436                     it does not affect recovery time.
1437
1438
1439              [o2hb-14C29A7392]
1440                     One per heartbeat device. It is a kernel  thread  started
1441                     when  the  heartbeat  region is populated in configfs and
1442                     stopped when it is removed. It writes every  two  seconds
1443                     to  a block in the heartbeat region, indicating that this
1444                     node is alive. It also reads the region to maintain a map
1445                     of  live  nodes.  It  notifies subscribers like o2net and
1446                     o2dlm of any changes in the live node map.
1447
1448
1449              [ocfs2dc]
1450                     One per mount. It is a kernel thread started when a  vol‐
1451                     ume is mounted and stopped when it is unmounted. It down‐
1452                     grades locks in response to  blocking  ASTs  (BASTs)  re‐
1453                     quested by other nodes.
1454
1455
1456              [jbd2/sdf1-97]
1457                     One  per  mount. It is part of JBD2, which OCFS2 uses for
1458                     journaling.
1459
1460
1461              [ocfs2cmt]
1462                     One per mount. It is a kernel thread started when a  vol‐
1463                     ume is mounted and stopped when it is unmounted. It works
1464                     with kjournald2.
1465
1466
1467              [ocfs2rec]
1468                     It is started whenever a node has to be  recovered.  This
1469                     thread  performs  file  system  recovery by replaying the
1470                     journal of the dead node. It is scheduled  to  run  after
1471                     dlm recovery has completed.
1472
1473
1474              [dlm_thread]
1475                     One  per dlm domain. It is a kernel thread started when a
1476                     dlm domain is created and stopped when it  is  destroyed.
1477                     This  thread  sends ASTs and blocking ASTs in response to
1478                     lock level convert requests. It also  frees  unused  lock
1479                     resources.
1480
1481
1482              [dlm_reco_thread]
1483                     One  per  dlm  domain. It is a kernel thread that handles
1484                     dlm recovery when another node dies. If this node is  the
1485                     dlm  recovery  master,  it re-masters every lock resource
1486                     owned by the dead node.
1487
1488
1489              [dlm_wq]
1490                     One per dlm domain. It is a work-queue thread that  o2dlm
1491                     uses to queue blocking tasks.
1492
1493
1494       FUTURE WORK
1495              File  system  development  is  a  never ending cycle. Faster and
1496              larger disks, faster  and  more  number  of  processors,  larger
1497              caches,  etc. keep changing the sweet spot for performance forc‐
1498              ing developers to rethink long held beliefs. Add to that new use
1499              cases, which forces developers to be innovative in providing so‐
1500              lutions that melds seamlessly with existing semantics.
1501
1502              We are currently looking to add features like  transparent  com‐
1503              pression,  transparent encryption, delayed allocation, multi-de‐
1504              vice support, etc. as well as work on improving  performance  on
1505              newer generation machines.
1506
1507              If  you  are  interested  in contributing, email the development
1508              team at ocfs2-devel@oss.oracle.com.
1509
1510

ACKNOWLEDGEMENTS

1512       The principal developers of the OCFS2 file system, its  tools  and  the
1513       O2CB cluster stack, are Joel Becker, Zach Brown, Mark Fasheh, Jan Kara,
1514       Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.
1515
1516       Other developers who have contributed to the file system via bug fixes,
1517       testing, etc.  are Wim Coekaerts, Srinivas Eeda, Coly Li, Jeff Mahoney,
1518       Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wengang Wang.
1519
1520       The members of the Linux Cluster community  including  Andrew  Beekhof,
1521       Lars Marowsky-Bree, Fabio Massimo Di Nitto and David Teigland.
1522
1523       The  members  of  the  Linux  File system community including Christoph
1524       Hellwig and Chris Mason.
1525
1526       The corporations that have contributed resources for this  project  in‐
1527       cluding  Oracle, SUSE Labs, EMC, Emulex, HP, IBM, Intel and Network Ap‐
1528       pliance.
1529
1530

AUTHOR

1539       Oracle Corporation
1540
1541

COPYRIGHT

1543       Copyright © 2004, 2012 Oracle. All rights reserved.
1544
1545
1546
1547Version 1.8.7                    January 2012                         OCFS2(7)