1OCFS2(7)                      OCFS2 Manual Pages                      OCFS2(7)
2
3
4

NAME

6       OCFS2 - A Shared-Disk Cluster File System for Linux
7
8

INTRODUCTION

10       OCFS2 is a file system. It allows users to store and retrieve data. The
11       data is stored in files that are organized in a hierarchical  directory
12       tree.  It  is  a POSIX compliant file system that supports the standard
13       interfaces and the behavioral semantics as spelled out by that specifi‐
14       cation.
15
16       It  is also a shared disk cluster file system, one that allows multiple
17       nodes to access the same disk at the same time. This is where  the  fun
18       begins  as  allowing  a  file system to be accessible on multiple nodes
19       opens a can of worms. What if the nodes are of different architectures?
20       What if a node dies while writing to the file system? What data consis‐
21       tency can one expect if processes on two nodes are reading and  writing
22       concurrently?  What  if one node removes a file while it is still being
23       used on another node?
24
25       Unlike most shared file systems where the answer is fuzzy,  the  answer
26       in  OCFS2  is very well defined. It behaves on all nodes exactly like a
27       local file system. If a file is removed, the directory entry is removed
28       but  the inode is kept as long as it is in use across the cluster. When
29       the last user closes the descriptor, the inode is marked for deletion.
30
31       The data consistency model follows the same principle. It works  as  if
32       the  two  processes that are running on two different nodes are running
33       on the same node. A read on a node gets the last write irrespective  of
34       the  IO  mode  used.  The  modes can be buffered, direct, asynchronous,
35       splice or memory mapped IOs. It is fully cache coherent.
36
37       Take for example the REFLINK feature that allows a user to create  mul‐
38       tiple write-able snapshots of a file. This feature, like all others, is
39       fully cluster-aware. A file being written to on multiple nodes  can  be
40       safely  reflinked  on  another. The snapshot created is a point-in-time
41       image of the file  that  includes  both  the  file  data  and  all  its
42       attributes (including extended attributes).
43
44       It  is  a  journaling  file  system. When a node dies, a surviving node
45       transparently replays the journal of the dead node. This  ensures  that
46       the  file  system  metadata  is  always consistent. It also defaults to
47       ordered data journaling to ensure the file  data  is  flushed  to  disk
48       before  the  journal  commit,  to remove the small possibility of stale
49       data appearing in files after a crash.
50
51       It is architecture and endian neutral. It allows concurrent  mounts  on
52       nodes  with  different  processors like x86, x86_64, IA64 and PPC64. It
53       handles little and big endian, 32-bit and 64-bit architectures.
54
55       It is feature rich. It supports indexed  directories,  metadata  check‐
56       sums,  extended attributes, POSIX ACLs, quotas, REFLINKs, sparse files,
57       unwritten extents and inline-data.
58
59       It is fully integrated with the mainline Linux kernel. The file  system
60       was merged into Linux kernel 2.6.16 in early 2006.
61
62       It  is quickly installed. It is available with almost all Linux distri‐
63       butions.  The file system is on-disk compatible across all of them.
64
65       It is modular. The file system can be configured to operate with  other
66       cluster stacks like Pacemaker and CMAN along with its own stack, O2CB.
67
68       It  is easily configured. The O2CB cluster stack configuration involves
69       editing two files, one for cluster layout and  the  other  for  cluster
70       timeouts.
71
72       It  is  very efficient. The file system consumes very little resources.
73       It is used to store virtual machine images in limited  memory  environ‐
74       ments like Xen and KVM.
75
76       In  summary, OCFS2 is an efficient, easily configured, modular, quickly
77       installed, fully integrated and compatible, feature-rich,  architecture
78       and endian neutral, cache coherent, ordered data journaling, POSIX-com‐
79       pliant, shared disk cluster file system.
80
81

OVERVIEW

83       OCFS2 is a general-purpose shared-disk cluster file  system  for  Linux
84       capable of providing both high performance and high availability.
85
86       As  it provides local file system semantics, it can be used with almost
87       all applications.  Cluster-aware applications can make  use  of  cache-
88       coherent  parallel  I/Os  from multiple nodes to scale out applications
89       easily. Other applications can make use of the clustering facilities to
90       fail-over running application in the event of a node failure.
91
92       The notable features of the file system are:
93
94       Tunable Block size
95              The  file  system  supports  block  sizes  of 512, 1K, 2K and 4K
96              bytes. 4KB is almost always recommended. This feature is  avail‐
97              able in all releases of the file system.
98
99
100       Tunable Cluster size
101              A  cluster  size  is also referred to as an allocation unit. The
102              file system supports cluster sizes of 4K,  8K,  16K,  32K,  64K,
103              128K, 256K, 512K and 1M bytes. For most use cases, 4KB is recom‐
104              mended. However, a larger value is recommended for volumes host‐
105              ing mostly very large files like database files, virtual machine
106              images, etc. A large cluster size  allows  the  file  system  to
107              store large files more efficiently. This feature is available in
108              all releases of the file system.
109
110
111       Endian and Architecture neutral
112              The file system can be mounted concurrently on nodes having dif‐
113              ferent  architectures.  Like 32-bit, 64-bit, little-endian (x86,
114              x86_64, ia64) and big-endian (ppc64, s390x).   This  feature  is
115              available in all releases of the file system.
116
117
118       Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes
119              The  file system supports all modes of I/O for maximum flexibil‐
120              ity and  performance.   It  also  supports  cluster-wide  shared
121              writeable  mmap(2).  The support for bufferred, direct and asyn‐
122              chronous I/O is available  in  all  releases.  The  support  for
123              splice  I/O  was  added  in  Linux  kernel 2.6.20 and for shared
124              writeable map(2) in 2.6.23.
125
126
127       Multiple Cluster Stacks
128              The file system includes a flexible framework  to  allow  it  to
129              function with userspace cluster stacks like Pacemaker (pcmk) and
130              CMAN (cman), its own in-kernel cluster stack o2cb and no cluster
131              stack.
132
133              The support for o2cb cluster stack is available in all releases.
134
135              The  support  for no cluster stack, or local mount, was added in
136              Linux kernel 2.6.20.
137
138              The support for userspace cluster stack was added in Linux  ker‐
139              nel 2.6.26.
140
141
142       Journaling
143              The  file  system  supports both ordered (default) and writeback
144              data journaling modes to provide file system consistency in  the
145              event  of  power failure or system crash.  It uses JBD2 in Linux
146              kernel 2.6.28 and later. It used JBD in earlier kernels.
147
148
149       Extent-based Allocations
150              The file system allocates and tracks space in  ranges  of  clus‐
151              ters. This is unlike block based file systems that have to track
152              each and every block. This feature allows the file system to  be
153              very  efficient  when  dealing with both large volumes and large
154              files.  This feature is available in all releases  of  the  file
155              system.
156
157
158       Sparse files
159              Sparse  files  are files with holes. With this feature, the file
160              system delays allocating space until a  write  is  issued  to  a
161              cluster.  This  feature  was  added  in  Linux kernel 2.6.22 and
162              requires enabling on-disk feature sparse.
163
164
165       Unwritten Extents
166              An unwritten extent is also referred to as user  pre-allocation.
167              It  allows  an  application to request a range of clusters to be
168              allocated, but not initialized, within a  file.   Pre-allocation
169              allows  the  file system to optimize the data layout with fewer,
170              larger extents. It also provides a performance  boost,  delaying
171              initialization  until the user writes to the clusters. This fea‐
172              ture was added in Linux kernel 2.6.23 and requires enabling  on-
173              disk feature unwritten.
174
175
176       Hole Punching
177              Hole  punching  allows  an application to remove arbitrary allo‐
178              cated regions within a file. Creating holes,  essentially.  This
179              is  more  efficient than zeroing the same extents.  This feature
180              is especially useful in virtualized environments as it allows  a
181              block  discard  in a guest file system to be converted to a hole
182              punch in the host file system thus allowing users to reduce disk
183              space  usage.  This feature was added in Linux kernel 2.6.23 and
184              requires enabling on-disk features sparse and unwritten.
185
186
187       Inline-data
188              Inline data is also referred to as data-in-inode  as  it  allows
189              storing small files and directories in the inode block. This not
190              only saves space but also has a positive  impact  on  cold-cache
191              directory  and  file operations. The data is transparently moved
192              out to an extent when it no longer fits inside the inode  block.
193              This  feature  was  added  in  Linux  kernel 2.6.24 and requires
194              enabling on-disk feature inline-data.
195
196
197       REFLINK
198              REFLINK is also referred to as fast copy.  It  allows  users  to
199              atomically  (and  instantly) copy regular files. In other words,
200              create multiple writeable snapshots of  regular  files.   It  is
201              called  REFLINK  because  it  looks and feels more like a (hard)
202              link(2) than a traditional snapshot. Like a link, it is a  regu‐
203              lar  user  operation,  subject to the security attributes of the
204              inode being reflinked and not to the super user privileges typi‐
205              cally  required  to  create a snapshot. Like a link, it operates
206              within a file system. But unlike a link, it links the inodes  at
207              the  data  extent  level  allowing  each reflinked inode to grow
208              independently as and when written to. Up to four billion  inodes
209              can share a data extent.  This feature was added in Linux kernel
210              2.6.32 and requires enabling on-disk feature refcount.
211
212
213       Allocation Reservation
214              File contiguity plays an important role in file  system  perfor‐
215              mance. When a file is fragmented on disk, reading and writing to
216              the file involves many seeks, leading to lower throughput.  Con‐
217              tiguous  files,  on the other hand, minimize seeks, allowing the
218              disks to perform IO at the maximum rate.
219
220              With allocation reservation, the file system reserves  a  window
221              in  the  bitmap for all extending files allowing each to grow as
222              contiguously as possible. As this extra space  is  not  actually
223              allocated,  it  is  available for use by other files if the need
224              arises.  This feature was added in Linux kernel 2.6.35  and  can
225              be tuned using the mount option resv_level.
226
227
228       Indexed Directories
229              An  indexed directory allows users to perform quick lookups of a
230              file in very large directories. It also results in  faster  cre‐
231              ates  and  unlinks and thus provides better overall performance.
232              This feature was added  in  Linux  kernel  2.6.30  and  requires
233              enabling on-disk feature indexed-dirs.
234
235
236       File Attributes
237              This  refers  to  EXT2-style file attributes, such as immutable,
238              modified using chattr(1) and queried using lsattr(1). This  fea‐
239              ture was added in Linux kernel 2.6.19.
240
241
242       Extended Attributes
243              An  extended  attribute  refers to a name:value pair than can be
244              associated with file system objects like regular files, directo‐
245              ries, symbolic links, etc. OCFS2 allows associating an unlimited
246              number of attributes per object. The attribute names can  be  up
247              to  255  bytes in length, terminated by the first NUL character.
248              While it is not required, printable  names  (ASCII)  are  recom‐
249              mended.  The  attribute  values  can be up to 64 KB of arbitrary
250              binary data. These attributes can be modified and  listed  using
251              standard  Linux utilities setfattr(1) and getfattr(1). This fea‐
252              ture was added in Linux kernel 2.6.29 and requires enabling  on-
253              disk feature xattr.
254
255
256       Metadata Checksums
257              This feature allows the file system to detect silent corruptions
258              in all metadata blocks like inodes and directories. This feature
259              was  added  in Linux kernel 2.6.29 and requires enabling on-disk
260              feature metaecc.
261
262
263       POSIX ACLs and Security Attributes
264              POSIX ACLs allows assigning  fine-grained  discretionary  access
265              rights  for files and directories. This security scheme is a lot
266              more flexible than the traditional file access permissions  that
267              imposes a strict user-group-other model.
268
269              Security attributes allow the file system to support other secu‐
270              rity regimes like SELinux, SMACK, AppArmor, etc.
271
272              Both these security extensions were added in Linux kernel 2.6.29
273              and requires enabling on-disk feature xattr.
274
275
276       User and Group Quotas
277              This  feature  allows  setting up usage quotas on user and group
278              basis  by  using   the   standard   utilities   like   quota(1),
279              setquota(8),  quotacheck(8),  and  quotaon(8).  This feature was
280              added in Linux kernel 2.6.29 and requires enabling on-disk  fea‐
281              tures usrquota and grpquota.
282
283
284       Unix File Locking
285              The  Unix  operating system has historically provided two system
286              calls to lock files.  flock(2) or BSD locking  and  fcntl(2)  or
287              POSIX  locking.  OCFS2  extends  both file locks to the cluster.
288              File locks taken on one node interact with those taken on  other
289              nodes.
290
291              The  support  for  clustered  flock(2) was added in Linux kernel
292              2.6.26.  All flock(2) options are supported, including the  ker‐
293              nels  ability  to cancel a lock request when an appropriate kill
294              signal is received by the user. This feature is  supported  with
295              all cluster-stacks including o2cb.
296
297              The  support  for  clustered  fcntl(2) was added in Linux kernel
298              2.6.28.  But because it requires group communication to make the
299              locks  coherent,  it  is  only  supported with userspace cluster
300              stacks, pcmk and cman and not with  the  default  cluster  stack
301              o2cb.
302
303
304       Comprehensive Tools Support
305              The  file  system  has  a  comprehensive EXT3-style toolset that
306              tries to use similar parameters  for  ease-of-use.  It  includes
307              mkfs.ocfs2(8)  (format),  tunefs.ocfs2(8)  (tune), fsck.ocfs2(8)
308              (check), debugfs.ocfs2(8) (debug), etc.
309
310
311       Online Resize
312              The file system can be dynamically grown using  tunefs.ocfs2(8).
313              This feature was added in Linux kernel 2.6.25.
314
315

RECENT CHANGES

317       The  O2CB cluster stack has a global heartbeat mode. It allows users to
318       specify heartbeat regions that are consistent  across  all  nodes.  The
319       cluster stack also allows online addition and removal of both nodes and
320       heartbeat regions.
321
322       o2cb(8) is the new cluster configuration utility. It is an easy to  use
323       utility that allows users to create the cluster configuration on a node
324       that is not  part  of  the  cluster.  It  replaces  the  older  utility
325       o2cb_ctl(8) which has being deprecated.
326
327       ocfs2console(8) has been obsoleted.
328
329       o2info(8)  is  a  new  utility  that can be used to provide file system
330       information.  It allows non-privileged users to see  the  enabled  file
331       system  features,  block  and  cluster  sizes, extended file stat, free
332       space fragmentation, etc.
333
334       o2hbmonitor(8) is a o2hb heartbeat monitor. It is  an  extremely  light
335       weight  utility that logs messages to the system logger once the heart‐
336       beat delay exceeds the warn threshold. This utility is useful in  iden‐
337       tifying volumes encountering I/O delays.
338
339       debugfs.ocfs2(8)  has some new commands. net_stats shows the o2net mes‐
340       sage times between various nodes. This is useful in  identifying  nodes
341       are  that  slowing  down the cluster operations. stat_sysdir allows the
342       user to dump the entire system directory that  can  be  used  to  debug
343       issues.  grpextents  dumps the complete free space fragmentation in the
344       cluster group allocator.
345
346       mkfs.ocfs2(8) now enables xattr, indexed-dirs, discontig-bg,  refcount,
347       extended-slotmap  and clusterinfo feature flags by default, in addition
348       to the older defaults, sparse, unwritten and inline-data.
349
350       mount.ocfs2(8) allows users to specify the  level  of  cache  coherency
351       between  nodes.   By default the file system operates in full coherency
352       mode that also serializes the direct I/Os. While this mode  is  techni‐
353       cally  correct, it limits the I/O thruput in a clustered database. This
354       mount option allows the user to limit the cache coherency to  only  the
355       buffered I/Os to allow multiple nodes to do concurrent direct writes to
356       the same file. This feature works with Linux kernel 2.6.37 and later.
357
358

COMPATIBILITY

360       The OCFS2 development teams goes to great lengths to maintain  compati‐
361       bility.  It attempts to maintain both on-disk and network protocol com‐
362       patibility across all releases of the file  system.  It  does  so  even
363       while adding new features that entail on-disk format and network proto‐
364       col changes. To do this successfully, it follows a few rules:
365
366           1. The on-disk format changes are managed by a set of feature flags
367           that  can  be  turned on and off. The file system in kernel detects
368           these features during mount and continues only  if  it  understands
369           all the features. Users encountering this have the option of either
370           disabling that feature or upgrading the  file  system  to  a  newer
371           release.
372
373           2.  The  latest  release of ocfs2-tools is compatible with all ver‐
374           sions of the file system. All utilities detect the features enabled
375           on disk and continue only if it understands all the features. Users
376           encountering this have to upgrade the tools to a newer release.
377
378           3. The network protocol version  is  negotiated  by  the  nodes  to
379           ensure all nodes understand the active protocol version.
380
381
382       FEATURE FLAGS
383              The  feature flags are split into three categories, namely, Com‐
384              pat, Incompat and RO Compat.
385
386              Compat, or compatible, is a feature that the  file  system  does
387              not need to fully understand to safely read/write to the volume.
388              An example of this is the backup-super feature  that  added  the
389              capability  to  backup  the super block in multiple locations in
390              the file system. As the backup super blocks  are  typically  not
391              read nor written to by the file system, an older file system can
392              safely mount a volume with this feature enabled.
393
394              Incompat, or incompatible, is a feature  that  the  file  system
395              needs to fully understand to read/write to the volume. Most fea‐
396              tures fall under this category.
397
398              RO Compat, or read-only compatible, is a feature that  the  file
399              system  needs  to fully understand to write to the volume. Older
400              software can safely read a volume with this feature enabled.  An
401              example  of  this  would be user and group quotas. As quotas are
402              manipulated only when the file system is written to, older soft‐
403              ware can safely mount such volumes in read-only mode.
404
405              The  list  of  feature  flags,  the version of the kernel it was
406              added in, the earliest version of the tools that understands it,
407              etc., is as follows:
408
409
410      ┌─────────────────────┬────────────────┬─────────────────┬───────────┬───────────┐
411Feature Flags        Kernel Version Tools Version   Category  Hex Value 
412      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
413backup-super         │      All       │ ocfs2-tools 1.2 │  Compat   │     1     │
414      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
415strict-journal-super │      All       │       All       │  Compat   │     2     │
416      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
417local                │  Linux 2.6.20  │ ocfs2-tools 1.2 │ Incompat  │     8     │
418      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
419sparse               │  Linux 2.6.22  │ ocfs2-tools 1.4 │ Incompat  │    10     │
420      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
421inline-data          │  Linux 2.6.24  │ ocfs2-tools 1.4 │ Incompat  │    40     │
422      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
423extended-slotmap     │  Linux 2.6.27  │ ocfs2-tools 1.6 │ Incompat  │    100    │
424      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
425xattr                │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │    200    │
426      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
427indexed-dirs         │  Linux 2.6.30  │ ocfs2-tools 1.6 │ Incompat  │    400    │
428      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
429metaecc              │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │    800    │
430      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
431refcount             │  Linux 2.6.32  │ ocfs2-tools 1.6 │ Incompat  │   1000    │
432      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
433discontig-bg         │  Linux 2.6.35  │ ocfs2-tools 1.6 │ Incompat  │   2000    │
434      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
435clusterinfo          │  Linux 2.6.37  │ ocfs2-tools 1.8 │ Incompat  │   4000    │
436      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
437unwritten            │  Linux 2.6.23  │ ocfs2-tools 1.4 │ RO Compat │     1     │
438      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
439grpquota             │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │     2     │
440      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
441usrquota             │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │     4     │
442      └─────────────────────┴────────────────┴─────────────────┴───────────┴───────────┘
443
444              To query the features enabled on a volume, do:
445
446              $ o2info --fs-features /dev/sdf1
447              backup-super strict-journal-super sparse extended-slotmap inline-data xattr
448              indexed-dirs refcount discontig-bg clusterinfo unwritten
449
450
451       ENABLING AND DISABLING FEATURES
452
453              The  format  utility, mkfs.ocfs2(8), allows a user to enable and
454              disable specific features using the fs-features option. The fea‐
455              tures  are  provided as a comma separated list. The enabled fea‐
456              tures are listed as is. The disabled features are prefixed  with
457              no.   The  example  below  shows the file system being formatted
458              with sparse disabled and inline-data enabled.
459
460              # mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1
461
462              After formatting, the users can toggle features using  the  tune
463              utility,  tunefs.ocfs2(8).   This  is  an offline operation. The
464              volume needs to be umounted across  the  cluster.   The  example
465              below  shows  the  sparse  feature being enabled and inline-data
466              disabled.
467
468              # tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1
469
470              Care should be taken before  enabling  and  disabling  features.
471              Users planning to use a volume with an older version of the file
472              system will be better of not enabling newer features as  turning
473              disabling may not succeed.
474
475              An  example would be disabling the sparse feature; this requires
476              filling every hole.  The operation can only succeed if the  file
477              system has enough free space.
478
479
480       DETECTING FEATURE INCOMPATIBILITY
481
482              Say  one  tries  to mount a volume with an incompatible feature.
483              What happens then? How does one detect the problem? How does one
484              know the name of that incompatible feature?
485
486              To  begin  with, one should look for error messages in dmesg(8).
487              Mount failures that are due  to  an  incompatible  feature  will
488              always result in an error message like the following:
489
490              ERROR: couldn't mount because of unsupported optional features (200).
491
492              Here  the  file  system  is unable to mount the volume due to an
493              unsupported optional feature. That means that that feature is an
494              Incompat  feature. By referring to the table above, one can then
495              deduce that the user failed to mount a  volume  with  the  xattr
496              feature enabled. (The value in the error message is in hexadeci‐
497              mal.)
498
499              Another example of an error message due to incompatibility is as
500              follows:
501
502              ERROR: couldn't mount RDWR because of unsupported optional features (1).
503
504              Here  the  file  system  is unable to mount the volume in the RW
505              mode. That means that that  feature  is  a  RO  Compat  feature.
506              Another  look at the table and it becomes apparent that the vol‐
507              ume had the unwritten feature enabled.
508
509              In both cases, the user has the option of disabling the feature.
510              In the second case, the user has the choice of mounting the vol‐
511              ume in the RO mode.
512
513

GETTING STARTED

515       The OCFS2 software is split into two  components,  namely,  kernel  and
516       tools. The kernel component includes the core file system and the clus‐
517       ter stack, and is packaged along with the kernel. The  tools  component
518       is  packaged  as ocfs2-tools and needs to be specifically installed. It
519       provides utilities to format, tune, mount, debug  and  check  the  file
520       system.
521
522       To  install  ocfs2-tools,  refer  to the package handling utility in in
523       your distributions.
524
525       The next step is selecting a cluster stack. The options include:
526
527           A. No cluster stack, or local mount.
528
529           B. In-kernel o2cb cluster stack with local or global heartbeat.
530
531           C. Userspace cluster stacks pcmk or cman.
532
533       The  file  system  allows  changing   cluster   stacks   easily   using
534       tunefs.ocfs2(8).   To list the cluster stacks stamped on the OCFS2 vol‐
535       umes, do:
536
537       # mounted.ocfs2 -d
538       Device     Stack  Cluster     F  UUID                              Label
539       /dev/sdb1  o2cb   webcluster  G  DCDA2845177F4D59A0F2DCD8DE507CC3  hbvol1
540       /dev/sdc1  None                  23878C320CF3478095D1318CB5C99EED  localmount
541       /dev/sdd1  o2cb   webcluster  G  8AB016CD59FC4327A2CDAB69F08518E3  webvol
542       /dev/sdg1  o2cb   webcluster  G  77D95EF51C0149D2823674FCC162CF8B  logsvol
543       /dev/sdh1  o2cb   webcluster  G  BBA1DBD0F73F449384CE75197D9B7098  scratch
544
545
546       NON-CLUSTERED OR LOCAL MOUNT
547
548              To format a OCFS2 volume as a non-clustered (local) volume, do:
549
550              # mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1
551
552              To convert an existing clustered volume to a non-clustered  vol‐
553              ume, do:
554
555              # tunefs.ocfs2 --fs-features=local /dev/sda1
556
557              Non-clustered  volumes  do  not interact with the cluster stack.
558              One can have both clustered and non-clustered volumes mounted at
559              the same time.
560
561              While  formatting  a non-clustered volume, users should consider
562              the possibility of later converting that volume to  a  clustered
563              one. If there is a possibility of that, then the user should add
564              enough node-slots using the -N option. Adding node-slots  during
565              format  creates  journals  with large extents. If created later,
566              then the journals will be fragmented which is not good for  per‐
567              formance.
568
569
570       CLUSTERED MOUNT WITH O2CB CLUSTER STACK
571
572              Only  one  of  the  two  heartbeat mode can be active at any one
573              time. Changing heartbeat modes is an offline operation.
574
575              Both  heartbeat  modes   require   /etc/ocfs2/cluster.conf   and
576              /etc/sysconfig/o2cb  to be populated as described in ocfs2.clus‐
577              ter.conf(5) and o2cb.sysconfig(5) respectively. The only differ‐
578              ence  in  set  up  between the two modes is that global requires
579              heartbeat devices to be configured whereas local does not.
580
581              Refer o2cb(7) for more information.
582
583
584              LOCAL HEARTBEAT
585                     This is the default heartbeat mode.  The  user  needs  to
586                     populate   the   configuration   files  as  described  in
587                     ocfs2.cluster.conf(5)  and  o2cb.sysconfig(5).  In   this
588                     mode,  the  cluster  stack heartbeats on all mounted vol‐
589                     umes. Thus,  one  does  not  have  to  specify  heartbeat
590                     devices in cluster.conf.
591
592                     Once  configured,  the  o2cb cluster stack can be onlined
593                     and offlined as follows:
594
595                     # service o2cb online
596                     Setting cluster stack "o2cb": OK
597                     Registering O2CB cluster "webcluster": OK
598                     Setting O2CB cluster timeouts : OK
599
600                     # service o2cb offline
601                     Clean userdlm domains: OK
602                     Stopping O2CB cluster webcluster: OK
603                     Unregistering O2CB cluster "webcluster": OK
604
605
606              GLOBAL HEARTBEAT
607                     The configuration is similar to local heartbeat. The  one
608                     additional  step  in this mode is that it requires heart‐
609                     beat devices to be also configured.
610
611                     These heartbeat devices are OCFS2 formatted volumes  with
612                     global heartbeat enabled on disk. These volumes can later
613                     be mounted and used as clustered file systems.
614
615                     The steps  to  format  a  volume  with  global  heartbeat
616                     enabled is listed in o2cb(7).  Also listed there is list‐
617                     ing all volumes with the cluster stack stamped on disk.
618
619                     In this mode, the heartbeat is started when  the  cluster
620                     is onlined and stopped when the cluster is offlined.
621
622                     # service o2cb online
623                     Setting cluster stack "o2cb": OK
624                     Registering O2CB cluster "webcluster": OK
625                     Setting O2CB cluster timeouts : OK
626                     Starting global heartbeat for cluster "webcluster": OK
627
628                     # service o2cb offline
629                     Clean userdlm domains: OK
630                     Stopping global heartbeat on cluster "webcluster": OK
631                     Stopping O2CB cluster webcluster: OK
632                     Unregistering O2CB cluster "webcluster": OK
633
634                     # service o2cb status
635                     Driver for "configfs": Loaded
636                     Filesystem "configfs": Mounted
637                     Stack glue driver: Loaded
638                     Stack plugin "o2cb": Loaded
639                     Driver for "ocfs2_dlmfs": Loaded
640                     Filesystem "ocfs2_dlmfs": Mounted
641                     Checking O2CB cluster "webcluster": Online
642                       Heartbeat dead threshold: 31
643                       Network idle timeout: 30000
644                       Network keepalive delay: 2000
645                       Network reconnect delay: 2000
646                       Heartbeat mode: Global
647                     Checking O2CB heartbeat: Active
648                       77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
649                     Nodes in O2CB cluster: 92 96
650
651
652
653       CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK
654
655              Configure  and  online  the  userspace stack pcmk or cman before
656              using tunefs.ocfs2(8) to update the cluster stack on disk.
657
658              # tunefs.ocfs2 --update-cluster-stack /dev/sdd1
659              Updating on-disk cluster information to match the running cluster.
660              DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
661              FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
662              Update the on-disk cluster information? y
663
664              Refer to the cluster  stack  documentation  for  information  on
665              starting and stopping the cluster stack.
666
667

FILE SYSTEM UTILITIES

669       This  sections  lists  the  utilities that are used to manage the OCFS2
670       file systems.  This includes tools to format, tune, check, mount, debug
671       the  file  system. Each utility has a man page that lists its capabili‐
672       ties in detail.
673
674
675       mkfs.ocfs2(8)
676              This is the file system format utility. All volumes have  to  be
677              formatted prior to its use.  As this utility overwrites the vol‐
678              ume, use it with care. Double check to ensure the volume is  not
679              in use on any node in the cluster.
680
681              As a precaution, the utility will abort if the volume is locally
682              mounted. It also detects use  across  the  cluster  if  used  by
683              OCFS2.  But  these checks are not comprehensive and can be over‐
684              ridden. So use it with care.
685
686              While it is not always required, the cluster should be online.
687
688
689       tunefs.ocfs2(8)
690              This is the file system tune utility. It allows users to  change
691              certain  on-disk  parameters  like  label, uuid, number of node-
692              slots, volume size and the size of the journals. It also  allows
693              turning on and off the file system features as listed above.
694
695              This utility requires the cluster to be online.
696
697
698       fsck.ocfs2(8)
699              This  is the file system check utility. It detects and fixes on-
700              disk errors. All the check codes and their fixes are  listed  in
701              fsck.ocfs2.checks(8).
702
703              This  utility  requires  the  cluster to be online to ensure the
704              volume is not in use on another node and to prevent  the  volume
705              from being mounted for the duration of the check.
706
707
708       mount.ocfs2(8)
709              This  is the file system mount utility. It is invoked indirectly
710              by the mount(8) utility.
711
712              This utility detects the cluster status and aborts if the  clus‐
713              ter is offline or does not match the cluster stamped on disk.
714
715
716       o2cluster(8)
717              This  is the file system cluster stack update utility. It allows
718              the users to update the on-disk cluster stack to  the  one  pro‐
719              vided.
720
721              This  utility only updates the disk if the utility is reasonably
722              assured that the file system is not in use on any node.
723
724
725       o2info(1)
726              This is the file system information utility. It provides  infor‐
727              mation  like  the  features enabled on disk, block size, cluster
728              size, free space fragmentation, etc.
729
730              It can be used by  both  privileged  and  non-privileged  users.
731              Users  having read permission on the device can provide the path
732              to the device. Other users can provide the path to a file  on  a
733              mounted file system.
734
735
736       debugfs.ocfs2(8)
737              This  is the file system debug utility. It allows users to exam‐
738              ine all  file  system  structures  including  walking  directory
739              structures,  displaying  inodes, backing up files, etc., without
740              mounting the file system.
741
742              This utility requires the user to have read  permission  on  the
743              device.
744
745
746       o2image(8)
747              This  is  the file system image utility. It allows users to copy
748              the file system metadata skeleton, including the inodes,  direc‐
749              tories,  bitmaps,  etc. As it excludes data, it shrinks the size
750              of the file system tremendously.
751
752              The image file created can be used in debugging on-disk  corrup‐
753              tions.
754
755
756       mounted.ocfs2(8)
757              This  is  the  file  system detect utility. It detects all OCFS2
758              volumes in the system and lists  its  label,  uuid  and  cluster
759              stack.
760
761

O2CB CLUSTER STACK UTILITIES

763       This  sections lists the utilities that are used to manage O2CB cluster
764       stack.  Each utility has a man page  that  lists  its  capabilities  in
765       detail.
766
767       o2cb(8)
768              This  is  the  cluster configuration utility. It allows users to
769              update the cluster configuration by adding  and  removing  nodes
770              and  heartbeat  regions.  This  utility is used by the o2cb init
771              script to online and offline the cluster.
772
773              This is a new utility and replaces o2cb_ctl(8)  which  has  been
774              deprecated.
775
776
777       ocfs2_hb_ctl(8)
778              This  is the cluster heartbeat utility. It allows users to start
779              and  stop  local  heartbeat.  This   utility   is   invoked   by
780              mount.ocfs2(8) and should not be invoked directly by the user.
781
782
783       o2hbmonitor(8)
784              This  is  the disk heartbeat monitor. It tracks the elapsed time
785              since the last  heartbeat  and  logs  warnings  once  that  time
786              exceeds the warn threshold.
787
788

FILE SYSTEM NOTES

790       This  section  includes some useful notes that may prove helpful to the
791       user.
792
793       BALANCED CLUSTER
794              A cluster is a computer. This is a fact and not a  slogan.  What
795              this  means is that an errant node in the cluster can affect the
796              behavior of other nodes. If one node is slow, the cluster opera‐
797              tions  will  slow down on all nodes. To prevent that, it is best
798              to have a balanced cluster. This is a cluster that  has  equally
799              powered and loaded nodes.
800
801              The standard recommendation for such clusters is to have identi‐
802              cal hardware and software across all the nodes. However, that is
803              not a hard and fast rule. After all, we have taken the effort to
804              ensure that OCFS2 works in a mixed architecture environment.
805
806              If one uses OCFS2 in a mixed architecture  environment,  try  to
807              ensure that the nodes are equally powered and loaded. The use of
808              a load balancer can assist with the latter. Power refers to  the
809              number  of  processors, speed, amount of memory, I/O throughput,
810              network bandwidth, etc. In reality, having equally powered  het‐
811              erogeneous nodes is not always practical. In that case, make the
812              lower node numbers more powerful than the higher  node  numbers.
813              The  O2CB  cluster stack favors lower node numbers in all of its
814              tiebreaking logic.
815
816              This is not to suggest you should add a single core  node  in  a
817              cluster  of  quad  cores. No amount of node number juggling will
818              help you there.
819
820
821       FILE DELETION
822              In Linux, rm(1) removes the directory entry. It does not  neces‐
823              sarily  delete  the  corresponding  inode.  But  by removing the
824              directory entry, it gives the illusion that the inode  has  been
825              deleted.   This puzzles users when they do not see a correspond‐
826              ing up-tick in the reported free  space.   The  reason  is  that
827              inode deletion has a few more hurdles to cross.
828
829              First  is  the  hard  link  count,  that indicates the number of
830              directory entries pointing to that inode. As long  as  an  inode
831              has  one  or more directory entries pointing to it, it cannot be
832              deleted.  The file system has to wait for  the  removal  of  all
833              those  directory entries. In other words, wait for that count to
834              drop to zero.
835
836              The second hurdle is the POSIX semantics allowing  files  to  be
837              unlinked  even  while they are in-use. In OCFS2, that translates
838              to in-use across the cluster. The file system has  to  wait  for
839              all processes across the cluster to stop using the inode.
840
841              Once  these  conditions  are  met,  the inode is deleted and the
842              freed space is visible after the next sync.
843
844              Now the amount of space freed depends on  the  allocation.  Only
845              space  that  is  actually  allocated to that inode is freed. The
846              example below shows a sparsely allocated file of  size  51TB  of
847              which only 2.4GB is actually allocated.
848
849              $ ls -lsh largefile
850              2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile
851
852              Furthermore,  for  reflinked  files,  only  private  extents are
853              freed. Shared extents are freed when the  last  inode  accessing
854              it,  is  deleted. The example below shows a 4GB file that shares
855              3GB with other reflinked files. Deleting it  will  increase  the
856              free  space  by  1GB.  However, if it is the only remaining file
857              accessing the shared extents, the full 4G will be freed.   (More
858              information on the shared-du(1) utility is provided below.)
859
860              $ shared-du -m -c --shared-size reflinkedfile
861              4000    (3000)  reflinkedfile
862
863              The  deletion itself is a multi-step process. Once the hard link
864              count falls to zero, the inode is moved to the orphan_dir system
865              directory  where  it  remains until the last process, across the
866              cluster, stops using the inode. Then the file system  frees  the
867              extents  and adds the freed space count to the truncate_log sys‐
868              tem file where it remains until the next sync.  The freed  space
869              is made visible to the user only after that sync.
870
871
872       DIRECTORY LISTING
873              ls(1)  may  be  a  simple  command, but it is not cheap. What is
874              expensive is not the part where it reads the directory  listing,
875              but the second part where it reads all the inodes, also referred
876              as an inode stat(2). If the inodes are not in  cache,  this  can
877              entail  disk  I/O.   Now,  while  a  cold cache inode stat(2) is
878              expensive in all file systems, it is especially so  in  a  clus‐
879              tered  file  system  as  it needs to take a cluster lock on each
880              inode.
881
882              A hot cache stat(2), on the other hand, has shown to perform  on
883              OCFS2 like it does on EXT3.
884
885              In other words, the second ls(1) will be quicker than the first.
886              However, it is not guaranteed. Say you have a million files in a
887              file  system  and  not  enough  kernel  memory  to cache all the
888              inodes. In that case, each ls(1) will involve  some  cold  cache
889              stat(2)s.
890
891
892       ALLOCATION RESERVATION
893              Allocation  reservation  allows  multiple concurrently extending
894              files to grow as contiguously as possible.  One  way  to  demon‐
895              strate  its functioning is to run a script that extends multiple
896              files in a circular order. The script below does that by writing
897              one hundred 4KB chunks to four files, one after another.
898
899              $ for i in $(seq 0 99);
900              > do
901              >   for j in $(seq 4);
902              >   do
903              >     dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
904              >   done;
905              > done;
906
907              When  run on a system running Linux kernel 2.6.34 or earlier, we
908              end up with files with 100 extents each. That is full fragmenta‐
909              tion. As the files are being extended one after another, the on-
910              disk allocations are fully interleaved.
911
912              $ filefrag file1 file2 file3 file4
913              file1: 100 extents found
914              file2: 100 extents found
915              file3: 100 extents found
916              file4: 100 extents found
917
918              When run on a system running Linux kernel 2.6.35  or  later,  we
919              see  files with 7 extents each. That is a lot fewer than before.
920              Fewer extents mean more on-disk contiguity and that always leads
921              to better overall performance.
922
923              $ filefrag file1 file2 file3 file4
924              file1: 7 extents found
925              file2: 7 extents found
926              file3: 7 extents found
927              file4: 7 extents found
928
929
930       REFLINK OPERATION
931              This  feature  allows a user to create a writeable snapshot of a
932              regular file. In this operation, the file system creates  a  new
933              inode  with the same extent pointers as the original inode. Mul‐
934              tiple inodes are thus able to share data extents.  This  adds  a
935              twist in file system administration because none of the existing
936              file system utilities in Linux expect this  behavior.  du(1),  a
937              utility  to  used  to  compute file space usage, simply adds the
938              blocks allocated to each inode. As it does not know about shared
939              extents,  it  over estimates the space used.  Say, we have a 5GB
940              file in a volume having 42GB free.
941
942              $ ls -l
943              total 5120000
944              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
945
946              $ du -m myfile*
947              5000    myfile
948
949              $ df -h .
950              Filesystem            Size  Used Avail Use% Mounted on
951              /dev/sdd1             50G   8.2G   42G  17% /ocfs2
952
953              If we were to reflink it 4 times, we would expect the  directory
954              listing  to  report  five  5GB files, but the df(1) to report no
955              loss of available space. du(1), on the other hand, would  report
956              the disk usage to climb to 25GB.
957
958              $ reflink myfile myfile-ref1
959              $ reflink myfile myfile-ref2
960              $ reflink myfile myfile-ref3
961              $ reflink myfile myfile-ref4
962
963              $ ls -l
964              total 25600000
965              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
966              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref1
967              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref2
968              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref3
969              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref4
970
971              $ df -h .
972              Filesystem            Size  Used Avail Use% Mounted on
973              /dev/sdd1             50G   8.2G   42G  17% /ocfs2
974
975              $ du -m myfile*
976              5000    myfile
977              5000    myfile-ref1
978              5000    myfile-ref2
979              5000    myfile-ref3
980              5000    myfile-ref4
981              25000 total
982
983              Enter  shared-du(1),  a  shared  extent-aware  du.  This utility
984              reports the shared extents per file in parenthesis and the over‐
985              all  footprint.  As  expected, it lists the overall footprint at
986              5GB. One can view the details of the extents using  shared-file‐
987              frag(1).   Both these utilities are available at http://oss.ora
988              cle.com/~smushran/reflink-tools/.   We  are  currently  in   the
989              process  of  pushing  the changes to the upstream maintainers of
990              these utilities.
991
992              $ shared-du -m -c --shared-size myfile*
993              5000    (5000)  myfile
994              5000    (5000)  myfile-ref1
995              5000    (5000)  myfile-ref2
996              5000    (5000)  myfile-ref3
997              5000    (5000)  myfile-ref4
998              25000 total
999              5000 footprint
1000
1001              # shared-filefrag -v myfile
1002              Filesystem type is: 7461636f
1003              File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
1004              ext logical physical expected length flags
1005              0         0  2247937            8448
1006              1      8448  2257921  2256384  30720
1007              2     39168  2290177  2288640  30720
1008              3     69888  2322433  2320896  30720
1009              4    100608  2354689  2353152  30720
1010              7    192768  2451457  2449920  30720
1011               . . .
1012              37  1073408  2032129  2030592  30720 shared
1013              38  1104128  2064385  2062848  30720 shared
1014              39  1134848  2096641  2095104  30720 shared
1015              40  1165568  2128897  2127360  30720 shared
1016              41  1196288  2161153  2159616  30720 shared
1017              42  1227008  2193409  2191872  30720 shared
1018              43  1257728  2225665  2224128  22272 shared,eof
1019              myfile: 44 extents found
1020
1021
1022       DATA COHERENCY
1023              One of the challenges in a shared file system is data  coherency
1024              when  multiple  nodes are writing to the same set of files. NFS,
1025              for example, provides close-to-open data coherency that  results
1026              in  the data being flushed to the server when the file is closed
1027              on the client.  This leaves open a wide window  for  stale  data
1028              being read on another node.
1029
1030              A  simple test to check the data coherency of a shared file sys‐
1031              tem involves concurrently appending the same file. Like  running
1032              "uname  -a  >>/dir/file" using a parallel distributed shell like
1033              dsh or pconsole. If coherent, the file will contain the  results
1034              from all nodes.
1035
1036              # dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"
1037              # cat /ocfs2/test
1038              Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1039              Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1040              Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1041              Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
1042
1043              OCFS2 is a fully cache coherent cluster file system.
1044
1045
1046       DISCONTIGUOUS BLOCK GROUP
1047              Most  file  systems pre-allocate space for inodes during format.
1048              OCFS2 dynamically allocates this space when required.
1049
1050              However, this dynamic allocation has been problematic  when  the
1051              free  space is very fragmented, because the file system required
1052              the inode and extent allocators to grow in contiguous fixed-size
1053              chunks.
1054
1055              The discontiguous block group feature takes care of this problem
1056              by allowing the allocators to grow  in  smaller,  variable-sized
1057              chunks.
1058
1059              This  feature  was  added  in  Linux  kernel 2.6.35 and requires
1060              enabling on-disk feature discontig-bg.
1061
1062
1063       BACKUP SUPER BLOCKS
1064              A file system super block stores critical  information  that  is
1065              hard  to  recreate.  In OCFS2, it stores the block size, cluster
1066              size, and the locations of  the  root  and  system  directories,
1067              among  other  things. As this block is close to the start of the
1068              disk, it is very susceptible to being overwritten by  an  errant
1069              write.  Say, dd if=file of=/dev/sda1.
1070
1071              Backup  super blocks are copies of the super block. These blocks
1072              are dispersed in the volume to minimize  the  chances  of  being
1073              overwritten.  On  the  small  chance that the original gets cor‐
1074              rupted, the backups are available to scan and  fix  the  corrup‐
1075              tion.
1076
1077              mkfs.ocfs2(8) enables this feature by default. Users can disable
1078              this by specifying --fs-features=nobackup-super during format.
1079
1080              o2info(1) can be used to  view  whether  the  feature  has  been
1081              enabled on a device.
1082
1083              # o2info --fs-features /dev/sdb1
1084              backup-super strict-journal-super sparse extended-slotmap inline-data xattr
1085              indexed-dirs refcount discontig-bg clusterinfo unwritten
1086
1087              In OCFS2, the super block is on the third block. The backups are
1088              located at the 1G, 4G, 16G, 64G, 256G and 1T byte  offsets.  The
1089              actual  number  of  backup  blocks  depends  on  the size of the
1090              device. The super block is not backed up on devices smaller than
1091              1GB.
1092
1093              fsck.ocfs2(8)  refers  to  these six offsets by numbers, 1 to 6.
1094              Users can specify any backup with the -r option to  recover  the
1095              volume. The example below uses the second backup. If successful,
1096              fsck.ocfs2(8) overwrites the  corrupted  super  block  with  the
1097              backup.
1098
1099              # fsck.ocfs2 -f -r 2 /dev/sdb1
1100              fsck.ocfs2 1.8.0
1101              [RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
1102              Checking OCFS2 filesystem in /dev/sdb1:
1103                Label:              webhome
1104                UUID:               B3E021A2A12B4D0EB08E9E986CDC7947
1105                Number of blocks:   13107196
1106                Block size:         4096
1107                Number of clusters: 13107196
1108                Cluster size:       4096
1109                Number of slots:    8
1110
1111              /dev/sdb1 was run with -f, check forced.
1112              Pass 0a: Checking cluster allocation chains
1113              Pass 0b: Checking inode allocation chains
1114              Pass 0c: Checking extent block allocation chains
1115              Pass 1: Checking inodes and blocks.
1116              Pass 2: Checking directory entries.
1117              Pass 3: Checking directory connectivity.
1118              Pass 4a: checking for orphaned inodes
1119              Pass 4b: Checking inodes link counts.
1120              All passes succeeded.
1121
1122
1123       SYNTHETIC FILE SYSTEMS
1124              The  OCFS2  development  effort included two synthetic file sys‐
1125              tems, configfs and dlmfs. It also makes use of a third, debugfs.
1126
1127
1128              configfs
1129                     configfs has since been accepted as a generic kernel com‐
1130                     ponent  and  is also used by netconsole and fs/dlm. OCFS2
1131                     tools use it to communicate the  list  of  nodes  in  the
1132                     cluster,  details  of the heartbeat device, cluster time‐
1133                     outs, and so on to the in-kernel node manager.  The  o2cb
1134                     init  script  mounts this file system at /sys/kernel/con‐
1135                     fig.
1136
1137
1138              dlmfs  dlmfs exposes the  in-kernel  o2dlm  to  the  user-space.
1139                     While  it was developed primarily for OCFS2 tools, it has
1140                     seen usage by others looking to  add  a  cluster  locking
1141                     dimension  in  their  applications.  Users  interested in
1142                     doing the same should look at the libo2dlm  library  pro‐
1143                     vided  by  ocfs2-tools.  The o2cb init script mounts this
1144                     file system at /dlm.
1145
1146
1147              debugfs
1148                     OCFS2 uses debugfs to expose its in-kernel information to
1149                     user  space. For example, listing the file system cluster
1150                     locks, dlm locks, dlm state, o2net state, etc. Users  can
1151                     access  the  information  by  mounting the file system at
1152                     /sys/kernel/debug. To automount,  add  the  following  to
1153                     /etc/fstab:  debugfs /sys/kernel/debug debugfs defaults 0
1154                     0
1155
1156
1157       DISTRIBUTED LOCK MANAGER
1158              One of the key technologies in a cluster is  the  lock  manager,
1159              which  maintains  the  locking state of all resources across the
1160              cluster. An easy implementation of a lock manager involves  des‐
1161              ignating one node to handle everything. In this model, if a node
1162              wanted to acquire a lock, it would send the request to the  lock
1163              manager.  However,  this  model  has  a weakness: lock manager’s
1164              death causes the cluster to seize up.
1165
1166              A better model is one where all nodes manage  a  subset  of  the
1167              lock  resources.  Each node maintains enough information for all
1168              the lock resources it is interested  in.  On  event  of  a  node
1169              death,  the  remaining  nodes  pool in the information to recon‐
1170              struct the lock state maintained  by  the  dead  node.  In  this
1171              scheme,  the  locking  overhead  is  distributed amongst all the
1172              nodes. Hence, the term distributed lock manager.
1173
1174              O2DLM is a distributed lock manager. It is based on the specifi‐
1175              cation  titled  "Programming  Locking  Application"  written  by
1176              Kristin  Thomas  and  is  available  at  the   following   link.
1177              http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlm
1178              book_final.pdf
1179
1180
1181       DLM DEBUGGING
1182              O2DLM has a rich debugging infrastructure that allows it to show
1183              the  state  of  the  lock manager, all the lock resources, among
1184              other things.  The figure below shows the dlm state of  a  nine-
1185              node  cluster that has just lost three nodes: 12, 32, and 35. It
1186              can be ascertained that node 7, the  recovery  master,  is  cur‐
1187              rently  recovering  node  12 and has received the lock states of
1188              the dead node from all other live nodes.
1189
1190              # cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
1191              Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001  Key: 0x10748e61
1192              Thread Pid: 24542  Node: 7  State: JOINED
1193              Number of Joins: 1  Joining Node: 255
1194              Domain Map: 7 31 33 34 40 50
1195              Live Map: 7 31 33 34 40 50
1196              Lock Resources: 48850 (439879)
1197              MLEs: 0 (1428625)
1198                Blocking: 0 (1066000)
1199                Mastery: 0 (362625)
1200                Migration: 0 (0)
1201              Lists: Dirty=Empty  Purge=Empty  PendingASTs=Empty  PendingBASTs=Empty
1202              Purge Count: 0  Refs: 1
1203              Dead Node: 12
1204              Recovery Pid: 24543  Master: 7  State: ACTIVE
1205              Recovery Map: 12 32 35
1206              Recovery Node State:
1207                      7 - DONE
1208                      31 - DONE
1209                      33 - DONE
1210                      34 - DONE
1211                      40 - DONE
1212                      50 - DONE
1213
1214              The figure below shows the state of a dlm lock resource that  is
1215              mastered  (owned)  by node 25, with 6 locks in the granted queue
1216              and node 26 holding the EX (writelock) lock on that resource.
1217
1218              # debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1
1219              Lockres: M000000000000000022d63c00000000   Owner: 25    State: 0x0
1220              Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
1221              Refs: 8    Locks: 6    On Lists: None
1222              Reference Map: 26 27 28 94 95
1223               Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST  Pending-Action
1224               Granted     94    NL     -1    94:3169409       2     No   No    None
1225               Granted     28    NL     -1    28:3213591       2     No   No    None
1226               Granted     27    NL     -1    27:3216832       2     No   No    None
1227               Granted     95    NL     -1    95:3178429       2     No   No    None
1228               Granted     25    NL     -1    25:3513994       2     No   No    None
1229               Granted     26    EX     -1    26:3512906       2     No   No    None
1230
1231              The figure below shows a lock from the file system  perspective.
1232              Specifically,  it  shows  a lock that is in the process of being
1233              upconverted from a NL  to  EX.  Locks  in  this  state  are  are
1234              referred  to  in the file system as busy locks and can be listed
1235              using the debugfs.ocfs2 command, "fs_locks -B".
1236
1237              # debugfs.ocfs2 -R "fs_locks -B" /dev/sda1
1238              Lockres: M000000000000000000000b9aba12ec  Mode: No Lock
1239              Flags: Initialized Attached Busy
1240              RO Holders: 0  EX Holders: 0
1241              Pending Action: Convert  Pending Unlock Action: None
1242              Requested Mode: Exclusive  Blocking Mode: No Lock
1243              PR > Gets: 0  Fails: 0    Waits Total: 0us  Max: 0us  Avg: 0ns
1244              EX > Gets: 1  Fails: 0    Waits Total: 544us  Max: 544us  Avg: 544185ns
1245              Disk Refreshes: 1
1246
1247              With this debugging infrastructure in  place,  users  can  debug
1248              hang issues as follows:
1249
1250                  *  Dump  the  busy fs locks for all the OCFS2 volumes on the
1251                  node with hanging processes. If no locks are found, then the
1252                  problem is not related to O2DLM.
1253
1254                  * Dump the corresponding dlm lock for all the busy fs locks.
1255                  Note down the owner (master) of all the locks.
1256
1257                  * Dump the dlm locks on the master node for each lock.
1258
1259              At this stage, one should note that the hanging node is  waiting
1260              to  get  an  AST from the master. The master, on the other hand,
1261              cannot send the AST until the current holder has down  converted
1262              that  lock, which it will do upon receiving a Blocking AST. How‐
1263              ever, a node can only down convert if all the lock holders  have
1264              stopped using that lock.  After dumping the dlm lock on the mas‐
1265              ter node, identify the current lock holder and dump both the dlm
1266              and fs locks on that node.
1267
1268              The  trick  here  is to see whether the Blocking AST message has
1269              been relayed to file system. If not, the problem is in  the  dlm
1270              layer.  If  it  has, then the most common reason would be a lock
1271              holder, the count for which is maintained in the fs lock.
1272
1273              At this stage, printing the list of process helps.
1274
1275              $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
1276
1277              Make a note of all D state processes. At least one  of  them  is
1278              responsible for the hang on the first node.
1279
1280              The  challenge  then  is  to  figure out why those processes are
1281              hanging. Failing that, at least  get  enough  information  (like
1282              alt-sysrq  t  output) for the kernel developers to review.  What
1283              to do next depends on where the process is  hanging.  If  it  is
1284              waiting  for  the I/O to complete, the problem could be anywhere
1285              in the I/O subsystem, from the block device  layer  through  the
1286              drivers  to  the  disk  array.  If the hang concerns a user lock
1287              (flock(2)), the problem could be in the  user’s  application.  A
1288              possible  solution  could  be to kill the holder. If the hang is
1289              due to tight or  fragmented  memory,  free  up  some  memory  by
1290              killing non-essential processes.
1291
1292              The thing to note is that the symptom for the problem was on one
1293              node but the cause is on another. The issue can only be resolved
1294              on  the node holding the lock. Sometimes, the best solution will
1295              be to reset that node. Once killed, the O2DLM  recovery  process
1296              will  clear all locks owned by the dead node and let the cluster
1297              continue to operate. As harsh as that sounds, at times it is the
1298              only  solution.  The  good news is that, by following the trail,
1299              you now have enough information to file a bug and get  the  real
1300              issue resolved.
1301
1302
1303       NFS EXPORTING
1304              OCFS2  volumes  can  be exported as NFS volumes. This support is
1305              limited to NFS version 3, which translates to Linux kernel  ver‐
1306              sion 2.4 or later.
1307
1308              If  the  version of the Linux kernel on the system exporting the
1309              volume is older than 2.6.30, then the NFS clients must mount the
1310              volumes  using  the  nordirplus  mount option. This disables the
1311              READDIRPLUS RPC call to workaround a bug in  NFSD,  detailed  in
1312              the following link:
1313
1314              http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
1315
1316              Users  running  NFS version 2 can export the volume after having
1317              disabled subtree checking (mount  option  no_subtree_check).  Be
1318              warned,  disabling  the  check  has security implications (docu‐
1319              mented in the exports(5) man page) that users must  evaluate  on
1320              their own.
1321
1322
1323       FILE SYSTEM LIMITS
1324              OCFS2  has  no  intrinsic limit on the total number of files and
1325              directories in the file system. In general, it is  only  limited
1326              by the size of the device. But there is one limit imposed by the
1327              current filesystem. It can address at most  four  billion  clus‐
1328              ters.  A  file  system  with  1MB cluster size can go up to 4PB,
1329              while a file system with a 4KB cluster size can  address  up  to
1330              16TB.
1331
1332
1333       SYSTEM OBJECTS
1334              The  OCFS2  file system stores its internal meta-data, including
1335              bitmaps, journals, etc., as system files. These are grouped in a
1336              system directory. These files and directories are not accessible
1337              via the file system  interface  but  can  be  viewed  using  the
1338              debugfs.ocfs2(8) tool.
1339
1340              To list the system directory (referred to as double-slash), do:
1341
1342              # debugfs.ocfs2 -R "ls -l //" /dev/sde1
1343                      66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 .
1344                      66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 ..
1345                      67     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 bad_blocks
1346                      68     -rw-r--r--   1  0  0      1179648 19-Jul-2011 13:36 global_inode_alloc
1347                      69     -rw-r--r--   1  0  0         4096 19-Jul-2011 14:35 slot_map
1348                      70     -rw-r--r--   1  0  0      1048576 19-Jul-2011 13:36 heartbeat
1349                      71     -rw-r--r--   1  0  0  53686960128 19-Jul-2011 13:36 global_bitmap
1350                      72     drwxr-xr-x   2  0  0         3896 25-Jul-2011 15:05 orphan_dir:0000
1351                      73     drwxr-xr-x   2  0  0         3896 19-Jul-2011 13:36 orphan_dir:0001
1352                      74     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0000
1353                      75     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0001
1354                      76     -rw-r--r--   1  0  0    121634816 19-Jul-2011 13:36 inode_alloc:0000
1355                      77     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 inode_alloc:0001
1356                      77     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:36 journal:0000
1357                      79     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:37 journal:0001
1358                      80     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0000
1359                      81     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0001
1360                      82     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0000
1361                      83     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0001
1362
1363              The  file  names that end with numbers are slot specific and are
1364              referred to as node-local system files. The  set  of  node-local
1365              files  used  by  a  node can be determined from the slot map. To
1366              list the slot map, do:
1367
1368              # debugfs.ocfs2 -R "slotmap" /dev/sde1
1369                  Slot#    Node#
1370                      0       32
1371                      1       35
1372                      2       40
1373                      3       31
1374                      4       34
1375                      5       33
1376
1377              For more information, refer to the OCFS2 support  guides  avail‐
1378              able   in   the   Documentation   section   at   http://oss.ora
1379              cle.com/projects/ocfs2.
1380
1381
1382       HEARTBEAT, QUORUM, AND FENCING
1383              Heartbeat is an  essential  component  in  any  cluster.  It  is
1384              charged  with  accurately  designating nodes as dead or alive. A
1385              mistake here could lead to a cluster hang or a corruption.
1386
1387              o2hb is the disk heartbeat component of  o2cb.  It  periodically
1388              updates a timestamp on disk, indicating to others that this node
1389              is alive. It also reads all the  timestamps  to  identify  other
1390              live  nodes. Other cluster components, like o2dlm and o2net, use
1391              the o2hb service to get node up and down events.
1392
1393              The quorum is the group of nodes in a cluster that is allowed to
1394              operate  on  the  shared storage. When there is a failure in the
1395              cluster, nodes may be split into groups that can communicate  in
1396              their groups and with the shared storage but not between groups.
1397              o2quo determines which group is allowed to continue  and  initi‐
1398              ates fencing of the other group(s).
1399
1400              Fencing is the act of forcefully removing a node from a cluster.
1401              A node with OCFS2 mounted will fence  itself  when  it  realizes
1402              that it does not have quorum in a degraded cluster. It does this
1403              so that  other  nodes  won’t  be  stuck  trying  to  access  its
1404              resources.
1405
1406              o2cb  uses  a machine reset to fence. This is the quickest route
1407              for the node to rejoin the cluster.
1408
1409
1410       PROCESSES
1411
1412
1413              [o2net]
1414                     One per node. It is a work-queue thread started when  the
1415                     cluster  is  brought  on-line and stopped when it is off-
1416                     lined. It handles network communication for  all  mounts.
1417                     It  gets the list of active nodes from O2HB and sets up a
1418                     TCP/IP communication channel  with  each  live  node.  It
1419                     sends  regular keep-alive packets to detect any interrup‐
1420                     tion on the channels.
1421
1422
1423              [user_dlm]
1424                     One per node. It is  a  work-queue  thread  started  when
1425                     dlmfs is loaded and stopped when it is unloaded (dlmfs is
1426                     a synthetic file system that allows user space  processes
1427                     to access the in-kernel dlm).
1428
1429
1430              [ocfs2_wq]
1431                     One  per node. It is a work-queue thread started when the
1432                     OCFS2 module is loaded and stopped when it  is  unloaded.
1433                     It is assigned background file system tasks that may take
1434                     cluster locks like  flushing  the  truncate  log,  orphan
1435                     directory recovery and local alloc recovery. For example,
1436                     orphan directory recovery runs in the background so  that
1437                     it does not affect recovery time.
1438
1439
1440              [o2hb-14C29A7392]
1441                     One  per  heartbeat device. It is a kernel thread started
1442                     when the heartbeat region is populated  in  configfs  and
1443                     stopped  when  it is removed. It writes every two seconds
1444                     to a block in the heartbeat region, indicating that  this
1445                     node is alive. It also reads the region to maintain a map
1446                     of live nodes. It notifies  subscribers  like  o2net  and
1447                     o2dlm of any changes in the live node map.
1448
1449
1450              [ocfs2dc]
1451                     One  per mount. It is a kernel thread started when a vol‐
1452                     ume is mounted and stopped when it is unmounted. It down‐
1453                     grades   locks  in  response  to  blocking  ASTs  (BASTs)
1454                     requested by other nodes.
1455
1456
1457              [jbd2/sdf1-97]
1458                     One per mount. It is part of JBD2, which OCFS2  uses  for
1459                     journaling.
1460
1461
1462              [ocfs2cmt]
1463                     One  per mount. It is a kernel thread started when a vol‐
1464                     ume is mounted and stopped when it is unmounted. It works
1465                     with kjournald2.
1466
1467
1468              [ocfs2rec]
1469                     It  is  started whenever a node has to be recovered. This
1470                     thread performs file system  recovery  by  replaying  the
1471                     journal  of  the  dead node. It is scheduled to run after
1472                     dlm recovery has completed.
1473
1474
1475              [dlm_thread]
1476                     One per dlm domain. It is a kernel thread started when  a
1477                     dlm  domain  is created and stopped when it is destroyed.
1478                     This thread sends ASTs and blocking ASTs in  response  to
1479                     lock  level  convert  requests. It also frees unused lock
1480                     resources.
1481
1482
1483              [dlm_reco_thread]
1484                     One per dlm domain. It is a kernel  thread  that  handles
1485                     dlm  recovery when another node dies. If this node is the
1486                     dlm recovery master, it re-masters  every  lock  resource
1487                     owned by the dead node.
1488
1489
1490              [dlm_wq]
1491                     One  per dlm domain. It is a work-queue thread that o2dlm
1492                     uses to queue blocking tasks.
1493
1494
1495       FUTURE WORK
1496              File system development is a  never  ending  cycle.  Faster  and
1497              larger  disks,  faster  and  more  number  of processors, larger
1498              caches, etc. keep changing the sweet spot for performance  forc‐
1499              ing developers to rethink long held beliefs. Add to that new use
1500              cases, which forces developers to  be  innovative  in  providing
1501              solutions that melds seamlessly with existing semantics.
1502
1503              We  are  currently looking to add features like transparent com‐
1504              pression, transparent  encryption,  delayed  allocation,  multi-
1505              device support, etc. as well as work on improving performance on
1506              newer generation machines.
1507
1508              If you are interested in  contributing,  email  the  development
1509              team at ocfs2-devel@oss.oracle.com.
1510
1511

ACKNOWLEDGEMENTS

1513       The  principal  developers  of the OCFS2 file system, its tools and the
1514       O2CB cluster stack, are Joel Becker, Zach Brown, Mark Fasheh, Jan Kara,
1515       Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.
1516
1517       Other developers who have contributed to the file system via bug fixes,
1518       testing, etc.  are Wim Coekaerts, Srinivas Eeda, Coly Li, Jeff Mahoney,
1519       Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wengang Wang.
1520
1521       The  members  of  the Linux Cluster community including Andrew Beekhof,
1522       Lars Marowsky-Bree, Fabio Massimo Di Nitto and David Teigland.
1523
1524       The members of the Linux  File  system  community  including  Christoph
1525       Hellwig and Chris Mason.
1526
1527       The  corporations  that  have  contributed  resources  for this project
1528       including Oracle, SUSE Labs, EMC, Emulex, HP, IBM,  Intel  and  Network
1529       Appliance.
1530
1531

SEE ALSO

1533       debugfs.ocfs2(8)   fsck.ocfs2(8)   fsck.ocfs2.checks(8)   mkfs.ocfs2(8)
1534       mount.ocfs2(8)  mounted.ocfs2(8)  o2cluster(8)   o2image(8)   o2info(1)
1535       o2cb(7)  o2cb(8) o2cb.sysconfig(5) o2hbmonitor(8) ocfs2.cluster.conf(5)
1536       tunefs.ocfs2(8)
1537
1538

AUTHOR

1540       Oracle Corporation
1541
1542
1544       Copyright © 2004, 2012 Oracle. All rights reserved.
1545
1546
1547
1548Version 1.8.6                    January 2012                         OCFS2(7)
Impressum