1dirfile-encoding(5)              DATA FORMATS              dirfile-encoding(5)
2
3
4

NAME

6       dirfile-encoding — dirfile database encoding schemes
7

DESCRIPTION

9       The  Dirfile Standards indicate that RAW fields defined in the database
10       are accompanied by binary files containing the field data in the speci‐
11       fied  simple  data type.  In certain situations, it may be advantageous
12       to convert the binary files in the  database  into  a  more  convenient
13       form.  This is accomplished by encoding the binary file into the alter‐
14       nate form.  A common use-case for encoding a binary file is to compress
15       it  to  save  disk space.  Only data is modified by an encoding scheme.
16       Database metadata is unaffected.
17
18       Support for encoding schemes is optional.  An implementation  need  not
19       support  any  particular  encoding  scheme, or may only support certain
20       operations with it, but should expect  to  encounter  unknown  encoding
21       schemes and fail gracefully in such situations.
22
23       Additionally, how a particular encoding is implemented is not specified
24       by the Dirfile Standards, but, for purposes  of  interoperability,  all
25       dirfile  implementations  are encouraged to support the encoding imple‐
26       mentation used by the GetData dirfile reference implementation,  elabo‐
27       rated below.
28
29       An  encoding  scheme is local to the particular format file fragment in
30       which it is indicated.  This allows a single  dirfile  to  have  binary
31       files which are stored using multiple encodings, by having them defined
32       in multiple fragments.
33
34       The rest of this manual page discusses specifics of the encoding frame‐
35       work  implemented  in the GetData library, and does not constitute part
36       of the Dirfile Standards.
37
38

THE GETDATA ENCODING FRAMEWORK

40       The GetData library provides  an  encoding  framework  which  abstracts
41       binary  file  I/O,  allowing  for generic support for a wide variety of
42       encoding schemes.  Functions which may make use of the encoding  frame‐
43       work are:
44
45              dirfile_add(3),     dirfile_add_raw(3),     dirfile_add_spec(3),
46              dirfile_alter_encoding(3),          dirfile_alter_endianness(3),
47              dirfile_alter_frameoffset(3),            dirfile_alter_entry(3),
48              dirfile_alter_raw(3),  dirfile_alter_spec(3),   dirfile_move(3),
49              dirfile_rename(3), getdata(3), get_nframes(3), and putdata(3).
50
51       Most  of  the  encodings  supported  by GetData are implemented through
52       external libraries which handle the actual file I/O and  data  transla‐
53       tion.   All  such  libraries are optional; a build of the library which
54       omits an external library will lack support for the associated encoding
55       scheme.   In this case, GetData will still properly identify the encod‐
56       ing scheme, but attempts to use GetData for file I/O via  the  encoding
57       will fail with the GD_E_UNSUPPORTED error code.
58
59       GetData discovers the encoding scheme of a particular RAW field by not‐
60       ing the filename extension of files associated with the field.   Binary
61       files which form an unencoded dirfile have no file extension.  The file
62       extension used by the other encodings are noted below.   Encoding  dis‐
63       covery  proceeds  by  searching  for  files with the known list of file
64       extensions (in an unspecified order) and stopping when the  first  suc‐
65       cessful  match is made.  Because of this, when the a field has multiple
66       data files with different, supported file extensions which could legit‐
67       imately  be  associated with it, the encoding scheme discovered by Get‐
68       Data is not well defined.
69
70       In addition to raw (unencoded) data, GetData supports five other encod‐
71       ing  schemes: text encoding, bzip2 encoding, gzip encoding, lzma encod‐
72       ing, and slim encoding, all discussed below.
73
74
75   Text Encoding
76       The Text Encoding is unique among GetData encoding schemes in  that  it
77       requires  no  external library.  As a result, all builds of the library
78       contain full support for this encoding.  It is meant to serve as a ref‐
79       erence encoding and example of the encoding framework for work on other
80       encoding schemes.
81
82       The Text Encoding replaces the binary data files with 7-bit ASCII files
83       containing  a  decimal  text encoding of the data, one sample per line.
84       All operations are supported by the Text Encoding.  The file  extension
85       of the Text Encoding is .txt.
86
87
88   BZip2 Encoding
89       The  BZip2  Encoding  compresses  raw  binary  files using the Burrows-
90       Wheeler block sorting text compression algorithm and Huffman coding, as
91       implemented  in  the  bzip2 format.  GetData's BZip2 Encoding scheme is
92       implemented through the the bzip2 compression library written by Julian
93       Seward.  GetData's BZip2 Encoding framework currently lacks write capa‐
94       bilities; as a result the BZip2 Encoding  does  not  support  functions
95       which modify binary data.
96
97       GetData  caches  an  uncompressed  megabyte  of data at a time to speed
98       access times.  A call to get_nframes(3) requires decompression  of  the
99       entire  binary  file  to  determine its uncompressed size, and may take
100       some time to complete.  The file extension of  the  BZip2  Encoding  is
101       .bz2.
102
103
104   GZip Encoding
105       The  GZip  Encoding compresses raw binary files using Lempel-Ziv coding
106       (LZ77) as implemented in the  gzip  format.   GetData's  GZip  Encoding
107       scheme  is implemented through the the zlib compression library written
108       by Jean-loup Gailly and Mark Adler.  GetData's GZip Encoding  framework
109       currently  lacks write capabilities; as a result the GZip Encoding does
110       not support functions which modify binary data.
111
112       To speed the operation of get_nframes(3), the GZip Encoding  takes  the
113       uncompressed  size  of  the  file  the  gzip footer, which contains the
114       file's uncompressed size in bytes, modulo 2^32.  As a result,  using  a
115       field  with an (uncompressed) binary file size larger than 4 GiB as the
116       reference field will  result  in  the  wrong  number  of  frames  being
117       reported.  The file extension of the GZip Encoding is .gz.
118
119
120   LZMA Encoding
121       The  LZMA  Encoding  compresses  raw  binary files using the Lempel-Ziv
122       Markov Chain Algorithm (LZMA) as implemented in the xz  container  for‐
123       mat.   GetData's  LZMA  Encoding scheme is implemented through the lzma
124       library, part of the XZ Utils suite  written  by  Lasse  Collin,  Ville
125       Koskinen, and Igor Pavlov.  GetData's LZMA Encoding framework currently
126       lacks write capabilities; as a result the LZMA Encoding does  not  sup‐
127       port functions which modify binary data.
128
129       As  with the BZip2 Encoding, GetData caches an uncompressed megabyte of
130       data at a time  to  speed  access  times.   A  call  to  get_nframes(3)
131       requires  decompression  of  the  entire  binary  file to determine its
132       uncompressed size, and may take some time to complete.  The file exten‐
133       sion of the LZMA Encoding is .xz, or .lzma.
134
135
136   Slim Encoding
137       The  Slim  Encoding  compresses raw binary files using the slimlib com‐
138       pression library written by Joseph Fowler.   The  slimlib  library  was
139       developed  at Princeton University to compress dirfile-like data.  Get‐
140       Data's Slim Encoding framework currently lacks write capabilities; as a
141       result, the Slim Encoding does not support function which modify binary
142       files.  The file extension of the Slim Encoding is .slm.
143
144

AUTHOR

146       This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.
147
148

SEE ALSO

150       dirfile(5), dirfile-format(5), bzip2(1), gzip(1), zlib(3).
151
152
153
154Standards Version 7             16 October 2009            dirfile-encoding(5)
Impressum