1dirfile-encoding(5)              DATA FORMATS              dirfile-encoding(5)
2
3
4

NAME

6       dirfile-encoding — dirfile database encoding schemes
7

DESCRIPTION

9       The  Dirfile Standards indicate that RAW fields defined in the database
10       are accompanied by binary files containing the field data in the speci‐
11       fied  simple  data type.  In certain situations, it may be advantageous
12       to convert the binary files in the  database  into  a  more  convenient
13       form.  This is accomplished by encoding the binary file into the alter‐
14       nate form.  A common use-case for encoding a binary file is to compress
15       it  to  save  disk space.  Only data is modified by an encoding scheme.
16       Database metadata is never encoded.
17
18       Support for encoding schemes is optional.  An implementation  need  not
19       support  any  particular  encoding  scheme, or may only support certain
20       operations with it, but should expect  to  encounter  unknown  encoding
21       schemes and fail gracefully in such situations.
22
23       Additionally, how a particular encoding is implemented is not specified
24       by the Dirfile Standards, but, for purposes  of  interoperability,  all
25       dirfile  implementations  are encouraged to support the encoding imple‐
26       mentation used by the GetData dirfile reference implementation,  elabo‐
27       rated below.
28
29       An  encoding  scheme  is  local  to the particular format specification
30       fragment in which it is indicated.  This allows  a  single  dirfile  to
31       have  binary files which are stored using multiple encodings, by having
32       them defined in multiple fragments.
33
34       The rest of this manual page discusses specifics of the encoding frame‐
35       work  implemented  in the GetData library, and does not constitute part
36       of the Dirfile Standards.
37
38

THE GETDATA ENCODING FRAMEWORK

40       The GetData library provides  an  encoding  framework  which  abstracts
41       binary  file  I/O,  allowing  for generic support for a wide variety of
42       encoding schemes.  Functions which may make use of the encoding  frame‐
43       work are:
44
45              gd_add(3),  gd_add_raw(3), gd_add_spec(3), gd_alter_encoding(3),
46              gd_alter_endianness(3),                 gd_alter_frameoffset(3),
47              gd_alter_entry(3),       gd_alter_raw(3),      gd_alter_spec(3),
48              gd_flush(3),   gd_getdata(3),   gd_malter_spec(3),   gd_move(3),
49              gd_nframes(3), gd_putdata(3), gd_raw_close(3), gd_rename(3), and
50              gd_sync(3).
51
52       Most of the encodings supported  by  GetData  are  implemented  through
53       external  libraries  which handle the actual file I/O and data transla‐
54       tion.  All such libraries are optional; a build of  the  library  which
55       omits an external library will lack support for the associated encoding
56       scheme.  In this case, GetData will still properly identify the  encod‐
57       ing  scheme,  but attempts to use GetData for file I/O via the encoding
58       will fail with the GD_E_UNSUPPORTED error code.
59
60       GetData discovers the encoding scheme of a particular RAW field by not‐
61       ing  the filename extension of files associated with the field.  Binary
62       files which form an unencoded dirfile have no file extension.  The file
63       extension  used  by the other encodings are noted below.  Encoding dis‐
64       covery proceeds by searching for files with  the  known  list  of  file
65       extensions  (in  an unspecified order) and stopping when the first suc‐
66       cessful match is made.  Because of this, when the a field has  multiple
67       data files with different, supported file extensions which could legit‐
68       imately be associated with it, the encoding scheme discovered  by  Get‐
69       Data is not well defined.
70
71       In addition to raw (unencoded) data, GetData supports nine other encod‐
72       ing schemes: text encoding, bzip2 encoding, flac encoding, gzip  encod‐
73       ing,  lzma  encoding,  sie (sample-index encoding), slim encoding, zzip
74       encoding, and zzslim encoding, all discussed below.
75
76       The text encoding and the sample-index encoding are implemented by Get‐
77       Data  natively  and  need  no  external library.  As a result, they are
78       always present in the library.
79
80
81   Out-of-place writes
82       Some of the encodings listed below only  support  writing  via  out-of-
83       place  writes;  that  is, raw files are written in a temporary location
84       and only moved into place when closed.  As a result, writing  to  these
85       encodings requires making a copy of the whole binary data file.  A fur‐
86       ther side effect of this is that a third-party trying  to  concurrently
87       read  a  Dirfile which is being written to using one of these encodings
88       usually doesn't work.
89
90       Within GetData, reading from a field so encoded  after  writing  to  it
91       will  cause  writing  to the temporary file to be finished and then the
92       file moved into place before the read occurs, which may take some  time
93       to  do.   Encodings which perform out-of-place writes are: bzip2, flac,
94       gzip, and lzma.
95
96
97   BZip2 Encoding
98       The BZip2 Encoding reads compressed raw binary files using the Burrows-
99       Wheeler block sorting text compression algorithm and Huffman coding, as
100       implemented in the bzip2 format.  GetData's BZip2  Encoding  scheme  is
101       implemented  through  the  bzip2  compression library written by Julian
102       Seward.  All operations are supported by the BZip2 Encoding, but  writ‐
103       ing occurs out-of-place.  See the Out-of-place writes section above for
104       details.
105
106       GetData caches an uncompressed megabyte of data  at  a  time  to  speed
107       access  times.   A  call to gd_nframes(3) requires decompression of the
108       entire binary file to determine its uncompressed  size,  and  may  take
109       some  time  to  complete.   The file extension of the BZip2 Encoding is
110       .bz2.
111
112
113   FLAC Encoding
114       The FLAC Encoding compresses raw binary files using the  Free  Lossless
115       Audio Codec.  GetData's FLAC Encoding scheme is implemented through the
116       libFLAC reference implementation developed  by  Josh  Coalson  and  the
117       Xiph.Org  Foundation.   All operations are supported by the FLAC Encod‐
118       ing, but writing occurs out-of-place.  See the Out-of-place writes sec‐
119       tion above for details.
120
121       The  FLAC  format  only  permits samples up to 32-bits, but the libFLAC
122       reference codec can only handle samples up to  24-bits.   GetData  gets
123       around  this  by  slicing data that is wider than 16-bits into multiple
124       channels (2, 4, or 8, depending on width).   For  big-ended  data,  the
125       most-significant  16-bits are in channel 0, the second 16-bits in chan‐
126       nel 1, &c.  For little-ended data, this is  reversed,  with  the  least
127       significant word in channel 0.
128
129       The  sample rate specified in the FLAC header is ignored and may be any
130       valid value.  FLAC files written by GetData use a sample rate of 1  Hz.
131       The  file  extension  of the FLAC Encoding is .flac.  The Ogg container
132       format is not supported.
133
134
135   GZip Encoding
136       The GZip Encoding compresses raw binary files using  Lempel-Ziv  coding
137       (LZ77)  as  implemented  in  the  gzip format.  GetData's GZip Encoding
138       scheme is implemented through the zlib compression library  written  by
139       Jean-loup  Gailly  and  Mark Adler. All operations are supported by the
140       GZip Encoding, but writing occurs out-of-place.  See  the  Out-of-place
141       writes section above for details.
142
143       To  speed  the  operation of gd_nframes(3), the GZip Encoding takes the
144       uncompressed size of the file  the  gzip  footer,  which  contains  the
145       file's  uncompressed size in bytes, modulo 2**32.  As a result, using a
146       field with an (uncompressed) binary file size larger than 4 GiB as  the
147       reference  field  will  result  in  the  wrong  number  of frames being
148       reported.  The file extension of the GZip Encoding is .gz.
149
150
151   LZMA Encoding
152       The LZMA Encoding reads compressed raw binary files using  the  Lempel-
153       Ziv  Markov  Chain  Algorithm (LZMA) as implemented in the xz container
154       format.  GetData's LZMA Encoding scheme is implemented through the lzma
155       library,  part  of  the  XZ  Utils suite written by Lasse Collin, Ville
156       Koskinen, and Igor Pavlov.  All operations are supported  by  the  LZMA
157       Encoding, but writing occurs out-of-place.  See the Out-of-place writes
158       section above for details.  Writing is supported only for the .xz  con‐
159       tainer  format,  and not for the obsolete .lzma format, which can still
160       be read.
161
162       GetData caches an uncompressed megabyte of data  at  a  time  to  speed
163       access  times.   A  call to gd_nframes(3) requires decompression of the
164       entire binary file to determine its uncompressed  size,  and  may  take
165       some time to complete.  The file extension of the LZMA Encoding is .xz,
166       or .lzma.
167
168
169   Sample-Index Encoding
170       The Sample-Index Encoding (SIE) compresses raw binary data by replacing
171       runs  of repeated data, similar to run-length encoding.  SIE files con‐
172       tain binary records consisting of a 64-bit sample number followed by  a
173       datum  (the  size  and format of which is determined by the RAW field's
174       data type in the format metadata).  The  sample  number  indicates  the
175       last sample of the field which has the specified value.  The first sam‐
176       ple with the value is the sample immediately following the data in  the
177       previous  record,  or sample number zero, for the first record.  Sample
178       numbers are relative to any /FRAMEOFFSET specified in the Dirfile meta‐
179       data.   All operations are supported by the Sample-Index Encoding.  The
180       file extension of the Sample-Index Encoding is .sie.
181
182
183   Slim Encoding
184       The Slim Encoding reads compressed raw binary files using  the  slimlib
185       compression  library written by Joseph Fowler.  The slimlib library was
186       developed at Princeton University to compress dirfile-like data.   Get‐
187       Data's Slim Encoding framework currently lacks write capabilities; as a
188       result, the Slim Encoding does not support function which modify binary
189       files.  The file extension of the Slim Encoding is .slm.
190
191       Using the Slim Encoding with GetData may result in unexpected, but man‐
192       ageable, memory usage.  See the gd_getdata(3) manual page for details.
193
194
195   Text Encoding
196       The Text Encoding replaces the binary data files with 7-bit ASCII files
197       containing  a  decimal  text encoding of the data, one sample per line.
198       All operations are supported by the Text Encoding.  The file  extension
199       of the Text Encoding is .txt.
200
201
202   ZZip Encoding
203       The  ZZip  Encoding reads compressed raw binary files using the DEFLATE
204       algorithm as implemented in the PKWARE ZIP  archive  container  format.
205       GetData's  ZZip Encoding scheme is implemented through the zzip library
206       written by Tomi Ollila and Guido Draheim.  The ZZip Encoding  framework
207       currently  lacks write capabilities; as a result the ZZip Encoding does
208       not support functions which modify binary data.
209
210       Unlike most encoding schemes, the ZZip encoding merges all binary  data
211       files  defined in a given fragment into a single ZIP archive.  The name
212       of this archive is raw.zip by default, but  a  different  name  may  be
213       specified  using  the second parameter to the /ENCODING directive.  For
214       example,
215
216              /ENCODING zzip archive
217
218       indicates that the ZIP archive is called archive.zip.  The file  exten‐
219       sion of the ZZip Encoding is .zip.
220
221
222   ZZSlim Encoding
223       The  ZZSlim Encoding is a convolution of the Slim Encoding and the ZZip
224       Encoding.  To create ZZSlim Encoded files, first the raw data are  com‐
225       pressed  using  the  slim library, and then these slim-compressed files
226       are archived (and compressed again) into a ZIP archive.   As  with  the
227       ZZip  Encoding,  the ZIP archive is raw.zip by default, but a different
228       name may be specified with the /ENCODING directive.
229
230       Notably, since the archives have the same name as  ZZip  Encoded  data,
231       automatic  encoding detection on ZZSlim Encoded data always fails: they
232       are incorrectly identified as simply ZZip Encoded.   As  a  result,  an
233       /ENCODING directive in the format file or else a GD_ZZSLIM_ENCODED flag
234       passed to gd_open(3) is required to read ZZSlim encoded data.  The file
235       extension of the ZZSlim Encoding is .zip.
236
237       Using  the  ZZSlim  Encoding with GetData may result in unexpected, but
238       manageable, memory  usage.   See  the  gd_getdata(3)  manual  page  for
239       details.
240
241

AUTHOR

243       This manual page was written by D. V. Wiebe <dvw@ketiltrout.net>.
244
245

SEE ALSO

247       bzip2(1),  flac(1),  gzip(1),  xz(1), zlib(3), dirfile(5), dirfile-for‐
248       mat(5)
249
250
251
252Standards Version 9             15 October 2015            dirfile-encoding(5)
Impressum