1dirfile-encoding(5) DATA FORMATS dirfile-encoding(5)
2
3
4
6 dirfile-encoding — dirfile database encoding schemes
7
9 The Dirfile Standards indicate that RAW fields defined in the database
10 are accompanied by binary files containing the field data in the speci‐
11 fied simple data type. In certain situations, it may be advantageous
12 to convert the binary files in the database into a more convenient
13 form. This is accomplished by encoding the binary file into the alter‐
14 nate form. A common use-case for encoding a binary file is to compress
15 it to save disk space. Only data is modified by an encoding scheme.
16 Database metadata is never encoded.
17
18 Support for encoding schemes is optional. An implementation need not
19 support any particular encoding scheme, or may only support certain
20 operations with it, but should expect to encounter unknown encoding
21 schemes and fail gracefully in such situations.
22
23 Additionally, how a particular encoding is implemented is not specified
24 by the Dirfile Standards, but, for purposes of interoperability, all
25 dirfile implementations are encouraged to support the encoding imple‐
26 mentation used by the GetData dirfile reference implementation, elabo‐
27 rated below.
28
29 An encoding scheme is local to the particular format specification
30 fragment in which it is indicated. This allows a single dirfile to
31 have binary files which are stored using multiple encodings, by having
32 them defined in multiple fragments.
33
34 The rest of this manual page discusses specifics of the encoding frame‐
35 work implemented in the GetData library, and does not constitute part
36 of the Dirfile Standards.
37
38
40 The GetData library provides an encoding framework which abstracts
41 binary file I/O, allowing for generic support for a wide variety of
42 encoding schemes. Functions which may make use of the encoding frame‐
43 work are:
44
45 gd_add(3), gd_add_raw(3), gd_add_spec(3), gd_alter_encoding(3),
46 gd_alter_endianness(3), gd_alter_frameoffset(3),
47 gd_alter_entry(3), gd_alter_raw(3), gd_alter_spec(3),
48 gd_flush(3), gd_getdata(3), gd_malter_spec(3), gd_move(3),
49 gd_nframes(3), gd_putdata(3), gd_raw_close(3), gd_rename(3), and
50 gd_sync(3).
51
52 Most of the encodings supported by GetData are implemented through
53 external libraries which handle the actual file I/O and data transla‐
54 tion. All such libraries are optional; a build of the library which
55 omits an external library will lack support for the associated encoding
56 scheme. In this case, GetData will still properly identify the encod‐
57 ing scheme, but attempts to use GetData for file I/O via the encoding
58 will fail with the GD_E_UNSUPPORTED error code.
59
60 GetData discovers the encoding scheme of a particular RAW field by not‐
61 ing the filename extension of files associated with the field. Binary
62 files which form an unencoded dirfile have no file extension. The file
63 extension used by the other encodings are noted below. Encoding dis‐
64 covery proceeds by searching for files with the known list of file
65 extensions (in an unspecified order) and stopping when the first suc‐
66 cessful match is made. Because of this, when the a field has multiple
67 data files with different, supported file extensions which could legit‐
68 imately be associated with it, the encoding scheme discovered by Get‐
69 Data is not well defined.
70
71 In addition to raw (unencoded) data, GetData supports nine other encod‐
72 ing schemes: text encoding, bzip2 encoding, flac encoding, gzip encod‐
73 ing, lzma encoding, sie (sample-index encoding), slim encoding, zzip
74 encoding, and zzslim encoding, all discussed below.
75
76 The text encoding and the sample-index encoding are implemented by Get‐
77 Data natively and need no external library. As a result, they are
78 always present in the library.
79
80
81 Out-of-place writes
82 Some of the encodings listed below only support writing via out-of-
83 place writes; that is, raw files are written in a temporary location
84 and only moved into place when closed. As a result, writing to these
85 encodings requires making a copy of the whole binary data file. A fur‐
86 ther side effect of this is that a third-party trying to concurrently
87 read a Dirfile which is being written to using one of these encodings
88 usually doesn't work.
89
90 Within GetData, reading from a field so encoded after writing to it
91 will cause writing to the temporary file to be finished and then the
92 file moved into place before the read occurs, which may take some time
93 to do. Encodings which perform out-of-place writes are: bzip2, flac,
94 gzip, and lzma.
95
96
97 BZip2 Encoding
98 The BZip2 Encoding reads compressed raw binary files using the Burrows-
99 Wheeler block sorting text compression algorithm and Huffman coding, as
100 implemented in the bzip2 format. GetData's BZip2 Encoding scheme is
101 implemented through the bzip2 compression library written by Julian
102 Seward. All operations are supported by the BZip2 Encoding, but writ‐
103 ing occurs out-of-place. See the Out-of-place writes section above for
104 details.
105
106 GetData caches an uncompressed megabyte of data at a time to speed
107 access times. A call to gd_nframes(3) requires decompression of the
108 entire binary file to determine its uncompressed size, and may take
109 some time to complete. The file extension of the BZip2 Encoding is
110 .bz2.
111
112
113 FLAC Encoding
114 The FLAC Encoding compresses raw binary files using the Free Lossless
115 Audio Codec. GetData's FLAC Encoding scheme is implemented through the
116 libFLAC reference implementation developed by Josh Coalson and the
117 Xiph.Org Foundation. All operations are supported by the FLAC Encod‐
118 ing, but writing occurs out-of-place. See the Out-of-place writes sec‐
119 tion above for details.
120
121 The FLAC format only permits samples up to 32-bits, but the libFLAC
122 reference codec can only handle samples up to 24-bits. GetData gets
123 around this by slicing data that is wider than 16-bits into multiple
124 channels (2, 4, or 8, depending on width). For big-ended data, the
125 most-significant 16-bits are in channel 0, the second 16-bits in chan‐
126 nel 1, &c. For little-ended data, this is reversed, with the least
127 significant word in channel 0.
128
129 The sample rate specified in the FLAC header is ignored and may be any
130 valid value. FLAC files written by GetData use a sample rate of 1 Hz.
131 The file extension of the FLAC Encoding is .flac. The Ogg container
132 format is not supported.
133
134
135 GZip Encoding
136 The GZip Encoding compresses raw binary files using Lempel-Ziv coding
137 (LZ77) as implemented in the gzip format. GetData's GZip Encoding
138 scheme is implemented through the zlib compression library written by
139 Jean-loup Gailly and Mark Adler. All operations are supported by the
140 GZip Encoding, but writing occurs out-of-place. See the Out-of-place
141 writes section above for details.
142
143 To speed the operation of gd_nframes(3), the GZip Encoding takes the
144 uncompressed size of the file the gzip footer, which contains the
145 file's uncompressed size in bytes, modulo 2**32. As a result, using a
146 field with an (uncompressed) binary file size larger than 4 GiB as the
147 reference field will result in the wrong number of frames being
148 reported. The file extension of the GZip Encoding is .gz.
149
150
151 LZMA Encoding
152 The LZMA Encoding reads compressed raw binary files using the Lempel-
153 Ziv Markov Chain Algorithm (LZMA) as implemented in the xz container
154 format. GetData's LZMA Encoding scheme is implemented through the lzma
155 library, part of the XZ Utils suite written by Lasse Collin, Ville
156 Koskinen, and Igor Pavlov. All operations are supported by the LZMA
157 Encoding, but writing occurs out-of-place. See the Out-of-place writes
158 section above for details. Writing is supported only for the .xz con‐
159 tainer format, and not for the obsolete .lzma format, which can still
160 be read.
161
162 GetData caches an uncompressed megabyte of data at a time to speed
163 access times. A call to gd_nframes(3) requires decompression of the
164 entire binary file to determine its uncompressed size, and may take
165 some time to complete. The file extension of the LZMA Encoding is .xz,
166 or .lzma.
167
168
169 Sample-Index Encoding
170 The Sample-Index Encoding (SIE) compresses raw binary data by replacing
171 runs of repeated data, similar to run-length encoding. SIE files con‐
172 tain binary records consisting of a 64-bit sample number followed by a
173 datum (the size and format of which is determined by the RAW field's
174 data type in the format metadata). The sample number indicates the
175 last sample of the field which has the specified value. The first sam‐
176 ple with the value is the sample immediately following the data in the
177 previous record, or sample number zero, for the first record. Sample
178 numbers are relative to any /FRAMEOFFSET specified in the Dirfile meta‐
179 data. All operations are supported by the Sample-Index Encoding. The
180 file extension of the Sample-Index Encoding is .sie.
181
182
183 Slim Encoding
184 The Slim Encoding reads compressed raw binary files using the slimlib
185 compression library written by Joseph Fowler. The slimlib library was
186 developed at Princeton University to compress dirfile-like data. Get‐
187 Data's Slim Encoding framework currently lacks write capabilities; as a
188 result, the Slim Encoding does not support function which modify binary
189 files. The file extension of the Slim Encoding is .slm.
190
191 Using the Slim Encoding with GetData may result in unexpected, but man‐
192 ageable, memory usage. See the gd_getdata(3) manual page for details.
193
194
195 Text Encoding
196 The Text Encoding replaces the binary data files with 7-bit ASCII files
197 containing a decimal text encoding of the data, one sample per line.
198 All operations are supported by the Text Encoding. The file extension
199 of the Text Encoding is .txt.
200
201
202 ZZip Encoding
203 The ZZip Encoding reads compressed raw binary files using the DEFLATE
204 algorithm as implemented in the PKWARE ZIP archive container format.
205 GetData's ZZip Encoding scheme is implemented through the zzip library
206 written by Tomi Ollila and Guido Draheim. The ZZip Encoding framework
207 currently lacks write capabilities; as a result the ZZip Encoding does
208 not support functions which modify binary data.
209
210 Unlike most encoding schemes, the ZZip encoding merges all binary data
211 files defined in a given fragment into a single ZIP archive. The name
212 of this archive is raw.zip by default, but a different name may be
213 specified using the second parameter to the /ENCODING directive. For
214 example,
215
216 /ENCODING zzip archive
217
218 indicates that the ZIP archive is called archive.zip. The file exten‐
219 sion of the ZZip Encoding is .zip.
220
221
222 ZZSlim Encoding
223 The ZZSlim Encoding is a convolution of the Slim Encoding and the ZZip
224 Encoding. To create ZZSlim Encoded files, first the raw data are com‐
225 pressed using the slim library, and then these slim-compressed files
226 are archived (and compressed again) into a ZIP archive. As with the
227 ZZip Encoding, the ZIP archive is raw.zip by default, but a different
228 name may be specified with the /ENCODING directive.
229
230 Notably, since the archives have the same name as ZZip Encoded data,
231 automatic encoding detection on ZZSlim Encoded data always fails: they
232 are incorrectly identified as simply ZZip Encoded. As a result, an
233 /ENCODING directive in the format file or else a GD_ZZSLIM_ENCODED flag
234 passed to gd_open(3) is required to read ZZSlim encoded data. The file
235 extension of the ZZSlim Encoding is .zip.
236
237 Using the ZZSlim Encoding with GetData may result in unexpected, but
238 manageable, memory usage. See the gd_getdata(3) manual page for
239 details.
240
241
243 This manual page was written by D. V. Wiebe <dvw@ketiltrout.net>.
244
245
247 bzip2(1), flac(1), gzip(1), xz(1), zlib(3), dirfile(5), dirfile-for‐
248 mat(5)
249
250
251
252Standards Version 9 15 October 2015 dirfile-encoding(5)