1dirfile-encoding(5) DATA FORMATS dirfile-encoding(5)
2
3
4
6 dirfile-encoding — dirfile database encoding schemes
7
9 The Dirfile Standards indicate that RAW fields defined in the database
10 are accompanied by binary files containing the field data in the speci‐
11 fied simple data type. In certain situations, it may be advantageous
12 to convert the binary files in the database into a more convenient
13 form. This is accomplished by encoding the binary file into the alter‐
14 nate form. A common use-case for encoding a binary file is to compress
15 it to save disk space. Only data is modified by an encoding scheme.
16 Database metadata is unaffected.
17
18 Support for encoding schemes is optional. An implementation need not
19 support any particular encoding scheme, or may only support certain
20 operations with it, but should expect to encounter unknown encoding
21 schemes and fail gracefully in such situations.
22
23 Additionally, how a particular encoding is implemented is not specified
24 by the Dirfile Standards, but, for purposes of interoperability, all
25 dirfile implementations are encouraged to support the encoding imple‐
26 mentation used by the GetData dirfile reference implementation, elabo‐
27 rated below.
28
29 An encoding scheme is local to the particular format file fragment in
30 which it is indicated. This allows a single dirfile to have binary
31 files which are stored using multiple encodings, by having them defined
32 in multiple fragments.
33
34 The rest of this manual page discusses specifics of the encoding frame‐
35 work implemented in the GetData library, and does not constitute part
36 of the Dirfile Standards.
37
38
40 The GetData library provides an encoding framework which abstracts
41 binary file I/O, allowing for generic support for a wide variety of
42 encoding schemes. Functions which may make use of the encoding frame‐
43 work are:
44
45 dirfile_add(3), dirfile_add_raw(3), dirfile_add_spec(3),
46 dirfile_alter_encoding(3), dirfile_alter_endianness(3),
47 dirfile_alter_frameoffset(3), dirfile_alter_entry(3),
48 dirfile_alter_raw(3), dirfile_alter_spec(3), dirfile_move(3),
49 dirfile_rename(3), getdata(3), get_nframes(3), and putdata(3).
50
51 Most of the encodings supported by GetData are implemented through
52 external libraries which handle the actual file I/O and data transla‐
53 tion. All such libraries are optional; a build of the library which
54 omits an external library will lack support for the associated encoding
55 scheme. In this case, GetData will still properly identify the encod‐
56 ing scheme, but attempts to use GetData for file I/O via the encoding
57 will fail with the GD_E_UNSUPPORTED error code.
58
59 GetData discovers the encoding scheme of a particular RAW field by not‐
60 ing the filename extension of files associated with the field. Binary
61 files which form an unencoded dirfile have no file extension. The file
62 extension used by the other encodings are noted below. Encoding dis‐
63 covery proceeds by searching for files with the known list of file
64 extensions (in an unspecified order) and stopping when the first suc‐
65 cessful match is made. Because of this, when the a field has multiple
66 data files with different, supported file extensions which could legit‐
67 imately be associated with it, the encoding scheme discovered by Get‐
68 Data is not well defined.
69
70 In addition to raw (unencoded) data, GetData supports five other encod‐
71 ing schemes: text encoding, bzip2 encoding, gzip encoding, lzma encod‐
72 ing, and slim encoding, all discussed below.
73
74
75 Text Encoding
76 The Text Encoding is unique among GetData encoding schemes in that it
77 requires no external library. As a result, all builds of the library
78 contain full support for this encoding. It is meant to serve as a ref‐
79 erence encoding and example of the encoding framework for work on other
80 encoding schemes.
81
82 The Text Encoding replaces the binary data files with 7-bit ASCII files
83 containing a decimal text encoding of the data, one sample per line.
84 All operations are supported by the Text Encoding. The file extension
85 of the Text Encoding is .txt.
86
87
88 BZip2 Encoding
89 The BZip2 Encoding compresses raw binary files using the Burrows-
90 Wheeler block sorting text compression algorithm and Huffman coding, as
91 implemented in the bzip2 format. GetData's BZip2 Encoding scheme is
92 implemented through the the bzip2 compression library written by Julian
93 Seward. GetData's BZip2 Encoding framework currently lacks write capa‐
94 bilities; as a result the BZip2 Encoding does not support functions
95 which modify binary data.
96
97 GetData caches an uncompressed megabyte of data at a time to speed
98 access times. A call to get_nframes(3) requires decompression of the
99 entire binary file to determine its uncompressed size, and may take
100 some time to complete. The file extension of the BZip2 Encoding is
101 .bz2.
102
103
104 GZip Encoding
105 The GZip Encoding compresses raw binary files using Lempel-Ziv coding
106 (LZ77) as implemented in the gzip format. GetData's GZip Encoding
107 scheme is implemented through the the zlib compression library written
108 by Jean-loup Gailly and Mark Adler. GetData's GZip Encoding framework
109 currently lacks write capabilities; as a result the GZip Encoding does
110 not support functions which modify binary data.
111
112 To speed the operation of get_nframes(3), the GZip Encoding takes the
113 uncompressed size of the file the gzip footer, which contains the
114 file's uncompressed size in bytes, modulo 2^32. As a result, using a
115 field with an (uncompressed) binary file size larger than 4 GiB as the
116 reference field will result in the wrong number of frames being
117 reported. The file extension of the GZip Encoding is .gz.
118
119
120 LZMA Encoding
121 The LZMA Encoding compresses raw binary files using the Lempel-Ziv
122 Markov Chain Algorithm (LZMA) as implemented in the xz container for‐
123 mat. GetData's LZMA Encoding scheme is implemented through the lzma
124 library, part of the XZ Utils suite written by Lasse Collin, Ville
125 Koskinen, and Igor Pavlov. GetData's LZMA Encoding framework currently
126 lacks write capabilities; as a result the LZMA Encoding does not sup‐
127 port functions which modify binary data.
128
129 As with the BZip2 Encoding, GetData caches an uncompressed megabyte of
130 data at a time to speed access times. A call to get_nframes(3)
131 requires decompression of the entire binary file to determine its
132 uncompressed size, and may take some time to complete. The file exten‐
133 sion of the LZMA Encoding is .xz, or .lzma.
134
135
136 Slim Encoding
137 The Slim Encoding compresses raw binary files using the slimlib com‐
138 pression library written by Joseph Fowler. The slimlib library was
139 developed at Princeton University to compress dirfile-like data. Get‐
140 Data's Slim Encoding framework currently lacks write capabilities; as a
141 result, the Slim Encoding does not support function which modify binary
142 files. The file extension of the Slim Encoding is .slm.
143
144
146 This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.
147
148
150 dirfile(5), dirfile-format(5), bzip2(1), gzip(1), zlib(3).
151
152
153
154Standards Version 7 16 October 2009 dirfile-encoding(5)