dictzip(1)

1DICTZIP(1)                                                          DICTZIP(1)
2
3
4

NAME

6       dictzip, dictunzip - compress (or expand) files, allowing random access
7

SYNOPSIS

9       dictzip [options] name
10       dictunzip [options] name
11

DESCRIPTION

13       dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner
14       which is completely compatible with the gzip file format.  An extension
15       to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952)
16       allows extra data to be stored in the  header  of  a  compressed  file.
17       Programs  like  gzip  and  zcat  will ignore this extra data.  However,
18       dictd(8), the DICT protocol dictionary server will  make  use  of  this
19       data to perform pseudo-random access on the file.  Files in the dictzip
20       format should end in ".dz" so that they may be distinguished from  com‐
21       mon gzip files that do not contain the special header information.
22
23       From RFC 1952, the extra field is specified as follows:
24
25              If the FLG.FEXTRA bit is set, an "extra field" is present in the
26              header, with total length XLEN bytes.  It consists of  a  series
27              of subfields, each of the form:
28
29              +---+---+---+---+==================================+
30              |SI1|SI2|  LEN  |... LEN bytes of subfield data ...|
31              +---+---+---+---+==================================+
32
33              SI1  and  SI2 provide a subfield ID, typically two ASCII letters
34              with     some     mnemonic     value.      Jean-Loup      Gailly
35              <gzip@prep.ai.mit.edu>  is  maintaining  a  registry of subfield
36              IDs; please send him any subfield ID you wish to use.   Subfield
37              IDs with SI2 = 0 are reserved for future use.
38
39              LEN  gives the length of the subfield data, excluding the 4 ini‐
40              tial bytes.
41
42       The dictzip program uses 'R' for SI1, and 'A' for  SI2  (i.e.,  "Random
43       Access").  After the LEN field, the data is arranged as follows:
44
45       +---+---+---+---+---+---+===============================+
46       |  VER  | CHLEN | CHCNT |  ... CHCNT words of data ...  |
47       +---+---+---+---+---+---+===============================+
48
49       As  per RFC 1952, all data is stored least-significant byte first.  For
50       VER 1 of the data, all values are  16-bits  long  (2  bytes),  and  are
51       unsigned integers.
52
53       XLEN  (which is specified earlier in the header) is a two byte integer,
54       so the extra field can be 0xffff bytes long, 2 bytes of which are  used
55       for  the  subfield  ID (SI1 and SI1), and 2 bytes of which are used for
56       the subfield length (LEN).  This leaves  0xfffb  bytes  (0x7ffd  2-byte
57       entries  or  0x3ffe  4-byte entries).  Given that the zip output buffer
58       must be 10% + 12 bytes larger than the input buffer, we can store 58969
59       bytes  per  entry,  or  about 1.8GB if the 2-byte entries are used.  If
60       this becomes a limiting factor, another format version can be  selected
61       and defined for 4-byte entries.
62
63       For  compression,  the  file  is divided up into "chunks" of data, each
64       chunk is less than 64kB, and can be compressed into  an  area  that  is
65       also  less  than  64kB long (taking incompressible data into account --
66       usually the data is compressed into a block that is much  smaller  than
67       the  original).   The  CHLEN field specifies the length of a "chunk" of
68       data.  The CHCNT field specifies how many chunks are  preset,  and  the
69       CHCNT  words of data specifies how long each chunk is after compression
70       (i.e., in the current compressed file).
71
72       To perform random access on the data, the offset and length of the data
73       are  provided  to library routines.  These routines determine the chunk
74       in which the desired data begins, and decompresses that chunk.  Consec‐
75       utive chunks are decompressed as necessary.
76

TRADEOFFS

78       Speed  True  random file access is not realized, since any access, even
79              for a single byte, requires that a 64kB chunk be read and decom‐
80              pressed.  This is slower than accessing a flat text file, but is
81              much, much faster than performing serial access on a fully  com‐
82              pressed file.
83
84       Space  For  the  textual  dictionary databases we are working with, the
85              use of 64kB chunks and maximal LZ77 compression realizes a  file
86              which  is only about 4% larger than the same file compressed all
87              at once.
88

OPTIONS

90       -d or --decompress
91              Decompress.  This is the default if  the  executable  is  called
92              dictunzip.
93
94       -c or --stdout
95              Write  output on standard output; keep original files unchanged.
96              This is only available when decompressing (because parts of  the
97              header must be updated after a write when compressing).
98
99       -f or --force
100              Force  compression  or  decompression  even  if  the output file
101              already exists.
102
103       -h or --help
104              Display help.
105
106       -k or --keep
107              Do not delete the original file.
108
109       -l or --list
110              For each compressed file, list the following fields:
111
112                  type: dzip, gzip, or text (includes files  in  unknown  for‐
113              mats)
114                  crc: CRC checksum
115                  date and time: from header
116                  chunks: number of chunks in file
117                  size: size of each uncompressed chunk
118                  compr.: compressed size
119                  uncompr.: uncompressed size
120                  ratio: compression ratio (0.0% if unknown)
121                  name: name of uncompressed file
122
123              Unlike gzip, the compression method is not detected.
124
125       -L or --license
126              Display the dictzip license and quit.
127
128       -t or --test
129              Check  the compressed file integrity.  This option is not imple‐
130              mented.  Instead, it will list the header information.
131
132       -v or --verbose
133              Verbose. Display extra information during compression.
134
135       -V or --version
136              Version. Display the version number and compilation options then
137              quit.
138
139       -s start or --start start
140              Specify the offer to start decompression, using decimal numbers.
141              The default is at the beginning of the file.
142
143       -e size or --size size
144              Specify the size of the portion of the file to decompress, using
145              decimal numbers.  The default is the whole file.
146
147       -S start or --Start start
148              Specify  the offer to start decompression, using base64 numbers.
149              The default is at the beginning of the file.
150
151       -E size or --Size start
152              Specify the size of the portion of the file to decompress, using
153              base64 numbers.  The default is the whole file.
154
155       -p prefilter or --pre prefilter
156              Specify  a  shell command to execute as a filter before compres‐
157              sion or decompression of a chunk.  The pre- and post-compression
158              filters  can be used to provide additional compression or output
159              formatting.  The filters may not increase the buffer  size  sig‐
160              nificantly.  The pre- and post-compression filters were designed
161              to provide the most general interface possible.
162
163       -P postfilter or --post postfilter
164              Specify a shell command to execute as a filter after compression
165              or decompression.
166

CREDITS

168       dictzip  was written by Rik Faith (faith@cs.unc.edu) and is distributed
169       under the terms of the GNU General Public License.  If you need to dis‐
170       tribute under other terms, write to the author.
171
172       The main libraries used by this programs (zlib, regex, libmaa) are dis‐
173       tributed under different terms, so you may be able to use the libraries
174       for  applications which are incompatible with the GPL -- please see the
175       copyright notices and license information that come with the  libraries
176       for  more  information, and consult with your attorney to resolve these
177       issues.
178

NAME

SYNOPSIS

DESCRIPTION

TRADEOFFS

OPTIONS

CREDITS

SEE ALSO