bzcat(1) - f37

1bzip2(1)                    General Commands Manual                   bzip2(1)
2
3
4

NAME

6       bzip2, bunzip2 - a block-sorting file compressor, v1.0.8
7       bzcat - decompresses files to stdout
8       bzip2recover - recovers data from damaged bzip2 files
9
10

SYNOPSIS

12       bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ...  ]
13       bunzip2 [ -fkvsVL ] [ filenames ...  ]
14       bzcat [ -s ] [ filenames ...  ]
15       bzip2recover filename
16
17

DESCRIPTION

19       bzip2  compresses  files  using  the Burrows-Wheeler block sorting text
20       compression algorithm, and Huffman coding.   Compression  is  generally
21       considerably   better   than   that   achieved   by  more  conventional
22       LZ77/LZ78-based compressors, and approaches the performance of the  PPM
23       family of statistical compressors.
24
25       The  command-line options are deliberately very similar to those of GNU
26       gzip, but they are not identical.
27
28       bzip2 expects a list of file names to accompany the command-line flags.
29       Each  file is replaced by a compressed version of itself, with the name
30       "original_name.bz2".  Each compressed file has  the  same  modification
31       date,  permissions,  and, when possible, ownership as the corresponding
32       original, so that these properties can be correctly restored at  decom‐
33       pression  time.  File name handling is naive in the sense that there is
34       no mechanism for preserving original file  names,  permissions,  owner‐
35       ships  or dates in filesystems which lack these concepts, or have seri‐
36       ous file name length restrictions, such as MS-DOS.
37
38       bzip2 and bunzip2 will by default not overwrite existing files.  If you
39       want this to happen, specify the -f flag.
40
41       If no file names are specified, bzip2 compresses from standard input to
42       standard output.  In this case, bzip2 will decline to write  compressed
43       output  to  a  terminal, as this would be entirely incomprehensible and
44       therefore pointless.
45
46       bunzip2 (or bzip2 -d) decompresses all specified  files.   Files  which
47       were  not  created by bzip2 will be detected and ignored, and a warning
48       issued.  bzip2 attempts to guess the filename for the decompressed file
49       from that of the compressed file as follows:
50
51              filename.bz2    becomes   filename
52              filename.bz     becomes   filename
53              filename.tbz2   becomes   filename.tar
54              filename.tbz    becomes   filename.tar
55              anyothername    becomes   anyothername.out
56
57       If  the  file does not end in one of the recognised endings, .bz2, .bz,
58       .tbz2 or .tbz, bzip2 complains that it cannot guess  the  name  of  the
59       original file, and uses the original name with .out appended.
60
61       As  with  compression, supplying no filenames causes decompression from
62       standard input to standard output.
63
64       bunzip2 will correctly decompress a file which is the concatenation  of
65       two  or  more compressed files.  The result is the concatenation of the
66       corresponding uncompressed files.  Integrity testing (-t)  of  concate‐
67       nated compressed files is also supported.
68
69       You  can  also  compress  or decompress files to the standard output by
70       giving the -c flag.  Multiple files may be compressed and  decompressed
71       like this.  The resulting outputs are fed sequentially to stdout.  Com‐
72       pression of multiple files in this manner generates a stream containing
73       multiple  compressed file representations.  Such a stream can be decom‐
74       pressed correctly only by bzip2 version 0.9.0 or later.   Earlier  ver‐
75       sions  of  bzip2  will  stop  after decompressing the first file in the
76       stream.
77
78       bzcat (or bzip2 -dc) decompresses all specified files to  the  standard
79       output.
80
81       bzip2  will  read  arguments  from  the environment variables BZIP2 and
82       BZIP, in that order, and will process them before  any  arguments  read
83       from  the  command line.  This gives a convenient way to supply default
84       arguments.
85
86       Compression is  always  performed,  even  if  the  compressed  file  is
87       slightly  larger  than the original.  Files of less than about one hun‐
88       dred bytes tend to get larger, since the compression  mechanism  has  a
89       constant  overhead  in  the region of 50 bytes.  Random data (including
90       the output of most file compressors) is coded at about  8.05  bits  per
91       byte, giving an expansion of around 0.5%.
92
93       As  a  self-check  for  your protection, bzip2 uses 32-bit CRCs to make
94       sure that the decompressed version of a file is identical to the origi‐
95       nal.   This  guards  against  corruption  of  the  compressed data, and
96       against undetected  bugs  in  bzip2  (hopefully  very  unlikely).   The
97       chances  of  data corruption going undetected is microscopic, about one
98       chance in four billion for each file processed.  Be aware, though, that
99       the check occurs upon decompression, so it can only tell you that some‐
100       thing is wrong.  It can't help you recover  the  original  uncompressed
101       data.   You  can  use  bzip2recover to try to recover data from damaged
102       files.
103
104       Unlike GNU gzip, bzip2 will not create a cascade of .bz2 suffixes  even
105       when using the --force option:
106
107              filename.bz2    does not become   filename.bz2.bz2
108
109       Return  values: 0 for a normal exit, 1 for environmental problems (file
110       not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt com‐
111       pressed  file,  3  for  an  internal  consistency error (eg, bug) which
112       caused bzip2 to panic.
113
114

OPTIONS

116       -c --stdout
117              Compress or decompress to standard output.
118
119       -d --decompress
120              Force decompression.  bzip2, bunzip2 and bzcat  are  really  the
121              same  program,  and  the  decision about what actions to take is
122              done on the basis of which name is used.   This  flag  overrides
123              that mechanism, and forces bzip2 to decompress.
124
125       -z --compress
126              The  complement to -d: forces compression, regardless of the in‐
127              vocation name.
128
129       -t --test
130              Check integrity of the specified file(s), but  don't  decompress
131              them.   This  really  performs  a trial decompression and throws
132              away the result.
133
134       -f --force
135              Force overwrite of output files.  Normally, bzip2 will not over‐
136              write  existing  output  files.  Also forces bzip2 to break hard
137              links to files, which it otherwise wouldn't do.
138
139              bzip2 normally declines to decompress files which don't have the
140              correct  magic  header  bytes.  If forced (-f), however, it will
141              pass such files through unmodified.  This is how  GNU  gzip  be‐
142              haves.
143
144       -k --keep
145              Keep (don't delete) input files during compression or decompres‐
146              sion.
147
148       -s --small
149              Reduce memory usage, for compression, decompression and testing.
150              Files  are  decompressed  and  tested using a modified algorithm
151              which only requires 2.5 bytes per block byte.   This  means  any
152              file  can  be  decompressed  in 2300k of memory, albeit at about
153              half the normal speed.
154
155              During compression, -s selects a block size of 200k, which  lim‐
156              its memory use to around the same figure, at the expense of your
157              compression ratio.  In short, if your machine is low  on  memory
158              (8  megabytes  or less), use -s for everything.  See MEMORY MAN‐
159              AGEMENT below.
160
161       -q --quiet
162              Suppress non-essential warning messages.  Messages pertaining to
163              I/O errors and other critical events will not be suppressed.
164
165       -v --verbose
166              Verbose  mode  --  show the compression ratio for each file pro‐
167              cessed.  Further -v's increase the verbosity level, spewing  out
168              lots  of information which is primarily of interest for diagnos‐
169              tic purposes.
170
171       -L --license -V --version
172              Display the software version, license terms and conditions.
173
174       -1 (or --fast) to -9 (or --best)
175              Set the block size to 100 k, 200 k ..  900 k  when  compressing.
176              Has  no effect when decompressing.  See MEMORY MANAGEMENT below.
177              The --fast and --best aliases are primarily for GNU gzip compat‐
178              ibility.   In  particular,  --fast  doesn't make things signifi‐
179              cantly faster.  And --best merely selects the default behaviour.
180
181       --     Treats all subsequent arguments as  file  names,  even  if  they
182              start  with  a dash.  This is so you can handle files with names
183              beginning with a dash, for example: bzip2 -- -myfilename.
184
185       --repetitive-fast --repetitive-best
186              These flags are redundant in versions  0.9.5  and  above.   They
187              provided  some  coarse control over the behaviour of the sorting
188              algorithm in  earlier  versions,  which  was  sometimes  useful.
189              0.9.5  and  above have an improved algorithm which renders these
190              flags irrelevant.
191
192

MEMORY MANAGEMENT

194       bzip2 compresses large files in blocks.  The block  size  affects  both
195       the  compression  ratio  achieved,  and the amount of memory needed for
196       compression and decompression.  The flags -1  through  -9  specify  the
197       block  size to be 100,000 bytes through 900,000 bytes (the default) re‐
198       spectively.  At decompression time, the block size used for compression
199       is  read from the header of the compressed file, and bunzip2 then allo‐
200       cates itself just enough memory to decompress the  file.   Since  block
201       sizes  are  stored in compressed files, it follows that the flags -1 to
202       -9 are irrelevant to and so ignored during decompression.
203
204       Compression and decompression requirements, in bytes, can be  estimated
205       as:
206
207              Compression:   400k + ( 8 x block size )
208
209              Decompression: 100k + ( 4 x block size ), or
210                             100k + ( 2.5 x block size )
211
212       Larger  block sizes give rapidly diminishing marginal returns.  Most of
213       the compression comes from the first two or three hundred  k  of  block
214       size,  a fact worth bearing in mind when using bzip2 on small machines.
215       It is also important to appreciate that the  decompression  memory  re‐
216       quirement is set at compression time by the choice of block size.
217
218       For files compressed with the default 900k block size, bunzip2 will re‐
219       quire about 3700 kbytes to decompress.  To support decompression of any
220       file on a 4 megabyte machine, bunzip2 has an option to decompress using
221       approximately half this amount of memory, about  2300  kbytes.   Decom‐
222       pression speed is also halved, so you should use this option only where
223       necessary.  The relevant flag is -s.
224
225       In general, try and use the largest block size memory  constraints  al‐
226       low,  since  that  maximises the compression achieved.  Compression and
227       decompression speed are virtually unaffected by block size.
228
229       Another significant point applies to files which fit in a single  block
230       -- that means most files you'd encounter using a large block size.  The
231       amount of real memory touched is proportional to the size of the  file,
232       since  the  file  is  smaller than a block.  For example, compressing a
233       file 20,000 bytes long with the flag -9 will cause  the  compressor  to
234       allocate  around 7600k of memory, but only touch 400k + 20000 * 8 = 560
235       kbytes of it.  Similarly, the decompressor will allocate 3700k but only
236       touch 100k + 20000 * 4 = 180 kbytes.
237
238       Here is a table which summarises the maximum memory usage for different
239       block sizes.  Also recorded is the total compressed size for  14  files
240       of the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
241       column gives some feel for how  compression  varies  with  block  size.
242       These  figures  tend  to understate the advantage of larger block sizes
243       for larger files, since the Corpus is dominated by smaller files.
244
245                  Compress   Decompress   Decompress   Corpus
246           Flag     usage      usage       -s usage     Size
247
248            -1      1200k       500k         350k      914704
249            -2      2000k       900k         600k      877703
250            -3      2800k      1300k         850k      860338
251            -4      3600k      1700k        1100k      846899
252            -5      4400k      2100k        1350k      845160
253            -6      5200k      2500k        1600k      838626
254            -7      6100k      2900k        1850k      834096
255            -8      6800k      3300k        2100k      828642
256            -9      7600k      3700k        2350k      828642
257
258

RECOVERING DATA FROM DAMAGED FILES

260       bzip2 compresses files in blocks, usually 900kbytes long.   Each  block
261       is  handled  independently.   If a media or transmission error causes a
262       multi-block .bz2 file to become damaged, it may be possible to  recover
263       data from the undamaged blocks in the file.
264
265       The  compressed  representation  of each block is delimited by a 48-bit
266       pattern, which makes it possible to find the block boundaries with rea‐
267       sonable certainty.  Each block also carries its own 32-bit CRC, so dam‐
268       aged blocks can be distinguished from undamaged ones.
269
270       bzip2recover is a simple program whose purpose is to search for  blocks
271       in  .bz2  files,  and write each block out into its own .bz2 file.  You
272       can then use bzip2 -t to test the integrity of the resulting files, and
273       decompress those which are undamaged.
274
275       bzip2recover takes a single argument, the name of the damaged file, and
276       writes a number of files "rec00001file.bz2",  "rec00002file.bz2",  etc,
277       containing  the   extracted   blocks.  The  output  filenames  are  de‐
278       signed  so  that the use of wildcards in subsequent processing  --  for
279       example,  "bzip2  -dc   rec*file.bz2 > recovered_data" -- processes the
280       files in the correct order.
281
282       bzip2recover should be of most use dealing with large .bz2  files,   as
283       these will contain many blocks.  It is clearly futile to use it on dam‐
284       aged single-block  files,  since  a damaged  block  cannot   be  recov‐
285       ered.   If  you  wish to minimise any potential data loss through media
286       or  transmission errors, you might consider compressing with a  smaller
287       block size.
288
289

PERFORMANCE NOTES

291       The  sorting  phase  of compression gathers together similar strings in
292       the file.  Because of this, files containing very long runs of repeated
293       symbols,  like "aabaabaabaab ..."  (repeated several hundred times) may
294       compress more slowly than normal.  Versions 0.9.5 and above  fare  much
295       better  than  previous  versions  in  this  respect.  The ratio between
296       worst-case and average-case compression time is in the region of  10:1.
297       For  previous  versions,  this figure was more like 100:1.  You can use
298       the -vvvv option to monitor progress in great detail, if you want.
299
300       Decompression speed is unaffected by these phenomena.
301
302       bzip2 usually allocates several megabytes of memory to operate in,  and
303       then  charges  all over it in a fairly random fashion.  This means that
304       performance, both for compressing and decompressing, is largely  deter‐
305       mined by the speed at which your machine can service cache misses.  Be‐
306       cause of this, small changes to the code to reduce the miss  rate  have
307       been  observed  to  give  disproportionately large performance improve‐
308       ments.  I imagine bzip2 will perform best on machines with  very  large
309       caches.
310
311

CAVEATS

313       I/O  error  messages  are not as helpful as they could be.  bzip2 tries
314       hard to detect I/O errors and exit cleanly, but the details of what the
315       problem is sometimes seem rather misleading.
316
317       This  manual  page pertains to version 1.0.8 of bzip2.  Compressed data
318       created by this version is entirely forwards and  backwards  compatible
319       with  the  previous  public  releases,  versions  0.1pl2, 0.9.0, 0.9.5,
320       1.0.0, 1.0.1, 1.0.2 and above, but with the following exception:  0.9.0
321       and  above  can  correctly  decompress multiple concatenated compressed
322       files.  0.1pl2 cannot do this; it will stop  after  decompressing  just
323       the first file in the stream.
324
325       bzip2recover  versions prior to 1.0.2 used 32-bit integers to represent
326       bit positions in compressed files, so they could not handle  compressed
327       files  more  than  512  megabytes  long.   Versions 1.0.2 and above use
328       64-bit ints on some platforms which support them  (GNU  supported  tar‐
329       gets, and Windows).  To establish whether or not bzip2recover was built
330       with such a limitation, run it without arguments.  In any event you can
331       build  yourself  an unlimited version if you can recompile it with May‐
332       beUInt64 set to be an unsigned 64-bit integer.
333
334
335
336

AUTHOR

338       Julian Seward, jseward@acm.org.
339
340       https://sourceware.org/bzip2/
341
342       The ideas embodied in bzip2 are due to (at least) the following people:
343       Michael  Burrows  and  David Wheeler (for the block sorting transforma‐
344       tion), David Wheeler (again, for the Huffman coder), Peter Fenwick (for
345       the  structured  coding  model  in  the original bzip, and many refine‐
346       ments), and Alistair Moffat, Radford  Neal  and  Ian  Witten  (for  the
347       arithmetic  coder  in the original bzip).  I am much indebted for their
348       help, support and advice.  See the manual in  the  source  distribution
349       for pointers to sources of documentation.  Christian von Roques encour‐
350       aged me to look for faster sorting algorithms, so as to speed  up  com‐
351       pression.  Bela Lubkin encouraged me to improve the worst-case compres‐
352       sion performance.  Donna Robinson XMLised the documentation.   The  bz*
353       scripts  are derived from those of GNU gzip.  Many people sent patches,
354       helped with portability problems, lent machines, gave advice  and  were
355       generally helpful.
356
357
358
359                                                                      bzip2(1)