1slim(1) PHYSICS DATA COMPRESSION slim(1)
2
3
4
6 slim, unslim, slimcat - Reversible compression of noisy physics data.
7
9 slim [options] filename ...
10 unslim [options] filename ...
11 slimcat [options] filename ...
12
13
14
15
17 slim performs lossless compression on binary data files. It is
18 designed to operate very rapidly and achieve better compression on
19 noisy physics data than general-purpose tools such as gzip and bzip2.
20
21 slim works well only on very specific kinds of data. It requires a
22 file that consists of integers represented in their native 16- or
23 32-bit binary format and arranged in a regular repetitive pattern.
24 Specifically, slim is designed for the kind of enormous files produced
25 copiously by the Atacama Cosmology Telescope. If it works well for
26 other types of files and from other experiments, then so much the bet‐
27 ter.
28
29 slim is not now appropriate for self-describing files containing their
30 own meta-data (e.g. FITS or NetCDF files). Probably the slim library
31 could be adapted to such files, if a front end were written to parse
32 the meta-data to extract details about the structure of the main data.
33
34
35 Although slim can operate on any file in the sense that the "slimfile"
36 will correctly expand back into the original file, good compression
37 performance--indeed, any compression at all--requires that slim knows
38 specific details about the file contents. See the Data description
39 options on how to describe the file contents, and RAW DATA STRUCTURE
40 for explanation of the relevant terms.
41
42 slim "trains" its encoding algorithms by examining only a fraction of
43 the data. For example, the main algorithm (the reduced binary encoder)
44 works by compressing a small contiguous range of values at the expense
45 of all values outside that range. The ends of that range are selected
46 based on only a sample of the data. For more, see Data sampling
47 options.
48
49
50 unslim is a synonym for slim --expand
51
52
53 slimcat is a synonym for slim --stdout --preserve
54
55
56
57
59 All options have short and long forms, which are listed together here.
60
61
62
63
64 General options
65 -p, --preserve
66 Preserve the original raw file when compressing or the original
67 slim file when expanding. The default behavior is to delete the
68 input file(s) as each output file is successfully compressed or
69 expanded (except when the option --stdout is used). If this
70 option is selected when expanding files, then the new output
71 file is named by replacing the ".slm" extension with ".raw".
72
73 -X, --compress
74 The file or files given as arguments are raw files to be com‐
75 pressed. The default is to assume raw files, except where the
76 file names end with the .slm or .SLM suffix. This option is the
77 only way to force compression of a file named something.slm.
78
79 -x, --expand
80 The file or files given as arguments are slim files to be
81 expanded. The default is to assume raw files, except where the
82 file names end with the .slm or .SLM suffix.
83
84 -k, --force
85 Force compression or expansion to overwrite any existing file of
86 the same name. Without this option, slim will refuse to over‐
87 write any files.
88
89 -V, --version
90 Print the slim file version and exit.
91
92 -q, --quiet
93 Don't print any output unless there are errors.
94
95 -?, --help
96 Print a usage message and exit.
97
98
99
100
101 Compression options
102 These options are ignored when expanding files.
103
104 -m, --method=method
105 Use encoding method method (default = 2). The methods are:
106
107 * -m2 The reduced binary encoder (default)
108 * -m5 Runlength coder. Good for data where the values are
109 strictly identical for long periods.
110
111 All other values are reserved. Methods numbered 1, 3, and 4
112 were once implemented. They all proved wrong for the job and
113 have been removed. Method 1 was a slight variation on the
114 reduced binary system, differing only in how the parameters were
115 computed. For more on methods 2 through 5, see ENCODING ALGO‐
116 RITHMS.
117
118 -d, --deltas
119 Encoders operate on the differences between successive data val‐
120 ues (the deltas). Default is to operate on the data itself.
121 Deltas are slightly slower to encode and decode, but they com‐
122 press more effectively in almost all situations. In principle,
123 raw data could compress to 0.5 fewer bits per data value, but
124 this would be expected only for extremely stable channels that
125 are not typical of physics data.
126
127 -b, --permit-bitrotation
128 Bit-rotation is an encoding feature useful in some data streams
129 where the lowest n bits are always the same. When using this
130 option, slim will check each channel for the existence of some
131 number of constant lowest bits. If any are found, then the data
132 are compressed by "rotating" the lowest bits to the top of the
133 number and then proceeding with the usual methods. The author
134 finds this useful with some analog-to-digital schemes that have
135 several low-order bits fixed at zero. For more, see Bit rota‐
136 tion below.
137
138 -n, --filename
139 Save the original filename in the slim file header. By default,
140 slim does not include the filename in the slim file.
141
142 -S, --rawsize
143 Save the original file's size (in bytes) in the slim file
144 header. (Currently, this is on by default and cannot be turned
145 off.) Note that when expanding, the raw file size is not tested
146 against the size stored in the file header.
147
148 -C, --compute-crc32
149 On compression, compute a CRC-32 checksum on the raw data in
150 each section and include its value in the slim file. Default is
151 not to compute the CRC. (On expansion, slim will compute and
152 test the checksum if it is present in the slim file, with or
153 without this option set. However, also see --ignore-crc32 )
154
155
156 Expansion options
157 These options are ignored when compressing files.
158
159 -o, --stdout
160 Write the results of expansion to stdout instead of to a file or
161 files. This option implies "--preserve" and is equivalent to
162 running slimcat
163
164 -0, --ignore-crc32
165 On expansion, do not compute and test the CRC-32 checksum, even
166 if one is included in the slim file.
167
168
169
170
171 Data description options
172 The current version of slim requires that all data channels be the same
173 (i.e. of the same raw type, having the same number of repeats per
174 frame, and so forth). Therefore, when given, all of the options in this
175 section apply to all channels. These requirements are not limitations
176 inherent to the slim file specification, but only to the current imple‐
177 mentation of slim. In the future, it could be possible to slim a more
178 heterogeneous mix of data channels, if a configuration file parser were
179 to be added. For more on the differences between the current restric‐
180 tions on compression and the full generality of the expander, see More
181 general slim data structures.
182
183 Data types are assumed to be 1-, 2- or 4-byte integers with little-
184 endian byte order. (Single-byte integer types cannot currently be
185 specified at the command line.) There are no immediate plans to add
186 facilities for handling 8-byte integers or floating-point numbers to
187 slim (although command-line arguments have been reserved for the float‐
188 ing-points).
189
190 See RAW DATA STRUCTURE for more information about how the raw data is
191 assumed to be structured, and for the definitions of the terms chan‐
192 nels, sections, frames, and repeats.
193
194 All data description options apply to compression only; they are all
195 ignored when expanding a slim file.
196
197
198 -c, --num-chan=nc
199 The data consist of nc separate data channels.
200
201 -r, --repeats=reps
202 One frame contains reps consecutive repeated data values for
203 each channel.
204
205 -F, --frames=nf
206 There are nf frames in each data section. Note that slim is
207 allowed to break up these frames into multiple sections as
208 needed to keep the section size below the hard limit of 16 MB,
209 or for any other reason. However, slim will also enforce a
210 change to a new section at the end of every nf frames with this
211 option.
212
213
214 -i, --int
215 -u, --unsigned
216 All data are signed or unsigned 32-bit integers (the default is
217 signed 32-bit).
218
219
220 -s, --short
221 -v, --ushort
222 All data are signed or unsigned 16-bit integers. Some aspects
223 of slim are not implemented for short integers, such as bit
224 rotation. See BUGS.
225
226
227 -y, --char
228 All data are signed 8-bit integers.
229
230
231
232 -f, --float
233 -g, --double
234 All data are 32-bit or 64-bit floating point numbers using the
235 IEEE-754 standard. Currently no floating-point compression
236 schemes are implemented, and this option is merely reserved for
237 future use. Instead, 32-bit floating point data are treated as
238 signed 32-bit integers, while 64-bit data are stored without
239 compression.
240
241
242
243 Data sampling options
244 The main encoder system (the reduced binary method) works by sampling
245 only some fraction of the raw data and assuming that the statistical
246 properties of the sample are adequate to predict the behavior of the
247 entire set. The data sampling options govern what fraction of the data
248 are used in the sampling. These control the trade-off between speed
249 and compression ratio.
250
251 Experience suggests that sampling a greater fraction of the data offers
252 very little benefit; we suggest sampling no more than 10% of all values
253 (the default), unless you plan to do (or have done) a careful test of
254 speed and compression ratio.
255
256 Note that the encoders are all universal, in the sense that all possi‐
257 ble values can be encoded (though of course most values will not be
258 compressed to a smaller size). This means that the occasional unusual
259 data value can be encoded, and sparse sampling is not fatal to the
260 resulting slim file.
261
262 Sampling can be done on a maximum of 20,000 data values. All sample
263 options refer to percentages of either this maximum or the full number
264 of available values, whichever is smaller. Regardless of the options
265 given, a minimum of 20 data values or all values will be sampled,
266 whichever is smaller.
267
268 -G, --sample-pct=pct
269 Sample pct percent of the data (between 2 and 100%) up to a max‐
270 imum of 200*pct samples when determining the encoding parame‐
271 ters. Default is 10% up to 2000 samples.
272
273 -9, --18-pct, --best
274 Sample 18% of the data (up to 3600 samples) to determine encod‐
275 ing parameters.
276
277 -8, --16-pct
278 Sample 16% of the data (up to 3200 samples) to determine encod‐
279 ing parameters.
280
281 -7, --14-pct
282 Sample 14% of the data (up to 2800 samples) to determine encod‐
283 ing parameters.
284
285 -6, --12-pct
286 Sample 12% of the data (up to 2400 samples) to determine encod‐
287 ing parameters.
288
289 -5, --10-pct
290 Sample 10% of the data (up to 2000 samples) to determine encod‐
291 ing parameters. (This is the default.)
292
293 -4, --8-pct
294 Sample 8% of the data (up to 1600 samples) to determine encoding
295 parameters.
296
297 -3, --6-pct
298 Sample 6% of the data (up to 1200 samples) to determine encoding
299 parameters.
300
301 -2, --4-pct
302 Sample 4% of the data (up to 800 samples) to determine encoding
303 parameters.
304
305 -1, --2-pct, --fast
306 Sample 2% of the data (up to 400 samples) to determine encoding
307 parameters.
308
309
310
311
312 Debugging and unimplemented features
313 -B, --debug-buffer=bufsize
314 read or write raw data from a debugging buffer of size bufsize
315 bytes. This is not the optimal way to compress or expand files
316 (though the price is small). This option is only intended to
317 exercise and debug parts of the code that would be used in a
318 planned slimlib library of functions.
319
320
321
322
323
325 The fundamental concept in slimming a raw data file is that of the data
326 channel. One channel would normally correspond to one physical data
327 source, such a single thermometer or encoder or other sensor. Good
328 compression requires that the distribution of the data from a single
329 channel has a single mode, that its statistical properties of be more
330 or less stationary,that the rate of extreme outliers be small (less
331 than a few percent), and so forth.
332
333 Data from more than one channel can be mixed together in the raw file,
334 provided that the pattern of changes from one channel to another is
335 repeated consistently. The most general pattern permitted is to have M
336 samples of channel 1, followed by N samples of channel 2, and so forth
337 until all of the channels in the raw data file have appeared. This
338 unit of the raw file, with each channel occurring one time and giving
339 its expect number of samples, is called the frame.
340
341 Multiple frames make up a section. The raw data file is assumed to
342 consist of one or more sections concatenated together. The number,
343 type, repetition count, and order of the channels is fixed throughout a
344 single section (and in the current version of the slim executable,
345 there is no way to change these factors between sections, either). A
346 section consists of one or more frames, and it is valid for the last
347 section in a file to end with a fractional frame (this would, we
348 assume, be due to an unexpected event, such as a raw data file being
349 truncated during acquisition or transmission).
350
351 When a raw file contains only one channel, there is an ambiguity in
352 whether that channel is being repeated N times within a single frame,
353 or if it is repeated only once in each of N frames. Because the first
354 choice leads to faster execution, slim silently selects it even if the
355 user's command-line options call for the second. Thus the options
356 "--num-chan=1 --repeats=1 --frames=20000" are silently converted to
357 "--num-chan=1 --repeats=20000 --frames=1".
358
359 Note that if one or both of the options giving the repeats or the
360 frames are absent, then slim tries to do the smartest possible thing.
361 You probably don't want to let it do that, but you can.
362
363
364
365
366
367 More general slim data structures
368 The slim executable produces slim files with several restrictions that
369 are not inherent to the definition of the slim file format itself. The
370 expander can expand files meeting the more general specification, but
371 there is no way at this time to construct these more general files.
372
373 Here we list some ways that the structure could be less restrictive
374 than the current compression executable permits. For one thing, the
375 data type of different channels need not be the same, and the number of
376 repeats can also be different. The list of channels could change
377 between one section and the next. Also, the slim file specification
378 allows all sections--not only the last--to be of arbitrary size. How‐
379 ever, the current implementation of the slim executable does not offer
380 a way to break up the raw data into sections at arbitrary places in the
381 raw file.
382
383 Currently, slim enforces a limit that sections be no longer than 16 MB
384 (2^24 bytes), because of the implementation detail that sections must
385 be held in memory when compressing or expanding them. If data compris‐
386 ing what is conceptually one section exceed this limit, then slim
387 silently divides it into two or more sections of nearly equal length
388 and writes them one after another. This fact is not generally relevant
389 to the user, but it does mean that compression will not be very effec‐
390 tive if there are channels that repeat only a few times per 16 MB of
391 raw data.
392
393
394
395
396
398 slim uses a few different algorithms for converting raw data into gen‐
399 erally smaller data. By "generally", we mean that no system can con‐
400 vert all possible 32-bit values into smaller ones; this is shown by a
401 simple counting argument. The goal of slim is to recognize a small
402 subset of values that appear most often and to map only this subset
403 into codes that require fewer bits. All other data values outside the
404 small subset are expanded into more than their original number of bits.
405 If most or all raw values come from the compressible subset, then the
406 encoded data will be smaller than the raw data.
407
408 Note that all encoding methods currently used in slim convert a single
409 value into an exact integer number of bits (unlike range or arithmetic
410 coding). Thus each bit in a slim file is a part of (or all of) the
411 code for a single value in the raw file (unless, of course, it belongs
412 to the file's header data).
413
414 The three encoding methods currently implemented are:
415
416 Constant-value encoding
417 A channel that contains exactly the same value for every
418 instance will be encoded as a constant-value. This system com‐
419 presses each value to zero bits, apart from recording the con‐
420 stant value in the section header. No command-line option is
421 required for constant-value encoding: all channels will be
422 checked to see if their values are strictly constant, and if so,
423 they will be encoded by this method in preference over all oth‐
424 ers.
425
426
427 Run-length encoding
428 For a channel whose values are strictly constant for long runs
429 but not for an entire data section, run-length encoding is
430 ideal. For example, it works well on a channel storing the
431 integer part of the time (in seconds), if the time is sampled
432 many times per second. This method will store a repeated run of
433 a single value as a (value, count) pair.
434
435 Two cautions are: (1) if the data sample used for evaluation
436 shows that run-length encoding will not actually compress the
437 data channel, then that channel will silently switch over to the
438 standard reduced-binary encoding, and (2) the longest possible
439 run is the number of repetitions within a frame--for technical
440 reasons, a "run" cannot cross from one frame to another. If a
441 channel's values do not appear as several successive words in
442 the raw file, then run-length encoding is not a good choice.
443
444
445 Reduced-binary encoding
446 This is the default method. The reduced-binary encoder has two
447 parameters: the number of bits N used for storing the "normal"
448 data, and the offset s subtracted from each value before encod‐
449 ing. The idea is to choose the parameters so that most or all
450 values lie in the range [ s, s+(2^N)-2 ] and can therefore be
451 stored using only N bits. The value s+(2^N)-1 is reserved to
452 indicate "Overflow", that the value being encoded did not lie in
453 the normal range. Overflow codes are followed by the raw data
454 value itself, stored in its natural length of 32 or 16 bits.
455 Thus, most 32-bit data are stored as N-bit numbers, while a
456 small fraction require N+32 bits.
457
458 The parameters are chosen using the data sample set. All possi‐
459 ble values are tested for N, while s is varied to keep the
460 arithmetic mean of the sample data in the middle of the normal
461 range for the given value of N. The choice of N is that which
462 gives the best compression on the sample data set. Note that
463 keeping the mean in the middle of the range might not always be
464 appropriate for every possible distribution, but that's how it
465 is done.
466
467
468
469
470 Tested but rejected encoding algorithms
471 Three other algorithms were implemented into slim and later removed.
472 Each of them improves on the compression ratios of the reduced-binary
473 code, but the improvements are small and come at a price in compression
474 and expansion speed (not to mention program complexity). Development
475 on these algorithms was dropped for this reason, and we mention them in
476 case you are also pondering ideas for improved encoding algorithms in
477 slim.
478
479
480 Codes A and B
481 Codes A and B are minor variations on the reduced-binary code.
482 The former changes how the presence of overflows is signaled,
483 and the latter also changes how their overflowing values are
484 stored.
485
486 Code A arises from the observation that overflow values might be
487 rare, but they are still more common than any other single
488 value. The idea is to encode most normal values as an N-bit
489 number but the overflow as an m-bit number, where m<N. The
490 result is to make each overflow take up (N-m) fewer bits, while
491 the allowed range of values contains only 2^(N)-2^(N-m) values,
492 slightly fewer than it would have had otherwise. This results
493 in savings of approximately 0.15 to 0.20 bits per channel for
494 normally-distributed data. However, the method was removed from
495 slim because of its CPU cost: decoding each normal value
496 requires first reading m bits, testing whether they are the
497 overflow code, and (if not) then reading in an additional (N-m)
498 bits to find the full N-bit number.
499
500 Code B expands on Code A. It starts with the observation that
501 overflow values are not generally taken randomly from the full
502 range of possible values, but instead are most likely found near
503 the allowed range of non-overflowing values. It is therefore
504 beneficial to write the overflowed value not as a full 32-bit
505 number (assuming the raw data are 32-bit numbers), but instead
506 as an N-bit, or (N+1)-bit number. Thus, overflows are encoded
507 as in Code A, followed by a (prefix, value), where the prefix is
508 an exponential Golomb code for the value's size (actually, the
509 size with N subtracted, since the size of an overflowing value
510 is guaranteed to be no less than N). As with Code A, it
511 improves compression by some fraction of one bit per value, but
512 the savings were judged not to be worth the performance penalty.
513
514
515 Huffman coding
516 Huffman coding of the raw data clearly accomplished the best
517 compression ratios of all methods. However, it was also the
518 slowest and so was removed from the program.
519
520 The approach is to take the data and split it into "upper" and
521 "lower" bits. The upper bits are Huffman-coded, while the lower
522 bits are assumed to be uniformly-distributed random values and
523 are repeated verbatim into the compressed data stream. The
524 split into upper and lower is made such that the data sample
525 contains no more than 127 distinct values for the upper bits.
526 The Huffman tree is built on the sampled frequencies of the val‐
527 ues of the upper bits (with an extra symbol to signal over‐
528 flows). The choice of 127 symbols was motivated by the possi‐
529 bility of storing the code tree with N symbols in only (N+1)
530 bytes, and by the observation that even as few as 32 symbols
531 work just as well on truly Gaussian-distributed data.
532
533 The price of decoding Huffman-encoded symbols is steep: we found
534 that decoding took twice as long when the Huffman encoders were
535 used. The problem is that the prefix-free Huffman codes must be
536 read one bit at a time, for there is no way to know their length
537 in advance.
538
539
540
541
542 Bit rotation
543 Bit-rotation is an advanced feature that helps in compressing channels
544 where the lowest B bits are constant. This sort of channel violates
545 the usual assumption that data follow a smooth distribution function,
546 and that the lowest few bits are likely to be distributed almost uni‐
547 formly. The author of slim has certainly found channels in real exper‐
548 iments where (for example) the six least significant bits are always
549 zero. This might happen a case where an IEEE 754 floating-point number
550 with 24 bits in its sign-plus-mantissa was converted to and recorded as
551 an integer.
552
553 If such a channel were encoded by the reduced-binary encoder, then the
554 unchanging B lowest bits would be faithfully but wastefully reproduced
555 over and over again. Bit rotation offers a way around this redundancy.
556
557 Bit rotation has slim encode such a channel by bit-shifting the raw
558 value by B bits to the right and then copying what were the B lowest
559 bits to the top (most significant bits) of the shifted result. This
560 makes, for example, the value 0xbeef9900 into 0x00beef99 for the case
561 of B=8. The reduced binary encoder can then act on the bit-rotated
562 result, in which the lowest bits of the new value are presumably less
563 repetitive than they were before bit rotation was applied.
564
565 The user selects only whether to try bit rotation with the --permit-
566 bitrotation option. The program then decides whether to choose a non-
567 zero value of B based on the sampled data set. The choice will be the
568 number of lowest-order bits that are strictly constant throughout the
569 sample set.
570
571 The penalty for trying bit rotation on every channel is modest when
572 compressing--causing roughly a 1.5% increase in instructions executed,
573 in one test. There is of course no penalty in expanding unless bit
574 rotation was actually used.
575
576
577
578
580 In order to operate on machines with 32-bit and 64-bit processors, slim
581 uses strictly sized data types where it matters, such as int32_t for
582 signed 32-bit integers. There are no known problems in running it on
583 machines with 32-bit or 64-bit word sizes. The low-level bit opera‐
584 tions are performed on words with the same size as the C++ data type
585 long which would normally be the native word size of the build machine.
586 The code works either way, but it is likely most efficient to operate
587 on words of the native size. The author admits to not having verified
588 this hunch empirically. (You can override this choice and force 32 or
589 64 bit word sizes. You do this when running build_bit_constants during
590 the build process, but you probably don't want to unless cross-compil‐
591 ing.)
592
593 Because of the author's prejudice or narrow experience (or the demands
594 of his day job), the assumption that words are in little-endian order
595 is deeply embedded in slim. An ambitious future contributor is welcome
596 to remove this restriction, but it is hard to see how to handle byte-
597 swapping without a performance penalty. For now, slim fails an asser‐
598 tion if run on a big-endian machine.
599
600
601
602
604 It would be very reasonable to have slim use environment variables for
605 some things. Feel free to suggest such features; there are none at
606 this time.
607
608
609
610
612 First, compress data from a Multi-channel electronics system. The raw
613 files consist of frames 4400 bytes long having 1100 4-byte words. We
614 ignore that a few of the channels are unsigned and treat them all as
615 signed words. Each channel repeats only once in the frame. We want to
616 encode the differences between successive data values and use method 2
617 (the reduced binary encoder).
618
619 slim -c1100 -i -r1 -dm2 raw_mce_file.dat
620
621 Next, a file containing only many, many repetitions of a single chan‐
622 nel:
623
624 slim -c1 -i -dm2 one_channel_only.dat
625
626 To uncompress both files:
627
628 unslim raw_mce_file.dat.slm one_channel_only.dat.slm
629
630
631
632
634 slim_acthk(1), gzip(1), bzip2(1)
635
636
637
638
640 The -f and -g command-line options for floating-point data are reserved
641 but aren't implemented.
642
643 Bit rotation does not work when the data are 8 or 16-bit integers
644 (whether signed or unsigned).
645
646
647
648
650 Copyright © 2008 Joseph Fowler, Princeton University.
651
652 This work was supported by and done for the benefit of the Atacama Cos‐
653 mology Telescope collaboration.
654
655 Permission is granted to make and distribute verbatim copies of this
656 manual provided the copyright notice and this permission notice are
657 preserved on all copies.
658
659 Permission is granted to copy and distribute modified versions of this
660 manual under the conditions for verbatim copying, provided that the
661 entire resulting derived work is distributed under the terms of a per‐
662 mission notice identical to this one.
663
664 Permission is granted to copy and distribute translations of this man‐
665 ual into another language, under the above conditions for modified ver‐
666 sions, except that this permission notice may be stated in a transla‐
667 tion approved by the Free Software Foundation.
668
669
670
671Version 2.6 October 24, 2008 slim(1)