1slim(1) PHYSICS DATA COMPRESSION slim(1)
2
3
4
6 slim, unslim, slimcat - Reversible compression of noisy physics data.
7
9 slim [options] filename ...
10 unslim [options] filename ...
11 slimcat [options] filename ...
12
13
14
15
17 slim performs lossless compression on binary data files. It is
18 designed to operate very rapidly and achieve better compression on
19 noisy physics data than general-purpose tools such as gzip and bzip2.
20
21 slim works well only on very specific kinds of data. It requires a
22 file that consists of integers represented in their native 16- or
23 32-bit binary format and arranged in a regular repetetive pattern.
24 Specifically, slim is designed for the kind of enormous files produced
25 copiously by the Atacama Cosmology Telescope. If it works well for
26 other types of files and from other experiments, then so much the bet‐
27 ter.
28
29 slim is not now appropriate for self-describing files containing their
30 own meta-data (e.g. FITS or NetCDF files). Probably the slim library
31 could be adapted to such files, if a front end were written to parse
32 the meta-data to extract details about the structure of the main data.
33
34
35 Although slim can operate on any file in the sense that the "slimfile"
36 will correctly expand back into the original file, good compression
37 performance--indeed, any compression at all--requires that slim knows
38 specific details about the file contents. See the Data description
39 options on how to describe the file contents, and RAW DATA STRUCTURE
40 for explanation of the relevant terms.
41
42 slim "trains" its encoding algorithms by examining only a fraction of
43 the data. For example, the main algorithm (the reduced binary encoder)
44 works by compressing a small contiguous range of values at the expense
45 of all values outside that range. The ends of that range are selected
46 based on only a sample of the data. For more, see Data sampling
47 options.
48
49
50 unslim is a synonym for slim --expand
51
52
53 slimcat is a synonym for slim --stdout --preserve
54
55
56
57
59 All options have short and long forms, which are listed together here.
60
61
62
63
64 General options
65 -p, --preserve
66 Preserve the original raw file when compressing or the original
67 slim file when expanding. The default behavior is to delete the
68 input file(s) as each output file is successfully compressed or
69 expanded (except when the option --stdout is used). If this
70 option is selected when expanding files, then the new output
71 file is named by replacing the ".slm" extension with ".raw".
72
73 -X, --compress
74 The file or files given as arguments are raw files to be com‐
75 pressed. The default is to assume raw files, except where the
76 file names end with the .slm or .SLM suffix. This option is the
77 only way to force compression of a file named something.slm.
78
79 -x, --expand
80 The file or files given as arguments are slim files to be
81 expanded. The default is to assume raw files, except where the
82 file names end with the .slm or .SLM suffix.
83
84 -k, --force
85 Force compression or expansion to overwrite any existing file of
86 the same name. Without this option, slim will refuse to over‐
87 write any files.
88
89 -V, --version
90 Print the slim file version and exit.
91
92 -?, --help
93 Print a usage message and exit.
94
95
96
97
98 Compression options
99 These options are ignored when expanding files.
100
101 -m, --method=method
102 Use encoding method method (default = 2). The methods are:
103
104 * -m2 The reduced binary encoder (default)
105 * -m5 Runlength coder. Good for data where the values are
106 strictly identical for long periods.
107
108 All other values are reserved. Methods numbered 1, 3, and 4
109 were once implemented. They all proved wrong for the job and
110 have been removed. Method 1 was a slight variation on the
111 reduced binary system, differing only in how the parameters were
112 computed. For more on methods 2 through 5, see ENCODING ALGO‐
113 RITHMS.
114
115 -d, --deltas
116 Encoders operate on the differences between successive data val‐
117 ues (the deltas). Default is to operate on the data itself.
118 Deltas are slightly slower to encode and decode, but they com‐
119 press more effectively in almost all situations. In principle,
120 raw data could compress to 0.5 fewer bits per data value, but
121 this would be expected only for extremely stable channels that
122 are not typical of physics data.
123
124 -b, --permit-bitrotation
125 Bit-rotation is an encoding feature useful in some data streams
126 where the lowest n bits are always the same. When using this
127 option, slim will check each channel for the existence of some
128 number of constant lowest bits. If any are found, then the data
129 are compressed by "rotating" the lowest bits to the top of the
130 number and then proceeding with the usual methods. The author
131 finds this useful with some analog-to-digital schemes that have
132 several low-order bits fixed at zero. For more, see Bit rota‐
133 tion below.
134
135 -n, --filename
136 Save the original filename in the slim file header. By default,
137 slim does not include the filename in the slim file.
138
139 -S, --rawsize
140 Save the original file's size (in bytes) in the slim file
141 header. (Currently, this is on by default and cannot be turned
142 off.) Note that when expanding, the raw file size is not tested
143 against the size stored in the file header.
144
145 -C, --compute-crc32
146 On compression, compute a CRC-32 checksum on the raw data in
147 each section and include its value in the slim file. Default is
148 not to compute the CRC. (On expansion, slim will compute and
149 test the checksum if it is present in the slim file, with or
150 without this option set. However, also see --ignore-crc32 )
151
152
153 Expansion options
154 These options are ignored when compressing files.
155
156 -o, --stdout
157 Write the results of expansion to stdout instead of to a file or
158 files. This option implies "--preserve" and is equivalent to
159 running slimcat
160
161 -0, --ignore-crc32
162 On expansion, do not compute and test the CRC-32 checksum, even
163 if one is included in the slim file.
164
165
166
167
168 Data description options
169 The current version of slim requires that all data channels be the same
170 (i.e. of the same raw type, having the same number of repeats per
171 frame, and so forth). Therefore, when given, all of the options in this
172 section apply to all channels. These requirements are not limitations
173 inherent to the slim file specification, but only to the current imple‐
174 mentation of slim. In the future, it could be possible to slim a more
175 heterogeneous mix of data channels, if a configuation file parser were
176 to be added. For more on the differences between the current restric‐
177 tions on compression and the full generality of the expander, see More
178 general slim data structures.
179
180 Data types are assumed to be 1-, 2- or 4-byte integers with little-
181 endian byte order. (Single-byte integer types cannot currently be
182 specified at the command line.) There are no immidiate plans to add
183 facilities for handling 8-byte integers or floating-point numbers to
184 slim (although command-line arguments have been reserved for the float‐
185 ing-points).
186
187 See RAW DATA STRUCTURE for more information about how the raw data is
188 assumed to be structured, and for the definitions of the terms chan‐
189 nels, sections, frames, and repeats.
190
191 All data description options apply to compression only; they are all
192 ignored when expanding a slim file.
193
194
195 -c, --num-chan=nc
196 The data consist of nc separate data channels.
197
198 -r, --repeats=reps
199 One frame contains reps consecutive repeated data values for
200 each channel.
201
202 -F, --frames=nf
203 There are nf frames in each data section. Note that slim is
204 allowed to break up these frames into multiple sections as
205 needed to keep the section size below the hard limit of 16 MB,
206 or for any other reason. However, slim will also enforce a
207 change to a new section at the end of every nf frames with this
208 option.
209
210
211 -i, --int
212 -u, --unsigned
213 All data are signed or unsigned 32-bit integers (the default is
214 signed 32-bit).
215
216
217 -s, --short
218 -v, --ushort
219 All data are signed or unsigned 16-bit integers. Some aspects
220 of slim are not implemented for short integers, such as bit
221 rotation. See BUGS.
222
223
224 -y, --char
225 All data are signed 8-bit integers.
226
227
228
229 -f, --float
230 -g, --double
231 All data are 32-bit or 64-bit floating point numbers using the
232 IEEE-754 standard. Currently no floating-point compression
233 schemes are implemented, and this option is merely reserved for
234 future use. Instead, 32-bit floating point data are treated as
235 signed 32-bit integers, while 64-bit data are stored without
236 compression.
237
238
239
240 Data sampling options
241 The main encoder system (the reduced binary method) works by sampling
242 only some fraction of the raw data and assuming that the statistical
243 properties of the sample are adequate to predict the behavior of the
244 entire set. The data sampling options govern what fraction of the data
245 are used in the sampling. These control the tradeoff between speed and
246 compression ratio.
247
248 Experience suggests that sampling a greater fraction of the data offers
249 very little benefit; we suggest sampling no more than 10% of all values
250 (the default), unless you plan to do (or have done) a careful test of
251 speed and compression ratio.
252
253 Note that the encoders are all universal, in the sense that all possi‐
254 ble values can be encoded (though of course most values will not be
255 compressed to a smaller size). This means that the occasional unusual
256 data value can be encoded, and sparse sampling is not fatal to the
257 resulting slim file.
258
259 Sampling can be done on a maximum of 20,000 data values. All sample
260 options refer to percentages of either this maximum or the full number
261 of available values, whichever is smaller. Regardless of the options
262 given, a minimum of 20 data values or all values will be sampled,
263 whichever is smaller.
264
265 -G, --sample-pct=pct
266 Sample pct percent of the data (between 2 and 100%) up to a
267 maxmimum of 200*pct samples when determining the encoding param‐
268 eters. Default is 10% up to 2000 samples.
269
270 -9, --18-pct, --best
271 Sample 18% of the data (up to 3600 samples) to determine encod‐
272 ing parameters.
273
274 -8, --16-pct
275 Sample 16% of the data (up to 3200 samples) to determine encod‐
276 ing parameters.
277
278 -7, --14-pct
279 Sample 14% of the data (up to 2800 samples) to determine encod‐
280 ing parameters.
281
282 -6, --12-pct
283 Sample 12% of the data (up to 2400 samples) to determine encod‐
284 ing parameters.
285
286 -5, --10-pct
287 Sample 10% of the data (up to 2000 samples) to determine encod‐
288 ing parameters. (This is the default.)
289
290 -4, --8-pct
291 Sample 8% of the data (up to 1600 samples) to determine encoding
292 parameters.
293
294 -3, --6-pct
295 Sample 6% of the data (up to 1200 samples) to determine encoding
296 parameters.
297
298 -2, --4-pct
299 Sample 4% of the data (up to 800 samples) to determine encoding
300 parameters.
301
302 -1, --2-pct, --fast
303 Sample 2% of the data (up to 400 samples) to determine encoding
304 parameters.
305
306
307
308
309 Debugging and unimplemented features
310 -B, --debug-buffer=bufsize
311 read or write raw data from a debugging buffer of size bufsize
312 bytes. This is not the optimal way to compress or expand files
313 (though the price is small). This option is only intended to
314 exercise and debug parts of the code that would be used in a
315 planned slimlib library of functions.
316
317
318
319
320
322 The fundamental concept in slimming a raw data file is that of the data
323 channel. One channel would normally correspond to one physical data
324 source, such a single thermometer or encoder or other sensor. Good
325 compression requres that the distribution of the data from a single
326 channel has a single mode, that its statistical properties of be more
327 or less stationary,that the rate of extreme outliers be small (less
328 than a few percent), and so forth.
329
330 Data from more than one channel can be mixed together in the raw file,
331 provided that the pattern of changes from one channel to another is
332 repeated consistently. The most general pattern permitted is to have M
333 samples of channel 1, followed by N samples of channel 2, and so forth
334 until all of the channels in the raw data file have appeared. This
335 unit of the raw file, with each channel occurring one time and giving
336 its expect number of samples, is called the frame.
337
338 Multiple frames make up a section. The raw data file is assumed to
339 consist of one or more sections concactenated together. The number,
340 type, repetition count, and order of the channels is fixed throughout a
341 single section (and in the current version of the slim executable,
342 there is no way to change these factors between sections, either). A
343 section consists of one or more frames, and it is valid for the last
344 section in a file to end with a fractional frame (this would, we
345 assume, be due to an unexpected event, such as a raw data file being
346 truncated during acquisition or transmission).
347
348 When a raw file contains only one channel, there is an ambiguity in
349 whether that channel is being repeated N times within a single frame,
350 or if it is repeated only once in each of N frames. Because the first
351 choice leads to faster execution, slim silently selects it even if the
352 user's command-line options call for the second. Thus the options
353 "--num-chan=1 --repeats=1 --frames=20000" are silently converted to
354 "--num-chan=1 --repeats=20000 --frames=1".
355
356 Note that if one or both of the options giving the repeats or the
357 frames are absent, then slim tries to do the smartest possible thing.
358 You probably don't want to let it do that, but you can.
359
360
361
362
363
364 More general slim data structures
365 The slim executable produces slim files with several restrictions that
366 are not inherent to the definition of the slim file format itself. The
367 expander can expand files meeting the more general specification, but
368 there is no way at this time to construct these more general files.
369
370 Here we list some ways that the structure could be less restrictive
371 than the current compression executable permits. For one thing, the
372 data type of different channels need not be the same, and the number of
373 repeats can also be different. The list of channels could change
374 between one section and the next. Also, the slim file specification
375 allows all sections--not only the last--to be of arbitrary size. How‐
376 ever, the current implementation of the slim executable does not offer
377 a way to break up the raw data into sections at arbitrary places in the
378 raw file.
379
380 Currently, slim enforces a limit that sections be no longer than 16 MB
381 (2^24 bytes), because of the implementation detail that sections must
382 be held in memory when compressing or expanding them. If data compris‐
383 ing what is conceptually one section exceed this limit, then slim
384 silently divides it into two or more sections of nearly equal length
385 and writes them one after another. This fact is not generally relevant
386 to the user, but it does mean that compression will not be very effec‐
387 tive if there are channels that repeat only a few times per 16 MB of
388 raw data.
389
390
391
392
393
395 slim uses a few different algorithms for converting raw data into gen‐
396 erally smaller data. By "generally", we mean that no system can con‐
397 vert all possible 32-bit values into smaller ones; this is shown by a
398 simple counting argument. The goal of slim is to recognize a small
399 subset of values that appear most often and to map only this subset
400 into codes that require fewer bits. All other data values outside the
401 small subset are expanded into more than their original number of bits.
402 If most or all raw values come from the compressible subset, then the
403 encoded data will be smaller than the raw data.
404
405 Note that all encoding methods currently used in slim convert a single
406 value into an exact integer number of bits (unlike range or arithmetic
407 coding). Thus each bit in a slim file is a part of (or all of) the
408 code for a single value in the raw file (unless, of course, it belongs
409 to the file's header data).
410
411 The three encoding methods currently implemented are:
412
413 Constant-value encoding
414 A channel that contains exactly the same value for every
415 instance will be encoded as a constant-value. This system com‐
416 presses each value to zero bits, apart from recording the con‐
417 stant value in the section header. No command-line option is
418 required for constant-value encoding: all channels will be
419 checked to see if their values are strictly constant, and if so,
420 they will be encoded by this method in preference over all oth‐
421 ers.
422
423
424 Run-length encoding
425 For a channel whose values are strictly constant for long runs
426 but not for an entire data section, run-length encoding is
427 ideal. For example, it works well on a channel storing the
428 integer part of the time (in seconds), if the time is sampled
429 many times per second. This method will store a repeated run of
430 a single value as a (value, count) pair.
431
432 Two cautions are: (1) if the data sample used for evaluation
433 shows that run-length encoding will not actually compress the
434 data channel, then that channel will silently switch over to the
435 standard reduced-binary encoding, and (2) the longest possible
436 run is the number of repetitions within a frame--for technical
437 reasons, a "run" cannot cross from one frame to another. If a
438 channel's values do not appear as several successive words in
439 the raw file, then run-length encoding is not a good choice.
440
441
442 Reduced-binary encoding
443 This is the default method. The reduced-binary encoder has two
444 parameters: the number of bits N used for storing the "normal"
445 data, and the offset s subtracted from each value before encod‐
446 ing. The idea is to choose the parameters so that most or all
447 values lie in the range [ s, s+(2^N)-2 ] and can therefore be
448 stored using only N bits. The value s+(2^N)-1 is reserved to
449 indicate "Overflow", that the value being encoded did not lie in
450 the normal range. Overflow codes are followed by the raw data
451 value itself, stored in its natural length of 32 or 16 bits.
452 Thus, most 32-bit data are stored as N-bit numbers, while a
453 small fraction require N+32 bits.
454
455 The parameters are chosen using the data sample set. All possi‐
456 ble values are tested for N, while s is varied to keep the
457 arithmetic mean of the sample data in the middle of the normal
458 range for the given value of N. The choice of N is that which
459 gives the best compression on the sample data set. Note that
460 keeping the mean in the middle of the range might not always be
461 appropriate for every possible distribution, but that's how it
462 is done.
463
464
465
466
467 Tested but rejected encoding algorithms
468 Three other algorithms were implemented into slim and later removed.
469 Each of them improves on the compression ratios of the reduced-binary
470 code, but the improvements are small and come at a price in compression
471 and expansion speed (not to mention program complexity). Development
472 on these algorithms was dropped for this reason, and we mention them in
473 case you are also pondering ideas for improved encoding algorithms in
474 slim.
475
476
477 Codes A and B
478 Codes A and B are minor variations on the reduced-binary code.
479 The former changes how the presence of overflows is signaled,
480 and the latter also changes how their overflowing values are
481 stored.
482
483 Code A arises from the observation that overflow values might be
484 rare, but they are still more common than any other single
485 value. The idea is to encode most normal values as an N-bit
486 number but the overflow as an m-bit number, where m<N. The
487 result is to make each overflow take up (N-m) fewer bits, while
488 the allowed range of values contains only 2^(N)-2^(N-m) values,
489 slightly fewer than it would have had otherwise. This results
490 in savings of approximately 0.15 to 0.20 bits per channel for
491 normally-distributed data. However, the method was removed from
492 slim because of its CPU cost: decoding each normal value
493 requires first reading m bits, testing whether they are the
494 overflow code, and (if not) then reading in an additional (N-m)
495 bits to find the full N-bit number.
496
497 Code B expands on Code A. It starts with the observation that
498 overflow values are not generally taken randomly from the full
499 range of possible values, but instead are most likely found near
500 the allowed range of non-overflowing values. It is therefore
501 beneficial to write the overflowed value not as a full 32-bit
502 number (assuming the raw data are 32-bit numbers), but instead
503 as an N-bit, or (N+1)-bit number. Thus, overflows are encoded
504 as in Code A, followed by a (prefix, value), where the prefix is
505 an exponential Golomb code for the value's size (actually, the
506 size with N subtracted, since the size of an overflowing value
507 is guaranteed to be no less than N). As with Code A, it
508 improves compression by some fraction of one bit per value, but
509 the savings were judged not to be worth the performance penalty.
510
511
512 Huffman coding
513 Huffman coding of the raw data clearly accomplished the best
514 compression ratios of all methods. However, it was also the
515 slowest and so was removed from the program.
516
517 The approach is to take the data and split it into "upper" and
518 "lower" bits. The upper bits are Huffman-coded, while the lower
519 bits are assumed to be uniformly-distributed random values and
520 are repeated verbatim into the compressed data stream. The
521 split into upper and lower is made such that the data sample
522 contains no more than 127 distinct values for the upper bits.
523 The Huffman tree is built on the sampled frequencies of the val‐
524 ues of the upper bits (with an extra symbol to signal over‐
525 flows). The choice of 127 symbols was motivated by the possi‐
526 bility of storing the code tree with N symbols in only (N+1)
527 bytes, and by the observation that even as few as 32 symbols
528 work just as well on truly Gaussian-distributed data.
529
530 The price of decoding Huffman-encoded symbols is steep: we found
531 that decoding took twice as long when the Huffman encoders were
532 used. The problem is that the prefix-free Huffman codes must be
533 read one bit at a time, for there is no way to know their length
534 in advance.
535
536
537
538
539 Bit rotation
540 Bit-rotation is an advanced feature that helps in compressing channels
541 where the lowest B bits are constant. This sort of channel violates
542 the usual assumption that data follow a smooth distribution function,
543 and that the lowest few bits are likely to be distributed almost uni‐
544 formly. The author of slim has certainly found channels in real exper‐
545 iments where (for example) the six least significant bits are always
546 zero. This might happen a case where an IEEE 754 floating-point number
547 with 24 bits in its sign-plus-mantissa was converted to and recorded as
548 an integer.
549
550 If such a channel were encoded by the reduced-binary encoder, then the
551 unchanging B lowest bits would be faithfully but wastefully reproduced
552 over and over again. Bit rotation offers a way around this redundancy.
553
554 Bit rotation has slim encode such a channel by bit-shifting the raw
555 value by B bits to the right and then copying what were the B lowest
556 bits to the top (most significant bits) of the shifted result. This
557 makes, for example, the value 0xbeef9900 into 0x00beef99 for the case
558 of B=8. The reduced binary encoder can then act on the bit-rotated
559 result, in which the lowest bits of the new value are presumably less
560 repetetive than they were before bit rotation was applied.
561
562 The user selects only whether to try bit rotation with the --permit-
563 bitrotation option. The program then decides whether to choose a non-
564 zero value of B based on the sampled data set. The choice will be the
565 number of lowest-order bits that are strictly constant throughout the
566 sample set.
567
568 The penalty for trying bit rotation on every channel is modest when
569 compressing--causing roughly a 1.5% increase in instructions executed,
570 in one test. There is of course no penalty in expanding unless bit
571 rotation was actually used.
572
573
574
575
577 In order to operate on machines with 32-bit and 64-bit processors, slim
578 uses strictly sized data types where it matters, such as int32_t for
579 signed 32-bit integers. There are no known problems in running it on
580 machines with 32-bit or 64-bit word sizes. The low-level bit opera‐
581 tions are performed on words with the same size as the C++ data type
582 long which would normally be the native word size of the build machine.
583 The code works either way, but it is likely most efficient to operate
584 on words of the native size. The author admits to not having verified
585 this hunch empirically. (You can override this choice and force 32 or
586 64 bit word sizes. You do this when running build_bit_constants during
587 the build process, but you probably don't want to unless cross-compil‐
588 ing.)
589
590 Because of the author's prejudice or narrow experience (or the demands
591 of his day job), the assumption that words are in little-endian order
592 is deeply embedded in slim. An ambitious future contributor is welcome
593 to remove this restriction, but it is hard to see how to handle byte-
594 swapping without a performance penalty. For now, slim fails an asser‐
595 tion if run on a big-endian machine.
596
597
598
599
601 It would be very reasonable to have slim use environment variables for
602 some things. Feel free to suggest such features; there are none at
603 this time.
604
605
606
607
609 First, compress data from a Multi-channel electronics system. The raw
610 files consist of frames 4400 bytes long having 1100 4-byte words. We
611 ignore that a few of the channels are unsigned and treat them all as
612 signed words. Each channel repeats only once in the frame. We want to
613 encode the differences between successive data values and use method 2
614 (the reduced binary encoder).
615
616 slim -c1100 -i -r1 -dm2 raw_mce_file.dat
617
618 Next, a file containing only many, many repetitions of a single chan‐
619 nel:
620
621 slim -c1 -i -dm2 one_channel_only.dat
622
623 To uncompress both files:
624
625 unslim raw_mce_file.dat.slm one_channel_only.dat.slm
626
627
628
629
631 slim_acthk(1), gzip(1), bzip2(1)
632
633
634
635
637 The -f and -g command-line options for floating-point data are reserved
638 but aren't implemented.
639
640 Bit rotation does not work when the data are 8 or 16-bit integers
641 (whether signed or unsigned).
642
643
644
645
647 Copyright © 2008 Joseph Fowler, Princeton University.
648
649 This work was supported by and done for the benefit of the Atacama Cos‐
650 mology Telescope collaboration.
651
652 Permission is granted to make and distribute verbatim copies of this
653 manual provided the copyright notice and this permission notice are
654 preserved on all copies.
655
656 Permission is granted to copy and distribute modified versions of this
657 manual under the conditions for verbatim copying, provided that the
658 entire resulting derived work is distributed under the terms of a per‐
659 mission notice identical to this one.
660
661 Permission is granted to copy and distribute translations of this man‐
662 ual into another language, under the above conditions for modified ver‐
663 sions, except that this permission notice may be stated in a transla‐
664 tion approved by the Free Software Foundation.
665
666
667
668Version 2.6 October 24, 2008 slim(1)