perlpacktut(1)

1PERLPACKTUT(1)         Perl Programmers Reference Guide         PERLPACKTUT(1)
2
3
4

NAME

6       perlpacktut - tutorial on "pack" and "unpack"
7

DESCRIPTION

9       "pack" and "unpack" are two functions for transforming data according
10       to a user-defined template, between the guarded way Perl stores values
11       and some well-defined representation as might be required in the envi‐
12       ronment of a Perl program. Unfortunately, they're also two of the most
13       misunderstood and most often overlooked functions that Perl provides.
14       This tutorial will demystify them for you.
15

The Basic Principle

17       Most programming languages don't shelter the memory where variables are
18       stored. In C, for instance, you can take the address of some variable,
19       and the "sizeof" operator tells you how many bytes are allocated to the
20       variable. Using the address and the size, you may access the storage to
21       your heart's content.
22
23       In Perl, you just can't access memory at random, but the structural and
24       representational conversion provided by "pack" and "unpack" is an
25       excellent alternative. The "pack" function converts values to a byte
26       sequence containing representations according to a given specification,
27       the so-called "template" argument. "unpack" is the reverse process,
28       deriving some values from the contents of a string of bytes. (Be cau‐
29       tioned, however, that not all that has been packed together can be
30       neatly unpacked - a very common experience as seasoned travellers are
31       likely to confirm.)
32
33       Why, you may ask, would you need a chunk of memory containing some val‐
34       ues in binary representation? One good reason is input and output
35       accessing some file, a device, or a network connection, whereby this
36       binary representation is either forced on you or will give you some
37       benefit in processing. Another cause is passing data to some system
38       call that is not available as a Perl function: "syscall" requires you
39       to provide parameters stored in the way it happens in a C program. Even
40       text processing (as shown in the next section) may be simplified with
41       judicious usage of these two functions.
42
43       To see how (un)packing works, we'll start with a simple template code
44       where the conversion is in low gear: between the contents of a byte
45       sequence and a string of hexadecimal digits. Let's use "unpack", since
46       this is likely to remind you of a dump program, or some desperate last
47       message unfortunate programs are wont to throw at you before they
48       expire into the wild blue yonder. Assuming that the variable $mem holds
49       a sequence of bytes that we'd like to inspect without assuming anything
50       about its meaning, we can write
51
52          my( $hex ) = unpack( 'H*', $mem );
53          print "$hex\n";
54
55       whereupon we might see something like this, with each pair of hex dig‐
56       its corresponding to a byte:
57
58          41204d414e204120504c414e20412043414e414c2050414e414d41
59
60       What was in this chunk of memory? Numbers, characters, or a mixture of
61       both? Assuming that we're on a computer where ASCII (or some similar)
62       encoding is used: hexadecimal values in the range 0x40 - 0x5A indicate
63       an uppercase letter, and 0x20 encodes a space. So we might assume it is
64       a piece of text, which some are able to read like a tabloid; but others
65       will have to get hold of an ASCII table and relive that firstgrader
66       feeling. Not caring too much about which way to read this, we note that
67       "unpack" with the template code "H" converts the contents of a sequence
68       of bytes into the customary hexadecimal notation. Since "a sequence of"
69       is a pretty vague indication of quantity, "H" has been defined to con‐
70       vert just a single hexadecimal digit unless it is followed by a repeat
71       count. An asterisk for the repeat count means to use whatever remains.
72
73       The inverse operation - packing byte contents from a string of hexadec‐
74       imal digits - is just as easily written. For instance:
75
76          my $s = pack( 'H2' x 10, map { "3$_" } ( 0..9 ) );
77          print "$s\n";
78
79       Since we feed a list of ten 2-digit hexadecimal strings to "pack", the
80       pack template should contain ten pack codes. If this is run on a com‐
81       puter with ASCII character coding, it will print 0123456789.
82

Packing Text

84       Let's suppose you've got to read in a data file like this:
85
86           Date      ⎪Description                ⎪ Income⎪Expenditure
87           01/24/2001 Ahmed's Camel Emporium                  1147.99
88           01/28/2001 Flea spray                                24.99
89           01/29/2001 Camel rides to tourists      235.00
90
91       How do we do it? You might think first to use "split"; however, since
92       "split" collapses blank fields, you'll never know whether a record was
93       income or expenditure. Oops. Well, you could always use "substr":
94
95           while (<>) {
96               my $date   = substr($_,  0, 11);
97               my $desc   = substr($_, 12, 27);
98               my $income = substr($_, 40,  7);
99               my $expend = substr($_, 52,  7);
100               ...
101           }
102
103       It's not really a barrel of laughs, is it? In fact, it's worse than it
104       may seem; the eagle-eyed may notice that the first field should only be
105       10 characters wide, and the error has propagated right through the
106       other numbers - which we've had to count by hand. So it's error-prone
107       as well as horribly unfriendly.
108
109       Or maybe we could use regular expressions:
110
111           while (<>) {
112               my($date, $desc, $income, $expend) =
113                   m⎪(\d\d/\d\d/\d{4}) (.{27}) (.{7})(.*)⎪;
114               ...
115           }
116
117       Urgh. Well, it's a bit better, but - well, would you want to maintain
118       that?
119
120       Hey, isn't Perl supposed to make this sort of thing easy? Well, it
121       does, if you use the right tools. "pack" and "unpack" are designed to
122       help you out when dealing with fixed-width data like the above. Let's
123       have a look at a solution with "unpack":
124
125           while (<>) {
126               my($date, $desc, $income, $expend) = unpack("A10xA27xA7A*", $_);
127               ...
128           }
129
130       That looks a bit nicer; but we've got to take apart that weird tem‐
131       plate.  Where did I pull that out of?
132
133       OK, let's have a look at some of our data again; in fact, we'll include
134       the headers, and a handy ruler so we can keep track of where we are.
135
136                    1         2         3         4         5
137           1234567890123456789012345678901234567890123456789012345678
138           Date      ⎪Description                ⎪ Income⎪Expenditure
139           01/28/2001 Flea spray                                24.99
140           01/29/2001 Camel rides to tourists      235.00
141
142       From this, we can see that the date column stretches from column 1 to
143       column 10 - ten characters wide. The "pack"-ese for "character" is "A",
144       and ten of them are "A10". So if we just wanted to extract the dates,
145       we could say this:
146
147           my($date) = unpack("A10", $_);
148
149       OK, what's next? Between the date and the description is a blank col‐
150       umn; we want to skip over that. The "x" template means "skip forward",
151       so we want one of those. Next, we have another batch of characters,
152       from 12 to 38. That's 27 more characters, hence "A27". (Don't make the
153       fencepost error - there are 27 characters between 12 and 38, not 26.
154       Count 'em!)
155
156       Now we skip another character and pick up the next 7 characters:
157
158           my($date,$description,$income) = unpack("A10xA27xA7", $_);
159
160       Now comes the clever bit. Lines in our ledger which are just income and
161       not expenditure might end at column 46. Hence, we don't want to tell
162       our "unpack" pattern that we need to find another 12 characters; we'll
163       just say "if there's anything left, take it". As you might guess from
164       regular expressions, that's what the "*" means: "use everything remain‐
165       ing".
166
167       ·  Be warned, though, that unlike regular expressions, if the "unpack"
168          template doesn't match the incoming data, Perl will scream and die.
169
170       Hence, putting it all together:
171
172           my($date,$description,$income,$expend) = unpack("A10xA27xA7xA*", $_);
173
174       Now, that's our data parsed. I suppose what we might want to do now is
175       total up our income and expenditure, and add another line to the end of
176       our ledger - in the same format - saying how much we've brought in and
177       how much we've spent:
178
179           while (<>) {
180               my($date, $desc, $income, $expend) = unpack("A10xA27xA7xA*", $_);
181               $tot_income += $income;
182               $tot_expend += $expend;
183           }
184
185           $tot_income = sprintf("%.2f", $tot_income); # Get them into
186           $tot_expend = sprintf("%.2f", $tot_expend); # "financial" format
187
188           $date = POSIX::strftime("%m/%d/%Y", localtime);
189
190           # OK, let's go:
191
192           print pack("A10xA27xA7xA*", $date, "Totals", $tot_income, $tot_expend);
193
194       Oh, hmm. That didn't quite work. Let's see what happened:
195
196           01/24/2001 Ahmed's Camel Emporium                   1147.99
197           01/28/2001 Flea spray                                 24.99
198           01/29/2001 Camel rides to tourists     1235.00
199           03/23/2001Totals                     1235.001172.98
200
201       OK, it's a start, but what happened to the spaces? We put "x", didn't
202       we? Shouldn't it skip forward? Let's look at what "pack" in perlfunc
203       says:
204
205           x   A null byte.
206
207       Urgh. No wonder. There's a big difference between "a null byte", char‐
208       acter zero, and "a space", character 32. Perl's put something between
209       the date and the description - but unfortunately, we can't see it!
210
211       What we actually need to do is expand the width of the fields. The "A"
212       format pads any non-existent characters with spaces, so we can use the
213       additional spaces to line up our fields, like this:
214
215           print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
216
217       (Note that you can put spaces in the template to make it more readable,
218       but they don't translate to spaces in the output.) Here's what we got
219       this time:
220
221           01/24/2001 Ahmed's Camel Emporium                   1147.99
222           01/28/2001 Flea spray                                 24.99
223           01/29/2001 Camel rides to tourists     1235.00
224           03/23/2001 Totals                      1235.00 1172.98
225
226       That's a bit better, but we still have that last column which needs to
227       be moved further over. There's an easy way to fix this up: unfortu‐
228       nately, we can't get "pack" to right-justify our fields, but we can get
229       "sprintf" to do it:
230
231           $tot_income = sprintf("%.2f", $tot_income);
232           $tot_expend = sprintf("%12.2f", $tot_expend);
233           $date = POSIX::strftime("%m/%d/%Y", localtime);
234           print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
235
236       This time we get the right answer:
237
238           01/28/2001 Flea spray                                 24.99
239           01/29/2001 Camel rides to tourists     1235.00
240           03/23/2001 Totals                      1235.00      1172.98
241
242       So that's how we consume and produce fixed-width data. Let's recap what
243       we've seen of "pack" and "unpack" so far:
244
245       ·  Use "pack" to go from several pieces of data to one fixed-width ver‐
246          sion; use "unpack" to turn a fixed-width-format string into several
247          pieces of data.
248
249       ·  The pack format "A" means "any character"; if you're "pack"ing and
250          you've run out of things to pack, "pack" will fill the rest up with
251          spaces.
252
253       ·  "x" means "skip a byte" when "unpack"ing; when "pack"ing, it means
254          "introduce a null byte" - that's probably not what you mean if
255          you're dealing with plain text.
256
257       ·  You can follow the formats with numbers to say how many characters
258          should be affected by that format: "A12" means "take 12 characters";
259          "x6" means "skip 6 bytes" or "character 0, 6 times".
260
261       ·  Instead of a number, you can use "*" to mean "consume everything
262          else left".
263
264          Warning: when packing multiple pieces of data, "*" only means "con‐
265          sume all of the current piece of data". That's to say
266
267              pack("A*A*", $one, $two)
268
269          packs all of $one into the first "A*" and then all of $two into the
270          second. This is a general principle: each format character corre‐
271          sponds to one piece of data to be "pack"ed.
272

Packing Numbers

274       So much for textual data. Let's get onto the meaty stuff that "pack"
275       and "unpack" are best at: handling binary formats for numbers. There
276       is, of course, not just one binary format  - life would be too simple -
277       but Perl will do all the finicky labor for you.
278
279       Integers
280
281       Packing and unpacking numbers implies conversion to and from some spe‐
282       cific binary representation. Leaving floating point numbers aside for
283       the moment, the salient properties of any such representation are:
284
285       ·   the number of bytes used for storing the integer,
286
287       ·   whether the contents are interpreted as a signed or unsigned num‐
288           ber,
289
290       ·   the byte ordering: whether the first byte is the least or most sig‐
291           nificant byte (or: little-endian or big-endian, respectively).
292
293       So, for instance, to pack 20302 to a signed 16 bit integer in your com‐
294       puter's representation you write
295
296          my $ps = pack( 's', 20302 );
297
298       Again, the result is a string, now containing 2 bytes. If you print
299       this string (which is, generally, not recommended) you might see "ON"
300       or "NO" (depending on your system's byte ordering) - or something
301       entirely different if your computer doesn't use ASCII character encod‐
302       ing.  Unpacking $ps with the same template returns the original integer
303       value:
304
305          my( $s ) = unpack( 's', $ps );
306
307       This is true for all numeric template codes. But don't expect miracles:
308       if the packed value exceeds the allotted byte capacity, high order bits
309       are silently discarded, and unpack certainly won't be able to pull them
310       back out of some magic hat. And, when you pack using a signed template
311       code such as "s", an excess value may result in the sign bit getting
312       set, and unpacking this will smartly return a negative value.
313
314       16 bits won't get you too far with integers, but there is "l" and "L"
315       for signed and unsigned 32-bit integers. And if this is not enough and
316       your system supports 64 bit integers you can push the limits much
317       closer to infinity with pack codes "q" and "Q". A notable exception is
318       provided by pack codes "i" and "I" for signed and unsigned integers of
319       the "local custom" variety: Such an integer will take up as many bytes
320       as a local C compiler returns for "sizeof(int)", but it'll use at least
321       32 bits.
322
323       Each of the integer pack codes "sSlLqQ" results in a fixed number of
324       bytes, no matter where you execute your program. This may be useful for
325       some applications, but it does not provide for a portable way to pass
326       data structures between Perl and C programs (bound to happen when you
327       call XS extensions or the Perl function "syscall"), or when you read or
328       write binary files. What you'll need in this case are template codes
329       that depend on what your local C compiler compiles when you code
330       "short" or "unsigned long", for instance. These codes and their corre‐
331       sponding byte lengths are shown in the table below.  Since the C stan‐
332       dard leaves much leeway with respect to the relative sizes of these
333       data types, actual values may vary, and that's why the values are given
334       as expressions in C and Perl. (If you'd like to use values from %Config
335       in your program you have to import it with "use Config".)
336
337          signed unsigned  byte length in C   byte length in Perl
338            s!     S!      sizeof(short)      $Config{shortsize}
339            i!     I!      sizeof(int)        $Config{intsize}
340            l!     L!      sizeof(long)       $Config{longsize}
341            q!     Q!      sizeof(long long)  $Config{longlongsize}
342
343       The "i!" and "I!" codes aren't different from "i" and "I"; they are
344       tolerated for completeness' sake.
345
346       Unpacking a Stack Frame
347
348       Requesting a particular byte ordering may be necessary when you work
349       with binary data coming from some specific architecture whereas your
350       program could run on a totally different system. As an example, assume
351       you have 24 bytes containing a stack frame as it happens on an Intel
352       8086:
353
354             +---------+        +----+----+               +---------+
355        TOS: ⎪   IP    ⎪  TOS+4:⎪ FL ⎪ FH ⎪ FLAGS  TOS+14:⎪   SI    ⎪
356             +---------+        +----+----+               +---------+
357             ⎪   CS    ⎪        ⎪ AL ⎪ AH ⎪ AX            ⎪   DI    ⎪
358             +---------+        +----+----+               +---------+
359                                ⎪ BL ⎪ BH ⎪ BX            ⎪   BP    ⎪
360                                +----+----+               +---------+
361                                ⎪ CL ⎪ CH ⎪ CX            ⎪   DS    ⎪
362                                +----+----+               +---------+
363                                ⎪ DL ⎪ DH ⎪ DX            ⎪   ES    ⎪
364                                +----+----+               +---------+
365
366       First, we note that this time-honored 16-bit CPU uses little-endian
367       order, and that's why the low order byte is stored at the lower
368       address. To unpack such a (signed) short we'll have to use code "v". A
369       repeat count unpacks all 12 shorts:
370
371          my( $ip, $cs, $flags, $ax, $bx, $cd, $dx, $si, $di, $bp, $ds, $es ) =
372            unpack( 'v12', $frame );
373
374       Alternatively, we could have used "C" to unpack the individually acces‐
375       sible byte registers FL, FH, AL, AH, etc.:
376
377          my( $fl, $fh, $al, $ah, $bl, $bh, $cl, $ch, $dl, $dh ) =
378            unpack( 'C10', substr( $frame, 4, 10 ) );
379
380       It would be nice if we could do this in one fell swoop: unpack a short,
381       back up a little, and then unpack 2 bytes. Since Perl is nice, it prof‐
382       fers the template code "X" to back up one byte. Putting this all
383       together, we may now write:
384
385          my( $ip, $cs,
386              $flags,$fl,$fh,
387              $ax,$al,$ah, $bx,$bl,$bh, $cx,$cl,$ch, $dx,$dl,$dh,
388              $si, $di, $bp, $ds, $es ) =
389          unpack( 'v2' . ('vXXCC' x 5) . 'v5', $frame );
390
391       (The clumsy construction of the template can be avoided - just read
392       on!)
393
394       We've taken some pains to construct the template so that it matches the
395       contents of our frame buffer. Otherwise we'd either get undefined val‐
396       ues, or "unpack" could not unpack all. If "pack" runs out of items, it
397       will supply null strings (which are coerced into zeroes whenever the
398       pack code says so).
399
400       How to Eat an Egg on a Net
401
402       The pack code for big-endian (high order byte at the lowest address) is
403       "n" for 16 bit and "N" for 32 bit integers. You use these codes if you
404       know that your data comes from a compliant architecture, but, surpris‐
405       ingly enough, you should also use these pack codes if you exchange
406       binary data, across the network, with some system that you know next to
407       nothing about. The simple reason is that this order has been chosen as
408       the network order, and all standard-fearing programs ought to follow
409       this convention. (This is, of course, a stern backing for one of the
410       Lilliputian parties and may well influence the political development
411       there.) So, if the protocol expects you to send a message by sending
412       the length first, followed by just so many bytes, you could write:
413
414          my $buf = pack( 'N', length( $msg ) ) . $msg;
415
416       or even:
417
418          my $buf = pack( 'NA*', length( $msg ), $msg );
419
420       and pass $buf to your send routine. Some protocols demand that the
421       count should include the length of the count itself: then just add 4 to
422       the data length. (But make sure to read "Lengths and Widths" before you
423       really code this!)
424
425       Floating point Numbers
426
427       For packing floating point numbers you have the choice between the pack
428       codes "f" and "d" which pack into (or unpack from) single-precision or
429       double-precision representation as it is provided by your system.
430       (There is no such thing as a network representation for reals, so if
431       you want to send your real numbers across computer boundaries, you'd
432       better stick to ASCII representation, unless you're absolutely sure
433       what's on the other end of the line.)
434

Exotic Templates

436       Bit Strings
437
438       Bits are the atoms in the memory world. Access to individual bits may
439       have to be used either as a last resort or because it is the most con‐
440       venient way to handle your data. Bit string (un)packing converts
441       between strings containing a series of 0 and 1 characters and a
442       sequence of bytes each containing a group of 8 bits. This is almost as
443       simple as it sounds, except that there are two ways the contents of a
444       byte may be written as a bit string. Let's have a look at an annotated
445       byte:
446
447            7 6 5 4 3 2 1 0
448          +-----------------+
449          ⎪ 1 0 0 0 1 1 0 0 ⎪
450          +-----------------+
451           MSB           LSB
452
453       It's egg-eating all over again: Some think that as a bit string this
454       should be written "10001100" i.e. beginning with the most significant
455       bit, others insist on "00110001". Well, Perl isn't biased, so that's
456       why we have two bit string codes:
457
458          $byte = pack( 'B8', '10001100' ); # start with MSB
459          $byte = pack( 'b8', '00110001' ); # start with LSB
460
461       It is not possible to pack or unpack bit fields - just integral bytes.
462       "pack" always starts at the next byte boundary and "rounds up" to the
463       next multiple of 8 by adding zero bits as required. (If you do want bit
464       fields, there is "vec" in perlfunc. Or you could implement bit field
465       handling at the character string level, using split, substr, and con‐
466       catenation on unpacked bit strings.)
467
468       To illustrate unpacking for bit strings, we'll decompose a simple sta‐
469       tus register (a "-" stands for a "reserved" bit):
470
471          +-----------------+-----------------+
472          ⎪ S Z - A - P - C ⎪ - - - - O D I T ⎪
473          +-----------------+-----------------+
474           MSB           LSB MSB           LSB
475
476       Converting these two bytes to a string can be done with the unpack tem‐
477       plate 'b16'. To obtain the individual bit values from the bit string we
478       use "split" with the "empty" separator pattern which dissects into
479       individual characters. Bit values from the "reserved" positions are
480       simply assigned to "undef", a convenient notation for "I don't care
481       where this goes".
482
483          ($carry, undef, $parity, undef, $auxcarry, undef, $zero, $sign,
484           $trace, $interrupt, $direction, $overflow) =
485             split( //, unpack( 'b16', $status ) );
486
487       We could have used an unpack template 'b12' just as well, since the
488       last 4 bits can be ignored anyway.
489
490       Uuencoding
491
492       Another odd-man-out in the template alphabet is "u", which packs an
493       "uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that
494       you won't ever need this encoding technique which was invented to over‐
495       come the shortcomings of old-fashioned transmission mediums that do not
496       support other than simple ASCII data. The essential recipe is simple:
497       Take three bytes, or 24 bits. Split them into 4 six-packs, adding a
498       space (0x20) to each. Repeat until all of the data is blended. Fold
499       groups of 4 bytes into lines no longer than 60 and garnish them in
500       front with the original byte count (incremented by 0x20) and a "\n" at
501       the end. - The "pack" chef will prepare this for you, a la minute, when
502       you select pack code "u" on the menu:
503
504          my $uubuf = pack( 'u', $bindat );
505
506       A repeat count after "u" sets the number of bytes to put into an uuen‐
507       coded line, which is the maximum of 45 by default, but could be set to
508       some (smaller) integer multiple of three. "unpack" simply ignores the
509       repeat count.
510
511       Doing Sums
512
513       An even stranger template code is "%"<number>. First, because it's used
514       as a prefix to some other template code. Second, because it cannot be
515       used in "pack" at all, and third, in "unpack", doesn't return the data
516       as defined by the template code it precedes. Instead it'll give you an
517       integer of number bits that is computed from the data value by doing
518       sums. For numeric unpack codes, no big feat is achieved:
519
520           my $buf = pack( 'iii', 100, 20, 3 );
521           print unpack( '%32i3', $buf ), "\n";  # prints 123
522
523       For string values, "%" returns the sum of the byte values saving you
524       the trouble of a sum loop with "substr" and "ord":
525
526           print unpack( '%32A*', "\x01\x10" ), "\n";  # prints 17
527
528       Although the "%" code is documented as returning a "checksum": don't
529       put your trust in such values! Even when applied to a small number of
530       bytes, they won't guarantee a noticeable Hamming distance.
531
532       In connection with "b" or "B", "%" simply adds bits, and this can be
533       put to good use to count set bits efficiently:
534
535           my $bitcount = unpack( '%32b*', $mask );
536
537       And an even parity bit can be determined like this:
538
539           my $evenparity = unpack( '%1b*', $mask );
540
541       Unicode
542
543       Unicode is a character set that can represent most characters in most
544       of the world's languages, providing room for over one million different
545       characters. Unicode 3.1 specifies 94,140 characters: The Basic Latin
546       characters are assigned to the numbers 0 - 127. The Latin-1 Supplement
547       with characters that are used in several European languages is in the
548       next range, up to 255. After some more Latin extensions we find the
549       character sets from languages using non-Roman alphabets, interspersed
550       with a variety of symbol sets such as currency symbols, Zapf Dingbats
551       or Braille.  (You might want to visit www.unicode.org for a look at
552       some of them - my personal favourites are Telugu and Kannada.)
553
554       The Unicode character sets associates characters with integers. Encod‐
555       ing these numbers in an equal number of bytes would more than double
556       the requirements for storing texts written in Latin alphabets.  The
557       UTF-8 encoding avoids this by storing the most common (from a western
558       point of view) characters in a single byte while encoding the rarer
559       ones in three or more bytes.
560
561       So what has this got to do with "pack"? Well, if you want to convert
562       between a Unicode number and its UTF-8 representation you can do so by
563       using template code "U". As an example, let's produce the UTF-8 repre‐
564       sentation of the Euro currency symbol (code number 0x20AC):
565
566          $UTF8{Euro} = pack( 'U', 0x20AC );
567
568       Inspecting $UTF8{Euro} shows that it contains 3 bytes: "\xe2\x82\xac".
569       The round trip can be completed with "unpack":
570
571          $Unicode{Euro} = unpack( 'U', $UTF8{Euro} );
572
573       Usually you'll want to pack or unpack UTF-8 strings:
574
575          # pack and unpack the Hebrew alphabet
576          my $alefbet = pack( 'U*', 0x05d0..0x05ea );
577          my @hebrew = unpack( 'U*', $utf );
578
579       Another Portable Binary Encoding
580
581       The pack code "w" has been added to support a portable binary data
582       encoding scheme that goes way beyond simple integers. (Details can be
583       found at Casbah.org, the Scarab project.)  A BER (Binary Encoded Repre‐
584       sentation) compressed unsigned integer stores base 128 digits, most
585       significant digit first, with as few digits as possible.  Bit eight
586       (the high bit) is set on each byte except the last. There is no size
587       limit to BER encoding, but Perl won't go to extremes.
588
589          my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 );
590
591       A hex dump of $berbuf, with spaces inserted at the right places, shows
592       01 8100 8101 81807F. Since the last byte is always less than 128,
593       "unpack" knows where to stop.
594

Template Grouping

596       Prior to Perl 5.8, repetitions of templates had to be made by "x"-mul‐
597       tiplication of template strings. Now there is a better way as we may
598       use the pack codes "(" and ")" combined with a repeat count.  The
599       "unpack" template from the Stack Frame example can simply be written
600       like this:
601
602          unpack( 'v2 (vXXCC)5 v5', $frame )
603
604       Let's explore this feature a little more. We'll begin with the equiva‐
605       lent of
606
607          join( '', map( substr( $_, 0, 1 ), @str ) )
608
609       which returns a string consisting of the first character from each
610       string.  Using pack, we can write
611
612          pack( '(A)'.@str, @str )
613
614       or, because a repeat count "*" means "repeat as often as required",
615       simply
616
617          pack( '(A)*', @str )
618
619       (Note that the template "A*" would only have packed $str[0] in full
620       length.)
621
622       To pack dates stored as triplets ( day, month, year ) in an array
623       @dates into a sequence of byte, byte, short integer we can write
624
625          $pd = pack( '(CCS)*', map( @$_, @dates ) );
626
627       To swap pairs of characters in a string (with even length) one could
628       use several techniques. First, let's use "x" and "X" to skip forward
629       and back:
630
631          $s = pack( '(A)*', unpack( '(xAXXAx)*', $s ) );
632
633       We can also use "@" to jump to an offset, with 0 being the position
634       where we were when the last "(" was encountered:
635
636          $s = pack( '(A)*', unpack( '(@1A @0A @2)*', $s ) );
637
638       Finally, there is also an entirely different approach by unpacking big
639       endian shorts and packing them in the reverse byte order:
640
641          $s = pack( '(v)*', unpack( '(n)*', $s );
642

Lengths and Widths

644       String Lengths
645
646       In the previous section we've seen a network message that was con‐
647       structed by prefixing the binary message length to the actual message.
648       You'll find that packing a length followed by so many bytes of data is
649       a frequently used recipe since appending a null byte won't work if a
650       null byte may be part of the data. Here is an example where both tech‐
651       niques are used: after two null terminated strings with source and des‐
652       tination address, a Short Message (to a mobile phone) is sent after a
653       length byte:
654
655          my $msg = pack( 'Z*Z*CA*', $src, $dst, length( $sm ), $sm );
656
657       Unpacking this message can be done with the same template:
658
659          ( $src, $dst, $len, $sm ) = unpack( 'Z*Z*CA*', $msg );
660
661       There's a subtle trap lurking in the offing: Adding another field after
662       the Short Message (in variable $sm) is all right when packing, but this
663       cannot be unpacked naively:
664
665          # pack a message
666          my $msg = pack( 'Z*Z*CA*C', $src, $dst, length( $sm ), $sm, $prio );
667
668          # unpack fails - $prio remains undefined!
669          ( $src, $dst, $len, $sm, $prio ) = unpack( 'Z*Z*CA*C', $msg );
670
671       The pack code "A*" gobbles up all remaining bytes, and $prio remains
672       undefined! Before we let disappointment dampen the morale: Perl's got
673       the trump card to make this trick too, just a little further up the
674       sleeve.  Watch this:
675
676          # pack a message: ASCIIZ, ASCIIZ, length/string, byte
677          my $msg = pack( 'Z* Z* C/A* C', $src, $dst, $sm, $prio );
678
679          # unpack
680          ( $src, $dst, $sm, $prio ) = unpack( 'Z* Z* C/A* C', $msg );
681
682       Combining two pack codes with a slash ("/") associates them with a sin‐
683       gle value from the argument list. In "pack", the length of the argument
684       is taken and packed according to the first code while the argument
685       itself is added after being converted with the template code after the
686       slash.  This saves us the trouble of inserting the "length" call, but
687       it is in "unpack" where we really score: The value of the length byte
688       marks the end of the string to be taken from the buffer. Since this
689       combination doesn't make sense except when the second pack code isn't
690       "a*", "A*" or "Z*", Perl won't let you.
691
692       The pack code preceding "/" may be anything that's fit to represent a
693       number: All the numeric binary pack codes, and even text codes such as
694       "A4" or "Z*":
695
696          # pack/unpack a string preceded by its length in ASCII
697          my $buf = pack( 'A4/A*', "Humpty-Dumpty" );
698          # unpack $buf: '13  Humpty-Dumpty'
699          my $txt = unpack( 'A4/A*', $buf );
700
701       "/" is not implemented in Perls before 5.6, so if your code is required
702       to work on older Perls you'll need to "unpack( 'Z* Z* C')" to get the
703       length, then use it to make a new unpack string. For example
704
705          # pack a message: ASCIIZ, ASCIIZ, length, string, byte (5.005 compatible)
706          my $msg = pack( 'Z* Z* C A* C', $src, $dst, length $sm, $sm, $prio );
707
708          # unpack
709          ( undef, undef, $len) = unpack( 'Z* Z* C', $msg );
710          ($src, $dst, $sm, $prio) = unpack ( "Z* Z* x A$len C", $msg );
711
712       But that second "unpack" is rushing ahead. It isn't using a simple lit‐
713       eral string for the template. So maybe we should introduce...
714
715       Dynamic Templates
716
717       So far, we've seen literals used as templates. If the list of pack
718       items doesn't have fixed length, an expression constructing the tem‐
719       plate is required (whenever, for some reason, "()*" cannot be used).
720       Here's an example: To store named string values in a way that can be
721       conveniently parsed by a C program, we create a sequence of names and
722       null terminated ASCII strings, with "=" between the name and the value,
723       followed by an additional delimiting null byte. Here's how:
724
725          my $env = pack( '(A*A*Z*)' . keys( %Env ) . 'C',
726                          map( { ( $_, '=', $Env{$_} ) } keys( %Env ) ), 0 );
727
728       Let's examine the cogs of this byte mill, one by one. There's the "map"
729       call, creating the items we intend to stuff into the $env buffer: to
730       each key (in $_) it adds the "=" separator and the hash entry value.
731       Each triplet is packed with the template code sequence "A*A*Z*" that is
732       repeated according to the number of keys. (Yes, that's what the "keys"
733       function returns in scalar context.) To get the very last null byte, we
734       add a 0 at the end of the "pack" list, to be packed with "C".  (Atten‐
735       tive readers may have noticed that we could have omitted the 0.)
736
737       For the reverse operation, we'll have to determine the number of items
738       in the buffer before we can let "unpack" rip it apart:
739
740          my $n = $env =~ tr/\0// - 1;
741          my %env = map( split( /=/, $_ ), unpack( "(Z*)$n", $env ) );
742
743       The "tr" counts the null bytes. The "unpack" call returns a list of
744       name-value pairs each of which is taken apart in the "map" block.
745
746       Counting Repetitions
747
748       Rather than storing a sentinel at the end of a data item (or a list of
749       items), we could precede the data with a count. Again, we pack keys and
750       values of a hash, preceding each with an unsigned short length count,
751       and up front we store the number of pairs:
752
753          my $env = pack( 'S(S/A* S/A*)*', scalar keys( %Env ), %Env );
754
755       This simplifies the reverse operation as the number of repetitions can
756       be unpacked with the "/" code:
757
758          my %env = unpack( 'S/(S/A* S/A*)', $env );
759
760       Note that this is one of the rare cases where you cannot use the same
761       template for "pack" and "unpack" because "pack" can't determine a
762       repeat count for a "()"-group.
763

Packing and Unpacking C Structures

765       In previous sections we have seen how to pack numbers and character
766       strings. If it were not for a couple of snags we could conclude this
767       section right away with the terse remark that C structures don't con‐
768       tain anything else, and therefore you already know all there is to it.
769       Sorry, no: read on, please.
770
771       The Alignment Pit
772
773       In the consideration of speed against memory requirements the balance
774       has been tilted in favor of faster execution. This has influenced the
775       way C compilers allocate memory for structures: On architectures where
776       a 16-bit or 32-bit operand can be moved faster between places in mem‐
777       ory, or to or from a CPU register, if it is aligned at an even or mul‐
778       tiple-of-four or even at a multiple-of eight address, a C compiler will
779       give you this speed benefit by stuffing extra bytes into structures.
780       If you don't cross the C shoreline this is not likely to cause you any
781       grief (although you should care when you design large data structures,
782       or you want your code to be portable between architectures (you do want
783       that, don't you?)).
784
785       To see how this affects "pack" and "unpack", we'll compare these two C
786       structures:
787
788          typedef struct {
789            char     c1;
790            short    s;
791            char     c2;
792            long     l;
793          } gappy_t;
794
795          typedef struct {
796            long     l;
797            short    s;
798            char     c1;
799            char     c2;
800          } dense_t;
801
802       Typically, a C compiler allocates 12 bytes to a "gappy_t" variable, but
803       requires only 8 bytes for a "dense_t". After investigating this fur‐
804       ther, we can draw memory maps, showing where the extra 4 bytes are hid‐
805       den:
806
807          0           +4          +8          +12
808          +--+--+--+--+--+--+--+--+--+--+--+--+
809          ⎪c1⎪xx⎪  s  ⎪c2⎪xx⎪xx⎪xx⎪     l     ⎪    xx = fill byte
810          +--+--+--+--+--+--+--+--+--+--+--+--+
811          gappy_t
812
813          0           +4          +8
814          +--+--+--+--+--+--+--+--+
815          ⎪     l     ⎪  h  ⎪c1⎪c2⎪
816          +--+--+--+--+--+--+--+--+
817          dense_t
818
819       And that's where the first quirk strikes: "pack" and "unpack" templates
820       have to be stuffed with "x" codes to get those extra fill bytes.
821
822       The natural question: "Why can't Perl compensate for the gaps?" war‐
823       rants an answer. One good reason is that C compilers might provide
824       (non-ANSI) extensions permitting all sorts of fancy control over the
825       way structures are aligned, even at the level of an individual struc‐
826       ture field. And, if this were not enough, there is an insidious thing
827       called "union" where the amount of fill bytes cannot be derived from
828       the alignment of the next item alone.
829
830       OK, so let's bite the bullet. Here's one way to get the alignment right
831       by inserting template codes "x", which don't take a corresponding item
832       from the list:
833
834         my $gappy = pack( 'cxs cxxx l!', $c1, $s, $c2, $l );
835
836       Note the "!" after "l": We want to make sure that we pack a long inte‐
837       ger as it is compiled by our C compiler. And even now, it will only
838       work for the platforms where the compiler aligns things as above.  And
839       somebody somewhere has a platform where it doesn't.  [Probably a Cray,
840       where "short"s, "int"s and "long"s are all 8 bytes. :-)]
841
842       Counting bytes and watching alignments in lengthy structures is bound
843       to be a drag. Isn't there a way we can create the template with a sim‐
844       ple program? Here's a C program that does the trick:
845
846          #include <stdio.h>
847          #include <stddef.h>
848
849          typedef struct {
850            char     fc1;
851            short    fs;
852            char     fc2;
853            long     fl;
854          } gappy_t;
855
856          #define Pt(struct,field,tchar) \
857            printf( "@%d%s ", offsetof(struct,field), # tchar );
858
859          int main() {
860            Pt( gappy_t, fc1, c  );
861            Pt( gappy_t, fs,  s! );
862            Pt( gappy_t, fc2, c  );
863            Pt( gappy_t, fl,  l! );
864            printf( "\n" );
865          }
866
867       The output line can be used as a template in a "pack" or "unpack" call:
868
869         my $gappy = pack( '@0c @2s! @4c @8l!', $c1, $s, $c2, $l );
870
871       Gee, yet another template code - as if we hadn't plenty. But "@" saves
872       our day by enabling us to specify the offset from the beginning of the
873       pack buffer to the next item: This is just the value the "offsetof"
874       macro (defined in "<stddef.h>") returns when given a "struct" type and
875       one of its field names ("member-designator" in C standardese).
876
877       Neither using offsets nor adding "x"'s to bridge the gaps is satisfac‐
878       tory.  (Just imagine what happens if the structure changes.) What we
879       really need is a way of saying "skip as many bytes as required to the
880       next multiple of N".  In fluent Templatese, you say this with "x!N"
881       where N is replaced by the appropriate value. Here's the next version
882       of our struct packaging:
883
884         my $gappy = pack( 'c x!2 s c x!4 l!', $c1, $s, $c2, $l );
885
886       That's certainly better, but we still have to know how long all the
887       integers are, and portability is far away. Rather than 2, for instance,
888       we want to say "however long a short is". But this can be done by
889       enclosing the appropriate pack code in brackets: "[s]". So, here's the
890       very best we can do:
891
892         my $gappy = pack( 'c x![s] s c x![l!] l!', $c1, $s, $c2, $l );
893
894       Alignment, Take 2
895
896       I'm afraid that we're not quite through with the alignment catch yet.
897       The hydra raises another ugly head when you pack arrays of structures:
898
899          typedef struct {
900            short    count;
901            char     glyph;
902          } cell_t;
903
904          typedef cell_t buffer_t[BUFLEN];
905
906       Where's the catch? Padding is neither required before the first field
907       "count", nor between this and the next field "glyph", so why can't we
908       simply pack like this:
909
910          # something goes wrong here:
911          pack( 's!a' x @buffer,
912                map{ ( $_->{count}, $_->{glyph} ) } @buffer );
913
914       This packs "3*@buffer" bytes, but it turns out that the size of "buf‐
915       fer_t" is four times "BUFLEN"! The moral of the story is that the
916       required alignment of a structure or array is propagated to the next
917       higher level where we have to consider padding at the end of each com‐
918       ponent as well. Thus the correct template is:
919
920          pack( 's!ax' x @buffer,
921                map{ ( $_->{count}, $_->{glyph} ) } @buffer );
922
923       Alignment, Take 3
924
925       And even if you take all the above into account, ANSI still lets this:
926
927          typedef struct {
928            char     foo[2];
929          } foo_t;
930
931       vary in size. The alignment constraint of the structure can be greater
932       than any of its elements. [And if you think that this doesn't affect
933       anything common, dismember the next cellphone that you see. Many have
934       ARM cores, and the ARM structure rules make "sizeof (foo_t)" == 4]
935
936       Pointers for How to Use Them
937
938       The title of this section indicates the second problem you may run into
939       sooner or later when you pack C structures. If the function you intend
940       to call expects a, say, "void *" value, you cannot simply take a refer‐
941       ence to a Perl variable. (Although that value certainly is a memory
942       address, it's not the address where the variable's contents are
943       stored.)
944
945       Template code "P" promises to pack a "pointer to a fixed length
946       string".  Isn't this what we want? Let's try:
947
948           # allocate some storage and pack a pointer to it
949           my $memory = "\x00" x $size;
950           my $memptr = pack( 'P', $memory );
951
952       But wait: doesn't "pack" just return a sequence of bytes? How can we
953       pass this string of bytes to some C code expecting a pointer which is,
954       after all, nothing but a number? The answer is simple: We have to
955       obtain the numeric address from the bytes returned by "pack".
956
957           my $ptr = unpack( 'L!', $memptr );
958
959       Obviously this assumes that it is possible to typecast a pointer to an
960       unsigned long and vice versa, which frequently works but should not be
961       taken as a universal law. - Now that we have this pointer the next
962       question is: How can we put it to good use? We need a call to some C
963       function where a pointer is expected. The read(2) system call comes to
964       mind:
965
966           ssize_t read(int fd, void *buf, size_t count);
967
968       After reading perlfunc explaining how to use "syscall" we can write
969       this Perl function copying a file to standard output:
970
971           require 'syscall.ph';
972           sub cat($){
973               my $path = shift();
974               my $size = -s $path;
975               my $memory = "\x00" x $size;  # allocate some memory
976               my $ptr = unpack( 'L', pack( 'P', $memory ) );
977               open( F, $path ) ⎪⎪ die( "$path: cannot open ($!)\n" );
978               my $fd = fileno(F);
979               my $res = syscall( &SYS_read, fileno(F), $ptr, $size );
980               print $memory;
981               close( F );
982           }
983
984       This is neither a specimen of simplicity nor a paragon of portability
985       but it illustrates the point: We are able to sneak behind the scenes
986       and access Perl's otherwise well-guarded memory! (Important note:
987       Perl's "syscall" does not require you to construct pointers in this
988       roundabout way. You simply pass a string variable, and Perl forwards
989       the address.)
990
991       How does "unpack" with "P" work? Imagine some pointer in the buffer
992       about to be unpacked: If it isn't the null pointer (which will smartly
993       produce the "undef" value) we have a start address - but then what?
994       Perl has no way of knowing how long this "fixed length string" is, so
995       it's up to you to specify the actual size as an explicit length after
996       "P".
997
998          my $mem = "abcdefghijklmn";
999          print unpack( 'P5', pack( 'P', $mem ) ); # prints "abcde"
1000
1001       As a consequence, "pack" ignores any number or "*" after "P".
1002
1003       Now that we have seen "P" at work, we might as well give "p" a whirl.
1004       Why do we need a second template code for packing pointers at all? The
1005       answer lies behind the simple fact that an "unpack" with "p" promises a
1006       null-terminated string starting at the address taken from the buffer,
1007       and that implies a length for the data item to be returned:
1008
1009          my $buf = pack( 'p', "abc\x00efhijklmn" );
1010          print unpack( 'p', $buf );    # prints "abc"
1011
1012       Albeit this is apt to be confusing: As a consequence of the length
1013       being implied by the string's length, a number after pack code "p" is a
1014       repeat count, not a length as after "P".
1015
1016       Using "pack(..., $x)" with "P" or "p" to get the address where $x is
1017       actually stored must be used with circumspection. Perl's internal
1018       machinery considers the relation between a variable and that address as
1019       its very own private matter and doesn't really care that we have
1020       obtained a copy. Therefore:
1021
1022       ·   Do not use "pack" with "p" or "P" to obtain the address of variable
1023           that's bound to go out of scope (and thereby freeing its memory)
1024           before you are done with using the memory at that address.
1025
1026       ·   Be very careful with Perl operations that change the value of the
1027           variable. Appending something to the variable, for instance, might
1028           require reallocation of its storage, leaving you with a pointer
1029           into no-man's land.
1030
1031       ·   Don't think that you can get the address of a Perl variable when it
1032           is stored as an integer or double number! "pack('P', $x)" will
1033           force the variable's internal representation to string, just as if
1034           you had written something like "$x .= ''".
1035
1036       It's safe, however, to P- or p-pack a string literal, because Perl sim‐
1037       ply allocates an anonymous variable.
1038

Pack Recipes

1040       Here are a collection of (possibly) useful canned recipes for "pack"
1041       and "unpack":
1042
1043           # Convert IP address for socket functions
1044           pack( "C4", split /\./, "123.4.5.6" );
1045
1046           # Count the bits in a chunk of memory (e.g. a select vector)
1047           unpack( '%32b*', $mask );
1048
1049           # Determine the endianness of your system
1050           $is_little_endian = unpack( 'c', pack( 's', 1 ) );
1051           $is_big_endian = unpack( 'xc', pack( 's', 1 ) );
1052
1053           # Determine the number of bits in a native integer
1054           $bits = unpack( '%32I!', ~0 );
1055
1056           # Prepare argument for the nanosleep system call
1057           my $timespec = pack( 'L!L!', $secs, $nanosecs );
1058
1059       For a simple memory dump we unpack some bytes into just as many pairs
1060       of hex digits, and use "map" to handle the traditional spacing - 16
1061       bytes to a line:
1062
1063           my $i;
1064           print map( ++$i % 16 ? "$_ " : "$_\n",
1065                      unpack( 'H2' x length( $mem ), $mem ) ),
1066                 length( $mem ) % 16 ? "\n" : '';
1067

Funnies Section

1069           # Pulling digits out of nowhere...
1070           print unpack( 'C', pack( 'x' ) ),
1071                 unpack( '%B*', pack( 'A' ) ),
1072                 unpack( 'H', pack( 'A' ) ),
1073                 unpack( 'A', unpack( 'C', pack( 'A' ) ) ), "\n";
1074
1075           # One for the road ;-)
1076           my $advice = pack( 'all u can in a van' );
1077

Authors

1079       Simon Cozens and Wolfgang Laun.
1080
1081
1082
1083perl v5.8.8                       2006-01-07                    PERLPACKTUT(1)