1LIBARCHIVE_INTERNALS(3)  BSD Library Functions Manual  LIBARCHIVE_INTERNALS(3)
2

NAME

4     libarchive_internals — description of libarchive internal interfaces
5

OVERVIEW

7     The libarchive library provides a flexible interface for reading and
8     writing streaming archive files such as tar and cpio.  Internally, it
9     follows a modular layered design that should make it easy to add new ar‐
10     chive and compression formats.
11

GENERAL ARCHITECTURE

13     Externally, libarchive exposes most operations through an opaque, object-
14     style interface.  The archive_entry(3) objects store information about a
15     single filesystem object.  The rest of the library provides facilities to
16     write archive_entry(3) objects to archive files, read them from archive
17     files, and write them to disk.  (There are plans to add a facility to
18     read archive_entry(3) objects from disk as well.)
19
20     The read and write APIs each have four layers: a public API layer, a for‐
21     mat layer that understands the archive file format, a compression layer,
22     and an I/O layer.  The I/O layer is completely exposed to clients who can
23     replace it entirely with their own functions.
24
25     In order to provide as much consistency as possible for clients, some
26     public functions are virtualized.  Eventually, it should be possible for
27     clients to open an archive or disk writer, and then use a single set of
28     code to select and write entries, regardless of the target.
29

READ ARCHITECTURE

31     From the outside, clients use the archive_read(3) API to manipulate an
32     archive object to read entries and bodies from an archive stream.  Inter‐
33     nally, the archive object is cast to an archive_read object, which holds
34     all read-specific data.  The API has four layers: The lowest layer is the
35     I/O layer.  This layer can be overridden by clients, but most clients use
36     the packaged I/O callbacks provided, for example, by
37     archive_read_open_memory(3), and archive_read_open_fd(3).  The compres‐
38     sion layer calls the I/O layer to read bytes and decompresses them for
39     the format layer.  The format layer unpacks a stream of uncompressed
40     bytes and creates archive_entry objects from the incoming data.  The API
41     layer tracks overall state (for example, it prevents clients from reading
42     data before reading a header) and invokes the format and compression
43     layer operations through registered function pointers.  In particular,
44     the API layer drives the format-detection process: When opening the ar‐
45     chive, it reads an initial block of data and offers it to each registered
46     compression handler.  The one with the highest bid is initialized with
47     the first block.  Similarly, the format handlers are polled to see which
48     handler is the best for each archive.  (Prior to 2.4.0, the format bid‐
49     ders were invoked for each entry, but this design hindered error recov‐
50     ery.)
51
52   I/O Layer and Client Callbacks
53     The read API goes to some lengths to be nice to clients.  As a result,
54     there are few restrictions on the behavior of the client callbacks.
55
56     The client read callback is expected to provide a block of data on each
57     call.  A zero-length return does indicate end of file, but otherwise
58     blocks may be as small as one byte or as large as the entire file.  In
59     particular, blocks may be of different sizes.
60
61     The client skip callback returns the number of bytes actually skipped,
62     which may be much smaller than the skip requested.  The only requirement
63     is that the skip not be larger.  In particular, clients are allowed to
64     return zero for any skip that they don't want to handle.  The skip call‐
65     back must never be invoked with a negative value.
66
67     Keep in mind that not all clients are reading from disk: clients reading
68     from networks may provide different-sized blocks on every request and
69     cannot skip at all; advanced clients may use mmap(2) to read the entire
70     file into memory at once and return the entire file to libarchive as a
71     single block; other clients may begin asynchronous I/O operations for the
72     next block on each request.
73
74   Decompresssion Layer
75     The decompression layer not only handles decompression, it also buffers
76     data so that the format handlers see a much nicer I/O model.  The decom‐
77     pression API is a two stage peek/consume model.  A read_ahead request
78     specifies a minimum read amount; the decompression layer must provide a
79     pointer to at least that much data.  If more data is immediately avail‐
80     able, it should return more: the format layer handles bulk data reads by
81     asking for a minimum of one byte and then copying as much data as is
82     available.
83
84     A subsequent call to the consume() function advances the read pointer.
85     Note that data returned from a read_ahead() call is guaranteed to remain
86     in place until the next call to read_ahead().  Intervening calls to
87     consume() should not cause the data to move.
88
89     Skip requests must always be handled exactly.  Decompression handlers
90     that cannot seek forward should not register a skip handler; the API
91     layer fills in a generic skip handler that reads and discards data.
92
93     A decompression handler has a specific lifecycle:
94     Registration/Configuration
95             When the client invokes the public support function, the decom‐
96             pression handler invokes the internal
97             __archive_read_register_compression() function to provide bid and
98             initialization functions.  This function returns NULL on error or
99             else a pointer to a struct decompressor_t.  This structure con‐
100             tains a void * config slot that can be used for storing any cus‐
101             tomization information.
102     Bid     The bid function is invoked with a pointer and size of a block of
103             data.  The decompressor can access its config data through the
104             decompressor element of the archive_read object.  The bid func‐
105             tion is otherwise stateless.  In particular, it must not perform
106             any I/O operations.
107
108             The value returned by the bid function indicates its suitability
109             for handling this data stream.  A bid of zero will ensure that
110             this decompressor is never invoked.  Return zero if magic number
111             checks fail.  Otherwise, your initial implementation should
112             return the number of bits actually checked.  For example, if you
113             verify two full bytes and three bits of another byte, bid 19.
114             Note that the initial block may be very short; be careful to only
115             inspect the data you are given.  (The current decompressors
116             require two bytes for correct bidding.)
117     Initialize
118             The winning bidder will have its init function called.  This
119             function should initialize the remaining slots of the struct
120             decompressor_t object pointed to by the decompressor element of
121             the archive_read object.  In particular, it should allocate any
122             working data it needs in the data slot of that structure.  The
123             init function is called with the block of data that was used for
124             tasting.  At this point, the decompressor is responsible for all
125             I/O requests to the client callbacks.  The decompressor is free
126             to read more data as and when necessary.
127     Satisfy I/O requests
128             The format handler will invoke the read_ahead, consume, and skip
129             functions as needed.
130     Finish  The finish method is called only once when the archive is closed.
131             It should release anything stored in the data and config slots of
132             the decompressor object.  It should not invoke the client close
133             callback.
134
135   Format Layer
136     The read formats have a similar lifecycle to the decompression handlers:
137     Registration
138             Allocate your private data and initialize your pointers.
139     Bid     Formats bid by invoking the read_ahead() decompression method but
140             not calling the consume() method.  This allows each bidder to
141             look ahead in the input stream.  Bidders should not look further
142             ahead than necessary, as long look aheads put pressure on the
143             decompression layer to buffer lots of data.  Most formats only
144             require a few hundred bytes of look ahead; look aheads of a few
145             kilobytes are reasonable.  (The ISO9660 reader sometimes looks
146             ahead by 48k, which should be considered an upper limit.)
147     Read header
148             The header read is usually the most complex part of any format.
149             There are a few strategies worth mentioning: For formats such as
150             tar or cpio, reading and parsing the header is straightforward
151             since headers alternate with data.  For formats that store all
152             header data at the beginning of the file, the first header read
153             request may have to read all headers into memory and store that
154             data, sorted by the location of the file data.  Subsequent header
155             read requests will skip forward to the beginning of the file data
156             and return the corresponding header.
157     Read Data
158             The read data interface supports sparse files; this requires that
159             each call return a block of data specifying the file offset and
160             size.  This may require you to carefully track the location so
161             that you can return accurate file offsets for each read.  Remem‐
162             ber that the decompressor will return as much data as it has.
163             Generally, you will want to request one byte, examine the return
164             value to see how much data is available, and possibly trim that
165             to the amount you can use.  You should invoke consume for each
166             block just before you return it.
167     Skip All Data
168             The skip data call should skip over all file data and trailing
169             padding.  This is called automatically by the API layer just
170             before each header read.  It is also called in response to the
171             client calling the public data_skip() function.
172     Cleanup
173             On cleanup, the format should release all of its allocated mem‐
174             ory.
175
176   API Layer
177     XXX to do XXX
178

WRITE ARCHITECTURE

180     The write API has a similar set of four layers: an API layer, a format
181     layer, a compression layer, and an I/O layer.  The registration here is
182     much simpler because only one format and one compression can be regis‐
183     tered at a time.
184
185   I/O Layer and Client Callbacks
186     XXX To be written XXX
187
188   Compression Layer
189     XXX To be written XXX
190
191   Format Layer
192     XXX To be written XXX
193
194   API Layer
195     XXX To be written XXX
196

WRITE_DISK ARCHITECTURE

198     The write_disk API is intended to look just like the write API to
199     clients.  Since it does not handle multiple formats or compression, it is
200     not layered internally.
201

GENERAL SERVICES

203     The archive_read, archive_write, and archive_write_disk objects all con‐
204     tain an initial archive object which provides common support for a set of
205     standard services.  (Recall that ANSI/ISO C90 guarantees that you can
206     cast freely between a pointer to a structure and a pointer to the first
207     element of that structure.)  The archive object has a magic value that
208     indicates which API this object is associated with, slots for storing
209     error information, and function pointers for virtualized API functions.
210

MISCELLANEOUS NOTES

212     Connecting existing archiving libraries into libarchive is generally
213     quite difficult.  In particular, many existing libraries strongly assume
214     that you are reading from a file; they seek forwards and backwards as
215     necessary to locate various pieces of information.  In contrast,
216     libarchive never seeks backwards in its input, which sometimes requires
217     very different approaches.
218
219     For example, libarchive's ISO9660 support operates very differently from
220     most ISO9660 readers.  The libarchive support utilizes a work-queue
221     design that keeps a list of known entries sorted by their location in the
222     input.  Whenever libarchive's ISO9660 implementation is asked for the
223     next header, checks this list to find the next item on the disk.  Direc‐
224     tories are parsed when they are encountered and new items are added to
225     the list.  This design relies heavily on the ISO9660 image being opti‐
226     mized so that directories always occur earlier on the disk than the files
227     they describe.
228
229     Depending on the specific format, such approaches may not be possible.
230     The ZIP format specification, for example, allows archivers to store key
231     information only at the end of the file.  In theory, it is possible to
232     create ZIP archives that cannot be read without seeking.  Fortunately,
233     such archives are very rare, and libarchive can read most ZIP archives,
234     though it cannot always extract as much information as a dedicated ZIP
235     program.
236

SEE ALSO

238     archive(3), archive_entry(3), archive_read(3), archive_write(3),
239     archive_write_disk(3)
240

HISTORY

242     The libarchive library first appeared in FreeBSD 5.3.
243

AUTHORS

245     The libarchive library was written by Tim Kientzle <kientzle@acm.org>.
246
247BSD                            January 26, 2011                            BSD
Impressum