gitformat-pack(5)

1GITFORMAT-PACK(5)                 Git Manual                 GITFORMAT-PACK(5)
2
3
4

NAME

6       gitformat-pack - Git pack format
7

SYNOPSIS

9       $GIT_DIR/objects/pack/pack-.{pack,idx}
10       $GIT_DIR/objects/pack/pack-.rev
11       $GIT_DIR/objects/pack/pack-*.mtimes
12       $GIT_DIR/objects/pack/multi-pack-index
13

DESCRIPTION

15       The Git pack format is how Git stores most of its primary repository
16       data. Over the lifetime of a repository, loose objects (if any) and
17       smaller packs are consolidated into larger pack(s). See git-gc(1) and
18       git-pack-objects(1).
19
20       The pack format is also used over-the-wire, see e.g. gitprotocol-v2(5),
21       as well as being a part of other container formats in the case of
22       gitformat-bundle(5).
23

CHECKSUMS AND OBJECT IDS

25       In a repository using the traditional SHA-1, pack checksums, index
26       checksums, and object IDs (object names) mentioned below are all
27       computed using SHA-1. Similarly, in SHA-256 repositories, these values
28       are computed using SHA-256.
29

PACK-*.PACK FILES HAVE THE FOLLOWING FORMAT:

31       •   A header appears at the beginning and consists of the following:
32
33               4-byte signature:
34                   The signature is: {'P', 'A', 'C', 'K'}
35
36                  4-byte version number (network byte order):
37               Git currently accepts version number 2 or 3 but
38                      generates version 2 only.
39
40               4-byte number of objects contained in the pack (network byte order)
41
42               Observation: we cannot have more than 4G versions ;-) and
43               more than 4G objects in a pack.
44
45       •   The header is followed by a number of object entries, each of which
46           looks like this:
47
48               (undeltified representation)
49               n-byte type and length (3-bit type, (n-1)*7+4-bit length)
50               compressed data
51
52                  (deltified representation)
53                  n-byte type and length (3-bit type, (n-1)*7+4-bit length)
54                  base object name if OBJ_REF_DELTA or a negative relative
55               offset from the delta object's position in the pack if this
56               is an OBJ_OFS_DELTA object
57                  compressed delta data
58
59               Observation: the length of each object is encoded in a variable
60               length format and is not constrained to 32-bit or anything.
61
62       •   The trailer records a pack checksum of all of the above.
63
64   Object types
65       Valid object types are:
66
67       •   OBJ_COMMIT (1)
68
69       •   OBJ_TREE (2)
70
71       •   OBJ_BLOB (3)
72
73       •   OBJ_TAG (4)
74
75       •   OBJ_OFS_DELTA (6)
76
77       •   OBJ_REF_DELTA (7)
78
79       Type 5 is reserved for future expansion. Type 0 is invalid.
80
81   Size encoding
82       This document uses the following "size encoding" of non-negative
83       integers: From each byte, the seven least significant bits are used to
84       form the resulting integer. As long as the most significant bit is 1,
85       this process continues; the byte with MSB 0 provides the last seven
86       bits. The seven-bit chunks are concatenated. Later values are more
87       significant.
88
89       This size encoding should not be confused with the "offset encoding",
90       which is also used in this document.
91
92   Deltified representation
93       Conceptually there are only four object types: commit, tree, tag and
94       blob. However to save space, an object could be stored as a "delta" of
95       another "base" object. These representations are assigned new types
96       ofs-delta and ref-delta, which is only valid in a pack file.
97
98       Both ofs-delta and ref-delta store the "delta" to be applied to another
99       object (called base object) to reconstruct the object. The difference
100       between them is, ref-delta directly encodes base object name. If the
101       base object is in the same pack, ofs-delta encodes the offset of the
102       base object in the pack instead.
103
104       The base object could also be deltified if it’s in the same pack.
105       Ref-delta can also refer to an object outside the pack (i.e. the
106       so-called "thin pack"). When stored on disk however, the pack should be
107       self contained to avoid cyclic dependency.
108
109       The delta data starts with the size of the base object and the size of
110       the object to be reconstructed. These sizes are encoded using the size
111       encoding from above. The remainder of the delta data is a sequence of
112       instructions to reconstruct the object from the base object. If the
113       base object is deltified, it must be converted to canonical form first.
114       Each instruction appends more and more data to the target object until
115       it’s complete. There are two supported instructions so far: one for
116       copying a byte range from the source object and one for inserting new
117       data embedded in the instruction itself.
118
119       Each instruction has variable length. Instruction type is determined by
120       the seventh bit of the first octet. The following diagrams follow the
121       convention in RFC 1951 (Deflate compressed data format).
122
123       Instruction to copy from base object
124               +----------+---------+---------+---------+---------+-------+-------+-------+
125               | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
126               +----------+---------+---------+---------+---------+-------+-------+-------+
127
128           This is the instruction format to copy a byte range from the source
129           object. It encodes the offset to copy from and the number of bytes
130           to copy. Offset and size are in little-endian order.
131
132           All offset and size bytes are optional. This is to reduce the
133           instruction size when encoding small offsets or sizes. The first
134           seven bits in the first octet determine which of the next seven
135           octets is present. If bit zero is set, offset1 is present. If bit
136           one is set offset2 is present and so on.
137
138           Note that a more compact instruction does not change offset and
139           size encoding. For example, if only offset2 is omitted like below,
140           offset3 still contains bits 16-23. It does not become offset2 and
141           contains bits 8-15 even if it’s right next to offset1.
142
143               +----------+---------+---------+
144               | 10000101 | offset1 | offset3 |
145               +----------+---------+---------+
146
147           In its most compact form, this instruction only takes up one byte
148           (0x80) with both offset and size omitted, which will have default
149           values zero. There is another exception: size zero is automatically
150           converted to 0x10000.
151
152       Instruction to add new data
153               +----------+============+
154               | 0xxxxxxx |    data    |
155               +----------+============+
156
157           This is the instruction to construct the target object without the
158           base object. The following data is appended to the target object.
159           The first seven bits of the first octet determine the size of data
160           in bytes. The size must be non-zero.
161
162       Reserved instruction
163               +----------+============
164               | 00000000 |
165               +----------+============
166
167           This is the instruction reserved for future expansion.
168

ORIGINAL (VERSION 1) PACK-*.IDX FILES HAVE THE FOLLOWING FORMAT:

170       •   The header consists of 256 4-byte network byte order integers. N-th
171           entry of this table records the number of objects in the
172           corresponding pack, the first byte of whose object name is less
173           than or equal to N. This is called the first-level fan-out table.
174
175       •   The header is followed by sorted 24-byte entries, one entry per
176           object in the pack. Each entry is:
177
178               4-byte network byte order integer, recording where the
179               object is stored in the packfile as the offset from the
180               beginning.
181
182               one object name of the appropriate size.
183
184       •   The file is concluded with a trailer:
185
186               A copy of the pack checksum at the end of the corresponding
187               packfile.
188
189               Index checksum of all of the above.
190
191       Pack Idx file:
192
193                --  +--------------------------------+
194           fanout        | fanout[0] = 2 (for example)    |-.
195           table         +--------------------------------+ |
196                    | fanout[1]                      | |
197                    +--------------------------------+ |
198                    | fanout[2]                      | |
199                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
200                    | fanout[255] = total objects    |---.
201                --  +--------------------------------+ | |
202           main     | offset                         | | |
203           index         | object name 00XXXXXXXXXXXXXXXX | | |
204           table         +--------------------------------+ | |
205                    | offset                         | | |
206                    | object name 00XXXXXXXXXXXXXXXX | | |
207                    +--------------------------------+<+ |
208                  .-| offset                         |   |
209                  | | object name 01XXXXXXXXXXXXXXXX |   |
210                  | +--------------------------------+   |
211                  | | offset                         |   |
212                  | | object name 01XXXXXXXXXXXXXXXX |   |
213                  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   |
214                  | | offset                         |   |
215                  | | object name FFXXXXXXXXXXXXXXXX |   |
216                --| +--------------------------------+<--+
217           trailer     | | packfile checksum              |
218                  | +--------------------------------+
219                  | | idxfile checksum               |
220                  | +--------------------------------+
221                     .-------.
222                             |
223           Pack file entry: <+
224
225               packed object header:
226           1-byte size extension bit (MSB)
227                  type (next 3 bit)
228                  size0 (lower 4-bit)
229                  n-byte sizeN (as long as MSB is set, each 7-bit)
230                size0..sizeN form 4+7+7+..+7 bit integer, size0
231                is the least significant part, and sizeN is the
232                most significant part.
233               packed object data:
234                  If it is not DELTA, then deflated bytes (the size above
235                is the size before compression).
236           If it is REF_DELTA, then
237             base object name (the size above is the
238                size of the delta data that follows).
239                    delta data, deflated.
240           If it is OFS_DELTA, then
241             n-byte offset (see below) interpreted as a negative
242                offset from the type-byte of the header of the
243                ofs-delta entry (the size above is the size of
244                the delta data that follows).
245             delta data, deflated.
246
247             offset encoding:
248           n bytes with MSB set in all but the last one.
249           The offset is then the number constructed by
250           concatenating the lower 7 bit of each byte, and
251           for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1))
252           to the result.
253

VERSION 2 PACK-*.IDX FILES SUPPORT PACKS LARGER THAN 4 GIB, AND

255           have some other reorganizations.  They have the format:
256
257       •   A 4-byte magic number \377tOc which is an unreasonable fanout[0]
258           value.
259
260       •   A 4-byte version number (= 2)
261
262       •   A 256-entry fan-out table just like v1.
263
264       •   A table of sorted object names. These are packed together without
265           offset values to reduce the cache footprint of the binary search
266           for a specific object name.
267
268       •   A table of 4-byte CRC32 values of the packed object data. This is
269           new in v2 so compressed data can be copied directly from pack to
270           pack during repacking without undetected data corruption.
271
272       •   A table of 4-byte offset values (in network byte order). These are
273           usually 31-bit pack file offsets, but large offsets are encoded as
274           an index into the next table with the msbit set.
275
276       •   A table of 8-byte offset entries (empty for pack files less than 2
277           GiB). Pack files are organized with heavily used objects toward the
278           front, so most object references should not need to refer to this
279           table.
280
281       •   The same trailer as a v1 pack file:
282
283               A copy of the pack checksum at the end of the
284               corresponding packfile.
285
286               Index checksum of all of the above.
287

PACK-*.REV FILES HAVE THE FORMAT:

289       •   A 4-byte magic number 0x52494458 (RIDX).
290
291       •   A 4-byte version identifier (= 1).
292
293       •   A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
294
295       •   A table of index positions (one per packed object, num_objects in
296           total, each a 4-byte unsigned integer in network order), sorted by
297           their corresponding offsets in the packfile.
298
299       •   A trailer, containing a:
300
301               checksum of the corresponding packfile, and
302
303               a checksum of all of the above.
304
305       All 4-byte numbers are in network order.
306

PACK-*.MTIMES FILES HAVE THE FORMAT:

308       All 4-byte numbers are in network byte order.
309
310       •   A 4-byte magic number 0x4d544d45 (MTME).
311
312       •   A 4-byte version identifier (= 1).
313
314       •   A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
315
316       •   A table of 4-byte unsigned integers. The ith value is the
317           modification time (mtime) of the ith object in the corresponding
318           pack by lexicographic (index) order. The mtimes count standard
319           epoch seconds.
320
321       •   A trailer, containing a checksum of the corresponding packfile, and
322           a checksum of all of the above (each having length according to the
323           specified hash function).
324

MULTI-PACK-INDEX (MIDX) FILES HAVE THE FOLLOWING FORMAT:

326       The multi-pack-index files refer to multiple pack-files and loose
327       objects.
328
329       In order to allow extensions that add extra data to the MIDX, we
330       organize the body into "chunks" and provide a lookup table at the
331       beginning of the body. The header includes certain length values, such
332       as the number of packs, the number of base MIDX files, hash lengths and
333       types.
334
335       All 4-byte numbers are in network order.
336
337       HEADER:
338
339           4-byte signature:
340               The signature is: {'M', 'I', 'D', 'X'}
341
342           1-byte version number:
343               Git only writes or recognizes version 1.
344
345           1-byte Object Id Version
346               We infer the length of object IDs (OIDs) from this value:
347                1 => SHA-1
348                2 => SHA-256
349               If the hash type does not match the repository's hash algorithm,
350               the multi-pack-index file should be ignored with a warning
351               presented to the user.
352
353           1-byte number of "chunks"
354
355           1-byte number of base multi-pack-index files:
356               This value is currently always zero.
357
358           4-byte number of pack files
359
360       CHUNK LOOKUP:
361
362           (C + 1) * 12 bytes providing the chunk offsets:
363               First 4 bytes describe chunk id. Value 0 is a terminating label.
364               Other 8 bytes provide offset in current file for chunk to start.
365               (Chunks are provided in file-order, so you can infer the length
366               using the next chunk position if necessary.)
367
368           The CHUNK LOOKUP matches the table of contents from
369           the chunk-based file format, see linkgit:gitformat-chunk[5].
370
371           The remaining data in the body is described one chunk at a time, and
372           these chunks may be given in any order. Chunks are required unless
373           otherwise specified.
374
375       CHUNK DATA:
376
377           Packfile Names (ID: {'P', 'N', 'A', 'M'})
378               Store the names of packfiles as a sequence of NUL-terminated
379               strings. There is no extra padding between the filenames,
380               and they are listed in lexicographic order. The chunk itself
381               is padded at the end with between 0 and 3 NUL bytes to make the
382               chunk size a multiple of 4 bytes.
383
384           OID Fanout (ID: {'O', 'I', 'D', 'F'})
385               The ith entry, F[i], stores the number of OIDs with first
386               byte at most i. Thus F[255] stores the total
387               number of objects.
388
389           OID Lookup (ID: {'O', 'I', 'D', 'L'})
390               The OIDs for all objects in the MIDX are stored in lexicographic
391               order in this chunk.
392
393           Object Offsets (ID: {'O', 'O', 'F', 'F'})
394               Stores two 4-byte values for every object.
395               1: The pack-int-id for the pack storing this object.
396               2: The offset within the pack.
397                If all offsets are less than 2^32, then the large offset chunk
398                will not exist and offsets are stored as in IDX v1.
399                If there is at least one offset value larger than 2^32-1, then
400                the large offset chunk must exist, and offsets larger than
401                2^31-1 must be stored in it instead. If the large offset chunk
402                exists and the 31st bit is on, then removing that bit reveals
403                the row in the large offsets containing the 8-byte offset of
404                this object.
405
406           [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
407               8-byte offsets into large packfiles.
408
409           [Optional] Bitmap pack order (ID: {'R', 'I', 'D', 'X'})
410               A list of MIDX positions (one per object in the MIDX, num_objects in
411               total, each a 4-byte unsigned integer in network byte order), sorted
412               according to their relative bitmap/pseudo-pack positions.
413
414       TRAILER:
415
416           Index checksum of the above contents.
417

MULTI-PACK-INDEX REVERSE INDEXES

419       Similar to the pack-based reverse index, the multi-pack index can also
420       be used to generate a reverse index.
421
422       Instead of mapping between offset, pack-, and index position, this
423       reverse index maps between an object’s position within the MIDX, and
424       that object’s position within a pseudo-pack that the MIDX describes
425       (i.e., the ith entry of the multi-pack reverse index holds the MIDX
426       position of ith object in pseudo-pack order).
427
428       To clarify the difference between these orderings, consider a
429       multi-pack reachability bitmap (which does not yet exist, but is what
430       we are building towards here). Each bit needs to correspond to an
431       object in the MIDX, and so we need an efficient mapping from bit
432       position to MIDX position.
433
434       One solution is to let bits occupy the same position in the oid-sorted
435       index stored by the MIDX. But because oids are effectively random,
436       their resulting reachability bitmaps would have no locality, and thus
437       compress poorly. (This is the reason that single-pack bitmaps use the
438       pack ordering, and not the .idx ordering, for the same purpose.)
439
440       So we’d like to define an ordering for the whole MIDX based around pack
441       ordering, which has far better locality (and thus compresses more
442       efficiently). We can think of a pseudo-pack created by the
443       concatenation of all of the packs in the MIDX. E.g., if we had a MIDX
444       with three packs (a, b, c), with 10, 15, and 20 objects respectively,
445       we can imagine an ordering of the objects like:
446
447           |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
448
449       where the ordering of the packs is defined by the MIDX’s pack list, and
450       then the ordering of objects within each pack is the same as the order
451       in the actual packfile.
452
453       Given the list of packs and their counts of objects, you can naïvely
454       reconstruct that pseudo-pack ordering (e.g., the object at position 27
455       must be (c,1) because packs "a" and "b" consumed 25 of the slots). But
456       there’s a catch. Objects may be duplicated between packs, in which case
457       the MIDX only stores one pointer to the object (and thus we’d want only
458       one slot in the bitmap).
459
460       Callers could handle duplicates themselves by reading objects in order
461       of their bit-position, but that’s linear in the number of objects, and
462       much too expensive for ordinary bitmap lookups. Building a reverse
463       index solves this, since it is the logical inverse of the index, and
464       that index has already removed duplicates. But, building a reverse
465       index on the fly can be expensive. Since we already have an on-disk
466       format for pack-based reverse indexes, let’s reuse it for the MIDX’s
467       pseudo-pack, too.
468
469       Objects from the MIDX are ordered as follows to string together the
470       pseudo-pack. Let pack(o) return the pack from which o was selected by
471       the MIDX, and define an ordering of packs based on their numeric ID (as
472       stored by the MIDX). Let offset(o) return the object offset of o within
473       pack(o). Then, compare o1 and o2 as follows:
474
475       •   If one of pack(o1) and pack(o2) is preferred and the other is not,
476           then the preferred one sorts first.
477
478           (This is a detail that allows the MIDX bitmap to determine which
479           pack should be used by the pack-reuse mechanism, since it can ask
480           the MIDX for the pack containing the object at bit position 0).
481
482       •   If pack(o1) ≠ pack(o2), then sort the two objects in descending
483           order based on the pack ID.
484
485       •   Otherwise, pack(o1) = pack(o2), and the objects are sorted in
486           pack-order (i.e., o1 sorts ahead of o2 exactly when offset(o1) <
487           offset(o2)).
488
489       In short, a MIDX’s pseudo-pack is the de-duplicated concatenation of
490       objects in packs stored by the MIDX, laid out in pack order, and the
491       packs arranged in MIDX order (with the preferred pack coming first).
492
493       The MIDX’s reverse index is stored in the optional RIDX chunk within
494       the MIDX itself.
495

CRUFT PACKS

497       The cruft packs feature offer an alternative to Git’s traditional
498       mechanism of removing unreachable objects. This document provides an
499       overview of Git’s pruning mechanism, and how a cruft pack can be used
500       instead to accomplish the same.
501
502   Background
503       To remove unreachable objects from your repository, Git offers git
504       repack -Ad (see git-repack(1)). Quoting from the documentation:
505
506           [...] unreachable objects in a previous pack become loose, unpacked objects,
507           instead of being left in the old pack. [...] loose unreachable objects will be
508           pruned according to normal expiry rules with the next 'git gc' invocation.
509
510       Unreachable objects aren’t removed immediately, since doing so could
511       race with an incoming push which may reference an object which is about
512       to be deleted. Instead, those unreachable objects are stored as loose
513       objects and stay that way until they are older than the expiration
514       window, at which point they are removed by git-prune(1).
515
516       Git must store these unreachable objects loose in order to keep track
517       of their per-object mtimes. If these unreachable objects were written
518       into one big pack, then either freshening that pack (because an object
519       contained within it was re-written) or creating a new pack of
520       unreachable objects would cause the pack’s mtime to get updated, and
521       the objects within it would never leave the expiration window. Instead,
522       objects are stored loose in order to keep track of the individual
523       object mtimes and avoid a situation where all cruft objects are
524       freshened at once.
525
526       This can lead to undesirable situations when a repository contains many
527       unreachable objects which have not yet left the grace period. Having
528       large directories in the shards of .git/objects can lead to decreased
529       performance in the repository. But given enough unreachable objects,
530       this can lead to inode starvation and degrade the performance of the
531       whole system. Since we can never pack those objects, these repositories
532       often take up a large amount of disk space, since we can only zlib
533       compress them, but not store them in delta chains.
534
535   Cruft packs
536       A cruft pack eliminates the need for storing unreachable objects in a
537       loose state by including the per-object mtimes in a separate file
538       alongside a single pack containing all loose objects.
539
540       A cruft pack is written by git repack --cruft when generating a new
541       pack. git-pack-objects(1)'s --cruft option. Note that git repack
542       --cruft is a classic all-into-one repack, meaning that everything in
543       the resulting pack is reachable, and everything else is unreachable.
544       Once written, the --cruft option instructs git repack to generate
545       another pack containing only objects not packed in the previous step
546       (which equates to packing all unreachable objects together). This
547       progresses as follows:
548
549        1. Enumerate every object, marking any object which is (a) not
550           contained in a kept-pack, and (b) whose mtime is within the grace
551           period as a traversal tip.
552
553        2. Perform a reachability traversal based on the tips gathered in the
554           previous step, adding every object along the way to the pack.
555
556        3. Write the pack out, along with a .mtimes file that records the
557           per-object timestamps.
558
559       This mode is invoked internally by git-repack(1) when instructed to
560       write a cruft pack. Crucially, the set of in-core kept packs is exactly
561       the set of packs which will not be deleted by the repack; in other
562       words, they contain all of the repository’s reachable objects.
563
564       When a repository already has a cruft pack, git repack --cruft
565       typically only adds objects to it. An exception to this is when git
566       repack is given the --cruft-expiration option, which allows the
567       generated cruft pack to omit expired objects instead of waiting for
568       git-gc(1) to expire those objects later on.
569
570       It is git-gc(1) that is typically responsible for removing expired
571       unreachable objects.
572
573   Alternatives
574       Notable alternatives to this design include:
575
576       •   The location of the per-object mtime data.
577
578       On the location of mtime data, a new auxiliary file tied to the pack
579       was chosen to avoid complicating the .idx format. If the .idx format
580       were ever to gain support for optional chunks of data, it may make
581       sense to consolidate the .mtimes format into the .idx itself.
582

GIT

584       Part of the git(1) suite
585
586
587
588Git 2.43.0                        11/20/2023                 GITFORMAT-PACK(5)