gitformat-pack(5)

1GITFORMAT-PACK(5)                 Git Manual                 GITFORMAT-PACK(5)
2
3
4

NAME

6       gitformat-pack - Git pack format
7

SYNOPSIS

9       $GIT_DIR/objects/pack/pack-.{pack,idx}
10       $GIT_DIR/objects/pack/pack-.rev
11       $GIT_DIR/objects/pack/pack-*.mtimes
12       $GIT_DIR/objects/pack/multi-pack-index
13

DESCRIPTION

15       The Git pack format is now Git stores most of its primary repository
16       data. Over the lietime af a repository loose objects (if any) and
17       smaller packs are consolidated into larger pack(s). See git-gc(1) and
18       git-pack-objects(1).
19
20       The pack format is also used over-the-wire, see e.g. gitprotocol-v2(5),
21       as well as being a part of other container formats in the case of
22       gitformat-bundle(5).
23

CHECKSUMS AND OBJECT IDS

25       In a repository using the traditional SHA-1, pack checksums, index
26       checksums, and object IDs (object names) mentioned below are all
27       computed using SHA-1. Similarly, in SHA-256 repositories, these values
28       are computed using SHA-256.
29

PACK-*.PACK FILES HAVE THE FOLLOWING FORMAT:

31       •   A header appears at the beginning and consists of the following:
32
33               4-byte signature:
34                   The signature is: {'P', 'A', 'C', 'K'}
35
36                  4-byte version number (network byte order):
37               Git currently accepts version number 2 or 3 but
38                      generates version 2 only.
39
40               4-byte number of objects contained in the pack (network byte order)
41
42               Observation: we cannot have more than 4G versions ;-) and
43               more than 4G objects in a pack.
44
45       •   The header is followed by number of object entries, each of which
46           looks like this:
47
48               (undeltified representation)
49               n-byte type and length (3-bit type, (n-1)*7+4-bit length)
50               compressed data
51
52                  (deltified representation)
53                  n-byte type and length (3-bit type, (n-1)*7+4-bit length)
54                  base object name if OBJ_REF_DELTA or a negative relative
55               offset from the delta object's position in the pack if this
56               is an OBJ_OFS_DELTA object
57                  compressed delta data
58
59               Observation: length of each object is encoded in a variable
60               length format and is not constrained to 32-bit or anything.
61
62       •   The trailer records a pack checksum of all of the above.
63
64   Object types
65       Valid object types are:
66
67       •   OBJ_COMMIT (1)
68
69       •   OBJ_TREE (2)
70
71       •   OBJ_BLOB (3)
72
73       •   OBJ_TAG (4)
74
75       •   OBJ_OFS_DELTA (6)
76
77       •   OBJ_REF_DELTA (7)
78
79       Type 5 is reserved for future expansion. Type 0 is invalid.
80
81   Size encoding
82       This document uses the following "size encoding" of non-negative
83       integers: From each byte, the seven least significant bits are used to
84       form the resulting integer. As long as the most significant bit is 1,
85       this process continues; the byte with MSB 0 provides the last seven
86       bits. The seven-bit chunks are concatenated. Later values are more
87       significant.
88
89       This size encoding should not be confused with the "offset encoding",
90       which is also used in this document.
91
92   Deltified representation
93       Conceptually there are only four object types: commit, tree, tag and
94       blob. However to save space, an object could be stored as a "delta" of
95       another "base" object. These representations are assigned new types
96       ofs-delta and ref-delta, which is only valid in a pack file.
97
98       Both ofs-delta and ref-delta store the "delta" to be applied to another
99       object (called base object) to reconstruct the object. The difference
100       between them is, ref-delta directly encodes base object name. If the
101       base object is in the same pack, ofs-delta encodes the offset of the
102       base object in the pack instead.
103
104       The base object could also be deltified if it’s in the same pack.
105       Ref-delta can also refer to an object outside the pack (i.e. the
106       so-called "thin pack"). When stored on disk however, the pack should be
107       self contained to avoid cyclic dependency.
108
109       The delta data starts with the size of the base object and the size of
110       the object to be reconstructed. These sizes are encoded using the size
111       encoding from above. The remainder of the delta data is a sequence of
112       instructions to reconstruct the object from the base object. If the
113       base object is deltified, it must be converted to canonical form first.
114       Each instruction appends more and more data to the target object until
115       it’s complete. There are two supported instructions so far: one for
116       copy a byte range from the source object and one for inserting new data
117       embedded in the instruction itself.
118
119       Each instruction has variable length. Instruction type is determined by
120       the seventh bit of the first octet. The following diagrams follow the
121       convention in RFC 1951 (Deflate compressed data format).
122
123       Instruction to copy from base object
124               +----------+---------+---------+---------+---------+-------+-------+-------+
125               | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
126               +----------+---------+---------+---------+---------+-------+-------+-------+
127
128           This is the instruction format to copy a byte range from the source
129           object. It encodes the offset to copy from and the number of bytes
130           to copy. Offset and size are in little-endian order.
131
132           All offset and size bytes are optional. This is to reduce the
133           instruction size when encoding small offsets or sizes. The first
134           seven bits in the first octet determines which of the next seven
135           octets is present. If bit zero is set, offset1 is present. If bit
136           one is set offset2 is present and so on.
137
138           Note that a more compact instruction does not change offset and
139           size encoding. For example, if only offset2 is omitted like below,
140           offset3 still contains bits 16-23. It does not become offset2 and
141           contains bits 8-15 even if it’s right next to offset1.
142
143               +----------+---------+---------+
144               | 10000101 | offset1 | offset3 |
145               +----------+---------+---------+
146
147           In its most compact form, this instruction only takes up one byte
148           (0x80) with both offset and size omitted, which will have default
149           values zero. There is another exception: size zero is automatically
150           converted to 0x10000.
151
152       Instruction to add new data
153               +----------+============+
154               | 0xxxxxxx |    data    |
155               +----------+============+
156
157           This is the instruction to construct target object without the base
158           object. The following data is appended to the target object. The
159           first seven bits of the first octet determines the size of data in
160           bytes. The size must be non-zero.
161
162       Reserved instruction
163               +----------+============
164               | 00000000 |
165               +----------+============
166
167           This is the instruction reserved for future expansion.
168

ORIGINAL (VERSION 1) PACK-*.IDX FILES HAVE THE FOLLOWING FORMAT:

170       •   The header consists of 256 4-byte network byte order integers. N-th
171           entry of this table records the number of objects in the
172           corresponding pack, the first byte of whose object name is less
173           than or equal to N. This is called the first-level fan-out table.
174
175       •   The header is followed by sorted 24-byte entries, one entry per
176           object in the pack. Each entry is:
177
178               4-byte network byte order integer, recording where the
179               object is stored in the packfile as the offset from the
180               beginning.
181
182               one object name of the appropriate size.
183
184       •   The file is concluded with a trailer:
185
186               A copy of the pack checksum at the end of the corresponding
187               packfile.
188
189               Index checksum of all of the above.
190
191       Pack Idx file:
192
193                --  +--------------------------------+
194           fanout        | fanout[0] = 2 (for example)    |-.
195           table         +--------------------------------+ |
196                    | fanout[1]                      | |
197                    +--------------------------------+ |
198                    | fanout[2]                      | |
199                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
200                    | fanout[255] = total objects    |---.
201                --  +--------------------------------+ | |
202           main     | offset                         | | |
203           index         | object name 00XXXXXXXXXXXXXXXX | | |
204           table         +--------------------------------+ | |
205                    | offset                         | | |
206                    | object name 00XXXXXXXXXXXXXXXX | | |
207                    +--------------------------------+<+ |
208                  .-| offset                         |   |
209                  | | object name 01XXXXXXXXXXXXXXXX |   |
210                  | +--------------------------------+   |
211                  | | offset                         |   |
212                  | | object name 01XXXXXXXXXXXXXXXX |   |
213                  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   |
214                  | | offset                         |   |
215                  | | object name FFXXXXXXXXXXXXXXXX |   |
216                --| +--------------------------------+<--+
217           trailer     | | packfile checksum              |
218                  | +--------------------------------+
219                  | | idxfile checksum               |
220                  | +--------------------------------+
221                     .-------.
222                             |
223           Pack file entry: <+
224
225               packed object header:
226           1-byte size extension bit (MSB)
227                  type (next 3 bit)
228                  size0 (lower 4-bit)
229                  n-byte sizeN (as long as MSB is set, each 7-bit)
230                size0..sizeN form 4+7+7+..+7 bit integer, size0
231                is the least significant part, and sizeN is the
232                most significant part.
233               packed object data:
234                  If it is not DELTA, then deflated bytes (the size above
235                is the size before compression).
236           If it is REF_DELTA, then
237             base object name (the size above is the
238                size of the delta data that follows).
239                    delta data, deflated.
240           If it is OFS_DELTA, then
241             n-byte offset (see below) interpreted as a negative
242                offset from the type-byte of the header of the
243                ofs-delta entry (the size above is the size of
244                the delta data that follows).
245             delta data, deflated.
246
247             offset encoding:
248           n bytes with MSB set in all but the last one.
249           The offset is then the number constructed by
250           concatenating the lower 7 bit of each byte, and
251           for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1))
252           to the result.
253

VERSION 2 PACK-*.IDX FILES SUPPORT PACKS LARGER THAN 4 GIB, AND

255           have some other reorganizations.  They have the format:
256
257       •   A 4-byte magic number \377tOc which is an unreasonable fanout[0]
258           value.
259
260       •   A 4-byte version number (= 2)
261
262       •   A 256-entry fan-out table just like v1.
263
264       •   A table of sorted object names. These are packed together without
265           offset values to reduce the cache footprint of the binary search
266           for a specific object name.
267
268       •   A table of 4-byte CRC32 values of the packed object data. This is
269           new in v2 so compressed data can be copied directly from pack to
270           pack during repacking without undetected data corruption.
271
272       •   A table of 4-byte offset values (in network byte order). These are
273           usually 31-bit pack file offsets, but large offsets are encoded as
274           an index into the next table with the msbit set.
275
276       •   A table of 8-byte offset entries (empty for pack files less than 2
277           GiB). Pack files are organized with heavily used objects toward the
278           front, so most object references should not need to refer to this
279           table.
280
281       •   The same trailer as a v1 pack file:
282
283               A copy of the pack checksum at the end of
284               corresponding packfile.
285
286               Index checksum of all of the above.
287

PACK-*.REV FILES HAVE THE FORMAT:

289       •   A 4-byte magic number 0x52494458 (RIDX).
290
291       •   A 4-byte version identifier (= 1).
292
293       •   A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
294
295       •   A table of index positions (one per packed object, num_objects in
296           total, each a 4-byte unsigned integer in network order), sorted by
297           their corresponding offsets in the packfile.
298
299       •   A trailer, containing a:
300
301               checksum of the corresponding packfile, and
302
303               a checksum of all of the above.
304
305       All 4-byte numbers are in network order.
306

PACK-*.MTIMES FILES HAVE THE FORMAT:

308       All 4-byte numbers are in network byte order.
309
310       •   A 4-byte magic number 0x4d544d45 (MTME).
311
312       •   A 4-byte version identifier (= 1).
313
314       •   A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
315
316       •   A table of 4-byte unsigned integers. The ith value is the
317           modification time (mtime) of the ith object in the corresponding
318           pack by lexicographic (index) order. The mtimes count standard
319           epoch seconds.
320
321       •   A trailer, containing a checksum of the corresponding packfile, and
322           a checksum of all of the above (each having length according to the
323           specified hash function).
324

MULTI-PACK-INDEX (MIDX) FILES HAVE THE FOLLOWING FORMAT:

326       The multi-pack-index files refer to multiple pack-files and loose
327       objects.
328
329       In order to allow extensions that add extra data to the MIDX, we
330       organize the body into "chunks" and provide a lookup table at the
331       beginning of the body. The header includes certain length values, such
332       as the number of packs, the number of base MIDX files, hash lengths and
333       types.
334
335       All 4-byte numbers are in network order.
336
337       HEADER:
338
339           4-byte signature:
340               The signature is: {'M', 'I', 'D', 'X'}
341
342           1-byte version number:
343               Git only writes or recognizes version 1.
344
345           1-byte Object Id Version
346               We infer the length of object IDs (OIDs) from this value:
347                1 => SHA-1
348                2 => SHA-256
349               If the hash type does not match the repository's hash algorithm,
350               the multi-pack-index file should be ignored with a warning
351               presented to the user.
352
353           1-byte number of "chunks"
354
355           1-byte number of base multi-pack-index files:
356               This value is currently always zero.
357
358           4-byte number of pack files
359
360       CHUNK LOOKUP:
361
362           (C + 1) * 12 bytes providing the chunk offsets:
363               First 4 bytes describe chunk id. Value 0 is a terminating label.
364               Other 8 bytes provide offset in current file for chunk to start.
365               (Chunks are provided in file-order, so you can infer the length
366               using the next chunk position if necessary.)
367
368           The CHUNK LOOKUP matches the table of contents from
369           the chunk-based file format, see linkgit:gitformat-chunk[5].
370
371           The remaining data in the body is described one chunk at a time, and
372           these chunks may be given in any order. Chunks are required unless
373           otherwise specified.
374
375       CHUNK DATA:
376
377           Packfile Names (ID: {'P', 'N', 'A', 'M'})
378               Stores the packfile names as concatenated, null-terminated strings.
379               Packfiles must be listed in lexicographic order for fast lookups by
380               name. This is the only chunk not guaranteed to be a multiple of four
381               bytes in length, so should be the last chunk for alignment reasons.
382
383           OID Fanout (ID: {'O', 'I', 'D', 'F'})
384               The ith entry, F[i], stores the number of OIDs with first
385               byte at most i. Thus F[255] stores the total
386               number of objects.
387
388           OID Lookup (ID: {'O', 'I', 'D', 'L'})
389               The OIDs for all objects in the MIDX are stored in lexicographic
390               order in this chunk.
391
392           Object Offsets (ID: {'O', 'O', 'F', 'F'})
393               Stores two 4-byte values for every object.
394               1: The pack-int-id for the pack storing this object.
395               2: The offset within the pack.
396                If all offsets are less than 2^32, then the large offset chunk
397                will not exist and offsets are stored as in IDX v1.
398                If there is at least one offset value larger than 2^32-1, then
399                the large offset chunk must exist, and offsets larger than
400                2^31-1 must be stored in it instead. If the large offset chunk
401                exists and the 31st bit is on, then removing that bit reveals
402                the row in the large offsets containing the 8-byte offset of
403                this object.
404
405           [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
406               8-byte offsets into large packfiles.
407
408           [Optional] Bitmap pack order (ID: {'R', 'I', 'D', 'X'})
409               A list of MIDX positions (one per object in the MIDX, num_objects in
410               total, each a 4-byte unsigned integer in network byte order), sorted
411               according to their relative bitmap/pseudo-pack positions.
412
413       TRAILER:
414
415           Index checksum of the above contents.
416

MULTI-PACK-INDEX REVERSE INDEXES

418       Similar to the pack-based reverse index, the multi-pack index can also
419       be used to generate a reverse index.
420
421       Instead of mapping between offset, pack-, and index position, this
422       reverse index maps between an object’s position within the MIDX, and
423       that object’s position within a pseudo-pack that the MIDX describes
424       (i.e., the ith entry of the multi-pack reverse index holds the MIDX
425       position of ith object in pseudo-pack order).
426
427       To clarify the difference between these orderings, consider a
428       multi-pack reachability bitmap (which does not yet exist, but is what
429       we are building towards here). Each bit needs to correspond to an
430       object in the MIDX, and so we need an efficient mapping from bit
431       position to MIDX position.
432
433       One solution is to let bits occupy the same position in the oid-sorted
434       index stored by the MIDX. But because oids are effectively random,
435       their resulting reachability bitmaps would have no locality, and thus
436       compress poorly. (This is the reason that single-pack bitmaps use the
437       pack ordering, and not the .idx ordering, for the same purpose.)
438
439       So we’d like to define an ordering for the whole MIDX based around pack
440       ordering, which has far better locality (and thus compresses more
441       efficiently). We can think of a pseudo-pack created by the
442       concatenation of all of the packs in the MIDX. E.g., if we had a MIDX
443       with three packs (a, b, c), with 10, 15, and 20 objects respectively,
444       we can imagine an ordering of the objects like:
445
446           |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
447
448       where the ordering of the packs is defined by the MIDX’s pack list, and
449       then the ordering of objects within each pack is the same as the order
450       in the actual packfile.
451
452       Given the list of packs and their counts of objects, you can naïvely
453       reconstruct that pseudo-pack ordering (e.g., the object at position 27
454       must be (c,1) because packs "a" and "b" consumed 25 of the slots). But
455       there’s a catch. Objects may be duplicated between packs, in which case
456       the MIDX only stores one pointer to the object (and thus we’d want only
457       one slot in the bitmap).
458
459       Callers could handle duplicates themselves by reading objects in order
460       of their bit-position, but that’s linear in the number of objects, and
461       much too expensive for ordinary bitmap lookups. Building a reverse
462       index solves this, since it is the logical inverse of the index, and
463       that index has already removed duplicates. But, building a reverse
464       index on the fly can be expensive. Since we already have an on-disk
465       format for pack-based reverse indexes, let’s reuse it for the MIDX’s
466       pseudo-pack, too.
467
468       Objects from the MIDX are ordered as follows to string together the
469       pseudo-pack. Let pack(o) return the pack from which o was selected by
470       the MIDX, and define an ordering of packs based on their numeric ID (as
471       stored by the MIDX). Let offset(o) return the object offset of o within
472       pack(o). Then, compare o1 and o2 as follows:
473
474       •   If one of pack(o1) and pack(o2) is preferred and the other is not,
475           then the preferred one sorts first.
476
477           (This is a detail that allows the MIDX bitmap to determine which
478           pack should be used by the pack-reuse mechanism, since it can ask
479           the MIDX for the pack containing the object at bit position 0).
480
481       •   If pack(o1) ≠ pack(o2), then sort the two objects in descending
482           order based on the pack ID.
483
484       •   Otherwise, pack(o1) = pack(o2), and the objects are sorted in
485           pack-order (i.e., o1 sorts ahead of o2 exactly when offset(o1) <
486           offset(o2)).
487
488       In short, a MIDX’s pseudo-pack is the de-duplicated concatenation of
489       objects in packs stored by the MIDX, laid out in pack order, and the
490       packs arranged in MIDX order (with the preferred pack coming first).
491
492       The MIDX’s reverse index is stored in the optional RIDX chunk within
493       the MIDX itself.
494

CRUFT PACKS

496       The cruft packs feature offer an alternative to Git’s traditional
497       mechanism of removing unreachable objects. This document provides an
498       overview of Git’s pruning mechanism, and how a cruft pack can be used
499       instead to accomplish the same.
500
501   Background
502       To remove unreachable objects from your repository, Git offers git
503       repack -Ad (see git-repack(1)). Quoting from the documentation:
504
505           [...] unreachable objects in a previous pack become loose, unpacked objects,
506           instead of being left in the old pack. [...] loose unreachable objects will be
507           pruned according to normal expiry rules with the next 'git gc' invocation.
508
509       Unreachable objects aren’t removed immediately, since doing so could
510       race with an incoming push which may reference an object which is about
511       to be deleted. Instead, those unreachable objects are stored as loose
512       objects and stay that way until they are older than the expiration
513       window, at which point they are removed by git-prune(1).
514
515       Git must store these unreachable objects loose in order to keep track
516       of their per-object mtimes. If these unreachable objects were written
517       into one big pack, then either freshening that pack (because an object
518       contained within it was re-written) or creating a new pack of
519       unreachable objects would cause the pack’s mtime to get updated, and
520       the objects within it would never leave the expiration window. Instead,
521       objects are stored loose in order to keep track of the individual
522       object mtimes and avoid a situation where all cruft objects are
523       freshened at once.
524
525       This can lead to undesirable situations when a repository contains many
526       unreachable objects which have not yet left the grace period. Having
527       large directories in the shards of .git/objects can lead to decreased
528       performance in the repository. But given enough unreachable objects,
529       this can lead to inode starvation and degrade the performance of the
530       whole system. Since we can never pack those objects, these repositories
531       often take up a large amount of disk space, since we can only zlib
532       compress them, but not store them in delta chains.
533
534   Cruft packs
535       A cruft pack eliminates the need for storing unreachable objects in a
536       loose state by including the per-object mtimes in a separate file
537       alongside a single pack containing all loose objects.
538
539       A cruft pack is written by git repack --cruft when generating a new
540       pack. git-pack-objects(1)'s --cruft option. Note that git repack
541       --cruft is a classic all-into-one repack, meaning that everything in
542       the resulting pack is reachable, and everything else is unreachable.
543       Once written, the --cruft option instructs git repack to generate
544       another pack containing only objects not packed in the previous step
545       (which equates to packing all unreachable objects together). This
546       progresses as follows:
547
548        1. Enumerate every object, marking any object which is (a) not
549           contained in a kept-pack, and (b) whose mtime is within the grace
550           period as a traversal tip.
551
552        2. Perform a reachability traversal based on the tips gathered in the
553           previous step, adding every object along the way to the pack.
554
555        3. Write the pack out, along with a .mtimes file that records the
556           per-object timestamps.
557
558       This mode is invoked internally by git-repack(1) when instructed to
559       write a cruft pack. Crucially, the set of in-core kept packs is exactly
560       the set of packs which will not be deleted by the repack; in other
561       words, they contain all of the repository’s reachable objects.
562
563       When a repository already has a cruft pack, git repack --cruft
564       typically only adds objects to it. An exception to this is when git
565       repack is given the --cruft-expiration option, which allows the
566       generated cruft pack to omit expired objects instead of waiting for
567       git-gc(1) to expire those objects later on.
568
569       It is git-gc(1) that is typically responsible for removing expired
570       unreachable objects.
571
572   Caution for mixed-version environments
573       Repositories that have cruft packs in them will continue to work with
574       any older version of Git. Note, however, that previous versions of Git
575       which do not understand the .mtimes file will use the cruft pack’s
576       mtime as the mtime for all of the objects in it. In other words, do not
577       expect older (pre-cruft pack) versions of Git to interpret or even read
578       the contents of the .mtimes file.
579
580       Note that having mixed versions of Git GC-ing the same repository can
581       lead to unreachable objects never being completely pruned. This can
582       happen under the following circumstances:
583
584       •   An older version of Git running GC explodes the contents of an
585           existing cruft pack loose, using the cruft pack’s mtime.
586
587       •   A newer version running GC collects those loose objects into a
588           cruft pack, where the .mtime file reflects the loose object’s
589           actual mtimes, but the cruft pack mtime is "now".
590
591       Repeating this process will lead to unreachable objects not getting
592       pruned as a result of repeatedly resetting the objects' mtimes to the
593       present time.
594
595       If you are GC-ing repositories in a mixed version environment, consider
596       omitting the --cruft option when using git-repack(1) and git-gc(1), and
597       leaving the gc.cruftPacks configuration unset until all writers
598       understand cruft packs.
599
600   Alternatives
601       Notable alternatives to this design include:
602
603       •   The location of the per-object mtime data, and
604
605       •   Storing unreachable objects in multiple cruft packs.
606
607       On the location of mtime data, a new auxiliary file tied to the pack
608       was chosen to avoid complicating the .idx format. If the .idx format
609       were ever to gain support for optional chunks of data, it may make
610       sense to consolidate the .mtimes format into the .idx itself.
611
612       Storing unreachable objects among multiple cruft packs (e.g., creating
613       a new cruft pack during each repacking operation including only
614       unreachable objects which aren’t already stored in an earlier cruft
615       pack) is significantly more complicated to construct, and so aren’t
616       pursued here. The obvious drawback to the current implementation is
617       that the entire cruft pack must be re-written from scratch.
618

GIT

620       Part of the git(1) suite
621
622
623
624Git 2.39.1                        2023-01-13                 GITFORMAT-PACK(5)