1GITFORMAT-PACK(5) Git Manual GITFORMAT-PACK(5)
2
3
4
6 gitformat-pack - Git pack format
7
9 $GIT_DIR/objects/pack/pack-.{pack,idx}
10 $GIT_DIR/objects/pack/pack-.rev
11 $GIT_DIR/objects/pack/pack-*.mtimes
12 $GIT_DIR/objects/pack/multi-pack-index
13
15 The Git pack format is how Git stores most of its primary repository
16 data. Over the lifetime of a repository, loose objects (if any) and
17 smaller packs are consolidated into larger pack(s). See git-gc(1) and
18 git-pack-objects(1).
19
20 The pack format is also used over-the-wire, see e.g. gitprotocol-v2(5),
21 as well as being a part of other container formats in the case of
22 gitformat-bundle(5).
23
25 In a repository using the traditional SHA-1, pack checksums, index
26 checksums, and object IDs (object names) mentioned below are all
27 computed using SHA-1. Similarly, in SHA-256 repositories, these values
28 are computed using SHA-256.
29
31 • A header appears at the beginning and consists of the following:
32
33 4-byte signature:
34 The signature is: {'P', 'A', 'C', 'K'}
35
36 4-byte version number (network byte order):
37 Git currently accepts version number 2 or 3 but
38 generates version 2 only.
39
40 4-byte number of objects contained in the pack (network byte order)
41
42 Observation: we cannot have more than 4G versions ;-) and
43 more than 4G objects in a pack.
44
45 • The header is followed by a number of object entries, each of which
46 looks like this:
47
48 (undeltified representation)
49 n-byte type and length (3-bit type, (n-1)*7+4-bit length)
50 compressed data
51
52 (deltified representation)
53 n-byte type and length (3-bit type, (n-1)*7+4-bit length)
54 base object name if OBJ_REF_DELTA or a negative relative
55 offset from the delta object's position in the pack if this
56 is an OBJ_OFS_DELTA object
57 compressed delta data
58
59 Observation: the length of each object is encoded in a variable
60 length format and is not constrained to 32-bit or anything.
61
62 • The trailer records a pack checksum of all of the above.
63
64 Object types
65 Valid object types are:
66
67 • OBJ_COMMIT (1)
68
69 • OBJ_TREE (2)
70
71 • OBJ_BLOB (3)
72
73 • OBJ_TAG (4)
74
75 • OBJ_OFS_DELTA (6)
76
77 • OBJ_REF_DELTA (7)
78
79 Type 5 is reserved for future expansion. Type 0 is invalid.
80
81 Size encoding
82 This document uses the following "size encoding" of non-negative
83 integers: From each byte, the seven least significant bits are used to
84 form the resulting integer. As long as the most significant bit is 1,
85 this process continues; the byte with MSB 0 provides the last seven
86 bits. The seven-bit chunks are concatenated. Later values are more
87 significant.
88
89 This size encoding should not be confused with the "offset encoding",
90 which is also used in this document.
91
92 Deltified representation
93 Conceptually there are only four object types: commit, tree, tag and
94 blob. However to save space, an object could be stored as a "delta" of
95 another "base" object. These representations are assigned new types
96 ofs-delta and ref-delta, which is only valid in a pack file.
97
98 Both ofs-delta and ref-delta store the "delta" to be applied to another
99 object (called base object) to reconstruct the object. The difference
100 between them is, ref-delta directly encodes base object name. If the
101 base object is in the same pack, ofs-delta encodes the offset of the
102 base object in the pack instead.
103
104 The base object could also be deltified if it’s in the same pack.
105 Ref-delta can also refer to an object outside the pack (i.e. the
106 so-called "thin pack"). When stored on disk however, the pack should be
107 self contained to avoid cyclic dependency.
108
109 The delta data starts with the size of the base object and the size of
110 the object to be reconstructed. These sizes are encoded using the size
111 encoding from above. The remainder of the delta data is a sequence of
112 instructions to reconstruct the object from the base object. If the
113 base object is deltified, it must be converted to canonical form first.
114 Each instruction appends more and more data to the target object until
115 it’s complete. There are two supported instructions so far: one for
116 copying a byte range from the source object and one for inserting new
117 data embedded in the instruction itself.
118
119 Each instruction has variable length. Instruction type is determined by
120 the seventh bit of the first octet. The following diagrams follow the
121 convention in RFC 1951 (Deflate compressed data format).
122
123 Instruction to copy from base object
124 +----------+---------+---------+---------+---------+-------+-------+-------+
125 | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
126 +----------+---------+---------+---------+---------+-------+-------+-------+
127
128 This is the instruction format to copy a byte range from the source
129 object. It encodes the offset to copy from and the number of bytes
130 to copy. Offset and size are in little-endian order.
131
132 All offset and size bytes are optional. This is to reduce the
133 instruction size when encoding small offsets or sizes. The first
134 seven bits in the first octet determine which of the next seven
135 octets is present. If bit zero is set, offset1 is present. If bit
136 one is set offset2 is present and so on.
137
138 Note that a more compact instruction does not change offset and
139 size encoding. For example, if only offset2 is omitted like below,
140 offset3 still contains bits 16-23. It does not become offset2 and
141 contains bits 8-15 even if it’s right next to offset1.
142
143 +----------+---------+---------+
144 | 10000101 | offset1 | offset3 |
145 +----------+---------+---------+
146
147 In its most compact form, this instruction only takes up one byte
148 (0x80) with both offset and size omitted, which will have default
149 values zero. There is another exception: size zero is automatically
150 converted to 0x10000.
151
152 Instruction to add new data
153 +----------+============+
154 | 0xxxxxxx | data |
155 +----------+============+
156
157 This is the instruction to construct the target object without the
158 base object. The following data is appended to the target object.
159 The first seven bits of the first octet determine the size of data
160 in bytes. The size must be non-zero.
161
162 Reserved instruction
163 +----------+============
164 | 00000000 |
165 +----------+============
166
167 This is the instruction reserved for future expansion.
168
170 • The header consists of 256 4-byte network byte order integers. N-th
171 entry of this table records the number of objects in the
172 corresponding pack, the first byte of whose object name is less
173 than or equal to N. This is called the first-level fan-out table.
174
175 • The header is followed by sorted 24-byte entries, one entry per
176 object in the pack. Each entry is:
177
178 4-byte network byte order integer, recording where the
179 object is stored in the packfile as the offset from the
180 beginning.
181
182 one object name of the appropriate size.
183
184 • The file is concluded with a trailer:
185
186 A copy of the pack checksum at the end of the corresponding
187 packfile.
188
189 Index checksum of all of the above.
190
191 Pack Idx file:
192
193 -- +--------------------------------+
194 fanout | fanout[0] = 2 (for example) |-.
195 table +--------------------------------+ |
196 | fanout[1] | |
197 +--------------------------------+ |
198 | fanout[2] | |
199 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
200 | fanout[255] = total objects |---.
201 -- +--------------------------------+ | |
202 main | offset | | |
203 index | object name 00XXXXXXXXXXXXXXXX | | |
204 table +--------------------------------+ | |
205 | offset | | |
206 | object name 00XXXXXXXXXXXXXXXX | | |
207 +--------------------------------+<+ |
208 .-| offset | |
209 | | object name 01XXXXXXXXXXXXXXXX | |
210 | +--------------------------------+ |
211 | | offset | |
212 | | object name 01XXXXXXXXXXXXXXXX | |
213 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
214 | | offset | |
215 | | object name FFXXXXXXXXXXXXXXXX | |
216 --| +--------------------------------+<--+
217 trailer | | packfile checksum |
218 | +--------------------------------+
219 | | idxfile checksum |
220 | +--------------------------------+
221 .-------.
222 |
223 Pack file entry: <+
224
225 packed object header:
226 1-byte size extension bit (MSB)
227 type (next 3 bit)
228 size0 (lower 4-bit)
229 n-byte sizeN (as long as MSB is set, each 7-bit)
230 size0..sizeN form 4+7+7+..+7 bit integer, size0
231 is the least significant part, and sizeN is the
232 most significant part.
233 packed object data:
234 If it is not DELTA, then deflated bytes (the size above
235 is the size before compression).
236 If it is REF_DELTA, then
237 base object name (the size above is the
238 size of the delta data that follows).
239 delta data, deflated.
240 If it is OFS_DELTA, then
241 n-byte offset (see below) interpreted as a negative
242 offset from the type-byte of the header of the
243 ofs-delta entry (the size above is the size of
244 the delta data that follows).
245 delta data, deflated.
246
247 offset encoding:
248 n bytes with MSB set in all but the last one.
249 The offset is then the number constructed by
250 concatenating the lower 7 bit of each byte, and
251 for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1))
252 to the result.
253
255 have some other reorganizations. They have the format:
256
257 • A 4-byte magic number \377tOc which is an unreasonable fanout[0]
258 value.
259
260 • A 4-byte version number (= 2)
261
262 • A 256-entry fan-out table just like v1.
263
264 • A table of sorted object names. These are packed together without
265 offset values to reduce the cache footprint of the binary search
266 for a specific object name.
267
268 • A table of 4-byte CRC32 values of the packed object data. This is
269 new in v2 so compressed data can be copied directly from pack to
270 pack during repacking without undetected data corruption.
271
272 • A table of 4-byte offset values (in network byte order). These are
273 usually 31-bit pack file offsets, but large offsets are encoded as
274 an index into the next table with the msbit set.
275
276 • A table of 8-byte offset entries (empty for pack files less than 2
277 GiB). Pack files are organized with heavily used objects toward the
278 front, so most object references should not need to refer to this
279 table.
280
281 • The same trailer as a v1 pack file:
282
283 A copy of the pack checksum at the end of the
284 corresponding packfile.
285
286 Index checksum of all of the above.
287
289 • A 4-byte magic number 0x52494458 (RIDX).
290
291 • A 4-byte version identifier (= 1).
292
293 • A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
294
295 • A table of index positions (one per packed object, num_objects in
296 total, each a 4-byte unsigned integer in network order), sorted by
297 their corresponding offsets in the packfile.
298
299 • A trailer, containing a:
300
301 checksum of the corresponding packfile, and
302
303 a checksum of all of the above.
304
305 All 4-byte numbers are in network order.
306
308 All 4-byte numbers are in network byte order.
309
310 • A 4-byte magic number 0x4d544d45 (MTME).
311
312 • A 4-byte version identifier (= 1).
313
314 • A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
315
316 • A table of 4-byte unsigned integers. The ith value is the
317 modification time (mtime) of the ith object in the corresponding
318 pack by lexicographic (index) order. The mtimes count standard
319 epoch seconds.
320
321 • A trailer, containing a checksum of the corresponding packfile, and
322 a checksum of all of the above (each having length according to the
323 specified hash function).
324
326 The multi-pack-index files refer to multiple pack-files and loose
327 objects.
328
329 In order to allow extensions that add extra data to the MIDX, we
330 organize the body into "chunks" and provide a lookup table at the
331 beginning of the body. The header includes certain length values, such
332 as the number of packs, the number of base MIDX files, hash lengths and
333 types.
334
335 All 4-byte numbers are in network order.
336
337 HEADER:
338
339 4-byte signature:
340 The signature is: {'M', 'I', 'D', 'X'}
341
342 1-byte version number:
343 Git only writes or recognizes version 1.
344
345 1-byte Object Id Version
346 We infer the length of object IDs (OIDs) from this value:
347 1 => SHA-1
348 2 => SHA-256
349 If the hash type does not match the repository's hash algorithm,
350 the multi-pack-index file should be ignored with a warning
351 presented to the user.
352
353 1-byte number of "chunks"
354
355 1-byte number of base multi-pack-index files:
356 This value is currently always zero.
357
358 4-byte number of pack files
359
360 CHUNK LOOKUP:
361
362 (C + 1) * 12 bytes providing the chunk offsets:
363 First 4 bytes describe chunk id. Value 0 is a terminating label.
364 Other 8 bytes provide offset in current file for chunk to start.
365 (Chunks are provided in file-order, so you can infer the length
366 using the next chunk position if necessary.)
367
368 The CHUNK LOOKUP matches the table of contents from
369 the chunk-based file format, see linkgit:gitformat-chunk[5].
370
371 The remaining data in the body is described one chunk at a time, and
372 these chunks may be given in any order. Chunks are required unless
373 otherwise specified.
374
375 CHUNK DATA:
376
377 Packfile Names (ID: {'P', 'N', 'A', 'M'})
378 Store the names of packfiles as a sequence of NUL-terminated
379 strings. There is no extra padding between the filenames,
380 and they are listed in lexicographic order. The chunk itself
381 is padded at the end with between 0 and 3 NUL bytes to make the
382 chunk size a multiple of 4 bytes.
383
384 OID Fanout (ID: {'O', 'I', 'D', 'F'})
385 The ith entry, F[i], stores the number of OIDs with first
386 byte at most i. Thus F[255] stores the total
387 number of objects.
388
389 OID Lookup (ID: {'O', 'I', 'D', 'L'})
390 The OIDs for all objects in the MIDX are stored in lexicographic
391 order in this chunk.
392
393 Object Offsets (ID: {'O', 'O', 'F', 'F'})
394 Stores two 4-byte values for every object.
395 1: The pack-int-id for the pack storing this object.
396 2: The offset within the pack.
397 If all offsets are less than 2^32, then the large offset chunk
398 will not exist and offsets are stored as in IDX v1.
399 If there is at least one offset value larger than 2^32-1, then
400 the large offset chunk must exist, and offsets larger than
401 2^31-1 must be stored in it instead. If the large offset chunk
402 exists and the 31st bit is on, then removing that bit reveals
403 the row in the large offsets containing the 8-byte offset of
404 this object.
405
406 [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
407 8-byte offsets into large packfiles.
408
409 [Optional] Bitmap pack order (ID: {'R', 'I', 'D', 'X'})
410 A list of MIDX positions (one per object in the MIDX, num_objects in
411 total, each a 4-byte unsigned integer in network byte order), sorted
412 according to their relative bitmap/pseudo-pack positions.
413
414 TRAILER:
415
416 Index checksum of the above contents.
417
419 Similar to the pack-based reverse index, the multi-pack index can also
420 be used to generate a reverse index.
421
422 Instead of mapping between offset, pack-, and index position, this
423 reverse index maps between an object’s position within the MIDX, and
424 that object’s position within a pseudo-pack that the MIDX describes
425 (i.e., the ith entry of the multi-pack reverse index holds the MIDX
426 position of ith object in pseudo-pack order).
427
428 To clarify the difference between these orderings, consider a
429 multi-pack reachability bitmap (which does not yet exist, but is what
430 we are building towards here). Each bit needs to correspond to an
431 object in the MIDX, and so we need an efficient mapping from bit
432 position to MIDX position.
433
434 One solution is to let bits occupy the same position in the oid-sorted
435 index stored by the MIDX. But because oids are effectively random,
436 their resulting reachability bitmaps would have no locality, and thus
437 compress poorly. (This is the reason that single-pack bitmaps use the
438 pack ordering, and not the .idx ordering, for the same purpose.)
439
440 So we’d like to define an ordering for the whole MIDX based around pack
441 ordering, which has far better locality (and thus compresses more
442 efficiently). We can think of a pseudo-pack created by the
443 concatenation of all of the packs in the MIDX. E.g., if we had a MIDX
444 with three packs (a, b, c), with 10, 15, and 20 objects respectively,
445 we can imagine an ordering of the objects like:
446
447 |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
448
449 where the ordering of the packs is defined by the MIDX’s pack list, and
450 then the ordering of objects within each pack is the same as the order
451 in the actual packfile.
452
453 Given the list of packs and their counts of objects, you can naïvely
454 reconstruct that pseudo-pack ordering (e.g., the object at position 27
455 must be (c,1) because packs "a" and "b" consumed 25 of the slots). But
456 there’s a catch. Objects may be duplicated between packs, in which case
457 the MIDX only stores one pointer to the object (and thus we’d want only
458 one slot in the bitmap).
459
460 Callers could handle duplicates themselves by reading objects in order
461 of their bit-position, but that’s linear in the number of objects, and
462 much too expensive for ordinary bitmap lookups. Building a reverse
463 index solves this, since it is the logical inverse of the index, and
464 that index has already removed duplicates. But, building a reverse
465 index on the fly can be expensive. Since we already have an on-disk
466 format for pack-based reverse indexes, let’s reuse it for the MIDX’s
467 pseudo-pack, too.
468
469 Objects from the MIDX are ordered as follows to string together the
470 pseudo-pack. Let pack(o) return the pack from which o was selected by
471 the MIDX, and define an ordering of packs based on their numeric ID (as
472 stored by the MIDX). Let offset(o) return the object offset of o within
473 pack(o). Then, compare o1 and o2 as follows:
474
475 • If one of pack(o1) and pack(o2) is preferred and the other is not,
476 then the preferred one sorts first.
477
478 (This is a detail that allows the MIDX bitmap to determine which
479 pack should be used by the pack-reuse mechanism, since it can ask
480 the MIDX for the pack containing the object at bit position 0).
481
482 • If pack(o1) ≠ pack(o2), then sort the two objects in descending
483 order based on the pack ID.
484
485 • Otherwise, pack(o1) = pack(o2), and the objects are sorted in
486 pack-order (i.e., o1 sorts ahead of o2 exactly when offset(o1) <
487 offset(o2)).
488
489 In short, a MIDX’s pseudo-pack is the de-duplicated concatenation of
490 objects in packs stored by the MIDX, laid out in pack order, and the
491 packs arranged in MIDX order (with the preferred pack coming first).
492
493 The MIDX’s reverse index is stored in the optional RIDX chunk within
494 the MIDX itself.
495
497 The cruft packs feature offer an alternative to Git’s traditional
498 mechanism of removing unreachable objects. This document provides an
499 overview of Git’s pruning mechanism, and how a cruft pack can be used
500 instead to accomplish the same.
501
502 Background
503 To remove unreachable objects from your repository, Git offers git
504 repack -Ad (see git-repack(1)). Quoting from the documentation:
505
506 [...] unreachable objects in a previous pack become loose, unpacked objects,
507 instead of being left in the old pack. [...] loose unreachable objects will be
508 pruned according to normal expiry rules with the next 'git gc' invocation.
509
510 Unreachable objects aren’t removed immediately, since doing so could
511 race with an incoming push which may reference an object which is about
512 to be deleted. Instead, those unreachable objects are stored as loose
513 objects and stay that way until they are older than the expiration
514 window, at which point they are removed by git-prune(1).
515
516 Git must store these unreachable objects loose in order to keep track
517 of their per-object mtimes. If these unreachable objects were written
518 into one big pack, then either freshening that pack (because an object
519 contained within it was re-written) or creating a new pack of
520 unreachable objects would cause the pack’s mtime to get updated, and
521 the objects within it would never leave the expiration window. Instead,
522 objects are stored loose in order to keep track of the individual
523 object mtimes and avoid a situation where all cruft objects are
524 freshened at once.
525
526 This can lead to undesirable situations when a repository contains many
527 unreachable objects which have not yet left the grace period. Having
528 large directories in the shards of .git/objects can lead to decreased
529 performance in the repository. But given enough unreachable objects,
530 this can lead to inode starvation and degrade the performance of the
531 whole system. Since we can never pack those objects, these repositories
532 often take up a large amount of disk space, since we can only zlib
533 compress them, but not store them in delta chains.
534
535 Cruft packs
536 A cruft pack eliminates the need for storing unreachable objects in a
537 loose state by including the per-object mtimes in a separate file
538 alongside a single pack containing all loose objects.
539
540 A cruft pack is written by git repack --cruft when generating a new
541 pack. git-pack-objects(1)'s --cruft option. Note that git repack
542 --cruft is a classic all-into-one repack, meaning that everything in
543 the resulting pack is reachable, and everything else is unreachable.
544 Once written, the --cruft option instructs git repack to generate
545 another pack containing only objects not packed in the previous step
546 (which equates to packing all unreachable objects together). This
547 progresses as follows:
548
549 1. Enumerate every object, marking any object which is (a) not
550 contained in a kept-pack, and (b) whose mtime is within the grace
551 period as a traversal tip.
552
553 2. Perform a reachability traversal based on the tips gathered in the
554 previous step, adding every object along the way to the pack.
555
556 3. Write the pack out, along with a .mtimes file that records the
557 per-object timestamps.
558
559 This mode is invoked internally by git-repack(1) when instructed to
560 write a cruft pack. Crucially, the set of in-core kept packs is exactly
561 the set of packs which will not be deleted by the repack; in other
562 words, they contain all of the repository’s reachable objects.
563
564 When a repository already has a cruft pack, git repack --cruft
565 typically only adds objects to it. An exception to this is when git
566 repack is given the --cruft-expiration option, which allows the
567 generated cruft pack to omit expired objects instead of waiting for
568 git-gc(1) to expire those objects later on.
569
570 It is git-gc(1) that is typically responsible for removing expired
571 unreachable objects.
572
573 Alternatives
574 Notable alternatives to this design include:
575
576 • The location of the per-object mtime data.
577
578 On the location of mtime data, a new auxiliary file tied to the pack
579 was chosen to avoid complicating the .idx format. If the .idx format
580 were ever to gain support for optional chunks of data, it may make
581 sense to consolidate the .mtimes format into the .idx itself.
582
584 Part of the git(1) suite
585
586
587
588Git 2.43.0 11/20/2023 GITFORMAT-PACK(5)