1GITFORMAT-PACK(5) Git Manual GITFORMAT-PACK(5)
2
3
4
6 gitformat-pack - Git pack format
7
9 $GIT_DIR/objects/pack/pack-.{pack,idx}
10 $GIT_DIR/objects/pack/pack-.rev
11 $GIT_DIR/objects/pack/pack-*.mtimes
12 $GIT_DIR/objects/pack/multi-pack-index
13
15 The Git pack format is now Git stores most of its primary repository
16 data. Over the lietime af a repository loose objects (if any) and
17 smaller packs are consolidated into larger pack(s). See git-gc(1) and
18 git-pack-objects(1).
19
20 The pack format is also used over-the-wire, see e.g. gitprotocol-v2(5),
21 as well as being a part of other container formats in the case of
22 gitformat-bundle(5).
23
25 In a repository using the traditional SHA-1, pack checksums, index
26 checksums, and object IDs (object names) mentioned below are all
27 computed using SHA-1. Similarly, in SHA-256 repositories, these values
28 are computed using SHA-256.
29
31 • A header appears at the beginning and consists of the following:
32
33 4-byte signature:
34 The signature is: {'P', 'A', 'C', 'K'}
35
36 4-byte version number (network byte order):
37 Git currently accepts version number 2 or 3 but
38 generates version 2 only.
39
40 4-byte number of objects contained in the pack (network byte order)
41
42 Observation: we cannot have more than 4G versions ;-) and
43 more than 4G objects in a pack.
44
45 • The header is followed by number of object entries, each of which
46 looks like this:
47
48 (undeltified representation)
49 n-byte type and length (3-bit type, (n-1)*7+4-bit length)
50 compressed data
51
52 (deltified representation)
53 n-byte type and length (3-bit type, (n-1)*7+4-bit length)
54 base object name if OBJ_REF_DELTA or a negative relative
55 offset from the delta object's position in the pack if this
56 is an OBJ_OFS_DELTA object
57 compressed delta data
58
59 Observation: length of each object is encoded in a variable
60 length format and is not constrained to 32-bit or anything.
61
62 • The trailer records a pack checksum of all of the above.
63
64 Object types
65 Valid object types are:
66
67 • OBJ_COMMIT (1)
68
69 • OBJ_TREE (2)
70
71 • OBJ_BLOB (3)
72
73 • OBJ_TAG (4)
74
75 • OBJ_OFS_DELTA (6)
76
77 • OBJ_REF_DELTA (7)
78
79 Type 5 is reserved for future expansion. Type 0 is invalid.
80
81 Size encoding
82 This document uses the following "size encoding" of non-negative
83 integers: From each byte, the seven least significant bits are used to
84 form the resulting integer. As long as the most significant bit is 1,
85 this process continues; the byte with MSB 0 provides the last seven
86 bits. The seven-bit chunks are concatenated. Later values are more
87 significant.
88
89 This size encoding should not be confused with the "offset encoding",
90 which is also used in this document.
91
92 Deltified representation
93 Conceptually there are only four object types: commit, tree, tag and
94 blob. However to save space, an object could be stored as a "delta" of
95 another "base" object. These representations are assigned new types
96 ofs-delta and ref-delta, which is only valid in a pack file.
97
98 Both ofs-delta and ref-delta store the "delta" to be applied to another
99 object (called base object) to reconstruct the object. The difference
100 between them is, ref-delta directly encodes base object name. If the
101 base object is in the same pack, ofs-delta encodes the offset of the
102 base object in the pack instead.
103
104 The base object could also be deltified if it’s in the same pack.
105 Ref-delta can also refer to an object outside the pack (i.e. the
106 so-called "thin pack"). When stored on disk however, the pack should be
107 self contained to avoid cyclic dependency.
108
109 The delta data starts with the size of the base object and the size of
110 the object to be reconstructed. These sizes are encoded using the size
111 encoding from above. The remainder of the delta data is a sequence of
112 instructions to reconstruct the object from the base object. If the
113 base object is deltified, it must be converted to canonical form first.
114 Each instruction appends more and more data to the target object until
115 it’s complete. There are two supported instructions so far: one for
116 copy a byte range from the source object and one for inserting new data
117 embedded in the instruction itself.
118
119 Each instruction has variable length. Instruction type is determined by
120 the seventh bit of the first octet. The following diagrams follow the
121 convention in RFC 1951 (Deflate compressed data format).
122
123 Instruction to copy from base object
124 +----------+---------+---------+---------+---------+-------+-------+-------+
125 | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
126 +----------+---------+---------+---------+---------+-------+-------+-------+
127
128 This is the instruction format to copy a byte range from the source
129 object. It encodes the offset to copy from and the number of bytes
130 to copy. Offset and size are in little-endian order.
131
132 All offset and size bytes are optional. This is to reduce the
133 instruction size when encoding small offsets or sizes. The first
134 seven bits in the first octet determines which of the next seven
135 octets is present. If bit zero is set, offset1 is present. If bit
136 one is set offset2 is present and so on.
137
138 Note that a more compact instruction does not change offset and
139 size encoding. For example, if only offset2 is omitted like below,
140 offset3 still contains bits 16-23. It does not become offset2 and
141 contains bits 8-15 even if it’s right next to offset1.
142
143 +----------+---------+---------+
144 | 10000101 | offset1 | offset3 |
145 +----------+---------+---------+
146
147 In its most compact form, this instruction only takes up one byte
148 (0x80) with both offset and size omitted, which will have default
149 values zero. There is another exception: size zero is automatically
150 converted to 0x10000.
151
152 Instruction to add new data
153 +----------+============+
154 | 0xxxxxxx | data |
155 +----------+============+
156
157 This is the instruction to construct target object without the base
158 object. The following data is appended to the target object. The
159 first seven bits of the first octet determines the size of data in
160 bytes. The size must be non-zero.
161
162 Reserved instruction
163 +----------+============
164 | 00000000 |
165 +----------+============
166
167 This is the instruction reserved for future expansion.
168
170 • The header consists of 256 4-byte network byte order integers. N-th
171 entry of this table records the number of objects in the
172 corresponding pack, the first byte of whose object name is less
173 than or equal to N. This is called the first-level fan-out table.
174
175 • The header is followed by sorted 24-byte entries, one entry per
176 object in the pack. Each entry is:
177
178 4-byte network byte order integer, recording where the
179 object is stored in the packfile as the offset from the
180 beginning.
181
182 one object name of the appropriate size.
183
184 • The file is concluded with a trailer:
185
186 A copy of the pack checksum at the end of the corresponding
187 packfile.
188
189 Index checksum of all of the above.
190
191 Pack Idx file:
192
193 -- +--------------------------------+
194 fanout | fanout[0] = 2 (for example) |-.
195 table +--------------------------------+ |
196 | fanout[1] | |
197 +--------------------------------+ |
198 | fanout[2] | |
199 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
200 | fanout[255] = total objects |---.
201 -- +--------------------------------+ | |
202 main | offset | | |
203 index | object name 00XXXXXXXXXXXXXXXX | | |
204 table +--------------------------------+ | |
205 | offset | | |
206 | object name 00XXXXXXXXXXXXXXXX | | |
207 +--------------------------------+<+ |
208 .-| offset | |
209 | | object name 01XXXXXXXXXXXXXXXX | |
210 | +--------------------------------+ |
211 | | offset | |
212 | | object name 01XXXXXXXXXXXXXXXX | |
213 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
214 | | offset | |
215 | | object name FFXXXXXXXXXXXXXXXX | |
216 --| +--------------------------------+<--+
217 trailer | | packfile checksum |
218 | +--------------------------------+
219 | | idxfile checksum |
220 | +--------------------------------+
221 .-------.
222 |
223 Pack file entry: <+
224
225 packed object header:
226 1-byte size extension bit (MSB)
227 type (next 3 bit)
228 size0 (lower 4-bit)
229 n-byte sizeN (as long as MSB is set, each 7-bit)
230 size0..sizeN form 4+7+7+..+7 bit integer, size0
231 is the least significant part, and sizeN is the
232 most significant part.
233 packed object data:
234 If it is not DELTA, then deflated bytes (the size above
235 is the size before compression).
236 If it is REF_DELTA, then
237 base object name (the size above is the
238 size of the delta data that follows).
239 delta data, deflated.
240 If it is OFS_DELTA, then
241 n-byte offset (see below) interpreted as a negative
242 offset from the type-byte of the header of the
243 ofs-delta entry (the size above is the size of
244 the delta data that follows).
245 delta data, deflated.
246
247 offset encoding:
248 n bytes with MSB set in all but the last one.
249 The offset is then the number constructed by
250 concatenating the lower 7 bit of each byte, and
251 for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1))
252 to the result.
253
255 have some other reorganizations. They have the format:
256
257 • A 4-byte magic number \377tOc which is an unreasonable fanout[0]
258 value.
259
260 • A 4-byte version number (= 2)
261
262 • A 256-entry fan-out table just like v1.
263
264 • A table of sorted object names. These are packed together without
265 offset values to reduce the cache footprint of the binary search
266 for a specific object name.
267
268 • A table of 4-byte CRC32 values of the packed object data. This is
269 new in v2 so compressed data can be copied directly from pack to
270 pack during repacking without undetected data corruption.
271
272 • A table of 4-byte offset values (in network byte order). These are
273 usually 31-bit pack file offsets, but large offsets are encoded as
274 an index into the next table with the msbit set.
275
276 • A table of 8-byte offset entries (empty for pack files less than 2
277 GiB). Pack files are organized with heavily used objects toward the
278 front, so most object references should not need to refer to this
279 table.
280
281 • The same trailer as a v1 pack file:
282
283 A copy of the pack checksum at the end of
284 corresponding packfile.
285
286 Index checksum of all of the above.
287
289 • A 4-byte magic number 0x52494458 (RIDX).
290
291 • A 4-byte version identifier (= 1).
292
293 • A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
294
295 • A table of index positions (one per packed object, num_objects in
296 total, each a 4-byte unsigned integer in network order), sorted by
297 their corresponding offsets in the packfile.
298
299 • A trailer, containing a:
300
301 checksum of the corresponding packfile, and
302
303 a checksum of all of the above.
304
305 All 4-byte numbers are in network order.
306
308 All 4-byte numbers are in network byte order.
309
310 • A 4-byte magic number 0x4d544d45 (MTME).
311
312 • A 4-byte version identifier (= 1).
313
314 • A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
315
316 • A table of 4-byte unsigned integers. The ith value is the
317 modification time (mtime) of the ith object in the corresponding
318 pack by lexicographic (index) order. The mtimes count standard
319 epoch seconds.
320
321 • A trailer, containing a checksum of the corresponding packfile, and
322 a checksum of all of the above (each having length according to the
323 specified hash function).
324
326 The multi-pack-index files refer to multiple pack-files and loose
327 objects.
328
329 In order to allow extensions that add extra data to the MIDX, we
330 organize the body into "chunks" and provide a lookup table at the
331 beginning of the body. The header includes certain length values, such
332 as the number of packs, the number of base MIDX files, hash lengths and
333 types.
334
335 All 4-byte numbers are in network order.
336
337 HEADER:
338
339 4-byte signature:
340 The signature is: {'M', 'I', 'D', 'X'}
341
342 1-byte version number:
343 Git only writes or recognizes version 1.
344
345 1-byte Object Id Version
346 We infer the length of object IDs (OIDs) from this value:
347 1 => SHA-1
348 2 => SHA-256
349 If the hash type does not match the repository's hash algorithm,
350 the multi-pack-index file should be ignored with a warning
351 presented to the user.
352
353 1-byte number of "chunks"
354
355 1-byte number of base multi-pack-index files:
356 This value is currently always zero.
357
358 4-byte number of pack files
359
360 CHUNK LOOKUP:
361
362 (C + 1) * 12 bytes providing the chunk offsets:
363 First 4 bytes describe chunk id. Value 0 is a terminating label.
364 Other 8 bytes provide offset in current file for chunk to start.
365 (Chunks are provided in file-order, so you can infer the length
366 using the next chunk position if necessary.)
367
368 The CHUNK LOOKUP matches the table of contents from
369 the chunk-based file format, see linkgit:gitformat-chunk[5].
370
371 The remaining data in the body is described one chunk at a time, and
372 these chunks may be given in any order. Chunks are required unless
373 otherwise specified.
374
375 CHUNK DATA:
376
377 Packfile Names (ID: {'P', 'N', 'A', 'M'})
378 Stores the packfile names as concatenated, null-terminated strings.
379 Packfiles must be listed in lexicographic order for fast lookups by
380 name. This is the only chunk not guaranteed to be a multiple of four
381 bytes in length, so should be the last chunk for alignment reasons.
382
383 OID Fanout (ID: {'O', 'I', 'D', 'F'})
384 The ith entry, F[i], stores the number of OIDs with first
385 byte at most i. Thus F[255] stores the total
386 number of objects.
387
388 OID Lookup (ID: {'O', 'I', 'D', 'L'})
389 The OIDs for all objects in the MIDX are stored in lexicographic
390 order in this chunk.
391
392 Object Offsets (ID: {'O', 'O', 'F', 'F'})
393 Stores two 4-byte values for every object.
394 1: The pack-int-id for the pack storing this object.
395 2: The offset within the pack.
396 If all offsets are less than 2^32, then the large offset chunk
397 will not exist and offsets are stored as in IDX v1.
398 If there is at least one offset value larger than 2^32-1, then
399 the large offset chunk must exist, and offsets larger than
400 2^31-1 must be stored in it instead. If the large offset chunk
401 exists and the 31st bit is on, then removing that bit reveals
402 the row in the large offsets containing the 8-byte offset of
403 this object.
404
405 [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
406 8-byte offsets into large packfiles.
407
408 [Optional] Bitmap pack order (ID: {'R', 'I', 'D', 'X'})
409 A list of MIDX positions (one per object in the MIDX, num_objects in
410 total, each a 4-byte unsigned integer in network byte order), sorted
411 according to their relative bitmap/pseudo-pack positions.
412
413 TRAILER:
414
415 Index checksum of the above contents.
416
418 Similar to the pack-based reverse index, the multi-pack index can also
419 be used to generate a reverse index.
420
421 Instead of mapping between offset, pack-, and index position, this
422 reverse index maps between an object’s position within the MIDX, and
423 that object’s position within a pseudo-pack that the MIDX describes
424 (i.e., the ith entry of the multi-pack reverse index holds the MIDX
425 position of ith object in pseudo-pack order).
426
427 To clarify the difference between these orderings, consider a
428 multi-pack reachability bitmap (which does not yet exist, but is what
429 we are building towards here). Each bit needs to correspond to an
430 object in the MIDX, and so we need an efficient mapping from bit
431 position to MIDX position.
432
433 One solution is to let bits occupy the same position in the oid-sorted
434 index stored by the MIDX. But because oids are effectively random,
435 their resulting reachability bitmaps would have no locality, and thus
436 compress poorly. (This is the reason that single-pack bitmaps use the
437 pack ordering, and not the .idx ordering, for the same purpose.)
438
439 So we’d like to define an ordering for the whole MIDX based around pack
440 ordering, which has far better locality (and thus compresses more
441 efficiently). We can think of a pseudo-pack created by the
442 concatenation of all of the packs in the MIDX. E.g., if we had a MIDX
443 with three packs (a, b, c), with 10, 15, and 20 objects respectively,
444 we can imagine an ordering of the objects like:
445
446 |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
447
448 where the ordering of the packs is defined by the MIDX’s pack list, and
449 then the ordering of objects within each pack is the same as the order
450 in the actual packfile.
451
452 Given the list of packs and their counts of objects, you can naïvely
453 reconstruct that pseudo-pack ordering (e.g., the object at position 27
454 must be (c,1) because packs "a" and "b" consumed 25 of the slots). But
455 there’s a catch. Objects may be duplicated between packs, in which case
456 the MIDX only stores one pointer to the object (and thus we’d want only
457 one slot in the bitmap).
458
459 Callers could handle duplicates themselves by reading objects in order
460 of their bit-position, but that’s linear in the number of objects, and
461 much too expensive for ordinary bitmap lookups. Building a reverse
462 index solves this, since it is the logical inverse of the index, and
463 that index has already removed duplicates. But, building a reverse
464 index on the fly can be expensive. Since we already have an on-disk
465 format for pack-based reverse indexes, let’s reuse it for the MIDX’s
466 pseudo-pack, too.
467
468 Objects from the MIDX are ordered as follows to string together the
469 pseudo-pack. Let pack(o) return the pack from which o was selected by
470 the MIDX, and define an ordering of packs based on their numeric ID (as
471 stored by the MIDX). Let offset(o) return the object offset of o within
472 pack(o). Then, compare o1 and o2 as follows:
473
474 • If one of pack(o1) and pack(o2) is preferred and the other is not,
475 then the preferred one sorts first.
476
477 (This is a detail that allows the MIDX bitmap to determine which
478 pack should be used by the pack-reuse mechanism, since it can ask
479 the MIDX for the pack containing the object at bit position 0).
480
481 • If pack(o1) ≠ pack(o2), then sort the two objects in descending
482 order based on the pack ID.
483
484 • Otherwise, pack(o1) = pack(o2), and the objects are sorted in
485 pack-order (i.e., o1 sorts ahead of o2 exactly when offset(o1) <
486 offset(o2)).
487
488 In short, a MIDX’s pseudo-pack is the de-duplicated concatenation of
489 objects in packs stored by the MIDX, laid out in pack order, and the
490 packs arranged in MIDX order (with the preferred pack coming first).
491
492 The MIDX’s reverse index is stored in the optional RIDX chunk within
493 the MIDX itself.
494
496 The cruft packs feature offer an alternative to Git’s traditional
497 mechanism of removing unreachable objects. This document provides an
498 overview of Git’s pruning mechanism, and how a cruft pack can be used
499 instead to accomplish the same.
500
501 Background
502 To remove unreachable objects from your repository, Git offers git
503 repack -Ad (see git-repack(1)). Quoting from the documentation:
504
505 [...] unreachable objects in a previous pack become loose, unpacked objects,
506 instead of being left in the old pack. [...] loose unreachable objects will be
507 pruned according to normal expiry rules with the next 'git gc' invocation.
508
509 Unreachable objects aren’t removed immediately, since doing so could
510 race with an incoming push which may reference an object which is about
511 to be deleted. Instead, those unreachable objects are stored as loose
512 objects and stay that way until they are older than the expiration
513 window, at which point they are removed by git-prune(1).
514
515 Git must store these unreachable objects loose in order to keep track
516 of their per-object mtimes. If these unreachable objects were written
517 into one big pack, then either freshening that pack (because an object
518 contained within it was re-written) or creating a new pack of
519 unreachable objects would cause the pack’s mtime to get updated, and
520 the objects within it would never leave the expiration window. Instead,
521 objects are stored loose in order to keep track of the individual
522 object mtimes and avoid a situation where all cruft objects are
523 freshened at once.
524
525 This can lead to undesirable situations when a repository contains many
526 unreachable objects which have not yet left the grace period. Having
527 large directories in the shards of .git/objects can lead to decreased
528 performance in the repository. But given enough unreachable objects,
529 this can lead to inode starvation and degrade the performance of the
530 whole system. Since we can never pack those objects, these repositories
531 often take up a large amount of disk space, since we can only zlib
532 compress them, but not store them in delta chains.
533
534 Cruft packs
535 A cruft pack eliminates the need for storing unreachable objects in a
536 loose state by including the per-object mtimes in a separate file
537 alongside a single pack containing all loose objects.
538
539 A cruft pack is written by git repack --cruft when generating a new
540 pack. git-pack-objects(1)'s --cruft option. Note that git repack
541 --cruft is a classic all-into-one repack, meaning that everything in
542 the resulting pack is reachable, and everything else is unreachable.
543 Once written, the --cruft option instructs git repack to generate
544 another pack containing only objects not packed in the previous step
545 (which equates to packing all unreachable objects together). This
546 progresses as follows:
547
548 1. Enumerate every object, marking any object which is (a) not
549 contained in a kept-pack, and (b) whose mtime is within the grace
550 period as a traversal tip.
551
552 2. Perform a reachability traversal based on the tips gathered in the
553 previous step, adding every object along the way to the pack.
554
555 3. Write the pack out, along with a .mtimes file that records the
556 per-object timestamps.
557
558 This mode is invoked internally by git-repack(1) when instructed to
559 write a cruft pack. Crucially, the set of in-core kept packs is exactly
560 the set of packs which will not be deleted by the repack; in other
561 words, they contain all of the repository’s reachable objects.
562
563 When a repository already has a cruft pack, git repack --cruft
564 typically only adds objects to it. An exception to this is when git
565 repack is given the --cruft-expiration option, which allows the
566 generated cruft pack to omit expired objects instead of waiting for
567 git-gc(1) to expire those objects later on.
568
569 It is git-gc(1) that is typically responsible for removing expired
570 unreachable objects.
571
572 Caution for mixed-version environments
573 Repositories that have cruft packs in them will continue to work with
574 any older version of Git. Note, however, that previous versions of Git
575 which do not understand the .mtimes file will use the cruft pack’s
576 mtime as the mtime for all of the objects in it. In other words, do not
577 expect older (pre-cruft pack) versions of Git to interpret or even read
578 the contents of the .mtimes file.
579
580 Note that having mixed versions of Git GC-ing the same repository can
581 lead to unreachable objects never being completely pruned. This can
582 happen under the following circumstances:
583
584 • An older version of Git running GC explodes the contents of an
585 existing cruft pack loose, using the cruft pack’s mtime.
586
587 • A newer version running GC collects those loose objects into a
588 cruft pack, where the .mtime file reflects the loose object’s
589 actual mtimes, but the cruft pack mtime is "now".
590
591 Repeating this process will lead to unreachable objects not getting
592 pruned as a result of repeatedly resetting the objects' mtimes to the
593 present time.
594
595 If you are GC-ing repositories in a mixed version environment, consider
596 omitting the --cruft option when using git-repack(1) and git-gc(1), and
597 leaving the gc.cruftPacks configuration unset until all writers
598 understand cruft packs.
599
600 Alternatives
601 Notable alternatives to this design include:
602
603 • The location of the per-object mtime data, and
604
605 • Storing unreachable objects in multiple cruft packs.
606
607 On the location of mtime data, a new auxiliary file tied to the pack
608 was chosen to avoid complicating the .idx format. If the .idx format
609 were ever to gain support for optional chunks of data, it may make
610 sense to consolidate the .mtimes format into the .idx itself.
611
612 Storing unreachable objects among multiple cruft packs (e.g., creating
613 a new cruft pack during each repacking operation including only
614 unreachable objects which aren’t already stored in an earlier cruft
615 pack) is significantly more complicated to construct, and so aren’t
616 pursued here. The obvious drawback to the current implementation is
617 that the entire cruft pack must be re-written from scratch.
618
620 Part of the git(1) suite
621
622
623
624Git 2.39.1 2023-01-13 GITFORMAT-PACK(5)