1PUBLIC-INBOX-V2-FORMAT(5) public-inbox user manual PUBLIC-INBOX-V2-FORMAT(5)
2
3
4
6 public-inbox-v2-format - structure of public inbox v2 archives
7
9 The v2 format is designed primarily to address several scalability
10 problems of the original format described at public-inbox-v1-format(5).
11 It also handles messages with Message-IDs.
12
14 The key change in v2 is the inbox is no longer a bare git repository,
15 but a directory with two or more git repositories. v2 divides git
16 repositories by time "epochs" and Xapian databases for parallelism by
17 "shards".
18
19 INBOX OVERVIEW AND DEFINITIONS
20 $EPOCH - Integer starting with 0 based on time
21 $SCHEMA_VERSION - DB schema version (for Xapian)
22 $SHARD - Integer starting with 0 based on parallelism
23
24 foo/ # "foo" is the name of the inbox
25 - inbox.lock # lock file to protect global state
26 - git/$EPOCH.git # normal git repositories
27 - all.git # empty, alternates to $EPOCH.git
28 - xap$SCHEMA_VERSION/$SHARD # per-shard Xapian DB
29 - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP, threading
30 - msgmap.sqlite3 # same the v1 msgmap
31
32 For blob lookups, the reader only needs to open the "all.git"
33 repository with $GIT_DIR/objects/info/alternates which references every
34 $EPOCH.git repo.
35
36 Individual $EPOCH.git repos DO NOT use alternates themselves as git
37 currently limits recursion of alternates nesting depth to 5.
38
39 GIT EPOCHS
40 One of the inherent scalability problems with git itself is the full
41 history of a project must be stored and carried around to all clients.
42 To address this problem, the v2 format uses multiple git repositories,
43 stored as time-based "epochs".
44
45 We currently divide epochs into roughly one gigabyte segments; but this
46 size can be configurable (if needed) in the future.
47
48 A pleasant side-effect of this design is the git packs of older epochs
49 are stable, allowing them to be cloned without requiring expensive pack
50 generation. This also allows clients to clone only the epochs they are
51 interested in to save bandwidth and storage.
52
53 To minimize changes to existing v1-based code and simplify our code, we
54 use the "alternates" mechanism described in gitrepository-layout(5) to
55 link all the epoch repositories with a single read-only "all.git"
56 endpoint.
57
58 Processes retrieve blobs via the "all.git" repository, while writers
59 write blobs directly to epochs.
60
61 GIT TREE LAYOUT
62 One key problem specific to v1 was large trees were frequently a
63 performance problem as name lookups are expensive and there were
64 limited deltafication opportunities with unpredictable file names. As
65 a result, all Xapian-enabled installations retrieve blob object_ids
66 directly in v1, bypassing tree lookups.
67
68 While dividing git repositories into epochs caps the growth of trees,
69 worst-case tree size was still unnecessary overhead and worth
70 eliminating.
71
72 So in contrast to the big trees of v1, the v2 git tree contains only a
73 single file at the top-level of the tree, either 'm' (for 'mail' or
74 'message') or 'd' (for deleted). A tree does not have 'm' and 'd' at
75 the same time.
76
77 Mail is still stored in blobs (instead of inline with the commit
78 object) as we still need a stable reference in the indices in case
79 commit history is rewritten to comply with legal requirements.
80
81 After-the-fact invocations of public-inbox-index will ignore messages
82 written to 'd' after they are written to 'm'.
83
84 Deltafication is not significantly improved over v1, but overall
85 storage for trees is made as as small as possible. Initial statistics
86 and benchmarks showing the benefits of this approach are documented at:
87
88 <https://public-inbox.org/meta/20180209205140.GA11047@dcvr/>
89
90 XAPIAN SHARDS
91 Another second scalability problem in v1 was the inability to utilize
92 multiple CPU cores for Xapian indexing. This is addressed by using
93 shards in Xapian to perform import indexing in parallel.
94
95 As with git alternates, Xapian natively supports a read-only interface
96 which transparently abstracts away the knowledge of multiple shards.
97 This allows us to simplify our read-only code paths.
98
99 The performance of the storage device is now the bottleneck on larger
100 multi-core systems. In our experience, performance is improved with
101 high-quality and high-quantity solid-state storage. Issuing TRIM
102 commands with fstrim(8) was necessary to maintain consistent
103 performance while developing this feature.
104
105 Rotational storage devices perform significantly worse than solid state
106 storage for indexing of large mail archives; but are fine for backup
107 and usable for small instances.
108
109 As of public-inbox 1.6.0, the "publicInbox.indexSequentialShard" option
110 of public-inbox-index(1) may be used with a high shard count to ensure
111 individual shards fit into page cache when the entire Xapian DB cannot.
112
113 Our use of the "OVERVIEW DB" requires Xapian document IDs to remain
114 stable. Using public-inbox-compact(1) and public-inbox-xcpdb(1)
115 wrappers are recommended over tools provided by Xapian.
116
117 OVERVIEW DB
118 Towards the end of v2 development, it became apparent Xapian did not
119 perform well for sorting large result sets used to generate the landing
120 page in the PSGI UI (/$INBOX/) or many queries used by the NNTP server.
121 Thus, SQLite was employed and the Xapian "skeleton" DB was renamed to
122 the "overview" DB (after the NNTP OVER/XOVER commands).
123
124 The overview DB maintains all the header information necessary to
125 implement the NNTP OVER/XOVER commands and non-search endpoints of the
126 PSGI UI.
127
128 Xapian has become completely optional for v2 (as it is for v1), but
129 SQLite remains required for v2. SQLite turns out to be powerful enough
130 to maintain overview information. Most of the PSGI and all of the NNTP
131 functionality is possible with only SQLite in addition to git.
132
133 The overview DB was an instrumental piece in maintaining near constant-
134 time read performance on a dataset 2-3 times larger than LKML history
135 as of 2018.
136
137 GHOST MESSAGES
138
139 The overview DB also includes references to "ghost" messages, or
140 messages which have replies but have not been seen by us. Thus it is
141 expected to have more rows than the "msgmap" DB described below.
142
143 msgmap.sqlite3
144 The SQLite msgmap DB is unchanged from v1, but it is now at the top-
145 level of the directory.
146
148 There are three distinct type of identifiers. content_hash is the new
149 one for v2 and should make message removal and deduplication easier.
150 object_id and Message-ID are already known.
151
152 object_id
153 The blob identifier git uses (currently SHA-1). No need to
154 publicly expose this outside of normal git ops (cloning) and
155 there's no need to make this searchable. As with v1 of public-
156 inbox, this is stored as part of the Xapian document so expensive
157 name lookups can be avoided for document retrieval.
158
159 Message-ID
160 The email header; duplicates allowed for archival purposes. This
161 remains a searchable field in Xapian. Note: it's possible for
162 emails to have multiple Message-ID headers (and git-send-email(1)
163 had that bug for a bit); so we take all of them into account. In
164 case of conflicts detected by content_hash below, we generate a new
165 Message-ID based on content_hash; if the generated Message-ID still
166 conflicts, a random one is generated.
167
168 content_hash
169 A hash of relevant headers and raw body content for purging of
170 unwanted content. This is not stored anywhere, but always
171 calculated on-the-fly.
172
173 For now, the relevant headers are:
174
175 Subject, From, Date, References, In-Reply-To, To, Cc
176
177 Received, List-Id, and similar headers are NOT part of content_hash
178 as they differ across lists and we will want removal to be able to
179 cross lists.
180
181 The textual parts of the body are decoded, CRLF normalized to LF,
182 and trailing whitespace stripped. Notably, hashing the raw body
183 risks being broken by list signatures; but we can use filters (e.g.
184 PublicInbox::Filter::Vger) to clean the body for imports.
185
186 content_hash is SHA-256 for now; but can be changed at any time
187 without making DB changes.
188
190 flock(2) locking exclusively locks the empty inbox.lock file for all
191 non-atomic operations.
192
194 Same handling as with v1, except the Message-ID header will be
195 generated if not provided or conflicting. "Bytes", "Lines" and
196 "Content-Length" headers are stripped and not allowed, they can
197 interfere with further processing.
198
199 The "Status" mbox header is also stripped as that header makes no sense
200 in a public archive.
201
203 Thanks to the Linux Foundation for sponsoring the development and
204 testing of the v2 format.
205
207 Copyright 2018-2021 all contributors <mailto:meta@public-inbox.org>
208
209 License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
210
212 gitrepository-layout(5), public-inbox-v1-format(5)
213
214
215
216public-inbox.git 1993-10-02 PUBLIC-INBOX-V2-FORMAT(5)