1PUBLIC-INBOX-EXTINDEX-FORMATp(u5b)lic-inbox user manPuUaBlLIC-INBOX-EXTINDEX-FORMAT(5)
2
3
4
6 public-inbox-extindex-format - external index format description
7
9 The extindex is an index-only evolution of the per-inbox SQLite and
10 Xapian indices used by public-inbox-v2-format(5) and
11 public-inbox-v1-format(5). It exists to facilitate searches across
12 multiple inboxes as well as to reduce index space when messages are
13 cross-posted to several existing inboxes.
14
15 It transparently indexes messages across any combination of v1 and v2
16 inboxes and data about inboxes themselves.
17
19 While inspired by v2, there is no git blob storage nor "msgmap.sqlite3"
20 DB.
21
22 Instead, there is an "ALL.git" (all caps) git repo which treats every
23 indexed v1 inbox or v2 epoch as a git alternate.
24
25 As with v2 inboxes, it uses "over.sqlite3" and Xapian "shards" for WWW
26 and IMAP use. Several exclusive new tables are added to deal with
27 "XREF3 DEDUPLICATION" and metadata.
28
29 Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP
30 newsgroup. Thus it lacks "msgmap.sqlite3" to enforce the unique
31 Message-ID requirement of NNTP.
32
33 INDEX OVERVIEW AND DEFINITIONS
34 $SCHEMA_VERSION - DB schema version (for Xapian)
35 $SHARD - Integer starting with 0 based on parallelism
36
37 foo/ # "foo" is the name of the index
38 - ei.lock # lock file to protect global state
39 - ALL.git # empty, alternates for inboxes
40 - ei$SCHEMA_VERSION/$SHARD # per-shard Xapian DB
41 - ei$SCHEMA_VERSION/over.sqlite3 # overview DB for WWW, IMAP
42 - ei$SCHEMA_VERSION/misc # misc Xapian DB
43
44 File and directory names are intentionally different from analogous v2
45 names to ensure extindex and v2 inboxes can easily be distinguished
46 from each other.
47
48 XREF3 DEDUPLICATION
49 Due to cross-posted messages being the norm in the large Linux kernel
50 development community and Xapian indices being the primary consumer of
51 storage, it makes sense to deduplicate indexing as much as possible.
52
53 The internal storage format is based on the NNTP "Xref" tuple, but with
54 the addition of a third element: the git blob OID. Thus the triple is
55 expressed in string form as:
56
57 $NEWSGROUP_NAME:$ARTICLE_NUM:$OID
58
59 If no "newsgroup" is configured for an inbox, the "inboxdir" of the
60 inbox is used.
61
62 This data is stored in the "xref3" table of over.sqlite3.
63
64 misc XAPIAN DB
65 In addition to the numeric Xapian shards for indexing messages, there
66 is a new, in-development Xapian index for storing data about inboxes
67 themselves and other non-message data. This index allows us to speed
68 up operations involving hundreds or thousands of inboxes.
69
71 In addition to providing cross-inbox search capabilities, it can also
72 replace per-inbox Xapian shards (but not per-inbox over.sqlite3). This
73 allows reduction in disk space, open file handles, and associated
74 memory use.
75
77 Relocating v1 and v2 inboxes on the filesystem will require extindex to
78 be garbage-collected and/or reindexed.
79
80 Configuring and maintaining stable "newsgroup" names before any
81 messages are indexed from every inbox can avoid expensive reindexing
82 and rely exclusively on GC.
83
85 flock(2) locking exclusively locks the empty ei.lock file for all non-
86 atomic operations.
87
89 Thanks to the Linux Foundation for sponsoring the development and
90 testing.
91
93 Copyright 2020-2021 all contributors <mailto:meta@public-inbox.org>
94
95 License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
96
98 public-inbox-v2-format(5)
99
100
101
102public-inbox.git 1993-10-02 PUBLIC-INBOX-EXTINDEX-FORMAT(5)