Lucy::Docs::FileFormat(3pm)

1Lucy::Docs::FileFormat(U3spemr)Contributed Perl DocumentLautciyo:n:Docs::FileFormat(3pm)
2
3
4

NAME

6       Lucy::Docs::FileFormat - Overview of index file format
7

DESCRIPTION

9       It is not necessary to understand the current implementation details of
10       the index file format in order to use Apache Lucy effectively, but it
11       may be helpful if you are interested in tweaking for high performance,
12       exotic usage, or debugging and development.
13
14       On a file system, an index is a directory.  The files inside have a
15       hierarchical relationship: an index is made up of “segments”, each of
16       which is an independent inverted index with its own subdirectory; each
17       segment is made up of several component parts.
18
19           [index]--|
20                    |--snapshot_XXX.json
21                    |--schema_XXX.json
22                    |--write.lock
23                    |
24                    |--seg_1--|
25                    |         |--segmeta.json
26                    |         |--cfmeta.json
27                    |         |--cf.dat-------|
28                    |                         |--[lexicon]
29                    |                         |--[postings]
30                    |                         |--[documents]
31                    |                         |--[highlight]
32                    |                         |--[deletions]
33                    |
34                    |--seg_2--|
35                    |         |--segmeta.json
36                    |         |--cfmeta.json
37                    |         |--cf.dat-------|
38                    |                         |--[lexicon]
39                    |                         |--[postings]
40                    |                         |--[documents]
41                    |                         |--[highlight]
42                    |                         |--[deletions]
43                    |
44                    |--[...]--|
45
46   Write-once philosophy
47       All segment directory names consist of the string “seg_” followed by a
48       number in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher
49       numbers indicating more recent segments.  Once a segment is finished
50       and committed, its name is never re-used and its files are never
51       modified.
52
53       Old segments become obsolete and can be removed when their data has
54       been consolidated into new segments during the process of segment
55       merging and optimization.  A fully-optimized index has only one
56       segment.
57
58   Top-level entries
59       There are a handful of “top-level” files and directories which belong
60       to the entire index rather than to a particular segment.
61
62       snapshot_XXX.json
63
64       A “snapshot” file, e.g. "snapshot_m7p.json", is list of index files and
65       directories.  Because index files, once written, are never modified,
66       the list of entries in a snapshot defines a point-in-time view of the
67       data in an index.
68
69       Like segment directories, snapshot files also utilize the
70       unique-base-36-number naming convention; the higher the number, the
71       more recent the file.  The appearance of a new snapshot file within the
72       index directory constitutes an index update.  While a new segment is
73       being written new files may be added to the index directory, but until
74       a new snapshot file gets written, a Searcher opening the index for
75       reading won’t know about them.
76
77       schema_XXX.json
78
79       The schema file is a Schema object describing the index’s format,
80       serialized as JSON.  It, too, is versioned, and a given snapshot file
81       will reference one and only one schema file.
82
83       locks
84
85       By default, only one indexing process may safely modify the index at
86       any given time.  Processes reserve an index by laying claim to the
87       "write.lock" file within the "locks/" directory.  A smattering of other
88       lock files may be used from time to time, as well.
89
90   A segment’s component parts
91       By default, each segment has up to five logical components: lexicon,
92       postings, document storage, highlight data, and deletions.  Binary data
93       from these components gets stored in virtual files within the “cf.dat”
94       compound file; metadata is stored in a shared “segmeta.json” file.
95
96       segmeta.json
97
98       The segmeta.json file is a central repository for segment metadata.  In
99       addition to information such as document counts and field numbers, it
100       also warehouses arbitrary metadata on behalf of individual index
101       components.
102
103       Lexicon
104
105       Each indexed field gets its own lexicon in each segment.  The exact
106       files involved depend on the field’s type, but generally speaking there
107       will be two parts.  First, there’s a primary "lexicon-XXX.dat" file
108       which houses a complete term list associating terms with corpus
109       frequency statistics, postings file locations, etc.  Second, one or
110       more “lexicon index” files may be present which contain periodic
111       samples from the primary lexicon file to facilitate fast lookups.
112
113       Postings
114
115       “Posting” is a technical term from the field of information retrieval,
116       defined as a single instance of a one term indexing one document.  If
117       you are looking at the index in the back of a book, and you see that
118       “freedom” is referenced on pages 8, 86, and 240, that would be three
119       postings, which taken together form a “posting list”.  The same
120       terminology applies to an index in electronic form.
121
122       Each segment has one postings file per indexed field.  When a search is
123       performed for a single term, first that term is looked up in the
124       lexicon.  If the term exists in the segment, the record in the lexicon
125       will contain information about which postings file to look at and where
126       to look.
127
128       The first thing any posting record tells you is a document id.  By
129       iterating over all the postings associated with a term, you can find
130       all the documents that match that term, a process which is analogous to
131       looking up page numbers in a book’s index.  However, each posting
132       record typically contains other information in addition to document id,
133       e.g. the positions at which the term occurs within the field.
134
135       Documents
136
137       The document storage section is a simple database, organized into two
138       files:
139
140       •   documents.dat - Serialized documents.
141
142       •   documents.ix - Document storage index, a solid array of 64-bit
143           integers where each integer location corresponds to a document id,
144           and the value at that location points at a file position in the
145           documents.dat file.
146
147       Highlight data
148
149       The files which store data used for excerpting and highlighting are
150       organized similarly to the files used to store documents.
151
152       •   highlight.dat - Chunks of serialized highlight data, one per doc
153           id.
154
155       •   highlight.ix - Highlight data index – as with the "documents.ix"
156           file, a solid array of 64-bit file pointers.
157
158       Deletions
159
160       When a document is “deleted” from a segment, it is not actually purged
161       right away; it is merely marked as “deleted” via a deletions file.
162       Deletions files contains bit vectors with one bit for each document in
163       the segment; if bit #254 is set then document 254 is deleted, and if
164       that document turns up in a search it will be masked out.
165
166       It is only when a segment’s contents are rewritten to a new segment
167       during the segment-merging process that deleted documents truly go
168       away.
169
170   Compound Files
171       If you peer inside an index directory, you won’t actually find any
172       files named “documents.dat”, “highlight.ix”, etc. unless there is an
173       indexing process underway.  What you will find instead is one “cf.dat”
174       and one “cfmeta.json” file per segment.
175
176       To minimize the need for file descriptors at search-time, all per-
177       segment binary data files are concatenated together in “cf.dat” at the
178       close of each indexing session.  Information about where each file
179       begins and ends is stored in "cfmeta.json".  When the segment is opened
180       for reading, a single file descriptor per “cf.dat” file can be shared
181       among several readers.
182
183   A Typical Search
184       Here’s a simplified narrative, dramatizing how a search for “freedom”
185       against a given segment plays out:
186
187       •   The searcher asks the relevant Lexicon Index, “Do you know anything
188           about ‘freedom’?”  Lexicon Index replies, “Can’t say for sure, but
189           if the main Lexicon file does, ‘freedom’ is probably somewhere
190           around byte 21008”.
191
192       •   The main Lexicon tells the searcher “One moment, let me scan our
193           records… Yes, we have 2 documents which contain ‘freedom’.  You’ll
194           find them in seg_6/postings-4.dat starting at byte 66991.”
195
196       •   The Postings file says “Yep, we have ‘freedom’, all right!
197           Document id 40 has 1 ‘freedom’, and document 44 has 8.  If you need
198           to know more, like if any ‘freedom’ is part of the phrase ‘freedom
199           of speech’, ask me about positions!
200
201       •   If the searcher is only looking for ‘freedom’ in isolation, that’s
202           where it stops.  It now knows enough to assign the documents scores
203           against “freedom”, with the 8-freedom document likely ranking
204           higher than the single-freedom document.
205
206
207
208perl v5.38.0                      2023-07-20       Lucy::Docs::FileFormat(3pm)