1KinoSearch1::Docs::FileUFsoerrmaCto(n3t)ributed Perl DocKuimneonSteaatricohn1::Docs::FileFormat(3)
2
3
4
6 KinoSearch1::Docs::FileFormat - overview of invindex file format
7
9 It is not necessary to understand the guts of the Lucene-derived
10 "invindex" file format in order to use KinoSearch1, but it may be
11 helpful if you are interested in tweaking for high performance, exotic
12 usage, or debugging and development.
13
14 On a file system, all the files in an invindex exist in one, flat
15 directory. Conceptually, the files have a hierarchical relationship:
16 an invindex is made up of "segments", each of which is an independent
17 inverted index, and each segment is made up of several subsections.
18
19 [invindex]--|
20 |-"segments" file
21 |
22 |-[segments]------|
23 |--[seg _0]--|
24 | |--[postings]
25 | |--[stored fields]
26 | |--[deletions]
27 |
28 |--[seg _1]--|
29 | |--[postings]
30 | |--[stored fields]
31 | |--[deletions]
32 |
33 |--[ ... ]---|
34
35 The "segments" file keeps a list of the segments that make up an
36 invindex. When a new segment is being written, KinoSearch1 may put
37 files into the directory, but until the segments file is updated, a
38 Searcher reading the index won't know about them.
39
40 Each segment is an independent inverted index. All the files which
41 belong to a given segment share a common prefix which consists of an
42 underscore followed by 1 or more decimal digits: _0, _67, _1058. A
43 fully optimized index has only a single segment.
44
45 In theory there are many files which make up each segment. However,
46 when you look inside an invindex not in the process of being updated,
47 you'll probably see only the segments file and files with either a .cfs
48 or .del extension. The .cfs file, a "compound" file which is
49 consolidated when a segment is finalized, "contains" all the other per-
50 segment files.
51
52 Segments are written once, and with the exception of the deletions
53 file, are never modified once written. They are deleted when their
54 data is written to new segments during the process of optimization.
55
57 Each segment can be said to have four logical parts: postings, stored
58 fields, the deletions file, and the term vectors data.
59
60 Stored fields
61 The stored fields are organized into two files.
62
63 • [seg_name].fdx - Field inDeX - pointers to field data
64
65 • [seg_name].fdt - Field DaTa - the actual stored fields
66
67 When a document turns up as a hit in a search and must be retrieved,
68 KinoSearch1 looks at the Field inDeX file to see where in the data file
69 the document's stored fields start, then retrieves all of them from the
70 .fdt file in one lump.
71
72 _1.fdx--|
73 |--[doc#0 => 0]----->_1.fdt--|
74 | |--[bodytext]
75 | |--[title]
76 | |--[url]
77 |--[doc#1 => 305]----->_1.fdt--| # byte 305
78 | |--[bodytext]
79 | |--[title]
80 | |--[url]
81 |--[...]--------------->_1.fdt--|--[...]
82
83 If a field is marked as "vectorized", its "term vectors" are also
84 stored in the .fdx file.
85
86 Postings
87 "Posting" is a technical term from the field of Information Retrieval
88 which refers to an single instance of a one term indexing one document.
89 If you are looking at the index in the back of a book, and you see that
90 "freedom" is referenced on pages 8, 86, and 240, that would be three
91 postings, which taken together form a "posting list". The same
92 terminology applies to an index in electronic form.
93
94 The postings data is spread out over 4 main files (not including field
95 normalization data, which we'll get to in a moment). From lowest to
96 highest in the hierarchy, they are...
97
98 [seg_name].prx - PRoXimity data. A list of the positions at which terms
99 appear in any given document. The .prx file is just a raw stream of
100 VInts; the document numbers and terms are implicitly indicated by files
101 higher up the hierarchy.
102
103 [seg_name].frq - FReQuency data for terms. If a term has a frequency
104 of 5 in a given document, that implies that there will be 5 entries in
105 the .prx file. The terms themselves are implicitly specified by the
106 .tis file.
107
108 _1.frq--|
109 |--[doc#40 => 2]----->_1.prx--|--[54,107]
110 |--[doc#0 => 1]----->_1.prx--|--[6]
111 |--[doc#6 => 1]----->_1.prx--|--[504]
112 |--[doc#36 => 3]----->_1.prx--|--[2,33,747]
113 |--[...]------------->_1.frq--|--[...]
114
115 [seg_name].tis - TermInfoS. Among the items stored here is the term's
116 doc_freq, which is the number of documents the term appears in. If a
117 term has a doc_freq of 22 in a given collection, that implies that
118 there will be 22 corresponding entries in the .frq file. Terms are
119 ordered lexically, first by field, then by term text.
120
121 _1.tis--|
122 |--[...]----------------------->_1.frq--|--[...]
123 |--[bodytext:mule => 1]-->_1.frq--|--[doc#40 => 2]
124 |--[bodytext:multitude => 3]-->_1.frq--|--[doc#0 => 1]
125 | |--[doc#6 => 1]
126 | |--[doc#36 => 3]
127 |--[bodytext:navigate => 1]-->_1.frq--|--[doc#21 => 1]
128 |--[...]----------------------->_1.frq--|--[...]
129 |--[title:amendment => 27]-->_1.frq--|--[doc#21 => 1]
130 | |--[doc#22 => 1]
131 |--[...]----------------------->_1.frq--|--[...]
132
133 [seg_name].tii - TermInfos Index. This file, which is decompressed and
134 loaded into RAM as soon as the IndexReader is initialized, contains a
135 small subset of the .tis data, with pointers to locations in the .tis
136 file. It is used to locate the right general vicinity in the .tis file
137 as quickly as possible.
138
139 _1.tii--|
140 |--[bodytext:a => 20]---------->_1.tis--|--[bodytext:a] # byte 20
141 | |--[bodytext:about]
142 | |--[bodytext:absolute]
143 | |--[...]
144 |--[bodytext:mule => 27065]---->_1.tis--|--[bodytext:mule]
145 | |--[bodytext:multitude]
146 | |--[...]
147 |--[title:amendment => 56992]-->_1.tis--|--[title:amendment]
148 |--[...]
149
150 Here's a simplified version of how a search for "freedom" against a
151 given segment plays out:
152
153 1. The searcher asks the .tii file, "Do you know anything about
154 'freedom'?" The .tii file replies, "Can't say for sure, but if the
155 .tis file does, 'freedom' is probably somewhere around byte 21008".
156
157 2. The .tis file tells the searcher "Yes, we have 2 documents which
158 contain 'freedom'. You'll find them in the .frq file starting at
159 byte 66991."
160
161 3. The .frq file says "document number 40 has 1 'freedom', and
162 document 44 has 8. If you need to know more, like if any 'freedom'
163 is part of the phrase 'freedom of speech', take a look at the .prx
164 file starting at..."
165
166 4. If the searcher is only looking for 'freedom' in isolation, that's
167 where it stops. It already knows enough to assign the documents
168 scores against "freedom", with the 8-freedom document scoring
169 higher than the single-freedom document.
170
171 Deletions
172 When a document is "deleted" from a segment, it is not actually purged
173 from the postings data and the stored fields data right away; it is
174 merely marked as "deleted", via the .del file. The .del file contains
175 a bit vector with one bit for each document in the segment; if bit #254
176 is set then document 254 is deleted, and if it turns up in a search it
177 will be masked out.
178
179 It is only when a segment's contents are rewritten to a new segment
180 during the segment-merging process that deleted documents truly go
181 away.
182
183 Field Normalization Files
184 For the sake of simplicity, the example search scenario above omits the
185 role played the field normalization files, or "fieldnorms" for short.
186 These files have the (theoretical) suffix of .f followed by an integer
187 -- .f0, .f1, etc. Each segment contains one such file for every
188 indexed field.
189
190 By default, the fieldnorms' job is to make sure that a field which is
191 100 terms long and contains 10 mentions of the word 'freedom' scores
192 higher than a field which also contains 10 mentions of the word
193 'freedom', but is 1000 terms in length. The idea is that the higher
194 the density of the desired term, the more relevant the document.
195
196 The fieldnorms files contain one byte per document per indexed field,
197 and all of them must be loaded into RAM before a search can be
198 executed.
199
201 Document numbers are ephemeral. They change every time a document
202 gets moved from one segment to a new one during optimization. If you
203 need to assign a primary key to each document, you need to create a
204 field and populate it with an externally generated unique identifier.
205
207 The file format used by KinoSearch1 is closely related to the Lucene
208 compound index format. (The technical specification for Lucene's file
209 format is distributed along with Lucene.) However, indexes generated
210 by Lucene and KinoSearch1 are not compatible.
211
213 Copyright 2005-2010 Marvin Humphrey
214
216 See KinoSearch1 version 1.01.
217
218
219
220perl v5.36.0 2023-01-20 KinoSearch1::Docs::FileFormat(3)