KinoSearch1::Docs::FileFormat(3pm)

1KinoSearch1::Docs::FileUFsoerrmaCto(n3t)ributed Perl DocKuimneonSteaatricohn1::Docs::FileFormat(3)
2
3
4

NAME

6       KinoSearch1::Docs::FileFormat - overview of invindex file format
7

OVERVIEW

9       It is not necessary to understand the guts of the Lucene-derived
10       "invindex" file format in order to use KinoSearch1, but it may be
11       helpful if you are interested in tweaking for high performance, exotic
12       usage, or debugging and development.
13
14       On a file system, all the files in an invindex exist in one, flat
15       directory.  Conceptually, the files have a hierarchical relationship:
16       an invindex is made up of "segments", each of which is an independent
17       inverted index, and each segment is made up of several subsections.
18
19           [invindex]--|
20                       |-"segments" file
21                       |
22                       |-[segments]------|
23                                         |--[seg _0]--|
24                                         |            |--[postings]
25                                         |            |--[stored fields]
26                                         |            |--[deletions]
27                                         |
28                                         |--[seg _1]--|
29                                         |            |--[postings]
30                                         |            |--[stored fields]
31                                         |            |--[deletions]
32                                         |
33                                         |--[ ... ]---|
34
35       The "segments" file keeps a list of the segments that make up an
36       invindex.  When a new segment is being written, KinoSearch1 may put
37       files into the directory, but until the segments file is updated, a
38       Searcher reading the index won't know about them.
39
40       Each segment is an independent inverted index.  All the files which
41       belong to a given segment share a common prefix which consists of an
42       underscore followed by 1 or more decimal digits: _0, _67, _1058.  A
43       fully optimized index has only a single segment.
44
45       In theory there are many files which make up each segment.  However,
46       when you look inside an invindex not in the process of being updated,
47       you'll probably see only the segments file and files with either a .cfs
48       or .del extension.  The .cfs file, a "compound" file which is
49       consolidated when a segment is finalized, "contains" all the other per-
50       segment files.
51
52       Segments are written once, and with the exception of the deletions
53       file, are never modified once written.  They are deleted when their
54       data is written to new segments during the process of optimization.
55

A segment's component parts

57       Each segment can be said to have four logical parts: postings, stored
58       fields, the deletions file, and the term vectors data.
59
60   Stored fields
61       The stored fields are organized into two files.
62
63       •   [seg_name].fdx - Field inDeX - pointers to field data
64
65       •   [seg_name].fdt - Field DaTa - the actual stored fields
66
67       When a document turns up as a hit in a search and must be retrieved,
68       KinoSearch1 looks at the Field inDeX file to see where in the data file
69       the document's stored fields start, then retrieves all of them from the
70       .fdt file in one lump.
71
72           _1.fdx--|
73                   |--[doc#0  =>   0]----->_1.fdt--|
74                   |                               |--[bodytext]
75                   |                               |--[title]
76                   |                               |--[url]
77                   |--[doc#1  => 305]----->_1.fdt--|             # byte 305
78                   |                               |--[bodytext]
79                   |                               |--[title]
80                   |                               |--[url]
81                   |--[...]--------------->_1.fdt--|--[...]
82
83       If a field is marked as "vectorized", its "term vectors" are also
84       stored in the .fdx file.
85
86   Postings
87       "Posting" is a technical term from the field of Information Retrieval
88       which refers to an single instance of a one term indexing one document.
89       If you are looking at the index in the back of a book, and you see that
90       "freedom" is referenced on pages 8, 86, and 240, that would be three
91       postings, which taken together form a "posting list".  The same
92       terminology applies to an index in electronic form.
93
94       The postings data is spread out over 4 main files (not including field
95       normalization data, which we'll get to in a moment).  From lowest to
96       highest in the hierarchy, they are...
97
98       [seg_name].prx - PRoXimity data. A list of the positions at which terms
99       appear in any given document.  The .prx file is just a raw stream of
100       VInts; the document numbers and terms are implicitly indicated by files
101       higher up the hierarchy.
102
103       [seg_name].frq - FReQuency data for terms.  If a term has a frequency
104       of 5 in a given document, that implies that there will be 5 entries in
105       the .prx file.  The terms themselves are implicitly specified by the
106       .tis file.
107
108           _1.frq--|
109                   |--[doc#40 => 2]----->_1.prx--|--[54,107]
110                   |--[doc#0  => 1]----->_1.prx--|--[6]
111                   |--[doc#6  => 1]----->_1.prx--|--[504]
112                   |--[doc#36 => 3]----->_1.prx--|--[2,33,747]
113                   |--[...]------------->_1.frq--|--[...]
114
115       [seg_name].tis - TermInfoS.  Among the items stored here is the term's
116       doc_freq, which is the number of documents the term appears in.  If a
117       term has a doc_freq of 22 in a given collection, that implies that
118       there will be 22 corresponding entries in the .frq file.  Terms are
119       ordered lexically, first by field, then by term text.
120
121           _1.tis--|
122                   |--[...]----------------------->_1.frq--|--[...]
123                   |--[bodytext:mule      =>  1]-->_1.frq--|--[doc#40 => 2]
124                   |--[bodytext:multitude =>  3]-->_1.frq--|--[doc#0  => 1]
125                   |                                       |--[doc#6  => 1]
126                   |                                       |--[doc#36 => 3]
127                   |--[bodytext:navigate  =>  1]-->_1.frq--|--[doc#21 => 1]
128                   |--[...]----------------------->_1.frq--|--[...]
129                   |--[title:amendment    => 27]-->_1.frq--|--[doc#21 => 1]
130                   |                                       |--[doc#22 => 1]
131                   |--[...]----------------------->_1.frq--|--[...]
132
133       [seg_name].tii - TermInfos Index.  This file, which is decompressed and
134       loaded into RAM as soon as the IndexReader is initialized, contains a
135       small subset of the .tis data, with pointers to locations in the .tis
136       file.  It is used to locate the right general vicinity in the .tis file
137       as quickly as possible.
138
139           _1.tii--|
140                   |--[bodytext:a => 20]---------->_1.tis--|--[bodytext:a] # byte 20
141                   |                                       |--[bodytext:about]
142                   |                                       |--[bodytext:absolute]
143                   |                                       |--[...]
144                   |--[bodytext:mule => 27065]---->_1.tis--|--[bodytext:mule]
145                   |                                       |--[bodytext:multitude]
146                   |                                       |--[...]
147                   |--[title:amendment => 56992]-->_1.tis--|--[title:amendment]
148                                                           |--[...]
149
150       Here's a simplified version of how a search for "freedom" against a
151       given segment plays out:
152
153       1.  The searcher asks the .tii file, "Do you know anything about
154           'freedom'?"  The .tii file replies, "Can't say for sure, but if the
155           .tis file does, 'freedom' is probably somewhere around byte 21008".
156
157       2.  The .tis file tells the searcher "Yes, we have 2 documents which
158           contain 'freedom'.  You'll find them in the .frq file starting at
159           byte 66991."
160
161       3.  The .frq file says "document number 40 has 1 'freedom', and
162           document 44 has 8.  If you need to know more, like if any 'freedom'
163           is part of the phrase 'freedom of speech', take a look at the .prx
164           file starting at..."
165
166       4.  If the searcher is only looking for 'freedom' in isolation, that's
167           where it stops.  It already knows enough to assign the documents
168           scores against "freedom", with the 8-freedom document scoring
169           higher than the single-freedom document.
170
171   Deletions
172       When a document is "deleted" from a segment, it is not actually purged
173       from the postings data and the stored fields data right away; it is
174       merely marked as "deleted", via the .del file.  The .del file contains
175       a bit vector with one bit for each document in the segment; if bit #254
176       is set then document 254 is deleted, and if it turns up in a search it
177       will be masked out.
178
179       It is only when a segment's contents are rewritten to a new segment
180       during the segment-merging process that deleted documents truly go
181       away.
182
183   Field Normalization Files
184       For the sake of simplicity, the example search scenario above omits the
185       role played the field normalization files, or "fieldnorms" for short.
186       These files have the (theoretical) suffix of .f followed by an integer
187       -- .f0, .f1, etc.  Each segment contains one such file for every
188       indexed field.
189
190       By default, the fieldnorms' job is to make sure that a field which is
191       100 terms long and contains 10 mentions of the word 'freedom' scores
192       higher than a field which also contains 10 mentions of the word
193       'freedom', but is 1000 terms in length.  The idea is that the higher
194       the density of the desired term, the more relevant the document.
195
196       The fieldnorms files contain one byte per document per indexed field,
197       and all of them must be loaded into RAM before a search can be
198       executed.
199

Document Numbers

201       Document numbers are ephemeral.   They change every time a document
202       gets moved from one segment to a new one during optimization.  If you
203       need to assign a primary key to each document, you need to create a
204       field and populate it with an externally generated unique identifier.
205

Not compatible with Java Lucene

207       The file format used by KinoSearch1 is closely related to the Lucene
208       compound index format. (The technical specification for Lucene's file
209       format is distributed along with Lucene.)  However, indexes generated
210       by Lucene and KinoSearch1 are not compatible.
211

COPYRIGHT

213       Copyright 2005-2010 Marvin Humphrey
214

LICENSE, DISCLAIMER, BUGS, etc.

216       See KinoSearch1 version 1.01.
217
218
219
220perl v5.34.0                      2022-01-21  KinoSearch1::Docs::FileFormat(3)