1Lucy::Docs::IRTheory(3)User Contributed Perl DocumentatioLnucy::Docs::IRTheory(3)
2
3
4

NAME

6       Lucy::Docs::IRTheory - Crash course in information retrieval
7

DESCRIPTION

9       Just enough Information Retrieval theory to find your way around Apache
10       Lucy.
11
12   Terminology
13       Lucy uses some terminology from the field of information retrieval
14       which may be unfamiliar to many users.  XDocumentX and XtermX mean
15       pretty much what youXd expect them to, but others such as XpostingX and
16       Xinverted indexX need a formal introduction:
17
18document - An atomic unit of retrieval.
19
20term - An attribute which describes a document.
21
22posting - One term indexing one document.
23
24term list - The complete list of terms which describe a document.
25
26posting list - The complete list of documents which a term indexes.
27
28inverted index - A data structure which maps from terms to
29           documents.
30
31       Since Lucy is a practical implementation of IR theory, it loads these
32       abstract, distilled definitions down with useful traits.  For instance,
33       a XpostingX in its most rarefied form is simply a term-document
34       pairing; in Lucy, the class MatchPosting fills this role.  However, by
35       associating additional information with a posting like the number of
36       times the term occurs in the document, we can turn it into a
37       ScorePosting, making it possible to rank documents by relevance rather
38       than just list documents which happen to match in no particular order.
39
40   TF/IDF ranking algorithm
41       Lucy uses a variant of the well-established XTerm Frequency / Inverse
42       Document FrequencyX weighting scheme.  A thorough treatment of TF/IDF
43       is too ambitious for our present purposes, but in a nutshell, it means
44       thatX
45
46       •   in a search for "skate park", documents which score well for the
47           comparatively rare term "skate" will rank higher than documents
48           which score well for the more common term "park".
49
50       •   a 10-word text which has one occurrence each of both "skate" and
51           "park" will rank higher than a 1000-word text which also contains
52           one occurrence of each.
53
54       A web search for Xtf idfX will turn up many excellent explanations of
55       the algorithm.
56
57
58
59perl v5.34.0                      2022-01-21           Lucy::Docs::IRTheory(3)
Impressum