1Lucy::Analysis::Token(3U)ser Contributed Perl DocumentatiLouncy::Analysis::Token(3)
2
3
4
6 Lucy::Analysis::Token - Unit of text.
7
9 my $token = Lucy::Analysis::Token->new(
10 text => 'blind',
11 start_offset => 8,
12 end_offset => 13,
13 );
14
15 $token->set_text('mice');
16
18 Token is the fundamental unit used by Apache Lucy’s Analyzer
19 subclasses. Each Token has 5 attributes: "text", "start_offset",
20 "end_offset", "boost", and "pos_inc".
21
22 The "text" attribute is a Unicode string encoded as UTF-8.
23
24 "start_offset" is the start point of the token text, measured in
25 Unicode code points from the top of the stored field; "end_offset"
26 delimits the corresponding closing boundary. "start_offset" and
27 "end_offset" locate the Token within a larger context, even if the
28 Token’s text attribute gets modified – by stemming, for instance. The
29 Token for “beating” in the text “beating a dead horse” begins life with
30 a start_offset of 0 and an end_offset of 7; after stemming, the text is
31 “beat”, but the start_offset is still 0 and the end_offset is still 7.
32 This allows “beating” to be highlighted correctly after a search
33 matches “beat”.
34
35 "boost" is a per-token weight. Use this when you want to assign more
36 or less importance to a particular token, as you might for emboldened
37 text within an HTML document, for example. (Note: The field this token
38 belongs to must be spec’d to use a posting of type RichPosting.)
39
40 "pos_inc" is the POSition INCrement, measured in Tokens. This
41 attribute, which defaults to 1, is a an advanced tool for manipulating
42 phrase matching. Ordinarily, Tokens are assigned consecutive position
43 numbers: 0, 1, and 2 for "three blind mice". However, if you set the
44 position increment for “blind” to, say, 1000, then the three tokens
45 will end up assigned to positions 0, 1, and 1001 – and will no longer
46 produce a phrase match for the query "three blind mice".
47
49 new
50 my $token = Lucy::Analysis::Token->new(
51 text => $text, # required
52 start_offset => $start_offset, # required
53 end_offset => $end_offset, # required
54 boost => 1.0, # optional
55 pos_inc => 1, # optional
56 );
57
58 • text - A string.
59
60 • start_offset - Start offset into the original document in Unicode
61 code points.
62
63 • start_offset - End offset into the original document in Unicode
64 code points.
65
66 • boost - Per-token weight.
67
68 • pos_inc - Position increment for phrase matching.
69
71 get_text
72 my $text = $token->get_text;
73
74 Get the token's text.
75
76 set_text
77 $token->set_text($text);
78
79 Set the token's text.
80
81 get_start_offset
82 my $int = $token->get_start_offset();
83
84 get_end_offset
85 my $int = $token->get_end_offset();
86
87 get_boost
88 my $float = $token->get_boost();
89
90 get_pos_inc
91 my $int = $token->get_pos_inc();
92
93 get_len
94 my $int = $token->get_len();
95
97 Lucy::Analysis::Token isa Clownfish::Obj.
98
99
100
101perl v5.36.0 2023-01-20 Lucy::Analysis::Token(3)