1Lucy::Analysis::RegexToUkseenrizCeorn(t3r)ibuted Perl DoLcuucmye:n:tAantailoynsis::RegexTokenizer(3)
2
3
4

NAME

6       Lucy::Analysis::RegexTokenizer - Split a string into tokens.
7

SYNOPSIS

9           my $whitespace_tokenizer
10               = Lucy::Analysis::RegexTokenizer->new( pattern => '\S+' );
11
12           # or...
13           my $word_char_tokenizer
14               = Lucy::Analysis::RegexTokenizer->new( pattern => '\w+' );
15
16           # or...
17           my $apostrophising_tokenizer = Lucy::Analysis::RegexTokenizer->new;
18
19           # Then... once you have a tokenizer, put it into a PolyAnalyzer:
20           my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
21               analyzers => [ $word_char_tokenizer, $normalizer, $stemmer ], );
22

DESCRIPTION

24       Generically, XtokenizingX is a process of breaking up a string into an
25       array of XtokensX.  For instance, the string Xthree blind miceX might
26       be tokenized into XthreeX, XblindX, XmiceX.
27
28       Lucy::Analysis::RegexTokenizer decides where it should break up the
29       text based on a regular expression compiled from a supplied "pattern"
30       matching one token.  If our source string isX
31
32           "Eats, Shoots and Leaves."
33
34       X then a Xwhitespace tokenizerX with a "pattern" of "\\S+" producesX
35
36           Eats,
37           Shoots
38           and
39           Leaves.
40
41       X while a Xword character tokenizerX with a "pattern" of "\\w+"
42       producesX
43
44           Eats
45           Shoots
46           and
47           Leaves
48
49       X the difference being that the word character tokenizer skips over
50       punctuation as well as whitespace when determining token boundaries.
51

CONSTRUCTORS

53   new
54           my $word_char_tokenizer = Lucy::Analysis::RegexTokenizer->new(
55               pattern => '\w+',    # required
56           );
57
58       Create a new RegexTokenizer.
59
60pattern - A string specifying a Perl-syntax regular expression
61           which should match one token.  The default value is
62           "\w+(?:[\x{2019}']\w+)*", which matches XitXsX as well as XitX and
63           XOXHenryXsX as well as XHenryX.
64

METHODS

66   transform
67           my $inversion = $regex_tokenizer->transform($inversion);
68
69       Take a single Inversion as input and returns an Inversion, either the
70       same one (presumably transformed in some way), or a new one.
71
72inversion - An inversion.
73

INHERITANCE

75       Lucy::Analysis::RegexTokenizer isa Lucy::Analysis::Analyzer isa
76       Clownfish::Obj.
77
78
79
80perl v5.34.0                      2021-07-22 Lucy::Analysis::RegexTokenizer(3)
Impressum