1Lucy::Analysis::RegexToUkseenrizCeorn(t3r)ibuted Perl DoLcuucmye:n:tAantailoynsis::RegexTokenizer(3)
2
3
4

NAME

6       Lucy::Analysis::RegexTokenizer - Split a string into tokens.
7

SYNOPSIS

9           my $whitespace_tokenizer
10               = Lucy::Analysis::RegexTokenizer->new( pattern => '\S+' );
11
12           # or...
13           my $word_char_tokenizer
14               = Lucy::Analysis::RegexTokenizer->new( pattern => '\w+' );
15
16           # or...
17           my $apostrophising_tokenizer = Lucy::Analysis::RegexTokenizer->new;
18
19           # Then... once you have a tokenizer, put it into a PolyAnalyzer:
20           my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
21               analyzers => [ $word_char_tokenizer, $normalizer, $stemmer ], );
22

DESCRIPTION

24       Generically, “tokenizing” is a process of breaking up a string into an
25       array of “tokens”.  For instance, the string “three blind mice” might
26       be tokenized into “three”, “blind”, “mice”.
27
28       Lucy::Analysis::RegexTokenizer decides where it should break up the
29       text based on a regular expression compiled from a supplied "pattern"
30       matching one token.  If our source string is…
31
32           "Eats, Shoots and Leaves."
33
34       … then a “whitespace tokenizer” with a "pattern" of "\\S+" produces…
35
36           Eats,
37           Shoots
38           and
39           Leaves.
40
41       … while a “word character tokenizer” with a "pattern" of "\\w+"
42       produces…
43
44           Eats
45           Shoots
46           and
47           Leaves
48
49       … the difference being that the word character tokenizer skips over
50       punctuation as well as whitespace when determining token boundaries.
51

CONSTRUCTORS

53   new
54           my $word_char_tokenizer = Lucy::Analysis::RegexTokenizer->new(
55               pattern => '\w+',    # required
56           );
57
58       Create a new RegexTokenizer.
59
60pattern - A string specifying a Perl-syntax regular expression
61           which should match one token.  The default value is
62           "\w+(?:[\x{2019}']\w+)*", which matches “it’s” as well as “it” and
63           “O’Henry’s” as well as “Henry”.
64

METHODS

66   transform
67           my $inversion = $regex_tokenizer->transform($inversion);
68
69       Take a single Inversion as input and returns an Inversion, either the
70       same one (presumably transformed in some way), or a new one.
71
72inversion - An inversion.
73

INHERITANCE

75       Lucy::Analysis::RegexTokenizer isa Lucy::Analysis::Analyzer isa
76       Clownfish::Obj.
77
78
79
80perl v5.36.0                      2023-01-20 Lucy::Analysis::RegexTokenizer(3)
Impressum