1Lucy::Docs::Tutorial::AUnsaelrysCiosnTturtiobruitaeldL(u3Pc)eyr:l:DDooccsu:m:eTnuttaotriioanl::AnalysisTutorial(3)
2
3
4
6 Lucy::Docs::Tutorial::AnalysisTutorial - How to choose and use
7 Analyzers.
8
10 Try swapping out the EasyAnalyzer in our Schema for a
11 StandardTokenizer:
12
13 my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
14 my $type = Lucy::Plan::FullTextType->new(
15 analyzer => $tokenizer,
16 );
17
18 Search for "senate", "Senate", and "Senator" before and after making
19 the change and re-indexing.
20
21 Under EasyAnalyzer, the results are identical for all three searches,
22 but under StandardTokenizer, searches are case-sensitive, and the
23 result sets for "Senate" and "Senator" are distinct.
24
25 EasyAnalyzer
26 What’s happening is that EasyAnalyzer is performing more aggressive
27 processing than StandardTokenizer. In addition to tokenizing, it’s
28 also converting all text to lower case so that searches are case-
29 insensitive, and using a “stemming” algorithm to reduce related words
30 to a common stem ("senat", in this case).
31
32 EasyAnalyzer is actually multiple Analyzers wrapped up in a single
33 package. In this case, it’s three-in-one, since specifying a
34 EasyAnalyzer with "language => 'en'" is equivalent to this snippet
35 creating a PolyAnalyzer:
36
37 my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
38 my $normalizer = Lucy::Analysis::Normalizer->new;
39 my $stemmer = Lucy::Analysis::SnowballStemmer->new( language => 'en' );
40 my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
41 analyzers => [ $tokenizer, $normalizer, $stemmer ],
42 );
43
44 You can add or subtract Analyzers from there if you like. Try adding a
45 fourth Analyzer, a SnowballStopFilter for suppressing “stopwords” like
46 “the”, “if”, and “maybe”.
47
48 my $stopfilter = Lucy::Analysis::SnowballStopFilter->new(
49 language => 'en',
50 );
51 my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
52 analyzers => [ $tokenizer, $normalizer, $stopfilter, $stemmer ],
53 );
54
55 Also, try removing the SnowballStemmer.
56
57 my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
58 analyzers => [ $tokenizer, $normalizer ],
59 );
60
61 The original choice of a stock English EasyAnalyzer probably still
62 yields the best results for this document collection, but you get the
63 idea: sometimes you want a different Analyzer.
64
65 When the best Analyzer is no Analyzer
66 Sometimes you don’t want an Analyzer at all. That was true for our
67 “url” field because we didn’t need it to be searchable, but it’s also
68 true for certain types of searchable fields. For instance, “category”
69 fields are often set up to match exactly or not at all, as are fields
70 like “last_name” (because you may not want to conflate results for
71 “Humphrey” and “Humphries”).
72
73 To specify that there should be no analysis performed at all, use
74 StringType:
75
76 my $type = Lucy::Plan::StringType->new;
77 $schema->spec_field( name => 'category', type => $type );
78
79 Highlighting up next
80 In our next tutorial chapter, HighlighterTutorial, we’ll add
81 highlighted excerpts from the “content” field to our search results.
82
83
84
85perl v5.36.0 2023L-u0c1y-:2:0Docs::Tutorial::AnalysisTutorial(3)