1Lingua::Stem::En(3) User Contributed Perl Documentation Lingua::Stem::En(3)
2
3
4
6 Lingua::Stem::En - Porter's stemming algorithm for 'generic' English
7
9 use Lingua::Stem::En;
10 my $stems = Lingua::Stem::En::stem({ -words => $word_list_reference,
11 -locale => 'en',
12 -exceptions => $exceptions_hash,
13 });
14
16 This routine applies the Porter Stemming Algorithm to its parameters,
17 returning the stemmed words.
18
19 It is derived from the C program "stemmer.c" as found in freewais and
20 elsewhere, which contains these notes:
21
22 Purpose: Implementation of the Porter stemming algorithm documented
23 in: Porter, M.F., "An Algorithm For Suffix Stripping,"
24 Program 14 (3), July 1980, pp. 130-137.
25 Provenance: Written by B. Frakes and C. Cox, 1986.
26
27 I have re-interpreted areas that use Frakes and Cox's "WordSize"
28 function. My version may misbehave on short words starting with "y",
29 but I can't think of any examples.
30
31 The step numbers correspond to Frakes and Cox, and are probably in
32 Porter's article (which I've not seen). Porter's algorithm still has
33 rough spots (e.g current/currency, -ings words), which I've not
34 attempted to cure, although I have added support for the British -ise
35 suffix.
36
38 1999.06.15 - Changed to '.pm' module, moved into Lingua::Stem namespace,
39 optionalized the export of the 'stem' routine
40 into the caller's namespace, added named parameters
41
42 1999.06.24 - Switch core implementation of the Porter stemmer to
43 the one written by Jim Richardson <jimr@maths.usyd.edu.au>
44
45 2000.08.25 - 2.11 Added stemming cache
46
47 2000.09.14 - 2.12 Fixed *major* :( implementation error of Porter's algorithm
48 Error was entirely my fault - I completely forgot to include
49 rule sets 2,3, and 4 starting with Lingua::Stem 0.30.
50 -- Jerilyn Franz
51
52 2003.09.28 - 2.13 Corrected documentation error pointed out by Simon Cozens.
53
54 2005.11.20 - 2.14 Changed rule declarations to conform to Perl style convention
55 for 'private' subroutines. Changed Exporter invokation to more
56 portable 'require' vice 'use'.
57
58 2006.02.14 - 2.15 Added ability to pass word list by 'handle' for in-place stemming.
59
60 2009.07.27 - 2.16 Documentation Fix
61
62 2020.06.20 - 2.30 Version renumber for module consistency.
63
64 2020.09.26 - 2.31 Fix for Latin1/UTF8 issue in documentation
65
67 stem({ -words => \@words, -locale => 'en', -exceptions => \%exceptions
68 });
69 Stems a list of passed words using the rules of US English. Returns
70 an anonymous array reference to the stemmed words.
71
72 Example:
73
74 my @words = ( 'wordy', 'another' );
75 my $stemmed_words = Lingua::Stem::En::stem({ -words => \@words,
76 -locale => 'en',
77 -exceptions => \%exceptions,
78 });
79
80 If the first element of @words is a list reference, then the
81 stemming is performed 'in place' on that list (modifying the passed
82 list directly instead of copying it to a new array).
83
84 This is only useful if you do not need to keep the original list.
85 If you do need to keep the original list, use the normal semantic
86 of having 'stem' return a new list instead - that is faster than
87 making your own copy and using the 'in place' semantics since the
88 primary difference between 'in place' and 'by value' stemming is
89 the creation of a copy of the original list. If you don't need the
90 original list, then the 'in place' stemming is about 60% faster.
91
92 Example of 'in place' stemming:
93
94 my $words = [ 'wordy', 'another' ];
95 my $stemmed_words = Lingua::Stem::En::stem({ -words => [$words],
96 -locale => 'en',
97 -exceptions => \%exceptions,
98 });
99
100 The 'in place' mode returns a reference to the original list with
101 the words stemmed.
102
103 stem_caching({ -level => 0|1|2 });
104 Sets the level of stem caching.
105
106 '0' means 'no caching'. This is the default level.
107
108 '1' means 'cache per run'. This caches stemming results during a
109 single
110 call to 'stem'.
111
112 '2' means 'cache indefinitely'. This caches stemming results until
113 either the process exits or the 'clear_stem_cache' method is
114 called.
115
116 clear_stem_cache;
117 Clears the cache of stemmed words
118
120 This code is almost entirely derived from the Porter 2.1 module written
121 by Jim Richardson.
122
124 Lingua::Stem
125
127 Jim Richardson, University of Sydney
128 jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html
129
130 Integration in Lingua::Stem by
131 Jerilyn Franz, FreeRun Technologies,
132 <cpan@jerilyn.info>
133
135 Jim Richardson, University of Sydney Jerilyn Franz, FreeRun
136 Technologies
137
138 This code is freely available under the same terms as Perl.
139
142perl v5.32.1 2021-01-27 Lingua::Stem::En(3)