1Lingua::Stem(3)       User Contributed Perl Documentation      Lingua::Stem(3)
2
3
4

NAME

6       Lingua::Stem - Stemming of words
7

SYNOPSIS

9           use Lingua::Stem qw(stem);
10           my $stemmmed_words_anon_array   = stem(@words);
11
12           or for the OO inclined,
13
14           use Lingua::Stem;
15           my $stemmer = Lingua::Stem->new(-locale => 'EN-UK');
16           $stemmer->stem_caching({ -level => 2 });
17           my $stemmmed_words_anon_array   = $stemmer->stem(@words);
18

DESCRIPTION

20       This routine applies stemming algorithms to its parameters, returning
21       the stemmed words as appropriate to the selected locale.
22
23       You can import some or all of the class methods.
24
25       use Lingua::Stem qw (stem clear_stem_cache stem_caching
26                            add_exceptions delete_exceptions
27                            get_exceptions set_locale get_locale
28                            :all :locale :exceptions :stem :caching);
29
30        :all        - imports  stem add_exceptions delete_exceptions get_exceptions
31                      set_locale get_locale
32        :stem       - imports  stem
33        :caching    - imports  stem_caching clear_stem_cache
34        :locale     - imports  set_locale get_locale
35        :exceptions - imports  add_exceptions delete_exceptions get_exceptions
36
37       Currently supported locales are:
38
39             DA          - Danish
40             DE          - German
41             EN          - English (also EN-US and EN-UK)
42             FR          - French
43             GL          - Galician
44             IT          - Italian
45             NO          - Norwegian
46             PT          - Portuguese
47             RU          - Russian (also RU-RU and RU-RU.KOI8-R)
48             SV          - Swedish
49
50       If you have the memory and lots of stemming to do, I strongly suggest
51       using cache level 2 and processing lists in 'big chunks' (long lists)
52       for best performance.
53
54   Comparision with Lingua::Stem::Snowball
55       It functions fairly similarly to the Lingua::Stem::Snowball suite of
56       stemmers, with the most significant differences being
57
58       1) Lingua::Stem is a 'pure perl' (no compiled XS code is needed) suite.
59          Lingua::Stem::Snowball is XS based (must be compiled).
60
61       2) Lingua::Stem works with Perl 5.6 or later
62          Lingua::Stem::Snowball works with Perl 5.8 or later
63
64       3) Lingua::Stem has an 'exceptions' system allowing you to override
65       stemming on a 'case by case' basis.
66          Lingua::Stem::Snowball does not have an 'exceptions' system.
67
68       4) A somewhat different set of supported languages:
69
70        +---------------------------------------------------------------+
71        | Language   | ISO code | Lingua::Stem | Lingua::Stem::Snowball |
72        |---------------------------------------------------------------|
73        | Danish     | da       |      yes     |          yes           |
74        | Dutch      | nl       |       no     |          yes           |
75        | English    | en       |      yes     |          yes           |
76        | Finnish    | fi       |       no     |          yes           |
77        | French     | fr       |      yes     |          yes           |
78        | Galacian   | gl       |      yes     |           no           |
79        | German     | de       |      yes     |          yes           |
80        | Italian    | it       |      yes     |          yes           |
81        | Norwegian  | no       |      yes     |          yes           |
82        | Portuguese | pt       |      yes     |          yes           |
83        | Russian    | ru       |      yes     |          yes           |
84        | Spanish    | es       |       no     |          yes           |
85        | Swedish    | sv       |      yes     |          yes           |
86        +---------------------------------------------------------------+
87
88       5) Lingua::Stem is faster for 'stem' (circa 30% faster than
89       Lingua::Stem::Snowball)
90
91       6) Lingua::Stem::Snowball is faster for 'stem_in_place' (circa 30%
92       faster than Lingua::Stem)
93
94       7) Lingua::Stem::Snowball is more consistent with regard to character
95       set issues.
96
97       8) Lingua::Stem::Snowball is under active development. Lingua::Stem is
98       currently fairly static.
99
100       Some benchmarks using Lingua::Stem 0.82 and Lingua::Stem::Snowball 0.94
101       gives an idea of how various options impact performance. The dataset
102       was The Works of Edgar Allen Poe, volumes 1-5 from the Gutenberg
103       Project processed 10 times in a row as single batch of words
104       (processing a long text one word at a time is very inefficient and
105       drops the performance of Lingua::Stem by about 90%: So "Don't Do That"
106       ;) )
107
108       The benchmarks were run on a 3.06 Ghz P4 with HT on Fedora Core 5 Linux
109       using Perl 5.8.8.
110
111        +------------------------------------------------------------------------+
112        | source: collected_works_poe.txt | words: 454691 | unique words: 22802  |
113        |------------------------------------------------------------------------|
114        | module                          | config        | avg secs | words/sec |
115        |------------------------------------------------------------------------|
116        | Lingua::Stem 0.82               | no cache      | 1.922    |  236560   |
117        | Lingua::Stem 0.82               | cache level 2 | 1.235    |  368292   |
118        | Lingua::Stem 0.82               | cachelv2, sip | 0.798    |  569494   |
119        | Lingua::Stem::Snowball 0.94     | stem          | 1.622    |  280276   |
120        | Lingua::Stem::Snowball 0.94     | stem_in_place | 0.627    |  725129   |
121        +------------------------------------------------------------------------+
122
123       The script for the benchmark is included in the examples/ directory of
124       this distribution as benchmark_stemmers.plx.
125

CHANGES

127        2.31 2020.09.26 - Fix for Latin1/UTF8 issue in documentation
128
129        2.30 2020.06.20 - Version renumber for module consistency
130
131        0.84 2010.04.29 - Documentation fixes to the En stemmer and removal
132                          of the accidentally included lib/Lingua/test.pl file
133                          Thanks goes to Aaron Naiman for bringing the
134                          documentation error to my attention and to
135                          Alexandr Ciornii and 'kmx' for the pointing out
136                          the problem with the test.pl file.
137
138        0.83 2007.06.23 - Disabled Italian locale build tests due to
139                          changes in Lingua::Stem::It breaking the tests.
140
141        0.82 2006.07.23 - Added 'stem_in_place' to base package.
142                          Tweaks to documentation and build tests.
143
144        0.81 2004.07.26 - Minor documentation tweak. No functional change.
145
146        0.80 2004.07.25 - Added 'RU', 'RU_RU', 'RU_RU.KOI-8' locale.
147                          Added support for Lingua::Stem::Ru to
148                          Makefile.PL and autoloader.
149
150                          Added documentation stressing use of caching
151                          and batches for performance. Added support
152                          for '_' as a seperator in the locale strings.
153                          Added example benchmark script. Expanded copyright
154                          credits.
155
156        0.70 2004.04.26 - Added FR locale and documentation fixes
157                          to Lingua::Stem::Gl
158
159        0.61 2003.09.28 - Documentation fixes. No functional changes.
160
161        0.60 2003.04.05 - Added more locales by wrappering various stemming
162                          implementations. Documented currently supported
163                          list of locales.
164
165        0.50 2000.09.14 - Fixed major implementation error. Starting with
166                          version 0.30 I forgot to include rulesets 2,3 and 4
167                          for Porter's algorithm. The resulting stemming results
168                          were very poor. Thanks go to <csyap@netfision.com>
169                          for bringing the problem to my attention.
170
171                          Unfortunately, the fix inherently generates *different*
172                          stemming results than 0.30 and 0.40 did. If you
173                          need identically broken output - use locale 'en-broken'.
174
175        0.40 2000.08.25 - Added stem caching support as an option. This
176                          can provide a large speedup to the operation
177                          of the stemmer. Caching is default turned off
178                          to maximize compatibility with previous versions.
179
180        0.30 1999.06.24 - Replaced core of 'En' stemmers with code from
181                          Jim Richardson <jimr@maths.usyd.edu.au>
182                          Aliased 'en-us' and 'en-uk' to 'en'
183                          Fixed 'SYNOPSIS' to correct return value
184                          type for stemmed words (SYNOPIS error spotted
185                          by <Arved_37@chebucto.ns.ca>)
186
187        0.20 1999.06.15 - Changed to '.pm' module, moved into Lingua:: namespace,
188                          added OO interface, optionalized the export of routines
189                          into the caller's namespace, added named parameter
190                          initialization, stemming exceptions, autoloaded
191                          locale support and isolated case flattening to
192                          localized stemmers prevent i18n problems later.
193
194                          Input and output text are assumed to be in UTF8
195                          encoding (no operational impact right now, but
196                          will be important when extending the module to
197                          non-English).
198

METHODS

200       new(...);
201           Returns a new instance of a Lingua::Stem object and, optionally,
202           selection of the locale to be used for stemming.
203
204           Examples:
205
206             # By default the locale is en
207             $us_stemmer = Lingua::Stem->new;
208
209             # Turn on the cache
210             $us_stemmer->stem_caching({ -level => 2 });
211
212             # Overriding the default for a specific instance
213             $uk_stemmer = Lingua::Stem->new({ -locale => 'en-uk' });
214
215             # Overriding the default for a specific instance and changing the default
216             $uk_stemmer = Lingua::Stem->new({ -default_locale => 'en-uk' });
217
218       set_locale($locale);
219           Sets the locale to one of the recognized locales.  locale
220           identifiers are converted to lowercase.
221
222           Called as a class method, it changes the default locale for all
223           subseqently generated object instances.
224
225           Called as an instance method, it only changes the locale for that
226           particular instance.
227
228           'croaks' if passed an unknown locale.
229
230           Examples:
231
232            # Change default locale
233            Lingua::Stem::set_locale('en-uk'); # UK's spellings
234
235            # Change instance locale
236            $self->set_locale('en-us');  # US's spellings
237
238       get_locale;
239           Called as a class method, returns the current default locale.
240
241           Example:
242
243            $default_locale = Lingua::Stem::get_locale;
244
245           Called as an instance method, returns the locale for the instance
246
247            $instance_locale = $stemmer->get_locale;
248
249       add_exceptions($exceptions_hash_ref);
250           Exceptions allow overriding the stemming algorithm on a case by
251           case basis. It is done on an exact match and substitution basis: If
252           a passed word is identical to the exception it will be replaced by
253           the specified value. No case adjustments are performed.
254
255           Called as a class method, adds exceptions to the default exceptions
256           list used for subsequently instantations of Lingua::Stem objects.
257
258           Example:
259
260            # adding default exceptions
261            Lingua::Stem::add_exceptions({ 'emily' => 'emily',
262                                           'driven' => 'driven',
263                                       });
264
265           Called as an instance method, adds exceptions only to the specific
266           instance.
267
268            # adding instance exceptions
269            $stemmer->add_exceptions({ 'steely' => 'steely' });
270
271           The exceptions shortcut the normal stemming - if an exception
272           matches no further stemming is performed after the substitution.
273
274           Adding an exception with the same key value as an already defined
275           exception replaces the pre-existing exception with the new value.
276
277       delete_exceptions(@exceptions_list);
278           The mirror of add_exceptions, this allows the _removal_ of
279           exceptions from either the defaults for the class or from the
280           instance.
281
282            # Deletion of exceptions from class default exceptions
283            Lingua::Stem::delete_exceptions('aragorn','frodo','samwise');
284
285            # Deletion of exceptions from instance
286            $stemmer->delete_exceptions('smaug','sauron','gollum');
287
288            # Deletion of all class default exceptions
289            delete_exceptions;
290
291            # Deletion of all exceptions from instance
292            $stemmer->delete_exceptions;
293
294       get_exceptions;
295           As a class method with no parameters it returns all the default
296           exceptions as an anonymous hash of 'exception' => 'replace with'
297           pairs.
298
299           Example:
300
301            # Returns all class default exceptions
302            $exceptions = Lingua::Stem::get_exceptions;
303
304           As a class method with parameters, it returns the default
305           exceptions listed in the parameters as an anonymous hash of
306           'exception' => 'replace with' pairs.  If a parameter specifies an
307           undefined 'exception', the value is set to undef.
308
309            # Returns class default exceptions for 'emily' and 'george'
310            $exceptions = Lingua::Stem::get_exceptions('emily','george');
311
312           As an instance method, with no parameters it returns the currently
313           active exceptions for the instance.
314
315            # Returns all instance exceptions
316            $exceptions = $stemmer->get_exceptions;
317
318           As an instance method with parameters, it returns the instance
319           exceptions listed in the parameters as an anonymous hash of
320           'exception' => 'replace with' pairs.  If a parameter specifies an
321           undefined 'exception', the value is set to undef.
322
323            # Returns instance exceptions for 'lisa' and 'bart'
324            $exceptions = $stemmer->get_exceptions('lisa','bart');
325
326       stem(@list);
327           Called as a class method, it applies the default settings and stems
328           the list of passed words, returning an anonymous array with the
329           stemmed words in the same order as the passed list of words.
330
331           Example:
332
333               # Default settings applied
334               my $anon_array_of_stemmed_words = Lingua::Stem::stem(@words);
335
336           Called as an instance method, it applies the instance's settings
337           and stems the list of passed words, returning an anonymous array
338           with the stemmed words in the same order as the passed list of
339           words.
340
341              # Instance's settings applied
342              my $stemmed_words = $stemmer->stem(@words);
343
344           The stemmer performs best when handed long lists of words rather
345           than one word at a time. The cache also provides a huge speed up if
346           you are processing lots of text.
347
348       stem_in_place(@list);
349           Stems the passed list of words 'in place'. It returns a reference
350           to the modified list.  This is about 60% faster than the 'stem'
351           method but modifies the original list. This currently only works
352           for the English locales.
353
354            Example:
355
356             my @words = ( 'a', 'list', 'of', 'words' );
357             my $stemmed_list_of_words = stem_in_place(@words);
358
359             # '$stemmed_list_of_words' refers to the @words list
360             # after 'stem_in_place' has executed
361
362           DO NOT use this method of stemming if you need to keep the original
363           list of words. Its performance gain derives entirely from the fact
364           it does not make a copy the original list but instead overwrites
365           the original list.
366
367           If you try something like
368
369             my @words_for_stemming = @words;
370             my $stemmed_list_of_words = stem_in_place(@words_for_stemming);
371
372           thinking you will get a speed boost while keeping the original
373           list, you won't: You wipe out the speed gain completely with your
374           copying of the original list. You should just use the 'stem' method
375           instead on the original list of words if you need to keep the
376           original list.
377
378       clear_stem_cache;
379           Clears the stemming cache for the current locale. Can be called as
380           either a class method or an instance method.
381
382               $stemmer->clear_stem_cache;
383
384               clear_stem_cache;
385
386       stem_caching ({ -level => 0|1|2 });
387           Sets stemming cache level for the current locale. Can be called as
388           either a class method or an instance method.
389
390               $stemmer->stem_caching({ -level => 1 });
391
392               stem_caching({ -level => 1 });
393
394           For the sake of maximum compatibility with previous versions, stem
395           caching is set to '-level => 0' by default.
396
397           '-level' definitions
398
399            '0' means 'no caching'. This is the default level.
400
401            '1' means 'cache per run'. This caches stemming results during each
402               call to 'stem'.
403
404            '2' means 'cache indefinitely'. This caches stemming results until
405               either the process exits or the 'clear_stem_cache' method is called.
406
407           stem caching is global to the locale. If you turn on stem caching
408           for one instance of a locale stemmer, all instances using the same
409           locale will have it turned on as well.
410
411           I STRONGLY suggest turning caching on if you have enough memory and
412           are processing a lot of data.
413

VERSION

415        2.31 2020.09.26
416

NOTES

418       It started with the 'Text::Stem' module which has been adapted into a
419       more general framework and moved into the more language oriented
420       'Lingua' namespace and re-organized to support a OOP interface as well
421       as switch core 'En' locale stemmers.
422
423       Version 0.40 added a cache for stemmed words. This can provide up to a
424       several fold performance improvement.
425
426       Organization is such that extending this module to any number of
427       languages should be direct and simple.
428
429       Case flattening is a function of the language, so the 'exceptions'
430       methods have to be used appropriately to the language. For 'En' family
431       stemming, use lower case words, only, for exceptions.
432

AUTHORS

434        Jerilyn Franz <cpan@jerilyn.info>
435        Jim Richardson  <imr@maths.usyd.edu.au>
436

CREDITS

438        Jim Richardson             <imr@maths.usyd.edu.au>
439        Ulrich Pfeifer             <pfeifer@ls6.informatik.uni-dortmund.de>
440        Aldo Calpini               <dada@perl.it>
441        xern                       <xern@cpan.org>
442        Ask Solem Hoel             <ask@unixmonks.net>
443        Dennis Haney               <davh@davh.dk>
444        Sébastien Darribere-Pleyt  <sebastien.darribere@lefute.com>
445        Aleksandr Guidrevitch      <pillgrim@mail.ru>
446

SEE ALSO

448        Lingua::Stem::En            Lingua::Stem::En            Lingua::Stem::Da
449        Lingua::Stem::De            Lingua::Stem::Gl            Lingua::Stem::No
450        Lingua::Stem::Pt            Lingua::Stem::Sv            Lingua::Stem::It
451        Lingua::Stem::Fr            Lingua::Stem::Ru            Text::German
452        Lingua::PT::Stemmer         Lingua::GL::Stemmer         Lingua::Stem::Snowball::No
453        Lingua::Stem::Snowball::Se  Lingua::Stem::Snowball::Da  Lingua::Stem::Snowball::Sv
454        Lingua::Stemmer::GL         Lingua::Stem::Snowball
455
456        http://snowball.tartarus.org
457
459       Copyright 1999-2004
460
461       Freerun Technologies, Inc (Freerun), Jim Richardson, University of
462       Sydney <imr@maths.usyd.edu.au> and Jerilyn Franz <cpan@jerilyn.info>.
463       All rights reserved.
464
465       Text::German was written and is copyrighted by Ulrich Pfeifer.
466
467       Lingua::Stem::Snowball::Da was written and is copyrighted by Dennis
468       Haney and Ask Solem Hoel.
469
470       Lingua::Stem::It was written and is copyrighted by Aldo Calpini.
471
472       Lingua::Stem::Snowball::No, Lingua::Stem::Snowball::Se,
473       Lingua::Stem::Snowball::Sv were written and are copyrighted by Ask
474       Solem Hoel.
475
476       Lingua::Stemmer::GL and Lingua::PT::Stemmer were written and are
477       copyrighted by Xern.
478
479       Lingua::Stem::Fr was written and is copyrighted by Aldo Calpini and
480       Sébastien Darribere-Pley.
481
482       Lingua::Stem::Ru was written and is copyrighted by Aleksandr
483       Guidrevitch.
484
485       This software may be freely copied and distributed under the same terms
486       and conditions as Perl.
487

BUGS

489       None known.
490

TODO

492       Add more languages. Extend regression tests. Add support for the
493       Lingua::Stem::Snowball family of stemmers as an alternative core
494       stemming engine. Extend 'stem_in_place' functionality to non-English
495       stemmers.
496
497
498
499perl v5.36.0                      2023-01-20                   Lingua::Stem(3)
Impressum