1Lingua::Stem(3)       User Contributed Perl Documentation      Lingua::Stem(3)
2
3
4

NAME

6       Lingua::Stem - Stemming of words
7

SYNOPSIS

9           use Lingua::Stem qw(stem);
10           my $stemmmed_words_anon_array   = stem(@words);
11
12           or for the OO inclined,
13
14           use Lingua::Stem;
15           my $stemmer = Lingua::Stem->new(-locale => 'EN-UK');
16           $stemmer->stem_caching({ -level => 2 });
17           my $stemmmed_words_anon_array   = $stemmer->stem(@words);
18

DESCRIPTION

20       This routine applies stemming algorithms to its parameters, returning
21       the stemmed words as appropriate to the selected locale.
22
23       You can import some or all of the class methods.
24
25       use Lingua::Stem qw (stem clear_stem_cache stem_caching
26                            add_exceptions delete_exceptions
27                            get_exceptions set_locale get_locale
28                            :all :locale :exceptions :stem :caching);
29
30        :all        - imports  stem add_exceptions delete_exceptions get_exceptions
31                      set_locale get_locale
32        :stem       - imports  stem
33        :caching    - imports  stem_caching clear_stem_cache
34        :locale     - imports  set_locale get_locale
35        :exceptions - imports  add_exceptions delete_exceptions get_exceptions
36
37       Currently supported locales are:
38
39             DA          - Danish
40             DE          - German
41             EN          - English (also EN-US and EN-UK)
42             FR          - French
43             GL          - Galician
44             IT          - Italian
45             NO          - Norwegian
46             PT          - Portuguese
47             RU          - Russian (also RU-RU and RU-RU.KOI8-R)
48             SV          - Swedish
49
50       If you have the memory and lots of stemming to do, I strongly suggest
51       using cache level 2 and processing lists in 'big chunks' (long lists)
52       for best performance.
53
54   Comparision with Lingua::Stem::Snowball
55       It functions fairly similarly to the Lingua::Stem::Snowball suite of
56       stemmers, with the most significant differences being
57
58       1) Lingua::Stem is a 'pure perl' (no compiled XS code is needed) suite.
59          Lingua::Stem::Snowball is XS based (must be compiled).
60
61       2) Lingua::Stem works with Perl 5.6 or later
62          Lingua::Stem::Snowball works with Perl 5.8 or later
63
64       3) Lingua::Stem has an 'exceptions' system allowing you to override
65       stemming on a 'case by case' basis.
66          Lingua::Stem::Snowball does not have an 'exceptions' system.
67
68       4) A somewhat different set of supported languages:
69
70        +---------------------------------------------------------------+
71        | Language   | ISO code | Lingua::Stem | Lingua::Stem::Snowball |
72        |---------------------------------------------------------------|
73        | Danish     | da       |      yes     |          yes           |
74        | Dutch      | nl       |       no     |          yes           |
75        | English    | en       |      yes     |          yes           |
76        | Finnish    | fi       |       no     |          yes           |
77        | French     | fr       |      yes     |          yes           |
78        | Galacian   | gl       |      yes     |           no           |
79        | German     | de       |      yes     |          yes           |
80        | Italian    | it       |      yes     |          yes           |
81        | Norwegian  | no       |      yes     |          yes           |
82        | Portuguese | pt       |      yes     |          yes           |
83        | Russian    | ru       |      yes     |          yes           |
84        | Spanish    | es       |       no     |          yes           |
85        | Swedish    | sv       |      yes     |          yes           |
86        +---------------------------------------------------------------+
87
88       5) Lingua::Stem is faster for 'stem' (circa 30% faster than
89       Lingua::Stem::Snowball)
90
91       6) Lingua::Stem::Snowball is faster for 'stem_in_place' (circa 30%
92       faster than Lingua::Stem)
93
94       7) Lingua::Stem::Snowball is more consistent with regard to character
95       set issues.
96
97       8) Lingua::Stem::Snowball is under active development. Lingua::Stem is
98       currently fairly static.
99
100       Some benchmarks using Lingua::Stem 0.82 and Lingua::Stem::Snowball 0.94
101       gives an idea of how various options impact performance. The dataset
102       was The Works of Edgar Allen Poe, volumes 1-5 from the Gutenberg
103       Project processed 10 times in a row as single batch of words
104       (processing a long text one word at a time is very inefficient and
105       drops the performance of Lingua::Stem by about 90%: So "Don't Do That"
106       ;) )
107
108       The benchmarks were run on a 3.06 Ghz P4 with HT on Fedora Core 5 Linux
109       using Perl 5.8.8.
110
111        +------------------------------------------------------------------------+
112        | source: collected_works_poe.txt | words: 454691 | unique words: 22802  |
113        |------------------------------------------------------------------------|
114        | module                          | config        | avg secs | words/sec |
115        |------------------------------------------------------------------------|
116        | Lingua::Stem 0.82               | no cache      | 1.922    |  236560   |
117        | Lingua::Stem 0.82               | cache level 2 | 1.235    |  368292   |
118        | Lingua::Stem 0.82               | cachelv2, sip | 0.798    |  569494   |
119        | Lingua::Stem::Snowball 0.94     | stem          | 1.622    |  280276   |
120        | Lingua::Stem::Snowball 0.94     | stem_in_place | 0.627    |  725129   |
121        +------------------------------------------------------------------------+
122
123       The script for the benchmark is included in the examples/ directory of
124       this distribution as benchmark_stemmers.plx.
125

CHANGES

127        0.84 2010.04.29 - Documentation fixes to the En stemmer and removal
128                          of the accidentally included lib/Lingua/test.pl file
129                          Thanks goes to Aaron Naiman for bringing the
130                          documentation error to my attention and to
131                          Alexandr Ciornii and 'kmx' for the pointing out
132                          the problem with the test.pl file.
133
134        0.83 2007.06.23 - Disabled Italian locale build tests due to
135                          changes in Lingua::Stem::It breaking the tests.
136
137        0.82 2006.07.23 - Added 'stem_in_place' to base package.
138                          Tweaks to documentation and build tests.
139
140        0.81 2004.07.26 - Minor documentation tweak. No functional change.
141
142        0.80 2004.07.25 - Added 'RU', 'RU_RU', 'RU_RU.KOI-8' locale.
143                          Added support for Lingua::Stem::Ru to
144                          Makefile.PL and autoloader.
145
146                          Added documentation stressing use of caching
147                          and batches for performance. Added support
148                          for '_' as a seperator in the locale strings.
149                          Added example benchmark script. Expanded copyright
150                          credits.
151
152        0.70 2004.04.26 - Added FR locale and documentation fixes
153                          to Lingua::Stem::Gl
154
155        0.61 2003.09.28 - Documentation fixes. No functional changes.
156
157        0.60 2003.04.05 - Added more locales by wrappering various stemming
158                          implementations. Documented currently supported
159                          list of locales.
160
161        0.50 2000.09.14 - Fixed major implementation error. Starting with
162                          version 0.30 I forgot to include rulesets 2,3 and 4
163                          for Porter's algorithm. The resulting stemming results
164                          were very poor. Thanks go to <csyap@netfision.com>
165                          for bringing the problem to my attention.
166
167                          Unfortunately, the fix inherently generates *different*
168                          stemming results than 0.30 and 0.40 did. If you
169                          need identically broken output - use locale 'en-broken'.
170
171        0.40 2000.08.25 - Added stem caching support as an option. This
172                          can provide a large speedup to the operation
173                          of the stemmer. Caching is default turned off
174                          to maximize compatibility with previous versions.
175
176        0.30 1999.06.24 - Replaced core of 'En' stemmers with code from
177                          Jim Richardson <jimr@maths.usyd.edu.au>
178                          Aliased 'en-us' and 'en-uk' to 'en'
179                          Fixed 'SYNOPSIS' to correct return value
180                          type for stemmed words (SYNOPIS error spotted
181                          by <Arved_37@chebucto.ns.ca>)
182
183        0.20 1999.06.15 - Changed to '.pm' module, moved into Lingua:: namespace,
184                          added OO interface, optionalized the export of routines
185                          into the caller's namespace, added named parameter
186                          initialization, stemming exceptions, autoloaded
187                          locale support and isolated case flattening to
188                          localized stemmers prevent i18n problems later.
189
190                          Input and output text are assumed to be in UTF8
191                          encoding (no operational impact right now, but
192                          will be important when extending the module to
193                          non-English).
194

METHODS

196       new(...);
197           Returns a new instance of a Lingua::Stem object and, optionally,
198           selection of the locale to be used for stemming.
199
200           Examples:
201
202             # By default the locale is en
203             $us_stemmer = Lingua::Stem->new;
204
205             # Turn on the cache
206             $us_stemmer->stem_caching({ -level => 2 });
207
208             # Overriding the default for a specific instance
209             $uk_stemmer = Lingua::Stem->new({ -locale => 'en-uk' });
210
211             # Overriding the default for a specific instance and changing the default
212             $uk_stemmer = Lingua::Stem->new({ -default_locale => 'en-uk' });
213
214       set_locale($locale);
215           Sets the locale to one of the recognized locales.  locale
216           identifiers are converted to lowercase.
217
218           Called as a class method, it changes the default locale for all
219           subseqently generated object instances.
220
221           Called as an instance method, it only changes the locale for that
222           particular instance.
223
224           'croaks' if passed an unknown locale.
225
226           Examples:
227
228            # Change default locale
229            Lingua::Stem::set_locale('en-uk'); # UK's spellings
230
231            # Change instance locale
232            $self->set_locale('en-us');  # US's spellings
233
234       get_locale;
235           Called as a class method, returns the current default locale.
236
237           Example:
238
239            $default_locale = Lingua::Stem::get_locale;
240
241           Called as an instance method, returns the locale for the instance
242
243            $instance_locale = $stemmer->get_locale;
244
245       add_exceptions($exceptions_hash_ref);
246           Exceptions allow overriding the stemming algorithm on a case by
247           case basis. It is done on an exact match and substitution basis: If
248           a passed word is identical to the exception it will be replaced by
249           the specified value. No case adjustments are performed.
250
251           Called as a class method, adds exceptions to the default exceptions
252           list used for subsequently instantations of Lingua::Stem objects.
253
254           Example:
255
256            # adding default exceptions
257            Lingua::Stem::add_exceptions({ 'emily' => 'emily',
258                                           'driven' => 'driven',
259                                       });
260
261           Called as an instance method, adds exceptions only to the specific
262           instance.
263
264            # adding instance exceptions
265            $stemmer->add_exceptions({ 'steely' => 'steely' });
266
267           The exceptions shortcut the normal stemming - if an exception
268           matches no further stemming is performed after the substitution.
269
270           Adding an exception with the same key value as an already defined
271           exception replaces the pre-existing exception with the new value.
272
273       delete_exceptions(@exceptions_list);
274           The mirror of add_exceptions, this allows the _removal_ of
275           exceptions from either the defaults for the class or from the
276           instance.
277
278            # Deletion of exceptions from class default exceptions
279            Lingua::Stem::delete_exceptions('aragorn','frodo','samwise');
280
281            # Deletion of exceptions from instance
282            $stemmer->delete_exceptions('smaug','sauron','gollum');
283
284            # Deletion of all class default exceptions
285            delete_exceptions;
286
287            # Deletion of all exceptions from instance
288            $stemmer->delete_exceptions;
289
290       get_exceptions;
291           As a class method with no parameters it returns all the default
292           exceptions as an anonymous hash of 'exception' => 'replace with'
293           pairs.
294
295           Example:
296
297            # Returns all class default exceptions
298            $exceptions = Lingua::Stem::get_exceptions;
299
300           As a class method with parameters, it returns the default
301           exceptions listed in the parameters as an anonymous hash of
302           'exception' => 'replace with' pairs.  If a parameter specifies an
303           undefined 'exception', the value is set to undef.
304
305            # Returns class default exceptions for 'emily' and 'george'
306            $exceptions = Lingua::Stem::get_exceptions('emily','george');
307
308           As an instance method, with no parameters it returns the currently
309           active exceptions for the instance.
310
311            # Returns all instance exceptions
312            $exceptions = $stemmer->get_exceptions;
313
314           As an instance method with parameters, it returns the instance
315           exceptions listed in the parameters as an anonymous hash of
316           'exception' => 'replace with' pairs.  If a parameter specifies an
317           undefined 'exception', the value is set to undef.
318
319            # Returns instance exceptions for 'lisa' and 'bart'
320            $exceptions = $stemmer->get_exceptions('lisa','bart');
321
322       stem(@list);
323           Called as a class method, it applies the default settings and stems
324           the list of passed words, returning an anonymous array with the
325           stemmed words in the same order as the passed list of words.
326
327           Example:
328
329               # Default settings applied
330               my $anon_array_of_stemmed_words = Lingua::Stem::stem(@words);
331
332           Called as an instance method, it applies the instance's settings
333           and stems the list of passed words, returning an anonymous array
334           with the stemmed words in the same order as the passed list of
335           words.
336
337              # Instance's settings applied
338              my $stemmed_words = $stemmer->stem(@words);
339
340           The stemmer performs best when handed long lists of words rather
341           than one word at a time. The cache also provides a huge speed up if
342           you are processing lots of text.
343
344       stem_in_place(@list);
345           Stems the passed list of words 'in place'. It returns a reference
346           to the modified list.  This is about 60% faster than the 'stem'
347           method but modifies the original list. This currently only works
348           for the English locales.
349
350            Example:
351
352             my @words = ( 'a', 'list', 'of', 'words' );
353             my $stemmed_list_of_words = stem_in_place(@words);
354
355             # '$stemmed_list_of_words' refers to the @words list
356             # after 'stem_in_place' has executed
357
358           DO NOT use this method of stemming if you need to keep the original
359           list of words. Its performance gain derives entirely from the fact
360           it does not make a copy the original list but instead overwrites
361           the original list.
362
363           If you try something like
364
365             my @words_for_stemming = @words;
366             my $stemmed_list_of_words = stem_in_place(@words_for_stemming);
367
368           thinking you will get a speed boost while keeping the original
369           list, you won't: You wipe out the speed gain completely with your
370           copying of the original list. You should just use the 'stem' method
371           instead on the original list of words if you need to keep the
372           original list.
373
374       clear_stem_cache;
375           Clears the stemming cache for the current locale. Can be called as
376           either a class method or an instance method.
377
378               $stemmer->clear_stem_cache;
379
380               clear_stem_cache;
381
382       stem_caching ({ -level => 0|1|2 });
383           Sets stemming cache level for the current locale. Can be called as
384           either a class method or an instance method.
385
386               $stemmer->stem_caching({ -level => 1 });
387
388               stem_caching({ -level => 1 });
389
390           For the sake of maximum compatibility with previous versions, stem
391           caching is set to '-level => 0' by default.
392
393           '-level' definitions
394
395            '0' means 'no caching'. This is the default level.
396
397            '1' means 'cache per run'. This caches stemming results during each
398               call to 'stem'.
399
400            '2' means 'cache indefinitely'. This caches stemming results until
401               either the process exits or the 'clear_stem_cache' method is called.
402
403           stem caching is global to the locale. If you turn on stem caching
404           for one instance of a locale stemmer, all instances using the same
405           locale will have it turned on as well.
406
407           I STRONGLY suggest turning caching on if you have enough memory and
408           are processing a lot of data.
409

VERSION

411        0.84 2008.07.27
412

NOTES

414       It started with the 'Text::Stem' module which has been adapted into a
415       more general framework and moved into the more language oriented
416       'Lingua' namespace and re-organized to support a OOP interface as well
417       as switch core 'En' locale stemmers.
418
419       Version 0.40 added a cache for stemmed words. This can provide up to a
420       several fold performance improvement.
421
422       Organization is such that extending this module to any number of
423       languages should be direct and simple.
424
425       Case flattening is a function of the language, so the 'exceptions'
426       methods have to be used appropriately to the language. For 'En' family
427       stemming, use lower case words, only, for exceptions.
428

AUTHORS

430        Benjamin Franz <snowhare@nihongo.org>
431        Jim Richardson  <imr@maths.usyd.edu.au>
432

CREDITS

434        Jim Richardson             <imr@maths.usyd.edu.au>
435        Ulrich Pfeifer             <pfeifer@ls6.informatik.uni-dortmund.de>
436        Aldo Calpini               <dada@perl.it>
437        xern                       <xern@cpan.org>
438        Ask Solem Hoel             <ask@unixmonks.net>
439        Dennis Haney               <davh@davh.dk>
440        Sebastien Darribere-Pleyt  <sebastien.darribere@lefute.com>
441        Aleksandr Guidrevitch      <pillgrim@mail.ru>
442

SEE ALSO

444        Lingua::Stem::En            Lingua::Stem::En            Lingua::Stem::Da
445        Lingua::Stem::De            Lingua::Stem::Gl            Lingua::Stem::No
446        Lingua::Stem::Pt            Lingua::Stem::Sv            Lingua::Stem::It
447        Lingua::Stem::Fr            Lingua::Stem::Ru            Text::German
448        Lingua::PT::Stemmer         Lingua::GL::Stemmer         Lingua::Stem::Snowball::No
449        Lingua::Stem::Snowball::Se  Lingua::Stem::Snowball::Da  Lingua::Stem::Snowball::Sv
450        Lingua::Stemmer::GL         Lingua::Stem::Snowball
451
452        http://snowball.tartarus.org
453
455       Copyright 1999-2004
456
457       Freerun Technologies, Inc (Freerun), Jim Richardson, University of
458       Sydney <imr@maths.usyd.edu.au> and Benjamin Franz
459       <snowhare@nihongo.org>. All rights reserved.
460
461       Text::German was written and is copyrighted by Ulrich Pfeifer.
462
463       Lingua::Stem::Snowball::Da was written and is copyrighted by Dennis
464       Haney and Ask Solem Hoel.
465
466       Lingua::Stem::It was written and is copyrighted by Aldo Calpini.
467
468       Lingua::Stem::Snowball::No, Lingua::Stem::Snowball::Se,
469       Lingua::Stem::Snowball::Sv were written and are copyrighted by Ask
470       Solem Hoel.
471
472       Lingua::Stemmer::GL and Lingua::PT::Stemmer were written and are
473       copyrighted by Xern.
474
475       Lingua::Stem::Fr was written and is copyrighted by  Aldo Calpini and
476       SA~Xbastien Darribere-Pley.
477
478       Lingua::Stem::Ru was written and is copyrighted by Aleksandr
479       Guidrevitch.
480
481       This software may be freely copied and distributed under the same terms
482       and conditions as Perl.
483

BUGS

485       None known.
486

TODO

488       Add more languages. Extend regression tests. Add support for the
489       Lingua::Stem::Snowball family of stemmers as an alternative core
490       stemming engine. Extend 'stem_in_place' functionality to non-English
491       stemmers.
492
493
494
495perl v5.12.1                      2010-09-14                   Lingua::Stem(3)
Impressum