1Lingua::Stem(3) User Contributed Perl Documentation Lingua::Stem(3)
2
3
4
6 Lingua::Stem - Stemming of words
7
9 use Lingua::Stem qw(stem);
10 my $stemmmed_words_anon_array = stem(@words);
11
12 or for the OO inclined,
13
14 use Lingua::Stem;
15 my $stemmer = Lingua::Stem->new(-locale => 'EN-UK');
16 $stemmer->stem_caching({ -level => 2 });
17 my $stemmmed_words_anon_array = $stemmer->stem(@words);
18
20 This routine applies stemming algorithms to its parameters, returning
21 the stemmed words as appropriate to the selected locale.
22
23 You can import some or all of the class methods.
24
25 use Lingua::Stem qw (stem clear_stem_cache stem_caching
26 add_exceptions delete_exceptions
27 get_exceptions set_locale get_locale
28 :all :locale :exceptions :stem :caching);
29
30 :all - imports stem add_exceptions delete_exceptions get_exceptions
31 set_locale get_locale
32 :stem - imports stem
33 :caching - imports stem_caching clear_stem_cache
34 :locale - imports set_locale get_locale
35 :exceptions - imports add_exceptions delete_exceptions get_exceptions
36
37 Currently supported locales are:
38
39 DA - Danish
40 DE - German
41 EN - English (also EN-US and EN-UK)
42 FR - French
43 GL - Galician
44 IT - Italian
45 NO - Norwegian
46 PT - Portuguese
47 RU - Russian (also RU-RU and RU-RU.KOI8-R)
48 SV - Swedish
49
50 If you have the memory and lots of stemming to do, I strongly suggest
51 using cache level 2 and processing lists in 'big chunks' (long lists)
52 for best performance.
53
54 Comparision with Lingua::Stem::Snowball
55 It functions fairly similarly to the Lingua::Stem::Snowball suite of
56 stemmers, with the most significant differences being
57
58 1) Lingua::Stem is a 'pure perl' (no compiled XS code is needed) suite.
59 Lingua::Stem::Snowball is XS based (must be compiled).
60
61 2) Lingua::Stem works with Perl 5.6 or later
62 Lingua::Stem::Snowball works with Perl 5.8 or later
63
64 3) Lingua::Stem has an 'exceptions' system allowing you to override
65 stemming on a 'case by case' basis.
66 Lingua::Stem::Snowball does not have an 'exceptions' system.
67
68 4) A somewhat different set of supported languages:
69
70 +---------------------------------------------------------------+
71 | Language | ISO code | Lingua::Stem | Lingua::Stem::Snowball |
72 |---------------------------------------------------------------|
73 | Danish | da | yes | yes |
74 | Dutch | nl | no | yes |
75 | English | en | yes | yes |
76 | Finnish | fi | no | yes |
77 | French | fr | yes | yes |
78 | Galacian | gl | yes | no |
79 | German | de | yes | yes |
80 | Italian | it | yes | yes |
81 | Norwegian | no | yes | yes |
82 | Portuguese | pt | yes | yes |
83 | Russian | ru | yes | yes |
84 | Spanish | es | no | yes |
85 | Swedish | sv | yes | yes |
86 +---------------------------------------------------------------+
87
88 5) Lingua::Stem is faster for 'stem' (circa 30% faster than
89 Lingua::Stem::Snowball)
90
91 6) Lingua::Stem::Snowball is faster for 'stem_in_place' (circa 30%
92 faster than Lingua::Stem)
93
94 7) Lingua::Stem::Snowball is more consistent with regard to character
95 set issues.
96
97 8) Lingua::Stem::Snowball is under active development. Lingua::Stem is
98 currently fairly static.
99
100 Some benchmarks using Lingua::Stem 0.82 and Lingua::Stem::Snowball 0.94
101 gives an idea of how various options impact performance. The dataset
102 was The Works of Edgar Allen Poe, volumes 1-5 from the Gutenberg
103 Project processed 10 times in a row as single batch of words
104 (processing a long text one word at a time is very inefficient and
105 drops the performance of Lingua::Stem by about 90%: So "Don't Do That"
106 ;) )
107
108 The benchmarks were run on a 3.06 Ghz P4 with HT on Fedora Core 5 Linux
109 using Perl 5.8.8.
110
111 +------------------------------------------------------------------------+
112 | source: collected_works_poe.txt | words: 454691 | unique words: 22802 |
113 |------------------------------------------------------------------------|
114 | module | config | avg secs | words/sec |
115 |------------------------------------------------------------------------|
116 | Lingua::Stem 0.82 | no cache | 1.922 | 236560 |
117 | Lingua::Stem 0.82 | cache level 2 | 1.235 | 368292 |
118 | Lingua::Stem 0.82 | cachelv2, sip | 0.798 | 569494 |
119 | Lingua::Stem::Snowball 0.94 | stem | 1.622 | 280276 |
120 | Lingua::Stem::Snowball 0.94 | stem_in_place | 0.627 | 725129 |
121 +------------------------------------------------------------------------+
122
123 The script for the benchmark is included in the examples/ directory of
124 this distribution as benchmark_stemmers.plx.
125
127 0.84 2010.04.29 - Documentation fixes to the En stemmer and removal
128 of the accidentally included lib/Lingua/test.pl file
129 Thanks goes to Aaron Naiman for bringing the
130 documentation error to my attention and to
131 Alexandr Ciornii and 'kmx' for the pointing out
132 the problem with the test.pl file.
133
134 0.83 2007.06.23 - Disabled Italian locale build tests due to
135 changes in Lingua::Stem::It breaking the tests.
136
137 0.82 2006.07.23 - Added 'stem_in_place' to base package.
138 Tweaks to documentation and build tests.
139
140 0.81 2004.07.26 - Minor documentation tweak. No functional change.
141
142 0.80 2004.07.25 - Added 'RU', 'RU_RU', 'RU_RU.KOI-8' locale.
143 Added support for Lingua::Stem::Ru to
144 Makefile.PL and autoloader.
145
146 Added documentation stressing use of caching
147 and batches for performance. Added support
148 for '_' as a seperator in the locale strings.
149 Added example benchmark script. Expanded copyright
150 credits.
151
152 0.70 2004.04.26 - Added FR locale and documentation fixes
153 to Lingua::Stem::Gl
154
155 0.61 2003.09.28 - Documentation fixes. No functional changes.
156
157 0.60 2003.04.05 - Added more locales by wrappering various stemming
158 implementations. Documented currently supported
159 list of locales.
160
161 0.50 2000.09.14 - Fixed major implementation error. Starting with
162 version 0.30 I forgot to include rulesets 2,3 and 4
163 for Porter's algorithm. The resulting stemming results
164 were very poor. Thanks go to <csyap@netfision.com>
165 for bringing the problem to my attention.
166
167 Unfortunately, the fix inherently generates *different*
168 stemming results than 0.30 and 0.40 did. If you
169 need identically broken output - use locale 'en-broken'.
170
171 0.40 2000.08.25 - Added stem caching support as an option. This
172 can provide a large speedup to the operation
173 of the stemmer. Caching is default turned off
174 to maximize compatibility with previous versions.
175
176 0.30 1999.06.24 - Replaced core of 'En' stemmers with code from
177 Jim Richardson <jimr@maths.usyd.edu.au>
178 Aliased 'en-us' and 'en-uk' to 'en'
179 Fixed 'SYNOPSIS' to correct return value
180 type for stemmed words (SYNOPIS error spotted
181 by <Arved_37@chebucto.ns.ca>)
182
183 0.20 1999.06.15 - Changed to '.pm' module, moved into Lingua:: namespace,
184 added OO interface, optionalized the export of routines
185 into the caller's namespace, added named parameter
186 initialization, stemming exceptions, autoloaded
187 locale support and isolated case flattening to
188 localized stemmers prevent i18n problems later.
189
190 Input and output text are assumed to be in UTF8
191 encoding (no operational impact right now, but
192 will be important when extending the module to
193 non-English).
194
196 new(...);
197 Returns a new instance of a Lingua::Stem object and, optionally,
198 selection of the locale to be used for stemming.
199
200 Examples:
201
202 # By default the locale is en
203 $us_stemmer = Lingua::Stem->new;
204
205 # Turn on the cache
206 $us_stemmer->stem_caching({ -level => 2 });
207
208 # Overriding the default for a specific instance
209 $uk_stemmer = Lingua::Stem->new({ -locale => 'en-uk' });
210
211 # Overriding the default for a specific instance and changing the default
212 $uk_stemmer = Lingua::Stem->new({ -default_locale => 'en-uk' });
213
214 set_locale($locale);
215 Sets the locale to one of the recognized locales. locale
216 identifiers are converted to lowercase.
217
218 Called as a class method, it changes the default locale for all
219 subseqently generated object instances.
220
221 Called as an instance method, it only changes the locale for that
222 particular instance.
223
224 'croaks' if passed an unknown locale.
225
226 Examples:
227
228 # Change default locale
229 Lingua::Stem::set_locale('en-uk'); # UK's spellings
230
231 # Change instance locale
232 $self->set_locale('en-us'); # US's spellings
233
234 get_locale;
235 Called as a class method, returns the current default locale.
236
237 Example:
238
239 $default_locale = Lingua::Stem::get_locale;
240
241 Called as an instance method, returns the locale for the instance
242
243 $instance_locale = $stemmer->get_locale;
244
245 add_exceptions($exceptions_hash_ref);
246 Exceptions allow overriding the stemming algorithm on a case by
247 case basis. It is done on an exact match and substitution basis: If
248 a passed word is identical to the exception it will be replaced by
249 the specified value. No case adjustments are performed.
250
251 Called as a class method, adds exceptions to the default exceptions
252 list used for subsequently instantations of Lingua::Stem objects.
253
254 Example:
255
256 # adding default exceptions
257 Lingua::Stem::add_exceptions({ 'emily' => 'emily',
258 'driven' => 'driven',
259 });
260
261 Called as an instance method, adds exceptions only to the specific
262 instance.
263
264 # adding instance exceptions
265 $stemmer->add_exceptions({ 'steely' => 'steely' });
266
267 The exceptions shortcut the normal stemming - if an exception
268 matches no further stemming is performed after the substitution.
269
270 Adding an exception with the same key value as an already defined
271 exception replaces the pre-existing exception with the new value.
272
273 delete_exceptions(@exceptions_list);
274 The mirror of add_exceptions, this allows the _removal_ of
275 exceptions from either the defaults for the class or from the
276 instance.
277
278 # Deletion of exceptions from class default exceptions
279 Lingua::Stem::delete_exceptions('aragorn','frodo','samwise');
280
281 # Deletion of exceptions from instance
282 $stemmer->delete_exceptions('smaug','sauron','gollum');
283
284 # Deletion of all class default exceptions
285 delete_exceptions;
286
287 # Deletion of all exceptions from instance
288 $stemmer->delete_exceptions;
289
290 get_exceptions;
291 As a class method with no parameters it returns all the default
292 exceptions as an anonymous hash of 'exception' => 'replace with'
293 pairs.
294
295 Example:
296
297 # Returns all class default exceptions
298 $exceptions = Lingua::Stem::get_exceptions;
299
300 As a class method with parameters, it returns the default
301 exceptions listed in the parameters as an anonymous hash of
302 'exception' => 'replace with' pairs. If a parameter specifies an
303 undefined 'exception', the value is set to undef.
304
305 # Returns class default exceptions for 'emily' and 'george'
306 $exceptions = Lingua::Stem::get_exceptions('emily','george');
307
308 As an instance method, with no parameters it returns the currently
309 active exceptions for the instance.
310
311 # Returns all instance exceptions
312 $exceptions = $stemmer->get_exceptions;
313
314 As an instance method with parameters, it returns the instance
315 exceptions listed in the parameters as an anonymous hash of
316 'exception' => 'replace with' pairs. If a parameter specifies an
317 undefined 'exception', the value is set to undef.
318
319 # Returns instance exceptions for 'lisa' and 'bart'
320 $exceptions = $stemmer->get_exceptions('lisa','bart');
321
322 stem(@list);
323 Called as a class method, it applies the default settings and stems
324 the list of passed words, returning an anonymous array with the
325 stemmed words in the same order as the passed list of words.
326
327 Example:
328
329 # Default settings applied
330 my $anon_array_of_stemmed_words = Lingua::Stem::stem(@words);
331
332 Called as an instance method, it applies the instance's settings
333 and stems the list of passed words, returning an anonymous array
334 with the stemmed words in the same order as the passed list of
335 words.
336
337 # Instance's settings applied
338 my $stemmed_words = $stemmer->stem(@words);
339
340 The stemmer performs best when handed long lists of words rather
341 than one word at a time. The cache also provides a huge speed up if
342 you are processing lots of text.
343
344 stem_in_place(@list);
345 Stems the passed list of words 'in place'. It returns a reference
346 to the modified list. This is about 60% faster than the 'stem'
347 method but modifies the original list. This currently only works
348 for the English locales.
349
350 Example:
351
352 my @words = ( 'a', 'list', 'of', 'words' );
353 my $stemmed_list_of_words = stem_in_place(@words);
354
355 # '$stemmed_list_of_words' refers to the @words list
356 # after 'stem_in_place' has executed
357
358 DO NOT use this method of stemming if you need to keep the original
359 list of words. Its performance gain derives entirely from the fact
360 it does not make a copy the original list but instead overwrites
361 the original list.
362
363 If you try something like
364
365 my @words_for_stemming = @words;
366 my $stemmed_list_of_words = stem_in_place(@words_for_stemming);
367
368 thinking you will get a speed boost while keeping the original
369 list, you won't: You wipe out the speed gain completely with your
370 copying of the original list. You should just use the 'stem' method
371 instead on the original list of words if you need to keep the
372 original list.
373
374 clear_stem_cache;
375 Clears the stemming cache for the current locale. Can be called as
376 either a class method or an instance method.
377
378 $stemmer->clear_stem_cache;
379
380 clear_stem_cache;
381
382 stem_caching ({ -level => 0|1|2 });
383 Sets stemming cache level for the current locale. Can be called as
384 either a class method or an instance method.
385
386 $stemmer->stem_caching({ -level => 1 });
387
388 stem_caching({ -level => 1 });
389
390 For the sake of maximum compatibility with previous versions, stem
391 caching is set to '-level => 0' by default.
392
393 '-level' definitions
394
395 '0' means 'no caching'. This is the default level.
396
397 '1' means 'cache per run'. This caches stemming results during each
398 call to 'stem'.
399
400 '2' means 'cache indefinitely'. This caches stemming results until
401 either the process exits or the 'clear_stem_cache' method is called.
402
403 stem caching is global to the locale. If you turn on stem caching
404 for one instance of a locale stemmer, all instances using the same
405 locale will have it turned on as well.
406
407 I STRONGLY suggest turning caching on if you have enough memory and
408 are processing a lot of data.
409
411 0.84 2008.07.27
412
414 It started with the 'Text::Stem' module which has been adapted into a
415 more general framework and moved into the more language oriented
416 'Lingua' namespace and re-organized to support a OOP interface as well
417 as switch core 'En' locale stemmers.
418
419 Version 0.40 added a cache for stemmed words. This can provide up to a
420 several fold performance improvement.
421
422 Organization is such that extending this module to any number of
423 languages should be direct and simple.
424
425 Case flattening is a function of the language, so the 'exceptions'
426 methods have to be used appropriately to the language. For 'En' family
427 stemming, use lower case words, only, for exceptions.
428
430 Benjamin Franz <snowhare@nihongo.org>
431 Jim Richardson <imr@maths.usyd.edu.au>
432
434 Jim Richardson <imr@maths.usyd.edu.au>
435 Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>
436 Aldo Calpini <dada@perl.it>
437 xern <xern@cpan.org>
438 Ask Solem Hoel <ask@unixmonks.net>
439 Dennis Haney <davh@davh.dk>
440 Sebastien Darribere-Pleyt <sebastien.darribere@lefute.com>
441 Aleksandr Guidrevitch <pillgrim@mail.ru>
442
444 Lingua::Stem::En Lingua::Stem::En Lingua::Stem::Da
445 Lingua::Stem::De Lingua::Stem::Gl Lingua::Stem::No
446 Lingua::Stem::Pt Lingua::Stem::Sv Lingua::Stem::It
447 Lingua::Stem::Fr Lingua::Stem::Ru Text::German
448 Lingua::PT::Stemmer Lingua::GL::Stemmer Lingua::Stem::Snowball::No
449 Lingua::Stem::Snowball::Se Lingua::Stem::Snowball::Da Lingua::Stem::Snowball::Sv
450 Lingua::Stemmer::GL Lingua::Stem::Snowball
451
452 http://snowball.tartarus.org
453
455 Copyright 1999-2004
456
457 Freerun Technologies, Inc (Freerun), Jim Richardson, University of
458 Sydney <imr@maths.usyd.edu.au> and Benjamin Franz
459 <snowhare@nihongo.org>. All rights reserved.
460
461 Text::German was written and is copyrighted by Ulrich Pfeifer.
462
463 Lingua::Stem::Snowball::Da was written and is copyrighted by Dennis
464 Haney and Ask Solem Hoel.
465
466 Lingua::Stem::It was written and is copyrighted by Aldo Calpini.
467
468 Lingua::Stem::Snowball::No, Lingua::Stem::Snowball::Se,
469 Lingua::Stem::Snowball::Sv were written and are copyrighted by Ask
470 Solem Hoel.
471
472 Lingua::Stemmer::GL and Lingua::PT::Stemmer were written and are
473 copyrighted by Xern.
474
475 Lingua::Stem::Fr was written and is copyrighted by Aldo Calpini and
476 SA~Xbastien Darribere-Pley.
477
478 Lingua::Stem::Ru was written and is copyrighted by Aleksandr
479 Guidrevitch.
480
481 This software may be freely copied and distributed under the same terms
482 and conditions as Perl.
483
485 None known.
486
488 Add more languages. Extend regression tests. Add support for the
489 Lingua::Stem::Snowball family of stemmers as an alternative core
490 stemming engine. Extend 'stem_in_place' functionality to non-English
491 stemmers.
492
493
494
495perl v5.12.1 2010-09-14 Lingua::Stem(3)