1Lingua::Stem(3) User Contributed Perl Documentation Lingua::Stem(3)
2
3
4
6 Lingua::Stem - Stemming of words
7
9 use Lingua::Stem qw(stem);
10 my $stemmmed_words_anon_array = stem(@words);
11
12 or for the OO inclined,
13
14 use Lingua::Stem;
15 my $stemmer = Lingua::Stem->new(-locale => 'EN-UK');
16 $stemmer->stem_caching({ -level => 2 });
17 my $stemmmed_words_anon_array = $stemmer->stem(@words);
18
20 This routine applies stemming algorithms to its parameters, returning
21 the stemmed words as appropriate to the selected locale.
22
23 You can import some or all of the class methods.
24
25 use Lingua::Stem qw (stem clear_stem_cache stem_caching
26 add_exceptions delete_exceptions
27 get_exceptions set_locale get_locale
28 :all :locale :exceptions :stem :caching);
29
30 :all - imports stem add_exceptions delete_exceptions get_exceptions
31 set_locale get_locale
32 :stem - imports stem
33 :caching - imports stem_caching clear_stem_cache
34 :locale - imports set_locale get_locale
35 :exceptions - imports add_exceptions delete_exceptions get_exceptions
36
37 Currently supported locales are:
38
39 DA - Danish
40 DE - German
41 EN - English (also EN-US and EN-UK)
42 FR - French
43 GL - Galician
44 IT - Italian
45 NO - Norwegian
46 PT - Portuguese
47 RU - Russian (also RU-RU and RU-RU.KOI8-R)
48 SV - Swedish
49
50 If you have the memory and lots of stemming to do, I strongly suggest
51 using cache level 2 and processing lists in 'big chunks' (long lists)
52 for best performance.
53
54 Comparision with Lingua::Stem::Snowball
55 It functions fairly similarly to the Lingua::Stem::Snowball suite of
56 stemmers, with the most significant differences being
57
58 1) Lingua::Stem is a 'pure perl' (no compiled XS code is needed) suite.
59 Lingua::Stem::Snowball is XS based (must be compiled).
60
61 2) Lingua::Stem works with Perl 5.6 or later
62 Lingua::Stem::Snowball works with Perl 5.8 or later
63
64 3) Lingua::Stem has an 'exceptions' system allowing you to override
65 stemming on a 'case by case' basis.
66 Lingua::Stem::Snowball does not have an 'exceptions' system.
67
68 4) A somewhat different set of supported languages:
69
70 +---------------------------------------------------------------+
71 | Language | ISO code | Lingua::Stem | Lingua::Stem::Snowball |
72 |---------------------------------------------------------------|
73 | Danish | da | yes | yes |
74 | Dutch | nl | no | yes |
75 | English | en | yes | yes |
76 | Finnish | fi | no | yes |
77 | French | fr | yes | yes |
78 | Galacian | gl | yes | no |
79 | German | de | yes | yes |
80 | Italian | it | yes | yes |
81 | Norwegian | no | yes | yes |
82 | Portuguese | pt | yes | yes |
83 | Russian | ru | yes | yes |
84 | Spanish | es | no | yes |
85 | Swedish | sv | yes | yes |
86 +---------------------------------------------------------------+
87
88 5) Lingua::Stem is faster for 'stem' (circa 30% faster than
89 Lingua::Stem::Snowball)
90
91 6) Lingua::Stem::Snowball is faster for 'stem_in_place' (circa 30%
92 faster than Lingua::Stem)
93
94 7) Lingua::Stem::Snowball is more consistent with regard to character
95 set issues.
96
97 8) Lingua::Stem::Snowball is under active development. Lingua::Stem is
98 currently fairly static.
99
100 Some benchmarks using Lingua::Stem 0.82 and Lingua::Stem::Snowball 0.94
101 gives an idea of how various options impact performance. The dataset
102 was The Works of Edgar Allen Poe, volumes 1-5 from the Gutenberg
103 Project processed 10 times in a row as single batch of words
104 (processing a long text one word at a time is very inefficient and
105 drops the performance of Lingua::Stem by about 90%: So "Don't Do That"
106 ;) )
107
108 The benchmarks were run on a 3.06 Ghz P4 with HT on Fedora Core 5 Linux
109 using Perl 5.8.8.
110
111 +------------------------------------------------------------------------+
112 | source: collected_works_poe.txt | words: 454691 | unique words: 22802 |
113 |------------------------------------------------------------------------|
114 | module | config | avg secs | words/sec |
115 |------------------------------------------------------------------------|
116 | Lingua::Stem 0.82 | no cache | 1.922 | 236560 |
117 | Lingua::Stem 0.82 | cache level 2 | 1.235 | 368292 |
118 | Lingua::Stem 0.82 | cachelv2, sip | 0.798 | 569494 |
119 | Lingua::Stem::Snowball 0.94 | stem | 1.622 | 280276 |
120 | Lingua::Stem::Snowball 0.94 | stem_in_place | 0.627 | 725129 |
121 +------------------------------------------------------------------------+
122
123 The script for the benchmark is included in the examples/ directory of
124 this distribution as benchmark_stemmers.plx.
125
127 2.31 2020.09.26 - Fix for Latin1/UTF8 issue in documentation
128
129 2.30 2020.06.20 - Version renumber for module consistency
130
131 0.84 2010.04.29 - Documentation fixes to the En stemmer and removal
132 of the accidentally included lib/Lingua/test.pl file
133 Thanks goes to Aaron Naiman for bringing the
134 documentation error to my attention and to
135 Alexandr Ciornii and 'kmx' for the pointing out
136 the problem with the test.pl file.
137
138 0.83 2007.06.23 - Disabled Italian locale build tests due to
139 changes in Lingua::Stem::It breaking the tests.
140
141 0.82 2006.07.23 - Added 'stem_in_place' to base package.
142 Tweaks to documentation and build tests.
143
144 0.81 2004.07.26 - Minor documentation tweak. No functional change.
145
146 0.80 2004.07.25 - Added 'RU', 'RU_RU', 'RU_RU.KOI-8' locale.
147 Added support for Lingua::Stem::Ru to
148 Makefile.PL and autoloader.
149
150 Added documentation stressing use of caching
151 and batches for performance. Added support
152 for '_' as a seperator in the locale strings.
153 Added example benchmark script. Expanded copyright
154 credits.
155
156 0.70 2004.04.26 - Added FR locale and documentation fixes
157 to Lingua::Stem::Gl
158
159 0.61 2003.09.28 - Documentation fixes. No functional changes.
160
161 0.60 2003.04.05 - Added more locales by wrappering various stemming
162 implementations. Documented currently supported
163 list of locales.
164
165 0.50 2000.09.14 - Fixed major implementation error. Starting with
166 version 0.30 I forgot to include rulesets 2,3 and 4
167 for Porter's algorithm. The resulting stemming results
168 were very poor. Thanks go to <csyap@netfision.com>
169 for bringing the problem to my attention.
170
171 Unfortunately, the fix inherently generates *different*
172 stemming results than 0.30 and 0.40 did. If you
173 need identically broken output - use locale 'en-broken'.
174
175 0.40 2000.08.25 - Added stem caching support as an option. This
176 can provide a large speedup to the operation
177 of the stemmer. Caching is default turned off
178 to maximize compatibility with previous versions.
179
180 0.30 1999.06.24 - Replaced core of 'En' stemmers with code from
181 Jim Richardson <jimr@maths.usyd.edu.au>
182 Aliased 'en-us' and 'en-uk' to 'en'
183 Fixed 'SYNOPSIS' to correct return value
184 type for stemmed words (SYNOPIS error spotted
185 by <Arved_37@chebucto.ns.ca>)
186
187 0.20 1999.06.15 - Changed to '.pm' module, moved into Lingua:: namespace,
188 added OO interface, optionalized the export of routines
189 into the caller's namespace, added named parameter
190 initialization, stemming exceptions, autoloaded
191 locale support and isolated case flattening to
192 localized stemmers prevent i18n problems later.
193
194 Input and output text are assumed to be in UTF8
195 encoding (no operational impact right now, but
196 will be important when extending the module to
197 non-English).
198
200 new(...);
201 Returns a new instance of a Lingua::Stem object and, optionally,
202 selection of the locale to be used for stemming.
203
204 Examples:
205
206 # By default the locale is en
207 $us_stemmer = Lingua::Stem->new;
208
209 # Turn on the cache
210 $us_stemmer->stem_caching({ -level => 2 });
211
212 # Overriding the default for a specific instance
213 $uk_stemmer = Lingua::Stem->new({ -locale => 'en-uk' });
214
215 # Overriding the default for a specific instance and changing the default
216 $uk_stemmer = Lingua::Stem->new({ -default_locale => 'en-uk' });
217
218 set_locale($locale);
219 Sets the locale to one of the recognized locales. locale
220 identifiers are converted to lowercase.
221
222 Called as a class method, it changes the default locale for all
223 subseqently generated object instances.
224
225 Called as an instance method, it only changes the locale for that
226 particular instance.
227
228 'croaks' if passed an unknown locale.
229
230 Examples:
231
232 # Change default locale
233 Lingua::Stem::set_locale('en-uk'); # UK's spellings
234
235 # Change instance locale
236 $self->set_locale('en-us'); # US's spellings
237
238 get_locale;
239 Called as a class method, returns the current default locale.
240
241 Example:
242
243 $default_locale = Lingua::Stem::get_locale;
244
245 Called as an instance method, returns the locale for the instance
246
247 $instance_locale = $stemmer->get_locale;
248
249 add_exceptions($exceptions_hash_ref);
250 Exceptions allow overriding the stemming algorithm on a case by
251 case basis. It is done on an exact match and substitution basis: If
252 a passed word is identical to the exception it will be replaced by
253 the specified value. No case adjustments are performed.
254
255 Called as a class method, adds exceptions to the default exceptions
256 list used for subsequently instantations of Lingua::Stem objects.
257
258 Example:
259
260 # adding default exceptions
261 Lingua::Stem::add_exceptions({ 'emily' => 'emily',
262 'driven' => 'driven',
263 });
264
265 Called as an instance method, adds exceptions only to the specific
266 instance.
267
268 # adding instance exceptions
269 $stemmer->add_exceptions({ 'steely' => 'steely' });
270
271 The exceptions shortcut the normal stemming - if an exception
272 matches no further stemming is performed after the substitution.
273
274 Adding an exception with the same key value as an already defined
275 exception replaces the pre-existing exception with the new value.
276
277 delete_exceptions(@exceptions_list);
278 The mirror of add_exceptions, this allows the _removal_ of
279 exceptions from either the defaults for the class or from the
280 instance.
281
282 # Deletion of exceptions from class default exceptions
283 Lingua::Stem::delete_exceptions('aragorn','frodo','samwise');
284
285 # Deletion of exceptions from instance
286 $stemmer->delete_exceptions('smaug','sauron','gollum');
287
288 # Deletion of all class default exceptions
289 delete_exceptions;
290
291 # Deletion of all exceptions from instance
292 $stemmer->delete_exceptions;
293
294 get_exceptions;
295 As a class method with no parameters it returns all the default
296 exceptions as an anonymous hash of 'exception' => 'replace with'
297 pairs.
298
299 Example:
300
301 # Returns all class default exceptions
302 $exceptions = Lingua::Stem::get_exceptions;
303
304 As a class method with parameters, it returns the default
305 exceptions listed in the parameters as an anonymous hash of
306 'exception' => 'replace with' pairs. If a parameter specifies an
307 undefined 'exception', the value is set to undef.
308
309 # Returns class default exceptions for 'emily' and 'george'
310 $exceptions = Lingua::Stem::get_exceptions('emily','george');
311
312 As an instance method, with no parameters it returns the currently
313 active exceptions for the instance.
314
315 # Returns all instance exceptions
316 $exceptions = $stemmer->get_exceptions;
317
318 As an instance method with parameters, it returns the instance
319 exceptions listed in the parameters as an anonymous hash of
320 'exception' => 'replace with' pairs. If a parameter specifies an
321 undefined 'exception', the value is set to undef.
322
323 # Returns instance exceptions for 'lisa' and 'bart'
324 $exceptions = $stemmer->get_exceptions('lisa','bart');
325
326 stem(@list);
327 Called as a class method, it applies the default settings and stems
328 the list of passed words, returning an anonymous array with the
329 stemmed words in the same order as the passed list of words.
330
331 Example:
332
333 # Default settings applied
334 my $anon_array_of_stemmed_words = Lingua::Stem::stem(@words);
335
336 Called as an instance method, it applies the instance's settings
337 and stems the list of passed words, returning an anonymous array
338 with the stemmed words in the same order as the passed list of
339 words.
340
341 # Instance's settings applied
342 my $stemmed_words = $stemmer->stem(@words);
343
344 The stemmer performs best when handed long lists of words rather
345 than one word at a time. The cache also provides a huge speed up if
346 you are processing lots of text.
347
348 stem_in_place(@list);
349 Stems the passed list of words 'in place'. It returns a reference
350 to the modified list. This is about 60% faster than the 'stem'
351 method but modifies the original list. This currently only works
352 for the English locales.
353
354 Example:
355
356 my @words = ( 'a', 'list', 'of', 'words' );
357 my $stemmed_list_of_words = stem_in_place(@words);
358
359 # '$stemmed_list_of_words' refers to the @words list
360 # after 'stem_in_place' has executed
361
362 DO NOT use this method of stemming if you need to keep the original
363 list of words. Its performance gain derives entirely from the fact
364 it does not make a copy the original list but instead overwrites
365 the original list.
366
367 If you try something like
368
369 my @words_for_stemming = @words;
370 my $stemmed_list_of_words = stem_in_place(@words_for_stemming);
371
372 thinking you will get a speed boost while keeping the original
373 list, you won't: You wipe out the speed gain completely with your
374 copying of the original list. You should just use the 'stem' method
375 instead on the original list of words if you need to keep the
376 original list.
377
378 clear_stem_cache;
379 Clears the stemming cache for the current locale. Can be called as
380 either a class method or an instance method.
381
382 $stemmer->clear_stem_cache;
383
384 clear_stem_cache;
385
386 stem_caching ({ -level => 0|1|2 });
387 Sets stemming cache level for the current locale. Can be called as
388 either a class method or an instance method.
389
390 $stemmer->stem_caching({ -level => 1 });
391
392 stem_caching({ -level => 1 });
393
394 For the sake of maximum compatibility with previous versions, stem
395 caching is set to '-level => 0' by default.
396
397 '-level' definitions
398
399 '0' means 'no caching'. This is the default level.
400
401 '1' means 'cache per run'. This caches stemming results during each
402 call to 'stem'.
403
404 '2' means 'cache indefinitely'. This caches stemming results until
405 either the process exits or the 'clear_stem_cache' method is called.
406
407 stem caching is global to the locale. If you turn on stem caching
408 for one instance of a locale stemmer, all instances using the same
409 locale will have it turned on as well.
410
411 I STRONGLY suggest turning caching on if you have enough memory and
412 are processing a lot of data.
413
415 2.31 2020.09.26
416
418 It started with the 'Text::Stem' module which has been adapted into a
419 more general framework and moved into the more language oriented
420 'Lingua' namespace and re-organized to support a OOP interface as well
421 as switch core 'En' locale stemmers.
422
423 Version 0.40 added a cache for stemmed words. This can provide up to a
424 several fold performance improvement.
425
426 Organization is such that extending this module to any number of
427 languages should be direct and simple.
428
429 Case flattening is a function of the language, so the 'exceptions'
430 methods have to be used appropriately to the language. For 'En' family
431 stemming, use lower case words, only, for exceptions.
432
434 Jerilyn Franz <cpan@jerilyn.info>
435 Jim Richardson <imr@maths.usyd.edu.au>
436
438 Jim Richardson <imr@maths.usyd.edu.au>
439 Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>
440 Aldo Calpini <dada@perl.it>
441 xern <xern@cpan.org>
442 Ask Solem Hoel <ask@unixmonks.net>
443 Dennis Haney <davh@davh.dk>
444 Sebastien Darribere-Pleyt <sebastien.darribere@lefute.com>
445 Aleksandr Guidrevitch <pillgrim@mail.ru>
446
448 Lingua::Stem::En Lingua::Stem::En Lingua::Stem::Da
449 Lingua::Stem::De Lingua::Stem::Gl Lingua::Stem::No
450 Lingua::Stem::Pt Lingua::Stem::Sv Lingua::Stem::It
451 Lingua::Stem::Fr Lingua::Stem::Ru Text::German
452 Lingua::PT::Stemmer Lingua::GL::Stemmer Lingua::Stem::Snowball::No
453 Lingua::Stem::Snowball::Se Lingua::Stem::Snowball::Da Lingua::Stem::Snowball::Sv
454 Lingua::Stemmer::GL Lingua::Stem::Snowball
455
456 http://snowball.tartarus.org
457
459 Copyright 1999-2004
460
461 Freerun Technologies, Inc (Freerun), Jim Richardson, University of
462 Sydney <imr@maths.usyd.edu.au> and Jerilyn Franz <cpan@jerilyn.info>.
463 All rights reserved.
464
465 Text::German was written and is copyrighted by Ulrich Pfeifer.
466
467 Lingua::Stem::Snowball::Da was written and is copyrighted by Dennis
468 Haney and Ask Solem Hoel.
469
470 Lingua::Stem::It was written and is copyrighted by Aldo Calpini.
471
472 Lingua::Stem::Snowball::No, Lingua::Stem::Snowball::Se,
473 Lingua::Stem::Snowball::Sv were written and are copyrighted by Ask
474 Solem Hoel.
475
476 Lingua::Stemmer::GL and Lingua::PT::Stemmer were written and are
477 copyrighted by Xern.
478
479 Lingua::Stem::Fr was written and is copyrighted by Aldo Calpini and
480 Sebastien Darribere-Pley.
481
482 Lingua::Stem::Ru was written and is copyrighted by Aleksandr
483 Guidrevitch.
484
485 This software may be freely copied and distributed under the same terms
486 and conditions as Perl.
487
489 None known.
490
492 Add more languages. Extend regression tests. Add support for the
493 Lingua::Stem::Snowball family of stemmers as an alternative core
494 stemming engine. Extend 'stem_in_place' functionality to non-English
495 stemmers.
496
497
498
499perl v5.32.0 2020-09-29 Lingua::Stem(3)