1Lingua::StopWords(3) User Contributed Perl Documentation Lingua::StopWords(3)
2
3
4
6 Lingua::StopWords - Stop words for several languages.
7
9 use Lingua::StopWords qw( getStopWords );
10 my $stopwords = getStopWords('en');
11
12 my @words = qw( i am the walrus goo goo g'joob );
13
14 # prints "walrus goo goo g'joob"
15 print join ' ', grep { !$stopwords->{$_} } @words;
16
18 In keyword search, it is common practice to suppress a collection of
19 "stopwords": words such as "the", "and", "maybe", etc. which exist in
20 in a large number of documents and do not tell you anything important
21 about any document which contains them. This module provides such
22 "stoplists" in several languages.
23
24 Supported Languages
25 |-----------------------------------------------------------|
26 | Language | ISO code | default encoding | also available |
27 |-----------------------------------------------------------|
28 | Danish | da | ISO-8859-1 | UTF-8 |
29 | Dutch | nl | ISO-8859-1 | UTF-8 |
30 | English | en | ISO-8859-1 | UTF-8 |
31 | Finnish | fi | ISO-8859-1 | UTF-8 |
32 | French | fr | ISO-8859-1 | UTF-8 |
33 | German | de | ISO-8859-1 | UTF-8 |
34 | Hungarian | hu | ISO-8859-1 | UTF-8 |
35 | Italian | it | ISO-8859-1 | UTF-8 |
36 | Norwegian | no | ISO-8859-1 | UTF-8 |
37 | Portuguese | pt | ISO-8859-1 | UTF-8 |
38 | Spanish | es | ISO-8859-1 | UTF-8 |
39 | Swedish | sv | ISO-8859-1 | UTF-8 |
40 | Russian | ru | KOI8-R | UTF-8 |
41 |-----------------------------------------------------------|
42
44 getStopWords
45 my $stoplist = getStopWords('en');
46 my $utf8_stoplist = getStopWords('en', 'UTF-8');
47
48 Retrieve a stoplist in the form of a hashref where the keys are all
49 stopwords and the values are all 1.
50
51 $stoplist = {
52 and => 1,
53 if => 1,
54 # ...
55 };
56
57 getStopWords() expects 1-2 arguments. The first, which is required, is
58 an ISO code representing a supported language. If the ISO code cannot
59 be found, getStopWords returns undef.
60
61 The second argument should be 'UTF-8' if you want the stopwords encoded
62 in UTF-8. The UTF-8 flag will be turned on, so make sure you
63 understand all the implications of that.
64
66 The stoplists supplied by this module were created as part of the
67 Snowball project (see <http://snowball.tartarus.org>,
68 Lingua::Stem::Snowball).
69
70 Lingua::EN::StopWords provides a different stoplist for English.
71
73 Maintained by Marvin Humphrey <marvin at rectangular dot com>.
74 Original author Fabien Potencier, <fabpot at cpan dot org>.
75
77 Copyright 2004-2008 Fabien Potencier, Marvin Humphrey
78
79 This library is free software; you can redistribute it and/or modify it
80 under the same terms as Perl itself, either Perl version 5.8.3 or, at
81 your option, any later version of Perl 5 you may have available.
82
83
84
85perl v5.28.1 2008-08-22 Lingua::StopWords(3)