1Lingua::Identify(3)   User Contributed Perl Documentation  Lingua::Identify(3)
2
3
4

NAME

6       Lingua::Identify - Language identification
7

SYNOPSIS

9         use Lingua::Identify qw(:language_identification);
10         $a = langof($textstring); # gives the most probable language
11
12       or the complete way:
13
14         @a = langof($textstring); # gives pairs of languages / probabilities
15                                   # sorted from most to least probable
16
17         %a = langof($textstring); # gives a hash of language / probability
18
19       or the expert way (see section OPTIONS, under HOW TO PERFORM
20       IDENTIFICATION)
21
22         $a = langof( { method => [qw/smallwords prefix2 suffix2/] }, $text);
23
24         $a = langof( { 'max-size' => 3_000_000 }, $text);
25
26         $a = langof( { 'extract_from' => ( 'head' => 1, 'tail' => 2)}, $text);
27

DESCRIPTION

29       STARTING WITH VERSION 0.25, Lingua::Identify IS UNICODE BY DEFAULT!
30
31       "Lingua::Identify" identifies the language a given string or file is
32       written in.
33
34       See section WHY LINGUA::IDENTIFY for a list of "Lingua::Identify"'s
35       strong points.
36
37       See section KNOWN LANGUAGES for a list of available languages and HOW
38       TO PERFORM IDENTIFICATION to know how to really use this module.
39
40       If you're in a hurry, jump to section EXAMPLES, way down below.
41
42       Also, don't forget to read the following section, IMPORTANT WARNING.
43

A WARNING ON THE ACCURACY OF LANGUAGE IDENTIFICATION METHODS

45       Take a word that exists in two different languages, take a good look at
46       it and answer this question: "What language does this word belong to?".
47
48       You can't give an answer like "Language X", right? You can only say it
49       looks like any of a set of languages.
50
51       Similarly, it isn't always easy to identify the language of a text if
52       the only two active languages are very similar.
53
54       Now that we've taken out of the way the warning that language
55       identification is not 100% accurate, please keep reading the
56       documentation.
57

WHY LINGUA::IDENTIFY

59       You might be wondering why you should use Lingua::Identify instead of
60       any other tool for language identification.
61
62       Here's a list of Lingua::Identify's strong points:
63
64       ·     it's free and it's open-source;
65
66       ·     it's portable (it's Perl, which means it will work in lots of
67             different platforms);
68
69       ·     unicode support;
70
71       ·     4 different methods of language identification and growing (see
72             METHODS OF LANGUAGE IDENTIFICATION for more details on this one);
73
74       ·     it's a module, which means you can easily write your own
75             application (be it CGI, TK, whatever) around it;
76
77       ·     it comes with langident, which means you don't actually need to
78             write your own application around it;
79
80       ·     it's flexible (at the moment, you can actually choose the methods
81             to use and their relevance, the max size of input to analyze each
82             time and which part(s) of the input to analyze)
83
84       ·     it supports big inputs (through the 'max-size' and 'extract_from'
85             options)
86
87       ·     it's easy to deal with languages (you can activate and deactivate
88             the ones you choose whenever you want to, which can improve your
89             times and accuracy);
90
91       ·     it's maintained.
92

HOW TO PERFORM IDENTIFICATION

94   langof
95       To identify the language a given text is written in, use the langof
96       function.  To get a single value, do:
97
98         $language = langof($text);
99
100       To get the most probable language and also the percentage of its
101       probability, do:
102
103         ($language, $probability) = langof($text);
104
105       If you want a hash where each active language is mapped into its
106       percentage, use this:
107
108         %languages = langof($text);
109
110       OPTIONS
111
112       langof can also be given some configuration parameters, in this way:
113
114         $language = langof(\%config, $text);
115
116       These parameters are detailed here:
117
118       ·     extract-from
119
120             When the size of the input exceeds the C'max-size', "langof"
121             analyzes only the beginning of the file. You can specify which
122             part of the file is analyzed with the 'extract-from' option:
123
124               langof( { 'extract_from' => 'tail' } , $text );
125
126             Possible values are 'head' and 'tail' (for now).
127
128             You can also specify more than one part of the file, so that text
129             is extracted from those parts:
130
131               langof( { 'extract_from' => [ 'head', 'tail' ] } , $text );
132
133             (this will be useful when more than two possibilities exist)
134
135             You can also specify different values for each part of the file
136             (not necessarily for all of them:
137
138              langof( { 'extract_from' => { head => 40, tail => 60 } } , $text);
139
140             The line above, for instance, retrives 40% of the text from the
141             beginning and 60% from the end. Note, however, that those values
142             are not percentages. You'd get the same behavior with:
143
144              langof( { 'extract_from' => { head => 80, tail => 120 } } , $text);
145
146             The percentages would be the same.
147
148       ·     max-size
149
150             By default, "langof" analyzes only 1,000,000 bytes. You can
151             specify how many bytes (at the most) can be analyzed (if not
152             enough exist, the whole input is still analyzed).
153
154               langof( { 'max-size' => 2000 }, $text);
155
156             If you want all the text to be analyzed, set max-size to 0:
157
158               langof( { 'max-size' => 0 }, $text);
159
160             See also "set_max_size".
161
162       ·     method
163
164             You can choose which method or methods to use, and also the
165             relevance of each of them.
166
167             To choose a single method to use:
168
169               langof( {method => 'smallwords' }, $text);
170
171             To choose several methods:
172
173               langof( {method => [qw/prefixes2 suffixes2/]}, $text);
174
175             To choose several methods and give them different weight:
176
177               langof( {method => {smallwords => 0.5, ngrams3 => 1.5} }, $text);
178
179             To see the list of available methods, see section METHODS OF
180             LANGUAGE IDENTIFICATION.
181
182             If no method is specified, the configuration for this parameter
183             is the following (this might change in the future):
184
185               method => {
186                 smallwords => 0.5,
187                 prefixes2  => 1,
188                 suffixes3  => 1,
189                 ngrams3    => 1.3
190               };
191
192       ·     mode
193
194             By default, "Lingua::Identify" assumes "normal" mode, but others
195             are available.
196
197             In "dummy" mode, instead of actually calculating anything,
198             "Lingua::Identify" only does the preparation it has to and then
199             returns a bunch of information, including the list of the active
200             languages, the selected methods, etc. It also returns the text
201             meant to be analised.
202
203             Do be warned that, with langof_file, the dummy mode still reads
204             the files, it simply doesn't calculate language.
205
206               langof( { 'mode' => 'dummy' }, $text);
207
208             This returns something like this:
209
210               { 'methods'          => {   'smallwords' => '0.5',
211                                           'prefixes2'  => '1',
212                                       },
213                 'config'           => {   'mode' => 'dummy' },
214                 'max-size'         => 1000000,
215                 'active-languages' => [ 'es', 'pt' ],
216                 'text'             => $text,
217                 'mode'             => 'dummy',
218               }
219
220   langof_file
221       langof_file works just like langof, with the exception that it reveives
222       filenames instead of text. It reads these texts (if existing and
223       readable, of course) and parses its content.
224
225       Currently, langof_file assumes the files are regular text. This may
226       change in the future and the files might be scanned to check their
227       filetype and then parsed to extract only their textual content (which
228       should be pretty useful so that you can perform language
229       identification, say, in HTML files, or PDFs).
230
231       To identify the language a file is written in:
232
233         $language = langof_file($path);
234
235       To get the most probable language and also the percentage of its
236       probability, do:
237
238         ($language, $probability) = langof_file($path);
239
240       If you want a hash where each active language is mapped into its
241       percentage, use this:
242
243         %languages = langof_file($path);
244
245       If you pass more than one file to langof_file, they will all be read
246       and their content merged and then parsed for language identification.
247
248       OPTIONS
249
250       langof_file accepts all the options langof does, so refer to those
251       first (up in this document).
252
253         $language = langof_file(\%config, $path);
254
255       langof_file currently only reads the first 10,000 bytes of each file.
256
257       You can force an input encoding with "{ encoding => 'ISO-8859-1' }" in
258       the configuration hash.
259
260   confidence
261       After getting the results into an array, its first element is the most
262       probable language. That doesn't mean it is very probable or not.
263
264       You can find more about the likeliness of the results to be accurate by
265       computing its confidence level.
266
267         use Lingua::Identify qw/:language_identification/;
268         my @results = langof($text);
269         my $confidence_level = confidence(@results);
270         # $confidence_level now holds a value between 0.5 and 1; the higher that
271         # value, the more accurate the results seem to be
272
273       The formula used is pretty simple: p1 / (p1 + p2) , where p1 is the
274       probability of the most likely language and p2 is the probability of
275       the language which came in second. A couple of examples to illustrate
276       this:
277
278       English 50% Portuguese 10% ...
279
280       confidence level: 50 / (50 + 10) = 0.83
281
282       Another example:
283
284       Spanish 30% Portuguese 10% ...
285
286       confidence level: 30 / (25 + 30) = 0.55
287
288       French 10% German 5% ...
289
290       confidence level: 10 / (10 + 5) = 0.67
291
292       As you can see, the first example is probably the most accurate one.
293       Are there any doubts? The English language has five times the
294       probability of the second language.
295
296       The second example is a bit more tricky. 55% confidence. The confidence
297       level is always above 50%, for obvious reasons. 55% doesn't make anyone
298       confident in the results, and one shouldn't be, with results such as
299       these.
300
301       Notice the third example. The confidence level goes up to 67%, but the
302       probability of French is of mere 10%. So what? It's twice as much as
303       the second language. The low probability may well be caused by a great
304       number of languages in play.
305
306   get_all_methods
307       Returns a list comprised of all the available methods for language
308       identification.
309

LANGUAGE IDENTIFICATION IN GENERAL

311       Language identification is based in patterns.
312
313       In order to identify the language a given text is written in, we repeat
314       a given process for each active language (see section LANGUAGES
315       MANIPULATION); in that process, we look for common patterns of that
316       language. Those patterns can be prefixes, suffixes, common words,
317       ngrams or even sequences of words.
318
319       After repeating the process for each language, the total score for each
320       of them is then used to compute the probability (in percentage) for
321       each language to be the one of that text.
322

METHODS OF LANGUAGE IDENTIFICATION

324       "Lingua::Identify" currently comprises four different ways for language
325       identification, in a total of thirteen variations of those.
326
327       The available methods are the following: smallwords, prefixes1,
328       prefixes2, prefixes3, prefixes4, suffixes1, suffixes2, suffixes3,
329       suffixes4, ngrams1, ngrams2, ngrams3 and ngrams4.
330
331       Here's a more detailed explanation of each of those ways and those
332       methods
333
334   Small Word Technique - smallwords
335       The "Small Word Technique" searches the text for the most common words
336       of each active language. These words are usually articles, pronouns,
337       etc, which happen to be (usually) the shortest words of the language;
338       hence, the method name.
339
340       This is usually a good method for big texts, especially if you happen
341       to have few languages active.
342
343   Prefix Analysis - prefixes1, prefixes2, prefixes3, prefixes4
344       This method analyses text for the common prefixes of each active
345       language.
346
347       The methods are, respectively, for prefixes of size 1, 2, 3 and 4.
348
349   Suffix Analysis - suffixes1, suffixes2, suffixes3, suffixes4
350       Similar to the Prefix Analysis (see above), but instead analysing
351       common suffixes.
352
353       The methods are, respectively, for suffixes of size 1, 2, 3 and 4.
354
355   Ngram Categorization - ngrams1, ngrams2, ngrams3, ngrams4
356       Ngrams are sequences of tokens. You can think of them as syllables, but
357       they are also more than that, as they are not only comprised by
358       characters, but also by spaces (delimiting or separating words).
359
360       Ngrams are a very good way for identifying languages, given that the
361       most common ones of each language are not generally very common in
362       others.
363
364       This is usually the best method for small amounts of text or too many
365       active languages.
366
367       The methods are, respectively, for ngrams of size 1, 2, 3 and 4.
368

LANGUAGE MANIPULATION

370       When trying to perform language identification, "Lingua::Identify"
371       works not with all available languages, but instead with the ones that
372       are active.
373
374       By default, all available languages are active, but that can be changed
375       by the user.
376
377       For your convenience, several methods regarding language manipulation
378       were created. In order to use them, load the module with the tag
379       :language_manipulation.
380
381       These methods work with the two letters code for languages.
382
383       activate_language
384             Activate a language
385
386               activate_language('en');
387
388               # or
389
390               activate_language($_) for get_all_languages();
391
392       activate_all_languages
393             Activates all languages
394
395               activate_all_languages();
396
397       deactivate_language
398             Deactivates a language
399
400               deactivate_language('en');
401
402       deactivate_all_languages
403             Deactivates all languages
404
405               deactivate_all_languages();
406
407       get_all_languages
408             Returns the names of all available languages
409
410               my @all_languages = get_all_languages();
411
412       get_active_languages
413             Returns the names of all active languages
414
415               my @active_languages = get_active_languages();
416
417       get_inactive_languages
418             Returns the names of all inactive languages
419
420               my @active_languages = get_inactive_languages();
421
422       is_active
423             Returns the name of the language if it is active, an empty list
424             otherwise
425
426               if (is_active('en')) {
427                 # YOUR CODE HERE
428               }
429
430       is_valid_language
431             Returns the name of the language if it exists, an empty list
432             otherwise
433
434               if (is_valid_language('en')) {
435                 # YOUR CODE HERE
436               }
437
438       set_active_languages
439             Sets the active languages
440
441               set_active_languages('en', 'pt');
442
443               # or
444
445               set_active_languages(get_all_languages());
446
447       name_of
448             Given the two letter tag of a language, returns its name
449
450               my $language_name = name_of('pt');
451

KNOWN LANGUAGES

453       Currently, "Lingua::Identify" knows the following languages (33 total):
454
455       AF - Afrikaans
456       BG - Bulgarian
457       BR - Breton
458       BS - Bosnian
459       CY - Welsh
460       DA - Danish
461       DE - German
462       EN - English
463       EO - Esperanto
464       ES - Spanish
465       FI - Finnish
466       FR - French
467       FY - Frisian
468       GA - Irish
469       HR - Croatian
470       HU - Hungarian
471       ID - Indonesian
472       IS - Icelandic
473       IT - Italian
474       LA - Latin
475       MS - Malay
476       NL - Dutch
477       NO - Norwegian
478       PL - Polish
479       PT - Portuguese
480       RO - Romanian
481       RU - Russian
482       SL - Slovene
483       SO - Somali
484       SQ - Albanian
485       SV - Swedish
486       SW - Swahili
487       TR - Turkish
488

CONTRIBUTING WITH NEW LANGUAGES

490       Please do not contribute with modules you made yourself. It's easier to
491       contribute with unprocessed text, because that allows for new versions
492       of Lingua::Identify not having to drop languages down in case I can't
493       contact you by that time.
494
495       Use make-lingua-identify-language to create a new module for your own
496       personal use, if you must, but try to contribute with unprocessed text
497       rather than those modules.
498

EXAMPLES

500   THE BASIC EXAMPLE
501       Check the language a given text file is written in:
502
503         use Lingua::Identify qw/langof/;
504
505         my $text = join "\n", <>;
506
507         # identify the language by letting the module decide on the best way
508         # to do so
509         my $language = langof($text);
510
511   IDENTIFYING BETWEEN TWO LANGUAGES
512       Check the language a given text file is written in, supposing you
513       happen to know it's either Portuguese or English:
514
515         use Lingua::Identify qw/langof set_active_languages/;
516         set_active_languages(qw/pt en/);
517
518         my $text = join "\n", <>;
519
520         # identify the language by letting the module decide on the best way
521         # to do so
522         my $language = langof($text);
523

TO DO

525       ·     WordNgrams based methods;
526
527       ·     More languages (always);
528
529       ·     File recognition and treatment;
530
531       ·     Deal with different encodings;
532
533       ·     Create sets of languages and allow their activation/deactivation;
534
535       ·     There should be a way of knowing the default configuration (other
536             than using the dummy mode, of course, or than accessing the
537             variables directly);
538
539       ·     Add a section about other similar tools.
540

ACKNOWLEDGMENTS

542       The following people and/or projects helped during this tool
543       development:
544
545          * EuroParl v5 corpus was used to train Dutch, German, English,
546            Spanish, Finish, French, Italian, Portuguese, Danish and Swedish.
547

SEE ALSO

549       langident(1), Text::ExtractWords(3), Text::Ngram(3), Text::Affixes(3).
550
551       ISO 639 Language Codes, at http://www.w3.org/WAI/ER/IG/ert/iso639.htm
552

AUTHOR

554       Alberto Simoes, "<ambs@cpan.org>"
555
556       Jose Castro, "<cog@cpan.org>"
557
559       Copyright 2008-2010 Alberto Simoes, All Rights Reserved.  Copyright
560       2004-2008 Jose Castro, All Rights Reserved.
561
562       This program is free software; you can redistribute it and/or modify it
563       under the same terms as Perl itself.
564
565
566
567perl v5.28.1                      2013-08-17               Lingua::Identify(3)
Impressum