1Lingua::Identify(3) User Contributed Perl Documentation Lingua::Identify(3)
2
3
4
6 Lingua::Identify - Language identification
7
9 use Lingua::Identify qw(:language_identification);
10 $a = langof($textstring); # gives the most probable language
11
12 or the complete way:
13
14 @a = langof($textstring); # gives pairs of languages / probabilities
15 # sorted from most to least probable
16
17 %a = langof($textstring); # gives a hash of language / probability
18
19 or the expert way (see section OPTIONS, under HOW TO PERFORM
20 IDENTIFICATION)
21
22 $a = langof( { method => [qw/smallwords prefix2 suffix2/] }, $text);
23
24 $a = langof( { 'max-size' => 3_000_000 }, $text);
25
26 $a = langof( { 'extract_from' => ( 'head' => 1, 'tail' => 2)}, $text);
27
29 STARTING WITH VERSION 0.25, Lingua::Identify IS UNICODE BY DEFAULT!
30
31 "Lingua::Identify" identifies the language a given string or file is
32 written in.
33
34 See section WHY LINGUA::IDENTIFY for a list of "Lingua::Identify"'s
35 strong points.
36
37 See section KNOWN LANGUAGES for a list of available languages and HOW
38 TO PERFORM IDENTIFICATION to know how to really use this module.
39
40 If you're in a hurry, jump to section EXAMPLES, way down below.
41
42 Also, don't forget to read the following section, IMPORTANT WARNING.
43
45 Take a word that exists in two different languages, take a good look at
46 it and answer this question: "What language does this word belong to?".
47
48 You can't give an answer like "Language X", right? You can only say it
49 looks like any of a set of languages.
50
51 Similarly, it isn't always easy to identify the language of a text if
52 the only two active languages are very similar.
53
54 Now that we've taken out of the way the warning that language
55 identification is not 100% accurate, please keep reading the
56 documentation.
57
59 You might be wondering why you should use Lingua::Identify instead of
60 any other tool for language identification.
61
62 Here's a list of Lingua::Identify's strong points:
63
64 · it's free and it's open-source;
65
66 · it's portable (it's Perl, which means it will work in lots of
67 different platforms);
68
69 · unicode support;
70
71 · 4 different methods of language identification and growing (see
72 METHODS OF LANGUAGE IDENTIFICATION for more details on this one);
73
74 · it's a module, which means you can easily write your own
75 application (be it CGI, TK, whatever) around it;
76
77 · it comes with langident, which means you don't actually need to
78 write your own application around it;
79
80 · it's flexible (at the moment, you can actually choose the methods
81 to use and their relevance, the max size of input to analyze each
82 time and which part(s) of the input to analyze)
83
84 · it supports big inputs (through the 'max-size' and 'extract_from'
85 options)
86
87 · it's easy to deal with languages (you can activate and deactivate
88 the ones you choose whenever you want to, which can improve your
89 times and accuracy);
90
91 · it's maintained.
92
94 langof
95 To identify the language a given text is written in, use the langof
96 function. To get a single value, do:
97
98 $language = langof($text);
99
100 To get the most probable language and also the percentage of its
101 probability, do:
102
103 ($language, $probability) = langof($text);
104
105 If you want a hash where each active language is mapped into its
106 percentage, use this:
107
108 %languages = langof($text);
109
110 OPTIONS
111
112 langof can also be given some configuration parameters, in this way:
113
114 $language = langof(\%config, $text);
115
116 These parameters are detailed here:
117
118 · extract-from
119
120 When the size of the input exceeds the C'max-size', "langof"
121 analyzes only the beginning of the file. You can specify which
122 part of the file is analyzed with the 'extract-from' option:
123
124 langof( { 'extract_from' => 'tail' } , $text );
125
126 Possible values are 'head' and 'tail' (for now).
127
128 You can also specify more than one part of the file, so that text
129 is extracted from those parts:
130
131 langof( { 'extract_from' => [ 'head', 'tail' ] } , $text );
132
133 (this will be useful when more than two possibilities exist)
134
135 You can also specify different values for each part of the file
136 (not necessarily for all of them:
137
138 langof( { 'extract_from' => { head => 40, tail => 60 } } , $text);
139
140 The line above, for instance, retrives 40% of the text from the
141 beginning and 60% from the end. Note, however, that those values
142 are not percentages. You'd get the same behavior with:
143
144 langof( { 'extract_from' => { head => 80, tail => 120 } } , $text);
145
146 The percentages would be the same.
147
148 · max-size
149
150 By default, "langof" analyzes only 1,000,000 bytes. You can
151 specify how many bytes (at the most) can be analyzed (if not
152 enough exist, the whole input is still analyzed).
153
154 langof( { 'max-size' => 2000 }, $text);
155
156 If you want all the text to be analyzed, set max-size to 0:
157
158 langof( { 'max-size' => 0 }, $text);
159
160 See also "set_max_size".
161
162 · method
163
164 You can choose which method or methods to use, and also the
165 relevance of each of them.
166
167 To choose a single method to use:
168
169 langof( {method => 'smallwords' }, $text);
170
171 To choose several methods:
172
173 langof( {method => [qw/prefixes2 suffixes2/]}, $text);
174
175 To choose several methods and give them different weight:
176
177 langof( {method => {smallwords => 0.5, ngrams3 => 1.5} }, $text);
178
179 To see the list of available methods, see section METHODS OF
180 LANGUAGE IDENTIFICATION.
181
182 If no method is specified, the configuration for this parameter
183 is the following (this might change in the future):
184
185 method => {
186 smallwords => 0.5,
187 prefixes2 => 1,
188 suffixes3 => 1,
189 ngrams3 => 1.3
190 };
191
192 · mode
193
194 By default, "Lingua::Identify" assumes "normal" mode, but others
195 are available.
196
197 In "dummy" mode, instead of actually calculating anything,
198 "Lingua::Identify" only does the preparation it has to and then
199 returns a bunch of information, including the list of the active
200 languages, the selected methods, etc. It also returns the text
201 meant to be analised.
202
203 Do be warned that, with langof_file, the dummy mode still reads
204 the files, it simply doesn't calculate language.
205
206 langof( { 'mode' => 'dummy' }, $text);
207
208 This returns something like this:
209
210 { 'methods' => { 'smallwords' => '0.5',
211 'prefixes2' => '1',
212 },
213 'config' => { 'mode' => 'dummy' },
214 'max-size' => 1000000,
215 'active-languages' => [ 'es', 'pt' ],
216 'text' => $text,
217 'mode' => 'dummy',
218 }
219
220 langof_file
221 langof_file works just like langof, with the exception that it reveives
222 filenames instead of text. It reads these texts (if existing and
223 readable, of course) and parses its content.
224
225 Currently, langof_file assumes the files are regular text. This may
226 change in the future and the files might be scanned to check their
227 filetype and then parsed to extract only their textual content (which
228 should be pretty useful so that you can perform language
229 identification, say, in HTML files, or PDFs).
230
231 To identify the language a file is written in:
232
233 $language = langof_file($path);
234
235 To get the most probable language and also the percentage of its
236 probability, do:
237
238 ($language, $probability) = langof_file($path);
239
240 If you want a hash where each active language is mapped into its
241 percentage, use this:
242
243 %languages = langof_file($path);
244
245 If you pass more than one file to langof_file, they will all be read
246 and their content merged and then parsed for language identification.
247
248 OPTIONS
249
250 langof_file accepts all the options langof does, so refer to those
251 first (up in this document).
252
253 $language = langof_file(\%config, $path);
254
255 langof_file currently only reads the first 10,000 bytes of each file.
256
257 You can force an input encoding with "{ encoding => 'ISO-8859-1' }" in
258 the configuration hash.
259
260 confidence
261 After getting the results into an array, its first element is the most
262 probable language. That doesn't mean it is very probable or not.
263
264 You can find more about the likeliness of the results to be accurate by
265 computing its confidence level.
266
267 use Lingua::Identify qw/:language_identification/;
268 my @results = langof($text);
269 my $confidence_level = confidence(@results);
270 # $confidence_level now holds a value between 0.5 and 1; the higher that
271 # value, the more accurate the results seem to be
272
273 The formula used is pretty simple: p1 / (p1 + p2) , where p1 is the
274 probability of the most likely language and p2 is the probability of
275 the language which came in second. A couple of examples to illustrate
276 this:
277
278 English 50% Portuguese 10% ...
279
280 confidence level: 50 / (50 + 10) = 0.83
281
282 Another example:
283
284 Spanish 30% Portuguese 10% ...
285
286 confidence level: 30 / (25 + 30) = 0.55
287
288 French 10% German 5% ...
289
290 confidence level: 10 / (10 + 5) = 0.67
291
292 As you can see, the first example is probably the most accurate one.
293 Are there any doubts? The English language has five times the
294 probability of the second language.
295
296 The second example is a bit more tricky. 55% confidence. The confidence
297 level is always above 50%, for obvious reasons. 55% doesn't make anyone
298 confident in the results, and one shouldn't be, with results such as
299 these.
300
301 Notice the third example. The confidence level goes up to 67%, but the
302 probability of French is of mere 10%. So what? It's twice as much as
303 the second language. The low probability may well be caused by a great
304 number of languages in play.
305
306 get_all_methods
307 Returns a list comprised of all the available methods for language
308 identification.
309
311 Language identification is based in patterns.
312
313 In order to identify the language a given text is written in, we repeat
314 a given process for each active language (see section LANGUAGES
315 MANIPULATION); in that process, we look for common patterns of that
316 language. Those patterns can be prefixes, suffixes, common words,
317 ngrams or even sequences of words.
318
319 After repeating the process for each language, the total score for each
320 of them is then used to compute the probability (in percentage) for
321 each language to be the one of that text.
322
324 "Lingua::Identify" currently comprises four different ways for language
325 identification, in a total of thirteen variations of those.
326
327 The available methods are the following: smallwords, prefixes1,
328 prefixes2, prefixes3, prefixes4, suffixes1, suffixes2, suffixes3,
329 suffixes4, ngrams1, ngrams2, ngrams3 and ngrams4.
330
331 Here's a more detailed explanation of each of those ways and those
332 methods
333
334 Small Word Technique - smallwords
335 The "Small Word Technique" searches the text for the most common words
336 of each active language. These words are usually articles, pronouns,
337 etc, which happen to be (usually) the shortest words of the language;
338 hence, the method name.
339
340 This is usually a good method for big texts, especially if you happen
341 to have few languages active.
342
343 Prefix Analysis - prefixes1, prefixes2, prefixes3, prefixes4
344 This method analyses text for the common prefixes of each active
345 language.
346
347 The methods are, respectively, for prefixes of size 1, 2, 3 and 4.
348
349 Suffix Analysis - suffixes1, suffixes2, suffixes3, suffixes4
350 Similar to the Prefix Analysis (see above), but instead analysing
351 common suffixes.
352
353 The methods are, respectively, for suffixes of size 1, 2, 3 and 4.
354
355 Ngram Categorization - ngrams1, ngrams2, ngrams3, ngrams4
356 Ngrams are sequences of tokens. You can think of them as syllables, but
357 they are also more than that, as they are not only comprised by
358 characters, but also by spaces (delimiting or separating words).
359
360 Ngrams are a very good way for identifying languages, given that the
361 most common ones of each language are not generally very common in
362 others.
363
364 This is usually the best method for small amounts of text or too many
365 active languages.
366
367 The methods are, respectively, for ngrams of size 1, 2, 3 and 4.
368
370 When trying to perform language identification, "Lingua::Identify"
371 works not with all available languages, but instead with the ones that
372 are active.
373
374 By default, all available languages are active, but that can be changed
375 by the user.
376
377 For your convenience, several methods regarding language manipulation
378 were created. In order to use them, load the module with the tag
379 :language_manipulation.
380
381 These methods work with the two letters code for languages.
382
383 activate_language
384 Activate a language
385
386 activate_language('en');
387
388 # or
389
390 activate_language($_) for get_all_languages();
391
392 activate_all_languages
393 Activates all languages
394
395 activate_all_languages();
396
397 deactivate_language
398 Deactivates a language
399
400 deactivate_language('en');
401
402 deactivate_all_languages
403 Deactivates all languages
404
405 deactivate_all_languages();
406
407 get_all_languages
408 Returns the names of all available languages
409
410 my @all_languages = get_all_languages();
411
412 get_active_languages
413 Returns the names of all active languages
414
415 my @active_languages = get_active_languages();
416
417 get_inactive_languages
418 Returns the names of all inactive languages
419
420 my @active_languages = get_inactive_languages();
421
422 is_active
423 Returns the name of the language if it is active, an empty list
424 otherwise
425
426 if (is_active('en')) {
427 # YOUR CODE HERE
428 }
429
430 is_valid_language
431 Returns the name of the language if it exists, an empty list
432 otherwise
433
434 if (is_valid_language('en')) {
435 # YOUR CODE HERE
436 }
437
438 set_active_languages
439 Sets the active languages
440
441 set_active_languages('en', 'pt');
442
443 # or
444
445 set_active_languages(get_all_languages());
446
447 name_of
448 Given the two letter tag of a language, returns its name
449
450 my $language_name = name_of('pt');
451
453 Currently, "Lingua::Identify" knows the following languages (33 total):
454
455 AF - Afrikaans
456 BG - Bulgarian
457 BR - Breton
458 BS - Bosnian
459 CY - Welsh
460 DA - Danish
461 DE - German
462 EN - English
463 EO - Esperanto
464 ES - Spanish
465 FI - Finnish
466 FR - French
467 FY - Frisian
468 GA - Irish
469 HR - Croatian
470 HU - Hungarian
471 ID - Indonesian
472 IS - Icelandic
473 IT - Italian
474 LA - Latin
475 MS - Malay
476 NL - Dutch
477 NO - Norwegian
478 PL - Polish
479 PT - Portuguese
480 RO - Romanian
481 RU - Russian
482 SL - Slovene
483 SO - Somali
484 SQ - Albanian
485 SV - Swedish
486 SW - Swahili
487 TR - Turkish
488
490 Please do not contribute with modules you made yourself. It's easier to
491 contribute with unprocessed text, because that allows for new versions
492 of Lingua::Identify not having to drop languages down in case I can't
493 contact you by that time.
494
495 Use make-lingua-identify-language to create a new module for your own
496 personal use, if you must, but try to contribute with unprocessed text
497 rather than those modules.
498
500 THE BASIC EXAMPLE
501 Check the language a given text file is written in:
502
503 use Lingua::Identify qw/langof/;
504
505 my $text = join "\n", <>;
506
507 # identify the language by letting the module decide on the best way
508 # to do so
509 my $language = langof($text);
510
511 IDENTIFYING BETWEEN TWO LANGUAGES
512 Check the language a given text file is written in, supposing you
513 happen to know it's either Portuguese or English:
514
515 use Lingua::Identify qw/langof set_active_languages/;
516 set_active_languages(qw/pt en/);
517
518 my $text = join "\n", <>;
519
520 # identify the language by letting the module decide on the best way
521 # to do so
522 my $language = langof($text);
523
525 · WordNgrams based methods;
526
527 · More languages (always);
528
529 · File recognition and treatment;
530
531 · Deal with different encodings;
532
533 · Create sets of languages and allow their activation/deactivation;
534
535 · There should be a way of knowing the default configuration (other
536 than using the dummy mode, of course, or than accessing the
537 variables directly);
538
539 · Add a section about other similar tools.
540
542 The following people and/or projects helped during this tool
543 development:
544
545 * EuroParl v5 corpus was used to train Dutch, German, English,
546 Spanish, Finish, French, Italian, Portuguese, Danish and Swedish.
547
549 langident(1), Text::ExtractWords(3), Text::Ngram(3), Text::Affixes(3).
550
551 ISO 639 Language Codes, at http://www.w3.org/WAI/ER/IG/ert/iso639.htm
552
554 Alberto Simoes, "<ambs@cpan.org>"
555
556 Jose Castro, "<cog@cpan.org>"
557
559 Copyright 2008-2010 Alberto Simoes, All Rights Reserved. Copyright
560 2004-2008 Jose Castro, All Rights Reserved.
561
562 This program is free software; you can redistribute it and/or modify it
563 under the same terms as Perl itself.
564
565
566
567perl v5.28.1 2013-08-17 Lingua::Identify(3)