1I18N::LangTags(3pm)    Perl Programmers Reference Guide    I18N::LangTags(3pm)
2
3
4

NAME

6       I18N::LangTags - functions for dealing with RFC3066-style language tags
7

SYNOPSIS

9         use I18N::LangTags();
10
11       ...or specify whichever of those functions you want to import, like so:
12
13         use I18N::LangTags qw(implicate_supers similarity_language_tag);
14
15       All the exportable functions are listed below -- you're free to import
16       only some, or none at all.  By default, none are imported.  If you say:
17
18           use I18N::LangTags qw(:ALL)
19
20       ...then all are exported.  (This saves you from having to use something
21       less obvious like "use I18N::LangTags qw(/./)".)
22
23       If you don't import any of these functions, assume a &I18N::LangTags::
24       in front of all the function names in the following examples.
25

DESCRIPTION

27       Language tags are a formalism, described in RFC 3066 (obsoleting 1766),
28       for declaring what language form (language and possibly dialect) a
29       given chunk of information is in.
30
31       This library provides functions for common tasks involving language
32       tags as they are needed in a variety of protocols and applications.
33
34       Please see the "See Also" references for a thorough explanation of how
35       to correctly use language tags.
36
37       ·   the function is_language_tag($lang1)
38
39           Returns true iff $lang1 is a formally valid language tag.
40
41              is_language_tag("fr")            is TRUE
42              is_language_tag("x-jicarilla")   is FALSE
43                  (Subtags can be 8 chars long at most -- 'jicarilla' is 9)
44
45              is_language_tag("sgn-US")    is TRUE
46                  (That's American Sign Language)
47
48              is_language_tag("i-Klikitat")    is TRUE
49                  (True without regard to the fact noone has actually
50                   registered Klikitat -- it's a formally valid tag)
51
52              is_language_tag("fr-patois")     is TRUE
53                  (Formally valid -- altho descriptively weak!)
54
55              is_language_tag("Spanish")       is FALSE
56              is_language_tag("french-patois") is FALSE
57                  (No good -- first subtag has to match
58                   /^([xXiI]|[a-zA-Z]{2,3})$/ -- see RFC3066)
59
60              is_language_tag("x-borg-prot2532") is TRUE
61                  (Yes, subtags can contain digits, as of RFC3066)
62
63       ·   the function extract_language_tags($whatever)
64
65           Returns a list of whatever looks like formally valid language tags
66           in $whatever.  Not very smart, so don't get too creative with what
67           you want to feed it.
68
69             extract_language_tags("fr, fr-ca, i-mingo")
70               returns:   ('fr', 'fr-ca', 'i-mingo')
71
72             extract_language_tags("It's like this: I'm in fr -- French!")
73               returns:   ('It', 'in', 'fr')
74             (So don't just feed it any old thing.)
75
76           The output is untainted.  If you don't know what tainting is, don't
77           worry about it.
78
79       ·   the function same_language_tag($lang1, $lang2)
80
81           Returns true iff $lang1 and $lang2 are acceptable variant tags
82           representing the same language-form.
83
84              same_language_tag('x-kadara', 'i-kadara')  is TRUE
85                 (The x/i- alternation doesn't matter)
86              same_language_tag('X-KADARA', 'i-kadara')  is TRUE
87                 (...and neither does case)
88              same_language_tag('en',       'en-US')     is FALSE
89                 (all-English is not the SAME as US English)
90              same_language_tag('x-kadara', 'x-kadar')   is FALSE
91                 (these are totally unrelated tags)
92              same_language_tag('no-bok',    'nb')       is TRUE
93                 (no-bok is a legacy tag for nb (Norwegian Bokmal))
94
95           "same_language_tag" works by just seeing whether
96           "encode_language_tag($lang1)" is the same as
97           "encode_language_tag($lang2)".
98
99           (Yes, I know this function is named a bit oddly.  Call it historic
100           reasons.)
101
102       ·   the function similarity_language_tag($lang1, $lang2)
103
104           Returns an integer representing the degree of similarity between
105           tags $lang1 and $lang2 (the order of which does not matter), where
106           similarity is the number of common elements on the left, without
107           regard to case and to x/i- alternation.
108
109              similarity_language_tag('fr', 'fr-ca')           is 1
110                 (one element in common)
111              similarity_language_tag('fr-ca', 'fr-FR')        is 1
112                 (one element in common)
113
114              similarity_language_tag('fr-CA-joual',
115                                      'fr-CA-PEI')             is 2
116              similarity_language_tag('fr-CA-joual', 'fr-CA')  is 2
117                 (two elements in common)
118
119              similarity_language_tag('x-kadara', 'i-kadara')  is 1
120                 (x/i- doesn't matter)
121
122              similarity_language_tag('en',       'x-kadar')   is 0
123              similarity_language_tag('x-kadara', 'x-kadar')   is 0
124                 (unrelated tags -- no similarity)
125
126              similarity_language_tag('i-cree-syllabic',
127                                      'i-cherokee-syllabic')   is 0
128                 (no B<leftmost> elements in common!)
129
130       ·   the function is_dialect_of($lang1, $lang2)
131
132           Returns true iff language tag $lang1 represents a subform of
133           language tag $lang2.
134
135           Get the order right!  It doesn't work the other way around!
136
137              is_dialect_of('en-US', 'en')            is TRUE
138                (American English IS a dialect of all-English)
139
140              is_dialect_of('fr-CA-joual', 'fr-CA')   is TRUE
141              is_dialect_of('fr-CA-joual', 'fr')      is TRUE
142                (Joual is a dialect of (a dialect of) French)
143
144              is_dialect_of('en', 'en-US')            is FALSE
145                (all-English is a NOT dialect of American English)
146
147              is_dialect_of('fr', 'en-CA')            is FALSE
148
149              is_dialect_of('en',    'en'   )         is TRUE
150              is_dialect_of('en-US', 'en-US')         is TRUE
151                (B<Note:> these are degenerate cases)
152
153              is_dialect_of('i-mingo-tom', 'x-Mingo') is TRUE
154                (the x/i thing doesn't matter, nor does case)
155
156              is_dialect_of('nn', 'no')               is TRUE
157                (because 'nn' (New Norse) is aliased to 'no-nyn',
158                 as a special legacy case, and 'no-nyn' is a
159                 subform of 'no' (Norwegian))
160
161       ·   the function super_languages($lang1)
162
163           Returns a list of language tags that are superordinate tags to
164           $lang1 -- it gets this by removing subtags from the end of $lang1
165           until nothing (or just "i" or "x") is left.
166
167              super_languages("fr-CA-joual")  is  ("fr-CA", "fr")
168
169              super_languages("en-AU")  is  ("en")
170
171              super_languages("en")  is  empty-list, ()
172
173              super_languages("i-cherokee")  is  empty-list, ()
174               ...not ("i"), which would be illegal as well as pointless.
175
176           If $lang1 is not a valid language tag, returns empty-list in a list
177           context, undef in a scalar context.
178
179           A notable and rather unavoidable problem with this method:
180           "x-mingo-tom" has an "x" because the whole tag isn't an IANA-
181           registered tag -- but super_languages('x-mingo-tom') is ('x-mingo')
182           -- which isn't really right, since 'i-mingo' is registered.  But
183           this module has no way of knowing that.  (But note that
184           same_language_tag('x-mingo', 'i-mingo') is TRUE.)
185
186           More importantly, you assume at your peril that superordinates of
187           $lang1 are mutually intelligible with $lang1.  Consider this
188           carefully.
189
190       ·   the function locale2language_tag($locale_identifier)
191
192           This takes a locale name (like "en", "en_US", or "en_US.ISO8859-1")
193           and maps it to a language tag.  If it's not mappable (as with,
194           notably, "C" and "POSIX"), this returns empty-list in a list
195           context, or undef in a scalar context.
196
197              locale2language_tag("en") is "en"
198
199              locale2language_tag("en_US") is "en-US"
200
201              locale2language_tag("en_US.ISO8859-1") is "en-US"
202
203              locale2language_tag("C") is undef or ()
204
205              locale2language_tag("POSIX") is undef or ()
206
207              locale2language_tag("POSIX") is undef or ()
208
209           I'm not totally sure that locale names map satisfactorily to
210           language tags.  Think REAL hard about how you use this.  YOU HAVE
211           BEEN WARNED.
212
213           The output is untainted.  If you don't know what tainting is, don't
214           worry about it.
215
216       ·   the function encode_language_tag($lang1)
217
218           This function, if given a language tag, returns an encoding of it
219           such that:
220
221           * tags representing different languages never get the same
222           encoding.
223
224           * tags representing the same language always get the same encoding.
225
226           * an encoding of a formally valid language tag always is a string
227           value that is defined, has length, and is true if considered as a
228           boolean.
229
230           Note that the encoding itself is not a formally valid language tag.
231           Note also that you cannot, currently, go from an encoding back to a
232           language tag that it's an encoding of.
233
234           Note also that you must consider the encoded value as atomic; i.e.,
235           you should not consider it as anything but an opaque, unanalysable
236           string value.  (The internals of the encoding method may change in
237           future versions, as the language tagging standard changes over
238           time.)
239
240           "encode_language_tag" returns undef if given anything other than a
241           formally valid language tag.
242
243           The reason "encode_language_tag" exists is because different
244           language tags may represent the same language; this is normally
245           treatable with "same_language_tag", but consider this situation:
246
247           You have a data file that expresses greetings in different
248           languages.  Its format is "[language tag]=[how to say 'Hello']",
249           like:
250
251                     en-US=Hiho
252                     fr=Bonjour
253                     i-mingo=Hau'
254
255           And suppose you write a program that reads that file and then runs
256           as a daemon, answering client requests that specify a language tag
257           and then expect the string that says how to greet in that language.
258           So an interaction looks like:
259
260                     greeting-client asks:    fr
261                     greeting-server answers: Bonjour
262
263           So far so good.  But suppose the way you're implementing this is:
264
265                     my %greetings;
266                     die unless open(IN, "<", "in.dat");
267                     while(<IN>) {
268                       chomp;
269                       next unless /^([^=]+)=(.+)/s;
270                       my($lang, $expr) = ($1, $2);
271                       $greetings{$lang} = $expr;
272                     }
273                     close(IN);
274
275           at which point %greetings has the contents:
276
277                     "en-US"   => "Hiho"
278                     "fr"      => "Bonjour"
279                     "i-mingo" => "Hau'"
280
281           And suppose then that you answer client requests for language
282           $wanted by just looking up $greetings{$wanted}.
283
284           If the client asks for "fr", that will look up successfully in
285           %greetings, to the value "Bonjour".  And if the client asks for
286           "i-mingo", that will look up successfully in %greetings, to the
287           value "Hau'".
288
289           But if the client asks for "i-Mingo" or "x-mingo", or "Fr", then
290           the lookup in %greetings fails.  That's the Wrong Thing.
291
292           You could instead do lookups on $wanted with:
293
294                     use I18N::LangTags qw(same_language_tag);
295                     my $response = '';
296                     foreach my $l2 (keys %greetings) {
297                       if(same_language_tag($wanted, $l2)) {
298                         $response = $greetings{$l2};
299                         last;
300                       }
301                     }
302
303           But that's rather inefficient.  A better way to do it is to start
304           your program with:
305
306                     use I18N::LangTags qw(encode_language_tag);
307                     my %greetings;
308                     die unless open(IN, "<", "in.dat");
309                     while(<IN>) {
310                       chomp;
311                       next unless /^([^=]+)=(.+)/s;
312                       my($lang, $expr) = ($1, $2);
313                       $greetings{
314                                   encode_language_tag($lang)
315                                 } = $expr;
316                     }
317                     close(IN);
318
319           and then just answer client requests for language $wanted by just
320           looking up
321
322                     $greetings{encode_language_tag($wanted)}
323
324           And that does the Right Thing.
325
326       ·   the function alternate_language_tags($lang1)
327
328           This function, if given a language tag, returns all language tags
329           that are alternate forms of this language tag.  (I.e., tags which
330           refer to the same language.)  This is meant to handle legacy tags
331           caused by the minor changes in language tag standards over the
332           years; and the x-/i- alternation is also dealt with.
333
334           Note that this function does not try to equate new (and never-used,
335           and unusable) ISO639-2 three-letter tags to old (and still in use)
336           ISO639-1 two-letter equivalents -- like "ara" -> "ar" -- because
337           "ara" has never been in use as an Internet language tag, and RFC
338           3066 stipulates that it never should be, since a shorter tag ("ar")
339           exists.
340
341           Examples:
342
343             alternate_language_tags('no-bok')       is ('nb')
344             alternate_language_tags('nb')           is ('no-bok')
345             alternate_language_tags('he')           is ('iw')
346             alternate_language_tags('iw')           is ('he')
347             alternate_language_tags('i-hakka')      is ('zh-hakka', 'x-hakka')
348             alternate_language_tags('zh-hakka')     is ('i-hakka', 'x-hakka')
349             alternate_language_tags('en')           is ()
350             alternate_language_tags('x-mingo-tom')  is ('i-mingo-tom')
351             alternate_language_tags('x-klikitat')   is ('i-klikitat')
352             alternate_language_tags('i-klikitat')   is ('x-klikitat')
353
354           This function returns empty-list if given anything other than a
355           formally valid language tag.
356
357       ·   the function @langs = panic_languages(@accept_languages)
358
359           This function takes a list of 0 or more language tags that
360           constitute a given user's Accept-Language list, and returns a list
361           of tags for other (non-super) languages that are probably
362           acceptable to the user, to be used if all else fails.
363
364           For example, if a user accepts only 'ca' (Catalan) and 'es'
365           (Spanish), and the documents/interfaces you have available are just
366           in German, Italian, and Chinese, then the user will most likely
367           want the Italian one (and not the Chinese or German one!), instead
368           of getting nothing.  So "panic_languages('ca', 'es')" returns a
369           list containing 'it' (Italian).
370
371           English ('en') is always in the return list, but whether it's at
372           the very end or not depends on the input languages.  This function
373           works by consulting an internal table that stipulates what common
374           languages are "close" to each other.
375
376           A useful construct you might consider using is:
377
378             @fallbacks = super_languages(@accept_languages);
379             push @fallbacks, panic_languages(
380               @accept_languages, @fallbacks,
381             );
382
383       ·   the function implicate_supers( ...languages... )
384
385           This takes a list of strings (which are presumed to be language-
386           tags; strings that aren't, are ignored); and after each one, this
387           function inserts super-ordinate forms that don't already appear in
388           the list.  The original list, plus these insertions, is returned.
389
390           In other words, it takes this:
391
392             pt-br de-DE en-US fr pt-br-janeiro
393
394           and returns this:
395
396             pt-br pt de-DE de en-US en fr pt-br-janeiro
397
398           This function is most useful in the idiom
399
400             implicate_supers( I18N::LangTags::Detect::detect() );
401
402           (See I18N::LangTags::Detect.)
403
404       ·   the function implicate_supers_strictly( ...languages... )
405
406           This works like "implicate_supers" except that the implicated forms
407           are added to the end of the return list.
408
409           In other words, implicate_supers_strictly takes a list of strings
410           (which are presumed to be language-tags; strings that aren't, are
411           ignored) and after the whole given list, it inserts the super-
412           ordinate forms of all given tags, minus any tags that already
413           appear in the input list.
414
415           In other words, it takes this:
416
417             pt-br de-DE en-US fr pt-br-janeiro
418
419           and returns this:
420
421             pt-br de-DE en-US fr pt-br-janeiro pt de en
422
423           The reason this function has "_strictly" in its name is that when
424           you're processing an Accept-Language list according to the RFCs, if
425           you interpret the RFCs quite strictly, then you would use
426           implicate_supers_strictly, but for normal use (i.e., common-sense
427           use, as far as I'm concerned) you'd use implicate_supers.
428

ABOUT LOWERCASING

430       I've considered making all the above functions that output language
431       tags return all those tags strictly in lowercase.  Having all your
432       language tags in lowercase does make some things easier.  But you might
433       as well just lowercase as you like, or call
434       "encode_language_tag($lang1)" where appropriate.
435

ABOUT UNICODE PLAINTEXT LANGUAGE TAGS

437       In some future version of I18N::LangTags, I plan to include support for
438       RFC2482-style language tags -- which are basically just normal language
439       tags with their ASCII characters shifted into Plane 14.
440

SEE ALSO

442       * I18N::LangTags::List
443
444       * RFC 3066, "http://www.ietf.org/rfc/rfc3066.txt", "Tags for the
445       Identification of Languages".  (Obsoletes RFC 1766)
446
447       * RFC 2277, "http://www.ietf.org/rfc/rfc2277.txt", "IETF Policy on
448       Character Sets and Languages".
449
450       * RFC 2231, "http://www.ietf.org/rfc/rfc2231.txt", "MIME Parameter
451       Value and Encoded Word Extensions: Character Sets, Languages, and
452       Continuations".
453
454       * RFC 2482, "http://www.ietf.org/rfc/rfc2482.txt", "Language Tagging in
455       Unicode Plain Text".
456
457       * Locale::Codes, in
458       "http://www.perl.com/CPAN/modules/by-module/Locale/"
459
460       * ISO 639-2, "Codes for the representation of names of languages",
461       including two-letter and three-letter codes,
462       "http://www.loc.gov/standards/iso639-2/php/code_list.php"
463
464       * The IANA list of registered languages (hopefully up-to-date),
465       "http://www.iana.org/assignments/language-tags"
466
468       Copyright (c) 1998+ Sean M. Burke. All rights reserved.
469
470       This library is free software; you can redistribute it and/or modify it
471       under the same terms as Perl itself.
472
473       The programs and documentation in this dist are distributed in the hope
474       that they will be useful, but without any warranty; without even the
475       implied warranty of merchantability or fitness for a particular
476       purpose.
477

AUTHOR

479       Sean M. Burke "sburke@cpan.org"
480
481
482
483perl v5.30.2                      2020-03-27               I18N::LangTags(3pm)
Impressum