I18N::LangTags(3pm)

1I18N::LangTags(3pm)    Perl Programmers Reference Guide    I18N::LangTags(3pm)
2
3
4

NAME

6       I18N::LangTags - functions for dealing with RFC3066-style language tags
7

SYNOPSIS

9         use I18N::LangTags();
10
11       ...or specify whichever of those functions you want to import, like so:
12
13         use I18N::LangTags qw(implicate_supers similarity_language_tag);
14
15       All the exportable functions are listed below -- you're free to import
16       only some, or none at all.  By default, none are imported.  If you say:
17
18           use I18N::LangTags qw(:ALL)
19
20       ...then all are exported.  (This saves you from having to use something
21       less obvious like "use I18N::LangTags qw(/./)".)
22
23       If you don't import any of these functions, assume a &I18N::LangTags::
24       in front of all the function names in the following examples.
25

DESCRIPTION

27       Language tags are a formalism, described in RFC 3066 (obsoleting 1766),
28       for declaring what language form (language and possibly dialect) a
29       given chunk of information is in.
30
31       This library provides functions for common tasks involving language
32       tags as they are needed in a variety of protocols and applications.
33
34       Please see the "See Also" references for a thorough explanation of how
35       to correctly use language tags.
36
37       * the function is_language_tag($lang1)
38           Returns true iff $lang1 is a formally valid language tag.
39
40              is_language_tag("fr")            is TRUE
41              is_language_tag("x-jicarilla")   is FALSE
42                  (Subtags can be 8 chars long at most -- 'jicarilla' is 9)
43
44              is_language_tag("sgn-US")    is TRUE
45                  (That's American Sign Language)
46
47              is_language_tag("i-Klikitat")    is TRUE
48                  (True without regard to the fact noone has actually
49                   registered Klikitat -- it's a formally valid tag)
50
51              is_language_tag("fr-patois")     is TRUE
52                  (Formally valid -- altho descriptively weak!)
53
54              is_language_tag("Spanish")       is FALSE
55              is_language_tag("french-patois") is FALSE
56                  (No good -- first subtag has to match
57                   /^([xXiI]⎪[a-zA-Z]{2,3})$/ -- see RFC3066)
58
59              is_language_tag("x-borg-prot2532") is TRUE
60                  (Yes, subtags can contain digits, as of RFC3066)
61
62       * the function extract_language_tags($whatever)
63           Returns a list of whatever looks like formally valid language tags
64           in $whatever.  Not very smart, so don't get too creative with what
65           you want to feed it.
66
67             extract_language_tags("fr, fr-ca, i-mingo")
68               returns:   ('fr', 'fr-ca', 'i-mingo')
69
70             extract_language_tags("It's like this: I'm in fr -- French!")
71               returns:   ('It', 'in', 'fr')
72             (So don't just feed it any old thing.)
73
74           The output is untainted.  If you don't know what tainting is, don't
75           worry about it.
76
77       * the function same_language_tag($lang1, $lang2)
78           Returns true iff $lang1 and $lang2 are acceptable variant tags rep‐
79           resenting the same language-form.
80
81              same_language_tag('x-kadara', 'i-kadara')  is TRUE
82                 (The x/i- alternation doesn't matter)
83              same_language_tag('X-KADARA', 'i-kadara')  is TRUE
84                 (...and neither does case)
85              same_language_tag('en',       'en-US')     is FALSE
86                 (all-English is not the SAME as US English)
87              same_language_tag('x-kadara', 'x-kadar')   is FALSE
88                 (these are totally unrelated tags)
89              same_language_tag('no-bok',    'nb')       is TRUE
90                 (no-bok is a legacy tag for nb (Norwegian Bokmal))
91
92           "same_language_tag" works by just seeing whether "encode_lan‐
93           guage_tag($lang1)" is the same as "encode_language_tag($lang2)".
94
95           (Yes, I know this function is named a bit oddly.  Call it historic
96           reasons.)
97
98       * the function similarity_language_tag($lang1, $lang2)
99           Returns an integer representing the degree of similarity between
100           tags $lang1 and $lang2 (the order of which does not matter), where
101           similarity is the number of common elements on the left, without
102           regard to case and to x/i- alternation.
103
104              similarity_language_tag('fr', 'fr-ca')           is 1
105                 (one element in common)
106              similarity_language_tag('fr-ca', 'fr-FR')        is 1
107                 (one element in common)
108
109              similarity_language_tag('fr-CA-joual',
110                                      'fr-CA-PEI')             is 2
111              similarity_language_tag('fr-CA-joual', 'fr-CA')  is 2
112                 (two elements in common)
113
114              similarity_language_tag('x-kadara', 'i-kadara')  is 1
115                 (x/i- doesn't matter)
116
117              similarity_language_tag('en',       'x-kadar')   is 0
118              similarity_language_tag('x-kadara', 'x-kadar')   is 0
119                 (unrelated tags -- no similarity)
120
121              similarity_language_tag('i-cree-syllabic',
122                                      'i-cherokee-syllabic')   is 0
123                 (no B<leftmost> elements in common!)
124
125       * the function is_dialect_of($lang1, $lang2)
126           Returns true iff language tag $lang1 represents a subform of lan‐
127           guage tag $lang2.
128
129           Get the order right!  It doesn't work the other way around!
130
131              is_dialect_of('en-US', 'en')            is TRUE
132                (American English IS a dialect of all-English)
133
134              is_dialect_of('fr-CA-joual', 'fr-CA')   is TRUE
135              is_dialect_of('fr-CA-joual', 'fr')      is TRUE
136                (Joual is a dialect of (a dialect of) French)
137
138              is_dialect_of('en', 'en-US')            is FALSE
139                (all-English is a NOT dialect of American English)
140
141              is_dialect_of('fr', 'en-CA')            is FALSE
142
143              is_dialect_of('en',    'en'   )         is TRUE
144              is_dialect_of('en-US', 'en-US')         is TRUE
145                (B<Note:> these are degenerate cases)
146
147              is_dialect_of('i-mingo-tom', 'x-Mingo') is TRUE
148                (the x/i thing doesn't matter, nor does case)
149
150              is_dialect_of('nn', 'no')               is TRUE
151                (because 'nn' (New Norse) is aliased to 'no-nyn',
152                 as a special legacy case, and 'no-nyn' is a
153                 subform of 'no' (Norwegian))
154
155       * the function super_languages($lang1)
156           Returns a list of language tags that are superordinate tags to
157           $lang1 -- it gets this by removing subtags from the end of $lang1
158           until nothing (or just "i" or "x") is left.
159
160              super_languages("fr-CA-joual")  is  ("fr-CA", "fr")
161
162              super_languages("en-AU")  is  ("en")
163
164              super_languages("en")  is  empty-list, ()
165
166              super_languages("i-cherokee")  is  empty-list, ()
167               ...not ("i"), which would be illegal as well as pointless.
168
169           If $lang1 is not a valid language tag, returns empty-list in a list
170           context, undef in a scalar context.
171
172           A notable and rather unavoidable problem with this method:
173           "x-mingo-tom" has an "x" because the whole tag isn't an IANA-regis‐
174           tered tag -- but super_languages('x-mingo-tom') is ('x-mingo') --
175           which isn't really right, since 'i-mingo' is registered.  But this
176           module has no way of knowing that.  (But note that same_lan‐
177           guage_tag('x-mingo', 'i-mingo') is TRUE.)
178
179           More importantly, you assume at your peril that superordinates of
180           $lang1 are mutually intelligible with $lang1.  Consider this care‐
181           fully.
182
183       * the function locale2language_tag($locale_identifier)
184           This takes a locale name (like "en", "en_US", or "en_US.ISO8859-1")
185           and maps it to a language tag.  If it's not mappable (as with,
186           notably, "C" and "POSIX"), this returns empty-list in a list con‐
187           text, or undef in a scalar context.
188
189              locale2language_tag("en") is "en"
190
191              locale2language_tag("en_US") is "en-US"
192
193              locale2language_tag("en_US.ISO8859-1") is "en-US"
194
195              locale2language_tag("C") is undef or ()
196
197              locale2language_tag("POSIX") is undef or ()
198
199              locale2language_tag("POSIX") is undef or ()
200
201           I'm not totally sure that locale names map satisfactorily to lan‐
202           guage tags.  Think REAL hard about how you use this.  YOU HAVE BEEN
203           WARNED.
204
205           The output is untainted.  If you don't know what tainting is, don't
206           worry about it.
207
208       * the function encode_language_tag($lang1)
209           This function, if given a language tag, returns an encoding of it
210           such that:
211
212           * tags representing different languages never get the same encod‐
213           ing.
214
215           * tags representing the same language always get the same encoding.
216
217           * an encoding of a formally valid language tag always is a string
218           value that is defined, has length, and is true if considered as a
219           boolean.
220
221           Note that the encoding itself is not a formally valid language tag.
222           Note also that you cannot, currently, go from an encoding back to a
223           language tag that it's an encoding of.
224
225           Note also that you must consider the encoded value as atomic; i.e.,
226           you should not consider it as anything but an opaque, unanalysable
227           string value.  (The internals of the encoding method may change in
228           future versions, as the language tagging standard changes over
229           time.)
230
231           "encode_language_tag" returns undef if given anything other than a
232           formally valid language tag.
233
234           The reason "encode_language_tag" exists is because different lan‐
235           guage tags may represent the same language; this is normally treat‐
236           able with "same_language_tag", but consider this situation:
237
238           You have a data file that expresses greetings in different lan‐
239           guages.  Its format is "[language tag]=[how to say 'Hello']", like:
240
241                     en-US=Hiho
242                     fr=Bonjour
243                     i-mingo=Hau'
244
245           And suppose you write a program that reads that file and then runs
246           as a daemon, answering client requests that specify a language tag
247           and then expect the string that says how to greet in that language.
248           So an interaction looks like:
249
250                     greeting-client asks:    fr
251                     greeting-server answers: Bonjour
252
253           So far so good.  But suppose the way you're implementing this is:
254
255                     my %greetings;
256                     die unless open(IN, "<in.dat");
257                     while(<IN>) {
258                       chomp;
259                       next unless /^([^=]+)=(.+)/s;
260                       my($lang, $expr) = ($1, $2);
261                       $greetings{$lang} = $expr;
262                     }
263                     close(IN);
264
265           at which point %greetings has the contents:
266
267                     "en-US"   => "Hiho"
268                     "fr"      => "Bonjour"
269                     "i-mingo" => "Hau'"
270
271           And suppose then that you answer client requests for language
272           $wanted by just looking up $greetings{$wanted}.
273
274           If the client asks for "fr", that will look up successfully in
275           %greetings, to the value "Bonjour".  And if the client asks for
276           "i-mingo", that will look up successfully in %greetings, to the
277           value "Hau'".
278
279           But if the client asks for "i-Mingo" or "x-mingo", or "Fr", then
280           the lookup in %greetings fails.  That's the Wrong Thing.
281
282           You could instead do lookups on $wanted with:
283
284                     use I18N::LangTags qw(same_language_tag);
285                     my $repsonse = '';
286                     foreach my $l2 (keys %greetings) {
287                       if(same_language_tag($wanted, $l2)) {
288                         $response = $greetings{$l2};
289                         last;
290                       }
291                     }
292
293           But that's rather inefficient.  A better way to do it is to start
294           your program with:
295
296                     use I18N::LangTags qw(encode_language_tag);
297                     my %greetings;
298                     die unless open(IN, "<in.dat");
299                     while(<IN>) {
300                       chomp;
301                       next unless /^([^=]+)=(.+)/s;
302                       my($lang, $expr) = ($1, $2);
303                       $greetings{
304                                   encode_language_tag($lang)
305                                 } = $expr;
306                     }
307                     close(IN);
308
309           and then just answer client requests for language $wanted by just
310           looking up
311
312                     $greetings{encode_language_tag($wanted)}
313
314           And that does the Right Thing.
315
316       * the function alternate_language_tags($lang1)
317           This function, if given a language tag, returns all language tags
318           that are alternate forms of this language tag.  (I.e., tags which
319           refer to the same language.)  This is meant to handle legacy tags
320           caused by the minor changes in language tag standards over the
321           years; and the x-/i- alternation is also dealt with.
322
323           Note that this function does not try to equate new (and never-used,
324           and unusable) ISO639-2 three-letter tags to old (and still in use)
325           ISO639-1 two-letter equivalents -- like "ara" -> "ar" -- because
326           "ara" has never been in use as an Internet language tag, and RFC
327           3066 stipulates that it never should be, since a shorter tag ("ar")
328           exists.
329
330           Examples:
331
332                     alternate_language_tags('no-bok')       is ('nb')
333                     alternate_language_tags('nb')           is ('no-bok')
334                     alternate_language_tags('he')           is ('iw')
335                     alternate_language_tags('iw')           is ('he')
336                     alternate_language_tags('i-hakka')      is ('zh-hakka', 'x-hakka')
337                     alternate_language_tags('zh-hakka')     is ('i-hakka', 'x-hakka')
338                     alternate_language_tags('en')           is ()
339                     alternate_language_tags('x-mingo-tom')  is ('i-mingo-tom')
340                     alternate_language_tags('x-klikitat')   is ('i-klikitat')
341                     alternate_language_tags('i-klikitat')   is ('x-klikitat')
342
343           This function returns empty-list if given anything other than a
344           formally valid language tag.
345
346       * the function @langs = panic_languages(@accept_languages)
347           This function takes a list of 0 or more language tags that consti‐
348           tute a given user's Accept-Language list, and returns a list of
349           tags for other (non-super) languages that are probably acceptable
350           to the user, to be used if all else fails.
351
352           For example, if a user accepts only 'ca' (Catalan) and 'es' (Span‐
353           ish), and the documents/interfaces you have available are just in
354           German, Italian, and Chinese, then the user will most likely want
355           the Italian one (and not the Chinese or German one!), instead of
356           getting nothing.  So "panic_languages('ca', 'es')" returns a list
357           containing 'it' (Italian).
358
359           English ('en') is always in the return list, but whether it's at
360           the very end or not depends on the input languages.  This function
361           works by consulting an internal table that stipulates what common
362           languages are "close" to each other.
363
364           A useful construct you might consider using is:
365
366             @fallbacks = super_languages(@accept_languages);
367             push @fallbacks, panic_languages(
368               @accept_languages, @fallbacks,
369             );
370
371       * the function implicate_supers( ...languages... )
372           This takes a list of strings (which are presumed to be lan‐
373           guage-tags; strings that aren't, are ignored); and after each one,
374           this function inserts super-ordinate forms that don't already
375           appear in the list.  The original list, plus these insertions, is
376           returned.
377
378           In other words, it takes this:
379
380             pt-br de-DE en-US fr pt-br-janeiro
381
382           and returns this:
383
384             pt-br pt de-DE de en-US en fr pt-br-janeiro
385
386           This function is most useful in the idiom
387
388             implicate_supers( I18N::LangTags::Detect::detect() );
389
390           (See I18N::LangTags::Detect.)
391
392       * the function implicate_supers_strictly( ...languages... )
393           This works like "implicate_supers" except that the implicated forms
394           are added to the end of the return list.
395
396           In other words, implicate_supers_strictly takes a list of strings
397           (which are presumed to be language-tags; strings that aren't, are
398           ignored) and after the whole given list, it inserts the super-ordi‐
399           nate forms of all given tags, minus any tags that already appear in
400           the input list.
401
402           In other words, it takes this:
403
404             pt-br de-DE en-US fr pt-br-janeiro
405
406           and returns this:
407
408             pt-br de-DE en-US fr pt-br-janeiro pt de en
409
410           The reason this function has "_strictly" in its name is that when
411           you're processing an Accept-Language list according to the RFCs, if
412           you interpret the RFCs quite strictly, then you would use impli‐
413           cate_supers_strictly, but for normal use (i.e., common-sense use,
414           as far as I'm concerned) you'd use implicate_supers.
415

ABOUT LOWERCASING

417       I've considered making all the above functions that output language
418       tags return all those tags strictly in lowercase.  Having all your lan‐
419       guage tags in lowercase does make some things easier.  But you might as
420       well just lowercase as you like, or call "encode_language_tag($lang1)"
421       where appropriate.
422

ABOUT UNICODE PLAINTEXT LANGUAGE TAGS

424       In some future version of I18N::LangTags, I plan to include support for
425       RFC2482-style language tags -- which are basically just normal language
426       tags with their ASCII characters shifted into Plane 14.
427

COPYRIGHT

455       Copyright (c) 1998+ Sean M. Burke. All rights reserved.
456
457       This library is free software; you can redistribute it and/or modify it
458       under the same terms as Perl itself.
459
460       The programs and documentation in this dist are distributed in the hope
461       that they will be useful, but without any warranty; without even the
462       implied warranty of merchantability or fitness for a particular pur‐
463       pose.
464

AUTHOR

466       Sean M. Burke "sburke@cpan.org"
467
468
469
470perl v5.8.8                       2001-09-21               I18N::LangTags(3pm)