1I18N::LangTags(3pm) Perl Programmers Reference Guide I18N::LangTags(3pm)
2
3
4
6 I18N::LangTags - functions for dealing with RFC3066-style language tags
7
9 use I18N::LangTags();
10
11 ...or specify whichever of those functions you want to import, like so:
12
13 use I18N::LangTags qw(implicate_supers similarity_language_tag);
14
15 All the exportable functions are listed below -- you're free to import
16 only some, or none at all. By default, none are imported. If you say:
17
18 use I18N::LangTags qw(:ALL)
19
20 ...then all are exported. (This saves you from having to use something
21 less obvious like "use I18N::LangTags qw(/./)".)
22
23 If you don't import any of these functions, assume a &I18N::LangTags::
24 in front of all the function names in the following examples.
25
27 Language tags are a formalism, described in RFC 3066 (obsoleting 1766),
28 for declaring what language form (language and possibly dialect) a
29 given chunk of information is in.
30
31 This library provides functions for common tasks involving language
32 tags as they are needed in a variety of protocols and applications.
33
34 Please see the "See Also" references for a thorough explanation of how
35 to correctly use language tags.
36
37 · the function is_language_tag($lang1)
38
39 Returns true iff $lang1 is a formally valid language tag.
40
41 is_language_tag("fr") is TRUE
42 is_language_tag("x-jicarilla") is FALSE
43 (Subtags can be 8 chars long at most -- 'jicarilla' is 9)
44
45 is_language_tag("sgn-US") is TRUE
46 (That's American Sign Language)
47
48 is_language_tag("i-Klikitat") is TRUE
49 (True without regard to the fact noone has actually
50 registered Klikitat -- it's a formally valid tag)
51
52 is_language_tag("fr-patois") is TRUE
53 (Formally valid -- altho descriptively weak!)
54
55 is_language_tag("Spanish") is FALSE
56 is_language_tag("french-patois") is FALSE
57 (No good -- first subtag has to match
58 /^([xXiI]|[a-zA-Z]{2,3})$/ -- see RFC3066)
59
60 is_language_tag("x-borg-prot2532") is TRUE
61 (Yes, subtags can contain digits, as of RFC3066)
62
63 · the function extract_language_tags($whatever)
64
65 Returns a list of whatever looks like formally valid language tags
66 in $whatever. Not very smart, so don't get too creative with what
67 you want to feed it.
68
69 extract_language_tags("fr, fr-ca, i-mingo")
70 returns: ('fr', 'fr-ca', 'i-mingo')
71
72 extract_language_tags("It's like this: I'm in fr -- French!")
73 returns: ('It', 'in', 'fr')
74 (So don't just feed it any old thing.)
75
76 The output is untainted. If you don't know what tainting is, don't
77 worry about it.
78
79 · the function same_language_tag($lang1, $lang2)
80
81 Returns true iff $lang1 and $lang2 are acceptable variant tags
82 representing the same language-form.
83
84 same_language_tag('x-kadara', 'i-kadara') is TRUE
85 (The x/i- alternation doesn't matter)
86 same_language_tag('X-KADARA', 'i-kadara') is TRUE
87 (...and neither does case)
88 same_language_tag('en', 'en-US') is FALSE
89 (all-English is not the SAME as US English)
90 same_language_tag('x-kadara', 'x-kadar') is FALSE
91 (these are totally unrelated tags)
92 same_language_tag('no-bok', 'nb') is TRUE
93 (no-bok is a legacy tag for nb (Norwegian Bokmal))
94
95 "same_language_tag" works by just seeing whether
96 "encode_language_tag($lang1)" is the same as
97 "encode_language_tag($lang2)".
98
99 (Yes, I know this function is named a bit oddly. Call it historic
100 reasons.)
101
102 · the function similarity_language_tag($lang1, $lang2)
103
104 Returns an integer representing the degree of similarity between
105 tags $lang1 and $lang2 (the order of which does not matter), where
106 similarity is the number of common elements on the left, without
107 regard to case and to x/i- alternation.
108
109 similarity_language_tag('fr', 'fr-ca') is 1
110 (one element in common)
111 similarity_language_tag('fr-ca', 'fr-FR') is 1
112 (one element in common)
113
114 similarity_language_tag('fr-CA-joual',
115 'fr-CA-PEI') is 2
116 similarity_language_tag('fr-CA-joual', 'fr-CA') is 2
117 (two elements in common)
118
119 similarity_language_tag('x-kadara', 'i-kadara') is 1
120 (x/i- doesn't matter)
121
122 similarity_language_tag('en', 'x-kadar') is 0
123 similarity_language_tag('x-kadara', 'x-kadar') is 0
124 (unrelated tags -- no similarity)
125
126 similarity_language_tag('i-cree-syllabic',
127 'i-cherokee-syllabic') is 0
128 (no B<leftmost> elements in common!)
129
130 · the function is_dialect_of($lang1, $lang2)
131
132 Returns true iff language tag $lang1 represents a subform of
133 language tag $lang2.
134
135 Get the order right! It doesn't work the other way around!
136
137 is_dialect_of('en-US', 'en') is TRUE
138 (American English IS a dialect of all-English)
139
140 is_dialect_of('fr-CA-joual', 'fr-CA') is TRUE
141 is_dialect_of('fr-CA-joual', 'fr') is TRUE
142 (Joual is a dialect of (a dialect of) French)
143
144 is_dialect_of('en', 'en-US') is FALSE
145 (all-English is a NOT dialect of American English)
146
147 is_dialect_of('fr', 'en-CA') is FALSE
148
149 is_dialect_of('en', 'en' ) is TRUE
150 is_dialect_of('en-US', 'en-US') is TRUE
151 (B<Note:> these are degenerate cases)
152
153 is_dialect_of('i-mingo-tom', 'x-Mingo') is TRUE
154 (the x/i thing doesn't matter, nor does case)
155
156 is_dialect_of('nn', 'no') is TRUE
157 (because 'nn' (New Norse) is aliased to 'no-nyn',
158 as a special legacy case, and 'no-nyn' is a
159 subform of 'no' (Norwegian))
160
161 · the function super_languages($lang1)
162
163 Returns a list of language tags that are superordinate tags to
164 $lang1 -- it gets this by removing subtags from the end of $lang1
165 until nothing (or just "i" or "x") is left.
166
167 super_languages("fr-CA-joual") is ("fr-CA", "fr")
168
169 super_languages("en-AU") is ("en")
170
171 super_languages("en") is empty-list, ()
172
173 super_languages("i-cherokee") is empty-list, ()
174 ...not ("i"), which would be illegal as well as pointless.
175
176 If $lang1 is not a valid language tag, returns empty-list in a list
177 context, undef in a scalar context.
178
179 A notable and rather unavoidable problem with this method:
180 "x-mingo-tom" has an "x" because the whole tag isn't an IANA-
181 registered tag -- but super_languages('x-mingo-tom') is ('x-mingo')
182 -- which isn't really right, since 'i-mingo' is registered. But
183 this module has no way of knowing that. (But note that
184 same_language_tag('x-mingo', 'i-mingo') is TRUE.)
185
186 More importantly, you assume at your peril that superordinates of
187 $lang1 are mutually intelligible with $lang1. Consider this
188 carefully.
189
190 · the function locale2language_tag($locale_identifier)
191
192 This takes a locale name (like "en", "en_US", or "en_US.ISO8859-1")
193 and maps it to a language tag. If it's not mappable (as with,
194 notably, "C" and "POSIX"), this returns empty-list in a list
195 context, or undef in a scalar context.
196
197 locale2language_tag("en") is "en"
198
199 locale2language_tag("en_US") is "en-US"
200
201 locale2language_tag("en_US.ISO8859-1") is "en-US"
202
203 locale2language_tag("C") is undef or ()
204
205 locale2language_tag("POSIX") is undef or ()
206
207 locale2language_tag("POSIX") is undef or ()
208
209 I'm not totally sure that locale names map satisfactorily to
210 language tags. Think REAL hard about how you use this. YOU HAVE
211 BEEN WARNED.
212
213 The output is untainted. If you don't know what tainting is, don't
214 worry about it.
215
216 · the function encode_language_tag($lang1)
217
218 This function, if given a language tag, returns an encoding of it
219 such that:
220
221 * tags representing different languages never get the same
222 encoding.
223
224 * tags representing the same language always get the same encoding.
225
226 * an encoding of a formally valid language tag always is a string
227 value that is defined, has length, and is true if considered as a
228 boolean.
229
230 Note that the encoding itself is not a formally valid language tag.
231 Note also that you cannot, currently, go from an encoding back to a
232 language tag that it's an encoding of.
233
234 Note also that you must consider the encoded value as atomic; i.e.,
235 you should not consider it as anything but an opaque, unanalysable
236 string value. (The internals of the encoding method may change in
237 future versions, as the language tagging standard changes over
238 time.)
239
240 "encode_language_tag" returns undef if given anything other than a
241 formally valid language tag.
242
243 The reason "encode_language_tag" exists is because different
244 language tags may represent the same language; this is normally
245 treatable with "same_language_tag", but consider this situation:
246
247 You have a data file that expresses greetings in different
248 languages. Its format is "[language tag]=[how to say 'Hello']",
249 like:
250
251 en-US=Hiho
252 fr=Bonjour
253 i-mingo=Hau'
254
255 And suppose you write a program that reads that file and then runs
256 as a daemon, answering client requests that specify a language tag
257 and then expect the string that says how to greet in that language.
258 So an interaction looks like:
259
260 greeting-client asks: fr
261 greeting-server answers: Bonjour
262
263 So far so good. But suppose the way you're implementing this is:
264
265 my %greetings;
266 die unless open(IN, "<in.dat");
267 while(<IN>) {
268 chomp;
269 next unless /^([^=]+)=(.+)/s;
270 my($lang, $expr) = ($1, $2);
271 $greetings{$lang} = $expr;
272 }
273 close(IN);
274
275 at which point %greetings has the contents:
276
277 "en-US" => "Hiho"
278 "fr" => "Bonjour"
279 "i-mingo" => "Hau'"
280
281 And suppose then that you answer client requests for language
282 $wanted by just looking up $greetings{$wanted}.
283
284 If the client asks for "fr", that will look up successfully in
285 %greetings, to the value "Bonjour". And if the client asks for
286 "i-mingo", that will look up successfully in %greetings, to the
287 value "Hau'".
288
289 But if the client asks for "i-Mingo" or "x-mingo", or "Fr", then
290 the lookup in %greetings fails. That's the Wrong Thing.
291
292 You could instead do lookups on $wanted with:
293
294 use I18N::LangTags qw(same_language_tag);
295 my $response = '';
296 foreach my $l2 (keys %greetings) {
297 if(same_language_tag($wanted, $l2)) {
298 $response = $greetings{$l2};
299 last;
300 }
301 }
302
303 But that's rather inefficient. A better way to do it is to start
304 your program with:
305
306 use I18N::LangTags qw(encode_language_tag);
307 my %greetings;
308 die unless open(IN, "<in.dat");
309 while(<IN>) {
310 chomp;
311 next unless /^([^=]+)=(.+)/s;
312 my($lang, $expr) = ($1, $2);
313 $greetings{
314 encode_language_tag($lang)
315 } = $expr;
316 }
317 close(IN);
318
319 and then just answer client requests for language $wanted by just
320 looking up
321
322 $greetings{encode_language_tag($wanted)}
323
324 And that does the Right Thing.
325
326 · the function alternate_language_tags($lang1)
327
328 This function, if given a language tag, returns all language tags
329 that are alternate forms of this language tag. (I.e., tags which
330 refer to the same language.) This is meant to handle legacy tags
331 caused by the minor changes in language tag standards over the
332 years; and the x-/i- alternation is also dealt with.
333
334 Note that this function does not try to equate new (and never-used,
335 and unusable) ISO639-2 three-letter tags to old (and still in use)
336 ISO639-1 two-letter equivalents -- like "ara" -> "ar" -- because
337 "ara" has never been in use as an Internet language tag, and RFC
338 3066 stipulates that it never should be, since a shorter tag ("ar")
339 exists.
340
341 Examples:
342
343 alternate_language_tags('no-bok') is ('nb')
344 alternate_language_tags('nb') is ('no-bok')
345 alternate_language_tags('he') is ('iw')
346 alternate_language_tags('iw') is ('he')
347 alternate_language_tags('i-hakka') is ('zh-hakka', 'x-hakka')
348 alternate_language_tags('zh-hakka') is ('i-hakka', 'x-hakka')
349 alternate_language_tags('en') is ()
350 alternate_language_tags('x-mingo-tom') is ('i-mingo-tom')
351 alternate_language_tags('x-klikitat') is ('i-klikitat')
352 alternate_language_tags('i-klikitat') is ('x-klikitat')
353
354 This function returns empty-list if given anything other than a
355 formally valid language tag.
356
357 · the function @langs = panic_languages(@accept_languages)
358
359 This function takes a list of 0 or more language tags that
360 constitute a given user's Accept-Language list, and returns a list
361 of tags for other (non-super) languages that are probably
362 acceptable to the user, to be used if all else fails.
363
364 For example, if a user accepts only 'ca' (Catalan) and 'es'
365 (Spanish), and the documents/interfaces you have available are just
366 in German, Italian, and Chinese, then the user will most likely
367 want the Italian one (and not the Chinese or German one!), instead
368 of getting nothing. So "panic_languages('ca', 'es')" returns a
369 list containing 'it' (Italian).
370
371 English ('en') is always in the return list, but whether it's at
372 the very end or not depends on the input languages. This function
373 works by consulting an internal table that stipulates what common
374 languages are "close" to each other.
375
376 A useful construct you might consider using is:
377
378 @fallbacks = super_languages(@accept_languages);
379 push @fallbacks, panic_languages(
380 @accept_languages, @fallbacks,
381 );
382
383 · the function implicate_supers( ...languages... )
384
385 This takes a list of strings (which are presumed to be language-
386 tags; strings that aren't, are ignored); and after each one, this
387 function inserts super-ordinate forms that don't already appear in
388 the list. The original list, plus these insertions, is returned.
389
390 In other words, it takes this:
391
392 pt-br de-DE en-US fr pt-br-janeiro
393
394 and returns this:
395
396 pt-br pt de-DE de en-US en fr pt-br-janeiro
397
398 This function is most useful in the idiom
399
400 implicate_supers( I18N::LangTags::Detect::detect() );
401
402 (See I18N::LangTags::Detect.)
403
404 · the function implicate_supers_strictly( ...languages... )
405
406 This works like "implicate_supers" except that the implicated forms
407 are added to the end of the return list.
408
409 In other words, implicate_supers_strictly takes a list of strings
410 (which are presumed to be language-tags; strings that aren't, are
411 ignored) and after the whole given list, it inserts the super-
412 ordinate forms of all given tags, minus any tags that already
413 appear in the input list.
414
415 In other words, it takes this:
416
417 pt-br de-DE en-US fr pt-br-janeiro
418
419 and returns this:
420
421 pt-br de-DE en-US fr pt-br-janeiro pt de en
422
423 The reason this function has "_strictly" in its name is that when
424 you're processing an Accept-Language list according to the RFCs, if
425 you interpret the RFCs quite strictly, then you would use
426 implicate_supers_strictly, but for normal use (i.e., common-sense
427 use, as far as I'm concerned) you'd use implicate_supers.
428
430 I've considered making all the above functions that output language
431 tags return all those tags strictly in lowercase. Having all your
432 language tags in lowercase does make some things easier. But you might
433 as well just lowercase as you like, or call
434 "encode_language_tag($lang1)" where appropriate.
435
437 In some future version of I18N::LangTags, I plan to include support for
438 RFC2482-style language tags -- which are basically just normal language
439 tags with their ASCII characters shifted into Plane 14.
440
442 * I18N::LangTags::List
443
444 * RFC 3066, "ftp://ftp.isi.edu/in-notes/rfc3066.txt", "Tags for the
445 Identification of Languages". (Obsoletes RFC 1766)
446
447 * RFC 2277, "ftp://ftp.isi.edu/in-notes/rfc2277.txt", "IETF Policy on
448 Character Sets and Languages".
449
450 * RFC 2231, "ftp://ftp.isi.edu/in-notes/rfc2231.txt", "MIME Parameter
451 Value and Encoded Word Extensions: Character Sets, Languages, and
452 Continuations".
453
454 * RFC 2482, "ftp://ftp.isi.edu/in-notes/rfc2482.txt", "Language Tagging
455 in Unicode Plain Text".
456
457 * Locale::Codes, in
458 "http://www.perl.com/CPAN/modules/by-module/Locale/"
459
460 * ISO 639-2, "Codes for the representation of names of languages",
461 including two-letter and three-letter codes,
462 "http://www.loc.gov/standards/iso639-2/langcodes.html"
463
464 * The IANA list of registered languages (hopefully up-to-date),
465 "http://www.iana.org/assignments/language-tags"
466
468 Copyright (c) 1998+ Sean M. Burke. All rights reserved.
469
470 This library is free software; you can redistribute it and/or modify it
471 under the same terms as Perl itself.
472
473 The programs and documentation in this dist are distributed in the hope
474 that they will be useful, but without any warranty; without even the
475 implied warranty of merchantability or fitness for a particular
476 purpose.
477
479 Sean M. Burke "sburke@cpan.org"
480
481
482
483perl v5.10.1 2009-02-12 I18N::LangTags(3pm)