1I18N::LangTags(3pm) Perl Programmers Reference Guide I18N::LangTags(3pm)
2
3
4
6 I18N::LangTags - functions for dealing with RFC3066-style language tags
7
9 use I18N::LangTags();
10
11 ...or specify whichever of those functions you want to import, like so:
12
13 use I18N::LangTags qw(implicate_supers similarity_language_tag);
14
15 All the exportable functions are listed below -- you're free to import
16 only some, or none at all. By default, none are imported. If you say:
17
18 use I18N::LangTags qw(:ALL)
19
20 ...then all are exported. (This saves you from having to use something
21 less obvious like "use I18N::LangTags qw(/./)".)
22
23 If you don't import any of these functions, assume a &I18N::LangTags::
24 in front of all the function names in the following examples.
25
27 Language tags are a formalism, described in RFC 3066 (obsoleting 1766),
28 for declaring what language form (language and possibly dialect) a
29 given chunk of information is in.
30
31 This library provides functions for common tasks involving language
32 tags as they are needed in a variety of protocols and applications.
33
34 Please see the "See Also" references for a thorough explanation of how
35 to correctly use language tags.
36
37 * the function is_language_tag($lang1)
38 Returns true iff $lang1 is a formally valid language tag.
39
40 is_language_tag("fr") is TRUE
41 is_language_tag("x-jicarilla") is FALSE
42 (Subtags can be 8 chars long at most -- 'jicarilla' is 9)
43
44 is_language_tag("sgn-US") is TRUE
45 (That's American Sign Language)
46
47 is_language_tag("i-Klikitat") is TRUE
48 (True without regard to the fact noone has actually
49 registered Klikitat -- it's a formally valid tag)
50
51 is_language_tag("fr-patois") is TRUE
52 (Formally valid -- altho descriptively weak!)
53
54 is_language_tag("Spanish") is FALSE
55 is_language_tag("french-patois") is FALSE
56 (No good -- first subtag has to match
57 /^([xXiI]⎪[a-zA-Z]{2,3})$/ -- see RFC3066)
58
59 is_language_tag("x-borg-prot2532") is TRUE
60 (Yes, subtags can contain digits, as of RFC3066)
61
62 * the function extract_language_tags($whatever)
63 Returns a list of whatever looks like formally valid language tags
64 in $whatever. Not very smart, so don't get too creative with what
65 you want to feed it.
66
67 extract_language_tags("fr, fr-ca, i-mingo")
68 returns: ('fr', 'fr-ca', 'i-mingo')
69
70 extract_language_tags("It's like this: I'm in fr -- French!")
71 returns: ('It', 'in', 'fr')
72 (So don't just feed it any old thing.)
73
74 The output is untainted. If you don't know what tainting is, don't
75 worry about it.
76
77 * the function same_language_tag($lang1, $lang2)
78 Returns true iff $lang1 and $lang2 are acceptable variant tags rep‐
79 resenting the same language-form.
80
81 same_language_tag('x-kadara', 'i-kadara') is TRUE
82 (The x/i- alternation doesn't matter)
83 same_language_tag('X-KADARA', 'i-kadara') is TRUE
84 (...and neither does case)
85 same_language_tag('en', 'en-US') is FALSE
86 (all-English is not the SAME as US English)
87 same_language_tag('x-kadara', 'x-kadar') is FALSE
88 (these are totally unrelated tags)
89 same_language_tag('no-bok', 'nb') is TRUE
90 (no-bok is a legacy tag for nb (Norwegian Bokmal))
91
92 "same_language_tag" works by just seeing whether "encode_lan‐
93 guage_tag($lang1)" is the same as "encode_language_tag($lang2)".
94
95 (Yes, I know this function is named a bit oddly. Call it historic
96 reasons.)
97
98 * the function similarity_language_tag($lang1, $lang2)
99 Returns an integer representing the degree of similarity between
100 tags $lang1 and $lang2 (the order of which does not matter), where
101 similarity is the number of common elements on the left, without
102 regard to case and to x/i- alternation.
103
104 similarity_language_tag('fr', 'fr-ca') is 1
105 (one element in common)
106 similarity_language_tag('fr-ca', 'fr-FR') is 1
107 (one element in common)
108
109 similarity_language_tag('fr-CA-joual',
110 'fr-CA-PEI') is 2
111 similarity_language_tag('fr-CA-joual', 'fr-CA') is 2
112 (two elements in common)
113
114 similarity_language_tag('x-kadara', 'i-kadara') is 1
115 (x/i- doesn't matter)
116
117 similarity_language_tag('en', 'x-kadar') is 0
118 similarity_language_tag('x-kadara', 'x-kadar') is 0
119 (unrelated tags -- no similarity)
120
121 similarity_language_tag('i-cree-syllabic',
122 'i-cherokee-syllabic') is 0
123 (no B<leftmost> elements in common!)
124
125 * the function is_dialect_of($lang1, $lang2)
126 Returns true iff language tag $lang1 represents a subform of lan‐
127 guage tag $lang2.
128
129 Get the order right! It doesn't work the other way around!
130
131 is_dialect_of('en-US', 'en') is TRUE
132 (American English IS a dialect of all-English)
133
134 is_dialect_of('fr-CA-joual', 'fr-CA') is TRUE
135 is_dialect_of('fr-CA-joual', 'fr') is TRUE
136 (Joual is a dialect of (a dialect of) French)
137
138 is_dialect_of('en', 'en-US') is FALSE
139 (all-English is a NOT dialect of American English)
140
141 is_dialect_of('fr', 'en-CA') is FALSE
142
143 is_dialect_of('en', 'en' ) is TRUE
144 is_dialect_of('en-US', 'en-US') is TRUE
145 (B<Note:> these are degenerate cases)
146
147 is_dialect_of('i-mingo-tom', 'x-Mingo') is TRUE
148 (the x/i thing doesn't matter, nor does case)
149
150 is_dialect_of('nn', 'no') is TRUE
151 (because 'nn' (New Norse) is aliased to 'no-nyn',
152 as a special legacy case, and 'no-nyn' is a
153 subform of 'no' (Norwegian))
154
155 * the function super_languages($lang1)
156 Returns a list of language tags that are superordinate tags to
157 $lang1 -- it gets this by removing subtags from the end of $lang1
158 until nothing (or just "i" or "x") is left.
159
160 super_languages("fr-CA-joual") is ("fr-CA", "fr")
161
162 super_languages("en-AU") is ("en")
163
164 super_languages("en") is empty-list, ()
165
166 super_languages("i-cherokee") is empty-list, ()
167 ...not ("i"), which would be illegal as well as pointless.
168
169 If $lang1 is not a valid language tag, returns empty-list in a list
170 context, undef in a scalar context.
171
172 A notable and rather unavoidable problem with this method:
173 "x-mingo-tom" has an "x" because the whole tag isn't an IANA-regis‐
174 tered tag -- but super_languages('x-mingo-tom') is ('x-mingo') --
175 which isn't really right, since 'i-mingo' is registered. But this
176 module has no way of knowing that. (But note that same_lan‐
177 guage_tag('x-mingo', 'i-mingo') is TRUE.)
178
179 More importantly, you assume at your peril that superordinates of
180 $lang1 are mutually intelligible with $lang1. Consider this care‐
181 fully.
182
183 * the function locale2language_tag($locale_identifier)
184 This takes a locale name (like "en", "en_US", or "en_US.ISO8859-1")
185 and maps it to a language tag. If it's not mappable (as with,
186 notably, "C" and "POSIX"), this returns empty-list in a list con‐
187 text, or undef in a scalar context.
188
189 locale2language_tag("en") is "en"
190
191 locale2language_tag("en_US") is "en-US"
192
193 locale2language_tag("en_US.ISO8859-1") is "en-US"
194
195 locale2language_tag("C") is undef or ()
196
197 locale2language_tag("POSIX") is undef or ()
198
199 locale2language_tag("POSIX") is undef or ()
200
201 I'm not totally sure that locale names map satisfactorily to lan‐
202 guage tags. Think REAL hard about how you use this. YOU HAVE BEEN
203 WARNED.
204
205 The output is untainted. If you don't know what tainting is, don't
206 worry about it.
207
208 * the function encode_language_tag($lang1)
209 This function, if given a language tag, returns an encoding of it
210 such that:
211
212 * tags representing different languages never get the same encod‐
213 ing.
214
215 * tags representing the same language always get the same encoding.
216
217 * an encoding of a formally valid language tag always is a string
218 value that is defined, has length, and is true if considered as a
219 boolean.
220
221 Note that the encoding itself is not a formally valid language tag.
222 Note also that you cannot, currently, go from an encoding back to a
223 language tag that it's an encoding of.
224
225 Note also that you must consider the encoded value as atomic; i.e.,
226 you should not consider it as anything but an opaque, unanalysable
227 string value. (The internals of the encoding method may change in
228 future versions, as the language tagging standard changes over
229 time.)
230
231 "encode_language_tag" returns undef if given anything other than a
232 formally valid language tag.
233
234 The reason "encode_language_tag" exists is because different lan‐
235 guage tags may represent the same language; this is normally treat‐
236 able with "same_language_tag", but consider this situation:
237
238 You have a data file that expresses greetings in different lan‐
239 guages. Its format is "[language tag]=[how to say 'Hello']", like:
240
241 en-US=Hiho
242 fr=Bonjour
243 i-mingo=Hau'
244
245 And suppose you write a program that reads that file and then runs
246 as a daemon, answering client requests that specify a language tag
247 and then expect the string that says how to greet in that language.
248 So an interaction looks like:
249
250 greeting-client asks: fr
251 greeting-server answers: Bonjour
252
253 So far so good. But suppose the way you're implementing this is:
254
255 my %greetings;
256 die unless open(IN, "<in.dat");
257 while(<IN>) {
258 chomp;
259 next unless /^([^=]+)=(.+)/s;
260 my($lang, $expr) = ($1, $2);
261 $greetings{$lang} = $expr;
262 }
263 close(IN);
264
265 at which point %greetings has the contents:
266
267 "en-US" => "Hiho"
268 "fr" => "Bonjour"
269 "i-mingo" => "Hau'"
270
271 And suppose then that you answer client requests for language
272 $wanted by just looking up $greetings{$wanted}.
273
274 If the client asks for "fr", that will look up successfully in
275 %greetings, to the value "Bonjour". And if the client asks for
276 "i-mingo", that will look up successfully in %greetings, to the
277 value "Hau'".
278
279 But if the client asks for "i-Mingo" or "x-mingo", or "Fr", then
280 the lookup in %greetings fails. That's the Wrong Thing.
281
282 You could instead do lookups on $wanted with:
283
284 use I18N::LangTags qw(same_language_tag);
285 my $repsonse = '';
286 foreach my $l2 (keys %greetings) {
287 if(same_language_tag($wanted, $l2)) {
288 $response = $greetings{$l2};
289 last;
290 }
291 }
292
293 But that's rather inefficient. A better way to do it is to start
294 your program with:
295
296 use I18N::LangTags qw(encode_language_tag);
297 my %greetings;
298 die unless open(IN, "<in.dat");
299 while(<IN>) {
300 chomp;
301 next unless /^([^=]+)=(.+)/s;
302 my($lang, $expr) = ($1, $2);
303 $greetings{
304 encode_language_tag($lang)
305 } = $expr;
306 }
307 close(IN);
308
309 and then just answer client requests for language $wanted by just
310 looking up
311
312 $greetings{encode_language_tag($wanted)}
313
314 And that does the Right Thing.
315
316 * the function alternate_language_tags($lang1)
317 This function, if given a language tag, returns all language tags
318 that are alternate forms of this language tag. (I.e., tags which
319 refer to the same language.) This is meant to handle legacy tags
320 caused by the minor changes in language tag standards over the
321 years; and the x-/i- alternation is also dealt with.
322
323 Note that this function does not try to equate new (and never-used,
324 and unusable) ISO639-2 three-letter tags to old (and still in use)
325 ISO639-1 two-letter equivalents -- like "ara" -> "ar" -- because
326 "ara" has never been in use as an Internet language tag, and RFC
327 3066 stipulates that it never should be, since a shorter tag ("ar")
328 exists.
329
330 Examples:
331
332 alternate_language_tags('no-bok') is ('nb')
333 alternate_language_tags('nb') is ('no-bok')
334 alternate_language_tags('he') is ('iw')
335 alternate_language_tags('iw') is ('he')
336 alternate_language_tags('i-hakka') is ('zh-hakka', 'x-hakka')
337 alternate_language_tags('zh-hakka') is ('i-hakka', 'x-hakka')
338 alternate_language_tags('en') is ()
339 alternate_language_tags('x-mingo-tom') is ('i-mingo-tom')
340 alternate_language_tags('x-klikitat') is ('i-klikitat')
341 alternate_language_tags('i-klikitat') is ('x-klikitat')
342
343 This function returns empty-list if given anything other than a
344 formally valid language tag.
345
346 * the function @langs = panic_languages(@accept_languages)
347 This function takes a list of 0 or more language tags that consti‐
348 tute a given user's Accept-Language list, and returns a list of
349 tags for other (non-super) languages that are probably acceptable
350 to the user, to be used if all else fails.
351
352 For example, if a user accepts only 'ca' (Catalan) and 'es' (Span‐
353 ish), and the documents/interfaces you have available are just in
354 German, Italian, and Chinese, then the user will most likely want
355 the Italian one (and not the Chinese or German one!), instead of
356 getting nothing. So "panic_languages('ca', 'es')" returns a list
357 containing 'it' (Italian).
358
359 English ('en') is always in the return list, but whether it's at
360 the very end or not depends on the input languages. This function
361 works by consulting an internal table that stipulates what common
362 languages are "close" to each other.
363
364 A useful construct you might consider using is:
365
366 @fallbacks = super_languages(@accept_languages);
367 push @fallbacks, panic_languages(
368 @accept_languages, @fallbacks,
369 );
370
371 * the function implicate_supers( ...languages... )
372 This takes a list of strings (which are presumed to be lan‐
373 guage-tags; strings that aren't, are ignored); and after each one,
374 this function inserts super-ordinate forms that don't already
375 appear in the list. The original list, plus these insertions, is
376 returned.
377
378 In other words, it takes this:
379
380 pt-br de-DE en-US fr pt-br-janeiro
381
382 and returns this:
383
384 pt-br pt de-DE de en-US en fr pt-br-janeiro
385
386 This function is most useful in the idiom
387
388 implicate_supers( I18N::LangTags::Detect::detect() );
389
390 (See I18N::LangTags::Detect.)
391
392 * the function implicate_supers_strictly( ...languages... )
393 This works like "implicate_supers" except that the implicated forms
394 are added to the end of the return list.
395
396 In other words, implicate_supers_strictly takes a list of strings
397 (which are presumed to be language-tags; strings that aren't, are
398 ignored) and after the whole given list, it inserts the super-ordi‐
399 nate forms of all given tags, minus any tags that already appear in
400 the input list.
401
402 In other words, it takes this:
403
404 pt-br de-DE en-US fr pt-br-janeiro
405
406 and returns this:
407
408 pt-br de-DE en-US fr pt-br-janeiro pt de en
409
410 The reason this function has "_strictly" in its name is that when
411 you're processing an Accept-Language list according to the RFCs, if
412 you interpret the RFCs quite strictly, then you would use impli‐
413 cate_supers_strictly, but for normal use (i.e., common-sense use,
414 as far as I'm concerned) you'd use implicate_supers.
415
417 I've considered making all the above functions that output language
418 tags return all those tags strictly in lowercase. Having all your lan‐
419 guage tags in lowercase does make some things easier. But you might as
420 well just lowercase as you like, or call "encode_language_tag($lang1)"
421 where appropriate.
422
424 In some future version of I18N::LangTags, I plan to include support for
425 RFC2482-style language tags -- which are basically just normal language
426 tags with their ASCII characters shifted into Plane 14.
427
429 * I18N::LangTags::List
430
431 * RFC 3066, "ftp://ftp.isi.edu/in-notes/rfc3066.txt", "Tags for the
432 Identification of Languages". (Obsoletes RFC 1766)
433
434 * RFC 2277, "ftp://ftp.isi.edu/in-notes/rfc2277.txt", "IETF Policy on
435 Character Sets and Languages".
436
437 * RFC 2231, "ftp://ftp.isi.edu/in-notes/rfc2231.txt", "MIME Parameter
438 Value and Encoded Word Extensions: Character Sets, Languages, and Con‐
439 tinuations".
440
441 * RFC 2482, "ftp://ftp.isi.edu/in-notes/rfc2482.txt", "Language Tagging
442 in Unicode Plain Text".
443
444 * Locale::Codes, in "http://www.perl.com/CPAN/modules/by-mod‐
445 ule/Locale/"
446
447 * ISO 639-2, "Codes for the representation of names of languages",
448 including two-letter and three-letter codes, "http://www.loc.gov/stan‐
449 dards/iso639-2/langcodes.html"
450
451 * The IANA list of registered languages (hopefully up-to-date),
452 "http://www.iana.org/assignments/language-tags"
453
455 Copyright (c) 1998+ Sean M. Burke. All rights reserved.
456
457 This library is free software; you can redistribute it and/or modify it
458 under the same terms as Perl itself.
459
460 The programs and documentation in this dist are distributed in the hope
461 that they will be useful, but without any warranty; without even the
462 implied warranty of merchantability or fitness for a particular pur‐
463 pose.
464
466 Sean M. Burke "sburke@cpan.org"
467
468
469
470perl v5.8.8 2001-09-21 I18N::LangTags(3pm)