1PERLLOCALE(1) Perl Programmers Reference Guide PERLLOCALE(1)
2
3
4
6 perllocale - Perl locale handling (internationalization and
7 localization)
8
10 In the beginning there was ASCII, the "American Standard Code for
11 Information Interchange", which works quite well for Americans with
12 their English alphabet and dollar-denominated currency. But it doesn't
13 work so well even for other English speakers, who may use different
14 currencies, such as the pound sterling (as the symbol for that currency
15 is not in ASCII); and it's hopelessly inadequate for many of the
16 thousands of the world's other languages.
17
18 To address these deficiencies, the concept of locales was invented
19 (formally the ISO C, XPG4, POSIX 1.c "locale system"). And
20 applications were and are being written that use the locale mechanism.
21 The process of making such an application take account of its users'
22 preferences in these kinds of matters is called internationalization
23 (often abbreviated as i18n); telling such an application about a
24 particular set of preferences is known as localization (l10n).
25
26 Perl has been extended to support certain types of locales available in
27 the locale system. This is controlled per application by using one
28 pragma, one function call, and several environment variables.
29
30 Perl supports single-byte locales that are supersets of ASCII, such as
31 the ISO 8859 ones, and one multi-byte-type locale, UTF-8 ones,
32 described in the next paragraph. Perl doesn't support any other multi-
33 byte locales, such as the ones for East Asian languages.
34
35 Unfortunately, there are quite a few deficiencies with the design (and
36 often, the implementations) of locales. Unicode was invented (see
37 perlunitut for an introduction to that) in part to address these design
38 deficiencies, and nowadays, there is a series of "UTF-8 locales", based
39 on Unicode. These are locales whose character set is Unicode, encoded
40 in UTF-8. Starting in v5.20, Perl fully supports UTF-8 locales, except
41 for sorting and string comparisons like "lt" and "ge". Starting in
42 v5.26, Perl can handle these reasonably as well, depending on the
43 platform's implementation. However, for earlier releases or for better
44 control, use Unicode::Collate. There are actually two slightly
45 different types of UTF-8 locales: one for Turkic languages and one for
46 everything else.
47
48 Starting in Perl v5.30, Perl detects Turkic locales by their behaviour,
49 and seamlessly handles both types; previously only the non-Turkic one
50 was supported. The name of the locale is ignored, if your system has a
51 "tr_TR.UTF-8" locale and it doesn't behave like a Turkic locale, perl
52 will treat it like a non-Turkic locale.
53
54 Perl continues to support the old non UTF-8 locales as well. There are
55 currently no UTF-8 locales for EBCDIC platforms.
56
57 (Unicode is also creating "CLDR", the "Common Locale Data Repository",
58 <http://cldr.unicode.org/> which includes more types of information
59 than are available in the POSIX locale system. At the time of this
60 writing, there was no CPAN module that provides access to this XML-
61 encoded data. However, it is possible to compute the POSIX locale data
62 from them, and earlier CLDR versions had these already extracted for
63 you as UTF-8 locales <http://unicode.org/Public/cldr/2.0.1/>.)
64
66 A locale is a set of data that describes various aspects of how various
67 communities in the world categorize their world. These categories are
68 broken down into the following types (some of which include a brief
69 note here):
70
71 Category "LC_NUMERIC": Numeric formatting
72 This indicates how numbers should be formatted for human
73 readability, for example the character used as the decimal point.
74
75 Category "LC_MONETARY": Formatting of monetary amounts
76
77
78 Category "LC_TIME": Date/Time formatting
79
80
81 Category "LC_MESSAGES": Error and other messages
82 This is used by Perl itself only for accessing operating system
83 error messages via $! and $^E.
84
85 Category "LC_COLLATE": Collation
86 This indicates the ordering of letters for comparison and sorting.
87 In Latin alphabets, for example, "b", generally follows "a".
88
89 Category "LC_CTYPE": Character Types
90 This indicates, for example if a character is an uppercase letter.
91
92 Other categories
93 Some platforms have other categories, dealing with such things as
94 measurement units and paper sizes. None of these are used directly
95 by Perl, but outside operations that Perl interacts with may use
96 these. See "Not within the scope of "use locale"" below.
97
98 More details on the categories used by Perl are given below in "LOCALE
99 CATEGORIES".
100
101 Together, these categories go a long way towards being able to
102 customize a single program to run in many different locations. But
103 there are deficiencies, so keep reading.
104
106 Perl itself (outside the POSIX module) will not use locales unless
107 specifically requested to (but again note that Perl may interact with
108 code that does use them). Even if there is such a request, all of the
109 following must be true for it to work properly:
110
111 • Your operating system must support the locale system. If it does,
112 you should find that the "setlocale()" function is a documented
113 part of its C library.
114
115 • Definitions for locales that you use must be installed. You, or
116 your system administrator, must make sure that this is the case.
117 The available locales, the location in which they are kept, and the
118 manner in which they are installed all vary from system to system.
119 Some systems provide only a few, hard-wired locales and do not
120 allow more to be added. Others allow you to add "canned" locales
121 provided by the system supplier. Still others allow you or the
122 system administrator to define and add arbitrary locales. (You may
123 have to ask your supplier to provide canned locales that are not
124 delivered with your operating system.) Read your system
125 documentation for further illumination.
126
127 • Perl must believe that the locale system is supported. If it does,
128 "perl -V:d_setlocale" will say that the value for "d_setlocale" is
129 "define".
130
131 If you want a Perl application to process and present your data
132 according to a particular locale, the application code should include
133 the "use locale" pragma (see "The "use locale" pragma") where
134 appropriate, and at least one of the following must be true:
135
136 1. The locale-determining environment variables (see "ENVIRONMENT")
137 must be correctly set up at the time the application is started,
138 either by yourself or by whomever set up your system account; or
139
140 2. The application must set its own locale using the method described
141 in "The setlocale function".
142
144 The "use locale" pragma
145 Starting in Perl 5.28, this pragma may be used in multi-threaded
146 applications on systems that have thread-safe locale ability. Some
147 caveats apply, see "Multi-threaded" below. On systems without this
148 capability, or in earlier Perls, do NOT use this pragma in scripts that
149 have multiple threads active. The locale in these cases is not local
150 to a single thread. Another thread may change the locale at any time,
151 which could cause at a minimum that a given thread is operating in a
152 locale it isn't expecting to be in. On some platforms, segfaults can
153 also occur. The locale change need not be explicit; some operations
154 cause perl to change the locale itself. You are vulnerable simply by
155 having done a "use locale".
156
157 By default, Perl itself (outside the POSIX module) ignores the current
158 locale. The "use locale" pragma tells Perl to use the current locale
159 for some operations. Starting in v5.16, there are optional parameters
160 to this pragma, described below, which restrict which operations are
161 affected by it.
162
163 The current locale is set at execution time by setlocale() described
164 below. If that function hasn't yet been called in the course of the
165 program's execution, the current locale is that which was determined by
166 the "ENVIRONMENT" in effect at the start of the program. If there is
167 no valid environment, the current locale is whatever the system default
168 has been set to. On POSIX systems, it is likely, but not necessarily,
169 the "C" locale. On Windows, the default is set via the computer's
170 "Control Panel->Regional and Language Options" (or its current
171 equivalent).
172
173 The operations that are affected by locale are:
174
175 Not within the scope of "use locale"
176 Only certain operations (all originating outside Perl) should be
177 affected, as follows:
178
179 • The current locale is used when going outside of Perl with
180 operations like system() or qx//, if those operations are
181 locale-sensitive.
182
183 • Also Perl gives access to various C library functions through
184 the POSIX module. Some of those functions are always affected
185 by the current locale. For example, "POSIX::strftime()" uses
186 "LC_TIME"; "POSIX::strtod()" uses "LC_NUMERIC";
187 "POSIX::strcoll()" and "POSIX::strxfrm()" use "LC_COLLATE".
188 All such functions will behave according to the current
189 underlying locale, even if that locale isn't exposed to Perl
190 space.
191
192 This applies as well to I18N::Langinfo.
193
194 • XS modules for all categories but "LC_NUMERIC" get the
195 underlying locale, and hence any C library functions they call
196 will use that underlying locale. For more discussion, see
197 "CAVEATS" in perlxs.
198
199 Note that all C programs (including the perl interpreter, which is
200 written in C) always have an underlying locale. That locale is the
201 "C" locale unless changed by a call to setlocale(). When Perl
202 starts up, it changes the underlying locale to the one which is
203 indicated by the "ENVIRONMENT". When using the POSIX module or
204 writing XS code, it is important to keep in mind that the
205 underlying locale may be something other than "C", even if the
206 program hasn't explicitly changed it.
207
208
209
210 Lingering effects of "use locale"
211 Certain Perl operations that are set-up within the scope of a "use
212 locale" retain that effect even outside the scope. These include:
213
214 • The output format of a write() is determined by an earlier
215 format declaration ("format" in perlfunc), so whether or not
216 the output is affected by locale is determined by if the
217 "format()" is within the scope of a "use locale", not whether
218 the "write()" is.
219
220 • Regular expression patterns can be compiled using qr// with
221 actual matching deferred to later. Again, it is whether or not
222 the compilation was done within the scope of "use locale" that
223 determines the match behavior, not if the matches are done
224 within such a scope or not.
225
226
227
228 Under ""use locale";"
229 • All the above operations
230
231 • Format declarations ("format" in perlfunc) and hence any
232 subsequent "write()"s use "LC_NUMERIC".
233
234 • stringification and output use "LC_NUMERIC". These include the
235 results of "print()", "printf()", "say()", and "sprintf()".
236
237 • The comparison operators ("lt", "le", "cmp", "ge", and "gt")
238 use "LC_COLLATE". "sort()" is also affected if used without an
239 explicit comparison function, because it uses "cmp" by default.
240
241 Note: "eq" and "ne" are unaffected by locale: they always
242 perform a char-by-char comparison of their scalar operands.
243 What's more, if "cmp" finds that its operands are equal
244 according to the collation sequence specified by the current
245 locale, it goes on to perform a char-by-char comparison, and
246 only returns 0 (equal) if the operands are char-for-char
247 identical. If you really want to know whether two
248 strings--which "eq" and "cmp" may consider different--are equal
249 as far as collation in the locale is concerned, see the
250 discussion in "Category "LC_COLLATE": Collation".
251
252 • Regular expressions and case-modification functions ("uc()",
253 "lc()", "ucfirst()", and "lcfirst()") use "LC_CTYPE"
254
255 • The variables $! (and its synonyms $ERRNO and $OS_ERROR) and
256 $^E (and its synonym $EXTENDED_OS_ERROR) when used as strings
257 use "LC_MESSAGES".
258
259 The default behavior is restored with the "no locale" pragma, or upon
260 reaching the end of the block enclosing "use locale". Note that "use
261 locale" calls may be nested, and that what is in effect within an inner
262 scope will revert to the outer scope's rules at the end of the inner
263 scope.
264
265 The string result of any operation that uses locale information is
266 tainted, as it is possible for a locale to be untrustworthy. See
267 "SECURITY".
268
269 Starting in Perl v5.16 in a very limited way, and more generally in
270 v5.22, you can restrict which category or categories are enabled by
271 this particular instance of the pragma by adding parameters to it. For
272 example,
273
274 use locale qw(:ctype :numeric);
275
276 enables locale awareness within its scope of only those operations
277 (listed above) that are affected by "LC_CTYPE" and "LC_NUMERIC".
278
279 The possible categories are: ":collate", ":ctype", ":messages",
280 ":monetary", ":numeric", ":time", and the pseudo category ":characters"
281 (described below).
282
283 Thus you can say
284
285 use locale ':messages';
286
287 and only $! and $^E will be locale aware. Everything else is
288 unaffected.
289
290 Since Perl doesn't currently do anything with the "LC_MONETARY"
291 category, specifying ":monetary" does effectively nothing. Some
292 systems have other categories, such as "LC_PAPER", but Perl also
293 doesn't do anything with them, and there is no way to specify them in
294 this pragma's arguments.
295
296 You can also easily say to use all categories but one, by either, for
297 example,
298
299 use locale ':!ctype';
300 use locale ':not_ctype';
301
302 both of which mean to enable locale awareness of all categories but
303 "LC_CTYPE". Only one category argument may be specified in a
304 "use locale" if it is of the negated form.
305
306 Prior to v5.22 only one form of the pragma with arguments is available:
307
308 use locale ':not_characters';
309
310 (and you have to say "not_"; you can't use the bang "!" form). This
311 pseudo category is a shorthand for specifying both ":collate" and
312 ":ctype". Hence, in the negated form, it is nearly the same thing as
313 saying
314
315 use locale qw(:messages :monetary :numeric :time);
316
317 We use the term "nearly", because ":not_characters" also turns on
318 "use feature 'unicode_strings'" within its scope. This form is less
319 useful in v5.20 and later, and is described fully in "Unicode and
320 UTF-8", but briefly, it tells Perl to not use the character portions of
321 the locale definition, that is the "LC_CTYPE" and "LC_COLLATE"
322 categories. Instead it will use the native character set (extended by
323 Unicode). When using this parameter, you are responsible for getting
324 the external character set translated into the native/Unicode one
325 (which it already will be if it is one of the increasingly popular
326 UTF-8 locales). There are convenient ways of doing this, as described
327 in "Unicode and UTF-8".
328
329 The setlocale function
330 WARNING! Prior to Perl 5.28 or on a system that does not support
331 thread-safe locale operations, do NOT use this function in a thread.
332 The locale will change in all other threads at the same time, and
333 should your thread get paused by the operating system, and another
334 started, that thread will not have the locale it is expecting. On some
335 platforms, there can be a race leading to segfaults if two threads call
336 this function nearly simultaneously. This warning does not apply on
337 unthreaded builds, or on perls where "${^SAFE_LOCALES}" exists and is
338 non-zero; namely Perl 5.28 and later unthreaded or compiled to be
339 locale-thread-safe.
340
341 You can switch locales as often as you wish at run time with the
342 "POSIX::setlocale()" function:
343
344 # Import locale-handling tool set from POSIX module.
345 # This example uses: setlocale -- the function call
346 # LC_CTYPE -- explained below
347 # (Showing the testing for success/failure of operations is
348 # omitted in these examples to avoid distracting from the main
349 # point)
350
351 use POSIX qw(locale_h);
352 use locale;
353 my $old_locale;
354
355 # query and save the old locale
356 $old_locale = setlocale(LC_CTYPE);
357
358 setlocale(LC_CTYPE, "fr_CA.ISO8859-1");
359 # LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1"
360
361 setlocale(LC_CTYPE, "");
362 # LC_CTYPE now reset to the default defined by the
363 # LC_ALL/LC_CTYPE/LANG environment variables, or to the system
364 # default. See below for documentation.
365
366 # restore the old locale
367 setlocale(LC_CTYPE, $old_locale);
368
369 The first argument of "setlocale()" gives the category, the second the
370 locale. The category tells in what aspect of data processing you want
371 to apply locale-specific rules. Category names are discussed in
372 "LOCALE CATEGORIES" and "ENVIRONMENT". The locale is the name of a
373 collection of customization information corresponding to a particular
374 combination of language, country or territory, and codeset. Read on
375 for hints on the naming of locales: not all systems name locales as in
376 the example.
377
378 If no second argument is provided and the category is something other
379 than "LC_ALL", the function returns a string naming the current locale
380 for the category. You can use this value as the second argument in a
381 subsequent call to "setlocale()", but on some platforms the string is
382 opaque, not something that most people would be able to decipher as to
383 what locale it means.
384
385 If no second argument is provided and the category is "LC_ALL", the
386 result is implementation-dependent. It may be a string of concatenated
387 locale names (separator also implementation-dependent) or a single
388 locale name. Please consult your setlocale(3) man page for details.
389
390 If a second argument is given and it corresponds to a valid locale, the
391 locale for the category is set to that value, and the function returns
392 the now-current locale value. You can then use this in yet another
393 call to "setlocale()". (In some implementations, the return value may
394 sometimes differ from the value you gave as the second argument--think
395 of it as an alias for the value you gave.)
396
397 As the example shows, if the second argument is an empty string, the
398 category's locale is returned to the default specified by the
399 corresponding environment variables. Generally, this results in a
400 return to the default that was in force when Perl started up: changes
401 to the environment made by the application after startup may or may not
402 be noticed, depending on your system's C library.
403
404 Note that when a form of "use locale" that doesn't include all
405 categories is specified, Perl ignores the excluded categories.
406
407 If "setlocale()" fails for some reason (for example, an attempt to set
408 to a locale unknown to the system), the locale for the category is not
409 changed, and the function returns "undef".
410
411 Starting in Perl 5.28, on multi-threaded perls compiled on systems that
412 implement POSIX 2008 thread-safe locale operations, this function
413 doesn't actually call the system "setlocale". Instead those thread-
414 safe operations are used to emulate the "setlocale" function, but in a
415 thread-safe manner.
416
417 You can force the thread-safe locale operations to always be used (if
418 available) by recompiling perl with
419
420 -Accflags='-DUSE_THREAD_SAFE_LOCALE'
421
422 added to your call to Configure.
423
424 For further information about the categories, consult setlocale(3).
425
426 Multi-threaded operation
427 Beginning in Perl 5.28, multi-threaded locale operation is supported on
428 systems that implement either the POSIX 2008 or Windows-specific
429 thread-safe locale operations. Many modern systems, such as various
430 Unix variants and Darwin do have this.
431
432 You can tell if using locales is safe on your system by looking at the
433 read-only boolean variable "${^SAFE_LOCALES}". The value is 1 if the
434 perl is not threaded, or if it is using thread-safe locale operations.
435
436 Thread-safe operations are supported in Windows starting in Visual
437 Studio 2005, and in systems compatible with POSIX 2008. Some platforms
438 claim to support POSIX 2008, but have buggy implementations, so that
439 the hints files for compiling to run on them turn off attempting to use
440 thread-safety. "${^SAFE_LOCALES}" will be 0 on them.
441
442 Be aware that writing a multi-threaded application will not be portable
443 to a platform which lacks the native thread-safe locale support. On
444 systems that do have it, you automatically get this behavior for
445 threaded perls, without having to do anything. If for some reason, you
446 don't want to use this capability (perhaps the POSIX 2008 support is
447 buggy on your system), you can manually compile Perl to use the old
448 non-thread-safe implementation by passing the argument
449 "-Accflags='-DNO_THREAD_SAFE_LOCALE'" to Configure. Except on Windows,
450 this will continue to use certain of the POSIX 2008 functions in some
451 situations. If these are buggy, you can pass the following to
452 Configure instead or additionally:
453 "-Accflags='-DNO_POSIX_2008_LOCALE'". This will also keep the code
454 from using thread-safe locales. "${^SAFE_LOCALES}" will be 0 on
455 systems that turn off the thread-safe operations.
456
457 Normally on unthreaded builds, the traditional "setlocale()" is used
458 and not the thread-safe locale functions. You can force the use of
459 these on systems that have them by adding the
460 "-Accflags='-DUSE_THREAD_SAFE_LOCALE'" to Configure.
461
462 The initial program is started up using the locale specified from the
463 environment, as currently, described in "ENVIRONMENT". All newly
464 created threads start with "LC_ALL" set to "C". Each thread may use
465 "POSIX::setlocale()" to query or switch its locale at any time, without
466 affecting any other thread. All locale-dependent operations
467 automatically use their thread's locale.
468
469 This should be completely transparent to any applications written
470 entirely in Perl (minus a few rarely encountered caveats given in the
471 "Multi-threaded" section). Information for XS module writers is given
472 in "Locale-aware XS code" in perlxs.
473
474 Finding locales
475 For locales available in your system, consult also setlocale(3) to see
476 whether it leads to the list of available locales (search for the SEE
477 ALSO section). If that fails, try the following command lines:
478
479 locale -a
480
481 nlsinfo
482
483 ls /usr/lib/nls/loc
484
485 ls /usr/lib/locale
486
487 ls /usr/lib/nls
488
489 ls /usr/share/locale
490
491 and see whether they list something resembling these
492
493 en_US.ISO8859-1 de_DE.ISO8859-1 ru_RU.ISO8859-5
494 en_US.iso88591 de_DE.iso88591 ru_RU.iso88595
495 en_US de_DE ru_RU
496 en de ru
497 english german russian
498 english.iso88591 german.iso88591 russian.iso88595
499 english.roman8 russian.koi8r
500
501 Sadly, even though the calling interface for "setlocale()" has been
502 standardized, names of locales and the directories where the
503 configuration resides have not been. The basic form of the name is
504 language_territory.codeset, but the latter parts after language are not
505 always present. The language and country are usually from the
506 standards ISO 3166 and ISO 639, the two-letter abbreviations for the
507 countries and the languages of the world, respectively. The codeset
508 part often mentions some ISO 8859 character set, the Latin codesets.
509 For example, "ISO 8859-1" is the so-called "Western European codeset"
510 that can be used to encode most Western European languages adequately.
511 Again, there are several ways to write even the name of that one
512 standard. Lamentably.
513
514 Two special locales are worth particular mention: "C" and "POSIX".
515 Currently these are effectively the same locale: the difference is
516 mainly that the first one is defined by the C standard, the second by
517 the POSIX standard. They define the default locale in which every
518 program starts in the absence of locale information in its environment.
519 (The default default locale, if you will.) Its language is (American)
520 English and its character codeset ASCII or, rarely, a superset thereof
521 (such as the "DEC Multinational Character Set (DEC-MCS)"). Warning.
522 The C locale delivered by some vendors may not actually exactly match
523 what the C standard calls for. So beware.
524
525 NOTE: Not all systems have the "POSIX" locale (not all systems are
526 POSIX-conformant), so use "C" when you need explicitly to specify this
527 default locale.
528
529 LOCALE PROBLEMS
530 You may encounter the following warning message at Perl startup:
531
532 perl: warning: Setting locale failed.
533 perl: warning: Please check that your locale settings:
534 LC_ALL = "En_US",
535 LANG = (unset)
536 are supported and installed on your system.
537 perl: warning: Falling back to the standard locale ("C").
538
539 This means that your locale settings had "LC_ALL" set to "En_US" and
540 LANG exists but has no value. Perl tried to believe you but could not.
541 Instead, Perl gave up and fell back to the "C" locale, the default
542 locale that is supposed to work no matter what. (On Windows, it first
543 tries falling back to the system default locale.) This usually means
544 your locale settings were wrong, they mention locales your system has
545 never heard of, or the locale installation in your system has problems
546 (for example, some system files are broken or missing). There are
547 quick and temporary fixes to these problems, as well as more thorough
548 and lasting fixes.
549
550 Testing for broken locales
551 If you are building Perl from source, the Perl test suite file
552 lib/locale.t can be used to test the locales on your system. Setting
553 the environment variable "PERL_DEBUG_FULL_TEST" to 1 will cause it to
554 output detailed results. For example, on Linux, you could say
555
556 PERL_DEBUG_FULL_TEST=1 ./perl -T -Ilib lib/locale.t > locale.log 2>&1
557
558 Besides many other tests, it will test every locale it finds on your
559 system to see if they conform to the POSIX standard. If any have
560 errors, it will include a summary near the end of the output of which
561 locales passed all its tests, and which failed, and why.
562
563 Temporarily fixing locale problems
564 The two quickest fixes are either to render Perl silent about any
565 locale inconsistencies or to run Perl under the default locale "C".
566
567 Perl's moaning about locale problems can be silenced by setting the
568 environment variable "PERL_BADLANG" to "0" or "". This method really
569 just sweeps the problem under the carpet: you tell Perl to shut up even
570 when Perl sees that something is wrong. Do not be surprised if later
571 something locale-dependent misbehaves.
572
573 Perl can be run under the "C" locale by setting the environment
574 variable "LC_ALL" to "C". This method is perhaps a bit more civilized
575 than the "PERL_BADLANG" approach, but setting "LC_ALL" (or other locale
576 variables) may affect other programs as well, not just Perl. In
577 particular, external programs run from within Perl will see these
578 changes. If you make the new settings permanent (read on), all
579 programs you run see the changes. See "ENVIRONMENT" for the full list
580 of relevant environment variables and "USING LOCALES" for their effects
581 in Perl. Effects in other programs are easily deducible. For example,
582 the variable "LC_COLLATE" may well affect your sort program (or
583 whatever the program that arranges "records" alphabetically in your
584 system is called).
585
586 You can test out changing these variables temporarily, and if the new
587 settings seem to help, put those settings into your shell startup
588 files. Consult your local documentation for the exact details. For
589 Bourne-like shells (sh, ksh, bash, zsh):
590
591 LC_ALL=en_US.ISO8859-1
592 export LC_ALL
593
594 This assumes that we saw the locale "en_US.ISO8859-1" using the
595 commands discussed above. We decided to try that instead of the above
596 faulty locale "En_US"--and in Cshish shells (csh, tcsh)
597
598 setenv LC_ALL en_US.ISO8859-1
599
600 or if you have the "env" application you can do (in any shell)
601
602 env LC_ALL=en_US.ISO8859-1 perl ...
603
604 If you do not know what shell you have, consult your local helpdesk or
605 the equivalent.
606
607 Permanently fixing locale problems
608 The slower but superior fixes are when you may be able to yourself fix
609 the misconfiguration of your own environment variables. The
610 mis(sing)configuration of the whole system's locales usually requires
611 the help of your friendly system administrator.
612
613 First, see earlier in this document about "Finding locales". That
614 tells how to find which locales are really supported--and more
615 importantly, installed--on your system. In our example error message,
616 environment variables affecting the locale are listed in the order of
617 decreasing importance (and unset variables do not matter). Therefore,
618 having LC_ALL set to "En_US" must have been the bad choice, as shown by
619 the error message. First try fixing locale settings listed first.
620
621 Second, if using the listed commands you see something exactly (prefix
622 matches do not count and case usually counts) like "En_US" without the
623 quotes, then you should be okay because you are using a locale name
624 that should be installed and available in your system. In this case,
625 see "Permanently fixing your system's locale configuration".
626
627 Permanently fixing your system's locale configuration
628 This is when you see something like:
629
630 perl: warning: Please check that your locale settings:
631 LC_ALL = "En_US",
632 LANG = (unset)
633 are supported and installed on your system.
634
635 but then cannot see that "En_US" listed by the above-mentioned
636 commands. You may see things like "en_US.ISO8859-1", but that isn't
637 the same. In this case, try running under a locale that you can list
638 and which somehow matches what you tried. The rules for matching
639 locale names are a bit vague because standardization is weak in this
640 area. See again the "Finding locales" about general rules.
641
642 Fixing system locale configuration
643 Contact a system administrator (preferably your own) and report the
644 exact error message you get, and ask them to read this same
645 documentation you are now reading. They should be able to check
646 whether there is something wrong with the locale configuration of the
647 system. The "Finding locales" section is unfortunately a bit vague
648 about the exact commands and places because these things are not that
649 standardized.
650
651 The localeconv function
652 The "POSIX::localeconv()" function allows you to get particulars of the
653 locale-dependent numeric formatting information specified by the
654 current underlying "LC_NUMERIC" and "LC_MONETARY" locales (regardless
655 of whether called from within the scope of "use locale" or not). (If
656 you just want the name of the current locale for a particular category,
657 use "POSIX::setlocale()" with a single parameter--see "The setlocale
658 function".)
659
660 use POSIX qw(locale_h);
661
662 # Get a reference to a hash of locale-dependent info
663 $locale_values = localeconv();
664
665 # Output sorted list of the values
666 for (sort keys %$locale_values) {
667 printf "%-20s = %s\n", $_, $locale_values->{$_}
668 }
669
670 "localeconv()" takes no arguments, and returns a reference to a hash.
671 The keys of this hash are variable names for formatting, such as
672 "decimal_point" and "thousands_sep". The values are the corresponding,
673 er, values. See "localeconv" in POSIX for a longer example listing the
674 categories an implementation might be expected to provide; some provide
675 more and others fewer. You don't need an explicit "use locale",
676 because "localeconv()" always observes the current locale.
677
678 Here's a simple-minded example program that rewrites its command-line
679 parameters as integers correctly formatted in the current locale:
680
681 use POSIX qw(locale_h);
682
683 # Get some of locale's numeric formatting parameters
684 my ($thousands_sep, $grouping) =
685 @{localeconv()}{'thousands_sep', 'grouping'};
686
687 # Apply defaults if values are missing
688 $thousands_sep = ',' unless $thousands_sep;
689
690 # grouping and mon_grouping are packed lists
691 # of small integers (characters) telling the
692 # grouping (thousand_seps and mon_thousand_seps
693 # being the group dividers) of numbers and
694 # monetary quantities. The integers' meanings:
695 # 255 means no more grouping, 0 means repeat
696 # the previous grouping, 1-254 means use that
697 # as the current grouping. Grouping goes from
698 # right to left (low to high digits). In the
699 # below we cheat slightly by never using anything
700 # else than the first grouping (whatever that is).
701 if ($grouping) {
702 @grouping = unpack("C*", $grouping);
703 } else {
704 @grouping = (3);
705 }
706
707 # Format command line params for current locale
708 for (@ARGV) {
709 $_ = int; # Chop non-integer part
710 1 while
711 s/(\d)(\d{$grouping[0]}($|$thousands_sep))/$1$thousands_sep$2/;
712 print "$_";
713 }
714 print "\n";
715
716 Note that if the platform doesn't have "LC_NUMERIC" and/or
717 "LC_MONETARY" available or enabled, the corresponding elements of the
718 hash will be missing.
719
720 I18N::Langinfo
721 Another interface for querying locale-dependent information is the
722 "I18N::Langinfo::langinfo()" function.
723
724 The following example will import the "langinfo()" function itself and
725 three constants to be used as arguments to "langinfo()": a constant for
726 the abbreviated first day of the week (the numbering starts from Sunday
727 = 1) and two more constants for the affirmative and negative answers
728 for a yes/no question in the current locale.
729
730 use I18N::Langinfo qw(langinfo ABDAY_1 YESSTR NOSTR);
731
732 my ($abday_1, $yesstr, $nostr)
733 = map { langinfo } qw(ABDAY_1 YESSTR NOSTR);
734
735 print "$abday_1? [$yesstr/$nostr] ";
736
737 In other words, in the "C" (or English) locale the above will probably
738 print something like:
739
740 Sun? [yes/no]
741
742 See I18N::Langinfo for more information.
743
745 The following subsections describe basic locale categories. Beyond
746 these, some combination categories allow manipulation of more than one
747 basic category at a time. See "ENVIRONMENT" for a discussion of these.
748
749 Category "LC_COLLATE": Collation: Text Comparisons and Sorting
750 In the scope of a "use locale" form that includes collation, Perl looks
751 to the "LC_COLLATE" environment variable to determine the application's
752 notions on collation (ordering) of characters. For example, "b"
753 follows "a" in Latin alphabets, but where do "a" and "aa" belong? And
754 while "color" follows "chocolate" in English, what about in traditional
755 Spanish?
756
757 The following collations all make sense and you may meet any of them if
758 you "use locale".
759
760 A B C D E a b c d e
761 A a B b C c D d E e
762 a A b B c C d D e E
763 a b c d e A B C D E
764
765 Here is a code snippet to tell what "word" characters are in the
766 current locale, in that locale's order:
767
768 use locale;
769 print +(sort grep /\w/, map { chr } 0..255), "\n";
770
771 Compare this with the characters that you see and their order if you
772 state explicitly that the locale should be ignored:
773
774 no locale;
775 print +(sort grep /\w/, map { chr } 0..255), "\n";
776
777 This machine-native collation (which is what you get unless
778 "use locale" has appeared earlier in the same block) must be used for
779 sorting raw binary data, whereas the locale-dependent collation of the
780 first example is useful for natural text.
781
782 As noted in "USING LOCALES", "cmp" compares according to the current
783 collation locale when "use locale" is in effect, but falls back to a
784 char-by-char comparison for strings that the locale says are equal. You
785 can use "POSIX::strcoll()" if you don't want this fall-back:
786
787 use POSIX qw(strcoll);
788 $equal_in_locale =
789 !strcoll("space and case ignored", "SpaceAndCaseIgnored");
790
791 $equal_in_locale will be true if the collation locale specifies a
792 dictionary-like ordering that ignores space characters completely and
793 which folds case.
794
795 Perl uses the platform's C library collation functions "strcoll()" and
796 "strxfrm()". That means you get whatever they give. On some
797 platforms, these functions work well on UTF-8 locales, giving a
798 reasonable default collation for the code points that are important in
799 that locale. (And if they aren't working well, the problem may only be
800 that the locale definition is deficient, so can be fixed by using a
801 better definition file. Unicode's definitions (see "Freely available
802 locale definitions") provide reasonable UTF-8 locale collation
803 definitions.) Starting in Perl v5.26, Perl's use of these functions
804 has been made more seamless. This may be sufficient for your needs.
805 For more control, and to make sure strings containing any code point
806 (not just the ones important in the locale) collate properly, the
807 Unicode::Collate module is suggested.
808
809 In non-UTF-8 locales (hence single byte), code points above 0xFF are
810 technically invalid. But if present, again starting in v5.26, they
811 will collate to the same position as the highest valid code point does.
812 This generally gives good results, but the collation order may be
813 skewed if the valid code point gets special treatment when it forms
814 particular sequences with other characters as defined by the locale.
815 When two strings collate identically, the code point order is used as a
816 tie breaker.
817
818 If Perl detects that there are problems with the locale collation
819 order, it reverts to using non-locale collation rules for that locale.
820
821 If you have a single string that you want to check for "equality in
822 locale" against several others, you might think you could gain a little
823 efficiency by using "POSIX::strxfrm()" in conjunction with "eq":
824
825 use POSIX qw(strxfrm);
826 $xfrm_string = strxfrm("Mixed-case string");
827 print "locale collation ignores spaces\n"
828 if $xfrm_string eq strxfrm("Mixed-casestring");
829 print "locale collation ignores hyphens\n"
830 if $xfrm_string eq strxfrm("Mixedcase string");
831 print "locale collation ignores case\n"
832 if $xfrm_string eq strxfrm("mixed-case string");
833
834 "strxfrm()" takes a string and maps it into a transformed string for
835 use in char-by-char comparisons against other transformed strings
836 during collation. "Under the hood", locale-affected Perl comparison
837 operators call "strxfrm()" for both operands, then do a char-by-char
838 comparison of the transformed strings. By calling "strxfrm()"
839 explicitly and using a non locale-affected comparison, the example
840 attempts to save a couple of transformations. But in fact, it doesn't
841 save anything: Perl magic (see "Magic Variables" in perlguts) creates
842 the transformed version of a string the first time it's needed in a
843 comparison, then keeps this version around in case it's needed again.
844 An example rewritten the easy way with "cmp" runs just about as fast.
845 It also copes with null characters embedded in strings; if you call
846 "strxfrm()" directly, it treats the first null it finds as a
847 terminator. Don't expect the transformed strings it produces to be
848 portable across systems--or even from one revision of your operating
849 system to the next. In short, don't call "strxfrm()" directly: let
850 Perl do it for you.
851
852 Note: "use locale" isn't shown in some of these examples because it
853 isn't needed: "strcoll()" and "strxfrm()" are POSIX functions which use
854 the standard system-supplied "libc" functions that always obey the
855 current "LC_COLLATE" locale.
856
857 Category "LC_CTYPE": Character Types
858 In the scope of a "use locale" form that includes "LC_CTYPE", Perl
859 obeys the "LC_CTYPE" locale setting. This controls the application's
860 notion of which characters are alphabetic, numeric, punctuation, etc.
861 This affects Perl's "\w" regular expression metanotation, which stands
862 for alphanumeric characters--that is, alphabetic, numeric, and the
863 platform's native underscore. (Consult perlre for more information
864 about regular expressions.) Thanks to "LC_CTYPE", depending on your
865 locale setting, characters like "ae", "d`", "ss", and "o" may be
866 understood as "\w" characters. It also affects things like "\s", "\D",
867 and the POSIX character classes, like "[[:graph:]]". (See
868 perlrecharclass for more information on all these.)
869
870 The "LC_CTYPE" locale also provides the map used in transliterating
871 characters between lower and uppercase. This affects the case-mapping
872 functions--"fc()", "lc()", "lcfirst()", "uc()", and "ucfirst()"; case-
873 mapping interpolation with "\F", "\l", "\L", "\u", or "\U" in double-
874 quoted strings and "s///" substitutions; and case-insensitive regular
875 expression pattern matching using the "i" modifier.
876
877 Starting in v5.20, Perl supports UTF-8 locales for "LC_CTYPE", but
878 otherwise Perl only supports single-byte locales, such as the ISO 8859
879 series. This means that wide character locales, for example for Asian
880 languages, are not well-supported. Use of these locales may cause core
881 dumps. If the platform has the capability for Perl to detect such a
882 locale, starting in Perl v5.22, Perl will warn, default enabled, using
883 the "locale" warning category, whenever such a locale is switched into.
884 The UTF-8 locale support is actually a superset of POSIX locales,
885 because it is really full Unicode behavior as if no "LC_CTYPE" locale
886 were in effect at all (except for tainting; see "SECURITY"). POSIX
887 locales, even UTF-8 ones, are lacking certain concepts in Unicode, such
888 as the idea that changing the case of a character could expand to be
889 more than one character. Perl in a UTF-8 locale, will give you that
890 expansion. Prior to v5.20, Perl treated a UTF-8 locale on some
891 platforms like an ISO 8859-1 one, with some restrictions, and on other
892 platforms more like the "C" locale. For releases v5.16 and v5.18,
893 "use locale 'not_characters" could be used as a workaround for this
894 (see "Unicode and UTF-8").
895
896 Note that there are quite a few things that are unaffected by the
897 current locale. Any literal character is the native character for the
898 given platform. Hence 'A' means the character at code point 65 on
899 ASCII platforms, and 193 on EBCDIC. That may or may not be an 'A' in
900 the current locale, if that locale even has an 'A'. Similarly, all the
901 escape sequences for particular characters, "\n" for example, always
902 mean the platform's native one. This means, for example, that "\N" in
903 regular expressions (every character but new-line) works on the
904 platform character set.
905
906 Starting in v5.22, Perl will by default warn when switching into a
907 locale that redefines any ASCII printable character (plus "\t" and
908 "\n") into a different class than expected. This is likely to happen
909 on modern locales only on EBCDIC platforms, where, for example, a CCSID
910 0037 locale on a CCSID 1047 machine moves "[", but it can happen on
911 ASCII platforms with the ISO 646 and other 7-bit locales that are
912 essentially obsolete. Things may still work, depending on what
913 features of Perl are used by the program. For example, in the example
914 from above where "|" becomes a "\w", and there are no regular
915 expressions where this matters, the program may still work properly.
916 The warning lists all the characters that it can determine could be
917 adversely affected.
918
919 Note: A broken or malicious "LC_CTYPE" locale definition may result in
920 clearly ineligible characters being considered to be alphanumeric by
921 your application. For strict matching of (mundane) ASCII letters and
922 digits--for example, in command strings--locale-aware applications
923 should use "\w" with the "/a" regular expression modifier. See
924 "SECURITY".
925
926 Category "LC_NUMERIC": Numeric Formatting
927 After a proper "POSIX::setlocale()" call, and within the scope of a
928 "use locale" form that includes numerics, Perl obeys the "LC_NUMERIC"
929 locale information, which controls an application's idea of how numbers
930 should be formatted for human readability. In most implementations the
931 only effect is to change the character used for the decimal
932 point--perhaps from "." to ",". The functions aren't aware of such
933 niceties as thousands separation and so on. (See "The localeconv
934 function" if you care about these things.)
935
936 use POSIX qw(strtod setlocale LC_NUMERIC);
937 use locale;
938
939 setlocale LC_NUMERIC, "";
940
941 $n = 5/2; # Assign numeric 2.5 to $n
942
943 $a = " $n"; # Locale-dependent conversion to string
944
945 print "half five is $n\n"; # Locale-dependent output
946
947 printf "half five is %g\n", $n; # Locale-dependent output
948
949 print "DECIMAL POINT IS COMMA\n"
950 if $n == (strtod("2,5"))[0]; # Locale-dependent conversion
951
952 See also I18N::Langinfo and "RADIXCHAR".
953
954 Category "LC_MONETARY": Formatting of monetary amounts
955 The C standard defines the "LC_MONETARY" category, but not a function
956 that is affected by its contents. (Those with experience of standards
957 committees will recognize that the working group decided to punt on the
958 issue.) Consequently, Perl essentially takes no notice of it. If you
959 really want to use "LC_MONETARY", you can query its contents--see "The
960 localeconv function"--and use the information that it returns in your
961 application's own formatting of currency amounts. However, you may
962 well find that the information, voluminous and complex though it may
963 be, still does not quite meet your requirements: currency formatting is
964 a hard nut to crack.
965
966 See also I18N::Langinfo and "CRNCYSTR".
967
968 Category "LC_TIME": Respresentation of time
969 Output produced by "POSIX::strftime()", which builds a formatted human-
970 readable date/time string, is affected by the current "LC_TIME" locale.
971 Thus, in a French locale, the output produced by the %B format element
972 (full month name) for the first month of the year would be "janvier".
973 Here's how to get a list of long month names in the current locale:
974
975 use POSIX qw(strftime);
976 for (0..11) {
977 $long_month_name[$_] =
978 strftime("%B", 0, 0, 0, 1, $_, 96);
979 }
980
981 Note: "use locale" isn't needed in this example: "strftime()" is a
982 POSIX function which uses the standard system-supplied "libc" function
983 that always obeys the current "LC_TIME" locale.
984
985 See also I18N::Langinfo and "ABDAY_1".."ABDAY_7", "DAY_1".."DAY_7",
986 "ABMON_1".."ABMON_12", and "ABMON_1".."ABMON_12".
987
988 Other categories
989 The remaining locale categories are not currently used by Perl itself.
990 But again note that things Perl interacts with may use these, including
991 extensions outside the standard Perl distribution, and by the operating
992 system and its utilities. Note especially that the string value of $!
993 and the error messages given by external utilities may be changed by
994 "LC_MESSAGES". If you want to have portable error codes, use "%!".
995 See Errno.
996
998 Although the main discussion of Perl security issues can be found in
999 perlsec, a discussion of Perl's locale handling would be incomplete if
1000 it did not draw your attention to locale-dependent security issues.
1001 Locales--particularly on systems that allow unprivileged users to build
1002 their own locales--are untrustworthy. A malicious (or just plain
1003 broken) locale can make a locale-aware application give unexpected
1004 results. Here are a few possibilities:
1005
1006 • Regular expression checks for safe file names or mail addresses
1007 using "\w" may be spoofed by an "LC_CTYPE" locale that claims that
1008 characters such as ">" and "|" are alphanumeric.
1009
1010 • String interpolation with case-mapping, as in, say, "$dest =
1011 "C:\U$name.$ext"", may produce dangerous results if a bogus
1012 "LC_CTYPE" case-mapping table is in effect.
1013
1014 • A sneaky "LC_COLLATE" locale could result in the names of students
1015 with "D" grades appearing ahead of those with "A"s.
1016
1017 • An application that takes the trouble to use information in
1018 "LC_MONETARY" may format debits as if they were credits and vice
1019 versa if that locale has been subverted. Or it might make payments
1020 in US dollars instead of Hong Kong dollars.
1021
1022 • The date and day names in dates formatted by "strftime()" could be
1023 manipulated to advantage by a malicious user able to subvert the
1024 "LC_DATE" locale. ("Look--it says I wasn't in the building on
1025 Sunday.")
1026
1027 Such dangers are not peculiar to the locale system: any aspect of an
1028 application's environment which may be modified maliciously presents
1029 similar challenges. Similarly, they are not specific to Perl: any
1030 programming language that allows you to write programs that take
1031 account of their environment exposes you to these issues.
1032
1033 Perl cannot protect you from all possibilities shown in the
1034 examples--there is no substitute for your own vigilance--but, when "use
1035 locale" is in effect, Perl uses the tainting mechanism (see perlsec) to
1036 mark string results that become locale-dependent, and which may be
1037 untrustworthy in consequence. Here is a summary of the tainting
1038 behavior of operators and functions that may be affected by the locale:
1039
1040 • Comparison operators ("lt", "le", "ge", "gt" and "cmp"):
1041
1042 Scalar true/false (or less/equal/greater) result is never tainted.
1043
1044 • Case-mapping interpolation (with "\l", "\L", "\u", "\U", or "\F")
1045
1046 The result string containing interpolated material is tainted if a
1047 "use locale" form that includes "LC_CTYPE" is in effect.
1048
1049 • Matching operator ("m//"):
1050
1051 Scalar true/false result never tainted.
1052
1053 All subpatterns, either delivered as a list-context result or as $1
1054 etc., are tainted if a "use locale" form that includes "LC_CTYPE"
1055 is in effect, and the subpattern regular expression contains a
1056 locale-dependent construct. These constructs include "\w" (to
1057 match an alphanumeric character), "\W" (non-alphanumeric
1058 character), "\b" and "\B" (word-boundary and non-boundardy, which
1059 depend on what "\w" and "\W" match), "\s" (whitespace character),
1060 "\S" (non whitespace character), "\d" and "\D" (digits and non-
1061 digits), and the POSIX character classes, such as "[:alpha:]" (see
1062 "POSIX Character Classes" in perlrecharclass).
1063
1064 Tainting is also likely if the pattern is to be matched case-
1065 insensitively (via "/i"). The exception is if all the code points
1066 to be matched this way are above 255 and do not have folds under
1067 Unicode rules to below 256. Tainting is not done for these because
1068 Perl only uses Unicode rules for such code points, and those rules
1069 are the same no matter what the current locale.
1070
1071 The matched-pattern variables, $&, "$`" (pre-match), "$'" (post-
1072 match), and $+ (last match) also are tainted.
1073
1074 • Substitution operator ("s///"):
1075
1076 Has the same behavior as the match operator. Also, the left
1077 operand of "=~" becomes tainted when a "use locale" form that
1078 includes "LC_CTYPE" is in effect, if modified as a result of a
1079 substitution based on a regular expression match involving any of
1080 the things mentioned in the previous item, or of case-mapping, such
1081 as "\l", "\L","\u", "\U", or "\F".
1082
1083 • Output formatting functions ("printf()" and "write()"):
1084
1085 Results are never tainted because otherwise even output from print,
1086 for example "print(1/7)", should be tainted if "use locale" is in
1087 effect.
1088
1089 • Case-mapping functions ("lc()", "lcfirst()", "uc()", "ucfirst()"):
1090
1091 Results are tainted if a "use locale" form that includes "LC_CTYPE"
1092 is in effect.
1093
1094 • POSIX locale-dependent functions ("localeconv()", "strcoll()",
1095 "strftime()", "strxfrm()"):
1096
1097 Results are never tainted.
1098
1099 Three examples illustrate locale-dependent tainting. The first
1100 program, which ignores its locale, won't run: a value taken directly
1101 from the command line may not be used to name an output file when taint
1102 checks are enabled.
1103
1104 #/usr/local/bin/perl -T
1105 # Run with taint checking
1106
1107 # Command line sanity check omitted...
1108 $tainted_output_file = shift;
1109
1110 open(F, ">$tainted_output_file")
1111 or warn "Open of $tainted_output_file failed: $!\n";
1112
1113 The program can be made to run by "laundering" the tainted value
1114 through a regular expression: the second example--which still ignores
1115 locale information--runs, creating the file named on its command line
1116 if it can.
1117
1118 #/usr/local/bin/perl -T
1119
1120 $tainted_output_file = shift;
1121 $tainted_output_file =~ m%[\w/]+%;
1122 $untainted_output_file = $&;
1123
1124 open(F, ">$untainted_output_file")
1125 or warn "Open of $untainted_output_file failed: $!\n";
1126
1127 Compare this with a similar but locale-aware program:
1128
1129 #/usr/local/bin/perl -T
1130
1131 $tainted_output_file = shift;
1132 use locale;
1133 $tainted_output_file =~ m%[\w/]+%;
1134 $localized_output_file = $&;
1135
1136 open(F, ">$localized_output_file")
1137 or warn "Open of $localized_output_file failed: $!\n";
1138
1139 This third program fails to run because $& is tainted: it is the result
1140 of a match involving "\w" while "use locale" is in effect.
1141
1143 PERL_SKIP_LOCALE_INIT
1144 This environment variable, available starting in Perl
1145 v5.20, if set (to any value), tells Perl to not use the
1146 rest of the environment variables to initialize with.
1147 Instead, Perl uses whatever the current locale settings
1148 are. This is particularly useful in embedded environments,
1149 see "Using embedded Perl with POSIX locales" in perlembed.
1150
1151 PERL_BADLANG
1152 A string that can suppress Perl's warning about failed
1153 locale settings at startup. Failure can occur if the
1154 locale support in the operating system is lacking (broken)
1155 in some way--or if you mistyped the name of a locale when
1156 you set up your environment. If this environment variable
1157 is absent, or has a value other than "0" or "", Perl will
1158 complain about locale setting failures.
1159
1160 NOTE: "PERL_BADLANG" only gives you a way to hide the
1161 warning message. The message tells about some problem in
1162 your system's locale support, and you should investigate
1163 what the problem is.
1164
1165 The following environment variables are not specific to Perl: They are
1166 part of the standardized (ISO C, XPG4, POSIX 1.c) "setlocale()" method
1167 for controlling an application's opinion on data. Windows is non-
1168 POSIX, but Perl arranges for the following to work as described anyway.
1169 If the locale given by an environment variable is not valid, Perl tries
1170 the next lower one in priority. If none are valid, on Windows, the
1171 system default locale is then tried. If all else fails, the "C" locale
1172 is used. If even that doesn't work, something is badly broken, but
1173 Perl tries to forge ahead with whatever the locale settings might be.
1174
1175 "LC_ALL" "LC_ALL" is the "override-all" locale environment variable.
1176 If set, it overrides all the rest of the locale environment
1177 variables.
1178
1179 "LANGUAGE" NOTE: "LANGUAGE" is a GNU extension, it affects you only if
1180 you are using the GNU libc. This is the case if you are
1181 using e.g. Linux. If you are using "commercial" Unixes you
1182 are most probably not using GNU libc and you can ignore
1183 "LANGUAGE".
1184
1185 However, in the case you are using "LANGUAGE": it affects
1186 the language of informational, warning, and error messages
1187 output by commands (in other words, it's like
1188 "LC_MESSAGES") but it has higher priority than "LC_ALL".
1189 Moreover, it's not a single value but instead a "path"
1190 (":"-separated list) of languages (not locales). See the
1191 GNU "gettext" library documentation for more information.
1192
1193 "LC_CTYPE" In the absence of "LC_ALL", "LC_CTYPE" chooses the
1194 character type locale. In the absence of both "LC_ALL" and
1195 "LC_CTYPE", "LANG" chooses the character type locale.
1196
1197 "LC_COLLATE"
1198 In the absence of "LC_ALL", "LC_COLLATE" chooses the
1199 collation (sorting) locale. In the absence of both
1200 "LC_ALL" and "LC_COLLATE", "LANG" chooses the collation
1201 locale.
1202
1203 "LC_MONETARY"
1204 In the absence of "LC_ALL", "LC_MONETARY" chooses the
1205 monetary formatting locale. In the absence of both
1206 "LC_ALL" and "LC_MONETARY", "LANG" chooses the monetary
1207 formatting locale.
1208
1209 "LC_NUMERIC"
1210 In the absence of "LC_ALL", "LC_NUMERIC" chooses the
1211 numeric format locale. In the absence of both "LC_ALL" and
1212 "LC_NUMERIC", "LANG" chooses the numeric format.
1213
1214 "LC_TIME" In the absence of "LC_ALL", "LC_TIME" chooses the date and
1215 time formatting locale. In the absence of both "LC_ALL"
1216 and "LC_TIME", "LANG" chooses the date and time formatting
1217 locale.
1218
1219 "LANG" "LANG" is the "catch-all" locale environment variable. If
1220 it is set, it is used as the last resort after the overall
1221 "LC_ALL" and the category-specific "LC_foo".
1222
1223 Examples
1224 The "LC_NUMERIC" controls the numeric output:
1225
1226 use locale;
1227 use POSIX qw(locale_h); # Imports setlocale() and the LC_ constants.
1228 setlocale(LC_NUMERIC, "fr_FR") or die "Pardon";
1229 printf "%g\n", 1.23; # If the "fr_FR" succeeded, probably shows 1,23.
1230
1231 and also how strings are parsed by "POSIX::strtod()" as numbers:
1232
1233 use locale;
1234 use POSIX qw(locale_h strtod);
1235 setlocale(LC_NUMERIC, "de_DE") or die "Entschuldigung";
1236 my $x = strtod("2,34") + 5;
1237 print $x, "\n"; # Probably shows 7,34.
1238
1240 String "eval" and "LC_NUMERIC"
1241 A string eval parses its expression as standard Perl. It is therefore
1242 expecting the decimal point to be a dot. If "LC_NUMERIC" is set to
1243 have this be a comma instead, the parsing will be confused, perhaps
1244 silently.
1245
1246 use locale;
1247 use POSIX qw(locale_h);
1248 setlocale(LC_NUMERIC, "fr_FR") or die "Pardon";
1249 my $a = 1.2;
1250 print eval "$a + 1.5";
1251 print "\n";
1252
1253 prints "13,5". This is because in that locale, the comma is the
1254 decimal point character. The "eval" thus expands to:
1255
1256 eval "1,2 + 1.5"
1257
1258 and the result is not what you likely expected. No warnings are
1259 generated. If you do string "eval"'s within the scope of "use locale",
1260 you should instead change the "eval" line to do something like:
1261
1262 print eval "no locale; $a + 1.5";
1263
1264 This prints 2.7.
1265
1266 You could also exclude "LC_NUMERIC", if you don't need it, by
1267
1268 use locale ':!numeric';
1269
1270 Backward compatibility
1271 Versions of Perl prior to 5.004 mostly ignored locale information,
1272 generally behaving as if something similar to the "C" locale were
1273 always in force, even if the program environment suggested otherwise
1274 (see "The setlocale function"). By default, Perl still behaves this
1275 way for backward compatibility. If you want a Perl application to pay
1276 attention to locale information, you must use the "use locale" pragma
1277 (see "The "use locale" pragma") or, in the unlikely event that you want
1278 to do so for just pattern matching, the "/l" regular expression
1279 modifier (see "Character set modifiers" in perlre) to instruct it to do
1280 so.
1281
1282 Versions of Perl from 5.002 to 5.003 did use the "LC_CTYPE" information
1283 if available; that is, "\w" did understand what were the letters
1284 according to the locale environment variables. The problem was that
1285 the user had no control over the feature: if the C library supported
1286 locales, Perl used them.
1287
1288 I18N:Collate obsolete
1289 In versions of Perl prior to 5.004, per-locale collation was possible
1290 using the "I18N::Collate" library module. This module is now mildly
1291 obsolete and should be avoided in new applications. The "LC_COLLATE"
1292 functionality is now integrated into the Perl core language: One can
1293 use locale-specific scalar data completely normally with "use locale",
1294 so there is no longer any need to juggle with the scalar references of
1295 "I18N::Collate".
1296
1297 Sort speed and memory use impacts
1298 Comparing and sorting by locale is usually slower than the default
1299 sorting; slow-downs of two to four times have been observed. It will
1300 also consume more memory: once a Perl scalar variable has participated
1301 in any string comparison or sorting operation obeying the locale
1302 collation rules, it will take 3-15 times more memory than before. (The
1303 exact multiplier depends on the string's contents, the operating system
1304 and the locale.) These downsides are dictated more by the operating
1305 system's implementation of the locale system than by Perl.
1306
1307 Freely available locale definitions
1308 The Unicode CLDR project extracts the POSIX portion of many of its
1309 locales, available at
1310
1311 https://unicode.org/Public/cldr/2.0.1/
1312
1313 (Newer versions of CLDR require you to compute the POSIX data yourself.
1314 See <http://unicode.org/Public/cldr/latest/>.)
1315
1316 There is a large collection of locale definitions at:
1317
1318 http://std.dkuug.dk/i18n/WG15-collection/locales/
1319
1320 You should be aware that it is unsupported, and is not claimed to be
1321 fit for any purpose. If your system allows installation of arbitrary
1322 locales, you may find the definitions useful as they are, or as a basis
1323 for the development of your own locales.
1324
1325 I18n and l10n
1326 "Internationalization" is often abbreviated as i18n because its first
1327 and last letters are separated by eighteen others. (You may guess why
1328 the internalin ... internaliti ... i18n tends to get abbreviated.) In
1329 the same way, "localization" is often abbreviated to l10n.
1330
1331 An imperfect standard
1332 Internationalization, as defined in the C and POSIX standards, can be
1333 criticized as incomplete and ungainly. They also have a tendency, like
1334 standards groups, to divide the world into nations, when we all know
1335 that the world can equally well be divided into bankers, bikers,
1336 gamers, and so on.
1337
1339 The support of Unicode is new starting from Perl version v5.6, and more
1340 fully implemented in versions v5.8 and later. See perluniintro.
1341
1342 Starting in Perl v5.20, UTF-8 locales are supported in Perl, except
1343 "LC_COLLATE" is only partially supported; collation support is improved
1344 in Perl v5.26 to a level that may be sufficient for your needs (see
1345 "Category "LC_COLLATE": Collation: Text Comparisons and Sorting").
1346
1347 If you have Perl v5.16 or v5.18 and can't upgrade, you can use
1348
1349 use locale ':not_characters';
1350
1351 When this form of the pragma is used, only the non-character portions
1352 of locales are used by Perl, for example "LC_NUMERIC". Perl assumes
1353 that you have translated all the characters it is to operate on into
1354 Unicode (actually the platform's native character set (ASCII or EBCDIC)
1355 plus Unicode). For data in files, this can conveniently be done by
1356 also specifying
1357
1358 use open ':locale';
1359
1360 This pragma arranges for all inputs from files to be translated into
1361 Unicode from the current locale as specified in the environment (see
1362 "ENVIRONMENT"), and all outputs to files to be translated back into the
1363 locale. (See open). On a per-filehandle basis, you can instead use
1364 the PerlIO::locale module, or the Encode::Locale module, both available
1365 from CPAN. The latter module also has methods to ease the handling of
1366 "ARGV" and environment variables, and can be used on individual
1367 strings. If you know that all your locales will be UTF-8, as many are
1368 these days, you can use the -C command line switch.
1369
1370 This form of the pragma allows essentially seamless handling of locales
1371 with Unicode. The collation order will be by Unicode code point order.
1372 Unicode::Collate can be used to get Unicode rules collation.
1373
1374 All the modules and switches just described can be used in v5.20 with
1375 just plain "use locale", and, should the input locales not be UTF-8,
1376 you'll get the less than ideal behavior, described below, that you get
1377 with pre-v5.16 Perls, or when you use the locale pragma without the
1378 ":not_characters" parameter in v5.16 and v5.18. If you are using
1379 exclusively UTF-8 locales in v5.20 and higher, the rest of this section
1380 does not apply to you.
1381
1382 There are two cases, multi-byte and single-byte locales. First multi-
1383 byte:
1384
1385 The only multi-byte (or wide character) locale that Perl is ever likely
1386 to support is UTF-8. This is due to the difficulty of implementation,
1387 the fact that high quality UTF-8 locales are now published for every
1388 area of the world (<https://unicode.org/Public/cldr/2.0.1/> for ones
1389 that are already set-up, but from an earlier version;
1390 <https://unicode.org/Public/cldr/latest/> for the most up-to-date, but
1391 you have to extract the POSIX information yourself), and that failing
1392 all that you can use the Encode module to translate to/from your
1393 locale. So, you'll have to do one of those things if you're using one
1394 of these locales, such as Big5 or Shift JIS. For UTF-8 locales, in
1395 Perls (pre v5.20) that don't have full UTF-8 locale support, they may
1396 work reasonably well (depending on your C library implementation)
1397 simply because both they and Perl store characters that take up
1398 multiple bytes the same way. However, some, if not most, C library
1399 implementations may not process the characters in the upper half of the
1400 Latin-1 range (128 - 255) properly under "LC_CTYPE". To see if a
1401 character is a particular type under a locale, Perl uses the functions
1402 like "isalnum()". Your C library may not work for UTF-8 locales with
1403 those functions, instead only working under the newer wide library
1404 functions like "iswalnum()", which Perl does not use. These multi-byte
1405 locales are treated like single-byte locales, and will have the
1406 restrictions described below. Starting in Perl v5.22 a warning message
1407 is raised when Perl detects a multi-byte locale that it doesn't fully
1408 support.
1409
1410 For single-byte locales, Perl generally takes the tack to use locale
1411 rules on code points that can fit in a single byte, and Unicode rules
1412 for those that can't (though this isn't uniformly applied, see the note
1413 at the end of this section). This prevents many problems in locales
1414 that aren't UTF-8. Suppose the locale is ISO8859-7, Greek. The
1415 character at 0xD7 there is a capital Chi. But in the ISO8859-1 locale,
1416 Latin1, it is a multiplication sign. The POSIX regular expression
1417 character class "[[:alpha:]]" will magically match 0xD7 in the Greek
1418 locale but not in the Latin one.
1419
1420 However, there are places where this breaks down. Certain Perl
1421 constructs are for Unicode only, such as "\p{Alpha}". They assume that
1422 0xD7 always has its Unicode meaning (or the equivalent on EBCDIC
1423 platforms). Since Latin1 is a subset of Unicode and 0xD7 is the
1424 multiplication sign in both Latin1 and Unicode, "\p{Alpha}" will never
1425 match it, regardless of locale. A similar issue occurs with "\N{...}".
1426 Prior to v5.20, it is therefore a bad idea to use "\p{}" or "\N{}"
1427 under plain "use locale"--unless you can guarantee that the locale will
1428 be ISO8859-1. Use POSIX character classes instead.
1429
1430 Another problem with this approach is that operations that cross the
1431 single byte/multiple byte boundary are not well-defined, and so are
1432 disallowed. (This boundary is between the codepoints at 255/256.) For
1433 example, lower casing LATIN CAPITAL LETTER Y WITH DIAERESIS (U+0178)
1434 should return LATIN SMALL LETTER Y WITH DIAERESIS (U+00FF). But in the
1435 Greek locale, for example, there is no character at 0xFF, and Perl has
1436 no way of knowing what the character at 0xFF is really supposed to
1437 represent. Thus it disallows the operation. In this mode, the
1438 lowercase of U+0178 is itself.
1439
1440 The same problems ensue if you enable automatic UTF-8-ification of your
1441 standard file handles, default "open()" layer, and @ARGV on
1442 non-ISO8859-1, non-UTF-8 locales (by using either the -C command line
1443 switch or the "PERL_UNICODE" environment variable; see perlrun).
1444 Things are read in as UTF-8, which would normally imply a Unicode
1445 interpretation, but the presence of a locale causes them to be
1446 interpreted in that locale instead. For example, a 0xD7 code point in
1447 the Unicode input, which should mean the multiplication sign, won't be
1448 interpreted by Perl that way under the Greek locale. This is not a
1449 problem provided you make certain that all locales will always and only
1450 be either an ISO8859-1, or, if you don't have a deficient C library, a
1451 UTF-8 locale.
1452
1453 Still another problem is that this approach can lead to two code points
1454 meaning the same character. Thus in a Greek locale, both U+03A7 and
1455 U+00D7 are GREEK CAPITAL LETTER CHI.
1456
1457 Because of all these problems, starting in v5.22, Perl will raise a
1458 warning if a multi-byte (hence Unicode) code point is used when a
1459 single-byte locale is in effect. (Although it doesn't check for this
1460 if doing so would unreasonably slow execution down.)
1461
1462 Vendor locales are notoriously buggy, and it is difficult for Perl to
1463 test its locale-handling code because this interacts with code that
1464 Perl has no control over; therefore the locale-handling code in Perl
1465 may be buggy as well. (However, the Unicode-supplied locales should be
1466 better, and there is a feed back mechanism to correct any problems.
1467 See "Freely available locale definitions".)
1468
1469 If you have Perl v5.16, the problems mentioned above go away if you use
1470 the ":not_characters" parameter to the locale pragma (except for vendor
1471 bugs in the non-character portions). If you don't have v5.16, and you
1472 do have locales that work, using them may be worthwhile for certain
1473 specific purposes, as long as you keep in mind the gotchas already
1474 mentioned. For example, if the collation for your locales works, it
1475 runs faster under locales than under Unicode::Collate; and you gain
1476 access to such things as the local currency symbol and the names of the
1477 months and days of the week. (But to hammer home the point, in v5.16,
1478 you get this access without the downsides of locales by using the
1479 ":not_characters" form of the pragma.)
1480
1481 Note: The policy of using locale rules for code points that can fit in
1482 a byte, and Unicode rules for those that can't is not uniformly
1483 applied. Pre-v5.12, it was somewhat haphazard; in v5.12 it was applied
1484 fairly consistently to regular expression matching except for bracketed
1485 character classes; in v5.14 it was extended to all regex matches; and
1486 in v5.16 to the casing operations such as "\L" and "uc()". For
1487 collation, in all releases so far, the system's "strxfrm()" function is
1488 called, and whatever it does is what you get. Starting in v5.26,
1489 various bugs are fixed with the way perl uses this function.
1490
1492 Collation of strings containing embedded "NUL" characters
1493 "NUL" characters will sort the same as the lowest collating control
1494 character does, or to "\001" in the unlikely event that there are no
1495 control characters at all in the locale. In cases where the strings
1496 don't contain this non-"NUL" control, the results will be correct, and
1497 in many locales, this control, whatever it might be, will rarely be
1498 encountered. But there are cases where a "NUL" should sort before this
1499 control, but doesn't. If two strings do collate identically, the one
1500 containing the "NUL" will sort to earlier. Prior to 5.26, there were
1501 more bugs.
1502
1503 Multi-threaded
1504 XS code or C-language libraries called from it that use the system
1505 setlocale(3) function (except on Windows) likely will not work from a
1506 multi-threaded application without changes. See "Locale-aware XS code"
1507 in perlxs.
1508
1509 An XS module that is locale-dependent could have been written under the
1510 assumption that it will never be called in a multi-threaded
1511 environment, and so uses other non-locale constructs that aren't multi-
1512 thread-safe. See "Thread-aware system interfaces" in perlxs.
1513
1514 POSIX does not define a way to get the name of the current per-thread
1515 locale. Some systems, such as Darwin and NetBSD do implement a
1516 function, querylocale(3) to do this. On non-Windows systems without
1517 it, such as Linux, there are some additional caveats:
1518
1519 • An embedded perl needs to be started up while the global locale is
1520 in effect. See "Using embedded Perl with POSIX locales" in
1521 perlembed.
1522
1523 • It becomes more important for perl to know about all the possible
1524 locale categories on the platform, even if they aren't apparently
1525 used in your program. Perl knows all of the Linux ones. If your
1526 platform has others, you can submit an issue at
1527 <https://github.com/Perl/perl5/issues> for inclusion of it in the
1528 next release. In the meantime, it is possible to edit the Perl
1529 source to teach it about the category, and then recompile. Search
1530 for instances of, say, "LC_PAPER" in the source, and use that as a
1531 template to add the omitted one.
1532
1533 • It is possible, though hard to do, to call "POSIX::setlocale" with
1534 a locale that it doesn't recognize as syntactically legal, but
1535 actually is legal on that system. This should happen only with
1536 embedded perls, or if you hand-craft a locale name yourself.
1537
1538 Broken systems
1539 In certain systems, the operating system's locale support is broken and
1540 cannot be fixed or used by Perl. Such deficiencies can and will result
1541 in mysterious hangs and/or Perl core dumps when "use locale" is in
1542 effect. When confronted with such a system, please report in
1543 excruciating detail to <<https://github.com/Perl/perl5/issues>>, and
1544 also contact your vendor: bug fixes may exist for these problems in
1545 your operating system. Sometimes such bug fixes are called an
1546 operating system upgrade. If you have the source for Perl, include in
1547 the bug report the output of the test described above in "Testing for
1548 broken locales".
1549
1551 I18N::Langinfo, perluniintro, perlunicode, open, "localeconv" in POSIX,
1552 "setlocale" in POSIX, "strcoll" in POSIX, "strftime" in POSIX, "strtod"
1553 in POSIX, "strxfrm" in POSIX.
1554
1555 For special considerations when Perl is embedded in a C program, see
1556 "Using embedded Perl with POSIX locales" in perlembed.
1557
1559 Jarkko Hietaniemi's original perli18n.pod heavily hacked by Dominic
1560 Dunlop, assisted by the perl5-porters. Prose worked over a bit by Tom
1561 Christiansen, and now maintained by Perl 5 porters.
1562
1563
1564
1565perl v5.34.1 2022-03-15 PERLLOCALE(1)