1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl. See "Further Resources" for references to more in-depth
11 treatments of Unicode.
12
13 Unicode
14 Unicode is a character set standard which plans to codify all of the
15 writing systems of the world, plus many other symbols.
16
17 Unicode and ISO/IEC 10646 are coordinated standards that unify almost
18 all other modern character set standards, covering more than 80 writing
19 systems and hundreds of languages, including all commercially-important
20 modern languages. All characters in the largest Chinese, Japanese, and
21 Korean dictionaries are also encoded. The standards will eventually
22 cover almost all characters in more than 250 writing systems and
23 thousands of languages. Unicode 1.0 was released in October 1991, and
24 6.0 in October 2010.
25
26 A Unicode character is an abstract entity. It is not bound to any
27 particular integer width, especially not to the C language "char".
28 Unicode is language-neutral and display-neutral: it does not encode the
29 language of the text, and it does not generally define fonts or other
30 graphical layout details. Unicode operates on characters and on text
31 built from those characters.
32
33 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35 0x0041 and 0x03B1, respectively. These unique numbers are called code
36 points. A code point is essentially the position of the character
37 within the set of all possible Unicode characters, and thus in Perl,
38 the term ordinal is often used interchangeably with it.
39
40 The Unicode standard prefers using hexadecimal notation for the code
41 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
42 later section, "Hexadecimal Notation". The Unicode standard uses the
43 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
44 point and the normative name of the character.
45
46 Unicode also defines various properties for the characters, like
47 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
48 properties are independent of the names of the characters.
49 Furthermore, various operations on the characters like uppercasing,
50 lowercasing, and collating (sorting) are defined.
51
52 A Unicode logical "character" can actually consist of more than one
53 internal actual "character" or code point. For Western languages, this
54 is adequately modelled by a base character (like "LATIN CAPITAL LETTER
55 A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
56 This sequence of base character and modifiers is called a combining
57 character sequence. Some non-western languages require more
58 complicated models, so Unicode created the grapheme cluster concept,
59 which was later further refined into the extended grapheme cluster.
60 For example, a Korean Hangul syllable is considered a single logical
61 character, but most often consists of three actual Unicode characters:
62 a leading consonant followed by an interior vowel followed by a
63 trailing consonant.
64
65 Whether to call these extended grapheme clusters "characters" depends
66 on your point of view. If you are a programmer, you probably would tend
67 towards seeing each element in the sequences as one unit, or
68 "character". However from the user's point of view, the whole sequence
69 could be seen as one "character" since that's probably what it looks
70 like in the context of the user's language. In this document, we take
71 the programmer's point of view: one "character" is one Unicode code
72 point.
73
74 For some combinations of base character and modifiers, there are
75 precomposed characters. There is a single character equivalent, for
76 example, to the sequence "LATIN CAPITAL LETTER A" followed by
77 "COMBINING ACUTE ACCENT". It is called "LATIN CAPITAL LETTER A WITH
78 ACUTE". These precomposed characters are, however, only available for
79 some combinations, and are mainly meant to support round-trip
80 conversions between Unicode and legacy standards (like ISO 8859).
81 Using sequences, as Unicode does, allows for needing fewer basic
82 building blocks (code points) to express many more potential grapheme
83 clusters. To support conversion between equivalent forms, various
84 normalization forms are also defined. Thus, "LATIN CAPITAL LETTER A
85 WITH ACUTE" is in Normalization Form Composed, (abbreviated NFC), and
86 the sequence "LATIN CAPITAL LETTER A" followed by "COMBINING ACUTE
87 ACCENT" represents the same character in Normalization Form Decomposed
88 (NFD).
89
90 Because of backward compatibility with legacy encodings, the "a unique
91 number for every character" idea breaks down a bit: instead, there is
92 "at least one number for every character". The same character could be
93 represented differently in several legacy encodings. The converse is
94 not also true: some code points do not have an assigned character.
95 Firstly, there are unallocated code points within otherwise used
96 blocks. Secondly, there are special Unicode control characters that do
97 not represent true characters.
98
99 When Unicode was first conceived, it was thought that all the world's
100 characters could be represented using a 16-bit word; that is a maximum
101 of 0x10000 (or 65536) characters from 0x0000 to 0xFFFF would be needed.
102 This soon proved to be false, and since Unicode 2.0 (July 1996),
103 Unicode has been defined all the way up to 21 bits (0x10FFFF), and
104 Unicode 3.1 (March 2001) defined the first characters above 0xFFFF.
105 The first 0x10000 characters are called the Plane 0, or the Basic
106 Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes
107 in all were defined--but they are nowhere near full of defined
108 characters, yet.
109
110 When a new language is being encoded, Unicode generally will choose a
111 "block" of consecutive unallocated code points for its characters. So
112 far, the number of code points in these blocks has always been evenly
113 divisible by 16. Extras in a block, not currently needed, are left
114 unallocated, for future growth. But there have been occasions when a
115 later relase needed more code points than the available extras, and a
116 new block had to allocated somewhere else, not contiguous to the
117 initial one, to handle the overflow. Thus, it became apparent early on
118 that "block" wasn't an adequate organizing principal, and so the
119 "Script" property was created. (Later an improved script property was
120 added as well, the "Script_Extensions" property.) Those code points
121 that are in overflow blocks can still have the same script as the
122 original ones. The script concept fits more closely with natural
123 language: there is "Latin" script, "Greek" script, and so on; and there
124 are several artificial scripts, like "Common" for characters that are
125 used in multiple scripts, such as mathematical symbols. Scripts
126 usually span varied parts of several blocks. For more information
127 about scripts, see "Scripts" in perlunicode. The division into blocks
128 exists, but it is almost completely accidental--an artifact of how the
129 characters have been and still are allocated. (Note that this
130 paragraph has oversimplified things for the sake of this being an
131 introduction. Unicode doesn't really encode languages, but the writing
132 systems for them--their scripts; and one script can be used by many
133 languages. Unicode also encodes things that aren't really about
134 languages, such as symbols like "BAGGAGE CLAIM".)
135
136 The Unicode code points are just abstract numbers. To input and output
137 these abstract numbers, the numbers must be encoded or serialised
138 somehow. Unicode defines several character encoding forms, of which
139 UTF-8 is perhaps the most popular. UTF-8 is a variable length encoding
140 that encodes Unicode characters as 1 to 6 bytes. Other encodings
141 include UTF-16 and UTF-32 and their big- and little-endian variants
142 (UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
143 and UCS-4 encoding forms.
144
145 For more information about encodings--for instance, to learn what
146 surrogates and byte order marks (BOMs) are--see perlunicode.
147
148 Perl's Unicode Support
149 Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
150 natively. Perl 5.8.0, however, is the first recommended release for
151 serious Unicode work. The maintenance release 5.6.1 fixed many of the
152 problems of the initial Unicode implementation, but for example regular
153 expressions still do not work with Unicode in 5.6.1. Perl 5.14.0 is
154 the first release where Unicode support is (almost) seamlessly
155 integrable without some gotchas (the exception being some differences
156 in quotemeta, which is fixed starting in Perl 5.16.0). To enable this
157 seamless support, you should "use feature 'unicode_strings'" (which is
158 automatically selected if you "use 5.012" or higher). See feature.
159 (5.14 also fixes a number of bugs and departures from the Unicode
160 standard.)
161
162 Before Perl 5.8.0, the use of "use utf8" was used to declare that
163 operations in the current block or file would be Unicode-aware. This
164 model was found to be wrong, or at least clumsy: the "Unicodeness" is
165 now carried with the data, instead of being attached to the operations.
166 Starting with Perl 5.8.0, only one case remains where an explicit "use
167 utf8" is needed: if your Perl script itself is encoded in UTF-8, you
168 can use UTF-8 in your identifier names, and in string and regular
169 expression literals, by saying "use utf8". This is not the default
170 because scripts with legacy 8-bit data in them would break. See utf8.
171
172 Perl's Unicode Model
173 Perl supports both pre-5.6 strings of eight-bit native bytes, and
174 strings of Unicode characters. The general principle is that Perl
175 tries to keep its data as eight-bit bytes for as long as possible, but
176 as soon as Unicodeness cannot be avoided, the data is transparently
177 upgraded to Unicode. Prior to Perl 5.14, the upgrade was not
178 completely transparent (see "The "Unicode Bug"" in perlunicode), and
179 for backwards compatibility, full transparency is not gained unless
180 "use feature 'unicode_strings'" (see feature) or "use 5.012" (or
181 higher) is selected.
182
183 Internally, Perl currently uses either whatever the native eight-bit
184 character set of the platform (for example Latin-1) is, defaulting to
185 UTF-8, to encode Unicode strings. Specifically, if all code points in
186 the string are 0xFF or less, Perl uses the native eight-bit character
187 set. Otherwise, it uses UTF-8.
188
189 A user of Perl does not normally need to know nor care how Perl happens
190 to encode its internal strings, but it becomes relevant when outputting
191 Unicode strings to a stream without a PerlIO layer (one with the
192 "default" encoding). In such a case, the raw bytes used internally
193 (the native character set or UTF-8, as appropriate for each string)
194 will be used, and a "Wide character" warning will be issued if those
195 strings contain a character beyond 0x00FF.
196
197 For example,
198
199 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
200
201 produces a fairly useless mixture of native bytes and UTF-8, as well as
202 a warning:
203
204 Wide character in print at ...
205
206 To output UTF-8, use the ":encoding" or ":utf8" output layer.
207 Prepending
208
209 binmode(STDOUT, ":utf8");
210
211 to this sample program ensures that the output is completely UTF-8, and
212 removes the program's warning.
213
214 You can enable automatic UTF-8-ification of your standard file handles,
215 default "open()" layer, and @ARGV by using either the "-C" command line
216 switch or the "PERL_UNICODE" environment variable, see perlrun for the
217 documentation of the "-C" switch.
218
219 Note that this means that Perl expects other software to work the same
220 way: if Perl has been led to believe that STDIN should be UTF-8, but
221 then STDIN coming in from another command is not UTF-8, Perl will
222 likely complain about the malformed UTF-8.
223
224 All features that combine Unicode and I/O also require using the new
225 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
226 you can see whether yours is by running "perl -V" and looking for
227 "useperlio=define".
228
229 Unicode and EBCDIC
230 Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, Unicode
231 support is somewhat more complex to implement since additional
232 conversions are needed at every step.
233
234 Later Perl releases have added code that will not work on EBCDIC
235 platforms, and no one has complained, so the divergence has continued.
236 If you want to run Perl on an EBCDIC platform, send email to
237 perlbug@perl.org
238
239 On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
240 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
241 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
242 "EBCDIC-safe".
243
244 Creating Unicode
245 To create Unicode characters in literals for code points above 0xFF,
246 use the "\x{...}" notation in double-quoted strings:
247
248 my $smiley = "\x{263a}";
249
250 Similarly, it can be used in regular expression literals
251
252 $smiley =~ /\x{263a}/;
253
254 At run-time you can use "chr()":
255
256 my $hebrew_alef = chr(0x05d0);
257
258 See "Further Resources" for how to find all these numeric codes.
259
260 Naturally, "ord()" will do the reverse: it turns a character into a
261 code point.
262
263 Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
264 and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
265 eight-bit character for backward compatibility with older Perls. For
266 arguments of 0x100 or more, Unicode characters are always produced. If
267 you want to force the production of Unicode characters regardless of
268 the numeric value, use "pack("U", ...)" instead of "\x..", "\x{...}",
269 or "chr()".
270
271 You can invoke characters by name in double-quoted strings:
272
273 my $arabic_alef = "\N{ARABIC LETTER ALEF}";
274
275 And, as mentioned above, you can also "pack()" numbers into Unicode
276 characters:
277
278 my $georgian_an = pack("U", 0x10a0);
279
280 Note that both "\x{...}" and "\N{...}" are compile-time string
281 constants: you cannot use variables in them. if you want similar run-
282 time functionality, use "chr()" and "charnames::string_vianame()".
283
284 If you want to force the result to Unicode characters, use the special
285 "U0" prefix. It consumes no arguments but causes the following bytes
286 to be interpreted as the UTF-8 encoding of Unicode characters:
287
288 my $chars = pack("U0W*", 0x80, 0x42);
289
290 Likewise, you can stop such UTF-8 interpretation by using the special
291 "C0" prefix.
292
293 Handling Unicode
294 Handling Unicode is for the most part transparent: just use the strings
295 as usual. Functions like "index()", "length()", and "substr()" will
296 work on the Unicode characters; regular expressions will work on the
297 Unicode characters (see perlunicode and perlretut).
298
299 Note that Perl considers grapheme clusters to be separate characters,
300 so for example
301
302 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
303 "\n";
304
305 will print 2, not 1. The only exception is that regular expressions
306 have "\X" for matching an extended grapheme cluster. (Thus "\X" in a
307 regular expression would match the entire sequence of both the example
308 characters.)
309
310 Life is not quite so transparent, however, when working with legacy
311 encodings, I/O, and certain special cases:
312
313 Legacy Encodings
314 When you combine legacy data and Unicode, the legacy data needs to be
315 upgraded to Unicode. Normally the legacy data is assumed to be ISO
316 8859-1 (or EBCDIC, if applicable).
317
318 The "Encode" module knows about many encodings and has interfaces for
319 doing conversions between those encodings:
320
321 use Encode 'decode';
322 $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
323
324 Unicode I/O
325 Normally, writing out Unicode data
326
327 print FH $some_string_with_unicode, "\n";
328
329 produces raw bytes that Perl happens to use to internally encode the
330 Unicode string. Perl's internal encoding depends on the system as well
331 as what characters happen to be in the string at the time. If any of
332 the characters are at code points 0x100 or above, you will get a
333 warning. To ensure that the output is explicitly rendered in the
334 encoding you desire--and to avoid the warning--open the stream with the
335 desired encoding. Some examples:
336
337 open FH, ">:utf8", "file";
338
339 open FH, ">:encoding(ucs2)", "file";
340 open FH, ">:encoding(UTF-8)", "file";
341 open FH, ">:encoding(shift_jis)", "file";
342
343 and on already open streams, use "binmode()":
344
345 binmode(STDOUT, ":utf8");
346
347 binmode(STDOUT, ":encoding(ucs2)");
348 binmode(STDOUT, ":encoding(UTF-8)");
349 binmode(STDOUT, ":encoding(shift_jis)");
350
351 The matching of encoding names is loose: case does not matter, and many
352 encodings have several aliases. Note that the ":utf8" layer must
353 always be specified exactly like that; it is not subject to the loose
354 matching of encoding names. Also note that currently ":utf8" is unsafe
355 for input, because it accepts the data without validating that it is
356 indeed valid UTF-8; you should instead use ":encoding(utf-8)" (with or
357 without a hyphen).
358
359 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
360 for the ":encoding()" layer, and Encode::Supported for many encodings
361 supported by the "Encode" module.
362
363 Reading in a file that you know happens to be encoded in one of the
364 Unicode or legacy encodings does not magically turn the data into
365 Unicode in Perl's eyes. To do that, specify the appropriate layer when
366 opening files
367
368 open(my $fh,'<:encoding(utf8)', 'anything');
369 my $line_of_unicode = <$fh>;
370
371 open(my $fh,'<:encoding(Big5)', 'anything');
372 my $line_of_unicode = <$fh>;
373
374 The I/O layers can also be specified more flexibly with the "open"
375 pragma. See open, or look at the following example.
376
377 use open ':encoding(utf8)'; # input/output default encoding will be
378 # UTF-8
379 open X, ">file";
380 print X chr(0x100), "\n";
381 close X;
382 open Y, "<file";
383 printf "%#x\n", ord(<Y>); # this should print 0x100
384 close Y;
385
386 With the "open" pragma you can use the ":locale" layer
387
388 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
389 # the :locale will probe the locale environment variables like
390 # LC_ALL
391 use open OUT => ':locale'; # russki parusski
392 open(O, ">koi8");
393 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
394 close O;
395 open(I, "<koi8");
396 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
397 close I;
398
399 These methods install a transparent filter on the I/O stream that
400 converts data from the specified encoding when it is read in from the
401 stream. The result is always Unicode.
402
403 The open pragma affects all the "open()" calls after the pragma by
404 setting default layers. If you want to affect only certain streams,
405 use explicit layers directly in the "open()" call.
406
407 You can switch encodings on an already opened stream by using
408 "binmode()"; see "binmode" in perlfunc.
409
410 The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
411 and "binmode()", only with the "open" pragma. The ":utf8" and
412 ":encoding(...)" methods do work with all of "open()", "binmode()", and
413 the "open" pragma.
414
415 Similarly, you may use these I/O layers on output streams to
416 automatically convert Unicode to the specified encoding when it is
417 written to the stream. For example, the following snippet copies the
418 contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
419 the file "text.utf8", encoded as UTF-8:
420
421 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
422 open(my $unicode, '>:utf8', 'text.utf8');
423 while (<$nihongo>) { print $unicode $_ }
424
425 The naming of encodings, both by the "open()" and by the "open" pragma
426 allows for flexible names: "koi8-r" and "KOI8R" will both be
427 understood.
428
429 Common encodings recognized by ISO, MIME, IANA, and various other
430 standardisation organisations are recognised; for a more detailed list
431 see Encode::Supported.
432
433 "read()" reads characters and returns the number of characters.
434 "seek()" and "tell()" operate on byte counts, as do "sysread()" and
435 "sysseek()".
436
437 Notice that because of the default behaviour of not doing any
438 conversion upon input if there is no default layer, it is easy to
439 mistakenly write code that keeps on expanding a file by repeatedly
440 encoding the data:
441
442 # BAD CODE WARNING
443 open F, "file";
444 local $/; ## read in the whole file of 8-bit characters
445 $t = <F>;
446 close F;
447 open F, ">:encoding(utf8)", "file";
448 print F $t; ## convert to UTF-8 on output
449 close F;
450
451 If you run this code twice, the contents of the file will be twice
452 UTF-8 encoded. A "use open ':encoding(utf8)'" would have avoided the
453 bug, or explicitly opening also the file for input as UTF-8.
454
455 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
456 been built with the new PerlIO feature (which is the default on most
457 systems).
458
459 Displaying Unicode As Text
460 Sometimes you might want to display Perl scalars containing Unicode as
461 simple ASCII (or EBCDIC) text. The following subroutine converts its
462 argument so that Unicode characters with code points greater than 255
463 are displayed as "\x{...}", control characters (like "\n") are
464 displayed as "\x..", and the rest of the characters as themselves:
465
466 sub nice_string {
467 join("",
468 map { $_ > 255 ? # if wide character...
469 sprintf("\\x{%04X}", $_) : # \x{...}
470 chr($_) =~ /[[:cntrl:]]/ ? # else if control character...
471 sprintf("\\x%02X", $_) : # \x..
472 quotemeta(chr($_)) # else quoted or as themselves
473 } unpack("W*", $_[0])); # unpack Unicode characters
474 }
475
476 For example,
477
478 nice_string("foo\x{100}bar\n")
479
480 returns the string
481
482 'foo\x{0100}bar\x0A'
483
484 which is ready to be printed.
485
486 Special Cases
487 · Bit Complement Operator ~ And vec()
488
489 The bit complement operator "~" may produce surprising results if
490 used on strings containing characters with ordinal values above
491 255. In such a case, the results are consistent with the internal
492 encoding of the characters, but not with much else. So don't do
493 that. Similarly for "vec()": you will be operating on the
494 internally-encoded bit patterns of the Unicode characters, not on
495 the code point values, which is very probably not what you want.
496
497 · Peeking At Perl's Internal Encoding
498
499 Normal users of Perl should never care how Perl encodes any
500 particular Unicode string (because the normal ways to get at the
501 contents of a string with Unicode--via input and output--should
502 always be via explicitly-defined I/O layers). But if you must,
503 there are two ways of looking behind the scenes.
504
505 One way of peeking inside the internal encoding of Unicode
506 characters is to use "unpack("C*", ..." to get the bytes of
507 whatever the string encoding happens to be, or "unpack("U0..",
508 ...)" to get the bytes of the UTF-8 encoding:
509
510 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
511 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
512
513 Yet another way would be to use the Devel::Peek module:
514
515 perl -MDevel::Peek -e 'Dump(chr(0x100))'
516
517 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
518 Unicode characters in "PV". See also later in this document the
519 discussion about the "utf8::is_utf8()" function.
520
521 Advanced Topics
522 · String Equivalence
523
524 The question of string equivalence turns somewhat complicated in
525 Unicode: what do you mean by "equal"?
526
527 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
528 LETTER A"?)
529
530 The short answer is that by default Perl compares equivalence
531 ("eq", "ne") based only on code points of the characters. In the
532 above case, the answer is no (because 0x00C1 != 0x0041). But
533 sometimes, any CAPITAL LETTER A's should be considered equal, or
534 even A's of any case.
535
536 The long answer is that you need to consider character
537 normalization and casing issues: see Unicode::Normalize, Unicode
538 Technical Report #15, Unicode Normalization Forms
539 <http://www.unicode.org/unicode/reports/tr15> and sections on case
540 mapping in the Unicode Standard <http://www.unicode.org>.
541
542 As of Perl 5.8.0, the "Full" case-folding of Case
543 Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
544 with them, mostly fixed by 5.14.
545
546 · String Collation
547
548 People like to see their strings nicely sorted--or as Unicode
549 parlance goes, collated. But again, what do you mean by collate?
550
551 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
552 "LATIN CAPITAL LETTER A WITH GRAVE"?)
553
554 The short answer is that by default, Perl compares strings ("lt",
555 "le", "cmp", "ge", "gt") based only on the code points of the
556 characters. In the above case, the answer is "after", since 0x00C1
557 > 0x00C0.
558
559 The long answer is that "it depends", and a good answer cannot be
560 given without knowing (at the very least) the language context.
561 See Unicode::Collate, and Unicode Collation Algorithm
562 <http://www.unicode.org/unicode/reports/tr10/>
563
564 Miscellaneous
565 · Character Ranges and Classes
566
567 Character ranges in regular expression bracketed character classes
568 ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
569 operator are not magically Unicode-aware. What this means is that
570 "[A-Za-z]" will not magically start to mean "all alphabetic
571 letters" (not that it does mean that even for 8-bit characters; for
572 those, if you are using locales (perllocale), use "/[[:alpha:]]/";
573 and if not, use the 8-bit-aware property "\p{alpha}").
574
575 All the properties that begin with "\p" (and its inverse "\P") are
576 actually character classes that are Unicode-aware. There are
577 dozens of them, see perluniprops.
578
579 You can use Unicode code points as the end points of character
580 ranges, and the range will include all Unicode code points that lie
581 between those end points.
582
583 · String-To-Number Conversions
584
585 Unicode does define several other decimal--and numeric--characters
586 besides the familiar 0 to 9, such as the Arabic and Indic digits.
587 Perl does not support string-to-number conversion for digits other
588 than ASCII 0 to 9 (and ASCII a to f for hexadecimal). To get safe
589 conversions from any Unicode string, use "num()" in Unicode::UCD.
590
591 Questions With Answers
592 · Will My Old Scripts Break?
593
594 Very probably not. Unless you are generating Unicode characters
595 somehow, old behaviour should be preserved. About the only
596 behaviour that has changed and which could start generating Unicode
597 is the old behaviour of "chr()" where supplying an argument more
598 than 255 produced a character modulo 255. "chr(300)", for example,
599 was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
600 LETTER I WITH BREVE.
601
602 · How Do I Make My Scripts Work With Unicode?
603
604 Very little work should be needed since nothing changes until you
605 generate Unicode data. The most important thing is getting input
606 as Unicode; for that, see the earlier I/O discussion. To get full
607 seamless Unicode support, add "use feature 'unicode_strings'" (or
608 "use 5.012" or higher) to your script.
609
610 · How Do I Know Whether My String Is In Unicode?
611
612 You shouldn't have to care. But you may if your Perl is before
613 5.14.0 or you haven't specified "use feature 'unicode_strings'" or
614 "use 5.012" (or higher) because otherwise the semantics of the code
615 points in the range 128 to 255 are different depending on whether
616 the string they are contained within is in Unicode or not. (See
617 "When Unicode Does Not Happen" in perlunicode.)
618
619 To determine if a string is in Unicode, use:
620
621 print utf8::is_utf8($string) ? 1 : 0, "\n";
622
623 But note that this doesn't mean that any of the characters in the
624 string are necessary UTF-8 encoded, or that any of the characters
625 have code points greater than 0xFF (255) or even 0x80 (128), or
626 that the string has any characters at all. All the "is_utf8()"
627 does is to return the value of the internal "utf8ness" flag
628 attached to the $string. If the flag is off, the bytes in the
629 scalar are interpreted as a single byte encoding. If the flag is
630 on, the bytes in the scalar are interpreted as the (variable-
631 length, potentially multi-byte) UTF-8 encoded code points of the
632 characters. Bytes added to a UTF-8 encoded string are
633 automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8
634 scalars are merged (double-quoted interpolation, explicit
635 concatenation, or printf/sprintf parameter substitution), the
636 result will be UTF-8 encoded as if copies of the byte strings were
637 upgraded to UTF-8: for example,
638
639 $a = "ab\x80c";
640 $b = "\x{100}";
641 print "$a = $b\n";
642
643 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
644 $a will stay byte-encoded.
645
646 Sometimes you might really need to know the byte length of a string
647 instead of the character length. For that use either the
648 "Encode::encode_utf8()" function or the "bytes" pragma and the
649 "length()" function:
650
651 my $unicode = chr(0x100);
652 print length($unicode), "\n"; # will print 1
653 require Encode;
654 print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
655 use bytes;
656 print length($unicode), "\n"; # will also print 2
657 # (the 0xC4 0x80 of the UTF-8)
658 no bytes;
659
660 · How Do I Find Out What Encoding a File Has?
661
662 You might try Encode::Guess, but it has a number of limitations.
663
664 · How Do I Detect Data That's Not Valid In a Particular Encoding?
665
666 Use the "Encode" package to try converting it. For example,
667
668 use Encode 'decode_utf8';
669
670 if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
671 # $string is valid utf8
672 } else {
673 # $string is not valid utf8
674 }
675
676 Or use "unpack" to try decoding it:
677
678 use warnings;
679 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
680
681 If invalid, a "Malformed UTF-8 character" warning is produced. The
682 "C0" means "process the string character per character". Without
683 that, the "unpack("U*", ...)" would work in "U0" mode (the default
684 if the format string starts with "U") and it would return the bytes
685 making up the UTF-8 encoding of the target string, something that
686 will always work.
687
688 · How Do I Convert Binary Data Into a Particular Encoding, Or Vice
689 Versa?
690
691 This probably isn't as useful as you might think. Normally, you
692 shouldn't need to.
693
694 In one sense, what you are asking doesn't make much sense:
695 encodings are for characters, and binary data are not "characters",
696 so converting "data" into some encoding isn't meaningful unless you
697 know in what character set and encoding the binary data is in, in
698 which case it's not just binary data, now is it?
699
700 If you have a raw sequence of bytes that you know should be
701 interpreted via a particular encoding, you can use "Encode":
702
703 use Encode 'from_to';
704 from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
705
706 The call to "from_to()" changes the bytes in $data, but nothing
707 material about the nature of the string has changed as far as Perl
708 is concerned. Both before and after the call, the string $data
709 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
710 the encoding of the string remains as "system-native 8-bit bytes".
711
712 You might relate this to a fictional 'Translate' module:
713
714 use Translate;
715 my $phrase = "Yes";
716 Translate::from_to($phrase, 'english', 'deutsch');
717 ## phrase now contains "Ja"
718
719 The contents of the string changes, but not the nature of the
720 string. Perl doesn't know any more after the call than before that
721 the contents of the string indicates the affirmative.
722
723 Back to converting data. If you have (or want) data in your
724 system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
725 can use pack/unpack to convert to/from Unicode.
726
727 $native_string = pack("W*", unpack("U*", $Unicode_string));
728 $Unicode_string = pack("U*", unpack("W*", $native_string));
729
730 If you have a sequence of bytes you know is valid UTF-8, but Perl
731 doesn't know it yet, you can make Perl a believer, too:
732
733 use Encode 'decode_utf8';
734 $Unicode = decode_utf8($bytes);
735
736 or:
737
738 $Unicode = pack("U0a*", $bytes);
739
740 You can find the bytes that make up a UTF-8 sequence with
741
742 @bytes = unpack("C*", $Unicode_string)
743
744 and you can create well-formed Unicode with
745
746 $Unicode_string = pack("U*", 0xff, ...)
747
748 · How Do I Display Unicode? How Do I Input Unicode?
749
750 See <http://www.alanwood.net/unicode/> and
751 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
752
753 · How Does Unicode Work With Traditional Locales?
754
755 Starting in Perl 5.16, you can specify
756
757 use locale ':not_characters';
758
759 to get Perl to work well with tradtional locales. The catch is
760 that you have to translate from the locale character set to/from
761 Unicode yourself. See "Unicode I/O" above for how to
762
763 use open ':locale';
764
765 to accomplish this, but full details are in "Unicode and UTF-8" in
766 perllocale, including gotchas that happen if you don't specifiy
767 ":not_characters".
768
769 Hexadecimal Notation
770 The Unicode standard prefers using hexadecimal notation because that
771 more clearly shows the division of Unicode into blocks of 256
772 characters. Hexadecimal is also simply shorter than decimal. You can
773 use decimal notation, too, but learning to use hexadecimal just makes
774 life easier with the Unicode standard. The "U+HHHH" notation uses
775 hexadecimal, for example.
776
777 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
778 (or A-F, case doesn't matter). Each hexadecimal digit represents four
779 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal
780 number in decimal, and "printf "%x\n", $decimal" will show a decimal
781 number in hexadecimal. If you have just the "hex digits" of a
782 hexadecimal number, you can use the "hex()" function.
783
784 print 0x0009, "\n"; # 9
785 print 0x000a, "\n"; # 10
786 print 0x000f, "\n"; # 15
787 print 0x0010, "\n"; # 16
788 print 0x0011, "\n"; # 17
789 print 0x0100, "\n"; # 256
790
791 print 0x0041, "\n"; # 65
792
793 printf "%x\n", 65; # 41
794 printf "%#x\n", 65; # 0x41
795
796 print hex("41"), "\n"; # 65
797
798 Further Resources
799 · Unicode Consortium
800
801 <http://www.unicode.org/>
802
803 · Unicode FAQ
804
805 <http://www.unicode.org/unicode/faq/>
806
807 · Unicode Glossary
808
809 <http://www.unicode.org/glossary/>
810
811 · Unicode Recommended Reading List
812
813 The Unicode Consortium has a list of articles and books, some of
814 which give a much more in depth treatment of Unicode:
815 <http://unicode.org/resources/readinglist.html>
816
817 · Unicode Useful Resources
818
819 <http://www.unicode.org/unicode/onlinedat/resources.html>
820
821 · Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
822 Other Applications
823
824 <http://www.alanwood.net/unicode/>
825
826 · UTF-8 and Unicode FAQ for Unix/Linux
827
828 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
829
830 · Legacy Character Sets
831
832 <http://www.czyborra.com/> <http://www.eki.ee/letter/>
833
834 · You can explore various information from the Unicode data files
835 using the "Unicode::UCD" module.
836
838 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
839 some Unicode processing by using the modules "Unicode::String",
840 "Unicode::Map8", and "Unicode::Map", available from CPAN. If you have
841 the GNU recode installed, you can also use the Perl front-end
842 "Convert::Recode" for character conversions.
843
844 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
845 UTF-8 bytes and back, the code works even with older Perl 5 versions.
846
847 # ISO 8859-1 to UTF-8
848 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
849
850 # UTF-8 to ISO 8859-1
851 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
852
854 perlunitut, perlunicode, Encode, open, utf8, bytes, perlretut, perlrun,
855 Unicode::Collate, Unicode::Normalize, Unicode::UCD
856
858 Thanks to the kind readers of the perl5-porters@perl.org,
859 perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
860 mailing lists for their valuable feedback.
861
863 Copyright 2001-2011 Jarkko Hietaniemi <jhi@iki.fi>
864
865 This document may be distributed under the same terms as Perl itself.
866
867
868
869perl v5.16.3 2013-03-04 PERLUNIINTRO(1)