1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl.
11
12 Unicode
13 Unicode is a character set standard which plans to codify all of the
14 writing systems of the world, plus many other symbols.
15
16 Unicode and ISO/IEC 10646 are coordinated standards that provide code
17 points for characters in almost all modern character set standards,
18 covering more than 30 writing systems and hundreds of languages,
19 including all commercially-important modern languages. All characters
20 in the largest Chinese, Japanese, and Korean dictionaries are also
21 encoded. The standards will eventually cover almost all characters in
22 more than 250 writing systems and thousands of languages. Unicode 1.0
23 was released in October 1991, and 4.0 in April 2003.
24
25 A Unicode character is an abstract entity. It is not bound to any
26 particular integer width, especially not to the C language "char".
27 Unicode is language-neutral and display-neutral: it does not encode the
28 language of the text, and it does not generally define fonts or other
29 graphical layout details. Unicode operates on characters and on text
30 built from those characters.
31
32 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
33 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
34 0x0041 and 0x03B1, respectively. These unique numbers are called code
35 points.
36
37 The Unicode standard prefers using hexadecimal notation for the code
38 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
39 later section, "Hexadecimal Notation". The Unicode standard uses the
40 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
41 point and the normative name of the character.
42
43 Unicode also defines various properties for the characters, like
44 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
45 properties are independent of the names of the characters.
46 Furthermore, various operations on the characters like uppercasing,
47 lowercasing, and collating (sorting) are defined.
48
49 A Unicode logical "character" can actually consist of more than one
50 internal actual "character" or code point. For Western languages, this
51 is adequately modelled by a base character (like "LATIN CAPITAL LETTER
52 A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
53 This sequence of base character and modifiers is called a combining
54 character sequence. Some non-western languages require more
55 complicated models, so Unicode created the grapheme cluster concept,
56 and then the extended grapheme cluster. For example, a Korean Hangul
57 syllable is considered a single logical character, but most often
58 consists of three actual Unicode characters: a leading consonant
59 followed by an interior vowel followed by a trailing consonant.
60
61 Whether to call these extended grapheme clusters "characters" depends
62 on your point of view. If you are a programmer, you probably would tend
63 towards seeing each element in the sequences as one unit, or
64 "character". The whole sequence could be seen as one "character",
65 however, from the user's point of view, since that's probably what it
66 looks like in the context of the user's language.
67
68 With this "whole sequence" view of characters, the total number of
69 characters is open-ended. But in the programmer's "one unit is one
70 character" point of view, the concept of "characters" is more
71 deterministic. In this document, we take that second point of view:
72 one "character" is one Unicode code point.
73
74 For some combinations, there are precomposed characters. "LATIN
75 CAPITAL LETTER A WITH ACUTE", for example, is defined as a single code
76 point. These precomposed characters are, however, only available for
77 some combinations, and are mainly meant to support round-trip
78 conversions between Unicode and legacy standards (like the ISO 8859).
79 In the general case, the composing method is more extensible. To
80 support conversion between different compositions of the characters,
81 various normalization forms to standardize representations are also
82 defined.
83
84 Because of backward compatibility with legacy encodings, the "a unique
85 number for every character" idea breaks down a bit: instead, there is
86 "at least one number for every character". The same character could be
87 represented differently in several legacy encodings. The converse is
88 also not true: some code points do not have an assigned character.
89 Firstly, there are unallocated code points within otherwise used
90 blocks. Secondly, there are special Unicode control characters that do
91 not represent true characters.
92
93 A common myth about Unicode is that it is "16-bit", that is, Unicode is
94 only represented as 0x10000 (or 65536) characters from 0x0000 to
95 0xFFFF. This is untrue. Since Unicode 2.0 (July 1996), Unicode has
96 been defined all the way up to 21 bits (0x10FFFF), and since Unicode
97 3.1 (March 2001), characters have been defined beyond 0xFFFF. The
98 first 0x10000 characters are called the Plane 0, or the Basic
99 Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes
100 in all were defined--but they are nowhere near full of defined
101 characters, yet.
102
103 Another myth is about Unicode blocks--that they have something to do
104 with languages--that each block would define the characters used by a
105 language or a set of languages. This is also untrue. The division
106 into blocks exists, but it is almost completely accidental--an artifact
107 of how the characters have been and still are allocated. Instead,
108 there is a concept called scripts, which is more useful: there is
109 "Latin" script, "Greek" script, and so on. Scripts usually span varied
110 parts of several blocks. For more information about scripts, see
111 "Scripts" in perlunicode.
112
113 The Unicode code points are just abstract numbers. To input and output
114 these abstract numbers, the numbers must be encoded or serialised
115 somehow. Unicode defines several character encoding forms, of which
116 UTF-8 is perhaps the most popular. UTF-8 is a variable length encoding
117 that encodes Unicode characters as 1 to 6 bytes. Other encodings
118 include UTF-16 and UTF-32 and their big- and little-endian variants
119 (UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
120 and UCS-4 encoding forms.
121
122 For more information about encodings--for instance, to learn what
123 surrogates and byte order marks (BOMs) are--see perlunicode.
124
125 Perl's Unicode Support
126 Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
127 natively. Perl 5.8.0, however, is the first recommended release for
128 serious Unicode work. The maintenance release 5.6.1 fixed many of the
129 problems of the initial Unicode implementation, but for example regular
130 expressions still do not work with Unicode in 5.6.1.
131
132 Starting from Perl 5.8.0, the use of "use utf8" is needed only in much
133 more restricted circumstances. In earlier releases the "utf8" pragma
134 was used to declare that operations in the current block or file would
135 be Unicode-aware. This model was found to be wrong, or at least
136 clumsy: the "Unicodeness" is now carried with the data, instead of
137 being attached to the operations. Only one case remains where an
138 explicit "use utf8" is needed: if your Perl script itself is encoded in
139 UTF-8, you can use UTF-8 in your identifier names, and in string and
140 regular expression literals, by saying "use utf8". This is not the
141 default because scripts with legacy 8-bit data in them would break.
142 See utf8.
143
144 Perl's Unicode Model
145 Perl supports both pre-5.6 strings of eight-bit native bytes, and
146 strings of Unicode characters. The principle is that Perl tries to
147 keep its data as eight-bit bytes for as long as possible, but as soon
148 as Unicodeness cannot be avoided, the data is (mostly) transparently
149 upgraded to Unicode. There are some problems--see "The "Unicode Bug""
150 in perlunicode.
151
152 Internally, Perl currently uses either whatever the native eight-bit
153 character set of the platform (for example Latin-1) is, defaulting to
154 UTF-8, to encode Unicode strings. Specifically, if all code points in
155 the string are 0xFF or less, Perl uses the native eight-bit character
156 set. Otherwise, it uses UTF-8.
157
158 A user of Perl does not normally need to know nor care how Perl happens
159 to encode its internal strings, but it becomes relevant when outputting
160 Unicode strings to a stream without a PerlIO layer (one with the
161 "default" encoding). In such a case, the raw bytes used internally
162 (the native character set or UTF-8, as appropriate for each string)
163 will be used, and a "Wide character" warning will be issued if those
164 strings contain a character beyond 0x00FF.
165
166 For example,
167
168 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
169
170 produces a fairly useless mixture of native bytes and UTF-8, as well as
171 a warning:
172
173 Wide character in print at ...
174
175 To output UTF-8, use the ":encoding" or ":utf8" output layer.
176 Prepending
177
178 binmode(STDOUT, ":utf8");
179
180 to this sample program ensures that the output is completely UTF-8, and
181 removes the program's warning.
182
183 You can enable automatic UTF-8-ification of your standard file handles,
184 default "open()" layer, and @ARGV by using either the "-C" command line
185 switch or the "PERL_UNICODE" environment variable, see perlrun for the
186 documentation of the "-C" switch.
187
188 Note that this means that Perl expects other software to work, too: if
189 Perl has been led to believe that STDIN should be UTF-8, but then STDIN
190 coming in from another command is not UTF-8, Perl will complain about
191 the malformed UTF-8.
192
193 All features that combine Unicode and I/O also require using the new
194 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
195 you can see whether yours is by running "perl -V" and looking for
196 "useperlio=define".
197
198 Unicode and EBCDIC
199 Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, Unicode
200 support is somewhat more complex to implement since additional
201 conversions are needed at every step.
202
203 Later Perl releases have added code that will not work on EBCDIC
204 platforms, and no one has complained, so the divergence has continued.
205 If you want to run Perl on an EBCDIC platform, send email to
206 perlbug@perl.org
207
208 On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
209 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
210 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
211 "EBCDIC-safe".
212
213 Creating Unicode
214 To create Unicode characters in literals for code points above 0xFF,
215 use the "\x{...}" notation in double-quoted strings:
216
217 my $smiley = "\x{263a}";
218
219 Similarly, it can be used in regular expression literals
220
221 $smiley =~ /\x{263a}/;
222
223 At run-time you can use "chr()":
224
225 my $hebrew_alef = chr(0x05d0);
226
227 See "Further Resources" for how to find all these numeric codes.
228
229 Naturally, "ord()" will do the reverse: it turns a character into a
230 code point.
231
232 Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
233 and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
234 eight-bit character for backward compatibility with older Perls. For
235 arguments of 0x100 or more, Unicode characters are always produced. If
236 you want to force the production of Unicode characters regardless of
237 the numeric value, use "pack("U", ...)" instead of "\x..", "\x{...}",
238 or "chr()".
239
240 You can also use the "charnames" pragma to invoke characters by name in
241 double-quoted strings:
242
243 use charnames ':full';
244 my $arabic_alef = "\N{ARABIC LETTER ALEF}";
245
246 And, as mentioned above, you can also "pack()" numbers into Unicode
247 characters:
248
249 my $georgian_an = pack("U", 0x10a0);
250
251 Note that both "\x{...}" and "\N{...}" are compile-time string
252 constants: you cannot use variables in them. if you want similar run-
253 time functionality, use "chr()" and "charnames::vianame()".
254
255 If you want to force the result to Unicode characters, use the special
256 "U0" prefix. It consumes no arguments but causes the following bytes
257 to be interpreted as the UTF-8 encoding of Unicode characters:
258
259 my $chars = pack("U0W*", 0x80, 0x42);
260
261 Likewise, you can stop such UTF-8 interpretation by using the special
262 "C0" prefix.
263
264 Handling Unicode
265 Handling Unicode is for the most part transparent: just use the strings
266 as usual. Functions like "index()", "length()", and "substr()" will
267 work on the Unicode characters; regular expressions will work on the
268 Unicode characters (see perlunicode and perlretut).
269
270 Note that Perl considers grapheme clusters to be separate characters,
271 so for example
272
273 use charnames ':full';
274 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
275
276 will print 2, not 1. The only exception is that regular expressions
277 have "\X" for matching an extended grapheme cluster.
278
279 Life is not quite so transparent, however, when working with legacy
280 encodings, I/O, and certain special cases:
281
282 Legacy Encodings
283 When you combine legacy data and Unicode the legacy data needs to be
284 upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if applicable) is
285 assumed.
286
287 The "Encode" module knows about many encodings and has interfaces for
288 doing conversions between those encodings:
289
290 use Encode 'decode';
291 $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
292
293 Unicode I/O
294 Normally, writing out Unicode data
295
296 print FH $some_string_with_unicode, "\n";
297
298 produces raw bytes that Perl happens to use to internally encode the
299 Unicode string. Perl's internal encoding depends on the system as well
300 as what characters happen to be in the string at the time. If any of
301 the characters are at code points 0x100 or above, you will get a
302 warning. To ensure that the output is explicitly rendered in the
303 encoding you desire--and to avoid the warning--open the stream with the
304 desired encoding. Some examples:
305
306 open FH, ">:utf8", "file";
307
308 open FH, ">:encoding(ucs2)", "file";
309 open FH, ">:encoding(UTF-8)", "file";
310 open FH, ">:encoding(shift_jis)", "file";
311
312 and on already open streams, use "binmode()":
313
314 binmode(STDOUT, ":utf8");
315
316 binmode(STDOUT, ":encoding(ucs2)");
317 binmode(STDOUT, ":encoding(UTF-8)");
318 binmode(STDOUT, ":encoding(shift_jis)");
319
320 The matching of encoding names is loose: case does not matter, and many
321 encodings have several aliases. Note that the ":utf8" layer must
322 always be specified exactly like that; it is not subject to the loose
323 matching of encoding names. Also note that ":utf8" is unsafe for input,
324 because it accepts the data without validating that it is indeed valid
325 UTF8.
326
327 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
328 for the ":encoding()" layer, and Encode::Supported for many encodings
329 supported by the "Encode" module.
330
331 Reading in a file that you know happens to be encoded in one of the
332 Unicode or legacy encodings does not magically turn the data into
333 Unicode in Perl's eyes. To do that, specify the appropriate layer when
334 opening files
335
336 open(my $fh,'<:encoding(utf8)', 'anything');
337 my $line_of_unicode = <$fh>;
338
339 open(my $fh,'<:encoding(Big5)', 'anything');
340 my $line_of_unicode = <$fh>;
341
342 The I/O layers can also be specified more flexibly with the "open"
343 pragma. See open, or look at the following example.
344
345 use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
346 open X, ">file";
347 print X chr(0x100), "\n";
348 close X;
349 open Y, "<file";
350 printf "%#x\n", ord(<Y>); # this should print 0x100
351 close Y;
352
353 With the "open" pragma you can use the ":locale" layer
354
355 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
356 # the :locale will probe the locale environment variables like LC_ALL
357 use open OUT => ':locale'; # russki parusski
358 open(O, ">koi8");
359 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
360 close O;
361 open(I, "<koi8");
362 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
363 close I;
364
365 These methods install a transparent filter on the I/O stream that
366 converts data from the specified encoding when it is read in from the
367 stream. The result is always Unicode.
368
369 The open pragma affects all the "open()" calls after the pragma by
370 setting default layers. If you want to affect only certain streams,
371 use explicit layers directly in the "open()" call.
372
373 You can switch encodings on an already opened stream by using
374 "binmode()"; see "binmode" in perlfunc.
375
376 The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
377 and "binmode()", only with the "open" pragma. The ":utf8" and
378 ":encoding(...)" methods do work with all of "open()", "binmode()", and
379 the "open" pragma.
380
381 Similarly, you may use these I/O layers on output streams to
382 automatically convert Unicode to the specified encoding when it is
383 written to the stream. For example, the following snippet copies the
384 contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
385 the file "text.utf8", encoded as UTF-8:
386
387 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
388 open(my $unicode, '>:utf8', 'text.utf8');
389 while (<$nihongo>) { print $unicode $_ }
390
391 The naming of encodings, both by the "open()" and by the "open" pragma
392 allows for flexible names: "koi8-r" and "KOI8R" will both be
393 understood.
394
395 Common encodings recognized by ISO, MIME, IANA, and various other
396 standardisation organisations are recognised; for a more detailed list
397 see Encode::Supported.
398
399 "read()" reads characters and returns the number of characters.
400 "seek()" and "tell()" operate on byte counts, as do "sysread()" and
401 "sysseek()".
402
403 Notice that because of the default behaviour of not doing any
404 conversion upon input if there is no default layer, it is easy to
405 mistakenly write code that keeps on expanding a file by repeatedly
406 encoding the data:
407
408 # BAD CODE WARNING
409 open F, "file";
410 local $/; ## read in the whole file of 8-bit characters
411 $t = <F>;
412 close F;
413 open F, ">:encoding(utf8)", "file";
414 print F $t; ## convert to UTF-8 on output
415 close F;
416
417 If you run this code twice, the contents of the file will be twice
418 UTF-8 encoded. A "use open ':encoding(utf8)'" would have avoided the
419 bug, or explicitly opening also the file for input as UTF-8.
420
421 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
422 been built with the new PerlIO feature (which is the default on most
423 systems).
424
425 Displaying Unicode As Text
426 Sometimes you might want to display Perl scalars containing Unicode as
427 simple ASCII (or EBCDIC) text. The following subroutine converts its
428 argument so that Unicode characters with code points greater than 255
429 are displayed as "\x{...}", control characters (like "\n") are
430 displayed as "\x..", and the rest of the characters as themselves:
431
432 sub nice_string {
433 join("",
434 map { $_ > 255 ? # if wide character...
435 sprintf("\\x{%04X}", $_) : # \x{...}
436 chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
437 sprintf("\\x%02X", $_) : # \x..
438 quotemeta(chr($_)) # else quoted or as themselves
439 } unpack("W*", $_[0])); # unpack Unicode characters
440 }
441
442 For example,
443
444 nice_string("foo\x{100}bar\n")
445
446 returns the string
447
448 'foo\x{0100}bar\x0A'
449
450 which is ready to be printed.
451
452 Special Cases
453 · Bit Complement Operator ~ And vec()
454
455 The bit complement operator "~" may produce surprising results if
456 used on strings containing characters with ordinal values above
457 255. In such a case, the results are consistent with the internal
458 encoding of the characters, but not with much else. So don't do
459 that. Similarly for "vec()": you will be operating on the
460 internally-encoded bit patterns of the Unicode characters, not on
461 the code point values, which is very probably not what you want.
462
463 · Peeking At Perl's Internal Encoding
464
465 Normal users of Perl should never care how Perl encodes any
466 particular Unicode string (because the normal ways to get at the
467 contents of a string with Unicode--via input and output--should
468 always be via explicitly-defined I/O layers). But if you must,
469 there are two ways of looking behind the scenes.
470
471 One way of peeking inside the internal encoding of Unicode
472 characters is to use "unpack("C*", ..." to get the bytes of
473 whatever the string encoding happens to be, or "unpack("U0..",
474 ...)" to get the bytes of the UTF-8 encoding:
475
476 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
477 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
478
479 Yet another way would be to use the Devel::Peek module:
480
481 perl -MDevel::Peek -e 'Dump(chr(0x100))'
482
483 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
484 Unicode characters in "PV". See also later in this document the
485 discussion about the "utf8::is_utf8()" function.
486
487 Advanced Topics
488 · String Equivalence
489
490 The question of string equivalence turns somewhat complicated in
491 Unicode: what do you mean by "equal"?
492
493 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
494 LETTER A"?)
495
496 The short answer is that by default Perl compares equivalence
497 ("eq", "ne") based only on code points of the characters. In the
498 above case, the answer is no (because 0x00C1 != 0x0041). But
499 sometimes, any CAPITAL LETTER As should be considered equal, or
500 even As of any case.
501
502 The long answer is that you need to consider character
503 normalization and casing issues: see Unicode::Normalize, Unicode
504 Technical Report #15, Unicode Normalization Forms
505 <http://www.unicode.org/unicode/reports/tr15> and sections on case
506 mapping in the Unicode Standard <http://www.unicode.org>.
507
508 As of Perl 5.8.0, the "Full" case-folding of Case
509 Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
510 with them.
511
512 · String Collation
513
514 People like to see their strings nicely sorted--or as Unicode
515 parlance goes, collated. But again, what do you mean by collate?
516
517 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
518 "LATIN CAPITAL LETTER A WITH GRAVE"?)
519
520 The short answer is that by default, Perl compares strings ("lt",
521 "le", "cmp", "ge", "gt") based only on the code points of the
522 characters. In the above case, the answer is "after", since 0x00C1
523 > 0x00C0.
524
525 The long answer is that "it depends", and a good answer cannot be
526 given without knowing (at the very least) the language context.
527 See Unicode::Collate, and Unicode Collation Algorithm
528 <http://www.unicode.org/unicode/reports/tr10/>
529
530 Miscellaneous
531 · Character Ranges and Classes
532
533 Character ranges in regular expression bracketed character classes
534 ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
535 operator are not magically Unicode-aware. What this means is that
536 "[A-Za-z]" will not magically start to mean "all alphabetic
537 letters" (not that it does mean that even for 8-bit characters; for
538 those, if you are using locales (perllocale), use "/[[:alpha:]]/";
539 and if not, use the 8-bit-aware property "\p{alpha}").
540
541 All the properties that begin with "\p" (and its inverse "\P") are
542 actually character classes that are Unicode-aware. There are
543 dozens of them, see perluniprops.
544
545 You can use Unicode code points as the end points of character
546 ranges, and the range will include all Unicode code points that lie
547 between those end points.
548
549 · String-To-Number Conversions
550
551 Unicode does define several other decimal--and numeric--characters
552 besides the familiar 0 to 9, such as the Arabic and Indic digits.
553 Perl does not support string-to-number conversion for digits other
554 than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
555
556 Questions With Answers
557 · Will My Old Scripts Break?
558
559 Very probably not. Unless you are generating Unicode characters
560 somehow, old behaviour should be preserved. About the only
561 behaviour that has changed and which could start generating Unicode
562 is the old behaviour of "chr()" where supplying an argument more
563 than 255 produced a character modulo 255. "chr(300)", for example,
564 was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
565 LETTER I WITH BREVE.
566
567 · How Do I Make My Scripts Work With Unicode?
568
569 Very little work should be needed since nothing changes until you
570 generate Unicode data. The most important thing is getting input
571 as Unicode; for that, see the earlier I/O discussion.
572
573 · How Do I Know Whether My String Is In Unicode?
574
575 You shouldn't have to care. But you may, because currently the
576 semantics of the characters whose ordinals are in the range 128 to
577 255 are different depending on whether the string they are
578 contained within is in Unicode or not. (See "When Unicode Does Not
579 Happen" in perlunicode.)
580
581 To determine if a string is in Unicode, use:
582
583 print utf8::is_utf8($string) ? 1 : 0, "\n";
584
585 But note that this doesn't mean that any of the characters in the
586 string are necessary UTF-8 encoded, or that any of the characters
587 have code points greater than 0xFF (255) or even 0x80 (128), or
588 that the string has any characters at all. All the "is_utf8()"
589 does is to return the value of the internal "utf8ness" flag
590 attached to the $string. If the flag is off, the bytes in the
591 scalar are interpreted as a single byte encoding. If the flag is
592 on, the bytes in the scalar are interpreted as the (variable-
593 length, potentially multi-byte) UTF-8 encoded code points of the
594 characters. Bytes added to a UTF-8 encoded string are
595 automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8
596 scalars are merged (double-quoted interpolation, explicit
597 concatenation, and printf/sprintf parameter substitution), the
598 result will be UTF-8 encoded as if copies of the byte strings were
599 upgraded to UTF-8: for example,
600
601 $a = "ab\x80c";
602 $b = "\x{100}";
603 print "$a = $b\n";
604
605 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
606 $a will stay byte-encoded.
607
608 Sometimes you might really need to know the byte length of a string
609 instead of the character length. For that use either the
610 "Encode::encode_utf8()" function or the "bytes" pragma and the
611 "length()" function:
612
613 my $unicode = chr(0x100);
614 print length($unicode), "\n"; # will print 1
615 require Encode;
616 print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
617 use bytes;
618 print length($unicode), "\n"; # will also print 2
619 # (the 0xC4 0x80 of the UTF-8)
620 no bytes;
621
622 · How Do I Detect Data That's Not Valid In a Particular Encoding?
623
624 Use the "Encode" package to try converting it. For example,
625
626 use Encode 'decode_utf8';
627
628 if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
629 # $string is valid utf8
630 } else {
631 # $string is not valid utf8
632 }
633
634 Or use "unpack" to try decoding it:
635
636 use warnings;
637 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
638
639 If invalid, a "Malformed UTF-8 character" warning is produced. The
640 "C0" means "process the string character per character". Without
641 that, the "unpack("U*", ...)" would work in "U0" mode (the default
642 if the format string starts with "U") and it would return the bytes
643 making up the UTF-8 encoding of the target string, something that
644 will always work.
645
646 · How Do I Convert Binary Data Into a Particular Encoding, Or Vice
647 Versa?
648
649 This probably isn't as useful as you might think. Normally, you
650 shouldn't need to.
651
652 In one sense, what you are asking doesn't make much sense:
653 encodings are for characters, and binary data are not "characters",
654 so converting "data" into some encoding isn't meaningful unless you
655 know in what character set and encoding the binary data is in, in
656 which case it's not just binary data, now is it?
657
658 If you have a raw sequence of bytes that you know should be
659 interpreted via a particular encoding, you can use "Encode":
660
661 use Encode 'from_to';
662 from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
663
664 The call to "from_to()" changes the bytes in $data, but nothing
665 material about the nature of the string has changed as far as Perl
666 is concerned. Both before and after the call, the string $data
667 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
668 the encoding of the string remains as "system-native 8-bit bytes".
669
670 You might relate this to a fictional 'Translate' module:
671
672 use Translate;
673 my $phrase = "Yes";
674 Translate::from_to($phrase, 'english', 'deutsch');
675 ## phrase now contains "Ja"
676
677 The contents of the string changes, but not the nature of the
678 string. Perl doesn't know any more after the call than before that
679 the contents of the string indicates the affirmative.
680
681 Back to converting data. If you have (or want) data in your
682 system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
683 can use pack/unpack to convert to/from Unicode.
684
685 $native_string = pack("W*", unpack("U*", $Unicode_string));
686 $Unicode_string = pack("U*", unpack("W*", $native_string));
687
688 If you have a sequence of bytes you know is valid UTF-8, but Perl
689 doesn't know it yet, you can make Perl a believer, too:
690
691 use Encode 'decode_utf8';
692 $Unicode = decode_utf8($bytes);
693
694 or:
695
696 $Unicode = pack("U0a*", $bytes);
697
698 You can find the bytes that make up a UTF-8 sequence with
699
700 @bytes = unpack("C*", $Unicode_string)
701
702 and you can create well-formed Unicode with
703
704 $Unicode_string = pack("U*", 0xff, ...)
705
706 · How Do I Display Unicode? How Do I Input Unicode?
707
708 See <http://www.alanwood.net/unicode/> and
709 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
710
711 · How Does Unicode Work With Traditional Locales?
712
713 In Perl, not very well. Avoid using locales through the "locale"
714 pragma. Use only one or the other. But see perlrun for the
715 description of the "-C" switch and its environment counterpart,
716 $ENV{PERL_UNICODE} to see how to enable various Unicode features,
717 for example by using locale settings.
718
719 Hexadecimal Notation
720 The Unicode standard prefers using hexadecimal notation because that
721 more clearly shows the division of Unicode into blocks of 256
722 characters. Hexadecimal is also simply shorter than decimal. You can
723 use decimal notation, too, but learning to use hexadecimal just makes
724 life easier with the Unicode standard. The "U+HHHH" notation uses
725 hexadecimal, for example.
726
727 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
728 (or A-F, case doesn't matter). Each hexadecimal digit represents four
729 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal
730 number in decimal, and "printf "%x\n", $decimal" will show a decimal
731 number in hexadecimal. If you have just the "hex digits" of a
732 hexadecimal number, you can use the "hex()" function.
733
734 print 0x0009, "\n"; # 9
735 print 0x000a, "\n"; # 10
736 print 0x000f, "\n"; # 15
737 print 0x0010, "\n"; # 16
738 print 0x0011, "\n"; # 17
739 print 0x0100, "\n"; # 256
740
741 print 0x0041, "\n"; # 65
742
743 printf "%x\n", 65; # 41
744 printf "%#x\n", 65; # 0x41
745
746 print hex("41"), "\n"; # 65
747
748 Further Resources
749 · Unicode Consortium
750
751 <http://www.unicode.org/>
752
753 · Unicode FAQ
754
755 <http://www.unicode.org/unicode/faq/>
756
757 · Unicode Glossary
758
759 <http://www.unicode.org/glossary/>
760
761 · Unicode Useful Resources
762
763 <http://www.unicode.org/unicode/onlinedat/resources.html>
764
765 · Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
766 Other Applications
767
768 <http://www.alanwood.net/unicode/>
769
770 · UTF-8 and Unicode FAQ for Unix/Linux
771
772 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
773
774 · Legacy Character Sets
775
776 <http://www.czyborra.com/> <http://www.eki.ee/letter/>
777
778 · The Unicode support files live within the Perl installation in the
779 directory
780
781 $Config{installprivlib}/unicore
782
783 in Perl 5.8.0 or newer, and
784
785 $Config{installprivlib}/unicode
786
787 in the Perl 5.6 series. (The renaming to lib/unicore was done to
788 avoid naming conflicts with lib/Unicode in case-insensitive
789 filesystems.) The main Unicode data file is UnicodeData.txt (or
790 Unicode.301 in Perl 5.6.1.) You can find the
791 $Config{installprivlib} by
792
793 perl "-V:installprivlib"
794
795 You can explore various information from the Unicode data files
796 using the "Unicode::UCD" module.
797
799 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
800 some Unicode processing by using the modules "Unicode::String",
801 "Unicode::Map8", and "Unicode::Map", available from CPAN. If you have
802 the GNU recode installed, you can also use the Perl front-end
803 "Convert::Recode" for character conversions.
804
805 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
806 UTF-8 bytes and back, the code works even with older Perl 5 versions.
807
808 # ISO 8859-1 to UTF-8
809 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
810
811 # UTF-8 to ISO 8859-1
812 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
813
815 perlunitut, perlunicode, Encode, open, utf8, bytes, perlretut, perlrun,
816 Unicode::Collate, Unicode::Normalize, Unicode::UCD
817
819 Thanks to the kind readers of the perl5-porters@perl.org,
820 perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
821 mailing lists for their valuable feedback.
822
824 Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
825
826 This document may be distributed under the same terms as Perl itself.
827
828
829
830perl v5.12.4 2011-06-07 PERLUNIINTRO(1)