1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl.
11
12 Unicode
13 Unicode is a character set standard which plans to codify all of the
14 writing systems of the world, plus many other symbols.
15
16 Unicode and ISO/IEC 10646 are coordinated standards that provide code
17 points for characters in almost all modern character set standards,
18 covering more than 30 writing systems and hundreds of languages,
19 including all commercially-important modern languages. All characters
20 in the largest Chinese, Japanese, and Korean dictionaries are also
21 encoded. The standards will eventually cover almost all characters in
22 more than 250 writing systems and thousands of languages. Unicode 1.0
23 was released in October 1991, and 4.0 in April 2003.
24
25 A Unicode character is an abstract entity. It is not bound to any
26 particular integer width, especially not to the C language "char".
27 Unicode is language-neutral and display-neutral: it does not encode the
28 language of the text and it does not generally define fonts or other
29 graphical layout details. Unicode operates on characters and on text
30 built from those characters.
31
32 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
33 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
34 0x0041 and 0x03B1, respectively. These unique numbers are called code
35 points.
36
37 The Unicode standard prefers using hexadecimal notation for the code
38 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
39 later section, "Hexadecimal Notation". The Unicode standard uses the
40 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
41 point and the normative name of the character.
42
43 Unicode also defines various properties for the characters, like
44 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
45 properties are independent of the names of the characters.
46 Furthermore, various operations on the characters like uppercasing,
47 lowercasing, and collating (sorting) are defined.
48
49 A Unicode character consists either of a single code point, or a base
50 character (like "LATIN CAPITAL LETTER A"), followed by one or more
51 modifiers (like "COMBINING ACUTE ACCENT"). This sequence of base
52 character and modifiers is called a combining character sequence.
53
54 Whether to call these combining character sequences "characters"
55 depends on your point of view. If you are a programmer, you probably
56 would tend towards seeing each element in the sequences as one unit, or
57 "character". The whole sequence could be seen as one "character",
58 however, from the user's point of view, since that's probably what it
59 looks like in the context of the user's language.
60
61 With this "whole sequence" view of characters, the total number of
62 characters is open-ended. But in the programmer's "one unit is one
63 character" point of view, the concept of "characters" is more
64 deterministic. In this document, we take that second point of view:
65 one "character" is one Unicode code point, be it a base character or a
66 combining character.
67
68 For some combinations, there are precomposed characters. "LATIN
69 CAPITAL LETTER A WITH ACUTE", for example, is defined as a single code
70 point. These precomposed characters are, however, only available for
71 some combinations, and are mainly meant to support round-trip
72 conversions between Unicode and legacy standards (like the ISO 8859).
73 In the general case, the composing method is more extensible. To
74 support conversion between different compositions of the characters,
75 various normalization forms to standardize representations are also
76 defined.
77
78 Because of backward compatibility with legacy encodings, the "a unique
79 number for every character" idea breaks down a bit: instead, there is
80 "at least one number for every character". The same character could be
81 represented differently in several legacy encodings. The converse is
82 also not true: some code points do not have an assigned character.
83 Firstly, there are unallocated code points within otherwise used
84 blocks. Secondly, there are special Unicode control characters that do
85 not represent true characters.
86
87 A common myth about Unicode is that it would be "16-bit", that is,
88 Unicode is only represented as 0x10000 (or 65536) characters from
89 0x0000 to 0xFFFF. This is untrue. Since Unicode 2.0 (July 1996),
90 Unicode has been defined all the way up to 21 bits (0x10FFFF), and
91 since Unicode 3.1 (March 2001), characters have been defined beyond
92 0xFFFF. The first 0x10000 characters are called the Plane 0, or the
93 Basic Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen)
94 planes in all were defined--but they are nowhere near full of defined
95 characters, yet.
96
97 Another myth is that the 256-character blocks have something to do with
98 languages--that each block would define the characters used by a
99 language or a set of languages. This is also untrue. The division
100 into blocks exists, but it is almost completely accidental--an artifact
101 of how the characters have been and still are allocated. Instead,
102 there is a concept called scripts, which is more useful: there is
103 "Latin" script, "Greek" script, and so on. Scripts usually span varied
104 parts of several blocks. For further information see Unicode::UCD.
105
106 The Unicode code points are just abstract numbers. To input and output
107 these abstract numbers, the numbers must be encoded or serialised
108 somehow. Unicode defines several character encoding forms, of which
109 UTF-8 is perhaps the most popular. UTF-8 is a variable length encoding
110 that encodes Unicode characters as 1 to 6 bytes (only 4 with the
111 currently defined characters). Other encodings include UTF-16 and
112 UTF-32 and their big- and little-endian variants (UTF-8 is byte-order
113 independent) The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding
114 forms.
115
116 For more information about encodings--for instance, to learn what
117 surrogates and byte order marks (BOMs) are--see perlunicode.
118
119 Perl's Unicode Support
120 Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
121 natively. Perl 5.8.0, however, is the first recommended release for
122 serious Unicode work. The maintenance release 5.6.1 fixed many of the
123 problems of the initial Unicode implementation, but for example regular
124 expressions still do not work with Unicode in 5.6.1.
125
126 Starting from Perl 5.8.0, the use of "use utf8" is needed only in much
127 more restricted circumstances. In earlier releases the "utf8" pragma
128 was used to declare that operations in the current block or file would
129 be Unicode-aware. This model was found to be wrong, or at least
130 clumsy: the "Unicodeness" is now carried with the data, instead of
131 being attached to the operations. Only one case remains where an
132 explicit "use utf8" is needed: if your Perl script itself is encoded in
133 UTF-8, you can use UTF-8 in your identifier names, and in string and
134 regular expression literals, by saying "use utf8". This is not the
135 default because scripts with legacy 8-bit data in them would break.
136 See utf8.
137
138 Perl's Unicode Model
139 Perl supports both pre-5.6 strings of eight-bit native bytes, and
140 strings of Unicode characters. The principle is that Perl tries to
141 keep its data as eight-bit bytes for as long as possible, but as soon
142 as Unicodeness cannot be avoided, the data is transparently upgraded to
143 Unicode.
144
145 Internally, Perl currently uses either whatever the native eight-bit
146 character set of the platform (for example Latin-1) is, defaulting to
147 UTF-8, to encode Unicode strings. Specifically, if all code points in
148 the string are 0xFF or less, Perl uses the native eight-bit character
149 set. Otherwise, it uses UTF-8.
150
151 A user of Perl does not normally need to know nor care how Perl happens
152 to encode its internal strings, but it becomes relevant when outputting
153 Unicode strings to a stream without a PerlIO layer -- one with the
154 "default" encoding. In such a case, the raw bytes used internally (the
155 native character set or UTF-8, as appropriate for each string) will be
156 used, and a "Wide character" warning will be issued if those strings
157 contain a character beyond 0x00FF.
158
159 For example,
160
161 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
162
163 produces a fairly useless mixture of native bytes and UTF-8, as well as
164 a warning:
165
166 Wide character in print at ...
167
168 To output UTF-8, use the ":encoding" or ":utf8" output layer.
169 Prepending
170
171 binmode(STDOUT, ":utf8");
172
173 to this sample program ensures that the output is completely UTF-8, and
174 removes the program's warning.
175
176 You can enable automatic UTF-8-ification of your standard file handles,
177 default "open()" layer, and @ARGV by using either the "-C" command line
178 switch or the "PERL_UNICODE" environment variable, see perlrun for the
179 documentation of the "-C" switch.
180
181 Note that this means that Perl expects other software to work, too: if
182 Perl has been led to believe that STDIN should be UTF-8, but then STDIN
183 coming in from another command is not UTF-8, Perl will complain about
184 the malformed UTF-8.
185
186 All features that combine Unicode and I/O also require using the new
187 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
188 you can see whether yours is by running "perl -V" and looking for
189 "useperlio=define".
190
191 Unicode and EBCDIC
192 Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, Unicode
193 support is somewhat more complex to implement since additional
194 conversions are needed at every step. Some problems remain, see
195 perlebcdic for details.
196
197 In any case, the Unicode support on EBCDIC platforms is better than in
198 the 5.6 series, which didn't work much at all for EBCDIC platform. On
199 EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
200 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
201 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
202 "EBCDIC-safe".
203
204 Creating Unicode
205 To create Unicode characters in literals for code points above 0xFF,
206 use the "\x{...}" notation in double-quoted strings:
207
208 my $smiley = "\x{263a}";
209
210 Similarly, it can be used in regular expression literals
211
212 $smiley =~ /\x{263a}/;
213
214 At run-time you can use "chr()":
215
216 my $hebrew_alef = chr(0x05d0);
217
218 See "Further Resources" for how to find all these numeric codes.
219
220 Naturally, "ord()" will do the reverse: it turns a character into a
221 code point.
222
223 Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
224 and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
225 eight-bit character for backward compatibility with older Perls. For
226 arguments of 0x100 or more, Unicode characters are always produced. If
227 you want to force the production of Unicode characters regardless of
228 the numeric value, use "pack("U", ...)" instead of "\x..", "\x{...}",
229 or "chr()".
230
231 You can also use the "charnames" pragma to invoke characters by name in
232 double-quoted strings:
233
234 use charnames ':full';
235 my $arabic_alef = "\N{ARABIC LETTER ALEF}";
236
237 And, as mentioned above, you can also "pack()" numbers into Unicode
238 characters:
239
240 my $georgian_an = pack("U", 0x10a0);
241
242 Note that both "\x{...}" and "\N{...}" are compile-time string
243 constants: you cannot use variables in them. if you want similar run-
244 time functionality, use "chr()" and "charnames::vianame()".
245
246 If you want to force the result to Unicode characters, use the special
247 "U0" prefix. It consumes no arguments but causes the following bytes
248 to be interpreted as the UTF-8 encoding of Unicode characters:
249
250 my $chars = pack("U0W*", 0x80, 0x42);
251
252 Likewise, you can stop such UTF-8 interpretation by using the special
253 "C0" prefix.
254
255 Handling Unicode
256 Handling Unicode is for the most part transparent: just use the strings
257 as usual. Functions like "index()", "length()", and "substr()" will
258 work on the Unicode characters; regular expressions will work on the
259 Unicode characters (see perlunicode and perlretut).
260
261 Note that Perl considers combining character sequences to be separate
262 characters, so for example
263
264 use charnames ':full';
265 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
266
267 will print 2, not 1. The only exception is that regular expressions
268 have "\X" for matching a combining character sequence.
269
270 Life is not quite so transparent, however, when working with legacy
271 encodings, I/O, and certain special cases:
272
273 Legacy Encodings
274 When you combine legacy data and Unicode the legacy data needs to be
275 upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if applicable) is
276 assumed.
277
278 The "Encode" module knows about many encodings and has interfaces for
279 doing conversions between those encodings:
280
281 use Encode 'decode';
282 $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
283
284 Unicode I/O
285 Normally, writing out Unicode data
286
287 print FH $some_string_with_unicode, "\n";
288
289 produces raw bytes that Perl happens to use to internally encode the
290 Unicode string. Perl's internal encoding depends on the system as well
291 as what characters happen to be in the string at the time. If any of
292 the characters are at code points 0x100 or above, you will get a
293 warning. To ensure that the output is explicitly rendered in the
294 encoding you desire--and to avoid the warning--open the stream with the
295 desired encoding. Some examples:
296
297 open FH, ">:utf8", "file";
298
299 open FH, ">:encoding(ucs2)", "file";
300 open FH, ">:encoding(UTF-8)", "file";
301 open FH, ">:encoding(shift_jis)", "file";
302
303 and on already open streams, use "binmode()":
304
305 binmode(STDOUT, ":utf8");
306
307 binmode(STDOUT, ":encoding(ucs2)");
308 binmode(STDOUT, ":encoding(UTF-8)");
309 binmode(STDOUT, ":encoding(shift_jis)");
310
311 The matching of encoding names is loose: case does not matter, and many
312 encodings have several aliases. Note that the ":utf8" layer must
313 always be specified exactly like that; it is not subject to the loose
314 matching of encoding names. Also note that ":utf8" is unsafe for input,
315 because it accepts the data without validating that it is indeed valid
316 UTF8.
317
318 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
319 for the ":encoding()" layer, and Encode::Supported for many encodings
320 supported by the "Encode" module.
321
322 Reading in a file that you know happens to be encoded in one of the
323 Unicode or legacy encodings does not magically turn the data into
324 Unicode in Perl's eyes. To do that, specify the appropriate layer when
325 opening files
326
327 open(my $fh,'<:encoding(utf8)', 'anything');
328 my $line_of_unicode = <$fh>;
329
330 open(my $fh,'<:encoding(Big5)', 'anything');
331 my $line_of_unicode = <$fh>;
332
333 The I/O layers can also be specified more flexibly with the "open"
334 pragma. See open, or look at the following example.
335
336 use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
337 open X, ">file";
338 print X chr(0x100), "\n";
339 close X;
340 open Y, "<file";
341 printf "%#x\n", ord(<Y>); # this should print 0x100
342 close Y;
343
344 With the "open" pragma you can use the ":locale" layer
345
346 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
347 # the :locale will probe the locale environment variables like LC_ALL
348 use open OUT => ':locale'; # russki parusski
349 open(O, ">koi8");
350 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
351 close O;
352 open(I, "<koi8");
353 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
354 close I;
355
356 These methods install a transparent filter on the I/O stream that
357 converts data from the specified encoding when it is read in from the
358 stream. The result is always Unicode.
359
360 The open pragma affects all the "open()" calls after the pragma by
361 setting default layers. If you want to affect only certain streams,
362 use explicit layers directly in the "open()" call.
363
364 You can switch encodings on an already opened stream by using
365 "binmode()"; see "binmode" in perlfunc.
366
367 The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
368 and "binmode()", only with the "open" pragma. The ":utf8" and
369 ":encoding(...)" methods do work with all of "open()", "binmode()", and
370 the "open" pragma.
371
372 Similarly, you may use these I/O layers on output streams to
373 automatically convert Unicode to the specified encoding when it is
374 written to the stream. For example, the following snippet copies the
375 contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
376 the file "text.utf8", encoded as UTF-8:
377
378 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
379 open(my $unicode, '>:utf8', 'text.utf8');
380 while (<$nihongo>) { print $unicode $_ }
381
382 The naming of encodings, both by the "open()" and by the "open" pragma
383 allows for flexible names: "koi8-r" and "KOI8R" will both be
384 understood.
385
386 Common encodings recognized by ISO, MIME, IANA, and various other
387 standardisation organisations are recognised; for a more detailed list
388 see Encode::Supported.
389
390 "read()" reads characters and returns the number of characters.
391 "seek()" and "tell()" operate on byte counts, as do "sysread()" and
392 "sysseek()".
393
394 Notice that because of the default behaviour of not doing any
395 conversion upon input if there is no default layer, it is easy to
396 mistakenly write code that keeps on expanding a file by repeatedly
397 encoding the data:
398
399 # BAD CODE WARNING
400 open F, "file";
401 local $/; ## read in the whole file of 8-bit characters
402 $t = <F>;
403 close F;
404 open F, ">:encoding(utf8)", "file";
405 print F $t; ## convert to UTF-8 on output
406 close F;
407
408 If you run this code twice, the contents of the file will be twice
409 UTF-8 encoded. A "use open ':encoding(utf8)'" would have avoided the
410 bug, or explicitly opening also the file for input as UTF-8.
411
412 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
413 been built with the new PerlIO feature (which is the default on most
414 systems).
415
416 Displaying Unicode As Text
417 Sometimes you might want to display Perl scalars containing Unicode as
418 simple ASCII (or EBCDIC) text. The following subroutine converts its
419 argument so that Unicode characters with code points greater than 255
420 are displayed as "\x{...}", control characters (like "\n") are
421 displayed as "\x..", and the rest of the characters as themselves:
422
423 sub nice_string {
424 join("",
425 map { $_ > 255 ? # if wide character...
426 sprintf("\\x{%04X}", $_) : # \x{...}
427 chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
428 sprintf("\\x%02X", $_) : # \x..
429 quotemeta(chr($_)) # else quoted or as themselves
430 } unpack("W*", $_[0])); # unpack Unicode characters
431 }
432
433 For example,
434
435 nice_string("foo\x{100}bar\n")
436
437 returns the string
438
439 'foo\x{0100}bar\x0A'
440
441 which is ready to be printed.
442
443 Special Cases
444 · Bit Complement Operator ~ And vec()
445
446 The bit complement operator "~" may produce surprising results if
447 used on strings containing characters with ordinal values above
448 255. In such a case, the results are consistent with the internal
449 encoding of the characters, but not with much else. So don't do
450 that. Similarly for "vec()": you will be operating on the
451 internally-encoded bit patterns of the Unicode characters, not on
452 the code point values, which is very probably not what you want.
453
454 · Peeking At Perl's Internal Encoding
455
456 Normal users of Perl should never care how Perl encodes any
457 particular Unicode string (because the normal ways to get at the
458 contents of a string with Unicode--via input and output--should
459 always be via explicitly-defined I/O layers). But if you must,
460 there are two ways of looking behind the scenes.
461
462 One way of peeking inside the internal encoding of Unicode
463 characters is to use "unpack("C*", ..." to get the bytes of
464 whatever the string encoding happens to be, or "unpack("U0..",
465 ...)" to get the bytes of the UTF-8 encoding:
466
467 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
468 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
469
470 Yet another way would be to use the Devel::Peek module:
471
472 perl -MDevel::Peek -e 'Dump(chr(0x100))'
473
474 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
475 Unicode characters in "PV". See also later in this document the
476 discussion about the "utf8::is_utf8()" function.
477
478 Advanced Topics
479 · String Equivalence
480
481 The question of string equivalence turns somewhat complicated in
482 Unicode: what do you mean by "equal"?
483
484 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
485 LETTER A"?)
486
487 The short answer is that by default Perl compares equivalence
488 ("eq", "ne") based only on code points of the characters. In the
489 above case, the answer is no (because 0x00C1 != 0x0041). But
490 sometimes, any CAPITAL LETTER As should be considered equal, or
491 even As of any case.
492
493 The long answer is that you need to consider character
494 normalization and casing issues: see Unicode::Normalize, Unicode
495 Technical Reports #15 and #21, Unicode Normalization Forms and Case
496 Mappings, <http://www.unicode.org/unicode/reports/tr15/> and
497 <http://www.unicode.org/unicode/reports/tr21/>
498
499 As of Perl 5.8.0, the "Full" case-folding of Case
500 Mappings/SpecialCasing is implemented.
501
502 · String Collation
503
504 People like to see their strings nicely sorted--or as Unicode
505 parlance goes, collated. But again, what do you mean by collate?
506
507 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
508 "LATIN CAPITAL LETTER A WITH GRAVE"?)
509
510 The short answer is that by default, Perl compares strings ("lt",
511 "le", "cmp", "ge", "gt") based only on the code points of the
512 characters. In the above case, the answer is "after", since 0x00C1
513 > 0x00C0.
514
515 The long answer is that "it depends", and a good answer cannot be
516 given without knowing (at the very least) the language context.
517 See Unicode::Collate, and Unicode Collation Algorithm
518 <http://www.unicode.org/unicode/reports/tr10/>
519
520 Miscellaneous
521 · Character Ranges and Classes
522
523 Character ranges in regular expression character classes
524 ("/[a-z]/") and in the "tr///" (also known as "y///") operator are
525 not magically Unicode-aware. What this means is that "[A-Za-z]"
526 will not magically start to mean "all alphabetic letters"; not that
527 it does mean that even for 8-bit characters, you should be using
528 "/[[:alpha:]]/" in that case.
529
530 For specifying character classes like that in regular expressions,
531 you can use the various Unicode properties--"\pL", or perhaps
532 "\p{Alphabetic}", in this particular case. You can use Unicode
533 code points as the end points of character ranges, but there is no
534 magic associated with specifying a certain range. For further
535 information--there are dozens of Unicode character classes--see
536 perlunicode.
537
538 · String-To-Number Conversions
539
540 Unicode does define several other decimal--and numeric--characters
541 besides the familiar 0 to 9, such as the Arabic and Indic digits.
542 Perl does not support string-to-number conversion for digits other
543 than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
544
545 Questions With Answers
546 · Will My Old Scripts Break?
547
548 Very probably not. Unless you are generating Unicode characters
549 somehow, old behaviour should be preserved. About the only
550 behaviour that has changed and which could start generating Unicode
551 is the old behaviour of "chr()" where supplying an argument more
552 than 255 produced a character modulo 255. "chr(300)", for example,
553 was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
554 LETTER I WITH BREVE.
555
556 · How Do I Make My Scripts Work With Unicode?
557
558 Very little work should be needed since nothing changes until you
559 generate Unicode data. The most important thing is getting input
560 as Unicode; for that, see the earlier I/O discussion.
561
562 · How Do I Know Whether My String Is In Unicode?
563
564 You shouldn't have to care. But you may, because currently the
565 semantics of the characters whose ordinals are in the range 128 to
566 255 is different depending on whether the string they are contained
567 within is in Unicode or not. (See perlunicode.)
568
569 To determine if a string is in Unicode, use:
570
571 print utf8::is_utf8($string) ? 1 : 0, "\n";
572
573 But note that this doesn't mean that any of the characters in the
574 string are necessary UTF-8 encoded, or that any of the characters
575 have code points greater than 0xFF (255) or even 0x80 (128), or
576 that the string has any characters at all. All the "is_utf8()"
577 does is to return the value of the internal "utf8ness" flag
578 attached to the $string. If the flag is off, the bytes in the
579 scalar are interpreted as a single byte encoding. If the flag is
580 on, the bytes in the scalar are interpreted as the (multi-byte,
581 variable-length) UTF-8 encoded code points of the characters.
582 Bytes added to an UTF-8 encoded string are automatically upgraded
583 to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-
584 quoted interpolation, explicit concatenation, and printf/sprintf
585 parameter substitution), the result will be UTF-8 encoded as if
586 copies of the byte strings were upgraded to UTF-8: for example,
587
588 $a = "ab\x80c";
589 $b = "\x{100}";
590 print "$a = $b\n";
591
592 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
593 $a will stay byte-encoded.
594
595 Sometimes you might really need to know the byte length of a string
596 instead of the character length. For that use either the
597 "Encode::encode_utf8()" function or the "bytes" pragma and the
598 "length()" function:
599
600 my $unicode = chr(0x100);
601 print length($unicode), "\n"; # will print 1
602 require Encode;
603 print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
604 use bytes;
605 print length($unicode), "\n"; # will also print 2
606 # (the 0xC4 0x80 of the UTF-8)
607
608 · How Do I Detect Data That's Not Valid In a Particular Encoding?
609
610 Use the "Encode" package to try converting it. For example,
611
612 use Encode 'decode_utf8';
613
614 if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
615 # $string is valid utf8
616 } else {
617 # $string is not valid utf8
618 }
619
620 Or use "unpack" to try decoding it:
621
622 use warnings;
623 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
624
625 If invalid, a "Malformed UTF-8 character" warning is produced. The
626 "C0" means "process the string character per character". Without
627 that, the "unpack("U*", ...)" would work in "U0" mode (the default
628 if the format string starts with "U") and it would return the bytes
629 making up the UTF-8 encoding of the target string, something that
630 will always work.
631
632 · How Do I Convert Binary Data Into a Particular Encoding, Or Vice
633 Versa?
634
635 This probably isn't as useful as you might think. Normally, you
636 shouldn't need to.
637
638 In one sense, what you are asking doesn't make much sense:
639 encodings are for characters, and binary data are not "characters",
640 so converting "data" into some encoding isn't meaningful unless you
641 know in what character set and encoding the binary data is in, in
642 which case it's not just binary data, now is it?
643
644 If you have a raw sequence of bytes that you know should be
645 interpreted via a particular encoding, you can use "Encode":
646
647 use Encode 'from_to';
648 from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
649
650 The call to "from_to()" changes the bytes in $data, but nothing
651 material about the nature of the string has changed as far as Perl
652 is concerned. Both before and after the call, the string $data
653 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
654 the encoding of the string remains as "system-native 8-bit bytes".
655
656 You might relate this to a fictional 'Translate' module:
657
658 use Translate;
659 my $phrase = "Yes";
660 Translate::from_to($phrase, 'english', 'deutsch');
661 ## phrase now contains "Ja"
662
663 The contents of the string changes, but not the nature of the
664 string. Perl doesn't know any more after the call than before that
665 the contents of the string indicates the affirmative.
666
667 Back to converting data. If you have (or want) data in your
668 system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
669 can use pack/unpack to convert to/from Unicode.
670
671 $native_string = pack("W*", unpack("U*", $Unicode_string));
672 $Unicode_string = pack("U*", unpack("W*", $native_string));
673
674 If you have a sequence of bytes you know is valid UTF-8, but Perl
675 doesn't know it yet, you can make Perl a believer, too:
676
677 use Encode 'decode_utf8';
678 $Unicode = decode_utf8($bytes);
679
680 or:
681
682 $Unicode = pack("U0a*", $bytes);
683
684 You can find the bytes that make up a UTF-8 sequence with
685
686 @bytes = unpack("C*", $Unicode_string)
687
688 and you can create well-formed Unicode with
689
690 $Unicode_string = pack("U*", 0xff, ...)
691
692 · How Do I Display Unicode? How Do I Input Unicode?
693
694 See <http://www.alanwood.net/unicode/> and
695 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
696
697 · How Does Unicode Work With Traditional Locales?
698
699 In Perl, not very well. Avoid using locales through the "locale"
700 pragma. Use only one or the other. But see perlrun for the
701 description of the "-C" switch and its environment counterpart,
702 $ENV{PERL_UNICODE} to see how to enable various Unicode features,
703 for example by using locale settings.
704
705 Hexadecimal Notation
706 The Unicode standard prefers using hexadecimal notation because that
707 more clearly shows the division of Unicode into blocks of 256
708 characters. Hexadecimal is also simply shorter than decimal. You can
709 use decimal notation, too, but learning to use hexadecimal just makes
710 life easier with the Unicode standard. The "U+HHHH" notation uses
711 hexadecimal, for example.
712
713 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
714 (or A-F, case doesn't matter). Each hexadecimal digit represents four
715 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal
716 number in decimal, and "printf "%x\n", $decimal" will show a decimal
717 number in hexadecimal. If you have just the "hex digits" of a
718 hexadecimal number, you can use the "hex()" function.
719
720 print 0x0009, "\n"; # 9
721 print 0x000a, "\n"; # 10
722 print 0x000f, "\n"; # 15
723 print 0x0010, "\n"; # 16
724 print 0x0011, "\n"; # 17
725 print 0x0100, "\n"; # 256
726
727 print 0x0041, "\n"; # 65
728
729 printf "%x\n", 65; # 41
730 printf "%#x\n", 65; # 0x41
731
732 print hex("41"), "\n"; # 65
733
734 Further Resources
735 · Unicode Consortium
736
737 <http://www.unicode.org/>
738
739 · Unicode FAQ
740
741 <http://www.unicode.org/unicode/faq/>
742
743 · Unicode Glossary
744
745 <http://www.unicode.org/glossary/>
746
747 · Unicode Useful Resources
748
749 <http://www.unicode.org/unicode/onlinedat/resources.html>
750
751 · Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
752 Other Applications
753
754 <http://www.alanwood.net/unicode/>
755
756 · UTF-8 and Unicode FAQ for Unix/Linux
757
758 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
759
760 · Legacy Character Sets
761
762 <http://www.czyborra.com/> <http://www.eki.ee/letter/>
763
764 · The Unicode support files live within the Perl installation in the
765 directory
766
767 $Config{installprivlib}/unicore
768
769 in Perl 5.8.0 or newer, and
770
771 $Config{installprivlib}/unicode
772
773 in the Perl 5.6 series. (The renaming to lib/unicore was done to
774 avoid naming conflicts with lib/Unicode in case-insensitive
775 filesystems.) The main Unicode data file is UnicodeData.txt (or
776 Unicode.301 in Perl 5.6.1.) You can find the
777 $Config{installprivlib} by
778
779 perl "-V:installprivlib"
780
781 You can explore various information from the Unicode data files
782 using the "Unicode::UCD" module.
783
785 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
786 some Unicode processing by using the modules "Unicode::String",
787 "Unicode::Map8", and "Unicode::Map", available from CPAN. If you have
788 the GNU recode installed, you can also use the Perl front-end
789 "Convert::Recode" for character conversions.
790
791 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
792 UTF-8 bytes and back, the code works even with older Perl 5 versions.
793
794 # ISO 8859-1 to UTF-8
795 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
796
797 # UTF-8 to ISO 8859-1
798 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
799
801 perlunitut, perlunicode, Encode, open, utf8, bytes, perlretut, perlrun,
802 Unicode::Collate, Unicode::Normalize, Unicode::UCD
803
805 Thanks to the kind readers of the perl5-porters@perl.org,
806 perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
807 mailing lists for their valuable feedback.
808
810 Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
811
812 This document may be distributed under the same terms as Perl itself.
813
814
815
816perl v5.10.1 2009-02-25 PERLUNIINTRO(1)