1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl.
11
12 Unicode
13
14 Unicode is a character set standard which plans to codify all of the
15 writing systems of the world, plus many other symbols.
16
17 Unicode and ISO/IEC 10646 are coordinated standards that provide code
18 points for characters in almost all modern character set standards,
19 covering more than 30 writing systems and hundreds of languages,
20 including all commercially-important modern languages. All characters
21 in the largest Chinese, Japanese, and Korean dictionaries are also
22 encoded. The standards will eventually cover almost all characters in
23 more than 250 writing systems and thousands of languages. Unicode 1.0
24 was released in October 1991, and 4.0 in April 2003.
25
26 A Unicode character is an abstract entity. It is not bound to any par‐
27 ticular integer width, especially not to the C language "char". Uni‐
28 code is language-neutral and display-neutral: it does not encode the
29 language of the text and it does not define fonts or other graphical
30 layout details. Unicode operates on characters and on text built from
31 those characters.
32
33 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35 0x0041 and 0x03B1, respectively. These unique numbers are called code
36 points.
37
38 The Unicode standard prefers using hexadecimal notation for the code
39 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
40 later section, "Hexadecimal Notation". The Unicode standard uses the
41 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
42 point and the normative name of the character.
43
44 Unicode also defines various properties for the characters, like
45 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
46 properties are independent of the names of the characters. Further‐
47 more, various operations on the characters like uppercasing, lowercas‐
48 ing, and collating (sorting) are defined.
49
50 A Unicode character consists either of a single code point, or a base
51 character (like "LATIN CAPITAL LETTER A"), followed by one or more mod‐
52 ifiers (like "COMBINING ACUTE ACCENT"). This sequence of base charac‐
53 ter and modifiers is called a combining character sequence.
54
55 Whether to call these combining character sequences "characters"
56 depends on your point of view. If you are a programmer, you probably
57 would tend towards seeing each element in the sequences as one unit, or
58 "character". The whole sequence could be seen as one "character", how‐
59 ever, from the user's point of view, since that's probably what it
60 looks like in the context of the user's language.
61
62 With this "whole sequence" view of characters, the total number of
63 characters is open-ended. But in the programmer's "one unit is one
64 character" point of view, the concept of "characters" is more determin‐
65 istic. In this document, we take that second point of view: one
66 "character" is one Unicode code point, be it a base character or a com‐
67 bining character.
68
69 For some combinations, there are precomposed characters. "LATIN CAPI‐
70 TAL LETTER A WITH ACUTE", for example, is defined as a single code
71 point. These precomposed characters are, however, only available for
72 some combinations, and are mainly meant to support round-trip conver‐
73 sions between Unicode and legacy standards (like the ISO 8859). In the
74 general case, the composing method is more extensible. To support con‐
75 version between different compositions of the characters, various nor‐
76 malization forms to standardize representations are also defined.
77
78 Because of backward compatibility with legacy encodings, the "a unique
79 number for every character" idea breaks down a bit: instead, there is
80 "at least one number for every character". The same character could be
81 represented differently in several legacy encodings. The converse is
82 also not true: some code points do not have an assigned character.
83 Firstly, there are unallocated code points within otherwise used
84 blocks. Secondly, there are special Unicode control characters that do
85 not represent true characters.
86
87 A common myth about Unicode is that it would be "16-bit", that is, Uni‐
88 code is only represented as 0x10000 (or 65536) characters from 0x0000
89 to 0xFFFF. This is untrue. Since Unicode 2.0 (July 1996), Unicode has
90 been defined all the way up to 21 bits (0x10FFFF), and since Unicode
91 3.1 (March 2001), characters have been defined beyond 0xFFFF. The
92 first 0x10000 characters are called the Plane 0, or the Basic Multilin‐
93 gual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes in all
94 were defined--but they are nowhere near full of defined characters,
95 yet.
96
97 Another myth is that the 256-character blocks have something to do with
98 languages--that each block would define the characters used by a lan‐
99 guage or a set of languages. This is also untrue. The division into
100 blocks exists, but it is almost completely accidental--an artifact of
101 how the characters have been and still are allocated. Instead, there
102 is a concept called scripts, which is more useful: there is "Latin"
103 script, "Greek" script, and so on. Scripts usually span varied parts
104 of several blocks. For further information see Unicode::UCD.
105
106 The Unicode code points are just abstract numbers. To input and output
107 these abstract numbers, the numbers must be encoded or serialised some‐
108 how. Unicode defines several character encoding forms, of which UTF-8
109 is perhaps the most popular. UTF-8 is a variable length encoding that
110 encodes Unicode characters as 1 to 6 bytes (only 4 with the currently
111 defined characters). Other encodings include UTF-16 and UTF-32 and
112 their big- and little-endian variants (UTF-8 is byte-order independent)
113 The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.
114
115 For more information about encodings--for instance, to learn what sur‐
116 rogates and byte order marks (BOMs) are--see perlunicode.
117
118 Perl's Unicode Support
119
120 Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
121 natively. Perl 5.8.0, however, is the first recommended release for
122 serious Unicode work. The maintenance release 5.6.1 fixed many of the
123 problems of the initial Unicode implementation, but for example regular
124 expressions still do not work with Unicode in 5.6.1.
125
126 Starting from Perl 5.8.0, the use of "use utf8" is no longer necessary.
127 In earlier releases the "utf8" pragma was used to declare that opera‐
128 tions in the current block or file would be Unicode-aware. This model
129 was found to be wrong, or at least clumsy: the "Unicodeness" is now
130 carried with the data, instead of being attached to the operations.
131 Only one case remains where an explicit "use utf8" is needed: if your
132 Perl script itself is encoded in UTF-8, you can use UTF-8 in your iden‐
133 tifier names, and in string and regular expression literals, by saying
134 "use utf8". This is not the default because scripts with legacy 8-bit
135 data in them would break. See utf8.
136
137 Perl's Unicode Model
138
139 Perl supports both pre-5.6 strings of eight-bit native bytes, and
140 strings of Unicode characters. The principle is that Perl tries to
141 keep its data as eight-bit bytes for as long as possible, but as soon
142 as Unicodeness cannot be avoided, the data is transparently upgraded to
143 Unicode.
144
145 Internally, Perl currently uses either whatever the native eight-bit
146 character set of the platform (for example Latin-1) is, defaulting to
147 UTF-8, to encode Unicode strings. Specifically, if all code points in
148 the string are 0xFF or less, Perl uses the native eight-bit character
149 set. Otherwise, it uses UTF-8.
150
151 A user of Perl does not normally need to know nor care how Perl happens
152 to encode its internal strings, but it becomes relevant when outputting
153 Unicode strings to a stream without a PerlIO layer -- one with the
154 "default" encoding. In such a case, the raw bytes used internally (the
155 native character set or UTF-8, as appropriate for each string) will be
156 used, and a "Wide character" warning will be issued if those strings
157 contain a character beyond 0x00FF.
158
159 For example,
160
161 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
162
163 produces a fairly useless mixture of native bytes and UTF-8, as well as
164 a warning:
165
166 Wide character in print at ...
167
168 To output UTF-8, use the ":utf8" output layer. Prepending
169
170 binmode(STDOUT, ":utf8");
171
172 to this sample program ensures that the output is completely UTF-8, and
173 removes the program's warning.
174
175 You can enable automatic UTF-8-ification of your standard file handles,
176 default "open()" layer, and @ARGV by using either the "-C" command line
177 switch or the "PERL_UNICODE" environment variable, see perlrun for the
178 documentation of the "-C" switch.
179
180 Note that this means that Perl expects other software to work, too: if
181 Perl has been led to believe that STDIN should be UTF-8, but then STDIN
182 coming in from another command is not UTF-8, Perl will complain about
183 the malformed UTF-8.
184
185 All features that combine Unicode and I/O also require using the new
186 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
187 you can see whether yours is by running "perl -V" and looking for
188 "useperlio=define".
189
190 Unicode and EBCDIC
191
192 Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, Unicode
193 support is somewhat more complex to implement since additional conver‐
194 sions are needed at every step. Some problems remain, see perlebcdic
195 for details.
196
197 In any case, the Unicode support on EBCDIC platforms is better than in
198 the 5.6 series, which didn't work much at all for EBCDIC platform. On
199 EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
200 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
201 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
202 "EBCDIC-safe".
203
204 Creating Unicode
205
206 To create Unicode characters in literals for code points above 0xFF,
207 use the "\x{...}" notation in double-quoted strings:
208
209 my $smiley = "\x{263a}";
210
211 Similarly, it can be used in regular expression literals
212
213 $smiley =~ /\x{263a}/;
214
215 At run-time you can use "chr()":
216
217 my $hebrew_alef = chr(0x05d0);
218
219 See "Further Resources" for how to find all these numeric codes.
220
221 Naturally, "ord()" will do the reverse: it turns a character into a
222 code point.
223
224 Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
225 and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
226 eight-bit character for backward compatibility with older Perls. For
227 arguments of 0x100 or more, Unicode characters are always produced. If
228 you want to force the production of Unicode characters regardless of
229 the numeric value, use "pack("U", ...)" instead of "\x..", "\x{...}",
230 or "chr()".
231
232 You can also use the "charnames" pragma to invoke characters by name in
233 double-quoted strings:
234
235 use charnames ':full';
236 my $arabic_alef = "\N{ARABIC LETTER ALEF}";
237
238 And, as mentioned above, you can also "pack()" numbers into Unicode
239 characters:
240
241 my $georgian_an = pack("U", 0x10a0);
242
243 Note that both "\x{...}" and "\N{...}" are compile-time string con‐
244 stants: you cannot use variables in them. if you want similar run-time
245 functionality, use "chr()" and "charnames::vianame()".
246
247 If you want to force the result to Unicode characters, use the special
248 "U0" prefix. It consumes no arguments but forces the result to be in
249 Unicode characters, instead of bytes.
250
251 my $chars = pack("U0C*", 0x80, 0x42);
252
253 Likewise, you can force the result to be bytes by using the special
254 "C0" prefix.
255
256 Handling Unicode
257
258 Handling Unicode is for the most part transparent: just use the strings
259 as usual. Functions like "index()", "length()", and "substr()" will
260 work on the Unicode characters; regular expressions will work on the
261 Unicode characters (see perlunicode and perlretut).
262
263 Note that Perl considers combining character sequences to be separate
264 characters, so for example
265
266 use charnames ':full';
267 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
268
269 will print 2, not 1. The only exception is that regular expressions
270 have "\X" for matching a combining character sequence.
271
272 Life is not quite so transparent, however, when working with legacy
273 encodings, I/O, and certain special cases:
274
275 Legacy Encodings
276
277 When you combine legacy data and Unicode the legacy data needs to be
278 upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if applicable) is
279 assumed. You can override this assumption by using the "encoding"
280 pragma, for example
281
282 use encoding 'latin2'; # ISO 8859-2
283
284 in which case literals (string or regular expressions), "chr()", and
285 "ord()" in your whole script are assumed to produce Unicode characters
286 from ISO 8859-2 code points. Note that the matching for encoding names
287 is forgiving: instead of "latin2" you could have said "Latin 2", or
288 "iso8859-2", or other variations. With just
289
290 use encoding;
291
292 the environment variable "PERL_ENCODING" will be consulted. If that
293 variable isn't set, the encoding pragma will fail.
294
295 The "Encode" module knows about many encodings and has interfaces for
296 doing conversions between those encodings:
297
298 use Encode 'decode';
299 $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
300
301 Unicode I/O
302
303 Normally, writing out Unicode data
304
305 print FH $some_string_with_unicode, "\n";
306
307 produces raw bytes that Perl happens to use to internally encode the
308 Unicode string. Perl's internal encoding depends on the system as well
309 as what characters happen to be in the string at the time. If any of
310 the characters are at code points 0x100 or above, you will get a warn‐
311 ing. To ensure that the output is explicitly rendered in the encoding
312 you desire--and to avoid the warning--open the stream with the desired
313 encoding. Some examples:
314
315 open FH, ">:utf8", "file";
316
317 open FH, ">:encoding(ucs2)", "file";
318 open FH, ">:encoding(UTF-8)", "file";
319 open FH, ">:encoding(shift_jis)", "file";
320
321 and on already open streams, use "binmode()":
322
323 binmode(STDOUT, ":utf8");
324
325 binmode(STDOUT, ":encoding(ucs2)");
326 binmode(STDOUT, ":encoding(UTF-8)");
327 binmode(STDOUT, ":encoding(shift_jis)");
328
329 The matching of encoding names is loose: case does not matter, and many
330 encodings have several aliases. Note that the ":utf8" layer must
331 always be specified exactly like that; it is not subject to the loose
332 matching of encoding names.
333
334 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
335 for the ":encoding()" layer, and Encode::Supported for many encodings
336 supported by the "Encode" module.
337
338 Reading in a file that you know happens to be encoded in one of the
339 Unicode or legacy encodings does not magically turn the data into Uni‐
340 code in Perl's eyes. To do that, specify the appropriate layer when
341 opening files
342
343 open(my $fh,'<:utf8', 'anything');
344 my $line_of_unicode = <$fh>;
345
346 open(my $fh,'<:encoding(Big5)', 'anything');
347 my $line_of_unicode = <$fh>;
348
349 The I/O layers can also be specified more flexibly with the "open"
350 pragma. See open, or look at the following example.
351
352 use open ':utf8'; # input and output default layer will be UTF-8
353 open X, ">file";
354 print X chr(0x100), "\n";
355 close X;
356 open Y, "<file";
357 printf "%#x\n", ord(<Y>); # this should print 0x100
358 close Y;
359
360 With the "open" pragma you can use the ":locale" layer
361
362 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
363 # the :locale will probe the locale environment variables like LC_ALL
364 use open OUT => ':locale'; # russki parusski
365 open(O, ">koi8");
366 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
367 close O;
368 open(I, "<koi8");
369 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
370 close I;
371
372 or you can also use the ':encoding(...)' layer
373
374 open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
375 my $line_of_unicode = <$epic>;
376
377 These methods install a transparent filter on the I/O stream that con‐
378 verts data from the specified encoding when it is read in from the
379 stream. The result is always Unicode.
380
381 The open pragma affects all the "open()" calls after the pragma by set‐
382 ting default layers. If you want to affect only certain streams, use
383 explicit layers directly in the "open()" call.
384
385 You can switch encodings on an already opened stream by using "bin‐
386 mode()"; see "binmode" in perlfunc.
387
388 The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
389 and "binmode()", only with the "open" pragma. The ":utf8" and ":encod‐
390 ing(...)" methods do work with all of "open()", "binmode()", and the
391 "open" pragma.
392
393 Similarly, you may use these I/O layers on output streams to automati‐
394 cally convert Unicode to the specified encoding when it is written to
395 the stream. For example, the following snippet copies the contents of
396 the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to the file
397 "text.utf8", encoded as UTF-8:
398
399 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
400 open(my $unicode, '>:utf8', 'text.utf8');
401 while (<$nihongo>) { print $unicode $_ }
402
403 The naming of encodings, both by the "open()" and by the "open" pragma,
404 is similar to the "encoding" pragma in that it allows for flexible
405 names: "koi8-r" and "KOI8R" will both be understood.
406
407 Common encodings recognized by ISO, MIME, IANA, and various other stan‐
408 dardisation organisations are recognised; for a more detailed list see
409 Encode::Supported.
410
411 "read()" reads characters and returns the number of characters.
412 "seek()" and "tell()" operate on byte counts, as do "sysread()" and
413 "sysseek()".
414
415 Notice that because of the default behaviour of not doing any conver‐
416 sion upon input if there is no default layer, it is easy to mistakenly
417 write code that keeps on expanding a file by repeatedly encoding the
418 data:
419
420 # BAD CODE WARNING
421 open F, "file";
422 local $/; ## read in the whole file of 8-bit characters
423 $t = <F>;
424 close F;
425 open F, ">:utf8", "file";
426 print F $t; ## convert to UTF-8 on output
427 close F;
428
429 If you run this code twice, the contents of the file will be twice
430 UTF-8 encoded. A "use open ':utf8'" would have avoided the bug, or
431 explicitly opening also the file for input as UTF-8.
432
433 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
434 been built with the new PerlIO feature (which is the default on most
435 systems).
436
437 Displaying Unicode As Text
438
439 Sometimes you might want to display Perl scalars containing Unicode as
440 simple ASCII (or EBCDIC) text. The following subroutine converts its
441 argument so that Unicode characters with code points greater than 255
442 are displayed as "\x{...}", control characters (like "\n") are dis‐
443 played as "\x..", and the rest of the characters as themselves:
444
445 sub nice_string {
446 join("",
447 map { $_ > 255 ? # if wide character...
448 sprintf("\\x{%04X}", $_) : # \x{...}
449 chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
450 sprintf("\\x%02X", $_) : # \x..
451 quotemeta(chr($_)) # else quoted or as themselves
452 } unpack("U*", $_[0])); # unpack Unicode characters
453 }
454
455 For example,
456
457 nice_string("foo\x{100}bar\n")
458
459 returns the string
460
461 'foo\x{0100}bar\x0A'
462
463 which is ready to be printed.
464
465 Special Cases
466
467 · Bit Complement Operator ~ And vec()
468
469 The bit complement operator "~" may produce surprising results if
470 used on strings containing characters with ordinal values above
471 255. In such a case, the results are consistent with the internal
472 encoding of the characters, but not with much else. So don't do
473 that. Similarly for "vec()": you will be operating on the inter‐
474 nally-encoded bit patterns of the Unicode characters, not on the
475 code point values, which is very probably not what you want.
476
477 · Peeking At Perl's Internal Encoding
478
479 Normal users of Perl should never care how Perl encodes any partic‐
480 ular Unicode string (because the normal ways to get at the contents
481 of a string with Unicode--via input and output--should always be
482 via explicitly-defined I/O layers). But if you must, there are two
483 ways of looking behind the scenes.
484
485 One way of peeking inside the internal encoding of Unicode charac‐
486 ters is to use "unpack("C*", ..." to get the bytes or "unpack("H*",
487 ...)" to display the bytes:
488
489 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
490 print join(" ", unpack("H*", pack("U", 0x100))), "\n";
491
492 Yet another way would be to use the Devel::Peek module:
493
494 perl -MDevel::Peek -e 'Dump(chr(0x100))'
495
496 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
497 Unicode characters in "PV". See also later in this document the
498 discussion about the "utf8::is_utf8()" function.
499
500 Advanced Topics
501
502 · String Equivalence
503
504 The question of string equivalence turns somewhat complicated in
505 Unicode: what do you mean by "equal"?
506
507 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
508 LETTER A"?)
509
510 The short answer is that by default Perl compares equivalence
511 ("eq", "ne") based only on code points of the characters. In the
512 above case, the answer is no (because 0x00C1 != 0x0041). But some‐
513 times, any CAPITAL LETTER As should be considered equal, or even As
514 of any case.
515
516 The long answer is that you need to consider character normaliza‐
517 tion and casing issues: see Unicode::Normalize, Unicode Technical
518 Reports #15 and #21, Unicode Normalization Forms and Case Mappings,
519 http://www.unicode.org/unicode/reports/tr15/ and http://www.uni‐
520 code.org/unicode/reports/tr21/
521
522 As of Perl 5.8.0, the "Full" case-folding of Case Mappings/Special‐
523 Casing is implemented.
524
525 · String Collation
526
527 People like to see their strings nicely sorted--or as Unicode par‐
528 lance goes, collated. But again, what do you mean by collate?
529
530 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
531 "LATIN CAPITAL LETTER A WITH GRAVE"?)
532
533 The short answer is that by default, Perl compares strings ("lt",
534 "le", "cmp", "ge", "gt") based only on the code points of the char‐
535 acters. In the above case, the answer is "after", since 0x00C1 >
536 0x00C0.
537
538 The long answer is that "it depends", and a good answer cannot be
539 given without knowing (at the very least) the language context.
540 See Unicode::Collate, and Unicode Collation Algorithm
541 http://www.unicode.org/unicode/reports/tr10/
542
543 Miscellaneous
544
545 · Character Ranges and Classes
546
547 Character ranges in regular expression character classes
548 ("/[a-z]/") and in the "tr///" (also known as "y///") operator are
549 not magically Unicode-aware. What this means that "[A-Za-z]" will
550 not magically start to mean "all alphabetic letters"; not that it
551 does mean that even for 8-bit characters, you should be using
552 "/[[:alpha:]]/" in that case.
553
554 For specifying character classes like that in regular expressions,
555 you can use the various Unicode properties--"\pL", or perhaps
556 "\p{Alphabetic}", in this particular case. You can use Unicode
557 code points as the end points of character ranges, but there is no
558 magic associated with specifying a certain range. For further
559 information--there are dozens of Unicode character classes--see
560 perlunicode.
561
562 · String-To-Number Conversions
563
564 Unicode does define several other decimal--and numeric--characters
565 besides the familiar 0 to 9, such as the Arabic and Indic digits.
566 Perl does not support string-to-number conversion for digits other
567 than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
568
569 Questions With Answers
570
571 · Will My Old Scripts Break?
572
573 Very probably not. Unless you are generating Unicode characters
574 somehow, old behaviour should be preserved. About the only behav‐
575 iour that has changed and which could start generating Unicode is
576 the old behaviour of "chr()" where supplying an argument more than
577 255 produced a character modulo 255. "chr(300)", for example, was
578 equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL LET‐
579 TER I WITH BREVE.
580
581 · How Do I Make My Scripts Work With Unicode?
582
583 Very little work should be needed since nothing changes until you
584 generate Unicode data. The most important thing is getting input
585 as Unicode; for that, see the earlier I/O discussion.
586
587 · How Do I Know Whether My String Is In Unicode?
588
589 You shouldn't care. No, you really shouldn't. No, really. If you
590 have to care--beyond the cases described above--it means that we
591 didn't get the transparency of Unicode quite right.
592
593 Okay, if you insist:
594
595 print utf8::is_utf8($string) ? 1 : 0, "\n";
596
597 But note that this doesn't mean that any of the characters in the
598 string are necessary UTF-8 encoded, or that any of the characters
599 have code points greater than 0xFF (255) or even 0x80 (128), or
600 that the string has any characters at all. All the "is_utf8()"
601 does is to return the value of the internal "utf8ness" flag
602 attached to the $string. If the flag is off, the bytes in the
603 scalar are interpreted as a single byte encoding. If the flag is
604 on, the bytes in the scalar are interpreted as the (multi-byte,
605 variable-length) UTF-8 encoded code points of the characters.
606 Bytes added to an UTF-8 encoded string are automatically upgraded
607 to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (dou‐
608 ble-quoted interpolation, explicit concatenation, and
609 printf/sprintf parameter substitution), the result will be UTF-8
610 encoded as if copies of the byte strings were upgraded to UTF-8:
611 for example,
612
613 $a = "ab\x80c";
614 $b = "\x{100}";
615 print "$a = $b\n";
616
617 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
618 $a will stay byte-encoded.
619
620 Sometimes you might really need to know the byte length of a string
621 instead of the character length. For that use either the
622 "Encode::encode_utf8()" function or the "bytes" pragma and its only
623 defined function "length()":
624
625 my $unicode = chr(0x100);
626 print length($unicode), "\n"; # will print 1
627 require Encode;
628 print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
629 use bytes;
630 print length($unicode), "\n"; # will also print 2
631 # (the 0xC4 0x80 of the UTF-8)
632
633 · How Do I Detect Data That's Not Valid In a Particular Encoding?
634
635 Use the "Encode" package to try converting it. For example,
636
637 use Encode 'decode_utf8';
638 if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) {
639 # valid
640 } else {
641 # invalid
642 }
643
644 For UTF-8 only, you can use:
645
646 use warnings;
647 @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
648
649 If invalid, a "Malformed UTF-8 character (byte 0x##) in unpack"
650 warning is produced. The "U0" means "expect strictly UTF-8 encoded
651 Unicode". Without that the "unpack("U*", ...)" would accept also
652 data like "chr(0xFF"), similarly to the "pack" as we saw earlier.
653
654 · How Do I Convert Binary Data Into a Particular Encoding, Or Vice
655 Versa?
656
657 This probably isn't as useful as you might think. Normally, you
658 shouldn't need to.
659
660 In one sense, what you are asking doesn't make much sense: encod‐
661 ings are for characters, and binary data are not "characters", so
662 converting "data" into some encoding isn't meaningful unless you
663 know in what character set and encoding the binary data is in, in
664 which case it's not just binary data, now is it?
665
666 If you have a raw sequence of bytes that you know should be inter‐
667 preted via a particular encoding, you can use "Encode":
668
669 use Encode 'from_to';
670 from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
671
672 The call to "from_to()" changes the bytes in $data, but nothing
673 material about the nature of the string has changed as far as Perl
674 is concerned. Both before and after the call, the string $data
675 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
676 the encoding of the string remains as "system-native 8-bit bytes".
677
678 You might relate this to a fictional 'Translate' module:
679
680 use Translate;
681 my $phrase = "Yes";
682 Translate::from_to($phrase, 'english', 'deutsch');
683 ## phrase now contains "Ja"
684
685 The contents of the string changes, but not the nature of the
686 string. Perl doesn't know any more after the call than before that
687 the contents of the string indicates the affirmative.
688
689 Back to converting data. If you have (or want) data in your sys‐
690 tem's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can
691 use pack/unpack to convert to/from Unicode.
692
693 $native_string = pack("C*", unpack("U*", $Unicode_string));
694 $Unicode_string = pack("U*", unpack("C*", $native_string));
695
696 If you have a sequence of bytes you know is valid UTF-8, but Perl
697 doesn't know it yet, you can make Perl a believer, too:
698
699 use Encode 'decode_utf8';
700 $Unicode = decode_utf8($bytes);
701
702 You can convert well-formed UTF-8 to a sequence of bytes, but if
703 you just want to convert random binary data into UTF-8, you can't.
704 Any random collection of bytes isn't well-formed UTF-8. You can
705 use "unpack("C*", $string)" for the former, and you can create
706 well-formed Unicode data by "pack("U*", 0xff, ...)".
707
708 · How Do I Display Unicode? How Do I Input Unicode?
709
710 See http://www.alanwood.net/unicode/ and
711 http://www.cl.cam.ac.uk/~mgk25/unicode.html
712
713 · How Does Unicode Work With Traditional Locales?
714
715 In Perl, not very well. Avoid using locales through the "locale"
716 pragma. Use only one or the other. But see perlrun for the
717 description of the "-C" switch and its environment counterpart,
718 $ENV{PERL_UNICODE} to see how to enable various Unicode features,
719 for example by using locale settings.
720
721 Hexadecimal Notation
722
723 The Unicode standard prefers using hexadecimal notation because that
724 more clearly shows the division of Unicode into blocks of 256 charac‐
725 ters. Hexadecimal is also simply shorter than decimal. You can use
726 decimal notation, too, but learning to use hexadecimal just makes life
727 easier with the Unicode standard. The "U+HHHH" notation uses hexadeci‐
728 mal, for example.
729
730 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
731 (or A-F, case doesn't matter). Each hexadecimal digit represents four
732 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal num‐
733 ber in decimal, and "printf "%x\n", $decimal" will show a decimal num‐
734 ber in hexadecimal. If you have just the "hex digits" of a hexadecimal
735 number, you can use the "hex()" function.
736
737 print 0x0009, "\n"; # 9
738 print 0x000a, "\n"; # 10
739 print 0x000f, "\n"; # 15
740 print 0x0010, "\n"; # 16
741 print 0x0011, "\n"; # 17
742 print 0x0100, "\n"; # 256
743
744 print 0x0041, "\n"; # 65
745
746 printf "%x\n", 65; # 41
747 printf "%#x\n", 65; # 0x41
748
749 print hex("41"), "\n"; # 65
750
751 Further Resources
752
753 · Unicode Consortium
754
755 http://www.unicode.org/
756
757 · Unicode FAQ
758
759 http://www.unicode.org/unicode/faq/
760
761 · Unicode Glossary
762
763 http://www.unicode.org/glossary/
764
765 · Unicode Useful Resources
766
767 http://www.unicode.org/unicode/onlinedat/resources.html
768
769 · Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
770 Other Applications
771
772 http://www.alanwood.net/unicode/
773
774 · UTF-8 and Unicode FAQ for Unix/Linux
775
776 http://www.cl.cam.ac.uk/~mgk25/unicode.html
777
778 · Legacy Character Sets
779
780 http://www.czyborra.com/
781 http://www.eki.ee/letter/
782
783 · The Unicode support files live within the Perl installation in the
784 directory
785
786 $Config{installprivlib}/unicore
787
788 in Perl 5.8.0 or newer, and
789
790 $Config{installprivlib}/unicode
791
792 in the Perl 5.6 series. (The renaming to lib/unicore was done to
793 avoid naming conflicts with lib/Unicode in case-insensitive
794 filesystems.) The main Unicode data file is UnicodeData.txt (or
795 Unicode.301 in Perl 5.6.1.) You can find the $Config{install‐
796 privlib} by
797
798 perl "-V:installprivlib"
799
800 You can explore various information from the Unicode data files
801 using the "Unicode::UCD" module.
802
804 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
805 some Unicode processing by using the modules "Unicode::String", "Uni‐
806 code::Map8", and "Unicode::Map", available from CPAN. If you have the
807 GNU recode installed, you can also use the Perl front-end "Con‐
808 vert::Recode" for character conversions.
809
810 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
811 UTF-8 bytes and back, the code works even with older Perl 5 versions.
812
813 # ISO 8859-1 to UTF-8
814 s/([\x80-\xFF])/chr(0xC0⎪ord($1)>>6).chr(0x80⎪ord($1)&0x3F)/eg;
815
816 # UTF-8 to ISO 8859-1
817 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0⎪ord($2)&0x3F)/eg;
818
820 perlunicode, Encode, encoding, open, utf8, bytes, perlretut, perlrun,
821 Unicode::Collate, Unicode::Normalize, Unicode::UCD
822
824 Thanks to the kind readers of the perl5-porters@perl.org, perl-uni‐
825 code@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org mailing
826 lists for their valuable feedback.
827
829 Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
830
831 This document may be distributed under the same terms as Perl itself.
832
833
834
835perl v5.8.8 2006-01-07 PERLUNIINTRO(1)