perluniintro(1)

1PERLUNIINTRO(1)        Perl Programmers Reference Guide        PERLUNIINTRO(1)
2
3
4

NAME

6       perluniintro - Perl Unicode introduction
7

DESCRIPTION

9       This document gives a general idea of Unicode and how to use Unicode in
10       Perl.
11
12   Unicode
13       Unicode is a character set standard which plans to codify all of the
14       writing systems of the world, plus many other symbols.
15
16       Unicode and ISO/IEC 10646 are coordinated standards that provide code
17       points for characters in almost all modern character set standards,
18       covering more than 30 writing systems and hundreds of languages,
19       including all commercially-important modern languages.  All characters
20       in the largest Chinese, Japanese, and Korean dictionaries are also
21       encoded. The standards will eventually cover almost all characters in
22       more than 250 writing systems and thousands of languages.  Unicode 1.0
23       was released in October 1991, and 4.0 in April 2003.
24
25       A Unicode character is an abstract entity.  It is not bound to any
26       particular integer width, especially not to the C language "char".
27       Unicode is language-neutral and display-neutral: it does not encode the
28       language of the text and it does not generally define fonts or other
29       graphical layout details.  Unicode operates on characters and on text
30       built from those characters.
31
32       Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
33       SMALL LETTER ALPHA" and unique numbers for the characters, in this case
34       0x0041 and 0x03B1, respectively.  These unique numbers are called code
35       points.
36
37       The Unicode standard prefers using hexadecimal notation for the code
38       points.  If numbers like 0x0041 are unfamiliar to you, take a peek at a
39       later section, "Hexadecimal Notation".  The Unicode standard uses the
40       notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
41       point and the normative name of the character.
42
43       Unicode also defines various properties for the characters, like
44       "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
45       properties are independent of the names of the characters.
46       Furthermore, various operations on the characters like uppercasing,
47       lowercasing, and collating (sorting) are defined.
48
49       A Unicode character consists either of a single code point, or a base
50       character (like "LATIN CAPITAL LETTER A"), followed by one or more
51       modifiers (like "COMBINING ACUTE ACCENT").  This sequence of base
52       character and modifiers is called a combining character sequence.
53
54       Whether to call these combining character sequences "characters"
55       depends on your point of view. If you are a programmer, you probably
56       would tend towards seeing each element in the sequences as one unit, or
57       "character".  The whole sequence could be seen as one "character",
58       however, from the user's point of view, since that's probably what it
59       looks like in the context of the user's language.
60
61       With this "whole sequence" view of characters, the total number of
62       characters is open-ended. But in the programmer's "one unit is one
63       character" point of view, the concept of "characters" is more
64       deterministic.  In this document, we take that second  point of view:
65       one "character" is one Unicode code point, be it a base character or a
66       combining character.
67
68       For some combinations, there are precomposed characters.  "LATIN
69       CAPITAL LETTER A WITH ACUTE", for example, is defined as a single code
70       point.  These precomposed characters are, however, only available for
71       some combinations, and are mainly meant to support round-trip
72       conversions between Unicode and legacy standards (like the ISO 8859).
73       In the general case, the composing method is more extensible.  To
74       support conversion between different compositions of the characters,
75       various normalization forms to standardize representations are also
76       defined.
77
78       Because of backward compatibility with legacy encodings, the "a unique
79       number for every character" idea breaks down a bit: instead, there is
80       "at least one number for every character".  The same character could be
81       represented differently in several legacy encodings.  The converse is
82       also not true: some code points do not have an assigned character.
83       Firstly, there are unallocated code points within otherwise used
84       blocks.  Secondly, there are special Unicode control characters that do
85       not represent true characters.
86
87       A common myth about Unicode is that it would be "16-bit", that is,
88       Unicode is only represented as 0x10000 (or 65536) characters from
89       0x0000 to 0xFFFF.  This is untrue.  Since Unicode 2.0 (July 1996),
90       Unicode has been defined all the way up to 21 bits (0x10FFFF), and
91       since Unicode 3.1 (March 2001), characters have been defined beyond
92       0xFFFF.  The first 0x10000 characters are called the Plane 0, or the
93       Basic Multilingual Plane (BMP).  With Unicode 3.1, 17 (yes, seventeen)
94       planes in all were defined--but they are nowhere near full of defined
95       characters, yet.
96
97       Another myth is that the 256-character blocks have something to do with
98       languages--that each block would define the characters used by a
99       language or a set of languages.  This is also untrue.  The division
100       into blocks exists, but it is almost completely accidental--an artifact
101       of how the characters have been and still are allocated.  Instead,
102       there is a concept called scripts, which is more useful: there is
103       "Latin" script, "Greek" script, and so on.  Scripts usually span varied
104       parts of several blocks.  For further information see Unicode::UCD.
105
106       The Unicode code points are just abstract numbers.  To input and output
107       these abstract numbers, the numbers must be encoded or serialised
108       somehow.  Unicode defines several character encoding forms, of which
109       UTF-8 is perhaps the most popular.  UTF-8 is a variable length encoding
110       that encodes Unicode characters as 1 to 6 bytes (only 4 with the
111       currently defined characters).  Other encodings include UTF-16 and
112       UTF-32 and their big- and little-endian variants (UTF-8 is byte-order
113       independent) The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding
114       forms.
115
116       For more information about encodings--for instance, to learn what
117       surrogates and byte order marks (BOMs) are--see perlunicode.
118
119   Perl's Unicode Support
120       Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
121       natively.  Perl 5.8.0, however, is the first recommended release for
122       serious Unicode work.  The maintenance release 5.6.1 fixed many of the
123       problems of the initial Unicode implementation, but for example regular
124       expressions still do not work with Unicode in 5.6.1.
125
126       Starting from Perl 5.8.0, the use of "use utf8" is needed only in much
127       more restricted circumstances. In earlier releases the "utf8" pragma
128       was used to declare that operations in the current block or file would
129       be Unicode-aware.  This model was found to be wrong, or at least
130       clumsy: the "Unicodeness" is now carried with the data, instead of
131       being attached to the operations.  Only one case remains where an
132       explicit "use utf8" is needed: if your Perl script itself is encoded in
133       UTF-8, you can use UTF-8 in your identifier names, and in string and
134       regular expression literals, by saying "use utf8".  This is not the
135       default because scripts with legacy 8-bit data in them would break.
136       See utf8.
137
138   Perl's Unicode Model
139       Perl supports both pre-5.6 strings of eight-bit native bytes, and
140       strings of Unicode characters.  The principle is that Perl tries to
141       keep its data as eight-bit bytes for as long as possible, but as soon
142       as Unicodeness cannot be avoided, the data is transparently upgraded to
143       Unicode.
144
145       Internally, Perl currently uses either whatever the native eight-bit
146       character set of the platform (for example Latin-1) is, defaulting to
147       UTF-8, to encode Unicode strings. Specifically, if all code points in
148       the string are 0xFF or less, Perl uses the native eight-bit character
149       set.  Otherwise, it uses UTF-8.
150
151       A user of Perl does not normally need to know nor care how Perl happens
152       to encode its internal strings, but it becomes relevant when outputting
153       Unicode strings to a stream without a PerlIO layer -- one with the
154       "default" encoding.  In such a case, the raw bytes used internally (the
155       native character set or UTF-8, as appropriate for each string) will be
156       used, and a "Wide character" warning will be issued if those strings
157       contain a character beyond 0x00FF.
158
159       For example,
160
161             perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
162
163       produces a fairly useless mixture of native bytes and UTF-8, as well as
164       a warning:
165
166            Wide character in print at ...
167
168       To output UTF-8, use the ":encoding" or ":utf8" output layer.
169       Prepending
170
171             binmode(STDOUT, ":utf8");
172
173       to this sample program ensures that the output is completely UTF-8, and
174       removes the program's warning.
175
176       You can enable automatic UTF-8-ification of your standard file handles,
177       default "open()" layer, and @ARGV by using either the "-C" command line
178       switch or the "PERL_UNICODE" environment variable, see perlrun for the
179       documentation of the "-C" switch.
180
181       Note that this means that Perl expects other software to work, too: if
182       Perl has been led to believe that STDIN should be UTF-8, but then STDIN
183       coming in from another command is not UTF-8, Perl will complain about
184       the malformed UTF-8.
185
186       All features that combine Unicode and I/O also require using the new
187       PerlIO feature.  Almost all Perl 5.8 platforms do use PerlIO, though:
188       you can see whether yours is by running "perl -V" and looking for
189       "useperlio=define".
190
191   Unicode and EBCDIC
192       Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There, Unicode
193       support is somewhat more complex to implement since additional
194       conversions are needed at every step.  Some problems remain, see
195       perlebcdic for details.
196
197       In any case, the Unicode support on EBCDIC platforms is better than in
198       the 5.6 series, which didn't work much at all for EBCDIC platform.  On
199       EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
200       instead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
201       that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
202       "EBCDIC-safe".
203
204   Creating Unicode
205       To create Unicode characters in literals for code points above 0xFF,
206       use the "\x{...}" notation in double-quoted strings:
207
208           my $smiley = "\x{263a}";
209
210       Similarly, it can be used in regular expression literals
211
212           $smiley =~ /\x{263a}/;
213
214       At run-time you can use "chr()":
215
216           my $hebrew_alef = chr(0x05d0);
217
218       See "Further Resources" for how to find all these numeric codes.
219
220       Naturally, "ord()" will do the reverse: it turns a character into a
221       code point.
222
223       Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
224       and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
225       eight-bit character for backward compatibility with older Perls.  For
226       arguments of 0x100 or more, Unicode characters are always produced. If
227       you want to force the production of Unicode characters regardless of
228       the numeric value, use "pack("U", ...)"  instead of "\x..", "\x{...}",
229       or "chr()".
230
231       You can also use the "charnames" pragma to invoke characters by name in
232       double-quoted strings:
233
234           use charnames ':full';
235           my $arabic_alef = "\N{ARABIC LETTER ALEF}";
236
237       And, as mentioned above, you can also "pack()" numbers into Unicode
238       characters:
239
240          my $georgian_an  = pack("U", 0x10a0);
241
242       Note that both "\x{...}" and "\N{...}" are compile-time string
243       constants: you cannot use variables in them.  if you want similar run-
244       time functionality, use "chr()" and "charnames::vianame()".
245
246       If you want to force the result to Unicode characters, use the special
247       "U0" prefix.  It consumes no arguments but causes the following bytes
248       to be interpreted as the UTF-8 encoding of Unicode characters:
249
250          my $chars = pack("U0W*", 0x80, 0x42);
251
252       Likewise, you can stop such UTF-8 interpretation by using the special
253       "C0" prefix.
254
255   Handling Unicode
256       Handling Unicode is for the most part transparent: just use the strings
257       as usual.  Functions like "index()", "length()", and "substr()" will
258       work on the Unicode characters; regular expressions will work on the
259       Unicode characters (see perlunicode and perlretut).
260
261       Note that Perl considers combining character sequences to be separate
262       characters, so for example
263
264           use charnames ':full';
265           print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
266
267       will print 2, not 1.  The only exception is that regular expressions
268       have "\X" for matching a combining character sequence.
269
270       Life is not quite so transparent, however, when working with legacy
271       encodings, I/O, and certain special cases:
272
273   Legacy Encodings
274       When you combine legacy data and Unicode the legacy data needs to be
275       upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if applicable) is
276       assumed.
277
278       The "Encode" module knows about many encodings and has interfaces for
279       doing conversions between those encodings:
280
281           use Encode 'decode';
282           $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
283
284   Unicode I/O
285       Normally, writing out Unicode data
286
287           print FH $some_string_with_unicode, "\n";
288
289       produces raw bytes that Perl happens to use to internally encode the
290       Unicode string.  Perl's internal encoding depends on the system as well
291       as what characters happen to be in the string at the time. If any of
292       the characters are at code points 0x100 or above, you will get a
293       warning.  To ensure that the output is explicitly rendered in the
294       encoding you desire--and to avoid the warning--open the stream with the
295       desired encoding. Some examples:
296
297           open FH, ">:utf8", "file";
298
299           open FH, ">:encoding(ucs2)",      "file";
300           open FH, ">:encoding(UTF-8)",     "file";
301           open FH, ">:encoding(shift_jis)", "file";
302
303       and on already open streams, use "binmode()":
304
305           binmode(STDOUT, ":utf8");
306
307           binmode(STDOUT, ":encoding(ucs2)");
308           binmode(STDOUT, ":encoding(UTF-8)");
309           binmode(STDOUT, ":encoding(shift_jis)");
310
311       The matching of encoding names is loose: case does not matter, and many
312       encodings have several aliases.  Note that the ":utf8" layer must
313       always be specified exactly like that; it is not subject to the loose
314       matching of encoding names. Also note that ":utf8" is unsafe for input,
315       because it accepts the data without validating that it is indeed valid
316       UTF8.
317
318       See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
319       for the ":encoding()" layer, and Encode::Supported for many encodings
320       supported by the "Encode" module.
321
322       Reading in a file that you know happens to be encoded in one of the
323       Unicode or legacy encodings does not magically turn the data into
324       Unicode in Perl's eyes.  To do that, specify the appropriate layer when
325       opening files
326
327           open(my $fh,'<:encoding(utf8)', 'anything');
328           my $line_of_unicode = <$fh>;
329
330           open(my $fh,'<:encoding(Big5)', 'anything');
331           my $line_of_unicode = <$fh>;
332
333       The I/O layers can also be specified more flexibly with the "open"
334       pragma.  See open, or look at the following example.
335
336           use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
337           open X, ">file";
338           print X chr(0x100), "\n";
339           close X;
340           open Y, "<file";
341           printf "%#x\n", ord(<Y>); # this should print 0x100
342           close Y;
343
344       With the "open" pragma you can use the ":locale" layer
345
346           BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
347           # the :locale will probe the locale environment variables like LC_ALL
348           use open OUT => ':locale'; # russki parusski
349           open(O, ">koi8");
350           print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
351           close O;
352           open(I, "<koi8");
353           printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
354           close I;
355
356       These methods install a transparent filter on the I/O stream that
357       converts data from the specified encoding when it is read in from the
358       stream.  The result is always Unicode.
359
360       The open pragma affects all the "open()" calls after the pragma by
361       setting default layers.  If you want to affect only certain streams,
362       use explicit layers directly in the "open()" call.
363
364       You can switch encodings on an already opened stream by using
365       "binmode()"; see "binmode" in perlfunc.
366
367       The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
368       and "binmode()", only with the "open" pragma.  The ":utf8" and
369       ":encoding(...)" methods do work with all of "open()", "binmode()", and
370       the "open" pragma.
371
372       Similarly, you may use these I/O layers on output streams to
373       automatically convert Unicode to the specified encoding when it is
374       written to the stream. For example, the following snippet copies the
375       contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
376       the file "text.utf8", encoded as UTF-8:
377
378           open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
379           open(my $unicode, '>:utf8',                  'text.utf8');
380           while (<$nihongo>) { print $unicode $_ }
381
382       The naming of encodings, both by the "open()" and by the "open" pragma
383       allows for flexible names: "koi8-r" and "KOI8R" will both be
384       understood.
385
386       Common encodings recognized by ISO, MIME, IANA, and various other
387       standardisation organisations are recognised; for a more detailed list
388       see Encode::Supported.
389
390       "read()" reads characters and returns the number of characters.
391       "seek()" and "tell()" operate on byte counts, as do "sysread()" and
392       "sysseek()".
393
394       Notice that because of the default behaviour of not doing any
395       conversion upon input if there is no default layer, it is easy to
396       mistakenly write code that keeps on expanding a file by repeatedly
397       encoding the data:
398
399           # BAD CODE WARNING
400           open F, "file";
401           local $/; ## read in the whole file of 8-bit characters
402           $t = <F>;
403           close F;
404           open F, ">:encoding(utf8)", "file";
405           print F $t; ## convert to UTF-8 on output
406           close F;
407
408       If you run this code twice, the contents of the file will be twice
409       UTF-8 encoded.  A "use open ':encoding(utf8)'" would have avoided the
410       bug, or explicitly opening also the file for input as UTF-8.
411
412       NOTE: the ":utf8" and ":encoding" features work only if your Perl has
413       been built with the new PerlIO feature (which is the default on most
414       systems).
415
416   Displaying Unicode As Text
417       Sometimes you might want to display Perl scalars containing Unicode as
418       simple ASCII (or EBCDIC) text.  The following subroutine converts its
419       argument so that Unicode characters with code points greater than 255
420       are displayed as "\x{...}", control characters (like "\n") are
421       displayed as "\x..", and the rest of the characters as themselves:
422
423          sub nice_string {
424              join("",
425                map { $_ > 255 ?                  # if wide character...
426                      sprintf("\\x{%04X}", $_) :  # \x{...}
427                      chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
428                      sprintf("\\x%02X", $_) :    # \x..
429                      quotemeta(chr($_))          # else quoted or as themselves
430                } unpack("W*", $_[0]));           # unpack Unicode characters
431          }
432
433       For example,
434
435          nice_string("foo\x{100}bar\n")
436
437       returns the string
438
439          'foo\x{0100}bar\x0A'
440
441       which is ready to be printed.
442
443   Special Cases
444       ·   Bit Complement Operator ~ And vec()
445
446           The bit complement operator "~" may produce surprising results if
447           used on strings containing characters with ordinal values above
448           255. In such a case, the results are consistent with the internal
449           encoding of the characters, but not with much else. So don't do
450           that. Similarly for "vec()": you will be operating on the
451           internally-encoded bit patterns of the Unicode characters, not on
452           the code point values, which is very probably not what you want.
453
454       ·   Peeking At Perl's Internal Encoding
455
456           Normal users of Perl should never care how Perl encodes any
457           particular Unicode string (because the normal ways to get at the
458           contents of a string with Unicode--via input and output--should
459           always be via explicitly-defined I/O layers). But if you must,
460           there are two ways of looking behind the scenes.
461
462           One way of peeking inside the internal encoding of Unicode
463           characters is to use "unpack("C*", ..." to get the bytes of
464           whatever the string encoding happens to be, or "unpack("U0..",
465           ...)" to get the bytes of the UTF-8 encoding:
466
467               # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
468               print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
469
470           Yet another way would be to use the Devel::Peek module:
471
472               perl -MDevel::Peek -e 'Dump(chr(0x100))'
473
474           That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
475           Unicode characters in "PV".  See also later in this document the
476           discussion about the "utf8::is_utf8()" function.
477
478   Advanced Topics
479       ·   String Equivalence
480
481           The question of string equivalence turns somewhat complicated in
482           Unicode: what do you mean by "equal"?
483
484           (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
485           LETTER A"?)
486
487           The short answer is that by default Perl compares equivalence
488           ("eq", "ne") based only on code points of the characters.  In the
489           above case, the answer is no (because 0x00C1 != 0x0041).  But
490           sometimes, any CAPITAL LETTER As should be considered equal, or
491           even As of any case.
492
493           The long answer is that you need to consider character
494           normalization and casing issues: see Unicode::Normalize, Unicode
495           Technical Reports #15 and #21, Unicode Normalization Forms and Case
496           Mappings, <http://www.unicode.org/unicode/reports/tr15/> and
497           <http://www.unicode.org/unicode/reports/tr21/>
498
499           As of Perl 5.8.0, the "Full" case-folding of Case
500           Mappings/SpecialCasing is implemented.
501
502       ·   String Collation
503
504           People like to see their strings nicely sorted--or as Unicode
505           parlance goes, collated.  But again, what do you mean by collate?
506
507           (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
508           "LATIN CAPITAL LETTER A WITH GRAVE"?)
509
510           The short answer is that by default, Perl compares strings ("lt",
511           "le", "cmp", "ge", "gt") based only on the code points of the
512           characters.  In the above case, the answer is "after", since 0x00C1
513           > 0x00C0.
514
515           The long answer is that "it depends", and a good answer cannot be
516           given without knowing (at the very least) the language context.
517           See Unicode::Collate, and Unicode Collation Algorithm
518           <http://www.unicode.org/unicode/reports/tr10/>
519
520   Miscellaneous
521       ·   Character Ranges and Classes
522
523           Character ranges in regular expression character classes
524           ("/[a-z]/") and in the "tr///" (also known as "y///") operator are
525           not magically Unicode-aware.  What this means is that "[A-Za-z]"
526           will not magically start to mean "all alphabetic letters"; not that
527           it does mean that even for 8-bit characters, you should be using
528           "/[[:alpha:]]/" in that case.
529
530           For specifying character classes like that in regular expressions,
531           you can use the various Unicode properties--"\pL", or perhaps
532           "\p{Alphabetic}", in this particular case.  You can use Unicode
533           code points as the end points of character ranges, but there is no
534           magic associated with specifying a certain range.  For further
535           information--there are dozens of Unicode character classes--see
536           perlunicode.
537
538       ·   String-To-Number Conversions
539
540           Unicode does define several other decimal--and numeric--characters
541           besides the familiar 0 to 9, such as the Arabic and Indic digits.
542           Perl does not support string-to-number conversion for digits other
543           than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
544
545   Questions With Answers
546       ·   Will My Old Scripts Break?
547
548           Very probably not.  Unless you are generating Unicode characters
549           somehow, old behaviour should be preserved.  About the only
550           behaviour that has changed and which could start generating Unicode
551           is the old behaviour of "chr()" where supplying an argument more
552           than 255 produced a character modulo 255.  "chr(300)", for example,
553           was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
554           LETTER I WITH BREVE.
555
556       ·   How Do I Make My Scripts Work With Unicode?
557
558           Very little work should be needed since nothing changes until you
559           generate Unicode data.  The most important thing is getting input
560           as Unicode; for that, see the earlier I/O discussion.
561
562       ·   How Do I Know Whether My String Is In Unicode?
563
564           You shouldn't have to care.  But you may, because currently the
565           semantics of the characters whose ordinals are in the range 128 to
566           255 is different depending on whether the string they are contained
567           within is in Unicode or not.  (See perlunicode.)
568
569           To determine if a string is in Unicode, use:
570
571               print utf8::is_utf8($string) ? 1 : 0, "\n";
572
573           But note that this doesn't mean that any of the characters in the
574           string are necessary UTF-8 encoded, or that any of the characters
575           have code points greater than 0xFF (255) or even 0x80 (128), or
576           that the string has any characters at all.  All the "is_utf8()"
577           does is to return the value of the internal "utf8ness" flag
578           attached to the $string.  If the flag is off, the bytes in the
579           scalar are interpreted as a single byte encoding.  If the flag is
580           on, the bytes in the scalar are interpreted as the (multi-byte,
581           variable-length) UTF-8 encoded code points of the characters.
582           Bytes added to an UTF-8 encoded string are automatically upgraded
583           to UTF-8.  If mixed non-UTF-8 and UTF-8 scalars are merged (double-
584           quoted interpolation, explicit concatenation, and printf/sprintf
585           parameter substitution), the result will be UTF-8 encoded as if
586           copies of the byte strings were upgraded to UTF-8: for example,
587
588               $a = "ab\x80c";
589               $b = "\x{100}";
590               print "$a = $b\n";
591
592           the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
593           $a will stay byte-encoded.
594
595           Sometimes you might really need to know the byte length of a string
596           instead of the character length. For that use either the
597           "Encode::encode_utf8()" function or the "bytes" pragma  and the
598           "length()" function:
599
600               my $unicode = chr(0x100);
601               print length($unicode), "\n"; # will print 1
602               require Encode;
603               print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
604               use bytes;
605               print length($unicode), "\n"; # will also print 2
606                                             # (the 0xC4 0x80 of the UTF-8)
607
608       ·   How Do I Detect Data That's Not Valid In a Particular Encoding?
609
610           Use the "Encode" package to try converting it.  For example,
611
612               use Encode 'decode_utf8';
613
614               if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
615                   # $string is valid utf8
616               } else {
617                   # $string is not valid utf8
618               }
619
620           Or use "unpack" to try decoding it:
621
622               use warnings;
623               @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
624
625           If invalid, a "Malformed UTF-8 character" warning is produced. The
626           "C0" means "process the string character per character".  Without
627           that, the "unpack("U*", ...)" would work in "U0" mode (the default
628           if the format string starts with "U") and it would return the bytes
629           making up the UTF-8 encoding of the target string, something that
630           will always work.
631
632       ·   How Do I Convert Binary Data Into a Particular Encoding, Or Vice
633           Versa?
634
635           This probably isn't as useful as you might think.  Normally, you
636           shouldn't need to.
637
638           In one sense, what you are asking doesn't make much sense:
639           encodings are for characters, and binary data are not "characters",
640           so converting "data" into some encoding isn't meaningful unless you
641           know in what character set and encoding the binary data is in, in
642           which case it's not just binary data, now is it?
643
644           If you have a raw sequence of bytes that you know should be
645           interpreted via a particular encoding, you can use "Encode":
646
647               use Encode 'from_to';
648               from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
649
650           The call to "from_to()" changes the bytes in $data, but nothing
651           material about the nature of the string has changed as far as Perl
652           is concerned.  Both before and after the call, the string $data
653           contains just a bunch of 8-bit bytes. As far as Perl is concerned,
654           the encoding of the string remains as "system-native 8-bit bytes".
655
656           You might relate this to a fictional 'Translate' module:
657
658              use Translate;
659              my $phrase = "Yes";
660              Translate::from_to($phrase, 'english', 'deutsch');
661              ## phrase now contains "Ja"
662
663           The contents of the string changes, but not the nature of the
664           string.  Perl doesn't know any more after the call than before that
665           the contents of the string indicates the affirmative.
666
667           Back to converting data.  If you have (or want) data in your
668           system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
669           can use pack/unpack to convert to/from Unicode.
670
671               $native_string  = pack("W*", unpack("U*", $Unicode_string));
672               $Unicode_string = pack("U*", unpack("W*", $native_string));
673
674           If you have a sequence of bytes you know is valid UTF-8, but Perl
675           doesn't know it yet, you can make Perl a believer, too:
676
677               use Encode 'decode_utf8';
678               $Unicode = decode_utf8($bytes);
679
680           or:
681
682               $Unicode = pack("U0a*", $bytes);
683
684           You can find the bytes that make up a UTF-8 sequence with
685
686                   @bytes = unpack("C*", $Unicode_string)
687
688           and you can create well-formed Unicode with
689
690                   $Unicode_string = pack("U*", 0xff, ...)
691
692       ·   How Do I Display Unicode?  How Do I Input Unicode?
693
694           See <http://www.alanwood.net/unicode/> and
695           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
696
697       ·   How Does Unicode Work With Traditional Locales?
698
699           In Perl, not very well.  Avoid using locales through the "locale"
700           pragma.  Use only one or the other.  But see perlrun for the
701           description of the "-C" switch and its environment counterpart,
702           $ENV{PERL_UNICODE} to see how to enable various Unicode features,
703           for example by using locale settings.
704
705   Hexadecimal Notation
706       The Unicode standard prefers using hexadecimal notation because that
707       more clearly shows the division of Unicode into blocks of 256
708       characters.  Hexadecimal is also simply shorter than decimal.  You can
709       use decimal notation, too, but learning to use hexadecimal just makes
710       life easier with the Unicode standard.  The "U+HHHH" notation uses
711       hexadecimal, for example.
712
713       The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
714       (or A-F, case doesn't matter).  Each hexadecimal digit represents four
715       bits, or half a byte.  "print 0x..., "\n"" will show a hexadecimal
716       number in decimal, and "printf "%x\n", $decimal" will show a decimal
717       number in hexadecimal.  If you have just the "hex digits" of a
718       hexadecimal number, you can use the "hex()" function.
719
720           print 0x0009, "\n";    # 9
721           print 0x000a, "\n";    # 10
722           print 0x000f, "\n";    # 15
723           print 0x0010, "\n";    # 16
724           print 0x0011, "\n";    # 17
725           print 0x0100, "\n";    # 256
726
727           print 0x0041, "\n";    # 65
728
729           printf "%x\n",  65;    # 41
730           printf "%#x\n", 65;    # 0x41
731
732           print hex("41"), "\n"; # 65
733
734   Further Resources
735       ·   Unicode Consortium
736
737           <http://www.unicode.org/>
738
739       ·   Unicode FAQ
740
741           <http://www.unicode.org/unicode/faq/>
742
743       ·   Unicode Glossary
744
745           <http://www.unicode.org/glossary/>
746
747       ·   Unicode Useful Resources
748
749           <http://www.unicode.org/unicode/onlinedat/resources.html>
750
751       ·   Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
752           Other Applications
753
754           <http://www.alanwood.net/unicode/>
755
756       ·   UTF-8 and Unicode FAQ for Unix/Linux
757
758           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
759
760       ·   Legacy Character Sets
761
762           <http://www.czyborra.com/> <http://www.eki.ee/letter/>
763
764       ·   The Unicode support files live within the Perl installation in the
765           directory
766
767               $Config{installprivlib}/unicore
768
769           in Perl 5.8.0 or newer, and
770
771               $Config{installprivlib}/unicode
772
773           in the Perl 5.6 series.  (The renaming to lib/unicore was done to
774           avoid naming conflicts with lib/Unicode in case-insensitive
775           filesystems.)  The main Unicode data file is UnicodeData.txt (or
776           Unicode.301 in Perl 5.6.1.)  You can find the
777           $Config{installprivlib} by
778
779               perl "-V:installprivlib"
780
781           You can explore various information from the Unicode data files
782           using the "Unicode::UCD" module.
783

UNICODE IN OLDER PERLS

785       If you cannot upgrade your Perl to 5.8.0 or later, you can still do
786       some Unicode processing by using the modules "Unicode::String",
787       "Unicode::Map8", and "Unicode::Map", available from CPAN.  If you have
788       the GNU recode installed, you can also use the Perl front-end
789       "Convert::Recode" for character conversions.
790
791       The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
792       UTF-8 bytes and back, the code works even with older Perl 5 versions.
793
794           # ISO 8859-1 to UTF-8
795           s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
796
797           # UTF-8 to ISO 8859-1
798           s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
799

ACKNOWLEDGMENTS

805       Thanks to the kind readers of the perl5-porters@perl.org,
806       perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
807       mailing lists for their valuable feedback.
808

AUTHOR, COPYRIGHT, AND LICENSE

810       Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
811
812       This document may be distributed under the same terms as Perl itself.
813
814
815
816perl v5.10.1                      2009-02-25                   PERLUNIINTRO(1)

NAME

DESCRIPTION

UNICODE IN OLDER PERLS

SEE ALSO

ACKNOWLEDGMENTS

AUTHOR, COPYRIGHT, AND LICENSE