perluniintro(1)

1PERLUNIINTRO(1)        Perl Programmers Reference Guide        PERLUNIINTRO(1)
2
3
4

NAME

6       perluniintro - Perl Unicode introduction
7

DESCRIPTION

9       This document gives a general idea of Unicode and how to use Unicode in
10       Perl.
11
12   Unicode
13       Unicode is a character set standard which plans to codify all of the
14       writing systems of the world, plus many other symbols.
15
16       Unicode and ISO/IEC 10646 are coordinated standards that provide code
17       points for characters in almost all modern character set standards,
18       covering more than 30 writing systems and hundreds of languages,
19       including all commercially-important modern languages.  All characters
20       in the largest Chinese, Japanese, and Korean dictionaries are also
21       encoded. The standards will eventually cover almost all characters in
22       more than 250 writing systems and thousands of languages.  Unicode 1.0
23       was released in October 1991, and 4.0 in April 2003.
24
25       A Unicode character is an abstract entity.  It is not bound to any
26       particular integer width, especially not to the C language "char".
27       Unicode is language-neutral and display-neutral: it does not encode the
28       language of the text, and it does not generally define fonts or other
29       graphical layout details.  Unicode operates on characters and on text
30       built from those characters.
31
32       Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
33       SMALL LETTER ALPHA" and unique numbers for the characters, in this case
34       0x0041 and 0x03B1, respectively.  These unique numbers are called code
35       points.
36
37       The Unicode standard prefers using hexadecimal notation for the code
38       points.  If numbers like 0x0041 are unfamiliar to you, take a peek at a
39       later section, "Hexadecimal Notation".  The Unicode standard uses the
40       notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
41       point and the normative name of the character.
42
43       Unicode also defines various properties for the characters, like
44       "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
45       properties are independent of the names of the characters.
46       Furthermore, various operations on the characters like uppercasing,
47       lowercasing, and collating (sorting) are defined.
48
49       A Unicode logical "character" can actually consist of more than one
50       internal actual "character" or code point.  For Western languages, this
51       is adequately modelled by a base character (like "LATIN CAPITAL LETTER
52       A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
53       This sequence of base character and modifiers is called a combining
54       character sequence.  Some non-western languages require more
55       complicated models, so Unicode created the grapheme cluster concept,
56       and then the extended grapheme cluster.  For example, a Korean Hangul
57       syllable is considered a single logical character, but most often
58       consists of three actual Unicode characters: a leading consonant
59       followed by an interior vowel followed by a trailing consonant.
60
61       Whether to call these extended grapheme clusters "characters" depends
62       on your point of view. If you are a programmer, you probably would tend
63       towards seeing each element in the sequences as one unit, or
64       "character".  The whole sequence could be seen as one "character",
65       however, from the user's point of view, since that's probably what it
66       looks like in the context of the user's language.
67
68       With this "whole sequence" view of characters, the total number of
69       characters is open-ended. But in the programmer's "one unit is one
70       character" point of view, the concept of "characters" is more
71       deterministic.  In this document, we take that second point of view:
72       one "character" is one Unicode code point.
73
74       For some combinations, there are precomposed characters.  "LATIN
75       CAPITAL LETTER A WITH ACUTE", for example, is defined as a single code
76       point.  These precomposed characters are, however, only available for
77       some combinations, and are mainly meant to support round-trip
78       conversions between Unicode and legacy standards (like the ISO 8859).
79       In the general case, the composing method is more extensible.  To
80       support conversion between different compositions of the characters,
81       various normalization forms to standardize representations are also
82       defined.
83
84       Because of backward compatibility with legacy encodings, the "a unique
85       number for every character" idea breaks down a bit: instead, there is
86       "at least one number for every character".  The same character could be
87       represented differently in several legacy encodings.  The converse is
88       also not true: some code points do not have an assigned character.
89       Firstly, there are unallocated code points within otherwise used
90       blocks.  Secondly, there are special Unicode control characters that do
91       not represent true characters.
92
93       A common myth about Unicode is that it is "16-bit", that is, Unicode is
94       only represented as 0x10000 (or 65536) characters from 0x0000 to
95       0xFFFF.  This is untrue.  Since Unicode 2.0 (July 1996), Unicode has
96       been defined all the way up to 21 bits (0x10FFFF), and since Unicode
97       3.1 (March 2001), characters have been defined beyond 0xFFFF.  The
98       first 0x10000 characters are called the Plane 0, or the Basic
99       Multilingual Plane (BMP).  With Unicode 3.1, 17 (yes, seventeen) planes
100       in all were defined--but they are nowhere near full of defined
101       characters, yet.
102
103       Another myth is about Unicode blocks--that they have something to do
104       with languages--that each block would define the characters used by a
105       language or a set of languages.  This is also untrue.  The division
106       into blocks exists, but it is almost completely accidental--an artifact
107       of how the characters have been and still are allocated.  Instead,
108       there is a concept called scripts, which is more useful: there is
109       "Latin" script, "Greek" script, and so on.  Scripts usually span varied
110       parts of several blocks.  For more information about scripts, see
111       "Scripts" in perlunicode.
112
113       The Unicode code points are just abstract numbers.  To input and output
114       these abstract numbers, the numbers must be encoded or serialised
115       somehow.  Unicode defines several character encoding forms, of which
116       UTF-8 is perhaps the most popular.  UTF-8 is a variable length encoding
117       that encodes Unicode characters as 1 to 6 bytes.  Other encodings
118       include UTF-16 and UTF-32 and their big- and little-endian variants
119       (UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
120       and UCS-4 encoding forms.
121
122       For more information about encodings--for instance, to learn what
123       surrogates and byte order marks (BOMs) are--see perlunicode.
124
125   Perl's Unicode Support
126       Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
127       natively.  Perl 5.8.0, however, is the first recommended release for
128       serious Unicode work.  The maintenance release 5.6.1 fixed many of the
129       problems of the initial Unicode implementation, but for example regular
130       expressions still do not work with Unicode in 5.6.1.
131
132       Starting from Perl 5.8.0, the use of "use utf8" is needed only in much
133       more restricted circumstances. In earlier releases the "utf8" pragma
134       was used to declare that operations in the current block or file would
135       be Unicode-aware.  This model was found to be wrong, or at least
136       clumsy: the "Unicodeness" is now carried with the data, instead of
137       being attached to the operations.  Only one case remains where an
138       explicit "use utf8" is needed: if your Perl script itself is encoded in
139       UTF-8, you can use UTF-8 in your identifier names, and in string and
140       regular expression literals, by saying "use utf8".  This is not the
141       default because scripts with legacy 8-bit data in them would break.
142       See utf8.
143
144   Perl's Unicode Model
145       Perl supports both pre-5.6 strings of eight-bit native bytes, and
146       strings of Unicode characters.  The principle is that Perl tries to
147       keep its data as eight-bit bytes for as long as possible, but as soon
148       as Unicodeness cannot be avoided, the data is (mostly) transparently
149       upgraded to Unicode.  There are some problems--see "The "Unicode Bug""
150       in perlunicode.
151
152       Internally, Perl currently uses either whatever the native eight-bit
153       character set of the platform (for example Latin-1) is, defaulting to
154       UTF-8, to encode Unicode strings. Specifically, if all code points in
155       the string are 0xFF or less, Perl uses the native eight-bit character
156       set.  Otherwise, it uses UTF-8.
157
158       A user of Perl does not normally need to know nor care how Perl happens
159       to encode its internal strings, but it becomes relevant when outputting
160       Unicode strings to a stream without a PerlIO layer (one with the
161       "default" encoding).  In such a case, the raw bytes used internally
162       (the native character set or UTF-8, as appropriate for each string)
163       will be used, and a "Wide character" warning will be issued if those
164       strings contain a character beyond 0x00FF.
165
166       For example,
167
168             perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
169
170       produces a fairly useless mixture of native bytes and UTF-8, as well as
171       a warning:
172
173            Wide character in print at ...
174
175       To output UTF-8, use the ":encoding" or ":utf8" output layer.
176       Prepending
177
178             binmode(STDOUT, ":utf8");
179
180       to this sample program ensures that the output is completely UTF-8, and
181       removes the program's warning.
182
183       You can enable automatic UTF-8-ification of your standard file handles,
184       default "open()" layer, and @ARGV by using either the "-C" command line
185       switch or the "PERL_UNICODE" environment variable, see perlrun for the
186       documentation of the "-C" switch.
187
188       Note that this means that Perl expects other software to work, too: if
189       Perl has been led to believe that STDIN should be UTF-8, but then STDIN
190       coming in from another command is not UTF-8, Perl will complain about
191       the malformed UTF-8.
192
193       All features that combine Unicode and I/O also require using the new
194       PerlIO feature.  Almost all Perl 5.8 platforms do use PerlIO, though:
195       you can see whether yours is by running "perl -V" and looking for
196       "useperlio=define".
197
198   Unicode and EBCDIC
199       Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There, Unicode
200       support is somewhat more complex to implement since additional
201       conversions are needed at every step.
202
203       Later Perl releases have added code that will not work on EBCDIC
204       platforms, and no one has complained, so the divergence has continued.
205       If you want to run Perl on an EBCDIC platform, send email to
206       perlbug@perl.org
207
208       On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
209       instead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
210       that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
211       "EBCDIC-safe".
212
213   Creating Unicode
214       To create Unicode characters in literals for code points above 0xFF,
215       use the "\x{...}" notation in double-quoted strings:
216
217           my $smiley = "\x{263a}";
218
219       Similarly, it can be used in regular expression literals
220
221           $smiley =~ /\x{263a}/;
222
223       At run-time you can use "chr()":
224
225           my $hebrew_alef = chr(0x05d0);
226
227       See "Further Resources" for how to find all these numeric codes.
228
229       Naturally, "ord()" will do the reverse: it turns a character into a
230       code point.
231
232       Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
233       and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
234       eight-bit character for backward compatibility with older Perls.  For
235       arguments of 0x100 or more, Unicode characters are always produced. If
236       you want to force the production of Unicode characters regardless of
237       the numeric value, use "pack("U", ...)"  instead of "\x..", "\x{...}",
238       or "chr()".
239
240       You can also use the "charnames" pragma to invoke characters by name in
241       double-quoted strings:
242
243           use charnames ':full';
244           my $arabic_alef = "\N{ARABIC LETTER ALEF}";
245
246       And, as mentioned above, you can also "pack()" numbers into Unicode
247       characters:
248
249          my $georgian_an  = pack("U", 0x10a0);
250
251       Note that both "\x{...}" and "\N{...}" are compile-time string
252       constants: you cannot use variables in them.  if you want similar run-
253       time functionality, use "chr()" and "charnames::vianame()".
254
255       If you want to force the result to Unicode characters, use the special
256       "U0" prefix.  It consumes no arguments but causes the following bytes
257       to be interpreted as the UTF-8 encoding of Unicode characters:
258
259          my $chars = pack("U0W*", 0x80, 0x42);
260
261       Likewise, you can stop such UTF-8 interpretation by using the special
262       "C0" prefix.
263
264   Handling Unicode
265       Handling Unicode is for the most part transparent: just use the strings
266       as usual.  Functions like "index()", "length()", and "substr()" will
267       work on the Unicode characters; regular expressions will work on the
268       Unicode characters (see perlunicode and perlretut).
269
270       Note that Perl considers grapheme clusters to be separate characters,
271       so for example
272
273           use charnames ':full';
274           print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
275
276       will print 2, not 1.  The only exception is that regular expressions
277       have "\X" for matching an extended grapheme cluster.
278
279       Life is not quite so transparent, however, when working with legacy
280       encodings, I/O, and certain special cases:
281
282   Legacy Encodings
283       When you combine legacy data and Unicode the legacy data needs to be
284       upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if applicable) is
285       assumed.
286
287       The "Encode" module knows about many encodings and has interfaces for
288       doing conversions between those encodings:
289
290           use Encode 'decode';
291           $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
292
293   Unicode I/O
294       Normally, writing out Unicode data
295
296           print FH $some_string_with_unicode, "\n";
297
298       produces raw bytes that Perl happens to use to internally encode the
299       Unicode string.  Perl's internal encoding depends on the system as well
300       as what characters happen to be in the string at the time. If any of
301       the characters are at code points 0x100 or above, you will get a
302       warning.  To ensure that the output is explicitly rendered in the
303       encoding you desire--and to avoid the warning--open the stream with the
304       desired encoding. Some examples:
305
306           open FH, ">:utf8", "file";
307
308           open FH, ">:encoding(ucs2)",      "file";
309           open FH, ">:encoding(UTF-8)",     "file";
310           open FH, ">:encoding(shift_jis)", "file";
311
312       and on already open streams, use "binmode()":
313
314           binmode(STDOUT, ":utf8");
315
316           binmode(STDOUT, ":encoding(ucs2)");
317           binmode(STDOUT, ":encoding(UTF-8)");
318           binmode(STDOUT, ":encoding(shift_jis)");
319
320       The matching of encoding names is loose: case does not matter, and many
321       encodings have several aliases.  Note that the ":utf8" layer must
322       always be specified exactly like that; it is not subject to the loose
323       matching of encoding names. Also note that ":utf8" is unsafe for input,
324       because it accepts the data without validating that it is indeed valid
325       UTF8.
326
327       See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
328       for the ":encoding()" layer, and Encode::Supported for many encodings
329       supported by the "Encode" module.
330
331       Reading in a file that you know happens to be encoded in one of the
332       Unicode or legacy encodings does not magically turn the data into
333       Unicode in Perl's eyes.  To do that, specify the appropriate layer when
334       opening files
335
336           open(my $fh,'<:encoding(utf8)', 'anything');
337           my $line_of_unicode = <$fh>;
338
339           open(my $fh,'<:encoding(Big5)', 'anything');
340           my $line_of_unicode = <$fh>;
341
342       The I/O layers can also be specified more flexibly with the "open"
343       pragma.  See open, or look at the following example.
344
345           use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
346           open X, ">file";
347           print X chr(0x100), "\n";
348           close X;
349           open Y, "<file";
350           printf "%#x\n", ord(<Y>); # this should print 0x100
351           close Y;
352
353       With the "open" pragma you can use the ":locale" layer
354
355           BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
356           # the :locale will probe the locale environment variables like LC_ALL
357           use open OUT => ':locale'; # russki parusski
358           open(O, ">koi8");
359           print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
360           close O;
361           open(I, "<koi8");
362           printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
363           close I;
364
365       These methods install a transparent filter on the I/O stream that
366       converts data from the specified encoding when it is read in from the
367       stream.  The result is always Unicode.
368
369       The open pragma affects all the "open()" calls after the pragma by
370       setting default layers.  If you want to affect only certain streams,
371       use explicit layers directly in the "open()" call.
372
373       You can switch encodings on an already opened stream by using
374       "binmode()"; see "binmode" in perlfunc.
375
376       The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
377       and "binmode()", only with the "open" pragma.  The ":utf8" and
378       ":encoding(...)" methods do work with all of "open()", "binmode()", and
379       the "open" pragma.
380
381       Similarly, you may use these I/O layers on output streams to
382       automatically convert Unicode to the specified encoding when it is
383       written to the stream. For example, the following snippet copies the
384       contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
385       the file "text.utf8", encoded as UTF-8:
386
387           open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
388           open(my $unicode, '>:utf8',                  'text.utf8');
389           while (<$nihongo>) { print $unicode $_ }
390
391       The naming of encodings, both by the "open()" and by the "open" pragma
392       allows for flexible names: "koi8-r" and "KOI8R" will both be
393       understood.
394
395       Common encodings recognized by ISO, MIME, IANA, and various other
396       standardisation organisations are recognised; for a more detailed list
397       see Encode::Supported.
398
399       "read()" reads characters and returns the number of characters.
400       "seek()" and "tell()" operate on byte counts, as do "sysread()" and
401       "sysseek()".
402
403       Notice that because of the default behaviour of not doing any
404       conversion upon input if there is no default layer, it is easy to
405       mistakenly write code that keeps on expanding a file by repeatedly
406       encoding the data:
407
408           # BAD CODE WARNING
409           open F, "file";
410           local $/; ## read in the whole file of 8-bit characters
411           $t = <F>;
412           close F;
413           open F, ">:encoding(utf8)", "file";
414           print F $t; ## convert to UTF-8 on output
415           close F;
416
417       If you run this code twice, the contents of the file will be twice
418       UTF-8 encoded.  A "use open ':encoding(utf8)'" would have avoided the
419       bug, or explicitly opening also the file for input as UTF-8.
420
421       NOTE: the ":utf8" and ":encoding" features work only if your Perl has
422       been built with the new PerlIO feature (which is the default on most
423       systems).
424
425   Displaying Unicode As Text
426       Sometimes you might want to display Perl scalars containing Unicode as
427       simple ASCII (or EBCDIC) text.  The following subroutine converts its
428       argument so that Unicode characters with code points greater than 255
429       are displayed as "\x{...}", control characters (like "\n") are
430       displayed as "\x..", and the rest of the characters as themselves:
431
432          sub nice_string {
433              join("",
434                map { $_ > 255 ?                  # if wide character...
435                      sprintf("\\x{%04X}", $_) :  # \x{...}
436                      chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
437                      sprintf("\\x%02X", $_) :    # \x..
438                      quotemeta(chr($_))          # else quoted or as themselves
439                } unpack("W*", $_[0]));           # unpack Unicode characters
440          }
441
442       For example,
443
444          nice_string("foo\x{100}bar\n")
445
446       returns the string
447
448          'foo\x{0100}bar\x0A'
449
450       which is ready to be printed.
451
452   Special Cases
453       ·   Bit Complement Operator ~ And vec()
454
455           The bit complement operator "~" may produce surprising results if
456           used on strings containing characters with ordinal values above
457           255. In such a case, the results are consistent with the internal
458           encoding of the characters, but not with much else. So don't do
459           that. Similarly for "vec()": you will be operating on the
460           internally-encoded bit patterns of the Unicode characters, not on
461           the code point values, which is very probably not what you want.
462
463       ·   Peeking At Perl's Internal Encoding
464
465           Normal users of Perl should never care how Perl encodes any
466           particular Unicode string (because the normal ways to get at the
467           contents of a string with Unicode--via input and output--should
468           always be via explicitly-defined I/O layers). But if you must,
469           there are two ways of looking behind the scenes.
470
471           One way of peeking inside the internal encoding of Unicode
472           characters is to use "unpack("C*", ..." to get the bytes of
473           whatever the string encoding happens to be, or "unpack("U0..",
474           ...)" to get the bytes of the UTF-8 encoding:
475
476               # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
477               print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
478
479           Yet another way would be to use the Devel::Peek module:
480
481               perl -MDevel::Peek -e 'Dump(chr(0x100))'
482
483           That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
484           Unicode characters in "PV".  See also later in this document the
485           discussion about the "utf8::is_utf8()" function.
486
487   Advanced Topics
488       ·   String Equivalence
489
490           The question of string equivalence turns somewhat complicated in
491           Unicode: what do you mean by "equal"?
492
493           (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
494           LETTER A"?)
495
496           The short answer is that by default Perl compares equivalence
497           ("eq", "ne") based only on code points of the characters.  In the
498           above case, the answer is no (because 0x00C1 != 0x0041).  But
499           sometimes, any CAPITAL LETTER As should be considered equal, or
500           even As of any case.
501
502           The long answer is that you need to consider character
503           normalization and casing issues: see Unicode::Normalize, Unicode
504           Technical Report #15, Unicode Normalization Forms
505           <http://www.unicode.org/unicode/reports/tr15> and sections on case
506           mapping in the Unicode Standard <http://www.unicode.org>.
507
508           As of Perl 5.8.0, the "Full" case-folding of Case
509           Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
510           with them.
511
512       ·   String Collation
513
514           People like to see their strings nicely sorted--or as Unicode
515           parlance goes, collated.  But again, what do you mean by collate?
516
517           (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
518           "LATIN CAPITAL LETTER A WITH GRAVE"?)
519
520           The short answer is that by default, Perl compares strings ("lt",
521           "le", "cmp", "ge", "gt") based only on the code points of the
522           characters.  In the above case, the answer is "after", since 0x00C1
523           > 0x00C0.
524
525           The long answer is that "it depends", and a good answer cannot be
526           given without knowing (at the very least) the language context.
527           See Unicode::Collate, and Unicode Collation Algorithm
528           <http://www.unicode.org/unicode/reports/tr10/>
529
530   Miscellaneous
531       ·   Character Ranges and Classes
532
533           Character ranges in regular expression bracketed character classes
534           ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
535           operator are not magically Unicode-aware.  What this means is that
536           "[A-Za-z]" will not magically start to mean "all alphabetic
537           letters" (not that it does mean that even for 8-bit characters; for
538           those, if you are using locales (perllocale), use "/[[:alpha:]]/";
539           and if not, use the 8-bit-aware property "\p{alpha}").
540
541           All the properties that begin with "\p" (and its inverse "\P") are
542           actually character classes that are Unicode-aware.  There are
543           dozens of them, see perluniprops.
544
545           You can use Unicode code points as the end points of character
546           ranges, and the range will include all Unicode code points that lie
547           between those end points.
548
549       ·   String-To-Number Conversions
550
551           Unicode does define several other decimal--and numeric--characters
552           besides the familiar 0 to 9, such as the Arabic and Indic digits.
553           Perl does not support string-to-number conversion for digits other
554           than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
555
556   Questions With Answers
557       ·   Will My Old Scripts Break?
558
559           Very probably not.  Unless you are generating Unicode characters
560           somehow, old behaviour should be preserved.  About the only
561           behaviour that has changed and which could start generating Unicode
562           is the old behaviour of "chr()" where supplying an argument more
563           than 255 produced a character modulo 255.  "chr(300)", for example,
564           was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
565           LETTER I WITH BREVE.
566
567       ·   How Do I Make My Scripts Work With Unicode?
568
569           Very little work should be needed since nothing changes until you
570           generate Unicode data.  The most important thing is getting input
571           as Unicode; for that, see the earlier I/O discussion.
572
573       ·   How Do I Know Whether My String Is In Unicode?
574
575           You shouldn't have to care.  But you may, because currently the
576           semantics of the characters whose ordinals are in the range 128 to
577           255 are different depending on whether the string they are
578           contained within is in Unicode or not.  (See "When Unicode Does Not
579           Happen" in perlunicode.)
580
581           To determine if a string is in Unicode, use:
582
583               print utf8::is_utf8($string) ? 1 : 0, "\n";
584
585           But note that this doesn't mean that any of the characters in the
586           string are necessary UTF-8 encoded, or that any of the characters
587           have code points greater than 0xFF (255) or even 0x80 (128), or
588           that the string has any characters at all.  All the "is_utf8()"
589           does is to return the value of the internal "utf8ness" flag
590           attached to the $string.  If the flag is off, the bytes in the
591           scalar are interpreted as a single byte encoding.  If the flag is
592           on, the bytes in the scalar are interpreted as the (variable-
593           length, potentially multi-byte) UTF-8 encoded code points of the
594           characters.  Bytes added to a UTF-8 encoded string are
595           automatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8
596           scalars are merged (double-quoted interpolation, explicit
597           concatenation, and printf/sprintf parameter substitution), the
598           result will be UTF-8 encoded as if copies of the byte strings were
599           upgraded to UTF-8: for example,
600
601               $a = "ab\x80c";
602               $b = "\x{100}";
603               print "$a = $b\n";
604
605           the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
606           $a will stay byte-encoded.
607
608           Sometimes you might really need to know the byte length of a string
609           instead of the character length. For that use either the
610           "Encode::encode_utf8()" function or the "bytes" pragma  and the
611           "length()" function:
612
613               my $unicode = chr(0x100);
614               print length($unicode), "\n"; # will print 1
615               require Encode;
616               print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
617               use bytes;
618               print length($unicode), "\n"; # will also print 2
619                                             # (the 0xC4 0x80 of the UTF-8)
620               no bytes;
621
622       ·   How Do I Detect Data That's Not Valid In a Particular Encoding?
623
624           Use the "Encode" package to try converting it.  For example,
625
626               use Encode 'decode_utf8';
627
628               if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
629                   # $string is valid utf8
630               } else {
631                   # $string is not valid utf8
632               }
633
634           Or use "unpack" to try decoding it:
635
636               use warnings;
637               @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
638
639           If invalid, a "Malformed UTF-8 character" warning is produced. The
640           "C0" means "process the string character per character".  Without
641           that, the "unpack("U*", ...)" would work in "U0" mode (the default
642           if the format string starts with "U") and it would return the bytes
643           making up the UTF-8 encoding of the target string, something that
644           will always work.
645
646       ·   How Do I Convert Binary Data Into a Particular Encoding, Or Vice
647           Versa?
648
649           This probably isn't as useful as you might think.  Normally, you
650           shouldn't need to.
651
652           In one sense, what you are asking doesn't make much sense:
653           encodings are for characters, and binary data are not "characters",
654           so converting "data" into some encoding isn't meaningful unless you
655           know in what character set and encoding the binary data is in, in
656           which case it's not just binary data, now is it?
657
658           If you have a raw sequence of bytes that you know should be
659           interpreted via a particular encoding, you can use "Encode":
660
661               use Encode 'from_to';
662               from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
663
664           The call to "from_to()" changes the bytes in $data, but nothing
665           material about the nature of the string has changed as far as Perl
666           is concerned.  Both before and after the call, the string $data
667           contains just a bunch of 8-bit bytes. As far as Perl is concerned,
668           the encoding of the string remains as "system-native 8-bit bytes".
669
670           You might relate this to a fictional 'Translate' module:
671
672              use Translate;
673              my $phrase = "Yes";
674              Translate::from_to($phrase, 'english', 'deutsch');
675              ## phrase now contains "Ja"
676
677           The contents of the string changes, but not the nature of the
678           string.  Perl doesn't know any more after the call than before that
679           the contents of the string indicates the affirmative.
680
681           Back to converting data.  If you have (or want) data in your
682           system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
683           can use pack/unpack to convert to/from Unicode.
684
685               $native_string  = pack("W*", unpack("U*", $Unicode_string));
686               $Unicode_string = pack("U*", unpack("W*", $native_string));
687
688           If you have a sequence of bytes you know is valid UTF-8, but Perl
689           doesn't know it yet, you can make Perl a believer, too:
690
691               use Encode 'decode_utf8';
692               $Unicode = decode_utf8($bytes);
693
694           or:
695
696               $Unicode = pack("U0a*", $bytes);
697
698           You can find the bytes that make up a UTF-8 sequence with
699
700                   @bytes = unpack("C*", $Unicode_string)
701
702           and you can create well-formed Unicode with
703
704                   $Unicode_string = pack("U*", 0xff, ...)
705
706       ·   How Do I Display Unicode?  How Do I Input Unicode?
707
708           See <http://www.alanwood.net/unicode/> and
709           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
710
711       ·   How Does Unicode Work With Traditional Locales?
712
713           In Perl, not very well.  Avoid using locales through the "locale"
714           pragma.  Use only one or the other.  But see perlrun for the
715           description of the "-C" switch and its environment counterpart,
716           $ENV{PERL_UNICODE} to see how to enable various Unicode features,
717           for example by using locale settings.
718
719   Hexadecimal Notation
720       The Unicode standard prefers using hexadecimal notation because that
721       more clearly shows the division of Unicode into blocks of 256
722       characters.  Hexadecimal is also simply shorter than decimal.  You can
723       use decimal notation, too, but learning to use hexadecimal just makes
724       life easier with the Unicode standard.  The "U+HHHH" notation uses
725       hexadecimal, for example.
726
727       The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
728       (or A-F, case doesn't matter).  Each hexadecimal digit represents four
729       bits, or half a byte.  "print 0x..., "\n"" will show a hexadecimal
730       number in decimal, and "printf "%x\n", $decimal" will show a decimal
731       number in hexadecimal.  If you have just the "hex digits" of a
732       hexadecimal number, you can use the "hex()" function.
733
734           print 0x0009, "\n";    # 9
735           print 0x000a, "\n";    # 10
736           print 0x000f, "\n";    # 15
737           print 0x0010, "\n";    # 16
738           print 0x0011, "\n";    # 17
739           print 0x0100, "\n";    # 256
740
741           print 0x0041, "\n";    # 65
742
743           printf "%x\n",  65;    # 41
744           printf "%#x\n", 65;    # 0x41
745
746           print hex("41"), "\n"; # 65
747
748   Further Resources
749       ·   Unicode Consortium
750
751           <http://www.unicode.org/>
752
753       ·   Unicode FAQ
754
755           <http://www.unicode.org/unicode/faq/>
756
757       ·   Unicode Glossary
758
759           <http://www.unicode.org/glossary/>
760
761       ·   Unicode Useful Resources
762
763           <http://www.unicode.org/unicode/onlinedat/resources.html>
764
765       ·   Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
766           Other Applications
767
768           <http://www.alanwood.net/unicode/>
769
770       ·   UTF-8 and Unicode FAQ for Unix/Linux
771
772           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
773
774       ·   Legacy Character Sets
775
776           <http://www.czyborra.com/> <http://www.eki.ee/letter/>
777
778       ·   The Unicode support files live within the Perl installation in the
779           directory
780
781               $Config{installprivlib}/unicore
782
783           in Perl 5.8.0 or newer, and
784
785               $Config{installprivlib}/unicode
786
787           in the Perl 5.6 series.  (The renaming to lib/unicore was done to
788           avoid naming conflicts with lib/Unicode in case-insensitive
789           filesystems.)  The main Unicode data file is UnicodeData.txt (or
790           Unicode.301 in Perl 5.6.1.)  You can find the
791           $Config{installprivlib} by
792
793               perl "-V:installprivlib"
794
795           You can explore various information from the Unicode data files
796           using the "Unicode::UCD" module.
797

UNICODE IN OLDER PERLS

799       If you cannot upgrade your Perl to 5.8.0 or later, you can still do
800       some Unicode processing by using the modules "Unicode::String",
801       "Unicode::Map8", and "Unicode::Map", available from CPAN.  If you have
802       the GNU recode installed, you can also use the Perl front-end
803       "Convert::Recode" for character conversions.
804
805       The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
806       UTF-8 bytes and back, the code works even with older Perl 5 versions.
807
808           # ISO 8859-1 to UTF-8
809           s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
810
811           # UTF-8 to ISO 8859-1
812           s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
813

ACKNOWLEDGMENTS

819       Thanks to the kind readers of the perl5-porters@perl.org,
820       perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
821       mailing lists for their valuable feedback.
822

AUTHOR, COPYRIGHT, AND LICENSE

824       Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
825
826       This document may be distributed under the same terms as Perl itself.
827
828
829
830perl v5.12.4                      2011-06-07                   PERLUNIINTRO(1)

NAME

DESCRIPTION

UNICODE IN OLDER PERLS

SEE ALSO

ACKNOWLEDGMENTS

AUTHOR, COPYRIGHT, AND LICENSE