perluniintro(1)

1PERLUNIINTRO(1)        Perl Programmers Reference Guide        PERLUNIINTRO(1)
2
3
4

NAME

6       perluniintro - Perl Unicode introduction
7

DESCRIPTION

9       This document gives a general idea of Unicode and how to use Unicode in
10       Perl.  See "Further Resources" for references to more in-depth
11       treatments of Unicode.
12
13   Unicode
14       Unicode is a character set standard which plans to codify all of the
15       writing systems of the world, plus many other symbols.
16
17       Unicode and ISO/IEC 10646 are coordinated standards that unify almost
18       all other modern character set standards, covering more than 80 writing
19       systems and hundreds of languages, including all commercially-important
20       modern languages.  All characters in the largest Chinese, Japanese, and
21       Korean dictionaries are also encoded. The standards will eventually
22       cover almost all characters in more than 250 writing systems and
23       thousands of languages.  Unicode 1.0 was released in October 1991, and
24       6.0 in October 2010.
25
26       A Unicode character is an abstract entity.  It is not bound to any
27       particular integer width, especially not to the C language "char".
28       Unicode is language-neutral and display-neutral: it does not encode the
29       language of the text, and it does not generally define fonts or other
30       graphical layout details.  Unicode operates on characters and on text
31       built from those characters.
32
33       Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34       SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35       0x0041 and 0x03B1, respectively.  These unique numbers are called code
36       points.  A code point is essentially the position of the character
37       within the set of all possible Unicode characters, and thus in Perl,
38       the term ordinal is often used interchangeably with it.
39
40       The Unicode standard prefers using hexadecimal notation for the code
41       points.  If numbers like 0x0041 are unfamiliar to you, take a peek at a
42       later section, "Hexadecimal Notation".  The Unicode standard uses the
43       notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
44       point and the normative name of the character.
45
46       Unicode also defines various properties for the characters, like
47       "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
48       properties are independent of the names of the characters.
49       Furthermore, various operations on the characters like uppercasing,
50       lowercasing, and collating (sorting) are defined.
51
52       A Unicode logical "character" can actually consist of more than one
53       internal actual "character" or code point.  For Western languages, this
54       is adequately modelled by a base character (like "LATIN CAPITAL LETTER
55       A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
56       This sequence of base character and modifiers is called a combining
57       character sequence.  Some non-western languages require more
58       complicated models, so Unicode created the grapheme cluster concept,
59       which was later further refined into the extended grapheme cluster.
60       For example, a Korean Hangul syllable is considered a single logical
61       character, but most often consists of three actual Unicode characters:
62       a leading consonant followed by an interior vowel followed by a
63       trailing consonant.
64
65       Whether to call these extended grapheme clusters "characters" depends
66       on your point of view. If you are a programmer, you probably would tend
67       towards seeing each element in the sequences as one unit, or
68       "character".  However from the user's point of view, the whole sequence
69       could be seen as one "character" since that's probably what it looks
70       like in the context of the user's language.  In this document, we take
71       the programmer's point of view: one "character" is one Unicode code
72       point.
73
74       For some combinations of base character and modifiers, there are
75       precomposed characters.  There is a single character equivalent, for
76       example, to the sequence "LATIN CAPITAL LETTER A" followed by
77       "COMBINING ACUTE ACCENT".  It is called  "LATIN CAPITAL LETTER A WITH
78       ACUTE".  These precomposed characters are, however, only available for
79       some combinations, and are mainly meant to support round-trip
80       conversions between Unicode and legacy standards (like ISO 8859).
81       Using sequences, as Unicode does, allows for needing fewer basic
82       building blocks (code points) to express many more potential grapheme
83       clusters.  To support conversion between equivalent forms, various
84       normalization forms are also defined.  Thus, "LATIN CAPITAL LETTER A
85       WITH ACUTE" is in Normalization Form Composed, (abbreviated NFC), and
86       the sequence "LATIN CAPITAL LETTER A" followed by "COMBINING ACUTE
87       ACCENT" represents the same character in Normalization Form Decomposed
88       (NFD).
89
90       Because of backward compatibility with legacy encodings, the "a unique
91       number for every character" idea breaks down a bit: instead, there is
92       "at least one number for every character".  The same character could be
93       represented differently in several legacy encodings.  The converse is
94       not also true: some code points do not have an assigned character.
95       Firstly, there are unallocated code points within otherwise used
96       blocks.  Secondly, there are special Unicode control characters that do
97       not represent true characters.
98
99       When Unicode was first conceived, it was thought that all the world's
100       characters could be represented using a 16-bit word; that is a maximum
101       of 0x10000 (or 65536) characters from 0x0000 to 0xFFFF would be needed.
102       This soon proved to be false, and since Unicode 2.0 (July 1996),
103       Unicode has been defined all the way up to 21 bits (0x10FFFF), and
104       Unicode 3.1 (March 2001) defined the first characters above 0xFFFF.
105       The first 0x10000 characters are called the Plane 0, or the Basic
106       Multilingual Plane (BMP).  With Unicode 3.1, 17 (yes, seventeen) planes
107       in all were defined--but they are nowhere near full of defined
108       characters, yet.
109
110       When a new language is being encoded, Unicode generally will choose a
111       "block" of consecutive unallocated code points for its characters.  So
112       far, the number of code points in these blocks has always been evenly
113       divisible by 16.  Extras in a block, not currently needed, are left
114       unallocated, for future growth.  But there have been occasions when a
115       later relase needed more code points than the available extras, and a
116       new block had to allocated somewhere else, not contiguous to the
117       initial one, to handle the overflow.  Thus, it became apparent early on
118       that "block" wasn't an adequate organizing principal, and so the
119       "Script" property was created.  (Later an improved script property was
120       added as well, the "Script_Extensions" property.)  Those code points
121       that are in overflow blocks can still have the same script as the
122       original ones.  The script concept fits more closely with natural
123       language: there is "Latin" script, "Greek" script, and so on; and there
124       are several artificial scripts, like "Common" for characters that are
125       used in multiple scripts, such as mathematical symbols.  Scripts
126       usually span varied parts of several blocks.  For more information
127       about scripts, see "Scripts" in perlunicode.  The division into blocks
128       exists, but it is almost completely accidental--an artifact of how the
129       characters have been and still are allocated.  (Note that this
130       paragraph has oversimplified things for the sake of this being an
131       introduction.  Unicode doesn't really encode languages, but the writing
132       systems for them--their scripts; and one script can be used by many
133       languages.  Unicode also encodes things that aren't really about
134       languages, such as symbols like "BAGGAGE CLAIM".)
135
136       The Unicode code points are just abstract numbers.  To input and output
137       these abstract numbers, the numbers must be encoded or serialised
138       somehow.  Unicode defines several character encoding forms, of which
139       UTF-8 is perhaps the most popular.  UTF-8 is a variable length encoding
140       that encodes Unicode characters as 1 to 6 bytes.  Other encodings
141       include UTF-16 and UTF-32 and their big- and little-endian variants
142       (UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
143       and UCS-4 encoding forms.
144
145       For more information about encodings--for instance, to learn what
146       surrogates and byte order marks (BOMs) are--see perlunicode.
147
148   Perl's Unicode Support
149       Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
150       natively.  Perl 5.8.0, however, is the first recommended release for
151       serious Unicode work.  The maintenance release 5.6.1 fixed many of the
152       problems of the initial Unicode implementation, but for example regular
153       expressions still do not work with Unicode in 5.6.1.  Perl 5.14.0 is
154       the first release where Unicode support is (almost) seamlessly
155       integrable without some gotchas (the exception being some differences
156       in quotemeta, which is fixed starting in Perl 5.16.0).   To enable this
157       seamless support, you should "use feature 'unicode_strings'" (which is
158       automatically selected if you "use 5.012" or higher).  See feature.
159       (5.14 also fixes a number of bugs and departures from the Unicode
160       standard.)
161
162       Before Perl 5.8.0, the use of "use utf8" was used to declare that
163       operations in the current block or file would be Unicode-aware.  This
164       model was found to be wrong, or at least clumsy: the "Unicodeness" is
165       now carried with the data, instead of being attached to the operations.
166       Starting with Perl 5.8.0, only one case remains where an explicit "use
167       utf8" is needed: if your Perl script itself is encoded in UTF-8, you
168       can use UTF-8 in your identifier names, and in string and regular
169       expression literals, by saying "use utf8".  This is not the default
170       because scripts with legacy 8-bit data in them would break.  See utf8.
171
172   Perl's Unicode Model
173       Perl supports both pre-5.6 strings of eight-bit native bytes, and
174       strings of Unicode characters.  The general principle is that Perl
175       tries to keep its data as eight-bit bytes for as long as possible, but
176       as soon as Unicodeness cannot be avoided, the data is transparently
177       upgraded to Unicode.  Prior to Perl 5.14, the upgrade was not
178       completely transparent (see "The "Unicode Bug"" in perlunicode), and
179       for backwards compatibility, full transparency is not gained unless
180       "use feature 'unicode_strings'" (see feature) or "use 5.012" (or
181       higher) is selected.
182
183       Internally, Perl currently uses either whatever the native eight-bit
184       character set of the platform (for example Latin-1) is, defaulting to
185       UTF-8, to encode Unicode strings. Specifically, if all code points in
186       the string are 0xFF or less, Perl uses the native eight-bit character
187       set.  Otherwise, it uses UTF-8.
188
189       A user of Perl does not normally need to know nor care how Perl happens
190       to encode its internal strings, but it becomes relevant when outputting
191       Unicode strings to a stream without a PerlIO layer (one with the
192       "default" encoding).  In such a case, the raw bytes used internally
193       (the native character set or UTF-8, as appropriate for each string)
194       will be used, and a "Wide character" warning will be issued if those
195       strings contain a character beyond 0x00FF.
196
197       For example,
198
199             perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
200
201       produces a fairly useless mixture of native bytes and UTF-8, as well as
202       a warning:
203
204            Wide character in print at ...
205
206       To output UTF-8, use the ":encoding" or ":utf8" output layer.
207       Prepending
208
209             binmode(STDOUT, ":utf8");
210
211       to this sample program ensures that the output is completely UTF-8, and
212       removes the program's warning.
213
214       You can enable automatic UTF-8-ification of your standard file handles,
215       default "open()" layer, and @ARGV by using either the "-C" command line
216       switch or the "PERL_UNICODE" environment variable, see perlrun for the
217       documentation of the "-C" switch.
218
219       Note that this means that Perl expects other software to work the same
220       way: if Perl has been led to believe that STDIN should be UTF-8, but
221       then STDIN coming in from another command is not UTF-8, Perl will
222       likely complain about the malformed UTF-8.
223
224       All features that combine Unicode and I/O also require using the new
225       PerlIO feature.  Almost all Perl 5.8 platforms do use PerlIO, though:
226       you can see whether yours is by running "perl -V" and looking for
227       "useperlio=define".
228
229   Unicode and EBCDIC
230       Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There, Unicode
231       support is somewhat more complex to implement since additional
232       conversions are needed at every step.
233
234       Later Perl releases have added code that will not work on EBCDIC
235       platforms, and no one has complained, so the divergence has continued.
236       If you want to run Perl on an EBCDIC platform, send email to
237       perlbug@perl.org
238
239       On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
240       instead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
241       that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
242       "EBCDIC-safe".
243
244   Creating Unicode
245       To create Unicode characters in literals for code points above 0xFF,
246       use the "\x{...}" notation in double-quoted strings:
247
248           my $smiley = "\x{263a}";
249
250       Similarly, it can be used in regular expression literals
251
252           $smiley =~ /\x{263a}/;
253
254       At run-time you can use "chr()":
255
256           my $hebrew_alef = chr(0x05d0);
257
258       See "Further Resources" for how to find all these numeric codes.
259
260       Naturally, "ord()" will do the reverse: it turns a character into a
261       code point.
262
263       Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
264       and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
265       eight-bit character for backward compatibility with older Perls.  For
266       arguments of 0x100 or more, Unicode characters are always produced. If
267       you want to force the production of Unicode characters regardless of
268       the numeric value, use "pack("U", ...)"  instead of "\x..", "\x{...}",
269       or "chr()".
270
271       You can invoke characters by name in double-quoted strings:
272
273           my $arabic_alef = "\N{ARABIC LETTER ALEF}";
274
275       And, as mentioned above, you can also "pack()" numbers into Unicode
276       characters:
277
278          my $georgian_an  = pack("U", 0x10a0);
279
280       Note that both "\x{...}" and "\N{...}" are compile-time string
281       constants: you cannot use variables in them.  if you want similar run-
282       time functionality, use "chr()" and "charnames::string_vianame()".
283
284       If you want to force the result to Unicode characters, use the special
285       "U0" prefix.  It consumes no arguments but causes the following bytes
286       to be interpreted as the UTF-8 encoding of Unicode characters:
287
288          my $chars = pack("U0W*", 0x80, 0x42);
289
290       Likewise, you can stop such UTF-8 interpretation by using the special
291       "C0" prefix.
292
293   Handling Unicode
294       Handling Unicode is for the most part transparent: just use the strings
295       as usual.  Functions like "index()", "length()", and "substr()" will
296       work on the Unicode characters; regular expressions will work on the
297       Unicode characters (see perlunicode and perlretut).
298
299       Note that Perl considers grapheme clusters to be separate characters,
300       so for example
301
302        print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
303              "\n";
304
305       will print 2, not 1.  The only exception is that regular expressions
306       have "\X" for matching an extended grapheme cluster.  (Thus "\X" in a
307       regular expression would match the entire sequence of both the example
308       characters.)
309
310       Life is not quite so transparent, however, when working with legacy
311       encodings, I/O, and certain special cases:
312
313   Legacy Encodings
314       When you combine legacy data and Unicode, the legacy data needs to be
315       upgraded to Unicode.  Normally the legacy data is assumed to be ISO
316       8859-1 (or EBCDIC, if applicable).
317
318       The "Encode" module knows about many encodings and has interfaces for
319       doing conversions between those encodings:
320
321           use Encode 'decode';
322           $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
323
324   Unicode I/O
325       Normally, writing out Unicode data
326
327           print FH $some_string_with_unicode, "\n";
328
329       produces raw bytes that Perl happens to use to internally encode the
330       Unicode string.  Perl's internal encoding depends on the system as well
331       as what characters happen to be in the string at the time. If any of
332       the characters are at code points 0x100 or above, you will get a
333       warning.  To ensure that the output is explicitly rendered in the
334       encoding you desire--and to avoid the warning--open the stream with the
335       desired encoding. Some examples:
336
337           open FH, ">:utf8", "file";
338
339           open FH, ">:encoding(ucs2)",      "file";
340           open FH, ">:encoding(UTF-8)",     "file";
341           open FH, ">:encoding(shift_jis)", "file";
342
343       and on already open streams, use "binmode()":
344
345           binmode(STDOUT, ":utf8");
346
347           binmode(STDOUT, ":encoding(ucs2)");
348           binmode(STDOUT, ":encoding(UTF-8)");
349           binmode(STDOUT, ":encoding(shift_jis)");
350
351       The matching of encoding names is loose: case does not matter, and many
352       encodings have several aliases.  Note that the ":utf8" layer must
353       always be specified exactly like that; it is not subject to the loose
354       matching of encoding names. Also note that currently ":utf8" is unsafe
355       for input, because it accepts the data without validating that it is
356       indeed valid UTF-8; you should instead use ":encoding(utf-8)" (with or
357       without a hyphen).
358
359       See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
360       for the ":encoding()" layer, and Encode::Supported for many encodings
361       supported by the "Encode" module.
362
363       Reading in a file that you know happens to be encoded in one of the
364       Unicode or legacy encodings does not magically turn the data into
365       Unicode in Perl's eyes.  To do that, specify the appropriate layer when
366       opening files
367
368           open(my $fh,'<:encoding(utf8)', 'anything');
369           my $line_of_unicode = <$fh>;
370
371           open(my $fh,'<:encoding(Big5)', 'anything');
372           my $line_of_unicode = <$fh>;
373
374       The I/O layers can also be specified more flexibly with the "open"
375       pragma.  See open, or look at the following example.
376
377           use open ':encoding(utf8)'; # input/output default encoding will be
378                                       # UTF-8
379           open X, ">file";
380           print X chr(0x100), "\n";
381           close X;
382           open Y, "<file";
383           printf "%#x\n", ord(<Y>); # this should print 0x100
384           close Y;
385
386       With the "open" pragma you can use the ":locale" layer
387
388           BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
389           # the :locale will probe the locale environment variables like
390           # LC_ALL
391           use open OUT => ':locale'; # russki parusski
392           open(O, ">koi8");
393           print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
394           close O;
395           open(I, "<koi8");
396           printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
397           close I;
398
399       These methods install a transparent filter on the I/O stream that
400       converts data from the specified encoding when it is read in from the
401       stream.  The result is always Unicode.
402
403       The open pragma affects all the "open()" calls after the pragma by
404       setting default layers.  If you want to affect only certain streams,
405       use explicit layers directly in the "open()" call.
406
407       You can switch encodings on an already opened stream by using
408       "binmode()"; see "binmode" in perlfunc.
409
410       The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
411       and "binmode()", only with the "open" pragma.  The ":utf8" and
412       ":encoding(...)" methods do work with all of "open()", "binmode()", and
413       the "open" pragma.
414
415       Similarly, you may use these I/O layers on output streams to
416       automatically convert Unicode to the specified encoding when it is
417       written to the stream. For example, the following snippet copies the
418       contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
419       the file "text.utf8", encoded as UTF-8:
420
421           open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
422           open(my $unicode, '>:utf8',                  'text.utf8');
423           while (<$nihongo>) { print $unicode $_ }
424
425       The naming of encodings, both by the "open()" and by the "open" pragma
426       allows for flexible names: "koi8-r" and "KOI8R" will both be
427       understood.
428
429       Common encodings recognized by ISO, MIME, IANA, and various other
430       standardisation organisations are recognised; for a more detailed list
431       see Encode::Supported.
432
433       "read()" reads characters and returns the number of characters.
434       "seek()" and "tell()" operate on byte counts, as do "sysread()" and
435       "sysseek()".
436
437       Notice that because of the default behaviour of not doing any
438       conversion upon input if there is no default layer, it is easy to
439       mistakenly write code that keeps on expanding a file by repeatedly
440       encoding the data:
441
442           # BAD CODE WARNING
443           open F, "file";
444           local $/; ## read in the whole file of 8-bit characters
445           $t = <F>;
446           close F;
447           open F, ">:encoding(utf8)", "file";
448           print F $t; ## convert to UTF-8 on output
449           close F;
450
451       If you run this code twice, the contents of the file will be twice
452       UTF-8 encoded.  A "use open ':encoding(utf8)'" would have avoided the
453       bug, or explicitly opening also the file for input as UTF-8.
454
455       NOTE: the ":utf8" and ":encoding" features work only if your Perl has
456       been built with the new PerlIO feature (which is the default on most
457       systems).
458
459   Displaying Unicode As Text
460       Sometimes you might want to display Perl scalars containing Unicode as
461       simple ASCII (or EBCDIC) text.  The following subroutine converts its
462       argument so that Unicode characters with code points greater than 255
463       are displayed as "\x{...}", control characters (like "\n") are
464       displayed as "\x..", and the rest of the characters as themselves:
465
466        sub nice_string {
467            join("",
468              map { $_ > 255 ?                  # if wide character...
469                     sprintf("\\x{%04X}", $_) :  # \x{...}
470                     chr($_) =~ /[[:cntrl:]]/ ?  # else if control character...
471                     sprintf("\\x%02X", $_) :    # \x..
472                     quotemeta(chr($_))          # else quoted or as themselves
473                } unpack("W*", $_[0]));           # unpack Unicode characters
474          }
475
476       For example,
477
478          nice_string("foo\x{100}bar\n")
479
480       returns the string
481
482          'foo\x{0100}bar\x0A'
483
484       which is ready to be printed.
485
486   Special Cases
487       ·   Bit Complement Operator ~ And vec()
488
489           The bit complement operator "~" may produce surprising results if
490           used on strings containing characters with ordinal values above
491           255. In such a case, the results are consistent with the internal
492           encoding of the characters, but not with much else. So don't do
493           that. Similarly for "vec()": you will be operating on the
494           internally-encoded bit patterns of the Unicode characters, not on
495           the code point values, which is very probably not what you want.
496
497       ·   Peeking At Perl's Internal Encoding
498
499           Normal users of Perl should never care how Perl encodes any
500           particular Unicode string (because the normal ways to get at the
501           contents of a string with Unicode--via input and output--should
502           always be via explicitly-defined I/O layers). But if you must,
503           there are two ways of looking behind the scenes.
504
505           One way of peeking inside the internal encoding of Unicode
506           characters is to use "unpack("C*", ..." to get the bytes of
507           whatever the string encoding happens to be, or "unpack("U0..",
508           ...)" to get the bytes of the UTF-8 encoding:
509
510               # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
511               print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
512
513           Yet another way would be to use the Devel::Peek module:
514
515               perl -MDevel::Peek -e 'Dump(chr(0x100))'
516
517           That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
518           Unicode characters in "PV".  See also later in this document the
519           discussion about the "utf8::is_utf8()" function.
520
521   Advanced Topics
522       ·   String Equivalence
523
524           The question of string equivalence turns somewhat complicated in
525           Unicode: what do you mean by "equal"?
526
527           (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
528           LETTER A"?)
529
530           The short answer is that by default Perl compares equivalence
531           ("eq", "ne") based only on code points of the characters.  In the
532           above case, the answer is no (because 0x00C1 != 0x0041).  But
533           sometimes, any CAPITAL LETTER A's should be considered equal, or
534           even A's of any case.
535
536           The long answer is that you need to consider character
537           normalization and casing issues: see Unicode::Normalize, Unicode
538           Technical Report #15, Unicode Normalization Forms
539           <http://www.unicode.org/unicode/reports/tr15> and sections on case
540           mapping in the Unicode Standard <http://www.unicode.org>.
541
542           As of Perl 5.8.0, the "Full" case-folding of Case
543           Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
544           with them, mostly fixed by 5.14.
545
546       ·   String Collation
547
548           People like to see their strings nicely sorted--or as Unicode
549           parlance goes, collated.  But again, what do you mean by collate?
550
551           (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
552           "LATIN CAPITAL LETTER A WITH GRAVE"?)
553
554           The short answer is that by default, Perl compares strings ("lt",
555           "le", "cmp", "ge", "gt") based only on the code points of the
556           characters.  In the above case, the answer is "after", since 0x00C1
557           > 0x00C0.
558
559           The long answer is that "it depends", and a good answer cannot be
560           given without knowing (at the very least) the language context.
561           See Unicode::Collate, and Unicode Collation Algorithm
562           <http://www.unicode.org/unicode/reports/tr10/>
563
564   Miscellaneous
565       ·   Character Ranges and Classes
566
567           Character ranges in regular expression bracketed character classes
568           ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
569           operator are not magically Unicode-aware.  What this means is that
570           "[A-Za-z]" will not magically start to mean "all alphabetic
571           letters" (not that it does mean that even for 8-bit characters; for
572           those, if you are using locales (perllocale), use "/[[:alpha:]]/";
573           and if not, use the 8-bit-aware property "\p{alpha}").
574
575           All the properties that begin with "\p" (and its inverse "\P") are
576           actually character classes that are Unicode-aware.  There are
577           dozens of them, see perluniprops.
578
579           You can use Unicode code points as the end points of character
580           ranges, and the range will include all Unicode code points that lie
581           between those end points.
582
583       ·   String-To-Number Conversions
584
585           Unicode does define several other decimal--and numeric--characters
586           besides the familiar 0 to 9, such as the Arabic and Indic digits.
587           Perl does not support string-to-number conversion for digits other
588           than ASCII 0 to 9 (and ASCII a to f for hexadecimal).  To get safe
589           conversions from any Unicode string, use "num()" in Unicode::UCD.
590
591   Questions With Answers
592       ·   Will My Old Scripts Break?
593
594           Very probably not.  Unless you are generating Unicode characters
595           somehow, old behaviour should be preserved.  About the only
596           behaviour that has changed and which could start generating Unicode
597           is the old behaviour of "chr()" where supplying an argument more
598           than 255 produced a character modulo 255.  "chr(300)", for example,
599           was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
600           LETTER I WITH BREVE.
601
602       ·   How Do I Make My Scripts Work With Unicode?
603
604           Very little work should be needed since nothing changes until you
605           generate Unicode data.  The most important thing is getting input
606           as Unicode; for that, see the earlier I/O discussion.  To get full
607           seamless Unicode support, add "use feature 'unicode_strings'" (or
608           "use 5.012" or higher) to your script.
609
610       ·   How Do I Know Whether My String Is In Unicode?
611
612           You shouldn't have to care.  But you may if your Perl is before
613           5.14.0 or you haven't specified "use feature 'unicode_strings'" or
614           "use 5.012" (or higher) because otherwise the semantics of the code
615           points in the range 128 to 255 are different depending on whether
616           the string they are contained within is in Unicode or not.  (See
617           "When Unicode Does Not Happen" in perlunicode.)
618
619           To determine if a string is in Unicode, use:
620
621               print utf8::is_utf8($string) ? 1 : 0, "\n";
622
623           But note that this doesn't mean that any of the characters in the
624           string are necessary UTF-8 encoded, or that any of the characters
625           have code points greater than 0xFF (255) or even 0x80 (128), or
626           that the string has any characters at all.  All the "is_utf8()"
627           does is to return the value of the internal "utf8ness" flag
628           attached to the $string.  If the flag is off, the bytes in the
629           scalar are interpreted as a single byte encoding.  If the flag is
630           on, the bytes in the scalar are interpreted as the (variable-
631           length, potentially multi-byte) UTF-8 encoded code points of the
632           characters.  Bytes added to a UTF-8 encoded string are
633           automatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8
634           scalars are merged (double-quoted interpolation, explicit
635           concatenation, or printf/sprintf parameter substitution), the
636           result will be UTF-8 encoded as if copies of the byte strings were
637           upgraded to UTF-8: for example,
638
639               $a = "ab\x80c";
640               $b = "\x{100}";
641               print "$a = $b\n";
642
643           the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
644           $a will stay byte-encoded.
645
646           Sometimes you might really need to know the byte length of a string
647           instead of the character length. For that use either the
648           "Encode::encode_utf8()" function or the "bytes" pragma and the
649           "length()" function:
650
651               my $unicode = chr(0x100);
652               print length($unicode), "\n"; # will print 1
653               require Encode;
654               print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
655               use bytes;
656               print length($unicode), "\n"; # will also print 2
657                                             # (the 0xC4 0x80 of the UTF-8)
658               no bytes;
659
660       ·   How Do I Find Out What Encoding a File Has?
661
662           You might try Encode::Guess, but it has a number of limitations.
663
664       ·   How Do I Detect Data That's Not Valid In a Particular Encoding?
665
666           Use the "Encode" package to try converting it.  For example,
667
668               use Encode 'decode_utf8';
669
670               if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
671                   # $string is valid utf8
672               } else {
673                   # $string is not valid utf8
674               }
675
676           Or use "unpack" to try decoding it:
677
678               use warnings;
679               @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
680
681           If invalid, a "Malformed UTF-8 character" warning is produced. The
682           "C0" means "process the string character per character".  Without
683           that, the "unpack("U*", ...)" would work in "U0" mode (the default
684           if the format string starts with "U") and it would return the bytes
685           making up the UTF-8 encoding of the target string, something that
686           will always work.
687
688       ·   How Do I Convert Binary Data Into a Particular Encoding, Or Vice
689           Versa?
690
691           This probably isn't as useful as you might think.  Normally, you
692           shouldn't need to.
693
694           In one sense, what you are asking doesn't make much sense:
695           encodings are for characters, and binary data are not "characters",
696           so converting "data" into some encoding isn't meaningful unless you
697           know in what character set and encoding the binary data is in, in
698           which case it's not just binary data, now is it?
699
700           If you have a raw sequence of bytes that you know should be
701           interpreted via a particular encoding, you can use "Encode":
702
703               use Encode 'from_to';
704               from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
705
706           The call to "from_to()" changes the bytes in $data, but nothing
707           material about the nature of the string has changed as far as Perl
708           is concerned.  Both before and after the call, the string $data
709           contains just a bunch of 8-bit bytes. As far as Perl is concerned,
710           the encoding of the string remains as "system-native 8-bit bytes".
711
712           You might relate this to a fictional 'Translate' module:
713
714              use Translate;
715              my $phrase = "Yes";
716              Translate::from_to($phrase, 'english', 'deutsch');
717              ## phrase now contains "Ja"
718
719           The contents of the string changes, but not the nature of the
720           string.  Perl doesn't know any more after the call than before that
721           the contents of the string indicates the affirmative.
722
723           Back to converting data.  If you have (or want) data in your
724           system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
725           can use pack/unpack to convert to/from Unicode.
726
727               $native_string  = pack("W*", unpack("U*", $Unicode_string));
728               $Unicode_string = pack("U*", unpack("W*", $native_string));
729
730           If you have a sequence of bytes you know is valid UTF-8, but Perl
731           doesn't know it yet, you can make Perl a believer, too:
732
733               use Encode 'decode_utf8';
734               $Unicode = decode_utf8($bytes);
735
736           or:
737
738               $Unicode = pack("U0a*", $bytes);
739
740           You can find the bytes that make up a UTF-8 sequence with
741
742               @bytes = unpack("C*", $Unicode_string)
743
744           and you can create well-formed Unicode with
745
746               $Unicode_string = pack("U*", 0xff, ...)
747
748       ·   How Do I Display Unicode?  How Do I Input Unicode?
749
750           See <http://www.alanwood.net/unicode/> and
751           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
752
753       ·   How Does Unicode Work With Traditional Locales?
754
755           Starting in Perl 5.16, you can specify
756
757               use locale ':not_characters';
758
759           to get Perl to work well with tradtional locales.  The catch is
760           that you have to translate from the locale character set to/from
761           Unicode yourself.  See "Unicode I/O" above for how to
762
763               use open ':locale';
764
765           to accomplish this, but full details are in "Unicode and UTF-8" in
766           perllocale, including gotchas that happen if you don't specifiy
767           ":not_characters".
768
769   Hexadecimal Notation
770       The Unicode standard prefers using hexadecimal notation because that
771       more clearly shows the division of Unicode into blocks of 256
772       characters.  Hexadecimal is also simply shorter than decimal.  You can
773       use decimal notation, too, but learning to use hexadecimal just makes
774       life easier with the Unicode standard.  The "U+HHHH" notation uses
775       hexadecimal, for example.
776
777       The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
778       (or A-F, case doesn't matter).  Each hexadecimal digit represents four
779       bits, or half a byte.  "print 0x..., "\n"" will show a hexadecimal
780       number in decimal, and "printf "%x\n", $decimal" will show a decimal
781       number in hexadecimal.  If you have just the "hex digits" of a
782       hexadecimal number, you can use the "hex()" function.
783
784           print 0x0009, "\n";    # 9
785           print 0x000a, "\n";    # 10
786           print 0x000f, "\n";    # 15
787           print 0x0010, "\n";    # 16
788           print 0x0011, "\n";    # 17
789           print 0x0100, "\n";    # 256
790
791           print 0x0041, "\n";    # 65
792
793           printf "%x\n",  65;    # 41
794           printf "%#x\n", 65;    # 0x41
795
796           print hex("41"), "\n"; # 65
797
798   Further Resources
799       ·   Unicode Consortium
800
801           <http://www.unicode.org/>
802
803       ·   Unicode FAQ
804
805           <http://www.unicode.org/unicode/faq/>
806
807       ·   Unicode Glossary
808
809           <http://www.unicode.org/glossary/>
810
811       ·   Unicode Recommended Reading List
812
813           The Unicode Consortium has a list of articles and books, some of
814           which give a much more in depth treatment of Unicode:
815           <http://unicode.org/resources/readinglist.html>
816
817       ·   Unicode Useful Resources
818
819           <http://www.unicode.org/unicode/onlinedat/resources.html>
820
821       ·   Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
822           Other Applications
823
824           <http://www.alanwood.net/unicode/>
825
826       ·   UTF-8 and Unicode FAQ for Unix/Linux
827
828           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
829
830       ·   Legacy Character Sets
831
832           <http://www.czyborra.com/> <http://www.eki.ee/letter/>
833
834       ·   You can explore various information from the Unicode data files
835           using the "Unicode::UCD" module.
836

UNICODE IN OLDER PERLS

838       If you cannot upgrade your Perl to 5.8.0 or later, you can still do
839       some Unicode processing by using the modules "Unicode::String",
840       "Unicode::Map8", and "Unicode::Map", available from CPAN.  If you have
841       the GNU recode installed, you can also use the Perl front-end
842       "Convert::Recode" for character conversions.
843
844       The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
845       UTF-8 bytes and back, the code works even with older Perl 5 versions.
846
847           # ISO 8859-1 to UTF-8
848           s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
849
850           # UTF-8 to ISO 8859-1
851           s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
852

ACKNOWLEDGMENTS

858       Thanks to the kind readers of the perl5-porters@perl.org,
859       perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
860       mailing lists for their valuable feedback.
861

AUTHOR, COPYRIGHT, AND LICENSE

863       Copyright 2001-2011 Jarkko Hietaniemi <jhi@iki.fi>
864
865       This document may be distributed under the same terms as Perl itself.
866
867
868
869perl v5.16.3                      2013-03-04                   PERLUNIINTRO(1)

NAME

DESCRIPTION

UNICODE IN OLDER PERLS

SEE ALSO

ACKNOWLEDGMENTS

AUTHOR, COPYRIGHT, AND LICENSE