perluniintro(1)

1PERLUNIINTRO(1)        Perl Programmers Reference Guide        PERLUNIINTRO(1)
2
3
4

NAME

6       perluniintro - Perl Unicode introduction
7

DESCRIPTION

9       This document gives a general idea of Unicode and how to use Unicode in
10       Perl.
11
12       Unicode
13
14       Unicode is a character set standard which plans to codify all of the
15       writing systems of the world, plus many other symbols.
16
17       Unicode and ISO/IEC 10646 are coordinated standards that provide code
18       points for characters in almost all modern character set standards,
19       covering more than 30 writing systems and hundreds of languages,
20       including all commercially-important modern languages.  All characters
21       in the largest Chinese, Japanese, and Korean dictionaries are also
22       encoded. The standards will eventually cover almost all characters in
23       more than 250 writing systems and thousands of languages.  Unicode 1.0
24       was released in October 1991, and 4.0 in April 2003.
25
26       A Unicode character is an abstract entity.  It is not bound to any par‐
27       ticular integer width, especially not to the C language "char".  Uni‐
28       code is language-neutral and display-neutral: it does not encode the
29       language of the text and it does not define fonts or other graphical
30       layout details.  Unicode operates on characters and on text built from
31       those characters.
32
33       Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34       SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35       0x0041 and 0x03B1, respectively.  These unique numbers are called code
36       points.
37
38       The Unicode standard prefers using hexadecimal notation for the code
39       points.  If numbers like 0x0041 are unfamiliar to you, take a peek at a
40       later section, "Hexadecimal Notation".  The Unicode standard uses the
41       notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
42       point and the normative name of the character.
43
44       Unicode also defines various properties for the characters, like
45       "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
46       properties are independent of the names of the characters.  Further‐
47       more, various operations on the characters like uppercasing, lowercas‐
48       ing, and collating (sorting) are defined.
49
50       A Unicode character consists either of a single code point, or a base
51       character (like "LATIN CAPITAL LETTER A"), followed by one or more mod‐
52       ifiers (like "COMBINING ACUTE ACCENT").  This sequence of base charac‐
53       ter and modifiers is called a combining character sequence.
54
55       Whether to call these combining character sequences "characters"
56       depends on your point of view. If you are a programmer, you probably
57       would tend towards seeing each element in the sequences as one unit, or
58       "character".  The whole sequence could be seen as one "character", how‐
59       ever, from the user's point of view, since that's probably what it
60       looks like in the context of the user's language.
61
62       With this "whole sequence" view of characters, the total number of
63       characters is open-ended. But in the programmer's "one unit is one
64       character" point of view, the concept of "characters" is more determin‐
65       istic.  In this document, we take that second  point of view: one
66       "character" is one Unicode code point, be it a base character or a com‐
67       bining character.
68
69       For some combinations, there are precomposed characters.  "LATIN CAPI‐
70       TAL LETTER A WITH ACUTE", for example, is defined as a single code
71       point.  These precomposed characters are, however, only available for
72       some combinations, and are mainly meant to support round-trip conver‐
73       sions between Unicode and legacy standards (like the ISO 8859).  In the
74       general case, the composing method is more extensible.  To support con‐
75       version between different compositions of the characters, various nor‐
76       malization forms to standardize representations are also defined.
77
78       Because of backward compatibility with legacy encodings, the "a unique
79       number for every character" idea breaks down a bit: instead, there is
80       "at least one number for every character".  The same character could be
81       represented differently in several legacy encodings.  The converse is
82       also not true: some code points do not have an assigned character.
83       Firstly, there are unallocated code points within otherwise used
84       blocks.  Secondly, there are special Unicode control characters that do
85       not represent true characters.
86
87       A common myth about Unicode is that it would be "16-bit", that is, Uni‐
88       code is only represented as 0x10000 (or 65536) characters from 0x0000
89       to 0xFFFF.  This is untrue.  Since Unicode 2.0 (July 1996), Unicode has
90       been defined all the way up to 21 bits (0x10FFFF), and since Unicode
91       3.1 (March 2001), characters have been defined beyond 0xFFFF.  The
92       first 0x10000 characters are called the Plane 0, or the Basic Multilin‐
93       gual Plane (BMP).  With Unicode 3.1, 17 (yes, seventeen) planes in all
94       were defined--but they are nowhere near full of defined characters,
95       yet.
96
97       Another myth is that the 256-character blocks have something to do with
98       languages--that each block would define the characters used by a lan‐
99       guage or a set of languages.  This is also untrue.  The division into
100       blocks exists, but it is almost completely accidental--an artifact of
101       how the characters have been and still are allocated.  Instead, there
102       is a concept called scripts, which is more useful: there is "Latin"
103       script, "Greek" script, and so on.  Scripts usually span varied parts
104       of several blocks.  For further information see Unicode::UCD.
105
106       The Unicode code points are just abstract numbers.  To input and output
107       these abstract numbers, the numbers must be encoded or serialised some‐
108       how.  Unicode defines several character encoding forms, of which UTF-8
109       is perhaps the most popular.  UTF-8 is a variable length encoding that
110       encodes Unicode characters as 1 to 6 bytes (only 4 with the currently
111       defined characters).  Other encodings include UTF-16 and UTF-32 and
112       their big- and little-endian variants (UTF-8 is byte-order independent)
113       The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.
114
115       For more information about encodings--for instance, to learn what sur‐
116       rogates and byte order marks (BOMs) are--see perlunicode.
117
118       Perl's Unicode Support
119
120       Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
121       natively.  Perl 5.8.0, however, is the first recommended release for
122       serious Unicode work.  The maintenance release 5.6.1 fixed many of the
123       problems of the initial Unicode implementation, but for example regular
124       expressions still do not work with Unicode in 5.6.1.
125
126       Starting from Perl 5.8.0, the use of "use utf8" is no longer necessary.
127       In earlier releases the "utf8" pragma was used to declare that opera‐
128       tions in the current block or file would be Unicode-aware.  This model
129       was found to be wrong, or at least clumsy: the "Unicodeness" is now
130       carried with the data, instead of being attached to the operations.
131       Only one case remains where an explicit "use utf8" is needed: if your
132       Perl script itself is encoded in UTF-8, you can use UTF-8 in your iden‐
133       tifier names, and in string and regular expression literals, by saying
134       "use utf8".  This is not the default because scripts with legacy 8-bit
135       data in them would break.  See utf8.
136
137       Perl's Unicode Model
138
139       Perl supports both pre-5.6 strings of eight-bit native bytes, and
140       strings of Unicode characters.  The principle is that Perl tries to
141       keep its data as eight-bit bytes for as long as possible, but as soon
142       as Unicodeness cannot be avoided, the data is transparently upgraded to
143       Unicode.
144
145       Internally, Perl currently uses either whatever the native eight-bit
146       character set of the platform (for example Latin-1) is, defaulting to
147       UTF-8, to encode Unicode strings. Specifically, if all code points in
148       the string are 0xFF or less, Perl uses the native eight-bit character
149       set.  Otherwise, it uses UTF-8.
150
151       A user of Perl does not normally need to know nor care how Perl happens
152       to encode its internal strings, but it becomes relevant when outputting
153       Unicode strings to a stream without a PerlIO layer -- one with the
154       "default" encoding.  In such a case, the raw bytes used internally (the
155       native character set or UTF-8, as appropriate for each string) will be
156       used, and a "Wide character" warning will be issued if those strings
157       contain a character beyond 0x00FF.
158
159       For example,
160
161             perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
162
163       produces a fairly useless mixture of native bytes and UTF-8, as well as
164       a warning:
165
166            Wide character in print at ...
167
168       To output UTF-8, use the ":utf8" output layer.  Prepending
169
170             binmode(STDOUT, ":utf8");
171
172       to this sample program ensures that the output is completely UTF-8, and
173       removes the program's warning.
174
175       You can enable automatic UTF-8-ification of your standard file handles,
176       default "open()" layer, and @ARGV by using either the "-C" command line
177       switch or the "PERL_UNICODE" environment variable, see perlrun for the
178       documentation of the "-C" switch.
179
180       Note that this means that Perl expects other software to work, too: if
181       Perl has been led to believe that STDIN should be UTF-8, but then STDIN
182       coming in from another command is not UTF-8, Perl will complain about
183       the malformed UTF-8.
184
185       All features that combine Unicode and I/O also require using the new
186       PerlIO feature.  Almost all Perl 5.8 platforms do use PerlIO, though:
187       you can see whether yours is by running "perl -V" and looking for
188       "useperlio=define".
189
190       Unicode and EBCDIC
191
192       Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There, Unicode
193       support is somewhat more complex to implement since additional conver‐
194       sions are needed at every step.  Some problems remain, see perlebcdic
195       for details.
196
197       In any case, the Unicode support on EBCDIC platforms is better than in
198       the 5.6 series, which didn't work much at all for EBCDIC platform.  On
199       EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
200       instead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
201       that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
202       "EBCDIC-safe".
203
204       Creating Unicode
205
206       To create Unicode characters in literals for code points above 0xFF,
207       use the "\x{...}" notation in double-quoted strings:
208
209           my $smiley = "\x{263a}";
210
211       Similarly, it can be used in regular expression literals
212
213           $smiley =~ /\x{263a}/;
214
215       At run-time you can use "chr()":
216
217           my $hebrew_alef = chr(0x05d0);
218
219       See "Further Resources" for how to find all these numeric codes.
220
221       Naturally, "ord()" will do the reverse: it turns a character into a
222       code point.
223
224       Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
225       and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
226       eight-bit character for backward compatibility with older Perls.  For
227       arguments of 0x100 or more, Unicode characters are always produced. If
228       you want to force the production of Unicode characters regardless of
229       the numeric value, use "pack("U", ...)"  instead of "\x..", "\x{...}",
230       or "chr()".
231
232       You can also use the "charnames" pragma to invoke characters by name in
233       double-quoted strings:
234
235           use charnames ':full';
236           my $arabic_alef = "\N{ARABIC LETTER ALEF}";
237
238       And, as mentioned above, you can also "pack()" numbers into Unicode
239       characters:
240
241          my $georgian_an  = pack("U", 0x10a0);
242
243       Note that both "\x{...}" and "\N{...}" are compile-time string con‐
244       stants: you cannot use variables in them.  if you want similar run-time
245       functionality, use "chr()" and "charnames::vianame()".
246
247       If you want to force the result to Unicode characters, use the special
248       "U0" prefix.  It consumes no arguments but forces the result to be in
249       Unicode characters, instead of bytes.
250
251          my $chars = pack("U0C*", 0x80, 0x42);
252
253       Likewise, you can force the result to be bytes by using the special
254       "C0" prefix.
255
256       Handling Unicode
257
258       Handling Unicode is for the most part transparent: just use the strings
259       as usual.  Functions like "index()", "length()", and "substr()" will
260       work on the Unicode characters; regular expressions will work on the
261       Unicode characters (see perlunicode and perlretut).
262
263       Note that Perl considers combining character sequences to be separate
264       characters, so for example
265
266           use charnames ':full';
267           print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
268
269       will print 2, not 1.  The only exception is that regular expressions
270       have "\X" for matching a combining character sequence.
271
272       Life is not quite so transparent, however, when working with legacy
273       encodings, I/O, and certain special cases:
274
275       Legacy Encodings
276
277       When you combine legacy data and Unicode the legacy data needs to be
278       upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if applicable) is
279       assumed.  You can override this assumption by using the "encoding"
280       pragma, for example
281
282           use encoding 'latin2'; # ISO 8859-2
283
284       in which case literals (string or regular expressions), "chr()", and
285       "ord()" in your whole script are assumed to produce Unicode characters
286       from ISO 8859-2 code points.  Note that the matching for encoding names
287       is forgiving: instead of "latin2" you could have said "Latin 2", or
288       "iso8859-2", or other variations.  With just
289
290           use encoding;
291
292       the environment variable "PERL_ENCODING" will be consulted.  If that
293       variable isn't set, the encoding pragma will fail.
294
295       The "Encode" module knows about many encodings and has interfaces for
296       doing conversions between those encodings:
297
298           use Encode 'decode';
299           $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
300
301       Unicode I/O
302
303       Normally, writing out Unicode data
304
305           print FH $some_string_with_unicode, "\n";
306
307       produces raw bytes that Perl happens to use to internally encode the
308       Unicode string.  Perl's internal encoding depends on the system as well
309       as what characters happen to be in the string at the time. If any of
310       the characters are at code points 0x100 or above, you will get a warn‐
311       ing.  To ensure that the output is explicitly rendered in the encoding
312       you desire--and to avoid the warning--open the stream with the desired
313       encoding. Some examples:
314
315           open FH, ">:utf8", "file";
316
317           open FH, ">:encoding(ucs2)",      "file";
318           open FH, ">:encoding(UTF-8)",     "file";
319           open FH, ">:encoding(shift_jis)", "file";
320
321       and on already open streams, use "binmode()":
322
323           binmode(STDOUT, ":utf8");
324
325           binmode(STDOUT, ":encoding(ucs2)");
326           binmode(STDOUT, ":encoding(UTF-8)");
327           binmode(STDOUT, ":encoding(shift_jis)");
328
329       The matching of encoding names is loose: case does not matter, and many
330       encodings have several aliases.  Note that the ":utf8" layer must
331       always be specified exactly like that; it is not subject to the loose
332       matching of encoding names.
333
334       See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
335       for the ":encoding()" layer, and Encode::Supported for many encodings
336       supported by the "Encode" module.
337
338       Reading in a file that you know happens to be encoded in one of the
339       Unicode or legacy encodings does not magically turn the data into Uni‐
340       code in Perl's eyes.  To do that, specify the appropriate layer when
341       opening files
342
343           open(my $fh,'<:utf8', 'anything');
344           my $line_of_unicode = <$fh>;
345
346           open(my $fh,'<:encoding(Big5)', 'anything');
347           my $line_of_unicode = <$fh>;
348
349       The I/O layers can also be specified more flexibly with the "open"
350       pragma.  See open, or look at the following example.
351
352           use open ':utf8'; # input and output default layer will be UTF-8
353           open X, ">file";
354           print X chr(0x100), "\n";
355           close X;
356           open Y, "<file";
357           printf "%#x\n", ord(<Y>); # this should print 0x100
358           close Y;
359
360       With the "open" pragma you can use the ":locale" layer
361
362           BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
363           # the :locale will probe the locale environment variables like LC_ALL
364           use open OUT => ':locale'; # russki parusski
365           open(O, ">koi8");
366           print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
367           close O;
368           open(I, "<koi8");
369           printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
370           close I;
371
372       or you can also use the ':encoding(...)' layer
373
374           open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
375           my $line_of_unicode = <$epic>;
376
377       These methods install a transparent filter on the I/O stream that con‐
378       verts data from the specified encoding when it is read in from the
379       stream.  The result is always Unicode.
380
381       The open pragma affects all the "open()" calls after the pragma by set‐
382       ting default layers.  If you want to affect only certain streams, use
383       explicit layers directly in the "open()" call.
384
385       You can switch encodings on an already opened stream by using "bin‐
386       mode()"; see "binmode" in perlfunc.
387
388       The ":locale" does not currently (as of Perl 5.8.0) work with "open()"
389       and "binmode()", only with the "open" pragma.  The ":utf8" and ":encod‐
390       ing(...)" methods do work with all of "open()", "binmode()", and the
391       "open" pragma.
392
393       Similarly, you may use these I/O layers on output streams to automati‐
394       cally convert Unicode to the specified encoding when it is written to
395       the stream. For example, the following snippet copies the contents of
396       the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to the file
397       "text.utf8", encoded as UTF-8:
398
399           open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
400           open(my $unicode, '>:utf8',                  'text.utf8');
401           while (<$nihongo>) { print $unicode $_ }
402
403       The naming of encodings, both by the "open()" and by the "open" pragma,
404       is similar to the "encoding" pragma in that it allows for flexible
405       names: "koi8-r" and "KOI8R" will both be understood.
406
407       Common encodings recognized by ISO, MIME, IANA, and various other stan‐
408       dardisation organisations are recognised; for a more detailed list see
409       Encode::Supported.
410
411       "read()" reads characters and returns the number of characters.
412       "seek()" and "tell()" operate on byte counts, as do "sysread()" and
413       "sysseek()".
414
415       Notice that because of the default behaviour of not doing any conver‐
416       sion upon input if there is no default layer, it is easy to mistakenly
417       write code that keeps on expanding a file by repeatedly encoding the
418       data:
419
420           # BAD CODE WARNING
421           open F, "file";
422           local $/; ## read in the whole file of 8-bit characters
423           $t = <F>;
424           close F;
425           open F, ">:utf8", "file";
426           print F $t; ## convert to UTF-8 on output
427           close F;
428
429       If you run this code twice, the contents of the file will be twice
430       UTF-8 encoded.  A "use open ':utf8'" would have avoided the bug, or
431       explicitly opening also the file for input as UTF-8.
432
433       NOTE: the ":utf8" and ":encoding" features work only if your Perl has
434       been built with the new PerlIO feature (which is the default on most
435       systems).
436
437       Displaying Unicode As Text
438
439       Sometimes you might want to display Perl scalars containing Unicode as
440       simple ASCII (or EBCDIC) text.  The following subroutine converts its
441       argument so that Unicode characters with code points greater than 255
442       are displayed as "\x{...}", control characters (like "\n") are dis‐
443       played as "\x..", and the rest of the characters as themselves:
444
445          sub nice_string {
446              join("",
447                map { $_ > 255 ?                  # if wide character...
448                      sprintf("\\x{%04X}", $_) :  # \x{...}
449                      chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
450                      sprintf("\\x%02X", $_) :    # \x..
451                      quotemeta(chr($_))          # else quoted or as themselves
452                } unpack("U*", $_[0]));           # unpack Unicode characters
453          }
454
455       For example,
456
457          nice_string("foo\x{100}bar\n")
458
459       returns the string
460
461          'foo\x{0100}bar\x0A'
462
463       which is ready to be printed.
464
465       Special Cases
466
467       ·   Bit Complement Operator ~ And vec()
468
469           The bit complement operator "~" may produce surprising results if
470           used on strings containing characters with ordinal values above
471           255. In such a case, the results are consistent with the internal
472           encoding of the characters, but not with much else. So don't do
473           that. Similarly for "vec()": you will be operating on the inter‐
474           nally-encoded bit patterns of the Unicode characters, not on the
475           code point values, which is very probably not what you want.
476
477       ·   Peeking At Perl's Internal Encoding
478
479           Normal users of Perl should never care how Perl encodes any partic‐
480           ular Unicode string (because the normal ways to get at the contents
481           of a string with Unicode--via input and output--should always be
482           via explicitly-defined I/O layers). But if you must, there are two
483           ways of looking behind the scenes.
484
485           One way of peeking inside the internal encoding of Unicode charac‐
486           ters is to use "unpack("C*", ..." to get the bytes or "unpack("H*",
487           ...)"  to display the bytes:
488
489               # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
490               print join(" ", unpack("H*", pack("U", 0x100))), "\n";
491
492           Yet another way would be to use the Devel::Peek module:
493
494               perl -MDevel::Peek -e 'Dump(chr(0x100))'
495
496           That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
497           Unicode characters in "PV".  See also later in this document the
498           discussion about the "utf8::is_utf8()" function.
499
500       Advanced Topics
501
502       ·   String Equivalence
503
504           The question of string equivalence turns somewhat complicated in
505           Unicode: what do you mean by "equal"?
506
507           (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
508           LETTER A"?)
509
510           The short answer is that by default Perl compares equivalence
511           ("eq", "ne") based only on code points of the characters.  In the
512           above case, the answer is no (because 0x00C1 != 0x0041).  But some‐
513           times, any CAPITAL LETTER As should be considered equal, or even As
514           of any case.
515
516           The long answer is that you need to consider character normaliza‐
517           tion and casing issues: see Unicode::Normalize, Unicode Technical
518           Reports #15 and #21, Unicode Normalization Forms and Case Mappings,
519           http://www.unicode.org/unicode/reports/tr15/ and http://www.uni‐
520           code.org/unicode/reports/tr21/
521
522           As of Perl 5.8.0, the "Full" case-folding of Case Mappings/Special‐
523           Casing is implemented.
524
525       ·   String Collation
526
527           People like to see their strings nicely sorted--or as Unicode par‐
528           lance goes, collated.  But again, what do you mean by collate?
529
530           (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
531           "LATIN CAPITAL LETTER A WITH GRAVE"?)
532
533           The short answer is that by default, Perl compares strings ("lt",
534           "le", "cmp", "ge", "gt") based only on the code points of the char‐
535           acters.  In the above case, the answer is "after", since 0x00C1 >
536           0x00C0.
537
538           The long answer is that "it depends", and a good answer cannot be
539           given without knowing (at the very least) the language context.
540           See Unicode::Collate, and Unicode Collation Algorithm
541           http://www.unicode.org/unicode/reports/tr10/
542
543       Miscellaneous
544
545       ·   Character Ranges and Classes
546
547           Character ranges in regular expression character classes
548           ("/[a-z]/") and in the "tr///" (also known as "y///") operator are
549           not magically Unicode-aware.  What this means that "[A-Za-z]" will
550           not magically start to mean "all alphabetic letters"; not that it
551           does mean that even for 8-bit characters, you should be using
552           "/[[:alpha:]]/" in that case.
553
554           For specifying character classes like that in regular expressions,
555           you can use the various Unicode properties--"\pL", or perhaps
556           "\p{Alphabetic}", in this particular case.  You can use Unicode
557           code points as the end points of character ranges, but there is no
558           magic associated with specifying a certain range.  For further
559           information--there are dozens of Unicode character classes--see
560           perlunicode.
561
562       ·   String-To-Number Conversions
563
564           Unicode does define several other decimal--and numeric--characters
565           besides the familiar 0 to 9, such as the Arabic and Indic digits.
566           Perl does not support string-to-number conversion for digits other
567           than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
568
569       Questions With Answers
570
571       ·   Will My Old Scripts Break?
572
573           Very probably not.  Unless you are generating Unicode characters
574           somehow, old behaviour should be preserved.  About the only behav‐
575           iour that has changed and which could start generating Unicode is
576           the old behaviour of "chr()" where supplying an argument more than
577           255 produced a character modulo 255.  "chr(300)", for example, was
578           equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL LET‐
579           TER I WITH BREVE.
580
581       ·   How Do I Make My Scripts Work With Unicode?
582
583           Very little work should be needed since nothing changes until you
584           generate Unicode data.  The most important thing is getting input
585           as Unicode; for that, see the earlier I/O discussion.
586
587       ·   How Do I Know Whether My String Is In Unicode?
588
589           You shouldn't care.  No, you really shouldn't.  No, really.  If you
590           have to care--beyond the cases described above--it means that we
591           didn't get the transparency of Unicode quite right.
592
593           Okay, if you insist:
594
595               print utf8::is_utf8($string) ? 1 : 0, "\n";
596
597           But note that this doesn't mean that any of the characters in the
598           string are necessary UTF-8 encoded, or that any of the characters
599           have code points greater than 0xFF (255) or even 0x80 (128), or
600           that the string has any characters at all.  All the "is_utf8()"
601           does is to return the value of the internal "utf8ness" flag
602           attached to the $string.  If the flag is off, the bytes in the
603           scalar are interpreted as a single byte encoding.  If the flag is
604           on, the bytes in the scalar are interpreted as the (multi-byte,
605           variable-length) UTF-8 encoded code points of the characters.
606           Bytes added to an UTF-8 encoded string are automatically upgraded
607           to UTF-8.  If mixed non-UTF-8 and UTF-8 scalars are merged (dou‐
608           ble-quoted interpolation, explicit concatenation, and
609           printf/sprintf parameter substitution), the result will be UTF-8
610           encoded as if copies of the byte strings were upgraded to UTF-8:
611           for example,
612
613               $a = "ab\x80c";
614               $b = "\x{100}";
615               print "$a = $b\n";
616
617           the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
618           $a will stay byte-encoded.
619
620           Sometimes you might really need to know the byte length of a string
621           instead of the character length. For that use either the
622           "Encode::encode_utf8()" function or the "bytes" pragma and its only
623           defined function "length()":
624
625               my $unicode = chr(0x100);
626               print length($unicode), "\n"; # will print 1
627               require Encode;
628               print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
629               use bytes;
630               print length($unicode), "\n"; # will also print 2
631                                             # (the 0xC4 0x80 of the UTF-8)
632
633       ·   How Do I Detect Data That's Not Valid In a Particular Encoding?
634
635           Use the "Encode" package to try converting it.  For example,
636
637               use Encode 'decode_utf8';
638               if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) {
639                   # valid
640               } else {
641                   # invalid
642               }
643
644           For UTF-8 only, you can use:
645
646               use warnings;
647               @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
648
649           If invalid, a "Malformed UTF-8 character (byte 0x##) in unpack"
650           warning is produced. The "U0" means "expect strictly UTF-8 encoded
651           Unicode".  Without that the "unpack("U*", ...)" would accept also
652           data like "chr(0xFF"), similarly to the "pack" as we saw earlier.
653
654       ·   How Do I Convert Binary Data Into a Particular Encoding, Or Vice
655           Versa?
656
657           This probably isn't as useful as you might think.  Normally, you
658           shouldn't need to.
659
660           In one sense, what you are asking doesn't make much sense: encod‐
661           ings are for characters, and binary data are not "characters", so
662           converting "data" into some encoding isn't meaningful unless you
663           know in what character set and encoding the binary data is in, in
664           which case it's not just binary data, now is it?
665
666           If you have a raw sequence of bytes that you know should be inter‐
667           preted via a particular encoding, you can use "Encode":
668
669               use Encode 'from_to';
670               from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
671
672           The call to "from_to()" changes the bytes in $data, but nothing
673           material about the nature of the string has changed as far as Perl
674           is concerned.  Both before and after the call, the string $data
675           contains just a bunch of 8-bit bytes. As far as Perl is concerned,
676           the encoding of the string remains as "system-native 8-bit bytes".
677
678           You might relate this to a fictional 'Translate' module:
679
680              use Translate;
681              my $phrase = "Yes";
682              Translate::from_to($phrase, 'english', 'deutsch');
683              ## phrase now contains "Ja"
684
685           The contents of the string changes, but not the nature of the
686           string.  Perl doesn't know any more after the call than before that
687           the contents of the string indicates the affirmative.
688
689           Back to converting data.  If you have (or want) data in your sys‐
690           tem's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can
691           use pack/unpack to convert to/from Unicode.
692
693               $native_string  = pack("C*", unpack("U*", $Unicode_string));
694               $Unicode_string = pack("U*", unpack("C*", $native_string));
695
696           If you have a sequence of bytes you know is valid UTF-8, but Perl
697           doesn't know it yet, you can make Perl a believer, too:
698
699               use Encode 'decode_utf8';
700               $Unicode = decode_utf8($bytes);
701
702           You can convert well-formed UTF-8 to a sequence of bytes, but if
703           you just want to convert random binary data into UTF-8, you can't.
704           Any random collection of bytes isn't well-formed UTF-8.  You can
705           use "unpack("C*", $string)" for the former, and you can create
706           well-formed Unicode data by "pack("U*", 0xff, ...)".
707
708       ·   How Do I Display Unicode?  How Do I Input Unicode?
709
710           See http://www.alanwood.net/unicode/ and
711           http://www.cl.cam.ac.uk/~mgk25/unicode.html
712
713       ·   How Does Unicode Work With Traditional Locales?
714
715           In Perl, not very well.  Avoid using locales through the "locale"
716           pragma.  Use only one or the other.  But see perlrun for the
717           description of the "-C" switch and its environment counterpart,
718           $ENV{PERL_UNICODE} to see how to enable various Unicode features,
719           for example by using locale settings.
720
721       Hexadecimal Notation
722
723       The Unicode standard prefers using hexadecimal notation because that
724       more clearly shows the division of Unicode into blocks of 256 charac‐
725       ters.  Hexadecimal is also simply shorter than decimal.  You can use
726       decimal notation, too, but learning to use hexadecimal just makes life
727       easier with the Unicode standard.  The "U+HHHH" notation uses hexadeci‐
728       mal, for example.
729
730       The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
731       (or A-F, case doesn't matter).  Each hexadecimal digit represents four
732       bits, or half a byte.  "print 0x..., "\n"" will show a hexadecimal num‐
733       ber in decimal, and "printf "%x\n", $decimal" will show a decimal num‐
734       ber in hexadecimal.  If you have just the "hex digits" of a hexadecimal
735       number, you can use the "hex()" function.
736
737           print 0x0009, "\n";    # 9
738           print 0x000a, "\n";    # 10
739           print 0x000f, "\n";    # 15
740           print 0x0010, "\n";    # 16
741           print 0x0011, "\n";    # 17
742           print 0x0100, "\n";    # 256
743
744           print 0x0041, "\n";    # 65
745
746           printf "%x\n",  65;    # 41
747           printf "%#x\n", 65;    # 0x41
748
749           print hex("41"), "\n"; # 65
750
751       Further Resources
752
753       ·   Unicode Consortium
754
755               http://www.unicode.org/
756
757       ·   Unicode FAQ
758
759               http://www.unicode.org/unicode/faq/
760
761       ·   Unicode Glossary
762
763               http://www.unicode.org/glossary/
764
765       ·   Unicode Useful Resources
766
767               http://www.unicode.org/unicode/onlinedat/resources.html
768
769       ·   Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
770           Other Applications
771
772               http://www.alanwood.net/unicode/
773
774       ·   UTF-8 and Unicode FAQ for Unix/Linux
775
776               http://www.cl.cam.ac.uk/~mgk25/unicode.html
777
778       ·   Legacy Character Sets
779
780               http://www.czyborra.com/
781               http://www.eki.ee/letter/
782
783       ·   The Unicode support files live within the Perl installation in the
784           directory
785
786               $Config{installprivlib}/unicore
787
788           in Perl 5.8.0 or newer, and
789
790               $Config{installprivlib}/unicode
791
792           in the Perl 5.6 series.  (The renaming to lib/unicore was done to
793           avoid naming conflicts with lib/Unicode in case-insensitive
794           filesystems.)  The main Unicode data file is UnicodeData.txt (or
795           Unicode.301 in Perl 5.6.1.)  You can find the $Config{install‐
796           privlib} by
797
798               perl "-V:installprivlib"
799
800           You can explore various information from the Unicode data files
801           using the "Unicode::UCD" module.
802

UNICODE IN OLDER PERLS

804       If you cannot upgrade your Perl to 5.8.0 or later, you can still do
805       some Unicode processing by using the modules "Unicode::String", "Uni‐
806       code::Map8", and "Unicode::Map", available from CPAN.  If you have the
807       GNU recode installed, you can also use the Perl front-end "Con‐
808       vert::Recode" for character conversions.
809
810       The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
811       UTF-8 bytes and back, the code works even with older Perl 5 versions.
812
813           # ISO 8859-1 to UTF-8
814           s/([\x80-\xFF])/chr(0xC0⎪ord($1)>>6).chr(0x80⎪ord($1)&0x3F)/eg;
815
816           # UTF-8 to ISO 8859-1
817           s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0⎪ord($2)&0x3F)/eg;
818

ACKNOWLEDGMENTS

824       Thanks to the kind readers of the perl5-porters@perl.org, perl-uni‐
825       code@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org mailing
826       lists for their valuable feedback.
827

AUTHOR, COPYRIGHT, AND LICENSE

829       Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
830
831       This document may be distributed under the same terms as Perl itself.
832
833
834
835perl v5.8.8                       2006-01-07                   PERLUNIINTRO(1)

NAME

DESCRIPTION

UNICODE IN OLDER PERLS

SEE ALSO

ACKNOWLEDGMENTS

AUTHOR, COPYRIGHT, AND LICENSE