perluniintro(1)

1PERLUNIINTRO(1)        Perl Programmers Reference Guide        PERLUNIINTRO(1)
2
3
4

NAME

6       perluniintro - Perl Unicode introduction
7

DESCRIPTION

9       This document gives a general idea of Unicode and how to use Unicode in
10       Perl.  See "Further Resources" for references to more in-depth
11       treatments of Unicode.
12
13   Unicode
14       Unicode is a character set standard which plans to codify all of the
15       writing systems of the world, plus many other symbols.
16
17       Unicode and ISO/IEC 10646 are coordinated standards that unify almost
18       all other modern character set standards, covering more than 80 writing
19       systems and hundreds of languages, including all commercially-important
20       modern languages.  All characters in the largest Chinese, Japanese, and
21       Korean dictionaries are also encoded. The standards will eventually
22       cover almost all characters in more than 250 writing systems and
23       thousands of languages.  Unicode 1.0 was released in October 1991, and
24       6.0 in October 2010.
25
26       A Unicode character is an abstract entity.  It is not bound to any
27       particular integer width, especially not to the C language "char".
28       Unicode is language-neutral and display-neutral: it does not encode the
29       language of the text, and it does not generally define fonts or other
30       graphical layout details.  Unicode operates on characters and on text
31       built from those characters.
32
33       Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34       SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35       0x0041 and 0x03B1, respectively.  These unique numbers are called code
36       points.  A code point is essentially the position of the character
37       within the set of all possible Unicode characters, and thus in Perl,
38       the term ordinal is often used interchangeably with it.
39
40       The Unicode standard prefers using hexadecimal notation for the code
41       points.  If numbers like 0x0041 are unfamiliar to you, take a peek at a
42       later section, "Hexadecimal Notation".  The Unicode standard uses the
43       notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
44       point and the normative name of the character.
45
46       Unicode also defines various properties for the characters, like
47       "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
48       properties are independent of the names of the characters.
49       Furthermore, various operations on the characters like uppercasing,
50       lowercasing, and collating (sorting) are defined.
51
52       A Unicode logical "character" can actually consist of more than one
53       internal actual "character" or code point.  For Western languages, this
54       is adequately modelled by a base character (like "LATIN CAPITAL LETTER
55       A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
56       This sequence of base character and modifiers is called a combining
57       character sequence.  Some non-western languages require more
58       complicated models, so Unicode created the grapheme cluster concept,
59       which was later further refined into the extended grapheme cluster.
60       For example, a Korean Hangul syllable is considered a single logical
61       character, but most often consists of three actual Unicode characters:
62       a leading consonant followed by an interior vowel followed by a
63       trailing consonant.
64
65       Whether to call these extended grapheme clusters "characters" depends
66       on your point of view. If you are a programmer, you probably would tend
67       towards seeing each element in the sequences as one unit, or
68       "character".  However from the user's point of view, the whole sequence
69       could be seen as one "character" since that's probably what it looks
70       like in the context of the user's language.  In this document, we take
71       the programmer's point of view: one "character" is one Unicode code
72       point.
73
74       For some combinations of base character and modifiers, there are
75       precomposed characters.  There is a single character equivalent, for
76       example, for the sequence "LATIN CAPITAL LETTER A" followed by
77       "COMBINING ACUTE ACCENT".  It is called  "LATIN CAPITAL LETTER A WITH
78       ACUTE".  These precomposed characters are, however, only available for
79       some combinations, and are mainly meant to support round-trip
80       conversions between Unicode and legacy standards (like ISO 8859).
81       Using sequences, as Unicode does, allows for needing fewer basic
82       building blocks (code points) to express many more potential grapheme
83       clusters.  To support conversion between equivalent forms, various
84       normalization forms are also defined.  Thus, "LATIN CAPITAL LETTER A
85       WITH ACUTE" is in Normalization Form Composed, (abbreviated NFC), and
86       the sequence "LATIN CAPITAL LETTER A" followed by "COMBINING ACUTE
87       ACCENT" represents the same character in Normalization Form Decomposed
88       (NFD).
89
90       Because of backward compatibility with legacy encodings, the "a unique
91       number for every character" idea breaks down a bit: instead, there is
92       "at least one number for every character".  The same character could be
93       represented differently in several legacy encodings.  The converse is
94       not true: some code points do not have an assigned character.  Firstly,
95       there are unallocated code points within otherwise used blocks.
96       Secondly, there are special Unicode control characters that do not
97       represent true characters.
98
99       When Unicode was first conceived, it was thought that all the world's
100       characters could be represented using a 16-bit word; that is a maximum
101       of 0x10000 (or 65,536) characters would be needed, from 0x0000 to
102       0xFFFF.  This soon proved to be wrong, and since Unicode 2.0 (July
103       1996), Unicode has been defined all the way up to 21 bits (0x10FFFF),
104       and Unicode 3.1 (March 2001) defined the first characters above 0xFFFF.
105       The first 0x10000 characters are called the Plane 0, or the Basic
106       Multilingual Plane (BMP).  With Unicode 3.1, 17 (yes, seventeen) planes
107       in all were defined--but they are nowhere near full of defined
108       characters, yet.
109
110       When a new language is being encoded, Unicode generally will choose a
111       "block" of consecutive unallocated code points for its characters.  So
112       far, the number of code points in these blocks has always been evenly
113       divisible by 16.  Extras in a block, not currently needed, are left
114       unallocated, for future growth.  But there have been occasions when a
115       later release needed more code points than the available extras, and a
116       new block had to allocated somewhere else, not contiguous to the
117       initial one, to handle the overflow.  Thus, it became apparent early on
118       that "block" wasn't an adequate organizing principle, and so the
119       "Script" property was created.  (Later an improved script property was
120       added as well, the "Script_Extensions" property.)  Those code points
121       that are in overflow blocks can still have the same script as the
122       original ones.  The script concept fits more closely with natural
123       language: there is "Latin" script, "Greek" script, and so on; and there
124       are several artificial scripts, like "Common" for characters that are
125       used in multiple scripts, such as mathematical symbols.  Scripts
126       usually span varied parts of several blocks.  For more information
127       about scripts, see "Scripts" in perlunicode.  The division into blocks
128       exists, but it is almost completely accidental--an artifact of how the
129       characters have been and still are allocated.  (Note that this
130       paragraph has oversimplified things for the sake of this being an
131       introduction.  Unicode doesn't really encode languages, but the writing
132       systems for them--their scripts; and one script can be used by many
133       languages.  Unicode also encodes things that aren't really about
134       languages, such as symbols like "BAGGAGE CLAIM".)
135
136       The Unicode code points are just abstract numbers.  To input and output
137       these abstract numbers, the numbers must be encoded or serialised
138       somehow.  Unicode defines several character encoding forms, of which
139       UTF-8 is the most popular.  UTF-8 is a variable length encoding that
140       encodes Unicode characters as 1 to 4 bytes.  Other encodings include
141       UTF-16 and UTF-32 and their big- and little-endian variants (UTF-8 is
142       byte-order independent).  The ISO/IEC 10646 defines the UCS-2 and UCS-4
143       encoding forms.
144
145       For more information about encodings--for instance, to learn what
146       surrogates and byte order marks (BOMs) are--see perlunicode.
147
148   Perl's Unicode Support
149       Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
150       natively.  Perl v5.8.0, however, is the first recommended release for
151       serious Unicode work.  The maintenance release 5.6.1 fixed many of the
152       problems of the initial Unicode implementation, but for example regular
153       expressions still do not work with Unicode in 5.6.1.  Perl v5.14.0 is
154       the first release where Unicode support is (almost) seamlessly
155       integratable without some gotchas. (There are a few exceptions.
156       Firstly, some differences in quotemeta were fixed starting in Perl
157       5.16.0. Secondly, some differences in the range operator were fixed
158       starting in Perl 5.26.0. Thirdly, some differences in split were fixed
159       started in Perl 5.28.0.)
160
161       To enable this seamless support, you should "use feature
162       'unicode_strings'" (which is automatically selected if you "use v5.12"
163       or higher).  See feature.  (5.14 also fixes a number of bugs and
164       departures from the Unicode standard.)
165
166       Before Perl v5.8.0, the use of "use utf8" was used to declare that
167       operations in the current block or file would be Unicode-aware.  This
168       model was found to be wrong, or at least clumsy: the "Unicodeness" is
169       now carried with the data, instead of being attached to the operations.
170       Starting with Perl v5.8.0, only one case remains where an explicit "use
171       utf8" is needed: if your Perl script itself is encoded in UTF-8, you
172       can use UTF-8 in your identifier names, and in string and regular
173       expression literals, by saying "use utf8".  This is not the default
174       because scripts with legacy 8-bit data in them would break.  See utf8.
175
176   Perl's Unicode Model
177       Perl supports both pre-5.6 strings of eight-bit native bytes, and
178       strings of Unicode characters.  The general principle is that Perl
179       tries to keep its data as eight-bit bytes for as long as possible, but
180       as soon as Unicodeness cannot be avoided, the data is transparently
181       upgraded to Unicode.  Prior to Perl v5.14.0, the upgrade was not
182       completely transparent (see "The "Unicode Bug"" in perlunicode), and
183       for backwards compatibility, full transparency is not gained unless use
184       feature 'unicode_strings' (see feature) or "use v5.12" (or higher) is
185       selected.
186
187       Internally, Perl currently uses either whatever the native eight-bit
188       character set of the platform (for example Latin-1) is, defaulting to
189       UTF-8, to encode Unicode strings. Specifically, if all code points in
190       the string are 0xFF or less, Perl uses the native eight-bit character
191       set.  Otherwise, it uses UTF-8.
192
193       A user of Perl does not normally need to know nor care how Perl happens
194       to encode its internal strings, but it becomes relevant when outputting
195       Unicode strings to a stream without a PerlIO layer (one with the
196       "default" encoding).  In such a case, the raw bytes used internally
197       (the native character set or UTF-8, as appropriate for each string)
198       will be used, and a "Wide character" warning will be issued if those
199       strings contain a character beyond 0x00FF.
200
201       For example,
202
203             perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
204
205       produces a fairly useless mixture of native bytes and UTF-8, as well as
206       a warning:
207
208            Wide character in print at ...
209
210       To output UTF-8, use the ":encoding" or ":utf8" output layer.
211       Prepending
212
213             binmode(STDOUT, ":utf8");
214
215       to this sample program ensures that the output is completely UTF-8, and
216       removes the program's warning.
217
218       You can enable automatic UTF-8-ification of your standard file handles,
219       default open() layer, and @ARGV by using either the "-C" command line
220       switch or the "PERL_UNICODE" environment variable, see perlrun for the
221       documentation of the "-C" switch.
222
223       Note that this means that Perl expects other software to work the same
224       way: if Perl has been led to believe that STDIN should be UTF-8, but
225       then STDIN coming in from another command is not UTF-8, Perl will
226       likely complain about the malformed UTF-8.
227
228       All features that combine Unicode and I/O also require using the new
229       PerlIO feature.  Almost all Perl 5.8 platforms do use PerlIO, though:
230       you can see whether yours is by running "perl -V" and looking for
231       "useperlio=define".
232
233   Unicode and EBCDIC
234       Perl 5.8.0 added support for Unicode on EBCDIC platforms.  This support
235       was allowed to lapse in later releases, but was revived in 5.22.
236       Unicode support is somewhat more complex to implement since additional
237       conversions are needed.  See perlebcdic for more information.
238
239       On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
240       instead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
241       that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
242       "EBCDIC-safe", in that all the basic characters (which includes all
243       those that have ASCII equivalents (like "A", "0", "%", etc.)  are the
244       same in both EBCDIC and UTF-EBCDIC.  Often, documentation will use the
245       term "UTF-8" to mean UTF-EBCDIC as well.  This is the case in this
246       document.
247
248   Creating Unicode
249       This section applies fully to Perls starting with v5.22.  Various
250       caveats for earlier releases are in the "Earlier releases caveats"
251       subsection below.
252
253       To create Unicode characters in literals, use the "\N{...}" notation in
254       double-quoted strings:
255
256        my $smiley_from_name = "\N{WHITE SMILING FACE}";
257        my $smiley_from_code_point = "\N{U+263a}";
258
259       Similarly, they can be used in regular expression literals
260
261        $smiley =~ /\N{WHITE SMILING FACE}/;
262        $smiley =~ /\N{U+263a}/;
263
264       or, starting in v5.32:
265
266        $smiley =~ /\p{Name=WHITE SMILING FACE}/;
267        $smiley =~ /\p{Name=whitesmilingface}/;
268
269       At run-time you can use:
270
271        use charnames ();
272        my $hebrew_alef_from_name
273                             = charnames::string_vianame("HEBREW LETTER ALEF");
274        my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");
275
276       Naturally, ord() will do the reverse: it turns a character into a code
277       point.
278
279       There are other runtime options as well.  You can use pack():
280
281        my $hebrew_alef_from_code_point = pack("U", 0x05d0);
282
283       Or you can use chr(), though it is less convenient in the general case:
284
285        $hebrew_alef_from_code_point = chr(utf8::unicode_to_native(0x05d0));
286        utf8::upgrade($hebrew_alef_from_code_point);
287
288       The utf8::unicode_to_native() and utf8::upgrade() aren't needed if the
289       argument is above 0xFF, so the above could have been written as
290
291        $hebrew_alef_from_code_point = chr(0x05d0);
292
293       since 0x5d0 is above 255.
294
295       "\x{}" and "\o{}" can also be used to specify code points at compile
296       time in double-quotish strings, but, for backward compatibility with
297       older Perls, the same rules apply as with chr() for code points less
298       than 256.
299
300       utf8::unicode_to_native() is used so that the Perl code is portable to
301       EBCDIC platforms.  You can omit it if you're really sure no one will
302       ever want to use your code on a non-ASCII platform.  Starting in Perl
303       v5.22, calls to it on ASCII platforms are optimized out, so there's no
304       performance penalty at all in adding it.  Or you can simply use the
305       other constructs that don't require it.
306
307       See "Further Resources" for how to find all these names and numeric
308       codes.
309
310       Earlier releases caveats
311
312       On EBCDIC platforms, prior to v5.22, using "\N{U+...}" doesn't work
313       properly.
314
315       Prior to v5.16, using "\N{...}" with a character name (as opposed to a
316       "U+..." code point) required a "use charnames :full".
317
318       Prior to v5.14, there were some bugs in "\N{...}" with a character name
319       (as opposed to a "U+..." code point).
320
321       charnames::string_vianame() was introduced in v5.14.  Prior to that,
322       charnames::vianame() should work, but only if the argument is of the
323       form "U+...".  Your best bet there for runtime Unicode by character
324       name is probably:
325
326        use charnames ();
327        my $hebrew_alef_from_name
328                         = pack("U", charnames::vianame("HEBREW LETTER ALEF"));
329
330   Handling Unicode
331       Handling Unicode is for the most part transparent: just use the strings
332       as usual.  Functions like index(), length(), and substr() will work on
333       the Unicode characters; regular expressions will work on the Unicode
334       characters (see perlunicode and perlretut).
335
336       Note that Perl considers grapheme clusters to be separate characters,
337       so for example
338
339        print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
340              "\n";
341
342       will print 2, not 1.  The only exception is that regular expressions
343       have "\X" for matching an extended grapheme cluster.  (Thus "\X" in a
344       regular expression would match the entire sequence of both the example
345       characters.)
346
347       Life is not quite so transparent, however, when working with legacy
348       encodings, I/O, and certain special cases:
349
350   Legacy Encodings
351       When you combine legacy data and Unicode, the legacy data needs to be
352       upgraded to Unicode.  Normally the legacy data is assumed to be ISO
353       8859-1 (or EBCDIC, if applicable).
354
355       The "Encode" module knows about many encodings and has interfaces for
356       doing conversions between those encodings:
357
358           use Encode 'decode';
359           $data = decode("iso-8859-3", $data); # convert from legacy
360
361   Unicode I/O
362       Normally, writing out Unicode data
363
364           print FH $some_string_with_unicode, "\n";
365
366       produces raw bytes that Perl happens to use to internally encode the
367       Unicode string.  Perl's internal encoding depends on the system as well
368       as what characters happen to be in the string at the time. If any of
369       the characters are at code points 0x100 or above, you will get a
370       warning.  To ensure that the output is explicitly rendered in the
371       encoding you desire--and to avoid the warning--open the stream with the
372       desired encoding. Some examples:
373
374           open FH, ">:utf8", "file";
375
376           open FH, ">:encoding(ucs2)",      "file";
377           open FH, ">:encoding(UTF-8)",     "file";
378           open FH, ">:encoding(shift_jis)", "file";
379
380       and on already open streams, use binmode():
381
382           binmode(STDOUT, ":utf8");
383
384           binmode(STDOUT, ":encoding(ucs2)");
385           binmode(STDOUT, ":encoding(UTF-8)");
386           binmode(STDOUT, ":encoding(shift_jis)");
387
388       The matching of encoding names is loose: case does not matter, and many
389       encodings have several aliases.  Note that the ":utf8" layer must
390       always be specified exactly like that; it is not subject to the loose
391       matching of encoding names. Also note that currently ":utf8" is unsafe
392       for input, because it accepts the data without validating that it is
393       indeed valid UTF-8; you should instead use :encoding(UTF-8) (with or
394       without a hyphen).
395
396       See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
397       for the :encoding() layer, and Encode::Supported for many encodings
398       supported by the "Encode" module.
399
400       Reading in a file that you know happens to be encoded in one of the
401       Unicode or legacy encodings does not magically turn the data into
402       Unicode in Perl's eyes.  To do that, specify the appropriate layer when
403       opening files
404
405           open(my $fh,'<:encoding(UTF-8)', 'anything');
406           my $line_of_unicode = <$fh>;
407
408           open(my $fh,'<:encoding(Big5)', 'anything');
409           my $line_of_unicode = <$fh>;
410
411       The I/O layers can also be specified more flexibly with the "open"
412       pragma.  See open, or look at the following example.
413
414           use open ':encoding(UTF-8)'; # input/output default encoding will be
415                                        # UTF-8
416           open X, ">file";
417           print X chr(0x100), "\n";
418           close X;
419           open Y, "<file";
420           printf "%#x\n", ord(<Y>); # this should print 0x100
421           close Y;
422
423       With the "open" pragma you can use the ":locale" layer
424
425           BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
426           # the :locale will probe the locale environment variables like
427           # LC_ALL
428           use open OUT => ':locale'; # russki parusski
429           open(O, ">koi8");
430           print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
431           close O;
432           open(I, "<koi8");
433           printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
434           close I;
435
436       These methods install a transparent filter on the I/O stream that
437       converts data from the specified encoding when it is read in from the
438       stream.  The result is always Unicode.
439
440       The open pragma affects all the open() calls after the pragma by
441       setting default layers.  If you want to affect only certain streams,
442       use explicit layers directly in the open() call.
443
444       You can switch encodings on an already opened stream by using
445       binmode(); see "binmode" in perlfunc.
446
447       The ":locale" does not currently work with open() and binmode(), only
448       with the "open" pragma.  The ":utf8" and :encoding(...) methods do work
449       with all of open(), binmode(), and the "open" pragma.
450
451       Similarly, you may use these I/O layers on output streams to
452       automatically convert Unicode to the specified encoding when it is
453       written to the stream. For example, the following snippet copies the
454       contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
455       the file "text.utf8", encoded as UTF-8:
456
457           open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
458           open(my $unicode, '>:utf8',                  'text.utf8');
459           while (<$nihongo>) { print $unicode $_ }
460
461       The naming of encodings, both by the open() and by the "open" pragma
462       allows for flexible names: "koi8-r" and "KOI8R" will both be
463       understood.
464
465       Common encodings recognized by ISO, MIME, IANA, and various other
466       standardisation organisations are recognised; for a more detailed list
467       see Encode::Supported.
468
469       read() reads characters and returns the number of characters.  seek()
470       and tell() operate on byte counts, as does sysseek().
471
472       sysread() and syswrite() should not be used on file handles with
473       character encoding layers, they behave badly, and that behaviour has
474       been deprecated since perl 5.24.
475
476       Notice that because of the default behaviour of not doing any
477       conversion upon input if there is no default layer, it is easy to
478       mistakenly write code that keeps on expanding a file by repeatedly
479       encoding the data:
480
481           # BAD CODE WARNING
482           open F, "file";
483           local $/; ## read in the whole file of 8-bit characters
484           $t = <F>;
485           close F;
486           open F, ">:encoding(UTF-8)", "file";
487           print F $t; ## convert to UTF-8 on output
488           close F;
489
490       If you run this code twice, the contents of the file will be twice
491       UTF-8 encoded.  A "use open ':encoding(UTF-8)'" would have avoided the
492       bug, or explicitly opening also the file for input as UTF-8.
493
494       NOTE: the ":utf8" and ":encoding" features work only if your Perl has
495       been built with PerlIO, which is the default on most systems.
496
497   Displaying Unicode As Text
498       Sometimes you might want to display Perl scalars containing Unicode as
499       simple ASCII (or EBCDIC) text.  The following subroutine converts its
500       argument so that Unicode characters with code points greater than 255
501       are displayed as "\x{...}", control characters (like "\n") are
502       displayed as "\x..", and the rest of the characters as themselves:
503
504        sub nice_string {
505               join("",
506               map { $_ > 255                    # if wide character...
507                     ? sprintf("\\x{%04X}", $_)  # \x{...}
508                     : chr($_) =~ /[[:cntrl:]]/  # else if control character...
509                       ? sprintf("\\x%02X", $_)  # \x..
510                       : quotemeta(chr($_))      # else quoted or as themselves
511               } unpack("W*", $_[0]));           # unpack Unicode characters
512          }
513
514       For example,
515
516          nice_string("foo\x{100}bar\n")
517
518       returns the string
519
520          'foo\x{0100}bar\x0A'
521
522       which is ready to be printed.
523
524       ("\\x{}" is used here instead of "\\N{}", since it's most likely that
525       you want to see what the native values are.)
526
527   Special Cases
528       •   Starting in Perl 5.28, it is illegal for bit operators, like "~",
529           to operate on strings containing code points above 255.
530
531       •   The vec() function may produce surprising results if used on
532           strings containing characters with ordinal values above 255. In
533           such a case, the results are consistent with the internal encoding
534           of the characters, but not with much else. So don't do that, and
535           starting in Perl 5.28, a deprecation message is issued if you do
536           so, becoming illegal in Perl 5.32.
537
538       •   Peeking At Perl's Internal Encoding
539
540           Normal users of Perl should never care how Perl encodes any
541           particular Unicode string (because the normal ways to get at the
542           contents of a string with Unicode--via input and output--should
543           always be via explicitly-defined I/O layers). But if you must,
544           there are two ways of looking behind the scenes.
545
546           One way of peeking inside the internal encoding of Unicode
547           characters is to use "unpack("C*", ..." to get the bytes of
548           whatever the string encoding happens to be, or "unpack("U0..",
549           ...)" to get the bytes of the UTF-8 encoding:
550
551               # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
552               print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
553
554           Yet another way would be to use the Devel::Peek module:
555
556               perl -MDevel::Peek -e 'Dump(chr(0x100))'
557
558           That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
559           Unicode characters in "PV".  See also later in this document the
560           discussion about the utf8::is_utf8() function.
561
562   Advanced Topics
563       •   String Equivalence
564
565           The question of string equivalence turns somewhat complicated in
566           Unicode: what do you mean by "equal"?
567
568           (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
569           LETTER A"?)
570
571           The short answer is that by default Perl compares equivalence
572           ("eq", "ne") based only on code points of the characters.  In the
573           above case, the answer is no (because 0x00C1 != 0x0041).  But
574           sometimes, any CAPITAL LETTER A's should be considered equal, or
575           even A's of any case.
576
577           The long answer is that you need to consider character
578           normalization and casing issues: see Unicode::Normalize, Unicode
579           Technical Report #15, Unicode Normalization Forms
580           <https://www.unicode.org/reports/tr15> and sections on case mapping
581           in the Unicode Standard <https://www.unicode.org>.
582
583           As of Perl 5.8.0, the "Full" case-folding of Case
584           Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
585           with them, mostly fixed by 5.14, and essentially entirely by 5.18.
586
587       •   String Collation
588
589           People like to see their strings nicely sorted--or as Unicode
590           parlance goes, collated.  But again, what do you mean by collate?
591
592           (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
593           "LATIN CAPITAL LETTER A WITH GRAVE"?)
594
595           The short answer is that by default, Perl compares strings ("lt",
596           "le", "cmp", "ge", "gt") based only on the code points of the
597           characters.  In the above case, the answer is "after", since 0x00C1
598           > 0x00C0.
599
600           The long answer is that "it depends", and a good answer cannot be
601           given without knowing (at the very least) the language context.
602           See Unicode::Collate, and Unicode Collation Algorithm
603           <https://www.unicode.org/reports/tr10/>
604
605   Miscellaneous
606       •   Character Ranges and Classes
607
608           Character ranges in regular expression bracketed character classes
609           ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
610           operator are not magically Unicode-aware.  What this means is that
611           "[A-Za-z]" will not magically start to mean "all alphabetic
612           letters" (not that it does mean that even for 8-bit characters; for
613           those, if you are using locales (perllocale), use "/[[:alpha:]]/";
614           and if not, use the 8-bit-aware property "\p{alpha}").
615
616           All the properties that begin with "\p" (and its inverse "\P") are
617           actually character classes that are Unicode-aware.  There are
618           dozens of them, see perluniprops.
619
620           Starting in v5.22, you can use Unicode code points as the end
621           points of regular expression pattern character ranges, and the
622           range will include all Unicode code points that lie between those
623           end points, inclusive.
624
625            qr/ [ \N{U+03} - \N{U+20} ] /xx
626
627           includes the code points "\N{U+03}", "\N{U+04}", ..., "\N{U+20}".
628
629           This also works for ranges in "tr///" starting in Perl v5.24.
630
631       •   String-To-Number Conversions
632
633           Unicode does define several other decimal--and numeric--characters
634           besides the familiar 0 to 9, such as the Arabic and Indic digits.
635           Perl does not support string-to-number conversion for digits other
636           than ASCII 0 to 9 (and ASCII "a" to "f" for hexadecimal).  To get
637           safe conversions from any Unicode string, use "num()" in
638           Unicode::UCD.
639
640   Questions With Answers
641       •   Will My Old Scripts Break?
642
643           Very probably not.  Unless you are generating Unicode characters
644           somehow, old behaviour should be preserved.  About the only
645           behaviour that has changed and which could start generating Unicode
646           is the old behaviour of chr() where supplying an argument more than
647           255 produced a character modulo 255.  chr(300), for example, was
648           equal to chr(45) or "-" (in ASCII), now it is LATIN CAPITAL LETTER
649           I WITH BREVE.
650
651       •   How Do I Make My Scripts Work With Unicode?
652
653           Very little work should be needed since nothing changes until you
654           generate Unicode data.  The most important thing is getting input
655           as Unicode; for that, see the earlier I/O discussion.  To get full
656           seamless Unicode support, add "use feature 'unicode_strings'" (or
657           "use v5.12" or higher) to your script.
658
659       •   How Do I Know Whether My String Is In Unicode?
660
661           You shouldn't have to care.  But you may if your Perl is before
662           5.14.0 or you haven't specified "use feature 'unicode_strings'" or
663           use 5.012 (or higher) because otherwise the rules for the code
664           points in the range 128 to 255 are different depending on whether
665           the string they are contained within is in Unicode or not.  (See
666           "When Unicode Does Not Happen" in perlunicode.)
667
668           To determine if a string is in Unicode, use:
669
670               print utf8::is_utf8($string) ? 1 : 0, "\n";
671
672           But note that this doesn't mean that any of the characters in the
673           string are necessary UTF-8 encoded, or that any of the characters
674           have code points greater than 0xFF (255) or even 0x80 (128), or
675           that the string has any characters at all.  All the is_utf8() does
676           is to return the value of the internal "utf8ness" flag attached to
677           the $string.  If the flag is off, the bytes in the scalar are
678           interpreted as a single byte encoding.  If the flag is on, the
679           bytes in the scalar are interpreted as the (variable-length,
680           potentially multi-byte) UTF-8 encoded code points of the
681           characters.  Bytes added to a UTF-8 encoded string are
682           automatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8
683           scalars are merged (double-quoted interpolation, explicit
684           concatenation, or printf/sprintf parameter substitution), the
685           result will be UTF-8 encoded as if copies of the byte strings were
686           upgraded to UTF-8: for example,
687
688               $a = "ab\x80c";
689               $b = "\x{100}";
690               print "$a = $b\n";
691
692           the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
693           $a will stay byte-encoded.
694
695           Sometimes you might really need to know the byte length of a string
696           instead of the character length. For that use the "bytes" pragma
697           and the length() function:
698
699               my $unicode = chr(0x100);
700               print length($unicode), "\n"; # will print 1
701               use bytes;
702               print length($unicode), "\n"; # will print 2
703                                             # (the 0xC4 0x80 of the UTF-8)
704               no bytes;
705
706       •   How Do I Find Out What Encoding a File Has?
707
708           You might try Encode::Guess, but it has a number of limitations.
709
710       •   How Do I Detect Data That's Not Valid In a Particular Encoding?
711
712           Use the "Encode" package to try converting it.  For example,
713
714               use Encode 'decode';
715
716               if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
717                   # $string is valid UTF-8
718               } else {
719                   # $string is not valid UTF-8
720               }
721
722           Or use "unpack" to try decoding it:
723
724               use warnings;
725               @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
726
727           If invalid, a "Malformed UTF-8 character" warning is produced. The
728           "C0" means "process the string character per character".  Without
729           that, the "unpack("U*", ...)" would work in "U0" mode (the default
730           if the format string starts with "U") and it would return the bytes
731           making up the UTF-8 encoding of the target string, something that
732           will always work.
733
734       •   How Do I Convert Binary Data Into a Particular Encoding, Or Vice
735           Versa?
736
737           This probably isn't as useful as you might think.  Normally, you
738           shouldn't need to.
739
740           In one sense, what you are asking doesn't make much sense:
741           encodings are for characters, and binary data are not "characters",
742           so converting "data" into some encoding isn't meaningful unless you
743           know in what character set and encoding the binary data is in, in
744           which case it's not just binary data, now is it?
745
746           If you have a raw sequence of bytes that you know should be
747           interpreted via a particular encoding, you can use "Encode":
748
749               use Encode 'from_to';
750               from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
751
752           The call to from_to() changes the bytes in $data, but nothing
753           material about the nature of the string has changed as far as Perl
754           is concerned.  Both before and after the call, the string $data
755           contains just a bunch of 8-bit bytes. As far as Perl is concerned,
756           the encoding of the string remains as "system-native 8-bit bytes".
757
758           You might relate this to a fictional 'Translate' module:
759
760              use Translate;
761              my $phrase = "Yes";
762              Translate::from_to($phrase, 'english', 'deutsch');
763              ## phrase now contains "Ja"
764
765           The contents of the string changes, but not the nature of the
766           string.  Perl doesn't know any more after the call than before that
767           the contents of the string indicates the affirmative.
768
769           Back to converting data.  If you have (or want) data in your
770           system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
771           can use pack/unpack to convert to/from Unicode.
772
773               $native_string  = pack("W*", unpack("U*", $Unicode_string));
774               $Unicode_string = pack("U*", unpack("W*", $native_string));
775
776           If you have a sequence of bytes you know is valid UTF-8, but Perl
777           doesn't know it yet, you can make Perl a believer, too:
778
779               $Unicode = $bytes;
780               utf8::decode($Unicode);
781
782           or:
783
784               $Unicode = pack("U0a*", $bytes);
785
786           You can find the bytes that make up a UTF-8 sequence with
787
788               @bytes = unpack("C*", $Unicode_string)
789
790           and you can create well-formed Unicode with
791
792               $Unicode_string = pack("U*", 0xff, ...)
793
794       •   How Do I Display Unicode?  How Do I Input Unicode?
795
796           See <http://www.alanwood.net/unicode/> and
797           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
798
799       •   How Does Unicode Work With Traditional Locales?
800
801           If your locale is a UTF-8 locale, starting in Perl v5.26, Perl
802           works well for all categories; before this, starting with Perl
803           v5.20, it works for all categories but "LC_COLLATE", which deals
804           with sorting and the "cmp" operator.  But note that the standard
805           "Unicode::Collate" and "Unicode::Collate::Locale" modules offer
806           much more powerful solutions to collation issues, and work on
807           earlier releases.
808
809           For other locales, starting in Perl 5.16, you can specify
810
811               use locale ':not_characters';
812
813           to get Perl to work well with them.  The catch is that you have to
814           translate from the locale character set to/from Unicode yourself.
815           See "Unicode I/O" above for how to
816
817               use open ':locale';
818
819           to accomplish this, but full details are in "Unicode and UTF-8" in
820           perllocale, including gotchas that happen if you don't specify
821           ":not_characters".
822
823   Hexadecimal Notation
824       The Unicode standard prefers using hexadecimal notation because that
825       more clearly shows the division of Unicode into blocks of 256
826       characters.  Hexadecimal is also simply shorter than decimal.  You can
827       use decimal notation, too, but learning to use hexadecimal just makes
828       life easier with the Unicode standard.  The "U+HHHH" notation uses
829       hexadecimal, for example.
830
831       The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
832       (or A-F, case doesn't matter).  Each hexadecimal digit represents four
833       bits, or half a byte.  "print 0x..., "\n"" will show a hexadecimal
834       number in decimal, and "printf "%x\n", $decimal" will show a decimal
835       number in hexadecimal.  If you have just the "hex digits" of a
836       hexadecimal number, you can use the hex() function.
837
838           print 0x0009, "\n";    # 9
839           print 0x000a, "\n";    # 10
840           print 0x000f, "\n";    # 15
841           print 0x0010, "\n";    # 16
842           print 0x0011, "\n";    # 17
843           print 0x0100, "\n";    # 256
844
845           print 0x0041, "\n";    # 65
846
847           printf "%x\n",  65;    # 41
848           printf "%#x\n", 65;    # 0x41
849
850           print hex("41"), "\n"; # 65
851
852   Further Resources
853       •   Unicode Consortium
854
855           <https://www.unicode.org/>
856
857       •   Unicode FAQ
858
859           <https://www.unicode.org/faq/>
860
861       •   Unicode Glossary
862
863           <https://www.unicode.org/glossary/>
864
865       •   Unicode Recommended Reading List
866
867           The Unicode Consortium has a list of articles and books, some of
868           which give a much more in depth treatment of Unicode:
869           <http://unicode.org/resources/readinglist.html>
870
871       •   Unicode Useful Resources
872
873           <https://www.unicode.org/unicode/onlinedat/resources.html>
874
875       •   Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
876           Other Applications
877
878           <http://www.alanwood.net/unicode/>
879
880       •   UTF-8 and Unicode FAQ for Unix/Linux
881
882           <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
883
884       •   Legacy Character Sets
885
886           <http://www.czyborra.com/> <http://www.eki.ee/letter/>
887
888       •   You can explore various information from the Unicode data files
889           using the "Unicode::UCD" module.
890

UNICODE IN OLDER PERLS

892       If you cannot upgrade your Perl to 5.8.0 or later, you can still do
893       some Unicode processing by using the modules "Unicode::String",
894       "Unicode::Map8", and "Unicode::Map", available from CPAN.  If you have
895       the GNU recode installed, you can also use the Perl front-end
896       "Convert::Recode" for character conversions.
897
898       The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
899       UTF-8 bytes and back, the code works even with older Perl 5 versions.
900
901           # ISO 8859-1 to UTF-8
902           s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
903
904           # UTF-8 to ISO 8859-1
905           s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
906

ACKNOWLEDGMENTS

912       Thanks to the kind readers of the perl5-porters@perl.org,
913       perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
914       mailing lists for their valuable feedback.
915

AUTHOR, COPYRIGHT, AND LICENSE

917       Copyright 2001-2011 Jarkko Hietaniemi <jhi@iki.fi>.  Now maintained by
918       Perl 5 Porters.
919
920       This document may be distributed under the same terms as Perl itself.
921
922
923
924perl v5.38.2                      2023-11-30                   PERLUNIINTRO(1)

NAME

DESCRIPTION

UNICODE IN OLDER PERLS

SEE ALSO

ACKNOWLEDGMENTS

AUTHOR, COPYRIGHT, AND LICENSE