1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl. See "Further Resources" for references to more in-depth
11 treatments of Unicode.
12
13 Unicode
14 Unicode is a character set standard which plans to codify all of the
15 writing systems of the world, plus many other symbols.
16
17 Unicode and ISO/IEC 10646 are coordinated standards that unify almost
18 all other modern character set standards, covering more than 80 writing
19 systems and hundreds of languages, including all commercially-important
20 modern languages. All characters in the largest Chinese, Japanese, and
21 Korean dictionaries are also encoded. The standards will eventually
22 cover almost all characters in more than 250 writing systems and
23 thousands of languages. Unicode 1.0 was released in October 1991, and
24 6.0 in October 2010.
25
26 A Unicode character is an abstract entity. It is not bound to any
27 particular integer width, especially not to the C language "char".
28 Unicode is language-neutral and display-neutral: it does not encode the
29 language of the text, and it does not generally define fonts or other
30 graphical layout details. Unicode operates on characters and on text
31 built from those characters.
32
33 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35 0x0041 and 0x03B1, respectively. These unique numbers are called code
36 points. A code point is essentially the position of the character
37 within the set of all possible Unicode characters, and thus in Perl,
38 the term ordinal is often used interchangeably with it.
39
40 The Unicode standard prefers using hexadecimal notation for the code
41 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
42 later section, "Hexadecimal Notation". The Unicode standard uses the
43 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
44 point and the normative name of the character.
45
46 Unicode also defines various properties for the characters, like
47 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
48 properties are independent of the names of the characters.
49 Furthermore, various operations on the characters like uppercasing,
50 lowercasing, and collating (sorting) are defined.
51
52 A Unicode logical "character" can actually consist of more than one
53 internal actual "character" or code point. For Western languages, this
54 is adequately modelled by a base character (like "LATIN CAPITAL LETTER
55 A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
56 This sequence of base character and modifiers is called a combining
57 character sequence. Some non-western languages require more
58 complicated models, so Unicode created the grapheme cluster concept,
59 which was later further refined into the extended grapheme cluster.
60 For example, a Korean Hangul syllable is considered a single logical
61 character, but most often consists of three actual Unicode characters:
62 a leading consonant followed by an interior vowel followed by a
63 trailing consonant.
64
65 Whether to call these extended grapheme clusters "characters" depends
66 on your point of view. If you are a programmer, you probably would tend
67 towards seeing each element in the sequences as one unit, or
68 "character". However from the user's point of view, the whole sequence
69 could be seen as one "character" since that's probably what it looks
70 like in the context of the user's language. In this document, we take
71 the programmer's point of view: one "character" is one Unicode code
72 point.
73
74 For some combinations of base character and modifiers, there are
75 precomposed characters. There is a single character equivalent, for
76 example, for the sequence "LATIN CAPITAL LETTER A" followed by
77 "COMBINING ACUTE ACCENT". It is called "LATIN CAPITAL LETTER A WITH
78 ACUTE". These precomposed characters are, however, only available for
79 some combinations, and are mainly meant to support round-trip
80 conversions between Unicode and legacy standards (like ISO 8859).
81 Using sequences, as Unicode does, allows for needing fewer basic
82 building blocks (code points) to express many more potential grapheme
83 clusters. To support conversion between equivalent forms, various
84 normalization forms are also defined. Thus, "LATIN CAPITAL LETTER A
85 WITH ACUTE" is in Normalization Form Composed, (abbreviated NFC), and
86 the sequence "LATIN CAPITAL LETTER A" followed by "COMBINING ACUTE
87 ACCENT" represents the same character in Normalization Form Decomposed
88 (NFD).
89
90 Because of backward compatibility with legacy encodings, the "a unique
91 number for every character" idea breaks down a bit: instead, there is
92 "at least one number for every character". The same character could be
93 represented differently in several legacy encodings. The converse is
94 not true: some code points do not have an assigned character. Firstly,
95 there are unallocated code points within otherwise used blocks.
96 Secondly, there are special Unicode control characters that do not
97 represent true characters.
98
99 When Unicode was first conceived, it was thought that all the world's
100 characters could be represented using a 16-bit word; that is a maximum
101 of 0x10000 (or 65,536) characters would be needed, from 0x0000 to
102 0xFFFF. This soon proved to be wrong, and since Unicode 2.0 (July
103 1996), Unicode has been defined all the way up to 21 bits (0x10FFFF),
104 and Unicode 3.1 (March 2001) defined the first characters above 0xFFFF.
105 The first 0x10000 characters are called the Plane 0, or the Basic
106 Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes
107 in all were defined--but they are nowhere near full of defined
108 characters, yet.
109
110 When a new language is being encoded, Unicode generally will choose a
111 "block" of consecutive unallocated code points for its characters. So
112 far, the number of code points in these blocks has always been evenly
113 divisible by 16. Extras in a block, not currently needed, are left
114 unallocated, for future growth. But there have been occasions when a
115 later release needed more code points than the available extras, and a
116 new block had to allocated somewhere else, not contiguous to the
117 initial one, to handle the overflow. Thus, it became apparent early on
118 that "block" wasn't an adequate organizing principle, and so the
119 "Script" property was created. (Later an improved script property was
120 added as well, the "Script_Extensions" property.) Those code points
121 that are in overflow blocks can still have the same script as the
122 original ones. The script concept fits more closely with natural
123 language: there is "Latin" script, "Greek" script, and so on; and there
124 are several artificial scripts, like "Common" for characters that are
125 used in multiple scripts, such as mathematical symbols. Scripts
126 usually span varied parts of several blocks. For more information
127 about scripts, see "Scripts" in perlunicode. The division into blocks
128 exists, but it is almost completely accidental--an artifact of how the
129 characters have been and still are allocated. (Note that this
130 paragraph has oversimplified things for the sake of this being an
131 introduction. Unicode doesn't really encode languages, but the writing
132 systems for them--their scripts; and one script can be used by many
133 languages. Unicode also encodes things that aren't really about
134 languages, such as symbols like "BAGGAGE CLAIM".)
135
136 The Unicode code points are just abstract numbers. To input and output
137 these abstract numbers, the numbers must be encoded or serialised
138 somehow. Unicode defines several character encoding forms, of which
139 UTF-8 is the most popular. UTF-8 is a variable length encoding that
140 encodes Unicode characters as 1 to 4 bytes. Other encodings include
141 UTF-16 and UTF-32 and their big- and little-endian variants (UTF-8 is
142 byte-order independent). The ISO/IEC 10646 defines the UCS-2 and UCS-4
143 encoding forms.
144
145 For more information about encodings--for instance, to learn what
146 surrogates and byte order marks (BOMs) are--see perlunicode.
147
148 Perl's Unicode Support
149 Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
150 natively. Perl v5.8.0, however, is the first recommended release for
151 serious Unicode work. The maintenance release 5.6.1 fixed many of the
152 problems of the initial Unicode implementation, but for example regular
153 expressions still do not work with Unicode in 5.6.1. Perl v5.14.0 is
154 the first release where Unicode support is (almost) seamlessly
155 integrable without some gotchas. (There are a few exceptions. Firstly,
156 some differences in quotemeta were fixed starting in Perl 5.16.0.
157 Secondly, some differences in the range operator were fixed starting in
158 Perl 5.26.0. Thirdly, some differences in split were fixed started in
159 Perl 5.28.0.)
160
161 To enable this seamless support, you should "use feature
162 'unicode_strings'" (which is automatically selected if you "use 5.012"
163 or higher). See feature. (5.14 also fixes a number of bugs and
164 departures from the Unicode standard.)
165
166 Before Perl v5.8.0, the use of "use utf8" was used to declare that
167 operations in the current block or file would be Unicode-aware. This
168 model was found to be wrong, or at least clumsy: the "Unicodeness" is
169 now carried with the data, instead of being attached to the operations.
170 Starting with Perl v5.8.0, only one case remains where an explicit "use
171 utf8" is needed: if your Perl script itself is encoded in UTF-8, you
172 can use UTF-8 in your identifier names, and in string and regular
173 expression literals, by saying "use utf8". This is not the default
174 because scripts with legacy 8-bit data in them would break. See utf8.
175
176 Perl's Unicode Model
177 Perl supports both pre-5.6 strings of eight-bit native bytes, and
178 strings of Unicode characters. The general principle is that Perl
179 tries to keep its data as eight-bit bytes for as long as possible, but
180 as soon as Unicodeness cannot be avoided, the data is transparently
181 upgraded to Unicode. Prior to Perl v5.14.0, the upgrade was not
182 completely transparent (see "The "Unicode Bug"" in perlunicode), and
183 for backwards compatibility, full transparency is not gained unless
184 "use feature 'unicode_strings'" (see feature) or "use 5.012" (or
185 higher) is selected.
186
187 Internally, Perl currently uses either whatever the native eight-bit
188 character set of the platform (for example Latin-1) is, defaulting to
189 UTF-8, to encode Unicode strings. Specifically, if all code points in
190 the string are 0xFF or less, Perl uses the native eight-bit character
191 set. Otherwise, it uses UTF-8.
192
193 A user of Perl does not normally need to know nor care how Perl happens
194 to encode its internal strings, but it becomes relevant when outputting
195 Unicode strings to a stream without a PerlIO layer (one with the
196 "default" encoding). In such a case, the raw bytes used internally
197 (the native character set or UTF-8, as appropriate for each string)
198 will be used, and a "Wide character" warning will be issued if those
199 strings contain a character beyond 0x00FF.
200
201 For example,
202
203 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
204
205 produces a fairly useless mixture of native bytes and UTF-8, as well as
206 a warning:
207
208 Wide character in print at ...
209
210 To output UTF-8, use the ":encoding" or ":utf8" output layer.
211 Prepending
212
213 binmode(STDOUT, ":utf8");
214
215 to this sample program ensures that the output is completely UTF-8, and
216 removes the program's warning.
217
218 You can enable automatic UTF-8-ification of your standard file handles,
219 default "open()" layer, and @ARGV by using either the "-C" command line
220 switch or the "PERL_UNICODE" environment variable, see perlrun for the
221 documentation of the "-C" switch.
222
223 Note that this means that Perl expects other software to work the same
224 way: if Perl has been led to believe that STDIN should be UTF-8, but
225 then STDIN coming in from another command is not UTF-8, Perl will
226 likely complain about the malformed UTF-8.
227
228 All features that combine Unicode and I/O also require using the new
229 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
230 you can see whether yours is by running "perl -V" and looking for
231 "useperlio=define".
232
233 Unicode and EBCDIC
234 Perl 5.8.0 added support for Unicode on EBCDIC platforms. This support
235 was allowed to lapse in later releases, but was revived in 5.22.
236 Unicode support is somewhat more complex to implement since additional
237 conversions are needed. See perlebcdic for more information.
238
239 On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
240 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
241 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
242 "EBCDIC-safe", in that all the basic characters (which includes all
243 those that have ASCII equivalents (like "A", "0", "%", etc.) are the
244 same in both EBCDIC and UTF-EBCDIC. Often, documentation will use the
245 term "UTF-8" to mean UTF-EBCDIC as well. This is the case in this
246 document.
247
248 Creating Unicode
249 This section applies fully to Perls starting with v5.22. Various
250 caveats for earlier releases are in the "Earlier releases caveats"
251 subsection below.
252
253 To create Unicode characters in literals, use the "\N{...}" notation in
254 double-quoted strings:
255
256 my $smiley_from_name = "\N{WHITE SMILING FACE}";
257 my $smiley_from_code_point = "\N{U+263a}";
258
259 Similarly, they can be used in regular expression literals
260
261 $smiley =~ /\N{WHITE SMILING FACE}/;
262 $smiley =~ /\N{U+263a}/;
263
264 At run-time you can use:
265
266 use charnames ();
267 my $hebrew_alef_from_name
268 = charnames::string_vianame("HEBREW LETTER ALEF");
269 my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");
270
271 Naturally, "ord()" will do the reverse: it turns a character into a
272 code point.
273
274 There are other runtime options as well. You can use "pack()":
275
276 my $hebrew_alef_from_code_point = pack("U", 0x05d0);
277
278 Or you can use "chr()", though it is less convenient in the general
279 case:
280
281 $hebrew_alef_from_code_point = chr(utf8::unicode_to_native(0x05d0));
282 utf8::upgrade($hebrew_alef_from_code_point);
283
284 The "utf8::unicode_to_native()" and "utf8::upgrade()" aren't needed if
285 the argument is above 0xFF, so the above could have been written as
286
287 $hebrew_alef_from_code_point = chr(0x05d0);
288
289 since 0x5d0 is above 255.
290
291 "\x{}" and "\o{}" can also be used to specify code points at compile
292 time in double-quotish strings, but, for backward compatibility with
293 older Perls, the same rules apply as with "chr()" for code points less
294 than 256.
295
296 "utf8::unicode_to_native()" is used so that the Perl code is portable
297 to EBCDIC platforms. You can omit it if you're really sure no one will
298 ever want to use your code on a non-ASCII platform. Starting in Perl
299 v5.22, calls to it on ASCII platforms are optimized out, so there's no
300 performance penalty at all in adding it. Or you can simply use the
301 other constructs that don't require it.
302
303 See "Further Resources" for how to find all these names and numeric
304 codes.
305
306 Earlier releases caveats
307
308 On EBCDIC platforms, prior to v5.22, using "\N{U+...}" doesn't work
309 properly.
310
311 Prior to v5.16, using "\N{...}" with a character name (as opposed to a
312 "U+..." code point) required a "use charnames :full".
313
314 Prior to v5.14, there were some bugs in "\N{...}" with a character name
315 (as opposed to a "U+..." code point).
316
317 "charnames::string_vianame()" was introduced in v5.14. Prior to that,
318 "charnames::vianame()" should work, but only if the argument is of the
319 form "U+...". Your best bet there for runtime Unicode by character
320 name is probably:
321
322 use charnames ();
323 my $hebrew_alef_from_name
324 = pack("U", charnames::vianame("HEBREW LETTER ALEF"));
325
326 Handling Unicode
327 Handling Unicode is for the most part transparent: just use the strings
328 as usual. Functions like "index()", "length()", and "substr()" will
329 work on the Unicode characters; regular expressions will work on the
330 Unicode characters (see perlunicode and perlretut).
331
332 Note that Perl considers grapheme clusters to be separate characters,
333 so for example
334
335 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
336 "\n";
337
338 will print 2, not 1. The only exception is that regular expressions
339 have "\X" for matching an extended grapheme cluster. (Thus "\X" in a
340 regular expression would match the entire sequence of both the example
341 characters.)
342
343 Life is not quite so transparent, however, when working with legacy
344 encodings, I/O, and certain special cases:
345
346 Legacy Encodings
347 When you combine legacy data and Unicode, the legacy data needs to be
348 upgraded to Unicode. Normally the legacy data is assumed to be ISO
349 8859-1 (or EBCDIC, if applicable).
350
351 The "Encode" module knows about many encodings and has interfaces for
352 doing conversions between those encodings:
353
354 use Encode 'decode';
355 $data = decode("iso-8859-3", $data); # convert from legacy
356
357 Unicode I/O
358 Normally, writing out Unicode data
359
360 print FH $some_string_with_unicode, "\n";
361
362 produces raw bytes that Perl happens to use to internally encode the
363 Unicode string. Perl's internal encoding depends on the system as well
364 as what characters happen to be in the string at the time. If any of
365 the characters are at code points 0x100 or above, you will get a
366 warning. To ensure that the output is explicitly rendered in the
367 encoding you desire--and to avoid the warning--open the stream with the
368 desired encoding. Some examples:
369
370 open FH, ">:utf8", "file";
371
372 open FH, ">:encoding(ucs2)", "file";
373 open FH, ">:encoding(UTF-8)", "file";
374 open FH, ">:encoding(shift_jis)", "file";
375
376 and on already open streams, use "binmode()":
377
378 binmode(STDOUT, ":utf8");
379
380 binmode(STDOUT, ":encoding(ucs2)");
381 binmode(STDOUT, ":encoding(UTF-8)");
382 binmode(STDOUT, ":encoding(shift_jis)");
383
384 The matching of encoding names is loose: case does not matter, and many
385 encodings have several aliases. Note that the ":utf8" layer must
386 always be specified exactly like that; it is not subject to the loose
387 matching of encoding names. Also note that currently ":utf8" is unsafe
388 for input, because it accepts the data without validating that it is
389 indeed valid UTF-8; you should instead use ":encoding(UTF-8)" (with or
390 without a hyphen).
391
392 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
393 for the ":encoding()" layer, and Encode::Supported for many encodings
394 supported by the "Encode" module.
395
396 Reading in a file that you know happens to be encoded in one of the
397 Unicode or legacy encodings does not magically turn the data into
398 Unicode in Perl's eyes. To do that, specify the appropriate layer when
399 opening files
400
401 open(my $fh,'<:encoding(UTF-8)', 'anything');
402 my $line_of_unicode = <$fh>;
403
404 open(my $fh,'<:encoding(Big5)', 'anything');
405 my $line_of_unicode = <$fh>;
406
407 The I/O layers can also be specified more flexibly with the "open"
408 pragma. See open, or look at the following example.
409
410 use open ':encoding(UTF-8)'; # input/output default encoding will be
411 # UTF-8
412 open X, ">file";
413 print X chr(0x100), "\n";
414 close X;
415 open Y, "<file";
416 printf "%#x\n", ord(<Y>); # this should print 0x100
417 close Y;
418
419 With the "open" pragma you can use the ":locale" layer
420
421 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
422 # the :locale will probe the locale environment variables like
423 # LC_ALL
424 use open OUT => ':locale'; # russki parusski
425 open(O, ">koi8");
426 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
427 close O;
428 open(I, "<koi8");
429 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
430 close I;
431
432 These methods install a transparent filter on the I/O stream that
433 converts data from the specified encoding when it is read in from the
434 stream. The result is always Unicode.
435
436 The open pragma affects all the "open()" calls after the pragma by
437 setting default layers. If you want to affect only certain streams,
438 use explicit layers directly in the "open()" call.
439
440 You can switch encodings on an already opened stream by using
441 "binmode()"; see "binmode" in perlfunc.
442
443 The ":locale" does not currently work with "open()" and "binmode()",
444 only with the "open" pragma. The ":utf8" and ":encoding(...)" methods
445 do work with all of "open()", "binmode()", and the "open" pragma.
446
447 Similarly, you may use these I/O layers on output streams to
448 automatically convert Unicode to the specified encoding when it is
449 written to the stream. For example, the following snippet copies the
450 contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
451 the file "text.utf8", encoded as UTF-8:
452
453 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
454 open(my $unicode, '>:utf8', 'text.utf8');
455 while (<$nihongo>) { print $unicode $_ }
456
457 The naming of encodings, both by the "open()" and by the "open" pragma
458 allows for flexible names: "koi8-r" and "KOI8R" will both be
459 understood.
460
461 Common encodings recognized by ISO, MIME, IANA, and various other
462 standardisation organisations are recognised; for a more detailed list
463 see Encode::Supported.
464
465 "read()" reads characters and returns the number of characters.
466 "seek()" and "tell()" operate on byte counts, as do "sysread()" and
467 "sysseek()".
468
469 Notice that because of the default behaviour of not doing any
470 conversion upon input if there is no default layer, it is easy to
471 mistakenly write code that keeps on expanding a file by repeatedly
472 encoding the data:
473
474 # BAD CODE WARNING
475 open F, "file";
476 local $/; ## read in the whole file of 8-bit characters
477 $t = <F>;
478 close F;
479 open F, ">:encoding(UTF-8)", "file";
480 print F $t; ## convert to UTF-8 on output
481 close F;
482
483 If you run this code twice, the contents of the file will be twice
484 UTF-8 encoded. A "use open ':encoding(UTF-8)'" would have avoided the
485 bug, or explicitly opening also the file for input as UTF-8.
486
487 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
488 been built with PerlIO, which is the default on most systems.
489
490 Displaying Unicode As Text
491 Sometimes you might want to display Perl scalars containing Unicode as
492 simple ASCII (or EBCDIC) text. The following subroutine converts its
493 argument so that Unicode characters with code points greater than 255
494 are displayed as "\x{...}", control characters (like "\n") are
495 displayed as "\x..", and the rest of the characters as themselves:
496
497 sub nice_string {
498 join("",
499 map { $_ > 255 # if wide character...
500 ? sprintf("\\x{%04X}", $_) # \x{...}
501 : chr($_) =~ /[[:cntrl:]]/ # else if control character...
502 ? sprintf("\\x%02X", $_) # \x..
503 : quotemeta(chr($_)) # else quoted or as themselves
504 } unpack("W*", $_[0])); # unpack Unicode characters
505 }
506
507 For example,
508
509 nice_string("foo\x{100}bar\n")
510
511 returns the string
512
513 'foo\x{0100}bar\x0A'
514
515 which is ready to be printed.
516
517 ("\\x{}" is used here instead of "\\N{}", since it's most likely that
518 you want to see what the native values are.)
519
520 Special Cases
521 · Bit Complement Operator ~ And vec()
522
523 The bit complement operator "~" may produce surprising results if
524 used on strings containing characters with ordinal values above
525 255. In such a case, the results are consistent with the internal
526 encoding of the characters, but not with much else. So don't do
527 that. Similarly for "vec()": you will be operating on the
528 internally-encoded bit patterns of the Unicode characters, not on
529 the code point values, which is very probably not what you want.
530
531 · Peeking At Perl's Internal Encoding
532
533 Normal users of Perl should never care how Perl encodes any
534 particular Unicode string (because the normal ways to get at the
535 contents of a string with Unicode--via input and output--should
536 always be via explicitly-defined I/O layers). But if you must,
537 there are two ways of looking behind the scenes.
538
539 One way of peeking inside the internal encoding of Unicode
540 characters is to use "unpack("C*", ..." to get the bytes of
541 whatever the string encoding happens to be, or "unpack("U0..",
542 ...)" to get the bytes of the UTF-8 encoding:
543
544 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
545 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
546
547 Yet another way would be to use the Devel::Peek module:
548
549 perl -MDevel::Peek -e 'Dump(chr(0x100))'
550
551 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
552 Unicode characters in "PV". See also later in this document the
553 discussion about the "utf8::is_utf8()" function.
554
555 Advanced Topics
556 · String Equivalence
557
558 The question of string equivalence turns somewhat complicated in
559 Unicode: what do you mean by "equal"?
560
561 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
562 LETTER A"?)
563
564 The short answer is that by default Perl compares equivalence
565 ("eq", "ne") based only on code points of the characters. In the
566 above case, the answer is no (because 0x00C1 != 0x0041). But
567 sometimes, any CAPITAL LETTER A's should be considered equal, or
568 even A's of any case.
569
570 The long answer is that you need to consider character
571 normalization and casing issues: see Unicode::Normalize, Unicode
572 Technical Report #15, Unicode Normalization Forms
573 <http://www.unicode.org/unicode/reports/tr15> and sections on case
574 mapping in the Unicode Standard <http://www.unicode.org>.
575
576 As of Perl 5.8.0, the "Full" case-folding of Case
577 Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
578 with them, mostly fixed by 5.14, and essentially entirely by 5.18.
579
580 · String Collation
581
582 People like to see their strings nicely sorted--or as Unicode
583 parlance goes, collated. But again, what do you mean by collate?
584
585 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
586 "LATIN CAPITAL LETTER A WITH GRAVE"?)
587
588 The short answer is that by default, Perl compares strings ("lt",
589 "le", "cmp", "ge", "gt") based only on the code points of the
590 characters. In the above case, the answer is "after", since 0x00C1
591 > 0x00C0.
592
593 The long answer is that "it depends", and a good answer cannot be
594 given without knowing (at the very least) the language context.
595 See Unicode::Collate, and Unicode Collation Algorithm
596 <http://www.unicode.org/unicode/reports/tr10/>
597
598 Miscellaneous
599 · Character Ranges and Classes
600
601 Character ranges in regular expression bracketed character classes
602 ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
603 operator are not magically Unicode-aware. What this means is that
604 "[A-Za-z]" will not magically start to mean "all alphabetic
605 letters" (not that it does mean that even for 8-bit characters; for
606 those, if you are using locales (perllocale), use "/[[:alpha:]]/";
607 and if not, use the 8-bit-aware property "\p{alpha}").
608
609 All the properties that begin with "\p" (and its inverse "\P") are
610 actually character classes that are Unicode-aware. There are
611 dozens of them, see perluniprops.
612
613 Starting in v5.22, you can use Unicode code points as the end
614 points of regular expression pattern character ranges, and the
615 range will include all Unicode code points that lie between those
616 end points, inclusive.
617
618 qr/ [ \N{U+03} - \N{U+20} ] /xx
619
620 includes the code points "\N{U+03}", "\N{U+04}", ..., "\N{U+20}".
621
622 This also works for ranges in "tr///" starting in Perl v5.24.
623
624 · String-To-Number Conversions
625
626 Unicode does define several other decimal--and numeric--characters
627 besides the familiar 0 to 9, such as the Arabic and Indic digits.
628 Perl does not support string-to-number conversion for digits other
629 than ASCII 0 to 9 (and ASCII "a" to "f" for hexadecimal). To get
630 safe conversions from any Unicode string, use "num()" in
631 Unicode::UCD.
632
633 Questions With Answers
634 · Will My Old Scripts Break?
635
636 Very probably not. Unless you are generating Unicode characters
637 somehow, old behaviour should be preserved. About the only
638 behaviour that has changed and which could start generating Unicode
639 is the old behaviour of "chr()" where supplying an argument more
640 than 255 produced a character modulo 255. "chr(300)", for example,
641 was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
642 LETTER I WITH BREVE.
643
644 · How Do I Make My Scripts Work With Unicode?
645
646 Very little work should be needed since nothing changes until you
647 generate Unicode data. The most important thing is getting input
648 as Unicode; for that, see the earlier I/O discussion. To get full
649 seamless Unicode support, add "use feature 'unicode_strings'" (or
650 "use 5.012" or higher) to your script.
651
652 · How Do I Know Whether My String Is In Unicode?
653
654 You shouldn't have to care. But you may if your Perl is before
655 5.14.0 or you haven't specified "use feature 'unicode_strings'" or
656 "use 5.012" (or higher) because otherwise the rules for the code
657 points in the range 128 to 255 are different depending on whether
658 the string they are contained within is in Unicode or not. (See
659 "When Unicode Does Not Happen" in perlunicode.)
660
661 To determine if a string is in Unicode, use:
662
663 print utf8::is_utf8($string) ? 1 : 0, "\n";
664
665 But note that this doesn't mean that any of the characters in the
666 string are necessary UTF-8 encoded, or that any of the characters
667 have code points greater than 0xFF (255) or even 0x80 (128), or
668 that the string has any characters at all. All the "is_utf8()"
669 does is to return the value of the internal "utf8ness" flag
670 attached to the $string. If the flag is off, the bytes in the
671 scalar are interpreted as a single byte encoding. If the flag is
672 on, the bytes in the scalar are interpreted as the (variable-
673 length, potentially multi-byte) UTF-8 encoded code points of the
674 characters. Bytes added to a UTF-8 encoded string are
675 automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8
676 scalars are merged (double-quoted interpolation, explicit
677 concatenation, or printf/sprintf parameter substitution), the
678 result will be UTF-8 encoded as if copies of the byte strings were
679 upgraded to UTF-8: for example,
680
681 $a = "ab\x80c";
682 $b = "\x{100}";
683 print "$a = $b\n";
684
685 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
686 $a will stay byte-encoded.
687
688 Sometimes you might really need to know the byte length of a string
689 instead of the character length. For that use the "bytes" pragma
690 and the "length()" function:
691
692 my $unicode = chr(0x100);
693 print length($unicode), "\n"; # will print 1
694 use bytes;
695 print length($unicode), "\n"; # will print 2
696 # (the 0xC4 0x80 of the UTF-8)
697 no bytes;
698
699 · How Do I Find Out What Encoding a File Has?
700
701 You might try Encode::Guess, but it has a number of limitations.
702
703 · How Do I Detect Data That's Not Valid In a Particular Encoding?
704
705 Use the "Encode" package to try converting it. For example,
706
707 use Encode 'decode';
708
709 if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
710 # $string is valid UTF-8
711 } else {
712 # $string is not valid UTF-8
713 }
714
715 Or use "unpack" to try decoding it:
716
717 use warnings;
718 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
719
720 If invalid, a "Malformed UTF-8 character" warning is produced. The
721 "C0" means "process the string character per character". Without
722 that, the "unpack("U*", ...)" would work in "U0" mode (the default
723 if the format string starts with "U") and it would return the bytes
724 making up the UTF-8 encoding of the target string, something that
725 will always work.
726
727 · How Do I Convert Binary Data Into a Particular Encoding, Or Vice
728 Versa?
729
730 This probably isn't as useful as you might think. Normally, you
731 shouldn't need to.
732
733 In one sense, what you are asking doesn't make much sense:
734 encodings are for characters, and binary data are not "characters",
735 so converting "data" into some encoding isn't meaningful unless you
736 know in what character set and encoding the binary data is in, in
737 which case it's not just binary data, now is it?
738
739 If you have a raw sequence of bytes that you know should be
740 interpreted via a particular encoding, you can use "Encode":
741
742 use Encode 'from_to';
743 from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
744
745 The call to "from_to()" changes the bytes in $data, but nothing
746 material about the nature of the string has changed as far as Perl
747 is concerned. Both before and after the call, the string $data
748 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
749 the encoding of the string remains as "system-native 8-bit bytes".
750
751 You might relate this to a fictional 'Translate' module:
752
753 use Translate;
754 my $phrase = "Yes";
755 Translate::from_to($phrase, 'english', 'deutsch');
756 ## phrase now contains "Ja"
757
758 The contents of the string changes, but not the nature of the
759 string. Perl doesn't know any more after the call than before that
760 the contents of the string indicates the affirmative.
761
762 Back to converting data. If you have (or want) data in your
763 system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
764 can use pack/unpack to convert to/from Unicode.
765
766 $native_string = pack("W*", unpack("U*", $Unicode_string));
767 $Unicode_string = pack("U*", unpack("W*", $native_string));
768
769 If you have a sequence of bytes you know is valid UTF-8, but Perl
770 doesn't know it yet, you can make Perl a believer, too:
771
772 $Unicode = $bytes;
773 utf8::decode($Unicode);
774
775 or:
776
777 $Unicode = pack("U0a*", $bytes);
778
779 You can find the bytes that make up a UTF-8 sequence with
780
781 @bytes = unpack("C*", $Unicode_string)
782
783 and you can create well-formed Unicode with
784
785 $Unicode_string = pack("U*", 0xff, ...)
786
787 · How Do I Display Unicode? How Do I Input Unicode?
788
789 See <http://www.alanwood.net/unicode/> and
790 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
791
792 · How Does Unicode Work With Traditional Locales?
793
794 If your locale is a UTF-8 locale, starting in Perl v5.26, Perl
795 works well for all categories; before this, starting with Perl
796 v5.20, it works for all categories but "LC_COLLATE", which deals
797 with sorting and the "cmp" operator. But note that the standard
798 "Unicode::Collate" and "Unicode::Collate::Locale" modules offer
799 much more powerful solutions to collation issues, and work on
800 earlier releases.
801
802 For other locales, starting in Perl 5.16, you can specify
803
804 use locale ':not_characters';
805
806 to get Perl to work well with them. The catch is that you have to
807 translate from the locale character set to/from Unicode yourself.
808 See "Unicode I/O" above for how to
809
810 use open ':locale';
811
812 to accomplish this, but full details are in "Unicode and UTF-8" in
813 perllocale, including gotchas that happen if you don't specify
814 ":not_characters".
815
816 Hexadecimal Notation
817 The Unicode standard prefers using hexadecimal notation because that
818 more clearly shows the division of Unicode into blocks of 256
819 characters. Hexadecimal is also simply shorter than decimal. You can
820 use decimal notation, too, but learning to use hexadecimal just makes
821 life easier with the Unicode standard. The "U+HHHH" notation uses
822 hexadecimal, for example.
823
824 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
825 (or A-F, case doesn't matter). Each hexadecimal digit represents four
826 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal
827 number in decimal, and "printf "%x\n", $decimal" will show a decimal
828 number in hexadecimal. If you have just the "hex digits" of a
829 hexadecimal number, you can use the "hex()" function.
830
831 print 0x0009, "\n"; # 9
832 print 0x000a, "\n"; # 10
833 print 0x000f, "\n"; # 15
834 print 0x0010, "\n"; # 16
835 print 0x0011, "\n"; # 17
836 print 0x0100, "\n"; # 256
837
838 print 0x0041, "\n"; # 65
839
840 printf "%x\n", 65; # 41
841 printf "%#x\n", 65; # 0x41
842
843 print hex("41"), "\n"; # 65
844
845 Further Resources
846 · Unicode Consortium
847
848 <http://www.unicode.org/>
849
850 · Unicode FAQ
851
852 <http://www.unicode.org/unicode/faq/>
853
854 · Unicode Glossary
855
856 <http://www.unicode.org/glossary/>
857
858 · Unicode Recommended Reading List
859
860 The Unicode Consortium has a list of articles and books, some of
861 which give a much more in depth treatment of Unicode:
862 <http://unicode.org/resources/readinglist.html>
863
864 · Unicode Useful Resources
865
866 <http://www.unicode.org/unicode/onlinedat/resources.html>
867
868 · Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
869 Other Applications
870
871 <http://www.alanwood.net/unicode/>
872
873 · UTF-8 and Unicode FAQ for Unix/Linux
874
875 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
876
877 · Legacy Character Sets
878
879 <http://www.czyborra.com/> <http://www.eki.ee/letter/>
880
881 · You can explore various information from the Unicode data files
882 using the "Unicode::UCD" module.
883
885 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
886 some Unicode processing by using the modules "Unicode::String",
887 "Unicode::Map8", and "Unicode::Map", available from CPAN. If you have
888 the GNU recode installed, you can also use the Perl front-end
889 "Convert::Recode" for character conversions.
890
891 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
892 UTF-8 bytes and back, the code works even with older Perl 5 versions.
893
894 # ISO 8859-1 to UTF-8
895 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
896
897 # UTF-8 to ISO 8859-1
898 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
899
901 perlunitut, perlunicode, Encode, open, utf8, bytes, perlretut, perlrun,
902 Unicode::Collate, Unicode::Normalize, Unicode::UCD
903
905 Thanks to the kind readers of the perl5-porters@perl.org,
906 perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
907 mailing lists for their valuable feedback.
908
910 Copyright 2001-2011 Jarkko Hietaniemi <jhi@iki.fi>. Now maintained by
911 Perl 5 Porters.
912
913 This document may be distributed under the same terms as Perl itself.
914
915
916
917perl v5.26.3 2019-05-11 PERLUNIINTRO(1)