1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl. See "Further Resources" for references to more in-depth
11 treatments of Unicode.
12
13 Unicode
14 Unicode is a character set standard which plans to codify all of the
15 writing systems of the world, plus many other symbols.
16
17 Unicode and ISO/IEC 10646 are coordinated standards that unify almost
18 all other modern character set standards, covering more than 80 writing
19 systems and hundreds of languages, including all commercially-important
20 modern languages. All characters in the largest Chinese, Japanese, and
21 Korean dictionaries are also encoded. The standards will eventually
22 cover almost all characters in more than 250 writing systems and
23 thousands of languages. Unicode 1.0 was released in October 1991, and
24 6.0 in October 2010.
25
26 A Unicode character is an abstract entity. It is not bound to any
27 particular integer width, especially not to the C language "char".
28 Unicode is language-neutral and display-neutral: it does not encode the
29 language of the text, and it does not generally define fonts or other
30 graphical layout details. Unicode operates on characters and on text
31 built from those characters.
32
33 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35 0x0041 and 0x03B1, respectively. These unique numbers are called code
36 points. A code point is essentially the position of the character
37 within the set of all possible Unicode characters, and thus in Perl,
38 the term ordinal is often used interchangeably with it.
39
40 The Unicode standard prefers using hexadecimal notation for the code
41 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
42 later section, "Hexadecimal Notation". The Unicode standard uses the
43 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
44 point and the normative name of the character.
45
46 Unicode also defines various properties for the characters, like
47 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
48 properties are independent of the names of the characters.
49 Furthermore, various operations on the characters like uppercasing,
50 lowercasing, and collating (sorting) are defined.
51
52 A Unicode logical "character" can actually consist of more than one
53 internal actual "character" or code point. For Western languages, this
54 is adequately modelled by a base character (like "LATIN CAPITAL LETTER
55 A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
56 This sequence of base character and modifiers is called a combining
57 character sequence. Some non-western languages require more
58 complicated models, so Unicode created the grapheme cluster concept,
59 which was later further refined into the extended grapheme cluster.
60 For example, a Korean Hangul syllable is considered a single logical
61 character, but most often consists of three actual Unicode characters:
62 a leading consonant followed by an interior vowel followed by a
63 trailing consonant.
64
65 Whether to call these extended grapheme clusters "characters" depends
66 on your point of view. If you are a programmer, you probably would tend
67 towards seeing each element in the sequences as one unit, or
68 "character". However from the user's point of view, the whole sequence
69 could be seen as one "character" since that's probably what it looks
70 like in the context of the user's language. In this document, we take
71 the programmer's point of view: one "character" is one Unicode code
72 point.
73
74 For some combinations of base character and modifiers, there are
75 precomposed characters. There is a single character equivalent, for
76 example, for the sequence "LATIN CAPITAL LETTER A" followed by
77 "COMBINING ACUTE ACCENT". It is called "LATIN CAPITAL LETTER A WITH
78 ACUTE". These precomposed characters are, however, only available for
79 some combinations, and are mainly meant to support round-trip
80 conversions between Unicode and legacy standards (like ISO 8859).
81 Using sequences, as Unicode does, allows for needing fewer basic
82 building blocks (code points) to express many more potential grapheme
83 clusters. To support conversion between equivalent forms, various
84 normalization forms are also defined. Thus, "LATIN CAPITAL LETTER A
85 WITH ACUTE" is in Normalization Form Composed, (abbreviated NFC), and
86 the sequence "LATIN CAPITAL LETTER A" followed by "COMBINING ACUTE
87 ACCENT" represents the same character in Normalization Form Decomposed
88 (NFD).
89
90 Because of backward compatibility with legacy encodings, the "a unique
91 number for every character" idea breaks down a bit: instead, there is
92 "at least one number for every character". The same character could be
93 represented differently in several legacy encodings. The converse is
94 not true: some code points do not have an assigned character. Firstly,
95 there are unallocated code points within otherwise used blocks.
96 Secondly, there are special Unicode control characters that do not
97 represent true characters.
98
99 When Unicode was first conceived, it was thought that all the world's
100 characters could be represented using a 16-bit word; that is a maximum
101 of 0x10000 (or 65,536) characters would be needed, from 0x0000 to
102 0xFFFF. This soon proved to be wrong, and since Unicode 2.0 (July
103 1996), Unicode has been defined all the way up to 21 bits (0x10FFFF),
104 and Unicode 3.1 (March 2001) defined the first characters above 0xFFFF.
105 The first 0x10000 characters are called the Plane 0, or the Basic
106 Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes
107 in all were defined--but they are nowhere near full of defined
108 characters, yet.
109
110 When a new language is being encoded, Unicode generally will choose a
111 "block" of consecutive unallocated code points for its characters. So
112 far, the number of code points in these blocks has always been evenly
113 divisible by 16. Extras in a block, not currently needed, are left
114 unallocated, for future growth. But there have been occasions when a
115 later release needed more code points than the available extras, and a
116 new block had to allocated somewhere else, not contiguous to the
117 initial one, to handle the overflow. Thus, it became apparent early on
118 that "block" wasn't an adequate organizing principle, and so the
119 "Script" property was created. (Later an improved script property was
120 added as well, the "Script_Extensions" property.) Those code points
121 that are in overflow blocks can still have the same script as the
122 original ones. The script concept fits more closely with natural
123 language: there is "Latin" script, "Greek" script, and so on; and there
124 are several artificial scripts, like "Common" for characters that are
125 used in multiple scripts, such as mathematical symbols. Scripts
126 usually span varied parts of several blocks. For more information
127 about scripts, see "Scripts" in perlunicode. The division into blocks
128 exists, but it is almost completely accidental--an artifact of how the
129 characters have been and still are allocated. (Note that this
130 paragraph has oversimplified things for the sake of this being an
131 introduction. Unicode doesn't really encode languages, but the writing
132 systems for them--their scripts; and one script can be used by many
133 languages. Unicode also encodes things that aren't really about
134 languages, such as symbols like "BAGGAGE CLAIM".)
135
136 The Unicode code points are just abstract numbers. To input and output
137 these abstract numbers, the numbers must be encoded or serialised
138 somehow. Unicode defines several character encoding forms, of which
139 UTF-8 is the most popular. UTF-8 is a variable length encoding that
140 encodes Unicode characters as 1 to 4 bytes. Other encodings include
141 UTF-16 and UTF-32 and their big- and little-endian variants (UTF-8 is
142 byte-order independent). The ISO/IEC 10646 defines the UCS-2 and UCS-4
143 encoding forms.
144
145 For more information about encodings--for instance, to learn what
146 surrogates and byte order marks (BOMs) are--see perlunicode.
147
148 Perl's Unicode Support
149 Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
150 natively. Perl v5.8.0, however, is the first recommended release for
151 serious Unicode work. The maintenance release 5.6.1 fixed many of the
152 problems of the initial Unicode implementation, but for example regular
153 expressions still do not work with Unicode in 5.6.1. Perl v5.14.0 is
154 the first release where Unicode support is (almost) seamlessly
155 integratable without some gotchas. (There are a few exceptions.
156 Firstly, some differences in quotemeta were fixed starting in Perl
157 5.16.0. Secondly, some differences in the range operator were fixed
158 starting in Perl 5.26.0. Thirdly, some differences in split were fixed
159 started in Perl 5.28.0.)
160
161 To enable this seamless support, you should "use feature
162 'unicode_strings'" (which is automatically selected if you "use v5.12"
163 or higher). See feature. (5.14 also fixes a number of bugs and
164 departures from the Unicode standard.)
165
166 Before Perl v5.8.0, the use of "use utf8" was used to declare that
167 operations in the current block or file would be Unicode-aware. This
168 model was found to be wrong, or at least clumsy: the "Unicodeness" is
169 now carried with the data, instead of being attached to the operations.
170 Starting with Perl v5.8.0, only one case remains where an explicit "use
171 utf8" is needed: if your Perl script itself is encoded in UTF-8, you
172 can use UTF-8 in your identifier names, and in string and regular
173 expression literals, by saying "use utf8". This is not the default
174 because scripts with legacy 8-bit data in them would break. See utf8.
175
176 Perl's Unicode Model
177 Perl supports both pre-5.6 strings of eight-bit native bytes, and
178 strings of Unicode characters. The general principle is that Perl
179 tries to keep its data as eight-bit bytes for as long as possible, but
180 as soon as Unicodeness cannot be avoided, the data is transparently
181 upgraded to Unicode. Prior to Perl v5.14.0, the upgrade was not
182 completely transparent (see "The "Unicode Bug"" in perlunicode), and
183 for backwards compatibility, full transparency is not gained unless
184 "use feature 'unicode_strings'" (see feature) or "use v5.12" (or
185 higher) is selected.
186
187 Internally, Perl currently uses either whatever the native eight-bit
188 character set of the platform (for example Latin-1) is, defaulting to
189 UTF-8, to encode Unicode strings. Specifically, if all code points in
190 the string are 0xFF or less, Perl uses the native eight-bit character
191 set. Otherwise, it uses UTF-8.
192
193 A user of Perl does not normally need to know nor care how Perl happens
194 to encode its internal strings, but it becomes relevant when outputting
195 Unicode strings to a stream without a PerlIO layer (one with the
196 "default" encoding). In such a case, the raw bytes used internally
197 (the native character set or UTF-8, as appropriate for each string)
198 will be used, and a "Wide character" warning will be issued if those
199 strings contain a character beyond 0x00FF.
200
201 For example,
202
203 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
204
205 produces a fairly useless mixture of native bytes and UTF-8, as well as
206 a warning:
207
208 Wide character in print at ...
209
210 To output UTF-8, use the ":encoding" or ":utf8" output layer.
211 Prepending
212
213 binmode(STDOUT, ":utf8");
214
215 to this sample program ensures that the output is completely UTF-8, and
216 removes the program's warning.
217
218 You can enable automatic UTF-8-ification of your standard file handles,
219 default "open()" layer, and @ARGV by using either the "-C" command line
220 switch or the "PERL_UNICODE" environment variable, see perlrun for the
221 documentation of the "-C" switch.
222
223 Note that this means that Perl expects other software to work the same
224 way: if Perl has been led to believe that STDIN should be UTF-8, but
225 then STDIN coming in from another command is not UTF-8, Perl will
226 likely complain about the malformed UTF-8.
227
228 All features that combine Unicode and I/O also require using the new
229 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
230 you can see whether yours is by running "perl -V" and looking for
231 "useperlio=define".
232
233 Unicode and EBCDIC
234 Perl 5.8.0 added support for Unicode on EBCDIC platforms. This support
235 was allowed to lapse in later releases, but was revived in 5.22.
236 Unicode support is somewhat more complex to implement since additional
237 conversions are needed. See perlebcdic for more information.
238
239 On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
240 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
241 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
242 "EBCDIC-safe", in that all the basic characters (which includes all
243 those that have ASCII equivalents (like "A", "0", "%", etc.) are the
244 same in both EBCDIC and UTF-EBCDIC. Often, documentation will use the
245 term "UTF-8" to mean UTF-EBCDIC as well. This is the case in this
246 document.
247
248 Creating Unicode
249 This section applies fully to Perls starting with v5.22. Various
250 caveats for earlier releases are in the "Earlier releases caveats"
251 subsection below.
252
253 To create Unicode characters in literals, use the "\N{...}" notation in
254 double-quoted strings:
255
256 my $smiley_from_name = "\N{WHITE SMILING FACE}";
257 my $smiley_from_code_point = "\N{U+263a}";
258
259 Similarly, they can be used in regular expression literals
260
261 $smiley =~ /\N{WHITE SMILING FACE}/;
262 $smiley =~ /\N{U+263a}/;
263
264 or, starting in v5.32:
265
266 $smiley =~ /\p{Name=WHITE SMILING FACE}/;
267 $smiley =~ /\p{Name=whitesmilingface}/;
268
269 At run-time you can use:
270
271 use charnames ();
272 my $hebrew_alef_from_name
273 = charnames::string_vianame("HEBREW LETTER ALEF");
274 my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");
275
276 Naturally, "ord()" will do the reverse: it turns a character into a
277 code point.
278
279 There are other runtime options as well. You can use "pack()":
280
281 my $hebrew_alef_from_code_point = pack("U", 0x05d0);
282
283 Or you can use "chr()", though it is less convenient in the general
284 case:
285
286 $hebrew_alef_from_code_point = chr(utf8::unicode_to_native(0x05d0));
287 utf8::upgrade($hebrew_alef_from_code_point);
288
289 The "utf8::unicode_to_native()" and "utf8::upgrade()" aren't needed if
290 the argument is above 0xFF, so the above could have been written as
291
292 $hebrew_alef_from_code_point = chr(0x05d0);
293
294 since 0x5d0 is above 255.
295
296 "\x{}" and "\o{}" can also be used to specify code points at compile
297 time in double-quotish strings, but, for backward compatibility with
298 older Perls, the same rules apply as with "chr()" for code points less
299 than 256.
300
301 "utf8::unicode_to_native()" is used so that the Perl code is portable
302 to EBCDIC platforms. You can omit it if you're really sure no one will
303 ever want to use your code on a non-ASCII platform. Starting in Perl
304 v5.22, calls to it on ASCII platforms are optimized out, so there's no
305 performance penalty at all in adding it. Or you can simply use the
306 other constructs that don't require it.
307
308 See "Further Resources" for how to find all these names and numeric
309 codes.
310
311 Earlier releases caveats
312
313 On EBCDIC platforms, prior to v5.22, using "\N{U+...}" doesn't work
314 properly.
315
316 Prior to v5.16, using "\N{...}" with a character name (as opposed to a
317 "U+..." code point) required a "use charnames :full".
318
319 Prior to v5.14, there were some bugs in "\N{...}" with a character name
320 (as opposed to a "U+..." code point).
321
322 "charnames::string_vianame()" was introduced in v5.14. Prior to that,
323 "charnames::vianame()" should work, but only if the argument is of the
324 form "U+...". Your best bet there for runtime Unicode by character
325 name is probably:
326
327 use charnames ();
328 my $hebrew_alef_from_name
329 = pack("U", charnames::vianame("HEBREW LETTER ALEF"));
330
331 Handling Unicode
332 Handling Unicode is for the most part transparent: just use the strings
333 as usual. Functions like "index()", "length()", and "substr()" will
334 work on the Unicode characters; regular expressions will work on the
335 Unicode characters (see perlunicode and perlretut).
336
337 Note that Perl considers grapheme clusters to be separate characters,
338 so for example
339
340 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
341 "\n";
342
343 will print 2, not 1. The only exception is that regular expressions
344 have "\X" for matching an extended grapheme cluster. (Thus "\X" in a
345 regular expression would match the entire sequence of both the example
346 characters.)
347
348 Life is not quite so transparent, however, when working with legacy
349 encodings, I/O, and certain special cases:
350
351 Legacy Encodings
352 When you combine legacy data and Unicode, the legacy data needs to be
353 upgraded to Unicode. Normally the legacy data is assumed to be ISO
354 8859-1 (or EBCDIC, if applicable).
355
356 The "Encode" module knows about many encodings and has interfaces for
357 doing conversions between those encodings:
358
359 use Encode 'decode';
360 $data = decode("iso-8859-3", $data); # convert from legacy
361
362 Unicode I/O
363 Normally, writing out Unicode data
364
365 print FH $some_string_with_unicode, "\n";
366
367 produces raw bytes that Perl happens to use to internally encode the
368 Unicode string. Perl's internal encoding depends on the system as well
369 as what characters happen to be in the string at the time. If any of
370 the characters are at code points 0x100 or above, you will get a
371 warning. To ensure that the output is explicitly rendered in the
372 encoding you desire--and to avoid the warning--open the stream with the
373 desired encoding. Some examples:
374
375 open FH, ">:utf8", "file";
376
377 open FH, ">:encoding(ucs2)", "file";
378 open FH, ">:encoding(UTF-8)", "file";
379 open FH, ">:encoding(shift_jis)", "file";
380
381 and on already open streams, use "binmode()":
382
383 binmode(STDOUT, ":utf8");
384
385 binmode(STDOUT, ":encoding(ucs2)");
386 binmode(STDOUT, ":encoding(UTF-8)");
387 binmode(STDOUT, ":encoding(shift_jis)");
388
389 The matching of encoding names is loose: case does not matter, and many
390 encodings have several aliases. Note that the ":utf8" layer must
391 always be specified exactly like that; it is not subject to the loose
392 matching of encoding names. Also note that currently ":utf8" is unsafe
393 for input, because it accepts the data without validating that it is
394 indeed valid UTF-8; you should instead use ":encoding(UTF-8)" (with or
395 without a hyphen).
396
397 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
398 for the ":encoding()" layer, and Encode::Supported for many encodings
399 supported by the "Encode" module.
400
401 Reading in a file that you know happens to be encoded in one of the
402 Unicode or legacy encodings does not magically turn the data into
403 Unicode in Perl's eyes. To do that, specify the appropriate layer when
404 opening files
405
406 open(my $fh,'<:encoding(UTF-8)', 'anything');
407 my $line_of_unicode = <$fh>;
408
409 open(my $fh,'<:encoding(Big5)', 'anything');
410 my $line_of_unicode = <$fh>;
411
412 The I/O layers can also be specified more flexibly with the "open"
413 pragma. See open, or look at the following example.
414
415 use open ':encoding(UTF-8)'; # input/output default encoding will be
416 # UTF-8
417 open X, ">file";
418 print X chr(0x100), "\n";
419 close X;
420 open Y, "<file";
421 printf "%#x\n", ord(<Y>); # this should print 0x100
422 close Y;
423
424 With the "open" pragma you can use the ":locale" layer
425
426 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
427 # the :locale will probe the locale environment variables like
428 # LC_ALL
429 use open OUT => ':locale'; # russki parusski
430 open(O, ">koi8");
431 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
432 close O;
433 open(I, "<koi8");
434 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
435 close I;
436
437 These methods install a transparent filter on the I/O stream that
438 converts data from the specified encoding when it is read in from the
439 stream. The result is always Unicode.
440
441 The open pragma affects all the "open()" calls after the pragma by
442 setting default layers. If you want to affect only certain streams,
443 use explicit layers directly in the "open()" call.
444
445 You can switch encodings on an already opened stream by using
446 "binmode()"; see "binmode" in perlfunc.
447
448 The ":locale" does not currently work with "open()" and "binmode()",
449 only with the "open" pragma. The ":utf8" and ":encoding(...)" methods
450 do work with all of "open()", "binmode()", and the "open" pragma.
451
452 Similarly, you may use these I/O layers on output streams to
453 automatically convert Unicode to the specified encoding when it is
454 written to the stream. For example, the following snippet copies the
455 contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
456 the file "text.utf8", encoded as UTF-8:
457
458 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
459 open(my $unicode, '>:utf8', 'text.utf8');
460 while (<$nihongo>) { print $unicode $_ }
461
462 The naming of encodings, both by the "open()" and by the "open" pragma
463 allows for flexible names: "koi8-r" and "KOI8R" will both be
464 understood.
465
466 Common encodings recognized by ISO, MIME, IANA, and various other
467 standardisation organisations are recognised; for a more detailed list
468 see Encode::Supported.
469
470 "read()" reads characters and returns the number of characters.
471 "seek()" and "tell()" operate on byte counts, as does "sysseek()".
472
473 "sysread()" and "syswrite()" should not be used on file handles with
474 character encoding layers, they behave badly, and that behaviour has
475 been deprecated since perl 5.24.
476
477 Notice that because of the default behaviour of not doing any
478 conversion upon input if there is no default layer, it is easy to
479 mistakenly write code that keeps on expanding a file by repeatedly
480 encoding the data:
481
482 # BAD CODE WARNING
483 open F, "file";
484 local $/; ## read in the whole file of 8-bit characters
485 $t = <F>;
486 close F;
487 open F, ">:encoding(UTF-8)", "file";
488 print F $t; ## convert to UTF-8 on output
489 close F;
490
491 If you run this code twice, the contents of the file will be twice
492 UTF-8 encoded. A "use open ':encoding(UTF-8)'" would have avoided the
493 bug, or explicitly opening also the file for input as UTF-8.
494
495 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
496 been built with PerlIO, which is the default on most systems.
497
498 Displaying Unicode As Text
499 Sometimes you might want to display Perl scalars containing Unicode as
500 simple ASCII (or EBCDIC) text. The following subroutine converts its
501 argument so that Unicode characters with code points greater than 255
502 are displayed as "\x{...}", control characters (like "\n") are
503 displayed as "\x..", and the rest of the characters as themselves:
504
505 sub nice_string {
506 join("",
507 map { $_ > 255 # if wide character...
508 ? sprintf("\\x{%04X}", $_) # \x{...}
509 : chr($_) =~ /[[:cntrl:]]/ # else if control character...
510 ? sprintf("\\x%02X", $_) # \x..
511 : quotemeta(chr($_)) # else quoted or as themselves
512 } unpack("W*", $_[0])); # unpack Unicode characters
513 }
514
515 For example,
516
517 nice_string("foo\x{100}bar\n")
518
519 returns the string
520
521 'foo\x{0100}bar\x0A'
522
523 which is ready to be printed.
524
525 ("\\x{}" is used here instead of "\\N{}", since it's most likely that
526 you want to see what the native values are.)
527
528 Special Cases
529 • Starting in Perl 5.28, it is illegal for bit operators, like "~",
530 to operate on strings containing code points above 255.
531
532 • The vec() function may produce surprising results if used on
533 strings containing characters with ordinal values above 255. In
534 such a case, the results are consistent with the internal encoding
535 of the characters, but not with much else. So don't do that, and
536 starting in Perl 5.28, a deprecation message is issued if you do
537 so, becoming illegal in Perl 5.32.
538
539 • Peeking At Perl's Internal Encoding
540
541 Normal users of Perl should never care how Perl encodes any
542 particular Unicode string (because the normal ways to get at the
543 contents of a string with Unicode--via input and output--should
544 always be via explicitly-defined I/O layers). But if you must,
545 there are two ways of looking behind the scenes.
546
547 One way of peeking inside the internal encoding of Unicode
548 characters is to use "unpack("C*", ..." to get the bytes of
549 whatever the string encoding happens to be, or "unpack("U0..",
550 ...)" to get the bytes of the UTF-8 encoding:
551
552 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
553 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
554
555 Yet another way would be to use the Devel::Peek module:
556
557 perl -MDevel::Peek -e 'Dump(chr(0x100))'
558
559 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
560 Unicode characters in "PV". See also later in this document the
561 discussion about the "utf8::is_utf8()" function.
562
563 Advanced Topics
564 • String Equivalence
565
566 The question of string equivalence turns somewhat complicated in
567 Unicode: what do you mean by "equal"?
568
569 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
570 LETTER A"?)
571
572 The short answer is that by default Perl compares equivalence
573 ("eq", "ne") based only on code points of the characters. In the
574 above case, the answer is no (because 0x00C1 != 0x0041). But
575 sometimes, any CAPITAL LETTER A's should be considered equal, or
576 even A's of any case.
577
578 The long answer is that you need to consider character
579 normalization and casing issues: see Unicode::Normalize, Unicode
580 Technical Report #15, Unicode Normalization Forms
581 <https://www.unicode.org/reports/tr15> and sections on case mapping
582 in the Unicode Standard <https://www.unicode.org>.
583
584 As of Perl 5.8.0, the "Full" case-folding of Case
585 Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
586 with them, mostly fixed by 5.14, and essentially entirely by 5.18.
587
588 • String Collation
589
590 People like to see their strings nicely sorted--or as Unicode
591 parlance goes, collated. But again, what do you mean by collate?
592
593 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
594 "LATIN CAPITAL LETTER A WITH GRAVE"?)
595
596 The short answer is that by default, Perl compares strings ("lt",
597 "le", "cmp", "ge", "gt") based only on the code points of the
598 characters. In the above case, the answer is "after", since 0x00C1
599 > 0x00C0.
600
601 The long answer is that "it depends", and a good answer cannot be
602 given without knowing (at the very least) the language context.
603 See Unicode::Collate, and Unicode Collation Algorithm
604 <https://www.unicode.org/reports/tr10/>
605
606 Miscellaneous
607 • Character Ranges and Classes
608
609 Character ranges in regular expression bracketed character classes
610 ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
611 operator are not magically Unicode-aware. What this means is that
612 "[A-Za-z]" will not magically start to mean "all alphabetic
613 letters" (not that it does mean that even for 8-bit characters; for
614 those, if you are using locales (perllocale), use "/[[:alpha:]]/";
615 and if not, use the 8-bit-aware property "\p{alpha}").
616
617 All the properties that begin with "\p" (and its inverse "\P") are
618 actually character classes that are Unicode-aware. There are
619 dozens of them, see perluniprops.
620
621 Starting in v5.22, you can use Unicode code points as the end
622 points of regular expression pattern character ranges, and the
623 range will include all Unicode code points that lie between those
624 end points, inclusive.
625
626 qr/ [ \N{U+03} - \N{U+20} ] /xx
627
628 includes the code points "\N{U+03}", "\N{U+04}", ..., "\N{U+20}".
629
630 This also works for ranges in "tr///" starting in Perl v5.24.
631
632 • String-To-Number Conversions
633
634 Unicode does define several other decimal--and numeric--characters
635 besides the familiar 0 to 9, such as the Arabic and Indic digits.
636 Perl does not support string-to-number conversion for digits other
637 than ASCII 0 to 9 (and ASCII "a" to "f" for hexadecimal). To get
638 safe conversions from any Unicode string, use "num()" in
639 Unicode::UCD.
640
641 Questions With Answers
642 • Will My Old Scripts Break?
643
644 Very probably not. Unless you are generating Unicode characters
645 somehow, old behaviour should be preserved. About the only
646 behaviour that has changed and which could start generating Unicode
647 is the old behaviour of "chr()" where supplying an argument more
648 than 255 produced a character modulo 255. "chr(300)", for example,
649 was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
650 LETTER I WITH BREVE.
651
652 • How Do I Make My Scripts Work With Unicode?
653
654 Very little work should be needed since nothing changes until you
655 generate Unicode data. The most important thing is getting input
656 as Unicode; for that, see the earlier I/O discussion. To get full
657 seamless Unicode support, add "use feature 'unicode_strings'" (or
658 "use v5.12" or higher) to your script.
659
660 • How Do I Know Whether My String Is In Unicode?
661
662 You shouldn't have to care. But you may if your Perl is before
663 5.14.0 or you haven't specified "use feature 'unicode_strings'" or
664 "use 5.012" (or higher) because otherwise the rules for the code
665 points in the range 128 to 255 are different depending on whether
666 the string they are contained within is in Unicode or not. (See
667 "When Unicode Does Not Happen" in perlunicode.)
668
669 To determine if a string is in Unicode, use:
670
671 print utf8::is_utf8($string) ? 1 : 0, "\n";
672
673 But note that this doesn't mean that any of the characters in the
674 string are necessary UTF-8 encoded, or that any of the characters
675 have code points greater than 0xFF (255) or even 0x80 (128), or
676 that the string has any characters at all. All the "is_utf8()"
677 does is to return the value of the internal "utf8ness" flag
678 attached to the $string. If the flag is off, the bytes in the
679 scalar are interpreted as a single byte encoding. If the flag is
680 on, the bytes in the scalar are interpreted as the (variable-
681 length, potentially multi-byte) UTF-8 encoded code points of the
682 characters. Bytes added to a UTF-8 encoded string are
683 automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8
684 scalars are merged (double-quoted interpolation, explicit
685 concatenation, or printf/sprintf parameter substitution), the
686 result will be UTF-8 encoded as if copies of the byte strings were
687 upgraded to UTF-8: for example,
688
689 $a = "ab\x80c";
690 $b = "\x{100}";
691 print "$a = $b\n";
692
693 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
694 $a will stay byte-encoded.
695
696 Sometimes you might really need to know the byte length of a string
697 instead of the character length. For that use the "bytes" pragma
698 and the "length()" function:
699
700 my $unicode = chr(0x100);
701 print length($unicode), "\n"; # will print 1
702 use bytes;
703 print length($unicode), "\n"; # will print 2
704 # (the 0xC4 0x80 of the UTF-8)
705 no bytes;
706
707 • How Do I Find Out What Encoding a File Has?
708
709 You might try Encode::Guess, but it has a number of limitations.
710
711 • How Do I Detect Data That's Not Valid In a Particular Encoding?
712
713 Use the "Encode" package to try converting it. For example,
714
715 use Encode 'decode';
716
717 if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
718 # $string is valid UTF-8
719 } else {
720 # $string is not valid UTF-8
721 }
722
723 Or use "unpack" to try decoding it:
724
725 use warnings;
726 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
727
728 If invalid, a "Malformed UTF-8 character" warning is produced. The
729 "C0" means "process the string character per character". Without
730 that, the "unpack("U*", ...)" would work in "U0" mode (the default
731 if the format string starts with "U") and it would return the bytes
732 making up the UTF-8 encoding of the target string, something that
733 will always work.
734
735 • How Do I Convert Binary Data Into a Particular Encoding, Or Vice
736 Versa?
737
738 This probably isn't as useful as you might think. Normally, you
739 shouldn't need to.
740
741 In one sense, what you are asking doesn't make much sense:
742 encodings are for characters, and binary data are not "characters",
743 so converting "data" into some encoding isn't meaningful unless you
744 know in what character set and encoding the binary data is in, in
745 which case it's not just binary data, now is it?
746
747 If you have a raw sequence of bytes that you know should be
748 interpreted via a particular encoding, you can use "Encode":
749
750 use Encode 'from_to';
751 from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
752
753 The call to "from_to()" changes the bytes in $data, but nothing
754 material about the nature of the string has changed as far as Perl
755 is concerned. Both before and after the call, the string $data
756 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
757 the encoding of the string remains as "system-native 8-bit bytes".
758
759 You might relate this to a fictional 'Translate' module:
760
761 use Translate;
762 my $phrase = "Yes";
763 Translate::from_to($phrase, 'english', 'deutsch');
764 ## phrase now contains "Ja"
765
766 The contents of the string changes, but not the nature of the
767 string. Perl doesn't know any more after the call than before that
768 the contents of the string indicates the affirmative.
769
770 Back to converting data. If you have (or want) data in your
771 system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
772 can use pack/unpack to convert to/from Unicode.
773
774 $native_string = pack("W*", unpack("U*", $Unicode_string));
775 $Unicode_string = pack("U*", unpack("W*", $native_string));
776
777 If you have a sequence of bytes you know is valid UTF-8, but Perl
778 doesn't know it yet, you can make Perl a believer, too:
779
780 $Unicode = $bytes;
781 utf8::decode($Unicode);
782
783 or:
784
785 $Unicode = pack("U0a*", $bytes);
786
787 You can find the bytes that make up a UTF-8 sequence with
788
789 @bytes = unpack("C*", $Unicode_string)
790
791 and you can create well-formed Unicode with
792
793 $Unicode_string = pack("U*", 0xff, ...)
794
795 • How Do I Display Unicode? How Do I Input Unicode?
796
797 See <http://www.alanwood.net/unicode/> and
798 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
799
800 • How Does Unicode Work With Traditional Locales?
801
802 If your locale is a UTF-8 locale, starting in Perl v5.26, Perl
803 works well for all categories; before this, starting with Perl
804 v5.20, it works for all categories but "LC_COLLATE", which deals
805 with sorting and the "cmp" operator. But note that the standard
806 "Unicode::Collate" and "Unicode::Collate::Locale" modules offer
807 much more powerful solutions to collation issues, and work on
808 earlier releases.
809
810 For other locales, starting in Perl 5.16, you can specify
811
812 use locale ':not_characters';
813
814 to get Perl to work well with them. The catch is that you have to
815 translate from the locale character set to/from Unicode yourself.
816 See "Unicode I/O" above for how to
817
818 use open ':locale';
819
820 to accomplish this, but full details are in "Unicode and UTF-8" in
821 perllocale, including gotchas that happen if you don't specify
822 ":not_characters".
823
824 Hexadecimal Notation
825 The Unicode standard prefers using hexadecimal notation because that
826 more clearly shows the division of Unicode into blocks of 256
827 characters. Hexadecimal is also simply shorter than decimal. You can
828 use decimal notation, too, but learning to use hexadecimal just makes
829 life easier with the Unicode standard. The "U+HHHH" notation uses
830 hexadecimal, for example.
831
832 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
833 (or A-F, case doesn't matter). Each hexadecimal digit represents four
834 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal
835 number in decimal, and "printf "%x\n", $decimal" will show a decimal
836 number in hexadecimal. If you have just the "hex digits" of a
837 hexadecimal number, you can use the "hex()" function.
838
839 print 0x0009, "\n"; # 9
840 print 0x000a, "\n"; # 10
841 print 0x000f, "\n"; # 15
842 print 0x0010, "\n"; # 16
843 print 0x0011, "\n"; # 17
844 print 0x0100, "\n"; # 256
845
846 print 0x0041, "\n"; # 65
847
848 printf "%x\n", 65; # 41
849 printf "%#x\n", 65; # 0x41
850
851 print hex("41"), "\n"; # 65
852
853 Further Resources
854 • Unicode Consortium
855
856 <https://www.unicode.org/>
857
858 • Unicode FAQ
859
860 <https://www.unicode.org/faq/>
861
862 • Unicode Glossary
863
864 <https://www.unicode.org/glossary/>
865
866 • Unicode Recommended Reading List
867
868 The Unicode Consortium has a list of articles and books, some of
869 which give a much more in depth treatment of Unicode:
870 <http://unicode.org/resources/readinglist.html>
871
872 • Unicode Useful Resources
873
874 <https://www.unicode.org/unicode/onlinedat/resources.html>
875
876 • Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
877 Other Applications
878
879 <http://www.alanwood.net/unicode/>
880
881 • UTF-8 and Unicode FAQ for Unix/Linux
882
883 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
884
885 • Legacy Character Sets
886
887 <http://www.czyborra.com/> <http://www.eki.ee/letter/>
888
889 • You can explore various information from the Unicode data files
890 using the "Unicode::UCD" module.
891
893 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
894 some Unicode processing by using the modules "Unicode::String",
895 "Unicode::Map8", and "Unicode::Map", available from CPAN. If you have
896 the GNU recode installed, you can also use the Perl front-end
897 "Convert::Recode" for character conversions.
898
899 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
900 UTF-8 bytes and back, the code works even with older Perl 5 versions.
901
902 # ISO 8859-1 to UTF-8
903 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
904
905 # UTF-8 to ISO 8859-1
906 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
907
909 perlunitut, perlunicode, Encode, open, utf8, bytes, perlretut, perlrun,
910 Unicode::Collate, Unicode::Normalize, Unicode::UCD
911
913 Thanks to the kind readers of the perl5-porters@perl.org,
914 perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
915 mailing lists for their valuable feedback.
916
918 Copyright 2001-2011 Jarkko Hietaniemi <jhi@iki.fi>. Now maintained by
919 Perl 5 Porters.
920
921 This document may be distributed under the same terms as Perl itself.
922
923
924
925perl v5.36.3 2023-11-30 PERLUNIINTRO(1)