1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl. See "Further Resources" for references to more in-depth
11 treatments of Unicode.
12
13 Unicode
14 Unicode is a character set standard which plans to codify all of the
15 writing systems of the world, plus many other symbols.
16
17 Unicode and ISO/IEC 10646 are coordinated standards that unify almost
18 all other modern character set standards, covering more than 80 writing
19 systems and hundreds of languages, including all commercially-important
20 modern languages. All characters in the largest Chinese, Japanese, and
21 Korean dictionaries are also encoded. The standards will eventually
22 cover almost all characters in more than 250 writing systems and
23 thousands of languages. Unicode 1.0 was released in October 1991, and
24 6.0 in October 2010.
25
26 A Unicode character is an abstract entity. It is not bound to any
27 particular integer width, especially not to the C language "char".
28 Unicode is language-neutral and display-neutral: it does not encode the
29 language of the text, and it does not generally define fonts or other
30 graphical layout details. Unicode operates on characters and on text
31 built from those characters.
32
33 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35 0x0041 and 0x03B1, respectively. These unique numbers are called code
36 points. A code point is essentially the position of the character
37 within the set of all possible Unicode characters, and thus in Perl,
38 the term ordinal is often used interchangeably with it.
39
40 The Unicode standard prefers using hexadecimal notation for the code
41 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
42 later section, "Hexadecimal Notation". The Unicode standard uses the
43 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
44 point and the normative name of the character.
45
46 Unicode also defines various properties for the characters, like
47 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
48 properties are independent of the names of the characters.
49 Furthermore, various operations on the characters like uppercasing,
50 lowercasing, and collating (sorting) are defined.
51
52 A Unicode logical "character" can actually consist of more than one
53 internal actual "character" or code point. For Western languages, this
54 is adequately modelled by a base character (like "LATIN CAPITAL LETTER
55 A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
56 This sequence of base character and modifiers is called a combining
57 character sequence. Some non-western languages require more
58 complicated models, so Unicode created the grapheme cluster concept,
59 which was later further refined into the extended grapheme cluster.
60 For example, a Korean Hangul syllable is considered a single logical
61 character, but most often consists of three actual Unicode characters:
62 a leading consonant followed by an interior vowel followed by a
63 trailing consonant.
64
65 Whether to call these extended grapheme clusters "characters" depends
66 on your point of view. If you are a programmer, you probably would tend
67 towards seeing each element in the sequences as one unit, or
68 "character". However from the user's point of view, the whole sequence
69 could be seen as one "character" since that's probably what it looks
70 like in the context of the user's language. In this document, we take
71 the programmer's point of view: one "character" is one Unicode code
72 point.
73
74 For some combinations of base character and modifiers, there are
75 precomposed characters. There is a single character equivalent, for
76 example, for the sequence "LATIN CAPITAL LETTER A" followed by
77 "COMBINING ACUTE ACCENT". It is called "LATIN CAPITAL LETTER A WITH
78 ACUTE". These precomposed characters are, however, only available for
79 some combinations, and are mainly meant to support round-trip
80 conversions between Unicode and legacy standards (like ISO 8859).
81 Using sequences, as Unicode does, allows for needing fewer basic
82 building blocks (code points) to express many more potential grapheme
83 clusters. To support conversion between equivalent forms, various
84 normalization forms are also defined. Thus, "LATIN CAPITAL LETTER A
85 WITH ACUTE" is in Normalization Form Composed, (abbreviated NFC), and
86 the sequence "LATIN CAPITAL LETTER A" followed by "COMBINING ACUTE
87 ACCENT" represents the same character in Normalization Form Decomposed
88 (NFD).
89
90 Because of backward compatibility with legacy encodings, the "a unique
91 number for every character" idea breaks down a bit: instead, there is
92 "at least one number for every character". The same character could be
93 represented differently in several legacy encodings. The converse is
94 not true: some code points do not have an assigned character. Firstly,
95 there are unallocated code points within otherwise used blocks.
96 Secondly, there are special Unicode control characters that do not
97 represent true characters.
98
99 When Unicode was first conceived, it was thought that all the world's
100 characters could be represented using a 16-bit word; that is a maximum
101 of 0x10000 (or 65,536) characters would be needed, from 0x0000 to
102 0xFFFF. This soon proved to be wrong, and since Unicode 2.0 (July
103 1996), Unicode has been defined all the way up to 21 bits (0x10FFFF),
104 and Unicode 3.1 (March 2001) defined the first characters above 0xFFFF.
105 The first 0x10000 characters are called the Plane 0, or the Basic
106 Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes
107 in all were defined--but they are nowhere near full of defined
108 characters, yet.
109
110 When a new language is being encoded, Unicode generally will choose a
111 "block" of consecutive unallocated code points for its characters. So
112 far, the number of code points in these blocks has always been evenly
113 divisible by 16. Extras in a block, not currently needed, are left
114 unallocated, for future growth. But there have been occasions when a
115 later release needed more code points than the available extras, and a
116 new block had to allocated somewhere else, not contiguous to the
117 initial one, to handle the overflow. Thus, it became apparent early on
118 that "block" wasn't an adequate organizing principle, and so the
119 "Script" property was created. (Later an improved script property was
120 added as well, the "Script_Extensions" property.) Those code points
121 that are in overflow blocks can still have the same script as the
122 original ones. The script concept fits more closely with natural
123 language: there is "Latin" script, "Greek" script, and so on; and there
124 are several artificial scripts, like "Common" for characters that are
125 used in multiple scripts, such as mathematical symbols. Scripts
126 usually span varied parts of several blocks. For more information
127 about scripts, see "Scripts" in perlunicode. The division into blocks
128 exists, but it is almost completely accidental--an artifact of how the
129 characters have been and still are allocated. (Note that this
130 paragraph has oversimplified things for the sake of this being an
131 introduction. Unicode doesn't really encode languages, but the writing
132 systems for them--their scripts; and one script can be used by many
133 languages. Unicode also encodes things that aren't really about
134 languages, such as symbols like "BAGGAGE CLAIM".)
135
136 The Unicode code points are just abstract numbers. To input and output
137 these abstract numbers, the numbers must be encoded or serialised
138 somehow. Unicode defines several character encoding forms, of which
139 UTF-8 is the most popular. UTF-8 is a variable length encoding that
140 encodes Unicode characters as 1 to 4 bytes. Other encodings include
141 UTF-16 and UTF-32 and their big- and little-endian variants (UTF-8 is
142 byte-order independent). The ISO/IEC 10646 defines the UCS-2 and UCS-4
143 encoding forms.
144
145 For more information about encodings--for instance, to learn what
146 surrogates and byte order marks (BOMs) are--see perlunicode.
147
148 Perl's Unicode Support
149 Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
150 natively. Perl v5.8.0, however, is the first recommended release for
151 serious Unicode work. The maintenance release 5.6.1 fixed many of the
152 problems of the initial Unicode implementation, but for example regular
153 expressions still do not work with Unicode in 5.6.1. Perl v5.14.0 is
154 the first release where Unicode support is (almost) seamlessly
155 integratable without some gotchas. (There are a few exceptions.
156 Firstly, some differences in quotemeta were fixed starting in Perl
157 5.16.0. Secondly, some differences in the range operator were fixed
158 starting in Perl 5.26.0. Thirdly, some differences in split were fixed
159 started in Perl 5.28.0.)
160
161 To enable this seamless support, you should "use feature
162 'unicode_strings'" (which is automatically selected if you "use v5.12"
163 or higher). See feature. (5.14 also fixes a number of bugs and
164 departures from the Unicode standard.)
165
166 Before Perl v5.8.0, the use of "use utf8" was used to declare that
167 operations in the current block or file would be Unicode-aware. This
168 model was found to be wrong, or at least clumsy: the "Unicodeness" is
169 now carried with the data, instead of being attached to the operations.
170 Starting with Perl v5.8.0, only one case remains where an explicit "use
171 utf8" is needed: if your Perl script itself is encoded in UTF-8, you
172 can use UTF-8 in your identifier names, and in string and regular
173 expression literals, by saying "use utf8". This is not the default
174 because scripts with legacy 8-bit data in them would break. See utf8.
175
176 Perl's Unicode Model
177 Perl supports both pre-5.6 strings of eight-bit native bytes, and
178 strings of Unicode characters. The general principle is that Perl
179 tries to keep its data as eight-bit bytes for as long as possible, but
180 as soon as Unicodeness cannot be avoided, the data is transparently
181 upgraded to Unicode. Prior to Perl v5.14.0, the upgrade was not
182 completely transparent (see "The "Unicode Bug"" in perlunicode), and
183 for backwards compatibility, full transparency is not gained unless use
184 feature 'unicode_strings' (see feature) or "use v5.12" (or higher) is
185 selected.
186
187 Internally, Perl currently uses either whatever the native eight-bit
188 character set of the platform (for example Latin-1) is, defaulting to
189 UTF-8, to encode Unicode strings. Specifically, if all code points in
190 the string are 0xFF or less, Perl uses the native eight-bit character
191 set. Otherwise, it uses UTF-8.
192
193 A user of Perl does not normally need to know nor care how Perl happens
194 to encode its internal strings, but it becomes relevant when outputting
195 Unicode strings to a stream without a PerlIO layer (one with the
196 "default" encoding). In such a case, the raw bytes used internally
197 (the native character set or UTF-8, as appropriate for each string)
198 will be used, and a "Wide character" warning will be issued if those
199 strings contain a character beyond 0x00FF.
200
201 For example,
202
203 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
204
205 produces a fairly useless mixture of native bytes and UTF-8, as well as
206 a warning:
207
208 Wide character in print at ...
209
210 To output UTF-8, use the ":encoding" or ":utf8" output layer.
211 Prepending
212
213 binmode(STDOUT, ":utf8");
214
215 to this sample program ensures that the output is completely UTF-8, and
216 removes the program's warning.
217
218 You can enable automatic UTF-8-ification of your standard file handles,
219 default open() layer, and @ARGV by using either the "-C" command line
220 switch or the "PERL_UNICODE" environment variable, see perlrun for the
221 documentation of the "-C" switch.
222
223 Note that this means that Perl expects other software to work the same
224 way: if Perl has been led to believe that STDIN should be UTF-8, but
225 then STDIN coming in from another command is not UTF-8, Perl will
226 likely complain about the malformed UTF-8.
227
228 All features that combine Unicode and I/O also require using the new
229 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
230 you can see whether yours is by running "perl -V" and looking for
231 "useperlio=define".
232
233 Unicode and EBCDIC
234 Perl 5.8.0 added support for Unicode on EBCDIC platforms. This support
235 was allowed to lapse in later releases, but was revived in 5.22.
236 Unicode support is somewhat more complex to implement since additional
237 conversions are needed. See perlebcdic for more information.
238
239 On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
240 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
241 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
242 "EBCDIC-safe", in that all the basic characters (which includes all
243 those that have ASCII equivalents (like "A", "0", "%", etc.) are the
244 same in both EBCDIC and UTF-EBCDIC. Often, documentation will use the
245 term "UTF-8" to mean UTF-EBCDIC as well. This is the case in this
246 document.
247
248 Creating Unicode
249 This section applies fully to Perls starting with v5.22. Various
250 caveats for earlier releases are in the "Earlier releases caveats"
251 subsection below.
252
253 To create Unicode characters in literals, use the "\N{...}" notation in
254 double-quoted strings:
255
256 my $smiley_from_name = "\N{WHITE SMILING FACE}";
257 my $smiley_from_code_point = "\N{U+263a}";
258
259 Similarly, they can be used in regular expression literals
260
261 $smiley =~ /\N{WHITE SMILING FACE}/;
262 $smiley =~ /\N{U+263a}/;
263
264 or, starting in v5.32:
265
266 $smiley =~ /\p{Name=WHITE SMILING FACE}/;
267 $smiley =~ /\p{Name=whitesmilingface}/;
268
269 At run-time you can use:
270
271 use charnames ();
272 my $hebrew_alef_from_name
273 = charnames::string_vianame("HEBREW LETTER ALEF");
274 my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");
275
276 Naturally, ord() will do the reverse: it turns a character into a code
277 point.
278
279 There are other runtime options as well. You can use pack():
280
281 my $hebrew_alef_from_code_point = pack("U", 0x05d0);
282
283 Or you can use chr(), though it is less convenient in the general case:
284
285 $hebrew_alef_from_code_point = chr(utf8::unicode_to_native(0x05d0));
286 utf8::upgrade($hebrew_alef_from_code_point);
287
288 The utf8::unicode_to_native() and utf8::upgrade() aren't needed if the
289 argument is above 0xFF, so the above could have been written as
290
291 $hebrew_alef_from_code_point = chr(0x05d0);
292
293 since 0x5d0 is above 255.
294
295 "\x{}" and "\o{}" can also be used to specify code points at compile
296 time in double-quotish strings, but, for backward compatibility with
297 older Perls, the same rules apply as with chr() for code points less
298 than 256.
299
300 utf8::unicode_to_native() is used so that the Perl code is portable to
301 EBCDIC platforms. You can omit it if you're really sure no one will
302 ever want to use your code on a non-ASCII platform. Starting in Perl
303 v5.22, calls to it on ASCII platforms are optimized out, so there's no
304 performance penalty at all in adding it. Or you can simply use the
305 other constructs that don't require it.
306
307 See "Further Resources" for how to find all these names and numeric
308 codes.
309
310 Earlier releases caveats
311
312 On EBCDIC platforms, prior to v5.22, using "\N{U+...}" doesn't work
313 properly.
314
315 Prior to v5.16, using "\N{...}" with a character name (as opposed to a
316 "U+..." code point) required a "use charnames :full".
317
318 Prior to v5.14, there were some bugs in "\N{...}" with a character name
319 (as opposed to a "U+..." code point).
320
321 charnames::string_vianame() was introduced in v5.14. Prior to that,
322 charnames::vianame() should work, but only if the argument is of the
323 form "U+...". Your best bet there for runtime Unicode by character
324 name is probably:
325
326 use charnames ();
327 my $hebrew_alef_from_name
328 = pack("U", charnames::vianame("HEBREW LETTER ALEF"));
329
330 Handling Unicode
331 Handling Unicode is for the most part transparent: just use the strings
332 as usual. Functions like index(), length(), and substr() will work on
333 the Unicode characters; regular expressions will work on the Unicode
334 characters (see perlunicode and perlretut).
335
336 Note that Perl considers grapheme clusters to be separate characters,
337 so for example
338
339 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
340 "\n";
341
342 will print 2, not 1. The only exception is that regular expressions
343 have "\X" for matching an extended grapheme cluster. (Thus "\X" in a
344 regular expression would match the entire sequence of both the example
345 characters.)
346
347 Life is not quite so transparent, however, when working with legacy
348 encodings, I/O, and certain special cases:
349
350 Legacy Encodings
351 When you combine legacy data and Unicode, the legacy data needs to be
352 upgraded to Unicode. Normally the legacy data is assumed to be ISO
353 8859-1 (or EBCDIC, if applicable).
354
355 The "Encode" module knows about many encodings and has interfaces for
356 doing conversions between those encodings:
357
358 use Encode 'decode';
359 $data = decode("iso-8859-3", $data); # convert from legacy
360
361 Unicode I/O
362 Normally, writing out Unicode data
363
364 print FH $some_string_with_unicode, "\n";
365
366 produces raw bytes that Perl happens to use to internally encode the
367 Unicode string. Perl's internal encoding depends on the system as well
368 as what characters happen to be in the string at the time. If any of
369 the characters are at code points 0x100 or above, you will get a
370 warning. To ensure that the output is explicitly rendered in the
371 encoding you desire--and to avoid the warning--open the stream with the
372 desired encoding. Some examples:
373
374 open FH, ">:utf8", "file";
375
376 open FH, ">:encoding(ucs2)", "file";
377 open FH, ">:encoding(UTF-8)", "file";
378 open FH, ">:encoding(shift_jis)", "file";
379
380 and on already open streams, use binmode():
381
382 binmode(STDOUT, ":utf8");
383
384 binmode(STDOUT, ":encoding(ucs2)");
385 binmode(STDOUT, ":encoding(UTF-8)");
386 binmode(STDOUT, ":encoding(shift_jis)");
387
388 The matching of encoding names is loose: case does not matter, and many
389 encodings have several aliases. Note that the ":utf8" layer must
390 always be specified exactly like that; it is not subject to the loose
391 matching of encoding names. Also note that currently ":utf8" is unsafe
392 for input, because it accepts the data without validating that it is
393 indeed valid UTF-8; you should instead use :encoding(UTF-8) (with or
394 without a hyphen).
395
396 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
397 for the :encoding() layer, and Encode::Supported for many encodings
398 supported by the "Encode" module.
399
400 Reading in a file that you know happens to be encoded in one of the
401 Unicode or legacy encodings does not magically turn the data into
402 Unicode in Perl's eyes. To do that, specify the appropriate layer when
403 opening files
404
405 open(my $fh,'<:encoding(UTF-8)', 'anything');
406 my $line_of_unicode = <$fh>;
407
408 open(my $fh,'<:encoding(Big5)', 'anything');
409 my $line_of_unicode = <$fh>;
410
411 The I/O layers can also be specified more flexibly with the "open"
412 pragma. See open, or look at the following example.
413
414 use open ':encoding(UTF-8)'; # input/output default encoding will be
415 # UTF-8
416 open X, ">file";
417 print X chr(0x100), "\n";
418 close X;
419 open Y, "<file";
420 printf "%#x\n", ord(<Y>); # this should print 0x100
421 close Y;
422
423 With the "open" pragma you can use the ":locale" layer
424
425 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
426 # the :locale will probe the locale environment variables like
427 # LC_ALL
428 use open OUT => ':locale'; # russki parusski
429 open(O, ">koi8");
430 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
431 close O;
432 open(I, "<koi8");
433 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
434 close I;
435
436 These methods install a transparent filter on the I/O stream that
437 converts data from the specified encoding when it is read in from the
438 stream. The result is always Unicode.
439
440 The open pragma affects all the open() calls after the pragma by
441 setting default layers. If you want to affect only certain streams,
442 use explicit layers directly in the open() call.
443
444 You can switch encodings on an already opened stream by using
445 binmode(); see "binmode" in perlfunc.
446
447 The ":locale" does not currently work with open() and binmode(), only
448 with the "open" pragma. The ":utf8" and :encoding(...) methods do work
449 with all of open(), binmode(), and the "open" pragma.
450
451 Similarly, you may use these I/O layers on output streams to
452 automatically convert Unicode to the specified encoding when it is
453 written to the stream. For example, the following snippet copies the
454 contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
455 the file "text.utf8", encoded as UTF-8:
456
457 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
458 open(my $unicode, '>:utf8', 'text.utf8');
459 while (<$nihongo>) { print $unicode $_ }
460
461 The naming of encodings, both by the open() and by the "open" pragma
462 allows for flexible names: "koi8-r" and "KOI8R" will both be
463 understood.
464
465 Common encodings recognized by ISO, MIME, IANA, and various other
466 standardisation organisations are recognised; for a more detailed list
467 see Encode::Supported.
468
469 read() reads characters and returns the number of characters. seek()
470 and tell() operate on byte counts, as does sysseek().
471
472 sysread() and syswrite() should not be used on file handles with
473 character encoding layers, they behave badly, and that behaviour has
474 been deprecated since perl 5.24.
475
476 Notice that because of the default behaviour of not doing any
477 conversion upon input if there is no default layer, it is easy to
478 mistakenly write code that keeps on expanding a file by repeatedly
479 encoding the data:
480
481 # BAD CODE WARNING
482 open F, "file";
483 local $/; ## read in the whole file of 8-bit characters
484 $t = <F>;
485 close F;
486 open F, ">:encoding(UTF-8)", "file";
487 print F $t; ## convert to UTF-8 on output
488 close F;
489
490 If you run this code twice, the contents of the file will be twice
491 UTF-8 encoded. A "use open ':encoding(UTF-8)'" would have avoided the
492 bug, or explicitly opening also the file for input as UTF-8.
493
494 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
495 been built with PerlIO, which is the default on most systems.
496
497 Displaying Unicode As Text
498 Sometimes you might want to display Perl scalars containing Unicode as
499 simple ASCII (or EBCDIC) text. The following subroutine converts its
500 argument so that Unicode characters with code points greater than 255
501 are displayed as "\x{...}", control characters (like "\n") are
502 displayed as "\x..", and the rest of the characters as themselves:
503
504 sub nice_string {
505 join("",
506 map { $_ > 255 # if wide character...
507 ? sprintf("\\x{%04X}", $_) # \x{...}
508 : chr($_) =~ /[[:cntrl:]]/ # else if control character...
509 ? sprintf("\\x%02X", $_) # \x..
510 : quotemeta(chr($_)) # else quoted or as themselves
511 } unpack("W*", $_[0])); # unpack Unicode characters
512 }
513
514 For example,
515
516 nice_string("foo\x{100}bar\n")
517
518 returns the string
519
520 'foo\x{0100}bar\x0A'
521
522 which is ready to be printed.
523
524 ("\\x{}" is used here instead of "\\N{}", since it's most likely that
525 you want to see what the native values are.)
526
527 Special Cases
528 • Starting in Perl 5.28, it is illegal for bit operators, like "~",
529 to operate on strings containing code points above 255.
530
531 • The vec() function may produce surprising results if used on
532 strings containing characters with ordinal values above 255. In
533 such a case, the results are consistent with the internal encoding
534 of the characters, but not with much else. So don't do that, and
535 starting in Perl 5.28, a deprecation message is issued if you do
536 so, becoming illegal in Perl 5.32.
537
538 • Peeking At Perl's Internal Encoding
539
540 Normal users of Perl should never care how Perl encodes any
541 particular Unicode string (because the normal ways to get at the
542 contents of a string with Unicode--via input and output--should
543 always be via explicitly-defined I/O layers). But if you must,
544 there are two ways of looking behind the scenes.
545
546 One way of peeking inside the internal encoding of Unicode
547 characters is to use "unpack("C*", ..." to get the bytes of
548 whatever the string encoding happens to be, or "unpack("U0..",
549 ...)" to get the bytes of the UTF-8 encoding:
550
551 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
552 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
553
554 Yet another way would be to use the Devel::Peek module:
555
556 perl -MDevel::Peek -e 'Dump(chr(0x100))'
557
558 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
559 Unicode characters in "PV". See also later in this document the
560 discussion about the utf8::is_utf8() function.
561
562 Advanced Topics
563 • String Equivalence
564
565 The question of string equivalence turns somewhat complicated in
566 Unicode: what do you mean by "equal"?
567
568 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
569 LETTER A"?)
570
571 The short answer is that by default Perl compares equivalence
572 ("eq", "ne") based only on code points of the characters. In the
573 above case, the answer is no (because 0x00C1 != 0x0041). But
574 sometimes, any CAPITAL LETTER A's should be considered equal, or
575 even A's of any case.
576
577 The long answer is that you need to consider character
578 normalization and casing issues: see Unicode::Normalize, Unicode
579 Technical Report #15, Unicode Normalization Forms
580 <https://www.unicode.org/reports/tr15> and sections on case mapping
581 in the Unicode Standard <https://www.unicode.org>.
582
583 As of Perl 5.8.0, the "Full" case-folding of Case
584 Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
585 with them, mostly fixed by 5.14, and essentially entirely by 5.18.
586
587 • String Collation
588
589 People like to see their strings nicely sorted--or as Unicode
590 parlance goes, collated. But again, what do you mean by collate?
591
592 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
593 "LATIN CAPITAL LETTER A WITH GRAVE"?)
594
595 The short answer is that by default, Perl compares strings ("lt",
596 "le", "cmp", "ge", "gt") based only on the code points of the
597 characters. In the above case, the answer is "after", since 0x00C1
598 > 0x00C0.
599
600 The long answer is that "it depends", and a good answer cannot be
601 given without knowing (at the very least) the language context.
602 See Unicode::Collate, and Unicode Collation Algorithm
603 <https://www.unicode.org/reports/tr10/>
604
605 Miscellaneous
606 • Character Ranges and Classes
607
608 Character ranges in regular expression bracketed character classes
609 ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
610 operator are not magically Unicode-aware. What this means is that
611 "[A-Za-z]" will not magically start to mean "all alphabetic
612 letters" (not that it does mean that even for 8-bit characters; for
613 those, if you are using locales (perllocale), use "/[[:alpha:]]/";
614 and if not, use the 8-bit-aware property "\p{alpha}").
615
616 All the properties that begin with "\p" (and its inverse "\P") are
617 actually character classes that are Unicode-aware. There are
618 dozens of them, see perluniprops.
619
620 Starting in v5.22, you can use Unicode code points as the end
621 points of regular expression pattern character ranges, and the
622 range will include all Unicode code points that lie between those
623 end points, inclusive.
624
625 qr/ [ \N{U+03} - \N{U+20} ] /xx
626
627 includes the code points "\N{U+03}", "\N{U+04}", ..., "\N{U+20}".
628
629 This also works for ranges in "tr///" starting in Perl v5.24.
630
631 • String-To-Number Conversions
632
633 Unicode does define several other decimal--and numeric--characters
634 besides the familiar 0 to 9, such as the Arabic and Indic digits.
635 Perl does not support string-to-number conversion for digits other
636 than ASCII 0 to 9 (and ASCII "a" to "f" for hexadecimal). To get
637 safe conversions from any Unicode string, use "num()" in
638 Unicode::UCD.
639
640 Questions With Answers
641 • Will My Old Scripts Break?
642
643 Very probably not. Unless you are generating Unicode characters
644 somehow, old behaviour should be preserved. About the only
645 behaviour that has changed and which could start generating Unicode
646 is the old behaviour of chr() where supplying an argument more than
647 255 produced a character modulo 255. chr(300), for example, was
648 equal to chr(45) or "-" (in ASCII), now it is LATIN CAPITAL LETTER
649 I WITH BREVE.
650
651 • How Do I Make My Scripts Work With Unicode?
652
653 Very little work should be needed since nothing changes until you
654 generate Unicode data. The most important thing is getting input
655 as Unicode; for that, see the earlier I/O discussion. To get full
656 seamless Unicode support, add "use feature 'unicode_strings'" (or
657 "use v5.12" or higher) to your script.
658
659 • How Do I Know Whether My String Is In Unicode?
660
661 You shouldn't have to care. But you may if your Perl is before
662 5.14.0 or you haven't specified "use feature 'unicode_strings'" or
663 use 5.012 (or higher) because otherwise the rules for the code
664 points in the range 128 to 255 are different depending on whether
665 the string they are contained within is in Unicode or not. (See
666 "When Unicode Does Not Happen" in perlunicode.)
667
668 To determine if a string is in Unicode, use:
669
670 print utf8::is_utf8($string) ? 1 : 0, "\n";
671
672 But note that this doesn't mean that any of the characters in the
673 string are necessary UTF-8 encoded, or that any of the characters
674 have code points greater than 0xFF (255) or even 0x80 (128), or
675 that the string has any characters at all. All the is_utf8() does
676 is to return the value of the internal "utf8ness" flag attached to
677 the $string. If the flag is off, the bytes in the scalar are
678 interpreted as a single byte encoding. If the flag is on, the
679 bytes in the scalar are interpreted as the (variable-length,
680 potentially multi-byte) UTF-8 encoded code points of the
681 characters. Bytes added to a UTF-8 encoded string are
682 automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8
683 scalars are merged (double-quoted interpolation, explicit
684 concatenation, or printf/sprintf parameter substitution), the
685 result will be UTF-8 encoded as if copies of the byte strings were
686 upgraded to UTF-8: for example,
687
688 $a = "ab\x80c";
689 $b = "\x{100}";
690 print "$a = $b\n";
691
692 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
693 $a will stay byte-encoded.
694
695 Sometimes you might really need to know the byte length of a string
696 instead of the character length. For that use the "bytes" pragma
697 and the length() function:
698
699 my $unicode = chr(0x100);
700 print length($unicode), "\n"; # will print 1
701 use bytes;
702 print length($unicode), "\n"; # will print 2
703 # (the 0xC4 0x80 of the UTF-8)
704 no bytes;
705
706 • How Do I Find Out What Encoding a File Has?
707
708 You might try Encode::Guess, but it has a number of limitations.
709
710 • How Do I Detect Data That's Not Valid In a Particular Encoding?
711
712 Use the "Encode" package to try converting it. For example,
713
714 use Encode 'decode';
715
716 if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
717 # $string is valid UTF-8
718 } else {
719 # $string is not valid UTF-8
720 }
721
722 Or use "unpack" to try decoding it:
723
724 use warnings;
725 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
726
727 If invalid, a "Malformed UTF-8 character" warning is produced. The
728 "C0" means "process the string character per character". Without
729 that, the "unpack("U*", ...)" would work in "U0" mode (the default
730 if the format string starts with "U") and it would return the bytes
731 making up the UTF-8 encoding of the target string, something that
732 will always work.
733
734 • How Do I Convert Binary Data Into a Particular Encoding, Or Vice
735 Versa?
736
737 This probably isn't as useful as you might think. Normally, you
738 shouldn't need to.
739
740 In one sense, what you are asking doesn't make much sense:
741 encodings are for characters, and binary data are not "characters",
742 so converting "data" into some encoding isn't meaningful unless you
743 know in what character set and encoding the binary data is in, in
744 which case it's not just binary data, now is it?
745
746 If you have a raw sequence of bytes that you know should be
747 interpreted via a particular encoding, you can use "Encode":
748
749 use Encode 'from_to';
750 from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
751
752 The call to from_to() changes the bytes in $data, but nothing
753 material about the nature of the string has changed as far as Perl
754 is concerned. Both before and after the call, the string $data
755 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
756 the encoding of the string remains as "system-native 8-bit bytes".
757
758 You might relate this to a fictional 'Translate' module:
759
760 use Translate;
761 my $phrase = "Yes";
762 Translate::from_to($phrase, 'english', 'deutsch');
763 ## phrase now contains "Ja"
764
765 The contents of the string changes, but not the nature of the
766 string. Perl doesn't know any more after the call than before that
767 the contents of the string indicates the affirmative.
768
769 Back to converting data. If you have (or want) data in your
770 system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
771 can use pack/unpack to convert to/from Unicode.
772
773 $native_string = pack("W*", unpack("U*", $Unicode_string));
774 $Unicode_string = pack("U*", unpack("W*", $native_string));
775
776 If you have a sequence of bytes you know is valid UTF-8, but Perl
777 doesn't know it yet, you can make Perl a believer, too:
778
779 $Unicode = $bytes;
780 utf8::decode($Unicode);
781
782 or:
783
784 $Unicode = pack("U0a*", $bytes);
785
786 You can find the bytes that make up a UTF-8 sequence with
787
788 @bytes = unpack("C*", $Unicode_string)
789
790 and you can create well-formed Unicode with
791
792 $Unicode_string = pack("U*", 0xff, ...)
793
794 • How Do I Display Unicode? How Do I Input Unicode?
795
796 See <http://www.alanwood.net/unicode/> and
797 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
798
799 • How Does Unicode Work With Traditional Locales?
800
801 If your locale is a UTF-8 locale, starting in Perl v5.26, Perl
802 works well for all categories; before this, starting with Perl
803 v5.20, it works for all categories but "LC_COLLATE", which deals
804 with sorting and the "cmp" operator. But note that the standard
805 "Unicode::Collate" and "Unicode::Collate::Locale" modules offer
806 much more powerful solutions to collation issues, and work on
807 earlier releases.
808
809 For other locales, starting in Perl 5.16, you can specify
810
811 use locale ':not_characters';
812
813 to get Perl to work well with them. The catch is that you have to
814 translate from the locale character set to/from Unicode yourself.
815 See "Unicode I/O" above for how to
816
817 use open ':locale';
818
819 to accomplish this, but full details are in "Unicode and UTF-8" in
820 perllocale, including gotchas that happen if you don't specify
821 ":not_characters".
822
823 Hexadecimal Notation
824 The Unicode standard prefers using hexadecimal notation because that
825 more clearly shows the division of Unicode into blocks of 256
826 characters. Hexadecimal is also simply shorter than decimal. You can
827 use decimal notation, too, but learning to use hexadecimal just makes
828 life easier with the Unicode standard. The "U+HHHH" notation uses
829 hexadecimal, for example.
830
831 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
832 (or A-F, case doesn't matter). Each hexadecimal digit represents four
833 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal
834 number in decimal, and "printf "%x\n", $decimal" will show a decimal
835 number in hexadecimal. If you have just the "hex digits" of a
836 hexadecimal number, you can use the hex() function.
837
838 print 0x0009, "\n"; # 9
839 print 0x000a, "\n"; # 10
840 print 0x000f, "\n"; # 15
841 print 0x0010, "\n"; # 16
842 print 0x0011, "\n"; # 17
843 print 0x0100, "\n"; # 256
844
845 print 0x0041, "\n"; # 65
846
847 printf "%x\n", 65; # 41
848 printf "%#x\n", 65; # 0x41
849
850 print hex("41"), "\n"; # 65
851
852 Further Resources
853 • Unicode Consortium
854
855 <https://www.unicode.org/>
856
857 • Unicode FAQ
858
859 <https://www.unicode.org/faq/>
860
861 • Unicode Glossary
862
863 <https://www.unicode.org/glossary/>
864
865 • Unicode Recommended Reading List
866
867 The Unicode Consortium has a list of articles and books, some of
868 which give a much more in depth treatment of Unicode:
869 <http://unicode.org/resources/readinglist.html>
870
871 • Unicode Useful Resources
872
873 <https://www.unicode.org/unicode/onlinedat/resources.html>
874
875 • Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
876 Other Applications
877
878 <http://www.alanwood.net/unicode/>
879
880 • UTF-8 and Unicode FAQ for Unix/Linux
881
882 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
883
884 • Legacy Character Sets
885
886 <http://www.czyborra.com/> <http://www.eki.ee/letter/>
887
888 • You can explore various information from the Unicode data files
889 using the "Unicode::UCD" module.
890
892 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
893 some Unicode processing by using the modules "Unicode::String",
894 "Unicode::Map8", and "Unicode::Map", available from CPAN. If you have
895 the GNU recode installed, you can also use the Perl front-end
896 "Convert::Recode" for character conversions.
897
898 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
899 UTF-8 bytes and back, the code works even with older Perl 5 versions.
900
901 # ISO 8859-1 to UTF-8
902 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
903
904 # UTF-8 to ISO 8859-1
905 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
906
908 perlunitut, perlunicode, Encode, open, utf8, bytes, perlretut, perlrun,
909 Unicode::Collate, Unicode::Normalize, Unicode::UCD
910
912 Thanks to the kind readers of the perl5-porters@perl.org,
913 perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
914 mailing lists for their valuable feedback.
915
917 Copyright 2001-2011 Jarkko Hietaniemi <jhi@iki.fi>. Now maintained by
918 Perl 5 Porters.
919
920 This document may be distributed under the same terms as Perl itself.
921
922
923
924perl v5.38.2 2023-11-30 PERLUNIINTRO(1)