1PERLUNIINTRO(1) Perl Programmers Reference Guide PERLUNIINTRO(1)
2
3
4
6 perluniintro - Perl Unicode introduction
7
9 This document gives a general idea of Unicode and how to use Unicode in
10 Perl. See "Further Resources" for references to more in-depth
11 treatments of Unicode.
12
13 Unicode
14 Unicode is a character set standard which plans to codify all of the
15 writing systems of the world, plus many other symbols.
16
17 Unicode and ISO/IEC 10646 are coordinated standards that unify almost
18 all other modern character set standards, covering more than 80 writing
19 systems and hundreds of languages, including all commercially-important
20 modern languages. All characters in the largest Chinese, Japanese, and
21 Korean dictionaries are also encoded. The standards will eventually
22 cover almost all characters in more than 250 writing systems and
23 thousands of languages. Unicode 1.0 was released in October 1991, and
24 6.0 in October 2010.
25
26 A Unicode character is an abstract entity. It is not bound to any
27 particular integer width, especially not to the C language "char".
28 Unicode is language-neutral and display-neutral: it does not encode the
29 language of the text, and it does not generally define fonts or other
30 graphical layout details. Unicode operates on characters and on text
31 built from those characters.
32
33 Unicode defines characters like "LATIN CAPITAL LETTER A" or "GREEK
34 SMALL LETTER ALPHA" and unique numbers for the characters, in this case
35 0x0041 and 0x03B1, respectively. These unique numbers are called code
36 points. A code point is essentially the position of the character
37 within the set of all possible Unicode characters, and thus in Perl,
38 the term ordinal is often used interchangeably with it.
39
40 The Unicode standard prefers using hexadecimal notation for the code
41 points. If numbers like 0x0041 are unfamiliar to you, take a peek at a
42 later section, "Hexadecimal Notation". The Unicode standard uses the
43 notation "U+0041 LATIN CAPITAL LETTER A", to give the hexadecimal code
44 point and the normative name of the character.
45
46 Unicode also defines various properties for the characters, like
47 "uppercase" or "lowercase", "decimal digit", or "punctuation"; these
48 properties are independent of the names of the characters.
49 Furthermore, various operations on the characters like uppercasing,
50 lowercasing, and collating (sorting) are defined.
51
52 A Unicode logical "character" can actually consist of more than one
53 internal actual "character" or code point. For Western languages, this
54 is adequately modelled by a base character (like "LATIN CAPITAL LETTER
55 A") followed by one or more modifiers (like "COMBINING ACUTE ACCENT").
56 This sequence of base character and modifiers is called a combining
57 character sequence. Some non-western languages require more
58 complicated models, so Unicode created the grapheme cluster concept,
59 which was later further refined into the extended grapheme cluster.
60 For example, a Korean Hangul syllable is considered a single logical
61 character, but most often consists of three actual Unicode characters:
62 a leading consonant followed by an interior vowel followed by a
63 trailing consonant.
64
65 Whether to call these extended grapheme clusters "characters" depends
66 on your point of view. If you are a programmer, you probably would tend
67 towards seeing each element in the sequences as one unit, or
68 "character". However from the user's point of view, the whole sequence
69 could be seen as one "character" since that's probably what it looks
70 like in the context of the user's language. In this document, we take
71 the programmer's point of view: one "character" is one Unicode code
72 point.
73
74 For some combinations of base character and modifiers, there are
75 precomposed characters. There is a single character equivalent, for
76 example, for the sequence "LATIN CAPITAL LETTER A" followed by
77 "COMBINING ACUTE ACCENT". It is called "LATIN CAPITAL LETTER A WITH
78 ACUTE". These precomposed characters are, however, only available for
79 some combinations, and are mainly meant to support round-trip
80 conversions between Unicode and legacy standards (like ISO 8859).
81 Using sequences, as Unicode does, allows for needing fewer basic
82 building blocks (code points) to express many more potential grapheme
83 clusters. To support conversion between equivalent forms, various
84 normalization forms are also defined. Thus, "LATIN CAPITAL LETTER A
85 WITH ACUTE" is in Normalization Form Composed, (abbreviated NFC), and
86 the sequence "LATIN CAPITAL LETTER A" followed by "COMBINING ACUTE
87 ACCENT" represents the same character in Normalization Form Decomposed
88 (NFD).
89
90 Because of backward compatibility with legacy encodings, the "a unique
91 number for every character" idea breaks down a bit: instead, there is
92 "at least one number for every character". The same character could be
93 represented differently in several legacy encodings. The converse is
94 not true: some code points do not have an assigned character. Firstly,
95 there are unallocated code points within otherwise used blocks.
96 Secondly, there are special Unicode control characters that do not
97 represent true characters.
98
99 When Unicode was first conceived, it was thought that all the world's
100 characters could be represented using a 16-bit word; that is a maximum
101 of 0x10000 (or 65,536) characters would be needed, from 0x0000 to
102 0xFFFF. This soon proved to be wrong, and since Unicode 2.0 (July
103 1996), Unicode has been defined all the way up to 21 bits (0x10FFFF),
104 and Unicode 3.1 (March 2001) defined the first characters above 0xFFFF.
105 The first 0x10000 characters are called the Plane 0, or the Basic
106 Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes
107 in all were defined--but they are nowhere near full of defined
108 characters, yet.
109
110 When a new language is being encoded, Unicode generally will choose a
111 "block" of consecutive unallocated code points for its characters. So
112 far, the number of code points in these blocks has always been evenly
113 divisible by 16. Extras in a block, not currently needed, are left
114 unallocated, for future growth. But there have been occasions when a
115 later release needed more code points than the available extras, and a
116 new block had to allocated somewhere else, not contiguous to the
117 initial one, to handle the overflow. Thus, it became apparent early on
118 that "block" wasn't an adequate organizing principle, and so the
119 "Script" property was created. (Later an improved script property was
120 added as well, the "Script_Extensions" property.) Those code points
121 that are in overflow blocks can still have the same script as the
122 original ones. The script concept fits more closely with natural
123 language: there is "Latin" script, "Greek" script, and so on; and there
124 are several artificial scripts, like "Common" for characters that are
125 used in multiple scripts, such as mathematical symbols. Scripts
126 usually span varied parts of several blocks. For more information
127 about scripts, see "Scripts" in perlunicode. The division into blocks
128 exists, but it is almost completely accidental--an artifact of how the
129 characters have been and still are allocated. (Note that this
130 paragraph has oversimplified things for the sake of this being an
131 introduction. Unicode doesn't really encode languages, but the writing
132 systems for them--their scripts; and one script can be used by many
133 languages. Unicode also encodes things that aren't really about
134 languages, such as symbols like "BAGGAGE CLAIM".)
135
136 The Unicode code points are just abstract numbers. To input and output
137 these abstract numbers, the numbers must be encoded or serialised
138 somehow. Unicode defines several character encoding forms, of which
139 UTF-8 is the most popular. UTF-8 is a variable length encoding that
140 encodes Unicode characters as 1 to 4 bytes. Other encodings include
141 UTF-16 and UTF-32 and their big- and little-endian variants (UTF-8 is
142 byte-order independent). The ISO/IEC 10646 defines the UCS-2 and UCS-4
143 encoding forms.
144
145 For more information about encodings--for instance, to learn what
146 surrogates and byte order marks (BOMs) are--see perlunicode.
147
148 Perl's Unicode Support
149 Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
150 natively. Perl v5.8.0, however, is the first recommended release for
151 serious Unicode work. The maintenance release 5.6.1 fixed many of the
152 problems of the initial Unicode implementation, but for example regular
153 expressions still do not work with Unicode in 5.6.1. Perl v5.14.0 is
154 the first release where Unicode support is (almost) seamlessly
155 integrable without some gotchas. (There are a few exceptions. Firstly,
156 some differences in quotemeta were fixed starting in Perl 5.16.0.
157 Secondly, some differences in the range operator were fixed starting in
158 Perl 5.26.0. Thirdly, some differences in split were fixed started in
159 Perl 5.28.0.)
160
161 To enable this seamless support, you should "use feature
162 'unicode_strings'" (which is automatically selected if you "use 5.012"
163 or higher). See feature. (5.14 also fixes a number of bugs and
164 departures from the Unicode standard.)
165
166 Before Perl v5.8.0, the use of "use utf8" was used to declare that
167 operations in the current block or file would be Unicode-aware. This
168 model was found to be wrong, or at least clumsy: the "Unicodeness" is
169 now carried with the data, instead of being attached to the operations.
170 Starting with Perl v5.8.0, only one case remains where an explicit "use
171 utf8" is needed: if your Perl script itself is encoded in UTF-8, you
172 can use UTF-8 in your identifier names, and in string and regular
173 expression literals, by saying "use utf8". This is not the default
174 because scripts with legacy 8-bit data in them would break. See utf8.
175
176 Perl's Unicode Model
177 Perl supports both pre-5.6 strings of eight-bit native bytes, and
178 strings of Unicode characters. The general principle is that Perl
179 tries to keep its data as eight-bit bytes for as long as possible, but
180 as soon as Unicodeness cannot be avoided, the data is transparently
181 upgraded to Unicode. Prior to Perl v5.14.0, the upgrade was not
182 completely transparent (see "The "Unicode Bug"" in perlunicode), and
183 for backwards compatibility, full transparency is not gained unless
184 "use feature 'unicode_strings'" (see feature) or "use 5.012" (or
185 higher) is selected.
186
187 Internally, Perl currently uses either whatever the native eight-bit
188 character set of the platform (for example Latin-1) is, defaulting to
189 UTF-8, to encode Unicode strings. Specifically, if all code points in
190 the string are 0xFF or less, Perl uses the native eight-bit character
191 set. Otherwise, it uses UTF-8.
192
193 A user of Perl does not normally need to know nor care how Perl happens
194 to encode its internal strings, but it becomes relevant when outputting
195 Unicode strings to a stream without a PerlIO layer (one with the
196 "default" encoding). In such a case, the raw bytes used internally
197 (the native character set or UTF-8, as appropriate for each string)
198 will be used, and a "Wide character" warning will be issued if those
199 strings contain a character beyond 0x00FF.
200
201 For example,
202
203 perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
204
205 produces a fairly useless mixture of native bytes and UTF-8, as well as
206 a warning:
207
208 Wide character in print at ...
209
210 To output UTF-8, use the ":encoding" or ":utf8" output layer.
211 Prepending
212
213 binmode(STDOUT, ":utf8");
214
215 to this sample program ensures that the output is completely UTF-8, and
216 removes the program's warning.
217
218 You can enable automatic UTF-8-ification of your standard file handles,
219 default "open()" layer, and @ARGV by using either the "-C" command line
220 switch or the "PERL_UNICODE" environment variable, see perlrun for the
221 documentation of the "-C" switch.
222
223 Note that this means that Perl expects other software to work the same
224 way: if Perl has been led to believe that STDIN should be UTF-8, but
225 then STDIN coming in from another command is not UTF-8, Perl will
226 likely complain about the malformed UTF-8.
227
228 All features that combine Unicode and I/O also require using the new
229 PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
230 you can see whether yours is by running "perl -V" and looking for
231 "useperlio=define".
232
233 Unicode and EBCDIC
234 Perl 5.8.0 added support for Unicode on EBCDIC platforms. This support
235 was allowed to lapse in later releases, but was revived in 5.22.
236 Unicode support is somewhat more complex to implement since additional
237 conversions are needed. See perlebcdic for more information.
238
239 On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
240 instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
241 that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
242 "EBCDIC-safe", in that all the basic characters (which includes all
243 those that have ASCII equivalents (like "A", "0", "%", etc.) are the
244 same in both EBCDIC and UTF-EBCDIC. Often, documentation will use the
245 term "UTF-8" to mean UTF-EBCDIC as well. This is the case in this
246 document.
247
248 Creating Unicode
249 This section applies fully to Perls starting with v5.22. Various
250 caveats for earlier releases are in the "Earlier releases caveats"
251 subsection below.
252
253 To create Unicode characters in literals, use the "\N{...}" notation in
254 double-quoted strings:
255
256 my $smiley_from_name = "\N{WHITE SMILING FACE}";
257 my $smiley_from_code_point = "\N{U+263a}";
258
259 Similarly, they can be used in regular expression literals
260
261 $smiley =~ /\N{WHITE SMILING FACE}/;
262 $smiley =~ /\N{U+263a}/;
263
264 At run-time you can use:
265
266 use charnames ();
267 my $hebrew_alef_from_name
268 = charnames::string_vianame("HEBREW LETTER ALEF");
269 my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");
270
271 Naturally, "ord()" will do the reverse: it turns a character into a
272 code point.
273
274 There are other runtime options as well. You can use "pack()":
275
276 my $hebrew_alef_from_code_point = pack("U", 0x05d0);
277
278 Or you can use "chr()", though it is less convenient in the general
279 case:
280
281 $hebrew_alef_from_code_point = chr(utf8::unicode_to_native(0x05d0));
282 utf8::upgrade($hebrew_alef_from_code_point);
283
284 The "utf8::unicode_to_native()" and "utf8::upgrade()" aren't needed if
285 the argument is above 0xFF, so the above could have been written as
286
287 $hebrew_alef_from_code_point = chr(0x05d0);
288
289 since 0x5d0 is above 255.
290
291 "\x{}" and "\o{}" can also be used to specify code points at compile
292 time in double-quotish strings, but, for backward compatibility with
293 older Perls, the same rules apply as with "chr()" for code points less
294 than 256.
295
296 "utf8::unicode_to_native()" is used so that the Perl code is portable
297 to EBCDIC platforms. You can omit it if you're really sure no one will
298 ever want to use your code on a non-ASCII platform. Starting in Perl
299 v5.22, calls to it on ASCII platforms are optimized out, so there's no
300 performance penalty at all in adding it. Or you can simply use the
301 other constructs that don't require it.
302
303 See "Further Resources" for how to find all these names and numeric
304 codes.
305
306 Earlier releases caveats
307
308 On EBCDIC platforms, prior to v5.22, using "\N{U+...}" doesn't work
309 properly.
310
311 Prior to v5.16, using "\N{...}" with a character name (as opposed to a
312 "U+..." code point) required a "use charnames :full".
313
314 Prior to v5.14, there were some bugs in "\N{...}" with a character name
315 (as opposed to a "U+..." code point).
316
317 "charnames::string_vianame()" was introduced in v5.14. Prior to that,
318 "charnames::vianame()" should work, but only if the argument is of the
319 form "U+...". Your best bet there for runtime Unicode by character
320 name is probably:
321
322 use charnames ();
323 my $hebrew_alef_from_name
324 = pack("U", charnames::vianame("HEBREW LETTER ALEF"));
325
326 Handling Unicode
327 Handling Unicode is for the most part transparent: just use the strings
328 as usual. Functions like "index()", "length()", and "substr()" will
329 work on the Unicode characters; regular expressions will work on the
330 Unicode characters (see perlunicode and perlretut).
331
332 Note that Perl considers grapheme clusters to be separate characters,
333 so for example
334
335 print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
336 "\n";
337
338 will print 2, not 1. The only exception is that regular expressions
339 have "\X" for matching an extended grapheme cluster. (Thus "\X" in a
340 regular expression would match the entire sequence of both the example
341 characters.)
342
343 Life is not quite so transparent, however, when working with legacy
344 encodings, I/O, and certain special cases:
345
346 Legacy Encodings
347 When you combine legacy data and Unicode, the legacy data needs to be
348 upgraded to Unicode. Normally the legacy data is assumed to be ISO
349 8859-1 (or EBCDIC, if applicable).
350
351 The "Encode" module knows about many encodings and has interfaces for
352 doing conversions between those encodings:
353
354 use Encode 'decode';
355 $data = decode("iso-8859-3", $data); # convert from legacy
356
357 Unicode I/O
358 Normally, writing out Unicode data
359
360 print FH $some_string_with_unicode, "\n";
361
362 produces raw bytes that Perl happens to use to internally encode the
363 Unicode string. Perl's internal encoding depends on the system as well
364 as what characters happen to be in the string at the time. If any of
365 the characters are at code points 0x100 or above, you will get a
366 warning. To ensure that the output is explicitly rendered in the
367 encoding you desire--and to avoid the warning--open the stream with the
368 desired encoding. Some examples:
369
370 open FH, ">:utf8", "file";
371
372 open FH, ">:encoding(ucs2)", "file";
373 open FH, ">:encoding(UTF-8)", "file";
374 open FH, ">:encoding(shift_jis)", "file";
375
376 and on already open streams, use "binmode()":
377
378 binmode(STDOUT, ":utf8");
379
380 binmode(STDOUT, ":encoding(ucs2)");
381 binmode(STDOUT, ":encoding(UTF-8)");
382 binmode(STDOUT, ":encoding(shift_jis)");
383
384 The matching of encoding names is loose: case does not matter, and many
385 encodings have several aliases. Note that the ":utf8" layer must
386 always be specified exactly like that; it is not subject to the loose
387 matching of encoding names. Also note that currently ":utf8" is unsafe
388 for input, because it accepts the data without validating that it is
389 indeed valid UTF-8; you should instead use ":encoding(UTF-8)" (with or
390 without a hyphen).
391
392 See PerlIO for the ":utf8" layer, PerlIO::encoding and Encode::PerlIO
393 for the ":encoding()" layer, and Encode::Supported for many encodings
394 supported by the "Encode" module.
395
396 Reading in a file that you know happens to be encoded in one of the
397 Unicode or legacy encodings does not magically turn the data into
398 Unicode in Perl's eyes. To do that, specify the appropriate layer when
399 opening files
400
401 open(my $fh,'<:encoding(UTF-8)', 'anything');
402 my $line_of_unicode = <$fh>;
403
404 open(my $fh,'<:encoding(Big5)', 'anything');
405 my $line_of_unicode = <$fh>;
406
407 The I/O layers can also be specified more flexibly with the "open"
408 pragma. See open, or look at the following example.
409
410 use open ':encoding(UTF-8)'; # input/output default encoding will be
411 # UTF-8
412 open X, ">file";
413 print X chr(0x100), "\n";
414 close X;
415 open Y, "<file";
416 printf "%#x\n", ord(<Y>); # this should print 0x100
417 close Y;
418
419 With the "open" pragma you can use the ":locale" layer
420
421 BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
422 # the :locale will probe the locale environment variables like
423 # LC_ALL
424 use open OUT => ':locale'; # russki parusski
425 open(O, ">koi8");
426 print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
427 close O;
428 open(I, "<koi8");
429 printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
430 close I;
431
432 These methods install a transparent filter on the I/O stream that
433 converts data from the specified encoding when it is read in from the
434 stream. The result is always Unicode.
435
436 The open pragma affects all the "open()" calls after the pragma by
437 setting default layers. If you want to affect only certain streams,
438 use explicit layers directly in the "open()" call.
439
440 You can switch encodings on an already opened stream by using
441 "binmode()"; see "binmode" in perlfunc.
442
443 The ":locale" does not currently work with "open()" and "binmode()",
444 only with the "open" pragma. The ":utf8" and ":encoding(...)" methods
445 do work with all of "open()", "binmode()", and the "open" pragma.
446
447 Similarly, you may use these I/O layers on output streams to
448 automatically convert Unicode to the specified encoding when it is
449 written to the stream. For example, the following snippet copies the
450 contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
451 the file "text.utf8", encoded as UTF-8:
452
453 open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
454 open(my $unicode, '>:utf8', 'text.utf8');
455 while (<$nihongo>) { print $unicode $_ }
456
457 The naming of encodings, both by the "open()" and by the "open" pragma
458 allows for flexible names: "koi8-r" and "KOI8R" will both be
459 understood.
460
461 Common encodings recognized by ISO, MIME, IANA, and various other
462 standardisation organisations are recognised; for a more detailed list
463 see Encode::Supported.
464
465 "read()" reads characters and returns the number of characters.
466 "seek()" and "tell()" operate on byte counts, as does "sysseek()".
467
468 "sysread()" and "syswrite()" should not be used on file handles with
469 character encoding layers, they behave badly, and that behaviour has
470 been deprecated since perl 5.24.
471
472 Notice that because of the default behaviour of not doing any
473 conversion upon input if there is no default layer, it is easy to
474 mistakenly write code that keeps on expanding a file by repeatedly
475 encoding the data:
476
477 # BAD CODE WARNING
478 open F, "file";
479 local $/; ## read in the whole file of 8-bit characters
480 $t = <F>;
481 close F;
482 open F, ">:encoding(UTF-8)", "file";
483 print F $t; ## convert to UTF-8 on output
484 close F;
485
486 If you run this code twice, the contents of the file will be twice
487 UTF-8 encoded. A "use open ':encoding(UTF-8)'" would have avoided the
488 bug, or explicitly opening also the file for input as UTF-8.
489
490 NOTE: the ":utf8" and ":encoding" features work only if your Perl has
491 been built with PerlIO, which is the default on most systems.
492
493 Displaying Unicode As Text
494 Sometimes you might want to display Perl scalars containing Unicode as
495 simple ASCII (or EBCDIC) text. The following subroutine converts its
496 argument so that Unicode characters with code points greater than 255
497 are displayed as "\x{...}", control characters (like "\n") are
498 displayed as "\x..", and the rest of the characters as themselves:
499
500 sub nice_string {
501 join("",
502 map { $_ > 255 # if wide character...
503 ? sprintf("\\x{%04X}", $_) # \x{...}
504 : chr($_) =~ /[[:cntrl:]]/ # else if control character...
505 ? sprintf("\\x%02X", $_) # \x..
506 : quotemeta(chr($_)) # else quoted or as themselves
507 } unpack("W*", $_[0])); # unpack Unicode characters
508 }
509
510 For example,
511
512 nice_string("foo\x{100}bar\n")
513
514 returns the string
515
516 'foo\x{0100}bar\x0A'
517
518 which is ready to be printed.
519
520 ("\\x{}" is used here instead of "\\N{}", since it's most likely that
521 you want to see what the native values are.)
522
523 Special Cases
524 · Starting in Perl 5.28, it is illegal for bit operators, like "~",
525 to operate on strings containing code points above 255.
526
527 · The vec() function may produce surprising results if used on
528 strings containing characters with ordinal values above 255. In
529 such a case, the results are consistent with the internal encoding
530 of the characters, but not with much else. So don't do that, and
531 starting in Perl 5.28, a deprecation message is issued if you do
532 so, becoming illegal in Perl 5.32.
533
534 · Peeking At Perl's Internal Encoding
535
536 Normal users of Perl should never care how Perl encodes any
537 particular Unicode string (because the normal ways to get at the
538 contents of a string with Unicode--via input and output--should
539 always be via explicitly-defined I/O layers). But if you must,
540 there are two ways of looking behind the scenes.
541
542 One way of peeking inside the internal encoding of Unicode
543 characters is to use "unpack("C*", ..." to get the bytes of
544 whatever the string encoding happens to be, or "unpack("U0..",
545 ...)" to get the bytes of the UTF-8 encoding:
546
547 # this prints c4 80 for the UTF-8 bytes 0xc4 0x80
548 print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
549
550 Yet another way would be to use the Devel::Peek module:
551
552 perl -MDevel::Peek -e 'Dump(chr(0x100))'
553
554 That shows the "UTF8" flag in FLAGS and both the UTF-8 bytes and
555 Unicode characters in "PV". See also later in this document the
556 discussion about the "utf8::is_utf8()" function.
557
558 Advanced Topics
559 · String Equivalence
560
561 The question of string equivalence turns somewhat complicated in
562 Unicode: what do you mean by "equal"?
563
564 (Is "LATIN CAPITAL LETTER A WITH ACUTE" equal to "LATIN CAPITAL
565 LETTER A"?)
566
567 The short answer is that by default Perl compares equivalence
568 ("eq", "ne") based only on code points of the characters. In the
569 above case, the answer is no (because 0x00C1 != 0x0041). But
570 sometimes, any CAPITAL LETTER A's should be considered equal, or
571 even A's of any case.
572
573 The long answer is that you need to consider character
574 normalization and casing issues: see Unicode::Normalize, Unicode
575 Technical Report #15, Unicode Normalization Forms
576 <http://www.unicode.org/unicode/reports/tr15> and sections on case
577 mapping in the Unicode Standard <http://www.unicode.org>.
578
579 As of Perl 5.8.0, the "Full" case-folding of Case
580 Mappings/SpecialCasing is implemented, but bugs remain in "qr//i"
581 with them, mostly fixed by 5.14, and essentially entirely by 5.18.
582
583 · String Collation
584
585 People like to see their strings nicely sorted--or as Unicode
586 parlance goes, collated. But again, what do you mean by collate?
587
588 (Does "LATIN CAPITAL LETTER A WITH ACUTE" come before or after
589 "LATIN CAPITAL LETTER A WITH GRAVE"?)
590
591 The short answer is that by default, Perl compares strings ("lt",
592 "le", "cmp", "ge", "gt") based only on the code points of the
593 characters. In the above case, the answer is "after", since 0x00C1
594 > 0x00C0.
595
596 The long answer is that "it depends", and a good answer cannot be
597 given without knowing (at the very least) the language context.
598 See Unicode::Collate, and Unicode Collation Algorithm
599 <http://www.unicode.org/unicode/reports/tr10/>
600
601 Miscellaneous
602 · Character Ranges and Classes
603
604 Character ranges in regular expression bracketed character classes
605 ( e.g., "/[a-z]/") and in the "tr///" (also known as "y///")
606 operator are not magically Unicode-aware. What this means is that
607 "[A-Za-z]" will not magically start to mean "all alphabetic
608 letters" (not that it does mean that even for 8-bit characters; for
609 those, if you are using locales (perllocale), use "/[[:alpha:]]/";
610 and if not, use the 8-bit-aware property "\p{alpha}").
611
612 All the properties that begin with "\p" (and its inverse "\P") are
613 actually character classes that are Unicode-aware. There are
614 dozens of them, see perluniprops.
615
616 Starting in v5.22, you can use Unicode code points as the end
617 points of regular expression pattern character ranges, and the
618 range will include all Unicode code points that lie between those
619 end points, inclusive.
620
621 qr/ [ \N{U+03} - \N{U+20} ] /xx
622
623 includes the code points "\N{U+03}", "\N{U+04}", ..., "\N{U+20}".
624
625 This also works for ranges in "tr///" starting in Perl v5.24.
626
627 · String-To-Number Conversions
628
629 Unicode does define several other decimal--and numeric--characters
630 besides the familiar 0 to 9, such as the Arabic and Indic digits.
631 Perl does not support string-to-number conversion for digits other
632 than ASCII 0 to 9 (and ASCII "a" to "f" for hexadecimal). To get
633 safe conversions from any Unicode string, use "num()" in
634 Unicode::UCD.
635
636 Questions With Answers
637 · Will My Old Scripts Break?
638
639 Very probably not. Unless you are generating Unicode characters
640 somehow, old behaviour should be preserved. About the only
641 behaviour that has changed and which could start generating Unicode
642 is the old behaviour of "chr()" where supplying an argument more
643 than 255 produced a character modulo 255. "chr(300)", for example,
644 was equal to "chr(45)" or "-" (in ASCII), now it is LATIN CAPITAL
645 LETTER I WITH BREVE.
646
647 · How Do I Make My Scripts Work With Unicode?
648
649 Very little work should be needed since nothing changes until you
650 generate Unicode data. The most important thing is getting input
651 as Unicode; for that, see the earlier I/O discussion. To get full
652 seamless Unicode support, add "use feature 'unicode_strings'" (or
653 "use 5.012" or higher) to your script.
654
655 · How Do I Know Whether My String Is In Unicode?
656
657 You shouldn't have to care. But you may if your Perl is before
658 5.14.0 or you haven't specified "use feature 'unicode_strings'" or
659 "use 5.012" (or higher) because otherwise the rules for the code
660 points in the range 128 to 255 are different depending on whether
661 the string they are contained within is in Unicode or not. (See
662 "When Unicode Does Not Happen" in perlunicode.)
663
664 To determine if a string is in Unicode, use:
665
666 print utf8::is_utf8($string) ? 1 : 0, "\n";
667
668 But note that this doesn't mean that any of the characters in the
669 string are necessary UTF-8 encoded, or that any of the characters
670 have code points greater than 0xFF (255) or even 0x80 (128), or
671 that the string has any characters at all. All the "is_utf8()"
672 does is to return the value of the internal "utf8ness" flag
673 attached to the $string. If the flag is off, the bytes in the
674 scalar are interpreted as a single byte encoding. If the flag is
675 on, the bytes in the scalar are interpreted as the (variable-
676 length, potentially multi-byte) UTF-8 encoded code points of the
677 characters. Bytes added to a UTF-8 encoded string are
678 automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8
679 scalars are merged (double-quoted interpolation, explicit
680 concatenation, or printf/sprintf parameter substitution), the
681 result will be UTF-8 encoded as if copies of the byte strings were
682 upgraded to UTF-8: for example,
683
684 $a = "ab\x80c";
685 $b = "\x{100}";
686 print "$a = $b\n";
687
688 the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but
689 $a will stay byte-encoded.
690
691 Sometimes you might really need to know the byte length of a string
692 instead of the character length. For that use the "bytes" pragma
693 and the "length()" function:
694
695 my $unicode = chr(0x100);
696 print length($unicode), "\n"; # will print 1
697 use bytes;
698 print length($unicode), "\n"; # will print 2
699 # (the 0xC4 0x80 of the UTF-8)
700 no bytes;
701
702 · How Do I Find Out What Encoding a File Has?
703
704 You might try Encode::Guess, but it has a number of limitations.
705
706 · How Do I Detect Data That's Not Valid In a Particular Encoding?
707
708 Use the "Encode" package to try converting it. For example,
709
710 use Encode 'decode';
711
712 if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
713 # $string is valid UTF-8
714 } else {
715 # $string is not valid UTF-8
716 }
717
718 Or use "unpack" to try decoding it:
719
720 use warnings;
721 @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
722
723 If invalid, a "Malformed UTF-8 character" warning is produced. The
724 "C0" means "process the string character per character". Without
725 that, the "unpack("U*", ...)" would work in "U0" mode (the default
726 if the format string starts with "U") and it would return the bytes
727 making up the UTF-8 encoding of the target string, something that
728 will always work.
729
730 · How Do I Convert Binary Data Into a Particular Encoding, Or Vice
731 Versa?
732
733 This probably isn't as useful as you might think. Normally, you
734 shouldn't need to.
735
736 In one sense, what you are asking doesn't make much sense:
737 encodings are for characters, and binary data are not "characters",
738 so converting "data" into some encoding isn't meaningful unless you
739 know in what character set and encoding the binary data is in, in
740 which case it's not just binary data, now is it?
741
742 If you have a raw sequence of bytes that you know should be
743 interpreted via a particular encoding, you can use "Encode":
744
745 use Encode 'from_to';
746 from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
747
748 The call to "from_to()" changes the bytes in $data, but nothing
749 material about the nature of the string has changed as far as Perl
750 is concerned. Both before and after the call, the string $data
751 contains just a bunch of 8-bit bytes. As far as Perl is concerned,
752 the encoding of the string remains as "system-native 8-bit bytes".
753
754 You might relate this to a fictional 'Translate' module:
755
756 use Translate;
757 my $phrase = "Yes";
758 Translate::from_to($phrase, 'english', 'deutsch');
759 ## phrase now contains "Ja"
760
761 The contents of the string changes, but not the nature of the
762 string. Perl doesn't know any more after the call than before that
763 the contents of the string indicates the affirmative.
764
765 Back to converting data. If you have (or want) data in your
766 system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you
767 can use pack/unpack to convert to/from Unicode.
768
769 $native_string = pack("W*", unpack("U*", $Unicode_string));
770 $Unicode_string = pack("U*", unpack("W*", $native_string));
771
772 If you have a sequence of bytes you know is valid UTF-8, but Perl
773 doesn't know it yet, you can make Perl a believer, too:
774
775 $Unicode = $bytes;
776 utf8::decode($Unicode);
777
778 or:
779
780 $Unicode = pack("U0a*", $bytes);
781
782 You can find the bytes that make up a UTF-8 sequence with
783
784 @bytes = unpack("C*", $Unicode_string)
785
786 and you can create well-formed Unicode with
787
788 $Unicode_string = pack("U*", 0xff, ...)
789
790 · How Do I Display Unicode? How Do I Input Unicode?
791
792 See <http://www.alanwood.net/unicode/> and
793 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
794
795 · How Does Unicode Work With Traditional Locales?
796
797 If your locale is a UTF-8 locale, starting in Perl v5.26, Perl
798 works well for all categories; before this, starting with Perl
799 v5.20, it works for all categories but "LC_COLLATE", which deals
800 with sorting and the "cmp" operator. But note that the standard
801 "Unicode::Collate" and "Unicode::Collate::Locale" modules offer
802 much more powerful solutions to collation issues, and work on
803 earlier releases.
804
805 For other locales, starting in Perl 5.16, you can specify
806
807 use locale ':not_characters';
808
809 to get Perl to work well with them. The catch is that you have to
810 translate from the locale character set to/from Unicode yourself.
811 See "Unicode I/O" above for how to
812
813 use open ':locale';
814
815 to accomplish this, but full details are in "Unicode and UTF-8" in
816 perllocale, including gotchas that happen if you don't specify
817 ":not_characters".
818
819 Hexadecimal Notation
820 The Unicode standard prefers using hexadecimal notation because that
821 more clearly shows the division of Unicode into blocks of 256
822 characters. Hexadecimal is also simply shorter than decimal. You can
823 use decimal notation, too, but learning to use hexadecimal just makes
824 life easier with the Unicode standard. The "U+HHHH" notation uses
825 hexadecimal, for example.
826
827 The "0x" prefix means a hexadecimal number, the digits are 0-9 and a-f
828 (or A-F, case doesn't matter). Each hexadecimal digit represents four
829 bits, or half a byte. "print 0x..., "\n"" will show a hexadecimal
830 number in decimal, and "printf "%x\n", $decimal" will show a decimal
831 number in hexadecimal. If you have just the "hex digits" of a
832 hexadecimal number, you can use the "hex()" function.
833
834 print 0x0009, "\n"; # 9
835 print 0x000a, "\n"; # 10
836 print 0x000f, "\n"; # 15
837 print 0x0010, "\n"; # 16
838 print 0x0011, "\n"; # 17
839 print 0x0100, "\n"; # 256
840
841 print 0x0041, "\n"; # 65
842
843 printf "%x\n", 65; # 41
844 printf "%#x\n", 65; # 0x41
845
846 print hex("41"), "\n"; # 65
847
848 Further Resources
849 · Unicode Consortium
850
851 <http://www.unicode.org/>
852
853 · Unicode FAQ
854
855 <http://www.unicode.org/unicode/faq/>
856
857 · Unicode Glossary
858
859 <http://www.unicode.org/glossary/>
860
861 · Unicode Recommended Reading List
862
863 The Unicode Consortium has a list of articles and books, some of
864 which give a much more in depth treatment of Unicode:
865 <http://unicode.org/resources/readinglist.html>
866
867 · Unicode Useful Resources
868
869 <http://www.unicode.org/unicode/onlinedat/resources.html>
870
871 · Unicode and Multilingual Support in HTML, Fonts, Web Browsers and
872 Other Applications
873
874 <http://www.alanwood.net/unicode/>
875
876 · UTF-8 and Unicode FAQ for Unix/Linux
877
878 <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
879
880 · Legacy Character Sets
881
882 <http://www.czyborra.com/> <http://www.eki.ee/letter/>
883
884 · You can explore various information from the Unicode data files
885 using the "Unicode::UCD" module.
886
888 If you cannot upgrade your Perl to 5.8.0 or later, you can still do
889 some Unicode processing by using the modules "Unicode::String",
890 "Unicode::Map8", and "Unicode::Map", available from CPAN. If you have
891 the GNU recode installed, you can also use the Perl front-end
892 "Convert::Recode" for character conversions.
893
894 The following are fast conversions from ISO 8859-1 (Latin-1) bytes to
895 UTF-8 bytes and back, the code works even with older Perl 5 versions.
896
897 # ISO 8859-1 to UTF-8
898 s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
899
900 # UTF-8 to ISO 8859-1
901 s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
902
904 perlunitut, perlunicode, Encode, open, utf8, bytes, perlretut, perlrun,
905 Unicode::Collate, Unicode::Normalize, Unicode::UCD
906
908 Thanks to the kind readers of the perl5-porters@perl.org,
909 perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
910 mailing lists for their valuable feedback.
911
913 Copyright 2001-2011 Jarkko Hietaniemi <jhi@iki.fi>. Now maintained by
914 Perl 5 Porters.
915
916 This document may be distributed under the same terms as Perl itself.
917
918
919
920perl v5.28.2 2018-11-01 PERLUNIINTRO(1)