1PERLUNICODE(1) Perl Programmers Reference Guide PERLUNICODE(1)
2
3
4
6 perlunicode - Unicode support in Perl
7
9 Important Caveats
10 Unicode support is an extensive requirement. While Perl does not
11 implement the Unicode standard or the accompanying technical reports
12 from cover to cover, Perl does support many Unicode features.
13
14 People who want to learn to use Unicode in Perl, should probably read
15 the Perl Unicode tutorial, perlunitut, before reading this reference
16 document.
17
18 Also, the use of Unicode may present security issues that aren't
19 obvious. Read Unicode Security Considerations
20 <http://www.unicode.org/reports/tr36>.
21
22 Input and Output Layers
23 Perl knows when a filehandle uses Perl's internal Unicode encodings
24 (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened
25 with the ":utf8" layer. Other encodings can be converted to Perl's
26 encoding on input or from Perl's encoding on output by use of the
27 ":encoding(...)" layer. See open.
28
29 To indicate that Perl source itself is in UTF-8, use "use utf8;".
30
31 Regular Expressions
32 The regular expression compiler produces polymorphic opcodes. That
33 is, the pattern adapts to the data and automatically switches to
34 the Unicode character scheme when presented with data that is
35 internally encoded in UTF-8, or instead uses a traditional byte
36 scheme when presented with byte data.
37
38 "use utf8" still needed to enable UTF-8/UTF-EBCDIC in scripts
39 As a compatibility measure, the "use utf8" pragma must be
40 explicitly included to enable recognition of UTF-8 in the Perl
41 scripts themselves (in string or regular expression literals, or in
42 identifier names) on ASCII-based machines or to recognize UTF-
43 EBCDIC on EBCDIC-based machines. These are the only times when an
44 explicit "use utf8" is needed. See utf8.
45
46 BOM-marked scripts and UTF-16 scripts autodetected
47 If a Perl script begins marked with the Unicode BOM (UTF-16LE,
48 UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked
49 UTF-16 of either endianness, Perl will correctly read in the script
50 as Unicode. (BOMless UTF-8 cannot be effectively recognized or
51 differentiated from ISO 8859-1 or other eight-bit encodings.)
52
53 "use encoding" needed to upgrade non-Latin-1 byte strings
54 By default, there is a fundamental asymmetry in Perl's Unicode
55 model: implicit upgrading from byte strings to Unicode strings
56 assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
57 strings are downgraded with UTF-8 encoding. This happens because
58 the first 256 codepoints in Unicode happens to agree with Latin-1.
59
60 See "Byte and Character Semantics" for more details.
61
62 Byte and Character Semantics
63 Beginning with version 5.6, Perl uses logically-wide characters to
64 represent strings internally.
65
66 In future, Perl-level operations will be expected to work with
67 characters rather than bytes.
68
69 However, as an interim compatibility measure, Perl aims to provide a
70 safe migration path from byte semantics to character semantics for
71 programs. For operations where Perl can unambiguously decide that the
72 input data are characters, Perl switches to character semantics. For
73 operations where this determination cannot be made without additional
74 information from the user, Perl decides in favor of compatibility and
75 chooses to use byte semantics.
76
77 Under byte semantics, when "use locale" is in effect, Perl uses the
78 semantics associated with the current locale. Absent a "use locale",
79 and absent a "use feature 'unicode_strings'" pragma, Perl currently
80 uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics,
81 meaning that characters whose ordinal numbers are in the range 128 -
82 255 are undefined except for their ordinal numbers. This means that
83 none have case (upper and lower), nor are any a member of character
84 classes, like "[:alpha:]" or "\w". (But all do belong to the "\W"
85 class or the Perl regular expression extension "[:^alpha:]".)
86
87 This behavior preserves compatibility with earlier versions of Perl,
88 which allowed byte semantics in Perl operations only if none of the
89 program's inputs were marked as being a source of Unicode character
90 data. Such data may come from filehandles, from calls to external
91 programs, from information provided by the system (such as %ENV), or
92 from literals and constants in the source text.
93
94 The "bytes" pragma will always, regardless of platform, force byte
95 semantics in a particular lexical scope. See bytes.
96
97 The "use feature 'unicode_strings'" pragma is intended to always,
98 regardless of platform, force character (Unicode) semantics in a
99 particular lexical scope. In release 5.12, it is partially
100 implemented, applying only to case changes. See "The "Unicode Bug""
101 below.
102
103 The "utf8" pragma is primarily a compatibility device that enables
104 recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
105 Note that this pragma is only required while Perl defaults to byte
106 semantics; when character semantics become the default, this pragma may
107 become a no-op. See utf8.
108
109 Unless explicitly stated, Perl operators use character semantics for
110 Unicode data and byte semantics for non-Unicode data. The decision to
111 use character semantics is made transparently. If input data comes
112 from a Unicode source--for example, if a character encoding layer is
113 added to a filehandle or a literal Unicode string constant appears in a
114 program--character semantics apply. Otherwise, byte semantics are in
115 effect. The "bytes" pragma should be used to force byte semantics on
116 Unicode data, and the "use feature 'unicode_strings'" pragma to force
117 Unicode semantics on byte data (though in 5.12 it isn't fully
118 implemented).
119
120 If strings operating under byte semantics and strings with Unicode
121 character data are concatenated, the new string will have character
122 semantics. This can cause surprises: See "BUGS", below. You can
123 choose to be warned when this happens. See encoding::warnings.
124
125 Under character semantics, many operations that formerly operated on
126 bytes now operate on characters. A character in Perl is logically just
127 a number ranging from 0 to 2**31 or so. Larger characters may encode
128 into longer sequences of bytes internally, but this internal detail is
129 mostly hidden for Perl code. See perluniintro for more.
130
131 Effects of Character Semantics
132 Character semantics have the following effects:
133
134 · Strings--including hash keys--and regular expression patterns may
135 contain characters that have an ordinal value larger than 255.
136
137 If you use a Unicode editor to edit your program, Unicode
138 characters may occur directly within the literal strings in UTF-8
139 encoding, or UTF-16. (The former requires a BOM or "use utf8", the
140 latter requires a BOM.)
141
142 Unicode characters can also be added to a string by using the
143 "\N{U+...}" notation. The Unicode code for the desired character,
144 in hexadecimal, should be placed in the braces, after the "U". For
145 instance, a smiley face is "\N{U+263A}".
146
147 Alternatively, you can use the "\x{...}" notation for characters
148 0x100 and above. For characters below 0x100 you may get byte
149 semantics instead of character semantics; see "The "Unicode Bug"".
150 On EBCDIC machines there is the additional problem that the value
151 for such characters gives the EBCDIC character rather than the
152 Unicode one.
153
154 Additionally, if you
155
156 use charnames ':full';
157
158 you can use the "\N{...}" notation and put the official Unicode
159 character name within the braces, such as "\N{WHITE SMILING FACE}".
160 See charnames.
161
162 · If an appropriate encoding is specified, identifiers within the
163 Perl script may contain Unicode alphanumeric characters, including
164 ideographs. Perl does not currently attempt to canonicalize
165 variable names.
166
167 · Regular expressions match characters instead of bytes. "." matches
168 a character instead of a byte.
169
170 · Bracketed character classes in regular expressions match characters
171 instead of bytes and match against the character properties
172 specified in the Unicode properties database. "\w" can be used to
173 match a Japanese ideograph, for instance.
174
175 · Named Unicode properties, scripts, and block ranges may be used
176 (like bracketed character classes) by using the "\p{}" "matches
177 property" construct and the "\P{}" negation, "doesn't match
178 property". See "Unicode Character Properties" for more details.
179
180 You can define your own character properties and use them in the
181 regular expression with the "\p{}" or "\P{}" construct. See "User-
182 Defined Character Properties" for more details.
183
184 · The special pattern "\X" matches a logical character, an "extended
185 grapheme cluster" in Standardese. In Unicode what appears to the
186 user to be a single character, for example an accented "G", may in
187 fact be composed of a sequence of characters, in this case a "G"
188 followed by an accent character. "\X" will match the entire
189 sequence.
190
191 · The "tr///" operator translates characters instead of bytes. Note
192 that the "tr///CU" functionality has been removed. For similar
193 functionality see pack('U0', ...) and pack('C0', ...).
194
195 · Case translation operators use the Unicode case translation tables
196 when character input is provided. Note that "uc()", or "\U" in
197 interpolated strings, translates to uppercase, while "ucfirst", or
198 "\u" in interpolated strings, translates to titlecase in languages
199 that make the distinction (which is equivalent to uppercase in
200 languages without the distinction).
201
202 · Most operators that deal with positions or lengths in a string will
203 automatically switch to using character positions, including
204 "chop()", "chomp()", "substr()", "pos()", "index()", "rindex()",
205 "sprintf()", "write()", and "length()". An operator that
206 specifically does not switch is "vec()". Operators that really
207 don't care include operators that treat strings as a bucket of bits
208 such as "sort()", and operators dealing with filenames.
209
210 · The "pack()"/"unpack()" letter "C" does not change, since it is
211 often used for byte-oriented formats. Again, think "char" in the C
212 language.
213
214 There is a new "U" specifier that converts between Unicode
215 characters and code points. There is also a "W" specifier that is
216 the equivalent of "chr"/"ord" and properly handles character values
217 even if they are above 255.
218
219 · The "chr()" and "ord()" functions work on characters, similar to
220 "pack("W")" and "unpack("W")", not "pack("C")" and "unpack("C")".
221 "pack("C")" and "unpack("C")" are methods for emulating byte-
222 oriented "chr()" and "ord()" on Unicode strings. While these
223 methods reveal the internal encoding of Unicode strings, that is
224 not something one normally needs to care about at all.
225
226 · The bit string operators, "& | ^ ~", can operate on character data.
227 However, for backward compatibility, such as when using bit string
228 operations when characters are all less than 256 in ordinal value,
229 one should not use "~" (the bit complement) with characters of both
230 values less than 256 and values greater than 256. Most
231 importantly, DeMorgan's laws ("~($x|$y) eq ~$x&~$y" and "~($x&$y)
232 eq ~$x|~$y") will not hold. The reason for this mathematical faux
233 pas is that the complement cannot return both the 8-bit (byte-wide)
234 bit complement and the full character-wide bit complement.
235
236 · You can define your own mappings to be used in "lc()", "lcfirst()",
237 "uc()", and "ucfirst()" (or their double-quoted string inlined
238 versions such as "\U"). See "User-Defined Case Mappings" for more
239 details.
240
241 · And finally, "scalar reverse()" reverses by character rather than
242 by byte.
243
244 Unicode Character Properties
245 Most Unicode character properties are accessible by using regular
246 expressions. They are used (like bracketed character classes) by using
247 the "\p{}" "matches property" construct and the "\P{}" negation,
248 "doesn't match property".
249
250 Note that the only time that Perl considers a sequence of individual
251 code points as a single logical character is in the "\X" construct,
252 already mentioned above. Therefore "character" in this discussion
253 means a single Unicode code point.
254
255 For instance, "\p{Uppercase}" matches any single character with the
256 Unicode "Uppercase" property, while "\p{L}" matches any character with
257 a General_Category of "L" (letter) property. Brackets are not required
258 for single letter property names, so "\p{L}" is equivalent to "\pL".
259
260 More formally, "\p{Uppercase}" matches any single character whose
261 Unicode Uppercase property value is True, and "\P{Uppercase}" matches
262 any character whose Uppercase property value is False, and they could
263 have been written as "\p{Uppercase=True}" and "\p{Uppercase=False}",
264 respectively.
265
266 This formality is needed when properties are not binary, that is if
267 they can take on more values than just True and False. For example,
268 the Bidi_Class (see "Bidirectional Character Types" below), can take on
269 a number of different values, such as Left, Right, Whitespace, and
270 others. To match these, one needs to specify the property name
271 (Bidi_Class), and the value being matched against (Left, Right, etc.).
272 This is done, as in the examples above, by having the two components
273 separated by an equal sign (or interchangeably, a colon), like
274 "\p{Bidi_Class: Left}".
275
276 All Unicode-defined character properties may be written in these
277 compound forms of "\p{property=value}" or "\p{property:value}", but
278 Perl provides some additional properties that are written only in the
279 single form, as well as single-form short-cuts for all binary
280 properties and certain others described below, in which you may omit
281 the property name and the equals or colon separator.
282
283 Most Unicode character properties have at least two synonyms (or
284 aliases if you prefer), a short one that is easier to type, and a
285 longer one which is more descriptive and hence it is easier to
286 understand what it means. Thus the "L" and "Letter" above are
287 equivalent and can be used interchangeably. Likewise, "Upper" is a
288 synonym for "Uppercase", and we could have written "\p{Uppercase}"
289 equivalently as "\p{Upper}". Also, there are typically various
290 synonyms for the values the property can be. For binary properties,
291 "True" has 3 synonyms: "T", "Yes", and "Y"; and "False has
292 correspondingly "F", "No", and "N". But be careful. A short form of a
293 value for one property may not mean the same thing as the same short
294 form for another. Thus, for the General_Category property, "L" means
295 "Letter", but for the Bidi_Class property, "L" means "Left". A
296 complete list of properties and synonyms is in perluniprops.
297
298 Upper/lower case differences in the property names and values are
299 irrelevant, thus "\p{Upper}" means the same thing as "\p{upper}" or
300 even "\p{UpPeR}". Similarly, you can add or subtract underscores
301 anywhere in the middle of a word, so that these are also equivalent to
302 "\p{U_p_p_e_r}". And white space is irrelevant adjacent to non-word
303 characters, such as the braces and the equals or colon separators so
304 "\p{ Upper }" and "\p{ Upper_case : Y }" are equivalent to these as
305 well. In fact, in most cases, white space and even hyphens can be
306 added or deleted anywhere. So even "\p{ Up-per case = Yes}" is
307 equivalent. All this is called "loose-matching" by Unicode. The few
308 places where stricter matching is employed is in the middle of numbers,
309 and the Perl extension properties that begin or end with an underscore.
310 Stricter matching cares about white space (except adjacent to the non-
311 word characters) and hyphens, and non-interior underscores.
312
313 You can also use negation in both "\p{}" and "\P{}" by introducing a
314 caret (^) between the first brace and the property name: "\p{^Tamil}"
315 is equal to "\P{Tamil}".
316
317 General_Category
318
319 Every Unicode character is assigned a general category, which is the
320 "most usual categorization of a character" (from
321 <http://www.unicode.org/reports/tr44>).
322
323 The compound way of writing these is like "\p{General_Category=Number}"
324 (short, "\p{gc:n}"). But Perl furnishes shortcuts in which everything
325 up through the equal or colon separator is omitted. So you can instead
326 just write "\pN".
327
328 Here are the short and long forms of the General Category properties:
329
330 Short Long
331
332 L Letter
333 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
334 Lu Uppercase_Letter
335 Ll Lowercase_Letter
336 Lt Titlecase_Letter
337 Lm Modifier_Letter
338 Lo Other_Letter
339
340 M Mark
341 Mn Nonspacing_Mark
342 Mc Spacing_Mark
343 Me Enclosing_Mark
344
345 N Number
346 Nd Decimal_Number (also Digit)
347 Nl Letter_Number
348 No Other_Number
349
350 P Punctuation (also Punct)
351 Pc Connector_Punctuation
352 Pd Dash_Punctuation
353 Ps Open_Punctuation
354 Pe Close_Punctuation
355 Pi Initial_Punctuation
356 (may behave like Ps or Pe depending on usage)
357 Pf Final_Punctuation
358 (may behave like Ps or Pe depending on usage)
359 Po Other_Punctuation
360
361 S Symbol
362 Sm Math_Symbol
363 Sc Currency_Symbol
364 Sk Modifier_Symbol
365 So Other_Symbol
366
367 Z Separator
368 Zs Space_Separator
369 Zl Line_Separator
370 Zp Paragraph_Separator
371
372 C Other
373 Cc Control (also Cntrl)
374 Cf Format
375 Cs Surrogate (not usable)
376 Co Private_Use
377 Cn Unassigned
378
379 Single-letter properties match all characters in any of the two-letter
380 sub-properties starting with the same letter. "LC" and "L&" are
381 special cases, which are both aliases for the set consisting of
382 everything matched by "Ll", "Lu", and "Lt".
383
384 Because Perl hides the need for the user to understand the internal
385 representation of Unicode characters, there is no need to implement the
386 somewhat messy concept of surrogates. "Cs" is therefore not supported.
387
388 Bidirectional Character Types
389
390 Because scripts differ in their directionality (Hebrew is written right
391 to left, for example) Unicode supplies these properties in the
392 Bidi_Class class:
393
394 Property Meaning
395
396 L Left-to-Right
397 LRE Left-to-Right Embedding
398 LRO Left-to-Right Override
399 R Right-to-Left
400 AL Arabic Letter
401 RLE Right-to-Left Embedding
402 RLO Right-to-Left Override
403 PDF Pop Directional Format
404 EN European Number
405 ES European Separator
406 ET European Terminator
407 AN Arabic Number
408 CS Common Separator
409 NSM Non-Spacing Mark
410 BN Boundary Neutral
411 B Paragraph Separator
412 S Segment Separator
413 WS Whitespace
414 ON Other Neutrals
415
416 This property is always written in the compound form. For example,
417 "\p{Bidi_Class:R}" matches characters that are normally written right
418 to left.
419
420 Scripts
421
422 The world's languages are written in a number of scripts. This
423 sentence (unless you're reading it in translation) is written in Latin,
424 while Russian is written in Cyrllic, and Greek is written in, well,
425 Greek; Japanese mainly in Hiragana or Katakana. There are many more.
426
427 The Unicode Script property gives what script a given character is in,
428 and the property can be specified with the compound form like
429 "\p{Script=Hebrew}" (short: "\p{sc=hebr}"). Perl furnishes shortcuts
430 for all script names. You can omit everything up through the equals
431 (or colon), and simply write "\p{Latin}" or "\P{Cyrillic}".
432
433 A complete list of scripts and their shortcuts is in perluniprops.
434
435 Use of "Is" Prefix
436
437 For backward compatibility (with Perl 5.6), all properties mentioned so
438 far may have "Is" or "Is_" prepended to their name, so "\P{Is_Lu}", for
439 example, is equal to "\P{Lu}", and "\p{IsScript:Arabic}" is equal to
440 "\p{Arabic}".
441
442 Blocks
443
444 In addition to scripts, Unicode also defines blocks of characters. The
445 difference between scripts and blocks is that the concept of scripts is
446 closer to natural languages, while the concept of blocks is more of an
447 artificial grouping based on groups of Unicode characters with
448 consecutive ordinal values. For example, the "Basic Latin" block is all
449 characters whose ordinals are between 0 and 127, inclusive, in other
450 words, the ASCII characters. The "Latin" script contains some letters
451 from this block as well as several more, like "Latin-1 Supplement",
452 "Latin Extended-A", etc., but it does not contain all the characters
453 from those blocks. It does not, for example, contain digits, because
454 digits are shared across many scripts. Digits and similar groups, like
455 punctuation, are in the script called "Common". There is also a script
456 called "Inherited" for characters that modify other characters, and
457 inherit the script value of the controlling character.
458
459 For more about scripts versus blocks, see UAX#24 "Unicode Script
460 Property": <http://www.unicode.org/reports/tr24>
461
462 The Script property is likely to be the one you want to use when
463 processing natural language; the Block property may be useful in
464 working with the nuts and bolts of Unicode.
465
466 Block names are matched in the compound form, like "\p{Block: Arrows}"
467 or "\p{Blk=Hebrew}". Unlike most other properties only a few block
468 names have a Unicode-defined short name. But Perl does provide a
469 (slight) shortcut: You can say, for example "\p{In_Arrows}" or
470 "\p{In_Hebrew}". For backwards compatibility, the "In" prefix may be
471 omitted if there is no naming conflict with a script or any other
472 property, and you can even use an "Is" prefix instead in those cases.
473 But it is not a good idea to do this, for a couple reasons:
474
475 1. It is confusing. There are many naming conflicts, and you may
476 forget some. For example, "\p{Hebrew}" means the script Hebrew,
477 and NOT the block Hebrew. But would you remember that 6 months
478 from now?
479
480 2. It is unstable. A new version of Unicode may pre-empt the current
481 meaning by creating a property with the same name. There was a
482 time in very early Unicode releases when "\p{Hebrew}" would have
483 matched the block Hebrew; now it doesn't.
484
485 Some people just prefer to always use "\p{Block: foo}" and "\p{Script:
486 bar}" instead of the shortcuts, for clarity, and because they can't
487 remember the difference between 'In' and 'Is' anyway (or aren't
488 confident that those who eventually will read their code will know).
489
490 A complete list of blocks and their shortcuts is in perluniprops.
491
492 Other Properties
493
494 There are many more properties than the very basic ones described here.
495 A complete list is in perluniprops.
496
497 Unicode defines all its properties in the compound form, so all single-
498 form properties are Perl extensions. A number of these are just
499 synonyms for the Unicode ones, but some are genunine extensions,
500 including a couple that are in the compound form. And quite a few of
501 these are actually recommended by Unicode (in
502 <http://www.unicode.org/reports/tr18>).
503
504 This section gives some details on all the extensions that aren't
505 synonyms for compound-form Unicode properties (for those, you'll have
506 to refer to the Unicode Standard <http://www.unicode.org/reports/tr44>.
507
508 "\p{All}"
509 This matches any of the 1_114_112 Unicode code points. It is a
510 synonym for "\p{Any}".
511
512 "\p{Alnum}"
513 This matches any "\p{Alphabetic}" or "\p{Decimal_Number}"
514 character.
515
516 "\p{Any}"
517 This matches any of the 1_114_112 Unicode code points. It is a
518 synonym for "\p{All}".
519
520 "\p{Assigned}"
521 This matches any assigned code point; that is, any code point whose
522 general category is not Unassigned (or equivalently, not Cn).
523
524 "\p{Blank}"
525 This is the same as "\h" and "\p{HorizSpace}": A character that
526 changes the spacing horizontally.
527
528 "\p{Decomposition_Type: Non_Canonical}" (Short: "\p{Dt=NonCanon}")
529 Matches a character that has a non-canonical decomposition.
530
531 To understand the use of this rarely used property=value
532 combination, it is necessary to know some basics about
533 decomposition. Consider a character, say H. It could appear with
534 various marks around it, such as an acute accent, or a circumflex,
535 or various hooks, circles, arrows, etc., above, below, to one side
536 and/or the other, etc. There are many possibilities among the
537 world's languages. The number of combinations is astronomical, and
538 if there were a character for each combination, it would soon
539 exhaust Unicode's more than a million possible characters. So
540 Unicode took a different approach: there is a character for the
541 base H, and a character for each of the possible marks, and they
542 can be combined variously to get a final logical character. So a
543 logical character--what appears to be a single character--can be a
544 sequence of more than one individual characters. This is called an
545 "extended grapheme cluster". (Perl furnishes the "\X" construct to
546 match such sequences.)
547
548 But Unicode's intent is to unify the existing character set
549 standards and practices, and a number of pre-existing standards
550 have single characters that mean the same thing as some of these
551 combinations. An example is ISO-8859-1, which has quite a few of
552 these in the Latin-1 range, an example being "LATIN CAPITAL LETTER
553 E WITH ACUTE". Because this character was in this pre-existing
554 standard, Unicode added it to its repertoire. But this character
555 is considered by Unicode to be equivalent to the sequence
556 consisting of first the character "LATIN CAPITAL LETTER E", then
557 the character "COMBINING ACUTE ACCENT".
558
559 "LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed"
560 character, and the equivalence with the sequence is called
561 canonical equivalence. All pre-composed characters are said to
562 have a decomposition (into the equivalent sequence) and the
563 decomposition type is also called canonical.
564
565 However, many more characters have a different type of
566 decomposition, a "compatible" or "non-canonical" decomposition.
567 The sequences that form these decompositions are not considered
568 canonically equivalent to the pre-composed character. An example,
569 again in the Latin-1 range, is the "SUPERSCRIPT ONE". It is kind
570 of like a regular digit 1, but not exactly; its decomposition into
571 the digit 1 is called a "compatible" decomposition, specifically a
572 "super" decomposition. There are several such compatibility
573 decompositions (see <http://www.unicode.org/reports/tr44>),
574 including one called "compat" which means some miscellaneous type
575 of decomposition that doesn't fit into the decomposition categories
576 that Unicode has chosen.
577
578 Note that most Unicode characters don't have a decomposition, so
579 their decomposition type is "None".
580
581 Perl has added the "Non_Canonical" type, for your convenience, to
582 mean any of the compatibility decompositions.
583
584 "\p{Graph}"
585 Matches any character that is graphic. Theoretically, this means a
586 character that on a printer would cause ink to be used.
587
588 "\p{HorizSpace}"
589 This is the same as "\h" and "\p{Blank}": A character that changes
590 the spacing horizontally.
591
592 "\p{In=*}"
593 This is a synonym for "\p{Present_In=*}"
594
595 "\p{PerlSpace}"
596 This is the same as "\s", restricted to ASCII, namely
597 "[ \f\n\r\t]".
598
599 Mnemonic: Perl's (original) space
600
601 "\p{PerlWord}"
602 This is the same as "\w", restricted to ASCII, namely
603 "[A-Za-z0-9_]"
604
605 Mnemonic: Perl's (original) word.
606
607 "\p{PosixAlnum}"
608 This matches any alphanumeric character in the ASCII range, namely
609 "[A-Za-z0-9]".
610
611 "\p{PosixAlpha}"
612 This matches any alphabetic character in the ASCII range, namely
613 "[A-Za-z]".
614
615 "\p{PosixBlank}"
616 This matches any blank character in the ASCII range, namely
617 "[ \t]".
618
619 "\p{PosixCntrl}"
620 This matches any control character in the ASCII range, namely
621 "[\x00-\x1F\x7F]"
622
623 "\p{PosixDigit}"
624 This matches any digit character in the ASCII range, namely
625 "[0-9]".
626
627 "\p{PosixGraph}"
628 This matches any graphical character in the ASCII range, namely
629 "[\x21-\x7E]".
630
631 "\p{PosixLower}"
632 This matches any lowercase character in the ASCII range, namely
633 "[a-z]".
634
635 "\p{PosixPrint}"
636 This matches any printable character in the ASCII range, namely
637 "[\x20-\x7E]". These are the graphical characters plus SPACE.
638
639 "\p{PosixPunct}"
640 This matches any punctuation character in the ASCII range, namely
641 "[\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]". These are the graphical
642 characters that aren't word characters. Note that the Posix
643 standard includes in its definition of punctuation, those
644 characters that Unicode calls "symbols."
645
646 "\p{PosixSpace}"
647 This matches any space character in the ASCII range, namely
648 "[ \f\n\r\t\x0B]" (the last being a vertical tab).
649
650 "\p{PosixUpper}"
651 This matches any uppercase character in the ASCII range, namely
652 "[A-Z]".
653
654 "\p{Present_In: *}" (Short: "\p{In=*}")
655 This property is used when you need to know in what Unicode
656 version(s) a character is.
657
658 The "*" above stands for some two digit Unicode version number,
659 such as 1.1 or 4.0; or the "*" can also be "Unassigned". This
660 property will match the code points whose final disposition has
661 been settled as of the Unicode release given by the version number;
662 "\p{Present_In: Unassigned}" will match those code points whose
663 meaning has yet to be assigned.
664
665 For example, "U+0041" "LATIN CAPITAL LETTER A" was present in the
666 very first Unicode release available, which is 1.1, so this
667 property is true for all valid "*" versions. On the other hand,
668 "U+1EFF" was not assigned until version 5.1 when it became "LATIN
669 SMALL LETTER Y WITH LOOP", so the only "*" that would match it are
670 5.1, 5.2, and later.
671
672 Unicode furnishes the "Age" property from which this is derived.
673 The problem with Age is that a strict interpretation of it (which
674 Perl takes) has it matching the precise release a code point's
675 meaning is introduced in. Thus "U+0041" would match only 1.1; and
676 "U+1EFF" only 5.1. This is not usually what you want.
677
678 Some non-Perl implementations of the Age property may change its
679 meaning to be the same as the Perl Present_In property; just be
680 aware of that.
681
682 Another confusion with both these properties is that the definition
683 is not that the code point has been assigned, but that the meaning
684 of the code point has been determined. This is because 66 code
685 points will always be unassigned, and, so the Age for them is the
686 Unicode version the decision to make them so was made in. For
687 example, "U+FDD0" is to be permanently unassigned to a character,
688 and the decision to do that was made in version 3.1, so
689 "\p{Age=3.1}" matches this character and "\p{Present_In: 3.1}" and
690 up matches as well.
691
692 "\p{Print}"
693 This matches any character that is graphical or blank, except
694 controls.
695
696 "\p{SpacePerl}"
697 This is the same as "\s", including beyond ASCII.
698
699 Mnemonic: Space, as modified by Perl. (It doesn't include the
700 vertical tab which both the Posix standard and Unicode consider to
701 be space.)
702
703 "\p{VertSpace}"
704 This is the same as "\v": A character that changes the spacing
705 vertically.
706
707 "\p{Word}"
708 This is the same as "\w", including beyond ASCII.
709
710 User-Defined Character Properties
711 You can define your own binary character properties by defining
712 subroutines whose names begin with "In" or "Is". The subroutines can
713 be defined in any package. The user-defined properties can be used in
714 the regular expression "\p" and "\P" constructs; if you are using a
715 user-defined property from a package other than the one you are in, you
716 must specify its package in the "\p" or "\P" construct.
717
718 # assuming property Is_Foreign defined in Lang::
719 package main; # property package name required
720 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
721
722 package Lang; # property package name not required
723 if ($txt =~ /\p{IsForeign}+/) { ... }
724
725 Note that the effect is compile-time and immutable once defined.
726
727 The subroutines must return a specially-formatted string, with one or
728 more newline-separated lines. Each line must be one of the following:
729
730 · A single hexadecimal number denoting a Unicode code point to
731 include.
732
733 · Two hexadecimal numbers separated by horizontal whitespace (space
734 or tabular characters) denoting a range of Unicode code points to
735 include.
736
737 · Something to include, prefixed by "+": a built-in character
738 property (prefixed by "utf8::") or a user-defined character
739 property, to represent all the characters in that property; two
740 hexadecimal code points for a range; or a single hexadecimal code
741 point.
742
743 · Something to exclude, prefixed by "-": an existing character
744 property (prefixed by "utf8::") or a user-defined character
745 property, to represent all the characters in that property; two
746 hexadecimal code points for a range; or a single hexadecimal code
747 point.
748
749 · Something to negate, prefixed "!": an existing character property
750 (prefixed by "utf8::") or a user-defined character property, to
751 represent all the characters in that property; two hexadecimal code
752 points for a range; or a single hexadecimal code point.
753
754 · Something to intersect with, prefixed by "&": an existing character
755 property (prefixed by "utf8::") or a user-defined character
756 property, for all the characters except the characters in the
757 property; two hexadecimal code points for a range; or a single
758 hexadecimal code point.
759
760 For example, to define a property that covers both the Japanese
761 syllabaries (hiragana and katakana), you can define
762
763 sub InKana {
764 return <<END;
765 3040\t309F
766 30A0\t30FF
767 END
768 }
769
770 Imagine that the here-doc end marker is at the beginning of the line.
771 Now you can use "\p{InKana}" and "\P{InKana}".
772
773 You could also have used the existing block property names:
774
775 sub InKana {
776 return <<'END';
777 +utf8::InHiragana
778 +utf8::InKatakana
779 END
780 }
781
782 Suppose you wanted to match only the allocated characters, not the raw
783 block ranges: in other words, you want to remove the non-characters:
784
785 sub InKana {
786 return <<'END';
787 +utf8::InHiragana
788 +utf8::InKatakana
789 -utf8::IsCn
790 END
791 }
792
793 The negation is useful for defining (surprise!) negated classes.
794
795 sub InNotKana {
796 return <<'END';
797 !utf8::InHiragana
798 -utf8::InKatakana
799 +utf8::IsCn
800 END
801 }
802
803 Intersection is useful for getting the common characters matched by two
804 (or more) classes.
805
806 sub InFooAndBar {
807 return <<'END';
808 +main::Foo
809 &main::Bar
810 END
811 }
812
813 It's important to remember not to use "&" for the first set; that would
814 be intersecting with nothing (resulting in an empty set).
815
816 User-Defined Case Mappings
817 You can also define your own mappings to be used in the lc(),
818 lcfirst(), uc(), and ucfirst() (or their string-inlined versions). The
819 principle is similar to that of user-defined character properties: to
820 define subroutines with names like "ToLower" (for lc() and lcfirst()),
821 "ToTitle" (for the first character in ucfirst()), and "ToUpper" (for
822 uc(), and the rest of the characters in ucfirst()).
823
824 The string returned by the subroutines needs to be two hexadecimal
825 numbers separated by two tabulators: the two numbers being,
826 respectively, the source code point and the destination code point.
827 For example:
828
829 sub ToUpper {
830 return <<END;
831 0061\t\t0041
832 END
833 }
834
835 defines an uc() mapping that causes only the character "a" to be mapped
836 to "A"; all other characters will remain unchanged.
837
838 (For serious hackers only) The above means you have to furnish a
839 complete mapping; you can't just override a couple of characters and
840 leave the rest unchanged. You can find all the mappings in the
841 directory $Config{privlib}/unicore/To/. The mapping data is returned
842 as the here-document, and the "utf8::ToSpecFoo" are special exception
843 mappings derived from <$Config{privlib}>/unicore/SpecialCasing.txt.
844 The "Digit" and "Fold" mappings that one can see in the directory are
845 not directly user-accessible, one can use either the "Unicode::UCD"
846 module, or just match case-insensitively (that's when the "Fold"
847 mapping is used).
848
849 The mappings will only take effect on scalars that have been marked as
850 having Unicode characters, for example by using "utf8::upgrade()". Old
851 byte-style strings are not affected.
852
853 The mappings are in effect for the package they are defined in.
854
855 Character Encodings for Input and Output
856 See Encode.
857
858 Unicode Regular Expression Support Level
859 The following list of Unicode support for regular expressions describes
860 all the features currently supported. The references to "Level N" and
861 the section numbers refer to the Unicode Technical Standard #18,
862 "Unicode Regular Expressions", version 11, in May 2005.
863
864 · Level 1 - Basic Unicode Support
865
866 RL1.1 Hex Notation - done [1]
867 RL1.2 Properties - done [2][3]
868 RL1.2a Compatibility Properties - done [4]
869 RL1.3 Subtraction and Intersection - MISSING [5]
870 RL1.4 Simple Word Boundaries - done [6]
871 RL1.5 Simple Loose Matches - done [7]
872 RL1.6 Line Boundaries - MISSING [8]
873 RL1.7 Supplementary Code Points - done [9]
874
875 [1] \x{...}
876 [2] \p{...} \P{...}
877 [3] supports not only minimal list, but all Unicode character
878 properties (see L</Unicode Character Properties>)
879 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
880 [5] can use regular expression look-ahead [a] or
881 user-defined character properties [b] to emulate set operations
882 [6] \b \B
883 [7] note that Perl does Full case-folding in matching (but with bugs),
884 not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9,
885 not with 1F80. This difference matters mainly for certain Greek
886 capital letters with certain modifiers: the Full case-folding
887 decomposes the letter, while the Simple case-folding would map
888 it to a single character.
889 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
890 CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
891 should also affect <>, $., and script line numbers;
892 should not split lines within CRLF [c] (i.e. there is no empty
893 line between \r and \n)
894 [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
895 but also beyond U+10FFFF [d]
896
897 [a] You can mimic class subtraction using lookahead. For example,
898 what UTS#18 might write as
899
900 [{Greek}-[{UNASSIGNED}]]
901
902 in Perl can be written as:
903
904 (?!\p{Unassigned})\p{InGreekAndCoptic}
905 (?=\p{Assigned})\p{InGreekAndCoptic}
906
907 But in this particular example, you probably really want
908
909 \p{GreekAndCoptic}
910
911 which will match assigned characters known to be part of the Greek
912 script.
913
914 Also see the Unicode::Regex::Set module, it does implement the full
915 UTS#18 grouping, intersection, union, and removal (subtraction)
916 syntax.
917
918 [b] '+' for union, '-' for removal (set-difference), '&' for
919 intersection (see "User-Defined Character Properties")
920
921 [c] Try the ":crlf" layer (see PerlIO).
922
923 [d] U+FFFF will currently generate a warning message if 'utf8'
924 warnings are
925 enabled
926
927 · Level 2 - Extended Unicode Support
928
929 RL2.1 Canonical Equivalents - MISSING [10][11]
930 RL2.2 Default Grapheme Clusters - MISSING [12]
931 RL2.3 Default Word Boundaries - MISSING [14]
932 RL2.4 Default Loose Matches - MISSING [15]
933 RL2.5 Name Properties - MISSING [16]
934 RL2.6 Wildcard Properties - MISSING
935
936 [10] see UAX#15 "Unicode Normalization Forms"
937 [11] have Unicode::Normalize but not integrated to regexes
938 [12] have \X but we don't have a "Grapheme Cluster Mode"
939 [14] see UAX#29, Word Boundaries
940 [15] see UAX#21 "Case Mappings"
941 [16] have \N{...} but neither compute names of CJK Ideographs
942 and Hangul Syllables nor use a loose match [e]
943
944 [e] "\N{...}" allows namespaces (see charnames).
945
946 · Level 3 - Tailored Support
947
948 RL3.1 Tailored Punctuation - MISSING
949 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
950 RL3.3 Tailored Word Boundaries - MISSING
951 RL3.4 Tailored Loose Matches - MISSING
952 RL3.5 Tailored Ranges - MISSING
953 RL3.6 Context Matching - MISSING [19]
954 RL3.7 Incremental Matches - MISSING
955 ( RL3.8 Unicode Set Sharing )
956 RL3.9 Possible Match Sets - MISSING
957 RL3.10 Folded Matching - MISSING [20]
958 RL3.11 Submatchers - MISSING
959
960 [17] see UAX#10 "Unicode Collation Algorithms"
961 [18] have Unicode::Collate but not integrated to regexes
962 [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
963 outside of the target substring
964 [20] need insensitive matching for linguistic features other than case;
965 for example, hiragana to katakana, wide and narrow, simplified Han
966 to traditional Han (see UTR#30 "Character Foldings")
967
968 Unicode Encodings
969 Unicode characters are assigned to code points, which are abstract
970 numbers. To use these numbers, various encodings are needed.
971
972 · UTF-8
973
974 UTF-8 is a variable-length (1 to 6 bytes, current character
975 allocations require 4 bytes), byte-order independent encoding. For
976 ASCII (and we really do mean 7-bit ASCII, not another 8-bit
977 encoding), UTF-8 is transparent.
978
979 The following table is from Unicode 3.2.
980
981 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
982
983 U+0000..U+007F 00..7F
984 U+0080..U+07FF * C2..DF 80..BF
985 U+0800..U+0FFF E0 * A0..BF 80..BF
986 U+1000..U+CFFF E1..EC 80..BF 80..BF
987 U+D000..U+D7FF ED 80..9F 80..BF
988 U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++
989 U+E000..U+FFFF EE..EF 80..BF 80..BF
990 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
991 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
992 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
993
994 Note the gaps before several of the byte entries above marked by
995 '*'. These are caused by legal UTF-8 avoiding non-shortest
996 encodings: it is technically possible to UTF-8-encode a single code
997 point in different ways, but that is explicitly forbidden, and the
998 shortest possible encoding should always be used (and that is what
999 Perl does).
1000
1001 Another way to look at it is via bits:
1002
1003 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1004
1005 0aaaaaaa 0aaaaaaa
1006 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1007 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1008 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
1009
1010 As you can see, the continuation bytes all begin with "10", and the
1011 leading bits of the start byte tell how many bytes there are in the
1012 encoded character.
1013
1014 · UTF-EBCDIC
1015
1016 Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1017
1018 · UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1019
1020 The followings items are mostly for reference and general Unicode
1021 knowledge, Perl doesn't use these constructs internally.
1022
1023 UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1024 "U+0000..U+FFFF" are stored in a single 16-bit unit, and the code
1025 points "U+10000..U+10FFFF" in two 16-bit units. The latter case is
1026 using surrogates, the first 16-bit unit being the high surrogate,
1027 and the second being the low surrogate.
1028
1029 Surrogates are code points set aside to encode the
1030 "U+10000..U+10FFFF" range of Unicode code points in pairs of 16-bit
1031 units. The high surrogates are the range "U+D800..U+DBFF" and the
1032 low surrogates are the range "U+DC00..U+DFFF". The surrogate
1033 encoding is
1034
1035 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1036 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1037
1038 and the decoding is
1039
1040 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1041
1042 If you try to generate surrogates (for example by using chr()), you
1043 will get a warning, if warnings are turned on, because those code
1044 points are not valid for a Unicode character.
1045
1046 Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
1047 itself can be used for in-memory computations, but if storage or
1048 transfer is required either UTF-16BE (big-endian) or UTF-16LE
1049 (little-endian) encodings must be chosen.
1050
1051 This introduces another problem: what if you just know that your
1052 data is UTF-16, but you don't know which endianness? Byte Order
1053 Marks, or BOMs, are a solution to this. A special character has
1054 been reserved in Unicode to function as a byte order marker: the
1055 character with the code point "U+FEFF" is the BOM.
1056
1057 The trick is that if you read a BOM, you will know the byte order,
1058 since if it was written on a big-endian platform, you will read the
1059 bytes "0xFE 0xFF", but if it was written on a little-endian
1060 platform, you will read the bytes "0xFF 0xFE". (And if the
1061 originating platform was writing in UTF-8, you will read the bytes
1062 "0xEF 0xBB 0xBF".)
1063
1064 The way this trick works is that the character with the code point
1065 "U+FFFE" is guaranteed not to be a valid Unicode character, so the
1066 sequence of bytes "0xFF 0xFE" is unambiguously "BOM, represented in
1067 little-endian format" and cannot be "U+FFFE", represented in big-
1068 endian format". (Actually, "U+FFFE" is legal for use by your
1069 program, even for input/output, but better not use it if you need a
1070 BOM. But it is "illegal for interchange", so that an unsuspecting
1071 program won't get confused.)
1072
1073 · UTF-32, UTF-32BE, UTF-32LE
1074
1075 The UTF-32 family is pretty much like the UTF-16 family, expect
1076 that the units are 32-bit, and therefore the surrogate scheme is
1077 not needed. The BOM signatures will be "0x00 0x00 0xFE 0xFF" for
1078 BE and "0xFF 0xFE 0x00 0x00" for LE.
1079
1080 · UCS-2, UCS-4
1081
1082 Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
1083 encoding. Unlike UTF-16, UCS-2 is not extensible beyond "U+FFFF",
1084 because it does not use surrogates. UCS-4 is a 32-bit encoding,
1085 functionally identical to UTF-32.
1086
1087 · UTF-7
1088
1089 A seven-bit safe (non-eight-bit) encoding, which is useful if the
1090 transport or storage is not eight-bit safe. Defined by RFC 2152.
1091
1092 Security Implications of Unicode
1093 Read Unicode Security Considerations
1094 <http://www.unicode.org/reports/tr36>. Also, note the following:
1095
1096 · Malformed UTF-8
1097
1098 Unfortunately, the specification of UTF-8 leaves some room for
1099 interpretation of how many bytes of encoded output one should
1100 generate from one input Unicode character. Strictly speaking, the
1101 shortest possible sequence of UTF-8 bytes should be generated,
1102 because otherwise there is potential for an input buffer overflow
1103 at the receiving end of a UTF-8 connection. Perl always generates
1104 the shortest length UTF-8, and with warnings on, Perl will warn
1105 about non-shortest length UTF-8 along with other malformations,
1106 such as the surrogates, which are not real Unicode code points.
1107
1108 · Regular expressions behave slightly differently between byte data
1109 and character (Unicode) data. For example, the "word character"
1110 character class "\w" will work differently depending on if data is
1111 eight-bit bytes or Unicode.
1112
1113 In the first case, the set of "\w" characters is either small--the
1114 default set of alphabetic characters, digits, and the "_"--or, if
1115 you are using a locale (see perllocale), the "\w" might contain a
1116 few more letters according to your language and country.
1117
1118 In the second case, the "\w" set of characters is much, much
1119 larger. Most importantly, even in the set of the first 256
1120 characters, it will probably match different characters: unlike
1121 most locales, which are specific to a language and country pair,
1122 Unicode classifies all the characters that are letters somewhere as
1123 "\w". For example, your locale might not think that LATIN SMALL
1124 LETTER ETH is a letter (unless you happen to speak Icelandic), but
1125 Unicode does.
1126
1127 As discussed elsewhere, Perl has one foot (two hooves?) planted in
1128 each of two worlds: the old world of bytes and the new world of
1129 characters, upgrading from bytes to characters when necessary. If
1130 your legacy code does not explicitly use Unicode, no automatic
1131 switch-over to characters should happen. Characters shouldn't get
1132 downgraded to bytes, either. It is possible to accidentally mix
1133 bytes and characters, however (see perluniintro), in which case
1134 "\w" in regular expressions might start behaving differently.
1135 Review your code. Use warnings and the "strict" pragma.
1136
1137 Unicode in Perl on EBCDIC
1138 The way Unicode is handled on EBCDIC platforms is still experimental.
1139 On such platforms, references to UTF-8 encoding in this document and
1140 elsewhere should be read as meaning the UTF-EBCDIC specified in Unicode
1141 Technical Report 16, unless ASCII vs. EBCDIC issues are specifically
1142 discussed. There is no "utfebcdic" pragma or ":utfebcdic" layer;
1143 rather, "utf8" and ":utf8" are reused to mean the platform's "natural"
1144 8-bit encoding of Unicode. See perlebcdic for more discussion of the
1145 issues.
1146
1147 Locales
1148 Usually locale settings and Unicode do not affect each other, but there
1149 are a couple of exceptions:
1150
1151 · You can enable automatic UTF-8-ification of your standard file
1152 handles, default "open()" layer, and @ARGV by using either the "-C"
1153 command line switch or the "PERL_UNICODE" environment variable, see
1154 perlrun for the documentation of the "-C" switch.
1155
1156 · Perl tries really hard to work both with Unicode and the old byte-
1157 oriented world. Most often this is nice, but sometimes Perl's
1158 straddling of the proverbial fence causes problems.
1159
1160 When Unicode Does Not Happen
1161 While Perl does have extensive ways to input and output in Unicode, and
1162 few other 'entry points' like the @ARGV which can be interpreted as
1163 Unicode (UTF-8), there still are many places where Unicode (in some
1164 encoding or another) could be given as arguments or received as
1165 results, or both, but it is not.
1166
1167 The following are such interfaces. Also, see "The "Unicode Bug"". For
1168 all of these interfaces Perl currently (as of 5.8.3) simply assumes
1169 byte strings both as arguments and results, or UTF-8 strings if the
1170 "encoding" pragma has been used.
1171
1172 One reason why Perl does not attempt to resolve the role of Unicode in
1173 these cases is that the answers are highly dependent on the operating
1174 system and the file system(s). For example, whether filenames can be
1175 in Unicode, and in exactly what kind of encoding, is not exactly a
1176 portable concept. Similarly for the qx and system: how well will the
1177 'command line interface' (and which of them?) handle Unicode?
1178
1179 · chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename,
1180 rmdir, stat, symlink, truncate, unlink, utime, -X
1181
1182 · %ENV
1183
1184 · glob (aka the <*>)
1185
1186 · open, opendir, sysopen
1187
1188 · qx (aka the backtick operator), system
1189
1190 · readdir, readlink
1191
1192 The "Unicode Bug"
1193 The term, the "Unicode bug" has been applied to an inconsistency with
1194 the Unicode characters whose ordinals are in the Latin-1 Supplement
1195 block, that is, between 128 and 255. Without a locale specified,
1196 unlike all other characters or code points, these characters have very
1197 different semantics in byte semantics versus character semantics.
1198
1199 In character semantics they are interpreted as Unicode code points,
1200 which means they have the same semantics as Latin-1 (ISO-8859-1).
1201
1202 In byte semantics, they are considered to be unassigned characters,
1203 meaning that the only semantics they have is their ordinal numbers, and
1204 that they are not members of various character classes. None are
1205 considered to match "\w" for example, but all match "\W". (On EBCDIC
1206 platforms, the behavior may be different from this, depending on the
1207 underlying C language library functions.)
1208
1209 The behavior is known to have effects on these areas:
1210
1211 · Changing the case of a scalar, that is, using "uc()", "ucfirst()",
1212 "lc()", and "lcfirst()", or "\L", "\U", "\u" and "\l" in regular
1213 expression substitutions.
1214
1215 · Using caseless ("/i") regular expression matching
1216
1217 · Matching a number of properties in regular expressions, such as
1218 "\w"
1219
1220 · User-defined case change mappings. You can create a "ToUpper()"
1221 function, for example, which overrides Perl's built-in case
1222 mappings. The scalar must be encoded in utf8 for your function to
1223 actually be invoked.
1224
1225 This behavior can lead to unexpected results in which a string's
1226 semantics suddenly change if a code point above 255 is appended to or
1227 removed from it, which changes the string's semantics from byte to
1228 character or vice versa. As an example, consider the following program
1229 and its output:
1230
1231 $ perl -le'
1232 $s1 = "\xC2";
1233 $s2 = "\x{2660}";
1234 for ($s1, $s2, $s1.$s2) {
1235 print /\w/ || 0;
1236 }
1237 '
1238 0
1239 0
1240 1
1241
1242 If there's no "\w" in "s1" or in "s2", why does their concatenation
1243 have one?
1244
1245 This anomaly stems from Perl's attempt to not disturb older programs
1246 that didn't use Unicode, and hence had no semantics for characters
1247 outside of the ASCII range (except in a locale), along with Perl's
1248 desire to add Unicode support seamlessly. The result wasn't seamless:
1249 these characters were orphaned.
1250
1251 Work is being done to correct this, but only some of it was complete in
1252 time for the 5.12 release. What has been finished is the important
1253 part of the case changing component. Due to concerns, and some
1254 evidence, that older code might have come to rely on the existing
1255 behavior, the new behavior must be explicitly enabled by the feature
1256 "unicode_strings" in the feature pragma, even though no new syntax is
1257 involved.
1258
1259 See "lc" in perlfunc for details on how this pragma works in
1260 combination with various others for casing. Even though the pragma
1261 only affects casing operations in the 5.12 release, it is planned to
1262 have it affect all the problematic behaviors in later releases: you
1263 can't have one without them all.
1264
1265 In the meantime, a workaround is to always call utf8::upgrade($string),
1266 or to use the standard module Encode. Also, a scalar that has any
1267 characters whose ordinal is above 0x100, or which were specified using
1268 either of the "\N{...}" notations will automatically have character
1269 semantics.
1270
1271 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1272 Sometimes (see "When Unicode Does Not Happen" or "The "Unicode Bug"")
1273 there are situations where you simply need to force a byte string into
1274 UTF-8, or vice versa. The low-level calls utf8::upgrade($bytestring)
1275 and utf8::downgrade($utf8string[, FAIL_OK]) are the answers.
1276
1277 Note that utf8::downgrade() can fail if the string contains characters
1278 that don't fit into a byte.
1279
1280 Calling either function on a string that already is in the desired
1281 state is a no-op.
1282
1283 Using Unicode in XS
1284 If you want to handle Perl Unicode in XS extensions, you may find the
1285 following C APIs useful. See also "Unicode Support" in perlguts for an
1286 explanation about Unicode at the XS level, and perlapi for the API
1287 details.
1288
1289 · "DO_UTF8(sv)" returns true if the "UTF8" flag is on and the bytes
1290 pragma is not in effect. "SvUTF8(sv)" returns true if the "UTF8"
1291 flag is on; the bytes pragma is ignored. The "UTF8" flag being on
1292 does not mean that there are any characters of code points greater
1293 than 255 (or 127) in the scalar or that there are even any
1294 characters in the scalar. What the "UTF8" flag means is that the
1295 sequence of octets in the representation of the scalar is the
1296 sequence of UTF-8 encoded code points of the characters of a
1297 string. The "UTF8" flag being off means that each octet in this
1298 representation encodes a single character with code point 0..255
1299 within the string. Perl's Unicode model is not to use UTF-8 until
1300 it is absolutely necessary.
1301
1302 · "uvchr_to_utf8(buf, chr)" writes a Unicode character code point
1303 into a buffer encoding the code point as UTF-8, and returns a
1304 pointer pointing after the UTF-8 bytes. It works appropriately on
1305 EBCDIC machines.
1306
1307 · "utf8_to_uvchr(buf, lenp)" reads UTF-8 encoded bytes from a buffer
1308 and returns the Unicode character code point and, optionally, the
1309 length of the UTF-8 byte sequence. It works appropriately on
1310 EBCDIC machines.
1311
1312 · "utf8_length(start, end)" returns the length of the UTF-8 encoded
1313 buffer in characters. "sv_len_utf8(sv)" returns the length of the
1314 UTF-8 encoded scalar.
1315
1316 · "sv_utf8_upgrade(sv)" converts the string of the scalar to its
1317 UTF-8 encoded form. "sv_utf8_downgrade(sv)" does the opposite, if
1318 possible. "sv_utf8_encode(sv)" is like sv_utf8_upgrade except that
1319 it does not set the "UTF8" flag. "sv_utf8_decode()" does the
1320 opposite of "sv_utf8_encode()". Note that none of these are to be
1321 used as general-purpose encoding or decoding interfaces: "use
1322 Encode" for that. "sv_utf8_upgrade()" is affected by the encoding
1323 pragma but "sv_utf8_downgrade()" is not (since the encoding pragma
1324 is designed to be a one-way street).
1325
1326 · is_utf8_char(s) returns true if the pointer points to a valid UTF-8
1327 character.
1328
1329 · "is_utf8_string(buf, len)" returns true if "len" bytes of the
1330 buffer are valid UTF-8.
1331
1332 · "UTF8SKIP(buf)" will return the number of bytes in the UTF-8
1333 encoded character in the buffer. "UNISKIP(chr)" will return the
1334 number of bytes required to UTF-8-encode the Unicode character code
1335 point. "UTF8SKIP()" is useful for example for iterating over the
1336 characters of a UTF-8 encoded buffer; "UNISKIP()" is useful, for
1337 example, in computing the size required for a UTF-8 encoded buffer.
1338
1339 · "utf8_distance(a, b)" will tell the distance in characters between
1340 the two pointers pointing to the same UTF-8 encoded buffer.
1341
1342 · "utf8_hop(s, off)" will return a pointer to a UTF-8 encoded buffer
1343 that is "off" (positive or negative) Unicode characters displaced
1344 from the UTF-8 buffer "s". Be careful not to overstep the buffer:
1345 "utf8_hop()" will merrily run off the end or the beginning of the
1346 buffer if told to do so.
1347
1348 · "pv_uni_display(dsv, spv, len, pvlim, flags)" and
1349 "sv_uni_display(dsv, ssv, pvlim, flags)" are useful for debugging
1350 the output of Unicode strings and scalars. By default they are
1351 useful only for debugging--they display all characters as
1352 hexadecimal code points--but with the flags "UNI_DISPLAY_ISPRINT",
1353 "UNI_DISPLAY_BACKSLASH", and "UNI_DISPLAY_QQ" you can make the
1354 output more readable.
1355
1356 · "ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)" can be used to
1357 compare two strings case-insensitively in Unicode. For case-
1358 sensitive comparisons you can just use "memEQ()" and "memNE()" as
1359 usual.
1360
1361 For more information, see perlapi, and utf8.c and utf8.h in the Perl
1362 source code distribution.
1363
1364 Hacking Perl to work on earlier Unicode versions (for very serious hackers
1365 only)
1366 Perl by default comes with the latest supported Unicode version built
1367 in, but you can change to use any earlier one.
1368
1369 Download the files in the version of Unicode that you want from the
1370 Unicode web site <http://www.unicode.org>). These should replace the
1371 existing files in "\$Config{privlib}"/unicore. ("\%Config" is
1372 available from the Config module.) Follow the instructions in
1373 README.perl in that directory to change some of their names, and then
1374 run make.
1375
1376 It is even possible to download them to a different directory, and then
1377 change utf8_heavy.pl in the directory "\$Config{privlib}" to point to
1378 the new directory, or maybe make a copy of that directory before making
1379 the change, and using @INC or the "-I" run-time flag to switch between
1380 versions at will (but because of caching, not in the middle of a
1381 process), but all this is beyond the scope of these instructions.
1382
1384 Interaction with Locales
1385 Use of locales with Unicode data may lead to odd results. Currently,
1386 Perl attempts to attach 8-bit locale info to characters in the range
1387 0..255, but this technique is demonstrably incorrect for locales that
1388 use characters above that range when mapped into Unicode. Perl's
1389 Unicode support will also tend to run slower. Use of locales with
1390 Unicode is discouraged.
1391
1392 Problems with characters in the Latin-1 Supplement range
1393 See "The "Unicode Bug""
1394
1395 Problems with case-insensitive regular expression matching
1396 There are problems with case-insensitive matches, including those
1397 involving character classes (enclosed in [square brackets]), characters
1398 whose fold is to multiple characters (such as the single character
1399 LATIN SMALL LIGATURE FFL matches case-insensitively with the
1400 3-character string "ffl"), and characters in the Latin-1 Supplement.
1401
1402 Interaction with Extensions
1403 When Perl exchanges data with an extension, the extension should be
1404 able to understand the UTF8 flag and act accordingly. If the extension
1405 doesn't know about the flag, it's likely that the extension will return
1406 incorrectly-flagged data.
1407
1408 So if you're working with Unicode data, consult the documentation of
1409 every module you're using if there are any issues with Unicode data
1410 exchange. If the documentation does not talk about Unicode at all,
1411 suspect the worst and probably look at the source to learn how the
1412 module is implemented. Modules written completely in Perl shouldn't
1413 cause problems. Modules that directly or indirectly access code written
1414 in other programming languages are at risk.
1415
1416 For affected functions, the simple strategy to avoid data corruption is
1417 to always make the encoding of the exchanged data explicit. Choose an
1418 encoding that you know the extension can handle. Convert arguments
1419 passed to the extensions to that encoding and convert results back from
1420 that encoding. Write wrapper functions that do the conversions for you,
1421 so you can later change the functions when the extension catches up.
1422
1423 To provide an example, let's say the popular Foo::Bar::escape_html
1424 function doesn't deal with Unicode data yet. The wrapper function would
1425 convert the argument to raw UTF-8 and convert the result back to Perl's
1426 internal representation like so:
1427
1428 sub my_escape_html ($) {
1429 my($what) = shift;
1430 return unless defined $what;
1431 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1432 }
1433
1434 Sometimes, when the extension does not convert data but just stores and
1435 retrieves them, you will be in a position to use the otherwise
1436 dangerous Encode::_utf8_on() function. Let's say the popular "Foo::Bar"
1437 extension, written in C, provides a "param" method that lets you store
1438 and retrieve data according to these prototypes:
1439
1440 $self->param($name, $value); # set a scalar
1441 $value = $self->param($name); # retrieve a scalar
1442
1443 If it does not yet provide support for any encoding, one could write a
1444 derived class with such a "param" method:
1445
1446 sub param {
1447 my($self,$name,$value) = @_;
1448 utf8::upgrade($name); # make sure it is UTF-8 encoded
1449 if (defined $value) {
1450 utf8::upgrade($value); # make sure it is UTF-8 encoded
1451 return $self->SUPER::param($name,$value);
1452 } else {
1453 my $ret = $self->SUPER::param($name);
1454 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1455 return $ret;
1456 }
1457 }
1458
1459 Some extensions provide filters on data entry/exit points, such as
1460 DB_File::filter_store_key and family. Look out for such filters in the
1461 documentation of your extensions, they can make the transition to
1462 Unicode data much easier.
1463
1464 Speed
1465 Some functions are slower when working on UTF-8 encoded strings than on
1466 byte encoded strings. All functions that need to hop over characters
1467 such as length(), substr() or index(), or matching regular expressions
1468 can work much faster when the underlying data are byte-encoded.
1469
1470 In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a
1471 caching scheme was introduced which will hopefully make the slowness
1472 somewhat less spectacular, at least for some operations. In general,
1473 operations with UTF-8 encoded strings are still slower. As an example,
1474 the Unicode properties (character classes) like "\p{Nd}" are known to
1475 be quite a bit slower (5-20 times) than their simpler counterparts like
1476 "\d" (then again, there 268 Unicode characters matching "Nd" compared
1477 with the 10 ASCII characters matching "d").
1478
1479 Problems on EBCDIC platforms
1480 There are a number of known problems with Perl on EBCDIC platforms. If
1481 you want to use Perl there, send email to perlbug@perl.org.
1482
1483 In earlier versions, when byte and character data were concatenated,
1484 the new string was sometimes created by decoding the byte strings as
1485 ISO 8859-1 (Latin-1), even if the old Unicode string used EBCDIC.
1486
1487 If you find any of these, please report them as bugs.
1488
1489 Porting code from perl-5.6.X
1490 Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1491 was required to use the "utf8" pragma to declare that a given scope
1492 expected to deal with Unicode data and had to make sure that only
1493 Unicode data were reaching that scope. If you have code that is working
1494 with 5.6, you will need some of the following adjustments to your code.
1495 The examples are written such that the code will continue to work under
1496 5.6, so you should be safe to try them out.
1497
1498 · A filehandle that should read or write UTF-8
1499
1500 if ($] > 5.007) {
1501 binmode $fh, ":encoding(utf8)";
1502 }
1503
1504 · A scalar that is going to be passed to some extension
1505
1506 Be it Compress::Zlib, Apache::Request or any extension that has no
1507 mention of Unicode in the manpage, you need to make sure that the
1508 UTF8 flag is stripped off. Note that at the time of this writing
1509 (October 2002) the mentioned modules are not UTF-8-aware. Please
1510 check the documentation to verify if this is still true.
1511
1512 if ($] > 5.007) {
1513 require Encode;
1514 $val = Encode::encode_utf8($val); # make octets
1515 }
1516
1517 · A scalar we got back from an extension
1518
1519 If you believe the scalar comes back as UTF-8, you will most likely
1520 want the UTF8 flag restored:
1521
1522 if ($] > 5.007) {
1523 require Encode;
1524 $val = Encode::decode_utf8($val);
1525 }
1526
1527 · Same thing, if you are really sure it is UTF-8
1528
1529 if ($] > 5.007) {
1530 require Encode;
1531 Encode::_utf8_on($val);
1532 }
1533
1534 · A wrapper for fetchrow_array and fetchrow_hashref
1535
1536 When the database contains only UTF-8, a wrapper function or method
1537 is a convenient way to replace all your fetchrow_array and
1538 fetchrow_hashref calls. A wrapper function will also make it easier
1539 to adapt to future enhancements in your database driver. Note that
1540 at the time of this writing (October 2002), the DBI has no
1541 standardized way to deal with UTF-8 data. Please check the
1542 documentation to verify if that is still true.
1543
1544 sub fetchrow {
1545 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1546 if ($] < 5.007) {
1547 return $sth->$what;
1548 } else {
1549 require Encode;
1550 if (wantarray) {
1551 my @arr = $sth->$what;
1552 for (@arr) {
1553 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1554 }
1555 return @arr;
1556 } else {
1557 my $ret = $sth->$what;
1558 if (ref $ret) {
1559 for my $k (keys %$ret) {
1560 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1561 }
1562 return $ret;
1563 } else {
1564 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1565 return $ret;
1566 }
1567 }
1568 }
1569 }
1570
1571 · A large scalar that you know can only contain ASCII
1572
1573 Scalars that contain only ASCII and are marked as UTF-8 are
1574 sometimes a drag to your program. If you recognize such a
1575 situation, just remove the UTF8 flag:
1576
1577 utf8::downgrade($val) if $] > 5.007;
1578
1580 perlunitut, perluniintro, perluniprops, Encode, open, utf8, bytes,
1581 perlretut, "${^UNICODE}" in perlvar
1582 <http://www.unicode.org/reports/tr44>).
1583
1584
1585
1586perl v5.12.4 2011-06-07 PERLUNICODE(1)