1PERLUNICODE(1) Perl Programmers Reference Guide PERLUNICODE(1)
2
3
4
6 perlunicode - Unicode support in Perl
7
9 Important Caveats
10 Unicode support is an extensive requirement. While Perl does not
11 implement the Unicode standard or the accompanying technical reports
12 from cover to cover, Perl does support many Unicode features.
13
14 People who want to learn to use Unicode in Perl, should probably read
15 the Perl Unicode tutorial, perlunitut and perluniintro, before reading
16 this reference document.
17
18 Also, the use of Unicode may present security issues that aren't
19 obvious. Read Unicode Security Considerations
20 <http://www.unicode.org/reports/tr36>.
21
22 Safest if you "use feature 'unicode_strings'"
23 In order to preserve backward compatibility, Perl does not turn on
24 full internal Unicode support unless the pragma "use feature
25 'unicode_strings'" is specified. (This is automatically selected
26 if you use "use 5.012" or higher.) Failure to do this can trigger
27 unexpected surprises. See "The "Unicode Bug"" below.
28
29 This pragma doesn't affect I/O, and there are still several places
30 where Unicode isn't fully supported, such as in filenames.
31
32 Input and Output Layers
33 Perl knows when a filehandle uses Perl's internal Unicode encodings
34 (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened
35 with the ":encoding(utf8)" layer. Other encodings can be converted
36 to Perl's encoding on input or from Perl's encoding on output by
37 use of the ":encoding(...)" layer. See open.
38
39 To indicate that Perl source itself is in UTF-8, use "use utf8;".
40
41 "use utf8" still needed to enable UTF-8/UTF-EBCDIC in scripts
42 As a compatibility measure, the "use utf8" pragma must be
43 explicitly included to enable recognition of UTF-8 in the Perl
44 scripts themselves (in string or regular expression literals, or in
45 identifier names) on ASCII-based machines or to recognize UTF-
46 EBCDIC on EBCDIC-based machines. These are the only times when an
47 explicit "use utf8" is needed. See utf8.
48
49 BOM-marked scripts and UTF-16 scripts autodetected
50 If a Perl script begins marked with the Unicode BOM (UTF-16LE,
51 UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked
52 UTF-16 of either endianness, Perl will correctly read in the script
53 as Unicode. (BOMless UTF-8 cannot be effectively recognized or
54 differentiated from ISO 8859-1 or other eight-bit encodings.)
55
56 "use encoding" needed to upgrade non-Latin-1 byte strings
57 By default, there is a fundamental asymmetry in Perl's Unicode
58 model: implicit upgrading from byte strings to Unicode strings
59 assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
60 strings are downgraded with UTF-8 encoding. This happens because
61 the first 256 codepoints in Unicode happens to agree with Latin-1.
62
63 See "Byte and Character Semantics" for more details.
64
65 Byte and Character Semantics
66 Beginning with version 5.6, Perl uses logically-wide characters to
67 represent strings internally.
68
69 Starting in Perl 5.14, Perl-level operations work with characters
70 rather than bytes within the scope of a "use feature 'unicode_strings'"
71 (or equivalently "use 5.012" or higher). (This is not true if bytes
72 have been explicitly requested by "use bytes", nor necessarily true for
73 interactions with the platform's operating system.)
74
75 For earlier Perls, and when "unicode_strings" is not in effect, Perl
76 provides a fairly safe environment that can handle both types of
77 semantics in programs. For operations where Perl can unambiguously
78 decide that the input data are characters, Perl switches to character
79 semantics. For operations where this determination cannot be made
80 without additional information from the user, Perl decides in favor of
81 compatibility and chooses to use byte semantics.
82
83 When "use locale" (but not "use locale ':not_characters'") is in
84 effect, Perl uses the semantics associated with the current locale.
85 ("use locale" overrides "use feature 'unicode_strings'" in the same
86 scope; while "use locale ':not_characters'" effectively also selects
87 "use feature 'unicode_strings'" in its scope; see perllocale.)
88 Otherwise, Perl uses the platform's native byte semantics for
89 characters whose code points are less than 256, and Unicode semantics
90 for those greater than 255. On EBCDIC platforms, this is almost
91 seamless, as the EBCDIC code pages that Perl handles are equivalent to
92 Unicode's first 256 code points. (The exception is that EBCDIC regular
93 expression case-insensitive matching rules are not as as robust as
94 Unicode's.) But on ASCII platforms, Perl uses US-ASCII (or Basic
95 Latin in Unicode terminology) byte semantics, meaning that characters
96 whose ordinal numbers are in the range 128 - 255 are undefined except
97 for their ordinal numbers. This means that none have case (upper and
98 lower), nor are any a member of character classes, like "[:alpha:]" or
99 "\w". (But all do belong to the "\W" class or the Perl regular
100 expression extension "[:^alpha:]".)
101
102 This behavior preserves compatibility with earlier versions of Perl,
103 which allowed byte semantics in Perl operations only if none of the
104 program's inputs were marked as being a source of Unicode character
105 data. Such data may come from filehandles, from calls to external
106 programs, from information provided by the system (such as %ENV), or
107 from literals and constants in the source text.
108
109 The "utf8" pragma is primarily a compatibility device that enables
110 recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
111 Note that this pragma is only required while Perl defaults to byte
112 semantics; when character semantics become the default, this pragma may
113 become a no-op. See utf8.
114
115 If strings operating under byte semantics and strings with Unicode
116 character data are concatenated, the new string will have character
117 semantics. This can cause surprises: See "BUGS", below. You can
118 choose to be warned when this happens. See encoding::warnings.
119
120 Under character semantics, many operations that formerly operated on
121 bytes now operate on characters. A character in Perl is logically just
122 a number ranging from 0 to 2**31 or so. Larger characters may encode
123 into longer sequences of bytes internally, but this internal detail is
124 mostly hidden for Perl code. See perluniintro for more.
125
126 Effects of Character Semantics
127 Character semantics have the following effects:
128
129 · Strings--including hash keys--and regular expression patterns may
130 contain characters that have an ordinal value larger than 255.
131
132 If you use a Unicode editor to edit your program, Unicode
133 characters may occur directly within the literal strings in UTF-8
134 encoding, or UTF-16. (The former requires a BOM or "use utf8", the
135 latter requires a BOM.)
136
137 Unicode characters can also be added to a string by using the
138 "\N{U+...}" notation. The Unicode code for the desired character,
139 in hexadecimal, should be placed in the braces, after the "U". For
140 instance, a smiley face is "\N{U+263A}".
141
142 Alternatively, you can use the "\x{...}" notation for characters
143 0x100 and above. For characters below 0x100 you may get byte
144 semantics instead of character semantics; see "The "Unicode Bug"".
145 On EBCDIC machines there is the additional problem that the value
146 for such characters gives the EBCDIC character rather than the
147 Unicode one, thus it is more portable to use "\N{U+...}" instead.
148
149 Additionally, you can use the "\N{...}" notation and put the
150 official Unicode character name within the braces, such as
151 "\N{WHITE SMILING FACE}". This automatically loads the charnames
152 module with the ":full" and ":short" options. If you prefer
153 different options for this module, you can instead, before the
154 "\N{...}", explicitly load it with your desired options; for
155 example,
156
157 use charnames ':loose';
158
159 · If an appropriate encoding is specified, identifiers within the
160 Perl script may contain Unicode alphanumeric characters, including
161 ideographs. Perl does not currently attempt to canonicalize
162 variable names.
163
164 · Regular expressions match characters instead of bytes. "." matches
165 a character instead of a byte.
166
167 · Bracketed character classes in regular expressions match characters
168 instead of bytes and match against the character properties
169 specified in the Unicode properties database. "\w" can be used to
170 match a Japanese ideograph, for instance.
171
172 · Named Unicode properties, scripts, and block ranges may be used
173 (like bracketed character classes) by using the "\p{}" "matches
174 property" construct and the "\P{}" negation, "doesn't match
175 property". See "Unicode Character Properties" for more details.
176
177 You can define your own character properties and use them in the
178 regular expression with the "\p{}" or "\P{}" construct. See "User-
179 Defined Character Properties" for more details.
180
181 · The special pattern "\X" matches a logical character, an "extended
182 grapheme cluster" in Standardese. In Unicode what appears to the
183 user to be a single character, for example an accented "G", may in
184 fact be composed of a sequence of characters, in this case a "G"
185 followed by an accent character. "\X" will match the entire
186 sequence.
187
188 · The "tr///" operator translates characters instead of bytes. Note
189 that the "tr///CU" functionality has been removed. For similar
190 functionality see pack('U0', ...) and pack('C0', ...).
191
192 · Case translation operators use the Unicode case translation tables
193 when character input is provided. Note that "uc()", or "\U" in
194 interpolated strings, translates to uppercase, while "ucfirst", or
195 "\u" in interpolated strings, translates to titlecase in languages
196 that make the distinction (which is equivalent to uppercase in
197 languages without the distinction).
198
199 · Most operators that deal with positions or lengths in a string will
200 automatically switch to using character positions, including
201 "chop()", "chomp()", "substr()", "pos()", "index()", "rindex()",
202 "sprintf()", "write()", and "length()". An operator that
203 specifically does not switch is "vec()". Operators that really
204 don't care include operators that treat strings as a bucket of bits
205 such as "sort()", and operators dealing with filenames.
206
207 · The "pack()"/"unpack()" letter "C" does not change, since it is
208 often used for byte-oriented formats. Again, think "char" in the C
209 language.
210
211 There is a new "U" specifier that converts between Unicode
212 characters and code points. There is also a "W" specifier that is
213 the equivalent of "chr"/"ord" and properly handles character values
214 even if they are above 255.
215
216 · The "chr()" and "ord()" functions work on characters, similar to
217 "pack("W")" and "unpack("W")", not "pack("C")" and "unpack("C")".
218 "pack("C")" and "unpack("C")" are methods for emulating byte-
219 oriented "chr()" and "ord()" on Unicode strings. While these
220 methods reveal the internal encoding of Unicode strings, that is
221 not something one normally needs to care about at all.
222
223 · The bit string operators, "& | ^ ~", can operate on character data.
224 However, for backward compatibility, such as when using bit string
225 operations when characters are all less than 256 in ordinal value,
226 one should not use "~" (the bit complement) with characters of both
227 values less than 256 and values greater than 256. Most
228 importantly, DeMorgan's laws ("~($x|$y) eq ~$x&~$y" and "~($x&$y)
229 eq ~$x|~$y") will not hold. The reason for this mathematical faux
230 pas is that the complement cannot return both the 8-bit (byte-wide)
231 bit complement and the full character-wide bit complement.
232
233 · There is a CPAN module, Unicode::Casing, which allows you to define
234 your own mappings to be used in "lc()", "lcfirst()", "uc()",
235 "ucfirst()", and "fc" (or their double-quoted string inlined
236 versions such as "\U"). (Prior to Perl 5.16, this functionality
237 was partially provided in the Perl core, but suffered from a number
238 of insurmountable drawbacks, so the CPAN module was written
239 instead.)
240
241 · And finally, "scalar reverse()" reverses by character rather than
242 by byte.
243
244 Unicode Character Properties
245 (The only time that Perl considers a sequence of individual code points
246 as a single logical character is in the "\X" construct, already
247 mentioned above. Therefore "character" in this discussion means a
248 single Unicode code point.)
249
250 Very nearly all Unicode character properties are accessible through
251 regular expressions by using the "\p{}" "matches property" construct
252 and the "\P{}" "doesn't match property" for its negation.
253
254 For instance, "\p{Uppercase}" matches any single character with the
255 Unicode "Uppercase" property, while "\p{L}" matches any character with
256 a General_Category of "L" (letter) property. Brackets are not required
257 for single letter property names, so "\p{L}" is equivalent to "\pL".
258
259 More formally, "\p{Uppercase}" matches any single character whose
260 Unicode Uppercase property value is True, and "\P{Uppercase}" matches
261 any character whose Uppercase property value is False, and they could
262 have been written as "\p{Uppercase=True}" and "\p{Uppercase=False}",
263 respectively.
264
265 This formality is needed when properties are not binary; that is, if
266 they can take on more values than just True and False. For example,
267 the Bidi_Class (see "Bidirectional Character Types" below), can take on
268 several different values, such as Left, Right, Whitespace, and others.
269 To match these, one needs to specify both the property name
270 (Bidi_Class), AND the value being matched against (Left, Right, etc.).
271 This is done, as in the examples above, by having the two components
272 separated by an equal sign (or interchangeably, a colon), like
273 "\p{Bidi_Class: Left}".
274
275 All Unicode-defined character properties may be written in these
276 compound forms of "\p{property=value}" or "\p{property:value}", but
277 Perl provides some additional properties that are written only in the
278 single form, as well as single-form short-cuts for all binary
279 properties and certain others described below, in which you may omit
280 the property name and the equals or colon separator.
281
282 Most Unicode character properties have at least two synonyms (or
283 aliases if you prefer): a short one that is easier to type and a longer
284 one that is more descriptive and hence easier to understand. Thus the
285 "L" and "Letter" properties above are equivalent and can be used
286 interchangeably. Likewise, "Upper" is a synonym for "Uppercase", and
287 we could have written "\p{Uppercase}" equivalently as "\p{Upper}".
288 Also, there are typically various synonyms for the values the property
289 can be. For binary properties, "True" has 3 synonyms: "T", "Yes", and
290 "Y"; and "False has correspondingly "F", "No", and "N". But be
291 careful. A short form of a value for one property may not mean the
292 same thing as the same short form for another. Thus, for the
293 General_Category property, "L" means "Letter", but for the Bidi_Class
294 property, "L" means "Left". A complete list of properties and synonyms
295 is in perluniprops.
296
297 Upper/lower case differences in property names and values are
298 irrelevant; thus "\p{Upper}" means the same thing as "\p{upper}" or
299 even "\p{UpPeR}". Similarly, you can add or subtract underscores
300 anywhere in the middle of a word, so that these are also equivalent to
301 "\p{U_p_p_e_r}". And white space is irrelevant adjacent to non-word
302 characters, such as the braces and the equals or colon separators, so
303 "\p{ Upper }" and "\p{ Upper_case : Y }" are equivalent to these as
304 well. In fact, white space and even hyphens can usually be added or
305 deleted anywhere. So even "\p{ Up-per case = Yes}" is equivalent. All
306 this is called "loose-matching" by Unicode. The few places where
307 stricter matching is used is in the middle of numbers, and in the Perl
308 extension properties that begin or end with an underscore. Stricter
309 matching cares about white space (except adjacent to non-word
310 characters), hyphens, and non-interior underscores.
311
312 You can also use negation in both "\p{}" and "\P{}" by introducing a
313 caret (^) between the first brace and the property name: "\p{^Tamil}"
314 is equal to "\P{Tamil}".
315
316 Almost all properties are immune to case-insensitive matching. That
317 is, adding a "/i" regular expression modifier does not change what they
318 match. There are two sets that are affected. The first set is
319 "Uppercase_Letter", "Lowercase_Letter", and "Titlecase_Letter", all of
320 which match "Cased_Letter" under "/i" matching. And the second set is
321 "Uppercase", "Lowercase", and "Titlecase", all of which match "Cased"
322 under "/i" matching. This set also includes its subsets "PosixUpper"
323 and "PosixLower" both of which under "/i" matching match "PosixAlpha".
324 (The difference between these sets is that some things, such as Roman
325 numerals, come in both upper and lower case so they are "Cased", but
326 aren't considered letters, so they aren't "Cased_Letter"s.)
327
328 The result is undefined if you try to match a non-Unicode code point
329 (that is, one above 0x10FFFF) against a Unicode property. Currently, a
330 warning is raised, and the match will fail. In some cases, this is
331 counterintuitive, as both these fail:
332
333 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
334 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
335
336 General_Category
337
338 Every Unicode character is assigned a general category, which is the
339 "most usual categorization of a character" (from
340 <http://www.unicode.org/reports/tr44>).
341
342 The compound way of writing these is like "\p{General_Category=Number}"
343 (short, "\p{gc:n}"). But Perl furnishes shortcuts in which everything
344 up through the equal or colon separator is omitted. So you can instead
345 just write "\pN".
346
347 Here are the short and long forms of the General Category properties:
348
349 Short Long
350
351 L Letter
352 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
353 Lu Uppercase_Letter
354 Ll Lowercase_Letter
355 Lt Titlecase_Letter
356 Lm Modifier_Letter
357 Lo Other_Letter
358
359 M Mark
360 Mn Nonspacing_Mark
361 Mc Spacing_Mark
362 Me Enclosing_Mark
363
364 N Number
365 Nd Decimal_Number (also Digit)
366 Nl Letter_Number
367 No Other_Number
368
369 P Punctuation (also Punct)
370 Pc Connector_Punctuation
371 Pd Dash_Punctuation
372 Ps Open_Punctuation
373 Pe Close_Punctuation
374 Pi Initial_Punctuation
375 (may behave like Ps or Pe depending on usage)
376 Pf Final_Punctuation
377 (may behave like Ps or Pe depending on usage)
378 Po Other_Punctuation
379
380 S Symbol
381 Sm Math_Symbol
382 Sc Currency_Symbol
383 Sk Modifier_Symbol
384 So Other_Symbol
385
386 Z Separator
387 Zs Space_Separator
388 Zl Line_Separator
389 Zp Paragraph_Separator
390
391 C Other
392 Cc Control (also Cntrl)
393 Cf Format
394 Cs Surrogate
395 Co Private_Use
396 Cn Unassigned
397
398 Single-letter properties match all characters in any of the two-letter
399 sub-properties starting with the same letter. "LC" and "L&" are
400 special: both are aliases for the set consisting of everything matched
401 by "Ll", "Lu", and "Lt".
402
403 Bidirectional Character Types
404
405 Because scripts differ in their directionality (Hebrew and Arabic are
406 written right to left, for example) Unicode supplies these properties
407 in the Bidi_Class class:
408
409 Property Meaning
410
411 L Left-to-Right
412 LRE Left-to-Right Embedding
413 LRO Left-to-Right Override
414 R Right-to-Left
415 AL Arabic Letter
416 RLE Right-to-Left Embedding
417 RLO Right-to-Left Override
418 PDF Pop Directional Format
419 EN European Number
420 ES European Separator
421 ET European Terminator
422 AN Arabic Number
423 CS Common Separator
424 NSM Non-Spacing Mark
425 BN Boundary Neutral
426 B Paragraph Separator
427 S Segment Separator
428 WS Whitespace
429 ON Other Neutrals
430
431 This property is always written in the compound form. For example,
432 "\p{Bidi_Class:R}" matches characters that are normally written right
433 to left.
434
435 Scripts
436
437 The world's languages are written in many different scripts. This
438 sentence (unless you're reading it in translation) is written in Latin,
439 while Russian is written in Cyrillic, and Greek is written in, well,
440 Greek; Japanese mainly in Hiragana or Katakana. There are many more.
441
442 The Unicode Script and Script_Extensions properties give what script a
443 given character is in. Either property can be specified with the
444 compound form like "\p{Script=Hebrew}" (short: "\p{sc=hebr}"), or
445 "\p{Script_Extensions=Javanese}" (short: "\p{scx=java}"). In addition,
446 Perl furnishes shortcuts for all "Script" property names. You can omit
447 everything up through the equals (or colon), and simply write
448 "\p{Latin}" or "\P{Cyrillic}". (This is not true for
449 "Script_Extensions", which is required to be written in the compound
450 form.)
451
452 The difference between these two properties involves characters that
453 are used in multiple scripts. For example the digits '0' through '9'
454 are used in many parts of the world. These are placed in a script
455 named "Common". Other characters are used in just a few scripts. For
456 example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
457 scripts, Katakana and Hiragana, but nowhere else. The "Script"
458 property places all characters that are used in multiple scripts in the
459 "Common" script, while the "Script_Extensions" property places those
460 that are used in only a few scripts into each of those scripts; while
461 still using "Common" for those used in many scripts. Thus both these
462 match:
463
464 "0" =~ /\p{sc=Common}/ # Matches
465 "0" =~ /\p{scx=Common}/ # Matches
466
467 and only the first of these match:
468
469 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
470 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
471
472 And only the last two of these match:
473
474 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
475 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
476 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
477 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
478
479 "Script_Extensions" is thus an improved "Script", in which there are
480 fewer characters in the "Common" script, and correspondingly more in
481 other scripts. It is new in Unicode version 6.0, and its data are
482 likely to change significantly in later releases, as things get sorted
483 out.
484
485 (Actually, besides "Common", the "Inherited" script, contains
486 characters that are used in multiple scripts. These are modifier
487 characters which modify other characters, and inherit the script value
488 of the controlling character. Some of these are used in many scripts,
489 and so go into "Inherited" in both "Script" and "Script_Extensions".
490 Others are used in just a few scripts, so are in "Inherited" in
491 "Script", but not in "Script_Extensions".)
492
493 It is worth stressing that there are several different sets of digits
494 in Unicode that are equivalent to 0-9 and are matchable by "\d" in a
495 regular expression. If they are used in a single language only, they
496 are in that language's "Script" and "Script_Extension". If they are
497 used in more than one script, they will be in "sc=Common", but only if
498 they are used in many scripts should they be in "scx=Common".
499
500 A complete list of scripts and their shortcuts is in perluniprops.
501
502 Use of "Is" Prefix
503
504 For backward compatibility (with Perl 5.6), all properties mentioned so
505 far may have "Is" or "Is_" prepended to their name, so "\P{Is_Lu}", for
506 example, is equal to "\P{Lu}", and "\p{IsScript:Arabic}" is equal to
507 "\p{Arabic}".
508
509 Blocks
510
511 In addition to scripts, Unicode also defines blocks of characters. The
512 difference between scripts and blocks is that the concept of scripts is
513 closer to natural languages, while the concept of blocks is more of an
514 artificial grouping based on groups of Unicode characters with
515 consecutive ordinal values. For example, the "Basic Latin" block is all
516 characters whose ordinals are between 0 and 127, inclusive; in other
517 words, the ASCII characters. The "Latin" script contains some letters
518 from this as well as several other blocks, like "Latin-1 Supplement",
519 "Latin Extended-A", etc., but it does not contain all the characters
520 from those blocks. It does not, for example, contain the digits 0-9,
521 because those digits are shared across many scripts, and hence are in
522 the "Common" script.
523
524 For more about scripts versus blocks, see UAX#24 "Unicode Script
525 Property": <http://www.unicode.org/reports/tr24>
526
527 The "Script" or "Script_Extensions" properties are likely to be the
528 ones you want to use when processing natural language; the Block
529 property may occasionally be useful in working with the nuts and bolts
530 of Unicode.
531
532 Block names are matched in the compound form, like "\p{Block: Arrows}"
533 or "\p{Blk=Hebrew}". Unlike most other properties, only a few block
534 names have a Unicode-defined short name. But Perl does provide a
535 (slight) shortcut: You can say, for example "\p{In_Arrows}" or
536 "\p{In_Hebrew}". For backwards compatibility, the "In" prefix may be
537 omitted if there is no naming conflict with a script or any other
538 property, and you can even use an "Is" prefix instead in those cases.
539 But it is not a good idea to do this, for a couple reasons:
540
541 1. It is confusing. There are many naming conflicts, and you may
542 forget some. For example, "\p{Hebrew}" means the script Hebrew,
543 and NOT the block Hebrew. But would you remember that 6 months
544 from now?
545
546 2. It is unstable. A new version of Unicode may pre-empt the current
547 meaning by creating a property with the same name. There was a
548 time in very early Unicode releases when "\p{Hebrew}" would have
549 matched the block Hebrew; now it doesn't.
550
551 Some people prefer to always use "\p{Block: foo}" and "\p{Script: bar}"
552 instead of the shortcuts, whether for clarity, because they can't
553 remember the difference between 'In' and 'Is' anyway, or they aren't
554 confident that those who eventually will read their code will know that
555 difference.
556
557 A complete list of blocks and their shortcuts is in perluniprops.
558
559 Other Properties
560
561 There are many more properties than the very basic ones described here.
562 A complete list is in perluniprops.
563
564 Unicode defines all its properties in the compound form, so all single-
565 form properties are Perl extensions. Most of these are just synonyms
566 for the Unicode ones, but some are genuine extensions, including
567 several that are in the compound form. And quite a few of these are
568 actually recommended by Unicode (in
569 <http://www.unicode.org/reports/tr18>).
570
571 This section gives some details on all extensions that aren't just
572 synonyms for compound-form Unicode properties (for those properties,
573 you'll have to refer to the Unicode Standard
574 <http://www.unicode.org/reports/tr44>.
575
576 "\p{All}"
577 This matches any of the 1_114_112 Unicode code points. It is a
578 synonym for "\p{Any}".
579
580 "\p{Alnum}"
581 This matches any "\p{Alphabetic}" or "\p{Decimal_Number}"
582 character.
583
584 "\p{Any}"
585 This matches any of the 1_114_112 Unicode code points. It is a
586 synonym for "\p{All}".
587
588 "\p{ASCII}"
589 This matches any of the 128 characters in the US-ASCII character
590 set, which is a subset of Unicode.
591
592 "\p{Assigned}"
593 This matches any assigned code point; that is, any code point whose
594 general category is not Unassigned (or equivalently, not Cn).
595
596 "\p{Blank}"
597 This is the same as "\h" and "\p{HorizSpace}": A character that
598 changes the spacing horizontally.
599
600 "\p{Decomposition_Type: Non_Canonical}" (Short: "\p{Dt=NonCanon}")
601 Matches a character that has a non-canonical decomposition.
602
603 To understand the use of this rarely used property=value
604 combination, it is necessary to know some basics about
605 decomposition. Consider a character, say H. It could appear with
606 various marks around it, such as an acute accent, or a circumflex,
607 or various hooks, circles, arrows, etc., above, below, to one side
608 or the other, etc. There are many possibilities among the world's
609 languages. The number of combinations is astronomical, and if
610 there were a character for each combination, it would soon exhaust
611 Unicode's more than a million possible characters. So Unicode took
612 a different approach: there is a character for the base H, and a
613 character for each of the possible marks, and these can be
614 variously combined to get a final logical character. So a logical
615 character--what appears to be a single character--can be a sequence
616 of more than one individual characters. This is called an
617 "extended grapheme cluster"; Perl furnishes the "\X" regular
618 expression construct to match such sequences.
619
620 But Unicode's intent is to unify the existing character set
621 standards and practices, and several pre-existing standards have
622 single characters that mean the same thing as some of these
623 combinations. An example is ISO-8859-1, which has quite a few of
624 these in the Latin-1 range, an example being "LATIN CAPITAL LETTER
625 E WITH ACUTE". Because this character was in this pre-existing
626 standard, Unicode added it to its repertoire. But this character
627 is considered by Unicode to be equivalent to the sequence
628 consisting of the character "LATIN CAPITAL LETTER E" followed by
629 the character "COMBINING ACUTE ACCENT".
630
631 "LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed"
632 character, and its equivalence with the sequence is called
633 canonical equivalence. All pre-composed characters are said to
634 have a decomposition (into the equivalent sequence), and the
635 decomposition type is also called canonical.
636
637 However, many more characters have a different type of
638 decomposition, a "compatible" or "non-canonical" decomposition.
639 The sequences that form these decompositions are not considered
640 canonically equivalent to the pre-composed character. An example,
641 again in the Latin-1 range, is the "SUPERSCRIPT ONE". It is
642 somewhat like a regular digit 1, but not exactly; its decomposition
643 into the digit 1 is called a "compatible" decomposition,
644 specifically a "super" decomposition. There are several such
645 compatibility decompositions (see
646 <http://www.unicode.org/reports/tr44>), including one called
647 "compat", which means some miscellaneous type of decomposition that
648 doesn't fit into the decomposition categories that Unicode has
649 chosen.
650
651 Note that most Unicode characters don't have a decomposition, so
652 their decomposition type is "None".
653
654 For your convenience, Perl has added the "Non_Canonical"
655 decomposition type to mean any of the several compatibility
656 decompositions.
657
658 "\p{Graph}"
659 Matches any character that is graphic. Theoretically, this means a
660 character that on a printer would cause ink to be used.
661
662 "\p{HorizSpace}"
663 This is the same as "\h" and "\p{Blank}": a character that changes
664 the spacing horizontally.
665
666 "\p{In=*}"
667 This is a synonym for "\p{Present_In=*}"
668
669 "\p{PerlSpace}"
670 This is the same as "\s", restricted to ASCII, namely
671 "[ \f\n\r\t]".
672
673 Mnemonic: Perl's (original) space
674
675 "\p{PerlWord}"
676 This is the same as "\w", restricted to ASCII, namely
677 "[A-Za-z0-9_]"
678
679 Mnemonic: Perl's (original) word.
680
681 "\p{Posix...}"
682 There are several of these, which are equivalents using the "\p"
683 notation for Posix classes and are described in "POSIX Character
684 Classes" in perlrecharclass.
685
686 "\p{Present_In: *}" (Short: "\p{In=*}")
687 This property is used when you need to know in what Unicode
688 version(s) a character is.
689
690 The "*" above stands for some two digit Unicode version number,
691 such as 1.1 or 4.0; or the "*" can also be "Unassigned". This
692 property will match the code points whose final disposition has
693 been settled as of the Unicode release given by the version number;
694 "\p{Present_In: Unassigned}" will match those code points whose
695 meaning has yet to be assigned.
696
697 For example, "U+0041" "LATIN CAPITAL LETTER A" was present in the
698 very first Unicode release available, which is 1.1, so this
699 property is true for all valid "*" versions. On the other hand,
700 "U+1EFF" was not assigned until version 5.1 when it became "LATIN
701 SMALL LETTER Y WITH LOOP", so the only "*" that would match it are
702 5.1, 5.2, and later.
703
704 Unicode furnishes the "Age" property from which this is derived.
705 The problem with Age is that a strict interpretation of it (which
706 Perl takes) has it matching the precise release a code point's
707 meaning is introduced in. Thus "U+0041" would match only 1.1; and
708 "U+1EFF" only 5.1. This is not usually what you want.
709
710 Some non-Perl implementations of the Age property may change its
711 meaning to be the same as the Perl Present_In property; just be
712 aware of that.
713
714 Another confusion with both these properties is that the definition
715 is not that the code point has been assigned, but that the meaning
716 of the code point has been determined. This is because 66 code
717 points will always be unassigned, and so the Age for them is the
718 Unicode version in which the decision to make them so was made.
719 For example, "U+FDD0" is to be permanently unassigned to a
720 character, and the decision to do that was made in version 3.1, so
721 "\p{Age=3.1}" matches this character, as also does "\p{Present_In:
722 3.1}" and up.
723
724 "\p{Print}"
725 This matches any character that is graphical or blank, except
726 controls.
727
728 "\p{SpacePerl}"
729 This is the same as "\s", including beyond ASCII.
730
731 Mnemonic: Space, as modified by Perl. (It doesn't include the
732 vertical tab which both the Posix standard and Unicode consider
733 white space.)
734
735 "\p{Title}" and "\p{Titlecase}"
736 Under case-sensitive matching, these both match the same code
737 points as "\p{General Category=Titlecase_Letter}" ("\p{gc=lt}").
738 The difference is that under "/i" caseless matching, these match
739 the same as "\p{Cased}", whereas "\p{gc=lt}" matches
740 "\p{Cased_Letter").
741
742 "\p{VertSpace}"
743 This is the same as "\v": A character that changes the spacing
744 vertically.
745
746 "\p{Word}"
747 This is the same as "\w", including over 100_000 characters beyond
748 ASCII.
749
750 "\p{XPosix...}"
751 There are several of these, which are the standard Posix classes
752 extended to the full Unicode range. They are described in "POSIX
753 Character Classes" in perlrecharclass.
754
755 User-Defined Character Properties
756 You can define your own binary character properties by defining
757 subroutines whose names begin with "In" or "Is". The subroutines can
758 be defined in any package. The user-defined properties can be used in
759 the regular expression "\p" and "\P" constructs; if you are using a
760 user-defined property from a package other than the one you are in, you
761 must specify its package in the "\p" or "\P" construct.
762
763 # assuming property Is_Foreign defined in Lang::
764 package main; # property package name required
765 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
766
767 package Lang; # property package name not required
768 if ($txt =~ /\p{IsForeign}+/) { ... }
769
770 Note that the effect is compile-time and immutable once defined.
771 However, the subroutines are passed a single parameter, which is 0 if
772 case-sensitive matching is in effect and non-zero if caseless matching
773 is in effect. The subroutine may return different values depending on
774 the value of the flag, and one set of values will immutably be in
775 effect for all case-sensitive matches, and the other set for all case-
776 insensitive matches.
777
778 Note that if the regular expression is tainted, then Perl will die
779 rather than calling the subroutine, where the name of the subroutine is
780 determined by the tainted data.
781
782 The subroutines must return a specially-formatted string, with one or
783 more newline-separated lines. Each line must be one of the following:
784
785 · A single hexadecimal number denoting a Unicode code point to
786 include.
787
788 · Two hexadecimal numbers separated by horizontal whitespace (space
789 or tabular characters) denoting a range of Unicode code points to
790 include.
791
792 · Something to include, prefixed by "+": a built-in character
793 property (prefixed by "utf8::") or a fully qualified (including
794 package name) user-defined character property, to represent all the
795 characters in that property; two hexadecimal code points for a
796 range; or a single hexadecimal code point.
797
798 · Something to exclude, prefixed by "-": an existing character
799 property (prefixed by "utf8::") or a fully qualified (including
800 package name) user-defined character property, to represent all the
801 characters in that property; two hexadecimal code points for a
802 range; or a single hexadecimal code point.
803
804 · Something to negate, prefixed "!": an existing character property
805 (prefixed by "utf8::") or a fully qualified (including package
806 name) user-defined character property, to represent all the
807 characters in that property; two hexadecimal code points for a
808 range; or a single hexadecimal code point.
809
810 · Something to intersect with, prefixed by "&": an existing character
811 property (prefixed by "utf8::") or a fully qualified (including
812 package name) user-defined character property, for all the
813 characters except the characters in the property; two hexadecimal
814 code points for a range; or a single hexadecimal code point.
815
816 For example, to define a property that covers both the Japanese
817 syllabaries (hiragana and katakana), you can define
818
819 sub InKana {
820 return <<END;
821 3040\t309F
822 30A0\t30FF
823 END
824 }
825
826 Imagine that the here-doc end marker is at the beginning of the line.
827 Now you can use "\p{InKana}" and "\P{InKana}".
828
829 You could also have used the existing block property names:
830
831 sub InKana {
832 return <<'END';
833 +utf8::InHiragana
834 +utf8::InKatakana
835 END
836 }
837
838 Suppose you wanted to match only the allocated characters, not the raw
839 block ranges: in other words, you want to remove the non-characters:
840
841 sub InKana {
842 return <<'END';
843 +utf8::InHiragana
844 +utf8::InKatakana
845 -utf8::IsCn
846 END
847 }
848
849 The negation is useful for defining (surprise!) negated classes.
850
851 sub InNotKana {
852 return <<'END';
853 !utf8::InHiragana
854 -utf8::InKatakana
855 +utf8::IsCn
856 END
857 }
858
859 This will match all non-Unicode code points, since every one of them is
860 not in Kana. You can use intersection to exclude these, if desired, as
861 this modified example shows:
862
863 sub InNotKana {
864 return <<'END';
865 !utf8::InHiragana
866 -utf8::InKatakana
867 +utf8::IsCn
868 &utf8::Any
869 END
870 }
871
872 &utf8::Any must be the last line in the definition.
873
874 Intersection is used generally for getting the common characters
875 matched by two (or more) classes. It's important to remember not to
876 use "&" for the first set; that would be intersecting with nothing,
877 resulting in an empty set.
878
879 (Note that official Unicode properties differ from these in that they
880 automatically exclude non-Unicode code points and a warning is raised
881 if a match is attempted on one of those.)
882
883 User-Defined Case Mappings (for serious hackers only)
884 This feature has been removed as of Perl 5.16. The CPAN module
885 Unicode::Casing provides better functionality without the drawbacks
886 that this feature had. If you are using a Perl earlier than 5.16, this
887 feature was most fully documented in the 5.14 version of this pod:
888 http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29
889 <http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-
890 Mappings-%28for-serious-hackers-only%29>
891
892 Character Encodings for Input and Output
893 See Encode.
894
895 Unicode Regular Expression Support Level
896 The following list of Unicode supported features for regular
897 expressions describes all features currently directly supported by core
898 Perl. The references to "Level N" and the section numbers refer to the
899 Unicode Technical Standard #18, "Unicode Regular Expressions", version
900 13, from August 2008.
901
902 · Level 1 - Basic Unicode Support
903
904 RL1.1 Hex Notation - done [1]
905 RL1.2 Properties - done [2][3]
906 RL1.2a Compatibility Properties - done [4]
907 RL1.3 Subtraction and Intersection - MISSING [5]
908 RL1.4 Simple Word Boundaries - done [6]
909 RL1.5 Simple Loose Matches - done [7]
910 RL1.6 Line Boundaries - MISSING [8][9]
911 RL1.7 Supplementary Code Points - done [10]
912
913 [1] \x{...}
914 [2] \p{...} \P{...}
915 [3] supports not only minimal list, but all Unicode character
916 properties (see Unicode Character Properties above)
917 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
918 [5] can use regular expression look-ahead [a] or
919 user-defined character properties [b] to emulate set
920 operations
921 [6] \b \B
922 [7] note that Perl does Full case-folding in matching (but with
923 bugs), not Simple: for example U+1F88 is equivalent to
924 U+1F00 U+03B9, instead of just U+1F80. This difference
925 matters mainly for certain Greek capital letters with certain
926 modifiers: the Full case-folding decomposes the letter,
927 while the Simple case-folding would map it to a single
928 character.
929 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
930 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
931 (U+2029); should also affect <>, $., and script line
932 numbers; should not split lines within CRLF [c] (i.e. there
933 is no empty line between \r and \n)
934 [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
935 Algorithm" is available through the Unicode::LineBreaking
936 module.
937 [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
938 U+10FFFF but also beyond U+10FFFF
939
940 [a] You can mimic class subtraction using lookahead. For example,
941 what UTS#18 might write as
942
943 [{Greek}-[{UNASSIGNED}]]
944
945 in Perl can be written as:
946
947 (?!\p{Unassigned})\p{InGreekAndCoptic}
948 (?=\p{Assigned})\p{InGreekAndCoptic}
949
950 But in this particular example, you probably really want
951
952 \p{GreekAndCoptic}
953
954 which will match assigned characters known to be part of the Greek
955 script.
956
957 Also see the Unicode::Regex::Set module; it does implement the full
958 UTS#18 grouping, intersection, union, and removal (subtraction)
959 syntax.
960
961 [b] '+' for union, '-' for removal (set-difference), '&' for
962 intersection (see "User-Defined Character Properties")
963
964 [c] Try the ":crlf" layer (see PerlIO).
965
966 · Level 2 - Extended Unicode Support
967
968 RL2.1 Canonical Equivalents - MISSING [10][11]
969 RL2.2 Default Grapheme Clusters - MISSING [12]
970 RL2.3 Default Word Boundaries - MISSING [14]
971 RL2.4 Default Loose Matches - MISSING [15]
972 RL2.5 Name Properties - DONE
973 RL2.6 Wildcard Properties - MISSING
974
975 [10] see UAX#15 "Unicode Normalization Forms"
976 [11] have Unicode::Normalize but not integrated to regexes
977 [12] have \X but we don't have a "Grapheme Cluster Mode"
978 [14] see UAX#29, Word Boundaries
979 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
980
981 · Level 3 - Tailored Support
982
983 RL3.1 Tailored Punctuation - MISSING
984 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
985 RL3.3 Tailored Word Boundaries - MISSING
986 RL3.4 Tailored Loose Matches - MISSING
987 RL3.5 Tailored Ranges - MISSING
988 RL3.6 Context Matching - MISSING [19]
989 RL3.7 Incremental Matches - MISSING
990 ( RL3.8 Unicode Set Sharing )
991 RL3.9 Possible Match Sets - MISSING
992 RL3.10 Folded Matching - MISSING [20]
993 RL3.11 Submatchers - MISSING
994
995 [17] see UAX#10 "Unicode Collation Algorithms"
996 [18] have Unicode::Collate but not integrated to regexes
997 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
998 should see outside of the target substring
999 [20] need insensitive matching for linguistic features other
1000 than case; for example, hiragana to katakana, wide and
1001 narrow, simplified Han to traditional Han (see UTR#30
1002 "Character Foldings")
1003
1004 Unicode Encodings
1005 Unicode characters are assigned to code points, which are abstract
1006 numbers. To use these numbers, various encodings are needed.
1007
1008 · UTF-8
1009
1010 UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1011 encoding. For ASCII (and we really do mean 7-bit ASCII, not another
1012 8-bit encoding), UTF-8 is transparent.
1013
1014 The following table is from Unicode 3.2.
1015
1016 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1017
1018 U+0000..U+007F 00..7F
1019 U+0080..U+07FF * C2..DF 80..BF
1020 U+0800..U+0FFF E0 * A0..BF 80..BF
1021 U+1000..U+CFFF E1..EC 80..BF 80..BF
1022 U+D000..U+D7FF ED 80..9F 80..BF
1023 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
1024 U+E000..U+FFFF EE..EF 80..BF 80..BF
1025 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1026 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1027 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
1028
1029 Note the gaps marked by "*" before several of the byte entries
1030 above. These are caused by legal UTF-8 avoiding non-shortest
1031 encodings: it is technically possible to UTF-8-encode a single code
1032 point in different ways, but that is explicitly forbidden, and the
1033 shortest possible encoding should always be used (and that is what
1034 Perl does).
1035
1036 Another way to look at it is via bits:
1037
1038 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1039
1040 0aaaaaaa 0aaaaaaa
1041 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1042 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1043 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
1044
1045 As you can see, the continuation bytes all begin with "10", and the
1046 leading bits of the start byte tell how many bytes there are in the
1047 encoded character.
1048
1049 The original UTF-8 specification allowed up to 6 bytes, to allow
1050 encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow
1051 those, and has extended that up to 13 bytes to encode code points
1052 up to what can fit in a 64-bit word. However, Perl will warn if
1053 you output any of these as being non-portable; and under strict
1054 UTF-8 input protocols, they are forbidden.
1055
1056 The Unicode non-character code points are also disallowed in UTF-8
1057 in "open interchange". See "Non-character code points".
1058
1059 · UTF-EBCDIC
1060
1061 Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1062
1063 · UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1064
1065 The followings items are mostly for reference and general Unicode
1066 knowledge, Perl doesn't use these constructs internally.
1067
1068 Like UTF-8, UTF-16 is a variable-width encoding, but where UTF-8
1069 uses 8-bit code units, UTF-16 uses 16-bit code units. All code
1070 points occupy either 2 or 4 bytes in UTF-16: code points
1071 "U+0000..U+FFFF" are stored in a single 16-bit unit, and code
1072 points "U+10000..U+10FFFF" in two 16-bit units. The latter case is
1073 using surrogates, the first 16-bit unit being the high surrogate,
1074 and the second being the low surrogate.
1075
1076 Surrogates are code points set aside to encode the
1077 "U+10000..U+10FFFF" range of Unicode code points in pairs of 16-bit
1078 units. The high surrogates are the range "U+D800..U+DBFF" and the
1079 low surrogates are the range "U+DC00..U+DFFF". The surrogate
1080 encoding is
1081
1082 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1083 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1084
1085 and the decoding is
1086
1087 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1088
1089 Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
1090 itself can be used for in-memory computations, but if storage or
1091 transfer is required either UTF-16BE (big-endian) or UTF-16LE
1092 (little-endian) encodings must be chosen.
1093
1094 This introduces another problem: what if you just know that your
1095 data is UTF-16, but you don't know which endianness? Byte Order
1096 Marks, or BOMs, are a solution to this. A special character has
1097 been reserved in Unicode to function as a byte order marker: the
1098 character with the code point "U+FEFF" is the BOM.
1099
1100 The trick is that if you read a BOM, you will know the byte order,
1101 since if it was written on a big-endian platform, you will read the
1102 bytes "0xFE 0xFF", but if it was written on a little-endian
1103 platform, you will read the bytes "0xFF 0xFE". (And if the
1104 originating platform was writing in UTF-8, you will read the bytes
1105 "0xEF 0xBB 0xBF".)
1106
1107 The way this trick works is that the character with the code point
1108 "U+FFFE" is not supposed to be in input streams, so the sequence of
1109 bytes "0xFF 0xFE" is unambiguously "BOM, represented in little-
1110 endian format" and cannot be "U+FFFE", represented in big-endian
1111 format".
1112
1113 Surrogates have no meaning in Unicode outside their use in pairs to
1114 represent other code points. However, Perl allows them to be
1115 represented individually internally, for example by saying
1116 "chr(0xD801)", so that all code points, not just those valid for
1117 open interchange, are representable. Unicode does define semantics
1118 for them, such as their General Category is "Cs". But because
1119 their use is somewhat dangerous, Perl will warn (using the warning
1120 category "surrogate", which is a sub-category of "utf8") if an
1121 attempt is made to do things like take the lower case of one, or
1122 match case-insensitively, or to output them. (But don't try this
1123 on Perls before 5.14.)
1124
1125 · UTF-32, UTF-32BE, UTF-32LE
1126
1127 The UTF-32 family is pretty much like the UTF-16 family, expect
1128 that the units are 32-bit, and therefore the surrogate scheme is
1129 not needed. UTF-32 is a fixed-width encoding. The BOM signatures
1130 are "0x00 0x00 0xFE 0xFF" for BE and "0xFF 0xFE 0x00 0x00" for LE.
1131
1132 · UCS-2, UCS-4
1133
1134 Legacy, fixed-width encodings defined by the ISO 10646 standard.
1135 UCS-2 is a 16-bit encoding. Unlike UTF-16, UCS-2 is not extensible
1136 beyond "U+FFFF", because it does not use surrogates. UCS-4 is a
1137 32-bit encoding, functionally identical to UTF-32 (the difference
1138 being that UCS-4 forbids neither surrogates nor code points larger
1139 than 0x10_FFFF).
1140
1141 · UTF-7
1142
1143 A seven-bit safe (non-eight-bit) encoding, which is useful if the
1144 transport or storage is not eight-bit safe. Defined by RFC 2152.
1145
1146 Non-character code points
1147 66 code points are set aside in Unicode as "non-character code points".
1148 These all have the Unassigned (Cn) General Category, and they never
1149 will be assigned. These are never supposed to be in legal Unicode
1150 input streams, so that code can use them as sentinels that can be mixed
1151 in with character data, and they always will be distinguishable from
1152 that data. To keep them out of Perl input streams, strict UTF-8 should
1153 be specified, such as by using the layer ":encoding('UTF-8')". The
1154 non-character code points are the 32 between U+FDD0 and U+FDEF, and the
1155 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE,
1156 U+10FFFF. Some people are under the mistaken impression that these are
1157 "illegal", but that is not true. An application or cooperating set of
1158 applications can legally use them at will internally; but these code
1159 points are "illegal for open interchange". Therefore, Perl will not
1160 accept these from input streams unless lax rules are being used, and
1161 will warn (using the warning category "nonchar", which is a sub-
1162 category of "utf8") if an attempt is made to output them.
1163
1164 Beyond Unicode code points
1165 The maximum Unicode code point is U+10FFFF. But Perl accepts code
1166 points up to the maximum permissible unsigned number available on the
1167 platform. However, Perl will not accept these from input streams
1168 unless lax rules are being used, and will warn (using the warning
1169 category "non_unicode", which is a sub-category of "utf8") if an
1170 attempt is made to operate on or output them. For example,
1171 "uc(0x11_0000)" will generate this warning, returning the input
1172 parameter as its result, as the upper case of every non-Unicode code
1173 point is the code point itself.
1174
1175 Security Implications of Unicode
1176 Read Unicode Security Considerations
1177 <http://www.unicode.org/reports/tr36>. Also, note the following:
1178
1179 · Malformed UTF-8
1180
1181 Unfortunately, the original specification of UTF-8 leaves some room
1182 for interpretation of how many bytes of encoded output one should
1183 generate from one input Unicode character. Strictly speaking, the
1184 shortest possible sequence of UTF-8 bytes should be generated,
1185 because otherwise there is potential for an input buffer overflow
1186 at the receiving end of a UTF-8 connection. Perl always generates
1187 the shortest length UTF-8, and with warnings on, Perl will warn
1188 about non-shortest length UTF-8 along with other malformations,
1189 such as the surrogates, which are not Unicode code points valid for
1190 interchange.
1191
1192 · Regular expression pattern matching may surprise you if you're not
1193 accustomed to Unicode. Starting in Perl 5.14, several pattern
1194 modifiers are available to control this, called the character set
1195 modifiers. Details are given in "Character set modifiers" in
1196 perlre.
1197
1198 As discussed elsewhere, Perl has one foot (two hooves?) planted in each
1199 of two worlds: the old world of bytes and the new world of characters,
1200 upgrading from bytes to characters when necessary. If your legacy code
1201 does not explicitly use Unicode, no automatic switch-over to characters
1202 should happen. Characters shouldn't get downgraded to bytes, either.
1203 It is possible to accidentally mix bytes and characters, however (see
1204 perluniintro), in which case "\w" in regular expressions might start
1205 behaving differently (unless the "/a" modifier is in effect). Review
1206 your code. Use warnings and the "strict" pragma.
1207
1208 Unicode in Perl on EBCDIC
1209 The way Unicode is handled on EBCDIC platforms is still experimental.
1210 On such platforms, references to UTF-8 encoding in this document and
1211 elsewhere should be read as meaning the UTF-EBCDIC specified in Unicode
1212 Technical Report 16, unless ASCII vs. EBCDIC issues are specifically
1213 discussed. There is no "utfebcdic" pragma or ":utfebcdic" layer;
1214 rather, "utf8" and ":utf8" are reused to mean the platform's "natural"
1215 8-bit encoding of Unicode. See perlebcdic for more discussion of the
1216 issues.
1217
1218 Locales
1219 See "Unicode and UTF-8" in perllocale
1220
1221 When Unicode Does Not Happen
1222 While Perl does have extensive ways to input and output in Unicode, and
1223 a few other "entry points" like the @ARGV array (which can sometimes be
1224 interpreted as UTF-8), there are still many places where Unicode (in
1225 some encoding or another) could be given as arguments or received as
1226 results, or both, but it is not.
1227
1228 The following are such interfaces. Also, see "The "Unicode Bug"". For
1229 all of these interfaces Perl currently (as of 5.8.3) simply assumes
1230 byte strings both as arguments and results, or UTF-8 strings if the
1231 (problematic) "encoding" pragma has been used.
1232
1233 One reason that Perl does not attempt to resolve the role of Unicode in
1234 these situations is that the answers are highly dependent on the
1235 operating system and the file system(s). For example, whether
1236 filenames can be in Unicode and in exactly what kind of encoding, is
1237 not exactly a portable concept. Similarly for "qx" and "system": how
1238 well will the "command-line interface" (and which of them?) handle
1239 Unicode?
1240
1241 · chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename,
1242 rmdir, stat, symlink, truncate, unlink, utime, -X
1243
1244 · %ENV
1245
1246 · glob (aka the <*>)
1247
1248 · open, opendir, sysopen
1249
1250 · qx (aka the backtick operator), system
1251
1252 · readdir, readlink
1253
1254 The "Unicode Bug"
1255 The term, "Unicode bug" has been applied to an inconsistency on ASCII
1256 platforms with the Unicode code points in the Latin-1 Supplement block,
1257 that is, between 128 and 255. Without a locale specified, unlike all
1258 other characters or code points, these characters have very different
1259 semantics in byte semantics versus character semantics, unless "use
1260 feature 'unicode_strings'" is specified, directly or indirectly. (It
1261 is indirectly specified by a "use v5.12" or higher.)
1262
1263 In character semantics these upper-Latin1 characters are interpreted as
1264 Unicode code points, which means they have the same semantics as
1265 Latin-1 (ISO-8859-1).
1266
1267 In byte semantics (without "unicode_strings"), they are considered to
1268 be unassigned characters, meaning that the only semantics they have is
1269 their ordinal numbers, and that they are not members of various
1270 character classes. None are considered to match "\w" for example, but
1271 all match "\W".
1272
1273 Perl 5.12.0 added "unicode_strings" to force character semantics on
1274 these code points in some circumstances, which fixed portions of the
1275 bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1276 remainder (so far as we know, anyway). The lesson here is to enable
1277 "unicode_strings" to avoid the headaches described below.
1278
1279 The old, problematic behavior affects these areas:
1280
1281 · Changing the case of a scalar, that is, using "uc()", "ucfirst()",
1282 "lc()", and "lcfirst()", or "\L", "\U", "\u" and "\l" in double-
1283 quotish contexts, such as regular expression substitutions. Under
1284 "unicode_strings" starting in Perl 5.12.0, character semantics are
1285 generally used. See "lc" in perlfunc for details on how this works
1286 in combination with various other pragmas.
1287
1288 · Using caseless ("/i") regular expression matching. Starting in
1289 Perl 5.14.0, regular expressions compiled within the scope of
1290 "unicode_strings" use character semantics even when executed or
1291 compiled into larger regular expressions outside the scope.
1292
1293 · Matching any of several properties in regular expressions, namely
1294 "\b", "\B", "\s", "\S", "\w", "\W", and all the Posix character
1295 classes except "[[:ascii:]]". Starting in Perl 5.14.0, regular
1296 expressions compiled within the scope of "unicode_strings" use
1297 character semantics even when executed or compiled into larger
1298 regular expressions outside the scope.
1299
1300 · In "quotemeta" or its inline equivalent "\Q", no code points above
1301 127 are quoted in UTF-8 encoded strings, but in byte encoded
1302 strings, code points between 128-255 are always quoted. Starting
1303 in Perl 5.16.0, consistent quoting rules are used within the scope
1304 of "unicode_strings", as described in "quotemeta" in perlfunc.
1305
1306 This behavior can lead to unexpected results in which a string's
1307 semantics suddenly change if a code point above 255 is appended to or
1308 removed from it, which changes the string's semantics from byte to
1309 character or vice versa. As an example, consider the following program
1310 and its output:
1311
1312 $ perl -le'
1313 no feature 'unicode_strings';
1314 $s1 = "\xC2";
1315 $s2 = "\x{2660}";
1316 for ($s1, $s2, $s1.$s2) {
1317 print /\w/ || 0;
1318 }
1319 '
1320 0
1321 0
1322 1
1323
1324 If there's no "\w" in "s1" or in "s2", why does their concatenation
1325 have one?
1326
1327 This anomaly stems from Perl's attempt to not disturb older programs
1328 that didn't use Unicode, and hence had no semantics for characters
1329 outside of the ASCII range (except in a locale), along with Perl's
1330 desire to add Unicode support seamlessly. The result wasn't seamless:
1331 these characters were orphaned.
1332
1333 For Perls earlier than those described above, or when a string is
1334 passed to a function outside the subpragma's scope, a workaround is to
1335 always call "utf8::upgrade($string)", or to use the standard module
1336 Encode. Also, a scalar that has any characters whose ordinal is above
1337 0x100, or which were specified using either of the "\N{...}" notations,
1338 will automatically have character semantics.
1339
1340 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1341 Sometimes (see "When Unicode Does Not Happen" or "The "Unicode Bug"")
1342 there are situations where you simply need to force a byte string into
1343 UTF-8, or vice versa. The low-level calls utf8::upgrade($bytestring)
1344 and utf8::downgrade($utf8string[, FAIL_OK]) are the answers.
1345
1346 Note that utf8::downgrade() can fail if the string contains characters
1347 that don't fit into a byte.
1348
1349 Calling either function on a string that already is in the desired
1350 state is a no-op.
1351
1352 Using Unicode in XS
1353 If you want to handle Perl Unicode in XS extensions, you may find the
1354 following C APIs useful. See also "Unicode Support" in perlguts for an
1355 explanation about Unicode at the XS level, and perlapi for the API
1356 details.
1357
1358 · "DO_UTF8(sv)" returns true if the "UTF8" flag is on and the bytes
1359 pragma is not in effect. "SvUTF8(sv)" returns true if the "UTF8"
1360 flag is on; the bytes pragma is ignored. The "UTF8" flag being on
1361 does not mean that there are any characters of code points greater
1362 than 255 (or 127) in the scalar or that there are even any
1363 characters in the scalar. What the "UTF8" flag means is that the
1364 sequence of octets in the representation of the scalar is the
1365 sequence of UTF-8 encoded code points of the characters of a
1366 string. The "UTF8" flag being off means that each octet in this
1367 representation encodes a single character with code point 0..255
1368 within the string. Perl's Unicode model is not to use UTF-8 until
1369 it is absolutely necessary.
1370
1371 · "uvchr_to_utf8(buf, chr)" writes a Unicode character code point
1372 into a buffer encoding the code point as UTF-8, and returns a
1373 pointer pointing after the UTF-8 bytes. It works appropriately on
1374 EBCDIC machines.
1375
1376 · "utf8_to_uvchr_buf(buf, bufend, lenp)" reads UTF-8 encoded bytes
1377 from a buffer and returns the Unicode character code point and,
1378 optionally, the length of the UTF-8 byte sequence. It works
1379 appropriately on EBCDIC machines.
1380
1381 · "utf8_length(start, end)" returns the length of the UTF-8 encoded
1382 buffer in characters. "sv_len_utf8(sv)" returns the length of the
1383 UTF-8 encoded scalar.
1384
1385 · "sv_utf8_upgrade(sv)" converts the string of the scalar to its
1386 UTF-8 encoded form. "sv_utf8_downgrade(sv)" does the opposite, if
1387 possible. "sv_utf8_encode(sv)" is like sv_utf8_upgrade except that
1388 it does not set the "UTF8" flag. "sv_utf8_decode()" does the
1389 opposite of "sv_utf8_encode()". Note that none of these are to be
1390 used as general-purpose encoding or decoding interfaces: "use
1391 Encode" for that. "sv_utf8_upgrade()" is affected by the encoding
1392 pragma but "sv_utf8_downgrade()" is not (since the encoding pragma
1393 is designed to be a one-way street).
1394
1395 · "is_utf8_string(buf, len)" returns true if "len" bytes of the
1396 buffer are valid UTF-8.
1397
1398 · is_utf8_char(s) returns true if the pointer points to a valid UTF-8
1399 character. However, this function should not be used because of
1400 security concerns. Instead, use "is_utf8_string()".
1401
1402 · "UTF8SKIP(buf)" will return the number of bytes in the UTF-8
1403 encoded character in the buffer. "UNISKIP(chr)" will return the
1404 number of bytes required to UTF-8-encode the Unicode character code
1405 point. "UTF8SKIP()" is useful for example for iterating over the
1406 characters of a UTF-8 encoded buffer; "UNISKIP()" is useful, for
1407 example, in computing the size required for a UTF-8 encoded buffer.
1408
1409 · "utf8_distance(a, b)" will tell the distance in characters between
1410 the two pointers pointing to the same UTF-8 encoded buffer.
1411
1412 · "utf8_hop(s, off)" will return a pointer to a UTF-8 encoded buffer
1413 that is "off" (positive or negative) Unicode characters displaced
1414 from the UTF-8 buffer "s". Be careful not to overstep the buffer:
1415 "utf8_hop()" will merrily run off the end or the beginning of the
1416 buffer if told to do so.
1417
1418 · "pv_uni_display(dsv, spv, len, pvlim, flags)" and
1419 "sv_uni_display(dsv, ssv, pvlim, flags)" are useful for debugging
1420 the output of Unicode strings and scalars. By default they are
1421 useful only for debugging--they display all characters as
1422 hexadecimal code points--but with the flags "UNI_DISPLAY_ISPRINT",
1423 "UNI_DISPLAY_BACKSLASH", and "UNI_DISPLAY_QQ" you can make the
1424 output more readable.
1425
1426 · "foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)" can be used to
1427 compare two strings case-insensitively in Unicode. For case-
1428 sensitive comparisons you can just use "memEQ()" and "memNE()" as
1429 usual, except if one string is in utf8 and the other isn't.
1430
1431 For more information, see perlapi, and utf8.c and utf8.h in the Perl
1432 source code distribution.
1433
1434 Hacking Perl to work on earlier Unicode versions (for very serious hackers
1435 only)
1436 Perl by default comes with the latest supported Unicode version built
1437 in, but you can change to use any earlier one.
1438
1439 Download the files in the desired version of Unicode from the Unicode
1440 web site <http://www.unicode.org>). These should replace the existing
1441 files in lib/unicore in the Perl source tree. Follow the instructions
1442 in README.perl in that directory to change some of their names, and
1443 then build perl (see INSTALL).
1444
1446 Interaction with Locales
1447 See "Unicode and UTF-8" in perllocale
1448
1449 Problems with characters in the Latin-1 Supplement range
1450 See "The "Unicode Bug""
1451
1452 Interaction with Extensions
1453 When Perl exchanges data with an extension, the extension should be
1454 able to understand the UTF8 flag and act accordingly. If the extension
1455 doesn't recognize that flag, it's likely that the extension will return
1456 incorrectly-flagged data.
1457
1458 So if you're working with Unicode data, consult the documentation of
1459 every module you're using if there are any issues with Unicode data
1460 exchange. If the documentation does not talk about Unicode at all,
1461 suspect the worst and probably look at the source to learn how the
1462 module is implemented. Modules written completely in Perl shouldn't
1463 cause problems. Modules that directly or indirectly access code written
1464 in other programming languages are at risk.
1465
1466 For affected functions, the simple strategy to avoid data corruption is
1467 to always make the encoding of the exchanged data explicit. Choose an
1468 encoding that you know the extension can handle. Convert arguments
1469 passed to the extensions to that encoding and convert results back from
1470 that encoding. Write wrapper functions that do the conversions for you,
1471 so you can later change the functions when the extension catches up.
1472
1473 To provide an example, let's say the popular Foo::Bar::escape_html
1474 function doesn't deal with Unicode data yet. The wrapper function would
1475 convert the argument to raw UTF-8 and convert the result back to Perl's
1476 internal representation like so:
1477
1478 sub my_escape_html ($) {
1479 my($what) = shift;
1480 return unless defined $what;
1481 Encode::decode_utf8(Foo::Bar::escape_html(
1482 Encode::encode_utf8($what)));
1483 }
1484
1485 Sometimes, when the extension does not convert data but just stores and
1486 retrieves them, you will be able to use the otherwise dangerous
1487 Encode::_utf8_on() function. Let's say the popular "Foo::Bar"
1488 extension, written in C, provides a "param" method that lets you store
1489 and retrieve data according to these prototypes:
1490
1491 $self->param($name, $value); # set a scalar
1492 $value = $self->param($name); # retrieve a scalar
1493
1494 If it does not yet provide support for any encoding, one could write a
1495 derived class with such a "param" method:
1496
1497 sub param {
1498 my($self,$name,$value) = @_;
1499 utf8::upgrade($name); # make sure it is UTF-8 encoded
1500 if (defined $value) {
1501 utf8::upgrade($value); # make sure it is UTF-8 encoded
1502 return $self->SUPER::param($name,$value);
1503 } else {
1504 my $ret = $self->SUPER::param($name);
1505 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1506 return $ret;
1507 }
1508 }
1509
1510 Some extensions provide filters on data entry/exit points, such as
1511 DB_File::filter_store_key and family. Look out for such filters in the
1512 documentation of your extensions, they can make the transition to
1513 Unicode data much easier.
1514
1515 Speed
1516 Some functions are slower when working on UTF-8 encoded strings than on
1517 byte encoded strings. All functions that need to hop over characters
1518 such as length(), substr() or index(), or matching regular expressions
1519 can work much faster when the underlying data are byte-encoded.
1520
1521 In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a
1522 caching scheme was introduced which will hopefully make the slowness
1523 somewhat less spectacular, at least for some operations. In general,
1524 operations with UTF-8 encoded strings are still slower. As an example,
1525 the Unicode properties (character classes) like "\p{Nd}" are known to
1526 be quite a bit slower (5-20 times) than their simpler counterparts like
1527 "\d" (then again, there are hundreds of Unicode characters matching
1528 "Nd" compared with the 10 ASCII characters matching "d").
1529
1530 Problems on EBCDIC platforms
1531 There are several known problems with Perl on EBCDIC platforms. If you
1532 want to use Perl there, send email to perlbug@perl.org.
1533
1534 In earlier versions, when byte and character data were concatenated,
1535 the new string was sometimes created by decoding the byte strings as
1536 ISO 8859-1 (Latin-1), even if the old Unicode string used EBCDIC.
1537
1538 If you find any of these, please report them as bugs.
1539
1540 Porting code from perl-5.6.X
1541 Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1542 was required to use the "utf8" pragma to declare that a given scope
1543 expected to deal with Unicode data and had to make sure that only
1544 Unicode data were reaching that scope. If you have code that is working
1545 with 5.6, you will need some of the following adjustments to your code.
1546 The examples are written such that the code will continue to work under
1547 5.6, so you should be safe to try them out.
1548
1549 · A filehandle that should read or write UTF-8
1550
1551 if ($] > 5.007) {
1552 binmode $fh, ":encoding(utf8)";
1553 }
1554
1555 · A scalar that is going to be passed to some extension
1556
1557 Be it Compress::Zlib, Apache::Request or any extension that has no
1558 mention of Unicode in the manpage, you need to make sure that the
1559 UTF8 flag is stripped off. Note that at the time of this writing
1560 (October 2002) the mentioned modules are not UTF-8-aware. Please
1561 check the documentation to verify if this is still true.
1562
1563 if ($] > 5.007) {
1564 require Encode;
1565 $val = Encode::encode_utf8($val); # make octets
1566 }
1567
1568 · A scalar we got back from an extension
1569
1570 If you believe the scalar comes back as UTF-8, you will most likely
1571 want the UTF8 flag restored:
1572
1573 if ($] > 5.007) {
1574 require Encode;
1575 $val = Encode::decode_utf8($val);
1576 }
1577
1578 · Same thing, if you are really sure it is UTF-8
1579
1580 if ($] > 5.007) {
1581 require Encode;
1582 Encode::_utf8_on($val);
1583 }
1584
1585 · A wrapper for fetchrow_array and fetchrow_hashref
1586
1587 When the database contains only UTF-8, a wrapper function or method
1588 is a convenient way to replace all your fetchrow_array and
1589 fetchrow_hashref calls. A wrapper function will also make it easier
1590 to adapt to future enhancements in your database driver. Note that
1591 at the time of this writing (October 2002), the DBI has no
1592 standardized way to deal with UTF-8 data. Please check the
1593 documentation to verify if that is still true.
1594
1595 sub fetchrow {
1596 # $what is one of fetchrow_{array,hashref}
1597 my($self, $sth, $what) = @_;
1598 if ($] < 5.007) {
1599 return $sth->$what;
1600 } else {
1601 require Encode;
1602 if (wantarray) {
1603 my @arr = $sth->$what;
1604 for (@arr) {
1605 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1606 }
1607 return @arr;
1608 } else {
1609 my $ret = $sth->$what;
1610 if (ref $ret) {
1611 for my $k (keys %$ret) {
1612 defined
1613 && /[^\000-\177]/
1614 && Encode::_utf8_on($_) for $ret->{$k};
1615 }
1616 return $ret;
1617 } else {
1618 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1619 return $ret;
1620 }
1621 }
1622 }
1623 }
1624
1625 · A large scalar that you know can only contain ASCII
1626
1627 Scalars that contain only ASCII and are marked as UTF-8 are
1628 sometimes a drag to your program. If you recognize such a situation,
1629 just remove the UTF8 flag:
1630
1631 utf8::downgrade($val) if $] > 5.007;
1632
1634 perlunitut, perluniintro, perluniprops, Encode, open, utf8, bytes,
1635 perlretut, "${^UNICODE}" in perlvar
1636 <http://www.unicode.org/reports/tr44>).
1637
1638
1639
1640perl v5.16.3 2013-03-04 PERLUNICODE(1)