perlunicode(1)

1PERLUNICODE(1)         Perl Programmers Reference Guide         PERLUNICODE(1)
2
3
4

NAME

6       perlunicode - Unicode support in Perl
7

DESCRIPTION

9   Important Caveats
10       Unicode support is an extensive requirement. While Perl does not
11       implement the Unicode standard or the accompanying technical reports
12       from cover to cover, Perl does support many Unicode features.
13
14       People who want to learn to use Unicode in Perl, should probably read
15       the Perl Unicode tutorial, perlunitut and perluniintro, before reading
16       this reference document.
17
18       Also, the use of Unicode may present security issues that aren't
19       obvious.  Read Unicode Security Considerations
20       <http://www.unicode.org/reports/tr36>.
21
22       Safest if you "use feature 'unicode_strings'"
23           In order to preserve backward compatibility, Perl does not turn on
24           full internal Unicode support unless the pragma "use feature
25           'unicode_strings'" is specified.  (This is automatically selected
26           if you use "use 5.012" or higher.)  Failure to do this can trigger
27           unexpected surprises.  See "The "Unicode Bug"" below.
28
29           This pragma doesn't affect I/O, and there are still several places
30           where Unicode isn't fully supported, such as in filenames.
31
32       Input and Output Layers
33           Perl knows when a filehandle uses Perl's internal Unicode encodings
34           (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened
35           with the ":encoding(utf8)" layer.  Other encodings can be converted
36           to Perl's encoding on input or from Perl's encoding on output by
37           use of the ":encoding(...)"  layer.  See open.
38
39           To indicate that Perl source itself is in UTF-8, use "use utf8;".
40
41       "use utf8" still needed to enable UTF-8/UTF-EBCDIC in scripts
42           As a compatibility measure, the "use utf8" pragma must be
43           explicitly included to enable recognition of UTF-8 in the Perl
44           scripts themselves (in string or regular expression literals, or in
45           identifier names) on ASCII-based machines or to recognize UTF-
46           EBCDIC on EBCDIC-based machines.  These are the only times when an
47           explicit "use utf8" is needed.  See utf8.
48
49       BOM-marked scripts and UTF-16 scripts autodetected
50           If a Perl script begins marked with the Unicode BOM (UTF-16LE,
51           UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked
52           UTF-16 of either endianness, Perl will correctly read in the script
53           as Unicode.  (BOMless UTF-8 cannot be effectively recognized or
54           differentiated from ISO 8859-1 or other eight-bit encodings.)
55
56       "use encoding" needed to upgrade non-Latin-1 byte strings
57           By default, there is a fundamental asymmetry in Perl's Unicode
58           model: implicit upgrading from byte strings to Unicode strings
59           assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
60           strings are downgraded with UTF-8 encoding.  This happens because
61           the first 256 codepoints in Unicode happens to agree with Latin-1.
62
63           See "Byte and Character Semantics" for more details.
64
65   Byte and Character Semantics
66       Beginning with version 5.6, Perl uses logically-wide characters to
67       represent strings internally.
68
69       Starting in Perl 5.14, Perl-level operations work with characters
70       rather than bytes within the scope of a "use feature 'unicode_strings'"
71       (or equivalently "use 5.012" or higher).  (This is not true if bytes
72       have been explicitly requested by "use bytes", nor necessarily true for
73       interactions with the platform's operating system.)
74
75       For earlier Perls, and when "unicode_strings" is not in effect, Perl
76       provides a fairly safe environment that can handle both types of
77       semantics in programs.  For operations where Perl can unambiguously
78       decide that the input data are characters, Perl switches to character
79       semantics.  For operations where this determination cannot be made
80       without additional information from the user, Perl decides in favor of
81       compatibility and chooses to use byte semantics.
82
83       When "use locale" (but not "use locale ':not_characters'") is in
84       effect, Perl uses the semantics associated with the current locale.
85       ("use locale" overrides "use feature 'unicode_strings'" in the same
86       scope; while "use locale ':not_characters'" effectively also selects
87       "use feature 'unicode_strings'" in its scope; see perllocale.)
88       Otherwise, Perl uses the platform's native byte semantics for
89       characters whose code points are less than 256, and Unicode semantics
90       for those greater than 255.  On EBCDIC platforms, this is almost
91       seamless, as the EBCDIC code pages that Perl handles are equivalent to
92       Unicode's first 256 code points.  (The exception is that EBCDIC regular
93       expression case-insensitive matching rules are not as as robust as
94       Unicode's.)   But on ASCII platforms, Perl uses US-ASCII (or Basic
95       Latin in Unicode terminology) byte semantics, meaning that characters
96       whose ordinal numbers are in the range 128 - 255 are undefined except
97       for their ordinal numbers.  This means that none have case (upper and
98       lower), nor are any a member of character classes, like "[:alpha:]" or
99       "\w".  (But all do belong to the "\W" class or the Perl regular
100       expression extension "[:^alpha:]".)
101
102       This behavior preserves compatibility with earlier versions of Perl,
103       which allowed byte semantics in Perl operations only if none of the
104       program's inputs were marked as being a source of Unicode character
105       data.  Such data may come from filehandles, from calls to external
106       programs, from information provided by the system (such as %ENV), or
107       from literals and constants in the source text.
108
109       The "utf8" pragma is primarily a compatibility device that enables
110       recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
111       Note that this pragma is only required while Perl defaults to byte
112       semantics; when character semantics become the default, this pragma may
113       become a no-op.  See utf8.
114
115       If strings operating under byte semantics and strings with Unicode
116       character data are concatenated, the new string will have character
117       semantics.  This can cause surprises: See "BUGS", below.  You can
118       choose to be warned when this happens.  See encoding::warnings.
119
120       Under character semantics, many operations that formerly operated on
121       bytes now operate on characters. A character in Perl is logically just
122       a number ranging from 0 to 2**31 or so. Larger characters may encode
123       into longer sequences of bytes internally, but this internal detail is
124       mostly hidden for Perl code.  See perluniintro for more.
125
126   Effects of Character Semantics
127       Character semantics have the following effects:
128
129       ·   Strings--including hash keys--and regular expression patterns may
130           contain characters that have an ordinal value larger than 255.
131
132           If you use a Unicode editor to edit your program, Unicode
133           characters may occur directly within the literal strings in UTF-8
134           encoding, or UTF-16.  (The former requires a BOM or "use utf8", the
135           latter requires a BOM.)
136
137           Unicode characters can also be added to a string by using the
138           "\N{U+...}" notation.  The Unicode code for the desired character,
139           in hexadecimal, should be placed in the braces, after the "U". For
140           instance, a smiley face is "\N{U+263A}".
141
142           Alternatively, you can use the "\x{...}" notation for characters
143           0x100 and above.  For characters below 0x100 you may get byte
144           semantics instead of character semantics;  see "The "Unicode Bug"".
145           On EBCDIC machines there is the additional problem that the value
146           for such characters gives the EBCDIC character rather than the
147           Unicode one, thus it is more portable to use "\N{U+...}" instead.
148
149           Additionally, you can use the "\N{...}" notation and put the
150           official Unicode character name within the braces, such as
151           "\N{WHITE SMILING FACE}".  This automatically loads the charnames
152           module with the ":full" and ":short" options.  If you prefer
153           different options for this module, you can instead, before the
154           "\N{...}", explicitly load it with your desired options; for
155           example,
156
157              use charnames ':loose';
158
159       ·   If an appropriate encoding is specified, identifiers within the
160           Perl script may contain Unicode alphanumeric characters, including
161           ideographs.  Perl does not currently attempt to canonicalize
162           variable names.
163
164       ·   Regular expressions match characters instead of bytes.  "." matches
165           a character instead of a byte.
166
167       ·   Bracketed character classes in regular expressions match characters
168           instead of bytes and match against the character properties
169           specified in the Unicode properties database.  "\w" can be used to
170           match a Japanese ideograph, for instance.
171
172       ·   Named Unicode properties, scripts, and block ranges may be used
173           (like bracketed character classes) by using the "\p{}" "matches
174           property" construct and the "\P{}" negation, "doesn't match
175           property".  See "Unicode Character Properties" for more details.
176
177           You can define your own character properties and use them in the
178           regular expression with the "\p{}" or "\P{}" construct.  See "User-
179           Defined Character Properties" for more details.
180
181       ·   The special pattern "\X" matches a logical character, an "extended
182           grapheme cluster" in Standardese.  In Unicode what appears to the
183           user to be a single character, for example an accented "G", may in
184           fact be composed of a sequence of characters, in this case a "G"
185           followed by an accent character.  "\X" will match the entire
186           sequence.
187
188       ·   The "tr///" operator translates characters instead of bytes.  Note
189           that the "tr///CU" functionality has been removed.  For similar
190           functionality see pack('U0', ...) and pack('C0', ...).
191
192       ·   Case translation operators use the Unicode case translation tables
193           when character input is provided.  Note that "uc()", or "\U" in
194           interpolated strings, translates to uppercase, while "ucfirst", or
195           "\u" in interpolated strings, translates to titlecase in languages
196           that make the distinction (which is equivalent to uppercase in
197           languages without the distinction).
198
199       ·   Most operators that deal with positions or lengths in a string will
200           automatically switch to using character positions, including
201           "chop()", "chomp()", "substr()", "pos()", "index()", "rindex()",
202           "sprintf()", "write()", and "length()".  An operator that
203           specifically does not switch is "vec()".  Operators that really
204           don't care include operators that treat strings as a bucket of bits
205           such as "sort()", and operators dealing with filenames.
206
207       ·   The "pack()"/"unpack()" letter "C" does not change, since it is
208           often used for byte-oriented formats.  Again, think "char" in the C
209           language.
210
211           There is a new "U" specifier that converts between Unicode
212           characters and code points. There is also a "W" specifier that is
213           the equivalent of "chr"/"ord" and properly handles character values
214           even if they are above 255.
215
216       ·   The "chr()" and "ord()" functions work on characters, similar to
217           "pack("W")" and "unpack("W")", not "pack("C")" and "unpack("C")".
218           "pack("C")" and "unpack("C")" are methods for emulating byte-
219           oriented "chr()" and "ord()" on Unicode strings.  While these
220           methods reveal the internal encoding of Unicode strings, that is
221           not something one normally needs to care about at all.
222
223       ·   The bit string operators, "& | ^ ~", can operate on character data.
224           However, for backward compatibility, such as when using bit string
225           operations when characters are all less than 256 in ordinal value,
226           one should not use "~" (the bit complement) with characters of both
227           values less than 256 and values greater than 256.  Most
228           importantly, DeMorgan's laws ("~($x|$y) eq ~$x&~$y" and "~($x&$y)
229           eq ~$x|~$y") will not hold.  The reason for this mathematical faux
230           pas is that the complement cannot return both the 8-bit (byte-wide)
231           bit complement and the full character-wide bit complement.
232
233       ·   There is a CPAN module, Unicode::Casing, which allows you to define
234           your own mappings to be used in "lc()", "lcfirst()", "uc()",
235           "ucfirst()", and "fc" (or their double-quoted string inlined
236           versions such as "\U").  (Prior to Perl 5.16, this functionality
237           was partially provided in the Perl core, but suffered from a number
238           of insurmountable drawbacks, so the CPAN module was written
239           instead.)
240
241       ·   And finally, "scalar reverse()" reverses by character rather than
242           by byte.
243
244   Unicode Character Properties
245       (The only time that Perl considers a sequence of individual code points
246       as a single logical character is in the "\X" construct, already
247       mentioned above.   Therefore "character" in this discussion means a
248       single Unicode code point.)
249
250       Very nearly all Unicode character properties are accessible through
251       regular expressions by using the "\p{}" "matches property" construct
252       and the "\P{}" "doesn't match property" for its negation.
253
254       For instance, "\p{Uppercase}" matches any single character with the
255       Unicode "Uppercase" property, while "\p{L}" matches any character with
256       a General_Category of "L" (letter) property.  Brackets are not required
257       for single letter property names, so "\p{L}" is equivalent to "\pL".
258
259       More formally, "\p{Uppercase}" matches any single character whose
260       Unicode Uppercase property value is True, and "\P{Uppercase}" matches
261       any character whose Uppercase property value is False, and they could
262       have been written as "\p{Uppercase=True}" and "\p{Uppercase=False}",
263       respectively.
264
265       This formality is needed when properties are not binary; that is, if
266       they can take on more values than just True and False.  For example,
267       the Bidi_Class (see "Bidirectional Character Types" below), can take on
268       several different values, such as Left, Right, Whitespace, and others.
269       To match these, one needs to specify both the property name
270       (Bidi_Class), AND the value being matched against (Left, Right, etc.).
271       This is done, as in the examples above, by having the two components
272       separated by an equal sign (or interchangeably, a colon), like
273       "\p{Bidi_Class: Left}".
274
275       All Unicode-defined character properties may be written in these
276       compound forms of "\p{property=value}" or "\p{property:value}", but
277       Perl provides some additional properties that are written only in the
278       single form, as well as single-form short-cuts for all binary
279       properties and certain others described below, in which you may omit
280       the property name and the equals or colon separator.
281
282       Most Unicode character properties have at least two synonyms (or
283       aliases if you prefer): a short one that is easier to type and a longer
284       one that is more descriptive and hence easier to understand.  Thus the
285       "L" and "Letter" properties above are equivalent and can be used
286       interchangeably.  Likewise, "Upper" is a synonym for "Uppercase", and
287       we could have written "\p{Uppercase}" equivalently as "\p{Upper}".
288       Also, there are typically various synonyms for the values the property
289       can be.   For binary properties, "True" has 3 synonyms: "T", "Yes", and
290       "Y"; and "False has correspondingly "F", "No", and "N".  But be
291       careful.  A short form of a value for one property may not mean the
292       same thing as the same short form for another.  Thus, for the
293       General_Category property, "L" means "Letter", but for the Bidi_Class
294       property, "L" means "Left".  A complete list of properties and synonyms
295       is in perluniprops.
296
297       Upper/lower case differences in property names and values are
298       irrelevant; thus "\p{Upper}" means the same thing as "\p{upper}" or
299       even "\p{UpPeR}".  Similarly, you can add or subtract underscores
300       anywhere in the middle of a word, so that these are also equivalent to
301       "\p{U_p_p_e_r}".  And white space is irrelevant adjacent to non-word
302       characters, such as the braces and the equals or colon separators, so
303       "\p{   Upper  }" and "\p{ Upper_case : Y }" are equivalent to these as
304       well.  In fact, white space and even hyphens can usually be added or
305       deleted anywhere.  So even "\p{ Up-per case = Yes}" is equivalent.  All
306       this is called "loose-matching" by Unicode.  The few places where
307       stricter matching is used is in the middle of numbers, and in the Perl
308       extension properties that begin or end with an underscore.  Stricter
309       matching cares about white space (except adjacent to non-word
310       characters), hyphens, and non-interior underscores.
311
312       You can also use negation in both "\p{}" and "\P{}" by introducing a
313       caret (^) between the first brace and the property name: "\p{^Tamil}"
314       is equal to "\P{Tamil}".
315
316       Almost all properties are immune to case-insensitive matching.  That
317       is, adding a "/i" regular expression modifier does not change what they
318       match.  There are two sets that are affected.  The first set is
319       "Uppercase_Letter", "Lowercase_Letter", and "Titlecase_Letter", all of
320       which match "Cased_Letter" under "/i" matching.  And the second set is
321       "Uppercase", "Lowercase", and "Titlecase", all of which match "Cased"
322       under "/i" matching.  This set also includes its subsets "PosixUpper"
323       and "PosixLower" both of which under "/i" matching match "PosixAlpha".
324       (The difference between these sets is that some things, such as Roman
325       numerals, come in both upper and lower case so they are "Cased", but
326       aren't considered letters, so they aren't "Cased_Letter"s.)
327
328       The result is undefined if you try to match a non-Unicode code point
329       (that is, one above 0x10FFFF) against a Unicode property.  Currently, a
330       warning is raised, and the match will fail.  In some cases, this is
331       counterintuitive, as both these fail:
332
333        chr(0x110000) =~ \p{ASCII_Hex_Digit=True}      # Fails.
334        chr(0x110000) =~ \p{ASCII_Hex_Digit=False}     # Fails!
335
336       General_Category
337
338       Every Unicode character is assigned a general category, which is the
339       "most usual categorization of a character" (from
340       <http://www.unicode.org/reports/tr44>).
341
342       The compound way of writing these is like "\p{General_Category=Number}"
343       (short, "\p{gc:n}").  But Perl furnishes shortcuts in which everything
344       up through the equal or colon separator is omitted.  So you can instead
345       just write "\pN".
346
347       Here are the short and long forms of the General Category properties:
348
349           Short       Long
350
351           L           Letter
352           LC, L&      Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
353           Lu          Uppercase_Letter
354           Ll          Lowercase_Letter
355           Lt          Titlecase_Letter
356           Lm          Modifier_Letter
357           Lo          Other_Letter
358
359           M           Mark
360           Mn          Nonspacing_Mark
361           Mc          Spacing_Mark
362           Me          Enclosing_Mark
363
364           N           Number
365           Nd          Decimal_Number (also Digit)
366           Nl          Letter_Number
367           No          Other_Number
368
369           P           Punctuation (also Punct)
370           Pc          Connector_Punctuation
371           Pd          Dash_Punctuation
372           Ps          Open_Punctuation
373           Pe          Close_Punctuation
374           Pi          Initial_Punctuation
375                       (may behave like Ps or Pe depending on usage)
376           Pf          Final_Punctuation
377                       (may behave like Ps or Pe depending on usage)
378           Po          Other_Punctuation
379
380           S           Symbol
381           Sm          Math_Symbol
382           Sc          Currency_Symbol
383           Sk          Modifier_Symbol
384           So          Other_Symbol
385
386           Z           Separator
387           Zs          Space_Separator
388           Zl          Line_Separator
389           Zp          Paragraph_Separator
390
391           C           Other
392           Cc          Control (also Cntrl)
393           Cf          Format
394           Cs          Surrogate
395           Co          Private_Use
396           Cn          Unassigned
397
398       Single-letter properties match all characters in any of the two-letter
399       sub-properties starting with the same letter.  "LC" and "L&" are
400       special: both are aliases for the set consisting of everything matched
401       by "Ll", "Lu", and "Lt".
402
403       Bidirectional Character Types
404
405       Because scripts differ in their directionality (Hebrew and Arabic are
406       written right to left, for example) Unicode supplies these properties
407       in the Bidi_Class class:
408
409           Property    Meaning
410
411           L           Left-to-Right
412           LRE         Left-to-Right Embedding
413           LRO         Left-to-Right Override
414           R           Right-to-Left
415           AL          Arabic Letter
416           RLE         Right-to-Left Embedding
417           RLO         Right-to-Left Override
418           PDF         Pop Directional Format
419           EN          European Number
420           ES          European Separator
421           ET          European Terminator
422           AN          Arabic Number
423           CS          Common Separator
424           NSM         Non-Spacing Mark
425           BN          Boundary Neutral
426           B           Paragraph Separator
427           S           Segment Separator
428           WS          Whitespace
429           ON          Other Neutrals
430
431       This property is always written in the compound form.  For example,
432       "\p{Bidi_Class:R}" matches characters that are normally written right
433       to left.
434
435       Scripts
436
437       The world's languages are written in many different scripts.  This
438       sentence (unless you're reading it in translation) is written in Latin,
439       while Russian is written in Cyrillic, and Greek is written in, well,
440       Greek; Japanese mainly in Hiragana or Katakana.  There are many more.
441
442       The Unicode Script and Script_Extensions properties give what script a
443       given character is in.  Either property can be specified with the
444       compound form like "\p{Script=Hebrew}" (short: "\p{sc=hebr}"), or
445       "\p{Script_Extensions=Javanese}" (short: "\p{scx=java}").  In addition,
446       Perl furnishes shortcuts for all "Script" property names.  You can omit
447       everything up through the equals (or colon), and simply write
448       "\p{Latin}" or "\P{Cyrillic}".  (This is not true for
449       "Script_Extensions", which is required to be written in the compound
450       form.)
451
452       The difference between these two properties involves characters that
453       are used in multiple scripts.  For example the digits '0' through '9'
454       are used in many parts of the world.  These are placed in a script
455       named "Common".  Other characters are used in just a few scripts.  For
456       example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
457       scripts, Katakana and Hiragana, but nowhere else.  The "Script"
458       property places all characters that are used in multiple scripts in the
459       "Common" script, while the "Script_Extensions" property places those
460       that are used in only a few scripts into each of those scripts; while
461       still using "Common" for those used in many scripts.  Thus both these
462       match:
463
464        "0" =~ /\p{sc=Common}/     # Matches
465        "0" =~ /\p{scx=Common}/    # Matches
466
467       and only the first of these match:
468
469        "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common}  # Matches
470        "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
471
472       And only the last two of these match:
473
474        "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana}  # No match
475        "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana}  # No match
476        "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
477        "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
478
479       "Script_Extensions" is thus an improved "Script", in which there are
480       fewer characters in the "Common" script, and correspondingly more in
481       other scripts.  It is new in Unicode version 6.0, and its data are
482       likely to change significantly in later releases, as things get sorted
483       out.
484
485       (Actually, besides "Common", the "Inherited" script, contains
486       characters that are used in multiple scripts.  These are modifier
487       characters which modify other characters, and inherit the script value
488       of the controlling character.  Some of these are used in many scripts,
489       and so go into "Inherited" in both "Script" and "Script_Extensions".
490       Others are used in just a few scripts, so are in "Inherited" in
491       "Script", but not in "Script_Extensions".)
492
493       It is worth stressing that there are several different sets of digits
494       in Unicode that are equivalent to 0-9 and are matchable by "\d" in a
495       regular expression.  If they are used in a single language only, they
496       are in that language's "Script" and "Script_Extension".  If they are
497       used in more than one script, they will be in "sc=Common", but only if
498       they are used in many scripts should they be in "scx=Common".
499
500       A complete list of scripts and their shortcuts is in perluniprops.
501
502       Use of "Is" Prefix
503
504       For backward compatibility (with Perl 5.6), all properties mentioned so
505       far may have "Is" or "Is_" prepended to their name, so "\P{Is_Lu}", for
506       example, is equal to "\P{Lu}", and "\p{IsScript:Arabic}" is equal to
507       "\p{Arabic}".
508
509       Blocks
510
511       In addition to scripts, Unicode also defines blocks of characters.  The
512       difference between scripts and blocks is that the concept of scripts is
513       closer to natural languages, while the concept of blocks is more of an
514       artificial grouping based on groups of Unicode characters with
515       consecutive ordinal values. For example, the "Basic Latin" block is all
516       characters whose ordinals are between 0 and 127, inclusive; in other
517       words, the ASCII characters.  The "Latin" script contains some letters
518       from this as well as several other blocks, like "Latin-1 Supplement",
519       "Latin Extended-A", etc., but it does not contain all the characters
520       from those blocks. It does not, for example, contain the digits 0-9,
521       because those digits are shared across many scripts, and hence are in
522       the "Common" script.
523
524       For more about scripts versus blocks, see UAX#24 "Unicode Script
525       Property": <http://www.unicode.org/reports/tr24>
526
527       The "Script" or "Script_Extensions" properties are likely to be the
528       ones you want to use when processing natural language; the Block
529       property may occasionally be useful in working with the nuts and bolts
530       of Unicode.
531
532       Block names are matched in the compound form, like "\p{Block: Arrows}"
533       or "\p{Blk=Hebrew}".  Unlike most other properties, only a few block
534       names have a Unicode-defined short name.  But Perl does provide a
535       (slight) shortcut:  You can say, for example "\p{In_Arrows}" or
536       "\p{In_Hebrew}".  For backwards compatibility, the "In" prefix may be
537       omitted if there is no naming conflict with a script or any other
538       property, and you can even use an "Is" prefix instead in those cases.
539       But it is not a good idea to do this, for a couple reasons:
540
541       1.  It is confusing.  There are many naming conflicts, and you may
542           forget some.  For example, "\p{Hebrew}" means the script Hebrew,
543           and NOT the block Hebrew.  But would you remember that 6 months
544           from now?
545
546       2.  It is unstable.  A new version of Unicode may pre-empt the current
547           meaning by creating a property with the same name.  There was a
548           time in very early Unicode releases when "\p{Hebrew}" would have
549           matched the block Hebrew; now it doesn't.
550
551       Some people prefer to always use "\p{Block: foo}" and "\p{Script: bar}"
552       instead of the shortcuts, whether for clarity, because they can't
553       remember the difference between 'In' and 'Is' anyway, or they aren't
554       confident that those who eventually will read their code will know that
555       difference.
556
557       A complete list of blocks and their shortcuts is in perluniprops.
558
559       Other Properties
560
561       There are many more properties than the very basic ones described here.
562       A complete list is in perluniprops.
563
564       Unicode defines all its properties in the compound form, so all single-
565       form properties are Perl extensions.  Most of these are just synonyms
566       for the Unicode ones, but some are genuine extensions, including
567       several that are in the compound form.  And quite a few of these are
568       actually recommended by Unicode (in
569       <http://www.unicode.org/reports/tr18>).
570
571       This section gives some details on all extensions that aren't just
572       synonyms for compound-form Unicode properties (for those properties,
573       you'll have to refer to the Unicode Standard
574       <http://www.unicode.org/reports/tr44>.
575
576       "\p{All}"
577           This matches any of the 1_114_112 Unicode code points.  It is a
578           synonym for "\p{Any}".
579
580       "\p{Alnum}"
581           This matches any "\p{Alphabetic}" or "\p{Decimal_Number}"
582           character.
583
584       "\p{Any}"
585           This matches any of the 1_114_112 Unicode code points.  It is a
586           synonym for "\p{All}".
587
588       "\p{ASCII}"
589           This matches any of the 128 characters in the US-ASCII character
590           set, which is a subset of Unicode.
591
592       "\p{Assigned}"
593           This matches any assigned code point; that is, any code point whose
594           general category is not Unassigned (or equivalently, not Cn).
595
596       "\p{Blank}"
597           This is the same as "\h" and "\p{HorizSpace}":  A character that
598           changes the spacing horizontally.
599
600       "\p{Decomposition_Type: Non_Canonical}"    (Short: "\p{Dt=NonCanon}")
601           Matches a character that has a non-canonical decomposition.
602
603           To understand the use of this rarely used property=value
604           combination, it is necessary to know some basics about
605           decomposition.  Consider a character, say H.  It could appear with
606           various marks around it, such as an acute accent, or a circumflex,
607           or various hooks, circles, arrows, etc., above, below, to one side
608           or the other, etc.  There are many possibilities among the world's
609           languages.  The number of combinations is astronomical, and if
610           there were a character for each combination, it would soon exhaust
611           Unicode's more than a million possible characters.  So Unicode took
612           a different approach: there is a character for the base H, and a
613           character for each of the possible marks, and these can be
614           variously combined to get a final logical character.  So a logical
615           character--what appears to be a single character--can be a sequence
616           of more than one individual characters.  This is called an
617           "extended grapheme cluster";  Perl furnishes the "\X" regular
618           expression construct to match such sequences.
619
620           But Unicode's intent is to unify the existing character set
621           standards and practices, and several pre-existing standards have
622           single characters that mean the same thing as some of these
623           combinations.  An example is ISO-8859-1, which has quite a few of
624           these in the Latin-1 range, an example being "LATIN CAPITAL LETTER
625           E WITH ACUTE".  Because this character was in this pre-existing
626           standard, Unicode added it to its repertoire.  But this character
627           is considered by Unicode to be equivalent to the sequence
628           consisting of the character "LATIN CAPITAL LETTER E" followed by
629           the character "COMBINING ACUTE ACCENT".
630
631           "LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed"
632           character, and its equivalence with the sequence is called
633           canonical equivalence.  All pre-composed characters are said to
634           have a decomposition (into the equivalent sequence), and the
635           decomposition type is also called canonical.
636
637           However, many more characters have a different type of
638           decomposition, a "compatible" or "non-canonical" decomposition.
639           The sequences that form these decompositions are not considered
640           canonically equivalent to the pre-composed character.  An example,
641           again in the Latin-1 range, is the "SUPERSCRIPT ONE".  It is
642           somewhat like a regular digit 1, but not exactly; its decomposition
643           into the digit 1 is called a "compatible" decomposition,
644           specifically a "super" decomposition.  There are several such
645           compatibility decompositions (see
646           <http://www.unicode.org/reports/tr44>), including one called
647           "compat", which means some miscellaneous type of decomposition that
648           doesn't fit into the decomposition categories that Unicode has
649           chosen.
650
651           Note that most Unicode characters don't have a decomposition, so
652           their decomposition type is "None".
653
654           For your convenience, Perl has added the "Non_Canonical"
655           decomposition type to mean any of the several compatibility
656           decompositions.
657
658       "\p{Graph}"
659           Matches any character that is graphic.  Theoretically, this means a
660           character that on a printer would cause ink to be used.
661
662       "\p{HorizSpace}"
663           This is the same as "\h" and "\p{Blank}":  a character that changes
664           the spacing horizontally.
665
666       "\p{In=*}"
667           This is a synonym for "\p{Present_In=*}"
668
669       "\p{PerlSpace}"
670           This is the same as "\s", restricted to ASCII, namely
671           "[ \f\n\r\t]".
672
673           Mnemonic: Perl's (original) space
674
675       "\p{PerlWord}"
676           This is the same as "\w", restricted to ASCII, namely
677           "[A-Za-z0-9_]"
678
679           Mnemonic: Perl's (original) word.
680
681       "\p{Posix...}"
682           There are several of these, which are equivalents using the "\p"
683           notation for Posix classes and are described in "POSIX Character
684           Classes" in perlrecharclass.
685
686       "\p{Present_In: *}"    (Short: "\p{In=*}")
687           This property is used when you need to know in what Unicode
688           version(s) a character is.
689
690           The "*" above stands for some two digit Unicode version number,
691           such as 1.1 or 4.0; or the "*" can also be "Unassigned".  This
692           property will match the code points whose final disposition has
693           been settled as of the Unicode release given by the version number;
694           "\p{Present_In: Unassigned}" will match those code points whose
695           meaning has yet to be assigned.
696
697           For example, "U+0041" "LATIN CAPITAL LETTER A" was present in the
698           very first Unicode release available, which is 1.1, so this
699           property is true for all valid "*" versions.  On the other hand,
700           "U+1EFF" was not assigned until version 5.1 when it became "LATIN
701           SMALL LETTER Y WITH LOOP", so the only "*" that would match it are
702           5.1, 5.2, and later.
703
704           Unicode furnishes the "Age" property from which this is derived.
705           The problem with Age is that a strict interpretation of it (which
706           Perl takes) has it matching the precise release a code point's
707           meaning is introduced in.  Thus "U+0041" would match only 1.1; and
708           "U+1EFF" only 5.1.  This is not usually what you want.
709
710           Some non-Perl implementations of the Age property may change its
711           meaning to be the same as the Perl Present_In property; just be
712           aware of that.
713
714           Another confusion with both these properties is that the definition
715           is not that the code point has been assigned, but that the meaning
716           of the code point has been determined.  This is because 66 code
717           points will always be unassigned, and so the Age for them is the
718           Unicode version in which the decision to make them so was made.
719           For example, "U+FDD0" is to be permanently unassigned to a
720           character, and the decision to do that was made in version 3.1, so
721           "\p{Age=3.1}" matches this character, as also does "\p{Present_In:
722           3.1}" and up.
723
724       "\p{Print}"
725           This matches any character that is graphical or blank, except
726           controls.
727
728       "\p{SpacePerl}"
729           This is the same as "\s", including beyond ASCII.
730
731           Mnemonic: Space, as modified by Perl.  (It doesn't include the
732           vertical tab which both the Posix standard and Unicode consider
733           white space.)
734
735       "\p{Title}" and  "\p{Titlecase}"
736           Under case-sensitive matching, these both match the same code
737           points as "\p{General Category=Titlecase_Letter}" ("\p{gc=lt}").
738           The difference is that under "/i" caseless matching, these match
739           the same as "\p{Cased}", whereas "\p{gc=lt}" matches
740           "\p{Cased_Letter").
741
742       "\p{VertSpace}"
743           This is the same as "\v":  A character that changes the spacing
744           vertically.
745
746       "\p{Word}"
747           This is the same as "\w", including over 100_000 characters beyond
748           ASCII.
749
750       "\p{XPosix...}"
751           There are several of these, which are the standard Posix classes
752           extended to the full Unicode range.  They are described in "POSIX
753           Character Classes" in perlrecharclass.
754
755   User-Defined Character Properties
756       You can define your own binary character properties by defining
757       subroutines whose names begin with "In" or "Is".  The subroutines can
758       be defined in any package.  The user-defined properties can be used in
759       the regular expression "\p" and "\P" constructs; if you are using a
760       user-defined property from a package other than the one you are in, you
761       must specify its package in the "\p" or "\P" construct.
762
763           # assuming property Is_Foreign defined in Lang::
764           package main;  # property package name required
765           if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
766
767           package Lang;  # property package name not required
768           if ($txt =~ /\p{IsForeign}+/) { ... }
769
770       Note that the effect is compile-time and immutable once defined.
771       However, the subroutines are passed a single parameter, which is 0 if
772       case-sensitive matching is in effect and non-zero if caseless matching
773       is in effect.  The subroutine may return different values depending on
774       the value of the flag, and one set of values will immutably be in
775       effect for all case-sensitive matches, and the other set for all case-
776       insensitive matches.
777
778       Note that if the regular expression is tainted, then Perl will die
779       rather than calling the subroutine, where the name of the subroutine is
780       determined by the tainted data.
781
782       The subroutines must return a specially-formatted string, with one or
783       more newline-separated lines.  Each line must be one of the following:
784
785       ·   A single hexadecimal number denoting a Unicode code point to
786           include.
787
788       ·   Two hexadecimal numbers separated by horizontal whitespace (space
789           or tabular characters) denoting a range of Unicode code points to
790           include.
791
792       ·   Something to include, prefixed by "+": a built-in character
793           property (prefixed by "utf8::") or a fully qualified (including
794           package name) user-defined character property, to represent all the
795           characters in that property; two hexadecimal code points for a
796           range; or a single hexadecimal code point.
797
798       ·   Something to exclude, prefixed by "-": an existing character
799           property (prefixed by "utf8::") or a fully qualified (including
800           package name) user-defined character property, to represent all the
801           characters in that property; two hexadecimal code points for a
802           range; or a single hexadecimal code point.
803
804       ·   Something to negate, prefixed "!": an existing character property
805           (prefixed by "utf8::") or a fully qualified (including package
806           name) user-defined character property, to represent all the
807           characters in that property; two hexadecimal code points for a
808           range; or a single hexadecimal code point.
809
810       ·   Something to intersect with, prefixed by "&": an existing character
811           property (prefixed by "utf8::") or a fully qualified (including
812           package name) user-defined character property, for all the
813           characters except the characters in the property; two hexadecimal
814           code points for a range; or a single hexadecimal code point.
815
816       For example, to define a property that covers both the Japanese
817       syllabaries (hiragana and katakana), you can define
818
819           sub InKana {
820               return <<END;
821           3040\t309F
822           30A0\t30FF
823           END
824           }
825
826       Imagine that the here-doc end marker is at the beginning of the line.
827       Now you can use "\p{InKana}" and "\P{InKana}".
828
829       You could also have used the existing block property names:
830
831           sub InKana {
832               return <<'END';
833           +utf8::InHiragana
834           +utf8::InKatakana
835           END
836           }
837
838       Suppose you wanted to match only the allocated characters, not the raw
839       block ranges: in other words, you want to remove the non-characters:
840
841           sub InKana {
842               return <<'END';
843           +utf8::InHiragana
844           +utf8::InKatakana
845           -utf8::IsCn
846           END
847           }
848
849       The negation is useful for defining (surprise!) negated classes.
850
851           sub InNotKana {
852               return <<'END';
853           !utf8::InHiragana
854           -utf8::InKatakana
855           +utf8::IsCn
856           END
857           }
858
859       This will match all non-Unicode code points, since every one of them is
860       not in Kana.  You can use intersection to exclude these, if desired, as
861       this modified example shows:
862
863           sub InNotKana {
864               return <<'END';
865           !utf8::InHiragana
866           -utf8::InKatakana
867           +utf8::IsCn
868           &utf8::Any
869           END
870           }
871
872       &utf8::Any must be the last line in the definition.
873
874       Intersection is used generally for getting the common characters
875       matched by two (or more) classes.  It's important to remember not to
876       use "&" for the first set; that would be intersecting with nothing,
877       resulting in an empty set.
878
879       (Note that official Unicode properties differ from these in that they
880       automatically exclude non-Unicode code points and a warning is raised
881       if a match is attempted on one of those.)
882
883   User-Defined Case Mappings (for serious hackers only)
884       This feature has been removed as of Perl 5.16.  The CPAN module
885       Unicode::Casing provides better functionality without the drawbacks
886       that this feature had.  If you are using a Perl earlier than 5.16, this
887       feature was most fully documented in the 5.14 version of this pod:
888       http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29
889       <http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-
890       Mappings-%28for-serious-hackers-only%29>
891
892   Character Encodings for Input and Output
893       See Encode.
894
895   Unicode Regular Expression Support Level
896       The following list of Unicode supported features for regular
897       expressions describes all features currently directly supported by core
898       Perl.  The references to "Level N" and the section numbers refer to the
899       Unicode Technical Standard #18, "Unicode Regular Expressions", version
900       13, from August 2008.
901
902       ·   Level 1 - Basic Unicode Support
903
904            RL1.1   Hex Notation                     - done          [1]
905            RL1.2   Properties                       - done          [2][3]
906            RL1.2a  Compatibility Properties         - done          [4]
907            RL1.3   Subtraction and Intersection     - MISSING       [5]
908            RL1.4   Simple Word Boundaries           - done          [6]
909            RL1.5   Simple Loose Matches             - done          [7]
910            RL1.6   Line Boundaries                  - MISSING       [8][9]
911            RL1.7   Supplementary Code Points        - done          [10]
912
913            [1]  \x{...}
914            [2]  \p{...} \P{...}
915            [3]  supports not only minimal list, but all Unicode character
916                 properties (see Unicode Character Properties above)
917            [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
918            [5]  can use regular expression look-ahead [a] or
919                 user-defined character properties [b] to emulate set
920                 operations
921            [6]  \b \B
922            [7]  note that Perl does Full case-folding in matching (but with
923                 bugs), not Simple: for example U+1F88 is equivalent to
924                 U+1F00 U+03B9, instead of just U+1F80.  This difference
925                 matters mainly for certain Greek capital letters with certain
926                 modifiers: the Full case-folding decomposes the letter,
927                 while the Simple case-folding would map it to a single
928                 character.
929            [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR
930                 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
931                 (U+2029); should also affect <>, $., and script line
932                 numbers; should not split lines within CRLF [c] (i.e. there
933                 is no empty line between \r and \n)
934            [9]  Linebreaking conformant with UAX#14 "Unicode Line Breaking
935                 Algorithm" is available through the Unicode::LineBreaking
936                 module.
937            [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
938                 U+10FFFF but also beyond U+10FFFF
939
940           [a] You can mimic class subtraction using lookahead.  For example,
941           what UTS#18 might write as
942
943               [{Greek}-[{UNASSIGNED}]]
944
945           in Perl can be written as:
946
947               (?!\p{Unassigned})\p{InGreekAndCoptic}
948               (?=\p{Assigned})\p{InGreekAndCoptic}
949
950           But in this particular example, you probably really want
951
952               \p{GreekAndCoptic}
953
954           which will match assigned characters known to be part of the Greek
955           script.
956
957           Also see the Unicode::Regex::Set module; it does implement the full
958           UTS#18 grouping, intersection, union, and removal (subtraction)
959           syntax.
960
961           [b] '+' for union, '-' for removal (set-difference), '&' for
962           intersection (see "User-Defined Character Properties")
963
964           [c] Try the ":crlf" layer (see PerlIO).
965
966       ·   Level 2 - Extended Unicode Support
967
968            RL2.1   Canonical Equivalents           - MISSING       [10][11]
969            RL2.2   Default Grapheme Clusters       - MISSING       [12]
970            RL2.3   Default Word Boundaries         - MISSING       [14]
971            RL2.4   Default Loose Matches           - MISSING       [15]
972            RL2.5   Name Properties                 - DONE
973            RL2.6   Wildcard Properties             - MISSING
974
975            [10] see UAX#15 "Unicode Normalization Forms"
976            [11] have Unicode::Normalize but not integrated to regexes
977            [12] have \X but we don't have a "Grapheme Cluster Mode"
978            [14] see UAX#29, Word Boundaries
979            [15] This is covered in Chapter 3.13 (in Unicode 6.0)
980
981       ·   Level 3 - Tailored Support
982
983            RL3.1   Tailored Punctuation            - MISSING
984            RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
985            RL3.3   Tailored Word Boundaries        - MISSING
986            RL3.4   Tailored Loose Matches          - MISSING
987            RL3.5   Tailored Ranges                 - MISSING
988            RL3.6   Context Matching                - MISSING       [19]
989            RL3.7   Incremental Matches             - MISSING
990                 ( RL3.8   Unicode Set Sharing )
991            RL3.9   Possible Match Sets             - MISSING
992            RL3.10  Folded Matching                 - MISSING       [20]
993            RL3.11  Submatchers                     - MISSING
994
995            [17] see UAX#10 "Unicode Collation Algorithms"
996            [18] have Unicode::Collate but not integrated to regexes
997            [19] have (?<=x) and (?=x), but look-aheads or look-behinds
998                 should see outside of the target substring
999            [20] need insensitive matching for linguistic features other
1000                 than case; for example, hiragana to katakana, wide and
1001                 narrow, simplified Han to traditional Han (see UTR#30
1002                 "Character Foldings")
1003
1004   Unicode Encodings
1005       Unicode characters are assigned to code points, which are abstract
1006       numbers.  To use these numbers, various encodings are needed.
1007
1008       ·   UTF-8
1009
1010           UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1011           encoding. For ASCII (and we really do mean 7-bit ASCII, not another
1012           8-bit encoding), UTF-8 is transparent.
1013
1014           The following table is from Unicode 3.2.
1015
1016            Code Points            1st Byte  2nd Byte  3rd Byte 4th Byte
1017
1018              U+0000..U+007F       00..7F
1019              U+0080..U+07FF     * C2..DF    80..BF
1020              U+0800..U+0FFF       E0      * A0..BF    80..BF
1021              U+1000..U+CFFF       E1..EC    80..BF    80..BF
1022              U+D000..U+D7FF       ED        80..9F    80..BF
1023              U+D800..U+DFFF       +++++ utf16 surrogates, not legal utf8 +++++
1024              U+E000..U+FFFF       EE..EF    80..BF    80..BF
1025             U+10000..U+3FFFF      F0      * 90..BF    80..BF    80..BF
1026             U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
1027            U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
1028
1029           Note the gaps marked by "*" before several of the byte entries
1030           above.  These are caused by legal UTF-8 avoiding non-shortest
1031           encodings: it is technically possible to UTF-8-encode a single code
1032           point in different ways, but that is explicitly forbidden, and the
1033           shortest possible encoding should always be used (and that is what
1034           Perl does).
1035
1036           Another way to look at it is via bits:
1037
1038                           Code Points  1st Byte  2nd Byte  3rd Byte  4th Byte
1039
1040                              0aaaaaaa  0aaaaaaa
1041                      00000bbbbbaaaaaa  110bbbbb  10aaaaaa
1042                      ccccbbbbbbaaaaaa  1110cccc  10bbbbbb  10aaaaaa
1043            00000dddccccccbbbbbbaaaaaa  11110ddd  10cccccc  10bbbbbb  10aaaaaa
1044
1045           As you can see, the continuation bytes all begin with "10", and the
1046           leading bits of the start byte tell how many bytes there are in the
1047           encoded character.
1048
1049           The original UTF-8 specification allowed up to 6 bytes, to allow
1050           encoding of numbers up to 0x7FFF_FFFF.  Perl continues to allow
1051           those, and has extended that up to 13 bytes to encode code points
1052           up to what can fit in a 64-bit word.  However, Perl will warn if
1053           you output any of these as being non-portable; and under strict
1054           UTF-8 input protocols, they are forbidden.
1055
1056           The Unicode non-character code points are also disallowed in UTF-8
1057           in "open interchange".  See "Non-character code points".
1058
1059       ·   UTF-EBCDIC
1060
1061           Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1062
1063       ·   UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1064
1065           The followings items are mostly for reference and general Unicode
1066           knowledge, Perl doesn't use these constructs internally.
1067
1068           Like UTF-8, UTF-16 is a variable-width encoding, but where UTF-8
1069           uses 8-bit code units, UTF-16 uses 16-bit code units.  All code
1070           points occupy either 2 or 4 bytes in UTF-16: code points
1071           "U+0000..U+FFFF" are stored in a single 16-bit unit, and code
1072           points "U+10000..U+10FFFF" in two 16-bit units.  The latter case is
1073           using surrogates, the first 16-bit unit being the high surrogate,
1074           and the second being the low surrogate.
1075
1076           Surrogates are code points set aside to encode the
1077           "U+10000..U+10FFFF" range of Unicode code points in pairs of 16-bit
1078           units.  The high surrogates are the range "U+D800..U+DBFF" and the
1079           low surrogates are the range "U+DC00..U+DFFF".  The surrogate
1080           encoding is
1081
1082               $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1083               $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1084
1085           and the decoding is
1086
1087               $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1088
1089           Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
1090           itself can be used for in-memory computations, but if storage or
1091           transfer is required either UTF-16BE (big-endian) or UTF-16LE
1092           (little-endian) encodings must be chosen.
1093
1094           This introduces another problem: what if you just know that your
1095           data is UTF-16, but you don't know which endianness?  Byte Order
1096           Marks, or BOMs, are a solution to this.  A special character has
1097           been reserved in Unicode to function as a byte order marker: the
1098           character with the code point "U+FEFF" is the BOM.
1099
1100           The trick is that if you read a BOM, you will know the byte order,
1101           since if it was written on a big-endian platform, you will read the
1102           bytes "0xFE 0xFF", but if it was written on a little-endian
1103           platform, you will read the bytes "0xFF 0xFE".  (And if the
1104           originating platform was writing in UTF-8, you will read the bytes
1105           "0xEF 0xBB 0xBF".)
1106
1107           The way this trick works is that the character with the code point
1108           "U+FFFE" is not supposed to be in input streams, so the sequence of
1109           bytes "0xFF 0xFE" is unambiguously "BOM, represented in little-
1110           endian format" and cannot be "U+FFFE", represented in big-endian
1111           format".
1112
1113           Surrogates have no meaning in Unicode outside their use in pairs to
1114           represent other code points.  However, Perl allows them to be
1115           represented individually internally, for example by saying
1116           "chr(0xD801)", so that all code points, not just those valid for
1117           open interchange, are representable.  Unicode does define semantics
1118           for them, such as their General Category is "Cs".  But because
1119           their use is somewhat dangerous, Perl will warn (using the warning
1120           category "surrogate", which is a sub-category of "utf8") if an
1121           attempt is made to do things like take the lower case of one, or
1122           match case-insensitively, or to output them.  (But don't try this
1123           on Perls before 5.14.)
1124
1125       ·   UTF-32, UTF-32BE, UTF-32LE
1126
1127           The UTF-32 family is pretty much like the UTF-16 family, expect
1128           that the units are 32-bit, and therefore the surrogate scheme is
1129           not needed.  UTF-32 is a fixed-width encoding.  The BOM signatures
1130           are "0x00 0x00 0xFE 0xFF" for BE and "0xFF 0xFE 0x00 0x00" for LE.
1131
1132       ·   UCS-2, UCS-4
1133
1134           Legacy, fixed-width encodings defined by the ISO 10646 standard.
1135           UCS-2 is a 16-bit encoding.  Unlike UTF-16, UCS-2 is not extensible
1136           beyond "U+FFFF", because it does not use surrogates.  UCS-4 is a
1137           32-bit encoding, functionally identical to UTF-32 (the difference
1138           being that UCS-4 forbids neither surrogates nor code points larger
1139           than 0x10_FFFF).
1140
1141       ·   UTF-7
1142
1143           A seven-bit safe (non-eight-bit) encoding, which is useful if the
1144           transport or storage is not eight-bit safe.  Defined by RFC 2152.
1145
1146   Non-character code points
1147       66 code points are set aside in Unicode as "non-character code points".
1148       These all have the Unassigned (Cn) General Category, and they never
1149       will be assigned.  These are never supposed to be in legal Unicode
1150       input streams, so that code can use them as sentinels that can be mixed
1151       in with character data, and they always will be distinguishable from
1152       that data.  To keep them out of Perl input streams, strict UTF-8 should
1153       be specified, such as by using the layer ":encoding('UTF-8')".  The
1154       non-character code points are the 32 between U+FDD0 and U+FDEF, and the
1155       34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE,
1156       U+10FFFF.  Some people are under the mistaken impression that these are
1157       "illegal", but that is not true.  An application or cooperating set of
1158       applications can legally use them at will internally; but these code
1159       points are "illegal for open interchange".  Therefore, Perl will not
1160       accept these from input streams unless lax rules are being used, and
1161       will warn (using the warning category "nonchar", which is a sub-
1162       category of "utf8") if an attempt is made to output them.
1163
1164   Beyond Unicode code points
1165       The maximum Unicode code point is U+10FFFF.  But Perl accepts code
1166       points up to the maximum permissible unsigned number available on the
1167       platform.  However, Perl will not accept these from input streams
1168       unless lax rules are being used, and will warn (using the warning
1169       category "non_unicode", which is a sub-category of "utf8") if an
1170       attempt is made to operate on or output them.  For example,
1171       "uc(0x11_0000)" will generate this warning, returning the input
1172       parameter as its result, as the upper case of every non-Unicode code
1173       point is the code point itself.
1174
1175   Security Implications of Unicode
1176       Read Unicode Security Considerations
1177       <http://www.unicode.org/reports/tr36>.  Also, note the following:
1178
1179       ·   Malformed UTF-8
1180
1181           Unfortunately, the original specification of UTF-8 leaves some room
1182           for interpretation of how many bytes of encoded output one should
1183           generate from one input Unicode character.  Strictly speaking, the
1184           shortest possible sequence of UTF-8 bytes should be generated,
1185           because otherwise there is potential for an input buffer overflow
1186           at the receiving end of a UTF-8 connection.  Perl always generates
1187           the shortest length UTF-8, and with warnings on, Perl will warn
1188           about non-shortest length UTF-8 along with other malformations,
1189           such as the surrogates, which are not Unicode code points valid for
1190           interchange.
1191
1192       ·   Regular expression pattern matching may surprise you if you're not
1193           accustomed to Unicode.  Starting in Perl 5.14, several pattern
1194           modifiers are available to control this, called the character set
1195           modifiers.  Details are given in "Character set modifiers" in
1196           perlre.
1197
1198       As discussed elsewhere, Perl has one foot (two hooves?) planted in each
1199       of two worlds: the old world of bytes and the new world of characters,
1200       upgrading from bytes to characters when necessary.  If your legacy code
1201       does not explicitly use Unicode, no automatic switch-over to characters
1202       should happen.  Characters shouldn't get downgraded to bytes, either.
1203       It is possible to accidentally mix bytes and characters, however (see
1204       perluniintro), in which case "\w" in regular expressions might start
1205       behaving differently (unless the "/a" modifier is in effect).  Review
1206       your code.  Use warnings and the "strict" pragma.
1207
1208   Unicode in Perl on EBCDIC
1209       The way Unicode is handled on EBCDIC platforms is still experimental.
1210       On such platforms, references to UTF-8 encoding in this document and
1211       elsewhere should be read as meaning the UTF-EBCDIC specified in Unicode
1212       Technical Report 16, unless ASCII vs. EBCDIC issues are specifically
1213       discussed. There is no "utfebcdic" pragma or ":utfebcdic" layer;
1214       rather, "utf8" and ":utf8" are reused to mean the platform's "natural"
1215       8-bit encoding of Unicode. See perlebcdic for more discussion of the
1216       issues.
1217
1218   Locales
1219       See "Unicode and UTF-8" in perllocale
1220
1221   When Unicode Does Not Happen
1222       While Perl does have extensive ways to input and output in Unicode, and
1223       a few other "entry points" like the @ARGV array (which can sometimes be
1224       interpreted as UTF-8), there are still many places where Unicode (in
1225       some encoding or another) could be given as arguments or received as
1226       results, or both, but it is not.
1227
1228       The following are such interfaces.  Also, see "The "Unicode Bug"".  For
1229       all of these interfaces Perl currently (as of 5.8.3) simply assumes
1230       byte strings both as arguments and results, or UTF-8 strings if the
1231       (problematic) "encoding" pragma has been used.
1232
1233       One reason that Perl does not attempt to resolve the role of Unicode in
1234       these situations is that the answers are highly dependent on the
1235       operating system and the file system(s).  For example, whether
1236       filenames can be in Unicode and in exactly what kind of encoding, is
1237       not exactly a portable concept.  Similarly for "qx" and "system": how
1238       well will the "command-line interface" (and which of them?) handle
1239       Unicode?
1240
1241       ·   chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename,
1242           rmdir, stat, symlink, truncate, unlink, utime, -X
1243
1244       ·   %ENV
1245
1246       ·   glob (aka the <*>)
1247
1248       ·   open, opendir, sysopen
1249
1250       ·   qx (aka the backtick operator), system
1251
1252       ·   readdir, readlink
1253
1254   The "Unicode Bug"
1255       The term, "Unicode bug" has been applied to an inconsistency on ASCII
1256       platforms with the Unicode code points in the Latin-1 Supplement block,
1257       that is, between 128 and 255.  Without a locale specified, unlike all
1258       other characters or code points, these characters have very different
1259       semantics in byte semantics versus character semantics, unless "use
1260       feature 'unicode_strings'" is specified, directly or indirectly.  (It
1261       is indirectly specified by a "use v5.12" or higher.)
1262
1263       In character semantics these upper-Latin1 characters are interpreted as
1264       Unicode code points, which means they have the same semantics as
1265       Latin-1 (ISO-8859-1).
1266
1267       In byte semantics (without "unicode_strings"), they are considered to
1268       be unassigned characters, meaning that the only semantics they have is
1269       their ordinal numbers, and that they are not members of various
1270       character classes.  None are considered to match "\w" for example, but
1271       all match "\W".
1272
1273       Perl 5.12.0 added "unicode_strings" to force character semantics on
1274       these code points in some circumstances, which fixed portions of the
1275       bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1276       remainder (so far as we know, anyway).  The lesson here is to enable
1277       "unicode_strings" to avoid the headaches described below.
1278
1279       The old, problematic behavior affects these areas:
1280
1281       ·   Changing the case of a scalar, that is, using "uc()", "ucfirst()",
1282           "lc()", and "lcfirst()", or "\L", "\U", "\u" and "\l" in double-
1283           quotish contexts, such as regular expression substitutions.  Under
1284           "unicode_strings" starting in Perl 5.12.0, character semantics are
1285           generally used.  See "lc" in perlfunc for details on how this works
1286           in combination with various other pragmas.
1287
1288       ·   Using caseless ("/i") regular expression matching.  Starting in
1289           Perl 5.14.0, regular expressions compiled within the scope of
1290           "unicode_strings" use character semantics even when executed or
1291           compiled into larger regular expressions outside the scope.
1292
1293       ·   Matching any of several properties in regular expressions, namely
1294           "\b", "\B", "\s", "\S", "\w", "\W", and all the Posix character
1295           classes except "[[:ascii:]]".  Starting in Perl 5.14.0, regular
1296           expressions compiled within the scope of "unicode_strings" use
1297           character semantics even when executed or compiled into larger
1298           regular expressions outside the scope.
1299
1300       ·   In "quotemeta" or its inline equivalent "\Q", no code points above
1301           127 are quoted in UTF-8 encoded strings, but in byte encoded
1302           strings, code points between 128-255 are always quoted.  Starting
1303           in Perl 5.16.0, consistent quoting rules are used within the scope
1304           of "unicode_strings", as described in "quotemeta" in perlfunc.
1305
1306       This behavior can lead to unexpected results in which a string's
1307       semantics suddenly change if a code point above 255 is appended to or
1308       removed from it, which changes the string's semantics from byte to
1309       character or vice versa.  As an example, consider the following program
1310       and its output:
1311
1312        $ perl -le'
1313            no feature 'unicode_strings';
1314            $s1 = "\xC2";
1315            $s2 = "\x{2660}";
1316            for ($s1, $s2, $s1.$s2) {
1317                print /\w/ || 0;
1318            }
1319        '
1320        0
1321        0
1322        1
1323
1324       If there's no "\w" in "s1" or in "s2", why does their concatenation
1325       have one?
1326
1327       This anomaly stems from Perl's attempt to not disturb older programs
1328       that didn't use Unicode, and hence had no semantics for characters
1329       outside of the ASCII range (except in a locale), along with Perl's
1330       desire to add Unicode support seamlessly.  The result wasn't seamless:
1331       these characters were orphaned.
1332
1333       For Perls earlier than those described above, or when a string is
1334       passed to a function outside the subpragma's scope, a workaround is to
1335       always call "utf8::upgrade($string)", or to use the standard module
1336       Encode.   Also, a scalar that has any characters whose ordinal is above
1337       0x100, or which were specified using either of the "\N{...}" notations,
1338       will automatically have character semantics.
1339
1340   Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1341       Sometimes (see "When Unicode Does Not Happen" or "The "Unicode Bug"")
1342       there are situations where you simply need to force a byte string into
1343       UTF-8, or vice versa.  The low-level calls utf8::upgrade($bytestring)
1344       and utf8::downgrade($utf8string[, FAIL_OK]) are the answers.
1345
1346       Note that utf8::downgrade() can fail if the string contains characters
1347       that don't fit into a byte.
1348
1349       Calling either function on a string that already is in the desired
1350       state is a no-op.
1351
1352   Using Unicode in XS
1353       If you want to handle Perl Unicode in XS extensions, you may find the
1354       following C APIs useful.  See also "Unicode Support" in perlguts for an
1355       explanation about Unicode at the XS level, and perlapi for the API
1356       details.
1357
1358       ·   "DO_UTF8(sv)" returns true if the "UTF8" flag is on and the bytes
1359           pragma is not in effect.  "SvUTF8(sv)" returns true if the "UTF8"
1360           flag is on; the bytes pragma is ignored.  The "UTF8" flag being on
1361           does not mean that there are any characters of code points greater
1362           than 255 (or 127) in the scalar or that there are even any
1363           characters in the scalar.  What the "UTF8" flag means is that the
1364           sequence of octets in the representation of the scalar is the
1365           sequence of UTF-8 encoded code points of the characters of a
1366           string.  The "UTF8" flag being off means that each octet in this
1367           representation encodes a single character with code point 0..255
1368           within the string.  Perl's Unicode model is not to use UTF-8 until
1369           it is absolutely necessary.
1370
1371       ·   "uvchr_to_utf8(buf, chr)" writes a Unicode character code point
1372           into a buffer encoding the code point as UTF-8, and returns a
1373           pointer pointing after the UTF-8 bytes.  It works appropriately on
1374           EBCDIC machines.
1375
1376       ·   "utf8_to_uvchr_buf(buf, bufend, lenp)" reads UTF-8 encoded bytes
1377           from a buffer and returns the Unicode character code point and,
1378           optionally, the length of the UTF-8 byte sequence.  It works
1379           appropriately on EBCDIC machines.
1380
1381       ·   "utf8_length(start, end)" returns the length of the UTF-8 encoded
1382           buffer in characters.  "sv_len_utf8(sv)" returns the length of the
1383           UTF-8 encoded scalar.
1384
1385       ·   "sv_utf8_upgrade(sv)" converts the string of the scalar to its
1386           UTF-8 encoded form.  "sv_utf8_downgrade(sv)" does the opposite, if
1387           possible.  "sv_utf8_encode(sv)" is like sv_utf8_upgrade except that
1388           it does not set the "UTF8" flag.  "sv_utf8_decode()" does the
1389           opposite of "sv_utf8_encode()".  Note that none of these are to be
1390           used as general-purpose encoding or decoding interfaces: "use
1391           Encode" for that.  "sv_utf8_upgrade()" is affected by the encoding
1392           pragma but "sv_utf8_downgrade()" is not (since the encoding pragma
1393           is designed to be a one-way street).
1394
1395       ·   "is_utf8_string(buf, len)" returns true if "len" bytes of the
1396           buffer are valid UTF-8.
1397
1398       ·   is_utf8_char(s) returns true if the pointer points to a valid UTF-8
1399           character.  However, this function should not be used because of
1400           security concerns.  Instead, use "is_utf8_string()".
1401
1402       ·   "UTF8SKIP(buf)" will return the number of bytes in the UTF-8
1403           encoded character in the buffer.  "UNISKIP(chr)" will return the
1404           number of bytes required to UTF-8-encode the Unicode character code
1405           point.  "UTF8SKIP()" is useful for example for iterating over the
1406           characters of a UTF-8 encoded buffer; "UNISKIP()" is useful, for
1407           example, in computing the size required for a UTF-8 encoded buffer.
1408
1409       ·   "utf8_distance(a, b)" will tell the distance in characters between
1410           the two pointers pointing to the same UTF-8 encoded buffer.
1411
1412       ·   "utf8_hop(s, off)" will return a pointer to a UTF-8 encoded buffer
1413           that is "off" (positive or negative) Unicode characters displaced
1414           from the UTF-8 buffer "s".  Be careful not to overstep the buffer:
1415           "utf8_hop()" will merrily run off the end or the beginning of the
1416           buffer if told to do so.
1417
1418       ·   "pv_uni_display(dsv, spv, len, pvlim, flags)" and
1419           "sv_uni_display(dsv, ssv, pvlim, flags)" are useful for debugging
1420           the output of Unicode strings and scalars.  By default they are
1421           useful only for debugging--they display all characters as
1422           hexadecimal code points--but with the flags "UNI_DISPLAY_ISPRINT",
1423           "UNI_DISPLAY_BACKSLASH", and "UNI_DISPLAY_QQ" you can make the
1424           output more readable.
1425
1426       ·   "foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)" can be used to
1427           compare two strings case-insensitively in Unicode.  For case-
1428           sensitive comparisons you can just use "memEQ()" and "memNE()" as
1429           usual, except if one string is in utf8 and the other isn't.
1430
1431       For more information, see perlapi, and utf8.c and utf8.h in the Perl
1432       source code distribution.
1433
1434   Hacking Perl to work on earlier Unicode versions (for very serious hackers
1435       only)
1436       Perl by default comes with the latest supported Unicode version built
1437       in, but you can change to use any earlier one.
1438
1439       Download the files in the desired version of Unicode from the Unicode
1440       web site <http://www.unicode.org>).  These should replace the existing
1441       files in lib/unicore in the Perl source tree.  Follow the instructions
1442       in README.perl in that directory to change some of their names, and
1443       then build perl (see INSTALL).
1444

BUGS

1446   Interaction with Locales
1447       See "Unicode and UTF-8" in perllocale
1448
1449   Problems with characters in the Latin-1 Supplement range
1450       See "The "Unicode Bug""
1451
1452   Interaction with Extensions
1453       When Perl exchanges data with an extension, the extension should be
1454       able to understand the UTF8 flag and act accordingly. If the extension
1455       doesn't recognize that flag, it's likely that the extension will return
1456       incorrectly-flagged data.
1457
1458       So if you're working with Unicode data, consult the documentation of
1459       every module you're using if there are any issues with Unicode data
1460       exchange. If the documentation does not talk about Unicode at all,
1461       suspect the worst and probably look at the source to learn how the
1462       module is implemented. Modules written completely in Perl shouldn't
1463       cause problems. Modules that directly or indirectly access code written
1464       in other programming languages are at risk.
1465
1466       For affected functions, the simple strategy to avoid data corruption is
1467       to always make the encoding of the exchanged data explicit. Choose an
1468       encoding that you know the extension can handle. Convert arguments
1469       passed to the extensions to that encoding and convert results back from
1470       that encoding. Write wrapper functions that do the conversions for you,
1471       so you can later change the functions when the extension catches up.
1472
1473       To provide an example, let's say the popular Foo::Bar::escape_html
1474       function doesn't deal with Unicode data yet. The wrapper function would
1475       convert the argument to raw UTF-8 and convert the result back to Perl's
1476       internal representation like so:
1477
1478           sub my_escape_html ($) {
1479               my($what) = shift;
1480               return unless defined $what;
1481               Encode::decode_utf8(Foo::Bar::escape_html(
1482                                                Encode::encode_utf8($what)));
1483           }
1484
1485       Sometimes, when the extension does not convert data but just stores and
1486       retrieves them, you will be able to use the otherwise dangerous
1487       Encode::_utf8_on() function. Let's say the popular "Foo::Bar"
1488       extension, written in C, provides a "param" method that lets you store
1489       and retrieve data according to these prototypes:
1490
1491           $self->param($name, $value);            # set a scalar
1492           $value = $self->param($name);           # retrieve a scalar
1493
1494       If it does not yet provide support for any encoding, one could write a
1495       derived class with such a "param" method:
1496
1497           sub param {
1498             my($self,$name,$value) = @_;
1499             utf8::upgrade($name);     # make sure it is UTF-8 encoded
1500             if (defined $value) {
1501               utf8::upgrade($value);  # make sure it is UTF-8 encoded
1502               return $self->SUPER::param($name,$value);
1503             } else {
1504               my $ret = $self->SUPER::param($name);
1505               Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1506               return $ret;
1507             }
1508           }
1509
1510       Some extensions provide filters on data entry/exit points, such as
1511       DB_File::filter_store_key and family. Look out for such filters in the
1512       documentation of your extensions, they can make the transition to
1513       Unicode data much easier.
1514
1515   Speed
1516       Some functions are slower when working on UTF-8 encoded strings than on
1517       byte encoded strings.  All functions that need to hop over characters
1518       such as length(), substr() or index(), or matching regular expressions
1519       can work much faster when the underlying data are byte-encoded.
1520
1521       In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a
1522       caching scheme was introduced which will hopefully make the slowness
1523       somewhat less spectacular, at least for some operations.  In general,
1524       operations with UTF-8 encoded strings are still slower. As an example,
1525       the Unicode properties (character classes) like "\p{Nd}" are known to
1526       be quite a bit slower (5-20 times) than their simpler counterparts like
1527       "\d" (then again, there are hundreds of Unicode characters matching
1528       "Nd" compared with the 10 ASCII characters matching "d").
1529
1530   Problems on EBCDIC platforms
1531       There are several known problems with Perl on EBCDIC platforms.  If you
1532       want to use Perl there, send email to perlbug@perl.org.
1533
1534       In earlier versions, when byte and character data were concatenated,
1535       the new string was sometimes created by decoding the byte strings as
1536       ISO 8859-1 (Latin-1), even if the old Unicode string used EBCDIC.
1537
1538       If you find any of these, please report them as bugs.
1539
1540   Porting code from perl-5.6.X
1541       Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1542       was required to use the "utf8" pragma to declare that a given scope
1543       expected to deal with Unicode data and had to make sure that only
1544       Unicode data were reaching that scope. If you have code that is working
1545       with 5.6, you will need some of the following adjustments to your code.
1546       The examples are written such that the code will continue to work under
1547       5.6, so you should be safe to try them out.
1548
1549       ·  A filehandle that should read or write UTF-8
1550
1551            if ($] > 5.007) {
1552              binmode $fh, ":encoding(utf8)";
1553            }
1554
1555       ·  A scalar that is going to be passed to some extension
1556
1557          Be it Compress::Zlib, Apache::Request or any extension that has no
1558          mention of Unicode in the manpage, you need to make sure that the
1559          UTF8 flag is stripped off. Note that at the time of this writing
1560          (October 2002) the mentioned modules are not UTF-8-aware. Please
1561          check the documentation to verify if this is still true.
1562
1563            if ($] > 5.007) {
1564              require Encode;
1565              $val = Encode::encode_utf8($val); # make octets
1566            }
1567
1568       ·  A scalar we got back from an extension
1569
1570          If you believe the scalar comes back as UTF-8, you will most likely
1571          want the UTF8 flag restored:
1572
1573            if ($] > 5.007) {
1574              require Encode;
1575              $val = Encode::decode_utf8($val);
1576            }
1577
1578       ·  Same thing, if you are really sure it is UTF-8
1579
1580            if ($] > 5.007) {
1581              require Encode;
1582              Encode::_utf8_on($val);
1583            }
1584
1585       ·  A wrapper for fetchrow_array and fetchrow_hashref
1586
1587          When the database contains only UTF-8, a wrapper function or method
1588          is a convenient way to replace all your fetchrow_array and
1589          fetchrow_hashref calls. A wrapper function will also make it easier
1590          to adapt to future enhancements in your database driver. Note that
1591          at the time of this writing (October 2002), the DBI has no
1592          standardized way to deal with UTF-8 data. Please check the
1593          documentation to verify if that is still true.
1594
1595            sub fetchrow {
1596              # $what is one of fetchrow_{array,hashref}
1597              my($self, $sth, $what) = @_;
1598              if ($] < 5.007) {
1599                return $sth->$what;
1600              } else {
1601                require Encode;
1602                if (wantarray) {
1603                  my @arr = $sth->$what;
1604                  for (@arr) {
1605                    defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1606                  }
1607                  return @arr;
1608                } else {
1609                  my $ret = $sth->$what;
1610                  if (ref $ret) {
1611                    for my $k (keys %$ret) {
1612                      defined
1613                      && /[^\000-\177]/
1614                      && Encode::_utf8_on($_) for $ret->{$k};
1615                    }
1616                    return $ret;
1617                  } else {
1618                    defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1619                    return $ret;
1620                  }
1621                }
1622              }
1623            }
1624
1625       ·  A large scalar that you know can only contain ASCII
1626
1627          Scalars that contain only ASCII and are marked as UTF-8 are
1628          sometimes a drag to your program. If you recognize such a situation,
1629          just remove the UTF8 flag:
1630
1631            utf8::downgrade($val) if $] > 5.007;
1632

NAME

DESCRIPTION

BUGS

SEE ALSO