perlunicode(1)

1PERLUNICODE(1)         Perl Programmers Reference Guide         PERLUNICODE(1)
2
3
4

NAME

6       perlunicode - Unicode support in Perl
7

DESCRIPTION

9   Important Caveats
10       Unicode support is an extensive requirement. While Perl does not
11       implement the Unicode standard or the accompanying technical reports
12       from cover to cover, Perl does support many Unicode features.
13
14       People who want to learn to use Unicode in Perl, should probably read
15       the Perl Unicode tutorial, perlunitut, before reading this reference
16       document.
17
18       Also, the use of Unicode may present security issues that aren't
19       obvious.  Read Unicode Security Considerations
20       <http://www.unicode.org/reports/tr36>.
21
22       Input and Output Layers
23           Perl knows when a filehandle uses Perl's internal Unicode encodings
24           (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened
25           with the ":utf8" layer.  Other encodings can be converted to Perl's
26           encoding on input or from Perl's encoding on output by use of the
27           ":encoding(...)"  layer.  See open.
28
29           To indicate that Perl source itself is in UTF-8, use "use utf8;".
30
31       Regular Expressions
32           The regular expression compiler produces polymorphic opcodes.  That
33           is, the pattern adapts to the data and automatically switches to
34           the Unicode character scheme when presented with data that is
35           internally encoded in UTF-8, or instead uses a traditional byte
36           scheme when presented with byte data.
37
38       "use utf8" still needed to enable UTF-8/UTF-EBCDIC in scripts
39           As a compatibility measure, the "use utf8" pragma must be
40           explicitly included to enable recognition of UTF-8 in the Perl
41           scripts themselves (in string or regular expression literals, or in
42           identifier names) on ASCII-based machines or to recognize UTF-
43           EBCDIC on EBCDIC-based machines.  These are the only times when an
44           explicit "use utf8" is needed.  See utf8.
45
46       BOM-marked scripts and UTF-16 scripts autodetected
47           If a Perl script begins marked with the Unicode BOM (UTF-16LE,
48           UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked
49           UTF-16 of either endianness, Perl will correctly read in the script
50           as Unicode.  (BOMless UTF-8 cannot be effectively recognized or
51           differentiated from ISO 8859-1 or other eight-bit encodings.)
52
53       "use encoding" needed to upgrade non-Latin-1 byte strings
54           By default, there is a fundamental asymmetry in Perl's Unicode
55           model: implicit upgrading from byte strings to Unicode strings
56           assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
57           strings are downgraded with UTF-8 encoding.  This happens because
58           the first 256 codepoints in Unicode happens to agree with Latin-1.
59
60           See "Byte and Character Semantics" for more details.
61
62   Byte and Character Semantics
63       Beginning with version 5.6, Perl uses logically-wide characters to
64       represent strings internally.
65
66       In future, Perl-level operations will be expected to work with
67       characters rather than bytes.
68
69       However, as an interim compatibility measure, Perl aims to provide a
70       safe migration path from byte semantics to character semantics for
71       programs.  For operations where Perl can unambiguously decide that the
72       input data are characters, Perl switches to character semantics.  For
73       operations where this determination cannot be made without additional
74       information from the user, Perl decides in favor of compatibility and
75       chooses to use byte semantics.
76
77       Under byte semantics, when "use locale" is in effect, Perl uses the
78       semantics associated with the current locale.  Absent a "use locale",
79       and absent a "use feature 'unicode_strings'" pragma, Perl currently
80       uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics,
81       meaning that characters whose ordinal numbers are in the range 128 -
82       255 are undefined except for their ordinal numbers.  This means that
83       none have case (upper and lower), nor are any a member of character
84       classes, like "[:alpha:]" or "\w".  (But all do belong to the "\W"
85       class or the Perl regular expression extension "[:^alpha:]".)
86
87       This behavior preserves compatibility with earlier versions of Perl,
88       which allowed byte semantics in Perl operations only if none of the
89       program's inputs were marked as being a source of Unicode character
90       data.  Such data may come from filehandles, from calls to external
91       programs, from information provided by the system (such as %ENV), or
92       from literals and constants in the source text.
93
94       The "bytes" pragma will always, regardless of platform, force byte
95       semantics in a particular lexical scope.  See bytes.
96
97       The "use feature 'unicode_strings'" pragma is intended to always,
98       regardless of platform, force character (Unicode) semantics in a
99       particular lexical scope.  In release 5.12, it is partially
100       implemented, applying only to case changes.  See "The "Unicode Bug""
101       below.
102
103       The "utf8" pragma is primarily a compatibility device that enables
104       recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
105       Note that this pragma is only required while Perl defaults to byte
106       semantics; when character semantics become the default, this pragma may
107       become a no-op.  See utf8.
108
109       Unless explicitly stated, Perl operators use character semantics for
110       Unicode data and byte semantics for non-Unicode data.  The decision to
111       use character semantics is made transparently.  If input data comes
112       from a Unicode source--for example, if a character encoding layer is
113       added to a filehandle or a literal Unicode string constant appears in a
114       program--character semantics apply.  Otherwise, byte semantics are in
115       effect.  The "bytes" pragma should be used to force byte semantics on
116       Unicode data, and the "use feature 'unicode_strings'" pragma to force
117       Unicode semantics on byte data (though in 5.12 it isn't fully
118       implemented).
119
120       If strings operating under byte semantics and strings with Unicode
121       character data are concatenated, the new string will have character
122       semantics.  This can cause surprises: See "BUGS", below.  You can
123       choose to be warned when this happens.  See encoding::warnings.
124
125       Under character semantics, many operations that formerly operated on
126       bytes now operate on characters. A character in Perl is logically just
127       a number ranging from 0 to 2**31 or so. Larger characters may encode
128       into longer sequences of bytes internally, but this internal detail is
129       mostly hidden for Perl code.  See perluniintro for more.
130
131   Effects of Character Semantics
132       Character semantics have the following effects:
133
134       ·   Strings--including hash keys--and regular expression patterns may
135           contain characters that have an ordinal value larger than 255.
136
137           If you use a Unicode editor to edit your program, Unicode
138           characters may occur directly within the literal strings in UTF-8
139           encoding, or UTF-16.  (The former requires a BOM or "use utf8", the
140           latter requires a BOM.)
141
142           Unicode characters can also be added to a string by using the
143           "\N{U+...}" notation.  The Unicode code for the desired character,
144           in hexadecimal, should be placed in the braces, after the "U". For
145           instance, a smiley face is "\N{U+263A}".
146
147           Alternatively, you can use the "\x{...}" notation for characters
148           0x100 and above.  For characters below 0x100 you may get byte
149           semantics instead of character semantics;  see "The "Unicode Bug"".
150           On EBCDIC machines there is the additional problem that the value
151           for such characters gives the EBCDIC character rather than the
152           Unicode one.
153
154           Additionally, if you
155
156              use charnames ':full';
157
158           you can use the "\N{...}" notation and put the official Unicode
159           character name within the braces, such as "\N{WHITE SMILING FACE}".
160           See charnames.
161
162       ·   If an appropriate encoding is specified, identifiers within the
163           Perl script may contain Unicode alphanumeric characters, including
164           ideographs.  Perl does not currently attempt to canonicalize
165           variable names.
166
167       ·   Regular expressions match characters instead of bytes.  "." matches
168           a character instead of a byte.
169
170       ·   Bracketed character classes in regular expressions match characters
171           instead of bytes and match against the character properties
172           specified in the Unicode properties database.  "\w" can be used to
173           match a Japanese ideograph, for instance.
174
175       ·   Named Unicode properties, scripts, and block ranges may be used
176           (like bracketed character classes) by using the "\p{}" "matches
177           property" construct and the "\P{}" negation, "doesn't match
178           property".  See "Unicode Character Properties" for more details.
179
180           You can define your own character properties and use them in the
181           regular expression with the "\p{}" or "\P{}" construct.  See "User-
182           Defined Character Properties" for more details.
183
184       ·   The special pattern "\X" matches a logical character, an "extended
185           grapheme cluster" in Standardese.  In Unicode what appears to the
186           user to be a single character, for example an accented "G", may in
187           fact be composed of a sequence of characters, in this case a "G"
188           followed by an accent character.  "\X" will match the entire
189           sequence.
190
191       ·   The "tr///" operator translates characters instead of bytes.  Note
192           that the "tr///CU" functionality has been removed.  For similar
193           functionality see pack('U0', ...) and pack('C0', ...).
194
195       ·   Case translation operators use the Unicode case translation tables
196           when character input is provided.  Note that "uc()", or "\U" in
197           interpolated strings, translates to uppercase, while "ucfirst", or
198           "\u" in interpolated strings, translates to titlecase in languages
199           that make the distinction (which is equivalent to uppercase in
200           languages without the distinction).
201
202       ·   Most operators that deal with positions or lengths in a string will
203           automatically switch to using character positions, including
204           "chop()", "chomp()", "substr()", "pos()", "index()", "rindex()",
205           "sprintf()", "write()", and "length()".  An operator that
206           specifically does not switch is "vec()".  Operators that really
207           don't care include operators that treat strings as a bucket of bits
208           such as "sort()", and operators dealing with filenames.
209
210       ·   The "pack()"/"unpack()" letter "C" does not change, since it is
211           often used for byte-oriented formats.  Again, think "char" in the C
212           language.
213
214           There is a new "U" specifier that converts between Unicode
215           characters and code points. There is also a "W" specifier that is
216           the equivalent of "chr"/"ord" and properly handles character values
217           even if they are above 255.
218
219       ·   The "chr()" and "ord()" functions work on characters, similar to
220           "pack("W")" and "unpack("W")", not "pack("C")" and "unpack("C")".
221           "pack("C")" and "unpack("C")" are methods for emulating byte-
222           oriented "chr()" and "ord()" on Unicode strings.  While these
223           methods reveal the internal encoding of Unicode strings, that is
224           not something one normally needs to care about at all.
225
226       ·   The bit string operators, "& | ^ ~", can operate on character data.
227           However, for backward compatibility, such as when using bit string
228           operations when characters are all less than 256 in ordinal value,
229           one should not use "~" (the bit complement) with characters of both
230           values less than 256 and values greater than 256.  Most
231           importantly, DeMorgan's laws ("~($x|$y) eq ~$x&~$y" and "~($x&$y)
232           eq ~$x|~$y") will not hold.  The reason for this mathematical faux
233           pas is that the complement cannot return both the 8-bit (byte-wide)
234           bit complement and the full character-wide bit complement.
235
236       ·   You can define your own mappings to be used in "lc()", "lcfirst()",
237           "uc()", and "ucfirst()" (or their double-quoted string inlined
238           versions such as "\U").  See "User-Defined Case Mappings" for more
239           details.
240
241       ·   And finally, "scalar reverse()" reverses by character rather than
242           by byte.
243
244   Unicode Character Properties
245       Most Unicode character properties are accessible by using regular
246       expressions.  They are used (like bracketed character classes) by using
247       the "\p{}" "matches property" construct and the "\P{}" negation,
248       "doesn't match property".
249
250       Note that the only time that Perl considers a sequence of individual
251       code points as a single logical character is in the "\X" construct,
252       already mentioned above.   Therefore "character" in this discussion
253       means a single Unicode code point.
254
255       For instance, "\p{Uppercase}" matches any single character with the
256       Unicode "Uppercase" property, while "\p{L}" matches any character with
257       a General_Category of "L" (letter) property.  Brackets are not required
258       for single letter property names, so "\p{L}" is equivalent to "\pL".
259
260       More formally, "\p{Uppercase}" matches any single character whose
261       Unicode Uppercase property value is True, and "\P{Uppercase}" matches
262       any character whose Uppercase property value is False, and they could
263       have been written as "\p{Uppercase=True}" and "\p{Uppercase=False}",
264       respectively.
265
266       This formality is needed when properties are not binary, that is if
267       they can take on more values than just True and False.  For example,
268       the Bidi_Class (see "Bidirectional Character Types" below), can take on
269       a number of different values, such as Left, Right, Whitespace, and
270       others.  To match these, one needs to specify the property name
271       (Bidi_Class), and the value being matched against (Left, Right, etc.).
272       This is done, as in the examples above, by having the two components
273       separated by an equal sign (or interchangeably, a colon), like
274       "\p{Bidi_Class: Left}".
275
276       All Unicode-defined character properties may be written in these
277       compound forms of "\p{property=value}" or "\p{property:value}", but
278       Perl provides some additional properties that are written only in the
279       single form, as well as single-form short-cuts for all binary
280       properties and certain others described below, in which you may omit
281       the property name and the equals or colon separator.
282
283       Most Unicode character properties have at least two synonyms (or
284       aliases if you prefer), a short one that is easier to type, and a
285       longer one which is more descriptive and hence it is easier to
286       understand what it means.  Thus the "L" and "Letter" above are
287       equivalent and can be used interchangeably.  Likewise, "Upper" is a
288       synonym for "Uppercase", and we could have written "\p{Uppercase}"
289       equivalently as "\p{Upper}".  Also, there are typically various
290       synonyms for the values the property can be.   For binary properties,
291       "True" has 3 synonyms: "T", "Yes", and "Y"; and "False has
292       correspondingly "F", "No", and "N".  But be careful.  A short form of a
293       value for one property may not mean the same thing as the same short
294       form for another.  Thus, for the General_Category property, "L" means
295       "Letter", but for the Bidi_Class property, "L" means "Left".  A
296       complete list of properties and synonyms is in perluniprops.
297
298       Upper/lower case differences in the property names and values are
299       irrelevant, thus "\p{Upper}" means the same thing as "\p{upper}" or
300       even "\p{UpPeR}".  Similarly, you can add or subtract underscores
301       anywhere in the middle of a word, so that these are also equivalent to
302       "\p{U_p_p_e_r}".  And white space is irrelevant adjacent to non-word
303       characters, such as the braces and the equals or colon separators so
304       "\p{   Upper  }" and "\p{ Upper_case : Y }" are equivalent to these as
305       well.  In fact, in most cases, white space and even hyphens can be
306       added or deleted anywhere.  So even "\p{ Up-per case = Yes}" is
307       equivalent.  All this is called "loose-matching" by Unicode.  The few
308       places where stricter matching is employed is in the middle of numbers,
309       and the Perl extension properties that begin or end with an underscore.
310       Stricter matching cares about white space (except adjacent to the non-
311       word characters) and hyphens, and non-interior underscores.
312
313       You can also use negation in both "\p{}" and "\P{}" by introducing a
314       caret (^) between the first brace and the property name: "\p{^Tamil}"
315       is equal to "\P{Tamil}".
316
317       General_Category
318
319       Every Unicode character is assigned a general category, which is the
320       "most usual categorization of a character" (from
321       <http://www.unicode.org/reports/tr44>).
322
323       The compound way of writing these is like "\p{General_Category=Number}"
324       (short, "\p{gc:n}").  But Perl furnishes shortcuts in which everything
325       up through the equal or colon separator is omitted.  So you can instead
326       just write "\pN".
327
328       Here are the short and long forms of the General Category properties:
329
330           Short       Long
331
332           L           Letter
333           LC, L&      Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
334           Lu          Uppercase_Letter
335           Ll          Lowercase_Letter
336           Lt          Titlecase_Letter
337           Lm          Modifier_Letter
338           Lo          Other_Letter
339
340           M           Mark
341           Mn          Nonspacing_Mark
342           Mc          Spacing_Mark
343           Me          Enclosing_Mark
344
345           N           Number
346           Nd          Decimal_Number (also Digit)
347           Nl          Letter_Number
348           No          Other_Number
349
350           P           Punctuation (also Punct)
351           Pc          Connector_Punctuation
352           Pd          Dash_Punctuation
353           Ps          Open_Punctuation
354           Pe          Close_Punctuation
355           Pi          Initial_Punctuation
356                       (may behave like Ps or Pe depending on usage)
357           Pf          Final_Punctuation
358                       (may behave like Ps or Pe depending on usage)
359           Po          Other_Punctuation
360
361           S           Symbol
362           Sm          Math_Symbol
363           Sc          Currency_Symbol
364           Sk          Modifier_Symbol
365           So          Other_Symbol
366
367           Z           Separator
368           Zs          Space_Separator
369           Zl          Line_Separator
370           Zp          Paragraph_Separator
371
372           C           Other
373           Cc          Control (also Cntrl)
374           Cf          Format
375           Cs          Surrogate   (not usable)
376           Co          Private_Use
377           Cn          Unassigned
378
379       Single-letter properties match all characters in any of the two-letter
380       sub-properties starting with the same letter.  "LC" and "L&" are
381       special cases, which are both aliases for the set consisting of
382       everything matched by "Ll", "Lu", and "Lt".
383
384       Because Perl hides the need for the user to understand the internal
385       representation of Unicode characters, there is no need to implement the
386       somewhat messy concept of surrogates. "Cs" is therefore not supported.
387
388       Bidirectional Character Types
389
390       Because scripts differ in their directionality (Hebrew is written right
391       to left, for example) Unicode supplies these properties in the
392       Bidi_Class class:
393
394           Property    Meaning
395
396           L           Left-to-Right
397           LRE         Left-to-Right Embedding
398           LRO         Left-to-Right Override
399           R           Right-to-Left
400           AL          Arabic Letter
401           RLE         Right-to-Left Embedding
402           RLO         Right-to-Left Override
403           PDF         Pop Directional Format
404           EN          European Number
405           ES          European Separator
406           ET          European Terminator
407           AN          Arabic Number
408           CS          Common Separator
409           NSM         Non-Spacing Mark
410           BN          Boundary Neutral
411           B           Paragraph Separator
412           S           Segment Separator
413           WS          Whitespace
414           ON          Other Neutrals
415
416       This property is always written in the compound form.  For example,
417       "\p{Bidi_Class:R}" matches characters that are normally written right
418       to left.
419
420       Scripts
421
422       The world's languages are written in a number of scripts.  This
423       sentence (unless you're reading it in translation) is written in Latin,
424       while Russian is written in Cyrllic, and Greek is written in, well,
425       Greek; Japanese mainly in Hiragana or Katakana.  There are many more.
426
427       The Unicode Script property gives what script a given character is in,
428       and the property can be specified with the compound form like
429       "\p{Script=Hebrew}" (short: "\p{sc=hebr}").  Perl furnishes shortcuts
430       for all script names.  You can omit everything up through the equals
431       (or colon), and simply write "\p{Latin}" or "\P{Cyrillic}".
432
433       A complete list of scripts and their shortcuts is in perluniprops.
434
435       Use of "Is" Prefix
436
437       For backward compatibility (with Perl 5.6), all properties mentioned so
438       far may have "Is" or "Is_" prepended to their name, so "\P{Is_Lu}", for
439       example, is equal to "\P{Lu}", and "\p{IsScript:Arabic}" is equal to
440       "\p{Arabic}".
441
442       Blocks
443
444       In addition to scripts, Unicode also defines blocks of characters.  The
445       difference between scripts and blocks is that the concept of scripts is
446       closer to natural languages, while the concept of blocks is more of an
447       artificial grouping based on groups of Unicode characters with
448       consecutive ordinal values. For example, the "Basic Latin" block is all
449       characters whose ordinals are between 0 and 127, inclusive, in other
450       words, the ASCII characters.  The "Latin" script contains some letters
451       from this block as well as several more, like "Latin-1 Supplement",
452       "Latin Extended-A", etc., but it does not contain all the characters
453       from those blocks. It does not, for example, contain digits, because
454       digits are shared across many scripts. Digits and similar groups, like
455       punctuation, are in the script called "Common".  There is also a script
456       called "Inherited" for characters that modify other characters, and
457       inherit the script value of the controlling character.
458
459       For more about scripts versus blocks, see UAX#24 "Unicode Script
460       Property": <http://www.unicode.org/reports/tr24>
461
462       The Script property is likely to be the one you want to use when
463       processing natural language; the Block property may be useful in
464       working with the nuts and bolts of Unicode.
465
466       Block names are matched in the compound form, like "\p{Block: Arrows}"
467       or "\p{Blk=Hebrew}".  Unlike most other properties only a few block
468       names have a Unicode-defined short name.  But Perl does provide a
469       (slight) shortcut:  You can say, for example "\p{In_Arrows}" or
470       "\p{In_Hebrew}".  For backwards compatibility, the "In" prefix may be
471       omitted if there is no naming conflict with a script or any other
472       property, and you can even use an "Is" prefix instead in those cases.
473       But it is not a good idea to do this, for a couple reasons:
474
475       1.  It is confusing.  There are many naming conflicts, and you may
476           forget some.  For example, "\p{Hebrew}" means the script Hebrew,
477           and NOT the block Hebrew.  But would you remember that 6 months
478           from now?
479
480       2.  It is unstable.  A new version of Unicode may pre-empt the current
481           meaning by creating a property with the same name.  There was a
482           time in very early Unicode releases when "\p{Hebrew}" would have
483           matched the block Hebrew; now it doesn't.
484
485       Some people just prefer to always use "\p{Block: foo}" and "\p{Script:
486       bar}" instead of the shortcuts, for clarity, and because they can't
487       remember the difference between 'In' and 'Is' anyway (or aren't
488       confident that those who eventually will read their code will know).
489
490       A complete list of blocks and their shortcuts is in perluniprops.
491
492       Other Properties
493
494       There are many more properties than the very basic ones described here.
495       A complete list is in perluniprops.
496
497       Unicode defines all its properties in the compound form, so all single-
498       form properties are Perl extensions.  A number of these are just
499       synonyms for the Unicode ones, but some are genunine extensions,
500       including a couple that are in the compound form.  And quite a few of
501       these are actually recommended by Unicode (in
502       <http://www.unicode.org/reports/tr18>).
503
504       This section gives some details on all the extensions that aren't
505       synonyms for compound-form Unicode properties (for those, you'll have
506       to refer to the Unicode Standard <http://www.unicode.org/reports/tr44>.
507
508       "\p{All}"
509           This matches any of the 1_114_112 Unicode code points.  It is a
510           synonym for "\p{Any}".
511
512       "\p{Alnum}"
513           This matches any "\p{Alphabetic}" or "\p{Decimal_Number}"
514           character.
515
516       "\p{Any}"
517           This matches any of the 1_114_112 Unicode code points.  It is a
518           synonym for "\p{All}".
519
520       "\p{Assigned}"
521           This matches any assigned code point; that is, any code point whose
522           general category is not Unassigned (or equivalently, not Cn).
523
524       "\p{Blank}"
525           This is the same as "\h" and "\p{HorizSpace}":  A character that
526           changes the spacing horizontally.
527
528       "\p{Decomposition_Type: Non_Canonical}"    (Short: "\p{Dt=NonCanon}")
529           Matches a character that has a non-canonical decomposition.
530
531           To understand the use of this rarely used property=value
532           combination, it is necessary to know some basics about
533           decomposition.  Consider a character, say H.  It could appear with
534           various marks around it, such as an acute accent, or a circumflex,
535           or various hooks, circles, arrows, etc., above, below, to one side
536           and/or the other, etc.  There are many possibilities among the
537           world's languages.  The number of combinations is astronomical, and
538           if there were a character for each combination, it would soon
539           exhaust Unicode's more than a million possible characters.  So
540           Unicode took a different approach: there is a character for the
541           base H, and a character for each of the possible marks, and they
542           can be combined variously to get a final logical character.  So a
543           logical character--what appears to be a single character--can be a
544           sequence of more than one individual characters.  This is called an
545           "extended grapheme cluster".  (Perl furnishes the "\X" construct to
546           match such sequences.)
547
548           But Unicode's intent is to unify the existing character set
549           standards and practices, and a number of pre-existing standards
550           have single characters that mean the same thing as some of these
551           combinations.  An example is ISO-8859-1, which has quite a few of
552           these in the Latin-1 range, an example being "LATIN CAPITAL LETTER
553           E WITH ACUTE".  Because this character was in this pre-existing
554           standard, Unicode added it to its repertoire.  But this character
555           is considered by Unicode to be equivalent to the sequence
556           consisting of first the character "LATIN CAPITAL LETTER E", then
557           the character "COMBINING ACUTE ACCENT".
558
559           "LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed"
560           character, and the equivalence with the sequence is called
561           canonical equivalence.  All pre-composed characters are said to
562           have a decomposition (into the equivalent sequence) and the
563           decomposition type is also called canonical.
564
565           However, many more characters have a different type of
566           decomposition, a "compatible" or "non-canonical" decomposition.
567           The sequences that form these decompositions are not considered
568           canonically equivalent to the pre-composed character.  An example,
569           again in the Latin-1 range, is the "SUPERSCRIPT ONE".  It is kind
570           of like a regular digit 1, but not exactly; its decomposition into
571           the digit 1 is called a "compatible" decomposition, specifically a
572           "super" decomposition.  There are several such compatibility
573           decompositions (see <http://www.unicode.org/reports/tr44>),
574           including one called "compat" which means some miscellaneous type
575           of decomposition that doesn't fit into the decomposition categories
576           that Unicode has chosen.
577
578           Note that most Unicode characters don't have a decomposition, so
579           their decomposition type is "None".
580
581           Perl has added the "Non_Canonical" type, for your convenience, to
582           mean any of the compatibility decompositions.
583
584       "\p{Graph}"
585           Matches any character that is graphic.  Theoretically, this means a
586           character that on a printer would cause ink to be used.
587
588       "\p{HorizSpace}"
589           This is the same as "\h" and "\p{Blank}":  A character that changes
590           the spacing horizontally.
591
592       "\p{In=*}"
593           This is a synonym for "\p{Present_In=*}"
594
595       "\p{PerlSpace}"
596           This is the same as "\s", restricted to ASCII, namely
597           "[ \f\n\r\t]".
598
599           Mnemonic: Perl's (original) space
600
601       "\p{PerlWord}"
602           This is the same as "\w", restricted to ASCII, namely
603           "[A-Za-z0-9_]"
604
605           Mnemonic: Perl's (original) word.
606
607       "\p{PosixAlnum}"
608           This matches any alphanumeric character in the ASCII range, namely
609           "[A-Za-z0-9]".
610
611       "\p{PosixAlpha}"
612           This matches any alphabetic character in the ASCII range, namely
613           "[A-Za-z]".
614
615       "\p{PosixBlank}"
616           This matches any blank character in the ASCII range, namely
617           "[ \t]".
618
619       "\p{PosixCntrl}"
620           This matches any control character in the ASCII range, namely
621           "[\x00-\x1F\x7F]"
622
623       "\p{PosixDigit}"
624           This matches any digit character in the ASCII range, namely
625           "[0-9]".
626
627       "\p{PosixGraph}"
628           This matches any graphical character in the ASCII range, namely
629           "[\x21-\x7E]".
630
631       "\p{PosixLower}"
632           This matches any lowercase character in the ASCII range, namely
633           "[a-z]".
634
635       "\p{PosixPrint}"
636           This matches any printable character in the ASCII range, namely
637           "[\x20-\x7E]".  These are the graphical characters plus SPACE.
638
639       "\p{PosixPunct}"
640           This matches any punctuation character in the ASCII range, namely
641           "[\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]".  These are the graphical
642           characters that aren't word characters.  Note that the Posix
643           standard includes in its definition of punctuation, those
644           characters that Unicode calls "symbols."
645
646       "\p{PosixSpace}"
647           This matches any space character in the ASCII range, namely
648           "[ \f\n\r\t\x0B]" (the last being a vertical tab).
649
650       "\p{PosixUpper}"
651           This matches any uppercase character in the ASCII range, namely
652           "[A-Z]".
653
654       "\p{Present_In: *}"    (Short: "\p{In=*}")
655           This property is used when you need to know in what Unicode
656           version(s) a character is.
657
658           The "*" above stands for some two digit Unicode version number,
659           such as 1.1 or 4.0; or the "*" can also be "Unassigned".  This
660           property will match the code points whose final disposition has
661           been settled as of the Unicode release given by the version number;
662           "\p{Present_In: Unassigned}" will match those code points whose
663           meaning has yet to be assigned.
664
665           For example, "U+0041" "LATIN CAPITAL LETTER A" was present in the
666           very first Unicode release available, which is 1.1, so this
667           property is true for all valid "*" versions.  On the other hand,
668           "U+1EFF" was not assigned until version 5.1 when it became "LATIN
669           SMALL LETTER Y WITH LOOP", so the only "*" that would match it are
670           5.1, 5.2, and later.
671
672           Unicode furnishes the "Age" property from which this is derived.
673           The problem with Age is that a strict interpretation of it (which
674           Perl takes) has it matching the precise release a code point's
675           meaning is introduced in.  Thus "U+0041" would match only 1.1; and
676           "U+1EFF" only 5.1.  This is not usually what you want.
677
678           Some non-Perl implementations of the Age property may change its
679           meaning to be the same as the Perl Present_In property; just be
680           aware of that.
681
682           Another confusion with both these properties is that the definition
683           is not that the code point has been assigned, but that the meaning
684           of the code point has been determined.  This is because 66 code
685           points will always be unassigned, and, so the Age for them is the
686           Unicode version the decision to make them so was made in.  For
687           example, "U+FDD0" is to be permanently unassigned to a character,
688           and the decision to do that was made in version 3.1, so
689           "\p{Age=3.1}" matches this character and "\p{Present_In: 3.1}" and
690           up matches as well.
691
692       "\p{Print}"
693           This matches any character that is graphical or blank, except
694           controls.
695
696       "\p{SpacePerl}"
697           This is the same as "\s", including beyond ASCII.
698
699           Mnemonic: Space, as modified by Perl.  (It doesn't include the
700           vertical tab which both the Posix standard and Unicode consider to
701           be space.)
702
703       "\p{VertSpace}"
704           This is the same as "\v":  A character that changes the spacing
705           vertically.
706
707       "\p{Word}"
708           This is the same as "\w", including beyond ASCII.
709
710   User-Defined Character Properties
711       You can define your own binary character properties by defining
712       subroutines whose names begin with "In" or "Is".  The subroutines can
713       be defined in any package.  The user-defined properties can be used in
714       the regular expression "\p" and "\P" constructs; if you are using a
715       user-defined property from a package other than the one you are in, you
716       must specify its package in the "\p" or "\P" construct.
717
718           # assuming property Is_Foreign defined in Lang::
719           package main;  # property package name required
720           if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
721
722           package Lang;  # property package name not required
723           if ($txt =~ /\p{IsForeign}+/) { ... }
724
725       Note that the effect is compile-time and immutable once defined.
726
727       The subroutines must return a specially-formatted string, with one or
728       more newline-separated lines.  Each line must be one of the following:
729
730       ·   A single hexadecimal number denoting a Unicode code point to
731           include.
732
733       ·   Two hexadecimal numbers separated by horizontal whitespace (space
734           or tabular characters) denoting a range of Unicode code points to
735           include.
736
737       ·   Something to include, prefixed by "+": a built-in character
738           property (prefixed by "utf8::") or a user-defined character
739           property, to represent all the characters in that property; two
740           hexadecimal code points for a range; or a single hexadecimal code
741           point.
742
743       ·   Something to exclude, prefixed by "-": an existing character
744           property (prefixed by "utf8::") or a user-defined character
745           property, to represent all the characters in that property; two
746           hexadecimal code points for a range; or a single hexadecimal code
747           point.
748
749       ·   Something to negate, prefixed "!": an existing character property
750           (prefixed by "utf8::") or a user-defined character property, to
751           represent all the characters in that property; two hexadecimal code
752           points for a range; or a single hexadecimal code point.
753
754       ·   Something to intersect with, prefixed by "&": an existing character
755           property (prefixed by "utf8::") or a user-defined character
756           property, for all the characters except the characters in the
757           property; two hexadecimal code points for a range; or a single
758           hexadecimal code point.
759
760       For example, to define a property that covers both the Japanese
761       syllabaries (hiragana and katakana), you can define
762
763           sub InKana {
764               return <<END;
765           3040\t309F
766           30A0\t30FF
767           END
768           }
769
770       Imagine that the here-doc end marker is at the beginning of the line.
771       Now you can use "\p{InKana}" and "\P{InKana}".
772
773       You could also have used the existing block property names:
774
775           sub InKana {
776               return <<'END';
777           +utf8::InHiragana
778           +utf8::InKatakana
779           END
780           }
781
782       Suppose you wanted to match only the allocated characters, not the raw
783       block ranges: in other words, you want to remove the non-characters:
784
785           sub InKana {
786               return <<'END';
787           +utf8::InHiragana
788           +utf8::InKatakana
789           -utf8::IsCn
790           END
791           }
792
793       The negation is useful for defining (surprise!) negated classes.
794
795           sub InNotKana {
796               return <<'END';
797           !utf8::InHiragana
798           -utf8::InKatakana
799           +utf8::IsCn
800           END
801           }
802
803       Intersection is useful for getting the common characters matched by two
804       (or more) classes.
805
806           sub InFooAndBar {
807               return <<'END';
808           +main::Foo
809           &main::Bar
810           END
811           }
812
813       It's important to remember not to use "&" for the first set; that would
814       be intersecting with nothing (resulting in an empty set).
815
816   User-Defined Case Mappings
817       You can also define your own mappings to be used in the lc(),
818       lcfirst(), uc(), and ucfirst() (or their string-inlined versions).  The
819       principle is similar to that of user-defined character properties: to
820       define subroutines with names like "ToLower" (for lc() and lcfirst()),
821       "ToTitle" (for the first character in ucfirst()), and "ToUpper" (for
822       uc(), and the rest of the characters in ucfirst()).
823
824       The string returned by the subroutines needs to be two hexadecimal
825       numbers separated by two tabulators: the two numbers being,
826       respectively, the source code point and the destination code point.
827       For example:
828
829           sub ToUpper {
830               return <<END;
831           0061\t\t0041
832           END
833           }
834
835       defines an uc() mapping that causes only the character "a" to be mapped
836       to "A"; all other characters will remain unchanged.
837
838       (For serious hackers only)  The above means you have to furnish a
839       complete mapping; you can't just override a couple of characters and
840       leave the rest unchanged.  You can find all the mappings in the
841       directory $Config{privlib}/unicore/To/.  The mapping data is returned
842       as the here-document, and the "utf8::ToSpecFoo" are special exception
843       mappings derived from <$Config{privlib}>/unicore/SpecialCasing.txt.
844       The "Digit" and "Fold" mappings that one can see in the directory are
845       not directly user-accessible, one can use either the "Unicode::UCD"
846       module, or just match case-insensitively (that's when the "Fold"
847       mapping is used).
848
849       The mappings will only take effect on scalars that have been marked as
850       having Unicode characters, for example by using "utf8::upgrade()".  Old
851       byte-style strings are not affected.
852
853       The mappings are in effect for the package they are defined in.
854
855   Character Encodings for Input and Output
856       See Encode.
857
858   Unicode Regular Expression Support Level
859       The following list of Unicode support for regular expressions describes
860       all the features currently supported.  The references to "Level N" and
861       the section numbers refer to the Unicode Technical Standard #18,
862       "Unicode Regular Expressions", version 11, in May 2005.
863
864       ·   Level 1 - Basic Unicode Support
865
866                   RL1.1   Hex Notation                        - done          [1]
867                   RL1.2   Properties                          - done          [2][3]
868                   RL1.2a  Compatibility Properties            - done          [4]
869                   RL1.3   Subtraction and Intersection        - MISSING       [5]
870                   RL1.4   Simple Word Boundaries              - done          [6]
871                   RL1.5   Simple Loose Matches                - done          [7]
872                   RL1.6   Line Boundaries                     - MISSING       [8]
873                   RL1.7   Supplementary Code Points           - done          [9]
874
875                   [1]  \x{...}
876                   [2]  \p{...} \P{...}
877                   [3]  supports not only minimal list, but all Unicode character
878                        properties (see L</Unicode Character Properties>)
879                   [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
880                   [5]  can use regular expression look-ahead [a] or
881                        user-defined character properties [b] to emulate set operations
882                   [6]  \b \B
883                   [7]  note that Perl does Full case-folding in matching (but with bugs),
884                        not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9,
885                        not with 1F80.  This difference matters mainly for certain Greek
886                        capital letters with certain modifiers: the Full case-folding
887                        decomposes the letter, while the Simple case-folding would map
888                        it to a single character.
889                   [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
890                        CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
891                        should also affect <>, $., and script line numbers;
892                        should not split lines within CRLF [c] (i.e. there is no empty
893                        line between \r and \n)
894                   [9]  UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
895                        but also beyond U+10FFFF [d]
896
897           [a] You can mimic class subtraction using lookahead.  For example,
898           what UTS#18 might write as
899
900               [{Greek}-[{UNASSIGNED}]]
901
902           in Perl can be written as:
903
904               (?!\p{Unassigned})\p{InGreekAndCoptic}
905               (?=\p{Assigned})\p{InGreekAndCoptic}
906
907           But in this particular example, you probably really want
908
909               \p{GreekAndCoptic}
910
911           which will match assigned characters known to be part of the Greek
912           script.
913
914           Also see the Unicode::Regex::Set module, it does implement the full
915           UTS#18 grouping, intersection, union, and removal (subtraction)
916           syntax.
917
918           [b] '+' for union, '-' for removal (set-difference), '&' for
919           intersection (see "User-Defined Character Properties")
920
921           [c] Try the ":crlf" layer (see PerlIO).
922
923           [d] U+FFFF will currently generate a warning message if 'utf8'
924           warnings are
925               enabled
926
927       ·   Level 2 - Extended Unicode Support
928
929                   RL2.1   Canonical Equivalents           - MISSING       [10][11]
930                   RL2.2   Default Grapheme Clusters       - MISSING       [12]
931                   RL2.3   Default Word Boundaries         - MISSING       [14]
932                   RL2.4   Default Loose Matches           - MISSING       [15]
933                   RL2.5   Name Properties                 - MISSING       [16]
934                   RL2.6   Wildcard Properties             - MISSING
935
936                   [10] see UAX#15 "Unicode Normalization Forms"
937                   [11] have Unicode::Normalize but not integrated to regexes
938                   [12] have \X but we don't have a "Grapheme Cluster Mode"
939                   [14] see UAX#29, Word Boundaries
940                   [15] see UAX#21 "Case Mappings"
941                   [16] have \N{...} but neither compute names of CJK Ideographs
942                        and Hangul Syllables nor use a loose match [e]
943
944           [e] "\N{...}" allows namespaces (see charnames).
945
946       ·   Level 3 - Tailored Support
947
948                   RL3.1   Tailored Punctuation            - MISSING
949                   RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
950                   RL3.3   Tailored Word Boundaries        - MISSING
951                   RL3.4   Tailored Loose Matches          - MISSING
952                   RL3.5   Tailored Ranges                 - MISSING
953                   RL3.6   Context Matching                - MISSING       [19]
954                   RL3.7   Incremental Matches             - MISSING
955                 ( RL3.8   Unicode Set Sharing )
956                   RL3.9   Possible Match Sets             - MISSING
957                   RL3.10  Folded Matching                 - MISSING       [20]
958                   RL3.11  Submatchers                     - MISSING
959
960                   [17] see UAX#10 "Unicode Collation Algorithms"
961                   [18] have Unicode::Collate but not integrated to regexes
962                   [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
963                        outside of the target substring
964                   [20] need insensitive matching for linguistic features other than case;
965                        for example, hiragana to katakana, wide and narrow, simplified Han
966                        to traditional Han (see UTR#30 "Character Foldings")
967
968   Unicode Encodings
969       Unicode characters are assigned to code points, which are abstract
970       numbers.  To use these numbers, various encodings are needed.
971
972       ·   UTF-8
973
974           UTF-8 is a variable-length (1 to 6 bytes, current character
975           allocations require 4 bytes), byte-order independent encoding. For
976           ASCII (and we really do mean 7-bit ASCII, not another 8-bit
977           encoding), UTF-8 is transparent.
978
979           The following table is from Unicode 3.2.
980
981            Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
982
983              U+0000..U+007F       00..7F
984              U+0080..U+07FF     * C2..DF    80..BF
985              U+0800..U+0FFF       E0      * A0..BF    80..BF
986              U+1000..U+CFFF       E1..EC    80..BF    80..BF
987              U+D000..U+D7FF       ED        80..9F    80..BF
988              U+D800..U+DFFF       +++++++ utf16 surrogates, not legal utf8 +++++++
989              U+E000..U+FFFF       EE..EF    80..BF    80..BF
990             U+10000..U+3FFFF      F0      * 90..BF    80..BF    80..BF
991             U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
992            U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
993
994           Note the gaps before several of the byte entries above marked by
995           '*'.  These are caused by legal UTF-8 avoiding non-shortest
996           encodings: it is technically possible to UTF-8-encode a single code
997           point in different ways, but that is explicitly forbidden, and the
998           shortest possible encoding should always be used (and that is what
999           Perl does).
1000
1001           Another way to look at it is via bits:
1002
1003            Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte
1004
1005                               0aaaaaaa     0aaaaaaa
1006                       00000bbbbbaaaaaa     110bbbbb  10aaaaaa
1007                       ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
1008             00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa
1009
1010           As you can see, the continuation bytes all begin with "10", and the
1011           leading bits of the start byte tell how many bytes there are in the
1012           encoded character.
1013
1014       ·   UTF-EBCDIC
1015
1016           Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1017
1018       ·   UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1019
1020           The followings items are mostly for reference and general Unicode
1021           knowledge, Perl doesn't use these constructs internally.
1022
1023           UTF-16 is a 2 or 4 byte encoding.  The Unicode code points
1024           "U+0000..U+FFFF" are stored in a single 16-bit unit, and the code
1025           points "U+10000..U+10FFFF" in two 16-bit units.  The latter case is
1026           using surrogates, the first 16-bit unit being the high surrogate,
1027           and the second being the low surrogate.
1028
1029           Surrogates are code points set aside to encode the
1030           "U+10000..U+10FFFF" range of Unicode code points in pairs of 16-bit
1031           units.  The high surrogates are the range "U+D800..U+DBFF" and the
1032           low surrogates are the range "U+DC00..U+DFFF".  The surrogate
1033           encoding is
1034
1035                   $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1036                   $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1037
1038           and the decoding is
1039
1040                   $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1041
1042           If you try to generate surrogates (for example by using chr()), you
1043           will get a warning, if warnings are turned on, because those code
1044           points are not valid for a Unicode character.
1045
1046           Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
1047           itself can be used for in-memory computations, but if storage or
1048           transfer is required either UTF-16BE (big-endian) or UTF-16LE
1049           (little-endian) encodings must be chosen.
1050
1051           This introduces another problem: what if you just know that your
1052           data is UTF-16, but you don't know which endianness?  Byte Order
1053           Marks, or BOMs, are a solution to this.  A special character has
1054           been reserved in Unicode to function as a byte order marker: the
1055           character with the code point "U+FEFF" is the BOM.
1056
1057           The trick is that if you read a BOM, you will know the byte order,
1058           since if it was written on a big-endian platform, you will read the
1059           bytes "0xFE 0xFF", but if it was written on a little-endian
1060           platform, you will read the bytes "0xFF 0xFE".  (And if the
1061           originating platform was writing in UTF-8, you will read the bytes
1062           "0xEF 0xBB 0xBF".)
1063
1064           The way this trick works is that the character with the code point
1065           "U+FFFE" is guaranteed not to be a valid Unicode character, so the
1066           sequence of bytes "0xFF 0xFE" is unambiguously "BOM, represented in
1067           little-endian format" and cannot be "U+FFFE", represented in big-
1068           endian format".  (Actually, "U+FFFE" is legal for use by your
1069           program, even for input/output, but better not use it if you need a
1070           BOM.  But it is "illegal for interchange", so that an unsuspecting
1071           program won't get confused.)
1072
1073       ·   UTF-32, UTF-32BE, UTF-32LE
1074
1075           The UTF-32 family is pretty much like the UTF-16 family, expect
1076           that the units are 32-bit, and therefore the surrogate scheme is
1077           not needed.  The BOM signatures will be "0x00 0x00 0xFE 0xFF" for
1078           BE and "0xFF 0xFE 0x00 0x00" for LE.
1079
1080       ·   UCS-2, UCS-4
1081
1082           Encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
1083           encoding.  Unlike UTF-16, UCS-2 is not extensible beyond "U+FFFF",
1084           because it does not use surrogates.  UCS-4 is a 32-bit encoding,
1085           functionally identical to UTF-32.
1086
1087       ·   UTF-7
1088
1089           A seven-bit safe (non-eight-bit) encoding, which is useful if the
1090           transport or storage is not eight-bit safe.  Defined by RFC 2152.
1091
1092   Security Implications of Unicode
1093       Read Unicode Security Considerations
1094       <http://www.unicode.org/reports/tr36>.  Also, note the following:
1095
1096       ·   Malformed UTF-8
1097
1098           Unfortunately, the specification of UTF-8 leaves some room for
1099           interpretation of how many bytes of encoded output one should
1100           generate from one input Unicode character.  Strictly speaking, the
1101           shortest possible sequence of UTF-8 bytes should be generated,
1102           because otherwise there is potential for an input buffer overflow
1103           at the receiving end of a UTF-8 connection.  Perl always generates
1104           the shortest length UTF-8, and with warnings on, Perl will warn
1105           about non-shortest length UTF-8 along with other malformations,
1106           such as the surrogates, which are not real Unicode code points.
1107
1108       ·   Regular expressions behave slightly differently between byte data
1109           and character (Unicode) data.  For example, the "word character"
1110           character class "\w" will work differently depending on if data is
1111           eight-bit bytes or Unicode.
1112
1113           In the first case, the set of "\w" characters is either small--the
1114           default set of alphabetic characters, digits, and the "_"--or, if
1115           you are using a locale (see perllocale), the "\w" might contain a
1116           few more letters according to your language and country.
1117
1118           In the second case, the "\w" set of characters is much, much
1119           larger.  Most importantly, even in the set of the first 256
1120           characters, it will probably match different characters: unlike
1121           most locales, which are specific to a language and country pair,
1122           Unicode classifies all the characters that are letters somewhere as
1123           "\w".  For example, your locale might not think that LATIN SMALL
1124           LETTER ETH is a letter (unless you happen to speak Icelandic), but
1125           Unicode does.
1126
1127           As discussed elsewhere, Perl has one foot (two hooves?) planted in
1128           each of two worlds: the old world of bytes and the new world of
1129           characters, upgrading from bytes to characters when necessary.  If
1130           your legacy code does not explicitly use Unicode, no automatic
1131           switch-over to characters should happen.  Characters shouldn't get
1132           downgraded to bytes, either.  It is possible to accidentally mix
1133           bytes and characters, however (see perluniintro), in which case
1134           "\w" in regular expressions might start behaving differently.
1135           Review your code.  Use warnings and the "strict" pragma.
1136
1137   Unicode in Perl on EBCDIC
1138       The way Unicode is handled on EBCDIC platforms is still experimental.
1139       On such platforms, references to UTF-8 encoding in this document and
1140       elsewhere should be read as meaning the UTF-EBCDIC specified in Unicode
1141       Technical Report 16, unless ASCII vs. EBCDIC issues are specifically
1142       discussed. There is no "utfebcdic" pragma or ":utfebcdic" layer;
1143       rather, "utf8" and ":utf8" are reused to mean the platform's "natural"
1144       8-bit encoding of Unicode. See perlebcdic for more discussion of the
1145       issues.
1146
1147   Locales
1148       Usually locale settings and Unicode do not affect each other, but there
1149       are a couple of exceptions:
1150
1151       ·   You can enable automatic UTF-8-ification of your standard file
1152           handles, default "open()" layer, and @ARGV by using either the "-C"
1153           command line switch or the "PERL_UNICODE" environment variable, see
1154           perlrun for the documentation of the "-C" switch.
1155
1156       ·   Perl tries really hard to work both with Unicode and the old byte-
1157           oriented world. Most often this is nice, but sometimes Perl's
1158           straddling of the proverbial fence causes problems.
1159
1160   When Unicode Does Not Happen
1161       While Perl does have extensive ways to input and output in Unicode, and
1162       few other 'entry points' like the @ARGV which can be interpreted as
1163       Unicode (UTF-8), there still are many places where Unicode (in some
1164       encoding or another) could be given as arguments or received as
1165       results, or both, but it is not.
1166
1167       The following are such interfaces.  Also, see "The "Unicode Bug"".  For
1168       all of these interfaces Perl currently (as of 5.8.3) simply assumes
1169       byte strings both as arguments and results, or UTF-8 strings if the
1170       "encoding" pragma has been used.
1171
1172       One reason why Perl does not attempt to resolve the role of Unicode in
1173       these cases is that the answers are highly dependent on the operating
1174       system and the file system(s).  For example, whether filenames can be
1175       in Unicode, and in exactly what kind of encoding, is not exactly a
1176       portable concept.  Similarly for the qx and system: how well will the
1177       'command line interface' (and which of them?) handle Unicode?
1178
1179       ·   chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename,
1180           rmdir, stat, symlink, truncate, unlink, utime, -X
1181
1182       ·   %ENV
1183
1184       ·   glob (aka the <*>)
1185
1186       ·   open, opendir, sysopen
1187
1188       ·   qx (aka the backtick operator), system
1189
1190       ·   readdir, readlink
1191
1192   The "Unicode Bug"
1193       The term, the "Unicode bug" has been applied to an inconsistency with
1194       the Unicode characters whose ordinals are in the Latin-1 Supplement
1195       block, that is, between 128 and 255.  Without a locale specified,
1196       unlike all other characters or code points, these characters have very
1197       different semantics in byte semantics versus character semantics.
1198
1199       In character semantics they are interpreted as Unicode code points,
1200       which means they have the same semantics as Latin-1 (ISO-8859-1).
1201
1202       In byte semantics, they are considered to be unassigned characters,
1203       meaning that the only semantics they have is their ordinal numbers, and
1204       that they are not members of various character classes.  None are
1205       considered to match "\w" for example, but all match "\W".  (On EBCDIC
1206       platforms, the behavior may be different from this, depending on the
1207       underlying C language library functions.)
1208
1209       The behavior is known to have effects on these areas:
1210
1211       ·   Changing the case of a scalar, that is, using "uc()", "ucfirst()",
1212           "lc()", and "lcfirst()", or "\L", "\U", "\u" and "\l" in regular
1213           expression substitutions.
1214
1215       ·   Using caseless ("/i") regular expression matching
1216
1217       ·   Matching a number of properties in regular expressions, such as
1218           "\w"
1219
1220       ·   User-defined case change mappings.  You can create a "ToUpper()"
1221           function, for example, which overrides Perl's built-in case
1222           mappings.  The scalar must be encoded in utf8 for your function to
1223           actually be invoked.
1224
1225       This behavior can lead to unexpected results in which a string's
1226       semantics suddenly change if a code point above 255 is appended to or
1227       removed from it, which changes the string's semantics from byte to
1228       character or vice versa.  As an example, consider the following program
1229       and its output:
1230
1231        $ perl -le'
1232            $s1 = "\xC2";
1233            $s2 = "\x{2660}";
1234            for ($s1, $s2, $s1.$s2) {
1235                print /\w/ || 0;
1236            }
1237        '
1238        0
1239        0
1240        1
1241
1242       If there's no "\w" in "s1" or in "s2", why does their concatenation
1243       have one?
1244
1245       This anomaly stems from Perl's attempt to not disturb older programs
1246       that didn't use Unicode, and hence had no semantics for characters
1247       outside of the ASCII range (except in a locale), along with Perl's
1248       desire to add Unicode support seamlessly.  The result wasn't seamless:
1249       these characters were orphaned.
1250
1251       Work is being done to correct this, but only some of it was complete in
1252       time for the 5.12 release.  What has been finished is the important
1253       part of the case changing component.  Due to concerns, and some
1254       evidence, that older code might have come to rely on the existing
1255       behavior, the new behavior must be explicitly enabled by the feature
1256       "unicode_strings" in the feature pragma, even though no new syntax is
1257       involved.
1258
1259       See "lc" in perlfunc for details on how this pragma works in
1260       combination with various others for casing.  Even though the pragma
1261       only affects casing operations in the 5.12 release, it is planned to
1262       have it affect all the problematic behaviors in later releases: you
1263       can't have one without them all.
1264
1265       In the meantime, a workaround is to always call utf8::upgrade($string),
1266       or to use the standard module Encode.   Also, a scalar that has any
1267       characters whose ordinal is above 0x100, or which were specified using
1268       either of the "\N{...}" notations will automatically have character
1269       semantics.
1270
1271   Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1272       Sometimes (see "When Unicode Does Not Happen" or "The "Unicode Bug"")
1273       there are situations where you simply need to force a byte string into
1274       UTF-8, or vice versa.  The low-level calls utf8::upgrade($bytestring)
1275       and utf8::downgrade($utf8string[, FAIL_OK]) are the answers.
1276
1277       Note that utf8::downgrade() can fail if the string contains characters
1278       that don't fit into a byte.
1279
1280       Calling either function on a string that already is in the desired
1281       state is a no-op.
1282
1283   Using Unicode in XS
1284       If you want to handle Perl Unicode in XS extensions, you may find the
1285       following C APIs useful.  See also "Unicode Support" in perlguts for an
1286       explanation about Unicode at the XS level, and perlapi for the API
1287       details.
1288
1289       ·   "DO_UTF8(sv)" returns true if the "UTF8" flag is on and the bytes
1290           pragma is not in effect.  "SvUTF8(sv)" returns true if the "UTF8"
1291           flag is on; the bytes pragma is ignored.  The "UTF8" flag being on
1292           does not mean that there are any characters of code points greater
1293           than 255 (or 127) in the scalar or that there are even any
1294           characters in the scalar.  What the "UTF8" flag means is that the
1295           sequence of octets in the representation of the scalar is the
1296           sequence of UTF-8 encoded code points of the characters of a
1297           string.  The "UTF8" flag being off means that each octet in this
1298           representation encodes a single character with code point 0..255
1299           within the string.  Perl's Unicode model is not to use UTF-8 until
1300           it is absolutely necessary.
1301
1302       ·   "uvchr_to_utf8(buf, chr)" writes a Unicode character code point
1303           into a buffer encoding the code point as UTF-8, and returns a
1304           pointer pointing after the UTF-8 bytes.  It works appropriately on
1305           EBCDIC machines.
1306
1307       ·   "utf8_to_uvchr(buf, lenp)" reads UTF-8 encoded bytes from a buffer
1308           and returns the Unicode character code point and, optionally, the
1309           length of the UTF-8 byte sequence.  It works appropriately on
1310           EBCDIC machines.
1311
1312       ·   "utf8_length(start, end)" returns the length of the UTF-8 encoded
1313           buffer in characters.  "sv_len_utf8(sv)" returns the length of the
1314           UTF-8 encoded scalar.
1315
1316       ·   "sv_utf8_upgrade(sv)" converts the string of the scalar to its
1317           UTF-8 encoded form.  "sv_utf8_downgrade(sv)" does the opposite, if
1318           possible.  "sv_utf8_encode(sv)" is like sv_utf8_upgrade except that
1319           it does not set the "UTF8" flag.  "sv_utf8_decode()" does the
1320           opposite of "sv_utf8_encode()".  Note that none of these are to be
1321           used as general-purpose encoding or decoding interfaces: "use
1322           Encode" for that.  "sv_utf8_upgrade()" is affected by the encoding
1323           pragma but "sv_utf8_downgrade()" is not (since the encoding pragma
1324           is designed to be a one-way street).
1325
1326       ·   is_utf8_char(s) returns true if the pointer points to a valid UTF-8
1327           character.
1328
1329       ·   "is_utf8_string(buf, len)" returns true if "len" bytes of the
1330           buffer are valid UTF-8.
1331
1332       ·   "UTF8SKIP(buf)" will return the number of bytes in the UTF-8
1333           encoded character in the buffer.  "UNISKIP(chr)" will return the
1334           number of bytes required to UTF-8-encode the Unicode character code
1335           point.  "UTF8SKIP()" is useful for example for iterating over the
1336           characters of a UTF-8 encoded buffer; "UNISKIP()" is useful, for
1337           example, in computing the size required for a UTF-8 encoded buffer.
1338
1339       ·   "utf8_distance(a, b)" will tell the distance in characters between
1340           the two pointers pointing to the same UTF-8 encoded buffer.
1341
1342       ·   "utf8_hop(s, off)" will return a pointer to a UTF-8 encoded buffer
1343           that is "off" (positive or negative) Unicode characters displaced
1344           from the UTF-8 buffer "s".  Be careful not to overstep the buffer:
1345           "utf8_hop()" will merrily run off the end or the beginning of the
1346           buffer if told to do so.
1347
1348       ·   "pv_uni_display(dsv, spv, len, pvlim, flags)" and
1349           "sv_uni_display(dsv, ssv, pvlim, flags)" are useful for debugging
1350           the output of Unicode strings and scalars.  By default they are
1351           useful only for debugging--they display all characters as
1352           hexadecimal code points--but with the flags "UNI_DISPLAY_ISPRINT",
1353           "UNI_DISPLAY_BACKSLASH", and "UNI_DISPLAY_QQ" you can make the
1354           output more readable.
1355
1356       ·   "ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)" can be used to
1357           compare two strings case-insensitively in Unicode.  For case-
1358           sensitive comparisons you can just use "memEQ()" and "memNE()" as
1359           usual.
1360
1361       For more information, see perlapi, and utf8.c and utf8.h in the Perl
1362       source code distribution.
1363
1364   Hacking Perl to work on earlier Unicode versions (for very serious hackers
1365       only)
1366       Perl by default comes with the latest supported Unicode version built
1367       in, but you can change to use any earlier one.
1368
1369       Download the files in the version of Unicode that you want from the
1370       Unicode web site <http://www.unicode.org>).  These should replace the
1371       existing files in "\$Config{privlib}"/unicore.  ("\%Config" is
1372       available from the Config module.)  Follow the instructions in
1373       README.perl in that directory to change some of their names, and then
1374       run make.
1375
1376       It is even possible to download them to a different directory, and then
1377       change utf8_heavy.pl in the directory "\$Config{privlib}" to point to
1378       the new directory, or maybe make a copy of that directory before making
1379       the change, and using @INC or the "-I" run-time flag to switch between
1380       versions at will (but because of caching, not in the middle of a
1381       process), but all this is beyond the scope of these instructions.
1382

BUGS

1384   Interaction with Locales
1385       Use of locales with Unicode data may lead to odd results.  Currently,
1386       Perl attempts to attach 8-bit locale info to characters in the range
1387       0..255, but this technique is demonstrably incorrect for locales that
1388       use characters above that range when mapped into Unicode.  Perl's
1389       Unicode support will also tend to run slower.  Use of locales with
1390       Unicode is discouraged.
1391
1392   Problems with characters in the Latin-1 Supplement range
1393       See "The "Unicode Bug""
1394
1395   Problems with case-insensitive regular expression matching
1396       There are problems with case-insensitive matches, including those
1397       involving character classes (enclosed in [square brackets]), characters
1398       whose fold is to multiple characters (such as the single character
1399       LATIN SMALL LIGATURE FFL matches case-insensitively with the
1400       3-character string "ffl"), and characters in the Latin-1 Supplement.
1401
1402   Interaction with Extensions
1403       When Perl exchanges data with an extension, the extension should be
1404       able to understand the UTF8 flag and act accordingly. If the extension
1405       doesn't know about the flag, it's likely that the extension will return
1406       incorrectly-flagged data.
1407
1408       So if you're working with Unicode data, consult the documentation of
1409       every module you're using if there are any issues with Unicode data
1410       exchange. If the documentation does not talk about Unicode at all,
1411       suspect the worst and probably look at the source to learn how the
1412       module is implemented. Modules written completely in Perl shouldn't
1413       cause problems. Modules that directly or indirectly access code written
1414       in other programming languages are at risk.
1415
1416       For affected functions, the simple strategy to avoid data corruption is
1417       to always make the encoding of the exchanged data explicit. Choose an
1418       encoding that you know the extension can handle. Convert arguments
1419       passed to the extensions to that encoding and convert results back from
1420       that encoding. Write wrapper functions that do the conversions for you,
1421       so you can later change the functions when the extension catches up.
1422
1423       To provide an example, let's say the popular Foo::Bar::escape_html
1424       function doesn't deal with Unicode data yet. The wrapper function would
1425       convert the argument to raw UTF-8 and convert the result back to Perl's
1426       internal representation like so:
1427
1428           sub my_escape_html ($) {
1429             my($what) = shift;
1430             return unless defined $what;
1431             Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1432           }
1433
1434       Sometimes, when the extension does not convert data but just stores and
1435       retrieves them, you will be in a position to use the otherwise
1436       dangerous Encode::_utf8_on() function. Let's say the popular "Foo::Bar"
1437       extension, written in C, provides a "param" method that lets you store
1438       and retrieve data according to these prototypes:
1439
1440           $self->param($name, $value);            # set a scalar
1441           $value = $self->param($name);           # retrieve a scalar
1442
1443       If it does not yet provide support for any encoding, one could write a
1444       derived class with such a "param" method:
1445
1446           sub param {
1447             my($self,$name,$value) = @_;
1448             utf8::upgrade($name);     # make sure it is UTF-8 encoded
1449             if (defined $value) {
1450               utf8::upgrade($value);  # make sure it is UTF-8 encoded
1451               return $self->SUPER::param($name,$value);
1452             } else {
1453               my $ret = $self->SUPER::param($name);
1454               Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1455               return $ret;
1456             }
1457           }
1458
1459       Some extensions provide filters on data entry/exit points, such as
1460       DB_File::filter_store_key and family. Look out for such filters in the
1461       documentation of your extensions, they can make the transition to
1462       Unicode data much easier.
1463
1464   Speed
1465       Some functions are slower when working on UTF-8 encoded strings than on
1466       byte encoded strings.  All functions that need to hop over characters
1467       such as length(), substr() or index(), or matching regular expressions
1468       can work much faster when the underlying data are byte-encoded.
1469
1470       In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a
1471       caching scheme was introduced which will hopefully make the slowness
1472       somewhat less spectacular, at least for some operations.  In general,
1473       operations with UTF-8 encoded strings are still slower. As an example,
1474       the Unicode properties (character classes) like "\p{Nd}" are known to
1475       be quite a bit slower (5-20 times) than their simpler counterparts like
1476       "\d" (then again, there 268 Unicode characters matching "Nd" compared
1477       with the 10 ASCII characters matching "d").
1478
1479   Problems on EBCDIC platforms
1480       There are a number of known problems with Perl on EBCDIC platforms.  If
1481       you want to use Perl there, send email to perlbug@perl.org.
1482
1483       In earlier versions, when byte and character data were concatenated,
1484       the new string was sometimes created by decoding the byte strings as
1485       ISO 8859-1 (Latin-1), even if the old Unicode string used EBCDIC.
1486
1487       If you find any of these, please report them as bugs.
1488
1489   Porting code from perl-5.6.X
1490       Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1491       was required to use the "utf8" pragma to declare that a given scope
1492       expected to deal with Unicode data and had to make sure that only
1493       Unicode data were reaching that scope. If you have code that is working
1494       with 5.6, you will need some of the following adjustments to your code.
1495       The examples are written such that the code will continue to work under
1496       5.6, so you should be safe to try them out.
1497
1498       ·   A filehandle that should read or write UTF-8
1499
1500             if ($] > 5.007) {
1501               binmode $fh, ":encoding(utf8)";
1502             }
1503
1504       ·   A scalar that is going to be passed to some extension
1505
1506           Be it Compress::Zlib, Apache::Request or any extension that has no
1507           mention of Unicode in the manpage, you need to make sure that the
1508           UTF8 flag is stripped off. Note that at the time of this writing
1509           (October 2002) the mentioned modules are not UTF-8-aware. Please
1510           check the documentation to verify if this is still true.
1511
1512             if ($] > 5.007) {
1513               require Encode;
1514               $val = Encode::encode_utf8($val); # make octets
1515             }
1516
1517       ·   A scalar we got back from an extension
1518
1519           If you believe the scalar comes back as UTF-8, you will most likely
1520           want the UTF8 flag restored:
1521
1522             if ($] > 5.007) {
1523               require Encode;
1524               $val = Encode::decode_utf8($val);
1525             }
1526
1527       ·   Same thing, if you are really sure it is UTF-8
1528
1529             if ($] > 5.007) {
1530               require Encode;
1531               Encode::_utf8_on($val);
1532             }
1533
1534       ·   A wrapper for fetchrow_array and fetchrow_hashref
1535
1536           When the database contains only UTF-8, a wrapper function or method
1537           is a convenient way to replace all your fetchrow_array and
1538           fetchrow_hashref calls. A wrapper function will also make it easier
1539           to adapt to future enhancements in your database driver. Note that
1540           at the time of this writing (October 2002), the DBI has no
1541           standardized way to deal with UTF-8 data. Please check the
1542           documentation to verify if that is still true.
1543
1544             sub fetchrow {
1545               my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1546               if ($] < 5.007) {
1547                 return $sth->$what;
1548               } else {
1549                 require Encode;
1550                 if (wantarray) {
1551                   my @arr = $sth->$what;
1552                   for (@arr) {
1553                     defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1554                   }
1555                   return @arr;
1556                 } else {
1557                   my $ret = $sth->$what;
1558                   if (ref $ret) {
1559                     for my $k (keys %$ret) {
1560                       defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1561                     }
1562                     return $ret;
1563                   } else {
1564                     defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1565                     return $ret;
1566                   }
1567                 }
1568               }
1569             }
1570
1571       ·   A large scalar that you know can only contain ASCII
1572
1573           Scalars that contain only ASCII and are marked as UTF-8 are
1574           sometimes a drag to your program. If you recognize such a
1575           situation, just remove the UTF8 flag:
1576
1577             utf8::downgrade($val) if $] > 5.007;
1578

NAME

DESCRIPTION

BUGS

SEE ALSO