1PERLUNICODE(1)         Perl Programmers Reference Guide         PERLUNICODE(1)
2
3
4

NAME

6       perlunicode - Unicode support in Perl
7

DESCRIPTION

9   Important Caveats
10       Unicode support is an extensive requirement. While Perl does not
11       implement the Unicode standard or the accompanying technical reports
12       from cover to cover, Perl does support many Unicode features.
13
14       People who want to learn to use Unicode in Perl, should probably read
15       the Perl Unicode tutorial, perlunitut, before reading this reference
16       document.
17
18       Input and Output Layers
19           Perl knows when a filehandle uses Perl's internal Unicode encodings
20           (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened
21           with the ":utf8" layer.  Other encodings can be converted to Perl's
22           encoding on input or from Perl's encoding on output by use of the
23           ":encoding(...)"  layer.  See open.
24
25           To indicate that Perl source itself is in UTF-8, use "use utf8;".
26
27       Regular Expressions
28           The regular expression compiler produces polymorphic opcodes.  That
29           is, the pattern adapts to the data and automatically switches to
30           the Unicode character scheme when presented with data that is
31           internally encoded in UTF-8 -- or instead uses a traditional byte
32           scheme when presented with byte data.
33
34       "use utf8" still needed to enable UTF-8/UTF-EBCDIC in scripts
35           As a compatibility measure, the "use utf8" pragma must be
36           explicitly included to enable recognition of UTF-8 in the Perl
37           scripts themselves (in string or regular expression literals, or in
38           identifier names) on ASCII-based machines or to recognize UTF-
39           EBCDIC on EBCDIC-based machines.  These are the only times when an
40           explicit "use utf8" is needed.  See utf8.
41
42       BOM-marked scripts and UTF-16 scripts autodetected
43           If a Perl script begins marked with the Unicode BOM (UTF-16LE,
44           UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked
45           UTF-16 of either endianness, Perl will correctly read in the script
46           as Unicode.  (BOMless UTF-8 cannot be effectively recognized or
47           differentiated from ISO 8859-1 or other eight-bit encodings.)
48
49       "use encoding" needed to upgrade non-Latin-1 byte strings
50           By default, there is a fundamental asymmetry in Perl's Unicode
51           model: implicit upgrading from byte strings to Unicode strings
52           assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
53           strings are downgraded with UTF-8 encoding.  This happens because
54           the first 256 codepoints in Unicode happens to agree with Latin-1.
55
56           See "Byte and Character Semantics" for more details.
57
58   Byte and Character Semantics
59       Beginning with version 5.6, Perl uses logically-wide characters to
60       represent strings internally.
61
62       In future, Perl-level operations will be expected to work with
63       characters rather than bytes.
64
65       However, as an interim compatibility measure, Perl aims to provide a
66       safe migration path from byte semantics to character semantics for
67       programs.  For operations where Perl can unambiguously decide that the
68       input data are characters, Perl switches to character semantics.  For
69       operations where this determination cannot be made without additional
70       information from the user, Perl decides in favor of compatibility and
71       chooses to use byte semantics.
72
73       Under byte semantics, when "use locale" is in effect, Perl uses the
74       semantics associated with the current locale.  Absent a "use locale",
75       Perl currently uses US-ASCII (or Basic Latin in Unicode terminology)
76       byte semantics, meaning that characters whose ordinal numbers are in
77       the range 128 - 255 are undefined except for their ordinal numbers.
78       This means that none have case (upper and lower), nor are any a member
79       of character classes, like "[:alpha:]" or "\w".  (But all do belong to
80       the "\W" class or the Perl regular expression extension "[:^alpha:]".)
81
82       This behavior preserves compatibility with earlier versions of Perl,
83       which allowed byte semantics in Perl operations only if none of the
84       program's inputs were marked as being as source of Unicode character
85       data.  Such data may come from filehandles, from calls to external
86       programs, from information provided by the system (such as %ENV), or
87       from literals and constants in the source text.
88
89       The "bytes" pragma will always, regardless of platform, force byte
90       semantics in a particular lexical scope.  See bytes.
91
92       The "utf8" pragma is primarily a compatibility device that enables
93       recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
94       Note that this pragma is only required while Perl defaults to byte
95       semantics; when character semantics become the default, this pragma may
96       become a no-op.  See utf8.
97
98       Unless explicitly stated, Perl operators use character semantics for
99       Unicode data and byte semantics for non-Unicode data.  The decision to
100       use character semantics is made transparently.  If input data comes
101       from a Unicode source--for example, if a character encoding layer is
102       added to a filehandle or a literal Unicode string constant appears in a
103       program--character semantics apply.  Otherwise, byte semantics are in
104       effect.  The "bytes" pragma should be used to force byte semantics on
105       Unicode data.
106
107       If strings operating under byte semantics and strings with Unicode
108       character data are concatenated, the new string will have character
109       semantics.  This can cause surprises: See "BUGS", below
110
111       Under character semantics, many operations that formerly operated on
112       bytes now operate on characters. A character in Perl is logically just
113       a number ranging from 0 to 2**31 or so. Larger characters may encode
114       into longer sequences of bytes internally, but this internal detail is
115       mostly hidden for Perl code.  See perluniintro for more.
116
117   Effects of Character Semantics
118       Character semantics have the following effects:
119
120       ·   Strings--including hash keys--and regular expression patterns may
121           contain characters that have an ordinal value larger than 255.
122
123           If you use a Unicode editor to edit your program, Unicode
124           characters may occur directly within the literal strings in UTF-8
125           encoding, or UTF-16.  (The former requires a BOM or "use utf8", the
126           latter requires a BOM.)
127
128           Unicode characters can also be added to a string by using the
129           "\x{...}" notation.  The Unicode code for the desired character, in
130           hexadecimal, should be placed in the braces. For instance, a smiley
131           face is "\x{263A}".  This encoding scheme works for all characters,
132           but for characters under 0x100, note that Perl may use an 8 bit
133           encoding internally, for optimization and/or backward
134           compatibility.
135
136           Additionally, if you
137
138              use charnames ':full';
139
140           you can use the "\N{...}" notation and put the official Unicode
141           character name within the braces, such as "\N{WHITE SMILING FACE}".
142
143       ·   If an appropriate encoding is specified, identifiers within the
144           Perl script may contain Unicode alphanumeric characters, including
145           ideographs.  Perl does not currently attempt to canonicalize
146           variable names.
147
148       ·   Regular expressions match characters instead of bytes.  "." matches
149           a character instead of a byte.
150
151       ·   Character classes in regular expressions match characters instead
152           of bytes and match against the character properties specified in
153           the Unicode properties database.  "\w" can be used to match a
154           Japanese ideograph, for instance.
155
156       ·   Named Unicode properties, scripts, and block ranges may be used
157           like character classes via the "\p{}" "matches property" construct
158           and the "\P{}" negation, "doesn't match property".
159
160           See "Unicode Character Properties" for more details.
161
162           You can define your own character properties and use them in the
163           regular expression with the "\p{}" or "\P{}" construct.
164
165           See "User-Defined Character Properties" for more details.
166
167       ·   The special pattern "\X" matches any extended Unicode sequence--"a
168           combining character sequence" in Standardese--where the first
169           character is a base character and subsequent characters are mark
170           characters that apply to the base character.  "\X" is equivalent to
171           "(?>\PM\pM*)".
172
173       ·   The "tr///" operator translates characters instead of bytes.  Note
174           that the "tr///CU" functionality has been removed.  For similar
175           functionality see pack('U0', ...) and pack('C0', ...).
176
177       ·   Case translation operators use the Unicode case translation tables
178           when character input is provided.  Note that "uc()", or "\U" in
179           interpolated strings, translates to uppercase, while "ucfirst", or
180           "\u" in interpolated strings, translates to titlecase in languages
181           that make the distinction.
182
183       ·   Most operators that deal with positions or lengths in a string will
184           automatically switch to using character positions, including
185           "chop()", "chomp()", "substr()", "pos()", "index()", "rindex()",
186           "sprintf()", "write()", and "length()".  An operator that
187           specifically does not switch is "vec()".  Operators that really
188           don't care include operators that treat strings as a bucket of bits
189           such as "sort()", and operators dealing with filenames.
190
191       ·   The "pack()"/"unpack()" letter "C" does not change, since it is
192           often used for byte-oriented formats.  Again, think "char" in the C
193           language.
194
195           There is a new "U" specifier that converts between Unicode
196           characters and code points. There is also a "W" specifier that is
197           the equivalent of "chr"/"ord" and properly handles character values
198           even if they are above 255.
199
200       ·   The "chr()" and "ord()" functions work on characters, similar to
201           "pack("W")" and "unpack("W")", not "pack("C")" and "unpack("C")".
202           "pack("C")" and "unpack("C")" are methods for emulating byte-
203           oriented "chr()" and "ord()" on Unicode strings.  While these
204           methods reveal the internal encoding of Unicode strings, that is
205           not something one normally needs to care about at all.
206
207       ·   The bit string operators, "& | ^ ~", can operate on character data.
208           However, for backward compatibility, such as when using bit string
209           operations when characters are all less than 256 in ordinal value,
210           one should not use "~" (the bit complement) with characters of both
211           values less than 256 and values greater than 256.  Most
212           importantly, DeMorgan's laws ("~($x|$y) eq ~$x&~$y" and "~($x&$y)
213           eq ~$x|~$y") will not hold.  The reason for this mathematical faux
214           pas is that the complement cannot return both the 8-bit (byte-wide)
215           bit complement and the full character-wide bit complement.
216
217       ·   lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
218
219           ·       the case mapping is from a single Unicode character to
220                   another single Unicode character, or
221
222           ·       the case mapping is from a single Unicode character to more
223                   than one Unicode character.
224
225           Things to do with locales (Lithuanian, Turkish, Azeri) do not work
226           since Perl does not understand the concept of Unicode locales.
227
228           See the Unicode Technical Report #21, Case Mappings, for more
229           details.
230
231           But you can also define your own mappings to be used in the lc(),
232           lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
233
234           See "User-Defined Case Mappings" for more details.
235
236       ·   And finally, "scalar reverse()" reverses by character rather than
237           by byte.
238
239   Unicode Character Properties
240       Named Unicode properties, scripts, and block ranges may be used like
241       character classes via the "\p{}" "matches property" construct and the
242       "\P{}" negation, "doesn't match property".
243
244       For instance, "\p{Lu}" matches any character with the Unicode "Lu"
245       (Letter, uppercase) property, while "\p{M}" matches any character with
246       an "M" (mark--accents and such) property.  Brackets are not required
247       for single letter properties, so "\p{M}" is equivalent to "\pM". Many
248       predefined properties are available, such as "\p{Mirrored}" and
249       "\p{Tibetan}".
250
251       The official Unicode script and block names have spaces and dashes as
252       separators, but for convenience you can use dashes, spaces, or
253       underbars, and case is unimportant. It is recommended, however, that
254       for consistency you use the following naming: the official Unicode
255       script, property, or block name (see below for the additional rules
256       that apply to block names) with whitespace and dashes removed, and the
257       words "uppercase-first-lowercase-rest". "Latin-1 Supplement" thus
258       becomes "Latin1Supplement".
259
260       You can also use negation in both "\p{}" and "\P{}" by introducing a
261       caret (^) between the first brace and the property name: "\p{^Tamil}"
262       is equal to "\P{Tamil}".
263
264       NOTE: the properties, scripts, and blocks listed here are as of Unicode
265       5.0.0 in July 2006.
266
267       General Category
268           Here are the basic Unicode General Category properties, followed by
269           their long form.  You can use either; "\p{Lu}" and
270           "\p{UppercaseLetter}", for instance, are identical.
271
272               Short       Long
273
274               L           Letter
275               LC          CasedLetter
276               Lu          UppercaseLetter
277               Ll          LowercaseLetter
278               Lt          TitlecaseLetter
279               Lm          ModifierLetter
280               Lo          OtherLetter
281
282               M           Mark
283               Mn          NonspacingMark
284               Mc          SpacingMark
285               Me          EnclosingMark
286
287               N           Number
288               Nd          DecimalNumber
289               Nl          LetterNumber
290               No          OtherNumber
291
292               P           Punctuation
293               Pc          ConnectorPunctuation
294               Pd          DashPunctuation
295               Ps          OpenPunctuation
296               Pe          ClosePunctuation
297               Pi          InitialPunctuation
298                           (may behave like Ps or Pe depending on usage)
299               Pf          FinalPunctuation
300                           (may behave like Ps or Pe depending on usage)
301               Po          OtherPunctuation
302
303               S           Symbol
304               Sm          MathSymbol
305               Sc          CurrencySymbol
306               Sk          ModifierSymbol
307               So          OtherSymbol
308
309               Z           Separator
310               Zs          SpaceSeparator
311               Zl          LineSeparator
312               Zp          ParagraphSeparator
313
314               C           Other
315               Cc          Control
316               Cf          Format
317               Cs          Surrogate   (not usable)
318               Co          PrivateUse
319               Cn          Unassigned
320
321           Single-letter properties match all characters in any of the two-
322           letter sub-properties starting with the same letter.  "LC" and "L&"
323           are special cases, which are aliases for the set of "Ll", "Lu", and
324           "Lt".
325
326           Because Perl hides the need for the user to understand the internal
327           representation of Unicode characters, there is no need to implement
328           the somewhat messy concept of surrogates. "Cs" is therefore not
329           supported.
330
331       Bidirectional Character Types
332           Because scripts differ in their directionality--Hebrew is written
333           right to left, for example--Unicode supplies these properties in
334           the BidiClass class:
335
336               Property    Meaning
337
338               L           Left-to-Right
339               LRE         Left-to-Right Embedding
340               LRO         Left-to-Right Override
341               R           Right-to-Left
342               AL          Right-to-Left Arabic
343               RLE         Right-to-Left Embedding
344               RLO         Right-to-Left Override
345               PDF         Pop Directional Format
346               EN          European Number
347               ES          European Number Separator
348               ET          European Number Terminator
349               AN          Arabic Number
350               CS          Common Number Separator
351               NSM         Non-Spacing Mark
352               BN          Boundary Neutral
353               B           Paragraph Separator
354               S           Segment Separator
355               WS          Whitespace
356               ON          Other Neutrals
357
358           For example, "\p{BidiClass:R}" matches characters that are normally
359           written right to left.
360
361       Scripts
362           The script names which can be used by "\p{...}" and "\P{...}", such
363           as in "\p{Latin}" or "\p{Cyrillic}", are as follows:
364
365               Arabic
366               Armenian
367               Balinese
368               Bengali
369               Bopomofo
370               Braille
371               Buginese
372               Buhid
373               CanadianAboriginal
374               Cherokee
375               Coptic
376               Cuneiform
377               Cypriot
378               Cyrillic
379               Deseret
380               Devanagari
381               Ethiopic
382               Georgian
383               Glagolitic
384               Gothic
385               Greek
386               Gujarati
387               Gurmukhi
388               Han
389               Hangul
390               Hanunoo
391               Hebrew
392               Hiragana
393               Inherited
394               Kannada
395               Katakana
396               Kharoshthi
397               Khmer
398               Lao
399               Latin
400               Limbu
401               LinearB
402               Malayalam
403               Mongolian
404               Myanmar
405               NewTaiLue
406               Nko
407               Ogham
408               OldItalic
409               OldPersian
410               Oriya
411               Osmanya
412               PhagsPa
413               Phoenician
414               Runic
415               Shavian
416               Sinhala
417               SylotiNagri
418               Syriac
419               Tagalog
420               Tagbanwa
421               TaiLe
422               Tamil
423               Telugu
424               Thaana
425               Thai
426               Tibetan
427               Tifinagh
428               Ugaritic
429               Yi
430
431       Extended property classes
432           Extended property classes can supplement the basic properties,
433           defined by the PropList Unicode database:
434
435               ASCIIHexDigit
436               BidiControl
437               Dash
438               Deprecated
439               Diacritic
440               Extender
441               HexDigit
442               Hyphen
443               Ideographic
444               IDSBinaryOperator
445               IDSTrinaryOperator
446               JoinControl
447               LogicalOrderException
448               NoncharacterCodePoint
449               OtherAlphabetic
450               OtherDefaultIgnorableCodePoint
451               OtherGraphemeExtend
452               OtherIDStart
453               OtherIDContinue
454               OtherLowercase
455               OtherMath
456               OtherUppercase
457               PatternSyntax
458               PatternWhiteSpace
459               QuotationMark
460               Radical
461               SoftDotted
462               STerm
463               TerminalPunctuation
464               UnifiedIdeograph
465               VariationSelector
466               WhiteSpace
467
468           and there are further derived properties:
469
470               Alphabetic  =  Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
471               Lowercase   =  Ll + OtherLowercase
472               Uppercase   =  Lu + OtherUppercase
473               Math        =  Sm + OtherMath
474
475               IDStart     =  Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
476               IDContinue  =  IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
477
478               DefaultIgnorableCodePoint
479                           =  OtherDefaultIgnorableCodePoint
480                              + Cf + Cc + Cs + Noncharacters + VariationSelector
481                              - WhiteSpace - FFF9..FFFB (Annotation Characters)
482
483               Any         =  Any code points (i.e. U+0000 to U+10FFFF)
484               Assigned    =  Any non-Cn code points (i.e. synonym for \P{Cn})
485               Unassigned  =  Synonym for \p{Cn}
486               ASCII       =  ASCII (i.e. U+0000 to U+007F)
487
488               Common      =  Any character (or unassigned code point)
489                              not explicitly assigned to a script
490
491       Use of "Is" Prefix
492           For backward compatibility (with Perl 5.6), all properties
493           mentioned so far may have "Is" prepended to their name, so
494           "\P{IsLu}", for example, is equal to "\P{Lu}".
495
496       Blocks
497           In addition to scripts, Unicode also defines blocks of characters.
498           The difference between scripts and blocks is that the concept of
499           scripts is closer to natural languages, while the concept of blocks
500           is more of an artificial grouping based on groups of 256 Unicode
501           characters. For example, the "Latin" script contains letters from
502           many blocks but does not contain all the characters from those
503           blocks. It does not, for example, contain digits, because digits
504           are shared across many scripts. Digits and similar groups, like
505           punctuation, are in a category called "Common".
506
507           For more about scripts, see the UAX#24 "Script Names":
508
509              http://www.unicode.org/reports/tr24/
510
511           For more about blocks, see:
512
513              http://www.unicode.org/Public/UNIDATA/Blocks.txt
514
515           Block names are given with the "In" prefix. For example, the
516           Katakana block is referenced via "\p{InKatakana}".  The "In" prefix
517           may be omitted if there is no naming conflict with a script or any
518           other property, but it is recommended that "In" always be used for
519           block tests to avoid confusion.
520
521           These block names are supported:
522
523               InAegeanNumbers
524               InAlphabeticPresentationForms
525               InAncientGreekMusicalNotation
526               InAncientGreekNumbers
527               InArabic
528               InArabicPresentationFormsA
529               InArabicPresentationFormsB
530               InArabicSupplement
531               InArmenian
532               InArrows
533               InBalinese
534               InBasicLatin
535               InBengali
536               InBlockElements
537               InBopomofo
538               InBopomofoExtended
539               InBoxDrawing
540               InBraillePatterns
541               InBuginese
542               InBuhid
543               InByzantineMusicalSymbols
544               InCJKCompatibility
545               InCJKCompatibilityForms
546               InCJKCompatibilityIdeographs
547               InCJKCompatibilityIdeographsSupplement
548               InCJKRadicalsSupplement
549               InCJKStrokes
550               InCJKSymbolsAndPunctuation
551               InCJKUnifiedIdeographs
552               InCJKUnifiedIdeographsExtensionA
553               InCJKUnifiedIdeographsExtensionB
554               InCherokee
555               InCombiningDiacriticalMarks
556               InCombiningDiacriticalMarksSupplement
557               InCombiningDiacriticalMarksforSymbols
558               InCombiningHalfMarks
559               InControlPictures
560               InCoptic
561               InCountingRodNumerals
562               InCuneiform
563               InCuneiformNumbersAndPunctuation
564               InCurrencySymbols
565               InCypriotSyllabary
566               InCyrillic
567               InCyrillicSupplement
568               InDeseret
569               InDevanagari
570               InDingbats
571               InEnclosedAlphanumerics
572               InEnclosedCJKLettersAndMonths
573               InEthiopic
574               InEthiopicExtended
575               InEthiopicSupplement
576               InGeneralPunctuation
577               InGeometricShapes
578               InGeorgian
579               InGeorgianSupplement
580               InGlagolitic
581               InGothic
582               InGreekExtended
583               InGreekAndCoptic
584               InGujarati
585               InGurmukhi
586               InHalfwidthAndFullwidthForms
587               InHangulCompatibilityJamo
588               InHangulJamo
589               InHangulSyllables
590               InHanunoo
591               InHebrew
592               InHighPrivateUseSurrogates
593               InHighSurrogates
594               InHiragana
595               InIPAExtensions
596               InIdeographicDescriptionCharacters
597               InKanbun
598               InKangxiRadicals
599               InKannada
600               InKatakana
601               InKatakanaPhoneticExtensions
602               InKharoshthi
603               InKhmer
604               InKhmerSymbols
605               InLao
606               InLatin1Supplement
607               InLatinExtendedA
608               InLatinExtendedAdditional
609               InLatinExtendedB
610               InLatinExtendedC
611               InLatinExtendedD
612               InLetterlikeSymbols
613               InLimbu
614               InLinearBIdeograms
615               InLinearBSyllabary
616               InLowSurrogates
617               InMalayalam
618               InMathematicalAlphanumericSymbols
619               InMathematicalOperators
620               InMiscellaneousMathematicalSymbolsA
621               InMiscellaneousMathematicalSymbolsB
622               InMiscellaneousSymbols
623               InMiscellaneousSymbolsAndArrows
624               InMiscellaneousTechnical
625               InModifierToneLetters
626               InMongolian
627               InMusicalSymbols
628               InMyanmar
629               InNKo
630               InNewTaiLue
631               InNumberForms
632               InOgham
633               InOldItalic
634               InOldPersian
635               InOpticalCharacterRecognition
636               InOriya
637               InOsmanya
638               InPhagspa
639               InPhoenician
640               InPhoneticExtensions
641               InPhoneticExtensionsSupplement
642               InPrivateUseArea
643               InRunic
644               InShavian
645               InSinhala
646               InSmallFormVariants
647               InSpacingModifierLetters
648               InSpecials
649               InSuperscriptsAndSubscripts
650               InSupplementalArrowsA
651               InSupplementalArrowsB
652               InSupplementalMathematicalOperators
653               InSupplementalPunctuation
654               InSupplementaryPrivateUseAreaA
655               InSupplementaryPrivateUseAreaB
656               InSylotiNagri
657               InSyriac
658               InTagalog
659               InTagbanwa
660               InTags
661               InTaiLe
662               InTaiXuanJingSymbols
663               InTamil
664               InTelugu
665               InThaana
666               InThai
667               InTibetan
668               InTifinagh
669               InUgaritic
670               InUnifiedCanadianAboriginalSyllabics
671               InVariationSelectors
672               InVariationSelectorsSupplement
673               InVerticalForms
674               InYiRadicals
675               InYiSyllables
676               InYijingHexagramSymbols
677
678   User-Defined Character Properties
679       You can define your own character properties by defining subroutines
680       whose names begin with "In" or "Is".  The subroutines can be defined in
681       any package.  The user-defined properties can be used in the regular
682       expression "\p" and "\P" constructs; if you are using a user-defined
683       property from a package other than the one you are in, you must specify
684       its package in the "\p" or "\P" construct.
685
686           # assuming property IsForeign defined in Lang::
687           package main;  # property package name required
688           if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
689
690           package Lang;  # property package name not required
691           if ($txt =~ /\p{IsForeign}+/) { ... }
692
693       Note that the effect is compile-time and immutable once defined.
694
695       The subroutines must return a specially-formatted string, with one or
696       more newline-separated lines.  Each line must be one of the following:
697
698       ·   A single hexadecimal number denoting a Unicode code point to
699           include.
700
701       ·   Two hexadecimal numbers separated by horizontal whitespace (space
702           or tabular characters) denoting a range of Unicode code points to
703           include.
704
705       ·   Something to include, prefixed by "+": a built-in character
706           property (prefixed by "utf8::") or a user-defined character
707           property, to represent all the characters in that property; two
708           hexadecimal code points for a range; or a single hexadecimal code
709           point.
710
711       ·   Something to exclude, prefixed by "-": an existing character
712           property (prefixed by "utf8::") or a user-defined character
713           property, to represent all the characters in that property; two
714           hexadecimal code points for a range; or a single hexadecimal code
715           point.
716
717       ·   Something to negate, prefixed "!": an existing character property
718           (prefixed by "utf8::") or a user-defined character property, to
719           represent all the characters in that property; two hexadecimal code
720           points for a range; or a single hexadecimal code point.
721
722       ·   Something to intersect with, prefixed by "&": an existing character
723           property (prefixed by "utf8::") or a user-defined character
724           property, for all the characters except the characters in the
725           property; two hexadecimal code points for a range; or a single
726           hexadecimal code point.
727
728       For example, to define a property that covers both the Japanese
729       syllabaries (hiragana and katakana), you can define
730
731           sub InKana {
732               return <<END;
733           3040\t309F
734           30A0\t30FF
735           END
736           }
737
738       Imagine that the here-doc end marker is at the beginning of the line.
739       Now you can use "\p{InKana}" and "\P{InKana}".
740
741       You could also have used the existing block property names:
742
743           sub InKana {
744               return <<'END';
745           +utf8::InHiragana
746           +utf8::InKatakana
747           END
748           }
749
750       Suppose you wanted to match only the allocated characters, not the raw
751       block ranges: in other words, you want to remove the non-characters:
752
753           sub InKana {
754               return <<'END';
755           +utf8::InHiragana
756           +utf8::InKatakana
757           -utf8::IsCn
758           END
759           }
760
761       The negation is useful for defining (surprise!) negated classes.
762
763           sub InNotKana {
764               return <<'END';
765           !utf8::InHiragana
766           -utf8::InKatakana
767           +utf8::IsCn
768           END
769           }
770
771       Intersection is useful for getting the common characters matched by two
772       (or more) classes.
773
774           sub InFooAndBar {
775               return <<'END';
776           +main::Foo
777           &main::Bar
778           END
779           }
780
781       It's important to remember not to use "&" for the first set -- that
782       would be intersecting with nothing (resulting in an empty set).
783
784   User-Defined Case Mappings
785       You can also define your own mappings to be used in the lc(),
786       lcfirst(), uc(), and ucfirst() (or their string-inlined versions).  The
787       principle is similar to that of user-defined character properties: to
788       define subroutines in the "main" package with names like "ToLower" (for
789       lc() and lcfirst()), "ToTitle" (for the first character in ucfirst()),
790       and "ToUpper" (for uc(), and the rest of the characters in ucfirst()).
791
792       The string returned by the subroutines needs now to be three
793       hexadecimal numbers separated by tabulators: start of the source range,
794       end of the source range, and start of the destination range.  For
795       example:
796
797           sub ToUpper {
798               return <<END;
799           0061\t0063\t0041
800           END
801           }
802
803       defines an uc() mapping that causes only the characters "a", "b", and
804       "c" to be mapped to "A", "B", "C", all other characters will remain
805       unchanged.
806
807       If there is no source range to speak of, that is, the mapping is from a
808       single character to another single character, leave the end of the
809       source range empty, but the two tabulator characters are still needed.
810       For example:
811
812           sub ToLower {
813               return <<END;
814           0041\t\t0061
815           END
816           }
817
818       defines a lc() mapping that causes only "A" to be mapped to "a", all
819       other characters will remain unchanged.
820
821       (For serious hackers only)  If you want to introspect the default
822       mappings, you can find the data in the directory
823       $Config{privlib}/unicore/To/.  The mapping data is returned as the
824       here-document, and the "utf8::ToSpecFoo" are special exception mappings
825       derived from <$Config{privlib}>/unicore/SpecialCasing.txt.  The "Digit"
826       and "Fold" mappings that one can see in the directory are not directly
827       user-accessible, one can use either the "Unicode::UCD" module, or just
828       match case-insensitively (that's when the "Fold" mapping is used).
829
830       A final note on the user-defined case mappings: they will be used only
831       if the scalar has been marked as having Unicode characters.  Old byte-
832       style strings will not be affected.
833
834   Character Encodings for Input and Output
835       See Encode.
836
837   Unicode Regular Expression Support Level
838       The following list of Unicode support for regular expressions describes
839       all the features currently supported.  The references to "Level N" and
840       the section numbers refer to the Unicode Technical Standard #18,
841       "Unicode Regular Expressions", version 11, in May 2005.
842
843       ·   Level 1 - Basic Unicode Support
844
845                   RL1.1   Hex Notation                        - done          [1]
846                   RL1.2   Properties                          - done          [2][3]
847                   RL1.2a  Compatibility Properties            - done          [4]
848                   RL1.3   Subtraction and Intersection        - MISSING       [5]
849                   RL1.4   Simple Word Boundaries              - done          [6]
850                   RL1.5   Simple Loose Matches                - done          [7]
851                   RL1.6   Line Boundaries                     - MISSING       [8]
852                   RL1.7   Supplementary Code Points           - done          [9]
853
854                   [1]  \x{...}
855                   [2]  \p{...} \P{...}
856                   [3]  supports not only minimal list (general category, scripts,
857                        Alphabetic, Lowercase, Uppercase, WhiteSpace,
858                        NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
859                        ASCII, Assigned), but also bidirectional types, blocks, etc.
860                        (see "Unicode Character Properties")
861                   [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
862                   [5]  can use regular expression look-ahead [a] or
863                        user-defined character properties [b] to emulate set operations
864                   [6]  \b \B
865                   [7]  note that Perl does Full case-folding in matching, not Simple:
866                        for example U+1F88 is equivalent to U+1F00 U+03B9,
867                        not with 1F80.  This difference matters mainly for certain Greek
868                        capital letters with certain modifiers: the Full case-folding
869                        decomposes the letter, while the Simple case-folding would map
870                        it to a single character.
871                   [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
872                        CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
873                        should also affect <>, $., and script line numbers;
874                        should not split lines within CRLF [c] (i.e. there is no empty
875                        line between \r and \n)
876                   [9]  UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
877                        but also beyond U+10FFFF [d]
878
879           [a] You can mimic class subtraction using lookahead.  For example,
880           what UTS#18 might write as
881
882               [{Greek}-[{UNASSIGNED}]]
883
884           in Perl can be written as:
885
886               (?!\p{Unassigned})\p{InGreekAndCoptic}
887               (?=\p{Assigned})\p{InGreekAndCoptic}
888
889           But in this particular example, you probably really want
890
891               \p{GreekAndCoptic}
892
893           which will match assigned characters known to be part of the Greek
894           script.
895
896           Also see the Unicode::Regex::Set module, it does implement the full
897           UTS#18 grouping, intersection, union, and removal (subtraction)
898           syntax.
899
900           [b] '+' for union, '-' for removal (set-difference), '&' for
901           intersection (see "User-Defined Character Properties")
902
903           [c] Try the ":crlf" layer (see PerlIO).
904
905           [d] Avoid "use warning 'utf8';" (or say "no warning 'utf8';") to
906           allow U+FFFF ("\x{FFFF}").
907
908       ·   Level 2 - Extended Unicode Support
909
910                   RL2.1   Canonical Equivalents           - MISSING       [10][11]
911                   RL2.2   Default Grapheme Clusters       - MISSING       [12][13]
912                   RL2.3   Default Word Boundaries         - MISSING       [14]
913                   RL2.4   Default Loose Matches           - MISSING       [15]
914                   RL2.5   Name Properties                 - MISSING       [16]
915                   RL2.6   Wildcard Properties             - MISSING
916
917                   [10] see UAX#15 "Unicode Normalization Forms"
918                   [11] have Unicode::Normalize but not integrated to regexes
919                   [12] have \X but at this level . should equal that
920                   [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
921                        clusters as a single grapheme cluster.
922                   [14] see UAX#29, Word Boundaries
923                   [15] see UAX#21 "Case Mappings"
924                   [16] have \N{...} but neither compute names of CJK Ideographs
925                        and Hangul Syllables nor use a loose match [e]
926
927           [e] "\N{...}" allows namespaces (see charnames).
928
929       ·   Level 3 - Tailored Support
930
931                   RL3.1   Tailored Punctuation            - MISSING
932                   RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
933                   RL3.3   Tailored Word Boundaries        - MISSING
934                   RL3.4   Tailored Loose Matches          - MISSING
935                   RL3.5   Tailored Ranges                 - MISSING
936                   RL3.6   Context Matching                - MISSING       [19]
937                   RL3.7   Incremental Matches             - MISSING
938                 ( RL3.8   Unicode Set Sharing )
939                   RL3.9   Possible Match Sets             - MISSING
940                   RL3.10  Folded Matching                 - MISSING       [20]
941                   RL3.11  Submatchers                     - MISSING
942
943                   [17] see UAX#10 "Unicode Collation Algorithms"
944                   [18] have Unicode::Collate but not integrated to regexes
945                   [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
946                        outside of the target substring
947                   [20] need insensitive matching for linguistic features other than case;
948                        for example, hiragana to katakana, wide and narrow, simplified Han
949                        to traditional Han (see UTR#30 "Character Foldings")
950
951   Unicode Encodings
952       Unicode characters are assigned to code points, which are abstract
953       numbers.  To use these numbers, various encodings are needed.
954
955       ·   UTF-8
956
957           UTF-8 is a variable-length (1 to 6 bytes, current character
958           allocations require 4 bytes), byte-order independent encoding. For
959           ASCII (and we really do mean 7-bit ASCII, not another 8-bit
960           encoding), UTF-8 is transparent.
961
962           The following table is from Unicode 3.2.
963
964            Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
965
966              U+0000..U+007F       00..7F
967              U+0080..U+07FF       C2..DF    80..BF
968              U+0800..U+0FFF       E0        A0..BF    80..BF
969              U+1000..U+CFFF       E1..EC    80..BF    80..BF
970              U+D000..U+D7FF       ED        80..9F    80..BF
971              U+D800..U+DFFF       ******* ill-formed *******
972              U+E000..U+FFFF       EE..EF    80..BF    80..BF
973             U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
974             U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
975            U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
976
977           Note the "A0..BF" in "U+0800..U+0FFF", the "80..9F" in
978           "U+D000...U+D7FF", the "90..B"F in "U+10000..U+3FFFF", and the
979           "80...8F" in "U+100000..U+10FFFF".  The "gaps" are caused by legal
980           UTF-8 avoiding non-shortest encodings: it is technically possible
981           to UTF-8-encode a single code point in different ways, but that is
982           explicitly forbidden, and the shortest possible encoding should
983           always be used.  So that's what Perl does.
984
985           Another way to look at it is via bits:
986
987            Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte
988
989                               0aaaaaaa     0aaaaaaa
990                       00000bbbbbaaaaaa     110bbbbb  10aaaaaa
991                       ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
992             00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa
993
994           As you can see, the continuation bytes all begin with 10, and the
995           leading bits of the start byte tell how many bytes the are in the
996           encoded character.
997
998       ·   UTF-EBCDIC
999
1000           Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1001
1002       ·   UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1003
1004           The followings items are mostly for reference and general Unicode
1005           knowledge, Perl doesn't use these constructs internally.
1006
1007           UTF-16 is a 2 or 4 byte encoding.  The Unicode code points
1008           "U+0000..U+FFFF" are stored in a single 16-bit unit, and the code
1009           points "U+10000..U+10FFFF" in two 16-bit units.  The latter case is
1010           using surrogates, the first 16-bit unit being the high surrogate,
1011           and the second being the low surrogate.
1012
1013           Surrogates are code points set aside to encode the
1014           "U+10000..U+10FFFF" range of Unicode code points in pairs of 16-bit
1015           units.  The high surrogates are the range "U+D800..U+DBFF", and the
1016           low surrogates are the range "U+DC00..U+DFFF".  The surrogate
1017           encoding is
1018
1019                   $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1020                   $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1021
1022           and the decoding is
1023
1024                   $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1025
1026           If you try to generate surrogates (for example by using chr()), you
1027           will get a warning if warnings are turned on, because those code
1028           points are not valid for a Unicode character.
1029
1030           Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
1031           itself can be used for in-memory computations, but if storage or
1032           transfer is required either UTF-16BE (big-endian) or UTF-16LE
1033           (little-endian) encodings must be chosen.
1034
1035           This introduces another problem: what if you just know that your
1036           data is UTF-16, but you don't know which endianness?  Byte Order
1037           Marks, or BOMs, are a solution to this.  A special character has
1038           been reserved in Unicode to function as a byte order marker: the
1039           character with the code point "U+FEFF" is the BOM.
1040
1041           The trick is that if you read a BOM, you will know the byte order,
1042           since if it was written on a big-endian platform, you will read the
1043           bytes "0xFE 0xFF", but if it was written on a little-endian
1044           platform, you will read the bytes "0xFF 0xFE".  (And if the
1045           originating platform was writing in UTF-8, you will read the bytes
1046           "0xEF 0xBB 0xBF".)
1047
1048           The way this trick works is that the character with the code point
1049           "U+FFFE" is guaranteed not to be a valid Unicode character, so the
1050           sequence of bytes "0xFF 0xFE" is unambiguously "BOM, represented in
1051           little-endian format" and cannot be "U+FFFE", represented in big-
1052           endian format".
1053
1054       ·   UTF-32, UTF-32BE, UTF-32LE
1055
1056           The UTF-32 family is pretty much like the UTF-16 family, expect
1057           that the units are 32-bit, and therefore the surrogate scheme is
1058           not needed.  The BOM signatures will be "0x00 0x00 0xFE 0xFF" for
1059           BE and "0xFF 0xFE 0x00 0x00" for LE.
1060
1061       ·   UCS-2, UCS-4
1062
1063           Encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
1064           encoding.  Unlike UTF-16, UCS-2 is not extensible beyond "U+FFFF",
1065           because it does not use surrogates.  UCS-4 is a 32-bit encoding,
1066           functionally identical to UTF-32.
1067
1068       ·   UTF-7
1069
1070           A seven-bit safe (non-eight-bit) encoding, which is useful if the
1071           transport or storage is not eight-bit safe.  Defined by RFC 2152.
1072
1073   Security Implications of Unicode
1074       ·   Malformed UTF-8
1075
1076           Unfortunately, the specification of UTF-8 leaves some room for
1077           interpretation of how many bytes of encoded output one should
1078           generate from one input Unicode character.  Strictly speaking, the
1079           shortest possible sequence of UTF-8 bytes should be generated,
1080           because otherwise there is potential for an input buffer overflow
1081           at the receiving end of a UTF-8 connection.  Perl always generates
1082           the shortest length UTF-8, and with warnings on Perl will warn
1083           about non-shortest length UTF-8 along with other malformations,
1084           such as the surrogates, which are not real Unicode code points.
1085
1086       ·   Regular expressions behave slightly differently between byte data
1087           and character (Unicode) data.  For example, the "word character"
1088           character class "\w" will work differently depending on if data is
1089           eight-bit bytes or Unicode.
1090
1091           In the first case, the set of "\w" characters is either small--the
1092           default set of alphabetic characters, digits, and the "_"--or, if
1093           you are using a locale (see perllocale), the "\w" might contain a
1094           few more letters according to your language and country.
1095
1096           In the second case, the "\w" set of characters is much, much
1097           larger.  Most importantly, even in the set of the first 256
1098           characters, it will probably match different characters: unlike
1099           most locales, which are specific to a language and country pair,
1100           Unicode classifies all the characters that are letters somewhere as
1101           "\w".  For example, your locale might not think that LATIN SMALL
1102           LETTER ETH is a letter (unless you happen to speak Icelandic), but
1103           Unicode does.
1104
1105           As discussed elsewhere, Perl has one foot (two hooves?) planted in
1106           each of two worlds: the old world of bytes and the new world of
1107           characters, upgrading from bytes to characters when necessary.  If
1108           your legacy code does not explicitly use Unicode, no automatic
1109           switch-over to characters should happen.  Characters shouldn't get
1110           downgraded to bytes, either.  It is possible to accidentally mix
1111           bytes and characters, however (see perluniintro), in which case
1112           "\w" in regular expressions might start behaving differently.
1113           Review your code.  Use warnings and the "strict" pragma.
1114
1115   Unicode in Perl on EBCDIC
1116       The way Unicode is handled on EBCDIC platforms is still experimental.
1117       On such platforms, references to UTF-8 encoding in this document and
1118       elsewhere should be read as meaning the UTF-EBCDIC specified in Unicode
1119       Technical Report 16, unless ASCII vs. EBCDIC issues are specifically
1120       discussed. There is no "utfebcdic" pragma or ":utfebcdic" layer;
1121       rather, "utf8" and ":utf8" are reused to mean the platform's "natural"
1122       8-bit encoding of Unicode. See perlebcdic for more discussion of the
1123       issues.
1124
1125   Locales
1126       Usually locale settings and Unicode do not affect each other, but there
1127       are a couple of exceptions:
1128
1129       ·   You can enable automatic UTF-8-ification of your standard file
1130           handles, default "open()" layer, and @ARGV by using either the "-C"
1131           command line switch or the "PERL_UNICODE" environment variable, see
1132           perlrun for the documentation of the "-C" switch.
1133
1134       ·   Perl tries really hard to work both with Unicode and the old byte-
1135           oriented world. Most often this is nice, but sometimes Perl's
1136           straddling of the proverbial fence causes problems.
1137
1138   When Unicode Does Not Happen
1139       While Perl does have extensive ways to input and output in Unicode, and
1140       few other 'entry points' like the @ARGV which can be interpreted as
1141       Unicode (UTF-8), there still are many places where Unicode (in some
1142       encoding or another) could be given as arguments or received as
1143       results, or both, but it is not.
1144
1145       The following are such interfaces.  For all of these interfaces Perl
1146       currently (as of 5.8.3) simply assumes byte strings both as arguments
1147       and results, or UTF-8 strings if the "encoding" pragma has been used.
1148
1149       One reason why Perl does not attempt to resolve the role of Unicode in
1150       this cases is that the answers are highly dependent on the operating
1151       system and the file system(s).  For example, whether filenames can be
1152       in Unicode, and in exactly what kind of encoding, is not exactly a
1153       portable concept.  Similarly for the qx and system: how well will the
1154       'command line interface' (and which of them?) handle Unicode?
1155
1156       ·   chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename,
1157           rmdir, stat, symlink, truncate, unlink, utime, -X
1158
1159       ·   %ENV
1160
1161       ·   glob (aka the <*>)
1162
1163       ·   open, opendir, sysopen
1164
1165       ·   qx (aka the backtick operator), system
1166
1167       ·   readdir, readlink
1168
1169   Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1170       Sometimes (see "When Unicode Does Not Happen") there are situations
1171       where you simply need to force a byte string into UTF-8, or vice versa.
1172       The low-level calls utf8::upgrade($bytestring) and
1173       utf8::downgrade($utf8string[, FAIL_OK]) are the answers.
1174
1175       Note that utf8::downgrade() can fail if the string contains characters
1176       that don't fit into a byte.
1177
1178   Using Unicode in XS
1179       If you want to handle Perl Unicode in XS extensions, you may find the
1180       following C APIs useful.  See also "Unicode Support" in perlguts for an
1181       explanation about Unicode at the XS level, and perlapi for the API
1182       details.
1183
1184       ·   "DO_UTF8(sv)" returns true if the "UTF8" flag is on and the bytes
1185           pragma is not in effect.  "SvUTF8(sv)" returns true if the "UTF8"
1186           flag is on; the bytes pragma is ignored.  The "UTF8" flag being on
1187           does not mean that there are any characters of code points greater
1188           than 255 (or 127) in the scalar or that there are even any
1189           characters in the scalar.  What the "UTF8" flag means is that the
1190           sequence of octets in the representation of the scalar is the
1191           sequence of UTF-8 encoded code points of the characters of a
1192           string.  The "UTF8" flag being off means that each octet in this
1193           representation encodes a single character with code point 0..255
1194           within the string.  Perl's Unicode model is not to use UTF-8 until
1195           it is absolutely necessary.
1196
1197       ·   "uvchr_to_utf8(buf, chr)" writes a Unicode character code point
1198           into a buffer encoding the code point as UTF-8, and returns a
1199           pointer pointing after the UTF-8 bytes.  It works appropriately on
1200           EBCDIC machines.
1201
1202       ·   "utf8_to_uvchr(buf, lenp)" reads UTF-8 encoded bytes from a buffer
1203           and returns the Unicode character code point and, optionally, the
1204           length of the UTF-8 byte sequence.  It works appropriately on
1205           EBCDIC machines.
1206
1207       ·   "utf8_length(start, end)" returns the length of the UTF-8 encoded
1208           buffer in characters.  "sv_len_utf8(sv)" returns the length of the
1209           UTF-8 encoded scalar.
1210
1211       ·   "sv_utf8_upgrade(sv)" converts the string of the scalar to its
1212           UTF-8 encoded form.  "sv_utf8_downgrade(sv)" does the opposite, if
1213           possible.  "sv_utf8_encode(sv)" is like sv_utf8_upgrade except that
1214           it does not set the "UTF8" flag.  "sv_utf8_decode()" does the
1215           opposite of "sv_utf8_encode()".  Note that none of these are to be
1216           used as general-purpose encoding or decoding interfaces: "use
1217           Encode" for that.  "sv_utf8_upgrade()" is affected by the encoding
1218           pragma but "sv_utf8_downgrade()" is not (since the encoding pragma
1219           is designed to be a one-way street).
1220
1221       ·   is_utf8_char(s) returns true if the pointer points to a valid UTF-8
1222           character.
1223
1224       ·   "is_utf8_string(buf, len)" returns true if "len" bytes of the
1225           buffer are valid UTF-8.
1226
1227       ·   "UTF8SKIP(buf)" will return the number of bytes in the UTF-8
1228           encoded character in the buffer.  "UNISKIP(chr)" will return the
1229           number of bytes required to UTF-8-encode the Unicode character code
1230           point.  "UTF8SKIP()" is useful for example for iterating over the
1231           characters of a UTF-8 encoded buffer; "UNISKIP()" is useful, for
1232           example, in computing the size required for a UTF-8 encoded buffer.
1233
1234       ·   "utf8_distance(a, b)" will tell the distance in characters between
1235           the two pointers pointing to the same UTF-8 encoded buffer.
1236
1237       ·   "utf8_hop(s, off)" will return a pointer to a UTF-8 encoded buffer
1238           that is "off" (positive or negative) Unicode characters displaced
1239           from the UTF-8 buffer "s".  Be careful not to overstep the buffer:
1240           "utf8_hop()" will merrily run off the end or the beginning of the
1241           buffer if told to do so.
1242
1243       ·   "pv_uni_display(dsv, spv, len, pvlim, flags)" and
1244           "sv_uni_display(dsv, ssv, pvlim, flags)" are useful for debugging
1245           the output of Unicode strings and scalars.  By default they are
1246           useful only for debugging--they display all characters as
1247           hexadecimal code points--but with the flags "UNI_DISPLAY_ISPRINT",
1248           "UNI_DISPLAY_BACKSLASH", and "UNI_DISPLAY_QQ" you can make the
1249           output more readable.
1250
1251       ·   "ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)" can be used to
1252           compare two strings case-insensitively in Unicode.  For case-
1253           sensitive comparisons you can just use "memEQ()" and "memNE()" as
1254           usual.
1255
1256       For more information, see perlapi, and utf8.c and utf8.h in the Perl
1257       source code distribution.
1258

BUGS

1260   Interaction with Locales
1261       Use of locales with Unicode data may lead to odd results.  Currently,
1262       Perl attempts to attach 8-bit locale info to characters in the range
1263       0..255, but this technique is demonstrably incorrect for locales that
1264       use characters above that range when mapped into Unicode.  Perl's
1265       Unicode support will also tend to run slower.  Use of locales with
1266       Unicode is discouraged.
1267
1268   Problems with characters whose ordinal numbers are in the range 128 - 255
1269       with no Locale specified
1270       Without a locale specified, unlike all other characters or code points,
1271       these characters have very different semantics in byte semantics versus
1272       character semantics.  In character semantics they are interpreted as
1273       Unicode code points, which means they are viewed as Latin-1
1274       (ISO-8859-1).  In byte semantics, they are considered to be unassigned
1275       characters, meaning that the only semantics they have is their ordinal
1276       numbers, and that they are not members of various character classes.
1277       None are considered to match "\w" for example, but all match "\W".
1278       Besides these class matches, the known operations that this affects are
1279       those that change the case, regular expression matching while ignoring
1280       case, and quotemeta().  This can lead to unexpected results in which a
1281       string's semantics suddenly change if a code point above 255 is
1282       appended to or removed from it, which changes the string's semantics
1283       from byte to character or vice versa.  This behavior is scheduled to
1284       change in version 5.12, but in the meantime, a workaround is to always
1285       call utf8::upgrade($string), or to use the standard modules Encode or
1286       charnames.
1287
1288   Interaction with Extensions
1289       When Perl exchanges data with an extension, the extension should be
1290       able to understand the UTF8 flag and act accordingly. If the extension
1291       doesn't know about the flag, it's likely that the extension will return
1292       incorrectly-flagged data.
1293
1294       So if you're working with Unicode data, consult the documentation of
1295       every module you're using if there are any issues with Unicode data
1296       exchange. If the documentation does not talk about Unicode at all,
1297       suspect the worst and probably look at the source to learn how the
1298       module is implemented. Modules written completely in Perl shouldn't
1299       cause problems. Modules that directly or indirectly access code written
1300       in other programming languages are at risk.
1301
1302       For affected functions, the simple strategy to avoid data corruption is
1303       to always make the encoding of the exchanged data explicit. Choose an
1304       encoding that you know the extension can handle. Convert arguments
1305       passed to the extensions to that encoding and convert results back from
1306       that encoding. Write wrapper functions that do the conversions for you,
1307       so you can later change the functions when the extension catches up.
1308
1309       To provide an example, let's say the popular Foo::Bar::escape_html
1310       function doesn't deal with Unicode data yet. The wrapper function would
1311       convert the argument to raw UTF-8 and convert the result back to Perl's
1312       internal representation like so:
1313
1314           sub my_escape_html ($) {
1315             my($what) = shift;
1316             return unless defined $what;
1317             Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1318           }
1319
1320       Sometimes, when the extension does not convert data but just stores and
1321       retrieves them, you will be in a position to use the otherwise
1322       dangerous Encode::_utf8_on() function. Let's say the popular "Foo::Bar"
1323       extension, written in C, provides a "param" method that lets you store
1324       and retrieve data according to these prototypes:
1325
1326           $self->param($name, $value);            # set a scalar
1327           $value = $self->param($name);           # retrieve a scalar
1328
1329       If it does not yet provide support for any encoding, one could write a
1330       derived class with such a "param" method:
1331
1332           sub param {
1333             my($self,$name,$value) = @_;
1334             utf8::upgrade($name);     # make sure it is UTF-8 encoded
1335             if (defined $value) {
1336               utf8::upgrade($value);  # make sure it is UTF-8 encoded
1337               return $self->SUPER::param($name,$value);
1338             } else {
1339               my $ret = $self->SUPER::param($name);
1340               Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1341               return $ret;
1342             }
1343           }
1344
1345       Some extensions provide filters on data entry/exit points, such as
1346       DB_File::filter_store_key and family. Look out for such filters in the
1347       documentation of your extensions, they can make the transition to
1348       Unicode data much easier.
1349
1350   Speed
1351       Some functions are slower when working on UTF-8 encoded strings than on
1352       byte encoded strings.  All functions that need to hop over characters
1353       such as length(), substr() or index(), or matching regular expressions
1354       can work much faster when the underlying data are byte-encoded.
1355
1356       In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a
1357       caching scheme was introduced which will hopefully make the slowness
1358       somewhat less spectacular, at least for some operations.  In general,
1359       operations with UTF-8 encoded strings are still slower. As an example,
1360       the Unicode properties (character classes) like "\p{Nd}" are known to
1361       be quite a bit slower (5-20 times) than their simpler counterparts like
1362       "\d" (then again, there 268 Unicode characters matching "Nd" compared
1363       with the 10 ASCII characters matching "d").
1364
1365   Possible problems on EBCDIC platforms
1366       In earlier versions, when byte and character data were concatenated,
1367       the new string was sometimes created by decoding the byte strings as
1368       ISO 8859-1 (Latin-1), even if the old Unicode string used EBCDIC.
1369
1370       If you find any of these, please report them as bugs.
1371
1372   Porting code from perl-5.6.X
1373       Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1374       was required to use the "utf8" pragma to declare that a given scope
1375       expected to deal with Unicode data and had to make sure that only
1376       Unicode data were reaching that scope. If you have code that is working
1377       with 5.6, you will need some of the following adjustments to your code.
1378       The examples are written such that the code will continue to work under
1379       5.6, so you should be safe to try them out.
1380
1381       ·   A filehandle that should read or write UTF-8
1382
1383             if ($] > 5.007) {
1384               binmode $fh, ":encoding(utf8)";
1385             }
1386
1387       ·   A scalar that is going to be passed to some extension
1388
1389           Be it Compress::Zlib, Apache::Request or any extension that has no
1390           mention of Unicode in the manpage, you need to make sure that the
1391           UTF8 flag is stripped off. Note that at the time of this writing
1392           (October 2002) the mentioned modules are not UTF-8-aware. Please
1393           check the documentation to verify if this is still true.
1394
1395             if ($] > 5.007) {
1396               require Encode;
1397               $val = Encode::encode_utf8($val); # make octets
1398             }
1399
1400       ·   A scalar we got back from an extension
1401
1402           If you believe the scalar comes back as UTF-8, you will most likely
1403           want the UTF8 flag restored:
1404
1405             if ($] > 5.007) {
1406               require Encode;
1407               $val = Encode::decode_utf8($val);
1408             }
1409
1410       ·   Same thing, if you are really sure it is UTF-8
1411
1412             if ($] > 5.007) {
1413               require Encode;
1414               Encode::_utf8_on($val);
1415             }
1416
1417       ·   A wrapper for fetchrow_array and fetchrow_hashref
1418
1419           When the database contains only UTF-8, a wrapper function or method
1420           is a convenient way to replace all your fetchrow_array and
1421           fetchrow_hashref calls. A wrapper function will also make it easier
1422           to adapt to future enhancements in your database driver. Note that
1423           at the time of this writing (October 2002), the DBI has no
1424           standardized way to deal with UTF-8 data. Please check the
1425           documentation to verify if that is still true.
1426
1427             sub fetchrow {
1428               my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1429               if ($] < 5.007) {
1430                 return $sth->$what;
1431               } else {
1432                 require Encode;
1433                 if (wantarray) {
1434                   my @arr = $sth->$what;
1435                   for (@arr) {
1436                     defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1437                   }
1438                   return @arr;
1439                 } else {
1440                   my $ret = $sth->$what;
1441                   if (ref $ret) {
1442                     for my $k (keys %$ret) {
1443                       defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1444                     }
1445                     return $ret;
1446                   } else {
1447                     defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1448                     return $ret;
1449                   }
1450                 }
1451               }
1452             }
1453
1454       ·   A large scalar that you know can only contain ASCII
1455
1456           Scalars that contain only ASCII and are marked as UTF-8 are
1457           sometimes a drag to your program. If you recognize such a
1458           situation, just remove the UTF8 flag:
1459
1460             utf8::downgrade($val) if $] > 5.007;
1461

SEE ALSO

1463       perlunitut, perluniintro, Encode, open, utf8, bytes, perlretut,
1464       "${^UNICODE}" in perlvar
1465
1466
1467
1468perl v5.10.1                      2009-05-14                    PERLUNICODE(1)
Impressum