1PERLUNICODE(1) Perl Programmers Reference Guide PERLUNICODE(1)
2
3
4
6 perlunicode - Unicode support in Perl
7
9 Important Caveats
10 Unicode support is an extensive requirement. While Perl does not
11 implement the Unicode standard or the accompanying technical reports
12 from cover to cover, Perl does support many Unicode features.
13
14 People who want to learn to use Unicode in Perl, should probably read
15 the Perl Unicode tutorial, perlunitut, before reading this reference
16 document.
17
18 Input and Output Layers
19 Perl knows when a filehandle uses Perl's internal Unicode encodings
20 (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened
21 with the ":utf8" layer. Other encodings can be converted to Perl's
22 encoding on input or from Perl's encoding on output by use of the
23 ":encoding(...)" layer. See open.
24
25 To indicate that Perl source itself is in UTF-8, use "use utf8;".
26
27 Regular Expressions
28 The regular expression compiler produces polymorphic opcodes. That
29 is, the pattern adapts to the data and automatically switches to
30 the Unicode character scheme when presented with data that is
31 internally encoded in UTF-8 -- or instead uses a traditional byte
32 scheme when presented with byte data.
33
34 "use utf8" still needed to enable UTF-8/UTF-EBCDIC in scripts
35 As a compatibility measure, the "use utf8" pragma must be
36 explicitly included to enable recognition of UTF-8 in the Perl
37 scripts themselves (in string or regular expression literals, or in
38 identifier names) on ASCII-based machines or to recognize UTF-
39 EBCDIC on EBCDIC-based machines. These are the only times when an
40 explicit "use utf8" is needed. See utf8.
41
42 BOM-marked scripts and UTF-16 scripts autodetected
43 If a Perl script begins marked with the Unicode BOM (UTF-16LE,
44 UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked
45 UTF-16 of either endianness, Perl will correctly read in the script
46 as Unicode. (BOMless UTF-8 cannot be effectively recognized or
47 differentiated from ISO 8859-1 or other eight-bit encodings.)
48
49 "use encoding" needed to upgrade non-Latin-1 byte strings
50 By default, there is a fundamental asymmetry in Perl's Unicode
51 model: implicit upgrading from byte strings to Unicode strings
52 assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
53 strings are downgraded with UTF-8 encoding. This happens because
54 the first 256 codepoints in Unicode happens to agree with Latin-1.
55
56 See "Byte and Character Semantics" for more details.
57
58 Byte and Character Semantics
59 Beginning with version 5.6, Perl uses logically-wide characters to
60 represent strings internally.
61
62 In future, Perl-level operations will be expected to work with
63 characters rather than bytes.
64
65 However, as an interim compatibility measure, Perl aims to provide a
66 safe migration path from byte semantics to character semantics for
67 programs. For operations where Perl can unambiguously decide that the
68 input data are characters, Perl switches to character semantics. For
69 operations where this determination cannot be made without additional
70 information from the user, Perl decides in favor of compatibility and
71 chooses to use byte semantics.
72
73 Under byte semantics, when "use locale" is in effect, Perl uses the
74 semantics associated with the current locale. Absent a "use locale",
75 Perl currently uses US-ASCII (or Basic Latin in Unicode terminology)
76 byte semantics, meaning that characters whose ordinal numbers are in
77 the range 128 - 255 are undefined except for their ordinal numbers.
78 This means that none have case (upper and lower), nor are any a member
79 of character classes, like "[:alpha:]" or "\w". (But all do belong to
80 the "\W" class or the Perl regular expression extension "[:^alpha:]".)
81
82 This behavior preserves compatibility with earlier versions of Perl,
83 which allowed byte semantics in Perl operations only if none of the
84 program's inputs were marked as being as source of Unicode character
85 data. Such data may come from filehandles, from calls to external
86 programs, from information provided by the system (such as %ENV), or
87 from literals and constants in the source text.
88
89 The "bytes" pragma will always, regardless of platform, force byte
90 semantics in a particular lexical scope. See bytes.
91
92 The "utf8" pragma is primarily a compatibility device that enables
93 recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
94 Note that this pragma is only required while Perl defaults to byte
95 semantics; when character semantics become the default, this pragma may
96 become a no-op. See utf8.
97
98 Unless explicitly stated, Perl operators use character semantics for
99 Unicode data and byte semantics for non-Unicode data. The decision to
100 use character semantics is made transparently. If input data comes
101 from a Unicode source--for example, if a character encoding layer is
102 added to a filehandle or a literal Unicode string constant appears in a
103 program--character semantics apply. Otherwise, byte semantics are in
104 effect. The "bytes" pragma should be used to force byte semantics on
105 Unicode data.
106
107 If strings operating under byte semantics and strings with Unicode
108 character data are concatenated, the new string will have character
109 semantics. This can cause surprises: See "BUGS", below
110
111 Under character semantics, many operations that formerly operated on
112 bytes now operate on characters. A character in Perl is logically just
113 a number ranging from 0 to 2**31 or so. Larger characters may encode
114 into longer sequences of bytes internally, but this internal detail is
115 mostly hidden for Perl code. See perluniintro for more.
116
117 Effects of Character Semantics
118 Character semantics have the following effects:
119
120 · Strings--including hash keys--and regular expression patterns may
121 contain characters that have an ordinal value larger than 255.
122
123 If you use a Unicode editor to edit your program, Unicode
124 characters may occur directly within the literal strings in UTF-8
125 encoding, or UTF-16. (The former requires a BOM or "use utf8", the
126 latter requires a BOM.)
127
128 Unicode characters can also be added to a string by using the
129 "\x{...}" notation. The Unicode code for the desired character, in
130 hexadecimal, should be placed in the braces. For instance, a smiley
131 face is "\x{263A}". This encoding scheme works for all characters,
132 but for characters under 0x100, note that Perl may use an 8 bit
133 encoding internally, for optimization and/or backward
134 compatibility.
135
136 Additionally, if you
137
138 use charnames ':full';
139
140 you can use the "\N{...}" notation and put the official Unicode
141 character name within the braces, such as "\N{WHITE SMILING FACE}".
142
143 · If an appropriate encoding is specified, identifiers within the
144 Perl script may contain Unicode alphanumeric characters, including
145 ideographs. Perl does not currently attempt to canonicalize
146 variable names.
147
148 · Regular expressions match characters instead of bytes. "." matches
149 a character instead of a byte.
150
151 · Character classes in regular expressions match characters instead
152 of bytes and match against the character properties specified in
153 the Unicode properties database. "\w" can be used to match a
154 Japanese ideograph, for instance.
155
156 · Named Unicode properties, scripts, and block ranges may be used
157 like character classes via the "\p{}" "matches property" construct
158 and the "\P{}" negation, "doesn't match property".
159
160 See "Unicode Character Properties" for more details.
161
162 You can define your own character properties and use them in the
163 regular expression with the "\p{}" or "\P{}" construct.
164
165 See "User-Defined Character Properties" for more details.
166
167 · The special pattern "\X" matches any extended Unicode sequence--"a
168 combining character sequence" in Standardese--where the first
169 character is a base character and subsequent characters are mark
170 characters that apply to the base character. "\X" is equivalent to
171 "(?>\PM\pM*)".
172
173 · The "tr///" operator translates characters instead of bytes. Note
174 that the "tr///CU" functionality has been removed. For similar
175 functionality see pack('U0', ...) and pack('C0', ...).
176
177 · Case translation operators use the Unicode case translation tables
178 when character input is provided. Note that "uc()", or "\U" in
179 interpolated strings, translates to uppercase, while "ucfirst", or
180 "\u" in interpolated strings, translates to titlecase in languages
181 that make the distinction.
182
183 · Most operators that deal with positions or lengths in a string will
184 automatically switch to using character positions, including
185 "chop()", "chomp()", "substr()", "pos()", "index()", "rindex()",
186 "sprintf()", "write()", and "length()". An operator that
187 specifically does not switch is "vec()". Operators that really
188 don't care include operators that treat strings as a bucket of bits
189 such as "sort()", and operators dealing with filenames.
190
191 · The "pack()"/"unpack()" letter "C" does not change, since it is
192 often used for byte-oriented formats. Again, think "char" in the C
193 language.
194
195 There is a new "U" specifier that converts between Unicode
196 characters and code points. There is also a "W" specifier that is
197 the equivalent of "chr"/"ord" and properly handles character values
198 even if they are above 255.
199
200 · The "chr()" and "ord()" functions work on characters, similar to
201 "pack("W")" and "unpack("W")", not "pack("C")" and "unpack("C")".
202 "pack("C")" and "unpack("C")" are methods for emulating byte-
203 oriented "chr()" and "ord()" on Unicode strings. While these
204 methods reveal the internal encoding of Unicode strings, that is
205 not something one normally needs to care about at all.
206
207 · The bit string operators, "& | ^ ~", can operate on character data.
208 However, for backward compatibility, such as when using bit string
209 operations when characters are all less than 256 in ordinal value,
210 one should not use "~" (the bit complement) with characters of both
211 values less than 256 and values greater than 256. Most
212 importantly, DeMorgan's laws ("~($x|$y) eq ~$x&~$y" and "~($x&$y)
213 eq ~$x|~$y") will not hold. The reason for this mathematical faux
214 pas is that the complement cannot return both the 8-bit (byte-wide)
215 bit complement and the full character-wide bit complement.
216
217 · lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
218
219 · the case mapping is from a single Unicode character to
220 another single Unicode character, or
221
222 · the case mapping is from a single Unicode character to more
223 than one Unicode character.
224
225 Things to do with locales (Lithuanian, Turkish, Azeri) do not work
226 since Perl does not understand the concept of Unicode locales.
227
228 See the Unicode Technical Report #21, Case Mappings, for more
229 details.
230
231 But you can also define your own mappings to be used in the lc(),
232 lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
233
234 See "User-Defined Case Mappings" for more details.
235
236 · And finally, "scalar reverse()" reverses by character rather than
237 by byte.
238
239 Unicode Character Properties
240 Named Unicode properties, scripts, and block ranges may be used like
241 character classes via the "\p{}" "matches property" construct and the
242 "\P{}" negation, "doesn't match property".
243
244 For instance, "\p{Lu}" matches any character with the Unicode "Lu"
245 (Letter, uppercase) property, while "\p{M}" matches any character with
246 an "M" (mark--accents and such) property. Brackets are not required
247 for single letter properties, so "\p{M}" is equivalent to "\pM". Many
248 predefined properties are available, such as "\p{Mirrored}" and
249 "\p{Tibetan}".
250
251 The official Unicode script and block names have spaces and dashes as
252 separators, but for convenience you can use dashes, spaces, or
253 underbars, and case is unimportant. It is recommended, however, that
254 for consistency you use the following naming: the official Unicode
255 script, property, or block name (see below for the additional rules
256 that apply to block names) with whitespace and dashes removed, and the
257 words "uppercase-first-lowercase-rest". "Latin-1 Supplement" thus
258 becomes "Latin1Supplement".
259
260 You can also use negation in both "\p{}" and "\P{}" by introducing a
261 caret (^) between the first brace and the property name: "\p{^Tamil}"
262 is equal to "\P{Tamil}".
263
264 NOTE: the properties, scripts, and blocks listed here are as of Unicode
265 5.0.0 in July 2006.
266
267 General Category
268 Here are the basic Unicode General Category properties, followed by
269 their long form. You can use either; "\p{Lu}" and
270 "\p{UppercaseLetter}", for instance, are identical.
271
272 Short Long
273
274 L Letter
275 LC CasedLetter
276 Lu UppercaseLetter
277 Ll LowercaseLetter
278 Lt TitlecaseLetter
279 Lm ModifierLetter
280 Lo OtherLetter
281
282 M Mark
283 Mn NonspacingMark
284 Mc SpacingMark
285 Me EnclosingMark
286
287 N Number
288 Nd DecimalNumber
289 Nl LetterNumber
290 No OtherNumber
291
292 P Punctuation
293 Pc ConnectorPunctuation
294 Pd DashPunctuation
295 Ps OpenPunctuation
296 Pe ClosePunctuation
297 Pi InitialPunctuation
298 (may behave like Ps or Pe depending on usage)
299 Pf FinalPunctuation
300 (may behave like Ps or Pe depending on usage)
301 Po OtherPunctuation
302
303 S Symbol
304 Sm MathSymbol
305 Sc CurrencySymbol
306 Sk ModifierSymbol
307 So OtherSymbol
308
309 Z Separator
310 Zs SpaceSeparator
311 Zl LineSeparator
312 Zp ParagraphSeparator
313
314 C Other
315 Cc Control
316 Cf Format
317 Cs Surrogate (not usable)
318 Co PrivateUse
319 Cn Unassigned
320
321 Single-letter properties match all characters in any of the two-
322 letter sub-properties starting with the same letter. "LC" and "L&"
323 are special cases, which are aliases for the set of "Ll", "Lu", and
324 "Lt".
325
326 Because Perl hides the need for the user to understand the internal
327 representation of Unicode characters, there is no need to implement
328 the somewhat messy concept of surrogates. "Cs" is therefore not
329 supported.
330
331 Bidirectional Character Types
332 Because scripts differ in their directionality--Hebrew is written
333 right to left, for example--Unicode supplies these properties in
334 the BidiClass class:
335
336 Property Meaning
337
338 L Left-to-Right
339 LRE Left-to-Right Embedding
340 LRO Left-to-Right Override
341 R Right-to-Left
342 AL Right-to-Left Arabic
343 RLE Right-to-Left Embedding
344 RLO Right-to-Left Override
345 PDF Pop Directional Format
346 EN European Number
347 ES European Number Separator
348 ET European Number Terminator
349 AN Arabic Number
350 CS Common Number Separator
351 NSM Non-Spacing Mark
352 BN Boundary Neutral
353 B Paragraph Separator
354 S Segment Separator
355 WS Whitespace
356 ON Other Neutrals
357
358 For example, "\p{BidiClass:R}" matches characters that are normally
359 written right to left.
360
361 Scripts
362 The script names which can be used by "\p{...}" and "\P{...}", such
363 as in "\p{Latin}" or "\p{Cyrillic}", are as follows:
364
365 Arabic
366 Armenian
367 Balinese
368 Bengali
369 Bopomofo
370 Braille
371 Buginese
372 Buhid
373 CanadianAboriginal
374 Cherokee
375 Coptic
376 Cuneiform
377 Cypriot
378 Cyrillic
379 Deseret
380 Devanagari
381 Ethiopic
382 Georgian
383 Glagolitic
384 Gothic
385 Greek
386 Gujarati
387 Gurmukhi
388 Han
389 Hangul
390 Hanunoo
391 Hebrew
392 Hiragana
393 Inherited
394 Kannada
395 Katakana
396 Kharoshthi
397 Khmer
398 Lao
399 Latin
400 Limbu
401 LinearB
402 Malayalam
403 Mongolian
404 Myanmar
405 NewTaiLue
406 Nko
407 Ogham
408 OldItalic
409 OldPersian
410 Oriya
411 Osmanya
412 PhagsPa
413 Phoenician
414 Runic
415 Shavian
416 Sinhala
417 SylotiNagri
418 Syriac
419 Tagalog
420 Tagbanwa
421 TaiLe
422 Tamil
423 Telugu
424 Thaana
425 Thai
426 Tibetan
427 Tifinagh
428 Ugaritic
429 Yi
430
431 Extended property classes
432 Extended property classes can supplement the basic properties,
433 defined by the PropList Unicode database:
434
435 ASCIIHexDigit
436 BidiControl
437 Dash
438 Deprecated
439 Diacritic
440 Extender
441 HexDigit
442 Hyphen
443 Ideographic
444 IDSBinaryOperator
445 IDSTrinaryOperator
446 JoinControl
447 LogicalOrderException
448 NoncharacterCodePoint
449 OtherAlphabetic
450 OtherDefaultIgnorableCodePoint
451 OtherGraphemeExtend
452 OtherIDStart
453 OtherIDContinue
454 OtherLowercase
455 OtherMath
456 OtherUppercase
457 PatternSyntax
458 PatternWhiteSpace
459 QuotationMark
460 Radical
461 SoftDotted
462 STerm
463 TerminalPunctuation
464 UnifiedIdeograph
465 VariationSelector
466 WhiteSpace
467
468 and there are further derived properties:
469
470 Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
471 Lowercase = Ll + OtherLowercase
472 Uppercase = Lu + OtherUppercase
473 Math = Sm + OtherMath
474
475 IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
476 IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
477
478 DefaultIgnorableCodePoint
479 = OtherDefaultIgnorableCodePoint
480 + Cf + Cc + Cs + Noncharacters + VariationSelector
481 - WhiteSpace - FFF9..FFFB (Annotation Characters)
482
483 Any = Any code points (i.e. U+0000 to U+10FFFF)
484 Assigned = Any non-Cn code points (i.e. synonym for \P{Cn})
485 Unassigned = Synonym for \p{Cn}
486 ASCII = ASCII (i.e. U+0000 to U+007F)
487
488 Common = Any character (or unassigned code point)
489 not explicitly assigned to a script
490
491 Use of "Is" Prefix
492 For backward compatibility (with Perl 5.6), all properties
493 mentioned so far may have "Is" prepended to their name, so
494 "\P{IsLu}", for example, is equal to "\P{Lu}".
495
496 Blocks
497 In addition to scripts, Unicode also defines blocks of characters.
498 The difference between scripts and blocks is that the concept of
499 scripts is closer to natural languages, while the concept of blocks
500 is more of an artificial grouping based on groups of 256 Unicode
501 characters. For example, the "Latin" script contains letters from
502 many blocks but does not contain all the characters from those
503 blocks. It does not, for example, contain digits, because digits
504 are shared across many scripts. Digits and similar groups, like
505 punctuation, are in a category called "Common".
506
507 For more about scripts, see the UAX#24 "Script Names":
508
509 http://www.unicode.org/reports/tr24/
510
511 For more about blocks, see:
512
513 http://www.unicode.org/Public/UNIDATA/Blocks.txt
514
515 Block names are given with the "In" prefix. For example, the
516 Katakana block is referenced via "\p{InKatakana}". The "In" prefix
517 may be omitted if there is no naming conflict with a script or any
518 other property, but it is recommended that "In" always be used for
519 block tests to avoid confusion.
520
521 These block names are supported:
522
523 InAegeanNumbers
524 InAlphabeticPresentationForms
525 InAncientGreekMusicalNotation
526 InAncientGreekNumbers
527 InArabic
528 InArabicPresentationFormsA
529 InArabicPresentationFormsB
530 InArabicSupplement
531 InArmenian
532 InArrows
533 InBalinese
534 InBasicLatin
535 InBengali
536 InBlockElements
537 InBopomofo
538 InBopomofoExtended
539 InBoxDrawing
540 InBraillePatterns
541 InBuginese
542 InBuhid
543 InByzantineMusicalSymbols
544 InCJKCompatibility
545 InCJKCompatibilityForms
546 InCJKCompatibilityIdeographs
547 InCJKCompatibilityIdeographsSupplement
548 InCJKRadicalsSupplement
549 InCJKStrokes
550 InCJKSymbolsAndPunctuation
551 InCJKUnifiedIdeographs
552 InCJKUnifiedIdeographsExtensionA
553 InCJKUnifiedIdeographsExtensionB
554 InCherokee
555 InCombiningDiacriticalMarks
556 InCombiningDiacriticalMarksSupplement
557 InCombiningDiacriticalMarksforSymbols
558 InCombiningHalfMarks
559 InControlPictures
560 InCoptic
561 InCountingRodNumerals
562 InCuneiform
563 InCuneiformNumbersAndPunctuation
564 InCurrencySymbols
565 InCypriotSyllabary
566 InCyrillic
567 InCyrillicSupplement
568 InDeseret
569 InDevanagari
570 InDingbats
571 InEnclosedAlphanumerics
572 InEnclosedCJKLettersAndMonths
573 InEthiopic
574 InEthiopicExtended
575 InEthiopicSupplement
576 InGeneralPunctuation
577 InGeometricShapes
578 InGeorgian
579 InGeorgianSupplement
580 InGlagolitic
581 InGothic
582 InGreekExtended
583 InGreekAndCoptic
584 InGujarati
585 InGurmukhi
586 InHalfwidthAndFullwidthForms
587 InHangulCompatibilityJamo
588 InHangulJamo
589 InHangulSyllables
590 InHanunoo
591 InHebrew
592 InHighPrivateUseSurrogates
593 InHighSurrogates
594 InHiragana
595 InIPAExtensions
596 InIdeographicDescriptionCharacters
597 InKanbun
598 InKangxiRadicals
599 InKannada
600 InKatakana
601 InKatakanaPhoneticExtensions
602 InKharoshthi
603 InKhmer
604 InKhmerSymbols
605 InLao
606 InLatin1Supplement
607 InLatinExtendedA
608 InLatinExtendedAdditional
609 InLatinExtendedB
610 InLatinExtendedC
611 InLatinExtendedD
612 InLetterlikeSymbols
613 InLimbu
614 InLinearBIdeograms
615 InLinearBSyllabary
616 InLowSurrogates
617 InMalayalam
618 InMathematicalAlphanumericSymbols
619 InMathematicalOperators
620 InMiscellaneousMathematicalSymbolsA
621 InMiscellaneousMathematicalSymbolsB
622 InMiscellaneousSymbols
623 InMiscellaneousSymbolsAndArrows
624 InMiscellaneousTechnical
625 InModifierToneLetters
626 InMongolian
627 InMusicalSymbols
628 InMyanmar
629 InNKo
630 InNewTaiLue
631 InNumberForms
632 InOgham
633 InOldItalic
634 InOldPersian
635 InOpticalCharacterRecognition
636 InOriya
637 InOsmanya
638 InPhagspa
639 InPhoenician
640 InPhoneticExtensions
641 InPhoneticExtensionsSupplement
642 InPrivateUseArea
643 InRunic
644 InShavian
645 InSinhala
646 InSmallFormVariants
647 InSpacingModifierLetters
648 InSpecials
649 InSuperscriptsAndSubscripts
650 InSupplementalArrowsA
651 InSupplementalArrowsB
652 InSupplementalMathematicalOperators
653 InSupplementalPunctuation
654 InSupplementaryPrivateUseAreaA
655 InSupplementaryPrivateUseAreaB
656 InSylotiNagri
657 InSyriac
658 InTagalog
659 InTagbanwa
660 InTags
661 InTaiLe
662 InTaiXuanJingSymbols
663 InTamil
664 InTelugu
665 InThaana
666 InThai
667 InTibetan
668 InTifinagh
669 InUgaritic
670 InUnifiedCanadianAboriginalSyllabics
671 InVariationSelectors
672 InVariationSelectorsSupplement
673 InVerticalForms
674 InYiRadicals
675 InYiSyllables
676 InYijingHexagramSymbols
677
678 User-Defined Character Properties
679 You can define your own character properties by defining subroutines
680 whose names begin with "In" or "Is". The subroutines can be defined in
681 any package. The user-defined properties can be used in the regular
682 expression "\p" and "\P" constructs; if you are using a user-defined
683 property from a package other than the one you are in, you must specify
684 its package in the "\p" or "\P" construct.
685
686 # assuming property IsForeign defined in Lang::
687 package main; # property package name required
688 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
689
690 package Lang; # property package name not required
691 if ($txt =~ /\p{IsForeign}+/) { ... }
692
693 Note that the effect is compile-time and immutable once defined.
694
695 The subroutines must return a specially-formatted string, with one or
696 more newline-separated lines. Each line must be one of the following:
697
698 · A single hexadecimal number denoting a Unicode code point to
699 include.
700
701 · Two hexadecimal numbers separated by horizontal whitespace (space
702 or tabular characters) denoting a range of Unicode code points to
703 include.
704
705 · Something to include, prefixed by "+": a built-in character
706 property (prefixed by "utf8::") or a user-defined character
707 property, to represent all the characters in that property; two
708 hexadecimal code points for a range; or a single hexadecimal code
709 point.
710
711 · Something to exclude, prefixed by "-": an existing character
712 property (prefixed by "utf8::") or a user-defined character
713 property, to represent all the characters in that property; two
714 hexadecimal code points for a range; or a single hexadecimal code
715 point.
716
717 · Something to negate, prefixed "!": an existing character property
718 (prefixed by "utf8::") or a user-defined character property, to
719 represent all the characters in that property; two hexadecimal code
720 points for a range; or a single hexadecimal code point.
721
722 · Something to intersect with, prefixed by "&": an existing character
723 property (prefixed by "utf8::") or a user-defined character
724 property, for all the characters except the characters in the
725 property; two hexadecimal code points for a range; or a single
726 hexadecimal code point.
727
728 For example, to define a property that covers both the Japanese
729 syllabaries (hiragana and katakana), you can define
730
731 sub InKana {
732 return <<END;
733 3040\t309F
734 30A0\t30FF
735 END
736 }
737
738 Imagine that the here-doc end marker is at the beginning of the line.
739 Now you can use "\p{InKana}" and "\P{InKana}".
740
741 You could also have used the existing block property names:
742
743 sub InKana {
744 return <<'END';
745 +utf8::InHiragana
746 +utf8::InKatakana
747 END
748 }
749
750 Suppose you wanted to match only the allocated characters, not the raw
751 block ranges: in other words, you want to remove the non-characters:
752
753 sub InKana {
754 return <<'END';
755 +utf8::InHiragana
756 +utf8::InKatakana
757 -utf8::IsCn
758 END
759 }
760
761 The negation is useful for defining (surprise!) negated classes.
762
763 sub InNotKana {
764 return <<'END';
765 !utf8::InHiragana
766 -utf8::InKatakana
767 +utf8::IsCn
768 END
769 }
770
771 Intersection is useful for getting the common characters matched by two
772 (or more) classes.
773
774 sub InFooAndBar {
775 return <<'END';
776 +main::Foo
777 &main::Bar
778 END
779 }
780
781 It's important to remember not to use "&" for the first set -- that
782 would be intersecting with nothing (resulting in an empty set).
783
784 User-Defined Case Mappings
785 You can also define your own mappings to be used in the lc(),
786 lcfirst(), uc(), and ucfirst() (or their string-inlined versions). The
787 principle is similar to that of user-defined character properties: to
788 define subroutines in the "main" package with names like "ToLower" (for
789 lc() and lcfirst()), "ToTitle" (for the first character in ucfirst()),
790 and "ToUpper" (for uc(), and the rest of the characters in ucfirst()).
791
792 The string returned by the subroutines needs now to be three
793 hexadecimal numbers separated by tabulators: start of the source range,
794 end of the source range, and start of the destination range. For
795 example:
796
797 sub ToUpper {
798 return <<END;
799 0061\t0063\t0041
800 END
801 }
802
803 defines an uc() mapping that causes only the characters "a", "b", and
804 "c" to be mapped to "A", "B", "C", all other characters will remain
805 unchanged.
806
807 If there is no source range to speak of, that is, the mapping is from a
808 single character to another single character, leave the end of the
809 source range empty, but the two tabulator characters are still needed.
810 For example:
811
812 sub ToLower {
813 return <<END;
814 0041\t\t0061
815 END
816 }
817
818 defines a lc() mapping that causes only "A" to be mapped to "a", all
819 other characters will remain unchanged.
820
821 (For serious hackers only) If you want to introspect the default
822 mappings, you can find the data in the directory
823 $Config{privlib}/unicore/To/. The mapping data is returned as the
824 here-document, and the "utf8::ToSpecFoo" are special exception mappings
825 derived from <$Config{privlib}>/unicore/SpecialCasing.txt. The "Digit"
826 and "Fold" mappings that one can see in the directory are not directly
827 user-accessible, one can use either the "Unicode::UCD" module, or just
828 match case-insensitively (that's when the "Fold" mapping is used).
829
830 A final note on the user-defined case mappings: they will be used only
831 if the scalar has been marked as having Unicode characters. Old byte-
832 style strings will not be affected.
833
834 Character Encodings for Input and Output
835 See Encode.
836
837 Unicode Regular Expression Support Level
838 The following list of Unicode support for regular expressions describes
839 all the features currently supported. The references to "Level N" and
840 the section numbers refer to the Unicode Technical Standard #18,
841 "Unicode Regular Expressions", version 11, in May 2005.
842
843 · Level 1 - Basic Unicode Support
844
845 RL1.1 Hex Notation - done [1]
846 RL1.2 Properties - done [2][3]
847 RL1.2a Compatibility Properties - done [4]
848 RL1.3 Subtraction and Intersection - MISSING [5]
849 RL1.4 Simple Word Boundaries - done [6]
850 RL1.5 Simple Loose Matches - done [7]
851 RL1.6 Line Boundaries - MISSING [8]
852 RL1.7 Supplementary Code Points - done [9]
853
854 [1] \x{...}
855 [2] \p{...} \P{...}
856 [3] supports not only minimal list (general category, scripts,
857 Alphabetic, Lowercase, Uppercase, WhiteSpace,
858 NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
859 ASCII, Assigned), but also bidirectional types, blocks, etc.
860 (see "Unicode Character Properties")
861 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
862 [5] can use regular expression look-ahead [a] or
863 user-defined character properties [b] to emulate set operations
864 [6] \b \B
865 [7] note that Perl does Full case-folding in matching, not Simple:
866 for example U+1F88 is equivalent to U+1F00 U+03B9,
867 not with 1F80. This difference matters mainly for certain Greek
868 capital letters with certain modifiers: the Full case-folding
869 decomposes the letter, while the Simple case-folding would map
870 it to a single character.
871 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
872 CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
873 should also affect <>, $., and script line numbers;
874 should not split lines within CRLF [c] (i.e. there is no empty
875 line between \r and \n)
876 [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
877 but also beyond U+10FFFF [d]
878
879 [a] You can mimic class subtraction using lookahead. For example,
880 what UTS#18 might write as
881
882 [{Greek}-[{UNASSIGNED}]]
883
884 in Perl can be written as:
885
886 (?!\p{Unassigned})\p{InGreekAndCoptic}
887 (?=\p{Assigned})\p{InGreekAndCoptic}
888
889 But in this particular example, you probably really want
890
891 \p{GreekAndCoptic}
892
893 which will match assigned characters known to be part of the Greek
894 script.
895
896 Also see the Unicode::Regex::Set module, it does implement the full
897 UTS#18 grouping, intersection, union, and removal (subtraction)
898 syntax.
899
900 [b] '+' for union, '-' for removal (set-difference), '&' for
901 intersection (see "User-Defined Character Properties")
902
903 [c] Try the ":crlf" layer (see PerlIO).
904
905 [d] Avoid "use warning 'utf8';" (or say "no warning 'utf8';") to
906 allow U+FFFF ("\x{FFFF}").
907
908 · Level 2 - Extended Unicode Support
909
910 RL2.1 Canonical Equivalents - MISSING [10][11]
911 RL2.2 Default Grapheme Clusters - MISSING [12][13]
912 RL2.3 Default Word Boundaries - MISSING [14]
913 RL2.4 Default Loose Matches - MISSING [15]
914 RL2.5 Name Properties - MISSING [16]
915 RL2.6 Wildcard Properties - MISSING
916
917 [10] see UAX#15 "Unicode Normalization Forms"
918 [11] have Unicode::Normalize but not integrated to regexes
919 [12] have \X but at this level . should equal that
920 [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
921 clusters as a single grapheme cluster.
922 [14] see UAX#29, Word Boundaries
923 [15] see UAX#21 "Case Mappings"
924 [16] have \N{...} but neither compute names of CJK Ideographs
925 and Hangul Syllables nor use a loose match [e]
926
927 [e] "\N{...}" allows namespaces (see charnames).
928
929 · Level 3 - Tailored Support
930
931 RL3.1 Tailored Punctuation - MISSING
932 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
933 RL3.3 Tailored Word Boundaries - MISSING
934 RL3.4 Tailored Loose Matches - MISSING
935 RL3.5 Tailored Ranges - MISSING
936 RL3.6 Context Matching - MISSING [19]
937 RL3.7 Incremental Matches - MISSING
938 ( RL3.8 Unicode Set Sharing )
939 RL3.9 Possible Match Sets - MISSING
940 RL3.10 Folded Matching - MISSING [20]
941 RL3.11 Submatchers - MISSING
942
943 [17] see UAX#10 "Unicode Collation Algorithms"
944 [18] have Unicode::Collate but not integrated to regexes
945 [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
946 outside of the target substring
947 [20] need insensitive matching for linguistic features other than case;
948 for example, hiragana to katakana, wide and narrow, simplified Han
949 to traditional Han (see UTR#30 "Character Foldings")
950
951 Unicode Encodings
952 Unicode characters are assigned to code points, which are abstract
953 numbers. To use these numbers, various encodings are needed.
954
955 · UTF-8
956
957 UTF-8 is a variable-length (1 to 6 bytes, current character
958 allocations require 4 bytes), byte-order independent encoding. For
959 ASCII (and we really do mean 7-bit ASCII, not another 8-bit
960 encoding), UTF-8 is transparent.
961
962 The following table is from Unicode 3.2.
963
964 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
965
966 U+0000..U+007F 00..7F
967 U+0080..U+07FF C2..DF 80..BF
968 U+0800..U+0FFF E0 A0..BF 80..BF
969 U+1000..U+CFFF E1..EC 80..BF 80..BF
970 U+D000..U+D7FF ED 80..9F 80..BF
971 U+D800..U+DFFF ******* ill-formed *******
972 U+E000..U+FFFF EE..EF 80..BF 80..BF
973 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
974 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
975 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
976
977 Note the "A0..BF" in "U+0800..U+0FFF", the "80..9F" in
978 "U+D000...U+D7FF", the "90..B"F in "U+10000..U+3FFFF", and the
979 "80...8F" in "U+100000..U+10FFFF". The "gaps" are caused by legal
980 UTF-8 avoiding non-shortest encodings: it is technically possible
981 to UTF-8-encode a single code point in different ways, but that is
982 explicitly forbidden, and the shortest possible encoding should
983 always be used. So that's what Perl does.
984
985 Another way to look at it is via bits:
986
987 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
988
989 0aaaaaaa 0aaaaaaa
990 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
991 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
992 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
993
994 As you can see, the continuation bytes all begin with 10, and the
995 leading bits of the start byte tell how many bytes the are in the
996 encoded character.
997
998 · UTF-EBCDIC
999
1000 Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1001
1002 · UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1003
1004 The followings items are mostly for reference and general Unicode
1005 knowledge, Perl doesn't use these constructs internally.
1006
1007 UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1008 "U+0000..U+FFFF" are stored in a single 16-bit unit, and the code
1009 points "U+10000..U+10FFFF" in two 16-bit units. The latter case is
1010 using surrogates, the first 16-bit unit being the high surrogate,
1011 and the second being the low surrogate.
1012
1013 Surrogates are code points set aside to encode the
1014 "U+10000..U+10FFFF" range of Unicode code points in pairs of 16-bit
1015 units. The high surrogates are the range "U+D800..U+DBFF", and the
1016 low surrogates are the range "U+DC00..U+DFFF". The surrogate
1017 encoding is
1018
1019 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1020 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1021
1022 and the decoding is
1023
1024 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1025
1026 If you try to generate surrogates (for example by using chr()), you
1027 will get a warning if warnings are turned on, because those code
1028 points are not valid for a Unicode character.
1029
1030 Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
1031 itself can be used for in-memory computations, but if storage or
1032 transfer is required either UTF-16BE (big-endian) or UTF-16LE
1033 (little-endian) encodings must be chosen.
1034
1035 This introduces another problem: what if you just know that your
1036 data is UTF-16, but you don't know which endianness? Byte Order
1037 Marks, or BOMs, are a solution to this. A special character has
1038 been reserved in Unicode to function as a byte order marker: the
1039 character with the code point "U+FEFF" is the BOM.
1040
1041 The trick is that if you read a BOM, you will know the byte order,
1042 since if it was written on a big-endian platform, you will read the
1043 bytes "0xFE 0xFF", but if it was written on a little-endian
1044 platform, you will read the bytes "0xFF 0xFE". (And if the
1045 originating platform was writing in UTF-8, you will read the bytes
1046 "0xEF 0xBB 0xBF".)
1047
1048 The way this trick works is that the character with the code point
1049 "U+FFFE" is guaranteed not to be a valid Unicode character, so the
1050 sequence of bytes "0xFF 0xFE" is unambiguously "BOM, represented in
1051 little-endian format" and cannot be "U+FFFE", represented in big-
1052 endian format".
1053
1054 · UTF-32, UTF-32BE, UTF-32LE
1055
1056 The UTF-32 family is pretty much like the UTF-16 family, expect
1057 that the units are 32-bit, and therefore the surrogate scheme is
1058 not needed. The BOM signatures will be "0x00 0x00 0xFE 0xFF" for
1059 BE and "0xFF 0xFE 0x00 0x00" for LE.
1060
1061 · UCS-2, UCS-4
1062
1063 Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
1064 encoding. Unlike UTF-16, UCS-2 is not extensible beyond "U+FFFF",
1065 because it does not use surrogates. UCS-4 is a 32-bit encoding,
1066 functionally identical to UTF-32.
1067
1068 · UTF-7
1069
1070 A seven-bit safe (non-eight-bit) encoding, which is useful if the
1071 transport or storage is not eight-bit safe. Defined by RFC 2152.
1072
1073 Security Implications of Unicode
1074 · Malformed UTF-8
1075
1076 Unfortunately, the specification of UTF-8 leaves some room for
1077 interpretation of how many bytes of encoded output one should
1078 generate from one input Unicode character. Strictly speaking, the
1079 shortest possible sequence of UTF-8 bytes should be generated,
1080 because otherwise there is potential for an input buffer overflow
1081 at the receiving end of a UTF-8 connection. Perl always generates
1082 the shortest length UTF-8, and with warnings on Perl will warn
1083 about non-shortest length UTF-8 along with other malformations,
1084 such as the surrogates, which are not real Unicode code points.
1085
1086 · Regular expressions behave slightly differently between byte data
1087 and character (Unicode) data. For example, the "word character"
1088 character class "\w" will work differently depending on if data is
1089 eight-bit bytes or Unicode.
1090
1091 In the first case, the set of "\w" characters is either small--the
1092 default set of alphabetic characters, digits, and the "_"--or, if
1093 you are using a locale (see perllocale), the "\w" might contain a
1094 few more letters according to your language and country.
1095
1096 In the second case, the "\w" set of characters is much, much
1097 larger. Most importantly, even in the set of the first 256
1098 characters, it will probably match different characters: unlike
1099 most locales, which are specific to a language and country pair,
1100 Unicode classifies all the characters that are letters somewhere as
1101 "\w". For example, your locale might not think that LATIN SMALL
1102 LETTER ETH is a letter (unless you happen to speak Icelandic), but
1103 Unicode does.
1104
1105 As discussed elsewhere, Perl has one foot (two hooves?) planted in
1106 each of two worlds: the old world of bytes and the new world of
1107 characters, upgrading from bytes to characters when necessary. If
1108 your legacy code does not explicitly use Unicode, no automatic
1109 switch-over to characters should happen. Characters shouldn't get
1110 downgraded to bytes, either. It is possible to accidentally mix
1111 bytes and characters, however (see perluniintro), in which case
1112 "\w" in regular expressions might start behaving differently.
1113 Review your code. Use warnings and the "strict" pragma.
1114
1115 Unicode in Perl on EBCDIC
1116 The way Unicode is handled on EBCDIC platforms is still experimental.
1117 On such platforms, references to UTF-8 encoding in this document and
1118 elsewhere should be read as meaning the UTF-EBCDIC specified in Unicode
1119 Technical Report 16, unless ASCII vs. EBCDIC issues are specifically
1120 discussed. There is no "utfebcdic" pragma or ":utfebcdic" layer;
1121 rather, "utf8" and ":utf8" are reused to mean the platform's "natural"
1122 8-bit encoding of Unicode. See perlebcdic for more discussion of the
1123 issues.
1124
1125 Locales
1126 Usually locale settings and Unicode do not affect each other, but there
1127 are a couple of exceptions:
1128
1129 · You can enable automatic UTF-8-ification of your standard file
1130 handles, default "open()" layer, and @ARGV by using either the "-C"
1131 command line switch or the "PERL_UNICODE" environment variable, see
1132 perlrun for the documentation of the "-C" switch.
1133
1134 · Perl tries really hard to work both with Unicode and the old byte-
1135 oriented world. Most often this is nice, but sometimes Perl's
1136 straddling of the proverbial fence causes problems.
1137
1138 When Unicode Does Not Happen
1139 While Perl does have extensive ways to input and output in Unicode, and
1140 few other 'entry points' like the @ARGV which can be interpreted as
1141 Unicode (UTF-8), there still are many places where Unicode (in some
1142 encoding or another) could be given as arguments or received as
1143 results, or both, but it is not.
1144
1145 The following are such interfaces. For all of these interfaces Perl
1146 currently (as of 5.8.3) simply assumes byte strings both as arguments
1147 and results, or UTF-8 strings if the "encoding" pragma has been used.
1148
1149 One reason why Perl does not attempt to resolve the role of Unicode in
1150 this cases is that the answers are highly dependent on the operating
1151 system and the file system(s). For example, whether filenames can be
1152 in Unicode, and in exactly what kind of encoding, is not exactly a
1153 portable concept. Similarly for the qx and system: how well will the
1154 'command line interface' (and which of them?) handle Unicode?
1155
1156 · chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename,
1157 rmdir, stat, symlink, truncate, unlink, utime, -X
1158
1159 · %ENV
1160
1161 · glob (aka the <*>)
1162
1163 · open, opendir, sysopen
1164
1165 · qx (aka the backtick operator), system
1166
1167 · readdir, readlink
1168
1169 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1170 Sometimes (see "When Unicode Does Not Happen") there are situations
1171 where you simply need to force a byte string into UTF-8, or vice versa.
1172 The low-level calls utf8::upgrade($bytestring) and
1173 utf8::downgrade($utf8string[, FAIL_OK]) are the answers.
1174
1175 Note that utf8::downgrade() can fail if the string contains characters
1176 that don't fit into a byte.
1177
1178 Using Unicode in XS
1179 If you want to handle Perl Unicode in XS extensions, you may find the
1180 following C APIs useful. See also "Unicode Support" in perlguts for an
1181 explanation about Unicode at the XS level, and perlapi for the API
1182 details.
1183
1184 · "DO_UTF8(sv)" returns true if the "UTF8" flag is on and the bytes
1185 pragma is not in effect. "SvUTF8(sv)" returns true if the "UTF8"
1186 flag is on; the bytes pragma is ignored. The "UTF8" flag being on
1187 does not mean that there are any characters of code points greater
1188 than 255 (or 127) in the scalar or that there are even any
1189 characters in the scalar. What the "UTF8" flag means is that the
1190 sequence of octets in the representation of the scalar is the
1191 sequence of UTF-8 encoded code points of the characters of a
1192 string. The "UTF8" flag being off means that each octet in this
1193 representation encodes a single character with code point 0..255
1194 within the string. Perl's Unicode model is not to use UTF-8 until
1195 it is absolutely necessary.
1196
1197 · "uvchr_to_utf8(buf, chr)" writes a Unicode character code point
1198 into a buffer encoding the code point as UTF-8, and returns a
1199 pointer pointing after the UTF-8 bytes. It works appropriately on
1200 EBCDIC machines.
1201
1202 · "utf8_to_uvchr(buf, lenp)" reads UTF-8 encoded bytes from a buffer
1203 and returns the Unicode character code point and, optionally, the
1204 length of the UTF-8 byte sequence. It works appropriately on
1205 EBCDIC machines.
1206
1207 · "utf8_length(start, end)" returns the length of the UTF-8 encoded
1208 buffer in characters. "sv_len_utf8(sv)" returns the length of the
1209 UTF-8 encoded scalar.
1210
1211 · "sv_utf8_upgrade(sv)" converts the string of the scalar to its
1212 UTF-8 encoded form. "sv_utf8_downgrade(sv)" does the opposite, if
1213 possible. "sv_utf8_encode(sv)" is like sv_utf8_upgrade except that
1214 it does not set the "UTF8" flag. "sv_utf8_decode()" does the
1215 opposite of "sv_utf8_encode()". Note that none of these are to be
1216 used as general-purpose encoding or decoding interfaces: "use
1217 Encode" for that. "sv_utf8_upgrade()" is affected by the encoding
1218 pragma but "sv_utf8_downgrade()" is not (since the encoding pragma
1219 is designed to be a one-way street).
1220
1221 · is_utf8_char(s) returns true if the pointer points to a valid UTF-8
1222 character.
1223
1224 · "is_utf8_string(buf, len)" returns true if "len" bytes of the
1225 buffer are valid UTF-8.
1226
1227 · "UTF8SKIP(buf)" will return the number of bytes in the UTF-8
1228 encoded character in the buffer. "UNISKIP(chr)" will return the
1229 number of bytes required to UTF-8-encode the Unicode character code
1230 point. "UTF8SKIP()" is useful for example for iterating over the
1231 characters of a UTF-8 encoded buffer; "UNISKIP()" is useful, for
1232 example, in computing the size required for a UTF-8 encoded buffer.
1233
1234 · "utf8_distance(a, b)" will tell the distance in characters between
1235 the two pointers pointing to the same UTF-8 encoded buffer.
1236
1237 · "utf8_hop(s, off)" will return a pointer to a UTF-8 encoded buffer
1238 that is "off" (positive or negative) Unicode characters displaced
1239 from the UTF-8 buffer "s". Be careful not to overstep the buffer:
1240 "utf8_hop()" will merrily run off the end or the beginning of the
1241 buffer if told to do so.
1242
1243 · "pv_uni_display(dsv, spv, len, pvlim, flags)" and
1244 "sv_uni_display(dsv, ssv, pvlim, flags)" are useful for debugging
1245 the output of Unicode strings and scalars. By default they are
1246 useful only for debugging--they display all characters as
1247 hexadecimal code points--but with the flags "UNI_DISPLAY_ISPRINT",
1248 "UNI_DISPLAY_BACKSLASH", and "UNI_DISPLAY_QQ" you can make the
1249 output more readable.
1250
1251 · "ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)" can be used to
1252 compare two strings case-insensitively in Unicode. For case-
1253 sensitive comparisons you can just use "memEQ()" and "memNE()" as
1254 usual.
1255
1256 For more information, see perlapi, and utf8.c and utf8.h in the Perl
1257 source code distribution.
1258
1260 Interaction with Locales
1261 Use of locales with Unicode data may lead to odd results. Currently,
1262 Perl attempts to attach 8-bit locale info to characters in the range
1263 0..255, but this technique is demonstrably incorrect for locales that
1264 use characters above that range when mapped into Unicode. Perl's
1265 Unicode support will also tend to run slower. Use of locales with
1266 Unicode is discouraged.
1267
1268 Problems with characters whose ordinal numbers are in the range 128 - 255
1269 with no Locale specified
1270 Without a locale specified, unlike all other characters or code points,
1271 these characters have very different semantics in byte semantics versus
1272 character semantics. In character semantics they are interpreted as
1273 Unicode code points, which means they are viewed as Latin-1
1274 (ISO-8859-1). In byte semantics, they are considered to be unassigned
1275 characters, meaning that the only semantics they have is their ordinal
1276 numbers, and that they are not members of various character classes.
1277 None are considered to match "\w" for example, but all match "\W".
1278 Besides these class matches, the known operations that this affects are
1279 those that change the case, regular expression matching while ignoring
1280 case, and quotemeta(). This can lead to unexpected results in which a
1281 string's semantics suddenly change if a code point above 255 is
1282 appended to or removed from it, which changes the string's semantics
1283 from byte to character or vice versa. This behavior is scheduled to
1284 change in version 5.12, but in the meantime, a workaround is to always
1285 call utf8::upgrade($string), or to use the standard modules Encode or
1286 charnames.
1287
1288 Interaction with Extensions
1289 When Perl exchanges data with an extension, the extension should be
1290 able to understand the UTF8 flag and act accordingly. If the extension
1291 doesn't know about the flag, it's likely that the extension will return
1292 incorrectly-flagged data.
1293
1294 So if you're working with Unicode data, consult the documentation of
1295 every module you're using if there are any issues with Unicode data
1296 exchange. If the documentation does not talk about Unicode at all,
1297 suspect the worst and probably look at the source to learn how the
1298 module is implemented. Modules written completely in Perl shouldn't
1299 cause problems. Modules that directly or indirectly access code written
1300 in other programming languages are at risk.
1301
1302 For affected functions, the simple strategy to avoid data corruption is
1303 to always make the encoding of the exchanged data explicit. Choose an
1304 encoding that you know the extension can handle. Convert arguments
1305 passed to the extensions to that encoding and convert results back from
1306 that encoding. Write wrapper functions that do the conversions for you,
1307 so you can later change the functions when the extension catches up.
1308
1309 To provide an example, let's say the popular Foo::Bar::escape_html
1310 function doesn't deal with Unicode data yet. The wrapper function would
1311 convert the argument to raw UTF-8 and convert the result back to Perl's
1312 internal representation like so:
1313
1314 sub my_escape_html ($) {
1315 my($what) = shift;
1316 return unless defined $what;
1317 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1318 }
1319
1320 Sometimes, when the extension does not convert data but just stores and
1321 retrieves them, you will be in a position to use the otherwise
1322 dangerous Encode::_utf8_on() function. Let's say the popular "Foo::Bar"
1323 extension, written in C, provides a "param" method that lets you store
1324 and retrieve data according to these prototypes:
1325
1326 $self->param($name, $value); # set a scalar
1327 $value = $self->param($name); # retrieve a scalar
1328
1329 If it does not yet provide support for any encoding, one could write a
1330 derived class with such a "param" method:
1331
1332 sub param {
1333 my($self,$name,$value) = @_;
1334 utf8::upgrade($name); # make sure it is UTF-8 encoded
1335 if (defined $value) {
1336 utf8::upgrade($value); # make sure it is UTF-8 encoded
1337 return $self->SUPER::param($name,$value);
1338 } else {
1339 my $ret = $self->SUPER::param($name);
1340 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1341 return $ret;
1342 }
1343 }
1344
1345 Some extensions provide filters on data entry/exit points, such as
1346 DB_File::filter_store_key and family. Look out for such filters in the
1347 documentation of your extensions, they can make the transition to
1348 Unicode data much easier.
1349
1350 Speed
1351 Some functions are slower when working on UTF-8 encoded strings than on
1352 byte encoded strings. All functions that need to hop over characters
1353 such as length(), substr() or index(), or matching regular expressions
1354 can work much faster when the underlying data are byte-encoded.
1355
1356 In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a
1357 caching scheme was introduced which will hopefully make the slowness
1358 somewhat less spectacular, at least for some operations. In general,
1359 operations with UTF-8 encoded strings are still slower. As an example,
1360 the Unicode properties (character classes) like "\p{Nd}" are known to
1361 be quite a bit slower (5-20 times) than their simpler counterparts like
1362 "\d" (then again, there 268 Unicode characters matching "Nd" compared
1363 with the 10 ASCII characters matching "d").
1364
1365 Possible problems on EBCDIC platforms
1366 In earlier versions, when byte and character data were concatenated,
1367 the new string was sometimes created by decoding the byte strings as
1368 ISO 8859-1 (Latin-1), even if the old Unicode string used EBCDIC.
1369
1370 If you find any of these, please report them as bugs.
1371
1372 Porting code from perl-5.6.X
1373 Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1374 was required to use the "utf8" pragma to declare that a given scope
1375 expected to deal with Unicode data and had to make sure that only
1376 Unicode data were reaching that scope. If you have code that is working
1377 with 5.6, you will need some of the following adjustments to your code.
1378 The examples are written such that the code will continue to work under
1379 5.6, so you should be safe to try them out.
1380
1381 · A filehandle that should read or write UTF-8
1382
1383 if ($] > 5.007) {
1384 binmode $fh, ":encoding(utf8)";
1385 }
1386
1387 · A scalar that is going to be passed to some extension
1388
1389 Be it Compress::Zlib, Apache::Request or any extension that has no
1390 mention of Unicode in the manpage, you need to make sure that the
1391 UTF8 flag is stripped off. Note that at the time of this writing
1392 (October 2002) the mentioned modules are not UTF-8-aware. Please
1393 check the documentation to verify if this is still true.
1394
1395 if ($] > 5.007) {
1396 require Encode;
1397 $val = Encode::encode_utf8($val); # make octets
1398 }
1399
1400 · A scalar we got back from an extension
1401
1402 If you believe the scalar comes back as UTF-8, you will most likely
1403 want the UTF8 flag restored:
1404
1405 if ($] > 5.007) {
1406 require Encode;
1407 $val = Encode::decode_utf8($val);
1408 }
1409
1410 · Same thing, if you are really sure it is UTF-8
1411
1412 if ($] > 5.007) {
1413 require Encode;
1414 Encode::_utf8_on($val);
1415 }
1416
1417 · A wrapper for fetchrow_array and fetchrow_hashref
1418
1419 When the database contains only UTF-8, a wrapper function or method
1420 is a convenient way to replace all your fetchrow_array and
1421 fetchrow_hashref calls. A wrapper function will also make it easier
1422 to adapt to future enhancements in your database driver. Note that
1423 at the time of this writing (October 2002), the DBI has no
1424 standardized way to deal with UTF-8 data. Please check the
1425 documentation to verify if that is still true.
1426
1427 sub fetchrow {
1428 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1429 if ($] < 5.007) {
1430 return $sth->$what;
1431 } else {
1432 require Encode;
1433 if (wantarray) {
1434 my @arr = $sth->$what;
1435 for (@arr) {
1436 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1437 }
1438 return @arr;
1439 } else {
1440 my $ret = $sth->$what;
1441 if (ref $ret) {
1442 for my $k (keys %$ret) {
1443 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1444 }
1445 return $ret;
1446 } else {
1447 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1448 return $ret;
1449 }
1450 }
1451 }
1452 }
1453
1454 · A large scalar that you know can only contain ASCII
1455
1456 Scalars that contain only ASCII and are marked as UTF-8 are
1457 sometimes a drag to your program. If you recognize such a
1458 situation, just remove the UTF8 flag:
1459
1460 utf8::downgrade($val) if $] > 5.007;
1461
1463 perlunitut, perluniintro, Encode, open, utf8, bytes, perlretut,
1464 "${^UNICODE}" in perlvar
1465
1466
1467
1468perl v5.10.1 2009-05-14 PERLUNICODE(1)