1PERLUNICOOK(1) Perl Programmers Reference Guide PERLUNICOOK(1)
2
3
4
6 perlunicook - cookbookish examples of handling Unicode in Perl
7
9 This manpage contains short recipes demonstrating how to handle common
10 Unicode operations in Perl, plus one complete program at the end. Any
11 undeclared variables in individual recipes are assumed to have a
12 previous appropriate value in them.
13
15 ℞ 0: Standard preamble
16 Unless otherwise notes, all examples below require this standard
17 preamble to work correctly, with the "#!" adjusted to work on your
18 system:
19
20 #!/usr/bin/env perl
21
22 use v5.36; # or later to get "unicode_strings" feature,
23 # plus strict, warnings
24 use utf8; # so literals and identifiers can be in UTF-8
25 use warnings qw(FATAL utf8); # fatalize encoding glitches
26 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
27 use charnames qw(:full :short); # unneeded in v5.16
28
29 This does make even Unix programmers "binmode" your binary streams, or
30 open them with ":raw", but that's the only way to get at them portably
31 anyway.
32
33 WARNING: "use autodie" (pre 2.26) and "use open" do not get along with
34 each other.
35
36 ℞ 1: Generic Unicode-savvy filter
37 Always decompose on the way in, then recompose on the way out.
38
39 use Unicode::Normalize;
40
41 while (<>) {
42 $_ = NFD($_); # decompose + reorder canonically
43 ...
44 } continue {
45 print NFC($_); # recompose (where possible) + reorder canonically
46 }
47
48 ℞ 2: Fine-tuning Unicode warnings
49 As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
50
51 use v5.14; # subwarnings unavailable any earlier
52 no warnings "nonchar"; # the 66 forbidden non-characters
53 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
54 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
55
56 ℞ 3: Declare source in utf8 for identifiers and literals
57 Without the all-critical "use utf8" declaration, putting UTF‑8 in your
58 literals and identifiers won’t work right. If you used the standard
59 preamble just given above, this already happened. If you did, you can
60 do things like this:
61
62 use utf8;
63
64 my $measure = "Ångström";
65 my @μsoft = qw( cp852 cp1251 cp1252 );
66 my @ὑπέρμεγας = qw( ὑπέρ μεγας );
67 my @鯉 = qw( koi8-f koi8-u koi8-r );
68 my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
69
70 If you forget "use utf8", high bytes will be misunderstood as separate
71 characters, and nothing will work right.
72
73 ℞ 4: Characters and their numbers
74 The "ord" and "chr" functions work transparently on all codepoints, not
75 just on ASCII alone — nor in fact, not even just on Unicode alone.
76
77 # ASCII characters
78 ord("A")
79 chr(65)
80
81 # characters from the Basic Multilingual Plane
82 ord("Σ")
83 chr(0x3A3)
84
85 # beyond the BMP
86 ord("𝑛") # MATHEMATICAL ITALIC SMALL N
87 chr(0x1D45B)
88
89 # beyond Unicode! (up to MAXINT)
90 ord("\x{20_0000}")
91 chr(0x20_0000)
92
93 ℞ 5: Unicode literals by character number
94 In an interpolated literal, whether a double-quoted string or a regex,
95 you may specify a character by its number using the "\x{HHHHHH}"
96 escape.
97
98 String: "\x{3a3}"
99 Regex: /\x{3a3}/
100
101 String: "\x{1d45b}"
102 Regex: /\x{1d45b}/
103
104 # even non-BMP ranges in regex work fine
105 /[\x{1D434}-\x{1D467}]/
106
107 ℞ 6: Get character name by number
108 use charnames ();
109 my $name = charnames::viacode(0x03A3);
110
111 ℞ 7: Get character number by name
112 use charnames ();
113 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
114
115 ℞ 8: Unicode named characters
116 Use the "\N{charname}" notation to get the character by that name for
117 use in interpolated literals (double-quoted strings and regexes). In
118 v5.16, there is an implicit
119
120 use charnames qw(:full :short);
121
122 But prior to v5.16, you must be explicit about which set of charnames
123 you want. The ":full" names are the official Unicode character name,
124 alias, or sequence, which all share a namespace.
125
126 use charnames qw(:full :short latin greek);
127
128 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
129 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
130
131 Anything else is a Perl-specific convenience abbreviation. Specify one
132 or more scripts by names if you want short names that are script-
133 specific.
134
135 "\N{Greek:Sigma}" # :short
136 "\N{ae}" # latin
137 "\N{epsilon}" # greek
138
139 The v5.16 release also supports a ":loose" import for loose matching of
140 character names, which works just like loose matching of property
141 names: that is, it disregards case, whitespace, and underscores:
142
143 "\N{euro sign}" # :loose (from v5.16)
144
145 Starting in v5.32, you can also use
146
147 qr/\p{name=euro sign}/
148
149 to get official Unicode named characters in regular expressions. Loose
150 matching is always done for these.
151
152 ℞ 9: Unicode named sequences
153 These look just like character names but return multiple codepoints.
154 Notice the %vx vector-print functionality in "printf".
155
156 use charnames qw(:full);
157 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
158 printf "U+%v04X\n", $seq;
159 U+0100.0300
160
161 ℞ 10: Custom named characters
162 Use ":alias" to give your own lexically scoped nicknames to existing
163 characters, or even to give unnamed private-use characters useful
164 names.
165
166 use charnames ":full", ":alias" => {
167 ecute => "LATIN SMALL LETTER E WITH ACUTE",
168 "APPLE LOGO" => 0xF8FF, # private use character
169 };
170
171 "\N{ecute}"
172 "\N{APPLE LOGO}"
173
174 ℞ 11: Names of CJK codepoints
175 Sinograms like “東京” come back with character names of "CJK UNIFIED
176 IDEOGRAPH-6771" and "CJK UNIFIED IDEOGRAPH-4EAC", because their “names”
177 vary. The CPAN "Unicode::Unihan" module has a large database for
178 decoding these (and a whole lot more), provided you know how to
179 understand its output.
180
181 # cpan -i Unicode::Unihan
182 use Unicode::Unihan;
183 my $str = "東京";
184 my $unhan = Unicode::Unihan->new;
185 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
186 printf "CJK $str in %-12s is ", $lang;
187 say $unhan->$lang($str);
188 }
189
190 prints:
191
192 CJK 東京 in Mandarin is DONG1JING1
193 CJK 東京 in Cantonese is dung1ging1
194 CJK 東京 in Korean is TONGKYENG
195 CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
196 CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
197
198 If you have a specific romanization scheme in mind, use the specific
199 module:
200
201 # cpan -i Lingua::JA::Romanize::Japanese
202 use Lingua::JA::Romanize::Japanese;
203 my $k2r = Lingua::JA::Romanize::Japanese->new;
204 my $str = "東京";
205 say "Japanese for $str is ", $k2r->chars($str);
206
207 prints
208
209 Japanese for 東京 is toukyou
210
211 ℞ 12: Explicit encode/decode
212 On rare occasion, such as a database read, you may be given encoded
213 text you need to decode.
214
215 use Encode qw(encode decode);
216
217 my $chars = decode("shiftjis", $bytes, 1);
218 # OR
219 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
220
221 For streams all in the same encoding, don't use encode/decode; instead
222 set the file encoding when you open the file or immediately after with
223 "binmode" as described later below.
224
225 ℞ 13: Decode program arguments as utf8
226 $ perl -CA ...
227 or
228 $ export PERL_UNICODE=A
229 or
230 use Encode qw(decode);
231 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
232
233 ℞ 14: Decode program arguments as locale encoding
234 # cpan -i Encode::Locale
235 use Encode qw(locale);
236 use Encode::Locale;
237
238 # use "locale" as an arg to encode/decode
239 @ARGV = map { decode(locale => $_, 1) } @ARGV;
240
241 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
242 Use a command-line option, an environment variable, or else call
243 "binmode" explicitly:
244
245 $ perl -CS ...
246 or
247 $ export PERL_UNICODE=S
248 or
249 use open qw(:std :encoding(UTF-8));
250 or
251 binmode(STDIN, ":encoding(UTF-8)");
252 binmode(STDOUT, ":utf8");
253 binmode(STDERR, ":utf8");
254
255 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
256 # cpan -i Encode::Locale
257 use Encode;
258 use Encode::Locale;
259
260 # or as a stream for binmode or open
261 binmode STDIN, ":encoding(console_in)" if -t STDIN;
262 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
263 binmode STDERR, ":encoding(console_out)" if -t STDERR;
264
265 ℞ 17: Make file I/O default to utf8
266 Files opened without an encoding argument will be in UTF-8:
267
268 $ perl -CD ...
269 or
270 $ export PERL_UNICODE=D
271 or
272 use open qw(:encoding(UTF-8));
273
274 ℞ 18: Make all I/O and args default to utf8
275 $ perl -CSDA ...
276 or
277 $ export PERL_UNICODE=SDA
278 or
279 use open qw(:std :encoding(UTF-8));
280 use Encode qw(decode);
281 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
282
283 ℞ 19: Open file with specific encoding
284 Specify stream encoding. This is the normal way to deal with encoded
285 text, not by calling low-level functions.
286
287 # input file
288 open(my $in_file, "< :encoding(UTF-16)", "wintext");
289 OR
290 open(my $in_file, "<", "wintext");
291 binmode($in_file, ":encoding(UTF-16)");
292 THEN
293 my $line = <$in_file>;
294
295 # output file
296 open($out_file, "> :encoding(cp1252)", "wintext");
297 OR
298 open(my $out_file, ">", "wintext");
299 binmode($out_file, ":encoding(cp1252)");
300 THEN
301 print $out_file "some text\n";
302
303 More layers than just the encoding can be specified here. For example,
304 the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF
305 handling.
306
307 ℞ 20: Unicode casing
308 Unicode casing is very different from ASCII casing.
309
310 uc("henry ⅷ") # "HENRY Ⅷ"
311 uc("tschüß") # "TSCHÜSS" notice ß => SS
312
313 # both are true:
314 "tschüß" =~ /TSCHÜSS/i # notice ß => SS
315 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
316
317 ℞ 21: Unicode case-insensitive comparisons
318 Also available in the CPAN Unicode::CaseFold module, the new "fc"
319 “foldcase” function from v5.16 grants access to the same Unicode
320 casefolding as the "/i" pattern modifier has always used:
321
322 use feature "fc"; # fc() function is from v5.16
323
324 # sort case-insensitively
325 my @sorted = sort { fc($a) cmp fc($b) } @list;
326
327 # both are true:
328 fc("tschüß") eq fc("TSCHÜSS")
329 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
330
331 ℞ 22: Match Unicode linebreak sequence in regex
332 A Unicode linebreak matches the two-character CRLF grapheme or any of
333 seven vertical whitespace characters. Good for dealing with textfiles
334 coming from different operating systems.
335
336 \R
337
338 s/\R/\n/g; # normalize all linebreaks to \n
339
340 ℞ 23: Get character category
341 Find the general category of a numeric codepoint.
342
343 use Unicode::UCD qw(charinfo);
344 my $cat = charinfo(0x3A3)->{category}; # "Lu"
345
346 ℞ 24: Disabling Unicode-awareness in builtin charclasses
347 Disable "\w", "\b", "\s", "\d", and the POSIX classes from working
348 correctly on Unicode either in this scope, or in just one regex.
349
350 use v5.14;
351 use re "/a";
352
353 # OR
354
355 my($num) = $str =~ /(\d+)/a;
356
357 Or use specific un-Unicode properties, like "\p{ahex}" and
358 "\p{POSIX_Digit"}. Properties still work normally no matter what
359 charset modifiers ("/d /u /l /a /aa") should be effect.
360
361 ℞ 25: Match Unicode properties in regex with \p, \P
362 These all match a single codepoint with the given property. Use "\P"
363 in place of "\p" to match one codepoint lacking that property.
364
365 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
366 \p{Sk}, \p{Ps}, \p{Lt}
367 \p{alpha}, \p{upper}, \p{lower}
368 \p{Latin}, \p{Greek}
369 \p{script_extensions=Latin}, \p{scx=Greek}
370 \p{East_Asian_Width=Wide}, \p{EA=W}
371 \p{Line_Break=Hyphen}, \p{LB=HY}
372 \p{Numeric_Value=4}, \p{NV=4}
373
374 ℞ 26: Custom character properties
375 Define at compile-time your own custom character properties for use in
376 regexes.
377
378 # using private-use characters
379 sub In_Tengwar { "E000\tE07F\n" }
380
381 if (/\p{In_Tengwar}/) { ... }
382
383 # blending existing properties
384 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
385 +utf8::IsLatin
386 +utf8::IsGreek
387 &utf8::IsTitle
388 END_OF_SET
389
390 if (/\p{Is_GraecoRoman_Title}/ { ... }
391
392 ℞ 27: Unicode normalization
393 Typically render into NFD on input and NFC on output. Using NFKC or
394 NFKD functions improves recall on searches, assuming you've already
395 done to the same text to be searched. Note that this is about much more
396 than just pre- combined compatibility glyphs; it also reorders marks
397 according to their canonical combining classes and weeds out
398 singletons.
399
400 use Unicode::Normalize;
401 my $nfd = NFD($orig);
402 my $nfc = NFC($orig);
403 my $nfkd = NFKD($orig);
404 my $nfkc = NFKC($orig);
405
406 ℞ 28: Convert non-ASCII Unicode numerics
407 Unless you’ve used "/a" or "/aa", "\d" matches more than ASCII digits
408 only, but Perl’s implicit string-to-number conversion does not current
409 recognize these. Here’s how to convert such strings manually.
410
411 use v5.14; # needed for num() function
412 use Unicode::UCD qw(num);
413 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
414 my @nums = ();
415 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
416 push @nums, num($1);
417 }
418 say "@nums"; # 12 4567 0.875
419
420 use charnames qw(:full);
421 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
422
423 ℞ 29: Match Unicode grapheme cluster in regex
424 Programmer-visible “characters” are codepoints matched by "/./s", but
425 user-visible “characters” are graphemes matched by "/\X/".
426
427 # Find vowel *plus* any combining diacritics,underlining,etc.
428 my $nfd = NFD($orig);
429 $nfd =~ / (?=[aeiou]) \X /xi
430
431 ℞ 30: Extract by grapheme instead of by codepoint (regex)
432 # match and grab five first graphemes
433 my($first_five) = $str =~ /^ ( \X{5} ) /x;
434
435 ℞ 31: Extract by grapheme instead of by codepoint (substr)
436 # cpan -i Unicode::GCString
437 use Unicode::GCString;
438 my $gcs = Unicode::GCString->new($str);
439 my $first_five = $gcs->substr(0, 5);
440
441 ℞ 32: Reverse string by grapheme
442 Reversing by codepoint messes up diacritics, mistakenly converting
443 "crème brûlée" into "éel̂urb em̀erc" instead of into "eélûrb emèrc"; so
444 reverse by grapheme instead. Both these approaches work right no
445 matter what normalization the string is in:
446
447 $str = join("", reverse $str =~ /\X/g);
448
449 # OR: cpan -i Unicode::GCString
450 use Unicode::GCString;
451 $str = reverse Unicode::GCString->new($str);
452
453 ℞ 33: String length in graphemes
454 The string "brûlée" has six graphemes but up to eight codepoints. This
455 counts by grapheme, not by codepoint:
456
457 my $str = "brûlée";
458 my $count = 0;
459 while ($str =~ /\X/g) { $count++ }
460
461 # OR: cpan -i Unicode::GCString
462 use Unicode::GCString;
463 my $gcs = Unicode::GCString->new($str);
464 my $count = $gcs->length;
465
466 ℞ 34: Unicode column-width for printing
467 Perl’s "printf", "sprintf", and "format" think all codepoints take up 1
468 print column, but many take 0 or 2. Here to show that normalization
469 makes no difference, we print out both forms:
470
471 use Unicode::GCString;
472 use Unicode::Normalize;
473
474 my @words = qw/crème brûlée/;
475 @words = map { NFC($_), NFD($_) } @words;
476
477 for my $str (@words) {
478 my $gcs = Unicode::GCString->new($str);
479 my $cols = $gcs->columns;
480 my $pad = " " x (10 - $cols);
481 say str, $pad, " |";
482 }
483
484 generates this to show that it pads correctly no matter the
485 normalization:
486
487 crème |
488 crème |
489 brûlée |
490 brûlée |
491
492 ℞ 35: Unicode collation
493 Text sorted by numeric codepoint follows no reasonable alphabetic
494 order; use the UCA for sorting text.
495
496 use Unicode::Collate;
497 my $col = Unicode::Collate->new();
498 my @list = $col->sort(@old_list);
499
500 See the ucsort program from the Unicode::Tussle CPAN module for a
501 convenient command-line interface to this module.
502
503 ℞ 36: Case- and accent-insensitive Unicode sort
504 Specify a collation strength of level 1 to ignore case and diacritics,
505 only looking at the basic character.
506
507 use Unicode::Collate;
508 my $col = Unicode::Collate->new(level => 1);
509 my @list = $col->sort(@old_list);
510
511 ℞ 37: Unicode locale collation
512 Some locales have special sorting rules.
513
514 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
515 use Unicode::Collate::Locale;
516 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
517 my @list = $col->sort(@old_list);
518
519 The ucsort program mentioned above accepts a "--locale" parameter.
520
521 ℞ 38: Making "cmp" work on text instead of codepoints
522 Instead of this:
523
524 @srecs = sort {
525 $b->{AGE} <=> $a->{AGE}
526 ||
527 $a->{NAME} cmp $b->{NAME}
528 } @recs;
529
530 Use this:
531
532 my $coll = Unicode::Collate->new();
533 for my $rec (@recs) {
534 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
535 }
536 @srecs = sort {
537 $b->{AGE} <=> $a->{AGE}
538 ||
539 $a->{NAME_key} cmp $b->{NAME_key}
540 } @recs;
541
542 ℞ 39: Case- and accent-insensitive comparisons
543 Use a collator object to compare Unicode text by character instead of
544 by codepoint.
545
546 use Unicode::Collate;
547 my $es = Unicode::Collate->new(
548 level => 1,
549 normalization => undef
550 );
551
552 # now both are true:
553 $es->eq("García", "GARCIA" );
554 $es->eq("Márquez", "MARQUEZ");
555
556 ℞ 40: Case- and accent-insensitive locale comparisons
557 Same, but in a specific locale.
558
559 my $de = Unicode::Collate::Locale->new(
560 locale => "de__phonebook",
561 );
562
563 # now this is true:
564 $de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
565
566 ℞ 41: Unicode linebreaking
567 Break up text into lines according to Unicode rules.
568
569 # cpan -i Unicode::LineBreak
570 use Unicode::LineBreak;
571 use charnames qw(:full);
572
573 my $para = "This is a super\N{HYPHEN}long string. " x 20;
574 my $fmt = Unicode::LineBreak->new;
575 print $fmt->break($para), "\n";
576
577 ℞ 42: Unicode text in DBM hashes, the tedious way
578 Using a regular Perl string as a key or value for a DBM hash will
579 trigger a wide character exception if any codepoints won’t fit into a
580 byte. Here’s how to manually manage the translation:
581
582 use DB_File;
583 use Encode qw(encode decode);
584 tie %dbhash, "DB_File", "pathname";
585
586 # STORE
587
588 # assume $uni_key and $uni_value are abstract Unicode strings
589 my $enc_key = encode("UTF-8", $uni_key, 1);
590 my $enc_value = encode("UTF-8", $uni_value, 1);
591 $dbhash{$enc_key} = $enc_value;
592
593 # FETCH
594
595 # assume $uni_key holds a normal Perl string (abstract Unicode)
596 my $enc_key = encode("UTF-8", $uni_key, 1);
597 my $enc_value = $dbhash{$enc_key};
598 my $uni_value = decode("UTF-8", $enc_value, 1);
599
600 ℞ 43: Unicode text in DBM hashes, the easy way
601 Here’s how to implicitly manage the translation; all encoding and
602 decoding is done automatically, just as with streams that have a
603 particular encoding attached to them:
604
605 use DB_File;
606 use DBM_Filter;
607
608 my $dbobj = tie %dbhash, "DB_File", "pathname";
609 $dbobj->Filter_Value("utf8"); # this is the magic bit
610
611 # STORE
612
613 # assume $uni_key and $uni_value are abstract Unicode strings
614 $dbhash{$uni_key} = $uni_value;
615
616 # FETCH
617
618 # $uni_key holds a normal Perl string (abstract Unicode)
619 my $uni_value = $dbhash{$uni_key};
620
621 ℞ 44: PROGRAM: Demo of Unicode collation and printing
622 Here’s a full program showing how to make use of locale-sensitive
623 sorting, Unicode casing, and managing print widths when some of the
624 characters take up zero or two columns, not just one column each time.
625 When run, the following program produces this nicely aligned output:
626
627 Crème Brûlée....... €2.00
628 Éclair............. €1.60
629 Fideuà............. €4.20
630 Hamburger.......... €6.00
631 Jamón Serrano...... €4.45
632 Linguiça........... €7.00
633 Pâté............... €4.15
634 Pears.............. €2.00
635 Pêches............. €2.25
636 Smørbrød........... €5.75
637 Spätzle............ €5.50
638 Xoriço............. €3.00
639 Γύρος.............. €6.50
640 막걸리............. €4.00
641 おもち............. €2.65
642 お好み焼き......... €8.00
643 シュークリーム..... €1.85
644 寿司............... €9.99
645 包子............... €7.50
646
647 Here's that program.
648
649 #!/usr/bin/env perl
650 # umenu - demo sorting and printing of Unicode food
651 #
652 # (obligatory and increasingly long preamble)
653 #
654 use v5.36;
655 use utf8;
656 use warnings qw(FATAL utf8); # fatalize encoding faults
657 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
658 use charnames qw(:full :short); # unneeded in v5.16
659
660 # std modules
661 use Unicode::Normalize; # std perl distro as of v5.8
662 use List::Util qw(max); # std perl distro as of v5.10
663 use Unicode::Collate::Locale; # std perl distro as of v5.14
664
665 # cpan modules
666 use Unicode::GCString; # from CPAN
667
668 my %price = (
669 "γύρος" => 6.50, # gyros
670 "pears" => 2.00, # like um, pears
671 "linguiça" => 7.00, # spicy sausage, Portuguese
672 "xoriço" => 3.00, # chorizo sausage, Catalan
673 "hamburger" => 6.00, # burgermeister meisterburger
674 "éclair" => 1.60, # dessert, French
675 "smørbrød" => 5.75, # sandwiches, Norwegian
676 "spätzle" => 5.50, # Bayerisch noodles, little sparrows
677 "包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
678 "jamón serrano" => 4.45, # country ham, Spanish
679 "pêches" => 2.25, # peaches, French
680 "シュークリーム" => 1.85, # cream-filled pastry like eclair
681 "막걸리" => 4.00, # makgeolli, Korean rice wine
682 "寿司" => 9.99, # sushi, Japanese
683 "おもち" => 2.65, # omochi, rice cakes, Japanese
684 "crème brûlée" => 2.00, # crema catalana
685 "fideuà" => 4.20, # more noodles, Valencian
686 # (Catalan=fideuada)
687 "pâté" => 4.15, # gooseliver paste, French
688 "お好み焼き" => 8.00, # okonomiyaki, Japanese
689 );
690
691 my $width = 5 + max map { colwidth($_) } keys %price;
692
693 # So the Asian stuff comes out in an order that someone
694 # who reads those scripts won't freak out over; the
695 # CJK stuff will be in JIS X 0208 order that way.
696 my $coll = Unicode::Collate::Locale->new(locale => "ja");
697
698 for my $item ($coll->sort(keys %price)) {
699 print pad(entitle($item), $width, ".");
700 printf " €%.2f\n", $price{$item};
701 }
702
703 sub pad ($str, $width, $padchar) {
704 return $str . ($padchar x ($width - colwidth($str)));
705 }
706
707 sub colwidth ($str) {
708 return Unicode::GCString->new($str)->columns;
709 }
710
711 sub entitle ($str) {
712 $str =~ s{ (?=\pL)(\S) (\S*) }
713 { ucfirst($1) . lc($2) }xge;
714 return $str;
715 }
716
718 See these manpages, some of which are CPAN modules: perlunicode,
719 perluniprops, perlre, perlrecharclass, perluniintro, perlunitut,
720 perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode,
721 Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString,
722 Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale,
723 Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle,
724 Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin,
725 Lingua::KO::Romanize::Hangul.
726
727 The Unicode::Tussle CPAN module includes many programs to help with
728 working with Unicode, including these programs to fully or partly
729 replace standard utilities: tcgrep instead of egrep, uniquote instead
730 of cat -v or hexdump, uniwc instead of wc, unilook instead of look,
731 unifmt instead of fmt, and ucsort instead of sort. For exploring
732 Unicode character names and character properties, see its uniprops,
733 unichars, and uninames programs. It also supplies these programs, all
734 of which are general filters that do Unicode-y things: unititle and
735 unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd,
736 and nfkc; and uc, lc, and tc.
737
738 Finally, see the published Unicode Standard (page numbers are from
739 version 6.0.0), including these specific annexes and technical reports:
740
741 §3.13 Default Case Algorithms, page 113; §4.2 Case, pages 120–122;
742 Case Mappings, page 166–172, especially Caseless Matching starting on
743 page 170.
744 UAX #44: Unicode Character Database
745 UTS #18: Unicode Regular Expressions
746 UAX #15: Unicode Normalization Forms
747 UTS #10: Unicode Collation Algorithm
748 UAX #29: Unicode Text Segmentation
749 UAX #14: Unicode Line Breaking Algorithm
750 UAX #11: East Asian Width
751
753 Tom Christiansen <tchrist@perl.com> wrote this, with occasional
754 kibbitzing from Larry Wall and Jeffrey Friedl in the background.
755
757 Copyright © 2012 Tom Christiansen.
758
759 This program is free software; you may redistribute it and/or modify it
760 under the same terms as Perl itself.
761
762 Most of these examples taken from the current edition of the “Camel
763 Book”; that is, from the 4ᵗʰ Edition of Programming Perl, Copyright ©
764 2012 Tom Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code
765 itself is freely redistributable, and you are encouraged to transplant,
766 fold, spindle, and mutilate any of the examples in this manpage however
767 you please for inclusion into your own programs without any encumbrance
768 whatsoever. Acknowledgement via code comment is polite but not
769 required.
770
772 v1.0.0 – first public release, 2012-02-27
773
774
775
776perl v5.38.2 2023-11-30 PERLUNICOOK(1)