1PERLUNICOOK(1) Perl Programmers Reference Guide PERLUNICOOK(1)
2
3
4
6 perlunicook - cookbookish examples of handling Unicode in Perl
7
9 This manpage contains short recipes demonstrating how to handle common
10 Unicode operations in Perl, plus one complete program at the end. Any
11 undeclared variables in individual recipes are assumed to have a
12 previous appropriate value in them.
13
15 X 0: Standard preamble
16 Unless otherwise notes, all examples below require this standard
17 preamble to work correctly, with the "#!" adjusted to work on your
18 system:
19
20 #!/usr/bin/env perl
21
22 use utf8; # so literals and identifiers can be in UTF-8
23 use v5.12; # or later to get "unicode_strings" feature
24 use strict; # quote strings, declare variables
25 use warnings; # on by default
26 use warnings qw(FATAL utf8); # fatalize encoding glitches
27 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
28 use charnames qw(:full :short); # unneeded in v5.16
29
30 This does make even Unix programmers "binmode" your binary streams, or
31 open them with ":raw", but that's the only way to get at them portably
32 anyway.
33
34 WARNING: "use autodie" (pre 2.26) and "use open" do not get along with
35 each other.
36
37 X 1: Generic Unicode-savvy filter
38 Always decompose on the way in, then recompose on the way out.
39
40 use Unicode::Normalize;
41
42 while (<>) {
43 $_ = NFD($_); # decompose + reorder canonically
44 ...
45 } continue {
46 print NFC($_); # recompose (where possible) + reorder canonically
47 }
48
49 X 2: Fine-tuning Unicode warnings
50 As of v5.14, Perl distinguishes three subclasses of UTFX8 warnings.
51
52 use v5.14; # subwarnings unavailable any earlier
53 no warnings "nonchar"; # the 66 forbidden non-characters
54 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
55 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
56
57 X 3: Declare source in utf8 for identifiers and literals
58 Without the all-critical "use utf8" declaration, putting UTFX8 in your
59 literals and identifiers wonXt work right. If you used the standard
60 preamble just given above, this already happened. If you did, you can
61 do things like this:
62
63 use utf8;
64
65 my $measure = "Aangstroem";
66 my @Xsoft = qw( cp852 cp1251 cp1252 );
67 my @XXXXXXXXX = qw( XXXX XXXXX );
68 my @X = qw( koi8-f koi8-u koi8-r );
69 my $motto = "X X X"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
70
71 If you forget "use utf8", high bytes will be misunderstood as separate
72 characters, and nothing will work right.
73
74 X 4: Characters and their numbers
75 The "ord" and "chr" functions work transparently on all codepoints, not
76 just on ASCII alone X nor in fact, not even just on Unicode alone.
77
78 # ASCII characters
79 ord("A")
80 chr(65)
81
82 # characters from the Basic Multilingual Plane
83 ord("X")
84 chr(0x3A3)
85
86 # beyond the BMP
87 ord("X") # MATHEMATICAL ITALIC SMALL N
88 chr(0x1D45B)
89
90 # beyond Unicode! (up to MAXINT)
91 ord("\x{20_0000}")
92 chr(0x20_0000)
93
94 X 5: Unicode literals by character number
95 In an interpolated literal, whether a double-quoted string or a regex,
96 you may specify a character by its number using the "\x{HHHHHH}"
97 escape.
98
99 String: "\x{3a3}"
100 Regex: /\x{3a3}/
101
102 String: "\x{1d45b}"
103 Regex: /\x{1d45b}/
104
105 # even non-BMP ranges in regex work fine
106 /[\x{1D434}-\x{1D467}]/
107
108 X 6: Get character name by number
109 use charnames ();
110 my $name = charnames::viacode(0x03A3);
111
112 X 7: Get character number by name
113 use charnames ();
114 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
115
116 X 8: Unicode named characters
117 Use the "\N{charname}" notation to get the character by that name for
118 use in interpolated literals (double-quoted strings and regexes). In
119 v5.16, there is an implicit
120
121 use charnames qw(:full :short);
122
123 But prior to v5.16, you must be explicit about which set of charnames
124 you want. The ":full" names are the official Unicode character name,
125 alias, or sequence, which all share a namespace.
126
127 use charnames qw(:full :short latin greek);
128
129 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
130 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
131
132 Anything else is a Perl-specific convenience abbreviation. Specify one
133 or more scripts by names if you want short names that are script-
134 specific.
135
136 "\N{Greek:Sigma}" # :short
137 "\N{ae}" # latin
138 "\N{epsilon}" # greek
139
140 The v5.16 release also supports a ":loose" import for loose matching of
141 character names, which works just like loose matching of property
142 names: that is, it disregards case, whitespace, and underscores:
143
144 "\N{euro sign}" # :loose (from v5.16)
145
146 Starting in v5.32, you can also use
147
148 qr/\p{name=euro sign}/
149
150 to get official Unicode named characters in regular expressions. Loose
151 matching is always done for these.
152
153 X 9: Unicode named sequences
154 These look just like character names but return multiple codepoints.
155 Notice the %vx vector-print functionality in "printf".
156
157 use charnames qw(:full);
158 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
159 printf "U+%v04X\n", $seq;
160 U+0100.0300
161
162 X 10: Custom named characters
163 Use ":alias" to give your own lexically scoped nicknames to existing
164 characters, or even to give unnamed private-use characters useful
165 names.
166
167 use charnames ":full", ":alias" => {
168 ecute => "LATIN SMALL LETTER E WITH ACUTE",
169 "APPLE LOGO" => 0xF8FF, # private use character
170 };
171
172 "\N{ecute}"
173 "\N{APPLE LOGO}"
174
175 X 11: Names of CJK codepoints
176 Sinograms like XXXX come back with character names of "CJK UNIFIED
177 IDEOGRAPH-6771" and "CJK UNIFIED IDEOGRAPH-4EAC", because their XnamesX
178 vary. The CPAN "Unicode::Unihan" module has a large database for
179 decoding these (and a whole lot more), provided you know how to
180 understand its output.
181
182 # cpan -i Unicode::Unihan
183 use Unicode::Unihan;
184 my $str = "XX";
185 my $unhan = Unicode::Unihan->new;
186 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
187 printf "CJK $str in %-12s is ", $lang;
188 say $unhan->$lang($str);
189 }
190
191 prints:
192
193 CJK XX in Mandarin is DONG1JING1
194 CJK XX in Cantonese is dung1ging1
195 CJK XX in Korean is TONGKYENG
196 CJK XX in JapaneseOn is TOUKYOU KEI KIN
197 CJK XX in JapaneseKun is HIGASHI AZUMAMIYAKO
198
199 If you have a specific romanization scheme in mind, use the specific
200 module:
201
202 # cpan -i Lingua::JA::Romanize::Japanese
203 use Lingua::JA::Romanize::Japanese;
204 my $k2r = Lingua::JA::Romanize::Japanese->new;
205 my $str = "XX";
206 say "Japanese for $str is ", $k2r->chars($str);
207
208 prints
209
210 Japanese for XX is toukyou
211
212 X 12: Explicit encode/decode
213 On rare occasion, such as a database read, you may be given encoded
214 text you need to decode.
215
216 use Encode qw(encode decode);
217
218 my $chars = decode("shiftjis", $bytes, 1);
219 # OR
220 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
221
222 For streams all in the same encoding, don't use encode/decode; instead
223 set the file encoding when you open the file or immediately after with
224 "binmode" as described later below.
225
226 X 13: Decode program arguments as utf8
227 $ perl -CA ...
228 or
229 $ export PERL_UNICODE=A
230 or
231 use Encode qw(decode);
232 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
233
234 X 14: Decode program arguments as locale encoding
235 # cpan -i Encode::Locale
236 use Encode qw(locale);
237 use Encode::Locale;
238
239 # use "locale" as an arg to encode/decode
240 @ARGV = map { decode(locale => $_, 1) } @ARGV;
241
242 X 15: Declare STD{IN,OUT,ERR} to be utf8
243 Use a command-line option, an environment variable, or else call
244 "binmode" explicitly:
245
246 $ perl -CS ...
247 or
248 $ export PERL_UNICODE=S
249 or
250 use open qw(:std :encoding(UTF-8));
251 or
252 binmode(STDIN, ":encoding(UTF-8)");
253 binmode(STDOUT, ":utf8");
254 binmode(STDERR, ":utf8");
255
256 X 16: Declare STD{IN,OUT,ERR} to be in locale encoding
257 # cpan -i Encode::Locale
258 use Encode;
259 use Encode::Locale;
260
261 # or as a stream for binmode or open
262 binmode STDIN, ":encoding(console_in)" if -t STDIN;
263 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
264 binmode STDERR, ":encoding(console_out)" if -t STDERR;
265
266 X 17: Make file I/O default to utf8
267 Files opened without an encoding argument will be in UTF-8:
268
269 $ perl -CD ...
270 or
271 $ export PERL_UNICODE=D
272 or
273 use open qw(:encoding(UTF-8));
274
275 X 18: Make all I/O and args default to utf8
276 $ perl -CSDA ...
277 or
278 $ export PERL_UNICODE=SDA
279 or
280 use open qw(:std :encoding(UTF-8));
281 use Encode qw(decode);
282 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
283
284 X 19: Open file with specific encoding
285 Specify stream encoding. This is the normal way to deal with encoded
286 text, not by calling low-level functions.
287
288 # input file
289 open(my $in_file, "< :encoding(UTF-16)", "wintext");
290 OR
291 open(my $in_file, "<", "wintext");
292 binmode($in_file, ":encoding(UTF-16)");
293 THEN
294 my $line = <$in_file>;
295
296 # output file
297 open($out_file, "> :encoding(cp1252)", "wintext");
298 OR
299 open(my $out_file, ">", "wintext");
300 binmode($out_file, ":encoding(cp1252)");
301 THEN
302 print $out_file "some text\n";
303
304 More layers than just the encoding can be specified here. For example,
305 the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF
306 handling.
307
308 X 20: Unicode casing
309 Unicode casing is very different from ASCII casing.
310
311 uc("henry X") # "HENRY X"
312 uc("tschuess") # "TSCHUeSS" notice ss => SS
313
314 # both are true:
315 "tschuess" =~ /TSCHUeSS/i # notice ss => SS
316 "XXXXXXX" =~ /XXXXXXX/i # notice X,X,X sameness
317
318 X 21: Unicode case-insensitive comparisons
319 Also available in the CPAN Unicode::CaseFold module, the new "fc"
320 XfoldcaseX function from v5.16 grants access to the same Unicode
321 casefolding as the "/i" pattern modifier has always used:
322
323 use feature "fc"; # fc() function is from v5.16
324
325 # sort case-insensitively
326 my @sorted = sort { fc($a) cmp fc($b) } @list;
327
328 # both are true:
329 fc("tschuess") eq fc("TSCHUeSS")
330 fc("XXXXXXX") eq fc("XXXXXXX")
331
332 X 22: Match Unicode linebreak sequence in regex
333 A Unicode linebreak matches the two-character CRLF grapheme or any of
334 seven vertical whitespace characters. Good for dealing with textfiles
335 coming from different operating systems.
336
337 \R
338
339 s/\R/\n/g; # normalize all linebreaks to \n
340
341 X 23: Get character category
342 Find the general category of a numeric codepoint.
343
344 use Unicode::UCD qw(charinfo);
345 my $cat = charinfo(0x3A3)->{category}; # "Lu"
346
347 X 24: Disabling Unicode-awareness in builtin charclasses
348 Disable "\w", "\b", "\s", "\d", and the POSIX classes from working
349 correctly on Unicode either in this scope, or in just one regex.
350
351 use v5.14;
352 use re "/a";
353
354 # OR
355
356 my($num) = $str =~ /(\d+)/a;
357
358 Or use specific un-Unicode properties, like "\p{ahex}" and
359 "\p{POSIX_Digit"}. Properties still work normally no matter what
360 charset modifiers ("/d /u /l /a /aa") should be effect.
361
362 X 25: Match Unicode properties in regex with \p, \P
363 These all match a single codepoint with the given property. Use "\P"
364 in place of "\p" to match one codepoint lacking that property.
365
366 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
367 \p{Sk}, \p{Ps}, \p{Lt}
368 \p{alpha}, \p{upper}, \p{lower}
369 \p{Latin}, \p{Greek}
370 \p{script_extensions=Latin}, \p{scx=Greek}
371 \p{East_Asian_Width=Wide}, \p{EA=W}
372 \p{Line_Break=Hyphen}, \p{LB=HY}
373 \p{Numeric_Value=4}, \p{NV=4}
374
375 X 26: Custom character properties
376 Define at compile-time your own custom character properties for use in
377 regexes.
378
379 # using private-use characters
380 sub In_Tengwar { "E000\tE07F\n" }
381
382 if (/\p{In_Tengwar}/) { ... }
383
384 # blending existing properties
385 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
386 +utf8::IsLatin
387 +utf8::IsGreek
388 &utf8::IsTitle
389 END_OF_SET
390
391 if (/\p{Is_GraecoRoman_Title}/ { ... }
392
393 X 27: Unicode normalization
394 Typically render into NFD on input and NFC on output. Using NFKC or
395 NFKD functions improves recall on searches, assuming you've already
396 done to the same text to be searched. Note that this is about much more
397 than just pre- combined compatibility glyphs; it also reorders marks
398 according to their canonical combining classes and weeds out
399 singletons.
400
401 use Unicode::Normalize;
402 my $nfd = NFD($orig);
403 my $nfc = NFC($orig);
404 my $nfkd = NFKD($orig);
405 my $nfkc = NFKC($orig);
406
407 X 28: Convert non-ASCII Unicode numerics
408 Unless youXve used "/a" or "/aa", "\d" matches more than ASCII digits
409 only, but PerlXs implicit string-to-number conversion does not current
410 recognize these. HereXs how to convert such strings manually.
411
412 use v5.14; # needed for num() function
413 use Unicode::UCD qw(num);
414 my $str = "got X and XXXX and X and here";
415 my @nums = ();
416 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
417 push @nums, num($1);
418 }
419 say "@nums"; # 12 4567 0.875
420
421 use charnames qw(:full);
422 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
423
424 X 29: Match Unicode grapheme cluster in regex
425 Programmer-visible XcharactersX are codepoints matched by "/./s", but
426 user-visible XcharactersX are graphemes matched by "/\X/".
427
428 # Find vowel *plus* any combining diacritics,underlining,etc.
429 my $nfd = NFD($orig);
430 $nfd =~ / (?=[aeiou]) \X /xi
431
432 X 30: Extract by grapheme instead of by codepoint (regex)
433 # match and grab five first graphemes
434 my($first_five) = $str =~ /^ ( \X{5} ) /x;
435
436 X 31: Extract by grapheme instead of by codepoint (substr)
437 # cpan -i Unicode::GCString
438 use Unicode::GCString;
439 my $gcs = Unicode::GCString->new($str);
440 my $first_five = $gcs->substr(0, 5);
441
442 X 32: Reverse string by grapheme
443 Reversing by codepoint messes up diacritics, mistakenly converting
444 "creme brulee" into "eelXurb emXerc" instead of into "eelurb emerc"; so
445 reverse by grapheme instead. Both these approaches work right no
446 matter what normalization the string is in:
447
448 $str = join("", reverse $str =~ /\X/g);
449
450 # OR: cpan -i Unicode::GCString
451 use Unicode::GCString;
452 $str = reverse Unicode::GCString->new($str);
453
454 X 33: String length in graphemes
455 The string "brulee" has six graphemes but up to eight codepoints. This
456 counts by grapheme, not by codepoint:
457
458 my $str = "brulee";
459 my $count = 0;
460 while ($str =~ /\X/g) { $count++ }
461
462 # OR: cpan -i Unicode::GCString
463 use Unicode::GCString;
464 my $gcs = Unicode::GCString->new($str);
465 my $count = $gcs->length;
466
467 X 34: Unicode column-width for printing
468 PerlXs "printf", "sprintf", and "format" think all codepoints take up 1
469 print column, but many take 0 or 2. Here to show that normalization
470 makes no difference, we print out both forms:
471
472 use Unicode::GCString;
473 use Unicode::Normalize;
474
475 my @words = qw/creme brulee/;
476 @words = map { NFC($_), NFD($_) } @words;
477
478 for my $str (@words) {
479 my $gcs = Unicode::GCString->new($str);
480 my $cols = $gcs->columns;
481 my $pad = " " x (10 - $cols);
482 say str, $pad, " |";
483 }
484
485 generates this to show that it pads correctly no matter the
486 normalization:
487
488 creme |
489 creXme |
490 brulee |
491 bruXleXe |
492
493 X 35: Unicode collation
494 Text sorted by numeric codepoint follows no reasonable alphabetic
495 order; use the UCA for sorting text.
496
497 use Unicode::Collate;
498 my $col = Unicode::Collate->new();
499 my @list = $col->sort(@old_list);
500
501 See the ucsort program from the Unicode::Tussle CPAN module for a
502 convenient command-line interface to this module.
503
504 X 36: Case- and accent-insensitive Unicode sort
505 Specify a collation strength of level 1 to ignore case and diacritics,
506 only looking at the basic character.
507
508 use Unicode::Collate;
509 my $col = Unicode::Collate->new(level => 1);
510 my @list = $col->sort(@old_list);
511
512 X 37: Unicode locale collation
513 Some locales have special sorting rules.
514
515 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
516 use Unicode::Collate::Locale;
517 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
518 my @list = $col->sort(@old_list);
519
520 The ucsort program mentioned above accepts a "--locale" parameter.
521
522 X 38: Making "cmp" work on text instead of codepoints
523 Instead of this:
524
525 @srecs = sort {
526 $b->{AGE} <=> $a->{AGE}
527 ||
528 $a->{NAME} cmp $b->{NAME}
529 } @recs;
530
531 Use this:
532
533 my $coll = Unicode::Collate->new();
534 for my $rec (@recs) {
535 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
536 }
537 @srecs = sort {
538 $b->{AGE} <=> $a->{AGE}
539 ||
540 $a->{NAME_key} cmp $b->{NAME_key}
541 } @recs;
542
543 X 39: Case- and accent-insensitive comparisons
544 Use a collator object to compare Unicode text by character instead of
545 by codepoint.
546
547 use Unicode::Collate;
548 my $es = Unicode::Collate->new(
549 level => 1,
550 normalization => undef
551 );
552
553 # now both are true:
554 $es->eq("Garcia", "GARCIA" );
555 $es->eq("Marquez", "MARQUEZ");
556
557 X 40: Case- and accent-insensitive locale comparisons
558 Same, but in a specific locale.
559
560 my $de = Unicode::Collate::Locale->new(
561 locale => "de__phonebook",
562 );
563
564 # now this is true:
565 $de->eq("tschuess", "TSCHUESS"); # notice ue => UE, ss => SS
566
567 X 41: Unicode linebreaking
568 Break up text into lines according to Unicode rules.
569
570 # cpan -i Unicode::LineBreak
571 use Unicode::LineBreak;
572 use charnames qw(:full);
573
574 my $para = "This is a super\N{HYPHEN}long string. " x 20;
575 my $fmt = Unicode::LineBreak->new;
576 print $fmt->break($para), "\n";
577
578 X 42: Unicode text in DBM hashes, the tedious way
579 Using a regular Perl string as a key or value for a DBM hash will
580 trigger a wide character exception if any codepoints wonXt fit into a
581 byte. HereXs how to manually manage the translation:
582
583 use DB_File;
584 use Encode qw(encode decode);
585 tie %dbhash, "DB_File", "pathname";
586
587 # STORE
588
589 # assume $uni_key and $uni_value are abstract Unicode strings
590 my $enc_key = encode("UTF-8", $uni_key, 1);
591 my $enc_value = encode("UTF-8", $uni_value, 1);
592 $dbhash{$enc_key} = $enc_value;
593
594 # FETCH
595
596 # assume $uni_key holds a normal Perl string (abstract Unicode)
597 my $enc_key = encode("UTF-8", $uni_key, 1);
598 my $enc_value = $dbhash{$enc_key};
599 my $uni_value = decode("UTF-8", $enc_value, 1);
600
601 X 43: Unicode text in DBM hashes, the easy way
602 HereXs how to implicitly manage the translation; all encoding and
603 decoding is done automatically, just as with streams that have a
604 particular encoding attached to them:
605
606 use DB_File;
607 use DBM_Filter;
608
609 my $dbobj = tie %dbhash, "DB_File", "pathname";
610 $dbobj->Filter_Value("utf8"); # this is the magic bit
611
612 # STORE
613
614 # assume $uni_key and $uni_value are abstract Unicode strings
615 $dbhash{$uni_key} = $uni_value;
616
617 # FETCH
618
619 # $uni_key holds a normal Perl string (abstract Unicode)
620 my $uni_value = $dbhash{$uni_key};
621
622 X 44: PROGRAM: Demo of Unicode collation and printing
623 HereXs a full program showing how to make use of locale-sensitive
624 sorting, Unicode casing, and managing print widths when some of the
625 characters take up zero or two columns, not just one column each time.
626 When run, the following program produces this nicely aligned output:
627
628 Creme Brulee....... X2.00
629 Eclair............. X1.60
630 Fideua............. X4.20
631 Hamburger.......... X6.00
632 Jamon Serrano...... X4.45
633 Linguica........... X7.00
634 Pate............... X4.15
635 Pears.............. X2.00
636 Peches............. X2.25
637 Smorbrod........... X5.75
638 Spaetzle............ X5.50
639 Xorico............. X3.00
640 XXXXX.............. X6.50
641 XXX............. X4.00
642 XXX............. X2.65
643 XXXXX......... X8.00
644 XXXXXXX..... X1.85
645 XX............... X9.99
646 XX............... X7.50
647
648 Here's that program; tested on v5.14.
649
650 #!/usr/bin/env perl
651 # umenu - demo sorting and printing of Unicode food
652 #
653 # (obligatory and increasingly long preamble)
654 #
655 use utf8;
656 use v5.14; # for locale sorting
657 use strict;
658 use warnings;
659 use warnings qw(FATAL utf8); # fatalize encoding faults
660 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
661 use charnames qw(:full :short); # unneeded in v5.16
662
663 # std modules
664 use Unicode::Normalize; # std perl distro as of v5.8
665 use List::Util qw(max); # std perl distro as of v5.10
666 use Unicode::Collate::Locale; # std perl distro as of v5.14
667
668 # cpan modules
669 use Unicode::GCString; # from CPAN
670
671 # forward defs
672 sub pad($$$);
673 sub colwidth(_);
674 sub entitle(_);
675
676 my %price = (
677 "XXXXX" => 6.50, # gyros
678 "pears" => 2.00, # like um, pears
679 "linguica" => 7.00, # spicy sausage, Portuguese
680 "xorico" => 3.00, # chorizo sausage, Catalan
681 "hamburger" => 6.00, # burgermeister meisterburger
682 "eclair" => 1.60, # dessert, French
683 "smorbrod" => 5.75, # sandwiches, Norwegian
684 "spaetzle" => 5.50, # Bayerisch noodles, little sparrows
685 "XX" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
686 "jamon serrano" => 4.45, # country ham, Spanish
687 "peches" => 2.25, # peaches, French
688 "XXXXXXX" => 1.85, # cream-filled pastry like eclair
689 "XXX" => 4.00, # makgeolli, Korean rice wine
690 "XX" => 9.99, # sushi, Japanese
691 "XXX" => 2.65, # omochi, rice cakes, Japanese
692 "creme brulee" => 2.00, # crema catalana
693 "fideua" => 4.20, # more noodles, Valencian
694 # (Catalan=fideuada)
695 "pate" => 4.15, # gooseliver paste, French
696 "XXXXX" => 8.00, # okonomiyaki, Japanese
697 );
698
699 my $width = 5 + max map { colwidth } keys %price;
700
701 # So the Asian stuff comes out in an order that someone
702 # who reads those scripts won't freak out over; the
703 # CJK stuff will be in JIS X 0208 order that way.
704 my $coll = Unicode::Collate::Locale->new(locale => "ja");
705
706 for my $item ($coll->sort(keys %price)) {
707 print pad(entitle($item), $width, ".");
708 printf " X%.2f\n", $price{$item};
709 }
710
711 sub pad($$$) {
712 my($str, $width, $padchar) = @_;
713 return $str . ($padchar x ($width - colwidth($str)));
714 }
715
716 sub colwidth(_) {
717 my($str) = @_;
718 return Unicode::GCString->new($str)->columns;
719 }
720
721 sub entitle(_) {
722 my($str) = @_;
723 $str =~ s{ (?=\pL)(\S) (\S*) }
724 { ucfirst($1) . lc($2) }xge;
725 return $str;
726 }
727
729 See these manpages, some of which are CPAN modules: perlunicode,
730 perluniprops, perlre, perlrecharclass, perluniintro, perlunitut,
731 perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode,
732 Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString,
733 Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale,
734 Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle,
735 Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin,
736 Lingua::KO::Romanize::Hangul.
737
738 The Unicode::Tussle CPAN module includes many programs to help with
739 working with Unicode, including these programs to fully or partly
740 replace standard utilities: tcgrep instead of egrep, uniquote instead
741 of cat -v or hexdump, uniwc instead of wc, unilook instead of look,
742 unifmt instead of fmt, and ucsort instead of sort. For exploring
743 Unicode character names and character properties, see its uniprops,
744 unichars, and uninames programs. It also supplies these programs, all
745 of which are general filters that do Unicode-y things: unititle and
746 unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd,
747 and nfkc; and uc, lc, and tc.
748
749 Finally, see the published Unicode Standard (page numbers are from
750 version 6.0.0), including these specific annexes and technical reports:
751
752 X3.13 Default Case Algorithms, page 113; X4.2 Case, pages 120X122;
753 Case Mappings, page 166X172, especially Caseless Matching starting on
754 page 170.
755 UAX #44: Unicode Character Database
756 UTS #18: Unicode Regular Expressions
757 UAX #15: Unicode Normalization Forms
758 UTS #10: Unicode Collation Algorithm
759 UAX #29: Unicode Text Segmentation
760 UAX #14: Unicode Line Breaking Algorithm
761 UAX #11: East Asian Width
762
764 Tom Christiansen <tchrist@perl.com> wrote this, with occasional
765 kibbitzing from Larry Wall and Jeffrey Friedl in the background.
766
768 Copyright X 2012 Tom Christiansen.
769
770 This program is free software; you may redistribute it and/or modify it
771 under the same terms as Perl itself.
772
773 Most of these examples taken from the current edition of the XCamel
774 BookX; that is, from the 4XX Edition of Programming Perl, Copyright X
775 2012 Tom Christiansen <et al.>, 2012-02-13 by OXReilly Media. The code
776 itself is freely redistributable, and you are encouraged to transplant,
777 fold, spindle, and mutilate any of the examples in this manpage however
778 you please for inclusion into your own programs without any encumbrance
779 whatsoever. Acknowledgement via code comment is polite but not
780 required.
781
783 v1.0.0 X first public release, 2012-02-27
784
785
786
787perl v5.34.0 2021-10-18 PERLUNICOOK(1)