1PERLUNICOOK(1) Perl Programmers Reference Guide PERLUNICOOK(1)
2
3
4
6 perlunicook - cookbookish examples of handling Unicode in Perl
7
9 This manpage contains short recipes demonstrating how to handle common
10 Unicode operations in Perl, plus one complete program at the end. Any
11 undeclared variables in individual recipes are assumed to have a
12 previous appropriate value in them.
13
15 X 0: Standard preamble
16 Unless otherwise notes, all examples below require this standard
17 preamble to work correctly, with the "#!" adjusted to work on your
18 system:
19
20 #!/usr/bin/env perl
21
22 use utf8; # so literals and identifiers can be in UTF-8
23 use v5.12; # or later to get "unicode_strings" feature
24 use strict; # quote strings, declare variables
25 use warnings; # on by default
26 use warnings qw(FATAL utf8); # fatalize encoding glitches
27 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
28 use charnames qw(:full :short); # unneeded in v5.16
29
30 This does make even Unix programmers "binmode" your binary streams, or
31 open them with ":raw", but that's the only way to get at them portably
32 anyway.
33
34 WARNING: "use autodie" (pre 2.26) and "use open" do not get along with
35 each other.
36
37 X 1: Generic Unicode-savvy filter
38 Always decompose on the way in, then recompose on the way out.
39
40 use Unicode::Normalize;
41
42 while (<>) {
43 $_ = NFD($_); # decompose + reorder canonically
44 ...
45 } continue {
46 print NFC($_); # recompose (where possible) + reorder canonically
47 }
48
49 X 2: Fine-tuning Unicode warnings
50 As of v5.14, Perl distinguishes three subclasses of UTFX8 warnings.
51
52 use v5.14; # subwarnings unavailable any earlier
53 no warnings "nonchar"; # the 66 forbidden non-characters
54 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
55 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
56
57 X 3: Declare source in utf8 for identifiers and literals
58 Without the all-critical "use utf8" declaration, putting UTFX8 in your
59 literals and identifiers wonXt work right. If you used the standard
60 preamble just given above, this already happened. If you did, you can
61 do things like this:
62
63 use utf8;
64
65 my $measure = "Aangstroem";
66 my @Xsoft = qw( cp852 cp1251 cp1252 );
67 my @XXXXXXXXX = qw( XXXX XXXXX );
68 my @X = qw( koi8-f koi8-u koi8-r );
69 my $motto = "X X X"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
70
71 If you forget "use utf8", high bytes will be misunderstood as separate
72 characters, and nothing will work right.
73
74 X 4: Characters and their numbers
75 The "ord" and "chr" functions work transparently on all codepoints, not
76 just on ASCII alone X nor in fact, not even just on Unicode alone.
77
78 # ASCII characters
79 ord("A")
80 chr(65)
81
82 # characters from the Basic Multilingual Plane
83 ord("X")
84 chr(0x3A3)
85
86 # beyond the BMP
87 ord("X") # MATHEMATICAL ITALIC SMALL N
88 chr(0x1D45B)
89
90 # beyond Unicode! (up to MAXINT)
91 ord("\x{20_0000}")
92 chr(0x20_0000)
93
94 X 5: Unicode literals by character number
95 In an interpolated literal, whether a double-quoted string or a regex,
96 you may specify a character by its number using the "\x{HHHHHH}"
97 escape.
98
99 String: "\x{3a3}"
100 Regex: /\x{3a3}/
101
102 String: "\x{1d45b}"
103 Regex: /\x{1d45b}/
104
105 # even non-BMP ranges in regex work fine
106 /[\x{1D434}-\x{1D467}]/
107
108 X 6: Get character name by number
109 use charnames ();
110 my $name = charnames::viacode(0x03A3);
111
112 X 7: Get character number by name
113 use charnames ();
114 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
115
116 X 8: Unicode named characters
117 Use the "\N{charname}" notation to get the character by that name for
118 use in interpolated literals (double-quoted strings and regexes). In
119 v5.16, there is an implicit
120
121 use charnames qw(:full :short);
122
123 But prior to v5.16, you must be explicit about which set of charnames
124 you want. The ":full" names are the official Unicode character name,
125 alias, or sequence, which all share a namespace.
126
127 use charnames qw(:full :short latin greek);
128
129 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
130 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
131
132 Anything else is a Perl-specific convenience abbreviation. Specify one
133 or more scripts by names if you want short names that are script-
134 specific.
135
136 "\N{Greek:Sigma}" # :short
137 "\N{ae}" # latin
138 "\N{epsilon}" # greek
139
140 The v5.16 release also supports a ":loose" import for loose matching of
141 character names, which works just like loose matching of property
142 names: that is, it disregards case, whitespace, and underscores:
143
144 "\N{euro sign}" # :loose (from v5.16)
145
146 X 9: Unicode named sequences
147 These look just like character names but return multiple codepoints.
148 Notice the %vx vector-print functionality in "printf".
149
150 use charnames qw(:full);
151 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
152 printf "U+%v04X\n", $seq;
153 U+0100.0300
154
155 X 10: Custom named characters
156 Use ":alias" to give your own lexically scoped nicknames to existing
157 characters, or even to give unnamed private-use characters useful
158 names.
159
160 use charnames ":full", ":alias" => {
161 ecute => "LATIN SMALL LETTER E WITH ACUTE",
162 "APPLE LOGO" => 0xF8FF, # private use character
163 };
164
165 "\N{ecute}"
166 "\N{APPLE LOGO}"
167
168 X 11: Names of CJK codepoints
169 Sinograms like XXXX come back with character names of "CJK UNIFIED
170 IDEOGRAPH-6771" and "CJK UNIFIED IDEOGRAPH-4EAC", because their XnamesX
171 vary. The CPAN "Unicode::Unihan" module has a large database for
172 decoding these (and a whole lot more), provided you know how to
173 understand its output.
174
175 # cpan -i Unicode::Unihan
176 use Unicode::Unihan;
177 my $str = "XX";
178 my $unhan = Unicode::Unihan->new;
179 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
180 printf "CJK $str in %-12s is ", $lang;
181 say $unhan->$lang($str);
182 }
183
184 prints:
185
186 CJK XX in Mandarin is DONG1JING1
187 CJK XX in Cantonese is dung1ging1
188 CJK XX in Korean is TONGKYENG
189 CJK XX in JapaneseOn is TOUKYOU KEI KIN
190 CJK XX in JapaneseKun is HIGASHI AZUMAMIYAKO
191
192 If you have a specific romanization scheme in mind, use the specific
193 module:
194
195 # cpan -i Lingua::JA::Romanize::Japanese
196 use Lingua::JA::Romanize::Japanese;
197 my $k2r = Lingua::JA::Romanize::Japanese->new;
198 my $str = "XX";
199 say "Japanese for $str is ", $k2r->chars($str);
200
201 prints
202
203 Japanese for XX is toukyou
204
205 X 12: Explicit encode/decode
206 On rare occasion, such as a database read, you may be given encoded
207 text you need to decode.
208
209 use Encode qw(encode decode);
210
211 my $chars = decode("shiftjis", $bytes, 1);
212 # OR
213 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
214
215 For streams all in the same encoding, don't use encode/decode; instead
216 set the file encoding when you open the file or immediately after with
217 "binmode" as described later below.
218
219 X 13: Decode program arguments as utf8
220 $ perl -CA ...
221 or
222 $ export PERL_UNICODE=A
223 or
224 use Encode qw(decode);
225 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
226
227 X 14: Decode program arguments as locale encoding
228 # cpan -i Encode::Locale
229 use Encode qw(locale);
230 use Encode::Locale;
231
232 # use "locale" as an arg to encode/decode
233 @ARGV = map { decode(locale => $_, 1) } @ARGV;
234
235 X 15: Declare STD{IN,OUT,ERR} to be utf8
236 Use a command-line option, an environment variable, or else call
237 "binmode" explicitly:
238
239 $ perl -CS ...
240 or
241 $ export PERL_UNICODE=S
242 or
243 use open qw(:std :encoding(UTF-8));
244 or
245 binmode(STDIN, ":encoding(UTF-8)");
246 binmode(STDOUT, ":utf8");
247 binmode(STDERR, ":utf8");
248
249 X 16: Declare STD{IN,OUT,ERR} to be in locale encoding
250 # cpan -i Encode::Locale
251 use Encode;
252 use Encode::Locale;
253
254 # or as a stream for binmode or open
255 binmode STDIN, ":encoding(console_in)" if -t STDIN;
256 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
257 binmode STDERR, ":encoding(console_out)" if -t STDERR;
258
259 X 17: Make file I/O default to utf8
260 Files opened without an encoding argument will be in UTF-8:
261
262 $ perl -CD ...
263 or
264 $ export PERL_UNICODE=D
265 or
266 use open qw(:encoding(UTF-8));
267
268 X 18: Make all I/O and args default to utf8
269 $ perl -CSDA ...
270 or
271 $ export PERL_UNICODE=SDA
272 or
273 use open qw(:std :encoding(UTF-8));
274 use Encode qw(decode);
275 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
276
277 X 19: Open file with specific encoding
278 Specify stream encoding. This is the normal way to deal with encoded
279 text, not by calling low-level functions.
280
281 # input file
282 open(my $in_file, "< :encoding(UTF-16)", "wintext");
283 OR
284 open(my $in_file, "<", "wintext");
285 binmode($in_file, ":encoding(UTF-16)");
286 THEN
287 my $line = <$in_file>;
288
289 # output file
290 open($out_file, "> :encoding(cp1252)", "wintext");
291 OR
292 open(my $out_file, ">", "wintext");
293 binmode($out_file, ":encoding(cp1252)");
294 THEN
295 print $out_file "some text\n";
296
297 More layers than just the encoding can be specified here. For example,
298 the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF
299 handling.
300
301 X 20: Unicode casing
302 Unicode casing is very different from ASCII casing.
303
304 uc("henry X") # "HENRY X"
305 uc("tschuess") # "TSCHUeSS" notice ss => SS
306
307 # both are true:
308 "tschuess" =~ /TSCHUeSS/i # notice ss => SS
309 "XXXXXXX" =~ /XXXXXXX/i # notice X,X,X sameness
310
311 X 21: Unicode case-insensitive comparisons
312 Also available in the CPAN Unicode::CaseFold module, the new "fc"
313 XfoldcaseX function from v5.16 grants access to the same Unicode
314 casefolding as the "/i" pattern modifier has always used:
315
316 use feature "fc"; # fc() function is from v5.16
317
318 # sort case-insensitively
319 my @sorted = sort { fc($a) cmp fc($b) } @list;
320
321 # both are true:
322 fc("tschuess") eq fc("TSCHUeSS")
323 fc("XXXXXXX") eq fc("XXXXXXX")
324
325 X 22: Match Unicode linebreak sequence in regex
326 A Unicode linebreak matches the two-character CRLF grapheme or any of
327 seven vertical whitespace characters. Good for dealing with textfiles
328 coming from different operating systems.
329
330 \R
331
332 s/\R/\n/g; # normalize all linebreaks to \n
333
334 X 23: Get character category
335 Find the general category of a numeric codepoint.
336
337 use Unicode::UCD qw(charinfo);
338 my $cat = charinfo(0x3A3)->{category}; # "Lu"
339
340 X 24: Disabling Unicode-awareness in builtin charclasses
341 Disable "\w", "\b", "\s", "\d", and the POSIX classes from working
342 correctly on Unicode either in this scope, or in just one regex.
343
344 use v5.14;
345 use re "/a";
346
347 # OR
348
349 my($num) = $str =~ /(\d+)/a;
350
351 Or use specific un-Unicode properties, like "\p{ahex}" and
352 "\p{POSIX_Digit"}. Properties still work normally no matter what
353 charset modifiers ("/d /u /l /a /aa") should be effect.
354
355 X 25: Match Unicode properties in regex with \p, \P
356 These all match a single codepoint with the given property. Use "\P"
357 in place of "\p" to match one codepoint lacking that property.
358
359 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
360 \p{Sk}, \p{Ps}, \p{Lt}
361 \p{alpha}, \p{upper}, \p{lower}
362 \p{Latin}, \p{Greek}
363 \p{script_extensions=Latin}, \p{scx=Greek}
364 \p{East_Asian_Width=Wide}, \p{EA=W}
365 \p{Line_Break=Hyphen}, \p{LB=HY}
366 \p{Numeric_Value=4}, \p{NV=4}
367
368 X 26: Custom character properties
369 Define at compile-time your own custom character properties for use in
370 regexes.
371
372 # using private-use characters
373 sub In_Tengwar { "E000\tE07F\n" }
374
375 if (/\p{In_Tengwar}/) { ... }
376
377 # blending existing properties
378 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
379 +utf8::IsLatin
380 +utf8::IsGreek
381 &utf8::IsTitle
382 END_OF_SET
383
384 if (/\p{Is_GraecoRoman_Title}/ { ... }
385
386 X 27: Unicode normalization
387 Typically render into NFD on input and NFC on output. Using NFKC or
388 NFKD functions improves recall on searches, assuming you've already
389 done to the same text to be searched. Note that this is about much more
390 than just pre- combined compatibility glyphs; it also reorders marks
391 according to their canonical combining classes and weeds out
392 singletons.
393
394 use Unicode::Normalize;
395 my $nfd = NFD($orig);
396 my $nfc = NFC($orig);
397 my $nfkd = NFKD($orig);
398 my $nfkc = NFKC($orig);
399
400 X 28: Convert non-ASCII Unicode numerics
401 Unless youXve used "/a" or "/aa", "\d" matches more than ASCII digits
402 only, but PerlXs implicit string-to-number conversion does not current
403 recognize these. HereXs how to convert such strings manually.
404
405 use v5.14; # needed for num() function
406 use Unicode::UCD qw(num);
407 my $str = "got X and XXXX and X and here";
408 my @nums = ();
409 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
410 push @nums, num($1);
411 }
412 say "@nums"; # 12 4567 0.875
413
414 use charnames qw(:full);
415 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
416
417 X 29: Match Unicode grapheme cluster in regex
418 Programmer-visible XcharactersX are codepoints matched by "/./s", but
419 user-visible XcharactersX are graphemes matched by "/\X/".
420
421 # Find vowel *plus* any combining diacritics,underlining,etc.
422 my $nfd = NFD($orig);
423 $nfd =~ / (?=[aeiou]) \X /xi
424
425 X 30: Extract by grapheme instead of by codepoint (regex)
426 # match and grab five first graphemes
427 my($first_five) = $str =~ /^ ( \X{5} ) /x;
428
429 X 31: Extract by grapheme instead of by codepoint (substr)
430 # cpan -i Unicode::GCString
431 use Unicode::GCString;
432 my $gcs = Unicode::GCString->new($str);
433 my $first_five = $gcs->substr(0, 5);
434
435 X 32: Reverse string by grapheme
436 Reversing by codepoint messes up diacritics, mistakenly converting
437 "creme brulee" into "eelXurb emXerc" instead of into "eelurb emerc"; so
438 reverse by grapheme instead. Both these approaches work right no
439 matter what normalization the string is in:
440
441 $str = join("", reverse $str =~ /\X/g);
442
443 # OR: cpan -i Unicode::GCString
444 use Unicode::GCString;
445 $str = reverse Unicode::GCString->new($str);
446
447 X 33: String length in graphemes
448 The string "brulee" has six graphemes but up to eight codepoints. This
449 counts by grapheme, not by codepoint:
450
451 my $str = "brulee";
452 my $count = 0;
453 while ($str =~ /\X/g) { $count++ }
454
455 # OR: cpan -i Unicode::GCString
456 use Unicode::GCString;
457 my $gcs = Unicode::GCString->new($str);
458 my $count = $gcs->length;
459
460 X 34: Unicode column-width for printing
461 PerlXs "printf", "sprintf", and "format" think all codepoints take up 1
462 print column, but many take 0 or 2. Here to show that normalization
463 makes no difference, we print out both forms:
464
465 use Unicode::GCString;
466 use Unicode::Normalize;
467
468 my @words = qw/creme brulee/;
469 @words = map { NFC($_), NFD($_) } @words;
470
471 for my $str (@words) {
472 my $gcs = Unicode::GCString->new($str);
473 my $cols = $gcs->columns;
474 my $pad = " " x (10 - $cols);
475 say str, $pad, " |";
476 }
477
478 generates this to show that it pads correctly no matter the
479 normalization:
480
481 creme |
482 creXme |
483 brulee |
484 bruXleXe |
485
486 X 35: Unicode collation
487 Text sorted by numeric codepoint follows no reasonable alphabetic
488 order; use the UCA for sorting text.
489
490 use Unicode::Collate;
491 my $col = Unicode::Collate->new();
492 my @list = $col->sort(@old_list);
493
494 See the ucsort program from the Unicode::Tussle CPAN module for a
495 convenient command-line interface to this module.
496
497 X 36: Case- and accent-insensitive Unicode sort
498 Specify a collation strength of level 1 to ignore case and diacritics,
499 only looking at the basic character.
500
501 use Unicode::Collate;
502 my $col = Unicode::Collate->new(level => 1);
503 my @list = $col->sort(@old_list);
504
505 X 37: Unicode locale collation
506 Some locales have special sorting rules.
507
508 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
509 use Unicode::Collate::Locale;
510 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
511 my @list = $col->sort(@old_list);
512
513 The ucsort program mentioned above accepts a "--locale" parameter.
514
515 X 38: Making "cmp" work on text instead of codepoints
516 Instead of this:
517
518 @srecs = sort {
519 $b->{AGE} <=> $a->{AGE}
520 ||
521 $a->{NAME} cmp $b->{NAME}
522 } @recs;
523
524 Use this:
525
526 my $coll = Unicode::Collate->new();
527 for my $rec (@recs) {
528 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
529 }
530 @srecs = sort {
531 $b->{AGE} <=> $a->{AGE}
532 ||
533 $a->{NAME_key} cmp $b->{NAME_key}
534 } @recs;
535
536 X 39: Case- and accent-insensitive comparisons
537 Use a collator object to compare Unicode text by character instead of
538 by codepoint.
539
540 use Unicode::Collate;
541 my $es = Unicode::Collate->new(
542 level => 1,
543 normalization => undef
544 );
545
546 # now both are true:
547 $es->eq("Garcia", "GARCIA" );
548 $es->eq("Marquez", "MARQUEZ");
549
550 X 40: Case- and accent-insensitive locale comparisons
551 Same, but in a specific locale.
552
553 my $de = Unicode::Collate::Locale->new(
554 locale => "de__phonebook",
555 );
556
557 # now this is true:
558 $de->eq("tschuess", "TSCHUESS"); # notice ue => UE, ss => SS
559
560 X 41: Unicode linebreaking
561 Break up text into lines according to Unicode rules.
562
563 # cpan -i Unicode::LineBreak
564 use Unicode::LineBreak;
565 use charnames qw(:full);
566
567 my $para = "This is a super\N{HYPHEN}long string. " x 20;
568 my $fmt = Unicode::LineBreak->new;
569 print $fmt->break($para), "\n";
570
571 X 42: Unicode text in DBM hashes, the tedious way
572 Using a regular Perl string as a key or value for a DBM hash will
573 trigger a wide character exception if any codepoints wonXt fit into a
574 byte. HereXs how to manually manage the translation:
575
576 use DB_File;
577 use Encode qw(encode decode);
578 tie %dbhash, "DB_File", "pathname";
579
580 # STORE
581
582 # assume $uni_key and $uni_value are abstract Unicode strings
583 my $enc_key = encode("UTF-8", $uni_key, 1);
584 my $enc_value = encode("UTF-8", $uni_value, 1);
585 $dbhash{$enc_key} = $enc_value;
586
587 # FETCH
588
589 # assume $uni_key holds a normal Perl string (abstract Unicode)
590 my $enc_key = encode("UTF-8", $uni_key, 1);
591 my $enc_value = $dbhash{$enc_key};
592 my $uni_value = decode("UTF-8", $enc_value, 1);
593
594 X 43: Unicode text in DBM hashes, the easy way
595 HereXs how to implicitly manage the translation; all encoding and
596 decoding is done automatically, just as with streams that have a
597 particular encoding attached to them:
598
599 use DB_File;
600 use DBM_Filter;
601
602 my $dbobj = tie %dbhash, "DB_File", "pathname";
603 $dbobj->Filter_Value("utf8"); # this is the magic bit
604
605 # STORE
606
607 # assume $uni_key and $uni_value are abstract Unicode strings
608 $dbhash{$uni_key} = $uni_value;
609
610 # FETCH
611
612 # $uni_key holds a normal Perl string (abstract Unicode)
613 my $uni_value = $dbhash{$uni_key};
614
615 X 44: PROGRAM: Demo of Unicode collation and printing
616 HereXs a full program showing how to make use of locale-sensitive
617 sorting, Unicode casing, and managing print widths when some of the
618 characters take up zero or two columns, not just one column each time.
619 When run, the following program produces this nicely aligned output:
620
621 Creme Brulee....... X2.00
622 Eclair............. X1.60
623 Fideua............. X4.20
624 Hamburger.......... X6.00
625 Jamon Serrano...... X4.45
626 Linguica........... X7.00
627 Pate............... X4.15
628 Pears.............. X2.00
629 Peches............. X2.25
630 Smorbrod........... X5.75
631 Spaetzle............ X5.50
632 Xorico............. X3.00
633 XXXXX.............. X6.50
634 XXX............. X4.00
635 XXX............. X2.65
636 XXXXX......... X8.00
637 XXXXXXX..... X1.85
638 XX............... X9.99
639 XX............... X7.50
640
641 Here's that program; tested on v5.14.
642
643 #!/usr/bin/env perl
644 # umenu - demo sorting and printing of Unicode food
645 #
646 # (obligatory and increasingly long preamble)
647 #
648 use utf8;
649 use v5.14; # for locale sorting
650 use strict;
651 use warnings;
652 use warnings qw(FATAL utf8); # fatalize encoding faults
653 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
654 use charnames qw(:full :short); # unneeded in v5.16
655
656 # std modules
657 use Unicode::Normalize; # std perl distro as of v5.8
658 use List::Util qw(max); # std perl distro as of v5.10
659 use Unicode::Collate::Locale; # std perl distro as of v5.14
660
661 # cpan modules
662 use Unicode::GCString; # from CPAN
663
664 # forward defs
665 sub pad($$$);
666 sub colwidth(_);
667 sub entitle(_);
668
669 my %price = (
670 "XXXXX" => 6.50, # gyros
671 "pears" => 2.00, # like um, pears
672 "linguica" => 7.00, # spicy sausage, Portuguese
673 "xorico" => 3.00, # chorizo sausage, Catalan
674 "hamburger" => 6.00, # burgermeister meisterburger
675 "eclair" => 1.60, # dessert, French
676 "smorbrod" => 5.75, # sandwiches, Norwegian
677 "spaetzle" => 5.50, # Bayerisch noodles, little sparrows
678 "XX" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
679 "jamon serrano" => 4.45, # country ham, Spanish
680 "peches" => 2.25, # peaches, French
681 "XXXXXXX" => 1.85, # cream-filled pastry like eclair
682 "XXX" => 4.00, # makgeolli, Korean rice wine
683 "XX" => 9.99, # sushi, Japanese
684 "XXX" => 2.65, # omochi, rice cakes, Japanese
685 "creme brulee" => 2.00, # crema catalana
686 "fideua" => 4.20, # more noodles, Valencian
687 # (Catalan=fideuada)
688 "pate" => 4.15, # gooseliver paste, French
689 "XXXXX" => 8.00, # okonomiyaki, Japanese
690 );
691
692 my $width = 5 + max map { colwidth } keys %price;
693
694 # So the Asian stuff comes out in an order that someone
695 # who reads those scripts won't freak out over; the
696 # CJK stuff will be in JIS X 0208 order that way.
697 my $coll = Unicode::Collate::Locale->new(locale => "ja");
698
699 for my $item ($coll->sort(keys %price)) {
700 print pad(entitle($item), $width, ".");
701 printf " X%.2f\n", $price{$item};
702 }
703
704 sub pad($$$) {
705 my($str, $width, $padchar) = @_;
706 return $str . ($padchar x ($width - colwidth($str)));
707 }
708
709 sub colwidth(_) {
710 my($str) = @_;
711 return Unicode::GCString->new($str)->columns;
712 }
713
714 sub entitle(_) {
715 my($str) = @_;
716 $str =~ s{ (?=\pL)(\S) (\S*) }
717 { ucfirst($1) . lc($2) }xge;
718 return $str;
719 }
720
722 See these manpages, some of which are CPAN modules: perlunicode,
723 perluniprops, perlre, perlrecharclass, perluniintro, perlunitut,
724 perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode,
725 Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString,
726 Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale,
727 Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle,
728 Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin,
729 Lingua::KO::Romanize::Hangul.
730
731 The Unicode::Tussle CPAN module includes many programs to help with
732 working with Unicode, including these programs to fully or partly
733 replace standard utilities: tcgrep instead of egrep, uniquote instead
734 of cat -v or hexdump, uniwc instead of wc, unilook instead of look,
735 unifmt instead of fmt, and ucsort instead of sort. For exploring
736 Unicode character names and character properties, see its uniprops,
737 unichars, and uninames programs. It also supplies these programs, all
738 of which are general filters that do Unicode-y things: unititle and
739 unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd,
740 and nfkc; and uc, lc, and tc.
741
742 Finally, see the published Unicode Standard (page numbers are from
743 version 6.0.0), including these specific annexes and technical reports:
744
745 X3.13 Default Case Algorithms, page 113; X4.2 Case, pages 120X122;
746 Case Mappings, page 166X172, especially Caseless Matching starting on
747 page 170.
748 UAX #44: Unicode Character Database
749 UTS #18: Unicode Regular Expressions
750 UAX #15: Unicode Normalization Forms
751 UTS #10: Unicode Collation Algorithm
752 UAX #29: Unicode Text Segmentation
753 UAX #14: Unicode Line Breaking Algorithm
754 UAX #11: East Asian Width
755
757 Tom Christiansen <tchrist@perl.com> wrote this, with occasional
758 kibbitzing from Larry Wall and Jeffrey Friedl in the background.
759
761 Copyright X 2012 Tom Christiansen.
762
763 This program is free software; you may redistribute it and/or modify it
764 under the same terms as Perl itself.
765
766 Most of these examples taken from the current edition of the XCamel
767 BookX; that is, from the 4XX Edition of Programming Perl, Copyright X
768 2012 Tom Christiansen <et al.>, 2012-02-13 by OXReilly Media. The code
769 itself is freely redistributable, and you are encouraged to transplant,
770 fold, spindle, and mutilate any of the examples in this manpage however
771 you please for inclusion into your own programs without any encumbrance
772 whatsoever. Acknowledgement via code comment is polite but not
773 required.
774
776 v1.0.0 X first public release, 2012-02-27
777
778
779
780perl v5.28.2 2018-11-01 PERLUNICOOK(1)