1PERLUNICOOK(1)         Perl Programmers Reference Guide         PERLUNICOOK(1)
2
3
4

NAME

6       perlunicook - cookbookish examples of handling Unicode in Perl
7

DESCRIPTION

9       This manpage contains short recipes demonstrating how to handle common
10       Unicode operations in Perl, plus one complete program at the end. Any
11       undeclared variables in individual recipes are assumed to have a
12       previous appropriate value in them.
13

EXAMPLES

15   X 0: Standard preamble
16       Unless otherwise notes, all examples below require this standard
17       preamble to work correctly, with the "#!" adjusted to work on your
18       system:
19
20        #!/usr/bin/env perl
21
22        use utf8;      # so literals and identifiers can be in UTF-8
23        use v5.12;     # or later to get "unicode_strings" feature
24        use strict;    # quote strings, declare variables
25        use warnings;  # on by default
26        use warnings  qw(FATAL utf8);    # fatalize encoding glitches
27        use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
28        use charnames qw(:full :short);  # unneeded in v5.16
29
30       This does make even Unix programmers "binmode" your binary streams, or
31       open them with ":raw", but that's the only way to get at them portably
32       anyway.
33
34       WARNING: "use autodie" (pre 2.26) and "use open" do not get along with
35       each other.
36
37   X 1: Generic Unicode-savvy filter
38       Always decompose on the way in, then recompose on the way out.
39
40        use Unicode::Normalize;
41
42        while (<>) {
43            $_ = NFD($_);   # decompose + reorder canonically
44            ...
45        } continue {
46            print NFC($_);  # recompose (where possible) + reorder canonically
47        }
48
49   X 2: Fine-tuning Unicode warnings
50       As of v5.14, Perl distinguishes three subclasses of UTFX8 warnings.
51
52        use v5.14;                  # subwarnings unavailable any earlier
53        no warnings "nonchar";      # the 66 forbidden non-characters
54        no warnings "surrogate";    # UTF-16/CESU-8 nonsense
55        no warnings "non_unicode";  # for codepoints over 0x10_FFFF
56
57   X 3: Declare source in utf8 for identifiers and literals
58       Without the all-critical "use utf8" declaration, putting UTFX8 in your
59       literals and identifiers wonXt work right.  If you used the standard
60       preamble just given above, this already happened.  If you did, you can
61       do things like this:
62
63        use utf8;
64
65        my $measure   = "Aangstroem";
66        my @Xsoft     = qw( cp852 cp1251 cp1252 );
67        my @XXXXXXXXX = qw( XXXX  XXXXX );
68        my @X        = qw( koi8-f koi8-u koi8-r );
69        my $motto     = "X X X"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
70
71       If you forget "use utf8", high bytes will be misunderstood as separate
72       characters, and nothing will work right.
73
74   X 4: Characters and their numbers
75       The "ord" and "chr" functions work transparently on all codepoints, not
76       just on ASCII alone X nor in fact, not even just on Unicode alone.
77
78        # ASCII characters
79        ord("A")
80        chr(65)
81
82        # characters from the Basic Multilingual Plane
83        ord("X")
84        chr(0x3A3)
85
86        # beyond the BMP
87        ord("X")               # MATHEMATICAL ITALIC SMALL N
88        chr(0x1D45B)
89
90        # beyond Unicode! (up to MAXINT)
91        ord("\x{20_0000}")
92        chr(0x20_0000)
93
94   X 5: Unicode literals by character number
95       In an interpolated literal, whether a double-quoted string or a regex,
96       you may specify a character by its number using the "\x{HHHHHH}"
97       escape.
98
99        String: "\x{3a3}"
100        Regex:  /\x{3a3}/
101
102        String: "\x{1d45b}"
103        Regex:  /\x{1d45b}/
104
105        # even non-BMP ranges in regex work fine
106        /[\x{1D434}-\x{1D467}]/
107
108   X 6: Get character name by number
109        use charnames ();
110        my $name = charnames::viacode(0x03A3);
111
112   X 7: Get character number by name
113        use charnames ();
114        my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
115
116   X 8: Unicode named characters
117       Use the "\N{charname}" notation to get the character by that name for
118       use in interpolated literals (double-quoted strings and regexes).  In
119       v5.16, there is an implicit
120
121        use charnames qw(:full :short);
122
123       But prior to v5.16, you must be explicit about which set of charnames
124       you want.  The ":full" names are the official Unicode character name,
125       alias, or sequence, which all share a namespace.
126
127        use charnames qw(:full :short latin greek);
128
129        "\N{MATHEMATICAL ITALIC SMALL N}"      # :full
130        "\N{GREEK CAPITAL LETTER SIGMA}"       # :full
131
132       Anything else is a Perl-specific convenience abbreviation.  Specify one
133       or more scripts by names if you want short names that are script-
134       specific.
135
136        "\N{Greek:Sigma}"                      # :short
137        "\N{ae}"                               #  latin
138        "\N{epsilon}"                          #  greek
139
140       The v5.16 release also supports a ":loose" import for loose matching of
141       character names, which works just like loose matching of property
142       names: that is, it disregards case, whitespace, and underscores:
143
144        "\N{euro sign}"                        # :loose (from v5.16)
145
146       Starting in v5.32, you can also use
147
148        qr/\p{name=euro sign}/
149
150       to get official Unicode named characters in regular expressions.  Loose
151       matching is always done for these.
152
153   X 9: Unicode named sequences
154       These look just like character names but return multiple codepoints.
155       Notice the %vx vector-print functionality in "printf".
156
157        use charnames qw(:full);
158        my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
159        printf "U+%v04X\n", $seq;
160        U+0100.0300
161
162   X 10: Custom named characters
163       Use ":alias" to give your own lexically scoped nicknames to existing
164       characters, or even to give unnamed private-use characters useful
165       names.
166
167        use charnames ":full", ":alias" => {
168            ecute => "LATIN SMALL LETTER E WITH ACUTE",
169            "APPLE LOGO" => 0xF8FF, # private use character
170        };
171
172        "\N{ecute}"
173        "\N{APPLE LOGO}"
174
175   X 11: Names of CJK codepoints
176       Sinograms like XXXX come back with character names of "CJK UNIFIED
177       IDEOGRAPH-6771" and "CJK UNIFIED IDEOGRAPH-4EAC", because their XnamesX
178       vary.  The CPAN "Unicode::Unihan" module has a large database for
179       decoding these (and a whole lot more), provided you know how to
180       understand its output.
181
182        # cpan -i Unicode::Unihan
183        use Unicode::Unihan;
184        my $str = "XX";
185        my $unhan = Unicode::Unihan->new;
186        for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
187            printf "CJK $str in %-12s is ", $lang;
188            say $unhan->$lang($str);
189        }
190
191       prints:
192
193        CJK XX in Mandarin     is DONG1JING1
194        CJK XX in Cantonese    is dung1ging1
195        CJK XX in Korean       is TONGKYENG
196        CJK XX in JapaneseOn   is TOUKYOU KEI KIN
197        CJK XX in JapaneseKun  is HIGASHI AZUMAMIYAKO
198
199       If you have a specific romanization scheme in mind, use the specific
200       module:
201
202        # cpan -i Lingua::JA::Romanize::Japanese
203        use Lingua::JA::Romanize::Japanese;
204        my $k2r = Lingua::JA::Romanize::Japanese->new;
205        my $str = "XX";
206        say "Japanese for $str is ", $k2r->chars($str);
207
208       prints
209
210        Japanese for XX is toukyou
211
212   X 12: Explicit encode/decode
213       On rare occasion, such as a database read, you may be given encoded
214       text you need to decode.
215
216         use Encode qw(encode decode);
217
218         my $chars = decode("shiftjis", $bytes, 1);
219        # OR
220         my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
221
222       For streams all in the same encoding, don't use encode/decode; instead
223       set the file encoding when you open the file or immediately after with
224       "binmode" as described later below.
225
226   X 13: Decode program arguments as utf8
227            $ perl -CA ...
228        or
229            $ export PERL_UNICODE=A
230        or
231           use Encode qw(decode);
232           @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
233
234   X 14: Decode program arguments as locale encoding
235           # cpan -i Encode::Locale
236           use Encode qw(locale);
237           use Encode::Locale;
238
239           # use "locale" as an arg to encode/decode
240           @ARGV = map { decode(locale => $_, 1) } @ARGV;
241
242   X 15: Declare STD{IN,OUT,ERR} to be utf8
243       Use a command-line option, an environment variable, or else call
244       "binmode" explicitly:
245
246            $ perl -CS ...
247        or
248            $ export PERL_UNICODE=S
249        or
250            use open qw(:std :encoding(UTF-8));
251        or
252            binmode(STDIN,  ":encoding(UTF-8)");
253            binmode(STDOUT, ":utf8");
254            binmode(STDERR, ":utf8");
255
256   X 16: Declare STD{IN,OUT,ERR} to be in locale encoding
257           # cpan -i Encode::Locale
258           use Encode;
259           use Encode::Locale;
260
261           # or as a stream for binmode or open
262           binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
263           binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
264           binmode STDERR, ":encoding(console_out)" if -t STDERR;
265
266   X 17: Make file I/O default to utf8
267       Files opened without an encoding argument will be in UTF-8:
268
269            $ perl -CD ...
270        or
271            $ export PERL_UNICODE=D
272        or
273            use open qw(:encoding(UTF-8));
274
275   X 18: Make all I/O and args default to utf8
276            $ perl -CSDA ...
277        or
278            $ export PERL_UNICODE=SDA
279        or
280            use open qw(:std :encoding(UTF-8));
281            use Encode qw(decode);
282            @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
283
284   X 19: Open file with specific encoding
285       Specify stream encoding.  This is the normal way to deal with encoded
286       text, not by calling low-level functions.
287
288        # input file
289            open(my $in_file, "< :encoding(UTF-16)", "wintext");
290        OR
291            open(my $in_file, "<", "wintext");
292            binmode($in_file, ":encoding(UTF-16)");
293        THEN
294            my $line = <$in_file>;
295
296        # output file
297            open($out_file, "> :encoding(cp1252)", "wintext");
298        OR
299            open(my $out_file, ">", "wintext");
300            binmode($out_file, ":encoding(cp1252)");
301        THEN
302            print $out_file "some text\n";
303
304       More layers than just the encoding can be specified here. For example,
305       the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF
306       handling.
307
308   X 20: Unicode casing
309       Unicode casing is very different from ASCII casing.
310
311        uc("henry X")  # "HENRY X"
312        uc("tschuess")   # "TSCHUeSS"  notice ss => SS
313
314        # both are true:
315        "tschuess"  =~ /TSCHUeSS/i   # notice ss => SS
316        "XXXXXXX" =~ /XXXXXXX/i   # notice X,X,X sameness
317
318   X 21: Unicode case-insensitive comparisons
319       Also available in the CPAN Unicode::CaseFold module, the new "fc"
320       XfoldcaseX function from v5.16 grants access to the same Unicode
321       casefolding as the "/i" pattern modifier has always used:
322
323        use feature "fc"; # fc() function is from v5.16
324
325        # sort case-insensitively
326        my @sorted = sort { fc($a) cmp fc($b) } @list;
327
328        # both are true:
329        fc("tschuess")  eq fc("TSCHUeSS")
330        fc("XXXXXXX") eq fc("XXXXXXX")
331
332   X 22: Match Unicode linebreak sequence in regex
333       A Unicode linebreak matches the two-character CRLF grapheme or any of
334       seven vertical whitespace characters.  Good for dealing with textfiles
335       coming from different operating systems.
336
337        \R
338
339        s/\R/\n/g;  # normalize all linebreaks to \n
340
341   X 23: Get character category
342       Find the general category of a numeric codepoint.
343
344        use Unicode::UCD qw(charinfo);
345        my $cat = charinfo(0x3A3)->{category};  # "Lu"
346
347   X 24: Disabling Unicode-awareness in builtin charclasses
348       Disable "\w", "\b", "\s", "\d", and the POSIX classes from working
349       correctly on Unicode either in this scope, or in just one regex.
350
351        use v5.14;
352        use re "/a";
353
354        # OR
355
356        my($num) = $str =~ /(\d+)/a;
357
358       Or use specific un-Unicode properties, like "\p{ahex}" and
359       "\p{POSIX_Digit"}.  Properties still work normally no matter what
360       charset modifiers ("/d /u /l /a /aa") should be effect.
361
362   X 25: Match Unicode properties in regex with \p, \P
363       These all match a single codepoint with the given property.  Use "\P"
364       in place of "\p" to match one codepoint lacking that property.
365
366        \pL, \pN, \pS, \pP, \pM, \pZ, \pC
367        \p{Sk}, \p{Ps}, \p{Lt}
368        \p{alpha}, \p{upper}, \p{lower}
369        \p{Latin}, \p{Greek}
370        \p{script_extensions=Latin}, \p{scx=Greek}
371        \p{East_Asian_Width=Wide}, \p{EA=W}
372        \p{Line_Break=Hyphen}, \p{LB=HY}
373        \p{Numeric_Value=4}, \p{NV=4}
374
375   X 26: Custom character properties
376       Define at compile-time your own custom character properties for use in
377       regexes.
378
379        # using private-use characters
380        sub In_Tengwar { "E000\tE07F\n" }
381
382        if (/\p{In_Tengwar}/) { ... }
383
384        # blending existing properties
385        sub Is_GraecoRoman_Title {<<'END_OF_SET'}
386        +utf8::IsLatin
387        +utf8::IsGreek
388        &utf8::IsTitle
389        END_OF_SET
390
391        if (/\p{Is_GraecoRoman_Title}/ { ... }
392
393   X 27: Unicode normalization
394       Typically render into NFD on input and NFC on output. Using NFKC or
395       NFKD functions improves recall on searches, assuming you've already
396       done to the same text to be searched. Note that this is about much more
397       than just pre- combined compatibility glyphs; it also reorders marks
398       according to their canonical combining classes and weeds out
399       singletons.
400
401        use Unicode::Normalize;
402        my $nfd  = NFD($orig);
403        my $nfc  = NFC($orig);
404        my $nfkd = NFKD($orig);
405        my $nfkc = NFKC($orig);
406
407   X 28: Convert non-ASCII Unicode numerics
408       Unless youXve used "/a" or "/aa", "\d" matches more than ASCII digits
409       only, but PerlXs implicit string-to-number conversion does not current
410       recognize these.  HereXs how to convert such strings manually.
411
412        use v5.14;  # needed for num() function
413        use Unicode::UCD qw(num);
414        my $str = "got X and XXXX and X and here";
415        my @nums = ();
416        while ($str =~ /(\d+|\N)/g) {  # not just ASCII!
417           push @nums, num($1);
418        }
419        say "@nums";   #     12      4567      0.875
420
421        use charnames qw(:full);
422        my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
423
424   X 29: Match Unicode grapheme cluster in regex
425       Programmer-visible XcharactersX are codepoints matched by "/./s", but
426       user-visible XcharactersX are graphemes matched by "/\X/".
427
428        # Find vowel *plus* any combining diacritics,underlining,etc.
429        my $nfd = NFD($orig);
430        $nfd =~ / (?=[aeiou]) \X /xi
431
432   X 30: Extract by grapheme instead of by codepoint (regex)
433        # match and grab five first graphemes
434        my($first_five) = $str =~ /^ ( \X{5} ) /x;
435
436   X 31: Extract by grapheme instead of by codepoint (substr)
437        # cpan -i Unicode::GCString
438        use Unicode::GCString;
439        my $gcs = Unicode::GCString->new($str);
440        my $first_five = $gcs->substr(0, 5);
441
442   X 32: Reverse string by grapheme
443       Reversing by codepoint messes up diacritics, mistakenly converting
444       "creme brulee" into "eelXurb emXerc" instead of into "eelurb emerc"; so
445       reverse by grapheme instead.  Both these approaches work right no
446       matter what normalization the string is in:
447
448        $str = join("", reverse $str =~ /\X/g);
449
450        # OR: cpan -i Unicode::GCString
451        use Unicode::GCString;
452        $str = reverse Unicode::GCString->new($str);
453
454   X 33: String length in graphemes
455       The string "brulee" has six graphemes but up to eight codepoints.  This
456       counts by grapheme, not by codepoint:
457
458        my $str = "brulee";
459        my $count = 0;
460        while ($str =~ /\X/g) { $count++ }
461
462         # OR: cpan -i Unicode::GCString
463        use Unicode::GCString;
464        my $gcs = Unicode::GCString->new($str);
465        my $count = $gcs->length;
466
467   X 34: Unicode column-width for printing
468       PerlXs "printf", "sprintf", and "format" think all codepoints take up 1
469       print column, but many take 0 or 2.  Here to show that normalization
470       makes no difference, we print out both forms:
471
472        use Unicode::GCString;
473        use Unicode::Normalize;
474
475        my @words = qw/creme brulee/;
476        @words = map { NFC($_), NFD($_) } @words;
477
478        for my $str (@words) {
479            my $gcs = Unicode::GCString->new($str);
480            my $cols = $gcs->columns;
481            my $pad = " " x (10 - $cols);
482            say str, $pad, " |";
483        }
484
485       generates this to show that it pads correctly no matter the
486       normalization:
487
488        creme      |
489        creXme      |
490        brulee     |
491        bruXleXe     |
492
493   X 35: Unicode collation
494       Text sorted by numeric codepoint follows no reasonable alphabetic
495       order; use the UCA for sorting text.
496
497        use Unicode::Collate;
498        my $col = Unicode::Collate->new();
499        my @list = $col->sort(@old_list);
500
501       See the ucsort program from the Unicode::Tussle CPAN module for a
502       convenient command-line interface to this module.
503
504   X 36: Case- and accent-insensitive Unicode sort
505       Specify a collation strength of level 1 to ignore case and diacritics,
506       only looking at the basic character.
507
508        use Unicode::Collate;
509        my $col = Unicode::Collate->new(level => 1);
510        my @list = $col->sort(@old_list);
511
512   X 37: Unicode locale collation
513       Some locales have special sorting rules.
514
515        # either use v5.12, OR: cpan -i Unicode::Collate::Locale
516        use Unicode::Collate::Locale;
517        my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
518        my @list = $col->sort(@old_list);
519
520       The ucsort program mentioned above accepts a "--locale" parameter.
521
522   X 38: Making "cmp" work on text instead of codepoints
523       Instead of this:
524
525        @srecs = sort {
526            $b->{AGE}   <=>  $a->{AGE}
527                        ||
528            $a->{NAME}  cmp  $b->{NAME}
529        } @recs;
530
531       Use this:
532
533        my $coll = Unicode::Collate->new();
534        for my $rec (@recs) {
535            $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
536        }
537        @srecs = sort {
538            $b->{AGE}       <=>  $a->{AGE}
539                            ||
540            $a->{NAME_key}  cmp  $b->{NAME_key}
541        } @recs;
542
543   X 39: Case- and accent-insensitive comparisons
544       Use a collator object to compare Unicode text by character instead of
545       by codepoint.
546
547        use Unicode::Collate;
548        my $es = Unicode::Collate->new(
549            level => 1,
550            normalization => undef
551        );
552
553         # now both are true:
554        $es->eq("Garcia",  "GARCIA" );
555        $es->eq("Marquez", "MARQUEZ");
556
557   X 40: Case- and accent-insensitive locale comparisons
558       Same, but in a specific locale.
559
560        my $de = Unicode::Collate::Locale->new(
561                   locale => "de__phonebook",
562                 );
563
564        # now this is true:
565        $de->eq("tschuess", "TSCHUESS");  # notice ue => UE, ss => SS
566
567   X 41: Unicode linebreaking
568       Break up text into lines according to Unicode rules.
569
570        # cpan -i Unicode::LineBreak
571        use Unicode::LineBreak;
572        use charnames qw(:full);
573
574        my $para = "This is a super\N{HYPHEN}long string. " x 20;
575        my $fmt = Unicode::LineBreak->new;
576        print $fmt->break($para), "\n";
577
578   X 42: Unicode text in DBM hashes, the tedious way
579       Using a regular Perl string as a key or value for a DBM hash will
580       trigger a wide character exception if any codepoints wonXt fit into a
581       byte.  HereXs how to manually manage the translation:
582
583           use DB_File;
584           use Encode qw(encode decode);
585           tie %dbhash, "DB_File", "pathname";
586
587        # STORE
588
589           # assume $uni_key and $uni_value are abstract Unicode strings
590           my $enc_key   = encode("UTF-8", $uni_key, 1);
591           my $enc_value = encode("UTF-8", $uni_value, 1);
592           $dbhash{$enc_key} = $enc_value;
593
594        # FETCH
595
596           # assume $uni_key holds a normal Perl string (abstract Unicode)
597           my $enc_key   = encode("UTF-8", $uni_key, 1);
598           my $enc_value = $dbhash{$enc_key};
599           my $uni_value = decode("UTF-8", $enc_value, 1);
600
601   X 43: Unicode text in DBM hashes, the easy way
602       HereXs how to implicitly manage the translation; all encoding and
603       decoding is done automatically, just as with streams that have a
604       particular encoding attached to them:
605
606           use DB_File;
607           use DBM_Filter;
608
609           my $dbobj = tie %dbhash, "DB_File", "pathname";
610           $dbobj->Filter_Value("utf8");  # this is the magic bit
611
612        # STORE
613
614           # assume $uni_key and $uni_value are abstract Unicode strings
615           $dbhash{$uni_key} = $uni_value;
616
617         # FETCH
618
619           # $uni_key holds a normal Perl string (abstract Unicode)
620           my $uni_value = $dbhash{$uni_key};
621
622   X 44: PROGRAM: Demo of Unicode collation and printing
623       HereXs a full program showing how to make use of locale-sensitive
624       sorting, Unicode casing, and managing print widths when some of the
625       characters take up zero or two columns, not just one column each time.
626       When run, the following program produces this nicely aligned output:
627
628           Creme Brulee....... X2.00
629           Eclair............. X1.60
630           Fideua............. X4.20
631           Hamburger.......... X6.00
632           Jamon Serrano...... X4.45
633           Linguica........... X7.00
634           Pate............... X4.15
635           Pears.............. X2.00
636           Peches............. X2.25
637           Smorbrod........... X5.75
638           Spaetzle............ X5.50
639           Xorico............. X3.00
640           XXXXX.............. X6.50
641           XXX............. X4.00
642           XXX............. X2.65
643           XXXXX......... X8.00
644           XXXXXXX..... X1.85
645           XX............... X9.99
646           XX............... X7.50
647
648       Here's that program; tested on v5.14.
649
650        #!/usr/bin/env perl
651        # umenu - demo sorting and printing of Unicode food
652        #
653        # (obligatory and increasingly long preamble)
654        #
655        use utf8;
656        use v5.14;                       # for locale sorting
657        use strict;
658        use warnings;
659        use warnings  qw(FATAL utf8);    # fatalize encoding faults
660        use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
661        use charnames qw(:full :short);  # unneeded in v5.16
662
663        # std modules
664        use Unicode::Normalize;          # std perl distro as of v5.8
665        use List::Util qw(max);          # std perl distro as of v5.10
666        use Unicode::Collate::Locale;    # std perl distro as of v5.14
667
668        # cpan modules
669        use Unicode::GCString;           # from CPAN
670
671        # forward defs
672        sub pad($$$);
673        sub colwidth(_);
674        sub entitle(_);
675
676        my %price = (
677            "XXXXX"             => 6.50, # gyros
678            "pears"             => 2.00, # like um, pears
679            "linguica"          => 7.00, # spicy sausage, Portuguese
680            "xorico"            => 3.00, # chorizo sausage, Catalan
681            "hamburger"         => 6.00, # burgermeister meisterburger
682            "eclair"            => 1.60, # dessert, French
683            "smorbrod"          => 5.75, # sandwiches, Norwegian
684            "spaetzle"           => 5.50, # Bayerisch noodles, little sparrows
685            "XX"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
686            "jamon serrano"     => 4.45, # country ham, Spanish
687            "peches"            => 2.25, # peaches, French
688            "XXXXXXX"    => 1.85, # cream-filled pastry like eclair
689            "XXX"            => 4.00, # makgeolli, Korean rice wine
690            "XX"              => 9.99, # sushi, Japanese
691            "XXX"            => 2.65, # omochi, rice cakes, Japanese
692            "creme brulee"      => 2.00, # crema catalana
693            "fideua"            => 4.20, # more noodles, Valencian
694                                         # (Catalan=fideuada)
695            "pate"              => 4.15, # gooseliver paste, French
696            "XXXXX"        => 8.00, # okonomiyaki, Japanese
697        );
698
699        my $width = 5 + max map { colwidth } keys %price;
700
701        # So the Asian stuff comes out in an order that someone
702        # who reads those scripts won't freak out over; the
703        # CJK stuff will be in JIS X 0208 order that way.
704        my $coll  = Unicode::Collate::Locale->new(locale => "ja");
705
706        for my $item ($coll->sort(keys %price)) {
707            print pad(entitle($item), $width, ".");
708            printf " X%.2f\n", $price{$item};
709        }
710
711        sub pad($$$) {
712            my($str, $width, $padchar) = @_;
713            return $str . ($padchar x ($width - colwidth($str)));
714        }
715
716        sub colwidth(_) {
717            my($str) = @_;
718            return Unicode::GCString->new($str)->columns;
719        }
720
721        sub entitle(_) {
722            my($str) = @_;
723            $str =~ s{ (?=\pL)(\S)     (\S*) }
724                     { ucfirst($1) . lc($2)  }xge;
725            return $str;
726        }
727

SEE ALSO

729       See these manpages, some of which are CPAN modules: perlunicode,
730       perluniprops, perlre, perlrecharclass, perluniintro, perlunitut,
731       perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode,
732       Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString,
733       Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale,
734       Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle,
735       Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin,
736       Lingua::KO::Romanize::Hangul.
737
738       The Unicode::Tussle CPAN module includes many programs to help with
739       working with Unicode, including these programs to fully or partly
740       replace standard utilities: tcgrep instead of egrep, uniquote instead
741       of cat -v or hexdump, uniwc instead of wc, unilook instead of look,
742       unifmt instead of fmt, and ucsort instead of sort.  For exploring
743       Unicode character names and character properties, see its uniprops,
744       unichars, and uninames programs.  It also supplies these programs, all
745       of which are general filters that do Unicode-y things: unititle and
746       unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd,
747       and nfkc; and uc, lc, and tc.
748
749       Finally, see the published Unicode Standard (page numbers are from
750       version 6.0.0), including these specific annexes and technical reports:
751
752       X3.13 Default Case Algorithms, page 113; X4.2  Case, pages 120X122;
753       Case Mappings, page 166X172, especially Caseless Matching starting on
754       page 170.
755       UAX #44: Unicode Character Database
756       UTS #18: Unicode Regular Expressions
757       UAX #15: Unicode Normalization Forms
758       UTS #10: Unicode Collation Algorithm
759       UAX #29: Unicode Text Segmentation
760       UAX #14: Unicode Line Breaking Algorithm
761       UAX #11: East Asian Width
762

AUTHOR

764       Tom Christiansen <tchrist@perl.com> wrote this, with occasional
765       kibbitzing from Larry Wall and Jeffrey Friedl in the background.
766
768       Copyright X 2012 Tom Christiansen.
769
770       This program is free software; you may redistribute it and/or modify it
771       under the same terms as Perl itself.
772
773       Most of these examples taken from the current edition of the XCamel
774       BookX; that is, from the 4XX Edition of Programming Perl, Copyright X
775       2012 Tom Christiansen <et al.>, 2012-02-13 by OXReilly Media.  The code
776       itself is freely redistributable, and you are encouraged to transplant,
777       fold, spindle, and mutilate any of the examples in this manpage however
778       you please for inclusion into your own programs without any encumbrance
779       whatsoever.  Acknowledgement via code comment is polite but not
780       required.
781

REVISION HISTORY

783       v1.0.0 X first public release, 2012-02-27
784
785
786
787perl v5.32.1                      2021-03-31                    PERLUNICOOK(1)
Impressum