1PERLUNICOOK(1)         Perl Programmers Reference Guide         PERLUNICOOK(1)
2
3
4

NAME

6       perlunicook - cookbookish examples of handling Unicode in Perl
7

DESCRIPTION

9       This manpage contains short recipes demonstrating how to handle common
10       Unicode operations in Perl, plus one complete program at the end. Any
11       undeclared variables in individual recipes are assumed to have a
12       previous appropriate value in them.
13

EXAMPLES

15   X 0: Standard preamble
16       Unless otherwise notes, all examples below require this standard
17       preamble to work correctly, with the "#!" adjusted to work on your
18       system:
19
20        #!/usr/bin/env perl
21
22        use utf8;      # so literals and identifiers can be in UTF-8
23        use v5.12;     # or later to get "unicode_strings" feature
24        use strict;    # quote strings, declare variables
25        use warnings;  # on by default
26        use warnings  qw(FATAL utf8);    # fatalize encoding glitches
27        use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
28        use charnames qw(:full :short);  # unneeded in v5.16
29
30       This does make even Unix programmers "binmode" your binary streams, or
31       open them with ":raw", but that's the only way to get at them portably
32       anyway.
33
34       WARNING: "use autodie" (pre 2.26) and "use open" do not get along with
35       each other.
36
37   X 1: Generic Unicode-savvy filter
38       Always decompose on the way in, then recompose on the way out.
39
40        use Unicode::Normalize;
41
42        while (<>) {
43            $_ = NFD($_);   # decompose + reorder canonically
44            ...
45        } continue {
46            print NFC($_);  # recompose (where possible) + reorder canonically
47        }
48
49   X 2: Fine-tuning Unicode warnings
50       As of v5.14, Perl distinguishes three subclasses of UTFX8 warnings.
51
52        use v5.14;                  # subwarnings unavailable any earlier
53        no warnings "nonchar";      # the 66 forbidden non-characters
54        no warnings "surrogate";    # UTF-16/CESU-8 nonsense
55        no warnings "non_unicode";  # for codepoints over 0x10_FFFF
56
57   X 3: Declare source in utf8 for identifiers and literals
58       Without the all-critical "use utf8" declaration, putting UTFX8 in your
59       literals and identifiers wonXt work right.  If you used the standard
60       preamble just given above, this already happened.  If you did, you can
61       do things like this:
62
63        use utf8;
64
65        my $measure   = "Aangstroem";
66        my @Xsoft     = qw( cp852 cp1251 cp1252 );
67        my @XXXXXXXXX = qw( XXXX  XXXXX );
68        my @X        = qw( koi8-f koi8-u koi8-r );
69        my $motto     = "X X X"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
70
71       If you forget "use utf8", high bytes will be misunderstood as separate
72       characters, and nothing will work right.
73
74   X 4: Characters and their numbers
75       The "ord" and "chr" functions work transparently on all codepoints, not
76       just on ASCII alone X nor in fact, not even just on Unicode alone.
77
78        # ASCII characters
79        ord("A")
80        chr(65)
81
82        # characters from the Basic Multilingual Plane
83        ord("X")
84        chr(0x3A3)
85
86        # beyond the BMP
87        ord("X")               # MATHEMATICAL ITALIC SMALL N
88        chr(0x1D45B)
89
90        # beyond Unicode! (up to MAXINT)
91        ord("\x{20_0000}")
92        chr(0x20_0000)
93
94   X 5: Unicode literals by character number
95       In an interpolated literal, whether a double-quoted string or a regex,
96       you may specify a character by its number using the "\x{HHHHHH}"
97       escape.
98
99        String: "\x{3a3}"
100        Regex:  /\x{3a3}/
101
102        String: "\x{1d45b}"
103        Regex:  /\x{1d45b}/
104
105        # even non-BMP ranges in regex work fine
106        /[\x{1D434}-\x{1D467}]/
107
108   X 6: Get character name by number
109        use charnames ();
110        my $name = charnames::viacode(0x03A3);
111
112   X 7: Get character number by name
113        use charnames ();
114        my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
115
116   X 8: Unicode named characters
117       Use the "\N{charname}" notation to get the character by that name for
118       use in interpolated literals (double-quoted strings and regexes).  In
119       v5.16, there is an implicit
120
121        use charnames qw(:full :short);
122
123       But prior to v5.16, you must be explicit about which set of charnames
124       you want.  The ":full" names are the official Unicode character name,
125       alias, or sequence, which all share a namespace.
126
127        use charnames qw(:full :short latin greek);
128
129        "\N{MATHEMATICAL ITALIC SMALL N}"      # :full
130        "\N{GREEK CAPITAL LETTER SIGMA}"       # :full
131
132       Anything else is a Perl-specific convenience abbreviation.  Specify one
133       or more scripts by names if you want short names that are script-
134       specific.
135
136        "\N{Greek:Sigma}"                      # :short
137        "\N{ae}"                               #  latin
138        "\N{epsilon}"                          #  greek
139
140       The v5.16 release also supports a ":loose" import for loose matching of
141       character names, which works just like loose matching of property
142       names: that is, it disregards case, whitespace, and underscores:
143
144        "\N{euro sign}"                        # :loose (from v5.16)
145
146   X 9: Unicode named sequences
147       These look just like character names but return multiple codepoints.
148       Notice the %vx vector-print functionality in "printf".
149
150        use charnames qw(:full);
151        my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
152        printf "U+%v04X\n", $seq;
153        U+0100.0300
154
155   X 10: Custom named characters
156       Use ":alias" to give your own lexically scoped nicknames to existing
157       characters, or even to give unnamed private-use characters useful
158       names.
159
160        use charnames ":full", ":alias" => {
161            ecute => "LATIN SMALL LETTER E WITH ACUTE",
162            "APPLE LOGO" => 0xF8FF, # private use character
163        };
164
165        "\N{ecute}"
166        "\N{APPLE LOGO}"
167
168   X 11: Names of CJK codepoints
169       Sinograms like XXXX come back with character names of "CJK UNIFIED
170       IDEOGRAPH-6771" and "CJK UNIFIED IDEOGRAPH-4EAC", because their XnamesX
171       vary.  The CPAN "Unicode::Unihan" module has a large database for
172       decoding these (and a whole lot more), provided you know how to
173       understand its output.
174
175        # cpan -i Unicode::Unihan
176        use Unicode::Unihan;
177        my $str = "XX";
178        my $unhan = Unicode::Unihan->new;
179        for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
180            printf "CJK $str in %-12s is ", $lang;
181            say $unhan->$lang($str);
182        }
183
184       prints:
185
186        CJK XX in Mandarin     is DONG1JING1
187        CJK XX in Cantonese    is dung1ging1
188        CJK XX in Korean       is TONGKYENG
189        CJK XX in JapaneseOn   is TOUKYOU KEI KIN
190        CJK XX in JapaneseKun  is HIGASHI AZUMAMIYAKO
191
192       If you have a specific romanization scheme in mind, use the specific
193       module:
194
195        # cpan -i Lingua::JA::Romanize::Japanese
196        use Lingua::JA::Romanize::Japanese;
197        my $k2r = Lingua::JA::Romanize::Japanese->new;
198        my $str = "XX";
199        say "Japanese for $str is ", $k2r->chars($str);
200
201       prints
202
203        Japanese for XX is toukyou
204
205   X 12: Explicit encode/decode
206       On rare occasion, such as a database read, you may be given encoded
207       text you need to decode.
208
209         use Encode qw(encode decode);
210
211         my $chars = decode("shiftjis", $bytes, 1);
212        # OR
213         my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
214
215       For streams all in the same encoding, don't use encode/decode; instead
216       set the file encoding when you open the file or immediately after with
217       "binmode" as described later below.
218
219   X 13: Decode program arguments as utf8
220            $ perl -CA ...
221        or
222            $ export PERL_UNICODE=A
223        or
224           use Encode qw(decode);
225           @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
226
227   X 14: Decode program arguments as locale encoding
228           # cpan -i Encode::Locale
229           use Encode qw(locale);
230           use Encode::Locale;
231
232           # use "locale" as an arg to encode/decode
233           @ARGV = map { decode(locale => $_, 1) } @ARGV;
234
235   X 15: Declare STD{IN,OUT,ERR} to be utf8
236       Use a command-line option, an environment variable, or else call
237       "binmode" explicitly:
238
239            $ perl -CS ...
240        or
241            $ export PERL_UNICODE=S
242        or
243            use open qw(:std :encoding(UTF-8));
244        or
245            binmode(STDIN,  ":encoding(UTF-8)");
246            binmode(STDOUT, ":utf8");
247            binmode(STDERR, ":utf8");
248
249   X 16: Declare STD{IN,OUT,ERR} to be in locale encoding
250           # cpan -i Encode::Locale
251           use Encode;
252           use Encode::Locale;
253
254           # or as a stream for binmode or open
255           binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
256           binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
257           binmode STDERR, ":encoding(console_out)" if -t STDERR;
258
259   X 17: Make file I/O default to utf8
260       Files opened without an encoding argument will be in UTF-8:
261
262            $ perl -CD ...
263        or
264            $ export PERL_UNICODE=D
265        or
266            use open qw(:encoding(UTF-8));
267
268   X 18: Make all I/O and args default to utf8
269            $ perl -CSDA ...
270        or
271            $ export PERL_UNICODE=SDA
272        or
273            use open qw(:std :encoding(UTF-8));
274            use Encode qw(decode);
275            @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
276
277   X 19: Open file with specific encoding
278       Specify stream encoding.  This is the normal way to deal with encoded
279       text, not by calling low-level functions.
280
281        # input file
282            open(my $in_file, "< :encoding(UTF-16)", "wintext");
283        OR
284            open(my $in_file, "<", "wintext");
285            binmode($in_file, ":encoding(UTF-16)");
286        THEN
287            my $line = <$in_file>;
288
289        # output file
290            open($out_file, "> :encoding(cp1252)", "wintext");
291        OR
292            open(my $out_file, ">", "wintext");
293            binmode($out_file, ":encoding(cp1252)");
294        THEN
295            print $out_file "some text\n";
296
297       More layers than just the encoding can be specified here. For example,
298       the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF
299       handling.
300
301   X 20: Unicode casing
302       Unicode casing is very different from ASCII casing.
303
304        uc("henry X")  # "HENRY X"
305        uc("tschuess")   # "TSCHUeSS"  notice ss => SS
306
307        # both are true:
308        "tschuess"  =~ /TSCHUeSS/i   # notice ss => SS
309        "XXXXXXX" =~ /XXXXXXX/i   # notice X,X,X sameness
310
311   X 21: Unicode case-insensitive comparisons
312       Also available in the CPAN Unicode::CaseFold module, the new "fc"
313       XfoldcaseX function from v5.16 grants access to the same Unicode
314       casefolding as the "/i" pattern modifier has always used:
315
316        use feature "fc"; # fc() function is from v5.16
317
318        # sort case-insensitively
319        my @sorted = sort { fc($a) cmp fc($b) } @list;
320
321        # both are true:
322        fc("tschuess")  eq fc("TSCHUeSS")
323        fc("XXXXXXX") eq fc("XXXXXXX")
324
325   X 22: Match Unicode linebreak sequence in regex
326       A Unicode linebreak matches the two-character CRLF grapheme or any of
327       seven vertical whitespace characters.  Good for dealing with textfiles
328       coming from different operating systems.
329
330        \R
331
332        s/\R/\n/g;  # normalize all linebreaks to \n
333
334   X 23: Get character category
335       Find the general category of a numeric codepoint.
336
337        use Unicode::UCD qw(charinfo);
338        my $cat = charinfo(0x3A3)->{category};  # "Lu"
339
340   X 24: Disabling Unicode-awareness in builtin charclasses
341       Disable "\w", "\b", "\s", "\d", and the POSIX classes from working
342       correctly on Unicode either in this scope, or in just one regex.
343
344        use v5.14;
345        use re "/a";
346
347        # OR
348
349        my($num) = $str =~ /(\d+)/a;
350
351       Or use specific un-Unicode properties, like "\p{ahex}" and
352       "\p{POSIX_Digit"}.  Properties still work normally no matter what
353       charset modifiers ("/d /u /l /a /aa") should be effect.
354
355   X 25: Match Unicode properties in regex with \p, \P
356       These all match a single codepoint with the given property.  Use "\P"
357       in place of "\p" to match one codepoint lacking that property.
358
359        \pL, \pN, \pS, \pP, \pM, \pZ, \pC
360        \p{Sk}, \p{Ps}, \p{Lt}
361        \p{alpha}, \p{upper}, \p{lower}
362        \p{Latin}, \p{Greek}
363        \p{script_extensions=Latin}, \p{scx=Greek}
364        \p{East_Asian_Width=Wide}, \p{EA=W}
365        \p{Line_Break=Hyphen}, \p{LB=HY}
366        \p{Numeric_Value=4}, \p{NV=4}
367
368   X 26: Custom character properties
369       Define at compile-time your own custom character properties for use in
370       regexes.
371
372        # using private-use characters
373        sub In_Tengwar { "E000\tE07F\n" }
374
375        if (/\p{In_Tengwar}/) { ... }
376
377        # blending existing properties
378        sub Is_GraecoRoman_Title {<<'END_OF_SET'}
379        +utf8::IsLatin
380        +utf8::IsGreek
381        &utf8::IsTitle
382        END_OF_SET
383
384        if (/\p{Is_GraecoRoman_Title}/ { ... }
385
386   X 27: Unicode normalization
387       Typically render into NFD on input and NFC on output. Using NFKC or
388       NFKD functions improves recall on searches, assuming you've already
389       done to the same text to be searched. Note that this is about much more
390       than just pre- combined compatibility glyphs; it also reorders marks
391       according to their canonical combining classes and weeds out
392       singletons.
393
394        use Unicode::Normalize;
395        my $nfd  = NFD($orig);
396        my $nfc  = NFC($orig);
397        my $nfkd = NFKD($orig);
398        my $nfkc = NFKC($orig);
399
400   X 28: Convert non-ASCII Unicode numerics
401       Unless youXve used "/a" or "/aa", "\d" matches more than ASCII digits
402       only, but PerlXs implicit string-to-number conversion does not current
403       recognize these.  HereXs how to convert such strings manually.
404
405        use v5.14;  # needed for num() function
406        use Unicode::UCD qw(num);
407        my $str = "got X and XXXX and X and here";
408        my @nums = ();
409        while ($str =~ /(\d+|\N)/g) {  # not just ASCII!
410           push @nums, num($1);
411        }
412        say "@nums";   #     12      4567      0.875
413
414        use charnames qw(:full);
415        my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
416
417   X 29: Match Unicode grapheme cluster in regex
418       Programmer-visible XcharactersX are codepoints matched by "/./s", but
419       user-visible XcharactersX are graphemes matched by "/\X/".
420
421        # Find vowel *plus* any combining diacritics,underlining,etc.
422        my $nfd = NFD($orig);
423        $nfd =~ / (?=[aeiou]) \X /xi
424
425   X 30: Extract by grapheme instead of by codepoint (regex)
426        # match and grab five first graphemes
427        my($first_five) = $str =~ /^ ( \X{5} ) /x;
428
429   X 31: Extract by grapheme instead of by codepoint (substr)
430        # cpan -i Unicode::GCString
431        use Unicode::GCString;
432        my $gcs = Unicode::GCString->new($str);
433        my $first_five = $gcs->substr(0, 5);
434
435   X 32: Reverse string by grapheme
436       Reversing by codepoint messes up diacritics, mistakenly converting
437       "creme brulee" into "eelXurb emXerc" instead of into "eelurb emerc"; so
438       reverse by grapheme instead.  Both these approaches work right no
439       matter what normalization the string is in:
440
441        $str = join("", reverse $str =~ /\X/g);
442
443        # OR: cpan -i Unicode::GCString
444        use Unicode::GCString;
445        $str = reverse Unicode::GCString->new($str);
446
447   X 33: String length in graphemes
448       The string "brulee" has six graphemes but up to eight codepoints.  This
449       counts by grapheme, not by codepoint:
450
451        my $str = "brulee";
452        my $count = 0;
453        while ($str =~ /\X/g) { $count++ }
454
455         # OR: cpan -i Unicode::GCString
456        use Unicode::GCString;
457        my $gcs = Unicode::GCString->new($str);
458        my $count = $gcs->length;
459
460   X 34: Unicode column-width for printing
461       PerlXs "printf", "sprintf", and "format" think all codepoints take up 1
462       print column, but many take 0 or 2.  Here to show that normalization
463       makes no difference, we print out both forms:
464
465        use Unicode::GCString;
466        use Unicode::Normalize;
467
468        my @words = qw/creme brulee/;
469        @words = map { NFC($_), NFD($_) } @words;
470
471        for my $str (@words) {
472            my $gcs = Unicode::GCString->new($str);
473            my $cols = $gcs->columns;
474            my $pad = " " x (10 - $cols);
475            say str, $pad, " |";
476        }
477
478       generates this to show that it pads correctly no matter the
479       normalization:
480
481        creme      |
482        creXme      |
483        brulee     |
484        bruXleXe     |
485
486   X 35: Unicode collation
487       Text sorted by numeric codepoint follows no reasonable alphabetic
488       order; use the UCA for sorting text.
489
490        use Unicode::Collate;
491        my $col = Unicode::Collate->new();
492        my @list = $col->sort(@old_list);
493
494       See the ucsort program from the Unicode::Tussle CPAN module for a
495       convenient command-line interface to this module.
496
497   X 36: Case- and accent-insensitive Unicode sort
498       Specify a collation strength of level 1 to ignore case and diacritics,
499       only looking at the basic character.
500
501        use Unicode::Collate;
502        my $col = Unicode::Collate->new(level => 1);
503        my @list = $col->sort(@old_list);
504
505   X 37: Unicode locale collation
506       Some locales have special sorting rules.
507
508        # either use v5.12, OR: cpan -i Unicode::Collate::Locale
509        use Unicode::Collate::Locale;
510        my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
511        my @list = $col->sort(@old_list);
512
513       The ucsort program mentioned above accepts a "--locale" parameter.
514
515   X 38: Making "cmp" work on text instead of codepoints
516       Instead of this:
517
518        @srecs = sort {
519            $b->{AGE}   <=>  $a->{AGE}
520                        ||
521            $a->{NAME}  cmp  $b->{NAME}
522        } @recs;
523
524       Use this:
525
526        my $coll = Unicode::Collate->new();
527        for my $rec (@recs) {
528            $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
529        }
530        @srecs = sort {
531            $b->{AGE}       <=>  $a->{AGE}
532                            ||
533            $a->{NAME_key}  cmp  $b->{NAME_key}
534        } @recs;
535
536   X 39: Case- and accent-insensitive comparisons
537       Use a collator object to compare Unicode text by character instead of
538       by codepoint.
539
540        use Unicode::Collate;
541        my $es = Unicode::Collate->new(
542            level => 1,
543            normalization => undef
544        );
545
546         # now both are true:
547        $es->eq("Garcia",  "GARCIA" );
548        $es->eq("Marquez", "MARQUEZ");
549
550   X 40: Case- and accent-insensitive locale comparisons
551       Same, but in a specific locale.
552
553        my $de = Unicode::Collate::Locale->new(
554                   locale => "de__phonebook",
555                 );
556
557        # now this is true:
558        $de->eq("tschuess", "TSCHUESS");  # notice ue => UE, ss => SS
559
560   X 41: Unicode linebreaking
561       Break up text into lines according to Unicode rules.
562
563        # cpan -i Unicode::LineBreak
564        use Unicode::LineBreak;
565        use charnames qw(:full);
566
567        my $para = "This is a super\N{HYPHEN}long string. " x 20;
568        my $fmt = Unicode::LineBreak->new;
569        print $fmt->break($para), "\n";
570
571   X 42: Unicode text in DBM hashes, the tedious way
572       Using a regular Perl string as a key or value for a DBM hash will
573       trigger a wide character exception if any codepoints wonXt fit into a
574       byte.  HereXs how to manually manage the translation:
575
576           use DB_File;
577           use Encode qw(encode decode);
578           tie %dbhash, "DB_File", "pathname";
579
580        # STORE
581
582           # assume $uni_key and $uni_value are abstract Unicode strings
583           my $enc_key   = encode("UTF-8", $uni_key, 1);
584           my $enc_value = encode("UTF-8", $uni_value, 1);
585           $dbhash{$enc_key} = $enc_value;
586
587        # FETCH
588
589           # assume $uni_key holds a normal Perl string (abstract Unicode)
590           my $enc_key   = encode("UTF-8", $uni_key, 1);
591           my $enc_value = $dbhash{$enc_key};
592           my $uni_value = decode("UTF-8", $enc_value, 1);
593
594   X 43: Unicode text in DBM hashes, the easy way
595       HereXs how to implicitly manage the translation; all encoding and
596       decoding is done automatically, just as with streams that have a
597       particular encoding attached to them:
598
599           use DB_File;
600           use DBM_Filter;
601
602           my $dbobj = tie %dbhash, "DB_File", "pathname";
603           $dbobj->Filter_Value("utf8");  # this is the magic bit
604
605        # STORE
606
607           # assume $uni_key and $uni_value are abstract Unicode strings
608           $dbhash{$uni_key} = $uni_value;
609
610         # FETCH
611
612           # $uni_key holds a normal Perl string (abstract Unicode)
613           my $uni_value = $dbhash{$uni_key};
614
615   X 44: PROGRAM: Demo of Unicode collation and printing
616       HereXs a full program showing how to make use of locale-sensitive
617       sorting, Unicode casing, and managing print widths when some of the
618       characters take up zero or two columns, not just one column each time.
619       When run, the following program produces this nicely aligned output:
620
621           Creme Brulee....... X2.00
622           Eclair............. X1.60
623           Fideua............. X4.20
624           Hamburger.......... X6.00
625           Jamon Serrano...... X4.45
626           Linguica........... X7.00
627           Pate............... X4.15
628           Pears.............. X2.00
629           Peches............. X2.25
630           Smorbrod........... X5.75
631           Spaetzle............ X5.50
632           Xorico............. X3.00
633           XXXXX.............. X6.50
634           XXX............. X4.00
635           XXX............. X2.65
636           XXXXX......... X8.00
637           XXXXXXX..... X1.85
638           XX............... X9.99
639           XX............... X7.50
640
641       Here's that program; tested on v5.14.
642
643        #!/usr/bin/env perl
644        # umenu - demo sorting and printing of Unicode food
645        #
646        # (obligatory and increasingly long preamble)
647        #
648        use utf8;
649        use v5.14;                       # for locale sorting
650        use strict;
651        use warnings;
652        use warnings  qw(FATAL utf8);    # fatalize encoding faults
653        use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
654        use charnames qw(:full :short);  # unneeded in v5.16
655
656        # std modules
657        use Unicode::Normalize;          # std perl distro as of v5.8
658        use List::Util qw(max);          # std perl distro as of v5.10
659        use Unicode::Collate::Locale;    # std perl distro as of v5.14
660
661        # cpan modules
662        use Unicode::GCString;           # from CPAN
663
664        # forward defs
665        sub pad($$$);
666        sub colwidth(_);
667        sub entitle(_);
668
669        my %price = (
670            "XXXXX"             => 6.50, # gyros
671            "pears"             => 2.00, # like um, pears
672            "linguica"          => 7.00, # spicy sausage, Portuguese
673            "xorico"            => 3.00, # chorizo sausage, Catalan
674            "hamburger"         => 6.00, # burgermeister meisterburger
675            "eclair"            => 1.60, # dessert, French
676            "smorbrod"          => 5.75, # sandwiches, Norwegian
677            "spaetzle"           => 5.50, # Bayerisch noodles, little sparrows
678            "XX"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
679            "jamon serrano"     => 4.45, # country ham, Spanish
680            "peches"            => 2.25, # peaches, French
681            "XXXXXXX"    => 1.85, # cream-filled pastry like eclair
682            "XXX"            => 4.00, # makgeolli, Korean rice wine
683            "XX"              => 9.99, # sushi, Japanese
684            "XXX"            => 2.65, # omochi, rice cakes, Japanese
685            "creme brulee"      => 2.00, # crema catalana
686            "fideua"            => 4.20, # more noodles, Valencian
687                                         # (Catalan=fideuada)
688            "pate"              => 4.15, # gooseliver paste, French
689            "XXXXX"        => 8.00, # okonomiyaki, Japanese
690        );
691
692        my $width = 5 + max map { colwidth } keys %price;
693
694        # So the Asian stuff comes out in an order that someone
695        # who reads those scripts won't freak out over; the
696        # CJK stuff will be in JIS X 0208 order that way.
697        my $coll  = Unicode::Collate::Locale->new(locale => "ja");
698
699        for my $item ($coll->sort(keys %price)) {
700            print pad(entitle($item), $width, ".");
701            printf " X%.2f\n", $price{$item};
702        }
703
704        sub pad($$$) {
705            my($str, $width, $padchar) = @_;
706            return $str . ($padchar x ($width - colwidth($str)));
707        }
708
709        sub colwidth(_) {
710            my($str) = @_;
711            return Unicode::GCString->new($str)->columns;
712        }
713
714        sub entitle(_) {
715            my($str) = @_;
716            $str =~ s{ (?=\pL)(\S)     (\S*) }
717                     { ucfirst($1) . lc($2)  }xge;
718            return $str;
719        }
720

SEE ALSO

722       See these manpages, some of which are CPAN modules: perlunicode,
723       perluniprops, perlre, perlrecharclass, perluniintro, perlunitut,
724       perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode,
725       Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString,
726       Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale,
727       Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle,
728       Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin,
729       Lingua::KO::Romanize::Hangul.
730
731       The Unicode::Tussle CPAN module includes many programs to help with
732       working with Unicode, including these programs to fully or partly
733       replace standard utilities: tcgrep instead of egrep, uniquote instead
734       of cat -v or hexdump, uniwc instead of wc, unilook instead of look,
735       unifmt instead of fmt, and ucsort instead of sort.  For exploring
736       Unicode character names and character properties, see its uniprops,
737       unichars, and uninames programs.  It also supplies these programs, all
738       of which are general filters that do Unicode-y things: unititle and
739       unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd,
740       and nfkc; and uc, lc, and tc.
741
742       Finally, see the published Unicode Standard (page numbers are from
743       version 6.0.0), including these specific annexes and technical reports:
744
745       X3.13 Default Case Algorithms, page 113; X4.2  Case, pages 120X122;
746       Case Mappings, page 166X172, especially Caseless Matching starting on
747       page 170.
748       UAX #44: Unicode Character Database
749       UTS #18: Unicode Regular Expressions
750       UAX #15: Unicode Normalization Forms
751       UTS #10: Unicode Collation Algorithm
752       UAX #29: Unicode Text Segmentation
753       UAX #14: Unicode Line Breaking Algorithm
754       UAX #11: East Asian Width
755

AUTHOR

757       Tom Christiansen <tchrist@perl.com> wrote this, with occasional
758       kibbitzing from Larry Wall and Jeffrey Friedl in the background.
759
761       Copyright X 2012 Tom Christiansen.
762
763       This program is free software; you may redistribute it and/or modify it
764       under the same terms as Perl itself.
765
766       Most of these examples taken from the current edition of the XCamel
767       BookX; that is, from the 4XX Edition of Programming Perl, Copyright X
768       2012 Tom Christiansen <et al.>, 2012-02-13 by OXReilly Media.  The code
769       itself is freely redistributable, and you are encouraged to transplant,
770       fold, spindle, and mutilate any of the examples in this manpage however
771       you please for inclusion into your own programs without any encumbrance
772       whatsoever.  Acknowledgement via code comment is polite but not
773       required.
774

REVISION HISTORY

776       v1.0.0 X first public release, 2012-02-27
777
778
779
780perl v5.30.2                      2020-03-27                    PERLUNICOOK(1)
Impressum