1PERLUNICOOK(1)         Perl Programmers Reference Guide         PERLUNICOOK(1)
2
3
4

NAME

6       perlunicook - cookbookish examples of handling Unicode in Perl
7

DESCRIPTION

9       This manpage contains short recipes demonstrating how to handle common
10       Unicode operations in Perl, plus one complete program at the end. Any
11       undeclared variables in individual recipes are assumed to have a
12       previous appropriate value in them.
13

EXAMPLES

15   ℞ 0: Standard preamble
16       Unless otherwise notes, all examples below require this standard
17       preamble to work correctly, with the "#!" adjusted to work on your
18       system:
19
20        #!/usr/bin/env perl
21
22        use v5.36;     # or later to get "unicode_strings" feature,
23                       #   plus strict, warnings
24        use utf8;      # so literals and identifiers can be in UTF-8
25        use warnings  qw(FATAL utf8);    # fatalize encoding glitches
26        use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
27        use charnames qw(:full :short);  # unneeded in v5.16
28
29       This does make even Unix programmers "binmode" your binary streams, or
30       open them with ":raw", but that's the only way to get at them portably
31       anyway.
32
33       WARNING: "use autodie" (pre 2.26) and "use open" do not get along with
34       each other.
35
36   ℞ 1: Generic Unicode-savvy filter
37       Always decompose on the way in, then recompose on the way out.
38
39        use Unicode::Normalize;
40
41        while (<>) {
42            $_ = NFD($_);   # decompose + reorder canonically
43            ...
44        } continue {
45            print NFC($_);  # recompose (where possible) + reorder canonically
46        }
47
48   ℞ 2: Fine-tuning Unicode warnings
49       As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
50
51        use v5.14;                  # subwarnings unavailable any earlier
52        no warnings "nonchar";      # the 66 forbidden non-characters
53        no warnings "surrogate";    # UTF-16/CESU-8 nonsense
54        no warnings "non_unicode";  # for codepoints over 0x10_FFFF
55
56   ℞ 3: Declare source in utf8 for identifiers and literals
57       Without the all-critical "use utf8" declaration, putting UTF‑8 in your
58       literals and identifiers won’t work right.  If you used the standard
59       preamble just given above, this already happened.  If you did, you can
60       do things like this:
61
62        use utf8;
63
64        my $measure   = "Ångström";
65        my @μsoft     = qw( cp852 cp1251 cp1252 );
66        my @ὑπέρμεγας = qw( ὑπέρ  μεγας );
67        my @鯉        = qw( koi8-f koi8-u koi8-r );
68        my $motto     = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
69
70       If you forget "use utf8", high bytes will be misunderstood as separate
71       characters, and nothing will work right.
72
73   ℞ 4: Characters and their numbers
74       The "ord" and "chr" functions work transparently on all codepoints, not
75       just on ASCII alone — nor in fact, not even just on Unicode alone.
76
77        # ASCII characters
78        ord("A")
79        chr(65)
80
81        # characters from the Basic Multilingual Plane
82        ord("Σ")
83        chr(0x3A3)
84
85        # beyond the BMP
86        ord("𝑛")               # MATHEMATICAL ITALIC SMALL N
87        chr(0x1D45B)
88
89        # beyond Unicode! (up to MAXINT)
90        ord("\x{20_0000}")
91        chr(0x20_0000)
92
93   ℞ 5: Unicode literals by character number
94       In an interpolated literal, whether a double-quoted string or a regex,
95       you may specify a character by its number using the "\x{HHHHHH}"
96       escape.
97
98        String: "\x{3a3}"
99        Regex:  /\x{3a3}/
100
101        String: "\x{1d45b}"
102        Regex:  /\x{1d45b}/
103
104        # even non-BMP ranges in regex work fine
105        /[\x{1D434}-\x{1D467}]/
106
107   ℞ 6: Get character name by number
108        use charnames ();
109        my $name = charnames::viacode(0x03A3);
110
111   ℞ 7: Get character number by name
112        use charnames ();
113        my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
114
115   ℞ 8: Unicode named characters
116       Use the "\N{charname}" notation to get the character by that name for
117       use in interpolated literals (double-quoted strings and regexes).  In
118       v5.16, there is an implicit
119
120        use charnames qw(:full :short);
121
122       But prior to v5.16, you must be explicit about which set of charnames
123       you want.  The ":full" names are the official Unicode character name,
124       alias, or sequence, which all share a namespace.
125
126        use charnames qw(:full :short latin greek);
127
128        "\N{MATHEMATICAL ITALIC SMALL N}"      # :full
129        "\N{GREEK CAPITAL LETTER SIGMA}"       # :full
130
131       Anything else is a Perl-specific convenience abbreviation.  Specify one
132       or more scripts by names if you want short names that are script-
133       specific.
134
135        "\N{Greek:Sigma}"                      # :short
136        "\N{ae}"                               #  latin
137        "\N{epsilon}"                          #  greek
138
139       The v5.16 release also supports a ":loose" import for loose matching of
140       character names, which works just like loose matching of property
141       names: that is, it disregards case, whitespace, and underscores:
142
143        "\N{euro sign}"                        # :loose (from v5.16)
144
145       Starting in v5.32, you can also use
146
147        qr/\p{name=euro sign}/
148
149       to get official Unicode named characters in regular expressions.  Loose
150       matching is always done for these.
151
152   ℞ 9: Unicode named sequences
153       These look just like character names but return multiple codepoints.
154       Notice the %vx vector-print functionality in "printf".
155
156        use charnames qw(:full);
157        my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
158        printf "U+%v04X\n", $seq;
159        U+0100.0300
160
161   ℞ 10: Custom named characters
162       Use ":alias" to give your own lexically scoped nicknames to existing
163       characters, or even to give unnamed private-use characters useful
164       names.
165
166        use charnames ":full", ":alias" => {
167            ecute => "LATIN SMALL LETTER E WITH ACUTE",
168            "APPLE LOGO" => 0xF8FF, # private use character
169        };
170
171        "\N{ecute}"
172        "\N{APPLE LOGO}"
173
174   ℞ 11: Names of CJK codepoints
175       Sinograms like “東京” come back with character names of "CJK UNIFIED
176       IDEOGRAPH-6771" and "CJK UNIFIED IDEOGRAPH-4EAC", because their “names”
177       vary.  The CPAN "Unicode::Unihan" module has a large database for
178       decoding these (and a whole lot more), provided you know how to
179       understand its output.
180
181        # cpan -i Unicode::Unihan
182        use Unicode::Unihan;
183        my $str = "東京";
184        my $unhan = Unicode::Unihan->new;
185        for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
186            printf "CJK $str in %-12s is ", $lang;
187            say $unhan->$lang($str);
188        }
189
190       prints:
191
192        CJK 東京 in Mandarin     is DONG1JING1
193        CJK 東京 in Cantonese    is dung1ging1
194        CJK 東京 in Korean       is TONGKYENG
195        CJK 東京 in JapaneseOn   is TOUKYOU KEI KIN
196        CJK 東京 in JapaneseKun  is HIGASHI AZUMAMIYAKO
197
198       If you have a specific romanization scheme in mind, use the specific
199       module:
200
201        # cpan -i Lingua::JA::Romanize::Japanese
202        use Lingua::JA::Romanize::Japanese;
203        my $k2r = Lingua::JA::Romanize::Japanese->new;
204        my $str = "東京";
205        say "Japanese for $str is ", $k2r->chars($str);
206
207       prints
208
209        Japanese for 東京 is toukyou
210
211   ℞ 12: Explicit encode/decode
212       On rare occasion, such as a database read, you may be given encoded
213       text you need to decode.
214
215         use Encode qw(encode decode);
216
217         my $chars = decode("shiftjis", $bytes, 1);
218        # OR
219         my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
220
221       For streams all in the same encoding, don't use encode/decode; instead
222       set the file encoding when you open the file or immediately after with
223       "binmode" as described later below.
224
225   ℞ 13: Decode program arguments as utf8
226            $ perl -CA ...
227        or
228            $ export PERL_UNICODE=A
229        or
230           use Encode qw(decode);
231           @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
232
233   ℞ 14: Decode program arguments as locale encoding
234           # cpan -i Encode::Locale
235           use Encode qw(locale);
236           use Encode::Locale;
237
238           # use "locale" as an arg to encode/decode
239           @ARGV = map { decode(locale => $_, 1) } @ARGV;
240
241   ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
242       Use a command-line option, an environment variable, or else call
243       "binmode" explicitly:
244
245            $ perl -CS ...
246        or
247            $ export PERL_UNICODE=S
248        or
249            use open qw(:std :encoding(UTF-8));
250        or
251            binmode(STDIN,  ":encoding(UTF-8)");
252            binmode(STDOUT, ":utf8");
253            binmode(STDERR, ":utf8");
254
255   ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
256           # cpan -i Encode::Locale
257           use Encode;
258           use Encode::Locale;
259
260           # or as a stream for binmode or open
261           binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
262           binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
263           binmode STDERR, ":encoding(console_out)" if -t STDERR;
264
265   ℞ 17: Make file I/O default to utf8
266       Files opened without an encoding argument will be in UTF-8:
267
268            $ perl -CD ...
269        or
270            $ export PERL_UNICODE=D
271        or
272            use open qw(:encoding(UTF-8));
273
274   ℞ 18: Make all I/O and args default to utf8
275            $ perl -CSDA ...
276        or
277            $ export PERL_UNICODE=SDA
278        or
279            use open qw(:std :encoding(UTF-8));
280            use Encode qw(decode);
281            @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
282
283   ℞ 19: Open file with specific encoding
284       Specify stream encoding.  This is the normal way to deal with encoded
285       text, not by calling low-level functions.
286
287        # input file
288            open(my $in_file, "< :encoding(UTF-16)", "wintext");
289        OR
290            open(my $in_file, "<", "wintext");
291            binmode($in_file, ":encoding(UTF-16)");
292        THEN
293            my $line = <$in_file>;
294
295        # output file
296            open($out_file, "> :encoding(cp1252)", "wintext");
297        OR
298            open(my $out_file, ">", "wintext");
299            binmode($out_file, ":encoding(cp1252)");
300        THEN
301            print $out_file "some text\n";
302
303       More layers than just the encoding can be specified here. For example,
304       the incantation ":raw :encoding(UTF-16LE) :crlf" includes implicit CRLF
305       handling.
306
307   ℞ 20: Unicode casing
308       Unicode casing is very different from ASCII casing.
309
310        uc("henry ⅷ")  # "HENRY Ⅷ"
311        uc("tschüß")   # "TSCHÜSS"  notice ß => SS
312
313        # both are true:
314        "tschüß"  =~ /TSCHÜSS/i   # notice ß => SS
315        "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i   # notice Σ,σ,ς sameness
316
317   ℞ 21: Unicode case-insensitive comparisons
318       Also available in the CPAN Unicode::CaseFold module, the new "fc"
319       “foldcase” function from v5.16 grants access to the same Unicode
320       casefolding as the "/i" pattern modifier has always used:
321
322        use feature "fc"; # fc() function is from v5.16
323
324        # sort case-insensitively
325        my @sorted = sort { fc($a) cmp fc($b) } @list;
326
327        # both are true:
328        fc("tschüß")  eq fc("TSCHÜSS")
329        fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
330
331   ℞ 22: Match Unicode linebreak sequence in regex
332       A Unicode linebreak matches the two-character CRLF grapheme or any of
333       seven vertical whitespace characters.  Good for dealing with textfiles
334       coming from different operating systems.
335
336        \R
337
338        s/\R/\n/g;  # normalize all linebreaks to \n
339
340   ℞ 23: Get character category
341       Find the general category of a numeric codepoint.
342
343        use Unicode::UCD qw(charinfo);
344        my $cat = charinfo(0x3A3)->{category};  # "Lu"
345
346   ℞ 24: Disabling Unicode-awareness in builtin charclasses
347       Disable "\w", "\b", "\s", "\d", and the POSIX classes from working
348       correctly on Unicode either in this scope, or in just one regex.
349
350        use v5.14;
351        use re "/a";
352
353        # OR
354
355        my($num) = $str =~ /(\d+)/a;
356
357       Or use specific un-Unicode properties, like "\p{ahex}" and
358       "\p{POSIX_Digit"}.  Properties still work normally no matter what
359       charset modifiers ("/d /u /l /a /aa") should be effect.
360
361   ℞ 25: Match Unicode properties in regex with \p, \P
362       These all match a single codepoint with the given property.  Use "\P"
363       in place of "\p" to match one codepoint lacking that property.
364
365        \pL, \pN, \pS, \pP, \pM, \pZ, \pC
366        \p{Sk}, \p{Ps}, \p{Lt}
367        \p{alpha}, \p{upper}, \p{lower}
368        \p{Latin}, \p{Greek}
369        \p{script_extensions=Latin}, \p{scx=Greek}
370        \p{East_Asian_Width=Wide}, \p{EA=W}
371        \p{Line_Break=Hyphen}, \p{LB=HY}
372        \p{Numeric_Value=4}, \p{NV=4}
373
374   ℞ 26: Custom character properties
375       Define at compile-time your own custom character properties for use in
376       regexes.
377
378        # using private-use characters
379        sub In_Tengwar { "E000\tE07F\n" }
380
381        if (/\p{In_Tengwar}/) { ... }
382
383        # blending existing properties
384        sub Is_GraecoRoman_Title {<<'END_OF_SET'}
385        +utf8::IsLatin
386        +utf8::IsGreek
387        &utf8::IsTitle
388        END_OF_SET
389
390        if (/\p{Is_GraecoRoman_Title}/ { ... }
391
392   ℞ 27: Unicode normalization
393       Typically render into NFD on input and NFC on output. Using NFKC or
394       NFKD functions improves recall on searches, assuming you've already
395       done to the same text to be searched. Note that this is about much more
396       than just pre- combined compatibility glyphs; it also reorders marks
397       according to their canonical combining classes and weeds out
398       singletons.
399
400        use Unicode::Normalize;
401        my $nfd  = NFD($orig);
402        my $nfc  = NFC($orig);
403        my $nfkd = NFKD($orig);
404        my $nfkc = NFKC($orig);
405
406   ℞ 28: Convert non-ASCII Unicode numerics
407       Unless you’ve used "/a" or "/aa", "\d" matches more than ASCII digits
408       only, but Perl’s implicit string-to-number conversion does not current
409       recognize these.  Here’s how to convert such strings manually.
410
411        use v5.14;  # needed for num() function
412        use Unicode::UCD qw(num);
413        my $str = "got Ⅻ and ४५६७ and ⅞ and here";
414        my @nums = ();
415        while ($str =~ /(\d+|\N)/g) {  # not just ASCII!
416           push @nums, num($1);
417        }
418        say "@nums";   #     12      4567      0.875
419
420        use charnames qw(:full);
421        my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
422
423   ℞ 29: Match Unicode grapheme cluster in regex
424       Programmer-visible “characters” are codepoints matched by "/./s", but
425       user-visible “characters” are graphemes matched by "/\X/".
426
427        # Find vowel *plus* any combining diacritics,underlining,etc.
428        my $nfd = NFD($orig);
429        $nfd =~ / (?=[aeiou]) \X /xi
430
431   ℞ 30: Extract by grapheme instead of by codepoint (regex)
432        # match and grab five first graphemes
433        my($first_five) = $str =~ /^ ( \X{5} ) /x;
434
435   ℞ 31: Extract by grapheme instead of by codepoint (substr)
436        # cpan -i Unicode::GCString
437        use Unicode::GCString;
438        my $gcs = Unicode::GCString->new($str);
439        my $first_five = $gcs->substr(0, 5);
440
441   ℞ 32: Reverse string by grapheme
442       Reversing by codepoint messes up diacritics, mistakenly converting
443       "crème brûlée" into "éel̂urb em̀erc" instead of into "eélûrb emèrc"; so
444       reverse by grapheme instead.  Both these approaches work right no
445       matter what normalization the string is in:
446
447        $str = join("", reverse $str =~ /\X/g);
448
449        # OR: cpan -i Unicode::GCString
450        use Unicode::GCString;
451        $str = reverse Unicode::GCString->new($str);
452
453   ℞ 33: String length in graphemes
454       The string "brûlée" has six graphemes but up to eight codepoints.  This
455       counts by grapheme, not by codepoint:
456
457        my $str = "brûlée";
458        my $count = 0;
459        while ($str =~ /\X/g) { $count++ }
460
461         # OR: cpan -i Unicode::GCString
462        use Unicode::GCString;
463        my $gcs = Unicode::GCString->new($str);
464        my $count = $gcs->length;
465
466   ℞ 34: Unicode column-width for printing
467       Perl’s "printf", "sprintf", and "format" think all codepoints take up 1
468       print column, but many take 0 or 2.  Here to show that normalization
469       makes no difference, we print out both forms:
470
471        use Unicode::GCString;
472        use Unicode::Normalize;
473
474        my @words = qw/crème brûlée/;
475        @words = map { NFC($_), NFD($_) } @words;
476
477        for my $str (@words) {
478            my $gcs = Unicode::GCString->new($str);
479            my $cols = $gcs->columns;
480            my $pad = " " x (10 - $cols);
481            say str, $pad, " |";
482        }
483
484       generates this to show that it pads correctly no matter the
485       normalization:
486
487        crème      |
488        crème      |
489        brûlée     |
490        brûlée     |
491
492   ℞ 35: Unicode collation
493       Text sorted by numeric codepoint follows no reasonable alphabetic
494       order; use the UCA for sorting text.
495
496        use Unicode::Collate;
497        my $col = Unicode::Collate->new();
498        my @list = $col->sort(@old_list);
499
500       See the ucsort program from the Unicode::Tussle CPAN module for a
501       convenient command-line interface to this module.
502
503   ℞ 36: Case- and accent-insensitive Unicode sort
504       Specify a collation strength of level 1 to ignore case and diacritics,
505       only looking at the basic character.
506
507        use Unicode::Collate;
508        my $col = Unicode::Collate->new(level => 1);
509        my @list = $col->sort(@old_list);
510
511   ℞ 37: Unicode locale collation
512       Some locales have special sorting rules.
513
514        # either use v5.12, OR: cpan -i Unicode::Collate::Locale
515        use Unicode::Collate::Locale;
516        my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
517        my @list = $col->sort(@old_list);
518
519       The ucsort program mentioned above accepts a "--locale" parameter.
520
521   ℞ 38: Making "cmp" work on text instead of codepoints
522       Instead of this:
523
524        @srecs = sort {
525            $b->{AGE}   <=>  $a->{AGE}
526                        ||
527            $a->{NAME}  cmp  $b->{NAME}
528        } @recs;
529
530       Use this:
531
532        my $coll = Unicode::Collate->new();
533        for my $rec (@recs) {
534            $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
535        }
536        @srecs = sort {
537            $b->{AGE}       <=>  $a->{AGE}
538                            ||
539            $a->{NAME_key}  cmp  $b->{NAME_key}
540        } @recs;
541
542   ℞ 39: Case- and accent-insensitive comparisons
543       Use a collator object to compare Unicode text by character instead of
544       by codepoint.
545
546        use Unicode::Collate;
547        my $es = Unicode::Collate->new(
548            level => 1,
549            normalization => undef
550        );
551
552         # now both are true:
553        $es->eq("García",  "GARCIA" );
554        $es->eq("Márquez", "MARQUEZ");
555
556   ℞ 40: Case- and accent-insensitive locale comparisons
557       Same, but in a specific locale.
558
559        my $de = Unicode::Collate::Locale->new(
560                   locale => "de__phonebook",
561                 );
562
563        # now this is true:
564        $de->eq("tschüß", "TSCHUESS");  # notice ü => UE, ß => SS
565
566   ℞ 41: Unicode linebreaking
567       Break up text into lines according to Unicode rules.
568
569        # cpan -i Unicode::LineBreak
570        use Unicode::LineBreak;
571        use charnames qw(:full);
572
573        my $para = "This is a super\N{HYPHEN}long string. " x 20;
574        my $fmt = Unicode::LineBreak->new;
575        print $fmt->break($para), "\n";
576
577   ℞ 42: Unicode text in DBM hashes, the tedious way
578       Using a regular Perl string as a key or value for a DBM hash will
579       trigger a wide character exception if any codepoints won’t fit into a
580       byte.  Here’s how to manually manage the translation:
581
582           use DB_File;
583           use Encode qw(encode decode);
584           tie %dbhash, "DB_File", "pathname";
585
586        # STORE
587
588           # assume $uni_key and $uni_value are abstract Unicode strings
589           my $enc_key   = encode("UTF-8", $uni_key, 1);
590           my $enc_value = encode("UTF-8", $uni_value, 1);
591           $dbhash{$enc_key} = $enc_value;
592
593        # FETCH
594
595           # assume $uni_key holds a normal Perl string (abstract Unicode)
596           my $enc_key   = encode("UTF-8", $uni_key, 1);
597           my $enc_value = $dbhash{$enc_key};
598           my $uni_value = decode("UTF-8", $enc_value, 1);
599
600   ℞ 43: Unicode text in DBM hashes, the easy way
601       Here’s how to implicitly manage the translation; all encoding and
602       decoding is done automatically, just as with streams that have a
603       particular encoding attached to them:
604
605           use DB_File;
606           use DBM_Filter;
607
608           my $dbobj = tie %dbhash, "DB_File", "pathname";
609           $dbobj->Filter_Value("utf8");  # this is the magic bit
610
611        # STORE
612
613           # assume $uni_key and $uni_value are abstract Unicode strings
614           $dbhash{$uni_key} = $uni_value;
615
616         # FETCH
617
618           # $uni_key holds a normal Perl string (abstract Unicode)
619           my $uni_value = $dbhash{$uni_key};
620
621   ℞ 44: PROGRAM: Demo of Unicode collation and printing
622       Here’s a full program showing how to make use of locale-sensitive
623       sorting, Unicode casing, and managing print widths when some of the
624       characters take up zero or two columns, not just one column each time.
625       When run, the following program produces this nicely aligned output:
626
627           Crème Brûlée....... €2.00
628           Éclair............. €1.60
629           Fideuà............. €4.20
630           Hamburger.......... €6.00
631           Jamón Serrano...... €4.45
632           Linguiça........... €7.00
633           Pâté............... €4.15
634           Pears.............. €2.00
635           Pêches............. €2.25
636           Smørbrød........... €5.75
637           Spätzle............ €5.50
638           Xoriço............. €3.00
639           Γύρος.............. €6.50
640           막걸리............. €4.00
641           おもち............. €2.65
642           お好み焼き......... €8.00
643           シュークリーム..... €1.85
644           寿司............... €9.99
645           包子............... €7.50
646
647       Here's that program.
648
649        #!/usr/bin/env perl
650        # umenu - demo sorting and printing of Unicode food
651        #
652        # (obligatory and increasingly long preamble)
653        #
654        use v5.36;
655        use utf8;
656        use warnings  qw(FATAL utf8);    # fatalize encoding faults
657        use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
658        use charnames qw(:full :short);  # unneeded in v5.16
659
660        # std modules
661        use Unicode::Normalize;          # std perl distro as of v5.8
662        use List::Util qw(max);          # std perl distro as of v5.10
663        use Unicode::Collate::Locale;    # std perl distro as of v5.14
664
665        # cpan modules
666        use Unicode::GCString;           # from CPAN
667
668        my %price = (
669            "γύρος"             => 6.50, # gyros
670            "pears"             => 2.00, # like um, pears
671            "linguiça"          => 7.00, # spicy sausage, Portuguese
672            "xoriço"            => 3.00, # chorizo sausage, Catalan
673            "hamburger"         => 6.00, # burgermeister meisterburger
674            "éclair"            => 1.60, # dessert, French
675            "smørbrød"          => 5.75, # sandwiches, Norwegian
676            "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
677            "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
678            "jamón serrano"     => 4.45, # country ham, Spanish
679            "pêches"            => 2.25, # peaches, French
680            "シュークリーム"    => 1.85, # cream-filled pastry like eclair
681            "막걸리"            => 4.00, # makgeolli, Korean rice wine
682            "寿司"              => 9.99, # sushi, Japanese
683            "おもち"            => 2.65, # omochi, rice cakes, Japanese
684            "crème brûlée"      => 2.00, # crema catalana
685            "fideuà"            => 4.20, # more noodles, Valencian
686                                         # (Catalan=fideuada)
687            "pâté"              => 4.15, # gooseliver paste, French
688            "お好み焼き"        => 8.00, # okonomiyaki, Japanese
689        );
690
691        my $width = 5 + max map { colwidth($_) } keys %price;
692
693        # So the Asian stuff comes out in an order that someone
694        # who reads those scripts won't freak out over; the
695        # CJK stuff will be in JIS X 0208 order that way.
696        my $coll  = Unicode::Collate::Locale->new(locale => "ja");
697
698        for my $item ($coll->sort(keys %price)) {
699            print pad(entitle($item), $width, ".");
700            printf " €%.2f\n", $price{$item};
701        }
702
703        sub pad ($str, $width, $padchar) {
704            return $str . ($padchar x ($width - colwidth($str)));
705        }
706
707        sub colwidth ($str) {
708            return Unicode::GCString->new($str)->columns;
709        }
710
711        sub entitle ($str) {
712            $str =~ s{ (?=\pL)(\S)     (\S*) }
713                     { ucfirst($1) . lc($2)  }xge;
714            return $str;
715        }
716

SEE ALSO

718       See these manpages, some of which are CPAN modules: perlunicode,
719       perluniprops, perlre, perlrecharclass, perluniintro, perlunitut,
720       perlunifaq, PerlIO, DB_File, DBM_Filter, DBM_Filter::utf8, Encode,
721       Encode::Locale, Unicode::UCD, Unicode::Normalize, Unicode::GCString,
722       Unicode::LineBreak, Unicode::Collate, Unicode::Collate::Locale,
723       Unicode::Unihan, Unicode::CaseFold, Unicode::Tussle,
724       Lingua::JA::Romanize::Japanese, Lingua::ZH::Romanize::Pinyin,
725       Lingua::KO::Romanize::Hangul.
726
727       The Unicode::Tussle CPAN module includes many programs to help with
728       working with Unicode, including these programs to fully or partly
729       replace standard utilities: tcgrep instead of egrep, uniquote instead
730       of cat -v or hexdump, uniwc instead of wc, unilook instead of look,
731       unifmt instead of fmt, and ucsort instead of sort.  For exploring
732       Unicode character names and character properties, see its uniprops,
733       unichars, and uninames programs.  It also supplies these programs, all
734       of which are general filters that do Unicode-y things: unititle and
735       unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd,
736       and nfkc; and uc, lc, and tc.
737
738       Finally, see the published Unicode Standard (page numbers are from
739       version 6.0.0), including these specific annexes and technical reports:
740
741       §3.13 Default Case Algorithms, page 113; §4.2  Case, pages 120–122;
742       Case Mappings, page 166–172, especially Caseless Matching starting on
743       page 170.
744       UAX #44: Unicode Character Database
745       UTS #18: Unicode Regular Expressions
746       UAX #15: Unicode Normalization Forms
747       UTS #10: Unicode Collation Algorithm
748       UAX #29: Unicode Text Segmentation
749       UAX #14: Unicode Line Breaking Algorithm
750       UAX #11: East Asian Width
751

AUTHOR

753       Tom Christiansen <tchrist@perl.com> wrote this, with occasional
754       kibbitzing from Larry Wall and Jeffrey Friedl in the background.
755
757       Copyright © 2012 Tom Christiansen.
758
759       This program is free software; you may redistribute it and/or modify it
760       under the same terms as Perl itself.
761
762       Most of these examples taken from the current edition of the “Camel
763       Book”; that is, from the 4ᵗʰ Edition of Programming Perl, Copyright ©
764       2012 Tom Christiansen <et al.>, 2012-02-13 by O’Reilly Media.  The code
765       itself is freely redistributable, and you are encouraged to transplant,
766       fold, spindle, and mutilate any of the examples in this manpage however
767       you please for inclusion into your own programs without any encumbrance
768       whatsoever.  Acknowledgement via code comment is polite but not
769       required.
770

REVISION HISTORY

772       v1.0.0 – first public release, 2012-02-27
773
774
775
776perl v5.38.2                      2023-11-30                    PERLUNICOOK(1)
Impressum