1Unicode::Collate::LocaleP(e3rplm)Programmers ReferenceUGnuiicdoede::Collate::Locale(3pm)
2
3
4

NAME

6       Unicode::Collate::Locale - Linguistic tailoring for DUCET via
7       Unicode::Collate
8

SYNOPSIS

10         use Unicode::Collate::Locale;
11
12         #construct
13         $Collator = Unicode::Collate::Locale->
14             new(locale => $locale_name, %tailoring);
15
16         #sort
17         @sorted = $Collator->sort(@not_sorted);
18
19         #compare
20         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
21
22       Note: Strings in @not_sorted, $a and $b are interpreted according to
23       Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
24       perlunifaq, utf8.  Otherwise you can use "preprocess" (cf.
25       "Unicode::Collate") or should decode them before.
26

DESCRIPTION

28       This module provides linguistic tailoring for it taking advantage of
29       "Unicode::Collate".
30
31   Constructor
32       The "new" method returns a collator object.
33
34       A parameter list for the constructor is a hash, which can include a
35       special key "locale" and its value (case-insensitive) standing for a
36       Unicode base language code (two or three-letter).  For example,
37       "Unicode::Collate::Locale->new(locale => 'FR')" returns a collator
38       tailored for French.
39
40       $locale_name may be suffixed with a Unicode script code (four-letter),
41       a Unicode region code, a Unicode language variant code. These codes are
42       case-insensitive, and separated with '_' or '-'.  E.g. "en_US" for
43       English in USA, "az_Cyrl" for Azerbaijani in the Cyrillic script,
44       "es_ES_traditional" for Spanish in Spain (Traditional).
45
46       If $locale_name is not available, fallback is selected in the following
47       order:
48
49           1. language with a variant code
50           2. language with a script code
51           3. language with a region code
52           4. language
53           5. default
54
55       Tailoring tags provided by "Unicode::Collate" are allowed as long as
56       they are not used for "locale" support.  Esp. the "table" tag is always
57       untailorable, since it is reserved for DUCET.
58
59       E.g. a collator for French, which ignores diacritics and case
60       difference (i.e. level 1), with reversed case ordering and no
61       normalization.
62
63           Unicode::Collate::Locale->new(
64               level => 1,
65               locale => 'fr',
66               upper_before_lower => 1,
67               normalization => undef
68           )
69
70       Overriding a behavior already tailored by "locale" is disallowed if
71       such a tailoring is passed to "new()".
72
73           Unicode::Collate::Locale->new(
74               locale => 'da',
75               upper_before_lower => 0, # causes error as reserved by 'da'
76           )
77
78       However "change()" inherited from "Unicode::Collate" allows such a
79       tailoring that is reserved by "locale". Examples:
80
81           new(locale => 'ca')->change(backwards => undef)
82           new(locale => 'da')->change(upper_before_lower => 0)
83           new(locale => 'ja')->change(overrideCJK => undef)
84
85   Methods
86       "Unicode::Collate::Locale" is a subclass of "Unicode::Collate" and
87       methods other than "new" are inherited from "Unicode::Collate".
88
89       Here is a list of additional methods:
90
91       "$Collator->getlocale"
92           Returns a language code accepted and used actually on collation.
93           If linguistic tailoring is not provided for a language code you
94           passed (intensionally for some languages, or due to the incomplete
95           implementation), this method returns a string 'default' meaning no
96           special tailoring.
97
98       "$Collator->locale_version"
99           (Since Unicode::Collate::Locale 0.87) Returns the version number
100           (perhaps "/\d\.\d\d/") of the locale, as that of Locale/*.pl.
101
102           Note: Locale/*.pl that a collator uses should be identified by a
103           combination of return values from "getlocale" and "locale_version".
104
105   A list of tailorable locales
106             locale name       description
107           --------------------------------------------------------------
108             af                Afrikaans
109             ar                Arabic
110             as                Assamese
111             az                Azerbaijani (Azeri)
112             be                Belarusian
113             bg                Bulgarian
114             bn                Bengali
115             bs                Bosnian
116             ca                Catalan
117             cs                Czech
118             cy                Welsh
119             da                Danish
120             de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
121             eo                Esperanto
122             es                Spanish
123             es__traditional   Spanish ('ch' and 'll' as a grapheme)
124             et                Estonian
125             fa                Persian
126             fi                Finnish (v and w are primary equal)
127             fi__phonebook     Finnish (v and w as separate characters)
128             fil               Filipino
129             fo                Faroese
130             fr                French
131             gu                Gujarati
132             ha                Hausa
133             haw               Hawaiian
134             hi                Hindi
135             hr                Croatian
136             hu                Hungarian
137             hy                Armenian
138             ig                Igbo
139             is                Icelandic
140             ja                Japanese [1]
141             kk                Kazakh
142             kl                Kalaallisut
143             kn                Kannada
144             ko                Korean [2]
145             kok               Konkani
146             ln                Lingala
147             lt                Lithuanian
148             lv                Latvian
149             mk                Macedonian
150             ml                Malayalam
151             mr                Marathi
152             mt                Maltese
153             nb                Norwegian Bokmal
154             nn                Norwegian Nynorsk
155             nso               Northern Sotho
156             om                Oromo
157             or                Oriya
158             pa                Punjabi
159             pl                Polish
160             ro                Romanian
161             ru                Russian
162             sa                Sanskrit
163             se                Northern Sami
164             si                Sinhala
165             si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
166             sk                Slovak
167             sl                Slovenian
168             sq                Albanian
169             sr                Serbian
170             sr_Latn           Serbian in Latin (tailored as Croatian)
171             sv                Swedish (v and w are primary equal)
172             sv__reformed      Swedish (v and w as separate characters)
173             ta                Tamil
174             te                Telugu
175             th                Thai
176             tn                Tswana
177             to                Tonga
178             tr                Turkish
179             uk                Ukrainian
180             ur                Urdu
181             vi                Vietnamese
182             wae               Walser
183             wo                Wolof
184             yo                Yoruba
185             zh                Chinese
186             zh__big5han       Chinese (ideographs: big5 order)
187             zh__gb2312han     Chinese (ideographs: GB-2312 order)
188             zh__pinyin        Chinese (ideographs: pinyin order) [3]
189             zh__stroke        Chinese (ideographs: stroke order) [3]
190           --------------------------------------------------------------
191
192       Locales according to the default UCA rules include chr (Cherokee), de
193       (German), en (English), ga (Irish), id (Indonesian), it (Italian), ka
194       (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern
195       Sotho), sw (Swahili), xh (Xhosa), zu (Zulu).
196
197       Note
198
199       [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and
200       halfwidth forms are identical to their normal form.  The difference
201       between hiragana and katakana is at the 4th level, the comparison also
202       requires "(variable => 'Non-ignorable')", and then
203       "katakana_before_hiragana" has no effect.
204
205       [2] ko: Plenty of ideographs are sorted by their reading. Such an
206       ideograph is primary (level 1) equal to, and secondary (level 2)
207       greater than, the corresponding hangul syllable.
208
209       [3] zh__pinyin and zh__stroke: implemented alt='short', where a smaller
210       number of ideographs are tailored.
211

INSTALL

213       Installation of "Unicode::Collate::Locale" requires Collate/Locale.pm,
214       Collate/Locale/*.pm, Collate/CJK/*.pm and Collate/allkeys.txt.  On
215       building, "Unicode::Collate::Locale" doesn't require any of data/*.txt,
216       gendata/*, and mklocale.  Tests for "Unicode::Collate::Locale" are
217       named t/loc_*.t.
218

CAVEAT

220       tailoring is not maximum
221           Even if a certain letter is tailored, its equivalent would not
222           always tailored as well as it. For example, even though W is
223           tailored, fullwidth W ("U+FF37"), W with acute ("U+1E82"), etc. are
224           not tailored. The result may depend on whether source strings are
225           normalized or not, and whether decomposed or composed.  Thus
226           "(normalization => undef)" is less preferred.
227

AUTHOR

229       The Unicode::Collate::Locale module for perl was written by SADAHIRO
230       Tomoyuki, <SADAHIRO@cpan.org>.  This module is Copyright(C) 2004-2012,
231       SADAHIRO Tomoyuki. Japan.  All rights reserved.
232
233       This module is free software; you can redistribute it and/or modify it
234       under the same terms as Perl itself.
235

SEE ALSO

237       Unicode Collation Algorithm - UTS #10
238           <http://www.unicode.org/reports/tr10/>
239
240       The Default Unicode Collation Element Table (DUCET)
241           <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
242
243       Unicode Locale Data Markup Language (LDML) - UTS #35
244           <http://www.unicode.org/reports/tr35/>
245
246       CLDR - Unicode Common Locale Data Repository
247           <http://cldr.unicode.org/>
248
249       Unicode::Collate
250       Unicode::Normalize
251
252
253
254perl v5.16.3                      2013-03-04     Unicode::Collate::Locale(3pm)
Impressum