1Unicode::Collate::LocaleP(e3rplm)Programmers ReferenceUGnuiicdoede::Collate::Locale(3pm)
2
3
4
6 Unicode::Collate::Locale - Linguistic tailoring for DUCET via
7 Unicode::Collate
8
10 use Unicode::Collate::Locale;
11
12 #construct
13 $Collator = Unicode::Collate::Locale->
14 new(locale => $locale_name, %tailoring);
15
16 #sort
17 @sorted = $Collator->sort(@not_sorted);
18
19 #compare
20 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
21
22 Note: Strings in @not_sorted, $a and $b are interpreted according to
23 Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
24 perlunifaq, utf8. Otherwise you can use "preprocess" (cf.
25 "Unicode::Collate") or should decode them before.
26
28 This module provides linguistic tailoring for it taking advantage of
29 "Unicode::Collate".
30
31 Constructor
32 The "new" method returns a collator object.
33
34 A parameter list for the constructor is a hash, which can include a
35 special key "locale" and its value (case-insensitive) standing for a
36 Unicode base language code (two or three-letter). For example,
37 "Unicode::Collate::Locale->new(locale => 'FR')" returns a collator
38 tailored for French.
39
40 $locale_name may be suffixed with a Unicode script code (four-letter),
41 a Unicode region code, a Unicode language variant code. These codes are
42 case-insensitive, and separated with '_' or '-'. E.g. "en_US" for
43 English in USA, "az_Cyrl" for Azerbaijani in the Cyrillic script,
44 "es_ES_traditional" for Spanish in Spain (Traditional).
45
46 If $locale_name is not available, fallback is selected in the following
47 order:
48
49 1. language with a variant code
50 2. language with a script code
51 3. language with a region code
52 4. language
53 5. default
54
55 Tailoring tags provided by "Unicode::Collate" are allowed as long as
56 they are not used for "locale" support. Esp. the "table" tag is always
57 untailorable, since it is reserved for DUCET.
58
59 E.g. a collator for French, which ignores diacritics and case
60 difference (i.e. level 1), with reversed case ordering and no
61 normalization.
62
63 Unicode::Collate::Locale->new(
64 level => 1,
65 locale => 'fr',
66 upper_before_lower => 1,
67 normalization => undef
68 )
69
70 Overriding a behavior already tailored by "locale" is disallowed if
71 such a tailoring is passed to "new()".
72
73 Unicode::Collate::Locale->new(
74 locale => 'da',
75 upper_before_lower => 0, # causes error as reserved by 'da'
76 )
77
78 However "change()" inherited from "Unicode::Collate" allows such a
79 tailoring that is reserved by "locale". Examples:
80
81 new(locale => 'ca')->change(backwards => undef)
82 new(locale => 'da')->change(upper_before_lower => 0)
83 new(locale => 'ja')->change(overrideCJK => undef)
84
85 Methods
86 "Unicode::Collate::Locale" is a subclass of "Unicode::Collate" and
87 methods other than "new" are inherited from "Unicode::Collate".
88
89 Here is a list of additional methods:
90
91 "$Collator->getlocale"
92 Returns a language code accepted and used actually on collation.
93 If linguistic tailoring is not provided for a language code you
94 passed (intensionally for some languages, or due to the incomplete
95 implementation), this method returns a string 'default' meaning no
96 special tailoring.
97
98 "$Collator->locale_version"
99 (Since Unicode::Collate::Locale 0.87) Returns the version number
100 (perhaps "/\d\.\d\d/") of the locale, as that of Locale/*.pl.
101
102 Note: Locale/*.pl that a collator uses should be identified by a
103 combination of return values from "getlocale" and "locale_version".
104
105 A list of tailorable locales
106 locale name description
107 --------------------------------------------------------------
108 af Afrikaans
109 ar Arabic
110 as Assamese
111 az Azerbaijani (Azeri)
112 be Belarusian
113 bg Bulgarian
114 bn Bengali
115 bs Bosnian
116 ca Catalan
117 cs Czech
118 cy Welsh
119 da Danish
120 de__phonebook German (umlaut as 'ae', 'oe', 'ue')
121 eo Esperanto
122 es Spanish
123 es__traditional Spanish ('ch' and 'll' as a grapheme)
124 et Estonian
125 fa Persian
126 fi Finnish (v and w are primary equal)
127 fi__phonebook Finnish (v and w as separate characters)
128 fil Filipino
129 fo Faroese
130 fr French
131 gu Gujarati
132 ha Hausa
133 haw Hawaiian
134 hi Hindi
135 hr Croatian
136 hu Hungarian
137 hy Armenian
138 ig Igbo
139 is Icelandic
140 ja Japanese [1]
141 kk Kazakh
142 kl Kalaallisut
143 kn Kannada
144 ko Korean [2]
145 kok Konkani
146 ln Lingala
147 lt Lithuanian
148 lv Latvian
149 mk Macedonian
150 ml Malayalam
151 mr Marathi
152 mt Maltese
153 nb Norwegian Bokmal
154 nn Norwegian Nynorsk
155 nso Northern Sotho
156 om Oromo
157 or Oriya
158 pa Punjabi
159 pl Polish
160 ro Romanian
161 ru Russian
162 sa Sanskrit
163 se Northern Sami
164 si Sinhala
165 si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
166 sk Slovak
167 sl Slovenian
168 sq Albanian
169 sr Serbian
170 sr_Latn Serbian in Latin (tailored as Croatian)
171 sv Swedish (v and w are primary equal)
172 sv__reformed Swedish (v and w as separate characters)
173 ta Tamil
174 te Telugu
175 th Thai
176 tn Tswana
177 to Tonga
178 tr Turkish
179 uk Ukrainian
180 ur Urdu
181 vi Vietnamese
182 wae Walser
183 wo Wolof
184 yo Yoruba
185 zh Chinese
186 zh__big5han Chinese (ideographs: big5 order)
187 zh__gb2312han Chinese (ideographs: GB-2312 order)
188 zh__pinyin Chinese (ideographs: pinyin order) [3]
189 zh__stroke Chinese (ideographs: stroke order) [3]
190 --------------------------------------------------------------
191
192 Locales according to the default UCA rules include chr (Cherokee), de
193 (German), en (English), ga (Irish), id (Indonesian), it (Italian), ka
194 (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern
195 Sotho), sw (Swahili), xh (Xhosa), zu (Zulu).
196
197 Note
198
199 [1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and
200 halfwidth forms are identical to their normal form. The difference
201 between hiragana and katakana is at the 4th level, the comparison also
202 requires "(variable => 'Non-ignorable')", and then
203 "katakana_before_hiragana" has no effect.
204
205 [2] ko: Plenty of ideographs are sorted by their reading. Such an
206 ideograph is primary (level 1) equal to, and secondary (level 2)
207 greater than, the corresponding hangul syllable.
208
209 [3] zh__pinyin and zh__stroke: implemented alt='short', where a smaller
210 number of ideographs are tailored.
211
213 Installation of "Unicode::Collate::Locale" requires Collate/Locale.pm,
214 Collate/Locale/*.pm, Collate/CJK/*.pm and Collate/allkeys.txt. On
215 building, "Unicode::Collate::Locale" doesn't require any of data/*.txt,
216 gendata/*, and mklocale. Tests for "Unicode::Collate::Locale" are
217 named t/loc_*.t.
218
220 tailoring is not maximum
221 Even if a certain letter is tailored, its equivalent would not
222 always tailored as well as it. For example, even though W is
223 tailored, fullwidth W ("U+FF37"), W with acute ("U+1E82"), etc. are
224 not tailored. The result may depend on whether source strings are
225 normalized or not, and whether decomposed or composed. Thus
226 "(normalization => undef)" is less preferred.
227
229 The Unicode::Collate::Locale module for perl was written by SADAHIRO
230 Tomoyuki, <SADAHIRO@cpan.org>. This module is Copyright(C) 2004-2012,
231 SADAHIRO Tomoyuki. Japan. All rights reserved.
232
233 This module is free software; you can redistribute it and/or modify it
234 under the same terms as Perl itself.
235
237 Unicode Collation Algorithm - UTS #10
238 <http://www.unicode.org/reports/tr10/>
239
240 The Default Unicode Collation Element Table (DUCET)
241 <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
242
243 Unicode Locale Data Markup Language (LDML) - UTS #35
244 <http://www.unicode.org/reports/tr35/>
245
246 CLDR - Unicode Common Locale Data Repository
247 <http://cldr.unicode.org/>
248
249 Unicode::Collate
250 Unicode::Normalize
251
252
253
254perl v5.16.3 2013-03-04 Unicode::Collate::Locale(3pm)