1Collate::Locale(3) User Contributed Perl Documentation Collate::Locale(3)
2
3
4
6 Unicode::Collate::Locale - Linguistic tailoring for DUCET via
7 Unicode::Collate
8
10 use Unicode::Collate::Locale;
11
12 #construct
13 $Collator = Unicode::Collate::Locale->
14 new(locale => $locale_name, %tailoring);
15
16 #sort
17 @sorted = $Collator->sort(@not_sorted);
18
19 #compare
20 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
21
22 Note: Strings in @not_sorted, $a and $b are interpreted according to
23 Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
24 perlunifaq, utf8. Otherwise you can use "preprocess" (cf.
25 "Unicode::Collate") or should decode them before.
26
28 This module provides linguistic tailoring for it taking advantage of
29 "Unicode::Collate".
30
31 Constructor
32 The "new" method returns a collator object.
33
34 A parameter list for the constructor is a hash, which can include a
35 special key "locale" and its value (case-insensitive) standing for a
36 Unicode base language code (two or three-letter). For example,
37 "Unicode::Collate::Locale->new(locale => 'ES')" returns a collator
38 tailored for Spanish.
39
40 $locale_name may be suffixed with a Unicode script code (four-letter),
41 a Unicode region (territory) code, a Unicode language variant code.
42 These codes are case-insensitive, and separated with '_' or '-'. E.g.
43 "en_US" for English in USA, "az_Cyrl" for Azerbaijani in the Cyrillic
44 script, "es_ES_traditional" for Spanish in Spain (Traditional).
45
46 If $locale_name is not available, fallback is selected in the following
47 order:
48
49 1. language with a variant code
50 2. language with a script code
51 3. language with a region code
52 4. language
53 5. default
54
55 Tailoring tags provided by "Unicode::Collate" are allowed as long as
56 they are not used for "locale" support. Esp. the "table" tag is always
57 untailorable, since it is reserved for DUCET.
58
59 However "entry" is allowed, even if it is used for "locale" support, to
60 add or override mappings.
61
62 E.g. a collator for Spanish, which ignores diacritics and case
63 difference (i.e. level 1), with reversed case ordering and no
64 normalization.
65
66 Unicode::Collate::Locale->new(
67 level => 1,
68 locale => 'es',
69 upper_before_lower => 1,
70 normalization => undef
71 )
72
73 Overriding a behavior already tailored by "locale" is disallowed if
74 such a tailoring is passed to "new()".
75
76 Unicode::Collate::Locale->new(
77 locale => 'da',
78 upper_before_lower => 0, # causes error as reserved by 'da'
79 )
80
81 However "change()" inherited from "Unicode::Collate" allows such a
82 tailoring that is reserved by "locale". Examples:
83
84 new(locale => 'fr_ca')->change(backwards => undef)
85 new(locale => 'da')->change(upper_before_lower => 0)
86 new(locale => 'ja')->change(overrideCJK => undef)
87
88 Methods
89 "Unicode::Collate::Locale" is a subclass of "Unicode::Collate" and
90 methods other than "new" are inherited from "Unicode::Collate".
91
92 Here is a list of additional methods:
93
94 "$Collator->getlocale"
95 Returns a language code accepted and used actually on collation.
96 If linguistic tailoring is not provided for a language code you
97 passed (intensionally for some languages, or due to the incomplete
98 implementation), this method returns a string 'default' meaning no
99 special tailoring.
100
101 "$Collator->locale_version"
102 (Since Unicode::Collate::Locale 0.87) Returns the version number
103 (perhaps "/\d\.\d\d/") of the locale, as that of Locale/*.pl.
104
105 Note: Locale/*.pl that a collator uses should be identified by a
106 combination of return values from "getlocale" and "locale_version".
107
108 A list of tailorable locales
109 locale name description
110 --------------------------------------------------------------
111 af Afrikaans
112 ar Arabic
113 as Assamese
114 az Azerbaijani (Azeri)
115 be Belarusian
116 bn Bengali
117 bs Bosnian (tailored as Croatian)
118 bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
119 ca Catalan
120 cs Czech
121 cy Welsh
122 da Danish
123 de__phonebook German (umlaut as 'ae', 'oe', 'ue')
124 de_AT_phonebook Austrian German (umlaut primary greater)
125 dsb Lower Sorbian
126 ee Ewe
127 eo Esperanto
128 es Spanish
129 es__traditional Spanish ('ch' and 'll' as a grapheme)
130 et Estonian
131 fa Persian
132 fi Finnish (v and w are primary equal)
133 fi__phonebook Finnish (v and w as separate characters)
134 fil Filipino
135 fo Faroese
136 fr_CA Canadian French
137 gu Gujarati
138 ha Hausa
139 haw Hawaiian
140 he Hebrew
141 hi Hindi
142 hr Croatian
143 hu Hungarian
144 hy Armenian
145 ig Igbo
146 is Icelandic
147 ja Japanese [1]
148 kk Kazakh
149 kl Kalaallisut
150 kn Kannada
151 ko Korean [2]
152 kok Konkani
153 lkt Lakota
154 ln Lingala
155 lt Lithuanian
156 lv Latvian
157 mk Macedonian
158 ml Malayalam
159 mr Marathi
160 mt Maltese
161 nb Norwegian Bokmal
162 nn Norwegian Nynorsk
163 nso Northern Sotho
164 om Oromo
165 or Oriya
166 pa Punjabi
167 pl Polish
168 ro Romanian
169 sa Sanskrit
170 se Northern Sami
171 si Sinhala
172 si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
173 sk Slovak
174 sl Slovenian
175 sq Albanian
176 sr Serbian
177 sr_Latn Serbian in Latin (tailored as Croatian)
178 sv Swedish (v and w are primary equal)
179 sv__reformed Swedish (v and w as separate characters)
180 ta Tamil
181 te Telugu
182 th Thai
183 tn Tswana
184 to Tonga
185 tr Turkish
186 ug_Cyrl Uyghur in Cyrillic
187 uk Ukrainian
188 ur Urdu
189 vi Vietnamese
190 vo Volapu"k
191 wae Walser
192 wo Wolof
193 yo Yoruba
194 zh Chinese
195 zh__big5han Chinese (ideographs: big5 order)
196 zh__gb2312han Chinese (ideographs: GB-2312 order)
197 zh__pinyin Chinese (ideographs: pinyin order) [3]
198 zh__stroke Chinese (ideographs: stroke order) [3]
199 zh__zhuyin Chinese (ideographs: zhuyin order) [3]
200 --------------------------------------------------------------
201
202 Locales according to the default UCA rules include am (Amharic) without
203 "[reorder Ethi]", bg (Bulgarian) without "[reorder Cyrl]", chr
204 (Cherokee) without "[reorder Cher]", de (German), en (English), fr
205 (French), ga (Irish), id (Indonesian), it (Italian), ka (Georgian)
206 without "[reorder Geor]", mn (Mongolian) without "[reorder Cyrl Mong]",
207 ms (Malay), nl (Dutch), pt (Portuguese), ru (Russian) without "[reorder
208 Cyrl]", sw (Swahili), zu (Zulu).
209
210 Note
211
212 [1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and
213 halfwidth forms are identical to their regular form. The difference
214 between hiragana and katakana is at the 4th level, the comparison also
215 requires "(variable => 'Non-ignorable')", and then
216 "katakana_before_hiragana" has no effect.
217
218 [2] ko: Plenty of ideographs are sorted by their reading. Such an
219 ideograph is primary (level 1) equal to, and secondary (level 2)
220 greater than, the corresponding hangul syllable.
221
222 [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short',
223 where a smaller number of ideographs are tailored.
224
225 A list of variant codes and their aliases
226 variant code alias
227 ------------------------------------------
228 dictionary dict
229 phonebook phone phonebk
230 reformed reform
231 traditional trad
232 ------------------------------------------
233 big5han big5
234 gb2312han gb2312
235 pinyin
236 stroke
237 zhuyin
238 ------------------------------------------
239
240 Note: 'pinyin' is Han in Latin, 'zhuyin' is Han in Bopomofo.
241
243 Installation of "Unicode::Collate::Locale" requires Collate/Locale.pm,
244 Collate/Locale/*.pm, Collate/CJK/*.pm and Collate/allkeys.txt. On
245 building, "Unicode::Collate::Locale" doesn't require any of data/*.txt,
246 gendata/*, and mklocale. Tests for "Unicode::Collate::Locale" are
247 named t/loc_*.t.
248
250 Tailoring is not maximum
251 Even if a certain letter is tailored, its equivalent would not
252 always tailored as well as it. For example, even though W is
253 tailored, fullwidth W ("U+FF37"), W with acute ("U+1E82"), etc. are
254 not tailored. The result may depend on whether source strings are
255 normalized or not, and whether decomposed or composed. Thus
256 "(normalization => undef)" is less preferred.
257
258 Collation reordering is not supported
259 The order of any groups including scripts is not changed.
260
261 Reference
262 locale based CLDR or other reference
263 --------------------------------------------------------------------
264 af 30 = 1.8.1
265 ar 30 = 28 ("compat" wo [reorder Arab]) = 1.9.0
266 as 30 = 28 (without [reorder Beng..]) = 23
267 az 30 = 24 ("standard" wo [reorder Latn Cyrl])
268 be 30 = 28 (without [reorder Cyrl])
269 bn 30 = 28 ("standard" wo [reorder Beng..]) = 2.0.1
270 bs 30 = 28 (type="standard": [import hr])
271 bs_Cyrl 30 = 28 (type="standard": [import sr])
272 ca 30 = 23 (alt="proposed" type="standard")
273 cs 30 = 1.8.1 (type="standard")
274 cy 30 = 1.8.1
275 da 22.1 = 1.8.1 (type="standard")
276 de__phonebook 30 = 2.0 (type="phonebook")
277 de_AT_phonebook 30 = 27 (type="phonebook")
278 dsb 30 = 26
279 ee 30 = 21
280 eo 30 = 1.8.1
281 es 30 = 1.9.0 (type="standard")
282 es__traditional 30 = 1.8.1 (type="traditional")
283 et 30 = 26
284 fa 22.1 = 1.8.1
285 fi 22.1 = 1.8.1 (type="standard" alt="proposed")
286 fi__phonebook 22.1 = 1.8.1 (type="phonebook")
287 fil 30 = 1.9.0 (type="standard") = 1.8.1
288 fo 22.1 = 1.8.1 (alt="proposed" type="standard")
289 fr_CA 30 = 1.9.0
290 gu 30 = 28 ("standard" wo [reorder Gujr..]) = 1.9.0
291 ha 30 = 1.9.0
292 haw 30 = 24
293 he 30 = 28 (without [reorder Hebr]) = 23
294 hi 30 = 28 (without [reorder Deva..]) = 1.9.0
295 hr 30 = 28 ("standard" wo [reorder Latn Cyrl]) = 1.9.0
296 hu 22.1 = 1.8.1 (alt="proposed" type="standard")
297 hy 30 = 28 (without [reorder Armn]) = 1.8.1
298 ig 30 = 1.8.1
299 is 22.1 = 1.8.1 (type="standard")
300 ja 22.1 = 1.8.1 (type="standard")
301 kk 30 = 28 (without [reorder Cyrl])
302 kl 22.1 = 1.8.1 (type="standard")
303 kn 30 = 28 ("standard" wo [reorder Knda..]) = 1.9.0
304 ko 22.1 = 1.8.1 (type="standard")
305 kok 30 = 28 (without [reorder Deva..]) = 1.8.1
306 lkt 30 = 25
307 ln 30 = 2.0 (type="standard") = 1.8.1
308 lt 22.1 = 1.9.0
309 lv 22.1 = 1.9.0 (type="standard") = 1.8.1
310 mk 30 = 28 (without [reorder Cyrl])
311 ml 22.1 = 1.9.0
312 mr 30 = 28 (without [reorder Deva..]) = 1.8.1
313 mt 22.1 = 1.9.0
314 nb 22.1 = 2.0 (type="standard")
315 nn 22.1 = 2.0 (type="standard")
316 nso [*] 26 = 1.8.1
317 om 22.1 = 1.8.1
318 or 30 = 28 (without [reorder Orya..]) = 1.9.0
319 pa 22.1 = 1.8.1
320 pl 30 = 1.8.1
321 ro 30 = 1.9.0 (type="standard")
322 sa [*] 1.9.1 = 1.8.1 (type="standard" alt="proposed")
323 se 22.1 = 1.8.1 (type="standard")
324 si 30 = 28 ("standard" wo [reorder Sinh..]) = 1.9.0
325 si__dictionary 30 = 28 ("dictionary" wo [reorder Sinh..]) = 1.9.0
326 sk 22.1 = 1.9.0 (type="standard")
327 sl 22.1 = 1.8.1 (type="standard" alt="proposed")
328 sq 22.1 = 1.8.1 (alt="proposed" type="standard")
329 sr 30 = 28 (without [reorder Cyrl])
330 sr_Latn 30 = 28 (type="standard": [import hr])
331 sv 22.1 = 1.9.0 (type="standard")
332 sv__reformed 22.1 = 1.8.1 (type="reformed")
333 ta 22.1 = 1.9.0
334 te 30 = 28 (without [reorder Telu..]) = 1.9.0
335 th 22.1 = 22
336 tn [*] 26 = 1.8.1
337 to 22.1 = 22
338 tr 22.1 = 1.8.1 (type="standard")
339 uk 30 = 28 (without [reorder Cyrl])
340 ug_Cyrl https://en.wikipedia.org/wiki/Uyghur_Cyrillic_alphabet
341 ur 22.1 = 1.9.0
342 vi 22.1 = 1.8.1
343 vo 30 = 25
344 wae 30 = 2.0
345 wo [*] 1.9.1 = 1.8.1
346 yo 30 = 1.8.1
347 zh 22.1 = 1.8.1 (type="standard")
348 zh__big5han 22.1 = 1.8.1 (type="big5han")
349 zh__gb2312han 22.1 = 1.8.1 (type="gb2312han")
350 zh__pinyin 22.1 = 2.0 (type='pinyin' alt='short')
351 zh__stroke 22.1 = 1.9.1 (type='stroke' alt='short')
352 zh__zhuyin 22.1 = 22 (type='zhuyin' alt='short')
353 --------------------------------------------------------------------
354
355 [*] http://www.unicode.org/repos/cldr/tags/latest/seed/collation/
356
358 The Unicode::Collate::Locale module for perl was written by SADAHIRO
359 Tomoyuki, <SADAHIRO@cpan.org>. This module is Copyright(C) 2004-2017,
360 SADAHIRO Tomoyuki. Japan. All rights reserved.
361
362 This module is free software; you can redistribute it and/or modify it
363 under the same terms as Perl itself.
364
366 Unicode Collation Algorithm - UTS #10
367 <http://www.unicode.org/reports/tr10/>
368
369 The Default Unicode Collation Element Table (DUCET)
370 <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
371
372 Unicode Locale Data Markup Language (LDML) - UTS #35
373 <http://www.unicode.org/reports/tr35/>
374
375 CLDR - Unicode Common Locale Data Repository
376 <http://cldr.unicode.org/>
377
378 Unicode::Collate
379 Unicode::Normalize
380
381
382
383perl v5.26.3 2017-11-22 Collate::Locale(3)