1Collate::Locale(3) User Contributed Perl Documentation Collate::Locale(3)
2
3
4
6 Unicode::Collate::Locale - Linguistic tailoring for DUCET via
7 Unicode::Collate
8
10 use Unicode::Collate::Locale;
11
12 #construct
13 $Collator = Unicode::Collate::Locale->
14 new(locale => $locale_name, %tailoring);
15
16 #sort
17 @sorted = $Collator->sort(@not_sorted);
18
19 #compare
20 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
21
22 Note: Strings in @not_sorted, $a and $b are interpreted according to
23 Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
24 perlunifaq, utf8. Otherwise you can use "preprocess" (cf.
25 "Unicode::Collate") or should decode them before.
26
28 This module provides linguistic tailoring for it taking advantage of
29 "Unicode::Collate".
30
31 Constructor
32 The "new" method returns a collator object.
33
34 A parameter list for the constructor is a hash, which can include a
35 special key "locale" and its value (case-insensitive) standing for a
36 Unicode base language code (two or three-letter). For example,
37 "Unicode::Collate::Locale->new(locale => 'ES')" returns a collator
38 tailored for Spanish.
39
40 $locale_name may be suffixed with a Unicode script code (four-letter),
41 a Unicode region (territory) code, a Unicode language variant code.
42 These codes are case-insensitive, and separated with '_' or '-'. E.g.
43 "en_US" for English in USA, "az_Cyrl" for Azerbaijani in the Cyrillic
44 script, "es_ES_traditional" for Spanish in Spain (Traditional).
45
46 If $locale_name is not available, fallback is selected in the following
47 order:
48
49 1. language with a variant code
50 2. language with a script code
51 3. language with a region code
52 4. language
53 5. default
54
55 Tailoring tags provided by "Unicode::Collate" are allowed as long as
56 they are not used for "locale" support. Esp. the "table" tag is always
57 untailorable, since it is reserved for DUCET.
58
59 However "entry" is allowed, even if it is used for "locale" support, to
60 add or override mappings.
61
62 E.g. a collator for Spanish, which ignores diacritics and case
63 difference (i.e. level 1), with reversed case ordering and no
64 normalization.
65
66 Unicode::Collate::Locale->new(
67 level => 1,
68 locale => 'es',
69 upper_before_lower => 1,
70 normalization => undef
71 )
72
73 Overriding a behavior already tailored by "locale" is disallowed if
74 such a tailoring is passed to "new()".
75
76 Unicode::Collate::Locale->new(
77 locale => 'da',
78 upper_before_lower => 0, # causes error as reserved by 'da'
79 )
80
81 However "change()" inherited from "Unicode::Collate" allows such a
82 tailoring that is reserved by "locale". Examples:
83
84 new(locale => 'fr_ca')->change(backwards => undef)
85 new(locale => 'da')->change(upper_before_lower => 0)
86 new(locale => 'ja')->change(overrideCJK => undef)
87
88 Methods
89 "Unicode::Collate::Locale" is a subclass of "Unicode::Collate" and
90 methods other than "new" are inherited from "Unicode::Collate".
91
92 Here is a list of additional methods:
93
94 "$Collator->getlocale"
95 Returns a language code accepted and used actually on collation.
96 If linguistic tailoring is not provided for a language code you
97 passed (intensionally for some languages, or due to the incomplete
98 implementation), this method returns a string 'default' meaning no
99 special tailoring.
100
101 "$Collator->locale_version"
102 (Since Unicode::Collate::Locale 0.87) Returns the version number
103 (perhaps "/\d\.\d\d/") of the locale, as that of Locale/*.pl.
104
105 Note: Locale/*.pl that a collator uses should be identified by a
106 combination of return values from "getlocale" and "locale_version".
107
108 A list of tailorable locales
109 locale name description
110 --------------------------------------------------------------
111 af Afrikaans
112 ar Arabic
113 as Assamese
114 az Azerbaijani (Azeri)
115 be Belarusian
116 bn Bengali
117 bs Bosnian (tailored as Croatian)
118 bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
119 ca Catalan
120 cs Czech
121 cu Church Slavic
122 cy Welsh
123 da Danish
124 de__phonebook German (umlaut as 'ae', 'oe', 'ue')
125 de_AT_phonebook Austrian German (umlaut primary greater)
126 dsb Lower Sorbian
127 ee Ewe
128 eo Esperanto
129 es Spanish
130 es__traditional Spanish ('ch' and 'll' as a grapheme)
131 et Estonian
132 fa Persian
133 fi Finnish (v and w are primary equal)
134 fi__phonebook Finnish (v and w as separate characters)
135 fil Filipino
136 fo Faroese
137 fr_CA Canadian French
138 gu Gujarati
139 ha Hausa
140 haw Hawaiian
141 he Hebrew
142 hi Hindi
143 hr Croatian
144 hu Hungarian
145 hy Armenian
146 ig Igbo
147 is Icelandic
148 ja Japanese [1]
149 kk Kazakh
150 kl Kalaallisut
151 kn Kannada
152 ko Korean [2]
153 kok Konkani
154 lkt Lakota
155 ln Lingala
156 lt Lithuanian
157 lv Latvian
158 mk Macedonian
159 ml Malayalam
160 mr Marathi
161 mt Maltese
162 nb Norwegian Bokmal
163 nn Norwegian Nynorsk
164 nso Northern Sotho
165 om Oromo
166 or Oriya
167 pa Punjabi
168 pl Polish
169 ro Romanian
170 sa Sanskrit
171 se Northern Sami
172 si Sinhala
173 si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
174 sk Slovak
175 sl Slovenian
176 sq Albanian
177 sr Serbian
178 sr_Latn Serbian in Latin (tailored as Croatian)
179 sv Swedish (v and w are primary equal)
180 sv__reformed Swedish (v and w as separate characters)
181 ta Tamil
182 te Telugu
183 th Thai
184 tn Tswana
185 to Tonga
186 tr Turkish
187 ug_Cyrl Uyghur in Cyrillic
188 uk Ukrainian
189 ur Urdu
190 vi Vietnamese
191 vo Volapu"k
192 wae Walser
193 wo Wolof
194 yo Yoruba
195 zh Chinese
196 zh__big5han Chinese (ideographs: big5 order)
197 zh__gb2312han Chinese (ideographs: GB-2312 order)
198 zh__pinyin Chinese (ideographs: pinyin order) [3]
199 zh__stroke Chinese (ideographs: stroke order) [3]
200 zh__zhuyin Chinese (ideographs: zhuyin order) [3]
201 --------------------------------------------------------------
202
203 Locales according to the default UCA rules include am (Amharic) without
204 "[reorder Ethi]", bg (Bulgarian) without "[reorder Cyrl]", chr
205 (Cherokee) without "[reorder Cher]", de (German), en (English), fr
206 (French), ga (Irish), id (Indonesian), it (Italian), ka (Georgian)
207 without "[reorder Geor]", mn (Mongolian) without "[reorder Cyrl Mong]",
208 ms (Malay), nl (Dutch), pt (Portuguese), ru (Russian) without "[reorder
209 Cyrl]", sw (Swahili), zu (Zulu).
210
211 Note
212
213 [1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and
214 halfwidth forms are identical to their regular form. The difference
215 between hiragana and katakana is at the 4th level, the comparison also
216 requires "(variable => 'Non-ignorable')", and then
217 "katakana_before_hiragana" has no effect.
218
219 [2] ko: Plenty of ideographs are sorted by their reading. Such an
220 ideograph is primary (level 1) equal to, and secondary (level 2)
221 greater than, the corresponding hangul syllable.
222
223 [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short',
224 where a smaller number of ideographs are tailored.
225
226 A list of variant codes and their aliases
227 variant code alias
228 ------------------------------------------
229 dictionary dict
230 phonebook phone phonebk
231 reformed reform
232 traditional trad
233 ------------------------------------------
234 big5han big5
235 gb2312han gb2312
236 pinyin
237 stroke
238 zhuyin
239 ------------------------------------------
240
241 Note: 'pinyin' is Han in Latin, 'zhuyin' is Han in Bopomofo.
242
244 Installation of "Unicode::Collate::Locale" requires Collate/Locale.pm,
245 Collate/Locale/*.pm, Collate/CJK/*.pm and Collate/allkeys.txt. On
246 building, "Unicode::Collate::Locale" doesn't require any of data/*.txt,
247 gendata/*, and mklocale. Tests for "Unicode::Collate::Locale" are
248 named t/loc_*.t.
249
251 Tailoring is not maximum
252 Even if a certain letter is tailored, its equivalent would not
253 always tailored as well as it. For example, even though W is
254 tailored, fullwidth W ("U+FF37"), W with acute ("U+1E82"), etc. are
255 not tailored. The result may depend on whether source strings are
256 normalized or not, and whether decomposed or composed. Thus
257 "(normalization => undef)" is less preferred.
258
259 Collation reordering is not supported
260 The order of any groups including scripts is not changed.
261
262 Reference
263 locale based CLDR or other reference
264 --------------------------------------------------------------------
265 af 30 = 1.8.1
266 ar 30 = 28 ("compat" wo [reorder Arab]) = 1.9.0
267 as 30 = 28 (without [reorder Beng..]) = 23
268 az 30 = 24 ("standard" wo [reorder Latn Cyrl])
269 be 30 = 28 (without [reorder Cyrl])
270 bn 30 = 28 ("standard" wo [reorder Beng..]) = 2.0.1
271 bs 30 = 28 (type="standard": [import hr])
272 bs_Cyrl 30 = 28 (type="standard": [import sr])
273 ca 30 = 23 (alt="proposed" type="standard")
274 cs 30 = 1.8.1 (type="standard")
275 cu 34 = 30 (without [reorder Cyrl])
276 cy 30 = 1.8.1
277 da 22.1 = 1.8.1 (type="standard")
278 de__phonebook 30 = 2.0 (type="phonebook")
279 de_AT_phonebook 30 = 27 (type="phonebook")
280 dsb 30 = 26
281 ee 30 = 21
282 eo 30 = 1.8.1
283 es 30 = 1.9.0 (type="standard")
284 es__traditional 30 = 1.8.1 (type="traditional")
285 et 30 = 26
286 fa 22.1 = 1.8.1
287 fi 22.1 = 1.8.1 (type="standard" alt="proposed")
288 fi__phonebook 22.1 = 1.8.1 (type="phonebook")
289 fil 30 = 1.9.0 (type="standard") = 1.8.1
290 fo 22.1 = 1.8.1 (alt="proposed" type="standard")
291 fr_CA 30 = 1.9.0
292 gu 30 = 28 ("standard" wo [reorder Gujr..]) = 1.9.0
293 ha 30 = 1.9.0
294 haw 30 = 24
295 he 30 = 28 (without [reorder Hebr]) = 23
296 hi 30 = 28 (without [reorder Deva..]) = 1.9.0
297 hr 30 = 28 ("standard" wo [reorder Latn Cyrl]) = 1.9.0
298 hu 22.1 = 1.8.1 (alt="proposed" type="standard")
299 hy 30 = 28 (without [reorder Armn]) = 1.8.1
300 ig 30 = 1.8.1
301 is 22.1 = 1.8.1 (type="standard")
302 ja 22.1 = 1.8.1 (type="standard")
303 kk 30 = 28 (without [reorder Cyrl])
304 kl 22.1 = 1.8.1 (type="standard")
305 kn 30 = 28 ("standard" wo [reorder Knda..]) = 1.9.0
306 ko 22.1 = 1.8.1 (type="standard")
307 kok 30 = 28 (without [reorder Deva..]) = 1.8.1
308 lkt 30 = 25
309 ln 30 = 2.0 (type="standard") = 1.8.1
310 lt 22.1 = 1.9.0
311 lv 22.1 = 1.9.0 (type="standard") = 1.8.1
312 mk 30 = 28 (without [reorder Cyrl])
313 ml 22.1 = 1.9.0
314 mr 30 = 28 (without [reorder Deva..]) = 1.8.1
315 mt 22.1 = 1.9.0
316 nb 22.1 = 2.0 (type="standard")
317 nn 22.1 = 2.0 (type="standard")
318 nso [*] 26 = 1.8.1
319 om 22.1 = 1.8.1
320 or 30 = 28 (without [reorder Orya..]) = 1.9.0
321 pa 22.1 = 1.8.1
322 pl 30 = 1.8.1
323 ro 30 = 1.9.0 (type="standard")
324 sa [*] 1.9.1 = 1.8.1 (type="standard" alt="proposed")
325 se 22.1 = 1.8.1 (type="standard")
326 si 30 = 28 ("standard" wo [reorder Sinh..]) = 1.9.0
327 si__dictionary 30 = 28 ("dictionary" wo [reorder Sinh..]) = 1.9.0
328 sk 22.1 = 1.9.0 (type="standard")
329 sl 22.1 = 1.8.1 (type="standard" alt="proposed")
330 sq 22.1 = 1.8.1 (alt="proposed" type="standard")
331 sr 30 = 28 (without [reorder Cyrl])
332 sr_Latn 30 = 28 (type="standard": [import hr])
333 sv 22.1 = 1.9.0 (type="standard")
334 sv__reformed 22.1 = 1.8.1 (type="reformed")
335 ta 22.1 = 1.9.0
336 te 30 = 28 (without [reorder Telu..]) = 1.9.0
337 th 22.1 = 22
338 tn [*] 26 = 1.8.1
339 to 22.1 = 22
340 tr 22.1 = 1.8.1 (type="standard")
341 uk 30 = 28 (without [reorder Cyrl])
342 ug_Cyrl https://en.wikipedia.org/wiki/Uyghur_Cyrillic_alphabet
343 ur 22.1 = 1.9.0
344 vi 22.1 = 1.8.1
345 vo 30 = 25
346 wae 30 = 2.0
347 wo [*] 1.9.1 = 1.8.1
348 yo 30 = 1.8.1
349 zh 22.1 = 1.8.1 (type="standard")
350 zh__big5han 22.1 = 1.8.1 (type="big5han")
351 zh__gb2312han 22.1 = 1.8.1 (type="gb2312han")
352 zh__pinyin 22.1 = 2.0 (type='pinyin' alt='short')
353 zh__stroke 22.1 = 1.9.1 (type='stroke' alt='short')
354 zh__zhuyin 22.1 = 22 (type='zhuyin' alt='short')
355 --------------------------------------------------------------------
356
357 [*] http://www.unicode.org/repos/cldr/tags/latest/seed/collation/
358
360 The Unicode::Collate::Locale module for perl was written by SADAHIRO
361 Tomoyuki, <SADAHIRO@cpan.org>. This module is Copyright(C) 2004-2020,
362 SADAHIRO Tomoyuki. Japan. All rights reserved.
363
364 This module is free software; you can redistribute it and/or modify it
365 under the same terms as Perl itself.
366
368 Unicode Collation Algorithm - UTS #10
369 <http://www.unicode.org/reports/tr10/>
370
371 The Default Unicode Collation Element Table (DUCET)
372 <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
373
374 Unicode Locale Data Markup Language (LDML) - UTS #35
375 <http://www.unicode.org/reports/tr35/>
376
377 CLDR - Unicode Common Locale Data Repository
378 <http://cldr.unicode.org/>
379
380 Unicode::Collate
381 Unicode::Normalize
382
383
384
385perl v5.34.0 2021-08-23 Collate::Locale(3)