1Collate::Locale(3)    User Contributed Perl Documentation   Collate::Locale(3)
2
3
4

NAME

6       Unicode::Collate::Locale - Linguistic tailoring for DUCET via
7       Unicode::Collate
8

SYNOPSIS

10         use Unicode::Collate::Locale;
11
12         #construct
13         $Collator = Unicode::Collate::Locale->
14             new(locale => $locale_name, %tailoring);
15
16         #sort
17         @sorted = $Collator->sort(@not_sorted);
18
19         #compare
20         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
21
22       Note: Strings in @not_sorted, $a and $b are interpreted according to
23       Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
24       perlunifaq, utf8.  Otherwise you can use "preprocess" (cf.
25       "Unicode::Collate") or should decode them before.
26

DESCRIPTION

28       This module provides linguistic tailoring for it taking advantage of
29       "Unicode::Collate".
30
31   Constructor
32       The "new" method returns a collator object.
33
34       A parameter list for the constructor is a hash, which can include a
35       special key "locale" and its value (case-insensitive) standing for a
36       Unicode base language code (two or three-letter).  For example,
37       "Unicode::Collate::Locale->new(locale => 'ES')" returns a collator
38       tailored for Spanish.
39
40       $locale_name may be suffixed with a Unicode script code (four-letter),
41       a Unicode region (territory) code, a Unicode language variant code.
42       These codes are case-insensitive, and separated with '_' or '-'.  E.g.
43       "en_US" for English in USA, "az_Cyrl" for Azerbaijani in the Cyrillic
44       script, "es_ES_traditional" for Spanish in Spain (Traditional).
45
46       If $locale_name is not available, fallback is selected in the following
47       order:
48
49           1. language with a variant code
50           2. language with a script code
51           3. language with a region code
52           4. language
53           5. default
54
55       Tailoring tags provided by "Unicode::Collate" are allowed as long as
56       they are not used for "locale" support.  Esp. the "table" tag is always
57       untailorable, since it is reserved for DUCET.
58
59       However "entry" is allowed, even if it is used for "locale" support, to
60       add or override mappings.
61
62       E.g. a collator for Spanish, which ignores diacritics and case
63       difference (i.e. level 1), with reversed case ordering and no
64       normalization.
65
66           Unicode::Collate::Locale->new(
67               level => 1,
68               locale => 'es',
69               upper_before_lower => 1,
70               normalization => undef
71           )
72
73       Overriding a behavior already tailored by "locale" is disallowed if
74       such a tailoring is passed to "new()".
75
76           Unicode::Collate::Locale->new(
77               locale => 'da',
78               upper_before_lower => 0, # causes error as reserved by 'da'
79           )
80
81       However "change()" inherited from "Unicode::Collate" allows such a
82       tailoring that is reserved by "locale". Examples:
83
84           new(locale => 'fr_ca')->change(backwards => undef)
85           new(locale => 'da')->change(upper_before_lower => 0)
86           new(locale => 'ja')->change(overrideCJK => undef)
87
88   Methods
89       "Unicode::Collate::Locale" is a subclass of "Unicode::Collate" and
90       methods other than "new" are inherited from "Unicode::Collate".
91
92       Here is a list of additional methods:
93
94       "$Collator->getlocale"
95           Returns a language code accepted and used actually on collation.
96           If linguistic tailoring is not provided for a language code you
97           passed (intensionally for some languages, or due to the incomplete
98           implementation), this method returns a string 'default' meaning no
99           special tailoring.
100
101       "$Collator->locale_version"
102           (Since Unicode::Collate::Locale 0.87) Returns the version number
103           (perhaps "/\d\.\d\d/") of the locale, as that of Locale/*.pl.
104
105           Note: Locale/*.pl that a collator uses should be identified by a
106           combination of return values from "getlocale" and "locale_version".
107
108   A list of tailorable locales
109             locale name       description
110           --------------------------------------------------------------
111             af                Afrikaans
112             ar                Arabic
113             as                Assamese
114             az                Azerbaijani (Azeri)
115             be                Belarusian
116             bn                Bengali
117             bs                Bosnian (tailored as Croatian)
118             bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
119             ca                Catalan
120             cs                Czech
121             cu                Church Slavic
122             cy                Welsh
123             da                Danish
124             de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
125             de_AT_phonebook   Austrian German (umlaut primary greater)
126             dsb               Lower Sorbian
127             ee                Ewe
128             eo                Esperanto
129             es                Spanish
130             es__traditional   Spanish ('ch' and 'll' as a grapheme)
131             et                Estonian
132             fa                Persian
133             fi                Finnish (v and w are primary equal)
134             fi__phonebook     Finnish (v and w as separate characters)
135             fil               Filipino
136             fo                Faroese
137             fr_CA             Canadian French
138             gu                Gujarati
139             ha                Hausa
140             haw               Hawaiian
141             he                Hebrew
142             hi                Hindi
143             hr                Croatian
144             hu                Hungarian
145             hy                Armenian
146             ig                Igbo
147             is                Icelandic
148             ja                Japanese [1]
149             kk                Kazakh
150             kl                Kalaallisut
151             kn                Kannada
152             ko                Korean [2]
153             kok               Konkani
154             lkt               Lakota
155             ln                Lingala
156             lt                Lithuanian
157             lv                Latvian
158             mk                Macedonian
159             ml                Malayalam
160             mr                Marathi
161             mt                Maltese
162             nb                Norwegian Bokmal
163             nn                Norwegian Nynorsk
164             nso               Northern Sotho
165             om                Oromo
166             or                Oriya
167             pa                Punjabi
168             pl                Polish
169             ro                Romanian
170             sa                Sanskrit
171             se                Northern Sami
172             si                Sinhala
173             si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
174             sk                Slovak
175             sl                Slovenian
176             sq                Albanian
177             sr                Serbian
178             sr_Latn           Serbian in Latin (tailored as Croatian)
179             sv                Swedish (v and w are primary equal)
180             sv__reformed      Swedish (v and w as separate characters)
181             ta                Tamil
182             te                Telugu
183             th                Thai
184             tn                Tswana
185             to                Tonga
186             tr                Turkish
187             ug_Cyrl           Uyghur in Cyrillic
188             uk                Ukrainian
189             ur                Urdu
190             vi                Vietnamese
191             vo                Volapu"k
192             wae               Walser
193             wo                Wolof
194             yo                Yoruba
195             zh                Chinese
196             zh__big5han       Chinese (ideographs: big5 order)
197             zh__gb2312han     Chinese (ideographs: GB-2312 order)
198             zh__pinyin        Chinese (ideographs: pinyin order) [3]
199             zh__stroke        Chinese (ideographs: stroke order) [3]
200             zh__zhuyin        Chinese (ideographs: zhuyin order) [3]
201           --------------------------------------------------------------
202
203       Locales according to the default UCA rules include am (Amharic) without
204       "[reorder Ethi]", bg (Bulgarian) without "[reorder Cyrl]", chr
205       (Cherokee) without "[reorder Cher]", de (German), en (English), fr
206       (French), ga (Irish), id (Indonesian), it (Italian), ka (Georgian)
207       without "[reorder Geor]", mn (Mongolian) without "[reorder Cyrl Mong]",
208       ms (Malay), nl (Dutch), pt (Portuguese), ru (Russian) without "[reorder
209       Cyrl]", sw (Swahili), zu (Zulu).
210
211       Note
212
213       [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and
214       halfwidth forms are identical to their regular form.  The difference
215       between hiragana and katakana is at the 4th level, the comparison also
216       requires "(variable => 'Non-ignorable')", and then
217       "katakana_before_hiragana" has no effect.
218
219       [2] ko: Plenty of ideographs are sorted by their reading. Such an
220       ideograph is primary (level 1) equal to, and secondary (level 2)
221       greater than, the corresponding hangul syllable.
222
223       [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short',
224       where a smaller number of ideographs are tailored.
225
226   A list of variant codes and their aliases
227             variant code       alias
228           ------------------------------------------
229             dictionary         dict
230             phonebook          phone     phonebk
231             reformed           reform
232             traditional        trad
233           ------------------------------------------
234             big5han            big5
235             gb2312han          gb2312
236             pinyin
237             stroke
238             zhuyin
239           ------------------------------------------
240
241       Note: 'pinyin' is Han in Latin, 'zhuyin' is Han in Bopomofo.
242

INSTALL

244       Installation of "Unicode::Collate::Locale" requires Collate/Locale.pm,
245       Collate/Locale/*.pm, Collate/CJK/*.pm and Collate/allkeys.txt.  On
246       building, "Unicode::Collate::Locale" doesn't require any of data/*.txt,
247       gendata/*, and mklocale.  Tests for "Unicode::Collate::Locale" are
248       named t/loc_*.t.
249

CAVEAT

251       Tailoring is not maximum
252           Even if a certain letter is tailored, its equivalent would not
253           always tailored as well as it. For example, even though W is
254           tailored, fullwidth W ("U+FF37"), W with acute ("U+1E82"), etc. are
255           not tailored. The result may depend on whether source strings are
256           normalized or not, and whether decomposed or composed.  Thus
257           "(normalization => undef)" is less preferred.
258
259       Collation reordering is not supported
260           The order of any groups including scripts is not changed.
261
262   Reference
263             locale            based CLDR or other reference
264           --------------------------------------------------------------------
265             af                30 = 1.8.1
266             ar                30 = 28 ("compat" wo [reorder Arab]) = 1.9.0
267             as                30 = 28 (without [reorder Beng..]) = 23
268             az                30 = 24 ("standard" wo [reorder Latn Cyrl])
269             be                30 = 28 (without [reorder Cyrl])
270             bn                30 = 28 ("standard" wo [reorder Beng..]) = 2.0.1
271             bs                30 = 28 (type="standard": [import hr])
272             bs_Cyrl           30 = 28 (type="standard": [import sr])
273             ca                30 = 23 (alt="proposed" type="standard")
274             cs                30 = 1.8.1 (type="standard")
275             cu                34 = 30 (without [reorder Cyrl])
276             cy                30 = 1.8.1
277             da                22.1 = 1.8.1 (type="standard")
278             de__phonebook     30 = 2.0 (type="phonebook")
279             de_AT_phonebook   30 = 27 (type="phonebook")
280             dsb               30 = 26
281             ee                30 = 21
282             eo                30 = 1.8.1
283             es                30 = 1.9.0 (type="standard")
284             es__traditional   30 = 1.8.1 (type="traditional")
285             et                30 = 26
286             fa                22.1 = 1.8.1
287             fi                22.1 = 1.8.1 (type="standard" alt="proposed")
288             fi__phonebook     22.1 = 1.8.1 (type="phonebook")
289             fil               30 = 1.9.0 (type="standard") = 1.8.1
290             fo                22.1 = 1.8.1 (alt="proposed" type="standard")
291             fr_CA             30 = 1.9.0
292             gu                30 = 28 ("standard" wo [reorder Gujr..]) = 1.9.0
293             ha                30 = 1.9.0
294             haw               30 = 24
295             he                30 = 28 (without [reorder Hebr]) = 23
296             hi                30 = 28 (without [reorder Deva..]) = 1.9.0
297             hr                30 = 28 ("standard" wo [reorder Latn Cyrl]) = 1.9.0
298             hu                22.1 = 1.8.1 (alt="proposed" type="standard")
299             hy                30 = 28 (without [reorder Armn]) = 1.8.1
300             ig                30 = 1.8.1
301             is                22.1 = 1.8.1 (type="standard")
302             ja                22.1 = 1.8.1 (type="standard")
303             kk                30 = 28 (without [reorder Cyrl])
304             kl                22.1 = 1.8.1 (type="standard")
305             kn                30 = 28 ("standard" wo [reorder Knda..]) = 1.9.0
306             ko                22.1 = 1.8.1 (type="standard")
307             kok               30 = 28 (without [reorder Deva..]) = 1.8.1
308             lkt               30 = 25
309             ln                30 = 2.0 (type="standard") = 1.8.1
310             lt                22.1 = 1.9.0
311             lv                22.1 = 1.9.0 (type="standard") = 1.8.1
312             mk                30 = 28 (without [reorder Cyrl])
313             ml                22.1 = 1.9.0
314             mr                30 = 28 (without [reorder Deva..]) = 1.8.1
315             mt                22.1 = 1.9.0
316             nb                22.1 = 2.0   (type="standard")
317             nn                22.1 = 2.0   (type="standard")
318             nso           [*] 26 = 1.8.1
319             om                22.1 = 1.8.1
320             or                30 = 28 (without [reorder Orya..]) = 1.9.0
321             pa                22.1 = 1.8.1
322             pl                30 = 1.8.1
323             ro                30 = 1.9.0 (type="standard")
324             sa            [*] 1.9.1 = 1.8.1 (type="standard" alt="proposed")
325             se                22.1 = 1.8.1 (type="standard")
326             si                30 = 28 ("standard" wo [reorder Sinh..]) = 1.9.0
327             si__dictionary    30 = 28 ("dictionary" wo [reorder Sinh..]) = 1.9.0
328             sk                22.1 = 1.9.0 (type="standard")
329             sl                22.1 = 1.8.1 (type="standard" alt="proposed")
330             sq                22.1 = 1.8.1 (alt="proposed" type="standard")
331             sr                30 = 28 (without [reorder Cyrl])
332             sr_Latn           30 = 28 (type="standard": [import hr])
333             sv                22.1 = 1.9.0 (type="standard")
334             sv__reformed      22.1 = 1.8.1 (type="reformed")
335             ta                22.1 = 1.9.0
336             te                30 = 28 (without [reorder Telu..]) = 1.9.0
337             th                22.1 = 22
338             tn            [*] 26 = 1.8.1
339             to                22.1 = 22
340             tr                22.1 = 1.8.1 (type="standard")
341             uk                30 = 28 (without [reorder Cyrl])
342             ug_Cyrl           https://en.wikipedia.org/wiki/Uyghur_Cyrillic_alphabet
343             ur                22.1 = 1.9.0
344             vi                22.1 = 1.8.1
345             vo                30 = 25
346             wae               30 = 2.0
347             wo            [*] 1.9.1 = 1.8.1
348             yo                30 = 1.8.1
349             zh                22.1 = 1.8.1 (type="standard")
350             zh__big5han       22.1 = 1.8.1 (type="big5han")
351             zh__gb2312han     22.1 = 1.8.1 (type="gb2312han")
352             zh__pinyin        22.1 = 2.0   (type='pinyin' alt='short')
353             zh__stroke        22.1 = 1.9.1 (type='stroke' alt='short')
354             zh__zhuyin        22.1 = 22    (type='zhuyin' alt='short')
355           --------------------------------------------------------------------
356
357       [*] http://www.unicode.org/repos/cldr/tags/latest/seed/collation/
358

AUTHOR

360       The Unicode::Collate::Locale module for perl was written by SADAHIRO
361       Tomoyuki, <SADAHIRO@cpan.org>.  This module is Copyright(C) 2004-2020,
362       SADAHIRO Tomoyuki. Japan.  All rights reserved.
363
364       This module is free software; you can redistribute it and/or modify it
365       under the same terms as Perl itself.
366

SEE ALSO

368       Unicode Collation Algorithm - UTS #10
369           <http://www.unicode.org/reports/tr10/>
370
371       The Default Unicode Collation Element Table (DUCET)
372           <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
373
374       Unicode Locale Data Markup Language (LDML) - UTS #35
375           <http://www.unicode.org/reports/tr35/>
376
377       CLDR - Unicode Common Locale Data Repository
378           <http://cldr.unicode.org/>
379
380       Unicode::Collate
381       Unicode::Normalize
382
383
384
385perl v5.32.0                      2020-09-29                Collate::Locale(3)
Impressum