1Collate::Locale(3)    User Contributed Perl Documentation   Collate::Locale(3)
2
3
4

NAME

6       Unicode::Collate::Locale - Linguistic tailoring for DUCET via
7       Unicode::Collate
8

SYNOPSIS

10         use Unicode::Collate::Locale;
11
12         #construct
13         $Collator = Unicode::Collate::Locale->
14             new(locale => $locale_name, %tailoring);
15
16         #sort
17         @sorted = $Collator->sort(@not_sorted);
18
19         #compare
20         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
21
22       Note: Strings in @not_sorted, $a and $b are interpreted according to
23       Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
24       perlunifaq, utf8.  Otherwise you can use "preprocess" (cf.
25       "Unicode::Collate") or should decode them before.
26

DESCRIPTION

28       This module provides linguistic tailoring for it taking advantage of
29       "Unicode::Collate".
30
31   Constructor
32       The "new" method returns a collator object.
33
34       A parameter list for the constructor is a hash, which can include a
35       special key "locale" and its value (case-insensitive) standing for a
36       Unicode base language code (two or three-letter).  For example,
37       "Unicode::Collate::Locale->new(locale => 'ES')" returns a collator
38       tailored for Spanish.
39
40       $locale_name may be suffixed with a Unicode script code (four-letter),
41       a Unicode region (territory) code, a Unicode language variant code.
42       These codes are case-insensitive, and separated with '_' or '-'.  E.g.
43       "en_US" for English in USA, "az_Cyrl" for Azerbaijani in the Cyrillic
44       script, "es_ES_traditional" for Spanish in Spain (Traditional).
45
46       If $locale_name is not available, fallback is selected in the following
47       order:
48
49           1. language with a variant code
50           2. language with a script code
51           3. language with a region code
52           4. language
53           5. default
54
55       Tailoring tags provided by "Unicode::Collate" are allowed as long as
56       they are not used for "locale" support.  Esp. the "table" tag is always
57       untailorable, since it is reserved for DUCET.
58
59       However "entry" is allowed, even if it is used for "locale" support, to
60       add or override mappings.
61
62       E.g. a collator for Spanish, which ignores diacritics and case
63       difference (i.e. level 1), with reversed case ordering and no
64       normalization.
65
66           Unicode::Collate::Locale->new(
67               level => 1,
68               locale => 'es',
69               upper_before_lower => 1,
70               normalization => undef
71           )
72
73       Overriding a behavior already tailored by "locale" is disallowed if
74       such a tailoring is passed to "new()".
75
76           Unicode::Collate::Locale->new(
77               locale => 'da',
78               upper_before_lower => 0, # causes error as reserved by 'da'
79           )
80
81       However "change()" inherited from "Unicode::Collate" allows such a
82       tailoring that is reserved by "locale". Examples:
83
84           new(locale => 'fr_ca')->change(backwards => undef)
85           new(locale => 'da')->change(upper_before_lower => 0)
86           new(locale => 'ja')->change(overrideCJK => undef)
87
88   Methods
89       "Unicode::Collate::Locale" is a subclass of "Unicode::Collate" and
90       methods other than "new" are inherited from "Unicode::Collate".
91
92       Here is a list of additional methods:
93
94       "$Collator->getlocale"
95           Returns a language code accepted and used actually on collation.
96           If linguistic tailoring is not provided for a language code you
97           passed (intensionally for some languages, or due to the incomplete
98           implementation), this method returns a string 'default' meaning no
99           special tailoring.
100
101       "$Collator->locale_version"
102           (Since Unicode::Collate::Locale 0.87) Returns the version number
103           (perhaps "/\d\.\d\d/") of the locale, as that of Locale/*.pl.
104
105           Note: Locale/*.pl that a collator uses should be identified by a
106           combination of return values from "getlocale" and "locale_version".
107
108   A list of tailorable locales
109             locale name       description
110           --------------------------------------------------------------
111             af                Afrikaans
112             ar                Arabic
113             as                Assamese
114             az                Azerbaijani (Azeri)
115             be                Belarusian
116             bn                Bengali
117             bs                Bosnian (tailored as Croatian)
118             bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
119             ca                Catalan
120             cs                Czech
121             cy                Welsh
122             da                Danish
123             de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
124             de_AT_phonebook   Austrian German (umlaut primary greater)
125             dsb               Lower Sorbian
126             ee                Ewe
127             eo                Esperanto
128             es                Spanish
129             es__traditional   Spanish ('ch' and 'll' as a grapheme)
130             et                Estonian
131             fa                Persian
132             fi                Finnish (v and w are primary equal)
133             fi__phonebook     Finnish (v and w as separate characters)
134             fil               Filipino
135             fo                Faroese
136             fr_CA             Canadian French
137             gu                Gujarati
138             ha                Hausa
139             haw               Hawaiian
140             he                Hebrew
141             hi                Hindi
142             hr                Croatian
143             hu                Hungarian
144             hy                Armenian
145             ig                Igbo
146             is                Icelandic
147             ja                Japanese [1]
148             kk                Kazakh
149             kl                Kalaallisut
150             kn                Kannada
151             ko                Korean [2]
152             kok               Konkani
153             lkt               Lakota
154             ln                Lingala
155             lt                Lithuanian
156             lv                Latvian
157             mk                Macedonian
158             ml                Malayalam
159             mr                Marathi
160             mt                Maltese
161             nb                Norwegian Bokmal
162             nn                Norwegian Nynorsk
163             nso               Northern Sotho
164             om                Oromo
165             or                Oriya
166             pa                Punjabi
167             pl                Polish
168             ro                Romanian
169             sa                Sanskrit
170             se                Northern Sami
171             si                Sinhala
172             si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
173             sk                Slovak
174             sl                Slovenian
175             sq                Albanian
176             sr                Serbian
177             sr_Latn           Serbian in Latin (tailored as Croatian)
178             sv                Swedish (v and w are primary equal)
179             sv__reformed      Swedish (v and w as separate characters)
180             ta                Tamil
181             te                Telugu
182             th                Thai
183             tn                Tswana
184             to                Tonga
185             tr                Turkish
186             ug_Cyrl           Uyghur in Cyrillic
187             uk                Ukrainian
188             ur                Urdu
189             vi                Vietnamese
190             vo                Volapu"k
191             wae               Walser
192             wo                Wolof
193             yo                Yoruba
194             zh                Chinese
195             zh__big5han       Chinese (ideographs: big5 order)
196             zh__gb2312han     Chinese (ideographs: GB-2312 order)
197             zh__pinyin        Chinese (ideographs: pinyin order) [3]
198             zh__stroke        Chinese (ideographs: stroke order) [3]
199             zh__zhuyin        Chinese (ideographs: zhuyin order) [3]
200           --------------------------------------------------------------
201
202       Locales according to the default UCA rules include am (Amharic) without
203       "[reorder Ethi]", bg (Bulgarian) without "[reorder Cyrl]", chr
204       (Cherokee) without "[reorder Cher]", de (German), en (English), fr
205       (French), ga (Irish), id (Indonesian), it (Italian), ka (Georgian)
206       without "[reorder Geor]", mn (Mongolian) without "[reorder Cyrl Mong]",
207       ms (Malay), nl (Dutch), pt (Portuguese), ru (Russian) without "[reorder
208       Cyrl]", sw (Swahili), zu (Zulu).
209
210       Note
211
212       [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and
213       halfwidth forms are identical to their regular form.  The difference
214       between hiragana and katakana is at the 4th level, the comparison also
215       requires "(variable => 'Non-ignorable')", and then
216       "katakana_before_hiragana" has no effect.
217
218       [2] ko: Plenty of ideographs are sorted by their reading. Such an
219       ideograph is primary (level 1) equal to, and secondary (level 2)
220       greater than, the corresponding hangul syllable.
221
222       [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short',
223       where a smaller number of ideographs are tailored.
224
225   A list of variant codes and their aliases
226             variant code       alias
227           ------------------------------------------
228             dictionary         dict
229             phonebook          phone     phonebk
230             reformed           reform
231             traditional        trad
232           ------------------------------------------
233             big5han            big5
234             gb2312han          gb2312
235             pinyin
236             stroke
237             zhuyin
238           ------------------------------------------
239
240       Note: 'pinyin' is Han in Latin, 'zhuyin' is Han in Bopomofo.
241

INSTALL

243       Installation of "Unicode::Collate::Locale" requires Collate/Locale.pm,
244       Collate/Locale/*.pm, Collate/CJK/*.pm and Collate/allkeys.txt.  On
245       building, "Unicode::Collate::Locale" doesn't require any of data/*.txt,
246       gendata/*, and mklocale.  Tests for "Unicode::Collate::Locale" are
247       named t/loc_*.t.
248

CAVEAT

250       Tailoring is not maximum
251           Even if a certain letter is tailored, its equivalent would not
252           always tailored as well as it. For example, even though W is
253           tailored, fullwidth W ("U+FF37"), W with acute ("U+1E82"), etc. are
254           not tailored. The result may depend on whether source strings are
255           normalized or not, and whether decomposed or composed.  Thus
256           "(normalization => undef)" is less preferred.
257
258       Collation reordering is not supported
259           The order of any groups including scripts is not changed.
260
261   Reference
262             locale            based CLDR or other reference
263           --------------------------------------------------------------------
264             af                30 = 1.8.1
265             ar                30 = 28 ("compat" wo [reorder Arab]) = 1.9.0
266             as                30 = 28 (without [reorder Beng..]) = 23
267             az                30 = 24 ("standard" wo [reorder Latn Cyrl])
268             be                30 = 28 (without [reorder Cyrl])
269             bn                30 = 28 ("standard" wo [reorder Beng..]) = 2.0.1
270             bs                30 = 28 (type="standard": [import hr])
271             bs_Cyrl           30 = 28 (type="standard": [import sr])
272             ca                30 = 23 (alt="proposed" type="standard")
273             cs                30 = 1.8.1 (type="standard")
274             cy                30 = 1.8.1
275             da                22.1 = 1.8.1 (type="standard")
276             de__phonebook     30 = 2.0 (type="phonebook")
277             de_AT_phonebook   30 = 27 (type="phonebook")
278             dsb               30 = 26
279             ee                30 = 21
280             eo                30 = 1.8.1
281             es                30 = 1.9.0 (type="standard")
282             es__traditional   30 = 1.8.1 (type="traditional")
283             et                30 = 26
284             fa                22.1 = 1.8.1
285             fi                22.1 = 1.8.1 (type="standard" alt="proposed")
286             fi__phonebook     22.1 = 1.8.1 (type="phonebook")
287             fil               30 = 1.9.0 (type="standard") = 1.8.1
288             fo                22.1 = 1.8.1 (alt="proposed" type="standard")
289             fr_CA             30 = 1.9.0
290             gu                30 = 28 ("standard" wo [reorder Gujr..]) = 1.9.0
291             ha                30 = 1.9.0
292             haw               30 = 24
293             he                30 = 28 (without [reorder Hebr]) = 23
294             hi                30 = 28 (without [reorder Deva..]) = 1.9.0
295             hr                30 = 28 ("standard" wo [reorder Latn Cyrl]) = 1.9.0
296             hu                22.1 = 1.8.1 (alt="proposed" type="standard")
297             hy                30 = 28 (without [reorder Armn]) = 1.8.1
298             ig                30 = 1.8.1
299             is                22.1 = 1.8.1 (type="standard")
300             ja                22.1 = 1.8.1 (type="standard")
301             kk                30 = 28 (without [reorder Cyrl])
302             kl                22.1 = 1.8.1 (type="standard")
303             kn                30 = 28 ("standard" wo [reorder Knda..]) = 1.9.0
304             ko                22.1 = 1.8.1 (type="standard")
305             kok               30 = 28 (without [reorder Deva..]) = 1.8.1
306             lkt               30 = 25
307             ln                30 = 2.0 (type="standard") = 1.8.1
308             lt                22.1 = 1.9.0
309             lv                22.1 = 1.9.0 (type="standard") = 1.8.1
310             mk                30 = 28 (without [reorder Cyrl])
311             ml                22.1 = 1.9.0
312             mr                30 = 28 (without [reorder Deva..]) = 1.8.1
313             mt                22.1 = 1.9.0
314             nb                22.1 = 2.0   (type="standard")
315             nn                22.1 = 2.0   (type="standard")
316             nso           [*] 26 = 1.8.1
317             om                22.1 = 1.8.1
318             or                30 = 28 (without [reorder Orya..]) = 1.9.0
319             pa                22.1 = 1.8.1
320             pl                30 = 1.8.1
321             ro                30 = 1.9.0 (type="standard")
322             sa            [*] 1.9.1 = 1.8.1 (type="standard" alt="proposed")
323             se                22.1 = 1.8.1 (type="standard")
324             si                30 = 28 ("standard" wo [reorder Sinh..]) = 1.9.0
325             si__dictionary    30 = 28 ("dictionary" wo [reorder Sinh..]) = 1.9.0
326             sk                22.1 = 1.9.0 (type="standard")
327             sl                22.1 = 1.8.1 (type="standard" alt="proposed")
328             sq                22.1 = 1.8.1 (alt="proposed" type="standard")
329             sr                30 = 28 (without [reorder Cyrl])
330             sr_Latn           30 = 28 (type="standard": [import hr])
331             sv                22.1 = 1.9.0 (type="standard")
332             sv__reformed      22.1 = 1.8.1 (type="reformed")
333             ta                22.1 = 1.9.0
334             te                30 = 28 (without [reorder Telu..]) = 1.9.0
335             th                22.1 = 22
336             tn            [*] 26 = 1.8.1
337             to                22.1 = 22
338             tr                22.1 = 1.8.1 (type="standard")
339             uk                30 = 28 (without [reorder Cyrl])
340             ug_Cyrl           https://en.wikipedia.org/wiki/Uyghur_Cyrillic_alphabet
341             ur                22.1 = 1.9.0
342             vi                22.1 = 1.8.1
343             vo                30 = 25
344             wae               30 = 2.0
345             wo            [*] 1.9.1 = 1.8.1
346             yo                30 = 1.8.1
347             zh                22.1 = 1.8.1 (type="standard")
348             zh__big5han       22.1 = 1.8.1 (type="big5han")
349             zh__gb2312han     22.1 = 1.8.1 (type="gb2312han")
350             zh__pinyin        22.1 = 2.0   (type='pinyin' alt='short')
351             zh__stroke        22.1 = 1.9.1 (type='stroke' alt='short')
352             zh__zhuyin        22.1 = 22    (type='zhuyin' alt='short')
353           --------------------------------------------------------------------
354
355       [*] http://www.unicode.org/repos/cldr/tags/latest/seed/collation/
356

AUTHOR

358       The Unicode::Collate::Locale module for perl was written by SADAHIRO
359       Tomoyuki, <SADAHIRO@cpan.org>.  This module is Copyright(C) 2004-2017,
360       SADAHIRO Tomoyuki. Japan.  All rights reserved.
361
362       This module is free software; you can redistribute it and/or modify it
363       under the same terms as Perl itself.
364

SEE ALSO

366       Unicode Collation Algorithm - UTS #10
367           <http://www.unicode.org/reports/tr10/>
368
369       The Default Unicode Collation Element Table (DUCET)
370           <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
371
372       Unicode Locale Data Markup Language (LDML) - UTS #35
373           <http://www.unicode.org/reports/tr35/>
374
375       CLDR - Unicode Common Locale Data Repository
376           <http://cldr.unicode.org/>
377
378       Unicode::Collate
379       Unicode::Normalize
380
381
382
383perl v5.26.3                      2017-11-22                Collate::Locale(3)
Impressum