1Collate(3)            User Contributed Perl Documentation           Collate(3)
2
3
4

NAME

6       Unicode::Collate - Unicode Collation Algorithm
7

SYNOPSIS

9         use Unicode::Collate;
10
11         #construct
12         $Collator = Unicode::Collate->new(%tailoring);
13
14         #sort
15         @sorted = $Collator->sort(@not_sorted);
16
17         #compare
18         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20       Note: Strings in @not_sorted, $a and $b are interpreted according to
21       Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
22       perlunifaq, utf8.  Otherwise you can use "preprocess" or should decode
23       them before.
24

DESCRIPTION

26       This module is an implementation of Unicode Technical Standard #10
27       (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
28
29   Constructor and Tailoring
30       The "new" method returns a collator object. If new() is called with no
31       parameters, the collator should do the default collation.
32
33          $Collator = Unicode::Collate->new(
34             UCA_Version => $UCA_Version,
35             alternate => $alternate, # alias for 'variable'
36             backwards => $levelNumber, # or \@levelNumbers
37             entry => $element,
38             hangul_terminator => $term_primary_weight,
39             highestFFFF => $bool,
40             identical => $bool,
41             ignoreName => qr/$ignoreName/,
42             ignoreChar => qr/$ignoreChar/,
43             ignore_level2 => $bool,
44             katakana_before_hiragana => $bool,
45             level => $collationLevel,
46             long_contraction => $bool,
47             minimalFFFE => $bool,
48             normalization  => $normalization_form,
49             overrideCJK => \&overrideCJK,
50             overrideHangul => \&overrideHangul,
51             preprocess => \&preprocess,
52             rearrange => \@charList,
53             rewrite => \&rewrite,
54             suppress => \@charList,
55             table => $filename,
56             undefName => qr/$undefName/,
57             undefChar => qr/$undefChar/,
58             upper_before_lower => $bool,
59             variable => $variable,
60          );
61
62       UCA_Version
63           If the revision (previously "tracking version") number of UCA is
64           given, behavior of that revision is emulated on collating.  If
65           omitted, the return value of "UCA_Version()" is used.
66
67           The following revisions are supported.  The default is 43.
68
69                UCA       Unicode Standard         DUCET (@version)
70              -------------------------------------------------------
71                 8              3.1                3.0.1 (3.0.1d9)
72                 9     3.1 with Corrigendum 3      3.1.1
73                11             4.0.0
74                14             4.1.0
75                16             5.0.0
76                18             5.1.0
77                20             5.2.0
78                22             6.0.0
79                24             6.1.0
80                26             6.2.0
81                28             6.3.0
82                30             7.0.0
83                32             8.0.0
84                34             9.0.0
85                36            10.0.0
86                38            11.0.0
87                40            12.0.0
88                41            12.1.0
89                43            13.0.0
90
91           * See below for "long_contraction" with "UCA_Version" 22 and 24.
92
93           * Noncharacters (e.g. U+FFFF) are not ignored, and can be
94           overridden since "UCA_Version" 22.
95
96           * Out-of-range codepoints (greater than U+10FFFF) are not ignored,
97           and can be overridden since "UCA_Version" 22.
98
99           * Fully ignorable characters were ignored, and would not interrupt
100           contractions with "UCA_Version" 9 and 11.
101
102           * Treatment of ignorables after variables and some behaviors were
103           changed at "UCA_Version" 9.
104
105           * Characters regarded as CJK unified ideographs (cf. "overrideCJK")
106           depend on "UCA_Version".
107
108           * Many hangul jamo are assigned at "UCA_Version" 20, that will
109           affect "hangul_terminator".
110
111       alternate
112           -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
113
114           For backward compatibility, "alternate" (old name) can be used as
115           an alias for "variable".
116
117       backwards
118           -- see 3.4 Backward Accents, UTS #10.
119
120                backwards => $levelNumber or \@levelNumbers
121
122           Weights in reverse order; ex. level 2 (diacritic ordering) in
123           French.  If omitted (or $levelNumber is "undef" or "\@levelNumbers"
124           is "[]"), forwards at all the levels.
125
126       entry
127           -- see 5 Tailoring; 9.1 Allkeys File Format, UTS #10.
128
129           If the same character (or a sequence of characters) exists in the
130           collation element table through "table", mapping to collation
131           elements is overridden.  If it does not exist, the mapping is
132           defined additionally.
133
134               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
135           0063 0068 ; [.0E6A.0020.0002.0063] # ch
136           0043 0068 ; [.0E6A.0020.0007.0043] # Ch
137           0043 0048 ; [.0E6A.0020.0008.0043] # CH
138           006C 006C ; [.0F4C.0020.0002.006C] # ll
139           004C 006C ; [.0F4C.0020.0007.004C] # Ll
140           004C 004C ; [.0F4C.0020.0008.004C] # LL
141           00F1      ; [.0F7B.0020.0002.00F1] # n-tilde
142           006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
143           00D1      ; [.0F7B.0020.0008.00D1] # N-tilde
144           004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
145           ENTRY
146
147               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
148           00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
149           00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
150           ENTRY
151
152           NOTE: The code point in the UCA file format (before ';') must be a
153           Unicode code point (defined as hexadecimal), but not a native code
154           point.  So 0063 must always denote "U+0063", but not a character of
155           "\x63".
156
157           Weighting may vary depending on collation element table.  So ensure
158           the weights defined in "entry" will be consistent with those in the
159           collation element table loaded via "table".
160
161           In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
162           "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
163           between 0E60 and "0E6D") makes ordering as "C < CH < D".  Exactly
164           speaking DUCET already has some characters between "C" and "D":
165           "small capital C" ("U+1D04") with primary weight 0E64,
166           "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
167           ("U+0255") with 0E69.  Then primary weight "0E6A" for "CH" makes
168           "CH" ordered between "c-curl" and "D".
169
170       hangul_terminator
171           -- see 7.1.4 Trailing Weights, UTS #10.
172
173           If a true value is given (non-zero but should be positive), it will
174           be added as a terminator primary weight to the end of every
175           standard Hangul syllable. Secondary and any higher weights for
176           terminator are set to zero.  If the value is false or
177           "hangul_terminator" key does not exist, insertion of terminator
178           weights will not be performed.
179
180           Boundaries of Hangul syllables are determined according to
181           conjoining Jamo behavior in the Unicode Standard and
182           HangulSyllableType.txt.
183
184           Implementation Note: (1) For expansion mapping (Unicode character
185           mapped to a sequence of collation elements), a terminator will not
186           be added between collation elements, even if Hangul syllable
187           boundary exists there.  Addition of terminator is restricted to the
188           next position to the last collation element.
189
190           (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
191           Jamo, and enclosed letters) are not automatically terminated with a
192           terminator primary weight.  These characters may need terminator
193           included in a collation element table beforehand.
194
195       highestFFFF
196           -- see 2.4 Tailored noncharacter weights, UTS #35 (LDML) Part 5:
197           Collation.
198
199           If the parameter is made true, "U+FFFF" has a highest primary
200           weight.  When a boolean of "$coll->ge($str, "abc")" and
201           "$coll->le($str, "abc\x{FFFF}")" is true, it is expected that $str
202           begins with "abc", or another primary equivalent.  $str may be
203           "abcd", "abc012", but should not include "U+FFFF" such as
204           "abc\x{FFFF}xyz".
205
206           "$coll->le($str, "abc\x{FFFF}")" works like "$coll->lt($str,
207           "abd")" almost, but the latter has a problem that you should know
208           which letter is next to "c". For a certain language where "ch" as
209           the next letter, "abch" is greater than "abc\x{FFFF}", but less
210           than "abd".
211
212           Note: This is equivalent to "(entry => 'FFFF ;
213           [.FFFE.0020.0005.FFFF]')".  Any other character than "U+FFFF" can
214           be tailored by "entry".
215
216       identical
217           -- see A.3 Deterministic Comparison, UTS #10.
218
219           By default, strings whose weights are equal should be equal, even
220           though their code points are not equal.  Completely ignorable
221           characters are ignored.
222
223           If the parameter is made true, a final, tie-breaking level is used.
224           If no difference of weights is found after the comparison through
225           all the level specified by "level", the comparison with code points
226           will be performed.  For the tie-breaking comparison, the sort key
227           has code points of the original string appended.  Completely
228           ignorable characters are not ignored.
229
230           If "preprocess" and/or "normalization" is applied, the code points
231           of the string after them (in NFD by default) are used.
232
233       ignoreChar
234       ignoreName
235           -- see 3.6 Variable Weighting, UTS #10.
236
237           Makes the entry in the table completely ignorable; i.e. as if the
238           weights were zero at all level.
239
240           Through "ignoreChar", any character matching "qr/$ignoreChar/" will
241           be ignored. Through "ignoreName", any character whose name (given
242           in the "table" file as a comment) matches "qr/$ignoreName/" will be
243           ignored.
244
245           E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
246           (or 'lmnt').
247
248       ignore_level2
249           -- see 5.1 Parametric Tailoring, UTS #10.
250
251           By default, case-sensitive comparison (that is level 3 difference)
252           won't ignore accents (that is level 2 difference).
253
254           If the parameter is made true, accents (and other primary ignorable
255           characters) are ignored, even though cases are taken into account.
256
257           NOTE: "level" should be 3 or greater.
258
259       katakana_before_hiragana
260           -- see 7.2 Tertiary Weight Table, UTS #10.
261
262           By default, hiragana is before katakana.  If the parameter is made
263           true, this is reversed.
264
265           NOTE: This parameter simplemindedly assumes that any
266           hiragana/katakana distinctions must occur in level 3, and their
267           weights at level 3 must be same as those mentioned in 7.3.1, UTS
268           #10.  If you define your collation elements which violate this
269           requirement, this parameter does not work validly.
270
271       level
272           -- see 4.3 Form Sort Key, UTS #10.
273
274           Set the maximum level.  Any higher levels than the specified one
275           are ignored.
276
277             Level 1: alphabetic ordering
278             Level 2: diacritic ordering
279             Level 3: case ordering
280             Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
281
282             ex.level => 2,
283
284           If omitted, the maximum is the 4th.
285
286           NOTE: The DUCET includes weights over 0xFFFF at the 4th level.  But
287           this module only uses weights within 0xFFFF.  When "variable" is
288           'blanked' or 'non-ignorable' (other than 'shifted' and
289           'shift-trimmed'), the level 4 may be unreliable.
290
291           See also "identical".
292
293       long_contraction
294           -- see 3.8.2 Well-Formedness of the DUCET, 4.2 Produce Array, UTS
295           #10.
296
297           If the parameter is made true, for a contraction with three or more
298           characters (here nicknamed "long contraction"), initial substrings
299           will be handled.  For example, a contraction ABC, where A is a
300           starter, and B and C are non-starters (character with non-zero
301           combining character class), will be detected even if there is not
302           AB as a contraction.
303
304           Default: Usually false.  If "UCA_Version" is 22 or 24, and the
305           value of "long_contraction" is not specified in "new()", a true
306           value is set implicitly.  This is a workaround to pass Conformance
307           Tests for Unicode 6.0.0 and 6.1.0.
308
309           "change()" handles "long_contraction" explicitly only.  If
310           "long_contraction" is not specified in "change()", even though
311           "UCA_Version" is changed, "long_contraction" will not be changed.
312
313           Limitation: Scanning non-starters is one-way (no back tracking).
314           If AB is found but not ABC is not found, other long contraction
315           where the first character is A and the second is not B may not be
316           found.
317
318           Under "(normalization => undef)", detection step of discontiguous
319           contractions will be skipped.
320
321           Note: The following contractions in DUCET are not considered in
322           steps S2.1.1 to S2.1.3, where they are discontiguous.
323
324               0FB2 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC RR)
325               0FB3 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC LL)
326
327           For example "TIBETAN VOWEL SIGN VOCALIC RR" with "COMBINING TILDE
328           OVERLAY" ("U+0344") is "0FB2 0344 0F71 0F80" in NFD.  In this case
329           "0FB2 0F80" ("TIBETAN VOWEL SIGN VOCALIC R") is detected, instead
330           of "0FB2 0F71 0F80".  Inserted 0344 makes "0FB2 0F71 0F80"
331           discontiguous and lack of contraction "0FB2 0F71" prohibits "0FB2
332           0F71 0F80" from being detected.
333
334       minimalFFFE
335           -- see 1.1.1 U+FFFE, UTS #35 (LDML) Part 5: Collation.
336
337           If the parameter is made true, "U+FFFE" has a minimal primary
338           weight.  The comparison between "$a1\x{FFFE}$a2" and
339           "$b1\x{FFFE}$b2" first compares $a1 and $b1 at level 1, and then
340           $a2 and $b2 at level 1, as followed.
341
342                   "ab\x{FFFE}a"
343                   "Ab\x{FFFE}a"
344                   "ab\x{FFFE}c"
345                   "Ab\x{FFFE}c"
346                   "ab\x{FFFE}xyz"
347                   "abc\x{FFFE}def"
348                   "abc\x{FFFE}xYz"
349                   "aBc\x{FFFE}xyz"
350                   "abcX\x{FFFE}def"
351                   "abcx\x{FFFE}xyz"
352                   "b\x{FFFE}aaa"
353                   "bbb\x{FFFE}a"
354
355           Note: This is equivalent to "(entry => 'FFFE ;
356           [.0001.0020.0005.FFFE]')".  Any other character than "U+FFFE" can
357           be tailored by "entry".
358
359       normalization
360           -- see 4.1 Normalize, UTS #10.
361
362           If specified, strings are normalized before preparation of sort
363           keys (the normalization is executed after preprocess).
364
365           A form name "Unicode::Normalize::normalize()" accepts will be
366           applied as $normalization_form.  Acceptable names include 'NFD',
367           'NFC', 'NFKD', and 'NFKC'.  See "Unicode::Normalize::normalize()"
368           for detail.  If omitted, 'NFD' is used.
369
370           "normalization" is performed after "preprocess" (if defined).
371
372           Furthermore, special values, "undef" and "prenormalized", can be
373           used, though they are not concerned with
374           "Unicode::Normalize::normalize()".
375
376           If "undef" (not a string "undef") is passed explicitly as the value
377           for this key, any normalization is not carried out (this may make
378           tailoring easier if any normalization is not desired). Under
379           "(normalization => undef)", only contiguous contractions are
380           resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
381           after "Z", "A-cedilla-ring" would be primary equal to "A".  In this
382           point, "(normalization => undef, preprocess => sub { NFD(shift) })"
383           is not equivalent to "(normalization => 'NFD')".
384
385           In the case of "(normalization => "prenormalized")", any
386           normalization is not performed, but discontiguous contractions with
387           combining characters are performed.  Therefore "(normalization =>
388           'prenormalized', preprocess => sub { NFD(shift) })" is equivalent
389           to "(normalization => 'NFD')".  If source strings are finely
390           prenormalized, "(normalization => 'prenormalized')" may save time
391           for normalization.
392
393           Except "(normalization => undef)", Unicode::Normalize is required
394           (see also CAVEAT).
395
396       overrideCJK
397           -- see 7.1 Derived Collation Elements, UTS #10.
398
399           By default, CJK unified ideographs are ordered in Unicode codepoint
400           order, but those in the CJK Unified Ideographs block are less than
401           those in the CJK Unified Ideographs Extension A etc.
402
403               In the CJK Unified Ideographs block:
404               U+4E00..U+9FA5 if UCA_Version is 8, 9 or 11.
405               U+4E00..U+9FBB if UCA_Version is 14 or 16.
406               U+4E00..U+9FC3 if UCA_Version is 18.
407               U+4E00..U+9FCB if UCA_Version is 20 or 22.
408               U+4E00..U+9FCC if UCA_Version is 24 to 30.
409               U+4E00..U+9FD5 if UCA_Version is 32 or 34.
410               U+4E00..U+9FEA if UCA_Version is 36.
411               U+4E00..U+9FEF if UCA_Version is 38, 40 or 41.
412               U+4E00..U+9FFC if UCA_Version is 43.
413
414               In the CJK Unified Ideographs Extension blocks:
415               Ext.A (U+3400..U+4DB5)   if UCA_Version is  8 to 41.
416               Ext.A (U+3400..U+4DBF)   if UCA_Version is 43.
417               Ext.B (U+20000..U+2A6D6) if UCA_Version is  8 to 41.
418               Ext.B (U+20000..U+2A6DD) if UCA_Version is 43.
419               Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or later.
420               Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or later.
421               Ext.E (U+2B820..U+2CEA1) if UCA_Version is 32 or later.
422               Ext.F (U+2CEB0..U+2EBE0) if UCA_Version is 36 or later.
423               Ext.G (U+30000..U+3134A) if UCA_Version is 43.
424
425           Through "overrideCJK", ordering of CJK unified ideographs
426           (including extensions) can be overridden.
427
428           ex. CJK unified ideographs in the JIS code point order.
429
430             overrideCJK => sub {
431                 my $u = shift;             # get a Unicode codepoint
432                 my $b = pack('n', $u);     # to UTF-16BE
433                 my $s = your_unicode_to_sjis_converter($b); # convert
434                 my $n = unpack('n', $s);   # convert sjis to short
435                 [ $n, 0x20, 0x2, $u ];     # return the collation element
436             },
437
438           The return value may be an arrayref of 1st to 4th weights as shown
439           above. The return value may be an integer as the primary weight as
440           shown below.  If "undef" is returned, the default derived collation
441           element will be used.
442
443             overrideCJK => sub {
444                 my $u = shift;             # get a Unicode codepoint
445                 my $b = pack('n', $u);     # to UTF-16BE
446                 my $s = your_unicode_to_sjis_converter($b); # convert
447                 my $n = unpack('n', $s);   # convert sjis to short
448                 return $n;                 # return the primary weight
449             },
450
451           The return value may be a list containing zero or more of an
452           arrayref, an integer, or "undef".
453
454           ex. ignores all CJK unified ideographs.
455
456             overrideCJK => sub {()}, # CODEREF returning empty list
457
458              # where ->eq("Pe\x{4E00}rl", "Perl") is true
459              # as U+4E00 is a CJK unified ideograph and to be ignorable.
460
461           If a false value (including "undef") is passed, "overrideCJK" has
462           no effect.  "$Collator->change(overrideCJK => 0)" resets the old
463           one.
464
465           But assignment of weight for CJK unified ideographs in "table" or
466           "entry" is still valid.  If "undef" is passed explicitly as the
467           value for this key, weights for CJK unified ideographs are treated
468           as undefined.  However when "UCA_Version" > 8, "(overrideCJK =>
469           undef)" has no special meaning.
470
471           Note: In addition to them, 12 CJK compatibility ideographs
472           ("U+FA0E", "U+FA0F", "U+FA11", "U+FA13", "U+FA14", "U+FA1F",
473           "U+FA21", "U+FA23", "U+FA24", "U+FA27", "U+FA28", "U+FA29") are
474           also treated as CJK unified ideographs. But they can't be
475           overridden via "overrideCJK" when you use DUCET, as the table
476           includes weights for them. "table" or "entry" has priority over
477           "overrideCJK".
478
479       overrideHangul
480           -- see 7.1 Derived Collation Elements, UTS #10.
481
482           By default, Hangul syllables are decomposed into Hangul Jamo, even
483           if "(normalization => undef)".  But the mapping of Hangul syllables
484           may be overridden.
485
486           This parameter works like "overrideCJK", so see there for examples.
487
488           If you want to override the mapping of Hangul syllables, NFD and
489           NFKD are not appropriate, since NFD and NFKD will decompose Hangul
490           syllables before overriding. FCD may decompose Hangul syllables as
491           the case may be.
492
493           If a false value (but not "undef") is passed, "overrideHangul" has
494           no effect.  "$Collator->change(overrideHangul => 0)" resets the old
495           one.
496
497           If "undef" is passed explicitly as the value for this key, weight
498           for Hangul syllables is treated as undefined without decomposition
499           into Hangul Jamo.  But definition of weight for Hangul syllables in
500           "table" or "entry" is still valid.
501
502       overrideOut
503           -- see 7.1.1 Handling Ill-Formed Code Unit Sequences, UTS #10.
504
505           Perl seems to allow out-of-range values (greater than 0x10FFFF).
506           By default, out-of-range values are replaced with "U+FFFD"
507           (REPLACEMENT CHARACTER) when "UCA_Version" >= 22, or ignored when
508           "UCA_Version" <= 20.
509
510           When "UCA_Version" >= 22, the weights of out-of-range values can be
511           overridden. Though "table" or "entry" are available for them, out-
512           of-range values are too many.
513
514           "overrideOut" can perform it algorithmically.  This parameter works
515           like "overrideCJK", so see there for examples.
516
517           ex. ignores all out-of-range values.
518
519             overrideOut => sub {()}, # CODEREF returning empty list
520
521           If a false value (including "undef") is passed, "overrideOut" has
522           no effect.  "$Collator->change(overrideOut => 0)" resets the old
523           one.
524
525           NOTE ABOUT U+FFFD:
526
527           UCA recommends that out-of-range values should not be ignored for
528           security reasons. Say, "pe\x{110000}rl" should not be equal to
529           "perl".  However, "U+FFFD" is wrongly mapped to a variable
530           collation element in DUCET for Unicode 6.0.0 to 6.2.0, that means
531           out-of-range values will be ignored when "variable" isn't
532           "Non-ignorable".
533
534           The mapping of "U+FFFD" is corrected in Unicode 6.3.0.  see
535           <http://www.unicode.org/reports/tr10/tr10-28.html#Trailing_Weights>
536           (7.1.4 Trailing Weights). Such a correction is reproduced by this.
537
538             overrideOut => sub { 0xFFFD }, # CODEREF returning a very large integer
539
540           This workaround is unnecessary since Unicode 6.3.0.
541
542       preprocess
543           -- see 5.4 Preprocessing, UTS #10.
544
545           If specified, the coderef is used to preprocess each string before
546           the formation of sort keys.
547
548           ex. dropping English articles, such as "a" or "the".  Then, "the
549           pen" is before "a pencil".
550
551                preprocess => sub {
552                      my $str = shift;
553                      $str =~ s/\b(?:an?|the)\s+//gi;
554                      return $str;
555                   },
556
557           "preprocess" is performed before "normalization" (if defined).
558
559           ex. decoding strings in a legacy encoding such as shift-jis:
560
561               $sjis_collator = Unicode::Collate->new(
562                   preprocess => \&your_shiftjis_to_unicode_decoder,
563               );
564               @result = $sjis_collator->sort(@shiftjis_strings);
565
566           Note: Strings returned from the coderef will be interpreted
567           according to Perl's Unicode support. See perlunicode, perluniintro,
568           perlunitut, perlunifaq, utf8.
569
570       rearrange
571           -- see 3.5 Rearrangement, UTS #10.
572
573           Characters that are not coded in logical order and to be
574           rearranged.  If "UCA_Version" is equal to or less than 11, default
575           is:
576
577               rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
578
579           If you want to disallow any rearrangement, pass "undef" or "[]" (a
580           reference to empty list) as the value for this key.
581
582           If "UCA_Version" is equal to or greater than 14, default is "[]"
583           (i.e. no rearrangement).
584
585           According to the version 9 of UCA, this parameter shall not be
586           used; but it is not warned at present.
587
588       rewrite
589           If specified, the coderef is used to rewrite lines in "table" or
590           "entry".  The coderef will get each line, and then should return a
591           rewritten line according to the UCA file format.  If the coderef
592           returns an empty line, the line will be skipped.
593
594           e.g. any primary ignorable characters into tertiary ignorable:
595
596               rewrite => sub {
597                   my $line = shift;
598                   $line =~ s/\[\.0000\..{4}\..{4}\./[.0000.0000.0000./g;
599                   return $line;
600               },
601
602           This example shows rewriting weights. "rewrite" is allowed to
603           affect code points, weights, and the name.
604
605           NOTE: "table" is available to use another table file; preparing a
606           modified table once would be more efficient than rewriting lines on
607           reading an unmodified table every time.
608
609       suppress
610           -- see 3.12 Special-Purpose Commands, UTS #35 (LDML) Part 5:
611           Collation.
612
613           Contractions beginning with the specified characters are
614           suppressed, even if those contractions are defined in "table".
615
616           An example for Russian and some languages using the Cyrillic
617           script:
618
619               suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F],
620
621           where 0x0400 stands for "U+0400", CYRILLIC CAPITAL LETTER IE WITH
622           GRAVE.
623
624           NOTE: Contractions via "entry" will not be suppressed.
625
626       table
627           -- see 3.8 Default Unicode Collation Element Table, UTS #10.
628
629           You can use another collation element table if desired.
630
631           The table file should locate in the Unicode/Collate directory on
632           @INC. Say, if the filename is Foo.txt, the table file is searched
633           as Unicode/Collate/Foo.txt in @INC.
634
635           By default, allkeys.txt (as the filename of DUCET) is used.  If you
636           will prepare your own table file, any name other than allkeys.txt
637           may be better to avoid namespace conflict.
638
639           NOTE: When XSUB is used, the DUCET is compiled on building this
640           module, and it may save time at the run time.  Explicit saying
641           "(table => 'allkeys.txt')", or using another table, or using
642           "ignoreChar", "ignoreName", "undefChar", "undefName" or "rewrite"
643           will prevent this module from using the compiled DUCET.
644
645           If "undef" is passed explicitly as the value for this key, no file
646           is read (but you can define collation elements via "entry").
647
648           A typical way to define a collation element table without any file
649           of table:
650
651              $onlyABC = Unicode::Collate->new(
652                  table => undef,
653                  entry => << 'ENTRIES',
654           0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
655           0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
656           0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
657           0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
658           0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
659           0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
660           ENTRIES
661               );
662
663           If "ignoreName" or "undefName" is used, character names should be
664           specified as a comment (following "#") on each line.
665
666       undefChar
667       undefName
668           -- see 6.3.3 Reducing the Repertoire, UTS #10.
669
670           Undefines the collation element as if it were unassigned in the
671           "table".  This reduces the size of the table.  If an unassigned
672           character appears in the string to be collated, the sort key is
673           made from its codepoint as a single-character collation element, as
674           it is greater than any other assigned collation elements (in the
675           codepoint order among the unassigned characters).  But, it'd be
676           better to ignore characters unfamiliar to you and maybe never used.
677
678           Through "undefChar", any character matching "qr/$undefChar/" will
679           be undefined. Through "undefName", any character whose name (given
680           in the "table" file as a comment) matches "qr/$undefName/" will be
681           undefined.
682
683           ex. Collation weights for beyond-BMP characters are not stored in
684           object:
685
686               undefChar => qr/[^\0-\x{fffd}]/,
687
688       upper_before_lower
689           -- see 6.6 Case Comparisons, UTS #10.
690
691           By default, lowercase is before uppercase.  If the parameter is
692           made true, this is reversed.
693
694           NOTE: This parameter simplemindedly assumes that any
695           lowercase/uppercase distinctions must occur in level 3, and their
696           weights at level 3 must be same as those mentioned in 7.3.1, UTS
697           #10.  If you define your collation elements which differs from this
698           requirement, this parameter doesn't work validly.
699
700       variable
701           -- see 3.6 Variable Weighting, UTS #10.
702
703           This key allows for variable weighting of variable collation
704           elements, which are marked with an ASTERISK in the table (NOTE:
705           Many punctuation marks and symbols are variable in allkeys.txt).
706
707              variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
708
709           These names are case-insensitive.  By default (if specification is
710           omitted), 'shifted' is adopted.
711
712              'Blanked'        Variable elements are made ignorable at levels 1 through 3;
713                               considered at the 4th level.
714
715              'Non-Ignorable'  Variable elements are not reset to ignorable.
716
717              'Shifted'        Variable elements are made ignorable at levels 1 through 3
718                               their level 4 weight is replaced by the old level 1 weight.
719                               Level 4 weight for Non-Variable elements is 0xFFFF.
720
721              'Shift-Trimmed'  Same as 'shifted', but all FFFF's at the 4th level
722                               are trimmed.
723
724   Methods for Collation
725       "@sorted = $Collator->sort(@not_sorted)"
726           Sorts a list of strings.
727
728       "$result = $Collator->cmp($a, $b)"
729           Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
730           $b) or -1 (when $a is less than $b).
731
732       "$result = $Collator->eq($a, $b)"
733       "$result = $Collator->ne($a, $b)"
734       "$result = $Collator->lt($a, $b)"
735       "$result = $Collator->le($a, $b)"
736       "$result = $Collator->gt($a, $b)"
737       "$result = $Collator->ge($a, $b)"
738           They works like the same name operators as theirs.
739
740              eq : whether $a is equal to $b.
741              ne : whether $a is not equal to $b.
742              lt : whether $a is less than $b.
743              le : whether $a is less than $b or equal to $b.
744              gt : whether $a is greater than $b.
745              ge : whether $a is greater than $b or equal to $b.
746
747       "$sortKey = $Collator->getSortKey($string)"
748           -- see 4.3 Form Sort Key, UTS #10.
749
750           Returns a sort key.
751
752           You compare the sort keys using a binary comparison and get the
753           result of the comparison of the strings using UCA.
754
755              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
756
757                 is equivalent to
758
759              $Collator->cmp($a, $b)
760
761       "$sortKeyForm = $Collator->viewSortKey($string)"
762           Converts a sorting key into its representation form.  If
763           "UCA_Version" is 8, the output is slightly different.
764
765              use Unicode::Collate;
766              my $c = Unicode::Collate->new();
767              print $c->viewSortKey("Perl"),"\n";
768
769              # output:
770              # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
771              #  Level 1               Level 2               Level 3               Level 4
772
773   Methods for Searching
774       The "match", "gmatch", "subst", "gsubst" methods work like "m//",
775       "m//g", "s///", "s///g", respectively, but they are not aware of any
776       pattern, but only a literal substring.
777
778       DISCLAIMER: If "preprocess" or "normalization" parameter is true for
779       $Collator, calling these methods ("index", "match", "gmatch", "subst",
780       "gsubst") is croaked, as the position and the length might differ from
781       those on the specified string.
782
783       "rearrange" and "hangul_terminator" parameters are neglected.
784       "katakana_before_hiragana" and "upper_before_lower" don't affect
785       matching and searching, as it doesn't matter whether greater or less.
786
787       "$position = $Collator->index($string, $substring[, $position])"
788       "($position, $length) = $Collator->index($string, $substring[,
789       $position])"
790           If $substring matches a part of $string, returns the position of
791           the first occurrence of the matching part in scalar context; in
792           list context, returns a two-element list of the position and the
793           length of the matching part.
794
795           If $substring does not match any part of $string, returns "-1" in
796           scalar context and an empty list in list context.
797
798           e.g. when the content of $str is ""Ich mu"ß" studieren Perl."", you
799           say the following where $sub is ""M"ü"SS"",
800
801             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
802                                                # (normalization => undef) is REQUIRED.
803             my $match;
804             if (my($pos,$len) = $Collator->index($str, $sub)) {
805                 $match = substr($str, $pos, $len);
806             }
807
808           and get ""mu"ß""" in $match, since ""mu"ß""" is primary equal to
809           ""M"ü"SS"".
810
811       "$match_ref = $Collator->match($string, $substring)"
812       "($match)   = $Collator->match($string, $substring)"
813           If $substring matches a part of $string, in scalar context, returns
814           a reference to the first occurrence of the matching part
815           ($match_ref is always true if matches, since every reference is
816           true); in list context, returns the first occurrence of the
817           matching part.
818
819           If $substring does not match any part of $string, returns "undef"
820           in scalar context and an empty list in list context.
821
822           e.g.
823
824               if ($match_ref = $Collator->match($str, $sub)) { # scalar context
825                   print "matches [$$match_ref].\n";
826               } else {
827                   print "doesn't match.\n";
828               }
829
830                or
831
832               if (($match) = $Collator->match($str, $sub)) { # list context
833                   print "matches [$match].\n";
834               } else {
835                   print "doesn't match.\n";
836               }
837
838       "@match = $Collator->gmatch($string, $substring)"
839           If $substring matches a part of $string, returns all the matching
840           parts (or matching count in scalar context).
841
842           If $substring does not match any part of $string, returns an empty
843           list.
844
845       "$count = $Collator->subst($string, $substring, $replacement)"
846           If $substring matches a part of $string, the first occurrence of
847           the matching part is replaced by $replacement ($string is modified)
848           and $count (always equals to 1) is returned.
849
850           $replacement can be a "CODEREF", taking the matching part as an
851           argument, and returning a string to replace the matching part (a
852           bit similar to "s/(..)/$coderef->($1)/e").
853
854       "$count = $Collator->gsubst($string, $substring, $replacement)"
855           If $substring matches a part of $string, all the occurrences of the
856           matching part are replaced by $replacement ($string is modified)
857           and $count is returned.
858
859           $replacement can be a "CODEREF", taking the matching part as an
860           argument, and returning a string to replace the matching part (a
861           bit similar to "s/(..)/$coderef->($1)/eg").
862
863           e.g.
864
865             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
866                                                # (normalization => undef) is REQUIRED.
867             my $str = "Camel donkey zebra came\x{301}l CAMEL horse cam\0e\0l...";
868             $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
869
870             # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cam\0e\0l</b>...";
871             # i.e., all the camels are made bold-faced.
872
873              Examples: levels and ignore_level2 - what does camel match?
874             ---------------------------------------------------------------------------
875              level  ignore_level2  |  camel  Camel  came\x{301}l  c-a-m-e-l  cam\0e\0l
876             -----------------------|---------------------------------------------------
877                1        false      |   yes    yes      yes          yes        yes
878                2        false      |   yes    yes      no           yes        yes
879                3        false      |   yes    no       no           yes        yes
880                4        false      |   yes    no       no           no         yes
881             -----------------------|---------------------------------------------------
882                1        true       |   yes    yes      yes          yes        yes
883                2        true       |   yes    yes      yes          yes        yes
884                3        true       |   yes    no       yes          yes        yes
885                4        true       |   yes    no       yes          no         yes
886             ---------------------------------------------------------------------------
887              note: if variable => non-ignorable, camel doesn't match c-a-m-e-l
888                    at any level.
889
890   Other Methods
891       "%old_tailoring = $Collator->change(%new_tailoring)"
892       "$modified_collator = $Collator->change(%new_tailoring)"
893           Changes the value of specified keys and returns the changed part.
894
895               $Collator = Unicode::Collate->new(level => 4);
896
897               $Collator->eq("perl", "PERL"); # false
898
899               %old = $Collator->change(level => 2); # returns (level => 4).
900
901               $Collator->eq("perl", "PERL"); # true
902
903               $Collator->change(%old); # returns (level => 2).
904
905               $Collator->eq("perl", "PERL"); # false
906
907           Not all "(key,value)"s are allowed to be changed.  See also
908           @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
909
910           In the scalar context, returns the modified collator (but it is not
911           a clone from the original).
912
913               $Collator->change(level => 2)->eq("perl", "PERL"); # true
914
915               $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
916
917               $Collator->change(level => 4)->eq("perl", "PERL"); # false
918
919       "$version = $Collator->version()"
920           Returns the version number (a string) of the Unicode Standard which
921           the "table" file used by the collator object is based on.  If the
922           table does not include a version line (starting with @version),
923           returns "unknown".
924
925       "UCA_Version()"
926           Returns the revision number of UTS #10 this module consults, that
927           should correspond with the DUCET incorporated.
928
929       "Base_Unicode_Version()"
930           Returns the version number of UTS #10 this module consults, that
931           should correspond with the DUCET incorporated.
932

EXPORT

934       No method will be exported.
935

INSTALL

937       Though this module can be used without any "table" file, to use this
938       module easily, it is recommended to install a table file in the UCA
939       format, by copying it under the directory <a place in
940       @INC>/Unicode/Collate.
941
942       The most preferable one is "The Default Unicode Collation Element
943       Table" (aka DUCET), available from the Unicode Consortium's website:
944
945          http://www.unicode.org/Public/UCA/
946
947          http://www.unicode.org/Public/UCA/latest/allkeys.txt
948          (latest version)
949
950       If DUCET is not installed, it is recommended to copy the file from
951       http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
952       @INC>/Unicode/Collate/allkeys.txt manually.
953

CAVEATS

955       Normalization
956           Use of the "normalization" parameter requires the
957           Unicode::Normalize module (see Unicode::Normalize).
958
959           If you need not it (say, in the case when you need not handle any
960           combining characters), assign "(normalization => undef)"
961           explicitly.
962
963           -- see 6.5 Avoiding Normalization, UTS #10.
964
965       Conformance Test
966           The Conformance Test for the UCA is available under
967           <http://www.unicode.org/Public/UCA/>.
968
969           For CollationTest_SHIFTED.txt, a collator via
970           "Unicode::Collate->new( )" should be used; for
971           CollationTest_NON_IGNORABLE.txt, a collator via
972           "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
973
974           If "UCA_Version" is 26 or later, the "identical" level is
975           preferred; "Unicode::Collate->new(identical => 1)" and
976           "Unicode::Collate->new(identical => 1," "variable =>
977           "non-ignorable", level => 3)" should be used.
978
979           Unicode::Normalize is required to try The Conformance Test.
980
982       The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
983       <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2020, SADAHIRO
984       Tomoyuki. Japan. All rights reserved.
985
986       This module is free software; you can redistribute it and/or modify it
987       under the same terms as Perl itself.
988
989       The file Unicode/Collate/allkeys.txt was copied verbatim from
990       <http://www.unicode.org/Public/UCA/13.0.0/allkeys.txt>.  For this file,
991       Copyright (c) 2020 Unicode, Inc.; distributed under the Terms of Use in
992       <http://www.unicode.org/terms_of_use.html>
993

SEE ALSO

995       Unicode Collation Algorithm - UTS #10
996           <http://www.unicode.org/reports/tr10/>
997
998       The Default Unicode Collation Element Table (DUCET)
999           <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
1000
1001       The conformance test for the UCA
1002           <http://www.unicode.org/Public/UCA/latest/CollationTest.html>
1003
1004           <http://www.unicode.org/Public/UCA/latest/CollationTest.zip>
1005
1006       Hangul Syllable Type
1007           <http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt>
1008
1009       Unicode Normalization Forms - UAX #15
1010           <http://www.unicode.org/reports/tr15/>
1011
1012       Unicode Locale Data Markup Language (LDML) - UTS #35
1013           <http://www.unicode.org/reports/tr35/>
1014
1015
1016
1017perl v5.32.1                      2021-01-27                        Collate(3)
Impressum