Unicode::Collate(3pm)

1Collate(3)            User Contributed Perl Documentation           Collate(3)
2
3
4

NAME

6       Unicode::Collate - Unicode Collation Algorithm
7

SYNOPSIS

9         use Unicode::Collate;
10
11         #construct
12         $Collator = Unicode::Collate->new(%tailoring);
13
14         #sort
15         @sorted = $Collator->sort(@not_sorted);
16
17         #compare
18         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20       Note: Strings in @not_sorted, $a and $b are interpreted according to
21       Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
22       perlunifaq, utf8.  Otherwise you can use "preprocess" or should decode
23       them before.
24

DESCRIPTION

26       This module is an implementation of Unicode Technical Standard #10
27       (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
28
29   Constructor and Tailoring
30       The "new" method returns a collator object. If new() is called with no
31       parameters, the collator should do the default collation.
32
33          $Collator = Unicode::Collate->new(
34             UCA_Version => $UCA_Version,
35             alternate => $alternate, # alias for 'variable'
36             backwards => $levelNumber, # or \@levelNumbers
37             entry => $element,
38             hangul_terminator => $term_primary_weight,
39             highestFFFF => $bool,
40             identical => $bool,
41             ignoreName => qr/$ignoreName/,
42             ignoreChar => qr/$ignoreChar/,
43             ignore_level2 => $bool,
44             katakana_before_hiragana => $bool,
45             level => $collationLevel,
46             long_contraction => $bool,
47             minimalFFFE => $bool,
48             normalization  => $normalization_form,
49             overrideCJK => \&overrideCJK,
50             overrideHangul => \&overrideHangul,
51             preprocess => \&preprocess,
52             rearrange => \@charList,
53             rewrite => \&rewrite,
54             suppress => \@charList,
55             table => $filename,
56             undefName => qr/$undefName/,
57             undefChar => qr/$undefChar/,
58             upper_before_lower => $bool,
59             variable => $variable,
60          );
61
62       UCA_Version
63           If the revision (previously "tracking version") number of UCA is
64           given, behavior of that revision is emulated on collating.  If
65           omitted, the return value of "UCA_Version()" is used.
66
67           The following revisions are supported.  The default is 36.
68
69                UCA       Unicode Standard         DUCET (@version)
70              -------------------------------------------------------
71                 8              3.1                3.0.1 (3.0.1d9)
72                 9     3.1 with Corrigendum 3      3.1.1 (3.1.1)
73                11              4.0                4.0.0 (4.0.0)
74                14             4.1.0               4.1.0 (4.1.0)
75                16              5.0                5.0.0 (5.0.0)
76                18             5.1.0               5.1.0 (5.1.0)
77                20             5.2.0               5.2.0 (5.2.0)
78                22             6.0.0               6.0.0 (6.0.0)
79                24             6.1.0               6.1.0 (6.1.0)
80                26             6.2.0               6.2.0 (6.2.0)
81                28             6.3.0               6.3.0 (6.3.0)
82                30             7.0.0               7.0.0 (7.0.0)
83                32             8.0.0               8.0.0 (8.0.0)
84                34             9.0.0               9.0.0 (9.0.0)
85                36            10.0.0              10.0.0(10.0.0)
86
87           * See below for "long_contraction" with "UCA_Version" 22 and 24.
88
89           * Noncharacters (e.g. U+FFFF) are not ignored, and can be
90           overridden since "UCA_Version" 22.
91
92           * Out-of-range codepoints (greater than U+10FFFF) are not ignored,
93           and can be overridden since "UCA_Version" 22.
94
95           * Fully ignorable characters were ignored, and would not interrupt
96           contractions with "UCA_Version" 9 and 11.
97
98           * Treatment of ignorables after variables and some behaviors were
99           changed at "UCA_Version" 9.
100
101           * Characters regarded as CJK unified ideographs (cf. "overrideCJK")
102           depend on "UCA_Version".
103
104           * Many hangul jamo are assigned at "UCA_Version" 20, that will
105           affect "hangul_terminator".
106
107       alternate
108           -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
109
110           For backward compatibility, "alternate" (old name) can be used as
111           an alias for "variable".
112
113       backwards
114           -- see 3.4 Backward Accents, UTS #10.
115
116                backwards => $levelNumber or \@levelNumbers
117
118           Weights in reverse order; ex. level 2 (diacritic ordering) in
119           French.  If omitted (or $levelNumber is "undef" or "\@levelNumbers"
120           is "[]"), forwards at all the levels.
121
122       entry
123           -- see 5 Tailoring; 9.1 Allkeys File Format, UTS #10.
124
125           If the same character (or a sequence of characters) exists in the
126           collation element table through "table", mapping to collation
127           elements is overridden.  If it does not exist, the mapping is
128           defined additionally.
129
130               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
131           0063 0068 ; [.0E6A.0020.0002.0063] # ch
132           0043 0068 ; [.0E6A.0020.0007.0043] # Ch
133           0043 0048 ; [.0E6A.0020.0008.0043] # CH
134           006C 006C ; [.0F4C.0020.0002.006C] # ll
135           004C 006C ; [.0F4C.0020.0007.004C] # Ll
136           004C 004C ; [.0F4C.0020.0008.004C] # LL
137           00F1      ; [.0F7B.0020.0002.00F1] # n-tilde
138           006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
139           00D1      ; [.0F7B.0020.0008.00D1] # N-tilde
140           004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
141           ENTRY
142
143               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
144           00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
145           00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
146           ENTRY
147
148           NOTE: The code point in the UCA file format (before ';') must be a
149           Unicode code point (defined as hexadecimal), but not a native code
150           point.  So 0063 must always denote "U+0063", but not a character of
151           "\x63".
152
153           Weighting may vary depending on collation element table.  So ensure
154           the weights defined in "entry" will be consistent with those in the
155           collation element table loaded via "table".
156
157           In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
158           "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
159           between 0E60 and "0E6D") makes ordering as "C < CH < D".  Exactly
160           speaking DUCET already has some characters between "C" and "D":
161           "small capital C" ("U+1D04") with primary weight 0E64,
162           "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
163           ("U+0255") with 0E69.  Then primary weight "0E6A" for "CH" makes
164           "CH" ordered between "c-curl" and "D".
165
166       hangul_terminator
167           -- see 7.1.4 Trailing Weights, UTS #10.
168
169           If a true value is given (non-zero but should be positive), it will
170           be added as a terminator primary weight to the end of every
171           standard Hangul syllable. Secondary and any higher weights for
172           terminator are set to zero.  If the value is false or
173           "hangul_terminator" key does not exist, insertion of terminator
174           weights will not be performed.
175
176           Boundaries of Hangul syllables are determined according to
177           conjoining Jamo behavior in the Unicode Standard and
178           HangulSyllableType.txt.
179
180           Implementation Note: [22m(1) For expansion mapping (Unicode character
181           mapped to a sequence of collation elements), a terminator will not
182           be added between collation elements, even if Hangul syllable
183           boundary exists there.  Addition of terminator is restricted to the
184           next position to the last collation element.
185
186           (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
187           Jamo, and enclosed letters) are not automatically terminated with a
188           terminator primary weight.  These characters may need terminator
189           included in a collation element table beforehand.
190
191       highestFFFF
192           -- see 2.4 Tailored noncharacter weights, UTS #35 (LDML) Part 5:
193           Collation.
194
195           If the parameter is made true, "U+FFFF" has a highest primary
196           weight.  When a boolean of "$coll->ge($str, "abc")" and
197           "$coll->le($str, "abc\x{FFFF}")" is true, it is expected that $str
198           begins with "abc", or another primary equivalent.  $str may be
199           "abcd", "abc012", but should not include "U+FFFF" such as
200           "abc\x{FFFF}xyz".
201
202           "$coll->le($str, "abc\x{FFFF}")" works like "$coll->lt($str,
203           "abd")" almost, but the latter has a problem that you should know
204           which letter is next to "c". For a certain language where "ch" as
205           the next letter, "abch" is greater than "abc\x{FFFF}", but less
206           than "abd".
207
208           Note: This is equivalent to "(entry => 'FFFF ;
209           [.FFFE.0020.0005.FFFF]')".  Any other character than "U+FFFF" can
210           be tailored by "entry".
211
212       identical
213           -- see A.3 Deterministic Comparison, UTS #10.
214
215           By default, strings whose weights are equal should be equal, even
216           though their code points are not equal.  Completely ignorable
217           characters are ignored.
218
219           If the parameter is made true, a final, tie-breaking level is used.
220           If no difference of weights is found after the comparison through
221           all the level specified by "level", the comparison with code points
222           will be performed.  For the tie-breaking comparison, the sort key
223           has code points of the original string appended.  Completely
224           ignorable characters are not ignored.
225
226           If "preprocess" and/or "normalization" is applied, the code points
227           of the string after them (in NFD by default) are used.
228
229       ignoreChar
230       ignoreName
231           -- see 3.6 Variable Weighting, UTS #10.
232
233           Makes the entry in the table completely ignorable; i.e. as if the
234           weights were zero at all level.
235
236           Through "ignoreChar", any character matching "qr/$ignoreChar/" will
237           be ignored. Through "ignoreName", any character whose name (given
238           in the "table" file as a comment) matches "qr/$ignoreName/" will be
239           ignored.
240
241           E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
242           (or 'lmnt').
243
244       ignore_level2
245           -- see 5.1 Parametric Tailoring, UTS #10.
246
247           By default, case-sensitive comparison (that is level 3 difference)
248           won't ignore accents (that is level 2 difference).
249
250           If the parameter is made true, accents (and other primary ignorable
251           characters) are ignored, even though cases are taken into account.
252
253           NOTE: "level" should be 3 or greater.
254
255       katakana_before_hiragana
256           -- see 7.2 Tertiary Weight Table, UTS #10.
257
258           By default, hiragana is before katakana.  If the parameter is made
259           true, this is reversed.
260
261           NOTE: This parameter simplemindedly assumes that any
262           hiragana/katakana distinctions must occur in level 3, and their
263           weights at level 3 must be same as those mentioned in 7.3.1, UTS
264           #10.  If you define your collation elements which violate this
265           requirement, this parameter does not work validly.
266
267       level
268           -- see 4.3 Form Sort Key, UTS #10.
269
270           Set the maximum level.  Any higher levels than the specified one
271           are ignored.
272
273             Level 1: alphabetic ordering
274             Level 2: diacritic ordering
275             Level 3: case ordering
276             Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
277
278             ex.level => 2,
279
280           If omitted, the maximum is the 4th.
281
282           NOTE: The DUCET includes weights over 0xFFFF at the 4th level.  But
283           this module only uses weights within 0xFFFF.  When "variable" is
284           'blanked' or 'non-ignorable' (other than 'shifted' and
285           'shift-trimmed'), the level 4 may be unreliable.
286
287           See also "identical".
288
289       long_contraction
290           -- see 3.8.2 Well-Formedness of the DUCET, 4.2 Produce Array, UTS
291           #10.
292
293           If the parameter is made true, for a contraction with three or more
294           characters (here nicknamed "long contraction"), initial substrings
295           will be handled.  For example, a contraction ABC, where A is a
296           starter, and B and C are non-starters (character with non-zero
297           combining character class), will be detected even if there is not
298           AB as a contraction.
299
300           Default: Usually false.  If "UCA_Version" is 22 or 24, and the
301           value of "long_contraction" is not specified in "new()", a true
302           value is set implicitly.  This is a workaround to pass Conformance
303           Tests for Unicode 6.0.0 and 6.1.0.
304
305           "change()" handles "long_contraction" explicitly only.  If
306           "long_contraction" is not specified in "change()", even though
307           "UCA_Version" is changed, "long_contraction" will not be changed.
308
309           Limitation: Scanning non-starters is one-way (no back tracking).
310           If AB is found but not ABC is not found, other long contraction
311           where the first character is A and the second is not B may not be
312           found.
313
314           Under "(normalization => undef)", detection step of discontiguous
315           contractions will be skipped.
316
317           Note: The following contractions in DUCET are not considered in
318           steps S2.1.1 to S2.1.3, where they are discontiguous.
319
320               0FB2 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC RR)
321               0FB3 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC LL)
322
323           For example "TIBETAN VOWEL SIGN VOCALIC RR" with "COMBINING TILDE
324           OVERLAY" ("U+0344") is "0FB2 0344 0F71 0F80" in NFD.  In this case
325           "0FB2 0F80" ("TIBETAN VOWEL SIGN VOCALIC R") is detected, instead
326           of "0FB2 0F71 0F80".  Inserted 0344 makes "0FB2 0F71 0F80"
327           discontiguous and lack of contraction "0FB2 0F71" prohibits "0FB2
328           0F71 0F80" from being detected.
329
330       minimalFFFE
331           -- see 1.1.1 U+FFFE, UTS #35 (LDML) Part 5: Collation.
332
333           If the parameter is made true, "U+FFFE" has a minimal primary
334           weight.  The comparison between "$a1\x{FFFE}$a2" and
335           "$b1\x{FFFE}$b2" first compares $a1 and $b1 at level 1, and then
336           $a2 and $b2 at level 1, as followed.
337
338                   "ab\x{FFFE}a"
339                   "Ab\x{FFFE}a"
340                   "ab\x{FFFE}c"
341                   "Ab\x{FFFE}c"
342                   "ab\x{FFFE}xyz"
343                   "abc\x{FFFE}def"
344                   "abc\x{FFFE}xYz"
345                   "aBc\x{FFFE}xyz"
346                   "abcX\x{FFFE}def"
347                   "abcx\x{FFFE}xyz"
348                   "b\x{FFFE}aaa"
349                   "bbb\x{FFFE}a"
350
351           Note: This is equivalent to "(entry => 'FFFE ;
352           [.0001.0020.0005.FFFE]')".  Any other character than "U+FFFE" can
353           be tailored by "entry".
354
355       normalization
356           -- see 4.1 Normalize, UTS #10.
357
358           If specified, strings are normalized before preparation of sort
359           keys (the normalization is executed after preprocess).
360
361           A form name "Unicode::Normalize::normalize()" accepts will be
362           applied as $normalization_form.  Acceptable names include 'NFD',
363           'NFC', 'NFKD', and 'NFKC'.  See "Unicode::Normalize::normalize()"
364           for detail.  If omitted, 'NFD' is used.
365
366           "normalization" is performed after "preprocess" (if defined).
367
368           Furthermore, special values, "undef" and "prenormalized", can be
369           used, though they are not concerned with
370           "Unicode::Normalize::normalize()".
371
372           If "undef" (not a string "undef") is passed explicitly as the value
373           for this key, any normalization is not carried out (this may make
374           tailoring easier if any normalization is not desired). Under
375           "(normalization => undef)", only contiguous contractions are
376           resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
377           after "Z", "A-cedilla-ring" would be primary equal to "A".  In this
378           point, "(normalization => undef, preprocess => sub { NFD(shift) })"
379           is not equivalent to "(normalization => 'NFD')".
380
381           In the case of "(normalization => "prenormalized")", any
382           normalization is not performed, but discontiguous contractions with
383           combining characters are performed.  Therefore "(normalization =>
384           'prenormalized', preprocess => sub { NFD(shift) })" is equivalent
385           to "(normalization => 'NFD')".  If source strings are finely
386           prenormalized, "(normalization => 'prenormalized')" may save time
387           for normalization.
388
389           Except "(normalization => undef)", Unicode::Normalize is required
390           (see also CAVEAT).
391
392       overrideCJK
393           -- see 7.1 Derived Collation Elements, UTS #10.
394
395           By default, CJK unified ideographs are ordered in Unicode codepoint
396           order, but those in the CJK Unified Ideographs block are less than
397           those in the CJK Unified Ideographs Extension A etc.
398
399               In the CJK Unified Ideographs block:
400               U+4E00..U+9FA5 if UCA_Version is 8, 9 or 11.
401               U+4E00..U+9FBB if UCA_Version is 14 or 16.
402               U+4E00..U+9FC3 if UCA_Version is 18.
403               U+4E00..U+9FCB if UCA_Version is 20 or 22.
404               U+4E00..U+9FCC if UCA_Version is 24 to 30.
405               U+4E00..U+9FD5 if UCA_Version is 32 or 34.
406               U+4E00..U+9FEA if UCA_Version is 36.
407
408               In the CJK Unified Ideographs Extension blocks:
409               Ext.A (U+3400..U+4DB5) and Ext.B (U+20000..U+2A6D6) in any UCA_Version.
410               Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or later.
411               Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or later.
412               Ext.E (U+2B820..U+2CEA1) if UCA_Version is 32 or later.
413               Ext.F (U+2CEB0..U+2EBE0) if UCA_Version is 36.
414
415           Through "overrideCJK", ordering of CJK unified ideographs
416           (including extensions) can be overridden.
417
418           ex. CJK unified ideographs in the JIS code point order.
419
420             overrideCJK => sub {
421                 my $u = shift;             # get a Unicode codepoint
422                 my $b = pack('n', $u);     # to UTF-16BE
423                 my $s = your_unicode_to_sjis_converter($b); # convert
424                 my $n = unpack('n', $s);   # convert sjis to short
425                 [ $n, 0x20, 0x2, $u ];     # return the collation element
426             },
427
428           The return value may be an arrayref of 1st to 4th weights as shown
429           above. The return value may be an integer as the primary weight as
430           shown below.  If "undef" is returned, the default derived collation
431           element will be used.
432
433             overrideCJK => sub {
434                 my $u = shift;             # get a Unicode codepoint
435                 my $b = pack('n', $u);     # to UTF-16BE
436                 my $s = your_unicode_to_sjis_converter($b); # convert
437                 my $n = unpack('n', $s);   # convert sjis to short
438                 return $n;                 # return the primary weight
439             },
440
441           The return value may be a list containing zero or more of an
442           arrayref, an integer, or "undef".
443
444           ex. ignores all CJK unified ideographs.
445
446             overrideCJK => sub {()}, # CODEREF returning empty list
447
448              # where ->eq("Pe\x{4E00}rl", "Perl") is true
449              # as U+4E00 is a CJK unified ideograph and to be ignorable.
450
451           If a false value (including "undef") is passed, "overrideCJK" has
452           no effect.  "$Collator->change(overrideCJK => 0)" resets the old
453           one.
454
455           But assignment of weight for CJK unified ideographs in "table" or
456           "entry" is still valid.  If "undef" is passed explicitly as the
457           value for this key, weights for CJK unified ideographs are treated
458           as undefined.  However when "UCA_Version" > 8, "(overrideCJK =>
459           undef)" has no special meaning.
460
461           Note: In addition to them, 12 CJK compatibility ideographs
462           ("U+FA0E", "U+FA0F", "U+FA11", "U+FA13", "U+FA14", "U+FA1F",
463           "U+FA21", "U+FA23", "U+FA24", "U+FA27", "U+FA28", "U+FA29") are
464           also treated as CJK unified ideographs. But they can't be
465           overridden via "overrideCJK" when you use DUCET, as the table
466           includes weights for them. "table" or "entry" has priority over
467           "overrideCJK".
468
469       overrideHangul
470           -- see 7.1 Derived Collation Elements, UTS #10.
471
472           By default, Hangul syllables are decomposed into Hangul Jamo, even
473           if "(normalization => undef)".  But the mapping of Hangul syllables
474           may be overridden.
475
476           This parameter works like "overrideCJK", so see there for examples.
477
478           If you want to override the mapping of Hangul syllables, NFD and
479           NFKD are not appropriate, since NFD and NFKD will decompose Hangul
480           syllables before overriding. FCD may decompose Hangul syllables as
481           the case may be.
482
483           If a false value (but not "undef") is passed, "overrideHangul" has
484           no effect.  "$Collator->change(overrideHangul => 0)" resets the old
485           one.
486
487           If "undef" is passed explicitly as the value for this key, weight
488           for Hangul syllables is treated as undefined without decomposition
489           into Hangul Jamo.  But definition of weight for Hangul syllables in
490           "table" or "entry" is still valid.
491
492       overrideOut
493           -- see 7.1.1 Handling Ill-Formed Code Unit Sequences, UTS #10.
494
495           Perl seems to allow out-of-range values (greater than 0x10FFFF).
496           By default, out-of-range values are replaced with "U+FFFD"
497           (REPLACEMENT CHARACTER) when "UCA_Version" >= 22, or ignored when
498           "UCA_Version" <= 20.
499
500           When "UCA_Version" >= 22, the weights of out-of-range values can be
501           overridden. Though "table" or "entry" are available for them, out-
502           of-range values are too many.
503
504           "overrideOut" can perform it algorithmically.  This parameter works
505           like "overrideCJK", so see there for examples.
506
507           ex. ignores all out-of-range values.
508
509             overrideOut => sub {()}, # CODEREF returning empty list
510
511           If a false value (including "undef") is passed, "overrideOut" has
512           no effect.  "$Collator->change(overrideOut => 0)" resets the old
513           one.
514
515           NOTE ABOUT U+FFFD:
516
517           UCA recommends that out-of-range values should not be ignored for
518           security reasons. Say, "pe\x{110000}rl" should not be equal to
519           "perl".  However, "U+FFFD" is wrongly mapped to a variable
520           collation element in DUCET for Unicode 6.0.0 to 6.2.0, that means
521           out-of-range values will be ignored when "variable" isn't
522           "Non-ignorable".
523
524           The mapping of "U+FFFD" is corrected in Unicode 6.3.0.  see
525           <http://www.unicode.org/reports/tr10/tr10-28.html#Trailing_Weights>
526           (7.1.4 Trailing Weights). Such a correction is reproduced by this.
527
528             overrideOut => sub { 0xFFFD }, # CODEREF returning a very large integer
529
530           This workaround is unnecessary since Unicode 6.3.0.
531
532       preprocess
533           -- see 5.4 Preprocessing, UTS #10.
534
535           If specified, the coderef is used to preprocess each string before
536           the formation of sort keys.
537
538           ex. dropping English articles, such as "a" or "the".  Then, "the
539           pen" is before "a pencil".
540
541                preprocess => sub {
542                      my $str = shift;
543                      $str =~ s/\b(?:an?|the)\s+//gi;
544                      return $str;
545                   },
546
547           "preprocess" is performed before "normalization" (if defined).
548
549           ex. decoding strings in a legacy encoding such as shift-jis:
550
551               $sjis_collator = Unicode::Collate->new(
552                   preprocess => \&your_shiftjis_to_unicode_decoder,
553               );
554               @result = $sjis_collator->sort(@shiftjis_strings);
555
556           Note: Strings returned from the coderef will be interpreted
557           according to Perl's Unicode support. See perlunicode, perluniintro,
558           perlunitut, perlunifaq, utf8.
559
560       rearrange
561           -- see 3.5 Rearrangement, UTS #10.
562
563           Characters that are not coded in logical order and to be
564           rearranged.  If "UCA_Version" is equal to or less than 11, default
565           is:
566
567               rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
568
569           If you want to disallow any rearrangement, pass "undef" or "[]" (a
570           reference to empty list) as the value for this key.
571
572           If "UCA_Version" is equal to or greater than 14, default is "[]"
573           (i.e. no rearrangement).
574
575           According to the version 9 of UCA, this parameter shall not be
576           used; but it is not warned at present.
577
578       rewrite
579           If specified, the coderef is used to rewrite lines in "table" or
580           "entry".  The coderef will get each line, and then should return a
581           rewritten line according to the UCA file format.  If the coderef
582           returns an empty line, the line will be skipped.
583
584           e.g. any primary ignorable characters into tertiary ignorable:
585
586               rewrite => sub {
587                   my $line = shift;
588                   $line =~ s/\[\.0000\..{4}\..{4}\./[.0000.0000.0000./g;
589                   return $line;
590               },
591
592           This example shows rewriting weights. "rewrite" is allowed to
593           affect code points, weights, and the name.
594
595           NOTE: "table" is available to use another table file; preparing a
596           modified table once would be more efficient than rewriting lines on
597           reading an unmodified table every time.
598
599       suppress
600           -- see 3.12 Special-Purpose Commands, UTS #35 (LDML) Part 5:
601           Collation.
602
603           Contractions beginning with the specified characters are
604           suppressed, even if those contractions are defined in "table".
605
606           An example for Russian and some languages using the Cyrillic
607           script:
608
609               suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F],
610
611           where 0x0400 stands for "U+0400", CYRILLIC CAPITAL LETTER IE WITH
612           GRAVE.
613
614           NOTE: Contractions via "entry" will not be suppressed.
615
616       table
617           -- see 3.8 Default Unicode Collation Element Table, UTS #10.
618
619           You can use another collation element table if desired.
620
621           The table file should locate in the Unicode/Collate directory on
622           @INC. Say, if the filename is Foo.txt, the table file is searched
623           as Unicode/Collate/Foo.txt in @INC.
624
625           By default, allkeys.txt (as the filename of DUCET) is used.  If you
626           will prepare your own table file, any name other than allkeys.txt
627           may be better to avoid namespace conflict.
628
629           NOTE: When XSUB is used, the DUCET is compiled on building this
630           module, and it may save time at the run time.  Explicit saying
631           "(table => 'allkeys.txt')", or using another table, or using
632           "ignoreChar", "ignoreName", "undefChar", "undefName" or "rewrite"
633           will prevent this module from using the compiled DUCET.
634
635           If "undef" is passed explicitly as the value for this key, no file
636           is read (but you can define collation elements via "entry").
637
638           A typical way to define a collation element table without any file
639           of table:
640
641              $onlyABC = Unicode::Collate->new(
642                  table => undef,
643                  entry => << 'ENTRIES',
644           0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
645           0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
646           0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
647           0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
648           0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
649           0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
650           ENTRIES
651               );
652
653           If "ignoreName" or "undefName" is used, character names should be
654           specified as a comment (following "#") on each line.
655
656       undefChar
657       undefName
658           -- see 6.3.3 Reducing the Repertoire, UTS #10.
659
660           Undefines the collation element as if it were unassigned in the
661           "table".  This reduces the size of the table.  If an unassigned
662           character appears in the string to be collated, the sort key is
663           made from its codepoint as a single-character collation element, as
664           it is greater than any other assigned collation elements (in the
665           codepoint order among the unassigned characters).  But, it'd be
666           better to ignore characters unfamiliar to you and maybe never used.
667
668           Through "undefChar", any character matching "qr/$undefChar/" will
669           be undefined. Through "undefName", any character whose name (given
670           in the "table" file as a comment) matches "qr/$undefName/" will be
671           undefined.
672
673           ex. Collation weights for beyond-BMP characters are not stored in
674           object:
675
676               undefChar => qr/[^\0-\x{fffd}]/,
677
678       upper_before_lower
679           -- see 6.6 Case Comparisons, UTS #10.
680
681           By default, lowercase is before uppercase.  If the parameter is
682           made true, this is reversed.
683
684           NOTE: This parameter simplemindedly assumes that any
685           lowercase/uppercase distinctions must occur in level 3, and their
686           weights at level 3 must be same as those mentioned in 7.3.1, UTS
687           #10.  If you define your collation elements which differs from this
688           requirement, this parameter doesn't work validly.
689
690       variable
691           -- see 3.6 Variable Weighting, UTS #10.
692
693           This key allows for variable weighting of variable collation
694           elements, which are marked with an ASTERISK in the table (NOTE:
695           Many punctuation marks and symbols are variable in allkeys.txt).
696
697              variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
698
699           These names are case-insensitive.  By default (if specification is
700           omitted), 'shifted' is adopted.
701
702              'Blanked'        Variable elements are made ignorable at levels 1 through 3;
703                               considered at the 4th level.
704
705              'Non-Ignorable'  Variable elements are not reset to ignorable.
706
707              'Shifted'        Variable elements are made ignorable at levels 1 through 3
708                               their level 4 weight is replaced by the old level 1 weight.
709                               Level 4 weight for Non-Variable elements is 0xFFFF.
710
711              'Shift-Trimmed'  Same as 'shifted', but all FFFF's at the 4th level
712                               are trimmed.
713
714   Methods for Collation
715       "@sorted = $Collator->sort(@not_sorted)"
716           Sorts a list of strings.
717
718       "$result = $Collator->cmp($a, $b)"
719           Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
720           $b) or -1 (when $a is less than $b).
721
722       "$result = $Collator->eq($a, $b)"
723       "$result = $Collator->ne($a, $b)"
724       "$result = $Collator->lt($a, $b)"
725       "$result = $Collator->le($a, $b)"
726       "$result = $Collator->gt($a, $b)"
727       "$result = $Collator->ge($a, $b)"
728           They works like the same name operators as theirs.
729
730              eq : whether $a is equal to $b.
731              ne : whether $a is not equal to $b.
732              lt : whether $a is less than $b.
733              le : whether $a is less than $b or equal to $b.
734              gt : whether $a is greater than $b.
735              ge : whether $a is greater than $b or equal to $b.
736
737       "$sortKey = $Collator->getSortKey($string)"
738           -- see 4.3 Form Sort Key, UTS #10.
739
740           Returns a sort key.
741
742           You compare the sort keys using a binary comparison and get the
743           result of the comparison of the strings using UCA.
744
745              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
746
747                 is equivalent to
748
749              $Collator->cmp($a, $b)
750
751       "$sortKeyForm = $Collator->viewSortKey($string)"
752           Converts a sorting key into its representation form.  If
753           "UCA_Version" is 8, the output is slightly different.
754
755              use Unicode::Collate;
756              my $c = Unicode::Collate->new();
757              print $c->viewSortKey("Perl"),"\n";
758
759              # output:
760              # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
761              #  Level 1               Level 2               Level 3               Level 4
762
763   Methods for Searching
764       The "match", "gmatch", "subst", "gsubst" methods work like "m//",
765       "m//g", "s///", "s///g", respectively, but they are not aware of any
766       pattern, but only a literal substring.
767
768       DISCLAIMER: If "preprocess" or "normalization" parameter is true for
769       $Collator, calling these methods ("index", "match", "gmatch", "subst",
770       "gsubst") is croaked, as the position and the length might differ from
771       those on the specified string.
772
773       "rearrange" and "hangul_terminator" parameters are neglected.
774       "katakana_before_hiragana" and "upper_before_lower" don't affect
775       matching and searching, as it doesn't matter whether greater or less.
776
777       "$position = $Collator->index($string, $substring[, $position])"
778       "($position, $length) = $Collator->index($string, $substring[,
779       $position])"
780           If $substring matches a part of $string, returns the position of
781           the first occurrence of the matching part in scalar context; in
782           list context, returns a two-element list of the position and the
783           length of the matching part.
784
785           If $substring does not match any part of $string, returns "-1" in
786           scalar context and an empty list in list context.
787
788           e.g. when the content of $str is ""Ich mu"ß" studieren Perl."", you
789           say the following where $sub is ""M"ü"SS"",
790
791             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
792                                                # (normalization => undef) is REQUIRED.
793             my $match;
794             if (my($pos,$len) = $Collator->index($str, $sub)) {
795                 $match = substr($str, $pos, $len);
796             }
797
798           and get ""mu"ß""" in $match, since ""mu"ß""" is primary equal to
799           ""M"ü"SS"".
800
801       "$match_ref = $Collator->match($string, $substring)"
802       "($match)   = $Collator->match($string, $substring)"
803           If $substring matches a part of $string, in scalar context, returns
804           a reference to the first occurrence of the matching part
805           ($match_ref is always true if matches, since every reference is
806           true); in list context, returns the first occurrence of the
807           matching part.
808
809           If $substring does not match any part of $string, returns "undef"
810           in scalar context and an empty list in list context.
811
812           e.g.
813
814               if ($match_ref = $Collator->match($str, $sub)) { # scalar context
815                   print "matches [$$match_ref].\n";
816               } else {
817                   print "doesn't match.\n";
818               }
819
820                or
821
822               if (($match) = $Collator->match($str, $sub)) { # list context
823                   print "matches [$match].\n";
824               } else {
825                   print "doesn't match.\n";
826               }
827
828       "@match = $Collator->gmatch($string, $substring)"
829           If $substring matches a part of $string, returns all the matching
830           parts (or matching count in scalar context).
831
832           If $substring does not match any part of $string, returns an empty
833           list.
834
835       "$count = $Collator->subst($string, $substring, $replacement)"
836           If $substring matches a part of $string, the first occurrence of
837           the matching part is replaced by $replacement ($string is modified)
838           and $count (always equals to 1) is returned.
839
840           $replacement can be a "CODEREF", taking the matching part as an
841           argument, and returning a string to replace the matching part (a
842           bit similar to "s/(..)/$coderef->($1)/e").
843
844       "$count = $Collator->gsubst($string, $substring, $replacement)"
845           If $substring matches a part of $string, all the occurrences of the
846           matching part are replaced by $replacement ($string is modified)
847           and $count is returned.
848
849           $replacement can be a "CODEREF", taking the matching part as an
850           argument, and returning a string to replace the matching part (a
851           bit similar to "s/(..)/$coderef->($1)/eg").
852
853           e.g.
854
855             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
856                                                # (normalization => undef) is REQUIRED.
857             my $str = "Camel donkey zebra came\x{301}l CAMEL horse cam\0e\0l...";
858             $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
859
860             # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cam\0e\0l</b>...";
861             # i.e., all the camels are made bold-faced.
862
863              Examples: levels and ignore_level2 - what does camel match?
864             ---------------------------------------------------------------------------
865              level  ignore_level2  |  camel  Camel  came\x{301}l  c-a-m-e-l  cam\0e\0l
866             -----------------------|---------------------------------------------------
867                1        false      |   yes    yes      yes          yes        yes
868                2        false      |   yes    yes      no           yes        yes
869                3        false      |   yes    no       no           yes        yes
870                4        false      |   yes    no       no           no         yes
871             -----------------------|---------------------------------------------------
872                1        true       |   yes    yes      yes          yes        yes
873                2        true       |   yes    yes      yes          yes        yes
874                3        true       |   yes    no       yes          yes        yes
875                4        true       |   yes    no       yes          no         yes
876             ---------------------------------------------------------------------------
877              note: if variable => non-ignorable, camel doesn't match c-a-m-e-l
878                    at any level.
879
880   Other Methods
881       "%old_tailoring = $Collator->change(%new_tailoring)"
882       "$modified_collator = $Collator->change(%new_tailoring)"
883           Changes the value of specified keys and returns the changed part.
884
885               $Collator = Unicode::Collate->new(level => 4);
886
887               $Collator->eq("perl", "PERL"); # false
888
889               %old = $Collator->change(level => 2); # returns (level => 4).
890
891               $Collator->eq("perl", "PERL"); # true
892
893               $Collator->change(%old); # returns (level => 2).
894
895               $Collator->eq("perl", "PERL"); # false
896
897           Not all "(key,value)"s are allowed to be changed.  See also
898           @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
899
900           In the scalar context, returns the modified collator (but it is not
901           a clone from the original).
902
903               $Collator->change(level => 2)->eq("perl", "PERL"); # true
904
905               $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
906
907               $Collator->change(level => 4)->eq("perl", "PERL"); # false
908
909       "$version = $Collator->version()"
910           Returns the version number (a string) of the Unicode Standard which
911           the "table" file used by the collator object is based on.  If the
912           table does not include a version line (starting with @version),
913           returns "unknown".
914
915       "UCA_Version()"
916           Returns the revision number of UTS #10 this module consults, that
917           should correspond with the DUCET incorporated.
918
919       "Base_Unicode_Version()"
920           Returns the version number of UTS #10 this module consults, that
921           should correspond with the DUCET incorporated.
922

EXPORT

924       No method will be exported.
925

INSTALL

927       Though this module can be used without any "table" file, to use this
928       module easily, it is recommended to install a table file in the UCA
929       format, by copying it under the directory <a place in
930       @INC>/Unicode/Collate.
931
932       The most preferable one is "The Default Unicode Collation Element
933       Table" (aka DUCET), available from the Unicode Consortium's website:
934
935          http://www.unicode.org/Public/UCA/
936
937          http://www.unicode.org/Public/UCA/latest/allkeys.txt
938          (latest version)
939
940       If DUCET is not installed, it is recommended to copy the file from
941       http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
942       @INC>/Unicode/Collate/allkeys.txt manually.
943

CAVEATS

945       Normalization
946           Use of the "normalization" parameter requires the
947           Unicode::Normalize module (see Unicode::Normalize).
948
949           If you need not it (say, in the case when you need not handle any
950           combining characters), assign "(normalization => undef)"
951           explicitly.
952
953           -- see 6.5 Avoiding Normalization, UTS #10.
954
955       Conformance Test
956           The Conformance Test for the UCA is available under
957           <http://www.unicode.org/Public/UCA/>.
958
959           For CollationTest_SHIFTED.txt, a collator via
960           "Unicode::Collate->new( )" should be used; for
961           CollationTest_NON_IGNORABLE.txt, a collator via
962           "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
963
964           If "UCA_Version" is 26 or later, the "identical" level is
965           preferred; "Unicode::Collate->new(identical => 1)" and
966           "Unicode::Collate->new(identical => 1," "variable =>
967           "non-ignorable", level => 3)" should be used.
968
969           Unicode::Normalize is required to try The Conformance Test.
970

AUTHOR, COPYRIGHT AND LICENSE

972       The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
973       <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2018, SADAHIRO
974       Tomoyuki. Japan. All rights reserved.
975
976       This module is free software; you can redistribute it and/or modify it
977       under the same terms as Perl itself.
978
979       The file Unicode/Collate/allkeys.txt was copied verbatim from
980       <http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt>.  For this file,
981       Copyright (c) 2016 Unicode, Inc.; distributed under the Terms of Use in
982       <http://www.unicode.org/terms_of_use.html>
983