1Collate(3) User Contributed Perl Documentation Collate(3)
2
3
4
6 Unicode::Collate - Unicode Collation Algorithm
7
9 use Unicode::Collate;
10
11 #construct
12 $Collator = Unicode::Collate->new(%tailoring);
13
14 #sort
15 @sorted = $Collator->sort(@not_sorted);
16
17 #compare
18 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20 Note: Strings in @not_sorted, $a and $b are interpreted according to
21 Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
22 perlunifaq, utf8. Otherwise you can use "preprocess" or should decode
23 them before.
24
26 This module is an implementation of Unicode Technical Standard #10
27 (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
28
29 Constructor and Tailoring
30 The "new" method returns a collator object. If new() is called with no
31 parameters, the collator should do the default collation.
32
33 $Collator = Unicode::Collate->new(
34 UCA_Version => $UCA_Version,
35 alternate => $alternate, # alias for 'variable'
36 backwards => $levelNumber, # or \@levelNumbers
37 entry => $element,
38 hangul_terminator => $term_primary_weight,
39 highestFFFF => $bool,
40 identical => $bool,
41 ignoreName => qr/$ignoreName/,
42 ignoreChar => qr/$ignoreChar/,
43 ignore_level2 => $bool,
44 katakana_before_hiragana => $bool,
45 level => $collationLevel,
46 long_contraction => $bool,
47 minimalFFFE => $bool,
48 normalization => $normalization_form,
49 overrideCJK => \&overrideCJK,
50 overrideHangul => \&overrideHangul,
51 preprocess => \&preprocess,
52 rearrange => \@charList,
53 rewrite => \&rewrite,
54 suppress => \@charList,
55 table => $filename,
56 undefName => qr/$undefName/,
57 undefChar => qr/$undefChar/,
58 upper_before_lower => $bool,
59 variable => $variable,
60 );
61
62 UCA_Version
63 If the revision (previously "tracking version") number of UCA is
64 given, behavior of that revision is emulated on collating. If
65 omitted, the return value of "UCA_Version()" is used.
66
67 The following revisions are supported. The default is 36.
68
69 UCA Unicode Standard DUCET (@version)
70 -------------------------------------------------------
71 8 3.1 3.0.1 (3.0.1d9)
72 9 3.1 with Corrigendum 3 3.1.1 (3.1.1)
73 11 4.0 4.0.0 (4.0.0)
74 14 4.1.0 4.1.0 (4.1.0)
75 16 5.0 5.0.0 (5.0.0)
76 18 5.1.0 5.1.0 (5.1.0)
77 20 5.2.0 5.2.0 (5.2.0)
78 22 6.0.0 6.0.0 (6.0.0)
79 24 6.1.0 6.1.0 (6.1.0)
80 26 6.2.0 6.2.0 (6.2.0)
81 28 6.3.0 6.3.0 (6.3.0)
82 30 7.0.0 7.0.0 (7.0.0)
83 32 8.0.0 8.0.0 (8.0.0)
84 34 9.0.0 9.0.0 (9.0.0)
85 36 10.0.0 10.0.0(10.0.0)
86
87 * See below for "long_contraction" with "UCA_Version" 22 and 24.
88
89 * Noncharacters (e.g. U+FFFF) are not ignored, and can be
90 overridden since "UCA_Version" 22.
91
92 * Out-of-range codepoints (greater than U+10FFFF) are not ignored,
93 and can be overridden since "UCA_Version" 22.
94
95 * Fully ignorable characters were ignored, and would not interrupt
96 contractions with "UCA_Version" 9 and 11.
97
98 * Treatment of ignorables after variables and some behaviors were
99 changed at "UCA_Version" 9.
100
101 * Characters regarded as CJK unified ideographs (cf. "overrideCJK")
102 depend on "UCA_Version".
103
104 * Many hangul jamo are assigned at "UCA_Version" 20, that will
105 affect "hangul_terminator".
106
107 alternate
108 -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
109
110 For backward compatibility, "alternate" (old name) can be used as
111 an alias for "variable".
112
113 backwards
114 -- see 3.4 Backward Accents, UTS #10.
115
116 backwards => $levelNumber or \@levelNumbers
117
118 Weights in reverse order; ex. level 2 (diacritic ordering) in
119 French. If omitted (or $levelNumber is "undef" or "\@levelNumbers"
120 is "[]"), forwards at all the levels.
121
122 entry
123 -- see 5 Tailoring; 9.1 Allkeys File Format, UTS #10.
124
125 If the same character (or a sequence of characters) exists in the
126 collation element table through "table", mapping to collation
127 elements is overridden. If it does not exist, the mapping is
128 defined additionally.
129
130 entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
131 0063 0068 ; [.0E6A.0020.0002.0063] # ch
132 0043 0068 ; [.0E6A.0020.0007.0043] # Ch
133 0043 0048 ; [.0E6A.0020.0008.0043] # CH
134 006C 006C ; [.0F4C.0020.0002.006C] # ll
135 004C 006C ; [.0F4C.0020.0007.004C] # Ll
136 004C 004C ; [.0F4C.0020.0008.004C] # LL
137 00F1 ; [.0F7B.0020.0002.00F1] # n-tilde
138 006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
139 00D1 ; [.0F7B.0020.0008.00D1] # N-tilde
140 004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
141 ENTRY
142
143 entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
144 00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
145 00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
146 ENTRY
147
148 NOTE: The code point in the UCA file format (before ';') must be a
149 Unicode code point (defined as hexadecimal), but not a native code
150 point. So 0063 must always denote "U+0063", but not a character of
151 "\x63".
152
153 Weighting may vary depending on collation element table. So ensure
154 the weights defined in "entry" will be consistent with those in the
155 collation element table loaded via "table".
156
157 In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
158 "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
159 between 0E60 and "0E6D") makes ordering as "C < CH < D". Exactly
160 speaking DUCET already has some characters between "C" and "D":
161 "small capital C" ("U+1D04") with primary weight 0E64,
162 "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
163 ("U+0255") with 0E69. Then primary weight "0E6A" for "CH" makes
164 "CH" ordered between "c-curl" and "D".
165
166 hangul_terminator
167 -- see 7.1.4 Trailing Weights, UTS #10.
168
169 If a true value is given (non-zero but should be positive), it will
170 be added as a terminator primary weight to the end of every
171 standard Hangul syllable. Secondary and any higher weights for
172 terminator are set to zero. If the value is false or
173 "hangul_terminator" key does not exist, insertion of terminator
174 weights will not be performed.
175
176 Boundaries of Hangul syllables are determined according to
177 conjoining Jamo behavior in the Unicode Standard and
178 HangulSyllableType.txt.
179
180 Implementation Note: [22m(1) For expansion mapping (Unicode character
181 mapped to a sequence of collation elements), a terminator will not
182 be added between collation elements, even if Hangul syllable
183 boundary exists there. Addition of terminator is restricted to the
184 next position to the last collation element.
185
186 (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
187 Jamo, and enclosed letters) are not automatically terminated with a
188 terminator primary weight. These characters may need terminator
189 included in a collation element table beforehand.
190
191 highestFFFF
192 -- see 2.4 Tailored noncharacter weights, UTS #35 (LDML) Part 5:
193 Collation.
194
195 If the parameter is made true, "U+FFFF" has a highest primary
196 weight. When a boolean of "$coll->ge($str, "abc")" and
197 "$coll->le($str, "abc\x{FFFF}")" is true, it is expected that $str
198 begins with "abc", or another primary equivalent. $str may be
199 "abcd", "abc012", but should not include "U+FFFF" such as
200 "abc\x{FFFF}xyz".
201
202 "$coll->le($str, "abc\x{FFFF}")" works like "$coll->lt($str,
203 "abd")" almost, but the latter has a problem that you should know
204 which letter is next to "c". For a certain language where "ch" as
205 the next letter, "abch" is greater than "abc\x{FFFF}", but less
206 than "abd".
207
208 Note: This is equivalent to "(entry => 'FFFF ;
209 [.FFFE.0020.0005.FFFF]')". Any other character than "U+FFFF" can
210 be tailored by "entry".
211
212 identical
213 -- see A.3 Deterministic Comparison, UTS #10.
214
215 By default, strings whose weights are equal should be equal, even
216 though their code points are not equal. Completely ignorable
217 characters are ignored.
218
219 If the parameter is made true, a final, tie-breaking level is used.
220 If no difference of weights is found after the comparison through
221 all the level specified by "level", the comparison with code points
222 will be performed. For the tie-breaking comparison, the sort key
223 has code points of the original string appended. Completely
224 ignorable characters are not ignored.
225
226 If "preprocess" and/or "normalization" is applied, the code points
227 of the string after them (in NFD by default) are used.
228
229 ignoreChar
230 ignoreName
231 -- see 3.6 Variable Weighting, UTS #10.
232
233 Makes the entry in the table completely ignorable; i.e. as if the
234 weights were zero at all level.
235
236 Through "ignoreChar", any character matching "qr/$ignoreChar/" will
237 be ignored. Through "ignoreName", any character whose name (given
238 in the "table" file as a comment) matches "qr/$ignoreName/" will be
239 ignored.
240
241 E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
242 (or 'lmnt').
243
244 ignore_level2
245 -- see 5.1 Parametric Tailoring, UTS #10.
246
247 By default, case-sensitive comparison (that is level 3 difference)
248 won't ignore accents (that is level 2 difference).
249
250 If the parameter is made true, accents (and other primary ignorable
251 characters) are ignored, even though cases are taken into account.
252
253 NOTE: "level" should be 3 or greater.
254
255 katakana_before_hiragana
256 -- see 7.2 Tertiary Weight Table, UTS #10.
257
258 By default, hiragana is before katakana. If the parameter is made
259 true, this is reversed.
260
261 NOTE: This parameter simplemindedly assumes that any
262 hiragana/katakana distinctions must occur in level 3, and their
263 weights at level 3 must be same as those mentioned in 7.3.1, UTS
264 #10. If you define your collation elements which violate this
265 requirement, this parameter does not work validly.
266
267 level
268 -- see 4.3 Form Sort Key, UTS #10.
269
270 Set the maximum level. Any higher levels than the specified one
271 are ignored.
272
273 Level 1: alphabetic ordering
274 Level 2: diacritic ordering
275 Level 3: case ordering
276 Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
277
278 ex.level => 2,
279
280 If omitted, the maximum is the 4th.
281
282 NOTE: The DUCET includes weights over 0xFFFF at the 4th level. But
283 this module only uses weights within 0xFFFF. When "variable" is
284 'blanked' or 'non-ignorable' (other than 'shifted' and
285 'shift-trimmed'), the level 4 may be unreliable.
286
287 See also "identical".
288
289 long_contraction
290 -- see 3.8.2 Well-Formedness of the DUCET, 4.2 Produce Array, UTS
291 #10.
292
293 If the parameter is made true, for a contraction with three or more
294 characters (here nicknamed "long contraction"), initial substrings
295 will be handled. For example, a contraction ABC, where A is a
296 starter, and B and C are non-starters (character with non-zero
297 combining character class), will be detected even if there is not
298 AB as a contraction.
299
300 Default: Usually false. If "UCA_Version" is 22 or 24, and the
301 value of "long_contraction" is not specified in "new()", a true
302 value is set implicitly. This is a workaround to pass Conformance
303 Tests for Unicode 6.0.0 and 6.1.0.
304
305 "change()" handles "long_contraction" explicitly only. If
306 "long_contraction" is not specified in "change()", even though
307 "UCA_Version" is changed, "long_contraction" will not be changed.
308
309 Limitation: Scanning non-starters is one-way (no back tracking).
310 If AB is found but not ABC is not found, other long contraction
311 where the first character is A and the second is not B may not be
312 found.
313
314 Under "(normalization => undef)", detection step of discontiguous
315 contractions will be skipped.
316
317 Note: The following contractions in DUCET are not considered in
318 steps S2.1.1 to S2.1.3, where they are discontiguous.
319
320 0FB2 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC RR)
321 0FB3 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC LL)
322
323 For example "TIBETAN VOWEL SIGN VOCALIC RR" with "COMBINING TILDE
324 OVERLAY" ("U+0344") is "0FB2 0344 0F71 0F80" in NFD. In this case
325 "0FB2 0F80" ("TIBETAN VOWEL SIGN VOCALIC R") is detected, instead
326 of "0FB2 0F71 0F80". Inserted 0344 makes "0FB2 0F71 0F80"
327 discontiguous and lack of contraction "0FB2 0F71" prohibits "0FB2
328 0F71 0F80" from being detected.
329
330 minimalFFFE
331 -- see 1.1.1 U+FFFE, UTS #35 (LDML) Part 5: Collation.
332
333 If the parameter is made true, "U+FFFE" has a minimal primary
334 weight. The comparison between "$a1\x{FFFE}$a2" and
335 "$b1\x{FFFE}$b2" first compares $a1 and $b1 at level 1, and then
336 $a2 and $b2 at level 1, as followed.
337
338 "ab\x{FFFE}a"
339 "Ab\x{FFFE}a"
340 "ab\x{FFFE}c"
341 "Ab\x{FFFE}c"
342 "ab\x{FFFE}xyz"
343 "abc\x{FFFE}def"
344 "abc\x{FFFE}xYz"
345 "aBc\x{FFFE}xyz"
346 "abcX\x{FFFE}def"
347 "abcx\x{FFFE}xyz"
348 "b\x{FFFE}aaa"
349 "bbb\x{FFFE}a"
350
351 Note: This is equivalent to "(entry => 'FFFE ;
352 [.0001.0020.0005.FFFE]')". Any other character than "U+FFFE" can
353 be tailored by "entry".
354
355 normalization
356 -- see 4.1 Normalize, UTS #10.
357
358 If specified, strings are normalized before preparation of sort
359 keys (the normalization is executed after preprocess).
360
361 A form name "Unicode::Normalize::normalize()" accepts will be
362 applied as $normalization_form. Acceptable names include 'NFD',
363 'NFC', 'NFKD', and 'NFKC'. See "Unicode::Normalize::normalize()"
364 for detail. If omitted, 'NFD' is used.
365
366 "normalization" is performed after "preprocess" (if defined).
367
368 Furthermore, special values, "undef" and "prenormalized", can be
369 used, though they are not concerned with
370 "Unicode::Normalize::normalize()".
371
372 If "undef" (not a string "undef") is passed explicitly as the value
373 for this key, any normalization is not carried out (this may make
374 tailoring easier if any normalization is not desired). Under
375 "(normalization => undef)", only contiguous contractions are
376 resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
377 after "Z", "A-cedilla-ring" would be primary equal to "A". In this
378 point, "(normalization => undef, preprocess => sub { NFD(shift) })"
379 is not equivalent to "(normalization => 'NFD')".
380
381 In the case of "(normalization => "prenormalized")", any
382 normalization is not performed, but discontiguous contractions with
383 combining characters are performed. Therefore "(normalization =>
384 'prenormalized', preprocess => sub { NFD(shift) })" is equivalent
385 to "(normalization => 'NFD')". If source strings are finely
386 prenormalized, "(normalization => 'prenormalized')" may save time
387 for normalization.
388
389 Except "(normalization => undef)", Unicode::Normalize is required
390 (see also CAVEAT).
391
392 overrideCJK
393 -- see 7.1 Derived Collation Elements, UTS #10.
394
395 By default, CJK unified ideographs are ordered in Unicode codepoint
396 order, but those in the CJK Unified Ideographs block are less than
397 those in the CJK Unified Ideographs Extension A etc.
398
399 In the CJK Unified Ideographs block:
400 U+4E00..U+9FA5 if UCA_Version is 8, 9 or 11.
401 U+4E00..U+9FBB if UCA_Version is 14 or 16.
402 U+4E00..U+9FC3 if UCA_Version is 18.
403 U+4E00..U+9FCB if UCA_Version is 20 or 22.
404 U+4E00..U+9FCC if UCA_Version is 24 to 30.
405 U+4E00..U+9FD5 if UCA_Version is 32 or 34.
406 U+4E00..U+9FEA if UCA_Version is 36.
407
408 In the CJK Unified Ideographs Extension blocks:
409 Ext.A (U+3400..U+4DB5) and Ext.B (U+20000..U+2A6D6) in any UCA_Version.
410 Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or later.
411 Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or later.
412 Ext.E (U+2B820..U+2CEA1) if UCA_Version is 32 or later.
413 Ext.F (U+2CEB0..U+2EBE0) if UCA_Version is 36.
414
415 Through "overrideCJK", ordering of CJK unified ideographs
416 (including extensions) can be overridden.
417
418 ex. CJK unified ideographs in the JIS code point order.
419
420 overrideCJK => sub {
421 my $u = shift; # get a Unicode codepoint
422 my $b = pack('n', $u); # to UTF-16BE
423 my $s = your_unicode_to_sjis_converter($b); # convert
424 my $n = unpack('n', $s); # convert sjis to short
425 [ $n, 0x20, 0x2, $u ]; # return the collation element
426 },
427
428 The return value may be an arrayref of 1st to 4th weights as shown
429 above. The return value may be an integer as the primary weight as
430 shown below. If "undef" is returned, the default derived collation
431 element will be used.
432
433 overrideCJK => sub {
434 my $u = shift; # get a Unicode codepoint
435 my $b = pack('n', $u); # to UTF-16BE
436 my $s = your_unicode_to_sjis_converter($b); # convert
437 my $n = unpack('n', $s); # convert sjis to short
438 return $n; # return the primary weight
439 },
440
441 The return value may be a list containing zero or more of an
442 arrayref, an integer, or "undef".
443
444 ex. ignores all CJK unified ideographs.
445
446 overrideCJK => sub {()}, # CODEREF returning empty list
447
448 # where ->eq("Pe\x{4E00}rl", "Perl") is true
449 # as U+4E00 is a CJK unified ideograph and to be ignorable.
450
451 If a false value (including "undef") is passed, "overrideCJK" has
452 no effect. "$Collator->change(overrideCJK => 0)" resets the old
453 one.
454
455 But assignment of weight for CJK unified ideographs in "table" or
456 "entry" is still valid. If "undef" is passed explicitly as the
457 value for this key, weights for CJK unified ideographs are treated
458 as undefined. However when "UCA_Version" > 8, "(overrideCJK =>
459 undef)" has no special meaning.
460
461 Note: In addition to them, 12 CJK compatibility ideographs
462 ("U+FA0E", "U+FA0F", "U+FA11", "U+FA13", "U+FA14", "U+FA1F",
463 "U+FA21", "U+FA23", "U+FA24", "U+FA27", "U+FA28", "U+FA29") are
464 also treated as CJK unified ideographs. But they can't be
465 overridden via "overrideCJK" when you use DUCET, as the table
466 includes weights for them. "table" or "entry" has priority over
467 "overrideCJK".
468
469 overrideHangul
470 -- see 7.1 Derived Collation Elements, UTS #10.
471
472 By default, Hangul syllables are decomposed into Hangul Jamo, even
473 if "(normalization => undef)". But the mapping of Hangul syllables
474 may be overridden.
475
476 This parameter works like "overrideCJK", so see there for examples.
477
478 If you want to override the mapping of Hangul syllables, NFD and
479 NFKD are not appropriate, since NFD and NFKD will decompose Hangul
480 syllables before overriding. FCD may decompose Hangul syllables as
481 the case may be.
482
483 If a false value (but not "undef") is passed, "overrideHangul" has
484 no effect. "$Collator->change(overrideHangul => 0)" resets the old
485 one.
486
487 If "undef" is passed explicitly as the value for this key, weight
488 for Hangul syllables is treated as undefined without decomposition
489 into Hangul Jamo. But definition of weight for Hangul syllables in
490 "table" or "entry" is still valid.
491
492 overrideOut
493 -- see 7.1.1 Handling Ill-Formed Code Unit Sequences, UTS #10.
494
495 Perl seems to allow out-of-range values (greater than 0x10FFFF).
496 By default, out-of-range values are replaced with "U+FFFD"
497 (REPLACEMENT CHARACTER) when "UCA_Version" >= 22, or ignored when
498 "UCA_Version" <= 20.
499
500 When "UCA_Version" >= 22, the weights of out-of-range values can be
501 overridden. Though "table" or "entry" are available for them, out-
502 of-range values are too many.
503
504 "overrideOut" can perform it algorithmically. This parameter works
505 like "overrideCJK", so see there for examples.
506
507 ex. ignores all out-of-range values.
508
509 overrideOut => sub {()}, # CODEREF returning empty list
510
511 If a false value (including "undef") is passed, "overrideOut" has
512 no effect. "$Collator->change(overrideOut => 0)" resets the old
513 one.
514
515 NOTE ABOUT U+FFFD:
516
517 UCA recommends that out-of-range values should not be ignored for
518 security reasons. Say, "pe\x{110000}rl" should not be equal to
519 "perl". However, "U+FFFD" is wrongly mapped to a variable
520 collation element in DUCET for Unicode 6.0.0 to 6.2.0, that means
521 out-of-range values will be ignored when "variable" isn't
522 "Non-ignorable".
523
524 The mapping of "U+FFFD" is corrected in Unicode 6.3.0. see
525 <http://www.unicode.org/reports/tr10/tr10-28.html#Trailing_Weights>
526 (7.1.4 Trailing Weights). Such a correction is reproduced by this.
527
528 overrideOut => sub { 0xFFFD }, # CODEREF returning a very large integer
529
530 This workaround is unnecessary since Unicode 6.3.0.
531
532 preprocess
533 -- see 5.4 Preprocessing, UTS #10.
534
535 If specified, the coderef is used to preprocess each string before
536 the formation of sort keys.
537
538 ex. dropping English articles, such as "a" or "the". Then, "the
539 pen" is before "a pencil".
540
541 preprocess => sub {
542 my $str = shift;
543 $str =~ s/\b(?:an?|the)\s+//gi;
544 return $str;
545 },
546
547 "preprocess" is performed before "normalization" (if defined).
548
549 ex. decoding strings in a legacy encoding such as shift-jis:
550
551 $sjis_collator = Unicode::Collate->new(
552 preprocess => \&your_shiftjis_to_unicode_decoder,
553 );
554 @result = $sjis_collator->sort(@shiftjis_strings);
555
556 Note: Strings returned from the coderef will be interpreted
557 according to Perl's Unicode support. See perlunicode, perluniintro,
558 perlunitut, perlunifaq, utf8.
559
560 rearrange
561 -- see 3.5 Rearrangement, UTS #10.
562
563 Characters that are not coded in logical order and to be
564 rearranged. If "UCA_Version" is equal to or less than 11, default
565 is:
566
567 rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
568
569 If you want to disallow any rearrangement, pass "undef" or "[]" (a
570 reference to empty list) as the value for this key.
571
572 If "UCA_Version" is equal to or greater than 14, default is "[]"
573 (i.e. no rearrangement).
574
575 According to the version 9 of UCA, this parameter shall not be
576 used; but it is not warned at present.
577
578 rewrite
579 If specified, the coderef is used to rewrite lines in "table" or
580 "entry". The coderef will get each line, and then should return a
581 rewritten line according to the UCA file format. If the coderef
582 returns an empty line, the line will be skipped.
583
584 e.g. any primary ignorable characters into tertiary ignorable:
585
586 rewrite => sub {
587 my $line = shift;
588 $line =~ s/\[\.0000\..{4}\..{4}\./[.0000.0000.0000./g;
589 return $line;
590 },
591
592 This example shows rewriting weights. "rewrite" is allowed to
593 affect code points, weights, and the name.
594
595 NOTE: "table" is available to use another table file; preparing a
596 modified table once would be more efficient than rewriting lines on
597 reading an unmodified table every time.
598
599 suppress
600 -- see 3.12 Special-Purpose Commands, UTS #35 (LDML) Part 5:
601 Collation.
602
603 Contractions beginning with the specified characters are
604 suppressed, even if those contractions are defined in "table".
605
606 An example for Russian and some languages using the Cyrillic
607 script:
608
609 suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F],
610
611 where 0x0400 stands for "U+0400", CYRILLIC CAPITAL LETTER IE WITH
612 GRAVE.
613
614 NOTE: Contractions via "entry" will not be suppressed.
615
616 table
617 -- see 3.8 Default Unicode Collation Element Table, UTS #10.
618
619 You can use another collation element table if desired.
620
621 The table file should locate in the Unicode/Collate directory on
622 @INC. Say, if the filename is Foo.txt, the table file is searched
623 as Unicode/Collate/Foo.txt in @INC.
624
625 By default, allkeys.txt (as the filename of DUCET) is used. If you
626 will prepare your own table file, any name other than allkeys.txt
627 may be better to avoid namespace conflict.
628
629 NOTE: When XSUB is used, the DUCET is compiled on building this
630 module, and it may save time at the run time. Explicit saying
631 "(table => 'allkeys.txt')", or using another table, or using
632 "ignoreChar", "ignoreName", "undefChar", "undefName" or "rewrite"
633 will prevent this module from using the compiled DUCET.
634
635 If "undef" is passed explicitly as the value for this key, no file
636 is read (but you can define collation elements via "entry").
637
638 A typical way to define a collation element table without any file
639 of table:
640
641 $onlyABC = Unicode::Collate->new(
642 table => undef,
643 entry => << 'ENTRIES',
644 0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
645 0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
646 0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
647 0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
648 0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
649 0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
650 ENTRIES
651 );
652
653 If "ignoreName" or "undefName" is used, character names should be
654 specified as a comment (following "#") on each line.
655
656 undefChar
657 undefName
658 -- see 6.3.3 Reducing the Repertoire, UTS #10.
659
660 Undefines the collation element as if it were unassigned in the
661 "table". This reduces the size of the table. If an unassigned
662 character appears in the string to be collated, the sort key is
663 made from its codepoint as a single-character collation element, as
664 it is greater than any other assigned collation elements (in the
665 codepoint order among the unassigned characters). But, it'd be
666 better to ignore characters unfamiliar to you and maybe never used.
667
668 Through "undefChar", any character matching "qr/$undefChar/" will
669 be undefined. Through "undefName", any character whose name (given
670 in the "table" file as a comment) matches "qr/$undefName/" will be
671 undefined.
672
673 ex. Collation weights for beyond-BMP characters are not stored in
674 object:
675
676 undefChar => qr/[^\0-\x{fffd}]/,
677
678 upper_before_lower
679 -- see 6.6 Case Comparisons, UTS #10.
680
681 By default, lowercase is before uppercase. If the parameter is
682 made true, this is reversed.
683
684 NOTE: This parameter simplemindedly assumes that any
685 lowercase/uppercase distinctions must occur in level 3, and their
686 weights at level 3 must be same as those mentioned in 7.3.1, UTS
687 #10. If you define your collation elements which differs from this
688 requirement, this parameter doesn't work validly.
689
690 variable
691 -- see 3.6 Variable Weighting, UTS #10.
692
693 This key allows for variable weighting of variable collation
694 elements, which are marked with an ASTERISK in the table (NOTE:
695 Many punctuation marks and symbols are variable in allkeys.txt).
696
697 variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
698
699 These names are case-insensitive. By default (if specification is
700 omitted), 'shifted' is adopted.
701
702 'Blanked' Variable elements are made ignorable at levels 1 through 3;
703 considered at the 4th level.
704
705 'Non-Ignorable' Variable elements are not reset to ignorable.
706
707 'Shifted' Variable elements are made ignorable at levels 1 through 3
708 their level 4 weight is replaced by the old level 1 weight.
709 Level 4 weight for Non-Variable elements is 0xFFFF.
710
711 'Shift-Trimmed' Same as 'shifted', but all FFFF's at the 4th level
712 are trimmed.
713
714 Methods for Collation
715 "@sorted = $Collator->sort(@not_sorted)"
716 Sorts a list of strings.
717
718 "$result = $Collator->cmp($a, $b)"
719 Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
720 $b) or -1 (when $a is less than $b).
721
722 "$result = $Collator->eq($a, $b)"
723 "$result = $Collator->ne($a, $b)"
724 "$result = $Collator->lt($a, $b)"
725 "$result = $Collator->le($a, $b)"
726 "$result = $Collator->gt($a, $b)"
727 "$result = $Collator->ge($a, $b)"
728 They works like the same name operators as theirs.
729
730 eq : whether $a is equal to $b.
731 ne : whether $a is not equal to $b.
732 lt : whether $a is less than $b.
733 le : whether $a is less than $b or equal to $b.
734 gt : whether $a is greater than $b.
735 ge : whether $a is greater than $b or equal to $b.
736
737 "$sortKey = $Collator->getSortKey($string)"
738 -- see 4.3 Form Sort Key, UTS #10.
739
740 Returns a sort key.
741
742 You compare the sort keys using a binary comparison and get the
743 result of the comparison of the strings using UCA.
744
745 $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
746
747 is equivalent to
748
749 $Collator->cmp($a, $b)
750
751 "$sortKeyForm = $Collator->viewSortKey($string)"
752 Converts a sorting key into its representation form. If
753 "UCA_Version" is 8, the output is slightly different.
754
755 use Unicode::Collate;
756 my $c = Unicode::Collate->new();
757 print $c->viewSortKey("Perl"),"\n";
758
759 # output:
760 # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
761 # Level 1 Level 2 Level 3 Level 4
762
763 Methods for Searching
764 The "match", "gmatch", "subst", "gsubst" methods work like "m//",
765 "m//g", "s///", "s///g", respectively, but they are not aware of any
766 pattern, but only a literal substring.
767
768 DISCLAIMER: If "preprocess" or "normalization" parameter is true for
769 $Collator, calling these methods ("index", "match", "gmatch", "subst",
770 "gsubst") is croaked, as the position and the length might differ from
771 those on the specified string.
772
773 "rearrange" and "hangul_terminator" parameters are neglected.
774 "katakana_before_hiragana" and "upper_before_lower" don't affect
775 matching and searching, as it doesn't matter whether greater or less.
776
777 "$position = $Collator->index($string, $substring[, $position])"
778 "($position, $length) = $Collator->index($string, $substring[,
779 $position])"
780 If $substring matches a part of $string, returns the position of
781 the first occurrence of the matching part in scalar context; in
782 list context, returns a two-element list of the position and the
783 length of the matching part.
784
785 If $substring does not match any part of $string, returns "-1" in
786 scalar context and an empty list in list context.
787
788 e.g. when the content of $str is ""Ich mu"ß" studieren Perl."", you
789 say the following where $sub is ""M"ü"SS"",
790
791 my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
792 # (normalization => undef) is REQUIRED.
793 my $match;
794 if (my($pos,$len) = $Collator->index($str, $sub)) {
795 $match = substr($str, $pos, $len);
796 }
797
798 and get ""mu"ß""" in $match, since ""mu"ß""" is primary equal to
799 ""M"ü"SS"".
800
801 "$match_ref = $Collator->match($string, $substring)"
802 "($match) = $Collator->match($string, $substring)"
803 If $substring matches a part of $string, in scalar context, returns
804 a reference to the first occurrence of the matching part
805 ($match_ref is always true if matches, since every reference is
806 true); in list context, returns the first occurrence of the
807 matching part.
808
809 If $substring does not match any part of $string, returns "undef"
810 in scalar context and an empty list in list context.
811
812 e.g.
813
814 if ($match_ref = $Collator->match($str, $sub)) { # scalar context
815 print "matches [$$match_ref].\n";
816 } else {
817 print "doesn't match.\n";
818 }
819
820 or
821
822 if (($match) = $Collator->match($str, $sub)) { # list context
823 print "matches [$match].\n";
824 } else {
825 print "doesn't match.\n";
826 }
827
828 "@match = $Collator->gmatch($string, $substring)"
829 If $substring matches a part of $string, returns all the matching
830 parts (or matching count in scalar context).
831
832 If $substring does not match any part of $string, returns an empty
833 list.
834
835 "$count = $Collator->subst($string, $substring, $replacement)"
836 If $substring matches a part of $string, the first occurrence of
837 the matching part is replaced by $replacement ($string is modified)
838 and $count (always equals to 1) is returned.
839
840 $replacement can be a "CODEREF", taking the matching part as an
841 argument, and returning a string to replace the matching part (a
842 bit similar to "s/(..)/$coderef->($1)/e").
843
844 "$count = $Collator->gsubst($string, $substring, $replacement)"
845 If $substring matches a part of $string, all the occurrences of the
846 matching part are replaced by $replacement ($string is modified)
847 and $count is returned.
848
849 $replacement can be a "CODEREF", taking the matching part as an
850 argument, and returning a string to replace the matching part (a
851 bit similar to "s/(..)/$coderef->($1)/eg").
852
853 e.g.
854
855 my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
856 # (normalization => undef) is REQUIRED.
857 my $str = "Camel donkey zebra came\x{301}l CAMEL horse cam\0e\0l...";
858 $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
859
860 # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cam\0e\0l</b>...";
861 # i.e., all the camels are made bold-faced.
862
863 Examples: levels and ignore_level2 - what does camel match?
864 ---------------------------------------------------------------------------
865 level ignore_level2 | camel Camel came\x{301}l c-a-m-e-l cam\0e\0l
866 -----------------------|---------------------------------------------------
867 1 false | yes yes yes yes yes
868 2 false | yes yes no yes yes
869 3 false | yes no no yes yes
870 4 false | yes no no no yes
871 -----------------------|---------------------------------------------------
872 1 true | yes yes yes yes yes
873 2 true | yes yes yes yes yes
874 3 true | yes no yes yes yes
875 4 true | yes no yes no yes
876 ---------------------------------------------------------------------------
877 note: if variable => non-ignorable, camel doesn't match c-a-m-e-l
878 at any level.
879
880 Other Methods
881 "%old_tailoring = $Collator->change(%new_tailoring)"
882 "$modified_collator = $Collator->change(%new_tailoring)"
883 Changes the value of specified keys and returns the changed part.
884
885 $Collator = Unicode::Collate->new(level => 4);
886
887 $Collator->eq("perl", "PERL"); # false
888
889 %old = $Collator->change(level => 2); # returns (level => 4).
890
891 $Collator->eq("perl", "PERL"); # true
892
893 $Collator->change(%old); # returns (level => 2).
894
895 $Collator->eq("perl", "PERL"); # false
896
897 Not all "(key,value)"s are allowed to be changed. See also
898 @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
899
900 In the scalar context, returns the modified collator (but it is not
901 a clone from the original).
902
903 $Collator->change(level => 2)->eq("perl", "PERL"); # true
904
905 $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
906
907 $Collator->change(level => 4)->eq("perl", "PERL"); # false
908
909 "$version = $Collator->version()"
910 Returns the version number (a string) of the Unicode Standard which
911 the "table" file used by the collator object is based on. If the
912 table does not include a version line (starting with @version),
913 returns "unknown".
914
915 "UCA_Version()"
916 Returns the revision number of UTS #10 this module consults, that
917 should correspond with the DUCET incorporated.
918
919 "Base_Unicode_Version()"
920 Returns the version number of UTS #10 this module consults, that
921 should correspond with the DUCET incorporated.
922
924 No method will be exported.
925
927 Though this module can be used without any "table" file, to use this
928 module easily, it is recommended to install a table file in the UCA
929 format, by copying it under the directory <a place in
930 @INC>/Unicode/Collate.
931
932 The most preferable one is "The Default Unicode Collation Element
933 Table" (aka DUCET), available from the Unicode Consortium's website:
934
935 http://www.unicode.org/Public/UCA/
936
937 http://www.unicode.org/Public/UCA/latest/allkeys.txt
938 (latest version)
939
940 If DUCET is not installed, it is recommended to copy the file from
941 http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
942 @INC>/Unicode/Collate/allkeys.txt manually.
943
945 Normalization
946 Use of the "normalization" parameter requires the
947 Unicode::Normalize module (see Unicode::Normalize).
948
949 If you need not it (say, in the case when you need not handle any
950 combining characters), assign "(normalization => undef)"
951 explicitly.
952
953 -- see 6.5 Avoiding Normalization, UTS #10.
954
955 Conformance Test
956 The Conformance Test for the UCA is available under
957 <http://www.unicode.org/Public/UCA/>.
958
959 For CollationTest_SHIFTED.txt, a collator via
960 "Unicode::Collate->new( )" should be used; for
961 CollationTest_NON_IGNORABLE.txt, a collator via
962 "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
963
964 If "UCA_Version" is 26 or later, the "identical" level is
965 preferred; "Unicode::Collate->new(identical => 1)" and
966 "Unicode::Collate->new(identical => 1," "variable =>
967 "non-ignorable", level => 3)" should be used.
968
969 Unicode::Normalize is required to try The Conformance Test.
970
972 The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
973 <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2018, SADAHIRO
974 Tomoyuki. Japan. All rights reserved.
975
976 This module is free software; you can redistribute it and/or modify it
977 under the same terms as Perl itself.
978
979 The file Unicode/Collate/allkeys.txt was copied verbatim from
980 <http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt>. For this file,
981 Copyright (c) 2016 Unicode, Inc.; distributed under the Terms of Use in
982 <http://www.unicode.org/terms_of_use.html>
983
985 Unicode Collation Algorithm - UTS #10
986 <http://www.unicode.org/reports/tr10/>
987
988 The Default Unicode Collation Element Table (DUCET)
989 <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
990
991 The conformance test for the UCA
992 <http://www.unicode.org/Public/UCA/latest/CollationTest.html>
993
994 <http://www.unicode.org/Public/UCA/latest/CollationTest.zip>
995
996 Hangul Syllable Type
997 <http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt>
998
999 Unicode Normalization Forms - UAX #15
1000 <http://www.unicode.org/reports/tr15/>
1001
1002 Unicode Locale Data Markup Language (LDML) - UTS #35
1003 <http://www.unicode.org/reports/tr35/>
1004
1005
1006
1007perl v5.28.1 2019-01-02 Collate(3)