1Collate(3) User Contributed Perl Documentation Collate(3)
2
3
4
6 Unicode::Collate - Unicode Collation Algorithm
7
9 use Unicode::Collate;
10
11 #construct
12 $Collator = Unicode::Collate->new(%tailoring);
13
14 #sort
15 @sorted = $Collator->sort(@not_sorted);
16
17 #compare
18 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20 Note: Strings in @not_sorted, $a and $b are interpreted according to
21 Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
22 perlunifaq, utf8. Otherwise you can use "preprocess" or should decode
23 them before.
24
26 This module is an implementation of Unicode Technical Standard #10
27 (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
28
29 Constructor and Tailoring
30 The "new" method returns a collator object. If new() is called with no
31 parameters, the collator should do the default collation.
32
33 $Collator = Unicode::Collate->new(
34 UCA_Version => $UCA_Version,
35 alternate => $alternate, # alias for 'variable'
36 backwards => $levelNumber, # or \@levelNumbers
37 entry => $element,
38 hangul_terminator => $term_primary_weight,
39 highestFFFF => $bool,
40 identical => $bool,
41 ignoreName => qr/$ignoreName/,
42 ignoreChar => qr/$ignoreChar/,
43 ignore_level2 => $bool,
44 katakana_before_hiragana => $bool,
45 level => $collationLevel,
46 long_contraction => $bool,
47 minimalFFFE => $bool,
48 normalization => $normalization_form,
49 overrideCJK => \&overrideCJK,
50 overrideHangul => \&overrideHangul,
51 preprocess => \&preprocess,
52 rearrange => \@charList,
53 rewrite => \&rewrite,
54 suppress => \@charList,
55 table => $filename,
56 undefName => qr/$undefName/,
57 undefChar => qr/$undefChar/,
58 upper_before_lower => $bool,
59 variable => $variable,
60 );
61
62 UCA_Version
63 If the revision (previously "tracking version") number of UCA is
64 given, behavior of that revision is emulated on collating. If
65 omitted, the return value of "UCA_Version()" is used.
66
67 The following revisions are supported. The default is 43.
68
69 UCA Unicode Standard DUCET (@version)
70 -------------------------------------------------------
71 8 3.1 3.0.1 (3.0.1d9)
72 9 3.1 with Corrigendum 3 3.1.1
73 11 4.0.0
74 14 4.1.0
75 16 5.0.0
76 18 5.1.0
77 20 5.2.0
78 22 6.0.0
79 24 6.1.0
80 26 6.2.0
81 28 6.3.0
82 30 7.0.0
83 32 8.0.0
84 34 9.0.0
85 36 10.0.0
86 38 11.0.0
87 40 12.0.0
88 41 12.1.0
89 43 13.0.0
90
91 * See below for "long_contraction" with "UCA_Version" 22 and 24.
92
93 * Noncharacters (e.g. U+FFFF) are not ignored, and can be
94 overridden since "UCA_Version" 22.
95
96 * Out-of-range codepoints (greater than U+10FFFF) are not ignored,
97 and can be overridden since "UCA_Version" 22.
98
99 * Fully ignorable characters were ignored, and would not interrupt
100 contractions with "UCA_Version" 9 and 11.
101
102 * Treatment of ignorables after variables and some behaviors were
103 changed at "UCA_Version" 9.
104
105 * Characters regarded as CJK unified ideographs (cf. "overrideCJK")
106 depend on "UCA_Version".
107
108 * Many hangul jamo are assigned at "UCA_Version" 20, that will
109 affect "hangul_terminator".
110
111 alternate
112 -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
113
114 For backward compatibility, "alternate" (old name) can be used as
115 an alias for "variable".
116
117 backwards
118 -- see 3.4 Backward Accents, UTS #10.
119
120 backwards => $levelNumber or \@levelNumbers
121
122 Weights in reverse order; ex. level 2 (diacritic ordering) in
123 French. If omitted (or $levelNumber is "undef" or "\@levelNumbers"
124 is "[]"), forwards at all the levels.
125
126 entry
127 -- see 5 Tailoring; 9.1 Allkeys File Format, UTS #10.
128
129 If the same character (or a sequence of characters) exists in the
130 collation element table through "table", mapping to collation
131 elements is overridden. If it does not exist, the mapping is
132 defined additionally.
133
134 entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
135 0063 0068 ; [.0E6A.0020.0002.0063] # ch
136 0043 0068 ; [.0E6A.0020.0007.0043] # Ch
137 0043 0048 ; [.0E6A.0020.0008.0043] # CH
138 006C 006C ; [.0F4C.0020.0002.006C] # ll
139 004C 006C ; [.0F4C.0020.0007.004C] # Ll
140 004C 004C ; [.0F4C.0020.0008.004C] # LL
141 00F1 ; [.0F7B.0020.0002.00F1] # n-tilde
142 006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
143 00D1 ; [.0F7B.0020.0008.00D1] # N-tilde
144 004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
145 ENTRY
146
147 entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
148 00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
149 00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
150 ENTRY
151
152 NOTE: The code point in the UCA file format (before ';') must be a
153 Unicode code point (defined as hexadecimal), but not a native code
154 point. So 0063 must always denote "U+0063", but not a character of
155 "\x63".
156
157 Weighting may vary depending on collation element table. So ensure
158 the weights defined in "entry" will be consistent with those in the
159 collation element table loaded via "table".
160
161 In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
162 "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
163 between 0E60 and "0E6D") makes ordering as "C < CH < D". Exactly
164 speaking DUCET already has some characters between "C" and "D":
165 "small capital C" ("U+1D04") with primary weight 0E64,
166 "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
167 ("U+0255") with 0E69. Then primary weight "0E6A" for "CH" makes
168 "CH" ordered between "c-curl" and "D".
169
170 hangul_terminator
171 -- see 7.1.4 Trailing Weights, UTS #10.
172
173 If a true value is given (non-zero but should be positive), it will
174 be added as a terminator primary weight to the end of every
175 standard Hangul syllable. Secondary and any higher weights for
176 terminator are set to zero. If the value is false or
177 "hangul_terminator" key does not exist, insertion of terminator
178 weights will not be performed.
179
180 Boundaries of Hangul syllables are determined according to
181 conjoining Jamo behavior in the Unicode Standard and
182 HangulSyllableType.txt.
183
184 Implementation Note: [22m(1) For expansion mapping (Unicode character
185 mapped to a sequence of collation elements), a terminator will not
186 be added between collation elements, even if Hangul syllable
187 boundary exists there. Addition of terminator is restricted to the
188 next position to the last collation element.
189
190 (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
191 Jamo, and enclosed letters) are not automatically terminated with a
192 terminator primary weight. These characters may need terminator
193 included in a collation element table beforehand.
194
195 highestFFFF
196 -- see 2.4 Tailored noncharacter weights, UTS #35 (LDML) Part 5:
197 Collation.
198
199 If the parameter is made true, "U+FFFF" has a highest primary
200 weight. When a boolean of "$coll->ge($str, "abc")" and
201 "$coll->le($str, "abc\x{FFFF}")" is true, it is expected that $str
202 begins with "abc", or another primary equivalent. $str may be
203 "abcd", "abc012", but should not include "U+FFFF" such as
204 "abc\x{FFFF}xyz".
205
206 "$coll->le($str, "abc\x{FFFF}")" works like "$coll->lt($str,
207 "abd")" almost, but the latter has a problem that you should know
208 which letter is next to "c". For a certain language where "ch" as
209 the next letter, "abch" is greater than "abc\x{FFFF}", but less
210 than "abd".
211
212 Note: This is equivalent to "(entry => 'FFFF ;
213 [.FFFE.0020.0005.FFFF]')". Any other character than "U+FFFF" can
214 be tailored by "entry".
215
216 identical
217 -- see A.3 Deterministic Comparison, UTS #10.
218
219 By default, strings whose weights are equal should be equal, even
220 though their code points are not equal. Completely ignorable
221 characters are ignored.
222
223 If the parameter is made true, a final, tie-breaking level is used.
224 If no difference of weights is found after the comparison through
225 all the level specified by "level", the comparison with code points
226 will be performed. For the tie-breaking comparison, the sort key
227 has code points of the original string appended. Completely
228 ignorable characters are not ignored.
229
230 If "preprocess" and/or "normalization" is applied, the code points
231 of the string after them (in NFD by default) are used.
232
233 ignoreChar
234 ignoreName
235 -- see 3.6 Variable Weighting, UTS #10.
236
237 Makes the entry in the table completely ignorable; i.e. as if the
238 weights were zero at all level.
239
240 Through "ignoreChar", any character matching "qr/$ignoreChar/" will
241 be ignored. Through "ignoreName", any character whose name (given
242 in the "table" file as a comment) matches "qr/$ignoreName/" will be
243 ignored.
244
245 E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
246 (or 'lmnt').
247
248 ignore_level2
249 -- see 5.1 Parametric Tailoring, UTS #10.
250
251 By default, case-sensitive comparison (that is level 3 difference)
252 won't ignore accents (that is level 2 difference).
253
254 If the parameter is made true, accents (and other primary ignorable
255 characters) are ignored, even though cases are taken into account.
256
257 NOTE: "level" should be 3 or greater.
258
259 katakana_before_hiragana
260 -- see 7.2 Tertiary Weight Table, UTS #10.
261
262 By default, hiragana is before katakana. If the parameter is made
263 true, this is reversed.
264
265 NOTE: This parameter simplemindedly assumes that any
266 hiragana/katakana distinctions must occur in level 3, and their
267 weights at level 3 must be same as those mentioned in 7.3.1, UTS
268 #10. If you define your collation elements which violate this
269 requirement, this parameter does not work validly.
270
271 level
272 -- see 4.3 Form Sort Key, UTS #10.
273
274 Set the maximum level. Any higher levels than the specified one
275 are ignored.
276
277 Level 1: alphabetic ordering
278 Level 2: diacritic ordering
279 Level 3: case ordering
280 Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
281
282 ex.level => 2,
283
284 If omitted, the maximum is the 4th.
285
286 NOTE: The DUCET includes weights over 0xFFFF at the 4th level. But
287 this module only uses weights within 0xFFFF. When "variable" is
288 'blanked' or 'non-ignorable' (other than 'shifted' and
289 'shift-trimmed'), the level 4 may be unreliable.
290
291 See also "identical".
292
293 long_contraction
294 -- see 3.8.2 Well-Formedness of the DUCET, 4.2 Produce Array, UTS
295 #10.
296
297 If the parameter is made true, for a contraction with three or more
298 characters (here nicknamed "long contraction"), initial substrings
299 will be handled. For example, a contraction ABC, where A is a
300 starter, and B and C are non-starters (character with non-zero
301 combining character class), will be detected even if there is not
302 AB as a contraction.
303
304 Default: Usually false. If "UCA_Version" is 22 or 24, and the
305 value of "long_contraction" is not specified in "new()", a true
306 value is set implicitly. This is a workaround to pass Conformance
307 Tests for Unicode 6.0.0 and 6.1.0.
308
309 "change()" handles "long_contraction" explicitly only. If
310 "long_contraction" is not specified in "change()", even though
311 "UCA_Version" is changed, "long_contraction" will not be changed.
312
313 Limitation: Scanning non-starters is one-way (no back tracking).
314 If AB is found but not ABC is not found, other long contraction
315 where the first character is A and the second is not B may not be
316 found.
317
318 Under "(normalization => undef)", detection step of discontiguous
319 contractions will be skipped.
320
321 Note: The following contractions in DUCET are not considered in
322 steps S2.1.1 to S2.1.3, where they are discontiguous.
323
324 0FB2 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC RR)
325 0FB3 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC LL)
326
327 For example "TIBETAN VOWEL SIGN VOCALIC RR" with "COMBINING TILDE
328 OVERLAY" ("U+0344") is "0FB2 0344 0F71 0F80" in NFD. In this case
329 "0FB2 0F80" ("TIBETAN VOWEL SIGN VOCALIC R") is detected, instead
330 of "0FB2 0F71 0F80". Inserted 0344 makes "0FB2 0F71 0F80"
331 discontiguous and lack of contraction "0FB2 0F71" prohibits "0FB2
332 0F71 0F80" from being detected.
333
334 minimalFFFE
335 -- see 1.1.1 U+FFFE, UTS #35 (LDML) Part 5: Collation.
336
337 If the parameter is made true, "U+FFFE" has a minimal primary
338 weight. The comparison between "$a1\x{FFFE}$a2" and
339 "$b1\x{FFFE}$b2" first compares $a1 and $b1 at level 1, and then
340 $a2 and $b2 at level 1, as followed.
341
342 "ab\x{FFFE}a"
343 "Ab\x{FFFE}a"
344 "ab\x{FFFE}c"
345 "Ab\x{FFFE}c"
346 "ab\x{FFFE}xyz"
347 "abc\x{FFFE}def"
348 "abc\x{FFFE}xYz"
349 "aBc\x{FFFE}xyz"
350 "abcX\x{FFFE}def"
351 "abcx\x{FFFE}xyz"
352 "b\x{FFFE}aaa"
353 "bbb\x{FFFE}a"
354
355 Note: This is equivalent to "(entry => 'FFFE ;
356 [.0001.0020.0005.FFFE]')". Any other character than "U+FFFE" can
357 be tailored by "entry".
358
359 normalization
360 -- see 4.1 Normalize, UTS #10.
361
362 If specified, strings are normalized before preparation of sort
363 keys (the normalization is executed after preprocess).
364
365 A form name "Unicode::Normalize::normalize()" accepts will be
366 applied as $normalization_form. Acceptable names include 'NFD',
367 'NFC', 'NFKD', and 'NFKC'. See "Unicode::Normalize::normalize()"
368 for detail. If omitted, 'NFD' is used.
369
370 "normalization" is performed after "preprocess" (if defined).
371
372 Furthermore, special values, "undef" and "prenormalized", can be
373 used, though they are not concerned with
374 "Unicode::Normalize::normalize()".
375
376 If "undef" (not a string "undef") is passed explicitly as the value
377 for this key, any normalization is not carried out (this may make
378 tailoring easier if any normalization is not desired). Under
379 "(normalization => undef)", only contiguous contractions are
380 resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
381 after "Z", "A-cedilla-ring" would be primary equal to "A". In this
382 point, "(normalization => undef, preprocess => sub { NFD(shift) })"
383 is not equivalent to "(normalization => 'NFD')".
384
385 In the case of "(normalization => "prenormalized")", any
386 normalization is not performed, but discontiguous contractions with
387 combining characters are performed. Therefore "(normalization =>
388 'prenormalized', preprocess => sub { NFD(shift) })" is equivalent
389 to "(normalization => 'NFD')". If source strings are finely
390 prenormalized, "(normalization => 'prenormalized')" may save time
391 for normalization.
392
393 Except "(normalization => undef)", Unicode::Normalize is required
394 (see also CAVEAT).
395
396 overrideCJK
397 -- see 7.1 Derived Collation Elements, UTS #10.
398
399 By default, CJK unified ideographs are ordered in Unicode codepoint
400 order, but those in the CJK Unified Ideographs block are less than
401 those in the CJK Unified Ideographs Extension A etc.
402
403 In the CJK Unified Ideographs block:
404 U+4E00..U+9FA5 if UCA_Version is 8, 9 or 11.
405 U+4E00..U+9FBB if UCA_Version is 14 or 16.
406 U+4E00..U+9FC3 if UCA_Version is 18.
407 U+4E00..U+9FCB if UCA_Version is 20 or 22.
408 U+4E00..U+9FCC if UCA_Version is 24 to 30.
409 U+4E00..U+9FD5 if UCA_Version is 32 or 34.
410 U+4E00..U+9FEA if UCA_Version is 36.
411 U+4E00..U+9FEF if UCA_Version is 38, 40 or 41.
412 U+4E00..U+9FFC if UCA_Version is 43.
413
414 In the CJK Unified Ideographs Extension blocks:
415 Ext.A (U+3400..U+4DB5) if UCA_Version is 8 to 41.
416 Ext.A (U+3400..U+4DBF) if UCA_Version is 43.
417 Ext.B (U+20000..U+2A6D6) if UCA_Version is 8 to 41.
418 Ext.B (U+20000..U+2A6DD) if UCA_Version is 43.
419 Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or later.
420 Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or later.
421 Ext.E (U+2B820..U+2CEA1) if UCA_Version is 32 or later.
422 Ext.F (U+2CEB0..U+2EBE0) if UCA_Version is 36 or later.
423 Ext.G (U+30000..U+3134A) if UCA_Version is 43.
424
425 Through "overrideCJK", ordering of CJK unified ideographs
426 (including extensions) can be overridden.
427
428 ex. CJK unified ideographs in the JIS code point order.
429
430 overrideCJK => sub {
431 my $u = shift; # get a Unicode codepoint
432 my $b = pack('n', $u); # to UTF-16BE
433 my $s = your_unicode_to_sjis_converter($b); # convert
434 my $n = unpack('n', $s); # convert sjis to short
435 [ $n, 0x20, 0x2, $u ]; # return the collation element
436 },
437
438 The return value may be an arrayref of 1st to 4th weights as shown
439 above. The return value may be an integer as the primary weight as
440 shown below. If "undef" is returned, the default derived collation
441 element will be used.
442
443 overrideCJK => sub {
444 my $u = shift; # get a Unicode codepoint
445 my $b = pack('n', $u); # to UTF-16BE
446 my $s = your_unicode_to_sjis_converter($b); # convert
447 my $n = unpack('n', $s); # convert sjis to short
448 return $n; # return the primary weight
449 },
450
451 The return value may be a list containing zero or more of an
452 arrayref, an integer, or "undef".
453
454 ex. ignores all CJK unified ideographs.
455
456 overrideCJK => sub {()}, # CODEREF returning empty list
457
458 # where ->eq("Pe\x{4E00}rl", "Perl") is true
459 # as U+4E00 is a CJK unified ideograph and to be ignorable.
460
461 If a false value (including "undef") is passed, "overrideCJK" has
462 no effect. "$Collator->change(overrideCJK => 0)" resets the old
463 one.
464
465 But assignment of weight for CJK unified ideographs in "table" or
466 "entry" is still valid. If "undef" is passed explicitly as the
467 value for this key, weights for CJK unified ideographs are treated
468 as undefined. However when "UCA_Version" > 8, "(overrideCJK =>
469 undef)" has no special meaning.
470
471 Note: In addition to them, 12 CJK compatibility ideographs
472 ("U+FA0E", "U+FA0F", "U+FA11", "U+FA13", "U+FA14", "U+FA1F",
473 "U+FA21", "U+FA23", "U+FA24", "U+FA27", "U+FA28", "U+FA29") are
474 also treated as CJK unified ideographs. But they can't be
475 overridden via "overrideCJK" when you use DUCET, as the table
476 includes weights for them. "table" or "entry" has priority over
477 "overrideCJK".
478
479 overrideHangul
480 -- see 7.1 Derived Collation Elements, UTS #10.
481
482 By default, Hangul syllables are decomposed into Hangul Jamo, even
483 if "(normalization => undef)". But the mapping of Hangul syllables
484 may be overridden.
485
486 This parameter works like "overrideCJK", so see there for examples.
487
488 If you want to override the mapping of Hangul syllables, NFD and
489 NFKD are not appropriate, since NFD and NFKD will decompose Hangul
490 syllables before overriding. FCD may decompose Hangul syllables as
491 the case may be.
492
493 If a false value (but not "undef") is passed, "overrideHangul" has
494 no effect. "$Collator->change(overrideHangul => 0)" resets the old
495 one.
496
497 If "undef" is passed explicitly as the value for this key, weight
498 for Hangul syllables is treated as undefined without decomposition
499 into Hangul Jamo. But definition of weight for Hangul syllables in
500 "table" or "entry" is still valid.
501
502 overrideOut
503 -- see 7.1.1 Handling Ill-Formed Code Unit Sequences, UTS #10.
504
505 Perl seems to allow out-of-range values (greater than 0x10FFFF).
506 By default, out-of-range values are replaced with "U+FFFD"
507 (REPLACEMENT CHARACTER) when "UCA_Version" >= 22, or ignored when
508 "UCA_Version" <= 20.
509
510 When "UCA_Version" >= 22, the weights of out-of-range values can be
511 overridden. Though "table" or "entry" are available for them, out-
512 of-range values are too many.
513
514 "overrideOut" can perform it algorithmically. This parameter works
515 like "overrideCJK", so see there for examples.
516
517 ex. ignores all out-of-range values.
518
519 overrideOut => sub {()}, # CODEREF returning empty list
520
521 If a false value (including "undef") is passed, "overrideOut" has
522 no effect. "$Collator->change(overrideOut => 0)" resets the old
523 one.
524
525 NOTE ABOUT U+FFFD:
526
527 UCA recommends that out-of-range values should not be ignored for
528 security reasons. Say, "pe\x{110000}rl" should not be equal to
529 "perl". However, "U+FFFD" is wrongly mapped to a variable
530 collation element in DUCET for Unicode 6.0.0 to 6.2.0, that means
531 out-of-range values will be ignored when "variable" isn't
532 "Non-ignorable".
533
534 The mapping of "U+FFFD" is corrected in Unicode 6.3.0. see
535 <http://www.unicode.org/reports/tr10/tr10-28.html#Trailing_Weights>
536 (7.1.4 Trailing Weights). Such a correction is reproduced by this.
537
538 overrideOut => sub { 0xFFFD }, # CODEREF returning a very large integer
539
540 This workaround is unnecessary since Unicode 6.3.0.
541
542 preprocess
543 -- see 5.4 Preprocessing, UTS #10.
544
545 If specified, the coderef is used to preprocess each string before
546 the formation of sort keys.
547
548 ex. dropping English articles, such as "a" or "the". Then, "the
549 pen" is before "a pencil".
550
551 preprocess => sub {
552 my $str = shift;
553 $str =~ s/\b(?:an?|the)\s+//gi;
554 return $str;
555 },
556
557 "preprocess" is performed before "normalization" (if defined).
558
559 ex. decoding strings in a legacy encoding such as shift-jis:
560
561 $sjis_collator = Unicode::Collate->new(
562 preprocess => \&your_shiftjis_to_unicode_decoder,
563 );
564 @result = $sjis_collator->sort(@shiftjis_strings);
565
566 Note: Strings returned from the coderef will be interpreted
567 according to Perl's Unicode support. See perlunicode, perluniintro,
568 perlunitut, perlunifaq, utf8.
569
570 rearrange
571 -- see 3.5 Rearrangement, UTS #10.
572
573 Characters that are not coded in logical order and to be
574 rearranged. If "UCA_Version" is equal to or less than 11, default
575 is:
576
577 rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
578
579 If you want to disallow any rearrangement, pass "undef" or "[]" (a
580 reference to empty list) as the value for this key.
581
582 If "UCA_Version" is equal to or greater than 14, default is "[]"
583 (i.e. no rearrangement).
584
585 According to the version 9 of UCA, this parameter shall not be
586 used; but it is not warned at present.
587
588 rewrite
589 If specified, the coderef is used to rewrite lines in "table" or
590 "entry". The coderef will get each line, and then should return a
591 rewritten line according to the UCA file format. If the coderef
592 returns an empty line, the line will be skipped.
593
594 e.g. any primary ignorable characters into tertiary ignorable:
595
596 rewrite => sub {
597 my $line = shift;
598 $line =~ s/\[\.0000\..{4}\..{4}\./[.0000.0000.0000./g;
599 return $line;
600 },
601
602 This example shows rewriting weights. "rewrite" is allowed to
603 affect code points, weights, and the name.
604
605 NOTE: "table" is available to use another table file; preparing a
606 modified table once would be more efficient than rewriting lines on
607 reading an unmodified table every time.
608
609 suppress
610 -- see 3.12 Special-Purpose Commands, UTS #35 (LDML) Part 5:
611 Collation.
612
613 Contractions beginning with the specified characters are
614 suppressed, even if those contractions are defined in "table".
615
616 An example for Russian and some languages using the Cyrillic
617 script:
618
619 suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F],
620
621 where 0x0400 stands for "U+0400", CYRILLIC CAPITAL LETTER IE WITH
622 GRAVE.
623
624 NOTE: Contractions via "entry" will not be suppressed.
625
626 table
627 -- see 3.8 Default Unicode Collation Element Table, UTS #10.
628
629 You can use another collation element table if desired.
630
631 The table file should locate in the Unicode/Collate directory on
632 @INC. Say, if the filename is Foo.txt, the table file is searched
633 as Unicode/Collate/Foo.txt in @INC.
634
635 By default, allkeys.txt (as the filename of DUCET) is used. If you
636 will prepare your own table file, any name other than allkeys.txt
637 may be better to avoid namespace conflict.
638
639 NOTE: When XSUB is used, the DUCET is compiled on building this
640 module, and it may save time at the run time. Explicit saying
641 "(table => 'allkeys.txt')", or using another table, or using
642 "ignoreChar", "ignoreName", "undefChar", "undefName" or "rewrite"
643 will prevent this module from using the compiled DUCET.
644
645 If "undef" is passed explicitly as the value for this key, no file
646 is read (but you can define collation elements via "entry").
647
648 A typical way to define a collation element table without any file
649 of table:
650
651 $onlyABC = Unicode::Collate->new(
652 table => undef,
653 entry => << 'ENTRIES',
654 0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
655 0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
656 0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
657 0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
658 0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
659 0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
660 ENTRIES
661 );
662
663 If "ignoreName" or "undefName" is used, character names should be
664 specified as a comment (following "#") on each line.
665
666 undefChar
667 undefName
668 -- see 6.3.3 Reducing the Repertoire, UTS #10.
669
670 Undefines the collation element as if it were unassigned in the
671 "table". This reduces the size of the table. If an unassigned
672 character appears in the string to be collated, the sort key is
673 made from its codepoint as a single-character collation element, as
674 it is greater than any other assigned collation elements (in the
675 codepoint order among the unassigned characters). But, it'd be
676 better to ignore characters unfamiliar to you and maybe never used.
677
678 Through "undefChar", any character matching "qr/$undefChar/" will
679 be undefined. Through "undefName", any character whose name (given
680 in the "table" file as a comment) matches "qr/$undefName/" will be
681 undefined.
682
683 ex. Collation weights for beyond-BMP characters are not stored in
684 object:
685
686 undefChar => qr/[^\0-\x{fffd}]/,
687
688 upper_before_lower
689 -- see 6.6 Case Comparisons, UTS #10.
690
691 By default, lowercase is before uppercase. If the parameter is
692 made true, this is reversed.
693
694 NOTE: This parameter simplemindedly assumes that any
695 lowercase/uppercase distinctions must occur in level 3, and their
696 weights at level 3 must be same as those mentioned in 7.3.1, UTS
697 #10. If you define your collation elements which differs from this
698 requirement, this parameter doesn't work validly.
699
700 variable
701 -- see 3.6 Variable Weighting, UTS #10.
702
703 This key allows for variable weighting of variable collation
704 elements, which are marked with an ASTERISK in the table (NOTE:
705 Many punctuation marks and symbols are variable in allkeys.txt).
706
707 variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
708
709 These names are case-insensitive. By default (if specification is
710 omitted), 'shifted' is adopted.
711
712 'Blanked' Variable elements are made ignorable at levels 1 through 3;
713 considered at the 4th level.
714
715 'Non-Ignorable' Variable elements are not reset to ignorable.
716
717 'Shifted' Variable elements are made ignorable at levels 1 through 3
718 their level 4 weight is replaced by the old level 1 weight.
719 Level 4 weight for Non-Variable elements is 0xFFFF.
720
721 'Shift-Trimmed' Same as 'shifted', but all FFFF's at the 4th level
722 are trimmed.
723
724 Methods for Collation
725 "@sorted = $Collator->sort(@not_sorted)"
726 Sorts a list of strings.
727
728 "$result = $Collator->cmp($a, $b)"
729 Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
730 $b) or -1 (when $a is less than $b).
731
732 "$result = $Collator->eq($a, $b)"
733 "$result = $Collator->ne($a, $b)"
734 "$result = $Collator->lt($a, $b)"
735 "$result = $Collator->le($a, $b)"
736 "$result = $Collator->gt($a, $b)"
737 "$result = $Collator->ge($a, $b)"
738 They works like the same name operators as theirs.
739
740 eq : whether $a is equal to $b.
741 ne : whether $a is not equal to $b.
742 lt : whether $a is less than $b.
743 le : whether $a is less than $b or equal to $b.
744 gt : whether $a is greater than $b.
745 ge : whether $a is greater than $b or equal to $b.
746
747 "$sortKey = $Collator->getSortKey($string)"
748 -- see 4.3 Form Sort Key, UTS #10.
749
750 Returns a sort key.
751
752 You compare the sort keys using a binary comparison and get the
753 result of the comparison of the strings using UCA.
754
755 $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
756
757 is equivalent to
758
759 $Collator->cmp($a, $b)
760
761 "$sortKeyForm = $Collator->viewSortKey($string)"
762 Converts a sorting key into its representation form. If
763 "UCA_Version" is 8, the output is slightly different.
764
765 use Unicode::Collate;
766 my $c = Unicode::Collate->new();
767 print $c->viewSortKey("Perl"),"\n";
768
769 # output:
770 # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
771 # Level 1 Level 2 Level 3 Level 4
772
773 Methods for Searching
774 The "match", "gmatch", "subst", "gsubst" methods work like "m//",
775 "m//g", "s///", "s///g", respectively, but they are not aware of any
776 pattern, but only a literal substring.
777
778 DISCLAIMER: If "preprocess" or "normalization" parameter is true for
779 $Collator, calling these methods ("index", "match", "gmatch", "subst",
780 "gsubst") is croaked, as the position and the length might differ from
781 those on the specified string.
782
783 "rearrange" and "hangul_terminator" parameters are neglected.
784 "katakana_before_hiragana" and "upper_before_lower" don't affect
785 matching and searching, as it doesn't matter whether greater or less.
786
787 "$position = $Collator->index($string, $substring[, $position])"
788 "($position, $length) = $Collator->index($string, $substring[,
789 $position])"
790 If $substring matches a part of $string, returns the position of
791 the first occurrence of the matching part in scalar context; in
792 list context, returns a two-element list of the position and the
793 length of the matching part.
794
795 If $substring does not match any part of $string, returns "-1" in
796 scalar context and an empty list in list context.
797
798 e.g. when the content of $str is ""Ich mu"ß" studieren Perl."", you
799 say the following where $sub is ""M"ü"SS"",
800
801 my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
802 # (normalization => undef) is REQUIRED.
803 my $match;
804 if (my($pos,$len) = $Collator->index($str, $sub)) {
805 $match = substr($str, $pos, $len);
806 }
807
808 and get ""mu"ß""" in $match, since ""mu"ß""" is primary equal to
809 ""M"ü"SS"".
810
811 "$match_ref = $Collator->match($string, $substring)"
812 "($match) = $Collator->match($string, $substring)"
813 If $substring matches a part of $string, in scalar context, returns
814 a reference to the first occurrence of the matching part
815 ($match_ref is always true if matches, since every reference is
816 true); in list context, returns the first occurrence of the
817 matching part.
818
819 If $substring does not match any part of $string, returns "undef"
820 in scalar context and an empty list in list context.
821
822 e.g.
823
824 if ($match_ref = $Collator->match($str, $sub)) { # scalar context
825 print "matches [$$match_ref].\n";
826 } else {
827 print "doesn't match.\n";
828 }
829
830 or
831
832 if (($match) = $Collator->match($str, $sub)) { # list context
833 print "matches [$match].\n";
834 } else {
835 print "doesn't match.\n";
836 }
837
838 "@match = $Collator->gmatch($string, $substring)"
839 If $substring matches a part of $string, returns all the matching
840 parts (or matching count in scalar context).
841
842 If $substring does not match any part of $string, returns an empty
843 list.
844
845 "$count = $Collator->subst($string, $substring, $replacement)"
846 If $substring matches a part of $string, the first occurrence of
847 the matching part is replaced by $replacement ($string is modified)
848 and $count (always equals to 1) is returned.
849
850 $replacement can be a "CODEREF", taking the matching part as an
851 argument, and returning a string to replace the matching part (a
852 bit similar to "s/(..)/$coderef->($1)/e").
853
854 "$count = $Collator->gsubst($string, $substring, $replacement)"
855 If $substring matches a part of $string, all the occurrences of the
856 matching part are replaced by $replacement ($string is modified)
857 and $count is returned.
858
859 $replacement can be a "CODEREF", taking the matching part as an
860 argument, and returning a string to replace the matching part (a
861 bit similar to "s/(..)/$coderef->($1)/eg").
862
863 e.g.
864
865 my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
866 # (normalization => undef) is REQUIRED.
867 my $str = "Camel donkey zebra came\x{301}l CAMEL horse cam\0e\0l...";
868 $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
869
870 # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cam\0e\0l</b>...";
871 # i.e., all the camels are made bold-faced.
872
873 Examples: levels and ignore_level2 - what does camel match?
874 ---------------------------------------------------------------------------
875 level ignore_level2 | camel Camel came\x{301}l c-a-m-e-l cam\0e\0l
876 -----------------------|---------------------------------------------------
877 1 false | yes yes yes yes yes
878 2 false | yes yes no yes yes
879 3 false | yes no no yes yes
880 4 false | yes no no no yes
881 -----------------------|---------------------------------------------------
882 1 true | yes yes yes yes yes
883 2 true | yes yes yes yes yes
884 3 true | yes no yes yes yes
885 4 true | yes no yes no yes
886 ---------------------------------------------------------------------------
887 note: if variable => non-ignorable, camel doesn't match c-a-m-e-l
888 at any level.
889
890 Other Methods
891 "%old_tailoring = $Collator->change(%new_tailoring)"
892 "$modified_collator = $Collator->change(%new_tailoring)"
893 Changes the value of specified keys and returns the changed part.
894
895 $Collator = Unicode::Collate->new(level => 4);
896
897 $Collator->eq("perl", "PERL"); # false
898
899 %old = $Collator->change(level => 2); # returns (level => 4).
900
901 $Collator->eq("perl", "PERL"); # true
902
903 $Collator->change(%old); # returns (level => 2).
904
905 $Collator->eq("perl", "PERL"); # false
906
907 Not all "(key,value)"s are allowed to be changed. See also
908 @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
909
910 In the scalar context, returns the modified collator (but it is not
911 a clone from the original).
912
913 $Collator->change(level => 2)->eq("perl", "PERL"); # true
914
915 $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
916
917 $Collator->change(level => 4)->eq("perl", "PERL"); # false
918
919 "$version = $Collator->version()"
920 Returns the version number (a string) of the Unicode Standard which
921 the "table" file used by the collator object is based on. If the
922 table does not include a version line (starting with @version),
923 returns "unknown".
924
925 "UCA_Version()"
926 Returns the revision number of UTS #10 this module consults, that
927 should correspond with the DUCET incorporated.
928
929 "Base_Unicode_Version()"
930 Returns the version number of UTS #10 this module consults, that
931 should correspond with the DUCET incorporated.
932
934 No method will be exported.
935
937 Though this module can be used without any "table" file, to use this
938 module easily, it is recommended to install a table file in the UCA
939 format, by copying it under the directory <a place in
940 @INC>/Unicode/Collate.
941
942 The most preferable one is "The Default Unicode Collation Element
943 Table" (aka DUCET), available from the Unicode Consortium's website:
944
945 http://www.unicode.org/Public/UCA/
946
947 http://www.unicode.org/Public/UCA/latest/allkeys.txt
948 (latest version)
949
950 If DUCET is not installed, it is recommended to copy the file from
951 http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
952 @INC>/Unicode/Collate/allkeys.txt manually.
953
955 Normalization
956 Use of the "normalization" parameter requires the
957 Unicode::Normalize module (see Unicode::Normalize).
958
959 If you need not it (say, in the case when you need not handle any
960 combining characters), assign "(normalization => undef)"
961 explicitly.
962
963 -- see 6.5 Avoiding Normalization, UTS #10.
964
965 Conformance Test
966 The Conformance Test for the UCA is available under
967 <http://www.unicode.org/Public/UCA/>.
968
969 For CollationTest_SHIFTED.txt, a collator via
970 "Unicode::Collate->new( )" should be used; for
971 CollationTest_NON_IGNORABLE.txt, a collator via
972 "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
973
974 If "UCA_Version" is 26 or later, the "identical" level is
975 preferred; "Unicode::Collate->new(identical => 1)" and
976 "Unicode::Collate->new(identical => 1," "variable =>
977 "non-ignorable", level => 3)" should be used.
978
979 Unicode::Normalize is required to try The Conformance Test.
980
982 The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
983 <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2020, SADAHIRO
984 Tomoyuki. Japan. All rights reserved.
985
986 This module is free software; you can redistribute it and/or modify it
987 under the same terms as Perl itself.
988
989 The file Unicode/Collate/allkeys.txt was copied verbatim from
990 <http://www.unicode.org/Public/UCA/13.0.0/allkeys.txt>. For this file,
991 Copyright (c) 2020 Unicode, Inc.; distributed under the Terms of Use in
992 <http://www.unicode.org/terms_of_use.html>
993
995 Unicode Collation Algorithm - UTS #10
996 <http://www.unicode.org/reports/tr10/>
997
998 The Default Unicode Collation Element Table (DUCET)
999 <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
1000
1001 The conformance test for the UCA
1002 <http://www.unicode.org/Public/UCA/latest/CollationTest.html>
1003
1004 <http://www.unicode.org/Public/UCA/latest/CollationTest.zip>
1005
1006 Hangul Syllable Type
1007 <http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt>
1008
1009 Unicode Normalization Forms - UAX #15
1010 <http://www.unicode.org/reports/tr15/>
1011
1012 Unicode Locale Data Markup Language (LDML) - UTS #35
1013 <http://www.unicode.org/reports/tr35/>
1014
1015
1016
1017perl v5.32.1 2021-01-27 Collate(3)