1Unicode::Collate(3pm) Perl Programmers Reference Guide Unicode::Collate(3pm)
2
3
4
6 Unicode::Collate - Unicode Collation Algorithm
7
9 use Unicode::Collate;
10
11 #construct
12 $Collator = Unicode::Collate->new(%tailoring);
13
14 #sort
15 @sorted = $Collator->sort(@not_sorted);
16
17 #compare
18 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20 Note: Strings in @not_sorted, $a and $b are interpreted according to
21 Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
22 perlunifaq, utf8. Otherwise you can use "preprocess" or should decode
23 them before.
24
26 This module is an implementation of Unicode Technical Standard #10
27 (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
28
29 Constructor and Tailoring
30 The "new" method returns a collator object. If new() is called with no
31 parameters, the collator should do the default collation.
32
33 $Collator = Unicode::Collate->new(
34 UCA_Version => $UCA_Version,
35 alternate => $alternate, # alias for 'variable'
36 backwards => $levelNumber, # or \@levelNumbers
37 entry => $element,
38 hangul_terminator => $term_primary_weight,
39 ignoreName => qr/$ignoreName/,
40 ignoreChar => qr/$ignoreChar/,
41 ignore_level2 => $bool,
42 katakana_before_hiragana => $bool,
43 level => $collationLevel,
44 normalization => $normalization_form,
45 overrideCJK => \&overrideCJK,
46 overrideHangul => \&overrideHangul,
47 preprocess => \&preprocess,
48 rearrange => \@charList,
49 rewrite => \&rewrite,
50 suppress => \@charList,
51 table => $filename,
52 undefName => qr/$undefName/,
53 undefChar => qr/$undefChar/,
54 upper_before_lower => $bool,
55 variable => $variable,
56 );
57
58 UCA_Version
59 If the revision (previously "tracking version") number of UCA is
60 given, behavior of that revision is emulated on collating. If
61 omitted, the return value of "UCA_Version()" is used.
62
63 The following revisions are supported. The default is 24.
64
65 UCA Unicode Standard DUCET (@version)
66 -------------------------------------------------------
67 8 3.1 3.0.1 (3.0.1d9)
68 9 3.1 with Corrigendum 3 3.1.1 (3.1.1)
69 11 4.0 4.0.0 (4.0.0)
70 14 4.1.0 4.1.0 (4.1.0)
71 16 5.0 5.0.0 (5.0.0)
72 18 5.1.0 5.1.0 (5.1.0)
73 20 5.2.0 5.2.0 (5.2.0)
74 22 6.0.0 6.0.0 (6.0.0)
75 24 6.1.0 6.1.0 (6.1.0)
76
77 * Noncharacters (e.g. U+FFFF) are not ignored, and can be
78 overridden since "UCA_Version" 22.
79
80 * Fully ignorable characters were ignored, and would not interrupt
81 contractions with "UCA_Version" 9 and 11.
82
83 * Treatment of ignorables after variables and some behaviors were
84 changed at "UCA_Version" 9.
85
86 * Characters regarded as CJK unified ideographs (cf. "overrideCJK")
87 depend on "UCA_Version".
88
89 * Many hangul jamo are assigned at "UCA_Version" 20, that will
90 affect "hangul_terminator".
91
92 alternate
93 -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
94
95 For backward compatibility, "alternate" (old name) can be used as
96 an alias for "variable".
97
98 backwards
99 -- see 3.1.2 French Accents, UTS #10.
100
101 backwards => $levelNumber or \@levelNumbers
102
103 Weights in reverse order; ex. level 2 (diacritic ordering) in
104 French. If omitted (or $levelNumber is "undef" or "\@levelNumbers"
105 is "[]"), forwards at all the levels.
106
107 entry
108 -- see 3.1 Linguistic Features; 3.2.1 File Format, UTS #10.
109
110 If the same character (or a sequence of characters) exists in the
111 collation element table through "table", mapping to collation
112 elements is overridden. If it does not exist, the mapping is
113 defined additionally.
114
115 entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
116 0063 0068 ; [.0E6A.0020.0002.0063] # ch
117 0043 0068 ; [.0E6A.0020.0007.0043] # Ch
118 0043 0048 ; [.0E6A.0020.0008.0043] # CH
119 006C 006C ; [.0F4C.0020.0002.006C] # ll
120 004C 006C ; [.0F4C.0020.0007.004C] # Ll
121 004C 004C ; [.0F4C.0020.0008.004C] # LL
122 00F1 ; [.0F7B.0020.0002.00F1] # n-tilde
123 006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
124 00D1 ; [.0F7B.0020.0008.00D1] # N-tilde
125 004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
126 ENTRY
127
128 entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
129 00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
130 00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
131 ENTRY
132
133 NOTE: The code point in the UCA file format (before ';') must be a
134 Unicode code point (defined as hexadecimal), but not a native code
135 point. So 0063 must always denote "U+0063", but not a character of
136 "\x63".
137
138 Weighting may vary depending on collation element table. So ensure
139 the weights defined in "entry" will be consistent with those in the
140 collation element table loaded via "table".
141
142 In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
143 "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
144 between 0E60 and "0E6D") makes ordering as "C < CH < D". Exactly
145 speaking DUCET already has some characters between "C" and "D":
146 "small capital C" ("U+1D04") with primary weight 0E64,
147 "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
148 ("U+0255") with 0E69. Then primary weight "0E6A" for "CH" makes
149 "CH" ordered between "c-curl" and "D".
150
151 hangul_terminator
152 -- see 7.1.4 Trailing Weights, UTS #10.
153
154 If a true value is given (non-zero but should be positive), it will
155 be added as a terminator primary weight to the end of every
156 standard Hangul syllable. Secondary and any higher weights for
157 terminator are set to zero. If the value is false or
158 "hangul_terminator" key does not exist, insertion of terminator
159 weights will not be performed.
160
161 Boundaries of Hangul syllables are determined according to
162 conjoining Jamo behavior in the Unicode Standard and
163 HangulSyllableType.txt.
164
165 Implementation Note: [22m(1) For expansion mapping (Unicode character
166 mapped to a sequence of collation elements), a terminator will not
167 be added between collation elements, even if Hangul syllable
168 boundary exists there. Addition of terminator is restricted to the
169 next position to the last collation element.
170
171 (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
172 Jamo, and enclosed letters) are not automatically terminated with a
173 terminator primary weight. These characters may need terminator
174 included in a collation element table beforehand.
175
176 ignoreChar
177 ignoreName
178 -- see 3.2.2 Variable Weighting, UTS #10.
179
180 Makes the entry in the table completely ignorable; i.e. as if the
181 weights were zero at all level.
182
183 Through "ignoreChar", any character matching "qr/$ignoreChar/" will
184 be ignored. Through "ignoreName", any character whose name (given
185 in the "table" file as a comment) matches "qr/$ignoreName/" will be
186 ignored.
187
188 E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
189 (or 'lmnt').
190
191 ignore_level2
192 -- see 5.1 Parametric Tailoring, UTS #10.
193
194 By default, case-sensitive comparison (that is level 3 difference)
195 won't ignore accents (that is level 2 difference).
196
197 If the parameter is made true, accents (and other primary ignorable
198 characters) are ignored, even though cases are taken into account.
199
200 NOTE: "level" should be 3 or greater.
201
202 katakana_before_hiragana
203 -- see 7.3.1 Tertiary Weight Table, UTS #10.
204
205 By default, hiragana is before katakana. If the parameter is made
206 true, this is reversed.
207
208 NOTE: This parameter simplemindedly assumes that any
209 hiragana/katakana distinctions must occur in level 3, and their
210 weights at level 3 must be same as those mentioned in 7.3.1, UTS
211 #10. If you define your collation elements which violate this
212 requirement, this parameter does not work validly.
213
214 level
215 -- see 4.3 Form Sort Key, UTS #10.
216
217 Set the maximum level. Any higher levels than the specified one
218 are ignored.
219
220 Level 1: alphabetic ordering
221 Level 2: diacritic ordering
222 Level 3: case ordering
223 Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
224
225 ex.level => 2,
226
227 If omitted, the maximum is the 4th.
228
229 normalization
230 -- see 4.1 Normalize, UTS #10.
231
232 If specified, strings are normalized before preparation of sort
233 keys (the normalization is executed after preprocess).
234
235 A form name "Unicode::Normalize::normalize()" accepts will be
236 applied as $normalization_form. Acceptable names include 'NFD',
237 'NFC', 'NFKD', and 'NFKC'. See "Unicode::Normalize::normalize()"
238 for detail. If omitted, 'NFD' is used.
239
240 "normalization" is performed after "preprocess" (if defined).
241
242 Furthermore, special values, "undef" and "prenormalized", can be
243 used, though they are not concerned with
244 "Unicode::Normalize::normalize()".
245
246 If "undef" (not a string "undef") is passed explicitly as the value
247 for this key, any normalization is not carried out (this may make
248 tailoring easier if any normalization is not desired). Under
249 "(normalization => undef)", only contiguous contractions are
250 resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
251 after "Z", "A-cedilla-ring" would be primary equal to "A". In this
252 point, "(normalization => undef, preprocess => sub { NFD(shift) })"
253 is not equivalent to "(normalization => 'NFD')".
254
255 In the case of "(normalization => "prenormalized")", any
256 normalization is not performed, but discontiguous contractions with
257 combining characters are performed. Therefore "(normalization =>
258 'prenormalized', preprocess => sub { NFD(shift) })" is equivalent
259 to "(normalization => 'NFD')". If source strings are finely
260 prenormalized, "(normalization => 'prenormalized')" may save time
261 for normalization.
262
263 Except "(normalization => undef)", Unicode::Normalize is required
264 (see also CAVEAT).
265
266 overrideCJK
267 -- see 7.1 Derived Collation Elements, UTS #10.
268
269 By default, CJK unified ideographs are ordered in Unicode codepoint
270 order, but those in the CJK Unified Ideographs block are lesser
271 than those in the CJK Unified Ideographs Extension A etc.
272
273 In the CJK Unified Ideographs block:
274 U+4E00..U+9FA5 if UCA_Version is 8, 9 or 11.
275 U+4E00..U+9FBB if UCA_Version is 14 or 16.
276 U+4E00..U+9FC3 if UCA_Version is 18.
277 U+4E00..U+9FCB if UCA_Version is 20 or 22.
278 U+4E00..U+9FCC if UCA_Version is 24.
279
280 In the CJK Unified Ideographs Extension blocks:
281 Ext.A (U+3400..U+4DB5) and Ext.B (U+20000..U+2A6D6) in any UCA_Version.
282 Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or greater.
283 Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or greater.
284
285 Through "overrideCJK", ordering of CJK unified ideographs
286 (including extensions) can be overridden.
287
288 ex. CJK unified ideographs in the JIS code point order.
289
290 overrideCJK => sub {
291 my $u = shift; # get a Unicode codepoint
292 my $b = pack('n', $u); # to UTF-16BE
293 my $s = your_unicode_to_sjis_converter($b); # convert
294 my $n = unpack('n', $s); # convert sjis to short
295 [ $n, 0x20, 0x2, $u ]; # return the collation element
296 },
297
298 The return value may be an arrayref of 1st to 4th weights as shown
299 above. The return value may be an integer as the primary weight as
300 shown below. If "undef" is returned, the default derived collation
301 element will be used.
302
303 overrideCJK => sub {
304 my $u = shift; # get a Unicode codepoint
305 my $b = pack('n', $u); # to UTF-16BE
306 my $s = your_unicode_to_sjis_converter($b); # convert
307 my $n = unpack('n', $s); # convert sjis to short
308 return $n; # return the primary weight
309 },
310
311 The return value may be a list containing zero or more of an
312 arrayref, an integer, or "undef".
313
314 ex. ignores all CJK unified ideographs.
315
316 overrideCJK => sub {()}, # CODEREF returning empty list
317
318 # where ->eq("Pe\x{4E00}rl", "Perl") is true
319 # as U+4E00 is a CJK unified ideograph and to be ignorable.
320
321 If "undef" is passed explicitly as the value for this key, weights
322 for CJK unified ideographs are treated as undefined. But
323 assignment of weight for CJK unified ideographs in "table" or
324 "entry" is still valid.
325
326 Note: In addition to them, 12 CJK compatibility ideographs
327 ("U+FA0E", "U+FA0F", "U+FA11", "U+FA13", "U+FA14", "U+FA1F",
328 "U+FA21", "U+FA23", "U+FA24", "U+FA27", "U+FA28", "U+FA29") are
329 also treated as CJK unified ideographs. But they can't be
330 overridden via "overrideCJK" when you use DUCET, as the table
331 includes weights for them. "table" or "entry" has priority over
332 "overrideCJK".
333
334 overrideHangul
335 -- see 7.1 Derived Collation Elements, UTS #10.
336
337 By default, Hangul syllables are decomposed into Hangul Jamo, even
338 if "(normalization => undef)". But the mapping of Hangul syllables
339 may be overridden.
340
341 This parameter works like "overrideCJK", so see there for examples.
342
343 If you want to override the mapping of Hangul syllables, NFD and
344 NFKD are not appropriate, since NFD and NFKD will decompose Hangul
345 syllables before overriding. FCD may decompose Hangul syllables as
346 the case may be.
347
348 If "undef" is passed explicitly as the value for this key, weight
349 for Hangul syllables is treated as undefined without decomposition
350 into Hangul Jamo. But definition of weight for Hangul syllables in
351 "table" or "entry" is still valid.
352
353 preprocess
354 -- see 5.1 Preprocessing, UTS #10.
355
356 If specified, the coderef is used to preprocess each string before
357 the formation of sort keys.
358
359 ex. dropping English articles, such as "a" or "the". Then, "the
360 pen" is before "a pencil".
361
362 preprocess => sub {
363 my $str = shift;
364 $str =~ s/\b(?:an?|the)\s+//gi;
365 return $str;
366 },
367
368 "preprocess" is performed before "normalization" (if defined).
369
370 ex. decoding strings in a legacy encoding such as shift-jis:
371
372 $sjis_collator = Unicode::Collate->new(
373 preprocess => \&your_shiftjis_to_unicode_decoder,
374 );
375 @result = $sjis_collator->sort(@shiftjis_strings);
376
377 Note: Strings returned from the coderef will be interpreted
378 according to Perl's Unicode support. See perlunicode, perluniintro,
379 perlunitut, perlunifaq, utf8.
380
381 rearrange
382 -- see 3.1.3 Rearrangement, UTS #10.
383
384 Characters that are not coded in logical order and to be
385 rearranged. If "UCA_Version" is equal to or lesser than 11,
386 default is:
387
388 rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
389
390 If you want to disallow any rearrangement, pass "undef" or "[]" (a
391 reference to empty list) as the value for this key.
392
393 If "UCA_Version" is equal to or greater than 14, default is "[]"
394 (i.e. no rearrangement).
395
396 According to the version 9 of UCA, this parameter shall not be
397 used; but it is not warned at present.
398
399 rewrite
400 If specified, the coderef is used to rewrite lines in "table" or
401 "entry". The coderef will get each line, and then should return a
402 rewritten line according to the UCA file format. If the coderef
403 returns an empty line, the line will be skipped.
404
405 e.g. any primary ignorable characters into tertiary ignorable:
406
407 rewrite => sub {
408 my $line = shift;
409 $line =~ s/\[\.0000\..{4}\..{4}\./[.0000.0000.0000./g;
410 return $line;
411 },
412
413 This example shows rewriting weights. "rewrite" is allowed to
414 affect code points, weights, and the name.
415
416 NOTE: "table" is available to use another table file; preparing a
417 modified table once would be more efficient than rewriting lines on
418 reading an unmodified table every time.
419
420 suppress
421 -- see suppress contractions in 5.14.11 Special-Purpose Commands,
422 UTS #35 (LDML).
423
424 Contractions beginning with the specified characters are
425 suppressed, even if those contractions are defined in "table".
426
427 An example for Russian and some languages using the Cyrillic
428 script:
429
430 suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F],
431
432 where 0x0400 stands for "U+0400", CYRILLIC CAPITAL LETTER IE WITH
433 GRAVE.
434
435 NOTE: Contractions via "entry" are not be suppressed.
436
437 table
438 -- see 3.2 Default Unicode Collation Element Table, UTS #10.
439
440 You can use another collation element table if desired.
441
442 The table file should locate in the Unicode/Collate directory on
443 @INC. Say, if the filename is Foo.txt, the table file is searched
444 as Unicode/Collate/Foo.txt in @INC.
445
446 By default, allkeys.txt (as the filename of DUCET) is used. If you
447 will prepare your own table file, any name other than allkeys.txt
448 may be better to avoid namespace conflict.
449
450 NOTE: When XSUB is used, the DUCET is compiled on building this
451 module, and it may save time at the run time. Explicit saying
452 "table => 'allkeys.txt'" (or using another table), or using
453 "ignoreChar", "ignoreName", "undefChar", "undefName" or "rewrite"
454 will prevent this module from using the compiled DUCET.
455
456 If "undef" is passed explicitly as the value for this key, no file
457 is read (but you can define collation elements via "entry").
458
459 A typical way to define a collation element table without any file
460 of table:
461
462 $onlyABC = Unicode::Collate->new(
463 table => undef,
464 entry => << 'ENTRIES',
465 0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
466 0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
467 0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
468 0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
469 0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
470 0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
471 ENTRIES
472 );
473
474 If "ignoreName" or "undefName" is used, character names should be
475 specified as a comment (following "#") on each line.
476
477 undefChar
478 undefName
479 -- see 6.3.4 Reducing the Repertoire, UTS #10.
480
481 Undefines the collation element as if it were unassigned in the
482 "table". This reduces the size of the table. If an unassigned
483 character appears in the string to be collated, the sort key is
484 made from its codepoint as a single-character collation element, as
485 it is greater than any other assigned collation elements (in the
486 codepoint order among the unassigned characters). But, it'd be
487 better to ignore characters unfamiliar to you and maybe never used.
488
489 Through "undefChar", any character matching "qr/$undefChar/" will
490 be undefined. Through "undefName", any character whose name (given
491 in the "table" file as a comment) matches "qr/$undefName/" will be
492 undefined.
493
494 ex. Collation weights for beyond-BMP characters are not stored in
495 object:
496
497 undefChar => qr/[^\0-\x{fffd}]/,
498
499 upper_before_lower
500 -- see 6.6 Case Comparisons, UTS #10.
501
502 By default, lowercase is before uppercase. If the parameter is
503 made true, this is reversed.
504
505 NOTE: This parameter simplemindedly assumes that any
506 lowercase/uppercase distinctions must occur in level 3, and their
507 weights at level 3 must be same as those mentioned in 7.3.1, UTS
508 #10. If you define your collation elements which differs from this
509 requirement, this parameter doesn't work validly.
510
511 variable
512 -- see 3.2.2 Variable Weighting, UTS #10.
513
514 This key allows for variable weighting of variable collation
515 elements, which are marked with an ASTERISK in the table (NOTE:
516 Many punctuation marks and symbols are variable in allkeys.txt).
517
518 variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
519
520 These names are case-insensitive. By default (if specification is
521 omitted), 'shifted' is adopted.
522
523 'Blanked' Variable elements are made ignorable at levels 1 through 3;
524 considered at the 4th level.
525
526 'Non-Ignorable' Variable elements are not reset to ignorable.
527
528 'Shifted' Variable elements are made ignorable at levels 1 through 3
529 their level 4 weight is replaced by the old level 1 weight.
530 Level 4 weight for Non-Variable elements is 0xFFFF.
531
532 'Shift-Trimmed' Same as 'shifted', but all FFFF's at the 4th level
533 are trimmed.
534
535 Methods for Collation
536 "@sorted = $Collator->sort(@not_sorted)"
537 Sorts a list of strings.
538
539 "$result = $Collator->cmp($a, $b)"
540 Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
541 $b) or -1 (when $a is lesser than $b).
542
543 "$result = $Collator->eq($a, $b)"
544 "$result = $Collator->ne($a, $b)"
545 "$result = $Collator->lt($a, $b)"
546 "$result = $Collator->le($a, $b)"
547 "$result = $Collator->gt($a, $b)"
548 "$result = $Collator->ge($a, $b)"
549 They works like the same name operators as theirs.
550
551 eq : whether $a is equal to $b.
552 ne : whether $a is not equal to $b.
553 lt : whether $a is lesser than $b.
554 le : whether $a is lesser than $b or equal to $b.
555 gt : whether $a is greater than $b.
556 ge : whether $a is greater than $b or equal to $b.
557
558 "$sortKey = $Collator->getSortKey($string)"
559 -- see 4.3 Form Sort Key, UTS #10.
560
561 Returns a sort key.
562
563 You compare the sort keys using a binary comparison and get the
564 result of the comparison of the strings using UCA.
565
566 $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
567
568 is equivalent to
569
570 $Collator->cmp($a, $b)
571
572 "$sortKeyForm = $Collator->viewSortKey($string)"
573 Converts a sorting key into its representation form. If
574 "UCA_Version" is 8, the output is slightly different.
575
576 use Unicode::Collate;
577 my $c = Unicode::Collate->new();
578 print $c->viewSortKey("Perl"),"\n";
579
580 # output:
581 # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
582 # Level 1 Level 2 Level 3 Level 4
583
584 Methods for Searching
585 The "match", "gmatch", "subst", "gsubst" methods work like "m//",
586 "m//g", "s///", "s///g", respectively, but they are not aware of any
587 pattern, but only a literal substring.
588
589 DISCLAIMER: If "preprocess" or "normalization" parameter is true for
590 $Collator, calling these methods ("index", "match", "gmatch", "subst",
591 "gsubst") is croaked, as the position and the length might differ from
592 those on the specified string.
593
594 "rearrange" and "hangul_terminator" parameters are neglected.
595 "katakana_before_hiragana" and "upper_before_lower" don't affect
596 matching and searching, as it doesn't matter whether greater or lesser.
597
598 "$position = $Collator->index($string, $substring[, $position])"
599 "($position, $length) = $Collator->index($string, $substring[,
600 $position])"
601 If $substring matches a part of $string, returns the position of
602 the first occurrence of the matching part in scalar context; in
603 list context, returns a two-element list of the position and the
604 length of the matching part.
605
606 If $substring does not match any part of $string, returns "-1" in
607 scalar context and an empty list in list context.
608
609 e.g. you say
610
611 my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
612 # (normalization => undef) is REQUIRED.
613 my $str = "Ich muss studieren Perl.";
614 my $sub = "MUeSS";
615 my $match;
616 if (my($pos,$len) = $Collator->index($str, $sub)) {
617 $match = substr($str, $pos, $len);
618 }
619
620 and get "muss" in $match since "muss" is primary equal to "MUeSS".
621
622 "$match_ref = $Collator->match($string, $substring)"
623 "($match) = $Collator->match($string, $substring)"
624 If $substring matches a part of $string, in scalar context, returns
625 a reference to the first occurrence of the matching part
626 ($match_ref is always true if matches, since every reference is
627 true); in list context, returns the first occurrence of the
628 matching part.
629
630 If $substring does not match any part of $string, returns "undef"
631 in scalar context and an empty list in list context.
632
633 e.g.
634
635 if ($match_ref = $Collator->match($str, $sub)) { # scalar context
636 print "matches [$$match_ref].\n";
637 } else {
638 print "doesn't match.\n";
639 }
640
641 or
642
643 if (($match) = $Collator->match($str, $sub)) { # list context
644 print "matches [$match].\n";
645 } else {
646 print "doesn't match.\n";
647 }
648
649 "@match = $Collator->gmatch($string, $substring)"
650 If $substring matches a part of $string, returns all the matching
651 parts (or matching count in scalar context).
652
653 If $substring does not match any part of $string, returns an empty
654 list.
655
656 "$count = $Collator->subst($string, $substring, $replacement)"
657 If $substring matches a part of $string, the first occurrence of
658 the matching part is replaced by $replacement ($string is modified)
659 and $count (always equals to 1) is returned.
660
661 $replacement can be a "CODEREF", taking the matching part as an
662 argument, and returning a string to replace the matching part (a
663 bit similar to "s/(..)/$coderef->($1)/e").
664
665 "$count = $Collator->gsubst($string, $substring, $replacement)"
666 If $substring matches a part of $string, all the occurrences of the
667 matching part are replaced by $replacement ($string is modified)
668 and $count is returned.
669
670 $replacement can be a "CODEREF", taking the matching part as an
671 argument, and returning a string to replace the matching part (a
672 bit similar to "s/(..)/$coderef->($1)/eg").
673
674 e.g.
675
676 my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
677 # (normalization => undef) is REQUIRED.
678 my $str = "Camel donkey zebra came\x{301}l CAMEL horse cam\0e\0l...";
679 $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
680
681 # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cam\0e\0l</b>...";
682 # i.e., all the camels are made bold-faced.
683
684 Examples: levels and ignore_level2 - what does camel match?
685 ---------------------------------------------------------------------------
686 level ignore_level2 | camel Camel came\x{301}l c-a-m-e-l cam\0e\0l
687 -----------------------|---------------------------------------------------
688 1 false | yes yes yes yes yes
689 2 false | yes yes no yes yes
690 3 false | yes no no yes yes
691 4 false | yes no no no yes
692 -----------------------|---------------------------------------------------
693 1 true | yes yes yes yes yes
694 2 true | yes yes yes yes yes
695 3 true | yes no yes yes yes
696 4 true | yes no yes no yes
697 ---------------------------------------------------------------------------
698 note: if variable => non-ignorable, camel doesn't match c-a-m-e-l
699 at any level.
700
701 Other Methods
702 "%old_tailoring = $Collator->change(%new_tailoring)"
703 "$modified_collator = $Collator->change(%new_tailoring)"
704 Changes the value of specified keys and returns the changed part.
705
706 $Collator = Unicode::Collate->new(level => 4);
707
708 $Collator->eq("perl", "PERL"); # false
709
710 %old = $Collator->change(level => 2); # returns (level => 4).
711
712 $Collator->eq("perl", "PERL"); # true
713
714 $Collator->change(%old); # returns (level => 2).
715
716 $Collator->eq("perl", "PERL"); # false
717
718 Not all "(key,value)"s are allowed to be changed. See also
719 @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
720
721 In the scalar context, returns the modified collator (but it is not
722 a clone from the original).
723
724 $Collator->change(level => 2)->eq("perl", "PERL"); # true
725
726 $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
727
728 $Collator->change(level => 4)->eq("perl", "PERL"); # false
729
730 "$version = $Collator->version()"
731 Returns the version number (a string) of the Unicode Standard which
732 the "table" file used by the collator object is based on. If the
733 table does not include a version line (starting with @version),
734 returns "unknown".
735
736 "UCA_Version()"
737 Returns the revision number of UTS #10 this module consults, that
738 should correspond with the DUCET incorporated.
739
740 "Base_Unicode_Version()"
741 Returns the version number of UTS #10 this module consults, that
742 should correspond with the DUCET incorporated.
743
745 No method will be exported.
746
748 Though this module can be used without any "table" file, to use this
749 module easily, it is recommended to install a table file in the UCA
750 format, by copying it under the directory <a place in
751 @INC>/Unicode/Collate.
752
753 The most preferable one is "The Default Unicode Collation Element
754 Table" (aka DUCET), available from the Unicode Consortium's website:
755
756 http://www.unicode.org/Public/UCA/
757
758 http://www.unicode.org/Public/UCA/latest/allkeys.txt (latest version)
759
760 If DUCET is not installed, it is recommended to copy the file from
761 http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
762 @INC>/Unicode/Collate/allkeys.txt manually.
763
765 Normalization
766 Use of the "normalization" parameter requires the
767 Unicode::Normalize module (see Unicode::Normalize).
768
769 If you need not it (say, in the case when you need not handle any
770 combining characters), assign "normalization => undef" explicitly.
771
772 -- see 6.5 Avoiding Normalization, UTS #10.
773
774 Conformance Test
775 The Conformance Test for the UCA is available under
776 <http://www.unicode.org/Public/UCA/>.
777
778 For CollationTest_SHIFTED.txt, a collator via
779 "Unicode::Collate->new( )" should be used; for
780 CollationTest_NON_IGNORABLE.txt, a collator via
781 "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
782
783 Unicode::Normalize is required to try The Conformance Test.
784
786 The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
787 <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2012, SADAHIRO
788 Tomoyuki. Japan. All rights reserved.
789
790 This module is free software; you can redistribute it and/or modify it
791 under the same terms as Perl itself.
792
793 The file Unicode/Collate/allkeys.txt was copied verbatim from
794 <http://www.unicode.org/Public/UCA/6.1.0/allkeys.txt>. For this file,
795 Copyright (c) 2001-2011 Unicode, Inc. Distributed under the Terms of
796 Use in <http://www.unicode.org/copyright.html>.
797
799 Unicode Collation Algorithm - UTS #10
800 <http://www.unicode.org/reports/tr10/>
801
802 The Default Unicode Collation Element Table (DUCET)
803 <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
804
805 The conformance test for the UCA
806 <http://www.unicode.org/Public/UCA/latest/CollationTest.html>
807
808 <http://www.unicode.org/Public/UCA/latest/CollationTest.zip>
809
810 Hangul Syllable Type
811 <http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt>
812
813 Unicode Normalization Forms - UAX #15
814 <http://www.unicode.org/reports/tr15/>
815
816 Unicode Locale Data Markup Language (LDML) - UTS #35
817 <http://www.unicode.org/reports/tr35/>
818
819
820
821perl v5.16.3 2013-03-04 Unicode::Collate(3pm)