1Unicode::Collate(3pm)  Perl Programmers Reference Guide  Unicode::Collate(3pm)
2
3
4

NAME

6       Unicode::Collate - Unicode Collation Algorithm
7

SYNOPSIS

9         use Unicode::Collate;
10
11         #construct
12         $Collator = Unicode::Collate->new(%tailoring);
13
14         #sort
15         @sorted = $Collator->sort(@not_sorted);
16
17         #compare
18         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20       Note: Strings in @not_sorted, $a and $b are interpreted according to
21       Perl's Unicode support. See perlunicode, perluniintro, perlunitut,
22       perlunifaq, utf8.  Otherwise you can use "preprocess" or should decode
23       them before.
24

DESCRIPTION

26       This module is an implementation of Unicode Technical Standard #10
27       (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
28
29   Constructor and Tailoring
30       The "new" method returns a collator object. If new() is called with no
31       parameters, the collator should do the default collation.
32
33          $Collator = Unicode::Collate->new(
34             UCA_Version => $UCA_Version,
35             alternate => $alternate, # alias for 'variable'
36             backwards => $levelNumber, # or \@levelNumbers
37             entry => $element,
38             hangul_terminator => $term_primary_weight,
39             ignoreName => qr/$ignoreName/,
40             ignoreChar => qr/$ignoreChar/,
41             ignore_level2 => $bool,
42             katakana_before_hiragana => $bool,
43             level => $collationLevel,
44             normalization  => $normalization_form,
45             overrideCJK => \&overrideCJK,
46             overrideHangul => \&overrideHangul,
47             preprocess => \&preprocess,
48             rearrange => \@charList,
49             rewrite => \&rewrite,
50             suppress => \@charList,
51             table => $filename,
52             undefName => qr/$undefName/,
53             undefChar => qr/$undefChar/,
54             upper_before_lower => $bool,
55             variable => $variable,
56          );
57
58       UCA_Version
59           If the revision (previously "tracking version") number of UCA is
60           given, behavior of that revision is emulated on collating.  If
61           omitted, the return value of "UCA_Version()" is used.
62
63           The following revisions are supported.  The default is 24.
64
65                UCA       Unicode Standard         DUCET (@version)
66              -------------------------------------------------------
67                 8              3.1                3.0.1 (3.0.1d9)
68                 9     3.1 with Corrigendum 3      3.1.1 (3.1.1)
69                11              4.0                4.0.0 (4.0.0)
70                14             4.1.0               4.1.0 (4.1.0)
71                16              5.0                5.0.0 (5.0.0)
72                18             5.1.0               5.1.0 (5.1.0)
73                20             5.2.0               5.2.0 (5.2.0)
74                22             6.0.0               6.0.0 (6.0.0)
75                24             6.1.0               6.1.0 (6.1.0)
76
77           * Noncharacters (e.g. U+FFFF) are not ignored, and can be
78           overridden since "UCA_Version" 22.
79
80           * Fully ignorable characters were ignored, and would not interrupt
81           contractions with "UCA_Version" 9 and 11.
82
83           * Treatment of ignorables after variables and some behaviors were
84           changed at "UCA_Version" 9.
85
86           * Characters regarded as CJK unified ideographs (cf. "overrideCJK")
87           depend on "UCA_Version".
88
89           * Many hangul jamo are assigned at "UCA_Version" 20, that will
90           affect "hangul_terminator".
91
92       alternate
93           -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
94
95           For backward compatibility, "alternate" (old name) can be used as
96           an alias for "variable".
97
98       backwards
99           -- see 3.1.2 French Accents, UTS #10.
100
101                backwards => $levelNumber or \@levelNumbers
102
103           Weights in reverse order; ex. level 2 (diacritic ordering) in
104           French.  If omitted (or $levelNumber is "undef" or "\@levelNumbers"
105           is "[]"), forwards at all the levels.
106
107       entry
108           -- see 3.1 Linguistic Features; 3.2.1 File Format, UTS #10.
109
110           If the same character (or a sequence of characters) exists in the
111           collation element table through "table", mapping to collation
112           elements is overridden.  If it does not exist, the mapping is
113           defined additionally.
114
115               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
116           0063 0068 ; [.0E6A.0020.0002.0063] # ch
117           0043 0068 ; [.0E6A.0020.0007.0043] # Ch
118           0043 0048 ; [.0E6A.0020.0008.0043] # CH
119           006C 006C ; [.0F4C.0020.0002.006C] # ll
120           004C 006C ; [.0F4C.0020.0007.004C] # Ll
121           004C 004C ; [.0F4C.0020.0008.004C] # LL
122           00F1      ; [.0F7B.0020.0002.00F1] # n-tilde
123           006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
124           00D1      ; [.0F7B.0020.0008.00D1] # N-tilde
125           004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
126           ENTRY
127
128               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
129           00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
130           00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
131           ENTRY
132
133           NOTE: The code point in the UCA file format (before ';') must be a
134           Unicode code point (defined as hexadecimal), but not a native code
135           point.  So 0063 must always denote "U+0063", but not a character of
136           "\x63".
137
138           Weighting may vary depending on collation element table.  So ensure
139           the weights defined in "entry" will be consistent with those in the
140           collation element table loaded via "table".
141
142           In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
143           "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
144           between 0E60 and "0E6D") makes ordering as "C < CH < D".  Exactly
145           speaking DUCET already has some characters between "C" and "D":
146           "small capital C" ("U+1D04") with primary weight 0E64,
147           "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
148           ("U+0255") with 0E69.  Then primary weight "0E6A" for "CH" makes
149           "CH" ordered between "c-curl" and "D".
150
151       hangul_terminator
152           -- see 7.1.4 Trailing Weights, UTS #10.
153
154           If a true value is given (non-zero but should be positive), it will
155           be added as a terminator primary weight to the end of every
156           standard Hangul syllable. Secondary and any higher weights for
157           terminator are set to zero.  If the value is false or
158           "hangul_terminator" key does not exist, insertion of terminator
159           weights will not be performed.
160
161           Boundaries of Hangul syllables are determined according to
162           conjoining Jamo behavior in the Unicode Standard and
163           HangulSyllableType.txt.
164
165           Implementation Note: (1) For expansion mapping (Unicode character
166           mapped to a sequence of collation elements), a terminator will not
167           be added between collation elements, even if Hangul syllable
168           boundary exists there.  Addition of terminator is restricted to the
169           next position to the last collation element.
170
171           (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
172           Jamo, and enclosed letters) are not automatically terminated with a
173           terminator primary weight.  These characters may need terminator
174           included in a collation element table beforehand.
175
176       ignoreChar
177       ignoreName
178           -- see 3.2.2 Variable Weighting, UTS #10.
179
180           Makes the entry in the table completely ignorable; i.e. as if the
181           weights were zero at all level.
182
183           Through "ignoreChar", any character matching "qr/$ignoreChar/" will
184           be ignored. Through "ignoreName", any character whose name (given
185           in the "table" file as a comment) matches "qr/$ignoreName/" will be
186           ignored.
187
188           E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
189           (or 'lmnt').
190
191       ignore_level2
192           -- see 5.1 Parametric Tailoring, UTS #10.
193
194           By default, case-sensitive comparison (that is level 3 difference)
195           won't ignore accents (that is level 2 difference).
196
197           If the parameter is made true, accents (and other primary ignorable
198           characters) are ignored, even though cases are taken into account.
199
200           NOTE: "level" should be 3 or greater.
201
202       katakana_before_hiragana
203           -- see 7.3.1 Tertiary Weight Table, UTS #10.
204
205           By default, hiragana is before katakana.  If the parameter is made
206           true, this is reversed.
207
208           NOTE: This parameter simplemindedly assumes that any
209           hiragana/katakana distinctions must occur in level 3, and their
210           weights at level 3 must be same as those mentioned in 7.3.1, UTS
211           #10.  If you define your collation elements which violate this
212           requirement, this parameter does not work validly.
213
214       level
215           -- see 4.3 Form Sort Key, UTS #10.
216
217           Set the maximum level.  Any higher levels than the specified one
218           are ignored.
219
220             Level 1: alphabetic ordering
221             Level 2: diacritic ordering
222             Level 3: case ordering
223             Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
224
225             ex.level => 2,
226
227           If omitted, the maximum is the 4th.
228
229       normalization
230           -- see 4.1 Normalize, UTS #10.
231
232           If specified, strings are normalized before preparation of sort
233           keys (the normalization is executed after preprocess).
234
235           A form name "Unicode::Normalize::normalize()" accepts will be
236           applied as $normalization_form.  Acceptable names include 'NFD',
237           'NFC', 'NFKD', and 'NFKC'.  See "Unicode::Normalize::normalize()"
238           for detail.  If omitted, 'NFD' is used.
239
240           "normalization" is performed after "preprocess" (if defined).
241
242           Furthermore, special values, "undef" and "prenormalized", can be
243           used, though they are not concerned with
244           "Unicode::Normalize::normalize()".
245
246           If "undef" (not a string "undef") is passed explicitly as the value
247           for this key, any normalization is not carried out (this may make
248           tailoring easier if any normalization is not desired). Under
249           "(normalization => undef)", only contiguous contractions are
250           resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
251           after "Z", "A-cedilla-ring" would be primary equal to "A".  In this
252           point, "(normalization => undef, preprocess => sub { NFD(shift) })"
253           is not equivalent to "(normalization => 'NFD')".
254
255           In the case of "(normalization => "prenormalized")", any
256           normalization is not performed, but discontiguous contractions with
257           combining characters are performed.  Therefore "(normalization =>
258           'prenormalized', preprocess => sub { NFD(shift) })" is equivalent
259           to "(normalization => 'NFD')".  If source strings are finely
260           prenormalized, "(normalization => 'prenormalized')" may save time
261           for normalization.
262
263           Except "(normalization => undef)", Unicode::Normalize is required
264           (see also CAVEAT).
265
266       overrideCJK
267           -- see 7.1 Derived Collation Elements, UTS #10.
268
269           By default, CJK unified ideographs are ordered in Unicode codepoint
270           order, but those in the CJK Unified Ideographs block are lesser
271           than those in the CJK Unified Ideographs Extension A etc.
272
273               In the CJK Unified Ideographs block:
274               U+4E00..U+9FA5 if UCA_Version is 8, 9 or 11.
275               U+4E00..U+9FBB if UCA_Version is 14 or 16.
276               U+4E00..U+9FC3 if UCA_Version is 18.
277               U+4E00..U+9FCB if UCA_Version is 20 or 22.
278               U+4E00..U+9FCC if UCA_Version is 24.
279
280               In the CJK Unified Ideographs Extension blocks:
281               Ext.A (U+3400..U+4DB5) and Ext.B (U+20000..U+2A6D6) in any UCA_Version.
282               Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or greater.
283               Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or greater.
284
285           Through "overrideCJK", ordering of CJK unified ideographs
286           (including extensions) can be overridden.
287
288           ex. CJK unified ideographs in the JIS code point order.
289
290             overrideCJK => sub {
291                 my $u = shift;             # get a Unicode codepoint
292                 my $b = pack('n', $u);     # to UTF-16BE
293                 my $s = your_unicode_to_sjis_converter($b); # convert
294                 my $n = unpack('n', $s);   # convert sjis to short
295                 [ $n, 0x20, 0x2, $u ];     # return the collation element
296             },
297
298           The return value may be an arrayref of 1st to 4th weights as shown
299           above. The return value may be an integer as the primary weight as
300           shown below.  If "undef" is returned, the default derived collation
301           element will be used.
302
303             overrideCJK => sub {
304                 my $u = shift;             # get a Unicode codepoint
305                 my $b = pack('n', $u);     # to UTF-16BE
306                 my $s = your_unicode_to_sjis_converter($b); # convert
307                 my $n = unpack('n', $s);   # convert sjis to short
308                 return $n;                 # return the primary weight
309             },
310
311           The return value may be a list containing zero or more of an
312           arrayref, an integer, or "undef".
313
314           ex. ignores all CJK unified ideographs.
315
316             overrideCJK => sub {()}, # CODEREF returning empty list
317
318              # where ->eq("Pe\x{4E00}rl", "Perl") is true
319              # as U+4E00 is a CJK unified ideograph and to be ignorable.
320
321           If "undef" is passed explicitly as the value for this key, weights
322           for CJK unified ideographs are treated as undefined.  But
323           assignment of weight for CJK unified ideographs in "table" or
324           "entry" is still valid.
325
326           Note: In addition to them, 12 CJK compatibility ideographs
327           ("U+FA0E", "U+FA0F", "U+FA11", "U+FA13", "U+FA14", "U+FA1F",
328           "U+FA21", "U+FA23", "U+FA24", "U+FA27", "U+FA28", "U+FA29") are
329           also treated as CJK unified ideographs. But they can't be
330           overridden via "overrideCJK" when you use DUCET, as the table
331           includes weights for them. "table" or "entry" has priority over
332           "overrideCJK".
333
334       overrideHangul
335           -- see 7.1 Derived Collation Elements, UTS #10.
336
337           By default, Hangul syllables are decomposed into Hangul Jamo, even
338           if "(normalization => undef)".  But the mapping of Hangul syllables
339           may be overridden.
340
341           This parameter works like "overrideCJK", so see there for examples.
342
343           If you want to override the mapping of Hangul syllables, NFD and
344           NFKD are not appropriate, since NFD and NFKD will decompose Hangul
345           syllables before overriding. FCD may decompose Hangul syllables as
346           the case may be.
347
348           If "undef" is passed explicitly as the value for this key, weight
349           for Hangul syllables is treated as undefined without decomposition
350           into Hangul Jamo.  But definition of weight for Hangul syllables in
351           "table" or "entry" is still valid.
352
353       preprocess
354           -- see 5.1 Preprocessing, UTS #10.
355
356           If specified, the coderef is used to preprocess each string before
357           the formation of sort keys.
358
359           ex. dropping English articles, such as "a" or "the".  Then, "the
360           pen" is before "a pencil".
361
362                preprocess => sub {
363                      my $str = shift;
364                      $str =~ s/\b(?:an?|the)\s+//gi;
365                      return $str;
366                   },
367
368           "preprocess" is performed before "normalization" (if defined).
369
370           ex. decoding strings in a legacy encoding such as shift-jis:
371
372               $sjis_collator = Unicode::Collate->new(
373                   preprocess => \&your_shiftjis_to_unicode_decoder,
374               );
375               @result = $sjis_collator->sort(@shiftjis_strings);
376
377           Note: Strings returned from the coderef will be interpreted
378           according to Perl's Unicode support. See perlunicode, perluniintro,
379           perlunitut, perlunifaq, utf8.
380
381       rearrange
382           -- see 3.1.3 Rearrangement, UTS #10.
383
384           Characters that are not coded in logical order and to be
385           rearranged.  If "UCA_Version" is equal to or lesser than 11,
386           default is:
387
388               rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
389
390           If you want to disallow any rearrangement, pass "undef" or "[]" (a
391           reference to empty list) as the value for this key.
392
393           If "UCA_Version" is equal to or greater than 14, default is "[]"
394           (i.e. no rearrangement).
395
396           According to the version 9 of UCA, this parameter shall not be
397           used; but it is not warned at present.
398
399       rewrite
400           If specified, the coderef is used to rewrite lines in "table" or
401           "entry".  The coderef will get each line, and then should return a
402           rewritten line according to the UCA file format.  If the coderef
403           returns an empty line, the line will be skipped.
404
405           e.g. any primary ignorable characters into tertiary ignorable:
406
407               rewrite => sub {
408                   my $line = shift;
409                   $line =~ s/\[\.0000\..{4}\..{4}\./[.0000.0000.0000./g;
410                   return $line;
411               },
412
413           This example shows rewriting weights. "rewrite" is allowed to
414           affect code points, weights, and the name.
415
416           NOTE: "table" is available to use another table file; preparing a
417           modified table once would be more efficient than rewriting lines on
418           reading an unmodified table every time.
419
420       suppress
421           -- see suppress contractions in 5.14.11 Special-Purpose Commands,
422           UTS #35 (LDML).
423
424           Contractions beginning with the specified characters are
425           suppressed, even if those contractions are defined in "table".
426
427           An example for Russian and some languages using the Cyrillic
428           script:
429
430               suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F],
431
432           where 0x0400 stands for "U+0400", CYRILLIC CAPITAL LETTER IE WITH
433           GRAVE.
434
435           NOTE: Contractions via "entry" are not be suppressed.
436
437       table
438           -- see 3.2 Default Unicode Collation Element Table, UTS #10.
439
440           You can use another collation element table if desired.
441
442           The table file should locate in the Unicode/Collate directory on
443           @INC. Say, if the filename is Foo.txt, the table file is searched
444           as Unicode/Collate/Foo.txt in @INC.
445
446           By default, allkeys.txt (as the filename of DUCET) is used.  If you
447           will prepare your own table file, any name other than allkeys.txt
448           may be better to avoid namespace conflict.
449
450           NOTE: When XSUB is used, the DUCET is compiled on building this
451           module, and it may save time at the run time.  Explicit saying
452           "table => 'allkeys.txt'" (or using another table), or using
453           "ignoreChar", "ignoreName", "undefChar", "undefName" or "rewrite"
454           will prevent this module from using the compiled DUCET.
455
456           If "undef" is passed explicitly as the value for this key, no file
457           is read (but you can define collation elements via "entry").
458
459           A typical way to define a collation element table without any file
460           of table:
461
462              $onlyABC = Unicode::Collate->new(
463                  table => undef,
464                  entry => << 'ENTRIES',
465           0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
466           0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
467           0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
468           0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
469           0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
470           0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
471           ENTRIES
472               );
473
474           If "ignoreName" or "undefName" is used, character names should be
475           specified as a comment (following "#") on each line.
476
477       undefChar
478       undefName
479           -- see 6.3.4 Reducing the Repertoire, UTS #10.
480
481           Undefines the collation element as if it were unassigned in the
482           "table".  This reduces the size of the table.  If an unassigned
483           character appears in the string to be collated, the sort key is
484           made from its codepoint as a single-character collation element, as
485           it is greater than any other assigned collation elements (in the
486           codepoint order among the unassigned characters).  But, it'd be
487           better to ignore characters unfamiliar to you and maybe never used.
488
489           Through "undefChar", any character matching "qr/$undefChar/" will
490           be undefined. Through "undefName", any character whose name (given
491           in the "table" file as a comment) matches "qr/$undefName/" will be
492           undefined.
493
494           ex. Collation weights for beyond-BMP characters are not stored in
495           object:
496
497               undefChar => qr/[^\0-\x{fffd}]/,
498
499       upper_before_lower
500           -- see 6.6 Case Comparisons, UTS #10.
501
502           By default, lowercase is before uppercase.  If the parameter is
503           made true, this is reversed.
504
505           NOTE: This parameter simplemindedly assumes that any
506           lowercase/uppercase distinctions must occur in level 3, and their
507           weights at level 3 must be same as those mentioned in 7.3.1, UTS
508           #10.  If you define your collation elements which differs from this
509           requirement, this parameter doesn't work validly.
510
511       variable
512           -- see 3.2.2 Variable Weighting, UTS #10.
513
514           This key allows for variable weighting of variable collation
515           elements, which are marked with an ASTERISK in the table (NOTE:
516           Many punctuation marks and symbols are variable in allkeys.txt).
517
518              variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
519
520           These names are case-insensitive.  By default (if specification is
521           omitted), 'shifted' is adopted.
522
523              'Blanked'        Variable elements are made ignorable at levels 1 through 3;
524                               considered at the 4th level.
525
526              'Non-Ignorable'  Variable elements are not reset to ignorable.
527
528              'Shifted'        Variable elements are made ignorable at levels 1 through 3
529                               their level 4 weight is replaced by the old level 1 weight.
530                               Level 4 weight for Non-Variable elements is 0xFFFF.
531
532              'Shift-Trimmed'  Same as 'shifted', but all FFFF's at the 4th level
533                               are trimmed.
534
535   Methods for Collation
536       "@sorted = $Collator->sort(@not_sorted)"
537           Sorts a list of strings.
538
539       "$result = $Collator->cmp($a, $b)"
540           Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
541           $b) or -1 (when $a is lesser than $b).
542
543       "$result = $Collator->eq($a, $b)"
544       "$result = $Collator->ne($a, $b)"
545       "$result = $Collator->lt($a, $b)"
546       "$result = $Collator->le($a, $b)"
547       "$result = $Collator->gt($a, $b)"
548       "$result = $Collator->ge($a, $b)"
549           They works like the same name operators as theirs.
550
551              eq : whether $a is equal to $b.
552              ne : whether $a is not equal to $b.
553              lt : whether $a is lesser than $b.
554              le : whether $a is lesser than $b or equal to $b.
555              gt : whether $a is greater than $b.
556              ge : whether $a is greater than $b or equal to $b.
557
558       "$sortKey = $Collator->getSortKey($string)"
559           -- see 4.3 Form Sort Key, UTS #10.
560
561           Returns a sort key.
562
563           You compare the sort keys using a binary comparison and get the
564           result of the comparison of the strings using UCA.
565
566              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
567
568                 is equivalent to
569
570              $Collator->cmp($a, $b)
571
572       "$sortKeyForm = $Collator->viewSortKey($string)"
573           Converts a sorting key into its representation form.  If
574           "UCA_Version" is 8, the output is slightly different.
575
576              use Unicode::Collate;
577              my $c = Unicode::Collate->new();
578              print $c->viewSortKey("Perl"),"\n";
579
580              # output:
581              # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
582              #  Level 1               Level 2               Level 3               Level 4
583
584   Methods for Searching
585       The "match", "gmatch", "subst", "gsubst" methods work like "m//",
586       "m//g", "s///", "s///g", respectively, but they are not aware of any
587       pattern, but only a literal substring.
588
589       DISCLAIMER: If "preprocess" or "normalization" parameter is true for
590       $Collator, calling these methods ("index", "match", "gmatch", "subst",
591       "gsubst") is croaked, as the position and the length might differ from
592       those on the specified string.
593
594       "rearrange" and "hangul_terminator" parameters are neglected.
595       "katakana_before_hiragana" and "upper_before_lower" don't affect
596       matching and searching, as it doesn't matter whether greater or lesser.
597
598       "$position = $Collator->index($string, $substring[, $position])"
599       "($position, $length) = $Collator->index($string, $substring[,
600       $position])"
601           If $substring matches a part of $string, returns the position of
602           the first occurrence of the matching part in scalar context; in
603           list context, returns a two-element list of the position and the
604           length of the matching part.
605
606           If $substring does not match any part of $string, returns "-1" in
607           scalar context and an empty list in list context.
608
609           e.g. you say
610
611             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
612                                                # (normalization => undef) is REQUIRED.
613             my $str = "Ich muss studieren Perl.";
614             my $sub = "MUeSS";
615             my $match;
616             if (my($pos,$len) = $Collator->index($str, $sub)) {
617                 $match = substr($str, $pos, $len);
618             }
619
620           and get "muss" in $match since "muss" is primary equal to "MUeSS".
621
622       "$match_ref = $Collator->match($string, $substring)"
623       "($match)   = $Collator->match($string, $substring)"
624           If $substring matches a part of $string, in scalar context, returns
625           a reference to the first occurrence of the matching part
626           ($match_ref is always true if matches, since every reference is
627           true); in list context, returns the first occurrence of the
628           matching part.
629
630           If $substring does not match any part of $string, returns "undef"
631           in scalar context and an empty list in list context.
632
633           e.g.
634
635               if ($match_ref = $Collator->match($str, $sub)) { # scalar context
636                   print "matches [$$match_ref].\n";
637               } else {
638                   print "doesn't match.\n";
639               }
640
641                or
642
643               if (($match) = $Collator->match($str, $sub)) { # list context
644                   print "matches [$match].\n";
645               } else {
646                   print "doesn't match.\n";
647               }
648
649       "@match = $Collator->gmatch($string, $substring)"
650           If $substring matches a part of $string, returns all the matching
651           parts (or matching count in scalar context).
652
653           If $substring does not match any part of $string, returns an empty
654           list.
655
656       "$count = $Collator->subst($string, $substring, $replacement)"
657           If $substring matches a part of $string, the first occurrence of
658           the matching part is replaced by $replacement ($string is modified)
659           and $count (always equals to 1) is returned.
660
661           $replacement can be a "CODEREF", taking the matching part as an
662           argument, and returning a string to replace the matching part (a
663           bit similar to "s/(..)/$coderef->($1)/e").
664
665       "$count = $Collator->gsubst($string, $substring, $replacement)"
666           If $substring matches a part of $string, all the occurrences of the
667           matching part are replaced by $replacement ($string is modified)
668           and $count is returned.
669
670           $replacement can be a "CODEREF", taking the matching part as an
671           argument, and returning a string to replace the matching part (a
672           bit similar to "s/(..)/$coderef->($1)/eg").
673
674           e.g.
675
676             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
677                                                # (normalization => undef) is REQUIRED.
678             my $str = "Camel donkey zebra came\x{301}l CAMEL horse cam\0e\0l...";
679             $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
680
681             # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cam\0e\0l</b>...";
682             # i.e., all the camels are made bold-faced.
683
684              Examples: levels and ignore_level2 - what does camel match?
685             ---------------------------------------------------------------------------
686              level  ignore_level2  |  camel  Camel  came\x{301}l  c-a-m-e-l  cam\0e\0l
687             -----------------------|---------------------------------------------------
688                1        false      |   yes    yes      yes          yes        yes
689                2        false      |   yes    yes      no           yes        yes
690                3        false      |   yes    no       no           yes        yes
691                4        false      |   yes    no       no           no         yes
692             -----------------------|---------------------------------------------------
693                1        true       |   yes    yes      yes          yes        yes
694                2        true       |   yes    yes      yes          yes        yes
695                3        true       |   yes    no       yes          yes        yes
696                4        true       |   yes    no       yes          no         yes
697             ---------------------------------------------------------------------------
698              note: if variable => non-ignorable, camel doesn't match c-a-m-e-l
699                    at any level.
700
701   Other Methods
702       "%old_tailoring = $Collator->change(%new_tailoring)"
703       "$modified_collator = $Collator->change(%new_tailoring)"
704           Changes the value of specified keys and returns the changed part.
705
706               $Collator = Unicode::Collate->new(level => 4);
707
708               $Collator->eq("perl", "PERL"); # false
709
710               %old = $Collator->change(level => 2); # returns (level => 4).
711
712               $Collator->eq("perl", "PERL"); # true
713
714               $Collator->change(%old); # returns (level => 2).
715
716               $Collator->eq("perl", "PERL"); # false
717
718           Not all "(key,value)"s are allowed to be changed.  See also
719           @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
720
721           In the scalar context, returns the modified collator (but it is not
722           a clone from the original).
723
724               $Collator->change(level => 2)->eq("perl", "PERL"); # true
725
726               $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
727
728               $Collator->change(level => 4)->eq("perl", "PERL"); # false
729
730       "$version = $Collator->version()"
731           Returns the version number (a string) of the Unicode Standard which
732           the "table" file used by the collator object is based on.  If the
733           table does not include a version line (starting with @version),
734           returns "unknown".
735
736       "UCA_Version()"
737           Returns the revision number of UTS #10 this module consults, that
738           should correspond with the DUCET incorporated.
739
740       "Base_Unicode_Version()"
741           Returns the version number of UTS #10 this module consults, that
742           should correspond with the DUCET incorporated.
743

EXPORT

745       No method will be exported.
746

INSTALL

748       Though this module can be used without any "table" file, to use this
749       module easily, it is recommended to install a table file in the UCA
750       format, by copying it under the directory <a place in
751       @INC>/Unicode/Collate.
752
753       The most preferable one is "The Default Unicode Collation Element
754       Table" (aka DUCET), available from the Unicode Consortium's website:
755
756          http://www.unicode.org/Public/UCA/
757
758          http://www.unicode.org/Public/UCA/latest/allkeys.txt (latest version)
759
760       If DUCET is not installed, it is recommended to copy the file from
761       http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
762       @INC>/Unicode/Collate/allkeys.txt manually.
763

CAVEATS

765       Normalization
766           Use of the "normalization" parameter requires the
767           Unicode::Normalize module (see Unicode::Normalize).
768
769           If you need not it (say, in the case when you need not handle any
770           combining characters), assign "normalization => undef" explicitly.
771
772           -- see 6.5 Avoiding Normalization, UTS #10.
773
774       Conformance Test
775           The Conformance Test for the UCA is available under
776           <http://www.unicode.org/Public/UCA/>.
777
778           For CollationTest_SHIFTED.txt, a collator via
779           "Unicode::Collate->new( )" should be used; for
780           CollationTest_NON_IGNORABLE.txt, a collator via
781           "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
782
783           Unicode::Normalize is required to try The Conformance Test.
784
786       The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
787       <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2012, SADAHIRO
788       Tomoyuki. Japan. All rights reserved.
789
790       This module is free software; you can redistribute it and/or modify it
791       under the same terms as Perl itself.
792
793       The file Unicode/Collate/allkeys.txt was copied verbatim from
794       <http://www.unicode.org/Public/UCA/6.1.0/allkeys.txt>.  For this file,
795       Copyright (c) 2001-2011 Unicode, Inc.  Distributed under the Terms of
796       Use in <http://www.unicode.org/copyright.html>.
797

SEE ALSO

799       Unicode Collation Algorithm - UTS #10
800           <http://www.unicode.org/reports/tr10/>
801
802       The Default Unicode Collation Element Table (DUCET)
803           <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
804
805       The conformance test for the UCA
806           <http://www.unicode.org/Public/UCA/latest/CollationTest.html>
807
808           <http://www.unicode.org/Public/UCA/latest/CollationTest.zip>
809
810       Hangul Syllable Type
811           <http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt>
812
813       Unicode Normalization Forms - UAX #15
814           <http://www.unicode.org/reports/tr15/>
815
816       Unicode Locale Data Markup Language (LDML) - UTS #35
817           <http://www.unicode.org/reports/tr35/>
818
819
820
821perl v5.16.3                      2013-03-04             Unicode::Collate(3pm)
Impressum