Unicode::Collate(3pm)

1Unicode::Collate(3pm)  Perl Programmers Reference Guide  Unicode::Collate(3pm)
2
3
4

NAME

6       Unicode::Collate - Unicode Collation Algorithm
7

SYNOPSIS

9         use Unicode::Collate;
10
11         #construct
12         $Collator = Unicode::Collate->new(%tailoring);
13
14         #sort
15         @sorted = $Collator->sort(@not_sorted);
16
17         #compare
18         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20         # If %tailoring is false (i.e. empty),
21         # $Collator should do the default collation.
22

DESCRIPTION

24       This module is an implementation of Unicode Technical Standard #10
25       (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
26
27       Constructor and Tailoring
28
29       The "new" method returns a collator object.
30
31          $Collator = Unicode::Collate->new(
32             UCA_Version => $UCA_Version,
33             alternate => $alternate, # deprecated: use of 'variable' is recommended.
34             backwards => $levelNumber, # or \@levelNumbers
35             entry => $element,
36             hangul_terminator => $term_primary_weight,
37             ignoreName => qr/$ignoreName/,
38             ignoreChar => qr/$ignoreChar/,
39             katakana_before_hiragana => $bool,
40             level => $collationLevel,
41             normalization  => $normalization_form,
42             overrideCJK => \&overrideCJK,
43             overrideHangul => \&overrideHangul,
44             preprocess => \&preprocess,
45             rearrange => \@charList,
46             table => $filename,
47             undefName => qr/$undefName/,
48             undefChar => qr/$undefChar/,
49             upper_before_lower => $bool,
50             variable => $variable,
51          );
52
53       UCA_Version
54           If the tracking version number of UCA is given, behavior of that
55           tracking version is emulated on collating.  If omitted, the return
56           value of "UCA_Version()" is used.  "UCA_Version()" should return
57           the latest tracking version supported.
58
59           The supported tracking version: 8, 9, 11, or 14.
60
61                UCA       Unicode Standard         DUCET (@version)
62                ---------------------------------------------------
63                 8              3.1                3.0.1 (3.0.1d9)
64                 9     3.1 with Corrigendum 3      3.1.1 (3.1.1)
65                11              4.0                4.0.0 (4.0.0)
66                14             4.1.0               4.1.0 (4.1.0)
67
68           Note: Recent UTS #10 renames "Tracking Version" to "Revision."
69
70       alternate
71           -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
72
73           For backward compatibility, "alternate" (old name) can be used as
74           an alias for "variable".
75
76       backwards
77           -- see 3.1.2 French Accents, UTS #10.
78
79                backwards => $levelNumber or \@levelNumbers
80
81           Weights in reverse order; ex. level 2 (diacritic ordering) in
82           French.  If omitted, forwards at all the levels.
83
84       entry
85           -- see 3.1 Linguistic Features; 3.2.1 File Format, UTS #10.
86
87           If the same character (or a sequence of characters) exists in the
88           collation element table through "table", mapping to collation ele‐
89           ments is overrided.  If it does not exist, the mapping is defined
90           additionally.
91
92               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
93           0063 0068 ; [.0E6A.0020.0002.0063] # ch
94           0043 0068 ; [.0E6A.0020.0007.0043] # Ch
95           0043 0048 ; [.0E6A.0020.0008.0043] # CH
96           006C 006C ; [.0F4C.0020.0002.006C] # ll
97           004C 006C ; [.0F4C.0020.0007.004C] # Ll
98           004C 004C ; [.0F4C.0020.0008.004C] # LL
99           00F1      ; [.0F7B.0020.0002.00F1] # n-tilde
100           006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
101           00D1      ; [.0F7B.0020.0008.00D1] # N-tilde
102           004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
103           ENTRY
104
105               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
106           00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
107           00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
108           ENTRY
109
110           NOTE: The code point in the UCA file format (before ';') must be a
111           Unicode code point (defined as hexadecimal), but not a native code
112           point.  So 0063 must always denote "U+0063", but not a character of
113           "\x63".
114
115           Weighting may vary depending on collation element table.  So ensure
116           the weights defined in "entry" will be consistent with those in the
117           collation element table loaded via "table".
118
119           In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
120           "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
121           between 0E60 and "0E6D") makes ordering as "C < CH < D".  Exactly
122           speaking DUCET already has some characters between "C" and "D":
123           "small capital C" ("U+1D04") with primary weight 0E64,
124           "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
125           ("U+0255") with 0E69.  Then primary weight "0E6A" for "CH" makes
126           "CH" ordered between "c-curl" and "D".
127
128       hangul_terminator
129           -- see 7.1.4 Trailing Weights, UTS #10.
130
131           If a true value is given (non-zero but should be positive), it will
132           be added as a terminator primary weight to the end of every stan‐
133           dard Hangul syllable. Secondary and any higher weights for termina‐
134           tor are set to zero.  If the value is false or "hangul_terminator"
135           key does not exist, insertion of terminator weights will not be
136           performed.
137
138           Boundaries of Hangul syllables are determined according to conjoin‐
139           ing Jamo behavior in the Unicode Standard and HangulSyllable‐
140           Type.txt.
141
142           Implementation Note: [22m(1) For expansion mapping (Unicode character
143           mapped to a sequence of collation elements), a terminator will not
144           be added between collation elements, even if Hangul syllable bound‐
145           ary exists there.  Addition of terminator is restricted to the next
146           position to the last collation element.
147
148           (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
149           Jamo, and enclosed letters) are not automatically terminated with a
150           terminator primary weight.  These characters may need terminator
151           included in a collation element table beforehand.
152
153       ignoreChar
154       ignoreName
155           -- see 3.2.2 Variable Weighting, UTS #10.
156
157           Makes the entry in the table completely ignorable; i.e. as if the
158           weights were zero at all level.
159
160           Through "ignoreChar", any character matching "qr/$ignoreChar/" will
161           be ignored. Through "ignoreName", any character whose name (given
162           in the "table" file as a comment) matches "qr/$ignoreName/" will be
163           ignored.
164
165           E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
166           (or 'lmnt').
167
168       katakana_before_hiragana
169           -- see 7.3.1 Tertiary Weight Table, UTS #10.
170
171           By default, hiragana is before katakana.  If the parameter is made
172           true, this is reversed.
173
174           NOTE: This parameter simplemindedly assumes that any hira‐
175           gana/katakana distinctions must occur in level 3, and their weights
176           at level 3 must be same as those mentioned in 7.3.1, UTS #10.  If
177           you define your collation elements which violate this requirement,
178           this parameter does not work validly.
179
180       level
181           -- see 4.3 Form Sort Key, UTS #10.
182
183           Set the maximum level.  Any higher levels than the specified one
184           are ignored.
185
186             Level 1: alphabetic ordering
187             Level 2: diacritic ordering
188             Level 3: case ordering
189             Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
190
191             ex.level => 2,
192
193           If omitted, the maximum is the 4th.
194
195       normalization
196           -- see 4.1 Normalize, UTS #10.
197
198           If specified, strings are normalized before preparation of sort
199           keys (the normalization is executed after preprocess).
200
201           A form name "Unicode::Normalize::normalize()" accepts will be
202           applied as $normalization_form.  Acceptable names include 'NFD',
203           'NFC', 'NFKD', and 'NFKC'.  See "Unicode::Normalize::normalize()"
204           for detail.  If omitted, 'NFD' is used.
205
206           "normalization" is performed after "preprocess" (if defined).
207
208           Furthermore, special values, "undef" and "prenormalized", can be
209           used, though they are not concerned with "Unicode::Normalize::nor‐
210           malize()".
211
212           If "undef" (not a string "undef") is passed explicitly as the value
213           for this key, any normalization is not carried out (this may make
214           tailoring easier if any normalization is not desired). Under "(nor‐
215           malization => undef)", only contiguous contractions are resolved;
216           e.g. even if "A-ring" (and "A-ring-cedilla") is ordered after "Z",
217           "A-cedilla-ring" would be primary equal to "A".  In this point,
218           "(normalization => undef, preprocess => sub { NFD(shift) })" is not
219           equivalent to "(normalization => 'NFD')".
220
221           In the case of "(normalization => "prenormalized")", any normaliza‐
222           tion is not performed, but non-contiguous contractions with combin‐
223           ing characters are performed.  Therefore "(normalization =>
224           'prenormalized', preprocess => sub { NFD(shift) })" is equivalent
225           to "(normalization => 'NFD')".  If source strings are finely
226           prenormalized, "(normalization => 'prenormalized')" may save time
227           for normalization.
228
229           Except "(normalization => undef)", Unicode::Normalize is required
230           (see also CAVEAT).
231
232       overrideCJK
233           -- see 7.1 Derived Collation Elements, UTS #10.
234
235           By default, CJK Unified Ideographs are ordered in Unicode codepoint
236           order but "CJK Unified Ideographs" (if "UCA_Version" is 8 to 11,
237           its range is "U+4E00..U+9FA5"; if "UCA_Version" is 14, its range is
238           "U+4E00..U+9FBB") are lesser than "CJK Unified Ideographs Exten‐
239           sion" (its range is "U+3400..U+4DB5" and "U+20000..U+2A6D6").
240
241           Through "overrideCJK", ordering of CJK Unified Ideographs can be
242           overrided.
243
244           ex. CJK Unified Ideographs in the JIS code point order.
245
246             overrideCJK => sub {
247                 my $u = shift;             # get a Unicode codepoint
248                 my $b = pack('n', $u);     # to UTF-16BE
249                 my $s = your_unicode_to_sjis_converter($b); # convert
250                 my $n = unpack('n', $s);   # convert sjis to short
251                 [ $n, 0x20, 0x2, $u ];     # return the collation element
252             },
253
254           ex. ignores all CJK Unified Ideographs.
255
256             overrideCJK => sub {()}, # CODEREF returning empty list
257
258              # where ->eq("Pe\x{4E00}rl", "Perl") is true
259              # as U+4E00 is a CJK Unified Ideograph and to be ignorable.
260
261           If "undef" is passed explicitly as the value for this key, weights
262           for CJK Unified Ideographs are treated as undefined.  But assign‐
263           ment of weight for CJK Unified Ideographs in table or "entry" is
264           still valid.
265
266       overrideHangul
267           -- see 7.1 Derived Collation Elements, UTS #10.
268
269           By default, Hangul Syllables are decomposed into Hangul Jamo, even
270           if "(normalization => undef)".  But the mapping of Hangul Syllables
271           may be overrided.
272
273           This parameter works like "overrideCJK", so see there for examples.
274
275           If you want to override the mapping of Hangul Syllables, NFD, NFKD,
276           and FCD are not appropriate, since they will decompose Hangul Syl‐
277           lables before overriding.
278
279           If "undef" is passed explicitly as the value for this key, weight
280           for Hangul Syllables is treated as undefined without decomposition
281           into Hangul Jamo.  But definition of weight for Hangul Syllables in
282           table or "entry" is still valid.
283
284       preprocess
285           -- see 5.1 Preprocessing, UTS #10.
286
287           If specified, the coderef is used to preprocess before the forma‐
288           tion of sort keys.
289
290           ex. dropping English articles, such as "a" or "the".  Then, "the
291           pen" is before "a pencil".
292
293                preprocess => sub {
294                      my $str = shift;
295                      $str =~ s/\b(?:an?⎪the)\s+//gi;
296                      return $str;
297                   },
298
299           "preprocess" is performed before "normalization" (if defined).
300
301       rearrange
302           -- see 3.1.3 Rearrangement, UTS #10.
303
304           Characters that are not coded in logical order and to be rear‐
305           ranged.  If "UCA_Version" is equal to or lesser than 11, default
306           is:
307
308               rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
309
310           If you want to disallow any rearrangement, pass "undef" or "[]" (a
311           reference to empty list) as the value for this key.
312
313           If "UCA_Version" is equal to 14, default is "[]" (i.e. no re‐
314           arrangement).
315
316           According to the version 9 of UCA, this parameter shall not be
317           used; but it is not warned at present.
318
319       table
320           -- see 3.2 Default Unicode Collation Element Table, UTS #10.
321
322           You can use another collation element table if desired.
323
324           The table file should locate in the Unicode/Collate directory on
325           @INC. Say, if the filename is Foo.txt, the table file is searched
326           as Unicode/Collate/Foo.txt in @INC.
327
328           By default, allkeys.txt (as the filename of DUCET) is used.  If you
329           will prepare your own table file, any name other than allkeys.txt
330           may be better to avoid namespace conflict.
331
332           If "undef" is passed explicitly as the value for this key, no file
333           is read (but you can define collation elements via "entry").
334
335           A typical way to define a collation element table without any file
336           of table:
337
338              $onlyABC = Unicode::Collate->new(
339                  table => undef,
340                  entry => << 'ENTRIES',
341           0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
342           0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
343           0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
344           0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
345           0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
346           0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
347           ENTRIES
348               );
349
350           If "ignoreName" or "undefName" is used, character names should be
351           specified as a comment (following "#") on each line.
352
353       undefChar
354       undefName
355           -- see 6.3.4 Reducing the Repertoire, UTS #10.
356
357           Undefines the collation element as if it were unassigned in the ta‐
358           ble.  This reduces the size of the table.  If an unassigned charac‐
359           ter appears in the string to be collated, the sort key is made from
360           its codepoint as a single-character collation element, as it is
361           greater than any other assigned collation elements (in the code‐
362           point order among the unassigned characters).  But, it'd be better
363           to ignore characters unfamiliar to you and maybe never used.
364
365           Through "undefChar", any character matching "qr/$undefChar/" will
366           be undefined. Through "undefName", any character whose name (given
367           in the "table" file as a comment) matches "qr/$undefName/" will be
368           undefined.
369
370           ex. Collation weights for beyond-BMP characters are not stored in
371           object:
372
373               undefChar => qr/[^\0-\x{fffd}]/,
374
375       upper_before_lower
376           -- see 6.6 Case Comparisons, UTS #10.
377
378           By default, lowercase is before uppercase.  If the parameter is
379           made true, this is reversed.
380
381           NOTE: This parameter simplemindedly assumes that any lower‐
382           case/uppercase distinctions must occur in level 3, and their
383           weights at level 3 must be same as those mentioned in 7.3.1, UTS
384           #10.  If you define your collation elements which differs from this
385           requirement, this parameter doesn't work validly.
386
387       variable
388           -- see 3.2.2 Variable Weighting, UTS #10.
389
390           This key allows to variable weighting for variable collation ele‐
391           ments, which are marked with an ASTERISK in the table (NOTE: Many
392           punction marks and symbols are variable in allkeys.txt).
393
394              variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
395
396           These names are case-insensitive.  By default (if specification is
397           omitted), 'shifted' is adopted.
398
399              'Blanked'        Variable elements are made ignorable at levels 1 through 3;
400                               considered at the 4th level.
401
402              'Non-Ignorable'  Variable elements are not reset to ignorable.
403
404              'Shifted'        Variable elements are made ignorable at levels 1 through 3
405                               their level 4 weight is replaced by the old level 1 weight.
406                               Level 4 weight for Non-Variable elements is 0xFFFF.
407
408              'Shift-Trimmed'  Same as 'shifted', but all FFFF's at the 4th level
409                               are trimmed.
410
411       Methods for Collation
412
413       "@sorted = $Collator->sort(@not_sorted)"
414           Sorts a list of strings.
415
416       "$result = $Collator->cmp($a, $b)"
417           Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
418           $b) or -1 (when $a is lesser than $b).
419
420       "$result = $Collator->eq($a, $b)"
421       "$result = $Collator->ne($a, $b)"
422       "$result = $Collator->lt($a, $b)"
423       "$result = $Collator->le($a, $b)"
424       "$result = $Collator->gt($a, $b)"
425       "$result = $Collator->ge($a, $b)"
426           They works like the same name operators as theirs.
427
428              eq : whether $a is equal to $b.
429              ne : whether $a is not equal to $b.
430              lt : whether $a is lesser than $b.
431              le : whether $a is lesser than $b or equal to $b.
432              gt : whether $a is greater than $b.
433              ge : whether $a is greater than $b or equal to $b.
434
435       "$sortKey = $Collator->getSortKey($string)"
436           -- see 4.3 Form Sort Key, UTS #10.
437
438           Returns a sort key.
439
440           You compare the sort keys using a binary comparison and get the
441           result of the comparison of the strings using UCA.
442
443              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
444
445                 is equivalent to
446
447              $Collator->cmp($a, $b)
448
449       "$sortKeyForm = $Collator->viewSortKey($string)"
450           Converts a sorting key into its representation form.  If "UCA_Ver‐
451           sion" is 8, the output is slightly different.
452
453              use Unicode::Collate;
454              my $c = Unicode::Collate->new();
455              print $c->viewSortKey("Perl"),"\n";
456
457              # output:
458              # [0B67 0A65 0B7F 0B03 ⎪ 0020 0020 0020 0020 ⎪ 0008 0002 0002 0002 ⎪ FFFF FFFF FFFF FFFF]
459              #  Level 1               Level 2               Level 3               Level 4
460
461       Methods for Searching
462
463       DISCLAIMER: If "preprocess" or "normalization" parameter is true for
464       $Collator, calling these methods ("index", "match", "gmatch", "subst",
465       "gsubst") is croaked, as the position and the length might differ from
466       those on the specified string.  (And "rearrange" and "hangul_termina‐
467       tor" parameters are neglected.)
468
469       The "match", "gmatch", "subst", "gsubst" methods work like "m//",
470       "m//g", "s///", "s///g", respectively, but they are not aware of any
471       pattern, but only a literal substring.
472
473       "$position = $Collator->index($string, $substring[, $position])"
474       "($position, $length) = $Collator->index($string, $substring[, $posi‐
475       tion])"
476           If $substring matches a part of $string, returns the position of
477           the first occurrence of the matching part in scalar context; in
478           list context, returns a two-element list of the position and the
479           length of the matching part.
480
481           If $substring does not match any part of $string, returns "-1" in
482           scalar context and an empty list in list context.
483
484           e.g. you say
485
486             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
487                                                # (normalization => undef) is REQUIRED.
488             my $str = "Ich muß studieren Perl.";
489             my $sub = "MÜSS";
490             my $match;
491             if (my($pos,$len) = $Collator->index($str, $sub)) {
492                 $match = substr($str, $pos, $len);
493             }
494
495           and get "muß" in $match since "muß" is primary equal to "MÜSS".
496
497       "$match_ref = $Collator->match($string, $substring)"
498       "($match)   = $Collator->match($string, $substring)"
499           If $substring matches a part of $string, in scalar context, returns
500           a reference to the first occurrence of the matching part
501           ($match_ref is always true if matches, since every reference is
502           true); in list context, returns the first occurrence of the match‐
503           ing part.
504
505           If $substring does not match any part of $string, returns "undef"
506           in scalar context and an empty list in list context.
507
508           e.g.
509
510               if ($match_ref = $Collator->match($str, $sub)) { # scalar context
511                   print "matches [$$match_ref].\n";
512               } else {
513                   print "doesn't match.\n";
514               }
515
516                or
517
518               if (($match) = $Collator->match($str, $sub)) { # list context
519                   print "matches [$match].\n";
520               } else {
521                   print "doesn't match.\n";
522               }
523
524       "@match = $Collator->gmatch($string, $substring)"
525           If $substring matches a part of $string, returns all the matching
526           parts (or matching count in scalar context).
527
528           If $substring does not match any part of $string, returns an empty
529           list.
530
531       "$count = $Collator->subst($string, $substring, $replacement)"
532           If $substring matches a part of $string, the first occurrence of
533           the matching part is replaced by $replacement ($string is modified)
534           and return $count (always equals to 1).
535
536           $replacement can be a "CODEREF", taking the matching part as an
537           argument, and returning a string to replace the matching part (a
538           bit similar to "s/(..)/$coderef->($1)/e").
539
540       "$count = $Collator->gsubst($string, $substring, $replacement)"
541           If $substring matches a part of $string, all the occurrences of the
542           matching part is replaced by $replacement ($string is modified) and
543           return $count.
544
545           $replacement can be a "CODEREF", taking the matching part as an
546           argument, and returning a string to replace the matching part (a
547           bit similar to "s/(..)/$coderef->($1)/eg").
548
549           e.g.
550
551             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
552                                                # (normalization => undef) is REQUIRED.
553             my $str = "Camel donkey zebra came\x{301}l CAMEL horse cAm\0E\0L...";
554             $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
555
556             # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cAm\0E\0L</b>...";
557             # i.e., all the camels are made bold-faced.
558
559       Other Methods
560
561       "%old_tailoring = $Collator->change(%new_tailoring)"
562           Change the value of specified keys and returns the changed part.
563
564               $Collator = Unicode::Collate->new(level => 4);
565
566               $Collator->eq("perl", "PERL"); # false
567
568               %old = $Collator->change(level => 2); # returns (level => 4).
569
570               $Collator->eq("perl", "PERL"); # true
571
572               $Collator->change(%old); # returns (level => 2).
573
574               $Collator->eq("perl", "PERL"); # false
575
576           Not all "(key,value)"s are allowed to be changed.  See also @Uni‐
577           code::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
578
579           In the scalar context, returns the modified collator (but it is not
580           a clone from the original).
581
582               $Collator->change(level => 2)->eq("perl", "PERL"); # true
583
584               $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
585
586               $Collator->change(level => 4)->eq("perl", "PERL"); # false
587
588       "$version = $Collator->version()"
589           Returns the version number (a string) of the Unicode Standard which
590           the "table" file used by the collator object is based on.  If the
591           table does not include a version line (starting with @version),
592           returns "unknown".
593
594       "UCA_Version()"
595           Returns the tracking version number of UTS #10 this module con‐
596           sults.
597
598       "Base_Unicode_Version()"
599           Returns the version number of UTS #10 this module consults.
600

EXPORT

602       No method will be exported.
603

INSTALL

605       Though this module can be used without any "table" file, to use this
606       module easily, it is recommended to install a table file in the UCA
607       format, by copying it under the directory <a place in @INC>/Uni‐
608       code/Collate.
609
610       The most preferable one is "The Default Unicode Collation Element Ta‐
611       ble" (aka DUCET), available from the Unicode Consortium's website:
612
613          http://www.unicode.org/Public/UCA/
614
615          http://www.unicode.org/Public/UCA/latest/allkeys.txt (latest version)
616
617       If DUCET is not installed, it is recommended to copy the file from
618       http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
619       @INC>/Unicode/Collate/allkeys.txt manually.
620

CAVEATS

622       Normalization
623           Use of the "normalization" parameter requires the Unicode::Normal‐
624           ize module (see Unicode::Normalize).
625
626           If you need not it (say, in the case when you need not handle any
627           combining characters), assign "normalization => undef" explicitly.
628
629           -- see 6.5 Avoiding Normalization, UTS #10.
630
631       Conformance Test
632           The Conformance Test for the UCA is available under
633           <http://www.unicode.org/Public/UCA/>.
634
635           For CollationTest_SHIFTED.txt, a collator via "Unicode::Col‐
636           late->new( )" should be used; for CollationTest_NON_IGNORABLE.txt,
637           a collator via "Unicode::Collate->new(variable => "non-ignorable",
638           level => 3)".
639
640           Unicode::Normalize is required to try The Conformance Test.
641

AUTHOR, COPYRIGHT AND LICENSE

643       The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
644       <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2005, SADAHIRO
645       Tomoyuki. Japan. All rights reserved.
646
647       This module is free software; you can redistribute it and/or modify it
648       under the same terms as Perl itself.
649
650       The file Unicode/Collate/allkeys.txt was copied directly from
651       <http://www.unicode.org/Public/UCA/4.1.0/allkeys.txt>.  This file is
652       Copyright (c) 1991-2005 Unicode, Inc. All rights reserved.  Distributed
653       under the Terms of Use in <http://www.unicode.org/copyright.html>.
654