Unicode::Collate(3pm)

1Unicode::Collate(3pm)  Perl Programmers Reference Guide  Unicode::Collate(3pm)
2
3
4

NAME

6       Unicode::Collate - Unicode Collation Algorithm
7

SYNOPSIS

9         use Unicode::Collate;
10
11         #construct
12         $Collator = Unicode::Collate->new(%tailoring);
13
14         #sort
15         @sorted = $Collator->sort(@not_sorted);
16
17         #compare
18         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20         # If %tailoring is false (i.e. empty),
21         # $Collator should do the default collation.
22

DESCRIPTION

24       This module is an implementation of Unicode Technical Standard #10
25       (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
26
27   Constructor and Tailoring
28       The "new" method returns a collator object.
29
30          $Collator = Unicode::Collate->new(
31             UCA_Version => $UCA_Version,
32             alternate => $alternate, # deprecated: use of 'variable' is recommended.
33             backwards => $levelNumber, # or \@levelNumbers
34             entry => $element,
35             hangul_terminator => $term_primary_weight,
36             ignoreName => qr/$ignoreName/,
37             ignoreChar => qr/$ignoreChar/,
38             katakana_before_hiragana => $bool,
39             level => $collationLevel,
40             normalization  => $normalization_form,
41             overrideCJK => \&overrideCJK,
42             overrideHangul => \&overrideHangul,
43             preprocess => \&preprocess,
44             rearrange => \@charList,
45             table => $filename,
46             undefName => qr/$undefName/,
47             undefChar => qr/$undefChar/,
48             upper_before_lower => $bool,
49             variable => $variable,
50          );
51
52       UCA_Version
53           If the tracking version number of UCA is given, behavior of that
54           tracking version is emulated on collating.  If omitted, the return
55           value of "UCA_Version()" is used.  "UCA_Version()" should return
56           the latest tracking version supported.
57
58           The supported tracking version: 8, 9, 11, or 14.
59
60                UCA       Unicode Standard         DUCET (@version)
61                ---------------------------------------------------
62                 8              3.1                3.0.1 (3.0.1d9)
63                 9     3.1 with Corrigendum 3      3.1.1 (3.1.1)
64                11              4.0                4.0.0 (4.0.0)
65                14             4.1.0               4.1.0 (4.1.0)
66
67           Note: Recent UTS #10 renames "Tracking Version" to "Revision."
68
69       alternate
70           -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
71
72           For backward compatibility, "alternate" (old name) can be used as
73           an alias for "variable".
74
75       backwards
76           -- see 3.1.2 French Accents, UTS #10.
77
78                backwards => $levelNumber or \@levelNumbers
79
80           Weights in reverse order; ex. level 2 (diacritic ordering) in
81           French.  If omitted, forwards at all the levels.
82
83       entry
84           -- see 3.1 Linguistic Features; 3.2.1 File Format, UTS #10.
85
86           If the same character (or a sequence of characters) exists in the
87           collation element table through "table", mapping to collation
88           elements is overrided.  If it does not exist, the mapping is
89           defined additionally.
90
91               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
92           0063 0068 ; [.0E6A.0020.0002.0063] # ch
93           0043 0068 ; [.0E6A.0020.0007.0043] # Ch
94           0043 0048 ; [.0E6A.0020.0008.0043] # CH
95           006C 006C ; [.0F4C.0020.0002.006C] # ll
96           004C 006C ; [.0F4C.0020.0007.004C] # Ll
97           004C 004C ; [.0F4C.0020.0008.004C] # LL
98           00F1      ; [.0F7B.0020.0002.00F1] # n-tilde
99           006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
100           00D1      ; [.0F7B.0020.0008.00D1] # N-tilde
101           004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
102           ENTRY
103
104               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
105           00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
106           00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
107           ENTRY
108
109           NOTE: The code point in the UCA file format (before ';') must be a
110           Unicode code point (defined as hexadecimal), but not a native code
111           point.  So 0063 must always denote "U+0063", but not a character of
112           "\x63".
113
114           Weighting may vary depending on collation element table.  So ensure
115           the weights defined in "entry" will be consistent with those in the
116           collation element table loaded via "table".
117
118           In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
119           "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
120           between 0E60 and "0E6D") makes ordering as "C < CH < D".  Exactly
121           speaking DUCET already has some characters between "C" and "D":
122           "small capital C" ("U+1D04") with primary weight 0E64,
123           "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
124           ("U+0255") with 0E69.  Then primary weight "0E6A" for "CH" makes
125           "CH" ordered between "c-curl" and "D".
126
127       hangul_terminator
128           -- see 7.1.4 Trailing Weights, UTS #10.
129
130           If a true value is given (non-zero but should be positive), it will
131           be added as a terminator primary weight to the end of every
132           standard Hangul syllable. Secondary and any higher weights for
133           terminator are set to zero.  If the value is false or
134           "hangul_terminator" key does not exist, insertion of terminator
135           weights will not be performed.
136
137           Boundaries of Hangul syllables are determined according to
138           conjoining Jamo behavior in the Unicode Standard and
139           HangulSyllableType.txt.
140
141           Implementation Note: [22m(1) For expansion mapping (Unicode character
142           mapped to a sequence of collation elements), a terminator will not
143           be added between collation elements, even if Hangul syllable
144           boundary exists there.  Addition of terminator is restricted to the
145           next position to the last collation element.
146
147           (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
148           Jamo, and enclosed letters) are not automatically terminated with a
149           terminator primary weight.  These characters may need terminator
150           included in a collation element table beforehand.
151
152       ignoreChar
153       ignoreName
154           -- see 3.2.2 Variable Weighting, UTS #10.
155
156           Makes the entry in the table completely ignorable; i.e. as if the
157           weights were zero at all level.
158
159           Through "ignoreChar", any character matching "qr/$ignoreChar/" will
160           be ignored. Through "ignoreName", any character whose name (given
161           in the "table" file as a comment) matches "qr/$ignoreName/" will be
162           ignored.
163
164           E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
165           (or 'lmnt').
166
167       katakana_before_hiragana
168           -- see 7.3.1 Tertiary Weight Table, UTS #10.
169
170           By default, hiragana is before katakana.  If the parameter is made
171           true, this is reversed.
172
173           NOTE: This parameter simplemindedly assumes that any
174           hiragana/katakana distinctions must occur in level 3, and their
175           weights at level 3 must be same as those mentioned in 7.3.1, UTS
176           #10.  If you define your collation elements which violate this
177           requirement, this parameter does not work validly.
178
179       level
180           -- see 4.3 Form Sort Key, UTS #10.
181
182           Set the maximum level.  Any higher levels than the specified one
183           are ignored.
184
185             Level 1: alphabetic ordering
186             Level 2: diacritic ordering
187             Level 3: case ordering
188             Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
189
190             ex.level => 2,
191
192           If omitted, the maximum is the 4th.
193
194       normalization
195           -- see 4.1 Normalize, UTS #10.
196
197           If specified, strings are normalized before preparation of sort
198           keys (the normalization is executed after preprocess).
199
200           A form name "Unicode::Normalize::normalize()" accepts will be
201           applied as $normalization_form.  Acceptable names include 'NFD',
202           'NFC', 'NFKD', and 'NFKC'.  See "Unicode::Normalize::normalize()"
203           for detail.  If omitted, 'NFD' is used.
204
205           "normalization" is performed after "preprocess" (if defined).
206
207           Furthermore, special values, "undef" and "prenormalized", can be
208           used, though they are not concerned with
209           "Unicode::Normalize::normalize()".
210
211           If "undef" (not a string "undef") is passed explicitly as the value
212           for this key, any normalization is not carried out (this may make
213           tailoring easier if any normalization is not desired). Under
214           "(normalization => undef)", only contiguous contractions are
215           resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
216           after "Z", "A-cedilla-ring" would be primary equal to "A".  In this
217           point, "(normalization => undef, preprocess => sub { NFD(shift) })"
218           is not equivalent to "(normalization => 'NFD')".
219
220           In the case of "(normalization => "prenormalized")", any
221           normalization is not performed, but non-contiguous contractions
222           with combining characters are performed.  Therefore "(normalization
223           => 'prenormalized', preprocess => sub { NFD(shift) })" is
224           equivalent to "(normalization => 'NFD')".  If source strings are
225           finely prenormalized, "(normalization => 'prenormalized')" may save
226           time for normalization.
227
228           Except "(normalization => undef)", Unicode::Normalize is required
229           (see also CAVEAT).
230
231       overrideCJK
232           -- see 7.1 Derived Collation Elements, UTS #10.
233
234           By default, CJK Unified Ideographs are ordered in Unicode codepoint
235           order but "CJK Unified Ideographs" (if "UCA_Version" is 8 to 11,
236           its range is "U+4E00..U+9FA5"; if "UCA_Version" is 14, its range is
237           "U+4E00..U+9FBB") are lesser than "CJK Unified Ideographs
238           Extension" (its range is "U+3400..U+4DB5" and "U+20000..U+2A6D6").
239
240           Through "overrideCJK", ordering of CJK Unified Ideographs can be
241           overrided.
242
243           ex. CJK Unified Ideographs in the JIS code point order.
244
245             overrideCJK => sub {
246                 my $u = shift;             # get a Unicode codepoint
247                 my $b = pack('n', $u);     # to UTF-16BE
248                 my $s = your_unicode_to_sjis_converter($b); # convert
249                 my $n = unpack('n', $s);   # convert sjis to short
250                 [ $n, 0x20, 0x2, $u ];     # return the collation element
251             },
252
253           ex. ignores all CJK Unified Ideographs.
254
255             overrideCJK => sub {()}, # CODEREF returning empty list
256
257              # where ->eq("Pe\x{4E00}rl", "Perl") is true
258              # as U+4E00 is a CJK Unified Ideograph and to be ignorable.
259
260           If "undef" is passed explicitly as the value for this key, weights
261           for CJK Unified Ideographs are treated as undefined.  But
262           assignment of weight for CJK Unified Ideographs in table or "entry"
263           is still valid.
264
265       overrideHangul
266           -- see 7.1 Derived Collation Elements, UTS #10.
267
268           By default, Hangul Syllables are decomposed into Hangul Jamo, even
269           if "(normalization => undef)".  But the mapping of Hangul Syllables
270           may be overrided.
271
272           This parameter works like "overrideCJK", so see there for examples.
273
274           If you want to override the mapping of Hangul Syllables, NFD, NFKD,
275           and FCD are not appropriate, since they will decompose Hangul
276           Syllables before overriding.
277
278           If "undef" is passed explicitly as the value for this key, weight
279           for Hangul Syllables is treated as undefined without decomposition
280           into Hangul Jamo.  But definition of weight for Hangul Syllables in
281           table or "entry" is still valid.
282
283       preprocess
284           -- see 5.1 Preprocessing, UTS #10.
285
286           If specified, the coderef is used to preprocess before the
287           formation of sort keys.
288
289           ex. dropping English articles, such as "a" or "the".  Then, "the
290           pen" is before "a pencil".
291
292                preprocess => sub {
293                      my $str = shift;
294                      $str =~ s/\b(?:an?|the)\s+//gi;
295                      return $str;
296                   },
297
298           "preprocess" is performed before "normalization" (if defined).
299
300       rearrange
301           -- see 3.1.3 Rearrangement, UTS #10.
302
303           Characters that are not coded in logical order and to be
304           rearranged.  If "UCA_Version" is equal to or lesser than 11,
305           default is:
306
307               rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
308
309           If you want to disallow any rearrangement, pass "undef" or "[]" (a
310           reference to empty list) as the value for this key.
311
312           If "UCA_Version" is equal to 14, default is "[]" (i.e. no
313           rearrangement).
314
315           According to the version 9 of UCA, this parameter shall not be
316           used; but it is not warned at present.
317
318       table
319           -- see 3.2 Default Unicode Collation Element Table, UTS #10.
320
321           You can use another collation element table if desired.
322
323           The table file should locate in the Unicode/Collate directory on
324           @INC. Say, if the filename is Foo.txt, the table file is searched
325           as Unicode/Collate/Foo.txt in @INC.
326
327           By default, allkeys.txt (as the filename of DUCET) is used.  If you
328           will prepare your own table file, any name other than allkeys.txt
329           may be better to avoid namespace conflict.
330
331           If "undef" is passed explicitly as the value for this key, no file
332           is read (but you can define collation elements via "entry").
333
334           A typical way to define a collation element table without any file
335           of table:
336
337              $onlyABC = Unicode::Collate->new(
338                  table => undef,
339                  entry => << 'ENTRIES',
340           0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
341           0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
342           0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
343           0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
344           0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
345           0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
346           ENTRIES
347               );
348
349           If "ignoreName" or "undefName" is used, character names should be
350           specified as a comment (following "#") on each line.
351
352       undefChar
353       undefName
354           -- see 6.3.4 Reducing the Repertoire, UTS #10.
355
356           Undefines the collation element as if it were unassigned in the
357           table.  This reduces the size of the table.  If an unassigned
358           character appears in the string to be collated, the sort key is
359           made from its codepoint as a single-character collation element, as
360           it is greater than any other assigned collation elements (in the
361           codepoint order among the unassigned characters).  But, it'd be
362           better to ignore characters unfamiliar to you and maybe never used.
363
364           Through "undefChar", any character matching "qr/$undefChar/" will
365           be undefined. Through "undefName", any character whose name (given
366           in the "table" file as a comment) matches "qr/$undefName/" will be
367           undefined.
368
369           ex. Collation weights for beyond-BMP characters are not stored in
370           object:
371
372               undefChar => qr/[^\0-\x{fffd}]/,
373
374       upper_before_lower
375           -- see 6.6 Case Comparisons, UTS #10.
376
377           By default, lowercase is before uppercase.  If the parameter is
378           made true, this is reversed.
379
380           NOTE: This parameter simplemindedly assumes that any
381           lowercase/uppercase distinctions must occur in level 3, and their
382           weights at level 3 must be same as those mentioned in 7.3.1, UTS
383           #10.  If you define your collation elements which differs from this
384           requirement, this parameter doesn't work validly.
385
386       variable
387           -- see 3.2.2 Variable Weighting, UTS #10.
388
389           This key allows to variable weighting for variable collation
390           elements, which are marked with an ASTERISK in the table (NOTE:
391           Many punction marks and symbols are variable in allkeys.txt).
392
393              variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
394
395           These names are case-insensitive.  By default (if specification is
396           omitted), 'shifted' is adopted.
397
398              'Blanked'        Variable elements are made ignorable at levels 1 through 3;
399                               considered at the 4th level.
400
401              'Non-Ignorable'  Variable elements are not reset to ignorable.
402
403              'Shifted'        Variable elements are made ignorable at levels 1 through 3
404                               their level 4 weight is replaced by the old level 1 weight.
405                               Level 4 weight for Non-Variable elements is 0xFFFF.
406
407              'Shift-Trimmed'  Same as 'shifted', but all FFFF's at the 4th level
408                               are trimmed.
409
410   Methods for Collation
411       "@sorted = $Collator->sort(@not_sorted)"
412           Sorts a list of strings.
413
414       "$result = $Collator->cmp($a, $b)"
415           Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
416           $b) or -1 (when $a is lesser than $b).
417
418       "$result = $Collator->eq($a, $b)"
419       "$result = $Collator->ne($a, $b)"
420       "$result = $Collator->lt($a, $b)"
421       "$result = $Collator->le($a, $b)"
422       "$result = $Collator->gt($a, $b)"
423       "$result = $Collator->ge($a, $b)"
424           They works like the same name operators as theirs.
425
426              eq : whether $a is equal to $b.
427              ne : whether $a is not equal to $b.
428              lt : whether $a is lesser than $b.
429              le : whether $a is lesser than $b or equal to $b.
430              gt : whether $a is greater than $b.
431              ge : whether $a is greater than $b or equal to $b.
432
433       "$sortKey = $Collator->getSortKey($string)"
434           -- see 4.3 Form Sort Key, UTS #10.
435
436           Returns a sort key.
437
438           You compare the sort keys using a binary comparison and get the
439           result of the comparison of the strings using UCA.
440
441              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
442
443                 is equivalent to
444
445              $Collator->cmp($a, $b)
446
447       "$sortKeyForm = $Collator->viewSortKey($string)"
448           Converts a sorting key into its representation form.  If
449           "UCA_Version" is 8, the output is slightly different.
450
451              use Unicode::Collate;
452              my $c = Unicode::Collate->new();
453              print $c->viewSortKey("Perl"),"\n";
454
455              # output:
456              # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
457              #  Level 1               Level 2               Level 3               Level 4
458
459   Methods for Searching
460       DISCLAIMER: If "preprocess" or "normalization" parameter is true for
461       $Collator, calling these methods ("index", "match", "gmatch", "subst",
462       "gsubst") is croaked, as the position and the length might differ from
463       those on the specified string.  (And "rearrange" and
464       "hangul_terminator" parameters are neglected.)
465
466       The "match", "gmatch", "subst", "gsubst" methods work like "m//",
467       "m//g", "s///", "s///g", respectively, but they are not aware of any
468       pattern, but only a literal substring.
469
470       "$position = $Collator->index($string, $substring[, $position])"
471       "($position, $length) = $Collator->index($string, $substring[,
472       $position])"
473           If $substring matches a part of $string, returns the position of
474           the first occurrence of the matching part in scalar context; in
475           list context, returns a two-element list of the position and the
476           length of the matching part.
477
478           If $substring does not match any part of $string, returns "-1" in
479           scalar context and an empty list in list context.
480
481           e.g. you say
482
483             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
484                                                # (normalization => undef) is REQUIRED.
485             my $str = "Ich muss studieren Perl.";
486             my $sub = "MUeSS";
487             my $match;
488             if (my($pos,$len) = $Collator->index($str, $sub)) {
489                 $match = substr($str, $pos, $len);
490             }
491
492           and get "muss" in $match since "muss" is primary equal to "MUeSS".
493
494       "$match_ref = $Collator->match($string, $substring)"
495       "($match)   = $Collator->match($string, $substring)"
496           If $substring matches a part of $string, in scalar context, returns
497           a reference to the first occurrence of the matching part
498           ($match_ref is always true if matches, since every reference is
499           true); in list context, returns the first occurrence of the
500           matching part.
501
502           If $substring does not match any part of $string, returns "undef"
503           in scalar context and an empty list in list context.
504
505           e.g.
506
507               if ($match_ref = $Collator->match($str, $sub)) { # scalar context
508                   print "matches [$$match_ref].\n";
509               } else {
510                   print "doesn't match.\n";
511               }
512
513                or
514
515               if (($match) = $Collator->match($str, $sub)) { # list context
516                   print "matches [$match].\n";
517               } else {
518                   print "doesn't match.\n";
519               }
520
521       "@match = $Collator->gmatch($string, $substring)"
522           If $substring matches a part of $string, returns all the matching
523           parts (or matching count in scalar context).
524
525           If $substring does not match any part of $string, returns an empty
526           list.
527
528       "$count = $Collator->subst($string, $substring, $replacement)"
529           If $substring matches a part of $string, the first occurrence of
530           the matching part is replaced by $replacement ($string is modified)
531           and return $count (always equals to 1).
532
533           $replacement can be a "CODEREF", taking the matching part as an
534           argument, and returning a string to replace the matching part (a
535           bit similar to "s/(..)/$coderef->($1)/e").
536
537       "$count = $Collator->gsubst($string, $substring, $replacement)"
538           If $substring matches a part of $string, all the occurrences of the
539           matching part is replaced by $replacement ($string is modified) and
540           return $count.
541
542           $replacement can be a "CODEREF", taking the matching part as an
543           argument, and returning a string to replace the matching part (a
544           bit similar to "s/(..)/$coderef->($1)/eg").
545
546           e.g.
547
548             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
549                                                # (normalization => undef) is REQUIRED.
550             my $str = "Camel donkey zebra came\x{301}l CAMEL horse cAm\0E\0L...";
551             $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
552
553             # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cAm\0E\0L</b>...";
554             # i.e., all the camels are made bold-faced.
555
556   Other Methods
557       "%old_tailoring = $Collator->change(%new_tailoring)"
558           Change the value of specified keys and returns the changed part.
559
560               $Collator = Unicode::Collate->new(level => 4);
561
562               $Collator->eq("perl", "PERL"); # false
563
564               %old = $Collator->change(level => 2); # returns (level => 4).
565
566               $Collator->eq("perl", "PERL"); # true
567
568               $Collator->change(%old); # returns (level => 2).
569
570               $Collator->eq("perl", "PERL"); # false
571
572           Not all "(key,value)"s are allowed to be changed.  See also
573           @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
574
575           In the scalar context, returns the modified collator (but it is not
576           a clone from the original).
577
578               $Collator->change(level => 2)->eq("perl", "PERL"); # true
579
580               $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
581
582               $Collator->change(level => 4)->eq("perl", "PERL"); # false
583
584       "$version = $Collator->version()"
585           Returns the version number (a string) of the Unicode Standard which
586           the "table" file used by the collator object is based on.  If the
587           table does not include a version line (starting with @version),
588           returns "unknown".
589
590       "UCA_Version()"
591           Returns the tracking version number of UTS #10 this module
592           consults.
593
594       "Base_Unicode_Version()"
595           Returns the version number of UTS #10 this module consults.
596

EXPORT

598       No method will be exported.
599

INSTALL

601       Though this module can be used without any "table" file, to use this
602       module easily, it is recommended to install a table file in the UCA
603       format, by copying it under the directory <a place in
604       @INC>/Unicode/Collate.
605
606       The most preferable one is "The Default Unicode Collation Element
607       Table" (aka DUCET), available from the Unicode Consortium's website:
608
609          http://www.unicode.org/Public/UCA/
610
611          http://www.unicode.org/Public/UCA/latest/allkeys.txt (latest version)
612
613       If DUCET is not installed, it is recommended to copy the file from
614       http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
615       @INC>/Unicode/Collate/allkeys.txt manually.
616

CAVEATS

618       Normalization
619           Use of the "normalization" parameter requires the
620           Unicode::Normalize module (see Unicode::Normalize).
621
622           If you need not it (say, in the case when you need not handle any
623           combining characters), assign "normalization => undef" explicitly.
624
625           -- see 6.5 Avoiding Normalization, UTS #10.
626
627       Conformance Test
628           The Conformance Test for the UCA is available under
629           <http://www.unicode.org/Public/UCA/>.
630
631           For CollationTest_SHIFTED.txt, a collator via
632           "Unicode::Collate->new( )" should be used; for
633           CollationTest_NON_IGNORABLE.txt, a collator via
634           "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
635
636           Unicode::Normalize is required to try The Conformance Test.
637

AUTHOR, COPYRIGHT AND LICENSE

639       The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
640       <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2005, SADAHIRO
641       Tomoyuki. Japan. All rights reserved.
642
643       This module is free software; you can redistribute it and/or modify it
644       under the same terms as Perl itself.
645
646       The file Unicode/Collate/allkeys.txt was copied directly from
647       <http://www.unicode.org/Public/UCA/4.1.0/allkeys.txt>.  This file is
648       Copyright (c) 1991-2005 Unicode, Inc. All rights reserved.  Distributed
649       under the Terms of Use in <http://www.unicode.org/copyright.html>.
650