Unicode::Collate(3pm)

1Unicode::Collate(3pm)  Perl Programmers Reference Guide  Unicode::Collate(3pm)
2
3
4

NAME

6       Unicode::Collate - Unicode Collation Algorithm
7

SYNOPSIS

9         use Unicode::Collate;
10
11         #construct
12         $Collator = Unicode::Collate->new(%tailoring);
13
14         #sort
15         @sorted = $Collator->sort(@not_sorted);
16
17         #compare
18         $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
19
20         # If %tailoring is false (i.e. empty),
21         # $Collator should do the default collation.
22

DESCRIPTION

24       This module is an implementation of Unicode Technical Standard #10
25       (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).
26
27   Constructor and Tailoring
28       The "new" method returns a collator object.
29
30          $Collator = Unicode::Collate->new(
31             UCA_Version => $UCA_Version,
32             alternate => $alternate, # deprecated: use of 'variable' is recommended.
33             backwards => $levelNumber, # or \@levelNumbers
34             entry => $element,
35             hangul_terminator => $term_primary_weight,
36             ignoreName => qr/$ignoreName/,
37             ignoreChar => qr/$ignoreChar/,
38             katakana_before_hiragana => $bool,
39             level => $collationLevel,
40             normalization  => $normalization_form,
41             overrideCJK => \&overrideCJK,
42             overrideHangul => \&overrideHangul,
43             preprocess => \&preprocess,
44             rearrange => \@charList,
45             table => $filename,
46             undefName => qr/$undefName/,
47             undefChar => qr/$undefChar/,
48             upper_before_lower => $bool,
49             variable => $variable,
50          );
51
52       UCA_Version
53           If the tracking version number of UCA is given, behavior of that
54           tracking version is emulated on collating.  If omitted, the return
55           value of "UCA_Version()" is used.  "UCA_Version()" should return
56           the latest tracking version supported.
57
58           The supported tracking version: 8, 9, 11, or 14.
59
60                UCA       Unicode Standard         DUCET (@version)
61                ---------------------------------------------------
62                 8              3.1                3.0.1 (3.0.1d9)
63                 9     3.1 with Corrigendum 3      3.1.1 (3.1.1)
64                11              4.0                4.0.0 (4.0.0)
65                14             4.1.0               4.1.0 (4.1.0)
66
67           Note: Recent UTS #10 renames "Tracking Version" to "Revision."
68
69       alternate
70           -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
71
72           For backward compatibility, "alternate" (old name) can be used as
73           an alias for "variable".
74
75       backwards
76           -- see 3.1.2 French Accents, UTS #10.
77
78                backwards => $levelNumber or \@levelNumbers
79
80           Weights in reverse order; ex. level 2 (diacritic ordering) in
81           French.  If omitted, forwards at all the levels.
82
83       entry
84           -- see 3.1 Linguistic Features; 3.2.1 File Format, UTS #10.
85
86           If the same character (or a sequence of characters) exists in the
87           collation element table through "table", mapping to collation
88           elements is overrided.  If it does not exist, the mapping is
89           defined additionally.
90
91               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
92           0063 0068 ; [.0E6A.0020.0002.0063] # ch
93           0043 0068 ; [.0E6A.0020.0007.0043] # Ch
94           0043 0048 ; [.0E6A.0020.0008.0043] # CH
95           006C 006C ; [.0F4C.0020.0002.006C] # ll
96           004C 006C ; [.0F4C.0020.0007.004C] # Ll
97           004C 004C ; [.0F4C.0020.0008.004C] # LL
98           00F1      ; [.0F7B.0020.0002.00F1] # n-tilde
99           006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
100           00D1      ; [.0F7B.0020.0008.00D1] # N-tilde
101           004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
102           ENTRY
103
104               entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt)
105           00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e>
106           00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E>
107           ENTRY
108
109           NOTE: The code point in the UCA file format (before ';') must be a
110           Unicode code point (defined as hexadecimal), but not a native code
111           point.  So 0063 must always denote "U+0063", but not a character of
112           "\x63".
113
114           Weighting may vary depending on collation element table.  So ensure
115           the weights defined in "entry" will be consistent with those in the
116           collation element table loaded via "table".
117
118           In DUCET v4.0.0, primary weight of "C" is 0E60 and that of "D" is
119           "0E6D". So setting primary weight of "CH" to "0E6A" (as a value
120           between 0E60 and "0E6D") makes ordering as "C < CH < D".  Exactly
121           speaking DUCET already has some characters between "C" and "D":
122           "small capital C" ("U+1D04") with primary weight 0E64,
123           "c-hook/C-hook" ("U+0188/U+0187") with 0E65, and "c-curl"
124           ("U+0255") with 0E69.  Then primary weight "0E6A" for "CH" makes
125           "CH" ordered between "c-curl" and "D".
126
127       hangul_terminator
128           -- see 7.1.4 Trailing Weights, UTS #10.
129
130           If a true value is given (non-zero but should be positive), it will
131           be added as a terminator primary weight to the end of every
132           standard Hangul syllable. Secondary and any higher weights for
133           terminator are set to zero.  If the value is false or
134           "hangul_terminator" key does not exist, insertion of terminator
135           weights will not be performed.
136
137           Boundaries of Hangul syllables are determined according to
138           conjoining Jamo behavior in the Unicode Standard and
139           HangulSyllableType.txt.
140
141           Implementation Note: [22m(1) For expansion mapping (Unicode character
142           mapped to a sequence of collation elements), a terminator will not
143           be added between collation elements, even if Hangul syllable
144           boundary exists there.  Addition of terminator is restricted to the
145           next position to the last collation element.
146
147           (2) Non-conjoining Hangul letters (Compatibility Jamo, halfwidth
148           Jamo, and enclosed letters) are not automatically terminated with a
149           terminator primary weight.  These characters may need terminator
150           included in a collation element table beforehand.
151
152       ignoreChar
153       ignoreName
154           -- see 3.2.2 Variable Weighting, UTS #10.
155
156           Makes the entry in the table completely ignorable; i.e. as if the
157           weights were zero at all level.
158
159           Through "ignoreChar", any character matching "qr/$ignoreChar/" will
160           be ignored. Through "ignoreName", any character whose name (given
161           in the "table" file as a comment) matches "qr/$ignoreName/" will be
162           ignored.
163
164           E.g. when 'a' and 'e' are ignorable, 'element' is equal to 'lament'
165           (or 'lmnt').
166
167       katakana_before_hiragana
168           -- see 7.3.1 Tertiary Weight Table, UTS #10.
169
170           By default, hiragana is before katakana.  If the parameter is made
171           true, this is reversed.
172
173           NOTE: This parameter simplemindedly assumes that any
174           hiragana/katakana distinctions must occur in level 3, and their
175           weights at level 3 must be same as those mentioned in 7.3.1, UTS
176           #10.  If you define your collation elements which violate this
177           requirement, this parameter does not work validly.
178
179       level
180           -- see 4.3 Form Sort Key, UTS #10.
181
182           Set the maximum level.  Any higher levels than the specified one
183           are ignored.
184
185             Level 1: alphabetic ordering
186             Level 2: diacritic ordering
187             Level 3: case ordering
188             Level 4: tie-breaking (e.g. in the case when variable is 'shifted')
189
190             ex.level => 2,
191
192           If omitted, the maximum is the 4th.
193
194       normalization
195           -- see 4.1 Normalize, UTS #10.
196
197           If specified, strings are normalized before preparation of sort
198           keys (the normalization is executed after preprocess).
199
200           A form name "Unicode::Normalize::normalize()" accepts will be
201           applied as $normalization_form.  Acceptable names include 'NFD',
202           'NFC', 'NFKD', and 'NFKC'.  See "Unicode::Normalize::normalize()"
203           for detail.  If omitted, 'NFD' is used.
204
205           "normalization" is performed after "preprocess" (if defined).
206
207           Furthermore, special values, "undef" and "prenormalized", can be
208           used, though they are not concerned with
209           "Unicode::Normalize::normalize()".
210
211           If "undef" (not a string "undef") is passed explicitly as the value
212           for this key, any normalization is not carried out (this may make
213           tailoring easier if any normalization is not desired). Under
214           "(normalization => undef)", only contiguous contractions are
215           resolved; e.g. even if "A-ring" (and "A-ring-cedilla") is ordered
216           after "Z", "A-cedilla-ring" would be primary equal to "A".  In this
217           point, "(normalization => undef, preprocess => sub { NFD(shift) })"
218           is not equivalent to "(normalization => 'NFD')".
219
220           In the case of "(normalization => "prenormalized")", any
221           normalization is not performed, but non-contiguous contractions
222           with combining characters are performed.  Therefore "(normalization
223           => 'prenormalized', preprocess => sub { NFD(shift) })" is
224           equivalent to "(normalization => 'NFD')".  If source strings are
225           finely prenormalized, "(normalization => 'prenormalized')" may save
226           time for normalization.
227
228           Except "(normalization => undef)", Unicode::Normalize is required
229           (see also CAVEAT).
230
231       overrideCJK
232           -- see 7.1 Derived Collation Elements, UTS #10.
233
234           By default, CJK Unified Ideographs are ordered in Unicode codepoint
235           order but "CJK Unified Ideographs" (if "UCA_Version" is 8 to 11,
236           its range is "U+4E00..U+9FA5"; if "UCA_Version" is 14, its range is
237           "U+4E00..U+9FBB") are lesser than "CJK Unified Ideographs
238           Extension" (its range is "U+3400..U+4DB5" and "U+20000..U+2A6D6").
239
240           Through "overrideCJK", ordering of CJK Unified Ideographs can be
241           overrided.
242
243           ex. CJK Unified Ideographs in the JIS code point order.
244
245             overrideCJK => sub {
246                 my $u = shift;             # get a Unicode codepoint
247                 my $b = pack('n', $u);     # to UTF-16BE
248                 my $s = your_unicode_to_sjis_converter($b); # convert
249                 my $n = unpack('n', $s);   # convert sjis to short
250                 [ $n, 0x20, 0x2, $u ];     # return the collation element
251             },
252
253           ex. ignores all CJK Unified Ideographs.
254
255             overrideCJK => sub {()}, # CODEREF returning empty list
256
257              # where ->eq("Pe\x{4E00}rl", "Perl") is true
258              # as U+4E00 is a CJK Unified Ideograph and to be ignorable.
259
260           If "undef" is passed explicitly as the value for this key, weights
261           for CJK Unified Ideographs are treated as undefined.  But
262           assignment of weight for CJK Unified Ideographs in table or "entry"
263           is still valid.
264
265       overrideHangul
266           -- see 7.1 Derived Collation Elements, UTS #10.
267
268           By default, Hangul Syllables are decomposed into Hangul Jamo, even
269           if "(normalization => undef)".  But the mapping of Hangul Syllables
270           may be overrided.
271
272           This parameter works like "overrideCJK", so see there for examples.
273
274           If you want to override the mapping of Hangul Syllables, NFD, NFKD,
275           and FCD are not appropriate, since they will decompose Hangul
276           Syllables before overriding.
277
278           If "undef" is passed explicitly as the value for this key, weight
279           for Hangul Syllables is treated as undefined without decomposition
280           into Hangul Jamo.  But definition of weight for Hangul Syllables in
281           table or "entry" is still valid.
282
283       preprocess
284           -- see 5.1 Preprocessing, UTS #10.
285
286           If specified, the coderef is used to preprocess before the
287           formation of sort keys.
288
289           ex. dropping English articles, such as "a" or "the".  Then, "the
290           pen" is before "a pencil".
291
292                preprocess => sub {
293                      my $str = shift;
294                      $str =~ s/\b(?:an?|the)\s+//gi;
295                      return $str;
296                   },
297
298           "preprocess" is performed before "normalization" (if defined).
299
300       rearrange
301           -- see 3.1.3 Rearrangement, UTS #10.
302
303           Characters that are not coded in logical order and to be
304           rearranged.  If "UCA_Version" is equal to or lesser than 11,
305           default is:
306
307               rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
308
309           If you want to disallow any rearrangement, pass "undef" or "[]" (a
310           reference to empty list) as the value for this key.
311
312           If "UCA_Version" is equal to 14, default is "[]" (i.e. no
313           rearrangement).
314
315           According to the version 9 of UCA, this parameter shall not be
316           used; but it is not warned at present.
317
318       table
319           -- see 3.2 Default Unicode Collation Element Table, UTS #10.
320
321           You can use another collation element table if desired.
322
323           The table file should locate in the Unicode/Collate directory on
324           @INC. Say, if the filename is Foo.txt, the table file is searched
325           as Unicode/Collate/Foo.txt in @INC.
326
327           By default, allkeys.txt (as the filename of DUCET) is used.  If you
328           will prepare your own table file, any name other than allkeys.txt
329           may be better to avoid namespace conflict.
330
331           If "undef" is passed explicitly as the value for this key, no file
332           is read (but you can define collation elements via "entry").
333
334           A typical way to define a collation element table without any file
335           of table:
336
337              $onlyABC = Unicode::Collate->new(
338                  table => undef,
339                  entry => << 'ENTRIES',
340           0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
341           0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
342           0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
343           0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
344           0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
345           0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
346           ENTRIES
347               );
348
349           If "ignoreName" or "undefName" is used, character names should be
350           specified as a comment (following "#") on each line.
351
352       undefChar
353       undefName
354           -- see 6.3.4 Reducing the Repertoire, UTS #10.
355
356           Undefines the collation element as if it were unassigned in the
357           table.  This reduces the size of the table.  If an unassigned
358           character appears in the string to be collated, the sort key is
359           made from its codepoint as a single-character collation element, as
360           it is greater than any other assigned collation elements (in the
361           codepoint order among the unassigned characters).  But, it'd be
362           better to ignore characters unfamiliar to you and maybe never used.
363
364           Through "undefChar", any character matching "qr/$undefChar/" will
365           be undefined. Through "undefName", any character whose name (given
366           in the "table" file as a comment) matches "qr/$undefName/" will be
367           undefined.
368
369           ex. Collation weights for beyond-BMP characters are not stored in
370           object:
371
372               undefChar => qr/[^\0-\x{fffd}]/,
373
374       upper_before_lower
375           -- see 6.6 Case Comparisons, UTS #10.
376
377           By default, lowercase is before uppercase.  If the parameter is
378           made true, this is reversed.
379
380           NOTE: This parameter simplemindedly assumes that any
381           lowercase/uppercase distinctions must occur in level 3, and their
382           weights at level 3 must be same as those mentioned in 7.3.1, UTS
383           #10.  If you define your collation elements which differs from this
384           requirement, this parameter doesn't work validly.
385
386       variable
387           -- see 3.2.2 Variable Weighting, UTS #10.
388
389           This key allows to variable weighting for variable collation
390           elements, which are marked with an ASTERISK in the table (NOTE:
391           Many punction marks and symbols are variable in allkeys.txt).
392
393              variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
394
395           These names are case-insensitive.  By default (if specification is
396           omitted), 'shifted' is adopted.
397
398              'Blanked'        Variable elements are made ignorable at levels 1 through 3;
399                               considered at the 4th level.
400
401              'Non-Ignorable'  Variable elements are not reset to ignorable.
402
403              'Shifted'        Variable elements are made ignorable at levels 1 through 3
404                               their level 4 weight is replaced by the old level 1 weight.
405                               Level 4 weight for Non-Variable elements is 0xFFFF.
406
407              'Shift-Trimmed'  Same as 'shifted', but all FFFF's at the 4th level
408                               are trimmed.
409
410   Methods for Collation
411       "@sorted = $Collator->sort(@not_sorted)"
412           Sorts a list of strings.
413
414       "$result = $Collator->cmp($a, $b)"
415           Returns 1 (when $a is greater than $b) or 0 (when $a is equal to
416           $b) or -1 (when $a is lesser than $b).
417
418       "$result = $Collator->eq($a, $b)"
419       "$result = $Collator->ne($a, $b)"
420       "$result = $Collator->lt($a, $b)"
421       "$result = $Collator->le($a, $b)"
422       "$result = $Collator->gt($a, $b)"
423       "$result = $Collator->ge($a, $b)"
424           They works like the same name operators as theirs.
425
426              eq : whether $a is equal to $b.
427              ne : whether $a is not equal to $b.
428              lt : whether $a is lesser than $b.
429              le : whether $a is lesser than $b or equal to $b.
430              gt : whether $a is greater than $b.
431              ge : whether $a is greater than $b or equal to $b.
432
433       "$sortKey = $Collator->getSortKey($string)"
434           -- see 4.3 Form Sort Key, UTS #10.
435
436           Returns a sort key.
437
438           You compare the sort keys using a binary comparison and get the
439           result of the comparison of the strings using UCA.
440
441              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
442
443                 is equivalent to
444
445              $Collator->cmp($a, $b)
446
447       "$sortKeyForm = $Collator->viewSortKey($string)"
448           Converts a sorting key into its representation form.  If
449           "UCA_Version" is 8, the output is slightly different.
450
451              use Unicode::Collate;
452              my $c = Unicode::Collate->new();
453              print $c->viewSortKey("Perl"),"\n";
454
455              # output:
456              # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF]
457              #  Level 1               Level 2               Level 3               Level 4
458
459   Methods for Searching
460       DISCLAIMER: If "preprocess" or "normalization" parameter is true for
461       $Collator, calling these methods ("index", "match", "gmatch", "subst",
462       "gsubst") is croaked, as the position and the length might differ from
463       those on the specified string.  (And "rearrange" and
464       "hangul_terminator" parameters are neglected.)
465
466       The "match", "gmatch", "subst", "gsubst" methods work like "m//",
467       "m//g", "s///", "s///g", respectively, but they are not aware of any
468       pattern, but only a literal substring.
469
470       "$position = $Collator->index($string, $substring[, $position])"
471       "($position, $length) = $Collator->index($string, $substring[,
472       $position])"
473           If $substring matches a part of $string, returns the position of
474           the first occurrence of the matching part in scalar context; in
475           list context, returns a two-element list of the position and the
476           length of the matching part.
477
478           If $substring does not match any part of $string, returns "-1" in
479           scalar context and an empty list in list context.
480
481           e.g. you say
482
483             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
484                                                # (normalization => undef) is REQUIRED.
485             my $str = "Ich muA~X studieren Perl.";
486             my $sub = "MA~XSS";
487             my $match;
488             if (my($pos,$len) = $Collator->index($str, $sub)) {
489                 $match = substr($str, $pos, $len);
490             }
491
492           and get "muA~X" in $match since "muA~X" is primary equal to
493           "MA~XSS".
494
495       "$match_ref = $Collator->match($string, $substring)"
496       "($match)   = $Collator->match($string, $substring)"
497           If $substring matches a part of $string, in scalar context, returns
498           a reference to the first occurrence of the matching part
499           ($match_ref is always true if matches, since every reference is
500           true); in list context, returns the first occurrence of the
501           matching part.
502
503           If $substring does not match any part of $string, returns "undef"
504           in scalar context and an empty list in list context.
505
506           e.g.
507
508               if ($match_ref = $Collator->match($str, $sub)) { # scalar context
509                   print "matches [$$match_ref].\n";
510               } else {
511                   print "doesn't match.\n";
512               }
513
514                or
515
516               if (($match) = $Collator->match($str, $sub)) { # list context
517                   print "matches [$match].\n";
518               } else {
519                   print "doesn't match.\n";
520               }
521
522       "@match = $Collator->gmatch($string, $substring)"
523           If $substring matches a part of $string, returns all the matching
524           parts (or matching count in scalar context).
525
526           If $substring does not match any part of $string, returns an empty
527           list.
528
529       "$count = $Collator->subst($string, $substring, $replacement)"
530           If $substring matches a part of $string, the first occurrence of
531           the matching part is replaced by $replacement ($string is modified)
532           and return $count (always equals to 1).
533
534           $replacement can be a "CODEREF", taking the matching part as an
535           argument, and returning a string to replace the matching part (a
536           bit similar to "s/(..)/$coderef->($1)/e").
537
538       "$count = $Collator->gsubst($string, $substring, $replacement)"
539           If $substring matches a part of $string, all the occurrences of the
540           matching part is replaced by $replacement ($string is modified) and
541           return $count.
542
543           $replacement can be a "CODEREF", taking the matching part as an
544           argument, and returning a string to replace the matching part (a
545           bit similar to "s/(..)/$coderef->($1)/eg").
546
547           e.g.
548
549             my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
550                                                # (normalization => undef) is REQUIRED.
551             my $str = "Camel donkey zebra came\x{301}l CAMEL horse cAm\0E\0L...";
552             $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" });
553
554             # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cAm\0E\0L</b>...";
555             # i.e., all the camels are made bold-faced.
556
557   Other Methods
558       "%old_tailoring = $Collator->change(%new_tailoring)"
559           Change the value of specified keys and returns the changed part.
560
561               $Collator = Unicode::Collate->new(level => 4);
562
563               $Collator->eq("perl", "PERL"); # false
564
565               %old = $Collator->change(level => 2); # returns (level => 4).
566
567               $Collator->eq("perl", "PERL"); # true
568
569               $Collator->change(%old); # returns (level => 2).
570
571               $Collator->eq("perl", "PERL"); # false
572
573           Not all "(key,value)"s are allowed to be changed.  See also
574           @Unicode::Collate::ChangeOK and @Unicode::Collate::ChangeNG.
575
576           In the scalar context, returns the modified collator (but it is not
577           a clone from the original).
578
579               $Collator->change(level => 2)->eq("perl", "PERL"); # true
580
581               $Collator->eq("perl", "PERL"); # true; now max level is 2nd.
582
583               $Collator->change(level => 4)->eq("perl", "PERL"); # false
584
585       "$version = $Collator->version()"
586           Returns the version number (a string) of the Unicode Standard which
587           the "table" file used by the collator object is based on.  If the
588           table does not include a version line (starting with @version),
589           returns "unknown".
590
591       "UCA_Version()"
592           Returns the tracking version number of UTS #10 this module
593           consults.
594
595       "Base_Unicode_Version()"
596           Returns the version number of UTS #10 this module consults.
597

EXPORT

599       No method will be exported.
600

INSTALL

602       Though this module can be used without any "table" file, to use this
603       module easily, it is recommended to install a table file in the UCA
604       format, by copying it under the directory <a place in
605       @INC>/Unicode/Collate.
606
607       The most preferable one is "The Default Unicode Collation Element
608       Table" (aka DUCET), available from the Unicode Consortium's website:
609
610          http://www.unicode.org/Public/UCA/
611
612          http://www.unicode.org/Public/UCA/latest/allkeys.txt (latest version)
613
614       If DUCET is not installed, it is recommended to copy the file from
615       http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in
616       @INC>/Unicode/Collate/allkeys.txt manually.
617

CAVEATS

619       Normalization
620           Use of the "normalization" parameter requires the
621           Unicode::Normalize module (see Unicode::Normalize).
622
623           If you need not it (say, in the case when you need not handle any
624           combining characters), assign "normalization => undef" explicitly.
625
626           -- see 6.5 Avoiding Normalization, UTS #10.
627
628       Conformance Test
629           The Conformance Test for the UCA is available under
630           <http://www.unicode.org/Public/UCA/>.
631
632           For CollationTest_SHIFTED.txt, a collator via
633           "Unicode::Collate->new( )" should be used; for
634           CollationTest_NON_IGNORABLE.txt, a collator via
635           "Unicode::Collate->new(variable => "non-ignorable", level => 3)".
636
637           Unicode::Normalize is required to try The Conformance Test.
638

AUTHOR, COPYRIGHT AND LICENSE

640       The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki,
641       <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2005, SADAHIRO
642       Tomoyuki. Japan. All rights reserved.
643
644       This module is free software; you can redistribute it and/or modify it
645       under the same terms as Perl itself.
646
647       The file Unicode/Collate/allkeys.txt was copied directly from
648       <http://www.unicode.org/Public/UCA/4.1.0/allkeys.txt>.  This file is
649       Copyright (c) 1991-2005 Unicode, Inc. All rights reserved.  Distributed
650       under the Terms of Use in <http://www.unicode.org/copyright.html>.
651