Unicode::Normalize(3pm)

1Unicode::Normalize(3pm)Perl Programmers Reference GuideUnicode::Normalize(3pm)
2
3
4

NAME

6       Unicode::Normalize - Unicode Normalization Forms
7

SYNOPSIS

9       (1) using function names exported by default:
10
11         use Unicode::Normalize;
12
13         $NFD_string  = NFD($string);  # Normalization Form D
14         $NFC_string  = NFC($string);  # Normalization Form C
15         $NFKD_string = NFKD($string); # Normalization Form KD
16         $NFKC_string = NFKC($string); # Normalization Form KC
17
18       (2) using function names exported on request:
19
20         use Unicode::Normalize 'normalize';
21
22         $NFD_string  = normalize('D',  $string);  # Normalization Form D
23         $NFC_string  = normalize('C',  $string);  # Normalization Form C
24         $NFKD_string = normalize('KD', $string);  # Normalization Form KD
25         $NFKC_string = normalize('KC', $string);  # Normalization Form KC
26

DESCRIPTION

28       Parameters:
29
30       $string is used as a string under character semantics (see
31       perlunicode).
32
33       $code_point should be an unsigned integer representing a Unicode code
34       point.
35
36       Note: Between XSUB and pure Perl, there is an incompatibility about the
37       interpretation of $code_point as a decimal number.  XSUB converts
38       $code_point to an unsigned integer, but pure Perl does not.  Do not use
39       a floating point nor a negative sign in $code_point.
40
41   Normalization Forms
42       "$NFD_string = NFD($string)"
43           It returns the Normalization Form D (formed by canonical
44           decomposition).
45
46       "$NFC_string = NFC($string)"
47           It returns the Normalization Form C (formed by canonical
48           decomposition followed by canonical composition).
49
50       "$NFKD_string = NFKD($string)"
51           It returns the Normalization Form KD (formed by compatibility
52           decomposition).
53
54       "$NFKC_string = NFKC($string)"
55           It returns the Normalization Form KC (formed by compatibility
56           decomposition followed by canonical composition).
57
58       "$FCD_string = FCD($string)"
59           If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it
60           returns the string without modification; otherwise it returns an
61           FCD string.
62
63           Note: FCD is not always unique, then plural forms may be equivalent
64           each other. "FCD()" will return one of these equivalent forms.
65
66       "$FCC_string = FCC($string)"
67           It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
68
69           Note: FCC is unique, as well as four normalization forms (NF*).
70
71       "$normalized_string = normalize($form_name, $string)"
72           It returns the normalization form of $form_name.
73
74           As $form_name, one of the following names must be given.
75
76             'C'  or 'NFC'  for Normalization Form C  (UAX #15)
77             'D'  or 'NFD'  for Normalization Form D  (UAX #15)
78             'KC' or 'NFKC' for Normalization Form KC (UAX #15)
79             'KD' or 'NFKD' for Normalization Form KD (UAX #15)
80
81             'FCD'          for "Fast C or D" Form  (UTN #5)
82             'FCC'          for "Fast C Contiguous" (UTN #5)
83
84   Decomposition and Composition
85       "$decomposed_string = decompose($string [, $useCompatMapping])"
86           It returns the concatenation of the decomposition of each character
87           in the string.
88
89           If the second parameter (a boolean) is omitted or false, the
90           decomposition is canonical decomposition; if the second parameter
91           (a boolean) is true, the decomposition is compatibility
92           decomposition.
93
94           The string returned is not always in NFD/NFKD. Reordering may be
95           required.
96
97               $NFD_string  = reorder(decompose($string));       # eq. to NFD()
98               $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
99
100       "$reordered_string = reorder($string)"
101           It returns the result of reordering the combining characters
102           according to Canonical Ordering Behavior.
103
104           For example, when you have a list of NFD/NFKD strings, you can get
105           the concatenated NFD/NFKD string from them, by saying
106
107               $concat_NFD  = reorder(join '', @NFD_strings);
108               $concat_NFKD = reorder(join '', @NFKD_strings);
109
110       "$composed_string = compose($string)"
111           It returns the result of canonical composition without applying any
112           decomposition.
113
114           For example, when you have a NFD/NFKD string, you can get its
115           NFC/NFKC string, by saying
116
117               $NFC_string  = compose($NFD_string);
118               $NFKC_string = compose($NFKD_string);
119
120       "($processed, $unprocessed) = splitOnLastStarter($normalized)"
121           It returns two strings: the first one, $processed, is a part before
122           the last starter, and the second one, $unprocessed is another part
123           after the first part. A starter is a character having a combining
124           class of zero (see UAX #15).
125
126           Note that $processed may be empty (when $normalized contains no
127           starter or starts with the last starter), and then $unprocessed
128           should be equal to the entire $normalized.
129
130           When you have a $normalized string and an $unnormalized string
131           following it, a simple concatenation is wrong:
132
133               $concat = $normalized . normalize($form, $unnormalized); # wrong!
134
135           Instead of it, do like this:
136
137               ($processed, $unprocessed) = splitOnLastStarter($normalized);
138                $concat = $processed . normalize($form, $unprocessed.$unnormalized);
139
140           "splitOnLastStarter()" should be called with a pre-normalized
141           parameter $normalized, that is in the same form as $form you want.
142
143           If you have an array of @string that should be concatenated and
144           then normalized, you can do like this:
145
146               my $result = "";
147               my $unproc = "";
148               foreach my $str (@string) {
149                   $unproc .= $str;
150                   my $n = normalize($form, $unproc);
151                   my($p, $u) = splitOnLastStarter($n);
152                   $result .= $p;
153                   $unproc  = $u;
154               }
155               $result .= $unproc;
156               # instead of normalize($form, join('', @string))
157
158       "$processed = normalize_partial($form, $unprocessed)"
159           A wrapper for the combination of "normalize()" and
160           "splitOnLastStarter()".  Note that $unprocessed will be modified as
161           a side-effect.
162
163           If you have an array of @string that should be concatenated and
164           then normalized, you can do like this:
165
166               my $result = "";
167               my $unproc = "";
168               foreach my $str (@string) {
169                   $unproc .= $str;
170                   $result .= normalize_partial($form, $unproc);
171               }
172               $result .= $unproc;
173               # instead of normalize($form, join('', @string))
174
175       "$processed = NFD_partial($unprocessed)"
176           It does like "normalize_partial('NFD', $unprocessed)".  Note that
177           $unprocessed will be modified as a side-effect.
178
179       "$processed = NFC_partial($unprocessed)"
180           It does like "normalize_partial('NFC', $unprocessed)".  Note that
181           $unprocessed will be modified as a side-effect.
182
183       "$processed = NFKD_partial($unprocessed)"
184           It does like "normalize_partial('NFKD', $unprocessed)".  Note that
185           $unprocessed will be modified as a side-effect.
186
187       "$processed = NFKC_partial($unprocessed)"
188           It does like "normalize_partial('NFKC', $unprocessed)".  Note that
189           $unprocessed will be modified as a side-effect.
190
191   Quick Check
192       (see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
193
194       The following functions check whether the string is in that
195       normalization form.
196
197       The result returned will be one of the following:
198
199           YES     The string is in that normalization form.
200           NO      The string is not in that normalization form.
201           MAYBE   Dubious. Maybe yes, maybe no.
202
203       "$result = checkNFD($string)"
204           It returns true (1) if "YES"; false ("empty string") if "NO".
205
206       "$result = checkNFC($string)"
207           It returns true (1) if "YES"; false ("empty string") if "NO";
208           "undef" if "MAYBE".
209
210       "$result = checkNFKD($string)"
211           It returns true (1) if "YES"; false ("empty string") if "NO".
212
213       "$result = checkNFKC($string)"
214           It returns true (1) if "YES"; false ("empty string") if "NO";
215           "undef" if "MAYBE".
216
217       "$result = checkFCD($string)"
218           It returns true (1) if "YES"; false ("empty string") if "NO".
219
220       "$result = checkFCC($string)"
221           It returns true (1) if "YES"; false ("empty string") if "NO";
222           "undef" if "MAYBE".
223
224           Note: If a string is not in FCD, it must not be in FCC.  So
225           "checkFCC($not_FCD_string)" should return "NO".
226
227       "$result = check($form_name, $string)"
228           It returns true (1) if "YES"; false ("empty string") if "NO";
229           "undef" if "MAYBE".
230
231           As $form_name, one of the following names must be given.
232
233             'C'  or 'NFC'  for Normalization Form C  (UAX #15)
234             'D'  or 'NFD'  for Normalization Form D  (UAX #15)
235             'KC' or 'NFKC' for Normalization Form KC (UAX #15)
236             'KD' or 'NFKD' for Normalization Form KD (UAX #15)
237
238             'FCD'          for "Fast C or D" Form  (UTN #5)
239             'FCC'          for "Fast C Contiguous" (UTN #5)
240
241       Note
242
243       In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or
244       "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and
245       FCC.
246
247       A "MAYBE" string should contain at least one combining character or the
248       like. For example, "COMBINING ACUTE ACCENT" has the
249       MAYBE_NFC/MAYBE_NFKC property.
250
251       Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and
252       "checkNFC("B\N{COMBINING ACUTE ACCENT}")" will return "MAYBE".
253       "A\N{COMBINING ACUTE ACCENT}" is not in NFC (its NFC is "\N{LATIN
254       CAPITAL LETTER A WITH ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is
255       in NFC.
256
257       If you want to check exactly, compare the string with its NFC/NFKC/FCC.
258
259           if ($string eq NFC($string)) {
260               # $string is exactly normalized in NFC;
261           } else {
262               # $string is not normalized in NFC;
263           }
264
265           if ($string eq NFKC($string)) {
266               # $string is exactly normalized in NFKC;
267           } else {
268               # $string is not normalized in NFKC;
269           }
270
271   Character Data
272       These functions are interface of character data used internally.  If
273       you want only to get Unicode normalization forms, you don't need call
274       them yourself.
275
276       "$canonical_decomposition = getCanon($code_point)"
277           If the character is canonically decomposable (including Hangul
278           Syllables), it returns the (full) canonical decomposition as a
279           string.  Otherwise it returns "undef".
280
281           Note: According to the Unicode standard, the canonical
282           decomposition of the character that is not canonically decomposable
283           is same as the character itself.
284
285       "$compatibility_decomposition = getCompat($code_point)"
286           If the character is compatibility decomposable (including Hangul
287           Syllables), it returns the (full) compatibility decomposition as a
288           string.  Otherwise it returns "undef".
289
290           Note: According to the Unicode standard, the compatibility
291           decomposition of the character that is not compatibility
292           decomposable is same as the character itself.
293
294       "$code_point_composite = getComposite($code_point_here,
295       $code_point_next)"
296           If two characters here and next (as code points) are composable
297           (including Hangul Jamo/Syllables and Composition Exclusions), it
298           returns the code point of the composite.
299
300           If they are not composable, it returns "undef".
301
302       "$combining_class = getCombinClass($code_point)"
303           It returns the combining class (as an integer) of the character.
304
305       "$may_be_composed_with_prev_char = isComp2nd($code_point)"
306           It returns a boolean whether the character of the specified
307           codepoint may be composed with the previous one in a certain
308           composition (including Hangul Compositions, but excluding
309           Composition Exclusions and Non-Starter Decompositions).
310
311       "$is_exclusion = isExclusion($code_point)"
312           It returns a boolean whether the code point is a composition
313           exclusion.
314
315       "$is_singleton = isSingleton($code_point)"
316           It returns a boolean whether the code point is a singleton
317
318       "$is_non_starter_decomposition = isNonStDecomp($code_point)"
319           It returns a boolean whether the code point has Non-Starter
320           Decomposition.
321
322       "$is_Full_Composition_Exclusion = isComp_Ex($code_point)"
323           It returns a boolean of the derived property Comp_Ex
324           (Full_Composition_Exclusion). This property is generated from
325           Composition Exclusions + Singletons + Non-Starter Decompositions.
326
327       "$NFD_is_NO = isNFD_NO($code_point)"
328           It returns a boolean of the derived property NFD_NO
329           (NFD_Quick_Check=No).
330
331       "$NFC_is_NO = isNFC_NO($code_point)"
332           It returns a boolean of the derived property NFC_NO
333           (NFC_Quick_Check=No).
334
335       "$NFC_is_MAYBE = isNFC_MAYBE($code_point)"
336           It returns a boolean of the derived property NFC_MAYBE
337           (NFC_Quick_Check=Maybe).
338
339       "$NFKD_is_NO = isNFKD_NO($code_point)"
340           It returns a boolean of the derived property NFKD_NO
341           (NFKD_Quick_Check=No).
342
343       "$NFKC_is_NO = isNFKC_NO($code_point)"
344           It returns a boolean of the derived property NFKC_NO
345           (NFKC_Quick_Check=No).
346
347       "$NFKC_is_MAYBE = isNFKC_MAYBE($code_point)"
348           It returns a boolean of the derived property NFKC_MAYBE
349           (NFKC_Quick_Check=Maybe).
350

EXPORT

352       "NFC", "NFD", "NFKC", "NFKD": by default.
353
354       "normalize" and other some functions: on request.
355

CAVEATS

357       Perl's version vs. Unicode version
358           Since this module refers to perl core's Unicode database in the
359           directory /lib/unicore (or formerly /lib/unicode), the Unicode
360           version of normalization implemented by this module depends on your
361           perl's version.
362
363               perl's version     implemented Unicode version
364                  5.6.1              3.0.1
365                  5.7.2              3.1.0
366                  5.7.3              3.1.1 (normalization is same as 3.1.0)
367                  5.8.0              3.2.0
368                5.8.1-5.8.3          4.0.0
369                5.8.4-5.8.6          4.0.1 (normalization is same as 4.0.0)
370                5.8.7-5.8.8          4.1.0
371                  5.10.0             5.0.0
372               5.8.9, 5.10.1         5.1.0
373               5.12.0-5.12.3         5.2.0
374                  5.14.0             6.0.0
375               5.16.0 (to be)        6.1.0
376
377       Correction of decomposition mapping
378           In older Unicode versions, a small number of characters (all of
379           which are CJK compatibility ideographs as far as they have been
380           found) may have an erroneous decomposition mapping (see
381           NormalizationCorrections.txt).  Anyhow, this module will neither
382           refer to NormalizationCorrections.txt nor provide any specific
383           version of normalization. Therefore this module running on an older
384           perl with an older Unicode database may use the erroneous
385           decomposition mapping blindly conforming to the Unicode database.
386
387       Revised definition of canonical composition
388           In Unicode 4.1.0, the definition D2 of canonical composition (which
389           affects NFC and NFKC) has been changed (see Public Review Issue #29
390           and recent UAX #15). This module has used the newer definition
391           since the version 0.07 (Oct 31, 2001).  This module will not
392           support the normalization according to the older definition, even
393           if the Unicode version implemented by perl is lower than 4.1.0.
394

AUTHOR

396       SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
397
398       Copyright(C) 2001-2012, SADAHIRO Tomoyuki. Japan. All rights reserved.
399
400       This module is free software; you can redistribute it and/or modify it
401       under the same terms as Perl itself.
402

NAME

SYNOPSIS

DESCRIPTION

EXPORT

CAVEATS

AUTHOR

SEE ALSO