Unicode::Normalize(3pm)

1Unicode::Normalize(3pm)Perl Programmers Reference GuideUnicode::Normalize(3pm)
2
3
4

NAME

6       Unicode::Normalize - Unicode Normalization Forms
7

SYNOPSIS

9       (1) using function names exported by default:
10
11         use Unicode::Normalize;
12
13         $NFD_string  = NFD($string);  # Normalization Form D
14         $NFC_string  = NFC($string);  # Normalization Form C
15         $NFKD_string = NFKD($string); # Normalization Form KD
16         $NFKC_string = NFKC($string); # Normalization Form KC
17
18       (2) using function names exported on request:
19
20         use Unicode::Normalize 'normalize';
21
22         $NFD_string  = normalize('D',  $string);  # Normalization Form D
23         $NFC_string  = normalize('C',  $string);  # Normalization Form C
24         $NFKD_string = normalize('KD', $string);  # Normalization Form KD
25         $NFKC_string = normalize('KC', $string);  # Normalization Form KC
26

DESCRIPTION

28       Parameters:
29
30       $string is used as a string under character semantics (see
31       perlunicode).
32
33       $code_point should be an unsigned integer representing a Unicode code
34       point.
35
36       Note: Between XSUB and pure Perl, there is an incompatibility about the
37       interpretation of $code_point as a decimal number.  XSUB converts
38       $code_point to an unsigned integer, but pure Perl does not.  Do not use
39       a floating point nor a negative sign in $code_point.
40
41   Normalization Forms
42       "$NFD_string = NFD($string)"
43           It returns the Normalization Form D (formed by canonical
44           decomposition).
45
46       "$NFC_string = NFC($string)"
47           It returns the Normalization Form C (formed by canonical
48           decomposition followed by canonical composition).
49
50       "$NFKD_string = NFKD($string)"
51           It returns the Normalization Form KD (formed by compatibility
52           decomposition).
53
54       "$NFKC_string = NFKC($string)"
55           It returns the Normalization Form KC (formed by compatibility
56           decomposition followed by canonical composition).
57
58       "$FCD_string = FCD($string)"
59           If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it
60           returns the string without modification; otherwise it returns an
61           FCD string.
62
63           Note: FCD is not always unique, then plural forms may be equivalent
64           each other. "FCD()" will return one of these equivalent forms.
65
66       "$FCC_string = FCC($string)"
67           It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
68
69           Note: FCC is unique, as well as four normalization forms (NF*).
70
71       "$normalized_string = normalize($form_name, $string)"
72           It returns the normalization form of $form_name.
73
74           As $form_name, one of the following names must be given.
75
76             'C'  or 'NFC'  for Normalization Form C  (UAX #15)
77             'D'  or 'NFD'  for Normalization Form D  (UAX #15)
78             'KC' or 'NFKC' for Normalization Form KC (UAX #15)
79             'KD' or 'NFKD' for Normalization Form KD (UAX #15)
80
81             'FCD'          for "Fast C or D" Form  (UTN #5)
82             'FCC'          for "Fast C Contiguous" (UTN #5)
83
84   Decomposition and Composition
85       "$decomposed_string = decompose($string [, $useCompatMapping])"
86           It returns the concatenation of the decomposition of each character
87           in the string.
88
89           If the second parameter (a boolean) is omitted or false, the
90           decomposition is canonical decomposition; if the second parameter
91           (a boolean) is true, the decomposition is compatibility
92           decomposition.
93
94           The string returned is not always in NFD/NFKD. Reordering may be
95           required.
96
97               $NFD_string  = reorder(decompose($string));       # eq. to NFD()
98               $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
99
100       "$reordered_string = reorder($string)"
101           It returns the result of reordering the combining characters
102           according to Canonical Ordering Behavior.
103
104           For example, when you have a list of NFD/NFKD strings, you can get
105           the concatenated NFD/NFKD string from them, by saying
106
107               $concat_NFD  = reorder(join '', @NFD_strings);
108               $concat_NFKD = reorder(join '', @NFKD_strings);
109
110       "$composed_string = compose($string)"
111           It returns the result of canonical composition without applying any
112           decomposition.
113
114           For example, when you have a NFD/NFKD string, you can get its
115           NFC/NFKC string, by saying
116
117               $NFC_string  = compose($NFD_string);
118               $NFKC_string = compose($NFKD_string);
119
120   Quick Check
121       (see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
122
123       The following functions check whether the string is in that
124       normalization form.
125
126       The result returned will be one of the following:
127
128           YES     The string is in that normalization form.
129           NO      The string is not in that normalization form.
130           MAYBE   Dubious. Maybe yes, maybe no.
131
132       "$result = checkNFD($string)"
133           It returns true (1) if "YES"; false ("empty string") if "NO".
134
135       "$result = checkNFC($string)"
136           It returns true (1) if "YES"; false ("empty string") if "NO";
137           "undef" if "MAYBE".
138
139       "$result = checkNFKD($string)"
140           It returns true (1) if "YES"; false ("empty string") if "NO".
141
142       "$result = checkNFKC($string)"
143           It returns true (1) if "YES"; false ("empty string") if "NO";
144           "undef" if "MAYBE".
145
146       "$result = checkFCD($string)"
147           It returns true (1) if "YES"; false ("empty string") if "NO".
148
149       "$result = checkFCC($string)"
150           It returns true (1) if "YES"; false ("empty string") if "NO";
151           "undef" if "MAYBE".
152
153           Note: If a string is not in FCD, it must not be in FCC.  So
154           "checkFCC($not_FCD_string)" should return "NO".
155
156       "$result = check($form_name, $string)"
157           It returns true (1) if "YES"; false ("empty string") if "NO";
158           "undef" if "MAYBE".
159
160           As $form_name, one of the following names must be given.
161
162             'C'  or 'NFC'  for Normalization Form C  (UAX #15)
163             'D'  or 'NFD'  for Normalization Form D  (UAX #15)
164             'KC' or 'NFKC' for Normalization Form KC (UAX #15)
165             'KD' or 'NFKD' for Normalization Form KD (UAX #15)
166
167             'FCD'          for "Fast C or D" Form  (UTN #5)
168             'FCC'          for "Fast C Contiguous" (UTN #5)
169
170       Note
171
172       In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or
173       "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and
174       FCC.
175
176       A "MAYBE" string should contain at least one combining character or the
177       like. For example, "COMBINING ACUTE ACCENT" has the
178       MAYBE_NFC/MAYBE_NFKC property.
179
180       Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and
181       "checkNFC("B\N{COMBINING ACUTE ACCENT}")" will return "MAYBE".
182       "A\N{COMBINING ACUTE ACCENT}" is not in NFC (its NFC is "\N{LATIN
183       CAPITAL LETTER A WITH ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is
184       in NFC.
185
186       If you want to check exactly, compare the string with its NFC/NFKC/FCC.
187
188           if ($string eq NFC($string)) {
189               # $string is exactly normalized in NFC;
190           } else {
191               # $string is not normalized in NFC;
192           }
193
194           if ($string eq NFKC($string)) {
195               # $string is exactly normalized in NFKC;
196           } else {
197               # $string is not normalized in NFKC;
198           }
199
200   Character Data
201       These functions are interface of character data used internally.  If
202       you want only to get Unicode normalization forms, you don't need call
203       them yourself.
204
205       "$canonical_decomposition = getCanon($code_point)"
206           If the character is canonically decomposable (including Hangul
207           Syllables), it returns the (full) canonical decomposition as a
208           string.  Otherwise it returns "undef".
209
210           Note: According to the Unicode standard, the canonical
211           decomposition of the character that is not canonically decomposable
212           is same as the character itself.
213
214       "$compatibility_decomposition = getCompat($code_point)"
215           If the character is compatibility decomposable (including Hangul
216           Syllables), it returns the (full) compatibility decomposition as a
217           string.  Otherwise it returns "undef".
218
219           Note: According to the Unicode standard, the compatibility
220           decomposition of the character that is not compatibility
221           decomposable is same as the character itself.
222
223       "$code_point_composite = getComposite($code_point_here,
224       $code_point_next)"
225           If two characters here and next (as code points) are composable
226           (including Hangul Jamo/Syllables and Composition Exclusions), it
227           returns the code point of the composite.
228
229           If they are not composable, it returns "undef".
230
231       "$combining_class = getCombinClass($code_point)"
232           It returns the combining class (as an integer) of the character.
233
234       "$may_be_composed_with_prev_char = isComp2nd($code_point)"
235           It returns a boolean whether the character of the specified
236           codepoint may be composed with the previous one in a certain
237           composition (including Hangul Compositions, but excluding
238           Composition Exclusions and Non-Starter Decompositions).
239
240       "$is_exclusion = isExclusion($code_point)"
241           It returns a boolean whether the code point is a composition
242           exclusion.
243
244       "$is_singleton = isSingleton($code_point)"
245           It returns a boolean whether the code point is a singleton
246
247       "$is_non_starter_decomposition = isNonStDecomp($code_point)"
248           It returns a boolean whether the code point has Non-Starter
249           Decomposition.
250
251       "$is_Full_Composition_Exclusion = isComp_Ex($code_point)"
252           It returns a boolean of the derived property Comp_Ex
253           (Full_Composition_Exclusion). This property is generated from
254           Composition Exclusions + Singletons + Non-Starter Decompositions.
255
256       "$NFD_is_NO = isNFD_NO($code_point)"
257           It returns a boolean of the derived property NFD_NO
258           (NFD_Quick_Check=No).
259
260       "$NFC_is_NO = isNFC_NO($code_point)"
261           It returns a boolean of the derived property NFC_NO
262           (NFC_Quick_Check=No).
263
264       "$NFC_is_MAYBE = isNFC_MAYBE($code_point)"
265           It returns a boolean of the derived property NFC_MAYBE
266           (NFC_Quick_Check=Maybe).
267
268       "$NFKD_is_NO = isNFKD_NO($code_point)"
269           It returns a boolean of the derived property NFKD_NO
270           (NFKD_Quick_Check=No).
271
272       "$NFKC_is_NO = isNFKC_NO($code_point)"
273           It returns a boolean of the derived property NFKC_NO
274           (NFKC_Quick_Check=No).
275
276       "$NFKC_is_MAYBE = isNFKC_MAYBE($code_point)"
277           It returns a boolean of the derived property NFKC_MAYBE
278           (NFKC_Quick_Check=Maybe).
279

EXPORT

281       "NFC", "NFD", "NFKC", "NFKD": by default.
282
283       "normalize" and other some functions: on request.
284

CAVEATS

286       Perl's version vs. Unicode version
287           Since this module refers to perl core's Unicode database in the
288           directory /lib/unicore (or formerly /lib/unicode), the Unicode
289           version of normalization implemented by this module depends on your
290           perl's version.
291
292               perl's version     implemented Unicode version
293                  5.6.1              3.0.1
294                  5.7.2              3.1.0
295                  5.7.3              3.1.1 (normalization is same as 3.1.0)
296                  5.8.0              3.2.0
297                5.8.1-5.8.3          4.0.0
298                5.8.4-5.8.6          4.0.1 (normalization is same as 4.0.0)
299                5.8.7-5.8.8          4.1.0
300                  5.10.0             5.0.0
301                  5.8.9              5.1.0
302
303       Correction of decomposition mapping
304           In older Unicode versions, a small number of characters (all of
305           which are CJK compatibility ideographs as far as they have been
306           found) may have an erroneous decomposition mapping (see
307           NormalizationCorrections.txt).  Anyhow, this module will neither
308           refer to NormalizationCorrections.txt nor provide any specific
309           version of normalization. Therefore this module running on an older
310           perl with an older Unicode database may use the erroneous
311           decomposition mapping blindly conforming to the Unicode database.
312
313       Revised definition of canonical composition
314           In Unicode 4.1.0, the definition D2 of canonical composition (which
315           affects NFC and NFKC) has been changed (see Public Review Issue #29
316           and recent UAX #15). This module has used the newer definition
317           since the version 0.07 (Oct 31, 2001).  This module will not
318           support the normalization according to the older definition, even
319           if the Unicode version implemented by perl is lower than 4.1.0.
320

AUTHOR

322       SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
323
324       Copyright(C) 2001-2007, SADAHIRO Tomoyuki. Japan. All rights reserved.
325
326       This module is free software; you can redistribute it and/or modify it
327       under the same terms as Perl itself.
328

NAME

SYNOPSIS

DESCRIPTION

EXPORT

CAVEATS

AUTHOR

SEE ALSO