1Unicode::Normalize(3pm)Perl Programmers Reference GuideUnicode::Normalize(3pm)
2
3
4
6 Unicode::Normalize - Unicode Normalization Forms
7
9 (1) using function names exported by default:
10
11 use Unicode::Normalize;
12
13 $NFD_string = NFD($string); # Normalization Form D
14 $NFC_string = NFC($string); # Normalization Form C
15 $NFKD_string = NFKD($string); # Normalization Form KD
16 $NFKC_string = NFKC($string); # Normalization Form KC
17
18 (2) using function names exported on request:
19
20 use Unicode::Normalize 'normalize';
21
22 $NFD_string = normalize('D', $string); # Normalization Form D
23 $NFC_string = normalize('C', $string); # Normalization Form C
24 $NFKD_string = normalize('KD', $string); # Normalization Form KD
25 $NFKC_string = normalize('KC', $string); # Normalization Form KC
26
28 Parameters:
29
30 $string is used as a string under character semantics (see
31 perlunicode).
32
33 $code_point should be an unsigned integer representing a Unicode code
34 point.
35
36 Note: Between XSUB and pure Perl, there is an incompatibility about the
37 interpretation of $code_point as a decimal number. XSUB converts
38 $code_point to an unsigned integer, but pure Perl does not. Do not use
39 a floating point nor a negative sign in $code_point.
40
41 Normalization Forms
42 "$NFD_string = NFD($string)"
43 It returns the Normalization Form D (formed by canonical
44 decomposition).
45
46 "$NFC_string = NFC($string)"
47 It returns the Normalization Form C (formed by canonical
48 decomposition followed by canonical composition).
49
50 "$NFKD_string = NFKD($string)"
51 It returns the Normalization Form KD (formed by compatibility
52 decomposition).
53
54 "$NFKC_string = NFKC($string)"
55 It returns the Normalization Form KC (formed by compatibility
56 decomposition followed by canonical composition).
57
58 "$FCD_string = FCD($string)"
59 If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it
60 returns the string without modification; otherwise it returns an
61 FCD string.
62
63 Note: FCD is not always unique, then plural forms may be equivalent
64 each other. "FCD()" will return one of these equivalent forms.
65
66 "$FCC_string = FCC($string)"
67 It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
68
69 Note: FCC is unique, as well as four normalization forms (NF*).
70
71 "$normalized_string = normalize($form_name, $string)"
72 It returns the normalization form of $form_name.
73
74 As $form_name, one of the following names must be given.
75
76 'C' or 'NFC' for Normalization Form C (UAX #15)
77 'D' or 'NFD' for Normalization Form D (UAX #15)
78 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
79 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
80
81 'FCD' for "Fast C or D" Form (UTN #5)
82 'FCC' for "Fast C Contiguous" (UTN #5)
83
84 Decomposition and Composition
85 "$decomposed_string = decompose($string [, $useCompatMapping])"
86 It returns the concatenation of the decomposition of each character
87 in the string.
88
89 If the second parameter (a boolean) is omitted or false, the
90 decomposition is canonical decomposition; if the second parameter
91 (a boolean) is true, the decomposition is compatibility
92 decomposition.
93
94 The string returned is not always in NFD/NFKD. Reordering may be
95 required.
96
97 $NFD_string = reorder(decompose($string)); # eq. to NFD()
98 $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
99
100 "$reordered_string = reorder($string)"
101 It returns the result of reordering the combining characters
102 according to Canonical Ordering Behavior.
103
104 For example, when you have a list of NFD/NFKD strings, you can get
105 the concatenated NFD/NFKD string from them, by saying
106
107 $concat_NFD = reorder(join '', @NFD_strings);
108 $concat_NFKD = reorder(join '', @NFKD_strings);
109
110 "$composed_string = compose($string)"
111 It returns the result of canonical composition without applying any
112 decomposition.
113
114 For example, when you have a NFD/NFKD string, you can get its
115 NFC/NFKC string, by saying
116
117 $NFC_string = compose($NFD_string);
118 $NFKC_string = compose($NFKD_string);
119
120 Quick Check
121 (see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
122
123 The following functions check whether the string is in that
124 normalization form.
125
126 The result returned will be one of the following:
127
128 YES The string is in that normalization form.
129 NO The string is not in that normalization form.
130 MAYBE Dubious. Maybe yes, maybe no.
131
132 "$result = checkNFD($string)"
133 It returns true (1) if "YES"; false ("empty string") if "NO".
134
135 "$result = checkNFC($string)"
136 It returns true (1) if "YES"; false ("empty string") if "NO";
137 "undef" if "MAYBE".
138
139 "$result = checkNFKD($string)"
140 It returns true (1) if "YES"; false ("empty string") if "NO".
141
142 "$result = checkNFKC($string)"
143 It returns true (1) if "YES"; false ("empty string") if "NO";
144 "undef" if "MAYBE".
145
146 "$result = checkFCD($string)"
147 It returns true (1) if "YES"; false ("empty string") if "NO".
148
149 "$result = checkFCC($string)"
150 It returns true (1) if "YES"; false ("empty string") if "NO";
151 "undef" if "MAYBE".
152
153 Note: If a string is not in FCD, it must not be in FCC. So
154 "checkFCC($not_FCD_string)" should return "NO".
155
156 "$result = check($form_name, $string)"
157 It returns true (1) if "YES"; false ("empty string") if "NO";
158 "undef" if "MAYBE".
159
160 As $form_name, one of the following names must be given.
161
162 'C' or 'NFC' for Normalization Form C (UAX #15)
163 'D' or 'NFD' for Normalization Form D (UAX #15)
164 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
165 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
166
167 'FCD' for "Fast C or D" Form (UTN #5)
168 'FCC' for "Fast C Contiguous" (UTN #5)
169
170 Note
171
172 In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or
173 "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and
174 FCC.
175
176 A "MAYBE" string should contain at least one combining character or the
177 like. For example, "COMBINING ACUTE ACCENT" has the
178 MAYBE_NFC/MAYBE_NFKC property.
179
180 Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and
181 "checkNFC("B\N{COMBINING ACUTE ACCENT}")" will return "MAYBE".
182 "A\N{COMBINING ACUTE ACCENT}" is not in NFC (its NFC is "\N{LATIN
183 CAPITAL LETTER A WITH ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is
184 in NFC.
185
186 If you want to check exactly, compare the string with its NFC/NFKC/FCC.
187
188 if ($string eq NFC($string)) {
189 # $string is exactly normalized in NFC;
190 } else {
191 # $string is not normalized in NFC;
192 }
193
194 if ($string eq NFKC($string)) {
195 # $string is exactly normalized in NFKC;
196 } else {
197 # $string is not normalized in NFKC;
198 }
199
200 Character Data
201 These functions are interface of character data used internally. If
202 you want only to get Unicode normalization forms, you don't need call
203 them yourself.
204
205 "$canonical_decomposition = getCanon($code_point)"
206 If the character is canonically decomposable (including Hangul
207 Syllables), it returns the (full) canonical decomposition as a
208 string. Otherwise it returns "undef".
209
210 Note: According to the Unicode standard, the canonical
211 decomposition of the character that is not canonically decomposable
212 is same as the character itself.
213
214 "$compatibility_decomposition = getCompat($code_point)"
215 If the character is compatibility decomposable (including Hangul
216 Syllables), it returns the (full) compatibility decomposition as a
217 string. Otherwise it returns "undef".
218
219 Note: According to the Unicode standard, the compatibility
220 decomposition of the character that is not compatibility
221 decomposable is same as the character itself.
222
223 "$code_point_composite = getComposite($code_point_here,
224 $code_point_next)"
225 If two characters here and next (as code points) are composable
226 (including Hangul Jamo/Syllables and Composition Exclusions), it
227 returns the code point of the composite.
228
229 If they are not composable, it returns "undef".
230
231 "$combining_class = getCombinClass($code_point)"
232 It returns the combining class (as an integer) of the character.
233
234 "$may_be_composed_with_prev_char = isComp2nd($code_point)"
235 It returns a boolean whether the character of the specified
236 codepoint may be composed with the previous one in a certain
237 composition (including Hangul Compositions, but excluding
238 Composition Exclusions and Non-Starter Decompositions).
239
240 "$is_exclusion = isExclusion($code_point)"
241 It returns a boolean whether the code point is a composition
242 exclusion.
243
244 "$is_singleton = isSingleton($code_point)"
245 It returns a boolean whether the code point is a singleton
246
247 "$is_non_starter_decomposition = isNonStDecomp($code_point)"
248 It returns a boolean whether the code point has Non-Starter
249 Decomposition.
250
251 "$is_Full_Composition_Exclusion = isComp_Ex($code_point)"
252 It returns a boolean of the derived property Comp_Ex
253 (Full_Composition_Exclusion). This property is generated from
254 Composition Exclusions + Singletons + Non-Starter Decompositions.
255
256 "$NFD_is_NO = isNFD_NO($code_point)"
257 It returns a boolean of the derived property NFD_NO
258 (NFD_Quick_Check=No).
259
260 "$NFC_is_NO = isNFC_NO($code_point)"
261 It returns a boolean of the derived property NFC_NO
262 (NFC_Quick_Check=No).
263
264 "$NFC_is_MAYBE = isNFC_MAYBE($code_point)"
265 It returns a boolean of the derived property NFC_MAYBE
266 (NFC_Quick_Check=Maybe).
267
268 "$NFKD_is_NO = isNFKD_NO($code_point)"
269 It returns a boolean of the derived property NFKD_NO
270 (NFKD_Quick_Check=No).
271
272 "$NFKC_is_NO = isNFKC_NO($code_point)"
273 It returns a boolean of the derived property NFKC_NO
274 (NFKC_Quick_Check=No).
275
276 "$NFKC_is_MAYBE = isNFKC_MAYBE($code_point)"
277 It returns a boolean of the derived property NFKC_MAYBE
278 (NFKC_Quick_Check=Maybe).
279
281 "NFC", "NFD", "NFKC", "NFKD": by default.
282
283 "normalize" and other some functions: on request.
284
286 Perl's version vs. Unicode version
287 Since this module refers to perl core's Unicode database in the
288 directory /lib/unicore (or formerly /lib/unicode), the Unicode
289 version of normalization implemented by this module depends on your
290 perl's version.
291
292 perl's version implemented Unicode version
293 5.6.1 3.0.1
294 5.7.2 3.1.0
295 5.7.3 3.1.1 (normalization is same as 3.1.0)
296 5.8.0 3.2.0
297 5.8.1-5.8.3 4.0.0
298 5.8.4-5.8.6 4.0.1 (normalization is same as 4.0.0)
299 5.8.7-5.8.8 4.1.0
300 5.10.0 5.0.0
301 5.8.9 5.1.0
302
303 Correction of decomposition mapping
304 In older Unicode versions, a small number of characters (all of
305 which are CJK compatibility ideographs as far as they have been
306 found) may have an erroneous decomposition mapping (see
307 NormalizationCorrections.txt). Anyhow, this module will neither
308 refer to NormalizationCorrections.txt nor provide any specific
309 version of normalization. Therefore this module running on an older
310 perl with an older Unicode database may use the erroneous
311 decomposition mapping blindly conforming to the Unicode database.
312
313 Revised definition of canonical composition
314 In Unicode 4.1.0, the definition D2 of canonical composition (which
315 affects NFC and NFKC) has been changed (see Public Review Issue #29
316 and recent UAX #15). This module has used the newer definition
317 since the version 0.07 (Oct 31, 2001). This module will not
318 support the normalization according to the older definition, even
319 if the Unicode version implemented by perl is lower than 4.1.0.
320
322 SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
323
324 Copyright(C) 2001-2007, SADAHIRO Tomoyuki. Japan. All rights reserved.
325
326 This module is free software; you can redistribute it and/or modify it
327 under the same terms as Perl itself.
328
330 http://www.unicode.org/reports/tr15/
331 Unicode Normalization Forms - UAX #15
332
333 http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt
334 Composition Exclusion Table
335
336 http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
337 Derived Normalization Properties
338
339 http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt
340 Normalization Corrections
341
342 http://www.unicode.org/review/pr-29.html
343 Public Review Issue #29: Normalization Issue
344
345 http://www.unicode.org/notes/tn5/
346 Canonical Equivalence in Applications - UTN #5
347
348
349
350perl v5.10.1 2009-04-19 Unicode::Normalize(3pm)