1Unicode::Normalize(3pm)Perl Programmers Reference GuideUnicode::Normalize(3pm)
2
3
4
6 Unicode::Normalize - Unicode Normalization Forms
7
9 (1) using function names exported by default:
10
11 use Unicode::Normalize;
12
13 $NFD_string = NFD($string); # Normalization Form D
14 $NFC_string = NFC($string); # Normalization Form C
15 $NFKD_string = NFKD($string); # Normalization Form KD
16 $NFKC_string = NFKC($string); # Normalization Form KC
17
18 (2) using function names exported on request:
19
20 use Unicode::Normalize 'normalize';
21
22 $NFD_string = normalize('D', $string); # Normalization Form D
23 $NFC_string = normalize('C', $string); # Normalization Form C
24 $NFKD_string = normalize('KD', $string); # Normalization Form KD
25 $NFKC_string = normalize('KC', $string); # Normalization Form KC
26
28 Parameters:
29
30 $string is used as a string under character semantics (see
31 perlunicode).
32
33 $code_point should be an unsigned integer representing a Unicode code
34 point.
35
36 Note: Between XSUB and pure Perl, there is an incompatibility about the
37 interpretation of $code_point as a decimal number. XSUB converts
38 $code_point to an unsigned integer, but pure Perl does not. Do not use
39 a floating point nor a negative sign in $code_point.
40
41 Normalization Forms
42 "$NFD_string = NFD($string)"
43 It returns the Normalization Form D (formed by canonical
44 decomposition).
45
46 "$NFC_string = NFC($string)"
47 It returns the Normalization Form C (formed by canonical
48 decomposition followed by canonical composition).
49
50 "$NFKD_string = NFKD($string)"
51 It returns the Normalization Form KD (formed by compatibility
52 decomposition).
53
54 "$NFKC_string = NFKC($string)"
55 It returns the Normalization Form KC (formed by compatibility
56 decomposition followed by canonical composition).
57
58 "$FCD_string = FCD($string)"
59 If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it
60 returns the string without modification; otherwise it returns an
61 FCD string.
62
63 Note: FCD is not always unique, then plural forms may be equivalent
64 each other. "FCD()" will return one of these equivalent forms.
65
66 "$FCC_string = FCC($string)"
67 It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
68
69 Note: FCC is unique, as well as four normalization forms (NF*).
70
71 "$normalized_string = normalize($form_name, $string)"
72 It returns the normalization form of $form_name.
73
74 As $form_name, one of the following names must be given.
75
76 'C' or 'NFC' for Normalization Form C (UAX #15)
77 'D' or 'NFD' for Normalization Form D (UAX #15)
78 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
79 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
80
81 'FCD' for "Fast C or D" Form (UTN #5)
82 'FCC' for "Fast C Contiguous" (UTN #5)
83
84 Decomposition and Composition
85 "$decomposed_string = decompose($string [, $useCompatMapping])"
86 It returns the concatenation of the decomposition of each character
87 in the string.
88
89 If the second parameter (a boolean) is omitted or false, the
90 decomposition is canonical decomposition; if the second parameter
91 (a boolean) is true, the decomposition is compatibility
92 decomposition.
93
94 The string returned is not always in NFD/NFKD. Reordering may be
95 required.
96
97 $NFD_string = reorder(decompose($string)); # eq. to NFD()
98 $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
99
100 "$reordered_string = reorder($string)"
101 It returns the result of reordering the combining characters
102 according to Canonical Ordering Behavior.
103
104 For example, when you have a list of NFD/NFKD strings, you can get
105 the concatenated NFD/NFKD string from them, by saying
106
107 $concat_NFD = reorder(join '', @NFD_strings);
108 $concat_NFKD = reorder(join '', @NFKD_strings);
109
110 "$composed_string = compose($string)"
111 It returns the result of canonical composition without applying any
112 decomposition.
113
114 For example, when you have a NFD/NFKD string, you can get its
115 NFC/NFKC string, by saying
116
117 $NFC_string = compose($NFD_string);
118 $NFKC_string = compose($NFKD_string);
119
120 "($processed, $unprocessed) = splitOnLastStarter($normalized)"
121 It returns two strings: the first one, $processed, is a part before
122 the last starter, and the second one, $unprocessed is another part
123 after the first part. A starter is a character having a combining
124 class of zero (see UAX #15).
125
126 Note that $processed may be empty (when $normalized contains no
127 starter or starts with the last starter), and then $unprocessed
128 should be equal to the entire $normalized.
129
130 When you have a $normalized string and an $unnormalized string
131 following it, a simple concatenation is wrong:
132
133 $concat = $normalized . normalize($form, $unnormalized); # wrong!
134
135 Instead of it, do like this:
136
137 ($processed, $unprocessed) = splitOnLastStarter($normalized);
138 $concat = $processed . normalize($form, $unprocessed.$unnormalized);
139
140 "splitOnLastStarter()" should be called with a pre-normalized
141 parameter $normalized, that is in the same form as $form you want.
142
143 If you have an array of @string that should be concatenated and
144 then normalized, you can do like this:
145
146 my $result = "";
147 my $unproc = "";
148 foreach my $str (@string) {
149 $unproc .= $str;
150 my $n = normalize($form, $unproc);
151 my($p, $u) = splitOnLastStarter($n);
152 $result .= $p;
153 $unproc = $u;
154 }
155 $result .= $unproc;
156 # instead of normalize($form, join('', @string))
157
158 "$processed = normalize_partial($form, $unprocessed)"
159 A wrapper for the combination of "normalize()" and
160 "splitOnLastStarter()". Note that $unprocessed will be modified as
161 a side-effect.
162
163 If you have an array of @string that should be concatenated and
164 then normalized, you can do like this:
165
166 my $result = "";
167 my $unproc = "";
168 foreach my $str (@string) {
169 $unproc .= $str;
170 $result .= normalize_partial($form, $unproc);
171 }
172 $result .= $unproc;
173 # instead of normalize($form, join('', @string))
174
175 "$processed = NFD_partial($unprocessed)"
176 It does like "normalize_partial('NFD', $unprocessed)". Note that
177 $unprocessed will be modified as a side-effect.
178
179 "$processed = NFC_partial($unprocessed)"
180 It does like "normalize_partial('NFC', $unprocessed)". Note that
181 $unprocessed will be modified as a side-effect.
182
183 "$processed = NFKD_partial($unprocessed)"
184 It does like "normalize_partial('NFKD', $unprocessed)". Note that
185 $unprocessed will be modified as a side-effect.
186
187 "$processed = NFKC_partial($unprocessed)"
188 It does like "normalize_partial('NFKC', $unprocessed)". Note that
189 $unprocessed will be modified as a side-effect.
190
191 Quick Check
192 (see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
193
194 The following functions check whether the string is in that
195 normalization form.
196
197 The result returned will be one of the following:
198
199 YES The string is in that normalization form.
200 NO The string is not in that normalization form.
201 MAYBE Dubious. Maybe yes, maybe no.
202
203 "$result = checkNFD($string)"
204 It returns true (1) if "YES"; false ("empty string") if "NO".
205
206 "$result = checkNFC($string)"
207 It returns true (1) if "YES"; false ("empty string") if "NO";
208 "undef" if "MAYBE".
209
210 "$result = checkNFKD($string)"
211 It returns true (1) if "YES"; false ("empty string") if "NO".
212
213 "$result = checkNFKC($string)"
214 It returns true (1) if "YES"; false ("empty string") if "NO";
215 "undef" if "MAYBE".
216
217 "$result = checkFCD($string)"
218 It returns true (1) if "YES"; false ("empty string") if "NO".
219
220 "$result = checkFCC($string)"
221 It returns true (1) if "YES"; false ("empty string") if "NO";
222 "undef" if "MAYBE".
223
224 Note: If a string is not in FCD, it must not be in FCC. So
225 "checkFCC($not_FCD_string)" should return "NO".
226
227 "$result = check($form_name, $string)"
228 It returns true (1) if "YES"; false ("empty string") if "NO";
229 "undef" if "MAYBE".
230
231 As $form_name, one of the following names must be given.
232
233 'C' or 'NFC' for Normalization Form C (UAX #15)
234 'D' or 'NFD' for Normalization Form D (UAX #15)
235 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
236 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
237
238 'FCD' for "Fast C or D" Form (UTN #5)
239 'FCC' for "Fast C Contiguous" (UTN #5)
240
241 Note
242
243 In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or
244 "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and
245 FCC.
246
247 A "MAYBE" string should contain at least one combining character or the
248 like. For example, "COMBINING ACUTE ACCENT" has the
249 MAYBE_NFC/MAYBE_NFKC property.
250
251 Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and
252 "checkNFC("B\N{COMBINING ACUTE ACCENT}")" will return "MAYBE".
253 "A\N{COMBINING ACUTE ACCENT}" is not in NFC (its NFC is "\N{LATIN
254 CAPITAL LETTER A WITH ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is
255 in NFC.
256
257 If you want to check exactly, compare the string with its NFC/NFKC/FCC.
258
259 if ($string eq NFC($string)) {
260 # $string is exactly normalized in NFC;
261 } else {
262 # $string is not normalized in NFC;
263 }
264
265 if ($string eq NFKC($string)) {
266 # $string is exactly normalized in NFKC;
267 } else {
268 # $string is not normalized in NFKC;
269 }
270
271 Character Data
272 These functions are interface of character data used internally. If
273 you want only to get Unicode normalization forms, you don't need call
274 them yourself.
275
276 "$canonical_decomposition = getCanon($code_point)"
277 If the character is canonically decomposable (including Hangul
278 Syllables), it returns the (full) canonical decomposition as a
279 string. Otherwise it returns "undef".
280
281 Note: According to the Unicode standard, the canonical
282 decomposition of the character that is not canonically decomposable
283 is same as the character itself.
284
285 "$compatibility_decomposition = getCompat($code_point)"
286 If the character is compatibility decomposable (including Hangul
287 Syllables), it returns the (full) compatibility decomposition as a
288 string. Otherwise it returns "undef".
289
290 Note: According to the Unicode standard, the compatibility
291 decomposition of the character that is not compatibility
292 decomposable is same as the character itself.
293
294 "$code_point_composite = getComposite($code_point_here,
295 $code_point_next)"
296 If two characters here and next (as code points) are composable
297 (including Hangul Jamo/Syllables and Composition Exclusions), it
298 returns the code point of the composite.
299
300 If they are not composable, it returns "undef".
301
302 "$combining_class = getCombinClass($code_point)"
303 It returns the combining class (as an integer) of the character.
304
305 "$may_be_composed_with_prev_char = isComp2nd($code_point)"
306 It returns a boolean whether the character of the specified
307 codepoint may be composed with the previous one in a certain
308 composition (including Hangul Compositions, but excluding
309 Composition Exclusions and Non-Starter Decompositions).
310
311 "$is_exclusion = isExclusion($code_point)"
312 It returns a boolean whether the code point is a composition
313 exclusion.
314
315 "$is_singleton = isSingleton($code_point)"
316 It returns a boolean whether the code point is a singleton
317
318 "$is_non_starter_decomposition = isNonStDecomp($code_point)"
319 It returns a boolean whether the code point has Non-Starter
320 Decomposition.
321
322 "$is_Full_Composition_Exclusion = isComp_Ex($code_point)"
323 It returns a boolean of the derived property Comp_Ex
324 (Full_Composition_Exclusion). This property is generated from
325 Composition Exclusions + Singletons + Non-Starter Decompositions.
326
327 "$NFD_is_NO = isNFD_NO($code_point)"
328 It returns a boolean of the derived property NFD_NO
329 (NFD_Quick_Check=No).
330
331 "$NFC_is_NO = isNFC_NO($code_point)"
332 It returns a boolean of the derived property NFC_NO
333 (NFC_Quick_Check=No).
334
335 "$NFC_is_MAYBE = isNFC_MAYBE($code_point)"
336 It returns a boolean of the derived property NFC_MAYBE
337 (NFC_Quick_Check=Maybe).
338
339 "$NFKD_is_NO = isNFKD_NO($code_point)"
340 It returns a boolean of the derived property NFKD_NO
341 (NFKD_Quick_Check=No).
342
343 "$NFKC_is_NO = isNFKC_NO($code_point)"
344 It returns a boolean of the derived property NFKC_NO
345 (NFKC_Quick_Check=No).
346
347 "$NFKC_is_MAYBE = isNFKC_MAYBE($code_point)"
348 It returns a boolean of the derived property NFKC_MAYBE
349 (NFKC_Quick_Check=Maybe).
350
352 "NFC", "NFD", "NFKC", "NFKD": by default.
353
354 "normalize" and other some functions: on request.
355
357 Perl's version vs. Unicode version
358 Since this module refers to perl core's Unicode database in the
359 directory /lib/unicore (or formerly /lib/unicode), the Unicode
360 version of normalization implemented by this module depends on your
361 perl's version.
362
363 perl's version implemented Unicode version
364 5.6.1 3.0.1
365 5.7.2 3.1.0
366 5.7.3 3.1.1 (normalization is same as 3.1.0)
367 5.8.0 3.2.0
368 5.8.1-5.8.3 4.0.0
369 5.8.4-5.8.6 4.0.1 (normalization is same as 4.0.0)
370 5.8.7-5.8.8 4.1.0
371 5.10.0 5.0.0
372 5.8.9, 5.10.1 5.1.0
373 5.12.0-5.12.3 5.2.0
374 5.14.0 6.0.0
375 5.16.0 (to be) 6.1.0
376
377 Correction of decomposition mapping
378 In older Unicode versions, a small number of characters (all of
379 which are CJK compatibility ideographs as far as they have been
380 found) may have an erroneous decomposition mapping (see
381 NormalizationCorrections.txt). Anyhow, this module will neither
382 refer to NormalizationCorrections.txt nor provide any specific
383 version of normalization. Therefore this module running on an older
384 perl with an older Unicode database may use the erroneous
385 decomposition mapping blindly conforming to the Unicode database.
386
387 Revised definition of canonical composition
388 In Unicode 4.1.0, the definition D2 of canonical composition (which
389 affects NFC and NFKC) has been changed (see Public Review Issue #29
390 and recent UAX #15). This module has used the newer definition
391 since the version 0.07 (Oct 31, 2001). This module will not
392 support the normalization according to the older definition, even
393 if the Unicode version implemented by perl is lower than 4.1.0.
394
396 SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
397
398 Copyright(C) 2001-2012, SADAHIRO Tomoyuki. Japan. All rights reserved.
399
400 This module is free software; you can redistribute it and/or modify it
401 under the same terms as Perl itself.
402
404 http://www.unicode.org/reports/tr15/
405 Unicode Normalization Forms - UAX #15
406
407 http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt
408 Composition Exclusion Table
409
410 http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
411 Derived Normalization Properties
412
413 http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt
414 Normalization Corrections
415
416 http://www.unicode.org/review/pr-29.html
417 Public Review Issue #29: Normalization Issue
418
419 http://www.unicode.org/notes/tn5/
420 Canonical Equivalence in Applications - UTN #5
421
422
423
424perl v5.16.3 2013-03-04 Unicode::Normalize(3pm)