1Unicode::Normalize(3pm)Perl Programmers Reference GuideUnicode::Normalize(3pm)
2
3
4
6 Unicode::Normalize - Unicode Normalization Forms
7
9 (1) using function names exported by default:
10
11 use Unicode::Normalize;
12
13 $NFD_string = NFD($string); # Normalization Form D
14 $NFC_string = NFC($string); # Normalization Form C
15 $NFKD_string = NFKD($string); # Normalization Form KD
16 $NFKC_string = NFKC($string); # Normalization Form KC
17
18 (2) using function names exported on request:
19
20 use Unicode::Normalize 'normalize';
21
22 $NFD_string = normalize('D', $string); # Normalization Form D
23 $NFC_string = normalize('C', $string); # Normalization Form C
24 $NFKD_string = normalize('KD', $string); # Normalization Form KD
25 $NFKC_string = normalize('KC', $string); # Normalization Form KC
26
28 Parameters:
29
30 $string is used as a string under character semantics (see perluni‐
31 code).
32
33 $codepoint should be an unsigned integer representing a Unicode code
34 point.
35
36 Note: Between XSUB and pure Perl, there is an incompatibility about the
37 interpretation of $codepoint as a decimal number. XSUB converts $code‐
38 point to an unsigned integer, but pure Perl does not. Do not use a
39 floating point nor a negative sign in $codepoint.
40
41 Normalization Forms
42
43 "$NFD_string = NFD($string)"
44 returns the Normalization Form D (formed by canonical decomposi‐
45 tion).
46
47 "$NFC_string = NFC($string)"
48 returns the Normalization Form C (formed by canonical decomposition
49 followed by canonical composition).
50
51 "$NFKD_string = NFKD($string)"
52 returns the Normalization Form KD (formed by compatibility decompo‐
53 sition).
54
55 "$NFKC_string = NFKC($string)"
56 returns the Normalization Form KC (formed by compatibility decompo‐
57 sition followed by canonical composition).
58
59 "$FCD_string = FCD($string)"
60 If the given string is in FCD ("Fast C or D" form; cf. UTN #5),
61 returns it without modification; otherwise returns an FCD string.
62
63 Note: FCD is not always unique, then plural forms may be equivalent
64 each other. "FCD()" will return one of these equivalent forms.
65
66 "$FCC_string = FCC($string)"
67 returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
68
69 Note: FCC is unique, as well as four normalization forms (NF*).
70
71 "$normalized_string = normalize($form_name, $string)"
72 As $form_name, one of the following names must be given.
73
74 'C' or 'NFC' for Normalization Form C (UAX #15)
75 'D' or 'NFD' for Normalization Form D (UAX #15)
76 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
77 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
78
79 'FCD' for "Fast C or D" Form (UTN #5)
80 'FCC' for "Fast C Contiguous" (UTN #5)
81
82 Decomposition and Composition
83
84 "$decomposed_string = decompose($string)"
85 "$decomposed_string = decompose($string, $useCompatMapping)"
86 Decomposes the specified string and returns the result.
87
88 If the second parameter (a boolean) is omitted or false, decomposes
89 it using the Canonical Decomposition Mapping. If true, decomposes
90 it using the Compatibility Decomposition Mapping.
91
92 The string returned is not always in NFD/NFKD. Reordering may be
93 required.
94
95 $NFD_string = reorder(decompose($string)); # eq. to NFD()
96 $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
97
98 "$reordered_string = reorder($string)"
99 Reorders the combining characters and the like in the canonical
100 ordering and returns the result.
101
102 E.g., when you have a list of NFD/NFKD strings, you can get the
103 concatenated NFD/NFKD string from them, saying
104
105 $concat_NFD = reorder(join '', @NFD_strings);
106 $concat_NFKD = reorder(join '', @NFKD_strings);
107
108 "$composed_string = compose($string)"
109 Returns the string where composable pairs are composed.
110
111 E.g., when you have a NFD/NFKD string, you can get its NFC/NFKC
112 string, saying
113
114 $NFC_string = compose($NFD_string);
115 $NFKC_string = compose($NFKD_string);
116
117 Quick Check
118
119 (see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
120
121 The following functions check whether the string is in that normaliza‐
122 tion form.
123
124 The result returned will be:
125
126 YES The string is in that normalization form.
127 NO The string is not in that normalization form.
128 MAYBE Dubious. Maybe yes, maybe no.
129
130 "$result = checkNFD($string)"
131 returns true (1) if "YES"; false ("empty string") if "NO".
132
133 "$result = checkNFC($string)"
134 returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
135 if "MAYBE".
136
137 "$result = checkNFKD($string)"
138 returns true (1) if "YES"; false ("empty string") if "NO".
139
140 "$result = checkNFKC($string)"
141 returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
142 if "MAYBE".
143
144 "$result = checkFCD($string)"
145 returns true (1) if "YES"; false ("empty string") if "NO".
146
147 "$result = checkFCC($string)"
148 returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
149 if "MAYBE".
150
151 If a string is not in FCD, it must not be in FCC. So "check‐
152 FCC($not_FCD_string)" should return "NO".
153
154 "$result = check($form_name, $string)"
155 returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
156 if "MAYBE".
157
158 As $form_name, one of the following names must be given.
159
160 'C' or 'NFC' for Normalization Form C (UAX #15)
161 'D' or 'NFD' for Normalization Form D (UAX #15)
162 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
163 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
164
165 'FCD' for "Fast C or D" Form (UTN #5)
166 'FCC' for "Fast C Contiguous" (UTN #5)
167
168 Note
169
170 In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or
171 "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and
172 FCC.
173
174 A "MAYBE" string should contain at least one combining character or the
175 like. For example, "COMBINING ACUTE ACCENT" has the
176 MAYBE_NFC/MAYBE_NFKC property.
177
178 Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and "checkNFC("B\N{COM‐
179 BINING ACUTE ACCENT}")" will return "MAYBE". "A\N{COMBINING ACUTE
180 ACCENT}" is not in NFC (its NFC is "\N{LATIN CAPITAL LETTER A WITH
181 ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is in NFC.
182
183 If you want to check exactly, compare the string with its NFC/NFKC/FCC.
184
185 if ($string eq NFC($string)) {
186 # $string is exactly normalized in NFC;
187 } else {
188 # $string is not normalized in NFC;
189 }
190
191 if ($string eq NFKC($string)) {
192 # $string is exactly normalized in NFKC;
193 } else {
194 # $string is not normalized in NFKC;
195 }
196
197 Character Data
198
199 These functions are interface of character data used internally. If
200 you want only to get Unicode normalization forms, you don't need call
201 them yourself.
202
203 "$canonical_decomposed = getCanon($codepoint)"
204 If the character of the specified codepoint is canonically decom‐
205 posable (including Hangul Syllables), returns the completely decom‐
206 posed string canonically equivalent to it.
207
208 If it is not decomposable, returns "undef".
209
210 "$compatibility_decomposed = getCompat($codepoint)"
211 If the character of the specified codepoint is compatibility decom‐
212 posable (including Hangul Syllables), returns the completely decom‐
213 posed string compatibility equivalent to it.
214
215 If it is not decomposable, returns "undef".
216
217 "$codepoint_composite = getComposite($codepoint_here, $codepoint_next)"
218 If two characters here and next (as codepoints) are composable
219 (including Hangul Jamo/Syllables and Composition Exclusions),
220 returns the codepoint of the composite.
221
222 If they are not composable, returns "undef".
223
224 "$combining_class = getCombinClass($codepoint)"
225 Returns the combining class of the character as an integer.
226
227 "$is_exclusion = isExclusion($codepoint)"
228 Returns a boolean whether the character of the specified codepoint
229 is a composition exclusion.
230
231 "$is_singleton = isSingleton($codepoint)"
232 Returns a boolean whether the character of the specified codepoint
233 is a singleton.
234
235 "$is_non_starter_decomposition = isNonStDecomp($codepoint)"
236 Returns a boolean whether the canonical decomposition of the char‐
237 acter of the specified codepoint is a Non-Starter Decomposition.
238
239 "$may_be_composed_with_prev_char = isComp2nd($codepoint)"
240 Returns a boolean whether the character of the specified codepoint
241 may be composed with the previous one in a certain composition
242 (including Hangul Compositions, but excluding Composition Exclu‐
243 sions and Non-Starter Decompositions).
244
246 "NFC", "NFD", "NFKC", "NFKD": by default.
247
248 "normalize" and other some functions: on request.
249
251 Perl's version vs. Unicode version
252 Since this module refers to perl core's Unicode database in the
253 directory /lib/unicore (or formerly /lib/unicode), the Unicode ver‐
254 sion of normalization implemented by this module depends on your
255 perl's version.
256
257 perl's version implemented Unicode version
258 5.6.1 3.0.1
259 5.7.2 3.1.0
260 5.7.3 3.1.1 (same normalized form as that of 3.1.0)
261 5.8.0 3.2.0
262 5.8.1-5.8.3 4.0.0
263 5.8.4-5.8.6 (latest) 4.0.1 (same normalized form as that of 4.0.0)
264
265 Correction of decomposition mapping
266 In older Unicode versions, a small number of characters (all of
267 which are CJK compatibility ideographs as far as they have been
268 found) may have an erroneous decomposition mapping (see Normaliza‐
269 tionCorrections.txt). Anyhow, this module will neither refer to
270 NormalizationCorrections.txt nor provide any specific version of
271 normalization. Therefore this module running on an older perl with
272 an older Unicode database may use the erroneous decomposition map‐
273 ping blindly conforming to the Unicode database.
274
275 Revised definition of canonical composition
276 In Unicode 4.1.0, the definition D2 of canonical composition (which
277 affects NFC and NFKC) has been changed (see Public Review Issue #29
278 and recent UAX #15). This module has used the newer definition
279 since the version 0.07 (Oct 31, 2001). This module does not sup‐
280 port normalization according to the older definition, even if the
281 Unicode version implemented by perl is lower than 4.1.0.
282
284 SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
285
286 Copyright(C) 2001-2005, SADAHIRO Tomoyuki. Japan. All rights reserved.
287
288 This module is free software; you can redistribute it and/or modify it
289 under the same terms as Perl itself.
290
292 http://www.unicode.org/reports/tr15/
293 Unicode Normalization Forms - UAX #15
294
295 http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
296 Derived Normalization Properties
297
298 http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt
299 Normalization Corrections
300
301 http://www.unicode.org/review/pr-29.html
302 Public Review Issue #29: Normalization Issue
303
304 http://www.unicode.org/notes/tn5/
305 Canonical Equivalence in Applications - UTN #5
306
307
308
309perl v5.8.8 2001-09-21 Unicode::Normalize(3pm)