1Normalize(3) User Contributed Perl Documentation Normalize(3)
2
3
4
6 Unicode::Normalize - Unicode Normalization Forms
7
9 (1) using function names exported by default:
10
11 use Unicode::Normalize;
12
13 $NFD_string = NFD($string); # Normalization Form D
14 $NFC_string = NFC($string); # Normalization Form C
15 $NFKD_string = NFKD($string); # Normalization Form KD
16 $NFKC_string = NFKC($string); # Normalization Form KC
17
18 (2) using function names exported on request:
19
20 use Unicode::Normalize 'normalize';
21
22 $NFD_string = normalize('D', $string); # Normalization Form D
23 $NFC_string = normalize('C', $string); # Normalization Form C
24 $NFKD_string = normalize('KD', $string); # Normalization Form KD
25 $NFKC_string = normalize('KC', $string); # Normalization Form KC
26
28 Parameters:
29
30 $string is used as a string under character semantics (see
31 perlunicode).
32
33 $code_point should be an unsigned integer representing a Unicode code
34 point.
35
36 Note: Between XSUB and pure Perl, there is an incompatibility about the
37 interpretation of $code_point as a decimal number. XSUB converts
38 $code_point to an unsigned integer, but pure Perl does not. Do not use
39 a floating point nor a negative sign in $code_point.
40
41 Normalization Forms
42 "$NFD_string = NFD($string)"
43 It returns the Normalization Form D (formed by canonical
44 decomposition).
45
46 "$NFC_string = NFC($string)"
47 It returns the Normalization Form C (formed by canonical
48 decomposition followed by canonical composition).
49
50 "$NFKD_string = NFKD($string)"
51 It returns the Normalization Form KD (formed by compatibility
52 decomposition).
53
54 "$NFKC_string = NFKC($string)"
55 It returns the Normalization Form KC (formed by compatibility
56 decomposition followed by canonical composition).
57
58 "$FCD_string = FCD($string)"
59 If the given string is in FCD ("Fast C or D" form; cf. UTN #5), it
60 returns the string without modification; otherwise it returns an
61 FCD string.
62
63 Note: FCD is not always unique, then plural forms may be equivalent
64 each other. FCD() will return one of these equivalent forms.
65
66 "$FCC_string = FCC($string)"
67 It returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
68
69 Note: FCC is unique, as well as four normalization forms (NF*).
70
71 "$normalized_string = normalize($form_name, $string)"
72 It returns the normalization form of $form_name.
73
74 As $form_name, one of the following names must be given.
75
76 'C' or 'NFC' for Normalization Form C (UAX #15)
77 'D' or 'NFD' for Normalization Form D (UAX #15)
78 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
79 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
80
81 'FCD' for "Fast C or D" Form (UTN #5)
82 'FCC' for "Fast C Contiguous" (UTN #5)
83
84 Decomposition and Composition
85 "$decomposed_string = decompose($string [, $useCompatMapping])"
86 It returns the concatenation of the decomposition of each character
87 in the string.
88
89 If the second parameter (a boolean) is omitted or false, the
90 decomposition is canonical decomposition; if the second parameter
91 (a boolean) is true, the decomposition is compatibility
92 decomposition.
93
94 The string returned is not always in NFD/NFKD. Reordering may be
95 required.
96
97 $NFD_string = reorder(decompose($string)); # eq. to NFD()
98 $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
99
100 "$reordered_string = reorder($string)"
101 It returns the result of reordering the combining characters
102 according to Canonical Ordering Behavior.
103
104 For example, when you have a list of NFD/NFKD strings, you can get
105 the concatenated NFD/NFKD string from them, by saying
106
107 $concat_NFD = reorder(join '', @NFD_strings);
108 $concat_NFKD = reorder(join '', @NFKD_strings);
109
110 "$composed_string = compose($string)"
111 It returns the result of canonical composition without applying any
112 decomposition.
113
114 For example, when you have a NFD/NFKD string, you can get its
115 NFC/NFKC string, by saying
116
117 $NFC_string = compose($NFD_string);
118 $NFKC_string = compose($NFKD_string);
119
120 "($processed, $unprocessed) = splitOnLastStarter($normalized)"
121 It returns two strings: the first one, $processed, is a part before
122 the last starter, and the second one, $unprocessed is another part
123 after the first part. A starter is a character having a combining
124 class of zero (see UAX #15).
125
126 Note that $processed may be empty (when $normalized contains no
127 starter or starts with the last starter), and then $unprocessed
128 should be equal to the entire $normalized.
129
130 When you have a $normalized string and an $unnormalized string
131 following it, a simple concatenation is wrong:
132
133 $concat = $normalized . normalize($form, $unnormalized); # wrong!
134
135 Instead of it, do like this:
136
137 ($processed, $unprocessed) = splitOnLastStarter($normalized);
138 $concat = $processed . normalize($form,$unprocessed.$unnormalized);
139
140 splitOnLastStarter() should be called with a pre-normalized
141 parameter $normalized, that is in the same form as $form you want.
142
143 If you have an array of @string that should be concatenated and
144 then normalized, you can do like this:
145
146 my $result = "";
147 my $unproc = "";
148 foreach my $str (@string) {
149 $unproc .= $str;
150 my $n = normalize($form, $unproc);
151 my($p, $u) = splitOnLastStarter($n);
152 $result .= $p;
153 $unproc = $u;
154 }
155 $result .= $unproc;
156 # instead of normalize($form, join('', @string))
157
158 "$processed = normalize_partial($form, $unprocessed)"
159 A wrapper for the combination of normalize() and
160 splitOnLastStarter(). Note that $unprocessed will be modified as a
161 side-effect.
162
163 If you have an array of @string that should be concatenated and
164 then normalized, you can do like this:
165
166 my $result = "";
167 my $unproc = "";
168 foreach my $str (@string) {
169 $unproc .= $str;
170 $result .= normalize_partial($form, $unproc);
171 }
172 $result .= $unproc;
173 # instead of normalize($form, join('', @string))
174
175 "$processed = NFD_partial($unprocessed)"
176 It does like "normalize_partial('NFD', $unprocessed)". Note that
177 $unprocessed will be modified as a side-effect.
178
179 "$processed = NFC_partial($unprocessed)"
180 It does like "normalize_partial('NFC', $unprocessed)". Note that
181 $unprocessed will be modified as a side-effect.
182
183 "$processed = NFKD_partial($unprocessed)"
184 It does like "normalize_partial('NFKD', $unprocessed)". Note that
185 $unprocessed will be modified as a side-effect.
186
187 "$processed = NFKC_partial($unprocessed)"
188 It does like "normalize_partial('NFKC', $unprocessed)". Note that
189 $unprocessed will be modified as a side-effect.
190
191 Quick Check
192 (see Annex 8, UAX #15; and lib/unicore/DerivedNormalizationProps.txt)
193
194 The following functions check whether the string is in that
195 normalization form.
196
197 The result returned will be one of the following:
198
199 YES The string is in that normalization form.
200 NO The string is not in that normalization form.
201 MAYBE Dubious. Maybe yes, maybe no.
202
203 "$result = checkNFD($string)"
204 It returns true (1) if "YES"; false ("empty string") if "NO".
205
206 "$result = checkNFC($string)"
207 It returns true (1) if "YES"; false ("empty string") if "NO";
208 "undef" if "MAYBE".
209
210 "$result = checkNFKD($string)"
211 It returns true (1) if "YES"; false ("empty string") if "NO".
212
213 "$result = checkNFKC($string)"
214 It returns true (1) if "YES"; false ("empty string") if "NO";
215 "undef" if "MAYBE".
216
217 "$result = checkFCD($string)"
218 It returns true (1) if "YES"; false ("empty string") if "NO".
219
220 "$result = checkFCC($string)"
221 It returns true (1) if "YES"; false ("empty string") if "NO";
222 "undef" if "MAYBE".
223
224 Note: If a string is not in FCD, it must not be in FCC. So
225 checkFCC($not_FCD_string) should return "NO".
226
227 "$result = check($form_name, $string)"
228 It returns true (1) if "YES"; false ("empty string") if "NO";
229 "undef" if "MAYBE".
230
231 As $form_name, one of the following names must be given.
232
233 'C' or 'NFC' for Normalization Form C (UAX #15)
234 'D' or 'NFD' for Normalization Form D (UAX #15)
235 'KC' or 'NFKC' for Normalization Form KC (UAX #15)
236 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
237
238 'FCD' for "Fast C or D" Form (UTN #5)
239 'FCC' for "Fast C Contiguous" (UTN #5)
240
241 Note
242
243 In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or
244 "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and
245 FCC.
246
247 A "MAYBE" string should contain at least one combining character or the
248 like. For example, "COMBINING ACUTE ACCENT" has the
249 MAYBE_NFC/MAYBE_NFKC property.
250
251 Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and
252 "checkNFC("B\N{COMBINING ACUTE ACCENT}")" will return "MAYBE".
253 "A\N{COMBINING ACUTE ACCENT}" is not in NFC (its NFC is "\N{LATIN
254 CAPITAL LETTER A WITH ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is
255 in NFC.
256
257 If you want to check exactly, compare the string with its NFC/NFKC/FCC.
258
259 if ($string eq NFC($string)) {
260 # $string is exactly normalized in NFC;
261 } else {
262 # $string is not normalized in NFC;
263 }
264
265 if ($string eq NFKC($string)) {
266 # $string is exactly normalized in NFKC;
267 } else {
268 # $string is not normalized in NFKC;
269 }
270
271 Character Data
272 These functions are interface of character data used internally. If
273 you want only to get Unicode normalization forms, you don't need call
274 them yourself.
275
276 "$canonical_decomposition = getCanon($code_point)"
277 If the character is canonically decomposable (including Hangul
278 Syllables), it returns the (full) canonical decomposition as a
279 string. Otherwise it returns "undef".
280
281 Note: According to the Unicode standard, the canonical
282 decomposition of the character that is not canonically decomposable
283 is same as the character itself.
284
285 "$compatibility_decomposition = getCompat($code_point)"
286 If the character is compatibility decomposable (including Hangul
287 Syllables), it returns the (full) compatibility decomposition as a
288 string. Otherwise it returns "undef".
289
290 Note: According to the Unicode standard, the compatibility
291 decomposition of the character that is not compatibility
292 decomposable is same as the character itself.
293
294 "$code_point_composite = getComposite($code_point_here,
295 $code_point_next)"
296 If two characters here and next (as code points) are composable
297 (including Hangul Jamo/Syllables and Composition Exclusions), it
298 returns the code point of the composite.
299
300 If they are not composable, it returns "undef".
301
302 "$combining_class = getCombinClass($code_point)"
303 It returns the combining class (as an integer) of the character.
304
305 "$may_be_composed_with_prev_char = isComp2nd($code_point)"
306 It returns a boolean whether the character of the specified
307 codepoint may be composed with the previous one in a certain
308 composition (including Hangul Compositions, but excluding
309 Composition Exclusions and Non-Starter Decompositions).
310
311 "$is_exclusion = isExclusion($code_point)"
312 It returns a boolean whether the code point is a composition
313 exclusion.
314
315 "$is_singleton = isSingleton($code_point)"
316 It returns a boolean whether the code point is a singleton
317
318 "$is_non_starter_decomposition = isNonStDecomp($code_point)"
319 It returns a boolean whether the code point has Non-Starter
320 Decomposition.
321
322 "$is_Full_Composition_Exclusion = isComp_Ex($code_point)"
323 It returns a boolean of the derived property Comp_Ex
324 (Full_Composition_Exclusion). This property is generated from
325 Composition Exclusions + Singletons + Non-Starter Decompositions.
326
327 "$NFD_is_NO = isNFD_NO($code_point)"
328 It returns a boolean of the derived property NFD_NO
329 (NFD_Quick_Check=No).
330
331 "$NFC_is_NO = isNFC_NO($code_point)"
332 It returns a boolean of the derived property NFC_NO
333 (NFC_Quick_Check=No).
334
335 "$NFC_is_MAYBE = isNFC_MAYBE($code_point)"
336 It returns a boolean of the derived property NFC_MAYBE
337 (NFC_Quick_Check=Maybe).
338
339 "$NFKD_is_NO = isNFKD_NO($code_point)"
340 It returns a boolean of the derived property NFKD_NO
341 (NFKD_Quick_Check=No).
342
343 "$NFKC_is_NO = isNFKC_NO($code_point)"
344 It returns a boolean of the derived property NFKC_NO
345 (NFKC_Quick_Check=No).
346
347 "$NFKC_is_MAYBE = isNFKC_MAYBE($code_point)"
348 It returns a boolean of the derived property NFKC_MAYBE
349 (NFKC_Quick_Check=Maybe).
350
352 "NFC", "NFD", "NFKC", "NFKD": by default.
353
354 "normalize" and other some functions: on request.
355
357 Perl's version vs. Unicode version
358 Since this module refers to perl core's Unicode database in the
359 directory /lib/unicore (or formerly /lib/unicode), the Unicode
360 version of normalization implemented by this module depends on what
361 has been compiled into your perl. The following table lists the
362 default Unicode version that comes with various perl versions. (It
363 is possible to change the Unicode version in any perl version to be
364 any earlier Unicode version, so one could cause Unicode 3.2 to be
365 used in any perl version starting with 5.8.0. Read
366 $Config{privlib}/unicore/README.perl for details.
367
368 perl's version implemented Unicode version
369 5.6.1 3.0.1
370 5.7.2 3.1.0
371 5.7.3 3.1.1 (normalization is same as 3.1.0)
372 5.8.0 3.2.0
373 5.8.1-5.8.3 4.0.0
374 5.8.4-5.8.6 4.0.1 (normalization is same as 4.0.0)
375 5.8.7-5.8.8 4.1.0
376 5.10.0 5.0.0
377 5.8.9, 5.10.1 5.1.0
378 5.12.x 5.2.0
379 5.14.x 6.0.0
380 5.16.x 6.1.0
381 5.18.x 6.2.0
382 5.20.x 6.3.0
383 5.22.x 7.0.0
384
385 Correction of decomposition mapping
386 In older Unicode versions, a small number of characters (all of
387 which are CJK compatibility ideographs as far as they have been
388 found) may have an erroneous decomposition mapping (see
389 lib/unicore/NormalizationCorrections.txt). Anyhow, this module
390 will neither refer to lib/unicore/NormalizationCorrections.txt nor
391 provide any specific version of normalization. Therefore this
392 module running on an older perl with an older Unicode database may
393 use the erroneous decomposition mapping blindly conforming to the
394 Unicode database.
395
396 Revised definition of canonical composition
397 In Unicode 4.1.0, the definition D2 of canonical composition (which
398 affects NFC and NFKC) has been changed (see Public Review Issue #29
399 and recent UAX #15). This module has used the newer definition
400 since the version 0.07 (Oct 31, 2001). This module will not
401 support the normalization according to the older definition, even
402 if the Unicode version implemented by perl is lower than 4.1.0.
403
405 SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
406
407 Currently maintained by <perl5-porters@perl.org>
408
409 Copyright(C) 2001-2012, SADAHIRO Tomoyuki. Japan. All rights reserved.
410
412 This module is free software; you can redistribute it and/or modify it
413 under the same terms as Perl itself.
414
416 <http://www.unicode.org/reports/tr15/>
417 Unicode Normalization Forms - UAX #15
418
419 <http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt>
420 Composition Exclusion Table
421
422 <http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt>
423 Derived Normalization Properties
424
425 <http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt>
426 Normalization Corrections
427
428 <http://www.unicode.org/review/pr-29.html>
429 Public Review Issue #29: Normalization Issue
430
431 <http://www.unicode.org/notes/tn5/>
432 Canonical Equivalence in Applications - UTN #5
433
434
435
436perl v5.38.0 2023-07-21 Normalize(3)