Unicode::Normalize(3pm)

1Unicode::Normalize(3pm)Perl Programmers Reference GuideUnicode::Normalize(3pm)
2
3
4

NAME

6       Unicode::Normalize - Unicode Normalization Forms
7

SYNOPSIS

9       (1) using function names exported by default:
10
11         use Unicode::Normalize;
12
13         $NFD_string  = NFD($string);  # Normalization Form D
14         $NFC_string  = NFC($string);  # Normalization Form C
15         $NFKD_string = NFKD($string); # Normalization Form KD
16         $NFKC_string = NFKC($string); # Normalization Form KC
17
18       (2) using function names exported on request:
19
20         use Unicode::Normalize 'normalize';
21
22         $NFD_string  = normalize('D',  $string);  # Normalization Form D
23         $NFC_string  = normalize('C',  $string);  # Normalization Form C
24         $NFKD_string = normalize('KD', $string);  # Normalization Form KD
25         $NFKC_string = normalize('KC', $string);  # Normalization Form KC
26

DESCRIPTION

28       Parameters:
29
30       $string is used as a string under character semantics (see perluni‐
31       code).
32
33       $codepoint should be an unsigned integer representing a Unicode code
34       point.
35
36       Note: Between XSUB and pure Perl, there is an incompatibility about the
37       interpretation of $codepoint as a decimal number.  XSUB converts $code‐
38       point to an unsigned integer, but pure Perl does not.  Do not use a
39       floating point nor a negative sign in $codepoint.
40
41       Normalization Forms
42
43       "$NFD_string = NFD($string)"
44           returns the Normalization Form D (formed by canonical decomposi‐
45           tion).
46
47       "$NFC_string = NFC($string)"
48           returns the Normalization Form C (formed by canonical decomposition
49           followed by canonical composition).
50
51       "$NFKD_string = NFKD($string)"
52           returns the Normalization Form KD (formed by compatibility decompo‐
53           sition).
54
55       "$NFKC_string = NFKC($string)"
56           returns the Normalization Form KC (formed by compatibility decompo‐
57           sition followed by canonical composition).
58
59       "$FCD_string = FCD($string)"
60           If the given string is in FCD ("Fast C or D" form; cf. UTN #5),
61           returns it without modification; otherwise returns an FCD string.
62
63           Note: FCD is not always unique, then plural forms may be equivalent
64           each other. "FCD()" will return one of these equivalent forms.
65
66       "$FCC_string = FCC($string)"
67           returns the FCC form ("Fast C Contiguous"; cf. UTN #5).
68
69           Note: FCC is unique, as well as four normalization forms (NF*).
70
71       "$normalized_string = normalize($form_name, $string)"
72           As $form_name, one of the following names must be given.
73
74             'C'  or 'NFC'  for Normalization Form C  (UAX #15)
75             'D'  or 'NFD'  for Normalization Form D  (UAX #15)
76             'KC' or 'NFKC' for Normalization Form KC (UAX #15)
77             'KD' or 'NFKD' for Normalization Form KD (UAX #15)
78
79             'FCD'          for "Fast C or D" Form  (UTN #5)
80             'FCC'          for "Fast C Contiguous" (UTN #5)
81
82       Decomposition and Composition
83
84       "$decomposed_string = decompose($string)"
85       "$decomposed_string = decompose($string, $useCompatMapping)"
86           Decomposes the specified string and returns the result.
87
88           If the second parameter (a boolean) is omitted or false, decomposes
89           it using the Canonical Decomposition Mapping.  If true, decomposes
90           it using the Compatibility Decomposition Mapping.
91
92           The string returned is not always in NFD/NFKD.  Reordering may be
93           required.
94
95               $NFD_string  = reorder(decompose($string));       # eq. to NFD()
96               $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
97
98       "$reordered_string  = reorder($string)"
99           Reorders the combining characters and the like in the canonical
100           ordering and returns the result.
101
102           E.g., when you have a list of NFD/NFKD strings, you can get the
103           concatenated NFD/NFKD string from them, saying
104
105               $concat_NFD  = reorder(join '', @NFD_strings);
106               $concat_NFKD = reorder(join '', @NFKD_strings);
107
108       "$composed_string   = compose($string)"
109           Returns the string where composable pairs are composed.
110
111           E.g., when you have a NFD/NFKD string, you can get its NFC/NFKC
112           string, saying
113
114               $NFC_string  = compose($NFD_string);
115               $NFKC_string = compose($NFKD_string);
116
117       Quick Check
118
119       (see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
120
121       The following functions check whether the string is in that normaliza‐
122       tion form.
123
124       The result returned will be:
125
126           YES     The string is in that normalization form.
127           NO      The string is not in that normalization form.
128           MAYBE   Dubious. Maybe yes, maybe no.
129
130       "$result = checkNFD($string)"
131           returns true (1) if "YES"; false ("empty string") if "NO".
132
133       "$result = checkNFC($string)"
134           returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
135           if "MAYBE".
136
137       "$result = checkNFKD($string)"
138           returns true (1) if "YES"; false ("empty string") if "NO".
139
140       "$result = checkNFKC($string)"
141           returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
142           if "MAYBE".
143
144       "$result = checkFCD($string)"
145           returns true (1) if "YES"; false ("empty string") if "NO".
146
147       "$result = checkFCC($string)"
148           returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
149           if "MAYBE".
150
151           If a string is not in FCD, it must not be in FCC.  So "check‐
152           FCC($not_FCD_string)" should return "NO".
153
154       "$result = check($form_name, $string)"
155           returns true (1) if "YES"; false ("empty string") if "NO"; "undef"
156           if "MAYBE".
157
158           As $form_name, one of the following names must be given.
159
160             'C'  or 'NFC'  for Normalization Form C  (UAX #15)
161             'D'  or 'NFD'  for Normalization Form D  (UAX #15)
162             'KC' or 'NFKC' for Normalization Form KC (UAX #15)
163             'KD' or 'NFKD' for Normalization Form KD (UAX #15)
164
165             'FCD'          for "Fast C or D" Form  (UTN #5)
166             'FCC'          for "Fast C Contiguous" (UTN #5)
167
168       Note
169
170       In the cases of NFD, NFKD, and FCD, the answer must be either "YES" or
171       "NO". The answer "MAYBE" may be returned in the cases of NFC, NFKC, and
172       FCC.
173
174       A "MAYBE" string should contain at least one combining character or the
175       like. For example, "COMBINING ACUTE ACCENT" has the
176       MAYBE_NFC/MAYBE_NFKC property.
177
178       Both "checkNFC("A\N{COMBINING ACUTE ACCENT}")" and "checkNFC("B\N{COM‐
179       BINING ACUTE ACCENT}")" will return "MAYBE".  "A\N{COMBINING ACUTE
180       ACCENT}" is not in NFC (its NFC is "\N{LATIN CAPITAL LETTER A WITH
181       ACUTE}"), while "B\N{COMBINING ACUTE ACCENT}" is in NFC.
182
183       If you want to check exactly, compare the string with its NFC/NFKC/FCC.
184
185           if ($string eq NFC($string)) {
186               # $string is exactly normalized in NFC;
187           } else {
188               # $string is not normalized in NFC;
189           }
190
191           if ($string eq NFKC($string)) {
192               # $string is exactly normalized in NFKC;
193           } else {
194               # $string is not normalized in NFKC;
195           }
196
197       Character Data
198
199       These functions are interface of character data used internally.  If
200       you want only to get Unicode normalization forms, you don't need call
201       them yourself.
202
203       "$canonical_decomposed = getCanon($codepoint)"
204           If the character of the specified codepoint is canonically decom‐
205           posable (including Hangul Syllables), returns the completely decom‐
206           posed string canonically equivalent to it.
207
208           If it is not decomposable, returns "undef".
209
210       "$compatibility_decomposed = getCompat($codepoint)"
211           If the character of the specified codepoint is compatibility decom‐
212           posable (including Hangul Syllables), returns the completely decom‐
213           posed string compatibility equivalent to it.
214
215           If it is not decomposable, returns "undef".
216
217       "$codepoint_composite = getComposite($codepoint_here, $codepoint_next)"
218           If two characters here and next (as codepoints) are composable
219           (including Hangul Jamo/Syllables and Composition Exclusions),
220           returns the codepoint of the composite.
221
222           If they are not composable, returns "undef".
223
224       "$combining_class = getCombinClass($codepoint)"
225           Returns the combining class of the character as an integer.
226
227       "$is_exclusion = isExclusion($codepoint)"
228           Returns a boolean whether the character of the specified codepoint
229           is a composition exclusion.
230
231       "$is_singleton = isSingleton($codepoint)"
232           Returns a boolean whether the character of the specified codepoint
233           is a singleton.
234
235       "$is_non_starter_decomposition = isNonStDecomp($codepoint)"
236           Returns a boolean whether the canonical decomposition of the char‐
237           acter of the specified codepoint is a Non-Starter Decomposition.
238
239       "$may_be_composed_with_prev_char = isComp2nd($codepoint)"
240           Returns a boolean whether the character of the specified codepoint
241           may be composed with the previous one in a certain composition
242           (including Hangul Compositions, but excluding Composition Exclu‐
243           sions and Non-Starter Decompositions).
244

EXPORT

246       "NFC", "NFD", "NFKC", "NFKD": by default.
247
248       "normalize" and other some functions: on request.
249

CAVEATS

251       Perl's version vs. Unicode version
252           Since this module refers to perl core's Unicode database in the
253           directory /lib/unicore (or formerly /lib/unicode), the Unicode ver‐
254           sion of normalization implemented by this module depends on your
255           perl's version.
256
257               perl's version         implemented Unicode version
258                  5.6.1                  3.0.1
259                  5.7.2                  3.1.0
260                  5.7.3                  3.1.1 (same normalized form as that of 3.1.0)
261                  5.8.0                  3.2.0
262                5.8.1-5.8.3              4.0.0
263                5.8.4-5.8.6 (latest)     4.0.1 (same normalized form as that of 4.0.0)
264
265       Correction of decomposition mapping
266           In older Unicode versions, a small number of characters (all of
267           which are CJK compatibility ideographs as far as they have been
268           found) may have an erroneous decomposition mapping (see Normaliza‐
269           tionCorrections.txt).  Anyhow, this module will neither refer to
270           NormalizationCorrections.txt nor provide any specific version of
271           normalization. Therefore this module running on an older perl with
272           an older Unicode database may use the erroneous decomposition map‐
273           ping blindly conforming to the Unicode database.
274
275       Revised definition of canonical composition
276           In Unicode 4.1.0, the definition D2 of canonical composition (which
277           affects NFC and NFKC) has been changed (see Public Review Issue #29
278           and recent UAX #15). This module has used the newer definition
279           since the version 0.07 (Oct 31, 2001).  This module does not sup‐
280           port normalization according to the older definition, even if the
281           Unicode version implemented by perl is lower than 4.1.0.
282

AUTHOR

284       SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
285
286       Copyright(C) 2001-2005, SADAHIRO Tomoyuki. Japan. All rights reserved.
287
288       This module is free software; you can redistribute it and/or modify it
289       under the same terms as Perl itself.
290

NAME

SYNOPSIS

DESCRIPTION

EXPORT

CAVEATS

AUTHOR

SEE ALSO