1Unicode::UCD(3pm) Perl Programmers Reference Guide Unicode::UCD(3pm)
2
3
4
6 Unicode::UCD - Unicode character database
7
9 use Unicode::UCD 'charinfo';
10 my $charinfo = charinfo($codepoint);
11
12 use Unicode::UCD 'casefold';
13 my $casefold = casefold(0xFB00);
14
15 use Unicode::UCD 'casespec';
16 my $casespec = casespec(0xFB00);
17
18 use Unicode::UCD 'charblock';
19 my $charblock = charblock($codepoint);
20
21 use Unicode::UCD 'charscript';
22 my $charscript = charscript($codepoint);
23
24 use Unicode::UCD 'charblocks';
25 my $charblocks = charblocks();
26
27 use Unicode::UCD 'charscripts';
28 my $charscripts = charscripts();
29
30 use Unicode::UCD qw(charscript charinrange);
31 my $range = charscript($script);
32 print "looks like $script\n" if charinrange($range, $codepoint);
33
34 use Unicode::UCD qw(general_categories bidi_types);
35 my $categories = general_categories();
36 my $types = bidi_types();
37
38 use Unicode::UCD 'compexcl';
39 my $compexcl = compexcl($codepoint);
40
41 use Unicode::UCD 'namedseq';
42 my $namedseq = namedseq($named_sequence_name);
43
44 my $unicode_version = Unicode::UCD::UnicodeVersion();
45
47 The Unicode::UCD module offers a series of functions that provide a
48 simple interface to the Unicode Character Database.
49
50 code point argument
51 Some of the functions are called with a code point argument, which is
52 either a decimal or a hexadecimal scalar designating a Unicode code
53 point, or "U+" followed by hexadecimals designating a Unicode code
54 point. In other words, if you want a code point to be interpreted as a
55 hexadecimal number, you must prefix it with either "0x" or "U+",
56 because a string like e.g. 123 will be interpreted as a decimal code
57 point. Also note that Unicode is not limited to 16 bits (the number of
58 Unicode code points is open-ended, in theory unlimited): you may have
59 more than 4 hexdigits.
60
61 charinfo()
62 use Unicode::UCD 'charinfo';
63
64 my $charinfo = charinfo(0x41);
65
66 This returns information about the input "code point argument" as a
67 reference to a hash of fields as defined by the Unicode standard. If
68 the "code point argument" is not assigned in the standard (i.e., has
69 the general category "Cn" meaning "Unassigned") or is a non-character
70 (meaning it is guaranteed to never be assigned in the standard), undef
71 is returned.
72
73 Fields that aren't applicable to the particular code point argument
74 exist in the returned hash, and are empty.
75
76 The keys in the hash with the meanings of their values are:
77
78 code
79 the input "code point argument" expressed in hexadecimal, with
80 leading zeros added if necessary to make it contain at least four
81 hexdigits
82
83 name
84 name of code, all IN UPPER CASE. Some control-type code points do
85 not have names. This field will be empty for "Surrogate" and
86 "Private Use" code points, and for the others without a name, it
87 will contain a description enclosed in angle brackets, like
88 "<control>".
89
90 category
91 The short name of the general category of code. This will match
92 one of the keys in the hash returned by "general_categories()".
93
94 combining
95 the combining class number for code used in the Canonical Ordering
96 Algorithm. For Unicode 5.1, this is described in Section 3.11
97 "Canonical Ordering Behavior" available at
98 <http://www.unicode.org/versions/Unicode5.1.0/>
99
100 bidi
101 bidirectional type of code. This will match one of the keys in the
102 hash returned by "bidi_types()".
103
104 decomposition
105 is empty if code has no decomposition; or is one or more codes
106 (separated by spaces) that taken in order represent a decomposition
107 for code. Each has at least four hexdigits. The codes may be
108 preceded by a word enclosed in angle brackets then a space, like
109 "<compat> ", giving the type of decomposition
110
111 decimal
112 if code is a decimal digit this is its integer numeric value
113
114 digit
115 if code represents a whole number, this is its integer numeric
116 value
117
118 numeric
119 if code represents a whole or rational number, this is its numeric
120 value. Rational values are expressed as a string like "1/4".
121
122 mirrored
123 "Y" or "N" designating if code is mirrored in bidirectional text
124
125 unicode10
126 name of code in the Unicode 1.0 standard if one existed for this
127 code point and is different from the current name
128
129 comment
130 ISO 10646 comment field. It appears in parentheses in the ISO
131 10646 names list, or contains an asterisk to indicate there is a
132 note for this code point in Annex P of that standard.
133
134 upper
135 is empty if there is no single code point uppercase mapping for
136 code; otherwise it is that mapping expressed as at least four
137 hexdigits. ("casespec()" should be used in addition to charinfo()
138 for case mappings when the calling program can cope with multiple
139 code point mappings.)
140
141 lower
142 is empty if there is no single code point lowercase mapping for
143 code; otherwise it is that mapping expressed as at least four
144 hexdigits. ("casespec()" should be used in addition to charinfo()
145 for case mappings when the calling program can cope with multiple
146 code point mappings.)
147
148 title
149 is empty if there is no single code point titlecase mapping for
150 code; otherwise it is that mapping expressed as at least four
151 hexdigits. ("casespec()" should be used in addition to charinfo()
152 for case mappings when the calling program can cope with multiple
153 code point mappings.)
154
155 block
156 block code belongs to (used in \p{In...}). See "Blocks versus
157 Scripts".
158
159 script
160 script code belongs to. See "Blocks versus Scripts".
161
162 Note that you cannot do (de)composition and casing based solely on the
163 decomposition, combining, lower, upper, and title fields; you will need
164 also the "compexcl()", and "casespec()" functions.
165
166 charblock()
167 use Unicode::UCD 'charblock';
168
169 my $charblock = charblock(0x41);
170 my $charblock = charblock(1234);
171 my $charblock = charblock(0x263a);
172 my $charblock = charblock("U+263a");
173
174 my $range = charblock('Armenian');
175
176 With a "code point argument" charblock() returns the block the code
177 point belongs to, e.g. "Basic Latin". If the code point is
178 unassigned, this returns the block it would belong to if it were
179 assigned (which it may in future versions of the Unicode Standard).
180
181 See also "Blocks versus Scripts".
182
183 If supplied with an argument that can't be a code point, charblock()
184 tries to do the opposite and interpret the argument as a code point
185 block. The return value is a range: an anonymous list of lists that
186 contain start-of-range, end-of-range code point pairs. You can test
187 whether a code point is in a range using the "charinrange()" function.
188 If the argument is not a known code point block, undef is returned.
189
190 charscript()
191 use Unicode::UCD 'charscript';
192
193 my $charscript = charscript(0x41);
194 my $charscript = charscript(1234);
195 my $charscript = charscript("U+263a");
196
197 my $range = charscript('Thai');
198
199 With a "code point argument" charscript() returns the script the code
200 point belongs to, e.g. "Latin", "Greek", "Han". If the code point is
201 unassigned, it returns undef
202
203 If supplied with an argument that can't be a code point, charscript()
204 tries to do the opposite and interpret the argument as a code point
205 script. The return value is a range: an anonymous list of lists that
206 contain start-of-range, end-of-range code point pairs. You can test
207 whether a code point is in a range using the "charinrange()" function.
208 If the argument is not a known code point script, undef is returned.
209
210 See also "Blocks versus Scripts".
211
212 charblocks()
213 use Unicode::UCD 'charblocks';
214
215 my $charblocks = charblocks();
216
217 charblocks() returns a reference to a hash with the known block names
218 as the keys, and the code point ranges (see "charblock()") as the
219 values.
220
221 See also "Blocks versus Scripts".
222
223 charscripts()
224 use Unicode::UCD 'charscripts';
225
226 my $charscripts = charscripts();
227
228 charscripts() returns a reference to a hash with the known script names
229 as the keys, and the code point ranges (see "charscript()") as the
230 values.
231
232 See also "Blocks versus Scripts".
233
234 charinrange()
235 In addition to using the "\p{In...}" and "\P{In...}" constructs, you
236 can also test whether a code point is in the range as returned by
237 "charblock()" and "charscript()" or as the values of the hash returned
238 by "charblocks()" and "charscripts()" by using charinrange():
239
240 use Unicode::UCD qw(charscript charinrange);
241
242 $range = charscript('Hiragana');
243 print "looks like hiragana\n" if charinrange($range, $codepoint);
244
245 general_categories()
246 use Unicode::UCD 'general_categories';
247
248 my $categories = general_categories();
249
250 This returns a reference to a hash which has short general category
251 names (such as "Lu", "Nd", "Zs", "S") as keys and long names (such as
252 "UppercaseLetter", "DecimalNumber", "SpaceSeparator", "Symbol") as
253 values. The hash is reversible in case you need to go from the long
254 names to the short names. The general category is the one returned
255 from "charinfo()" under the "category" key.
256
257 bidi_types()
258 use Unicode::UCD 'bidi_types';
259
260 my $categories = bidi_types();
261
262 This returns a reference to a hash which has the short bidi
263 (bidirectional) type names (such as "L", "R") as keys and long names
264 (such as "Left-to-Right", "Right-to-Left") as values. The hash is
265 reversible in case you need to go from the long names to the short
266 names. The bidi type is the one returned from "charinfo()" under the
267 "bidi" key. For the exact meaning of the various bidi classes the
268 Unicode TR9 is recommended reading:
269 <http://www.unicode.org/reports/tr9/> (as of Unicode 5.0.0)
270
271 compexcl()
272 use Unicode::UCD 'compexcl';
273
274 my $compexcl = compexcl(0x09dc);
275
276 This returns true if the "code point argument" should not be produced
277 by composition normalization, AND if that fact is not otherwise
278 determinable from the Unicode data base. It currently does not return
279 true if the code point has a decomposition consisting of another single
280 code point, nor if its decomposition starts with a code point whose
281 combining class is non-zero. Code points that meet either of these
282 conditions should also not be produced by composition normalization.
283
284 It returns false otherwise.
285
286 casefold()
287 use Unicode::UCD 'casefold';
288
289 my $casefold = casefold(0xDF);
290 if (defined $casefold) {
291 my @full_fold_hex = split / /, $casefold->{'full'};
292 my $full_fold_string =
293 join "", map {chr(hex($_))} @full_fold_hex;
294 my @turkic_fold_hex =
295 split / /, ($casefold->{'turkic'} ne "")
296 ? $casefold->{'turkic'}
297 : $casefold->{'full'};
298 my $turkic_fold_string =
299 join "", map {chr(hex($_))} @turkic_fold_hex;
300 }
301 if (defined $casefold && $casefold->{'simple'} ne "") {
302 my $simple_fold_hex = $casefold->{'simple'};
303 my $simple_fold_string = chr(hex($simple_fold_hex));
304 }
305
306 This returns the (almost) locale-independent case folding of the
307 character specified by the "code point argument".
308
309 If there is no case folding for that code point, undef is returned.
310
311 If there is a case folding for that code point, a reference to a hash
312 with the following fields is returned:
313
314 code
315 the input "code point argument" expressed in hexadecimal, with
316 leading zeros added if necessary to make it contain at least four
317 hexdigits
318
319 full
320 one or more codes (separated by spaces) that taken in order give
321 the code points for the case folding for code. Each has at least
322 four hexdigits.
323
324 simple
325 is empty, or is exactly one code with at least four hexdigits which
326 can be used as an alternative case folding when the calling program
327 cannot cope with the fold being a sequence of multiple code points.
328 If full is just one code point, then simple equals full. If there
329 is no single code point folding defined for code, then simple is
330 the empty string. Otherwise, it is an inferior, but still better-
331 than-nothing alternative folding to full.
332
333 mapping
334 is the same as simple if simple is not empty, and it is the same as
335 full otherwise. It can be considered to be the simplest possible
336 folding for code. It is defined primarily for backwards
337 compatibility.
338
339 status
340 is "C" (for "common") if the best possible fold is a single code
341 point (simple equals full equals mapping). It is "S" if there are
342 distinct folds, simple and full (mapping equals simple). And it is
343 "F" if there only a full fold (mapping equals full; simple is
344 empty). Note that this describes the contents of mapping. It is
345 defined primarily for backwards compatibility.
346
347 On versions 3.1 and earlier of Unicode, status can also be "I"
348 which is the same as "C" but is a special case for dotted uppercase
349 I and dotless lowercase i:
350
351 * If you use this "I" mapping, the result is case-insensitive,
352 but dotless and dotted I's are not distinguished
353
354 * If you exclude this "I" mapping, the result is not fully case-
355 insensitive, but dotless and dotted I's are distinguished
356
357 turkic
358 contains any special folding for Turkic languages. For versions of
359 Unicode starting with 3.2, this field is empty unless code has a
360 different folding in Turkic languages, in which case it is one or
361 more codes (separated by spaces) that taken in order give the code
362 points for the case folding for code in those languages. Each code
363 has at least four hexdigits. Note that this folding does not
364 maintain canonical equivalence without additional processing.
365
366 For versions of Unicode 3.1 and earlier, this field is empty unless
367 there is a special folding for Turkic languages, in which case
368 status is "I", and mapping, full, simple, and turkic are all equal.
369
370 Programs that want complete generality and the best folding results
371 should use the folding contained in the full field. But note that the
372 fold for some code points will be a sequence of multiple code points.
373
374 Programs that can't cope with the fold mapping being multiple code
375 points can use the folding contained in the simple field, with the loss
376 of some generality. In Unicode 5.1, about 7% of the defined foldings
377 have no single code point folding.
378
379 The mapping and status fields are provided for backwards compatibility
380 for existing programs. They contain the same values as in previous
381 versions of this function.
382
383 Locale is not completely independent. The turkic field contains
384 results to use when the locale is a Turkic language.
385
386 For more information about case mappings see
387 <http://www.unicode.org/unicode/reports/tr21>
388
389 casespec()
390 use Unicode::UCD 'casespec';
391
392 my $casespec = casespec(0xFB00);
393
394 This returns the potentially locale-dependent case mappings of the
395 "code point argument". The mappings may be longer than a single code
396 point (which the basic Unicode case mappings as returned by
397 "charinfo()" never are).
398
399 If there are no case mappings for the "code point argument", or if all
400 three possible mappings (lower, title and upper) result in single code
401 points and are locale independent and unconditional, undef is returned
402 (which means that the case mappings, if any, for the code point are
403 those returned by "charinfo()").
404
405 Otherwise, a reference to a hash giving the mappings (or a reference to
406 a hash of such hashes, explained below) is returned with the following
407 keys and their meanings:
408
409 The keys in the bottom layer hash with the meanings of their values
410 are:
411
412 code
413 the input "code point argument" expressed in hexadecimal, with
414 leading zeros added if necessary to make it contain at least four
415 hexdigits
416
417 lower
418 one or more codes (separated by spaces) that taken in order give
419 the code points for the lower case of code. Each has at least four
420 hexdigits.
421
422 title
423 one or more codes (separated by spaces) that taken in order give
424 the code points for the title case of code. Each has at least four
425 hexdigits.
426
427 lower
428 one or more codes (separated by spaces) that taken in order give
429 the code points for the upper case of code. Each has at least four
430 hexdigits.
431
432 condition
433 the conditions for the mappings to be valid. If undef, the
434 mappings are always valid. When defined, this field is a list of
435 conditions, all of which must be true for the mappings to be valid.
436 The list consists of one or more locales (see below) and/or
437 contexts (explained in the next paragraph), separated by spaces.
438 (Other than as used to separate elements, spaces are to be
439 ignored.) Case distinctions in the condition list are not
440 significant. Conditions preceded by "NON_" represent the negation
441 of the condition.
442
443 A context is one of those defined in the Unicode standard. For
444 Unicode 5.1, they are defined in Section 3.13 "Default Case
445 Operations" available at
446 <http://www.unicode.org/versions/Unicode5.1.0/>. These are for
447 context-sensitive casing.
448
449 The hash described above is returned for locale-independent casing,
450 where at least one of the mappings has length longer than one. If
451 undef is returned, the code point may have mappings, but if so, all are
452 length one, and are returned by "charinfo()". Note that when this
453 function does return a value, it will be for the complete set of
454 mappings for a code point, even those whose length is one.
455
456 If there are additional casing rules that apply only in certain
457 locales, an additional key for each will be defined in the returned
458 hash. Each such key will be its locale name, defined as a 2-letter ISO
459 3166 country code, possibly followed by a "_" and a 2-letter ISO
460 language code (possibly followed by a "_" and a variant code). You can
461 find the lists of all possible locales, see Locale::Country and
462 Locale::Language. (In Unicode 5.1, the only locales returned by this
463 function are "lt", "tr", and "az".)
464
465 Each locale key is a reference to a hash that has the form above, and
466 gives the casing rules for that particular locale, which take
467 precedence over the locale-independent ones when in that locale.
468
469 If the only casing for a code point is locale-dependent, then the
470 returned hash will not have any of the base keys, like "code", "upper",
471 etc., but will contain only locale keys.
472
473 For more information about case mappings see
474 <http://www.unicode.org/unicode/reports/tr21/>
475
476 namedseq()
477 use Unicode::UCD 'namedseq';
478
479 my $namedseq = namedseq("KATAKANA LETTER AINU P");
480 my @namedseq = namedseq("KATAKANA LETTER AINU P");
481 my %namedseq = namedseq();
482
483 If used with a single argument in a scalar context, returns the string
484 consisting of the code points of the named sequence, or undef if no
485 named sequence by that name exists. If used with a single argument in
486 a list context, it returns the list of the ordinals of the code points.
487 If used with no arguments in a list context, returns a hash with the
488 names of the named sequences as the keys and the named sequences as
489 strings as the values. Otherwise, it returns undef or an empty list
490 depending on the context.
491
492 This function only operates on officially approved (not provisional)
493 named sequences.
494
495 Unicode::UCD::UnicodeVersion
496 This returns the version of the Unicode Character Database, in other
497 words, the version of the Unicode standard the database implements.
498 The version is a string of numbers delimited by dots ('.').
499
500 Blocks versus Scripts
501 The difference between a block and a script is that scripts are closer
502 to the linguistic notion of a set of code points required to present
503 languages, while block is more of an artifact of the Unicode code point
504 numbering and separation into blocks of (mostly) 256 code points.
505
506 For example the Latin script is spread over several blocks, such as
507 "Basic Latin", "Latin 1 Supplement", "Latin Extended-A", and "Latin
508 Extended-B". On the other hand, the Latin script does not contain all
509 the characters of the "Basic Latin" block (also known as ASCII): it
510 includes only the letters, and not, for example, the digits or the
511 punctuation.
512
513 For blocks see <http://www.unicode.org/Public/UNIDATA/Blocks.txt>
514
515 For scripts see UTR #24: <http://www.unicode.org/unicode/reports/tr24/>
516
517 Matching Scripts and Blocks
518 Scripts are matched with the regular-expression construct "\p{...}"
519 (e.g. "\p{Tibetan}" matches characters of the Tibetan script), while
520 "\p{In...}" is used for blocks (e.g. "\p{InTibetan}" matches any of the
521 256 code points in the Tibetan block).
522
523 Implementation Note
524 The first use of charinfo() opens a read-only filehandle to the Unicode
525 Character Database (the database is included in the Perl distribution).
526 The filehandle is then kept open for further queries. In other words,
527 if you are wondering where one of your filehandles went, that's where.
528
530 Does not yet support EBCDIC platforms.
531
532 "compexcl()" should give a complete list of excluded code points.
533
535 Jarkko Hietaniemi
536
537
538
539perl v5.12.4 2011-06-07 Unicode::UCD(3pm)