Unicode::UCD(3pm)

1Unicode::UCD(3pm)      Perl Programmers Reference Guide      Unicode::UCD(3pm)
2
3
4

NAME

6       Unicode::UCD - Unicode character database
7

SYNOPSIS

9           use Unicode::UCD 'charinfo';
10           my $charinfo   = charinfo($codepoint);
11
12           use Unicode::UCD 'charblock';
13           my $charblock  = charblock($codepoint);
14
15           use Unicode::UCD 'charscript';
16           my $charscript = charscript($codepoint);
17
18           use Unicode::UCD 'charblocks';
19           my $charblocks = charblocks();
20
21           use Unicode::UCD 'charscripts';
22           my %charscripts = charscripts();
23
24           use Unicode::UCD qw(charscript charinrange);
25           my $range = charscript($script);
26           print "looks like $script\n" if charinrange($range, $codepoint);
27
28           use Unicode::UCD 'compexcl';
29           my $compexcl = compexcl($codepoint);
30
31           use Unicode::UCD 'namedseq';
32           my $namedseq = namedseq($named_sequence_name);
33
34           my $unicode_version = Unicode::UCD::UnicodeVersion();
35

DESCRIPTION

37       The Unicode::UCD module offers a simple interface to the Unicode Char‐
38       acter Database.
39
40       charinfo
41
42           use Unicode::UCD 'charinfo';
43
44           my $charinfo = charinfo(0x41);
45
46       charinfo() returns a reference to a hash that has the following fields
47       as defined by the Unicode standard:
48
49           key
50
51           code             code point with at least four hexdigits
52           name             name of the character IN UPPER CASE
53           category         general category of the character
54           combining        classes used in the Canonical Ordering Algorithm
55           bidi             bidirectional category
56           decomposition    character decomposition mapping
57           decimal          if decimal digit this is the integer numeric value
58           digit            if digit this is the numeric value
59           numeric          if numeric is the integer or rational numeric value
60           mirrored         if mirrored in bidirectional text
61           unicode10        Unicode 1.0 name if existed and different
62           comment          ISO 10646 comment field
63           upper            uppercase equivalent mapping
64           lower            lowercase equivalent mapping
65           title            titlecase equivalent mapping
66
67           block            block the character belongs to (used in \p{In...})
68           script           script the character belongs to
69
70       If no match is found, a reference to an empty hash is returned.
71
72       The "block" property is the same as returned by charinfo().  It is not
73       defined in the Unicode Character Database proper (Chapter 4 of the Uni‐
74       code 3.0 Standard, aka TUS3) but instead in an auxiliary database
75       (Chapter 14 of TUS3).  Similarly for the "script" property.
76
77       Note that you cannot do (de)composition and casing based solely on the
78       above "decomposition" and "lower", "upper", "title", properties, you
79       will need also the compexcl(), casefold(), and casespec() functions.
80
81       charblock
82
83           use Unicode::UCD 'charblock';
84
85           my $charblock = charblock(0x41);
86           my $charblock = charblock(1234);
87           my $charblock = charblock("0x263a");
88           my $charblock = charblock("U+263a");
89
90           my $range     = charblock('Armenian');
91
92       With a code point argument charblock() returns the block the character
93       belongs to, e.g.  "Basic Latin".  Note that not all the character posi‐
94       tions within all blocks are defined.
95
96       See also "Blocks versus Scripts".
97
98       If supplied with an argument that can't be a code point, charblock()
99       tries to do the opposite and interpret the argument as a character
100       block. The return value is a range: an anonymous list of lists that
101       contain start-of-range, end-of-range code point pairs. You can test
102       whether a code point is in a range using the "charinrange" function. If
103       the argument is not a known character block, "undef" is returned.
104
105       charscript
106
107           use Unicode::UCD 'charscript';
108
109           my $charscript = charscript(0x41);
110           my $charscript = charscript(1234);
111           my $charscript = charscript("U+263a");
112
113           my $range      = charscript('Thai');
114
115       With a code point argument charscript() returns the script the charac‐
116       ter belongs to, e.g.  "Latin", "Greek", "Han".
117
118       See also "Blocks versus Scripts".
119
120       If supplied with an argument that can't be a code point, charscript()
121       tries to do the opposite and interpret the argument as a character
122       script. The return value is a range: an anonymous list of lists that
123       contain start-of-range, end-of-range code point pairs. You can test
124       whether a code point is in a range using the "charinrange" function. If
125       the argument is not a known character script, "undef" is returned.
126
127       charblocks
128
129           use Unicode::UCD 'charblocks';
130
131           my $charblocks = charblocks();
132
133       charblocks() returns a reference to a hash with the known block names
134       as the keys, and the code point ranges (see "charblock") as the values.
135
136       See also "Blocks versus Scripts".
137
138       charscripts
139
140           use Unicode::UCD 'charscripts';
141
142           my %charscripts = charscripts();
143
144       charscripts() returns a hash with the known script names as the keys,
145       and the code point ranges (see "charscript") as the values.
146
147       See also "Blocks versus Scripts".
148
149       Blocks versus Scripts
150
151       The difference between a block and a script is that scripts are closer
152       to the linguistic notion of a set of characters required to present
153       languages, while block is more of an artifact of the Unicode character
154       numbering and separation into blocks of (mostly) 256 characters.
155
156       For example the Latin script is spread over several blocks, such as
157       "Basic Latin", "Latin 1 Supplement", "Latin Extended-A", and "Latin
158       Extended-B".  On the other hand, the Latin script does not contain all
159       the characters of the "Basic Latin" block (also known as the ASCII): it
160       includes only the letters, and not, for example, the digits or the
161       punctuation.
162
163       For blocks see http://www.unicode.org/Public/UNIDATA/Blocks.txt
164
165       For scripts see UTR #24: http://www.unicode.org/unicode/reports/tr24/
166
167       Matching Scripts and Blocks
168
169       Scripts are matched with the regular-expression construct "\p{...}"
170       (e.g. "\p{Tibetan}" matches characters of the Tibetan script), while
171       "\p{In...}" is used for blocks (e.g. "\p{InTibetan}" matches any of the
172       256 code points in the Tibetan block).
173
174       Code Point Arguments
175
176       A code point argument is either a decimal or a hexadecimal scalar des‐
177       ignating a Unicode character, or "U+" followed by hexadecimals desig‐
178       nating a Unicode character.  In other words, if you want a code point
179       to be interpreted as a hexadecimal number, you must prefix it with
180       either "0x" or "U+", because a string like e.g. 123 will be interpreted
181       as a decimal code point.  Also note that Unicode is not limited to 16
182       bits (the number of Unicode characters is open-ended, in theory unlim‐
183       ited): you may have more than 4 hexdigits.
184
185       charinrange
186
187       In addition to using the "\p{In...}" and "\P{In...}" constructs, you
188       can also test whether a code point is in the range as returned by
189       "charblock" and "charscript" or as the values of the hash returned by
190       "charblocks" and "charscripts" by using charinrange():
191
192           use Unicode::UCD qw(charscript charinrange);
193
194           $range = charscript('Hiragana');
195           print "looks like hiragana\n" if charinrange($range, $codepoint);
196
197       compexcl
198
199           use Unicode::UCD 'compexcl';
200
201           my $compexcl = compexcl("09dc");
202
203       The compexcl() returns the composition exclusion (that is, if the char‐
204       acter should not be produced during a precomposition) of the character
205       specified by a code point argument.
206
207       If there is a composition exclusion for the character, true is
208       returned.  Otherwise, false is returned.
209
210       casefold
211
212           use Unicode::UCD 'casefold';
213
214           my $casefold = casefold("00DF");
215
216       The casefold() returns the locale-independent case folding of the char‐
217       acter specified by a code point argument.
218
219       If there is a case folding for that character, a reference to a hash
220       with the following fields is returned:
221
222           key
223
224           code             code point with at least four hexdigits
225           status           "C", "F", "S", or "I"
226           mapping          one or more codes separated by spaces
227
228       The meaning of the status is as follows:
229
230          C                 common case folding, common mappings shared
231                            by both simple and full mappings
232          F                 full case folding, mappings that cause strings
233                            to grow in length. Multiple characters are separated
234                            by spaces
235          S                 simple case folding, mappings to single characters
236                            where different from F
237          I                 special case for dotted uppercase I and
238                            dotless lowercase i
239                            - If this mapping is included, the result is
240                              case-insensitive, but dotless and dotted I's
241                              are not distinguished
242                            - If this mapping is excluded, the result is not
243                              fully case-insensitive, but dotless and dotted
244                              I's are distinguished
245
246       If there is no case folding for that character, "undef" is returned.
247
248       For more information about case mappings see http://www.uni‐
249       code.org/unicode/reports/tr21/
250
251       casespec
252
253           use Unicode::UCD 'casespec';
254
255           my $casespec = casespec("FB00");
256
257       The casespec() returns the potentially locale-dependent case mapping of
258       the character specified by a code point argument.  The mapping may
259       change the length of the string (which the basic Unicode case mappings
260       as returned by charinfo() never do).
261
262       If there is a case folding for that character, a reference to a hash
263       with the following fields is returned:
264
265           key
266
267           code             code point with at least four hexdigits
268           lower            lowercase
269           title            titlecase
270           upper            uppercase
271           condition        condition list (may be undef)
272
273       The "condition" is optional.  Where present, it consists of one or more
274       locales or contexts, separated by spaces (other than as used to sepa‐
275       rate elements, spaces are to be ignored).  A condition list overrides
276       the normal behavior if all of the listed conditions are true.  Case
277       distinctions in the condition list are not significant.  Conditions
278       preceded by "NON_" represent the negation of the condition.
279
280       Note that when there are multiple case folding definitions for a single
281       code point because of different locales, the value returned by cas‐
282       espec() is a hash reference which has the locales as the keys and hash
283       references as described above as the values.
284
285       A locale is defined as a 2-letter ISO 3166 country code, possibly fol‐
286       lowed by a "_" and a 2-letter ISO language code (possibly followed by a
287       "_" and a variant code).  You can find the lists of those codes, see
288       Locale::Country and Locale::Language.
289
290       A context is one of the following choices:
291
292           FINAL            The letter is not followed by a letter of
293                            general category L (e.g. Ll, Lt, Lu, Lm, or Lo)
294           MODERN           The mapping is only used for modern text
295           AFTER_i          The last base character was "i" (U+0069)
296
297       For more information about case mappings see http://www.uni‐
298       code.org/unicode/reports/tr21/
299
300       namedseq()
301
302           use Unicode::UCD 'namedseq';
303
304           my $namedseq = namedseq("KATAKANA LETTER AINU P");
305           my @namedseq = namedseq("KATAKANA LETTER AINU P");
306           my %namedseq = namedseq();
307
308       If used with a single argument in a scalar context, returns the string
309       consisting of the code points of the named sequence, or "undef" if no
310       named sequence by that name exists.  If used with a single argument in
311       a list context, returns list of the code points.  If used with no argu‐
312       ments in a list context, returns a hash with the names of the named
313       sequences as the keys and the named sequences as strings as the values.
314       Otherwise, returns "undef" or empty list depending on the context.
315
316       (New from Unicode 4.1.0)
317
318       Unicode::UCD::UnicodeVersion
319
320       Unicode::UCD::UnicodeVersion() returns the version of the Unicode Char‐
321       acter Database, in other words, the version of the Unicode standard the
322       database implements.  The version is a string of numbers delimited by
323       dots ('.').
324
325       Implementation Note
326
327       The first use of charinfo() opens a read-only filehandle to the Unicode
328       Character Database (the database is included in the Perl distribution).
329       The filehandle is then kept open for further queries.  In other words,
330       if you are wondering where one of your filehandles went, that's where.
331

BUGS

333       Does not yet support EBCDIC platforms.
334

AUTHOR

336       Jarkko Hietaniemi
337
338
339
340perl v5.8.8                       2001-09-21                 Unicode::UCD(3pm)