1PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
2
3
4
6 PCRE - Perl-compatible regular expressions (revised API)
7
9
10 When PCRE2 is built with Unicode support (which is the default), it has
11 knowledge of Unicode character properties and can process text strings
12 in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
13 However, by default, PCRE2 assumes that one code unit is one character.
14 To process a pattern as a UTF string, where a character may require
15 more than one code unit, you must call pcre2_compile() with the
16 PCRE2_UTF option flag, or the pattern must start with the sequence
17 (*UTF). When either of these is the case, both the pattern and any sub‐
18 ject strings that are matched against it are treated as UTF strings
19 instead of strings of individual one-code-unit characters. There are
20 also some other changes to the way characters are handled, as docu‐
21 mented below.
22
23 If you do not need Unicode support you can build PCRE2 without it, in
24 which case the library will be smaller.
25
27
28 When PCRE2 is built with Unicode support, the escape sequences \p{..},
29 \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set‐
30 ting. The Unicode properties that can be tested are limited to the
31 general category properties such as Lu for an upper case letter or Nd
32 for a decimal number, the Unicode script names such as Arabic or Han,
33 and the derived properties Any and L&. Full lists are given in the
34 pcre2pattern and pcre2syntax documentation. Only the short names for
35 properties are supported. For example, \p{L} matches a letter. Its Perl
36 synonym, \p{Letter}, is not supported. Furthermore, in Perl, many
37 properties may optionally be prefixed by "Is", for compatibility with
38 Perl 5.6. PCRE2 does not support this.
39
41
42 Code points less than 256 can be specified in patterns by either braced
43 or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
44 Larger values have to use braced sequences. Unbraced octal code points
45 up to \777 are also recognized; larger ones can be coded using \o{...}.
46
47 The escape sequence \N{U+<hex digits>} is recognized as another way of
48 specifying a Unicode character by code point in a UTF mode. It is not
49 allowed in non-UTF modes.
50
51 In UTF modes, repeat quantifiers apply to complete UTF characters, not
52 to individual code units.
53
54 In UTF modes, the dot metacharacter matches one UTF character instead
55 of a single code unit.
56
57 In UTF modes, capture group names are not restricted to ASCII, and may
58 contain any Unicode letters and decimal digits, as well as underscore.
59
60 The escape sequence \C can be used to match a single code unit in a UTF
61 mode, but its use can lead to some strange effects because it breaks up
62 multi-unit characters (see the description of \C in the pcre2pattern
63 documentation). For this reason, there is a build-time option that dis‐
64 ables support for \C completely. There is also a less draconian com‐
65 pile-time option for locking out the use of \C when a pattern is com‐
66 piled.
67
68 The use of \C is not supported by the alternative matching function
69 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac‐
70 ter may consist of more than one code unit. The use of \C in these
71 modes provokes a match-time error. Also, the JIT optimization does not
72 support \C in these modes. If JIT optimization is requested for a UTF-8
73 or UTF-16 pattern that contains \C, it will not succeed, and so when
74 pcre2_match() is called, the matching will be carried out by the normal
75 interpretive function.
76
77 The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
78 characters of any code value, but, by default, the characters that
79 PCRE2 recognizes as digits, spaces, or word characters remain the same
80 set as in non-UTF mode, all with code points less than 256. This
81 remains true even when PCRE2 is built to include Unicode support,
82 because to do otherwise would slow down matching in many common cases.
83 Note that this also applies to \b and \B, because they are defined in
84 terms of \w and \W. If you want to test for a wider sense of, say,
85 "digit", you can use explicit Unicode property tests such as \p{Nd}.
86 Alternatively, if you set the PCRE2_UCP option, the way that the char‐
87 acter escapes work is changed so that Unicode properties are used to
88 determine which characters match. There are more details in the section
89 on generic character types in the pcre2pattern documentation.
90
91 Similarly, characters that match the POSIX named character classes are
92 all low-valued characters, unless the PCRE2_UCP option is set.
93
94 However, the special horizontal and vertical white space matching
95 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char‐
96 acters, whether or not PCRE2_UCP is set.
97
99
100 Case-insensitive matching in a UTF mode makes use of Unicode properties
101 except for characters whose code points are less than 128 and that have
102 at most two case-equivalent values. For these, a direct table lookup is
103 used for speed. A few Unicode characters such as Greek sigma have more
104 than two code points that are case-equivalent, and these are treated as
105 such.
106
108
109 The pattern constructs (*script_run:...) and (*atomic_script_run:...),
110 with synonyms (*sr:...) and (*asr:...), verify that the string matched
111 within the parentheses is a script run. In concept, a script run is a
112 sequence of characters that are all from the same Unicode script. How‐
113 ever, because some scripts are commonly used together, and because some
114 diacritical and other marks are used with multiple scripts, it is not
115 that simple.
116
117 Every Unicode character has a Script property, mostly with a value cor‐
118 responding to the name of a script, such as Latin, Greek, or Cyrillic.
119 There are also three special values:
120
121 "Unknown" is used for code points that have not been assigned, and also
122 for the surrogate code points. In the PCRE2 32-bit library, characters
123 whose code points are greater than the Unicode maximum (U+10FFFF),
124 which are accessible only in non-UTF mode, are assigned the Unknown
125 script.
126
127 "Common" is used for characters that are used with many scripts. These
128 include punctuation, emoji, mathematical, musical, and currency sym‐
129 bols, and the ASCII digits 0 to 9.
130
131 "Inherited" is used for characters such as diacritical marks that mod‐
132 ify a previous character. These are considered to take on the script of
133 the character that they modify.
134
135 Some Inherited characters are used with many scripts, but many of them
136 are only normally used with a small number of scripts. For example,
137 U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop‐
138 tic. In order to make it possible to check this, a Unicode property
139 called Script Extension exists. Its value is a list of scripts that
140 apply to the character. For the majority of characters, the list con‐
141 tains just one script, the same one as the Script property. However,
142 for characters such as U+102E0 more than one Script is listed. There
143 are also some Common characters that have a single, non-Common script
144 in their Script Extension list.
145
146 The next section describes the basic rules for deciding whether a given
147 string of characters is a script run. Note, however, that there are
148 some special cases involving the Chinese Han script, and an additional
149 constraint for decimal digits. These are covered in subsequent sec‐
150 tions.
151
152 Basic script run rules
153
154 A string that is less than two characters long is a script run. This is
155 the only case in which an Unknown character can be part of a script
156 run. Longer strings are checked using only the Script Extensions prop‐
157 erty, not the basic Script property.
158
159 If a character's Script Extension property is the single value "Inher‐
160 ited", it is always accepted as part of a script run. This is also true
161 for the property "Common", subject to the checking of decimal digits
162 described below. All the remaining characters in a script run must have
163 at least one script in common in their Script Extension lists. In set-
164 theoretic terminology, the intersection of all the sets of scripts must
165 not be empty.
166
167 A simple example is an Internet name such as "google.com". The letters
168 are all in the Latin script, and the dot is Common, so this string is a
169 script run. However, the Cyrillic letter "o" looks exactly the same as
170 the Latin "o"; a string that looks the same, but with Cyrillic "o"s is
171 not a script run.
172
173 More interesting examples involve characters with more than one script
174 in their Script Extension. Consider the following characters:
175
176 U+060C Arabic comma
177 U+06D4 Arabic full stop
178
179 The first has the Script Extension list Arabic, Hanifi Rohingya, Syr‐
180 iac, and Thaana; the second has just Arabic and Hanifi Rohingya. Both
181 of them could appear in script runs of either Arabic or Hanifi
182 Rohingya. The first could also appear in Syriac or Thaana script runs,
183 but the second could not.
184
185 The Chinese Han script
186
187 The Chinese Han script is commonly used in conjunction with other
188 scripts for writing certain languages. Japanese uses the Hiragana and
189 Katakana scripts together with Han; Korean uses Hangul and Han; Tai‐
190 wanese Mandarin uses Bopomofo and Han. These three combinations are
191 treated as special cases when checking script runs and are, in effect,
192 "virtual scripts". Thus, a script run may contain a mixture of Hira‐
193 gana, Katakana, and Han, or a mixture of Hangul and Han, or a mixture
194 of Bopomofo and Han, but not, for example, a mixture of Hangul and
195 Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan‐
196 dard 39 ("Unicode Security Mechanisms", http://uni‐
197 code.org/reports/tr39/) in allowing such mixtures.
198
199 Decimal digits
200
201 Unicode contains many sets of 10 decimal digits in different scripts,
202 and some scripts (including the Common script) contain more than one
203 set. Some of these decimal digits them are visually indistinguishable
204 from the common ASCII digits. In addition to the script checking
205 described above, if a script run contains any decimal digits, they must
206 all come from the same set of 10 adjacent characters.
207
209
210 When the PCRE2_UTF option is set, the strings passed as patterns and
211 subjects are (by default) checked for validity on entry to the relevant
212 functions. If an invalid UTF string is passed, an negative error code
213 is returned. The code unit offset to the offending character can be
214 extracted from the match data block by calling pcre2_get_startchar(),
215 which is used for this purpose after a UTF error.
216
217 In some situations, you may already know that your strings are valid,
218 and therefore want to skip these checks in order to improve perfor‐
219 mance, for example in the case of a long subject string that is being
220 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com‐
221 pile time or at match time, PCRE2 assumes that the pattern or subject
222 it is given (respectively) contains only valid UTF code unit sequences.
223
224 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
225 result is usually undefined and your program may crash or loop indefi‐
226 nitely. There is, however, one mode of matching that can handle invalid
227 UTF subject strings. This is matching via the JIT optimization using
228 the PCRE2_JIT_INVALID_UTF option when calling pcre2_jit_compile(). For
229 details, see the pcre2jit documentation.
230
231 Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
232 for the pattern; it does not also apply to subject strings. If you want
233 to disable the check for a subject string you must pass this same
234 option to pcre2_match() or pcre2_dfa_match().
235
236 UTF-16 and UTF-32 strings can indicate their endianness by special code
237 knows as a byte-order mark (BOM). The PCRE2 functions do not handle
238 this, expecting strings to be in host byte order.
239
240 Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any
241 other processing takes place. In the case of pcre2_match() and
242 pcre2_dfa_match() calls with a non-zero starting offset, the check is
243 applied only to that part of the subject that could be inspected during
244 matching, and there is a check that the starting offset points to the
245 first code unit of a character or to the end of the subject. If there
246 are no lookbehind assertions in the pattern, the check starts at the
247 starting offset. Otherwise, it starts at the length of the longest
248 lookbehind before the starting offset, or at the start of the subject
249 if there are not that many characters before the starting offset. Note
250 that the sequences \b and \B are one-character lookbehinds.
251
252 In addition to checking the format of the string, there is a check to
253 ensure that all code points lie in the range U+0 to U+10FFFF, excluding
254 the surrogate area. The so-called "non-character" code points are not
255 excluded because Unicode corrigendum #9 makes it clear that they should
256 not be.
257
258 Characters in the "Surrogate Area" of Unicode are reserved for use by
259 UTF-16, where they are used in pairs to encode code points with values
260 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
261 are available independently in the UTF-8 and UTF-32 encodings. (In
262 other words, the whole surrogate thing is a fudge for UTF-16 which
263 unfortunately messes up UTF-8 and UTF-32.)
264
265 Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
266 that is given if an escape sequence for an invalid Unicode code point
267 is encountered in the pattern. If you want to allow escape sequences
268 such as \x{d800} (a surrogate code point) you can set the
269 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos‐
270 sible only in UTF-8 and UTF-32 modes, because these values are not rep‐
271 resentable in UTF-16.
272
273 Errors in UTF-8 strings
274
275 The following negative error codes are given for invalid UTF-8 strings:
276
277 PCRE2_ERROR_UTF8_ERR1
278 PCRE2_ERROR_UTF8_ERR2
279 PCRE2_ERROR_UTF8_ERR3
280 PCRE2_ERROR_UTF8_ERR4
281 PCRE2_ERROR_UTF8_ERR5
282
283 The string ends with a truncated UTF-8 character; the code specifies
284 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
285 characters to be no longer than 4 bytes, the encoding scheme (origi‐
286 nally defined by RFC 2279) allows for up to 6 bytes, and this is
287 checked first; hence the possibility of 4 or 5 missing bytes.
288
289 PCRE2_ERROR_UTF8_ERR6
290 PCRE2_ERROR_UTF8_ERR7
291 PCRE2_ERROR_UTF8_ERR8
292 PCRE2_ERROR_UTF8_ERR9
293 PCRE2_ERROR_UTF8_ERR10
294
295 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
296 the character do not have the binary value 0b10 (that is, either the
297 most significant bit is 0, or the next bit is 1).
298
299 PCRE2_ERROR_UTF8_ERR11
300 PCRE2_ERROR_UTF8_ERR12
301
302 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
303 long; these code points are excluded by RFC 3629.
304
305 PCRE2_ERROR_UTF8_ERR13
306
307 A 4-byte character has a value greater than 0x10ffff; these code points
308 are excluded by RFC 3629.
309
310 PCRE2_ERROR_UTF8_ERR14
311
312 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
313 range of code points are reserved by RFC 3629 for use with UTF-16, and
314 so are excluded from UTF-8.
315
316 PCRE2_ERROR_UTF8_ERR15
317 PCRE2_ERROR_UTF8_ERR16
318 PCRE2_ERROR_UTF8_ERR17
319 PCRE2_ERROR_UTF8_ERR18
320 PCRE2_ERROR_UTF8_ERR19
321
322 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
323 for a value that can be represented by fewer bytes, which is invalid.
324 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor‐
325 rect coding uses just one byte.
326
327 PCRE2_ERROR_UTF8_ERR20
328
329 The two most significant bits of the first byte of a character have the
330 binary value 0b10 (that is, the most significant bit is 1 and the sec‐
331 ond is 0). Such a byte can only validly occur as the second or subse‐
332 quent byte of a multi-byte character.
333
334 PCRE2_ERROR_UTF8_ERR21
335
336 The first byte of a character has the value 0xfe or 0xff. These values
337 can never occur in a valid UTF-8 string.
338
339 Errors in UTF-16 strings
340
341 The following negative error codes are given for invalid UTF-16
342 strings:
343
344 PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
345 PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
346 PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
347
348
349 Errors in UTF-32 strings
350
351 The following negative error codes are given for invalid UTF-32
352 strings:
353
354 PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
355 PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
356
357
359
360 Philip Hazel
361 University Computing Service
362 Cambridge, England.
363
365
366 Last updated: 11 May 2019
367 Copyright (c) 1997-2019 University of Cambridge.
368
369
370
371PCRE2 10.33 11 May 2019 PCRE2UNICODE(3)