1PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE2 are described in the pcre2pattern documentation. This
12 document contains a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 This table applies to ASCII and Unicode environments. An unrecognized
22 escape sequence causes an error.
23
24 \a alarm, that is, the BEL character (hex 07)
25 \cx "control-x", where x is any ASCII printing character
26 \e escape (hex 1B)
27 \f form feed (hex 0C)
28 \n newline (hex 0A)
29 \r carriage return (hex 0D)
30 \t tab (hex 09)
31 \0dd character with octal code 0dd
32 \ddd character with octal code ddd, or backreference
33 \o{ddd..} character with octal code ddd..
34 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35 \xhh character with hex code hh
36 \x{hh..} character with hex code hh..
37
38 If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
39 following are also recognized:
40
41 \U the character "U"
42 \uhhhh character with hex code hhhh
43 \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
44
45 When \x is not followed by {, from zero to two hexadecimal digits are
46 read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig‐
47 its to be recognized as a hexadecimal escape; otherwise it matches a
48 literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by
49 four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
50 digits in curly brackets, it matches a literal "u".
51
52 Note that \0dd is always an octal code. The treatment of backslash fol‐
53 lowed by a non-zero digit is complicated; for details see the section
54 "Non-printing characters" in the pcre2pattern documentation, where
55 details of escape processing in EBCDIC environments are also given.
56 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
57 EBCDIC environments. Note that \N not followed by an opening curly
58 bracket has a different meaning (see below).
59
61
62 . any character except newline;
63 in dotall mode, any character whatsoever
64 \C one code unit, even in UTF mode (best avoided)
65 \d a decimal digit
66 \D a character that is not a decimal digit
67 \h a horizontal white space character
68 \H a character that is not a horizontal white space character
69 \N a character that is not a newline
70 \p{xx} a character with the xx property
71 \P{xx} a character without the xx property
72 \R a newline sequence
73 \s a white space character
74 \S a character that is not a white space character
75 \v a vertical white space character
76 \V a character that is not a vertical white space character
77 \w a "word" character
78 \W a "non-word" character
79 \X a Unicode extended grapheme cluster
80
81 \C is dangerous because it may leave the current matching point in the
82 middle of a UTF-8 or UTF-16 character. The application can lock out the
83 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
84 possible to build PCRE2 with the use of \C permanently disabled.
85
86 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
87 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
88 matching is happening, \s and \w may also match characters with code
89 points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
90 iour of these escape sequences is changed to use Unicode properties and
91 they match many more characters.
92
94
95 C Other
96 Cc Control
97 Cf Format
98 Cn Unassigned
99 Co Private use
100 Cs Surrogate
101
102 L Letter
103 Ll Lower case letter
104 Lm Modifier letter
105 Lo Other letter
106 Lt Title case letter
107 Lu Upper case letter
108 L& Ll, Lu, or Lt
109
110 M Mark
111 Mc Spacing mark
112 Me Enclosing mark
113 Mn Non-spacing mark
114
115 N Number
116 Nd Decimal number
117 Nl Letter number
118 No Other number
119
120 P Punctuation
121 Pc Connector punctuation
122 Pd Dash punctuation
123 Pe Close punctuation
124 Pf Final punctuation
125 Pi Initial punctuation
126 Po Other punctuation
127 Ps Open punctuation
128
129 S Symbol
130 Sc Currency symbol
131 Sk Modifier symbol
132 Sm Mathematical symbol
133 So Other symbol
134
135 Z Separator
136 Zl Line separator
137 Zp Paragraph separator
138 Zs Space separator
139
141
142 Xan Alphanumeric: union of properties L and N
143 Xps POSIX space: property Z or tab, NL, VT, FF, CR
144 Xsp Perl space: property Z or tab, NL, VT, FF, CR
145 Xuc Univerally-named character: one that can be
146 represented by a Universal Character Name
147 Xwd Perl word: property Xan or underscore
148
149 Perl and POSIX space are now the same. Perl added VT to its space char‐
150 acter set at release 5.18.
151
153
154 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali‐
155 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
156 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
157 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
158 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
159 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
160 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
161 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
162 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan‐
163 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
164 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
165 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
166 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
167 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
168 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar‐
169 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog‐
170 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
171 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
172 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha‐
173 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
174 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
175 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi‐
176 nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
177
179
180 [...] positive character class
181 [^...] negative character class
182 [x-y] range (can be used for hex characters)
183 [[:xxx:]] positive POSIX named set
184 [[:^xxx:]] negative POSIX named set
185
186 alnum alphanumeric
187 alpha alphabetic
188 ascii 0-127
189 blank space or tab
190 cntrl control character
191 digit decimal digit
192 graph printing, excluding space
193 lower lower case letter
194 print printing, including space
195 punct printing, excluding alphanumeric
196 space white space
197 upper upper case letter
198 word same as \w
199 xdigit hexadecimal digit
200
201 In PCRE2, POSIX character set names recognize only ASCII characters by
202 default, but some of them use Unicode properties if PCRE2_UCP is set.
203 You can use \Q...\E inside a character class.
204
206
207 ? 0 or 1, greedy
208 ?+ 0 or 1, possessive
209 ?? 0 or 1, lazy
210 * 0 or more, greedy
211 *+ 0 or more, possessive
212 *? 0 or more, lazy
213 + 1 or more, greedy
214 ++ 1 or more, possessive
215 +? 1 or more, lazy
216 {n} exactly n
217 {n,m} at least n, no more than m, greedy
218 {n,m}+ at least n, no more than m, possessive
219 {n,m}? at least n, no more than m, lazy
220 {n,} n or more, greedy
221 {n,}+ n or more, possessive
222 {n,}? n or more, lazy
223
225
226 \b word boundary
227 \B not a word boundary
228 ^ start of subject
229 also after an internal newline in multiline mode
230 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
231 \A start of subject
232 $ end of subject
233 also before newline at end of subject
234 also before internal newline in multiline mode
235 \Z end of subject
236 also before newline at end of subject
237 \z end of subject
238 \G first matching position in subject
239
241
242 \K set reported start of match
243
244 \K is honoured in positive assertions, but ignored in negative ones.
245
247
248 expr|expr|expr...
249
251
252 (...) capture group
253 (?<name>...) named capture group (Perl)
254 (?'name'...) named capture group (Perl)
255 (?P<name>...) named capture group (Python)
256 (?:...) non-capture group
257 (?|...) non-capture group; reset group numbers for
258 capture groups in each alternative
259
260 In non-UTF modes, names may contain underscores and ASCII letters and
261 digits; in UTF modes, any Unicode letters and Unicode decimal digits
262 are permitted. In both cases, a name must not start with a digit.
263
265
266 (?>...) atomic non-capture group
267 (*atomic:...) atomic non-capture group
268
270
271 (?#....) comment (not nestable)
272
274 Changes of these options within a group are automatically cancelled at
275 the end of the group.
276
277 (?i) caseless
278 (?J) allow duplicate names
279 (?m) multiline
280 (?n) no auto capture
281 (?s) single line (dotall)
282 (?U) default ungreedy (lazy)
283 (?x) extended: ignore white space except in classes
284 (?xx) as (?x) but also ignore space and tab in classes
285 (?-...) unset option(s)
286 (?^) unset imnsx options
287
288 Unsetting x or xx unsets both. Several options may be set at once, and
289 a mixture of setting and unsetting such as (?i-x) is allowed, but there
290 may be only one hyphen. Setting (but no unsetting) is allowed after (?^
291 for example (?^in). An option setting may appear at the start of a non-
292 capture group, for example (?i:...).
293
294 The following are recognized only at the very start of a pattern or
295 after one of the newline or \R options with similar syntax. More than
296 one of them may appear. For the first three, d is a decimal number.
297
298 (*LIMIT_DEPTH=d) set the backtracking limit to d
299 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
300 (*LIMIT_MATCH=d) set the match limit to d
301 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
302 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
303 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
304 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
305 (*NO_JIT) disable JIT optimization
306 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
307 (*UTF) set appropriate UTF mode for the library in use
308 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
309
310 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
311 value of the limits set by the caller of pcre2_match() or
312 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
313 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
314 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
315 respectively, at compile time.
316
318
319 These are recognized only at the very start of the pattern or after
320 option settings with a similar syntax.
321
322 (*CR) carriage return only
323 (*LF) linefeed only
324 (*CRLF) carriage return followed by linefeed
325 (*ANYCRLF) all three of the above
326 (*ANY) any Unicode newline sequence
327 (*NUL) the NUL character (binary zero)
328
330
331 These are recognized only at the very start of the pattern or after
332 option setting with a similar syntax.
333
334 (*BSR_ANYCRLF) CR, LF, or CRLF
335 (*BSR_UNICODE) any Unicode newline sequence
336
338
339 (?=...) )
340 (*pla:...) ) positive lookahead
341 (*positive_lookahead:...) )
342
343 (?!...) )
344 (*nla:...) ) negative lookahead
345 (*negative_lookahead:...) )
346
347 (?<=...) )
348 (*plb:...) ) positive lookbehind
349 (*positive_lookbehind:...) )
350
351 (?<!...) )
352 (*nlb:...) ) negative lookbehind
353 (*negative_lookbehind:...) )
354
355 Each top-level branch of a lookbehind must be of a fixed length.
356
358
359 (*script_run:...) ) script run, can be backtracked into
360 (*sr:...) )
361
362 (*atomic_script_run:...) ) atomic script run
363 (*asr:...) )
364
366
367 \n reference by number (can be ambiguous)
368 \gn reference by number
369 \g{n} reference by number
370 \g+n relative reference by number (PCRE2 extension)
371 \g-n relative reference by number
372 \g{+n} relative reference by number (PCRE2 extension)
373 \g{-n} relative reference by number
374 \k<name> reference by name (Perl)
375 \k'name' reference by name (Perl)
376 \g{name} reference by name (Perl)
377 \k{name} reference by name (.NET)
378 (?P=name) reference by name (Python)
379
381
382 (?R) recurse whole pattern
383 (?n) call subroutine by absolute number
384 (?+n) call subroutine by relative number
385 (?-n) call subroutine by relative number
386 (?&name) call subroutine by name (Perl)
387 (?P>name) call subroutine by name (Python)
388 \g<name> call subroutine by name (Oniguruma)
389 \g'name' call subroutine by name (Oniguruma)
390 \g<n> call subroutine by absolute number (Oniguruma)
391 \g'n' call subroutine by absolute number (Oniguruma)
392 \g<+n> call subroutine by relative number (PCRE2 extension)
393 \g'+n' call subroutine by relative number (PCRE2 extension)
394 \g<-n> call subroutine by relative number (PCRE2 extension)
395 \g'-n' call subroutine by relative number (PCRE2 extension)
396
398
399 (?(condition)yes-pattern)
400 (?(condition)yes-pattern|no-pattern)
401
402 (?(n) absolute reference condition
403 (?(+n) relative reference condition
404 (?(-n) relative reference condition
405 (?(<name>) named reference condition (Perl)
406 (?('name') named reference condition (Perl)
407 (?(name) named reference condition (PCRE2, deprecated)
408 (?(R) overall recursion condition
409 (?(Rn) specific numbered group recursion condition
410 (?(R&name) specific named group recursion condition
411 (?(DEFINE) define groups for reference
412 (?(VERSION[>]=n.m) test PCRE2 version
413 (?(assert) assertion condition
414
415 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
416 conditions or recursion tests. Such a condition is interpreted as a
417 reference condition if the relevant named group exists.
418
420
421 All backtracking control verbs may be in the form (*VERB:NAME). For
422 (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
423 changes its behaviour if :NAME is present. The others just set a name
424 for passing back to the caller, but this is not a name that (*SKIP) can
425 see. The following act immediately they are reached:
426
427 (*ACCEPT) force successful match
428 (*FAIL) force backtrack; synonym (*F)
429 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
430
431 The following act only when a subsequent match failure causes a back‐
432 track to reach them. They all force a match failure, but they differ in
433 what happens afterwards. Those that advance the start-of-match point do
434 so only if the pattern is not anchored.
435
436 (*COMMIT) overall failure, no advance of starting point
437 (*PRUNE) advance to next starting character
438 (*SKIP) advance to current matching position
439 (*SKIP:NAME) advance to position corresponding to an earlier
440 (*MARK:NAME); if not found, the (*SKIP) is ignored
441 (*THEN) local failure, backtrack to next alternation
442
443 The effect of one of these verbs in a group called as a subroutine is
444 confined to the subroutine call.
445
447
448 (?C) callout (assumed number 0)
449 (?Cn) callout with numerical data n
450 (?C"text") callout with string data
451
452 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
453 the start and the end), and the starting delimiter { matched with the
454 ending delimiter }. To encode the ending delimiter within the string,
455 double it.
456
458
459 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
460 pcre2(3).
461
463
464 Philip Hazel
465 University Computing Service
466 Cambridge, England.
467
469
470 Last updated: 11 February 2019
471 Copyright (c) 1997-2019 University of Cambridge.
472
473
474
475PCRE2 10.33 11 February 2019 PCRE2SYNTAX(3)