1PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE2 are described in the pcre2pattern documentation. This
12 document contains a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 This table applies to ASCII and Unicode environments. An unrecognized
22 escape sequence causes an error.
23
24 \a alarm, that is, the BEL character (hex 07)
25 \cx "control-x", where x is any ASCII printing character
26 \e escape (hex 1B)
27 \f form feed (hex 0C)
28 \n newline (hex 0A)
29 \r carriage return (hex 0D)
30 \t tab (hex 09)
31 \0dd character with octal code 0dd
32 \ddd character with octal code ddd, or backreference
33 \o{ddd..} character with octal code ddd..
34 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35 \xhh character with hex code hh
36 \x{hh..} character with hex code hh..
37
38 If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
39 following are also recognized:
40
41 \U the character "U"
42 \uhhhh character with hex code hhhh
43 \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
44
45 When \x is not followed by {, from zero to two hexadecimal digits are
46 read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig‐
47 its to be recognized as a hexadecimal escape; otherwise it matches a
48 literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by
49 four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
50 digits in curly brackets, it matches a literal "u".
51
52 Note that \0dd is always an octal code. The treatment of backslash fol‐
53 lowed by a non-zero digit is complicated; for details see the section
54 "Non-printing characters" in the pcre2pattern documentation, where
55 details of escape processing in EBCDIC environments are also given.
56 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
57 EBCDIC environments. Note that \N not followed by an opening curly
58 bracket has a different meaning (see below).
59
61
62 . any character except newline;
63 in dotall mode, any character whatsoever
64 \C one code unit, even in UTF mode (best avoided)
65 \d a decimal digit
66 \D a character that is not a decimal digit
67 \h a horizontal white space character
68 \H a character that is not a horizontal white space character
69 \N a character that is not a newline
70 \p{xx} a character with the xx property
71 \P{xx} a character without the xx property
72 \R a newline sequence
73 \s a white space character
74 \S a character that is not a white space character
75 \v a vertical white space character
76 \V a character that is not a vertical white space character
77 \w a "word" character
78 \W a "non-word" character
79 \X a Unicode extended grapheme cluster
80
81 \C is dangerous because it may leave the current matching point in the
82 middle of a UTF-8 or UTF-16 character. The application can lock out the
83 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
84 possible to build PCRE2 with the use of \C permanently disabled.
85
86 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
87 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
88 matching is happening, \s and \w may also match characters with code
89 points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
90 iour of these escape sequences is changed to use Unicode properties and
91 they match many more characters.
92
94
95 C Other
96 Cc Control
97 Cf Format
98 Cn Unassigned
99 Co Private use
100 Cs Surrogate
101
102 L Letter
103 Ll Lower case letter
104 Lm Modifier letter
105 Lo Other letter
106 Lt Title case letter
107 Lu Upper case letter
108 L& Ll, Lu, or Lt
109
110 M Mark
111 Mc Spacing mark
112 Me Enclosing mark
113 Mn Non-spacing mark
114
115 N Number
116 Nd Decimal number
117 Nl Letter number
118 No Other number
119
120 P Punctuation
121 Pc Connector punctuation
122 Pd Dash punctuation
123 Pe Close punctuation
124 Pf Final punctuation
125 Pi Initial punctuation
126 Po Other punctuation
127 Ps Open punctuation
128
129 S Symbol
130 Sc Currency symbol
131 Sk Modifier symbol
132 Sm Mathematical symbol
133 So Other symbol
134
135 Z Separator
136 Zl Line separator
137 Zp Paragraph separator
138 Zs Space separator
139
141
142 Xan Alphanumeric: union of properties L and N
143 Xps POSIX space: property Z or tab, NL, VT, FF, CR
144 Xsp Perl space: property Z or tab, NL, VT, FF, CR
145 Xuc Univerally-named character: one that can be
146 represented by a Universal Character Name
147 Xwd Perl word: property Xan or underscore
148
149 Perl and POSIX space are now the same. Perl added VT to its space char‐
150 acter set at release 5.18.
151
153
154 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali‐
155 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
156 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
157 nian, Chakma, Cham, Cherokee, Chorasmian, Common, Coptic, Cuneiform,
158 Cypriot, Cyrillic, Deseret, Devanagari, Dives_Akuru, Dogra, Duployan,
159 Egyptian_Hieroglyphs, Elbasan, Elymaic, Ethiopic, Georgian, Glagolitic,
160 Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul,
161 Hanifi_Rohingya, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic,
162 Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,
163 Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khitan_Small_Script,
164 Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Lin‐
165 ear_B, Lisu, Lycian, Lydian, Mahajani, Makasar, Malayalam, Mandaic,
166 Manichaean, Marchen, Masaram_Gondi, Medefaidrin, Meetei_Mayek,
167 Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mon‐
168 golian, Mro, Multani, Myanmar, Nabataean, Nandinagari, New_Tai_Lue,
169 Newa, Nko, Nushu, Nyakeng_Puachue_Hmong, Ogham, Ol_Chiki, Old_Hungar‐
170 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog‐
171 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
172 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
173 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha‐
174 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
175 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
176 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi‐
177 nagh, Tirhuta, Ugaritic, Vai, Wancho, Warang_Citi, Yezidi, Yi, Zan‐
178 abazar_Square.
179
181
182 [...] positive character class
183 [^...] negative character class
184 [x-y] range (can be used for hex characters)
185 [[:xxx:]] positive POSIX named set
186 [[:^xxx:]] negative POSIX named set
187
188 alnum alphanumeric
189 alpha alphabetic
190 ascii 0-127
191 blank space or tab
192 cntrl control character
193 digit decimal digit
194 graph printing, excluding space
195 lower lower case letter
196 print printing, including space
197 punct printing, excluding alphanumeric
198 space white space
199 upper upper case letter
200 word same as \w
201 xdigit hexadecimal digit
202
203 In PCRE2, POSIX character set names recognize only ASCII characters by
204 default, but some of them use Unicode properties if PCRE2_UCP is set.
205 You can use \Q...\E inside a character class.
206
208
209 ? 0 or 1, greedy
210 ?+ 0 or 1, possessive
211 ?? 0 or 1, lazy
212 * 0 or more, greedy
213 *+ 0 or more, possessive
214 *? 0 or more, lazy
215 + 1 or more, greedy
216 ++ 1 or more, possessive
217 +? 1 or more, lazy
218 {n} exactly n
219 {n,m} at least n, no more than m, greedy
220 {n,m}+ at least n, no more than m, possessive
221 {n,m}? at least n, no more than m, lazy
222 {n,} n or more, greedy
223 {n,}+ n or more, possessive
224 {n,}? n or more, lazy
225
227
228 \b word boundary
229 \B not a word boundary
230 ^ start of subject
231 also after an internal newline in multiline mode
232 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
233 \A start of subject
234 $ end of subject
235 also before newline at end of subject
236 also before internal newline in multiline mode
237 \Z end of subject
238 also before newline at end of subject
239 \z end of subject
240 \G first matching position in subject
241
243
244 \K set reported start of match
245
246 \K is honoured in positive assertions, but ignored in negative ones.
247
249
250 expr|expr|expr...
251
253
254 (...) capture group
255 (?<name>...) named capture group (Perl)
256 (?'name'...) named capture group (Perl)
257 (?P<name>...) named capture group (Python)
258 (?:...) non-capture group
259 (?|...) non-capture group; reset group numbers for
260 capture groups in each alternative
261
262 In non-UTF modes, names may contain underscores and ASCII letters and
263 digits; in UTF modes, any Unicode letters and Unicode decimal digits
264 are permitted. In both cases, a name must not start with a digit.
265
267
268 (?>...) atomic non-capture group
269 (*atomic:...) atomic non-capture group
270
272
273 (?#....) comment (not nestable)
274
276 Changes of these options within a group are automatically cancelled at
277 the end of the group.
278
279 (?i) caseless
280 (?J) allow duplicate named groups
281 (?m) multiline
282 (?n) no auto capture
283 (?s) single line (dotall)
284 (?U) default ungreedy (lazy)
285 (?x) extended: ignore white space except in classes
286 (?xx) as (?x) but also ignore space and tab in classes
287 (?-...) unset option(s)
288 (?^) unset imnsx options
289
290 Unsetting x or xx unsets both. Several options may be set at once, and
291 a mixture of setting and unsetting such as (?i-x) is allowed, but there
292 may be only one hyphen. Setting (but no unsetting) is allowed after (?^
293 for example (?^in). An option setting may appear at the start of a non-
294 capture group, for example (?i:...).
295
296 The following are recognized only at the very start of a pattern or
297 after one of the newline or \R options with similar syntax. More than
298 one of them may appear. For the first three, d is a decimal number.
299
300 (*LIMIT_DEPTH=d) set the backtracking limit to d
301 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
302 (*LIMIT_MATCH=d) set the match limit to d
303 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
304 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
305 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
306 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
307 (*NO_JIT) disable JIT optimization
308 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
309 (*UTF) set appropriate UTF mode for the library in use
310 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
311
312 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
313 value of the limits set by the caller of pcre2_match() or
314 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
315 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
316 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
317 respectively, at compile time.
318
320
321 These are recognized only at the very start of the pattern or after
322 option settings with a similar syntax.
323
324 (*CR) carriage return only
325 (*LF) linefeed only
326 (*CRLF) carriage return followed by linefeed
327 (*ANYCRLF) all three of the above
328 (*ANY) any Unicode newline sequence
329 (*NUL) the NUL character (binary zero)
330
332
333 These are recognized only at the very start of the pattern or after
334 option setting with a similar syntax.
335
336 (*BSR_ANYCRLF) CR, LF, or CRLF
337 (*BSR_UNICODE) any Unicode newline sequence
338
340
341 (?=...) )
342 (*pla:...) ) positive lookahead
343 (*positive_lookahead:...) )
344
345 (?!...) )
346 (*nla:...) ) negative lookahead
347 (*negative_lookahead:...) )
348
349 (?<=...) )
350 (*plb:...) ) positive lookbehind
351 (*positive_lookbehind:...) )
352
353 (?<!...) )
354 (*nlb:...) ) negative lookbehind
355 (*negative_lookbehind:...) )
356
357 Each top-level branch of a lookbehind must be of a fixed length.
358
360
361 These assertions are specific to PCRE2 and are not Perl-compatible.
362
363 (?*...) )
364 (*napla:...) ) synonyms
365 (*non_atomic_positive_lookahead:...) )
366
367 (?<*...) )
368 (*naplb:...) ) synonyms
369 (*non_atomic_positive_lookbehind:...) )
370
372
373 (*script_run:...) ) script run, can be backtracked into
374 (*sr:...) )
375
376 (*atomic_script_run:...) ) atomic script run
377 (*asr:...) )
378
380
381 \n reference by number (can be ambiguous)
382 \gn reference by number
383 \g{n} reference by number
384 \g+n relative reference by number (PCRE2 extension)
385 \g-n relative reference by number
386 \g{+n} relative reference by number (PCRE2 extension)
387 \g{-n} relative reference by number
388 \k<name> reference by name (Perl)
389 \k'name' reference by name (Perl)
390 \g{name} reference by name (Perl)
391 \k{name} reference by name (.NET)
392 (?P=name) reference by name (Python)
393
395
396 (?R) recurse whole pattern
397 (?n) call subroutine by absolute number
398 (?+n) call subroutine by relative number
399 (?-n) call subroutine by relative number
400 (?&name) call subroutine by name (Perl)
401 (?P>name) call subroutine by name (Python)
402 \g<name> call subroutine by name (Oniguruma)
403 \g'name' call subroutine by name (Oniguruma)
404 \g<n> call subroutine by absolute number (Oniguruma)
405 \g'n' call subroutine by absolute number (Oniguruma)
406 \g<+n> call subroutine by relative number (PCRE2 extension)
407 \g'+n' call subroutine by relative number (PCRE2 extension)
408 \g<-n> call subroutine by relative number (PCRE2 extension)
409 \g'-n' call subroutine by relative number (PCRE2 extension)
410
412
413 (?(condition)yes-pattern)
414 (?(condition)yes-pattern|no-pattern)
415
416 (?(n) absolute reference condition
417 (?(+n) relative reference condition
418 (?(-n) relative reference condition
419 (?(<name>) named reference condition (Perl)
420 (?('name') named reference condition (Perl)
421 (?(name) named reference condition (PCRE2, deprecated)
422 (?(R) overall recursion condition
423 (?(Rn) specific numbered group recursion condition
424 (?(R&name) specific named group recursion condition
425 (?(DEFINE) define groups for reference
426 (?(VERSION[>]=n.m) test PCRE2 version
427 (?(assert) assertion condition
428
429 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
430 conditions or recursion tests. Such a condition is interpreted as a
431 reference condition if the relevant named group exists.
432
434
435 All backtracking control verbs may be in the form (*VERB:NAME). For
436 (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
437 changes its behaviour if :NAME is present. The others just set a name
438 for passing back to the caller, but this is not a name that (*SKIP) can
439 see. The following act immediately they are reached:
440
441 (*ACCEPT) force successful match
442 (*FAIL) force backtrack; synonym (*F)
443 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
444
445 The following act only when a subsequent match failure causes a back‐
446 track to reach them. They all force a match failure, but they differ in
447 what happens afterwards. Those that advance the start-of-match point do
448 so only if the pattern is not anchored.
449
450 (*COMMIT) overall failure, no advance of starting point
451 (*PRUNE) advance to next starting character
452 (*SKIP) advance to current matching position
453 (*SKIP:NAME) advance to position corresponding to an earlier
454 (*MARK:NAME); if not found, the (*SKIP) is ignored
455 (*THEN) local failure, backtrack to next alternation
456
457 The effect of one of these verbs in a group called as a subroutine is
458 confined to the subroutine call.
459
461
462 (?C) callout (assumed number 0)
463 (?Cn) callout with numerical data n
464 (?C"text") callout with string data
465
466 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
467 the start and the end), and the starting delimiter { matched with the
468 ending delimiter }. To encode the ending delimiter within the string,
469 double it.
470
472
473 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
474 pcre2(3).
475
477
478 Philip Hazel
479 University Computing Service
480 Cambridge, England.
481
483
484 Last updated: 28 December 2019
485 Copyright (c) 1997-2019 University of Cambridge.
486
487
488
489PCRE2 10.35 28 December 2019 PCRE2SYNTAX(3)