1PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE2 are described in the pcre2pattern documentation. This
12 document contains a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 This table applies to ASCII and Unicode environments. An unrecognized
22 escape sequence causes an error.
23
24 \a alarm, that is, the BEL character (hex 07)
25 \cx "control-x", where x is any ASCII printing character
26 \e escape (hex 1B)
27 \f form feed (hex 0C)
28 \n newline (hex 0A)
29 \r carriage return (hex 0D)
30 \t tab (hex 09)
31 \0dd character with octal code 0dd
32 \ddd character with octal code ddd, or backreference
33 \o{ddd..} character with octal code ddd..
34 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35 \xhh character with hex code hh
36 \x{hh..} character with hex code hh..
37
38 If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
39 following are also recognized:
40
41 \U the character "U"
42 \uhhhh character with hex code hhhh
43 \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
44
45 When \x is not followed by {, from zero to two hexadecimal digits are
46 read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig‐
47 its to be recognized as a hexadecimal escape; otherwise it matches a
48 literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by
49 four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
50 digits in curly brackets, it matches a literal "u".
51
52 Note that \0dd is always an octal code. The treatment of backslash fol‐
53 lowed by a non-zero digit is complicated; for details see the section
54 "Non-printing characters" in the pcre2pattern documentation, where
55 details of escape processing in EBCDIC environments are also given.
56 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
57 EBCDIC environments. Note that \N not followed by an opening curly
58 bracket has a different meaning (see below).
59
61
62 . any character except newline;
63 in dotall mode, any character whatsoever
64 \C one code unit, even in UTF mode (best avoided)
65 \d a decimal digit
66 \D a character that is not a decimal digit
67 \h a horizontal white space character
68 \H a character that is not a horizontal white space character
69 \N a character that is not a newline
70 \p{xx} a character with the xx property
71 \P{xx} a character without the xx property
72 \R a newline sequence
73 \s a white space character
74 \S a character that is not a white space character
75 \v a vertical white space character
76 \V a character that is not a vertical white space character
77 \w a "word" character
78 \W a "non-word" character
79 \X a Unicode extended grapheme cluster
80
81 \C is dangerous because it may leave the current matching point in the
82 middle of a UTF-8 or UTF-16 character. The application can lock out the
83 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
84 possible to build PCRE2 with the use of \C permanently disabled.
85
86 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
87 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
88 matching is happening, \s and \w may also match characters with code
89 points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
90 iour of these escape sequences is changed to use Unicode properties and
91 they match many more characters.
92
94
95 C Other
96 Cc Control
97 Cf Format
98 Cn Unassigned
99 Co Private use
100 Cs Surrogate
101
102 L Letter
103 Ll Lower case letter
104 Lm Modifier letter
105 Lo Other letter
106 Lt Title case letter
107 Lu Upper case letter
108 L& Ll, Lu, or Lt
109
110 M Mark
111 Mc Spacing mark
112 Me Enclosing mark
113 Mn Non-spacing mark
114
115 N Number
116 Nd Decimal number
117 Nl Letter number
118 No Other number
119
120 P Punctuation
121 Pc Connector punctuation
122 Pd Dash punctuation
123 Pe Close punctuation
124 Pf Final punctuation
125 Pi Initial punctuation
126 Po Other punctuation
127 Ps Open punctuation
128
129 S Symbol
130 Sc Currency symbol
131 Sk Modifier symbol
132 Sm Mathematical symbol
133 So Other symbol
134
135 Z Separator
136 Zl Line separator
137 Zp Paragraph separator
138 Zs Space separator
139
141
142 Xan Alphanumeric: union of properties L and N
143 Xps POSIX space: property Z or tab, NL, VT, FF, CR
144 Xsp Perl space: property Z or tab, NL, VT, FF, CR
145 Xuc Univerally-named character: one that can be
146 represented by a Universal Character Name
147 Xwd Perl word: property Xan or underscore
148
149 Perl and POSIX space are now the same. Perl added VT to its space char‐
150 acter set at release 5.18.
151
153
154 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali‐
155 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
156 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
157 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
158 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
159 Elbasan, Elymaic, Ethiopic, Georgian, Glagolitic, Gothic, Grantha,
160 Greek, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
161 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
162 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan‐
163 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
164 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
165 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
166 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
167 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
168 Nabataean, Nandinagari, New_Tai_Lue, Newa, Nko, Nushu, Nyak‐
169 eng_Puachue_Hmong, Ogham, Ol_Chiki, Old_Hungarian, Old_Italic,
170 Old_North_Arabian, Old_Permic, Old_Persian, Old_Sogdian, Old_South_Ara‐
171 bian, Old_Turkic, Oriya, Osage, Osmanya, Pahawh_Hmong, Palmyrene,
172 Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic,
173 Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
174 Sogdian, Sora_Sompeng, Soyombo, Sundanese, Syloti_Nagri, Syriac, Taga‐
175 log, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Tangut, Tel‐
176 ugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Wancho,
177 Warang_Citi, Yi, Zanabazar_Square.
178
180
181 [...] positive character class
182 [^...] negative character class
183 [x-y] range (can be used for hex characters)
184 [[:xxx:]] positive POSIX named set
185 [[:^xxx:]] negative POSIX named set
186
187 alnum alphanumeric
188 alpha alphabetic
189 ascii 0-127
190 blank space or tab
191 cntrl control character
192 digit decimal digit
193 graph printing, excluding space
194 lower lower case letter
195 print printing, including space
196 punct printing, excluding alphanumeric
197 space white space
198 upper upper case letter
199 word same as \w
200 xdigit hexadecimal digit
201
202 In PCRE2, POSIX character set names recognize only ASCII characters by
203 default, but some of them use Unicode properties if PCRE2_UCP is set.
204 You can use \Q...\E inside a character class.
205
207
208 ? 0 or 1, greedy
209 ?+ 0 or 1, possessive
210 ?? 0 or 1, lazy
211 * 0 or more, greedy
212 *+ 0 or more, possessive
213 *? 0 or more, lazy
214 + 1 or more, greedy
215 ++ 1 or more, possessive
216 +? 1 or more, lazy
217 {n} exactly n
218 {n,m} at least n, no more than m, greedy
219 {n,m}+ at least n, no more than m, possessive
220 {n,m}? at least n, no more than m, lazy
221 {n,} n or more, greedy
222 {n,}+ n or more, possessive
223 {n,}? n or more, lazy
224
226
227 \b word boundary
228 \B not a word boundary
229 ^ start of subject
230 also after an internal newline in multiline mode
231 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
232 \A start of subject
233 $ end of subject
234 also before newline at end of subject
235 also before internal newline in multiline mode
236 \Z end of subject
237 also before newline at end of subject
238 \z end of subject
239 \G first matching position in subject
240
242
243 \K set reported start of match
244
245 \K is honoured in positive assertions, but ignored in negative ones.
246
248
249 expr|expr|expr...
250
252
253 (...) capture group
254 (?<name>...) named capture group (Perl)
255 (?'name'...) named capture group (Perl)
256 (?P<name>...) named capture group (Python)
257 (?:...) non-capture group
258 (?|...) non-capture group; reset group numbers for
259 capture groups in each alternative
260
261 In non-UTF modes, names may contain underscores and ASCII letters and
262 digits; in UTF modes, any Unicode letters and Unicode decimal digits
263 are permitted. In both cases, a name must not start with a digit.
264
266
267 (?>...) atomic non-capture group
268 (*atomic:...) atomic non-capture group
269
271
272 (?#....) comment (not nestable)
273
275 Changes of these options within a group are automatically cancelled at
276 the end of the group.
277
278 (?i) caseless
279 (?J) allow duplicate names
280 (?m) multiline
281 (?n) no auto capture
282 (?s) single line (dotall)
283 (?U) default ungreedy (lazy)
284 (?x) extended: ignore white space except in classes
285 (?xx) as (?x) but also ignore space and tab in classes
286 (?-...) unset option(s)
287 (?^) unset imnsx options
288
289 Unsetting x or xx unsets both. Several options may be set at once, and
290 a mixture of setting and unsetting such as (?i-x) is allowed, but there
291 may be only one hyphen. Setting (but no unsetting) is allowed after (?^
292 for example (?^in). An option setting may appear at the start of a non-
293 capture group, for example (?i:...).
294
295 The following are recognized only at the very start of a pattern or
296 after one of the newline or \R options with similar syntax. More than
297 one of them may appear. For the first three, d is a decimal number.
298
299 (*LIMIT_DEPTH=d) set the backtracking limit to d
300 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
301 (*LIMIT_MATCH=d) set the match limit to d
302 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
303 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
304 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
305 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
306 (*NO_JIT) disable JIT optimization
307 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
308 (*UTF) set appropriate UTF mode for the library in use
309 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
310
311 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
312 value of the limits set by the caller of pcre2_match() or
313 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
314 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
315 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
316 respectively, at compile time.
317
319
320 These are recognized only at the very start of the pattern or after
321 option settings with a similar syntax.
322
323 (*CR) carriage return only
324 (*LF) linefeed only
325 (*CRLF) carriage return followed by linefeed
326 (*ANYCRLF) all three of the above
327 (*ANY) any Unicode newline sequence
328 (*NUL) the NUL character (binary zero)
329
331
332 These are recognized only at the very start of the pattern or after
333 option setting with a similar syntax.
334
335 (*BSR_ANYCRLF) CR, LF, or CRLF
336 (*BSR_UNICODE) any Unicode newline sequence
337
339
340 (?=...) )
341 (*pla:...) ) positive lookahead
342 (*positive_lookahead:...) )
343
344 (?!...) )
345 (*nla:...) ) negative lookahead
346 (*negative_lookahead:...) )
347
348 (?<=...) )
349 (*plb:...) ) positive lookbehind
350 (*positive_lookbehind:...) )
351
352 (?<!...) )
353 (*nlb:...) ) negative lookbehind
354 (*negative_lookbehind:...) )
355
356 Each top-level branch of a lookbehind must be of a fixed length.
357
359
360 These assertions are specific to PCRE2 and are not Perl-compatible.
361
362 (*napla:...)
363 (*non_atomic_positive_lookahead:...)
364
365 (*naplb:...)
366 (*non_atomic_positive_lookbehind:...)
367
369
370 (*script_run:...) ) script run, can be backtracked into
371 (*sr:...) )
372
373 (*atomic_script_run:...) ) atomic script run
374 (*asr:...) )
375
377
378 \n reference by number (can be ambiguous)
379 \gn reference by number
380 \g{n} reference by number
381 \g+n relative reference by number (PCRE2 extension)
382 \g-n relative reference by number
383 \g{+n} relative reference by number (PCRE2 extension)
384 \g{-n} relative reference by number
385 \k<name> reference by name (Perl)
386 \k'name' reference by name (Perl)
387 \g{name} reference by name (Perl)
388 \k{name} reference by name (.NET)
389 (?P=name) reference by name (Python)
390
392
393 (?R) recurse whole pattern
394 (?n) call subroutine by absolute number
395 (?+n) call subroutine by relative number
396 (?-n) call subroutine by relative number
397 (?&name) call subroutine by name (Perl)
398 (?P>name) call subroutine by name (Python)
399 \g<name> call subroutine by name (Oniguruma)
400 \g'name' call subroutine by name (Oniguruma)
401 \g<n> call subroutine by absolute number (Oniguruma)
402 \g'n' call subroutine by absolute number (Oniguruma)
403 \g<+n> call subroutine by relative number (PCRE2 extension)
404 \g'+n' call subroutine by relative number (PCRE2 extension)
405 \g<-n> call subroutine by relative number (PCRE2 extension)
406 \g'-n' call subroutine by relative number (PCRE2 extension)
407
409
410 (?(condition)yes-pattern)
411 (?(condition)yes-pattern|no-pattern)
412
413 (?(n) absolute reference condition
414 (?(+n) relative reference condition
415 (?(-n) relative reference condition
416 (?(<name>) named reference condition (Perl)
417 (?('name') named reference condition (Perl)
418 (?(name) named reference condition (PCRE2, deprecated)
419 (?(R) overall recursion condition
420 (?(Rn) specific numbered group recursion condition
421 (?(R&name) specific named group recursion condition
422 (?(DEFINE) define groups for reference
423 (?(VERSION[>]=n.m) test PCRE2 version
424 (?(assert) assertion condition
425
426 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
427 conditions or recursion tests. Such a condition is interpreted as a
428 reference condition if the relevant named group exists.
429
431
432 All backtracking control verbs may be in the form (*VERB:NAME). For
433 (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
434 changes its behaviour if :NAME is present. The others just set a name
435 for passing back to the caller, but this is not a name that (*SKIP) can
436 see. The following act immediately they are reached:
437
438 (*ACCEPT) force successful match
439 (*FAIL) force backtrack; synonym (*F)
440 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
441
442 The following act only when a subsequent match failure causes a back‐
443 track to reach them. They all force a match failure, but they differ in
444 what happens afterwards. Those that advance the start-of-match point do
445 so only if the pattern is not anchored.
446
447 (*COMMIT) overall failure, no advance of starting point
448 (*PRUNE) advance to next starting character
449 (*SKIP) advance to current matching position
450 (*SKIP:NAME) advance to position corresponding to an earlier
451 (*MARK:NAME); if not found, the (*SKIP) is ignored
452 (*THEN) local failure, backtrack to next alternation
453
454 The effect of one of these verbs in a group called as a subroutine is
455 confined to the subroutine call.
456
458
459 (?C) callout (assumed number 0)
460 (?Cn) callout with numerical data n
461 (?C"text") callout with string data
462
463 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
464 the start and the end), and the starting delimiter { matched with the
465 ending delimiter }. To encode the ending delimiter within the string,
466 double it.
467
469
470 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
471 pcre2(3).
472
474
475 Philip Hazel
476 University Computing Service
477 Cambridge, England.
478
480
481 Last updated: 29 July 2019
482 Copyright (c) 1997-2019 University of Cambridge.
483
484
485
486PCRE2 10.34 29 July 2019 PCRE2SYNTAX(3)