1PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE2 are described in the pcre2pattern documentation. This
12 document contains a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 This table applies to ASCII and Unicode environments. An unrecognized
22 escape sequence causes an error.
23
24 \a alarm, that is, the BEL character (hex 07)
25 \cx "control-x", where x is any ASCII printing character
26 \e escape (hex 1B)
27 \f form feed (hex 0C)
28 \n newline (hex 0A)
29 \r carriage return (hex 0D)
30 \t tab (hex 09)
31 \0dd character with octal code 0dd
32 \ddd character with octal code ddd, or backreference
33 \o{ddd..} character with octal code ddd..
34 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35 \xhh character with hex code hh
36 \x{hh..} character with hex code hh..
37
38 If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
39 following are also recognized:
40
41 \U the character "U"
42 \uhhhh character with hex code hhhh
43 \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
44
45 When \x is not followed by {, from zero to two hexadecimal digits are
46 read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig‐
47 its to be recognized as a hexadecimal escape; otherwise it matches a
48 literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by
49 four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
50 digits in curly brackets, it matches a literal "u".
51
52 Note that \0dd is always an octal code. The treatment of backslash fol‐
53 lowed by a non-zero digit is complicated; for details see the section
54 "Non-printing characters" in the pcre2pattern documentation, where de‐
55 tails of escape processing in EBCDIC environments are also given.
56 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
57 EBCDIC environments. Note that \N not followed by an opening curly
58 bracket has a different meaning (see below).
59
61
62 . any character except newline;
63 in dotall mode, any character whatsoever
64 \C one code unit, even in UTF mode (best avoided)
65 \d a decimal digit
66 \D a character that is not a decimal digit
67 \h a horizontal white space character
68 \H a character that is not a horizontal white space character
69 \N a character that is not a newline
70 \p{xx} a character with the xx property
71 \P{xx} a character without the xx property
72 \R a newline sequence
73 \s a white space character
74 \S a character that is not a white space character
75 \v a vertical white space character
76 \V a character that is not a vertical white space character
77 \w a "word" character
78 \W a "non-word" character
79 \X a Unicode extended grapheme cluster
80
81 \C is dangerous because it may leave the current matching point in the
82 middle of a UTF-8 or UTF-16 character. The application can lock out the
83 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
84 possible to build PCRE2 with the use of \C permanently disabled.
85
86 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
87 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
88 matching is happening, \s and \w may also match characters with code
89 points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
90 iour of these escape sequences is changed to use Unicode properties and
91 they match many more characters.
92
93 Property descriptions in \p and \P are matched caselessly; hyphens, un‐
94 derscores, and white space are ignored, in accordance with Unicode's
95 "loose matching" rules.
96
98
99 C Other
100 Cc Control
101 Cf Format
102 Cn Unassigned
103 Co Private use
104 Cs Surrogate
105
106 L Letter
107 Ll Lower case letter
108 Lm Modifier letter
109 Lo Other letter
110 Lt Title case letter
111 Lu Upper case letter
112 Lc Ll, Lu, or Lt
113 L& Ll, Lu, or Lt
114
115 M Mark
116 Mc Spacing mark
117 Me Enclosing mark
118 Mn Non-spacing mark
119
120 N Number
121 Nd Decimal number
122 Nl Letter number
123 No Other number
124
125 P Punctuation
126 Pc Connector punctuation
127 Pd Dash punctuation
128 Pe Close punctuation
129 Pf Final punctuation
130 Pi Initial punctuation
131 Po Other punctuation
132 Ps Open punctuation
133
134 S Symbol
135 Sc Currency symbol
136 Sk Modifier symbol
137 Sm Mathematical symbol
138 So Other symbol
139
140 Z Separator
141 Zl Line separator
142 Zp Paragraph separator
143 Zs Space separator
144
146
147 Xan Alphanumeric: union of properties L and N
148 Xps POSIX space: property Z or tab, NL, VT, FF, CR
149 Xsp Perl space: property Z or tab, NL, VT, FF, CR
150 Xuc Univerally-named character: one that can be
151 represented by a Universal Character Name
152 Xwd Perl word: property Xan or underscore
153
154 Perl and POSIX space are now the same. Perl added VT to its space char‐
155 acter set at release 5.18.
156
158
159 Unicode defines a number of binary properties, that is, properties
160 whose only values are true or false. You can obtain a list of those
161 that are recognized by \p and \P, along with their abbreviations, by
162 running this command:
163
164 pcre2test -LP
165
167
168 Many script names and their 4-letter abbreviations are recognized in
169 \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P
170 of course). You can obtain a list of these scripts by running this com‐
171 mand:
172
173 pcre2test -LS
174
176
177 \p{Bidi_Class:<class>} matches a character with the given class
178 \p{BC:<class>} matches a character with the given class
179
180 The recognized classes are:
181
182 AL Arabic letter
183 AN Arabic number
184 B paragraph separator
185 BN boundary neutral
186 CS common separator
187 EN European number
188 ES European separator
189 ET European terminator
190 FSI first strong isolate
191 L left-to-right
192 LRE left-to-right embedding
193 LRI left-to-right isolate
194 LRO left-to-right override
195 NSM non-spacing mark
196 ON other neutral
197 PDF pop directional format
198 PDI pop directional isolate
199 R right-to-left
200 RLE right-to-left embedding
201 RLI right-to-left isolate
202 RLO right-to-left override
203 S segment separator
204 WS which space
205
207
208 [...] positive character class
209 [^...] negative character class
210 [x-y] range (can be used for hex characters)
211 [[:xxx:]] positive POSIX named set
212 [[:^xxx:]] negative POSIX named set
213
214 alnum alphanumeric
215 alpha alphabetic
216 ascii 0-127
217 blank space or tab
218 cntrl control character
219 digit decimal digit
220 graph printing, excluding space
221 lower lower case letter
222 print printing, including space
223 punct printing, excluding alphanumeric
224 space white space
225 upper upper case letter
226 word same as \w
227 xdigit hexadecimal digit
228
229 In PCRE2, POSIX character set names recognize only ASCII characters by
230 default, but some of them use Unicode properties if PCRE2_UCP is set.
231 You can use \Q...\E inside a character class.
232
234
235 ? 0 or 1, greedy
236 ?+ 0 or 1, possessive
237 ?? 0 or 1, lazy
238 * 0 or more, greedy
239 *+ 0 or more, possessive
240 *? 0 or more, lazy
241 + 1 or more, greedy
242 ++ 1 or more, possessive
243 +? 1 or more, lazy
244 {n} exactly n
245 {n,m} at least n, no more than m, greedy
246 {n,m}+ at least n, no more than m, possessive
247 {n,m}? at least n, no more than m, lazy
248 {n,} n or more, greedy
249 {n,}+ n or more, possessive
250 {n,}? n or more, lazy
251
253
254 \b word boundary
255 \B not a word boundary
256 ^ start of subject
257 also after an internal newline in multiline mode
258 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
259 \A start of subject
260 $ end of subject
261 also before newline at end of subject
262 also before internal newline in multiline mode
263 \Z end of subject
264 also before newline at end of subject
265 \z end of subject
266 \G first matching position in subject
267
269
270 \K set reported start of match
271
272 From release 10.38 \K is not permitted by default in lookaround asser‐
273 tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL‐
274 LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
275 When this option is set, \K is honoured in positive assertions, but ig‐
276 nored in negative ones.
277
279
280 expr|expr|expr...
281
283
284 (...) capture group
285 (?<name>...) named capture group (Perl)
286 (?'name'...) named capture group (Perl)
287 (?P<name>...) named capture group (Python)
288 (?:...) non-capture group
289 (?|...) non-capture group; reset group numbers for
290 capture groups in each alternative
291
292 In non-UTF modes, names may contain underscores and ASCII letters and
293 digits; in UTF modes, any Unicode letters and Unicode decimal digits
294 are permitted. In both cases, a name must not start with a digit.
295
297
298 (?>...) atomic non-capture group
299 (*atomic:...) atomic non-capture group
300
302
303 (?#....) comment (not nestable)
304
306 Changes of these options within a group are automatically cancelled at
307 the end of the group.
308
309 (?i) caseless
310 (?J) allow duplicate named groups
311 (?m) multiline
312 (?n) no auto capture
313 (?s) single line (dotall)
314 (?U) default ungreedy (lazy)
315 (?x) extended: ignore white space except in classes
316 (?xx) as (?x) but also ignore space and tab in classes
317 (?-...) unset option(s)
318 (?^) unset imnsx options
319
320 Unsetting x or xx unsets both. Several options may be set at once, and
321 a mixture of setting and unsetting such as (?i-x) is allowed, but there
322 may be only one hyphen. Setting (but no unsetting) is allowed after (?^
323 for example (?^in). An option setting may appear at the start of a non-
324 capture group, for example (?i:...).
325
326 The following are recognized only at the very start of a pattern or af‐
327 ter one of the newline or \R options with similar syntax. More than one
328 of them may appear. For the first three, d is a decimal number.
329
330 (*LIMIT_DEPTH=d) set the backtracking limit to d
331 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
332 (*LIMIT_MATCH=d) set the match limit to d
333 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
334 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
335 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
336 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
337 (*NO_JIT) disable JIT optimization
338 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
339 (*UTF) set appropriate UTF mode for the library in use
340 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
341
342 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
343 value of the limits set by the caller of pcre2_match() or
344 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
345 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
346 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
347 respectively, at compile time.
348
350
351 These are recognized only at the very start of the pattern or after op‐
352 tion settings with a similar syntax.
353
354 (*CR) carriage return only
355 (*LF) linefeed only
356 (*CRLF) carriage return followed by linefeed
357 (*ANYCRLF) all three of the above
358 (*ANY) any Unicode newline sequence
359 (*NUL) the NUL character (binary zero)
360
362
363 These are recognized only at the very start of the pattern or after op‐
364 tion setting with a similar syntax.
365
366 (*BSR_ANYCRLF) CR, LF, or CRLF
367 (*BSR_UNICODE) any Unicode newline sequence
368
370
371 (?=...) )
372 (*pla:...) ) positive lookahead
373 (*positive_lookahead:...) )
374
375 (?!...) )
376 (*nla:...) ) negative lookahead
377 (*negative_lookahead:...) )
378
379 (?<=...) )
380 (*plb:...) ) positive lookbehind
381 (*positive_lookbehind:...) )
382
383 (?<!...) )
384 (*nlb:...) ) negative lookbehind
385 (*negative_lookbehind:...) )
386
387 Each top-level branch of a lookbehind must be of a fixed length.
388
390
391 These assertions are specific to PCRE2 and are not Perl-compatible.
392
393 (?*...) )
394 (*napla:...) ) synonyms
395 (*non_atomic_positive_lookahead:...) )
396
397 (?<*...) )
398 (*naplb:...) ) synonyms
399 (*non_atomic_positive_lookbehind:...) )
400
402
403 (*script_run:...) ) script run, can be backtracked into
404 (*sr:...) )
405
406 (*atomic_script_run:...) ) atomic script run
407 (*asr:...) )
408
410
411 \n reference by number (can be ambiguous)
412 \gn reference by number
413 \g{n} reference by number
414 \g+n relative reference by number (PCRE2 extension)
415 \g-n relative reference by number
416 \g{+n} relative reference by number (PCRE2 extension)
417 \g{-n} relative reference by number
418 \k<name> reference by name (Perl)
419 \k'name' reference by name (Perl)
420 \g{name} reference by name (Perl)
421 \k{name} reference by name (.NET)
422 (?P=name) reference by name (Python)
423
425
426 (?R) recurse whole pattern
427 (?n) call subroutine by absolute number
428 (?+n) call subroutine by relative number
429 (?-n) call subroutine by relative number
430 (?&name) call subroutine by name (Perl)
431 (?P>name) call subroutine by name (Python)
432 \g<name> call subroutine by name (Oniguruma)
433 \g'name' call subroutine by name (Oniguruma)
434 \g<n> call subroutine by absolute number (Oniguruma)
435 \g'n' call subroutine by absolute number (Oniguruma)
436 \g<+n> call subroutine by relative number (PCRE2 extension)
437 \g'+n' call subroutine by relative number (PCRE2 extension)
438 \g<-n> call subroutine by relative number (PCRE2 extension)
439 \g'-n' call subroutine by relative number (PCRE2 extension)
440
442
443 (?(condition)yes-pattern)
444 (?(condition)yes-pattern|no-pattern)
445
446 (?(n) absolute reference condition
447 (?(+n) relative reference condition
448 (?(-n) relative reference condition
449 (?(<name>) named reference condition (Perl)
450 (?('name') named reference condition (Perl)
451 (?(name) named reference condition (PCRE2, deprecated)
452 (?(R) overall recursion condition
453 (?(Rn) specific numbered group recursion condition
454 (?(R&name) specific named group recursion condition
455 (?(DEFINE) define groups for reference
456 (?(VERSION[>]=n.m) test PCRE2 version
457 (?(assert) assertion condition
458
459 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
460 conditions or recursion tests. Such a condition is interpreted as a
461 reference condition if the relevant named group exists.
462
464
465 All backtracking control verbs may be in the form (*VERB:NAME). For
466 (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
467 changes its behaviour if :NAME is present. The others just set a name
468 for passing back to the caller, but this is not a name that (*SKIP) can
469 see. The following act immediately they are reached:
470
471 (*ACCEPT) force successful match
472 (*FAIL) force backtrack; synonym (*F)
473 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
474
475 The following act only when a subsequent match failure causes a back‐
476 track to reach them. They all force a match failure, but they differ in
477 what happens afterwards. Those that advance the start-of-match point do
478 so only if the pattern is not anchored.
479
480 (*COMMIT) overall failure, no advance of starting point
481 (*PRUNE) advance to next starting character
482 (*SKIP) advance to current matching position
483 (*SKIP:NAME) advance to position corresponding to an earlier
484 (*MARK:NAME); if not found, the (*SKIP) is ignored
485 (*THEN) local failure, backtrack to next alternation
486
487 The effect of one of these verbs in a group called as a subroutine is
488 confined to the subroutine call.
489
491
492 (?C) callout (assumed number 0)
493 (?Cn) callout with numerical data n
494 (?C"text") callout with string data
495
496 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
497 the start and the end), and the starting delimiter { matched with the
498 ending delimiter }. To encode the ending delimiter within the string,
499 double it.
500
502
503 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
504 pcre2(3).
505
507
508 Philip Hazel
509 Retired from University Computing Service
510 Cambridge, England.
511
513
514 Last updated: 12 January 2022
515 Copyright (c) 1997-2022 University of Cambridge.
516
517
518
519PCRE2 10.40 12 January 2022 PCRE2SYNTAX(3)