pcre2syntax(3)

1PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY

9
10       The  full syntax and semantics of the regular expressions that are sup‐
11       ported by PCRE2 are described in the pcre2pattern  documentation.  This
12       document contains a quick-reference summary of the syntax.
13

QUOTING

15
16         \x         where x is non-alphanumeric is a literal x
17         \Q...\E    treat enclosed characters as literal
18

ESCAPED CHARACTERS

20
21       This  table  applies to ASCII and Unicode environments. An unrecognized
22       escape sequence causes an error.
23
24         \a         alarm, that is, the BEL character (hex 07)
25         \cx        "control-x", where x is any ASCII printing character
26         \e         escape (hex 1B)
27         \f         form feed (hex 0C)
28         \n         newline (hex 0A)
29         \r         carriage return (hex 0D)
30         \t         tab (hex 09)
31         \0dd       character with octal code 0dd
32         \ddd       character with octal code ddd, or backreference
33         \o{ddd..}  character with octal code ddd..
34         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35         \xhh       character with hex code hh
36         \x{hh..}   character with hex code hh..
37
38       If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
39       following are also recognized:
40
41         \U         the character "U"
42         \uhhhh     character with hex code hhhh
43         \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
44
45       When  \x  is not followed by {, from zero to two hexadecimal digits are
46       read, but in ALT_BSUX mode \x must be followed by two hexadecimal  dig‐
47       its  to  be  recognized as a hexadecimal escape; otherwise it matches a
48       literal "x".  Likewise, if \u (in ALT_BSUX mode)  is  not  followed  by
49       four  hexadecimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
50       digits in curly brackets, it matches a literal "u".
51
52       Note that \0dd is always an octal code. The treatment of backslash fol‐
53       lowed  by  a non-zero digit is complicated; for details see the section
54       "Non-printing characters" in the pcre2pattern documentation, where  de‐
55       tails  of  escape  processing  in  EBCDIC  environments are also given.
56       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
57       EBCDIC  environments.  Note  that  \N  not followed by an opening curly
58       bracket has a different meaning (see below).
59

CHARACTER TYPES

61
62         .          any character except newline;
63                      in dotall mode, any character whatsoever
64         \C         one code unit, even in UTF mode (best avoided)
65         \d         a decimal digit
66         \D         a character that is not a decimal digit
67         \h         a horizontal white space character
68         \H         a character that is not a horizontal white space character
69         \N         a character that is not a newline
70         \p{xx}     a character with the xx property
71         \P{xx}     a character without the xx property
72         \R         a newline sequence
73         \s         a white space character
74         \S         a character that is not a white space character
75         \v         a vertical white space character
76         \V         a character that is not a vertical white space character
77         \w         a "word" character
78         \W         a "non-word" character
79         \X         a Unicode extended grapheme cluster
80
81       \C is dangerous because it may leave the current matching point in  the
82       middle of a UTF-8 or UTF-16 character. The application can lock out the
83       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
84       possible to build PCRE2 with the use of \C permanently disabled.
85
86       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
87       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
88       matching  is  happening,  \s and \w may also match characters with code
89       points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
90       iour of these escape sequences is changed to use Unicode properties and
91       they match many more characters.
92
93       Property descriptions in \p and \P are matched caselessly; hyphens, un‐
94       derscores,  and  white  space are ignored, in accordance with Unicode's
95       "loose matching" rules.
96

GENERAL CATEGORY PROPERTIES FOR \p and \P

98
99         C          Other
100         Cc         Control
101         Cf         Format
102         Cn         Unassigned
103         Co         Private use
104         Cs         Surrogate
105
106         L          Letter
107         Ll         Lower case letter
108         Lm         Modifier letter
109         Lo         Other letter
110         Lt         Title case letter
111         Lu         Upper case letter
112         Lc         Ll, Lu, or Lt
113         L&         Ll, Lu, or Lt
114
115         M          Mark
116         Mc         Spacing mark
117         Me         Enclosing mark
118         Mn         Non-spacing mark
119
120         N          Number
121         Nd         Decimal number
122         Nl         Letter number
123         No         Other number
124
125         P          Punctuation
126         Pc         Connector punctuation
127         Pd         Dash punctuation
128         Pe         Close punctuation
129         Pf         Final punctuation
130         Pi         Initial punctuation
131         Po         Other punctuation
132         Ps         Open punctuation
133
134         S          Symbol
135         Sc         Currency symbol
136         Sk         Modifier symbol
137         Sm         Mathematical symbol
138         So         Other symbol
139
140         Z          Separator
141         Zl         Line separator
142         Zp         Paragraph separator
143         Zs         Space separator
144

PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P

146
147         Xan        Alphanumeric: union of properties L and N
148         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
149         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
150         Xuc        Univerally-named character: one that can be
151                      represented by a Universal Character Name
152         Xwd        Perl word: property Xan or underscore
153
154       Perl and POSIX space are now the same. Perl added VT to its space char‐
155       acter set at release 5.18.
156

BINARY PROPERTIES FOR \p AND \P

158
159       Unicode  defines  a  number  of  binary properties, that is, properties
160       whose only values are true or false. You can obtain  a  list  of  those
161       that  are  recognized  by \p and \P, along with their abbreviations, by
162       running this command:
163
164         pcre2test -LP
165

SCRIPT MATCHING WITH \p AND \P

167
168       Many script names and their 4-letter abbreviations  are  recognized  in
169       \p{sc:...}  or  \p{scx:...} items, or on their own with \p (and also \P
170       of course). You can obtain a list of these scripts by running this com‐
171       mand:
172
173         pcre2test -LS
174

THE BIDI_CLASS PROPERTY FOR \p AND \P

176
177         \p{Bidi_Class:<class>}   matches a character with the given class
178         \p{BC:<class>}           matches a character with the given class
179
180       The recognized classes are:
181
182         AL          Arabic letter
183         AN          Arabic number
184         B           paragraph separator
185         BN          boundary neutral
186         CS          common separator
187         EN          European number
188         ES          European separator
189         ET          European terminator
190         FSI         first strong isolate
191         L           left-to-right
192         LRE         left-to-right embedding
193         LRI         left-to-right isolate
194         LRO         left-to-right override
195         NSM         non-spacing mark
196         ON          other neutral
197         PDF         pop directional format
198         PDI         pop directional isolate
199         R           right-to-left
200         RLE         right-to-left embedding
201         RLI         right-to-left isolate
202         RLO         right-to-left override
203         S           segment separator
204         WS          which space
205

CHARACTER CLASSES

207
208         [...]       positive character class
209         [^...]      negative character class
210         [x-y]       range (can be used for hex characters)
211         [[:xxx:]]   positive POSIX named set
212         [[:^xxx:]]  negative POSIX named set
213
214         alnum       alphanumeric
215         alpha       alphabetic
216         ascii       0-127
217         blank       space or tab
218         cntrl       control character
219         digit       decimal digit
220         graph       printing, excluding space
221         lower       lower case letter
222         print       printing, including space
223         punct       printing, excluding alphanumeric
224         space       white space
225         upper       upper case letter
226         word        same as \w
227         xdigit      hexadecimal digit
228
229       In  PCRE2, POSIX character set names recognize only ASCII characters by
230       default, but some of them use Unicode properties if PCRE2_UCP  is  set.
231       You can use \Q...\E inside a character class.
232

QUANTIFIERS

234
235         ?           0 or 1, greedy
236         ?+          0 or 1, possessive
237         ??          0 or 1, lazy
238         *           0 or more, greedy
239         *+          0 or more, possessive
240         *?          0 or more, lazy
241         +           1 or more, greedy
242         ++          1 or more, possessive
243         +?          1 or more, lazy
244         {n}         exactly n
245         {n,m}       at least n, no more than m, greedy
246         {n,m}+      at least n, no more than m, possessive
247         {n,m}?      at least n, no more than m, lazy
248         {n,}        n or more, greedy
249         {n,}+       n or more, possessive
250         {n,}?       n or more, lazy
251

ANCHORS AND SIMPLE ASSERTIONS

253
254         \b          word boundary
255         \B          not a word boundary
256         ^           start of subject
257                       also after an internal newline in multiline mode
258                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
259         \A          start of subject
260         $           end of subject
261                       also before newline at end of subject
262                       also before internal newline in multiline mode
263         \Z          end of subject
264                       also before newline at end of subject
265         \z          end of subject
266         \G          first matching position in subject
267

REPORTED MATCH POINT SETTING

269
270         \K          set reported start of match
271
272       From  release 10.38 \K is not permitted by default in lookaround asser‐
273       tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL‐
274       LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
275       When this option is set, \K is honoured in positive assertions, but ig‐
276       nored in negative ones.
277

ALTERNATION

279
280         expr|expr|expr...
281

CAPTURING

283
284         (...)           capture group
285         (?<name>...)    named capture group (Perl)
286         (?'name'...)    named capture group (Perl)
287         (?P<name>...)   named capture group (Python)
288         (?:...)         non-capture group
289         (?|...)         non-capture group; reset group numbers for
290                          capture groups in each alternative
291
292       In  non-UTF  modes, names may contain underscores and ASCII letters and
293       digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits
294       are permitted. In both cases, a name must not start with a digit.
295

ATOMIC GROUPS

297
298         (?>...)         atomic non-capture group
299         (*atomic:...)   atomic non-capture group
300

COMMENT

302
303         (?#....)        comment (not nestable)
304

OPTION SETTING

306       Changes  of these options within a group are automatically cancelled at
307       the end of the group.
308
309         (?i)            caseless
310         (?J)            allow duplicate named groups
311         (?m)            multiline
312         (?n)            no auto capture
313         (?s)            single line (dotall)
314         (?U)            default ungreedy (lazy)
315         (?x)            extended: ignore white space except in classes
316         (?xx)           as (?x) but also ignore space and tab in classes
317         (?-...)         unset option(s)
318         (?^)            unset imnsx options
319
320       Unsetting x or xx unsets both. Several options may be set at once,  and
321       a mixture of setting and unsetting such as (?i-x) is allowed, but there
322       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
323       for example (?^in). An option setting may appear at the start of a non-
324       capture group, for example (?i:...).
325
326       The following are recognized only at the very start of a pattern or af‐
327       ter one of the newline or \R options with similar syntax. More than one
328       of them may appear. For the first three, d is a decimal number.
329
330         (*LIMIT_DEPTH=d) set the backtracking limit to d
331         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
332         (*LIMIT_MATCH=d) set the match limit to d
333         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
334         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
335         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
336         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
337         (*NO_JIT)       disable JIT optimization
338         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
339         (*UTF)          set appropriate UTF mode for the library in use
340         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
341
342       Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the
343       value   of   the   limits   set  by  the  caller  of  pcre2_match()  or
344       pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete
345       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
346       and (*UCP) by setting the PCRE2_NEVER_UTF or  PCRE2_NEVER_UCP  options,
347       respectively, at compile time.
348

NEWLINE CONVENTION

350
351       These are recognized only at the very start of the pattern or after op‐
352       tion settings with a similar syntax.
353
354         (*CR)           carriage return only
355         (*LF)           linefeed only
356         (*CRLF)         carriage return followed by linefeed
357         (*ANYCRLF)      all three of the above
358         (*ANY)          any Unicode newline sequence
359         (*NUL)          the NUL character (binary zero)
360

WHAT \R MATCHES

362
363       These are recognized only at the very start of the pattern or after op‐
364       tion setting with a similar syntax.
365
366         (*BSR_ANYCRLF)  CR, LF, or CRLF
367         (*BSR_UNICODE)  any Unicode newline sequence
368

LOOKAHEAD AND LOOKBEHIND ASSERTIONS

370
371         (?=...)                     )
372         (*pla:...)                  ) positive lookahead
373         (*positive_lookahead:...)   )
374
375         (?!...)                     )
376         (*nla:...)                  ) negative lookahead
377         (*negative_lookahead:...)   )
378
379         (?<=...)                    )
380         (*plb:...)                  ) positive lookbehind
381         (*positive_lookbehind:...)  )
382
383         (?<!...)                    )
384         (*nlb:...)                  ) negative lookbehind
385         (*negative_lookbehind:...)  )
386
387       Each top-level branch of a lookbehind must be of a fixed length.
388

NON-ATOMIC LOOKAROUND ASSERTIONS

390
391       These assertions are specific to PCRE2 and are not Perl-compatible.
392
393         (?*...)                                )
394         (*napla:...)                           ) synonyms
395         (*non_atomic_positive_lookahead:...)   )
396
397         (?<*...)                               )
398         (*naplb:...)                           ) synonyms
399         (*non_atomic_positive_lookbehind:...)  )
400

SCRIPT RUNS

402
403         (*script_run:...)           ) script run, can be backtracked into
404         (*sr:...)                   )
405
406         (*atomic_script_run:...)    ) atomic script run
407         (*asr:...)                  )
408

BACKREFERENCES

410
411         \n              reference by number (can be ambiguous)
412         \gn             reference by number
413         \g{n}           reference by number
414         \g+n            relative reference by number (PCRE2 extension)
415         \g-n            relative reference by number
416         \g{+n}          relative reference by number (PCRE2 extension)
417         \g{-n}          relative reference by number
418         \k<name>        reference by name (Perl)
419         \k'name'        reference by name (Perl)
420         \g{name}        reference by name (Perl)
421         \k{name}        reference by name (.NET)
422         (?P=name)       reference by name (Python)
423

SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

425
426         (?R)            recurse whole pattern
427         (?n)            call subroutine by absolute number
428         (?+n)           call subroutine by relative number
429         (?-n)           call subroutine by relative number
430         (?&name)        call subroutine by name (Perl)
431         (?P>name)       call subroutine by name (Python)
432         \g<name>        call subroutine by name (Oniguruma)
433         \g'name'        call subroutine by name (Oniguruma)
434         \g<n>           call subroutine by absolute number (Oniguruma)
435         \g'n'           call subroutine by absolute number (Oniguruma)
436         \g<+n>          call subroutine by relative number (PCRE2 extension)
437         \g'+n'          call subroutine by relative number (PCRE2 extension)
438         \g<-n>          call subroutine by relative number (PCRE2 extension)
439         \g'-n'          call subroutine by relative number (PCRE2 extension)
440

CONDITIONAL PATTERNS

442
443         (?(condition)yes-pattern)
444         (?(condition)yes-pattern|no-pattern)
445
446         (?(n)               absolute reference condition
447         (?(+n)              relative reference condition
448         (?(-n)              relative reference condition
449         (?(<name>)          named reference condition (Perl)
450         (?('name')          named reference condition (Perl)
451         (?(name)            named reference condition (PCRE2, deprecated)
452         (?(R)               overall recursion condition
453         (?(Rn)              specific numbered group recursion condition
454         (?(R&name)          specific named group recursion condition
455         (?(DEFINE)          define groups for reference
456         (?(VERSION[>]=n.m)  test PCRE2 version
457         (?(assert)          assertion condition
458
459       Note  the  ambiguity of (?(R) and (?(Rn) which might be named reference
460       conditions or recursion tests. Such a condition  is  interpreted  as  a
461       reference condition if the relevant named group exists.
462

BACKTRACKING CONTROL

464
465       All  backtracking  control  verbs  may be in the form (*VERB:NAME). For
466       (*MARK) the name is mandatory, for the others it is  optional.  (*SKIP)
467       changes  its  behaviour if :NAME is present. The others just set a name
468       for passing back to the caller, but this is not a name that (*SKIP) can
469       see. The following act immediately they are reached:
470
471         (*ACCEPT)       force successful match
472         (*FAIL)         force backtrack; synonym (*F)
473         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
474
475       The  following  act only when a subsequent match failure causes a back‐
476       track to reach them. They all force a match failure, but they differ in
477       what happens afterwards. Those that advance the start-of-match point do
478       so only if the pattern is not anchored.
479
480         (*COMMIT)       overall failure, no advance of starting point
481         (*PRUNE)        advance to next starting character
482         (*SKIP)         advance to current matching position
483         (*SKIP:NAME)    advance to position corresponding to an earlier
484                         (*MARK:NAME); if not found, the (*SKIP) is ignored
485         (*THEN)         local failure, backtrack to next alternation
486
487       The effect of one of these verbs in a group called as a  subroutine  is
488       confined to the subroutine call.
489

CALLOUTS

491
492         (?C)            callout (assumed number 0)
493         (?Cn)           callout with numerical data n
494         (?C"text")      callout with string data
495
496       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
497       the start and the end), and the starting delimiter { matched  with  the
498       ending  delimiter  }. To encode the ending delimiter within the string,
499       double it.
500

AUTHOR

507
508       Philip Hazel
509       Retired from University Computing Service
510       Cambridge, England.
511

REVISION

513
514       Last updated: 12 January 2022
515       Copyright (c) 1997-2022 University of Cambridge.
516
517
518
519PCRE2 10.40                     12 January 2022                 PCRE2SYNTAX(3)