1PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY

9
10       The  full syntax and semantics of the regular expressions that are sup‐
11       ported by PCRE2 are described in the pcre2pattern  documentation.  This
12       document contains a quick-reference summary of the syntax.
13

QUOTING

15
16         \x         where x is non-alphanumeric is a literal x
17         \Q...\E    treat enclosed characters as literal
18

ESCAPED CHARACTERS

20
21       This  table  applies to ASCII and Unicode environments. An unrecognized
22       escape sequence causes an error.
23
24         \a         alarm, that is, the BEL character (hex 07)
25         \cx        "control-x", where x is any ASCII printing character
26         \e         escape (hex 1B)
27         \f         form feed (hex 0C)
28         \n         newline (hex 0A)
29         \r         carriage return (hex 0D)
30         \t         tab (hex 09)
31         \0dd       character with octal code 0dd
32         \ddd       character with octal code ddd, or backreference
33         \o{ddd..}  character with octal code ddd..
34         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35         \xhh       character with hex code hh
36         \x{hh..}   character with hex code hh..
37
38       If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
39       following are also recognized:
40
41         \U         the character "U"
42         \uhhhh     character with hex code hhhh
43         \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
44
45       When  \x  is not followed by {, from zero to two hexadecimal digits are
46       read, but in ALT_BSUX mode \x must be followed by two hexadecimal  dig‐
47       its  to  be  recognized as a hexadecimal escape; otherwise it matches a
48       literal "x".  Likewise, if \u (in ALT_BSUX mode)  is  not  followed  by
49       four  hexadecimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
50       digits in curly brackets, it matches a literal "u".
51
52       Note that \0dd is always an octal code. The treatment of backslash fol‐
53       lowed  by  a non-zero digit is complicated; for details see the section
54       "Non-printing characters"  in  the  pcre2pattern  documentation,  where
55       details  of  escape  processing  in EBCDIC environments are also given.
56       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
57       EBCDIC  environments.  Note  that  \N  not followed by an opening curly
58       bracket has a different meaning (see below).
59

CHARACTER TYPES

61
62         .          any character except newline;
63                      in dotall mode, any character whatsoever
64         \C         one code unit, even in UTF mode (best avoided)
65         \d         a decimal digit
66         \D         a character that is not a decimal digit
67         \h         a horizontal white space character
68         \H         a character that is not a horizontal white space character
69         \N         a character that is not a newline
70         \p{xx}     a character with the xx property
71         \P{xx}     a character without the xx property
72         \R         a newline sequence
73         \s         a white space character
74         \S         a character that is not a white space character
75         \v         a vertical white space character
76         \V         a character that is not a vertical white space character
77         \w         a "word" character
78         \W         a "non-word" character
79         \X         a Unicode extended grapheme cluster
80
81       \C is dangerous because it may leave the current matching point in  the
82       middle of a UTF-8 or UTF-16 character. The application can lock out the
83       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
84       possible to build PCRE2 with the use of \C permanently disabled.
85
86       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
87       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
88       matching  is  happening,  \s and \w may also match characters with code
89       points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
90       iour of these escape sequences is changed to use Unicode properties and
91       they match many more characters.
92

GENERAL CATEGORY PROPERTIES FOR \p and \P

94
95         C          Other
96         Cc         Control
97         Cf         Format
98         Cn         Unassigned
99         Co         Private use
100         Cs         Surrogate
101
102         L          Letter
103         Ll         Lower case letter
104         Lm         Modifier letter
105         Lo         Other letter
106         Lt         Title case letter
107         Lu         Upper case letter
108         L&         Ll, Lu, or Lt
109
110         M          Mark
111         Mc         Spacing mark
112         Me         Enclosing mark
113         Mn         Non-spacing mark
114
115         N          Number
116         Nd         Decimal number
117         Nl         Letter number
118         No         Other number
119
120         P          Punctuation
121         Pc         Connector punctuation
122         Pd         Dash punctuation
123         Pe         Close punctuation
124         Pf         Final punctuation
125         Pi         Initial punctuation
126         Po         Other punctuation
127         Ps         Open punctuation
128
129         S          Symbol
130         Sc         Currency symbol
131         Sk         Modifier symbol
132         Sm         Mathematical symbol
133         So         Other symbol
134
135         Z          Separator
136         Zl         Line separator
137         Zp         Paragraph separator
138         Zs         Space separator
139

PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P

141
142         Xan        Alphanumeric: union of properties L and N
143         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
144         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
145         Xuc        Univerally-named character: one that can be
146                      represented by a Universal Character Name
147         Xwd        Perl word: property Xan or underscore
148
149       Perl and POSIX space are now the same. Perl added VT to its space char‐
150       acter set at release 5.18.
151

SCRIPT NAMES FOR \p AND \P

153
154       Adlam,  Ahom,  Anatolian_Hieroglyphs,  Arabic, Armenian, Avestan, Bali‐
155       nese, Bamum, Bassa_Vah, Batak, Bengali,  Bhaiksuki,  Bopomofo,  Brahmi,
156       Braille,  Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
157       nian, Chakma,  Cham,  Cherokee,  Common,  Coptic,  Cuneiform,  Cypriot,
158       Cyrillic,  Deseret,  Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
159       Elbasan,  Ethiopic,  Georgian,  Glagolitic,  Gothic,  Grantha,   Greek,
160       Gujarati,   Gunjala_Gondi,   Gurmukhi,  Han,  Hangul,  Hanifi_Rohingya,
161       Hanunoo,  Hatran,  Hebrew,   Hiragana,   Imperial_Aramaic,   Inherited,
162       Inscriptional_Pahlavi,  Inscriptional_Parthian,  Javanese, Kaithi, Kan‐
163       nada, Katakana, Kayah_Li, Kharoshthi, Khmer,  Khojki,  Khudawadi,  Lao,
164       Latin,  Lepcha,  Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
165       jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen,  Masaram_Gondi,
166       Medefaidrin,     Meetei_Mayek,     Mende_Kikakui,     Meroitic_Cursive,
167       Meroitic_Hieroglyphs, Miao, Modi,  Mongolian,  Mro,  Multani,  Myanmar,
168       Nabataean,  New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar‐
169       ian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,  Old_Sog‐
170       dian,    Old_South_Arabian,    Old_Turkic,   Oriya,   Osage,   Osmanya,
171       Pahawh_Hmong,    Palmyrene,    Pau_Cin_Hau,    Phags_Pa,    Phoenician,
172       Psalter_Pahlavi,  Rejang,  Runic,  Samaritan, Saurashtra, Sharada, Sha‐
173       vian, Siddham, SignWriting, Sinhala,  Sogdian,  Sora_Sompeng,  Soyombo,
174       Sundanese,  Syloti_Nagri,  Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
175       Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana,  Thai,  Tibetan,  Tifi‐
176       nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
177

CHARACTER CLASSES

179
180         [...]       positive character class
181         [^...]      negative character class
182         [x-y]       range (can be used for hex characters)
183         [[:xxx:]]   positive POSIX named set
184         [[:^xxx:]]  negative POSIX named set
185
186         alnum       alphanumeric
187         alpha       alphabetic
188         ascii       0-127
189         blank       space or tab
190         cntrl       control character
191         digit       decimal digit
192         graph       printing, excluding space
193         lower       lower case letter
194         print       printing, including space
195         punct       printing, excluding alphanumeric
196         space       white space
197         upper       upper case letter
198         word        same as \w
199         xdigit      hexadecimal digit
200
201       In  PCRE2, POSIX character set names recognize only ASCII characters by
202       default, but some of them use Unicode properties if PCRE2_UCP  is  set.
203       You can use \Q...\E inside a character class.
204

QUANTIFIERS

206
207         ?           0 or 1, greedy
208         ?+          0 or 1, possessive
209         ??          0 or 1, lazy
210         *           0 or more, greedy
211         *+          0 or more, possessive
212         *?          0 or more, lazy
213         +           1 or more, greedy
214         ++          1 or more, possessive
215         +?          1 or more, lazy
216         {n}         exactly n
217         {n,m}       at least n, no more than m, greedy
218         {n,m}+      at least n, no more than m, possessive
219         {n,m}?      at least n, no more than m, lazy
220         {n,}        n or more, greedy
221         {n,}+       n or more, possessive
222         {n,}?       n or more, lazy
223

ANCHORS AND SIMPLE ASSERTIONS

225
226         \b          word boundary
227         \B          not a word boundary
228         ^           start of subject
229                       also after an internal newline in multiline mode
230                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
231         \A          start of subject
232         $           end of subject
233                       also before newline at end of subject
234                       also before internal newline in multiline mode
235         \Z          end of subject
236                       also before newline at end of subject
237         \z          end of subject
238         \G          first matching position in subject
239

REPORTED MATCH POINT SETTING

241
242         \K          set reported start of match
243
244       \K is honoured in positive assertions, but ignored in negative ones.
245

ALTERNATION

247
248         expr|expr|expr...
249

CAPTURING

251
252         (...)           capture group
253         (?<name>...)    named capture group (Perl)
254         (?'name'...)    named capture group (Perl)
255         (?P<name>...)   named capture group (Python)
256         (?:...)         non-capture group
257         (?|...)         non-capture group; reset group numbers for
258                          capture groups in each alternative
259
260       In  non-UTF  modes, names may contain underscores and ASCII letters and
261       digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits
262       are permitted. In both cases, a name must not start with a digit.
263

ATOMIC GROUPS

265
266         (?>...)         atomic non-capture group
267         (*atomic:...)   atomic non-capture group
268

COMMENT

270
271         (?#....)        comment (not nestable)
272

OPTION SETTING

274       Changes  of these options within a group are automatically cancelled at
275       the end of the group.
276
277         (?i)            caseless
278         (?J)            allow duplicate names
279         (?m)            multiline
280         (?n)            no auto capture
281         (?s)            single line (dotall)
282         (?U)            default ungreedy (lazy)
283         (?x)            extended: ignore white space except in classes
284         (?xx)           as (?x) but also ignore space and tab in classes
285         (?-...)         unset option(s)
286         (?^)            unset imnsx options
287
288       Unsetting x or xx unsets both. Several options may be set at once,  and
289       a mixture of setting and unsetting such as (?i-x) is allowed, but there
290       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
291       for example (?^in). An option setting may appear at the start of a non-
292       capture group, for example (?i:...).
293
294       The following are recognized only at the very start  of  a  pattern  or
295       after  one  of the newline or \R options with similar syntax. More than
296       one of them may appear. For the first three, d is a decimal number.
297
298         (*LIMIT_DEPTH=d) set the backtracking limit to d
299         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
300         (*LIMIT_MATCH=d) set the match limit to d
301         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
302         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
303         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
304         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
305         (*NO_JIT)       disable JIT optimization
306         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
307         (*UTF)          set appropriate UTF mode for the library in use
308         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
309
310       Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the
311       value   of   the   limits   set  by  the  caller  of  pcre2_match()  or
312       pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete
313       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
314       and (*UCP) by setting the PCRE2_NEVER_UTF or  PCRE2_NEVER_UCP  options,
315       respectively, at compile time.
316

NEWLINE CONVENTION

318
319       These  are  recognized  only  at the very start of the pattern or after
320       option settings with a similar syntax.
321
322         (*CR)           carriage return only
323         (*LF)           linefeed only
324         (*CRLF)         carriage return followed by linefeed
325         (*ANYCRLF)      all three of the above
326         (*ANY)          any Unicode newline sequence
327         (*NUL)          the NUL character (binary zero)
328

WHAT \R MATCHES

330
331       These are recognized only at the very start of  the  pattern  or  after
332       option setting with a similar syntax.
333
334         (*BSR_ANYCRLF)  CR, LF, or CRLF
335         (*BSR_UNICODE)  any Unicode newline sequence
336

LOOKAHEAD AND LOOKBEHIND ASSERTIONS

338
339         (?=...)                     )
340         (*pla:...)                  ) positive lookahead
341         (*positive_lookahead:...)   )
342
343         (?!...)                     )
344         (*nla:...)                  ) negative lookahead
345         (*negative_lookahead:...)   )
346
347         (?<=...)                    )
348         (*plb:...)                  ) positive lookbehind
349         (*positive_lookbehind:...)  )
350
351         (?<!...)                    )
352         (*nlb:...)                  ) negative lookbehind
353         (*negative_lookbehind:...)  )
354
355       Each top-level branch of a lookbehind must be of a fixed length.
356

SCRIPT RUNS

358
359         (*script_run:...)           ) script run, can be backtracked into
360         (*sr:...)                   )
361
362         (*atomic_script_run:...)    ) atomic script run
363         (*asr:...)                  )
364

BACKREFERENCES

366
367         \n              reference by number (can be ambiguous)
368         \gn             reference by number
369         \g{n}           reference by number
370         \g+n            relative reference by number (PCRE2 extension)
371         \g-n            relative reference by number
372         \g{+n}          relative reference by number (PCRE2 extension)
373         \g{-n}          relative reference by number
374         \k<name>        reference by name (Perl)
375         \k'name'        reference by name (Perl)
376         \g{name}        reference by name (Perl)
377         \k{name}        reference by name (.NET)
378         (?P=name)       reference by name (Python)
379

SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

381
382         (?R)            recurse whole pattern
383         (?n)            call subroutine by absolute number
384         (?+n)           call subroutine by relative number
385         (?-n)           call subroutine by relative number
386         (?&name)        call subroutine by name (Perl)
387         (?P>name)       call subroutine by name (Python)
388         \g<name>        call subroutine by name (Oniguruma)
389         \g'name'        call subroutine by name (Oniguruma)
390         \g<n>           call subroutine by absolute number (Oniguruma)
391         \g'n'           call subroutine by absolute number (Oniguruma)
392         \g<+n>          call subroutine by relative number (PCRE2 extension)
393         \g'+n'          call subroutine by relative number (PCRE2 extension)
394         \g<-n>          call subroutine by relative number (PCRE2 extension)
395         \g'-n'          call subroutine by relative number (PCRE2 extension)
396

CONDITIONAL PATTERNS

398
399         (?(condition)yes-pattern)
400         (?(condition)yes-pattern|no-pattern)
401
402         (?(n)               absolute reference condition
403         (?(+n)              relative reference condition
404         (?(-n)              relative reference condition
405         (?(<name>)          named reference condition (Perl)
406         (?('name')          named reference condition (Perl)
407         (?(name)            named reference condition (PCRE2, deprecated)
408         (?(R)               overall recursion condition
409         (?(Rn)              specific numbered group recursion condition
410         (?(R&name)          specific named group recursion condition
411         (?(DEFINE)          define groups for reference
412         (?(VERSION[>]=n.m)  test PCRE2 version
413         (?(assert)          assertion condition
414
415       Note  the  ambiguity of (?(R) and (?(Rn) which might be named reference
416       conditions or recursion tests. Such a condition  is  interpreted  as  a
417       reference condition if the relevant named group exists.
418

BACKTRACKING CONTROL

420
421       All  backtracking  control  verbs  may be in the form (*VERB:NAME). For
422       (*MARK) the name is mandatory, for the others it is  optional.  (*SKIP)
423       changes  its  behaviour if :NAME is present. The others just set a name
424       for passing back to the caller, but this is not a name that (*SKIP) can
425       see. The following act immediately they are reached:
426
427         (*ACCEPT)       force successful match
428         (*FAIL)         force backtrack; synonym (*F)
429         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
430
431       The  following  act only when a subsequent match failure causes a back‐
432       track to reach them. They all force a match failure, but they differ in
433       what happens afterwards. Those that advance the start-of-match point do
434       so only if the pattern is not anchored.
435
436         (*COMMIT)       overall failure, no advance of starting point
437         (*PRUNE)        advance to next starting character
438         (*SKIP)         advance to current matching position
439         (*SKIP:NAME)    advance to position corresponding to an earlier
440                         (*MARK:NAME); if not found, the (*SKIP) is ignored
441         (*THEN)         local failure, backtrack to next alternation
442
443       The effect of one of these verbs in a group called as a  subroutine  is
444       confined to the subroutine call.
445

CALLOUTS

447
448         (?C)            callout (assumed number 0)
449         (?Cn)           callout with numerical data n
450         (?C"text")      callout with string data
451
452       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
453       the start and the end), and the starting delimiter { matched  with  the
454       ending  delimiter  }. To encode the ending delimiter within the string,
455       double it.
456

SEE ALSO

458
459       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
460       pcre2(3).
461

AUTHOR

463
464       Philip Hazel
465       University Computing Service
466       Cambridge, England.
467

REVISION

469
470       Last updated: 11 February 2019
471       Copyright (c) 1997-2019 University of Cambridge.
472
473
474
475PCRE2 10.33                    11 February 2019                 PCRE2SYNTAX(3)
Impressum