pcre2syntax(3)

1PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY

9
10       The  full syntax and semantics of the regular expressions that are sup‐
11       ported by PCRE2 are described in the pcre2pattern  documentation.  This
12       document contains a quick-reference summary of the syntax.
13

QUOTING

15
16         \x         where x is non-alphanumeric is a literal x
17         \Q...\E    treat enclosed characters as literal
18

ESCAPED CHARACTERS

20
21       This table applies to ASCII and Unicode environments.
22
23         \a         alarm, that is, the BEL character (hex 07)
24         \cx        "control-x", where x is any ASCII printing character
25         \e         escape (hex 1B)
26         \f         form feed (hex 0C)
27         \n         newline (hex 0A)
28         \r         carriage return (hex 0D)
29         \t         tab (hex 09)
30         \0dd       character with octal code 0dd
31         \ddd       character with octal code ddd, or backreference
32         \o{ddd..}  character with octal code ddd..
33         \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
34         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35         \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
36         \xhh       character with hex code hh
37         \x{hh..}   character with hex code hh..
38
39       Note that \0dd is always an octal code. The treatment of backslash fol‐
40       lowed by a non-zero digit is complicated; for details see  the  section
41       "Non-printing  characters"  in  the  pcre2pattern  documentation, where
42       details of escape processing in EBCDIC  environments  are  also  given.
43       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
44       EBCDIC environments. Note that \N not  followed  by  an  opening  curly
45       bracket has a different meaning (see below).
46
47       When  \x  is not followed by {, from zero to two hexadecimal digits are
48       read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec‐
49       imal  digits  to  be  recognized  as a hexadecimal escape; otherwise it
50       matches a literal "x".  Likewise, if \u (in ALT_BSUX mode) is not  fol‐
51       lowed by four hexadecimal digits, it matches a literal "u".
52

CHARACTER TYPES

54
55         .          any character except newline;
56                      in dotall mode, any character whatsoever
57         \C         one code unit, even in UTF mode (best avoided)
58         \d         a decimal digit
59         \D         a character that is not a decimal digit
60         \h         a horizontal white space character
61         \H         a character that is not a horizontal white space character
62         \N         a character that is not a newline
63         \p{xx}     a character with the xx property
64         \P{xx}     a character without the xx property
65         \R         a newline sequence
66         \s         a white space character
67         \S         a character that is not a white space character
68         \v         a vertical white space character
69         \V         a character that is not a vertical white space character
70         \w         a "word" character
71         \W         a "non-word" character
72         \X         a Unicode extended grapheme cluster
73
74       \C  is dangerous because it may leave the current matching point in the
75       middle of a UTF-8 or UTF-16 character. The application can lock out the
76       use  of  \C  by  setting the PCRE2_NEVER_BACKSLASH_C option. It is also
77       possible to build PCRE2 with the use of \C permanently disabled.
78
79       By default, \d, \s, and \w match only ASCII characters, even  in  UTF-8
80       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
81       matching is happening, \s and \w may also match  characters  with  code
82       points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
83       iour of these escape sequences is changed to use Unicode properties and
84       they match many more characters.
85

GENERAL CATEGORY PROPERTIES FOR \p and \P

87
88         C          Other
89         Cc         Control
90         Cf         Format
91         Cn         Unassigned
92         Co         Private use
93         Cs         Surrogate
94
95         L          Letter
96         Ll         Lower case letter
97         Lm         Modifier letter
98         Lo         Other letter
99         Lt         Title case letter
100         Lu         Upper case letter
101         L&         Ll, Lu, or Lt
102
103         M          Mark
104         Mc         Spacing mark
105         Me         Enclosing mark
106         Mn         Non-spacing mark
107
108         N          Number
109         Nd         Decimal number
110         Nl         Letter number
111         No         Other number
112
113         P          Punctuation
114         Pc         Connector punctuation
115         Pd         Dash punctuation
116         Pe         Close punctuation
117         Pf         Final punctuation
118         Pi         Initial punctuation
119         Po         Other punctuation
120         Ps         Open punctuation
121
122         S          Symbol
123         Sc         Currency symbol
124         Sk         Modifier symbol
125         Sm         Mathematical symbol
126         So         Other symbol
127
128         Z          Separator
129         Zl         Line separator
130         Zp         Paragraph separator
131         Zs         Space separator
132

PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P

134
135         Xan        Alphanumeric: union of properties L and N
136         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
137         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
138         Xuc        Univerally-named character: one that can be
139                      represented by a Universal Character Name
140         Xwd        Perl word: property Xan or underscore
141
142       Perl and POSIX space are now the same. Perl added VT to its space char‐
143       acter set at release 5.18.
144

SCRIPT NAMES FOR \p AND \P

146
147       Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali‐
148       nese,  Bamum,  Bassa_Vah,  Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
149       Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Caucasian_Alba‐
150       nian,  Chakma,  Cham,  Cherokee,  Common,  Coptic,  Cuneiform, Cypriot,
151       Cyrillic, Deseret, Devanagari, Dogra,  Duployan,  Egyptian_Hieroglyphs,
152       Elbasan,   Ethiopic,  Georgian,  Glagolitic,  Gothic,  Grantha,  Greek,
153       Gujarati,  Gunjala_Gondi,  Gurmukhi,  Han,   Hangul,   Hanifi_Rohingya,
154       Hanunoo,   Hatran,   Hebrew,   Hiragana,  Imperial_Aramaic,  Inherited,
155       Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan‐
156       nada,  Katakana,  Kayah_Li,  Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
157       Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha‐
158       jani,  Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
159       Medefaidrin,     Meetei_Mayek,     Mende_Kikakui,     Meroitic_Cursive,
160       Meroitic_Hieroglyphs,  Miao,  Modi,  Mongolian,  Mro, Multani, Myanmar,
161       Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki,  Old_Hungar‐
162       ian,  Old_Italic,  Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog‐
163       dian,   Old_South_Arabian,   Old_Turkic,   Oriya,    Osage,    Osmanya,
164       Pahawh_Hmong,    Palmyrene,    Pau_Cin_Hau,    Phags_Pa,    Phoenician,
165       Psalter_Pahlavi, Rejang, Runic, Samaritan,  Saurashtra,  Sharada,  Sha‐
166       vian,  Siddham,  SignWriting,  Sinhala, Sogdian, Sora_Sompeng, Soyombo,
167       Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,
168       Tai_Viet,  Takri,  Tamil,  Tangut, Telugu, Thaana, Thai, Tibetan, Tifi‐
169       nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
170

CHARACTER CLASSES

172
173         [...]       positive character class
174         [^...]      negative character class
175         [x-y]       range (can be used for hex characters)
176         [[:xxx:]]   positive POSIX named set
177         [[:^xxx:]]  negative POSIX named set
178
179         alnum       alphanumeric
180         alpha       alphabetic
181         ascii       0-127
182         blank       space or tab
183         cntrl       control character
184         digit       decimal digit
185         graph       printing, excluding space
186         lower       lower case letter
187         print       printing, including space
188         punct       printing, excluding alphanumeric
189         space       white space
190         upper       upper case letter
191         word        same as \w
192         xdigit      hexadecimal digit
193
194       In PCRE2, POSIX character set names recognize only ASCII characters  by
195       default,  but  some of them use Unicode properties if PCRE2_UCP is set.
196       You can use \Q...\E inside a character class.
197

QUANTIFIERS

199
200         ?           0 or 1, greedy
201         ?+          0 or 1, possessive
202         ??          0 or 1, lazy
203         *           0 or more, greedy
204         *+          0 or more, possessive
205         *?          0 or more, lazy
206         +           1 or more, greedy
207         ++          1 or more, possessive
208         +?          1 or more, lazy
209         {n}         exactly n
210         {n,m}       at least n, no more than m, greedy
211         {n,m}+      at least n, no more than m, possessive
212         {n,m}?      at least n, no more than m, lazy
213         {n,}        n or more, greedy
214         {n,}+       n or more, possessive
215         {n,}?       n or more, lazy
216

ANCHORS AND SIMPLE ASSERTIONS

218
219         \b          word boundary
220         \B          not a word boundary
221         ^           start of subject
222                       also after an internal newline in multiline mode
223                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
224         \A          start of subject
225         $           end of subject
226                       also before newline at end of subject
227                       also before internal newline in multiline mode
228         \Z          end of subject
229                       also before newline at end of subject
230         \z          end of subject
231         \G          first matching position in subject
232

REPORTED MATCH POINT SETTING

234
235         \K          set reported start of match
236
237       \K is honoured in positive assertions, but ignored in negative ones.
238

ALTERNATION

240
241         expr|expr|expr...
242

CAPTURING

244
245         (...)           capturing group
246         (?<name>...)    named capturing group (Perl)
247         (?'name'...)    named capturing group (Perl)
248         (?P<name>...)   named capturing group (Python)
249         (?:...)         non-capturing group
250         (?|...)         non-capturing group; reset group numbers for
251                          capturing groups in each alternative
252

ATOMIC GROUPS

254
255         (?>...)         atomic, non-capturing group
256

COMMENT

258
259         (?#....)        comment (not nestable)
260

OPTION SETTING

262       Changes of these options within a group are automatically cancelled  at
263       the end of the group.
264
265         (?i)            caseless
266         (?J)            allow duplicate names
267         (?m)            multiline
268         (?n)            no auto capture
269         (?s)            single line (dotall)
270         (?U)            default ungreedy (lazy)
271         (?x)            extended: ignore white space except in classes
272         (?xx)           as (?x) but also ignore space and tab in classes
273         (?-...)         unset option(s)
274         (?^)            unset imnsx options
275
276       Unsetting  x or xx unsets both. Several options may be set at once, and
277       a mixture of setting and unsetting such as (?i-x) is allowed, but there
278       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
279       for example (?^in). An option setting may appear at the start of a non-
280       capturing group, for example (?i:...).
281
282       The  following  are  recognized  only at the very start of a pattern or
283       after one of the newline or \R options with similar syntax.  More  than
284       one of them may appear. For the first three, d is a decimal number.
285
286         (*LIMIT_DEPTH=d) set the backtracking limit to d
287         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
288         (*LIMIT_MATCH=d) set the match limit to d
289         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
290         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
291         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
292         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
293         (*NO_JIT)       disable JIT optimization
294         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
295         (*UTF)          set appropriate UTF mode for the library in use
296         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
297
298       Note  that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
299       value  of  the  limits  set  by  the   caller   of   pcre2_match()   or
300       pcre2_dfa_match(),  not  increase  them. LIMIT_RECURSION is an obsolete
301       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
302       and  (*UCP)  by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
303       respectively, at compile time.
304

NEWLINE CONVENTION

306
307       These are recognized only at the very start of  the  pattern  or  after
308       option settings with a similar syntax.
309
310         (*CR)           carriage return only
311         (*LF)           linefeed only
312         (*CRLF)         carriage return followed by linefeed
313         (*ANYCRLF)      all three of the above
314         (*ANY)          any Unicode newline sequence
315         (*NUL)          the NUL character (binary zero)
316

WHAT \R MATCHES

318
319       These  are  recognized  only  at the very start of the pattern or after
320       option setting with a similar syntax.
321
322         (*BSR_ANYCRLF)  CR, LF, or CRLF
323         (*BSR_UNICODE)  any Unicode newline sequence
324

LOOKAHEAD AND LOOKBEHIND ASSERTIONS

326
327         (?=...)         positive look ahead
328         (?!...)         negative look ahead
329         (?<=...)        positive look behind
330         (?<!...)        negative look behind
331
332       Each top-level branch of a look behind must be of a fixed length.
333

BACKREFERENCES

335
336         \n              reference by number (can be ambiguous)
337         \gn             reference by number
338         \g{n}           reference by number
339         \g+n            relative reference by number (PCRE2 extension)
340         \g-n            relative reference by number
341         \g{+n}          relative reference by number (PCRE2 extension)
342         \g{-n}          relative reference by number
343         \k<name>        reference by name (Perl)
344         \k'name'        reference by name (Perl)
345         \g{name}        reference by name (Perl)
346         \k{name}        reference by name (.NET)
347         (?P=name)       reference by name (Python)
348

SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

350
351         (?R)            recurse whole pattern
352         (?n)            call subpattern by absolute number
353         (?+n)           call subpattern by relative number
354         (?-n)           call subpattern by relative number
355         (?&name)        call subpattern by name (Perl)
356         (?P>name)       call subpattern by name (Python)
357         \g<name>        call subpattern by name (Oniguruma)
358         \g'name'        call subpattern by name (Oniguruma)
359         \g<n>           call subpattern by absolute number (Oniguruma)
360         \g'n'           call subpattern by absolute number (Oniguruma)
361         \g<+n>          call subpattern by relative number (PCRE2 extension)
362         \g'+n'          call subpattern by relative number (PCRE2 extension)
363         \g<-n>          call subpattern by relative number (PCRE2 extension)
364         \g'-n'          call subpattern by relative number (PCRE2 extension)
365

CONDITIONAL PATTERNS

367
368         (?(condition)yes-pattern)
369         (?(condition)yes-pattern|no-pattern)
370
371         (?(n)               absolute reference condition
372         (?(+n)              relative reference condition
373         (?(-n)              relative reference condition
374         (?(<name>)          named reference condition (Perl)
375         (?('name')          named reference condition (Perl)
376         (?(name)            named reference condition (PCRE2, deprecated)
377         (?(R)               overall recursion condition
378         (?(Rn)              specific numbered group recursion condition
379         (?(R&name)          specific named group recursion condition
380         (?(DEFINE)          define subpattern for reference
381         (?(VERSION[>]=n.m)  test PCRE2 version
382         (?(assert)          assertion condition
383
384       Note the ambiguity of (?(R) and (?(Rn) which might be  named  reference
385       conditions  or  recursion  tests.  Such a condition is interpreted as a
386       reference condition if the relevant named group exists.
387

BACKTRACKING CONTROL

389
390       All backtracking control verbs may be in  the  form  (*VERB:NAME).  For
391       (*MARK)  the  name is mandatory, for the others it is optional. (*SKIP)
392       changes its behaviour if :NAME is present. The others just set  a  name
393       for passing back to the caller, but this is not a name that (*SKIP) can
394       see. The following act immediately they are reached:
395
396         (*ACCEPT)       force successful match
397         (*FAIL)         force backtrack; synonym (*F)
398         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
399
400       The following act only when a subsequent match failure causes  a  back‐
401       track to reach them. They all force a match failure, but they differ in
402       what happens afterwards. Those that advance the start-of-match point do
403       so only if the pattern is not anchored.
404
405         (*COMMIT)       overall failure, no advance of starting point
406         (*PRUNE)        advance to next starting character
407         (*SKIP)         advance to current matching position
408         (*SKIP:NAME)    advance to position corresponding to an earlier
409                         (*MARK:NAME); if not found, the (*SKIP) is ignored
410         (*THEN)         local failure, backtrack to next alternation
411
412       The  effect  of one of these verbs in a group called as a subroutine is
413       confined to the subroutine call.
414

CALLOUTS

416
417         (?C)            callout (assumed number 0)
418         (?Cn)           callout with numerical data n
419         (?C"text")      callout with string data
420
421       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
422       the  start  and the end), and the starting delimiter { matched with the
423       ending delimiter }. To encode the ending delimiter within  the  string,
424       double it.
425

AUTHOR

432
433       Philip Hazel
434       University Computing Service
435       Cambridge, England.
436

REVISION

438
439       Last updated: 02 September 2018
440       Copyright (c) 1997-2018 University of Cambridge.
441
442
443
444PCRE2 10.32                    02 September 2018                PCRE2SYNTAX(3)