pcre2syntax(3)

1PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY

9
10       The  full syntax and semantics of the regular expressions that are sup‐
11       ported by PCRE2 are described in the pcre2pattern  documentation.  This
12       document contains a quick-reference summary of the syntax.
13

QUOTING

15
16         \x         where x is non-alphanumeric is a literal x
17         \Q...\E    treat enclosed characters as literal
18

ESCAPED CHARACTERS

20
21       This table applies to ASCII and Unicode environments.
22
23         \a         alarm, that is, the BEL character (hex 07)
24         \cx        "control-x", where x is any ASCII printing character
25         \e         escape (hex 1B)
26         \f         form feed (hex 0C)
27         \n         newline (hex 0A)
28         \r         carriage return (hex 0D)
29         \t         tab (hex 09)
30         \0dd       character with octal code 0dd
31         \ddd       character with octal code ddd, or backreference
32         \o{ddd..}  character with octal code ddd..
33         \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
34         \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
35         \xhh       character with hex code hh
36         \x{hhh..}  character with hex code hhh..
37
38       Note that \0dd is always an octal code. The treatment of backslash fol‐
39       lowed by a non-zero digit is complicated; for details see  the  section
40       "Non-printing  characters"  in  the  pcre2pattern  documentation, where
41       details of escape processing in EBCDIC environments are also given.
42
43       When \x is not followed by {, from zero to two hexadecimal  digits  are
44       read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec‐
45       imal digits to be recognized as  a  hexadecimal  escape;  otherwise  it
46       matches  a literal "x".  Likewise, if \u (in ALT_BSUX mode) is not fol‐
47       lowed by four hexadecimal digits, it matches a literal "u".
48

CHARACTER TYPES

50
51         .          any character except newline;
52                      in dotall mode, any character whatsoever
53         \C         one code unit, even in UTF mode (best avoided)
54         \d         a decimal digit
55         \D         a character that is not a decimal digit
56         \h         a horizontal white space character
57         \H         a character that is not a horizontal white space character
58         \N         a character that is not a newline
59         \p{xx}     a character with the xx property
60         \P{xx}     a character without the xx property
61         \R         a newline sequence
62         \s         a white space character
63         \S         a character that is not a white space character
64         \v         a vertical white space character
65         \V         a character that is not a vertical white space character
66         \w         a "word" character
67         \W         a "non-word" character
68         \X         a Unicode extended grapheme cluster
69
70       \C is dangerous because it may leave the current matching point in  the
71       middle of a UTF-8 or UTF-16 character. The application can lock out the
72       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
73       possible to build PCRE2 with the use of \C permanently disabled.
74
75       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
76       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
77       matching  is  happening,  \s and \w may also match characters with code
78       points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
79       iour of these escape sequences is changed to use Unicode properties and
80       they match many more characters.
81

GENERAL CATEGORY PROPERTIES FOR \p and \P

83
84         C          Other
85         Cc         Control
86         Cf         Format
87         Cn         Unassigned
88         Co         Private use
89         Cs         Surrogate
90
91         L          Letter
92         Ll         Lower case letter
93         Lm         Modifier letter
94         Lo         Other letter
95         Lt         Title case letter
96         Lu         Upper case letter
97         L&         Ll, Lu, or Lt
98
99         M          Mark
100         Mc         Spacing mark
101         Me         Enclosing mark
102         Mn         Non-spacing mark
103
104         N          Number
105         Nd         Decimal number
106         Nl         Letter number
107         No         Other number
108
109         P          Punctuation
110         Pc         Connector punctuation
111         Pd         Dash punctuation
112         Pe         Close punctuation
113         Pf         Final punctuation
114         Pi         Initial punctuation
115         Po         Other punctuation
116         Ps         Open punctuation
117
118         S          Symbol
119         Sc         Currency symbol
120         Sk         Modifier symbol
121         Sm         Mathematical symbol
122         So         Other symbol
123
124         Z          Separator
125         Zl         Line separator
126         Zp         Paragraph separator
127         Zs         Space separator
128

PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P

130
131         Xan        Alphanumeric: union of properties L and N
132         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
133         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
134         Xuc        Univerally-named character: one that can be
135                      represented by a Universal Character Name
136         Xwd        Perl word: property Xan or underscore
137
138       Perl and POSIX space are now the same. Perl added VT to its space char‐
139       acter set at release 5.18.
140

SCRIPT NAMES FOR \p AND \P

142
143       Ahom,   Anatolian_Hieroglyphs,  Arabic,  Armenian,  Avestan,  Balinese,
144       Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille,  Buginese,
145       Buhid,  Canadian_Aboriginal,  Carian, Caucasian_Albanian, Chakma, Cham,
146       Cherokee,  Common,  Coptic,  Cuneiform,  Cypriot,  Cyrillic,   Deseret,
147       Devanagari,  Duployan,  Egyptian_Hieroglyphs,  Elbasan, Ethiopic, Geor‐
148       gian, Glagolitic, Gothic,  Grantha,  Greek,  Gujarati,  Gurmukhi,  Han,
149       Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
150       Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan‐
151       nada,  Katakana,  Kayah_Li,  Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
152       Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha‐
153       jani,  Malayalam,  Mandaic,  Manichaean,  Meetei_Mayek,  Mende_Kikakui,
154       Meroitic_Cursive, Meroitic_Hieroglyphs,  Miao,  Modi,  Mongolian,  Mro,
155       Multani,   Myanmar,   Nabataean,  New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,
156       Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,
157       Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
158       Pau_Cin_Hau,  Phags_Pa,  Phoenician,  Psalter_Pahlavi,  Rejang,  Runic,
159       Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
160       Sora_Sompeng,  Sundanese,  Syloti_Nagri,  Syriac,  Tagalog,   Tagbanwa,
161       Tai_Le,   Tai_Tham,  Tai_Viet,  Takri,  Tamil,  Telugu,  Thaana,  Thai,
162       Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
163

CHARACTER CLASSES

165
166         [...]       positive character class
167         [^...]      negative character class
168         [x-y]       range (can be used for hex characters)
169         [[:xxx:]]   positive POSIX named set
170         [[:^xxx:]]  negative POSIX named set
171
172         alnum       alphanumeric
173         alpha       alphabetic
174         ascii       0-127
175         blank       space or tab
176         cntrl       control character
177         digit       decimal digit
178         graph       printing, excluding space
179         lower       lower case letter
180         print       printing, including space
181         punct       printing, excluding alphanumeric
182         space       white space
183         upper       upper case letter
184         word        same as \w
185         xdigit      hexadecimal digit
186
187       In PCRE2, POSIX character set names recognize only ASCII characters  by
188       default,  but  some of them use Unicode properties if PCRE2_UCP is set.
189       You can use \Q...\E inside a character class.
190

QUANTIFIERS

192
193         ?           0 or 1, greedy
194         ?+          0 or 1, possessive
195         ??          0 or 1, lazy
196         *           0 or more, greedy
197         *+          0 or more, possessive
198         *?          0 or more, lazy
199         +           1 or more, greedy
200         ++          1 or more, possessive
201         +?          1 or more, lazy
202         {n}         exactly n
203         {n,m}       at least n, no more than m, greedy
204         {n,m}+      at least n, no more than m, possessive
205         {n,m}?      at least n, no more than m, lazy
206         {n,}        n or more, greedy
207         {n,}+       n or more, possessive
208         {n,}?       n or more, lazy
209

ANCHORS AND SIMPLE ASSERTIONS

211
212         \b          word boundary
213         \B          not a word boundary
214         ^           start of subject
215                       also after an internal newline in multiline mode
216                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
217         \A          start of subject
218         $           end of subject
219                       also before newline at end of subject
220                       also before internal newline in multiline mode
221         \Z          end of subject
222                       also before newline at end of subject
223         \z          end of subject
224         \G          first matching position in subject
225

MATCH POINT RESET

227
228         \K          reset start of match
229
230       \K is honoured in positive assertions, but ignored in negative ones.
231

ALTERNATION

233
234         expr|expr|expr...
235

CAPTURING

237
238         (...)           capturing group
239         (?<name>...)    named capturing group (Perl)
240         (?'name'...)    named capturing group (Perl)
241         (?P<name>...)   named capturing group (Python)
242         (?:...)         non-capturing group
243         (?|...)         non-capturing group; reset group numbers for
244                          capturing groups in each alternative
245

ATOMIC GROUPS

247
248         (?>...)         atomic, non-capturing group
249

COMMENT

251
252         (?#....)        comment (not nestable)
253

OPTION SETTING

255
256         (?i)            caseless
257         (?J)            allow duplicate names
258         (?m)            multiline
259         (?s)            single line (dotall)
260         (?U)            default ungreedy (lazy)
261         (?x)            extended (ignore white space)
262         (?-...)         unset option(s)
263
264       The following are recognized only at the very start  of  a  pattern  or
265       after  one  of the newline or \R options with similar syntax. More than
266       one of them may appear.
267
268         (*LIMIT_MATCH=d) set the match limit to d (decimal number)
269         (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
270         (*NOTEMPTY)     set PCRE2_NOTEMPTY when matching
271         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
272         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
273         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
274         (*NO_JIT)       disable JIT optimization
275         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
276         (*UTF)          set appropriate UTF mode for the library in use
277         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
278
279       Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value  of
280       the limits set by the caller of pcre2_match() or pcre2_dfa_match(), not
281       increase them. The application can lock  out  the  use  of  (*UTF)  and
282       (*UCP)  by  setting  the  PCRE2_NEVER_UTF  or  PCRE2_NEVER_UCP options,
283       respectively, at compile time.
284

NEWLINE CONVENTION

286
287       These are recognized only at the very start of  the  pattern  or  after
288       option settings with a similar syntax.
289
290         (*CR)           carriage return only
291         (*LF)           linefeed only
292         (*CRLF)         carriage return followed by linefeed
293         (*ANYCRLF)      all three of the above
294         (*ANY)          any Unicode newline sequence
295

WHAT \R MATCHES

297
298       These  are  recognized  only  at the very start of the pattern or after
299       option setting with a similar syntax.
300
301         (*BSR_ANYCRLF)  CR, LF, or CRLF
302         (*BSR_UNICODE)  any Unicode newline sequence
303

LOOKAHEAD AND LOOKBEHIND ASSERTIONS

305
306         (?=...)         positive look ahead
307         (?!...)         negative look ahead
308         (?<=...)        positive look behind
309         (?<!...)        negative look behind
310
311       Each top-level branch of a look behind must be of a fixed length.
312

BACKREFERENCES

314
315         \n              reference by number (can be ambiguous)
316         \gn             reference by number
317         \g{n}           reference by number
318         \g+n            relative reference by number (PCRE2 extension)
319         \g-n            relative reference by number
320         \g{+n}          relative reference by number (PCRE2 extension)
321         \g{-n}          relative reference by number
322         \k<name>        reference by name (Perl)
323         \k'name'        reference by name (Perl)
324         \g{name}        reference by name (Perl)
325         \k{name}        reference by name (.NET)
326         (?P=name)       reference by name (Python)
327

SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

329
330         (?R)            recurse whole pattern
331         (?n)            call subpattern by absolute number
332         (?+n)           call subpattern by relative number
333         (?-n)           call subpattern by relative number
334         (?&name)        call subpattern by name (Perl)
335         (?P>name)       call subpattern by name (Python)
336         \g<name>        call subpattern by name (Oniguruma)
337         \g'name'        call subpattern by name (Oniguruma)
338         \g<n>           call subpattern by absolute number (Oniguruma)
339         \g'n'           call subpattern by absolute number (Oniguruma)
340         \g<+n>          call subpattern by relative number (PCRE2 extension)
341         \g'+n'          call subpattern by relative number (PCRE2 extension)
342         \g<-n>          call subpattern by relative number (PCRE2 extension)
343         \g'-n'          call subpattern by relative number (PCRE2 extension)
344

CONDITIONAL PATTERNS

346
347         (?(condition)yes-pattern)
348         (?(condition)yes-pattern|no-pattern)
349
350         (?(n)               absolute reference condition
351         (?(+n)              relative reference condition
352         (?(-n)              relative reference condition
353         (?(<name>)          named reference condition (Perl)
354         (?('name')          named reference condition (Perl)
355         (?(name)            named reference condition (PCRE2, deprecated)
356         (?(R)               overall recursion condition
357         (?(Rn)              specific numbered group recursion condition
358         (?(R&name)          specific named group recursion condition
359         (?(DEFINE)          define subpattern for reference
360         (?(VERSION[>]=n.m)  test PCRE2 version
361         (?(assert)          assertion condition
362
363       Note the ambiguity of (?(R) and (?(Rn) which might be  named  reference
364       conditions  or  recursion  tests.  Such a condition is interpreted as a
365       reference condition if the relevant named group exists.
366

BACKTRACKING CONTROL

368
369       The following act immediately they are reached:
370
371         (*ACCEPT)       force successful match
372         (*FAIL)         force backtrack; synonym (*F)
373         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
374
375       The following act only when a subsequent match failure causes  a  back‐
376       track to reach them. They all force a match failure, but they differ in
377       what happens afterwards. Those that advance the start-of-match point do
378       so only if the pattern is not anchored.
379
380         (*COMMIT)       overall failure, no advance of starting point
381         (*PRUNE)        advance to next starting character
382         (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
383         (*SKIP)         advance to current matching position
384         (*SKIP:NAME)    advance to position corresponding to an earlier
385                         (*MARK:NAME); if not found, the (*SKIP) is ignored
386         (*THEN)         local failure, backtrack to next alternation
387         (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
388

CALLOUTS

390
391         (?C)            callout (assumed number 0)
392         (?Cn)           callout with numerical data n
393         (?C"text")      callout with string data
394
395       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
396       the start and the end), and the starting delimiter { matched  with  the
397       ending  delimiter  }. To encode the ending delimiter within the string,
398       double it.
399

AUTHOR

406
407       Philip Hazel
408       University Computing Service
409       Cambridge, England.
410

REVISION

412
413       Last updated: 23 December 2016
414       Copyright (c) 1997-2016 University of Cambridge.
415
416
417
418PCRE2 10.23                    23 December 2016                 PCRE2SYNTAX(3)