1PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE2 are described in the pcre2pattern documentation. This
12 document contains a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 This table applies to ASCII and Unicode environments.
22
23 \a alarm, that is, the BEL character (hex 07)
24 \cx "control-x", where x is any ASCII printing character
25 \e escape (hex 1B)
26 \f form feed (hex 0C)
27 \n newline (hex 0A)
28 \r carriage return (hex 0D)
29 \t tab (hex 09)
30 \0dd character with octal code 0dd
31 \ddd character with octal code ddd, or backreference
32 \o{ddd..} character with octal code ddd..
33 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
34 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
35 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
36 \xhh character with hex code hh
37 \x{hh..} character with hex code hh..
38
39 Note that \0dd is always an octal code. The treatment of backslash fol‐
40 lowed by a non-zero digit is complicated; for details see the section
41 "Non-printing characters" in the pcre2pattern documentation, where
42 details of escape processing in EBCDIC environments are also given.
43 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
44 EBCDIC environments. Note that \N not followed by an opening curly
45 bracket has a different meaning (see below).
46
47 When \x is not followed by {, from zero to two hexadecimal digits are
48 read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec‐
49 imal digits to be recognized as a hexadecimal escape; otherwise it
50 matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol‐
51 lowed by four hexadecimal digits, it matches a literal "u".
52
54
55 . any character except newline;
56 in dotall mode, any character whatsoever
57 \C one code unit, even in UTF mode (best avoided)
58 \d a decimal digit
59 \D a character that is not a decimal digit
60 \h a horizontal white space character
61 \H a character that is not a horizontal white space character
62 \N a character that is not a newline
63 \p{xx} a character with the xx property
64 \P{xx} a character without the xx property
65 \R a newline sequence
66 \s a white space character
67 \S a character that is not a white space character
68 \v a vertical white space character
69 \V a character that is not a vertical white space character
70 \w a "word" character
71 \W a "non-word" character
72 \X a Unicode extended grapheme cluster
73
74 \C is dangerous because it may leave the current matching point in the
75 middle of a UTF-8 or UTF-16 character. The application can lock out the
76 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
77 possible to build PCRE2 with the use of \C permanently disabled.
78
79 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
80 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
81 matching is happening, \s and \w may also match characters with code
82 points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
83 iour of these escape sequences is changed to use Unicode properties and
84 they match many more characters.
85
87
88 C Other
89 Cc Control
90 Cf Format
91 Cn Unassigned
92 Co Private use
93 Cs Surrogate
94
95 L Letter
96 Ll Lower case letter
97 Lm Modifier letter
98 Lo Other letter
99 Lt Title case letter
100 Lu Upper case letter
101 L& Ll, Lu, or Lt
102
103 M Mark
104 Mc Spacing mark
105 Me Enclosing mark
106 Mn Non-spacing mark
107
108 N Number
109 Nd Decimal number
110 Nl Letter number
111 No Other number
112
113 P Punctuation
114 Pc Connector punctuation
115 Pd Dash punctuation
116 Pe Close punctuation
117 Pf Final punctuation
118 Pi Initial punctuation
119 Po Other punctuation
120 Ps Open punctuation
121
122 S Symbol
123 Sc Currency symbol
124 Sk Modifier symbol
125 Sm Mathematical symbol
126 So Other symbol
127
128 Z Separator
129 Zl Line separator
130 Zp Paragraph separator
131 Zs Space separator
132
134
135 Xan Alphanumeric: union of properties L and N
136 Xps POSIX space: property Z or tab, NL, VT, FF, CR
137 Xsp Perl space: property Z or tab, NL, VT, FF, CR
138 Xuc Univerally-named character: one that can be
139 represented by a Universal Character Name
140 Xwd Perl word: property Xan or underscore
141
142 Perl and POSIX space are now the same. Perl added VT to its space char‐
143 acter set at release 5.18.
144
146
147 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali‐
148 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
149 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
150 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
151 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
152 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
153 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
154 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
155 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan‐
156 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
157 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
158 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
159 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
160 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
161 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar‐
162 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog‐
163 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
164 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
165 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha‐
166 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
167 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
168 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi‐
169 nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
170
172
173 [...] positive character class
174 [^...] negative character class
175 [x-y] range (can be used for hex characters)
176 [[:xxx:]] positive POSIX named set
177 [[:^xxx:]] negative POSIX named set
178
179 alnum alphanumeric
180 alpha alphabetic
181 ascii 0-127
182 blank space or tab
183 cntrl control character
184 digit decimal digit
185 graph printing, excluding space
186 lower lower case letter
187 print printing, including space
188 punct printing, excluding alphanumeric
189 space white space
190 upper upper case letter
191 word same as \w
192 xdigit hexadecimal digit
193
194 In PCRE2, POSIX character set names recognize only ASCII characters by
195 default, but some of them use Unicode properties if PCRE2_UCP is set.
196 You can use \Q...\E inside a character class.
197
199
200 ? 0 or 1, greedy
201 ?+ 0 or 1, possessive
202 ?? 0 or 1, lazy
203 * 0 or more, greedy
204 *+ 0 or more, possessive
205 *? 0 or more, lazy
206 + 1 or more, greedy
207 ++ 1 or more, possessive
208 +? 1 or more, lazy
209 {n} exactly n
210 {n,m} at least n, no more than m, greedy
211 {n,m}+ at least n, no more than m, possessive
212 {n,m}? at least n, no more than m, lazy
213 {n,} n or more, greedy
214 {n,}+ n or more, possessive
215 {n,}? n or more, lazy
216
218
219 \b word boundary
220 \B not a word boundary
221 ^ start of subject
222 also after an internal newline in multiline mode
223 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
224 \A start of subject
225 $ end of subject
226 also before newline at end of subject
227 also before internal newline in multiline mode
228 \Z end of subject
229 also before newline at end of subject
230 \z end of subject
231 \G first matching position in subject
232
234
235 \K set reported start of match
236
237 \K is honoured in positive assertions, but ignored in negative ones.
238
240
241 expr|expr|expr...
242
244
245 (...) capturing group
246 (?<name>...) named capturing group (Perl)
247 (?'name'...) named capturing group (Perl)
248 (?P<name>...) named capturing group (Python)
249 (?:...) non-capturing group
250 (?|...) non-capturing group; reset group numbers for
251 capturing groups in each alternative
252
254
255 (?>...) atomic, non-capturing group
256
258
259 (?#....) comment (not nestable)
260
262 Changes of these options within a group are automatically cancelled at
263 the end of the group.
264
265 (?i) caseless
266 (?J) allow duplicate names
267 (?m) multiline
268 (?n) no auto capture
269 (?s) single line (dotall)
270 (?U) default ungreedy (lazy)
271 (?x) extended: ignore white space except in classes
272 (?xx) as (?x) but also ignore space and tab in classes
273 (?-...) unset option(s)
274 (?^) unset imnsx options
275
276 Unsetting x or xx unsets both. Several options may be set at once, and
277 a mixture of setting and unsetting such as (?i-x) is allowed, but there
278 may be only one hyphen. Setting (but no unsetting) is allowed after (?^
279 for example (?^in). An option setting may appear at the start of a non-
280 capturing group, for example (?i:...).
281
282 The following are recognized only at the very start of a pattern or
283 after one of the newline or \R options with similar syntax. More than
284 one of them may appear. For the first three, d is a decimal number.
285
286 (*LIMIT_DEPTH=d) set the backtracking limit to d
287 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
288 (*LIMIT_MATCH=d) set the match limit to d
289 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
290 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
291 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
292 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
293 (*NO_JIT) disable JIT optimization
294 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
295 (*UTF) set appropriate UTF mode for the library in use
296 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
297
298 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
299 value of the limits set by the caller of pcre2_match() or
300 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
301 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
302 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
303 respectively, at compile time.
304
306
307 These are recognized only at the very start of the pattern or after
308 option settings with a similar syntax.
309
310 (*CR) carriage return only
311 (*LF) linefeed only
312 (*CRLF) carriage return followed by linefeed
313 (*ANYCRLF) all three of the above
314 (*ANY) any Unicode newline sequence
315 (*NUL) the NUL character (binary zero)
316
318
319 These are recognized only at the very start of the pattern or after
320 option setting with a similar syntax.
321
322 (*BSR_ANYCRLF) CR, LF, or CRLF
323 (*BSR_UNICODE) any Unicode newline sequence
324
326
327 (?=...) positive look ahead
328 (?!...) negative look ahead
329 (?<=...) positive look behind
330 (?<!...) negative look behind
331
332 Each top-level branch of a look behind must be of a fixed length.
333
335
336 \n reference by number (can be ambiguous)
337 \gn reference by number
338 \g{n} reference by number
339 \g+n relative reference by number (PCRE2 extension)
340 \g-n relative reference by number
341 \g{+n} relative reference by number (PCRE2 extension)
342 \g{-n} relative reference by number
343 \k<name> reference by name (Perl)
344 \k'name' reference by name (Perl)
345 \g{name} reference by name (Perl)
346 \k{name} reference by name (.NET)
347 (?P=name) reference by name (Python)
348
350
351 (?R) recurse whole pattern
352 (?n) call subpattern by absolute number
353 (?+n) call subpattern by relative number
354 (?-n) call subpattern by relative number
355 (?&name) call subpattern by name (Perl)
356 (?P>name) call subpattern by name (Python)
357 \g<name> call subpattern by name (Oniguruma)
358 \g'name' call subpattern by name (Oniguruma)
359 \g<n> call subpattern by absolute number (Oniguruma)
360 \g'n' call subpattern by absolute number (Oniguruma)
361 \g<+n> call subpattern by relative number (PCRE2 extension)
362 \g'+n' call subpattern by relative number (PCRE2 extension)
363 \g<-n> call subpattern by relative number (PCRE2 extension)
364 \g'-n' call subpattern by relative number (PCRE2 extension)
365
367
368 (?(condition)yes-pattern)
369 (?(condition)yes-pattern|no-pattern)
370
371 (?(n) absolute reference condition
372 (?(+n) relative reference condition
373 (?(-n) relative reference condition
374 (?(<name>) named reference condition (Perl)
375 (?('name') named reference condition (Perl)
376 (?(name) named reference condition (PCRE2, deprecated)
377 (?(R) overall recursion condition
378 (?(Rn) specific numbered group recursion condition
379 (?(R&name) specific named group recursion condition
380 (?(DEFINE) define subpattern for reference
381 (?(VERSION[>]=n.m) test PCRE2 version
382 (?(assert) assertion condition
383
384 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
385 conditions or recursion tests. Such a condition is interpreted as a
386 reference condition if the relevant named group exists.
387
389
390 All backtracking control verbs may be in the form (*VERB:NAME). For
391 (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
392 changes its behaviour if :NAME is present. The others just set a name
393 for passing back to the caller, but this is not a name that (*SKIP) can
394 see. The following act immediately they are reached:
395
396 (*ACCEPT) force successful match
397 (*FAIL) force backtrack; synonym (*F)
398 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
399
400 The following act only when a subsequent match failure causes a back‐
401 track to reach them. They all force a match failure, but they differ in
402 what happens afterwards. Those that advance the start-of-match point do
403 so only if the pattern is not anchored.
404
405 (*COMMIT) overall failure, no advance of starting point
406 (*PRUNE) advance to next starting character
407 (*SKIP) advance to current matching position
408 (*SKIP:NAME) advance to position corresponding to an earlier
409 (*MARK:NAME); if not found, the (*SKIP) is ignored
410 (*THEN) local failure, backtrack to next alternation
411
412 The effect of one of these verbs in a group called as a subroutine is
413 confined to the subroutine call.
414
416
417 (?C) callout (assumed number 0)
418 (?Cn) callout with numerical data n
419 (?C"text") callout with string data
420
421 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
422 the start and the end), and the starting delimiter { matched with the
423 ending delimiter }. To encode the ending delimiter within the string,
424 double it.
425
427
428 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
429 pcre2(3).
430
432
433 Philip Hazel
434 University Computing Service
435 Cambridge, England.
436
438
439 Last updated: 02 September 2018
440 Copyright (c) 1997-2018 University of Cambridge.
441
442
443
444PCRE2 10.32 02 September 2018 PCRE2SYNTAX(3)