1PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE2 are described in the pcre2pattern documentation. This
12 document contains a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 This table applies to ASCII and Unicode environments.
22
23 \a alarm, that is, the BEL character (hex 07)
24 \cx "control-x", where x is any ASCII printing character
25 \e escape (hex 1B)
26 \f form feed (hex 0C)
27 \n newline (hex 0A)
28 \r carriage return (hex 0D)
29 \t tab (hex 09)
30 \0dd character with octal code 0dd
31 \ddd character with octal code ddd, or backreference
32 \o{ddd..} character with octal code ddd..
33 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
34 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
35 \xhh character with hex code hh
36 \x{hhh..} character with hex code hhh..
37
38 Note that \0dd is always an octal code. The treatment of backslash fol‐
39 lowed by a non-zero digit is complicated; for details see the section
40 "Non-printing characters" in the pcre2pattern documentation, where
41 details of escape processing in EBCDIC environments are also given.
42
43 When \x is not followed by {, from zero to two hexadecimal digits are
44 read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec‐
45 imal digits to be recognized as a hexadecimal escape; otherwise it
46 matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol‐
47 lowed by four hexadecimal digits, it matches a literal "u".
48
50
51 . any character except newline;
52 in dotall mode, any character whatsoever
53 \C one code unit, even in UTF mode (best avoided)
54 \d a decimal digit
55 \D a character that is not a decimal digit
56 \h a horizontal white space character
57 \H a character that is not a horizontal white space character
58 \N a character that is not a newline
59 \p{xx} a character with the xx property
60 \P{xx} a character without the xx property
61 \R a newline sequence
62 \s a white space character
63 \S a character that is not a white space character
64 \v a vertical white space character
65 \V a character that is not a vertical white space character
66 \w a "word" character
67 \W a "non-word" character
68 \X a Unicode extended grapheme cluster
69
70 \C is dangerous because it may leave the current matching point in the
71 middle of a UTF-8 or UTF-16 character. The application can lock out the
72 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
73 possible to build PCRE2 with the use of \C permanently disabled.
74
75 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
76 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
77 matching is happening, \s and \w may also match characters with code
78 points in the range 128-255. If the PCRE2_UCP option is set, the behav‐
79 iour of these escape sequences is changed to use Unicode properties and
80 they match many more characters.
81
83
84 C Other
85 Cc Control
86 Cf Format
87 Cn Unassigned
88 Co Private use
89 Cs Surrogate
90
91 L Letter
92 Ll Lower case letter
93 Lm Modifier letter
94 Lo Other letter
95 Lt Title case letter
96 Lu Upper case letter
97 L& Ll, Lu, or Lt
98
99 M Mark
100 Mc Spacing mark
101 Me Enclosing mark
102 Mn Non-spacing mark
103
104 N Number
105 Nd Decimal number
106 Nl Letter number
107 No Other number
108
109 P Punctuation
110 Pc Connector punctuation
111 Pd Dash punctuation
112 Pe Close punctuation
113 Pf Final punctuation
114 Pi Initial punctuation
115 Po Other punctuation
116 Ps Open punctuation
117
118 S Symbol
119 Sc Currency symbol
120 Sk Modifier symbol
121 Sm Mathematical symbol
122 So Other symbol
123
124 Z Separator
125 Zl Line separator
126 Zp Paragraph separator
127 Zs Space separator
128
130
131 Xan Alphanumeric: union of properties L and N
132 Xps POSIX space: property Z or tab, NL, VT, FF, CR
133 Xsp Perl space: property Z or tab, NL, VT, FF, CR
134 Xuc Univerally-named character: one that can be
135 represented by a Universal Character Name
136 Xwd Perl word: property Xan or underscore
137
138 Perl and POSIX space are now the same. Perl added VT to its space char‐
139 acter set at release 5.18.
140
142
143 Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese,
144 Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese,
145 Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham,
146 Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
147 Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor‐
148 gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han,
149 Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
150 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan‐
151 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
152 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
153 jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui,
154 Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro,
155 Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki,
156 Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian,
157 Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
158 Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic,
159 Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
160 Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
161 Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai,
162 Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
163
165
166 [...] positive character class
167 [^...] negative character class
168 [x-y] range (can be used for hex characters)
169 [[:xxx:]] positive POSIX named set
170 [[:^xxx:]] negative POSIX named set
171
172 alnum alphanumeric
173 alpha alphabetic
174 ascii 0-127
175 blank space or tab
176 cntrl control character
177 digit decimal digit
178 graph printing, excluding space
179 lower lower case letter
180 print printing, including space
181 punct printing, excluding alphanumeric
182 space white space
183 upper upper case letter
184 word same as \w
185 xdigit hexadecimal digit
186
187 In PCRE2, POSIX character set names recognize only ASCII characters by
188 default, but some of them use Unicode properties if PCRE2_UCP is set.
189 You can use \Q...\E inside a character class.
190
192
193 ? 0 or 1, greedy
194 ?+ 0 or 1, possessive
195 ?? 0 or 1, lazy
196 * 0 or more, greedy
197 *+ 0 or more, possessive
198 *? 0 or more, lazy
199 + 1 or more, greedy
200 ++ 1 or more, possessive
201 +? 1 or more, lazy
202 {n} exactly n
203 {n,m} at least n, no more than m, greedy
204 {n,m}+ at least n, no more than m, possessive
205 {n,m}? at least n, no more than m, lazy
206 {n,} n or more, greedy
207 {n,}+ n or more, possessive
208 {n,}? n or more, lazy
209
211
212 \b word boundary
213 \B not a word boundary
214 ^ start of subject
215 also after an internal newline in multiline mode
216 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
217 \A start of subject
218 $ end of subject
219 also before newline at end of subject
220 also before internal newline in multiline mode
221 \Z end of subject
222 also before newline at end of subject
223 \z end of subject
224 \G first matching position in subject
225
227
228 \K reset start of match
229
230 \K is honoured in positive assertions, but ignored in negative ones.
231
233
234 expr|expr|expr...
235
237
238 (...) capturing group
239 (?<name>...) named capturing group (Perl)
240 (?'name'...) named capturing group (Perl)
241 (?P<name>...) named capturing group (Python)
242 (?:...) non-capturing group
243 (?|...) non-capturing group; reset group numbers for
244 capturing groups in each alternative
245
247
248 (?>...) atomic, non-capturing group
249
251
252 (?#....) comment (not nestable)
253
255
256 (?i) caseless
257 (?J) allow duplicate names
258 (?m) multiline
259 (?s) single line (dotall)
260 (?U) default ungreedy (lazy)
261 (?x) extended (ignore white space)
262 (?-...) unset option(s)
263
264 The following are recognized only at the very start of a pattern or
265 after one of the newline or \R options with similar syntax. More than
266 one of them may appear.
267
268 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
269 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
270 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
271 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
272 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
273 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
274 (*NO_JIT) disable JIT optimization
275 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
276 (*UTF) set appropriate UTF mode for the library in use
277 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
278
279 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
280 the limits set by the caller of pcre2_match() or pcre2_dfa_match(), not
281 increase them. The application can lock out the use of (*UTF) and
282 (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
283 respectively, at compile time.
284
286
287 These are recognized only at the very start of the pattern or after
288 option settings with a similar syntax.
289
290 (*CR) carriage return only
291 (*LF) linefeed only
292 (*CRLF) carriage return followed by linefeed
293 (*ANYCRLF) all three of the above
294 (*ANY) any Unicode newline sequence
295
297
298 These are recognized only at the very start of the pattern or after
299 option setting with a similar syntax.
300
301 (*BSR_ANYCRLF) CR, LF, or CRLF
302 (*BSR_UNICODE) any Unicode newline sequence
303
305
306 (?=...) positive look ahead
307 (?!...) negative look ahead
308 (?<=...) positive look behind
309 (?<!...) negative look behind
310
311 Each top-level branch of a look behind must be of a fixed length.
312
314
315 \n reference by number (can be ambiguous)
316 \gn reference by number
317 \g{n} reference by number
318 \g+n relative reference by number (PCRE2 extension)
319 \g-n relative reference by number
320 \g{+n} relative reference by number (PCRE2 extension)
321 \g{-n} relative reference by number
322 \k<name> reference by name (Perl)
323 \k'name' reference by name (Perl)
324 \g{name} reference by name (Perl)
325 \k{name} reference by name (.NET)
326 (?P=name) reference by name (Python)
327
329
330 (?R) recurse whole pattern
331 (?n) call subpattern by absolute number
332 (?+n) call subpattern by relative number
333 (?-n) call subpattern by relative number
334 (?&name) call subpattern by name (Perl)
335 (?P>name) call subpattern by name (Python)
336 \g<name> call subpattern by name (Oniguruma)
337 \g'name' call subpattern by name (Oniguruma)
338 \g<n> call subpattern by absolute number (Oniguruma)
339 \g'n' call subpattern by absolute number (Oniguruma)
340 \g<+n> call subpattern by relative number (PCRE2 extension)
341 \g'+n' call subpattern by relative number (PCRE2 extension)
342 \g<-n> call subpattern by relative number (PCRE2 extension)
343 \g'-n' call subpattern by relative number (PCRE2 extension)
344
346
347 (?(condition)yes-pattern)
348 (?(condition)yes-pattern|no-pattern)
349
350 (?(n) absolute reference condition
351 (?(+n) relative reference condition
352 (?(-n) relative reference condition
353 (?(<name>) named reference condition (Perl)
354 (?('name') named reference condition (Perl)
355 (?(name) named reference condition (PCRE2, deprecated)
356 (?(R) overall recursion condition
357 (?(Rn) specific numbered group recursion condition
358 (?(R&name) specific named group recursion condition
359 (?(DEFINE) define subpattern for reference
360 (?(VERSION[>]=n.m) test PCRE2 version
361 (?(assert) assertion condition
362
363 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
364 conditions or recursion tests. Such a condition is interpreted as a
365 reference condition if the relevant named group exists.
366
368
369 The following act immediately they are reached:
370
371 (*ACCEPT) force successful match
372 (*FAIL) force backtrack; synonym (*F)
373 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
374
375 The following act only when a subsequent match failure causes a back‐
376 track to reach them. They all force a match failure, but they differ in
377 what happens afterwards. Those that advance the start-of-match point do
378 so only if the pattern is not anchored.
379
380 (*COMMIT) overall failure, no advance of starting point
381 (*PRUNE) advance to next starting character
382 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
383 (*SKIP) advance to current matching position
384 (*SKIP:NAME) advance to position corresponding to an earlier
385 (*MARK:NAME); if not found, the (*SKIP) is ignored
386 (*THEN) local failure, backtrack to next alternation
387 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
388
390
391 (?C) callout (assumed number 0)
392 (?Cn) callout with numerical data n
393 (?C"text") callout with string data
394
395 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
396 the start and the end), and the starting delimiter { matched with the
397 ending delimiter }. To encode the ending delimiter within the string,
398 double it.
399
401
402 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
403 pcre2(3).
404
406
407 Philip Hazel
408 University Computing Service
409 Cambridge, England.
410
412
413 Last updated: 23 December 2016
414 Copyright (c) 1997-2016 University of Cambridge.
415
416
417
418PCRE2 10.23 23 December 2016 PCRE2SYNTAX(3)