1PCRESYNTAX(3) Library Functions Manual PCRESYNTAX(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE are described in the pcrepattern documentation. This
12 document contains a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 \a alarm, that is, the BEL character (hex 07)
22 \cx "control-x", where x is any ASCII character
23 \e escape (hex 1B)
24 \f form feed (hex 0C)
25 \n newline (hex 0A)
26 \r carriage return (hex 0D)
27 \t tab (hex 09)
28 \0dd character with octal code 0dd
29 \ddd character with octal code ddd, or backreference
30 \o{ddd..} character with octal code ddd..
31 \xhh character with hex code hh
32 \x{hhh..} character with hex code hhh..
33
34 Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
35 eral characters "8" and "9".
36
38
39 . any character except newline;
40 in dotall mode, any character whatsoever
41 \C one data unit, even in UTF mode (best avoided)
42 \d a decimal digit
43 \D a character that is not a decimal digit
44 \h a horizontal white space character
45 \H a character that is not a horizontal white space character
46 \N a character that is not a newline
47 \p{xx} a character with the xx property
48 \P{xx} a character without the xx property
49 \R a newline sequence
50 \s a white space character
51 \S a character that is not a white space character
52 \v a vertical white space character
53 \V a character that is not a vertical white space character
54 \w a "word" character
55 \W a "non-word" character
56 \X a Unicode extended grapheme cluster
57
58 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
59 mode or in the 16- bit and 32-bit libraries. However, if locale-spe‐
60 cific matching is happening, \s and \w may also match characters with
61 code points in the range 128-255. If the PCRE_UCP option is set, the
62 behaviour of these escape sequences is changed to use Unicode proper‐
63 ties and they match many more characters.
64
66
67 C Other
68 Cc Control
69 Cf Format
70 Cn Unassigned
71 Co Private use
72 Cs Surrogate
73
74 L Letter
75 Ll Lower case letter
76 Lm Modifier letter
77 Lo Other letter
78 Lt Title case letter
79 Lu Upper case letter
80 L& Ll, Lu, or Lt
81
82 M Mark
83 Mc Spacing mark
84 Me Enclosing mark
85 Mn Non-spacing mark
86
87 N Number
88 Nd Decimal number
89 Nl Letter number
90 No Other number
91
92 P Punctuation
93 Pc Connector punctuation
94 Pd Dash punctuation
95 Pe Close punctuation
96 Pf Final punctuation
97 Pi Initial punctuation
98 Po Other punctuation
99 Ps Open punctuation
100
101 S Symbol
102 Sc Currency symbol
103 Sk Modifier symbol
104 Sm Mathematical symbol
105 So Other symbol
106
107 Z Separator
108 Zl Line separator
109 Zp Paragraph separator
110 Zs Space separator
111
113
114 Xan Alphanumeric: union of properties L and N
115 Xps POSIX space: property Z or tab, NL, VT, FF, CR
116 Xsp Perl space: property Z or tab, NL, VT, FF, CR
117 Xuc Universally-named character: one that can be
118 represented by a Universal Character Name
119 Xwd Perl word: property Xan or underscore
120
121 Perl and POSIX space are now the same. Perl added VT to its space char‐
122 acter set at release 5.18 and PCRE changed at release 8.34.
123
125
126 Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali,
127 Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car‐
128 ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei‐
129 form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero‐
130 glyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha,
131 Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Im‐
132 perial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip‐
133 tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
134 Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin‐
135 ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic,
136 Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Meroitic_Hi‐
137 eroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, New_Tai_Lue,
138 Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Old_Permic,
139 Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pa‐
140 hawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
141 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha‐
142 vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac,
143 Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
144 Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi,
145 Yi.
146
148
149 [...] positive character class
150 [^...] negative character class
151 [x-y] range (can be used for hex characters)
152 [[:xxx:]] positive POSIX named set
153 [[:^xxx:]] negative POSIX named set
154
155 alnum alphanumeric
156 alpha alphabetic
157 ascii 0-127
158 blank space or tab
159 cntrl control character
160 digit decimal digit
161 graph printing, excluding space
162 lower lower case letter
163 print printing, including space
164 punct printing, excluding alphanumeric
165 space white space
166 upper upper case letter
167 word same as \w
168 xdigit hexadecimal digit
169
170 In PCRE, POSIX character set names recognize only ASCII characters by
171 default, but some of them use Unicode properties if PCRE_UCP is set.
172 You can use \Q...\E inside a character class.
173
175
176 ? 0 or 1, greedy
177 ?+ 0 or 1, possessive
178 ?? 0 or 1, lazy
179 * 0 or more, greedy
180 *+ 0 or more, possessive
181 *? 0 or more, lazy
182 + 1 or more, greedy
183 ++ 1 or more, possessive
184 +? 1 or more, lazy
185 {n} exactly n
186 {n,m} at least n, no more than m, greedy
187 {n,m}+ at least n, no more than m, possessive
188 {n,m}? at least n, no more than m, lazy
189 {n,} n or more, greedy
190 {n,}+ n or more, possessive
191 {n,}? n or more, lazy
192
194
195 \b word boundary
196 \B not a word boundary
197 ^ start of subject
198 also after internal newline in multiline mode
199 \A start of subject
200 $ end of subject
201 also before newline at end of subject
202 also before internal newline in multiline mode
203 \Z end of subject
204 also before newline at end of subject
205 \z end of subject
206 \G first matching position in subject
207
209
210 \K reset start of match
211
212 \K is honoured in positive assertions, but ignored in negative ones.
213
215
216 expr|expr|expr...
217
219
220 (...) capturing group
221 (?<name>...) named capturing group (Perl)
222 (?'name'...) named capturing group (Perl)
223 (?P<name>...) named capturing group (Python)
224 (?:...) non-capturing group
225 (?|...) non-capturing group; reset group numbers for
226 capturing groups in each alternative
227
229
230 (?>...) atomic, non-capturing group
231
233
234 (?#....) comment (not nestable)
235
237
238 (?i) caseless
239 (?J) allow duplicate names
240 (?m) multiline
241 (?s) single line (dotall)
242 (?U) default ungreedy (lazy)
243 (?x) extended (ignore white space)
244 (?-...) unset option(s)
245
246 The following are recognized only at the very start of a pattern or af‐
247 ter one of the newline or \R options with similar syntax. More than one
248 of them may appear.
249
250 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
251 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
252 (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
253 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
254 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
255 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
256 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
257 (*UTF) set appropriate UTF mode for the library in use
258 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
259
260 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
261 the limits set by the caller of pcre_exec(), not increase them.
262
264
265 These are recognized only at the very start of the pattern or after op‐
266 tion settings with a similar syntax.
267
268 (*CR) carriage return only
269 (*LF) linefeed only
270 (*CRLF) carriage return followed by linefeed
271 (*ANYCRLF) all three of the above
272 (*ANY) any Unicode newline sequence
273
275
276 These are recognized only at the very start of the pattern or after op‐
277 tion setting with a similar syntax.
278
279 (*BSR_ANYCRLF) CR, LF, or CRLF
280 (*BSR_UNICODE) any Unicode newline sequence
281
283
284 (?=...) positive look ahead
285 (?!...) negative look ahead
286 (?<=...) positive look behind
287 (?<!...) negative look behind
288
289 Each top-level branch of a look behind must be of a fixed length.
290
292
293 \n reference by number (can be ambiguous)
294 \gn reference by number
295 \g{n} reference by number
296 \g{-n} relative reference by number
297 \k<name> reference by name (Perl)
298 \k'name' reference by name (Perl)
299 \g{name} reference by name (Perl)
300 \k{name} reference by name (.NET)
301 (?P=name) reference by name (Python)
302
304
305 (?R) recurse whole pattern
306 (?n) call subpattern by absolute number
307 (?+n) call subpattern by relative number
308 (?-n) call subpattern by relative number
309 (?&name) call subpattern by name (Perl)
310 (?P>name) call subpattern by name (Python)
311 \g<name> call subpattern by name (Oniguruma)
312 \g'name' call subpattern by name (Oniguruma)
313 \g<n> call subpattern by absolute number (Oniguruma)
314 \g'n' call subpattern by absolute number (Oniguruma)
315 \g<+n> call subpattern by relative number (PCRE extension)
316 \g'+n' call subpattern by relative number (PCRE extension)
317 \g<-n> call subpattern by relative number (PCRE extension)
318 \g'-n' call subpattern by relative number (PCRE extension)
319
321
322 (?(condition)yes-pattern)
323 (?(condition)yes-pattern|no-pattern)
324
325 (?(n)... absolute reference condition
326 (?(+n)... relative reference condition
327 (?(-n)... relative reference condition
328 (?(<name>)... named reference condition (Perl)
329 (?('name')... named reference condition (Perl)
330 (?(name)... named reference condition (PCRE)
331 (?(R)... overall recursion condition
332 (?(Rn)... specific group recursion condition
333 (?(R&name)... specific recursion condition
334 (?(DEFINE)... define subpattern for reference
335 (?(assert)... assertion condition
336
338
339 The following act immediately they are reached:
340
341 (*ACCEPT) force successful match
342 (*FAIL) force backtrack; synonym (*F)
343 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
344
345 The following act only when a subsequent match failure causes a back‐
346 track to reach them. They all force a match failure, but they differ in
347 what happens afterwards. Those that advance the start-of-match point do
348 so only if the pattern is not anchored.
349
350 (*COMMIT) overall failure, no advance of starting point
351 (*PRUNE) advance to next starting character
352 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
353 (*SKIP) advance to current matching position
354 (*SKIP:NAME) advance to position corresponding to an earlier
355 (*MARK:NAME); if not found, the (*SKIP) is ignored
356 (*THEN) local failure, backtrack to next alternation
357 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
358
360
361 (?C) callout
362 (?Cn) callout with data n
363
365
366 pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
367
369
370 Philip Hazel
371 University Computing Service
372 Cambridge CB2 3QH, England.
373
375
376 Last updated: 08 January 2014
377 Copyright (c) 1997-2014 University of Cambridge.
378
379
380
381PCRE 8.35 08 January 2014 PCRESYNTAX(3)