1PCRESYNTAX(3) Library Functions Manual PCRESYNTAX(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 The full syntax and semantics of the regular expressions that are sup‐
11 ported by PCRE are described in the pcrepattern documentation. This
12 document contains just a quick-reference summary of the syntax.
13
15
16 \x where x is non-alphanumeric is a literal x
17 \Q...\E treat enclosed characters as literal
18
20
21 \a alarm, that is, the BEL character (hex 07)
22 \cx "control-x", where x is any character
23 \e escape (hex 1B)
24 \f formfeed (hex 0C)
25 \n newline (hex 0A)
26 \r carriage return (hex 0D)
27 \t tab (hex 09)
28 \ddd character with octal code ddd, or backreference
29 \xhh character with hex code hh
30 \x{hhh..} character with hex code hhh..
31
33
34 . any character except newline;
35 in dotall mode, any character whatsoever
36 \C one byte, even in UTF-8 mode (best avoided)
37 \d a decimal digit
38 \D a character that is not a decimal digit
39 \h a horizontal whitespace character
40 \H a character that is not a horizontal whitespace character
41 \N a character that is not a newline
42 \p{xx} a character with the xx property
43 \P{xx} a character without the xx property
44 \R a newline sequence
45 \s a whitespace character
46 \S a character that is not a whitespace character
47 \v a vertical whitespace character
48 \V a character that is not a vertical whitespace character
49 \w a "word" character
50 \W a "non-word" character
51 \X an extended Unicode sequence
52
53 In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
54 characters, even in UTF-8 mode. However, this can be changed by setting
55 the PCRE_UCP option.
56
58
59 C Other
60 Cc Control
61 Cf Format
62 Cn Unassigned
63 Co Private use
64 Cs Surrogate
65
66 L Letter
67 Ll Lower case letter
68 Lm Modifier letter
69 Lo Other letter
70 Lt Title case letter
71 Lu Upper case letter
72 L& Ll, Lu, or Lt
73
74 M Mark
75 Mc Spacing mark
76 Me Enclosing mark
77 Mn Non-spacing mark
78
79 N Number
80 Nd Decimal number
81 Nl Letter number
82 No Other number
83
84 P Punctuation
85 Pc Connector punctuation
86 Pd Dash punctuation
87 Pe Close punctuation
88 Pf Final punctuation
89 Pi Initial punctuation
90 Po Other punctuation
91 Ps Open punctuation
92
93 S Symbol
94 Sc Currency symbol
95 Sk Modifier symbol
96 Sm Mathematical symbol
97 So Other symbol
98
99 Z Separator
100 Zl Line separator
101 Zp Paragraph separator
102 Zs Space separator
103
105
106 Xan Alphanumeric: union of properties L and N
107 Xps POSIX space: property Z or tab, NL, VT, FF, CR
108 Xsp Perl space: property Z or tab, NL, FF, CR
109 Xwd Perl word: property Xan or underscore
110
112
113 Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
114 Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
115 Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp‐
116 tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
117 Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe‐
118 rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
119 Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
120 Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
121 Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
122 Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
123 Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
124 Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
125 Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
126 Ugaritic, Vai, Yi.
127
129
130 [...] positive character class
131 [^...] negative character class
132 [x-y] range (can be used for hex characters)
133 [[:xxx:]] positive POSIX named set
134 [[:^xxx:]] negative POSIX named set
135
136 alnum alphanumeric
137 alpha alphabetic
138 ascii 0-127
139 blank space or tab
140 cntrl control character
141 digit decimal digit
142 graph printing, excluding space
143 lower lower case letter
144 print printing, including space
145 punct printing, excluding alphanumeric
146 space whitespace
147 upper upper case letter
148 word same as \w
149 xdigit hexadecimal digit
150
151 In PCRE, POSIX character set names recognize only ASCII characters by
152 default, but some of them use Unicode properties if PCRE_UCP is set.
153 You can use \Q...\E inside a character class.
154
156
157 ? 0 or 1, greedy
158 ?+ 0 or 1, possessive
159 ?? 0 or 1, lazy
160 * 0 or more, greedy
161 *+ 0 or more, possessive
162 *? 0 or more, lazy
163 + 1 or more, greedy
164 ++ 1 or more, possessive
165 +? 1 or more, lazy
166 {n} exactly n
167 {n,m} at least n, no more than m, greedy
168 {n,m}+ at least n, no more than m, possessive
169 {n,m}? at least n, no more than m, lazy
170 {n,} n or more, greedy
171 {n,}+ n or more, possessive
172 {n,}? n or more, lazy
173
175
176 \b word boundary
177 \B not a word boundary
178 ^ start of subject
179 also after internal newline in multiline mode
180 \A start of subject
181 $ end of subject
182 also before newline at end of subject
183 also before internal newline in multiline mode
184 \Z end of subject
185 also before newline at end of subject
186 \z end of subject
187 \G first matching position in subject
188
190
191 \K reset start of match
192
194
195 expr|expr|expr...
196
198
199 (...) capturing group
200 (?<name>...) named capturing group (Perl)
201 (?'name'...) named capturing group (Perl)
202 (?P<name>...) named capturing group (Python)
203 (?:...) non-capturing group
204 (?|...) non-capturing group; reset group numbers for
205 capturing groups in each alternative
206
208
209 (?>...) atomic, non-capturing group
210
212
213 (?#....) comment (not nestable)
214
216
217 (?i) caseless
218 (?J) allow duplicate names
219 (?m) multiline
220 (?s) single line (dotall)
221 (?U) default ungreedy (lazy)
222 (?x) extended (ignore white space)
223 (?-...) unset option(s)
224
225 The following are recognized only at the start of a pattern or after
226 one of the newline-setting options with similar syntax:
227
228 (*UTF8) set UTF-8 mode (PCRE_UTF8)
229 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
230
232
233 (?=...) positive look ahead
234 (?!...) negative look ahead
235 (?<=...) positive look behind
236 (?<!...) negative look behind
237
238 Each top-level branch of a look behind must be of a fixed length.
239
241
242 \n reference by number (can be ambiguous)
243 \gn reference by number
244 \g{n} reference by number
245 \g{-n} relative reference by number
246 \k<name> reference by name (Perl)
247 \k'name' reference by name (Perl)
248 \g{name} reference by name (Perl)
249 \k{name} reference by name (.NET)
250 (?P=name) reference by name (Python)
251
253
254 (?R) recurse whole pattern
255 (?n) call subpattern by absolute number
256 (?+n) call subpattern by relative number
257 (?-n) call subpattern by relative number
258 (?&name) call subpattern by name (Perl)
259 (?P>name) call subpattern by name (Python)
260 \g<name> call subpattern by name (Oniguruma)
261 \g'name' call subpattern by name (Oniguruma)
262 \g<n> call subpattern by absolute number (Oniguruma)
263 \g'n' call subpattern by absolute number (Oniguruma)
264 \g<+n> call subpattern by relative number (PCRE extension)
265 \g'+n' call subpattern by relative number (PCRE extension)
266 \g<-n> call subpattern by relative number (PCRE extension)
267 \g'-n' call subpattern by relative number (PCRE extension)
268
270
271 (?(condition)yes-pattern)
272 (?(condition)yes-pattern|no-pattern)
273
274 (?(n)... absolute reference condition
275 (?(+n)... relative reference condition
276 (?(-n)... relative reference condition
277 (?(<name>)... named reference condition (Perl)
278 (?('name')... named reference condition (Perl)
279 (?(name)... named reference condition (PCRE)
280 (?(R)... overall recursion condition
281 (?(Rn)... specific group recursion condition
282 (?(R&name)... specific recursion condition
283 (?(DEFINE)... define subpattern for reference
284 (?(assert)... assertion condition
285
287
288 The following act immediately they are reached:
289
290 (*ACCEPT) force successful match
291 (*FAIL) force backtrack; synonym (*F)
292
293 The following act only when a subsequent match failure causes a back‐
294 track to reach them. They all force a match failure, but they differ in
295 what happens afterwards. Those that advance the start-of-match point do
296 so only if the pattern is not anchored.
297
298 (*COMMIT) overall failure, no advance of starting point
299 (*PRUNE) advance to next starting character
300 (*SKIP) advance start to current matching position
301 (*THEN) local failure, backtrack to next alternation
302
304
305 These are recognized only at the very start of the pattern or after a
306 (*BSR_...) or (*UTF8) or (*UCP) option.
307
308 (*CR) carriage return only
309 (*LF) linefeed only
310 (*CRLF) carriage return followed by linefeed
311 (*ANYCRLF) all three of the above
312 (*ANY) any Unicode newline sequence
313
315
316 These are recognized only at the very start of the pattern or after a
317 (*...) option that sets the newline convention or UTF-8 or UCP mode.
318
319 (*BSR_ANYCRLF) CR, LF, or CRLF
320 (*BSR_UNICODE) any Unicode newline sequence
321
323
324 (?C) callout
325 (?Cn) callout with data n
326
328
329 pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
330
332
333 Philip Hazel
334 University Computing Service
335 Cambridge CB2 3QH, England.
336
338
339 Last updated: 12 May 2010
340 Copyright (c) 1997-2010 University of Cambridge.
341
342
343
344 PCRESYNTAX(3)