pcresyntax(3)

1PCRESYNTAX(3)              Library Functions Manual              PCRESYNTAX(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PCRE REGULAR EXPRESSION SYNTAX SUMMARY

9
10       The  full syntax and semantics of the regular expressions that are sup‐
11       ported by PCRE are described in  the  pcrepattern  documentation.  This
12       document contains just a quick-reference summary of the syntax.
13

QUOTING

15
16         \x         where x is non-alphanumeric is a literal x
17         \Q...\E    treat enclosed characters as literal
18

CHARACTERS

20
21         \a         alarm, that is, the BEL character (hex 07)
22         \cx        "control-x", where x is any character
23         \e         escape (hex 1B)
24         \f         formfeed (hex 0C)
25         \n         newline (hex 0A)
26         \r         carriage return (hex 0D)
27         \t         tab (hex 09)
28         \ddd       character with octal code ddd, or backreference
29         \xhh       character with hex code hh
30         \x{hhh..}  character with hex code hhh..
31

CHARACTER TYPES

33
34         .          any character except newline;
35                      in dotall mode, any character whatsoever
36         \C         one byte, even in UTF-8 mode (best avoided)
37         \d         a decimal digit
38         \D         a character that is not a decimal digit
39         \h         a horizontal whitespace character
40         \H         a character that is not a horizontal whitespace character
41         \N         a character that is not a newline
42         \p{xx}     a character with the xx property
43         \P{xx}     a character without the xx property
44         \R         a newline sequence
45         \s         a whitespace character
46         \S         a character that is not a whitespace character
47         \v         a vertical whitespace character
48         \V         a character that is not a vertical whitespace character
49         \w         a "word" character
50         \W         a "non-word" character
51         \X         an extended Unicode sequence
52
53       In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
54       characters, even in UTF-8 mode. However, this can be changed by setting
55       the PCRE_UCP option.
56

GENERAL CATEGORY PROPERTIES FOR \p and \P

58
59         C          Other
60         Cc         Control
61         Cf         Format
62         Cn         Unassigned
63         Co         Private use
64         Cs         Surrogate
65
66         L          Letter
67         Ll         Lower case letter
68         Lm         Modifier letter
69         Lo         Other letter
70         Lt         Title case letter
71         Lu         Upper case letter
72         L&         Ll, Lu, or Lt
73
74         M          Mark
75         Mc         Spacing mark
76         Me         Enclosing mark
77         Mn         Non-spacing mark
78
79         N          Number
80         Nd         Decimal number
81         Nl         Letter number
82         No         Other number
83
84         P          Punctuation
85         Pc         Connector punctuation
86         Pd         Dash punctuation
87         Pe         Close punctuation
88         Pf         Final punctuation
89         Pi         Initial punctuation
90         Po         Other punctuation
91         Ps         Open punctuation
92
93         S          Symbol
94         Sc         Currency symbol
95         Sk         Modifier symbol
96         Sm         Mathematical symbol
97         So         Other symbol
98
99         Z          Separator
100         Zl         Line separator
101         Zp         Paragraph separator
102         Zs         Space separator
103

PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P

105
106         Xan        Alphanumeric: union of properties L and N
107         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
108         Xsp        Perl space: property Z or tab, NL, FF, CR
109         Xwd        Perl word: property Xan or underscore
110

SCRIPT NAMES FOR \p AND \P

112
113       Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
114       Buginese, Buhid, Canadian_Aboriginal, Carian, Cham,  Cherokee,  Common,
115       Coptic,   Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,  Egyp‐
116       tian_Hieroglyphs,  Ethiopic,  Georgian,  Glagolitic,   Gothic,   Greek,
117       Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana, Impe‐
118       rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
119       Javanese,  Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
120       Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,
121       Meetei_Mayek,  Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
122       Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki,  Oriya,  Osmanya,
123       Phags_Pa,  Phoenician,  Rejang,  Runic, Samaritan, Saurashtra, Shavian,
124       Sinhala, Sundanese, Syloti_Nagri, Syriac,  Tagalog,  Tagbanwa,  Tai_Le,
125       Tai_Tham,  Tai_Viet,  Tamil,  Telugu,  Thaana, Thai, Tibetan, Tifinagh,
126       Ugaritic, Vai, Yi.
127

CHARACTER CLASSES

129
130         [...]       positive character class
131         [^...]      negative character class
132         [x-y]       range (can be used for hex characters)
133         [[:xxx:]]   positive POSIX named set
134         [[:^xxx:]]  negative POSIX named set
135
136         alnum       alphanumeric
137         alpha       alphabetic
138         ascii       0-127
139         blank       space or tab
140         cntrl       control character
141         digit       decimal digit
142         graph       printing, excluding space
143         lower       lower case letter
144         print       printing, including space
145         punct       printing, excluding alphanumeric
146         space       whitespace
147         upper       upper case letter
148         word        same as \w
149         xdigit      hexadecimal digit
150
151       In PCRE, POSIX character set names recognize only ASCII  characters  by
152       default,  but  some  of them use Unicode properties if PCRE_UCP is set.
153       You can use \Q...\E inside a character class.
154

QUANTIFIERS

156
157         ?           0 or 1, greedy
158         ?+          0 or 1, possessive
159         ??          0 or 1, lazy
160         *           0 or more, greedy
161         *+          0 or more, possessive
162         *?          0 or more, lazy
163         +           1 or more, greedy
164         ++          1 or more, possessive
165         +?          1 or more, lazy
166         {n}         exactly n
167         {n,m}       at least n, no more than m, greedy
168         {n,m}+      at least n, no more than m, possessive
169         {n,m}?      at least n, no more than m, lazy
170         {n,}        n or more, greedy
171         {n,}+       n or more, possessive
172         {n,}?       n or more, lazy
173

ANCHORS AND SIMPLE ASSERTIONS

175
176         \b          word boundary
177         \B          not a word boundary
178         ^           start of subject
179                      also after internal newline in multiline mode
180         \A          start of subject
181         $           end of subject
182                      also before newline at end of subject
183                      also before internal newline in multiline mode
184         \Z          end of subject
185                      also before newline at end of subject
186         \z          end of subject
187         \G          first matching position in subject
188

MATCH POINT RESET

190
191         \K          reset start of match
192

ALTERNATION

194
195         expr|expr|expr...
196

CAPTURING

198
199         (...)           capturing group
200         (?<name>...)    named capturing group (Perl)
201         (?'name'...)    named capturing group (Perl)
202         (?P<name>...)   named capturing group (Python)
203         (?:...)         non-capturing group
204         (?|...)         non-capturing group; reset group numbers for
205                          capturing groups in each alternative
206

ATOMIC GROUPS

208
209         (?>...)         atomic, non-capturing group
210

COMMENT

212
213         (?#....)        comment (not nestable)
214

OPTION SETTING

216
217         (?i)            caseless
218         (?J)            allow duplicate names
219         (?m)            multiline
220         (?s)            single line (dotall)
221         (?U)            default ungreedy (lazy)
222         (?x)            extended (ignore white space)
223         (?-...)         unset option(s)
224
225       The following are recognized only at the start of a  pattern  or  after
226       one of the newline-setting options with similar syntax:
227
228         (*UTF8)         set UTF-8 mode (PCRE_UTF8)
229         (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
230

LOOKAHEAD AND LOOKBEHIND ASSERTIONS

232
233         (?=...)         positive look ahead
234         (?!...)         negative look ahead
235         (?<=...)        positive look behind
236         (?<!...)        negative look behind
237
238       Each top-level branch of a look behind must be of a fixed length.
239

BACKREFERENCES

241
242         \n              reference by number (can be ambiguous)
243         \gn             reference by number
244         \g{n}           reference by number
245         \g{-n}          relative reference by number
246         \k<name>        reference by name (Perl)
247         \k'name'        reference by name (Perl)
248         \g{name}        reference by name (Perl)
249         \k{name}        reference by name (.NET)
250         (?P=name)       reference by name (Python)
251

SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

253
254         (?R)            recurse whole pattern
255         (?n)            call subpattern by absolute number
256         (?+n)           call subpattern by relative number
257         (?-n)           call subpattern by relative number
258         (?&name)        call subpattern by name (Perl)
259         (?P>name)       call subpattern by name (Python)
260         \g<name>        call subpattern by name (Oniguruma)
261         \g'name'        call subpattern by name (Oniguruma)
262         \g<n>           call subpattern by absolute number (Oniguruma)
263         \g'n'           call subpattern by absolute number (Oniguruma)
264         \g<+n>          call subpattern by relative number (PCRE extension)
265         \g'+n'          call subpattern by relative number (PCRE extension)
266         \g<-n>          call subpattern by relative number (PCRE extension)
267         \g'-n'          call subpattern by relative number (PCRE extension)
268

CONDITIONAL PATTERNS

270
271         (?(condition)yes-pattern)
272         (?(condition)yes-pattern|no-pattern)
273
274         (?(n)...        absolute reference condition
275         (?(+n)...       relative reference condition
276         (?(-n)...       relative reference condition
277         (?(<name>)...   named reference condition (Perl)
278         (?('name')...   named reference condition (Perl)
279         (?(name)...     named reference condition (PCRE)
280         (?(R)...        overall recursion condition
281         (?(Rn)...       specific group recursion condition
282         (?(R&name)...   specific recursion condition
283         (?(DEFINE)...   define subpattern for reference
284         (?(assert)...   assertion condition
285

BACKTRACKING CONTROL

287
288       The following act immediately they are reached:
289
290         (*ACCEPT)       force successful match
291         (*FAIL)         force backtrack; synonym (*F)
292
293       The  following  act only when a subsequent match failure causes a back‐
294       track to reach them. They all force a match failure, but they differ in
295       what happens afterwards. Those that advance the start-of-match point do
296       so only if the pattern is not anchored.
297
298         (*COMMIT)       overall failure, no advance of starting point
299         (*PRUNE)        advance to next starting character
300         (*SKIP)         advance start to current matching position
301         (*THEN)         local failure, backtrack to next alternation
302

NEWLINE CONVENTIONS

304
305       These are recognized only at the very start of the pattern or  after  a
306       (*BSR_...) or (*UTF8) or (*UCP) option.
307
308         (*CR)           carriage return only
309         (*LF)           linefeed only
310         (*CRLF)         carriage return followed by linefeed
311         (*ANYCRLF)      all three of the above
312         (*ANY)          any Unicode newline sequence
313

WHAT \R MATCHES

315
316       These  are  recognized only at the very start of the pattern or after a
317       (*...) option that sets the newline convention or UTF-8 or UCP mode.
318
319         (*BSR_ANYCRLF)  CR, LF, or CRLF
320         (*BSR_UNICODE)  any Unicode newline sequence
321

CALLOUTS

323
324         (?C)      callout
325         (?Cn)     callout with data n
326

AUTHOR

332
333       Philip Hazel
334       University Computing Service
335       Cambridge CB2 3QH, England.
336

REVISION

338
339       Last updated: 12 May 2010
340       Copyright (c) 1997-2010 University of Cambridge.
341
342
343
344                                                                 PCRESYNTAX(3)