pcresyntax(3)

1PCRESYNTAX(3)              Library Functions Manual              PCRESYNTAX(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PCRE REGULAR EXPRESSION SYNTAX SUMMARY

9
10       The  full syntax and semantics of the regular expressions that are sup‐
11       ported by PCRE are described in  the  pcrepattern  documentation.  This
12       document contains a quick-reference summary of the syntax.
13

QUOTING

15
16         \x         where x is non-alphanumeric is a literal x
17         \Q...\E    treat enclosed characters as literal
18

CHARACTERS

20
21         \a         alarm, that is, the BEL character (hex 07)
22         \cx        "control-x", where x is any ASCII character
23         \e         escape (hex 1B)
24         \f         form feed (hex 0C)
25         \n         newline (hex 0A)
26         \r         carriage return (hex 0D)
27         \t         tab (hex 09)
28         \ddd       character with octal code ddd, or backreference
29         \xhh       character with hex code hh
30         \x{hhh..}  character with hex code hhh..
31

CHARACTER TYPES

33
34         .          any character except newline;
35                      in dotall mode, any character whatsoever
36         \C         one data unit, even in UTF mode (best avoided)
37         \d         a decimal digit
38         \D         a character that is not a decimal digit
39         \h         a horizontal white space character
40         \H         a character that is not a horizontal white space character
41         \N         a character that is not a newline
42         \p{xx}     a character with the xx property
43         \P{xx}     a character without the xx property
44         \R         a newline sequence
45         \s         a white space character
46         \S         a character that is not a white space character
47         \v         a vertical white space character
48         \V         a character that is not a vertical white space character
49         \w         a "word" character
50         \W         a "non-word" character
51         \X         a Unicode extended grapheme cluster
52
53       In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
54       characters, even in a UTF mode. However, this can be changed by setting
55       the PCRE_UCP option.
56

GENERAL CATEGORY PROPERTIES FOR \p and \P

58
59         C          Other
60         Cc         Control
61         Cf         Format
62         Cn         Unassigned
63         Co         Private use
64         Cs         Surrogate
65
66         L          Letter
67         Ll         Lower case letter
68         Lm         Modifier letter
69         Lo         Other letter
70         Lt         Title case letter
71         Lu         Upper case letter
72         L&         Ll, Lu, or Lt
73
74         M          Mark
75         Mc         Spacing mark
76         Me         Enclosing mark
77         Mn         Non-spacing mark
78
79         N          Number
80         Nd         Decimal number
81         Nl         Letter number
82         No         Other number
83
84         P          Punctuation
85         Pc         Connector punctuation
86         Pd         Dash punctuation
87         Pe         Close punctuation
88         Pf         Final punctuation
89         Pi         Initial punctuation
90         Po         Other punctuation
91         Ps         Open punctuation
92
93         S          Symbol
94         Sc         Currency symbol
95         Sk         Modifier symbol
96         Sm         Mathematical symbol
97         So         Other symbol
98
99         Z          Separator
100         Zl         Line separator
101         Zp         Paragraph separator
102         Zs         Space separator
103

PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P

105
106         Xan        Alphanumeric: union of properties L and N
107         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
108         Xsp        Perl space: property Z or tab, NL, FF, CR
109         Xwd        Perl word: property Xan or underscore
110

SCRIPT NAMES FOR \p AND \P

112
113       Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
114       Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
115       Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
116       Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
117       Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira‐
118       gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip‐
119       tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
120       Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
121       Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
122       Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
123       Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
124       Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari‐
125       tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
126       Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
127       Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
128       Yi.
129

CHARACTER CLASSES

131
132         [...]       positive character class
133         [^...]      negative character class
134         [x-y]       range (can be used for hex characters)
135         [[:xxx:]]   positive POSIX named set
136         [[:^xxx:]]  negative POSIX named set
137
138         alnum       alphanumeric
139         alpha       alphabetic
140         ascii       0-127
141         blank       space or tab
142         cntrl       control character
143         digit       decimal digit
144         graph       printing, excluding space
145         lower       lower case letter
146         print       printing, including space
147         punct       printing, excluding alphanumeric
148         space       white space
149         upper       upper case letter
150         word        same as \w
151         xdigit      hexadecimal digit
152
153       In PCRE, POSIX character set names recognize only ASCII  characters  by
154       default,  but  some  of them use Unicode properties if PCRE_UCP is set.
155       You can use \Q...\E inside a character class.
156

QUANTIFIERS

158
159         ?           0 or 1, greedy
160         ?+          0 or 1, possessive
161         ??          0 or 1, lazy
162         *           0 or more, greedy
163         *+          0 or more, possessive
164         *?          0 or more, lazy
165         +           1 or more, greedy
166         ++          1 or more, possessive
167         +?          1 or more, lazy
168         {n}         exactly n
169         {n,m}       at least n, no more than m, greedy
170         {n,m}+      at least n, no more than m, possessive
171         {n,m}?      at least n, no more than m, lazy
172         {n,}        n or more, greedy
173         {n,}+       n or more, possessive
174         {n,}?       n or more, lazy
175

ANCHORS AND SIMPLE ASSERTIONS

177
178         \b          word boundary
179         \B          not a word boundary
180         ^           start of subject
181                      also after internal newline in multiline mode
182         \A          start of subject
183         $           end of subject
184                      also before newline at end of subject
185                      also before internal newline in multiline mode
186         \Z          end of subject
187                      also before newline at end of subject
188         \z          end of subject
189         \G          first matching position in subject
190

MATCH POINT RESET

192
193         \K          reset start of match
194

ALTERNATION

196
197         expr|expr|expr...
198

CAPTURING

200
201         (...)           capturing group
202         (?<name>...)    named capturing group (Perl)
203         (?'name'...)    named capturing group (Perl)
204         (?P<name>...)   named capturing group (Python)
205         (?:...)         non-capturing group
206         (?|...)         non-capturing group; reset group numbers for
207                          capturing groups in each alternative
208

ATOMIC GROUPS

210
211         (?>...)         atomic, non-capturing group
212

COMMENT

214
215         (?#....)        comment (not nestable)
216

OPTION SETTING

218
219         (?i)            caseless
220         (?J)            allow duplicate names
221         (?m)            multiline
222         (?s)            single line (dotall)
223         (?U)            default ungreedy (lazy)
224         (?x)            extended (ignore white space)
225         (?-...)         unset option(s)
226
227       The following are recognized only at the start of a  pattern  or  after
228       one of the newline-setting options with similar syntax:
229
230         (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
231         (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
232         (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
233         (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
234         (*UTF)          set appropriate UTF mode for the library in use
235         (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
236

LOOKAHEAD AND LOOKBEHIND ASSERTIONS

238
239         (?=...)         positive look ahead
240         (?!...)         negative look ahead
241         (?<=...)        positive look behind
242         (?<!...)        negative look behind
243
244       Each top-level branch of a look behind must be of a fixed length.
245

BACKREFERENCES

247
248         \n              reference by number (can be ambiguous)
249         \gn             reference by number
250         \g{n}           reference by number
251         \g{-n}          relative reference by number
252         \k<name>        reference by name (Perl)
253         \k'name'        reference by name (Perl)
254         \g{name}        reference by name (Perl)
255         \k{name}        reference by name (.NET)
256         (?P=name)       reference by name (Python)
257

SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

259
260         (?R)            recurse whole pattern
261         (?n)            call subpattern by absolute number
262         (?+n)           call subpattern by relative number
263         (?-n)           call subpattern by relative number
264         (?&name)        call subpattern by name (Perl)
265         (?P>name)       call subpattern by name (Python)
266         \g<name>        call subpattern by name (Oniguruma)
267         \g'name'        call subpattern by name (Oniguruma)
268         \g<n>           call subpattern by absolute number (Oniguruma)
269         \g'n'           call subpattern by absolute number (Oniguruma)
270         \g<+n>          call subpattern by relative number (PCRE extension)
271         \g'+n'          call subpattern by relative number (PCRE extension)
272         \g<-n>          call subpattern by relative number (PCRE extension)
273         \g'-n'          call subpattern by relative number (PCRE extension)
274

CONDITIONAL PATTERNS

276
277         (?(condition)yes-pattern)
278         (?(condition)yes-pattern|no-pattern)
279
280         (?(n)...        absolute reference condition
281         (?(+n)...       relative reference condition
282         (?(-n)...       relative reference condition
283         (?(<name>)...   named reference condition (Perl)
284         (?('name')...   named reference condition (Perl)
285         (?(name)...     named reference condition (PCRE)
286         (?(R)...        overall recursion condition
287         (?(Rn)...       specific group recursion condition
288         (?(R&name)...   specific recursion condition
289         (?(DEFINE)...   define subpattern for reference
290         (?(assert)...   assertion condition
291

BACKTRACKING CONTROL

293
294       The following act immediately they are reached:
295
296         (*ACCEPT)       force successful match
297         (*FAIL)         force backtrack; synonym (*F)
298         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
299
300       The  following  act only when a subsequent match failure causes a back‐
301       track to reach them. They all force a match failure, but they differ in
302       what happens afterwards. Those that advance the start-of-match point do
303       so only if the pattern is not anchored.
304
305         (*COMMIT)       overall failure, no advance of starting point
306         (*PRUNE)        advance to next starting character
307         (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
308         (*SKIP)         advance to current matching position
309         (*SKIP:NAME)    advance to position corresponding to an earlier
310                         (*MARK:NAME); if not found, the (*SKIP) is ignored
311         (*THEN)         local failure, backtrack to next alternation
312         (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
313

NEWLINE CONVENTIONS

315
316       These are recognized only at the very start of the pattern or  after  a
317       (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
318
319         (*CR)           carriage return only
320         (*LF)           linefeed only
321         (*CRLF)         carriage return followed by linefeed
322         (*ANYCRLF)      all three of the above
323         (*ANY)          any Unicode newline sequence
324

WHAT \R MATCHES

326
327       These  are  recognized only at the very start of the pattern or after a
328       (*...) option that sets the newline convention or a UTF or UCP mode.
329
330         (*BSR_ANYCRLF)  CR, LF, or CRLF
331         (*BSR_UNICODE)  any Unicode newline sequence
332

CALLOUTS

334
335         (?C)      callout
336         (?Cn)     callout with data n
337

AUTHOR

343
344       Philip Hazel
345       University Computing Service
346       Cambridge CB2 3QH, England.
347

REVISION

349
350       Last updated: 11 November 2012
351       Copyright (c) 1997-2012 University of Cambridge.
352
353
354
355PCRE 8.32                      11 November 2012                  PCRESYNTAX(3)