1PCREPATTERN(3) Library Functions Manual PCREPATTERN(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 The syntax and semantics of the regular expressions that are supported
11 by PCRE are described in detail below. There is a quick-reference syn‐
12 tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
13 semantics as closely as it can. PCRE also supports some alternative
14 regular expression syntax (which does not conflict with the Perl syn‐
15 tax) in order to provide some compatibility with regular expressions in
16 Python, .NET, and Oniguruma.
17
18 Perl's regular expressions are described in its own documentation, and
19 regular expressions in general are covered in a number of books, some
20 of which have copious examples. Jeffrey Friedl's "Mastering Regular
21 Expressions", published by O'Reilly, covers regular expressions in
22 great detail. This description of PCRE's regular expressions is
23 intended as reference material.
24
25 The original operation of PCRE was on strings of one-byte characters.
26 However, there is now also support for UTF-8 character strings. To use
27 this, you must build PCRE to include UTF-8 support, and then call
28 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
29 matching is mentioned in several places below. There is also a summary
30 of UTF-8 features in the section on UTF-8 support in the main pcre
31 page.
32
33 The remainder of this document discusses the patterns that are sup‐
34 ported by PCRE when its main matching function, pcre_exec(), is used.
35 From release 6.0, PCRE offers a second matching function,
36 pcre_dfa_exec(), which matches using a different algorithm that is not
37 Perl-compatible. Some of the features discussed below are not available
38 when pcre_dfa_exec() is used. The advantages and disadvantages of the
39 alternative function, and how it differs from the normal function, are
40 discussed in the pcrematching page.
41
43
44 PCRE supports five different conventions for indicating line breaks in
45 strings: a single CR (carriage return) character, a single LF (line‐
46 feed) character, the two-character sequence CRLF, any of the three pre‐
47 ceding, or any Unicode newline sequence. The pcreapi page has further
48 discussion about newlines, and shows how to set the newline convention
49 in the options arguments for the compiling and matching functions.
50
51 It is also possible to specify a newline convention by starting a pat‐
52 tern string with one of the following five sequences:
53
54 (*CR) carriage return
55 (*LF) linefeed
56 (*CRLF) carriage return, followed by linefeed
57 (*ANYCRLF) any of the three above
58 (*ANY) all Unicode newline sequences
59
60 These override the default and the options given to pcre_compile(). For
61 example, on a Unix system where LF is the default newline sequence, the
62 pattern
63
64 (*CR)a.b
65
66 changes the convention to CR. That pattern matches "a\nb" because LF is
67 no longer a newline. Note that these special settings, which are not
68 Perl-compatible, are recognized only at the very start of a pattern,
69 and that they must be in upper case. If more than one of them is
70 present, the last one is used.
71
72 The newline convention does not affect what the \R escape sequence
73 matches. By default, this is any Unicode newline sequence, for Perl
74 compatibility. However, this can be changed; see the description of \R
75 in the section entitled "Newline sequences" below. A change of \R set‐
76 ting can be combined with a change of newline convention.
77
79
80 A regular expression is a pattern that is matched against a subject
81 string from left to right. Most characters stand for themselves in a
82 pattern, and match the corresponding characters in the subject. As a
83 trivial example, the pattern
84
85 The quick brown fox
86
87 matches a portion of a subject string that is identical to itself. When
88 caseless matching is specified (the PCRE_CASELESS option), letters are
89 matched independently of case. In UTF-8 mode, PCRE always understands
90 the concept of case for characters whose values are less than 128, so
91 caseless matching is always possible. For characters with higher val‐
92 ues, the concept of case is supported if PCRE is compiled with Unicode
93 property support, but not otherwise. If you want to use caseless
94 matching for characters 128 and above, you must ensure that PCRE is
95 compiled with Unicode property support as well as with UTF-8 support.
96
97 The power of regular expressions comes from the ability to include
98 alternatives and repetitions in the pattern. These are encoded in the
99 pattern by the use of metacharacters, which do not stand for themselves
100 but instead are interpreted in some special way.
101
102 There are two different sets of metacharacters: those that are recog‐
103 nized anywhere in the pattern except within square brackets, and those
104 that are recognized within square brackets. Outside square brackets,
105 the metacharacters are as follows:
106
107 \ general escape character with several uses
108 ^ assert start of string (or line, in multiline mode)
109 $ assert end of string (or line, in multiline mode)
110 . match any character except newline (by default)
111 [ start character class definition
112 | start of alternative branch
113 ( start subpattern
114 ) end subpattern
115 ? extends the meaning of (
116 also 0 or 1 quantifier
117 also quantifier minimizer
118 * 0 or more quantifier
119 + 1 or more quantifier
120 also "possessive quantifier"
121 { start min/max quantifier
122
123 Part of a pattern that is in square brackets is called a "character
124 class". In a character class the only metacharacters are:
125
126 \ general escape character
127 ^ negate the class, but only if the first character
128 - indicates character range
129 [ POSIX character class (only if followed by POSIX
130 syntax)
131 ] terminates the character class
132
133 The following sections describe the use of each of the metacharacters.
134
136
137 The backslash character has several uses. Firstly, if it is followed by
138 a non-alphanumeric character, it takes away any special meaning that
139 character may have. This use of backslash as an escape character
140 applies both inside and outside character classes.
141
142 For example, if you want to match a * character, you write \* in the
143 pattern. This escaping action applies whether or not the following
144 character would otherwise be interpreted as a metacharacter, so it is
145 always safe to precede a non-alphanumeric with backslash to specify
146 that it stands for itself. In particular, if you want to match a back‐
147 slash, you write \\.
148
149 If a pattern is compiled with the PCRE_EXTENDED option, white space in
150 the pattern (other than in a character class) and characters between a
151 # outside a character class and the next newline are ignored. An escap‐
152 ing backslash can be used to include a white space or # character as
153 part of the pattern.
154
155 If you want to remove the special meaning from a sequence of charac‐
156 ters, you can do so by putting them between \Q and \E. This is differ‐
157 ent from Perl in that $ and @ are handled as literals in \Q...\E
158 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola‐
159 tion. Note the following examples:
160
161 Pattern PCRE matches Perl matches
162
163 \Qabc$xyz\E abc$xyz abc followed by the
164 contents of $xyz
165 \Qabc\$xyz\E abc\$xyz abc\$xyz
166 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
167
168 The \Q...\E sequence is recognized both inside and outside character
169 classes.
170
171 Non-printing characters
172
173 A second use of backslash provides a way of encoding non-printing char‐
174 acters in patterns in a visible manner. There is no restriction on the
175 appearance of non-printing characters, apart from the binary zero that
176 terminates a pattern, but when a pattern is being prepared by text
177 editing, it is usually easier to use one of the following escape
178 sequences than the binary character it represents:
179
180 \a alarm, that is, the BEL character (hex 07)
181 \cx "control-x", where x is any character
182 \e escape (hex 1B)
183 \f form feed (hex 0C)
184 \n linefeed (hex 0A)
185 \r carriage return (hex 0D)
186 \t tab (hex 09)
187 \ddd character with octal code ddd, or backreference
188 \xhh character with hex code hh
189 \x{hhh..} character with hex code hhh..
190
191 The precise effect of \cx is as follows: if x is a lower case letter,
192 it is converted to upper case. Then bit 6 of the character (hex 40) is
193 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
194 becomes hex 7B.
195
196 After \x, from zero to two hexadecimal digits are read (letters can be
197 in upper or lower case). Any number of hexadecimal digits may appear
198 between \x{ and }, but the value of the character code must be less
199 than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
200 the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
201 than the largest Unicode code point, which is 10FFFF.
202
203 If characters other than hexadecimal digits appear between \x{ and },
204 or if there is no terminating }, this form of escape is not recognized.
205 Instead, the initial \x will be interpreted as a basic hexadecimal
206 escape, with no following digits, giving a character whose value is
207 zero.
208
209 Characters whose value is less than 256 can be defined by either of the
210 two syntaxes for \x. There is no difference in the way they are han‐
211 dled. For example, \xdc is exactly the same as \x{dc}.
212
213 After \0 up to two further octal digits are read. If there are fewer
214 than two digits, just those that are present are used. Thus the
215 sequence \0\x\07 specifies two binary zeros followed by a BEL character
216 (code value 7). Make sure you supply two digits after the initial zero
217 if the pattern character that follows is itself an octal digit.
218
219 The handling of a backslash followed by a digit other than 0 is compli‐
220 cated. Outside a character class, PCRE reads it and any following dig‐
221 its as a decimal number. If the number is less than 10, or if there
222 have been at least that many previous capturing left parentheses in the
223 expression, the entire sequence is taken as a back reference. A
224 description of how this works is given later, following the discussion
225 of parenthesized subpatterns.
226
227 Inside a character class, or if the decimal number is greater than 9
228 and there have not been that many capturing subpatterns, PCRE re-reads
229 up to three octal digits following the backslash, and uses them to gen‐
230 erate a data character. Any subsequent digits stand for themselves. In
231 non-UTF-8 mode, the value of a character specified in octal must be
232 less than \400. In UTF-8 mode, values up to \777 are permitted. For
233 example:
234
235 \040 is another way of writing a space
236 \40 is the same, provided there are fewer than 40
237 previous capturing subpatterns
238 \7 is always a back reference
239 \11 might be a back reference, or another way of
240 writing a tab
241 \011 is always a tab
242 \0113 is a tab followed by the character "3"
243 \113 might be a back reference, otherwise the
244 character with octal code 113
245 \377 might be a back reference, otherwise
246 the byte consisting entirely of 1 bits
247 \81 is either a back reference, or a binary zero
248 followed by the two characters "8" and "1"
249
250 Note that octal values of 100 or greater must not be introduced by a
251 leading zero, because no more than three octal digits are ever read.
252
253 All the sequences that define a single character value can be used both
254 inside and outside character classes. In addition, inside a character
255 class, the sequence \b is interpreted as the backspace character (hex
256 08), and the sequences \R and \X are interpreted as the characters "R"
257 and "X", respectively. Outside a character class, these sequences have
258 different meanings (see below).
259
260 Absolute and relative back references
261
262 The sequence \g followed by an unsigned or a negative number, option‐
263 ally enclosed in braces, is an absolute or relative back reference. A
264 named back reference can be coded as \g{name}. Back references are dis‐
265 cussed later, following the discussion of parenthesized subpatterns.
266
267 Absolute and relative subroutine calls
268
269 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
270 name or a number enclosed either in angle brackets or single quotes, is
271 an alternative syntax for referencing a subpattern as a "subroutine".
272 Details are discussed later. Note that \g{...} (Perl syntax) and
273 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
274 reference; the latter is a subroutine call.
275
276 Generic character types
277
278 Another use of backslash is for specifying generic character types. The
279 following are always recognized:
280
281 \d any decimal digit
282 \D any character that is not a decimal digit
283 \h any horizontal white space character
284 \H any character that is not a horizontal white space character
285 \s any white space character
286 \S any character that is not a white space character
287 \v any vertical white space character
288 \V any character that is not a vertical white space character
289 \w any "word" character
290 \W any "non-word" character
291
292 Each pair of escape sequences partitions the complete set of characters
293 into two disjoint sets. Any given character matches one, and only one,
294 of each pair.
295
296 These character type sequences can appear both inside and outside char‐
297 acter classes. They each match one character of the appropriate type.
298 If the current matching point is at the end of the subject string, all
299 of them fail, since there is no character to match.
300
301 For compatibility with Perl, \s does not match the VT character (code
302 11). This makes it different from the the POSIX "space" class. The \s
303 characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
304 "use locale;" is included in a Perl script, \s may match the VT charac‐
305 ter. In PCRE, it never does.
306
307 In UTF-8 mode, characters with values greater than 128 never match \d,
308 \s, or \w, and always match \D, \S, and \W. This is true even when Uni‐
309 code character property support is available. These sequences retain
310 their original meanings from before UTF-8 support was available, mainly
311 for efficiency reasons.
312
313 The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
314 the other sequences, these do match certain high-valued codepoints in
315 UTF-8 mode. The horizontal space characters are:
316
317 U+0009 Horizontal tab
318 U+0020 Space
319 U+00A0 Non-break space
320 U+1680 Ogham space mark
321 U+180E Mongolian vowel separator
322 U+2000 En quad
323 U+2001 Em quad
324 U+2002 En space
325 U+2003 Em space
326 U+2004 Three-per-em space
327 U+2005 Four-per-em space
328 U+2006 Six-per-em space
329 U+2007 Figure space
330 U+2008 Punctuation space
331 U+2009 Thin space
332 U+200A Hair space
333 U+202F Narrow no-break space
334 U+205F Medium mathematical space
335 U+3000 Ideographic space
336
337 The vertical space characters are:
338
339 U+000A Linefeed
340 U+000B Vertical tab
341 U+000C Form feed
342 U+000D Carriage return
343 U+0085 Next line
344 U+2028 Line separator
345 U+2029 Paragraph separator
346
347 A "word" character is an underscore or any character less than 256 that
348 is a letter or digit. The definition of letters and digits is con‐
349 trolled by PCRE's low-valued character tables, and may vary if locale-
350 specific matching is taking place (see "Locale support" in the pcreapi
351 page). For example, in a French locale such as "fr_FR" in Unix-like
352 systems, or "french" in Windows, some character codes greater than 128
353 are used for accented letters, and these are matched by \w. The use of
354 locales with Unicode is discouraged.
355
356 Newline sequences
357
358 Outside a character class, by default, the escape sequence \R matches
359 any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
360 mode \R is equivalent to the following:
361
362 (?>\r\n|\n|\x0b|\f|\r|\x85)
363
364 This is an example of an "atomic group", details of which are given
365 below. This particular group matches either the two-character sequence
366 CR followed by LF, or one of the single characters LF (linefeed,
367 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car‐
368 riage return, U+000D), or NEL (next line, U+0085). The two-character
369 sequence is treated as a single unit that cannot be split.
370
371 In UTF-8 mode, two additional characters whose codepoints are greater
372 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
373 rator, U+2029). Unicode character property support is not needed for
374 these characters to be recognized.
375
376 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
377 the complete set of Unicode line endings) by setting the option
378 PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
379 (BSR is an abbrevation for "backslash R".) This can be made the default
380 when PCRE is built; if this is the case, the other behaviour can be
381 requested via the PCRE_BSR_UNICODE option. It is also possible to
382 specify these settings by starting a pattern string with one of the
383 following sequences:
384
385 (*BSR_ANYCRLF) CR, LF, or CRLF only
386 (*BSR_UNICODE) any Unicode newline sequence
387
388 These override the default and the options given to pcre_compile(), but
389 they can be overridden by options given to pcre_exec(). Note that these
390 special settings, which are not Perl-compatible, are recognized only at
391 the very start of a pattern, and that they must be in upper case. If
392 more than one of them is present, the last one is used. They can be
393 combined with a change of newline convention, for example, a pattern
394 can start with:
395
396 (*ANY)(*BSR_ANYCRLF)
397
398 Inside a character class, \R matches the letter "R".
399
400 Unicode character properties
401
402 When PCRE is built with Unicode character property support, three addi‐
403 tional escape sequences that match characters with specific properties
404 are available. When not in UTF-8 mode, these sequences are of course
405 limited to testing characters whose codepoints are less than 256, but
406 they do work in this mode. The extra escape sequences are:
407
408 \p{xx} a character with the xx property
409 \P{xx} a character without the xx property
410 \X an extended Unicode sequence
411
412 The property names represented by xx above are limited to the Unicode
413 script names, the general category properties, and "Any", which matches
414 any character (including newline). Other properties such as "InMusical‐
415 Symbols" are not currently supported by PCRE. Note that \P{Any} does
416 not match any characters, so always causes a match failure.
417
418 Sets of Unicode characters are defined as belonging to certain scripts.
419 A character from one of these sets can be matched using a script name.
420 For example:
421
422 \p{Greek}
423 \P{Han}
424
425 Those that are not part of an identified script are lumped together as
426 "Common". The current list of scripts is:
427
428 Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese,
429 Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform,
430 Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
431 Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira‐
432 gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin,
433 Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko,
434 Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
435 Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
436 Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
437
438 Each character has exactly one general category property, specified by
439 a two-letter abbreviation. For compatibility with Perl, negation can be
440 specified by including a circumflex between the opening brace and the
441 property name. For example, \p{^Lu} is the same as \P{Lu}.
442
443 If only one letter is specified with \p or \P, it includes all the gen‐
444 eral category properties that start with that letter. In this case, in
445 the absence of negation, the curly brackets in the escape sequence are
446 optional; these two examples have the same effect:
447
448 \p{L}
449 \pL
450
451 The following general category property codes are supported:
452
453 C Other
454 Cc Control
455 Cf Format
456 Cn Unassigned
457 Co Private use
458 Cs Surrogate
459
460 L Letter
461 Ll Lower case letter
462 Lm Modifier letter
463 Lo Other letter
464 Lt Title case letter
465 Lu Upper case letter
466
467 M Mark
468 Mc Spacing mark
469 Me Enclosing mark
470 Mn Non-spacing mark
471
472 N Number
473 Nd Decimal number
474 Nl Letter number
475 No Other number
476
477 P Punctuation
478 Pc Connector punctuation
479 Pd Dash punctuation
480 Pe Close punctuation
481 Pf Final punctuation
482 Pi Initial punctuation
483 Po Other punctuation
484 Ps Open punctuation
485
486 S Symbol
487 Sc Currency symbol
488 Sk Modifier symbol
489 Sm Mathematical symbol
490 So Other symbol
491
492 Z Separator
493 Zl Line separator
494 Zp Paragraph separator
495 Zs Space separator
496
497 The special property L& is also supported: it matches a character that
498 has the Lu, Ll, or Lt property, in other words, a letter that is not
499 classified as a modifier or "other".
500
501 The Cs (Surrogate) property applies only to characters in the range
502 U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
503 RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check‐
504 ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
505 the pcreapi page).
506
507 The long synonyms for these properties that Perl supports (such as
508 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
509 any of these properties with "Is".
510
511 No character that is in the Unicode table has the Cn (unassigned) prop‐
512 erty. Instead, this property is assumed for any code point that is not
513 in the Unicode table.
514
515 Specifying caseless matching does not affect these escape sequences.
516 For example, \p{Lu} always matches only upper case letters.
517
518 The \X escape matches any number of Unicode characters that form an
519 extended Unicode sequence. \X is equivalent to
520
521 (?>\PM\pM*)
522
523 That is, it matches a character without the "mark" property, followed
524 by zero or more characters with the "mark" property, and treats the
525 sequence as an atomic group (see below). Characters with the "mark"
526 property are typically accents that affect the preceding character.
527 None of them have codepoints less than 256, so in non-UTF-8 mode \X
528 matches any one character.
529
530 Matching characters by Unicode property is not fast, because PCRE has
531 to search a structure that contains data for over fifteen thousand
532 characters. That is why the traditional escape sequences such as \d and
533 \w do not use Unicode properties in PCRE.
534
535 Resetting the match start
536
537 The escape sequence \K, which is a Perl 5.10 feature, causes any previ‐
538 ously matched characters not to be included in the final matched
539 sequence. For example, the pattern:
540
541 foo\Kbar
542
543 matches "foobar", but reports that it has matched "bar". This feature
544 is similar to a lookbehind assertion (described below). However, in
545 this case, the part of the subject before the real match does not have
546 to be of fixed length, as lookbehind assertions do. The use of \K does
547 not interfere with the setting of captured substrings. For example,
548 when the pattern
549
550 (foo)\Kbar
551
552 matches "foobar", the first substring is still set to "foo".
553
554 Simple assertions
555
556 The final use of backslash is for certain simple assertions. An asser‐
557 tion specifies a condition that has to be met at a particular point in
558 a match, without consuming any characters from the subject string. The
559 use of subpatterns for more complicated assertions is described below.
560 The backslashed assertions are:
561
562 \b matches at a word boundary
563 \B matches when not at a word boundary
564 \A matches at the start of the subject
565 \Z matches at the end of the subject
566 also matches before a newline at the end of the subject
567 \z matches only at the end of the subject
568 \G matches at the first matching position in the subject
569
570 These assertions may not appear in character classes (but note that \b
571 has a different meaning, namely the backspace character, inside a char‐
572 acter class).
573
574 A word boundary is a position in the subject string where the current
575 character and the previous character do not both match \w or \W (i.e.
576 one matches \w and the other matches \W), or the start or end of the
577 string if the first or last character matches \w, respectively.
578
579 The \A, \Z, and \z assertions differ from the traditional circumflex
580 and dollar (described in the next section) in that they only ever match
581 at the very start and end of the subject string, whatever options are
582 set. Thus, they are independent of multiline mode. These three asser‐
583 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
584 affect only the behaviour of the circumflex and dollar metacharacters.
585 However, if the startoffset argument of pcre_exec() is non-zero, indi‐
586 cating that matching is to start at a point other than the beginning of
587 the subject, \A can never match. The difference between \Z and \z is
588 that \Z matches before a newline at the end of the string as well as at
589 the very end, whereas \z matches only at the end.
590
591 The \G assertion is true only when the current matching position is at
592 the start point of the match, as specified by the startoffset argument
593 of pcre_exec(). It differs from \A when the value of startoffset is
594 non-zero. By calling pcre_exec() multiple times with appropriate argu‐
595 ments, you can mimic Perl's /g option, and it is in this kind of imple‐
596 mentation where \G can be useful.
597
598 Note, however, that PCRE's interpretation of \G, as the start of the
599 current match, is subtly different from Perl's, which defines it as the
600 end of the previous match. In Perl, these can be different when the
601 previously matched string was empty. Because PCRE does just one match
602 at a time, it cannot reproduce this behaviour.
603
604 If all the alternatives of a pattern begin with \G, the expression is
605 anchored to the starting match position, and the "anchored" flag is set
606 in the compiled regular expression.
607
609
610 Outside a character class, in the default matching mode, the circumflex
611 character is an assertion that is true only if the current matching
612 point is at the start of the subject string. If the startoffset argu‐
613 ment of pcre_exec() is non-zero, circumflex can never match if the
614 PCRE_MULTILINE option is unset. Inside a character class, circumflex
615 has an entirely different meaning (see below).
616
617 Circumflex need not be the first character of the pattern if a number
618 of alternatives are involved, but it should be the first thing in each
619 alternative in which it appears if the pattern is ever to match that
620 branch. If all possible alternatives start with a circumflex, that is,
621 if the pattern is constrained to match only at the start of the sub‐
622 ject, it is said to be an "anchored" pattern. (There are also other
623 constructs that can cause a pattern to be anchored.)
624
625 A dollar character is an assertion that is true only if the current
626 matching point is at the end of the subject string, or immediately
627 before a newline at the end of the string (by default). Dollar need not
628 be the last character of the pattern if a number of alternatives are
629 involved, but it should be the last item in any branch in which it
630 appears. Dollar has no special meaning in a character class.
631
632 The meaning of dollar can be changed so that it matches only at the
633 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
634 compile time. This does not affect the \Z assertion.
635
636 The meanings of the circumflex and dollar characters are changed if the
637 PCRE_MULTILINE option is set. When this is the case, a circumflex
638 matches immediately after internal newlines as well as at the start of
639 the subject string. It does not match after a newline that ends the
640 string. A dollar matches before any newlines in the string, as well as
641 at the very end, when PCRE_MULTILINE is set. When newline is specified
642 as the two-character sequence CRLF, isolated CR and LF characters do
643 not indicate newlines.
644
645 For example, the pattern /^abc$/ matches the subject string "def\nabc"
646 (where \n represents a newline) in multiline mode, but not otherwise.
647 Consequently, patterns that are anchored in single line mode because
648 all branches start with ^ are not anchored in multiline mode, and a
649 match for circumflex is possible when the startoffset argument of
650 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
651 PCRE_MULTILINE is set.
652
653 Note that the sequences \A, \Z, and \z can be used to match the start
654 and end of the subject in both modes, and if all branches of a pattern
655 start with \A it is always anchored, whether or not PCRE_MULTILINE is
656 set.
657
659
660 Outside a character class, a dot in the pattern matches any one charac‐
661 ter in the subject string except (by default) a character that signi‐
662 fies the end of a line. In UTF-8 mode, the matched character may be
663 more than one byte long.
664
665 When a line ending is defined as a single character, dot never matches
666 that character; when the two-character sequence CRLF is used, dot does
667 not match CR if it is immediately followed by LF, but otherwise it
668 matches all characters (including isolated CRs and LFs). When any Uni‐
669 code line endings are being recognized, dot does not match CR or LF or
670 any of the other line ending characters.
671
672 The behaviour of dot with regard to newlines can be changed. If the
673 PCRE_DOTALL option is set, a dot matches any one character, without
674 exception. If the two-character sequence CRLF is present in the subject
675 string, it takes two dots to match it.
676
677 The handling of dot is entirely independent of the handling of circum‐
678 flex and dollar, the only relationship being that they both involve
679 newlines. Dot has no special meaning in a character class.
680
682
683 Outside a character class, the escape sequence \C matches any one byte,
684 both in and out of UTF-8 mode. Unlike a dot, it always matches any
685 line-ending characters. The feature is provided in Perl in order to
686 match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char‐
687 acters into individual bytes, what remains in the string may be a mal‐
688 formed UTF-8 string. For this reason, the \C escape sequence is best
689 avoided.
690
691 PCRE does not allow \C to appear in lookbehind assertions (described
692 below), because in UTF-8 mode this would make it impossible to calcu‐
693 late the length of the lookbehind.
694
696
697 An opening square bracket introduces a character class, terminated by a
698 closing square bracket. A closing square bracket on its own is not spe‐
699 cial. If a closing square bracket is required as a member of the class,
700 it should be the first data character in the class (after an initial
701 circumflex, if present) or escaped with a backslash.
702
703 A character class matches a single character in the subject. In UTF-8
704 mode, the character may occupy more than one byte. A matched character
705 must be in the set of characters defined by the class, unless the first
706 character in the class definition is a circumflex, in which case the
707 subject character must not be in the set defined by the class. If a
708 circumflex is actually required as a member of the class, ensure it is
709 not the first character, or escape it with a backslash.
710
711 For example, the character class [aeiou] matches any lower case vowel,
712 while [^aeiou] matches any character that is not a lower case vowel.
713 Note that a circumflex is just a convenient notation for specifying the
714 characters that are in the class by enumerating those that are not. A
715 class that starts with a circumflex is not an assertion: it still con‐
716 sumes a character from the subject string, and therefore it fails if
717 the current pointer is at the end of the string.
718
719 In UTF-8 mode, characters with values greater than 255 can be included
720 in a class as a literal string of bytes, or by using the \x{ escaping
721 mechanism.
722
723 When caseless matching is set, any letters in a class represent both
724 their upper case and lower case versions, so for example, a caseless
725 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
726 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
727 understands the concept of case for characters whose values are less
728 than 128, so caseless matching is always possible. For characters with
729 higher values, the concept of case is supported if PCRE is compiled
730 with Unicode property support, but not otherwise. If you want to use
731 caseless matching for characters 128 and above, you must ensure that
732 PCRE is compiled with Unicode property support as well as with UTF-8
733 support.
734
735 Characters that might indicate line breaks are never treated in any
736 special way when matching character classes, whatever line-ending
737 sequence is in use, and whatever setting of the PCRE_DOTALL and
738 PCRE_MULTILINE options is used. A class such as [^a] always matches one
739 of these characters.
740
741 The minus (hyphen) character can be used to specify a range of charac‐
742 ters in a character class. For example, [d-m] matches any letter
743 between d and m, inclusive. If a minus character is required in a
744 class, it must be escaped with a backslash or appear in a position
745 where it cannot be interpreted as indicating a range, typically as the
746 first or last character in the class.
747
748 It is not possible to have the literal character "]" as the end charac‐
749 ter of a range. A pattern such as [W-]46] is interpreted as a class of
750 two characters ("W" and "-") followed by a literal string "46]", so it
751 would match "W46]" or "-46]". However, if the "]" is escaped with a
752 backslash it is interpreted as the end of range, so [W-\]46] is inter‐
753 preted as a class containing a range followed by two other characters.
754 The octal or hexadecimal representation of "]" can also be used to end
755 a range.
756
757 Ranges operate in the collating sequence of character values. They can
758 also be used for characters specified numerically, for example
759 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
760 are greater than 255, for example [\x{100}-\x{2ff}].
761
762 If a range that includes letters is used when caseless matching is set,
763 it matches the letters in either case. For example, [W-c] is equivalent
764 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
765 character tables for a French locale are in use, [\xc8-\xcb] matches
766 accented E characters in both cases. In UTF-8 mode, PCRE supports the
767 concept of case for characters with values greater than 128 only when
768 it is compiled with Unicode property support.
769
770 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
771 in a character class, and add the characters that they match to the
772 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum‐
773 flex can conveniently be used with the upper case character types to
774 specify a more restricted set of characters than the matching lower
775 case type. For example, the class [^\W_] matches any letter or digit,
776 but not underscore.
777
778 The only metacharacters that are recognized in character classes are
779 backslash, hyphen (only where it can be interpreted as specifying a
780 range), circumflex (only at the start), opening square bracket (only
781 when it can be interpreted as introducing a POSIX class name - see the
782 next section), and the terminating closing square bracket. However,
783 escaping other non-alphanumeric characters does no harm.
784
786
787 Perl supports the POSIX notation for character classes. This uses names
788 enclosed by [: and :] within the enclosing square brackets. PCRE also
789 supports this notation. For example,
790
791 [01[:alpha:]%]
792
793 matches "0", "1", any alphabetic character, or "%". The supported class
794 names are
795
796 alnum letters and digits
797 alpha letters
798 ascii character codes 0 - 127
799 blank space or tab only
800 cntrl control characters
801 digit decimal digits (same as \d)
802 graph printing characters, excluding space
803 lower lower case letters
804 print printing characters, including space
805 punct printing characters, excluding letters and digits
806 space white space (not quite the same as \s)
807 upper upper case letters
808 word "word" characters (same as \w)
809 xdigit hexadecimal digits
810
811 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
812 and space (32). Notice that this list includes the VT character (code
813 11). This makes "space" different to \s, which does not include VT (for
814 Perl compatibility).
815
816 The name "word" is a Perl extension, and "blank" is a GNU extension
817 from Perl 5.8. Another Perl extension is negation, which is indicated
818 by a ^ character after the colon. For example,
819
820 [12[:^digit:]]
821
822 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
823 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
824 these are not supported, and an error is given if they are encountered.
825
826 In UTF-8 mode, characters with values greater than 128 do not match any
827 of the POSIX character classes.
828
830
831 Vertical bar characters are used to separate alternative patterns. For
832 example, the pattern
833
834 gilbert|sullivan
835
836 matches either "gilbert" or "sullivan". Any number of alternatives may
837 appear, and an empty alternative is permitted (matching the empty
838 string). The matching process tries each alternative in turn, from left
839 to right, and the first one that succeeds is used. If the alternatives
840 are within a subpattern (defined below), "succeeds" means matching the
841 rest of the main pattern as well as the alternative in the subpattern.
842
844
845 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
846 PCRE_EXTENDED options (which are Perl-compatible) can be changed from
847 within the pattern by a sequence of Perl option letters enclosed
848 between "(?" and ")". The option letters are
849
850 i for PCRE_CASELESS
851 m for PCRE_MULTILINE
852 s for PCRE_DOTALL
853 x for PCRE_EXTENDED
854
855 For example, (?im) sets caseless, multiline matching. It is also possi‐
856 ble to unset these options by preceding the letter with a hyphen, and a
857 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE‐
858 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
859 is also permitted. If a letter appears both before and after the
860 hyphen, the option is unset.
861
862 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
863 can be changed in the same way as the Perl-compatible options by using
864 the characters J, U and X respectively.
865
866 When an option change occurs at top level (that is, not inside subpat‐
867 tern parentheses), the change applies to the remainder of the pattern
868 that follows. If the change is placed right at the start of a pattern,
869 PCRE extracts it into the global options (and it will therefore show up
870 in data extracted by the pcre_fullinfo() function).
871
872 An option change within a subpattern (see below for a description of
873 subpatterns) affects only that part of the current pattern that follows
874 it, so
875
876 (a(?i)b)c
877
878 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
879 used). By this means, options can be made to have different settings
880 in different parts of the pattern. Any changes made in one alternative
881 do carry on into subsequent branches within the same subpattern. For
882 example,
883
884 (a(?i)b|c)
885
886 matches "ab", "aB", "c", and "C", even though when matching "C" the
887 first branch is abandoned before the option setting. This is because
888 the effects of option settings happen at compile time. There would be
889 some very weird behaviour otherwise.
890
891 Note: There are other PCRE-specific options that can be set by the
892 application when the compile or match functions are called. In some
893 cases the pattern can contain special leading sequences to override
894 what the application has set or what has been defaulted. Details are
895 given in the section entitled "Newline sequences" above.
896
898
899 Subpatterns are delimited by parentheses (round brackets), which can be
900 nested. Turning part of a pattern into a subpattern does two things:
901
902 1. It localizes a set of alternatives. For example, the pattern
903
904 cat(aract|erpillar|)
905
906 matches one of the words "cat", "cataract", or "caterpillar". Without
907 the parentheses, it would match "cataract", "erpillar" or an empty
908 string.
909
910 2. It sets up the subpattern as a capturing subpattern. This means
911 that, when the whole pattern matches, that portion of the subject
912 string that matched the subpattern is passed back to the caller via the
913 ovector argument of pcre_exec(). Opening parentheses are counted from
914 left to right (starting from 1) to obtain numbers for the capturing
915 subpatterns.
916
917 For example, if the string "the red king" is matched against the pat‐
918 tern
919
920 the ((red|white) (king|queen))
921
922 the captured substrings are "red king", "red", and "king", and are num‐
923 bered 1, 2, and 3, respectively.
924
925 The fact that plain parentheses fulfil two functions is not always
926 helpful. There are often times when a grouping subpattern is required
927 without a capturing requirement. If an opening parenthesis is followed
928 by a question mark and a colon, the subpattern does not do any captur‐
929 ing, and is not counted when computing the number of any subsequent
930 capturing subpatterns. For example, if the string "the white queen" is
931 matched against the pattern
932
933 the ((?:red|white) (king|queen))
934
935 the captured substrings are "white queen" and "queen", and are numbered
936 1 and 2. The maximum number of capturing subpatterns is 65535.
937
938 As a convenient shorthand, if any option settings are required at the
939 start of a non-capturing subpattern, the option letters may appear
940 between the "?" and the ":". Thus the two patterns
941
942 (?i:saturday|sunday)
943 (?:(?i)saturday|sunday)
944
945 match exactly the same set of strings. Because alternative branches are
946 tried from left to right, and options are not reset until the end of
947 the subpattern is reached, an option setting in one branch does affect
948 subsequent branches, so the above patterns match "SUNDAY" as well as
949 "Saturday".
950
952
953 Perl 5.10 introduced a feature whereby each alternative in a subpattern
954 uses the same numbers for its capturing parentheses. Such a subpattern
955 starts with (?| and is itself a non-capturing subpattern. For example,
956 consider this pattern:
957
958 (?|(Sat)ur|(Sun))day
959
960 Because the two alternatives are inside a (?| group, both sets of cap‐
961 turing parentheses are numbered one. Thus, when the pattern matches,
962 you can look at captured substring number one, whichever alternative
963 matched. This construct is useful when you want to capture part, but
964 not all, of one of a number of alternatives. Inside a (?| group, paren‐
965 theses are numbered as usual, but the number is reset at the start of
966 each branch. The numbers of any capturing buffers that follow the sub‐
967 pattern start after the highest number used in any branch. The follow‐
968 ing example is taken from the Perl documentation. The numbers under‐
969 neath show in which buffer the captured content will be stored.
970
971 # before ---------------branch-reset----------- after
972 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
973 # 1 2 2 3 2 3 4
974
975 A backreference or a recursive call to a numbered subpattern always
976 refers to the first one in the pattern with the given number.
977
978 An alternative approach to using this "branch reset" feature is to use
979 duplicate named subpatterns, as described in the next section.
980
982
983 Identifying capturing parentheses by number is simple, but it can be
984 very hard to keep track of the numbers in complicated regular expres‐
985 sions. Furthermore, if an expression is modified, the numbers may
986 change. To help with this difficulty, PCRE supports the naming of sub‐
987 patterns. This feature was not added to Perl until release 5.10. Python
988 had the feature earlier, and PCRE introduced it at release 4.0, using
989 the Python syntax. PCRE now supports both the Perl and the Python syn‐
990 tax.
991
992 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
993 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
994 to capturing parentheses from other parts of the pattern, such as back‐
995 references, recursion, and conditions, can be made by name as well as
996 by number.
997
998 Names consist of up to 32 alphanumeric characters and underscores.
999 Named capturing parentheses are still allocated numbers as well as
1000 names, exactly as if the names were not present. The PCRE API provides
1001 function calls for extracting the name-to-number translation table from
1002 a compiled pattern. There is also a convenience function for extracting
1003 a captured substring by name.
1004
1005 By default, a name must be unique within a pattern, but it is possible
1006 to relax this constraint by setting the PCRE_DUPNAMES option at compile
1007 time. This can be useful for patterns where only one instance of the
1008 named parentheses can match. Suppose you want to match the name of a
1009 weekday, either as a 3-letter abbreviation or as the full name, and in
1010 both cases you want to extract the abbreviation. This pattern (ignoring
1011 the line breaks) does the job:
1012
1013 (?<DN>Mon|Fri|Sun)(?:day)?|
1014 (?<DN>Tue)(?:sday)?|
1015 (?<DN>Wed)(?:nesday)?|
1016 (?<DN>Thu)(?:rsday)?|
1017 (?<DN>Sat)(?:urday)?
1018
1019 There are five capturing substrings, but only one is ever set after a
1020 match. (An alternative way of solving this problem is to use a "branch
1021 reset" subpattern, as described in the previous section.)
1022
1023 The convenience function for extracting the data by name returns the
1024 substring for the first (and in this example, the only) subpattern of
1025 that name that matched. This saves searching to find which numbered
1026 subpattern it was. If you make a reference to a non-unique named sub‐
1027 pattern from elsewhere in the pattern, the one that corresponds to the
1028 lowest number is used. For further details of the interfaces for han‐
1029 dling named subpatterns, see the pcreapi documentation.
1030
1032
1033 Repetition is specified by quantifiers, which can follow any of the
1034 following items:
1035
1036 a literal data character
1037 the dot metacharacter
1038 the \C escape sequence
1039 the \X escape sequence (in UTF-8 mode with Unicode properties)
1040 the \R escape sequence
1041 an escape such as \d that matches a single character
1042 a character class
1043 a back reference (see next section)
1044 a parenthesized subpattern (unless it is an assertion)
1045
1046 The general repetition quantifier specifies a minimum and maximum num‐
1047 ber of permitted matches, by giving the two numbers in curly brackets
1048 (braces), separated by a comma. The numbers must be less than 65536,
1049 and the first must be less than or equal to the second. For example:
1050
1051 z{2,4}
1052
1053 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
1054 special character. If the second number is omitted, but the comma is
1055 present, there is no upper limit; if the second number and the comma
1056 are both omitted, the quantifier specifies an exact number of required
1057 matches. Thus
1058
1059 [aeiou]{3,}
1060
1061 matches at least 3 successive vowels, but may match many more, while
1062
1063 \d{8}
1064
1065 matches exactly 8 digits. An opening curly bracket that appears in a
1066 position where a quantifier is not allowed, or one that does not match
1067 the syntax of a quantifier, is taken as a literal character. For exam‐
1068 ple, {,6} is not a quantifier, but a literal string of four characters.
1069
1070 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
1071 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char‐
1072 acters, each of which is represented by a two-byte sequence. Similarly,
1073 when Unicode property support is available, \X{3} matches three Unicode
1074 extended sequences, each of which may be several bytes long (and they
1075 may be of different lengths).
1076
1077 The quantifier {0} is permitted, causing the expression to behave as if
1078 the previous item and the quantifier were not present. This may be use‐
1079 ful for subpatterns that are referenced as subroutines from elsewhere
1080 in the pattern. Items other than subpatterns that have a {0} quantifier
1081 are omitted from the compiled pattern.
1082
1083 For convenience, the three most common quantifiers have single-charac‐
1084 ter abbreviations:
1085
1086 * is equivalent to {0,}
1087 + is equivalent to {1,}
1088 ? is equivalent to {0,1}
1089
1090 It is possible to construct infinite loops by following a subpattern
1091 that can match no characters with a quantifier that has no upper limit,
1092 for example:
1093
1094 (a?)*
1095
1096 Earlier versions of Perl and PCRE used to give an error at compile time
1097 for such patterns. However, because there are cases where this can be
1098 useful, such patterns are now accepted, but if any repetition of the
1099 subpattern does in fact match no characters, the loop is forcibly bro‐
1100 ken.
1101
1102 By default, the quantifiers are "greedy", that is, they match as much
1103 as possible (up to the maximum number of permitted times), without
1104 causing the rest of the pattern to fail. The classic example of where
1105 this gives problems is in trying to match comments in C programs. These
1106 appear between /* and */ and within the comment, individual * and /
1107 characters may appear. An attempt to match C comments by applying the
1108 pattern
1109
1110 /\*.*\*/
1111
1112 to the string
1113
1114 /* first comment */ not comment /* second comment */
1115
1116 fails, because it matches the entire string owing to the greediness of
1117 the .* item.
1118
1119 However, if a quantifier is followed by a question mark, it ceases to
1120 be greedy, and instead matches the minimum number of times possible, so
1121 the pattern
1122
1123 /\*.*?\*/
1124
1125 does the right thing with the C comments. The meaning of the various
1126 quantifiers is not otherwise changed, just the preferred number of
1127 matches. Do not confuse this use of question mark with its use as a
1128 quantifier in its own right. Because it has two uses, it can sometimes
1129 appear doubled, as in
1130
1131 \d??\d
1132
1133 which matches one digit by preference, but can match two if that is the
1134 only way the rest of the pattern matches.
1135
1136 If the PCRE_UNGREEDY option is set (an option that is not available in
1137 Perl), the quantifiers are not greedy by default, but individual ones
1138 can be made greedy by following them with a question mark. In other
1139 words, it inverts the default behaviour.
1140
1141 When a parenthesized subpattern is quantified with a minimum repeat
1142 count that is greater than 1 or with a limited maximum, more memory is
1143 required for the compiled pattern, in proportion to the size of the
1144 minimum or maximum.
1145
1146 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv‐
1147 alent to Perl's /s) is set, thus allowing the dot to match newlines,
1148 the pattern is implicitly anchored, because whatever follows will be
1149 tried against every character position in the subject string, so there
1150 is no point in retrying the overall match at any position after the
1151 first. PCRE normally treats such a pattern as though it were preceded
1152 by \A.
1153
1154 In cases where it is known that the subject string contains no new‐
1155 lines, it is worth setting PCRE_DOTALL in order to obtain this opti‐
1156 mization, or alternatively using ^ to indicate anchoring explicitly.
1157
1158 However, there is one situation where the optimization cannot be used.
1159 When .* is inside capturing parentheses that are the subject of a
1160 backreference elsewhere in the pattern, a match at the start may fail
1161 where a later one succeeds. Consider, for example:
1162
1163 (.*)abc\1
1164
1165 If the subject is "xyz123abc123" the match point is the fourth charac‐
1166 ter. For this reason, such a pattern is not implicitly anchored.
1167
1168 When a capturing subpattern is repeated, the value captured is the sub‐
1169 string that matched the final iteration. For example, after
1170
1171 (tweedle[dume]{3}\s*)+
1172
1173 has matched "tweedledum tweedledee" the value of the captured substring
1174 is "tweedledee". However, if there are nested capturing subpatterns,
1175 the corresponding captured values may have been set in previous itera‐
1176 tions. For example, after
1177
1178 /(a|(b))+/
1179
1180 matches "aba" the value of the second captured substring is "b".
1181
1183
1184 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1185 repetition, failure of what follows normally causes the repeated item
1186 to be re-evaluated to see if a different number of repeats allows the
1187 rest of the pattern to match. Sometimes it is useful to prevent this,
1188 either to change the nature of the match, or to cause it fail earlier
1189 than it otherwise might, when the author of the pattern knows there is
1190 no point in carrying on.
1191
1192 Consider, for example, the pattern \d+foo when applied to the subject
1193 line
1194
1195 123456bar
1196
1197 After matching all 6 digits and then failing to match "foo", the normal
1198 action of the matcher is to try again with only 5 digits matching the
1199 \d+ item, and then with 4, and so on, before ultimately failing.
1200 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
1201 the means for specifying that once a subpattern has matched, it is not
1202 to be re-evaluated in this way.
1203
1204 If we use atomic grouping for the previous example, the matcher gives
1205 up immediately on failing to match "foo" the first time. The notation
1206 is a kind of special parenthesis, starting with (?> as in this example:
1207
1208 (?>\d+)foo
1209
1210 This kind of parenthesis "locks up" the part of the pattern it con‐
1211 tains once it has matched, and a failure further into the pattern is
1212 prevented from backtracking into it. Backtracking past it to previous
1213 items, however, works as normal.
1214
1215 An alternative description is that a subpattern of this type matches
1216 the string of characters that an identical standalone pattern would
1217 match, if anchored at the current point in the subject string.
1218
1219 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1220 such as the above example can be thought of as a maximizing repeat that
1221 must swallow everything it can. So, while both \d+ and \d+? are pre‐
1222 pared to adjust the number of digits they match in order to make the
1223 rest of the pattern match, (?>\d+) can only match an entire sequence of
1224 digits.
1225
1226 Atomic groups in general can of course contain arbitrarily complicated
1227 subpatterns, and can be nested. However, when the subpattern for an
1228 atomic group is just a single repeated item, as in the example above, a
1229 simpler notation, called a "possessive quantifier" can be used. This
1230 consists of an additional + character following a quantifier. Using
1231 this notation, the previous example can be rewritten as
1232
1233 \d++foo
1234
1235 Note that a possessive quantifier can be used with an entire group, for
1236 example:
1237
1238 (abc|xyz){2,3}+
1239
1240 Possessive quantifiers are always greedy; the setting of the
1241 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1242 simpler forms of atomic group. However, there is no difference in the
1243 meaning of a possessive quantifier and the equivalent atomic group,
1244 though there may be a performance difference; possessive quantifiers
1245 should be slightly faster.
1246
1247 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
1248 tax. Jeffrey Friedl originated the idea (and the name) in the first
1249 edition of his book. Mike McCloskey liked it, so implemented it when he
1250 built Sun's Java package, and PCRE copied it from there. It ultimately
1251 found its way into Perl at release 5.10.
1252
1253 PCRE has an optimization that automatically "possessifies" certain sim‐
1254 ple pattern constructs. For example, the sequence A+B is treated as
1255 A++B because there is no point in backtracking into a sequence of A's
1256 when B must follow.
1257
1258 When a pattern contains an unlimited repeat inside a subpattern that
1259 can itself be repeated an unlimited number of times, the use of an
1260 atomic group is the only way to avoid some failing matches taking a
1261 very long time indeed. The pattern
1262
1263 (\D+|<\d+>)*[!?]
1264
1265 matches an unlimited number of substrings that either consist of non-
1266 digits, or digits enclosed in <>, followed by either ! or ?. When it
1267 matches, it runs quickly. However, if it is applied to
1268
1269 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1270
1271 it takes a long time before reporting failure. This is because the
1272 string can be divided between the internal \D+ repeat and the external
1273 * repeat in a large number of ways, and all have to be tried. (The
1274 example uses [!?] rather than a single character at the end, because
1275 both PCRE and Perl have an optimization that allows for fast failure
1276 when a single character is used. They remember the last single charac‐
1277 ter that is required for a match, and fail early if it is not present
1278 in the string.) If the pattern is changed so that it uses an atomic
1279 group, like this:
1280
1281 ((?>\D+)|<\d+>)*[!?]
1282
1283 sequences of non-digits cannot be broken, and failure happens quickly.
1284
1286
1287 Outside a character class, a backslash followed by a digit greater than
1288 0 (and possibly further digits) is a back reference to a capturing sub‐
1289 pattern earlier (that is, to its left) in the pattern, provided there
1290 have been that many previous capturing left parentheses.
1291
1292 However, if the decimal number following the backslash is less than 10,
1293 it is always taken as a back reference, and causes an error only if
1294 there are not that many capturing left parentheses in the entire pat‐
1295 tern. In other words, the parentheses that are referenced need not be
1296 to the left of the reference for numbers less than 10. A "forward back
1297 reference" of this type can make sense when a repetition is involved
1298 and the subpattern to the right has participated in an earlier itera‐
1299 tion.
1300
1301 It is not possible to have a numerical "forward back reference" to a
1302 subpattern whose number is 10 or more using this syntax because a
1303 sequence such as \50 is interpreted as a character defined in octal.
1304 See the subsection entitled "Non-printing characters" above for further
1305 details of the handling of digits following a backslash. There is no
1306 such problem when named parentheses are used. A back reference to any
1307 subpattern is possible using named parentheses (see below).
1308
1309 Another way of avoiding the ambiguity inherent in the use of digits
1310 following a backslash is to use the \g escape sequence, which is a fea‐
1311 ture introduced in Perl 5.10. This escape must be followed by an
1312 unsigned number or a negative number, optionally enclosed in braces.
1313 These examples are all identical:
1314
1315 (ring), \1
1316 (ring), \g1
1317 (ring), \g{1}
1318
1319 An unsigned number specifies an absolute reference without the ambigu‐
1320 ity that is present in the older syntax. It is also useful when literal
1321 digits follow the reference. A negative number is a relative reference.
1322 Consider this example:
1323
1324 (abc(def)ghi)\g{-1}
1325
1326 The sequence \g{-1} is a reference to the most recently started captur‐
1327 ing subpattern before \g, that is, is it equivalent to \2. Similarly,
1328 \g{-2} would be equivalent to \1. The use of relative references can be
1329 helpful in long patterns, and also in patterns that are created by
1330 joining together fragments that contain references within themselves.
1331
1332 A back reference matches whatever actually matched the capturing sub‐
1333 pattern in the current subject string, rather than anything matching
1334 the subpattern itself (see "Subpatterns as subroutines" below for a way
1335 of doing that). So the pattern
1336
1337 (sens|respons)e and \1ibility
1338
1339 matches "sense and sensibility" and "response and responsibility", but
1340 not "sense and responsibility". If caseful matching is in force at the
1341 time of the back reference, the case of letters is relevant. For exam‐
1342 ple,
1343
1344 ((?i)rah)\s+\1
1345
1346 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1347 original capturing subpattern is matched caselessly.
1348
1349 There are several different ways of writing back references to named
1350 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
1351 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
1352 unified back reference syntax, in which \g can be used for both numeric
1353 and named references, is also supported. We could rewrite the above
1354 example in any of the following ways:
1355
1356 (?<p1>(?i)rah)\s+\k<p1>
1357 (?'p1'(?i)rah)\s+\k{p1}
1358 (?P<p1>(?i)rah)\s+(?P=p1)
1359 (?<p1>(?i)rah)\s+\g{p1}
1360
1361 A subpattern that is referenced by name may appear in the pattern
1362 before or after the reference.
1363
1364 There may be more than one back reference to the same subpattern. If a
1365 subpattern has not actually been used in a particular match, any back
1366 references to it always fail. For example, the pattern
1367
1368 (a|(bc))\2
1369
1370 always fails if it starts to match "a" rather than "bc". Because there
1371 may be many capturing parentheses in a pattern, all digits following
1372 the backslash are taken as part of a potential back reference number.
1373 If the pattern continues with a digit character, some delimiter must be
1374 used to terminate the back reference. If the PCRE_EXTENDED option is
1375 set, this can be white space. Otherwise an empty comment (see "Com‐
1376 ments" below) can be used.
1377
1378 A back reference that occurs inside the parentheses to which it refers
1379 fails when the subpattern is first used, so, for example, (a\1) never
1380 matches. However, such references can be useful inside repeated sub‐
1381 patterns. For example, the pattern
1382
1383 (a|b\1)+
1384
1385 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1386 ation of the subpattern, the back reference matches the character
1387 string corresponding to the previous iteration. In order for this to
1388 work, the pattern must be such that the first iteration does not need
1389 to match the back reference. This can be done using alternation, as in
1390 the example above, or by a quantifier with a minimum of zero.
1391
1393
1394 An assertion is a test on the characters following or preceding the
1395 current matching point that does not actually consume any characters.
1396 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1397 described above.
1398
1399 More complicated assertions are coded as subpatterns. There are two
1400 kinds: those that look ahead of the current position in the subject
1401 string, and those that look behind it. An assertion subpattern is
1402 matched in the normal way, except that it does not cause the current
1403 matching position to be changed.
1404
1405 Assertion subpatterns are not capturing subpatterns, and may not be
1406 repeated, because it makes no sense to assert the same thing several
1407 times. If any kind of assertion contains capturing subpatterns within
1408 it, these are counted for the purposes of numbering the capturing sub‐
1409 patterns in the whole pattern. However, substring capturing is carried
1410 out only for positive assertions, because it does not make sense for
1411 negative assertions.
1412
1413 Lookahead assertions
1414
1415 Lookahead assertions start with (?= for positive assertions and (?! for
1416 negative assertions. For example,
1417
1418 \w+(?=;)
1419
1420 matches a word followed by a semicolon, but does not include the semi‐
1421 colon in the match, and
1422
1423 foo(?!bar)
1424
1425 matches any occurrence of "foo" that is not followed by "bar". Note
1426 that the apparently similar pattern
1427
1428 (?!foo)bar
1429
1430 does not find an occurrence of "bar" that is preceded by something
1431 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1432 the assertion (?!foo) is always true when the next three characters are
1433 "bar". A lookbehind assertion is needed to achieve the other effect.
1434
1435 If you want to force a matching failure at some point in a pattern, the
1436 most convenient way to do it is with (?!) because an empty string
1437 always matches, so an assertion that requires there not to be an empty
1438 string must always fail.
1439
1440 Lookbehind assertions
1441
1442 Lookbehind assertions start with (?<= for positive assertions and (?<!
1443 for negative assertions. For example,
1444
1445 (?<!foo)bar
1446
1447 does find an occurrence of "bar" that is not preceded by "foo". The
1448 contents of a lookbehind assertion are restricted such that all the
1449 strings it matches must have a fixed length. However, if there are sev‐
1450 eral top-level alternatives, they do not all have to have the same
1451 fixed length. Thus
1452
1453 (?<=bullock|donkey)
1454
1455 is permitted, but
1456
1457 (?<!dogs?|cats?)
1458
1459 causes an error at compile time. Branches that match different length
1460 strings are permitted only at the top level of a lookbehind assertion.
1461 This is an extension compared with Perl (at least for 5.8), which
1462 requires all branches to match the same length of string. An assertion
1463 such as
1464
1465 (?<=ab(c|de))
1466
1467 is not permitted, because its single top-level branch can match two
1468 different lengths, but it is acceptable if rewritten to use two top-
1469 level branches:
1470
1471 (?<=abc|abde)
1472
1473 In some cases, the Perl 5.10 escape sequence \K (see above) can be used
1474 instead of a lookbehind assertion; this is not restricted to a fixed-
1475 length.
1476
1477 The implementation of lookbehind assertions is, for each alternative,
1478 to temporarily move the current position back by the fixed length and
1479 then try to match. If there are insufficient characters before the cur‐
1480 rent position, the assertion fails.
1481
1482 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1483 mode) to appear in lookbehind assertions, because it makes it impossi‐
1484 ble to calculate the length of the lookbehind. The \X and \R escapes,
1485 which can match different numbers of bytes, are also not permitted.
1486
1487 Possessive quantifiers can be used in conjunction with lookbehind
1488 assertions to specify efficient matching at the end of the subject
1489 string. Consider a simple pattern such as
1490
1491 abcd$
1492
1493 when applied to a long string that does not match. Because matching
1494 proceeds from left to right, PCRE will look for each "a" in the subject
1495 and then see if what follows matches the rest of the pattern. If the
1496 pattern is specified as
1497
1498 ^.*abcd$
1499
1500 the initial .* matches the entire string at first, but when this fails
1501 (because there is no following "a"), it backtracks to match all but the
1502 last character, then all but the last two characters, and so on. Once
1503 again the search for "a" covers the entire string, from right to left,
1504 so we are no better off. However, if the pattern is written as
1505
1506 ^.*+(?<=abcd)
1507
1508 there can be no backtracking for the .*+ item; it can match only the
1509 entire string. The subsequent lookbehind assertion does a single test
1510 on the last four characters. If it fails, the match fails immediately.
1511 For long strings, this approach makes a significant difference to the
1512 processing time.
1513
1514 Using multiple assertions
1515
1516 Several assertions (of any sort) may occur in succession. For example,
1517
1518 (?<=\d{3})(?<!999)foo
1519
1520 matches "foo" preceded by three digits that are not "999". Notice that
1521 each of the assertions is applied independently at the same point in
1522 the subject string. First there is a check that the previous three
1523 characters are all digits, and then there is a check that the same
1524 three characters are not "999". This pattern does not match "foo" pre‐
1525 ceded by six characters, the first of which are digits and the last
1526 three of which are not "999". For example, it doesn't match "123abc‐
1527 foo". A pattern to do that is
1528
1529 (?<=\d{3}...)(?<!999)foo
1530
1531 This time the first assertion looks at the preceding six characters,
1532 checking that the first three are digits, and then the second assertion
1533 checks that the preceding three characters are not "999".
1534
1535 Assertions can be nested in any combination. For example,
1536
1537 (?<=(?<!foo)bar)baz
1538
1539 matches an occurrence of "baz" that is preceded by "bar" which in turn
1540 is not preceded by "foo", while
1541
1542 (?<=\d{3}(?!999)...)foo
1543
1544 is another pattern that matches "foo" preceded by three digits and any
1545 three characters that are not "999".
1546
1548
1549 It is possible to cause the matching process to obey a subpattern con‐
1550 ditionally or to choose between two alternative subpatterns, depending
1551 on the result of an assertion, or whether a previous capturing subpat‐
1552 tern matched or not. The two possible forms of conditional subpattern
1553 are
1554
1555 (?(condition)yes-pattern)
1556 (?(condition)yes-pattern|no-pattern)
1557
1558 If the condition is satisfied, the yes-pattern is used; otherwise the
1559 no-pattern (if present) is used. If there are more than two alterna‐
1560 tives in the subpattern, a compile-time error occurs.
1561
1562 There are four kinds of condition: references to subpatterns, refer‐
1563 ences to recursion, a pseudo-condition called DEFINE, and assertions.
1564
1565 Checking for a used subpattern by number
1566
1567 If the text between the parentheses consists of a sequence of digits,
1568 the condition is true if the capturing subpattern of that number has
1569 previously matched. An alternative notation is to precede the digits
1570 with a plus or minus sign. In this case, the subpattern number is rela‐
1571 tive rather than absolute. The most recently opened parentheses can be
1572 referenced by (?(-1), the next most recent by (?(-2), and so on. In
1573 looping constructs it can also make sense to refer to subsequent groups
1574 with constructs such as (?(+2).
1575
1576 Consider the following pattern, which contains non-significant white
1577 space to make it more readable (assume the PCRE_EXTENDED option) and to
1578 divide it into three parts for ease of discussion:
1579
1580 ( \( )? [^()]+ (?(1) \) )
1581
1582 The first part matches an optional opening parenthesis, and if that
1583 character is present, sets it as the first captured substring. The sec‐
1584 ond part matches one or more characters that are not parentheses. The
1585 third part is a conditional subpattern that tests whether the first set
1586 of parentheses matched or not. If they did, that is, if subject started
1587 with an opening parenthesis, the condition is true, and so the yes-pat‐
1588 tern is executed and a closing parenthesis is required. Otherwise,
1589 since no-pattern is not present, the subpattern matches nothing. In
1590 other words, this pattern matches a sequence of non-parentheses,
1591 optionally enclosed in parentheses.
1592
1593 If you were embedding this pattern in a larger one, you could use a
1594 relative reference:
1595
1596 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
1597
1598 This makes the fragment independent of the parentheses in the larger
1599 pattern.
1600
1601 Checking for a used subpattern by name
1602
1603 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
1604 used subpattern by name. For compatibility with earlier versions of
1605 PCRE, which had this facility before Perl, the syntax (?(name)...) is
1606 also recognized. However, there is a possible ambiguity with this syn‐
1607 tax, because subpattern names may consist entirely of digits. PCRE
1608 looks first for a named subpattern; if it cannot find one and the name
1609 consists entirely of digits, PCRE looks for a subpattern of that num‐
1610 ber, which must be greater than zero. Using subpattern names that con‐
1611 sist entirely of digits is not recommended.
1612
1613 Rewriting the above example to use a named subpattern gives this:
1614
1615 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
1616
1617
1618 Checking for pattern recursion
1619
1620 If the condition is the string (R), and there is no subpattern with the
1621 name R, the condition is true if a recursive call to the whole pattern
1622 or any subpattern has been made. If digits or a name preceded by amper‐
1623 sand follow the letter R, for example:
1624
1625 (?(R3)...) or (?(R&name)...)
1626
1627 the condition is true if the most recent recursion is into the subpat‐
1628 tern whose number or name is given. This condition does not check the
1629 entire recursion stack.
1630
1631 At "top level", all these recursion test conditions are false. Recur‐
1632 sive patterns are described below.
1633
1634 Defining subpatterns for use by reference only
1635
1636 If the condition is the string (DEFINE), and there is no subpattern
1637 with the name DEFINE, the condition is always false. In this case,
1638 there may be only one alternative in the subpattern. It is always
1639 skipped if control reaches this point in the pattern; the idea of
1640 DEFINE is that it can be used to define "subroutines" that can be ref‐
1641 erenced from elsewhere. (The use of "subroutines" is described below.)
1642 For example, a pattern to match an IPv4 address could be written like
1643 this (ignore white space and line breaks):
1644
1645 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
1646 \b (?&byte) (\.(?&byte)){3} \b
1647
1648 The first part of the pattern is a DEFINE group inside which a another
1649 group named "byte" is defined. This matches an individual component of
1650 an IPv4 address (a number less than 256). When matching takes place,
1651 this part of the pattern is skipped because DEFINE acts like a false
1652 condition.
1653
1654 The rest of the pattern uses references to the named group to match the
1655 four dot-separated components of an IPv4 address, insisting on a word
1656 boundary at each end.
1657
1658 Assertion conditions
1659
1660 If the condition is not in any of the above formats, it must be an
1661 assertion. This may be a positive or negative lookahead or lookbehind
1662 assertion. Consider this pattern, again containing non-significant
1663 white space, and with the two alternatives on the second line:
1664
1665 (?(?=[^a-z]*[a-z])
1666 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1667
1668 The condition is a positive lookahead assertion that matches an
1669 optional sequence of non-letters followed by a letter. In other words,
1670 it tests for the presence of at least one letter in the subject. If a
1671 letter is found, the subject is matched against the first alternative;
1672 otherwise it is matched against the second. This pattern matches
1673 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1674 letters and dd are digits.
1675
1677
1678 The sequence (?# marks the start of a comment that continues up to the
1679 next closing parenthesis. Nested parentheses are not permitted. The
1680 characters that make up a comment play no part in the pattern matching
1681 at all.
1682
1683 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1684 character class introduces a comment that continues to immediately
1685 after the next newline in the pattern.
1686
1688
1689 Consider the problem of matching a string in parentheses, allowing for
1690 unlimited nested parentheses. Without the use of recursion, the best
1691 that can be done is to use a pattern that matches up to some fixed
1692 depth of nesting. It is not possible to handle an arbitrary nesting
1693 depth.
1694
1695 For some time, Perl has provided a facility that allows regular expres‐
1696 sions to recurse (amongst other things). It does this by interpolating
1697 Perl code in the expression at run time, and the code can refer to the
1698 expression itself. A Perl pattern using code interpolation to solve the
1699 parentheses problem can be created like this:
1700
1701 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1702
1703 The (?p{...}) item interpolates Perl code at run time, and in this case
1704 refers recursively to the pattern in which it appears.
1705
1706 Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1707 it supports special syntax for recursion of the entire pattern, and
1708 also for individual subpattern recursion. After its introduction in
1709 PCRE and Python, this kind of recursion was introduced into Perl at
1710 release 5.10.
1711
1712 A special item that consists of (? followed by a number greater than
1713 zero and a closing parenthesis is a recursive call of the subpattern of
1714 the given number, provided that it occurs inside that subpattern. (If
1715 not, it is a "subroutine" call, which is described in the next sec‐
1716 tion.) The special item (?R) or (?0) is a recursive call of the entire
1717 regular expression.
1718
1719 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
1720 always treated as an atomic group. That is, once it has matched some of
1721 the subject string, it is never re-entered, even if it contains untried
1722 alternatives and there is a subsequent matching failure.
1723
1724 This PCRE pattern solves the nested parentheses problem (assume the
1725 PCRE_EXTENDED option is set so that white space is ignored):
1726
1727 \( ( (?>[^()]+) | (?R) )* \)
1728
1729 First it matches an opening parenthesis. Then it matches any number of
1730 substrings which can either be a sequence of non-parentheses, or a
1731 recursive match of the pattern itself (that is, a correctly parenthe‐
1732 sized substring). Finally there is a closing parenthesis.
1733
1734 If this were part of a larger pattern, you would not want to recurse
1735 the entire pattern, so instead you could use this:
1736
1737 ( \( ( (?>[^()]+) | (?1) )* \) )
1738
1739 We have put the pattern into parentheses, and caused the recursion to
1740 refer to them instead of the whole pattern.
1741
1742 In a larger pattern, keeping track of parenthesis numbers can be
1743 tricky. This is made easier by the use of relative references. (A Perl
1744 5.10 feature.) Instead of (?1) in the pattern above you can write
1745 (?-2) to refer to the second most recently opened parentheses preceding
1746 the recursion. In other words, a negative number counts capturing
1747 parentheses leftwards from the point at which it is encountered.
1748
1749 It is also possible to refer to subsequently opened parentheses, by
1750 writing references such as (?+2). However, these cannot be recursive
1751 because the reference is not inside the parentheses that are refer‐
1752 enced. They are always "subroutine" calls, as described in the next
1753 section.
1754
1755 An alternative approach is to use named parentheses instead. The Perl
1756 syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
1757 supported. We could rewrite the above example as follows:
1758
1759 (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
1760
1761 If there is more than one subpattern with the same name, the earliest
1762 one is used.
1763
1764 This particular example pattern that we have been looking at contains
1765 nested unlimited repeats, and so the use of atomic grouping for match‐
1766 ing strings of non-parentheses is important when applying the pattern
1767 to strings that do not match. For example, when this pattern is applied
1768 to
1769
1770 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1771
1772 it yields "no match" quickly. However, if atomic grouping is not used,
1773 the match runs for a very long time indeed because there are so many
1774 different ways the + and * repeats can carve up the subject, and all
1775 have to be tested before failure can be reported.
1776
1777 At the end of a match, the values set for any capturing subpatterns are
1778 those from the outermost level of the recursion at which the subpattern
1779 value is set. If you want to obtain intermediate values, a callout
1780 function can be used (see below and the pcrecallout documentation). If
1781 the pattern above is matched against
1782
1783 (ab(cd)ef)
1784
1785 the value for the capturing parentheses is "ef", which is the last
1786 value taken on at the top level. If additional parentheses are added,
1787 giving
1788
1789 \( ( ( (?>[^()]+) | (?R) )* ) \)
1790 ^ ^
1791 ^ ^
1792
1793 the string they capture is "ab(cd)ef", the contents of the top level
1794 parentheses. If there are more than 15 capturing parentheses in a pat‐
1795 tern, PCRE has to obtain extra memory to store data during a recursion,
1796 which it does by using pcre_malloc, freeing it via pcre_free after‐
1797 wards. If no memory can be obtained, the match fails with the
1798 PCRE_ERROR_NOMEMORY error.
1799
1800 Do not confuse the (?R) item with the condition (R), which tests for
1801 recursion. Consider this pattern, which matches text in angle brack‐
1802 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1803 brackets (that is, when recursing), whereas any characters are permit‐
1804 ted at the outer level.
1805
1806 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1807
1808 In this pattern, (?(R) is the start of a conditional subpattern, with
1809 two different alternatives for the recursive and non-recursive cases.
1810 The (?R) item is the actual recursive call.
1811
1813
1814 If the syntax for a recursive subpattern reference (either by number or
1815 by name) is used outside the parentheses to which it refers, it oper‐
1816 ates like a subroutine in a programming language. The "called" subpat‐
1817 tern may be defined before or after the reference. A numbered reference
1818 can be absolute or relative, as in these examples:
1819
1820 (...(absolute)...)...(?2)...
1821 (...(relative)...)...(?-1)...
1822 (...(?+1)...(relative)...
1823
1824 An earlier example pointed out that the pattern
1825
1826 (sens|respons)e and \1ibility
1827
1828 matches "sense and sensibility" and "response and responsibility", but
1829 not "sense and responsibility". If instead the pattern
1830
1831 (sens|respons)e and (?1)ibility
1832
1833 is used, it does match "sense and responsibility" as well as the other
1834 two strings. Another example is given in the discussion of DEFINE
1835 above.
1836
1837 Like recursive subpatterns, a "subroutine" call is always treated as an
1838 atomic group. That is, once it has matched some of the subject string,
1839 it is never re-entered, even if it contains untried alternatives and
1840 there is a subsequent matching failure.
1841
1842 When a subpattern is used as a subroutine, processing options such as
1843 case-independence are fixed when the subpattern is defined. They cannot
1844 be changed for different calls. For example, consider this pattern:
1845
1846 (abc)(?i:(?-1))
1847
1848 It matches "abcabc". It does not match "abcABC" because the change of
1849 processing option does not affect the called subpattern.
1850
1852
1853 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
1854 name or a number enclosed either in angle brackets or single quotes, is
1855 an alternative syntax for referencing a subpattern as a subroutine,
1856 possibly recursively. Here are two of the examples used above, rewrit‐
1857 ten using this syntax:
1858
1859 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
1860 (sens|respons)e and \g'1'ibility
1861
1862 PCRE supports an extension to Oniguruma: if a number is preceded by a
1863 plus or a minus sign it is taken as a relative reference. For example:
1864
1865 (abc)(?i:\g<-1>)
1866
1867 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
1868 synonymous. The former is a back reference; the latter is a subroutine
1869 call.
1870
1872
1873 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1874 Perl code to be obeyed in the middle of matching a regular expression.
1875 This makes it possible, amongst other things, to extract different sub‐
1876 strings that match the same pair of parentheses when there is a repeti‐
1877 tion.
1878
1879 PCRE provides a similar feature, but of course it cannot obey arbitrary
1880 Perl code. The feature is called "callout". The caller of PCRE provides
1881 an external function by putting its entry point in the global variable
1882 pcre_callout. By default, this variable contains NULL, which disables
1883 all calling out.
1884
1885 Within a regular expression, (?C) indicates the points at which the
1886 external function is to be called. If you want to identify different
1887 callout points, you can put a number less than 256 after the letter C.
1888 The default value is zero. For example, this pattern has two callout
1889 points:
1890
1891 (?C1)abc(?C2)def
1892
1893 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1894 automatically installed before each item in the pattern. They are all
1895 numbered 255.
1896
1897 During matching, when PCRE reaches a callout point (and pcre_callout is
1898 set), the external function is called. It is provided with the number
1899 of the callout, the position in the pattern, and, optionally, one item
1900 of data originally supplied by the caller of pcre_exec(). The callout
1901 function may cause matching to proceed, to backtrack, or to fail alto‐
1902 gether. A complete description of the interface to the callout function
1903 is given in the pcrecallout documentation.
1904
1906
1907 Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
1908 which are described in the Perl documentation as "experimental and sub‐
1909 ject to change or removal in a future version of Perl". It goes on to
1910 say: "Their usage in production code should be noted to avoid problems
1911 during upgrades." The same remarks apply to the PCRE features described
1912 in this section.
1913
1914 Since these verbs are specifically related to backtracking, most of
1915 them can be used only when the pattern is to be matched using
1916 pcre_exec(), which uses a backtracking algorithm. With the exception of
1917 (*FAIL), which behaves like a failing negative assertion, they cause an
1918 error if encountered by pcre_dfa_exec().
1919
1920 The new verbs make use of what was previously invalid syntax: an open‐
1921 ing parenthesis followed by an asterisk. In Perl, they are generally of
1922 the form (*VERB:ARG) but PCRE does not support the use of arguments, so
1923 its general form is just (*VERB). Any number of these verbs may occur
1924 in a pattern. There are two kinds:
1925
1926 Verbs that act immediately
1927
1928 The following verbs act as soon as they are encountered:
1929
1930 (*ACCEPT)
1931
1932 This verb causes the match to end successfully, skipping the remainder
1933 of the pattern. When inside a recursion, only the innermost pattern is
1934 ended immediately. PCRE differs from Perl in what happens if the
1935 (*ACCEPT) is inside capturing parentheses. In Perl, the data so far is
1936 captured: in PCRE no data is captured. For example:
1937
1938 A(A|B(*ACCEPT)|C)D
1939
1940 This matches "AB", "AAD", or "ACD", but when it matches "AB", no data
1941 is captured.
1942
1943 (*FAIL) or (*F)
1944
1945 This verb causes the match to fail, forcing backtracking to occur. It
1946 is equivalent to (?!) but easier to read. The Perl documentation notes
1947 that it is probably useful only when combined with (?{}) or (??{}).
1948 Those are, of course, Perl features that are not present in PCRE. The
1949 nearest equivalent is the callout feature, as for example in this pat‐
1950 tern:
1951
1952 a+(?C)(*FAIL)
1953
1954 A match with the string "aaaa" always fails, but the callout is taken
1955 before each backtrack happens (in this example, 10 times).
1956
1957 Verbs that act after backtracking
1958
1959 The following verbs do nothing when they are encountered. Matching con‐
1960 tinues with what follows, but if there is no subsequent match, a fail‐
1961 ure is forced. The verbs differ in exactly what kind of failure
1962 occurs.
1963
1964 (*COMMIT)
1965
1966 This verb causes the whole match to fail outright if the rest of the
1967 pattern does not match. Even if the pattern is unanchored, no further
1968 attempts to find a match by advancing the start point take place. Once
1969 (*COMMIT) has been passed, pcre_exec() is committed to finding a match
1970 at the current starting point, or not at all. For example:
1971
1972 a+(*COMMIT)b
1973
1974 This matches "xxaab" but not "aacaab". It can be thought of as a kind
1975 of dynamic anchor, or "I've started, so I must finish."
1976
1977 (*PRUNE)
1978
1979 This verb causes the match to fail at the current position if the rest
1980 of the pattern does not match. If the pattern is unanchored, the normal
1981 "bumpalong" advance to the next starting character then happens. Back‐
1982 tracking can occur as usual to the left of (*PRUNE), or when matching
1983 to the right of (*PRUNE), but if there is no match to the right, back‐
1984 tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE)
1985 is just an alternative to an atomic group or possessive quantifier, but
1986 there are some uses of (*PRUNE) that cannot be expressed in any other
1987 way.
1988
1989 (*SKIP)
1990
1991 This verb is like (*PRUNE), except that if the pattern is unanchored,
1992 the "bumpalong" advance is not to the next character, but to the posi‐
1993 tion in the subject where (*SKIP) was encountered. (*SKIP) signifies
1994 that whatever text was matched leading up to it cannot be part of a
1995 successful match. Consider:
1996
1997 a+(*SKIP)b
1998
1999 If the subject is "aaaac...", after the first match attempt fails
2000 (starting at the first character in the string), the starting point
2001 skips on to start the next attempt at "c". Note that a possessive quan‐
2002 tifer does not have the same effect in this example; although it would
2003 suppress backtracking during the first match attempt, the second
2004 attempt would start at the second character instead of skipping on to
2005 "c".
2006
2007 (*THEN)
2008
2009 This verb causes a skip to the next alternation if the rest of the pat‐
2010 tern does not match. That is, it cancels pending backtracking, but only
2011 within the current alternation. Its name comes from the observation
2012 that it can be used for a pattern-based if-then-else block:
2013
2014 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2015
2016 If the COND1 pattern matches, FOO is tried (and possibly further items
2017 after the end of the group if FOO succeeds); on failure the matcher
2018 skips to the second alternative and tries COND2, without backtracking
2019 into COND1. If (*THEN) is used outside of any alternation, it acts
2020 exactly like (*PRUNE).
2021
2023
2024 pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
2025
2027
2028 Philip Hazel
2029 University Computing Service
2030 Cambridge CB2 3QH, England.
2031
2033
2034 Last updated: 19 April 2008
2035 Copyright (c) 1997-2008 University of Cambridge.
2036
2037
2038
2039 PCREPATTERN(3)